ChEB-AI · sfluegel05 · Jul 21, 2025 · Jun 24, 2025 · Jun 27, 2025 · Jun 27, 2025
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
@@ -0,0 +1,26 @@
+name: Lint
+
+on: [push, pull_request]
+
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@v2
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.10'  # or any version your project uses
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install black==25.1.0 ruff==0.12.2
+
+      - name: Run Black
+        run: black --check .
+
+      - name: Run Ruff (no formatting)
+        run: ruff check . --no-fix
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,179 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+docs/build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+
+# configs/ # commented as new configs can be added as a part of a feature
+
+/.idea
+/data
+/logs
+/results_buffer
+electra_pretrained.ckpt
+
+build
+.virtual_documents
+.jupyter
+chebai.egg-info
+lightning_logs
+logs
+.isort.cfg
+/.vscode
+/api/.cloned_repos
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,31 @@
+repos:
+-   repo: https://github.com/psf/black
+    rev: "25.1.0"
+    hooks:
+    -   id: black
+    -   id: black-jupyter # for formatting jupyter-notebook
+
+-   repo: https://github.com/pycqa/isort
+    rev: 5.13.2
+    hooks:
+      - id: isort
+        name: isort (python)
+        args: ["--profile=black"]
+
+-   repo: https://github.com/asottile/seed-isort-config
+    rev: v2.2.0
+    hooks:
+    -   id: seed-isort-config
+
+-   repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.6.0
+    hooks:
+    -   id: check-yaml
+    -   id: end-of-file-fixer
+    -   id: trailing-whitespace
+
+-   repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.12.2
+    hooks:
+    -   id: ruff
+        args: [--fix]
diff --git a/README.md b/README.md
@@ -1,8 +1,17 @@
 # python-chebifier
-An AI ensemble model for predicting chemical classes in the ChEBI ontology.
+An AI ensemble model for predicting chemical classes in the ChEBI ontology. It integrates deep learning models,
+rule-based models and generative AI-based models.
+
+A web application for the ensemble is available at https://chebifier.hastingslab.org/.
 
 ## Installation
 
+You can get the package from PyPI:
+```bash
+pip install chebifier
+```
+
+or get the latest development version from GitHub:
 ```bash
 # Clone the repository
 git clone https://github.com/yourusername/python-chebifier.git
@@ -12,7 +21,7 @@ cd python-chebifier
 pip install -e .
 ```
 
-Some dependencies of `chebai-graph` cannot be installed automatically. If you want to use Graph Neural Networks, follow
+`chebai-graph` and its dependencies cannot be installed automatically. If you want to use Graph Neural Networks, follow
 the instructions in the [chebai-graph repository](https://github.com/ChEB-AI/python-chebai-graph).
 
 ## Usage
@@ -21,23 +30,25 @@ the instructions in the [chebai-graph repository](https://github.com/ChEB-AI/pyt
 
 The package provides a command-line interface (CLI) for making predictions using an ensemble model.
 
-```bash
-# Get help
-python -m chebifier.cli --help
+The ensemble configuration is given by a configuration file (by default, this is `chebifier/ensemble.yml`). If you
+want to change which models are included in the ensemble or how they are weighted, you can create your own configuration file.
 
-# Make predictions using a configuration file
-python -m chebifier.cli predict configs/example_config.yml --smiles "CC(=O)OC1=CC=CC=C1C(=O)O" "C1=CC=C(C=C1)C(=O)O"
+Model weights for deep learning models are downloaded automatically from [Hugging Face](https://huggingface.co/chebai).
+However, you can also supply your own model checkpoints (see `configs/example_config.yml` for an example).
 
-# Make predictions using SMILES from a file
-python -m chebifier.cli predict configs/example_config.yml --smiles-file smiles.txt
-```
+```bash
+# Make predictions 
+python -m chebifier predict --smiles "CC(=O)OC1=CC=CC=C1C(=O)O" --smiles "C1=CC=C(C=C1)C(=O)O"
 
-### Configuration File
+# Make predictions using SMILES from a file
+python -m chebifier predict --smiles-file smiles.txt
 
-The CLI requires a YAML configuration file that defines the ensemble model. An example can be found in `configs/example_config.yml`.
+# Make predictions using a configuration file
+python -m chebifier predict --ensemble-config configs/my_config.yml --smiles-file smiles.txt
 
-The models and other required files are trained / generated by our [chebai](https://github.com/ChEB-AI/python-chebai) package. 
-Examples for models can be found on [kaggle](https://www.kaggle.com/datasets/sfluegel/chebai).
+# Get all available options
+python -m chebifier predict --help
+```
 
 ### Python API
 
@@ -67,7 +78,29 @@ for smiles, prediction in zip(smiles_list, predictions):
         print("No predictions")
 ```
 
+### The models
+Currently, the following models are supported:
+
+
+| Model | Description | #Classes | Publication                                                           | Repository                                                                            |
+|-------|-------------|----------|-----------------------------------------------------------------------|----------------------------------------------------------------------------------------|
+| `electra` | A transformer-based deep learning model trained on ChEBI SMILES strings. | 1522 | [Glauer, Martin, et al., 2024: Chebifier: Automating semantic classification in ChEBI to accelerate data-driven discovery, Digital Discovery 3 (2024) 896-907](https://pubs.rsc.org/en/content/articlehtml/2024/dd/d3dd00238a) | [python-chebai](https://github.com/ChEB-AI/python-chebai) |
+| `resgated` | A Residual Gated Graph Convolutional Network trained on ChEBI molecules. | 1522 | | [python-chebai-graph](https://github.com/ChEB-AI/python-chebai-graph) |
+| `chemlog_peptides` | A rule-based model specialised on peptide classes. | 18 | [Flügel, Simon, et al., 2025: ChemLog: Making MSOL Viable for Ontological Classification and Learning, arXiv](https://arxiv.org/abs/2507.13987) | [chemlog-peptides](https://github.com/sfluegel05/chemlog-peptides) |
+| `chemlog_element`, `chemlog_organox` | Extensions of ChemLog for classes that are defined either by the presence of a specific element or by the presence of an organic bond. | 118 + 37 | | [chemlog-extra](https://github.com/ChEB-AI/chemlog-extra) |
+| `c3p` | A collection _Chemical Classifier Programs_, generated by LLMs based on the natural language definitions of ChEBI classes. | 338 | [Mungall, Christopher J., et al., 2025: Chemical classification program synthesis using generative artificial intelligence, arXiv](https://arxiv.org/abs/2505.18470) | [c3p](https://github.com/chemkg/c3p) |
+
+In addition, Chebifier also includes a ChEBI lookup that automatically retrieves the ChEBI superclasses for a class
+matched by a SMILES string. This is not activated by default, but can be included by adding 
+```yaml
+chebi_lookup:
+    type: chebi_lookup
+    model_weight: 10 # optional
+```
+to your configuration file.
+
 ### The ensemble
+<img width="700" alt="ensemble_architecture" src="https://github.com/user-attachments/assets/9275d3cd-ac88-466f-a1e9-27d20d67543b" />
 
 Given a sample (i.e., a SMILES string) and models $m_1, m_2, \ldots, m_n$, the ensemble works as follows:
 1. Get predictions from each model $m_i$ for the sample.
@@ -103,7 +136,7 @@ Trust is based on the model's performance on a validation set. After training, w
 on a validation set for each class. If the `ensemble_type` is set to `wmv-f1`, the trust is calculated as 1 + the F1 score.
 If the `ensemble_type` is set to `mv` (the default), the trust is set to 1 for all models.
 
-### Inconsistency correction
+### Inconsistency resolution
 After a decision has been made for each class independently, the consistency of the predictions with regard to the ChEBI hierarchy 
 and disjointness axioms is checked. This is
 done in 3 steps:
@@ -114,7 +147,7 @@ belongs to the direct and indirect superclasses (e.g., primary alcohol, aromatic
 - (2) Next, we check for disjointness. This is not specified directly in ChEBI, but in an additional ChEBI module ([chebi-disjoints.owl](https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/)).
 We have extracted these disjointness axioms into a CSV file and added some more disjointness axioms ourselves (see
 `data>disjoint_chebi.csv` and `data>disjoint_additional.csv`). If two classes $A$ and $B$ are disjoint and we predict
-both, we select one of them randomly (https://github.com/ChEB-AI/python-chebifier/issues/6) and set the other to 0.
+both, we select one with the higher class score and set the other to 0.
 - (3) Since the second step might have introduced new inconsistencies into the hierarchy, we repeat the first step, but 
 with a small change. For a pair of classes $A \subseteq B$ with predictions $1$ and $0$, instead of setting $B$ to $1$,
 we now set $A$ to $0$. This has the advantage that we cannot introduce new disjointness-inconsistencies and don't have

diff --git a/chebifier/__main__.py b/chebifier/__main__.py
@@ -0,0 +1,4 @@
+from chebifier.cli import cli
+
+if __name__ == "__main__":
+    cli()
diff --git a/chebifier/check_env.py b/chebifier/check_env.py
@@ -0,0 +1,30 @@
+import subprocess
+import sys
+
+
+def get_current_environment() -> str:
+    """
+    Return the path of the Python executable for the current environment.
+    """
+    return sys.executable
+
+
+def check_package_installed(package_name: str) -> None:
+    """
+    Check if the given package is installed in the current Python environment.
+    """
+    python_exec = get_current_environment()
+    try:
+        subprocess.check_output(
+            [python_exec, "-m", "pip", "show", package_name], stderr=subprocess.DEVNULL
+        )
+        print(f"✅ Package '{package_name}' is already installed.")
+    except subprocess.CalledProcessError:
+        raise (
+            f"❌ Please install '{package_name}' into your environment: {python_exec}"
+        )
+
+
+if __name__ == "__main__":
+    print(f"🔍 Using Python executable: {get_current_environment()}")
+    check_package_installed("numpy")  # Replace with your desired package