Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
02c5409
api code to download model from hugging face
aditya0by0 Jun 24, 2025
c13423c
Merge branch 'dev' into feature/api_downloadble_models
aditya0by0 Jun 27, 2025
b539f0a
Create .pre-commit-config.yaml
aditya0by0 Jun 27, 2025
2c2aba2
utility to setup env and model package dependencies
aditya0by0 Jun 27, 2025
2b9f335
`gather_predictions` will return predicted_classes_dict
aditya0by0 Jun 27, 2025
6faf3bd
use package namespace imports for prediction models
aditya0by0 Jun 28, 2025
a4f5f85
add hugging face api
aditya0by0 Jun 28, 2025
481a2eb
api registry
aditya0by0 Jun 28, 2025
584b6a6
api cli
aditya0by0 Jun 28, 2025
05d8580
Update .gitignore
aditya0by0 Jun 28, 2025
997120e
use hugging face's cache system instead of custom file management
aditya0by0 Jun 28, 2025
9c3beea
pre-commit -run -a
aditya0by0 Jun 28, 2025
e6602ef
remove explicit config kwargs for resgated
aditya0by0 Jul 1, 2025
fd814e9
api support for ensemble
aditya0by0 Jul 6, 2025
a044f23
add ruff action workflow
aditya0by0 Jul 6, 2025
51a2d34
same version for workflow and pre-commit yaml
aditya0by0 Jul 6, 2025
d2c586a
Update base_predictor.py
aditya0by0 Jul 6, 2025
e0b3ca7
merge from dev
aditya0by0 Jul 9, 2025
ebc450f
Merge branch 'refs/heads/dev' into feature/api_downloadble_models
sfluegel05 Jul 11, 2025
f3b3905
fix readme
sfluegel05 Jul 11, 2025
001538d
fix cli and ensemble imports
sfluegel05 Jul 11, 2025
f8583cb
add huggingface download to cli
sfluegel05 Jul 11, 2025
90aedd4
reformat with black
sfluegel05 Jul 11, 2025
2bead4a
use None values to mark samples where all methods failed (usually due…
sfluegel05 Jul 11, 2025
df68ecb
init smoother at init to avoid re-initialising it for every predictio…
sfluegel05 Jul 11, 2025
c575637
add lookup classifier and chemlog-by-element classifier
sfluegel05 Jul 14, 2025
32da68b
split chemlog extra predictors into two for generalisability
sfluegel05 Jul 15, 2025
ecb48ff
fix typos
sfluegel05 Jul 15, 2025
89b4812
add c3p integration
sfluegel05 Jul 17, 2025
e8e4ec3
use class scores for smoothing, explicitly predict transitive closure…
sfluegel05 Jul 18, 2025
a11b0b7
restructure error handling and update cache indexing for gnn
sfluegel05 Jul 18, 2025
598ed6b
fix weight calculation if a model does not make predictions for all c…
sfluegel05 Jul 18, 2025
4d918d5
Add C3P explanations
sfluegel05 Jul 18, 2025
e4f1c54
add cache
sfluegel05 Jul 21, 2025
87fb66a
move files from api to chebifier, add files to huggingface
sfluegel05 Jul 21, 2025
7ebbacb
fix error handling for nns
sfluegel05 Jul 21, 2025
c30e06d
reformat with black, add development dependencies
sfluegel05 Jul 21, 2025
2115cfd
refactor with ruff
sfluegel05 Jul 21, 2025
572f0ed
update readme
sfluegel05 Jul 21, 2025
9769e23
update readme
sfluegel05 Jul 21, 2025
dcd0915
Merge branch 'dev' into fix/error-propagation
sfluegel05 Jul 21, 2025
0377027
Update README.md
sfluegel05 Jul 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: Lint

on: [push, pull_request]

jobs:
lint:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10' # or any version your project uses

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install black==25.1.0 ruff==0.12.2

- name: Run Black
run: black --check .

- name: Run Ruff (no formatting)
run: ruff check . --no-fix
179 changes: 179 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/
docs/build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

# configs/ # commented as new configs can be added as a part of a feature

/.idea
/data
/logs
/results_buffer
electra_pretrained.ckpt

build
.virtual_documents
.jupyter
chebai.egg-info
lightning_logs
logs
.isort.cfg
/.vscode
/api/.cloned_repos
31 changes: 31 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
repos:
- repo: https://github.com/psf/black
rev: "25.1.0"
hooks:
- id: black
- id: black-jupyter # for formatting jupyter-notebook

- repo: https://github.com/pycqa/isort
rev: 5.13.2
hooks:
- id: isort
name: isort (python)
args: ["--profile=black"]

- repo: https://github.com/asottile/seed-isort-config
rev: v2.2.0
hooks:
- id: seed-isort-config

- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace

- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.12.2
hooks:
- id: ruff
args: [--fix]
65 changes: 49 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,17 @@
# python-chebifier
An AI ensemble model for predicting chemical classes in the ChEBI ontology.
An AI ensemble model for predicting chemical classes in the ChEBI ontology. It integrates deep learning models,
rule-based models and generative AI-based models.

A web application for the ensemble is available at https://chebifier.hastingslab.org/.

## Installation

You can get the package from PyPI:
```bash
pip install chebifier
```

or get the latest development version from GitHub:
```bash
# Clone the repository
git clone https://github.com/yourusername/python-chebifier.git
Expand All @@ -12,7 +21,7 @@ cd python-chebifier
pip install -e .
```

Some dependencies of `chebai-graph` cannot be installed automatically. If you want to use Graph Neural Networks, follow
`chebai-graph` and its dependencies cannot be installed automatically. If you want to use Graph Neural Networks, follow
the instructions in the [chebai-graph repository](https://github.com/ChEB-AI/python-chebai-graph).

## Usage
Expand All @@ -21,23 +30,25 @@ the instructions in the [chebai-graph repository](https://github.com/ChEB-AI/pyt

The package provides a command-line interface (CLI) for making predictions using an ensemble model.

```bash
# Get help
python -m chebifier.cli --help
The ensemble configuration is given by a configuration file (by default, this is `chebifier/ensemble.yml`). If you
want to change which models are included in the ensemble or how they are weighted, you can create your own configuration file.

# Make predictions using a configuration file
python -m chebifier.cli predict configs/example_config.yml --smiles "CC(=O)OC1=CC=CC=C1C(=O)O" "C1=CC=C(C=C1)C(=O)O"
Model weights for deep learning models are downloaded automatically from [Hugging Face](https://huggingface.co/chebai).
However, you can also supply your own model checkpoints (see `configs/example_config.yml` for an example).

# Make predictions using SMILES from a file
python -m chebifier.cli predict configs/example_config.yml --smiles-file smiles.txt
```
```bash
# Make predictions
python -m chebifier predict --smiles "CC(=O)OC1=CC=CC=C1C(=O)O" --smiles "C1=CC=C(C=C1)C(=O)O"

### Configuration File
# Make predictions using SMILES from a file
python -m chebifier predict --smiles-file smiles.txt

The CLI requires a YAML configuration file that defines the ensemble model. An example can be found in `configs/example_config.yml`.
# Make predictions using a configuration file
python -m chebifier predict --ensemble-config configs/my_config.yml --smiles-file smiles.txt

The models and other required files are trained / generated by our [chebai](https://github.com/ChEB-AI/python-chebai) package.
Examples for models can be found on [kaggle](https://www.kaggle.com/datasets/sfluegel/chebai).
# Get all available options
python -m chebifier predict --help
```

### Python API

Expand Down Expand Up @@ -67,7 +78,29 @@ for smiles, prediction in zip(smiles_list, predictions):
print("No predictions")
```

### The models
Currently, the following models are supported:


| Model | Description | #Classes | Publication | Repository |
|-------|-------------|----------|-----------------------------------------------------------------------|----------------------------------------------------------------------------------------|
| `electra` | A transformer-based deep learning model trained on ChEBI SMILES strings. | 1522 | [Glauer, Martin, et al., 2024: Chebifier: Automating semantic classification in ChEBI to accelerate data-driven discovery, Digital Discovery 3 (2024) 896-907](https://pubs.rsc.org/en/content/articlehtml/2024/dd/d3dd00238a) | [python-chebai](https://github.com/ChEB-AI/python-chebai) |
| `resgated` | A Residual Gated Graph Convolutional Network trained on ChEBI molecules. | 1522 | | [python-chebai-graph](https://github.com/ChEB-AI/python-chebai-graph) |
| `chemlog_peptides` | A rule-based model specialised on peptide classes. | 18 | [Flügel, Simon, et al., 2025: ChemLog: Making MSOL Viable for Ontological Classification and Learning, arXiv](https://arxiv.org/abs/2507.13987) | [chemlog-peptides](https://github.com/sfluegel05/chemlog-peptides) |
| `chemlog_element`, `chemlog_organox` | Extensions of ChemLog for classes that are defined either by the presence of a specific element or by the presence of an organic bond. | 118 + 37 | | [chemlog-extra](https://github.com/ChEB-AI/chemlog-extra) |
| `c3p` | A collection _Chemical Classifier Programs_, generated by LLMs based on the natural language definitions of ChEBI classes. | 338 | [Mungall, Christopher J., et al., 2025: Chemical classification program synthesis using generative artificial intelligence, arXiv](https://arxiv.org/abs/2505.18470) | [c3p](https://github.com/chemkg/c3p) |

In addition, Chebifier also includes a ChEBI lookup that automatically retrieves the ChEBI superclasses for a class
matched by a SMILES string. This is not activated by default, but can be included by adding
```yaml
chebi_lookup:
type: chebi_lookup
model_weight: 10 # optional
```
to your configuration file.

### The ensemble
<img width="700" alt="ensemble_architecture" src="https://github.com/user-attachments/assets/9275d3cd-ac88-466f-a1e9-27d20d67543b" />

Given a sample (i.e., a SMILES string) and models $m_1, m_2, \ldots, m_n$, the ensemble works as follows:
1. Get predictions from each model $m_i$ for the sample.
Expand Down Expand Up @@ -103,7 +136,7 @@ Trust is based on the model's performance on a validation set. After training, w
on a validation set for each class. If the `ensemble_type` is set to `wmv-f1`, the trust is calculated as 1 + the F1 score.
If the `ensemble_type` is set to `mv` (the default), the trust is set to 1 for all models.

### Inconsistency correction
### Inconsistency resolution
After a decision has been made for each class independently, the consistency of the predictions with regard to the ChEBI hierarchy
and disjointness axioms is checked. This is
done in 3 steps:
Expand All @@ -114,7 +147,7 @@ belongs to the direct and indirect superclasses (e.g., primary alcohol, aromatic
- (2) Next, we check for disjointness. This is not specified directly in ChEBI, but in an additional ChEBI module ([chebi-disjoints.owl](https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/)).
We have extracted these disjointness axioms into a CSV file and added some more disjointness axioms ourselves (see
`data>disjoint_chebi.csv` and `data>disjoint_additional.csv`). If two classes $A$ and $B$ are disjoint and we predict
both, we select one of them randomly (https://github.com/ChEB-AI/python-chebifier/issues/6) and set the other to 0.
both, we select one with the higher class score and set the other to 0.
- (3) Since the second step might have introduced new inconsistencies into the hierarchy, we repeat the first step, but
with a small change. For a pair of classes $A \subseteq B$ with predictions $1$ and $0$, instead of setting $B$ to $1$,
we now set $A$ to $0$. This has the advantage that we cannot introduce new disjointness-inconsistencies and don't have
Expand Down
4 changes: 4 additions & 0 deletions chebifier/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from chebifier.cli import cli

if __name__ == "__main__":
cli()
30 changes: 30 additions & 0 deletions chebifier/check_env.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import subprocess
import sys


def get_current_environment() -> str:
"""
Return the path of the Python executable for the current environment.
"""
return sys.executable


def check_package_installed(package_name: str) -> None:
"""
Check if the given package is installed in the current Python environment.
"""
python_exec = get_current_environment()
try:
subprocess.check_output(
[python_exec, "-m", "pip", "show", package_name], stderr=subprocess.DEVNULL
)
print(f"✅ Package '{package_name}' is already installed.")
except subprocess.CalledProcessError:
raise (
f"❌ Please install '{package_name}' into your environment: {python_exec}"
)


if __name__ == "__main__":
print(f"🔍 Using Python executable: {get_current_environment()}")
check_package_installed("numpy") # Replace with your desired package
Loading