Hallam Lab transform, data-type, and container library for the Metasmith workflow engine.
Metasmith plans and executes data workflows by composing typed transforms. This repo is the registry of types, container images, and transform implementations that planner draws from.
data_types/ # YAML type definitions (15 namespaces)
resources/
containers/ # 62 OCI container references (.oci → docker:// URI)
lib/ # Reusable analysis scripts (clustering, plotting, etc.)
transforms/ # Transform implementations grouped by domain
amplicon/ # 2 transforms
assembly/ # 13 transforms
functionalAnnotation/ # 27 transforms
logistics/ # 23 transforms (downloaders, format conversion)
metabolicModelling/ # 5 transforms
metagenomics/ # 9 transforms (+ binning/, taxonomy/)
pangenome/ # 2 transforms
responseSurface/ # 1 transform
transcriptomics/ # 19 transforms
tests/ # Pytest workflow & E2E tests
envs/ # Conda env for running tests outside Metasmith
analysis/ # Downstream analysis code (not part of the library)
main/ # Operator scripts (DAG rendering, planner probes, run launchers)
Data types (data_types/*.yml) cover: alignment, amplicon, annotation,
binning, clustering, containers, lib, media_optimization,
metabolomics, ncbi, pangenome, ref, sequences, taxonomy,
transcriptomics.
Transform domains include genome/metagenome assembly (flye, hifiasm-meta, megahit, metaspades, …), binning (metabat2, semibin2, comebin, checkm), taxonomy (gtdbtk, fastani, kraken2/bracken, sylph, …), functional annotation (eggnog-mapper, kofamscan, interproscan, antismash, bakta, dram, busco, diamond, deepec, …), protein embeddings & structure (esmfold, esm-c, prott5, ankh, saprot, foldseek), transcriptomics (star, salmon, stringtie, braker3, deseq2/pydeseq2), and logistics (weight/DB downloaders, SRA/assembly fetchers, read format conversion, sharding, container prefetch).
The build compiles every data_types/*.yml, every resources/*/ directory,
and every transforms/*/ directory into the _metadata/ indexes that
Metasmith's planner loads at runtime.
./dev.sh -bThe script delegates to msm build all --types … --uniques … --transforms ….
It uses msm if it's on PATH; otherwise it falls back to
../Metasmith/dev.sh -r build all …, so you need either:
- a Metasmith install with
msmonPATH(matching the dev-branch CLI signature), or - a sibling checkout of hallamlab/Metasmith plus a Python env satisfying
envs/base.yml(andenvs/dev.ymlfor tests).
After the build, _metadata/ directories under resources/ and
transforms/ reflect the current type graph. They're committed to the
repo so consumers can pull the library without rebuilding.
- Add the type to
data_types/containers.yml(extends: container, declare what itprovides). - Create
resources/containers/<name>.ocicontaining the registry URI (e.g.docker://quay.io/biocontainers/diamond:2.1.8--h43eeafb_0). - Rerun
./dev.sh -b.
The build fails if any .oci file lacks a matching type definition.
from metasmith.python_api import *
lib = TransformInstanceLibrary.ResolveParentLibrary(__file__)
model = Transform()
inp = model.AddRequirement(lib.GetType("sequences::assembly"))
out = model.AddProduct(lib.GetType("annotation::eggnog_table"))
def protocol(context: ExecutionContext):
in_path = context.Input(inp)
out_path = context.Output(out)
context.ExecWithContainer(
image=lib.GetResource("containers::eggnog-mapper.oci"),
cmd=f"emapper.py -i {in_path.container} -o {out_path.container}",
)
return ExecutionResult(manifest=[{out: out_path.local}], success=out_path.local.exists())
TransformInstance(
protocol=protocol,
model=model,
group_by=inp,
resources=Resources(cpus=8, memory=Size.GB(32), duration=Duration(hours=12)),
)See transforms/_template.py for the minimal skeleton.
Every type referenced via lib.GetType("namespace::type") must exist in
data_types/<namespace>.yml or the build fails.
Tests use the fixtures defined in tests/conftest.py (agent,
base_resources, mlib, tmp_inputs).
pytest tests/ # full suite
pytest tests/test_binning_workflow.py -v # one workflow
pytest tests/ -m "not slow" # skip E2E
./dev.sh --test-comebin # named E2E shortcutWorkflow-planning tests call agent.GenerateWorkflow(...) and assert on
the resulting plan. End-to-end tests additionally call StageWorkflow() /
RunWorkflow() and wait for completion; they're marked @pytest.mark.slow
and need a configured execution backend.
transforms/*/_metadata/is build-generated — do not hand-edit.data_types/,resources/*/_metadata/index.yml(for declaring containers), and the transform.pyfiles themselves are the hand-edited surface.- Disabled transforms live in
transforms/*/_disabled/or are renamed<name>.py.disabledso the build skips them.
- hallamlab/metasmith — the planner / executor
- This library is consumed by Metasmith agents at workflow-generation time; it has no Python package of its own.