vepyr-diffly

vepyr-diffly compares Ensembl VEP output against vepyr output using semantic, DataFrame-based diffs.

It does four things:

run Ensembl VEP and vepyr on the same input VCF,
normalize both annotated VCF outputs into comparable tabular forms,
compare them semantically with Polars DataFrames,
print a clear console summary and write detailed diff artifacts for fixing vepyr.

Local Quickstart

This is the shortest practical path to test VEP vs vepyr locally on this machine.

1. Create the Python environment

cd /Users/lukaszjezapkowicz/Desktop/magisterka/praca/vepyr_diffly
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

2. Create `.env`

cp .env.example .env

The CLI loads .env automatically. The minimum important variables are:

VEPYR_DIFFLY_EXECUTION_MODE=local
VEPYR_DIFFLY_INPUT_VCF
VEPYR_DIFFLY_OUTPUT_DIR
VEPYR_DIFFLY_VEP_CACHE_DIR
VEPYR_DIFFLY_VEP_CACHE_VERSION
VEPYR_DIFFLY_REFERENCE_FASTA
VEPYR_DIFFLY_VEP_BIN
VEPYR_DIFFLY_VEPYR_PYTHON
VEPYR_DIFFLY_VEPYR_CACHE_OUTPUT_DIR
VEPYR_DIFFLY_VEPYR_USE_FJALL
VEPYR_DIFFLY_COMPARE_MODE
VEPYR_DIFFLY_BUCKET_COUNT
VEPYR_DIFFLY_COMPARE_WORKERS
VEPYR_DIFFLY_MEMORY_BUDGET_MB
VEPYR_DIFFLY_FINGERPRINT_ONLY

The current .env.example already points at the local paths that were verified in this workspace:

input VCF: /Users/lukaszjezapkowicz/Downloads/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf
VEP cache: /Users/lukaszjezapkowicz/.vep
FASTA: /Users/lukaszjezapkowicz/.vep/homo_sapiens/115_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa
VEP binary: /Users/lukaszjezapkowicz/VepAnnotations/ensembl/ensembl-vep/vep
vepyr Python env: /Users/lukaszjezapkowicz/Desktop/magisterka/praca/gdl-annotations-infra/modules/python/annotator_testing/runner/.vepyr/bin/python
vepyr cache root: /Users/lukaszjezapkowicz/Desktop/magisterka/praca/vepyr_diffly/.cache/vepyr_cache

For local plugin cache builds, configure plugin source files in .env as needed:

VEPYR_DIFFLY_PLUGIN_CLINVAR_SOURCE
VEPYR_DIFFLY_PLUGIN_SPLICEAI_SOURCE
VEPYR_DIFFLY_PLUGIN_CADD_SNV_SOURCE
VEPYR_DIFFLY_PLUGIN_CADD_INDEL_SOURCE
VEPYR_DIFFLY_PLUGIN_ALPHAMISSENSE_SOURCE
VEPYR_DIFFLY_PLUGIN_DBNSFP_SOURCE

If the cache directory also contains *.fjall stores, you can enable the faster vepyr backend with:

VEPYR_DIFFLY_VEPYR_USE_FJALL=true

3. Sanity-check the repo

source .venv/bin/activate
PYTHONPATH=src python -m vepyr_diffly.cli list-presets
PYTHONPATH=src pytest -q

4. Run a local smoke test

For a full end-to-end smoke test, including annotation by both engines:

source .venv/bin/activate
PYTHONPATH=src python -m vepyr_diffly.cli run \
  --output-dir runs/local-smoke-1000 \
  --sample-first-n 1000

To change how many input variants are used in the smoke test, change --sample-first-n:

source .venv/bin/activate
PYTHONPATH=src python -m vepyr_diffly.cli run \
  --output-dir runs/local-smoke-5000 \
  --sample-first-n 5000

To force the vepyr fjall backend for the run:

source .venv/bin/activate
PYTHONPATH=src python -m vepyr_diffly.cli run \
  --output-dir runs/local-smoke-fjall-5000 \
  --sample-first-n 5000 \
  --use-fjall

5. Reuse existing annotated VCFs without re-annotating

This is the faster iteration path when VEP output already exists and you only want to compare:

source .venv/bin/activate
PYTHONPATH=src python -m vepyr_diffly.cli compare-existing \
  --preset ensembl_everything \
  --left-vcf runs/local-smoke-1000/runtime/vep.annotated.vcf \
  --right-vcf runs/local-smoke-1000/runtime/vepyr.annotated.vcf \
  --output-dir runs/compare-only \
  --chromosomes 1,2,3 \
  --compare-mode fast \
  --memory-budget-mb 1024

Useful compare tuning flags:

--chromosomes 1 or --chromosomes 1,2,3: process only the selected chromosomes; standard aliases like 1 and chr1 are treated as the same chromosome
--compare-mode fast: exact precheck per bucket, then diffly only on buckets that differ
--compare-mode debug: force diffly for every bucket and keep full debug artifacts
--bucket-count <N>: override automatic consequence bucket count
--compare-workers <N>: override compare worker process count
--memory-budget-mb <N>: hard memory budget used by the bounded-memory planner for chunk sizes, bucket fanout, and worker count
--fingerprint-only: run the bucket precheck only and skip exact diff artifact generation

5a. Run only `vepyr` on an existing prepared input

This is the safest way to produce only the missing vepyr output when prepared_input.vcf already exists and you do not want to rerun VEP:

source .venv/bin/activate
PYTHONPATH=src python -m vepyr_diffly.cli annotate-vepyr \
  --input-vcf runs/full-golden-fresh/runtime/prepared_input.vcf \
  --output-vcf runs/full-golden-fresh/runtime/vepyr.annotated.vcf \
  --cache-dir /Users/lukaszjezapkowicz/Desktop/magisterka/praca/vepyr_diffly/.cache/vepyr_cache \
  --reference-fasta /Users/lukaszjezapkowicz/.vep/homo_sapiens/115_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa

Optional:

add --use-fjall when the cache root also contains *.fjall stores
add --log-path /path/to/vepyr.log to override the default log destination

5b. Benchmark compare-only on an existing annotated pair

source .venv/bin/activate
PYTHONPATH=src python -m vepyr_diffly.cli benchmark-compare \
  --left-vcf runs/full-golden-fresh/runtime/vep.annotated.vcf \
  --right-vcf runs/full-golden-fresh/runtime/vepyr.annotated.vcf \
  --output-json /tmp/vepyr-diffly-benchmark.json

This command runs compare-existing in temporary directories, writes only the compact benchmark summary JSON, and cleans the heavy temp artifacts automatically.

5c. Build a local `vepyr` cache for selected chromosomes

Use the helper script in scripts/ when you want to materialize the local cache outside a compare run.

Build only chromosome Y plus the default plugin set:

source .venv/bin/activate
python scripts/build_chr_cache.py --chromosomes Y

Build chromosomes 1 and Y:

source .venv/bin/activate
python scripts/build_chr_cache.py --chromosomes 1,Y

Build the full cache:

source .venv/bin/activate
python scripts/build_chr_cache.py

By default the script:

installs the local checkout from VEPYR_DIFFLY_VEPYR_PATH with pip install -e --no-build-isolation
builds the core cache under VEPYR_DIFFLY_VEPYR_CACHE_OUTPUT_DIR
builds core variation.fjall and translation_sift.fjall from the generated parquet cache
derives the Ensembl source cache from VEPYR_DIFFLY_VEP_CACHE_DIR
builds plugin caches for clinvar, spliceai, cadd, alphamissense, and dbnsfp
requires each requested plugin to be provided as a local source file via .env or CLI
does not download plugin sources from the Internet in the standard build path
materializes cadd from two local source files into one shared cache root layout: cadd/ and cadd.fjall/

Example local plugin source configuration:

export VEPYR_DIFFLY_PLUGIN_CLINVAR_SOURCE=/data/plugins/clinvar.vcf.gz
export VEPYR_DIFFLY_PLUGIN_CADD_SNV_SOURCE=/data/plugins/whole_genome_SNVs.tsv.gz
export VEPYR_DIFFLY_PLUGIN_CADD_INDEL_SOURCE=/data/plugins/gnomad.genomes.r4.0.indel.tsv.gz
export VEPYR_DIFFLY_PLUGIN_DBNSFP_SOURCE=/data/plugins/dbNSFP4.9c_grch38.gz

Useful flags:

--no-plugins to skip plugin cache generation
--no-core-fjall to skip variation.fjall / translation_sift.fjall
--only-plugins to skip the core cache and build only plugin caches
--chromosomes 1,4,10 to restrict both core and plugin generation to the selected chromosomes
--plugins clinvar,spliceai to limit the plugin set
--preview-rows 1000 to build both core and plugin caches from preview-sized source slices; for plugins this means first N data rows, while for the binary Ensembl core cache it means copying only the first preview-sized set of whole source shard files
when --chromosomes is used for plugins, preview slicing is chromosome-aware and uses tabix automatically when a matching .tbi index exists
--clean-plugin-output to delete only the current-layout outputs of the requested plugins before rebuilding (<version>/<plugin>/ and <version>/<plugin>.fjall/), which is useful for repeatable size and timing comparisons
--remove-old-layout-cache to delete recognized stale artifacts from the previous cache layout before building (parquet/<version>, root-level *.fjall, and root-level plugin directories)
--assume-sorted-plugin-input to skip SQL ORDER BY for single-source plugin builds when the raw input is already sorted by chrom,pos,ref,alt; this is a safe opt-in and is intentionally ignored for cadd
--clinvar-source, --spliceai-source, --cadd-snv-source, --cadd-indel-source, --alphamissense-source, --dbnsfp-source to override .env per run
--force-plugin-source is retained for CLI compatibility but ignored for local-source builds
--skip-install if vepyr is already installed in the active interpreter

Build only plugin caches without the core cache:

source .venv/bin/activate
python scripts/build_chr_cache.py --only-plugins --plugins clinvar,spliceai,cadd,alphamissense

Build only plugin preview caches from the first 1000 source rows:

source .venv/bin/activate
python scripts/build_chr_cache.py --only-plugins --plugins clinvar,spliceai,cadd,alphamissense --preview-rows 1000

Build repeatable chromosome-scoped preview caches and clean only the selected plugin outputs first:

source .venv/bin/activate
python scripts/build_chr_cache.py \
  --only-plugins \
  --plugins clinvar,spliceai,cadd,alphamissense,dbnsfp \
  --chromosomes 1 \
  --preview-rows 1000 \
  --clean-plugin-output \
  --skip-install

5d. Run a plugin round-trip validation check

Use the dedicated helper when you want an explicit convert -> read back -> verify row counts and schema check for plugin caches.

Example:

source .venv/bin/activate
python scripts/plugin_round_trip_test.py --plugins clinvar,spliceai,cadd,alphamissense --preview-rows 1000

What this script does for each plugin:

creates a preview source if --preview-rows is set
runs plugin conversion through the local vepyr build path
reads back the generated parquet files
verifies row counts: expected rows from the preview source vs actual parquet rows
verifies schema: expected plugin columns and Arrow types vs the written parquet schema
prints explicit per-plugin stages: convert, read-back, verify rows, verify schema, and final round-trip

Useful flags:

--plugins clinvar,spliceai to limit the plugin set
--preview-rows 1000 to keep the check fast on very large sources
--skip-install if the active environment already has the correct editable vepyr
--cache-dir /path/to/output to keep outputs in a known location
--keep-cache to keep the generated parquet files instead of deleting the temporary cache directory after the check

5e. Create `.tbi` indexes for plugin source files

Use the dedicated helper when you want to prepare tabix indexes for plugin inputs in plugins/.

Dry-run all known plugin sources:

source .venv/bin/activate
python scripts/create_plugin_indexes.py --dry-run

Index only dbnsfp, spliceai, and cadd:

source .venv/bin/activate
python scripts/create_plugin_indexes.py --plugins dbnsfp,spliceai,cadd

For plain gzip inputs such as clinvar.vcf.gz and AlphaMissense_hg38.tsv.gz, create sibling BGZF files first and index those:

source .venv/bin/activate
python scripts/create_plugin_indexes.py --plugins clinvar,alphamissense --recompress-plain-gzip

Notes:

BGZF inputs are indexed in place.
Plain gzip inputs cannot be indexed directly by tabix; they must be converted to BGZF first.
The recompress mode writes sibling .bgz files and leaves the original .gz untouched.
The helper prints [i/n] progress and periodic elapsed-time updates while bgzip or tabix is still running.
These .tbi files can help future chromosome-scoped plugin-cache builds; they do not speed up annotation from already-built parquet/fjall caches.

5f. Run a plugin annotation smoke test from built cache

Use this helper when you want to verify that vepyr.annotate() really picks up plugin values from the already-built plugin cache.

It:

reads one variant from each selected plugin cache parquet directory
writes a temporary mini VCF from those cache-backed variants
runs vepyr.annotate() against the partitioned cache
checks that at least one plugin-specific field is populated for each plugin

Example:

source .venv/bin/activate
python scripts/plugin_annotation_smoke_test.py --plugins clinvar,spliceai,cadd,alphamissense,dbnsfp --skip-install

Notes:

The script uses use_fjall=False deliberately, so it exercises the parquet-backed annotation path.
It expects plugin caches under .cache/vepyr_cache/115_GRCh38_vep/<plugin>/.
It expects VEPYR_DIFFLY_REFERENCE_FASTA in .env, or you can pass --reference-fasta.
If the existing local core cache contains corrupted parquet files from an interrupted older build, the smoke test can fail before plugin lookup. Rebuild the core cache into a clean cache root in that case.

5g. Issue Validation Artifact

For a step-by-step local proof that the current plugin-cache work satisfies the original plugin issue using reduced-scope evidence, see:

github_issue_check.MD

6. What to expect on stdout

run has two expensive phases:

annotation
comparison

The compare phase prints explicit progress and ends with a summary table like:

Comparison Summary
variant      yes   0   0   0   1000
consequence  yes   0   0   0   34741

The output directory then contains:

summary.json
summary.md
variant_diff.parquet
consequence_diff.parquet
variant_mismatches.tsv
consequence_mismatches.tsv
runtime/compare.progress.log
runtime/vep.log
runtime/vepyr.log

runtime/vep.log and runtime/vepyr.log are written incrementally while annotation is running, so they can be tailed live during long runs. Each log also records STARTED, ENDED, and EXIT markers to make elapsed-time checks easier.

How The Comparison Works

The comparison is semantic, not byte-for-byte.

Short Technical Summary

In short, the repo does this:

prepares one canonical input VCF for both annotators,
runs VEP and vepyr on exactly that same prepared input,
reads both annotated VCFs into tabular form,
normalizes INFO/CSQ into stable variant and consequence tables,
compares those tables with diffly,
writes a compact console summary plus detailed diff artifacts.

The main libraries used by the compare path are:

polars-bio to scan annotated VCF files,
polars to normalize and aggregate rows,
diffly to compare normalized DataFrames by primary key,
dataframely to validate the normalized schema,
rich to print readable summaries and progress.

input VCF
   |
   v
sample first N (optional)
   |
   v
split multi-allelic -> runtime/prepared_input.vcf
   |
   +--------------------+
   |                    |
   v                    v
 VEP                vepyr
   |                    |
   v                    v
vep.annotated.vcf   vepyr.annotated.vcf
   |                    |
   +----------+---------+
              |
              v
   normalize to variant tier + consequence tier
              |
              v
      diffly compare by primary key
              |
              v
 summary.json / summary.md / parquet diffs / mismatch TSV / progress logs

For small files the compare can still run directly on normalized tables. For larger files the pipeline now uses a bounded-memory path for both tiers: variant rows and consequence rows are streamed in chunks, spilled into hash buckets on disk, compacted per bucket, and then compared bucket-by-bucket. A resource planner derives chunk sizes, bucket count, worker count, and side parallelism from --memory-budget-mb, so working memory scales with the configured budget instead of full input size.

The pipeline can now also be scoped to selected chromosomes. The filter is available on both run and compare-existing, is applied before annotation in run, and during streaming normalization in compare-existing. Standard naming aliases are normalized, so 1 matches both 1 and chr1, and MT matches MT, M, chrMT, and chrM.

The pipeline is:

prepare the input VCF
annotate the same prepared input with VEP and vepyr
normalize both annotated VCFs into comparable tables
compare those tables with diffly
write console summary plus file artifacts

Input preparation

Before annotation, run prepares a canonical input:

if sampling is enabled, it writes runtime/sampled_input.vcf
if chromosome filtering is enabled, it writes runtime/filtered_input.vcf
it then decomposes multi-allelic rows into single-alt rows
it writes the final annotation input to runtime/prepared_input.vcf
it records overall and per-chromosome preparation stats in runtime/input_preparation.json

This step is required because multi-allelic input created misleading consequence-level diffs, especially for indels. After splitting to single-alt before annotation, the smoke 5000 run became fully equal again.

Two comparison tiers

The repo compares two normalized tables.

Variant tier:

primary key: chrom, pos, ref, alt
tracks counts and basic summary columns per variant/allele
useful to detect gross record-level divergence

Consequence tier:

one normalized row per semantic CSQ consequence
compared after parsing and normalizing INFO/CSQ
this is the tier used to diagnose real annotation drift

Why `diffly`

diffly is the final diff engine. We do not compare raw VCF text directly. Instead:

VCF is normalized into tabular rows
rows get stable primary keys
duplicates are counted explicitly
then diffly compares the normalized DataFrames

Why bucketization exists

The full golden annotated VCFs are too large for a naive eager compare.

For large runs, consequence comparison works in buckets:

normalize consequence rows into hash buckets
run an exact precheck per bucket
in fast mode, skip diffly for buckets already proven equal
in debug mode, compare every bucket with diffly
merge bucket diff artifacts at the end

For large runs, variant comparison now follows the same model:

stream variant rows into hash buckets on disk
compact each bucket into deterministic bucket.parquet
compare the variant tier bucket-by-bucket instead of one global in-memory diff
keep the old eager path only for safely small files

This keeps memory bounded and gives progress reporting for large compares.

Results

A. Smoke Tests

The most recent post-fix smoke matrix that was re-verified locally is:

10
100
1000
10000

For each size the workflow was:

run --compare-mode fast
then compare-existing --compare-mode debug
with the same annotated outputs
under a bounded-memory budget

Observed outcome:

Sample size	Prepared rows	`fast` variant equal	`fast` consequence equal	`debug` variant equal	`debug` consequence equal
`10`	`10`	`true`	`true`	`true`	`true`
`100`	`100`	`true`	`true`	`true`	`true`
`1000`	`1000`	`true`	`true`	`true`	`true`
`10000`	`10066`	`true`	`true`	`true`	`true`

Observations:

all currently re-verified smoke points are fully equal at both tiers
the vepyr import path issue is fixed; the runner now prefers the working .vepyr environment before falling back to the local checkout
multi-allelic decomposition is still visible at 10000, where 10000 source variants became 10066 prepared single-alt rows
the compare path is green in both fast and debug on small and medium-sized fixtures after the RAM-safe refactor

B. Golden Test

The newest retained full compare-only result is runs/full-golden-compare-fast-per-chrom-20260330, run in fast mode against:

Effective plan from summary.json:

memory_budget_mb=1024
bucket_count=256
variant_chunk_rows=10000
consequence_chunk_rows=8192
compare_workers=2
parallelize_sides=false

Observed compare-only timings:

Stage	Seconds	Approx.
Variant summary	`518.632`	`8m 39s`
Consequence bucketization	`999.975`	`16m 40s`
Schema validation	`0.031`	`<1s`
Variant diff	`4.081`	`4s`
Consequence diff	`490.070`	`8m 10s`
Total compare-only	`2012.789`	`33m 33s`

Observed result:

variant tier: equal=true
variant joined equal rows: 4096123
variant left_only=0, right_only=0, unequal=0
consequence tier: equal=false
consequence joined equal rows: 36914657
consequence left_only=392, right_only=392, unequal=0

What this means:

at the variant level VEP and vepyr are fully equal
at the consequence level there are still a small number of rows that exist only on one side
those rows are not random noise: the sampled mismatch artifact collapses cleanly into 392 transcript-level left_only/right_only pairs
there are still no cases where the same normalized consequence key exists on both sides with different payload values inside one joined key

Why this run still takes a long time:

the dominant cost is not the final diff itself, but preprocessing the annotated VCFs into bounded-memory bucket artifacts
consequence_bucketization alone is still about 16m 40s
even in fast mode, only a tiny minority of consequence buckets are proven equal by precheck, so the final diff still has to run exact compare for most buckets
the inputs are very large annotated VCFs, so parsing CSQ, normalizing transcript consequences, spilling to disk, and compacting buckets dominate wall-clock time
the run deliberately used a conservative 1024 MB memory budget, which reduces RAM pressure but also reduces concurrency

Per-chromosome view from the same summary:

all chromosomes are variant.equal=true
only chromosome 1 is fully equal on the consequence tier
the largest consequence mismatch counts are on chromosomes 2 (88/88), 9 (54/54), 6 (44/44), 3 (39/39), and 14 (37/37)
the per-chromosome timings below are attributed timings from the summary, not independent wall-clock slices, so they are best used for relative weight and hotspot analysis

Chr	Variant	Consequence	Left/Right only	Attributed total	Variant summary	Consequence bucketization	Consequence diff
1	`yes`	`yes`	`0 / 0`	`161.248s`	`40.951s`	`80.516s`	`39.459s`
2	`yes`	`no`	`88 / 88`	`166.191s`	`41.951s`	`83.156s`	`40.754s`
3	`yes`	`no`	`39 / 39`	`158.916s`	`36.532s`	`81.940s`	`40.157s`
4	`yes`	`no`	`10 / 10`	`135.640s`	`38.908s`	`64.712s`	`31.714s`
5	`yes`	`no`	`8 / 8`	`110.990s`	`33.478s`	`51.842s`	`25.407s`
6	`yes`	`no`	`44 / 44`	`129.702s`	`34.435s`	`63.752s`	`31.244s`
7	`yes`	`no`	`9 / 9`	`114.174s`	`29.694s`	`56.538s`	`27.708s`
8	`yes`	`no`	`26 / 26`	`107.026s`	`28.519s`	`52.536s`	`25.747s`
9	`yes`	`no`	`54 / 54`	`83.531s`	`22.298s`	`40.976s`	`20.082s`
10	`yes`	`no`	`8 / 8`	`105.041s`	`27.028s`	`52.212s`	`25.588s`
11	`yes`	`no`	`12 / 12`	`100.479s`	`26.187s`	`49.719s`	`24.367s`
12	`yes`	`no`	`8 / 8`	`95.336s`	`25.046s`	`47.040s`	`23.053s`
13	`yes`	`no`	`7 / 7`	`61.341s`	`20.438s`	`27.342s`	`13.400s`
14	`yes`	`no`	`37 / 37`	`68.388s`	`16.865s`	`34.488s`	`16.902s`
15	`yes`	`no`	`6 / 6`	`68.004s`	`15.850s`	`34.917s`	`17.112s`
16	`yes`	`no`	`13 / 13`	`62.537s`	`15.619s`	`31.404s`	`15.391s`
17	`yes`	`no`	`3 / 3`	`65.283s`	`13.722s`	`34.530s`	`16.923s`
18	`yes`	`no`	`1 / 1`	`56.589s`	`15.116s`	`27.753s`	`13.601s`
19	`yes`	`no`	`5 / 5`	`55.707s`	`11.484s`	`29.618s`	`14.515s`
20	`yes`	`no`	`11 / 11`	`41.981s`	`11.003s`	`20.731s`	`10.160s`
21	`yes`	`no`	`2 / 2`	`29.526s`	`7.067s`	`15.035s`	`7.368s`
22	`yes`	`no`	`1 / 1`	`35.130s`	`6.440s`	`19.220s`	`9.419s`

What the 392 / 392 / 0 mismatches actually are:

they are not missing transcripts versus extra transcripts in the old sense
each sampled mismatch row on the VEP side has a matching transcript-level partner on the vepyr side with the same chrom, pos, ref, alt, and Feature
the mismatch is therefore usually a payload drift for the same transcript consequence, not a different transcript set
the top low-level field signatures are:
- only HGVSp: 121 pairs
- only Consequence: 72
- only HGNC_ID: 63
- only HGVSc: 54
- Consequence + IMPACT: 40
the top semantic classes are:
- hgvs_payload_drift: 114
- missing_hgvs: 84
- consequence_reclassification: 79
- missing_hgnc_id: 63
- consequence_and_impact_reclassification: 55

In practice that means the 392 mismatch pairs are mostly:

rows where both sides point at the same transcript consequence, but HGVSp uses different protein coordinates or wording
rows where VEP has HGVSc or HGVSp filled and vepyr leaves it empty
rows where VEP and vepyr classify the same transcript consequence differently, for example inframe_insertion versus inframe_insertion&start_retained_variant
rows where the same transcript consequence gets a different IMPACT, for example HIGH versus LOW
rows where HGNC_ID is missing on one side, most often present in vepyr and empty in VEP

Representative raw CSQ examples confirm that these are already present in the source annotated VCFs:

missing_hgvs:
- VEP contains ENST00000944442.1:c.-111_-110ins...
- vepyr emits the same row with an empty HGVSc
consequence_and_impact_reclassification:
- VEP: frameshift_variant&splice_region_variant|HIGH
- vepyr: splice_region_variant|LOW
consequence_reclassification:
- VEP: inframe_insertion
- vepyr: inframe_insertion&start_retained_variant
missing_hgnc_id:
- VEP: empty HGNC_ID
- vepyr: populated HGNC:...
hgvs_payload_drift:
- both sides agree on Consequence, IMPACT, and HGVSc
- but HGVSp refers to different duplicated amino-acid coordinates

Follow-up optimization note:

after the older retained full run, the consequence compare path was tightened further so bucketed compare no longer writes unused per-bucket TSV files, can use cheap bucket metadata to skip the expensive eager precheck for buckets that are already provably different, and now uses in-process worker threads instead of spawning separate compare subprocesses
the fresh full rerun above confirms those changes on the full pipeline: consequence_diff is now 490.070s instead of the older retained 761.191s, which is about 35.6% faster
on the same machine and with the same full-golden vepyr.annotated.vcf, a one-side materialize_consequence_buckets(...) benchmark at bucket_count=256 and chunk_variants=8192 first dropped from 708.518s to 654.883s after switching temporary bucket part writes to lighter parquet settings (lz4, no statistics), and then to 447.560s after buffering several reduced chunks before each flush
that latest bucketization benchmark is about 36.8% faster than the original one-side baseline, and it also cuts temporary file churn dramatically because bucket parts are flushed around every 65536 input variants instead of every 8192
the fresh full rerun also reflects that in end-to-end compare wall-clock:
- older retained compare-only baseline: 49m 00s
- fresh rerun on current code: 33m 33s

Latest retained mismatch diagnosis:

all 784 sampled consequence mismatch rows collapse cleanly into 392 transcript-level left_only/right_only pairs
there are 0 unpaired rows in that sampled mismatch artifact
the top differing fields are still HGVSp, Consequence, HGNC_ID, IMPACT, and HGVSc
the top semantic drift classes are:
- hgvs_payload_drift: 114
- missing_hgvs: 84
- consequence_reclassification: 79
- missing_hgnc_id: 63
- consequence_and_impact_reclassification: 55
raw CSQ extraction on representative cases confirms that these differences already exist in the source annotated VCF payloads, not just in downstream normalization
the current debug tooling can now go from mismatch TSV -> semantic category summary -> raw left/right CSQ examples without rescanning the whole mismatch space manually

Most useful artifacts for debugging the golden mismatch:

summary.json
summary.md
consequence_mismatches.tsv
consequence_diff.parquet
runtime/compare.progress.log
/tmp/vepyr-diffly-golden-mismatch-analysis.json
/tmp/vepyr-diffly-golden-raw-csq.json

Logs and structured diagnostics added for this:

runtime/compare.progress.log: live stage and per-bucket compare progress for the full run
consequence_mismatch_analysis.json: grouped transcript-level mismatch summary, field counts, semantic categories, and representative examples
raw-csq-examples.json: raw left/right CSQ payloads for representative cases in each semantic category

Current Scope

Current verified scope:

one active preset: ensembl_everything
semantic comparison of annotated VCF INFO/CSQ
two comparison tiers: variant and consequence
bounded-memory compare path for large annotated VCFs
local runtime execution against a real local VEP installation
local runtime execution against an existing vepyr Python environment and parquet cache
compare-only reuse of existing annotated VCFs via compare-existing

Python Environment

The root requirements.txt installs:

the local package in editable mode
runtime dependencies from pyproject.toml
pytest
ruff

Recommended setup:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
cp .env.example .env

Quick verification:

source .venv/bin/activate
PYTHONPATH=src python -m vepyr_diffly.cli list-presets
PYTHONPATH=src pytest -q

Repo Configuration

Runtime paths should live in repo-local .env.

Important variables:

VEPYR_DIFFLY_EXECUTION_MODE
VEPYR_DIFFLY_INPUT_VCF
VEPYR_DIFFLY_OUTPUT_DIR
VEPYR_DIFFLY_VEP_CACHE_DIR
VEPYR_DIFFLY_VEP_CACHE_VERSION
VEPYR_DIFFLY_REFERENCE_FASTA
VEPYR_DIFFLY_VEP_BIN
VEPYR_DIFFLY_VEPYR_PYTHON
VEPYR_DIFFLY_VEPYR_CACHE_OUTPUT_DIR
VEPYR_DIFFLY_COMPARE_MODE
VEPYR_DIFFLY_MEMORY_BUDGET_MB

Notes:

VEPYR_DIFFLY_VEPYR_PYTHON should point to a working virtualenv interpreter such as .../.vepyr/bin/python
VEPYR_DIFFLY_VEPYR_CACHE_OUTPUT_DIR should point to the cache root, not directly to the deepest parquet leaf
the CLI loads .env automatically

CLI

Main commands:

python -m vepyr_diffly.cli list-presets
python -m vepyr_diffly.cli run --preset ensembl_everything --input-vcf /path/to/input.vcf --output-dir runs/demo --sample-first-n 1000 --chromosomes 1,2
python -m vepyr_diffly.cli compare-existing --preset ensembl_everything --left-vcf /path/to/vep.annotated.vcf --right-vcf /path/to/vepyr.annotated.vcf --output-dir runs/compare-only --chromosomes 1,2 --compare-mode fast --memory-budget-mb 1024
python -m vepyr_diffly.cli annotate-vepyr --input-vcf /path/to/prepared_input.vcf --output-vcf /path/to/vepyr.annotated.vcf
python -m vepyr_diffly.cli benchmark-compare --left-vcf /path/to/vep.annotated.vcf --right-vcf /path/to/vepyr.annotated.vcf --output-json /tmp/benchmark.json
python -m vepyr_diffly.cli inspect-run --run-dir runs/demo
python -m vepyr_diffly.cli analyze-consequence-mismatches --run-dir runs/full-golden-compare-fast --output-json /tmp/mismatch-analysis.json
python -m vepyr_diffly.cli extract-mismatch-csq-examples --run-dir runs/full-golden-fresh --analysis-json /tmp/mismatch-analysis.json --output-json /tmp/raw-csq-examples.json

analyze-consequence-mismatches reads consequence_mismatches.tsv, groups left_only/right_only rows back into transcript-level pairs, counts the fields that drift most often, and emits representative examples. It is the fastest way to turn a retained run into a concrete CSQ drift report.

extract-mismatch-csq-examples takes that analysis and enriches a few representative cases per semantic category with the original raw CSQ= entries from both annotated VCFs. This is the shortest path from "which categories drift?" to "what exactly did VEP and vepyr emit for the same transcript consequence?".

Expected Output

Typical successful smoke output:

Preset: ensembl_everything
Input: /Users/lukaszjezapkowicz/Downloads/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf
Sample first N: 1000

Comparison Summary

Tier         Equal  Left only  Right only  Unequal  Joined equal
variant      yes    0          0           0        1000
consequence  yes    0          0           0        34741

Main run artifacts:

summary.json
summary.md
variant_diff.parquet
consequence_diff.parquet
variant_mismatches.tsv
consequence_mismatches.tsv
runtime/effective_config.json
runtime/compare.progress.log
runtime/vep.log
runtime/vepyr.log

On large runs the TSV mismatch files are intentionally sampled, while the parquet diff artifacts remain complete.

summary.json and summary.md now include both:

overall results for the processed subset as a whole
per-chromosome variant/consequence equality and counts
attributed per-chromosome stage timings for normalization and diff

Troubleshooting

If tests fail:

source .venv/bin/activate
PYTHONPATH=src pytest -q -vv

If annotation fails before compare:

inspect runtime/vep.log
inspect runtime/vepyr.log
inspect runtime/effective_config.json

Typical causes:

wrong VEPYR_DIFFLY_VEP_BIN
wrong VEPYR_DIFFLY_VEP_CACHE_DIR
wrong VEPYR_DIFFLY_REFERENCE_FASTA
wrong VEPYR_DIFFLY_VEPYR_CACHE_OUTPUT_DIR
VEPYR_DIFFLY_VEPYR_PYTHON does not have a working vepyr install

If vepyr fails to import:

check that VEPYR_DIFFLY_VEPYR_PYTHON points to a real virtualenv interpreter like .../.vepyr/bin/python
avoid resolving that path to the system interpreter

If a long compare is still running, tail:

tail -f runs/<run>/runtime/compare.progress.log

The most important milestones are:

parsed CSQ header
materializing variant summary
bucketizing consequence rows
comparing ... buckets
writing summaries
completed successfully

Runtime Notes

canonical equality is semantic comparison of parsed CSQ
raw VCF text equality is out of scope
the currently verified execution path is local runtime
the compare path is now bounded-memory and driven by the configured memory budget
for large runs, choosing a smaller memory budget trades throughput for lower RAM pressure

Known Limitations

only the ensembl_everything preset is currently verified end-to-end
merged and refseq cache flavors are not implemented and verified yet
Docker/container execution remains scaffolded in the repo, but local execution is the currently verified path
the retained full golden result currently documents compare-existing --compare-mode fast; a fresh full end-to-end golden run with the same current codepath is still worth recording separately
the repo compares semantic normalized outputs, not byte-for-byte annotated VCF files
the current docs assume the local external assets already exist on this machine: VEP binary, VEP cache, FASTA, vepyr Python environment, and vepyr parquet cache

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.agent		.agent
docker		docker
presets		presets
scripts		scripts
src/vepyr_diffly		src/vepyr_diffly
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
e2e_testing.MD		e2e_testing.MD
github_issue_check.MD		github_issue_check.MD
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

vepyr-diffly

Local Quickstart

1. Create the Python environment

2. Create .env

3. Sanity-check the repo

4. Run a local smoke test

5. Reuse existing annotated VCFs without re-annotating

5a. Run only vepyr on an existing prepared input

5b. Benchmark compare-only on an existing annotated pair

5c. Build a local vepyr cache for selected chromosomes

5d. Run a plugin round-trip validation check

5e. Create .tbi indexes for plugin source files

5f. Run a plugin annotation smoke test from built cache

5g. Issue Validation Artifact

6. What to expect on stdout

How The Comparison Works

Short Technical Summary

Input preparation

Two comparison tiers

Why diffly

Why bucketization exists

Results

A. Smoke Tests

B. Golden Test

Current Scope

Python Environment

Repo Configuration

CLI

Expected Output

Troubleshooting

Runtime Notes

Known Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Create `.env`

5a. Run only `vepyr` on an existing prepared input

5c. Build a local `vepyr` cache for selected chromosomes

5e. Create `.tbi` indexes for plugin source files

Why `diffly`

Packages