vepyr-diffly compares Ensembl VEP output against vepyr output using semantic, DataFrame-based diffs.
It does four things:
- run Ensembl VEP and
vepyron the same input VCF, - normalize both annotated VCF outputs into comparable tabular forms,
- compare them semantically with Polars DataFrames,
- print a clear console summary and write detailed diff artifacts for fixing
vepyr.
This is the shortest practical path to test VEP vs vepyr locally on this machine.
cd /Users/lukaszjezapkowicz/Desktop/magisterka/praca/vepyr_diffly
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txtcp .env.example .envThe CLI loads .env automatically. The minimum important variables are:
VEPYR_DIFFLY_EXECUTION_MODE=localVEPYR_DIFFLY_INPUT_VCFVEPYR_DIFFLY_OUTPUT_DIRVEPYR_DIFFLY_VEP_CACHE_DIRVEPYR_DIFFLY_VEP_CACHE_VERSIONVEPYR_DIFFLY_REFERENCE_FASTAVEPYR_DIFFLY_VEP_BINVEPYR_DIFFLY_VEPYR_PYTHONVEPYR_DIFFLY_VEPYR_CACHE_OUTPUT_DIRVEPYR_DIFFLY_VEPYR_USE_FJALLVEPYR_DIFFLY_COMPARE_MODEVEPYR_DIFFLY_BUCKET_COUNTVEPYR_DIFFLY_COMPARE_WORKERSVEPYR_DIFFLY_MEMORY_BUDGET_MBVEPYR_DIFFLY_FINGERPRINT_ONLY
The current .env.example already points at the local paths that were verified in this workspace:
- input VCF:
/Users/lukaszjezapkowicz/Downloads/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf - VEP cache:
/Users/lukaszjezapkowicz/.vep - FASTA:
/Users/lukaszjezapkowicz/.vep/homo_sapiens/115_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa - VEP binary:
/Users/lukaszjezapkowicz/VepAnnotations/ensembl/ensembl-vep/vep vepyrPython env:/Users/lukaszjezapkowicz/Desktop/magisterka/praca/gdl-annotations-infra/modules/python/annotator_testing/runner/.vepyr/bin/pythonvepyrcache root:/Users/lukaszjezapkowicz/Desktop/magisterka/praca/vepyr_diffly/.cache/vepyr_cache
For local plugin cache builds, configure plugin source files in .env as needed:
VEPYR_DIFFLY_PLUGIN_CLINVAR_SOURCEVEPYR_DIFFLY_PLUGIN_SPLICEAI_SOURCEVEPYR_DIFFLY_PLUGIN_CADD_SNV_SOURCEVEPYR_DIFFLY_PLUGIN_CADD_INDEL_SOURCEVEPYR_DIFFLY_PLUGIN_ALPHAMISSENSE_SOURCEVEPYR_DIFFLY_PLUGIN_DBNSFP_SOURCE
If the cache directory also contains *.fjall stores, you can enable the faster vepyr backend with:
VEPYR_DIFFLY_VEPYR_USE_FJALL=true
source .venv/bin/activate
PYTHONPATH=src python -m vepyr_diffly.cli list-presets
PYTHONPATH=src pytest -qFor a full end-to-end smoke test, including annotation by both engines:
source .venv/bin/activate
PYTHONPATH=src python -m vepyr_diffly.cli run \
--output-dir runs/local-smoke-1000 \
--sample-first-n 1000To change how many input variants are used in the smoke test, change --sample-first-n:
source .venv/bin/activate
PYTHONPATH=src python -m vepyr_diffly.cli run \
--output-dir runs/local-smoke-5000 \
--sample-first-n 5000To force the vepyr fjall backend for the run:
source .venv/bin/activate
PYTHONPATH=src python -m vepyr_diffly.cli run \
--output-dir runs/local-smoke-fjall-5000 \
--sample-first-n 5000 \
--use-fjallThis is the faster iteration path when VEP output already exists and you only want to compare:
source .venv/bin/activate
PYTHONPATH=src python -m vepyr_diffly.cli compare-existing \
--preset ensembl_everything \
--left-vcf runs/local-smoke-1000/runtime/vep.annotated.vcf \
--right-vcf runs/local-smoke-1000/runtime/vepyr.annotated.vcf \
--output-dir runs/compare-only \
--chromosomes 1,2,3 \
--compare-mode fast \
--memory-budget-mb 1024Useful compare tuning flags:
--chromosomes 1or--chromosomes 1,2,3: process only the selected chromosomes; standard aliases like1andchr1are treated as the same chromosome--compare-mode fast: exact precheck per bucket, thendifflyonly on buckets that differ--compare-mode debug: forcedifflyfor every bucket and keep full debug artifacts--bucket-count <N>: override automatic consequence bucket count--compare-workers <N>: override compare worker process count--memory-budget-mb <N>: hard memory budget used by the bounded-memory planner for chunk sizes, bucket fanout, and worker count--fingerprint-only: run the bucket precheck only and skip exact diff artifact generation
This is the safest way to produce only the missing vepyr output when prepared_input.vcf already exists and you do not want to rerun VEP:
source .venv/bin/activate
PYTHONPATH=src python -m vepyr_diffly.cli annotate-vepyr \
--input-vcf runs/full-golden-fresh/runtime/prepared_input.vcf \
--output-vcf runs/full-golden-fresh/runtime/vepyr.annotated.vcf \
--cache-dir /Users/lukaszjezapkowicz/Desktop/magisterka/praca/vepyr_diffly/.cache/vepyr_cache \
--reference-fasta /Users/lukaszjezapkowicz/.vep/homo_sapiens/115_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.faOptional:
- add
--use-fjallwhen the cache root also contains*.fjallstores - add
--log-path /path/to/vepyr.logto override the default log destination
source .venv/bin/activate
PYTHONPATH=src python -m vepyr_diffly.cli benchmark-compare \
--left-vcf runs/full-golden-fresh/runtime/vep.annotated.vcf \
--right-vcf runs/full-golden-fresh/runtime/vepyr.annotated.vcf \
--output-json /tmp/vepyr-diffly-benchmark.jsonThis command runs compare-existing in temporary directories, writes only the compact benchmark summary JSON, and cleans the heavy temp artifacts automatically.
Use the helper script in scripts/ when you want to materialize the local cache outside a compare run.
Build only chromosome Y plus the default plugin set:
source .venv/bin/activate
python scripts/build_chr_cache.py --chromosomes YBuild chromosomes 1 and Y:
source .venv/bin/activate
python scripts/build_chr_cache.py --chromosomes 1,YBuild the full cache:
source .venv/bin/activate
python scripts/build_chr_cache.pyBy default the script:
- installs the local checkout from
VEPYR_DIFFLY_VEPYR_PATHwithpip install -e --no-build-isolation - builds the core cache under
VEPYR_DIFFLY_VEPYR_CACHE_OUTPUT_DIR - builds core
variation.fjallandtranslation_sift.fjallfrom the generated parquet cache - derives the Ensembl source cache from
VEPYR_DIFFLY_VEP_CACHE_DIR - builds plugin caches for
clinvar,spliceai,cadd,alphamissense, anddbnsfp - requires each requested plugin to be provided as a local source file via
.envor CLI - does not download plugin sources from the Internet in the standard build path
- materializes
caddfrom two local source files into one shared cache root layout:cadd/andcadd.fjall/
Example local plugin source configuration:
export VEPYR_DIFFLY_PLUGIN_CLINVAR_SOURCE=/data/plugins/clinvar.vcf.gz
export VEPYR_DIFFLY_PLUGIN_CADD_SNV_SOURCE=/data/plugins/whole_genome_SNVs.tsv.gz
export VEPYR_DIFFLY_PLUGIN_CADD_INDEL_SOURCE=/data/plugins/gnomad.genomes.r4.0.indel.tsv.gz
export VEPYR_DIFFLY_PLUGIN_DBNSFP_SOURCE=/data/plugins/dbNSFP4.9c_grch38.gzUseful flags:
--no-pluginsto skip plugin cache generation--no-core-fjallto skipvariation.fjall/translation_sift.fjall--only-pluginsto skip the core cache and build only plugin caches--chromosomes 1,4,10to restrict both core and plugin generation to the selected chromosomes--plugins clinvar,spliceaito limit the plugin set--preview-rows 1000to build both core and plugin caches from preview-sized source slices; for plugins this means firstNdata rows, while for the binary Ensembl core cache it means copying only the first preview-sized set of whole source shard files- when
--chromosomesis used for plugins, preview slicing is chromosome-aware and usestabixautomatically when a matching.tbiindex exists --clean-plugin-outputto delete only the current-layout outputs of the requested plugins before rebuilding (<version>/<plugin>/and<version>/<plugin>.fjall/), which is useful for repeatable size and timing comparisons--remove-old-layout-cacheto delete recognized stale artifacts from the previous cache layout before building (parquet/<version>, root-level*.fjall, and root-level plugin directories)--assume-sorted-plugin-inputto skip SQLORDER BYfor single-source plugin builds when the raw input is already sorted bychrom,pos,ref,alt; this is a safe opt-in and is intentionally ignored forcadd--clinvar-source,--spliceai-source,--cadd-snv-source,--cadd-indel-source,--alphamissense-source,--dbnsfp-sourceto override.envper run--force-plugin-sourceis retained for CLI compatibility but ignored for local-source builds--skip-installifvepyris already installed in the active interpreter
Build only plugin caches without the core cache:
source .venv/bin/activate
python scripts/build_chr_cache.py --only-plugins --plugins clinvar,spliceai,cadd,alphamissenseBuild only plugin preview caches from the first 1000 source rows:
source .venv/bin/activate
python scripts/build_chr_cache.py --only-plugins --plugins clinvar,spliceai,cadd,alphamissense --preview-rows 1000Build repeatable chromosome-scoped preview caches and clean only the selected plugin outputs first:
source .venv/bin/activate
python scripts/build_chr_cache.py \
--only-plugins \
--plugins clinvar,spliceai,cadd,alphamissense,dbnsfp \
--chromosomes 1 \
--preview-rows 1000 \
--clean-plugin-output \
--skip-installUse the dedicated helper when you want an explicit convert -> read back -> verify row counts and schema check for plugin caches.
Example:
source .venv/bin/activate
python scripts/plugin_round_trip_test.py --plugins clinvar,spliceai,cadd,alphamissense --preview-rows 1000What this script does for each plugin:
- creates a preview source if
--preview-rowsis set - runs plugin conversion through the local
vepyrbuild path - reads back the generated parquet files
- verifies row counts: expected rows from the preview source vs actual parquet rows
- verifies schema: expected plugin columns and Arrow types vs the written parquet schema
- prints explicit per-plugin stages:
convert,read-back,verify rows,verify schema, and finalround-trip
Useful flags:
--plugins clinvar,spliceaito limit the plugin set--preview-rows 1000to keep the check fast on very large sources--skip-installif the active environment already has the correct editablevepyr--cache-dir /path/to/outputto keep outputs in a known location--keep-cacheto keep the generated parquet files instead of deleting the temporary cache directory after the check
Use the dedicated helper when you want to prepare tabix indexes for plugin
inputs in plugins/.
Dry-run all known plugin sources:
source .venv/bin/activate
python scripts/create_plugin_indexes.py --dry-runIndex only dbnsfp, spliceai, and cadd:
source .venv/bin/activate
python scripts/create_plugin_indexes.py --plugins dbnsfp,spliceai,caddFor plain gzip inputs such as clinvar.vcf.gz and AlphaMissense_hg38.tsv.gz,
create sibling BGZF files first and index those:
source .venv/bin/activate
python scripts/create_plugin_indexes.py --plugins clinvar,alphamissense --recompress-plain-gzipNotes:
- BGZF inputs are indexed in place.
- Plain gzip inputs cannot be indexed directly by
tabix; they must be converted to BGZF first. - The recompress mode writes sibling
.bgzfiles and leaves the original.gzuntouched. - The helper prints
[i/n]progress and periodic elapsed-time updates whilebgziportabixis still running. - These
.tbifiles can help future chromosome-scoped plugin-cache builds; they do not speed up annotation from already-built parquet/fjall caches.
Use this helper when you want to verify that vepyr.annotate() really picks up
plugin values from the already-built plugin cache.
It:
- reads one variant from each selected plugin cache parquet directory
- writes a temporary mini VCF from those cache-backed variants
- runs
vepyr.annotate()against the partitioned cache - checks that at least one plugin-specific field is populated for each plugin
Example:
source .venv/bin/activate
python scripts/plugin_annotation_smoke_test.py --plugins clinvar,spliceai,cadd,alphamissense,dbnsfp --skip-installNotes:
- The script uses
use_fjall=Falsedeliberately, so it exercises the parquet-backed annotation path. - It expects plugin caches under
.cache/vepyr_cache/115_GRCh38_vep/<plugin>/. - It expects
VEPYR_DIFFLY_REFERENCE_FASTAin.env, or you can pass--reference-fasta. - If the existing local core cache contains corrupted parquet files from an interrupted older build, the smoke test can fail before plugin lookup. Rebuild the core cache into a clean cache root in that case.
For a step-by-step local proof that the current plugin-cache work satisfies the original plugin issue using reduced-scope evidence, see:
run has two expensive phases:
- annotation
- comparison
The compare phase prints explicit progress and ends with a summary table like:
Comparison Summary
variant yes 0 0 0 1000
consequence yes 0 0 0 34741
The output directory then contains:
summary.jsonsummary.mdvariant_diff.parquetconsequence_diff.parquetvariant_mismatches.tsvconsequence_mismatches.tsvruntime/compare.progress.logruntime/vep.logruntime/vepyr.log
runtime/vep.log and runtime/vepyr.log are written incrementally while annotation is running, so they can be tailed live during long runs.
Each log also records STARTED, ENDED, and EXIT markers to make elapsed-time checks easier.
The comparison is semantic, not byte-for-byte.
In short, the repo does this:
- prepares one canonical input VCF for both annotators,
- runs VEP and
vepyron exactly that same prepared input, - reads both annotated VCFs into tabular form,
- normalizes
INFO/CSQinto stable variant and consequence tables, - compares those tables with
diffly, - writes a compact console summary plus detailed diff artifacts.
The main libraries used by the compare path are:
polars-bioto scan annotated VCF files,polarsto normalize and aggregate rows,difflyto compare normalized DataFrames by primary key,dataframelyto validate the normalized schema,richto print readable summaries and progress.
input VCF
|
v
sample first N (optional)
|
v
split multi-allelic -> runtime/prepared_input.vcf
|
+--------------------+
| |
v v
VEP vepyr
| |
v v
vep.annotated.vcf vepyr.annotated.vcf
| |
+----------+---------+
|
v
normalize to variant tier + consequence tier
|
v
diffly compare by primary key
|
v
summary.json / summary.md / parquet diffs / mismatch TSV / progress logs
For small files the compare can still run directly on normalized tables. For larger files the pipeline now uses a bounded-memory path for both tiers: variant rows and consequence rows are streamed in chunks, spilled into hash buckets on disk, compacted per bucket, and then compared bucket-by-bucket. A resource planner derives chunk sizes, bucket count, worker count, and side parallelism from --memory-budget-mb, so working memory scales with the configured budget instead of full input size.
The pipeline can now also be scoped to selected chromosomes. The filter is available on both run and compare-existing, is applied before annotation in run, and during streaming normalization in compare-existing. Standard naming aliases are normalized, so 1 matches both 1 and chr1, and MT matches MT, M, chrMT, and chrM.
The pipeline is:
- prepare the input VCF
- annotate the same prepared input with VEP and
vepyr - normalize both annotated VCFs into comparable tables
- compare those tables with
diffly - write console summary plus file artifacts
Before annotation, run prepares a canonical input:
- if sampling is enabled, it writes
runtime/sampled_input.vcf - if chromosome filtering is enabled, it writes
runtime/filtered_input.vcf - it then decomposes multi-allelic rows into single-alt rows
- it writes the final annotation input to
runtime/prepared_input.vcf - it records overall and per-chromosome preparation stats in
runtime/input_preparation.json
This step is required because multi-allelic input created misleading consequence-level diffs, especially for indels. After splitting to single-alt before annotation, the smoke 5000 run became fully equal again.
The repo compares two normalized tables.
Variant tier:
- primary key:
chrom,pos,ref,alt - tracks counts and basic summary columns per variant/allele
- useful to detect gross record-level divergence
Consequence tier:
- one normalized row per semantic
CSQconsequence - compared after parsing and normalizing
INFO/CSQ - this is the tier used to diagnose real annotation drift
diffly is the final diff engine. We do not compare raw VCF text directly. Instead:
- VCF is normalized into tabular rows
- rows get stable primary keys
- duplicates are counted explicitly
- then
difflycompares the normalized DataFrames
The full golden annotated VCFs are too large for a naive eager compare.
For large runs, consequence comparison works in buckets:
- normalize consequence rows into hash buckets
- run an exact precheck per bucket
- in
fastmode, skipdifflyfor buckets already proven equal - in
debugmode, compare every bucket withdiffly - merge bucket diff artifacts at the end
For large runs, variant comparison now follows the same model:
- stream variant rows into hash buckets on disk
- compact each bucket into deterministic
bucket.parquet - compare the variant tier bucket-by-bucket instead of one global in-memory diff
- keep the old eager path only for safely small files
This keeps memory bounded and gives progress reporting for large compares.
The most recent post-fix smoke matrix that was re-verified locally is:
10100100010000
For each size the workflow was:
run --compare-mode fast- then
compare-existing --compare-mode debug - with the same annotated outputs
- under a bounded-memory budget
Observed outcome:
| Sample size | Prepared rows | fast variant equal |
fast consequence equal |
debug variant equal |
debug consequence equal |
|---|---|---|---|---|---|
10 |
10 |
true |
true |
true |
true |
100 |
100 |
true |
true |
true |
true |
1000 |
1000 |
true |
true |
true |
true |
10000 |
10066 |
true |
true |
true |
true |
Observations:
- all currently re-verified smoke points are fully equal at both tiers
- the
vepyrimport path issue is fixed; the runner now prefers the working.vepyrenvironment before falling back to the local checkout - multi-allelic decomposition is still visible at
10000, where10000source variants became10066prepared single-alt rows - the compare path is green in both
fastanddebugon small and medium-sized fixtures after the RAM-safe refactor
The newest retained full compare-only result is runs/full-golden-compare-fast-per-chrom-20260330, run in fast mode against:
Effective plan from summary.json:
memory_budget_mb=1024bucket_count=256variant_chunk_rows=10000consequence_chunk_rows=8192compare_workers=2parallelize_sides=false
Observed compare-only timings:
| Stage | Seconds | Approx. |
|---|---|---|
| Variant summary | 518.632 |
8m 39s |
| Consequence bucketization | 999.975 |
16m 40s |
| Schema validation | 0.031 |
<1s |
| Variant diff | 4.081 |
4s |
| Consequence diff | 490.070 |
8m 10s |
| Total compare-only | 2012.789 |
33m 33s |
Observed result:
- variant tier:
equal=true - variant joined equal rows:
4096123 - variant
left_only=0,right_only=0,unequal=0 - consequence tier:
equal=false - consequence joined equal rows:
36914657 - consequence
left_only=392,right_only=392,unequal=0
What this means:
- at the variant level VEP and
vepyrare fully equal - at the consequence level there are still a small number of rows that exist only on one side
- those rows are not random noise: the sampled mismatch artifact collapses cleanly into
392transcript-levelleft_only/right_onlypairs - there are still no cases where the same normalized consequence key exists on both sides with different payload values inside one joined key
Why this run still takes a long time:
- the dominant cost is not the final diff itself, but preprocessing the annotated VCFs into bounded-memory bucket artifacts
consequence_bucketizationalone is still about16m 40s- even in
fastmode, only a tiny minority of consequence buckets are proven equal by precheck, so the final diff still has to run exact compare for most buckets - the inputs are very large annotated VCFs, so parsing
CSQ, normalizing transcript consequences, spilling to disk, and compacting buckets dominate wall-clock time - the run deliberately used a conservative
1024 MBmemory budget, which reduces RAM pressure but also reduces concurrency
Per-chromosome view from the same summary:
- all chromosomes are
variant.equal=true - only chromosome
1is fully equal on the consequence tier - the largest consequence mismatch counts are on chromosomes
2(88/88),9(54/54),6(44/44),3(39/39), and14(37/37) - the per-chromosome timings below are attributed timings from the summary, not independent wall-clock slices, so they are best used for relative weight and hotspot analysis
| Chr | Variant | Consequence | Left/Right only | Attributed total | Variant summary | Consequence bucketization | Consequence diff |
|---|---|---|---|---|---|---|---|
| 1 | yes |
yes |
0 / 0 |
161.248s |
40.951s |
80.516s |
39.459s |
| 2 | yes |
no |
88 / 88 |
166.191s |
41.951s |
83.156s |
40.754s |
| 3 | yes |
no |
39 / 39 |
158.916s |
36.532s |
81.940s |
40.157s |
| 4 | yes |
no |
10 / 10 |
135.640s |
38.908s |
64.712s |
31.714s |
| 5 | yes |
no |
8 / 8 |
110.990s |
33.478s |
51.842s |
25.407s |
| 6 | yes |
no |
44 / 44 |
129.702s |
34.435s |
63.752s |
31.244s |
| 7 | yes |
no |
9 / 9 |
114.174s |
29.694s |
56.538s |
27.708s |
| 8 | yes |
no |
26 / 26 |
107.026s |
28.519s |
52.536s |
25.747s |
| 9 | yes |
no |
54 / 54 |
83.531s |
22.298s |
40.976s |
20.082s |
| 10 | yes |
no |
8 / 8 |
105.041s |
27.028s |
52.212s |
25.588s |
| 11 | yes |
no |
12 / 12 |
100.479s |
26.187s |
49.719s |
24.367s |
| 12 | yes |
no |
8 / 8 |
95.336s |
25.046s |
47.040s |
23.053s |
| 13 | yes |
no |
7 / 7 |
61.341s |
20.438s |
27.342s |
13.400s |
| 14 | yes |
no |
37 / 37 |
68.388s |
16.865s |
34.488s |
16.902s |
| 15 | yes |
no |
6 / 6 |
68.004s |
15.850s |
34.917s |
17.112s |
| 16 | yes |
no |
13 / 13 |
62.537s |
15.619s |
31.404s |
15.391s |
| 17 | yes |
no |
3 / 3 |
65.283s |
13.722s |
34.530s |
16.923s |
| 18 | yes |
no |
1 / 1 |
56.589s |
15.116s |
27.753s |
13.601s |
| 19 | yes |
no |
5 / 5 |
55.707s |
11.484s |
29.618s |
14.515s |
| 20 | yes |
no |
11 / 11 |
41.981s |
11.003s |
20.731s |
10.160s |
| 21 | yes |
no |
2 / 2 |
29.526s |
7.067s |
15.035s |
7.368s |
| 22 | yes |
no |
1 / 1 |
35.130s |
6.440s |
19.220s |
9.419s |
What the 392 / 392 / 0 mismatches actually are:
- they are not missing transcripts versus extra transcripts in the old sense
- each sampled mismatch row on the VEP side has a matching transcript-level partner on the
vepyrside with the samechrom,pos,ref,alt, andFeature - the mismatch is therefore usually a payload drift for the same transcript consequence, not a different transcript set
- the top low-level field signatures are:
- only
HGVSp:121pairs - only
Consequence:72 - only
HGNC_ID:63 - only
HGVSc:54 Consequence + IMPACT:40
- only
- the top semantic classes are:
hgvs_payload_drift:114missing_hgvs:84consequence_reclassification:79missing_hgnc_id:63consequence_and_impact_reclassification:55
In practice that means the 392 mismatch pairs are mostly:
- rows where both sides point at the same transcript consequence, but
HGVSpuses different protein coordinates or wording - rows where VEP has
HGVScorHGVSpfilled andvepyrleaves it empty - rows where VEP and
vepyrclassify the same transcript consequence differently, for exampleinframe_insertionversusinframe_insertion&start_retained_variant - rows where the same transcript consequence gets a different
IMPACT, for exampleHIGHversusLOW - rows where
HGNC_IDis missing on one side, most often present invepyrand empty in VEP
Representative raw CSQ examples confirm that these are already present in the source annotated VCFs:
missing_hgvs:- VEP contains
ENST00000944442.1:c.-111_-110ins... vepyremits the same row with an emptyHGVSc
- VEP contains
consequence_and_impact_reclassification:- VEP:
frameshift_variant&splice_region_variant|HIGH vepyr:splice_region_variant|LOW
- VEP:
consequence_reclassification:- VEP:
inframe_insertion vepyr:inframe_insertion&start_retained_variant
- VEP:
missing_hgnc_id:- VEP: empty
HGNC_ID vepyr: populatedHGNC:...
- VEP: empty
hgvs_payload_drift:- both sides agree on
Consequence,IMPACT, andHGVSc - but
HGVSprefers to different duplicated amino-acid coordinates
- both sides agree on
Follow-up optimization note:
- after the older retained full run, the consequence compare path was tightened further so bucketed compare no longer writes unused per-bucket TSV files, can use cheap bucket metadata to skip the expensive eager precheck for buckets that are already provably different, and now uses in-process worker threads instead of spawning separate compare subprocesses
- the fresh full rerun above confirms those changes on the full pipeline:
consequence_diffis now490.070sinstead of the older retained761.191s, which is about35.6%faster - on the same machine and with the same full-golden
vepyr.annotated.vcf, a one-sidematerialize_consequence_buckets(...)benchmark atbucket_count=256andchunk_variants=8192first dropped from708.518sto654.883safter switching temporary bucket part writes to lighter parquet settings (lz4, no statistics), and then to447.560safter buffering several reduced chunks before each flush - that latest bucketization benchmark is about
36.8%faster than the original one-side baseline, and it also cuts temporary file churn dramatically because bucket parts are flushed around every65536input variants instead of every8192 - the fresh full rerun also reflects that in end-to-end compare wall-clock:
- older retained compare-only baseline:
49m 00s - fresh rerun on current code:
33m 33s
- older retained compare-only baseline:
Latest retained mismatch diagnosis:
- all
784sampled consequence mismatch rows collapse cleanly into392transcript-levelleft_only/right_onlypairs - there are
0unpaired rows in that sampled mismatch artifact - the top differing fields are still
HGVSp,Consequence,HGNC_ID,IMPACT, andHGVSc - the top semantic drift classes are:
hgvs_payload_drift:114missing_hgvs:84consequence_reclassification:79missing_hgnc_id:63consequence_and_impact_reclassification:55
- raw
CSQextraction on representative cases confirms that these differences already exist in the source annotated VCF payloads, not just in downstream normalization - the current debug tooling can now go from mismatch TSV -> semantic category summary -> raw left/right
CSQexamples without rescanning the whole mismatch space manually
Most useful artifacts for debugging the golden mismatch:
- summary.json
- summary.md
- consequence_mismatches.tsv
- consequence_diff.parquet
- runtime/compare.progress.log
/tmp/vepyr-diffly-golden-mismatch-analysis.json/tmp/vepyr-diffly-golden-raw-csq.json
Logs and structured diagnostics added for this:
runtime/compare.progress.log: live stage and per-bucket compare progress for the full runconsequence_mismatch_analysis.json: grouped transcript-level mismatch summary, field counts, semantic categories, and representative examplesraw-csq-examples.json: raw left/rightCSQpayloads for representative cases in each semantic category
Current verified scope:
- one active preset:
ensembl_everything - semantic comparison of annotated VCF
INFO/CSQ - two comparison tiers: variant and consequence
- bounded-memory compare path for large annotated VCFs
- local runtime execution against a real local VEP installation
- local runtime execution against an existing
vepyrPython environment and parquet cache - compare-only reuse of existing annotated VCFs via
compare-existing
The root requirements.txt installs:
- the local package in editable mode
- runtime dependencies from pyproject.toml
pytestruff
Recommended setup:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
cp .env.example .envQuick verification:
source .venv/bin/activate
PYTHONPATH=src python -m vepyr_diffly.cli list-presets
PYTHONPATH=src pytest -qRuntime paths should live in repo-local .env.
Important variables:
VEPYR_DIFFLY_EXECUTION_MODEVEPYR_DIFFLY_INPUT_VCFVEPYR_DIFFLY_OUTPUT_DIRVEPYR_DIFFLY_VEP_CACHE_DIRVEPYR_DIFFLY_VEP_CACHE_VERSIONVEPYR_DIFFLY_REFERENCE_FASTAVEPYR_DIFFLY_VEP_BINVEPYR_DIFFLY_VEPYR_PYTHONVEPYR_DIFFLY_VEPYR_CACHE_OUTPUT_DIRVEPYR_DIFFLY_COMPARE_MODEVEPYR_DIFFLY_MEMORY_BUDGET_MB
Notes:
VEPYR_DIFFLY_VEPYR_PYTHONshould point to a working virtualenv interpreter such as.../.vepyr/bin/pythonVEPYR_DIFFLY_VEPYR_CACHE_OUTPUT_DIRshould point to the cache root, not directly to the deepest parquet leaf- the CLI loads
.envautomatically
Main commands:
python -m vepyr_diffly.cli list-presets
python -m vepyr_diffly.cli run --preset ensembl_everything --input-vcf /path/to/input.vcf --output-dir runs/demo --sample-first-n 1000 --chromosomes 1,2
python -m vepyr_diffly.cli compare-existing --preset ensembl_everything --left-vcf /path/to/vep.annotated.vcf --right-vcf /path/to/vepyr.annotated.vcf --output-dir runs/compare-only --chromosomes 1,2 --compare-mode fast --memory-budget-mb 1024
python -m vepyr_diffly.cli annotate-vepyr --input-vcf /path/to/prepared_input.vcf --output-vcf /path/to/vepyr.annotated.vcf
python -m vepyr_diffly.cli benchmark-compare --left-vcf /path/to/vep.annotated.vcf --right-vcf /path/to/vepyr.annotated.vcf --output-json /tmp/benchmark.json
python -m vepyr_diffly.cli inspect-run --run-dir runs/demo
python -m vepyr_diffly.cli analyze-consequence-mismatches --run-dir runs/full-golden-compare-fast --output-json /tmp/mismatch-analysis.json
python -m vepyr_diffly.cli extract-mismatch-csq-examples --run-dir runs/full-golden-fresh --analysis-json /tmp/mismatch-analysis.json --output-json /tmp/raw-csq-examples.jsonanalyze-consequence-mismatches reads consequence_mismatches.tsv, groups left_only/right_only rows back into transcript-level pairs, counts the fields that drift most often, and emits representative examples. It is the fastest way to turn a retained run into a concrete CSQ drift report.
extract-mismatch-csq-examples takes that analysis and enriches a few representative cases per semantic category with the original raw CSQ= entries from both annotated VCFs. This is the shortest path from "which categories drift?" to "what exactly did VEP and vepyr emit for the same transcript consequence?".
Typical successful smoke output:
Preset: ensembl_everything
Input: /Users/lukaszjezapkowicz/Downloads/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf
Sample first N: 1000
Comparison Summary
Tier Equal Left only Right only Unequal Joined equal
variant yes 0 0 0 1000
consequence yes 0 0 0 34741
Main run artifacts:
summary.jsonsummary.mdvariant_diff.parquetconsequence_diff.parquetvariant_mismatches.tsvconsequence_mismatches.tsvruntime/effective_config.jsonruntime/compare.progress.logruntime/vep.logruntime/vepyr.log
On large runs the TSV mismatch files are intentionally sampled, while the parquet diff artifacts remain complete.
summary.json and summary.md now include both:
- overall results for the processed subset as a whole
- per-chromosome variant/consequence equality and counts
- attributed per-chromosome stage timings for normalization and diff
If tests fail:
source .venv/bin/activate
PYTHONPATH=src pytest -q -vvIf annotation fails before compare:
- inspect
runtime/vep.log - inspect
runtime/vepyr.log - inspect
runtime/effective_config.json
Typical causes:
- wrong
VEPYR_DIFFLY_VEP_BIN - wrong
VEPYR_DIFFLY_VEP_CACHE_DIR - wrong
VEPYR_DIFFLY_REFERENCE_FASTA - wrong
VEPYR_DIFFLY_VEPYR_CACHE_OUTPUT_DIR VEPYR_DIFFLY_VEPYR_PYTHONdoes not have a workingvepyrinstall
If vepyr fails to import:
- check that
VEPYR_DIFFLY_VEPYR_PYTHONpoints to a real virtualenv interpreter like.../.vepyr/bin/python - avoid resolving that path to the system interpreter
If a long compare is still running, tail:
tail -f runs/<run>/runtime/compare.progress.logThe most important milestones are:
parsed CSQ headermaterializing variant summarybucketizing consequence rowscomparing ... bucketswriting summariescompleted successfully
- canonical equality is semantic comparison of parsed
CSQ - raw VCF text equality is out of scope
- the currently verified execution path is local runtime
- the compare path is now bounded-memory and driven by the configured memory budget
- for large runs, choosing a smaller memory budget trades throughput for lower RAM pressure
- only the
ensembl_everythingpreset is currently verified end-to-end mergedandrefseqcache flavors are not implemented and verified yet- Docker/container execution remains scaffolded in the repo, but local execution is the currently verified path
- the retained full golden result currently documents
compare-existing --compare-mode fast; a fresh full end-to-end goldenrunwith the same current codepath is still worth recording separately - the repo compares semantic normalized outputs, not byte-for-byte annotated VCF files
- the current docs assume the local external assets already exist on this machine: VEP binary, VEP cache, FASTA,
vepyrPython environment, andvepyrparquet cache