perf(hts): faster terminology bootstrap & re-import, steadier $lookup throughput — language filter, batch tuning, EXISTS probe, no JSON round-trip, indexing, cache eviction; fix localized $validate-code displays & lenient-display-validation#152
Draft
darthcav wants to merge 11 commits into
Conversation
The pre-transaction check that decides which code systems need an immediate closure rebuild only distinguishes empty from non-empty, but counted every concept row for the system's URL. During chunked bulk loads (SNOMED RF2, LOINC) that COUNT scans an ever-growing concept set once per batch, turning the check into O(n²/batch_size) over the whole import. An EXISTS probe answers the same question in O(1). Applies to both the SQLite and PostgreSQL import paths.
…TCH_SIZE) Bootstrap sync hardcoded a 500-concept batch size; every batch pays fixed overhead (transaction, CodeSystem metadata upsert, closure-row and expansion-cache invalidation, in-memory cache flushes), so a SNOMED CT load ran ~1300 batches' worth of bookkeeping. The new HTS_BOOTSTRAP_BATCH_SIZE variable (default 5000) amortizes that ~10x. The hts import CLI keeps its existing --batch-size default of 500 for memory-constrained ad-hoc runs.
…IMPORT_LANGUAGES) SNOMED RF2 archives ship one Description file per language plus language refsets, and LOINC ships per-language linguistic-variant CSVs — all of which were imported unconditionally, multiplying designation rows and dominating bulk-load time for deployments that only need a few languages. HTS_IMPORT_LANGUAGES (env, also --languages on hts import) takes a comma-separated BCP-47 tag list. Matching is BCP-47-aware in both directions via the existing language module (configured de admits de-DE; configured de-DE admits the bare de that RF2 languageCode columns carry). Filtering happens at file level where possible — excluded RF2 Description / Language-refset files and LOINC variant CSVs are never parsed — with a row-level guard for mixed-language RF2 files. English is always retained because SNOMED display selection and the en-US/en-GB preference chain depend on it. For language-sensitive formats the configured tag set is folded into the bootstrap ledger signature, so changing HTS_IMPORT_LANGUAGES re-triggers the import of affected files on the next startup; the allow-all default leaves signatures untouched, keeping existing ledgers valid.
The SNOMED RF2 and LOINC importers serialized each concept batch into a FHIR Bundle JSON payload that the backend immediately re-parsed with serde_json before writing — every term and designation was allocated, encoded, and decoded once more per import run, millions of times for a multilingual SNOMED load. BundleImportBackend gains import_parsed(), accepting the existing backend-agnostic ParsedBundle directly; import_bundle() now delegates to it after parsing (both backends), so HTTP /import semantics are unchanged. bundle_builder::build_parsed_code_system() constructs ParsedBundle values straight from the builder structs, mirroring the JSON path exactly — including the parent property row + hierarchy edge the parser derives from parent_code — with one deliberate exception: resource_json now stores the metadata-only CodeSystem resource (the seed shape) instead of whichever 500-concept chunk happened to be written last. SNOMED association-refset ConceptMaps keep the JSON path; their cost is a one-off per import.
…d README Align the SNOMED RF2 and LOINC module headers, the hts import help dump, and the CLAUDE.md import recipes with the HTS_IMPORT_LANGUAGES / --languages option introduced for multilingual imports.
Contributor
Author
|
@smunini Please do not merge it right away. I may need to patch it a bit because I think I need to optimize further the performance. |
Contributor
|
@darthcav ok no problem. Going forward, the best way to signal that a PR is still in progress is to raise it as a "Draft" PR - https://github.blog/news-insights/product-news/introducing-draft-pull-requests/ WHen ready, you can flip it to "Ready for review" |
Improve first-time and repeat import performance for large terminology distributions (SNOMED CT, LOINC) and close several correctness gaps in the bootstrap and re-import paths. Indexing - Add a concept_designations(concept_id) index on both backends. It was the only missing access path: every per-concept delete-before-reinsert and every per-concept designation read was a full table scan, making a designation-heavy import O(n^2). It is now an index seek (O(n log n)). - Drop the redundant idx_concepts_system_code on both backends. The UNIQUE(system_id, code) constraint already backs the same lookups, so the explicit duplicate only doubled per-insert index maintenance. Fresh-load fast path - Add BundleImportBackend::code_system_has_concepts and a ParsedBundle.fresh_load hint. The SNOMED RF2 and LOINC importers probe once before a bulk load; when the target system holds no concepts yet, the per-concept delete-before-reinsert is skipped entirely across all batches. Re-imports keep full replacement semantics. Re-import replacement semantics - Replace each touched concept's properties and designations unconditionally (previously guarded on a non-empty incoming slice). A narrowed HTS_IMPORT_LANGUAGES now drops previously imported translations instead of leaving them behind, and PostgreSQL no longer duplicates child rows on a repeat import. Bootstrap startup - Skip full-file hashing of unchanged archives: the bootstrap ledger now records size, mtime and the applied language filter and skips re-reading a file whose stat is unchanged, falling back to a content hash only on a mismatch. Avoids re-reading multi-GB SNOMED/LOINC ZIPs on every restart. - Defer the SQLite FTS prebuild until after the bootstrap directory sync, and run the post-import closure and FTS rebuild that the CLI performs, so a bootstrapped server is not left with stale closure or FTS data after a chunked re-import. Language matching - Rework BCP-47 ranking to the precedence exact > separator-insensitive > primary/sibling, and add best-tier selection for import filtering: a configured es-ES imports only es-ES (not es-AR/es-MX), while a bare es imports every es-* variant. Applied to both import filtering and query-time display selection. Documentation (README, CLAUDE.md) updated to match, with regression tests covering each area.
The SQLite backend's in-memory request caches — the $lookup and
$validate-code response caches and the per-concept abstract/inactive
status-flag caches — all used an insert-only-while-under-bound policy:
if w.len() < max { w.insert(key, value) }
Once a cache filled, it froze. The first `max` distinct keys held their
slots permanently and every later key missed forever, so the cache
became useless for any working set larger than its bound — e.g.
diverse-code $lookup traffic against a 600K-concept SNOMED system, where
only the first 4096 codes were ever cached.
Replace the policy with a shared `bounded_cache_insert` helper that, when
the map is at capacity and the key is new, evicts one existing entry
before admitting the new one. std HashMap's randomized iteration order
makes this random replacement, which keeps the same memory ceiling while
letting hot keys re-enter the cache.
Random replacement (not true LRU) is deliberate: these caches are read
under a shared lock on the hot path, and tracking per-entry recency would
require taking an exclusive lock on every read, serializing the very
lookups the cache exists to accelerate. Overwriting an existing key never
evicts, and a zero bound is a no-op.
Applied at all four call sites (lookup_response_cache,
validate_code_response_cache, and both concept-flag caches). Adds three
unit tests covering eviction-not-freeze, overwrite-without-evict, and the
zero-bound no-op. Documents the indexing and caching strategy in the HTS
README.
CodeSystem/$validate-code returned the English default display (and spuriously rejected a correct localized display) when displayLanguage selected a SNOMED translation, even though $lookup resolved it correctly. The language-aware display validation built its valid-display candidate set through is_display_alternative(), which only accepted designations whose use.code was absent or "display". SNOMED RF2 designations are imported with use.system http://snomed.info/sct and use.code set to the SNOMED description-type concept id (900000000000013009 for synonyms, 900000000000003001 for FSNs), so every SNOMED translation was dropped: the language-preferred display was never found and the supplied display could not match. Accept terminology-native description types as display alternatives by recognizing designations whose use.system is the SNOMED CT system. This mirrors $lookup (which matches purely on the BCP-47 language tag) while still excluding FHIR designation-usage alternative-purpose codes such as consumer-name and olde-english, which carry a different use.system. Add regression tests covering both the resolved display value and the validation result for a German SNOMED designation.
…date-code The FHIR lenient-display-validation parameter downgrades a display mismatch from an error (result=false) to a warning with result=true. It was already wired through the request struct and honored by the Postgres CodeSystem backend and both ValueSet backends, but the SQLite CodeSystem validate_code path hardcoded the invalid-display issue to error severity and computed result solely from display equality, so a SNOMED/CodeSystem $validate-code against the default SQLite backend still returned result=false regardless of the flag. Thread req.lenient_display_validation into that path: emit the invalid-display issue as a warning and keep result=true when the flag is set. The operations-layer language-aware rewrite already preserves a backend-emitted warning severity, so the downgrade survives the language-display pass and the comma-separated displayLanguage handling. Add a regression test asserting result=true with a warning-severity invalid-display issue when lenient-display-validation=true.
…lay validation [skip ci] Note in the HTS README that display validation matches against concept designations (including SNOMED CT synonyms/FSNs via use.system) with BCP-47 displayLanguage handling, and that lenient-display-validation is an R6-ballot parameter (defined on ValueSet/$validate-code, absent from R4/R5) which HTS accepts on both CodeSystem and ValueSet $validate-code as an opt-in superset. Clarify that the default — invalid display fails validation — is the spec-mandated behaviour across all versions.
lenient-display-validation downgrades a display mismatch to a warning with result=true. It worked when the backend flagged the mismatch, but not when the language-aware operations pass surfaced it: if the supplied display matched the concept's default display, the backend accepted it and emitted no issue, yet with displayLanguage set the only valid display is the requested-language designation — so apply_language_ display_validation flagged a mismatch the backend never saw. That function derived lenient solely from a backend-emitted warning severity, so with no issue to inherit it emitted an error and returned result=false, ignoring the flag. Thread req.lenient_display_validation through build_validate_response_ async into apply_language_display_validation (CodeSystem, inline-ValueSet and ValueSet $validate-code handlers) and decide severity from the flag directly OR an inherited backend warning. The default (flag absent) remains a hard error, preserving the spec-mandated result=false. Add regression tests for the displayLanguage-mismatch case with and without the flag.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bootstrap imports of large multilingual terminologies (SNOMED CT RF2 with national-extension languages, LOINC with linguistic variants) are slow: every language in the archive is ingested unconditionally, every concept batch pays a fixed bookkeeping cost, and every batch is serialized to JSON only to be immediately re-parsed. This PR addresses those bottlenecks independently — each in its own commit, each verified against the full
helios-htstest suite (including the Postgres testcontainer suite) before moving on — and then closes the loop with import-indexing/re-import-correctness fixes and a$lookup-path caching fix so the steady-state read throughput holds up on the larger, correctly-imported datasets.Changes
1.
perf: probe empty code systems withEXISTSinstead ofCOUNT(*)import_bundledecides which code systems need an immediate closure rebuild by checking whether they currently have zero concepts. That check ranSELECT COUNT(*) FROM concepts JOIN code_systems WHERE url = ?— an answer that only needs to distinguish empty from non-empty. During a chunked bulk load the concept set for the URL grows with every batch, so the per-batch COUNT scans all previously imported rows, making the check O(n²/batch_size) over the whole import (~1300 batches for SNOMED CT). AnEXISTS(SELECT 1 …)probe answers the same question in O(1). Applied to both the SQLite and PostgreSQL import paths.2.
feat: configurable bootstrap batch size —HTS_BOOTSTRAP_BATCH_SIZEThe
HTS_BOOTSTRAP_DIRsync hardcoded a batch size of 500 concepts. Each batch is one transaction plus fixed bookkeeping: CodeSystem metadata upsert,resource_jsonrefresh, closure-row and expansion-cache invalidation, and in-memory cache flushes. The new variable (default 5000) amortizes that overhead ~10× for big terminologies. Thehts importCLI keeps its existing--batch-sizedefault of 500 for memory-constrained ad-hoc runs, matching the existing config conventions.3.
feat: language filter for multilingual imports —HTS_IMPORT_LANGUAGES/--languagesDesignation volume dominates multilingual bulk loads: SNOMED RF2 ships one Description file per language plus language refsets, LOINC ships per-language
*LinguisticVariant.csvfiles, and all of them were imported unconditionally. Deployments that only need a couple of languages had no way to opt out.LanguageFilter(comma-separated BCP-47 tags; empty = import everything, i.e. the historical behavior). Matching reuses the RFC 4647 logic inlanguage.rs, with best-tier selection: a configuredes-ESimports onlyes-ES, while a bareesimports everyes-*variant.find_rf2_pathsand never parsed; a row-level guard covers mixed-language Description files. English is always retained because concept display selection and theen-US/en-GBpreference chain depend on it.bootstrap_importssignature, so changingHTS_IMPORT_LANGUAGESre-triggers the import of affected files on the next startup.HTS_IMPORT_LANGUAGESon the server (bootstrap) and--languages/HTS_IMPORT_LANGUAGESonhts import.4.
perf: skip the JSON Bundle round-trip in chunked filesystem importersThe SNOMED and LOINC importers built a FHIR Bundle JSON payload per batch (
build_code_system_bundle) whichimport_bundleimmediately re-parsed withserde_jsoninto the backend-agnosticParsedBundlebefore writing. Every term and designation was therefore allocated, JSON-encoded, and JSON-decoded once more per run — millions of times for a multilingual SNOMED load.BundleImportBackendgainsimport_parsed(ctx, ParsedBundle);import_bundle(bytes)now parses and delegates to it on both backends, so the HTTPPOST /importpath is semantically unchanged.bundle_builder::build_parsed_code_system()constructsParsedBundlevalues directly from the builder structs, mirroring the JSON path exactly — including theparentproperty row and hierarchy edge that the parser derives fromparent_code, and the FHIRvalue[x]→ value-type mapping.resource_jsonis now the metadata-only CodeSystem resource (the same shape the seed bundle already wrote) instead of whichever ~500-concept chunk happened to be written last.5.
perf+fix: import indexing, fresh-load fast path, and re-import correctnessClosing the remaining import bottlenecks and the correctness gaps they exposed:
concept_designations(concept_id)index on both backends — the only missing per-concept access path. Without it, every per-concept delete-before-reinsert and every per-concept designation read was a full table scan, making a designation-heavy import O(n²); it is now an index seek. Conversely, drop the redundantidx_concepts_system_code: theUNIQUE(system_id, code)constraint already backs exactly those lookups (verified withEXPLAIN QUERY PLAN— both pick the constraint's index), so the explicit duplicate only doubled per-insert index maintenance.BundleImportBackend::code_system_has_concepts+ aParsedBundle.fresh_loadhint let the SNOMED/LOINC importers probe once before a bulk load; when the target system holds no concepts yet, the per-concept delete-before-reinsert is skipped across all batches. Re-imports keep full replacement semantics.HTS_IMPORT_LANGUAGESnow drops previously-imported translations instead of leaving them behind, and PostgreSQL no longer duplicates child rows on a repeat import.6.
fix: evict instead of freezing the SQLite request cachesThe SQLite backend keeps small in-memory caches of assembled
$lookup/$validate-coderesponses and per-concept abstract/inactive status flags. All used an insert-only-while-under-bound policy (if len < max { insert }), which froze each cache once full: the firstmaxdistinct keys held their slots permanently and every later key missed forever — making the cache useless for any working set larger than its bound (e.g. diverse-code$lookuptraffic against a 600 K-concept SNOMED system, where only the first 4096 codes were ever cached). This matters more now that the import fixes above mean concepts correctly carry their full designation set.bounded_cache_inserthelper evicts one existing entry (random replacement viaHashMap's randomized iteration order) when the map is at capacity and the key is new, keeping the same memory ceiling while letting hot keys re-enter the cache.7.
fix: resolve SNOMED display-language designations in$validate-codeWith the import fixes above ensuring concepts carry their full designation set, a
$validate-codegap surfaced: requesting a SNOMED concept withdisplayLanguage=dereturned the English default display (and spuriously rejected the correct German display), even though$lookupresolved the German term correctly.The language-aware display validation built its valid-display candidate set through an
is_display_alternative()filter that only accepted designations whoseuse.codewas absent ordisplay. SNOMED RF2 designations are imported withuse.system = http://snomed.info/sctanduse.codeset to the SNOMED description-type concept id (900000000000013009for synonyms,900000000000003001for FSNs), so every SNOMED translation was dropped — the language-preferred display was never found and a supplied localized display could not match.use.systemis the SNOMED CT system. This mirrors$lookup(which matches purely on the BCP-47 language tag) while still excluding FHIR designation-usage alternative-purpose codes such asconsumer-nameandolde-english, which carry a differentuse.system.use.code, so they already passed the filter.8.
fix: honorlenient-display-validationin the SQLite CodeSystem$validate-codeThe FHIR
lenient-display-validationboolean downgrades a display-name mismatch from an error (result=false) to awarningwithresult=true. The parameter was already parsed intoValidateCodeRequestand honored by the Postgres CodeSystem backend and both ValueSet backends, but the SQLite CodeSystemvalidate_codepath — the default backend — hardcoded theinvalid-displayissue toerrorseverity and derivedresultpurely from display equality, so a CodeSystem$validate-codeignored the flag and still returnedresult=false.req.lenient_display_validationinto that path: theinvalid-displayissue is emitted as awarningandresultstaystruewhen the flag is set, matching the other three backend paths.displayLanguagemakes the requested-language designation the only valid display. Previouslyapply_language_display_validationderived "lenient" solely from a backend-emittedwarningseverity, so with no issue to inherit it emitted anerrorand returnedresult=false, ignoring the flag.req.lenient_display_validationis now threaded throughbuild_validate_response_asyncinto that pass (across the CodeSystem, inline-ValueSet, and ValueSet$validate-codehandlers) and the severity is decided from the flag directly or an inherited backend warning. The default (flag absent) remains a harderror/result=false.Spec provenance.
lenient-display-validationis an R6-ballot parameter, defined onValueSet/$validate-codeand absent from the R4/R5 operation definitions. HTS accepts it on bothCodeSystem/$validate-codeandValueSet/$validate-coderegardless of the resources' FHIR version — an opt-in superset of the published definitions. The default (parameter absent orfalse) is the spec-mandated behavior across all versions: an invalid display fails validation (result=false). This change only affects the explicit=trueopt-in; the R6 provenance is documented in the HTS README.For a SNOMED CT International + national-extension load restricted to one extra language: parsing and designation writes drop roughly proportionally to the languages excluded (designations outnumber concepts ~5–10× in multilingual archives); per-batch bookkeeping shrinks ~10× via the larger bootstrap batch size; the O(n²)-ish COUNT and the O(n²) designation scans disappear (EXISTS probe +
concept_designationsindex); the per-batch serialize/re-parse of every term is eliminated; and the fresh-load fast path skips delete-before-reinsert entirely on first import. Unchanged archives are skipped on restart without being re-read. On the read side, the request-cache fix keeps$lookup/$validate-codethroughput steady once the working set exceeds the cache bound, instead of collapsing to the cold path. The changes are independent — deployments that need every language still benefit from the others.Compatibility
resource_jsonshape for chunked imports noted in item 4.concept_designations(concept_id)is createdIF NOT EXISTS, the redundantidx_concepts_system_codeis droppedIF EXISTS, and thebootstrap_importsledger gainsmtime_unix/languagescolumns viaADD COLUMN IF NOT EXISTS(Postgres) / duplicate-column-tolerantALTER(SQLite). Existing ledgers stay valid; the allow-all language default keeps signatures byte-identical so no spurious re-imports are triggered by upgrading.import_snomed_rf2/import_loinc_csvgain a&LanguageFilterparameter andBundleImportBackendgainsimport_parsed/code_system_has_concepts/fresh_load— workspace-internal API, both backends updated. The request-cache fix is a pure internal change with no API or schema impact.Testing
LanguageFilterparsing/matching and best-tier selection, RF2 filename language extraction, SNOMED import with an exclusion filter (designation counts + lookup behavior), SNOMED English force-retention, LOINC variant-file skipping, re-import replacement/de-duplication, bootstrap stat-skip ledger behavior, the request-cache eviction (evict-not-freeze, overwrite-without-evict, zero-bound no-op),$validate-coderesolution/validation of a German SNOMED designation, andlenient-display-validationdowngrading a CodeSystem display mismatch to a warning withresult=true— including the case wheredisplayLanguage(not the backend) surfaces the mismatch, plus a guard that the mismatch still fails when the flag is absent.helios-htssuite green in both feature configurations (default/SQLite and--features postgres, including the Postgres testcontainer integration suite).cargo fmtandcargo clippy --all-targets --all-features -- -D warnings(CI flag set) clean on both feature configurations.