perf(hts): faster terminology bootstrap & re-import, steadier $lookup throughput — language filter, batch tuning, EXISTS probe, no JSON round-trip, indexing, cache eviction; fix localized $validate-code displays & lenient-display-validation by darthcav · Pull Request #152 · HeliosSoftware/hfs

darthcav · 2026-06-12T09:39:44Z

Summary

Bootstrap imports of large multilingual terminologies (SNOMED CT RF2 with national-extension languages, LOINC with linguistic variants) are slow: every language in the archive is ingested unconditionally, every concept batch pays a fixed bookkeeping cost, and every batch is serialized to JSON only to be immediately re-parsed. This PR addresses those bottlenecks independently — each in its own commit, each verified against the full helios-hts test suite (including the Postgres testcontainer suite) before moving on — and then closes the loop with import-indexing/re-import-correctness fixes and a $lookup-path caching fix so the steady-state read throughput holds up on the larger, correctly-imported datasets.

Changes

1. `perf`: probe empty code systems with `EXISTS` instead of `COUNT(*)`

import_bundle decides which code systems need an immediate closure rebuild by checking whether they currently have zero concepts. That check ran SELECT COUNT(*) FROM concepts JOIN code_systems WHERE url = ? — an answer that only needs to distinguish empty from non-empty. During a chunked bulk load the concept set for the URL grows with every batch, so the per-batch COUNT scans all previously imported rows, making the check O(n²/batch_size) over the whole import (~1300 batches for SNOMED CT). An EXISTS(SELECT 1 …) probe answers the same question in O(1). Applied to both the SQLite and PostgreSQL import paths.

2. `feat`: configurable bootstrap batch size — `HTS_BOOTSTRAP_BATCH_SIZE`

The HTS_BOOTSTRAP_DIR sync hardcoded a batch size of 500 concepts. Each batch is one transaction plus fixed bookkeeping: CodeSystem metadata upsert, resource_json refresh, closure-row and expansion-cache invalidation, and in-memory cache flushes. The new variable (default 5000) amortizes that overhead ~10× for big terminologies. The hts import CLI keeps its existing --batch-size default of 500 for memory-constrained ad-hoc runs, matching the existing config conventions.

3. `feat`: language filter for multilingual imports — `HTS_IMPORT_LANGUAGES` / `--languages`

Designation volume dominates multilingual bulk loads: SNOMED RF2 ships one Description file per language plus language refsets, LOINC ships per-language *LinguisticVariant.csv files, and all of them were imported unconditionally. Deployments that only need a couple of languages had no way to opt out.

New LanguageFilter (comma-separated BCP-47 tags; empty = import everything, i.e. the historical behavior). Matching reuses the RFC 4647 logic in language.rs, with best-tier selection: a configured es-ES imports only es-ES, while a bare es imports every es-* variant.
SNOMED RF2: excluded per-language Description and Language-refset files are dropped in find_rf2_paths and never parsed; a row-level guard covers mixed-language Description files. English is always retained because concept display selection and the en-US/en-GB preference chain depend on it.
LOINC: excluded linguistic-variant files are skipped before parsing; the English main table is unaffected.
Bootstrap ledger: for language-sensitive formats the configured tag set is folded into the bootstrap_imports signature, so changing HTS_IMPORT_LANGUAGES re-triggers the import of affected files on the next startup.
Exposed as HTS_IMPORT_LANGUAGES on the server (bootstrap) and --languages / HTS_IMPORT_LANGUAGES on hts import.

4. `perf`: skip the JSON Bundle round-trip in chunked filesystem importers

The SNOMED and LOINC importers built a FHIR Bundle JSON payload per batch (build_code_system_bundle) which import_bundle immediately re-parsed with serde_json into the backend-agnostic ParsedBundle before writing. Every term and designation was therefore allocated, JSON-encoded, and JSON-decoded once more per run — millions of times for a multilingual SNOMED load.

BundleImportBackend gains import_parsed(ctx, ParsedBundle); import_bundle(bytes) now parses and delegates to it on both backends, so the HTTP POST /import path is semantically unchanged.
bundle_builder::build_parsed_code_system() constructs ParsedBundle values directly from the builder structs, mirroring the JSON path exactly — including the parent property row and hierarchy edge that the parser derives from parent_code, and the FHIR value[x] → value-type mapping.
One deliberate behavior change: for chunked imports, the stored resource_json is now the metadata-only CodeSystem resource (the same shape the seed bundle already wrote) instead of whichever ~500-concept chunk happened to be written last.
SNOMED association-refset ConceptMaps keep the JSON path; their cost is a one-off per import.

5. `perf` + `fix`: import indexing, fresh-load fast path, and re-import correctness

Closing the remaining import bottlenecks and the correctness gaps they exposed:

Indexing. Add a concept_designations(concept_id) index on both backends — the only missing per-concept access path. Without it, every per-concept delete-before-reinsert and every per-concept designation read was a full table scan, making a designation-heavy import O(n²); it is now an index seek. Conversely, drop the redundant idx_concepts_system_code: the UNIQUE(system_id, code) constraint already backs exactly those lookups (verified with EXPLAIN QUERY PLAN — both pick the constraint's index), so the explicit duplicate only doubled per-insert index maintenance.
Fresh-load fast path. BundleImportBackend::code_system_has_concepts + a ParsedBundle.fresh_load hint let the SNOMED/LOINC importers probe once before a bulk load; when the target system holds no concepts yet, the per-concept delete-before-reinsert is skipped across all batches. Re-imports keep full replacement semantics.
Re-import replacement semantics. Each touched concept's properties and designations are now replaced unconditionally (previously guarded on a non-empty incoming slice). A narrowed HTS_IMPORT_LANGUAGES now drops previously-imported translations instead of leaving them behind, and PostgreSQL no longer duplicates child rows on a repeat import.
Bootstrap startup. The ledger records size, mtime and the applied language filter; an unchanged file is skipped on the cheap stat alone, falling back to a content hash only on a stat mismatch — so multi-GB SNOMED/LOINC archives are no longer re-read on every restart. The FTS prebuild is deferred until after the bootstrap directory sync, and the post-import closure/FTS rebuild the CLI performs now runs on the server path too, so a bootstrapped server is never left with stale closure or FTS data after a chunked re-import.
Language matching. BCP-47 ranking reworked to precedence exact > separator-insensitive > primary/sibling, applied to both import filtering and query-time display selection.

6. `fix`: evict instead of freezing the SQLite request caches

The SQLite backend keeps small in-memory caches of assembled $lookup / $validate-code responses and per-concept abstract/inactive status flags. All used an insert-only-while-under-bound policy (if len < max { insert }), which froze each cache once full: the first max distinct keys held their slots permanently and every later key missed forever — making the cache useless for any working set larger than its bound (e.g. diverse-code $lookup traffic against a 600 K-concept SNOMED system, where only the first 4096 codes were ever cached). This matters more now that the import fixes above mean concepts correctly carry their full designation set.

A shared bounded_cache_insert helper evicts one existing entry (random replacement via HashMap's randomized iteration order) when the map is at capacity and the key is new, keeping the same memory ceiling while letting hot keys re-enter the cache.
Random replacement, not true LRU, is deliberate: these caches are read under a shared lock on the hot path, and tracking per-entry recency would require an exclusive lock on every read, serializing the very lookups the cache exists to accelerate.
Applied at all four call sites; overwriting an existing key never evicts, and a zero bound is a no-op.

7. `fix`: resolve SNOMED display-language designations in `$validate-code`

With the import fixes above ensuring concepts carry their full designation set, a $validate-code gap surfaced: requesting a SNOMED concept with displayLanguage=de returned the English default display (and spuriously rejected the correct German display), even though $lookup resolved the German term correctly.

The language-aware display validation built its valid-display candidate set through an is_display_alternative() filter that only accepted designations whose use.code was absent or display. SNOMED RF2 designations are imported with use.system = http://snomed.info/sct and use.code set to the SNOMED description-type concept id (900000000000013009 for synonyms, 900000000000003001 for FSNs), so every SNOMED translation was dropped — the language-preferred display was never found and a supplied localized display could not match.

Terminology-native description types are now accepted as display alternatives by recognizing designations whose use.system is the SNOMED CT system. This mirrors $lookup (which matches purely on the BCP-47 language tag) while still excluding FHIR designation-usage alternative-purpose codes such as consumer-name and olde-english, which carry a different use.system.
LOINC linguistic-variant designations are unaffected — they are imported with no use.code, so they already passed the filter.

8. `fix`: honor `lenient-display-validation` in the SQLite CodeSystem `$validate-code`

The FHIR lenient-display-validation boolean downgrades a display-name mismatch from an error (result=false) to a warning with result=true. The parameter was already parsed into ValidateCodeRequest and honored by the Postgres CodeSystem backend and both ValueSet backends, but the SQLite CodeSystem validate_code path — the default backend — hardcoded the invalid-display issue to error severity and derived result purely from display equality, so a CodeSystem $validate-code ignored the flag and still returned result=false.

Thread req.lenient_display_validation into that path: the invalid-display issue is emitted as a warning and result stays true when the flag is set, matching the other three backend paths.
The flag is also honored when the language-aware operations pass, not the backend, surfaces the mismatch — e.g. the supplied display matches the concept's default display (so the backend accepts it and emits no issue) but displayLanguage makes the requested-language designation the only valid display. Previously apply_language_display_validation derived "lenient" solely from a backend-emitted warning severity, so with no issue to inherit it emitted an error and returned result=false, ignoring the flag. req.lenient_display_validation is now threaded through build_validate_response_async into that pass (across the CodeSystem, inline-ValueSet, and ValueSet $validate-code handlers) and the severity is decided from the flag directly or an inherited backend warning. The default (flag absent) remains a hard error / result=false.

Spec provenance. lenient-display-validation is an R6-ballot parameter, defined on ValueSet/$validate-code and absent from the R4/R5 operation definitions. HTS accepts it on both CodeSystem/$validate-code and ValueSet/$validate-code regardless of the resources' FHIR version — an opt-in superset of the published definitions. The default (parameter absent or false) is the spec-mandated behavior across all versions: an invalid display fails validation (result=false). This change only affects the explicit =true opt-in; the R6 provenance is documented in the HTS README.

For a SNOMED CT International + national-extension load restricted to one extra language: parsing and designation writes drop roughly proportionally to the languages excluded (designations outnumber concepts ~5–10× in multilingual archives); per-batch bookkeeping shrinks ~10× via the larger bootstrap batch size; the O(n²)-ish COUNT and the O(n²) designation scans disappear (EXISTS probe + concept_designations index); the per-batch serialize/re-parse of every term is eliminated; and the fresh-load fast path skips delete-before-reinsert entirely on first import. Unchanged archives are skipped on restart without being re-read. On the read side, the request-cache fix keeps $lookup / $validate-code throughput steady once the working set exceeds the cache bound, instead of collapsing to the cold path. The changes are independent — deployments that need every language still benefit from the others.

Compatibility

All defaults preserve current behavior except the bootstrap batch size (500 → 5000, now tunable) and the resource_json shape for chunked imports noted in item 4.
The schema changes are additive/idempotent and applied on startup: concept_designations(concept_id) is created IF NOT EXISTS, the redundant idx_concepts_system_code is dropped IF EXISTS, and the bootstrap_imports ledger gains mtime_unix / languages columns via ADD COLUMN IF NOT EXISTS (Postgres) / duplicate-column-tolerant ALTER (SQLite). Existing ledgers stay valid; the allow-all language default keeps signatures byte-identical so no spurious re-imports are triggered by upgrading.
import_snomed_rf2 / import_loinc_csv gain a &LanguageFilter parameter and BundleImportBackend gains import_parsed / code_system_has_concepts / fresh_load — workspace-internal API, both backends updated. The request-cache fix is a pure internal change with no API or schema impact.

Testing

New tests: LanguageFilter parsing/matching and best-tier selection, RF2 filename language extraction, SNOMED import with an exclusion filter (designation counts + lookup behavior), SNOMED English force-retention, LOINC variant-file skipping, re-import replacement/de-duplication, bootstrap stat-skip ledger behavior, the request-cache eviction (evict-not-freeze, overwrite-without-evict, zero-bound no-op), $validate-code resolution/validation of a German SNOMED designation, and lenient-display-validation downgrading a CodeSystem display mismatch to a warning with result=true — including the case where displayLanguage (not the backend) surfaces the mismatch, plus a guard that the mismatch still fails when the flag is absent.
Full helios-hts suite green in both feature configurations (default/SQLite and --features postgres, including the Postgres testcontainer integration suite).
cargo fmt and cargo clippy --all-targets --all-features -- -D warnings (CI flag set) clean on both feature configurations.
All commits are signed.

The pre-transaction check that decides which code systems need an immediate closure rebuild only distinguishes empty from non-empty, but counted every concept row for the system's URL. During chunked bulk loads (SNOMED RF2, LOINC) that COUNT scans an ever-growing concept set once per batch, turning the check into O(n²/batch_size) over the whole import. An EXISTS probe answers the same question in O(1). Applies to both the SQLite and PostgreSQL import paths.

…TCH_SIZE) Bootstrap sync hardcoded a 500-concept batch size; every batch pays fixed overhead (transaction, CodeSystem metadata upsert, closure-row and expansion-cache invalidation, in-memory cache flushes), so a SNOMED CT load ran ~1300 batches' worth of bookkeeping. The new HTS_BOOTSTRAP_BATCH_SIZE variable (default 5000) amortizes that ~10x. The hts import CLI keeps its existing --batch-size default of 500 for memory-constrained ad-hoc runs.

…IMPORT_LANGUAGES) SNOMED RF2 archives ship one Description file per language plus language refsets, and LOINC ships per-language linguistic-variant CSVs — all of which were imported unconditionally, multiplying designation rows and dominating bulk-load time for deployments that only need a few languages. HTS_IMPORT_LANGUAGES (env, also --languages on hts import) takes a comma-separated BCP-47 tag list. Matching is BCP-47-aware in both directions via the existing language module (configured de admits de-DE; configured de-DE admits the bare de that RF2 languageCode columns carry). Filtering happens at file level where possible — excluded RF2 Description / Language-refset files and LOINC variant CSVs are never parsed — with a row-level guard for mixed-language RF2 files. English is always retained because SNOMED display selection and the en-US/en-GB preference chain depend on it. For language-sensitive formats the configured tag set is folded into the bootstrap ledger signature, so changing HTS_IMPORT_LANGUAGES re-triggers the import of affected files on the next startup; the allow-all default leaves signatures untouched, keeping existing ledgers valid.

The SNOMED RF2 and LOINC importers serialized each concept batch into a FHIR Bundle JSON payload that the backend immediately re-parsed with serde_json before writing — every term and designation was allocated, encoded, and decoded once more per import run, millions of times for a multilingual SNOMED load. BundleImportBackend gains import_parsed(), accepting the existing backend-agnostic ParsedBundle directly; import_bundle() now delegates to it after parsing (both backends), so HTTP /import semantics are unchanged. bundle_builder::build_parsed_code_system() constructs ParsedBundle values straight from the builder structs, mirroring the JSON path exactly — including the parent property row + hierarchy edge the parser derives from parent_code — with one deliberate exception: resource_json now stores the metadata-only CodeSystem resource (the seed shape) instead of whichever 500-concept chunk happened to be written last. SNOMED association-refset ConceptMaps keep the JSON path; their cost is a one-off per import.

…d README Align the SNOMED RF2 and LOINC module headers, the hts import help dump, and the CLAUDE.md import recipes with the HTS_IMPORT_LANGUAGES / --languages option introduced for multilingual imports.

darthcav · 2026-06-12T13:19:50Z

@smunini Please do not merge it right away. I may need to patch it a bit because I think I need to optimize further the performance.

smunini · 2026-06-12T13:54:32Z

@darthcav ok no problem. Going forward, the best way to signal that a PR is still in progress is to raise it as a "Draft" PR - https://github.blog/news-insights/product-news/introducing-draft-pull-requests/ WHen ready, you can flip it to "Ready for review"

Improve first-time and repeat import performance for large terminology distributions (SNOMED CT, LOINC) and close several correctness gaps in the bootstrap and re-import paths. Indexing - Add a concept_designations(concept_id) index on both backends. It was the only missing access path: every per-concept delete-before-reinsert and every per-concept designation read was a full table scan, making a designation-heavy import O(n^2). It is now an index seek (O(n log n)). - Drop the redundant idx_concepts_system_code on both backends. The UNIQUE(system_id, code) constraint already backs the same lookups, so the explicit duplicate only doubled per-insert index maintenance. Fresh-load fast path - Add BundleImportBackend::code_system_has_concepts and a ParsedBundle.fresh_load hint. The SNOMED RF2 and LOINC importers probe once before a bulk load; when the target system holds no concepts yet, the per-concept delete-before-reinsert is skipped entirely across all batches. Re-imports keep full replacement semantics. Re-import replacement semantics - Replace each touched concept's properties and designations unconditionally (previously guarded on a non-empty incoming slice). A narrowed HTS_IMPORT_LANGUAGES now drops previously imported translations instead of leaving them behind, and PostgreSQL no longer duplicates child rows on a repeat import. Bootstrap startup - Skip full-file hashing of unchanged archives: the bootstrap ledger now records size, mtime and the applied language filter and skips re-reading a file whose stat is unchanged, falling back to a content hash only on a mismatch. Avoids re-reading multi-GB SNOMED/LOINC ZIPs on every restart. - Defer the SQLite FTS prebuild until after the bootstrap directory sync, and run the post-import closure and FTS rebuild that the CLI performs, so a bootstrapped server is not left with stale closure or FTS data after a chunked re-import. Language matching - Rework BCP-47 ranking to the precedence exact > separator-insensitive > primary/sibling, and add best-tier selection for import filtering: a configured es-ES imports only es-ES (not es-AR/es-MX), while a bare es imports every es-* variant. Applied to both import filtering and query-time display selection. Documentation (README, CLAUDE.md) updated to match, with regression tests covering each area.

The SQLite backend's in-memory request caches — the $lookup and $validate-code response caches and the per-concept abstract/inactive status-flag caches — all used an insert-only-while-under-bound policy: if w.len() < max { w.insert(key, value) } Once a cache filled, it froze. The first `max` distinct keys held their slots permanently and every later key missed forever, so the cache became useless for any working set larger than its bound — e.g. diverse-code $lookup traffic against a 600K-concept SNOMED system, where only the first 4096 codes were ever cached. Replace the policy with a shared `bounded_cache_insert` helper that, when the map is at capacity and the key is new, evicts one existing entry before admitting the new one. std HashMap's randomized iteration order makes this random replacement, which keeps the same memory ceiling while letting hot keys re-enter the cache. Random replacement (not true LRU) is deliberate: these caches are read under a shared lock on the hot path, and tracking per-entry recency would require taking an exclusive lock on every read, serializing the very lookups the cache exists to accelerate. Overwriting an existing key never evicts, and a zero bound is a no-op. Applied at all four call sites (lookup_response_cache, validate_code_response_cache, and both concept-flag caches). Adds three unit tests covering eviction-not-freeze, overwrite-without-evict, and the zero-bound no-op. Documents the indexing and caching strategy in the HTS README.

CodeSystem/$validate-code returned the English default display (and spuriously rejected a correct localized display) when displayLanguage selected a SNOMED translation, even though $lookup resolved it correctly. The language-aware display validation built its valid-display candidate set through is_display_alternative(), which only accepted designations whose use.code was absent or "display". SNOMED RF2 designations are imported with use.system http://snomed.info/sct and use.code set to the SNOMED description-type concept id (900000000000013009 for synonyms, 900000000000003001 for FSNs), so every SNOMED translation was dropped: the language-preferred display was never found and the supplied display could not match. Accept terminology-native description types as display alternatives by recognizing designations whose use.system is the SNOMED CT system. This mirrors $lookup (which matches purely on the BCP-47 language tag) while still excluding FHIR designation-usage alternative-purpose codes such as consumer-name and olde-english, which carry a different use.system. Add regression tests covering both the resolved display value and the validation result for a German SNOMED designation.

…date-code The FHIR lenient-display-validation parameter downgrades a display mismatch from an error (result=false) to a warning with result=true. It was already wired through the request struct and honored by the Postgres CodeSystem backend and both ValueSet backends, but the SQLite CodeSystem validate_code path hardcoded the invalid-display issue to error severity and computed result solely from display equality, so a SNOMED/CodeSystem $validate-code against the default SQLite backend still returned result=false regardless of the flag. Thread req.lenient_display_validation into that path: emit the invalid-display issue as a warning and keep result=true when the flag is set. The operations-layer language-aware rewrite already preserves a backend-emitted warning severity, so the downgrade survives the language-display pass and the comma-separated displayLanguage handling. Add a regression test asserting result=true with a warning-severity invalid-display issue when lenient-display-validation=true.

…lay validation [skip ci] Note in the HTS README that display validation matches against concept designations (including SNOMED CT synonyms/FSNs via use.system) with BCP-47 displayLanguage handling, and that lenient-display-validation is an R6-ballot parameter (defined on ValueSet/$validate-code, absent from R4/R5) which HTS accepts on both CodeSystem and ValueSet $validate-code as an opt-in superset. Clarify that the default — invalid display fails validation — is the spec-mandated behaviour across all versions.

lenient-display-validation downgrades a display mismatch to a warning with result=true. It worked when the backend flagged the mismatch, but not when the language-aware operations pass surfaced it: if the supplied display matched the concept's default display, the backend accepted it and emitted no issue, yet with displayLanguage set the only valid display is the requested-language designation — so apply_language_ display_validation flagged a mismatch the backend never saw. That function derived lenient solely from a backend-emitted warning severity, so with no issue to inherit it emitted an error and returned result=false, ignoring the flag. Thread req.lenient_display_validation through build_validate_response_ async into apply_language_display_validation (CodeSystem, inline-ValueSet and ValueSet $validate-code handlers) and decide severity from the flag directly OR an inherited backend warning. The default (flag absent) remains a hard error, preserving the spec-mandated result=false. Add regression tests for the displayLanguage-mismatch case with and without the flag.

darthcav added 5 commits June 12, 2026 10:26

docs(hts): document language filtering across importer module docs an…

2f3fca9

…d README Align the SNOMED RF2 and LOINC module headers, the hts import help dump, and the CLAUDE.md import recipes with the HTS_IMPORT_LANGUAGES / --languages option introduced for multilingual imports.

darthcav marked this pull request as draft June 12, 2026 13:55

darthcav added 2 commits June 12, 2026 18:19

darthcav added 2 commits June 13, 2026 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(hts): faster terminology bootstrap & re-import, steadier $lookup throughput — language filter, batch tuning, EXISTS probe, no JSON round-trip, indexing, cache eviction; fix localized $validate-code displays & lenient-display-validation#152

darthcav commented Jun 12, 2026 •

edited

Loading

Uh oh!

darthcav commented Jun 12, 2026

Uh oh!

smunini commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

darthcav commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. perf: probe empty code systems with EXISTS instead of COUNT(*)

2. feat: configurable bootstrap batch size — HTS_BOOTSTRAP_BATCH_SIZE

3. feat: language filter for multilingual imports — HTS_IMPORT_LANGUAGES / --languages

4. perf: skip the JSON Bundle round-trip in chunked filesystem importers

5. perf + fix: import indexing, fresh-load fast path, and re-import correctness

6. fix: evict instead of freezing the SQLite request caches

7. fix: resolve SNOMED display-language designations in $validate-code

8. fix: honor lenient-display-validation in the SQLite CodeSystem $validate-code

Compatibility

Testing

Uh oh!

darthcav commented Jun 12, 2026

Uh oh!

smunini commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

darthcav commented Jun 12, 2026 •

edited

Loading

1. `perf`: probe empty code systems with `EXISTS` instead of `COUNT(*)`

2. `feat`: configurable bootstrap batch size — `HTS_BOOTSTRAP_BATCH_SIZE`

3. `feat`: language filter for multilingual imports — `HTS_IMPORT_LANGUAGES` / `--languages`

4. `perf`: skip the JSON Bundle round-trip in chunked filesystem importers

5. `perf` + `fix`: import indexing, fresh-load fast path, and re-import correctness

6. `fix`: evict instead of freezing the SQLite request caches

7. `fix`: resolve SNOMED display-language designations in `$validate-code`

8. `fix`: honor `lenient-display-validation` in the SQLite CodeSystem `$validate-code`