Skip to content

perf(hts): faster terminology bootstrap & re-import, steadier $lookup throughput — language filter, batch tuning, EXISTS probe, no JSON round-trip, indexing, cache eviction; fix localized $validate-code displays & lenient-display-validation#152

Draft
darthcav wants to merge 11 commits into
HeliosSoftware:mainfrom
darthcav:feat/hts-import-performance

Conversation

@darthcav

@darthcav darthcav commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Bootstrap imports of large multilingual terminologies (SNOMED CT RF2 with national-extension languages, LOINC with linguistic variants) are slow: every language in the archive is ingested unconditionally, every concept batch pays a fixed bookkeeping cost, and every batch is serialized to JSON only to be immediately re-parsed. This PR addresses those bottlenecks independently — each in its own commit, each verified against the full helios-hts test suite (including the Postgres testcontainer suite) before moving on — and then closes the loop with import-indexing/re-import-correctness fixes and a $lookup-path caching fix so the steady-state read throughput holds up on the larger, correctly-imported datasets.

Changes

1. perf: probe empty code systems with EXISTS instead of COUNT(*)

import_bundle decides which code systems need an immediate closure rebuild by checking whether they currently have zero concepts. That check ran SELECT COUNT(*) FROM concepts JOIN code_systems WHERE url = ? — an answer that only needs to distinguish empty from non-empty. During a chunked bulk load the concept set for the URL grows with every batch, so the per-batch COUNT scans all previously imported rows, making the check O(n²/batch_size) over the whole import (~1300 batches for SNOMED CT). An EXISTS(SELECT 1 …) probe answers the same question in O(1). Applied to both the SQLite and PostgreSQL import paths.

2. feat: configurable bootstrap batch size — HTS_BOOTSTRAP_BATCH_SIZE

The HTS_BOOTSTRAP_DIR sync hardcoded a batch size of 500 concepts. Each batch is one transaction plus fixed bookkeeping: CodeSystem metadata upsert, resource_json refresh, closure-row and expansion-cache invalidation, and in-memory cache flushes. The new variable (default 5000) amortizes that overhead ~10× for big terminologies. The hts import CLI keeps its existing --batch-size default of 500 for memory-constrained ad-hoc runs, matching the existing config conventions.

3. feat: language filter for multilingual imports — HTS_IMPORT_LANGUAGES / --languages

Designation volume dominates multilingual bulk loads: SNOMED RF2 ships one Description file per language plus language refsets, LOINC ships per-language *LinguisticVariant.csv files, and all of them were imported unconditionally. Deployments that only need a couple of languages had no way to opt out.

  • New LanguageFilter (comma-separated BCP-47 tags; empty = import everything, i.e. the historical behavior). Matching reuses the RFC 4647 logic in language.rs, with best-tier selection: a configured es-ES imports only es-ES, while a bare es imports every es-* variant.
  • SNOMED RF2: excluded per-language Description and Language-refset files are dropped in find_rf2_paths and never parsed; a row-level guard covers mixed-language Description files. English is always retained because concept display selection and the en-US/en-GB preference chain depend on it.
  • LOINC: excluded linguistic-variant files are skipped before parsing; the English main table is unaffected.
  • Bootstrap ledger: for language-sensitive formats the configured tag set is folded into the bootstrap_imports signature, so changing HTS_IMPORT_LANGUAGES re-triggers the import of affected files on the next startup.
  • Exposed as HTS_IMPORT_LANGUAGES on the server (bootstrap) and --languages / HTS_IMPORT_LANGUAGES on hts import.

4. perf: skip the JSON Bundle round-trip in chunked filesystem importers

The SNOMED and LOINC importers built a FHIR Bundle JSON payload per batch (build_code_system_bundle) which import_bundle immediately re-parsed with serde_json into the backend-agnostic ParsedBundle before writing. Every term and designation was therefore allocated, JSON-encoded, and JSON-decoded once more per run — millions of times for a multilingual SNOMED load.

  • BundleImportBackend gains import_parsed(ctx, ParsedBundle); import_bundle(bytes) now parses and delegates to it on both backends, so the HTTP POST /import path is semantically unchanged.
  • bundle_builder::build_parsed_code_system() constructs ParsedBundle values directly from the builder structs, mirroring the JSON path exactly — including the parent property row and hierarchy edge that the parser derives from parent_code, and the FHIR value[x] → value-type mapping.
  • One deliberate behavior change: for chunked imports, the stored resource_json is now the metadata-only CodeSystem resource (the same shape the seed bundle already wrote) instead of whichever ~500-concept chunk happened to be written last.
  • SNOMED association-refset ConceptMaps keep the JSON path; their cost is a one-off per import.

5. perf + fix: import indexing, fresh-load fast path, and re-import correctness

Closing the remaining import bottlenecks and the correctness gaps they exposed:

  • Indexing. Add a concept_designations(concept_id) index on both backends — the only missing per-concept access path. Without it, every per-concept delete-before-reinsert and every per-concept designation read was a full table scan, making a designation-heavy import O(n²); it is now an index seek. Conversely, drop the redundant idx_concepts_system_code: the UNIQUE(system_id, code) constraint already backs exactly those lookups (verified with EXPLAIN QUERY PLAN — both pick the constraint's index), so the explicit duplicate only doubled per-insert index maintenance.
  • Fresh-load fast path. BundleImportBackend::code_system_has_concepts + a ParsedBundle.fresh_load hint let the SNOMED/LOINC importers probe once before a bulk load; when the target system holds no concepts yet, the per-concept delete-before-reinsert is skipped across all batches. Re-imports keep full replacement semantics.
  • Re-import replacement semantics. Each touched concept's properties and designations are now replaced unconditionally (previously guarded on a non-empty incoming slice). A narrowed HTS_IMPORT_LANGUAGES now drops previously-imported translations instead of leaving them behind, and PostgreSQL no longer duplicates child rows on a repeat import.
  • Bootstrap startup. The ledger records size, mtime and the applied language filter; an unchanged file is skipped on the cheap stat alone, falling back to a content hash only on a stat mismatch — so multi-GB SNOMED/LOINC archives are no longer re-read on every restart. The FTS prebuild is deferred until after the bootstrap directory sync, and the post-import closure/FTS rebuild the CLI performs now runs on the server path too, so a bootstrapped server is never left with stale closure or FTS data after a chunked re-import.
  • Language matching. BCP-47 ranking reworked to precedence exact > separator-insensitive > primary/sibling, applied to both import filtering and query-time display selection.

6. fix: evict instead of freezing the SQLite request caches

The SQLite backend keeps small in-memory caches of assembled $lookup / $validate-code responses and per-concept abstract/inactive status flags. All used an insert-only-while-under-bound policy (if len < max { insert }), which froze each cache once full: the first max distinct keys held their slots permanently and every later key missed forever — making the cache useless for any working set larger than its bound (e.g. diverse-code $lookup traffic against a 600 K-concept SNOMED system, where only the first 4096 codes were ever cached). This matters more now that the import fixes above mean concepts correctly carry their full designation set.

  • A shared bounded_cache_insert helper evicts one existing entry (random replacement via HashMap's randomized iteration order) when the map is at capacity and the key is new, keeping the same memory ceiling while letting hot keys re-enter the cache.
  • Random replacement, not true LRU, is deliberate: these caches are read under a shared lock on the hot path, and tracking per-entry recency would require an exclusive lock on every read, serializing the very lookups the cache exists to accelerate.
  • Applied at all four call sites; overwriting an existing key never evicts, and a zero bound is a no-op.

7. fix: resolve SNOMED display-language designations in $validate-code

With the import fixes above ensuring concepts carry their full designation set, a $validate-code gap surfaced: requesting a SNOMED concept with displayLanguage=de returned the English default display (and spuriously rejected the correct German display), even though $lookup resolved the German term correctly.

The language-aware display validation built its valid-display candidate set through an is_display_alternative() filter that only accepted designations whose use.code was absent or display. SNOMED RF2 designations are imported with use.system = http://snomed.info/sct and use.code set to the SNOMED description-type concept id (900000000000013009 for synonyms, 900000000000003001 for FSNs), so every SNOMED translation was dropped — the language-preferred display was never found and a supplied localized display could not match.

  • Terminology-native description types are now accepted as display alternatives by recognizing designations whose use.system is the SNOMED CT system. This mirrors $lookup (which matches purely on the BCP-47 language tag) while still excluding FHIR designation-usage alternative-purpose codes such as consumer-name and olde-english, which carry a different use.system.
  • LOINC linguistic-variant designations are unaffected — they are imported with no use.code, so they already passed the filter.

8. fix: honor lenient-display-validation in the SQLite CodeSystem $validate-code

The FHIR lenient-display-validation boolean downgrades a display-name mismatch from an error (result=false) to a warning with result=true. The parameter was already parsed into ValidateCodeRequest and honored by the Postgres CodeSystem backend and both ValueSet backends, but the SQLite CodeSystem validate_code path — the default backend — hardcoded the invalid-display issue to error severity and derived result purely from display equality, so a CodeSystem $validate-code ignored the flag and still returned result=false.

  • Thread req.lenient_display_validation into that path: the invalid-display issue is emitted as a warning and result stays true when the flag is set, matching the other three backend paths.
  • The flag is also honored when the language-aware operations pass, not the backend, surfaces the mismatch — e.g. the supplied display matches the concept's default display (so the backend accepts it and emits no issue) but displayLanguage makes the requested-language designation the only valid display. Previously apply_language_display_validation derived "lenient" solely from a backend-emitted warning severity, so with no issue to inherit it emitted an error and returned result=false, ignoring the flag. req.lenient_display_validation is now threaded through build_validate_response_async into that pass (across the CodeSystem, inline-ValueSet, and ValueSet $validate-code handlers) and the severity is decided from the flag directly or an inherited backend warning. The default (flag absent) remains a hard error / result=false.

Spec provenance. lenient-display-validation is an R6-ballot parameter, defined on ValueSet/$validate-code and absent from the R4/R5 operation definitions. HTS accepts it on both CodeSystem/$validate-code and ValueSet/$validate-code regardless of the resources' FHIR version — an opt-in superset of the published definitions. The default (parameter absent or false) is the spec-mandated behavior across all versions: an invalid display fails validation (result=false). This change only affects the explicit =true opt-in; the R6 provenance is documented in the HTS README.

For a SNOMED CT International + national-extension load restricted to one extra language: parsing and designation writes drop roughly proportionally to the languages excluded (designations outnumber concepts ~5–10× in multilingual archives); per-batch bookkeeping shrinks ~10× via the larger bootstrap batch size; the O(n²)-ish COUNT and the O(n²) designation scans disappear (EXISTS probe + concept_designations index); the per-batch serialize/re-parse of every term is eliminated; and the fresh-load fast path skips delete-before-reinsert entirely on first import. Unchanged archives are skipped on restart without being re-read. On the read side, the request-cache fix keeps $lookup / $validate-code throughput steady once the working set exceeds the cache bound, instead of collapsing to the cold path. The changes are independent — deployments that need every language still benefit from the others.

Compatibility

  • All defaults preserve current behavior except the bootstrap batch size (500 → 5000, now tunable) and the resource_json shape for chunked imports noted in item 4.
  • The schema changes are additive/idempotent and applied on startup: concept_designations(concept_id) is created IF NOT EXISTS, the redundant idx_concepts_system_code is dropped IF EXISTS, and the bootstrap_imports ledger gains mtime_unix / languages columns via ADD COLUMN IF NOT EXISTS (Postgres) / duplicate-column-tolerant ALTER (SQLite). Existing ledgers stay valid; the allow-all language default keeps signatures byte-identical so no spurious re-imports are triggered by upgrading.
  • import_snomed_rf2 / import_loinc_csv gain a &LanguageFilter parameter and BundleImportBackend gains import_parsed / code_system_has_concepts / fresh_load — workspace-internal API, both backends updated. The request-cache fix is a pure internal change with no API or schema impact.

Testing

  • New tests: LanguageFilter parsing/matching and best-tier selection, RF2 filename language extraction, SNOMED import with an exclusion filter (designation counts + lookup behavior), SNOMED English force-retention, LOINC variant-file skipping, re-import replacement/de-duplication, bootstrap stat-skip ledger behavior, the request-cache eviction (evict-not-freeze, overwrite-without-evict, zero-bound no-op), $validate-code resolution/validation of a German SNOMED designation, and lenient-display-validation downgrading a CodeSystem display mismatch to a warning with result=true — including the case where displayLanguage (not the backend) surfaces the mismatch, plus a guard that the mismatch still fails when the flag is absent.
  • Full helios-hts suite green in both feature configurations (default/SQLite and --features postgres, including the Postgres testcontainer integration suite).
  • cargo fmt and cargo clippy --all-targets --all-features -- -D warnings (CI flag set) clean on both feature configurations.
  • All commits are signed.

darthcav added 5 commits June 12, 2026 10:26
The pre-transaction check that decides which code systems need an
immediate closure rebuild only distinguishes empty from non-empty, but
counted every concept row for the system's URL. During chunked bulk
loads (SNOMED RF2, LOINC) that COUNT scans an ever-growing concept set
once per batch, turning the check into O(n²/batch_size) over the whole
import. An EXISTS probe answers the same question in O(1).

Applies to both the SQLite and PostgreSQL import paths.
…TCH_SIZE)

Bootstrap sync hardcoded a 500-concept batch size; every batch pays
fixed overhead (transaction, CodeSystem metadata upsert, closure-row
and expansion-cache invalidation, in-memory cache flushes), so a SNOMED
CT load ran ~1300 batches' worth of bookkeeping. The new
HTS_BOOTSTRAP_BATCH_SIZE variable (default 5000) amortizes that ~10x.
The hts import CLI keeps its existing --batch-size default of 500 for
memory-constrained ad-hoc runs.
…IMPORT_LANGUAGES)

SNOMED RF2 archives ship one Description file per language plus
language refsets, and LOINC ships per-language linguistic-variant CSVs
— all of which were imported unconditionally, multiplying designation
rows and dominating bulk-load time for deployments that only need a
few languages.

HTS_IMPORT_LANGUAGES (env, also --languages on hts import) takes a
comma-separated BCP-47 tag list. Matching is BCP-47-aware in both
directions via the existing language module (configured de admits
de-DE; configured de-DE admits the bare de that RF2 languageCode
columns carry). Filtering happens at file level where possible —
excluded RF2 Description / Language-refset files and LOINC variant
CSVs are never parsed — with a row-level guard for mixed-language RF2
files. English is always retained because SNOMED display selection and
the en-US/en-GB preference chain depend on it.

For language-sensitive formats the configured tag set is folded into
the bootstrap ledger signature, so changing HTS_IMPORT_LANGUAGES
re-triggers the import of affected files on the next startup; the
allow-all default leaves signatures untouched, keeping existing
ledgers valid.
The SNOMED RF2 and LOINC importers serialized each concept batch into
a FHIR Bundle JSON payload that the backend immediately re-parsed with
serde_json before writing — every term and designation was allocated,
encoded, and decoded once more per import run, millions of times for a
multilingual SNOMED load.

BundleImportBackend gains import_parsed(), accepting the existing
backend-agnostic ParsedBundle directly; import_bundle() now delegates
to it after parsing (both backends), so HTTP /import semantics are
unchanged. bundle_builder::build_parsed_code_system() constructs
ParsedBundle values straight from the builder structs, mirroring the
JSON path exactly — including the parent property row + hierarchy edge
the parser derives from parent_code — with one deliberate exception:
resource_json now stores the metadata-only CodeSystem resource (the
seed shape) instead of whichever 500-concept chunk happened to be
written last.

SNOMED association-refset ConceptMaps keep the JSON path; their cost
is a one-off per import.
…d README

Align the SNOMED RF2 and LOINC module headers, the hts import help
dump, and the CLAUDE.md import recipes with the HTS_IMPORT_LANGUAGES /
--languages option introduced for multilingual imports.
@darthcav

Copy link
Copy Markdown
Contributor Author

@smunini Please do not merge it right away. I may need to patch it a bit because I think I need to optimize further the performance.

@smunini

smunini commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@darthcav ok no problem. Going forward, the best way to signal that a PR is still in progress is to raise it as a "Draft" PR - https://github.blog/news-insights/product-news/introducing-draft-pull-requests/ WHen ready, you can flip it to "Ready for review"

@darthcav darthcav marked this pull request as draft June 12, 2026 13:55
darthcav added 2 commits June 12, 2026 18:19
Improve first-time and repeat import performance for large terminology
distributions (SNOMED CT, LOINC) and close several correctness gaps in the
bootstrap and re-import paths.

Indexing
- Add a concept_designations(concept_id) index on both backends. It was the
  only missing access path: every per-concept delete-before-reinsert and every
  per-concept designation read was a full table scan, making a
  designation-heavy import O(n^2). It is now an index seek (O(n log n)).
- Drop the redundant idx_concepts_system_code on both backends. The
  UNIQUE(system_id, code) constraint already backs the same lookups, so the
  explicit duplicate only doubled per-insert index maintenance.

Fresh-load fast path
- Add BundleImportBackend::code_system_has_concepts and a ParsedBundle.fresh_load
  hint. The SNOMED RF2 and LOINC importers probe once before a bulk load; when
  the target system holds no concepts yet, the per-concept
  delete-before-reinsert is skipped entirely across all batches. Re-imports
  keep full replacement semantics.

Re-import replacement semantics
- Replace each touched concept's properties and designations unconditionally
  (previously guarded on a non-empty incoming slice). A narrowed
  HTS_IMPORT_LANGUAGES now drops previously imported translations instead of
  leaving them behind, and PostgreSQL no longer duplicates child rows on a
  repeat import.

Bootstrap startup
- Skip full-file hashing of unchanged archives: the bootstrap ledger now
  records size, mtime and the applied language filter and skips re-reading a
  file whose stat is unchanged, falling back to a content hash only on a
  mismatch. Avoids re-reading multi-GB SNOMED/LOINC ZIPs on every restart.
- Defer the SQLite FTS prebuild until after the bootstrap directory sync, and
  run the post-import closure and FTS rebuild that the CLI performs, so a
  bootstrapped server is not left with stale closure or FTS data after a
  chunked re-import.

Language matching
- Rework BCP-47 ranking to the precedence exact > separator-insensitive >
  primary/sibling, and add best-tier selection for import filtering: a
  configured es-ES imports only es-ES (not es-AR/es-MX), while a bare es
  imports every es-* variant. Applied to both import filtering and query-time
  display selection.

Documentation (README, CLAUDE.md) updated to match, with regression tests
covering each area.
The SQLite backend's in-memory request caches — the $lookup and
$validate-code response caches and the per-concept abstract/inactive
status-flag caches — all used an insert-only-while-under-bound policy:

    if w.len() < max { w.insert(key, value) }

Once a cache filled, it froze. The first `max` distinct keys held their
slots permanently and every later key missed forever, so the cache
became useless for any working set larger than its bound — e.g.
diverse-code $lookup traffic against a 600K-concept SNOMED system, where
only the first 4096 codes were ever cached.

Replace the policy with a shared `bounded_cache_insert` helper that, when
the map is at capacity and the key is new, evicts one existing entry
before admitting the new one. std HashMap's randomized iteration order
makes this random replacement, which keeps the same memory ceiling while
letting hot keys re-enter the cache.

Random replacement (not true LRU) is deliberate: these caches are read
under a shared lock on the hot path, and tracking per-entry recency would
require taking an exclusive lock on every read, serializing the very
lookups the cache exists to accelerate. Overwriting an existing key never
evicts, and a zero bound is a no-op.

Applied at all four call sites (lookup_response_cache,
validate_code_response_cache, and both concept-flag caches). Adds three
unit tests covering eviction-not-freeze, overwrite-without-evict, and the
zero-bound no-op. Documents the indexing and caching strategy in the HTS
README.
@darthcav darthcav changed the title perf(hts): faster multilingual terminology bootstrap — language filter, batch tuning, EXISTS probe, no JSON round-trip perf(hts): faster terminology bootstrap & re-import, steadier $lookup throughput — language filter, batch tuning, EXISTS probe, no JSON round-trip, indexing, cache eviction Jun 13, 2026
CodeSystem/$validate-code returned the English default display (and
spuriously rejected a correct localized display) when displayLanguage
selected a SNOMED translation, even though $lookup resolved it correctly.

The language-aware display validation built its valid-display candidate
set through is_display_alternative(), which only accepted designations
whose use.code was absent or "display". SNOMED RF2 designations are
imported with use.system http://snomed.info/sct and use.code set to the
SNOMED description-type concept id (900000000000013009 for synonyms,
900000000000003001 for FSNs), so every SNOMED translation was dropped:
the language-preferred display was never found and the supplied display
could not match.

Accept terminology-native description types as display alternatives by
recognizing designations whose use.system is the SNOMED CT system. This
mirrors $lookup (which matches purely on the BCP-47 language tag) while
still excluding FHIR designation-usage alternative-purpose codes such as
consumer-name and olde-english, which carry a different use.system.

Add regression tests covering both the resolved display value and the
validation result for a German SNOMED designation.
@darthcav darthcav changed the title perf(hts): faster terminology bootstrap & re-import, steadier $lookup throughput — language filter, batch tuning, EXISTS probe, no JSON round-trip, indexing, cache eviction perf(hts): faster terminology bootstrap & re-import, steadier $lookup throughput — language filter, batch tuning, EXISTS probe, no JSON round-trip, indexing, cache eviction; fix localized $validate-code displays Jun 13, 2026
…date-code

The FHIR lenient-display-validation parameter downgrades a display
mismatch from an error (result=false) to a warning with result=true.
It was already wired through the request struct and honored by the
Postgres CodeSystem backend and both ValueSet backends, but the SQLite
CodeSystem validate_code path hardcoded the invalid-display issue to
error severity and computed result solely from display equality, so a
SNOMED/CodeSystem $validate-code against the default SQLite backend
still returned result=false regardless of the flag.

Thread req.lenient_display_validation into that path: emit the
invalid-display issue as a warning and keep result=true when the flag is
set. The operations-layer language-aware rewrite already preserves a
backend-emitted warning severity, so the downgrade survives the
language-display pass and the comma-separated displayLanguage handling.

Add a regression test asserting result=true with a warning-severity
invalid-display issue when lenient-display-validation=true.
@darthcav darthcav changed the title perf(hts): faster terminology bootstrap & re-import, steadier $lookup throughput — language filter, batch tuning, EXISTS probe, no JSON round-trip, indexing, cache eviction; fix localized $validate-code displays perf(hts): faster terminology bootstrap & re-import, steadier $lookup throughput — language filter, batch tuning, EXISTS probe, no JSON round-trip, indexing, cache eviction; fix localized $validate-code displays & lenient-display-validation Jun 13, 2026
darthcav added 2 commits June 13, 2026 18:35
…lay validation [skip ci]

Note in the HTS README that display validation matches against concept
designations (including SNOMED CT synonyms/FSNs via use.system) with
BCP-47 displayLanguage handling, and that lenient-display-validation is
an R6-ballot parameter (defined on ValueSet/$validate-code, absent from
R4/R5) which HTS accepts on both CodeSystem and ValueSet $validate-code
as an opt-in superset. Clarify that the default — invalid display fails
validation — is the spec-mandated behaviour across all versions.
lenient-display-validation downgrades a display mismatch to a warning
with result=true. It worked when the backend flagged the mismatch, but
not when the language-aware operations pass surfaced it: if the supplied
display matched the concept's default display, the backend accepted it
and emitted no issue, yet with displayLanguage set the only valid
display is the requested-language designation — so apply_language_
display_validation flagged a mismatch the backend never saw. That
function derived lenient solely from a backend-emitted warning severity,
so with no issue to inherit it emitted an error and returned
result=false, ignoring the flag.

Thread req.lenient_display_validation through build_validate_response_
async into apply_language_display_validation (CodeSystem, inline-ValueSet
and ValueSet $validate-code handlers) and decide severity from the flag
directly OR an inherited backend warning. The default (flag absent)
remains a hard error, preserving the spec-mandated result=false.

Add regression tests for the displayLanguage-mismatch case with and
without the flag.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants