Skip to content

perf: Batch metadata query to avoid N+1 across 5 call sites#232

Open
KRRT7 wants to merge 8 commits intomicrosoft:mainfrom
KRRT7:perf/batch-metadata-query
Open

perf: Batch metadata query to avoid N+1 across 5 call sites#232
KRRT7 wants to merge 8 commits intomicrosoft:mainfrom
KRRT7:perf/batch-metadata-query

Conversation

@KRRT7
Copy link
Copy Markdown
Contributor

@KRRT7 KRRT7 commented Apr 10, 2026

Stack: 4/4 — depends on #230. Merge #231, #229, #230, then this PR.


  • Five call sites used get_item() per scored ref — one SELECT and full deserialization per match (N+1 pattern)
  • Added get_metadata_multiple to ISemanticRefCollection that fetches only semref_id, range_json, knowledge_type in a single batch query
  • Replaced the N+1 loop with one get_metadata_multiple call at each site
  • Further optimized scope-filtering: binary search in contains_range, inline tuple comparisons in TextRange, skip pydantic validation in get_metadata_multiple

Call sites optimized

  1. lookup_term_filtered — batch metadata, filter by knowledge_type/range
  2. lookup_property_in_property_index — batch metadata, filter by range scope
  3. SemanticRefAccumulator.group_matches_by_type — batch metadata, group by knowledge_type
  4. SemanticRefAccumulator.get_matches_in_scope — batch metadata, filter by range scope
  5. get_scored_semantic_refs_from_ordinals_iter — two-phase: metadata filter then batch fetch

Additional optimizations

  • Binary search in TextRangeCollection.contains_range: replaced O(n) linear scan with bisect_right keyed on start, reducing scope-filtering from ~25ms to ~9ms
  • Inline tuple comparisons in TextRange: replaced TextLocation allocations in __eq__/__lt__/__contains__ with a shared _effective_end returning tuples
  • Skip pydantic validation in get_metadata_multiple: construct TextLocation/TextRange directly from JSON instead of going through __pydantic_validator__

Benchmark

Azure Standard_D2s_v5 — 2 vCPU, 8 GiB RAM, Python 3.13

Query (pytest-async-benchmark pedantic, 200 rounds)

200 matches against a 200-message indexed SQLite transcript. Only the function under test is timed.

Function Before (median) After (median) Speedup
lookup_term_filtered 2.650 ms 1.184 ms 2.24x
group_matches_by_type 2.428 ms 978 μs 2.48x
get_scored_semantic_refs_from_ordinals_iter 2.541 ms 2.946 ms 0.86x
lookup_property_in_property_index 25.306 ms 9.365 ms 2.70x
get_matches_in_scope 25.011 ms 9.160 ms 2.73x
Reproduce the benchmark locally
pip install 'pytest-async-benchmark @ git+https://github.com/KRRT7/pytest-async-benchmark.git@feat/pedantic-mode' pytest-asyncio
python -m pytest tests/benchmarks/test_benchmark_query.py -v -s

Generated by codeflash optimization agent

KRRT7 added 7 commits April 10, 2026 15:49
Add add_terms_batch / add_properties_batch to the index interfaces
with executemany-based SQLite implementations. Restructure
add_metadata_to_index_from_list and add_to_property_index to collect
all items first, then batch-insert via extend() and the new batch
methods. Eliminates ~1000 individual INSERT round-trips during
indexing.
Rename _collect_{facet,entity,action}_{terms,properties} to drop the
leading underscore in propindex.py and semrefindex.py.
Change list to Sequence in add_terms_batch and add_properties_batch
interfaces and implementations to satisfy covariance. Add missing
add_terms_batch to FakeTermIndex in conftest.py.
lookup_term_filtered called get_item() per scored ref — one SELECT and
full deserialization per match. The filter only needs knowledge_type
(a plain column) and range (json.loads of range_json), never the
expensive knowledge_json deserialization (64% of per-row cost).

Add get_metadata_multiple to ISemanticRefCollection that fetches only
semref_id, range_json, knowledge_type in a single batch query. Replace
the N+1 loop in lookup_term_filtered with one get_metadata_multiple call.

Benchmark (200 matches, 200 rounds): 4.38ms → 1.32ms (3.3x speedup).
Apply the same get_metadata_multiple pattern from lookup_term_filtered
to four more sites that called get_item() in a loop:

- propindex.lookup_property_in_property_index: filter by .range
- SemanticRefAccumulator.group_matches_by_type: group by .knowledge_type
- SemanticRefAccumulator.get_matches_in_scope: filter by .range
- answers.get_scored_semantic_refs_from_ordinals_iter: two-phase
  metadata filter then batch get_multiple for matching full objects

All sites now use a single batch query instead of N individual SELECTs,
skipping knowledge_json deserialization where only range or
knowledge_type is needed.
…arisons

- Use bisect_right with key=start in TextRangeCollection.contains_range
  to skip O(n) linear scan (O(log n) for non-overlapping point ranges)
- Replace TextLocation allocations in TextRange __eq__/__lt__/__contains__
  with a shared _effective_end returning tuples
- Skip pydantic validation in get_metadata_multiple by constructing
  TextLocation/TextRange directly from JSON
@KRRT7 KRRT7 force-pushed the perf/batch-metadata-query branch from f9a489b to 743d8db Compare April 10, 2026 20:50
@bmerkle bmerkle self-assigned this Apr 11, 2026
Copy link
Copy Markdown
Collaborator

@bmerkle bmerkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KRRT7 please check my comments

assert set(rowdict) == set(arg)
return [self._deserialize_semantic_ref_from_row(rowdict[ordl]) for ordl in arg]

async def get_metadata_multiple(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing bounds check in get_metadata_multiple

get_multiple validates ordinals are in bounds before querying. get_metadata_multiple in collections.py:344 does not — if an ordinal is missing from the DB, rowdict[o] will KeyError. Should either validate or handle missing keys gracefully, mirroring the existing get_multiple pattern.

semantic_ref_ordinal: SemanticRefOrdinal | ScoredSemanticRefOrdinal,
) -> None: ...

async def add_properties_batch(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In propindex.py:85 the SQLite implementation imports make_property_term_text / split_property_term_text from storage.memory.propindex. This creates a cross-layer dependency (sqlite → memory). These helpers should be extracted to a shared location (e.g., the base knowpro/propindex or a utils module).

await add_to_property_index(conversation, 0)


def collect_facet_properties(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect_* functions duplicate logic from existing add_*_to_index functions

propindex.py adds collect_entity_properties, collect_action_properties, collect_facet_properties that duplicate the logic of the existing add_entity_properties_to_index, add_action_properties_to_index, add_facet_to_index. Same duplication occurs in semrefindex.py with collect_entity_terms / collect_action_terms / collect_facet_terms. The old add_* functions should be refactored to use the new collect_* functions + batch add, rather than keeping both versions.

async def test_benchmark_lookup_term_filtered(async_benchmark):
"""Benchmark lookup_term_filtered with N+1 get_item pattern."""
settings = make_settings()
tmpdir = tempfile.mkdtemp()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the temp directory is created manually and only cleaned up in finally inside the benchmark — if create_indexed_transcript raises, the cleanup won't run. Use a pytest tmp_path fixture instead.

@KRRT7
Copy link
Copy Markdown
Contributor Author

KRRT7 commented Apr 14, 2026

@bmerkle Heads up — #230 now includes a refactor that will cause merge conflicts with this PR (semrefindex.py, sqlite/propindex.py, propindex.py all overlap). We'll hold off on updating this PR until #230 lands, then rebase on top.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants