Skip to content

perf: Cumulative startup and runtime optimizations#3

Draft
KRRT7 wants to merge 19 commits intomainfrom
optimization
Draft

perf: Cumulative startup and runtime optimizations#3
KRRT7 wants to merge 19 commits intomainfrom
optimization

Conversation

@KRRT7
Copy link
Copy Markdown
Owner

@KRRT7 KRRT7 commented Apr 10, 2026

Cumulative Benchmarks

Azure Standard_D2s_v5 (2 vCPU, 8 GB RAM, non-burstable), Python 3.13, Ubuntu 24.04

Startup

hyperfine (warmup 5, min-runs 30)

Benchmark main optimization Speedup
import typeagent 791ms ±11ms 713ms ±8ms 1.11x
Offline test suite (69 tests) 5.72s ±90ms 5.60s ±98ms 1.02x

Runtime (indexing pipeline)

pytest-async-benchmark pedantic mode, 20 rounds, 3 warmup — only hot path timed (setup/teardown excluded)

Benchmark main (min) optimization (min) Speedup
add_messages_with_indexing (200 msgs) 28.8ms 25.0ms 1.16x
add_messages_with_indexing (50 msgs) 7.8ms 6.7ms 1.16x
VTT ingest (40 msgs) 6.9ms 6.1ms 1.14x

Query

pytest-async-benchmark pedantic mode, 200 rounds, 20 warmup

Benchmark main (median) optimization (median) Speedup
lookup_term_filtered (200 matches) 2.650ms 1.184ms 2.24x
group_matches_by_type (200 matches) 2.428ms 978μs 2.48x
get_scored_semantic_refs_from_ordinals_iter (200 matches) 2.541ms 2.946ms 0.86x
lookup_property_in_property_index (200 matches) 25.306ms 9.365ms 2.70x
get_matches_in_scope (200 matches) 25.011ms 9.160ms 2.73x

Vector Search

pytest-async-benchmark pedantic mode, 200 rounds, 20 warmup, 384-dim embeddings

Benchmark main (min) optimization (min) Speedup
fuzzy_lookup_embedding (1K vecs) 257μs 70μs 3.7x
fuzzy_lookup_embedding (10K vecs) 5.72ms 559μs 10.2x
fuzzy_lookup_embedding (10K + predicate) 4.79ms 3.41ms 1.4x
fuzzy_lookup_embedding_in_subset (1K of 10K) 3.45ms 243μs 14.2x

Note: The vectorbase optimization applies to code from microsoft/typeagent-py#228 which has not yet merged to upstream main. The PR is open against the contributor's fork at shreejaykurhade/typeagent-py#1.

Optimizations (cumulative)

1. Defer black import to first use (ecbf6f5)

  • black was imported at module level but only used in two cold-path functions
  • Moved import black inside create_context_prompt() and format_code()
  • Saves ~79ms on import by avoiding black + transitive deps (pathspec, black.nodes)

2. Batch SQLite INSERTs for indexing pipeline (bc9f2df)

  • Added add_terms_batch and add_properties_batch to ITermToSemanticRefIndex and IPropertyToSemanticRefIndex interfaces
  • SQLite backend uses executemany instead of individual cursor.execute() calls
  • Restructured add_metadata_to_index_from_list and add_to_property_index to collect all data first, then batch-insert
  • Consistent ~14-16% improvement across all message counts

3. Numpy vectorized fuzzy lookup (bc5b319)

  • Replaced Python-level list comprehension + sort with np.flatnonzero + np.argpartition for O(n) top-k
  • fuzzy_lookup_embedding_in_subset now uses fancy indexing to compute dot products only for subset indices
  • 3.7x–14.2x speedup on vector search hot paths
  • Optimizes code introduced in microsoft/typeagent-py#228 (not yet merged upstream)

4. Batch metadata query across 5 N+1 call sites

  • Five call sites used get_item() per scored ref — N+1 pattern with full knowledge_json deserialization
  • Added get_metadata_multiple to ISemanticRefCollection — fetches only semref_id, range_json, knowledge_type in one batch query
  • Skips json.loads(knowledge_json) and deserialize_knowledge() entirely (64% of per-row cost)

5. Speed up scope-filtering: bisect + inline tuple comparisons

  • Binary search in TextRangeCollection.contains_range — replaced O(n) linear scan with bisect_right keyed on start
  • Inline tuple comparisons in TextRange.__eq__/__lt__/__contains__ — replaced TextLocation allocations with _effective_end tuples
  • Skip pydantic validation in get_metadata_multiple — construct TextLocation/TextRange directly from JSON
  • Scope-filtering benchmarks improved from ~25ms to ~9ms (2.7x)

6. Bugfix: parse_azure_endpoint

  • parse_azure_endpoint returned full URL with ?api-version=..., which AsyncAzureOpenAI mangled into a double-path
  • Now strips query string before returning. Added 6 unit tests.
  • Upstream: microsoft/typeagent-py#231

This PR accumulates all optimizations on the optimization branch. Benchmarks are re-run after every push. Individual PRs are opened separately for review.

black is only used in create_context_prompt() and format_code() -- both
cold paths. Moving the import inside the functions avoids loading black
and its transitive deps (pathspec, black.nodes, etc.) on every
import typeagent.
@KRRT7 KRRT7 changed the title perf: Defer heavy imports for faster startup perf: Cumulative startup and runtime optimizations Apr 10, 2026
KRRT7 added 3 commits April 10, 2026 00:11
- Combine 16 separate cursor.execute() calls in init_db_schema into a
  single db.executescript() call, reducing SQLite round-trips during
  database initialization.
- Pre-compile the whitespace regex in _prepare_term to avoid
  re-compiling on every call (552 calls during indexing).
Add add_terms_batch / add_properties_batch to the index interfaces
with executemany-based SQLite implementations. Restructure
add_metadata_to_index_from_list and add_to_property_index to collect
all items first, then batch-insert via extend() and the new batch
methods. Eliminates ~1000 individual INSERT round-trips during
indexing.
KRRT7 added 11 commits April 10, 2026 01:19
Replace hand-rolled time.perf_counter() loop with the pedantic fixture
from pytest-async-benchmark. Setup (DB/storage/transcript creation) and
teardown (close/delete) are now properly excluded from timing via the
framework instead of inline timing code.
Move repeated setup/teardown/target pattern into run_indexing_benchmark()
helper. Each test now delegates with just messages and message_type.
Rename _collect_{facet,entity,action}_{terms,properties} to drop the
leading underscore in propindex.py and semrefindex.py.
Install from fork with pedantic mode support for benchmark tests.
Change list to Sequence in add_terms_batch and add_properties_batch
interfaces and implementations to satisfy covariance. Add missing
add_terms_batch to FakeTermIndex in conftest.py.
Replace Python-level list comprehension + sort with numpy operations:
- No-predicate path: np.flatnonzero for score filtering, np.argpartition
  for O(n) top-k selection — avoids building ScoredInt for every vector
- Predicate path: numpy pre-filters by score, applies predicate only to
  candidates above threshold
- Subset lookup: numpy fancy indexing computes dot products only for
  subset indices instead of delegating to full-vector scan with predicate
lookup_term_filtered called get_item() per scored ref — one SELECT and
full deserialization per match. The filter only needs knowledge_type
(a plain column) and range (json.loads of range_json), never the
expensive knowledge_json deserialization (64% of per-row cost).

Add get_metadata_multiple to ISemanticRefCollection that fetches only
semref_id, range_json, knowledge_type in a single batch query. Replace
the N+1 loop in lookup_term_filtered with one get_metadata_multiple call.

Benchmark (200 matches, 200 rounds): 4.38ms → 1.32ms (3.3x speedup).
Apply the same get_metadata_multiple pattern from lookup_term_filtered
to four more sites that called get_item() in a loop:

- propindex.lookup_property_in_property_index: filter by .range
- SemanticRefAccumulator.group_matches_by_type: group by .knowledge_type
- SemanticRefAccumulator.get_matches_in_scope: filter by .range
- answers.get_scored_semantic_refs_from_ordinals_iter: two-phase
  metadata filter then batch get_multiple for matching full objects

All sites now use a single batch query instead of N individual SELECTs,
skipping knowledge_json deserialization where only range or
knowledge_type is needed.
parse_azure_endpoint returned the raw URL including ?api-version=...
which AsyncAzureOpenAI then mangled into invalid paths like
...?api-version=2024-06-01/openai/. Strip the query string before
returning — api_version is already returned as a separate value and
passed to the SDK independently.
KRRT7 added 3 commits April 10, 2026 05:03
…arisons

- Use bisect_right with key=start in TextRangeCollection.contains_range
  to skip O(n) linear scan (O(log n) for non-overlapping point ranges)
- Replace TextLocation allocations in TextRange __eq__/__lt__/__contains__
  with a shared _effective_end returning tuples
- Skip pydantic validation in get_metadata_multiple by constructing
  TextLocation/TextRange directly from JSON
black is only used at runtime in two cold formatting paths:
- create_context_prompt() in answers.py (LLM debug context)
- format_code()/pretty_print() in utils.py (developer terminal output)

Both format Python data structures, which is exactly what pprint does.
Replace black.format_str with pprint.pformat + ast.literal_eval,
eliminating the runtime dependency entirely.

Move black from dependencies to dev dependency-group — it remains
available for make format/check but is no longer required by library
consumers.
answers, search_query_schema, searchlang, and answer_response_schema
are only used in the query() method. Move their imports from module
level into query() and use TYPE_CHECKING + __future__.annotations for
the type hints.

These modules pull in search, query, and schema initialization that
isn't needed when creating or indexing conversations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant