perf: Cumulative startup and runtime optimizations by KRRT7 · Pull Request #3 · KRRT7/typeagent-py

KRRT7 · 2026-04-10T04:23:56Z

Cumulative Benchmarks

Azure Standard_D2s_v5 (2 vCPU, 8 GB RAM, non-burstable), Python 3.13, Ubuntu 24.04

Startup

hyperfine (warmup 5, min-runs 30)

Benchmark	main	optimization	Speedup
`import typeagent`	791ms ±11ms	713ms ±8ms	1.11x
Offline test suite (69 tests)	5.72s ±90ms	5.60s ±98ms	1.02x

Runtime (indexing pipeline)

pytest-async-benchmark pedantic mode, 20 rounds, 3 warmup — only hot path timed (setup/teardown excluded)

Benchmark	main (min)	optimization (min)	Speedup
`add_messages_with_indexing` (200 msgs)	28.8ms	25.0ms	1.16x
`add_messages_with_indexing` (50 msgs)	7.8ms	6.7ms	1.16x
VTT ingest (40 msgs)	6.9ms	6.1ms	1.14x

Query

pytest-async-benchmark pedantic mode, 200 rounds, 20 warmup

Benchmark	main (median)	optimization (median)	Speedup
`lookup_term_filtered` (200 matches)	2.650ms	1.184ms	2.24x
`group_matches_by_type` (200 matches)	2.428ms	978μs	2.48x
`get_scored_semantic_refs_from_ordinals_iter` (200 matches)	2.541ms	2.946ms	0.86x
`lookup_property_in_property_index` (200 matches)	25.306ms	9.365ms	2.70x
`get_matches_in_scope` (200 matches)	25.011ms	9.160ms	2.73x

Vector Search

pytest-async-benchmark pedantic mode, 200 rounds, 20 warmup, 384-dim embeddings

Benchmark	main (min)	optimization (min)	Speedup
`fuzzy_lookup_embedding` (1K vecs)	257μs	70μs	3.7x
`fuzzy_lookup_embedding` (10K vecs)	5.72ms	559μs	10.2x
`fuzzy_lookup_embedding` (10K + predicate)	4.79ms	3.41ms	1.4x
`fuzzy_lookup_embedding_in_subset` (1K of 10K)	3.45ms	243μs	14.2x

Note: The vectorbase optimization applies to code from microsoft/typeagent-py#228 which has not yet merged to upstream main. The PR is open against the contributor's fork at shreejaykurhade/typeagent-py#1.

Optimizations (cumulative)

1. Defer `black` import to first use (`ecbf6f5`)

black was imported at module level but only used in two cold-path functions
Moved import black inside create_context_prompt() and format_code()
Saves ~79ms on import by avoiding black + transitive deps (pathspec, black.nodes)

2. Batch SQLite INSERTs for indexing pipeline (`bc9f2df`)

Added add_terms_batch and add_properties_batch to ITermToSemanticRefIndex and IPropertyToSemanticRefIndex interfaces
SQLite backend uses executemany instead of individual cursor.execute() calls
Restructured add_metadata_to_index_from_list and add_to_property_index to collect all data first, then batch-insert
Consistent ~14-16% improvement across all message counts

3. Numpy vectorized fuzzy lookup (`bc5b319`)

Replaced Python-level list comprehension + sort with np.flatnonzero + np.argpartition for O(n) top-k
fuzzy_lookup_embedding_in_subset now uses fancy indexing to compute dot products only for subset indices
3.7x–14.2x speedup on vector search hot paths
Optimizes code introduced in microsoft/typeagent-py#228 (not yet merged upstream)

4. Batch metadata query across 5 N+1 call sites

Five call sites used get_item() per scored ref — N+1 pattern with full knowledge_json deserialization
Added get_metadata_multiple to ISemanticRefCollection — fetches only semref_id, range_json, knowledge_type in one batch query
Skips json.loads(knowledge_json) and deserialize_knowledge() entirely (64% of per-row cost)

5. Speed up scope-filtering: bisect + inline tuple comparisons

Binary search in TextRangeCollection.contains_range — replaced O(n) linear scan with bisect_right keyed on start
Inline tuple comparisons in TextRange.__eq__/__lt__/__contains__ — replaced TextLocation allocations with _effective_end tuples
Skip pydantic validation in get_metadata_multiple — construct TextLocation/TextRange directly from JSON
Scope-filtering benchmarks improved from ~25ms to ~9ms (2.7x)

6. Bugfix: parse_azure_endpoint

parse_azure_endpoint returned full URL with ?api-version=..., which AsyncAzureOpenAI mangled into a double-path
Now strips query string before returning. Added 6 unit tests.
Upstream: microsoft/typeagent-py#231

This PR accumulates all optimizations on the optimization branch. Benchmarks are re-run after every push. Individual PRs are opened separately for review.

black is only used in create_context_prompt() and format_code() -- both cold paths. Moving the import inside the functions avoids loading black and its transitive deps (pathspec, black.nodes, etc.) on every import typeagent.

- Combine 16 separate cursor.execute() calls in init_db_schema into a single db.executescript() call, reducing SQLite round-trips during database initialization. - Pre-compile the whitespace regex in _prepare_term to avoid re-compiling on every call (552 calls during indexing).

This reverts commit d4bc744.

Add add_terms_batch / add_properties_batch to the index interfaces with executemany-based SQLite implementations. Restructure add_metadata_to_index_from_list and add_to_property_index to collect all items first, then batch-insert via extend() and the new batch methods. Eliminates ~1000 individual INSERT round-trips during indexing.

Replace hand-rolled time.perf_counter() loop with the pedantic fixture from pytest-async-benchmark. Setup (DB/storage/transcript creation) and teardown (close/delete) are now properly excluded from timing via the framework instead of inline timing code.

Move repeated setup/teardown/target pattern into run_indexing_benchmark() helper. Each test now delegates with just messages and message_type.

Rename _collect_{facet,entity,action}_{terms,properties} to drop the leading underscore in propindex.py and semrefindex.py.

Install from fork with pedantic mode support for benchmark tests.

Change list to Sequence in add_terms_batch and add_properties_batch interfaces and implementations to satisfy covariance. Add missing add_terms_batch to FakeTermIndex in conftest.py.

Replace Python-level list comprehension + sort with numpy operations: - No-predicate path: np.flatnonzero for score filtering, np.argpartition for O(n) top-k selection — avoids building ScoredInt for every vector - Predicate path: numpy pre-filters by score, applies predicate only to candidates above threshold - Subset lookup: numpy fancy indexing computes dot products only for subset indices instead of delegating to full-vector scan with predicate

lookup_term_filtered called get_item() per scored ref — one SELECT and full deserialization per match. The filter only needs knowledge_type (a plain column) and range (json.loads of range_json), never the expensive knowledge_json deserialization (64% of per-row cost). Add get_metadata_multiple to ISemanticRefCollection that fetches only semref_id, range_json, knowledge_type in a single batch query. Replace the N+1 loop in lookup_term_filtered with one get_metadata_multiple call. Benchmark (200 matches, 200 rounds): 4.38ms → 1.32ms (3.3x speedup).

Apply the same get_metadata_multiple pattern from lookup_term_filtered to four more sites that called get_item() in a loop: - propindex.lookup_property_in_property_index: filter by .range - SemanticRefAccumulator.group_matches_by_type: group by .knowledge_type - SemanticRefAccumulator.get_matches_in_scope: filter by .range - answers.get_scored_semantic_refs_from_ordinals_iter: two-phase metadata filter then batch get_multiple for matching full objects All sites now use a single batch query instead of N individual SELECTs, skipping knowledge_json deserialization where only range or knowledge_type is needed.

parse_azure_endpoint returned the raw URL including ?api-version=... which AsyncAzureOpenAI then mangled into invalid paths like ...?api-version=2024-06-01/openai/. Strip the query string before returning — api_version is already returned as a separate value and passed to the SDK independently.

…arisons - Use bisect_right with key=start in TextRangeCollection.contains_range to skip O(n) linear scan (O(log n) for non-overlapping point ranges) - Replace TextLocation allocations in TextRange __eq__/__lt__/__contains__ with a shared _effective_end returning tuples - Skip pydantic validation in get_metadata_multiple by constructing TextLocation/TextRange directly from JSON

black is only used at runtime in two cold formatting paths: - create_context_prompt() in answers.py (LLM debug context) - format_code()/pretty_print() in utils.py (developer terminal output) Both format Python data structures, which is exactly what pprint does. Replace black.format_str with pprint.pformat + ast.literal_eval, eliminating the runtime dependency entirely. Move black from dependencies to dev dependency-group — it remains available for make format/check but is no longer required by library consumers.

answers, search_query_schema, searchlang, and answer_response_schema are only used in the query() method. Move their imports from module level into query() and use TYPE_CHECKING + __future__.annotations for the type hints. These modules pull in search, query, and schema initialization that isn't needed when creating or indexing conversations.

Defer black import to first use

ecbf6f5

black is only used in create_context_prompt() and format_code() -- both cold paths. Moving the import inside the functions avoids loading black and its transitive deps (pathspec, black.nodes, etc.) on every import typeagent.

KRRT7 changed the title ~~perf: Defer heavy imports for faster startup~~ perf: Cumulative startup and runtime optimizations Apr 10, 2026

KRRT7 added 3 commits April 10, 2026 00:11

Revert "Batch schema DDL into executescript and pre-compile regex"

88c8d5e

This reverts commit d4bc744.

KRRT7 force-pushed the optimization branch from 3e43596 to 2583786 Compare April 10, 2026 05:39

Add pytest-async-benchmark tests for indexing pipeline

2140df4

KRRT7 force-pushed the optimization branch from 2583786 to 2140df4 Compare April 10, 2026 05:48

KRRT7 added 11 commits April 10, 2026 01:19

Remove section-divider comments from benchmark tests

93a05a7

Extract shared benchmark harness to reduce duplication

dd5d738

Move repeated setup/teardown/target pattern into run_indexing_benchmark() helper. Each test now delegates with just messages and message_type.

Remove underscore prefix from collect helper functions

4148b7a

Rename _collect_{facet,entity,action}_{terms,properties} to drop the leading underscore in propindex.py and semrefindex.py.

Add pytest-async-benchmark as dev dependency

aa7e958

Install from fork with pedantic mode support for benchmark tests.

Fix pyright errors: use Sequence for batch method signatures

771e481

Change list to Sequence in add_terms_batch and add_properties_batch interfaces and implementations to satisfy covariance. Add missing add_terms_batch to FakeTermIndex in conftest.py.

Add benchmarks for all batch metadata query call sites

ee86888

KRRT7 force-pushed the optimization branch from 533793e to ee86888 Compare April 10, 2026 09:34

KRRT7 added 3 commits April 10, 2026 05:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Cumulative startup and runtime optimizations#3

perf: Cumulative startup and runtime optimizations#3
KRRT7 wants to merge 19 commits intomainfrom
optimization

KRRT7 commented Apr 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KRRT7 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cumulative Benchmarks

Startup

Runtime (indexing pipeline)

Query

Vector Search

Optimizations (cumulative)

1. Defer black import to first use (ecbf6f5)

2. Batch SQLite INSERTs for indexing pipeline (bc9f2df)

3. Numpy vectorized fuzzy lookup (bc5b319)

4. Batch metadata query across 5 N+1 call sites

5. Speed up scope-filtering: bisect + inline tuple comparisons

6. Bugfix: parse_azure_endpoint

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KRRT7 commented Apr 10, 2026 •

edited

Loading

1. Defer `black` import to first use (`ecbf6f5`)

2. Batch SQLite INSERTs for indexing pipeline (`bc9f2df`)

3. Numpy vectorized fuzzy lookup (`bc5b319`)