Skip to content

perf: Batch SQLite INSERTs for indexing pipeline#230

Open
KRRT7 wants to merge 6 commits intomicrosoft:mainfrom
KRRT7:perf/batch-inserts
Open

perf: Batch SQLite INSERTs for indexing pipeline#230
KRRT7 wants to merge 6 commits intomicrosoft:mainfrom
KRRT7:perf/batch-inserts

Conversation

@KRRT7
Copy link
Copy Markdown
Contributor

@KRRT7 KRRT7 commented Apr 10, 2026

Stack: 3/4 — depends on #229. Merge #231, #229, then this PR.


  • Add add_terms_batch and add_properties_batch to ITermToSemanticRefIndex and IPropertyToSemanticRefIndex interfaces
  • SQLite backend uses executemany instead of individual cursor.execute() calls (~1000+ calls per indexing batch reduced to 2-3)
  • Restructure add_metadata_to_index_from_list and add_to_property_index to collect all data first (pure functions), then batch-insert
  • Memory backend implements batch methods as loops for interface compatibility

Benchmark

Azure Standard_D2s_v5 -- 2 vCPU, 8 GiB RAM, Python 3.13

Indexing Pipeline (pytest-async-benchmark pedantic, 20 rounds, 3 warmup)

Only the hot path (add_messages_with_indexing) is timed -- DB creation, storage init, and teardown are excluded.

Benchmark Before (min) After (min) Speedup
add_messages_with_indexing (200 msgs) 28.8 ms 25.0 ms 1.16x
add_messages_with_indexing (50 msgs) 7.8 ms 6.7 ms 1.16x
VTT ingest (40 msgs) 6.9 ms 6.1 ms 1.14x

Consistent ~14-16% improvement -- executemany amortizes per-call overhead.

Reproduce the benchmark locally

Save the benchmark file below as tests/benchmarks/test_benchmark_indexing.py, then:

pip install 'pytest-async-benchmark @ git+https://github.com/KRRT7/pytest-async-benchmark.git@feat/pedantic-mode' pytest-asyncio

# Run on main
git checkout main
python -m pytest tests/benchmarks/test_benchmark_indexing.py -v -s

# Run on this branch
git checkout perf/batch-inserts
python -m pytest tests/benchmarks/test_benchmark_indexing.py -v -s

Generated by codeflash optimization agent

Copy link
Copy Markdown
Collaborator

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! It requires more review time than I have right now, so I'll keep it open until I have more time.

KRRT7 added 3 commits April 10, 2026 15:49
Add add_terms_batch / add_properties_batch to the index interfaces
with executemany-based SQLite implementations. Restructure
add_metadata_to_index_from_list and add_to_property_index to collect
all items first, then batch-insert via extend() and the new batch
methods. Eliminates ~1000 individual INSERT round-trips during
indexing.
Rename _collect_{facet,entity,action}_{terms,properties} to drop the
leading underscore in propindex.py and semrefindex.py.
Change list to Sequence in add_terms_batch and add_properties_batch
interfaces and implementations to satisfy covariance. Add missing
add_terms_batch to FakeTermIndex in conftest.py.
@KRRT7 KRRT7 force-pushed the perf/batch-inserts branch from 4030379 to 82ba650 Compare April 10, 2026 20:50
@bmerkle bmerkle self-assigned this Apr 11, 2026
Copy link
Copy Markdown
Collaborator

@bmerkle bmerkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @KRRT7
I was asked by @gvanrossum to do a review of this PR
Please find attached some comments.

There are also some pre-existing issues in these files, so things which have not be introduced by this PR, but i would suggest that we cover those in a future, seperate PR.
I have only mentioned the code duplicate, which could possibly be fixed also in this PR.

please let me know what you think.

Comment thread src/typeagent/storage/memory/semrefindex.py
Comment thread src/typeagent/storage/memory/propindex.py
Comment thread src/typeagent/storage/memory/semrefindex.py
Comment thread src/typeagent/storage/sqlite/propindex.py Outdated
Comment thread src/typeagent/storage/memory/semrefindex.py
Comment thread src/typeagent/storage/memory/semrefindex.py Outdated
Comment thread src/typeagent/storage/memory/semrefindex.py Outdated
Comment thread src/typeagent/storage/memory/semrefindex.py Outdated
Comment thread src/typeagent/storage/memory/semrefindex.py
Comment thread src/typeagent/storage/memory/semrefindex.py
@KRRT7
Copy link
Copy Markdown
Contributor Author

KRRT7 commented Apr 11, 2026

Hi @KRRT7
I was asked by @gvanrossum to do a review of this PR

Thanks! I'll take a closer look to your reviews this afternoon.

@bmerkle
Copy link
Copy Markdown
Collaborator

bmerkle commented Apr 14, 2026

Hi @KRRT7 can you please review the open comments from me ? Thanks :-)

Copy link
Copy Markdown
Collaborator

@bmerkle bmerkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KRRT7 please check my comments

…move imports

- Fix inverse_actions omission in add_metadata_to_index_from_list (regression)
- Fix inverse_actions omission in add_metadata_to_index (pre-existing)
- Delete duplicate add_entity_to_index, add_action_to_index, add_topic_to_index,
  text_range_from_location — unified into add_entity, add_action, add_topic
- Update all callers and tests to use unified functions
- Move function-level imports to top-level in sqlite/propindex.py per AGENTS.md
@KRRT7
Copy link
Copy Markdown
Contributor Author

KRRT7 commented Apr 14, 2026

@bmerkle Thanks for the thorough review — all comments addressed in 9ae667f. Summary:

  • Fixed inverse_actions bug — both add_metadata_to_index_from_list (regression) and add_metadata_to_index (pre-existing) now index inverse actions
  • Eliminated code duplication — deleted add_entity_to_index, add_action_to_index, add_topic_to_index, and text_range_from_location (were duplicates of add_entity/add_action/add_topic/text_range_from_message_chunk). Reduced from 3 parallel paths to 2. Updated all callers and tests.
  • Moved imports to top-level in sqlite/propindex.py per AGENTS.md

make check test format passes (479/479 offline tests green, 3 failures are API rate-limits).

Ready for re-review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants