perf: Batch SQLite INSERTs for indexing pipeline#230
perf: Batch SQLite INSERTs for indexing pipeline#230KRRT7 wants to merge 6 commits intomicrosoft:mainfrom
Conversation
19520f3 to
e7e804e
Compare
gvanrossum
left a comment
There was a problem hiding this comment.
Thanks for this! It requires more review time than I have right now, so I'll keep it open until I have more time.
Add add_terms_batch / add_properties_batch to the index interfaces with executemany-based SQLite implementations. Restructure add_metadata_to_index_from_list and add_to_property_index to collect all items first, then batch-insert via extend() and the new batch methods. Eliminates ~1000 individual INSERT round-trips during indexing.
Rename _collect_{facet,entity,action}_{terms,properties} to drop the
leading underscore in propindex.py and semrefindex.py.
Change list to Sequence in add_terms_batch and add_properties_batch interfaces and implementations to satisfy covariance. Add missing add_terms_batch to FakeTermIndex in conftest.py.
4030379 to
82ba650
Compare
bmerkle
left a comment
There was a problem hiding this comment.
Hi @KRRT7
I was asked by @gvanrossum to do a review of this PR
Please find attached some comments.
There are also some pre-existing issues in these files, so things which have not be introduced by this PR, but i would suggest that we cover those in a future, seperate PR.
I have only mentioned the code duplicate, which could possibly be fixed also in this PR.
please let me know what you think.
Thanks! I'll take a closer look to your reviews this afternoon. |
|
Hi @KRRT7 can you please review the open comments from me ? Thanks :-) |
…move imports - Fix inverse_actions omission in add_metadata_to_index_from_list (regression) - Fix inverse_actions omission in add_metadata_to_index (pre-existing) - Delete duplicate add_entity_to_index, add_action_to_index, add_topic_to_index, text_range_from_location — unified into add_entity, add_action, add_topic - Update all callers and tests to use unified functions - Move function-level imports to top-level in sqlite/propindex.py per AGENTS.md
|
@bmerkle Thanks for the thorough review — all comments addressed in
Ready for re-review. |
Stack: 3/4 — depends on #229. Merge #231, #229, then this PR.
add_terms_batchandadd_properties_batchtoITermToSemanticRefIndexandIPropertyToSemanticRefIndexinterfacesexecutemanyinstead of individualcursor.execute()calls (~1000+ calls per indexing batch reduced to 2-3)add_metadata_to_index_from_listandadd_to_property_indexto collect all data first (pure functions), then batch-insertBenchmark
Azure Standard_D2s_v5 -- 2 vCPU, 8 GiB RAM, Python 3.13
Indexing Pipeline (pytest-async-benchmark pedantic, 20 rounds, 3 warmup)
Only the hot path (
add_messages_with_indexing) is timed -- DB creation, storage init, and teardown are excluded.add_messages_with_indexing(200 msgs)add_messages_with_indexing(50 msgs)Consistent ~14-16% improvement --
executemanyamortizes per-call overhead.Reproduce the benchmark locally
Save the benchmark file below as
tests/benchmarks/test_benchmark_indexing.py, then:Generated by codeflash optimization agent