Fixes for "semcode-index --lore"#18
Open
chucklever wants to merge 10 commits intofacebookexperimental:mainfrom
Open
Fixes for "semcode-index --lore"#18chucklever wants to merge 10 commits intofacebookexperimental:mainfrom
chucklever wants to merge 10 commits intofacebookexperimental:mainfrom
Conversation
Lore indexing required scanning the email table to determine which commits had already been processed. The lore table contains parsed email records, not a direct mapping of indexed commits, making duplicate detection both slow and unreliable. A dedicated lore_indexed_commits table now tracks processed git commit SHAs. After successful insertion of lore emails, commit SHAs are recorded in this table via merge_insert keyed on git_commit_sha, so repeated SHAs are absorbed rather than duplicated. Subsequent runs load the full table into a HashSet to skip already-processed commits, avoiding redundant downloads and parsing of mailing list archives. The table contains only short SHA strings, so reading it entirely into memory is inexpensive. The table has a single git_commit_sha column and integrates into schema initialization and repair. Fixes: 39ae6a3 ("semcode-index: Add --refresh-lore to update tracked archives") Signed-off-by: Chuck Lever <cel@kernel.org>
The --lore refresh path uses buffer_unordered() to index up to four archives concurrently. Each archive pipeline spawns its own set of database inserter tasks, all sharing the same LanceDB connection and its underlying DataFusion memory pool. With large-row archives such as oe-kbuild-all, concurrent merge_insert operations from separate pipelines exhaust the memory pool simultaneously. Neither pipeline can make progress because each holds a portion of the pool while waiting for more, producing a resource deadlock visible as two frozen progress bars with unchanging "Inserted N emails" counts. Replace buffer_unordered() with a sequential loop, matching the approach already used by the --lore <args> initial-clone path. The git fetch for each archive still runs inline, so network latency is the only cost; the database insertion -- which dominates wall-clock time -- no longer contends for the shared memory pool. Fixes: 39ae6a3 ("semcode-index: Add --refresh-lore to update tracked archives") Signed-off-by: Chuck Lever <cel@kernel.org>
insert_lore_emails() feeds an entire pipeline batch (up to 1024 emails) into a single LanceDB merge_insert call. Each lore email carries full headers and body text, so the resulting RecordBatch is far larger than a typical code-analysis batch. LanceDB merge_insert uses DataFusion's RepartitionExec internally, and the oversized batch exhausts the DataFusion memory pool -- particularly when two inserter tasks submit concurrently. The failure manifests as: Resources exhausted: Failed to allocate additional 11.6 MB for RepartitionExec[0] with 11.9 MB already allocated for this reservation Split the deduplicated email indices into chunks of 128 and issue a separate merge_insert per chunk to bound peak memory per operation. When a chunk still fails (e.g. a single email is large enough to exhaust the pool on its own), fall back to inserting each email in the chunk individually so that only genuinely uninsertable messages are skipped. Signed-off-by: Chuck Lever <cel@kernel.org>
LanceDB compaction encounters a pathological case when a table accumulates thousands of small fragments. The compact operation enters a CPU loop where the main thread spins at 100% CPU utilization while worker threads remain idle. Any table subjected to many small appends without intervening compaction can reach this state. A check now examines fragment count before compaction proceeds. When fragment count exceeds 500, compaction is skipped and a warning directs the user to rebuild the database with --clear. This threshold prevents the hang condition while allowing normal compaction for tables with moderate fragmentation. Prune, index, and checkout operations remain unaffected; only the compact step is gated by this fragment limit. Fixes: 4a16e15 ("semcode-index: optimize database periodically during long-running indexing") Signed-off-by: Chuck Lever <cel@kernel.org>
VectorStore::insert_chunk() uses bare table.add() to append vector entries. When functions are re-vectorized after reindexing, entries with the same content_hash accumulate as duplicates rather than being replaced, and each insertion creates a new fragment. The content table already handles this correctly with merge_insert keyed on blake3_hash. Apply the same pattern here: merge_insert keyed on content_hash, matching the deduplication strategy used elsewhere in the database layer. Existing databases must be rebuilt with --clear to reclaim the space already wasted by duplicate vector entries. Signed-off-by: Chuck Lever <cel@kernel.org>
insert_commit_vectors_batch_with_table() uses bare table.add() to append vector entries. When commits are re-vectorized after reindexing, entries with the same git_commit_sha accumulate as duplicates rather than being replaced, and each insertion creates a new fragment. Apply the same merge_insert pattern used for the vectors and lore_indexed_commits tables: key on git_commit_sha so repeated SHAs are absorbed rather than duplicated. Existing databases must be rebuilt with --clear to reclaim the space already wasted by duplicate vector entries. Fixes: ba422c7 ("Start indexing git commits (database schema change)") Signed-off-by: Chuck Lever <cel@kernel.org>
insert_lore_vectors_batch_with_table() uses bare table.add() to append vector entries. When lore emails are re-vectorized, entries with the same message_id accumulate as duplicates rather than being replaced, and each insertion creates a new fragment. update_lore_vectors() already skips message_ids present in the table, but a re-vectorization pass (after model changes or --clear of the vectors table alone) still produces duplicates. Apply the same merge_insert pattern used for the other vector tables: key on message_id so repeated entries are absorbed rather than duplicated. Existing databases must be rebuilt with --clear to reclaim the space already wasted by duplicate vector entries. Fixes: 01b9399 ("semcode: add --lore for email indexing (database schema change)") Signed-off-by: Chuck Lever <cel@kernel.org>
During lore indexing, FTS indices were created before the compaction and prune steps of database optimization. The optimization then rebuilt all FTS indices against the new compacted layout, orphaning the original set. LanceDB's prune operation removes old data versions but not old index data, leaving both sets on disk. On a kernel tree with several lore archives this wasted roughly 1.5 GB. Reorder both the --lore and --refresh-lore code paths so that optimization runs first and FTS index creation follows. Additionally, have create_lore_fts_indices() drop any existing FTS indices, create fresh replacements, and then prune the table. drop_index() only removes the logical reference from the manifest; the old index directory under _indices/ persists as orphaned data until a prune pass reclaims it. The same is true of directories left behind by OptimizeAction::Index, which rebuilds all indices into new directories without removing the prior set. The trailing prune removes both sources of orphaned index data. Signed-off-by: Chuck Lever <cel@kernel.org>
The lore table stored full RFC 5322 headers alongside the individual fields already extracted into their own columns (from, date, subject, message_id, in_reply_to, references, recipients). The only consumer was write_email_as_mbox(), which emitted the raw header block when producing MBOX output. No query path ever searched or filtered on the raw headers column. The lore_search MCP function provides from_patterns, subject_patterns, and recipients_patterns, each backed by FTS indices on the individual columns. A raw-headers search would be strictly less useful than what already exists. Reconstruct the header block from the individual columns at output time instead of storing it. This eliminates a full copy of every email's headers from the lance data files and from the FTS body index scans, reducing the per-email storage footprint of the lore table. Existing databases are migrated automatically: migrate_lore_table() detects the column at startup and drops it via LanceDB schema evolution. Signed-off-by: Chuck Lever <cel@kernel.org>
The -j flag controls analysis thread count, but there is no way to set a persistent default without a shell alias. When -j is not provided, check the SEMCODE_JOBS environment variable for a thread count before falling back to auto-detection. The explicit flag always takes precedence. This is consistent with SEMCODE_BATCH_SIZE, which already provides an environment-variable override for vectorization batch size. Signed-off-by: Chuck Lever <cel@kernel.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Address several recently introduced inefficiencies and a few long-standing bugs in the "--lore" command line option.
I'm not certain if I've completely worked out the CLA issues. Let me know.