Skip to content

Fixes for "semcode-index --lore"#18

Open
chucklever wants to merge 10 commits intofacebookexperimental:mainfrom
chucklever:main
Open

Fixes for "semcode-index --lore"#18
chucklever wants to merge 10 commits intofacebookexperimental:mainfrom
chucklever:main

Conversation

@chucklever
Copy link
Contributor

Address several recently introduced inefficiencies and a few long-standing bugs in the "--lore" command line option.

I'm not certain if I've completely worked out the CLA issues. Let me know.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 11, 2026
Lore indexing required scanning the email table to determine which
commits had already been processed. The lore table contains parsed
email records, not a direct mapping of indexed commits, making
duplicate detection both slow and unreliable.

A dedicated lore_indexed_commits table now tracks processed git
commit SHAs. After successful insertion of lore emails, commit
SHAs are recorded in this table via merge_insert keyed on
git_commit_sha, so repeated SHAs are absorbed rather than
duplicated. Subsequent runs load the full table into a HashSet
to skip already-processed commits, avoiding redundant downloads
and parsing of mailing list archives. The table contains only
short SHA strings, so reading it entirely into memory is
inexpensive. The table has a single git_commit_sha column and
integrates into schema initialization and repair.

Fixes: 39ae6a3 ("semcode-index: Add --refresh-lore to update tracked archives")
Signed-off-by: Chuck Lever <cel@kernel.org>
The --lore refresh path uses buffer_unordered() to index
up to four archives concurrently.  Each archive pipeline
spawns its own set of database inserter tasks, all sharing
the same LanceDB connection and its underlying DataFusion
memory pool.

With large-row archives such as oe-kbuild-all, concurrent
merge_insert operations from separate pipelines exhaust
the memory pool simultaneously.  Neither pipeline can
make progress because each holds a portion of the pool
while waiting for more, producing a resource deadlock
visible as two frozen progress bars with unchanging
"Inserted N emails" counts.

Replace buffer_unordered() with a sequential loop,
matching the approach already used by the --lore <args>
initial-clone path.  The git fetch for each archive still
runs inline, so network latency is the only cost; the
database insertion -- which dominates wall-clock time --
no longer contends for the shared memory pool.

Fixes: 39ae6a3 ("semcode-index: Add --refresh-lore to update tracked archives")
Signed-off-by: Chuck Lever <cel@kernel.org>
insert_lore_emails() feeds an entire pipeline batch (up to
1024 emails) into a single LanceDB merge_insert call. Each
lore email carries full headers and body text, so the
resulting RecordBatch is far larger than a typical
code-analysis batch. LanceDB merge_insert uses DataFusion's
RepartitionExec internally, and the oversized batch exhausts
the DataFusion memory pool -- particularly when two inserter
tasks submit concurrently.

The failure manifests as:

  Resources exhausted: Failed to allocate additional
  11.6 MB for RepartitionExec[0] with 11.9 MB already
  allocated for this reservation

Split the deduplicated email indices into chunks of 128 and
issue a separate merge_insert per chunk to bound peak memory
per operation. When a chunk still fails (e.g. a single
email is large enough to exhaust the pool on its own), fall
back to inserting each email in the chunk individually so
that only genuinely uninsertable messages are skipped.

Signed-off-by: Chuck Lever <cel@kernel.org>
LanceDB compaction encounters a pathological case when a table
accumulates thousands of small fragments. The compact operation
enters a CPU loop where the main thread spins at 100% CPU
utilization while worker threads remain idle. Any table
subjected to many small appends without intervening compaction
can reach this state.

A check now examines fragment count before compaction proceeds.
When fragment count exceeds 500, compaction is skipped and a
warning directs the user to rebuild the database with --clear.
This threshold prevents the hang condition while allowing normal
compaction for tables with moderate fragmentation. Prune, index,
and checkout operations remain unaffected; only the compact step
is gated by this fragment limit.

Fixes: 4a16e15 ("semcode-index: optimize database periodically during long-running indexing")
Signed-off-by: Chuck Lever <cel@kernel.org>
VectorStore::insert_chunk() uses bare table.add() to append
vector entries. When functions are re-vectorized after
reindexing, entries with the same content_hash accumulate as
duplicates rather than being replaced, and each insertion
creates a new fragment. The content table already handles this
correctly with merge_insert keyed on blake3_hash.

Apply the same pattern here: merge_insert keyed on
content_hash, matching the deduplication strategy used
elsewhere in the database layer.

Existing databases must be rebuilt with --clear to reclaim
the space already wasted by duplicate vector entries.

Signed-off-by: Chuck Lever <cel@kernel.org>
insert_commit_vectors_batch_with_table() uses bare table.add()
to append vector entries. When commits are re-vectorized after
reindexing, entries with the same git_commit_sha accumulate as
duplicates rather than being replaced, and each insertion
creates a new fragment.

Apply the same merge_insert pattern used for the vectors and
lore_indexed_commits tables: key on git_commit_sha so repeated
SHAs are absorbed rather than duplicated.

Existing databases must be rebuilt with --clear to reclaim
the space already wasted by duplicate vector entries.

Fixes: ba422c7 ("Start indexing git commits (database schema change)")
Signed-off-by: Chuck Lever <cel@kernel.org>
insert_lore_vectors_batch_with_table() uses bare table.add()
to append vector entries. When lore emails are re-vectorized,
entries with the same message_id accumulate as duplicates
rather than being replaced, and each insertion creates a new
fragment. update_lore_vectors() already skips message_ids
present in the table, but a re-vectorization pass (after
model changes or --clear of the vectors table alone) still
produces duplicates.

Apply the same merge_insert pattern used for the other
vector tables: key on message_id so repeated entries are
absorbed rather than duplicated.

Existing databases must be rebuilt with --clear to reclaim
the space already wasted by duplicate vector entries.

Fixes: 01b9399 ("semcode: add --lore for email indexing (database schema change)")
Signed-off-by: Chuck Lever <cel@kernel.org>
During lore indexing, FTS indices were created before the
compaction and prune steps of database optimization. The
optimization then rebuilt all FTS indices against the new
compacted layout, orphaning the original set. LanceDB's
prune operation removes old data versions but not old
index data, leaving both sets on disk. On a kernel tree
with several lore archives this wasted roughly 1.5 GB.

Reorder both the --lore and --refresh-lore code paths so
that optimization runs first and FTS index creation
follows.

Additionally, have create_lore_fts_indices() drop any
existing FTS indices, create fresh replacements, and then
prune the table. drop_index() only removes the logical
reference from the manifest; the old index directory
under _indices/ persists as orphaned data until a prune
pass reclaims it. The same is true of directories left
behind by OptimizeAction::Index, which rebuilds all
indices into new directories without removing the prior
set. The trailing prune removes both sources of orphaned
index data.

Signed-off-by: Chuck Lever <cel@kernel.org>
The lore table stored full RFC 5322 headers alongside the
individual fields already extracted into their own columns
(from, date, subject, message_id, in_reply_to, references,
recipients). The only consumer was write_email_as_mbox(),
which emitted the raw header block when producing MBOX
output.

No query path ever searched or filtered on the raw headers
column. The lore_search MCP function provides from_patterns,
subject_patterns, and recipients_patterns, each backed by
FTS indices on the individual columns. A raw-headers search
would be strictly less useful than what already exists.

Reconstruct the header block from the individual columns
at output time instead of storing it. This eliminates a
full copy of every email's headers from the lance data
files and from the FTS body index scans, reducing the
per-email storage footprint of the lore table.

Existing databases are migrated automatically:
migrate_lore_table() detects the column at startup and
drops it via LanceDB schema evolution.

Signed-off-by: Chuck Lever <cel@kernel.org>
The -j flag controls analysis thread count, but there is no
way to set a persistent default without a shell alias. When
-j is not provided, check the SEMCODE_JOBS environment
variable for a thread count before falling back to
auto-detection. The explicit flag always takes precedence.

This is consistent with SEMCODE_BATCH_SIZE, which already
provides an environment-variable override for vectorization
batch size.

Signed-off-by: Chuck Lever <cel@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant