Propagate element metadata to chunks in MEDI chunkers#7516
Propagate element metadata to chunks in MEDI chunkers#7516luisquintanilla wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses MEDI ingestion metadata loss by propagating IngestionDocumentElement.Metadata into IngestionChunk<string>.Metadata across the built-in chunkers, so downstream components (e.g., VectorStoreWriter) can persist element-derived metadata on produced chunks.
Changes:
- Added element-metadata accumulation/application logic to
ElementsChunker(affectingSectionChunker,HeaderChunker, andSemanticSimilarityChunker). - Added similar metadata accumulation/application to
DocumentTokenChunkerduring element iteration and chunk finalization. - Introduced a new test suite validating metadata propagation behavior for several chunkers and scenarios.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/ChunkerMetadataPropagationTests.cs | Adds tests asserting element metadata is propagated to chunk metadata under various conditions. |
| src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/ElementsChunker.cs | Accumulates element metadata while building chunks and applies it when committing chunks. |
| src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs | Accumulates element metadata during token chunking and applies it when finalizing chunks. |
| AccumulateMetadata(element, ref accumulatedMetadata); | ||
|
|
||
| int elementTokenCount = CountTokens(semanticContent.AsSpan()); | ||
| if (elementTokenCount + totalTokenCount <= _maxTokensPerChunk) | ||
| { |
There was a problem hiding this comment.
Accepted and fixed.
Good catch — this was a real timing bug. In the original code, AccumulateMetadata ran unconditionally before the branch logic, so when a table pre-commit or overflow triggered FinalizeCurrentChunk, the element's metadata attached to the previous chunk instead of the one receiving the content.
Fix (commit 8bfe9bd): Moved AccumulateMetadata into each of the 3 branches with flag-based deferred accumulation:
- Fits branch (L72-78): Accumulate immediately before append — straightforward, element stays in current chunk.
- Table branch (L79-163):
tableMetadataAccumulatedflag — defer until the first actualAppendNewLineAndSpancall to_currentChunk. This handles pre-commit (header doesn't fit),rowIndex==1edge case (first data row doesn't fit), and mid-table row splits. - Non-table overflow (L164-213):
elementMetadataAccumulatedflag — accumulate only whenindex > 0(content actually appended to a chunk).
| { | ||
| continue; | ||
| } | ||
|
|
||
| AccumulateMetadata(element, ref accumulatedMetadata); | ||
|
|
||
| int contentToProcessTokenCount = _tokenizer.CountTokens(elementContent!, considerNormalization: false); | ||
| ReadOnlyMemory<char> contentToProcess = elementContent.AsMemory(); | ||
| while (stringBuilderTokenCount + contentToProcessTokenCount >= _maxTokensPerChunk) | ||
| { | ||
| int index = _tokenizer.GetIndexByTokenCount( | ||
| text: contentToProcess.Span, | ||
| maxTokenCount: _maxTokensPerChunk - stringBuilderTokenCount, | ||
| out string? _, | ||
| out int _, | ||
| considerNormalization: false); | ||
|
|
||
| unsafe | ||
| { | ||
| fixed (char* ptr = &MemoryMarshal.GetReference(contentToProcess.Span)) | ||
| { | ||
| _ = stringBuilder.Append(ptr, index); | ||
| } | ||
| } | ||
| yield return FinalizeChunk(); | ||
| yield return FinalizeChunk(ref accumulatedMetadata); | ||
|
|
There was a problem hiding this comment.
Accepted and fixed.
Correct — in the original code, AccumulateMetadata ran at line 86 before the while loop. If index == 0 and the buffer was already full, the while loop would call FinalizeCurrentChunk first, attaching the element's metadata to the wrong (previous) chunk.
Fix (commit 8bfe9bd): Introduced elementMetadataAccumulated flag:
- Inside the while loop: accumulate only when
index > 0(meaning content has actually been appended to a chunk from this element) - After the while loop: if the flag is still false (remaining content goes to buffer), accumulate then
This ensures metadata always follows the content, regardless of whether a chunk boundary is crossed at the start of the element.
| private static IngestionChunker<string> CreateDocumentTokenChunker(int maxTokensPerChunk = 2_000) | ||
| { | ||
| var tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o"); | ||
| return new DocumentTokenChunker(new(tokenizer) { MaxTokensPerChunk = maxTokensPerChunk, OverlapTokens = 0 }); | ||
| } | ||
|
|
||
| [Fact] |
There was a problem hiding this comment.
Partially accepted.
Agree that test coverage should be broader. Here's what was added:
Boundary/overlap tests (commit 8bfe9bd — 6 tests):
ElementsChunker_TablePreCommit_MetadataGoesToCorrectChunk— table element triggers pre-commit; metadata follows content to new chunkElementsChunker_ExactFill_MetadataStaysOnCurrentChunk— element exactly fills remaining capacity; metadata stays on current chunkDocumentTokenChunker_OverlapTokens_MetadataOnlyOnOriginalChunks— overlap content doesn't duplicate metadataDocumentTokenChunker_ExactFill_MetadataAttachesToCorrectChunk— boundary precisionElementsChunker_TableSplit_MetadataGoesToFirstTableChunk— large table split across chunksElementsChunker_NonTableOverflow_MetadataGoesToNewChunk— non-table overflow triggers new chunk
SemanticSimilarityChunker tests (commit bd31e18 — 2 tests):
SemanticSimilarityChunker_SingleElementWithMetadata_PropagatesMetadata— basic metadata flowSemanticSimilarityChunker_MultipleElementsDifferentKeys_AllKeysAppear— per-element metadata preserved across chunks
All 22 metadata propagation tests pass across net8.0, net9.0, net10.0, and net462.
|
MEDI is an acronym regularly used to reference Microsoft.Extensions.DependencyInjection - using it for this library is another reason why these AI technology packages should not be in the root Microsoft.Extensions namespace. |
|
@adamsitnik I confirmed this wasn't a design choice. Can I please get a review before merging. Thanks! |
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1454486&view=codecoverage-tab |
6c9bd82 to
8bfe9bd
Compare
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1454834&view=codecoverage-tab |
Fix #7465: All four IngestionChunker implementations (SectionChunker, HeaderChunker, SemanticSimilarityChunker, DocumentTokenChunker) now propagate IngestionDocumentElement.Metadata to IngestionChunk.Metadata. Design decisions: - First-wins merge strategy (TryAdd) for conflicting keys - Null metadata values skipped (element allows object?, chunk requires object) - Split elements: metadata goes to the first chunk only - Lazy allocation: dictionary only created when elements have metadata ElementsChunker (fixes SectionChunker, HeaderChunker, SemanticSimilarityChunker): - Added AccumulateMetadata/ApplyMetadata static helpers - Accumulates metadata as elements are processed - Applies to chunk on commit, then clears accumulator DocumentTokenChunker: - Added AccumulateMetadata static helper - Accumulates metadata during element iteration - Applies in FinalizeChunk, then clears accumulator Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix metadata accumulation timing bugs in ElementsChunker and DocumentTokenChunker where AccumulateMetadata was called before determining which chunk the element's content contributes to. When a Commit/FinalizeChunk happens before the new element adds content (table pre-commit, non-table overflow, exact-fill boundary), the metadata was incorrectly applied to the previous chunk. ElementsChunker fixes: - Branch 1 (fits): accumulate right before appending - Branch 2 (table): use flag, accumulate before first table content append to _currentChunk, after any pre-commit or row-level commit - Branch 3 (non-table too big): use flag, accumulate when index > 0 (first content contribution in the while loop) DocumentTokenChunker fixes: - Use flag to defer accumulation until first content contribution - In while loop: accumulate only when index > 0 - After while loop: accumulate if not yet done (element fits entirely) New boundary tests (6 tests): - Previous element fills chunk, next element metadata on new chunk - Non-table element too large, metadata on correct chunks - Table pre-commit: table metadata not on pre-committed chunk - DocumentTokenChunker boundary with large filler element - DocumentTokenChunker with overlap enabled - Table split across chunks: first chunk gets metadata Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 2 tests covering SemanticSimilarityChunker metadata flow: - Single element with metadata propagates to chunk - Multiple elements with different keys each carry metadata Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bd31e18 to
8bac32d
Compare
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1454942&view=codecoverage-tab |
Summary
Fixes #7465
All four built-in
IngestionChunkerimplementations (SectionChunker,HeaderChunker,SemanticSimilarityChunker,DocumentTokenChunker) now propagateIngestionDocumentElement.MetadatatoIngestionChunk<T>.Metadata.Problem
The chunkers never read element metadata, so any metadata attached to document elements (e.g., page numbers, source URIs, element types) was silently dropped during chunking. This meant
VectorStoreWriter- which already correctly persists chunk metadata - had nothing to write.Solution
ElementsChunker (internal, fixes 3 public chunkers)
AccumulateMetadata/ApplyMetadatastatic helpersDocumentTokenChunker (independent chunker)
AccumulateMetadatastatic helperFinalizeChunk, then clears the accumulatorDesign Decisions
Testing
ChunkerMetadataPropagationTestscovering all scenariosMicrosoft Reviewers: Open in CodeFlow