Skip to content

Propagate element metadata to chunks in MEDI chunkers#7516

Open
luisquintanilla wants to merge 3 commits into
mainfrom
fix/chunker-metadata-propagation
Open

Propagate element metadata to chunks in MEDI chunkers#7516
luisquintanilla wants to merge 3 commits into
mainfrom
fix/chunker-metadata-propagation

Conversation

@luisquintanilla

@luisquintanilla luisquintanilla commented May 7, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #7465

All four built-in IngestionChunker implementations (SectionChunker, HeaderChunker, SemanticSimilarityChunker, DocumentTokenChunker) now propagate IngestionDocumentElement.Metadata to IngestionChunk<T>.Metadata.

Problem

The chunkers never read element metadata, so any metadata attached to document elements (e.g., page numbers, source URIs, element types) was silently dropped during chunking. This meant VectorStoreWriter - which already correctly persists chunk metadata - had nothing to write.

Solution

ElementsChunker (internal, fixes 3 public chunkers)

  • Added AccumulateMetadata / ApplyMetadata static helpers
  • As elements are processed, their metadata is accumulated into a lazily-allocated dictionary
  • When a chunk is committed, accumulated metadata is applied to the chunk and the accumulator is cleared

DocumentTokenChunker (independent chunker)

  • Added AccumulateMetadata static helper
  • Accumulates metadata during element iteration
  • Applies metadata in FinalizeChunk, then clears the accumulator

Design Decisions

Decision Rationale
First-wins merge (TryAdd) When multiple elements share a key, the first element's value prevails - predictable and deterministic
Null values skipped Element metadata allows object? but chunk metadata requires object - nulls are meaningless for downstream consumers
Split elements -> first chunk only When an element is split across chunks, metadata goes to the first chunk. This avoids duplication and matches the semantic intent
Lazy allocation Dictionary only allocated when the first element with metadata is encountered

Testing

  • 14 new tests in ChunkerMetadataPropagationTests covering all scenarios
  • All 128 existing DataIngestion tests pass with no regressions
  • Verified on net8.0 and net9.0 (builds clean on all 5 TFMs)
Microsoft Reviewers: Open in CodeFlow

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses MEDI ingestion metadata loss by propagating IngestionDocumentElement.Metadata into IngestionChunk<string>.Metadata across the built-in chunkers, so downstream components (e.g., VectorStoreWriter) can persist element-derived metadata on produced chunks.

Changes:

  • Added element-metadata accumulation/application logic to ElementsChunker (affecting SectionChunker, HeaderChunker, and SemanticSimilarityChunker).
  • Added similar metadata accumulation/application to DocumentTokenChunker during element iteration and chunk finalization.
  • Introduced a new test suite validating metadata propagation behavior for several chunkers and scenarios.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/ChunkerMetadataPropagationTests.cs Adds tests asserting element metadata is propagated to chunk metadata under various conditions.
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/ElementsChunker.cs Accumulates element metadata while building chunks and applies it when committing chunks.
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs Accumulates element metadata during token chunking and applies it when finalizing chunks.

Comment on lines 71 to 75
AccumulateMetadata(element, ref accumulatedMetadata);

int elementTokenCount = CountTokens(semanticContent.AsSpan());
if (elementTokenCount + totalTokenCount <= _maxTokensPerChunk)
{

@luisquintanilla luisquintanilla Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted and fixed.

Good catch — this was a real timing bug. In the original code, AccumulateMetadata ran unconditionally before the branch logic, so when a table pre-commit or overflow triggered FinalizeCurrentChunk, the element's metadata attached to the previous chunk instead of the one receiving the content.

Fix (commit 8bfe9bd): Moved AccumulateMetadata into each of the 3 branches with flag-based deferred accumulation:

  • Fits branch (L72-78): Accumulate immediately before append — straightforward, element stays in current chunk.
  • Table branch (L79-163): tableMetadataAccumulated flag — defer until the first actual AppendNewLineAndSpan call to _currentChunk. This handles pre-commit (header doesn't fit), rowIndex==1 edge case (first data row doesn't fit), and mid-table row splits.
  • Non-table overflow (L164-213): elementMetadataAccumulated flag — accumulate only when index > 0 (content actually appended to a chunk).

Comment on lines 55 to 80
{
continue;
}

AccumulateMetadata(element, ref accumulatedMetadata);

int contentToProcessTokenCount = _tokenizer.CountTokens(elementContent!, considerNormalization: false);
ReadOnlyMemory<char> contentToProcess = elementContent.AsMemory();
while (stringBuilderTokenCount + contentToProcessTokenCount >= _maxTokensPerChunk)
{
int index = _tokenizer.GetIndexByTokenCount(
text: contentToProcess.Span,
maxTokenCount: _maxTokensPerChunk - stringBuilderTokenCount,
out string? _,
out int _,
considerNormalization: false);

unsafe
{
fixed (char* ptr = &MemoryMarshal.GetReference(contentToProcess.Span))
{
_ = stringBuilder.Append(ptr, index);
}
}
yield return FinalizeChunk();
yield return FinalizeChunk(ref accumulatedMetadata);

@luisquintanilla luisquintanilla Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted and fixed.

Correct — in the original code, AccumulateMetadata ran at line 86 before the while loop. If index == 0 and the buffer was already full, the while loop would call FinalizeCurrentChunk first, attaching the element's metadata to the wrong (previous) chunk.

Fix (commit 8bfe9bd): Introduced elementMetadataAccumulated flag:

  • Inside the while loop: accumulate only when index > 0 (meaning content has actually been appended to a chunk from this element)
  • After the while loop: if the flag is still false (remaining content goes to buffer), accumulate then

This ensures metadata always follows the content, regardless of whether a chunk boundary is crossed at the start of the element.

Comment on lines +26 to +32
private static IngestionChunker<string> CreateDocumentTokenChunker(int maxTokensPerChunk = 2_000)
{
var tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");
return new DocumentTokenChunker(new(tokenizer) { MaxTokensPerChunk = maxTokensPerChunk, OverlapTokens = 0 });
}

[Fact]

@luisquintanilla luisquintanilla Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially accepted.

Agree that test coverage should be broader. Here's what was added:

Boundary/overlap tests (commit 8bfe9bd — 6 tests):

  • ElementsChunker_TablePreCommit_MetadataGoesToCorrectChunk — table element triggers pre-commit; metadata follows content to new chunk
  • ElementsChunker_ExactFill_MetadataStaysOnCurrentChunk — element exactly fills remaining capacity; metadata stays on current chunk
  • DocumentTokenChunker_OverlapTokens_MetadataOnlyOnOriginalChunks — overlap content doesn't duplicate metadata
  • DocumentTokenChunker_ExactFill_MetadataAttachesToCorrectChunk — boundary precision
  • ElementsChunker_TableSplit_MetadataGoesToFirstTableChunk — large table split across chunks
  • ElementsChunker_NonTableOverflow_MetadataGoesToNewChunk — non-table overflow triggers new chunk

SemanticSimilarityChunker tests (commit bd31e18 — 2 tests):

  • SemanticSimilarityChunker_SingleElementWithMetadata_PropagatesMetadata — basic metadata flow
  • SemanticSimilarityChunker_MultipleElementsDifferentKeys_AllKeysAppear — per-element metadata preserved across chunks

All 22 metadata propagation tests pass across net8.0, net9.0, net10.0, and net462.

@CZEMacLeod

Copy link
Copy Markdown

MEDI is an acronym regularly used to reference Microsoft.Extensions.DependencyInjection - using it for this library is another reason why these AI technology packages should not be in the root Microsoft.Extensions namespace.

@luisquintanilla

Copy link
Copy Markdown
Contributor Author

@adamsitnik I confirmed this wasn't a design choice. Can I please get a review before merging. Thanks!

@dotnet-comment-bot

Copy link
Copy Markdown
Collaborator

‼️ Found issues ‼️

Project Coverage Type Expected Actual
Microsoft.Extensions.Diagnostics.Testing Line 99 98.65 🔻
Microsoft.Extensions.Telemetry Line 93 91.95 🔻
Microsoft.Extensions.AI Line 89 88.51 🔻
Microsoft.Extensions.AI Branch 89 88.53 🔻
Microsoft.Extensions.AI.OpenAI Line 75 62.62 🔻
Microsoft.Extensions.AI.OpenAI Branch 75 49.63 🔻
Microsoft.Extensions.DataIngestion.MarkItDown Line 75 4.46 🔻
Microsoft.Extensions.DataIngestion.MarkItDown Branch 75 0 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring Line 99 96.03 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring Branch 99 94.39 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring.Kubernetes Line 99 97.73 🔻
Microsoft.Extensions.ServiceDiscovery.Dns Line 75 69.93 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions Line 75 42.11 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions Branch 75 42.86 🔻
Microsoft.Extensions.ServiceDiscovery Line 75 67.51 🔻
Microsoft.Extensions.ServiceDiscovery Branch 75 71.43 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp Line 75 73.85 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp Branch 75 70 🔻
Microsoft.Extensions.VectorData.Abstractions Line 75 37.39 🔻
Microsoft.Extensions.VectorData.Abstractions Branch 75 22.73 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project Expected Actual
Microsoft.Gen.BuildMetadata 97 100
Microsoft.Gen.MetadataExtractor 57 73
Microsoft.Gen.MetricsReports 67 69
Microsoft.Extensions.AI.Abstractions 82 85
Microsoft.Extensions.AI.Evaluation.NLP 0 78
Microsoft.Extensions.Caching.Hybrid 82 89
Microsoft.Extensions.DataIngestion.Abstractions 75 91
Microsoft.Extensions.DataIngestion 75 89
Microsoft.Extensions.DataIngestion.Markdig 75 90
Microsoft.Extensions.Http.Resilience 97 100

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1454486&view=codecoverage-tab

@luisquintanilla luisquintanilla force-pushed the fix/chunker-metadata-propagation branch from 6c9bd82 to 8bfe9bd Compare June 8, 2026 18:18
@dotnet-comment-bot

Copy link
Copy Markdown
Collaborator

‼️ Found issues ‼️

Project Coverage Type Expected Actual
Microsoft.Extensions.Diagnostics.Testing Line 99 98.65 🔻
Microsoft.Extensions.Telemetry Line 93 91.95 🔻
Microsoft.Extensions.AI Line 89 88.59 🔻
Microsoft.Extensions.AI Branch 89 88.53 🔻
Microsoft.Extensions.AI.OpenAI Line 75 62.62 🔻
Microsoft.Extensions.AI.OpenAI Branch 75 49.63 🔻
Microsoft.Extensions.DataIngestion.MarkItDown Line 75 4.46 🔻
Microsoft.Extensions.DataIngestion.MarkItDown Branch 75 0 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring Line 99 96.03 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring Branch 99 94.39 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring.Kubernetes Line 99 97.73 🔻
Microsoft.Extensions.ServiceDiscovery.Dns Line 75 68.32 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions Line 75 42.11 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions Branch 75 42.86 🔻
Microsoft.Extensions.ServiceDiscovery Line 75 67.21 🔻
Microsoft.Extensions.ServiceDiscovery Branch 75 71.43 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp Line 75 73.85 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp Branch 75 70 🔻
Microsoft.Extensions.VectorData.Abstractions Line 75 37.39 🔻
Microsoft.Extensions.VectorData.Abstractions Branch 75 22.73 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project Expected Actual
Microsoft.Gen.BuildMetadata 97 100
Microsoft.Gen.MetadataExtractor 57 73
Microsoft.Gen.MetricsReports 67 69
Microsoft.Extensions.AI.Abstractions 82 85
Microsoft.Extensions.AI.Evaluation.NLP 0 78
Microsoft.Extensions.Caching.Hybrid 82 84
Microsoft.Extensions.DataIngestion.Abstractions 75 91
Microsoft.Extensions.DataIngestion 75 89
Microsoft.Extensions.DataIngestion.Markdig 75 90
Microsoft.Extensions.Http.Resilience 97 100

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1454834&view=codecoverage-tab

luisquintanilla and others added 3 commits June 8, 2026 16:23
Fix #7465: All four IngestionChunker implementations (SectionChunker,
HeaderChunker, SemanticSimilarityChunker, DocumentTokenChunker) now
propagate IngestionDocumentElement.Metadata to IngestionChunk.Metadata.

Design decisions:
- First-wins merge strategy (TryAdd) for conflicting keys
- Null metadata values skipped (element allows object?, chunk requires object)
- Split elements: metadata goes to the first chunk only
- Lazy allocation: dictionary only created when elements have metadata

ElementsChunker (fixes SectionChunker, HeaderChunker, SemanticSimilarityChunker):
- Added AccumulateMetadata/ApplyMetadata static helpers
- Accumulates metadata as elements are processed
- Applies to chunk on commit, then clears accumulator

DocumentTokenChunker:
- Added AccumulateMetadata static helper
- Accumulates metadata during element iteration
- Applies in FinalizeChunk, then clears accumulator

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix metadata accumulation timing bugs in ElementsChunker and
DocumentTokenChunker where AccumulateMetadata was called before
determining which chunk the element's content contributes to.

When a Commit/FinalizeChunk happens before the new element adds
content (table pre-commit, non-table overflow, exact-fill boundary),
the metadata was incorrectly applied to the previous chunk.

ElementsChunker fixes:
- Branch 1 (fits): accumulate right before appending
- Branch 2 (table): use flag, accumulate before first table content
  append to _currentChunk, after any pre-commit or row-level commit
- Branch 3 (non-table too big): use flag, accumulate when index > 0
  (first content contribution in the while loop)

DocumentTokenChunker fixes:
- Use flag to defer accumulation until first content contribution
- In while loop: accumulate only when index > 0
- After while loop: accumulate if not yet done (element fits entirely)

New boundary tests (6 tests):
- Previous element fills chunk, next element metadata on new chunk
- Non-table element too large, metadata on correct chunks
- Table pre-commit: table metadata not on pre-committed chunk
- DocumentTokenChunker boundary with large filler element
- DocumentTokenChunker with overlap enabled
- Table split across chunks: first chunk gets metadata

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 2 tests covering SemanticSimilarityChunker metadata flow:
- Single element with metadata propagates to chunk
- Multiple elements with different keys each carry metadata

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@luisquintanilla luisquintanilla force-pushed the fix/chunker-metadata-propagation branch from bd31e18 to 8bac32d Compare June 8, 2026 20:23
@dotnet-comment-bot

Copy link
Copy Markdown
Collaborator

‼️ Found issues ‼️

Project Coverage Type Expected Actual
Microsoft.Extensions.Diagnostics.Testing Line 99 98.65 🔻
Microsoft.Extensions.Telemetry Line 93 91.95 🔻
Microsoft.Extensions.AI Line 89 88.57 🔻
Microsoft.Extensions.AI Branch 89 88.53 🔻
Microsoft.Extensions.AI.OpenAI Line 75 62.62 🔻
Microsoft.Extensions.AI.OpenAI Branch 75 49.63 🔻
Microsoft.Extensions.DataIngestion.MarkItDown Line 75 4.46 🔻
Microsoft.Extensions.DataIngestion.MarkItDown Branch 75 0 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring Line 99 96.03 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring Branch 99 94.39 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring.Kubernetes Line 99 97.73 🔻
Microsoft.Extensions.ServiceDiscovery.Dns Line 75 69.93 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions Line 75 42.11 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions Branch 75 42.86 🔻
Microsoft.Extensions.ServiceDiscovery Line 75 68.88 🔻
Microsoft.Extensions.ServiceDiscovery Branch 75 71.43 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp Line 75 73.85 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp Branch 75 70 🔻
Microsoft.Extensions.VectorData.Abstractions Line 75 37.39 🔻
Microsoft.Extensions.VectorData.Abstractions Branch 75 22.73 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project Expected Actual
Microsoft.Gen.BuildMetadata 97 100
Microsoft.Gen.MetadataExtractor 57 73
Microsoft.Gen.MetricsReports 67 69
Microsoft.Extensions.AI.Abstractions 82 85
Microsoft.Extensions.AI.Evaluation.NLP 0 78
Microsoft.Extensions.Caching.Hybrid 82 89
Microsoft.Extensions.DataIngestion.Abstractions 75 91
Microsoft.Extensions.DataIngestion 75 89
Microsoft.Extensions.DataIngestion.Markdig 75 90
Microsoft.Extensions.Http.Resilience 97 100

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1454942&view=codecoverage-tab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MEDI] Design feedback: Built-in chunkers don't propagate element metadata to chunks

4 participants