Skip to content

Change Embedding property type from string to AIContent#7557

Draft
Copilot wants to merge 8 commits into
data-ingestion-preview2from
copilot/update-ingestion-chunk-to-non-generic
Draft

Change Embedding property type from string to AIContent#7557
Copilot wants to merge 8 commits into
data-ingestion-preview2from
copilot/update-ingestion-chunk-to-non-generic

Conversation

Copilot AI commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Changes IngestionChunkVectorRecord.Embedding from string? to AIContent?, passing the chunker's content directly to the vector store instead of extracting text first.

Changes

  • IngestionChunkVectorRecord — property type changed to AIContent?
  • VectorStoreWriter — assigns chunk.Content directly instead of extracting text
  • VectorStoreExtensions — uses generic VectorStoreVectorProperty<AIContent> so the model builder resolves IEmbeddingGenerator<AIContent, TEmbedding>; retains AsAIContentEmbeddingGenerator/AsStringEmbeddingGenerator adapter classes for wrapping between string and AIContent generators
  • TemplatesDataIngestor now accepts IEmbeddingGenerator<AIContent, Embedding<float>> and uses .AsStringEmbeddingGenerator() for the chunker; Program.cs registers embeddingGenerator.AsAIContentEmbeddingGenerator(); Aspire Program.Aspire.cs adds a DI decorator wrapping the string generator to AIContent; IngestedChunk.Embedding simplified to => Content;
  • Templates & snapshots — all IngestedChunk, Program.cs, and DataIngestor.cs snapshots updated to match
  • Tests — embedding generators changed from TestEmbeddingGenerator<string> to TestEmbeddingGenerator<AIContent>; shared GetText helper moved to DocumentChunkerTests base class

Example

public class IngestedChunk : IngestionChunkVectorRecord
{
    [VectorStoreVector(1536, DistanceFunction = DistanceFunction.CosineDistance)]
    public override AIContent? Embedding => Content;
}

Note: The VectorData framework may not yet fully support AIContent as an embedding source type — this is intentional per the request; the author will resolve the remaining integration on their branch.

Copilot AI added 2 commits June 10, 2026 14:21
- IngestionChunk, IngestionChunker, IngestionChunkProcessor, IngestionChunkWriter are now non-generic
- IngestionPipeline is now non-generic
- IngestionChunkVectorRecord is now non-generic with SerializedContent string property
- VectorStoreWriter has single type param TRecord
- All chunkers wrap text in TextContent
- VectorStoreWriter serializes AIContent via JsonSerializer
- Updated all tests, project templates, snapshots, and READMEs
- Added test showing multi-content-type chunker (ChunkerCanProduceMultipleContentTypes)
- Added test showing pipeline with embedding generator (PipelineWorksWithEmbeddingGenerator)

@adamsitnik adamsitnik left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot please address my feedback. Make sure we don't generate the embedding based on the SerializedContent, but on the original AIContent that was created by the chunker

Comment thread src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/README.md Outdated
Comment thread src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs Outdated
Comment thread src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs Outdated
Comment thread src/Libraries/Microsoft.Extensions.DataIngestion/Utils/Batching.cs Outdated
… SerializedContent

- Embedding property is string? (VectorData framework limitation - does not support AIContent)
- VectorStoreWriter extracts text from AIContent for embedding source (not JSON)
- Added AsAIContentEmbeddingGenerator/AsStringEmbeddingGenerator adapter extensions
- Removed Microsoft.Bcl.AsyncInterfaces transitive dependency
- Fixed README links to use generic type suffixes
- Added GetText helper in chunker tests to avoid repeated casts
- Fixed using directive ordering in test files
- Updated all template and snapshot files

Copilot AI commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

@copilot please address my feedback. Make sure we don't generate the embedding based on the SerializedContent, but on the original AIContent that was created by the chunker

Addressed in the latest commit. The embedding is now generated from the original text content extracted from AIContent (via TextContent.Text), not from the JSON SerializedContent.

Important note: I investigated VectorData 10.1.0 and confirmed it does not support AIContent as a vector property type — it only accepts string as an embedding source type. When using typeof(AIContent) in VectorStoreVectorProperty, it fails with: "cannot convert the input type 'AIContent' to a supported vector type". Similarly, if the CLR property type is AIContent? but the definition says typeof(string), it fails with a type mismatch error.

Because of this framework limitation, Embedding remains string? containing the actual text content (not JSON). Once VectorData adds AIContent support, we can upgrade the property type. The adapter extension methods (AsAIContentEmbeddingGenerator/AsStringEmbeddingGenerator) are available for scenarios where the abstraction layer needs to work with AIContent-based generators.

Copilot AI changed the title Make IngestionChunk non-generic, use AIContent for chunk content Change Embedding property type from string to AIContent Jun 10, 2026

@adamsitnik adamsitnik left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot address my new feedback

Copilot AI commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

@copilot address my new feedback

Done. Changes in commit that follows:

  • Moved GetText helper to DocumentChunkerTests base class (shared by all test types)
  • Replaced all inline ((TextContent)chunk.Content).Text casts with GetText() in SectionChunkerTests and SemanticSimilarityChunkerTests
  • Changed IngestedChunk.Embedding to => Content; in the template and all snapshot files
  • Made HeaderChunkerTests inherit from DocumentChunkerTests to use the shared helper

Copilot AI requested a review from adamsitnik June 10, 2026 17:54
…at the model builder creates a VectorPropertyModel<AIContent> which knows how to resolve IEmbeddingGenerator<AIContent, TEmbedding>.
@dotnet-comment-bot

Copy link
Copy Markdown
Collaborator

‼️ Found issues ‼️

Project Coverage Type Expected Actual
Microsoft.Extensions.ServiceDiscovery.Dns Line 75 69.38 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring.Kubernetes Line 99 98.08 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring Line 99 95.86 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring Branch 99 94.46 🔻
Microsoft.Extensions.DataIngestion.MarkItDown Line 75 4.15 🔻
Microsoft.Extensions.DataIngestion.MarkItDown Branch 75 0 🔻
Microsoft.Extensions.DataIngestion.Abstractions Branch 75 72.73 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp Line 75 73.85 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp Branch 75 70 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions Line 75 44.13 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions Branch 75 42.86 🔻
Microsoft.Extensions.AI Line 89 88.99 🔻
Microsoft.Extensions.AI Branch 89 88.31 🔻
Microsoft.Extensions.ServiceDiscovery Line 75 68.15 🔻
Microsoft.Extensions.ServiceDiscovery Branch 75 71.43 🔻
Microsoft.Extensions.AI.OpenAI Line 75 71.41 🔻
Microsoft.Extensions.AI.OpenAI Branch 75 59.03 🔻
Microsoft.Extensions.Diagnostics.Testing Line 99 98.57 🔻
Microsoft.Extensions.Telemetry Line 93 92.56 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project Expected Actual
Microsoft.Extensions.DataIngestion 75 88
Microsoft.Extensions.DataIngestion.Markdig 75 90
Microsoft.Extensions.AI.Abstractions 82 85
Microsoft.Extensions.AI.Evaluation.NLP 0 77
Microsoft.Extensions.Caching.Hybrid 82 84
Microsoft.Gen.MetadataExtractor 57 70

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1458371&view=codecoverage-tab

@adamsitnik adamsitnik left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot please fix following build errors:

src\Libraries\Microsoft.Extensions.DataIngestion\Writers\VectorStoreWriter.cs(6,1): error S1128: (NETCORE_ENGINEERING_TELEMETRY=Build) Remove this unnecessary 'using'. (https://rules.sonarsource.com/csharp/RSPEC-1128)

src\Libraries\Microsoft.Extensions.DataIngestion\Writers\VectorStoreWriter.cs(9,1): error S1128: (NETCORE_ENGINEERING_TELEMETRY=Build) Remove this unnecessary 'using'. (https://rules.sonarsource.com/csharp/RSPEC-1128)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants