Change Embedding property type from string to AIContent#7557
Conversation
- IngestionChunk, IngestionChunker, IngestionChunkProcessor, IngestionChunkWriter are now non-generic - IngestionPipeline is now non-generic - IngestionChunkVectorRecord is now non-generic with SerializedContent string property - VectorStoreWriter has single type param TRecord - All chunkers wrap text in TextContent - VectorStoreWriter serializes AIContent via JsonSerializer - Updated all tests, project templates, snapshots, and READMEs - Added test showing multi-content-type chunker (ChunkerCanProduceMultipleContentTypes) - Added test showing pipeline with embedding generator (PipelineWorksWithEmbeddingGenerator)
adamsitnik
left a comment
There was a problem hiding this comment.
@copilot please address my feedback. Make sure we don't generate the embedding based on the SerializedContent, but on the original AIContent that was created by the chunker
… SerializedContent - Embedding property is string? (VectorData framework limitation - does not support AIContent) - VectorStoreWriter extracts text from AIContent for embedding source (not JSON) - Added AsAIContentEmbeddingGenerator/AsStringEmbeddingGenerator adapter extensions - Removed Microsoft.Bcl.AsyncInterfaces transitive dependency - Fixed README links to use generic type suffixes - Added GetText helper in chunker tests to avoid repeated casts - Fixed using directive ordering in test files - Updated all template and snapshot files
Addressed in the latest commit. The embedding is now generated from the original text content extracted from Important note: I investigated VectorData 10.1.0 and confirmed it does not support Because of this framework limitation, |
…ds making it build
adamsitnik
left a comment
There was a problem hiding this comment.
@copilot address my new feedback
… all tests, change Embedding to => Content
Done. Changes in commit that follows:
|
…at the model builder creates a VectorPropertyModel<AIContent> which knows how to resolve IEmbeddingGenerator<AIContent, TEmbedding>.
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1458371&view=codecoverage-tab |
adamsitnik
left a comment
There was a problem hiding this comment.
@copilot please fix following build errors:
src\Libraries\Microsoft.Extensions.DataIngestion\Writers\VectorStoreWriter.cs(6,1): error S1128: (NETCORE_ENGINEERING_TELEMETRY=Build) Remove this unnecessary 'using'. (https://rules.sonarsource.com/csharp/RSPEC-1128)
src\Libraries\Microsoft.Extensions.DataIngestion\Writers\VectorStoreWriter.cs(9,1): error S1128: (NETCORE_ENGINEERING_TELEMETRY=Build) Remove this unnecessary 'using'. (https://rules.sonarsource.com/csharp/RSPEC-1128)
Changes
IngestionChunkVectorRecord.Embeddingfromstring?toAIContent?, passing the chunker's content directly to the vector store instead of extracting text first.Changes
IngestionChunkVectorRecord— property type changed toAIContent?VectorStoreWriter— assignschunk.Contentdirectly instead of extracting textVectorStoreExtensions— uses genericVectorStoreVectorProperty<AIContent>so the model builder resolvesIEmbeddingGenerator<AIContent, TEmbedding>; retainsAsAIContentEmbeddingGenerator/AsStringEmbeddingGeneratoradapter classes for wrapping between string and AIContent generatorsDataIngestornow acceptsIEmbeddingGenerator<AIContent, Embedding<float>>and uses.AsStringEmbeddingGenerator()for the chunker;Program.csregistersembeddingGenerator.AsAIContentEmbeddingGenerator(); AspireProgram.Aspire.csadds a DI decorator wrapping the string generator to AIContent;IngestedChunk.Embeddingsimplified to=> Content;IngestedChunk,Program.cs, andDataIngestor.cssnapshots updated to matchTestEmbeddingGenerator<string>toTestEmbeddingGenerator<AIContent>; sharedGetTexthelper moved toDocumentChunkerTestsbase classExample