Add public SentencePieceTokenizer factory methods for Unigram from vocab list and tokenizer.json#7625
Add public SentencePieceTokenizer factory methods for Unigram from vocab list and tokenizer.json#7625Copilot wants to merge 4 commits into
Conversation
…erJson APIs Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
ericstj
left a comment
There was a problem hiding this comment.
Thanks for adding this — the in-memory Create(vocab, ...) overload is clean and the ID-preservation (JSON vocab index = token id) is the right call. I implemented this same JSON-only Unigram capability recently against real Hugging Face models, and hit two correctness issues that this PR's single test model happens to mask. Details inline; summary here.
Bugs (untested by the current suite)
-
BOS/EOS positional fallback corrupts real pieces.
FindSpecialTokenId(pieces, "<s>", 1)/("</s>", 2)fall back to positions 1/2 when the vocab has no piece literally named<s>/</s>. Many HF Unigram tokenizers don't use those names — e.g.minishlab/potion-multilingual-128M(bge-m3 family) hasunk_id=1, vocab[0]="[PAD]", [1]="[UNK]", [2]=",". ThereeosId→2 marks","asControland drops it from the Viterbi trie (it can never be emitted), andbosId→1 collides withunkIdand clobbers the unknown entry. This is structural (independent ofaddBos/addEos). -
Normalizer steps beyond
Precompiledare silently dropped.ExtractPrecompiledCharsMapextracts only the charsmap and discards sibling normalizers. Real Unigram models often have a richer chain (potion/bge-m3:Sequence[Precompiled, Replace(punctuation spacing), Replace("\\s+"->" "), Strip]), whichSentencePieceNormalizercannot reproduce — soCreateFromTokenizerJsonsilently yields different tokens than HF. Since the charsmap must run before thoseReplacesteps, they can't just be reordered into SP; at minimum this should throw on unrecognized normalizer types rather than silently ignore them.
Why the test stays green
Paraphrase-multilingual-MiniLM-L12-v2 names its specials (<s>=0, </s>=2) so the fallback never fires, and its normalizer is a single Precompiled, so the dropped-sibling path is never hit. Recommend adding fixtures that (a) place specials at non-conventional positions / omit <s>/</s>, and (b) use a Sequence normalizer with Replace/Strip, asserting against HF reference ids.
Minor
removeExtraWhitespacesis hard-codedtruein both factories rather than derived from the JSON.added_tokensfrom the JSON aren't auto-wired; correctness depends on the caller passingspecialTokens. Worth documenting or reading them.
(Posting as comments only — not an approval or change request.)
|
@copilot please address feedback |
…end_scheme handling Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
|
@copilot address feedback |
…tion, and add tests Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #7625 +/- ##
==========================================
+ Coverage 69.59% 69.61% +0.01%
==========================================
Files 1484 1484
Lines 273606 274013 +407
Branches 27949 27999 +50
==========================================
+ Hits 190410 190745 +335
- Misses 75832 75885 +53
- Partials 7364 7383 +19
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR extends SentencePieceTokenizer to support Hugging Face JSON-only Unigram tokenizers by adding new public factory APIs that can construct a Unigram tokenizer from either an in-memory vocab list or a tokenizer.json stream, avoiding the current requirement for a SentencePiece .model protobuf.
Changes:
- Add
SentencePieceTokenizer.Create(IEnumerable<(string Piece, float Score)> vocab, ...)for constructing a Unigram tokenizer directly from a vocab list. - Add
SentencePieceTokenizer.CreateFromTokenizerJson(Stream tokenizerJsonStream, ...)for parsing HFtokenizer.json(Unigram) including vocab,unk_id, precompiled charsmap, and Metaspace settings. - Add internal constructors/refactoring to build a
SentencePieceUnigramModelfrom vocab pieces and config values, plus new tests covering these creation paths.
Show a summary per file
| File | Description |
|---|---|
| test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs | Adds unit tests for vocab-based and tokenizer.json-based Unigram construction and behavior parity checks. |
| src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs | Adds Unigram model constructors that build vocab/trie from (piece, score) inputs and detect special tokens by name. |
| src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs | Adds public factories for vocab and tokenizer.json, plus JSON parsing helpers for normalizer/pre-tokenizer extraction. |
| src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs | Adds a new base-model constructor taking explicit config/token IDs instead of ModelProto. |
Copilot's findings
- Files reviewed: 4/4 changed files
- Comments generated: 4
| AddBeginningOfSentence = addBos; | ||
| AddEndOfSentence = addEos; | ||
| BeginningOfSentenceToken = bosToken; | ||
| BeginningOfSentenceId = Math.Max(0, bosId); | ||
| EndOfSentenceToken = eosToken; | ||
| EndOfSentenceId = Math.Max(0, eosId); | ||
| UnknownToken = unkToken; | ||
| UnknownId = Math.Max(0, unkId); | ||
| AddDummyPrefix = addDummyPrefix; |
| /// The beginning-of-sentence and end-of-sentence token IDs are auto-detected by looking for pieces | ||
| /// named <c><s></c> and <c></s></c> in <paramref name="vocab"/>. If not found, positions 1 and 2 | ||
| /// are used as fallbacks (the SentencePiece convention). Similarly, a <c><pad></c> piece is | ||
| /// detected automatically if present. |
| // Validate model type | ||
| if (!root.TryGetProperty("model", out JsonElement modelElement)) | ||
| { | ||
| throw new InvalidDataException("The tokenizer.json does not contain a 'model' property."); | ||
| } | ||
|
|
||
| if (modelElement.TryGetProperty("type", out JsonElement modelTypeElement) && | ||
| !string.Equals(modelTypeElement.GetString(), "Unigram", StringComparison.OrdinalIgnoreCase)) | ||
| { | ||
| throw new InvalidDataException($"Expected model type 'Unigram' but found '{modelTypeElement.GetString()}'."); | ||
| } |
| // Extract pre_tokenizer settings | ||
| bool escapeWhiteSpaces = true; | ||
| bool treatWhitespaceAsSuffix = false; | ||
| if (root.TryGetProperty("pre_tokenizer", out JsonElement preTokenizerElement)) | ||
| { | ||
| ExtractMetaspaceSettings(preTokenizerElement, ref addDummyPrefix, ref escapeWhiteSpaces, ref treatWhitespaceAsSuffix); | ||
| } |
SentencePieceTokenizeronly exposedCreate(Stream)requiring a SentencePiece protobuf (.model), making it impossible to load Hugging Face JSON-only Unigram tokenizers that have no.modelfile.New public APIs
From in-memory vocab:
From
tokenizer.json:CreateFromTokenizerJsonreadsmodel.vocab,model.unk_id, extractsprecompiled_charsmapfrom aPrecompiledorSequencenormalizer, and reads Metaspace pre-tokenizer settings (add_prefix_space,replacement,prepend_scheme). It validatesmodel.type == "Unigram".Internal changes
SentencePieceBaseModel: new constructor taking individual config parameters instead ofModelProtoSentencePieceUnigramModel: new constructors building vocab fromIReadOnlyList<(string, float)>; BOS/EOS/PAD IDs auto-detected by piece name (<s>,</s>,<pad>) with SentencePiece-conventional positional fallbacksNote on token IDs
HF
tokenizer.jsontypically uses a different special-token ordering than the SentencePiece protobuf (e.g.<s>=0, <pad>=1, </s>=2, <unk>=3vs.<unk>=0, <s>=1, </s>=2). Piece strings produced are identical; numeric IDs will differ by the vocab offset introduced by the extra special tokens.