[API] Public way to construct SentencePieceTokenizer (Unigram) from tokenizer.json / in-memory pieces+scores
Summary
Microsoft.ML.Tokenizers fully implements the SentencePiece Unigram model
(SentencePieceUnigramModel), but the only public way to obtain a
SentencePieceTokenizer is SentencePieceTokenizer.Create(Stream), which parses a
SentencePiece protobuf (.model / ModelProto). There is no public API to build a
Unigram tokenizer from a Hugging Face tokenizer.json (or from in-memory pieces + scores).
Many modern HF models ship a JSON-only Unigram tokenizer (tokenizer.json with
model.type == "Unigram", model.vocab as [piece, score] pairs, model.unk_id, and a
normalizer/pre_tokenizer) and no sentencepiece.model/spiece.model/tokenizer.model
protobuf. For these models there is currently no supported way to construct the tokenizer.
Current behavior
SentencePieceTokenizer's only constructor is internal SentencePieceTokenizer(ModelProto modelProto, ...).
- The
Sentencepiece.* generated protobuf types (ModelProto, TrainerSpec,
NormalizerSpec, …) are internal, so callers can't build a ModelProto directly.
- The public factory
SentencePieceTokenizer.Create(Stream, bool addBeginOfSentence, bool addEndOfSentence, ...)
requires a serialized SentencePiece protobuf stream.
- By contrast,
BpeTokenizer / WordPieceTokenizer expose vocab-file/stream factories, so the
asymmetry is Unigram-specific.
Request
A public factory to construct a Unigram SentencePieceTokenizer without a protobuf, e.g. one of:
- From
tokenizer.json — SentencePieceTokenizer.CreateFromTokenizerJson(Stream json, ...)
(parse model.vocab pieces+scores, model.unk_id, and the normalizer/pre_tokenizer
precompiled charsmap + metaspace settings).
- From in-memory pieces+scores —
Create(IEnumerable<(string Piece, float Score)> vocab, int unkId, ReadOnlySpan<byte> precompiledCharsMap, bool addDummyPrefix, bool escapeWhitespaces, ...).
Either would let callers load JSON-Unigram models that have no .model protobuf.
Workaround we're using
Since the only public entry is a protobuf stream, we synthesize a SentencePiece ModelProto
on the fly from tokenizer.json and feed the bytes to Create(Stream):
pieces ← model.vocab [piece, score] (mapping special tokens to CONTROL/UNKNOWN types).
trainer_spec.model_type = UNIGRAM, trainer_spec.unk_id = model.unk_id (+ bos/eos/pad ids).
normalizer_spec.precompiled_charsmap ← the Precompiled normalizer's precompiled_charsmap
bytes from tokenizer.json (gives byte-exact NFKC parity), plus
add_dummy_prefix / escape_whitespaces from the Metaspace pre-tokenizer.
This works, but it requires hand-writing a SentencePiece protobuf encoder and re-deriving the
wire schema, which is exactly the kind of thing the library could expose directly. It's also
fragile across schema changes.
Why it matters
JSON-only Unigram tokenizers are common (multilingual static-embedding models, several HF
encoder models). Without a JSON/in-memory factory, every consumer must either ship a converted
.model or reimplement the protobuf synthesis above.
Repro context
- Model: a
potion-multilingual-128M-style static embedding model — tokenizer.json only,
model.type == "Unigram", ~500k-entry custom vocab, [PAD]=0 / [UNK]=1, with a
Sequence→Precompiled normalizer (precompiled_charsmap present) and a Metaspace
pre-tokenizer. No sentencepiece.model.
Microsoft.ML.Tokenizers main (commit 901da3e).
[API] Public way to construct
SentencePieceTokenizer(Unigram) fromtokenizer.json/ in-memory pieces+scoresSummary
Microsoft.ML.Tokenizersfully implements the SentencePiece Unigram model(
SentencePieceUnigramModel), but the only public way to obtain aSentencePieceTokenizerisSentencePieceTokenizer.Create(Stream), which parses aSentencePiece protobuf (
.model/ModelProto). There is no public API to build aUnigram tokenizer from a Hugging Face
tokenizer.json(or from in-memory pieces + scores).Many modern HF models ship a JSON-only Unigram tokenizer (
tokenizer.jsonwithmodel.type == "Unigram",model.vocabas[piece, score]pairs,model.unk_id, and anormalizer/pre_tokenizer) and nosentencepiece.model/spiece.model/tokenizer.modelprotobuf. For these models there is currently no supported way to construct the tokenizer.
Current behavior
SentencePieceTokenizer's only constructor isinternal SentencePieceTokenizer(ModelProto modelProto, ...).Sentencepiece.*generated protobuf types (ModelProto,TrainerSpec,NormalizerSpec, …) areinternal, so callers can't build aModelProtodirectly.SentencePieceTokenizer.Create(Stream, bool addBeginOfSentence, bool addEndOfSentence, ...)requires a serialized SentencePiece protobuf stream.
BpeTokenizer/WordPieceTokenizerexpose vocab-file/stream factories, so theasymmetry is Unigram-specific.
Request
A public factory to construct a Unigram
SentencePieceTokenizerwithout a protobuf, e.g. one of:tokenizer.json—SentencePieceTokenizer.CreateFromTokenizerJson(Stream json, ...)(parse
model.vocabpieces+scores,model.unk_id, and thenormalizer/pre_tokenizerprecompiled charsmap + metaspace settings).
Create(IEnumerable<(string Piece, float Score)> vocab, int unkId, ReadOnlySpan<byte> precompiledCharsMap, bool addDummyPrefix, bool escapeWhitespaces, ...).Either would let callers load JSON-Unigram models that have no
.modelprotobuf.Workaround we're using
Since the only public entry is a protobuf stream, we synthesize a SentencePiece
ModelProtoon the fly from
tokenizer.jsonand feed the bytes toCreate(Stream):pieces←model.vocab[piece, score](mapping special tokens toCONTROL/UNKNOWNtypes).trainer_spec.model_type = UNIGRAM,trainer_spec.unk_id = model.unk_id(+ bos/eos/pad ids).normalizer_spec.precompiled_charsmap← thePrecompilednormalizer'sprecompiled_charsmapbytes from
tokenizer.json(gives byte-exact NFKC parity), plusadd_dummy_prefix/escape_whitespacesfrom theMetaspacepre-tokenizer.This works, but it requires hand-writing a SentencePiece protobuf encoder and re-deriving the
wire schema, which is exactly the kind of thing the library could expose directly. It's also
fragile across schema changes.
Why it matters
JSON-only Unigram tokenizers are common (multilingual static-embedding models, several HF
encoder models). Without a JSON/in-memory factory, every consumer must either ship a converted
.modelor reimplement the protobuf synthesis above.Repro context
potion-multilingual-128M-style static embedding model —tokenizer.jsononly,model.type == "Unigram", ~500k-entry custom vocab,[PAD]=0/[UNK]=1, with aSequence→Precompilednormalizer (precompiled_charsmappresent) and aMetaspacepre-tokenizer. No
sentencepiece.model.Microsoft.ML.Tokenizersmain (commit901da3e).