Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
@MarkusRabe I'm not 100% sure if this is the root cause, but it's possible if your input data contains these special tokens Could you also double-check which tokenizer was used to generate the pre-tokenized dataset? While unlikely, it's worth verifying just to be safe! |
|
Interesting. I don't think it is a serious issue with the existing experiments.
I also don't see how the fix changes anything. I ran some experiments and it doesn't change the number of tokens mapped to . |
MarkusRabe
left a comment
There was a problem hiding this comment.
See other comment above.
I suggest we just document this behavior and don't change it.
Make sense. I was trying to investigate the issue you mentioned last time so I was looking at code around tokenization. But I might be misunderstanding the problem itself. Is the issue fixed? If not, do you mind sharing it again here to make sure I understand it correctly? Also, what's the latest commands you use to run the experiments? Would like to run something on my end while investigating to validate my presumptions |
Summary
Updated
get_default_tokenizer_vocabinlanguage_model_dataloader.pyto use tiktoken'sdecode_single_token_bytesmethod. This ensures that special tokens (specifically FIM tokens like<|fim_prefix|>,<|fim_middle|>, etc.) are correctly decoded into their string representations instead of being blindly replaced with<unk>.Issue
Before: The standard
decode()method fails on special tokens, raising an exception. Our try/except block caught this and assigned the string"<unk>"to all such tokens.Impact: In the SpellingBeeEmbedding, all distinct special tokens (IDs 100258-100260, etc.) were receiving the exact same character embedding (based on the characters "u", "n", "k"). This effectively blinded the model to the differences between these control tokens.