Skip to content

Fix: Correctly decode special tokens#14

Open
zherendong wants to merge 1 commit intomainfrom
zheren/tokenization-debugging
Open

Fix: Correctly decode special tokens#14
zherendong wants to merge 1 commit intomainfrom
zheren/tokenization-debugging

Conversation

@zherendong
Copy link
Copy Markdown
Collaborator

Summary

Updated get_default_tokenizer_vocab in language_model_dataloader.py to use tiktoken's decode_single_token_bytes method. This ensures that special tokens (specifically FIM tokens like <|fim_prefix|>, <|fim_middle|>, etc.) are correctly decoded into their string representations instead of being blindly replaced with <unk>.

Issue

Before: The standard decode() method fails on special tokens, raising an exception. Our try/except block caught this and assigned the string "<unk>" to all such tokens.

Impact: In the SpellingBeeEmbedding, all distinct special tokens (IDs 100258-100260, etc.) were receiving the exact same character embedding (based on the characters "u", "n", "k"). This effectively blinded the model to the differences between these control tokens.

Copy link
Copy Markdown

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@zherendong
Copy link
Copy Markdown
Collaborator Author

@MarkusRabe I'm not 100% sure if this is the root cause, but it's possible if your input data contains these special tokens

Could you also double-check which tokenizer was used to generate the pre-tokenized dataset? While unlikely, it's worth verifying just to be safe!

Copy link
Copy Markdown
Owner

Interesting. I don't think it is a serious issue with the existing experiments.

  • We tokenize the dataset online for each training run. There is not pre-tokenization.
  • We only use this method in the spelling bee embeddings.
  • If the model doesn't know how to spell the special tokens that's not an issue.

I also don't see how the fix changes anything. I ran some experiments and it doesn't change the number of tokens mapped to .

Copy link
Copy Markdown
Owner

@MarkusRabe MarkusRabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment above.

I suggest we just document this behavior and don't change it.

@zherendong
Copy link
Copy Markdown
Collaborator Author

zherendong commented Dec 26, 2025

Interesting. I don't think it is a serious issue with the existing experiments.

  • We tokenize the dataset online for each training run. There is not pre-tokenization.
  • We only use this method in the spelling bee embeddings.
  • If the model doesn't know how to spell the special tokens that's not an issue.

I also don't see how the fix changes anything. I ran some experiments and it doesn't change the number of tokens mapped to .

Make sense.

I was trying to investigate the issue you mentioned last time so I was looking at code around tokenization. But I might be misunderstanding the problem itself.

Is the issue fixed? If not, do you mind sharing it again here to make sure I understand it correctly? Also, what's the latest commands you use to run the experiments? Would like to run something on my end while investigating to validate my presumptions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants