Fix: Correctly decode special tokens by zherendong · Pull Request #14 · MarkusRabe/littletrainingloop

zherendong · 2025-12-17T02:20:56Z

Summary

Updated get_default_tokenizer_vocab in language_model_dataloader.py to use tiktoken's decode_single_token_bytes method. This ensures that special tokens (specifically FIM tokens like <|fim_prefix|>, <|fim_middle|>, etc.) are correctly decoded into their string representations instead of being blindly replaced with <unk>.

Issue

Before: The standard decode() method fails on special tokens, raising an exception. Our try/except block caught this and assigned the string "<unk>" to all such tokens.

Impact: In the SpellingBeeEmbedding, all distinct special tokens (IDs 100258-100260, etc.) were receiving the exact same character embedding (based on the characters "u", "n", "k"). This effectively blinded the model to the differences between these control tokens.

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

zherendong · 2025-12-17T02:21:23Z

@MarkusRabe I'm not 100% sure if this is the root cause, but it's possible if your input data contains these special tokens

Could you also double-check which tokenizer was used to generate the pre-tokenized dataset? While unlikely, it's worth verifying just to be safe!

MarkusRabe · 2025-12-26T20:11:41Z

Interesting. I don't think it is a serious issue with the existing experiments.

We tokenize the dataset online for each training run. There is not pre-tokenization.
We only use this method in the spelling bee embeddings.
If the model doesn't know how to spell the special tokens that's not an issue.

I also don't see how the fix changes anything. I ran some experiments and it doesn't change the number of tokens mapped to .

MarkusRabe

See other comment above.

I suggest we just document this behavior and don't change it.

zherendong · 2025-12-26T20:28:09Z

Interesting. I don't think it is a serious issue with the existing experiments.

We tokenize the dataset online for each training run. There is not pre-tokenization.

We only use this method in the spelling bee embeddings.

If the model doesn't know how to spell the special tokens that's not an issue.

I also don't see how the fix changes anything. I ran some experiments and it doesn't change the number of tokens mapped to .

Make sense.

I was trying to investigate the issue you mentioned last time so I was looking at code around tokenization. But I might be misunderstanding the problem itself.

Is the issue fixed? If not, do you mind sharing it again here to make sure I understand it correctly? Also, what's the latest commands you use to run the experiments? Would like to run something on my end while investigating to validate my presumptions

Fix: Correctly decode special tokens

66dc0bb

zherendong requested a review from MarkusRabe December 17, 2025 02:20

greptile-apps bot reviewed Dec 17, 2025

View reviewed changes

MarkusRabe reviewed Dec 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Correctly decode special tokens#14

Fix: Correctly decode special tokens#14
zherendong wants to merge 1 commit intomainfrom
zheren/tokenization-debugging

zherendong commented Dec 17, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

zherendong commented Dec 17, 2025

Uh oh!

MarkusRabe commented Dec 26, 2025

Uh oh!

MarkusRabe left a comment

Uh oh!

zherendong commented Dec 26, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zherendong commented Dec 17, 2025

Summary

Issue

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

zherendong commented Dec 17, 2025

Uh oh!

MarkusRabe commented Dec 26, 2025

Uh oh!

MarkusRabe left a comment

Choose a reason for hiding this comment

Uh oh!

zherendong commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zherendong commented Dec 26, 2025 •

edited

Loading