Support OPD when teacher tokenization differs by hhnqqq · Pull Request #2032 · THUDM/slime

hhnqqq · 2026-06-08T10:04:55Z

SGLang OPD previously assumed that teacher and student shared token IDs, so it could send student token IDs directly to the teacher server and trim returned logprobs by student response length. Cross-vocabulary distillation needs the teacher to render the original chat messages with its own chat template, then align teacher response logprobs back to student response positions only where token boundaries match exactly.

This adds explicit cross-vocabulary hooks, preserves raw prompt messages in dataset metadata when requested, and documents the SGLang configuration. Non-aligned response positions keep the student rollout logprob so the OPD delta is zero there.

Constraint: Official Sample.prompt may already be a student-rendered string after --apply-chat-template, so raw messages are preserved through an opt-in metadata key instead of relying on local origin_prompt fields.
Rejected: Send student token IDs to cross-vocabulary teachers | token IDs are tokenizer-local and cannot represent the same prompt across vocabularies.
Rejected: Apply teacher logprobs to every decoded-text span | many-to-one token spans do not map to a single student logprob position cleanly.
Confidence: medium
Scope-risk: moderate
Directive: Do not replace the metadata messages path with origin_prompt-style private fields; official datasets need an explicit portable way to carry raw messages.
Tested: python -m py_compile on modified Python files; manual execution of tests/test_on_policy_distillation.py test functions; git diff --check.
Not-tested: Full pytest suite unavailable because pytest is not installed; parser smoke blocked by missing sglang_router dependency; live SGLang teacher request not exercised.

SGLang OPD previously assumed that teacher and student shared token IDs, so it could send student token IDs directly to the teacher server and trim returned logprobs by student response length. Cross-vocabulary distillation needs the teacher to render the original chat messages with its own chat template, then align teacher response logprobs back to student response positions only where token boundaries match exactly. This adds explicit cross-vocabulary hooks, preserves raw prompt messages in dataset metadata when requested, documents the SGLang configuration, and includes a conservative 8xH200 example using a Qwen3-8B student with a Qwen3.5-35B-A3B teacher. Non-aligned response positions keep the student rollout logprob so the OPD delta is zero there. The test file also includes an opt-in live SGLang teacher API check guarded by environment variables. Constraint: Official Sample.prompt may already be a student-rendered string after --apply-chat-template, so raw messages are preserved through an opt-in metadata key instead of relying on local origin_prompt fields. Constraint: The example reserves GPU 6,7 for the teacher and exposes only GPU 0-5 to slime/Ray so colocated rollout and Megatron engines do not contend with the teacher server. Rejected: Send student token IDs to cross-vocabulary teachers | token IDs are tokenizer-local and cannot represent the same prompt across vocabularies. Rejected: Apply teacher logprobs to every decoded-text span | many-to-one token spans do not map to a single student logprob position cleanly. Confidence: medium Scope-risk: moderate Directive: Do not replace the metadata messages path with origin_prompt-style private fields; official datasets need an explicit portable way to carry raw messages. Tested: bash -n examples/on_policy_distillation/run-qwen3-8B-qwen3.5-35B-A3B-cross-vocab-opd.sh; python -m py_compile on modified Python files; manual production-function checks for render/dataset/alignment/postprocess; git diff --check. Not-tested: Full pytest suite unavailable because pytest is not installed locally; parser smoke blocked by missing sglang_router dependency; live SGLang teacher API test and example script not run locally because no teacher endpoint was provided.

hhnqqq · 2026-06-08T11:53:31Z

An example of distillating a Qwen3-30B-A3B-based model to Qwen3.5-35B-A3B.

hhnqqq force-pushed the add-cross-vocab-opd branch from 2b01c5b to 1fb0c44 Compare June 8, 2026 11:30

hhnqqq force-pushed the add-cross-vocab-opd branch from 1fb0c44 to 76fde4f Compare June 8, 2026 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support OPD when teacher tokenization differs#2032

Support OPD when teacher tokenization differs#2032
hhnqqq wants to merge 1 commit into
THUDM:mainfrom
hhnqqq:add-cross-vocab-opd

hhnqqq commented Jun 8, 2026

Uh oh!

hhnqqq commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hhnqqq commented Jun 8, 2026

Uh oh!

hhnqqq commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant