Support OPD when teacher tokenization differs#2032
Open
hhnqqq wants to merge 1 commit into
Open
Conversation
2b01c5b to
1fb0c44
Compare
SGLang OPD previously assumed that teacher and student shared token IDs, so it could send student token IDs directly to the teacher server and trim returned logprobs by student response length. Cross-vocabulary distillation needs the teacher to render the original chat messages with its own chat template, then align teacher response logprobs back to student response positions only where token boundaries match exactly. This adds explicit cross-vocabulary hooks, preserves raw prompt messages in dataset metadata when requested, documents the SGLang configuration, and includes a conservative 8xH200 example using a Qwen3-8B student with a Qwen3.5-35B-A3B teacher. Non-aligned response positions keep the student rollout logprob so the OPD delta is zero there. The test file also includes an opt-in live SGLang teacher API check guarded by environment variables. Constraint: Official Sample.prompt may already be a student-rendered string after --apply-chat-template, so raw messages are preserved through an opt-in metadata key instead of relying on local origin_prompt fields. Constraint: The example reserves GPU 6,7 for the teacher and exposes only GPU 0-5 to slime/Ray so colocated rollout and Megatron engines do not contend with the teacher server. Rejected: Send student token IDs to cross-vocabulary teachers | token IDs are tokenizer-local and cannot represent the same prompt across vocabularies. Rejected: Apply teacher logprobs to every decoded-text span | many-to-one token spans do not map to a single student logprob position cleanly. Confidence: medium Scope-risk: moderate Directive: Do not replace the metadata messages path with origin_prompt-style private fields; official datasets need an explicit portable way to carry raw messages. Tested: bash -n examples/on_policy_distillation/run-qwen3-8B-qwen3.5-35B-A3B-cross-vocab-opd.sh; python -m py_compile on modified Python files; manual production-function checks for render/dataset/alignment/postprocess; git diff --check. Not-tested: Full pytest suite unavailable because pytest is not installed locally; parser smoke blocked by missing sglang_router dependency; live SGLang teacher API test and example script not run locally because no teacher endpoint was provided.
1fb0c44 to
76fde4f
Compare
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

SGLang OPD previously assumed that teacher and student shared token IDs, so it could send student token IDs directly to the teacher server and trim returned logprobs by student response length. Cross-vocabulary distillation needs the teacher to render the original chat messages with its own chat template, then align teacher response logprobs back to student response positions only where token boundaries match exactly.
This adds explicit cross-vocabulary hooks, preserves raw prompt messages in dataset metadata when requested, and documents the SGLang configuration. Non-aligned response positions keep the student rollout logprob so the OPD delta is zero there.
Constraint: Official Sample.prompt may already be a student-rendered string after --apply-chat-template, so raw messages are preserved through an opt-in metadata key instead of relying on local origin_prompt fields.
Rejected: Send student token IDs to cross-vocabulary teachers | token IDs are tokenizer-local and cannot represent the same prompt across vocabularies.
Rejected: Apply teacher logprobs to every decoded-text span | many-to-one token spans do not map to a single student logprob position cleanly.
Confidence: medium
Scope-risk: moderate
Directive: Do not replace the metadata messages path with origin_prompt-style private fields; official datasets need an explicit portable way to carry raw messages.
Tested: python -m py_compile on modified Python files; manual execution of tests/test_on_policy_distillation.py test functions; git diff --check.
Not-tested: Full pytest suite unavailable because pytest is not installed; parser smoke blocked by missing sglang_router dependency; live SGLang teacher request not exercised.