[draft] Add support for multi teacher OPD and memory-efficient topk level OPD#2033
Draft
hhnqqq wants to merge 2 commits into
Draft
[draft] Add support for multi teacher OPD and memory-efficient topk level OPD#2033hhnqqq wants to merge 2 commits into
hhnqqq wants to merge 2 commits into
Conversation
Top-k OPD needs a separate actor flow because it prepares old-policy top-k indices, teacher log-probs on those indices, and a top-k/tail loss before normal Megatron training. The normal train.py path now stays on the existing actor unless a dedicated entry registers the top-k actor subclass. Constraint: Teachers are assumed homogeneous with the student so top-k token ids are shared. Constraint: The top-k CP implementation follows the existing zigzag CP layout and rejects --allgather-cp. Rejected: Trigger top-k actor selection implicitly from generic train.py | user requested an explicit topkopd_train.py entry for the actor subclass. Confidence: medium Scope-risk: moderate Directive: Keep top-k actor selection explicit through topkopd_train.py unless generic train.py gains a documented actor plugin API. Tested: python3 -m py_compile on modified Python files Tested: bash -n examples/on_policy_distillation/run-qwen3-8B-topk-opd-megatron.sh Tested: git diff --check Not-tested: Distributed Megatron runtime with TP/CP GPUs and real teacher checkpoints Not-tested: --allgather-cp top-k OPD, intentionally rejected by argument validation
Top-k OPD computes its training signal from Megatron teacher top-k and tail distributions, so the example should not require a task-specific reward model just to exercise the distillation path. Add minimal zero-reward and placeholder-advantage helpers and wire the top-k example to them. Constraint: Keep the helper independent from self-OPD-specific reward, EMA, PRM, and mixed RL logic. Confidence: high Scope-risk: narrow Tested: PYTHONPYCACHEPREFIX=/tmp/slime-pycache python3 -m py_compile examples/on_policy_distillation/topk_opd_helpers.py examples/on_policy_distillation/topkopd_train.py slime/backends/megatron_utils/topk_opd_actor.py slime/backends/megatron_utils/loss.py slime/utils/arguments.py Tested: bash -n examples/on_policy_distillation/run-qwen3-8B-topk-opd-megatron.sh Tested: git diff --check Not-tested: Full distributed top-k OPD runtime with real teacher checkpoints
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Top-k OPD needs a separate actor flow because it prepares old-policy top-k indices, teacher log-probs on those indices, and a top-k/tail loss before normal Megatron training. The normal train.py path now stays on the existing actor unless a dedicated entry registers the top-k actor subclass.
Constraint: Teachers are assumed homogeneous with the student so top-k token ids are shared.
Constraint: The top-k CP implementation follows the existing zigzag CP layout and rejects --allgather-cp.
Rejected: Trigger top-k actor selection implicitly from generic train.py | user requested an explicit topkopd_train.py entry for the actor subclass.
Confidence: medium
Scope-risk: moderate
Directive: Keep top-k actor selection explicit through topkopd_train.py unless generic train.py gains a documented actor plugin API.
Tested: python3 -m py_compile on modified Python files
Tested: bash -n examples/on_policy_distillation/run-qwen3-8B-topk-opd-megatron.sh
Tested: git diff --check
Not-tested: Distributed Megatron runtime with TP/CP GPUs and real teacher checkpoints
Not-tested: --allgather-cp top-k OPD, intentionally rejected by argument validation