Skip to content

[draft] Add support for multi teacher OPD and memory-efficient topk level OPD#2033

Draft
hhnqqq wants to merge 2 commits into
THUDM:mainfrom
hhnqqq:topk-level-opd
Draft

[draft] Add support for multi teacher OPD and memory-efficient topk level OPD#2033
hhnqqq wants to merge 2 commits into
THUDM:mainfrom
hhnqqq:topk-level-opd

Conversation

@hhnqqq

@hhnqqq hhnqqq commented Jun 8, 2026

Copy link
Copy Markdown

Top-k OPD needs a separate actor flow because it prepares old-policy top-k indices, teacher log-probs on those indices, and a top-k/tail loss before normal Megatron training. The normal train.py path now stays on the existing actor unless a dedicated entry registers the top-k actor subclass.

Constraint: Teachers are assumed homogeneous with the student so top-k token ids are shared.

Constraint: The top-k CP implementation follows the existing zigzag CP layout and rejects --allgather-cp.

Rejected: Trigger top-k actor selection implicitly from generic train.py | user requested an explicit topkopd_train.py entry for the actor subclass.

Confidence: medium

Scope-risk: moderate

Directive: Keep top-k actor selection explicit through topkopd_train.py unless generic train.py gains a documented actor plugin API.

Tested: python3 -m py_compile on modified Python files

Tested: bash -n examples/on_policy_distillation/run-qwen3-8B-topk-opd-megatron.sh

Tested: git diff --check

Not-tested: Distributed Megatron runtime with TP/CP GPUs and real teacher checkpoints

Not-tested: --allgather-cp top-k OPD, intentionally rejected by argument validation

Top-k OPD needs a separate actor flow because it prepares old-policy top-k indices, teacher log-probs on those indices, and a top-k/tail loss before normal Megatron training. The normal train.py path now stays on the existing actor unless a dedicated entry registers the top-k actor subclass.

Constraint: Teachers are assumed homogeneous with the student so top-k token ids are shared.

Constraint: The top-k CP implementation follows the existing zigzag CP layout and rejects --allgather-cp.

Rejected: Trigger top-k actor selection implicitly from generic train.py | user requested an explicit topkopd_train.py entry for the actor subclass.

Confidence: medium

Scope-risk: moderate

Directive: Keep top-k actor selection explicit through topkopd_train.py unless generic train.py gains a documented actor plugin API.

Tested: python3 -m py_compile on modified Python files

Tested: bash -n examples/on_policy_distillation/run-qwen3-8B-topk-opd-megatron.sh

Tested: git diff --check

Not-tested: Distributed Megatron runtime with TP/CP GPUs and real teacher checkpoints

Not-tested: --allgather-cp top-k OPD, intentionally rejected by argument validation
Top-k OPD computes its training signal from Megatron teacher top-k and tail distributions, so the example should not require a task-specific reward model just to exercise the distillation path. Add minimal zero-reward and placeholder-advantage helpers and wire the top-k example to them.

Constraint: Keep the helper independent from self-OPD-specific reward, EMA, PRM, and mixed RL logic.

Confidence: high

Scope-risk: narrow

Tested: PYTHONPYCACHEPREFIX=/tmp/slime-pycache python3 -m py_compile examples/on_policy_distillation/topk_opd_helpers.py examples/on_policy_distillation/topkopd_train.py slime/backends/megatron_utils/topk_opd_actor.py slime/backends/megatron_utils/loss.py slime/utils/arguments.py

Tested: bash -n examples/on_policy_distillation/run-qwen3-8B-topk-opd-megatron.sh

Tested: git diff --check

Not-tested: Full distributed top-k OPD runtime with real teacher checkpoints
@hhnqqq hhnqqq changed the title [draft] Add support for multi teacher opd and topk level opd [draft] Add support for multi teacher OPD and memory-efficient topk level OPD Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant