Add LoRA co-training support for HF EAGLE speculative decoding#1060
Add LoRA co-training support for HF EAGLE speculative decoding#1060yeyu-nvidia wants to merge 12 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds configurable PEFT LoRA support for the EAGLE base model: new config fields, runtime LoRA adapter injection/toggling, a KL-based preservation loss, export of LoRA artifacts, dependency addition, and unit tests covering injection, training constraints, loss, incompatibility, and export. Changes
Sequence Diagram(s)sequenceDiagram
participant Trainer
participant HFEagleModel
participant BaseModel
participant LoRAAdapter
participant Exporter
Trainer->>HFEagleModel: init(config with eagle_base_lora)
HFEagleModel->>LoRAAdapter: _inject_base_lora() (create LoraConfig, inject adapters)
LoRAAdapter-->>HFEagleModel: adapters installed, LoRA params unfrozen
Trainer->>HFEagleModel: training step
HFEagleModel->>BaseModel: _set_base_lora_enabled(False)
HFEagleModel->>BaseModel: forward -> ref_logits
BaseModel-->>HFEagleModel: ref_logits
HFEagleModel->>BaseModel: _set_base_lora_enabled(True)
HFEagleModel->>BaseModel: forward -> lora_logits, hidden_states
BaseModel-->>HFEagleModel: lora_logits, hidden_states
HFEagleModel->>HFEagleModel: _preservation_loss(ref_logits, lora_logits)
HFEagleModel-->>Trainer: combined loss
Trainer->>Exporter: export request
Exporter->>LoRAAdapter: export weights -> `lora_adapter_model.safetensors`
Exporter->>Exporter: write `lora_adapter_config.json`
Exporter-->>Trainer: exported artifacts
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
modelopt/torch/speculative/config.py (1)
122-145: Strengthen LoRA config typing and value bounds.Consider constraining invalid user input at config parse time (e.g., rank <= 0, negative preservation weight) and avoid mutable list defaults.
Proposed patch
- eagle_base_lora_rank: int = ModeloptField( + eagle_base_lora_rank: int = ModeloptField( default=64, + ge=1, description="LoRA rank for the base model adapters.", ) eagle_base_lora_alpha: float = ModeloptField( default=16.0, + gt=0.0, description="LoRA alpha (scaling) for the base model adapters.", ) - eagle_base_lora_target_modules: list = ModeloptField( - default=[], + eagle_base_lora_target_modules: tuple[str, ...] = ModeloptField( + default=(), description=( "List of module name patterns to apply LoRA to in the base model " "(e.g. ['q_proj', 'v_proj']). Empty list uses peft defaults." ), ) eagle_base_lora_preservation_loss_weight: float = ModeloptField( default=0.1, + ge=0.0, description=( "Weight for the preservation loss that minimizes the KL divergence between " "the LoRA-adapted base model output and the original base model output." ), )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/speculative/config.py` around lines 122 - 145, The config fields eagle_base_lora_rank, eagle_base_lora_alpha, eagle_base_lora_target_modules, and eagle_base_lora_preservation_loss_weight use permissive types and a mutable list default; update their ModeloptField definitions to enforce proper typing and validate bounds at parse/validation time: require eagle_base_lora_rank to be an int > 0, eagle_base_lora_alpha to be a float >= 0, eagle_base_lora_preservation_loss_weight to be a float >= 0 (reject negatives), and replace the mutable default for eagle_base_lora_target_modules with an immutable default (e.g., None or tuple) and coerce/validate it into a list of strings; implement these checks using the config validation hook or the ModeloptField's validator callbacks so invalid inputs raise a clear parsing error.tests/unit/torch/speculative/plugins/test_hf_speculative_lora.py (1)
93-100: Strengthen export test by validating LoRA config contents, not just file existence.Existence checks can pass with malformed config. Assert expected
r,lora_alpha, andtarget_modulesinlora_adapter_config.json.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/speculative/plugins/test_hf_speculative_lora.py` around lines 93 - 100, In test_export_lora_artifacts, after exporting via lora_eagle_model.get_exporter().export(export_dir), open and parse export_dir / "lora_adapter_config.json" as JSON and assert the config contains the keys "r", "lora_alpha", and "target_modules"; further validate that "r" and "lora_alpha" are positive integers and that "target_modules" is a non-empty list of strings (or matches the expected module names for this model), so the test checks semantic correctness not just file existence.modelopt/torch/speculative/plugins/transformers.py (1)
581-582: PreferF.kl_divfor preservation loss clarity/stability.Current expression is a manual cross-entropy form;
F.kl_divmakes intent explicit and is less error-prone to maintain.Proposed patch
+import torch.nn.functional as F ... - loss = nn.Softmax(dim=-1)(ref_logits.detach()) * nn.LogSoftmax(dim=-1)(lora_logits) - return -loss.sum(dim=-1).mean() * self.eagle_base_lora_preservation_loss_weight + ref_prob = F.softmax(ref_logits.detach(), dim=-1) + lora_log_prob = F.log_softmax(lora_logits, dim=-1) + kl = F.kl_div(lora_log_prob, ref_prob, reduction="batchmean") + return kl * self.eagle_base_lora_preservation_loss_weight🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/speculative/plugins/transformers.py` around lines 581 - 582, Replace the manual cross-entropy-like expression with torch.nn.functional.kl_div to make intent explicit and numerically stable: compute log-probs from lora_logits with F.log_softmax, compute target probs from ref_logits.detach() with F.softmax, call F.kl_div(log_probs, target_probs, reduction='none'), sum over the last dim, take the mean, and multiply by self.eagle_base_lora_preservation_loss_weight (no leading negative). Update the expression that currently uses nn.Softmax/nn.LogSoftmax and returns -loss.sum(...).mean()*self.eagle_base_lora_preservation_loss_weight to the F.kl_div-based sequence using the same dimensions and detachment of ref_logits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/export/plugins/hf_spec_export.py`:
- Around line 191-193: The export currently builds lora_sd = {k: v for k, v in
full_sd.items() if "lora_A" in k or "lora_B" in k} and calls save_file(...) even
if lora_sd is empty; add a guard after constructing lora_sd in the
hf_spec_export export routine to fail fast: check if lora_sd is empty and if so,
raise a clear exception (or call processLogger.error and raise RuntimeError)
indicating no LoRA tensors found instead of writing an empty file; reference the
lora_sd variable, full_sd source, save_file call and
export_dir/"lora_adapter_model.safetensors" target so the change is applied in
the right spot.
In `@modelopt/torch/speculative/plugins/transformers.py`:
- Around line 812-818: The block that disables LoRA adapters using
self._set_base_lora_enabled(False) before calling _run_forward can leave
adapters disabled if _run_forward raises; wrap the reference forward in a
try/finally so that self._set_base_lora_enabled(True) always executes, still
clearing self._aux_hidden_states when present and returning/using ref_logits
from _run_forward; specifically, call _set_base_lora_enabled(False), run
ref_logits = _run_forward(no_grad=True).logits inside try, then in finally
re-enable via _set_base_lora_enabled(True) and clear self._aux_hidden_states if
present.
- Around line 648-650: The code currently uses an assert to enforce that
eagle_base_lora and eagle_offline are not both set (in the block that calls
self._inject_base_lora()); replace the assert with an explicit runtime exception
(e.g., raise ValueError or RuntimeError) so the check always runs in production.
Locate the conditional that checks self.eagle_base_lora and the incompatible
flag self.eagle_offline, and throw a clear exception with a descriptive message
instead of using assert before calling self._inject_base_lora().
---
Nitpick comments:
In `@modelopt/torch/speculative/config.py`:
- Around line 122-145: The config fields eagle_base_lora_rank,
eagle_base_lora_alpha, eagle_base_lora_target_modules, and
eagle_base_lora_preservation_loss_weight use permissive types and a mutable list
default; update their ModeloptField definitions to enforce proper typing and
validate bounds at parse/validation time: require eagle_base_lora_rank to be an
int > 0, eagle_base_lora_alpha to be a float >= 0,
eagle_base_lora_preservation_loss_weight to be a float >= 0 (reject negatives),
and replace the mutable default for eagle_base_lora_target_modules with an
immutable default (e.g., None or tuple) and coerce/validate it into a list of
strings; implement these checks using the config validation hook or the
ModeloptField's validator callbacks so invalid inputs raise a clear parsing
error.
In `@modelopt/torch/speculative/plugins/transformers.py`:
- Around line 581-582: Replace the manual cross-entropy-like expression with
torch.nn.functional.kl_div to make intent explicit and numerically stable:
compute log-probs from lora_logits with F.log_softmax, compute target probs from
ref_logits.detach() with F.softmax, call F.kl_div(log_probs, target_probs,
reduction='none'), sum over the last dim, take the mean, and multiply by
self.eagle_base_lora_preservation_loss_weight (no leading negative). Update the
expression that currently uses nn.Softmax/nn.LogSoftmax and returns
-loss.sum(...).mean()*self.eagle_base_lora_preservation_loss_weight to the
F.kl_div-based sequence using the same dimensions and detachment of ref_logits.
In `@tests/unit/torch/speculative/plugins/test_hf_speculative_lora.py`:
- Around line 93-100: In test_export_lora_artifacts, after exporting via
lora_eagle_model.get_exporter().export(export_dir), open and parse export_dir /
"lora_adapter_config.json" as JSON and assert the config contains the keys "r",
"lora_alpha", and "target_modules"; further validate that "r" and "lora_alpha"
are positive integers and that "target_modules" is a non-empty list of strings
(or matches the expected module names for this model), so the test checks
semantic correctness not just file existence.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 66fcf495-7fbb-405d-9f5d-1206155ab766
📒 Files selected for processing (5)
examples/speculative_decoding/requirements.txtmodelopt/torch/export/plugins/hf_spec_export.pymodelopt/torch/speculative/config.pymodelopt/torch/speculative/plugins/transformers.pytests/unit/torch/speculative/plugins/test_hf_speculative_lora.py
| lora_sd = {k: v for k, v in full_sd.items() if "lora_A" in k or "lora_B" in k} | ||
| save_file(lora_sd, export_dir / "lora_adapter_model.safetensors") | ||
|
|
There was a problem hiding this comment.
Fail fast when no LoRA tensors are found during export.
If LoRA injection/reg-key filtering regresses, this currently emits an empty adapter file and still reports success. Add an explicit guard.
Proposed patch
def _export_lora(self, export_dir: Path, full_sd: dict):
"""Export base model LoRA adapter weights alongside the eagle module artifacts."""
lora_sd = {k: v for k, v in full_sd.items() if "lora_A" in k or "lora_B" in k}
+ if not lora_sd:
+ raise RuntimeError("No LoRA adapter tensors found in state_dict; refusing empty export.")
save_file(lora_sd, export_dir / "lora_adapter_model.safetensors")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/export/plugins/hf_spec_export.py` around lines 191 - 193, The
export currently builds lora_sd = {k: v for k, v in full_sd.items() if "lora_A"
in k or "lora_B" in k} and calls save_file(...) even if lora_sd is empty; add a
guard after constructing lora_sd in the hf_spec_export export routine to fail
fast: check if lora_sd is empty and if so, raise a clear exception (or call
processLogger.error and raise RuntimeError) indicating no LoRA tensors found
instead of writing an empty file; reference the lora_sd variable, full_sd
source, save_file call and export_dir/"lora_adapter_model.safetensors" target so
the change is applied in the right spot.
| if self.eagle_base_lora: | ||
| assert not self.eagle_offline, "eagle_base_lora is incompatible with eagle_offline=True" | ||
| self._inject_base_lora() |
There was a problem hiding this comment.
Do not use assert for runtime config validation.
assert can be optimized out; use an explicit exception so the incompatibility check always executes.
Proposed patch
# Inject HF PEFT LoRA adapters into the base model for co-training
if self.eagle_base_lora:
- assert not self.eagle_offline, "eagle_base_lora is incompatible with eagle_offline=True"
+ if self.eagle_offline:
+ raise ValueError("eagle_base_lora is incompatible with eagle_offline=True")
self._inject_base_lora()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/speculative/plugins/transformers.py` around lines 648 - 650,
The code currently uses an assert to enforce that eagle_base_lora and
eagle_offline are not both set (in the block that calls
self._inject_base_lora()); replace the assert with an explicit runtime exception
(e.g., raise ValueError or RuntimeError) so the check always runs in production.
Locate the conditional that checks self.eagle_base_lora and the incompatible
flag self.eagle_offline, and throw a clear exception with a descriptive message
instead of using assert before calling self._inject_base_lora().
| if self.eagle_base_lora: | ||
| self._set_base_lora_enabled(False) | ||
| ref_logits = _run_forward(no_grad=True).logits | ||
| if hasattr(self, "_aux_hidden_states"): | ||
| self._aux_hidden_states.clear() | ||
| self._set_base_lora_enabled(True) | ||
|
|
There was a problem hiding this comment.
Restore LoRA adapter state with try/finally around the reference forward.
If the reference forward raises, adapters stay disabled and subsequent training behavior becomes incorrect.
Proposed patch
ref_logits = None
if self.eagle_base_lora:
- self._set_base_lora_enabled(False)
- ref_logits = _run_forward(no_grad=True).logits
- if hasattr(self, "_aux_hidden_states"):
- self._aux_hidden_states.clear()
- self._set_base_lora_enabled(True)
+ self._set_base_lora_enabled(False)
+ try:
+ ref_logits = _run_forward(no_grad=True).logits
+ if hasattr(self, "_aux_hidden_states"):
+ self._aux_hidden_states.clear()
+ finally:
+ self._set_base_lora_enabled(True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/speculative/plugins/transformers.py` around lines 812 - 818,
The block that disables LoRA adapters using self._set_base_lora_enabled(False)
before calling _run_forward can leave adapters disabled if _run_forward raises;
wrap the reference forward in a try/finally so that
self._set_base_lora_enabled(True) always executes, still clearing
self._aux_hidden_states when present and returning/using ref_logits from
_run_forward; specifically, call _set_base_lora_enabled(False), run ref_logits =
_run_forward(no_grad=True).logits inside try, then in finally re-enable via
_set_base_lora_enabled(True) and clear self._aux_hidden_states if present.
|
How would the base model quality and AL looks like with this lora cotraining? |
Haven't tested. will report later |
Introduces eagle_base_lora training mode where HF PEFT LoRA adapters are injected into the base model and co-trained with the EAGLE draft module. A preservation loss (KL divergence between original and LoRA-adapted base model outputs) is added to prevent the base model from drifting during training. LoRA adapter weights are exported separately alongside the EAGLE draft model artifacts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
The LoRA co-training config fields (eagle_base_lora, eagle_base_lora_rank, eagle_base_lora_alpha, eagle_base_lora_target_modules, eagle_base_lora_preservation_loss_weight) were defined in the config but never assigned in EagleModel.modify(), causing DynamicModule.__getattr__ to raise AttributeError when HFEagleModel.modify() accessed them. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
Set num_key_value_heads=16 (matching num_attention_heads) to avoid GQA, which triggers enable_gqa=True in SDPA — unsupported on CPU backends. Set use_last_layernorm=True so the norm layer is created and norm.weight is present in the export state dict as required by the export validator. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
- enable_cp_ttt_patch: add SDPBackend.MATH alongside CUDNN_ATTENTION so the math kernel is available as fallback on CPU (fixes test_forward_returns_loss) - _check_valid_sd: skip fc/hidden_norm from required keys when use_aux_hidden_state=False, as these layers only exist in EAGLE-3 (fixes test_export_lora_artifacts) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
- Handle None dtype in _compute_ttt_attention_mask by falling back to torch.float32 when HF config.dtype is unset - Fix JSON serialization of LoRA config by converting set target_modules to sorted list in _export_lora Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
bcee9bd to
d933f42
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (3)
modelopt/torch/speculative/plugins/transformers.py (2)
647-650:⚠️ Potential issue | 🟠 MajorReplace
assertwith explicit exception for runtime config validation.
assertstatements can be optimized out with-Oflag. Use an explicit exception to ensure the incompatibility check always executes.Proposed patch
# Inject HF PEFT LoRA adapters into the base model for co-training if self.eagle_base_lora: - assert not self.eagle_offline, "eagle_base_lora is incompatible with eagle_offline=True" + if self.eagle_offline: + raise ValueError("eagle_base_lora is incompatible with eagle_offline=True") self._inject_base_lora()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/speculative/plugins/transformers.py` around lines 647 - 650, Replace the runtime config assertion in the block that injects HF PEFT LoRA adapters: instead of using assert not self.eagle_offline, raise an explicit exception (e.g., raise RuntimeError or ValueError) when self.eagle_base_lora is true and self.eagle_offline is true so the check always runs; keep the existing call to self._inject_base_lora() when the check passes and reference the same symbols (self.eagle_base_lora, self.eagle_offline, self._inject_base_lora) when making the change.
811-817:⚠️ Potential issue | 🟠 MajorWrap reference forward in
try/finallyto ensure LoRA re-enablement.If
_run_forwardraises an exception, LoRA adapters remain disabled, causing incorrect behavior in subsequent training steps.Proposed patch
ref_logits = None if self.eagle_base_lora: self._set_base_lora_enabled(False) - ref_logits = _run_forward(no_grad=True).logits - if hasattr(self, "_aux_hidden_states"): - self._aux_hidden_states.clear() - self._set_base_lora_enabled(True) + try: + ref_logits = _run_forward(no_grad=True).logits + if hasattr(self, "_aux_hidden_states"): + self._aux_hidden_states.clear() + finally: + self._set_base_lora_enabled(True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/speculative/plugins/transformers.py` around lines 811 - 817, The block that disables LoRA adapters (guarded by self.eagle_base_lora) should wrap the call to _run_forward in a try/finally so _set_base_lora_enabled(True) always runs even if _run_forward raises; keep the current behavior of clearing self._aux_hidden_states and assigning ref_logits from _run_forward().logits, but move those operations into the try (or after a successful call) and perform re-enablement in the finally block to guarantee LoRA is restored.modelopt/torch/export/plugins/hf_spec_export.py (1)
195-212:⚠️ Potential issue | 🟠 MajorMissing guard for empty LoRA state dict.
If LoRA injection regresses or filtering fails,
lora_sdwill be empty, resulting in an empty adapter file being written without error. Add a guard to fail fast.Proposed patch
def _export_lora(self, export_dir: Path, full_sd: dict): """Export base model LoRA adapter weights alongside the eagle module artifacts.""" lora_sd = {k: v for k, v in full_sd.items() if "lora_A" in k or "lora_B" in k} + if not lora_sd: + raise RuntimeError( + "No LoRA adapter tensors found in state_dict; refusing to export empty adapter." + ) save_file(lora_sd, export_dir / "lora_adapter_model.safetensors")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/plugins/hf_spec_export.py` around lines 195 - 212, The _export_lora method currently builds lora_sd and unconditionally writes it and a config; add a fail-fast guard that checks if lora_sd is empty and raises a clear exception (e.g., RuntimeError/ValueError) before calling save_file or constructing/writing the LoraConfig, so you don't create an empty adapter file if LoRA filtering/injection failed; reference lora_sd, _export_lora, save_file, LoraConfig and export_dir when adding the check and error.
🧹 Nitpick comments (1)
modelopt/torch/speculative/plugins/transformers.py (1)
574-582: PreferF.softmax/F.log_softmaxover module instantiation.Creating
nn.Softmaxandnn.LogSoftmaxmodule instances on each call adds unnecessary overhead. Use the functional API instead.Proposed patch
+import torch.nn.functional as F + def _preservation_loss( self, ref_logits: torch.Tensor, lora_logits: torch.Tensor ) -> torch.Tensor: """KL divergence encouraging LoRA output to stay close to the original base model. KL(softmax(ref) || log_softmax(lora)) weighted by eagle_base_lora_preservation_loss_weight. """ - loss = nn.Softmax(dim=-1)(ref_logits.detach()) * nn.LogSoftmax(dim=-1)(lora_logits) + loss = F.softmax(ref_logits.detach(), dim=-1) * F.log_softmax(lora_logits, dim=-1) return -loss.sum(dim=-1).mean() * self.eagle_base_lora_preservation_loss_weight🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/speculative/plugins/transformers.py` around lines 574 - 582, In _preservation_loss, avoid instantiating nn.Softmax and nn.LogSoftmax on each call; use the functional API (torch.nn.functional.softmax and torch.nn.functional.log_softmax) to compute softmax(ref_logits.detach(), dim=-1) and log_softmax(lora_logits, dim=-1) respectively, then compute the elementwise product, sum over vocab dim and return the weighted mean exactly as before using self.eagle_base_lora_preservation_loss_weight; ensure you keep the ref_logits.detach() semantics and the dim=-1 argument.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@modelopt/torch/export/plugins/hf_spec_export.py`:
- Around line 195-212: The _export_lora method currently builds lora_sd and
unconditionally writes it and a config; add a fail-fast guard that checks if
lora_sd is empty and raises a clear exception (e.g., RuntimeError/ValueError)
before calling save_file or constructing/writing the LoraConfig, so you don't
create an empty adapter file if LoRA filtering/injection failed; reference
lora_sd, _export_lora, save_file, LoraConfig and export_dir when adding the
check and error.
In `@modelopt/torch/speculative/plugins/transformers.py`:
- Around line 647-650: Replace the runtime config assertion in the block that
injects HF PEFT LoRA adapters: instead of using assert not self.eagle_offline,
raise an explicit exception (e.g., raise RuntimeError or ValueError) when
self.eagle_base_lora is true and self.eagle_offline is true so the check always
runs; keep the existing call to self._inject_base_lora() when the check passes
and reference the same symbols (self.eagle_base_lora, self.eagle_offline,
self._inject_base_lora) when making the change.
- Around line 811-817: The block that disables LoRA adapters (guarded by
self.eagle_base_lora) should wrap the call to _run_forward in a try/finally so
_set_base_lora_enabled(True) always runs even if _run_forward raises; keep the
current behavior of clearing self._aux_hidden_states and assigning ref_logits
from _run_forward().logits, but move those operations into the try (or after a
successful call) and perform re-enablement in the finally block to guarantee
LoRA is restored.
---
Nitpick comments:
In `@modelopt/torch/speculative/plugins/transformers.py`:
- Around line 574-582: In _preservation_loss, avoid instantiating nn.Softmax and
nn.LogSoftmax on each call; use the functional API (torch.nn.functional.softmax
and torch.nn.functional.log_softmax) to compute softmax(ref_logits.detach(),
dim=-1) and log_softmax(lora_logits, dim=-1) respectively, then compute the
elementwise product, sum over vocab dim and return the weighted mean exactly as
before using self.eagle_base_lora_preservation_loss_weight; ensure you keep the
ref_logits.detach() semantics and the dim=-1 argument.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b4fbd165-c015-4d1e-a7a2-edfbff923aec
📒 Files selected for processing (2)
modelopt/torch/export/plugins/hf_spec_export.pymodelopt/torch/speculative/plugins/transformers.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1060 +/- ##
==========================================
+ Coverage 70.29% 70.34% +0.04%
==========================================
Files 227 227
Lines 25860 25864 +4
==========================================
+ Hits 18179 18193 +14
+ Misses 7681 7671 -10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
peft is not available in all test environments. Move the LoraConfig import inside _export_lora where it is actually used. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
Use getattr with bfloat16 fallback instead of direct attribute access, which raises AttributeError in transformers 4.53.3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
Use getattr with None default combined with or-fallback to bfloat16, handling both: attribute missing (tf_min/4.53.3) and attribute present but None (tf_latest). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
Use getattr(..., None) or torch.bfloat16 to handle both absent attribute (transformers tf_min) and attribute-exists-but-None (tf_latest) cases. Signed-off-by: Ye Yu <yeyu@nvidia.com>
Use torch.get_default_dtype() as fallback instead of torch.bfloat16 when config.dtype is None (transformers >= 4.53 sets LlamaConfig.dtype=None). This prevents RuntimeError from scaled_dot_product_attention when the model computes in float32 but the attention mask was created as bfloat16. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
ChenhanYu
left a comment
There was a problem hiding this comment.
This PR adds LoRA co-training support for HF EAGLE speculative decoding — reasonable feature design with good test coverage. However, peft is now a hard import-time dependency for the entire transformers.py plugin due to top-level imports. This will break anyone importing modelopt.torch.speculative.plugins.transformers without peft installed, even if they don't use LoRA. This needs to be fixed before merging. Several other issues below.
| @@ -37,6 +37,9 @@ | |||
| import torch | |||
| import transformers | |||
| from packaging.version import Version | |||
There was a problem hiding this comment.
Blocker: These top-level from peft import ... lines make peft a hard dependency for the entire transformers plugin. Anyone importing this module without peft installed will get an ImportError, even if they don't use LoRA.
These must be lazy imports inside _inject_base_lora, _set_base_lora_enabled, and the export method. For example:
def _inject_base_lora(self):
from peft import LoraConfig
from peft.mapping import inject_adapter_in_model
...
def _set_base_lora_enabled(self, enabled: bool):
from peft.tuners.lora import LoraLayer
...| eagle_base_lora_alpha: float = ModeloptField( | ||
| default=16.0, | ||
| description="LoRA alpha (scaling) for the base model adapters.", | ||
| ) |
There was a problem hiding this comment.
Mutable default: default=[] means all config instances share the same list object. Use default_factory=list or default=None with a note that None uses peft defaults.
|
|
||
| def _set_base_lora_enabled(self, enabled: bool) -> None: | ||
| """Enable or disable LoRA adapters in the base model.""" | ||
| for module in self._base_model.modules(): |
There was a problem hiding this comment.
The docstring says "KL divergence" but this computes cross-entropy: -softmax(ref) * log_softmax(lora). The missing entropy term is constant w.r.t. LoRA params so gradients are correct, but the naming is misleading. Either rename or add a comment clarifying this is KL up to a constant.
There was a problem hiding this comment.
This is logit KL divergence
| ) -> BlockMask | torch.Tensor: | ||
| """Return TTT attention_mask tensor of type BlockMask or Tensor depends on eagle attn impl.""" | ||
| msk_func = get_ttt_msk_func(seq_length, ttt_step) | ||
| dtypemin = torch.finfo(self._base_llm_config.dtype).min |
There was a problem hiding this comment.
The reference forward with LoRA disabled has no try/finally. If the forward throws, LoRA stays disabled for all subsequent calls:
self._set_base_lora_enabled(False)
try:
ref_logits = _run_forward(no_grad=True).logits
if hasattr(self, "_aux_hidden_states"):
self._aux_hidden_states.clear()
finally:
self._set_base_lora_enabled(True)| @@ -610,6 +644,11 @@ def modify( | |||
| if layer_idx in self.eagle_config.eagle_aux_hidden_state_layer_ids: | |||
| layer.register_forward_hook(self._collect_aux_hidden_states_forward_hook) | |||
|
|
|||
There was a problem hiding this comment.
assert not self.eagle_offline can be optimized out with python -O. Use if self.eagle_offline: raise ValueError(...) for runtime config validation.
| required_keys = set(expected_keys_single_layer["required"]) | ||
| if not use_aux: | ||
| required_keys -= aux_only_keys | ||
| # Check that export sd has required keys |
There was a problem hiding this comment.
LoRA key filtering ("lora_A" in k or "lora_B" in k) is fragile. PEFT may use other key patterns (lora_embedding_A, lora_magnitude_vector, etc.). Consider using peft's own utilities to identify adapter parameters, or add a warning if zero LoRA tensors are found.
| import modelopt.torch.speculative.plugins.transformers | ||
|
|
||
| modelopt.torch.speculative.plugins.transformers.ENABLE_CP_TTT_PATCH = True | ||
| with sdpa_kernel(SDPBackend.CUDNN_ATTENTION): |
There was a problem hiding this comment.
Adding SDPBackend.MATH fallback and the dtype getattr changes in transformers.py look unrelated to LoRA co-training. Consider splitting into a separate PR or calling them out in the description.
There was a problem hiding this comment.
This is needed for tests
● ### What does this PR do?
Type of change: New feature
Adds LoRA co-training support for HF EAGLE speculative decoding. When
eagle_base_lora=True, HF PEFT LoRA adapters are injected into the basemodel and co-trained alongside the EAGLE draft module in a single online
training pass. A preservation loss (KL divergence between the original
frozen base model output and the LoRA-adapted output) is added to prevent
the base model from drifting during training. LoRA adapter weights are
exported separately alongside the EAGLE draft model artifacts.
Usage