Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a complete CodonFM feature set: codon tokenizer, datasets/collators, TransformerEngine-native model and training recipe (FSDP2), quantization/FP8 tooling and debug configs, checkpointing, perf logging, extensive tests and golden fixtures, CI sync rules, and dependency manifests. Changes
Sequence Diagram(s)sequenceDiagram
participant Trainer as Trainer (Hydra)
participant Tokenizer as CodonTokenizer
participant Dataloader as Dataloader (BSHD/THD)
participant Model as CodonFM (TE on CUDA)
participant Optimizer as Optimizer/Scheduler
participant Checkpoint as Checkpoint (FSDP2)
participant Perf as PerfLogger
participant Debug as nvdlfw_inspect
Trainer->>Tokenizer: load/instantiate tokenizer
Trainer->>Dataloader: create dataloaders (DistributedSampler)
Dataloader->>Tokenizer: tokenize & collate batch
Trainer->>Model: forward(batch) on CUDA/TE
Model->>Perf: emit loss/logits/metrics
Trainer->>Optimizer: backward + step
Trainer->>Checkpoint: conditional save/load
Trainer->>Perf: log_step(metrics)
Trainer->>Debug: init/collect quant-stats (optional)
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
| An available TCP port number. | ||
| """ | ||
| with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: | ||
| s.bind(("", 0)) |
Check warning
Code scanning / CodeQL
Binding a socket to all network interfaces Medium test
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 27 days ago
In general, to avoid binding a socket to all interfaces, replace uses of '' or "0.0.0.0" in bind((host, port)) with a specific interface address (often 127.0.0.1 for local‑only services), or create separate sockets for each intended interface. For an ephemeral test port, binding only to the loopback interface is sufficient and safer.
In this file, the best minimal change is to adjust the unused_tcp_port fixture so the socket is bound to the loopback address instead of all interfaces. Replace s.bind(("", 0)) with s.bind(("127.0.0.1", 0)). This keeps the behavior of selecting an available ephemeral port while ensuring the temporary binding is only on localhost. No new imports or helper methods are required, and no other code paths in this file need changes.
Specifically:
- In
bionemo-recipes/models/codonfm/tests/common/fixtures.py, within theunused_tcp_portfixture (around line 41), change the bind call as described. - Keep
SO_REUSEADDRand the rest of the function unchanged.
| @@ -38,7 +38,7 @@ | ||
| An available TCP port number. | ||
| """ | ||
| with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: | ||
| s.bind(("", 0)) | ||
| s.bind(("127.0.0.1", 0)) | ||
| s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) | ||
| return s.getsockname()[1] | ||
|
|
| def unused_tcp_port(): | ||
| """Find and return an unused TCP port for torchrun rendezvous.""" | ||
| with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: | ||
| s.bind(("", 0)) |
Check warning
Code scanning / CodeQL
Binding a socket to all network interfaces Medium test
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 28 days ago
In general, to fix this class of issue you should avoid binding sockets to all interfaces ('' or 0.0.0.0) unless absolutely necessary. Instead, bind to a specific interface (often 127.0.0.1 for local-only communication) or create multiple sockets if you genuinely need to listen on multiple interfaces.
For this specific fixture, the goal is to allocate an unused TCP port for use by torchrun or similar tooling, not to serve external clients. We can achieve the same behavior by binding to ("127.0.0.1", 0) instead of ("", 0). The OS will still choose an unused ephemeral port on the loopback interface, and returning that port number is sufficient for local rendezvous in tests. No additional imports are needed; only a single line change in bionemo-recipes/models/codonfm/tests/conftest.py at the s.bind call.
| @@ -39,7 +39,7 @@ | ||
| def unused_tcp_port(): | ||
| """Find and return an unused TCP port for torchrun rendezvous.""" | ||
| with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: | ||
| s.bind(("", 0)) | ||
| s.bind(("127.0.0.1", 0)) | ||
| s.listen(1) | ||
| port = s.getsockname()[1] | ||
| return port |
|
@coderabbitai review |
|
Tip For best results, initiate chat on the files or code changes.
Or feel free to ask any specific questions about the changes! |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 18
🧹 Nitpick comments (8)
bionemo-recipes/recipes/codonfm_native_te/checkpoint.py (1)
193-197: Potential race: non-main ranks proceed while main rank prunes.After
save_checkpoint_fsdp2, the main process logs and potentially runsprune_checkpoints(which deletes directories), while other ranks continue immediately. If subsequent operations on non-main ranks try to access the pruned checkpoint before the deletion completes, there could be issues.Consider adding a barrier after pruning if downstream code might read checkpoints:
if max_checkpoints is not None and dist_config.is_main_process(): prune_checkpoints(ckpt_path, max_checkpoints) torch.distributed.barrier(group=process_group) # Ensure pruning completes before continuingThis may not be an issue if checkpoints are only read at the start of training, but worth considering for robustness.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/checkpoint.py` around lines 193 - 197, The current flow after save_checkpoint_fsdp2 lets non-main ranks continue while the main rank may delete old checkpoints via prune_checkpoints(ckpt_path, max_checkpoints), risking a race if other ranks access those files; modify the shutdown so that after the conditional prune_checkpoints call (guarded by dist_config.is_main_process()) you add a distributed synchronization (e.g., torch.distributed.barrier with the appropriate process_group) so all ranks wait until pruning completes before proceeding; update calls around save_checkpoint_fsdp2/prune_checkpoints and ensure the barrier uses the same process group used for training to avoid deadlocks.bionemo-recipes/recipes/codonfm_native_te/requirements.txt (1)
1-12: Consider pinning dependency versions for reproducibility.All dependencies are unpinned, which can lead to reproducibility issues across environments and over time. Note that
bionemo-recipes/models/codonfm/requirements.txtpins minimum versions (e.g.,torch>=2.6.0). Consider aligning the approach:
- Pin minimum versions for critical dependencies (
torch,transformer_engine)- Move
pytestto a separate dev requirements file orpyproject.tomltest extras, as it's a test dependency rather than runtimeExample with version constraints
hydra-core nvdlfw_inspect pandas pyarrow -pytest pyyaml safetensors -torch +torch>=2.6.0 torchmetrics tqdm -transformer_engine[pytorch] +transformer_engine[pytorch]>=1.14.0 wandb🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/requirements.txt` around lines 1 - 12, Update the requirements handling to improve reproducibility: pin minimum (or exact) versions for critical runtime packages such as torch and transformer_engine (e.g., replace "torch" and "transformer_engine[pytorch]" with constrained versions consistent with bionemo-recipes/models/codonfm/requirements.txt), and move test-only dependencies like pytest out of this runtime requirements list into a dev/test requirements file or test extras in pyproject.toml; ensure pandas/pyarrow/safetensors/torchmetrics/tqdm/wandb also have at least minimum version constraints to avoid drift.bionemo-recipes/recipes/codonfm_native_te/distributed_config.py (1)
36-40: Be aware:setdefaultmodifies the environment.Using
os.environ.setdefault()has the side effect of setting the environment variable if it doesn't exist. This is intentional per the docstring ("we set them to default values"), but callers should be aware that instantiatingDistributedConfigcan modify the process environment.If read-only behavior is ever needed, consider using
os.environ.get()instead:rank: int = field(default_factory=lambda: int(os.environ.get("RANK", "0")))This is informational—the current behavior appears intentional for ensuring consistent env vars across the training process.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/distributed_config.py` around lines 36 - 40, The DistributedConfig dataclass currently uses os.environ.setdefault in the default_factory for fields rank, local_rank, world_size, _master_addr, and _master_port which mutates the process environment; if you want read-only behavior change each default_factory to use os.environ.get (e.g., replace setdefault with get) so instantiating DistributedConfig does not modify environment variables while preserving the same default values.bionemo-recipes/models/codonfm/tests/golden_values.json (1)
1-1: Consider formatting JSON for improved maintainability.This golden values file is a single line with ~87KB of data. While functionally correct, multi-line formatted JSON would make it easier to:
- Review changes in diffs
- Debug failing tests by inspecting specific values
- Maintain over time
This is a low-priority suggestion since the file is auto-generated by
generate_golden_values.py.The test data structure looks correct:
seed: 42for reproducibilityconfigmatches a small test model configurationlabelscorrectly uses-100for masked positions (PyTorch cross-entropy ignore index)attention_maskproperly indicates padding (0 at position 13 in sequence index 1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/golden_values.json` at line 1, Golden JSON file is a single long line; update the generator to emit pretty-printed JSON for easier diffs and review. Modify generate_golden_values.py (the code that writes the golden values) to call json.dump/json.dumps with indent=2 (or similar) and ensure separators/default serialization remain unchanged, then rewrite the output file (golden_values.json) using that formatted output so subsequent runs produce multi-line indented JSON.bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py (1)
22-22: Replace the private API usage with defensive error handling to maintain forward compatibility.The imports and usage of
_mesh_resources(lines 22 and 54) rely on a module-private API that is not exported intorch.distributed.device_mesh.__all__and may change across PyTorch versions without notice. Real-world integrations explicitly warn that these internals are subject to change.Wrap the
_mesh_resources.mesh_stack.clear()call with a try/except to handle potential API shifts:Suggested fix
# NOTE: _mesh_resources is a private API that may change in future PyTorch versions try: _mesh_resources.mesh_stack.clear() except AttributeError: pass # API changed, mesh cleanup may be incomplete🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py` at line 22, The test setup currently uses the private symbol _mesh_resources.mesh_stack.clear() which can break across PyTorch versions; modify the conftest cleanup to defensively handle missing or changed internals by wrapping the call to _mesh_resources.mesh_stack.clear() in a try/except (catch AttributeError) so tests won’t crash if mesh_stack or _mesh_resources shape changes, and leave a brief comment noting this is a private API; keep usage of init_device_mesh untouched but ensure any cleanup uses the guarded call to avoid relying on non-public APIs.bionemo-recipes/recipes/codonfm_native_te/scheduler.py (1)
19-43: Add type hints and complete docstring.The scheduler logic is correct. Consider adding type hints and completing the Google-style docstring with Args/Returns sections for better documentation.
📝 Suggested improvements
+from torch.optim import Optimizer from torch.optim.lr_scheduler import LambdaLR def get_linear_schedule_with_warmup( - optimizer, - num_warmup_steps=2_000, - num_training_steps=500_000, - last_epoch=-1, -): + optimizer: Optimizer, + num_warmup_steps: int = 2_000, + num_training_steps: int = 500_000, + last_epoch: int = -1, +) -> LambdaLR: """Linear warmup and decay scheduler for CodonFM pretraining. The learning rate is warmed up over the first warmup steps to a peak value, then linearly decayed to one tenth of its peak value over 90% of training duration. + + Args: + optimizer: The optimizer to schedule. + num_warmup_steps: Number of warmup steps. + num_training_steps: Total number of training steps. + last_epoch: The index of last epoch. + + Returns: + LambdaLR scheduler with the warmup and decay schedule. """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/scheduler.py` around lines 19 - 43, Add type hints and a complete Google-style docstring to get_linear_schedule_with_warmup: annotate parameters (optimizer: torch.optim.Optimizer, num_warmup_steps: int, num_training_steps: int, last_epoch: int) and the return type (torch.optim.lr_scheduler.LambdaLR), and expand the existing docstring to include Args (describe each param), Returns (describe the LambdaLR scheduler), and Raises (if any). Mention lr_lambda in the docstring as the internal schedule function and that the returned object is an instance of LambdaLR constructed with lr_lambda.bionemo-recipes/recipes/codonfm_native_te/fp8_debugging_stats.yaml (1)
7-23: Minor indentation inconsistency in YAML.The
LogFp8TensorStatssection uses 12-space indentation (lines 8-18) whileLogTensorStatsuses 10-space indentation (lines 20-23). While YAML will parse this correctly, consistent indentation improves readability.💅 Suggested fix for consistent indentation
LogTensorStats: - enabled: True - stats: [max, min, mean, std, l1_norm] - tensors: [dgrad, wgrad] - freq: 1 + enabled: True + stats: [max, min, mean, std, l1_norm] + tensors: [dgrad, wgrad] + freq: 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/fp8_debugging_stats.yaml` around lines 7 - 23, The YAML has inconsistent indentation between the LogFp8TensorStats block and the LogTensorStats block; update the indentation so both top-level keys and their child mappings align consistently (e.g., make LogTensorStats use the same indent width as LogFp8TensorStats), ensuring the keys "LogFp8TensorStats", "tensors_struct", "LogTensorStats", "stats", "tensors", and "freq" are vertically aligned for improved readability.bionemo-recipes/models/codonfm/tests/common/test_modeling_common.py (1)
1081-1097: Makecreate_inference_params()concrete, or move it to an autoregressive-only mixin.The docstring says only autoregressive subclasses must override this, but
@abstractmethodmeans Python will refuse to instantiate any subclass until it implements the method. That forces masked-LM suites to carry a dummy override even though every call site is already gated byis_autoregressive. (docs.python.org)♻️ Proposed fix
- `@abstractmethod` def create_inference_params(self, config, batch_size=1, max_seq_len=256, num_beams=1) -> Any: """Create inference params for KV-cache generation tests. @@ - pass + raise NotImplementedError( + "Autoregressive subclasses must implement create_inference_params()" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/common/test_modeling_common.py` around lines 1081 - 1097, The abstract create_inference_params method forces non-autoregressive subclasses to implement a no-op even though callers guard by is_autoregressive; make create_inference_params concrete by providing a default implementation that raises a clear runtime error (or returns None) when called, or move the method into an autoregressive-only mixin used by autoregressive tests; update references to create_inference_params and is_autoregressive accordingly so masked-LM test suites no longer need dummy overrides.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bionemo-recipes/models/codonfm/tests/common/test_modeling_common.py`:
- Around line 643-670: get_converted_te_model_checkpoint always rebuilds the
same converted checkpoint and causes repeated heavy HF load/convert/save cycles;
change it to cache the converted checkpoint once per test class (e.g., a class
attribute like _cached_converted_te_checkpoint or use functools.cached_property)
so subsequent calls simply return the cached Path if set and exists. Implement:
at start of get_converted_te_model_checkpoint (and similarly in
get_converted_te_model) check the cache attribute and return it if present,
otherwise run the existing conversion logic, set the class-level cache to
checkpoint_path before returning; ensure the cache is validated (exists on disk)
to recover if cleanup removed it.
- Around line 427-435: Add an explicit length check before the zip-based
hidden-state comparison: ensure te_outputs.hidden_states and
hf_outputs.hidden_states have the same length (e.g., compare
len(te_outputs.hidden_states) vs len(hf_outputs.hidden_states)) and raise/assert
with a clear message if they differ, then proceed to the existing loop that
iterates over them (the code using compare_hidden_states and the for i,
(te_hidden, hf_hidden) in enumerate(zip(...)) block). This ensures structural
mismatches are detected instead of silently ignored.
- Around line 300-307: The guard for tied-weight skipping assumes
model._tied_weights_keys is iterable and calls set(model._tied_weights_keys)
which can raise if it's None; change that to use a safe default like
set(getattr(model, "_tied_weights_keys", None) or []) when checking membership
in the if should_be_fp8 block so tied-weight handling mirrors the defensive
pattern used in test_modeling_codonfm_te.py; leave the rest (the
_do_not_quantize check using fnmatch and the final assertion asserting
module.weight is a QuantizedTensor) unchanged.
- Around line 787-798: cu_seq_lens_q[-1] is a 0-dim tensor and must be converted
to a Python int before calling torch.arange; change how num_real_tokens is
computed to use int(cu_seq_lens_q[-1]) and then simplify torch.arange(0,
num_real_tokens, device="cuda") to torch.arange(num_real_tokens, device="cuda")
when building real_idx and the other arange usage, so functions/variables like
num_real_tokens, cu_seq_lens_q, torch.arange, real_idx, input_data_thd and
input_data_thd_padded all operate with a Python integer end value; apply the
same change at both arange sites in the test.
In `@bionemo-recipes/models/codonfm/tests/conftest.py`:
- Around line 48-52: The fixture use_te_debug currently sets env vars with
monkeypatch.setenv and calls importlib.reload(transformer_engine.pytorch) but
never restores module state; convert it to a yield-style fixture that sets the
env vars (via monkeypatch.setenv as now), capture the original module object
(e.g., orig = sys.modules.get("transformer_engine.pytorch")) before calling
importlib.reload(transformer_engine.pytorch), yield to run the test, and after
yield restore the original module state by replacing
sys.modules["transformer_engine.pytorch"] with the saved orig (or reload it
again) so the importlib.reload side effect is undone while leaving monkeypatch
cleanup to handle the env vars.
In `@bionemo-recipes/models/codonfm/tests/generate_golden_values.py`:
- Around line 207-211: The code currently calls
native_model.load_state_dict(native_sd, strict=False) and only prints missing
keys, allowing unintended missing keys to silently pass; change this to
explicitly fail on any missing keys other than the expected rotary inv_freq key:
capture (missing, unexpected) from native_model.load_state_dict(native_sd,
strict=False), assert that unexpected is empty (as already done), then compute
unexpected_missing = set(missing) - {"rotary_emb.inv_freq"} (use the exact key
string shown in your codebase) and raise/assert if unexpected_missing is
non-empty so the test fails when any unexpected key is missing; keep the print
for the allowed missing key if you want but do not let other missing keys pass.
In `@bionemo-recipes/models/codonfm/tokenizer.py`:
- Around line 19-125: The CodonTokenizer class is missing the HuggingFace
save/load contract; implement save_pretrained(self, save_directory) and
`@classmethod` from_pretrained(cls, load_directory) on CodonTokenizer so the
tokenizer can be serialized and reloaded with model artifacts: persist/restore
seq_type, SPECIAL_TOKENS and the full encoder/decoder (or the codons list) as a
JSON (or vocab) file; on save_pretrained write a small metadata JSON containing
seq_type and SPECIAL_TOKENS and the encoder (or codons) and on from_pretrained
read that file to reconstruct self.seq_type, self.codons, self.encoder and
self.decoder exactly as done in __init__; ensure from_pretrained returns an
instance equivalent to calling CodonTokenizer(seq_type=...) so HF-style
round-tripping works.
In `@bionemo-recipes/recipes/codonfm_native_te/checkpoint.py`:
- Line 1: Update the SPDX copyright header year from 2026 to 2025 in the
top-of-file SPDX comment (the SPDX-FileCopyrightText line) so the file uses the
correct copyright year.
In `@bionemo-recipes/recipes/codonfm_native_te/dataset.py`:
- Around line 62-68: The __init__ currently does eager loading with
pd.read_parquet(path); change it to keep only the path/metadata and create a
lazy reader so rows are loaded on-demand (e.g., store self.path or a
pyarrow.dataset.Dataset created via pyarrow.dataset.dataset(path) or a
pyarrow.parquet.ParquetFile instance) and remove self.df assignment;
implement/adjust __len__ and __getitem__ (or an iterator) to read specific rows
or batches from the underlying pyarrow reader (using row_group access or
Scanner.to_batches with a small batch_size) so memory is not multiplied per
rank; apply the same change to the mirrored implementation in
bionemo-recipes/models/codonfm/dataset.py and ensure places that previously used
self.df now call the on-demand reader methods.
In `@bionemo-recipes/recipes/codonfm_native_te/distributed_config.py`:
- Line 1: Update the SPDX copyright header year from 2026 to 2025 in the file's
top comment (the SPDX-FileCopyrightText line) so it matches the repository
standard; locate the SPDX header string and replace "2026" with "2025".
In `@bionemo-recipes/recipes/codonfm_native_te/perf_logger.py`:
- Around line 105-115: The min_loss update is gated by the logging_frequency
check so it never runs on step 0 (and stays inf if total steps <=
logging_frequency); move the update self.min_loss = torch.minimum(self.min_loss,
outputs.loss) out of the if block so it runs unconditionally each step (i.e.,
place it before the if that checks step % self.logging_frequency) while leaving
the timing/logging code (elapsed_time, previous_step_time, step_time, token
counts) inside the existing conditional; ensure references to min_loss,
outputs.loss, logging_frequency and step remain unchanged.
In `@bionemo-recipes/recipes/codonfm_native_te/quantization.py`:
- Around line 190-243: In resolve_layer_precision(), reject invalid layer
selections up-front: treat an explicit empty list (fp8_layers == [] or
fp4_layers == []) as an error and raise a ValueError explaining that an explicit
empty selection is not allowed; also validate that every entry in fp8_layers and
fp4_layers is an integer within 1..num_layers and raise a ValueError listing any
out-of-range or non-integer entries. Do these checks before the logic that fills
defaults or computes overlaps (i.e., before the fp8_enabled/fp4_enabled
conditional blocks and before computing claimed_by_fp4/claimed_by_fp8) so bad
configs are rejected rather than silently ignored.
In `@bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py`:
- Line 1: Update the SPDX copyright header year from 2026 to 2025 in the file's
top comment (the line starting with "# SPDX-FileCopyrightText"), matching the
other file (distributed_config.py); replace "2026 NVIDIA CORPORATION &
AFFILIATES" with "2025 NVIDIA CORPORATION & AFFILIATES".
In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_model.py`:
- Around line 135-154: The test reads a stale loss and is flaky due to
randomness; in test_loss_decreases ensure you recompute the loss after the final
optimizer.step by doing one more forward pass (call model(...) again) to obtain
final_loss, and remove randomness by fixing the RNG (torch.manual_seed(...)) at
the start of the test and switching the model to evaluation mode (model.eval())
before that final forward; reference the test function test_loss_decreases, the
model variable (CodonFMForMaskedLM), and outputs.loss to locate the changes.
- Around line 83-154: The tests in TestCodonFMForMaskedLM assume CUDA and will
fail on CPU; add a skip guard to the whole class by decorating
TestCodonFMForMaskedLM with `@pytest.mark.skipif`(not torch.cuda.is_available(),
reason="CUDA required for these tests") (ensure pytest and torch are imported
where the class is defined) so the entire class is skipped when CUDA is not
available instead of trying to call .cuda() or allocate CUDA tensors.
In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_tokenizer.py`:
- Around line 96-132: Move the declaration of ptl_te_path outside/above the try
so it always exists for the finally cleanup (or guard the finally removal with a
check like "if 'ptl_te_path' in locals() and str(ptl_te_path) in sys.path");
also replace the hardcoded special-token cutoff "i >= 5" with a computed value
using the tokenizer's special-tokens length (e.g., use
len(CodonTokenizer.SPECIAL_TOKENS) or derive the cutoff from
PTLTokenizer/CodonTokenizer special token IDs) so the comparison uses a stable
symbol (CodonTokenizer.SPECIAL_TOKENS or PTLTokenizer special-tokens) instead of
the magic number.
In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py`:
- Around line 52-57: The distributed process group and logger initialization in
the DistributedConfig block (DistributedConfig(),
torch.distributed.init_process_group, torch.cuda.set_device) must be torn down
in a finally block: wrap the setup + training/IO code in try/finally and in
finally call torch.distributed.destroy_process_group() to ensure the group is
always cleaned up, and also call your logger teardown method (e.g.,
logger.finish() or logger.close() — use the actual method used by your logging
instance) so the logger is always finished even on exceptions; keep the
DistributedConfig and device setup at the top of the try so the finally can
unconditionally clean up.
- Around line 168-212: The training loop can hang if a rank receives zero
batches; modify the loop around the for batch in train_dataloader to detect and
fail fast: introduce a local counter (e.g., batches_in_epoch = 0) before the
for, increment it inside the loop, and after the for (before incrementing epoch
and calling sampler.set_epoch) check if batches_in_epoch == 0 and raise a
RuntimeError with a clear message referencing step, epoch and
args.num_train_steps; update references around step, train_dataloader,
sampler.set_epoch and the outer while to ensure the error breaks the run instead
of spinning forever.
---
Nitpick comments:
In `@bionemo-recipes/models/codonfm/tests/common/test_modeling_common.py`:
- Around line 1081-1097: The abstract create_inference_params method forces
non-autoregressive subclasses to implement a no-op even though callers guard by
is_autoregressive; make create_inference_params concrete by providing a default
implementation that raises a clear runtime error (or returns None) when called,
or move the method into an autoregressive-only mixin used by autoregressive
tests; update references to create_inference_params and is_autoregressive
accordingly so masked-LM test suites no longer need dummy overrides.
In `@bionemo-recipes/models/codonfm/tests/golden_values.json`:
- Line 1: Golden JSON file is a single long line; update the generator to emit
pretty-printed JSON for easier diffs and review. Modify
generate_golden_values.py (the code that writes the golden values) to call
json.dump/json.dumps with indent=2 (or similar) and ensure separators/default
serialization remain unchanged, then rewrite the output file
(golden_values.json) using that formatted output so subsequent runs produce
multi-line indented JSON.
In `@bionemo-recipes/recipes/codonfm_native_te/checkpoint.py`:
- Around line 193-197: The current flow after save_checkpoint_fsdp2 lets
non-main ranks continue while the main rank may delete old checkpoints via
prune_checkpoints(ckpt_path, max_checkpoints), risking a race if other ranks
access those files; modify the shutdown so that after the conditional
prune_checkpoints call (guarded by dist_config.is_main_process()) you add a
distributed synchronization (e.g., torch.distributed.barrier with the
appropriate process_group) so all ranks wait until pruning completes before
proceeding; update calls around save_checkpoint_fsdp2/prune_checkpoints and
ensure the barrier uses the same process group used for training to avoid
deadlocks.
In `@bionemo-recipes/recipes/codonfm_native_te/distributed_config.py`:
- Around line 36-40: The DistributedConfig dataclass currently uses
os.environ.setdefault in the default_factory for fields rank, local_rank,
world_size, _master_addr, and _master_port which mutates the process
environment; if you want read-only behavior change each default_factory to use
os.environ.get (e.g., replace setdefault with get) so instantiating
DistributedConfig does not modify environment variables while preserving the
same default values.
In `@bionemo-recipes/recipes/codonfm_native_te/fp8_debugging_stats.yaml`:
- Around line 7-23: The YAML has inconsistent indentation between the
LogFp8TensorStats block and the LogTensorStats block; update the indentation so
both top-level keys and their child mappings align consistently (e.g., make
LogTensorStats use the same indent width as LogFp8TensorStats), ensuring the
keys "LogFp8TensorStats", "tensors_struct", "LogTensorStats", "stats",
"tensors", and "freq" are vertically aligned for improved readability.
In `@bionemo-recipes/recipes/codonfm_native_te/requirements.txt`:
- Around line 1-12: Update the requirements handling to improve reproducibility:
pin minimum (or exact) versions for critical runtime packages such as torch and
transformer_engine (e.g., replace "torch" and "transformer_engine[pytorch]" with
constrained versions consistent with
bionemo-recipes/models/codonfm/requirements.txt), and move test-only
dependencies like pytest out of this runtime requirements list into a dev/test
requirements file or test extras in pyproject.toml; ensure
pandas/pyarrow/safetensors/torchmetrics/tqdm/wandb also have at least minimum
version constraints to avoid drift.
In `@bionemo-recipes/recipes/codonfm_native_te/scheduler.py`:
- Around line 19-43: Add type hints and a complete Google-style docstring to
get_linear_schedule_with_warmup: annotate parameters (optimizer:
torch.optim.Optimizer, num_warmup_steps: int, num_training_steps: int,
last_epoch: int) and the return type (torch.optim.lr_scheduler.LambdaLR), and
expand the existing docstring to include Args (describe each param), Returns
(describe the LambdaLR scheduler), and Raises (if any). Mention lr_lambda in the
docstring as the internal schedule function and that the returned object is an
instance of LambdaLR constructed with lr_lambda.
In `@bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py`:
- Line 22: The test setup currently uses the private symbol
_mesh_resources.mesh_stack.clear() which can break across PyTorch versions;
modify the conftest cleanup to defensively handle missing or changed internals
by wrapping the call to _mesh_resources.mesh_stack.clear() in a try/except
(catch AttributeError) so tests won’t crash if mesh_stack or _mesh_resources
shape changes, and leave a brief comment noting this is a private API; keep
usage of init_device_mesh untouched but ensure any cleanup uses the guarded call
to avoid relying on non-public APIs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 105ed43a-d143-4b91-acbc-4df8273fd6ad
⛔ Files ignored due to path filters (1)
bionemo-recipes/recipes/codonfm_native_te/train.parquetis excluded by!**/*.parquet
📒 Files selected for processing (35)
.gitignorebionemo-recipes/models/codonfm/dataset.pybionemo-recipes/models/codonfm/modeling_codonfm_te.pybionemo-recipes/models/codonfm/requirements.txtbionemo-recipes/models/codonfm/tests/common/README.mdbionemo-recipes/models/codonfm/tests/common/__init__.pybionemo-recipes/models/codonfm/tests/common/fixtures.pybionemo-recipes/models/codonfm/tests/common/test_modeling_common.pybionemo-recipes/models/codonfm/tests/conftest.pybionemo-recipes/models/codonfm/tests/generate_golden_values.pybionemo-recipes/models/codonfm/tests/golden_state_dict.safetensorsbionemo-recipes/models/codonfm/tests/golden_values.jsonbionemo-recipes/models/codonfm/tests/test_modeling_codonfm_te.pybionemo-recipes/models/codonfm/tokenizer.pybionemo-recipes/recipes/codonfm_native_te/checkpoint.pybionemo-recipes/recipes/codonfm_native_te/dataset.pybionemo-recipes/recipes/codonfm_native_te/distributed_config.pybionemo-recipes/recipes/codonfm_native_te/fp4_debugging_stats.yamlbionemo-recipes/recipes/codonfm_native_te/fp8_debugging_stats.yamlbionemo-recipes/recipes/codonfm_native_te/hydra_config/L0_sanity.yamlbionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.pybionemo-recipes/recipes/codonfm_native_te/perf_logger.pybionemo-recipes/recipes/codonfm_native_te/quantization.pybionemo-recipes/recipes/codonfm_native_te/requirements.txtbionemo-recipes/recipes/codonfm_native_te/scheduler.pybionemo-recipes/recipes/codonfm_native_te/tests/__init__.pybionemo-recipes/recipes/codonfm_native_te/tests/conftest.pybionemo-recipes/recipes/codonfm_native_te/tests/test_model.pybionemo-recipes/recipes/codonfm_native_te/tests/test_quantization.pybionemo-recipes/recipes/codonfm_native_te/tests/test_tokenizer.pybionemo-recipes/recipes/codonfm_native_te/tests/test_train.pybionemo-recipes/recipes/codonfm_native_te/tokenizer.pybionemo-recipes/recipes/codonfm_native_te/train_fsdp2.pyci/scripts/check_copied_files.py
|
@coderabbitai resolve |
|
@coderabbitai re-review |
✅ Actions performedComments resolved. Auto-approval is disabled; enable |
|
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 13
♻️ Duplicate comments (4)
bionemo-recipes/recipes/codonfm_native_te/tests/test_model.py (2)
83-96:⚠️ Potential issue | 🟡 MinorAdd CUDA skip guard to the test class.
Every test in
TestCodonFMForMaskedLMcalls.cuda()unconditionally. On a CPU-only runner, this will cause hard failures instead of clean skips.🛡️ Suggested fix
+@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA required") class TestCodonFMForMaskedLM: `@pytest.mark.parametrize`("preset_name", list(MODEL_PRESETS.keys())) def test_forward_all_presets(self, preset_name):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_model.py` around lines 83 - 96, The tests in TestCodonFMForMaskedLM (notably test_forward_all_presets) call .cuda() unconditionally which will fail on CPU-only runners; add a CUDA skip guard by marking the class or the test with pytest.mark.skipif(not torch.cuda.is_available(), reason="requires CUDA") or conditionally skip inside test using torch.cuda.is_available() so the test suite cleanly skips when CUDA is unavailable.
135-154:⚠️ Potential issue | 🟠 MajorFix the loss comparison to use a fresh forward pass.
Line 153 reads the stale loss from the final training iteration before
optimizer.step()was applied. Additionally, the test doesn't fix the RNG or disable dropout, making the assertion flaky.🔧 Suggested fix
def test_loss_decreases(self, small_config): + torch.manual_seed(0) + torch.cuda.manual_seed_all(0) + small_config.hidden_dropout_prob = 0.0 + small_config.attention_probs_dropout_prob = 0.0 model = CodonFMForMaskedLM(small_config).cuda() optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) batch_size, seq_len = 4, 32 input_ids = torch.randint(5, small_config.vocab_size, (batch_size, seq_len), device="cuda") attention_mask = torch.ones(batch_size, seq_len, dtype=torch.long, device="cuda") labels = input_ids.clone() initial_loss = None for _ in range(20): outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels) if initial_loss is None: initial_loss = outputs.loss.item() outputs.loss.backward() optimizer.step() optimizer.zero_grad() - final_loss = outputs.loss.item() + model.eval() + with torch.no_grad(): + final_loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).loss.item() assert final_loss < initial_loss, f"Loss did not decrease: {initial_loss} -> {final_loss}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_model.py` around lines 135 - 154, The test reads a stale loss from the last training iteration (outputs.loss) and is flaky due to nondeterminism/dropout; update test_loss_decreases to set torch.manual_seed(...) for reproducibility, run training loop on CodonFMForMaskedLM as-is, then switch the model to eval() and perform a fresh forward pass (e.g., with torch.no_grad()) using the same input_ids/attention_mask/labels to compute final_loss for comparison; ensure initial_loss is captured before any backward/step and final_loss is computed from the new forward pass (not from the cached outputs), and keep optimizer.step()/zero_grad() unchanged inside the loop.bionemo-recipes/models/codonfm/tests/common/fixtures.py (1)
40-43:⚠️ Potential issue | 🟡 MinorBind the probe socket to loopback only.
This helper only needs a local rendezvous port. Binding
("", 0)exposes it on all interfaces unnecessarily on shared CI hosts;("127.0.0.1", 0)is sufficient.Suggested fix
- s.bind(("", 0)) + s.bind(("127.0.0.1", 0))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/common/fixtures.py` around lines 40 - 43, Change the helper that binds a probe socket (the with socket.socket(...) block that returns s.getsockname()[1]) to bind to loopback only by replacing s.bind(("", 0)) with s.bind(("127.0.0.1", 0)) so the rendezvous port is only exposed locally instead of on all interfaces.bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py (1)
52-60:⚠️ Potential issue | 🟠 MajorStart the
trybefore initializing CUDA/distributed state.
init_process_group()andtorch.cuda.set_device()still execute before thefinally. If anything in that block throws after the process group is created, the job leaks a live NCCL group.Suggested structure
- dist_config = DistributedConfig() - logger.info("Initializing distributed training: %s", dist_config) - device = torch.device(f"cuda:{dist_config.local_rank}") - torch.distributed.init_process_group(backend="nccl", device_id=device) - torch.cuda.set_device(dist_config.local_rank) - - perf_logger = None - try: + perf_logger = None + try: + dist_config = DistributedConfig() + logger.info("Initializing distributed training: %s", dist_config) + device = torch.device(f"cuda:{dist_config.local_rank}") + torch.distributed.init_process_group(backend="nccl", device_id=device) + torch.cuda.set_device(dist_config.local_rank) ... finally: if perf_logger is not None: perf_logger.finish() - torch.distributed.destroy_process_group() + if torch.distributed.is_initialized(): + torch.distributed.destroy_process_group()Also applies to: 234-237
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py` around lines 52 - 60, The distributed/CUDA init must be wrapped in the try so the finally can clean up a partially-created NCCL group; move the try: to start before creating DistributedConfig and calling torch.distributed.init_process_group and torch.cuda.set_device (i.e., ensure DistributedConfig(), torch.distributed.init_process_group(backend="nccl", device_id=device), torch.cuda.set_device(dist_config.local_rank) are all inside the try) and keep the finally logic to destroy the process group; make the same change for the second initialization block that also creates DistributedConfig and calls init_process_group.
🧹 Nitpick comments (14)
bionemo-recipes/recipes/codonfm_native_te/tokenizer.py (1)
19-125: Avoid a second independentCodonTokenizerimplementation.This class encodes checkpoint-critical vocab and special-token ordering, but the model package already has a
CodonTokenizer. If the two drift even slightly, the recipe can silently train or load against the wrong token IDs. Please reuse the model tokenizer here, or add a direct equivalence test that locks the two implementations together.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/tokenizer.py` around lines 19 - 125, This file introduces a second, divergent CodonTokenizer implementation (class CodonTokenizer, methods tokenize/encode/decode, SPECIAL_TOKENS) which can drift from the model's canonical tokenizer; either remove this duplicate and import and use the model's tokenizer (e.g., replace the class with an import of the model package's CodonTokenizer and update call sites to use that single symbol), or if import is infeasible add a unit test that asserts exact equivalence between the local implementation and the model tokenizer (compare SPECIAL_TOKENS, vocab_size, encoder keys/values, decoder mappings and the results of tokenize/encode/decode on representative sequences) so the two remain locked together.bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py (1)
37-44: Make the NCCL/device-mesh setup opt-in.With
autouse=True, every test in this package initializes distributed CUDA state viainit_process_group()andcuda.set_device(), even tests that never use it. Tests liketest_tokenizer.py,test_model.py,test_quantization.py, andtest_train.pydon't reference the fixture, forcing them to depend on GPU and NCCL availability unnecessarily.Remove
autouse=Trueto make the fixture opt-in. Tests that need the mocked distributed functions (patched at lines 46-50) can explicitly request the fixture by addingdevice_meshas a parameter.♻️ Remove autouse to make fixture opt-in
-@pytest.fixture(scope="session", autouse=True) +@pytest.fixture(scope="session") def device_mesh():🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py` around lines 37 - 44, The device_mesh fixture currently runs for every test due to autouse=True which forces NCCL/CUDA initialization; remove autouse=True from the pytest.fixture decorator on the device_mesh function and keep the existing setup calls (DistributedConfig(), torch.distributed.init_process_group, torch.cuda.set_device, init_device_mesh) intact so it only runs when tests explicitly request the fixture by adding device_mesh as a parameter (update any tests that need the distributed environment to include device_mesh in their function signature).bionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.py (2)
317-318: THD format squeeze assumes batch dimension of 1.The squeeze operation
hidden_states.squeeze(0)only triggers whenhidden_states.size(0) == 1. This is correct for THD format but the assumption should be documented or asserted.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.py` around lines 317 - 318, The code assumes THD attention input has a singleton batch dim when doing hidden_states.squeeze(0) inside the block checking self.config.attn_input_format == "thd"; add an explicit assertion or documented guard before squeezing (e.g., assert hidden_states.size(0) == 1 or raise a clear error) and/or a short comment explaining the THD layout expectation so the squeeze on hidden_states in modeling_codonfm_te.py is safe and self-explanatory.
212-222: Inconsistent error handling for FP8 vs FP4 without layer_precision.When
layer_precisionis None with onlyfp8_recipe, the code auto-populates it with["fp8"] * num_layersand issues a warning. However, when onlyfp4_recipeis provided, it raises aRuntimeError. This asymmetry may be confusing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.py` around lines 212 - 222, The branch handling self.config.layer_precision is inconsistent: when fp8_recipe is provided it warns and auto-sets self.config.layer_precision = ["fp8"] * self.config.num_hidden_layers, but when only fp4_recipe is provided it raises a RuntimeError. Make the behavior consistent by changing the fp4_recipe branch to mirror the fp8 branch—emit a UserWarning and auto-populate self.config.layer_precision = ["fp4"] * self.config.num_hidden_layers (or alternatively make both branches raise); update the code paths around fp8_recipe, fp4_recipe, and self.config.layer_precision accordingly so the logic is symmetric.bionemo-recipes/recipes/codonfm_native_te/checkpoint.py (2)
188-191: Return value mismatch between sync and async save.
dcp_savereturnsNonewhiledcp_async_savereturns aFuture. Assigning the result unconditionally to_ckpt_futures["fsdp2"]means the sync path storesNone, which works but is semantically unclear.🔧 Proposed clarification
ckpt_save_func = dcp_async_save if async_save else dcp_save - _ckpt_futures["fsdp2"] = ckpt_save_func(state_dict, checkpoint_id=checkpoint_path, process_group=process_group) + result = ckpt_save_func(state_dict, checkpoint_id=checkpoint_path, process_group=process_group) + if async_save: + _ckpt_futures["fsdp2"] = result🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/checkpoint.py` around lines 188 - 191, The sync and async save functions return different types but the code unconditionally assigns their return to _ckpt_futures["fsdp2"]; update the logic around ckpt_save_func(state_dict, checkpoint_id=checkpoint_path, process_group=process_group) so that when using dcp_async_save you store the returned Future in _ckpt_futures["fsdp2"], and when using dcp_save you either set _ckpt_futures["fsdp2"] to None (or remove the key) to make the semantic difference explicit; locate the assignment around state_dict, ckpt_save_func, dcp_async_save, dcp_save and change it to a conditional that only assigns the Future for the async path.
42-42: Global mutable state for checkpoint futures could cause issues.The module-level
_ckpt_futuresdict works for single-checkpoint-type scenarios but may cause subtle bugs if extended to track multiple checkpoint types concurrently. Consider documenting this limitation or using a more structured approach.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/checkpoint.py` at line 42, The module-level mutable _ckpt_futures dict can cause cross-type collisions; replace it with a structured manager or namespaced mapping and update call sites to use it: create a CheckpointManager (or change _ckpt_futures to a dict-of-dicts keyed by checkpoint type) with methods like get_future(checkpoint_type, key) and set_future(checkpoint_type, key, future), then refactor any functions that reference _ckpt_futures to call these methods (or pass an explicit manager instance) so futures are isolated per checkpoint type; alternatively, if you keep the module-level variable, add clear documentation and a typed alias to signal the limitation.bionemo-recipes/recipes/codonfm_native_te/dataset.py (1)
131-138: MLM masking uses unseededrandom.random(), causing non-deterministic results across workers and epochs.The collator uses the global
randommodule without seeding, so different DataLoader workers and different epochs will produce different masks for the same samples. This may be intentional for training diversity but makes debugging and reproducibility harder.🔧 Proposed fix: Pass and use a generator
class CodonMLMCollator: def __init__( self, tokenizer: CodonTokenizer, max_seq_length: int = 512, mlm_probability: float = 0.15, + seed: int | None = None, ): self.tokenizer = tokenizer self.max_seq_length = max_seq_length self.mlm_probability = mlm_probability + self.rng = random.Random(seed) def __call__(self, batch: list[dict[str, str]]) -> dict[str, torch.Tensor]: # ... in the masking loop: - if random.random() < self.mlm_probability: + if self.rng.random() < self.mlm_probability:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/dataset.py` around lines 131 - 138, The MLM masking uses the global random module (random.random()/random.randint()) which causes non-determinism; modify the collator/dataset to accept and use a passed RNG (e.g., an instance named rng or self.rng) instead of the global random, and replace calls random.random() and random.randint(...) with rng.random() and rng.randint(5, self.tokenizer.vocab_size - 1); ensure the RNG is seeded per worker (using torch.utils.data.get_worker_info or by creating per-worker rng seeds) so that mlm_probability, ids assignment, tokenizer.mask_token_id and labels behavior are reproducible across workers and epochs.bionemo-recipes/recipes/codonfm_native_te/README.md (1)
49-66: Add language specifier to fenced code block.The directory structure code block is missing a language identifier. While
textorplaintextwould work, this improves markdown rendering consistency.♻️ Suggested fix
-``` +```text codonfm_native_te/ ├── modeling_codonfm_te.py — HF-compatible CodonFM model with TE layers🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/README.md` around lines 49 - 66, Edit the fenced code block in README.md that displays the codonfm_native_te/ directory tree (the block containing lines like "codonfm_native_te/" and "modeling_codonfm_te.py") and add a language specifier (e.g., change ``` to ```text or ```plaintext) so the block is rendered as plain text.bionemo-recipes/recipes/codonfm_native_te/scheduler.py (1)
19-24: Consider adding type hints for the function signature.The function parameters lack type annotations, which would improve readability and enable better static analysis.
♻️ Suggested improvement
+from torch.optim import Optimizer + def get_linear_schedule_with_warmup( - optimizer, + optimizer: Optimizer, num_warmup_steps: int = 2_000, num_training_steps: int = 500_000, last_epoch: int = -1, -): +) -> LambdaLR:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/scheduler.py` around lines 19 - 24, Add type hints to get_linear_schedule_with_warmup by annotating parameters and return type: change the signature to include optimizer: Optimizer, num_warmup_steps: int = 2_000, num_training_steps: int = 500_000, last_epoch: int = -1 and a return type of _LRScheduler (or LambdaLR) and import Optimizer and _LRScheduler (or LambdaLR) from torch.optim / torch.optim.lr_scheduler; update any internal references if needed to satisfy the new annotations for get_linear_schedule_with_warmup.bionemo-recipes/models/codonfm/modeling_codonfm_te.py (2)
206-216: Asymmetric handling of FP8 vs FP4 when no layer_precision is provided.When
layer_precision is None:
- FP8 recipe alone → auto-populates
layer_precisionwith["fp8"] * num_layers(Line 211)- FP4 recipe alone → raises
RuntimeError(Lines 213-216)This asymmetry could be confusing. Consider either making FP4 behave consistently with FP8 (auto-populate), or documenting why FP4 requires explicit layer specification.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/modeling_codonfm_te.py` around lines 206 - 216, The code currently auto-populates self.config.layer_precision when fp8_recipe is present but raises for fp4_recipe; change this to handle FP4 symmetrically by warning and setting self.config.layer_precision = ["fp4"] * self.config.num_hidden_layers when fp4_recipe is provided and fp8_recipe is None (use the same warnings.warn pattern as for fp8), or alternatively add a clear comment explaining why FP4 requires explicit layer_precision; update the conditional branch around fp8_recipe/fp4_recipe in the layer-precision initialization (referencing self.config.layer_precision, fp8_recipe, fp4_recipe, and self.config.num_hidden_layers) accordingly.
251-291: Complex nested context managers for layer-wise precision.The
get_autocast_contextmethod handles multiple scenarios (init vs forward, outer vs per-layer, fp8 vs fp4 vs bf16). While functional, the logic is intricate:
outer=Truewraps the entire encoder stack for FP8 onlyinit=Truewithuse_quantized_model_initappliesquantized_model_init- Per-layer forward uses
te.autocastwith the appropriate recipeThe multiple conditional branches could benefit from inline comments explaining the control flow for maintainability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/modeling_codonfm_te.py` around lines 251 - 291, Add concise inline comments inside get_autocast_context to document each control path: note the early return when self.config.layer_precision is None; explain the outer branch (outer True) only enables FP8 autocast and warns when self._fp8_recipe is missing before calling transformer_engine.pytorch.autocast; annotate the init branch (init True and self.config.use_quantized_model_init) to state it uses transformer_engine.pytorch.quantized_model_init for FP8/FP4 and raises if FP4 recipe (_fp4_recipe) is missing; and comment the per-layer forward branches that choose between FP8/FP4 autocast using recipe variables (_fp8_recipe/_fp4_recipe) or return a disabled autocast by default.bionemo-recipes/models/codonfm/tests/test_modeling_codonfm_te.py (1)
149-170: Same masking randomness issue as in generate_golden_values.py.
torch.manual_seed(42)is called inside the loop at Line 156 for each sample, causing identical masking probabilities across samples at the same positions. This matches the generator script behavior, so the tests should pass, but it's worth noting the limitation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/test_modeling_codonfm_te.py` around lines 149 - 170, The test currently calls torch.manual_seed(42) inside the per-sample loop, causing each sample to use the same RNG state and produce identical mask positions; move the torch.manual_seed(42) call out of the loop (place it once before iterating over encoded) so masking remains deterministic across test runs but varies between samples (reference variables: torch.manual_seed, encoded loop, ids/labels generation; related file generate_golden_values.py).bionemo-recipes/recipes/codonfm_native_te/quantization.py (1)
101-109: Temporary file is not cleaned up.The
tempfile.NamedTemporaryFileis created withdelete=False, but the file is never cleaned up. This could lead to temporary file accumulation over multiple runs.Consider returning a cleanup callback or documenting the caller's responsibility
The caller (
initialize_quant_stats_logging) passes this path todebug_api.initialize(), so the file needs to persist during the debug session. Document that cleanup is the caller's responsibility, or return a context manager/cleanup function.Alternatively, if the temp file should persist for the duration of the training run, this is acceptable but should be documented.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/quantization.py` around lines 101 - 109, The temporary file created with tempfile.NamedTemporaryFile(delete=False) (temp_file) in the quantization function is never removed; update the function (the one that returns temp_file.name and is used by initialize_quant_stats_logging -> debug_api.initialize) to either (A) return a paired cleanup callback or context-manager that deletes the file when the caller is done, or (B) document clearly that the caller is responsible for removing the returned path; implement option (A) by returning a small function like cleanup() that unlinks temp_file.name (or a contextmanager that yields the path and unlinks on __exit__), and keep logger lines as-is to preserve debug output.bionemo-recipes/models/codonfm/tests/generate_golden_values.py (1)
111-135: Masking randomness is not deterministic per-sample due to repeated seeding.
torch.manual_seed(SEED)is called inside the loop at Line 122 for each sample. This resets the RNG state before processing each sequence, meaning the masking pattern depends only on position, not on which sample is being processed. While this may be intentional for reproducibility, it creates identical masking probabilities for all samples at the same position indices.If truly independent-but-reproducible masking per sample is desired:
for enc in encoded: pad_len = max_len - len(enc) ids = enc + [tokenizer.pad_token_id] * pad_len mask = [1] * len(enc) + [0] * pad_len lbl = [-100] * max_len - torch.manual_seed(SEED) + torch.manual_seed(SEED + len(input_ids)) # per-sample seed for i in range(1, len(enc) - 1):However, if this behavior is intentional for exact reproducibility with the test file, no change is needed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/generate_golden_values.py` around lines 111 - 135, The prepare_test_inputs function currently calls torch.manual_seed(SEED) inside the per-sample loop which resets RNG each iteration and yields identical position-wise masking across samples; move the seeding so it produces the intended behavior: either seed once before the loop (keep torch.manual_seed(SEED) above the for enc in encoded: line) to get reproducible but varied masks across samples, or implement per-sample deterministic seeds (e.g., torch.manual_seed(SEED + i) or use a per-sample torch.Generator seeded from SEED) so each sample has independent yet reproducible masking; update the code around torch.manual_seed(SEED) and the masking loop in prepare_test_inputs accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bionemo-recipes/models/codonfm/dataset.py`:
- Around line 125-139: The MLM masking loop uses the global random module
(random.random(), random.randint()) which makes masks non-reproducible across
DataLoader workers; fix by switching to a per-instance RNG seeded from a
deterministic seed (and worker id) and use that RNG inside the loop. Concretely:
add a RNG attribute (e.g., self.rng = random.Random(base_seed) or create a
torch.Generator seeded from base_seed in the dataset/collator __init__), derive
a per-worker seed via torch.utils.data.get_worker_info() when the
collator/dataset is used in a worker, and replace calls to random.random() and
random.randint(...) in the MLM masking block (the code referencing
self.mlm_probability, tokenizer.mask_token_id, tokenizer.vocab_size, labels,
ids) with self.rng.random() and self.rng.randint(...) (or generator-based
equivalents) so masking is deterministic and reproducible across runs and
workers.
In `@bionemo-recipes/models/codonfm/modeling_codonfm_te.py`:
- Around line 432-470: In _magneto_init_weights, the name-based Q/K detection
("query" or "key" in name) misses fused TE QKV modules (e.g. self_attention.qkv)
so their weights get the scaled MAGNETO gain instead of 1.0; update the
detection used when computing is_qk in the loop over named_modules() (in the
_magneto_init_weights method) to also treat "qkv" module names as Q/K (for
example add a check for "qkv" in name) so that the fused QKV module and its
fused weight attributes are initialized with w_gain=1.0.
In `@bionemo-recipes/models/codonfm/requirements.txt`:
- Around line 1-3: The requirements file pins a non-existent minimum for
transformer_engine (transformer_engine>=1.14.0); change this to a valid released
minimum such as transformer_engine>=2.1.0 (or >=1.13.0 if targeting 1.x) in the
requirements.txt entry for transformer_engine, and confirm the other entries
torch>=2.6.0 and transformers>=4.48.0 are correct for the codebase (especially
if your code uses FP8/FP4/QuantizedTensor features prefer >=2.1.0).
In `@bionemo-recipes/models/codonfm/tests/common/__init__.py`:
- Around line 33-41: Update the docstring example to reference CodonFM classes
instead of ESM2: rename the example class ESM2ModelTester to CodonFMModelTester
and change the returned class from NVEsmForMaskedLM to CodonFMForMaskedLM in the
get_model_class implementation so the sample reflects CodonFM test utilities.
In `@bionemo-recipes/models/codonfm/tests/common/fixtures.py`:
- Around line 28-30: The fixture currently sets NVTE_DEBUG after importing
transformer_engine.pytorch so the setting is ineffective; change the fixture to
set os.environ['NVTE_DEBUG'] (use monkeypatch.setenv in tests) before importing
transformer_engine modules and then reload the affected module(s) with
importlib.reload (e.g., reload transformer_engine.pytorch or recipe_module) so
the new env var takes effect; update the copied source in
bionemo-recipes/models/esm2/tests/common/fixtures.py to match this pattern used
in conftest.py (set env var via monkeypatch, then import/reload
transformer_engine modules).
In `@bionemo-recipes/recipes/codonfm_native_te/checkpoint.py`:
- Around line 80-91: prune_checkpoints currently deletes older checkpoint dirs
and can race with other ranks that may still be writing; update the caller or
prune_checkpoints to synchronize before removing files: if a distributed backend
is initialized (e.g., torch.distributed.is_initialized()), call
torch.distributed.barrier() (or your project's sync_barrier wrapper) so all
ranks finish writing before running prune_checkpoints, then proceed to
list/sort/delete; alternatively, add a clear docstring to prune_checkpoints and
its call site noting it must only run after a global barrier. Ensure you
reference the prune_checkpoints function and the place where it is invoked so
reviewers can find and apply the barrier.
In `@bionemo-recipes/recipes/codonfm_native_te/distributed_config.py`:
- Around line 36-40: The field default factories in DistributedConfig (rank,
local_rank, world_size, _master_addr, _master_port) currently call
os.environ.setdefault(...) which mutates the process environment; change each
default_factory to use os.getenv(...) (e.g., os.getenv("RANK", "0")) so the
defaults are read without writing to os.environ, preserving process/test
isolation and avoiding leaking defaults into child processes—do this for the
lambda bodies that initialize rank, local_rank, world_size, _master_addr, and
_master_port.
In `@bionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yaml`:
- Around line 41-44: Replace the hard-coded scheduler horizon in
lr_scheduler_kwargs (num_training_steps: 500_000) so it derives from the
top-level training steps value (e.g., num_train_steps / args.num_train_steps)
instead of a fixed constant; update the defaults.yaml to reference the
global/train config variable for num_train_steps and ensure num_warmup_steps
remains relative to that horizon if needed so the scheduler matches the actual
training loop.
In `@bionemo-recipes/recipes/codonfm_native_te/requirements.txt`:
- Around line 1-12: requirements.txt is missing the transformers dependency
(imported in modeling_codonfm_te.py) and lacks version constraints; update
bionemo-recipes/recipes/codonfm_native_te/requirements.txt to include a
transformers entry and add >= style version bounds for key packages (e.g.,
transformers, torch, pandas, pyarrow, hydra-core, etc.) consistent with
models/codonfm/requirements.txt so installs are reproducible—ensure the package
name is "transformers" and use the same minimum versions used in models/codonfm
to keep compatibility.
In `@bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py`:
- Around line 46-49: The existing mocks don't intercept the directly imported
init_device_mesh used in train_fsdp2 (imported as from
torch.distributed.device_mesh import init_device_mesh), so add a patch for the
recipe module's namespace
(mock.patch("codonfm_native_te.train_fsdp2.init_device_mesh",
return_value=device_mesh)) to the with(...) context alongside the existing
patches; update the mocked targets in conftest.py to include this new patch so
calls inside train_fsdp2 use the mocked device_mesh.
In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_train.py`:
- Around line 32-110: The tests currently inherit a fixed MASTER_PORT from
DistributedConfig which can cause cross-test rendezvous collisions; add a
per-test fixture that uses pytest's unused_tcp_port (or similar) to pick an
unused port and sets MASTER_PORT (and optionally MASTER_ADDR) in the test
environment before calling main_fsdp2. Modify conftest.py to include a
function-scoped fixture (used by tests like test_sanity_convergence_fsdp2,
test_sanity_convergence_fsdp2_no_meta_device, test_sanity_convergence_fsdp2_thd,
test_sanity_convergence_fsdp2_fp8,
test_sanity_convergence_fsdp2_fp32_master_weights) that monkeypatches
os.environ["MASTER_PORT"]=str(unused_tcp_port) (and
os.environ["MASTER_ADDR"]="127.0.0.1") so each test gets a unique rendezvous
port instead of relying on the session-scoped init_process_group mock.
In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py`:
- Line 233: The function main() currently returns perf_logger.min_loss which is
a tensor; change the return to a Python float or None by converting the tensor
to a scalar (e.g. use perf_logger.min_loss.item() and wrap with float(...))
before returning so callers receive a CPU Python float consistent with the
declared return type float | None; update the return statement in main() that
references perf_logger.min_loss accordingly.
- Around line 82-92: The debug session started by debug_api.initialize() inside
initialize_quant_stats_logging() must be torn down even if an exception occurs
before PerfLogger is constructed; modify the code so debug_api.end_debug() is
called independently of PerfLogger.finish() — either move debug initialization
to occur after PerfLogger is created, or wrap the region between
initialize_quant_stats_logging(...) and perf_logger = PerfLogger(...) in a
try/finally (or register debug_api.end_debug with atexit) and call
debug_api.end_debug() in the finally/cleanup path; reference
initialize_quant_stats_logging, debug_api.initialize, debug_api.end_debug, and
PerfLogger.finish/PerfLogger to locate and implement the change.
---
Duplicate comments:
In `@bionemo-recipes/models/codonfm/tests/common/fixtures.py`:
- Around line 40-43: Change the helper that binds a probe socket (the with
socket.socket(...) block that returns s.getsockname()[1]) to bind to loopback
only by replacing s.bind(("", 0)) with s.bind(("127.0.0.1", 0)) so the
rendezvous port is only exposed locally instead of on all interfaces.
In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_model.py`:
- Around line 83-96: The tests in TestCodonFMForMaskedLM (notably
test_forward_all_presets) call .cuda() unconditionally which will fail on
CPU-only runners; add a CUDA skip guard by marking the class or the test with
pytest.mark.skipif(not torch.cuda.is_available(), reason="requires CUDA") or
conditionally skip inside test using torch.cuda.is_available() so the test suite
cleanly skips when CUDA is unavailable.
- Around line 135-154: The test reads a stale loss from the last training
iteration (outputs.loss) and is flaky due to nondeterminism/dropout; update
test_loss_decreases to set torch.manual_seed(...) for reproducibility, run
training loop on CodonFMForMaskedLM as-is, then switch the model to eval() and
perform a fresh forward pass (e.g., with torch.no_grad()) using the same
input_ids/attention_mask/labels to compute final_loss for comparison; ensure
initial_loss is captured before any backward/step and final_loss is computed
from the new forward pass (not from the cached outputs), and keep
optimizer.step()/zero_grad() unchanged inside the loop.
In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py`:
- Around line 52-60: The distributed/CUDA init must be wrapped in the try so the
finally can clean up a partially-created NCCL group; move the try: to start
before creating DistributedConfig and calling
torch.distributed.init_process_group and torch.cuda.set_device (i.e., ensure
DistributedConfig(), torch.distributed.init_process_group(backend="nccl",
device_id=device), torch.cuda.set_device(dist_config.local_rank) are all inside
the try) and keep the finally logic to destroy the process group; make the same
change for the second initialization block that also creates DistributedConfig
and calls init_process_group.
---
Nitpick comments:
In `@bionemo-recipes/models/codonfm/modeling_codonfm_te.py`:
- Around line 206-216: The code currently auto-populates
self.config.layer_precision when fp8_recipe is present but raises for
fp4_recipe; change this to handle FP4 symmetrically by warning and setting
self.config.layer_precision = ["fp4"] * self.config.num_hidden_layers when
fp4_recipe is provided and fp8_recipe is None (use the same warnings.warn
pattern as for fp8), or alternatively add a clear comment explaining why FP4
requires explicit layer_precision; update the conditional branch around
fp8_recipe/fp4_recipe in the layer-precision initialization (referencing
self.config.layer_precision, fp8_recipe, fp4_recipe, and
self.config.num_hidden_layers) accordingly.
- Around line 251-291: Add concise inline comments inside get_autocast_context
to document each control path: note the early return when
self.config.layer_precision is None; explain the outer branch (outer True) only
enables FP8 autocast and warns when self._fp8_recipe is missing before calling
transformer_engine.pytorch.autocast; annotate the init branch (init True and
self.config.use_quantized_model_init) to state it uses
transformer_engine.pytorch.quantized_model_init for FP8/FP4 and raises if FP4
recipe (_fp4_recipe) is missing; and comment the per-layer forward branches that
choose between FP8/FP4 autocast using recipe variables (_fp8_recipe/_fp4_recipe)
or return a disabled autocast by default.
In `@bionemo-recipes/models/codonfm/tests/generate_golden_values.py`:
- Around line 111-135: The prepare_test_inputs function currently calls
torch.manual_seed(SEED) inside the per-sample loop which resets RNG each
iteration and yields identical position-wise masking across samples; move the
seeding so it produces the intended behavior: either seed once before the loop
(keep torch.manual_seed(SEED) above the for enc in encoded: line) to get
reproducible but varied masks across samples, or implement per-sample
deterministic seeds (e.g., torch.manual_seed(SEED + i) or use a per-sample
torch.Generator seeded from SEED) so each sample has independent yet
reproducible masking; update the code around torch.manual_seed(SEED) and the
masking loop in prepare_test_inputs accordingly.
In `@bionemo-recipes/models/codonfm/tests/test_modeling_codonfm_te.py`:
- Around line 149-170: The test currently calls torch.manual_seed(42) inside the
per-sample loop, causing each sample to use the same RNG state and produce
identical mask positions; move the torch.manual_seed(42) call out of the loop
(place it once before iterating over encoded) so masking remains deterministic
across test runs but varies between samples (reference variables:
torch.manual_seed, encoded loop, ids/labels generation; related file
generate_golden_values.py).
In `@bionemo-recipes/recipes/codonfm_native_te/checkpoint.py`:
- Around line 188-191: The sync and async save functions return different types
but the code unconditionally assigns their return to _ckpt_futures["fsdp2"];
update the logic around ckpt_save_func(state_dict,
checkpoint_id=checkpoint_path, process_group=process_group) so that when using
dcp_async_save you store the returned Future in _ckpt_futures["fsdp2"], and when
using dcp_save you either set _ckpt_futures["fsdp2"] to None (or remove the key)
to make the semantic difference explicit; locate the assignment around
state_dict, ckpt_save_func, dcp_async_save, dcp_save and change it to a
conditional that only assigns the Future for the async path.
- Line 42: The module-level mutable _ckpt_futures dict can cause cross-type
collisions; replace it with a structured manager or namespaced mapping and
update call sites to use it: create a CheckpointManager (or change _ckpt_futures
to a dict-of-dicts keyed by checkpoint type) with methods like
get_future(checkpoint_type, key) and set_future(checkpoint_type, key, future),
then refactor any functions that reference _ckpt_futures to call these methods
(or pass an explicit manager instance) so futures are isolated per checkpoint
type; alternatively, if you keep the module-level variable, add clear
documentation and a typed alias to signal the limitation.
In `@bionemo-recipes/recipes/codonfm_native_te/dataset.py`:
- Around line 131-138: The MLM masking uses the global random module
(random.random()/random.randint()) which causes non-determinism; modify the
collator/dataset to accept and use a passed RNG (e.g., an instance named rng or
self.rng) instead of the global random, and replace calls random.random() and
random.randint(...) with rng.random() and rng.randint(5,
self.tokenizer.vocab_size - 1); ensure the RNG is seeded per worker (using
torch.utils.data.get_worker_info or by creating per-worker rng seeds) so that
mlm_probability, ids assignment, tokenizer.mask_token_id and labels behavior are
reproducible across workers and epochs.
In `@bionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.py`:
- Around line 317-318: The code assumes THD attention input has a singleton
batch dim when doing hidden_states.squeeze(0) inside the block checking
self.config.attn_input_format == "thd"; add an explicit assertion or documented
guard before squeezing (e.g., assert hidden_states.size(0) == 1 or raise a clear
error) and/or a short comment explaining the THD layout expectation so the
squeeze on hidden_states in modeling_codonfm_te.py is safe and self-explanatory.
- Around line 212-222: The branch handling self.config.layer_precision is
inconsistent: when fp8_recipe is provided it warns and auto-sets
self.config.layer_precision = ["fp8"] * self.config.num_hidden_layers, but when
only fp4_recipe is provided it raises a RuntimeError. Make the behavior
consistent by changing the fp4_recipe branch to mirror the fp8 branch—emit a
UserWarning and auto-populate self.config.layer_precision = ["fp4"] *
self.config.num_hidden_layers (or alternatively make both branches raise);
update the code paths around fp8_recipe, fp4_recipe, and
self.config.layer_precision accordingly so the logic is symmetric.
In `@bionemo-recipes/recipes/codonfm_native_te/quantization.py`:
- Around line 101-109: The temporary file created with
tempfile.NamedTemporaryFile(delete=False) (temp_file) in the quantization
function is never removed; update the function (the one that returns
temp_file.name and is used by initialize_quant_stats_logging ->
debug_api.initialize) to either (A) return a paired cleanup callback or
context-manager that deletes the file when the caller is done, or (B) document
clearly that the caller is responsible for removing the returned path; implement
option (A) by returning a small function like cleanup() that unlinks
temp_file.name (or a contextmanager that yields the path and unlinks on
__exit__), and keep logger lines as-is to preserve debug output.
In `@bionemo-recipes/recipes/codonfm_native_te/README.md`:
- Around line 49-66: Edit the fenced code block in README.md that displays the
codonfm_native_te/ directory tree (the block containing lines like
"codonfm_native_te/" and "modeling_codonfm_te.py") and add a language specifier
(e.g., change ``` to ```text or ```plaintext) so the block is rendered as plain
text.
In `@bionemo-recipes/recipes/codonfm_native_te/scheduler.py`:
- Around line 19-24: Add type hints to get_linear_schedule_with_warmup by
annotating parameters and return type: change the signature to include
optimizer: Optimizer, num_warmup_steps: int = 2_000, num_training_steps: int =
500_000, last_epoch: int = -1 and a return type of _LRScheduler (or LambdaLR)
and import Optimizer and _LRScheduler (or LambdaLR) from torch.optim /
torch.optim.lr_scheduler; update any internal references if needed to satisfy
the new annotations for get_linear_schedule_with_warmup.
In `@bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py`:
- Around line 37-44: The device_mesh fixture currently runs for every test due
to autouse=True which forces NCCL/CUDA initialization; remove autouse=True from
the pytest.fixture decorator on the device_mesh function and keep the existing
setup calls (DistributedConfig(), torch.distributed.init_process_group,
torch.cuda.set_device, init_device_mesh) intact so it only runs when tests
explicitly request the fixture by adding device_mesh as a parameter (update any
tests that need the distributed environment to include device_mesh in their
function signature).
In `@bionemo-recipes/recipes/codonfm_native_te/tokenizer.py`:
- Around line 19-125: This file introduces a second, divergent CodonTokenizer
implementation (class CodonTokenizer, methods tokenize/encode/decode,
SPECIAL_TOKENS) which can drift from the model's canonical tokenizer; either
remove this duplicate and import and use the model's tokenizer (e.g., replace
the class with an import of the model package's CodonTokenizer and update call
sites to use that single symbol), or if import is infeasible add a unit test
that asserts exact equivalence between the local implementation and the model
tokenizer (compare SPECIAL_TOKENS, vocab_size, encoder keys/values, decoder
mappings and the results of tokenize/encode/decode on representative sequences)
so the two remain locked together.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8b5818f7-f406-4483-a2d3-2205cf9c7c24
⛔ Files ignored due to path filters (1)
bionemo-recipes/recipes/codonfm_native_te/train.parquetis excluded by!**/*.parquet
📒 Files selected for processing (40)
.gitignorebionemo-recipes/models/codonfm/dataset.pybionemo-recipes/models/codonfm/modeling_codonfm_te.pybionemo-recipes/models/codonfm/requirements.txtbionemo-recipes/models/codonfm/tests/common/README.mdbionemo-recipes/models/codonfm/tests/common/__init__.pybionemo-recipes/models/codonfm/tests/common/fixtures.pybionemo-recipes/models/codonfm/tests/common/test_modeling_common.pybionemo-recipes/models/codonfm/tests/conftest.pybionemo-recipes/models/codonfm/tests/generate_golden_values.pybionemo-recipes/models/codonfm/tests/golden_state_dict.safetensorsbionemo-recipes/models/codonfm/tests/golden_values.jsonbionemo-recipes/models/codonfm/tests/test_modeling_codonfm_te.pybionemo-recipes/models/codonfm/tokenizer.pybionemo-recipes/models/esm2/tests/common/test_modeling_common.pybionemo-recipes/models/llama3/tests/common/test_modeling_common.pybionemo-recipes/models/mixtral/tests/common/test_modeling_common.pybionemo-recipes/models/qwen/tests/common/test_modeling_common.pybionemo-recipes/recipes/codonfm_native_te/README.mdbionemo-recipes/recipes/codonfm_native_te/checkpoint.pybionemo-recipes/recipes/codonfm_native_te/dataset.pybionemo-recipes/recipes/codonfm_native_te/distributed_config.pybionemo-recipes/recipes/codonfm_native_te/fp4_debugging_stats.yamlbionemo-recipes/recipes/codonfm_native_te/fp8_debugging_stats.yamlbionemo-recipes/recipes/codonfm_native_te/hydra_config/L0_sanity.yamlbionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.pybionemo-recipes/recipes/codonfm_native_te/perf_logger.pybionemo-recipes/recipes/codonfm_native_te/quantization.pybionemo-recipes/recipes/codonfm_native_te/requirements.txtbionemo-recipes/recipes/codonfm_native_te/scheduler.pybionemo-recipes/recipes/codonfm_native_te/tests/__init__.pybionemo-recipes/recipes/codonfm_native_te/tests/conftest.pybionemo-recipes/recipes/codonfm_native_te/tests/test_model.pybionemo-recipes/recipes/codonfm_native_te/tests/test_quantization.pybionemo-recipes/recipes/codonfm_native_te/tests/test_tokenizer.pybionemo-recipes/recipes/codonfm_native_te/tests/test_train.pybionemo-recipes/recipes/codonfm_native_te/tokenizer.pybionemo-recipes/recipes/codonfm_native_te/train_fsdp2.pyci/scripts/check_copied_files.py
|
@coderabbitai re-review |
|
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 8
♻️ Duplicate comments (8)
bionemo-recipes/recipes/codonfm_native_te/requirements.txt (1)
1-13:⚠️ Potential issue | 🟡 MinorKeep version bounds on the recipe stack.
The missing
transformersentry is fixed, but this manifest is still completely unbounded. For a training recipe that couplestorch,transformer_engine[pytorch], andtransformers, upstream releases can change numerics or break installs unexpectedly. Please align the minimum versions withbionemo-recipes/models/codonfm/requirements.txt(or pin this recipe separately) so the environment stays reproducible.Run this read-only check to compare bounded vs. unbounded entries:
#!/bin/bash set -euo pipefail python - <<'PY' from pathlib import Path ops = ("==", ">=", "<=", "~=", "!=", ">", "<") paths = [ Path("bionemo-recipes/recipes/codonfm_native_te/requirements.txt"), Path("bionemo-recipes/models/codonfm/requirements.txt"), ] for path in paths: print(f"\n== {path} ==") for i, raw in enumerate(path.read_text().splitlines(), 1): line = raw.strip() if not line or line.startswith("#"): continue bounded = any(op in line for op in ops) print(f"{i:>3}: {line:<35} {'bounded' if bounded else 'UNBOUNDED'}") PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/requirements.txt` around lines 1 - 13, The requirements.txt for the codonfm_native_te recipe contains unbounded package entries (e.g., hydra-core, torch, transformers, transformer_engine[pytorch]) which risks unreproducible breaks; update this recipe's requirements.txt to include the same minimum version bounds (or exact pins) used in the models/codonfm/requirements.txt so torch, transformer_engine[pytorch], transformers, and related packages match the model stack; specifically add version specifiers (>= or ==) for transformers, torch, transformer_engine[pytorch], and any other unbounded names listed (hydra-core, pandas, pyarrow, etc.) to mirror the model requirements and ensure consistent installs.bionemo-recipes/models/codonfm/requirements.txt (1)
1-3:⚠️ Potential issue | 🔴 CriticalFix invalid
transformer_engine>=1.14.0constraint.The specified version
1.14.0does not exist on PyPI (versions go from 1.13.0 directly to 2.1.0). Update to either>=1.13.0or>=2.1.0depending on the required features. Given the FP8/FP4 quantization features used in this codebase,>=2.1.0is likely appropriate.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/requirements.txt` around lines 1 - 3, The transformer_engine version constraint is invalid (1.14.0 doesn't exist); update the requirement for the package named transformer_engine in requirements.txt to a valid release that supports FP8/FP4 quantization (e.g., change the spec from transformer_engine>=1.14.0 to transformer_engine>=2.1.0) so the installer can resolve a real PyPI version while preserving needed features.bionemo-recipes/recipes/codonfm_native_te/distributed_config.py (1)
36-40:⚠️ Potential issue | 🟠 MajorReplace
setdefault()withgetenv()to prevent environment mutation.Using
os.environ.setdefault()in field default factories writes values to the global process environment, which can leak into child processes and break test isolation.🛠️ Suggested fix
- rank: int = field(default_factory=lambda: int(os.environ.setdefault("RANK", "0"))) - local_rank: int = field(default_factory=lambda: int(os.environ.setdefault("LOCAL_RANK", "0"))) - world_size: int = field(default_factory=lambda: int(os.environ.setdefault("WORLD_SIZE", "1"))) - _master_addr: str = field(default_factory=lambda: os.environ.setdefault("MASTER_ADDR", "localhost")) - _master_port: str = field(default_factory=lambda: os.environ.setdefault("MASTER_PORT", "12355")) + rank: int = field(default_factory=lambda: int(os.getenv("RANK", "0"))) + local_rank: int = field(default_factory=lambda: int(os.getenv("LOCAL_RANK", "0"))) + world_size: int = field(default_factory=lambda: int(os.getenv("WORLD_SIZE", "1"))) + _master_addr: str = field(default_factory=lambda: os.getenv("MASTER_ADDR", "localhost")) + _master_port: str = field(default_factory=lambda: os.getenv("MASTER_PORT", "12355"))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/distributed_config.py` around lines 36 - 40, The field default factories (rank, local_rank, world_size, _master_addr, _master_port) currently call os.environ.setdefault(...) which mutates the global environment; change each lambda to use os.getenv(...) (or os.environ.get(...)) instead and keep the same conversions (int(...) for rank/local_rank/world_size and str/identity for _master_addr/_master_port) so defaults are read without writing to os.environ. Ensure you update the five factories: rank, local_rank, world_size, _master_addr, and _master_port to use getenv/get instead of setdefault.bionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yaml (1)
41-44:⚠️ Potential issue | 🟠 MajorScheduler horizon not tied to training steps.
The
num_training_steps: 500_000is hardcoded whilenum_train_stepsat line 4 is a separate required field. This mismatch causes incorrect warmup/decay curves when only the top-level step count is overridden.🔧 Suggested fix using Hydra interpolation
lr_scheduler_kwargs: num_warmup_steps: 2_000 - num_training_steps: 500_000 + num_training_steps: ${num_train_steps}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yaml` around lines 41 - 44, The lr_scheduler_kwargs currently hardcodes num_training_steps to 500_000 while the top-level training step count is defined as num_train_steps; change num_training_steps inside lr_scheduler_kwargs to reference the top-level num_train_steps via Hydra interpolation (replace the literal 500_000 with an expression that points to num_train_steps) so the scheduler horizon follows overrides to num_train_steps; keep num_warmup_steps as-is and ensure the key names lr_scheduler_kwargs, num_warmup_steps, num_training_steps and num_train_steps are used exactly to locate and update the values.bionemo-recipes/recipes/codonfm_native_te/tests/test_model.py (1)
136-155:⚠️ Potential issue | 🟡 MinorStale loss value in assertion.
Line 154 reads
outputs.loss.item()from the last forward pass before the finaloptimizer.step(), sofinal_lossreflects the state before the last weight update. Additionally, without fixing the RNG seed or disabling dropout within this test, results may be flaky.Consider recomputing the loss after the training loop completes:
🧪 Suggested fix
+ torch.manual_seed(0) + torch.cuda.manual_seed_all(0) model = CodonFMForMaskedLM(small_config).cuda() + model.train() optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) @@ - final_loss = outputs.loss.item() + model.eval() + with torch.no_grad(): + final_loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).loss.item() assert final_loss < initial_loss, f"Loss did not decrease: {initial_loss} -> {final_loss}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_model.py` around lines 136 - 155, The test uses the last cached outputs (outputs.loss) from the final forward pass before the final optimizer.step, so final_loss is stale and may be flaky due to nondeterminism; fix test_loss_decreases by (1) after the training loop perform a fresh forward pass to recompute the loss (call model(input_ids=..., attention_mask=..., labels=...) again and use that loss for final_loss), and (2) make the final evaluation deterministic by setting a seed (e.g., torch.manual_seed(...)) and/or switching the model to eval mode (model.eval()) for that final forward pass; reference the test function test_loss_decreases and the model class CodonFMForMaskedLM and ensure you compare the recomputed final_loss against initial_loss.bionemo-recipes/recipes/codonfm_native_te/dataset.py (1)
85-103:⚠️ Potential issue | 🟠 MajorMasking RNG still duplicates across workers.
Each worker receives a collator copy with the same
random.Random(seed)state, andcreate_*_dataloader()never forwards the factoryseedinto either collator anyway. Withnum_workers > 1, masking patterns repeat across workers, and changing the dataloaderseeddoes not actually change masking. Seed the collator from the factory input and derive a per-worker stream viatorch.utils.data.get_worker_info(). Keep the mirror inbionemo-recipes/models/codonfm/dataset.pyaligned.Also applies to: 158-176, 263-267, 318-322
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/dataset.py` around lines 85 - 103, The collator's RNG is created with a fixed seed in the dataset __init__ (self.rng = random.Random(seed)) and not re-seeded per worker, causing duplicate masking across dataloader workers; modify the code to accept and store the factory seed (pass seed through the create_*_dataloader() call into the dataset/collator constructor) and in the collator/dataset masking code detect worker via torch.utils.data.get_worker_info() and derive a per-worker RNG (e.g., combine the stored seed with worker_info.id to create a new Random instance) instead of using the single shared self.rng; apply the same change to the mirrored implementation in bionemo-recipes/models/codonfm/dataset.py so both remain aligned.bionemo-recipes/recipes/codonfm_native_te/quantization.py (1)
191-223:⚠️ Potential issue | 🟠 MajorReject
[]for explicit layer selections.
fp8_layers=[]/fp4_layers=[]still bypass the validator and silently mean “enabled, but on zero layers”. That makes an invalid quantization config look successful while actually leaving those layers in BF16.Suggested guard
# Validate layer entries are integers within valid range. for name, layers in [("fp8_layers", fp8_layers), ("fp4_layers", fp4_layers)]: - if layers is not None and len(layers) > 0: + if layers == []: + raise ValueError(f"{name} cannot be an explicit empty list; use null to mean 'unspecified'.") + if layers is not None: invalid = [x for x in layers if not isinstance(x, int) or x < 1 or x > num_layers] if invalid: raise ValueError(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/quantization.py` around lines 191 - 223, The code currently allows fp8_layers=[] or fp4_layers=[] which silently means "enabled but zero layers"; update the validation for fp8_layers and fp4_layers so empty lists are rejected when provided: in the loop that inspects ("fp8_layers", fp8_layers) and ("fp4_layers", fp4_layers) treat layers is not None (not layers and not len>0) as invalid—i.e., if layers is not None then check both that len(layers) > 0 and that every entry is an int in 1..num_layers, and raise a ValueError (mentioning fp8_layers/fp4_layers and num_layers) if the list is empty or contains out-of-range/non-integer entries; keep the later logic that fills defaults when layers is None unchanged (so explicit empty lists are rejected but omitted lists still get auto-assigned).bionemo-recipes/recipes/codonfm_native_te/perf_logger.py (1)
107-145:⚠️ Potential issue | 🟠 MajorThe logging window is still one step late.
stepis zero-based here, sostep % self.logging_frequency == 0 and step > 0first fires afterlogging_frequency + 1optimizer steps. On a 100-step run with frequency 10, that emits only 9 updates and closes the progress bar at 90/100. Switch the gate to(step + 1) % self.logging_frequency == 0, and flush the trailing partial window infinish().Also applies to: 150-158
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/perf_logger.py` around lines 107 - 145, The logging gate is off-by-one: replace the condition "if step % self.logging_frequency == 0 and step > 0" in PerfLogger's per-step logging block (the code referencing self.logging_frequency, outputs, self.metrics, wandb.log, and self._progress_bar) with "(step + 1) % self.logging_frequency == 0" (and remove the step > 0 requirement), and make the identical change to the other logging gate in the file (the second block that updates metrics and progress bar). Additionally, implement flushing in finish(): when finishing training, if there are any accumulated metrics in self.metrics (a partial window), compute and convert them the same way you do in the per-step block, call wandb.log(..., step=last_step), update the progress bar by the remaining count, set_postfix with the latest loss, then reset metrics so the final partial window is emitted.
🧹 Nitpick comments (7)
bionemo-recipes/recipes/codonfm_native_te/scheduler.py (2)
19-29: Use a Google-style docstring on this helper.This new public function’s docstring is missing the standard
Args/Returnssections used elsewhere in the repo.Docstring shape
def get_linear_schedule_with_warmup( optimizer, num_warmup_steps=2_000, num_training_steps=500_000, last_epoch=-1, ): - """Linear warmup and decay scheduler for CodonFM pretraining. - - The learning rate is warmed up over the first warmup steps to a peak value, then linearly decayed to - one tenth of its peak value over 90% of training duration. - """ + """Linear warmup and decay scheduler for CodonFM pretraining. + + Args: + optimizer: Optimizer whose learning rate will be scheduled. + num_warmup_steps: Number of warmup steps. + num_training_steps: Total number of training steps. + last_epoch: Last completed epoch when resuming training. + + Returns: + A `LambdaLR` implementing warmup, decay, and the final 0.1 floor. + """As per coding guidelines, "Ensure all Python files follow Google-style docstrings (pydocstyle convention)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/scheduler.py` around lines 19 - 29, The function get_linear_schedule_with_warmup has a free-form docstring; replace it with a Google-style docstring that includes a short summary line, an Args section documenting optimizer (type), num_warmup_steps, num_training_steps, and last_epoch (with defaults and meanings), and a Returns section describing the returned scheduler object (type and behavior: linear warmup to peak then linear decay to one tenth over 90% of training). Also include any Raises if applicable (e.g., invalid step counts) to match the repo's pydocstyle conventions.
30-41: Handle warmup values above the decay cutoff explicitly.If
num_warmup_stepsfalls betweenint(num_training_steps * 0.9)andnum_training_steps, the linear-decay branch is never reached and the LR cliffs straight from warmup to0.1. That makes a bad Hydra value silently change the training regime instead of failing fast.Suggested guard
def get_linear_schedule_with_warmup( optimizer, num_warmup_steps=2_000, num_training_steps=500_000, last_epoch=-1, ): @@ + if num_training_steps <= 0: + raise ValueError("num_training_steps must be positive") + if num_warmup_steps < 0: + raise ValueError("num_warmup_steps must be non-negative") decay_steps = int(num_training_steps * 0.9) + if decay_steps < num_warmup_steps < num_training_steps: + raise ValueError( + "num_warmup_steps cannot exceed the decay cutoff unless warmup spans the full run" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/scheduler.py` around lines 30 - 41, The lr scheduling logic in lr_lambda allows num_warmup_steps to fall into (decay_steps, num_training_steps] which causes a sudden cliff from warmup to 0.1; add an explicit guard that validates num_warmup_steps <= decay_steps (where decay_steps = int(num_training_steps * 0.9)) and fails fast by raising a clear ValueError (mentioning num_warmup_steps, decay_steps and num_training_steps) before defining or returning lr_lambda; alternatively, if you prefer runtime checking, detect the case inside lr_lambda (using current_step and num_warmup_steps vs decay_steps) and raise an error instead of silently producing the cliff. Ensure references to lr_lambda, num_warmup_steps, decay_steps and num_training_steps are used in the error message to aid debugging.bionemo-recipes/recipes/codonfm_native_te/README.md (1)
49-66: Add language specifier to fenced code block.The repository structure code block should have a language specifier (e.g.,
textorplaintext) to satisfy markdown linting rules.📝 Suggested fix
-``` +```text codonfm_native_te/ ├── modeling_codonfm_te.py — HF-compatible CodonFM model with TE layers🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/README.md` around lines 49 - 66, The fenced repository-structure code block in README.md (the block starting with the line containing "codonfm_native_te/") lacks a language specifier; update the opening fence from ``` to ```text (or ```plaintext) so the block becomes a plaintext code block and satisfies markdown linting rules while preserving the same content and indentation.bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py (1)
54-57: Using private API_mesh_resources.mesh_stack.
_mesh_resources.mesh_stack.clear()accesses a private API that may change in future PyTorch versions. Consider adding a comment explaining why this is necessary, or wrapping it in a try/except to handle potential API changes.💡 Defensive approach
torch.distributed.destroy_process_group() - _mesh_resources.mesh_stack.clear() + # Clean up mesh resources (private API, may change in future PyTorch versions) + try: + _mesh_resources.mesh_stack.clear() + except AttributeError: + pass # API changed, cleanup not critical torch.cuda.empty_cache() torch.cuda.synchronize()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py` around lines 54 - 57, The test teardown calls private API _mesh_resources.mesh_stack.clear(); to make this robust, wrap that call in a try/except that falls back gracefully (e.g., log/debug a warning) if the attribute or method does not exist, and add a brief comment explaining why the private mesh stack must be cleared; keep the surrounding cleanup calls (torch.distributed.destroy_process_group(), torch.cuda.empty_cache(), torch.cuda.synchronize()) as-is so the overall cleanup order is preserved and only protect or document the _mesh_resources.mesh_stack.clear() usage.bionemo-recipes/models/codonfm/tests/test_modeling_codonfm_te.py (2)
287-288:strict=Falsemay silently ignore missing keys.Using
strict=Falseinload_state_dictcould mask issues where the golden state dict doesn't match the model architecture. Consider logging or asserting on the returned missing/unexpected keys for debugging purposes.💡 Example verification
- model.load_state_dict(golden_sd, strict=False) + missing, unexpected = model.load_state_dict(golden_sd, strict=False) + if missing or unexpected: + import warnings + warnings.warn(f"State dict mismatch - missing: {missing}, unexpected: {unexpected}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/test_modeling_codonfm_te.py` around lines 287 - 288, The call model.load_state_dict(golden_sd, strict=False) can silently ignore mismatches; change it to capture the returned result (e.g., result = model.load_state_dict(golden_sd, strict=False)) and then assert or log result.missing_keys and result.unexpected_keys (or raise if either is non-empty) before calling model.eval(), so any missing/unexpected parameters in golden_sd are surfaced during tests; reference model.load_state_dict, golden_sd, and model.eval when making this change.
156-160: RNG reset produces identical masking per sequence.
torch.manual_seed(42)is called inside the loop for each sequence, so every sequence gets the same masking pattern (same random positions). If you intend different masking per sequence, seed once before the loop. If identical masking is intentional for reproducibility, this is fine.Also applies to: 179-183
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/test_modeling_codonfm_te.py` around lines 156 - 160, The fixed code should stop reseeding the global RNG inside each sequence loop: move the torch.manual_seed(42) call out of the per-sequence loop (e.g., place it once before iterating sequences) so each sequence gets a different masking pattern, or remove/make the seed conditional if identical masking per sequence is desired; update the occurrences around the loop that set lbl, ids, and tokenizer.mask_token_id (and the similar block at lines 179-183) accordingly.bionemo-recipes/models/codonfm/tests/conftest.py (1)
38-45: Minor inconsistency with shared fixture.This
unused_tcp_portfixture differs slightly from the one intests/common/fixtures.py(Lines 33-42) by addings.listen(1), which is unnecessary for port discovery. Consider using the shared fixture from the plugin instead to avoid duplication.Note: The static analysis warning about binding to
''is expected behavior for ephemeral port allocation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/conftest.py` around lines 38 - 45, The fixture unused_tcp_port duplicates the shared fixture in tests/common/fixtures.py and adds an unnecessary s.listen(1); remove this duplication by importing and using the shared unused_tcp_port fixture from tests.common.fixtures (or delete this local definition), or if you must keep a local version, drop the s.listen(1) call and keep the socket bind-to-0 logic so it matches the shared implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bionemo-recipes/models/codonfm/dataset.py`:
- Around line 22-23: Replace the bare/top-level imports by using a local
relative tokenizer import and a structural type for the distributed config:
remove "from distributed_config import DistributedConfig" and "from tokenizer
import CodonTokenizer", import the tokenizer relatively (e.g., a local
CodonTokenizer from the package using a relative import), and define a small
Protocol (e.g., DistConfigProtocol with attributes/methods rank and world_size)
to use in type annotations and function/class signatures that previously
referenced DistributedConfig (replace references to DistributedConfig with
DistConfigProtocol and update call sites to rely only on rank/world_size).
Ensure all imports are relative to the package and update any code that
constructed or accessed DistributedConfig to accept the protocol-compatible
object instead.
In `@bionemo-recipes/models/codonfm/tokenizer.py`:
- Around line 114-130: The decode method currently assumes special token IDs are
0..N-1 by building special_ids = set(range(len(self.SPECIAL_TOKENS))) but
from_pretrained() never reloads SPECIAL_TOKENS or the exact special token ID
mapping saved by save_pretrained(), breaking round-trip decoding; fix by
rehydrating the full tokenizer state in from_pretrained() (restore
SPECIAL_TOKENS, special token -> id mapping, and any saved encoder/decoder
mappings) so decode() can compute special_ids from the actual saved special
token IDs (e.g., use the saved special_token_to_id map or stored
SPECIAL_TOKENS_IDS) instead of range(len(...)), update any uses in decode,
save_pretrained, and from_pretrained to read/write the same metadata keys to
ensure consistency between encoder/decoder and special token IDs.
In `@bionemo-recipes/recipes/codonfm_native_te/checkpoint.py`:
- Around line 182-191: The current async save may remain in-flight because we
only wait for the previous future before starting a new one; ensure the final
future is awaited before exit by checking _ckpt_futures["fsdp2"] after
scheduling (and in save_final_model_fsdp2) and calling .result() if it is not
None, and repeat the same pattern for the other async slots handled in the
200-218 region; update usages around dcp_async_save / dcp_save and
save_final_model_fsdp2 to join outstanding futures (e.g., call
_ckpt_futures["fsdp2"].result()) so the last async checkpoint completes before
process exit.
In `@bionemo-recipes/recipes/codonfm_native_te/fp8_debugging_stats.yaml`:
- Around line 10-18: Update the LogFp8TensorStats entries in
fp8_debugging_stats.yaml to use explicit recipe-prefixed stat names to match
fp4_debugging_stats.yaml and the fp8_recipe in defaults.yaml; replace the
unprefixed stats under the tensor blocks (activation, gradient, weight) with
fp8_delayed_scaling_underflows%, fp8_delayed_scaling_scale_inv_min,
fp8_delayed_scaling_scale_inv_max, and fp8_delayed_scaling_mse (referencing the
fp8_recipe: transformer_engine.common.recipe.DelayedScaling and the
LogFp8TensorStats hook) so all stats are consistently prefixed.
In `@bionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.py`:
- Around line 358-395: The LM head always uses GELU in forward but CodonFMConfig
supports config.hidden_act (e.g., "relu" or "silu"), causing encoder/head
mismatch; fix by storing the configured activation in __init__ (e.g.,
self.hidden_act = config.hidden_act or map it to a callable) and then use that
activation instead of torch.nn.functional.gelu in forward (while keeping the
transformer_engine.pytorch.autocast(enabled=False) context and existing
operations on self.dense and self.layer_norm_linear); ensure supported names
("gelu","relu","silu") map to the correct torch.nn.functional functions and
raise a clear error for unsupported values.
- Around line 322-325: The code builds RoPE using
self.rotary_embeddings(max_seq_len=self.config.max_position_embeddings) which
uses the config max instead of the current batch length; change it to compute
max_seq_len from the active input (e.g., derive from hidden_states.shape[1] or,
if using packed sequences, from cu_seq_lens_q[-1] like the PTL reference) and
pass that to self.rotary_embeddings so rotary positions match the current batch;
keep the subsequent .to(hidden_states.device, non_blocking=True) call.
In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_quantization.py`:
- Around line 160-168: The positive assertions in test_matches_correct_sublayers
are too loose because they use suffixes like "_something" and will still pass if
generate_layer_regex overmatches prefixes; update the test (function
test_matches_correct_sublayers) to assert against the exact sublayer names
produced by generate_layer_regex (e.g., use
"encoder.layers.1.self_attention.layernorm_qkv",
"encoder.layers.1.self_attention.proj", "encoder.layers.1.layernorm_mlp.fc1",
"encoder.layers.1.layernorm_mlp.fc2") or include a strict boundary (dot or
end-of-string) after the sublayer name so only the intended sublayers match, and
keep the negative case "encoder.layers.1.self_attention.some_other_thing"
unchanged.
In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py`:
- Around line 224-231: The final_model export is missing tokenizer metadata
required by CodonTokenizer.from_pretrained; update save_final_model_fsdp2 to
also write the tokenizer artifacts (tokenizer_config.json and any vocab/merges
or special files) into the final_model directory. Specifically, ensure
save_final_model_fsdp2 (and its caller in train_fsdp2) either accepts a
tokenizer argument or locates the original tokenizer source from the
model/config and then calls tokenizer.save_pretrained/save files or copies
tokenizer_config.json alongside the existing model.safetensors and config.json
so CodonTokenizer.from_pretrained can load the tokenizer from the self-contained
final_model directory.
---
Duplicate comments:
In `@bionemo-recipes/models/codonfm/requirements.txt`:
- Around line 1-3: The transformer_engine version constraint is invalid (1.14.0
doesn't exist); update the requirement for the package named transformer_engine
in requirements.txt to a valid release that supports FP8/FP4 quantization (e.g.,
change the spec from transformer_engine>=1.14.0 to transformer_engine>=2.1.0) so
the installer can resolve a real PyPI version while preserving needed features.
In `@bionemo-recipes/recipes/codonfm_native_te/dataset.py`:
- Around line 85-103: The collator's RNG is created with a fixed seed in the
dataset __init__ (self.rng = random.Random(seed)) and not re-seeded per worker,
causing duplicate masking across dataloader workers; modify the code to accept
and store the factory seed (pass seed through the create_*_dataloader() call
into the dataset/collator constructor) and in the collator/dataset masking code
detect worker via torch.utils.data.get_worker_info() and derive a per-worker RNG
(e.g., combine the stored seed with worker_info.id to create a new Random
instance) instead of using the single shared self.rng; apply the same change to
the mirrored implementation in bionemo-recipes/models/codonfm/dataset.py so both
remain aligned.
In `@bionemo-recipes/recipes/codonfm_native_te/distributed_config.py`:
- Around line 36-40: The field default factories (rank, local_rank, world_size,
_master_addr, _master_port) currently call os.environ.setdefault(...) which
mutates the global environment; change each lambda to use os.getenv(...) (or
os.environ.get(...)) instead and keep the same conversions (int(...) for
rank/local_rank/world_size and str/identity for _master_addr/_master_port) so
defaults are read without writing to os.environ. Ensure you update the five
factories: rank, local_rank, world_size, _master_addr, and _master_port to use
getenv/get instead of setdefault.
In `@bionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yaml`:
- Around line 41-44: The lr_scheduler_kwargs currently hardcodes
num_training_steps to 500_000 while the top-level training step count is defined
as num_train_steps; change num_training_steps inside lr_scheduler_kwargs to
reference the top-level num_train_steps via Hydra interpolation (replace the
literal 500_000 with an expression that points to num_train_steps) so the
scheduler horizon follows overrides to num_train_steps; keep num_warmup_steps
as-is and ensure the key names lr_scheduler_kwargs, num_warmup_steps,
num_training_steps and num_train_steps are used exactly to locate and update the
values.
In `@bionemo-recipes/recipes/codonfm_native_te/perf_logger.py`:
- Around line 107-145: The logging gate is off-by-one: replace the condition "if
step % self.logging_frequency == 0 and step > 0" in PerfLogger's per-step
logging block (the code referencing self.logging_frequency, outputs,
self.metrics, wandb.log, and self._progress_bar) with "(step + 1) %
self.logging_frequency == 0" (and remove the step > 0 requirement), and make the
identical change to the other logging gate in the file (the second block that
updates metrics and progress bar). Additionally, implement flushing in finish():
when finishing training, if there are any accumulated metrics in self.metrics (a
partial window), compute and convert them the same way you do in the per-step
block, call wandb.log(..., step=last_step), update the progress bar by the
remaining count, set_postfix with the latest loss, then reset metrics so the
final partial window is emitted.
In `@bionemo-recipes/recipes/codonfm_native_te/quantization.py`:
- Around line 191-223: The code currently allows fp8_layers=[] or fp4_layers=[]
which silently means "enabled but zero layers"; update the validation for
fp8_layers and fp4_layers so empty lists are rejected when provided: in the loop
that inspects ("fp8_layers", fp8_layers) and ("fp4_layers", fp4_layers) treat
layers is not None (not layers and not len>0) as invalid—i.e., if layers is not
None then check both that len(layers) > 0 and that every entry is an int in
1..num_layers, and raise a ValueError (mentioning fp8_layers/fp4_layers and
num_layers) if the list is empty or contains out-of-range/non-integer entries;
keep the later logic that fills defaults when layers is None unchanged (so
explicit empty lists are rejected but omitted lists still get auto-assigned).
In `@bionemo-recipes/recipes/codonfm_native_te/requirements.txt`:
- Around line 1-13: The requirements.txt for the codonfm_native_te recipe
contains unbounded package entries (e.g., hydra-core, torch, transformers,
transformer_engine[pytorch]) which risks unreproducible breaks; update this
recipe's requirements.txt to include the same minimum version bounds (or exact
pins) used in the models/codonfm/requirements.txt so torch,
transformer_engine[pytorch], transformers, and related packages match the model
stack; specifically add version specifiers (>= or ==) for transformers, torch,
transformer_engine[pytorch], and any other unbounded names listed (hydra-core,
pandas, pyarrow, etc.) to mirror the model requirements and ensure consistent
installs.
In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_model.py`:
- Around line 136-155: The test uses the last cached outputs (outputs.loss) from
the final forward pass before the final optimizer.step, so final_loss is stale
and may be flaky due to nondeterminism; fix test_loss_decreases by (1) after the
training loop perform a fresh forward pass to recompute the loss (call
model(input_ids=..., attention_mask=..., labels=...) again and use that loss for
final_loss), and (2) make the final evaluation deterministic by setting a seed
(e.g., torch.manual_seed(...)) and/or switching the model to eval mode
(model.eval()) for that final forward pass; reference the test function
test_loss_decreases and the model class CodonFMForMaskedLM and ensure you
compare the recomputed final_loss against initial_loss.
---
Nitpick comments:
In `@bionemo-recipes/models/codonfm/tests/conftest.py`:
- Around line 38-45: The fixture unused_tcp_port duplicates the shared fixture
in tests/common/fixtures.py and adds an unnecessary s.listen(1); remove this
duplication by importing and using the shared unused_tcp_port fixture from
tests.common.fixtures (or delete this local definition), or if you must keep a
local version, drop the s.listen(1) call and keep the socket bind-to-0 logic so
it matches the shared implementation.
In `@bionemo-recipes/models/codonfm/tests/test_modeling_codonfm_te.py`:
- Around line 287-288: The call model.load_state_dict(golden_sd, strict=False)
can silently ignore mismatches; change it to capture the returned result (e.g.,
result = model.load_state_dict(golden_sd, strict=False)) and then assert or log
result.missing_keys and result.unexpected_keys (or raise if either is non-empty)
before calling model.eval(), so any missing/unexpected parameters in golden_sd
are surfaced during tests; reference model.load_state_dict, golden_sd, and
model.eval when making this change.
- Around line 156-160: The fixed code should stop reseeding the global RNG
inside each sequence loop: move the torch.manual_seed(42) call out of the
per-sequence loop (e.g., place it once before iterating sequences) so each
sequence gets a different masking pattern, or remove/make the seed conditional
if identical masking per sequence is desired; update the occurrences around the
loop that set lbl, ids, and tokenizer.mask_token_id (and the similar block at
lines 179-183) accordingly.
In `@bionemo-recipes/recipes/codonfm_native_te/README.md`:
- Around line 49-66: The fenced repository-structure code block in README.md
(the block starting with the line containing "codonfm_native_te/") lacks a
language specifier; update the opening fence from ``` to ```text (or
```plaintext) so the block becomes a plaintext code block and satisfies markdown
linting rules while preserving the same content and indentation.
In `@bionemo-recipes/recipes/codonfm_native_te/scheduler.py`:
- Around line 19-29: The function get_linear_schedule_with_warmup has a
free-form docstring; replace it with a Google-style docstring that includes a
short summary line, an Args section documenting optimizer (type),
num_warmup_steps, num_training_steps, and last_epoch (with defaults and
meanings), and a Returns section describing the returned scheduler object (type
and behavior: linear warmup to peak then linear decay to one tenth over 90% of
training). Also include any Raises if applicable (e.g., invalid step counts) to
match the repo's pydocstyle conventions.
- Around line 30-41: The lr scheduling logic in lr_lambda allows
num_warmup_steps to fall into (decay_steps, num_training_steps] which causes a
sudden cliff from warmup to 0.1; add an explicit guard that validates
num_warmup_steps <= decay_steps (where decay_steps = int(num_training_steps *
0.9)) and fails fast by raising a clear ValueError (mentioning num_warmup_steps,
decay_steps and num_training_steps) before defining or returning lr_lambda;
alternatively, if you prefer runtime checking, detect the case inside lr_lambda
(using current_step and num_warmup_steps vs decay_steps) and raise an error
instead of silently producing the cliff. Ensure references to lr_lambda,
num_warmup_steps, decay_steps and num_training_steps are used in the error
message to aid debugging.
In `@bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py`:
- Around line 54-57: The test teardown calls private API
_mesh_resources.mesh_stack.clear(); to make this robust, wrap that call in a
try/except that falls back gracefully (e.g., log/debug a warning) if the
attribute or method does not exist, and add a brief comment explaining why the
private mesh stack must be cleared; keep the surrounding cleanup calls
(torch.distributed.destroy_process_group(), torch.cuda.empty_cache(),
torch.cuda.synchronize()) as-is so the overall cleanup order is preserved and
only protect or document the _mesh_resources.mesh_stack.clear() usage.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: de4fdcbb-4a47-4730-9c0c-fcfdc219106f
⛔ Files ignored due to path filters (1)
bionemo-recipes/recipes/codonfm_native_te/train.parquetis excluded by!**/*.parquet
📒 Files selected for processing (40)
.gitignorebionemo-recipes/models/codonfm/dataset.pybionemo-recipes/models/codonfm/modeling_codonfm_te.pybionemo-recipes/models/codonfm/requirements.txtbionemo-recipes/models/codonfm/tests/common/README.mdbionemo-recipes/models/codonfm/tests/common/__init__.pybionemo-recipes/models/codonfm/tests/common/fixtures.pybionemo-recipes/models/codonfm/tests/common/test_modeling_common.pybionemo-recipes/models/codonfm/tests/conftest.pybionemo-recipes/models/codonfm/tests/generate_golden_values.pybionemo-recipes/models/codonfm/tests/golden_state_dict.safetensorsbionemo-recipes/models/codonfm/tests/golden_values.jsonbionemo-recipes/models/codonfm/tests/test_modeling_codonfm_te.pybionemo-recipes/models/codonfm/tokenizer.pybionemo-recipes/models/esm2/tests/common/test_modeling_common.pybionemo-recipes/models/llama3/tests/common/test_modeling_common.pybionemo-recipes/models/mixtral/tests/common/test_modeling_common.pybionemo-recipes/models/qwen/tests/common/test_modeling_common.pybionemo-recipes/recipes/codonfm_native_te/README.mdbionemo-recipes/recipes/codonfm_native_te/checkpoint.pybionemo-recipes/recipes/codonfm_native_te/dataset.pybionemo-recipes/recipes/codonfm_native_te/distributed_config.pybionemo-recipes/recipes/codonfm_native_te/fp4_debugging_stats.yamlbionemo-recipes/recipes/codonfm_native_te/fp8_debugging_stats.yamlbionemo-recipes/recipes/codonfm_native_te/hydra_config/L0_sanity.yamlbionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.pybionemo-recipes/recipes/codonfm_native_te/perf_logger.pybionemo-recipes/recipes/codonfm_native_te/quantization.pybionemo-recipes/recipes/codonfm_native_te/requirements.txtbionemo-recipes/recipes/codonfm_native_te/scheduler.pybionemo-recipes/recipes/codonfm_native_te/tests/__init__.pybionemo-recipes/recipes/codonfm_native_te/tests/conftest.pybionemo-recipes/recipes/codonfm_native_te/tests/test_model.pybionemo-recipes/recipes/codonfm_native_te/tests/test_quantization.pybionemo-recipes/recipes/codonfm_native_te/tests/test_tokenizer.pybionemo-recipes/recipes/codonfm_native_te/tests/test_train.pybionemo-recipes/recipes/codonfm_native_te/tokenizer.pybionemo-recipes/recipes/codonfm_native_te/train_fsdp2.pyci/scripts/check_copied_files.py
|
@coderabbitai re-review |
|
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 12
♻️ Duplicate comments (10)
bionemo-recipes/recipes/codonfm_native_te/distributed_config.py (1)
36-40:⚠️ Potential issue | 🟠 MajorReplace
setdefault()withgetenv()to prevent environment mutation.Using
os.environ.setdefault()in field default factories writes to the global process environment on everyDistributedConfig()instantiation, which can leak defaults into child processes and break test isolation.🛠️ Proposed fix
- rank: int = field(default_factory=lambda: int(os.environ.setdefault("RANK", "0"))) - local_rank: int = field(default_factory=lambda: int(os.environ.setdefault("LOCAL_RANK", "0"))) - world_size: int = field(default_factory=lambda: int(os.environ.setdefault("WORLD_SIZE", "1"))) - _master_addr: str = field(default_factory=lambda: os.environ.setdefault("MASTER_ADDR", "localhost")) - _master_port: str = field(default_factory=lambda: os.environ.setdefault("MASTER_PORT", "12355")) + rank: int = field(default_factory=lambda: int(os.getenv("RANK", "0"))) + local_rank: int = field(default_factory=lambda: int(os.getenv("LOCAL_RANK", "0"))) + world_size: int = field(default_factory=lambda: int(os.getenv("WORLD_SIZE", "1"))) + _master_addr: str = field(default_factory=lambda: os.getenv("MASTER_ADDR", "localhost")) + _master_port: str = field(default_factory=lambda: os.getenv("MASTER_PORT", "12355")),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/distributed_config.py` around lines 36 - 40, The field default factories in DistributedConfig (rank, local_rank, world_size, _master_addr, _master_port) currently call os.environ.setdefault(...) which mutates the global environment; change each lambda to use os.getenv(...) (or os.environ.get(...)) instead so defaults are read without writing to os.environ, e.g. replace the default_factory lambdas for rank, local_rank, world_size, _master_addr and _master_port to call os.getenv/ .get with the same default strings.bionemo-recipes/models/codonfm/requirements.txt (1)
1-3:⚠️ Potential issue | 🔴 CriticalFix invalid
transformer_engine>=1.14.0constraint.Version 1.14.0 of
transformer_enginedoes not exist on PyPI. The version progression goes from 1.13.0 directly to 2.x. Given the FP8/FP4 quantization features used in this codebase, update to>=2.1.0.🛠️ Proposed fix
torch>=2.6.0 -transformer_engine>=1.14.0 +transformer_engine>=2.1.0 transformers>=4.48.0,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/requirements.txt` around lines 1 - 3, Update the invalid dependency constraint for transformer_engine in requirements.txt: replace the non-existent "transformer_engine>=1.14.0" with "transformer_engine>=2.1.0" so the project uses a PyPI-available release that includes the FP8/FP4 quantization features referenced by the code; ensure the line in requirements.txt containing transformer_engine is edited accordingly and keep the existing torch and transformers lines unchanged.bionemo-recipes/recipes/codonfm_native_te/checkpoint.py (1)
196-197:⚠️ Potential issue | 🟡 MinorRace condition risk when pruning checkpoints.
prune_checkpointsis called only on the main process, but other ranks may still be writing to checkpoint directories. The barrier at line 187 synchronizes before saving, but there's no barrier after saving completes. For async saves, this is especially risky since the save may not be complete when pruning runs.Consider adding a barrier after the save completes (for sync saves) or documenting that async_save and max_checkpoints should not be used together.
,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/checkpoint.py` around lines 196 - 197, Race condition: pruning (prune_checkpoints) runs only on the main rank (dist_config.is_main_process()) while other ranks may still be saving; add a synchronization point after all ranks finish saving or disallow async saves with max_checkpoints. Specifically, after the checkpoint save completes (the code path that calls save_checkpoint / async_save), insert a distributed barrier (use the existing dist_config barrier primitive) so all ranks have finished writing before the main rank calls prune_checkpoints(ckpt_path, max_checkpoints); alternatively, if async_save is used, validate and raise or log an error when max_checkpoints is set to prevent pruning during in-progress async saves and document this constraint in the function that handles checkpoint saving and pruning.bionemo-recipes/models/codonfm/tests/conftest.py (2)
39-45:⚠️ Potential issue | 🟡 MinorBind the probe socket to loopback only.
This helper only needs a local rendezvous port.
s.bind(("", 0))listens on every interface and is the reason CodeQL flagged this line; binding to127.0.0.1keeps the fixture local without changing its contract.🔒 Minimal change
- s.bind(("", 0)) + s.bind(("127.0.0.1", 0))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/conftest.py` around lines 39 - 45, The probe socket in unused_tcp_port currently binds to all interfaces (s.bind(("", 0))) which is flagged by CodeQL; change the bind call in the unused_tcp_port function to bind only to the loopback interface (e.g., s.bind(("127.0.0.1", 0))) so the fixture remains local while keeping the same behavior of returning an unused TCP port.
48-58:⚠️ Potential issue | 🟡 MinorThe
origrestore does not undo the reload.
importlib.reload()re-executes the existing module object in place. Hereorigandtransformer_engine.pytorchare the same object, so reassigningsys.modules["transformer_engine.pytorch"] = origafteryieldleaves the debug-loaded module in place. Restore the env vars first and reload again in teardown.♻️ Minimal cleanup
`@pytest.fixture`(autouse=True) -def use_te_debug(monkeypatch): - monkeypatch.setenv("NVTE_DEBUG", "1") - monkeypatch.setenv("NVTE_DEBUG_LEVEL", "2") - orig = sys.modules.get("transformer_engine.pytorch") - importlib.reload(transformer_engine.pytorch) - yield - if orig is not None: - sys.modules["transformer_engine.pytorch"] = orig - else: - importlib.reload(transformer_engine.pytorch) +def use_te_debug(): + prev_debug = os.environ.get("NVTE_DEBUG") + prev_level = os.environ.get("NVTE_DEBUG_LEVEL") + os.environ["NVTE_DEBUG"] = "1" + os.environ["NVTE_DEBUG_LEVEL"] = "2" + importlib.reload(transformer_engine.pytorch) + try: + yield + finally: + if prev_debug is None: + os.environ.pop("NVTE_DEBUG", None) + else: + os.environ["NVTE_DEBUG"] = prev_debug + if prev_level is None: + os.environ.pop("NVTE_DEBUG_LEVEL", None) + else: + os.environ["NVTE_DEBUG_LEVEL"] = prev_level + importlib.reload(transformer_engine.pytorch)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/conftest.py` around lines 48 - 58, The teardown must restore the original environment variables before reloading the transformer_engine.pytorch module because orig is a reference to the same module object and assigning it back doesn't undo the reload; change the fixture use_te_debug to capture the original NVTE_DEBUG and NVTE_DEBUG_LEVEL (or their absence) and the orig module reference, set the env vars and call importlib.reload(transformer_engine.pytorch) to enable debug, then in the teardown restore the env vars to their previous values (using monkeypatch.setenv or monkeypatch.delenv as appropriate) and finally call importlib.reload(transformer_engine.pytorch) (if orig was present) or delete sys.modules["transformer_engine.pytorch"] (if orig was None) instead of reassigning sys.modules to orig.bionemo-recipes/recipes/codonfm_native_te/requirements.txt (1)
1-13:⚠️ Potential issue | 🟠 MajorAdd version floors for the core ML stack.
transformersis listed now, but the recipe is still completely unbounded. This code relies on newer Torch/TransformerEngine/Transformers APIs, so a fresh install can resolve to combinations the recipe here does not support. Please mirror the minimum versions frombionemo-recipes/models/codonfm/requirements.txtfor at leasttorch,transformer_engine[pytorch],transformers,pandas, andpyarrow.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/requirements.txt` around lines 1 - 13, The requirements.txt has unpinned core ML packages; update it to include minimum version floors for torch, transformer_engine[pytorch], transformers, pandas, and pyarrow to match the floors used in bionemo-recipes/models/codonfm/requirements.txt so the recipe uses compatible APIs; locate the package entries (torch, transformer_engine[pytorch], transformers, pandas, pyarrow) in bionemo-recipes/recipes/codonfm_native_te/requirements.txt and replace them with the same ">=<version>" specifiers found in bionemo-recipes/models/codonfm/requirements.txt.bionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yaml (1)
42-44:⚠️ Potential issue | 🟠 MajorKeep the scheduler horizon tied to
num_train_steps.The training loop exits on
args.num_train_steps, so overriding only the top-level step count leaves the LR schedule on a 500k-step curve. Interpolatelr_scheduler_kwargs.num_training_stepsfrom the top-level value instead of hard-coding it.Suggested fix
lr_scheduler_kwargs: num_warmup_steps: 2_000 - num_training_steps: 500_000 + num_training_steps: ${num_train_steps}As per coding guidelines,
**/bionemo-recipes/recipes/**/*.yaml: Use Hydra configs for recipe configuration in bionemo-recipes/recipes/.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yaml` around lines 42 - 44, The LR scheduler horizon is hard-coded to 500_000; change lr_scheduler_kwargs.num_training_steps to interpolate from the top-level training steps so it tracks the real exit condition (e.g., replace the literal 500_000 with a Hydra interpolation referencing the top-level num_train_steps), keeping num_warmup_steps as-is; update the lr_scheduler_kwargs block (look for lr_scheduler_kwargs, num_warmup_steps, num_training_steps) to use the interpolated value instead of the hard-coded number.bionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.py (1)
322-325:⚠️ Potential issue | 🟠 MajorBuild RoPE from the active batch length, not the config max.
In BSHD this over-allocates every forward; in THD it becomes wrong once the packed token count exceeds
max_position_embeddings, because later packed tokens never get matching rotary positions. Derivemax_seq_lenfrom the current input and land the fix inbionemo-recipes/models/codonfm/modeling_codonfm_te.pybefore syncing this copy.Suggested fix
with torch.autocast(device_type="cuda", enabled=False): - te_rope_emb = self.rotary_embeddings(max_seq_len=self.config.max_position_embeddings) + if self.config.attn_input_format == "bshd": + max_seq_len = hidden_states.shape[1] + else: + max_seq_len = int(kwargs["cu_seq_lens_q"][-1]) + te_rope_emb = self.rotary_embeddings(max_seq_len=max_seq_len) te_rope_emb = te_rope_emb.to(hidden_states.device, non_blocking=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.py` around lines 322 - 325, The rotary positional embedding is being built using the config max_position_embeddings which over-allocates and breaks packed sequences; instead compute max_seq_len from the active input length (e.g., seq_len = hidden_states.size(1) or hidden_states.shape[1]) and pass that into self.rotary_embeddings (replace the existing self.rotary_embeddings(max_seq_len=self.config.max_position_embeddings) call), then keep the .to(hidden_states.device, non_blocking=True) call; update the torch.autocast block around that code in the method where hidden_states is available (the block containing self.rotary_embeddings and te_rope_emb) so RoPE is sized to the current batch.bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py (1)
224-231:⚠️ Potential issue | 🟠 MajorFinal model export is missing tokenizer metadata.
save_final_model_fsdp2()writesmodel.safetensorsandconfig.json, butCodonTokenizer.from_pretrained()requirestokenizer_config.json. The resultingfinal_modeldirectory won't be self-contained for inference.🐛 Suggested fix to include tokenizer
# Save final model if args.checkpoint.save_final_model and ckpt_path: + tokenizer = CodonTokenizer() save_final_model_fsdp2( model=model, config=config, save_directory=ckpt_path / "final_model", dist_config=dist_config, ) + if dist_config.is_main_process(): + tokenizer.save_pretrained(ckpt_path / "final_model")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py` around lines 224 - 231, The final model export written by save_final_model_fsdp2() only emits model.safetensors and config.json, but CodonTokenizer.from_pretrained() expects tokenizer files (e.g., tokenizer_config.json and vocab/merges or tokenizer.json); update save_final_model_fsdp2 to also load the tokenizer used during training and write its artifacts into the save_directory (use the tokenizer.save_pretrained or explicitly write tokenizer_config.json and any vocab files) so the final_model directory is self-contained and can be loaded by CodonTokenizer.from_pretrained; reference save_final_model_fsdp2, CodonTokenizer.from_pretrained, and the save_directory/ckpt_path "final_model" target when implementing this change.bionemo-recipes/recipes/codonfm_native_te/dataset.py (1)
66-79:⚠️ Potential issue | 🟠 Major
pq.read_table()is still eager here.Switching away from pandas removes one copy, but
pq.read_table(path, columns=["sequence"], memory_map=True)still materializes the wholesequencecolumn in__init__. Every rank will preload the full dataset before training starts, so the original host-memory scaling problem is still present. A lazyParquetFile/ dataset scanner is needed to actually fix that.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/dataset.py` around lines 66 - 79, The constructor currently eagerly materializes the entire column via pq.read_table; change __init__ to open a pyarrow.parquet.ParquetFile (store as self._pqfile) and precompute metadata like total rows (self._num_rows) and row-group row counts so you can map an index to a row-group and local offset; update __len__ to return self._num_rows; change __getitem__ to lazily read only the needed row (use self._pqfile.read_row_group(row_group, columns=["sequence"]) or read with slice/batch for that row-group and extract the single sequence) instead of using self._table/_sequences, so no full-column materialization occurs and memory stays constant per process.
🧹 Nitpick comments (13)
bionemo-recipes/recipes/codonfm_native_te/checkpoint.py (1)
189-191: PotentialNoneassignment whenasync_save=False.
dcp_save(synchronous) returnsNone, whiledcp_async_savereturns a future. Assigning the result unconditionally to_ckpt_futures["fsdp2"]means subsequent code checking for a pending future will findNone(which is fine), but the pattern is confusing and could lead to bugs if the check logic changes.♻️ Clearer approach
- ckpt_save_func = dcp_async_save if async_save else dcp_save - _ckpt_futures["fsdp2"] = ckpt_save_func(state_dict, checkpoint_id=checkpoint_path, process_group=process_group) + if async_save: + _ckpt_futures["fsdp2"] = dcp_async_save(state_dict, checkpoint_id=checkpoint_path, process_group=process_group) + else: + dcp_save(state_dict, checkpoint_id=checkpoint_path, process_group=process_group) + _ckpt_futures["fsdp2"] = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/checkpoint.py` around lines 189 - 191, The current assignment stores the return of dcp_save or dcp_async_save into _ckpt_futures["fsdp2"], but dcp_save is synchronous and returns None which makes the stored value ambiguous; change the logic so that when async_save is True you assign the future returned by dcp_async_save to _ckpt_futures["fsdp2"], and when async_save is False you either remove any existing _ckpt_futures["fsdp2"] key or explicitly set it to a sentinel (e.g., False or not present) to avoid storing None; update the block that calls dcp_save/dcp_async_save (refer to the variables async_save, dcp_save, dcp_async_save, _ckpt_futures["fsdp2"], and checkpoint_path) to use this explicit branch so later checks for a pending future are unambiguous.bionemo-recipes/recipes/codonfm_native_te/tokenizer.py (1)
86-91: Minor: Remove redundant length check intokenize().The
if len(codon) == 3check is redundant. Given the looprange(0, len(sequence) - 2, 3), the slicesequence[i:i+3]will always produce exactly 3 characters for all valid loop iterations.♻️ Proposed simplification
def tokenize(self, sequence: str) -> list[str]: sequence = sequence.upper() - tokens = [] - for i in range(0, len(sequence) - 2, 3): - codon = sequence[i : i + 3] - if len(codon) == 3: - tokens.append(codon) - return tokens + return [sequence[i : i + 3] for i in range(0, len(sequence) - 2, 3)]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/tokenizer.py` around lines 86 - 91, The redundant length check in tokenize() should be removed: inside the loop that iterates with range(0, len(sequence) - 2, 3) simply append the sliced codon to tokens without the if len(codon) == 3 guard; update the body of tokenize() to directly do tokens.append(codon) for each codon (references: function tokenize, local variables tokens and codon, and the for loop using range(0, len(sequence) - 2, 3)).bionemo-recipes/recipes/codonfm_native_te/scheduler.py (1)
19-24: Add type hint foroptimizerparameter.The
optimizerparameter is missing a type annotation, which would improve IDE support and type checking consistency.♻️ Proposed fix
+from torch.optim import Optimizer + def get_linear_schedule_with_warmup( - optimizer, + optimizer: Optimizer, num_warmup_steps=2_000, num_training_steps=500_000, last_epoch=-1, -): +) -> LambdaLR:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/scheduler.py` around lines 19 - 24, The get_linear_schedule_with_warmup function is missing a type annotation for the optimizer parameter; update its signature to annotate optimizer as torch.optim.Optimizer (or Optimizer if you add "from torch.optim import Optimizer" at the top) so IDEs and type checkers recognize it, e.g. def get_linear_schedule_with_warmup(optimizer: torch.optim.Optimizer, ...), and add the corresponding import if not already present.bionemo-recipes/recipes/codonfm_native_te/perf_logger.py (1)
36-37: Hardcoded PAD_TOKEN_ID may diverge from tokenizer.The constant
PAD_TOKEN_ID = 3is hardcoded here but should matchCodonTokenizer.pad_token_id. If the tokenizer's special token ordering ever changes, this would silently break.♻️ Import from tokenizer or document the coupling
+# PAD_TOKEN_ID must match CodonTokenizer().pad_token_id PAD_TOKEN_ID = 3Or better, pass the pad token ID as a parameter to
PerfLogger.__init__.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/perf_logger.py` around lines 36 - 37, PAD_TOKEN_ID is hardcoded (PAD_TOKEN_ID = 3) and can drift from the tokenizer; update PerfLogger to accept the pad token id from the tokenizer instead of a hardcoded constant: remove or stop using PAD_TOKEN_ID, add a pad_token_id parameter to PerfLogger.__init__ (or read CodonTokenizer.pad_token_id where PerfLogger is constructed) and propagate that value into any methods that reference the pad id so the logger always uses CodonTokenizer.pad_token_id rather than a literal 3.bionemo-recipes/models/codonfm/tokenizer.py (2)
147-164:from_pretraineddoesn't restorecodonslist.The
from_pretrainedmethod restoresencoder/decoderbut leavesself.codonswith the default generated codons. While this doesn't break functionality (since encoding/decoding useencoder/decoder), it could cause subtle bugs if code later iterates overself.codons.♻️ Restore codons from encoder
`@classmethod` def from_pretrained(cls, load_directory: str) -> "CodonTokenizer": with open(os.path.join(load_directory, TOKENIZER_CONFIG_FILE)) as f: config = json.load(f) tokenizer = cls(seq_type=config["seq_type"]) # Restore the exact encoder/decoder from the saved config to guarantee round-trip fidelity. tokenizer.encoder = {tok: int(i) for tok, i in config["encoder"].items()} tokenizer.decoder = {i: tok for tok, i in tokenizer.encoder.items()} + # Restore codons list from encoder (tokens not in special tokens) + special_set = set(tokenizer.SPECIAL_TOKENS) + tokenizer.codons = [ + tok for tok, _ in sorted(tokenizer.encoder.items(), key=lambda x: x[1]) + if tok not in special_set + ] return tokenizer🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tokenizer.py` around lines 147 - 164, The from_pretrained method restores encoder/decoder but leaves self.codons as the default; update from_pretrained (in class CodonTokenizer) to reconstruct tokenizer.codons from the saved encoder mapping (config["encoder"]) by ordering tokens by their integer id (i.e., create a list where index == token id), then assign that list to tokenizer.codons before returning; keep the existing restoration of tokenizer.encoder and tokenizer.decoder.
114-130: Decode uses class constant instead of instance state for special token IDs.The
decodemethod computesspecial_idsfromself.SPECIAL_TOKENS(class constant) but lookups useself.encoder(instance state). Iffrom_pretrainedever loads a tokenizer with different special tokens, this could cause inconsistencies. However, sincefrom_pretrainedcurrently only restores the encoder without modifyingSPECIAL_TOKENS, this works correctly.♻️ Consider storing special_tokens as instance attribute for future-proofing
class CodonTokenizer: SPECIAL_TOKENS: list[str] = ["<CLS>", "<SEP>", "<UNK>", "<PAD>", "<MASK>"] def __init__(self, seq_type="dna"): ... + self.special_tokens = list(self.SPECIAL_TOKENS) all_tokens = self.SPECIAL_TOKENS + self.codons ... def decode(self, ids: list[int], skip_special_tokens: bool = True) -> str: - special_ids = {self.encoder[tok] for tok in self.SPECIAL_TOKENS if tok in self.encoder} + special_ids = {self.encoder[tok] for tok in self.special_tokens if tok in self.encoder}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tokenizer.py` around lines 114 - 130, The decode method computes special_ids from the class constant SPECIAL_TOKENS which can diverge from instance state; change decode to use an instance attribute (e.g., self.special_tokens) instead of self.SPECIAL_TOKENS, compute special_ids from self.special_tokens and self.encoder, and ensure the tokenizer loader (from_pretrained) restores/sets self.special_tokens when deserializing so special token lookups remain consistent between encoder/decoder and special token configuration.bionemo-recipes/models/codonfm/dataset.py (1)
263-267: Collator seed is not passed from dataloader factory.The
seedparameter is passed toSyntheticCodonDatasetandDistributedSamplerbut not to the collators. This means all collators use the default seed42, which could cause identical masking patterns across different dataloader instances.♻️ Pass seed to collators
collator = CodonMLMCollator( tokenizer=tokenizer, max_seq_length=max_seq_length, mlm_probability=mlm_probability, + seed=seed, )Also applies to: 318-322
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/dataset.py` around lines 263 - 267, The collator is being instantiated without the dataset/dataloader seed, causing all collators to use the default seed; update the CodonMLMCollator instantiations to accept and use the same seed value passed to SyntheticCodonDataset/DistributedSampler by adding the seed argument when constructing collator (e.g., change CodonMLMCollator(tokenizer=..., max_seq_length=..., mlm_probability=...) to include seed=seed), and make the analogous update for the second collator creation later in the file so each dataloader instance receives the intended seed.bionemo-recipes/recipes/codonfm_native_te/quantization.py (2)
191-199: Validation allows empty list[]to silently disable quantization.The validation at lines 192-199 only checks non-empty lists for invalid entries. An empty list
[]passes validation but results in no layers being quantized. This is intentional per the testtest_fp8_enabled_empty_list, but the docstring doesn't document this behavior clearly.The past review suggested rejecting empty lists, but the current implementation treats
[]as "explicitly no layers" which is a valid use case. Consider documenting this behavior.📝 Document the empty list behavior in the docstring
Args: num_layers: Total number of transformer layers in the model. fp8_enabled: Whether FP8 quantization is enabled. fp4_enabled: Whether FP4 quantization is enabled. - fp8_layers: 1-indexed list of layers for FP8, or None if not specified. - fp4_layers: 1-indexed list of layers for FP4, or None if not specified. + fp8_layers: 1-indexed list of layers for FP8. None means "use default" + (all layers when only FP8 is enabled, remaining layers otherwise). + An empty list ``[]`` explicitly means "no FP8 layers". + fp4_layers: 1-indexed list of layers for FP4. Same semantics as fp8_layers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/quantization.py` around lines 191 - 199, Update the docstring for the quantization configuration (near the fp8_layers/fp4_layers validation in quantization.py) to explicitly state that passing an empty list ([]) for fp8_layers or fp4_layers is treated as "explicitly no layers" and is allowed (it disables quantization for that dtype), while non-empty lists must contain integers in 1..num_layers; reference the existing test test_fp8_enabled_empty_list for expected behavior. Ensure the docstring mentions the accepted types (None, list of ints, or empty list) and the semantic difference between None (use defaults) and [] (disable quantization).
101-109: Temp file is not cleaned up after use.
tempfile.NamedTemporaryFile(delete=False)creates a file that persists until explicitly deleted. While this is intentional (the file path is returned for later use), the caller should document or handle cleanup.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/quantization.py` around lines 101 - 109, The temp file created with tempfile.NamedTemporaryFile(delete=False) (temp_file / temp_file.name returned by this function) is left on disk; either change the function to not persist the file (write to a temporary file with delete=True or return the YAML string instead of returning temp_file.name) or explicitly document/implement cleanup responsibility: add a clear docstring note that the caller must unlink temp_file.name (or add an optional parameter like cleanup=False that when True deletes the file before returning or registers an atexit.unlink for temp_file.name). Ensure references to temp_file, temp_file.name, yaml.dump and the return value are updated accordingly.bionemo-recipes/models/codonfm/tests/generate_golden_values.py (1)
36-49: Consider documenting the path setup rationale.The
sys.pathmanipulation is necessary to import both PTL and native TE modules, but a brief comment explaining why this is needed would improve maintainability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/generate_golden_values.py` around lines 36 - 49, Add a concise explanatory comment above the path setup block describing why we modify sys.path and set CODON_FM_TE_IMPL: explain that we need to prepend PTL_TE_DIR, PTL_TE_DIR/src, and NATIVE_TE_DIR to sys.path so tests can import the PTL-based recipe model and the native codonfm model side-by-side, and note that CODON_FM_TE_IMPL="nonexact" forces the non-exact TETransformerLayer implementation for deterministic golden-value generation; place this comment adjacent to the existing REPO_ROOT/PTL_TE_DIR/NATIVE_TE_DIR and os.environ lines so future readers understand the intent.bionemo-recipes/models/codonfm/tests/common/fixtures.py (2)
117-134: Environment variables not restored on fixture teardown in error cases.If a test using
te_attn_backendraises an exception before yielding, the cleanup code afteryieldwon't run, leaving env vars and the backend flag in an inconsistent state.♻️ Use try/finally for cleanup
`@pytest.fixture`(params=["flash_attn", "fused_attn"]) def te_attn_backend(request): """Fixture to parametrize the attention implementation.""" - if request.param == "flash_attn": - os.environ["NVTE_FUSED_ATTN"] = "0" - os.environ["NVTE_FLASH_ATTN"] = "1" - _attention_backends["backend_selection_requires_update"] = True - - else: - os.environ["NVTE_FUSED_ATTN"] = "1" - os.environ["NVTE_FLASH_ATTN"] = "0" + try: + if request.param == "flash_attn": + os.environ["NVTE_FUSED_ATTN"] = "0" + os.environ["NVTE_FLASH_ATTN"] = "1" + else: + os.environ["NVTE_FUSED_ATTN"] = "1" + os.environ["NVTE_FLASH_ATTN"] = "0" _attention_backends["backend_selection_requires_update"] = True - yield request.param - - os.environ.pop("NVTE_FUSED_ATTN", None) - os.environ.pop("NVTE_FLASH_ATTN", None) - _attention_backends["backend_selection_requires_update"] = True + yield request.param + finally: + os.environ.pop("NVTE_FUSED_ATTN", None) + os.environ.pop("NVTE_FLASH_ATTN", None) + _attention_backends["backend_selection_requires_update"] = True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/common/fixtures.py` around lines 117 - 134, The te_attn_backend fixture sets NVTE_FUSED_ATTN/NVTE_FLASH_ATTN and flips _attention_backends["backend_selection_requires_update"] but if an exception occurs before the yield the post-yield cleanup won't run; wrap the setup/yield in a try/finally (or register a finalizer via request.addfinalizer) so you first capture original values of os.environ.get("NVTE_FUSED_ATTN") and os.environ.get("NVTE_FLASH_ATTN"), set the desired values based on request.param inside the try, yield request.param, and in the finally restore the original env values (pop if None) and reset _attention_backends["backend_selection_requires_update"] = True to ensure consistent state even on errors in te_attn_backend.
40-43: Socket binds to all interfaces.The empty string
""binds the socket to all network interfaces. While this is acceptable for finding an ephemeral port in a test environment, CodeQL flagged it as a potential security concern.♻️ Consider binding to localhost only
def unused_tcp_port() -> int: with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: - s.bind(("", 0)) + s.bind(("127.0.0.1", 0)) s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) return s.getsockname()[1]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/common/fixtures.py` around lines 40 - 43, The test helper currently binds the ephemeral socket to all interfaces via s.bind(("", 0)); change the bind call to bind only to localhost (e.g., s.bind(("127.0.0.1", 0)) or ("localhost", 0)) while keeping the surrounding socket.socket(...), s.setsockopt(...), and return s.getsockname()[1] logic intact so the function still discovers an ephemeral port but only on the loopback interface.bionemo-recipes/models/codonfm/tests/common/test_modeling_common.py (1)
185-189: Broaden theinput_datatype contract for THD metadata.This base class annotates
get_test_input_data()andcompare_outputs()asDict[str, torch.Tensor], but the current THD paths already rely on non-tensor entries likemax_length_q,max_length_k, andpad_between_seqs. Keeping the narrower type hides Pyright signal and forces downstream workarounds. A union orTypedDictfor the mixed tensor/int/bool payload would be more accurate.🛠️ Proposed fix
- ) -> Dict[str, torch.Tensor]: + ) -> Dict[str, torch.Tensor | int | bool]: @@ - input_data: Dict[str, torch.Tensor], + input_data: Dict[str, torch.Tensor | int | bool],As per coding guidelines,
**/*.py: Use Pyright for type checking.Also applies to: 381-381
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/common/test_modeling_common.py` around lines 185 - 189, The type annotation for get_test_input_data and compare_outputs is too narrow (Dict[str, torch.Tensor]) because THD paths include non-tensor metadata (e.g., max_length_q, max_length_k, pad_between_seqs); update the type contract to reflect a mixed payload by introducing a TypedDict or a Union type (e.g., InputDataTypedDict with torch.Tensor fields and int/bool fields) and replace occurrences of Dict[str, torch.Tensor] on get_test_input_data and compare_outputs (and any other mentions like input_data in tests) with the new type so Pyright correctly recognizes the int/bool metadata used by THD code paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bionemo-recipes/models/codonfm/modeling_codonfm_te.py`:
- Around line 109-125: Add upfront validation for self.attn_input_format in the
model initializer: ensure its value is one of the supported options (e.g.,
"bshd" or "thd") and raise a clear ValueError if not. Locate the validation
block near the existing checks (the same area that validates self.hidden_size,
self.hidden_act, and self.layer_precision) inside the ModelingCodonfmTe
initializer and add a check that inspects self.attn_input_format and raises a
descriptive error referencing the invalid value.
- Around line 447-466: The value parameter is being initialized with w_gain
derived from is_qk, which wrongly makes value share Q/K's gain; change the
initialization so query and key weights still use w_gain (as computed from
is_qk) but module.value_weight uses the scaled gain variable (gain) instead, and
ensure module.value_bias is zeroed when present; locate the block iterating
self.named_modules() (checking isinstance(module, (nn.Linear,
transformer_engine.pytorch.Linear))) and update the module.value_weight and
module.value_bias initialization to use gain rather than w_gain while leaving
query_weight and key_weight logic unchanged.
In `@bionemo-recipes/models/codonfm/tests/common/fixtures.py`:
- Around line 28-30: The fixture sets NVTE_DEBUG after Transformer Engine
modules are imported (recipe_module, fp8, _attention_backends), so move the env
var setup to run before those imports: either set NVTE_DEBUG via
monkeypatch.setenv() and then import/reload transformer_engine modules inside
the fixture (use importlib.reload on recipe_module, fp8 module, and
transformer_engine.pytorch.attention.dot_product_attention) or delay the
original top-level imports by moving them into the fixture after the env is set;
apply the identical fix to the original copied test fixture so both files are in
sync.
In `@bionemo-recipes/models/codonfm/tests/common/test_modeling_common.py`:
- Around line 398-439: The guards using hasattr(te_outputs,
"...")/hasattr(hf_outputs, "...") are wrong because dataclass fields exist but
may be None; replace them with explicit None checks: e.g., for loss use checks
like compare_loss and getattr(te_outputs, "loss", None) is not None and
getattr(hf_outputs, "loss", None) is not None; for logits ensure
input_data["attention_mask"] handling still only indexes when te_outputs.logits
and hf_outputs.logits are not None; and for hidden states replace the hasattr
guards with te_outputs.hidden_states is not None and hf_outputs.hidden_states is
not None before calling len(...) or iterating in the loop to avoid len(None) and
assert_close(None, ...).
In `@bionemo-recipes/models/codonfm/tests/test_modeling_codonfm_te.py`:
- Line 287: The test is loading the golden checkpoint with
model.load_state_dict(golden_sd, strict=False) which hides missing/unexpected
keys; change each call (the one shown and the other two occurrences around the
file) to capture the return value like missing, unexpected =
model.load_state_dict(golden_sd, strict=False) and add assertions that missing
== [] and unexpected == [] (or use assert not missing and assert not unexpected)
so the regression test fails when the checkpoint doesn't fully match the model
architecture; apply this change to the call in test_modeling_codonfm_te (the
shown model.load_state_dict invocation) and the two other similar calls in the
file.
In `@bionemo-recipes/recipes/codonfm_native_te/checkpoint.py`:
- Line 27: The import in checkpoint.py should use a relative import to ensure
the package-local module is resolved; replace the bare "from distributed_config
import DistributedConfig" with a package-relative import (e.g. "from
.distributed_config import DistributedConfig") so the DistributedConfig symbol
is imported from the local module rather than relying on sys.path.
In `@bionemo-recipes/recipes/codonfm_native_te/dataset.py`:
- Around line 118-122: The truncation currently calls tokenizer.encode(...,
add_special_tokens=True) then slices ids which can drop the terminal SEP;
instead call tokenizer.encode(..., add_special_tokens=False) or otherwise obtain
the raw codon token ids, truncate that raw list to self.max_seq_length -
N_reserved (reserve slots for CLS/SEP), then append the special tokens (CLS and
SEP) explicitly so SEP is preserved; update the loop that builds ids in the
batch processing (the for sample in batch block using tokenizer.encode and ids =
ids[: self.max_seq_length] and seq_len = len(ids)) and make the identical change
in the second collator loop (the other block similar to lines 191-194) so both
collators truncate raw codon IDs first and then add CLS/SEP.
- Around line 85-103: The collators use a hardcoded Random(42) and the
dataloader factories don't forward the seed, causing identical masking across
workers; update create_bshd_dataloader() and create_thd_dataloader() to accept
and pass the caller seed into CodonMLMCollator and CodonTHDCollator
constructors, remove reliance on a single class-level rng (e.g., the
Random(seed) stored in __init__), and inside CodonMLMCollator and
CodonTHDCollator derive a worker-local RNG for masking using
torch.utils.data.get_worker_info().seed (or torch.initial_seed() if
get_worker_info() is None) to seed a per-worker random.Random/torch.Generator so
each worker produces independent masking streams and worker respawns don't
replay the same masks.
In `@bionemo-recipes/recipes/codonfm_native_te/README.md`:
- Around line 49-66: The fenced repository-tree code block (the triple-backtick
block showing the codonfm_native_te directory tree) is missing a language tag
and triggers MD040; update the opening fence to use a language tag by changing
``` to ```text so the tree is treated as plain text in the README's
codonfm_native_te section.
In `@bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py`:
- Around line 37-42: The device_mesh fixture currently constructs
DistributedConfig() which may default MASTER_PORT and collide; before calling
DistributedConfig() pick an ephemeral free loopback port (e.g., bind a temporary
socket to 127.0.0.1:0, get the assigned port, close the socket) and set
os.environ["MASTER_PORT"] to that port string, then instantiate
DistributedConfig(), create the torch.device and call
torch.distributed.init_process_group as before; update the device_mesh fixture
to perform this pre-seed so each test session uses a unique rendezvous port.
In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_train.py`:
- Around line 49-56: The test composes the L0_sanity config but leaves
checkpoint.ckpt_dir null, risking reuse of stale checkpoints; update the
overrides passed to compose (the overrides list in the block that calls compose
for L0_sanity) to include "+checkpoint.ckpt_dir={tmp_path}" so the checkpoint
directory is explicitly set to the temporary path (consistent with other sanity
variants) and ensures test isolation when resume_from_checkpoint is true.
- Around line 81-94: The FP8 test test_sanity_convergence_fsdp2_fp8 currently
lacks a proper GPU capability guard; replace the imprecise major-version check
with the existing helper transformer_engine.pytorch.fp8.check_fp8_support() to
skip the test when FP8 isn't available (following the pattern used in
esm2_native_te/tests/test_train.py), i.e., import check_fp8_support and apply it
as a pytest skip/marker guard around test_sanity_convergence_fsdp2_fp8 so the
test runs only when check_fp8_support() returns True.
---
Duplicate comments:
In `@bionemo-recipes/models/codonfm/requirements.txt`:
- Around line 1-3: Update the invalid dependency constraint for
transformer_engine in requirements.txt: replace the non-existent
"transformer_engine>=1.14.0" with "transformer_engine>=2.1.0" so the project
uses a PyPI-available release that includes the FP8/FP4 quantization features
referenced by the code; ensure the line in requirements.txt containing
transformer_engine is edited accordingly and keep the existing torch and
transformers lines unchanged.
In `@bionemo-recipes/models/codonfm/tests/conftest.py`:
- Around line 39-45: The probe socket in unused_tcp_port currently binds to all
interfaces (s.bind(("", 0))) which is flagged by CodeQL; change the bind call in
the unused_tcp_port function to bind only to the loopback interface (e.g.,
s.bind(("127.0.0.1", 0))) so the fixture remains local while keeping the same
behavior of returning an unused TCP port.
- Around line 48-58: The teardown must restore the original environment
variables before reloading the transformer_engine.pytorch module because orig is
a reference to the same module object and assigning it back doesn't undo the
reload; change the fixture use_te_debug to capture the original NVTE_DEBUG and
NVTE_DEBUG_LEVEL (or their absence) and the orig module reference, set the env
vars and call importlib.reload(transformer_engine.pytorch) to enable debug, then
in the teardown restore the env vars to their previous values (using
monkeypatch.setenv or monkeypatch.delenv as appropriate) and finally call
importlib.reload(transformer_engine.pytorch) (if orig was present) or delete
sys.modules["transformer_engine.pytorch"] (if orig was None) instead of
reassigning sys.modules to orig.
In `@bionemo-recipes/recipes/codonfm_native_te/checkpoint.py`:
- Around line 196-197: Race condition: pruning (prune_checkpoints) runs only on
the main rank (dist_config.is_main_process()) while other ranks may still be
saving; add a synchronization point after all ranks finish saving or disallow
async saves with max_checkpoints. Specifically, after the checkpoint save
completes (the code path that calls save_checkpoint / async_save), insert a
distributed barrier (use the existing dist_config barrier primitive) so all
ranks have finished writing before the main rank calls
prune_checkpoints(ckpt_path, max_checkpoints); alternatively, if async_save is
used, validate and raise or log an error when max_checkpoints is set to prevent
pruning during in-progress async saves and document this constraint in the
function that handles checkpoint saving and pruning.
In `@bionemo-recipes/recipes/codonfm_native_te/dataset.py`:
- Around line 66-79: The constructor currently eagerly materializes the entire
column via pq.read_table; change __init__ to open a pyarrow.parquet.ParquetFile
(store as self._pqfile) and precompute metadata like total rows (self._num_rows)
and row-group row counts so you can map an index to a row-group and local
offset; update __len__ to return self._num_rows; change __getitem__ to lazily
read only the needed row (use self._pqfile.read_row_group(row_group,
columns=["sequence"]) or read with slice/batch for that row-group and extract
the single sequence) instead of using self._table/_sequences, so no full-column
materialization occurs and memory stays constant per process.
In `@bionemo-recipes/recipes/codonfm_native_te/distributed_config.py`:
- Around line 36-40: The field default factories in DistributedConfig (rank,
local_rank, world_size, _master_addr, _master_port) currently call
os.environ.setdefault(...) which mutates the global environment; change each
lambda to use os.getenv(...) (or os.environ.get(...)) instead so defaults are
read without writing to os.environ, e.g. replace the default_factory lambdas for
rank, local_rank, world_size, _master_addr and _master_port to call os.getenv/
.get with the same default strings.
In `@bionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yaml`:
- Around line 42-44: The LR scheduler horizon is hard-coded to 500_000; change
lr_scheduler_kwargs.num_training_steps to interpolate from the top-level
training steps so it tracks the real exit condition (e.g., replace the literal
500_000 with a Hydra interpolation referencing the top-level num_train_steps),
keeping num_warmup_steps as-is; update the lr_scheduler_kwargs block (look for
lr_scheduler_kwargs, num_warmup_steps, num_training_steps) to use the
interpolated value instead of the hard-coded number.
In `@bionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.py`:
- Around line 322-325: The rotary positional embedding is being built using the
config max_position_embeddings which over-allocates and breaks packed sequences;
instead compute max_seq_len from the active input length (e.g., seq_len =
hidden_states.size(1) or hidden_states.shape[1]) and pass that into
self.rotary_embeddings (replace the existing
self.rotary_embeddings(max_seq_len=self.config.max_position_embeddings) call),
then keep the .to(hidden_states.device, non_blocking=True) call; update the
torch.autocast block around that code in the method where hidden_states is
available (the block containing self.rotary_embeddings and te_rope_emb) so RoPE
is sized to the current batch.
In `@bionemo-recipes/recipes/codonfm_native_te/requirements.txt`:
- Around line 1-13: The requirements.txt has unpinned core ML packages; update
it to include minimum version floors for torch, transformer_engine[pytorch],
transformers, pandas, and pyarrow to match the floors used in
bionemo-recipes/models/codonfm/requirements.txt so the recipe uses compatible
APIs; locate the package entries (torch, transformer_engine[pytorch],
transformers, pandas, pyarrow) in
bionemo-recipes/recipes/codonfm_native_te/requirements.txt and replace them with
the same ">=<version>" specifiers found in
bionemo-recipes/models/codonfm/requirements.txt.
In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py`:
- Around line 224-231: The final model export written by
save_final_model_fsdp2() only emits model.safetensors and config.json, but
CodonTokenizer.from_pretrained() expects tokenizer files (e.g.,
tokenizer_config.json and vocab/merges or tokenizer.json); update
save_final_model_fsdp2 to also load the tokenizer used during training and write
its artifacts into the save_directory (use the tokenizer.save_pretrained or
explicitly write tokenizer_config.json and any vocab files) so the final_model
directory is self-contained and can be loaded by CodonTokenizer.from_pretrained;
reference save_final_model_fsdp2, CodonTokenizer.from_pretrained, and the
save_directory/ckpt_path "final_model" target when implementing this change.
---
Nitpick comments:
In `@bionemo-recipes/models/codonfm/dataset.py`:
- Around line 263-267: The collator is being instantiated without the
dataset/dataloader seed, causing all collators to use the default seed; update
the CodonMLMCollator instantiations to accept and use the same seed value passed
to SyntheticCodonDataset/DistributedSampler by adding the seed argument when
constructing collator (e.g., change CodonMLMCollator(tokenizer=...,
max_seq_length=..., mlm_probability=...) to include seed=seed), and make the
analogous update for the second collator creation later in the file so each
dataloader instance receives the intended seed.
In `@bionemo-recipes/models/codonfm/tests/common/fixtures.py`:
- Around line 117-134: The te_attn_backend fixture sets
NVTE_FUSED_ATTN/NVTE_FLASH_ATTN and flips
_attention_backends["backend_selection_requires_update"] but if an exception
occurs before the yield the post-yield cleanup won't run; wrap the setup/yield
in a try/finally (or register a finalizer via request.addfinalizer) so you first
capture original values of os.environ.get("NVTE_FUSED_ATTN") and
os.environ.get("NVTE_FLASH_ATTN"), set the desired values based on request.param
inside the try, yield request.param, and in the finally restore the original env
values (pop if None) and reset
_attention_backends["backend_selection_requires_update"] = True to ensure
consistent state even on errors in te_attn_backend.
- Around line 40-43: The test helper currently binds the ephemeral socket to all
interfaces via s.bind(("", 0)); change the bind call to bind only to localhost
(e.g., s.bind(("127.0.0.1", 0)) or ("localhost", 0)) while keeping the
surrounding socket.socket(...), s.setsockopt(...), and return s.getsockname()[1]
logic intact so the function still discovers an ephemeral port but only on the
loopback interface.
In `@bionemo-recipes/models/codonfm/tests/common/test_modeling_common.py`:
- Around line 185-189: The type annotation for get_test_input_data and
compare_outputs is too narrow (Dict[str, torch.Tensor]) because THD paths
include non-tensor metadata (e.g., max_length_q, max_length_k,
pad_between_seqs); update the type contract to reflect a mixed payload by
introducing a TypedDict or a Union type (e.g., InputDataTypedDict with
torch.Tensor fields and int/bool fields) and replace occurrences of Dict[str,
torch.Tensor] on get_test_input_data and compare_outputs (and any other mentions
like input_data in tests) with the new type so Pyright correctly recognizes the
int/bool metadata used by THD code paths.
In `@bionemo-recipes/models/codonfm/tests/generate_golden_values.py`:
- Around line 36-49: Add a concise explanatory comment above the path setup
block describing why we modify sys.path and set CODON_FM_TE_IMPL: explain that
we need to prepend PTL_TE_DIR, PTL_TE_DIR/src, and NATIVE_TE_DIR to sys.path so
tests can import the PTL-based recipe model and the native codonfm model
side-by-side, and note that CODON_FM_TE_IMPL="nonexact" forces the non-exact
TETransformerLayer implementation for deterministic golden-value generation;
place this comment adjacent to the existing REPO_ROOT/PTL_TE_DIR/NATIVE_TE_DIR
and os.environ lines so future readers understand the intent.
In `@bionemo-recipes/models/codonfm/tokenizer.py`:
- Around line 147-164: The from_pretrained method restores encoder/decoder but
leaves self.codons as the default; update from_pretrained (in class
CodonTokenizer) to reconstruct tokenizer.codons from the saved encoder mapping
(config["encoder"]) by ordering tokens by their integer id (i.e., create a list
where index == token id), then assign that list to tokenizer.codons before
returning; keep the existing restoration of tokenizer.encoder and
tokenizer.decoder.
- Around line 114-130: The decode method computes special_ids from the class
constant SPECIAL_TOKENS which can diverge from instance state; change decode to
use an instance attribute (e.g., self.special_tokens) instead of
self.SPECIAL_TOKENS, compute special_ids from self.special_tokens and
self.encoder, and ensure the tokenizer loader (from_pretrained) restores/sets
self.special_tokens when deserializing so special token lookups remain
consistent between encoder/decoder and special token configuration.
In `@bionemo-recipes/recipes/codonfm_native_te/checkpoint.py`:
- Around line 189-191: The current assignment stores the return of dcp_save or
dcp_async_save into _ckpt_futures["fsdp2"], but dcp_save is synchronous and
returns None which makes the stored value ambiguous; change the logic so that
when async_save is True you assign the future returned by dcp_async_save to
_ckpt_futures["fsdp2"], and when async_save is False you either remove any
existing _ckpt_futures["fsdp2"] key or explicitly set it to a sentinel (e.g.,
False or not present) to avoid storing None; update the block that calls
dcp_save/dcp_async_save (refer to the variables async_save, dcp_save,
dcp_async_save, _ckpt_futures["fsdp2"], and checkpoint_path) to use this
explicit branch so later checks for a pending future are unambiguous.
In `@bionemo-recipes/recipes/codonfm_native_te/perf_logger.py`:
- Around line 36-37: PAD_TOKEN_ID is hardcoded (PAD_TOKEN_ID = 3) and can drift
from the tokenizer; update PerfLogger to accept the pad token id from the
tokenizer instead of a hardcoded constant: remove or stop using PAD_TOKEN_ID,
add a pad_token_id parameter to PerfLogger.__init__ (or read
CodonTokenizer.pad_token_id where PerfLogger is constructed) and propagate that
value into any methods that reference the pad id so the logger always uses
CodonTokenizer.pad_token_id rather than a literal 3.
In `@bionemo-recipes/recipes/codonfm_native_te/quantization.py`:
- Around line 191-199: Update the docstring for the quantization configuration
(near the fp8_layers/fp4_layers validation in quantization.py) to explicitly
state that passing an empty list ([]) for fp8_layers or fp4_layers is treated as
"explicitly no layers" and is allowed (it disables quantization for that dtype),
while non-empty lists must contain integers in 1..num_layers; reference the
existing test test_fp8_enabled_empty_list for expected behavior. Ensure the
docstring mentions the accepted types (None, list of ints, or empty list) and
the semantic difference between None (use defaults) and [] (disable
quantization).
- Around line 101-109: The temp file created with
tempfile.NamedTemporaryFile(delete=False) (temp_file / temp_file.name returned
by this function) is left on disk; either change the function to not persist the
file (write to a temporary file with delete=True or return the YAML string
instead of returning temp_file.name) or explicitly document/implement cleanup
responsibility: add a clear docstring note that the caller must unlink
temp_file.name (or add an optional parameter like cleanup=False that when True
deletes the file before returning or registers an atexit.unlink for
temp_file.name). Ensure references to temp_file, temp_file.name, yaml.dump and
the return value are updated accordingly.
In `@bionemo-recipes/recipes/codonfm_native_te/scheduler.py`:
- Around line 19-24: The get_linear_schedule_with_warmup function is missing a
type annotation for the optimizer parameter; update its signature to annotate
optimizer as torch.optim.Optimizer (or Optimizer if you add "from torch.optim
import Optimizer" at the top) so IDEs and type checkers recognize it, e.g. def
get_linear_schedule_with_warmup(optimizer: torch.optim.Optimizer, ...), and add
the corresponding import if not already present.
In `@bionemo-recipes/recipes/codonfm_native_te/tokenizer.py`:
- Around line 86-91: The redundant length check in tokenize() should be removed:
inside the loop that iterates with range(0, len(sequence) - 2, 3) simply append
the sliced codon to tokens without the if len(codon) == 3 guard; update the body
of tokenize() to directly do tokens.append(codon) for each codon (references:
function tokenize, local variables tokens and codon, and the for loop using
range(0, len(sequence) - 2, 3)).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 36a2c98d-c4b7-4c5a-a41c-dd19dde3fdce
⛔ Files ignored due to path filters (1)
bionemo-recipes/recipes/codonfm_native_te/train.parquetis excluded by!**/*.parquet
📒 Files selected for processing (40)
.gitignorebionemo-recipes/models/codonfm/dataset.pybionemo-recipes/models/codonfm/modeling_codonfm_te.pybionemo-recipes/models/codonfm/requirements.txtbionemo-recipes/models/codonfm/tests/common/README.mdbionemo-recipes/models/codonfm/tests/common/__init__.pybionemo-recipes/models/codonfm/tests/common/fixtures.pybionemo-recipes/models/codonfm/tests/common/test_modeling_common.pybionemo-recipes/models/codonfm/tests/conftest.pybionemo-recipes/models/codonfm/tests/generate_golden_values.pybionemo-recipes/models/codonfm/tests/golden_state_dict.safetensorsbionemo-recipes/models/codonfm/tests/golden_values.jsonbionemo-recipes/models/codonfm/tests/test_modeling_codonfm_te.pybionemo-recipes/models/codonfm/tokenizer.pybionemo-recipes/models/esm2/tests/common/test_modeling_common.pybionemo-recipes/models/llama3/tests/common/test_modeling_common.pybionemo-recipes/models/mixtral/tests/common/test_modeling_common.pybionemo-recipes/models/qwen/tests/common/test_modeling_common.pybionemo-recipes/recipes/codonfm_native_te/README.mdbionemo-recipes/recipes/codonfm_native_te/checkpoint.pybionemo-recipes/recipes/codonfm_native_te/dataset.pybionemo-recipes/recipes/codonfm_native_te/distributed_config.pybionemo-recipes/recipes/codonfm_native_te/fp4_debugging_stats.yamlbionemo-recipes/recipes/codonfm_native_te/fp8_debugging_stats.yamlbionemo-recipes/recipes/codonfm_native_te/hydra_config/L0_sanity.yamlbionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.pybionemo-recipes/recipes/codonfm_native_te/perf_logger.pybionemo-recipes/recipes/codonfm_native_te/quantization.pybionemo-recipes/recipes/codonfm_native_te/requirements.txtbionemo-recipes/recipes/codonfm_native_te/scheduler.pybionemo-recipes/recipes/codonfm_native_te/tests/__init__.pybionemo-recipes/recipes/codonfm_native_te/tests/conftest.pybionemo-recipes/recipes/codonfm_native_te/tests/test_model.pybionemo-recipes/recipes/codonfm_native_te/tests/test_quantization.pybionemo-recipes/recipes/codonfm_native_te/tests/test_tokenizer.pybionemo-recipes/recipes/codonfm_native_te/tests/test_train.pybionemo-recipes/recipes/codonfm_native_te/tokenizer.pybionemo-recipes/recipes/codonfm_native_te/train_fsdp2.pyci/scripts/check_copied_files.py
|
@coderabbitai resolve |
|
@coderabbitai re-review |
|
✅ Actions performedFull review triggered. |
✅ Actions performedComments resolved. Auto-approval is disabled; enable |
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (21)
bionemo-recipes/models/codonfm/requirements.txt (1)
2-2:⚠️ Potential issue | 🔴 CriticalUse a published Transformer Engine minimum.
transformer_engine>=1.14.0is not satisfiable from PyPI: the published release history goes from1.13.0to2.1.0, so there is no1.14.0release for pip to install. Pick a released floor on the API line you actually support (>=1.13.0for the last 1.x line or>=2.1.0for 2.x), otherwise clean installs fail before the CodonFM model or recipe tests even start. (pypi.org)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/requirements.txt` at line 2, The requirements entry "transformer_engine>=1.14.0" in requirements.txt is invalid because PyPI has no 1.14.0 release; update that line to a published minimum you support (for example "transformer_engine>=1.13.0" if you want the last 1.x line or "transformer_engine>=2.1.0" for 2.x) so installs succeed; locate the package line "transformer_engine>=1.14.0" in requirements.txt and replace it with the chosen published floor.bionemo-recipes/recipes/codonfm_native_te/distributed_config.py (1)
26-40:⚠️ Potential issue | 🟠 MajorStop mutating
os.environduring config construction.These default factories write fallback values back into the process environment instead of just reading launcher-provided state.
os.environis a live environment mapping, anddict.setdefault()inserts the default when the key is missing, so constructingDistributedConfig()has global side effects here. Useos.getenv()for these defaults, and update the wording at Line 28 so the docstring says the class falls back to single-process defaults rather than setting them. (docs.python.org)🛠️ Suggested change
`@dataclass`(frozen=True) class DistributedConfig: """Class to track distributed ranks and handle basic distributed training setup. - If torch distributed environment variables are not set, we set them to default values for single-process training. + If torch distributed environment variables are not set, we fall back to single-process defaults. Attributes: rank: The rank of the process. local_rank: The local rank of the process. world_size: The total number of processes. """ - rank: int = field(default_factory=lambda: int(os.environ.setdefault("RANK", "0"))) - local_rank: int = field(default_factory=lambda: int(os.environ.setdefault("LOCAL_RANK", "0"))) - world_size: int = field(default_factory=lambda: int(os.environ.setdefault("WORLD_SIZE", "1"))) - _master_addr: str = field(default_factory=lambda: os.environ.setdefault("MASTER_ADDR", "localhost")) - _master_port: str = field(default_factory=lambda: os.environ.setdefault("MASTER_PORT", "12355")) + rank: int = field(default_factory=lambda: int(os.getenv("RANK", "0"))) + local_rank: int = field(default_factory=lambda: int(os.getenv("LOCAL_RANK", "0"))) + world_size: int = field(default_factory=lambda: int(os.getenv("WORLD_SIZE", "1"))) + _master_addr: str = field(default_factory=lambda: os.getenv("MASTER_ADDR", "localhost")) + _master_port: str = field(default_factory=lambda: os.getenv("MASTER_PORT", "12355"))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/distributed_config.py` around lines 26 - 40, The dataclass currently mutates os.environ via dict.setdefault in the default_factories for rank, local_rank, world_size, _master_addr, and _master_port; change those default_factory lambdas to use os.getenv(..., default) so they only read environment values and do not write them, and update the class docstring sentence that says the class "sets" torch distributed env vars to instead state it "falls back to" single-process defaults (refer to DistributedConfig and the fields rank, local_rank, world_size, _master_addr, _master_port).bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py (1)
37-44:⚠️ Potential issue | 🟡 MinorPick a free
MASTER_PORTto avoid test collisions.
DistributedConfigdefaults toMASTER_PORT=12355. Concurrent pytest sessions on the same host can collide on the same NCCL rendezvous port, causing nondeterministic failures.🛠️ Suggested fix using ephemeral port
+import os +import socket import sys from pathlib import Path ... `@pytest.fixture`(scope="session", autouse=True) def device_mesh(): """Create a re-usable device mesh for testing.""" + if "MASTER_PORT" not in os.environ: + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + s.bind(("127.0.0.1", 0)) + os.environ["MASTER_PORT"] = str(s.getsockname()[1]) dist_config = DistributedConfig()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py` around lines 37 - 44, The fixture device_mesh can collide across concurrent test runs because DistributedConfig defaults MASTER_PORT to 12355; reserve an ephemeral TCP port and set MASTER_PORT before creating DistributedConfig / calling torch.distributed.init_process_group so each test session uses a unique port. Specifically, in the device_mesh fixture obtain a free port (bind to port 0 on localhost to get the OS-assigned port), set os.environ["MASTER_PORT"] to that port string before instantiating DistributedConfig or before calling torch.distributed.init_process_group, then proceed with creating the device, setting CUDA device, and calling init_device_mesh("cuda", mesh_shape=(1,), mesh_dim_names=("dp",)); ensure the socket is closed after reading the assigned port so the port is available for the process.bionemo-recipes/models/codonfm/tests/common/__init__.py (1)
32-42:⚠️ Potential issue | 🟡 MinorDocstring example references ESM2 instead of CodonFM.
The example code mentions
ESM2ModelTesterandNVEsmForMaskedLMwhich are ESM2-specific. Since this is a copied file for CodonFM, consider updating the example to reference CodonFM classes for clarity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/common/__init__.py` around lines 32 - 42, Update the docstring example to reference CodonFM-specific classes instead of ESM2; replace the ESM2 identifiers (ESM2ModelTester, NVEsmForMaskedLM) with the CodonFM equivalents used in this codebase (e.g., a CodonFMModelTester subclass of BaseModelTest and the model class name used for CodonFM), ensuring the example imports (from tests.common import BaseModelTest, TestTolerances) and method get_model_class remain consistent with the CodonFM test helpers.bionemo-recipes/recipes/codonfm_native_te/README.md (1)
49-66:⚠️ Potential issue | 🟡 MinorAdd language tag to fenced code block.
The directory tree block uses bare triple backticks, which triggers MD040 (fenced-code-language) in pre-commit checks.
🔧 Add `text` language tag
-``` +```text codonfm_native_te/ ├── modeling_codonfm_te.py — HF-compatible CodonFM model with TE layers ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/README.md` around lines 49 - 66, The fenced code block containing the directory tree in the README (the block that starts with the triple backticks showing codonfm_native_te/ and its files) is missing a language tag and triggers MD040; update that opening fence from ``` to ```text so the block becomes a text-coded fenced block (i.e., add the "text" language tag to the directory-tree code fence in README.md).bionemo-recipes/recipes/codonfm_native_te/tests/test_train.py (2)
47-59:⚠️ Potential issue | 🟡 MinorAdd checkpoint directory override for test isolation.
This test disables meta device but doesn't set
checkpoint.ckpt_dir, unlike other tests. Withresume_from_checkpoint: truein defaults, this could resume from a stale checkpoint instead of testing a fresh non-meta-device initialization.🔧 Suggested fix
with initialize_config_dir(config_dir=str(recipe_path / "hydra_config"), version_base="1.2"): sanity_config = compose( config_name="L0_sanity", overrides=[ f"+wandb_init_args.dir={tmp_path}", + f"checkpoint.ckpt_dir={tmp_path}", "use_meta_device=false", ], )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_train.py` around lines 47 - 59, The test test_sanity_convergence_fsdp2_no_meta_device can accidentally resume from an existing checkpoint because it doesn't override checkpoint.ckpt_dir while defaults include resume_from_checkpoint=true; update the compose overrides in the test (the compose call inside test_sanity_convergence_fsdp2_no_meta_device) to set a per-test checkpoint directory (e.g. add an override that points checkpoint.ckpt_dir to tmp_path or a tmp subdir) so the run is isolated and cannot pick up stale checkpoints.
81-94:⚠️ Potential issue | 🟠 MajorAdd GPU capability guard for FP8 test.
FP8 requires compute capability 8.9+ (Ada Lovelace and newer). Without a guard, this test will fail on unsupported hardware instead of being skipped.
🔧 Suggested fix using TransformerEngine's check_fp8_support
+from transformer_engine.pytorch.fp8 import check_fp8_support + +requires_fp8 = pytest.mark.skipif( + not torch.cuda.is_available() or not check_fp8_support()[0], + reason="FP8 requires compute capability 8.9+", +) + +@requires_fp8 def test_sanity_convergence_fsdp2_fp8(tmp_path, recipe_path):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_train.py` around lines 81 - 94, Add a GPU capability guard to the test_sanity_convergence_fsdp2_fp8 function so it skips on hardware that doesn't support FP8: call TransformerEngine.check_fp8_support() (or equivalent) at the start of test_sanity_convergence_fsdp2_fp8 and if it returns False use pytest.skip(...) (or apply pytest.mark.skipif with that predicate) to avoid running main_fsdp2 when FP8 is unsupported; ensure the skip message mentions FP8/compute capability so test output is clear.bionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yaml (1)
41-44:⚠️ Potential issue | 🟠 MajorScheduler horizon should derive from
num_train_steps.The LR scheduler uses a hardcoded
num_training_steps: 500_000, but the training loop stops atargs.num_train_steps. If a config overrides onlynum_train_steps, the warmup/decay curve will be incorrect.🔧 Suggested fix using Hydra interpolation
lr_scheduler_kwargs: num_warmup_steps: 2_000 - num_training_steps: 500_000 + num_training_steps: ${num_train_steps}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yaml` around lines 41 - 44, The lr_scheduler_kwargs currently hardcodes num_training_steps to 500_000 which can diverge from the actual training loop parameter (args.num_train_steps); update the config so num_training_steps is interpolated from the main training steps value instead of a literal (use Hydra/Omega conf interpolation to reference the training steps field, e.g. point lr_scheduler_kwargs.num_training_steps to the top-level num_train_steps), and ensure num_warmup_steps remains consistent (can be expressed as an absolute or derived value via interpolation if desired) so the scheduler horizon follows args.num_train_steps at runtime.bionemo-recipes/recipes/codonfm_native_te/tests/test_model.py (1)
136-155:⚠️ Potential issue | 🟡 MinorRecompute loss after training loop for accurate comparison.
Line 154 reads
outputs.loss.item()from the last forward pass before the finaloptimizer.step(). This checks pre-step loss, not post-step loss. Additionally, without fixed RNG seeds and with dropout enabled (default 0.1), this test can be flaky.🔧 Suggested fix for deterministic testing
def test_loss_decreases(self, small_config): + torch.manual_seed(42) + torch.cuda.manual_seed_all(42) + small_config.hidden_dropout_prob = 0.0 + small_config.attention_probs_dropout_prob = 0.0 model = CodonFMForMaskedLM(small_config).cuda() optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) ... - final_loss = outputs.loss.item() + # Recompute loss after final optimizer step + model.eval() + with torch.no_grad(): + final_outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels) + final_loss = final_outputs.loss.item() assert final_loss < initial_loss, f"Loss did not decrease: {initial_loss} -> {final_loss}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_model.py` around lines 136 - 155, The test_loss_decreases test reads outputs.loss from the last forward pass before the final optimizer.step, producing an incorrect pre-step final_loss and making the check flaky; update the test to perform one additional forward pass after the training loop (call model(...) again with input_ids/attention_mask/labels) and read loss from that fresh outputs.loss for final_loss, and make the test deterministic by fixing RNG seeds (torch.manual_seed, torch.cuda.manual_seed_all) and disabling/dropout (set model.eval() or set dropout to 0) before training so CodonFMForMaskedLM behavior is stable.bionemo-recipes/models/codonfm/modeling_codonfm_te.py (2)
109-117:⚠️ Potential issue | 🟡 MinorValidate
attn_input_formatinCodonFMConfig.Any value other than
"bshd"currently falls through the THD branch later, so a typo turns into a missingcu_seq_lens_*runtime failure instead of a clear config error.🛠️ Suggested fix
if self.hidden_act not in ("gelu", "relu", "silu"): raise ValueError(f"hidden_act must be one of: gelu, relu, silu, got {self.hidden_act}") + if self.attn_input_format not in {"bshd", "thd"}: + raise ValueError(f'attn_input_format must be "bshd" or "thd", got {self.attn_input_format!r}') if self.layer_precision is not None:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/modeling_codonfm_te.py` around lines 109 - 117, The config field attn_input_format is not validated, so typos (anything other than "bshd") silently fall into the THD branch and cause runtime failures; add an explicit validation in the same validation block (near the existing checks for hidden_size, hidden_act, layer_precision) that checks self.attn_input_format (or CodonFMConfig.attn_input_format) and raises a ValueError if it is not exactly "bshd" (with a clear message listing the allowed value) so misconfigured values fail fast with a clear error.
447-466:⚠️ Potential issue | 🟠 MajorUse the scaled MAGNETO gain for
value_weight.The docstring says only Q/K should use
gain=1.0, butvalue_weightcurrently inheritsw_gain. In the split-QKV path, that initializes V as if it were Q/K.🛠️ Suggested fix
- is_qk = "query" in name or "key" in name or "qkv" in name - w_gain = 1.0 if is_qk else gain + is_qk_module = "query" in name or "key" in name or "qkv" in name + w_gain = 1.0 if is_qk_module else gain if getattr(module, "weight", None) is not None: nn.init.xavier_normal_(module.weight, gain=w_gain) if getattr(module, "bias", None) is not None: module.bias.data.zero_() if getattr(module, "query_weight", None) is not None: nn.init.xavier_normal_(module.query_weight, gain=w_gain) if getattr(module, "query_bias", None) is not None: module.query_bias.data.zero_() if getattr(module, "key_weight", None) is not None: nn.init.xavier_normal_(module.key_weight, gain=w_gain) if getattr(module, "key_bias", None) is not None: module.key_bias.data.zero_() if getattr(module, "value_weight", None) is not None: - nn.init.xavier_normal_(module.value_weight, gain=w_gain) + nn.init.xavier_normal_(module.value_weight, gain=gain) if getattr(module, "value_bias", None) is not None: module.value_bias.data.zero_()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/modeling_codonfm_te.py` around lines 447 - 466, The value_weight branch incorrectly uses w_gain (which is set to 1.0 for Q/K) so V gets Q/K scaling; in the named_modules loop inside modeling_codonfm_te.py (look for is_qk, w_gain, gain) change the initialization of module.value_weight to use the unmodified gain variable (and leave value_bias zeroing as-is) so that only query/key weights use gain=1.0 while value_weight uses the scaled MAGNETO gain.bionemo-recipes/models/codonfm/tests/test_modeling_codonfm_te.py (1)
286-287:⚠️ Potential issue | 🟠 MajorAssert that the golden state dict loads cleanly.
All three regression paths ignore the
missing_keys/unexpected_keysresult fromload_state_dict(..., strict=False). That can hide fixture drift and still let the test compare partially initialized models.🛠️ Suggested fix
- model.load_state_dict(golden_sd, strict=False) + missing_keys, unexpected_keys = model.load_state_dict(golden_sd, strict=False) + assert not missing_keys and not unexpected_keysApply the same change to the other two
load_state_dict(golden_sd, strict=False)call sites in this file.#!/bin/bash python - <<'PY' try: import torch from torch import nn result = nn.Linear(1, 1).load_state_dict({"weight": torch.zeros(1, 1)}, strict=False) print(type(result).__name__) print("missing_keys:", result.missing_keys) print("unexpected_keys:", result.unexpected_keys) except Exception as exc: print(f"PyTorch inspection unavailable: {exc}") PY rg -n "load_state_dict\(golden_sd, strict=False\)" bionemo-recipes/models/codonfm/tests/test_modeling_codonfm_te.pyAlso applies to: 352-353, 365-366
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/test_modeling_codonfm_te.py` around lines 286 - 287, The test currently calls load_state_dict(golden_sd, strict=False) on CodonFMForMaskedLM instances and ignores the returned result, which can hide missing/unexpected keys; update each load_state_dict(golden_sd, strict=False) call (the ones used when instantiating CodonFMForMaskedLM and the two other occurrences) to capture the return value (e.g., result = model.load_state_dict(...)) and add assertions that result.missing_keys == [] and result.unexpected_keys == [] (or use assert not result.missing_keys and assert not result.unexpected_keys) so the golden state dict must load cleanly. Ensure the assertions are applied to all three call sites referenced in the test.bionemo-recipes/recipes/codonfm_native_te/checkpoint.py (1)
182-191:⚠️ Potential issue | 🔴 CriticalOnly store async futures in
_ckpt_futures.
dcp_save()is the synchronous path, but its return value is still written into_ckpt_futures["fsdp2"].save_final_model_fsdp2()later calls.result()on whatever is stored there, so a run that saved a synchronous checkpoint can fail during final export.🛠️ Suggested fix
- if async_save and "fsdp2" in _ckpt_futures and _ckpt_futures["fsdp2"] is not None: - _ckpt_futures["fsdp2"].result() + if "fsdp2" in _ckpt_futures and _ckpt_futures["fsdp2"] is not None: + _ckpt_futures["fsdp2"].result() + _ckpt_futures["fsdp2"] = None @@ state_dict = {"app": AppState(model=model, optimizer=optimizer, scheduler=scheduler, step=step, epoch=epoch)} ckpt_save_func = dcp_async_save if async_save else dcp_save - _ckpt_futures["fsdp2"] = ckpt_save_func(state_dict, checkpoint_id=checkpoint_path, process_group=process_group) + save_result = ckpt_save_func(state_dict, checkpoint_id=checkpoint_path, process_group=process_group) + _ckpt_futures["fsdp2"] = save_result if async_save else None#!/bin/bash python - <<'PY' import inspect try: from torch.distributed.checkpoint.state_dict_saver import save, async_save print("save signature:", inspect.signature(save)) print("async_save signature:", inspect.signature(async_save)) print("--- save source tail ---") for line in inspect.getsource(save).splitlines()[-8:]: print(line) print("--- async_save source tail ---") for line in inspect.getsource(async_save).splitlines()[-8:]: print(line) except Exception as exc: print(f"PyTorch checkpoint API inspection unavailable: {exc}") PY rg -n "_ckpt_futures\\[\"fsdp2\"\\]|save_final_model_fsdp2|dcp_async_save|dcp_save" bionemo-recipes/recipes/codonfm_native_te/checkpoint.pyAlso applies to: 207-210
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/checkpoint.py` around lines 182 - 191, The code is storing the return value of the synchronous saver into the _ckpt_futures map, causing save_final_model_fsdp2 to later call .result() on a non-future; change the assignment so that only the async path writes a future into _ckpt_futures["fsdp2"]: call dcp_save(state_dict, ...) without assigning its return when async_save is False, and only assign the result of dcp_async_save(...) to _ckpt_futures["fsdp2"] when async_save is True; update the block around ckpt_save_func/dcp_save/dcp_async_save and any other occurrences (e.g., the similar 207-210 region) to follow the same pattern so save_final_model_fsdp2 always sees either a Future or None in _ckpt_futures["fsdp2"].bionemo-recipes/models/codonfm/tokenizer.py (1)
124-145:⚠️ Potential issue | 🟡 MinorReload the serialized special-token state in
from_pretrained().
save_pretrained()persistsspecial_tokens, butfrom_pretrained()ignores them anddecode()still derives special IDs from the class constant. That makes the saved metadata ineffective and breaks exact round-tripping as soon as token ordering changes.🛠️ Suggested fix
- special_ids = {self.encoder[tok] for tok in self.SPECIAL_TOKENS if tok in self.encoder} + special_tokens = getattr(self, "special_tokens", self.SPECIAL_TOKENS) + special_ids = {self.encoder[tok] for tok in special_tokens if tok in self.encoder} @@ config = { "seq_type": self.seq_type, - "special_tokens": self.SPECIAL_TOKENS, + "special_tokens": getattr(self, "special_tokens", self.SPECIAL_TOKENS), "encoder": self.encoder, } @@ tokenizer = cls(seq_type=config["seq_type"]) + tokenizer.special_tokens = list(config.get("special_tokens", tokenizer.SPECIAL_TOKENS)) # Restore the exact encoder/decoder from the saved config to guarantee round-trip fidelity. tokenizer.encoder = {tok: int(i) for tok, i in config["encoder"].items()} tokenizer.decoder = {i: tok for tok, i in tokenizer.encoder.items()} + tokenizer.codons = [ + tok + for tok, _ in sorted(tokenizer.encoder.items(), key=lambda item: item[1]) + if tok not in tokenizer.special_tokens + ] return tokenizerAlso applies to: 157-163
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tokenizer.py` around lines 124 - 145, from_pretrained currently ignores the serialized "special_tokens" saved by save_pretrained, so reload the saved special-token state in from_pretrained by reading the "special_tokens" entry from the config JSON and using it to set the instance's SPECIAL_TOKENS (and rebuild any dependent structures like special_ids); ensure from_pretrained populates self.SPECIAL_TOKENS (or an instance attribute used by decode) before creating encoder/decoder mappings so decode() derives special IDs from the loaded config rather than the class constant, and similarly handle the alternate location referenced in lines ~157-163.bionemo-recipes/recipes/codonfm_native_te/tests/test_quantization.py (1)
160-168:⚠️ Potential issue | 🟡 MinorMake
test_matches_correct_sublayersreject prefix overmatches.These positives still pass if
generate_layer_regex()starts matching prefixes too broadly, because the assertions use*_somethingnames. Assert the exact sublayer names and explicitly reject the suffixed variants.🛠️ Suggested fix
def test_matches_correct_sublayers(): """Regex should only match layernorm_qkv, proj, fc1, fc2.""" regex = generate_layer_regex([1]) - assert re.search(regex, "encoder.layers.1.self_attention.layernorm_qkv_something") - assert re.search(regex, "encoder.layers.1.self_attention.proj_something") - assert re.search(regex, "encoder.layers.1.layernorm_mlp.fc1_something") - assert re.search(regex, "encoder.layers.1.layernorm_mlp.fc2_something") + assert re.search(regex, "encoder.layers.1.self_attention.layernorm_qkv") + assert re.search(regex, "encoder.layers.1.self_attention.proj") + assert re.search(regex, "encoder.layers.1.layernorm_mlp.fc1") + assert re.search(regex, "encoder.layers.1.layernorm_mlp.fc2") + assert not re.search(regex, "encoder.layers.1.self_attention.layernorm_qkv_something") + assert not re.search(regex, "encoder.layers.1.self_attention.proj_something") + assert not re.search(regex, "encoder.layers.1.layernorm_mlp.fc1_something") + assert not re.search(regex, "encoder.layers.1.layernorm_mlp.fc2_something") # Should not match unrelated sublayer names assert not re.search(regex, "encoder.layers.1.self_attention.some_other_thing")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_quantization.py` around lines 160 - 168, The test currently accepts suffixes like "_something" which hides over-broad prefix matches from generate_layer_regex; update test_matches_correct_sublayers to assert matches against the exact sublayer names ("encoder.layers.1.self_attention.layernorm_qkv", "encoder.layers.1.self_attention.proj", "encoder.layers.1.layernorm_mlp.fc1", "encoder.layers.1.layernorm_mlp.fc2") and add negative assertions that the regex does NOT match the suffixed variants ("...layernorm_qkv_something", "...proj_something", "...fc1_something", "...fc2_something") as well as the unrelated sublayer ("encoder.layers.1.self_attention.some_other_thing") so the test fails if generate_layer_regex only matches prefixes.bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py (1)
224-231:⚠️ Potential issue | 🟠 MajorFinal export is still not self-contained.
This call only serializes weights and
config.jsonthroughsave_final_model_fsdp2()incheckpoint.py.CodonTokenizer.from_pretrained()expects tokenizer metadata too, sofinal_model/will not round-trip until the tokenizer is saved alongside the model.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py` around lines 224 - 231, The final export only writes weights and config via save_final_model_fsdp2 (checkpoint.py) but misses tokenizer metadata required by CodonTokenizer.from_pretrained; update the checkpoint saving block (where args.checkpoint.save_final_model and ckpt_path are checked) to also save the tokenizer into the same save_directory (e.g., call the tokenizer's save method such as tokenizer.save_pretrained or tokenizer.save to write tokenizer files into ckpt_path / "final_model") so the final_model directory is self-contained for from_pretrained loading.bionemo-recipes/models/codonfm/tests/common/test_modeling_common.py (1)
398-439:⚠️ Potential issue | 🟡 MinorUse
Nonechecks for optionalModelOutputfields.
MaskedLMOutputattributes exist even when their value isNone, sohasattr(...)still letsNoneinto the comparison paths and the hidden-state branch can hitlen(None)/assert_close(None, ...). Since this is a copied file, please fix the source and re-sync.Suggested fix
- if compare_loss and hasattr(te_outputs, "loss") and hasattr(hf_outputs, "loss"): + if ( + compare_loss + and getattr(te_outputs, "loss", None) is not None + and getattr(hf_outputs, "loss", None) is not None + ): torch.testing.assert_close( te_outputs.loss, hf_outputs.loss, @@ - if compare_logits and hasattr(te_outputs, "logits") and hasattr(hf_outputs, "logits"): + if ( + compare_logits + and getattr(te_outputs, "logits", None) is not None + and getattr(hf_outputs, "logits", None) is not None + ): # Only compare logits where attention mask is True if "attention_mask" in input_data: @@ - if compare_hidden_states and hasattr(te_outputs, "hidden_states") and hasattr(hf_outputs, "hidden_states"): + if ( + compare_hidden_states + and getattr(te_outputs, "hidden_states", None) is not None + and getattr(hf_outputs, "hidden_states", None) is not None + ): assert len(te_outputs.hidden_states) == len(hf_outputs.hidden_states), (🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/common/test_modeling_common.py` around lines 398 - 439, The comparisons currently use hasattr(...) which returns True even when ModelOutput fields are present but set to None; update the conditions to also guard against None before comparing. Specifically, in the loss branch (compare_loss) require te_outputs.loss is not None and hf_outputs.loss is not None; in the logits branch (compare_logits) require te_outputs.logits is not None and hf_outputs.logits is not None (and still apply the attention_mask boolean indexing only after confirming logits are not None); and in the hidden-states branch (compare_hidden_states) require te_outputs.hidden_states is not None and hf_outputs.hidden_states is not None before calling len(...) or iterating through hidden states. Use the existing symbols (compare_loss, compare_logits, compare_hidden_states, te_outputs, hf_outputs, input_data, tolerances) to locate and update the conditional checks.bionemo-recipes/recipes/codonfm_native_te/quantization.py (1)
191-205:⚠️ Potential issue | 🟠 MajorReject explicit empty layer lists.
[]bypasses the current validation and is treated as an explicit assignment, sofp8_enabled=True, fp8_layers=[]returns an all-Noneprecision plan and silently disables quantization. Fail fast here before the defaulting logic.Suggested fix
# Validate layer entries are integers within valid range. for name, layers in [("fp8_layers", fp8_layers), ("fp4_layers", fp4_layers)]: - if layers is not None and len(layers) > 0: + if layers == []: + raise ValueError(f"{name} cannot be an explicit empty list.") + if layers is not None: invalid = [x for x in layers if not isinstance(x, int) or x < 1 or x > num_layers] if invalid: raise ValueError(Also applies to: 207-233
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/codonfm_native_te/quantization.py` around lines 191 - 205, The validator currently allows explicit empty lists for fp8_layers/fp4_layers which silently disables quantization; update the validation in the loop that inspects ("fp8_layers", fp8_layers) and ("fp4_layers", fp4_layers) to treat an empty list as invalid: if layers is not None and len(layers) == 0 raise a ValueError stating the layer list must not be empty, and otherwise keep the existing integer/range checks for non-empty lists; ensure this change also addresses the logic around fp8_enabled and fp4_enabled (the fp8_enabled and fp4_enabled check that follows should no longer need to handle empty lists because they will be rejected) and apply the same empty-list rejection in the similar validation block referenced later (lines ~207-233) for fp8_layers/fp4_layers.bionemo-recipes/models/codonfm/dataset.py (2)
22-23:⚠️ Potential issue | 🟠 MajorThese imports break the model package outside recipe-specific path hacks.
bionemo-recipes/models/codonfm/tokenizer.pylives in this package, so it should be imported relatively, and the onlyDistributedConfigshown in this PR lives underrecipes/codonfm_native_te/. Importing this module as part of the model package will fail unless callers mutatesys.path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/dataset.py` around lines 22 - 23, The top-level imports in bionemo-recipes/models/codonfm/dataset.py break package usage; change the tokenizer import to a relative import (use from .tokenizer import CodonTokenizer) and stop importing DistributedConfig from a recipe-specific location inside the model package—either remove the DistributedConfig import from this module and pass a config object into the functions/classes that need it, or import it from the recipe-level module that actually defines it (e.g., move any DistributedConfig usage out of dataset.py into the recipes/codonfm_native_te code and pass the config in), ensuring references to CodonTokenizer and any config are located via .tokenizer and not top-level names.
85-103:⚠️ Potential issue | 🟠 MajorMasking RNG still replays across workers.
Both collators keep a single
random.Random(seed)instance, and the dataloader factories never pass theirseedargument into them. Withnum_workers > 0, workers start from identical RNG state and worker respawns repeat the same masking stream. Seed per worker fromtorch.utils.data.get_worker_info().seed/torch.initial_seed()and thread the caller seed through both factories.Also applies to: 159-177, 266-270, 321-325
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/dataset.py` around lines 85 - 103, The masking RNG should not be a single shared random.Random(seed) stored on initialization (self.rng in __init__); instead derive a per-worker RNG at mask-time using torch.utils.data.get_worker_info().seed (or torch.initial_seed() when worker_info is None) combined with the caller/base seed, e.g., create random.Random(worker_seed ^ base_seed) inside the collator or Dataset.masking method; also update the dataloader factory functions to accept and forward the base seed into the collator/Dataset constructors so the per-worker RNG can be deterministically derived from that base seed. Ensure the change is applied to the Dataset.__init__ (remove/stop using self.rng), the collator classes that currently hold a random.Random(seed), and both dataloader factory functions so every worker obtains a unique RNG.bionemo-recipes/recipes/codonfm_native_te/dataset.py (1)
85-103:⚠️ Potential issue | 🟠 MajorMasking RNG still replays across workers.
Both collators keep a single
random.Random(seed)instance, and the dataloader factories never pass theirseedargument into them. Withnum_workers > 0, workers start from identical RNG state and worker respawns repeat the same masking stream. Seed per worker fromtorch.utils.data.get_worker_info().seed/torch.initial_seed()and thread the caller seed through both factories.Also applies to: 159-177, 266-270, 321-325
🧹 Nitpick comments (2)
bionemo-recipes/models/codonfm/tests/conftest.py (1)
38-45: Minor inconsistency between local and sharedunused_tcp_portfixtures.This fixture differs from the one in
tests/common/fixtures.py(lines 33-42): it addss.listen(1)and omitsSO_REUSEADDR. Thelisten()call is unnecessary for port discovery and slightly changes behavior. Consider aligning with the shared implementation for consistency.Regarding the CodeQL warning about binding to
'': this is standard practice for discovering an available port and is safe in this test context since the socket is immediately closed.♻️ Align with shared fixture
`@pytest.fixture` def unused_tcp_port(): """Find and return an unused TCP port for torchrun rendezvous.""" with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: s.bind(("", 0)) + s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) - s.listen(1) - port = s.getsockname()[1] - return port + return s.getsockname()[1]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/conftest.py` around lines 38 - 45, The unused_tcp_port fixture should match the shared implementation: remove the unnecessary s.listen(1) call and set SO_REUSEADDR on the socket before binding so behavior is consistent with tests/common/fixtures.py; update the fixture named unused_tcp_port to create the socket, call s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1), bind to ("", 0), get the port from s.getsockname()[1], close the socket, and return the port.bionemo-recipes/models/codonfm/tests/generate_golden_values.py (1)
111-135: Consider adding documentation about the masking strategy.The function applies 15% random masking (line 124) but re-seeds the RNG per sequence (line 122), which affects reproducibility across sequences. This is fine for golden value generation but worth documenting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/models/codonfm/tests/generate_golden_values.py` around lines 111 - 135, The prepare_test_inputs function applies 15% random masking but calls torch.manual_seed(SEED) inside the per-sequence loop, which re-seeds the RNG each iteration and causes the same mask pattern for every sequence; either move torch.manual_seed(SEED) to just before the for enc in encoded: loop (or before encoding) to get a reproducible but varied mask across sequences, or if you intentionally want identical per-sequence masks, update the function docstring to state the masking rate (0.15), that masking uses torch.rand with torch.manual_seed(SEED) called per sequence, and mention tokenizer.mask_token_id and labels use -100 for non-masked positions so reviewers understand the behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bionemo-recipes/recipes/codonfm_native_te/perf_logger.py`:
- Around line 107-128: The metrics loss, grad_norm, learning_rate and perplexity
are only updated inside the logging gate; move their updates to run every step
so the MeanMetric/Perplexity instances accumulate over the window: for each step
(before the if step % self.logging_frequency == 0 block) call
self.metrics["train/loss"].update(outputs.loss),
self.metrics["train/learning_rate"].update(lr),
self.metrics["train/grad_norm"].update(grad_norm if isinstance(grad_norm,
torch.Tensor) else torch.tensor(grad_norm)), and update perplexity using logits
(apply the same logits.dim() < 3 -> logits = logits.unsqueeze(0] handling) with
self.metrics["train/perplexity"].update(logits, batch["labels"]); keep the
timing/tokens-per-second and the reset/report logic inside the existing
logging_frequency gate so only reporting/reset happens there.
In `@bionemo-recipes/recipes/codonfm_native_te/scheduler.py`:
- Around line 30-41: Check and enforce the warmup/decay boundary before
constructing lr_lambda: validate that num_warmup_steps < decay_steps (where
decay_steps = int(num_training_steps * 0.9)) and either raise a clear ValueError
or clamp decay_steps (e.g., set decay_steps = max(decay_steps, num_warmup_steps
+ 1)) so the linear-decay region exists; update the logic around decay_steps and
num_warmup_steps used in lr_lambda to rely on the validated/clamped value and
fail fast on invalid configs instead of silently producing a clipped schedule.
In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py`:
- Around line 190-210: The logging and checkpointing use a 0-based step because
step is incremented after perf_logger.log_step and should_save_checkpoint,
causing off-by-one behavior; update the flow to use a 1-based global_step for
both perf_logger.log_step and should_save_checkpoint/save_checkpoint_fsdp2 by
deriving global_step = step + 1 (or increment step before these calls) and pass
global_step into perf_logger.log_step and
should_save_checkpoint/save_checkpoint_fsdp2 (and to any saved metadata like
step/epoch) so logs and periodic checkpoints occur at the intended 1-based
boundaries; keep the internal step variable consistent by either using the new
global_step only for logging/checkpointing or by making step itself 1-based.
- Around line 152-166: After restoring state with load_checkpoint_fsdp2(...) you
must call sampler.set_epoch(epoch) to align DistributedSampler with the restored
epoch; add sampler.set_epoch(epoch) immediately after the checkpoint branch that
sets start_step and epoch (i.e., right after load_checkpoint_fsdp2 returns) and
before entering the training loop, and ensure any existing mid-epoch resume
logic uses start_step to skip already-processed batches so indices are not
duplicated.
---
Duplicate comments:
In `@bionemo-recipes/models/codonfm/dataset.py`:
- Around line 22-23: The top-level imports in
bionemo-recipes/models/codonfm/dataset.py break package usage; change the
tokenizer import to a relative import (use from .tokenizer import
CodonTokenizer) and stop importing DistributedConfig from a recipe-specific
location inside the model package—either remove the DistributedConfig import
from this module and pass a config object into the functions/classes that need
it, or import it from the recipe-level module that actually defines it (e.g.,
move any DistributedConfig usage out of dataset.py into the
recipes/codonfm_native_te code and pass the config in), ensuring references to
CodonTokenizer and any config are located via .tokenizer and not top-level
names.
- Around line 85-103: The masking RNG should not be a single shared
random.Random(seed) stored on initialization (self.rng in __init__); instead
derive a per-worker RNG at mask-time using
torch.utils.data.get_worker_info().seed (or torch.initial_seed() when
worker_info is None) combined with the caller/base seed, e.g., create
random.Random(worker_seed ^ base_seed) inside the collator or Dataset.masking
method; also update the dataloader factory functions to accept and forward the
base seed into the collator/Dataset constructors so the per-worker RNG can be
deterministically derived from that base seed. Ensure the change is applied to
the Dataset.__init__ (remove/stop using self.rng), the collator classes that
currently hold a random.Random(seed), and both dataloader factory functions so
every worker obtains a unique RNG.
In `@bionemo-recipes/models/codonfm/modeling_codonfm_te.py`:
- Around line 109-117: The config field attn_input_format is not validated, so
typos (anything other than "bshd") silently fall into the THD branch and cause
runtime failures; add an explicit validation in the same validation block (near
the existing checks for hidden_size, hidden_act, layer_precision) that checks
self.attn_input_format (or CodonFMConfig.attn_input_format) and raises a
ValueError if it is not exactly "bshd" (with a clear message listing the allowed
value) so misconfigured values fail fast with a clear error.
- Around line 447-466: The value_weight branch incorrectly uses w_gain (which is
set to 1.0 for Q/K) so V gets Q/K scaling; in the named_modules loop inside
modeling_codonfm_te.py (look for is_qk, w_gain, gain) change the initialization
of module.value_weight to use the unmodified gain variable (and leave value_bias
zeroing as-is) so that only query/key weights use gain=1.0 while value_weight
uses the scaled MAGNETO gain.
In `@bionemo-recipes/models/codonfm/requirements.txt`:
- Line 2: The requirements entry "transformer_engine>=1.14.0" in
requirements.txt is invalid because PyPI has no 1.14.0 release; update that line
to a published minimum you support (for example "transformer_engine>=1.13.0" if
you want the last 1.x line or "transformer_engine>=2.1.0" for 2.x) so installs
succeed; locate the package line "transformer_engine>=1.14.0" in
requirements.txt and replace it with the chosen published floor.
In `@bionemo-recipes/models/codonfm/tests/common/__init__.py`:
- Around line 32-42: Update the docstring example to reference CodonFM-specific
classes instead of ESM2; replace the ESM2 identifiers (ESM2ModelTester,
NVEsmForMaskedLM) with the CodonFM equivalents used in this codebase (e.g., a
CodonFMModelTester subclass of BaseModelTest and the model class name used for
CodonFM), ensuring the example imports (from tests.common import BaseModelTest,
TestTolerances) and method get_model_class remain consistent with the CodonFM
test helpers.
In `@bionemo-recipes/models/codonfm/tests/common/test_modeling_common.py`:
- Around line 398-439: The comparisons currently use hasattr(...) which returns
True even when ModelOutput fields are present but set to None; update the
conditions to also guard against None before comparing. Specifically, in the
loss branch (compare_loss) require te_outputs.loss is not None and
hf_outputs.loss is not None; in the logits branch (compare_logits) require
te_outputs.logits is not None and hf_outputs.logits is not None (and still apply
the attention_mask boolean indexing only after confirming logits are not None);
and in the hidden-states branch (compare_hidden_states) require
te_outputs.hidden_states is not None and hf_outputs.hidden_states is not None
before calling len(...) or iterating through hidden states. Use the existing
symbols (compare_loss, compare_logits, compare_hidden_states, te_outputs,
hf_outputs, input_data, tolerances) to locate and update the conditional checks.
In `@bionemo-recipes/models/codonfm/tests/test_modeling_codonfm_te.py`:
- Around line 286-287: The test currently calls load_state_dict(golden_sd,
strict=False) on CodonFMForMaskedLM instances and ignores the returned result,
which can hide missing/unexpected keys; update each load_state_dict(golden_sd,
strict=False) call (the ones used when instantiating CodonFMForMaskedLM and the
two other occurrences) to capture the return value (e.g., result =
model.load_state_dict(...)) and add assertions that result.missing_keys == []
and result.unexpected_keys == [] (or use assert not result.missing_keys and
assert not result.unexpected_keys) so the golden state dict must load cleanly.
Ensure the assertions are applied to all three call sites referenced in the
test.
In `@bionemo-recipes/models/codonfm/tokenizer.py`:
- Around line 124-145: from_pretrained currently ignores the serialized
"special_tokens" saved by save_pretrained, so reload the saved special-token
state in from_pretrained by reading the "special_tokens" entry from the config
JSON and using it to set the instance's SPECIAL_TOKENS (and rebuild any
dependent structures like special_ids); ensure from_pretrained populates
self.SPECIAL_TOKENS (or an instance attribute used by decode) before creating
encoder/decoder mappings so decode() derives special IDs from the loaded config
rather than the class constant, and similarly handle the alternate location
referenced in lines ~157-163.
In `@bionemo-recipes/recipes/codonfm_native_te/checkpoint.py`:
- Around line 182-191: The code is storing the return value of the synchronous
saver into the _ckpt_futures map, causing save_final_model_fsdp2 to later call
.result() on a non-future; change the assignment so that only the async path
writes a future into _ckpt_futures["fsdp2"]: call dcp_save(state_dict, ...)
without assigning its return when async_save is False, and only assign the
result of dcp_async_save(...) to _ckpt_futures["fsdp2"] when async_save is True;
update the block around ckpt_save_func/dcp_save/dcp_async_save and any other
occurrences (e.g., the similar 207-210 region) to follow the same pattern so
save_final_model_fsdp2 always sees either a Future or None in
_ckpt_futures["fsdp2"].
In `@bionemo-recipes/recipes/codonfm_native_te/distributed_config.py`:
- Around line 26-40: The dataclass currently mutates os.environ via
dict.setdefault in the default_factories for rank, local_rank, world_size,
_master_addr, and _master_port; change those default_factory lambdas to use
os.getenv(..., default) so they only read environment values and do not write
them, and update the class docstring sentence that says the class "sets" torch
distributed env vars to instead state it "falls back to" single-process defaults
(refer to DistributedConfig and the fields rank, local_rank, world_size,
_master_addr, _master_port).
In `@bionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yaml`:
- Around line 41-44: The lr_scheduler_kwargs currently hardcodes
num_training_steps to 500_000 which can diverge from the actual training loop
parameter (args.num_train_steps); update the config so num_training_steps is
interpolated from the main training steps value instead of a literal (use
Hydra/Omega conf interpolation to reference the training steps field, e.g. point
lr_scheduler_kwargs.num_training_steps to the top-level num_train_steps), and
ensure num_warmup_steps remains consistent (can be expressed as an absolute or
derived value via interpolation if desired) so the scheduler horizon follows
args.num_train_steps at runtime.
In `@bionemo-recipes/recipes/codonfm_native_te/quantization.py`:
- Around line 191-205: The validator currently allows explicit empty lists for
fp8_layers/fp4_layers which silently disables quantization; update the
validation in the loop that inspects ("fp8_layers", fp8_layers) and
("fp4_layers", fp4_layers) to treat an empty list as invalid: if layers is not
None and len(layers) == 0 raise a ValueError stating the layer list must not be
empty, and otherwise keep the existing integer/range checks for non-empty lists;
ensure this change also addresses the logic around fp8_enabled and fp4_enabled
(the fp8_enabled and fp4_enabled check that follows should no longer need to
handle empty lists because they will be rejected) and apply the same empty-list
rejection in the similar validation block referenced later (lines ~207-233) for
fp8_layers/fp4_layers.
In `@bionemo-recipes/recipes/codonfm_native_te/README.md`:
- Around line 49-66: The fenced code block containing the directory tree in the
README (the block that starts with the triple backticks showing
codonfm_native_te/ and its files) is missing a language tag and triggers MD040;
update that opening fence from ``` to ```text so the block becomes a text-coded
fenced block (i.e., add the "text" language tag to the directory-tree code fence
in README.md).
In `@bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py`:
- Around line 37-44: The fixture device_mesh can collide across concurrent test
runs because DistributedConfig defaults MASTER_PORT to 12355; reserve an
ephemeral TCP port and set MASTER_PORT before creating DistributedConfig /
calling torch.distributed.init_process_group so each test session uses a unique
port. Specifically, in the device_mesh fixture obtain a free port (bind to port
0 on localhost to get the OS-assigned port), set os.environ["MASTER_PORT"] to
that port string before instantiating DistributedConfig or before calling
torch.distributed.init_process_group, then proceed with creating the device,
setting CUDA device, and calling init_device_mesh("cuda", mesh_shape=(1,),
mesh_dim_names=("dp",)); ensure the socket is closed after reading the assigned
port so the port is available for the process.
In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_model.py`:
- Around line 136-155: The test_loss_decreases test reads outputs.loss from the
last forward pass before the final optimizer.step, producing an incorrect
pre-step final_loss and making the check flaky; update the test to perform one
additional forward pass after the training loop (call model(...) again with
input_ids/attention_mask/labels) and read loss from that fresh outputs.loss for
final_loss, and make the test deterministic by fixing RNG seeds
(torch.manual_seed, torch.cuda.manual_seed_all) and disabling/dropout (set
model.eval() or set dropout to 0) before training so CodonFMForMaskedLM behavior
is stable.
In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_quantization.py`:
- Around line 160-168: The test currently accepts suffixes like "_something"
which hides over-broad prefix matches from generate_layer_regex; update
test_matches_correct_sublayers to assert matches against the exact sublayer
names ("encoder.layers.1.self_attention.layernorm_qkv",
"encoder.layers.1.self_attention.proj", "encoder.layers.1.layernorm_mlp.fc1",
"encoder.layers.1.layernorm_mlp.fc2") and add negative assertions that the regex
does NOT match the suffixed variants ("...layernorm_qkv_something",
"...proj_something", "...fc1_something", "...fc2_something") as well as the
unrelated sublayer ("encoder.layers.1.self_attention.some_other_thing") so the
test fails if generate_layer_regex only matches prefixes.
In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_train.py`:
- Around line 47-59: The test test_sanity_convergence_fsdp2_no_meta_device can
accidentally resume from an existing checkpoint because it doesn't override
checkpoint.ckpt_dir while defaults include resume_from_checkpoint=true; update
the compose overrides in the test (the compose call inside
test_sanity_convergence_fsdp2_no_meta_device) to set a per-test checkpoint
directory (e.g. add an override that points checkpoint.ckpt_dir to tmp_path or a
tmp subdir) so the run is isolated and cannot pick up stale checkpoints.
- Around line 81-94: Add a GPU capability guard to the
test_sanity_convergence_fsdp2_fp8 function so it skips on hardware that doesn't
support FP8: call TransformerEngine.check_fp8_support() (or equivalent) at the
start of test_sanity_convergence_fsdp2_fp8 and if it returns False use
pytest.skip(...) (or apply pytest.mark.skipif with that predicate) to avoid
running main_fsdp2 when FP8 is unsupported; ensure the skip message mentions
FP8/compute capability so test output is clear.
In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py`:
- Around line 224-231: The final export only writes weights and config via
save_final_model_fsdp2 (checkpoint.py) but misses tokenizer metadata required by
CodonTokenizer.from_pretrained; update the checkpoint saving block (where
args.checkpoint.save_final_model and ckpt_path are checked) to also save the
tokenizer into the same save_directory (e.g., call the tokenizer's save method
such as tokenizer.save_pretrained or tokenizer.save to write tokenizer files
into ckpt_path / "final_model") so the final_model directory is self-contained
for from_pretrained loading.
---
Nitpick comments:
In `@bionemo-recipes/models/codonfm/tests/conftest.py`:
- Around line 38-45: The unused_tcp_port fixture should match the shared
implementation: remove the unnecessary s.listen(1) call and set SO_REUSEADDR on
the socket before binding so behavior is consistent with
tests/common/fixtures.py; update the fixture named unused_tcp_port to create the
socket, call s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1), bind to
("", 0), get the port from s.getsockname()[1], close the socket, and return the
port.
In `@bionemo-recipes/models/codonfm/tests/generate_golden_values.py`:
- Around line 111-135: The prepare_test_inputs function applies 15% random
masking but calls torch.manual_seed(SEED) inside the per-sequence loop, which
re-seeds the RNG each iteration and causes the same mask pattern for every
sequence; either move torch.manual_seed(SEED) to just before the for enc in
encoded: loop (or before encoding) to get a reproducible but varied mask across
sequences, or if you intentionally want identical per-sequence masks, update the
function docstring to state the masking rate (0.15), that masking uses
torch.rand with torch.manual_seed(SEED) called per sequence, and mention
tokenizer.mask_token_id and labels use -100 for non-masked positions so
reviewers understand the behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a8b9a7b4-2bf1-4951-80a8-52fcd8990444
⛔ Files ignored due to path filters (1)
bionemo-recipes/recipes/codonfm_native_te/train.parquetis excluded by!**/*.parquet
📒 Files selected for processing (40)
.gitignorebionemo-recipes/models/codonfm/dataset.pybionemo-recipes/models/codonfm/modeling_codonfm_te.pybionemo-recipes/models/codonfm/requirements.txtbionemo-recipes/models/codonfm/tests/common/README.mdbionemo-recipes/models/codonfm/tests/common/__init__.pybionemo-recipes/models/codonfm/tests/common/fixtures.pybionemo-recipes/models/codonfm/tests/common/test_modeling_common.pybionemo-recipes/models/codonfm/tests/conftest.pybionemo-recipes/models/codonfm/tests/generate_golden_values.pybionemo-recipes/models/codonfm/tests/golden_state_dict.safetensorsbionemo-recipes/models/codonfm/tests/golden_values.jsonbionemo-recipes/models/codonfm/tests/test_modeling_codonfm_te.pybionemo-recipes/models/codonfm/tokenizer.pybionemo-recipes/models/esm2/tests/common/test_modeling_common.pybionemo-recipes/models/llama3/tests/common/test_modeling_common.pybionemo-recipes/models/mixtral/tests/common/test_modeling_common.pybionemo-recipes/models/qwen/tests/common/test_modeling_common.pybionemo-recipes/recipes/codonfm_native_te/README.mdbionemo-recipes/recipes/codonfm_native_te/checkpoint.pybionemo-recipes/recipes/codonfm_native_te/dataset.pybionemo-recipes/recipes/codonfm_native_te/distributed_config.pybionemo-recipes/recipes/codonfm_native_te/fp4_debugging_stats.yamlbionemo-recipes/recipes/codonfm_native_te/fp8_debugging_stats.yamlbionemo-recipes/recipes/codonfm_native_te/hydra_config/L0_sanity.yamlbionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.pybionemo-recipes/recipes/codonfm_native_te/perf_logger.pybionemo-recipes/recipes/codonfm_native_te/quantization.pybionemo-recipes/recipes/codonfm_native_te/requirements.txtbionemo-recipes/recipes/codonfm_native_te/scheduler.pybionemo-recipes/recipes/codonfm_native_te/tests/__init__.pybionemo-recipes/recipes/codonfm_native_te/tests/conftest.pybionemo-recipes/recipes/codonfm_native_te/tests/test_model.pybionemo-recipes/recipes/codonfm_native_te/tests/test_quantization.pybionemo-recipes/recipes/codonfm_native_te/tests/test_tokenizer.pybionemo-recipes/recipes/codonfm_native_te/tests/test_train.pybionemo-recipes/recipes/codonfm_native_te/tokenizer.pybionemo-recipes/recipes/codonfm_native_te/train_fsdp2.pyci/scripts/check_copied_files.py
|
@coderabbitai resolve |
✅ Actions performedComments resolved. Auto-approval is disabled; enable |
|
/ok to test c05e3a2 |
|
/ok to test 8dbcb98 |
02a1501 to
8c00221
Compare
|
/ok to test ef905b5 |
|
@coderabbitai resolve |
|
/ok to test bc82beb |
✅ Actions performedComments resolved. Approval is disabled; enable |
bc82beb to
bdb5cf2
Compare
Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>
bdb5cf2 to
93cdb62
Compare
The models/codonfm nightly CI job started failing after transformers 5.4.0 was released on March 27, 2026. The PR CI (NVIDIA#1531) passed on March 25 when the latest version was 5.3.0. The 5.4.0 release introduced significant breaking changes including dataclass-based PretrainedConfig and weight tying behavior modifications. Pin to <5.4.0 to restore the working configuration. A follow-up should investigate compatibility with transformers>=5.4.0. Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Summary
support
models/codonfm/ (HF-compatible model)
The model code in models/codonfm/modeling_codonfm_te.py is the source of truth, auto-synced to the recipe via check_copied_files.py.
support
73 tests (36 pass, 10 skip, 25 xfail, 2 xpass):
cross-model logit equivalence verified
recipes/codonfm_native_te/ (training recipe)
Self-contained FSDP2 training recipe with:
63 tests covering model, tokenizer, quantization, and end-to-end training.
CI integration
What's left out / future work
extra post-attention/post-MLP LayerNorms not present in standard TETransformerLayer. Golden values will be updated once a native_te
checkpoint is trained and uploaded.
evolve.
Test plan
---------- END OF DESCRIPTION----
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Triggering Code Rabbit AI Review
To trigger a code review from code rabbit, comment on a pull request with one of these commands:
See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.
Pre-submit Checklist
Summary by CodeRabbit
New Features
Tests
Documentation
Chores