docs: add prerequisites, troubleshooting, and build verification for GRPO quickstart by brluobt · Pull Request #2051 · NVIDIA-NeMo/RL

brluobt · 2026-03-03T02:45:22Z

Summary

docs/guides/grpo.md: Add Prerequisites section with required uv sync extras and environment variables (HF_HOME, WANDB_API_KEY, VLLM_USE_DEEP_GEMM, LD_LIBRARY_PATH). Add Troubleshooting table covering 6 common setup errors encountered on single-node GPU setups.
docs/docker.md: Add "Verifying the Build" section with quick smoke-test commands to validate core and optional dependencies after building the Docker image.
docs/about/installation.md: Add note about cuDNN not being visible in Ray worker virtual environments (/opt/ray_venvs/) and how to fix via LD_LIBRARY_PATH.

Motivation

Following the GRPO Quick Start guide on an H20 single-node (8 GPUs) setup, I encountered 9 separate dependency/environment errors before successfully running training. These documentation additions aim to save future users from the same experience.

See issue #2043 for the full list of errors and their root causes.

Closes #2043

Test plan

Verify markdown renders correctly on GitHub
Confirm the troubleshooting steps match actual error messages
Maintainer review for accuracy and style consistency

Summary by CodeRabbit

Documentation

Expanded installation and setup guides with enhanced environment configuration details for cuDNN paths
Added Docker build verification commands to validate dependencies at runtime
Comprehensive GRPO setup documentation including prerequisites, environment variables, launch guidance, and troubleshooting for common setup issues

coderabbitai · 2026-03-03T02:48:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a426896 and 1403c00.

📒 Files selected for processing (3)

docs/about/installation.md
docs/docker.md
docs/guides/grpo.md

📝 Walkthrough

Walkthrough

Three documentation files are updated to address setup and troubleshooting gaps in the GRPO Quick Start guide. docs/about/installation.md adds cuDNN environment path notes, docs/docker.md adds build verification steps, and docs/guides/grpo.md adds prerequisites, environment variables, and troubleshooting sections.

Changes

Cohort / File(s)	Summary
cuDNN Setup Guidance `docs/about/installation.md`	Adds two user notes detailing LD_LIBRARY_PATH configuration for Ray worker virtual environments and FLASH_ATTENTION fallback instructions for cuDNN visibility issues.
Docker Build Verification `docs/docker.md`	Adds "Verifying the Build" subsection with docker run commands to verify core and optional dependencies at runtime, with rebuild guidance if checks fail.
GRPO Quick Start Enhancements `docs/guides/grpo.md`	Adds Prerequisites section (environment extras and uv sync commands), environment variables guidance (HF_HOME, WANDB_API_KEY, VLLM_USE_DEEP_GEMM, LD_LIBRARY_PATH), Launch subsection, and Troubleshooting Common Setup Issues section with symptom-cause-fix table for common errors (missing packages, Ray worker venv issues, missing dependencies).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and accurately describes the main changes: adding prerequisites, troubleshooting, and build verification documentation for the GRPO quickstart guide.
Linked Issues check	✅ Passed	All code changes directly address the objectives from issue `#2043`: Prerequisites section with uv sync extras and env vars, Troubleshooting table for common errors, and build verification steps are implemented across the three documentation files.
Out of Scope Changes check	✅ Passed	All changes are focused on documentation improvements directly related to issue `#2043` requirements; no unrelated or out-of-scope code changes are present.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes	✅ Passed	PR contains only minor documentation changes across three Markdown files with no code modifications, and includes appropriate test plan for documentation verification.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

terrykong

thanks for improving our docs @brluobt

terrykong · 2026-03-03T07:19:11Z

 ```

+> [!NOTE]
+> **cuDNN in Ray worker virtual environments**: Even when cuDNN is installed on the host system, Ray worker processes run in isolated virtual environments (`/opt/ray_venvs/`) that may not have access to system CUDA libraries. If you see errors like `libcudnn.so.9: cannot open shared object file`, ensure that the CUDA library paths are included in `LD_LIBRARY_PATH`:


Did you see this libcudnn.so.9 even in a container? Or was this just something when you ran bare-metal outside of our docker image?

Hi terrykong!
I went back and re-tested inside the container (nemo-rl:latest built from
main at commit b1ab8ab). The result:

libcudnn.so.9 is available at /usr/lib/x86_64-linux-gnu/ (via the base image)

It is also installed via pip (nvidia-cudnn) in the vLLM worker venv at
/opt/ray_venvs/.../site-packages/nvidia/cudnn/lib/

ldconfig -p resolves it correctly

Even with LD_LIBRARY_PATH unset, ctypes.CDLL("libcudnn.so.9") succeeds from
both the driver and worker venv Python

So I was unable to reproduce this inside the current container. My original encounter
was during a sequence of cascading errors, and I unfortunately did not preserve the exact
build logs or image. It is possible that I had a corrupted build cache at the time.

I'm happy to remove this note from installation.md if you think it adds confusion.

terrykong · 2026-03-03T07:22:19Z

+# For vLLM inference backend (recommended):
+uv sync --extra automodel --extra vllm
+
+# For SGLang inference backend:
+uv sync --extra sglang


I'm curious why this was necessary for you. Were you running in a container when you needed to run this? The container should have synced all the dependencies https://github.com/NVIDIA-NeMo/RL/blob/main/docker/Dockerfile#L140-L149 which warms the uv cache and then we later create all the virtual envs for all the worker types here https://github.com/NVIDIA-NeMo/RL/blob/main/docker/Dockerfile#L193C3-L198. So the container shouldn't need this.

By default, the uv sync here is downloading/installing the dependencies for the driver process, which doesn't update the worker venvs, which is why I'm a little surprised this worked for you.

Yes, I was running inside a container built from docker/Dockerfile. Looking at it now,
the Dockerfile does run uv sync --extra vllm, --extra automodel, etc. at build time
(lines 140-149) and then prefetches worker venvs (lines 193-198), so these steps should
not be needed in a clean build.

I suspect my original issues stemmed from a corrupted build cache — which would also
explain the ray._private.node error I hit. Unfortunately I didn't save the build logs
from that session.

To reproduce, I re-ran the full test suite inside the current nemo-rl:latest container:

OmegaConf, Ray, vllm.logger, uvloop, tabulate all import correctly

nemo_automodel is correctly installed in the DTensorPolicyWorkerV2 venv (not in the
driver venv, which is expected)

Given this, I agree the Prerequisites section should not recommend uv sync --extra as a
default step. I'll revise it to focus on the container workflow and only mention uv sync
as a fallback for bare-metal / custom builds.

terrykong · 2026-03-03T07:22:31Z

+# For Llama models, also run: huggingface-cli login
+
+# Recommended: Disable DeepGEMM if not available on your GPU (e.g., H20)
+export VLLM_USE_DEEP_GEMM=0


Thanks! I ran a full GRPO training pass (2 steps, Qwen2.5-1.5B, 8x H20) with
use_deep_gemm=True inside the current container — it completed successfully
(exit code 0).

However, I noted that with the default bfloat16 config, DeepGEMM is never
actually invoked since it only applies to FP8 inference paths. My original
error message specifically mentioned "FP8 kernels", so I likely hit this
with an FP8-related configuration or during a cascading fix session where
the deep_gemm package version became inconsistent.

That said, VLLM_USE_DEEP_GEMM=0 is still a useful safety net for users
who enable FP8 on non-H100 GPUs, so I think it's worth keeping in the docs
as a troubleshooting tip.

terrykong · 2026-03-03T07:26:00Z

+| `No module named 'nemo_automodel'` | Missing `automodel` extra | `uv sync --extra automodel` |
+| `No module named 'vllm'` or `No module named 'vllm.logger'` | Missing or incomplete `vllm` extra | `uv sync --extra vllm` |
+| `cannot import name 'OmegaConf'` | Virtual environment not synced | `uv venv && uv sync --extra automodel --extra vllm` |
+| `No module named 'ray._private.node'` | Corrupted Ray installation | `uv pip install "ray[default]==2.49.2" --force-reinstall` |


Actually, this one

| `No module named 'ray._private.node'` | Corrupted Ray installation | `uv pip install "ray[default]==2.49.2" --force-reinstall` |

could also explain why you needed to do uv sync --extra vllm if somehow you had a corrupted cache. I wonder if somehow the docker build wasn't failing for you or something. I'd be interested to know what your docker build logs looked like on the potentially corrupted container. We could probably improve our build process to catch w/e you observed

I would actually prefer if we recommend these instructions (assuming they work for you too):

uv run nemo_rl/utils/prefetch_venvs.py

which downloads all the venvs, or if you are just trying to install a select few:

uv run nemo_rl/utils/prefetch_venvs.py vllm policy # see the args for more details #parser.add_argument( # "filters", # nargs="*", # help="Filter strings to match against actor FQNs. Only actors whose FQN " # "contains at least one of these strings will be prefetched. " # "If not provided, all venvs are prefetched.", # ) # parser.add_argument( # "--negative-filters", # nargs="*", # help="Filter strings to exclude from prefetching. Actors whose FQN " # "contains any of these strings will be skipped.", # )

Great suggestion — I wasn't aware of prefetch_venvs.py. I'll update the troubleshooting
table to recommend:

uv run nemo_rl/utils/prefetch_venvs.py # all workers uv run nemo_rl/utils/prefetch_venvs.py vllm policy # selective

instead of uv sync --extra. This is a much better fit since it directly rebuilds the
worker venvs rather than just the driver environment.

I'll also revise the Prerequisites section accordingly and push an updated commit.

…GRPO quickstart - docs/guides/grpo.md: Add Prerequisites section with required `uv sync` extras and environment variables (HF_HOME, WANDB_API_KEY, VLLM_USE_DEEP_GEMM, LD_LIBRARY_PATH). Add Troubleshooting table covering common setup errors encountered on single-node GPU setups (e.g., missing extras, cuDNN visibility, DeepGEMM availability). - docs/docker.md: Add "Verifying the Build" section with quick smoke-test commands to validate core and optional dependencies after building the Docker image. - docs/about/installation.md: Add note about cuDNN not being visible in Ray worker virtual environments (/opt/ray_venvs/) and how to fix via LD_LIBRARY_PATH. Closes NVIDIA-NeMo#2043 Made-with: Cursor Signed-off-by: brluo <brluo@nvidia.com>

@terrykong

…are-metal Per @terrykong's review: - Replace `uv sync --extra` recommendations with `uv run nemo_rl/utils/prefetch_venvs.py` which directly rebuilds worker venvs (the correct fix for missing modules). - Simplify Prerequisites to focus on env vars; move venv prefetch to a bare-metal note since containers already have everything pre-built. - Update troubleshooting table: consolidate missing module fixes under prefetch_venvs.py, link cuDNN issue to installation.md (bare-metal only). - Add prefetch_venvs.py to the venv rebuild instructions. Made-with: Cursor Signed-off-by: brluo <brluo@nvidia.com>

brluobt requested a review from a team as a code owner March 3, 2026 02:45

github-actions Bot added the Documentation Improvements or additions to documentation label Mar 3, 2026

github-actions Bot added the community-request label Mar 3, 2026

brluobt mentioned this pull request Mar 3, 2026

GRPO Quick Start guide missing prerequisites, env vars, and troubleshooting for single-node H20 setup #2043

Open

terrykong reviewed Mar 3, 2026

View reviewed changes

brluo added 2 commits March 4, 2026 15:03

brluobt force-pushed the docs/fix-grpo-quickstart-prerequisites branch from 847f2a0 to 37ed124 Compare March 4, 2026 07:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add prerequisites, troubleshooting, and build verification for GRPO quickstart#2051

docs: add prerequisites, troubleshooting, and build verification for GRPO quickstart#2051
brluobt wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
brluobt:docs/fix-grpo-quickstart-prerequisites

brluobt commented Mar 3, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 3, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

terrykong left a comment

Uh oh!

terrykong Mar 3, 2026

Uh oh!

brluobt Mar 3, 2026

Uh oh!

terrykong Mar 3, 2026

Uh oh!

brluobt Mar 3, 2026

Uh oh!

terrykong Mar 3, 2026

Uh oh!

brluobt Mar 3, 2026 •

edited

Loading

Uh oh!

terrykong Mar 3, 2026

Uh oh!

brluobt Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brluobt commented Mar 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Test plan

Summary by CodeRabbit

Documentation

Uh oh!

coderabbitai Bot commented Mar 3, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

terrykong Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

brluobt Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

terrykong Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

brluobt Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

terrykong Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

brluobt Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

terrykong Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

brluobt Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brluobt commented Mar 3, 2026 •

edited by coderabbitai Bot

Loading

brluobt Mar 3, 2026 •

edited

Loading