docs: add prerequisites, troubleshooting, and build verification for GRPO quickstart#2051
docs: add prerequisites, troubleshooting, and build verification for GRPO quickstart#2051brluobt wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThree documentation files are updated to address setup and troubleshooting gaps in the GRPO Quick Start guide. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| ``` | ||
|
|
||
| > [!NOTE] | ||
| > **cuDNN in Ray worker virtual environments**: Even when cuDNN is installed on the host system, Ray worker processes run in isolated virtual environments (`/opt/ray_venvs/`) that may not have access to system CUDA libraries. If you see errors like `libcudnn.so.9: cannot open shared object file`, ensure that the CUDA library paths are included in `LD_LIBRARY_PATH`: |
There was a problem hiding this comment.
Did you see this libcudnn.so.9 even in a container? Or was this just something when you ran bare-metal outside of our docker image?
There was a problem hiding this comment.
Hi terrykong!
I went back and re-tested inside the container (nemo-rl:latest built from
main at commit b1ab8ab). The result:
libcudnn.so.9is available at/usr/lib/x86_64-linux-gnu/(via the base image)- It is also installed via pip (
nvidia-cudnn) in the vLLM worker venv at
/opt/ray_venvs/.../site-packages/nvidia/cudnn/lib/ ldconfig -presolves it correctly- Even with
LD_LIBRARY_PATHunset,ctypes.CDLL("libcudnn.so.9")succeeds from
both the driver and worker venv Python
So I was unable to reproduce this inside the current container. My original encounter
was during a sequence of cascading errors, and I unfortunately did not preserve the exact
build logs or image. It is possible that I had a corrupted build cache at the time.
I'm happy to remove this note from installation.md if you think it adds confusion.
| # For vLLM inference backend (recommended): | ||
| uv sync --extra automodel --extra vllm | ||
|
|
||
| # For SGLang inference backend: | ||
| uv sync --extra sglang |
There was a problem hiding this comment.
I'm curious why this was necessary for you. Were you running in a container when you needed to run this? The container should have synced all the dependencies https://github.com/NVIDIA-NeMo/RL/blob/main/docker/Dockerfile#L140-L149 which warms the uv cache and then we later create all the virtual envs for all the worker types here https://github.com/NVIDIA-NeMo/RL/blob/main/docker/Dockerfile#L193C3-L198. So the container shouldn't need this.
By default, the uv sync here is downloading/installing the dependencies for the driver process, which doesn't update the worker venvs, which is why I'm a little surprised this worked for you.
There was a problem hiding this comment.
Yes, I was running inside a container built from docker/Dockerfile. Looking at it now,
the Dockerfile does run uv sync --extra vllm, --extra automodel, etc. at build time
(lines 140-149) and then prefetches worker venvs (lines 193-198), so these steps should
not be needed in a clean build.
I suspect my original issues stemmed from a corrupted build cache — which would also
explain the ray._private.node error I hit. Unfortunately I didn't save the build logs
from that session.
To reproduce, I re-ran the full test suite inside the current nemo-rl:latest container:
- OmegaConf, Ray, vllm.logger, uvloop, tabulate all import correctly
nemo_automodelis correctly installed in the DTensorPolicyWorkerV2 venv (not in the
driver venv, which is expected)
Given this, I agree the Prerequisites section should not recommend uv sync --extra as a
default step. I'll revise it to focus on the container workflow and only mention uv sync
as a fallback for bare-metal / custom builds.
| # For Llama models, also run: huggingface-cli login | ||
|
|
||
| # Recommended: Disable DeepGEMM if not available on your GPU (e.g., H20) | ||
| export VLLM_USE_DEEP_GEMM=0 |
There was a problem hiding this comment.
Thanks! I ran a full GRPO training pass (2 steps, Qwen2.5-1.5B, 8x H20) with
use_deep_gemm=True inside the current container — it completed successfully
(exit code 0).
However, I noted that with the default bfloat16 config, DeepGEMM is never
actually invoked since it only applies to FP8 inference paths. My original
error message specifically mentioned "FP8 kernels", so I likely hit this
with an FP8-related configuration or during a cascading fix session where
the deep_gemm package version became inconsistent.
That said, VLLM_USE_DEEP_GEMM=0 is still a useful safety net for users
who enable FP8 on non-H100 GPUs, so I think it's worth keeping in the docs
as a troubleshooting tip.
| | `No module named 'nemo_automodel'` | Missing `automodel` extra | `uv sync --extra automodel` | | ||
| | `No module named 'vllm'` or `No module named 'vllm.logger'` | Missing or incomplete `vllm` extra | `uv sync --extra vllm` | | ||
| | `cannot import name 'OmegaConf'` | Virtual environment not synced | `uv venv && uv sync --extra automodel --extra vllm` | | ||
| | `No module named 'ray._private.node'` | Corrupted Ray installation | `uv pip install "ray[default]==2.49.2" --force-reinstall` | |
There was a problem hiding this comment.
Actually, this one
| `No module named 'ray._private.node'` | Corrupted Ray installation | `uv pip install "ray[default]==2.49.2" --force-reinstall` |
could also explain why you needed to do uv sync --extra vllm if somehow you had a corrupted cache. I wonder if somehow the docker build wasn't failing for you or something. I'd be interested to know what your docker build logs looked like on the potentially corrupted container. We could probably improve our build process to catch w/e you observed
I would actually prefer if we recommend these instructions (assuming they work for you too):
uv run nemo_rl/utils/prefetch_venvs.py
which downloads all the venvs, or if you are just trying to install a select few:
uv run nemo_rl/utils/prefetch_venvs.py vllm policy
# see the args for more details
#parser.add_argument(
# "filters",
# nargs="*",
# help="Filter strings to match against actor FQNs. Only actors whose FQN "
# "contains at least one of these strings will be prefetched. "
# "If not provided, all venvs are prefetched.",
# )
# parser.add_argument(
# "--negative-filters",
# nargs="*",
# help="Filter strings to exclude from prefetching. Actors whose FQN "
# "contains any of these strings will be skipped.",
# )
There was a problem hiding this comment.
Great suggestion — I wasn't aware of prefetch_venvs.py. I'll update the troubleshooting
table to recommend:
uv run nemo_rl/utils/prefetch_venvs.py # all workers
uv run nemo_rl/utils/prefetch_venvs.py vllm policy # selective
instead of uv sync --extra. This is a much better fit since it directly rebuilds the
worker venvs rather than just the driver environment.
I'll also revise the Prerequisites section accordingly and push an updated commit.
…GRPO quickstart - docs/guides/grpo.md: Add Prerequisites section with required `uv sync` extras and environment variables (HF_HOME, WANDB_API_KEY, VLLM_USE_DEEP_GEMM, LD_LIBRARY_PATH). Add Troubleshooting table covering common setup errors encountered on single-node GPU setups (e.g., missing extras, cuDNN visibility, DeepGEMM availability). - docs/docker.md: Add "Verifying the Build" section with quick smoke-test commands to validate core and optional dependencies after building the Docker image. - docs/about/installation.md: Add note about cuDNN not being visible in Ray worker virtual environments (/opt/ray_venvs/) and how to fix via LD_LIBRARY_PATH. Closes NVIDIA-NeMo#2043 Made-with: Cursor Signed-off-by: brluo <brluo@nvidia.com>
…are-metal Per @terrykong's review: - Replace `uv sync --extra` recommendations with `uv run nemo_rl/utils/prefetch_venvs.py` which directly rebuilds worker venvs (the correct fix for missing modules). - Simplify Prerequisites to focus on env vars; move venv prefetch to a bare-metal note since containers already have everything pre-built. - Update troubleshooting table: consolidate missing module fixes under prefetch_venvs.py, link cuDNN issue to installation.md (bare-metal only). - Add prefetch_venvs.py to the venv rebuild instructions. Made-with: Cursor Signed-off-by: brluo <brluo@nvidia.com>
847f2a0 to
37ed124
Compare
Summary
docs/guides/grpo.md: Add Prerequisites section with requireduv syncextras and environment variables (HF_HOME,WANDB_API_KEY,VLLM_USE_DEEP_GEMM,LD_LIBRARY_PATH). Add Troubleshooting table covering 6 common setup errors encountered on single-node GPU setups.docs/docker.md: Add "Verifying the Build" section with quick smoke-test commands to validate core and optional dependencies after building the Docker image.docs/about/installation.md: Add note about cuDNN not being visible in Ray worker virtual environments (/opt/ray_venvs/) and how to fix viaLD_LIBRARY_PATH.Motivation
Following the GRPO Quick Start guide on an H20 single-node (8 GPUs) setup, I encountered 9 separate dependency/environment errors before successfully running training. These documentation additions aim to save future users from the same experience.
See issue #2043 for the full list of errors and their root causes.
Closes #2043
Test plan
Summary by CodeRabbit
Documentation