Skip to content

docs: add prerequisites, troubleshooting, and build verification for GRPO quickstart#2051

Open
brluobt wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
brluobt:docs/fix-grpo-quickstart-prerequisites
Open

docs: add prerequisites, troubleshooting, and build verification for GRPO quickstart#2051
brluobt wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
brluobt:docs/fix-grpo-quickstart-prerequisites

Conversation

@brluobt
Copy link
Copy Markdown

@brluobt brluobt commented Mar 3, 2026

Summary

  • docs/guides/grpo.md: Add Prerequisites section with required uv sync extras and environment variables (HF_HOME, WANDB_API_KEY, VLLM_USE_DEEP_GEMM, LD_LIBRARY_PATH). Add Troubleshooting table covering 6 common setup errors encountered on single-node GPU setups.
  • docs/docker.md: Add "Verifying the Build" section with quick smoke-test commands to validate core and optional dependencies after building the Docker image.
  • docs/about/installation.md: Add note about cuDNN not being visible in Ray worker virtual environments (/opt/ray_venvs/) and how to fix via LD_LIBRARY_PATH.

Motivation

Following the GRPO Quick Start guide on an H20 single-node (8 GPUs) setup, I encountered 9 separate dependency/environment errors before successfully running training. These documentation additions aim to save future users from the same experience.

See issue #2043 for the full list of errors and their root causes.

Closes #2043

Test plan

  • Verify markdown renders correctly on GitHub
  • Confirm the troubleshooting steps match actual error messages
  • Maintainer review for accuracy and style consistency

Summary by CodeRabbit

Documentation

  • Expanded installation and setup guides with enhanced environment configuration details for cuDNN paths
  • Added Docker build verification commands to validate dependencies at runtime
  • Comprehensive GRPO setup documentation including prerequisites, environment variables, launch guidance, and troubleshooting for common setup issues

@brluobt brluobt requested a review from a team as a code owner March 3, 2026 02:45
@github-actions github-actions Bot added the Documentation Improvements or additions to documentation label Mar 3, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 3, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a426896 and 1403c00.

📒 Files selected for processing (3)
  • docs/about/installation.md
  • docs/docker.md
  • docs/guides/grpo.md

📝 Walkthrough

Walkthrough

Three documentation files are updated to address setup and troubleshooting gaps in the GRPO Quick Start guide. docs/about/installation.md adds cuDNN environment path notes, docs/docker.md adds build verification steps, and docs/guides/grpo.md adds prerequisites, environment variables, and troubleshooting sections.

Changes

Cohort / File(s) Summary
cuDNN Setup Guidance
docs/about/installation.md
Adds two user notes detailing LD_LIBRARY_PATH configuration for Ray worker virtual environments and FLASH_ATTENTION fallback instructions for cuDNN visibility issues.
Docker Build Verification
docs/docker.md
Adds "Verifying the Build" subsection with docker run commands to verify core and optional dependencies at runtime, with rebuild guidance if checks fail.
GRPO Quick Start Enhancements
docs/guides/grpo.md
Adds Prerequisites section (environment extras and uv sync commands), environment variables guidance (HF_HOME, WANDB_API_KEY, VLLM_USE_DEEP_GEMM, LD_LIBRARY_PATH), Launch subsection, and Troubleshooting Common Setup Issues section with symptom-cause-fix table for common errors (missing packages, Ray worker venv issues, missing dependencies).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately describes the main changes: adding prerequisites, troubleshooting, and build verification documentation for the GRPO quickstart guide.
Linked Issues check ✅ Passed All code changes directly address the objectives from issue #2043: Prerequisites section with uv sync extras and env vars, Troubleshooting table for common errors, and build verification steps are implemented across the three documentation files.
Out of Scope Changes check ✅ Passed All changes are focused on documentation improvements directly related to issue #2043 requirements; no unrelated or out-of-scope code changes are present.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes ✅ Passed PR contains only minor documentation changes across three Markdown files with no code modifications, and includes appropriate test plan for documentation verification.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for improving our docs @brluobt

```

> [!NOTE]
> **cuDNN in Ray worker virtual environments**: Even when cuDNN is installed on the host system, Ray worker processes run in isolated virtual environments (`/opt/ray_venvs/`) that may not have access to system CUDA libraries. If you see errors like `libcudnn.so.9: cannot open shared object file`, ensure that the CUDA library paths are included in `LD_LIBRARY_PATH`:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see this libcudnn.so.9 even in a container? Or was this just something when you ran bare-metal outside of our docker image?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi terrykong!
I went back and re-tested inside the container (nemo-rl:latest built from
main at commit b1ab8ab). The result:

  • libcudnn.so.9 is available at /usr/lib/x86_64-linux-gnu/ (via the base image)
  • It is also installed via pip (nvidia-cudnn) in the vLLM worker venv at
    /opt/ray_venvs/.../site-packages/nvidia/cudnn/lib/
  • ldconfig -p resolves it correctly
  • Even with LD_LIBRARY_PATH unset, ctypes.CDLL("libcudnn.so.9") succeeds from
    both the driver and worker venv Python

So I was unable to reproduce this inside the current container. My original encounter
was during a sequence of cascading errors, and I unfortunately did not preserve the exact
build logs or image. It is possible that I had a corrupted build cache at the time.

I'm happy to remove this note from installation.md if you think it adds confusion.

Comment thread docs/guides/grpo.md Outdated
Comment on lines +14 to +18
# For vLLM inference backend (recommended):
uv sync --extra automodel --extra vllm

# For SGLang inference backend:
uv sync --extra sglang
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why this was necessary for you. Were you running in a container when you needed to run this? The container should have synced all the dependencies https://github.com/NVIDIA-NeMo/RL/blob/main/docker/Dockerfile#L140-L149 which warms the uv cache and then we later create all the virtual envs for all the worker types here https://github.com/NVIDIA-NeMo/RL/blob/main/docker/Dockerfile#L193C3-L198. So the container shouldn't need this.

By default, the uv sync here is downloading/installing the dependencies for the driver process, which doesn't update the worker venvs, which is why I'm a little surprised this worked for you.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was running inside a container built from docker/Dockerfile. Looking at it now,
the Dockerfile does run uv sync --extra vllm, --extra automodel, etc. at build time
(lines 140-149) and then prefetches worker venvs (lines 193-198), so these steps should
not be needed in a clean build.

I suspect my original issues stemmed from a corrupted build cache — which would also
explain the ray._private.node error I hit. Unfortunately I didn't save the build logs
from that session.

To reproduce, I re-ran the full test suite inside the current nemo-rl:latest container:

  • OmegaConf, Ray, vllm.logger, uvloop, tabulate all import correctly
  • nemo_automodel is correctly installed in the DTensorPolicyWorkerV2 venv (not in the
    driver venv, which is expected)

Given this, I agree the Prerequisites section should not recommend uv sync --extra as a
default step. I'll revise it to focus on the container workflow and only mention uv sync
as a fallback for bare-metal / custom builds.

Comment thread docs/guides/grpo.md
# For Llama models, also run: huggingface-cli login

# Recommended: Disable DeepGEMM if not available on your GPU (e.g., H20)
export VLLM_USE_DEEP_GEMM=0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

Copy link
Copy Markdown
Author

@brluobt brluobt Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I ran a full GRPO training pass (2 steps, Qwen2.5-1.5B, 8x H20) with
use_deep_gemm=True inside the current container — it completed successfully
(exit code 0).

However, I noted that with the default bfloat16 config, DeepGEMM is never
actually invoked since it only applies to FP8 inference paths. My original
error message specifically mentioned "FP8 kernels", so I likely hit this
with an FP8-related configuration or during a cascading fix session where
the deep_gemm package version became inconsistent.

That said, VLLM_USE_DEEP_GEMM=0 is still a useful safety net for users
who enable FP8 on non-H100 GPUs, so I think it's worth keeping in the docs
as a troubleshooting tip.

Comment thread docs/guides/grpo.md Outdated
Comment on lines +55 to +58
| `No module named 'nemo_automodel'` | Missing `automodel` extra | `uv sync --extra automodel` |
| `No module named 'vllm'` or `No module named 'vllm.logger'` | Missing or incomplete `vllm` extra | `uv sync --extra vllm` |
| `cannot import name 'OmegaConf'` | Virtual environment not synced | `uv venv && uv sync --extra automodel --extra vllm` |
| `No module named 'ray._private.node'` | Corrupted Ray installation | `uv pip install "ray[default]==2.49.2" --force-reinstall` |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this one

| `No module named 'ray._private.node'` | Corrupted Ray installation | `uv pip install "ray[default]==2.49.2" --force-reinstall` |

could also explain why you needed to do uv sync --extra vllm if somehow you had a corrupted cache. I wonder if somehow the docker build wasn't failing for you or something. I'd be interested to know what your docker build logs looked like on the potentially corrupted container. We could probably improve our build process to catch w/e you observed

I would actually prefer if we recommend these instructions (assuming they work for you too):

uv run nemo_rl/utils/prefetch_venvs.py

which downloads all the venvs, or if you are just trying to install a select few:

uv run nemo_rl/utils/prefetch_venvs.py vllm policy

# see the args for more details
#parser.add_argument(
#        "filters",
#        nargs="*",
#        help="Filter strings to match against actor FQNs. Only actors whose FQN "
#        "contains at least one of these strings will be prefetched. "
#        "If not provided, all venvs are prefetched.",
#    )
#    parser.add_argument(
#        "--negative-filters",
#        nargs="*",
#        help="Filter strings to exclude from prefetching. Actors whose FQN "
#        "contains any of these strings will be skipped.",
#    )

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion — I wasn't aware of prefetch_venvs.py. I'll update the troubleshooting
table to recommend:

uv run nemo_rl/utils/prefetch_venvs.py          # all workers
uv run nemo_rl/utils/prefetch_venvs.py vllm policy  # selective

instead of uv sync --extra. This is a much better fit since it directly rebuilds the
worker venvs rather than just the driver environment.

I'll also revise the Prerequisites section accordingly and push an updated commit.

brluo added 2 commits March 4, 2026 15:03
…GRPO quickstart

- docs/guides/grpo.md: Add Prerequisites section with required `uv sync` extras and
  environment variables (HF_HOME, WANDB_API_KEY, VLLM_USE_DEEP_GEMM, LD_LIBRARY_PATH).
  Add Troubleshooting table covering common setup errors encountered on single-node
  GPU setups (e.g., missing extras, cuDNN visibility, DeepGEMM availability).
- docs/docker.md: Add "Verifying the Build" section with quick smoke-test commands
  to validate core and optional dependencies after building the Docker image.
- docs/about/installation.md: Add note about cuDNN not being visible in Ray worker
  virtual environments (/opt/ray_venvs/) and how to fix via LD_LIBRARY_PATH.

Closes NVIDIA-NeMo#2043

Made-with: Cursor
Signed-off-by: brluo <brluo@nvidia.com>
…are-metal

Per @terrykong's review:
- Replace `uv sync --extra` recommendations with `uv run nemo_rl/utils/prefetch_venvs.py`
  which directly rebuilds worker venvs (the correct fix for missing modules).
- Simplify Prerequisites to focus on env vars; move venv prefetch to a
  bare-metal note since containers already have everything pre-built.
- Update troubleshooting table: consolidate missing module fixes under
  prefetch_venvs.py, link cuDNN issue to installation.md (bare-metal only).
- Add prefetch_venvs.py to the venv rebuild instructions.

Made-with: Cursor
Signed-off-by: brluo <brluo@nvidia.com>
@brluobt brluobt force-pushed the docs/fix-grpo-quickstart-prerequisites branch from 847f2a0 to 37ed124 Compare March 4, 2026 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GRPO Quick Start guide missing prerequisites, env vars, and troubleshooting for single-node H20 setup

2 participants