Nsight Systems SQLite exports are powerful but tedious to inspect by hand when you need an answer quickly.
This tool turns trace.sqlite into a prioritized report with explicit evidence:
- Top CUDA kernels and launch storms
- Top CPU↔GPU barriers, including blocking memcpy and CPU launcher gaps
- Top NCCL ops, with overlap against non-NCCL compute kernels
- Per-process breakdowns for vLLM-style multi-process traces
- NVLink during NCCL when NVLink counters are present
- Exact capture instructions when required counters are missing
The repo is intentionally conservative:
- It only claims NCCL/NVLink correlation when the exported SQLite data supports it.
- If NVLink counters are missing, it prints
NVLink counters not foundand tells you exactly how to re-capture. - If only NCCL kernel names are available, it degrades to kernel-name-based NCCL detection instead of pretending it saw higher-level collectives.
The fastest way to verify the current report shape is the committed synthetic example:
- Example report with all current section headers: examples/synthetic/report.md
- The example is generated from the synthetic SQLite fixture in
tests/test_synthetic_sqlite.py, so it is small and reproducible. - Exact visible section names:
Global critical path suspects,Top NCCL ops,NVLink during NCCL,Top CPU↔GPU barriers,Per-process breakdown - If the report says
NVLink counters not found, jump to NVLink counters guidance for the re-capture command.
python3 -m pip install -e .For tests:
python3 -m pip install -e .[dev]nsys-llm-explain trace.sqlite --out artifacts/run_YYYYMMDD_HHMMSS/Useful flags:
--print-schema: dump the detected SQLite tables/columns first--phase-map phases.json: optional NVTX phase grouping
The generated output directory contains:
report.md,report.jsontables/kernels.csvtables/barriers.csvtables/nccl_ops.csvtables/nccl_by_pid.csvtables/nvlink_during_nccl.csvtables/gpu_idle_gaps.csvtables/kernels_by_pid.csv,tables/sync_by_pid.csv,tables/nvtx_by_pid.csv
New report sections:
Global critical path suspectsTop NCCL opsNVLink during NCCLTop CPU↔GPU barriersPer-process breakdown
Use this when you want kernels, launch gaps, barriers, NVTX, and CUDA-graphs-aware capture:
nsys profile \
--trace=cuda,nvtx,osrt \
--sample=none \
--cpuctxsw=none \
--cuda-trace-scope=process-tree \
--cuda-graph-trace=node \
-o trace \
python your_workload.py
nsys export \
--type sqlite \
--output trace.sqlite \
--force-overwrite=true \
--lazy=false \
trace.nsys-rep
nsys-llm-explain trace.sqlite --out artifacts/run_YYYYMMDD_HHMMSS/Notes:
--cuda-trace-scope=process-treekeeps child processes in the trace, which matters for vLLM-style worker processes.--cuda-graph-trace=nodeis recommended when the workload uses CUDA Graphs.
Use this when you want NCCL ops to survive export cleanly:
nsys profile \
--trace=nccl,cuda,nvtx,osrt \
--nccl-trace=all \
--sample=none \
--cpuctxsw=none \
--cuda-trace-scope=process-tree \
--cuda-graph-trace=node \
-o trace \
torchrun --nproc_per_node=... your_workload.py
nsys export \
--type sqlite \
--include-json true \
--output trace.sqlite \
--force-overwrite=true \
--lazy=false \
trace.nsys-repWhy --include-json true matters:
- Nsight Systems exports some event classes, including NVTX events with user-defined payloads, only when JSON export is included.
- Recent NCCL tracing support in Nsight Systems surfaces NCCL activity as NVTX-backed events in the exported database.
This tool will still fall back to runtime API names or NCCL kernel names when those richer NCCL events are absent.
The tool can only report NVLink during NCCL when the SQLite export contains GPU Metrics tables with NVLink-related metrics.
First, list the supported GPU metric sets on your machine:
nsys profile --gpu-metrics-devices=all --gpu-metrics-set=helpThen re-capture with a supported metric set:
sudo nsys profile \
--trace=nccl,cuda,nvtx,osrt \
--nccl-trace=all \
--sample=none \
--cpuctxsw=none \
--cuda-trace-scope=process-tree \
--cuda-graph-trace=node \
--gpu-metrics-devices=all \
--gpu-metrics-set=<supported-set> \
--gpu-metrics-frequency=10000 \
-o trace \
torchrun --nproc_per_node=... your_workload.py
nsys export \
--type sqlite \
--include-json true \
--output trace.sqlite \
--force-overwrite=true \
--lazy=false \
trace.nsys-repNotes:
- On Linux, GPU Metrics collection typically requires elevated permissions.
- When those counters are missing, the report prints
NVLink counters not foundinstead of inventing a correlation result.
The snippets below are generated from the synthetic SQLite fixture used in tests, not from a real GPU capture.
CLI sample:
Wrote report to: /tmp/.../out/report.md
Top kernel: computeKernel (42.6% of kernel time, 2.6 ms, 3 calls)
Top barrier: cudaStreamSynchronize [sync_api] (0.8 ms, 1 events)
Top NCCL op: allreduce (2.0 ms total, 2.0 ms max, overlap 50.0%)
Launch storm: 5 launches over 0.005s = 909.1 launches/s; median kernel 1000.00 us
GPU idle estimate: GPU 0: 18.2% idle (1.0 ms / 5.5 ms window)
NVLink during NCCL: NVLink counters not found
Report sample:
## Global critical path suspects
| kind | name | total_ms | count | details |
| kernel | computeKernel | 2.600 | 3 | 42.6% of kernel time |
| nccl | allreduce | 2.000 | 1 | max 2.000 ms |
## Top NCCL ops
| op_name | source | total_time_ms | max_duration_ms | count | compute_overlap_ms | compute_overlap_pct | straggler |
| allreduce | kernel | 2.000 | 2.000 | 1 | 1.000 | 50.0 | pid:111 |
| broadcast | kernel | 1.500 | 1.500 | 1 | 0.600 | 40.0 | pid:222 |
## Top CPU↔GPU barriers
| barrier_kind | api_name | total_time_ms | count | avg_duration_us | max_duration_us |
| sync_api | cudaStreamSynchronize | 0.800 | 1 | 800.00 | 800.00 |
| sync_api | cudaDeviceSynchronize | 0.700 | 1 | 700.00 | 700.00 |
| blocking_memcpy | cudaMemcpy | 0.600 | 1 | 600.00 | 600.00 |
| cpu_launcher_gap | cpu_launcher_gap | 0.200 | 1 | 200.00 | 200.00 |
When NVLink metrics are present, the synthetic fixture produces a row like:
metric_source_id: 0
metric_names: NVLink bytes received, NVLink bytes transmitted
avg_metric_during_nccl: 76.67
avg_metric_outside_nccl: 5.83
nccl_activity_correlation: 0.990
- examples/synthetic/report.md: synthetic, fixture-generated, and guaranteed to show the new NCCL/NVLink/barrier/per-process sections.
examples/a100_vllm/: historical real output-only example from an A100 vLLM run.
Nsight Systems SQLite schema varies by version and by capture options.
This tool probes the schema at runtime and degrades gracefully:
- String table: prefers
StringIds(id, value) - Kernels: prefers
CUPTI_ACTIVITY_KIND_KERNEL, falls back to concurrent-kernel variants - Runtime API: prefers
CUPTI_ACTIVITY_KIND_RUNTIME - NVTX: prefers
NVTX_EVENTS - GPU Metrics: looks for
GPU_METRICSandTARGET_INFO_GPU_METRICS - CUDA Graphs capture awareness: the README and report assume
--cuda-graph-trace=nodefor graph-heavy workloads
If a section cannot be computed, the report says so explicitly instead of silently omitting the limitation.
Install:
python3 -m pip install -e .[dev]Run on a trace:
nsys-llm-explain trace.sqlite --out artifacts/run_YYYYMMDD_HHMMSS/Run tests:
python3 -m pytest -qPrimary NVIDIA documentation used for the capture/export guidance:
- Nsight Systems User Guide: GPU metrics collection,
--gpu-metrics-set=help, and required permissions - Nsight Systems User Guide:
--cuda-graph-trace=node - Nsight Systems User Guide:
--cuda-trace-scope=process-tree - Nsight Systems User Guide: NCCL tracing and
--nccl-trace=all - Nsight Systems User Guide: SQLite export
--include-json true