feat(slinky): allow gpu.clique to override provider accelerator domains#342
feat(slinky): allow gpu.clique to override provider accelerator domains#342dmitsh wants to merge 1 commit into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #342 +/- ##
==========================================
+ Coverage 68.46% 69.96% +1.50%
==========================================
Files 82 84 +2
Lines 4842 5084 +242
==========================================
+ Hits 3315 3557 +242
+ Misses 1395 1307 -88
- Partials 132 220 +88 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Greptile SummaryThis PR adds a
Confidence Score: 5/5Safe to merge; the new code path is well-tested and does not affect existing behaviour when the flag is off. The caching mechanism, graph cloning, and error propagation are all correct. The two observations are minor inconsistencies that do not affect runtime correctness for any realistic deployment. No files require special attention; pkg/engines/slinky/engine.go has minor readability nits but no logic defects. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[GenerateOutput called\nwith graph + config] --> B[resolveTopologies\ngetTranslateConfig]
B --> C{UseGPUCliqueLabel\nAND usesBlockTopology?}
C -- Yes --> D[loadClusterNodes\nlazy fetch + cache]
D --> E[withGPUCliqueDomains\nbuild domains from\nnvidia.com/gpu.clique labels]
E --> F{domains empty?}
F -- Yes --> G[Return 502 error]
F -- No --> H[Clone graph,\nreplace Domains]
C -- No --> I[translate.NewNetworkTopology\nwith original graph]
H --> I
I --> J[GenerateTopologyConfig]
J --> K{ConfigUpdateMode\n!= none?}
K -- Yes --> L[UpdateTopologyConfigmap]
K -- No --> M{UseDynamicNodes?}
L --> M
M -- Yes --> N[loadClusterNodes\nreturns cached result]
N --> O[performReconciliation\nannotate k8s nodes]
M -- No --> P[Return OK]
O --> P
Reviews (3): Last reviewed commit: "feat(slinky): allow gpu.clique to overri..." | Re-trigger Greptile |
| instance, ok := node.Annotations[topology.KeyNodeInstance] | ||
| if !ok { | ||
| klog.Warningf("missing %q annotation in node %s", topology.KeyNodeInstance, node.Name) | ||
| continue | ||
| } | ||
|
|
||
| domains.AddHost(gpuClique, instance, slurmName) |
There was a problem hiding this comment.
Nodes with
gpu.clique label but missing instance annotation are silently dropped
A node that carries nvidia.com/gpu.clique (meaning the GPU Operator confirmed NVLink fabric is complete) but is missing the topograph.nvidia.com/instance annotation is skipped with only a klog.Warningf. If a clique spans, say, 8 nodes and one is missing the annotation, the generated block will contain only 7 nodes — no error is surfaced to the caller. The existing len(domains) == 0 guard catches the fully-empty case, but a partially populated domain that silently excludes valid GPU nodes could confuse Slurm topology scheduling without any user-visible signal beyond a log line.
|
🌿 Preview your docs: https://nvidia-preview-pull-request-342.docs.buildwithfern.com/topograph |
Signed-off-by: Dmitry Shmulevich <dshmulevich@nvidia.com>
Description
Checklist
git commit -s).