Skip to content

feat(slinky): allow gpu.clique to override provider accelerator domains#342

Open
dmitsh wants to merge 1 commit into
mainfrom
ds-slinky-acc
Open

feat(slinky): allow gpu.clique to override provider accelerator domains#342
dmitsh wants to merge 1 commit into
mainfrom
ds-slinky-acc

Conversation

@dmitsh
Copy link
Copy Markdown
Collaborator

@dmitsh dmitsh commented May 27, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • All commits are signed off per DCO (git commit -s).

@dmitsh dmitsh requested a review from ravisoundar May 27, 2026 18:40
@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

❌ Patch coverage is 55.38462% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.96%. Comparing base (1875ab8) to head (b1acd9a).
⚠️ Report is 68 commits behind head on main.

Files with missing lines Patch % Lines
pkg/engines/slinky/engine.go 55.38% 9 Missing and 20 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #342      +/-   ##
==========================================
+ Coverage   68.46%   69.96%   +1.50%     
==========================================
  Files          82       84       +2     
  Lines        4842     5084     +242     
==========================================
+ Hits         3315     3557     +242     
+ Misses       1395     1307      -88     
- Partials      132      220      +88     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 27, 2026

Greptile Summary

This PR adds a useGpuCliqueLabel option to the Slinky engine that lets operators source topology/block domains from the GPU Operator's existing nvidia.com/gpu.clique node label instead of provider accelerator-domain data, which is useful when the cloud provider's accelerator domain is broader than the actual NVLink clique.

  • Engine logic (pkg/engines/slinky/engine.go): Adds withGPUCliqueDomains which builds a new DomainMap from gpu.clique labels and replaces graph.Domains (leaving graph.Tiers intact for tree topology). A lazy-loading closure avoids the double Kubernetes API round-trip when both UseGPUCliqueLabel and UseDynamicNodes are active. performReconciliation now accepts the cached *clusterNodes instead of fetching again.
  • Tests (pkg/engines/slinky/engine_test.go): Four new test cases cover the happy path (whitespace trimming, node filtering, graph non-mutation), the empty-domains error path, end-to-end GenerateOutput integration, and usesBlockTopology unit coverage.
  • Charts and docs: New values.slinky.ib.block-example.yaml example, updated schema with an untyped params object on the engine block, and documentation in docs/engines/slinky.md and docs/reference/node-labels.md.

Confidence Score: 5/5

Safe to merge; the new code path is well-tested and does not affect existing behaviour when the flag is off.

The caching mechanism, graph cloning, and error propagation are all correct. The two observations are minor inconsistencies that do not affect runtime correctness for any realistic deployment.

No files require special attention; pkg/engines/slinky/engine.go has minor readability nits but no logic defects.

Important Files Changed

Filename Overview
pkg/engines/slinky/engine.go Core logic change: introduces UseGPUCliqueLabel flag, withGPUCliqueDomains function, lazy-loading clusterNodes cache, and refactored getClusterNodes returning a struct.
pkg/engines/slinky/engine_test.go Adds comprehensive tests covering happy path, error path, end-to-end GenerateOutput integration, and usesBlockTopology unit tests.
charts/topograph/values.slinky.ib.block-example.yaml New example values file for infiniband-k8s provider with useGpuCliqueLabel enabled on both provider and engine.
charts/topograph/values.schema.json Adds a params object field to the engine schema.
docs/engines/slinky.md Documents the new useGpuCliqueLabel option clearly, including the 502 failure behavior.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[GenerateOutput called\nwith graph + config] --> B[resolveTopologies\ngetTranslateConfig]
    B --> C{UseGPUCliqueLabel\nAND usesBlockTopology?}
    C -- Yes --> D[loadClusterNodes\nlazy fetch + cache]
    D --> E[withGPUCliqueDomains\nbuild domains from\nnvidia.com/gpu.clique labels]
    E --> F{domains empty?}
    F -- Yes --> G[Return 502 error]
    F -- No --> H[Clone graph,\nreplace Domains]
    C -- No --> I[translate.NewNetworkTopology\nwith original graph]
    H --> I
    I --> J[GenerateTopologyConfig]
    J --> K{ConfigUpdateMode\n!= none?}
    K -- Yes --> L[UpdateTopologyConfigmap]
    K -- No --> M{UseDynamicNodes?}
    L --> M
    M -- Yes --> N[loadClusterNodes\nreturns cached result]
    N --> O[performReconciliation\nannotate k8s nodes]
    M -- No --> P[Return OK]
    O --> P
Loading

Reviews (3): Last reviewed commit: "feat(slinky): allow gpu.clique to overri..." | Re-trigger Greptile

Comment thread pkg/engines/slinky/engine.go Outdated
Comment on lines +271 to +277
instance, ok := node.Annotations[topology.KeyNodeInstance]
if !ok {
klog.Warningf("missing %q annotation in node %s", topology.KeyNodeInstance, node.Name)
continue
}

domains.AddHost(gpuClique, instance, slurmName)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Nodes with gpu.clique label but missing instance annotation are silently dropped

A node that carries nvidia.com/gpu.clique (meaning the GPU Operator confirmed NVLink fabric is complete) but is missing the topograph.nvidia.com/instance annotation is skipped with only a klog.Warningf. If a clique spans, say, 8 nodes and one is missing the annotation, the generated block will contain only 7 nodes — no error is surfaced to the caller. The existing len(domains) == 0 guard catches the fully-empty case, but a partially populated domain that silently excludes valid GPU nodes could confuse Slurm topology scheduling without any user-visible signal beyond a log line.

@github-actions
Copy link
Copy Markdown

Signed-off-by: Dmitry Shmulevich <dshmulevich@nvidia.com>
Copy link
Copy Markdown
Collaborator

@ravisoundar ravisoundar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants