Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What's changing and why?
The EFA device plugin DaemonSet uses
nodeAffinitywithsupportedInstanceLabelsto determine scheduling targets. The current list inhelm_chart/HyperPodHelmChart/values.yamlincludes entire instance families (i3en, m7i, r7i) without accounting for the fact that EFA support is size-dependent within a family — only the largest sizes in these families expose EFA network interfaces.When the plugin is scheduled on a non-EFA node, it calls
Fetching EFA devices→ fails withNo valid EFA devices found→ exits non-zero → kubelet restarts it → CrashLoopBackOff.This PR aligns the
supportedInstanceLabelswith the actual EFA capability matrix, verified againstaws ec2 describe-instance-types --filters Name=network-info.efa-supported,Values=trueand AWS EFA Supported Instance Types:Affected instance families:
Before/After UX
Before:
EFA device plugin pod is scheduled on non-EFA nodes (e.g. ml.i3en.6xlarge), crashes immediately with
No valid EFA devices found, and enters CrashLoopBackOff (keeps restarting). Customers need to manually diagnose the issue, identify the DaemonSet misconfiguration, and either edit the DaemonSet nodeAffinity or tolerate the noise in monitoring/alerting.After:
EFA device plugin pod is only scheduled on nodes that actually have EFA hardware. No unnecessary CrashLoopBackOff, no manual intervention needed.
How was this change tested?
supportedInstanceLabelslist againstaws ec2 describe-instance-types --filters Name=network-info.efa-supported,Values=trueoutput in us-west-2Are unit tests added?
No — this is a configuration-only change (values.yaml).
Are integration tests added?
No — this is a configuration-only change (values.yaml).
Reviewer Guidelines