Skip to content

Remove non-EFA instance types from EFA device plugin nodeAffinity to prevent CrashLoopBackOff#389

Open
haozhx23 wants to merge 1 commit intoaws:mainfrom
haozhx23:fix/efa-device-plugin-non-efa-instance-types
Open

Remove non-EFA instance types from EFA device plugin nodeAffinity to prevent CrashLoopBackOff#389
haozhx23 wants to merge 1 commit intoaws:mainfrom
haozhx23:fix/efa-device-plugin-non-efa-instance-types

Conversation

@haozhx23
Copy link

What's changing and why?

The EFA device plugin DaemonSet uses nodeAffinity with supportedInstanceLabels to determine scheduling targets. The current list in helm_chart/HyperPodHelmChart/values.yaml includes entire instance families (i3en, m7i, r7i) without accounting for the fact that EFA support is size-dependent within a family — only the largest sizes in these families expose EFA network interfaces.

When the plugin is scheduled on a non-EFA node, it calls Fetching EFA devices → fails with No valid EFA devices found → exits non-zero → kubelet restarts it → CrashLoopBackOff.

This PR aligns the supportedInstanceLabels with the actual EFA capability matrix, verified against aws ec2 describe-instance-types --filters Name=network-info.efa-supported,Values=true and AWS EFA Supported Instance Types:

Affected instance families:

  • i3en: large, xlarge, 2xlarge, 3xlarge, 6xlarge (only 12xlarge+ supports EFA)
  • m7i: large ~ 24xlarge (only 48xlarge supports EFA)
  • r7i: large ~ 24xlarge (only 48xlarge supports EFA)

Before/After UX

Before:
EFA device plugin pod is scheduled on non-EFA nodes (e.g. ml.i3en.6xlarge), crashes immediately with No valid EFA devices found, and enters CrashLoopBackOff (keeps restarting). Customers need to manually diagnose the issue, identify the DaemonSet misconfiguration, and either edit the DaemonSet nodeAffinity or tolerate the noise in monitoring/alerting.

After:
EFA device plugin pod is only scheduled on nodes that actually have EFA hardware. No unnecessary CrashLoopBackOff, no manual intervention needed.

How was this change tested?

  • Compared the full supportedInstanceLabels list against aws ec2 describe-instance-types --filters Name=network-info.efa-supported,Values=true output in us-west-2
  • Cross-referenced with AWS EFA documentation

Are unit tests added?

No — this is a configuration-only change (values.yaml).

Are integration tests added?

No — this is a configuration-only change (values.yaml).

Reviewer Guidelines

  • All automated PR checks pass

@haozhx23 haozhx23 requested a review from a team as a code owner March 17, 2026 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant