ice: backport XSK queue disable/enable fixes from upstream Linux#62
Open
Shivam279Chaudhary wants to merge 1 commit into
Open
ice: backport XSK queue disable/enable fixes from upstream Linux#62Shivam279Chaudhary wants to merge 1 commit into
Shivam279Chaudhary wants to merge 1 commit into
Conversation
Backport 4 upstream commits that fix race conditions during AF_XDP/XSK
socket setup/teardown causing false TX watchdog timeouts, NULL pointer
dereferences, and workqueue deadlocks:
- 99099c6bc75a ("ice: reorder disabling IRQ and NAPI in ice_qp_dis")
- 405d9999aa0b ("ice: replace synchronize_rcu with synchronize_net")
- 9da75a511c55 ("ice: toggle netif_carrier when setting up XSK pool")
- 7e3b407ccbea ("ice: remove ICE_CFG_BUSY locking from AF_XDP code")
All fix the original 2d4238f55697 ("ice: Add support for AF_XDP").
These commits were cherry-picked from torvalds/linux but required manual
adaptation due to API differences between upstream and the out-of-tree
driver.
Without this patch, hosts crash within 1-4 XSK lifecycle iterations
under TX load. With the patch applied, hosts survive 200+ iterations
under sustained outbound TX pressure (~200 Gbps).
Tested on:
- Intel E810 (PCI 8086:1592)
- Kernel 5.10.252/253 (Amazon Linux 2)
Signed-off-by: Shivam Chaudhary <shivam.chaudhary279@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport 4 upstream commits that fix race conditions during AF_XDP/XSK
socket setup/teardown causing false TX watchdog timeouts, NULL pointer
dereferences, and workqueue deadlocks on ice 2.5.4:
All fix the original 2d4238f55697 ("ice: Add support for AF_XDP").
Problem
During AF_XDP/XSK socket setup/teardown under TX load, the driver's queue
reconfiguration takes longer than the 5-second watchdog timeout allows. The
kernel incorrectly concludes the NIC has hung and triggers a PF reset. The
reset frees ring pointers while concurrent operations are still accessing
them, causing a NULL pointer dereference followed by a workqueue deadlock.
The NIC never recovers — hosts become permanently unreachable.
Fix
These 4 commits collectively:
synchronize_net()early in teardown (drains in-flight TX)netif_carrier_off/on(prevents watchdog from firing during reconfig)ICE_CFG_BUSYbusy-wait (was never protecting the queue pair)Backport methodology
Cherry-pick was not possible due to API differences between upstream and
the out-of-tree driver. The combined intent of all four patches was applied
manually, preserving ice-2.5.4-specific code.
Testing
Related: #58, #61