ixgbevf libeth AF_XDP by walking-machine · Pull Request #3 · walking-machine/linux

walking-machine · 2025-10-28T17:55:21Z

pseudo header split not yet integrated
lacks segment support in AF_XDP TX
wakeup works through interrupts, because IPIs are not properly handled yet

The sparx5_stats_init() function starts a worker thread which needs to be cleaned up. Move the initialization code to probe() and add a deinit() function for proper teardown. Also, rename sparx_stats_init() to sparx5_stats_init() to match the driver naming convention. Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-4-10ba54ccf005@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Move the calendar initialization from sparx5_start() to probe() by creating a new sparx5_calendar_init() wrapper function that calls both sparx5_config_auto_calendar() and sparx5_config_dsm_calendar(). Calendar initialization does not require cleanup. Also, make the individual calendar config functions static since they are now only called from within sparx5_calendar.c. Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-5-10ba54ccf005@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Move sparx5_pgid_init(), sparx5_vlan_init(), and sparx5_board_init() from sparx5_start() to probe(). These functions do not require cleanup. Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-6-10ba54ccf005@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Move the PTP IRQ request into sparx5_ptp_init() so all PTP setup is done in one place. Also move the sparx5_ptp_init() call to right before sparx5_register_netdevs() and add a cleanup_ptp label. Update remove() to disable the PTP IRQ and reorder ptp_deinit accordingly. Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-7-10ba54ccf005@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Move the Frame DMA and register-based extraction initialization out of sparx5_start() and into a new sparx5_frame_io_init() function, called from probe(). Also, add sparx5_frame_io_deinit() for the cleanup path. Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-8-10ba54ccf005@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

With all subsystem initializations moved out, sparx5_start() only sets up forwarding (UPSIDs, CPU ports, masks, PGIDs, FCS, watermarks). Rename it to sparx5_forwarding_init() and make it void since it cannot fail. This removes sparx5_start() entirely. Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-9-10ba54ccf005@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Daniel Machon says: ==================== net: sparx5: clean up probe/remove init and deinit paths This series refactors the sparx5 init and deinit code out of sparx5_start() and into probe(), adding proper per-subsystem cleanup labels and deinit functions. Currently, the sparx5 driver initializes most subsystems inside sparx5_start(), which is called from probe(). This includes registering netdevs, starting worker threads for stats and MAC table polling, requesting PTP IRQs, and initializing VCAP. The function has grown to handle many unrelated subsystems, and has no granular error handling — it either succeeds entirely or returns an error, leaving cleanup to a single catch-all label in probe(). The remove() path has a similar problem: teardown is not structured as the reverse of initialization, and several subsystems lack proper deinit functions. For example, the stats workqueue has no corresponding cleanup, and the mact workqueue is destroyed without first cancelling its delayed work. Refactor this by moving each init function out of sparx5_start() and into probe(), with a corresponding goto-based cleanup label. Add deinit functions for subsystems that allocate resources, to properly cancel work and destroy workqueues. Ensure that cleanup order in both error paths and remove() follows the reverse of initialization order. sparx5_start() is eliminated entirely — its hardware register setup is renamed to sparx5_forwarding_init() and its FDMA/XTR setup is extracted to sparx5_frame_io_init(). Before this series, most init functions live inside sparx5_start() with no individual cleanup: probe(): sparx5_start(): <- no granular error handling sparx5_mact_init() sparx_stats_init() <- starts worker, no cleanup mact_queue setup <- no cancel on teardown sparx5_register_netdevs() sparx5_register_notifier_blocks() sparx5_vcap_init() sparx5_ptp_init() probe() error path: cleanup_ports: sparx5_cleanup_ports() destroy_workqueue(mact_queue) After this series, probe() initializes subsystems in order with matching cleanup labels, and remove() tears down in reverse: probe(): sparx5_pgid_init() sparx5_vlan_init() sparx5_board_init() sparx5_forwarding_init() sparx5_calendar_init() -> cleanup_ports sparx5_qos_init() -> cleanup_ports sparx5_vcap_init() -> cleanup_ports sparx5_mact_init() -> cleanup_vcap sparx5_stats_init() -> cleanup_mact sparx5_frame_io_init() -> cleanup_stats sparx5_ptp_init() -> cleanup_frame_io sparx5_register_netdevs() -> cleanup_ptp sparx5_register_notifier_blocks() -> cleanup_netdevs remove(): sparx5_unregister_notifier_blocks() sparx5_unregister_netdevs() sparx5_ptp_deinit() sparx5_frame_io_deinit() sparx5_stats_deinit() sparx5_mact_deinit() sparx5_vcap_deinit() sparx5_destroy_netdevs() ==================== Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-0-10ba54ccf005@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6_stub is never NULL, let's remove this test. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260228175715.1195536-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

The new test exercise paths, where RTNL is needed, to catch lockdep splat: setsockopt MRT_INIT / MRT_DONE MRT_ADD_VIF / MRT_DEL_VIF MRT_ADD_MFC / MRT_DEL_MFC / MRT_ADD_MFC_PROXY / MRT_DEL_MFC_PROXY MRT_TABLE MRT_FLUSH rtnetlink RTM_NEWROUTE RTM_DELROUTE NETDEV_UNREGISTER I will extend this to cover IPv6 setsockopt() later. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

These fields in struct mr_table are updated in ip_mroute_setsockopt() under RTNL: * mroute_do_pim * mroute_do_assert * mroute_do_wrvifwhole However, ip_mroute_getsockopt() does not hold RTNL and read the first two fields locklessly, and ip_mr_forward() reads all the three under RCU. pim_rcv_v1() also reads mroute_do_pim locklessly. Let's use WRITE_ONCE() and READ_ONCE() for them. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net->ipv4.mr_tables is updated under RTNL and can be read safely under RCU. Once created, the multicast route tables are not removed until netns dismantle. ipmr_rtm_dumplink() does not need RTNL protection for ipmr_for_each_table() and ipmr_fill_table() if RCU is held. Even if mrt->maxvif changes concurrently, ipmr_fill_vif() returns true to continue dumping the next table. Let's convert it to RCU. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mroute_msgsize() calculates skb size needed for ipmr_fill_mroute(). The size differs based on mrt->maxvif. We will drop RTNL for ipmr_rtm_getroute() and mrt->maxvif may change under RCU. To avoid -EMSGSIZE, let's calculate the size with the maximum value of mrt->maxvif, MAXVIFS. struct rtnexthop is 8 bytes and MAXVIFS is 32, so the maximum delta is 256 bytes, which is small enough. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr_rtm_getroute() calls __ipmr_get_table(), ipmr_cache_find(), and ipmr_fill_mroute(). The table is not removed until netns dismantle, and net->ipv4.mr_tables is managed with RCU list API, so __ipmr_get_table() is safe under RCU. struct mfc_cache is freed by mr_cache_put() after RCU grace period, so we can use ipmr_cache_find() under RCU. rcu_read_lock() around it was just to avoid lockdep splat for rhl_for_each_entry_rcu(). ipmr_fill_mroute() calls mr_fill_mroute(), which properly uses RCU. Let's drop RTNL for ipmr_rtm_getroute() and use RCU instead. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr_rtm_dumproute() calls mr_table_dump() or mr_rtm_dumproute(), and mr_rtm_dumproute() finally calls mr_table_dump(). mr_table_dump() calls the passed function, _ipmr_fill_mroute(). _ipmr_fill_mroute() is a wrapper of ipmr_fill_mroute() to cast struct mr_mfc * to struct mfc_cache *. ipmr_fill_mroute() can be already called safely under RCU. Let's convert ipmr_rtm_dumproute() to RCU. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

This is a prep commit to convert ipmr_net_exit_batch() to ->exit_rtnl(). Let's move unregister_netdevice_many() in mroute_clean_tables() to its callers. As a bonus, mrtsock_destruct() can do batching for all tables. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-8-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

This is a prep commit to convert ipmr_net_exit_batch() to ->exit_rtnl(). Let's move unregister_netdevice_many() in ipmr_free_table() to its callers. Now ipmr_rules_exit() can do batching all tables per netns. Note that later we will remove RTNL and unregister_netdevice_many() in ipmr_rules_init(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-9-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr_net_ops uses ->exit_batch() to acquire RTNL only once for dying network namespaces. ipmr does not depend on the ordering of ->exit_rtnl() and ->exit_batch() of other pernet_operations (unlike fib_net_ops). Once ipmr_free_table() is called and all devices are queued for destruction in ->exit_rtnl(), later during NETDEV_UNREGISTER, ipmr_device_event() will not see anything in vif table and just do nothing. Let's convert ipmr_net_exit_batch() to ->exit_rtnl(). Note that fib_rules_unregister() does not need RTNL and we will remove RTNL and unregister_netdevice_many() in ipmr_net_init(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-10-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

When ipmr_free_table() is called from ipmr_rules_init() or ipmr_net_init(), the netns is not yet published. Thus, no device should have been registered, and mroute_clean_tables() will not call vif_delete(), so unregister_netdevice_many() is unnecessary. unregister_netdevice_many() does nothing if the list is empty, but it requires RTNL due to the unconditional ASSERT_RTNL() at the entry of unregister_netdevice_many_notify(). Let's remove unnecessary RTNL and ASSERT_RTNL() and instead add WARN_ON_ONCE() in ipmr_free_table(). Note that we use a local list for the new WARN_ON_ONCE() because dev_kill_list passed from ipmr_rules_exit_rtnl() may have some devices when other ops->init() fails after ipmr durnig setup_net(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-11-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

fib_rules_unregister() removes ops from net->rules_ops under spinlock, calls ops->delete() for each rule, and frees the ops. ipmr_rules_ops_template does not have ->delete(), and any operation does not require RTNL there. Let's move fib_rules_unregister() from ipmr_rules_exit_rtnl() to ipmr_net_exit(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-12-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

…ROUTE. net->ipv4.ipmr_notifier_ops and net->ipv4.ipmr_seq are used only in net/ipv4/ipmr.c. Let's move these definitions under CONFIG_IP_MROUTE. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-13-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

We will no longer hold RTNL for ipmr_mfc_add() and ipmr_mfc_delete(). MFC entry can be loosely connected with VIF by its index for mrt->vif_table[] (stored in mfc_parent), but the two tables are not synchronised. i.e. Even if VIF 1 is removed, MFC for VIF 1 is not automatically removed. The only field that the MFC/VIF interfaces share is net->ipv[46].ipmr_seq, which is protected by RTNL. Adding a new mutex for both just to protect a single field is overkill. Let's convert the field to atomic_t. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-14-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

We will no longer hold RTNL for ipmr_rtm_route() to modify the MFC hash table. Only __dev_get_by_index() in rtm_to_ipmr_mfcc() is the RTNL dependant, otherwise, we just need protection for mrt->mfc_hash and mrt->mfc_cache_list. Let's add a new mutex for ipmr_mfc_add(), ipmr_mfc_delete(), and mroute_clean_tables() (setsockopt(MRT_FLUSH or MRT_DONE)). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-15-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipmr_mfc_add() and ipmr_mfc_delete() are already protected by a dedicated mutex. rtm_to_ipmr_mfcc() calls __ipmr_get_table(), __dev_get_by_index(), amd ipmr_find_vif(). Once __dev_get_by_index() is converted to dev_get_by_index_rcu(), we can move the other two functions under that same RCU section and drop RTNL for ipmr_rtm_route(). Let's do that conversion and drop ASSERT_RTNL() in mr_call_mfc_notifiers(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-16-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Kuniyuki Iwashima says: ==================== ipmr: No RTNL for RTNL_FAMILY_IPMR rtnetlink. This series removes RTNL from ipmr rtnetlink handlers. After this series, there are a few RTNL left in net/ipv4/ipmr.c and such users will be converted to per-netns RTNL in another series. Patch 1 adds a selftest to exercise most? of the RTNL paths in net/ipv4/ipmr.c Patch 2 - 6 converts RTM_GETLINK / RTM_GETROUTE handlers to RCU. Patch 7 - 9 converts ->exit_batch() to ->exit_rtnl() to save one RTNL in cleanup_net(). Patch 10 - 11 removes unnecessary RTNL during setup_net() failure. Patch 12 is a random cleanup. Patch 13 - 15 drops RTNL for RTM_NEWROUTE and RTM_DELROUTE. ==================== Link: https://patch.msgid.link/20260228221800.1082070-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Commit c92c81d ("net: dccp: fix kernel crash on module load") added inet_hashinfo2_init_mod() for DCCP. Commit 22d6c9e ("net: Unexport shared functions for DCCP.") removed EXPORT_SYMBOL_GPL() it but forgot to remove the function itself. Let's remove inet_hashinfo2_init_mod(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260301063756.1581685-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

kernel-doc reports function parameters not described for parameters that are not named. Add parameter names for these functions and then describe the function parameters in kernel-doc format. Fixes these warnings: Warning: include/linux/atmdev.h:316 function parameter '' not described in 'register_atm_ioctl' Warning: include/linux/atmdev.h:321 function parameter '' not described in 'deregister_atm_ioctl' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20260228220845.2978547-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

…WC timeout The GF stats periodic query is used as mechanism to monitor HWC health check. If this HWC command times out, it is a strong indication that the device/SoC is in a faulty state and requires recovery. Today, when a timeout is detected, the driver marks hwc_timeout_occurred, clears cached stats, and stops rescheduling the periodic work. However, the device itself is left in the same failing state. Extend the timeout handling path to trigger the existing MANA VF recovery service by queueing a GDMA_EQE_HWC_RESET_REQUEST work item. This is expected to initiate the appropriate recovery flow by suspende resume first and if it fails then trigger a bus rescan. This change is intentionally limited to HWC command timeouts and does not trigger recovery for errors reported by the SoC as a normal command response. Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/aaFShvKnwR5FY8dH@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>

EPPKE authentication can use SAE (via PASN), so give the AP more time to respond to EPPKE case just like for SAE. Link: https://patch.msgid.link/20260128132414.881741-2-johannes@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>

Replace the five per-variant zl3073x_chip_info structures and their exported symbol definitions with a single consolidated chip ID lookup table. The chip variant is now detected at runtime by reading the chip ID register from hardware and looking it up in the table, rather than being selected at compile time via the bus driver match data. Repurpose struct zl3073x_chip_info to hold a single chip ID, its channel count, and a flags field. Introduce enum zl3073x_flags with ZL3073X_FLAG_REF_PHASE_COMP_32 to replace the chip_id switch statement in zl3073x_dev_is_ref_phase_comp_32bit(). Store a pointer to the detected chip_info entry in struct zl3073x_dev for runtime access. This simplifies the bus drivers by removing per-variant .data and .driver_data references from the I2C/SPI match tables, and makes adding support for new chip variants a single-line table addition. Signed-off-by: Ivan Vecera <ivecera@redhat.com> Link: https://patch.msgid.link/20260227105300.710272-2-ivecera@redhat.com Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Some zl3073x chip variants (0x1Exx, 0x2Exx and 0x3FC4) provide a die temperature status register with 0.1 C resolution. Add a ZL3073X_FLAG_DIE_TEMP chip flag to identify these variants and implement zl3073x_dpll_temp_get() as the dpll_device_ops.temp_get callback. The register value is converted from 0.1 C units to millidegrees as expected by the DPLL subsystem. To support per-instance ops selection, copy the base dpll_device_ops into struct zl3073x_dpll and conditionally set .temp_get during device registration based on the chip flag. Signed-off-by: Ivan Vecera <ivecera@redhat.com> Link: https://patch.msgid.link/20260227105300.710272-3-ivecera@redhat.com Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bitmap_empty() is more verbose and efficient, as it stops traversing {r,t}xq_ena as soon as the 1st set bit found. Signed-off-by: Yury Norov <ynorov@nvidia.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

Fix spelling errors in code comments: - e1000_nvm.c: 'likley' -> 'likely' - e1000_mac.c: 'auto-negotitation' -> 'auto-negotiation' - e1000_mbx.h: 'exra' -> 'extra' - e1000_defines.h: 'Aserted' -> 'Asserted' Signed-off-by: Maximilian Pezzullo <maximilianpezzullo@gmail.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Joe Damato <joe@dama.to>

Fix spelling errors in code comments: - igc_diag.c: 'autonegotioation' -> 'autonegotiation' - igc_main.c: 'revisons' -> 'revisions' (two occurrences) Signed-off-by: Maximilian Pezzullo <maximilianpezzullo@gmail.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Joe Damato <joe@dama.to>

Many ethernet drivers report xdp Rx queue frag size as being the same as DMA write size. However, the only user of this field, namely bpf_xdp_frags_increase_tail(), clearly expects a truesize. Such difference leads to unspecific memory corruption issues under certain circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses all DMA-writable space in 2 buffers. This would be fine, if only rxq->frag_size was properly set to 4K, but value of 3K results in a negative tailroom, because there is a non-zero page offset. We could return -EINVAL and be done with it in such case, but due to tailroom being stored as an unsigned int, it is reported to be somewhere near UINT_MAX, resulting in a tail being grown, even if the requested offset is too much (it is around 2K in the abovementioned test). This later leads to all kinds of unspecific calltraces. [ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6 [ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4 [ 7340.338179] in libc.so.6[61c9d,7f4161aaf000+160000] [ 7340.339230] in xskxceiver[42b5,400000+69000] [ 7340.340300] likely on CPU 6 (core 0, socket 6) [ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe [ 7340.340888] likely on CPU 3 (core 0, socket 3) [ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7 [ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI [ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy) [ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014 [ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80 [ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89 [ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202 [ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010 [ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff [ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0 [ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0 [ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500 [ 7340.418229] FS: 0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000 [ 7340.419489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0 [ 7340.421237] PKRU: 55555554 [ 7340.421623] Call Trace: [ 7340.421987] <TASK> [ 7340.422309] ? softleaf_from_pte+0x77/0xa0 [ 7340.422855] swap_pte_batch+0xa7/0x290 [ 7340.423363] zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270 [ 7340.424102] zap_pte_range+0x281/0x580 [ 7340.424607] zap_pmd_range.isra.0+0xc9/0x240 [ 7340.425177] unmap_page_range+0x24d/0x420 [ 7340.425714] unmap_vmas+0xa1/0x180 [ 7340.426185] exit_mmap+0xe1/0x3b0 [ 7340.426644] __mmput+0x41/0x150 [ 7340.427098] exit_mm+0xb1/0x110 [ 7340.427539] do_exit+0x1b2/0x460 [ 7340.427992] do_group_exit+0x2d/0xc0 [ 7340.428477] get_signal+0x79d/0x7e0 [ 7340.428957] arch_do_signal_or_restart+0x34/0x100 [ 7340.429571] exit_to_user_mode_loop+0x8e/0x4c0 [ 7340.430159] do_syscall_64+0x188/0x6b0 [ 7340.430672] ? __do_sys_clone3+0xd9/0x120 [ 7340.431212] ? switch_fpu_return+0x4e/0xd0 [ 7340.431761] ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0 [ 7340.432498] ? do_syscall_64+0xbb/0x6b0 [ 7340.433015] ? __handle_mm_fault+0x445/0x690 [ 7340.433582] ? count_memcg_events+0xd6/0x210 [ 7340.434151] ? handle_mm_fault+0x212/0x340 [ 7340.434697] ? do_user_addr_fault+0x2b4/0x7b0 [ 7340.435271] ? clear_bhb_loop+0x30/0x80 [ 7340.435788] ? clear_bhb_loop+0x30/0x80 [ 7340.436299] ? clear_bhb_loop+0x30/0x80 [ 7340.436812] ? clear_bhb_loop+0x30/0x80 [ 7340.437323] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 7340.437973] RIP: 0033:0x7f4161b14169 [ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f. [ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca [ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169 [ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990 [ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff [ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0 [ 7340.444586] </TASK> [ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg [ 7340.449650] ---[ end trace 0000000000000000 ]--- The issue can be fixed in all in-tree drivers, but we cannot just trust OOT drivers to not do this. Therefore, make tailroom a signed int and produce a warning when it is negative to prevent such mistakes in the future. Fixes: bf25146 ("bpf: add frags support to the bpf_xdp_adjust_tail() API") Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Many ethernet drivers report xdp Rx queue frag size as being the same as DMA write size. However, the only user of this field, namely bpf_xdp_frags_increase_tail(), clearly expects a truesize. Such difference leads to unspecific memory corruption issues under certain circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses all DMA-writable space in 2 buffers. This would be fine, if only rxq->frag_size was properly set to 4K, but value of 3K results in a negative tailroom, because there is a non-zero page offset. We are supposed to return -EINVAL and be done with it in such case, but due to tailroom being stored as an unsigned int, it is reported to be somewhere near UINT_MAX, resulting in a tail being grown, even if the requested offset is too much (it is around 2K in the abovementioned test). This later leads to all kinds of unspecific calltraces. [ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6 [ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4 [ 7340.338179] in libc.so.6[61c9d,7f4161aaf000+160000] [ 7340.339230] in xskxceiver[42b5,400000+69000] [ 7340.340300] likely on CPU 6 (core 0, socket 6) [ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe [ 7340.340888] likely on CPU 3 (core 0, socket 3) [ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7 [ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI [ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy) [ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014 [ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80 [ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89 [ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202 [ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010 [ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff [ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0 [ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0 [ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500 [ 7340.418229] FS: 0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000 [ 7340.419489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0 [ 7340.421237] PKRU: 55555554 [ 7340.421623] Call Trace: [ 7340.421987] <TASK> [ 7340.422309] ? softleaf_from_pte+0x77/0xa0 [ 7340.422855] swap_pte_batch+0xa7/0x290 [ 7340.423363] zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270 [ 7340.424102] zap_pte_range+0x281/0x580 [ 7340.424607] zap_pmd_range.isra.0+0xc9/0x240 [ 7340.425177] unmap_page_range+0x24d/0x420 [ 7340.425714] unmap_vmas+0xa1/0x180 [ 7340.426185] exit_mmap+0xe1/0x3b0 [ 7340.426644] __mmput+0x41/0x150 [ 7340.427098] exit_mm+0xb1/0x110 [ 7340.427539] do_exit+0x1b2/0x460 [ 7340.427992] do_group_exit+0x2d/0xc0 [ 7340.428477] get_signal+0x79d/0x7e0 [ 7340.428957] arch_do_signal_or_restart+0x34/0x100 [ 7340.429571] exit_to_user_mode_loop+0x8e/0x4c0 [ 7340.430159] do_syscall_64+0x188/0x6b0 [ 7340.430672] ? __do_sys_clone3+0xd9/0x120 [ 7340.431212] ? switch_fpu_return+0x4e/0xd0 [ 7340.431761] ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0 [ 7340.432498] ? do_syscall_64+0xbb/0x6b0 [ 7340.433015] ? __handle_mm_fault+0x445/0x690 [ 7340.433582] ? count_memcg_events+0xd6/0x210 [ 7340.434151] ? handle_mm_fault+0x212/0x340 [ 7340.434697] ? do_user_addr_fault+0x2b4/0x7b0 [ 7340.435271] ? clear_bhb_loop+0x30/0x80 [ 7340.435788] ? clear_bhb_loop+0x30/0x80 [ 7340.436299] ? clear_bhb_loop+0x30/0x80 [ 7340.436812] ? clear_bhb_loop+0x30/0x80 [ 7340.437323] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 7340.437973] RIP: 0033:0x7f4161b14169 [ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f. [ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca [ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169 [ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990 [ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff [ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0 [ 7340.444586] </TASK> [ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg [ 7340.449650] ---[ end trace 0000000000000000 ]--- The issue can be fixed in all in-tree drivers, but we cannot just trust OOT drivers to not do this. Therefore, make tailroom a signed int and produce a warning when it is negative to prevent such mistakes in the future. Fixes: bf25146 ("bpf: add frags support to the bpf_xdp_adjust_tail() API") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Currently, generic XDP hook uses xdp_rxq_info from netstack Rx queues which do not have its XDP memory model registered. There is a case when XDP program calls bpf_xdp_adjust_tail() BPF helper, which in turn releases underlying memory. This happens when it consumes enough amount of bytes and when XDP buffer has fragments. For this action the memory model knowledge passed to XDP program is crucial so that core can call suitable function for freeing/recycling the page. For netstack queues it defaults to MEM_TYPE_PAGE_SHARED (0) due to lack of mem model registration. The problem we're fixing here is when kernel copied the skb to new buffer backed by system's page_pool and XDP buffer is built around it. Then when bpf_xdp_adjust_tail() calls __xdp_return(), it acts incorrectly due to mem type not being set to MEM_TYPE_PAGE_POOL and causes a page leak. Pull out the existing code from bpf_prog_run_generic_xdp() that init/prepares xdp_buff onto new helper xdp_convert_skb_to_buff() and embed there rxq's mem_type initialization that is assigned to xdp_buff. Make it agnostic to current skb->data position. This problem was triggered by syzbot as well as AF_XDP test suite which is about to be integrated to BPF CI. Reported-by: syzbot+ff145014d6b0ce64a173@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6756c37b.050a0220.a30f1.019a.GAE@google.com/ Fixes: e6d5dbd ("xdp: add multi-buff support for xdp running in generic mode") Tested-by: Ihor Solodrai <ihor.solodrai@linux.dev> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Co-developed-by: Octavian Purdila <tavip@google.com> Signed-off-by: Octavian Purdila <tavip@google.com> # whole analysis, testing, initiating a fix Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # commit msg and proposed more robust fix

When skb's headroom is not sufficient for XDP purposes, skb_pp_cow_data() returns new skb with requested headroom space. This skb was provided by page_pool. For CONFIG_DEBUG_VM=y and XDP program that uses bpf_xdp_adjust_tail() against a skb with frags, and mentioned helper consumed enough amount of bytes that in turn released the page, following splat was observed: [ 32.204881] BUG: Bad page state in process test_progs pfn:11c98b [ 32.207167] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11c98b [ 32.210084] flags: 0x1fffe0000000000(node=0|zone=1|lastcpupid=0x7fff) [ 32.212493] raw: 01fffe0000000000 dead000000000040 ff11000123c9b000 0000000000000000 [ 32.218056] raw: 0000000000000000 0000000000000001 00000000ffffffff 0000000000000000 [ 32.220900] page dumped because: page_pool leak [ 32.222636] Modules linked in: bpf_testmod(O) bpf_preload [ 32.224632] CPU: 6 UID: 0 PID: 3612 Comm: test_progs Tainted: G O 6.17.0-rc5-gfec474d29325 #6969 PREEMPT [ 32.224638] Tainted: [O]=OOT_MODULE [ 32.224639] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 32.224641] Call Trace: [ 32.224644] <IRQ> [ 32.224646] dump_stack_lvl+0x4b/0x70 [ 32.224653] bad_page.cold+0xbd/0xe0 [ 32.224657] __free_frozen_pages+0x838/0x10b0 [ 32.224660] ? skb_pp_cow_data+0x782/0xc30 [ 32.224665] bpf_xdp_shrink_data+0x221/0x530 [ 32.224668] ? skb_pp_cow_data+0x6d1/0xc30 [ 32.224671] bpf_xdp_adjust_tail+0x598/0x810 [ 32.224673] ? xsk_destruct_skb+0x321/0x800 [ 32.224678] bpf_prog_004ac6bb21de57a7_xsk_xdp_adjust_tail+0x52/0xd6 [ 32.224681] veth_xdp_rcv_skb+0x45d/0x15a0 [ 32.224684] ? get_stack_info_noinstr+0x16/0xe0 [ 32.224688] ? veth_set_channels+0x920/0x920 [ 32.224691] ? get_stack_info+0x2f/0x80 [ 32.224693] ? unwind_next_frame+0x3af/0x1df0 [ 32.224697] veth_xdp_rcv.constprop.0+0x38a/0xbe0 [ 32.224700] ? common_startup_64+0x13e/0x148 [ 32.224703] ? veth_xdp_rcv_one+0xcd0/0xcd0 [ 32.224706] ? stack_trace_save+0x84/0xa0 [ 32.224709] ? stack_depot_save_flags+0x28/0x820 [ 32.224713] ? __resched_curr.constprop.0+0x332/0x3b0 [ 32.224716] ? timerqueue_add+0x217/0x320 [ 32.224719] veth_poll+0x115/0x5e0 [ 32.224722] ? veth_xdp_rcv.constprop.0+0xbe0/0xbe0 [ 32.224726] ? update_load_avg+0x1cb/0x12d0 [ 32.224730] ? update_cfs_group+0x121/0x2c0 [ 32.224733] __napi_poll+0xa0/0x420 [ 32.224736] net_rx_action+0x901/0xe90 [ 32.224740] ? run_backlog_napi+0x50/0x50 [ 32.224743] ? clockevents_program_event+0x1cc/0x280 [ 32.224746] ? hrtimer_interrupt+0x31e/0x7c0 [ 32.224749] handle_softirqs+0x151/0x430 [ 32.224752] do_softirq+0x3f/0x60 [ 32.224755] </IRQ> It's because xdp_rxq with mem model set to MEM_TYPE_PAGE_SHARED was used when initializing xdp_buff. Fix this by using new helper xdp_convert_skb_to_buff() that, besides init/prepare xdp_buff, will check if page used for linear part of xdp_buff comes from page_pool. We assume that linear data and frags will have same memory provider as currently XDP API does not provide us a way to distinguish it (the mem model is registered for *whole* Rx queue and here we speak about single buffer granularity). Before releasing xdp_buff out of veth via XDP_{TX,REDIRECT}, mem type on xdp_rxq associated with xdp_buff is restored to its original model. We need to respect previous setting at least until buff is converted to frame, as frame carries the mem_type. Add a page_pool variant of veth_xdp_get() so that we avoid refcount underflow when draining page frag. Fixes: 0ebab78 ("net: veth: add page_pool for page recycling") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reported-by: Alexei Starovoitov <ast@kernel.org> Closes: https://lore.kernel.org/bpf/CAADnVQ+bBofJDfieyOYzSmSujSfJwDTQhiz3aJw7hE+4E2_iPA@mail.gmail.com/ Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

Similarly as in commit 5384467 ("iavf: kill "legacy-rx" for good"), drop skb construction logic in favor of only using napi_build_skb() as a superior option that reduces the need to allocate and copy memory. As IXGBEVF_PRIV_FLAGS_LEGACY_RX is the only private flag in ixgbevf, entirely remove private flags support from the driver. When compared to iavf changes, ixgbevf has a single complication: MAC type 82599 cannot finely limit the DMA write size with RXDCTL.RLPML, only 1024 increments through SRRCTL are available, see commit fe68195 ("ixgbevf: Require large buffers for build_skb on 82599VF") and commit 2bafa8f ("ixgbe: don't set RXDCTL.RLPML for 82599"). Therefore, this is a special case requiring legacy RX unless large buffers are used. For now, solve this by always using large buffers for this MAC type. Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Again, same as in the related iavf commit 920d86f ("iavf: drop page splitting and recycling"), as an intermediate step, drop the page sharing and recycling logic in a preparation to offload it to page_pool. Instead of the previous sharing and recycling, just allocate a new page every time. Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Use page_pool buffers by the means of libeth in the Rx queues, this significantly reduces code complexity of the driver itself. Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Add likely/unlikely markers for better branch prediction. While touching some functions, cleanup the code a little bit. This patch is not supposed to make any logic changes. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Implement XDP support for received fragmented packets, this requires using some helpers from libeth_xdp. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Use libeth to support XDP_TX action for segmented packets. Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

To fully support XDP_REDIRECT, utilize more libeth helpers in XDP Rx path, hence save cached_ntu in the ring structure instead of stack. ixgbevf-supported VFs usually have few queues, so use libeth_xdpsq_lock functionality for XDP queue sharing. Adjust filling-in of XDP Tx descriptors to use data from xdp frame. Otherwise, simply use libeth helpers to implement .ndo_xdp_xmit(). While at it, fix a typo in libeth docs. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Introduce pseudo header split support in the ixgbevf driver, specifically targeting ixgbe_mac_82599_vf. On older hardware (e.g. ixgbe_mac_82599_vf), RX DMA write size can only be limited in 1K increments. This causes issues when attempting to fit multiple packets per page, as a DMA write may overwrite the headroom of the next packet. To address this, introduce pseudo header split support, where the hardware copies the full L2 header into a dedicated header buffer. This avoids the need for HR/TR alignment and allows safe skb construction from the header buffer without risking overwrites. Given that once packet is too big to fit into a single page, the behaviour is the same for all supported HW, use pseudo header split only for smaller packets. Signed-off-by: Natalia Wochtman <natalia.wochtman@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Co-developed-by: Larysa Zaremba <larysa.zaremba@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Currently, when MTU is changed, page pool is not reconfigured, which leads to usage of suboptimal buffer sizes. Always destroy page pool when cleaning the ring up and create it anew when we first allocate Rx buffers. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

xskxceiver attempts to change MTU after attaching XDP program, ixgbevf rejects the request leading to test being failed. Support MTU change operation even when XDP program is already attached, perform the same frame size check as when attaching a program. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

The same register write operation is already used twice in code, it will be used again by AF_XDP configuration. Wrap it in a helper function. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

AF_XDP ZC Rx path is also required to implement skb creation. Move all common functions to a header file as inlines. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Plenty of code can be shared between ZC and normal XDP Tx queues. Expose such code through the previously added header file. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Before starting transmission XDP queue first fills a single context descriptor, on which we cannot check DD bit later. This is not a problem in case of XDP_TX and .ndo_xdp_xmit(), because preparation happens only if we already have packets to send. This is different for ZC though. Wakeup must trigger queue preparation even if no new packets are queued, hence a single context descriptor can block completions. Modify RS-setting logic to account for handle such case. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Implement xsk_buff_pool configuration and supporting functionality, such as a single queue pair reconfiguration. Also, properly initialize Rx buffers. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Add code that handles Tx ZC queues inside of napi_pool(), utilize libeth. As NIC's multiple buffer conventions do not play nicely with AF_XDP's, leave handling of segments for later. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Add code that handles AF_XDP ZC Rx queues inside of napi_poll(), utilize libeth helpers. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

To finalize basic AF_XDP implementation, set features and add .ndo_xsk_wakeup() handler. TMP NOTE: IPI variant is incomplete, works through interrupts. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Transmitting multi-buffer AF_XDP packets is not very straightforward given HW limitations in ixgbevf, namely that the first data descriptor must contain the length of the whole packet. Use private data of an sqe to store the length of an unfinished packet so far and the first descriptor index. Once EoP zero-copy descriptor is processed, write the accumulated length into the saved first descriptor. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Danielmachon and others added 30 commits March 2, 2026 18:46

Yury Norov and others added 29 commits March 5, 2026 14:57

ice: use bitmap_empty() in ice_vf_has_no_qs_ena

d325e62

bitmap_empty() is more verbose and efficient, as it stops traversing {r,t}xq_ena as soon as the 1st set bit found. Signed-off-by: Yury Norov <ynorov@nvidia.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

ixgbevf: support XDP multi-buffer on Rx path

24f1f64

Implement XDP support for received fragmented packets, this requires using some helpers from libeth_xdp. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

ixgbevf: XDP_TX in multi-buffer through libeth

9dfd4a5

Use libeth to support XDP_TX action for segmented packets. Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

[TMP] xskxceiver: filter loopback traffic

c7ba839

[TMP] xskxceiver: increase thread timeout

3d4808e

ixgbevf: add a helper to flush Tx queue

5c0200c

The same register write operation is already used twice in code, it will be used again by AF_XDP configuration. Wrap it in a helper function. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

ixgbevf: move skb-filling code to a header

1f7d0bc

AF_XDP ZC Rx path is also required to implement skb creation. Move all common functions to a header file as inlines. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

ixgbevf: move XDP queue management code to a header

2aa509c

Plenty of code can be shared between ZC and normal XDP Tx queues. Expose such code through the previously added header file. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

ixgbevf: implement AF_XDP ZC initialization

48ae72e

Implement xsk_buff_pool configuration and supporting functionality, such as a single queue pair reconfiguration. Also, properly initialize Rx buffers. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

ixgbevf: implement AF_XDP zero-copy Tx

86007c6

Add code that handles Tx ZC queues inside of napi_pool(), utilize libeth. As NIC's multiple buffer conventions do not play nicely with AF_XDP's, leave handling of segments for later. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

ixgbevf: implement AF_XDP zero-copy Rx

eb1412b

Add code that handles AF_XDP ZC Rx queues inside of napi_poll(), utilize libeth helpers. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

ixgbevf: implement .ndo_xsk_wakeup() and set features

46a261b

To finalize basic AF_XDP implementation, set features and add .ndo_xsk_wakeup() handler. TMP NOTE: IPI variant is incomplete, works through interrupts. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

sleep after disabling queue, revisit

18c5c45

walking-machine force-pushed the ixgbevf-libeth-af-xdp branch from fc4df3c to da62bd8 Compare March 9, 2026 11:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ixgbevf libeth AF_XDP#3

ixgbevf libeth AF_XDP#3
walking-machine wants to merge 10000 commits intopr-basefrom
ixgbevf-libeth-af-xdp

walking-machine commented Oct 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

walking-machine commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

walking-machine commented Oct 28, 2025 •

edited

Loading