fix(vmm): close cold-boot BSP/AP race on dual-socket NUMA hosts#5897
fix(vmm): close cold-boot BSP/AP race on dual-socket NUMA hosts#5897JackThomson2 wants to merge 1 commit into
Conversation
a63e73c to
c3a7c46
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5897 +/- ##
=======================================
Coverage 83.00% 83.00%
=======================================
Files 277 277
Lines 30132 30139 +7
=======================================
+ Hits 25010 25016 +6
- Misses 5122 5123 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
ca3c633 to
19975b2
Compare
|
After taking a closer look I think this PR add too much complexity to the boot process while it does not provide a lot of value, especially considering that the original issue seems to be fixed by newer versions of FC. I guess we can close this and the original issue. @JackThomson2 wdyt? |
|
@ShadowCurse I think we should take just 85fa43a |
|
@JackThomson2 yeah, this one seems reasonable. Will you update this PR or create a new one with this one commit? |
Set the BSP to RUNNABLE and APs to INIT_RECEIVED during x86_64 vCPU configuration. This mirrors the expected wait-for-SIPI startup state and avoids KVM's UNINITIALIZED short-circuit before APs process INIT/SIPI on their first KVM_RUN. Refs: firecracker-microvm#5744 Signed-off-by: Jack Thomson <jackabt@amazon.com>
4e4ea18 to
a0fc29e
Compare
|
@ShadowCurse updated this PR |
| crate::arch::x86_64::regs::setup_sregs(guest_mem, &self.fd, kernel_entry_point.protocol)?; | ||
| crate::arch::x86_64::interrupts::set_lint(&self.fd)?; | ||
|
|
||
| // BSP runnable, APs wait-for-SIPI. |
There was a problem hiding this comment.
The commit message doesn't explain why we're doing this and what are we fixing.
There was a problem hiding this comment.
This mirrors the expected wait-for-SIPI startup
state and avoids KVM's UNINITIALIZED short-circuit before APs
process INIT/SIPI on their first KVM_RUN.
Doesn't this explain it?
There was a problem hiding this comment.
It doesn't explain why we never had to do this and we now do. Are we fixing something? And if so when is it triggered and what is the failure mode?
avoids KVM's UNINITIALIZED short-circuit before APs process INIT/SIPI on their first KVM_RUN
I don't understand what this means.
From context from offline discussions I think you mean that there's a race condition between the BSP sending an INIT and the APs entering the guest for the first time but I wouldn't understand this from the line above.
There was a problem hiding this comment.
I now see that you have a better explanation in the PR description but that is not part of git history.
There was a problem hiding this comment.
So do I understand correctly that this fix can be phrased as: "We set secondary vcpus to KVM_MP_STATE_INIT_RECEIVED manually here, even though they will go into this state on the very first KVM_RUN. The reason we do this is to prevent a situation when the main vcpu tries to wake up secondary vcpu which has not yet entered the KVM_RUN loop and is still in the UNINITIALIZED state."?
There was a problem hiding this comment.
which has not yet entered the KVM_RUN loop and is still in the UNINITIALIZED state
And why is that a problem? The very definition of KVM_MP_STATE_UNINITIALIZED is:
the vcpu is an application processor (AP) which has not yet received an INIT signal [x86]
So it is the state the vCPU should be in.
There was a problem hiding this comment.
Ok, in that case there is nothing for us to do for secondary vcpus. And for the BSP it seems there is no reason to do anything as well since everything already works right now.
| mp_state: if self.index == 0 { | ||
| KVM_MP_STATE_RUNNABLE | ||
| } else { | ||
| KVM_MP_STATE_INIT_RECEIVED |
There was a problem hiding this comment.
Will the BSP deliver the INIT again and what happens then?
There was a problem hiding this comment.
It can do, its just going to set state to KVM_MP_STATE_INIT_RECEIVED so a no-op.
There was a problem hiding this comment.
Say the BSP sends both the INIT and the SIPI before we ever schedule the AP (which is the race condition you are describing IIUC). And then we enter the AP for the first time with the state being KVM_MP_STATE_INIT_RECEIVED. What happens then? Won't the AP be blocked waiting for a SIPI that will never come?
There was a problem hiding this comment.
As you say KVM does latch to the INIT + SIPI, so this is a non issue, the INIT is processed first then the SIPI is. The issue is if the SIPI arrives before the INIT. In this case KVM discards it: https://github.com/torvalds/linux/blob/9ecfb2f7287a967b418ba69f10d45ead0d360593/arch/x86/kvm/lapic.c#L3566
For reference QEMU explicitly set the state to INIT_RECEIVED as well: https://github.com/qemu/qemu/blob/3b50303f9563a42538a1fd5c0ea7f952e23016e1/target/i386/kvm/kvm.c#L2611
There was a problem hiding this comment.
The issue is if the SIPI arrives before the INIT.
But that would be a buggy guest, wouldn't it?
For reference QEMU explicitly set the state to INIT_RECEIVED as well
According to kiro that code is used when QEMU emulates the irqchip (rather than the kernel)
There was a problem hiding this comment.
Yes but they get upgraded in the path I linked here: https://github.com/qemu/qemu/blob/3b50303f9563a42538a1fd5c0ea7f952e23016e1/target/i386/kvm/kvm.c#L2611
This can't be for the emulated irq as it's never in that state
There was a problem hiding this comment.
Like I said before I'm not very familiar with that code, but I had both kiro and claude look at it and they're both telling me that that kvm_arch_do_init_vcpu() function is not on the cold-boot path for an in-kernel irqchip.
- At cold boot, QEMU's kvm_arch_reset_vcpu does: if (kvm_irqchip_in_kernel()) { mp_state = bsp ? RUNNABLE : UNINITIALIZED }
(kvm.c:2558-2560). So with an in-kernel irqchip, QEMU sets APs to UNINITIALIZED — identical to current Firecracker, and the opposite of
what the PR does.- The INIT_RECEIVED flip the author points to lives in kvm_arch_do_init_vcpu, whose only caller is do_cpu_init, gated on
CPU_INTERRUPT_INIT being pending (kvm.c:6071-6075).- In kvm.c, CPU_INTERRUPT_INIT is raised in exactly one place: from events.smi.latched_init (kvm.c:5591-5593) — an INIT latched while in
SMM, retrieved via KVM_GET_VCPU_EVENTS. That is not a cold-boot event.- For an in-kernel LAPIC, INIT/SIPI are handled entirely inside KVM and never exit to userspace; kvm_arch_process_async_events even
early-returns at kvm.c:6077 (if (kvm_irqchip_in_kernel()) return 0;) before any SIPI handling.
Do you have any evidence to the contrary? I.e. a concrete code path that shows this is getting called when the LAPIC is emulated by KVM? Firecracker doesn't expose SMM either.
This can't be for the emulated irq as it's never in that state
I don't understand what this means
There was a problem hiding this comment.
Well in the emulated irq it's set to runnable, not uninitialised. So the section linked can't be for emulated right?
There was a problem hiding this comment.
What does one function have to do with the other? Again, can you list the concrete code path that leads to that call?
There was a problem hiding this comment.
Or alternatively you can start QEMU (with kernel_irqchip=on,smm=off) in gdb, set a breakpoint at kvm_arch_do_init_vcpu() and see if you hit it.
|
|
||
| // BSP runnable, APs wait-for-SIPI. | ||
| let mp_state = kvm_mp_state { | ||
| mp_state: if self.index == 0 { |
There was a problem hiding this comment.
Is the BSP == 0 a convention we use elsewhere in the code? Do we need a constant for this?
|
KVM latches the INIT/SIPI, so I don't know that there's really a race condition there |
Changes
With the current approach of vCPU initialisation it leaves a potential race condition
in which the BP enters the kernel and attempts to init AP's before they are initialised
and ready to start up.
Update the flow so that AP's are started into vCPU run before start BP to eliminate
this race condition.
Reason
Fixing #5744
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.PR Checklist
tools/devtool checkbuild --allto verify that the PR passesbuild checks on all supported architectures.
tools/devtool checkstyleto verify that the PR passes theautomated style checks.
how they are solving the problem in a clear and encompassing way.
in the PR.
CHANGELOG.md.Runbook for Firecracker API changes.
integration tests.
TODO.rust-vmm.