Skip to content

fix(vmm): close cold-boot BSP/AP race on dual-socket NUMA hosts#5897

Open
JackThomson2 wants to merge 1 commit into
firecracker-microvm:mainfrom
JackThomson2:fix/cold_boot_smp
Open

fix(vmm): close cold-boot BSP/AP race on dual-socket NUMA hosts#5897
JackThomson2 wants to merge 1 commit into
firecracker-microvm:mainfrom
JackThomson2:fix/cold_boot_smp

Conversation

@JackThomson2

@JackThomson2 JackThomson2 commented May 19, 2026

Copy link
Copy Markdown
Contributor

Changes

With the current approach of vCPU initialisation it leaves a potential race condition
in which the BP enters the kernel and attempts to init AP's before they are initialised
and ready to start up.

Update the flow so that AP's are started into vCPU run before start BP to eliminate
this race condition.

Reason

Fixing #5744

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkbuild --all to verify that the PR passes
    build checks on all supported architectures.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

@codecov

codecov Bot commented May 20, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 83.00%. Comparing base (7dff88a) to head (a0fc29e).
⚠️ Report is 27 commits behind head on main.

Files with missing lines Patch % Lines
src/vmm/src/arch/x86_64/vcpu.rs 85.71% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5897   +/-   ##
=======================================
  Coverage   83.00%   83.00%           
=======================================
  Files         277      277           
  Lines       30132    30139    +7     
=======================================
+ Hits        25010    25016    +6     
- Misses       5122     5123    +1     
Flag Coverage Δ
5.10-m5n.metal 83.30% <85.71%> (-0.01%) ⬇️
5.10-m6a.metal 82.65% <85.71%> (+<0.01%) ⬆️
5.10-m6g.metal 79.94% <ø> (ø)
5.10-m6i.metal 83.30% <85.71%> (-0.01%) ⬇️
5.10-m7a.metal-48xl 82.64% <85.71%> (+<0.01%) ⬆️
5.10-m7g.metal 79.93% <ø> (-0.01%) ⬇️
5.10-m7i.metal-24xl 83.28% <85.71%> (-0.01%) ⬇️
5.10-m7i.metal-48xl 83.27% <85.71%> (-0.02%) ⬇️
5.10-m8g.metal-24xl 79.93% <ø> (-0.01%) ⬇️
5.10-m8g.metal-48xl 79.94% <ø> (ø)
5.10-m8i.metal-48xl 83.28% <85.71%> (-0.01%) ⬇️
5.10-m8i.metal-96xl 83.28% <85.71%> (+<0.01%) ⬆️
6.1-m5n.metal 83.32% <85.71%> (-0.02%) ⬇️
6.1-m6a.metal 82.67% <85.71%> (-0.01%) ⬇️
6.1-m6g.metal 79.93% <ø> (-0.01%) ⬇️
6.1-m6i.metal 83.32% <85.71%> (-0.01%) ⬇️
6.1-m7a.metal-48xl 82.66% <85.71%> (+<0.01%) ⬆️
6.1-m7g.metal 79.93% <ø> (-0.01%) ⬇️
6.1-m7i.metal-24xl 83.34% <85.71%> (+<0.01%) ⬆️
6.1-m7i.metal-48xl 83.34% <85.71%> (-0.01%) ⬇️
6.1-m8g.metal-24xl 79.93% <ø> (-0.01%) ⬇️
6.1-m8g.metal-48xl 79.94% <ø> (+<0.01%) ⬆️
6.1-m8i.metal-48xl 83.34% <85.71%> (-0.01%) ⬇️
6.1-m8i.metal-96xl 83.35% <85.71%> (+<0.01%) ⬆️
6.18-m5n.metal 83.32% <85.71%> (-0.01%) ⬇️
6.18-m6a.metal 82.67% <85.71%> (+<0.01%) ⬆️
6.18-m6g.metal 79.93% <ø> (-0.01%) ⬇️
6.18-m6i.metal 83.32% <85.71%> (+<0.01%) ⬆️
6.18-m7a.metal-48xl 82.66% <85.71%> (+<0.01%) ⬆️
6.18-m7g.metal 79.94% <ø> (ø)
6.18-m7i.metal-24xl 83.34% <85.71%> (+<0.01%) ⬆️
6.18-m7i.metal-48xl 83.34% <85.71%> (-0.01%) ⬇️
6.18-m8g.metal-24xl 79.93% <ø> (ø)
6.18-m8g.metal-48xl 79.93% <ø> (ø)
6.18-m8i.metal-48xl 83.35% <85.71%> (+<0.01%) ⬆️
6.18-m8i.metal-96xl 83.34% <85.71%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ShadowCurse

Copy link
Copy Markdown
Contributor

After taking a closer look I think this PR add too much complexity to the boot process while it does not provide a lot of value, especially considering that the original issue seems to be fixed by newer versions of FC. I guess we can close this and the original issue. @JackThomson2 wdyt?

@JackThomson2

Copy link
Copy Markdown
Contributor Author

@ShadowCurse I think we should take just 85fa43a

@ShadowCurse

Copy link
Copy Markdown
Contributor

@JackThomson2 yeah, this one seems reasonable. Will you update this PR or create a new one with this one commit?

Set the BSP to RUNNABLE and APs to INIT_RECEIVED during x86_64
vCPU configuration. This mirrors the expected wait-for-SIPI startup
state and avoids KVM's UNINITIALIZED short-circuit before APs
process INIT/SIPI on their first KVM_RUN.

Refs: firecracker-microvm#5744
Signed-off-by: Jack Thomson <jackabt@amazon.com>
@JackThomson2

Copy link
Copy Markdown
Contributor Author

@ShadowCurse updated this PR

@JackThomson2 JackThomson2 added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Jun 18, 2026
crate::arch::x86_64::regs::setup_sregs(guest_mem, &self.fd, kernel_entry_point.protocol)?;
crate::arch::x86_64::interrupts::set_lint(&self.fd)?;

// BSP runnable, APs wait-for-SIPI.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message doesn't explain why we're doing this and what are we fixing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mirrors the expected wait-for-SIPI startup
state and avoids KVM's UNINITIALIZED short-circuit before APs
process INIT/SIPI on their first KVM_RUN.

Doesn't this explain it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't explain why we never had to do this and we now do. Are we fixing something? And if so when is it triggered and what is the failure mode?

avoids KVM's UNINITIALIZED short-circuit before APs process INIT/SIPI on their first KVM_RUN

I don't understand what this means.

From context from offline discussions I think you mean that there's a race condition between the BSP sending an INIT and the APs entering the guest for the first time but I wouldn't understand this from the line above.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now see that you have a better explanation in the PR description but that is not part of git history.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do I understand correctly that this fix can be phrased as: "We set secondary vcpus to KVM_MP_STATE_INIT_RECEIVED manually here, even though they will go into this state on the very first KVM_RUN. The reason we do this is to prevent a situation when the main vcpu tries to wake up secondary vcpu which has not yet entered the KVM_RUN loop and is still in the UNINITIALIZED state."?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which has not yet entered the KVM_RUN loop and is still in the UNINITIALIZED state

And why is that a problem? The very definition of KVM_MP_STATE_UNINITIALIZED is:

the vcpu is an application processor (AP) which has not yet received an INIT signal [x86]

So it is the state the vCPU should be in.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, in that case there is nothing for us to do for secondary vcpus. And for the BSP it seems there is no reason to do anything as well since everything already works right now.

mp_state: if self.index == 0 {
KVM_MP_STATE_RUNNABLE
} else {
KVM_MP_STATE_INIT_RECEIVED

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the BSP deliver the INIT again and what happens then?

@JackThomson2 JackThomson2 Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can do, its just going to set state to KVM_MP_STATE_INIT_RECEIVED so a no-op.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say the BSP sends both the INIT and the SIPI before we ever schedule the AP (which is the race condition you are describing IIUC). And then we enter the AP for the first time with the state being KVM_MP_STATE_INIT_RECEIVED. What happens then? Won't the AP be blocked waiting for a SIPI that will never come?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you say KVM does latch to the INIT + SIPI, so this is a non issue, the INIT is processed first then the SIPI is. The issue is if the SIPI arrives before the INIT. In this case KVM discards it: https://github.com/torvalds/linux/blob/9ecfb2f7287a967b418ba69f10d45ead0d360593/arch/x86/kvm/lapic.c#L3566

For reference QEMU explicitly set the state to INIT_RECEIVED as well: https://github.com/qemu/qemu/blob/3b50303f9563a42538a1fd5c0ea7f952e23016e1/target/i386/kvm/kvm.c#L2611

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is if the SIPI arrives before the INIT.

But that would be a buggy guest, wouldn't it?

For reference QEMU explicitly set the state to INIT_RECEIVED as well

According to kiro that code is used when QEMU emulates the irqchip (rather than the kernel)

@JackThomson2 JackThomson2 Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but they get upgraded in the path I linked here: https://github.com/qemu/qemu/blob/3b50303f9563a42538a1fd5c0ea7f952e23016e1/target/i386/kvm/kvm.c#L2611

This can't be for the emulated irq as it's never in that state

@ilstam ilstam Jun 19, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said before I'm not very familiar with that code, but I had both kiro and claude look at it and they're both telling me that that kvm_arch_do_init_vcpu() function is not on the cold-boot path for an in-kernel irqchip.

  • At cold boot, QEMU's kvm_arch_reset_vcpu does: if (kvm_irqchip_in_kernel()) { mp_state = bsp ? RUNNABLE : UNINITIALIZED }
    (kvm.c:2558-2560). So with an in-kernel irqchip, QEMU sets APs to UNINITIALIZED — identical to current Firecracker, and the opposite of
    what the PR does.
  • The INIT_RECEIVED flip the author points to lives in kvm_arch_do_init_vcpu, whose only caller is do_cpu_init, gated on
    CPU_INTERRUPT_INIT being pending (kvm.c:6071-6075).
  • In kvm.c, CPU_INTERRUPT_INIT is raised in exactly one place: from events.smi.latched_init (kvm.c:5591-5593) — an INIT latched while in
    SMM, retrieved via KVM_GET_VCPU_EVENTS. That is not a cold-boot event.
  • For an in-kernel LAPIC, INIT/SIPI are handled entirely inside KVM and never exit to userspace; kvm_arch_process_async_events even
    early-returns at kvm.c:6077 (if (kvm_irqchip_in_kernel()) return 0;) before any SIPI handling.

Do you have any evidence to the contrary? I.e. a concrete code path that shows this is getting called when the LAPIC is emulated by KVM? Firecracker doesn't expose SMM either.

This can't be for the emulated irq as it's never in that state

I don't understand what this means

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well in the emulated irq it's set to runnable, not uninitialised. So the section linked can't be for emulated right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does one function have to do with the other? Again, can you list the concrete code path that leads to that call?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or alternatively you can start QEMU (with kernel_irqchip=on,smm=off) in gdb, set a breakpoint at kvm_arch_do_init_vcpu() and see if you hit it.


// BSP runnable, APs wait-for-SIPI.
let mp_state = kvm_mp_state {
mp_state: if self.index == 0 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the BSP == 0 a convention we use elsewhere in the code? Do we need a constant for this?

@ilstam

ilstam commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

KVM latches the INIT/SIPI, so I don't know that there's really a race condition there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Awaiting review Indicates that a pull request is ready to be reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants