Skip to content

Comments

Fix guest induced reboot/ shutdown migration failure#87

Draft
Coffeeri wants to merge 6 commits intocyberus-technology:gardenlinuxfrom
Coffeeri:fix-guest-induced-shutdown-migration-failure
Draft

Fix guest induced reboot/ shutdown migration failure#87
Coffeeri wants to merge 6 commits intocyberus-technology:gardenlinuxfrom
Coffeeri:fix-guest-induced-shutdown-migration-failure

Conversation

@Coffeeri
Copy link

@Coffeeri Coffeeri commented Feb 17, 2026

This PR fixes an issue https://github.com/cobaltcore-dev/cobaltcore/issues/374 where the VMM fails with VmError::VmMigrating if the guest triggers a reboot or shutdown during an active live migration.
When this happens, the reset or shutdown request is handled by the VMM control loop while the VM ownership is MaybeVmOwnership::Migration. Calling vm_reboot() or vmm_shutdown() in that state returns VmMigrating, which causes the normal lifecycle handling to abort.

The longterm idea is to refactor the vm_reboot function, able to outlive migration, by introducing a reset-capability of the VMs components.
This PR introduces a first workaround: Intercept the reboot/ shutdown event, pause the VM, migrate, resume VM, and finally re-emit the reboot/shutdown event.

What this change does

  • Introduces a small PostMigrationLifecycleEvent Enum field in VmSnapshot:

  • Migration Source

    • The control loop still consumes reset_evt / exit_evt.
    • If a migration is in progress, it latches the first lifecycle event (reset/ shutdown) instead of executing it.
    • The latched event is written into the VM snapshot before sending state.
    • If a lifecycle event is latched during pre-copy, the migration switches to the downtime phase at the next iteration boundary.
    • On migration failure, once ownership returns to the VMM, the latched event is replayed locally via the existing eventfds.
  • Migration Destination

    • The lifecycle Enum field is read from the received snapshot.
    • After Command::Complete, the VM is resumed and the lifecycle action is replayed through the existing eventfds:
      • VmRebootreset_evt
      • VmmShutdownexit_evt

Behavior and edge cases

  • No latched event → migration behaves exactly as before.
  • Multiple lifecycle signals during one migration → deterministic first-event-wins.
  • Downtime flow remains unchanged (throttling stop, pause, final transfer, snapshot, complete).
  • Latch state is cleared on migration start and after success or failure to avoid stale state.

@Coffeeri Coffeeri force-pushed the fix-guest-induced-shutdown-migration-failure branch 5 times, most recently from 9dfc444 to 5141abb Compare February 17, 2026 15:02
@Coffeeri Coffeeri self-assigned this Feb 17, 2026
@Coffeeri Coffeeri changed the title Fix guest induced shutdown migration failure Fix guest induced reboot/ shutdown migration failure Feb 17, 2026
@Coffeeri Coffeeri force-pushed the fix-guest-induced-shutdown-migration-failure branch from 5141abb to 24498cf Compare February 18, 2026 07:26
@Coffeeri Coffeeri force-pushed the fix-guest-induced-shutdown-migration-failure branch 2 times, most recently from 1ce7d22 to cf9dfa1 Compare February 19, 2026 14:51
During live migration, VM ownership is moved away from the VMM thread.
To preserve guest-triggered reboot and shutdown lifecycle intent
across that ownership handover, we need a small lifecycle marker to
travel with the migrated VM state.

This change introduces `PostMigrationLifecycleEvent` and stores it in
`VmSnapshot` with `#[serde(default)]` for backward compatibility.
`Vm::snapshot()` now serializes the marker, and VM construction from a
snapshot restores it.

No control-loop behavior is changed in this commit. This is only the
data model/plumbing needed by follow-up commits.

On-behalf-of: SAP leander.kohler@sap.com
Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
While a live migration is running, the migration worker owns the VM and
the VMM control loop cannot execute vm_reboot()/vmm_shutdown() directly.
Guest-triggered reset/exit events in that window currently hit
VmMigrating and fail.

This change makes the control loop consume reset/exit as before, but
when ownership is `MaybeVmOwnership::Migration` it latches a
post-migration lifecycle intent instead of calling lifecycle handlers
directly.

The latch is first-event-wins and is cleared when a new send migration
starts, preventing stale lifecycle intent from leaking between
migrations.

This commit only introduces source-side latching behavior and does not
yet apply or replay the latched event.

On-behalf-of: SAP leander.kohler@sap.com
Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
Add migration plumbing to carry the latched lifecycle intent from source
to destination and replay it through the existing control-loop paths.

The migration worker now passes the shared lifecycle latch into the send
path, and the sender writes the selected `PostMigrationLifecycleEvent`
into the VM snapshot before transmitting state.

On the receiving side, migration state restore extracts that snapshot
field and stores it in VMM state. After `Command::Complete`, the target
resumes the VM and replays the lifecycle action by writing to the
existing eventfds:
- VmReboot -> reset_evt
- VmmShutdown -> exit_evt

On-behalf-of: SAP leander.kohler@sap.com
Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
When a lifecycle event like reset or shutdown is latched during
pre-copy, switch to downtime at the next iteration boundary.
This keeps the current iteration send intact and then transitions
into the existing graceful downtime path (`stop_vcpu_throttling()`,
`pause()`, final transfer, snapshot).

To keep behavior deterministic on source migration failure,
replay the latched lifecycle event locally after ownership is
returned:
- VmReboot -> reset_evt
- VmmShutdown -> exit_evt

Latch state is cleared on both success and failure paths to avoid stale
state across migrations.

On-behalf-of: SAP leander.kohler@sap.com
Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
Legacy reset/poweroff paths waited only for vcpus_kill_signalled before
leaving their spin loops.

During migration, vCPUs are pause-signalled, not kill-signalled. This
could stall reset/poweroff handling and block migration completion.

Also break these waits on vcpus_pause_signalled and wire the new flag
through device construction paths.

Updated device paths:
  - i8042: guest reboot/reset write path wait loop
  - CMOS: reset register write (0x0f) wait loop
  - ACPI shutdown device:
    - reboot/reset write path wait loop
    - poweroff/shutdown write path wait loop

On-behalf-of: SAP leander.kohler@sap.com
Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
Align CMOS reset wait logic with i8042/ACPI by using direct pause/kill
flags instead of optional wrappers and Option checks.

On-behalf-of: SAP leander.kohler@sap.com
Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
@Coffeeri Coffeeri force-pushed the fix-guest-induced-shutdown-migration-failure branch from cf161ea to ac45988 Compare February 20, 2026 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants