Fix guest induced reboot/ shutdown migration failure#87
Draft
Coffeeri wants to merge 6 commits intocyberus-technology:gardenlinuxfrom
Draft
Fix guest induced reboot/ shutdown migration failure#87Coffeeri wants to merge 6 commits intocyberus-technology:gardenlinuxfrom
Coffeeri wants to merge 6 commits intocyberus-technology:gardenlinuxfrom
Conversation
9dfc444 to
5141abb
Compare
tpressure
reviewed
Feb 17, 2026
5141abb to
24498cf
Compare
Coffeeri
commented
Feb 18, 2026
1ce7d22 to
cf9dfa1
Compare
During live migration, VM ownership is moved away from the VMM thread. To preserve guest-triggered reboot and shutdown lifecycle intent across that ownership handover, we need a small lifecycle marker to travel with the migrated VM state. This change introduces `PostMigrationLifecycleEvent` and stores it in `VmSnapshot` with `#[serde(default)]` for backward compatibility. `Vm::snapshot()` now serializes the marker, and VM construction from a snapshot restores it. No control-loop behavior is changed in this commit. This is only the data model/plumbing needed by follow-up commits. On-behalf-of: SAP leander.kohler@sap.com Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
While a live migration is running, the migration worker owns the VM and the VMM control loop cannot execute vm_reboot()/vmm_shutdown() directly. Guest-triggered reset/exit events in that window currently hit VmMigrating and fail. This change makes the control loop consume reset/exit as before, but when ownership is `MaybeVmOwnership::Migration` it latches a post-migration lifecycle intent instead of calling lifecycle handlers directly. The latch is first-event-wins and is cleared when a new send migration starts, preventing stale lifecycle intent from leaking between migrations. This commit only introduces source-side latching behavior and does not yet apply or replay the latched event. On-behalf-of: SAP leander.kohler@sap.com Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
Add migration plumbing to carry the latched lifecycle intent from source to destination and replay it through the existing control-loop paths. The migration worker now passes the shared lifecycle latch into the send path, and the sender writes the selected `PostMigrationLifecycleEvent` into the VM snapshot before transmitting state. On the receiving side, migration state restore extracts that snapshot field and stores it in VMM state. After `Command::Complete`, the target resumes the VM and replays the lifecycle action by writing to the existing eventfds: - VmReboot -> reset_evt - VmmShutdown -> exit_evt On-behalf-of: SAP leander.kohler@sap.com Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
When a lifecycle event like reset or shutdown is latched during pre-copy, switch to downtime at the next iteration boundary. This keeps the current iteration send intact and then transitions into the existing graceful downtime path (`stop_vcpu_throttling()`, `pause()`, final transfer, snapshot). To keep behavior deterministic on source migration failure, replay the latched lifecycle event locally after ownership is returned: - VmReboot -> reset_evt - VmmShutdown -> exit_evt Latch state is cleared on both success and failure paths to avoid stale state across migrations. On-behalf-of: SAP leander.kohler@sap.com Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
Legacy reset/poweroff paths waited only for vcpus_kill_signalled before
leaving their spin loops.
During migration, vCPUs are pause-signalled, not kill-signalled. This
could stall reset/poweroff handling and block migration completion.
Also break these waits on vcpus_pause_signalled and wire the new flag
through device construction paths.
Updated device paths:
- i8042: guest reboot/reset write path wait loop
- CMOS: reset register write (0x0f) wait loop
- ACPI shutdown device:
- reboot/reset write path wait loop
- poweroff/shutdown write path wait loop
On-behalf-of: SAP leander.kohler@sap.com
Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
Align CMOS reset wait logic with i8042/ACPI by using direct pause/kill flags instead of optional wrappers and Option checks. On-behalf-of: SAP leander.kohler@sap.com Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
cf161ea to
ac45988
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes an issue https://github.com/cobaltcore-dev/cobaltcore/issues/374 where the VMM fails with
VmError::VmMigratingif the guest triggers a reboot or shutdown during an active live migration.When this happens, the reset or shutdown request is handled by the VMM control loop while the VM ownership is
MaybeVmOwnership::Migration. Callingvm_reboot()orvmm_shutdown()in that state returnsVmMigrating, which causes the normal lifecycle handling to abort.The longterm idea is to refactor the
vm_rebootfunction, able to outlive migration, by introducing a reset-capability of the VMs components.This PR introduces a first workaround: Intercept the reboot/ shutdown event, pause the VM, migrate, resume VM, and finally re-emit the reboot/shutdown event.
What this change does
Introduces a small
PostMigrationLifecycleEventEnum field inVmSnapshot:Migration Source
reset_evt/exit_evt.Migration Destination
Command::Complete, the VM is resumed and the lifecycle action is replayed through the existing eventfds:VmReboot→reset_evtVmmShutdown→exit_evtBehavior and edge cases