Skip to content

Comments

XXX migration: fetch dirty_log in thread#82

Draft
phip1611 wants to merge 16 commits intocyberus-technology:gardenlinuxfrom
phip1611:poc-dirty-rate-thread
Draft

XXX migration: fetch dirty_log in thread#82
phip1611 wants to merge 16 commits intocyberus-technology:gardenlinuxfrom
phip1611:poc-dirty-rate-thread

Conversation

@phip1611
Copy link
Member

@phip1611 phip1611 commented Feb 12, 2026

Closes https://github.com/cobaltcore-dev/cobaltcore/issues/280

This log nicely shows how the dirty rate calculation is now much better and nicely correlates with the throttling of the vCPU:

cloud-hypervisor:   6.023682s: <dirty-log-worker> INFO:vmm/src/dirty_log_worker.rs:122 -- starting thread
cloud-hypervisor:  26.450666s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=0,dur=20426ms,overhead=0ms,throttle=0%,size=10240MiB,dirtyrate=0pps,bandwidth=501.30MiBs,downtime(expected)=100ms
cloud-hypervisor:  55.537373s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=1,dur=29086ms,overhead=0ms,throttle=0%,size=9263MiB,dirtyrate=2546198pps,bandwidth=318.46MiBs,downtime(expected)=18477ms
cloud-hypervisor:  55.537506s: <migration> INFO:vmm/src/lib.rs:2231 -- Increasing auto-converge: 10%
cloud-hypervisor:  58.599796s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor:  87.012421s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor:  87.604584s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=2,dur=32067ms,overhead=0ms,throttle=10%,size=9261MiB,dirtyrate=2264056pps,bandwidth=288.80MiBs,downtime(expected)=29079ms
cloud-hypervisor:  93.001717s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor: 110.924957s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=3,dur=23319ms,overhead=0ms,throttle=10%,size=9245MiB,dirtyrate=2281225pps,bandwidth=396.42MiBs,downtime(expected)=32010ms
cloud-hypervisor: 110.925598s: <migration> INFO:vmm/src/lib.rs:2231 -- Increasing auto-converge: 20%
cloud-hypervisor: 124.872177s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor: 127.918579s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor: 127.929123s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 20ms to signal - retrying
cloud-hypervisor: 127.939674s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 30ms to signal - retrying
cloud-hypervisor: 128.357773s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=4,dur=17432ms,overhead=0ms,throttle=20%,size=9244MiB,dirtyrate=2233198pps,bandwidth=530.28MiBs,downtime(expected)=23318ms
cloud-hypervisor: 139.619446s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor: 148.046097s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=5,dur=19685ms,overhead=0ms,throttle=20%,size=9254MiB,dirtyrate=2223147pps,bandwidth=470.03MiBs,downtime(expected)=17449ms
cloud-hypervisor: 148.049988s: <migration> INFO:vmm/src/lib.rs:2231 -- Increasing auto-converge: 30%
cloud-hypervisor: 170.608948s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=6,dur=22554ms,overhead=0ms,throttle=30%,size=9244MiB,dirtyrate=2193089pps,bandwidth=409.85MiBs,downtime(expected)=19665ms
cloud-hypervisor: 193.213226s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=7,dur=22588ms,overhead=0ms,throttle=30%,size=9244MiB,dirtyrate=2275898pps,bandwidth=409.21MiBs,downtime(expected)=22553ms
cloud-hypervisor: 193.214600s: <migration> INFO:vmm/src/lib.rs:2231 -- Increasing auto-converge: 40%
cloud-hypervisor: 208.498703s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=8,dur=15282ms,overhead=0ms,throttle=40%,size=9244MiB,dirtyrate=2276342pps,bandwidth=604.83MiBs,downtime(expected)=22588ms
cloud-hypervisor: 224.536972s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=9,dur=16036ms,overhead=0ms,throttle=40%,size=9243MiB,dirtyrate=2152596pps,bandwidth=576.36MiBs,downtime(expected)=15280ms
cloud-hypervisor: 224.537323s: <migration> INFO:vmm/src/lib.rs:2231 -- Increasing auto-converge: 50%
cloud-hypervisor: 241.172592s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=10,dur=16635ms,overhead=0ms,throttle=50%,size=9244MiB,dirtyrate=2167918pps,bandwidth=555.66MiBs,downtime(expected)=16037ms
cloud-hypervisor: 241.334137s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor: 252.334900s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor: 259.005432s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=11,dur=17832ms,overhead=0ms,throttle=50%,size=9244MiB,dirtyrate=2040923pps,bandwidth=518.34MiBs,downtime(expected)=16634ms
cloud-hypervisor: 259.005524s: <migration> INFO:vmm/src/lib.rs:2231 -- Increasing auto-converge: 60%
cloud-hypervisor: 270.615387s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=12,dur=11609ms,overhead=0ms,throttle=60%,size=9243MiB,dirtyrate=1884611pps,bandwidth=796.11MiBs,downtime(expected)=17830ms
cloud-hypervisor: 272.575134s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor: 274.280182s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=13,dur=3664ms,overhead=0ms,throttle=60%,size=5209MiB,dirtyrate=1706834pps,bandwidth=1421.46MiBs,downtime(expected)=6542ms
cloud-hypervisor: 274.280212s: <migration> INFO:vmm/src/lib.rs:2231 -- Increasing auto-converge: 70%
cloud-hypervisor: 274.557281s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=14,dur=277ms,overhead=0ms,throttle=70%,size=2522MiB,dirtyrate=1524275pps,bandwidth=9105.73MiBs,downtime(expected)=1773ms
cloud-hypervisor: 274.557312s: <migration> INFO:vmm/src/lib.rs:2277 -- Memory delta transmission stopping - cutoff condition reached!
cloud-hypervisor: 274.557587s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=15,dur=277ms,overhead=0ms,throttle=70%,size=0MiB,dirtyrate=1524275pps,bandwidth=9105.73MiBs,downtime(expected)=0ms

Eventually, the VMM shuts down in the case of a successful migration.
We need to prevent "migration ongoing" errors in the shutdown path.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
The logging is not very spammy nor costly (iterations take seconds to
dozens of minutes) and is clearly a win for us to debug things.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This is the first commit in a series of commits to introduce a new API
endpoint in Cloud Hypervisor to report progress and live-insights about
an ongoing live migration.

# Motivation

Having live and frequently refreshing statistics/metrics about an
ongoing live migration is especially interesting for debugging and
monitoring, such as checking the actual network throughput. With the
proposed changes, for the first time, we will be able to see how live
migrations behave and create benchmarking infrastructure around it.

The ch driver in libvirt will use these information to populate its
`virsh domjobinfo` information.

# Design

We will add a new API endpoint to query information for ongoing live
migrations. The new endpoint will also serve to query information about
any previously failed or canceled migrations. The SendMigration call
will no longer be blocking (wait until the migration is done) but
instead just dispatch the migration.

This streamlines the behavior with QEMU and simplifies management
software.

When one queries the endpoint, a frequently refreshed snapshot of the
migration statistics and progress will be returned. The data will not be
assembled on the fly.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This is part of the commit series to enable live updates about an
ongoing live migration. See the first commit for an introduction.

We decided to use an Option<> rather than a Result<> as
there isn't really an error that can happen when we query this endpoint.
A previous snapshot may either be there or not. It also doesn't make
sense here to check if the current VM is running, as users should always
be able to query information about the past (failed or canceled) live
migration.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This is part of the commit series to enable live updates about an
ongoing live migration. See the first commit for an introduction.

In this commit, we add the HTTP endpoint to export ongoing VM
live-migration progress.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This is part of the commit series to enable live updates about an
ongoing live migration. See the first commit for an introduction.

This commit prepares the avoidance of naming clashes in the following.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This is part of the commit series to enable live updates about an
ongoing live migration. See the first commit for an introduction.

This commit actually brings all the functionality together. The first
version has the limitation that we populate the latest snapshot once per
memory iteration, although this is the most interesting part by far. In
a follow-up, we can make this more fine-grained.

We guarantee that as soon as SendMigration returns, migration progress
can be fetched as the underlying data source is populated.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Time has proven that the previous design was not optimal. Now, the
SendMigration call is not blocking for the duration of the migration.
Instead, it just triggers the migration. Using the new MigrationProgress
endpoint, management software can trigger the state of the migration and
also find information for failed migrations.

A new `keep_alive` parameter for SendMigration will keep the VMM alive
and usable after the migration to ensure management software can fetch
the final state. The management software is then supposed to send a
ShutdownVmm command.

With this, we are finally able to query the migration progress API
endpoint during an ongoing live migration.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
We preserve the old behavior in ch-remote: SendMigration is blocking. A
new ´--dispatch` flag however ensures that one can just dispatch the
migration without waiting for it to finish (or fail).

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Eventually, the VMM shuts down in the case of a successful migration.
We need to prevent "migration ongoing" errors in the shutdown path.

So far, I only triggered this with `ch-remote` but we didn't observe
it in the (test) production environment.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
If a failure happens fairly late in the migration, the VM will remain
unusable. This commit uses the generic migration result check code path
to resume() the VM when the VM was running before as well.

I could nicely test various scenarios via `ch-remote`.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Cloud Hypervisor only supports migration of running VMs. There are too
many implicit assumptions in the code to fix them easily. Further, with
our current knowledge, this restriction is perfectly feasible.

This check makes this failure case more explicit in favor of deeply
nested errors.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
mostly works but sometimes the guest page faults (for non idle workloads)
@phip1611 phip1611 force-pushed the poc-dirty-rate-thread branch from f82765c to 4e9ab64 Compare February 12, 2026 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant