XXX migration: fetch dirty_log in thread by phip1611 · Pull Request #82 · cyberus-technology/cloud-hypervisor

phip1611 · 2026-02-12T13:29:36Z

Closes https://github.com/cobaltcore-dev/cobaltcore/issues/280

This log nicely shows how the dirty rate calculation is now much better and nicely correlates with the throttling of the vCPU:

cloud-hypervisor:   6.023682s: <dirty-log-worker> INFO:vmm/src/dirty_log_worker.rs:122 -- starting thread
cloud-hypervisor:  26.450666s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=0,dur=20426ms,overhead=0ms,throttle=0%,size=10240MiB,dirtyrate=0pps,bandwidth=501.30MiBs,downtime(expected)=100ms
cloud-hypervisor:  55.537373s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=1,dur=29086ms,overhead=0ms,throttle=0%,size=9263MiB,dirtyrate=2546198pps,bandwidth=318.46MiBs,downtime(expected)=18477ms
cloud-hypervisor:  55.537506s: <migration> INFO:vmm/src/lib.rs:2231 -- Increasing auto-converge: 10%
cloud-hypervisor:  58.599796s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor:  87.012421s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor:  87.604584s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=2,dur=32067ms,overhead=0ms,throttle=10%,size=9261MiB,dirtyrate=2264056pps,bandwidth=288.80MiBs,downtime(expected)=29079ms
cloud-hypervisor:  93.001717s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor: 110.924957s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=3,dur=23319ms,overhead=0ms,throttle=10%,size=9245MiB,dirtyrate=2281225pps,bandwidth=396.42MiBs,downtime(expected)=32010ms
cloud-hypervisor: 110.925598s: <migration> INFO:vmm/src/lib.rs:2231 -- Increasing auto-converge: 20%
cloud-hypervisor: 124.872177s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor: 127.918579s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor: 127.929123s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 20ms to signal - retrying
cloud-hypervisor: 127.939674s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 30ms to signal - retrying
cloud-hypervisor: 128.357773s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=4,dur=17432ms,overhead=0ms,throttle=20%,size=9244MiB,dirtyrate=2233198pps,bandwidth=530.28MiBs,downtime(expected)=23318ms
cloud-hypervisor: 139.619446s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor: 148.046097s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=5,dur=19685ms,overhead=0ms,throttle=20%,size=9254MiB,dirtyrate=2223147pps,bandwidth=470.03MiBs,downtime(expected)=17449ms
cloud-hypervisor: 148.049988s: <migration> INFO:vmm/src/lib.rs:2231 -- Increasing auto-converge: 30%
cloud-hypervisor: 170.608948s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=6,dur=22554ms,overhead=0ms,throttle=30%,size=9244MiB,dirtyrate=2193089pps,bandwidth=409.85MiBs,downtime(expected)=19665ms
cloud-hypervisor: 193.213226s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=7,dur=22588ms,overhead=0ms,throttle=30%,size=9244MiB,dirtyrate=2275898pps,bandwidth=409.21MiBs,downtime(expected)=22553ms
cloud-hypervisor: 193.214600s: <migration> INFO:vmm/src/lib.rs:2231 -- Increasing auto-converge: 40%
cloud-hypervisor: 208.498703s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=8,dur=15282ms,overhead=0ms,throttle=40%,size=9244MiB,dirtyrate=2276342pps,bandwidth=604.83MiBs,downtime(expected)=22588ms
cloud-hypervisor: 224.536972s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=9,dur=16036ms,overhead=0ms,throttle=40%,size=9243MiB,dirtyrate=2152596pps,bandwidth=576.36MiBs,downtime(expected)=15280ms
cloud-hypervisor: 224.537323s: <migration> INFO:vmm/src/lib.rs:2231 -- Increasing auto-converge: 50%
cloud-hypervisor: 241.172592s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=10,dur=16635ms,overhead=0ms,throttle=50%,size=9244MiB,dirtyrate=2167918pps,bandwidth=555.66MiBs,downtime(expected)=16037ms
cloud-hypervisor: 241.334137s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor: 252.334900s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor: 259.005432s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=11,dur=17832ms,overhead=0ms,throttle=50%,size=9244MiB,dirtyrate=2040923pps,bandwidth=518.34MiBs,downtime(expected)=16634ms
cloud-hypervisor: 259.005524s: <migration> INFO:vmm/src/lib.rs:2231 -- Increasing auto-converge: 60%
cloud-hypervisor: 270.615387s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=12,dur=11609ms,overhead=0ms,throttle=60%,size=9243MiB,dirtyrate=1884611pps,bandwidth=796.11MiBs,downtime(expected)=17830ms
cloud-hypervisor: 272.575134s: <throttle-vcpu> WARN:vmm/src/cpu.rs:758 -- vCPU thread did not respond in 10ms to signal - retrying
cloud-hypervisor: 274.280182s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=13,dur=3664ms,overhead=0ms,throttle=60%,size=5209MiB,dirtyrate=1706834pps,bandwidth=1421.46MiBs,downtime(expected)=6542ms
cloud-hypervisor: 274.280212s: <migration> INFO:vmm/src/lib.rs:2231 -- Increasing auto-converge: 70%
cloud-hypervisor: 274.557281s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=14,dur=277ms,overhead=0ms,throttle=70%,size=2522MiB,dirtyrate=1524275pps,bandwidth=9105.73MiBs,downtime(expected)=1773ms
cloud-hypervisor: 274.557312s: <migration> INFO:vmm/src/lib.rs:2277 -- Memory delta transmission stopping - cutoff condition reached!
cloud-hypervisor: 274.557587s: <migration> INFO:vmm/src/lib.rs:2192 -- iter=15,dur=277ms,overhead=0ms,throttle=70%,size=0MiB,dirtyrate=1524275pps,bandwidth=9105.73MiBs,downtime(expected)=0ms

Eventually, the VMM shuts down in the case of a successful migration. We need to prevent "migration ongoing" errors in the shutdown path. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

The logging is not very spammy nor costly (iterations take seconds to dozens of minutes) and is clearly a win for us to debug things. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

This is the first commit in a series of commits to introduce a new API endpoint in Cloud Hypervisor to report progress and live-insights about an ongoing live migration. # Motivation Having live and frequently refreshing statistics/metrics about an ongoing live migration is especially interesting for debugging and monitoring, such as checking the actual network throughput. With the proposed changes, for the first time, we will be able to see how live migrations behave and create benchmarking infrastructure around it. The ch driver in libvirt will use these information to populate its `virsh domjobinfo` information. # Design We will add a new API endpoint to query information for ongoing live migrations. The new endpoint will also serve to query information about any previously failed or canceled migrations. The SendMigration call will no longer be blocking (wait until the migration is done) but instead just dispatch the migration. This streamlines the behavior with QEMU and simplifies management software. When one queries the endpoint, a frequently refreshed snapshot of the migration statistics and progress will be returned. The data will not be assembled on the fly. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

This is part of the commit series to enable live updates about an ongoing live migration. See the first commit for an introduction. We decided to use an Option<> rather than a Result<> as there isn't really an error that can happen when we query this endpoint. A previous snapshot may either be there or not. It also doesn't make sense here to check if the current VM is running, as users should always be able to query information about the past (failed or canceled) live migration. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

This is part of the commit series to enable live updates about an ongoing live migration. See the first commit for an introduction. In this commit, we add the HTTP endpoint to export ongoing VM live-migration progress. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

This is part of the commit series to enable live updates about an ongoing live migration. See the first commit for an introduction. This commit prepares the avoidance of naming clashes in the following. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

This is part of the commit series to enable live updates about an ongoing live migration. See the first commit for an introduction. This commit actually brings all the functionality together. The first version has the limitation that we populate the latest snapshot once per memory iteration, although this is the most interesting part by far. In a follow-up, we can make this more fine-grained. We guarantee that as soon as SendMigration returns, migration progress can be fetched as the underlying data source is populated. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

Time has proven that the previous design was not optimal. Now, the SendMigration call is not blocking for the duration of the migration. Instead, it just triggers the migration. Using the new MigrationProgress endpoint, management software can trigger the state of the migration and also find information for failed migrations. A new `keep_alive` parameter for SendMigration will keep the VMM alive and usable after the migration to ensure management software can fetch the final state. The management software is then supposed to send a ShutdownVmm command. With this, we are finally able to query the migration progress API endpoint during an ongoing live migration. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

We preserve the old behavior in ch-remote: SendMigration is blocking. A new ´--dispatch` flag however ensures that one can just dispatch the migration without waiting for it to finish (or fail). On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

Eventually, the VMM shuts down in the case of a successful migration. We need to prevent "migration ongoing" errors in the shutdown path. So far, I only triggered this with `ch-remote` but we didn't observe it in the (test) production environment. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

If a failure happens fairly late in the migration, the VM will remain unusable. This commit uses the generic migration result check code path to resume() the VM when the VM was running before as well. I could nicely test various scenarios via `ch-remote`. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

Cloud Hypervisor only supports migration of running VMs. There are too many implicit assumptions in the code to fix them easily. Further, with our current knowledge, this restriction is perfectly feasible. This check makes this failure case more explicit in favor of deeply nested errors. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

mostly works but sometimes the guest page faults (for non idle workloads)

phip1611 added 16 commits February 12, 2026 10:47

vmm: fix VM lifecycle bug

9f7b6c3

Eventually, the VMM shuts down in the case of a successful migration. We need to prevent "migration ongoing" errors in the shutdown path. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

ch-remote: add migration-progress command

318d0cb

On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

vmm: small code improvement

0ae7c80

On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>

xxx glue code to fix all patches

da1e07c

xxx dirty log worker thread

4e9ab64

mostly works but sometimes the guest page faults (for non idle workloads)

phip1611 force-pushed the poc-dirty-rate-thread branch from f82765c to 4e9ab64 Compare February 12, 2026 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

XXX migration: fetch dirty_log in thread#82

XXX migration: fetch dirty_log in thread#82
phip1611 wants to merge 16 commits intocyberus-technology:gardenlinuxfrom
phip1611:poc-dirty-rate-thread

phip1611 commented Feb 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

phip1611 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

phip1611 commented Feb 12, 2026 •

edited

Loading