XXX migration: fetch dirty_log in thread#82
Draft
phip1611 wants to merge 16 commits intocyberus-technology:gardenlinuxfrom
Draft
XXX migration: fetch dirty_log in thread#82phip1611 wants to merge 16 commits intocyberus-technology:gardenlinuxfrom
phip1611 wants to merge 16 commits intocyberus-technology:gardenlinuxfrom
Conversation
Eventually, the VMM shuts down in the case of a successful migration. We need to prevent "migration ongoing" errors in the shutdown path. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
The logging is not very spammy nor costly (iterations take seconds to dozens of minutes) and is clearly a win for us to debug things. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This is the first commit in a series of commits to introduce a new API endpoint in Cloud Hypervisor to report progress and live-insights about an ongoing live migration. # Motivation Having live and frequently refreshing statistics/metrics about an ongoing live migration is especially interesting for debugging and monitoring, such as checking the actual network throughput. With the proposed changes, for the first time, we will be able to see how live migrations behave and create benchmarking infrastructure around it. The ch driver in libvirt will use these information to populate its `virsh domjobinfo` information. # Design We will add a new API endpoint to query information for ongoing live migrations. The new endpoint will also serve to query information about any previously failed or canceled migrations. The SendMigration call will no longer be blocking (wait until the migration is done) but instead just dispatch the migration. This streamlines the behavior with QEMU and simplifies management software. When one queries the endpoint, a frequently refreshed snapshot of the migration statistics and progress will be returned. The data will not be assembled on the fly. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This is part of the commit series to enable live updates about an ongoing live migration. See the first commit for an introduction. We decided to use an Option<> rather than a Result<> as there isn't really an error that can happen when we query this endpoint. A previous snapshot may either be there or not. It also doesn't make sense here to check if the current VM is running, as users should always be able to query information about the past (failed or canceled) live migration. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This is part of the commit series to enable live updates about an ongoing live migration. See the first commit for an introduction. In this commit, we add the HTTP endpoint to export ongoing VM live-migration progress. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This is part of the commit series to enable live updates about an ongoing live migration. See the first commit for an introduction. This commit prepares the avoidance of naming clashes in the following. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This is part of the commit series to enable live updates about an ongoing live migration. See the first commit for an introduction. This commit actually brings all the functionality together. The first version has the limitation that we populate the latest snapshot once per memory iteration, although this is the most interesting part by far. In a follow-up, we can make this more fine-grained. We guarantee that as soon as SendMigration returns, migration progress can be fetched as the underlying data source is populated. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Time has proven that the previous design was not optimal. Now, the SendMigration call is not blocking for the duration of the migration. Instead, it just triggers the migration. Using the new MigrationProgress endpoint, management software can trigger the state of the migration and also find information for failed migrations. A new `keep_alive` parameter for SendMigration will keep the VMM alive and usable after the migration to ensure management software can fetch the final state. The management software is then supposed to send a ShutdownVmm command. With this, we are finally able to query the migration progress API endpoint during an ongoing live migration. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
We preserve the old behavior in ch-remote: SendMigration is blocking. A new ´--dispatch` flag however ensures that one can just dispatch the migration without waiting for it to finish (or fail). On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Eventually, the VMM shuts down in the case of a successful migration. We need to prevent "migration ongoing" errors in the shutdown path. So far, I only triggered this with `ch-remote` but we didn't observe it in the (test) production environment. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
If a failure happens fairly late in the migration, the VM will remain unusable. This commit uses the generic migration result check code path to resume() the VM when the VM was running before as well. I could nicely test various scenarios via `ch-remote`. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Cloud Hypervisor only supports migration of running VMs. There are too many implicit assumptions in the code to fix them easily. Further, with our current knowledge, this restriction is perfectly feasible. This check makes this failure case more explicit in favor of deeply nested errors. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
mostly works but sometimes the guest page faults (for non idle workloads)
f82765c to
4e9ab64
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes https://github.com/cobaltcore-dev/cobaltcore/issues/280
This log nicely shows how the dirty rate calculation is now much better and nicely correlates with the throttling of the vCPU: