Skip to content

Conversation

@changlei-li
Copy link
Contributor

@changlei-li changlei-li commented Dec 30, 2025

There is race condition about vm cache between pool_migrate_complete and VM event.
In the cross-pool migration case, it is designed to create vm with power_state Halted in XAPI db. In pool_migrate_complete, add_caches create an empty xenops_chae for the VM, then refresh_vm compares the cache powerstate None with its real state Running to update the right powerstate to XAPI db.
In the fail case, it is found that:
-> VM event 1 update_vm
-> pool_migrate_complete add_caches (cache power_state None)
-> pool_migrate_complete refresh_vm
-> VM event 1 update cache (cache power_state Running)
-> VM event 2 update_vm (Running <-> Running, XAPI DB not update)
When pool_migrate_complete add_caches, the cache update of previous VM event 1 breaks the design intention.

This commit add a wait in pool_migrate_complete to ensure all in-flight events complete before add_caches. Then there will be no race condition.

@robhoes
Copy link
Member

robhoes commented Dec 30, 2025

Nice find!

The diff is large and it is hard to see if the Events_from_xenopsd module has just been moved or modified as well. Has it changed?

@changlei-li changlei-li force-pushed the private/changleli/fix-xenops-cache branch from baf60ce to 0c82f5e Compare December 31, 2025 01:28
@changlei-li
Copy link
Contributor Author

The diff is large and it is hard to see if the Events_from_xenopsd module has just been moved or modified as well. Has it changed?

Moved, no change. I split it to a single commit for better review.

@changlei-li
Copy link
Contributor Author

changlei-li commented Dec 31, 2025

Rerun the cross pool migrate case for 10 times pass, but some other migrate case in bst failed. Let me consider more about the fix. Convert the PR to draft first.

@changlei-li changlei-li marked this pull request as draft December 31, 2025 03:03
There is race condition about vm cache between pool_migrate_complete
and VM event.
In the cross-pool migration case, it is designed to create vm with
power_state Halted in XAPI db. In pool_migrate_complete, add_caches
create an empty xenops_chae for the VM, then refresh_vm compares the
cache powerstate None with its real state Running to update the
right powerstate to XAPI db.
In the fail case, it is found that:
-> VM event 1 update_vm
-> pool_migrate_complete add_caches (cache power_state None)
-> pool_migrate_complete refresh_vm
-> VM event 1 update cache (cache power_state Running)
-> VM event 2 update_vm (Running <-> Running, XAPI DB not update)
When pool_migrate_complete add_caches, the cache update of previous
VM event 1 breaks the design intention.

This commit add a wait in pool_migrate_complete to ensure all
in-flight events complete before add_caches. Then there will be no
race condition.

Signed-off-by: Changlei Li <changlei.li@cloud.com>
@changlei-li changlei-li force-pushed the private/changleli/fix-xenops-cache branch from 0c82f5e to d308a91 Compare December 31, 2025 08:25
@changlei-li
Copy link
Contributor Author

Reopen the PR. Previous use of with_suppressed on destination host will lead to dead lock in same host migration case. Now just add Xapi_xenops.Events_from_xenopsd.wait before add_caches in pool_migrate_complete to ensure all in-flight events complete.
rerun CrossPoolMigrate 5 times pass.
ring3 bst pass.

@changlei-li changlei-li marked this pull request as ready for review December 31, 2025 08:30
@changlei-li changlei-li added this pull request to the merge queue Jan 4, 2026
Merged via the queue into xapi-project:master with commit 37c017b Jan 4, 2026
16 checks passed
@changlei-li changlei-li deleted the private/changleli/fix-xenops-cache branch January 4, 2026 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants