Skip to content

fix(lockservice): fence stale binds by allocator epoch#24905

Open
iamlinjunhong wants to merge 6 commits into
matrixorigin:4.0-devfrom
iamlinjunhong:d4-24896
Open

fix(lockservice): fence stale binds by allocator epoch#24905
iamlinjunhong wants to merge 6 commits into
matrixorigin:4.0-devfrom
iamlinjunhong:d4-24896

Conversation

@iamlinjunhong

Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24896

What this PR does / why we need it:

fix(lockservice): fence stale binds by allocator epoch

@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@matrix-meow matrix-meow added the size/L Denotes a PR that changes [500,999] lines label Jun 9, 2026

@XuPeng-SH XuPeng-SH left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not confirm a blocking logic bug in the implementation itself, but there is still one important unhappy-path regression gap that should be covered before this is approved.

The new fix combines two recovery actions off the same keepalive failure response: local-bind cleanup when OK=false, and stale-bind purge when the allocator epoch increases. Current tests cover allocator-epoch purge and OK=false handling separately, but I do not see a test that drives a single keepalive response with both OK=false and a newer AllocatorVersion, then verifies that:

  • stale local binds are removed
  • stale remote/proxy binds from the old epoch are removed
  • lastAllocatorVersion advances
  • current-epoch binds are preserved

That combined failure path is the core unhappy-path this PR is hardening after allocator restart/failover. Without one regression test for the combined case, a future refactor could short-circuit one branch while the other still fires and leave stale bindings behind. Please add a focused test around doKeepLockTableBind / keepalive handling for this case.

@iamlinjunhong

Copy link
Copy Markdown
Contributor Author

I did not confirm a blocking logic bug in the implementation itself, but there is still one important unhappy-path regression gap that should be covered before this is approved.

The new fix combines two recovery actions off the same keepalive failure response: local-bind cleanup when OK=false, and stale-bind purge when the allocator epoch increases. Current tests cover allocator-epoch purge and OK=false handling separately, but I do not see a test that drives a single keepalive response with both OK=false and a newer AllocatorVersion, then verifies that:

  • stale local binds are removed
  • stale remote/proxy binds from the old epoch are removed
  • lastAllocatorVersion advances
  • current-epoch binds are preserved

That combined failure path is the core unhappy-path this PR is hardening after allocator restart/failover. Without one regression test for the combined case, a future refactor could short-circuit one branch while the other still fires and leave stale bindings behind. Please add a focused test around doKeepLockTableBind / keepalive handling for this case.

fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Denotes a PR that changes [1000, 1999] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants