Skip to content

fix(lockservice): fence stale binds by allocator epoch#24906

Open
iamlinjunhong wants to merge 6 commits into
matrixorigin:mainfrom
iamlinjunhong:m-24896
Open

fix(lockservice): fence stale binds by allocator epoch#24906
iamlinjunhong wants to merge 6 commits into
matrixorigin:mainfrom
iamlinjunhong:m-24896

Conversation

@iamlinjunhong

Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24896

What this PR does / why we need it:

fix(lockservice): fence stale binds by allocator epoch

@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@matrix-meow matrix-meow added the size/L Denotes a PR that changes [500,999] lines label Jun 9, 2026

@aptend aptend left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Result

Approved. I did not find any blocking issues in this PR.

What I Checked

  • GetBind and KeepLockTableBind propagate the allocator epoch to CN.
  • CN stale-bind purge follows the existing LockTable.Version semantics: cached binds with Version < observed allocator epoch are stale allocator state.
  • main-branch bind-change handling was reviewed against the epoch purge path, including waiter cleanup and commit-side bind validation.
  • The added tests cover get-bind refresh, keepalive failure/epoch refresh, concurrent observation, remote/proxy bind purge, and waiter cleanup.

Local Verification

git diff --check origin/main...HEAD
go test ./pkg/lockservice -run 'Test(CommitDetectsStaleLocalBindAfterAllocatorRestart|AllocatorVersionZeroKeepsLocalBinds|AllocatorObserverDoesNotPurgeSameEpochBindVersions|GetBindAllowsRegressedAllocatorVersionWithoutPurging|KeepaliveEpochPurgeKeepsGroupMovePop|AllocatorObserverPurgesRemoteAndProxyLockTables|AllocatorObserverConcurrentKeepaliveAndGetBind|AllocatorObserverCloseWaitersOnStaleLocalBind)$'
timeout 180s go test ./pkg/lockservice

Note: the full go test ./pkg/lockservice hit the 180s local timeout without output. The PR-specific test set passed locally, and CI is green.

@iamlinjunhong

Copy link
Copy Markdown
Contributor Author

fixed the lockservice allocator-restart fencing issue across d4-24896, d3-24896, and m-24896.
Main fixes:
Added an allocator instance ID to lock table binds and allocator responses, so CNs can detect allocator replacement even when wall-clock-based versions regress or stay the same.
Updated CN-side stale bind detection to use allocator identity, not only monotonic allocator version.
Made allocator replacement purge go through normal bind-change fencing semantics, so active transactions get ErrLockTableBindChanged instead of lower-level closed/not-found errors.
Covered the combined keepalive unhappy path where OK=false and a newer/different allocator state arrive in the same response.
Ensured stale local, remote, and proxy lock tables are purged while current allocator-instance binds are preserved.
Added/updated regression tests for allocator ID changes, regressed allocator versions, keepalive cleanup, active txn fencing, remote/proxy stale bind cleanup, and concurrent observer paths.
For old branches, fixed remote-lock timeout handling to avoid recording phantom locks when the request likely never reached the remote owner.
Updated branch-specific tests whose old expectations conflicted with the new bind-change fencing behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Denotes a PR that changes [1000, 1999] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants