fix(lockservice): fence stale binds by allocator epoch by iamlinjunhong · Pull Request #24903 · matrixorigin/matrixone

iamlinjunhong · 2026-06-09T09:39:58Z

What type of PR is this?

Which issue(s) this PR fixes:

issue #24896

What this PR does / why we need it:

fix(lockservice): fence stale binds by allocator epoch

qodo-code-review · 2026-06-09T09:40:03Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

aptend

Review Result

Approved. I did not find any blocking issues in this PR.

What I Checked

GetBind and KeepLockTableBind now return the allocator epoch to CN.
CN stale-bind purge uses the existing LockTable.Version semantics consistently: binds with Version < observed allocator epoch are old allocator state and can be removed.
The added tests cover get-bind refresh, keepalive failure/epoch refresh, concurrent observation, remote/proxy bind purge, and waiter cleanup.

Local Verification

git diff --check HEAD^ HEAD
go test ./pkg/lockservice -run 'Test(CommitDetectsStaleLocalBindAfterAllocatorRestart|AllocatorVersionZeroKeepsLocalBinds|AllocatorObserverDoesNotPurgeSameEpochBindVersions|GetBindAllowsRegressedAllocatorVersionWithoutPurging|KeepaliveEpochPurgeKeepsGroupMovePop|AllocatorObserverPurgesRemoteAndProxyLockTables|AllocatorObserverConcurrentKeepaliveAndGetBind|AllocatorObserverCloseWaitersOnStaleLocalBind)$'

Note: I started go test ./pkg/lockservice, but stopped it after it ran for more than 5 minutes without output and then verified the PR-specific test set above.

XuPeng-SH

I understand the bug this PR is trying to fix, but I found one substantive correctness issue in the epoch-fencing design itself.

The allocator epoch is initialized from time.Now().UnixNano() in pkg/lockservice/lock_table_allocator.go, while the observer side in service.go explicitly ignores any non-increasing allocator version (observedVersion < oldVersion or == oldVersion). That means a real allocator restart can fail to purge stale cached binds if the new process comes up with the same or a lower wall-clock timestamp (for example after a clock step backward / NTP correction / VM time skew). In that case getLockTableWithCreate() keeps returning the stale entry from tableGroups without ever re-contacting the allocator, which is exactly the class of stale-bind problem this PR is trying to fence off.

What makes this higher confidence is that the new test TestGetBindAllowsRegressedAllocatorVersionWithoutPurging explicitly locks in that behavior. I think the fix needs an epoch source that is guaranteed monotonic/unique across allocator restarts (persisted term/counter, restart generation, etc.), or the observer must still invalidate cache on allocator replacement even when the observed epoch is non-increasing.

XuPeng-SH

I found one blocking issue in the current head.

In remoteLockTable.lock, the new skipTrackLockOnError path skips txn.lockAdded(...) for context.Canceled, context.DeadlineExceeded, and ErrRPCTimeout. But those errors do not prove the request never reached the remote owner — they can also happen after the remote side already acquired the lock and before the response got back.

In that case, the local txn no longer records the remote lock in txn.lockHolders, so the normal Unlock() / txn.close() path will never send the remote unlock. The code comment says the lock will eventually be cleaned by the keep-remote/orphan timeout, but that is still a real behavioral regression: a lock that should be released immediately at txn end can now remain blocking until timeout-based cleanup fires.

So I do not think it is safe to skip local bookkeeping for timeout/cancel based only on the client-side error class. This should only be done when the code can prove the request was never observed remotely, or else it needs an explicit reconciliation path before deciding not to record the remote lock.

XuPeng-SH

I found two blocking correctness issues in the current head (6e8ca4b).

pkg/lockservice/service.go:786-814 / pkg/lockservice/lock_table_keeper.go:229-236: the keepalive OK=false path still removes this service's local binds via holders.removeWithFilter(..., closeReasonKeepBindFailed) without going through removeLockTablesWithFence / fenceByBindChanged. That means an active txn that already touched the old bind is not marked bindChanged. After the purge, the same txn can fetch a fresh bind and acquire more locks successfully instead of getting ErrLockTableBindChanged. I reproduced this locally with a temporary regression test: lock table in txn T, reset allocator state so keepalive returns OK=false, call doKeepLockTableBind, then call Lock again on the same table and same txn. Current code returns nil, leaving T with old tableBinds plus new locks, so the failure is deferred to commit/valid/unlock instead of being fenced at the bind change.
pkg/lockservice/service.go:689-741 plus the unconditional tableGroups.set in getLockTableWithCreate: allocatorChanged treats any different non-empty AllocatorID as the next allocator. A delayed response from the old allocator can arrive after this CN has already accepted a newer allocator ID; current code will accept that old ID as another replacement, purge current binds, update lastAllocatorID back to the old allocator, and then getLockTableWithCreate can install the stale bind from that old response. UUID distinguishes allocator processes, but it does not order them. The stale-response unhappy path needs a monotonic term/generation, or a guard that prevents an already superseded allocator response from rolling CN state backward.

Please also audit identity consistency around allocator IDs. LockTable.Changed now includes AllocatorID, but getRemoteLockBindKey and LockTable.Equal still omit it, so remote bind heartbeat/stale cleanup can conflate different allocator instances when service/table/version match.

Local verification on the PR-specific tests still passes:

GOCACHE=/tmp/matrixone-pr24903-gocache go test ./pkg/lockservice -run 'Test(CommitDetectsStaleLocalBindAfterAllocatorRestart|AllocatorVersionZeroKeepsLocalBinds|AllocatorObserverDoesNotPurgeSameEpochBindVersions|GetBindPurgesStaleBindWhenAllocatorIDChangesWithRegressedVersion|KeepaliveEpochPurgeKeepsGroupMovePop|KeepaliveOKFalseWithNewAllocatorVersionPurgesStaleBinds|AllocatorObserverPurgesRemoteAndProxyLockTables|AllocatorObserverConcurrentKeepaliveAndGetBind|AllocatorObserverCloseWaitersOnStaleLocalBind|LockRemoteWithContextTimeoutTracksLockForUnlock|LockWithBindTimeout|LockWithBindNotFound|Issue3538)$'

But the temporary OK=false active-txn regression described above fails on current head, so I am requesting changes.

iamlinjunhong · 2026-06-11T03:37:10Z

I found two blocking correctness issues in the current head (6e8ca4b).

pkg/lockservice/service.go:786-814 / pkg/lockservice/lock_table_keeper.go:229-236: the keepalive OK=false path still removes this service's local binds via holders.removeWithFilter(..., closeReasonKeepBindFailed) without going through removeLockTablesWithFence / fenceByBindChanged. That means an active txn that already touched the old bind is not marked bindChanged. After the purge, the same txn can fetch a fresh bind and acquire more locks successfully instead of getting ErrLockTableBindChanged. I reproduced this locally with a temporary regression test: lock table in txn T, reset allocator state so keepalive returns OK=false, call doKeepLockTableBind, then call Lock again on the same table and same txn. Current code returns nil, leaving T with old tableBinds plus new locks, so the failure is deferred to commit/valid/unlock instead of being fenced at the bind change.

pkg/lockservice/service.go:689-741 plus the unconditional tableGroups.set in getLockTableWithCreate: allocatorChanged treats any different non-empty AllocatorID as the next allocator. A delayed response from the old allocator can arrive after this CN has already accepted a newer allocator ID; current code will accept that old ID as another replacement, purge current binds, update lastAllocatorID back to the old allocator, and then getLockTableWithCreate can install the stale bind from that old response. UUID distinguishes allocator processes, but it does not order them. The stale-response unhappy path needs a monotonic term/generation, or a guard that prevents an already superseded allocator response from rolling CN state backward.

Please also audit identity consistency around allocator IDs. LockTable.Changed now includes AllocatorID, but getRemoteLockBindKey and LockTable.Equal still omit it, so remote bind heartbeat/stale cleanup can conflate different allocator instances when service/table/version match.

Local verification on the PR-specific tests still passes:
GOCACHE=/tmp/matrixone-pr24903-gocache go test ./pkg/lockservice -run 'Test(CommitDetectsStaleLocalBindAfterAllocatorRestart|AllocatorVersionZeroKeepsLocalBinds|AllocatorObserverDoesNotPurgeSameEpochBindVersions|GetBindPurgesStaleBindWhenAllocatorIDChangesWithRegressedVersion|KeepaliveEpochPurgeKeepsGroupMovePop|KeepaliveOKFalseWithNewAllocatorVersionPurgesStaleBinds|AllocatorObserverPurgesRemoteAndProxyLockTables|AllocatorObserverConcurrentKeepaliveAndGetBind|AllocatorObserverCloseWaitersOnStaleLocalBind|LockRemoteWithContextTimeoutTracksLockForUnlock|LockWithBindTimeout|LockWithBindNotFound|Issue3538)$'
But the temporary OK=false active-txn regression described above fails on current head, so I am requesting changes.

fixed

fix(lockservice): fence stale binds by allocator epoch

fead819

iamlinjunhong requested review from XuPeng-SH and aptend as code owners June 9, 2026 09:39

iamlinjunhong temporarily deployed to ci June 9, 2026 09:40 — with GitHub Actions Inactive

matrix-meow added the size/L Denotes a PR that changes [500,999] lines label Jun 9, 2026

aptend approved these changes Jun 9, 2026

View reviewed changes

XuPeng-SH requested changes Jun 9, 2026

View reviewed changes

test(lockservice): cover keepalive stale bind purge

3e51aa7

iamlinjunhong temporarily deployed to ci June 10, 2026 02:22 — with GitHub Actions Inactive

iamlinjunhong temporarily deployed to ci June 10, 2026 08:02 — with GitHub Actions Inactive

XuPeng-SH requested changes Jun 10, 2026

View reviewed changes

fix(lockservice): keep remote lock bookkeeping on timeout

6e8ca4b

iamlinjunhong temporarily deployed to ci June 10, 2026 09:19 — with GitHub Actions Inactive

iamlinjunhong had a problem deploying to ci June 10, 2026 09:19 — with GitHub Actions Failure

iamlinjunhong temporarily deployed to ci June 10, 2026 10:14 — with GitHub Actions Inactive

XuPeng-SH requested changes Jun 10, 2026

View reviewed changes

fix(lockservice): fence keepalive purge and reject stale allocator

30a092f

iamlinjunhong temporarily deployed to ci June 11, 2026 02:52 — with GitHub Actions Inactive

iamlinjunhong had a problem deploying to ci June 11, 2026 02:52 — with GitHub Actions Failure

iamlinjunhong temporarily deployed to ci June 11, 2026 02:52 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lockservice): fence stale binds by allocator epoch#24903

fix(lockservice): fence stale binds by allocator epoch#24903
iamlinjunhong wants to merge 6 commits into
matrixorigin:3.0-devfrom
iamlinjunhong:d3-24896

iamlinjunhong commented Jun 9, 2026

Uh oh!

qodo-code-review Bot commented Jun 9, 2026

Uh oh!

aptend left a comment •

edited

Loading

Uh oh!

XuPeng-SH left a comment

Uh oh!

XuPeng-SH left a comment

Uh oh!

XuPeng-SH left a comment

Uh oh!

iamlinjunhong commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

iamlinjunhong commented Jun 9, 2026

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

Uh oh!

qodo-code-review Bot commented Jun 9, 2026

Qodo reviews are paused for this user.

Uh oh!

aptend left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XuPeng-SH left a comment

Choose a reason for hiding this comment

Uh oh!

XuPeng-SH left a comment

Choose a reason for hiding this comment

Uh oh!

XuPeng-SH left a comment

Choose a reason for hiding this comment

Uh oh!

iamlinjunhong commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aptend left a comment •

edited

Loading