fix(lockservice): fence stale binds by allocator epoch#24903
fix(lockservice): fence stale binds by allocator epoch#24903iamlinjunhong wants to merge 6 commits into
Conversation
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
There was a problem hiding this comment.
Review Result
Approved. I did not find any blocking issues in this PR.
What I Checked
GetBindandKeepLockTableBindnow return the allocator epoch to CN.- CN stale-bind purge uses the existing
LockTable.Versionsemantics consistently: binds withVersion < observed allocator epochare old allocator state and can be removed. - The added tests cover get-bind refresh, keepalive failure/epoch refresh, concurrent observation, remote/proxy bind purge, and waiter cleanup.
Local Verification
git diff --check HEAD^ HEAD
go test ./pkg/lockservice -run 'Test(CommitDetectsStaleLocalBindAfterAllocatorRestart|AllocatorVersionZeroKeepsLocalBinds|AllocatorObserverDoesNotPurgeSameEpochBindVersions|GetBindAllowsRegressedAllocatorVersionWithoutPurging|KeepaliveEpochPurgeKeepsGroupMovePop|AllocatorObserverPurgesRemoteAndProxyLockTables|AllocatorObserverConcurrentKeepaliveAndGetBind|AllocatorObserverCloseWaitersOnStaleLocalBind)$'Note: I started go test ./pkg/lockservice, but stopped it after it ran for more than 5 minutes without output and then verified the PR-specific test set above.
XuPeng-SH
left a comment
There was a problem hiding this comment.
I understand the bug this PR is trying to fix, but I found one substantive correctness issue in the epoch-fencing design itself.
The allocator epoch is initialized from time.Now().UnixNano() in pkg/lockservice/lock_table_allocator.go, while the observer side in service.go explicitly ignores any non-increasing allocator version (observedVersion < oldVersion or == oldVersion). That means a real allocator restart can fail to purge stale cached binds if the new process comes up with the same or a lower wall-clock timestamp (for example after a clock step backward / NTP correction / VM time skew). In that case getLockTableWithCreate() keeps returning the stale entry from tableGroups without ever re-contacting the allocator, which is exactly the class of stale-bind problem this PR is trying to fence off.
What makes this higher confidence is that the new test TestGetBindAllowsRegressedAllocatorVersionWithoutPurging explicitly locks in that behavior. I think the fix needs an epoch source that is guaranteed monotonic/unique across allocator restarts (persisted term/counter, restart generation, etc.), or the observer must still invalidate cache on allocator replacement even when the observed epoch is non-increasing.
XuPeng-SH
left a comment
There was a problem hiding this comment.
I found one blocking issue in the current head.
In remoteLockTable.lock, the new skipTrackLockOnError path skips txn.lockAdded(...) for context.Canceled, context.DeadlineExceeded, and ErrRPCTimeout. But those errors do not prove the request never reached the remote owner — they can also happen after the remote side already acquired the lock and before the response got back.
In that case, the local txn no longer records the remote lock in txn.lockHolders, so the normal Unlock() / txn.close() path will never send the remote unlock. The code comment says the lock will eventually be cleaned by the keep-remote/orphan timeout, but that is still a real behavioral regression: a lock that should be released immediately at txn end can now remain blocking until timeout-based cleanup fires.
So I do not think it is safe to skip local bookkeeping for timeout/cancel based only on the client-side error class. This should only be done when the code can prove the request was never observed remotely, or else it needs an explicit reconciliation path before deciding not to record the remote lock.
XuPeng-SH
left a comment
There was a problem hiding this comment.
I found two blocking correctness issues in the current head (6e8ca4b).
-
pkg/lockservice/service.go:786-814/pkg/lockservice/lock_table_keeper.go:229-236: the keepaliveOK=falsepath still removes this service's local binds viaholders.removeWithFilter(..., closeReasonKeepBindFailed)without going throughremoveLockTablesWithFence/fenceByBindChanged. That means an active txn that already touched the old bind is not markedbindChanged. After the purge, the same txn can fetch a fresh bind and acquire more locks successfully instead of gettingErrLockTableBindChanged. I reproduced this locally with a temporary regression test: lock table in txn T, reset allocator state so keepalive returnsOK=false, calldoKeepLockTableBind, then callLockagain on the same table and same txn. Current code returnsnil, leaving T with oldtableBindsplus new locks, so the failure is deferred to commit/valid/unlock instead of being fenced at the bind change. -
pkg/lockservice/service.go:689-741plus the unconditionaltableGroups.setingetLockTableWithCreate:allocatorChangedtreats any different non-emptyAllocatorIDas the next allocator. A delayed response from the old allocator can arrive after this CN has already accepted a newer allocator ID; current code will accept that old ID as another replacement, purge current binds, updatelastAllocatorIDback to the old allocator, and thengetLockTableWithCreatecan install the stale bind from that old response. UUID distinguishes allocator processes, but it does not order them. The stale-response unhappy path needs a monotonic term/generation, or a guard that prevents an already superseded allocator response from rolling CN state backward.
Please also audit identity consistency around allocator IDs. LockTable.Changed now includes AllocatorID, but getRemoteLockBindKey and LockTable.Equal still omit it, so remote bind heartbeat/stale cleanup can conflate different allocator instances when service/table/version match.
Local verification on the PR-specific tests still passes:
GOCACHE=/tmp/matrixone-pr24903-gocache go test ./pkg/lockservice -run 'Test(CommitDetectsStaleLocalBindAfterAllocatorRestart|AllocatorVersionZeroKeepsLocalBinds|AllocatorObserverDoesNotPurgeSameEpochBindVersions|GetBindPurgesStaleBindWhenAllocatorIDChangesWithRegressedVersion|KeepaliveEpochPurgeKeepsGroupMovePop|KeepaliveOKFalseWithNewAllocatorVersionPurgesStaleBinds|AllocatorObserverPurgesRemoteAndProxyLockTables|AllocatorObserverConcurrentKeepaliveAndGetBind|AllocatorObserverCloseWaitersOnStaleLocalBind|LockRemoteWithContextTimeoutTracksLockForUnlock|LockWithBindTimeout|LockWithBindNotFound|Issue3538)$'But the temporary OK=false active-txn regression described above fails on current head, so I am requesting changes.
fixed |
What type of PR is this?
Which issue(s) this PR fixes:
issue #24896
What this PR does / why we need it:
fix(lockservice): fence stale binds by allocator epoch