Skip to content

fix(lockservice): fence stale binds by allocator epoch#24903

Open
iamlinjunhong wants to merge 6 commits into
matrixorigin:3.0-devfrom
iamlinjunhong:d3-24896
Open

fix(lockservice): fence stale binds by allocator epoch#24903
iamlinjunhong wants to merge 6 commits into
matrixorigin:3.0-devfrom
iamlinjunhong:d3-24896

Conversation

@iamlinjunhong

Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24896

What this PR does / why we need it:

fix(lockservice): fence stale binds by allocator epoch

@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@aptend aptend left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Result

Approved. I did not find any blocking issues in this PR.

What I Checked

  • GetBind and KeepLockTableBind now return the allocator epoch to CN.
  • CN stale-bind purge uses the existing LockTable.Version semantics consistently: binds with Version < observed allocator epoch are old allocator state and can be removed.
  • The added tests cover get-bind refresh, keepalive failure/epoch refresh, concurrent observation, remote/proxy bind purge, and waiter cleanup.

Local Verification

git diff --check HEAD^ HEAD
go test ./pkg/lockservice -run 'Test(CommitDetectsStaleLocalBindAfterAllocatorRestart|AllocatorVersionZeroKeepsLocalBinds|AllocatorObserverDoesNotPurgeSameEpochBindVersions|GetBindAllowsRegressedAllocatorVersionWithoutPurging|KeepaliveEpochPurgeKeepsGroupMovePop|AllocatorObserverPurgesRemoteAndProxyLockTables|AllocatorObserverConcurrentKeepaliveAndGetBind|AllocatorObserverCloseWaitersOnStaleLocalBind)$'

Note: I started go test ./pkg/lockservice, but stopped it after it ran for more than 5 minutes without output and then verified the PR-specific test set above.

@XuPeng-SH XuPeng-SH left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the bug this PR is trying to fix, but I found one substantive correctness issue in the epoch-fencing design itself.

The allocator epoch is initialized from time.Now().UnixNano() in pkg/lockservice/lock_table_allocator.go, while the observer side in service.go explicitly ignores any non-increasing allocator version (observedVersion < oldVersion or == oldVersion). That means a real allocator restart can fail to purge stale cached binds if the new process comes up with the same or a lower wall-clock timestamp (for example after a clock step backward / NTP correction / VM time skew). In that case getLockTableWithCreate() keeps returning the stale entry from tableGroups without ever re-contacting the allocator, which is exactly the class of stale-bind problem this PR is trying to fence off.

What makes this higher confidence is that the new test TestGetBindAllowsRegressedAllocatorVersionWithoutPurging explicitly locks in that behavior. I think the fix needs an epoch source that is guaranteed monotonic/unique across allocator restarts (persisted term/counter, restart generation, etc.), or the observer must still invalidate cache on allocator replacement even when the observed epoch is non-increasing.

@XuPeng-SH XuPeng-SH left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking issue in the current head.

In remoteLockTable.lock, the new skipTrackLockOnError path skips txn.lockAdded(...) for context.Canceled, context.DeadlineExceeded, and ErrRPCTimeout. But those errors do not prove the request never reached the remote owner — they can also happen after the remote side already acquired the lock and before the response got back.

In that case, the local txn no longer records the remote lock in txn.lockHolders, so the normal Unlock() / txn.close() path will never send the remote unlock. The code comment says the lock will eventually be cleaned by the keep-remote/orphan timeout, but that is still a real behavioral regression: a lock that should be released immediately at txn end can now remain blocking until timeout-based cleanup fires.

So I do not think it is safe to skip local bookkeeping for timeout/cancel based only on the client-side error class. This should only be done when the code can prove the request was never observed remotely, or else it needs an explicit reconciliation path before deciding not to record the remote lock.

@XuPeng-SH XuPeng-SH left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found two blocking correctness issues in the current head (6e8ca4b).

  1. pkg/lockservice/service.go:786-814 / pkg/lockservice/lock_table_keeper.go:229-236: the keepalive OK=false path still removes this service's local binds via holders.removeWithFilter(..., closeReasonKeepBindFailed) without going through removeLockTablesWithFence / fenceByBindChanged. That means an active txn that already touched the old bind is not marked bindChanged. After the purge, the same txn can fetch a fresh bind and acquire more locks successfully instead of getting ErrLockTableBindChanged. I reproduced this locally with a temporary regression test: lock table in txn T, reset allocator state so keepalive returns OK=false, call doKeepLockTableBind, then call Lock again on the same table and same txn. Current code returns nil, leaving T with old tableBinds plus new locks, so the failure is deferred to commit/valid/unlock instead of being fenced at the bind change.

  2. pkg/lockservice/service.go:689-741 plus the unconditional tableGroups.set in getLockTableWithCreate: allocatorChanged treats any different non-empty AllocatorID as the next allocator. A delayed response from the old allocator can arrive after this CN has already accepted a newer allocator ID; current code will accept that old ID as another replacement, purge current binds, update lastAllocatorID back to the old allocator, and then getLockTableWithCreate can install the stale bind from that old response. UUID distinguishes allocator processes, but it does not order them. The stale-response unhappy path needs a monotonic term/generation, or a guard that prevents an already superseded allocator response from rolling CN state backward.

Please also audit identity consistency around allocator IDs. LockTable.Changed now includes AllocatorID, but getRemoteLockBindKey and LockTable.Equal still omit it, so remote bind heartbeat/stale cleanup can conflate different allocator instances when service/table/version match.

Local verification on the PR-specific tests still passes:

GOCACHE=/tmp/matrixone-pr24903-gocache go test ./pkg/lockservice -run 'Test(CommitDetectsStaleLocalBindAfterAllocatorRestart|AllocatorVersionZeroKeepsLocalBinds|AllocatorObserverDoesNotPurgeSameEpochBindVersions|GetBindPurgesStaleBindWhenAllocatorIDChangesWithRegressedVersion|KeepaliveEpochPurgeKeepsGroupMovePop|KeepaliveOKFalseWithNewAllocatorVersionPurgesStaleBinds|AllocatorObserverPurgesRemoteAndProxyLockTables|AllocatorObserverConcurrentKeepaliveAndGetBind|AllocatorObserverCloseWaitersOnStaleLocalBind|LockRemoteWithContextTimeoutTracksLockForUnlock|LockWithBindTimeout|LockWithBindNotFound|Issue3538)$'

But the temporary OK=false active-txn regression described above fails on current head, so I am requesting changes.

@iamlinjunhong

Copy link
Copy Markdown
Contributor Author

I found two blocking correctness issues in the current head (6e8ca4b).

  1. pkg/lockservice/service.go:786-814 / pkg/lockservice/lock_table_keeper.go:229-236: the keepalive OK=false path still removes this service's local binds via holders.removeWithFilter(..., closeReasonKeepBindFailed) without going through removeLockTablesWithFence / fenceByBindChanged. That means an active txn that already touched the old bind is not marked bindChanged. After the purge, the same txn can fetch a fresh bind and acquire more locks successfully instead of getting ErrLockTableBindChanged. I reproduced this locally with a temporary regression test: lock table in txn T, reset allocator state so keepalive returns OK=false, call doKeepLockTableBind, then call Lock again on the same table and same txn. Current code returns nil, leaving T with old tableBinds plus new locks, so the failure is deferred to commit/valid/unlock instead of being fenced at the bind change.
  2. pkg/lockservice/service.go:689-741 plus the unconditional tableGroups.set in getLockTableWithCreate: allocatorChanged treats any different non-empty AllocatorID as the next allocator. A delayed response from the old allocator can arrive after this CN has already accepted a newer allocator ID; current code will accept that old ID as another replacement, purge current binds, update lastAllocatorID back to the old allocator, and then getLockTableWithCreate can install the stale bind from that old response. UUID distinguishes allocator processes, but it does not order them. The stale-response unhappy path needs a monotonic term/generation, or a guard that prevents an already superseded allocator response from rolling CN state backward.

Please also audit identity consistency around allocator IDs. LockTable.Changed now includes AllocatorID, but getRemoteLockBindKey and LockTable.Equal still omit it, so remote bind heartbeat/stale cleanup can conflate different allocator instances when service/table/version match.

Local verification on the PR-specific tests still passes:

GOCACHE=/tmp/matrixone-pr24903-gocache go test ./pkg/lockservice -run 'Test(CommitDetectsStaleLocalBindAfterAllocatorRestart|AllocatorVersionZeroKeepsLocalBinds|AllocatorObserverDoesNotPurgeSameEpochBindVersions|GetBindPurgesStaleBindWhenAllocatorIDChangesWithRegressedVersion|KeepaliveEpochPurgeKeepsGroupMovePop|KeepaliveOKFalseWithNewAllocatorVersionPurgesStaleBinds|AllocatorObserverPurgesRemoteAndProxyLockTables|AllocatorObserverConcurrentKeepaliveAndGetBind|AllocatorObserverCloseWaitersOnStaleLocalBind|LockRemoteWithContextTimeoutTracksLockForUnlock|LockWithBindTimeout|LockWithBindNotFound|Issue3538)$'

But the temporary OK=false active-txn regression described above fails on current head, so I am requesting changes.

fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Denotes a PR that changes [1000, 1999] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants