fix(lockservice): fence stale binds by allocator epoch by iamlinjunhong · Pull Request #24905 · matrixorigin/matrixone

iamlinjunhong · 2026-06-09T10:55:17Z

What type of PR is this?

Which issue(s) this PR fixes:

issue #24896

What this PR does / why we need it:

fix(lockservice): fence stale binds by allocator epoch

qodo-code-review · 2026-06-09T10:55:21Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

XuPeng-SH

I did not confirm a blocking logic bug in the implementation itself, but there is still one important unhappy-path regression gap that should be covered before this is approved.

The new fix combines two recovery actions off the same keepalive failure response: local-bind cleanup when OK=false, and stale-bind purge when the allocator epoch increases. Current tests cover allocator-epoch purge and OK=false handling separately, but I do not see a test that drives a single keepalive response with both OK=false and a newer AllocatorVersion, then verifies that:

stale local binds are removed
stale remote/proxy binds from the old epoch are removed
lastAllocatorVersion advances
current-epoch binds are preserved

That combined failure path is the core unhappy-path this PR is hardening after allocator restart/failover. Without one regression test for the combined case, a future refactor could short-circuit one branch while the other still fires and leave stale bindings behind. Please add a focused test around doKeepLockTableBind / keepalive handling for this case.

iamlinjunhong · 2026-06-11T08:39:51Z

I did not confirm a blocking logic bug in the implementation itself, but there is still one important unhappy-path regression gap that should be covered before this is approved.

The new fix combines two recovery actions off the same keepalive failure response: local-bind cleanup when OK=false, and stale-bind purge when the allocator epoch increases. Current tests cover allocator-epoch purge and OK=false handling separately, but I do not see a test that drives a single keepalive response with both OK=false and a newer AllocatorVersion, then verifies that:

stale local binds are removed

stale remote/proxy binds from the old epoch are removed

lastAllocatorVersion advances

current-epoch binds are preserved

That combined failure path is the core unhappy-path this PR is hardening after allocator restart/failover. Without one regression test for the combined case, a future refactor could short-circuit one branch while the other still fires and leave stale bindings behind. Please add a focused test around doKeepLockTableBind / keepalive handling for this case.

fixed

iamlinjunhong requested review from XuPeng-SH and aptend as code owners June 9, 2026 10:55

iamlinjunhong had a problem deploying to ci June 9, 2026 10:55 — with GitHub Actions Error

iamlinjunhong temporarily deployed to ci June 9, 2026 10:55 — with GitHub Actions Inactive

iamlinjunhong had a problem deploying to ci June 9, 2026 10:55 — with GitHub Actions Error

iamlinjunhong temporarily deployed to ci June 9, 2026 10:55 — with GitHub Actions Inactive

iamlinjunhong had a problem deploying to ci June 9, 2026 10:55 — with GitHub Actions Error

matrix-meow added the size/L Denotes a PR that changes [500,999] lines label Jun 9, 2026

mergify Bot temporarily deployed to ci June 9, 2026 10:56 Inactive

aptend approved these changes Jun 9, 2026

View reviewed changes

XuPeng-SH requested changes Jun 9, 2026

View reviewed changes

iamlinjunhong temporarily deployed to ci June 10, 2026 02:21 — with GitHub Actions Inactive

iamlinjunhong temporarily deployed to ci June 10, 2026 05:24 — with GitHub Actions Inactive

iamlinjunhong had a problem deploying to ci June 10, 2026 08:03 — with GitHub Actions Failure

iamlinjunhong temporarily deployed to ci June 10, 2026 08:03 — with GitHub Actions Inactive

iamlinjunhong temporarily deployed to ci June 10, 2026 08:04 — with GitHub Actions Inactive

iamlinjunhong had a problem deploying to ci June 10, 2026 08:59 — with GitHub Actions Error

iamlinjunhong temporarily deployed to ci June 10, 2026 09:18 — with GitHub Actions Inactive

iamlinjunhong had a problem deploying to ci June 10, 2026 09:18 — with GitHub Actions Failure

iamlinjunhong temporarily deployed to ci June 10, 2026 09:18 — with GitHub Actions Inactive

iamlinjunhong added 6 commits June 10, 2026 18:54

fix(lockservice): fence stale binds by allocator epoch

cd59718

test(lockservice): cover keepalive stale bind purge

4132fe7

fix(lockservice): fence allocator replacements by instance id

902a800

fix(lockservice): remove unused allocator observer helper

9942532

fix(lockservice): keep remote lock bookkeeping on timeout

4222f58

fix(lockservice): fence keepalive purge and reject stale allocator

0849cf7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lockservice): fence stale binds by allocator epoch#24905

fix(lockservice): fence stale binds by allocator epoch#24905
iamlinjunhong wants to merge 6 commits into
matrixorigin:4.0-devfrom
iamlinjunhong:d4-24896

iamlinjunhong commented Jun 9, 2026

Uh oh!

qodo-code-review Bot commented Jun 9, 2026

Uh oh!

XuPeng-SH left a comment

Uh oh!

iamlinjunhong commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

iamlinjunhong commented Jun 9, 2026

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

Uh oh!

qodo-code-review Bot commented Jun 9, 2026

Qodo reviews are paused for this user.

Uh oh!

XuPeng-SH left a comment

Choose a reason for hiding this comment

Uh oh!

iamlinjunhong commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants