fix(logservice): match replicas by replica id#24912
Conversation
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
There was a problem hiding this comment.
Pull request overview
This PR fixes a correctness bug in HAKeeper LogService scheduling/step completion logic where “replica started” checks could incorrectly match by ShardID only, causing different replica generations on the same store to be treated as equivalent.
Changes:
- Update operator step finish checks to require an exact
(ShardID, ReplicaID)match when determining whether a replica has started or a zombie has been killed. - Update LogService checker start-scheduling logic to require an exact
(ShardID, ReplicaID)match, with a compatibility fallback for heartbeats/fixtures that omitReplicaID(ReplicaID == 0). - Extend and adjust unit tests to cover mixed-generation scenarios and ensure tests use consistent replica IDs.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pkg/hakeeper/operator/steps.go | Tightens IsFinish checks for start/bootstrap/zombie-kill steps to match by (ShardID, ReplicaID). |
| pkg/hakeeper/operator/steps_test.go | Adds coverage for “same shard, different replica ID” cases for relevant operator steps. |
| pkg/hakeeper/checkers/logservice/parse.go | Makes start-scheduling require an exact (ShardID, ReplicaID) match; adds ReplicaID==0 compatibility fallback. |
| pkg/hakeeper/checkers/logservice/parse_test.go | Adds a unit test for the new replica-ID-based scheduling behavior (needs one more case for the compatibility fallback). |
| pkg/hakeeper/checkers/logservice/check_test.go | Updates test fixtures to include distinct ReplicaID values consistent with expected membership. |
| pkg/hakeeper/checkers/coordinator_test.go | Updates fixture ReplicaID values to align with replica identity expectations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| stores["log-1"] = pb.LogStoreInfo{ | ||
| Replicas: []pb.LogReplicaInfo{{ | ||
| LogShardInfo: pb.LogShardInfo{ShardID: 1}, | ||
| ReplicaID: 262145, | ||
| }}, | ||
| } | ||
| assert.Empty(t, getReplicasToStart(1, replicas, stores)) |
aptend
left a comment
There was a problem hiding this comment.
Review Result
Approved. I did not find any blocking issues in this PR.
What I Checked
StartLogService,StartNonVotingLogService,BootstrapShard, andKillLogZombienow require the expectedReplicaID, avoiding false completion when another replica for the same shard is already present on the store.- The checker-side
toStartdetection now matches byshardID + replicaID, with compatibility for heartbeat records whose localReplicaIDis zero but whose membership maps contain the target replica. StopLogServiceandStopNonVotingLogServicestill match by shard only, which is appropriate because those steps finish when the store no longer reports any replica for that shard.- Added/updated tests cover the same-shard/different-replica cases for start, bootstrap, and zombie handling.
Local Verification
git diff --check origin/3.0-dev...HEAD
go test ./pkg/hakeeper/operator ./pkg/hakeeper/checkers/logservice ./pkg/hakeeper/checkers
go test ./pkg/hakeeper/operator -run 'Test(StartLogService|StartNonVotingLogService|KillLogZombie|BootstrapShard)$'git diff --check passed. The Go test commands did not run to completion locally because the environment is missing usearch.h from github.com/unum-cloud/usearch/golang; CI for this PR is green.
What type of PR is this?
Which issue(s) does this PR fix or relate to?
Fixes #24911
What this PR does / why we need it:
HAKeeper LogService scheduling checked whether a store had any replica for a shard in a few places. When the same store had multiple generations of the same shard, for example desired
shard=1/replica=262145while the heartbeat reportedshard=1/replica=275385, HAKeeper could treat the desired replica as started or an operator step as finished.This patch makes start scheduling and relevant operator finish checks match the exact
(shardID, replicaID)pair. It keeps compatibility with old heartbeat fixtures that omit localReplicaIDin the checker path by falling back to the reported membership map only whenReplicaID == 0.This prevents shard-level identity confusion, but it does not by itself recover already-persisted inconsistent production data where HAKeeper state, local metadata, and Dragonboat WAL/bootstrap records have diverged.
Test
CGO_CFLAGS="-I/Users/shenjiangwei/Work/code/matrixone/thirdparties/install/include" CGO_LDFLAGS="-L/Users/shenjiangwei/Work/code/matrixone/thirdparties/install/lib" DYLD_LIBRARY_PATH="/Users/shenjiangwei/Work/code/matrixone/thirdparties/install/lib" go test ./pkg/hakeeper/... -count=1CGO_CFLAGS="-I/Users/shenjiangwei/Work/code/matrixone/thirdparties/install/include" CGO_LDFLAGS="-L/Users/shenjiangwei/Work/code/matrixone/thirdparties/install/lib" DYLD_LIBRARY_PATH="/Users/shenjiangwei/Work/code/matrixone/thirdparties/install/lib" go test ./pkg/logservice -count=1