Skip to content

fix(logservice): match replicas by replica id#24912

Open
LeftHandCold wants to merge 1 commit into
matrixorigin:3.0-devfrom
LeftHandCold:fix-logservice-replica-id-match
Open

fix(logservice): match replicas by replica id#24912
LeftHandCold wants to merge 1 commit into
matrixorigin:3.0-devfrom
LeftHandCold:fix-logservice-replica-id-match

Conversation

@LeftHandCold

Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) does this PR fix or relate to?

Fixes #24911

What this PR does / why we need it:

HAKeeper LogService scheduling checked whether a store had any replica for a shard in a few places. When the same store had multiple generations of the same shard, for example desired shard=1/replica=262145 while the heartbeat reported shard=1/replica=275385, HAKeeper could treat the desired replica as started or an operator step as finished.

This patch makes start scheduling and relevant operator finish checks match the exact (shardID, replicaID) pair. It keeps compatibility with old heartbeat fixtures that omit local ReplicaID in the checker path by falling back to the reported membership map only when ReplicaID == 0.

This prevents shard-level identity confusion, but it does not by itself recover already-persisted inconsistent production data where HAKeeper state, local metadata, and Dragonboat WAL/bootstrap records have diverged.

Test

  • CGO_CFLAGS="-I/Users/shenjiangwei/Work/code/matrixone/thirdparties/install/include" CGO_LDFLAGS="-L/Users/shenjiangwei/Work/code/matrixone/thirdparties/install/lib" DYLD_LIBRARY_PATH="/Users/shenjiangwei/Work/code/matrixone/thirdparties/install/lib" go test ./pkg/hakeeper/... -count=1
  • CGO_CFLAGS="-I/Users/shenjiangwei/Work/code/matrixone/thirdparties/install/include" CGO_LDFLAGS="-L/Users/shenjiangwei/Work/code/matrixone/thirdparties/install/lib" DYLD_LIBRARY_PATH="/Users/shenjiangwei/Work/code/matrixone/thirdparties/install/lib" go test ./pkg/logservice -count=1

Copilot AI review requested due to automatic review settings June 10, 2026 03:45
@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a correctness bug in HAKeeper LogService scheduling/step completion logic where “replica started” checks could incorrectly match by ShardID only, causing different replica generations on the same store to be treated as equivalent.

Changes:

  • Update operator step finish checks to require an exact (ShardID, ReplicaID) match when determining whether a replica has started or a zombie has been killed.
  • Update LogService checker start-scheduling logic to require an exact (ShardID, ReplicaID) match, with a compatibility fallback for heartbeats/fixtures that omit ReplicaID (ReplicaID == 0).
  • Extend and adjust unit tests to cover mixed-generation scenarios and ensure tests use consistent replica IDs.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/hakeeper/operator/steps.go Tightens IsFinish checks for start/bootstrap/zombie-kill steps to match by (ShardID, ReplicaID).
pkg/hakeeper/operator/steps_test.go Adds coverage for “same shard, different replica ID” cases for relevant operator steps.
pkg/hakeeper/checkers/logservice/parse.go Makes start-scheduling require an exact (ShardID, ReplicaID) match; adds ReplicaID==0 compatibility fallback.
pkg/hakeeper/checkers/logservice/parse_test.go Adds a unit test for the new replica-ID-based scheduling behavior (needs one more case for the compatibility fallback).
pkg/hakeeper/checkers/logservice/check_test.go Updates test fixtures to include distinct ReplicaID values consistent with expected membership.
pkg/hakeeper/checkers/coordinator_test.go Updates fixture ReplicaID values to align with replica identity expectations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1064 to +1070
stores["log-1"] = pb.LogStoreInfo{
Replicas: []pb.LogReplicaInfo{{
LogShardInfo: pb.LogShardInfo{ShardID: 1},
ReplicaID: 262145,
}},
}
assert.Empty(t, getReplicasToStart(1, replicas, stores))

@aptend aptend left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Result

Approved. I did not find any blocking issues in this PR.

What I Checked

  • StartLogService, StartNonVotingLogService, BootstrapShard, and KillLogZombie now require the expected ReplicaID, avoiding false completion when another replica for the same shard is already present on the store.
  • The checker-side toStart detection now matches by shardID + replicaID, with compatibility for heartbeat records whose local ReplicaID is zero but whose membership maps contain the target replica.
  • StopLogService and StopNonVotingLogService still match by shard only, which is appropriate because those steps finish when the store no longer reports any replica for that shard.
  • Added/updated tests cover the same-shard/different-replica cases for start, bootstrap, and zombie handling.

Local Verification

git diff --check origin/3.0-dev...HEAD
go test ./pkg/hakeeper/operator ./pkg/hakeeper/checkers/logservice ./pkg/hakeeper/checkers
go test ./pkg/hakeeper/operator -run 'Test(StartLogService|StartNonVotingLogService|KillLogZombie|BootstrapShard)$'

git diff --check passed. The Go test commands did not run to completion locally because the environment is missing usearch.h from github.com/unum-cloud/usearch/golang; CI for this PR is green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working kind/test-ci size/M Denotes a PR that changes [100,499] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants