Skip to content

Subscription: add topic owner epoch fencing#17780

Open
VGalaxies wants to merge 7 commits into
masterfrom
subscription-topic-owner-fencing
Open

Subscription: add topic owner epoch fencing#17780
VGalaxies wants to merge 7 commits into
masterfrom
subscription-topic-owner-fencing

Conversation

@VGalaxies

Copy link
Copy Markdown
Contributor

Summary

  • Add topic-level owner epoch metadata for subscription fencing.
  • Propagate owner id and epoch from subscription consumers.
  • Reject stale owners during heartbeat, subscribe, poll, and commit.
  • Add focused tests for serialization and owner transfer fencing.

Tests

  • mvn test -pl iotdb-core/relational-grammar,iotdb-client/service-rpc,iotdb-client/subscription,iotdb-core/node-commons,iotdb-core/datanode -Dtest=SubscriptionReceiverV1Test -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
  • mvn test -pl iotdb-client/service-rpc,iotdb-client/subscription,iotdb-core/node-commons -Dtest=TopicDeSerTest -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false

@Caideyipi

Copy link
Copy Markdown
Collaborator

Findings

  • P1: TopicMeta 的新尾部字段破坏旧格式兼容。deserialize(InputStream) 用 inputStream.available() 判断是否还有 owner
    字段,但 TopicMeta 在 snapshot 里是连续写入的;旧版本数据没有 owner flag,多个 topic 时这里会把下一个 topic
    的首字节当成 owner flag 消费,导致后续反序列化错位。ByteBuffer.hasRemaining() 在旧 procedure/plan
    列表里也有同类问题。见 TopicMeta.java#L260

(

if (inputStream.available() > 0 && ReadWriteIOUtils.readBool(inputStream)) {
)
和 TopicMetaKeeper.java#L112

(https://github.com/apache/iotdb/blob/76f2a882411ba790f7db290740f0232a1e5c4022/iotdb-core/node-commons/src/main/java/org/apache/iotdb/commons/subscription/meta/topic/TopicMetaKeeper.java#L112)。建议改成版本化或长度前缀格式,或只从
config 属性恢复 owner,不在非自描述对象尾部追加可选字段。

  • P1: owner epoch 单调性没有在真实元数据替换路径上保证。transferOwner() 只在同一个对象上检查递增,但 AlterTopicPlan /
    metadata replace 路径直接 remove + add 新 TopicMeta,新对象从 -1 初始化,所以较小 epoch 也能覆盖当前 epoch,使旧
    owner 重新合法。见 TopicMeta.java#L135

(

if (isOwnerFencingEnabled() && ownerEpoch <= this.ownerEpoch) {
)
和 SubscriptionInfo.java#L340

(https://github.com/apache/iotdb/blob/76f2a882411ba790f7db290740f0232a1e5c4022/iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/persistence/subscription/SubscriptionInfo.java#L340)。建议在
ConfigNode alter/handle meta change 时对比 existing owner epoch 并拒绝回退。

  • P2: 新增的 owner status code 没有接入客户端错误分类。1913-1916 会落到 default,变成 generic critical
    exception;heartbeat 还会把 provider 标成 unavailable,导致 stale owner 场景反复重连/重试且错误信息不稳定。见
    TSStatusCode.java#L319

(https://github.com/apache/iotdb/blob/76f2a882411ba790f7db290740f0232a1e5c4022/iotdb-client/service-rpc/src/main/java/org/apache/iotdb/rpc/TSStatusCode.java#L319)、AbstractSubscriptionProvider.java#L445

(https://github.com/apache/iotdb/blob/76f2a882411ba790f7db290740f0232a1e5c4022/iotdb-client/subscription/src/main/java/org/apache/iotdb/session/subscription/consumer/base/AbstractSubscriptionProvider.java#L445)。建议显式处理
owner fenced/required/lease expired,映射成明确的不可重试业务异常。

@VGalaxies VGalaxies force-pushed the subscription-topic-owner-fencing branch from 76f2a88 to b6086b2 Compare May 28, 2026 10:35
@VGalaxies

Copy link
Copy Markdown
Contributor Author

Thanks @Caideyipi, addressed the three findings in b6086b2.

  • P1 serialization compatibility: removed the optional owner fields from the non-self-describing TopicMeta serialization tail. Owner state is now restored from topic attributes after deserialization, so sequential TopicMeta snapshot/procedure streams are not consumed out of boundary. Added a sequential deserialization regression test.
  • P1 epoch rollback: added TopicMeta.validateOwnerProgression(...) and applied it on both ConfigNode alter-topic metadata replacement and DataNode topic-meta update handling, rejecting owner clears/rollbacks/stale same-epoch owner changes. Added ConfigNode rollback coverage and extended the network-partition old-SN test to verify stale topic meta cannot make the old owner valid again.
  • P2 client classification: mapped 1913-1917 explicitly to SubscriptionOwnerFencedException, a SubscriptionRuntimeNonCriticalException subclass, so stale-owner/business fencing errors no longer fall into the generic critical default path.

Local verification passed:

  • mvn spotless:apply -pl iotdb-client/service-rpc,iotdb-client/subscription,iotdb-core/node-commons,iotdb-core/datanode,iotdb-core/confignode -DskipTests
  • mvn test -pl iotdb-client/service-rpc,iotdb-client/subscription,iotdb-core/node-commons -Dtest=TopicDeSerTest -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
  • mvn test -pl iotdb-client/service-rpc,iotdb-client/subscription,iotdb-core/node-commons,iotdb-core/confignode -Dtest=SubscriptionInfoTest -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
  • mvn test -pl iotdb-core/relational-grammar,iotdb-client/service-rpc,iotdb-client/subscription,iotdb-core/node-commons,iotdb-core/datanode -Dtest=SubscriptionReceiverV1Test -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
  • mvn test -pl iotdb-client/service-rpc,iotdb-client/subscription -Dtest=TSStatusCodeTest -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
  • git diff --check

The Sonar duplication check is queued again on the new commit; I will follow up if the rerun still fails.

@codecov

codecov Bot commented May 28, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 33.33333% with 458 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.74%. Comparing base (07b9cb0) to head (8621c5d).
⚠️ Report is 7 commits behind head on master.

Files with missing lines Patch % Lines
.../subscription/receiver/SubscriptionReceiverV1.java 0.00% 67 Missing ⚠️
...che/iotdb/confignode/manager/ProcedureManager.java 0.00% 53 Missing ⚠️
...sion/subscription/AbstractSubscriptionSession.java 0.00% 47 Missing ⚠️
.../manager/subscription/SubscriptionCoordinator.java 4.16% 46 Missing ⚠️
...tdb/commons/subscription/meta/topic/TopicMeta.java 84.73% 29 Missing ⚠️
.../db/subscription/agent/SubscriptionTopicAgent.java 50.00% 25 Missing ⚠️
...ueryengine/plan/relational/sql/ast/AlterTopic.java 30.76% 18 Missing ⚠️
...ion/config/executor/ClusterConfigTaskExecutor.java 0.00% 17 Missing ⚠️
...plan/relational/sql/util/DataNodeSqlFormatter.java 0.00% 15 Missing ⚠️
...on/consumer/base/AbstractSubscriptionConsumer.java 0.00% 12 Missing ⚠️
... and 25 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17780      +/-   ##
============================================
+ Coverage     40.68%   40.74%   +0.05%     
+ Complexity     2620      318    -2302     
============================================
  Files          5244     5249       +5     
  Lines        362374   363866    +1492     
  Branches      46653    46886     +233     
============================================
+ Hits         147419   148241     +822     
- Misses       214955   215625     +670     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jt2594838 jt2594838 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an IT to show a complete use case.

@VGalaxies VGalaxies force-pushed the subscription-topic-owner-fencing branch from b6086b2 to 25c34c2 Compare June 2, 2026 09:46
@VGalaxies

Copy link
Copy Markdown
Contributor Author

Addressed the latest review comments in 25c34c2.

  • Added IoTDBSubscriptionTopicOwnerIT as a complete local subscription use case: a stale owner (sn1/epoch 5) is fenced after the topic is owned by sn2/epoch 6, while the current owner can subscribe, poll data, commit, and unsubscribe.
  • Kept the new IT aligned with the current subscription IT convention: it is @ignore because SubscriptionConfig#getSubscriptionEnabled() is currently hard-coded false in this codebase, but the class compiles and is discovered under the with-integration-tests profile.
  • Preserved SubscriptionOwnerFencedException through subscribe redirection so the stale-owner path remains a business fencing error.
  • Merged the SubscriptionTopicAgent contains/get check into one getTopicMeta(...) null check.

Local verification:

  • mvn spotless:apply -pl integration-test -P with-integration-tests -DskipTests
  • mvn verify -DskipUTs -Dit.test=IoTDBSubscriptionTopicOwnerIT -DfailIfNoTests=false -Dfailsafe.failIfNoSpecifiedTests=false -pl integration-test -am -P with-integration-tests
  • mvn test -pl iotdb-client/service-rpc,iotdb-client/subscription -Dtest=TSStatusCodeTest -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
  • mvn test -pl iotdb-core/relational-grammar,iotdb-client/service-rpc,iotdb-client/subscription,iotdb-core/node-commons,iotdb-core/datanode -Dtest=SubscriptionReceiverV1Test -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
  • git diff --check

@jt2594838 jt2594838 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need one more test to show how to update the ownership and how to show the ownership.

@jt2594838 jt2594838 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you waited for all leases to expire before transferring the ownership?

@VGalaxies

Copy link
Copy Markdown
Contributor Author

Addressed the lease-transfer review comment in 82049d7.

Topic ownership transfer now checks the current topic owner lease before replacing owner metadata: if the current owner has an unexpired owner-lease-expire-time-ms and the incoming owner id is different, the transfer is rejected. Once the lease has expired, the transfer can proceed. Added SubscriptionInfoTest coverage for both paths.

Also addressed the inline comments in the same commit:

  • public session alterTopic(...) rejects owner attributes; alterTopicOwner(...) is the session API entry for owner updates.
  • removed the extra deep copy from the alter-topic metadata build path.
  • moved the two hard-coded messages to en/zh i18n constants.

Local verification passed:

  • mvn spotless:apply -pl iotdb-client/subscription,iotdb-core/node-commons,iotdb-core/confignode,iotdb-core/datanode -DskipTests
  • mvn test -pl iotdb-protocol/thrift-confignode,iotdb-client/service-rpc,iotdb-client/subscription,iotdb-core/node-commons,iotdb-core/confignode -Dtest=SubscriptionInfoTest -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
  • mvn test -pl iotdb-protocol/thrift-confignode,iotdb-core/relational-grammar,iotdb-client/service-rpc,iotdb-client/subscription,iotdb-core/node-commons,iotdb-core/datanode -Dtest=StatementGeneratorTest -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
  • mvn compile -pl iotdb-protocol/thrift-confignode,iotdb-client/service-rpc,iotdb-client/subscription,iotdb-core/relational-grammar,iotdb-core/node-commons,iotdb-core/confignode,iotdb-core/datanode -P with-zh-locale -DskipTests
  • git diff --check

@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants