Skip to content

PHOENIX-7872 :- HA observability metrics for poller, CRR refresh/age, failover, mutation-block + RegionServer bypass counter#2502

Open
lokiore wants to merge 1 commit into
apache:PHOENIX-7562-feature-newfrom
lokiore:PHOENIX-7872-ha-observability-metrics
Open

PHOENIX-7872 :- HA observability metrics for poller, CRR refresh/age, failover, mutation-block + RegionServer bypass counter#2502
lokiore wants to merge 1 commit into
apache:PHOENIX-7562-feature-newfrom
lokiore:PHOENIX-7872-ha-observability-metrics

Conversation

@lokiore

@lokiore lokiore commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Adds client-side and server-side observability metrics for the Consistent Failover (CCF) high-availability path.

JIRA: https://issues.apache.org/jira/browse/PHOENIX-7872

Tier-1 client-side counters (in GlobalClientMetrics + MetricType):

  • HA_POLLER_TICK_COUNT — total poller ticks across all HA groups (incremented in GetClusterRoleRecordUtil's polling task)
  • HA_POLLER_TICK_FAILURES — per-tick CRR fetch failures
  • HA_FAILOVER_COUNT — failover transitions executed by the client. Emitted from HighAvailabilityGroup.applyClusterRoleRecord only on actual ACTIVE → STANDBY or STANDBY → ACTIVE role transitions
  • HA_MUTATION_BLOCKED_COUNTMutationBlockedIOException occurrences detected via the wrap-and-propagate path in FailoverPhoenixConnection.wrapActionWhileFailover

Tier-2 client-side metrics:

  • HA_FAILOVER_DURATION_MS — failover end-to-end latency histogram (try/finally wrapper in FailoverPhoenixConnection.failover)
  • HA_STALE_CRR_DETECTED_COUNTStaleClusterRoleRecordException occurrences detected in the wrap path
  • HA_CRR_CACHE_AGE_MS — gauge of staleness of the in-memory CRR cache, set on every successful CRR refresh in HighAvailabilityGroup

Tier-2 server-side counter (new 3-file Hadoop-metrics2 source under phoenix-core-server/.../hbase/index/metrics):

  • BYPASSED_MUTATION_BLOCK_COUNT — emitted from IndexRegionObserver when a mutation bypasses the mutation-block check because no log group is present for the data table. Implemented as MetricsHaBypassSource (interface) + MetricsHaBypassSourceFactory (DefaultMetricsSystem-anchored, double-checked lock) + MetricsHaBypassSourceImpl.

Why are the changes needed?

The CCF HA path previously had no observability for client-side polling cadence, failover frequency, failover latency, mutation-block fail-fast counts, stale-CRR detection, or CRR cache age. Operators investigating slow failovers or unexpected mutation rejections had to reconstruct event timelines from scattered DEBUG logs.

These metrics close the gap on the dimensions the platform team needs for HA SLO tracking and incident triage:

  • Poller liveness & health (tick count + failures)
  • Failover frequency, duration, and trigger (CRR transition vs MBIOE-driven)
  • Stale-CRR detection rate as a leading indicator of failover-in-progress windows
  • Server-side rate of mutation-block bypass (regions without an HA log group attached)

Does this PR introduce any user-facing change?

No

The new metrics are emitted via the existing GlobalClientMetrics (client-side) and Hadoop metrics2 (server-side) pipelines. No public-API change, no SQL surface change, no behavior change on the failover/poller paths beyond getMetric().increment() / .update() / .set() calls.

How was this patch tested?

New unit tests:

  • HighAvailabilityUtilTest — covers RetriesExhaustedWithDetailsException cause-chain MBIOE detection (the wrap-and-propagate path that fires HA_MUTATION_BLOCKED_COUNT)
  • MetricsHaBypassSourceFactoryTest — covers factory thread-safety (single-instance under concurrent getInstance())

New ITs:

  • HAGroupMetricsIT — covers all 8 client-side metrics across the CCF failover lifecycle (poller ticks, failover transitions, CRR cache age gauge, stale-CRR detection, MBIOE detection on the wrap path)
  • BypassedMutationBlockMetricsIT — covers server-side BYPASSED_MUTATION_BLOCK_COUNT emission when a mutation hits a region without an HA log group

Local 13/13 PASS reproduction:

[INFO] Running org.apache.phoenix.jdbc.HighAvailabilityUtilTest
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0
[INFO] Running org.apache.phoenix.hbase.index.metrics.MetricsHaBypassSourceFactoryTest
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0
[INFO] Running org.apache.phoenix.jdbc.HAGroupMetricsIT
[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0
[INFO] Running org.apache.phoenix.end2end.index.BypassedMutationBlockMetricsIT
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

… failover, mutation-block + RegionServer bypass counter

Adds client-side and server-side observability metrics for the
Consistent Failover (CCF) high-availability path:

Tier-1 client-side counters (4):
- HA_POLLER_TICK_COUNT — total poller ticks across all HA groups
- HA_POLLER_TICK_FAILURES — per-tick CRR fetch failures
- HA_FAILOVER_COUNT — failover transitions executed by the client
- HA_MUTATION_BLOCKED_COUNT — MutationBlockedIOException occurrences
  detected via the wrap-and-propagate path

Tier-2 client-side metrics (4):
- HA_FAILOVER_DURATION_MS — failover end-to-end latency histogram
- HA_STALE_CRR_DETECTED_COUNT — StaleClusterRoleRecordException occurrences
- HA_CRR_CACHE_AGE_MS — gauge of staleness of the in-memory CRR cache
- (HA_FAILOVER_COUNT moved to applyClusterRoleRecord with role-transition guard
  so it only fires on actual ACTIVE -> STANDBY or STANDBY -> ACTIVE transitions)

Tier-2 server-side counter (1):
- BYPASSED_MUTATION_BLOCK_COUNT — emitted from IndexRegionObserver when a
  mutation bypasses the mutation-block check because no log group is present.
  Implemented as 3-file Hadoop-metrics2 source: interface +
  static factory (DefaultMetricsSystem.instance()) + impl.

Tests:
- HAGroupMetricsIT — IT covering all 8 client-side metrics
- BypassedMutationBlockMetricsIT — IT covering server-side bypass counter
- HighAvailabilityUtilTest — UT covering RetriesExhaustedWithDetailsException
  + IOException cause-chain MBIOE detection
- MetricsHaBypassSourceFactoryTest — UT covering factory thread-safety

Generated-by: Claude Code (Opus 4.7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant