PHOENIX-7872 :- HA observability metrics for poller, CRR refresh/age, failover, mutation-block + RegionServer bypass counter#2502
Open
lokiore wants to merge 1 commit into
Conversation
… failover, mutation-block + RegionServer bypass counter Adds client-side and server-side observability metrics for the Consistent Failover (CCF) high-availability path: Tier-1 client-side counters (4): - HA_POLLER_TICK_COUNT — total poller ticks across all HA groups - HA_POLLER_TICK_FAILURES — per-tick CRR fetch failures - HA_FAILOVER_COUNT — failover transitions executed by the client - HA_MUTATION_BLOCKED_COUNT — MutationBlockedIOException occurrences detected via the wrap-and-propagate path Tier-2 client-side metrics (4): - HA_FAILOVER_DURATION_MS — failover end-to-end latency histogram - HA_STALE_CRR_DETECTED_COUNT — StaleClusterRoleRecordException occurrences - HA_CRR_CACHE_AGE_MS — gauge of staleness of the in-memory CRR cache - (HA_FAILOVER_COUNT moved to applyClusterRoleRecord with role-transition guard so it only fires on actual ACTIVE -> STANDBY or STANDBY -> ACTIVE transitions) Tier-2 server-side counter (1): - BYPASSED_MUTATION_BLOCK_COUNT — emitted from IndexRegionObserver when a mutation bypasses the mutation-block check because no log group is present. Implemented as 3-file Hadoop-metrics2 source: interface + static factory (DefaultMetricsSystem.instance()) + impl. Tests: - HAGroupMetricsIT — IT covering all 8 client-side metrics - BypassedMutationBlockMetricsIT — IT covering server-side bypass counter - HighAvailabilityUtilTest — UT covering RetriesExhaustedWithDetailsException + IOException cause-chain MBIOE detection - MetricsHaBypassSourceFactoryTest — UT covering factory thread-safety Generated-by: Claude Code (Opus 4.7)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Adds client-side and server-side observability metrics for the Consistent Failover (CCF) high-availability path.
JIRA: https://issues.apache.org/jira/browse/PHOENIX-7872
Tier-1 client-side counters (in
GlobalClientMetrics+MetricType):HA_POLLER_TICK_COUNT— total poller ticks across all HA groups (incremented inGetClusterRoleRecordUtil's polling task)HA_POLLER_TICK_FAILURES— per-tick CRR fetch failuresHA_FAILOVER_COUNT— failover transitions executed by the client. Emitted fromHighAvailabilityGroup.applyClusterRoleRecordonly on actual ACTIVE → STANDBY or STANDBY → ACTIVE role transitionsHA_MUTATION_BLOCKED_COUNT—MutationBlockedIOExceptionoccurrences detected via the wrap-and-propagate path inFailoverPhoenixConnection.wrapActionWhileFailoverTier-2 client-side metrics:
HA_FAILOVER_DURATION_MS— failover end-to-end latency histogram (try/finally wrapper inFailoverPhoenixConnection.failover)HA_STALE_CRR_DETECTED_COUNT—StaleClusterRoleRecordExceptionoccurrences detected in the wrap pathHA_CRR_CACHE_AGE_MS— gauge of staleness of the in-memory CRR cache, set on every successful CRR refresh inHighAvailabilityGroupTier-2 server-side counter (new 3-file Hadoop-metrics2 source under
phoenix-core-server/.../hbase/index/metrics):BYPASSED_MUTATION_BLOCK_COUNT— emitted fromIndexRegionObserverwhen a mutation bypasses the mutation-block check because no log group is present for the data table. Implemented asMetricsHaBypassSource(interface) +MetricsHaBypassSourceFactory(DefaultMetricsSystem-anchored, double-checked lock) +MetricsHaBypassSourceImpl.Why are the changes needed?
The CCF HA path previously had no observability for client-side polling cadence, failover frequency, failover latency, mutation-block fail-fast counts, stale-CRR detection, or CRR cache age. Operators investigating slow failovers or unexpected mutation rejections had to reconstruct event timelines from scattered DEBUG logs.
These metrics close the gap on the dimensions the platform team needs for HA SLO tracking and incident triage:
Does this PR introduce any user-facing change?
No
The new metrics are emitted via the existing
GlobalClientMetrics(client-side) and Hadoop metrics2 (server-side) pipelines. No public-API change, no SQL surface change, no behavior change on the failover/poller paths beyondgetMetric().increment()/.update()/.set()calls.How was this patch tested?
New unit tests:
HighAvailabilityUtilTest— coversRetriesExhaustedWithDetailsExceptioncause-chain MBIOE detection (the wrap-and-propagate path that firesHA_MUTATION_BLOCKED_COUNT)MetricsHaBypassSourceFactoryTest— covers factory thread-safety (single-instance under concurrentgetInstance())New ITs:
HAGroupMetricsIT— covers all 8 client-side metrics across the CCF failover lifecycle (poller ticks, failover transitions, CRR cache age gauge, stale-CRR detection, MBIOE detection on the wrap path)BypassedMutationBlockMetricsIT— covers server-sideBYPASSED_MUTATION_BLOCK_COUNTemission when a mutation hits a region without an HA log groupLocal 13/13 PASS reproduction:
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.7)