Gracefully skip mongo collection on non-readable replica states#24017
Draft
HadhemiDD wants to merge 2 commits into
Draft
Gracefully skip mongo collection on non-readable replica states#24017HadhemiDD wants to merge 2 commits into
HadhemiDD wants to merge 2 commits into
Conversation
When a mongod node is in a replica-set state that is neither PRIMARY nor SECONDARY (recovering, startup, rollback, down, ...), read commands fail with NotPrimaryError (code 13436). Because NotPrimaryError subclasses ConnectionFailure, the check treated it as a CRITICAL connection failure and crashed with a traceback, even though the node is reachable. Add a centralized ReplicaSetDeployment.is_readable helper and use it to skip collection across the DBM jobs and database autodiscovery (replacing the scattered replset_state == 3 checks), and handle NotPrimaryError gracefully in the synchronous collector loop and at the check level so a state transition no longer marks the whole instance CRITICAL. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
|
Contributor
Validation ReportAll 21 validations passed. Show details
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Makes the
mongocheck tolerate replica-set nodes that are not in a readable (PRIMARY/SECONDARY) state instead of crashing.When a monitored mongod is in a state that is neither PRIMARY(1) nor SECONDARY(2) — e.g. RECOVERING(3), STARTUP(0), STARTUP2(5), ROLLBACK(9), DOWN(8) — read commands fail with
pymongo.errors.NotPrimaryError(code 13436,NotPrimaryOrSecondary). SinceNotPrimaryErrorsubclassesConnectionFailure, it was caught by the check'sCRITICAL_FAILUREhandling, marking the whole instance CRITICAL and dumping a traceback even though the node is reachable.Changes:
ReplicaSetDeployment.is_readableproperty (replset_state in {PRIMARY, SECONDARY}).replset_state == 3guards inoperation_samples.py,query_metrics.py, anddiscovery.pywithis_readable, so all non-readable states are skipped (not just RECOVERING).NotPrimaryErroras a non-critical, transient condition in the synchronous collector loop (_collect_metrics) and at thecheck()level — the run is skipped and the service check stays OK instead of going CRITICAL.NotPrimaryErrorhandling to the$queryStatsread path, mirroring the existing handling in operation samples'$currentOp.Motivation
Users running the check against a replica-set member that briefly enters a non-readable state (recovering, rollback, etc.) saw the whole check fail with a
NotPrimaryErrortraceback and a CRITICAL service check, when the correct behavior is to skip collection for that run and recover automatically.Review checklist (to be filled by reviewers)
qa/requiredif this PR needs QA validation, orqa/skip-qaif it does not. Exactly one of the two is required.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged🤖 Generated with Claude Code