Skip to content

Gracefully skip mongo collection on non-readable replica states#24017

Draft
HadhemiDD wants to merge 2 commits into
masterfrom
mongo-not-primary-graceful
Draft

Gracefully skip mongo collection on non-readable replica states#24017
HadhemiDD wants to merge 2 commits into
masterfrom
mongo-not-primary-graceful

Conversation

@HadhemiDD

Copy link
Copy Markdown
Contributor

What does this PR do?

Makes the mongo check tolerate replica-set nodes that are not in a readable (PRIMARY/SECONDARY) state instead of crashing.

When a monitored mongod is in a state that is neither PRIMARY(1) nor SECONDARY(2) — e.g. RECOVERING(3), STARTUP(0), STARTUP2(5), ROLLBACK(9), DOWN(8) — read commands fail with pymongo.errors.NotPrimaryError (code 13436, NotPrimaryOrSecondary). Since NotPrimaryError subclasses ConnectionFailure, it was caught by the check's CRITICAL_FAILURE handling, marking the whole instance CRITICAL and dumping a traceback even though the node is reachable.

Changes:

  • Add a centralized ReplicaSetDeployment.is_readable property (replset_state in {PRIMARY, SECONDARY}).
  • Replace the scattered replset_state == 3 guards in operation_samples.py, query_metrics.py, and discovery.py with is_readable, so all non-readable states are skipped (not just RECOVERING).
  • Catch NotPrimaryError as a non-critical, transient condition in the synchronous collector loop (_collect_metrics) and at the check() level — the run is skipped and the service check stays OK instead of going CRITICAL.
  • Add NotPrimaryError handling to the $queryStats read path, mirroring the existing handling in operation samples' $currentOp.

Motivation

Users running the check against a replica-set member that briefly enters a non-readable state (recovering, rollback, etc.) saw the whole check fail with a NotPrimaryError traceback and a CRITICAL service check, when the correct behavior is to skip collection for that run and recover automatically.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add qa/required if this PR needs QA validation, or qa/skip-qa if it does not. Exactly one of the two is required.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

🤖 Generated with Claude Code

When a mongod node is in a replica-set state that is neither PRIMARY nor
SECONDARY (recovering, startup, rollback, down, ...), read commands fail with
NotPrimaryError (code 13436). Because NotPrimaryError subclasses
ConnectionFailure, the check treated it as a CRITICAL connection failure and
crashed with a traceback, even though the node is reachable.

Add a centralized ReplicaSetDeployment.is_readable helper and use it to skip
collection across the DBM jobs and database autodiscovery (replacing the
scattered replset_state == 3 checks), and handle NotPrimaryError gracefully in
the synchronous collector loop and at the check level so a state transition
no longer marks the whole instance CRITICAL.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@HadhemiDD HadhemiDD added the qa/skip-qa Automatically skip this PR for the next QA label Jun 11, 2026
@datadog-official

datadog-official Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 43 Pipeline jobs failed

PR | test / test (linux, ubuntu-22.04, mongo, MongoDB (py3.13-4.4-auth), py3.13-4.4-auth) / MongoDB (py3.13-4.4-auth)-py3.13-4.4-auth   View in Datadog   GitHub Actions

PR | test / test (linux, ubuntu-22.04, mongo, MongoDB (py3.13-4.4-shard), py3.13-4.4-shard) / MongoDB (py3.13-4.4-shard)-py3.13-4.4-shard   View in Datadog   GitHub Actions

PR | test / test (linux, ubuntu-22.04, mongo, MongoDB (py3.13-4.4-standalone), py3.13-4.4-standalone) / MongoDB (py3.13-4.4-standalone)-py3.13-4.4-standalone   View in Datadog   GitHub Actions

View all 43 failed jobs.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 79337b6 | Docs | Datadog PR Page | Give us feedback!

@dd-octo-sts

dd-octo-sts Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Validation Report

All 21 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and code coverage settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
qa-label Validate the pull request declares whether it needs QA for the next Agent release
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration/mongo qa/skip-qa Automatically skip this PR for the next QA

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant