Skip to content

Default mysql read_timeout to 10 seconds#24009

Open
ian28223 wants to merge 1 commit into
masterfrom
ian.bucad/mysql_read_timeout_default
Open

Default mysql read_timeout to 10 seconds#24009
ian28223 wants to merge 1 commit into
masterfrom
ian.bucad/mysql_read_timeout_default

Conversation

@ian28223

Copy link
Copy Markdown
Contributor

What does this PR do?

Defaults read_timeout to 10 seconds for the MySQL check (previously unset).

Motivation

Without a read_timeout, pymysql calls sock.settimeout(None) immediately after
the TCP handshake, stripping all timeout protection from subsequent reads — the
server greeting, auth handshake, and queries all block indefinitely. During a MySQL
restart with InnoDB recovery, the check hangs silently: mysql.can_connect never
fires CRITICAL even though the database is unreachable.

Confirmed in AGENT-16350: against a paused MySQL 8.0.36 container, the check hung
for the entire observation window with zero service checks emitted. With
read_timeout: 10, CRITICAL fires within the configured seconds.

10 seconds matches the existing connect_timeout default and is consistent with
SQLServer (command_timeout: 10) and Elasticsearch (timeout: 10). It is ~50x
the observed p99 check completion time across 46 instances (193ms), making false
positives from normal operation essentially impossible.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add qa/required if this PR needs QA validation, or qa/skip-qa if it does not. Exactly one of the two is required.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

@datadog-prod-us1-4

datadog-prod-us1-4 Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Tests  Code Coverage

🎉 All green!

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 90.42% (+2.62%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 3bc6ae5 | Docs | Datadog PR Page | Give us feedback!

@ian28223 ian28223 force-pushed the ian.bucad/mysql_read_timeout_default branch from fa21cfe to a0b4ed7 Compare June 11, 2026 02:25
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ian28223 ian28223 force-pushed the ian.bucad/mysql_read_timeout_default branch from a8c73fd to 3bc6ae5 Compare June 11, 2026 03:47
@ian28223 ian28223 marked this pull request as ready for review June 11, 2026 03:54
@ian28223 ian28223 requested review from a team as code owners June 11, 2026 03:54
@dd-octo-sts

dd-octo-sts Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Validation Report

All 21 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and code coverage settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
qa-label Validate the pull request declares whether it needs QA for the next Agent release
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

Comment thread mysql/tests/test_unit.py
Comment on lines +929 to +943
start = time.time()
with pytest.raises(Exception):
check.check(None)
elapsed = time.time() - start

server.close()
for conn in accepted:
conn.close()

assert any(name == 'mysql.can_connect' and status == MySql.CRITICAL for name, status in service_checks), (
f"Expected mysql.can_connect CRITICAL, got {service_checks}"
)

# Should fire within a reasonable multiple of read_timeout, not hang forever
assert elapsed < 10, f"Check took {elapsed:.1f}s — read_timeout not enforced"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: could we mock time here? Otherwise, it might be better to move this out of the unit tests, since this test can take up to 10s while waiting for the timeout.

After a quick look, it seems the read_timeout parameter is passed through to the pymysql library here: https://github.com/DataDog/integrations-core/blob/master/mysql/datadog_checks/mysql/util.py#L52-L63

So we may effectively be testing the library behavior itself in this test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on test duration — though to clarify, the test uses read_timeout=2 so it runs in ~2s, not 10s (the 10s default is what would apply in production).

On testing library behavior: the fake endpoint is key here — it's a raw TCP socket that accepts the connection but never sends a MySQL greeting, simulating exactly the freeze scenario (e.g. docker pause) that caused the 2-hour silent outage. The assertion we care about is that mysql.can_connect fires CRITICAL, not that pymysql respects timeouts in general.

That said, the assert elapsed < 10 line is redundant — if the CRITICAL fires, the timeout already did its job. Happy to remove it.

@eric-weaver eric-weaver left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ian28223 Thank you for flagging this issue. I agree that this default seems incorrect and a 10s read timeout makes sense. Before we can merge this I need to do some deeper digging into whether we can currently safely apply this or not which is why Im currently marking this request for changes to block merge but we will look into this on our side.

For context 10s was the default when this config options was initially exposed here in datadog-agent v7.53.0 but then reverted back to the null timeout here in datadog-agent v7.56.0 due to a connection leak uncovered in parts of the integration when this read timeout occurs.

A good deal of changes have been made including around connection handling so this might no longer be an issue but I'll need to spend some time digging through our connection handling to make sure this is safe to apply directly or if we need some additional changes to unlock it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants