Default mysql read_timeout to 10 seconds#24009
Conversation
🎉 All green!🧪 All tests passed 🎯 Code Coverage (details) 🔗 Commit SHA: 3bc6ae5 | Docs | Datadog PR Page | Give us feedback! |
fa21cfe to
a0b4ed7
Compare
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
a8c73fd to
3bc6ae5
Compare
Validation ReportAll 21 validations passed. Show details
|
| start = time.time() | ||
| with pytest.raises(Exception): | ||
| check.check(None) | ||
| elapsed = time.time() - start | ||
|
|
||
| server.close() | ||
| for conn in accepted: | ||
| conn.close() | ||
|
|
||
| assert any(name == 'mysql.can_connect' and status == MySql.CRITICAL for name, status in service_checks), ( | ||
| f"Expected mysql.can_connect CRITICAL, got {service_checks}" | ||
| ) | ||
|
|
||
| # Should fire within a reasonable multiple of read_timeout, not hang forever | ||
| assert elapsed < 10, f"Check took {elapsed:.1f}s — read_timeout not enforced" |
There was a problem hiding this comment.
Suggestion: could we mock time here? Otherwise, it might be better to move this out of the unit tests, since this test can take up to 10s while waiting for the timeout.
After a quick look, it seems the read_timeout parameter is passed through to the pymysql library here: https://github.com/DataDog/integrations-core/blob/master/mysql/datadog_checks/mysql/util.py#L52-L63
So we may effectively be testing the library behavior itself in this test.
There was a problem hiding this comment.
Good point on test duration — though to clarify, the test uses read_timeout=2 so it runs in ~2s, not 10s (the 10s default is what would apply in production).
On testing library behavior: the fake endpoint is key here — it's a raw TCP socket that accepts the connection but never sends a MySQL greeting, simulating exactly the freeze scenario (e.g. docker pause) that caused the 2-hour silent outage. The assertion we care about is that mysql.can_connect fires CRITICAL, not that pymysql respects timeouts in general.
That said, the assert elapsed < 10 line is redundant — if the CRITICAL fires, the timeout already did its job. Happy to remove it.
eric-weaver
left a comment
There was a problem hiding this comment.
@ian28223 Thank you for flagging this issue. I agree that this default seems incorrect and a 10s read timeout makes sense. Before we can merge this I need to do some deeper digging into whether we can currently safely apply this or not which is why Im currently marking this request for changes to block merge but we will look into this on our side.
For context 10s was the default when this config options was initially exposed here in datadog-agent v7.53.0 but then reverted back to the null timeout here in datadog-agent v7.56.0 due to a connection leak uncovered in parts of the integration when this read timeout occurs.
A good deal of changes have been made including around connection handling so this might no longer be an issue but I'll need to spend some time digging through our connection handling to make sure this is safe to apply directly or if we need some additional changes to unlock it.
What does this PR do?
Defaults
read_timeoutto 10 seconds for the MySQL check (previously unset).Motivation
Without a
read_timeout, pymysql callssock.settimeout(None)immediately afterthe TCP handshake, stripping all timeout protection from subsequent reads — the
server greeting, auth handshake, and queries all block indefinitely. During a MySQL
restart with InnoDB recovery, the check hangs silently:
mysql.can_connectneverfires CRITICAL even though the database is unreachable.
Confirmed in AGENT-16350: against a paused MySQL 8.0.36 container, the check hung
for the entire observation window with zero service checks emitted. With
read_timeout: 10, CRITICAL fires within the configured seconds.10 seconds matches the existing
connect_timeoutdefault and is consistent withSQLServer (
command_timeout: 10) and Elasticsearch (timeout: 10). It is ~50xthe observed p99 check completion time across 46 instances (193ms), making false
positives from normal operation essentially impossible.
Review checklist (to be filled by reviewers)
qa/requiredif this PR needs QA validation, orqa/skip-qaif it does not. Exactly one of the two is required.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged