Skip to content

fix(ci): Make integration tests on Linux arm64 not flaky#4313

Draft
aldy505 wants to merge 8 commits into
masterfrom
aldy505/fix/ci-arm64-flakes
Draft

fix(ci): Make integration tests on Linux arm64 not flaky#4313
aldy505 wants to merge 8 commits into
masterfrom
aldy505/fix/ci-arm64-flakes

Conversation

@aldy505
Copy link
Copy Markdown
Collaborator

@aldy505 aldy505 commented Apr 29, 2026

Closes SELF-93

@linear-code
Copy link
Copy Markdown

linear-code Bot commented Apr 29, 2026

Comment thread sentry/sentry.conf.example.py Outdated
"message.max.bytes": 50000000,
"socket.timeout.ms": 1000,
"socket.timeout.ms": 10000, # Timeout for individual socket operations (send/recv)
"request.timeout.ms": 30000, # Max time to wait for a broker response before failing
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From logs:

  uptime-results-1                         | %4|1778036634.157|SESSTMOUT|rdkafka#consumer-1| [thrd:main]: Consumer group session timed out (in join-state steady) after 30400 ms without a successful response from the group coordinator (broker 1001, last error was Success): revoking assignment and rejoining group

searching "Consumer group session timed out" shows a couple more kafka timeouts after 30s as well, maybe we should increase kafka timeout even further?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aminvakil holy sh*t you're right

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I didn't catch this myself, it's been a while since I also use AI to review stuff :)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the problem persist. I have no idea other to blame github's arm64 runners.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they are too slow, this happens to me as well in other projects.

I searched "Consumer group session timed out" in latest run:

  process-spans-1                            | %4|1779437870.346|SESSTMOUT|rdkafka#consumer-1| [thrd:main]: Consumer group session timed out (in join-state steady) after 31691 ms without a successful response from the group coordinator (broker -1, last error was Success): revoking assignment and rejoining group

Seems like they still have 30 seconds timeout.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is according to AI:

Root cause summary:

  1. Silent config drop (Bug 1): KAFKA_CLUSTERS["default"] = DEFAULT_KAFKA_OPTIONS uses the legacy flat format. get_kafka_consumer_cluster_options() detects this and, because it's called with only_bootstrap=True, extracts only bootstrap.servers — all other settings including
    session.timeout.ms and heartbeat.interval.ms are silently discarded.
  2. HACK override (Bug 2): Even if the settings survived, build_consumer_config in consumers/init.py unconditionally overrides session.timeout.ms to match max_poll_interval_ms when that value is < 45000ms. Since --max-poll-interval-ms defaults to 30000 in run.py,
    session.timeout.ms is force-set to 30s — which is exactly what the ~31.7s timeout error reflects.

Recommended fix (no code change required): Pass --max-poll-interval-ms 300000 (or any value ≥ 45000) to the sentry run consumer process-spans command in the self-hosted docker-compose.yml. This disables the HACK and allows the session timeout to stay at the safe default of
45s.

Let me try to mess things up again

@aldy505
Copy link
Copy Markdown
Collaborator Author

aldy505 commented May 22, 2026

image

I'm happy already, but let me rerun the test one more time

@aldy505
Copy link
Copy Markdown
Collaborator Author

aldy505 commented May 22, 2026

@aldy505
Copy link
Copy Markdown
Collaborator Author

aldy505 commented May 22, 2026

image

sigh, still flaky

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants