adapter: add LaunchDarkly reconnect integration test by jasonhernandez · Pull Request #37026 · MaterializeInc/materialize

jasonhernandez · 2026-06-13T00:13:50Z

Stacked on #37025

This is stacked on #37025 (the LaunchDarkly upstream migration) and branches off it, so the diff currently includes that PR's commit too. Review the second commit only; once #37025 merges to main, this collapses to just the reconnect test. Merge after #37025.

Motivation

incident-984 was a runtime failure — the LaunchDarkly data source stopped reconnecting after its streaming connection dropped with a non-Eof error, silently wedging flag sync. The existing test/launchdarkly nightly covers value sync, persistence, targeting, and the kill switch, but nothing exercises reconnect after a mid-stream drop, which is the behavior the upstream fixes (rust-server-sdk#168, rust-eventsource-client#134/#135) actually repair. A prior attempt to add such a test was abandoned because the failure couldn't be reproduced against real LaunchDarkly ("the test would need to cut the connection"). This reproduces it deterministically against a mock.

What changed

Production: a hidden --launchdarkly-base-uri flag (env MZ_LAUNCHDARKLY_BASE_URI) that overrides the SDK's streaming/polling/events endpoints with a single base URL, via the SDK's ServiceEndpointsBuilder::relay_proxy. This is generally useful for relay-proxy setups; here it lets tests point the SDK at a mock. Threaded through SystemParameterSyncClientConfig::LaunchDarkly → ld_config.

Test (test/launchdarkly-reconnect):

mock_ld.py — a minimal mock of the LD streaming API. The first streaming client gets an initial flag value (2 GiB), then the connection is reset mid-stream with a TCP RST (SO_LINGER 0) so the SDK sees a non-Eof transport error (the incident class), not the Eof it always recovered from. Every reconnecting client gets an updated value (3 GiB).
mzcompose.py — boots environmentd pointed at the mock (mapping max_result_size to the flag) and asserts SHOW max_result_size reaches 3GB. That can only happen if the data source reconnected after the reset; a regressed SDK stays stuck at 2GB and the assertion times out.
Wired into the nightly pipeline. Needs no real LaunchDarkly credentials (unlike test/launchdarkly).

Validation

Green (fixed SDK) — empirically passes. Ran locally against a full materialized build on the migrated 3.1.1 SDK. Container logs show the intended sequence end-to-end: SDK connects to the mock (GET /all) → mock resets the first stream mid-flight (resetting first streaming connection) → SDK reconnects (GET /all again) → SHOW max_result_size reaches 3GB. So the mock protocol (SSE framing + flag JSON), the endpoint-override flag, and the reconnect path all work.

Red (pre-fix SDK) — proven at the source level. The reconnect fix is rust-server-sdk#168 (v3.0.3), which changed the streaming data source's non-Eof error handling from error!("unhandled error..."); break; (v3.0.2 — aborts the task, killing the data source) to continue (keeps it alive). The mock's disruption is a TCP RST, i.e. exactly such a non-Eof error (confirmed firing in the green-run logs), so on v3.0.2 the data source dies and max_result_size stays at 2GB — the assertion times out. I attempted to confirm this empirically by pinning to v3.0.2, but the run was blocked by an unrelated local colima/buildx image-resolution quirk (compose couldn't resolve the freshly-built manifest-list images); the assertion itself was never reached. CI builds always run on the fixed SDK, so the nightly job is the ongoing guard; the source diff above is what establishes that a future regression would turn it red.

Move launchdarkly-server-sdk from the MaterializeInc/rust-server-sdk fork back to upstream crates.io 3.1.1, restoring the launchdarkly-sdk- transport + MetricsTransport setup and dropping the [patch.crates-io] override. The fork existed for launchdarkly/rust-server-sdk#116: a StreamingData Source/eventsource StreamClosed bug where a non-Eof stream error left the data source stuck with no reconnect, silently breaking LD sync. A prior upgrade to upstream 3.0.1 had to be reverted (incident-984) because that bug was still unfixed upstream. The fixes have since landed — rust-server-sdk#168 and rust-eventsource-client#134/#135 — and 3.1.1 resolves eventsource-client to 0.17.5, which carries them. Use the rustls + aws-lc-rs features (hyper-rustls-native-roots, crypto-aws-lc-rs), now the upstream defaults, instead of the prior attempt's native-tls/crypto-openssl, avoiding the OpenSSL path. The transport build_https() call is identical either way. deny.toml gains skips for the duplicate versions the transport stack pulls (older tower/rustls-native-certs; newer rand/rand_core/getrandom/ cpufeatures) and re-adds the launchdarkly-sdk-transport wrapper. Adds a test, test_metric_frozen_on_midstream_error, modeling the exact incident-984 failure mode (200 OK then a mid-stream timeout): it asserts the last_sse_time_seconds gauge freezes so the staleness alert can detect a stuck data source. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add an integration test that reproduces incident-984: the LaunchDarkly data source must reconnect after its streaming connection drops with a non-Eof error, so flag updates keep syncing. To make this testable against a controlled server, add a hidden `--launchdarkly-base-uri` flag (env `MZ_LAUNCHDARKLY_BASE_URI`) that overrides the SDK's streaming/polling/events endpoints with a single base URL via the SDK's relay-proxy support. This is also generally useful for pointing at a LaunchDarkly relay proxy. The test (test/launchdarkly-reconnect) runs a mock LaunchDarkly streaming server that serves an initial flag value, resets the first streaming connection mid-stream with a TCP RST (a non-Eof transport error, as in the incident), and serves an updated value to every reconnecting client. environmentd is pointed at the mock, so reaching the updated value proves the data source reconnected; a regressed SDK stays stuck on the initial value. Unlike test/launchdarkly, this needs no real LaunchDarkly credentials, and runs in the nightly pipeline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jasonhernandez and others added 2 commits June 12, 2026 16:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adapter: add LaunchDarkly reconnect integration test#37026

adapter: add LaunchDarkly reconnect integration test#37026
jasonhernandez wants to merge 2 commits into
MaterializeInc:mainfrom
jasonhernandez:jason/ld-sync-reconnect-test

jasonhernandez commented Jun 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jasonhernandez commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stacked on #37025

Motivation

What changed

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jasonhernandez commented Jun 13, 2026 •

edited

Loading