Skip to content

adapter: add LaunchDarkly reconnect integration test#37026

Draft
jasonhernandez wants to merge 2 commits into
MaterializeInc:mainfrom
jasonhernandez:jason/ld-sync-reconnect-test
Draft

adapter: add LaunchDarkly reconnect integration test#37026
jasonhernandez wants to merge 2 commits into
MaterializeInc:mainfrom
jasonhernandez:jason/ld-sync-reconnect-test

Conversation

@jasonhernandez

@jasonhernandez jasonhernandez commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Stacked on #37025

This is stacked on #37025 (the LaunchDarkly upstream migration) and branches off it, so the diff currently includes that PR's commit too. Review the second commit only; once #37025 merges to main, this collapses to just the reconnect test. Merge after #37025.

Motivation

incident-984 was a runtime failure — the LaunchDarkly data source stopped reconnecting after its streaming connection dropped with a non-Eof error, silently wedging flag sync. The existing test/launchdarkly nightly covers value sync, persistence, targeting, and the kill switch, but nothing exercises reconnect after a mid-stream drop, which is the behavior the upstream fixes (rust-server-sdk#168, rust-eventsource-client#134/#135) actually repair. A prior attempt to add such a test was abandoned because the failure couldn't be reproduced against real LaunchDarkly ("the test would need to cut the connection"). This reproduces it deterministically against a mock.

What changed

Production: a hidden --launchdarkly-base-uri flag (env MZ_LAUNCHDARKLY_BASE_URI) that overrides the SDK's streaming/polling/events endpoints with a single base URL, via the SDK's ServiceEndpointsBuilder::relay_proxy. This is generally useful for relay-proxy setups; here it lets tests point the SDK at a mock. Threaded through SystemParameterSyncClientConfig::LaunchDarklyld_config.

Test (test/launchdarkly-reconnect):

  • mock_ld.py — a minimal mock of the LD streaming API. The first streaming client gets an initial flag value (2 GiB), then the connection is reset mid-stream with a TCP RST (SO_LINGER 0) so the SDK sees a non-Eof transport error (the incident class), not the Eof it always recovered from. Every reconnecting client gets an updated value (3 GiB).
  • mzcompose.py — boots environmentd pointed at the mock (mapping max_result_size to the flag) and asserts SHOW max_result_size reaches 3GB. That can only happen if the data source reconnected after the reset; a regressed SDK stays stuck at 2GB and the assertion times out.
  • Wired into the nightly pipeline. Needs no real LaunchDarkly credentials (unlike test/launchdarkly).

Validation

Green (fixed SDK) — empirically passes. Ran locally against a full materialized build on the migrated 3.1.1 SDK. Container logs show the intended sequence end-to-end: SDK connects to the mock (GET /all) → mock resets the first stream mid-flight (resetting first streaming connection) → SDK reconnects (GET /all again) → SHOW max_result_size reaches 3GB. So the mock protocol (SSE framing + flag JSON), the endpoint-override flag, and the reconnect path all work.

Red (pre-fix SDK) — proven at the source level. The reconnect fix is rust-server-sdk#168 (v3.0.3), which changed the streaming data source's non-Eof error handling from error!("unhandled error..."); break; (v3.0.2 — aborts the task, killing the data source) to continue (keeps it alive). The mock's disruption is a TCP RST, i.e. exactly such a non-Eof error (confirmed firing in the green-run logs), so on v3.0.2 the data source dies and max_result_size stays at 2GB — the assertion times out. I attempted to confirm this empirically by pinning to v3.0.2, but the run was blocked by an unrelated local colima/buildx image-resolution quirk (compose couldn't resolve the freshly-built manifest-list images); the assertion itself was never reached. CI builds always run on the fixed SDK, so the nightly job is the ongoing guard; the source diff above is what establishes that a future regression would turn it red.

jasonhernandez and others added 2 commits June 12, 2026 16:46
Move launchdarkly-server-sdk from the MaterializeInc/rust-server-sdk
fork back to upstream crates.io 3.1.1, restoring the launchdarkly-sdk-
transport + MetricsTransport setup and dropping the [patch.crates-io]
override.

The fork existed for launchdarkly/rust-server-sdk#116: a StreamingData
Source/eventsource StreamClosed bug where a non-Eof stream error left
the data source stuck with no reconnect, silently breaking LD sync. A
prior upgrade to upstream 3.0.1 had to be reverted (incident-984) because
that bug was still unfixed upstream. The fixes have since landed —
rust-server-sdk#168 and rust-eventsource-client#134/#135 — and 3.1.1
resolves eventsource-client to 0.17.5, which carries them.

Use the rustls + aws-lc-rs features (hyper-rustls-native-roots,
crypto-aws-lc-rs), now the upstream defaults, instead of the prior
attempt's native-tls/crypto-openssl, avoiding the OpenSSL path. The
transport build_https() call is identical either way.

deny.toml gains skips for the duplicate versions the transport stack
pulls (older tower/rustls-native-certs; newer rand/rand_core/getrandom/
cpufeatures) and re-adds the launchdarkly-sdk-transport wrapper.

Adds a test, test_metric_frozen_on_midstream_error, modeling the exact
incident-984 failure mode (200 OK then a mid-stream timeout): it asserts
the last_sse_time_seconds gauge freezes so the staleness alert can detect
a stuck data source.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add an integration test that reproduces incident-984: the LaunchDarkly data
source must reconnect after its streaming connection drops with a non-Eof
error, so flag updates keep syncing.

To make this testable against a controlled server, add a hidden
`--launchdarkly-base-uri` flag (env `MZ_LAUNCHDARKLY_BASE_URI`) that overrides
the SDK's streaming/polling/events endpoints with a single base URL via the
SDK's relay-proxy support. This is also generally useful for pointing at a
LaunchDarkly relay proxy.

The test (test/launchdarkly-reconnect) runs a mock LaunchDarkly streaming
server that serves an initial flag value, resets the first streaming connection
mid-stream with a TCP RST (a non-Eof transport error, as in the incident), and
serves an updated value to every reconnecting client. environmentd is pointed
at the mock, so reaching the updated value proves the data source reconnected;
a regressed SDK stays stuck on the initial value. Unlike test/launchdarkly,
this needs no real LaunchDarkly credentials, and runs in the nightly pipeline.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant