adapter: add LaunchDarkly reconnect integration test#37026
Draft
jasonhernandez wants to merge 2 commits into
Draft
adapter: add LaunchDarkly reconnect integration test#37026jasonhernandez wants to merge 2 commits into
jasonhernandez wants to merge 2 commits into
Conversation
Move launchdarkly-server-sdk from the MaterializeInc/rust-server-sdk fork back to upstream crates.io 3.1.1, restoring the launchdarkly-sdk- transport + MetricsTransport setup and dropping the [patch.crates-io] override. The fork existed for launchdarkly/rust-server-sdk#116: a StreamingData Source/eventsource StreamClosed bug where a non-Eof stream error left the data source stuck with no reconnect, silently breaking LD sync. A prior upgrade to upstream 3.0.1 had to be reverted (incident-984) because that bug was still unfixed upstream. The fixes have since landed — rust-server-sdk#168 and rust-eventsource-client#134/#135 — and 3.1.1 resolves eventsource-client to 0.17.5, which carries them. Use the rustls + aws-lc-rs features (hyper-rustls-native-roots, crypto-aws-lc-rs), now the upstream defaults, instead of the prior attempt's native-tls/crypto-openssl, avoiding the OpenSSL path. The transport build_https() call is identical either way. deny.toml gains skips for the duplicate versions the transport stack pulls (older tower/rustls-native-certs; newer rand/rand_core/getrandom/ cpufeatures) and re-adds the launchdarkly-sdk-transport wrapper. Adds a test, test_metric_frozen_on_midstream_error, modeling the exact incident-984 failure mode (200 OK then a mid-stream timeout): it asserts the last_sse_time_seconds gauge freezes so the staleness alert can detect a stuck data source. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add an integration test that reproduces incident-984: the LaunchDarkly data source must reconnect after its streaming connection drops with a non-Eof error, so flag updates keep syncing. To make this testable against a controlled server, add a hidden `--launchdarkly-base-uri` flag (env `MZ_LAUNCHDARKLY_BASE_URI`) that overrides the SDK's streaming/polling/events endpoints with a single base URL via the SDK's relay-proxy support. This is also generally useful for pointing at a LaunchDarkly relay proxy. The test (test/launchdarkly-reconnect) runs a mock LaunchDarkly streaming server that serves an initial flag value, resets the first streaming connection mid-stream with a TCP RST (a non-Eof transport error, as in the incident), and serves an updated value to every reconnecting client. environmentd is pointed at the mock, so reaching the updated value proves the data source reconnected; a regressed SDK stays stuck on the initial value. Unlike test/launchdarkly, this needs no real LaunchDarkly credentials, and runs in the nightly pipeline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #37025
This is stacked on #37025 (the LaunchDarkly upstream migration) and branches off it, so the diff currently includes that PR's commit too. Review the second commit only; once #37025 merges to
main, this collapses to just the reconnect test. Merge after #37025.Motivation
incident-984 was a runtime failure — the LaunchDarkly data source stopped reconnecting after its streaming connection dropped with a non-
Eoferror, silently wedging flag sync. The existingtest/launchdarklynightly covers value sync, persistence, targeting, and the kill switch, but nothing exercises reconnect after a mid-stream drop, which is the behavior the upstream fixes (rust-server-sdk#168, rust-eventsource-client#134/#135) actually repair. A prior attempt to add such a test was abandoned because the failure couldn't be reproduced against real LaunchDarkly ("the test would need to cut the connection"). This reproduces it deterministically against a mock.What changed
Production: a hidden
--launchdarkly-base-uriflag (envMZ_LAUNCHDARKLY_BASE_URI) that overrides the SDK's streaming/polling/events endpoints with a single base URL, via the SDK'sServiceEndpointsBuilder::relay_proxy. This is generally useful for relay-proxy setups; here it lets tests point the SDK at a mock. Threaded throughSystemParameterSyncClientConfig::LaunchDarkly→ld_config.Test (
test/launchdarkly-reconnect):mock_ld.py— a minimal mock of the LD streaming API. The first streaming client gets an initial flag value (2 GiB), then the connection is reset mid-stream with a TCP RST (SO_LINGER0) so the SDK sees a non-Eoftransport error (the incident class), not theEofit always recovered from. Every reconnecting client gets an updated value (3 GiB).mzcompose.py— boots environmentd pointed at the mock (mappingmax_result_sizeto the flag) and assertsSHOW max_result_sizereaches3GB. That can only happen if the data source reconnected after the reset; a regressed SDK stays stuck at2GBand the assertion times out.test/launchdarkly).Validation
Green (fixed SDK) — empirically passes. Ran locally against a full
materializedbuild on the migrated 3.1.1 SDK. Container logs show the intended sequence end-to-end: SDK connects to the mock (GET /all) → mock resets the first stream mid-flight (resetting first streaming connection) → SDK reconnects (GET /allagain) →SHOW max_result_sizereaches3GB. So the mock protocol (SSE framing + flag JSON), the endpoint-override flag, and the reconnect path all work.Red (pre-fix SDK) — proven at the source level. The reconnect fix is rust-server-sdk#168 (v3.0.3), which changed the streaming data source's non-
Eoferror handling fromerror!("unhandled error..."); break;(v3.0.2 — aborts the task, killing the data source) tocontinue(keeps it alive). The mock's disruption is a TCP RST, i.e. exactly such a non-Eoferror (confirmed firing in the green-run logs), so on v3.0.2 the data source dies andmax_result_sizestays at2GB— the assertion times out. I attempted to confirm this empirically by pinning to v3.0.2, but the run was blocked by an unrelated local colima/buildx image-resolution quirk (compose couldn't resolve the freshly-built manifest-list images); the assertion itself was never reached. CI builds always run on the fixed SDK, so the nightly job is the ongoing guard; the source diff above is what establishes that a future regression would turn it red.