From 6b6a803e3432bb5f76d2547290d25e0527a4b41f Mon Sep 17 00:00:00 2001 From: glasstiger Date: Wed, 13 May 2026 18:34:33 +0100 Subject: [PATCH] QWP HA docs --- .../client-failover/concepts.md | 246 ++++++++++++++ .../client-failover/configuration.md | 162 +++++++++ .../store-and-forward/concepts.md | 297 +++++++++++++++++ .../store-and-forward/configuration.md | 205 ++++++++++++ .../store-and-forward/operating-and-tuning.md | 309 ++++++++++++++++++ .../store-and-forward/when-to-use.md | 221 +++++++++++++ documentation/sidebars.js | 18 + 7 files changed, 1458 insertions(+) create mode 100644 documentation/high-availability/client-failover/concepts.md create mode 100644 documentation/high-availability/client-failover/configuration.md create mode 100644 documentation/high-availability/store-and-forward/concepts.md create mode 100644 documentation/high-availability/store-and-forward/configuration.md create mode 100644 documentation/high-availability/store-and-forward/operating-and-tuning.md create mode 100644 documentation/high-availability/store-and-forward/when-to-use.md diff --git a/documentation/high-availability/client-failover/concepts.md b/documentation/high-availability/client-failover/concepts.md new file mode 100644 index 000000000..8a64a104f --- /dev/null +++ b/documentation/high-availability/client-failover/concepts.md @@ -0,0 +1,246 @@ +--- +title: Client failover concepts +sidebar_label: Concepts +description: + How QuestDB clients detect a failed primary and transparently switch to a + healthy peer using multi-host addr lists, host-health classification, role + filtering, and zone-aware selection. +--- + +import { EnterpriseNote } from "@site/src/components/EnterpriseNote" + + + Client failover is most useful with QuestDB Enterprise primary-replica + replication. OSS users with a single instance gain limited benefit from + multi-host configuration. + + +:::note Java-only today + +Client-side failover support is currently available in the Java client. +Additional language clients are on the roadmap. + +::: + +When a QuestDB cluster fails over from one primary to another — whether through +a planned promotion, a rolling upgrade, or an unplanned outage — clients with a +single hard-coded address must be reconfigured and restarted. A failover-aware +client instead carries the full list of peers and walks that list automatically +when the current connection breaks. + +This page explains the model. The user-facing knobs and worked examples live in +the [Configuration](/docs/high-availability/client-failover/configuration/) +page. + +## What failover does + +You give the client a comma-separated list of endpoints: + +``` +addr=node-a:9000,node-b:9000,node-c:9000 +``` + +The client picks one, connects, and uses it until that connection breaks. When +it breaks, the client walks the rest of the list, classifies what it found at +each host, and either reconnects or surfaces a failure to your code. The exact +loop that drives this depends on whether you are ingesting (long-lived +background reconnect) or querying (per-request retry budget). Both loops share +the same primitives described here. + +## Host health model + +For every entry in `addr`, the client tracks two attributes: a **state** and a +**zone tier**. + +### State + +The state records what the client most recently observed when it tried that +host. + +| State | When the client moves a host here | +|---|---| +| `Healthy` | The last connect attempt succeeded. | +| `Unknown` | The host has not been tried in this round, or its classification was reset. | +| `TransientReject` | The server returned `421` with `X-QuestDB-Role: PRIMARY_CATCHUP` — it is a primary that is still catching up after promotion. Expected to recover. | +| `TransportError` | TCP/TLS handshake failed, an HTTP upgrade returned a transient error code, or an established connection broke mid-stream. | +| `TopologyReject` | The server returned `421` with a role that cannot satisfy the requested `target=` filter — for example, a `REPLICA` when you asked for `target=primary`. The host will not become writable without a topology change. | + +A lower state in the table above is preferred when the client picks the next +host to try. + +### Zone tier + +Each host is also classified relative to the client's configured `zone=`: + +| Zone tier | Meaning | +|---|---| +| `Same` | Server's advertised zone matches the client's `zone=` (case-insensitive), or `zone=` is unset, or `target=primary`. | +| `Unknown` | Server has not advertised a zone yet. | +| `Other` | Server advertised a different zone. | + +Zone information is advertised by the server on a successful upgrade and +(starting in QWP v2) on `421` rejects. The client remembers it for the lifetime +of the connection. + +`target=primary` collapses every host's zone tier to `Same` — writers must +follow the primary regardless of geography. Ingress is currently zone-blind in +both storage modes, so the `zone=` key is silently accepted on ingress +connections and only takes effect on egress. + +### Selection priority + +When the client needs to pick the next host, it sorts by the tuple `(state, +zone_tier)` lexicographically — state first, zone second. So a known-good host +in another zone wins against an untried local host. Within a tied bucket, the +order in your `addr=` list is preserved verbatim. + +The client does **not** shuffle, randomise, or load-balance across peers. +Cluster-level load balancing is the responsibility of QuestDB's server-side +coordinators. If you need a different first-pick distribution across many +simultaneously-starting clients, rotate the connect string at deployment time. + +## Sticky-Healthy across rounds + +Once the client lands on a `Healthy` host, that host stays the priority pick on +the next round of failover — provided its zone tier is still `Same`. This +avoids unnecessary churn after a short blip: a momentary network glitch +doesn't promote a different node into the active slot just because it +happened to be probed first. + +`Healthy` hosts in another zone are reset to `Unknown` between outages rather +than kept sticky. Otherwise a once-healthy cross-zone host would lock the +client out of probing local hosts after they recover. + +## Role filter (`target=`) + +The `target=` key controls which server role the client is willing to bind to: + +| `target=` | STANDALONE | PRIMARY | REPLICA | PRIMARY_CATCHUP | +|---|---|---|---|---| +| `any` (default) | accept | accept | accept | accept (transient) | +| `primary` | accept | accept | reject (topology) | accept (transient) | +| `replica` | reject (topology) | reject (topology) | accept | reject (topology) | + +`PRIMARY_CATCHUP` is a primary that has been promoted but has not yet caught +up to its predecessor's WAL — the client treats it as transient and retries +the same host (with a fresh round, no exponential backoff) until it either +becomes a full `PRIMARY` or the outage budget expires. + +A `421 Misdirected Request` response **without** an `X-QuestDB-Role` header +is treated as a generic transport error, not a role reject — the client walks +to the next host but does not pin the rejecting host as topology-unreachable. + +`target=replica` is intended for read-side workloads that explicitly want to +spread query load across read-only peers (see the egress flow below). + +## Two failover contexts + +Failover applies to both directions of QWP traffic, but the two contexts have +very different goals. + +### Ingress (writes) + +The ingress reconnect loop sits inside the store-and-forward I/O thread. It +runs continuously in the background, retrying through outages while the +producer keeps appending to the local buffer. The defaults are tuned for +throughput-oriented workloads that can tolerate minutes of server unavailability: + +- Initial backoff: `100 ms` +- Maximum backoff: `5 s` +- Per-outage budget: `5 minutes` (`reconnect_max_duration_millis`) +- Jitter: **equal-jitter** `[base, 2·base)` — non-zero lower bound damps + reconnect storms when many producers share a cluster +- Inter-host pause within a round: **none** — the client walks the full + address list as fast as `auth_timeout_ms` allows, paying one backoff + sleep at round exhaustion + +See the [store-and-forward concepts](/docs/high-availability/store-and-forward/concepts/) +page for how the reconnect loop interacts with the disk-backed segment ring. + +### Egress (queries) + +The egress failover loop wraps each `Execute()` call on the read-side query +client. It is interactive: a slow failover is worse than a clear error, so +the budget is short: + +- Initial backoff: `50 ms` +- Maximum backoff: `1 s` +- Total wall-clock budget: `30 s` (`failover_max_duration_ms`) +- Attempt cap: `8` (`failover_max_attempts`) +- Jitter: **full-jitter** `[0, base)` — a single-user query benefits from the + lowest expected recovery time, and one client per workload removes the + thundering-herd concern + +The egress loop also respects the `target=` role filter and prefers same-zone +hosts when `zone=` is set. + +## Error classification + +Every error the client encounters falls into one of three buckets, which drives +the loop's response: + +### Terminal — bypass failover + +The client surfaces the error to your code immediately. Retrying every host +will not help. + +| Condition | Why terminal | +|---|---| +| HTTP `401` / `403` on upgrade | Credentials are cluster-wide; retrying floods server logs without recovery. | +| Server-status reject (SF) | Application-layer reject; replay reproduces the same response. | + +### Topology — handled inside the round + +The host is demoted in the priority lattice; the client walks to the next host +within the same round. No exponential backoff is consumed. + +- `421` + `X-QuestDB-Role: PRIMARY_CATCHUP` → `TransientReject` +- `421` + any other recognised role → `TopologyReject` +- `SERVER_INFO.Role` does not match the requested `target=` + +If every host in a round role-rejects, ingress pays one fixed backoff sleep +(reset to `InitialBackoff`, no doubling) and starts a fresh round; egress +fails the current `Execute()` call. + +### Transient — enter backoff + +Everything else: TCP/TLS errors, `auth_timeout_ms` expiry, mid-stream send or +receive failures, `404` / `426` / `503` on upgrade, version mismatches +(per-endpoint — a rolling upgrade in flight does not lock out compatible +peers), and generic frame-decode errors. The client records `TransportError` +and walks to the next host. + +When a round exhausts with transient errors, the client sleeps for the +backoff interval (clamped to the remaining outage budget) and starts the +next round. + +## Mid-stream demotion + +If a connection breaks mid-stream — for example, the receive pump throws after +a successful upgrade — the client marks the failed host as `TransportError` +**before** picking the next host. Without this ordering, the sticky-Healthy +rule would re-pick the same just-failed host as the priority candidate, and +the next attempt would target the broken node again. + +This invariant only matters when you are reading client source code or +debugging a custom implementation. As a user, you observe it as "failover +moves off a broken node on the very next attempt, with no exponential delay +when at least one peer is healthy." + +## Authentication is cluster-wide + +A `401` or `403` on the HTTP upgrade is terminal — the client does not retry +other hosts. The assumption is that auth credentials are configured +identically across the cluster, so a credential failure against one node is +a credential failure against all of them. Retrying would spam every peer's +audit log without recovering. + +If your deployment has per-host credentials, that is unsupported and outside +the failover model — split the workload into one connect string per credential. + +## Next steps + +- [Configuration](/docs/high-availability/client-failover/configuration/) — + the connect-string keys and worked examples for each context. +- [Store-and-forward concepts](/docs/high-availability/store-and-forward/concepts/) — + how the ingress failover loop interacts with the disk-backed substrate. diff --git a/documentation/high-availability/client-failover/configuration.md b/documentation/high-availability/client-failover/configuration.md new file mode 100644 index 000000000..6da5a703f --- /dev/null +++ b/documentation/high-availability/client-failover/configuration.md @@ -0,0 +1,162 @@ +--- +title: Client failover configuration +sidebar_label: Configuration +description: + Connect-string keys that configure multi-host failover for QuestDB clients, + including addr lists, zone preference, role filtering, and the ingress and + egress retry budgets. +--- + +:::note Java-only today + +Client-side failover support is currently available in the Java client. +Additional language clients are on the roadmap. + +::: + +This page is the configuration reference for client failover. For the model +behind these keys — host-health states, zone tiers, role filtering, and the +two retry loops — read [Concepts](/docs/high-availability/client-failover/concepts/) +first. + +## Common keys + +These keys apply to every WS / WSS / HTTP / HTTPS client. They are documented +in full on the +[connect-string reference](/docs/client-configuration/connect-string#failover-keys); +the table below summarises the failover-relevant subset. + +| Key | Type | Default | Notes | +|---|---|---|---| +| `addr` | `host:port[,host:port…]` | required | Comma-separated peer list. The two syntactic forms (`addr=h1,h2` and repeated `addr=h1;addr=h2`) accumulate. Empty entries are rejected. | +| `zone` | string | unset | Client's zone identifier (opaque, case-insensitive — `eu-west-1a`, `dc-amsterdam`, etc.). Egress prefers same-zone peers when `target` is `any` or `replica`. Silently ignored on ingress. | +| `target` | `any` \| `primary` \| `replica` | `any` | Which server role the client accepts. See [Role filter](/docs/high-availability/client-failover/concepts/#role-filter-target) for the role table. | +| `auth_timeout_ms` | int (ms) | `15000` | Upper bound on the HTTP-upgrade response read per host. Does **not** cover the TCP connect or TLS handshake — those use the OS default. Set lower if you have well-known network paths and want faster failover; set higher only if upgrade is genuinely slow. | + +`addr` syntax — both of these are equivalent and produce the same three-peer +list: + +``` +addr=node-a:9000,node-b:9000,node-c:9000 +addr=node-a:9000;addr=node-b:9000;addr=node-c:9000 +``` + +## Ingress (write) + +The ingress reconnect loop is driven by store-and-forward connect-string +keys. See +[Store-and-forward configuration](/docs/high-availability/store-and-forward/configuration/#reconnect-keys) +and the +[connect-string reference](/docs/client-configuration/connect-string#sf-keys) +for the full list. The failover-relevant keys are: + +| Key | Type | Default | Notes | +|---|---|---|---| +| `reconnect_max_duration_millis` | int (ms) | `300000` (5 min) | Per-outage wall-clock budget. Resets on every successful reconnect. Size this to span your largest expected failover window, but short enough to surface permanent topology issues. | +| `reconnect_initial_backoff_millis` | int (ms) | `100` | Starting backoff sleep at round exhaustion. Doubles up to `reconnect_max_backoff_millis`. | +| `reconnect_max_backoff_millis` | int (ms) | `5000` | Cap on the exponential backoff. With equal-jitter, the actual sleep lands in `[max, 2·max)` once the base saturates. | +| `initial_connect_retry` | `off` \| `on` \| `async` | `off` | Whether to apply the same retry loop to the very first connect attempt. See below. | + +### `initial_connect_retry` + +By default, the first connect failure is **terminal** — typically the first +attempt failing means a misconfiguration (wrong host, wrong port, no +network), and retrying for five minutes only hides it. + +| Value | Behaviour | +|---|---| +| `off` (default; alias `false`) | First-connect failure is terminal. The producer's call to build the sender throws immediately. | +| `on` (aliases `sync`, `true`) | First-connect failures enter the same reconnect loop as mid-stream failures. The constructor blocks until success or the per-outage budget expires. | +| `async` | The constructor returns immediately; the background I/O thread drives the reconnect loop. The producer experiences backpressure if it tries to publish before the connection comes up. Intended for unattended producers where the SF directory may already carry segments from a prior process and the server may come up later. | + +## Egress (query) + +The egress failover loop wraps each `Execute()` call on the read-side query +client. The full key list lives on the +[connect-string reference](/docs/client-configuration/connect-string#egress-flow); +the user-visible knobs are: + +| Key | Type | Default | Notes | +|---|---|---|---| +| `failover` | `on` \| `off` | `on` | Global on/off. With `failover=off`, a single failed `Execute()` call surfaces the underlying error without walking the address list. | +| `failover_max_attempts` | int | `8` | Hard cap on attempts within a single `Execute()` call. | +| `failover_max_duration_ms` | int (ms) | `30000` | Wall-clock budget for failover eligibility. Bounds **when failover stops**, not the wall-clock of `Execute()` itself — a final `WalkTracker` round can still cost up to `hostCount × auth_timeout_ms` after the budget expires. | +| `failover_backoff_initial_ms` | int (ms) | `50` | Starting backoff sleep. Doubles up to the cap. | +| `failover_backoff_max_ms` | int (ms) | `1000` | Cap on the exponential backoff. With full-jitter, the actual sleep lands in `[0, max)`. | + +## Worked examples + +### Three-node Enterprise cluster, default failover + +Most users need only the `addr` list — defaults cover the rest. + +```java +try (Sender sender = Sender.fromConfig( + "ws::addr=node-a:9000,node-b:9000,node-c:9000;sf_dir=/var/lib/qdb-sender;")) { + sender.table("events") + .symbol("source", "edge-42") + .longColumn("count", 1) + .atNow(); +} +``` + +The `ws::` scheme picks the QWP WebSocket transport. `sf_dir` enables the +disk-backed store-and-forward substrate, which keeps unacked data across +sender restarts; see +[Store-and-forward concepts](/docs/high-availability/store-and-forward/concepts/). + +### Zone-aware read replicas + +For read-only queries spread across same-zone replicas, with a primary as +final fallback: + +```java +try (QueryClient client = QueryClient.fromConfig( + "ws::addr=replica-eu-1a:9000,replica-eu-1b:9000,primary:9000;" + + "zone=eu-west-1a;target=any;")) { + try (ResultSet rs = client.execute("SELECT * FROM trades WHERE ts > now() - 1h")) { + // ... + } +} +``` + +Setting `target=replica` would skip the primary entirely; `target=any` is +usually preferable so the query still completes after a replica outage. + +### Long-tolerated ingest with async first connect + +Useful for unattended ingest processes (edge sensors, ETL jobs) that may +restart before the server comes up: + +```java +try (Sender sender = Sender.fromConfig( + "ws::addr=primary:9000;sf_dir=/var/lib/qdb-sender;" + + "initial_connect_retry=async;" + + "reconnect_max_duration_millis=1800000;")) { + // appendBlocking() will absorb up to sf_max_total_bytes of writes + // while the I/O thread retries the initial connect. +} +``` + +The 30-minute reconnect budget gives a wide failover window; the `async` +initial-connect policy lets the producer thread proceed immediately. + +### Tight egress failover for an interactive dashboard + +```java +try (QueryClient client = QueryClient.fromConfig( + "ws::addr=node-a:9000,node-b:9000;" + + "failover_max_duration_ms=5000;failover_max_attempts=3;")) { + // Surfaces an error within a few seconds if the cluster is unreachable. +} +``` + +## Where each key is documented + +| Key | Concept | Reference | +|---|---|---| +| `addr`, `zone`, `target`, `auth_timeout_ms` | Host selection, role filter | [connect-string #failover-keys](/docs/client-configuration/connect-string#failover-keys) | +| `reconnect_*`, `initial_connect_retry` | Ingress retry budget | [connect-string #reconnect-keys](/docs/client-configuration/connect-string#reconnect-keys) | +| `failover`, `failover_*` | Egress retry budget | [connect-string #egress-flow](/docs/client-configuration/connect-string#egress-flow) | +| `username` / `password` / `token` | Authentication | [connect-string #auth](/docs/client-configuration/connect-string#auth) | +| `tls_*` | TLS configuration | [connect-string #tls](/docs/client-configuration/connect-string#tls) | diff --git a/documentation/high-availability/store-and-forward/concepts.md b/documentation/high-availability/store-and-forward/concepts.md new file mode 100644 index 000000000..7f070d33d --- /dev/null +++ b/documentation/high-availability/store-and-forward/concepts.md @@ -0,0 +1,297 @@ +--- +title: Store-and-forward concepts +sidebar_label: Concepts +description: + How the QuestDB store-and-forward client substrate decouples the producer + from the wire, masks network outages and server restarts, and replays + unacknowledged frames against a fresh connection. +--- + +:::note Java-only today + +Client-side store-and-forward support is currently available in the Java +client. Additional language clients are on the roadmap. + +::: + +Store-and-forward (SF) is the client-side substrate that sits between your +application code and the QWP wire transport. It absorbs publishes into a +local ring of fixed-size segments, drains them over a WebSocket connection +on a dedicated I/O thread, and replays any unacknowledged frames after a +disconnect or restart. + +The goal is **producer-never-blocks-on-the-wire**. Your call to `flush()` +returns as soon as data is published into the substrate. Acknowledgements +arrive asynchronously. A network outage, a server restart, even a JVM +crash leaves your producer code unaffected — the I/O thread quietly +reconnects and replays what remains. + +## Two modes + +SF runs in either of two modes selected by the connect string: + +| Aspect | Memory mode | SF mode | +|---|---|---| +| Trigger | `sf_dir` is **unset** | `sf_dir` is set | +| Storage | malloc'd ring in process RAM | mmap'd files under `//` | +| Default capacity | `128 MiB` | `10 GiB` | +| Survives JVM exit | No | Yes | +| Survives JVM crash | No | Yes — replay on next start | +| Tolerates transient network blips | Yes | Yes | +| Tolerates multi-minute server outages | Bounded by RAM cap | Bounded by disk cap | +| Recovers another sender's stale slot | n/a | Opt-in via `drain_orphans=on` | + +Both modes share the same reconnect loop, the same backoff and retry +budgets, and the same on-the-wire behaviour. The only difference is +where unacked data lives. + +## What "frame" means here + +A **frame** is one encoded QWP message — typically a batch of rows for one +or more tables. The SF substrate treats frames as opaque payloads with two +properties: a length, and a CRC32C checksum. The append protocol writes the +payload first, the checksum last, and a partial write left behind by a +crash is detected and discarded by the recovery scanner on next start. + +Frames in SF mode are **self-sufficient**: every frame carries the full +schema for every table it touches and the full symbol-dictionary delta +from id 0. That makes a frame replayable against any server connection, +weeks or months later, even after a process restart that wiped all +in-memory schema state. The cost is a small per-batch overhead which is +accepted for correctness. + +## The FSN model + +Two distinct counters track frame identity: + +- **FSN** (frame-sequence-number) — a monotonic counter assigned when a + frame is appended to the substrate. FSN survives reconnects and (in SF + mode) restarts. It is the substrate's permanent identifier for a frame. +- **wireSeq** — the per-connection counter the server uses for + deduplication, reset to `0` on every successful WebSocket upgrade. + +On every (re)connect the relationship is pinned: + +``` +fsn = fsnAtZero + wireSeq +``` + +where `fsnAtZero` is `ackedFsn + 1` (i.e. the next un-acked FSN). The +client streams frames from disk to the wire in strict FSN order, one frame +per WebSocket binary message, incrementing `wireSeq`. The server echoes +back the same `wireSeq` in its OK frames, and the client maps that back to +the original FSN to advance the trim watermark. + +Two consequences: + +- Frames **must** be sent in strict order. The wire format does not + serialise `wireSeq` — the server assigns it implicitly from receive + order. Reordering breaks the FSN mapping. +- After a reconnect, the server sees the **same payloads** at new + `wireSeq` values. Server-side dedup keys off `messageSequence` inside + the payload, not `wireSeq`, so replay does not produce double-writes. + +## Trim: how unacked data is reclaimed + +The substrate holds frames until the server confirms it has received and +processed them. Each confirmation advances the **acked FSN**, which +allows the manager thread to unlink sealed segment files (in SF mode) or +release ring memory (in memory mode) up to that watermark. + +Two trim drivers exist: + +### Default — OK-driven trim + +Each successful batch produces an **OK frame** carrying the highest +`wireSeq` it acknowledges and the per-table `seqTxn` watermarks that +batch updated. On receipt: + +1. The substrate translates `wireSeq` back to FSN. +2. `ackedFsn` advances to the new value. +3. Any segment whose last FSN is `≤ ackedFsn` is unlinked and its bytes + returned to the available pool. + +This is the default and is sufficient when "data is in the server's WAL" +is the durability bar you need. + +### `request_durable_ack=on` — WAL-durable trim + +When the connect string sets `request_durable_ack=on`, trim is driven by +a separate frame: `STATUS_DURABLE_ACK`. These carry per-table watermarks +for data the server has **already uploaded from the WAL to the configured +object store** (S3, Azure Blob, GCS, or NFS). + +- OK frames still arrive on every batch, but they no longer advance the + trim watermark. Instead, they are stashed alongside their per-table + `seqTxn` values. +- A `STATUS_DURABLE_ACK` frame names tables and their durable `seqTxn` + watermarks. The client matches the head of the OK queue against these + watermarks; each fully-covered head entry pops, and `ackedFsn` + advances to the highest covered wireSeq. +- The client opt-in is mandatory — the connect fails loudly if the server + does not echo `X-QWP-Durable-Ack: enabled` on the upgrade response. + This avoids the silent failure mode where the producer waits forever + for ack frames that will never arrive. + +Durable-ack mode is the right choice when "data is in the object store" +is the durability bar, but it has two costs: a longer time-to-trim (so +larger steady-state disk usage in SF mode), and a small WebSocket PING +sent every `durable_ack_keepalive_interval_millis` to nudge the server's +flush path when the client is idle but has pending confirmations. + +See [When to use](/docs/high-availability/store-and-forward/when-to-use/) +for the decision. + +## Reconnect and replay + +When the wire connection breaks — for any reason — the I/O thread enters +the reconnect loop documented in +[Client failover concepts](/docs/high-availability/client-failover/concepts/). +The producer is **not notified**: it keeps publishing into the substrate, +bounded by `sf_max_total_bytes` (see backpressure below). + +On every successful (re)connect: + +1. `fsnAtZero = ackedFsn + 1`. +2. `wireSeq` resets to `0`. +3. The read cursor rewinds to the first un-acked frame on disk (or in + memory). +4. Frames stream to the wire in FSN order. The server's dedup window + absorbs any frames that landed before the disconnect. +5. New frames appended by the producer during replay are picked up + automatically — the I/O loop watches a volatile `publishedFsn` + cursor. + +Frames sent before the disconnect and re-sent after a reconnect count +in the `getTotalFramesReplayed` observability counter. + +## Backpressure + +The substrate enforces `sf_max_total_bytes` as a hard cap on resident +storage. When the cap is hit, the producer's `appendBlocking` call +busy-spins (with cooperative yield) up to `sf_append_deadline_millis` +waiting for ACK-driven trim to free space. If the deadline fires, the +call throws a typed exception. + +The exception message distinguishes the two scenarios: + +- **Backpressure while the wire is publishing** — the server is acking + but the producer is faster than the server can absorb. Solutions: + raise `sf_max_total_bytes`, slow the producer, or scale the server. +- **Backpressure while reconnecting** — the I/O loop is in the retry + loop and the substrate is filling. The message includes attempt count + and outage start time. Solutions: address the cluster outage, raise + `sf_max_total_bytes`, or accept that the producer will start throwing + once the cap is exhausted. + +## Close and shutdown + +`close()` waits up to `close_flush_timeout_millis` (default 5 s) for +`ackedFsn` to reach `publishedFsn` — i.e. for the server to acknowledge +everything the producer has handed in. If the wait succeeds, all data is +acked. If the timeout fires, a `WARN` is logged and: + +- in **SF mode**, the un-acked tail is left on disk and recovered by the + next sender on the same slot; +- in **memory mode**, the un-acked tail is lost. + +Setting `close_flush_timeout_millis=0` (or `-1`) skips the drain wait +entirely — useful for fast shutdown paths where you do not want to block. +Even in this branch, the slot lock is released and segments are unmapped +cleanly, and a non-blocking safety-net check rethrows any latched +terminal error that has not already been delivered through an async +handler or a synchronous producer call. + +## Crash recovery (SF mode) + +When the engine opens an SF-mode sender, it scans the slot directory: + +1. **Acquire the slot lock.** Two senders pointing at the same + `//` will collide here and the second one fails to + start, naming the holder's PID in the error message. +2. **Validate every segment file.** Headers are checked, frames are walked + forward verifying each CRC. The first invalid or torn frame becomes + the file's end-of-data; anything past it is discarded. +3. **Reconcile gaps.** Segments are sorted by their `baseSeq` and adjacent + pairs must satisfy `prev.baseSeq + prev.frameCount == curr.baseSeq`. + A gap is a fatal recovery error — the engine refuses to start. +4. **Seed the ack watermark.** Either from `.ack-watermark` (if your + client maintains it; see below) or from the lowest surviving FSN minus + one. +5. **Bump the connection generation** so the I/O loop, on first connect, + replays from disk against a fresh wireSeq window. + +After recovery the producer publishes new frames as normal; the I/O +thread replays the un-acked tail and then drains forward. + +### `.ack-watermark` + +An optional 16-byte file under the slot directory persists the cumulative +durable-ack FSN across process restarts. Without it, recovery seeds the +ack watermark from the lowest surviving segment's `baseSeq - 1` — which +guarantees no data loss, but cannot distinguish which frames inside that +lowest segment the previous sender had already received durable acks +for. Replay therefore re-sends every frame in that segment, producing +row-level duplicates against a still-alive server unless deduplication is +enabled on the target table. + +With `.ack-watermark`, recovery clamps the seed to the higher of the +on-disk and watermarked values, so already-durable-acked frames inside +the lowest surviving segment are not re-replayed. + +The file is **optional** — a conformant client may choose not to maintain +it. The Java reference client does. + +## Orphan adoption + +When the foreground sender's connect string sets `drain_orphans=on`, the +engine scans `/*` at startup for **sibling slot directories** — +other `sender_id`s under the same group root that contain unacked data +and are not marked `.failed`. For each one, up to +`max_background_drainers` at a time, a background drainer spawns, +acquires the orphan slot's lock (skipping if another process holds it), +opens a separate WebSocket connection, runs the same recovery + replay +flow, and exits when the orphan is fully drained. + +This is the rescue path for a sender that died without draining cleanly +— a JVM crash, an OOM kill, a host reboot. The replacement process picks +the orphan's slot lock and clears its disk footprint. Without +`drain_orphans=on` the dead sender's data persists on disk indefinitely +until an operator intervenes. + +The orphan flow is opt-in because in a multi-tenant deployment with +shared `sf_dir`, blindly draining unknown slots may be surprising. + +## Error frames + +Not every server response is an OK. Server errors fall into six +categories, each with a default policy: + +| Category | Default | Meaning | +|---|---|---| +| `SCHEMA_MISMATCH` | `DROP_AND_CONTINUE` | The batch's schema doesn't match the server. Replay won't help — the substrate logs and advances trim past the rejected span. | +| `WRITE_ERROR` | `DROP_AND_CONTINUE` | Per-batch write failure (e.g. table is not currently accepting writes). | +| `PARSE_ERROR` | `HALT` | Almost certainly a client bug. The substrate preserves on-disk frames for postmortem. | +| `INTERNAL_ERROR` | `HALT` | Catch-all server fault. | +| `SECURITY_ERROR` | `HALT` | Cluster-wide auth / authorization failure. | +| `PROTOCOL_VIOLATION` | `HALT` (forced) | Connection is gone after a terminal WebSocket close code; no choice. | + +Errors are also delivered to an **error inbox** — a bounded queue +consumed by a daemon dispatcher that invokes your registered handler. +Overflow drops the oldest entry rather than the newest (watermarks are +monotonic; the latest entry is the most informative). The default +handler logs every received error: silence is forbidden by the contract, +because a buggy or no-op handler would hide data loss +indistinguishably from a healthy connection. + +## Next steps + +- [When to use](/docs/high-availability/store-and-forward/when-to-use/) — + decision guide for memory vs SF mode, and when to opt into + durable-ack and orphan adoption. +- [Operating and tuning](/docs/high-availability/store-and-forward/operating-and-tuning/) — + slot directory layout, lock semantics, sizing, observability. +- [Configuration](/docs/high-availability/store-and-forward/configuration/) — + connect-string key reference. +- [Client failover concepts](/docs/high-availability/client-failover/concepts/) — + how the reconnect loop selects hosts and classifies errors. diff --git a/documentation/high-availability/store-and-forward/configuration.md b/documentation/high-availability/store-and-forward/configuration.md new file mode 100644 index 000000000..db13ff71b --- /dev/null +++ b/documentation/high-availability/store-and-forward/configuration.md @@ -0,0 +1,205 @@ +--- +title: Store-and-forward configuration +sidebar_label: Configuration +description: + Connect-string keys that configure the QuestDB store-and-forward client + substrate — storage, reconnect, durable-ack, and error-handling. +--- + +:::note Java-only today + +Client-side store-and-forward support is currently available in the Java +client. Additional language clients are on the roadmap. + +::: + +This page is the configuration reference for the SF connect-string keys. +For the model behind each knob, read +[Concepts](/docs/high-availability/store-and-forward/concepts/); for +operational guidance read +[Operating and tuning](/docs/high-availability/store-and-forward/operating-and-tuning/). + +Shared keys (authentication, TLS, address list) are documented on the +[connect-string reference](/docs/client-configuration/connect-string). +The keys below are the SF-specific subset. + +## Storage keys + +These keys select between memory mode and SF mode and govern on-disk +layout. The single switch is `sf_dir`: unset → memory mode, set → SF +mode. + +| Key | Type | Default | Description | +|---|---|---|---| +| `sf_dir` | path | unset | Group root directory. When set, the slot lives at `//` and unacked data is durable across process restarts. When unset, the substrate runs in memory mode. | +| `sender_id` | string | `default` | Slot subdirectory name. Two senders sharing the same `sender_id` and `sf_dir` will collide on the slot lock. Must not contain path separators or be empty. | +| `sf_max_bytes` | size | `4M` | Per-segment file size; rotation threshold. | +| `sf_max_total_bytes` | size | `128M` (memory) / `10G` (SF) | Hard cap on resident SF storage. Triggers producer backpressure when full. | +| `sf_durability` | enum | `memory` | Reserved for future per-batch / per-frame fsync modes. Only `memory` is currently implemented; `flush` and `append` parse but are rejected at build time. | +| `sf_append_deadline_millis` | int (ms) | `30000` | How long a producer `appendBlocking` call waits for ACK-driven trim to free space before throwing. | +| `drain_orphans` | bool | `off` | Scan `/*` at startup and spawn drainers for sibling slots that contain unacked data. See [orphan adoption](/docs/high-availability/store-and-forward/concepts/#orphan-adoption). | +| `max_background_drainers` | int | `4` | Cap on concurrent orphan drainers. | + +Size values accept integer bytes or unit suffixes (`K`, `M`, `G`, `T`) +using binary multipliers. + +These keys are also documented on the central +[connect-string reference](/docs/client-configuration/connect-string#sf-keys). + +## Reconnect keys + +Govern the in-flight reconnect loop after the wire breaks. Backoff math +and host-walk semantics are documented in +[Client failover concepts](/docs/high-availability/client-failover/concepts/). + +| Key | Type | Default | Description | +|---|---|---|---| +| `reconnect_max_duration_millis` | int (ms) | `300000` (5 min) | Per-outage wall-clock budget. Resets on every successful reconnect. | +| `reconnect_initial_backoff_millis` | int (ms) | `100` | Initial backoff sleep at round exhaustion. | +| `reconnect_max_backoff_millis` | int (ms) | `5000` | Cap on the exponential backoff. With equal-jitter the actual sleep lands in `[max, 2·max)`. | +| `initial_connect_retry` | enum | `off` | `off` (alias `false`): first-connect failure is terminal. `on` (aliases `sync`, `true`): same retry loop as reconnect, blocking the constructor. `async`: same retry loop in the I/O thread, non-blocking. | +| `close_flush_timeout_millis` | int (ms) | `5000` | `close()` blocks up to this long waiting for `ackedFsn ≥ publishedFsn`. `0` or `-1` skips the drain wait. The safety-net `checkError()` still runs. | + +Cross-reference: +[connect-string #reconnect-keys](/docs/client-configuration/connect-string#reconnect-keys). + +## Durable-ack keys + +Opt in to object-store-durable trim. See +[Durable-ack: when to opt in](/docs/high-availability/store-and-forward/when-to-use/#durable-ack-when-to-opt-in). + +| Key | Type | Default | Description | +|---|---|---|---| +| `request_durable_ack` | bool | `off` | Opt-in via the upgrade header `X-QWP-Request-Durable-Ack: true`. Trim is then driven by `STATUS_DURABLE_ACK` frames only; OK frames no longer advance the trim watermark. Connect fails loudly if the server does not echo `X-QWP-Durable-Ack: enabled`. WebSocket transports only. | +| `durable_ack_keepalive_interval_millis` | int (ms) | `200` | Cadence of WebSocket PING the I/O loop sends while there are pending durable confirmations and the producer is idle. `0` or negative disables. | + +## Error-handling keys + +| Key | Type | Default | Description | +|---|---|---|---| +| `error_inbox_capacity` | int (≥16) | `256` | Bounded SPSC queue capacity for async error notifications. Overflow drops the oldest entry and increments `getDroppedErrorNotifications`. | +| `on_server_error`, `on_schema_error`, `on_parse_error`, `on_internal_error`, `on_security_error`, `on_write_error` | enum | per category | Override the default policy (`HALT` or `DROP_AND_CONTINUE`) for a category. Reserved in the spec but not yet recognised by the Java connect-string parser — use the fluent `LineSenderBuilder` API today. | + +The per-category defaults are documented in +[Concepts § Error frames](/docs/high-availability/store-and-forward/concepts/#error-frames). +`PROTOCOL_VIOLATION` and `UNKNOWN` are forced `HALT` and not user-overridable. + +## Other relevant keys + +These keys are not SF-specific but affect SF behaviour. See the +[connect-string reference](/docs/client-configuration/connect-string) for the +canonical entries. + +| Key | Type | Default | Description | +|---|---|---|---| +| `addr` | `host[:port][,host[:port]…]` | required | Multi-host failover list. See [Client failover configuration](/docs/high-availability/client-failover/configuration/). | +| `username` / `password` | string | unset | HTTP Basic auth on the upgrade request. | +| `token` | string | unset | Bearer token on the upgrade request. | +| `tls_verify` | enum | `on` | `on` or `unsafe_off`. Applies to `wss::` / TLS connections. | +| `tls_roots` | path | system trust | Custom CA trust store. | +| `tls_roots_password` | string | unset | Trust store password. | +| `auto_flush` | bool | `on` | Global on/off for auto-flush triggers. | +| `auto_flush_rows` | int / `off` | `1000` | Row-count flush trigger. | +| `auto_flush_bytes` | int / `off` | `0` (off) | Byte-size flush trigger. | +| `auto_flush_interval` | int (ms) / `off` | `100` | Time-since-first-row flush trigger. | +| `init_buf_size` | size | `64K` | Initial encode buffer capacity. | +| `max_buf_size` | size | `100M` | Max encode buffer capacity. | +| `max_name_len` | int | `127` | Local validation cap for table / column names. | +| `max_schemas_per_connection` | int | `65535` | Per-connection schema-id ceiling. | + +## Validation + +The parser rejects: + +- Unknown keys (forward compatibility is via the spec, not silent + acceptance). +- `sf_durability` values other than `memory`, `flush`, `append`. `flush` + and `append` parse but are rejected at build time today. +- `sender_id` containing path separators or empty. +- `request_durable_ack=on` on non-WebSocket transports. + +## Worked examples + +### Single-node memory-mode producer + +```java +try (Sender sender = Sender.fromConfig("ws::addr=localhost:9000;")) { + sender.table("events") + .stringColumn("source", "edge-42") + .longColumn("count", 1) + .atNow(); +} +``` + +No `sf_dir`, so memory mode. The default `128 MiB` cap absorbs short +network blips. A JVM crash loses the unacked tail. + +### Single-node durable producer + +```java +try (Sender sender = Sender.fromConfig( + "ws::addr=localhost:9000;sf_dir=/var/lib/qdb-sender;")) { + // ... +} +``` + +Same producer code; SF mode is enabled by the one extra key. Unacked +data persists at `/var/lib/qdb-sender/default/` across crashes. + +### Multi-host with object-store durability + +```java +try (Sender sender = Sender.fromConfig( + "wss::addr=node-a:9000,node-b:9000,node-c:9000;" + + "sf_dir=/var/lib/qdb-sender;sender_id=ingest-svc;" + + "request_durable_ack=on;" + + "username=ingest;password=…;")) { + // ... +} +``` + +`wss::` for TLS, three-host failover, durable-ack opt-in. Slot lives at +`/var/lib/qdb-sender/ingest-svc/`. The connect fails loudly if any peer +returns an upgrade without `X-QWP-Durable-Ack: enabled`. + +### Multi-tenant host with orphan rescue + +```java +try (Sender sender = Sender.fromConfig( + "ws::addr=node-a:9000;sf_dir=/var/lib/qdb-sender;" + + "sender_id=worker-" + workerInstanceId + ";" + + "drain_orphans=on;max_background_drainers=8;")) { + // ... +} +``` + +Each worker instance has a unique `sender_id`. When a worker crashes and +a new instance comes up under a different `sender_id`, the new +instance's foreground sender adopts the dead worker's slot in the +background and drains it. + +### Long-outage tolerance for unattended ingest + +```java +try (Sender sender = Sender.fromConfig( + "ws::addr=primary:9000;sf_dir=/var/lib/qdb-sender;" + + "sf_max_total_bytes=50G;" + + "reconnect_max_duration_millis=3600000;" + + "initial_connect_retry=async;")) { + // ... +} +``` + +50 GB of buffer space, a one-hour reconnect budget, async initial +connect so the constructor returns immediately even if the server is +down. Suitable for edge / IoT producers on unreliable links. + +## Where each key is documented + +| Group | Connect-string reference | +|---|---| +| Storage (`sf_dir`, `sender_id`, …) | [#sf-keys](/docs/client-configuration/connect-string#sf-keys) | +| Reconnect (`reconnect_*`, `initial_connect_retry`, `close_flush_timeout_millis`) | [#reconnect-keys](/docs/client-configuration/connect-string#reconnect-keys) | +| Failover (`addr`, `zone`, `target`, `auth_timeout_ms`) | [#failover-keys](/docs/client-configuration/connect-string#failover-keys) | +| Auth (`username`, `password`, `token`) | [#auth](/docs/client-configuration/connect-string#auth) | +| TLS (`tls_*`) | [#tls](/docs/client-configuration/connect-string#tls) | diff --git a/documentation/high-availability/store-and-forward/operating-and-tuning.md b/documentation/high-availability/store-and-forward/operating-and-tuning.md new file mode 100644 index 000000000..8a0d3e481 --- /dev/null +++ b/documentation/high-availability/store-and-forward/operating-and-tuning.md @@ -0,0 +1,309 @@ +--- +title: Operating and tuning store-and-forward +sidebar_label: Operating & tuning +description: + Operational guidance for QuestDB store-and-forward producers — slot + directory layout, locks, capacity sizing, recovery, backpressure, + observability, and orphan adoption. +--- + +:::note Java-only today + +Client-side store-and-forward support is currently available in the Java +client. Additional language clients are on the roadmap. + +::: + +This page is the operator-facing guide for SF in production: how to +provision the slot directory, what to watch, and how to tune the limits. +For the underlying model see +[Concepts](/docs/high-availability/store-and-forward/concepts/); for the +choice between memory mode and SF mode see +[When to use](/docs/high-availability/store-and-forward/when-to-use/). + +## Slot directory layout + +In SF mode every sender owns one **slot directory**: + +``` +// +├── .lock # advisory exclusive lock (kernel-released on process exit) +├── .lock.pid # UTF-8 text: holder PID + '\n' (diagnostic only) +├── .failed # optional drainer-failure sentinel (UTF-8 reason text) +├── .ack-watermark # optional 16-byte durable-ack high-water mark +├── sf-0000000000000001.sfa +├── sf-0000000000000002.sfa +└── ... +``` + +`` is the **group root** — the directory you point the connect +string at. `` is the slot subdirectory; it defaults to +`default` but should be set explicitly when more than one sender shares +the host. + +### `.lock` and `.lock.pid` + +The `.lock` file is held under an advisory exclusive lock for the engine's +lifetime — POSIX clients use `flock` / `fcntl`, Windows uses +`LockFileEx`. The lock is released automatically when the file descriptor +closes, including on hard process exit (kernel cleanup). + +A second sender pointing at the same slot directory will fail to start +with an error that names the holder's PID, read from `.lock.pid`. The +PID file is overwritten on every successful acquire; an absent or empty +`.lock.pid` reports `holder=unknown` rather than failing the lookup. + +Neither `.lock` nor `.lock.pid` is deleted on clean shutdown. Stale +files are harmless — the next acquirer silently overwrites them. + +**Cross-platform interop:** a POSIX client and a Windows client must +**not** share a slot on a network filesystem. Their lock primitives are +incompatible. + +### `.failed` + +Present iff a previous drainer attempt gave up on the slot — reconnect +budget exhausted, terminal auth failure, or irrecoverable corruption. +The file contents are a UTF-8 reason for human operators; the **presence** +is the signal that the orphan scanner uses to exclude the slot from +auto-drain on subsequent scans. + +**Operator action:** read the reason, fix the underlying cause (rotate +credentials, restore the missing peer, etc.), then delete `.failed`. The +next sender that scans `` will pick the slot up again. + +### Segment files + +Segments are named `sf-.sfa` where `` is a 16-character +zero-padded hexadecimal generation counter. The number reflects +allocation order, **not** the FSN range — that lives in the file header +and is read at recovery time. + +Pre-allocation reserves real disk blocks at file creation. On Linux this +is `posix_fallocate`; on macOS, `F_PREALLOCATE` / `F_ALLOCATEALL`. The +substrate refuses to fall back silently to `ftruncate` on filesystems +where these are unsupported — sparse files would risk a `SIGBUS` later +when the mmap'd region writes into a hole. On filesystems where the +native layer **must** fall back to `ftruncate`, size `sf_max_bytes` +conservatively against free space. + +## Lock collisions in practice + +Two `sender_id`s in the same `sf_dir` never collide — they are +independent slots. The same `sender_id` started twice **will** collide, +and the second start fails loudly. + +A common cause is a redeploy where the old process hasn't fully exited +when the new one comes up. Solutions: + +- Wait for the old process to release the lock (the kernel releases on + exit; `kill -9` is sufficient). +- Use a deployment unit that orders shutdown before startup. +- For containerised deployments, set `sender_id` from a per-pod stable + identity so two pods with the same template name don't collide. + +`drain_orphans=on` does **not** override the lock — a busy orphan slot +is skipped, not stolen. + +## Sizing capacity + +Two limits matter: + +### `sf_max_bytes` — per-segment file size (default `4 MiB`) + +This is the rotation threshold and the unit of trim. Segments that are +smaller release disk faster but waste more space on the active tail; +larger segments waste less on the active tail but hold acked frames in +the same file as the still-unacked tail until every frame in the segment +is acked. + +For most workloads `4 MiB` is fine. Raise it if you are appending very +large batches and pre-allocation cost matters; lower it if you observe +disk usage staying high under slow ack cadence. + +### `sf_max_total_bytes` — slot capacity (default `128 MiB` memory / `10 GiB` SF) + +This is the **hard cap** on resident SF storage — sealed segments plus +the active segment. When this fills, producer `appendBlocking` calls +block (with cooperative yield) for up to `sf_append_deadline_millis` +waiting for ACK-driven trim to free space; on timeout the call throws. + +Size this against your **worst expected outage** times your ingest +rate: + +``` +sf_max_total_bytes ≥ ingest_rate × max_tolerated_outage +``` + +A 5-minute reconnect budget at 10 MB/s of compressed frames implies at +least 3 GB. Add safety margin for trim latency — in particular, +`request_durable_ack=on` extends time-to-trim by the WAL→object-store +upload window. + +In memory mode the default `128 MiB` is deliberately small: it forces +you to think about backpressure rather than letting an outage silently +balloon process RSS. + +## Backpressure observability + +`appendBlocking` distinguishes two reasons it can stall: + +- **Wire-publishing backpressure.** The server is acking but the + producer is faster than ack throughput. The exception message names + this state. Solutions: scale the server, slow the producer, or raise + `sf_max_total_bytes`. +- **Reconnect backpressure.** The I/O loop is in the retry loop and the + substrate is filling. The exception message includes the attempt + count and outage start time. Solutions: address the cluster outage, + raise `sf_max_total_bytes`, or accept that the producer will start + throwing once the cap is exhausted. + +The `getTotalBackpressureStalls()` counter (see Observability below) +records every producer thread that hit the cap. + +## Recovery on restart + +When an SF-mode sender opens, it runs this sequence: + +1. Acquire `//.lock`. Fail loudly on contention. +2. Scan every `*.sfa` file: + - Validate magic, version, header. + - Walk frames forward verifying each CRC32C-Castagnoli. + - The first invalid frame becomes end-of-data; any non-zero bytes + past that point are logged as a torn-tail count. +3. Sort segments by `baseSeq` and verify no gaps. A gap is a fatal + recovery error. +4. Open `.ack-watermark` (if present) and read the cumulative + durable-ack FSN. Reject a watermark that exceeds the on-disk + ceiling — it would seed `ackedFsn` past every un-acked frame and + silently drop the un-acked tail. +5. Seed `ackedFsn = max(lowestBaseSeq - 1, watermark)`. +6. Allocate the next segment generation as `max(existing-gen) + 1`. +7. Bump the connection generation so the I/O loop replays from disk + against a fresh wireSeq window. + +A clean shutdown that drained everything is indistinguishable from a +fresh start: no segments, no replay. + +### Recovery failures + +| Symptom | Likely cause | Operator action | +|---|---|---| +| "Slot held by PID ``" | Two processes claiming the same `sender_id`. | Stop the duplicate. The lock releases on its exit. | +| "Gap between segments" | Corruption — a segment was deleted out of band. | Restore from backup or accept data loss; the substrate refuses to start. | +| "Watermark exceeds publishedFsn" | `.ack-watermark` is corrupt; the engine falls back to the no-watermark seed. | Logged as `WARN`. Replay will re-send the lowest segment's frames; rely on server deduplication. | +| Torn tail count > 0 | The previous process crashed mid-frame-write. | Informational; the CRC + zero-fill design discards the partial frame. | + +## Close and shutdown + +`close()` semantics depend on `close_flush_timeout_millis`: + +| Value | Behaviour | +|---|---| +| `5000` (default) | Block up to 5 s waiting for `ackedFsn ≥ publishedFsn`. Log `WARN` on timeout; un-acked tail stays on disk (SF) or is lost (memory). | +| `0` or `-1` | Skip the drain wait. Pending data persists on disk (SF) for the next sender, or is lost (memory). | +| any other positive value | That timeout in milliseconds. | + +In every branch `close()`: + +- Performs a non-blocking safety-net check that rethrows any latched + terminal error not already delivered through an async handler or a + synchronous producer call. +- Releases the slot lock and unmaps segment files. + +The safety-net check is what makes "close-and-forget" callers safe: if +the only API your code uses is `close()`, terminal errors still surface +rather than silently sinking into a no-op handler. + +## Orphan adoption in operations + +With `drain_orphans=on`, the foreground sender — after acquiring its own +lock — scans `/*` for siblings that: + +- are not its own `sender_id`, +- contain at least one `*.sfa` file, +- do not have a `.failed` sentinel. + +Up to `max_background_drainers` drainers run concurrently. Each drainer +opens its own engine and WebSocket connection, runs recovery + replay, +and exits when the orphan's `ackedFsn ≥ publishedFsn`. + +### Drainer failure modes + +- **Reconnect budget exhausted.** Drainer writes `.failed` with reason, + releases the lock, exits. +- **Auth-terminal upgrade error.** Same. +- **Irrecoverable corruption.** Same. + +`.failed` slots are excluded from auto-drain on subsequent scans — +operator action is required to clear the sentinel. + +### Observing drainers + +- `getActiveBackgroundDrainers()` — count of currently-running drainers + (best-effort: a just-finished drainer may still count for a few ms). +- `getTotalBackgroundDrainersSucceeded()` / `…Failed()` — cumulative + outcomes since process start. +- The `BackgroundDrainerListener` callback delivers per-drainer + events (progress watermark, durable-ack-mismatch escalation, terminal + outcome) for richer dashboards. +- On-disk `.failed` sentinels are the canonical record of giveup + events surviving sender restart. + +## Observability counters + +A conformant client exposes at minimum: + +| Counter | What it tells you | +|---|---| +| `getTotalReconnectAttempts()` | How often the wire has broken across the sender's lifetime. | +| `getTotalReconnectsSucceeded()` | How many of those recovered. | +| `getTotalFramesReplayed()` | Volume re-sent after reconnects. A spike usually means a fresh outage; sustained growth means a flapping wire. | +| `getTotalServerErrors()` | Count of error frames received (any category). | +| `getDroppedErrorNotifications()` | Error-inbox overflow count. Non-zero means a busy error stream or a slow handler. | +| `getTotalErrorNotificationsDelivered()` | Errors delivered to the user handler. | +| `getTotalBackpressureStalls()` | Producer threads that hit `sf_max_total_bytes`. | +| `getLastTerminalError()` | The latched `SenderError`, or null. | +| `getActiveBackgroundDrainers()` | Running orphan drainers right now. | +| `getTotalBackgroundDrainersSucceeded()` / `…Failed()` | Cumulative drainer outcomes. | + +### Suggested dashboards + +- **Reconnect health:** `reconnect_attempts - reconnect_succeeded` over + time. A non-zero difference for more than a few seconds means the + wire is currently down. Alert if it stays elevated past your + `reconnect_max_duration_millis`. +- **Replay volume:** `frames_replayed` rate. Bursts are expected; + sustained replay means a chronic instability. +- **Backpressure:** `backpressure_stalls` rate. Any non-zero rate is a + capacity signal. +- **Error rate by category:** instrument your error handler to bucket + by category. Background `SCHEMA_MISMATCH` is usually a schema-drift + symptom worth alerting on. + +The default error handler logs every received `SenderError` — +`ERROR`-level for HALT, `WARN`-level for DROP. Replace it only if you +are also routing the errors somewhere else (Sentry, structured logs): +silence is forbidden by the contract. + +## Multi-sender deployments + +When several senders share a host and a `sf_dir`: + +- Give each one a unique `sender_id`. The defaults `sender_id=default` + is fine for a single-sender host but collides for any second + sender. +- Consider `drain_orphans=on` if dynamic sender identities mean dead + instances can leave permanent orphans. +- Size `sf_max_total_bytes × number_of_senders` against available disk. +- Plan for the worst-case lock-collision recovery: a misconfigured + fleet that all share `sender_id=default` will leave only one sender + alive on each host. That is the design — fail loudly rather than + silently corrupt overlapping slots. + +## Next steps + +- [Configuration](/docs/high-availability/store-and-forward/configuration/) — + the full connect-string key reference. +- [Client failover concepts](/docs/high-availability/client-failover/concepts/) — + what the reconnect loop does between disconnects. diff --git a/documentation/high-availability/store-and-forward/when-to-use.md b/documentation/high-availability/store-and-forward/when-to-use.md new file mode 100644 index 000000000..931831e42 --- /dev/null +++ b/documentation/high-availability/store-and-forward/when-to-use.md @@ -0,0 +1,221 @@ +--- +title: When to use store-and-forward +sidebar_label: When to use +description: + Decision guide for choosing between memory mode and disk-backed + store-and-forward, when to opt into durable-ack trim, and when to enable + orphan adoption. +--- + +:::note Java-only today + +Client-side store-and-forward support is currently available in the Java +client. Additional language clients are on the roadmap. + +::: + +The QWP WebSocket transport always uses a store-and-forward (SF) substrate. +What changes between deployments is **where** that substrate keeps unacked +data and **what durability bar** it acknowledges against. This page is the +decision guide. + +If you are new to SF, start with +[Concepts](/docs/high-availability/store-and-forward/concepts/). + +## Memory mode vs SF mode + +The single switch that decides this is whether you set `sf_dir` in the +connect string. + +### Memory mode — `sf_dir` unset + +Unacked frames live in a malloc'd ring in process memory. Default cap is +`128 MiB`. + +**Choose memory mode when:** + +- The producer process is short-lived or ephemeral (a CLI job, a CI + worker, a serverless function). +- A process restart is acceptable as a fresh start — you don't need + in-flight data to survive a crash. +- You only need to tolerate **transient** network blips and short server + outages (think: rolling upgrades, brief network partitions). +- Your data volume comfortably fits in RAM during the longest outage you + care about. + +### SF mode — `sf_dir=/path/to/slot-root` + +Unacked frames are written to mmap'd files under +`//`. Default cap is `10 GiB`. + +**Choose SF mode when:** + +- The producer process is long-running and outage budgets are measured + in minutes (the default `reconnect_max_duration_millis` is 5 minutes + for a reason). +- You need in-flight data to survive process restarts — JVM crash, OOM + kill, host reboot, planned redeploy. +- You ingest at rates where minutes of buffering exceeds RAM you can + spare. +- You operate unattended at the edge (sensors, ETL jobs) where the + server may sometimes be unreachable for extended periods. + +Both modes share the same wire behaviour, the same failover loop, and +the same connect-string keys for everything other than storage. You can +switch between them without changing application code — only the connect +string. + +## Comparison at a glance + +| Question | Memory mode | SF mode | +|---|---|---| +| Where is buffered data? | Process RAM | Disk (`//`) | +| Default capacity | `128 MiB` | `10 GiB` | +| Survives a JVM crash? | No | Yes | +| Survives `kill -9`? | No | Yes | +| Survives a host reboot? | No | Yes (if the disk does) | +| Cross-sender rescue (orphan adoption) | n/a | Yes (opt-in) | +| Setup cost | Zero | Provisioning a writable directory | +| Operational cost | Zero | Sizing, monitoring, lock collisions | + +## Durable-ack: when to opt in + +By default the substrate trims unacked data on OK ack from the server. +That means the substrate releases a frame once the server has acknowledged +it into the WAL. The frame is durable on the **primary's** disk; whether +it has been replicated to the object store or to replicas is a separate +matter. + +When the connect string sets `request_durable_ack=on`, trim is held back +until a separate `STATUS_DURABLE_ACK` frame confirms the data has been +uploaded from the WAL to the **configured object store** (S3, Azure Blob, +GCS, or NFS). + +### Choose durable-ack when + +- You require object-store durability before considering a write + acknowledged — e.g. compliance requirements, end-to-end exactly-once + pipelines with cross-region recovery. +- Loss of an entire primary node (and its local disk) must not lose + in-flight data — replicas haven't downloaded the WAL yet, only the + object store has. +- You are willing to trade later trim (and so larger steady-state SF + disk usage) for the stronger guarantee. + +### Stay on the default OK trim when + +- WAL-local durability on the primary is sufficient. +- You want minimum steady-state disk usage. +- You are running OSS or a build that does not support durable-ack. + (The handshake fails loudly if you opt in but the server cannot + deliver — see below.) + +### Caveats + +- **Server support is required.** The client sends + `X-QWP-Request-Durable-Ack: true` on the upgrade. The server must echo + back `X-QWP-Durable-Ack: enabled`. If it does not — OSS build, + uninitialised primary, missing registry, hitting a replica — the + connect **fails loudly**, by design. Silently waiting for ack frames + that never arrive would let the SF disk fill up. +- **Idle keepalive.** The OSS server only flushes pending durable-ack + frames during inbound recv events. The client sends a WebSocket PING + every `durable_ack_keepalive_interval_millis` (default 200 ms) when + there are pending confirmations and the producer is idle. +- **Disk pressure.** Steady-state SF disk usage is roughly + `ingest_rate × time_to_object_store_durability`. Size + `sf_max_total_bytes` accordingly. + +## Orphan adoption: when to enable + +A sender that exits without draining its slot leaves unacked data on +disk. If another process restarts under the same `sender_id` and same +`sf_dir`, it picks up the orphan automatically as part of normal +recovery. But if no process ever uses that `sender_id` again, the data +sits on disk forever. + +Setting `drain_orphans=on` tells the **foreground sender** to scan +`/*` at startup for sibling `sender_id`s with unacked data and +spawn background drainers to clear them. + +### Enable orphan adoption when + +- You have a fleet of senders writing to a shared `sf_dir` (multi-tenant + host, container restart) and want any survivor to rescue dead + siblings' data. +- Your deployment can dynamically allocate `sender_id` (e.g. one per + process instance), so dead instances leave permanent orphans that no + natural restart will adopt. +- You prefer "automatic eventual delivery" over "operator manually + reattaches the slot." + +### Leave it off when + +- Each `sender_id` is statically pinned to a specific process — there + are no orphans by construction; a restart of the same process + recovers its own slot. +- You want explicit operator control over data movement in a shared + `sf_dir`. +- You run a single producer per host. + +Drainer concurrency is capped by `max_background_drainers` (default +`4`). Each drainer opens its own connection — they share the network +path but not the WebSocket. + +`drain_orphans=on` does not interfere with regular recovery: the +foreground sender still recovers its own `sender_id` first, then +drainers spawn for sibling slots. + +## Migrating from HTTP/TCP ILP + +If you are currently using HTTP or TCP ILP ingest, the comparison is: + +| Capability | HTTP ILP | TCP ILP | QWP WebSocket + SF | +|---|---|---|---| +| Non-blocking producer | No (request waits) | No (TCP backpressure) | Yes (buffer absorbs publishes) | +| Survives process crash | No | No | Yes (SF mode) | +| Server outage tolerance | Best-effort retry | None | Reconnect loop with multi-minute budget | +| Multi-host failover | Yes (HTTP only) | No | Yes | +| Cross-region durability ack | No | No | Yes (`request_durable_ack=on`) | +| Cluster-wide ordering | Best-effort | Best-effort | FSN-driven, server-deduplicated | + +The transition is application-transparent — `Sender.fromConfig` accepts +a `ws::` or `wss::` connect string and the public builder API is the +same. The most common migration is HTTP ILP → QWP WS+SF, with `sf_dir` +set, retaining HTTP for backward compatibility while the QWP path +becomes the primary. + +For specifically the multi-host HA path on HTTP ILP, see the existing +[ILP overview "Multiple URLs for High Availability"](/docs/ingestion/ilp/overview/#multiple-urls-for-high-availability) +section. QWP failover (documented in +[Client failover concepts](/docs/high-availability/client-failover/concepts/)) +replaces and extends it. + +## Decision flowchart + +```mermaid +graph TD + Q1{Will the producer outlive any single outage you care about?} + Q2{Does data need to survive a JVM crash or kill -9?} + Q3{Is object-store durability required before ack?} + Q4{Multiple senders share sf_dir, with dynamic sender_id?} + + Q1 -->|"No (ephemeral job)"| Memory[Memory mode — leave sf_dir unset] + Q1 -->|"Yes (long-running service)"| Q2 + Q2 -->|No| Memory + Q2 -->|Yes| SF[SF mode — set sf_dir] + SF --> Q3 + Q3 -->|Yes| Durable[Add request_durable_ack=on] + Q3 -->|No| Q4 + Durable --> Q4 + Q4 -->|Yes| Orphans[Add drain_orphans=on] + Q4 -->|No| Done[Configuration complete] + Orphans --> Done +``` + +## Next steps + +- [Configuration](/docs/high-availability/store-and-forward/configuration/) — + the connect-string keys. +- [Operating and tuning](/docs/high-availability/store-and-forward/operating-and-tuning/) — + slot layout, sizing, observability. diff --git a/documentation/sidebars.js b/documentation/sidebars.js index f5bd297a2..fe4972542 100644 --- a/documentation/sidebars.js +++ b/documentation/sidebars.js @@ -654,6 +654,24 @@ module.exports = { type: "doc", label: "WAL Cleanup", }, + { + type: "category", + label: "Client Failover", + items: [ + "high-availability/client-failover/concepts", + "high-availability/client-failover/configuration", + ], + }, + { + type: "category", + label: "Store-and-Forward", + items: [ + "high-availability/store-and-forward/concepts", + "high-availability/store-and-forward/when-to-use", + "high-availability/store-and-forward/operating-and-tuning", + "high-availability/store-and-forward/configuration", + ], + }, ], },