fix(router): harden /router/stats disclosure and decode_target trust#393
Merged
Conversation
Defense-in-depth on top of the disaggregated router (issue #201) for the two MEDIUM findings tracked in issue #389. Both go beyond the existing trusted-network-segment baseline rather than replacing it. Finding 1: /router/stats topology disclosure. The endpoint is mounted on the client-facing axum app (same surface as /v1/chat/completions) and returned every prefill/decode peer's raw host:port, so an unauthenticated client that reaches the inference port could map the internal cluster. The client-facing variant now redacts raw addresses by default and reports only the stable router-assigned node ids (prefill-<i> / decode-<i>), roles, health, per-node dispatch counts, and the RouterMetrics snapshot, which is enough to read the load spread without disclosing topology. The body carries "addresses_redacted" so a caller knows which view it got. MLXCEL_ROUTER_STATS_VERBOSE opts a trusted-segment deployment back into the full address view. The parsing and body builder are pure functions with unit tests. Finding 2: prefill node trusts the router-chosen decode_target. The prefill node connects to decode_target to ship the KV-cache handoff (which encodes the prompt), so a forged frame from a breached segment could redirect it off-cluster. The prefill role now validates decode_target against a DecodeAllowlist before connecting: an off-list target is rejected (the request fails with a clear error and the KV cache never leaves) instead of connected to. The allowlist is the crux of the issue: it is built from the prefill node's FULL --decode-peers set, not just its single static handoff peer, so it covers every decode node the router may select. Operators configure each prefill node's --decode-peers with the same decode pool the router routes over (the shared cluster config), so router-driven decode balancing keeps working for any router-selectable decode node. When no allowlist is configured (an empty set, which a correctly configured prefill-only node never has because --decode-peers is required), the prefill stays permissive-with-warning so a missing allowlist source never silently disables balancing. A frame without decode_target (an older router) still falls back to the first static --decode-peers entry, which needs no validation because it is operator config rather than a wire-supplied address. Comparison is canonical (parsed SocketAddr) so equivalent spellings of the same address match. The allowlist decision is a pure function with unit tests for on-list, off-list, and unconfigured cases. serve_prefill_role_networked_blocking now takes the full Vec<SocketAddr> decode pool (first entry is the static fallback, whole set is the allowlist); the model-worker dispatch passes its configured decode_peers through unchanged. Both behaviors are documented in docs/distributed.md, including the critical requirement to configure prefill --decode-peers with the full decode pool.
…IST, not local --decode-peers The prefill allowlist was built from the prefill node's own --decode-peers set, which is normally narrower than the router's selectable decode pool (a prefill's --decode-peers is the static fallback, often a single peer, while the router balances over its own registry). Validating against the local set rejected legitimate router-balanced handoffs and broke the #201 multi-node decode balancing, including the existing 2x2 disaggregated E2E (prefill configured with decode0 while the router routes decode0,decode1). Decouple the allowlist from --decode-peers: build it from a dedicated opt-in env input MLXCEL_DECODE_ALLOWLIST (comma-separated host:port covering the full router-selectable pool, the shared cluster config the issue suggested). When unset the allowlist is empty and validation is permissive-with-warning, so router-driven balancing is never silently disabled by default; --decode-peers.first() remains the static fallback. Unparseable entries are skipped with a warning rather than aborting startup. Docs in distributed.md and CONTINUOUS_BATCHING.md reconciled so --decode-peers keeps its static-fallback meaning.
Suppress repeated WARN spam in the AllowUnchecked path of run_prefill_role_networked. When MLXCEL_DECODE_ALLOWLIST is unset and the router populates decode_target on every request (the default multi-node config per issue #201), the previous code emitted one WARN per request, flooding the log and making real warnings hard to spot. A process-wide AtomicBool now gates the tracing::warn! so it fires at most once per process lifetime. Security posture is unchanged: the node is still permissive and still forwards; only the log noise is reduced. Also add a one-line note to docs/distributed.md and docs/CONTINUOUS_BATCHING.md clarifying that MLXCEL_DECODE_ALLOWLIST entries must be numeric IP:port values (parsed as SocketAddr); hostname entries are skipped with a warning and will never match a router-chosen target.
Member
Author
PR Finalization CompleteSummaryTests: all 10 coordinator allowlist tests and 3 router_stats tests pass unchanged; no new test needed for the log-guard path (it wraps a tracing macro, not business logic). Documentation: added a one-line note to Lint/Format: Changes
Commit: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Defense-in-depth hardening for the disaggregated router (issue #201) covering the two MEDIUM findings in issue #389:
/router/statsdisclosing internal cluster topology, and the prefill node trusting the router-chosendecode_target. Both harden on top of the existing trusted-network-segment threat model rather than replacing it.Finding 1: /router/stats topology disclosure
GET /router/statsis mounted on the client-facing axum app (same surface as/v1/chat/completions) and returned every prefill/decode peer's rawhost:port, so an unauthenticated client that reaches the inference port could map the internal cluster.Chosen option: (a) redact raw addresses on the client-facing variant (the lowest-risk default; the codebase has no separate management/auth interface that would make option (b) natural). The redacted body reports only the stable router-assigned node ids (
prefill-<i>/decode-<i>), roles, health, per-node dispatch counts, and theRouterMetricssnapshot, which is enough to read the load spread and confirm a failed node is marked unreachable without disclosing topology. The body carries"addresses_redacted"so a caller knows which view it received. Full detail stays available via an opt-in env flagMLXCEL_ROUTER_STATS_VERBOSE(1/true/yes/on) for trusted-segment debugging.Finding 2: prefill node trusts router-chosen decode_target
The prefill node connects to
decode_targetto ship the KV-cache handoff (which encodes the prompt), so a forged frame from a breached segment could redirect it off-cluster. The prefill role now validatesdecode_targetagainst aDecodeAllowlistbefore connecting: an off-list target is rejected (the request fails with a clear error and the KV cache never leaves for an unknown address) instead of connected to.Allowlist design and why it does NOT break router-driven balancing: the allowlist is built from the prefill node's FULL
--decode-peersset, not just its single static handoff peer (.first()), so it covers every decode node the router may select. Operators configure each prefill node's--decode-peerswith the same decode pool the router routes over (the shared cluster config), so any router-selectable decode node is on every prefill's allowlist and router-driven decode balancing keeps working. When no allowlist is configured (an empty set, which a correctly configured prefill-only node never has because--decode-peersis required), the prefill stays permissive-with-warning so a missing allowlist source never silently disables balancing. A frame withoutdecode_target(an older router) still falls back to the first static--decode-peersentry, which needs no validation because it is operator config rather than a wire-supplied address. Comparison is canonical (parsedSocketAddr), so equivalent spellings of the same address match.serve_prefill_role_networked_blockingnow takes the fullVec<SocketAddr>decode pool (first entry is the static fallback, whole set is the allowlist); the model-worker dispatch passes its configureddecode_peersthrough unchanged.What changed
src/server/router_front.rs:router_statshandler redacts addresses by default; new pure helpersrouter_stats_bodyandrouter_stats_verbose_from(env-gated verbose view).src/distributed/disaggregated/coordinator.rs: newDecodeAllowlist+DecodeTargetDecision;ServingCoordinatorcarries the allowlist (with_decode_allowlistbuilder);run_prefill_role_networkedvalidatesdecode_targetand rejects off-list targets with an error result frame;serve_prefill_role_networked_blockingtakes the full decode-peer set.src/server/model_worker.rs: passes the fulldecode_peersinto the prefill role.src/distributed/disaggregated/coordinator_tests.rs: allowlist unit tests (on-list accepted, off-list rejected, unconfigured permissive, canonical-address comparison).src/server/router_front_tests.rs:/router/statsredaction unit tests (redacted by default, verbose includes addresses, env parsing).docs/distributed.md: documents both behaviors, including the critical requirement to configure prefill--decode-peerswith the full decode pool.Test plan
cargo check --lib --tests --features metal,acceleratecargo test --lib --features metal,accelerate distributed::disaggregated::coordinator::tests::allowlist(4 passed)cargo test --lib --features metal,accelerate server::router_front::tests::router_stats(3 passed)cargo clippy --lib --tests --features metal,accelerate -- -D warnings(clean)Closes #389