Add EdgeZero canary routing (rollout_pct flag + FNV-1a bucket assignment) (PR 19)#727
Add EdgeZero canary routing (rollout_pct flag + FNV-1a bucket assignment) (PR 19)#727prk-Jr wants to merge 16 commits into
Conversation
…outes_to_edgezero
When edgezero_enabled=true, reads edgezero_rollout_pct (0-100) from the trusted_server_config store. Routes each request to EdgeZero if fnv1a_bucket(client_ip) < rollout_pct, giving the ops team sticky per-user canary control without a deploy. Absent key defaults to 100 (full rollout, backward compatible with edgezero_enabled=true deployments that predate this PR). Invalid values and read errors default to 0 (all legacy, fail-safe).
Set to "0" so local dev and CI stay on the legacy path by default. Ops changes this in the production config store — no re-deploy required.
Documents the two config store keys, canary progression stages (1% → 10% → 50% → 100%), hold-point durations, pass/fail thresholds, and instant rollback procedure.
…ro-pr19-cutover-canary
ChristianPavilonis
left a comment
There was a problem hiding this comment.
Summary
One low-severity docs/ops note: the canary runbook references routing log lines that are emitted at debug level, so production verification may require a debug/log-level deployment or equivalent observability setup.
The absent-key branch logged at warn on every request when edgezero_enabled=true but edgezero_rollout_pct was unset, flooding logs at production QPS while traffic flowed fine. Demote to debug; the setup procedure and migration runbook remain the operational safety net. The absent-key default stays 100 (backward compat) so an existing edgezero_enabled=true deployment is not silently rolled back.
For rollout_pct 0 (full rollback) and 100 (full cutover) — which together cover most of the canary's lifetime — the routing decision is fixed, yet every request still allocated the client-IP routing key string and ran the FNV hash. Match on the rollout value and only compute the bucket for a partial rollout. Also rename canary_routes_to_edgezero to routes_to_edgezero: the predicate is just `bucket < rollout_pct` and applies at any rollout value, so "canary" read misleadingly at the call site.
The four documented branches (absent -> 100, valid -> parsed, invalid -> 0, read error -> 0) are safety-critical but only the parse step was indirectly covered. ConfigStoreHandle::new(Arc::new(...)) over the public ConfigStore trait is constructible in unit tests, so add an in-memory stub store and pin every branch, including the fail-safe defaults that matter during an incident.
The routing verification log lines are emitted at log::debug!, but the Fastly
logger is capped at Info in production, so verifying the canary via log tailing
needs a debug-level deployment or another observability signal (the runbook now
says so and points at the real-time stats split as the alternative).
Also reconcile the documented log format with the adapter: at the degenerate
rollout values (0 and 100) the bucket is no longer computed, so those lines read
`routing request through {legacy,EdgeZero} path (rollout_pct=N)` without a bucket.
ChristianPavilonis
left a comment
There was a problem hiding this comment.
Automated review:
Review Summary
Reviewed PR #727 against feature/edgezero-pr19-spin-adapter, focusing on the Fastly entry-point canary routing, config-store failure behavior, rollout docs, and CI/review context. I did not find any blocking correctness, security, data-loss, authorization, or severe compatibility issues. I left one P2 inline comment about strengthening coverage for the safety-critical entry-point dispatch matrix.
Findings
P0 / Blockers
None.
P1 / High
None.
P2 / Medium
1 inline comment submitted.
P3 / Low
None.
CI / Existing Reviews
All reported PR checks are passing. Existing inline feedback about absent-key logging, config-store branch tests, degenerate rollout hashing, naming, and debug-level log documentation appears to have been addressed in later commits. Local cargo test-fastly rollout -- --nocapture could not run in this environment because Viceroy failed to satisfy fastly_http_downstream::downstream_bot_analyzed; CI's Fastly job is green.
Pull the rollout-percentage routing matrix out of main() into a pure should_route_to_edgezero(rollout_pct, client_ip) helper so the safety-critical wiring from rollout state to the EdgeZero/legacy choice can be unit-tested directly, not just the parsing/hashing helpers it composes. main() keeps the edgezero_enabled gate and config reads as thin I/O glue. Add tests covering the full dispatch matrix: rollout_pct=0 (always legacy), rollout_pct=100 (always EdgeZero), and partial-rollout bucket hit/miss for both present and absent client IPs.
…er' into feature/edgezero-pr19-cutover-canary
The base branch refactored IpAddr out of main.rs and dropped the import. Merging base in lost the import that should_route_to_edgezero and its tests rely on, breaking the wasm32-wasip1 build. Re-add it.
ChristianPavilonis
left a comment
There was a problem hiding this comment.
Automated review:
Review Summary
Reviewed PR #727 against feature/edgezero-pr19-spin-adapter, focusing on the Fastly canary dispatch, rollout percentage parsing/defaults, config-store failure behavior, tests, local Fastly config, and the migration runbook. I did not find any blocking correctness, security, data-loss, authorization, or severe compatibility issues. I left one P2 inline comment about making the production canary observability path explicit before operators rely on the runbook.
Findings
P0 / Blockers
None.
P1 / High
None.
P2 / Medium
1 inline comment submitted.
P3 / Low
None.
CI / Existing Reviews
All reported PR checks are passing. Existing review feedback about absent-key log volume, degenerate rollout hashing, config-store branch coverage, naming, and debug-level log documentation appears addressed by later commits. Local cargo test-fastly rollout -- --nocapture compiled the changed Fastly test binary but could not execute under the installed Viceroy because it lacks fastly_http_downstream::downstream_bot_analyzed; CI's Fastly test job is green.
| > **The routing log lines are emitted at `log::debug!`, while the Fastly logger | ||
| > is capped at `Info` in production.** Verifying the canary via log tailing | ||
| > therefore requires a debug-level logging deployment (or another equivalent | ||
| > observability signal, such as the Fastly real-time stats traffic split). With |
There was a problem hiding this comment.
Automated review: P2: Make the production canary route signal explicit before relying on it operationally.
This paragraph now correctly calls out that the route-decision logs are debug! while the Fastly logger is capped at Info, but the fallback it names (Fastly real-time stats traffic split) is not something this PR appears to create. The code only emits the EdgeZero/legacy decision at debug level, and the runbook also says the x-edgezero-path marker does not exist yet. Without a production-visible route marker (or an exact documented dashboard/query that already distinguishes the two branches), operators cannot reliably confirm that rollout_pct=0 is actually all legacy, that 1%/10% traffic is flowing on the intended path, or attribute latency/error changes to the canary branch.
Suggested fix: either add/document a concrete production-safe signal before running the canary (for example a sampled/rate-limited info log field, a metrics/log field, or an internally gated diagnostic response header), or change the runbook to make a debug-level logging deployment an explicit precondition and remove the unsupported real-time-stats fallback unless there is a specific existing split/dashboard to reference.
Summary
edgezero_rollout_pct(0–100) to thetrusted_server_configstore so EdgeZero traffic can be shifted incrementally without a deployclient_ipfor deterministic, sticky per-user bucket assignment (0–99); routing predicate isbucket < rollout_pctrollout_pct = 0is the immediate no-deploy rollbackChanges
crates/trusted-server-adapter-fastly/src/main.rsparse_rollout_pct,fnv1a_bucket,canary_routes_to_edgezero,read_rollout_pct; updatedmain()with canary dispatch; 11 unit tests (pinned FNV-1a vectors, distribution, boundary routing)fastly.tomledgezero_rollout_pct = "0"to local Viceroy config store with ops commentdocs/internal/EDGEZERO_MIGRATION.mdCloses
Closes #500
Test plan
cargo test-fastly— 65 + 956 + 21 + 3 tests greencargo test-axum— 31 tests greencargo clippy-fastly && cargo clippy-axum— no warningscargo fmt --all -- --check— cleancd crates/js/lib && npx vitest run— 291 tests greencd crates/js/lib && npm run format— cleancd docs && npm run format— cleancargo build --package trusted-server-adapter-fastly --release --target wasm32-wasip1fastly compute servewithedgezero_rollout_pct = "1"and"0"(rollback)Checklist
unwrap()in production code — useexpect("should ...")logmacros (notprintln!)