Skip to content

feat(config): nano-replica memory profile — run the replica in 512 MiB–1 GiB VMs#10491

Draft
Dfinity-Bjoern wants to merge 14 commits into
bjoern/dockernetfrom
bjoern/nano-replica
Draft

feat(config): nano-replica memory profile — run the replica in 512 MiB–1 GiB VMs#10491
Dfinity-Bjoern wants to merge 14 commits into
bjoern/dockernetfrom
bjoern/nano-replica

Conversation

@Dfinity-Bjoern

Copy link
Copy Markdown
Contributor

Status: DRAFT / experimental — shared for discussion, not for merge.
Branched off bjoern/dockernet (the local 4-node dev/local-net); this PR is just the nano-profile config + load-driver work on top of it.

What this is

An experiment to run the ICP replica in 512 MiB–1 GiB VMs (mainnet uses ~512 GiB), accepting a much smaller subnet: 2 execution threads, 1 query thread, a few-hundred-MB subnet state, best-effort-leaning messaging. The premise is that the 512 GiB figure is mostly worst-case capacity bounds and reservations, not steady-state resident memory — so the work is shrinking every pool/limit by ~100–1000× and fixing the few places where a single execution could OOM a tiny node.

All changes are gated to the nano profile's constants in rs/config; nothing here is meant for mainnet defaults as-is.

Config changes (rs/config)

  • execution_environment.rs: subnet memory capacity 2 TiB→512 MiB; exec threads 4→2, query threads 4→1; heap-delta capacity 140 GiB→96 MiB; guaranteed-response msg mem 15 GiB→64 MiB, best-effort 5 GiB→32 MiB; ingress history/custom sections/caches shrunk; callback soft-limit 1M→4096; memory reservation 2560→8 MiB/thread; SUBNET_MEMORY_THRESHOLD = capacity (disables storage cycle-reservation so canisters can use the full cap).
  • embedders.rs (the OOM-cliff fix): per-message stable dirty/accessed page limits 1–8 GiB → 32/128 MiB; sandbox count 10000→32, idle 30m→2m; rayon threads 10/8→2/2.
  • subnet_config.rs: MAX_HEAP_DELTA_PER_ITERATION 200 MB→64 MiB (so a single round can't overshoot the 96 MiB heap-delta cap — bounds the unreclaimable resident spike under writes); heap-delta initial reserve 32 GiB→32 MiB; max paused (DTS) execs 4→1; per-canister heap-delta rate-limit 75→32 MiB.
  • canister_sandbox/.../sandboxed_execution_controller.rs: decoupled max sandbox RSS from heap-delta via a 128 MiB floor (so shrinking heap delta doesn't starve sandboxes); eviction batch 1 GiB→64 MiB.
  • message_routing.rs: XNet stream size 10→2 MiB, max stream messages 10000→1000.
  • dev/local-net/prep.sh: bakes the nano hypervisor overrides into generated configs and shortens the DKG/checkpoint interval to ~50 rounds.

Two correctness fixes were found by actually running it: the DTS scheduler requires ≥2 cores ((cores-1)*100% capacity → 1 core trips an invariant), and MAX_HEAP_DELTA_PER_ITERATION must stay ≤ the heap-delta cap.

Load driver (rs/canister_client/examples/hammer.rs)

A self-contained stress tool driven over the public endpoint via the in-repo ic-canister-client (no dfx). Deploys universal canisters and hammers them. Modes: read (stable reads), heap/heapread (heap memory), calls (inter-canister chains), fanout (parallel calls), hybrid (reads+writes+messaging at once), plus a per-message dirty/accessed-limit probe. Run e.g.:

UNIVERSAL_CANISTER_WASM_PATH=/path/to/universal_canister.wasm \
  HAMMER_MODE=hybrid HAMMER_CANISTERS=4 cargo run -p ic-canister-client --example hammer -- http://localhost:8080

Key findings (from dev/local-net, container RAM hard-capped)

  • Idle/light load fits 512 MiB (~172–215 MiB anon). Heavy mixed load wants 1 GiB. Under a heavy hybrid storm, 512 MiB had 2 sandbox-OOM restarts (recovered, consensus never stopped); 1 GiB ran clean (0 restarts).
  • Non-reclaimable (anon) floor ≈ 340 MiB under load (replica + sandboxes + bounded heap delta); the rest of resident memory is reclaimable page cache of the checkpoint (read working set). The MAX_HEAP_DELTA_PER_ITERATION fix keeps anon bounded (no 200 MB overshoot).
  • Per-message 32 MiB stable limit verified: a single message touching >32 MiB of stable memory traps cleanly (canister-level), not an OOM-kill — the protection that makes a tiny node safe. Heap memory has no such per-execution cap and is ~2.5× more expensive to store, so stable is the right place for large state.
  • Inter-canister: ~200 msgs/s, latency = ~1 consensus round/hop; the reduced 64 MiB guaranteed-response cap is never hit under realistic patterns (execution-rate + ingress backpressure keep outstanding calls low) and is enforced gracefully when pushed.
  • Everywhere it degrades gracefully — throttle, backpressure, page-cache eviction, recover — rather than failing hard.

Not done / caveats

  • Structural items from the plan are not included: rejecting guaranteed-response calls at the system-API boundary, consensus/p2p pool sizing for the 4-node target, HTTP-endpoint concurrency, disabling BTC/HTTP-outcalls adapters.
  • Dependent-crate value-assertion tests (execution_environment, messaging, scheduler) will need expected-constant updates; the full bazel test sweep has not been run (CI-scale). Compile + ic-config unit tests + targeted bazel builds pass.

🤖 Generated with Claude Code

Bjoern Tackmann and others added 14 commits June 12, 2026 16:19
Scale down the replica's memory capacities, reservations and limits so it
can run on a 512 MiB–1 GiB VM (down from the 512 GiB mainnet footprint),
accepting a substantially reduced subnet capacity.

execution_environment.rs:
- subnet memory capacity 2 TiB -> 512 MiB, threshold -> 384 MiB
- guaranteed-response msg mem 15 GiB -> 64 MiB, best-effort 5 GiB -> 32 MiB
- ingress history 4 GiB -> 32 MiB, wasm custom sections 2 GiB -> 16 MiB
- execution threads 4 -> 1, query threads 4 -> 1
- subnet memory reservation 2560 -> 64 MiB per thread
- callback soft limit 1,000,000 -> 4,096
- subnet heap delta capacity 140 GiB -> 96 MiB
- query cache 200 -> 16 MiB, compilation cache 10 GiB -> 64 MiB

embedders.rs (OOM-cliff fix — bound a single execution's resident set):
- stable dirty/accessed page limits 1-8 GiB -> 32/128 MiB
- max dirty pages without optimization 1 GiB -> 32 MiB
- sandbox count 10,000 -> 32, idle time 30m -> 2m
- rayon compilation/page-allocator threads 10/8 -> 2/2
- query threads per canister 2 -> 1

subnet_config.rs:
- heap delta initial reserve 32 GiB -> 32 MiB (must be <= capacity)
- max paused (DTS) executions 4 -> 1
- per-canister heap delta rate limit 75 -> 32 MiB

sandboxed_execution_controller.rs:
- decouple max sandbox RSS from heap delta via a 128 MiB floor
  (MIN_SANDBOXES_RSS), so a tiny heap delta no longer starves sandboxes
- eviction batch 1 GiB -> 64 MiB

message_routing.rs:
- XNet stream target size 10 -> 2 MiB, max stream messages 10,000 -> 1,000

Verified: rustfmt, clippy (clean), cargo test -p ic-config (19 passed),
bazel build //rs/config:config //rs/canister_sandbox:backend_lib.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The DTS scheduler computes allocatable compute capacity as
`(scheduler_cores - 1) * 100%` (round_schedule::compute_capacity_percent).
With NUMBER_OF_EXECUTION_THREADS = 1 this is 0%, so the invariant
`total_compute_allocation + 1% <= compute_capacity` fails on every round
and the replica panics in the MR Batch Processor on restart.

Bump to 2 (the scheduler floor). Memory cost is negligible: the extra
execution thread's Wasm address space is virtual, resident usage stays
bounded by the per-message dirty-page limits and the shared sandbox-RSS
budget, and SUBNET_MEMORY_RESERVATION is 64 MiB x 2 = 128 MiB (< the
512 MiB subnet cap).

Found by running a 4-node local-net subnet with the nano profile.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Standalone stress driver for a local subnet, driven over the public
endpoint with the in-repo ic-canister-client Agent (no dfx needed):
deploys N universal canisters via provisional_create_canister_with_cycles,
then runs throughput / compute / dirty-page / memory-growth phases and
reports throughput, latency and error classes.

Run:
  UNIVERSAL_CANISTER_WASM_PATH=/path/to/universal_canister.wasm \
    cargo run -p ic-canister-client --example hammer -- http://localhost:8080

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Disable the storage cycle-reservation mechanism on the nano profile so
canisters can freely allocate up to the subnet memory capacity:

- SUBNET_MEMORY_THRESHOLD = SUBNET_MEMORY_CAPACITY (512 MiB). When the
  threshold is >= capacity the subnet is never "high usage", so growth
  never triggers cycle reservations (whose mainnet-calibrated pricing
  otherwise rejects growth on a tiny subnet, hitting the reserved-cycles
  limit).
- SUBNET_MEMORY_RESERVATION = 8 MiB/thread (was 64), so the response-
  callback reservation no longer caps usable storage well below capacity.

Also bake the matching hypervisor override into dev/local-net/prep.sh so
the local 4-node net inherits it across resets.

Verified on the local-net: with reservation disabled, a single message
writing 24 MiB of stable memory succeeds while 48 MiB traps with
"Exceeded the limit for the number of accessed pages ... limit 32768 KB"
(the nano 32 MiB per-message stable limit), and the subnet keeps
finalizing with no replica panic — i.e. the per-message limit, not an
OOM-kill, bounds a single execution's working set.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- HAMMER_MODE=probe runs only the per-message dirty/accessed-page-limit
  probe (skips the throughput/compute/growth storms).
- Grow stable memory in its own committed message, then fill 24 MiB
  (under the 32 MiB limit, expect OK) and 48 MiB (over, expect trap), so
  the limit is isolated from subnet-capacity effects.
- Widen error-class output so full canister reject reasons are visible.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The nano heap-delta capacity (96 MiB) is small relative to the default
~500-round checkpoint interval, so a memory-write-heavy workload fills the
heap delta in a few rounds and then execution stalls until the next
checkpoint flushes it (consensus keeps finalizing throughout — graceful,
but execution duty-cycle collapses).

Pass --dkg-interval-length 49 to ic-prep so checkpoints happen every ~50
rounds. Measured effect under the same hammer workload:
  heap-delta round-skips during the run: ~880 -> ~150
  compute phase drains ~3x faster; execution advances in short bursts
  instead of multi-minute stalls.

Checkpoint cadence follows the DKG interval (CUP heights); cheap here
because the nano subnet state is only a few hundred MB.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
HAMMER_MODE=read populates N canisters with large stable state, then runs
read-heavy 24 MiB stable_read calls — updates on all-but-one canister and
queries on the last — concurrently, plus a 48 MiB single-execution read
probe to exercise the per-message/query stable accessed-page limit.
storm() gains an is_query flag to drive query calls via execute_query.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
HAMMER_MODE=heap mirrors the stable-memory tests on Wasm heap memory:
per-message heap-write probe (24/48/96 MiB in one message), heap-write
storm (8 MiB/call), and a heap-read storm (40 MiB get_global_data reads,
updates + queries). Demonstrates that heap has no per-execution
dirty/accessed cap (the 32 MiB limits are stable-only): all three
single-message heap writes and the 40 MiB heap reads succeed, whereas the
stable equivalents trap at 32 MiB.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- read mode: cycle read offsets across the FULL populated range (not 4
  fixed windows) and error-check the populate, so reads pull distinct
  state and all canisters are actually large.
- heapread mode: build a large per-canister heap global via
  append_to_global_data and query-read it (96 MiB/read). Surfaces that
  large heap state is ~2.5x more expensive than stable (wasm heap never
  shrinks + realloc on build), so 3x96 MiB heap globals OOM the 512 MiB
  subnet while 3x128 MiB stable fits, and that large heap reads via
  update OOM (the get_global_data copy grows heap).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- read storm now queries ALL canisters cycling the full populated range
  (clean read pressure; queries don't replicate or dirty).
- populate grows+fills in 24 MiB increments (a single 128 MiB grow can be
  rejected; small incremental grows reliably build the state).

Used to measure read memory/perf under a container RAM cap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
HAMMER_MODE=calls: each ingress makes the target canister start a
HAMMER_CALL_DEPTH-hop chain of update calls around the canister ring
(nested via call_args().other_side), generating ~2*depth inter-canister
messages per ingress. Used to stress message routing, callbacks and the
guaranteed-response memory reservation under the nano profile.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…iplier)

HAMMER_MODE=fanout: each ingress fires N parallel fire-and-forget update
calls (no-op callbacks), leaving N outstanding inter-canister calls per
in-flight ingress to stress the guaranteed-response memory reservation and
callback limits. HAMMER_FANOUT_MULT repeats the fan-out so a single message
issues N*mult calls (all reservations taken before any drain), which
exposes the 64 MiB guaranteed-response cap (~32 simultaneous calls).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
HAMMER_MODE=hybrid runs three storms concurrently over the canister pool:
query reads (24 MiB stable_read), update writes (8 MiB stable_fill), and
3-hop inter-canister call chains — splitting the concurrency budget. Shows
read/update path isolation and update-path contention under mixed load.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MAX_HEAP_DELTA_PER_ITERATION was 200 MB > SUBNET_HEAP_DELTA_CAPACITY
(96 MiB), so a single execution round could push the in-memory heap delta
far past the cap before the next round's skip-check — a transient spike of
unreclaimable (anonymous) resident memory (~200-300 MB) that threatens a
512 MiB VM under write load. Lower it to 64 MB so one round cannot
overshoot the cap, tightening the anonymous-memory ceiling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Experimental “nano-replica” configuration intended to shrink replica memory footprints (targeting 512 MiB–1 GiB VMs) by aggressively reducing subnet memory caps, heap-delta limits, sandbox/resource limits, and XNet stream sizes, plus adding a standalone Rust “hammer” example to stress the local 4-node network.

Changes:

  • Reduce multiple replica/subnet memory and concurrency limits (heap delta, message routing streams, sandbox resources, query/execution threads).
  • Add hammer.rs load driver example and wire in universal canister dependency for it.
  • Adjust dev/local-net prep to bake nano hypervisor overrides and shorten DKG interval.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
rs/config/src/subnet_config.rs Shrinks heap-delta iteration cap, reserve, paused DTS executions, and per-canister heap-delta rate limit.
rs/config/src/message_routing.rs Reduces XNet stream target size and max messages per stream.
rs/config/src/execution_environment.rs Cuts subnet memory/message capacities and execution/query parallelism; lowers caches and reservations.
rs/config/src/embedders.rs Lowers stable-memory per-message dirty/accessed limits, sandbox counts/idle time, and compilation/page-copying parallelism.
rs/canister_sandbox/src/replica_controller/sandboxed_execution_controller.rs Adds a minimum sandbox RSS floor and reduces eviction RSS batch size.
rs/canister_client/examples/hammer.rs New stress-test tool for deploying universal canisters and generating mixed load patterns.
rs/canister_client/Cargo.toml Adds ic-universal-canister as a dev-dependency for the new example.
dev/local-net/prep.sh Applies nano hypervisor overrides and sets a shorter DKG interval for local-net.
Cargo.lock Lockfile update for the added dev-dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +106 to +110
// Nano-replica profile: keep a single round's heap-delta production below the
// SUBNET_HEAP_DELTA_CAPACITY (96 MiB) so one round cannot overshoot the cap and
// spike unreclaimable (anonymous) resident memory. This bounds the per-round
// dirty working set so writes stay safe on a 512 MiB - 1 GiB VM.
const MAX_HEAP_DELTA_PER_ITERATION: NumBytes = NumBytes::new(64 * M);
Comment on lines 33 to +41
/// This specifies the threshold in bytes at which the subnet memory usage is
/// considered to be high. If this value is greater or equal to the subnet
/// capacity, then the subnet is never considered to have high usage.
const SUBNET_MEMORY_THRESHOLD: NumBytes = NumBytes::new(750 * GIB);
// Nano-replica profile: set equal to the subnet memory capacity so the subnet
// is never considered "high usage" and the storage cycle-reservation mechanism
// stays disabled — canisters can allocate freely up to the subnet capacity
// without reserving cycles (reservation pricing is calibrated for mainnet and
// would otherwise reject growth on a tiny subnet).
const SUBNET_MEMORY_THRESHOLD: NumBytes = NumBytes::new(512 * MIB);
Comment on lines +593 to +596
// ---- Heap-read storm (analogue of the stable READ test) ----
// get_global_data reads the whole 40 MiB global in one execution — more
// than the 32 MiB stable per-message accessed limit would ever allow.
let qry_cans = Arc::new(vec![canisters[canisters.len() - 1]]);
Comment on lines +606 to +607
us.report("HEAP-READ-UPDATE (40 MiB heap read)", t.elapsed());
qs.report("HEAP-READ-QUERY (40 MiB heap read)", t.elapsed());
Comment on lines +439 to +442
// Populate each canister with ~120 MiB of real stable data (written in
// <=24 MiB chunks to respect the 32 MiB per-message dirty limit).
const BIG_MIB: u32 = 128;
let chunk: u32 = 24 * MIB;
Comment on lines +666 to +668
println!("\n[5/5] MEMORY-GROWTH storm: grow 16 MiB + fill per call across all canisters until rejected");
let grow = Arc::new(Stats::default());
let total_mib = Arc::new(AtomicU64::new(0));
for h in handles {
let _ = h.await;
}
grow.report("MEMORY-GROWTH", Duration::from_secs(1));
Comment on lines 68 to 73
/// The number of sandbox processes to evict in one go in order to amortize
/// for the eviction cost. A large number could lead to the eviction
/// of many sandboxes and increased system load. The number was chosen
/// based on the assumption of 800 canister executions per round
/// distributed across 4 execution cores.
const SANDBOX_PROCESSES_TO_EVICT: usize = 200;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants