Skip to content

perf: use commix concurrent GC for Native CLI binary#881

Open
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:perf/native-commix-gc
Open

perf: use commix concurrent GC for Native CLI binary#881
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:perf/native-commix-gc

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented May 30, 2026

Summary

Switch the Native CLI release binary from the default immix GC to commix (concurrent immix). commix collects on background threads, overlapping collection with evaluation. This is a one-line build.mill change with byte-identical output and unchanged RSS — it still frees memory, so it stays bounded and safe on small machines.

Why commix

While profiling the Native binary against jrsonnet, the remaining wall-clock gap turned out to be GC/allocation cost, not interpreter dispatch. immix's stop-the-world collections dominate on allocation-heavy configs and cause large latency variance. I benchmarked all four Scala Native GCs (none, immix, commix, boehm) on jrsonnet's realworld config suite (min ms, interleaved + cooled, 20 runs each):

config boehm immix (old default) commix (this PR) none (leaks) jrsonnet
kube-prometheus slow 141.8 122.1 99.0 96.1
loki 43.9 40.5 33.4 21.8
mimir 72.0 51.6 47.6 39.7 26.1
tempo 66.8 45.4 45.7 36.8 23.0
RSS (kube-prometheus) 168 MB 169 MB 199 MB 98 MB

Ranking is consistent: none < commix < immix < boehm.

  • commix is a free win over immix on allocation-heavy configs (~13–16%), neutral on the lightest (tempo), with the same RSS as immix (it frees — no leak).
  • It also collapses immix's stop-the-world latency variance: kube-prometheus went from ±55 ms (max 326 ms) under immix to ±5 ms under commix — far more predictable latency.
  • boehm is the slowest of all four (eliminated).
  • none is fastest (and on kube-prometheus essentially ties jrsonnet) but never frees → OOM risk on huge/adversarial inputs, so it is not a safe default. commix gives most of the safe win.

The residual gap to jrsonnet on the smaller configs (~1.5×, even with none) is interpreter + allocation throughput, which no GC choice addresses.

Trade-off

commix uses background GC threads (higher total CPU). On a genuinely single-core machine it could in theory regress; on any multi-core machine it is a clear win. The module keeps nativeMultithreading = None and commix still manages its own GC threads correctly.

Verification

  • Native test suite: 462/462 pass with commix.
  • Output byte-identical to immix on all four realworld configs (sha256 verified) and on trivial inputs.

Motivation:
The Native CLI defaulted to immix, whose stop-the-world collections
dominate wall-clock on allocation-heavy configs and cause large latency
variance. Profiling against jrsonnet showed the remaining gap was
GC/allocation cost, not interpreter dispatch.

Modification:
Set nativeGC = "commix" (concurrent immix) on the Native release module.
commix collects on background threads, overlapping collection with
evaluation. Output is byte-identical and RSS is unchanged vs immix
(it still frees -- bounded, safe on small machines).

Result:
jrsonnet realworld suite (min ms, interleaved, cooled):
  kube-prometheus  141.8 -> 122.1  (1.16x)
  loki              43.9 ->  40.5
  mimir             51.6 ->  47.6
  tempo             45.4 ->  45.7  (neutral)
STW variance collapses (kube-prometheus +-55ms -> +-5ms).
RSS 168 -> 169 MB. Native test suite 462/462 pass.
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented May 30, 2026

@stephenamar-db FYI @WojciechMazur cc

@He-Pin He-Pin closed this May 30, 2026
@He-Pin He-Pin reopened this May 30, 2026
@He-Pin He-Pin closed this May 30, 2026
@He-Pin He-Pin reopened this May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant