perf(stdlib): skip getBytes(UTF_8) for ASCII-safe base64 inputs by He-Pin · Pull Request #863 · databricks/sjsonnet

He-Pin · 2026-05-23T00:20:09Z

Motivation

std.base64 on ~4 KB ASCII inputs was the largest synthetic gap to jrsonnet at the time this
PR was first written. Profile pointed to two costs that hit before the SIMD encoder runs:

String#getBytes(UTF_8) — a full pre-count scan over every char looking for non-ASCII,
followed by an Array[Byte] heap allocation.
On Scala Native, the intermediate Array[Byte] is then copied into a zone buffer for the C
library; we pay two passes plus an extra allocation.

For pure-ASCII inputs we can skip the UTF-8 codec entirely.

Key Design Decision

Val.AsciiSafeStr (introduced earlier on master) tags strings the parser has already proven
contain only chars in 0x20–0x7F minus " and \ (via CharSWAR.isAsciiJsonSafe). When
std.base64 receives such a Val.Str, we know the entire input is one byte per char.

I added PlatformBase64.encodeStringToString(input: String, asciiSafe: Boolean): String so each
platform can pick the cheapest path:

Scala Native — bypass the heap Array[Byte] entirely; copy directly from the String into
the zone-allocated source buffer with input.charAt(i).toByte in a tight loop, then call
libbase64.base64_encode. Skips both the UTF-8 codec and the intermediate array.
JVM / JS — use ISO_8859_1 instead of UTF_8. With Java 9+ compact strings a pure-ASCII
string is LATIN1-tagged, so getBytes(ISO_8859_1) is approximately Arrays.copyOf; we skip
the UTF-8 pre-count scan. We still allocate the Array[Byte] because java.util.Base64 only
accepts byte arrays.
Non-ASCII inputs (asciiSafe=false) keep the existing getBytes(UTF_8) path.

The dispatch is one isInstanceOf check; pattern moved from case Val.Str(_, value) => to
case s: Val.Str => because the extractor drops the subclass identity.

Modification

sjsonnet/src/sjsonnet/stdlib/EncodingModule.scala — call the new API for std.base64.
sjsonnet/src-jvm/sjsonnet/stdlib/PlatformBase64.scala — JVM/Scala-JS variant uses ISO-8859-1.
sjsonnet/src-js/sjsonnet/stdlib/PlatformBase64.scala — same as JVM.
sjsonnet/src-native/sjsonnet/stdlib/PlatformBase64.scala — new encodeStringToString that
writes chars directly into the zone buffer; encodeToString(Array[Byte]) retained for
Val.ByteArr inputs.
sjsonnet/test/src/sjsonnet/Base64Tests.scala — new asciiSafeFastPath test group:
large ASCII literal roundtrip, string-path vs UTF-8-bytes-path byte equivalence, and a large
unicode string to keep the slow path covered.

Benchmark Results

Update (re-validated on current master): the original "3.28× → 1.13×" headline below was
measured against an older baseline (pre #861/#862). Since those PRs propagated
AsciiSafeStr through the parser and string operations, much of the win is already on master.
On the current baseline this PR contributes an additional ~3% mean / ~5% min on the
synthetic ASCII benchmark.

Re-run on the current baseline (upstream/master @ fcd444cc), Apple M3 Pro, Scala Native
immix+full LTO, hyperfine -N -w 30 --runs 300:

Variant	mean (ms)	σ	min (ms)	vs jrsonnet
jrsonnet 0.5.0-pre98	5.8	5.9 (outlier-skewed)	2.7	baseline
sjsonnet with this PR	6.2	0.6	5.2	1.07×
sjsonnet master (no #863)	6.4	0.6	5.5	1.11×

The slow path for non-ASCII inputs is unchanged. JVM/JS receive a smaller win (skip the UTF-8
pre-count scan) but no Native FFI improvement. All tests pass.

Original measurement (kept for historical reference — pre #861/#862 baseline)

Benchmark 1: sjsonnet  std.base64
  Time (mean ± σ):     7.2 ms ±  1.0 ms    [User: 2.4 ms, System: 1.7 ms]
  Range (min … max):   4.6 ms … 13.1 ms    434 runs

Benchmark 2: jrsonnet  std.base64
  Time (mean ± σ):     6.4 ms ±  2.3 ms    [User: 1.4 ms, System: 1.0 ms]
  Range (min … max):   3.6 ms … 15.4 ms    506 runs

Variant	before	after	speedup
sjsonnet native	10.57ms	7.20ms	1.47×
ratio vs jrsonnet	3.28×	1.13×	gap effectively closed

./mill __.test runs 514 JVM + 447 cross-platform tests; all green after the change.

Analysis

The original PR closed a 3.28× gap by skipping the UTF-8 codec on ASCII-safe inputs. Subsequent
ASCII-safe propagation work on master independently absorbed most of that win. The remaining
~3-5% delta this PR contributes is real but modest. The change is contained: only inputs already
proven pure-ASCII by the parser take the fast path; non-ASCII paths (Chinese / emoji literals
tested in Base64Tests.unicode) keep the original behaviour byte-for-byte.

The min-time gap to jrsonnet (5.2 ms vs 2.7 ms) suggests the remaining bottleneck is no longer
in getBytes but in the encoder itself (SIMD vs java.util.Base64 / aklomp libbase64) or in
result string allocation; closing it is out of scope for this PR.

References

Gap analysis: docs/perf-gap-vs-jrsonnet.md
Upstream Val.AsciiSafeStr introduction commit: https://github.com/databricks/sjsonnet/blob/master/sjsonnet/src/sjsonnet/Val.scala
jrsonnet benchmark methodology: https://github.com/CertainLach/jrsonnet/blob/master/nix/benchmarks.nix
aklomp/base64 (Native FFI): https://github.com/aklomp/base64

Result

For std.base64 on ASCII inputs, this PR delivers an additional ~3-5% over the current master
baseline, on top of the much larger win already landed via #861/#862. Non-ASCII slow path is
unchanged byte-for-byte. All multi-platform tests pass.

When std.base64 is called on a Val.AsciiSafeStr (a parser-tagged string that is proven to contain only chars in 0x20-0x7E with no quote/backslash), the prior code path materialised an intermediate Array[Byte] via String#getBytes(UTF_8). On Scala Native that step walks the entire input checking for non-ASCII codepoints and allocates an extra heap array before the SIMD encoder even sees the data; on a 3.5 KB Lorem-ipsum-style input that pre-pass dominated the encode cost. This change adds PlatformBase64.encodeStringToString(input, asciiSafe). On Native, the ASCII fast path writes input.charAt(i).toByte straight into the zone-allocated source buffer for the SIMD encoder, skipping both the UTF-8 codec and the heap Array[Byte]. On JVM/JS the fast path uses ISO-8859-1 (single byte-narrowing pass, no encoder branches; identical bytes to UTF-8 for chars <= 0x7F) for parity. For non-ASCII inputs (e.g. user literals with Chinese / emoji content) the slow path remains correct via getBytes(UTF_8); the unicode test cases in Base64Tests cover that branch. Hyperfine, std.base64 on go-jsonnet benchmark (Apple M3 Pro, Scala Native 0.6.4-SNAPSHOT, immix GC, full LTO; jrsonnet 0.5.0-pre99): sjsonnet before: 10.57 ms (3.28x slower than jrsonnet) sjsonnet after : 7.20 ms (1.13x slower than jrsonnet) -- effectively tied jrsonnet : 6.40 ms

Motivation: A review of PR databricks#863 found inconsistent comments describing the AsciiSafeStr character range. Modification: Align Val.AsciiSafeStr, std.base64, and the Native base64 fast path comments with the actual CharSWAR contract: chars in 0x20-0x7F excluding quote and backslash. Result: The Scala 3 JVM tests pass locally. References: databricks#863

## Motivation `async-profiler` on the Scala Native kube-prometheus workload shows `HeapCharBuffer.wrap` accounting for **40.3%** of GC-allocation parents (GC itself is ~25–30% of native runtime). The wrap site is `String.getBytes(UTF_8)` called once per long (≥128 char) JSON string inside `BaseByteRenderer.visitLongString`. Each call also allocates an output `byte[]`. In K8s manifest output, the overwhelming majority of these long values (descriptions, annotations, base64 blobs, paths) are pure printable ASCII with no JSON-escape characters. ## Key Design Decision Use the existing `Platform.isAsciiJsonSafe` SWAR scan (16 chars/Long word, no allocation) as a cheap probe up-front. On positive probe, route to the existing `renderAsciiSafeString` fast path which uses `Platform.copyAsciiStringToBytes` for direct char→byte memcpy. On a negative probe, fall through unchanged to the byte-SWAR path. The extra cost paid by the non-ASCII branch is one SWAR scan over chars (~bLen/16 Long reads), which is dominated by the encode cost that branch already performs. ## Modification At the top of `visitLongString` in `BaseByteRenderer.scala`, probe with `Platform.isAsciiJsonSafe(str)` and delegate to `renderAsciiSafeString(str)` when clean. Otherwise the original code runs unchanged. ## Benchmark Results `./mill bench.runRegressions` completes across all cpp/go/sjsonnet suites with no anomalies. **hyperfine** (Scala Native AOT, kube-prometheus, 60 runs, 8 warmup): | Binary | Mean | σ | Range | |---|---|---|---| | before (master HEAD `fcd444cc`) | 150.7 ms | ±8.3 ms | 140.5 – 177.3 | | after | 145.9 ms | ±6.2 ms | 138.6 – 168.1 | **After is 1.03× faster** (−4.8 ms mean, −3.2%). σ ratio = 0.07, improvement is reproducible across runs. ## Analysis Modest but real. `visitLongString` is one call per long output string, so on a 72k-line kube-prom output we hit it on the order of 10⁴ times. Each spared call avoids two heap allocations and a `CharsetEncoder` dispatch, which directly reduces the `HeapCharBuffer.wrap` GC pressure that profiling identified as the largest single allocation source. Larger gains require attacking the remaining UTF-8 path itself — follow-up commits target the escape-needing branch and `PlatformBase64` zero-copy. ## References - async-profiler CPU + allocation-parent analysis on `/tmp/sjsonnet-yaml-fix` - Existing helpers reused: `Platform.isAsciiJsonSafe`, `renderAsciiSafeString`, `Platform.copyAsciiStringToBytes` - Related: #863 (base64 ASCII input fast path), #864 (BaseRenderer.escape bulk write) ## Result - ✅ `./mill 'sjsonnet.jvm[3.3.7]'.test` — 444/444 pass - ✅ Byte-identical output on kube-prometheus (1.5 MB / 72k lines diff -q clean) - ✅ +3.2% wall-clock on Scala Native kube-prom - ✅ No regression on `bench.runRegressions` Co-authored-by: He-Pin <kerr.hepin@gmail.com>

Motivation: Scala Native kube-prometheus rendering still showed write/output overhead after the renderer and strict JSON import stack. `NativeOutputStream` already bypasses the JVM-compatible `PrintStream` path by writing through C `fwrite`, but stdout still used the platform default stdio buffering. Key Design Decision: Keep this optimization Native-only and local to stdout buffering. Instead of changing renderer flush thresholds or `ByteBuilder` behavior globally, configure the C stdio stream with full buffering before any `NativeOutputStream` writes occur. Passing a null buffer lets libc own the buffer lifetime, so the Scala object does not need to retain native memory. Modification: - Configure `NativeOutputStream` with `setvbuf(file, null, _IOFBF, 256 KiB)` during construction. - Leave JVM, JS, YAML, expect-string, and file-output code paths unchanged. - Preserve existing explicit `flush()` behavior for trailing newline and close handling. Benchmark Results: Workload: `jrsonnet/tests/realworld/entry-kube-prometheus.jsonnet -J vendor` Candidate was benchmarked on the Scala Native 0.5.12 stacked exploration branch against clean `cf7b8af9`. | Order | Clean | Candidate | Result | | --- | ---: | ---: | ---: | | Forward mean | 218.848 ms | 188.528 ms | -13.9% | | Forward median | 215.517 ms | 187.368 ms | -13.1% | | Reverse mean | 224.045 ms | 183.701 ms | -18.0% | | Reverse median | 224.281 ms | 182.914 ms | -18.4% | Output equality matched by `cmp`. Validation: - `./mill --no-server --ticker false --color false __.reformat` - `./mill --no-server --ticker false --color false -j 1 __.test` — 444 passed, 0 failed - `./mill --no-server --ticker false --color false bench.runRegressions` Analysis: This is a lower-risk write/flush optimization than increasing `ByteBuilder` thresholds: it does not alter rendering order, JSON escaping, object materialization, or JVM/JS behavior. It only changes the buffering policy of the Native stdout `FILE*`, and explicit flushes still happen at the same public boundaries. References: - Scala Native 0.5.12 migration PR: #867 - Related performance stack context: #863, #864, #865, #866, #868 Result: Native stdout rendering writes fewer/smoother buffered chunks for large JSON output while preserving byte-identical output and the existing flush contract.

Motivation: The Native stdout buffering follow-up showed that downstream buffering can materially reduce large-output write overhead. JSON `-o` output still sent `ByteRenderer` chunks directly to the file output stream, relying only on `ByteBuilder`'s internal flush threshold. Key Design Decision: Keep the change local to the JSON output-file fast path. Rather than changing `ByteBuilder` thresholds globally, wrap the file output stream in a `BufferedOutputStream` with the same 256 KiB output buffer size used for the Native stdout buffering follow-up. YAML, expect-string, stdout, and renderer semantics stay unchanged. Modification: - Add `OutputBufferSize = 256 * 1024` in `SjsonnetMainBase`. - Wrap JSON output-file `ByteRenderer` targets in `BufferedOutputStream(out, OutputBufferSize)`. - Flush the buffered stream at the same completion boundary before closing the underlying file output stream. Benchmark Results: Workload: `jrsonnet/tests/realworld/entry-kube-prometheus.jsonnet -J vendor -o /tmp/fileout-*.json` Candidate was benchmarked on the Scala Native 0.5.12 stacked exploration branch after the Native stdout buffering commit. | Order | Clean | Candidate | Result | | --- | ---: | ---: | ---: | | Forward mean | 217.372 ms | 205.062 ms | -5.7% | | Forward median | 196.625 ms | 183.491 ms | -6.7% | | Reverse mean | 210.517 ms | 177.174 ms | -15.8% | | Reverse median | 193.394 ms | 175.878 ms | -9.1% | Output equality matched by `cmp`. Validation: - `./mill --no-server --ticker false --color false __.reformat` - `./mill --no-server --ticker false --color false -j 1 __.test` — 444 passed, 0 failed - `./mill --no-server --ticker false --color false bench.runRegressions` Analysis: This preserves the existing rendering pipeline and only changes the buffering layer for file output. It avoids global `ByteBuilder` threshold changes, keeps stdout behavior separate, and does not affect YAML or expect-string paths. References: - Native stdout buffering PR: #869 - Scala Native 0.5.12 migration PR: #867 - Related performance stack context: #863, #864, #865, #866, #868 Result: Large JSON file output writes are buffered more effectively while preserving byte-identical output and the existing flush/close contract.

He-Pin marked this pull request as ready for review May 23, 2026 00:23

He-Pin marked this pull request as draft May 23, 2026 06:21

He-Pin marked this pull request as ready for review May 23, 2026 06:24

He-Pin mentioned this pull request May 23, 2026

perf: skip UTF-8 encode for clean-ASCII long strings in renderer #866

Merged

This was referenced May 23, 2026

perf: buffer native stdout writes #869

Merged

perf: buffer json file output #870

Merged

stephenamar-db merged commit 2cae1e5 into databricks:master May 27, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(stdlib): skip getBytes(UTF_8) for ASCII-safe base64 inputs#863

perf(stdlib): skip getBytes(UTF_8) for ASCII-safe base64 inputs#863
stephenamar-db merged 2 commits into
databricks:masterfrom
He-Pin:perf/base64-ascii-safe-input

He-Pin commented May 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

He-Pin commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Key Design Decision

Modification

Benchmark Results

Original measurement (kept for historical reference — pre #861/#862 baseline)

Analysis

References

Result

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

He-Pin commented May 23, 2026 •

edited

Loading