Skip to content

perf(stdlib): skip getBytes(UTF_8) for ASCII-safe base64 inputs#863

Merged
stephenamar-db merged 2 commits into
databricks:masterfrom
He-Pin:perf/base64-ascii-safe-input
May 27, 2026
Merged

perf(stdlib): skip getBytes(UTF_8) for ASCII-safe base64 inputs#863
stephenamar-db merged 2 commits into
databricks:masterfrom
He-Pin:perf/base64-ascii-safe-input

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented May 23, 2026

Motivation

std.base64 on ~4 KB ASCII inputs was the largest synthetic gap to jrsonnet at the time this
PR was first written. Profile pointed to two costs that hit before the SIMD encoder runs:

  1. String#getBytes(UTF_8) — a full pre-count scan over every char looking for non-ASCII,
    followed by an Array[Byte] heap allocation.
  2. On Scala Native, the intermediate Array[Byte] is then copied into a zone buffer for the C
    library; we pay two passes plus an extra allocation.

For pure-ASCII inputs we can skip the UTF-8 codec entirely.

Key Design Decision

Val.AsciiSafeStr (introduced earlier on master) tags strings the parser has already proven
contain only chars in 0x20–0x7F minus " and \ (via CharSWAR.isAsciiJsonSafe). When
std.base64 receives such a Val.Str, we know the entire input is one byte per char.

I added PlatformBase64.encodeStringToString(input: String, asciiSafe: Boolean): String so each
platform can pick the cheapest path:

  • Scala Native — bypass the heap Array[Byte] entirely; copy directly from the String into
    the zone-allocated source buffer with input.charAt(i).toByte in a tight loop, then call
    libbase64.base64_encode. Skips both the UTF-8 codec and the intermediate array.
  • JVM / JS — use ISO_8859_1 instead of UTF_8. With Java 9+ compact strings a pure-ASCII
    string is LATIN1-tagged, so getBytes(ISO_8859_1) is approximately Arrays.copyOf; we skip
    the UTF-8 pre-count scan. We still allocate the Array[Byte] because java.util.Base64 only
    accepts byte arrays.
  • Non-ASCII inputs (asciiSafe=false) keep the existing getBytes(UTF_8) path.

The dispatch is one isInstanceOf check; pattern moved from case Val.Str(_, value) => to
case s: Val.Str => because the extractor drops the subclass identity.

Modification

  • sjsonnet/src/sjsonnet/stdlib/EncodingModule.scala — call the new API for std.base64.
  • sjsonnet/src-jvm/sjsonnet/stdlib/PlatformBase64.scala — JVM/Scala-JS variant uses ISO-8859-1.
  • sjsonnet/src-js/sjsonnet/stdlib/PlatformBase64.scala — same as JVM.
  • sjsonnet/src-native/sjsonnet/stdlib/PlatformBase64.scala — new encodeStringToString that
    writes chars directly into the zone buffer; encodeToString(Array[Byte]) retained for
    Val.ByteArr inputs.
  • sjsonnet/test/src/sjsonnet/Base64Tests.scala — new asciiSafeFastPath test group:
    large ASCII literal roundtrip, string-path vs UTF-8-bytes-path byte equivalence, and a large
    unicode string to keep the slow path covered.

Benchmark Results

Update (re-validated on current master): the original "3.28× → 1.13×" headline below was
measured against an older baseline (pre #861/#862). Since those PRs propagated
AsciiSafeStr through the parser and string operations, much of the win is already on master.
On the current baseline this PR contributes an additional ~3% mean / ~5% min on the
synthetic ASCII benchmark.

Re-run on the current baseline (upstream/master @ fcd444cc), Apple M3 Pro, Scala Native
immix+full LTO, hyperfine -N -w 30 --runs 300:

Variant mean (ms) σ min (ms) vs jrsonnet
jrsonnet 0.5.0-pre98 5.8 5.9 (outlier-skewed) 2.7 baseline
sjsonnet with this PR 6.2 0.6 5.2 1.07×
sjsonnet master (no #863) 6.4 0.6 5.5 1.11×

The slow path for non-ASCII inputs is unchanged. JVM/JS receive a smaller win (skip the UTF-8
pre-count scan) but no Native FFI improvement. All tests pass.

Original measurement (kept for historical reference — pre #861/#862 baseline)

Benchmark 1: sjsonnet  std.base64
  Time (mean ± σ):     7.2 ms ±  1.0 ms    [User: 2.4 ms, System: 1.7 ms]
  Range (min … max):   4.6 ms … 13.1 ms    434 runs

Benchmark 2: jrsonnet  std.base64
  Time (mean ± σ):     6.4 ms ±  2.3 ms    [User: 1.4 ms, System: 1.0 ms]
  Range (min … max):   3.6 ms … 15.4 ms    506 runs
Variant before after speedup
sjsonnet native 10.57ms 7.20ms 1.47×
ratio vs jrsonnet 3.28× 1.13× gap effectively closed

./mill __.test runs 514 JVM + 447 cross-platform tests; all green after the change.

Analysis

The original PR closed a 3.28× gap by skipping the UTF-8 codec on ASCII-safe inputs. Subsequent
ASCII-safe propagation work on master independently absorbed most of that win. The remaining
~3-5% delta this PR contributes is real but modest. The change is contained: only inputs already
proven pure-ASCII by the parser take the fast path; non-ASCII paths (Chinese / emoji literals
tested in Base64Tests.unicode) keep the original behaviour byte-for-byte.

The min-time gap to jrsonnet (5.2 ms vs 2.7 ms) suggests the remaining bottleneck is no longer
in getBytes but in the encoder itself (SIMD vs java.util.Base64 / aklomp libbase64) or in
result string allocation; closing it is out of scope for this PR.

References

Result

For std.base64 on ASCII inputs, this PR delivers an additional ~3-5% over the current master
baseline, on top of the much larger win already landed via #861/#862. Non-ASCII slow path is
unchanged byte-for-byte. All multi-platform tests pass.

When std.base64 is called on a Val.AsciiSafeStr (a parser-tagged string
that is proven to contain only chars in 0x20-0x7E with no quote/backslash),
the prior code path materialised an intermediate Array[Byte] via
String#getBytes(UTF_8). On Scala Native that step walks the entire input
checking for non-ASCII codepoints and allocates an extra heap array before
the SIMD encoder even sees the data; on a 3.5 KB Lorem-ipsum-style input
that pre-pass dominated the encode cost.

This change adds PlatformBase64.encodeStringToString(input, asciiSafe).
On Native, the ASCII fast path writes input.charAt(i).toByte straight into
the zone-allocated source buffer for the SIMD encoder, skipping both the
UTF-8 codec and the heap Array[Byte]. On JVM/JS the fast path uses
ISO-8859-1 (single byte-narrowing pass, no encoder branches; identical
bytes to UTF-8 for chars <= 0x7F) for parity.

For non-ASCII inputs (e.g. user literals with Chinese / emoji content) the
slow path remains correct via getBytes(UTF_8); the unicode test cases in
Base64Tests cover that branch.

Hyperfine, std.base64 on go-jsonnet benchmark (Apple M3 Pro, Scala Native
0.6.4-SNAPSHOT, immix GC, full LTO; jrsonnet 0.5.0-pre99):
  sjsonnet before: 10.57 ms  (3.28x slower than jrsonnet)
  sjsonnet after :  7.20 ms  (1.13x slower than jrsonnet) -- effectively tied
  jrsonnet       :  6.40 ms
@He-Pin He-Pin marked this pull request as ready for review May 23, 2026 00:23
@He-Pin He-Pin marked this pull request as draft May 23, 2026 06:21
@He-Pin He-Pin marked this pull request as ready for review May 23, 2026 06:24
Motivation:
A review of PR databricks#863 found inconsistent comments describing the AsciiSafeStr character range.

Modification:
Align Val.AsciiSafeStr, std.base64, and the Native base64 fast path comments with the actual CharSWAR contract: chars in 0x20-0x7F excluding quote and backslash.

Result:
The Scala 3 JVM tests pass locally.

References:
databricks#863
@stephenamar-db stephenamar-db merged commit 2cae1e5 into databricks:master May 27, 2026
5 checks passed
stephenamar-db pushed a commit that referenced this pull request May 27, 2026
## Motivation

`async-profiler` on the Scala Native kube-prometheus workload shows
`HeapCharBuffer.wrap` accounting for **40.3%** of GC-allocation parents
(GC itself is ~25–30% of native runtime). The wrap site is
`String.getBytes(UTF_8)` called once per long (≥128 char) JSON string
inside `BaseByteRenderer.visitLongString`. Each call also allocates an
output `byte[]`. In K8s manifest output, the overwhelming majority of
these long values (descriptions, annotations, base64 blobs, paths) are
pure printable ASCII with no JSON-escape characters.

## Key Design Decision

Use the existing `Platform.isAsciiJsonSafe` SWAR scan (16 chars/Long
word, no allocation) as a cheap probe up-front. On positive probe, route
to the existing `renderAsciiSafeString` fast path which uses
`Platform.copyAsciiStringToBytes` for direct char→byte memcpy. On a
negative probe, fall through unchanged to the byte-SWAR path. The extra
cost paid by the non-ASCII branch is one SWAR scan over chars (~bLen/16
Long reads), which is dominated by the encode cost that branch already
performs.

## Modification

At the top of `visitLongString` in `BaseByteRenderer.scala`, probe with
`Platform.isAsciiJsonSafe(str)` and delegate to
`renderAsciiSafeString(str)` when clean. Otherwise the original code
runs unchanged.

## Benchmark Results

`./mill bench.runRegressions` completes across all cpp/go/sjsonnet
suites with no anomalies.

**hyperfine** (Scala Native AOT, kube-prometheus, 60 runs, 8 warmup):

| Binary | Mean | σ | Range |
|---|---|---|---|
| before (master HEAD `fcd444cc`) | 150.7 ms | ±8.3 ms | 140.5 – 177.3 |
| after | 145.9 ms | ±6.2 ms | 138.6 – 168.1 |

**After is 1.03× faster** (−4.8 ms mean, −3.2%). σ ratio = 0.07,
improvement is reproducible across runs.

## Analysis

Modest but real. `visitLongString` is one call per long output string,
so on a 72k-line kube-prom output we hit it on the order of 10⁴ times.
Each spared call avoids two heap allocations and a `CharsetEncoder`
dispatch, which directly reduces the `HeapCharBuffer.wrap` GC pressure
that profiling identified as the largest single allocation source.

Larger gains require attacking the remaining UTF-8 path itself —
follow-up commits target the escape-needing branch and `PlatformBase64`
zero-copy.

## References

- async-profiler CPU + allocation-parent analysis on
`/tmp/sjsonnet-yaml-fix`
- Existing helpers reused: `Platform.isAsciiJsonSafe`,
`renderAsciiSafeString`, `Platform.copyAsciiStringToBytes`
- Related: #863 (base64 ASCII input fast path), #864
(BaseRenderer.escape bulk write)

## Result

- ✅ `./mill 'sjsonnet.jvm[3.3.7]'.test` — 444/444 pass
- ✅ Byte-identical output on kube-prometheus (1.5 MB / 72k lines diff -q
clean)
- ✅ +3.2% wall-clock on Scala Native kube-prom
- ✅ No regression on `bench.runRegressions`

Co-authored-by: He-Pin <kerr.hepin@gmail.com>
stephenamar-db pushed a commit that referenced this pull request May 27, 2026
Motivation:

Scala Native kube-prometheus rendering still showed write/output
overhead after the renderer and strict JSON import stack.
`NativeOutputStream` already bypasses the JVM-compatible `PrintStream`
path by writing through C `fwrite`, but stdout still used the platform
default stdio buffering.

Key Design Decision:

Keep this optimization Native-only and local to stdout buffering.
Instead of changing renderer flush thresholds or `ByteBuilder` behavior
globally, configure the C stdio stream with full buffering before any
`NativeOutputStream` writes occur. Passing a null buffer lets libc own
the buffer lifetime, so the Scala object does not need to retain native
memory.

Modification:

- Configure `NativeOutputStream` with `setvbuf(file, null, _IOFBF, 256
KiB)` during construction.
- Leave JVM, JS, YAML, expect-string, and file-output code paths
unchanged.
- Preserve existing explicit `flush()` behavior for trailing newline and
close handling.

Benchmark Results:

Workload: `jrsonnet/tests/realworld/entry-kube-prometheus.jsonnet -J
vendor`

Candidate was benchmarked on the Scala Native 0.5.12 stacked exploration
branch against clean `cf7b8af9`.

| Order | Clean | Candidate | Result |
| --- | ---: | ---: | ---: |
| Forward mean | 218.848 ms | 188.528 ms | -13.9% |
| Forward median | 215.517 ms | 187.368 ms | -13.1% |
| Reverse mean | 224.045 ms | 183.701 ms | -18.0% |
| Reverse median | 224.281 ms | 182.914 ms | -18.4% |

Output equality matched by `cmp`.

Validation:

- `./mill --no-server --ticker false --color false __.reformat`
- `./mill --no-server --ticker false --color false -j 1 __.test` — 444
passed, 0 failed
- `./mill --no-server --ticker false --color false bench.runRegressions`

Analysis:

This is a lower-risk write/flush optimization than increasing
`ByteBuilder` thresholds: it does not alter rendering order, JSON
escaping, object materialization, or JVM/JS behavior. It only changes
the buffering policy of the Native stdout `FILE*`, and explicit flushes
still happen at the same public boundaries.

References:

- Scala Native 0.5.12 migration PR: #867
- Related performance stack context: #863, #864, #865, #866, #868

Result:

Native stdout rendering writes fewer/smoother buffered chunks for large
JSON output while preserving byte-identical output and the existing
flush contract.
stephenamar-db pushed a commit that referenced this pull request May 27, 2026
Motivation:

The Native stdout buffering follow-up showed that downstream buffering
can materially reduce large-output write overhead. JSON `-o` output
still sent `ByteRenderer` chunks directly to the file output stream,
relying only on `ByteBuilder`'s internal flush threshold.

Key Design Decision:

Keep the change local to the JSON output-file fast path. Rather than
changing `ByteBuilder` thresholds globally, wrap the file output stream
in a `BufferedOutputStream` with the same 256 KiB output buffer size
used for the Native stdout buffering follow-up. YAML, expect-string,
stdout, and renderer semantics stay unchanged.

Modification:

- Add `OutputBufferSize = 256 * 1024` in `SjsonnetMainBase`.
- Wrap JSON output-file `ByteRenderer` targets in
`BufferedOutputStream(out, OutputBufferSize)`.
- Flush the buffered stream at the same completion boundary before
closing the underlying file output stream.

Benchmark Results:

Workload: `jrsonnet/tests/realworld/entry-kube-prometheus.jsonnet -J
vendor -o /tmp/fileout-*.json`

Candidate was benchmarked on the Scala Native 0.5.12 stacked exploration
branch after the Native stdout buffering commit.

| Order | Clean | Candidate | Result |
| --- | ---: | ---: | ---: |
| Forward mean | 217.372 ms | 205.062 ms | -5.7% |
| Forward median | 196.625 ms | 183.491 ms | -6.7% |
| Reverse mean | 210.517 ms | 177.174 ms | -15.8% |
| Reverse median | 193.394 ms | 175.878 ms | -9.1% |

Output equality matched by `cmp`.

Validation:

- `./mill --no-server --ticker false --color false __.reformat`
- `./mill --no-server --ticker false --color false -j 1 __.test` — 444
passed, 0 failed
- `./mill --no-server --ticker false --color false bench.runRegressions`

Analysis:

This preserves the existing rendering pipeline and only changes the
buffering layer for file output. It avoids global `ByteBuilder`
threshold changes, keeps stdout behavior separate, and does not affect
YAML or expect-string paths.

References:

- Native stdout buffering PR: #869
- Scala Native 0.5.12 migration PR: #867
- Related performance stack context: #863, #864, #865, #866, #868

Result:

Large JSON file output writes are buffered more effectively while
preserving byte-identical output and the existing flush/close contract.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants