perf: skip UTF-8 encode for clean-ASCII long strings in renderer by He-Pin · Pull Request #866 · databricks/sjsonnet

He-Pin · 2026-05-23T09:16:27Z

Motivation

async-profiler on the Scala Native kube-prometheus workload shows HeapCharBuffer.wrap accounting for 40.3% of GC-allocation parents (GC itself is ~25–30% of native runtime). The wrap site is String.getBytes(UTF_8) called once per long (≥128 char) JSON string inside BaseByteRenderer.visitLongString. Each call also allocates an output byte[]. In K8s manifest output, the overwhelming majority of these long values (descriptions, annotations, base64 blobs, paths) are pure printable ASCII with no JSON-escape characters.

Key Design Decision

Use the existing Platform.isAsciiJsonSafe SWAR scan (16 chars/Long word, no allocation) as a cheap probe up-front. On positive probe, route to the existing renderAsciiSafeString fast path which uses Platform.copyAsciiStringToBytes for direct char→byte memcpy. On a negative probe, fall through unchanged to the byte-SWAR path. The extra cost paid by the non-ASCII branch is one SWAR scan over chars (~bLen/16 Long reads), which is dominated by the encode cost that branch already performs.

Modification

At the top of visitLongString in BaseByteRenderer.scala, probe with Platform.isAsciiJsonSafe(str) and delegate to renderAsciiSafeString(str) when clean. Otherwise the original code runs unchanged.

Benchmark Results

./mill bench.runRegressions completes across all cpp/go/sjsonnet suites with no anomalies.

hyperfine (Scala Native AOT, kube-prometheus, 60 runs, 8 warmup):

Binary	Mean	σ	Range
before (master HEAD `fcd444cc`)	150.7 ms	±8.3 ms	140.5 – 177.3
after	145.9 ms	±6.2 ms	138.6 – 168.1

After is 1.03× faster (−4.8 ms mean, −3.2%). σ ratio = 0.07, improvement is reproducible across runs.

Analysis

Modest but real. visitLongString is one call per long output string, so on a 72k-line kube-prom output we hit it on the order of 10⁴ times. Each spared call avoids two heap allocations and a CharsetEncoder dispatch, which directly reduces the HeapCharBuffer.wrap GC pressure that profiling identified as the largest single allocation source.

Larger gains require attacking the remaining UTF-8 path itself — follow-up commits target the escape-needing branch and PlatformBase64 zero-copy.

References

async-profiler CPU + allocation-parent analysis on /tmp/sjsonnet-yaml-fix
Existing helpers reused: Platform.isAsciiJsonSafe, renderAsciiSafeString, Platform.copyAsciiStringToBytes
Related: perf(stdlib): skip getBytes(UTF_8) for ASCII-safe base64 inputs #863 (base64 ASCII input fast path), perf: bulk-write safe runs in BaseRenderer.escape #864 (BaseRenderer.escape bulk write)

Result

✅ ./mill 'sjsonnet.jvm[3.3.7]'.test — 444/444 pass
✅ Byte-identical output on kube-prometheus (1.5 MB / 72k lines diff -q clean)
✅ +3.2% wall-clock on Scala Native kube-prom
✅ No regression on bench.runRegressions

Motivation: async-profiler on the Scala Native kube-prometheus workload shows HeapCharBuffer.wrap accounting for 40.3% of GC-allocation parents (GC itself is ~25-30% of native runtime). The wrap site is String.getBytes(UTF_8) called once per long (>=128 char) JSON string inside BaseByteRenderer.visitLongString. Each call also allocates an output byte[]. In K8s manifest output the overwhelming majority of these long values (descriptions, annotations, base64 blobs, paths) are pure printable ASCII with no JSON-escape characters. Modification: At the top of visitLongString, probe the string with the existing Platform.isAsciiJsonSafe SWAR scan (16 chars/Long word, no allocation). On a positive probe, delegate to renderAsciiSafeString which uses Platform.copyAsciiStringToBytes for a direct char->byte memcpy and skips the CharsetEncoder, HeapCharBuffer, and intermediate byte[] entirely. Strings that contain any escape-requiring char or any non-ASCII codepoint fall through to the existing byte-SWAR path unchanged — they pay one SWAR scan over chars (~bLen/16 Long reads) on top of the existing work, which is dominated by the encode cost they already perform. Result: - ./mill 'sjsonnet.jvm[3.3.7]'.test : 444/444 pass - Byte-identical output on kube-prometheus (1.5MB / 72k lines) - hyperfine (Scala Native, kube-prom, 60 runs, warmup 8): before: 150.7 ms ± 8.3 ms after : 145.9 ms ± 6.2 ms => 1.03x faster (-4.8 ms mean, -3.2%) - ./mill bench.runRegressions : completes successfully across all cpp/go/sjsonnet suites with no anomalies. Analysis: Modest but real: visitLongString is one call per long output string, so even on a 72k-line kube-prom output we hit it on the order of ~10^4 times. Each spared call avoids two heap allocations and a CharsetEncoder dispatch. Larger gains require attacking the remaining UTF-8 path itself (next commits target the escape-needing branch and the PlatformBase64 zero-copy). References: - async-profiler GC-parent analysis on /tmp/sjsonnet-yaml-fix - Platform.isAsciiJsonSafe / CharSWAR.isAsciiJsonSafe (existing SWAR helper) - renderAsciiSafeString / Platform.copyAsciiStringToBytes (existing fast path)

Motivation: Scala Native kube-prometheus rendering still showed write/output overhead after the renderer and strict JSON import stack. `NativeOutputStream` already bypasses the JVM-compatible `PrintStream` path by writing through C `fwrite`, but stdout still used the platform default stdio buffering. Key Design Decision: Keep this optimization Native-only and local to stdout buffering. Instead of changing renderer flush thresholds or `ByteBuilder` behavior globally, configure the C stdio stream with full buffering before any `NativeOutputStream` writes occur. Passing a null buffer lets libc own the buffer lifetime, so the Scala object does not need to retain native memory. Modification: - Configure `NativeOutputStream` with `setvbuf(file, null, _IOFBF, 256 KiB)` during construction. - Leave JVM, JS, YAML, expect-string, and file-output code paths unchanged. - Preserve existing explicit `flush()` behavior for trailing newline and close handling. Benchmark Results: Workload: `jrsonnet/tests/realworld/entry-kube-prometheus.jsonnet -J vendor` Candidate was benchmarked on the Scala Native 0.5.12 stacked exploration branch against clean `cf7b8af9`. | Order | Clean | Candidate | Result | | --- | ---: | ---: | ---: | | Forward mean | 218.848 ms | 188.528 ms | -13.9% | | Forward median | 215.517 ms | 187.368 ms | -13.1% | | Reverse mean | 224.045 ms | 183.701 ms | -18.0% | | Reverse median | 224.281 ms | 182.914 ms | -18.4% | Output equality matched by `cmp`. Validation: - `./mill --no-server --ticker false --color false __.reformat` - `./mill --no-server --ticker false --color false -j 1 __.test` — 444 passed, 0 failed - `./mill --no-server --ticker false --color false bench.runRegressions` Analysis: This is a lower-risk write/flush optimization than increasing `ByteBuilder` thresholds: it does not alter rendering order, JSON escaping, object materialization, or JVM/JS behavior. It only changes the buffering policy of the Native stdout `FILE*`, and explicit flushes still happen at the same public boundaries. References: - Scala Native 0.5.12 migration PR: #867 - Related performance stack context: #863, #864, #865, #866, #868 Result: Native stdout rendering writes fewer/smoother buffered chunks for large JSON output while preserving byte-identical output and the existing flush contract.

Motivation: The Native stdout buffering follow-up showed that downstream buffering can materially reduce large-output write overhead. JSON `-o` output still sent `ByteRenderer` chunks directly to the file output stream, relying only on `ByteBuilder`'s internal flush threshold. Key Design Decision: Keep the change local to the JSON output-file fast path. Rather than changing `ByteBuilder` thresholds globally, wrap the file output stream in a `BufferedOutputStream` with the same 256 KiB output buffer size used for the Native stdout buffering follow-up. YAML, expect-string, stdout, and renderer semantics stay unchanged. Modification: - Add `OutputBufferSize = 256 * 1024` in `SjsonnetMainBase`. - Wrap JSON output-file `ByteRenderer` targets in `BufferedOutputStream(out, OutputBufferSize)`. - Flush the buffered stream at the same completion boundary before closing the underlying file output stream. Benchmark Results: Workload: `jrsonnet/tests/realworld/entry-kube-prometheus.jsonnet -J vendor -o /tmp/fileout-*.json` Candidate was benchmarked on the Scala Native 0.5.12 stacked exploration branch after the Native stdout buffering commit. | Order | Clean | Candidate | Result | | --- | ---: | ---: | ---: | | Forward mean | 217.372 ms | 205.062 ms | -5.7% | | Forward median | 196.625 ms | 183.491 ms | -6.7% | | Reverse mean | 210.517 ms | 177.174 ms | -15.8% | | Reverse median | 193.394 ms | 175.878 ms | -9.1% | Output equality matched by `cmp`. Validation: - `./mill --no-server --ticker false --color false __.reformat` - `./mill --no-server --ticker false --color false -j 1 __.test` — 444 passed, 0 failed - `./mill --no-server --ticker false --color false bench.runRegressions` Analysis: This preserves the existing rendering pipeline and only changes the buffering layer for file output. It avoids global `ByteBuilder` threshold changes, keeps stdout behavior separate, and does not affect YAML or expect-string paths. References: - Native stdout buffering PR: #869 - Scala Native 0.5.12 migration PR: #867 - Related performance stack context: #863, #864, #865, #866, #868 Result: Large JSON file output writes are buffered more effectively while preserving byte-identical output and the existing flush/close contract.

He-Pin marked this pull request as ready for review May 23, 2026 09:38

This was referenced May 23, 2026

perf: buffer native stdout writes #869

Merged

perf: buffer json file output #870

Merged

stephenamar-db merged commit fe37922 into databricks:master May 27, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: skip UTF-8 encode for clean-ASCII long strings in renderer#866

perf: skip UTF-8 encode for clean-ASCII long strings in renderer#866
stephenamar-db merged 1 commit into
databricks:masterfrom
He-Pin:perf/renderer-long-string-ascii-fastpath

He-Pin commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

He-Pin commented May 23, 2026

Motivation

Key Design Decision

Modification

Benchmark Results

Analysis

References

Result

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants