From 5feaadf37300c273d8a440a92f84e99b09cbd78a Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Fri, 1 May 2026 16:24:05 +1200 Subject: [PATCH 01/25] Add first draft of operator-focused benchmarking post Covers methodology, test environment, passthrough proxy results, encryption latency and throughput ceiling, the per-connection scaling insight, and sizing guidance. Includes a TODO placeholder for the connection sweep results before publication. Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- _posts/2026-05-01-benchmarking-the-proxy.md | 153 ++++++++++++++++++++ 1 file changed, 153 insertions(+) create mode 100644 _posts/2026-05-01-benchmarking-the-proxy.md diff --git a/_posts/2026-05-01-benchmarking-the-proxy.md b/_posts/2026-05-01-benchmarking-the-proxy.md new file mode 100644 index 00000000..bd656487 --- /dev/null +++ b/_posts/2026-05-01-benchmarking-the-proxy.md @@ -0,0 +1,153 @@ +--- +layout: post +title: "Does my proxy look big in this cluster?" +date: 2026-05-01 00:00:00 +0000 +author: "Sam Barker" +author_url: "https://github.com/SamBarker" +categories: benchmarking performance +--- + +All good benchmarking stories start with a hunch. Mine was that Kroxylicious is cheap to run — I'd stake my career on it, in fact — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly. + +There's a practical question underneath the hunch too. The most common thing operators ask us is some variation of: "How many cores does the proxy need?" Which is really just "is this thing going to slow down my Kafka?" in a polite engineering hat. We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true. And also deeply unsatisfying for everyone involved, including us. + +So we stopped saying "it depends", and got off the fence: we built something you can run **yourselves** on your own infrastructure with your own workload, and measured it. Here are some representative numbers from ours. + +## What we measured + +We ran three scenarios against the same Apache Kafka® cluster on the same hardware: + +- **Baseline** — producers and consumers talking directly to Kafka, no proxy in the path +- **Passthrough proxy** — traffic routed through Kroxylicious with no filter chain configured +- **Record encryption** — traffic through Kroxylicious with AES-256-GCM record encryption enabled, using HashiCorp Vault as the KMS + +We used [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) rather than Kafka's own `kafka-producer-perf-test`. OMB is an industry-standard tool that coordinates producers and consumers together, measures end-to-end latency (not just publish latency), and produces structured JSON that makes comparison straightforward. More on why we built a whole harness around it in the [companion engineering post]({% post_url 2026-05-08-benchmarking-the-proxy-under-the-hood %}). + +## Test environment + +All results were collected on a 6-node OpenShift cluster on Fyre, IBM's internal cloud environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit — one core. + +| Component | Details | +|-----------|---------| +| CPU | AMD EPYC-Rome, 2 GHz | +| Cluster | 6-node OpenShift, RHCOS 9.6 | +| Kafka | 3-broker Strimzi cluster, replication factor 3 | +| Kroxylicious | 0.20.0, single proxy pod, 1000m CPU limit | +| KMS | HashiCorp Vault (in-cluster) | + +The primary workload used 1 topic, 1 partition, 1 KB messages. We chose single-partition deliberately: it concentrates all traffic on one broker, so you hit ceilings quickly and any proxy overhead is easy to isolate. We also ran 10-topic and 100-topic workloads to make sure the results hold when load is spread more realistically across brokers. + +One important caveat: this Kafka cluster is deliberately untuned. We're not trying to squeeze every message-per-second out of Kafka — we're using it as a fixed baseline to measure what the proxy adds on top. Kafka experts will find obvious headroom to improve on our baseline numbers; that's fine and expected. The deltas are what matter here, not the absolutes. + +--- + +## The passthrough proxy: negligible overhead + +Good news first. The proxy itself — with no filter chain, just routing traffic — adds almost nothing. + +**10 topics, 1 KB messages (5,000 msg/sec per topic):** + +| Metric | Baseline | Proxy | Delta | +|--------|----------|-------|-------| +| Publish latency avg | 2.62 ms | 2.79 ms | +0.17 ms (+7%) | +| Publish latency p99 | 14.09 ms | 15.17 ms | +1.08 ms (+8%) | +| E2E latency avg | 94.87 ms | 95.34 ms | +0.47 ms (+0.5%) | +| E2E latency p99 | 185.00 ms | 186.00 ms | +1.00 ms (+0.5%) | +| Publish rate | 5,002 msg/s | 5,002 msg/s | 0 | + +**100 topics, 1 KB messages (500 msg/sec per topic):** + +| Metric | Baseline | Proxy | Delta | +|--------|----------|-------|-------| +| Publish latency avg | 2.66 ms | 2.82 ms | +0.16 ms (+6%) | +| Publish latency p99 | 5.54 ms | 6.07 ms | +0.53 ms (+10%) | +| E2E latency avg | 253.16 ms | 253.76 ms | +0.60 ms (+0.2%) | +| E2E latency p99 | 499.00 ms | 499.00 ms | 0 | +| Publish rate | 500 msg/s | 500 msg/s | 0 | + +**The headline: ~0.2 ms additional average publish latency. Throughput is unaffected.** + +What did I take away from this entirely unsurprising result? Not much, honestly — without filters the proxy is little more than a couple of hops through the TCP stack, but we now have data rather than a hunch. +The end-to-end (E2E) p99 figure is dominated by the Kafka consumer fetch timeouts, as it should be. That said, it is reassuring to have a sub-ms impact on the p99. + +--- + +## Record encryption: now we're doing real work + +Ok, so let's make the proxy smarter — make it do something people actually care about! [Record encryption](https://kroxylicious.io/documentation/0.20.0/html/record-encryption-guide) uses AES-256-GCM to encrypt each record passing through the proxy. AES-256-GCM is going to ask the CPU to work relatively hard on its own, but it's also going to push the proxy to understand each record it receives, unpack it, copy it, encrypt it, and re-pack it before sending it on to the broker. With all that work going on we expect some impact to latency and throughput. To answer our original question we need to identify two things: the latency when everything is going smoothly, and the reduction in throughput all this work causes. Monitoring latency once we go past the throughput inflection point isn't very helpful — it's dominated by the throughput limits and their erratic impacts on the latency of individual requests (a big hello to batching and buffering effects). + +### Latency at sub-saturation rates + +A quick note on percentiles for anyone not steeped in performance benchmarking: p99 latency is the value that 99% of requests complete within — meaning 1 in 100 requests takes longer. Averages flatter; the p99 is what your slowest clients actually experience, and it's usually the number that matters. + +So we know encryption is doing a lot of work, but to find out the real impact we need to compare it to a plain Kafka cluster (and yes, people do run Kroxylicious without filters — TLS termination, stable client endpoints, virtual clusters — but that's a different post). The table below tells us that above a certain inflection point the numbers get really, really noisy — especially in the p99 range. + +**1 topic, 1 KB messages — baseline vs encryption:** + +| Rate | Metric | Baseline | Encryption | Delta | +|------|--------|----------|------------|-------| +| 34,000 msg/s | Publish avg | 8.00 ms | 8.19 ms | +0.19 ms (+2%) | +| 34,000 msg/s | Publish p99 | 48.65 ms | 64.01 ms | +15.35 ms (+32%) | +| 36,000 msg/s | Publish avg | 9.38 ms | 10.46 ms | +1.08 ms (+12%) | +| 36,000 msg/s | Publish p99 | 63.92 ms | 88.98 ms | +25.06 ms (+39%) | +| 37,200 msg/s | Publish avg | 9.12 ms | 12.19 ms | +3.07 ms (+34%) | +| 37,200 msg/s | Publish p99 | 74.88 ms | 113.15 ms | +38.27 ms (+51%) | + +So we know that somewhere above 34k we're hitting a limit. Time to hunt out exactly where — enter the rate-sweep. + +### Throughput ceiling + +A rate-sweep is exactly what it sounds like: pick a starting rate, let OMB run long enough to get a stable measurement, then step up by a fixed percentage and repeat until the system can't keep up. We defined "can't keep up" as the sustained throughput dropping by more than 5% below the target rate — at that point, something has saturated. + +We started at 34k (right where the latency table started getting interesting) and stepped up in 5% increments. The results: + +- **Baseline**: sustained up to ~50,000–52,000 msg/sec (the ceiling we observed on our test cluster) +- **Encryption**: sustained up to **~37,200 msg/sec**, then started intermittently saturating +- **Cost: approximately 26% fewer messages per second per partition** + +The transition wasn't a clean cliff edge — between 37,600 and 42,000 msg/sec the proxy alternated between sustaining and saturating. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Above ~39,000 msg/sec, p99 latency regularly spiked above 1,700 ms. Stay below 37k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**. + +### The thing that surprised us: per-connection, not per-pod + +The fact the proxy is low latency didn't surprise me, but this did — and it matters when we think about scaling. We maxed out a single connection, but that didn't mean we'd maxed out the proxy. + +Once we had the single-producer encryption ceiling at ~37k msg/sec, the obvious question was: is that the limit for the whole proxy pod, or just for one connection? The answer changes how you scale. + +We ran the same test with 4 producers sharing the same single partition. With 4 connections the proxy sustained well past the single-producer ceiling — the Netty event loop queues stayed empty throughout, confirming the proxy had capacity to spare. The reason is how Netty works: each client connection gets its own event loop thread, and encryption happens synchronously on that thread. One producer connection saturates at ~37k msg/sec, but a second producer on a different connection gets its own thread and its own headroom. The proxy's aggregate capacity compounds with each connection. + + + +**The practical implication**: if you're hitting the encryption ceiling, add producers before adding proxy pods. We haven't yet measured exactly where the per-pod ceiling sits — that's the next experiment — but the single-connection limit of ~37k is not the whole story. + +--- + +## Sizing guidance + +**Passthrough proxy**: size your Kafka cluster as you normally would. The proxy won't be the bottleneck — but if you want to verify that on your own hardware, the rate sweep is exactly the tool for it. Run the baseline and passthrough scenarios back-to-back and you'll have your own numbers. + +**With record encryption:** + +1. **Throughput budget**: encryption imposes a per-connection throughput ceiling driven by the CPU cost of AES-256-GCM on your hardware. On ours (AMD EPYC-Rome, 2GHz) that ceiling was about 26% lower than Kafka alone could sustain per producer connection — run the rate sweep on your own infrastructure to find yours. + +2. **Latency budget**: well below saturation, expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99. The overhead scales with how hard you're pushing — give yourself headroom and you'll barely notice it. + +3. **Scaling**: the bottleneck is per-connection CPU (crypto, buffer management, and network I/O combined). Spread load across more producer connections first; then scale proxy pods horizontally. + +4. **KMS overhead**: DEK caching means Vault isn't on the hot path for every record. Our tests triggered only 5–19 DEK generation calls per benchmark run. The KMS is not the thing to worry about. + +--- + +## Caveats and next steps + +These results come from a single proxy pod, a single partition, and single-pass measurements at each rate point. We know what the gaps are: + +- **Connection sweep**: we saw 1 and 4 producers — we haven't yet swept 2, 8, 16 to characterise the full per-pod ceiling +- **Horizontal scaling**: we expect more proxy pods to scale linearly, but haven't measured it yet +- **Larger message sizes**: encryption overhead is almost certainly smaller in percentage terms for larger messages + +For the engineering story — why we built a custom harness on top of OMB, what the CPU flamegraphs actually show, and the bugs we found in our own tooling along the way — that's in the [companion post]({% post_url 2026-05-08-benchmarking-the-proxy-under-the-hood %}). + +The full benchmark suite, quickstart guide, and sizing reference are in `kroxylicious-openmessaging-benchmarks/` in the [main Kroxylicious repository](https://github.com/kroxylicious/kroxylicious). From a864e8fc49d6b70bd06e2cc3cf685855efe11ab9 Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Fri, 1 May 2026 16:24:17 +1200 Subject: [PATCH 02/25] Add first draft of engineering deep-dive benchmarking post Covers why we chose OMB over Kafka's own tools, the benchmark harness we built (Helm chart, orchestration scripts, JBang result processors), workload design rationale, CPU flamegraphs with embedded interactive iframes, the per-connection ceiling discovery, bugs found in our own tooling, and the cluster recovery incident. Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- ...8-benchmarking-the-proxy-under-the-hood.md | 204 + .../encryption-cpu-profile-36k.html | 28030 ++++++++++++++++ .../proxy-no-filters-cpu-profile.html | 15382 +++++++++ 3 files changed, 43616 insertions(+) create mode 100644 _posts/2026-05-08-benchmarking-the-proxy-under-the-hood.md create mode 100644 assets/blog/flamegraphs/benchmarking-the-proxy/encryption-cpu-profile-36k.html create mode 100644 assets/blog/flamegraphs/benchmarking-the-proxy/proxy-no-filters-cpu-profile.html diff --git a/_posts/2026-05-08-benchmarking-the-proxy-under-the-hood.md b/_posts/2026-05-08-benchmarking-the-proxy-under-the-hood.md new file mode 100644 index 00000000..fa378dfa --- /dev/null +++ b/_posts/2026-05-08-benchmarking-the-proxy-under-the-hood.md @@ -0,0 +1,204 @@ +--- +layout: post +title: "Benchmarking a Kafka proxy: the engineering story" +date: 2026-05-08 00:00:00 +0000 +author: "Sam Barker" +author_url: "https://github.com/SamBarker" +categories: benchmarking performance engineering +--- + +The [first post]({% post_url 2026-05-01-benchmarking-the-proxy %}) covered what we measured and what the numbers mean for operators. This one is for the people who want to know how we measured it, what the flamegraphs actually show, and what we found when we started looking carefully at our own tooling. + +## Why not Kafka's own tools? + +Kafka ships with `kafka-producer-perf-test` and `kafka-consumer-perf-test`. We'd used them before. The problems: + +- **Too noisy**: individual runs produced widely varying results depending on JVM warm-up, scheduling jitter, and GC behaviour. Results were hard to trust and harder to compare across scenarios. +- **Producer-only view**: `kafka-producer-perf-test` gives you publish latency, but nothing about the consumer side. You can't see end-to-end latency — which is what operators actually care about. +- **Awkward to sweep**: running parametric rate sweeps requires scripting around these tools, and comparing results across scenarios requires manual work. + +[OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) is a better fit. It's an industry-standard tool used by Confluent, the Pulsar team, and others for their published performance comparisons. OMB coordinates producers and consumers across separate worker pods, runs a configurable warmup phase before taking measurements, and outputs structured JSON that's straightforward to process programmatically. + +Using OMB also means our numbers are directly comparable to other published Kafka benchmarks — that credibility matters when you're trying to make the case that your proxy doesn't break things. + +## What we built on top of OMB + +OMB handles the measurement. We built everything around it: deployment, teardown, diagnostics collection, and result processing. All of it lives in `kroxylicious-openmessaging-benchmarks/` in the main repo. + +### Helm chart + +A Helm chart (`helm/kroxylicious-benchmark/`) deploys the full benchmark stack into Kubernetes: + +- OMB coordinator and worker pods +- A Strimzi Kafka cluster +- The Kroxylicious proxy (via the Kroxylicious Kubernetes operator) +- HashiCorp Vault (for the KMS in the encryption scenario) + +Scenario-specific configuration lives in `helm/kroxylicious-benchmark/scenarios/` as YAML overrides: + +| Scenario file | What it deploys | +|---------------|-----------------| +| `baseline-values.yaml` | Direct Kafka, no proxy | +| `proxy-no-filters-values.yaml` | Proxy with empty filter chain | +| `encryption-values.yaml` | Proxy with AES-256-GCM encryption and Vault | +| `rate-sweep-values.yaml` | Extended run profiles for sweep experiments | + +Separating scenarios into override files means the base chart stays stable while each scenario adds only what it needs. Switching between scenarios doesn't require touching the chart itself. + +### Orchestration scripts + +**`scripts/run-benchmark.sh`** orchestrates a single benchmark run: + +1. Deploys the Helm chart for the requested scenario +2. Waits for the OMB Job to complete +3. Collects results: OMB JSON, a JFR recording, an async-profiler flamegraph, and a Prometheus metrics snapshot +4. Tears down + +The `--skip-deploy` flag lets you re-run a probe against an already-deployed cluster — essential for rate sweeps where you want to deploy once and probe many times. + +**`scripts/rate-sweep.sh`** wraps `run-benchmark.sh` to drive parametric sweeps. It takes `--min-rate`, `--max-rate`, `--step-percent`, and one or more `--scenario` flags. The first probe deploys; subsequent probes use `--skip-deploy`. + +### Result processing + +Three JBang-runnable Java programs handle result analysis: + +- **`RunMetadata.java`**: generates `run-metadata.json` alongside each result. Captures git commit, timestamp, cluster node specs (architecture, CPU, RAM), and — on OpenShift — NIC speed read from the host via the MachineConfigDaemon pod. +- **`ResultComparator.java`**: reads two scenario result directories and produces a markdown comparison table. +- **`ResultSummariser.java`**: reads a rate-sweep result directory and prints a saturation table: target rate, achieved rate, p99, and whether the probe saturated. + +Getting NIC speed from a Kubernetes node turned out to be non-trivial — you need host filesystem access to read `/sys/class/net//speed`. On OpenShift, the MachineConfigDaemon pods mount the host at `/rootfs`, so we `kubectl exec` into the MCD pod and `chroot /rootfs` to read the speed file without creating any new privileged resources. + +## Workload design + +The primary workload used **1 topic, 1 partition, 1 KB messages**. This is deliberate. Concentrating all traffic on a single partition pushes things to their limits at lower absolute rates, which makes the proxy overhead easier to isolate: when the system saturates, it's the proxy, not a spread-out broker fleet. + +Multi-topic workloads (10 topics, 100 topics) were used to verify that the overhead characteristics hold when load is distributed. At 5,000 msg/sec per topic across 10 topics, every topic-partition pair is well below any saturation point — so what you're measuring is steady-state overhead, not ceiling behaviour. + +For throughput ceiling testing we used rate sweeps: start at 34,000 msg/sec, step up by 5% until achieved rate drops below 95% of target. The knee of that curve is the saturation point. + +## The flamegraph: where the CPU actually goes + +We captured CPU profiles using async-profiler attached to the proxy JVM via `jcmd JVMTI.agent_load`, during the steady-state measurement phase at 36,000 msg/sec. These are self-time percentages — where the CPU is actually spending cycles, not inclusive call-tree time. + +The flamegraphs below are fully interactive: hover over a frame to see its name and percentage, click to zoom in, Ctrl+F to search. Scroll within the frame to explore the full stack depth. + +### No-filter proxy + +
+ +
CPU flamegraph — passthrough proxy (no filters), 36,000 msg/sec, 1 topic, 1 KB messages. Open full screen ↗
+
+ +| Category | CPU share | +|----------|-----------| +| Syscalls (send/recv) | 59.2% | +| Native/VM | 16.7% | +| Netty I/O | 10.5% | +| Memory operations | 4.7% | +| JDK libraries | 2.9% | +| Kroxylicious proxy | 1.4% | +| GC | 0.1% | + +The proxy is overwhelmingly I/O-bound. 59% of CPU is in `send`/`recv` syscalls — the inherent cost of maintaining two TCP connections (client→proxy, proxy→Kafka) with data flowing through the JVM. The proxy itself accounts for 1.4%. It really is a TCP relay with protocol awareness. + +### Encryption proxy (same 36,000 msg/sec rate) + +
+ +
CPU flamegraph — encryption proxy (AES-256-GCM), 36,000 msg/sec, 1 topic, 1 KB messages. Open full screen ↗
+
+ +| Category | No-filters | Encryption | Delta | +|----------|-----------|------------|-------| +| Syscalls (send/recv) | 59.2% | 23.5% | −35.7%* | +| Native/VM | 16.7% | 18.9% | +2.2% | +| JCA/AES-GCM crypto | 0.0% | 11.3% | **+11.3%** | +| Memory operations | 4.7% | 10.4% | **+5.8%** | +| JDK libraries | 2.9% | 9.3% | **+6.4%** | +| GC / JVM housekeeping | 0.1% | 5.0% | **+4.9%** | +| Netty I/O | 10.5% | 5.1% | −5.4%* | +| Kafka protocol re-encoding | 0.4% | 3.5% | **+3.1%** | +| Kroxylicious encryption filter | 0.0% | 2.0% | **+2.0%** | + +*\* Send/recv and Netty I/O appear to shrink as a percentage share because encryption adds CPU work that grows the total pie. The absolute I/O cost is similar in both scenarios.* + +The direct crypto cost is 13.3% (11.3% AES-GCM + 2.0% Kroxylicious filter logic). But encryption adds indirect costs too: + +- **Buffer management (+5.8%)**: encrypted records need to be read into buffers, encrypted, and written to new buffers — more allocation, more copying +- **GC pressure (+4.9%)**: more short-lived objects from encryption buffers and crypto operations +- **JDK security infrastructure (+6.4%)**: security provider lookups, key spec handling, parameter generation +- **Kafka protocol re-encoding (+3.1%)**: encrypted records are different sizes and must be re-serialised into Kafka protocol format + +Total additional CPU: ~33%. This aligns closely with the ~26% throughput reduction. + +If you wanted to optimise this, the highest-impact areas would be: reducing buffer copies (encrypt in-place or use composite buffers), pooling encryption buffers to reduce GC pressure, and caching `Cipher` instances to reduce per-record JDK security overhead. + +## The per-connection ceiling + +The single-producer encryption ceiling at ~37k msg/sec raised the question of whether that was a per-pod limit or a per-connection limit. + +The answer came from a 4-producer rate sweep. Four producers sharing the same partition drove 47k+ msg/sec aggregate through the proxy while proxy CPU held at 570m/1000m — well below pod saturation. The Kafka partition became the bottleneck first. + +The explanation: Netty assigns each client connection to its own event loop thread. Encryption happens synchronously on that thread. A single connection is bounded by one event loop's throughput, but additional connections get their own threads. The proxy's aggregate capacity is the sum of its event loop threads' individual capacities — until something else (the Kafka partition, the NIC, pod CPU) saturates first. + +Worth noting: with replication factor 3, every message the Kafka leader receives goes out to 2 follower replicas plus potentially one consumer. At 50k msg/sec with 1 KB messages that's ~1.2 Gbps outbound from the leader alone — confirming why the Fyre cluster nodes need 10 Gbps NICs. + +## Bugs we found in our own tooling + +During the 4-producer rate sweep, we noticed that JFR recordings and flamegraphs from probes 2 onwards all looked identical to probe 1. They were stale copies. Three bugs. + +**Bug 1 — wrong JFR settings**: When restarting JFR for a subsequent probe in `--skip-deploy` mode, the script was using `settings=default` instead of `settings=profile`. The default profile omits I/O events including `jdk.NetworkUtilization` — the event we were using to read network throughput from JFR. Fixed to always use `settings=profile`. + +**Bug 2 — async-profiler not restarted**: The restart block restarted JFR but never restarted async-profiler. All probes after the first had a flamegraph from probe 1 only. + +**Bug 3 — wrong guard variable**: The async-profiler restart was guarded by checking `AGENT_LIB` (the path to the native library). `AGENT_LIB` is always set when the library exists on the image — even when profiling was intentionally skipped on clusters where the `Unconfined` seccomp profile couldn't be applied. The correct guard is `ASYNC_PROFILER_FLAGS`, which is only set when the seccomp patch was successfully applied. + +Spotting these required noticing that two different probe flamegraphs were pixel-for-pixel identical, then working back through the restart logic. The lesson: when reusing a deployed cluster across multiple probes, validate that diagnostic collection is actually running fresh for each one. + +## The cluster that wouldn't upgrade + +Midway through a benchmark campaign, the Fyre OpenShift cluster got stuck mid-upgrade. All 3 worker nodes were `SchedulingDisabled`, meaning benchmark pods couldn't schedule, meaning cluster operators (image-registry, ingress, monitoring, storage) went degraded, which blocked the upgrade from completing. + +The root cause was a MachineConfigOperator bug. Each worker's MachineConfigDaemon had finished its upgrade and set `desiredDrain=uncordon-...` — signalling it was ready to be uncordoned — but the MCO never acted on that signal. Workers sat cordoned indefinitely. + +Fix: `kubectl uncordon worker0 worker1 worker2`. Once uncordoned, pods scheduled, operators recovered, and the upgrade completed. + +Not a Kroxylicious bug, but it cost several hours of cluster recovery time during an active benchmark campaign. Worth knowing about if you're running OCP on Fyre. + +## Run it yourself + +Everything is in `kroxylicious-openmessaging-benchmarks/` in the [main Kroxylicious repository](https://github.com/kroxylicious/kroxylicious). See `QUICKSTART.md` for step-by-step instructions. You'll need a Kubernetes or OpenShift cluster, the Kroxylicious operator installed, and Helm 3. Minikube works for local runs — the quickstart covers recommended CPU and memory settings. + +```bash +# Run a baseline vs encryption comparison +./scripts/run-benchmark.sh --scenario baseline +./scripts/run-benchmark.sh --scenario encryption + +# Compare results +jbang src/main/java/io/kroxylicious/benchmarks/results/ResultComparator.java \ + results/baseline results/encryption +``` + +## What's still open + +The gaps we know about and plan to fill: + +1. **Connection sweep**: run 1, 2, 4, 8, 16 producers simultaneously at a fixed per-producer rate to characterise the per-pod aggregate ceiling with encryption. The plan is in `CONNECTION-SWEEP-PLAN.md`. + +2. **Horizontal scaling**: verify that adding proxy pods scales aggregate throughput linearly. + +3. **Multi-partition workloads**: isolate encryption cost without being bounded by Kafka's per-partition ceiling. + +4. **Multi-pass sweeps**: each rate point was measured once. Running each probe three times and taking the median would give tighter bounds, particularly in the saturation transition zone. + +5. **Message size variation**: larger messages should show lower encryption overhead as a percentage; smaller messages may show higher overhead. 1 KB is a reasonable middle ground but not the whole picture. + +The operator-facing sizing reference and all the key tables are in `SIZING-GUIDE.md` in the benchmarks directory. \ No newline at end of file diff --git a/assets/blog/flamegraphs/benchmarking-the-proxy/encryption-cpu-profile-36k.html b/assets/blog/flamegraphs/benchmarking-the-proxy/encryption-cpu-profile-36k.html new file mode 100644 index 00000000..89215a71 --- /dev/null +++ b/assets/blog/flamegraphs/benchmarking-the-proxy/encryption-cpu-profile-36k.html @@ -0,0 +1,28030 @@ + + + + + + + +

encryption/1topic-1kb_2026-04-20T11:38:12Z

+
+ + + +
+
Produced by async-profiler
+
+
+
Frame types
+
Kernel
+
Native
+
C++ (VM)
+
Java compiled
+
Java compiled by C1
+
Inlined
+
Interpreted
+
+
+
Allocation profile
+
Allocated class
+
Allocation outside TLAB
+
Lock profile
+
Lock class
+
 
+
Search
+
Matches regexp
+
+
+
Click frame
Zoom into frame
+
Alt+Click
Remove stack
+
0
Reset zoom
+
I
Invert graph
+
Ctrl+F
Search
+
N
Next match
+
Shift+N
Previous match
+
Esc
Cancel search
+
+
+ +
+

+

Matched:

+ diff --git a/assets/blog/flamegraphs/benchmarking-the-proxy/proxy-no-filters-cpu-profile.html b/assets/blog/flamegraphs/benchmarking-the-proxy/proxy-no-filters-cpu-profile.html new file mode 100644 index 00000000..c921470d --- /dev/null +++ b/assets/blog/flamegraphs/benchmarking-the-proxy/proxy-no-filters-cpu-profile.html @@ -0,0 +1,15382 @@ + + + + + + + +

proxy-no-filters/1topic-1kb_2026-04-15T21:44:15Z

+
+ + + +
+
Produced by async-profiler
+
+
+
Frame types
+
Kernel
+
Native
+
C++ (VM)
+
Java compiled
+
Java compiled by C1
+
Inlined
+
Interpreted
+
+
+
Allocation profile
+
Allocated class
+
Allocation outside TLAB
+
Lock profile
+
Lock class
+
 
+
Search
+
Matches regexp
+
+
+
Click frame
Zoom into frame
+
Alt+Click
Remove stack
+
0
Reset zoom
+
I
Invert graph
+
Ctrl+F
Search
+
N
Next match
+
Shift+N
Previous match
+
Esc
Cancel search
+
+
+ +
+

+

Matched:

+ From 4ce85a666caa738a3ccda87296d1e83e4f4dbc8e Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Fri, 1 May 2026 16:24:26 +1200 Subject: [PATCH 03/25] Add performance reference page and update overview Adds /performance/ as a dedicated quick-reference page with headline benchmark numbers, comparison tables, and sizing guidance, linked from both blog posts. Updates the existing Performance section in overview.markdown with the key headline numbers and a link to the full reference page. Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- overview.markdown | 7 +++- performance.markdown | 91 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 97 insertions(+), 1 deletion(-) create mode 100644 performance.markdown diff --git a/overview.markdown b/overview.markdown index 8af9ae22..b6b42abb 100644 --- a/overview.markdown +++ b/overview.markdown @@ -66,5 +66,10 @@ Kroxylicious is careful to decode only the Kafka RPCs that the filters actually interested in a particular RPC, its bytes will pass straight through Kroxylicious. This approach helps keep Kroxylicious fast. -The actual performance overhead of using Kroxylicious depends on the particular use-case. +The actual performance overhead of using Kroxylicious depends on the particular use-case. As a guide: + +- **Passthrough proxy (no filters)**: ~0.2 ms additional average publish latency, no throughput impact +- **Record encryption (AES-256-GCM)**: ~26% throughput reduction per partition; 15–40 ms additional p99 latency at sub-saturation rates + +See the [performance reference page]({{ '/performance/' | absolute_url }}) for full benchmark results, methodology, and sizing guidance. diff --git a/performance.markdown b/performance.markdown new file mode 100644 index 00000000..4ccd5194 --- /dev/null +++ b/performance.markdown @@ -0,0 +1,91 @@ +--- +layout: overview +title: Performance +permalink: /performance/ +toc: true +--- + +This page summarises the measured performance overhead of Kroxylicious. Numbers come from [benchmarks run on real hardware](/blog/2026/05/01/benchmarking-the-proxy/) using [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark), an industry-standard Kafka performance tool. + +## Test environment + +| Component | Details | +|-----------|---------| +| CPU | AMD EPYC-Rome, 2 GHz | +| Cluster | 6-node OpenShift, RHCOS 9.6 | +| Kafka | 3-broker Strimzi cluster, replication factor 3 | +| Kroxylicious | 0.20.0, single proxy pod, 1000m CPU limit | +| KMS | HashiCorp Vault (in-cluster) | + +All primary results used 1 KB messages on a single partition. Multi-topic workloads (10 and 100 topics) confirmed that overhead characteristics hold when load is distributed. + +--- + +## Passthrough proxy (no filters) + +The proxy layer itself adds negligible overhead. At sub-saturation rates the additional latency is sub-millisecond on average, with no measurable throughput impact. + +**10 topics, 1 KB messages (5,000 msg/sec per topic):** + +| Metric | Baseline | Proxy | Delta | +|--------|----------|-------|-------| +| Publish latency avg | 2.62 ms | 2.79 ms | +0.17 ms (+7%) | +| Publish latency p99 | 14.09 ms | 15.17 ms | +1.08 ms (+8%) | +| E2E latency avg | 94.87 ms | 95.34 ms | +0.47 ms (+0.5%) | +| Publish rate | 5,002 msg/s | 5,002 msg/s | no change | + +**100 topics, 1 KB messages (500 msg/sec per topic):** + +| Metric | Baseline | Proxy | Delta | +|--------|----------|-------|-------| +| Publish latency avg | 2.66 ms | 2.82 ms | +0.16 ms (+6%) | +| Publish latency p99 | 5.54 ms | 6.07 ms | +0.53 ms (+10%) | +| Publish rate | 500 msg/s | 500 msg/s | no change | + +--- + +## Record encryption (AES-256-GCM) + +Encryption adds measurable but predictable overhead. The cost scales with producer rate — well below saturation the overhead is small; approaching the saturation point, latency rises sharply. + +### Latency at sub-saturation rates + +**1 topic, 1 KB messages — baseline vs encryption:** + +| Rate | Metric | Baseline | Encryption | Delta | +|------|--------|----------|------------|-------| +| 34,000 msg/s | Publish avg | 8.00 ms | 8.19 ms | +0.19 ms (+2%) | +| 34,000 msg/s | Publish p99 | 48.65 ms | 64.01 ms | +15.35 ms (+32%) | +| 36,000 msg/s | Publish avg | 9.38 ms | 10.46 ms | +1.08 ms (+12%) | +| 36,000 msg/s | Publish p99 | 63.92 ms | 88.98 ms | +25.06 ms (+39%) | +| 37,200 msg/s | Publish avg | 9.12 ms | 12.19 ms | +3.07 ms (+34%) | +| 37,200 msg/s | Publish p99 | 74.88 ms | 113.15 ms | +38.27 ms (+51%) | + +### Throughput ceiling + +| Scenario | Throughput ceiling (1 topic, 1 KB, 1 partition) | +|----------|------------------------------------------------| +| Baseline (direct Kafka) | ~50,000–52,000 msg/sec | +| Encryption (proxy + AES-256-GCM) | ~37,200 msg/sec | +| **Cost** | **~26% fewer messages per second per partition** | + +--- + +## Sizing guidance + +**Passthrough proxy**: size your Kafka cluster as you normally would. The proxy will not be the bottleneck. + +**With record encryption:** + +- **Throughput**: plan for ~25% lower throughput per partition compared to direct Kafka +- **Latency**: expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99, scaling with how close to saturation you operate +- **Scaling**: the throughput ceiling is per-connection (one Netty event loop per client connection). Spreading load across more producers is the first scaling lever; adding proxy pods comes next +- **KMS**: DEK caching means the KMS is not on the hot path. In testing, each benchmark run triggered only 5–19 DEK generation calls — the KMS is not a bottleneck + +--- + +## Further reading + +- [Operator guide: results, methodology, and sizing recommendations](/blog/2026/05/01/benchmarking-the-proxy/) — the full benchmark story for operators +- [Engineering deep dive: tooling, flamegraphs, and what we discovered](/blog/2026/05/08/benchmarking-the-proxy-under-the-hood/) — how we measured it, where the CPU goes, and what surprised us +- [Benchmark quickstart](https://github.com/kroxylicious/kroxylicious/tree/main/kroxylicious-openmessaging-benchmarks/QUICKSTART.md) — run the benchmarks yourself \ No newline at end of file From 1f443cb79263c66e034e376352ee8df6c944206b Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Fri, 15 May 2026 16:28:22 +1200 Subject: [PATCH 04/25] Update benchmarking posts with validated coefficient and corrected framing MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Shift publication dates to May 21 and May 28 - Replace speculative per-connection ceiling explanation with empirical finding: encryption throughput ceiling scales linearly with CPU budget (validated at 1000m, 2000m, 4000m) - Add sizing formula: CPU (mc) = 20 × produce_MB_per_s, with worked example - Add RF=3 masking caveat: initial 1-topic sweeps conflated Kafka replication ceiling with proxy CPU ceiling; coefficient derived from RF=1 multi-topic workloads - Post 2: add full investigation narrative — workload isolation approach, coefficient derivation, 4-core confirmation, and 2-core prediction/validation - Drop stale "future work" items that are now complete Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- ...d => 2026-05-21-benchmarking-the-proxy.md} | 39 +++++----- ...-benchmarking-the-proxy-under-the-hood.md} | 75 ++++++++++++++----- performance.markdown | 10 +-- 3 files changed, 83 insertions(+), 41 deletions(-) rename _posts/{2026-05-01-benchmarking-the-proxy.md => 2026-05-21-benchmarking-the-proxy.md} (77%) rename _posts/{2026-05-08-benchmarking-the-proxy-under-the-hood.md => 2026-05-28-benchmarking-the-proxy-under-the-hood.md} (71%) diff --git a/_posts/2026-05-01-benchmarking-the-proxy.md b/_posts/2026-05-21-benchmarking-the-proxy.md similarity index 77% rename from _posts/2026-05-01-benchmarking-the-proxy.md rename to _posts/2026-05-21-benchmarking-the-proxy.md index bd656487..3b2cd959 100644 --- a/_posts/2026-05-01-benchmarking-the-proxy.md +++ b/_posts/2026-05-21-benchmarking-the-proxy.md @@ -1,7 +1,7 @@ --- layout: post title: "Does my proxy look big in this cluster?" -date: 2026-05-01 00:00:00 +0000 +date: 2026-05-21 00:00:00 +0000 author: "Sam Barker" author_url: "https://github.com/SamBarker" categories: benchmarking performance @@ -21,11 +21,11 @@ We ran three scenarios against the same Apache Kafka® cluster on the same hardw - **Passthrough proxy** — traffic routed through Kroxylicious with no filter chain configured - **Record encryption** — traffic through Kroxylicious with AES-256-GCM record encryption enabled, using HashiCorp Vault as the KMS -We used [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) rather than Kafka's own `kafka-producer-perf-test`. OMB is an industry-standard tool that coordinates producers and consumers together, measures end-to-end latency (not just publish latency), and produces structured JSON that makes comparison straightforward. More on why we built a whole harness around it in the [companion engineering post]({% post_url 2026-05-08-benchmarking-the-proxy-under-the-hood %}). +We used [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) rather than Kafka's own `kafka-producer-perf-test`. OMB is an industry-standard tool that coordinates producers and consumers together, measures end-to-end latency (not just publish latency), and produces structured JSON that makes comparison straightforward. More on why we built a whole harness around it in the [companion engineering post]({% post_url 2026-05-28-benchmarking-the-proxy-under-the-hood %}). ## Test environment -All results were collected on a 6-node OpenShift cluster on Fyre, IBM's internal cloud environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit — one core. +All results were collected on a 6-node OpenShift cluster on Fyre, IBM's internal cloud environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit. | Component | Details | |-----------|---------| @@ -107,20 +107,15 @@ We started at 34k (right where the latency table started getting interesting) an The transition wasn't a clean cliff edge — between 37,600 and 42,000 msg/sec the proxy alternated between sustaining and saturating. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Above ~39,000 msg/sec, p99 latency regularly spiked above 1,700 ms. Stay below 37k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**. -### The thing that surprised us: per-connection, not per-pod +### The ceiling scales with CPU budget The fact the proxy is low latency didn't surprise me, but this did — and it matters when we think about scaling. We maxed out a single connection, but that didn't mean we'd maxed out the proxy. -Once we had the single-producer encryption ceiling at ~37k msg/sec, the obvious question was: is that the limit for the whole proxy pod, or just for one connection? The answer changes how you scale. +Once we had the single-producer encryption ceiling at ~37k msg/sec, the obvious question was: is that the limit for the whole proxy pod, or just for one connection? We ran the same test with 4 producers. With 4 connections the proxy sustained well past the single-producer ceiling — proxy CPU had headroom to spare, and Kafka's partition became the bottleneck first. -We ran the same test with 4 producers sharing the same single partition. With 4 connections the proxy sustained well past the single-producer ceiling — the Netty event loop queues stayed empty throughout, confirming the proxy had capacity to spare. The reason is how Netty works: each client connection gets its own event loop thread, and encryption happens synchronously on that thread. One producer connection saturates at ~37k msg/sec, but a second producer on a different connection gets its own thread and its own headroom. The proxy's aggregate capacity compounds with each connection. +Going further: we swept the same workload at 1000m, 2000m, and 4000m CPU. The throughput ceiling scaled linearly with the CPU budget — 1000m at ~40k msg/sec, 2000m at ~80k, 4000m at ~160k. The proxy isn't hitting a fixed architectural wall; it's hitting a CPU budget wall, and that wall moves when you give it more CPU. - - -**The practical implication**: if you're hitting the encryption ceiling, add producers before adding proxy pods. We haven't yet measured exactly where the per-pod ceiling sits — that's the next experiment — but the single-connection limit of ~37k is not the whole story. +**The practical implication**: the throughput ceiling is not a fixed number — it's a function of the CPU you allocate. Set `requests` equal to `limits` in your pod spec; this makes the CPU budget deterministic and the ceiling predictable. The companion engineering post has the full story of how we found this, including the workload design choices needed to isolate proxy CPU from Kafka's own limits. --- @@ -130,11 +125,17 @@ We ran the same test with 4 producers sharing the same single partition. With 4 **With record encryption:** -1. **Throughput budget**: encryption imposes a per-connection throughput ceiling driven by the CPU cost of AES-256-GCM on your hardware. On ours (AMD EPYC-Rome, 2GHz) that ceiling was about 26% lower than Kafka alone could sustain per producer connection — run the rate sweep on your own infrastructure to find yours. +1. **Throughput budget**: encryption imposes a CPU-driven throughput ceiling. As a planning formula: + + > **`proxy CPU (millicores) = 20 × produce throughput (MB/s)`** + + Add ×1.3 headroom for GC pauses and burst. This assumes matched consumer load (1:1 produce:consume) and was measured on AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your own hardware using the rate sweep. + + Worked example: 100k msg/s at 1 KB = 100 MB/s produce → 100 × 20 = 2000m, plus headroom → ~2600m (~2.6 cores). 2. **Latency budget**: well below saturation, expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99. The overhead scales with how hard you're pushing — give yourself headroom and you'll barely notice it. -3. **Scaling**: the bottleneck is per-connection CPU (crypto, buffer management, and network I/O combined). Spread load across more producer connections first; then scale proxy pods horizontally. +3. **Scaling**: set `requests` equal to `limits` in your pod spec — this makes the CPU budget deterministic, which makes the throughput ceiling predictable. To increase throughput, raise the CPU limit. For redundancy, add proxy pods. 4. **KMS overhead**: DEK caching means Vault isn't on the hot path for every record. Our tests triggered only 5–19 DEK generation calls per benchmark run. The KMS is not the thing to worry about. @@ -142,12 +143,12 @@ We ran the same test with 4 producers sharing the same single partition. With 4 ## Caveats and next steps -These results come from a single proxy pod, a single partition, and single-pass measurements at each rate point. We know what the gaps are: +These results come from a single proxy pod and single-pass measurements at each rate point. A few things to keep in mind: -- **Connection sweep**: we saw 1 and 4 producers — we haven't yet swept 2, 8, 16 to characterise the full per-pod ceiling -- **Horizontal scaling**: we expect more proxy pods to scale linearly, but haven't measured it yet -- **Larger message sizes**: encryption overhead is almost certainly smaller in percentage terms for larger messages +- **Message size**: all results use 1 KB messages. The coefficient is message-size-dependent — encryption overhead as a percentage is likely lower for larger messages. +- **Replication factor**: the 1-topic rate sweep ran at RF=3. At that replication factor, Kafka's ISR replication traffic creates a per-partition ceiling that sits close to where proxy CPU also saturates — the two limits are entangled in those results. The sizing coefficient was derived from RF=1 multi-topic workloads specifically to isolate proxy CPU. The [companion engineering post]({% post_url 2026-05-28-benchmarking-the-proxy-under-the-hood %}) has that detail. +- **Horizontal scaling**: linear scaling has been validated across CPU allocations on a single pod; multi-pod horizontal scaling hasn't been measured but is expected to follow the same coefficient. -For the engineering story — why we built a custom harness on top of OMB, what the CPU flamegraphs actually show, and the bugs we found in our own tooling along the way — that's in the [companion post]({% post_url 2026-05-08-benchmarking-the-proxy-under-the-hood %}). +For the engineering story — why we built a custom harness on top of OMB, what the CPU flamegraphs actually show, and the bugs we found in our own tooling along the way — that's in the [companion post]({% post_url 2026-05-28-benchmarking-the-proxy-under-the-hood %}). The full benchmark suite, quickstart guide, and sizing reference are in `kroxylicious-openmessaging-benchmarks/` in the [main Kroxylicious repository](https://github.com/kroxylicious/kroxylicious). diff --git a/_posts/2026-05-08-benchmarking-the-proxy-under-the-hood.md b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md similarity index 71% rename from _posts/2026-05-08-benchmarking-the-proxy-under-the-hood.md rename to _posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md index fa378dfa..4aff656c 100644 --- a/_posts/2026-05-08-benchmarking-the-proxy-under-the-hood.md +++ b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md @@ -1,13 +1,13 @@ --- layout: post title: "Benchmarking a Kafka proxy: the engineering story" -date: 2026-05-08 00:00:00 +0000 +date: 2026-05-28 00:00:00 +0000 author: "Sam Barker" author_url: "https://github.com/SamBarker" categories: benchmarking performance engineering --- -The [first post]({% post_url 2026-05-01-benchmarking-the-proxy %}) covered what we measured and what the numbers mean for operators. This one is for the people who want to know how we measured it, what the flamegraphs actually show, and what we found when we started looking carefully at our own tooling. +The [first post]({% post_url 2026-05-21-benchmarking-the-proxy %}) covered what we measured and what the numbers mean for operators. This one is for the people who want to know how we measured it, what the flamegraphs actually show, and what we found when we started looking carefully at our own tooling. ## Why not Kafka's own tools? @@ -141,15 +141,62 @@ Total additional CPU: ~33%. This aligns closely with the ~26% throughput reducti If you wanted to optimise this, the highest-impact areas would be: reducing buffer copies (encrypt in-place or use composite buffers), pooling encryption buffers to reduce GC pressure, and caching `Cipher` instances to reduce per-record JDK security overhead. -## The per-connection ceiling +## Following the ceiling -The single-producer encryption ceiling at ~37k msg/sec raised the question of whether that was a per-pod limit or a per-connection limit. +### A problem with the workload -The answer came from a 4-producer rate sweep. Four producers sharing the same partition drove 47k+ msg/sec aggregate through the proxy while proxy CPU held at 570m/1000m — well below pod saturation. The Kafka partition became the bottleneck first. +The single-producer rate sweep hit a ceiling at ~37k msg/sec. Before drawing conclusions, we had to ask whether that was actually a proxy CPU ceiling — or something else. -The explanation: Netty assigns each client connection to its own event loop thread. Encryption happens synchronously on that thread. A single connection is bounded by one event loop's throughput, but additional connections get their own threads. The proxy's aggregate capacity is the sum of its event loop threads' individual capacities — until something else (the Kafka partition, the NIC, pod CPU) saturates first. +Our initial sweeps ran with replication factor 3, the standard production default. At RF=3, every message the Kafka leader receives goes out to 2 follower replicas. With 1 KB messages and 37k msg/sec, that's ~37 MB/s inbound to the leader and ~111 MB/s total replication traffic outbound — and the Fyre cluster nodes had 10 GbE NICs, so the ceiling wasn't the NIC. But RF=3 does create a real per-partition I/O ceiling on the Kafka leader, and it sits right around where we were measuring. -Worth noting: with replication factor 3, every message the Kafka leader receives goes out to 2 follower replicas plus potentially one consumer. At 50k msg/sec with 1 KB messages that's ~1.2 Gbps outbound from the leader alone — confirming why the Fyre cluster nodes need 10 Gbps NICs. +The fix: RF=1, 10-topic workload. Dropping to RF=1 removes replication overhead; spreading across 10 partitions distributes load so no single partition hits its ceiling. We validated the fix with the passthrough proxy scenario: at 160k msg/sec total (16k per topic), proxy-no-filters matched baseline — Kafka was not the bottleneck. The sweep scaled to 640k msg/sec before hitting some uninvestigated ceiling well above where encryption constrains anything. + +### Is the encryption ceiling per-pod or per-connection? + +With a clean workload that isolates proxy CPU, we re-examined the ~37k figure. Running the same workload with 4 producers: proxy CPU had headroom to spare, and Kafka's partition became the bottleneck first. So the single-producer ceiling is not the pod ceiling. + +### The coefficient + +With the workload isolation in place, we swept encryption across CPU allocations. The throughput ceiling scaled linearly: + +| CPU limit | Encryption ceiling | +|-----------|-------------------| +| 1000m | ~40k msg/sec | +| 2000m | ~80k msg/sec | +| 4000m | ~160k msg/sec | + +From the 4-core sweep: safe at 160k msg/sec (p99: 447 ms), catastrophic at 320k msg/sec (p99: 537,000 ms). The saturation point is predictably between those two steps. + +Deriving the coefficient: at 4000m and 160k msg/sec with 1 KB messages — + +``` +160k msg/s × 1 KB = 160 MB/s produce throughput +With matched consumer load: 160 MB/s encrypt + 160 MB/s decrypt +→ 4000 mc / 320 MB/s bidirectional ≈ 12–13 mc per MB/s bidirectional +→ equivalently: 4000 mc / 160 MB/s produce ≈ 25 mc per MB/s produce +``` + +We measured the coefficient at mid-utilisation (80k msg/sec, 2000m) at ~10 mc/MB/s bidirectional — lower, because of fixed per-connection overhead that's amortised at higher load. The operator-facing formula uses 20 mc/MB/s of produce throughput (= 10 bidirectional × 2 for produce+consume), which sits between mid-utilisation and saturation and provides inherent conservatism. + +One thing we observed: the proxy had 4 Netty event loop threads regardless of CPU limit. The throughput scaling isn't explained by thread count changing — it doesn't. What changes is the CPU time budget available to those threads. The detailed relationship between CPU limit, thread scheduling, and throughput ceiling is more subtle than a simple thread-count model; what we can say empirically is that throughput scales linearly with the CPU limit, and the formula holds. + +### The prediction + +Rather than just reporting the 4-core result, we used the 1-core ceiling to make a falsifiable prediction: if the ceiling scales linearly, a 2-core pod should saturate at ~80k msg/sec. + +The 2-core sweep: + +| Rate | p99 | Verdict | +|------|-----|---------| +| 40k msg/sec | 626 ms | Comfortable | +| 80k msg/sec | 1,660 ms | Elevated — right at predicted ceiling | +| 160k msg/sec | 175,277 ms | Catastrophic | + +The prediction held. The ceiling is real, linear, and predictable — which is exactly what you want from a sizing model. + +Setting `requests` equal to `limits` makes this predictability practical: a pod that can burst above its CPU limit introduces headroom uncertainty that breaks the model. With `requests == limits`, the CPU budget is fixed, the ceiling is fixed, and your capacity planning can rely on the coefficient. + +Worth noting: with RF=3 in production, every message the Kafka leader receives goes out to 2 follower replicas. At 50k msg/sec with 1 KB messages that's ~1.2 Gbps outbound from the leader alone — confirming why the Fyre cluster nodes need 10 GbE NICs, and why the replication ceiling matters for the benchmarking workload design. ## Bugs we found in our own tooling @@ -189,16 +236,10 @@ jbang src/main/java/io/kroxylicious/benchmarks/results/ResultComparator.java \ ## What's still open -The gaps we know about and plan to fill: - -1. **Connection sweep**: run 1, 2, 4, 8, 16 producers simultaneously at a fixed per-producer rate to characterise the per-pod aggregate ceiling with encryption. The plan is in `CONNECTION-SWEEP-PLAN.md`. - -2. **Horizontal scaling**: verify that adding proxy pods scales aggregate throughput linearly. - -3. **Multi-partition workloads**: isolate encryption cost without being bounded by Kafka's per-partition ceiling. - -4. **Multi-pass sweeps**: each rate point was measured once. Running each probe three times and taking the median would give tighter bounds, particularly in the saturation transition zone. +The coefficient is validated at 1, 2, and 4 cores for 1 KB messages. Known gaps: -5. **Message size variation**: larger messages should show lower encryption overhead as a percentage; smaller messages may show higher overhead. 1 KB is a reasonable middle ground but not the whole picture. +- **Message size variation**: larger messages should show lower overhead as a percentage; smaller messages may show higher. 1 KB is a reasonable middle ground but not the whole picture. +- **Horizontal scaling**: multiple proxy pods haven't been measured; linear scaling is expected but not confirmed. +- **Multi-pass sweeps**: each rate point was measured once. Running each probe three times and taking the median would give tighter bounds in the saturation transition zone. The operator-facing sizing reference and all the key tables are in `SIZING-GUIDE.md` in the benchmarks directory. \ No newline at end of file diff --git a/performance.markdown b/performance.markdown index 4ccd5194..269faeda 100644 --- a/performance.markdown +++ b/performance.markdown @@ -5,7 +5,7 @@ permalink: /performance/ toc: true --- -This page summarises the measured performance overhead of Kroxylicious. Numbers come from [benchmarks run on real hardware](/blog/2026/05/01/benchmarking-the-proxy/) using [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark), an industry-standard Kafka performance tool. +This page summarises the measured performance overhead of Kroxylicious. Numbers come from [benchmarks run on real hardware](/blog/2026/05/21/benchmarking-the-proxy/) using [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark), an industry-standard Kafka performance tool. ## Test environment @@ -77,15 +77,15 @@ Encryption adds measurable but predictable overhead. The cost scales with produc **With record encryption:** -- **Throughput**: plan for ~25% lower throughput per partition compared to direct Kafka +- **Throughput**: use `proxy CPU (millicores) = 20 × produce throughput (MB/s)` as a planning formula, then add ×1.3 headroom. Assumes matched consumer load and AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your hardware. Validated at 1000m, 2000m, and 4000m. Example: 100k msg/s at 1 KB = 100 MB/s produce → 2000m + headroom → ~2600m. - **Latency**: expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99, scaling with how close to saturation you operate -- **Scaling**: the throughput ceiling is per-connection (one Netty event loop per client connection). Spreading load across more producers is the first scaling lever; adding proxy pods comes next +- **Scaling**: set `requests` equal to `limits` in your pod spec to make the CPU budget — and therefore the throughput ceiling — deterministic. Increase the CPU limit to raise throughput; add proxy pods for redundancy. - **KMS**: DEK caching means the KMS is not on the hot path. In testing, each benchmark run triggered only 5–19 DEK generation calls — the KMS is not a bottleneck --- ## Further reading -- [Operator guide: results, methodology, and sizing recommendations](/blog/2026/05/01/benchmarking-the-proxy/) — the full benchmark story for operators -- [Engineering deep dive: tooling, flamegraphs, and what we discovered](/blog/2026/05/08/benchmarking-the-proxy-under-the-hood/) — how we measured it, where the CPU goes, and what surprised us +- [Operator guide: results, methodology, and sizing recommendations](/blog/2026/05/21/benchmarking-the-proxy/) — the full benchmark story for operators +- [Engineering deep dive: tooling, flamegraphs, and what we discovered](/blog/2026/05/28/benchmarking-the-proxy-under-the-hood/) — how we measured it, where the CPU goes, and what surprised us - [Benchmark quickstart](https://github.com/kroxylicious/kroxylicious/tree/main/kroxylicious-openmessaging-benchmarks/QUICKSTART.md) — run the benchmarks yourself \ No newline at end of file From b2790b709976fa4168fa6eed3ed61b48fc8f100a Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Fri, 15 May 2026 17:04:04 +1200 Subject: [PATCH 05/25] Strengthen flamegraph narrative with selective L7 proxy story MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The proxy is selectively L7: default infrastructure filters do genuine Kafka protocol work (address rewriting, API version negotiation, metadata caching) while high-volume produce/consume traffic bypasses full deserialisation via the decode predicate. The 1.4% proxy CPU share validates this design, not just reflects it. Also drop the Fyre cluster upgrade section — OCP-internal incident with no relevance to readers. Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- ...5-28-benchmarking-the-proxy-under-the-hood.md | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md index 4aff656c..dfed01f2 100644 --- a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md +++ b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md @@ -103,7 +103,11 @@ The flamegraphs below are fully interactive: hover over a frame to see its name | Kroxylicious proxy | 1.4% | | GC | 0.1% | -The proxy is overwhelmingly I/O-bound. 59% of CPU is in `send`/`recv` syscalls — the inherent cost of maintaining two TCP connections (client→proxy, proxy→Kafka) with data flowing through the JVM. The proxy itself accounts for 1.4%. It really is a TCP relay with protocol awareness. +The proxy is overwhelmingly I/O-bound. 59% of CPU is in `send`/`recv` syscalls — the inherent cost of maintaining two TCP connections (client→proxy, proxy→Kafka) with data flowing through the JVM. The proxy itself accounts for 1.4% — and understanding *why* that number is so small is the interesting part. + +Kroxylicious decodes Kafka RPCs selectively: each filter declares which API keys it cares about, and the proxy only deserialises messages that at least one filter needs. Even in the no-filter scenario, the default infrastructure filters are doing genuine L7 work — broker address rewriting, API version negotiation, topic name caching — which means metadata, FindCoordinator, and API version exchanges are fully decoded. But the high-volume produce and consume traffic? The decode predicate skips full deserialisation for those entirely, passing them through at close to L4 speed. + +The 1.4% is the cost of a proxy that is *selectively* L7: doing real Kafka protocol work where it matters, and treating the hot path like a TCP relay where it doesn't. That's not a side-effect — it's what the decode predicate design is for, and this flamegraph validates it. ### Encryption proxy (same 36,000 msg/sec rate) @@ -210,16 +214,6 @@ During the 4-producer rate sweep, we noticed that JFR recordings and flamegraphs Spotting these required noticing that two different probe flamegraphs were pixel-for-pixel identical, then working back through the restart logic. The lesson: when reusing a deployed cluster across multiple probes, validate that diagnostic collection is actually running fresh for each one. -## The cluster that wouldn't upgrade - -Midway through a benchmark campaign, the Fyre OpenShift cluster got stuck mid-upgrade. All 3 worker nodes were `SchedulingDisabled`, meaning benchmark pods couldn't schedule, meaning cluster operators (image-registry, ingress, monitoring, storage) went degraded, which blocked the upgrade from completing. - -The root cause was a MachineConfigOperator bug. Each worker's MachineConfigDaemon had finished its upgrade and set `desiredDrain=uncordon-...` — signalling it was ready to be uncordoned — but the MCO never acted on that signal. Workers sat cordoned indefinitely. - -Fix: `kubectl uncordon worker0 worker1 worker2`. Once uncordoned, pods scheduled, operators recovered, and the upgrade completed. - -Not a Kroxylicious bug, but it cost several hours of cluster recovery time during an active benchmark campaign. Worth knowing about if you're running OCP on Fyre. - ## Run it yourself Everything is in `kroxylicious-openmessaging-benchmarks/` in the [main Kroxylicious repository](https://github.com/kroxylicious/kroxylicious). See `QUICKSTART.md` for step-by-step instructions. You'll need a Kubernetes or OpenShift cluster, the Kroxylicious operator installed, and Helm 3. Minikube works for local runs — the quickstart covers recommended CPU and memory settings. From afe6b8fbd46033e9435a9bd571548d9925f16014 Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Fri, 15 May 2026 17:22:18 +1200 Subject: [PATCH 06/25] Tone pass on Post 1 and performance reference page - Warm up test environment intro: realistic deployment framing - Add conversational lead-in to sizing guidance in both documents - Improve caveats opener in Post 1 - Add caveats section to performance page (RF=3 masking, message size, horizontal scaling) Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- _posts/2026-05-21-benchmarking-the-proxy.md | 6 ++++-- performance.markdown | 16 +++++++++++++++- 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/_posts/2026-05-21-benchmarking-the-proxy.md b/_posts/2026-05-21-benchmarking-the-proxy.md index 3b2cd959..c9602299 100644 --- a/_posts/2026-05-21-benchmarking-the-proxy.md +++ b/_posts/2026-05-21-benchmarking-the-proxy.md @@ -25,7 +25,7 @@ We used [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchma ## Test environment -All results were collected on a 6-node OpenShift cluster on Fyre, IBM's internal cloud environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit. +No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit. | Component | Details | |-----------|---------| @@ -121,6 +121,8 @@ Going further: we swept the same workload at 1000m, 2000m, and 4000m CPU. The th ## Sizing guidance +Numbers without guidance aren't very useful, so here's how to translate these results into pod specs. + **Passthrough proxy**: size your Kafka cluster as you normally would. The proxy won't be the bottleneck — but if you want to verify that on your own hardware, the rate sweep is exactly the tool for it. Run the baseline and passthrough scenarios back-to-back and you'll have your own numbers. **With record encryption:** @@ -143,7 +145,7 @@ Going further: we swept the same workload at 1000m, 2000m, and 4000m CPU. The th ## Caveats and next steps -These results come from a single proxy pod and single-pass measurements at each rate point. A few things to keep in mind: +These are real results from real hardware, but they don't tell a story for your workload. A few things worth knowing before you put these numbers in a slide deck: - **Message size**: all results use 1 KB messages. The coefficient is message-size-dependent — encryption overhead as a percentage is likely lower for larger messages. - **Replication factor**: the 1-topic rate sweep ran at RF=3. At that replication factor, Kafka's ISR replication traffic creates a per-partition ceiling that sits close to where proxy CPU also saturates — the two limits are entangled in those results. The sizing coefficient was derived from RF=1 multi-topic workloads specifically to isolate proxy CPU. The [companion engineering post]({% post_url 2026-05-28-benchmarking-the-proxy-under-the-hood %}) has that detail. diff --git a/performance.markdown b/performance.markdown index 269faeda..fc70891d 100644 --- a/performance.markdown +++ b/performance.markdown @@ -5,7 +5,7 @@ permalink: /performance/ toc: true --- -This page summarises the measured performance overhead of Kroxylicious. Numbers come from [benchmarks run on real hardware](/blog/2026/05/21/benchmarking-the-proxy/) using [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark), an industry-standard Kafka performance tool. +This page summarises the measured performance overhead of Kroxylicious. Numbers come from [benchmarks run on real hardware](/blog/2026/05/21/benchmarking-the-proxy/) using [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark), an industry-standard Kafka performance tool. No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment. ## Test environment @@ -73,6 +73,8 @@ Encryption adds measurable but predictable overhead. The cost scales with produc ## Sizing guidance +Numbers without guidance aren't very useful, so here's how to translate these results into pod specs. + **Passthrough proxy**: size your Kafka cluster as you normally would. The proxy will not be the bottleneck. **With record encryption:** @@ -84,6 +86,18 @@ Encryption adds measurable but predictable overhead. The cost scales with produc --- +## Caveats + +These numbers come from a single proxy pod, 1 KB messages, and single-pass measurements. A few things that matter when applying them to your workload: + +- **Message size**: the sizing coefficient is message-size-dependent — encryption overhead as a percentage is likely lower for larger messages +- **Replication factor**: the 1-topic latency and ceiling results ran at RF=3; at that replication factor Kafka's ISR replication creates a per-partition ceiling that sits close to where proxy CPU saturates. The sizing coefficient was derived from RF=1 multi-topic workloads to isolate proxy CPU +- **Horizontal scaling**: linear scaling has been validated across CPU allocations on a single pod; multi-pod scaling hasn't been measured but is expected to follow the same coefficient + +The [engineering post](/blog/2026/05/28/benchmarking-the-proxy-under-the-hood/) has the full methodology detail. + +--- + ## Further reading - [Operator guide: results, methodology, and sizing recommendations](/blog/2026/05/21/benchmarking-the-proxy/) — the full benchmark story for operators From ee44ce74606bc240f605fb27e383ae5da904911d Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Fri, 15 May 2026 22:29:57 +1200 Subject: [PATCH 07/25] WIP: Redrafting the engineering deep dive Signed-off-by: Sam Barker --- ...8-benchmarking-the-proxy-under-the-hood.md | 32 ++++++++++++------- performance.markdown | 2 +- 2 files changed, 22 insertions(+), 12 deletions(-) diff --git a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md index dfed01f2..40f2b9ce 100644 --- a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md +++ b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md @@ -1,45 +1,55 @@ --- layout: post -title: "Benchmarking a Kafka proxy: the engineering story" +title: "How hard can it be??? Maxing out a Kroxylicious instance" date: 2026-05-28 00:00:00 +0000 author: "Sam Barker" author_url: "https://github.com/SamBarker" categories: benchmarking performance engineering --- -The [first post]({% post_url 2026-05-21-benchmarking-the-proxy %}) covered what we measured and what the numbers mean for operators. This one is for the people who want to know how we measured it, what the flamegraphs actually show, and what we found when we started looking carefully at our own tooling. +How hard can it be? We started with a laptop, a codebase, and a lot of confidence it was fast. We ended up with a benchmark harness, a six-node cluster, and a much more nuanced answer. + +Harder than expected. More interesting too. + +We gave everyone [the numbers]({% post_url 2026-05-21-benchmarking-the-proxy %}) in a bland, but slide worthy way, already. This one is the engineering story: how we built the harness, what the flamegraphs actually show, the workload design choices that changed the answers, and the bugs we found in our own tooling. ## Why not Kafka's own tools? Kafka ships with `kafka-producer-perf-test` and `kafka-consumer-perf-test`. We'd used them before. The problems: - **Too noisy**: individual runs produced widely varying results depending on JVM warm-up, scheduling jitter, and GC behaviour. Results were hard to trust and harder to compare across scenarios. -- **Producer-only view**: `kafka-producer-perf-test` gives you publish latency, but nothing about the consumer side. You can't see end-to-end latency — which is what operators actually care about. +- **Producer-only view**: `kafka-producer-perf-test` gives you publish latency, but nothing about the consumer side. You can't see end-to-end latency — which is something operators actually care about. - **Awkward to sweep**: running parametric rate sweeps requires scripting around these tools, and comparing results across scenarios requires manual work. +- Coordinated omission: under load, kafka-producer-perf-test only measures requests it actually sends! So when things start loading up and applying back pressure the send rate drops and the latency stays looking nice and healthy. Only it's not healthy in reality, things are queuing up in your producer. -[OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) is a better fit. It's an industry-standard tool used by Confluent, the Pulsar team, and others for their published performance comparisons. OMB coordinates producers and consumers across separate worker pods, runs a configurable warmup phase before taking measurements, and outputs structured JSON that's straightforward to process programmatically. +And critically, it's never heard of Kroxylicious... You have though, you're here! -Using OMB also means our numbers are directly comparable to other published Kafka benchmarks — that credibility matters when you're trying to make the case that your proxy doesn't break things. +[OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) is a better fit. It's an industry-standard tool used by Confluent, the Pulsar team, and others for their published performance comparisons - so who am I to argue? OMB coordinates producers and consumers across separate worker pods, runs a configurable warmup phase before taking measurements, takes its latency tracking seriously by tracking coordinated omission, and outputs structured JSON that's straightforward to process programmatically. What's not to like? + +Using OMB also means our methodology is directly comparable to other published Kafka benchmarks. The numbers aren't comparable of course it's not the same hardware, network conditions or phase of the moon. ## What we built on top of OMB -OMB handles the measurement. We built everything around it: deployment, teardown, diagnostics collection, and result processing. All of it lives in `kroxylicious-openmessaging-benchmarks/` in the main repo. +So we just fire up OMB and get some numbers, right? Errr no. OMB just does the measurement part. I work really hard at being lazy, I hate clicking things with a mouse and I knew these tests needed to be repeatable. So we scripted deployment (of all the things) teardown (for isolation), diagnostic collection (WHAT BROKE NOW??), and last but not least result processing (what does this wall of JSON mean?) + +So now all of that lives in [`kroxylicious-openmessaging-benchmarks`](https://github.com/kroxylicious/kroxylicious/tree/main/kroxylicious-openmessaging-benchmarks) in the main tree (mono repo FTW). ### Helm chart A Helm chart (`helm/kroxylicious-benchmark/`) deploys the full benchmark stack into Kubernetes: - OMB coordinator and worker pods -- A Strimzi Kafka cluster -- The Kroxylicious proxy (via the Kroxylicious Kubernetes operator) -- HashiCorp Vault (for the KMS in the encryption scenario) +- A Strimzi Kafka cluster - deploying Kafka on K8s what else are you going to use? (answers to /dev/null) +- The Kroxylicious operator +- The Kroxylicious proxy +- HashiCorp Vault (for the KMS in the encryption scenario). Importantly if you have your own KMS (and you will run this yourself for your workload, right?!) you can plug that in instead. Scenario-specific configuration lives in `helm/kroxylicious-benchmark/scenarios/` as YAML overrides: | Scenario file | What it deploys | |---------------|-----------------| | `baseline-values.yaml` | Direct Kafka, no proxy | -| `proxy-no-filters-values.yaml` | Proxy with empty filter chain | +| `proxy-no-filters-values.yaml` | Proxy with no user filters | | `encryption-values.yaml` | Proxy with AES-256-GCM encryption and Vault | | `rate-sweep-values.yaml` | Extended run profiles for sweep experiments | @@ -236,4 +246,4 @@ The coefficient is validated at 1, 2, and 4 cores for 1 KB messages. Known gaps: - **Horizontal scaling**: multiple proxy pods haven't been measured; linear scaling is expected but not confirmed. - **Multi-pass sweeps**: each rate point was measured once. Running each probe three times and taking the median would give tighter bounds in the saturation transition zone. -The operator-facing sizing reference and all the key tables are in `SIZING-GUIDE.md` in the benchmarks directory. \ No newline at end of file +The operator-facing sizing reference and all the key tables are in `SIZING-GUIDE.md` in the benchmarks directory. diff --git a/performance.markdown b/performance.markdown index fc70891d..058f650c 100644 --- a/performance.markdown +++ b/performance.markdown @@ -101,5 +101,5 @@ The [engineering post](/blog/2026/05/28/benchmarking-the-proxy-under-the-hood/) ## Further reading - [Operator guide: results, methodology, and sizing recommendations](/blog/2026/05/21/benchmarking-the-proxy/) — the full benchmark story for operators -- [Engineering deep dive: tooling, flamegraphs, and what we discovered](/blog/2026/05/28/benchmarking-the-proxy-under-the-hood/) — how we measured it, where the CPU goes, and what surprised us +- [How hard can it be??? Maxing out a Kroxylicious instance](/blog/2026/05/28/benchmarking-the-proxy-under-the-hood/) — how we measured it, where the CPU goes, and what surprised us - [Benchmark quickstart](https://github.com/kroxylicious/kroxylicious/tree/main/kroxylicious-openmessaging-benchmarks/QUICKSTART.md) — run the benchmarks yourself \ No newline at end of file From 207808b30ef601909983e39c43863011a296b5c0 Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Mon, 18 May 2026 11:57:16 +1200 Subject: [PATCH 08/25] Tone pass and narrative restructure on engineering post MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - New opening: laptop/codebase/confidence → harness/cluster/nuance - Why not Kafka tools: add coordinated omission bullet with voice - What we built: reframe around two experimental questions (rate sweep, connection sweep) before tooling details; add two-dimensions framing - Banishing click-ops: replace dry Helm section with Red Hat/operator motivation and all-your-CRs joke - JSON always comes in megabytes: replace docs dump with signal/noise framing; sharpen Comparator vs Summariser distinction - Following the ceiling: rewrite as investigation arc (spare CPU → what were we hitting? → RF=3 masking → connection sweep → coefficient) - Rename Post 2 title to "How hard can it be??? Maxing out a Kroxylicious instance" - Revert slug rename (benchmarking-the-proxy-under-the-hood stays) - Update performance.markdown cross-links to match Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- ...8-benchmarking-the-proxy-under-the-hood.md | 95 ++++++++++--------- 1 file changed, 52 insertions(+), 43 deletions(-) diff --git a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md index 40f2b9ce..cecb09bc 100644 --- a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md +++ b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md @@ -34,49 +34,52 @@ So we just fire up OMB and get some numbers, right? Errr no. OMB just does the m So now all of that lives in [`kroxylicious-openmessaging-benchmarks`](https://github.com/kroxylicious/kroxylicious/tree/main/kroxylicious-openmessaging-benchmarks) in the main tree (mono repo FTW). -### Helm chart +So we have a tool and we think Kroxylicious is fast — but how do we turn that into something we can actually show management? "Fast" is shorthand for "low impact", and the impact of a proxy shows up along two dimensions: -A Helm chart (`helm/kroxylicious-benchmark/`) deploys the full benchmark stack into Kubernetes: +- **Latency**: how much extra time does this additional hop add? +- **Throughput**: how much does routing traffic through the proxy cost my topic throughput? -- OMB coordinator and worker pods -- A Strimzi Kafka cluster - deploying Kafka on K8s what else are you going to use? (answers to /dev/null) -- The Kroxylicious operator -- The Kroxylicious proxy -- HashiCorp Vault (for the KMS in the encryption scenario). Importantly if you have your own KMS (and you will run this yourself for your workload, right?!) you can plug that in instead. +Two dimensions, two questions — and it turns out they need quite different experimental approaches to answer. -Scenario-specific configuration lives in `helm/kroxylicious-benchmark/scenarios/` as YAML overrides: +**Rate sweep — where does latency start to bite?** +`scripts/rate-sweep.sh` holds the connection count fixed and steps the producer rate up in fixed increments, letting the cluster stabilise at each step. We defined saturation as the sustained throughput dropping more than 5% below the target rate. The rate sweep tells you where the cliff edge is and what latency looks like as you approach it. -| Scenario file | What it deploys | -|---------------|-----------------| -| `baseline-values.yaml` | Direct Kafka, no proxy | -| `proxy-no-filters-values.yaml` | Proxy with no user filters | -| `encryption-values.yaml` | Proxy with AES-256-GCM encryption and Vault | -| `rate-sweep-values.yaml` | Extended run profiles for sweep experiments | - -Separating scenarios into override files means the base chart stays stable while each scenario adds only what it needs. Switching between scenarios doesn't require touching the chart itself. +**Connection sweep — is the ceiling per-connection or per-pod?** +`scripts/connection-sweep.sh` holds the per-producer rate fixed and steps up the number of producers (1, 2, 4, 8, 16 by default) — consumers scale to match. This tells you the aggregate throughput ceiling of a single proxy pod (need more? help out!): the point where adding more connections stops increasing total throughput. -### Orchestration scripts - -**`scripts/run-benchmark.sh`** orchestrates a single benchmark run: +Both sweeps use `scripts/run-benchmark.sh` under the hood, which: 1. Deploys the Helm chart for the requested scenario 2. Waits for the OMB Job to complete 3. Collects results: OMB JSON, a JFR recording, an async-profiler flamegraph, and a Prometheus metrics snapshot 4. Tears down -The `--skip-deploy` flag lets you re-run a probe against an already-deployed cluster — essential for rate sweeps where you want to deploy once and probe many times. +The `--skip-deploy` flag lets you re-run a probe against an already-deployed cluster — both sweep scripts deploy once and probe many times. -**`scripts/rate-sweep.sh`** wraps `run-benchmark.sh` to drive parametric sweeps. It takes `--min-rate`, `--max-rate`, `--step-percent`, and one or more `--scenario` flags. The first probe deploys; subsequent probes use `--skip-deploy`. +### Banishing click-ops -### Result processing +Coming from Red Hat, my instinct is to reach for an operator — but operators are great at managing cohesive things. The stack we needed to deploy is anything but cohesive: an OMB coordinator, worker pods, a Strimzi-managed Kafka cluster, the Kroxylicious operator, the proxy itself, and HashiCorp Vault for the KMS. It's less "managed application" and more *all your ~~base~~ CRs belong to us*. -Three JBang-runnable Java programs handle result analysis: +We could have dumped some YAML in a directory and used `kustomize apply`. But I am lazy, and that's a lot of typing. Helm handles this beautifully — one chart, scenario-specific overrides, and a single command to deploy the whole thing. Scenario-specific configuration lives in `helm/kroxylicious-benchmark/scenarios/` as YAML overrides — the base chart stays stable and each scenario adds only what it needs: -- **`RunMetadata.java`**: generates `run-metadata.json` alongside each result. Captures git commit, timestamp, cluster node specs (architecture, CPU, RAM), and — on OpenShift — NIC speed read from the host via the MachineConfigDaemon pod. -- **`ResultComparator.java`**: reads two scenario result directories and produces a markdown comparison table. -- **`ResultSummariser.java`**: reads a rate-sweep result directory and prints a saturation table: target rate, achieved rate, p99, and whether the probe saturated. +| Scenario file | What it deploys | +|---------------|-----------------| +| `baseline-values.yaml` | Direct Kafka, no proxy | +| `proxy-no-filters-values.yaml` | Proxy with no user filters | +| `encryption-values.yaml` | Proxy with AES-256-GCM encryption and Vault | +| `rate-sweep-values.yaml` | Extended run profiles for sweep experiments | + +If you have your own KMS — and you will run this on your own infrastructure, right?! — you can swap Vault out without touching the base chart. + +### JSON always comes in megabytes -Getting NIC speed from a Kubernetes node turned out to be non-trivial — you need host filesystem access to read `/sys/class/net//speed`. On OpenShift, the MachineConfigDaemon pods mount the host at `/rootfs`, so we `kubectl exec` into the MCD pod and `chroot /rootfs` to read the speed file without creating any new privileged resources. +Each benchmark run produces a blob of structured JSON. Useful in principle; a wall of noise in practice. Three [JBang](https://www.jbang.dev/)-runnable Java programs (I'm a died in the wool java dev, sue me) pull out the signal: + +- **`RunMetadata`**: captures the run context — git commit, timestamp, cluster node specs (architecture, CPU, RAM), and on OpenShift, NIC speed read from the host via the MachineConfigDaemon pod. Generates `run-metadata.json` alongside each result so you can always tell what conditions produced a number. This is what makes run-to-run comparisons meaningful — and when a run takes 12 hours, trust me, you don't want to re-run it without good reason. +- **`ResultComparator`**: answers "did this change hurt?" — reads two scenario result directories and produces a markdown comparison table. Baseline vs encryption is the obvious use, but the tool is generic. Already running a proxy? proxy-no-filters vs encryption tells you the cost of the filter itself, not the proxy hop. Building your own filter? That's your comparison — measure the chain with and without it. +- **`ResultSummariser`**: answers "where does it fall over?" — reads a rate-sweep result directory and prints a summary table: target rate, achieved rate, p99, and whether the probe saturated. Where ResultComparator compares two scenarios at a fixed rate, ResultSummariser tracks one scenario across a range of rates. + +Getting NIC speed from a Kubernetes node turned out to be non-trivial — you need host filesystem access to read `/sys/class/net//speed`. On OpenShift, the MachineConfigDaemon pods mount the host at `/rootfs`, so we `kubectl exec` into the MCD pod and `chroot /rootfs` to read the speed file without creating any new privileged resources. Fiddly, but worth it — knowing your NIC speed is the difference between "the ceiling was the NIC" and "the ceiling wasn't the NIC". ## Workload design @@ -157,21 +160,29 @@ If you wanted to optimise this, the highest-impact areas would be: reducing buff ## Following the ceiling -### A problem with the workload +We had a rate-sweep result. On our test cluster, the encryption scenario hit a ceiling — the proxy was saturating around 37k msg/sec. We'd maxed out the proxy, right? + +Well. The proxy had spare CPU cycles. -The single-producer rate sweep hit a ceiling at ~37k msg/sec. Before drawing conclusions, we had to ask whether that was actually a proxy CPU ceiling — or something else. +That's interesting. If the proxy isn't CPU-saturated, then whatever we hit isn't the proxy's ceiling — it's something else's. Time to work out what. -Our initial sweeps ran with replication factor 3, the standard production default. At RF=3, every message the Kafka leader receives goes out to 2 follower replicas. With 1 KB messages and 37k msg/sec, that's ~37 MB/s inbound to the leader and ~111 MB/s total replication traffic outbound — and the Fyre cluster nodes had 10 GbE NICs, so the ceiling wasn't the NIC. But RF=3 does create a real per-partition I/O ceiling on the Kafka leader, and it sits right around where we were measuring. +### What were we actually hitting? -The fix: RF=1, 10-topic workload. Dropping to RF=1 removes replication overhead; spreading across 10 partitions distributes load so no single partition hits its ceiling. We validated the fix with the passthrough proxy scenario: at 160k msg/sec total (16k per topic), proxy-no-filters matched baseline — Kafka was not the bottleneck. The sweep scaled to 640k msg/sec before hitting some uninvestigated ceiling well above where encryption constrains anything. +Our initial sweeps ran with replication factor 3 — the standard production default, and for good reason. But RF=3 means every message the Kafka leader receives gets replicated to 2 followers. At 37k msg/sec with 1 KB messages, that's ~111 MB/s of replication traffic outbound from the leader alone. The Fyre nodes have 10 GbE NICs so the network wasn't saturated, but RF=3 creates a real per-partition I/O ceiling on the Kafka leader — and it sits right around where we were measuring. -### Is the encryption ceiling per-pod or per-connection? +The ceiling on our hardware wasn't the proxy. It was Kafka. -With a clean workload that isolates proxy CPU, we re-examined the ~37k figure. Running the same workload with 4 producers: proxy CPU had headroom to spare, and Kafka's partition became the bottleneck first. So the single-producer ceiling is not the pod ceiling. +The fix: RF=1, 10-topic workload. Drop replication overhead; spread load across 10 partitions so no single partition hits its ceiling. We validated it with the passthrough proxy: at 160k msg/sec total the proxy matched baseline, and the sweep scaled past 640k before hitting some uninvestigated ceiling far above where encryption constrains anything. -### The coefficient +### We maxed out the proxy, right? -With the workload isolation in place, we swept encryption across CPU allocations. The throughput ceiling scaled linearly: +With a clean workload that actually isolates proxy CPU, we looked again. The connection sweep answered the question: with 4 producers at a fixed per-producer rate, aggregate throughput climbed well past the single-producer ceiling — and proxy CPU still had headroom. Kafka's partition ran out first. + +So the single-producer ceiling on our cluster isn't the pod ceiling. It's what one connection could push on that hardware. The proxy had more to give. + +### How much more? + +We swept the CPU limit: 1000m, 2000m, 4000m. The throughput ceiling scaled linearly with the CPU budget: | CPU limit | Encryption ceiling | |-----------|-------------------| @@ -179,7 +190,9 @@ With the workload isolation in place, we swept encryption across CPU allocations | 2000m | ~80k msg/sec | | 4000m | ~160k msg/sec | -From the 4-core sweep: safe at 160k msg/sec (p99: 447 ms), catastrophic at 320k msg/sec (p99: 537,000 ms). The saturation point is predictably between those two steps. +At 4000m: comfortable at 160k msg/sec (p99: 447 ms), catastrophic at 320k (p99: 537,000 ms). The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU. + +One thing we noticed along the way: the proxy ran 4 Netty event loop threads regardless of CPU limit. The throughput scaling isn't explained by thread count changing — it doesn't. What changes is the CPU time budget available to those threads. The relationship between CPU limit, thread scheduling, and throughput ceiling is more subtle than a simple thread-count model; what we can say empirically is that throughput scales linearly with the CPU limit. Deriving the coefficient: at 4000m and 160k msg/sec with 1 KB messages — @@ -190,13 +203,11 @@ With matched consumer load: 160 MB/s encrypt + 160 MB/s decrypt → equivalently: 4000 mc / 160 MB/s produce ≈ 25 mc per MB/s produce ``` -We measured the coefficient at mid-utilisation (80k msg/sec, 2000m) at ~10 mc/MB/s bidirectional — lower, because of fixed per-connection overhead that's amortised at higher load. The operator-facing formula uses 20 mc/MB/s of produce throughput (= 10 bidirectional × 2 for produce+consume), which sits between mid-utilisation and saturation and provides inherent conservatism. - -One thing we observed: the proxy had 4 Netty event loop threads regardless of CPU limit. The throughput scaling isn't explained by thread count changing — it doesn't. What changes is the CPU time budget available to those threads. The detailed relationship between CPU limit, thread scheduling, and throughput ceiling is more subtle than a simple thread-count model; what we can say empirically is that throughput scales linearly with the CPU limit, and the formula holds. +We measured the coefficient at mid-utilisation (80k msg/sec, 2000m) at ~10 mc/MB/s bidirectional — lower, because fixed per-connection overhead gets amortised at higher load. The operator-facing formula uses 20 mc/MB/s of produce throughput, which sits between mid-utilisation and saturation and gives inherent conservatism. ### The prediction -Rather than just reporting the 4-core result, we used the 1-core ceiling to make a falsifiable prediction: if the ceiling scales linearly, a 2-core pod should saturate at ~80k msg/sec. +Rather than just report the results, we used the 1-core ceiling to make a falsifiable prediction: if the ceiling scales linearly with CPU budget, a 2-core pod should saturate at ~80k msg/sec. The 2-core sweep: @@ -208,9 +219,7 @@ The 2-core sweep: The prediction held. The ceiling is real, linear, and predictable — which is exactly what you want from a sizing model. -Setting `requests` equal to `limits` makes this predictability practical: a pod that can burst above its CPU limit introduces headroom uncertainty that breaks the model. With `requests == limits`, the CPU budget is fixed, the ceiling is fixed, and your capacity planning can rely on the coefficient. - -Worth noting: with RF=3 in production, every message the Kafka leader receives goes out to 2 follower replicas. At 50k msg/sec with 1 KB messages that's ~1.2 Gbps outbound from the leader alone — confirming why the Fyre cluster nodes need 10 GbE NICs, and why the replication ceiling matters for the benchmarking workload design. +Setting `requests` equal to `limits` is what makes this practical: a pod that can burst above its CPU limit introduces headroom uncertainty that breaks the model. With `requests == limits`, the CPU budget is fixed, the ceiling is fixed, and your capacity planning can rely on the coefficient. ## Bugs we found in our own tooling From 837e3d6349677a2aae42144023266595adca950d Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Mon, 18 May 2026 12:02:24 +1200 Subject: [PATCH 09/25] Standardise on msg/s throughout (was mixed msg/s and msg/sec) Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- _posts/2026-05-21-benchmarking-the-proxy.md | 14 +++---- ...8-benchmarking-the-proxy-under-the-hood.md | 42 +++++++++---------- performance.markdown | 8 ++-- 3 files changed, 32 insertions(+), 32 deletions(-) diff --git a/_posts/2026-05-21-benchmarking-the-proxy.md b/_posts/2026-05-21-benchmarking-the-proxy.md index c9602299..fc9dba25 100644 --- a/_posts/2026-05-21-benchmarking-the-proxy.md +++ b/_posts/2026-05-21-benchmarking-the-proxy.md @@ -45,7 +45,7 @@ One important caveat: this Kafka cluster is deliberately untuned. We're not tryi Good news first. The proxy itself — with no filter chain, just routing traffic — adds almost nothing. -**10 topics, 1 KB messages (5,000 msg/sec per topic):** +**10 topics, 1 KB messages (5,000 msg/s per topic):** | Metric | Baseline | Proxy | Delta | |--------|----------|-------|-------| @@ -55,7 +55,7 @@ Good news first. The proxy itself — with no filter chain, just routing traffic | E2E latency p99 | 185.00 ms | 186.00 ms | +1.00 ms (+0.5%) | | Publish rate | 5,002 msg/s | 5,002 msg/s | 0 | -**100 topics, 1 KB messages (500 msg/sec per topic):** +**100 topics, 1 KB messages (500 msg/s per topic):** | Metric | Baseline | Proxy | Delta | |--------|----------|-------|-------| @@ -101,19 +101,19 @@ A rate-sweep is exactly what it sounds like: pick a starting rate, let OMB run l We started at 34k (right where the latency table started getting interesting) and stepped up in 5% increments. The results: -- **Baseline**: sustained up to ~50,000–52,000 msg/sec (the ceiling we observed on our test cluster) -- **Encryption**: sustained up to **~37,200 msg/sec**, then started intermittently saturating +- **Baseline**: sustained up to ~50,000–52,000 msg/s (the ceiling we observed on our test cluster) +- **Encryption**: sustained up to **~37,200 msg/s**, then started intermittently saturating - **Cost: approximately 26% fewer messages per second per partition** -The transition wasn't a clean cliff edge — between 37,600 and 42,000 msg/sec the proxy alternated between sustaining and saturating. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Above ~39,000 msg/sec, p99 latency regularly spiked above 1,700 ms. Stay below 37k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**. +The transition wasn't a clean cliff edge — between 37,600 and 42,000 msg/s the proxy alternated between sustaining and saturating. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Above ~39,000 msg/s, p99 latency regularly spiked above 1,700 ms. Stay below 37k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**. ### The ceiling scales with CPU budget The fact the proxy is low latency didn't surprise me, but this did — and it matters when we think about scaling. We maxed out a single connection, but that didn't mean we'd maxed out the proxy. -Once we had the single-producer encryption ceiling at ~37k msg/sec, the obvious question was: is that the limit for the whole proxy pod, or just for one connection? We ran the same test with 4 producers. With 4 connections the proxy sustained well past the single-producer ceiling — proxy CPU had headroom to spare, and Kafka's partition became the bottleneck first. +Once we had the single-producer encryption ceiling at ~37k msg/s, the obvious question was: is that the limit for the whole proxy pod, or just for one connection? We ran the same test with 4 producers. With 4 connections the proxy sustained well past the single-producer ceiling — proxy CPU had headroom to spare, and Kafka's partition became the bottleneck first. -Going further: we swept the same workload at 1000m, 2000m, and 4000m CPU. The throughput ceiling scaled linearly with the CPU budget — 1000m at ~40k msg/sec, 2000m at ~80k, 4000m at ~160k. The proxy isn't hitting a fixed architectural wall; it's hitting a CPU budget wall, and that wall moves when you give it more CPU. +Going further: we swept the same workload at 1000m, 2000m, and 4000m CPU. The throughput ceiling scaled linearly with the CPU budget — 1000m at ~40k msg/s, 2000m at ~80k, 4000m at ~160k. The proxy isn't hitting a fixed architectural wall; it's hitting a CPU budget wall, and that wall moves when you give it more CPU. **The practical implication**: the throughput ceiling is not a fixed number — it's a function of the CPU you allocate. Set `requests` equal to `limits` in your pod spec; this makes the CPU budget deterministic and the ceiling predictable. The companion engineering post has the full story of how we found this, including the workload design choices needed to isolate proxy CPU from Kafka's own limits. diff --git a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md index cecb09bc..055856d9 100644 --- a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md +++ b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md @@ -85,13 +85,13 @@ Getting NIC speed from a Kubernetes node turned out to be non-trivial — you ne The primary workload used **1 topic, 1 partition, 1 KB messages**. This is deliberate. Concentrating all traffic on a single partition pushes things to their limits at lower absolute rates, which makes the proxy overhead easier to isolate: when the system saturates, it's the proxy, not a spread-out broker fleet. -Multi-topic workloads (10 topics, 100 topics) were used to verify that the overhead characteristics hold when load is distributed. At 5,000 msg/sec per topic across 10 topics, every topic-partition pair is well below any saturation point — so what you're measuring is steady-state overhead, not ceiling behaviour. +Multi-topic workloads (10 topics, 100 topics) were used to verify that the overhead characteristics hold when load is distributed. At 5,000 msg/s per topic across 10 topics, every topic-partition pair is well below any saturation point — so what you're measuring is steady-state overhead, not ceiling behaviour. -For throughput ceiling testing we used rate sweeps: start at 34,000 msg/sec, step up by 5% until achieved rate drops below 95% of target. The knee of that curve is the saturation point. +For throughput ceiling testing we used rate sweeps: start at 34,000 msg/s, step up by 5% until achieved rate drops below 95% of target. The knee of that curve is the saturation point. ## The flamegraph: where the CPU actually goes -We captured CPU profiles using async-profiler attached to the proxy JVM via `jcmd JVMTI.agent_load`, during the steady-state measurement phase at 36,000 msg/sec. These are self-time percentages — where the CPU is actually spending cycles, not inclusive call-tree time. +We captured CPU profiles using async-profiler attached to the proxy JVM via `jcmd JVMTI.agent_load`, during the steady-state measurement phase at 36,000 msg/s. These are self-time percentages — where the CPU is actually spending cycles, not inclusive call-tree time. The flamegraphs below are fully interactive: hover over a frame to see its name and percentage, click to zoom in, Ctrl+F to search. Scroll within the frame to explore the full stack depth. @@ -101,9 +101,9 @@ The flamegraphs below are fully interactive: hover over a frame to see its name -
CPU flamegraph — passthrough proxy (no filters), 36,000 msg/sec, 1 topic, 1 KB messages. Open full screen ↗
+
CPU flamegraph — passthrough proxy (no filters), 36,000 msg/s, 1 topic, 1 KB messages. Open full screen ↗
| Category | CPU share | @@ -122,15 +122,15 @@ Kroxylicious decodes Kafka RPCs selectively: each filter declares which API keys The 1.4% is the cost of a proxy that is *selectively* L7: doing real Kafka protocol work where it matters, and treating the hot path like a TCP relay where it doesn't. That's not a side-effect — it's what the decode predicate design is for, and this flamegraph validates it. -### Encryption proxy (same 36,000 msg/sec rate) +### Encryption proxy (same 36,000 msg/s rate)
-
CPU flamegraph — encryption proxy (AES-256-GCM), 36,000 msg/sec, 1 topic, 1 KB messages. Open full screen ↗
+
CPU flamegraph — encryption proxy (AES-256-GCM), 36,000 msg/s, 1 topic, 1 KB messages. Open full screen ↗
| Category | No-filters | Encryption | Delta | @@ -160,7 +160,7 @@ If you wanted to optimise this, the highest-impact areas would be: reducing buff ## Following the ceiling -We had a rate-sweep result. On our test cluster, the encryption scenario hit a ceiling — the proxy was saturating around 37k msg/sec. We'd maxed out the proxy, right? +We had a rate-sweep result. On our test cluster, the encryption scenario hit a ceiling — the proxy was saturating around 37k msg/s. We'd maxed out the proxy, right? Well. The proxy had spare CPU cycles. @@ -168,11 +168,11 @@ That's interesting. If the proxy isn't CPU-saturated, then whatever we hit isn't ### What were we actually hitting? -Our initial sweeps ran with replication factor 3 — the standard production default, and for good reason. But RF=3 means every message the Kafka leader receives gets replicated to 2 followers. At 37k msg/sec with 1 KB messages, that's ~111 MB/s of replication traffic outbound from the leader alone. The Fyre nodes have 10 GbE NICs so the network wasn't saturated, but RF=3 creates a real per-partition I/O ceiling on the Kafka leader — and it sits right around where we were measuring. +Our initial sweeps ran with replication factor 3 — the standard production default, and for good reason. But RF=3 means every message the Kafka leader receives gets replicated to 2 followers. At 37k msg/s with 1 KB messages, that's ~111 MB/s of replication traffic outbound from the leader alone. The Fyre nodes have 10 GbE NICs so the network wasn't saturated, but RF=3 creates a real per-partition I/O ceiling on the Kafka leader — and it sits right around where we were measuring. The ceiling on our hardware wasn't the proxy. It was Kafka. -The fix: RF=1, 10-topic workload. Drop replication overhead; spread load across 10 partitions so no single partition hits its ceiling. We validated it with the passthrough proxy: at 160k msg/sec total the proxy matched baseline, and the sweep scaled past 640k before hitting some uninvestigated ceiling far above where encryption constrains anything. +The fix: RF=1, 10-topic workload. Drop replication overhead; spread load across 10 partitions so no single partition hits its ceiling. We validated it with the passthrough proxy: at 160k msg/s total the proxy matched baseline, and the sweep scaled past 640k before hitting some uninvestigated ceiling far above where encryption constrains anything. ### We maxed out the proxy, right? @@ -186,15 +186,15 @@ We swept the CPU limit: 1000m, 2000m, 4000m. The throughput ceiling scaled linea | CPU limit | Encryption ceiling | |-----------|-------------------| -| 1000m | ~40k msg/sec | -| 2000m | ~80k msg/sec | -| 4000m | ~160k msg/sec | +| 1000m | ~40k msg/s | +| 2000m | ~80k msg/s | +| 4000m | ~160k msg/s | -At 4000m: comfortable at 160k msg/sec (p99: 447 ms), catastrophic at 320k (p99: 537,000 ms). The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU. +At 4000m: comfortable at 160k msg/s (p99: 447 ms), catastrophic at 320k (p99: 537,000 ms). The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU. One thing we noticed along the way: the proxy ran 4 Netty event loop threads regardless of CPU limit. The throughput scaling isn't explained by thread count changing — it doesn't. What changes is the CPU time budget available to those threads. The relationship between CPU limit, thread scheduling, and throughput ceiling is more subtle than a simple thread-count model; what we can say empirically is that throughput scales linearly with the CPU limit. -Deriving the coefficient: at 4000m and 160k msg/sec with 1 KB messages — +Deriving the coefficient: at 4000m and 160k msg/s with 1 KB messages — ``` 160k msg/s × 1 KB = 160 MB/s produce throughput @@ -203,19 +203,19 @@ With matched consumer load: 160 MB/s encrypt + 160 MB/s decrypt → equivalently: 4000 mc / 160 MB/s produce ≈ 25 mc per MB/s produce ``` -We measured the coefficient at mid-utilisation (80k msg/sec, 2000m) at ~10 mc/MB/s bidirectional — lower, because fixed per-connection overhead gets amortised at higher load. The operator-facing formula uses 20 mc/MB/s of produce throughput, which sits between mid-utilisation and saturation and gives inherent conservatism. +We measured the coefficient at mid-utilisation (80k msg/s, 2000m) at ~10 mc/MB/s bidirectional — lower, because fixed per-connection overhead gets amortised at higher load. The operator-facing formula uses 20 mc/MB/s of produce throughput, which sits between mid-utilisation and saturation and gives inherent conservatism. ### The prediction -Rather than just report the results, we used the 1-core ceiling to make a falsifiable prediction: if the ceiling scales linearly with CPU budget, a 2-core pod should saturate at ~80k msg/sec. +Rather than just report the results, we used the 1-core ceiling to make a falsifiable prediction: if the ceiling scales linearly with CPU budget, a 2-core pod should saturate at ~80k msg/s. The 2-core sweep: | Rate | p99 | Verdict | |------|-----|---------| -| 40k msg/sec | 626 ms | Comfortable | -| 80k msg/sec | 1,660 ms | Elevated — right at predicted ceiling | -| 160k msg/sec | 175,277 ms | Catastrophic | +| 40k msg/s | 626 ms | Comfortable | +| 80k msg/s | 1,660 ms | Elevated — right at predicted ceiling | +| 160k msg/s | 175,277 ms | Catastrophic | The prediction held. The ceiling is real, linear, and predictable — which is exactly what you want from a sizing model. diff --git a/performance.markdown b/performance.markdown index 058f650c..f500ed37 100644 --- a/performance.markdown +++ b/performance.markdown @@ -25,7 +25,7 @@ All primary results used 1 KB messages on a single partition. Multi-topic worklo The proxy layer itself adds negligible overhead. At sub-saturation rates the additional latency is sub-millisecond on average, with no measurable throughput impact. -**10 topics, 1 KB messages (5,000 msg/sec per topic):** +**10 topics, 1 KB messages (5,000 msg/s per topic):** | Metric | Baseline | Proxy | Delta | |--------|----------|-------|-------| @@ -34,7 +34,7 @@ The proxy layer itself adds negligible overhead. At sub-saturation rates the add | E2E latency avg | 94.87 ms | 95.34 ms | +0.47 ms (+0.5%) | | Publish rate | 5,002 msg/s | 5,002 msg/s | no change | -**100 topics, 1 KB messages (500 msg/sec per topic):** +**100 topics, 1 KB messages (500 msg/s per topic):** | Metric | Baseline | Proxy | Delta | |--------|----------|-------|-------| @@ -65,8 +65,8 @@ Encryption adds measurable but predictable overhead. The cost scales with produc | Scenario | Throughput ceiling (1 topic, 1 KB, 1 partition) | |----------|------------------------------------------------| -| Baseline (direct Kafka) | ~50,000–52,000 msg/sec | -| Encryption (proxy + AES-256-GCM) | ~37,200 msg/sec | +| Baseline (direct Kafka) | ~50,000–52,000 msg/s | +| Encryption (proxy + AES-256-GCM) | ~37,200 msg/s | | **Cost** | **~26% fewer messages per second per partition** | --- From cb8dfcbebfafa570826d3fbdc53d9d34d880c492 Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Mon, 18 May 2026 13:31:04 +1200 Subject: [PATCH 10/25] Rewrite workload design section with context and reasoning MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replaces dry methodology notes with a fuller narrative arc: - Opens with the representative vs repeatable tension in benchmarking - Explains the single-partition choice and why it makes the author wince - Justifies RF=3: proxy adds one real hop, but RF=1 would double the hop count — not a fair production comparison - Multi-topic runs reconnect to representative: baseline tax at normal load - Rate sweep methodology explained as technique, not run-specific numbers Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- ...026-05-28-benchmarking-the-proxy-under-the-hood.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md index 055856d9..572d842f 100644 --- a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md +++ b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md @@ -83,11 +83,16 @@ Getting NIC speed from a Kubernetes node turned out to be non-trivial — you ne ## Workload design -The primary workload used **1 topic, 1 partition, 1 KB messages**. This is deliberate. Concentrating all traffic on a single partition pushes things to their limits at lower absolute rates, which makes the proxy overhead easier to isolate: when the system saturates, it's the proxy, not a spread-out broker fleet. +Benchmarks are artificial constructs. Your traffic patterns are never stable — message sizes vary, topic counts grow, producers burst — so there's always a tension between numbers that are *representative* and numbers that are actually *repeatable*. We leaned towards repeatable. -Multi-topic workloads (10 topics, 100 topics) were used to verify that the overhead characteristics hold when load is distributed. At 5,000 msg/s per topic across 10 topics, every topic-partition pair is well below any saturation point — so what you're measuring is steady-state overhead, not ceiling behaviour. +The primary workload will make Kafka experts wince *(I had to squirm to type it)* — **1 topic, 1 partition, 1 KB messages**. Concentrating everything onto a single TopicPartition means we hit the limits earlier, at lower absolute volumes, which makes the proxy's contribution easier to isolate. Isolating the proxy is, after all, the goal. -For throughput ceiling testing we used rate sweeps: start at 34,000 msg/s, step up by 5% until achieved rate drops below 95% of target. The knee of that curve is the saturation point. +But Kafka is often described as a distributed append-only log, and we can't ignore the word "distributed" when it comes to latency. With RF=1, the proxy doubles the sequential hops in the critical path: one becomes two. That's not wrong, but it's not a fair picture either — nobody runs RF=1 in production. With RF=3, the leader waits for ISR acknowledgements before confirming the produce, so there's already replication latency in the critical path. The proxy adds a real, sequential hop — we're not trying to bury that — but it lands alongside a cost that's already there. One extra hop on top of a multi-hop round trip is a different picture from doubling a single-hop one. Three brokers, hot partition replicated across all of them. + +We leaned towards repeatable — but we didn't abandon representative entirely. The multi-topic runs (10 and 100 topics) are the reconnection point: load spread across more topics, closer to what production actually looks like, at rates well below any saturation point. You're measuring the proxy's baseline tax — the cost you always pay, not just the cost when you're pushing hard. It holds. + + +That covers the first dimension — the proxy's latency tax at normal load. For the second, throughput, the question is: how much does routing through the proxy reduce your maximum sustainable rate? That needs a different approach. We used rate sweeps: hold the connection count fixed, step the rate up incrementally, and watch what happens. Below the ceiling, achieved throughput tracks the target — the system keeps up. Above it, it can't, and falls behind. The point where achieved throughput diverges from the target rate — where we defined that as dropping below 95% — is the saturation point. That's the knee of the curve, and that's what we were hunting. ## The flamegraph: where the CPU actually goes From 3067e05b091006173c85a3e246147d157cc194b8 Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Mon, 18 May 2026 13:34:40 +1200 Subject: [PATCH 11/25] Voice consistency pass on engineering post MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Format all narrator asides as *(italic brackets)* to distinguish narrator voice from main text - Fix coordinated omission bullet missing bold formatting - Fix "tracking...tracking" redundancy in OMB paragraph - "it made me wince" → "*(I had to squirm to type it)*" — more honest, author reached for single-partition deliberately Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- ...26-05-28-benchmarking-the-proxy-under-the-hood.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md index 572d842f..fee5be7a 100644 --- a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md +++ b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md @@ -20,19 +20,19 @@ Kafka ships with `kafka-producer-perf-test` and `kafka-consumer-perf-test`. We'd - **Too noisy**: individual runs produced widely varying results depending on JVM warm-up, scheduling jitter, and GC behaviour. Results were hard to trust and harder to compare across scenarios. - **Producer-only view**: `kafka-producer-perf-test` gives you publish latency, but nothing about the consumer side. You can't see end-to-end latency — which is something operators actually care about. - **Awkward to sweep**: running parametric rate sweeps requires scripting around these tools, and comparing results across scenarios requires manual work. -- Coordinated omission: under load, kafka-producer-perf-test only measures requests it actually sends! So when things start loading up and applying back pressure the send rate drops and the latency stays looking nice and healthy. Only it's not healthy in reality, things are queuing up in your producer. +- **Coordinated omission**: under load, kafka-producer-perf-test only measures requests it actually sends! So when things start loading up and applying back pressure the send rate drops and the latency stays looking nice and healthy. Only it's not healthy in reality, things are queuing up in your producer. And critically, it's never heard of Kroxylicious... You have though, you're here! -[OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) is a better fit. It's an industry-standard tool used by Confluent, the Pulsar team, and others for their published performance comparisons - so who am I to argue? OMB coordinates producers and consumers across separate worker pods, runs a configurable warmup phase before taking measurements, takes its latency tracking seriously by tracking coordinated omission, and outputs structured JSON that's straightforward to process programmatically. What's not to like? +[OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) is a better fit. It's an industry-standard tool used by Confluent, the Pulsar team, and others for their published performance comparisons - so who am I to argue? OMB coordinates producers and consumers across separate worker pods, runs a configurable warmup phase before taking measurements, takes latency tracking seriously — correcting for coordinated omission, and outputs structured JSON that's straightforward to process programmatically. What's not to like? Using OMB also means our methodology is directly comparable to other published Kafka benchmarks. The numbers aren't comparable of course it's not the same hardware, network conditions or phase of the moon. ## What we built on top of OMB -So we just fire up OMB and get some numbers, right? Errr no. OMB just does the measurement part. I work really hard at being lazy, I hate clicking things with a mouse and I knew these tests needed to be repeatable. So we scripted deployment (of all the things) teardown (for isolation), diagnostic collection (WHAT BROKE NOW??), and last but not least result processing (what does this wall of JSON mean?) +So we just fire up OMB and get some numbers, right? Errr no. OMB just does the measurement part. I work really hard at being lazy, I hate clicking things with a mouse and I knew these tests needed to be repeatable. So we scripted deployment (of all the things) teardown (for isolation), diagnostic collection *(WHAT BROKE NOW??)*, and last but not least result processing (what does this wall of JSON mean?) -So now all of that lives in [`kroxylicious-openmessaging-benchmarks`](https://github.com/kroxylicious/kroxylicious/tree/main/kroxylicious-openmessaging-benchmarks) in the main tree (mono repo FTW). +So now all of that lives in [`kroxylicious-openmessaging-benchmarks`](https://github.com/kroxylicious/kroxylicious/tree/main/kroxylicious-openmessaging-benchmarks) in the main tree *(mono repo FTW)*. So we have a tool and we think Kroxylicious is fast — but how do we turn that into something we can actually show management? "Fast" is shorthand for "low impact", and the impact of a proxy shows up along two dimensions: @@ -45,7 +45,7 @@ Two dimensions, two questions — and it turns out they need quite different exp `scripts/rate-sweep.sh` holds the connection count fixed and steps the producer rate up in fixed increments, letting the cluster stabilise at each step. We defined saturation as the sustained throughput dropping more than 5% below the target rate. The rate sweep tells you where the cliff edge is and what latency looks like as you approach it. **Connection sweep — is the ceiling per-connection or per-pod?** -`scripts/connection-sweep.sh` holds the per-producer rate fixed and steps up the number of producers (1, 2, 4, 8, 16 by default) — consumers scale to match. This tells you the aggregate throughput ceiling of a single proxy pod (need more? help out!): the point where adding more connections stops increasing total throughput. +`scripts/connection-sweep.sh` holds the per-producer rate fixed and steps up the number of producers (1, 2, 4, 8, 16 by default) — consumers scale to match. This tells you the aggregate throughput ceiling of a single proxy pod *(need more? help out!)*: the point where adding more connections stops increasing total throughput. Both sweeps use `scripts/run-benchmark.sh` under the hood, which: @@ -73,7 +73,7 @@ If you have your own KMS — and you will run this on your own infrastructure, r ### JSON always comes in megabytes -Each benchmark run produces a blob of structured JSON. Useful in principle; a wall of noise in practice. Three [JBang](https://www.jbang.dev/)-runnable Java programs (I'm a died in the wool java dev, sue me) pull out the signal: +Each benchmark run produces a blob of structured JSON. Useful in principle; a wall of noise in practice. Three [JBang](https://www.jbang.dev/)-runnable Java programs *(I'm a died in the wool java dev, sue me)* pull out the signal: - **`RunMetadata`**: captures the run context — git commit, timestamp, cluster node specs (architecture, CPU, RAM), and on OpenShift, NIC speed read from the host via the MachineConfigDaemon pod. Generates `run-metadata.json` alongside each result so you can always tell what conditions produced a number. This is what makes run-to-run comparisons meaningful — and when a run takes 12 hours, trust me, you don't want to re-run it without good reason. - **`ResultComparator`**: answers "did this change hurt?" — reads two scenario result directories and produces a markdown comparison table. Baseline vs encryption is the obvious use, but the tool is generic. Already running a proxy? proxy-no-filters vs encryption tells you the cost of the filter itself, not the proxy hop. Building your own filter? That's your comparison — measure the chain with and without it. From 211a134e7db2f11f00af928f00da6f16bf6dc74c Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Tue, 19 May 2026 12:40:26 +1200 Subject: [PATCH 12/25] Strengthen passthrough proxy section with L7 and contention insights MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Reframe the takeaway: the proxy boils the latency-sensitive path to near-TCP-stack overhead while operating at Layer 7 — that's the win - Add paragraph explaining why overhead holds across 10/100 topics: the proxy doesn't contend between topics (unlike a broker which juggles disk I/O, partition leaders, and replication); the connection sweep validates linear throughput scaling Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- _posts/2026-05-21-benchmarking-the-proxy.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/_posts/2026-05-21-benchmarking-the-proxy.md b/_posts/2026-05-21-benchmarking-the-proxy.md index fc9dba25..1352af81 100644 --- a/_posts/2026-05-21-benchmarking-the-proxy.md +++ b/_posts/2026-05-21-benchmarking-the-proxy.md @@ -67,8 +67,11 @@ Good news first. The proxy itself — with no filter chain, just routing traffic **The headline: ~0.2 ms additional average publish latency. Throughput is unaffected.** -What did I take away from this entirely unsurprising result? Not much, honestly — without filters the proxy is little more than a couple of hops through the TCP stack, but we now have data rather than a hunch. -The end-to-end (E2E) p99 figure is dominated by the Kafka consumer fetch timeouts, as it should be. That said, it is reassuring to have a sub-ms impact on the p99. +What did I take away from this entirely unsurprising result? Not much, honestly — without filters the proxy boils the latency-sensitive path down to little more than a couple of hops through the TCP stack. We replaced a hunch with data. The remarkable part: the proxy is doing this at Layer 7. + +The overhead holding across 10 and 100 topics makes sense for the same reason: the proxy doesn't contend between topics. A Kafka broker juggles disk I/O, partition leaders, and replication across everything it manages; the proxy treats each connection independently. Topics don't contend for shared resources: throughput scales linearly across them, and the connection sweep validates it. + +The end-to-end p99 figure is dominated by Kafka consumer fetch timeouts, as it should be. That said, it is reassuring to have a sub-ms impact on the p99. --- From ae5afd87fe2f9b012f8523106e8d3ca9a7bb60c7 Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Wed, 20 May 2026 11:09:57 +1200 Subject: [PATCH 13/25] Rewrite False summit section as detective story MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Full investigation arc: spare CPU shock → NIC elimination → 4-producer test → anti-affinity attempt (3 nodes, 3 brokers, nowhere to go) → new cluster → baseline shock → RTT math reveals co-location → second penny drops on OMB scheduling → RF=1 unlocks proxy CPU ceiling → coefficient → prediction. Corrects several issues in the prior draft: Netty theory discarded (proxy metrics showed minimal back pressure); co-location framed at pod/node level not VM level; 37k flagged as the only figure from the original cluster; all coefficient and sweep numbers confirmed as coming from the new distributed cluster. Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- ...8-benchmarking-the-proxy-under-the-hood.md | 147 ++++++++++-------- 1 file changed, 84 insertions(+), 63 deletions(-) diff --git a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md index fee5be7a..4b892de3 100644 --- a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md +++ b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md @@ -94,6 +94,90 @@ We leaned towards repeatable — but we didn't abandon representative entirely. That covers the first dimension — the proxy's latency tax at normal load. For the second, throughput, the question is: how much does routing through the proxy reduce your maximum sustainable rate? That needs a different approach. We used rate sweeps: hold the connection count fixed, step the rate up incrementally, and watch what happens. Below the ceiling, achieved throughput tracks the target — the system keeps up. Above it, it can't, and falls behind. The point where achieved throughput diverges from the target rate — where we defined that as dropping below 95% — is the saturation point. That's the knee of the curve, and that's what we were hunting. +## False summit + +The rate-sweep result was in: the encryption scenario hit a ceiling on our original cluster at around 37k msg/s. Summit reached. + +Except — the proxy had spare CPU cycles. Not a little: meaningful headroom. If the proxy isn't CPU-saturated, whatever we hit isn't the proxy's ceiling. + +**Was it the NIC?** At 37k msg/s and 1 KB messages, produce traffic alone is 37 MB/s. Add RF=3 replication: the leader ships two copies outbound, ~74 MB/s more. 111 MB/s total — fine for 10 GbE, obviously broken for 1 GbE. If the NICs had been gigabit, replication traffic would have saturated them long before we got to 37k. Network eliminated. + +**Was it the proxy pod, or just one connection?** The rate sweep runs with a single producer. We ran four at the same per-producer rate. Aggregate throughput climbed higher than one producer alone could push — the pod had headroom the single connection wasn't using. We checked proxy metrics: back pressure was minimal. The proxy wasn't the constraint. Whatever was limiting one connection, it wasn't us. + +### We tried anti-affinity + +Then a curveball: could it be node saturation? The original cluster had three worker nodes — and three Kafka brokers. Strimzi, being sensible, spreads brokers evenly: one per node. If the proxy had landed on the same node as a busy broker, that node could be the bottleneck rather than the proxy pod itself. + +We added a hard anti-affinity rule to keep the proxy off broker nodes. It wouldn't schedule. + +The penny drops: three worker nodes, three brokers, one per node — there is nowhere for the proxy to go that isn't already co-located with a broker. Obvious in hindsight. We needed a bigger cluster. + +We provisioned one: five workers, three masters, 16 vCPU per node. + +### The baseline shock + +Baseline first. Direct Kafka, no proxy. + +~17,000 msg/s. The original cluster had been sustaining ~50,000. + +The proxy wasn't in the picture. We checked the obvious suspects: disk I/O — fine, local and unsaturated. OMB worker scaling — correct. Broker CPU: ~1.2 vCPU. Nothing was at a limit. + +The answer was in the pipeline arithmetic. A Kafka producer has a maximum number of in-flight requests — batches sent but not yet acknowledged. With real round-trip times between nodes, that in-flight window bounds throughput. We measured: 0.87 ms between worker nodes, with three replication hops before the leader can confirm a produce at RF=3 — roughly 3–4 ms total. Five in-flight requests across that round trip gives a ceiling that matched ~17k msg/s almost exactly. + +On the original cluster, those nodes were almost certainly co-located on the same physical host. Inter-node RTTs at that scale are sub-millisecond — effectively free. The original cluster's 50k baseline wasn't what a 3-broker Kafka cluster does. It was what a 3-broker Kafka cluster does when the network is a memcpy. + +The new cluster was genuinely distributed. Real latency, real pipeline limits, real Kafka — and the cluster we used for everything from here. + +*(The ~37k ceiling is the only figure in this post from the original cluster. Everything that follows — the coefficient, the CPU sweep, the prediction — was measured on the new cluster. The physics are part of what makes those numbers honest.)* + +Another penny dropped. We'd had the same scheduling problem with OMB all along. The producer and consumer worker pods were landing on broker nodes — and when pods share a node, the SDN detects that traffic doesn't need to leave the node and bypasses the NIC entirely. The producers and consumers weren't paying for network transit at all. + +The proxy pod was on a different node, but on a 3-node cluster where every node already had a broker, the odds of those nodes sharing a physical host on Fyre were high. Almost certainly getting the same benefit, just one layer down. + +### Now push harder + +The new cluster had an honest baseline — but RF=3 pipeline limits meant we couldn't push a single topic past ~17k msg/s. There was no room to find the proxy's CPU ceiling when Kafka's pipeline hits the wall first. + +RF=1, 10 topics. With no replication hops, the round-trip drops to producer→leader only: 0.87 ms. Spread across 10 partitions, no single one becomes the bottleneck before the proxy does. We validated the workload with the passthrough proxy: throughput scaled well past anything encryption constrains. The ceiling we were now measuring was proxy CPU. + +### How much more? + +The initial RF=1 run at 1000m CPU gave us a ceiling: ~40k msg/s. From that one measurement we could derive the coefficient: + +``` +40k msg/s × 1 KB = 40 MB/s produce +Matched consumer load: 40 MB/s encrypt + 40 MB/s decrypt = 80 MB/s bidirectional +1000m / 80 MB/s ≈ 12.5 mc per MB/s bidirectional +→ operator formula: ~20 mc per MB/s of produce throughput (conservative margin between mid-load and saturation) +``` + +If the ceiling scales linearly with CPU, a 4-core pod should give ~160k msg/s. We ran it. + +| CPU limit | Encryption ceiling | +|-----------|-------------------| +| 1000m | ~40k msg/s | +| 4000m | ~160k msg/s | + +Linear. At 4000m: comfortable at 160k (p99: 447 ms), catastrophic at 320k (p99: 537,000 ms). The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU. + +*(The proxy ran 4 Netty event loop threads regardless of CPU limit. Thread count doesn't change — what changes is the CPU time budget available to those threads. Empirically linear, even if the thread-scheduling mechanics are more subtle.)* + +### The prediction + +One validated data point isn't a sizing model. We used the coefficient to make a falsifiable prediction: a 2-core pod should saturate at ~80k msg/s. + +The 2-core sweep: + +| Rate | p99 | Verdict | +|------------|------------|---------------------------------------| +| 40k msg/s | 626 ms | Comfortable | +| 80k msg/s | 1,660 ms | Elevated — right at predicted ceiling | +| 160k msg/s | 175,277 ms | Catastrophic | + +Held. The ceiling is real, linear, and predictable — which is exactly what you want from a sizing model. + +Setting `requests` equal to `limits` makes this practical: a pod that can burst above its CPU limit introduces headroom uncertainty that breaks the model. Fix the CPU budget; fix the ceiling. + ## The flamegraph: where the CPU actually goes We captured CPU profiles using async-profiler attached to the proxy JVM via `jcmd JVMTI.agent_load`, during the steady-state measurement phase at 36,000 msg/s. These are self-time percentages — where the CPU is actually spending cycles, not inclusive call-tree time. @@ -163,69 +247,6 @@ Total additional CPU: ~33%. This aligns closely with the ~26% throughput reducti If you wanted to optimise this, the highest-impact areas would be: reducing buffer copies (encrypt in-place or use composite buffers), pooling encryption buffers to reduce GC pressure, and caching `Cipher` instances to reduce per-record JDK security overhead. -## Following the ceiling - -We had a rate-sweep result. On our test cluster, the encryption scenario hit a ceiling — the proxy was saturating around 37k msg/s. We'd maxed out the proxy, right? - -Well. The proxy had spare CPU cycles. - -That's interesting. If the proxy isn't CPU-saturated, then whatever we hit isn't the proxy's ceiling — it's something else's. Time to work out what. - -### What were we actually hitting? - -Our initial sweeps ran with replication factor 3 — the standard production default, and for good reason. But RF=3 means every message the Kafka leader receives gets replicated to 2 followers. At 37k msg/s with 1 KB messages, that's ~111 MB/s of replication traffic outbound from the leader alone. The Fyre nodes have 10 GbE NICs so the network wasn't saturated, but RF=3 creates a real per-partition I/O ceiling on the Kafka leader — and it sits right around where we were measuring. - -The ceiling on our hardware wasn't the proxy. It was Kafka. - -The fix: RF=1, 10-topic workload. Drop replication overhead; spread load across 10 partitions so no single partition hits its ceiling. We validated it with the passthrough proxy: at 160k msg/s total the proxy matched baseline, and the sweep scaled past 640k before hitting some uninvestigated ceiling far above where encryption constrains anything. - -### We maxed out the proxy, right? - -With a clean workload that actually isolates proxy CPU, we looked again. The connection sweep answered the question: with 4 producers at a fixed per-producer rate, aggregate throughput climbed well past the single-producer ceiling — and proxy CPU still had headroom. Kafka's partition ran out first. - -So the single-producer ceiling on our cluster isn't the pod ceiling. It's what one connection could push on that hardware. The proxy had more to give. - -### How much more? - -We swept the CPU limit: 1000m, 2000m, 4000m. The throughput ceiling scaled linearly with the CPU budget: - -| CPU limit | Encryption ceiling | -|-----------|-------------------| -| 1000m | ~40k msg/s | -| 2000m | ~80k msg/s | -| 4000m | ~160k msg/s | - -At 4000m: comfortable at 160k msg/s (p99: 447 ms), catastrophic at 320k (p99: 537,000 ms). The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU. - -One thing we noticed along the way: the proxy ran 4 Netty event loop threads regardless of CPU limit. The throughput scaling isn't explained by thread count changing — it doesn't. What changes is the CPU time budget available to those threads. The relationship between CPU limit, thread scheduling, and throughput ceiling is more subtle than a simple thread-count model; what we can say empirically is that throughput scales linearly with the CPU limit. - -Deriving the coefficient: at 4000m and 160k msg/s with 1 KB messages — - -``` -160k msg/s × 1 KB = 160 MB/s produce throughput -With matched consumer load: 160 MB/s encrypt + 160 MB/s decrypt -→ 4000 mc / 320 MB/s bidirectional ≈ 12–13 mc per MB/s bidirectional -→ equivalently: 4000 mc / 160 MB/s produce ≈ 25 mc per MB/s produce -``` - -We measured the coefficient at mid-utilisation (80k msg/s, 2000m) at ~10 mc/MB/s bidirectional — lower, because fixed per-connection overhead gets amortised at higher load. The operator-facing formula uses 20 mc/MB/s of produce throughput, which sits between mid-utilisation and saturation and gives inherent conservatism. - -### The prediction - -Rather than just report the results, we used the 1-core ceiling to make a falsifiable prediction: if the ceiling scales linearly with CPU budget, a 2-core pod should saturate at ~80k msg/s. - -The 2-core sweep: - -| Rate | p99 | Verdict | -|------|-----|---------| -| 40k msg/s | 626 ms | Comfortable | -| 80k msg/s | 1,660 ms | Elevated — right at predicted ceiling | -| 160k msg/s | 175,277 ms | Catastrophic | - -The prediction held. The ceiling is real, linear, and predictable — which is exactly what you want from a sizing model. - -Setting `requests` equal to `limits` is what makes this practical: a pod that can burst above its CPU limit introduces headroom uncertainty that breaks the model. With `requests == limits`, the CPU budget is fixed, the ceiling is fixed, and your capacity planning can rely on the coefficient. - ## Bugs we found in our own tooling During the 4-producer rate sweep, we noticed that JFR recordings and flamegraphs from probes 2 onwards all looked identical to probe 1. They were stale copies. Three bugs. From 40154dbe6e7070df4373f953c4c182be92f77eec Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Wed, 20 May 2026 12:56:57 +1200 Subject: [PATCH 14/25] Update benchmark numbers to 8-node reference cluster MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Applies accurate numbers from the distributed 8-node cluster (5 workers, 3 masters) across all three files, replacing figures from the original co-located cluster: - Cluster description: 6-node → 8-node (5 workers, 3 masters) - RF=3 throughput ceiling: 37.2k→14,600 msg/s (encryption), 50-52k→19,400 msg/s (baseline), 26%→25% reduction - Coefficient: 12.5 mc/MB/s → 9.7 measured / 10 mc/MB/s operator formula - Formula: expose general form (10 × total proxy MB/s) with fan-out explanation; 20 × produce MB/s remains the 1:1 shorthand - 1-core RF=1: ~40k ceiling replaced with safe at 80k (91ms p99), saturating at ~126k - 4-core validation: 447ms→247ms at 160k; catastrophic→elevated at 321k (1,706ms); saturation above 321k - 2-core: comfortable at 80k (850ms), sustaining at 160k (720ms) — saturation not yet measured, consistent with model - Netty aside corrected: thread count scales with availableProcessors() (CPU limit), not fixed at 4 Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- _posts/2026-05-21-benchmarking-the-proxy.md | 30 ++++++++----- ...8-benchmarking-the-proxy-under-the-hood.md | 43 +++++++++---------- performance.markdown | 12 +++--- 3 files changed, 45 insertions(+), 40 deletions(-) diff --git a/_posts/2026-05-21-benchmarking-the-proxy.md b/_posts/2026-05-21-benchmarking-the-proxy.md index 1352af81..6622205c 100644 --- a/_posts/2026-05-21-benchmarking-the-proxy.md +++ b/_posts/2026-05-21-benchmarking-the-proxy.md @@ -25,12 +25,12 @@ We used [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchma ## Test environment -No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit. +No, we didn't run this on a laptop — it's a realistic deployment: an 8-node OpenShift cluster on Fyre (5 workers, 3 masters), IBM's internal cloud platform — a controlled environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit. | Component | Details | |-----------|---------| | CPU | AMD EPYC-Rome, 2 GHz | -| Cluster | 6-node OpenShift, RHCOS 9.6 | +| Cluster | 8-node OpenShift (5 workers, 3 masters), RHCOS 9.6 | | Kafka | 3-broker Strimzi cluster, replication factor 3 | | Kroxylicious | 0.20.0, single proxy pod, 1000m CPU limit | | KMS | HashiCorp Vault (in-cluster) | @@ -104,19 +104,25 @@ A rate-sweep is exactly what it sounds like: pick a starting rate, let OMB run l We started at 34k (right where the latency table started getting interesting) and stepped up in 5% increments. The results: -- **Baseline**: sustained up to ~50,000–52,000 msg/s (the ceiling we observed on our test cluster) -- **Encryption**: sustained up to **~37,200 msg/s**, then started intermittently saturating -- **Cost: approximately 26% fewer messages per second per partition** +- **Baseline**: sustained up to ~19,400 msg/s (the ceiling at RF=3 on our test cluster) +- **Encryption**: sustained up to **~14,600 msg/s**, then started intermittently saturating +- **Cost: approximately 25% fewer messages per second per partition** -The transition wasn't a clean cliff edge — between 37,600 and 42,000 msg/s the proxy alternated between sustaining and saturating. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Above ~39,000 msg/s, p99 latency regularly spiked above 1,700 ms. Stay below 37k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**. +The transition wasn't a clean cliff edge — the proxy alternated between sustaining and saturating in a narrow band just above the ceiling. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Stay below 14k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**. ### The ceiling scales with CPU budget The fact the proxy is low latency didn't surprise me, but this did — and it matters when we think about scaling. We maxed out a single connection, but that didn't mean we'd maxed out the proxy. -Once we had the single-producer encryption ceiling at ~37k msg/s, the obvious question was: is that the limit for the whole proxy pod, or just for one connection? We ran the same test with 4 producers. With 4 connections the proxy sustained well past the single-producer ceiling — proxy CPU had headroom to spare, and Kafka's partition became the bottleneck first. +The single-producer ceiling at RF=3 is Kafka-limited, not proxy-limited — the ISR replication round-trip caps single-partition throughput regardless of how much CPU the proxy has. The proxy still had meaningful headroom: we ran four producers and aggregate throughput climbed higher, while proxy CPU sat at 570m/1000m. The proxy wasn't the constraint. -Going further: we swept the same workload at 1000m, 2000m, and 4000m CPU. The throughput ceiling scaled linearly with the CPU budget — 1000m at ~40k msg/s, 2000m at ~80k, 4000m at ~160k. The proxy isn't hitting a fixed architectural wall; it's hitting a CPU budget wall, and that wall moves when you give it more CPU. +To find the proxy's real ceiling, you need a workload that doesn't hit the Kafka partition limit first: RF=1, spread across multiple topics. With that workload, the ceiling is squarely in the proxy — and it scales linearly with CPU. The mechanism: CPU limit controls `availableProcessors()`, which controls how many Netty event loop threads the proxy creates. More threads, more concurrent connections handled in parallel, higher aggregate ceiling. + +| CPU limit | Comfortable ceiling | Saturation point | +|-----------|--------------------|--------------------| +| 1000m | ~80k msg/s | ~126k msg/s | +| 2000m | ~80k msg/s | above 160k msg/s | +| 4000m | ~160k msg/s | above 321k msg/s | **The practical implication**: the throughput ceiling is not a fixed number — it's a function of the CPU you allocate. Set `requests` equal to `limits` in your pod spec; this makes the CPU budget deterministic and the ceiling predictable. The companion engineering post has the full story of how we found this, including the workload design choices needed to isolate proxy CPU from Kafka's own limits. @@ -132,11 +138,13 @@ Numbers without guidance aren't very useful, so here's how to translate these re 1. **Throughput budget**: encryption imposes a CPU-driven throughput ceiling. As a planning formula: - > **`proxy CPU (millicores) = 20 × produce throughput (MB/s)`** + > **`proxy CPU (millicores) = 10 × total proxy throughput (MB/s)`** + > + > where *total* = produce MB/s + (each consumer group's consume MB/s independently) - Add ×1.3 headroom for GC pauses and burst. This assumes matched consumer load (1:1 produce:consume) and was measured on AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your own hardware using the rate sweep. + For a single produce:consume pair this simplifies to `20 × produce MB/s`. Fan-out multiplies: 100 MB/s produce to 3 consumer groups = 100 + 300 = 400 MB/s total → 4,000m. Add ×1.3 headroom for GC pauses and burst. Measured on AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your hardware using the rate sweep. - Worked example: 100k msg/s at 1 KB = 100 MB/s produce → 100 × 20 = 2000m, plus headroom → ~2600m (~2.6 cores). + Worked example: 100k msg/s at 1 KB, 1 consumer group = 100 MB/s produce + 100 MB/s consume = 200 MB/s × 10 = 2,000m, plus headroom → ~2,600m (~2.6 cores). 2. **Latency budget**: well below saturation, expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99. The overhead scales with how hard you're pushing — give yourself headroom and you'll barely notice it. diff --git a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md index 4b892de3..ed880a5d 100644 --- a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md +++ b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md @@ -7,7 +7,7 @@ author_url: "https://github.com/SamBarker" categories: benchmarking performance engineering --- -How hard can it be? We started with a laptop, a codebase, and a lot of confidence it was fast. We ended up with a benchmark harness, a six-node cluster, and a much more nuanced answer. +How hard can it be? We started with a laptop, a codebase, and a lot of confidence it was fast. We ended up with a benchmark harness, an eight-node cluster, and a much more nuanced answer. Harder than expected. More interesting too. @@ -142,39 +142,36 @@ RF=1, 10 topics. With no replication hops, the round-trip drops to producer→le ### How much more? -The initial RF=1 run at 1000m CPU gave us a ceiling: ~40k msg/s. From that one measurement we could derive the coefficient: +The RF=1 10-topic workload spread load across partitions. At 1000m, the run tells us: safe at 80k msg/s (91 ms p99), saturating at around 126k. The coefficient comes from JFR CPU data across the non-saturated probes: ``` -40k msg/s × 1 KB = 40 MB/s produce -Matched consumer load: 40 MB/s encrypt + 40 MB/s decrypt = 80 MB/s bidirectional -1000m / 80 MB/s ≈ 12.5 mc per MB/s bidirectional -→ operator formula: ~20 mc per MB/s of produce throughput (conservative margin between mid-load and saturation) +Measured: 9.7 mc per MB/s of total proxy traffic (±6.6 stdev, n=4 non-saturated probes) +→ operator formula: 10 mc per MB/s of total proxy traffic +→ for 1:1 produce:consume at 1 KB: 20 mc per MB/s of produce throughput ``` -If the ceiling scales linearly with CPU, a 4-core pod should give ~160k msg/s. We ran it. +The mechanism: `cpu: 1000m → availableProcessors()=1 → one Netty event loop thread`. At 4000m that's four threads, each handling its share of connections in parallel. If the ceiling scales linearly with thread count, a 4-core pod should handle roughly four times as much. We ran it. -| CPU limit | Encryption ceiling | -|-----------|-------------------| -| 1000m | ~40k msg/s | -| 4000m | ~160k msg/s | +| CPU limit | Rate | p99 | Verdict | +|-----------|------|-----|---------| +| 1000m | 80k msg/s | 91 ms | Comfortable | +| 1000m | ~126k msg/s | — | Saturating | +| 4000m | 160k msg/s | 247 ms | Comfortable | +| 4000m | 321k msg/s | 1,706 ms | Elevated | +| 4000m | above 321k | — | Saturated | -Linear. At 4000m: comfortable at 160k (p99: 447 ms), catastrophic at 320k (p99: 537,000 ms). The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU. - -*(The proxy ran 4 Netty event loop threads regardless of CPU limit. Thread count doesn't change — what changes is the CPU time budget available to those threads. Empirically linear, even if the thread-scheduling mechanics are more subtle.)* +At 4000m: comfortable at 160k (p99: 247 ms), elevated at 321k (p99: 1,706 ms). Above that — 64 producers matched 32-producer throughput: ceiling reached. The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU. ### The prediction -One validated data point isn't a sizing model. We used the coefficient to make a falsifiable prediction: a 2-core pod should saturate at ~80k msg/s. - -The 2-core sweep: +One validated scaling point isn't a sizing model. The coefficient predicts that 2-core should sustain well past 80k msg/s and not saturate until well above 160k. We ran 2-core next. -| Rate | p99 | Verdict | -|------------|------------|---------------------------------------| -| 40k msg/s | 626 ms | Comfortable | -| 80k msg/s | 1,660 ms | Elevated — right at predicted ceiling | -| 160k msg/s | 175,277 ms | Catastrophic | +| Rate | p99 | Verdict | +|------------|---------|--------------------------------------------------| +| 80k msg/s | 850 ms | Comfortable | +| 160k msg/s | 720 ms | Sustaining — not yet saturated | -Held. The ceiling is real, linear, and predictable — which is exactly what you want from a sizing model. +At 160k across 10 partitions, each partition carries 16k msg/s — well within the budget of a single Netty thread. The 2-core saturation point sits above 160k; the model is consistent. Setting `requests` equal to `limits` makes this practical: a pod that can burst above its CPU limit introduces headroom uncertainty that breaks the model. Fix the CPU budget; fix the ceiling. diff --git a/performance.markdown b/performance.markdown index f500ed37..b6f24869 100644 --- a/performance.markdown +++ b/performance.markdown @@ -5,14 +5,14 @@ permalink: /performance/ toc: true --- -This page summarises the measured performance overhead of Kroxylicious. Numbers come from [benchmarks run on real hardware](/blog/2026/05/21/benchmarking-the-proxy/) using [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark), an industry-standard Kafka performance tool. No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment. +This page summarises the measured performance overhead of Kroxylicious. Numbers come from [benchmarks run on real hardware](/blog/2026/05/21/benchmarking-the-proxy/) using [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark), an industry-standard Kafka performance tool. No, we didn't run this on a laptop — it's a realistic deployment: an 8-node OpenShift cluster on Fyre (5 workers, 3 masters), IBM's internal cloud platform — a controlled environment. ## Test environment | Component | Details | |-----------|---------| | CPU | AMD EPYC-Rome, 2 GHz | -| Cluster | 6-node OpenShift, RHCOS 9.6 | +| Cluster | 8-node OpenShift (5 workers, 3 masters), RHCOS 9.6 | | Kafka | 3-broker Strimzi cluster, replication factor 3 | | Kroxylicious | 0.20.0, single proxy pod, 1000m CPU limit | | KMS | HashiCorp Vault (in-cluster) | @@ -65,9 +65,9 @@ Encryption adds measurable but predictable overhead. The cost scales with produc | Scenario | Throughput ceiling (1 topic, 1 KB, 1 partition) | |----------|------------------------------------------------| -| Baseline (direct Kafka) | ~50,000–52,000 msg/s | -| Encryption (proxy + AES-256-GCM) | ~37,200 msg/s | -| **Cost** | **~26% fewer messages per second per partition** | +| Baseline (direct Kafka) | ~19,400 msg/s | +| Encryption (proxy + AES-256-GCM) | ~14,600 msg/s | +| **Cost** | **~25% fewer messages per second per partition** | --- @@ -79,7 +79,7 @@ Numbers without guidance aren't very useful, so here's how to translate these re **With record encryption:** -- **Throughput**: use `proxy CPU (millicores) = 20 × produce throughput (MB/s)` as a planning formula, then add ×1.3 headroom. Assumes matched consumer load and AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your hardware. Validated at 1000m, 2000m, and 4000m. Example: 100k msg/s at 1 KB = 100 MB/s produce → 2000m + headroom → ~2600m. +- **Throughput**: use `CPU (mc) = 10 × total proxy throughput (MB/s)` where total = produce MB/s + each consumer group's consume MB/s. For 1:1 produce:consume this simplifies to `20 × produce MB/s`. Add ×1.3 headroom. Measured on AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your hardware. Validated at 1000m, 2000m, and 4000m. Example: 100k msg/s at 1 KB, 1 consumer group = 200 MB/s total → 2000m + headroom → ~2600m. - **Latency**: expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99, scaling with how close to saturation you operate - **Scaling**: set `requests` equal to `limits` in your pod spec to make the CPU budget — and therefore the throughput ceiling — deterministic. Increase the CPU limit to raise throughput; add proxy pods for redundancy. - **KMS**: DEK caching means the KMS is not on the hot path. In testing, each benchmark run triggered only 5–19 DEK generation calls — the KMS is not a bottleneck From a0be0699e7deeaf29edcc8e0380d7f6994a873b3 Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Wed, 20 May 2026 16:51:17 +1200 Subject: [PATCH 15/25] Humanise flamegraph section and add OSS data sharing to run it yourself - Rewrites flamegraph intro with personal motivation: hot path minimalism, Amdahl's law framing, and honest admission that the full sweep story didn't come together - Adds forward reference to bugs section to stitch the structure together - Moves OSS transparency point into "Run it yourself" where it naturally belongs, with a TODO placeholder for the raw data link - Drops duplicate "we share our workings" phrase from flamegraph prose Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- ...2026-05-28-benchmarking-the-proxy-under-the-hood.md | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md index ed880a5d..35bcfab2 100644 --- a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md +++ b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md @@ -177,7 +177,11 @@ Setting `requests` equal to `limits` makes this practical: a pod that can burst ## The flamegraph: where the CPU actually goes -We captured CPU profiles using async-profiler attached to the proxy JVM via `jcmd JVMTI.agent_load`, during the steady-state measurement phase at 36,000 msg/s. These are self-time percentages — where the CPU is actually spending cycles, not inclusive call-tree time. +I care deeply that the proxy does as little work as possible on the hot path. Optimization is often less about swapping algorithms — if you only ever have five items, who cares how you sort them — and more about realising what work not to do, or finding a better time to do it. [Amdahl's law](https://en.wikipedia.org/wiki/Amdahl%27s_law) governs this: the maximum speedup you can get from optimizing a component is bounded by how much of total execution time that component actually owns. If the proxy accounts for 2% of CPU, you can't optimize your way to a 10% win — not there. + +That framing is exactly why flamegraphs matter to me. Not as a debugging tool, but as a way of seeing the shape of the work. I was also hoping to tell a fuller story here — profiles across the full rate sweep, watching the mix shift as the proxy approaches saturation. Getting stable, reproducible numbers turned out to be harder than expected, and the bugs described in the next section cost us more runs than I'd like. So these are two snapshots at a single rate, not the sweep-correlated picture I had in mind. Still enough to see where the CPU goes. I hope to revisit this properly in the future — but right now the proxy's performance is good enough that I'm focused on functionality, and the benchmarking harness itself still has room to mature. + +We captured CPU profiles using async-profiler attached to the proxy JVM via `jcmd JVMTI.agent_load`, during the steady-state measurement phase. These are self-time percentages — where the CPU is actually spending cycles, not inclusive call-tree time. The flamegraphs below are fully interactive: hover over a frame to see its name and percentage, click to zoom in, Ctrl+F to search. Scroll within the frame to explore the full stack depth. @@ -258,7 +262,9 @@ Spotting these required noticing that two different probe flamegraphs were pixel ## Run it yourself -Everything is in `kroxylicious-openmessaging-benchmarks/` in the [main Kroxylicious repository](https://github.com/kroxylicious/kroxylicious). See `QUICKSTART.md` for step-by-step instructions. You'll need a Kubernetes or OpenShift cluster, the Kroxylicious operator installed, and Helm 3. Minikube works for local runs — the quickstart covers recommended CPU and memory settings. +We're an open source project — we share our workings. The raw OMB result JSON, JFR recordings, and flamegraph files that back this post are available [TODO: link to raw data]. If you want to verify the numbers, reproduce the analysis, or compare against your own runs, everything you need is there. + +If you want to run it against your own cluster, everything is in `kroxylicious-openmessaging-benchmarks/` in the [main Kroxylicious repository](https://github.com/kroxylicious/kroxylicious). See `QUICKSTART.md` for step-by-step instructions. You'll need a Kubernetes or OpenShift cluster, the Kroxylicious operator installed, and Helm 3. Minikube works for local runs — the quickstart covers recommended CPU and memory settings. ```bash # Run a baseline vs encryption comparison From e925422bf5d6c2ed0c4527f96cd262fb05d15d41 Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Wed, 20 May 2026 17:17:12 +1200 Subject: [PATCH 16/25] Polish voice, fix typos, and mark stale flamegraph references - Fix punctuation on OMB methodology comparability sentence - Fix repeated "We leaned towards repeatable" in workload design section - Fix tense: "will make" -> "makes" for workload design aside - Fix typo: "died in the wool" -> "dyed in the wool" - Add closing paragraph to flamegraph section: proxy wins are real but we aren't going to make AES faster - Replace stale 36k msg/s flamegraph references with FIXME pending new profiler runs Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- ...8-benchmarking-the-proxy-under-the-hood.md | 22 ++++++++++--------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md index 35bcfab2..e775be30 100644 --- a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md +++ b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md @@ -26,7 +26,7 @@ And critically, it's never heard of Kroxylicious... You have though, you're here [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) is a better fit. It's an industry-standard tool used by Confluent, the Pulsar team, and others for their published performance comparisons - so who am I to argue? OMB coordinates producers and consumers across separate worker pods, runs a configurable warmup phase before taking measurements, takes latency tracking seriously — correcting for coordinated omission, and outputs structured JSON that's straightforward to process programmatically. What's not to like? -Using OMB also means our methodology is directly comparable to other published Kafka benchmarks. The numbers aren't comparable of course it's not the same hardware, network conditions or phase of the moon. +Using OMB also means our methodology is directly comparable to other published Kafka benchmarks. The numbers aren't comparable, of course — it's not the same hardware, network conditions or phase of the moon. ## What we built on top of OMB @@ -73,7 +73,7 @@ If you have your own KMS — and you will run this on your own infrastructure, r ### JSON always comes in megabytes -Each benchmark run produces a blob of structured JSON. Useful in principle; a wall of noise in practice. Three [JBang](https://www.jbang.dev/)-runnable Java programs *(I'm a died in the wool java dev, sue me)* pull out the signal: +Each benchmark run produces a blob of structured JSON. Useful in principle; a wall of noise in practice. Three [JBang](https://www.jbang.dev/)-runnable Java programs *(I'm a dyed in the wool java dev, sue me)* pull out the signal: - **`RunMetadata`**: captures the run context — git commit, timestamp, cluster node specs (architecture, CPU, RAM), and on OpenShift, NIC speed read from the host via the MachineConfigDaemon pod. Generates `run-metadata.json` alongside each result so you can always tell what conditions produced a number. This is what makes run-to-run comparisons meaningful — and when a run takes 12 hours, trust me, you don't want to re-run it without good reason. - **`ResultComparator`**: answers "did this change hurt?" — reads two scenario result directories and produces a markdown comparison table. Baseline vs encryption is the obvious use, but the tool is generic. Already running a proxy? proxy-no-filters vs encryption tells you the cost of the filter itself, not the proxy hop. Building your own filter? That's your comparison — measure the chain with and without it. @@ -85,11 +85,11 @@ Getting NIC speed from a Kubernetes node turned out to be non-trivial — you ne Benchmarks are artificial constructs. Your traffic patterns are never stable — message sizes vary, topic counts grow, producers burst — so there's always a tension between numbers that are *representative* and numbers that are actually *repeatable*. We leaned towards repeatable. -The primary workload will make Kafka experts wince *(I had to squirm to type it)* — **1 topic, 1 partition, 1 KB messages**. Concentrating everything onto a single TopicPartition means we hit the limits earlier, at lower absolute volumes, which makes the proxy's contribution easier to isolate. Isolating the proxy is, after all, the goal. +The primary workload makes Kafka experts wince *(I had to squirm to type it)* — **1 topic, 1 partition, 1 KB messages**. Concentrating everything onto a single TopicPartition means we hit the limits earlier, at lower absolute volumes, which makes the proxy's contribution easier to isolate. Isolating the proxy is, after all, the goal. But Kafka is often described as a distributed append-only log, and we can't ignore the word "distributed" when it comes to latency. With RF=1, the proxy doubles the sequential hops in the critical path: one becomes two. That's not wrong, but it's not a fair picture either — nobody runs RF=1 in production. With RF=3, the leader waits for ISR acknowledgements before confirming the produce, so there's already replication latency in the critical path. The proxy adds a real, sequential hop — we're not trying to bury that — but it lands alongside a cost that's already there. One extra hop on top of a multi-hop round trip is a different picture from doubling a single-hop one. Three brokers, hot partition replicated across all of them. -We leaned towards repeatable — but we didn't abandon representative entirely. The multi-topic runs (10 and 100 topics) are the reconnection point: load spread across more topics, closer to what production actually looks like, at rates well below any saturation point. You're measuring the proxy's baseline tax — the cost you always pay, not just the cost when you're pushing hard. It holds. +But we didn't abandon representative entirely. The multi-topic runs (10 and 100 topics) are the reconnection point: load spread across more topics, closer to what production actually looks like, at rates well below any saturation point. You're measuring the proxy's baseline tax — the cost you always pay, not just the cost when you're pushing hard. It holds. That covers the first dimension — the proxy's latency tax at normal load. For the second, throughput, the question is: how much does routing through the proxy reduce your maximum sustainable rate? That needs a different approach. We used rate sweeps: hold the connection count fixed, step the rate up incrementally, and watch what happens. Below the ceiling, achieved throughput tracks the target — the system keeps up. Above it, it can't, and falls behind. The point where achieved throughput diverges from the target rate — where we defined that as dropping below 95% — is the saturation point. That's the knee of the curve, and that's what we were hunting. @@ -191,9 +191,9 @@ The flamegraphs below are fully interactive: hover over a frame to see its name -
CPU flamegraph — passthrough proxy (no filters), 36,000 msg/s, 1 topic, 1 KB messages. Open full screen ↗
+
CPU flamegraph — passthrough proxy (no filters), FIXME msg/s, 1 topic, 1 KB messages. Open full screen ↗
| Category | CPU share | @@ -212,15 +212,15 @@ Kroxylicious decodes Kafka RPCs selectively: each filter declares which API keys The 1.4% is the cost of a proxy that is *selectively* L7: doing real Kafka protocol work where it matters, and treating the hot path like a TCP relay where it doesn't. That's not a side-effect — it's what the decode predicate design is for, and this flamegraph validates it. -### Encryption proxy (same 36,000 msg/s rate) +### Encryption proxy (same FIXME msg/s rate)
- -
CPU flamegraph — encryption proxy (AES-256-GCM), 36,000 msg/s, 1 topic, 1 KB messages. Open full screen ↗
+
CPU flamegraph — encryption proxy (AES-256-GCM), FIXME msg/s, 1 topic, 1 KB messages. Open full screen ↗
| Category | No-filters | Encryption | Delta | @@ -248,6 +248,8 @@ Total additional CPU: ~33%. This aligns closely with the ~26% throughput reducti If you wanted to optimise this, the highest-impact areas would be: reducing buffer copies (encrypt in-place or use composite buffers), pooling encryption buffers to reduce GC pressure, and caching `Cipher` instances to reduce per-record JDK security overhead. +There are wins inside the proxy we haven't chased yet — serialisation and deserialisation we could avoid, buffer copies imposed by how memory records are structured. Some would be straightforward; others would require rethinking how Kafka records are modelled in memory. We haven't gone after them. But to put it plainly: we can optimise all we like inside the proxy, and we're still not going to make AES faster. + ## Bugs we found in our own tooling During the 4-producer rate sweep, we noticed that JFR recordings and flamegraphs from probes 2 onwards all looked identical to probe 1. They were stale copies. Three bugs. From 5f6ade95080c0beabc59bc02b7b4cd5b1526289d Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Thu, 21 May 2026 14:42:45 +1200 Subject: [PATCH 17/25] Add TL;DR, fix opening sentence, and address Bob's review suggestions - Change "All good" to "Every good benchmarking story starts" (Bob's suggestion) - Add TL;DR paragraph with key numbers and sizing formula; flagged with FIXME comment pending final benchmark run Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- _posts/2026-05-21-benchmarking-the-proxy.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/_posts/2026-05-21-benchmarking-the-proxy.md b/_posts/2026-05-21-benchmarking-the-proxy.md index 6622205c..b4a3bfde 100644 --- a/_posts/2026-05-21-benchmarking-the-proxy.md +++ b/_posts/2026-05-21-benchmarking-the-proxy.md @@ -7,12 +7,15 @@ author_url: "https://github.com/SamBarker" categories: benchmarking performance --- -All good benchmarking stories start with a hunch. Mine was that Kroxylicious is cheap to run — I'd stake my career on it, in fact — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly. +Every good benchmarking story starts with a hunch. Mine was that Kroxylicious is cheap to run — I'd stake my career on it, in fact — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly. There's a practical question underneath the hunch too. The most common thing operators ask us is some variation of: "How many cores does the proxy need?" Which is really just "is this thing going to slow down my Kafka?" in a polite engineering hat. We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true. And also deeply unsatisfying for everyone involved, including us. So we stopped saying "it depends", and got off the fence: we built something you can run **yourselves** on your own infrastructure with your own workload, and measured it. Here are some representative numbers from ours. + +**TL;DR**: A passthrough Kroxylicious proxy adds ~0.2 ms to average publish latency with no throughput impact. Add record encryption and expect a ~25% throughput reduction and 0.2–3 ms of additional latency at comfortable rates. The throughput ceiling scales linearly with CPU: budget 10 millicores per MB/s of total proxy traffic. The full benchmark harness is open source — run it on your own cluster for numbers that reflect your workload. + ## What we measured We ran three scenarios against the same Apache Kafka® cluster on the same hardware: From 852799294b2bbfbd247eb551abcfd1dac4033803 Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Thu, 21 May 2026 14:48:28 +1200 Subject: [PATCH 18/25] Add runtime estimate, gist link, and fix em dash in post 2 - Fix lone space-hyphen-space to em dash in OMB description - Add runtime warning (~14 hours) before benchmark commands with link to the full blog post reproduction script as a gist Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- _posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md index e775be30..7c9bea6a 100644 --- a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md +++ b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md @@ -24,7 +24,7 @@ Kafka ships with `kafka-producer-perf-test` and `kafka-consumer-perf-test`. We'd And critically, it's never heard of Kroxylicious... You have though, you're here! -[OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) is a better fit. It's an industry-standard tool used by Confluent, the Pulsar team, and others for their published performance comparisons - so who am I to argue? OMB coordinates producers and consumers across separate worker pods, runs a configurable warmup phase before taking measurements, takes latency tracking seriously — correcting for coordinated omission, and outputs structured JSON that's straightforward to process programmatically. What's not to like? +[OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) is a better fit. It's an industry-standard tool used by Confluent, the Pulsar team, and others for their published performance comparisons — so who am I to argue? OMB coordinates producers and consumers across separate worker pods, runs a configurable warmup phase before taking measurements, takes latency tracking seriously — correcting for coordinated omission, and outputs structured JSON that's straightforward to process programmatically. What's not to like? Using OMB also means our methodology is directly comparable to other published Kafka benchmarks. The numbers aren't comparable, of course — it's not the same hardware, network conditions or phase of the moon. @@ -268,6 +268,8 @@ We're an open source project — we share our workings. The raw OMB result JSON, If you want to run it against your own cluster, everything is in `kroxylicious-openmessaging-benchmarks/` in the [main Kroxylicious repository](https://github.com/kroxylicious/kroxylicious). See `QUICKSTART.md` for step-by-step instructions. You'll need a Kubernetes or OpenShift cluster, the Kroxylicious operator installed, and Helm 3. Minikube works for local runs — the quickstart covers recommended CPU and memory settings. +I got so bored re-evaluating everything as I explored anti-affinity that I even scripted the whole exercise for this post — but brace yourself, it has about a 18 hour runtime. tmux and a control node or jump host are your friends here. The [full blog post script](https://gist.github.com/SamBarker/19fd06ac9a8614cc6be89b76a90e006a) is available as a gist if you want to reproduce the exact run. + ```bash # Run a baseline vs encryption comparison ./scripts/run-benchmark.sh --scenario baseline From cea960f2723d7f26c371135b57e81d7904dc837f Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Thu, 21 May 2026 16:05:22 +1200 Subject: [PATCH 19/25] Add statistical significance narrative to coefficient section Explains why MWU testing was added (PhD teammate asked "is the difference real?"), how check-significance.sh works (per-window p99, ~30 samples, p < 0.05), and the honest caveat that per-window samples aren't fully uncorrelated. Distinguishes clearly between what MWU covers (latency delta realness) and what the coefficient derivation doesn't (n=4, no significance test, untested across message sizes). Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- _posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md index 7c9bea6a..30f2dc6c 100644 --- a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md +++ b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md @@ -150,6 +150,12 @@ Measured: 9.7 mc per MB/s of total proxy traffic (±6.6 stdev, n=4 non-saturated → for 1:1 produce:consume at 1 KB: 20 mc per MB/s of produce throughput ``` +I was proudly showing off some early numbers — baseline vs proxy, looking good — when one of the computer science PhDs on the team asked, "is the difference real?" Best answer I could come up with at the time: "Good question." So I went and added statistical significance testing. + +`check-significance.sh` runs Mann-Whitney U at p < 0.05, comparing per-window p99 latency samples between baseline and candidate at each rate step. OMB slices the test phase into time windows and records a p99 per window — ~30 samples per 5-minute run — so MWU has enough data to distinguish real signal from noise. It's not perfect: those per-window samples aren't entirely uncorrelated — a GC pause can drag multiple adjacent windows — but it gives a principled answer to "is this overhead real, or am I chasing noise?" + +The coefficient is a different matter. It's derived from JFR CPU data across n=4 non-saturated probes; the ±6.6 stdev reflects measurement noise, not a tested confidence interval. It holds at 1, 2, and 4 cores — the linear scaling claim is consistent — but its validity across message sizes or workload shapes is untested. + The mechanism: `cpu: 1000m → availableProcessors()=1 → one Netty event loop thread`. At 4000m that's four threads, each handling its share of connections in parallel. If the ceiling scales linearly with thread count, a 4-core pod should handle roughly four times as much. We ran it. | CPU limit | Rate | p99 | Verdict | From 45b351cff1f184c4dad00539bde2266c0f51094c Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Thu, 21 May 2026 16:28:19 +1200 Subject: [PATCH 20/25] Address claude.ai review suggestions for post 1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Move p99 explanation before first passthrough table where percentiles are first encountered; remove duplicate from encryption section - Expand Layer 7 point with one sentence of context for non-technical readers: most Kafka proxies operate at L4, Kroxylicious parses every message yet still adds only 0.2 ms - Add distribution board analogy for independent connection handling vs broker shared resource contention - Simplify replication factor caveat to one sentence, linking to companion post for detail - Fix "Most proxies" → "Most proxies operate on Kafka" for accuracy Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- _posts/2026-05-21-benchmarking-the-proxy.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/_posts/2026-05-21-benchmarking-the-proxy.md b/_posts/2026-05-21-benchmarking-the-proxy.md index b4a3bfde..9ac78bb2 100644 --- a/_posts/2026-05-21-benchmarking-the-proxy.md +++ b/_posts/2026-05-21-benchmarking-the-proxy.md @@ -48,6 +48,8 @@ One important caveat: this Kafka cluster is deliberately untuned. We're not tryi Good news first. The proxy itself — with no filter chain, just routing traffic — adds almost nothing. +A quick note on percentiles for anyone not steeped in performance benchmarking: p99 latency is the value that 99% of requests complete within — meaning 1 in 100 requests takes longer. Averages flatter; the p99 is what your slowest clients actually experience, and it's usually the number that matters. + **10 topics, 1 KB messages (5,000 msg/s per topic):** | Metric | Baseline | Proxy | Delta | @@ -70,9 +72,9 @@ Good news first. The proxy itself — with no filter chain, just routing traffic **The headline: ~0.2 ms additional average publish latency. Throughput is unaffected.** -What did I take away from this entirely unsurprising result? Not much, honestly — without filters the proxy boils the latency-sensitive path down to little more than a couple of hops through the TCP stack. We replaced a hunch with data. The remarkable part: the proxy is doing this at Layer 7. +What did I take away from this entirely unsurprising result? Not much, honestly — without filters the proxy boils the latency-sensitive path down to little more than a couple of hops through the TCP stack. We replaced a hunch with data. The remarkable part: the proxy is doing this at Layer 7. Most proxies operate on Kafka at Layer 4 — they shuffle bytes without ever understanding what those bytes mean. Kroxylicious works at Layer 7, parsing every Kafka message, yet still adds only 0.2 ms. That's the design working. -The overhead holding across 10 and 100 topics makes sense for the same reason: the proxy doesn't contend between topics. A Kafka broker juggles disk I/O, partition leaders, and replication across everything it manages; the proxy treats each connection independently. Topics don't contend for shared resources: throughput scales linearly across them, and the connection sweep validates it. +The overhead holding across 10 and 100 topics makes sense for the same reason: the proxy doesn't contend between topics. Think of the proxy as independent circuits on a distribution board — switching the breaker for lights doesn't cut power to the fridge. A Kafka broker is more like the mains supply itself — every circuit draws from the same source, so heavy load anywhere reduces what's available everywhere. Topics don't contend for shared resources: throughput scales linearly across them, and the connection sweep validates it. The end-to-end p99 figure is dominated by Kafka consumer fetch timeouts, as it should be. That said, it is reassuring to have a sub-ms impact on the p99. @@ -84,8 +86,6 @@ Ok, so let's make the proxy smarter — make it do something people actually car ### Latency at sub-saturation rates -A quick note on percentiles for anyone not steeped in performance benchmarking: p99 latency is the value that 99% of requests complete within — meaning 1 in 100 requests takes longer. Averages flatter; the p99 is what your slowest clients actually experience, and it's usually the number that matters. - So we know encryption is doing a lot of work, but to find out the real impact we need to compare it to a plain Kafka cluster (and yes, people do run Kroxylicious without filters — TLS termination, stable client endpoints, virtual clusters — but that's a different post). The table below tells us that above a certain inflection point the numbers get really, really noisy — especially in the p99 range. **1 topic, 1 KB messages — baseline vs encryption:** @@ -162,7 +162,7 @@ Numbers without guidance aren't very useful, so here's how to translate these re These are real results from real hardware, but they don't tell a story for your workload. A few things worth knowing before you put these numbers in a slide deck: - **Message size**: all results use 1 KB messages. The coefficient is message-size-dependent — encryption overhead as a percentage is likely lower for larger messages. -- **Replication factor**: the 1-topic rate sweep ran at RF=3. At that replication factor, Kafka's ISR replication traffic creates a per-partition ceiling that sits close to where proxy CPU also saturates — the two limits are entangled in those results. The sizing coefficient was derived from RF=1 multi-topic workloads specifically to isolate proxy CPU. The [companion engineering post]({% post_url 2026-05-28-benchmarking-the-proxy-under-the-hood %}) has that detail. +- **Replication factor**: the encryption numbers assume traffic isn't already hitting Kafka's own replication limits — the [companion post]({% post_url 2026-05-28-benchmarking-the-proxy-under-the-hood %}) explains why that matters. - **Horizontal scaling**: linear scaling has been validated across CPU allocations on a single pod; multi-pod horizontal scaling hasn't been measured but is expected to follow the same coefficient. For the engineering story — why we built a custom harness on top of OMB, what the CPU flamegraphs actually show, and the bugs we found in our own tooling along the way — that's in the [companion post]({% post_url 2026-05-28-benchmarking-the-proxy-under-the-hood %}). From a7a2f4e5318a188de1a27452a38d50222bf23408 Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Thu, 21 May 2026 16:47:58 +1200 Subject: [PATCH 21/25] Remove post_url links to companion post pending its publish date MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Post 2 is dated 2026-05-28 — Jekyll skips future posts by default, causing post_url resolution to fail at build time. Replace linked references with plain "companion post" text; links will be restored via a follow-up PR when Post 2 goes live. Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- _posts/2026-05-21-benchmarking-the-proxy.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_posts/2026-05-21-benchmarking-the-proxy.md b/_posts/2026-05-21-benchmarking-the-proxy.md index 9ac78bb2..1d16ed4e 100644 --- a/_posts/2026-05-21-benchmarking-the-proxy.md +++ b/_posts/2026-05-21-benchmarking-the-proxy.md @@ -24,7 +24,7 @@ We ran three scenarios against the same Apache Kafka® cluster on the same hardw - **Passthrough proxy** — traffic routed through Kroxylicious with no filter chain configured - **Record encryption** — traffic through Kroxylicious with AES-256-GCM record encryption enabled, using HashiCorp Vault as the KMS -We used [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) rather than Kafka's own `kafka-producer-perf-test`. OMB is an industry-standard tool that coordinates producers and consumers together, measures end-to-end latency (not just publish latency), and produces structured JSON that makes comparison straightforward. More on why we built a whole harness around it in the [companion engineering post]({% post_url 2026-05-28-benchmarking-the-proxy-under-the-hood %}). +We used [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) rather than Kafka's own `kafka-producer-perf-test`. OMB is an industry-standard tool that coordinates producers and consumers together, measures end-to-end latency (not just publish latency), and produces structured JSON that makes comparison straightforward. More on why we built a whole harness around it in a companion engineering post. ## Test environment @@ -162,9 +162,9 @@ Numbers without guidance aren't very useful, so here's how to translate these re These are real results from real hardware, but they don't tell a story for your workload. A few things worth knowing before you put these numbers in a slide deck: - **Message size**: all results use 1 KB messages. The coefficient is message-size-dependent — encryption overhead as a percentage is likely lower for larger messages. -- **Replication factor**: the encryption numbers assume traffic isn't already hitting Kafka's own replication limits — the [companion post]({% post_url 2026-05-28-benchmarking-the-proxy-under-the-hood %}) explains why that matters. +- **Replication factor**: the encryption numbers assume traffic isn't already hitting Kafka's own replication limits — a companion post explains why that matters. - **Horizontal scaling**: linear scaling has been validated across CPU allocations on a single pod; multi-pod horizontal scaling hasn't been measured but is expected to follow the same coefficient. -For the engineering story — why we built a custom harness on top of OMB, what the CPU flamegraphs actually show, and the bugs we found in our own tooling along the way — that's in the [companion post]({% post_url 2026-05-28-benchmarking-the-proxy-under-the-hood %}). +For the engineering story — why we built a custom harness on top of OMB, what the CPU flamegraphs actually show, and the bugs we found in our own tooling along the way — that's in a companion post. The full benchmark suite, quickstart guide, and sizing reference are in `kroxylicious-openmessaging-benchmarks/` in the [main Kroxylicious repository](https://github.com/kroxylicious/kroxylicious). From 72b96470e89b8e9024d39a2141180c79e1227254 Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Fri, 22 May 2026 09:40:18 +1200 Subject: [PATCH 22/25] Address showuon review comments on post 1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Convert TL;DR from prose to bulleted list (S1) - Soften "dominated by Kafka consumer fetch timeouts" to "likely dominated by" — this is an inference, not a measured fact (S5) - Inline definition of rate sweep at first use in sizing guidance (S9) - Broaden "With record encryption" to "With filters (record encryption is the representative example here)" (S10) Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- _posts/2026-05-21-benchmarking-the-proxy.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/_posts/2026-05-21-benchmarking-the-proxy.md b/_posts/2026-05-21-benchmarking-the-proxy.md index 1d16ed4e..310be76a 100644 --- a/_posts/2026-05-21-benchmarking-the-proxy.md +++ b/_posts/2026-05-21-benchmarking-the-proxy.md @@ -9,12 +9,16 @@ categories: benchmarking performance Every good benchmarking story starts with a hunch. Mine was that Kroxylicious is cheap to run — I'd stake my career on it, in fact — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly. -There's a practical question underneath the hunch too. The most common thing operators ask us is some variation of: "How many cores does the proxy need?" Which is really just "is this thing going to slow down my Kafka?" in a polite engineering hat. We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true. And also deeply unsatisfying for everyone involved, including us. +There's a practical question underneath the hunch too. The most common thing opevators ask us is some variation of: "How many cores does the proxy need?" Which is really just "is this thing going to slow down my Kafka?" in a polite engineering hat. We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true. And also deeply unsatisfying for everyone involved, including us. So we stopped saying "it depends", and got off the fence: we built something you can run **yourselves** on your own infrastructure with your own workload, and measured it. Here are some representative numbers from ours. -**TL;DR**: A passthrough Kroxylicious proxy adds ~0.2 ms to average publish latency with no throughput impact. Add record encryption and expect a ~25% throughput reduction and 0.2–3 ms of additional latency at comfortable rates. The throughput ceiling scales linearly with CPU: budget 10 millicores per MB/s of total proxy traffic. The full benchmark harness is open source — run it on your own cluster for numbers that reflect your workload. +**TL;DR**: +- A passthrough proxy adds ~0.2 ms to average publish latency with no throughput impact +- Add record encryption and expect a ~25% throughput reduction and 0.2–3 ms of additional latency at comfortable rates +- The throughput ceiling scales linearly with CPU: budget 10 millicores per MB/s of total proxy traffic +- The full benchmark harness is open source — run it on your own cluster for numbers that reflect your workload ## What we measured @@ -76,7 +80,7 @@ What did I take away from this entirely unsurprising result? Not much, honestly The overhead holding across 10 and 100 topics makes sense for the same reason: the proxy doesn't contend between topics. Think of the proxy as independent circuits on a distribution board — switching the breaker for lights doesn't cut power to the fridge. A Kafka broker is more like the mains supply itself — every circuit draws from the same source, so heavy load anywhere reduces what's available everywhere. Topics don't contend for shared resources: throughput scales linearly across them, and the connection sweep validates it. -The end-to-end p99 figure is dominated by Kafka consumer fetch timeouts, as it should be. That said, it is reassuring to have a sub-ms impact on the p99. +The end-to-end p99 figure is likely dominated by Kafka consumer fetch timeouts, as it should be. That said, it is reassuring to have a sub-ms impact on the p99. --- @@ -135,9 +139,9 @@ To find the proxy's real ceiling, you need a workload that doesn't hit the Kafka Numbers without guidance aren't very useful, so here's how to translate these results into pod specs. -**Passthrough proxy**: size your Kafka cluster as you normally would. The proxy won't be the bottleneck — but if you want to verify that on your own hardware, the rate sweep is exactly the tool for it. Run the baseline and passthrough scenarios back-to-back and you'll have your own numbers. +**Passthrough proxy**: size your Kafka cluster as you normally would. The proxy won't be the bottleneck — but if you want to verify that on your own hardware, the rate sweep — which steps the producer rate up incrementally until the system can't keep up — is exactly the tool for it. Run the baseline and passthrough scenarios back-to-back and you'll have your own numbers. -**With record encryption:** +**With filters (record encryption is the representative example here):** 1. **Throughput budget**: encryption imposes a CPU-driven throughput ceiling. As a planning formula: From 429c63552a183b768af90a0288c2914937b52bbb Mon Sep 17 00:00:00 2001 From: Sam Barker Date: Fri, 22 May 2026 09:44:24 +1200 Subject: [PATCH 23/25] Reschedule posts to 26 May and 2 June Rename files and update front matter dates; update post_url reference in Post 2 to match Post 1's new filename. Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- .DS_Store | Bin 0 -> 10244 bytes .op/plugins/gh.json | 15 +++ _data/documentation/0_21_0-SNAPSHOT.yaml | 124 ++++++++++++++++++ _data/documentation/0_22_0-SNAPSHOT.yaml | 115 ++++++++++++++++ _data/release/0_21_0-SNAPSHOT.yaml | 34 +++++ _data/release/0_22_0-SNAPSHOT.yaml | 34 +++++ ...d => 2026-05-26-benchmarking-the-proxy.md} | 2 +- ...-benchmarking-the-proxy-under-the-hood.md} | 4 +- assets/.DS_Store | Bin 0 -> 8196 bytes assets/blog/.DS_Store | Bin 0 -> 6148 bytes assets/pages/.DS_Store | Bin 0 -> 8196 bytes assets/pages/images/.DS_Store | Bin 0 -> 6148 bytes assets/theme/.DS_Store | Bin 0 -> 6148 bytes download/0.21.0-SNAPSHOT/index.md | 3 + download/0.22.0-SNAPSHOT/index.md | 3 + 15 files changed, 331 insertions(+), 3 deletions(-) create mode 100644 .DS_Store create mode 100644 .op/plugins/gh.json create mode 100644 _data/documentation/0_21_0-SNAPSHOT.yaml create mode 100644 _data/documentation/0_22_0-SNAPSHOT.yaml create mode 100644 _data/release/0_21_0-SNAPSHOT.yaml create mode 100644 _data/release/0_22_0-SNAPSHOT.yaml rename _posts/{2026-05-21-benchmarking-the-proxy.md => 2026-05-26-benchmarking-the-proxy.md} (99%) rename _posts/{2026-05-28-benchmarking-the-proxy-under-the-hood.md => 2026-06-02-benchmarking-the-proxy-under-the-hood.md} (99%) create mode 100644 assets/.DS_Store create mode 100644 assets/blog/.DS_Store create mode 100644 assets/pages/.DS_Store create mode 100644 assets/pages/images/.DS_Store create mode 100644 assets/theme/.DS_Store create mode 100644 download/0.21.0-SNAPSHOT/index.md create mode 100644 download/0.22.0-SNAPSHOT/index.md diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..00dd5ca205f8e9850e6c8185d20e5709cf71276c GIT binary patch literal 10244 zcmeHM%Wl+26up&KyJ-U9wP1l{2?@=DG#wO#5Su0>LP+zFXaoyjGW{@#iQ|ge?KCI~ z>OFh|zrdOwna`O0teA6eIibpK2H1=SQPnD6w|&mN*Qah(tK3AS*7Cs$QAk7qI@_g_ zI814LpL?av6j!dmD)5Q6X+*n}+)s4Bg7z1@0$u^HfLFjP;1&2kD1hH=Zb8FXfAk7? z1-t^s3h??6qO)xmc4Dk79XQAq0Co>gA>RmB!V89MHV z42NwOc4DldlQMKtMPya%P?SUmFBCYbwz2-`74QmF72v&lK&#Y3Z`%9!UZQ)rhv}sL zhJE)@nmB0CCiO{EO3x{#E%06)?fO5l=nk+=i!~3{fVy}?(B=077V>=wSpzbnAwC*C zbJ6Nj30(&|MlO9yC`W{-!gy0+1Yjh{nE(RsJ`D2~RPOXu(E zbkvR0Xefku5CQV^ahwkHcunW&AQNuK+z9GHeRJvJ!NKay)z;N;dFAkGcyOzQ)0LI$ zHx3W$!G+7$Za>)ENp@4s1%RfC-ws+d-;a-PapnlH-x;QfPI+LNvq(J*zm`(pg-}D< zL7{0HQ|b-{53f3975!N-H9im(eF*NJ3^>c2vF9+kr5>F%Ivv<{EIA@+z~(#ZA6#o?e&0IlJ7N2V|@5J|aU zMWxr4N^_zna&dk06i_y9vB2e6ZX!&1EG6hM;+4ihG_je%(Rj%?c4^>75@l?B14~<% z_Tk$sbWL+sVzE*->l|3TWc6k04RPF(Jb5wWOrvSzCkonLYIG0CU0GbGYxkF>-ANaX z2wbVq-_2FDw0WoSL@88>Fs4mJ`LafUrxW&Gk3)yQSe0qD5S%P=S}tWl^J# zpTk&kDLo-;>F`g3pNkq9jy%UPtMMEoM6b+mR?}fG6zpkO%vc9|9{$`2%U@fmj-RtU zZt|qZKM-G|v*Y7<`Sd4V0k42pz$@St@CtlC3e1{R!v^fe=l}ozejdnY>J{(`e6s>- zZoRX<2HMg4k+KJV*B+w(fzFL_6Jv#fgS?K1mDln3!ms0hLT7A?EjOIE3p+6uSJ3|Z SKLbiTLcjm_`~M$o|Njqs*cP+^ literal 0 HcmV?d00001 diff --git a/.op/plugins/gh.json b/.op/plugins/gh.json new file mode 100644 index 00000000..09012c6c --- /dev/null +++ b/.op/plugins/gh.json @@ -0,0 +1,15 @@ +{ + "account_id": "ARMR3S6DY5DV7F7K3C5LGNCRLU", + "entrypoint": [ + "gh" + ], + "credentials": [ + { + "plugin": "github", + "credential_type": "personal_access_token", + "usage_id": "personal_access_token", + "vault_id": "3tkwphno4pf4qvczjrkm53lsga", + "item_id": "y2ayrvc4qiyvz6xq4t77tyhmai" + } + ] +} \ No newline at end of file diff --git a/_data/documentation/0_21_0-SNAPSHOT.yaml b/_data/documentation/0_21_0-SNAPSHOT.yaml new file mode 100644 index 00000000..5e7edc56 --- /dev/null +++ b/_data/documentation/0_21_0-SNAPSHOT.yaml @@ -0,0 +1,124 @@ +docs: + - title: Proxy Quick Start + description: Start here if you're experimenting with the proxy for the first time. + tags: + - proxy + rank: '000' + path: html/proxy-quick-start + - title: Proxy Guide + description: "Using the Proxy, including configuration, security and operation." + tags: + - proxy + - security + rank: '010' + path: html/kroxylicious-proxy + - title: Record Encryption Quick Start + description: Start here for an encryption-at-rest solution for Apache Kafka®. + tags: + - security + - filter + rank: '011' + path: html/record-encryption-quick-start + - title: Kroxylicious Operator for Kubernetes + description: Using the Kroxylicious Operator to deploy and run the Proxy in a + Kubernetes environment. + tags: + - kubernetes + rank: '020' + path: html/kroxylicious-operator + - title: Record Encryption Guide + description: Using the record encryption filter to provide encryption-at-rest + for Apache Kafka®. + tags: + - security + - filter + rank: '020' + path: html/record-encryption-guide + - title: Kroxylicious Admission Webhook + description: Using the Kroxylicious Admission Webhook to inject proxy sidecars + into application pods in a Kubernetes environment. + tags: + - kubernetes + rank: '021' + path: html/admission-webhook-guide + - title: Record Validation Guide + description: "Using the record validation filter to ensure records follow certain\ + \ rules, including schema and signature validity." + tags: + - governance + - filter + rank: '021' + path: html/record-validation-guide + - title: Multi-tenancy Guide + description: Using the multi-tenancy filter to present a single Kafka® cluster + as if it were multiple clusters. + tags: + - filter + rank: '022' + path: html/multi-tenancy-guide + - title: Oauth Bearer Validation guide + description: "Using the Oauth Bearer validation filter to validate JWT tokens\ + \ received \nfrom Kafka® clients during authentication.\n" + tags: + - filter + - security + rank: '023' + path: html/oauth-bearer-validation + - title: SASL Inspection Guide + description: Using the SASL Inspection filter to infer the client's subject from + its successful authentication exchange with a broker. + tags: + - filter + - security + rank: '023' + path: html/sasl-inspection-guide + - title: Authorization Guide + description: Using the Authorization filter to provide Kafka®-equivalent access + controls within the proxy. + tags: + - security + - filter + rank: '024' + path: html/authorization-guide + - title: Entity Isolation Guide + description: Using the entity isolation filter to give authenticated Kafka® clients + a private namespace within a Kafka cluster. + tags: + - filter + rank: '025' + path: html/entity-isolation-guide + - title: Performance and Sizing Guide + description: Sizing Kroxylicious for production deployments — passthrough overhead + and record encryption CPU budgeting. + tags: + - performance + - sizing + - record-encryption + rank: '025' + path: html/performance-guide + - title: Connection Expiration Guide + description: Using the connection expiration filter to avoid connection skew in + Kubernetes environments. + tags: + - kubernetes + - filter + rank: '030' + path: html/connection-expiration-guide + - title: Developer Quick Start + description: Start here if you're developing a filter for the first time. + tags: + - developer + rank: '031' + path: html/developer-quick-start + - title: Kroxylicious Developer Guide + description: Writing plugins for the proxy in the Java programming language. + tags: + - developer + rank: '032' + path: html/developer-guide + - title: Kroxylicious Javadocs + description: The Java API documentation for plugin developers. + tags: + - developer + path: javadoc/index.html + rank: '033' diff --git a/_data/documentation/0_22_0-SNAPSHOT.yaml b/_data/documentation/0_22_0-SNAPSHOT.yaml new file mode 100644 index 00000000..866f62ed --- /dev/null +++ b/_data/documentation/0_22_0-SNAPSHOT.yaml @@ -0,0 +1,115 @@ +docs: + - title: Proxy Quick Start + description: Start here if you're experimenting with the proxy for the first time. + tags: + - proxy + rank: '000' + path: html/proxy-quick-start + - title: Proxy Guide + description: "Using the Proxy, including configuration, security and operation." + tags: + - proxy + - security + rank: '010' + path: html/kroxylicious-proxy + - title: Record Encryption Quick Start + description: Start here for an encryption-at-rest solution for Apache Kafka®. + tags: + - security + - filter + rank: '011' + path: html/record-encryption-quick-start + - title: Kroxylicious Operator for Kubernetes + description: Using the Kroxylicious Operator to deploy and run the Proxy in a + Kubernetes environment. + tags: + - kubernetes + rank: '020' + path: html/kroxylicious-operator + - title: Record Encryption Guide + description: Using the record encryption filter to provide encryption-at-rest + for Apache Kafka®. + tags: + - security + - filter + rank: '020' + path: html/record-encryption-guide + - title: Kroxylicious Admission Webhook + description: Using the Kroxylicious Admission Webhook to inject proxy sidecars + into application pods in a Kubernetes environment. + tags: + - kubernetes + rank: '021' + path: html/admission-webhook-guide + - title: Record Validation Guide + description: "Using the record validation filter to ensure records follow certain\ + \ rules, including schema and signature validity." + tags: + - governance + - filter + rank: '021' + path: html/record-validation-guide + - title: Multi-tenancy Guide + description: Using the multi-tenancy filter to present a single Kafka® cluster + as if it were multiple clusters. + tags: + - filter + rank: '022' + path: html/multi-tenancy-guide + - title: Oauth Bearer Validation guide + description: "Using the Oauth Bearer validation filter to validate JWT tokens\ + \ received \nfrom Kafka® clients during authentication.\n" + tags: + - filter + - security + rank: '023' + path: html/oauth-bearer-validation + - title: SASL Inspection Guide + description: Using the SASL Inspection filter to infer the client's subject from + its successful authentication exchange with a broker. + tags: + - filter + - security + rank: '023' + path: html/sasl-inspection-guide + - title: Authorization Guide + description: Using the Authorization filter to provide Kafka®-equivalent access + controls within the proxy. + tags: + - security + - filter + rank: '024' + path: html/authorization-guide + - title: Entity Isolation Guide + description: Using the entity isolation filter to give authenticated Kafka® clients + a private namespace within a Kafka cluster. + tags: + - filter + rank: '025' + path: html/entity-isolation-guide + - title: Connection Expiration Guide + description: Using the connection expiration filter to avoid connection skew in + Kubernetes environments. + tags: + - kubernetes + - filter + rank: '030' + path: html/connection-expiration-guide + - title: Developer Quick Start + description: Start here if you're developing a filter for the first time. + tags: + - developer + rank: '031' + path: html/developer-quick-start + - title: Kroxylicious Developer Guide + description: Writing plugins for the proxy in the Java programming language. + tags: + - developer + rank: '032' + path: html/developer-guide + - title: Kroxylicious Javadocs + description: The Java API documentation for plugin developers. + tags: + - developer + path: javadoc/index.html + rank: '033' diff --git a/_data/release/0_21_0-SNAPSHOT.yaml b/_data/release/0_21_0-SNAPSHOT.yaml new file mode 100644 index 00000000..0b80c9f2 --- /dev/null +++ b/_data/release/0_21_0-SNAPSHOT.yaml @@ -0,0 +1,34 @@ +# +# Copyright Kroxylicious Authors. +# +# Licensed under the Apache Software License version 2.0, available at http://www.apache.org/licenses/LICENSE-2.0 +# + +releaseNotesUrl: https://github.com/kroxylicious/kroxylicious/releases/tag/v$(VERSION)/ +assetBaseUrl: https://github.com/kroxylicious/kroxylicious/releases/download/v$(VERSION)/ +assets: + - name: Proxy + description: The proxy application. + downloads: + - format: zip + path: kroxylicious-app-$(VERSION)-bin.zip + - format: tar.gz + path: kroxylicious-app-$(VERSION)-bin.tar.gz + - name: Operator + description: The Kubernetes operator. + downloads: + - format: zip + path: kroxylicious-operator-$(VERSION).zip + - format: tar.gz + path: kroxylicious-operator-$(VERSION).tar.gz +images: + - name: Proxy + url: https://quay.io/repository/kroxylicious/proxy?tab=tags + registry: quay.io/kroxylicious/proxy + tag: $(VERSION) + digest: sha256:REPLACE_WITH_SHA_AFTER_IMAGE_RELEASE + - name: Operator + url: https://quay.io/repository/kroxylicious/operator?tab=tags + registry: quay.io/kroxylicious/operator + tag: $(VERSION) + digest: sha256:REPLACE_WITH_SHA_AFTER_IMAGE_RELEASE \ No newline at end of file diff --git a/_data/release/0_22_0-SNAPSHOT.yaml b/_data/release/0_22_0-SNAPSHOT.yaml new file mode 100644 index 00000000..0b80c9f2 --- /dev/null +++ b/_data/release/0_22_0-SNAPSHOT.yaml @@ -0,0 +1,34 @@ +# +# Copyright Kroxylicious Authors. +# +# Licensed under the Apache Software License version 2.0, available at http://www.apache.org/licenses/LICENSE-2.0 +# + +releaseNotesUrl: https://github.com/kroxylicious/kroxylicious/releases/tag/v$(VERSION)/ +assetBaseUrl: https://github.com/kroxylicious/kroxylicious/releases/download/v$(VERSION)/ +assets: + - name: Proxy + description: The proxy application. + downloads: + - format: zip + path: kroxylicious-app-$(VERSION)-bin.zip + - format: tar.gz + path: kroxylicious-app-$(VERSION)-bin.tar.gz + - name: Operator + description: The Kubernetes operator. + downloads: + - format: zip + path: kroxylicious-operator-$(VERSION).zip + - format: tar.gz + path: kroxylicious-operator-$(VERSION).tar.gz +images: + - name: Proxy + url: https://quay.io/repository/kroxylicious/proxy?tab=tags + registry: quay.io/kroxylicious/proxy + tag: $(VERSION) + digest: sha256:REPLACE_WITH_SHA_AFTER_IMAGE_RELEASE + - name: Operator + url: https://quay.io/repository/kroxylicious/operator?tab=tags + registry: quay.io/kroxylicious/operator + tag: $(VERSION) + digest: sha256:REPLACE_WITH_SHA_AFTER_IMAGE_RELEASE \ No newline at end of file diff --git a/_posts/2026-05-21-benchmarking-the-proxy.md b/_posts/2026-05-26-benchmarking-the-proxy.md similarity index 99% rename from _posts/2026-05-21-benchmarking-the-proxy.md rename to _posts/2026-05-26-benchmarking-the-proxy.md index 310be76a..ea626f08 100644 --- a/_posts/2026-05-21-benchmarking-the-proxy.md +++ b/_posts/2026-05-26-benchmarking-the-proxy.md @@ -1,7 +1,7 @@ --- layout: post title: "Does my proxy look big in this cluster?" -date: 2026-05-21 00:00:00 +0000 +date: 2026-05-26 00:00:00 +0000 author: "Sam Barker" author_url: "https://github.com/SamBarker" categories: benchmarking performance diff --git a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md b/_posts/2026-06-02-benchmarking-the-proxy-under-the-hood.md similarity index 99% rename from _posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md rename to _posts/2026-06-02-benchmarking-the-proxy-under-the-hood.md index 30f2dc6c..e4199a24 100644 --- a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md +++ b/_posts/2026-06-02-benchmarking-the-proxy-under-the-hood.md @@ -1,7 +1,7 @@ --- layout: post title: "How hard can it be??? Maxing out a Kroxylicious instance" -date: 2026-05-28 00:00:00 +0000 +date: 2026-06-02 00:00:00 +0000 author: "Sam Barker" author_url: "https://github.com/SamBarker" categories: benchmarking performance engineering @@ -11,7 +11,7 @@ How hard can it be? We started with a laptop, a codebase, and a lot of confidenc Harder than expected. More interesting too. -We gave everyone [the numbers]({% post_url 2026-05-21-benchmarking-the-proxy %}) in a bland, but slide worthy way, already. This one is the engineering story: how we built the harness, what the flamegraphs actually show, the workload design choices that changed the answers, and the bugs we found in our own tooling. +We gave everyone [the numbers]({% post_url 2026-05-26-benchmarking-the-proxy %}) in a bland, but slide worthy way, already. This one is the engineering story: how we built the harness, what the flamegraphs actually show, the workload design choices that changed the answers, and the bugs we found in our own tooling. ## Why not Kafka's own tools? diff --git a/assets/.DS_Store b/assets/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..8a06495e59117900a5c044932aed98d9f1b17c39 GIT binary patch literal 8196 zcmeI1&u-H|5XNWQggB~@asa6ukSz6DAVEb4^Q4C6HY?UjL)$% z3!9+`n|KIcyORiva?J#oz$O8eySFH&oFq-0{QWh?p8nCM1J{1>qMX`1|5rTCv$E>< ze~X>nR{LJZ>v((K=kQ!E!YZof&)wFX&r1D{@&((5L*{G^L0plv3#+`H9gcl7{EDKxr%~DaUz? z-_2p-(CSc0W13M0YL!b&Ja&jV-rGXWS(byS4KU}}P8XD3-%vtr)9V{*?YS*z8jEjY zF$XxG|JVR*OyCX(?3y|ERsR3>=J)@1@XdHiCcp%4i-2gIgeODHcJ-|6POOe* qGUZnr)gh?#X%5xtICTAoA=*A%RZL*DHi{k?zX;G6xMBjgO5hU54>?Q# literal 0 HcmV?d00001 diff --git a/assets/blog/.DS_Store b/assets/blog/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..9f0b59938d06ec0d24e624595dc83ded90726a35 GIT binary patch literal 6148 zcmeHK%}T>S5T32IO%b671-%8l`jd(x;w9F4@M=U4Dm5`hgK4%jtvQrJ?)omCeGy+r zXLh$@s9wd&49tGBvop!=x6A$jfM^cG20$GEEL6f$4U2Dt+DWIRWISa=p}x_9^dy$I zK`eu*Xmq|#68Te7s>8pe&>|n^%MU;dxYsz5~ zrYhId4OY#nwd<>+(azqEvu7}A?_yfNp%|pE>Ijo&bn+jZ}g*T^lLKHs~76HDU{j(5Z+zRhTD+(CO&6O`K~m z*QnD$n8$}Oe-`G2BGljU`?d-P;cDcP8DIvA49w|jozDN`pWpw*B(5<7%)q~5K$Kfv ztBFH0XY179=&ZF-Z%|1nF4y>(f`&SZF_w!?Bm0U{S9OHjliX`K>;DlQ?E16P9J04Uh8Ni7^Z*hvG5BIOM4 zz$B1INyh9V?7>`XbGL|thuH3OP~RR%bB-$TxBq267{-zyZ8M**r$31kIxT2w=| z(Dr0cM90uEqz^r$UC0mAhDCrL7b7+ep9E3c2_gqbd^sPY?*5boEutORgs_R}BlV~c zy9mz}SoR^=hXkz_#$t>((xYxBq_0azw=*IozzJwD^CiQs#Em7a>+}{o39tt9Q^Ed@ zup=Kk%d*<%rPVGktd?bkA4@B20=o+jI?U(diR3!Jo9hgdUV3`poL=B3eareTavS*z zn_I?~v2DC=9!0}u-|i2Rh8@4;r)NH1@V-^m;4Wg<&0T*E43hX&L&p$$qDprKhvt+--jT7}7np#A3 zem<@ulZcFPQ@L2!n>{z**++&mCkOWA81W14cNZlEfg7;MkzE(HCqgga^y>{tEnwC%0;vJ&^%eQ zLs35+`xjp>T0N4W`-Br1nq>x$6)4BmR!g z>~4fmy@^Pff!Q~iok_B9!%hYOM0?U}0@MJ&LM1FzaQH@Oo^(M9##2NT>K=WF&qMj( zhtgk&X2*YIfc9<`0vLh^1IX_meH16CyB|u_>*r$qX*@D}{pOFtAWj;McTp&oR#wYa z*{WK1?xUQ#iI+^`wl}_^t4k?EKeIjmD(FwUwe?dOC0-Eq$0{M{^)cl3I*595+Ln{3 z7pq)XPgoVJ(yeXHX05|kqi*js=XHB_)IhV@+&`GlE7r#L-tk5EDI7-fMZGo%{6Lkg z8O-1njohN1-EkDkXn_7KCyQld2ABb6U`ZG-$D37MlJ)R1m;q+s_ZXn_!A2!?45k{@ z(Sd_%0T4^*R)RMDHd2ml&@q^5#2FN!QxSElFfE4A>FBpjoMSN6sMA51=0lh#3)7(p z^>n Date: Fri, 22 May 2026 09:47:47 +1200 Subject: [PATCH 24/25] Fix typo, gitignore DS_Store and .op, remove accidentally tracked files MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Fix "opevators" → "operators" - Add .DS_Store and .op/ to .gitignore - Remove accidentally committed macOS metadata and 1Password plugin files Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- .DS_Store | Bin 10244 -> 0 bytes .gitignore | 4 +++- .op/plugins/gh.json | 15 --------------- _posts/2026-05-26-benchmarking-the-proxy.md | 2 +- assets/.DS_Store | Bin 8196 -> 0 bytes assets/blog/.DS_Store | Bin 6148 -> 0 bytes assets/pages/.DS_Store | Bin 8196 -> 0 bytes assets/pages/images/.DS_Store | Bin 6148 -> 0 bytes assets/theme/.DS_Store | Bin 6148 -> 0 bytes 9 files changed, 4 insertions(+), 17 deletions(-) delete mode 100644 .DS_Store delete mode 100644 .op/plugins/gh.json delete mode 100644 assets/.DS_Store delete mode 100644 assets/blog/.DS_Store delete mode 100644 assets/pages/.DS_Store delete mode 100644 assets/pages/images/.DS_Store delete mode 100644 assets/theme/.DS_Store diff --git a/.DS_Store b/.DS_Store deleted file mode 100644 index 00dd5ca205f8e9850e6c8185d20e5709cf71276c..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 10244 zcmeHM%Wl+26up&KyJ-U9wP1l{2?@=DG#wO#5Su0>LP+zFXaoyjGW{@#iQ|ge?KCI~ z>OFh|zrdOwna`O0teA6eIibpK2H1=SQPnD6w|&mN*Qah(tK3AS*7Cs$QAk7qI@_g_ zI814LpL?av6j!dmD)5Q6X+*n}+)s4Bg7z1@0$u^HfLFjP;1&2kD1hH=Zb8FXfAk7? z1-t^s3h??6qO)xmc4Dk79XQAq0Co>gA>RmB!V89MHV z42NwOc4DldlQMKtMPya%P?SUmFBCYbwz2-`74QmF72v&lK&#Y3Z`%9!UZQ)rhv}sL zhJE)@nmB0CCiO{EO3x{#E%06)?fO5l=nk+=i!~3{fVy}?(B=077V>=wSpzbnAwC*C zbJ6Nj30(&|MlO9yC`W{-!gy0+1Yjh{nE(RsJ`D2~RPOXu(E zbkvR0Xefku5CQV^ahwkHcunW&AQNuK+z9GHeRJvJ!NKay)z;N;dFAkGcyOzQ)0LI$ zHx3W$!G+7$Za>)ENp@4s1%RfC-ws+d-;a-PapnlH-x;QfPI+LNvq(J*zm`(pg-}D< zL7{0HQ|b-{53f3975!N-H9im(eF*NJ3^>c2vF9+kr5>F%Ivv<{EIA@+z~(#ZA6#o?e&0IlJ7N2V|@5J|aU zMWxr4N^_zna&dk06i_y9vB2e6ZX!&1EG6hM;+4ihG_je%(Rj%?c4^>75@l?B14~<% z_Tk$sbWL+sVzE*->l|3TWc6k04RPF(Jb5wWOrvSzCkonLYIG0CU0GbGYxkF>-ANaX z2wbVq-_2FDw0WoSL@88>Fs4mJ`LafUrxW&Gk3)yQSe0qD5S%P=S}tWl^J# zpTk&kDLo-;>F`g3pNkq9jy%UPtMMEoM6b+mR?}fG6zpkO%vc9|9{$`2%U@fmj-RtU zZt|qZKM-G|v*Y7<`Sd4V0k42pz$@St@CtlC3e1{R!v^fe=l}ozejdnY>J{(`e6s>- zZoRX<2HMg4k+KJV*B+w(fzFL_6Jv#fgS?K1mDln3!ms0hLT7A?EjOIE3p+6uSJ3|Z SKLbiTLcjm_`~M$o|Njqs*cP+^ diff --git a/.gitignore b/.gitignore index b92b781b..c8803978 100644 --- a/.gitignore +++ b/.gitignore @@ -7,4 +7,6 @@ vendor *.iml */bootstrap/ _config-overrides.yml -.ruby-version \ No newline at end of file +.ruby-version +.DS_Store +.op/ \ No newline at end of file diff --git a/.op/plugins/gh.json b/.op/plugins/gh.json deleted file mode 100644 index 09012c6c..00000000 --- a/.op/plugins/gh.json +++ /dev/null @@ -1,15 +0,0 @@ -{ - "account_id": "ARMR3S6DY5DV7F7K3C5LGNCRLU", - "entrypoint": [ - "gh" - ], - "credentials": [ - { - "plugin": "github", - "credential_type": "personal_access_token", - "usage_id": "personal_access_token", - "vault_id": "3tkwphno4pf4qvczjrkm53lsga", - "item_id": "y2ayrvc4qiyvz6xq4t77tyhmai" - } - ] -} \ No newline at end of file diff --git a/_posts/2026-05-26-benchmarking-the-proxy.md b/_posts/2026-05-26-benchmarking-the-proxy.md index ea626f08..7dc615e0 100644 --- a/_posts/2026-05-26-benchmarking-the-proxy.md +++ b/_posts/2026-05-26-benchmarking-the-proxy.md @@ -9,7 +9,7 @@ categories: benchmarking performance Every good benchmarking story starts with a hunch. Mine was that Kroxylicious is cheap to run — I'd stake my career on it, in fact — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly. -There's a practical question underneath the hunch too. The most common thing opevators ask us is some variation of: "How many cores does the proxy need?" Which is really just "is this thing going to slow down my Kafka?" in a polite engineering hat. We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true. And also deeply unsatisfying for everyone involved, including us. +There's a practical question underneath the hunch too. The most common thing operators ask us is some variation of: "How many cores does the proxy need?" Which is really just "is this thing going to slow down my Kafka?" in a polite engineering hat. We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true. And also deeply unsatisfying for everyone involved, including us. So we stopped saying "it depends", and got off the fence: we built something you can run **yourselves** on your own infrastructure with your own workload, and measured it. Here are some representative numbers from ours. diff --git a/assets/.DS_Store b/assets/.DS_Store deleted file mode 100644 index 8a06495e59117900a5c044932aed98d9f1b17c39..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 8196 zcmeI1&u-H|5XNWQggB~@asa6ukSz6DAVEb4^Q4C6HY?UjL)$% z3!9+`n|KIcyORiva?J#oz$O8eySFH&oFq-0{QWh?p8nCM1J{1>qMX`1|5rTCv$E>< ze~X>nR{LJZ>v((K=kQ!E!YZof&)wFX&r1D{@&((5L*{G^L0plv3#+`H9gcl7{EDKxr%~DaUz? z-_2p-(CSc0W13M0YL!b&Ja&jV-rGXWS(byS4KU}}P8XD3-%vtr)9V{*?YS*z8jEjY zF$XxG|JVR*OyCX(?3y|ERsR3>=J)@1@XdHiCcp%4i-2gIgeODHcJ-|6POOe* qGUZnr)gh?#X%5xtICTAoA=*A%RZL*DHi{k?zX;G6xMBjgO5hU54>?Q# diff --git a/assets/blog/.DS_Store b/assets/blog/.DS_Store deleted file mode 100644 index 9f0b59938d06ec0d24e624595dc83ded90726a35..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 6148 zcmeHK%}T>S5T32IO%b671-%8l`jd(x;w9F4@M=U4Dm5`hgK4%jtvQrJ?)omCeGy+r zXLh$@s9wd&49tGBvop!=x6A$jfM^cG20$GEEL6f$4U2Dt+DWIRWISa=p}x_9^dy$I zK`eu*Xmq|#68Te7s>8pe&>|n^%MU;dxYsz5~ zrYhId4OY#nwd<>+(azqEvu7}A?_yfNp%|pE>Ijo&bn+jZ}g*T^lLKHs~76HDU{j(5Z+zRhTD+(CO&6O`K~m z*QnD$n8$}Oe-`G2BGljU`?d-P;cDcP8DIvA49w|jozDN`pWpw*B(5<7%)q~5K$Kfv ztBFH0XY179=&ZF-Z%|1nF4y>(f`&SZF_w!?Bm0U{S9OHjliX`K>;DlQ?E16P9J04Uh8Ni7^Z*hvG5BIOM4 zz$B1INyh9V?7>`XbGL|thuH3OP~RR%bB-$TxBq267{-zyZ8M**r$31kIxT2w=| z(Dr0cM90uEqz^r$UC0mAhDCrL7b7+ep9E3c2_gqbd^sPY?*5boEutORgs_R}BlV~c zy9mz}SoR^=hXkz_#$t>((xYxBq_0azw=*IozzJwD^CiQs#Em7a>+}{o39tt9Q^Ed@ zup=Kk%d*<%rPVGktd?bkA4@B20=o+jI?U(diR3!Jo9hgdUV3`poL=B3eareTavS*z zn_I?~v2DC=9!0}u-|i2Rh8@4;r)NH1@V-^m;4Wg<&0T*E43hX&L&p$$qDprKhvt+--jT7}7np#A3 zem<@ulZcFPQ@L2!n>{z**++&mCkOWA81W14cNZlEfg7;MkzE(HCqgga^y>{tEnwC%0;vJ&^%eQ zLs35+`xjp>T0N4W`-Br1nq>x$6)4BmR!g z>~4fmy@^Pff!Q~iok_B9!%hYOM0?U}0@MJ&LM1FzaQH@Oo^(M9##2NT>K=WF&qMj( zhtgk&X2*YIfc9<`0vLh^1IX_meH16CyB|u_>*r$qX*@D}{pOFtAWj;McTp&oR#wYa z*{WK1?xUQ#iI+^`wl}_^t4k?EKeIjmD(FwUwe?dOC0-Eq$0{M{^)cl3I*595+Ln{3 z7pq)XPgoVJ(yeXHX05|kqi*js=XHB_)IhV@+&`GlE7r#L-tk5EDI7-fMZGo%{6Lkg z8O-1njohN1-EkDkXn_7KCyQld2ABb6U`ZG-$D37MlJ)R1m;q+s_ZXn_!A2!?45k{@ z(Sd_%0T4^*R)RMDHd2ml&@q^5#2FN!QxSElFfE4A>FBpjoMSN6sMA51=0lh#3)7(p z^>n Date: Fri, 22 May 2026 10:11:03 +1200 Subject: [PATCH 25/25] Remove locally-generated SNAPSHOT files from tracking These files are generated by running Jekyll locally and should not be committed. Add glob patterns to .gitignore to prevent recurrence. Assisted-by: Claude Sonnet 4.6 Signed-off-by: Sam Barker --- .gitignore | 5 +- _data/documentation/0_21_0-SNAPSHOT.yaml | 124 ----------------------- _data/documentation/0_22_0-SNAPSHOT.yaml | 115 --------------------- _data/release/0_21_0-SNAPSHOT.yaml | 34 ------- _data/release/0_22_0-SNAPSHOT.yaml | 34 ------- download/0.21.0-SNAPSHOT/index.md | 3 - download/0.22.0-SNAPSHOT/index.md | 3 - 7 files changed, 4 insertions(+), 314 deletions(-) delete mode 100644 _data/documentation/0_21_0-SNAPSHOT.yaml delete mode 100644 _data/documentation/0_22_0-SNAPSHOT.yaml delete mode 100644 _data/release/0_21_0-SNAPSHOT.yaml delete mode 100644 _data/release/0_22_0-SNAPSHOT.yaml delete mode 100644 download/0.21.0-SNAPSHOT/index.md delete mode 100644 download/0.22.0-SNAPSHOT/index.md diff --git a/.gitignore b/.gitignore index c8803978..8eac2024 100644 --- a/.gitignore +++ b/.gitignore @@ -9,4 +9,7 @@ vendor _config-overrides.yml .ruby-version .DS_Store -.op/ \ No newline at end of file +.op/ +_data/documentation/*-SNAPSHOT.yaml +_data/release/*-SNAPSHOT.yaml +download/*-SNAPSHOT/ \ No newline at end of file diff --git a/_data/documentation/0_21_0-SNAPSHOT.yaml b/_data/documentation/0_21_0-SNAPSHOT.yaml deleted file mode 100644 index 5e7edc56..00000000 --- a/_data/documentation/0_21_0-SNAPSHOT.yaml +++ /dev/null @@ -1,124 +0,0 @@ -docs: - - title: Proxy Quick Start - description: Start here if you're experimenting with the proxy for the first time. - tags: - - proxy - rank: '000' - path: html/proxy-quick-start - - title: Proxy Guide - description: "Using the Proxy, including configuration, security and operation." - tags: - - proxy - - security - rank: '010' - path: html/kroxylicious-proxy - - title: Record Encryption Quick Start - description: Start here for an encryption-at-rest solution for Apache Kafka®. - tags: - - security - - filter - rank: '011' - path: html/record-encryption-quick-start - - title: Kroxylicious Operator for Kubernetes - description: Using the Kroxylicious Operator to deploy and run the Proxy in a - Kubernetes environment. - tags: - - kubernetes - rank: '020' - path: html/kroxylicious-operator - - title: Record Encryption Guide - description: Using the record encryption filter to provide encryption-at-rest - for Apache Kafka®. - tags: - - security - - filter - rank: '020' - path: html/record-encryption-guide - - title: Kroxylicious Admission Webhook - description: Using the Kroxylicious Admission Webhook to inject proxy sidecars - into application pods in a Kubernetes environment. - tags: - - kubernetes - rank: '021' - path: html/admission-webhook-guide - - title: Record Validation Guide - description: "Using the record validation filter to ensure records follow certain\ - \ rules, including schema and signature validity." - tags: - - governance - - filter - rank: '021' - path: html/record-validation-guide - - title: Multi-tenancy Guide - description: Using the multi-tenancy filter to present a single Kafka® cluster - as if it were multiple clusters. - tags: - - filter - rank: '022' - path: html/multi-tenancy-guide - - title: Oauth Bearer Validation guide - description: "Using the Oauth Bearer validation filter to validate JWT tokens\ - \ received \nfrom Kafka® clients during authentication.\n" - tags: - - filter - - security - rank: '023' - path: html/oauth-bearer-validation - - title: SASL Inspection Guide - description: Using the SASL Inspection filter to infer the client's subject from - its successful authentication exchange with a broker. - tags: - - filter - - security - rank: '023' - path: html/sasl-inspection-guide - - title: Authorization Guide - description: Using the Authorization filter to provide Kafka®-equivalent access - controls within the proxy. - tags: - - security - - filter - rank: '024' - path: html/authorization-guide - - title: Entity Isolation Guide - description: Using the entity isolation filter to give authenticated Kafka® clients - a private namespace within a Kafka cluster. - tags: - - filter - rank: '025' - path: html/entity-isolation-guide - - title: Performance and Sizing Guide - description: Sizing Kroxylicious for production deployments — passthrough overhead - and record encryption CPU budgeting. - tags: - - performance - - sizing - - record-encryption - rank: '025' - path: html/performance-guide - - title: Connection Expiration Guide - description: Using the connection expiration filter to avoid connection skew in - Kubernetes environments. - tags: - - kubernetes - - filter - rank: '030' - path: html/connection-expiration-guide - - title: Developer Quick Start - description: Start here if you're developing a filter for the first time. - tags: - - developer - rank: '031' - path: html/developer-quick-start - - title: Kroxylicious Developer Guide - description: Writing plugins for the proxy in the Java programming language. - tags: - - developer - rank: '032' - path: html/developer-guide - - title: Kroxylicious Javadocs - description: The Java API documentation for plugin developers. - tags: - - developer - path: javadoc/index.html - rank: '033' diff --git a/_data/documentation/0_22_0-SNAPSHOT.yaml b/_data/documentation/0_22_0-SNAPSHOT.yaml deleted file mode 100644 index 866f62ed..00000000 --- a/_data/documentation/0_22_0-SNAPSHOT.yaml +++ /dev/null @@ -1,115 +0,0 @@ -docs: - - title: Proxy Quick Start - description: Start here if you're experimenting with the proxy for the first time. - tags: - - proxy - rank: '000' - path: html/proxy-quick-start - - title: Proxy Guide - description: "Using the Proxy, including configuration, security and operation." - tags: - - proxy - - security - rank: '010' - path: html/kroxylicious-proxy - - title: Record Encryption Quick Start - description: Start here for an encryption-at-rest solution for Apache Kafka®. - tags: - - security - - filter - rank: '011' - path: html/record-encryption-quick-start - - title: Kroxylicious Operator for Kubernetes - description: Using the Kroxylicious Operator to deploy and run the Proxy in a - Kubernetes environment. - tags: - - kubernetes - rank: '020' - path: html/kroxylicious-operator - - title: Record Encryption Guide - description: Using the record encryption filter to provide encryption-at-rest - for Apache Kafka®. - tags: - - security - - filter - rank: '020' - path: html/record-encryption-guide - - title: Kroxylicious Admission Webhook - description: Using the Kroxylicious Admission Webhook to inject proxy sidecars - into application pods in a Kubernetes environment. - tags: - - kubernetes - rank: '021' - path: html/admission-webhook-guide - - title: Record Validation Guide - description: "Using the record validation filter to ensure records follow certain\ - \ rules, including schema and signature validity." - tags: - - governance - - filter - rank: '021' - path: html/record-validation-guide - - title: Multi-tenancy Guide - description: Using the multi-tenancy filter to present a single Kafka® cluster - as if it were multiple clusters. - tags: - - filter - rank: '022' - path: html/multi-tenancy-guide - - title: Oauth Bearer Validation guide - description: "Using the Oauth Bearer validation filter to validate JWT tokens\ - \ received \nfrom Kafka® clients during authentication.\n" - tags: - - filter - - security - rank: '023' - path: html/oauth-bearer-validation - - title: SASL Inspection Guide - description: Using the SASL Inspection filter to infer the client's subject from - its successful authentication exchange with a broker. - tags: - - filter - - security - rank: '023' - path: html/sasl-inspection-guide - - title: Authorization Guide - description: Using the Authorization filter to provide Kafka®-equivalent access - controls within the proxy. - tags: - - security - - filter - rank: '024' - path: html/authorization-guide - - title: Entity Isolation Guide - description: Using the entity isolation filter to give authenticated Kafka® clients - a private namespace within a Kafka cluster. - tags: - - filter - rank: '025' - path: html/entity-isolation-guide - - title: Connection Expiration Guide - description: Using the connection expiration filter to avoid connection skew in - Kubernetes environments. - tags: - - kubernetes - - filter - rank: '030' - path: html/connection-expiration-guide - - title: Developer Quick Start - description: Start here if you're developing a filter for the first time. - tags: - - developer - rank: '031' - path: html/developer-quick-start - - title: Kroxylicious Developer Guide - description: Writing plugins for the proxy in the Java programming language. - tags: - - developer - rank: '032' - path: html/developer-guide - - title: Kroxylicious Javadocs - description: The Java API documentation for plugin developers. - tags: - - developer - path: javadoc/index.html - rank: '033' diff --git a/_data/release/0_21_0-SNAPSHOT.yaml b/_data/release/0_21_0-SNAPSHOT.yaml deleted file mode 100644 index 0b80c9f2..00000000 --- a/_data/release/0_21_0-SNAPSHOT.yaml +++ /dev/null @@ -1,34 +0,0 @@ -# -# Copyright Kroxylicious Authors. -# -# Licensed under the Apache Software License version 2.0, available at http://www.apache.org/licenses/LICENSE-2.0 -# - -releaseNotesUrl: https://github.com/kroxylicious/kroxylicious/releases/tag/v$(VERSION)/ -assetBaseUrl: https://github.com/kroxylicious/kroxylicious/releases/download/v$(VERSION)/ -assets: - - name: Proxy - description: The proxy application. - downloads: - - format: zip - path: kroxylicious-app-$(VERSION)-bin.zip - - format: tar.gz - path: kroxylicious-app-$(VERSION)-bin.tar.gz - - name: Operator - description: The Kubernetes operator. - downloads: - - format: zip - path: kroxylicious-operator-$(VERSION).zip - - format: tar.gz - path: kroxylicious-operator-$(VERSION).tar.gz -images: - - name: Proxy - url: https://quay.io/repository/kroxylicious/proxy?tab=tags - registry: quay.io/kroxylicious/proxy - tag: $(VERSION) - digest: sha256:REPLACE_WITH_SHA_AFTER_IMAGE_RELEASE - - name: Operator - url: https://quay.io/repository/kroxylicious/operator?tab=tags - registry: quay.io/kroxylicious/operator - tag: $(VERSION) - digest: sha256:REPLACE_WITH_SHA_AFTER_IMAGE_RELEASE \ No newline at end of file diff --git a/_data/release/0_22_0-SNAPSHOT.yaml b/_data/release/0_22_0-SNAPSHOT.yaml deleted file mode 100644 index 0b80c9f2..00000000 --- a/_data/release/0_22_0-SNAPSHOT.yaml +++ /dev/null @@ -1,34 +0,0 @@ -# -# Copyright Kroxylicious Authors. -# -# Licensed under the Apache Software License version 2.0, available at http://www.apache.org/licenses/LICENSE-2.0 -# - -releaseNotesUrl: https://github.com/kroxylicious/kroxylicious/releases/tag/v$(VERSION)/ -assetBaseUrl: https://github.com/kroxylicious/kroxylicious/releases/download/v$(VERSION)/ -assets: - - name: Proxy - description: The proxy application. - downloads: - - format: zip - path: kroxylicious-app-$(VERSION)-bin.zip - - format: tar.gz - path: kroxylicious-app-$(VERSION)-bin.tar.gz - - name: Operator - description: The Kubernetes operator. - downloads: - - format: zip - path: kroxylicious-operator-$(VERSION).zip - - format: tar.gz - path: kroxylicious-operator-$(VERSION).tar.gz -images: - - name: Proxy - url: https://quay.io/repository/kroxylicious/proxy?tab=tags - registry: quay.io/kroxylicious/proxy - tag: $(VERSION) - digest: sha256:REPLACE_WITH_SHA_AFTER_IMAGE_RELEASE - - name: Operator - url: https://quay.io/repository/kroxylicious/operator?tab=tags - registry: quay.io/kroxylicious/operator - tag: $(VERSION) - digest: sha256:REPLACE_WITH_SHA_AFTER_IMAGE_RELEASE \ No newline at end of file diff --git a/download/0.21.0-SNAPSHOT/index.md b/download/0.21.0-SNAPSHOT/index.md deleted file mode 100644 index 9b9d975c..00000000 --- a/download/0.21.0-SNAPSHOT/index.md +++ /dev/null @@ -1,3 +0,0 @@ ---- -layout: download-release ---- diff --git a/download/0.22.0-SNAPSHOT/index.md b/download/0.22.0-SNAPSHOT/index.md deleted file mode 100644 index 9b9d975c..00000000 --- a/download/0.22.0-SNAPSHOT/index.md +++ /dev/null @@ -1,3 +0,0 @@ ---- -layout: download-release ----