Skip to content

Benchmarks v2#60

Open
podocarp wants to merge 5 commits intomainfrom
jxd/benchmarks
Open

Benchmarks v2#60
podocarp wants to merge 5 commits intomainfrom
jxd/benchmarks

Conversation

@podocarp
Copy link
Contributor

@podocarp podocarp commented Feb 2, 2026

Summary:

  • Now all of our integration/e2e tests can be run as benchmarks.
  • There is a new script ./run-all-benchmarks.sh that behaves like the ./run-all-e2e-tests.sh
  • Furthermore you can either call the script with the -f arg to specify which instrumentation to run, or you can go into the e2e folder under the instrumentation and run BENCHMARKS=1 ./run.sh.
  • This PR hijacks the make_request function to check for the env var, and if it's set it will fork out to benchmarking code (among other things)
  • Tests are unbearably long. The default config is 10s duration, 3s warmup per endpoint. The warmup I've discovered is very essential to start up any system caches etc. otherwise we get nonsensical results like the SDK being faster enabled than disabled.

The problem with this vs. structured benchmarks is that there's no standardized test suite so we can't interpret the results easily without context on the instrumentation and how the test is written. But it should make it very extensible in the future -- just add more tests, the benchmarks will automatically follow.

Note: (~) indicates low iterations (unreliable). Negative diff = slower with SDK.


aiohttp

Endpoint Baseline With SDK Diff
GET /health 2548.47/s 753.89/s -70.4%
GET /api/get-json 13.05/s 12.16/s -6.8%
GET /api/get-with-params 13.02/s 12.28/s -5.7%
GET /api/get-with-headers 13.73/s 11.34/s -17.4%
GET /api/chain 7.50/s 6.92/s -7.7%
GET /api/parallel 9.20/s 8.71/s -5.3%
GET /test/streaming 12.47/s 11.57/s -7.2%
GET /test/timeout 13.40/s 12.08/s -9.9%
PUT /api/put-json 2.95/s 2.99/s +1.4%

django

Endpoint Baseline With SDK Diff
GET /health 3302.22/s 812.16/s -75.4%
GET /api/csrf-form 3134.84/s 676.84/s -78.4%
GET /api/post/1 10.44/s 9.97/s -4.5%
GET /api/user/test123 2.54/s 2.51/s -1.2% (~)
GET /api/weather 1.02/s 0.96/s -5.9% (~)
POST /api/post 2.63/s 2.56/s -2.7% (~)
DELETE /api/post/1/delete 2.93/s 2.71/s -7.5% (~)

fastapi

Endpoint Baseline With SDK Diff
GET /health 3654.35/s 807.29/s -77.9%
GET /api/post/1 7.74/s 10.61/s +37.1%
GET /api/activity 2.47/s 2.35/s -4.9% (~)
GET /api/user/test123 2.56/s 2.03/s -20.7% (~)
GET /api/weather 0.99/s 0.90/s -9.1% (~)
POST /api/post 2.89/s 2.59/s -10.4% (~)
DELETE /api/post/1 2.63/s 2.87/s +9.1% (~)

flask

Endpoint Baseline With SDK Diff
GET /health 2551.63/s 693.08/s -72.8%
GET /api/post/1 10.79/s 10.00/s -7.3%
GET /api/user/test123 2.46/s 2.42/s -1.6% (~)
GET /api/weather-activity 0.57/s 0.59/s +3.5% (~)
POST /api/post 2.37/s 2.42/s +2.1% (~)
POST /api/user 2.25/s 2.37/s +5.3% (~)
DELETE /api/post/1 2.92/s 2.61/s -10.6% (~)

grpc

Endpoint Baseline With SDK Diff
GET /health 2553.36/s 733.84/s -71.3%
GET /api/greet?name=TestUser 1871.13/s 444.69/s -76.2%
GET /api/greet?name=AnotherUser 1875.43/s 444.96/s -76.3%
POST /api/greet-with-info 1804.09/s 402.96/s -77.7%
GET /api/greet-chain 1410.86/s 206.30/s -85.4%
GET /api/greet-with-call 1889.44/s 442.44/s -76.6%
GET /test/future-call 1632.54/s 416.79/s -74.5%
GET /test/stream-unary 1287.08/s 283.99/s -77.9%
GET /test/stream-stream 5.97/s 5.55/s -7.0%

httpx

Endpoint Baseline With SDK Diff
GET /health 2590.27/s 746.12/s -71.2%
GET /api/sync/get-json 10.73/s 7.95/s -25.9%
GET /api/sync/get-with-params 9.45/s 8.37/s -11.4%
GET /api/sync/get-with-headers 8.89/s 8.05/s -9.4%
GET /api/sync/chain 4.38/s 5.36/s +22.4%
GET /api/async/get-json 8.68/s 7.89/s -9.1%
GET /api/async/get-with-params 8.45/s 7.82/s -7.5%
GET /api/async/chain 6.62/s 4.52/s -31.7%
GET /api/async/parallel 7.19/s 7.78/s +8.2%
GET /test/async-send 8.97/s 9.56/s +6.6%
GET /test/async-stream 8.69/s 9.96/s +14.6%
GET /test/streaming 9.14/s 8.31/s -9.1%
GET /test/toplevel-stream 9.20/s 10.74/s +16.7%

psycopg (psycopg3)

Endpoint Baseline With SDK Diff
GET /health 2576.48/s 739.02/s -71.3%
GET /db/query 733.02/s 230.57/s -68.5%
GET /test/cursor-stream 690.21/s 282.39/s -59.1%
POST /db/transaction 657.52/s 270.67/s -58.8%

psycopg2

Endpoint Baseline With SDK Diff
GET /health 2551.87/s 725.39/s -71.6%
GET /db/register-jsonb 914.30/s 532.20/s -41.8%
GET /db/query 678.14/s 242.88/s -64.2%
POST /db/insert 650.75/s 467.80/s -28.1%

redis

Endpoint Baseline With SDK Diff
GET /health 2569.44/s 748.55/s -70.9%
POST /redis/set 2220.36/s 537.95/s -75.8%
GET /redis/get/test_key 2350.11/s 579.62/s -75.3%
GET /redis/get/test_key_expiry 2351.98/s 584.51/s -75.1%
POST /redis/incr/counter 2354.10/s 586.23/s -75.1%
GET /redis/keys/* 2364.17/s 554.42/s -76.5%
DELETE /redis/delete/test_key 2346.24/s 584.40/s -75.1%
GET /test/mget-mset 2137.24/s 327.05/s -84.7%
GET /test/pipeline-basic 2147.54/s 388.45/s -81.9%
GET /test/transaction-watch 2004.84/s 262.95/s -86.9%
GET /test/async-pipeline 1040.48/s 252.89/s -75.7%
GET /test/binary-data 1023.66/s 323.36/s -68.4%

requests

Endpoint Baseline With SDK Diff
GET /health 2525.20/s 739.19/s -70.7%
GET /api/get-json 9.45/s 4.34/s -54.1%
GET /api/get-with-params 9.96/s 4.35/s -56.3%
GET /api/chain 2.91/s 1.31/s -55.0% (~)
GET /api/parallel 7.63/s 3.93/s -48.5%
GET /api/with-timeout 8.57/s 5.00/s -41.7%
GET /test/session-send-direct 7.68/s 5.74/s -25.3%

urllib

Endpoint Baseline With SDK Diff
GET /health 2559.23/s 742.94/s -71.0%
GET /api/get-json 8.55/s 6.14/s -28.2%
GET /api/get-with-params 7.39/s 4.59/s -37.9%
GET /api/get-with-request-object 6.57/s 6.56/s -0.2%
GET /api/custom-opener 7.22/s 4.99/s -30.9%
GET /api/with-timeout 8.30/s 5.97/s -28.1%
GET /api/parallel 6.24/s 5.84/s -6.4%
GET /test/getheader 7.57/s 3.96/s -47.7%
GET /test/getcode 7.04/s 3.79/s -46.2%
GET /test/head-request 7.92/s 4.77/s -39.8%
GET /test/no-context-manager 6.53/s 4.19/s -35.8%
GET /test/ssl-context 5.90/s 4.17/s -29.3%

urllib3

Endpoint Baseline With SDK Diff
GET /health 2551.05/s 759.58/s -70.2%
GET /api/poolmanager/get-json 17.03/s 16.75/s -1.6%
GET /api/poolmanager/get-with-params 18.87/s 15.93/s -15.6%
GET /api/poolmanager/get-with-headers 22.45/s 17.08/s -23.9%
GET /api/poolmanager/chain 7.19/s 5.88/s -18.2%
GET /api/connectionpool/get-json 4.79/s 4.66/s -2.7%
GET /test/timeout 18.07/s 17.38/s -3.8%
GET /test/retries 18.55/s 18.11/s -2.4%
GET /test/new-poolmanager 5.69/s 4.75/s -16.5%
GET /test/multiple-requests 8.93/s 6.95/s -22.2%
GET /test/requests-lib 5.92/s 5.52/s -6.8%
POST /api/poolmanager/post-form 2.90/s 3.13/s +7.9%
DELETE /api/poolmanager/delete 2.84/s 3.19/s +12.3% (~)

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 8 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="benchmarks2/benchmark.py">

<violation number="1" location="benchmarks2/benchmark.py:41">
P2: Baseline parsing reads the "ops/s" label instead of the numeric ops/s value, so `parse_results` will fail on valid benchmark output. Use `parts[4]` for the numeric ops/s token.</violation>

<violation number="2" location="benchmarks2/benchmark.py:113">
P3: The error return code from `main()` is ignored, so the script exits successfully even when the server is unreachable. Propagate the exit status from `main()`.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@podocarp podocarp added the Tusk - Generate Tests Trigger Tusk for this PR (even if merged) label Feb 3, 2026
@tusk-dev
Copy link

tusk-dev bot commented Feb 3, 2026

@podocarp does not have an active Tusk seat. Activate it before triggering test generation.

@tusk-dev tusk-dev bot removed the Tusk - Generate Tests Trigger Tusk for this PR (even if merged) label Feb 3, 2026
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 40 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="drift/instrumentation/e2e_common/test_utils.py">

<violation number="1" location="drift/instrumentation/e2e_common/test_utils.py:89">
P2: Guard against zero iterations before computing per-op stats; BENCHMARK_DURATION=0 (or a too-short duration) leaves iterations at 0 and causes a ZeroDivisionError here.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Contributor

@sohil-kshirsagar sohil-kshirsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, ty for this. can we update docs in this repo to reflect how to use this? maybe a new doc on benchmarks specifically is best, and if there are any docs on how to write e2e tests, make sure to indicate how they must be written to support benchmarks.

looks like some lint/type failures as well

@podocarp
Copy link
Contributor Author

podocarp commented Feb 6, 2026

Added benchmarks.md. Writing benchmarks is really easy, since we are modifying the helper functions to do our benchmarking, as long as you add endpoints/test suites in the same format as the rest, it will auto discover new test cases.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="BENCHMARKS.md">

<violation number="1" location="BENCHMARKS.md:12">
P2: The benchmarking guide points to `./run-all-e2e-tests.sh`, which only runs E2E/stack tests and doesn't enable benchmark mode. This will mislead users who follow the benchmarking instructions. Use the benchmark runner script instead.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.


The simplest way to get started it simply
```
./run-all-e2e-tests.sh
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The benchmarking guide points to ./run-all-e2e-tests.sh, which only runs E2E/stack tests and doesn't enable benchmark mode. This will mislead users who follow the benchmarking instructions. Use the benchmark runner script instead.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At BENCHMARKS.md, line 12:

<comment>The benchmarking guide points to `./run-all-e2e-tests.sh`, which only runs E2E/stack tests and doesn't enable benchmark mode. This will mislead users who follow the benchmarking instructions. Use the benchmark runner script instead.</comment>

<file context>
@@ -0,0 +1,47 @@
+
+The simplest way to get started it simply
+```
+./run-all-e2e-tests.sh
+```
+or to run a single (or a few) instrumentations,
</file context>
Fix with Cubic

@jy-tan
Copy link
Contributor

jy-tan commented Feb 6, 2026

@podocarp what do you think about running and publishing benchmarks only from stack tests? it doesn't hurt for them to be in individual instrumentations should we need to optimize something, but stack tests would be more meaningful and slightly more representative of actual apps than testing say just redis in isolation.

stack tests then fulfill 3 purposes:

  1. ensure correctness in instrumentation interactions
  2. realistic benchmarks
  3. (future) serve as stack-based demo apps, effectively replacing the current python demo repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants