Skip to content

Commit a24de1f

Browse files
committed
Expand cost analysis with per-query breakdown, scaling projections, and breakeven calculations
Add detailed cost comparison showing per-query costs at benchmark scale, monthly projections at 1K and 100K queries/day, breakeven utilization analysis per workload, and honest assessment of when owned H100 vs OpenAI API is cost-effective. Key finding: gpt-4.1-mini is priced aggressively enough that H100 only wins with concurrent batched serving (except embeddings, where H100 wins even sequentially).
1 parent aa465a8 commit a24de1f

1 file changed

Lines changed: 125 additions & 8 deletions

File tree

scripts/staging/llm-bench/README.md

Lines changed: 125 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -891,14 +891,131 @@ of whether the GPU is actively processing queries.
891891
- amortization = $2.00/hr × 0.02655hr = **$0.0531**
892892
- total = **$0.0559** (matches metrics.json)
893893

894-
**Why local GPU appears more expensive here:** The H100 amortizes at
895-
$2.00/hr regardless of utilization. This benchmark runs only 250 sequential
896-
queries totaling ~3 minutes of inference — the GPU is idle most of the time.
897-
OpenAI's per-token pricing only charges for actual usage, which wins at low
898-
volume. At higher utilization (concurrent requests, continuous serving), the
899-
H100's per-query cost drops significantly: at full throughput (~21 req/s on
900-
embeddings), the amortized cost is ~$0.00003/query vs OpenAI's
901-
~$0.0004/query — making owned hardware ~13x cheaper at scale.
894+
#### Per-query cost breakdown
895+
896+
The two pricing models work fundamentally differently:
897+
- **OpenAI**: Pay per token. Cost scales linearly with number of queries.
898+
No cost when idle.
899+
- **H100**: Pay per hour (amortization + electricity). Cost is fixed
900+
regardless of how many queries you run. More queries = lower per-query cost.
901+
902+
Per-query costs from the benchmark (n=50, sequential):
903+
904+
| Workload | OpenAI per query | H100 per query | H100 / OpenAI |
905+
|----------|-----------------|----------------|---------------|
906+
| math | $0.000446 | $0.001119 | 2.5× more expensive |
907+
| reasoning | $0.000200 | $0.000649 | 3.2× more expensive |
908+
| summarization | $0.000151 | $0.000213 | 1.4× more expensive |
909+
| json_extraction | $0.000122 | $0.000156 | 1.3× more expensive |
910+
| embeddings | $0.000038 | $0.000027 | **0.7× (H100 cheaper)** |
911+
912+
At the benchmark scale (50 sequential queries), the H100 is more expensive
913+
for all generation-heavy workloads. This is because the GPU amortizes at
914+
$2.00/hr regardless of utilization — with only 50 queries, most of the
915+
wall-clock time is spent on actual inference, but the fixed hourly cost
916+
dominates.
917+
918+
Embeddings is the exception: queries are so fast (~47 ms each) that the
919+
GPU processes all 50 in 2.3 seconds, making the amortized cost per query
920+
very small.
921+
922+
#### Why OpenAI appears cheaper: the utilization gap
923+
924+
The benchmark runs 50 queries sequentially (c=1). The GPU processes one
925+
query at a time, leaving most of its capacity unused. Meanwhile, OpenAI
926+
only charges for the tokens actually consumed — idle time costs nothing.
927+
928+
**Example calculation (math workload):**
929+
- 50 queries take 95.6 seconds sequentially → throughput = 0.52 req/s
930+
- H100 hourly cost = $2.105/hr → cost for 95.6s = $0.0560
931+
- Per query = $0.0560 / 50 = **$0.001119**
932+
- OpenAI charges $0.40/M input + $1.60/M output → 50 queries = **$0.0223**
933+
- Per query = $0.0223 / 50 = **$0.000446**
934+
935+
The H100 is 2.5× more expensive per query because it costs $2.105/hr
936+
whether it runs 1 query or 1000 queries in that hour. At 0.52 req/s, the
937+
GPU serves only 1,879 queries/hr — far below its potential with batching.
938+
939+
#### Scaling projections
940+
941+
As query volume increases, the H100's fixed hourly cost gets amortized
942+
across more queries. OpenAI's cost stays linear. Here are projections at
943+
different daily volumes:
944+
945+
**At 1,000 queries/day (our benchmark is ~50/day):**
946+
947+
| Workload | OpenAI/month | H100/month | Winner |
948+
|----------|-------------|-----------|--------|
949+
| math | $13.39 | $33.60 | OpenAI (2.5×) |
950+
| reasoning | $5.99 | $19.47 | OpenAI (3.2×) |
951+
| summarization | $4.52 | $6.39 | OpenAI (1.4×) |
952+
| json_extraction | $3.65 | $4.67 | OpenAI (1.3×) |
953+
| embeddings | $1.14 | $0.82 | **H100** (0.7×) |
954+
955+
At low volume, OpenAI wins on most workloads because per-token pricing
956+
avoids paying for idle GPU time.
957+
958+
**At 100,000 queries/day (production scale):**
959+
960+
| Workload | OpenAI/month | H100/month | GPU hr/day needed | Winner |
961+
|----------|-------------|-----------|-------------------|--------|
962+
| math | $1,339 | $3,360 | 53.2 hr (needs 3 GPUs) | OpenAI (2.5×) |
963+
| reasoning | $599 | $1,947 | 30.8 hr (needs 2 GPUs) | OpenAI (3.2×) |
964+
| summarization | $452 | $639 | 10.1 hr | OpenAI (1.4×) |
965+
| json_extraction | $365 | $467 | 7.4 hr | OpenAI (1.3×) |
966+
| embeddings | $114 | $82 | 1.3 hr | **H100** (0.7×) |
967+
968+
Even at 100K queries/day with sequential processing, OpenAI's gpt-4.1-mini
969+
remains cheaper for generation-heavy workloads. This is because gpt-4.1-mini
970+
is aggressively priced ($0.40/M input, $1.60/M output) — far below what the
971+
equivalent compute would cost on owned hardware.
972+
973+
**The key factor: concurrent batching.** The projections above assume
974+
sequential processing (c=1, one query at a time). In production, vLLM
975+
uses **continuous batching** — processing many requests simultaneously,
976+
sharing GPU memory via PagedAttention. A 3B model on H100 can typically
977+
handle 10-50× higher throughput with batching than sequential processing.
978+
With batched throughput, the GPU hours/day drops proportionally, shifting
979+
the cost balance toward owned hardware.
980+
981+
#### When does owned hardware win?
982+
983+
The breakeven point depends on the workload's output length:
984+
985+
| Workload | Sequential throughput | Breakeven utilization | Achievable? |
986+
|----------|----------------------|----------------------|-------------|
987+
| math | 0.52 req/s (1,879/hr) | 4,718 req/hr (251% of max) | Only with batching |
988+
| reasoning | 0.90 req/s (3,244/hr) | 10,542 req/hr (325% of max) | Only with batching |
989+
| summarization | 2.75 req/s (9,882/hr) | 13,961 req/hr (141% of max) | Only with batching |
990+
| json_extraction | 3.76 req/s (13,518/hr) | 17,316 req/hr (128% of max) | Only with batching |
991+
| embeddings | 21.30 req/s (76,691/hr) | 55,570 req/hr (72% of max) | **Yes, sequential** |
992+
993+
Breakeven = queries/hr where H100 monthly cost equals OpenAI monthly cost.
994+
995+
- **Embeddings**: H100 already wins at 72% sequential utilization — no
996+
batching needed. At full sequential throughput (21 req/s), the H100
997+
costs $0.000027/query vs OpenAI's $0.000038/query.
998+
- **Generation workloads**: Breakeven requires 128-325% of sequential
999+
throughput — impossible without concurrent batching. With vLLM's
1000+
continuous batching, these throughputs are achievable for a 3B model
1001+
on H100.
1002+
1003+
#### Summary
1004+
1005+
| Scale | Winner | Why |
1006+
|-------|--------|-----|
1007+
| Low volume (<1K queries/day) | OpenAI | Per-token pricing avoids idle GPU cost |
1008+
| Medium volume (1K-50K/day, sequential) | OpenAI | gpt-4.1-mini pricing is below H100 sequential cost for generation workloads |
1009+
| High volume (batched serving) | H100 | Continuous batching multiplies throughput 10-50×, amortizing fixed cost across many more queries |
1010+
| Embeddings (any volume) | H100 | Fast enough that GPU utilization is high even sequentially |
1011+
| Data privacy / no rate limits | H100 | Owned hardware has no data sharing, no API rate limits, no vendor lock-in |
1012+
1013+
**Bottom line:** OpenAI's gpt-4.1-mini is priced aggressively enough that
1014+
owned H100 hardware only becomes cost-competitive with concurrent batched
1015+
serving. For sequential inference (as in this benchmark), OpenAI is
1016+
1.3-3.2× cheaper per query on generation workloads. The non-cost advantages
1017+
of owned hardware (privacy, no rate limits, custom models, no vendor
1018+
dependency) may justify the premium regardless.
9021019

9031020
### ROUGE Scores (Summarization)
9041021

0 commit comments

Comments
 (0)