@@ -891,14 +891,131 @@ of whether the GPU is actively processing queries.
891891- amortization = $2.00/hr × 0.02655hr = ** $0.0531**
892892- total = ** $0.0559** (matches metrics.json)
893893
894- ** Why local GPU appears more expensive here:** The H100 amortizes at
895- $2.00/hr regardless of utilization. This benchmark runs only 250 sequential
896- queries totaling ~ 3 minutes of inference — the GPU is idle most of the time.
897- OpenAI's per-token pricing only charges for actual usage, which wins at low
898- volume. At higher utilization (concurrent requests, continuous serving), the
899- H100's per-query cost drops significantly: at full throughput (~ 21 req/s on
900- embeddings), the amortized cost is ~ $0.00003/query vs OpenAI's
901- ~ $0.0004/query — making owned hardware ~ 13x cheaper at scale.
894+ #### Per-query cost breakdown
895+
896+ The two pricing models work fundamentally differently:
897+ - ** OpenAI** : Pay per token. Cost scales linearly with number of queries.
898+ No cost when idle.
899+ - ** H100** : Pay per hour (amortization + electricity). Cost is fixed
900+ regardless of how many queries you run. More queries = lower per-query cost.
901+
902+ Per-query costs from the benchmark (n=50, sequential):
903+
904+ | Workload | OpenAI per query | H100 per query | H100 / OpenAI |
905+ | ----------| -----------------| ----------------| ---------------|
906+ | math | $0.000446 | $0.001119 | 2.5× more expensive |
907+ | reasoning | $0.000200 | $0.000649 | 3.2× more expensive |
908+ | summarization | $0.000151 | $0.000213 | 1.4× more expensive |
909+ | json_extraction | $0.000122 | $0.000156 | 1.3× more expensive |
910+ | embeddings | $0.000038 | $0.000027 | ** 0.7× (H100 cheaper)** |
911+
912+ At the benchmark scale (50 sequential queries), the H100 is more expensive
913+ for all generation-heavy workloads. This is because the GPU amortizes at
914+ $2.00/hr regardless of utilization — with only 50 queries, most of the
915+ wall-clock time is spent on actual inference, but the fixed hourly cost
916+ dominates.
917+
918+ Embeddings is the exception: queries are so fast (~ 47 ms each) that the
919+ GPU processes all 50 in 2.3 seconds, making the amortized cost per query
920+ very small.
921+
922+ #### Why OpenAI appears cheaper: the utilization gap
923+
924+ The benchmark runs 50 queries sequentially (c=1). The GPU processes one
925+ query at a time, leaving most of its capacity unused. Meanwhile, OpenAI
926+ only charges for the tokens actually consumed — idle time costs nothing.
927+
928+ ** Example calculation (math workload):**
929+ - 50 queries take 95.6 seconds sequentially → throughput = 0.52 req/s
930+ - H100 hourly cost = $2.105/hr → cost for 95.6s = $0.0560
931+ - Per query = $0.0560 / 50 = ** $0.001119**
932+ - OpenAI charges $0.40/M input + $1.60/M output → 50 queries = ** $0.0223**
933+ - Per query = $0.0223 / 50 = ** $0.000446**
934+
935+ The H100 is 2.5× more expensive per query because it costs $2.105/hr
936+ whether it runs 1 query or 1000 queries in that hour. At 0.52 req/s, the
937+ GPU serves only 1,879 queries/hr — far below its potential with batching.
938+
939+ #### Scaling projections
940+
941+ As query volume increases, the H100's fixed hourly cost gets amortized
942+ across more queries. OpenAI's cost stays linear. Here are projections at
943+ different daily volumes:
944+
945+ ** At 1,000 queries/day (our benchmark is ~ 50/day):**
946+
947+ | Workload | OpenAI/month | H100/month | Winner |
948+ | ----------| -------------| -----------| --------|
949+ | math | $13.39 | $33.60 | OpenAI (2.5×) |
950+ | reasoning | $5.99 | $19.47 | OpenAI (3.2×) |
951+ | summarization | $4.52 | $6.39 | OpenAI (1.4×) |
952+ | json_extraction | $3.65 | $4.67 | OpenAI (1.3×) |
953+ | embeddings | $1.14 | $0.82 | ** H100** (0.7×) |
954+
955+ At low volume, OpenAI wins on most workloads because per-token pricing
956+ avoids paying for idle GPU time.
957+
958+ ** At 100,000 queries/day (production scale):**
959+
960+ | Workload | OpenAI/month | H100/month | GPU hr/day needed | Winner |
961+ | ----------| -------------| -----------| -------------------| --------|
962+ | math | $1,339 | $3,360 | 53.2 hr (needs 3 GPUs) | OpenAI (2.5×) |
963+ | reasoning | $599 | $1,947 | 30.8 hr (needs 2 GPUs) | OpenAI (3.2×) |
964+ | summarization | $452 | $639 | 10.1 hr | OpenAI (1.4×) |
965+ | json_extraction | $365 | $467 | 7.4 hr | OpenAI (1.3×) |
966+ | embeddings | $114 | $82 | 1.3 hr | ** H100** (0.7×) |
967+
968+ Even at 100K queries/day with sequential processing, OpenAI's gpt-4.1-mini
969+ remains cheaper for generation-heavy workloads. This is because gpt-4.1-mini
970+ is aggressively priced ($0.40/M input, $1.60/M output) — far below what the
971+ equivalent compute would cost on owned hardware.
972+
973+ ** The key factor: concurrent batching.** The projections above assume
974+ sequential processing (c=1, one query at a time). In production, vLLM
975+ uses ** continuous batching** — processing many requests simultaneously,
976+ sharing GPU memory via PagedAttention. A 3B model on H100 can typically
977+ handle 10-50× higher throughput with batching than sequential processing.
978+ With batched throughput, the GPU hours/day drops proportionally, shifting
979+ the cost balance toward owned hardware.
980+
981+ #### When does owned hardware win?
982+
983+ The breakeven point depends on the workload's output length:
984+
985+ | Workload | Sequential throughput | Breakeven utilization | Achievable? |
986+ | ----------| ----------------------| ----------------------| -------------|
987+ | math | 0.52 req/s (1,879/hr) | 4,718 req/hr (251% of max) | Only with batching |
988+ | reasoning | 0.90 req/s (3,244/hr) | 10,542 req/hr (325% of max) | Only with batching |
989+ | summarization | 2.75 req/s (9,882/hr) | 13,961 req/hr (141% of max) | Only with batching |
990+ | json_extraction | 3.76 req/s (13,518/hr) | 17,316 req/hr (128% of max) | Only with batching |
991+ | embeddings | 21.30 req/s (76,691/hr) | 55,570 req/hr (72% of max) | ** Yes, sequential** |
992+
993+ Breakeven = queries/hr where H100 monthly cost equals OpenAI monthly cost.
994+
995+ - ** Embeddings** : H100 already wins at 72% sequential utilization — no
996+ batching needed. At full sequential throughput (21 req/s), the H100
997+ costs $0.000027/query vs OpenAI's $0.000038/query.
998+ - ** Generation workloads** : Breakeven requires 128-325% of sequential
999+ throughput — impossible without concurrent batching. With vLLM's
1000+ continuous batching, these throughputs are achievable for a 3B model
1001+ on H100.
1002+
1003+ #### Summary
1004+
1005+ | Scale | Winner | Why |
1006+ | -------| --------| -----|
1007+ | Low volume (<1K queries/day) | OpenAI | Per-token pricing avoids idle GPU cost |
1008+ | Medium volume (1K-50K/day, sequential) | OpenAI | gpt-4.1-mini pricing is below H100 sequential cost for generation workloads |
1009+ | High volume (batched serving) | H100 | Continuous batching multiplies throughput 10-50×, amortizing fixed cost across many more queries |
1010+ | Embeddings (any volume) | H100 | Fast enough that GPU utilization is high even sequentially |
1011+ | Data privacy / no rate limits | H100 | Owned hardware has no data sharing, no API rate limits, no vendor lock-in |
1012+
1013+ ** Bottom line:** OpenAI's gpt-4.1-mini is priced aggressively enough that
1014+ owned H100 hardware only becomes cost-competitive with concurrent batched
1015+ serving. For sequential inference (as in this benchmark), OpenAI is
1016+ 1.3-3.2× cheaper per query on generation workloads. The non-cost advantages
1017+ of owned hardware (privacy, no rate limits, custom models, no vendor
1018+ dependency) may justify the premium regardless.
9021019
9031020### ROUGE Scores (Summarization)
9041021
0 commit comments