feat(vllm): upgrade to 0.16.0 with single-GPU validation by vivekkalyan · Pull Request #583 · OpenPipe/ART

vivekkalyan · 2026-02-27T02:35:19Z

Summary

Upgrades vllm from 0.15.1 to 0.16.0 (via uv) and refreshes uv.lock.

Scope

Models evaluated:
- OpenPipe/Qwen3-14B-Instruct
- Qwen/Qwen3-30B-A3B-Instruct-2507
Modes evaluated:
- inference-only
- strict replay
- ART-E
Experiment discipline:
- uv run sky ...
- H200 single-GPU
- one fresh VM/cluster per run
- non-FP8 focus

Compatibility notes

Upstream protocol paths changed around reasoning_content; thinking-model flows should be exercised for ART paths that still emit it.
Tinker renderer paths are Tinker API-specific and not the primary local vLLM path.
enable_dbo=true is not used in the recommended single-GPU config here (requires DeepEP all2all backend/kernels in this setup).

Benchmark highlights

Inference-only (single GPU, c=8)

14B is effectively flat:
- throughput: 621.88 -> 620.69 tok/s (-0.19%)
- latency avg: 1.2766s -> 1.2775s (+0.07%)
30B default regresses, but tuned config removes it:
- default: 660.95 -> 549.96 tok/s (-16.79%)
- tuned best (max_num_batched_tokens):
  - 0.16.0: 620.38 tok/s, 1.1545s latency avg
  - 0.15.1: 618.06 tok/s, 1.1585s latency avg
  - best-vs-best: +0.38% throughput, -0.34% latency for 0.16.0

Replay follow-up (single GPU, c=8)

For 30B replay, defaults were best for both versions.

0.15.1 default: 358.66 tok/s, 0.14798s latency mean
0.16.0 default: 364.74 tok/s, 0.14284s latency mean
best-vs-best: +1.70% throughput, -3.47% latency mean for 0.16.0

Forcing max_num_batched_tokens (8192/16384) reduced replay performance on both versions.

ART-E

Task-quality metrics stayed stable across both models; 30B ART-E showed slight latency/throughput improvement on 0.16.0.

Recommendation

Proceed with vllm==0.16.0 for single-GPU non-FP8 with mode-specific config:
- inference-only (30B): set max_num_batched_tokens explicitly (8192 or 16384)
- replay (30B): keep default server settings (do not force mnbt)
Do not assume one server config is optimal across modes.

vivekkalyan · 2026-02-27T02:35:58Z

Superseded by #584 after branch rename to feat/vllm-0.16.0.

build: Bump vLLM to 0.16.0 and refresh lockfile

6b0ce35

vivekkalyan closed this Feb 27, 2026

vivekkalyan deleted the feat/vllm-0.16.0-upgrade-main branch February 27, 2026 02:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vllm): upgrade to 0.16.0 with single-GPU validation#583

feat(vllm): upgrade to 0.16.0 with single-GPU validation#583
vivekkalyan wants to merge 1 commit intomainfrom
feat/vllm-0.16.0-upgrade-main

vivekkalyan commented Feb 27, 2026

Uh oh!

vivekkalyan commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vivekkalyan commented Feb 27, 2026

Summary

Scope

Compatibility notes

Benchmark highlights

Inference-only (single GPU, c=8)

Replay follow-up (single GPU, c=8)

ART-E

Recommendation

Uh oh!

vivekkalyan commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant