Skip to content

feat(vllm): upgrade to 0.16.0 with single-GPU validation#583

Closed
vivekkalyan wants to merge 1 commit intomainfrom
feat/vllm-0.16.0-upgrade-main
Closed

feat(vllm): upgrade to 0.16.0 with single-GPU validation#583
vivekkalyan wants to merge 1 commit intomainfrom
feat/vllm-0.16.0-upgrade-main

Conversation

@vivekkalyan
Copy link
Collaborator

Summary

Upgrades vllm from 0.15.1 to 0.16.0 (via uv) and refreshes uv.lock.

Scope

  • Models evaluated:
    • OpenPipe/Qwen3-14B-Instruct
    • Qwen/Qwen3-30B-A3B-Instruct-2507
  • Modes evaluated:
    • inference-only
    • strict replay
    • ART-E
  • Experiment discipline:
    • uv run sky ...
    • H200 single-GPU
    • one fresh VM/cluster per run
    • non-FP8 focus

Compatibility notes

  • Upstream protocol paths changed around reasoning_content; thinking-model flows should be exercised for ART paths that still emit it.
  • Tinker renderer paths are Tinker API-specific and not the primary local vLLM path.
  • enable_dbo=true is not used in the recommended single-GPU config here (requires DeepEP all2all backend/kernels in this setup).

Benchmark highlights

Inference-only (single GPU, c=8)

  • 14B is effectively flat:
    • throughput: 621.88 -> 620.69 tok/s (-0.19%)
    • latency avg: 1.2766s -> 1.2775s (+0.07%)
  • 30B default regresses, but tuned config removes it:
    • default: 660.95 -> 549.96 tok/s (-16.79%)
    • tuned best (max_num_batched_tokens):
      • 0.16.0: 620.38 tok/s, 1.1545s latency avg
      • 0.15.1: 618.06 tok/s, 1.1585s latency avg
      • best-vs-best: +0.38% throughput, -0.34% latency for 0.16.0

Replay follow-up (single GPU, c=8)

For 30B replay, defaults were best for both versions.

  • 0.15.1 default: 358.66 tok/s, 0.14798s latency mean
  • 0.16.0 default: 364.74 tok/s, 0.14284s latency mean
  • best-vs-best: +1.70% throughput, -3.47% latency mean for 0.16.0

Forcing max_num_batched_tokens (8192/16384) reduced replay performance on both versions.

ART-E

Task-quality metrics stayed stable across both models; 30B ART-E showed slight latency/throughput improvement on 0.16.0.

Recommendation

  • Proceed with vllm==0.16.0 for single-GPU non-FP8 with mode-specific config:
    • inference-only (30B): set max_num_batched_tokens explicitly (8192 or 16384)
    • replay (30B): keep default server settings (do not force mnbt)
  • Do not assume one server config is optimal across modes.

@vivekkalyan
Copy link
Collaborator Author

Superseded by #584 after branch rename to feat/vllm-0.16.0.

@vivekkalyan vivekkalyan deleted the feat/vllm-0.16.0-upgrade-main branch February 27, 2026 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant