-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
Description
Your current environment
The output of python collect_env.py
Your output of `python collect_env.py` here
🐛 Describe the bug
Only in B200 machine.
The regression is caused by #29624.
I use commit:75eb302a as baseline.
In commit 75eb302,
============ Serving Benchmark Result ============
Successful requests: 2560
Benchmark duration (s): 1597.51
Total input tokens: 2621440
Total generated tokens: 20971520
Request throughput (req/s): 1.60
Output token throughput (tok/s): 13127.61
Total Token throughput (tok/s): 14768.56
---------------Time to First Token----------------
Mean TTFT (ms): 902.42
Median TTFT (ms): 230.73
P99 TTFT (ms): 6494.69
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 36.55
Median TPOT (ms): 37.21
P99 TPOT (ms): 51.62
---------------Inter-token Latency----------------
Mean ITL (ms): 749.75
Median ITL (ms): 780.64
P99 ITL (ms): 1314.17
In commit 75eb302 and revert #29624.
============ Serving Benchmark Result ============
Successful requests: 2560
Benchmark duration (s): 1268.58
Total input tokens: 2621440
Total generated tokens: 20971520
Request throughput (req/s): 2.02
Output token throughput (tok/s): 16531.44
Total Token throughput (tok/s): 18597.87
---------------Time to First Token----------------
Mean TTFT (ms): 935.92
Median TTFT (ms): 229.24
P99 TTFT (ms): 6617.85
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 28.76
Median TPOT (ms): 29.07
P99 TPOT (ms): 43.97
---------------Inter-token Latency----------------
Mean ITL (ms): 599.24
Median ITL (ms): 577.93
P99 ITL (ms): 1128.54
#29624 Output throughput drop from 16531 to 13127 (20%).
Repro command
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
server-side:
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8087 --model openai/gpt-oss-120b --tokenizer openai/gpt-oss-120b --dtype auto --kv-cache-dtype fp8 --tensor-parallel-size 1 --pipeline-parallel-size 1 --data-parallel-size 1 --swap-space 16 --max-num-seqs 1024 --trust-remote-code --max-model-len 9226 --gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 --no-enable-prefix-caching --async-scheduling --stream-interval 20 --compilation_config.pass_config.fuse_allreduce_rms true --compilation_config.pass_config.eliminate_noops true --compilation_config.max_cudagraph_capture_size 2048 --speculative_config.method eagle3 --speculative_config.model nvidia/gpt-oss-120b-Eagle3-v2 --speculative_config.num_speculative_tokens 3
client-side:
python3 benchmark_serving.py --backend vllm --host 0.0.0.0 --port 8087 --model openai/gpt-oss-120b --num-prompts 2560 --trust-remote-code --ignore-eos --max-concurrency 512 --random-input-len 1024 --random-output-len 8192 --random-range-ratio 1.0 --use-chat-template --dataset-name random --save-result --result-filename benchmark_serving_results.json
Note: benchmark_serving.py is from the following repo.
git clone https://github.com/kimbochen/bench_serving.git
pip install pandas datasets --break-system-packages
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.