Conversation
| | 16 | 6.2 | | ||
| - **Service Returns Responsive within Seconds**: To test extreme resilience, we evaluated DeepSeek V3.2 on 4 nodes (32 GPUs total, setting ep_size=dp_size=32) with 256 redundant experts, allowing us to tolerate up to 2 full node failures. When measuring the service interruption time caused by sudden rank failures, Elastic EP reduces downtime by over 90%, from 2–3 minutes to less than 10 seconds. | ||
|
|
||
| | Number of failed ranks | Interruption time with Elastic EP (sec) | Throughput with remaining ranks (tokens/sec) | Mean TPOT with remaining ranks (ms) | |
There was a problem hiding this comment.
Better add an explanation why Mean TPOT decreases, I assume that is because the total batch size is smaller because of the dp rank decreases?
There was a problem hiding this comment.
Do we decrease the request rate here? If not, then maybe each EP rank will get more tokens per batch? IIUC, lower Mean TPOT usually indicates a higher per-req token throughput. If the total request number is the same, then each req should get fewer tokens per second since the computing resources are reduced?
Maybe you should provide the benchmark setting for reader to better understand the workload and for reproduction.
There was a problem hiding this comment.
The 4 decode-node setup used to evaluate recovery time was not optimized for the best throughput/latency performance, so the results were not convincing. However, I cannot acquire enough GPU resources to conduct a more thorough evaluation at this stage 😢 So I reverted the throughput/latency performance data from this table.
Nevertheless, I have revised the writings to describe the reproduction steps more clearly.
No description provided.