feat(turbomind): support priority schedule policy#4614
Conversation
Add request priority validation and plumbing through Python configs, OpenAI protocol, pybind, and TurboMind GenerationConfig. Introduce schedule_policy for TurboMind engines, implement FIFO and non-preemptive priority request queues, and use request priority in engine materialization ordering while preserving already scheduled requests. Add focused Python validation tests, C++ request queue tests, and priority scheduling docs.
|
Hi @windreamer, to move this PR forward, could you please let me know if any further work or adjustments are needed? Thanks! |
|
cc @lvhan028 |
Do you think we have any risk of request starvations? |
Yes, there is a potential starvation risk for lower-priority requests under sustained high-priority traffic. The current policy is strict priority when admitting requests from the waiting queue: lower priority values are always selected first. Once a request has already started, we try to keep it ahead of new requests, so an already-running low-priority request should not be starved by newly-arriving high-priority requests. The main risk is for low-priority requests that are still queued and have not entered the engine yet. This is opt-in because the default policy remains fifo. With schedule_policy='priority', the behavior is intentional, but it does not currently include aging, quotas, deadlines, or weighted fairness. So if we need eventual service guarantees for low-priority traffic, we should add one of those mechanisms or enforce admission limits for high-priority requests. We intentionally kept the initial implementation as strict priority scheduling. We did consider fairness mechanisms above, but did not include them in the first version because they would add scheduler complexity. These mechanisms are also highly business-dependent; fully adapting them to different traffic patterns may require quite a few tunable parameters. If we need stronger eventual-service guarantees, I can help implement one of those strategies. |
Quite reasonable — thanks for the detailed analysis. The maintainers are pretty swamped with the next LLM release right now, so we’ll likely need a little time to sync up and push this forward. To help us prioritize when we do pick this up: do you have a specific deadline or target timeline on your end? Also, could you share a bit more about the use case or business context driving this — e.g., is there a particular workload or SLA constraint this is meant to address? That would help us focus the discussion when we get to it. |
|
I will discuss with @lzhangzz asap |
Thanks,there is no hard deadline on our side, so it is fine to wait until the maintainers have bandwidth. The main use case is mixed online/offline deployment on the same inference cluster. Online requests are latency-sensitive, while offline batch jobs are throughput-oriented and only need to meet a loose T+1 requirement. Priority scheduling lets us give online traffic higher priority without splitting clusters or over-provisioning resources. |
Motivation
TurboMind currently schedules requests in FIFO order, which makes it hard to differentiate latency-sensitive online traffic from background or lower-priority workloads in shared serving scenarios.
This PR introduces request-level priority scheduling for the TurboMind backend, allowing high-priority requests to be admitted earlier while keeping the default FIFO behavior unchanged.
Modification
priorityvalidation and plumbing throughGenerationConfig, OpenAI-compatible request protocols, API server handling, pybind, and TurboMindGenerationConfig.schedule_policysupport for TurboMind engines, withfifoas the default policy andpriorityas the new priority scheduling policy.BC-breaking (Optional)
No backward compatibility break is introduced.
The default scheduling policy remains
fifo, so existing users keep the same scheduling behavior unless they explicitly enableschedule_policy='priority'.Use cases (Optional)
This feature is useful for serving workloads that mix different traffic classes, for example:
Documentation has been added to describe how to enable
--schedule-policy priorityand how to set request-levelpriority.Checklist
Validation
Ran the focused tests added by this PR, including Python validation tests for request priority and C++ request queue tests.
The full unit test suite was also attempted, but some existing tests failed. These failures are not introduced by this PR and are unrelated to the priority scheduling changes.