Your current environment
vLLM master.
🐛 Describe the bug
Decsription
AsyncLLM.encode() accepts truncate_prompt_tokens as a separate parameter but ignores pooling_params.truncate_prompt_tokens. This causes confusion and is inconsistent with generate(), which reads truncate_prompt_tokens directly from sampling_params.
Impact
Users setting truncate_prompt_tokens in PoolingParams expect it to work:
pooling_params = PoolingParams(truncate_prompt_tokens=-1) # Has no effect!
async for output in engine.encode(prompt, pooling_params, request_id): ...
This causes long prompts to fail with ValueError: prompt is longer than max_model_len instead of being truncated.
Before submitting a new issue...