[Feature Request]: Implement Smart Bucketing Strategies for RunInference

### What would you like to happen?

### Issue Title

`[Feature]: Implement Smart Bucketing Strategies for RunInference`

### Motivation

Currently, `RunInference` creates batches based solely on element count or arrival time. For generic models (e.g., BERT, ResNet, standard Hugging Face pipelines) that rely on static padding or dynamic padding (via DataCollator), this random batching leads to **significant padding overhead** when sequence lengths vary widely.

This results in **wasted GPU memory and compute resources**, as short sequences are forced to carry heavy padding to match the longest sequence in the batch.

### Proposal

I propose implementing **"Smart Bucketing"** strategies upstream of the model handler to group similar-sized elements together. Based on community feedback, I will implement two distinct strategies to cater to different workload constraints:

**1. Stateless Strategy (Latency-Sensitive)**
Sorts and batches elements within a single bundle using `StartBundle`/`FinishBundle`.

* **Pros:** **Zero shuffle cost**, minimal latency overhead.
* **Use Case:** High throughput pipelines with sufficient bundle density (enough data per bundle to sort effectively).

**2. Stateful Strategy (Cost-Sensitive)**
Pre-keys elements by length to leverage the existing `BatchElements` state machinery.

* **Pros:** **Global optimization** (minimum padding across all workers), prevention of OOM on expensive models.
* **Use Case:** Large/Expensive models where **GPU efficiency > network shuffle cost**.

> **Note:** This feature targets generic upstream batching for standard models. It is *not* intended to replace native continuous batching solutions like vLLM for supported models.

### Implementation Plan

I will deliver this work via **3 atomic PRs** to ensure manageable review scope:

* [x] **PR 1: Stateless Core.** Implement bundle-level sorting logic (`SortAndBatchElements`) with benchmarks demonstrating padding reduction and throughput.
* [x] **PR 2: Stateful Core.** Implement length-aware keying to reuse `BatchElements`, ensuring correctness across distributed workers.
* [ ] **PR 3: Integration.** Expose these strategies in the `RunInference` API with end-to-end success tests.

### Success Metrics

* **Padding Efficiency:** Significant reduction in padding tokens per batch (verified by benchmarks using Pareto/Log-normal distributions).
* **Throughput:** Comparison (Elements/sec) vs. standard `BatchElements` to ensure sorting overhead is negligible.
* **Latency:** Improvement in P95 latency by reducing the "straggler effect" in batches.

### Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

### Issue Components

- [x] Component: Python SDK
- [ ] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam YAML
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Infrastructure
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Implement Smart Bucketing Strategies for RunInference #37531

What would you like to happen?

Issue Title

Motivation

Proposal

Implementation Plan

Success Metrics

Issue Priority

Issue Components

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature Request]: Implement Smart Bucketing Strategies for RunInference #37531

Description

What would you like to happen?

Issue Title

Motivation

Proposal

Implementation Plan

Success Metrics

Issue Priority

Issue Components

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions