Voxtral Realtime: unify SDPA classes and dtype-aware attention masks by mergennachin · Pull Request #17997 · pytorch/executorch

mergennachin · 2026-03-08T12:27:54Z

Unify CudaSDPA and StandardEncoderSDPA into a single StandardSDPA class
with transpose_kv parameter, mirroring the MetalSDPA unification. This
gives a symmetric design: MetalSDPA and StandardSDPA share the same
interface (n_heads, n_kv_heads, head_dim, transpose_kv).

Make _build_attn_mask and create_causal_mask dtype-aware — masks are now
created in the model dtype instead of always float32. This is required
because the Metal SDPA kernel reads the mask buffer as device T* (same
type as Q/K/V). A float32 mask with bf16 Q/K/V would be misinterpreted.

pytorch-bot · 2026-03-08T12:27:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17997

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Awaiting Approval, 2 New Failures

As of commit e62e797 with merge base 122fdef ():

AWAITING APPROVAL - The following workflow needs approval before CI can run:

Claude Code (gh)

NEW FAILURES - The following jobs have failed:

pull / test-voxtral-realtime-xnnpack-linux / linux-job (gh)
RuntimeError: Command docker exec -t 05b9b1ab426d7b9574d4a5640b956e57a981f93e8a10059d5253737172660b86 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, quantized-int4-tile-packed) / linux-job (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Unify CudaSDPA and StandardEncoderSDPA into a single StandardSDPA class with transpose_kv parameter, mirroring the MetalSDPA unification. This gives a symmetric design: MetalSDPA and StandardSDPA share the same interface (n_heads, n_kv_heads, head_dim, transpose_kv). Make _build_attn_mask and create_causal_mask dtype-aware — masks are now created in the model dtype instead of always float32. This is required because the Metal SDPA kernel reads the mask buffer as device T* (same type as Q/K/V). A float32 mask with bf16 Q/K/V would be misinterpreted.

github-actions · 2026-03-08T12:28:48Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

This PR unifies the Voxtral Realtime SDPA implementations to align CUDA and encoder behavior with the existing Metal design, and updates attention mask creation to be dtype-aware to satisfy Metal SDPA kernel requirements.

Changes:

Replaced CudaSDPA and StandardEncoderSDPA with a unified StandardSDPA that supports both decoder and streaming-encoder layouts via transpose_kv.
Made _build_attn_mask and StandardEncoderRingKVCache.create_causal_mask dtype-aware so additive masks are created in the model’s dtype (required for Metal SDPA).
Updated model documentation to reflect the new SDPA classes and mask dtype constraints.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
examples/models/voxtral_realtime/model.py	Unifies SDPA modules (adds `transpose_kv`) and makes additive mask generation dtype-aware for Metal compatibility.
examples/models/voxtral_realtime/model.md	Updates architecture/docs to reflect `StandardSDPA` and dtype-matching additive masks for Metal.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-08T12:31:52Z

examples/models/voxtral_realtime/model.py

    Metal AOTI doesn't support bool tensor allocation on MPS, so we use
    integer arithmetic: clamp(curr_pos - k_pos + 1, 0, 1) gives 1 for
    valid positions (k <= curr_pos) and 0 for invalid, then convert to
    additive mask (0.0 = attend, -1e9 = don't attend).
+


The docstring says the additive mask uses "-1e9 = don't attend", but the implementation now casts to an arbitrary dtype. For float16 in particular, multiplying by 1e9 overflows to +/-inf, so the mask semantics become 0 and -inf (and the doc becomes inaccurate). Consider either updating the doc to reflect dtype-dependent behavior, or computing the masked value from torch.finfo(dtype) (e.g., a representable large negative) to avoid overflow/magic constants while still matching Q/K/V dtype for Metal.

Copilot · 2026-03-08T12:31:52Z

examples/models/voxtral_realtime/model.py

+        return torch.where(
+            valid,
+            torch.zeros(1, dtype=dtype, device=start_pos.device),
+            torch.tensor(float("-inf"), dtype=dtype, device=start_pos.device),
+        )


create_causal_mask now materializes new tensors (torch.zeros(1, ...) and torch.tensor(-inf, ...)) every call and uses a (1,) shape that relies on broadcasting. In streaming this mask is computed frequently, so these per-call allocations can add overhead. Prefer 0-dim scalars (shape ()) and/or reuse cached scalar constants (or buffers) to reduce allocations while keeping the requested dtype/device.

Copilot AI review requested due to automatic review settings March 8, 2026 12:27

mergennachin requested a review from lucylq as a code owner March 8, 2026 12:27

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 8, 2026

mergennachin force-pushed the voxtral-sdpa-unification branch from adb6ec3 to e62e797 Compare March 8, 2026 12:28

Copilot started reviewing on behalf of mergennachin March 8, 2026 12:28 View session

mergennachin requested a review from manuelcandales March 8, 2026 12:31

Copilot AI reviewed Mar 8, 2026

View reviewed changes

mergennachin marked this pull request as draft March 9, 2026 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voxtral Realtime: unify SDPA classes and dtype-aware attention masks#17997

Voxtral Realtime: unify SDPA classes and dtype-aware attention masks#17997
mergennachin wants to merge 1 commit intomainfrom
voxtral-sdpa-unification

mergennachin commented Mar 8, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 8, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mergennachin commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17997

❌ 1 Awaiting Approval, 2 New Failures

Uh oh!

github-actions bot commented Mar 8, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mergennachin commented Mar 8, 2026 •

edited

Loading

pytorch-bot bot commented Mar 8, 2026 •

edited

Loading

This PR needs a `release notes:` label