[Performance] LSTM/GRU scan: canonical strides + cuDNN flat-storage clones + thread-local recurrent mode by vmoens · Pull Request #3754 · pytorch/rl

vmoens · 2026-05-15T07:20:33Z

Stack from ghstack (oldest at bottom):

Three intertwined fixes to the scan / triton recurrent backends.

Canonical-stride check. A [1, 4, 5] tensor with strides (5, 5, 1)
passes is_contiguous() but torch._higher_order_ops.scan and the
triton kernels read strides directly and reject non-canonical layouts.
Add _canonical_stride + _canonical_contiguous and re-materialize
inputs / hidden buffers when strides drift off the C-canonical layout.
cuDNN flat-storage aliasing. nn.LSTM / nn.GRU with cuDNN flatten
all per-layer parameters into a single storage; the scan HOP tracer
walks the FakeTensor graph and rejects the aliased per-layer views as
inputs. Clone the weight views before closing the scan body. The
per-layer carry now also clones x_t and the transpose+flatten output
(the only remaining aliasing edge) so the existing .clone() on the
full torch.stack(...) carry can drop.
Thread-local recurrent_mode. _ContextManager was a single mutable
module-level flag, so spawning a collector worker thread saw the
parent's recurrent_mode setting. Wrap in _RecurrentModeContextManager
using contextvars.ContextVar so per-thread state is isolated.

Tests cover: _canonical_contiguous no-op vs re-materialize, scan/triton
output parity under non-canonical strides, scan-under-torch.compile
aliasing pin for both LSTM and GRU, LSTM scan/pad PPO-advantage parity,
non-canonical hidden-buffer regression for scan, and a thread-local
set_recurrent_mode test.

[ghstack-poisoned]

pytorch-bot · 2026-05-15T07:20:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3754

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Run pull request jobs on OSDC runners in shadow mode

❌ 3 New Failures, 1 Cancelled Job

As of commit 244f61c with merge base 0a01ee8 ():

NEW FAILURES - The following jobs have failed:

Build Windows Wheels / pytorch/rl / build-wheel-py3_10-cpu (gh)
Process completed with exit code 1.
Build Windows Wheels / pytorch/rl / upload / upload-wheel-py3_10-cpu (gh)
Unable to download artifact(s): Artifact not found for name: pytorch_rl__3.10_cpu_x64
Unit-tests on Linux / tests-stable-gpu (3.12, 13.0, 2) / linux-job (gh)
test/test_collectors.py::TestUpdateParams::test_param_sync_mixed_device[SharedMemWeightSyncScheme-cpu-env_device0-False-MultiAsyncCollector]

CANCELLED JOB - The following job was cancelled. Please retry:

Unit-tests on Windows / unittests-cpu (3.10, windows.4xlarge, cpu) / windows-job (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vmoens · 2026-05-15T07:47:37Z

+            hidden_per_step = _canonical_contiguous(hidden0_in[..., layer, :])
+            cell_per_step = _canonical_contiguous(hidden1_in[..., layer, :])


what if we were doing that for hidden0_in before the loop? Would the view still be contiguous?

vmoens · 2026-05-15T07:48:12Z

                b_hh = zeros if b_hh is None else b_hh

-            hidden_per_step = hidden_in[..., layer, :]
+            hidden_per_step = _canonical_contiguous(hidden_in[..., layer, :])


ditto, should we do it before the loop?

[ghstack-poisoned]

…lones + thread-local recurrent mode Three intertwined fixes to the scan / triton recurrent backends. - Canonical-stride check. A `[1, 4, 5]` tensor with strides `(5, 5, 1)` passes `is_contiguous()` but `torch._higher_order_ops.scan` and the triton kernels read strides directly and reject non-canonical layouts. Add `_canonical_stride` + `_canonical_contiguous` and re-materialize inputs / hidden buffers when strides drift off the C-canonical layout. - cuDNN flat-storage aliasing. `nn.LSTM` / `nn.GRU` with cuDNN flatten all per-layer parameters into a single storage; the scan HOP tracer walks the FakeTensor graph and rejects the aliased per-layer views as inputs. Clone the weight views before closing the scan body. The per-layer carry now also clones `x_t` and the transpose+flatten output (the only remaining aliasing edge) so the existing `.clone()` on the full `torch.stack(...)` carry can drop. - Thread-local `recurrent_mode`. `_ContextManager` was a single mutable module-level flag, so spawning a collector worker thread saw the parent's recurrent_mode setting. Wrap in `_RecurrentModeContextManager` using `contextvars.ContextVar` so per-thread state is isolated. Tests cover: `_canonical_contiguous` no-op vs re-materialize, scan/triton output parity under non-canonical strides, scan-under-torch.compile aliasing pin for both LSTM and GRU, LSTM scan/pad PPO-advantage parity, non-canonical hidden-buffer regression for scan, and a thread-local `set_recurrent_mode` test. ghstack-source-id: 40fca02 Pull-Request: #3754 Co-authored-by: Cursor <cursoragent@cursor.com>

Update

608ff7e

[ghstack-poisoned]

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 15, 2026

github-actions Bot added Modules Integrations/torch_geometric Integrations Performance Performance issue or suggestion for improvement and removed CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. labels May 15, 2026

vmoens commented May 15, 2026

View reviewed changes

vmoens mentioned this pull request May 15, 2026

[Feature] Collector.fake_tensordict() / MultiCollector.fake_tensordict() #3761

Closed

Update

20c799d

[ghstack-poisoned]

This was referenced May 15, 2026

[Feature] timeit.mark_start/mark_end for non-context-manager timing #3762

Merged

[Example] Isaac Lab RNN PPO with compact memory + knowledge-base notes #3763

Merged

[Feature] Collector.fake_tensordict() / MultiCollector.fake_tensordict() #3764

Merged

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 15, 2026

This was referenced May 15, 2026

[Refactor] Keep [B, T] dim in value estimators #3767

Merged

[Refactor] Simplify LSTM/GRUModule recurrent-mode shape normalization #3768

Merged

[Example] Add Isaac RNN PPO rollout mode flags #3769

Merged

Update

e2f4b36

[ghstack-poisoned]

This was referenced May 17, 2026

[Test] Enable scan compile RNN tests on Windows #3770

Closed

[BugFix] Fix GAE compact path bias on recurrent value nets at internal truncations #3771

Merged

[Example] Expose compact GAE cat dimension #3775

Merged

vmoens mentioned this pull request May 18, 2026

[Doc] Migrate shifted=True callers to legacy/compact + docstring polish #3776

Merged

Update

244f61c

[ghstack-poisoned]

vmoens merged commit 244f61c into gh/vmoens/270/base May 18, 2026
107 of 113 checks passed

vmoens deleted the gh/vmoens/270/head branch May 18, 2026 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] LSTM/GRU scan: canonical strides + cuDNN flat-storage clones + thread-local recurrent mode#3754

[Performance] LSTM/GRU scan: canonical strides + cuDNN flat-storage clones + thread-local recurrent mode#3754
vmoens merged 4 commits into
gh/vmoens/270/basefrom
gh/vmoens/270/head

vmoens commented May 15, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented May 15, 2026 •

edited

Loading

Uh oh!

vmoens May 15, 2026

Uh oh!

vmoens May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		hidden_per_step = _canonical_contiguous(hidden0_in[..., layer, :])
		cell_per_step = _canonical_contiguous(hidden1_in[..., layer, :])

Conversation

vmoens commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3754

❗ 1 Active SEVs

❌ 3 New Failures, 1 Cancelled Job

Uh oh!

vmoens May 15, 2026

Choose a reason for hiding this comment

Uh oh!

vmoens May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vmoens commented May 15, 2026 •

edited

Loading

pytorch-bot Bot commented May 15, 2026 •

edited

Loading