[JAX] Sharding constraint for rng_state in fused_attn_fwd makes it is impossible to use in shard_map

**Is your feature request related to a problem? Please describe.**

Hey guys, We have a custom transformer implementation with some parts wrapped in shard_map but using with_sharding_constraint here
https://github.com/NVIDIA/TransformerEngine/blob/5afbb0e14f58e068a6370eb9e5dbcbb96bc7ac04/transformer_engine/jax/cpp_extensions/attention.py#L3446

makes it impossible to use fused_attn_fwd in shard_map. And this is basically the only row that makes it impossible. To solve this issue, I just copy-paste source code for the following functions:
```
fused_attn_fwd
_fused_attn_fwd_rule
_fused_attn
fused_attn
```
while remove this particular row and it makes it possible to properly use cudnn attention in shard_map. 

**Describe the solution you'd like**

I wonder if we can move this sharding constraint somewhere else, outside of fused attention fwd?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JAX] Sharding constraint for rng_state in fused_attn_fwd makes it is impossible to use in shard_map #2500

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[JAX] Sharding constraint for rng_state in fused_attn_fwd makes it is impossible to use in shard_map #2500

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions