test mla and nsa performance on modal by 120L021326 · Pull Request #11 · deciding/txl

120L021326 · 2025-11-18T11:21:14Z

No description provided.

120L021326 · 2025-11-18T11:27:01Z

mla_decoding log:

-------------------------------
Running on TestParam(b=132, s_q=2, s_k=4096, is_varlen=False, is_causal=False, is_fp8=False, topk=None, test_performance=True, is_all_indices_invalid=False, have_zero_seqlen_k=False, block_size=64, h_q=128, h_kv=1, d=576, dv=512, seed=0)...
Correctness check passed!
ref_out result sample: tensor([-0.0012,  0.0035, -0.0003, -0.0001,  0.0035,  0.0009, -0.0003,  0.0009],
       device='cuda:0', dtype=torch.bfloat16)
flash_mla_out result sample: tensor([-0.0012,  0.0035, -0.0003, -0.0001,  0.0035,  0.0009, -0.0003,  0.0009],
       device='cuda:0')
txl_mla_out result sample: tensor([-0.0012,  0.0035, -0.0003, -0.0001,  0.0035,  0.0009, -0.0003,  0.0009],
       device='cuda:0')
===============================
Running performance test...
FLASH MLA: 0.485 ms, 621 TFLOPS, 1436 GB/s
TXL MLA: 0.498 ms, 605 TFLOPS, 1398 GB/s

mla_prefill log:

================
Running on TestParam(b=1, s_q=64, s_kv=128, topk=128, h_q=128, h_kv=1, d_qk=576, d_v=512, seed=0, check_correctness=True, benchmark=True)
FlashMLA Prefill:    17 us, 135.404 TFlops
TXL MLA Prefill:    122 us, 18.744 TFlops

120L021326 · 2025-11-18T12:05:06Z

Only when Batch % NUM_SMS == 0, can txl version decoding run. On modal the SM count is 132, so I set test Batch is 132.
PCle version H100 has 114 SMs.
TO DO: support varied Batch and varied kv_seq_len

120L021326 · 2025-11-18T13:11:58Z

docker/flash_mla/txl_mla_interface.py

+            accL = accL / l_i[:, None]
+            # m_ptrs = M + off_z * (H * N_Q) + off_kvh * (heads_per_kv * N_Q) + offs_m
+            # tl.store(m_ptrs, m_i) 
+


tl.store block the Batch loop

120L021326 added 2 commits November 18, 2025 19:19

test mla and nsa performance on modal

7ef3cb9

add modal file

5b99796

tl.store cause bug

592971e

120L021326 commented Nov 18, 2025

View reviewed changes

120L021326 added 15 commits November 20, 2025 09:53

fix performance test problem

54d7503

Merge remote-tracking branch 'origin/main' into mla/modal_test

be9190c

add draw.py; add mla benchmark

5842775

change color; delete nouse file

26bff23

fix mla tilelang

8f4c007

fix gemm FP8

9147c51

add modal gemm

f653350

fix attn FP8

c430fc2

verified mla performance

aec2289

add flashinfer test

64be2a6

add mla sq2 draw

e2cc15a

update fp8 gemm performance

9de279d

update attn FP8 performance

0732868

update txl attn causal

96cedca

update gemm benchmark

1c0b84e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test mla and nsa performance on modal#11

test mla and nsa performance on modal#11
120L021326 wants to merge 18 commits intomainfrom
mla/modal_test

120L021326 commented Nov 18, 2025

Uh oh!

120L021326 commented Nov 18, 2025

Uh oh!

120L021326 commented Nov 18, 2025 •

edited

Loading

Uh oh!

120L021326 Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

120L021326 commented Nov 18, 2025

Uh oh!

120L021326 commented Nov 18, 2025

Uh oh!

120L021326 commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

120L021326 Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

120L021326 commented Nov 18, 2025 •

edited

Loading