Skip to content

test mla and nsa performance on modal#11

Open
120L021326 wants to merge 18 commits intomainfrom
mla/modal_test
Open

test mla and nsa performance on modal#11
120L021326 wants to merge 18 commits intomainfrom
mla/modal_test

Conversation

@120L021326
Copy link
Collaborator

No description provided.

@120L021326
Copy link
Collaborator Author

mla_decoding log:

-------------------------------
Running on TestParam(b=132, s_q=2, s_k=4096, is_varlen=False, is_causal=False, is_fp8=False, topk=None, test_performance=True, is_all_indices_invalid=False, have_zero_seqlen_k=False, block_size=64, h_q=128, h_kv=1, d=576, dv=512, seed=0)...
Correctness check passed!
ref_out result sample: tensor([-0.0012,  0.0035, -0.0003, -0.0001,  0.0035,  0.0009, -0.0003,  0.0009],
       device='cuda:0', dtype=torch.bfloat16)
flash_mla_out result sample: tensor([-0.0012,  0.0035, -0.0003, -0.0001,  0.0035,  0.0009, -0.0003,  0.0009],
       device='cuda:0')
txl_mla_out result sample: tensor([-0.0012,  0.0035, -0.0003, -0.0001,  0.0035,  0.0009, -0.0003,  0.0009],
       device='cuda:0')
===============================
Running performance test...
FLASH MLA: 0.485 ms, 621 TFLOPS, 1436 GB/s
TXL MLA: 0.498 ms, 605 TFLOPS, 1398 GB/s

mla_prefill log:

================
Running on TestParam(b=1, s_q=64, s_kv=128, topk=128, h_q=128, h_kv=1, d_qk=576, d_v=512, seed=0, check_correctness=True, benchmark=True)
FlashMLA Prefill:    17 us, 135.404 TFlops
TXL MLA Prefill:    122 us, 18.744 TFlops

@120L021326
Copy link
Collaborator Author

120L021326 commented Nov 18, 2025

Only when Batch % NUM_SMS == 0, can txl version decoding run. On modal the SM count is 132, so I set test Batch is 132.
PCle version H100 has 114 SMs.
TO DO: support varied Batch and varied kv_seq_len

accL = accL / l_i[:, None]
# m_ptrs = M + off_z * (H * N_Q) + off_kvh * (heads_per_kv * N_Q) + offs_m
# tl.store(m_ptrs, m_i)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl.store block the Batch loop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant