Add CPU kernel skills#614
Conversation
|
Hi @jiqing-feng, thanks for your interest in contributing! This project requires that pull request authors are vouched, and you are not in the list of vouched users. This PR will be closed automatically. See https://github.com/huggingface/kernels/blob/main/CONTRIBUTING.md for more details. |
|
Hi @sywangyi @YangKai0616 . Please take a quick overview. Thanks! |
|
Is this PR ready to be reviewed? |
Yes, please. |
| - `cuda-kernels` (default) | ||
| - `rocm-kernels` | ||
| - `xpu-kernels` | ||
| - `cpu-kernels` |
There was a problem hiding this comment.
It would be nice to add a note on where CPU kernels are actually helpful.
There was a problem hiding this comment.
Done. Please review the new changes and rerun the CI. Thanks!
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| > [!TIP] | ||
| > **When are CPU kernels actually helpful?** Two main cases: | ||
| > - **Better performance on Intel Xeon** — custom AVX2/AVX512 kernels (and AMX via brgemm for quantized GEMM) outperform generic PyTorch ops for element-wise and quantized workloads, especially in CPU-only or latency-sensitive serving. | ||
| > - **Enabling functionality that otherwise can't run** — some kernels are a hard requirement, e.g. `megablocks` MoE on CPU, where without the kernel you simply cannot run MXFP4. |
There was a problem hiding this comment.
Nice! Can you provide some example kernels that you have built for CPU?
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
|
Failing tests are unrelated. Thanks for your contributions. |
Summary
Adds a
cpu-kernelsskill forkernel-builderthat guides writing, optimizing, andbenchmarking C++ CPU kernels (AVX2/AVX512) for the Hugging Face kernels ecosystem.
What's included
performance exploration with trial tracking and backtracking.
torch.utils.benchmark),perf statprofiling, and trial management.build.tomlmulti-target compilation,SIMD patterns, quantized GEMM / brgemm, threading, memory, and correctness constraints.