GitHub - Infini-AI-Lab/vortex_torch: Vortex: Programmable Sparse Attention for Agents as Algorithm Designers

Vortex: Programmable Sparse Attention for Agents as Algorithm Designers

Vortex turns sparse-attention algorithm design into something AI agents can do. Sparse attention is increasingly essential for serving LLMs as generation lengths grow — but deploying and evaluating new sparse-attention algorithms at scale has been highly engineering-intensive, slowing both human researchers and AI agents as they explore the design space.

Vortex couples a Python-embedded frontend over a page-centric tensor abstraction — concise enough to express a broad range of sparse-attention algorithms — with an efficient backend tightly integrated into modern LLM serving stacks (SGLang). A new algorithm goes from idea to deployed-and-benchmarked in minutes, turning its theoretical efficiency into real-world throughput without touching core model code.

This makes Vortex a platform for autonomous algorithm discovery: AI agents generate and refine diverse sparse-attention algorithms with Vortex — the best reaching up to 3.46× higher throughput than full attention while preserving accuracy. Vortex also extends sparse attention to emerging architectures and very large models that are otherwise hard to experiment with (up to 4.7× on the MLA-based GLM-4.7-Flash and 1.37× on the 229B-parameter MiniMax-M2.7), and doubles as a research instrument for understanding where the routing signal lives in sparse attention.

(a) A workflow to study sparse attention algorithms using Vortex. (b) Agent-generated sparse attention (Qwen3-1.7B, AIME, NVIDIA H200): each point is one algorithm generated or optimized by AI agents with Vortex — the best reaches up to 3.46× the throughput of full attention while preserving accuracy.

✨ Key Features

Easy Programming
Program sparse attention with a PyTorch-like frontend. No worrying about batching, caching & paged attention.
High Performance
Built to work with FlashInfer & CUDA Graph & Radix Attention for efficient LLM inference.
Agent Native
Designed for autonomous algorithm discovery — AI agents generate, benchmark, and refine sparse attention end-to-end, with a Claude Code workspace and OpenHands demo built in.

🚀 Installation

git clone --recursive https://github.com/Infini-AI-Lab/vortex_torch.git

# Install SGLang dependency
cd third_party/sglang/v0.5.9/sglang
pip install -e "python"
cd ../../../../

# Install Vortex
cd vortex_torch
pip install -e .

🤖 Innovate Sparse Attention with OpenHands

Vortex is designed not only for hand-crafted sparsity patterns but also for AI-generated sparse attention.

Our demo shows how to use SOTA agents OpenHands (https://openhands.dev/) to generate sparse attention algorithms.

OpenHands generates a sparse attention algorithm (up to 2.7X speedup in this example).

export LLM_API_KEY=YOUR_API_KEY
python AI/openhands_gen.py

The usage and installation guide of OpenHands can be found in https://docs.openhands.dev/sdk.

Note: Some operators are not yet fused or fully optimized, which may lead to increased memory usage. Tune down the mem_fraction_static if CUDA OOM. This can also impact generation speed during inference.

🧠 Iterate & Innovate with Claude Code

Long-horizon autonomous optimization on AIME24 (Claude Opus 4.7, Qwen3-1.7B, 23 iterations, 92 submissions): (a) mean@16 per iteration, (b) throughput per iteration, (c) the accuracy–throughput frontier of all submissions, colored by iteration order.

Vortex ships a Claude Code workspace (.claude/) that turns the framework into an autonomous algorithm scientist: Claude writes sparse-attention submissions, compiles them, benchmarks them on AIME24, and pushes the accuracy/throughput Pareto frontier outward — one batch at a time. Start a session from the repo root and drive it with slash commands:

Command	What it does
`/new-submission <name>`	Scaffold a new submission pair (`.py` + `.json`).
`/preflight <name>`	Cheap CPU-only config check before spending GPU time.
`/innovate <N> [theme]`	Innovate — draft `N` genuinely novel algorithms in one shot. All must compile; no benchmark loop. Great for brainstorming ideas the literature doesn't cover.
`/iterate [--max-iterations <N>]`	Iterate — the long-horizon loop: design 4 orthogonal variants → pre-flight → RULER quality gate → benchmark on AIME24 → analyse → repeat. Autonomously maps the Pareto frontier.
`/batch-benchmark <n1> <n2> <n3> <n4>`	Launch a 4-variant batch on the currently-free GPUs.
`/review <name>`	Audit a submission against the contract without editing it.

Innovate (explore) and iterate (exploit) are complementary: /innovate generates fresh, compile-checked ideas with no GPU cost, while /iterate benchmarks four variants per batch and folds the results back into algorithm_scientist/memory.md so later sessions resume from the running best. The full contract, knobs, and benchmark protocol live under AI/ (start with AI/AGENTS.md) and papers/guide.md.

# from a Claude Code session opened at the repo root
/innovate 4 channel-sparsity      # draft 4 novel ideas to explore
/iterate --max-iterations 3       # run 3 autonomous benchmark batches

🧩 Quick Example: Custom Sparse Attention

Scaling to a 229B model with tensor parallelism — MiniMax-M2.7 (229B) on AIME26 with 32K-token generation on four NVIDIA B200 GPUs (TP=4): (a) mean@16, (b) pass@4, (c) pass@8 versus end-to-end throughput. Block top-k and Quest sweep the number of attended blocks; the star marks the full-attention operating point.

A working setup is two files:

The flow module (this section) — a .py file that defines your sparse-attention algorithm as a vFlow subclass and @registers it under a name. It contains only vortex ops; it never imports sglang.
The launch script (next section) — imports sglang + vortex_torch and starts the engine pointing at the flow by its registered name.

1. Define the flow — `custom_sparse_attention.py`

A vFlow declares its cache layout in create_cache, refreshes per-page state in forward_cache, and scores/selects pages every decode step in forward_indexer. Save the snippet below anywhere on disk (e.g. custom_sparse_attention.py) — you'll point the engine at it by path + registered name.

from typing import Dict
import torch

from vortex_torch.flow import vFlow, register
from vortex_torch.indexer import GeMM, Mean, topK
from vortex_torch.cache import Mean as CMean
from vortex_torch.abs import ContextBase


@register("custom_sparse_attention")
class CustomSparseAttention(vFlow):

    def __init__(self):
        super().__init__()
        # Indexer-side ops (run every decode step)
        self.mean = Mean(dim=1)        # average over the query heads
        self.gemm = GeMM()             # GeMM(x, y) = y @ xᵀ
        self.output_func = topK()      # must end in topK / approxTopK

        # Cache-side ops (run once per finished page)
        self.reduction = CMean(dim=1)  # one centroid (mean key) per page

    def forward_indexer(
        self,
        q: torch.Tensor,                 # viewed as [B, H_q, D]
        o: torch.Tensor,
        cache: Dict[str, torch.Tensor],  # viewed as [S, r, c] per create_cache()
        ctx: ContextBase,
    ):
        # No native torch ops here — every tensor flows through vortex ops.
        q_mean = self.mean(q, ctx=ctx)                          # [B, 1, D]
        score = self.gemm(q_mean, cache["centroids"], ctx=ctx)  # [S, 1, 1]
        self.output_func(score, o, ctx=ctx)                     # selected pages -> o

    def forward_cache(
        self,
        cache: Dict[str, torch.Tensor],  # viewed as [B, r, c] per create_cache()
        loc: torch.Tensor,
        ctx: ContextBase,
    ):
        # triggered only when a page is finished
        self.reduction(cache["k"], cache["centroids"], loc=loc, ctx=ctx)

    def create_cache(self, block_size: int, head_dim: int):
        # "k" and "v" are provided automatically — do not declare them
        return {
            "centroids": (1, head_dim),
        }

🏃 Launch it with SGLang

The launch script is a separate file from the flow. It imports sglang and vortex_torch, then starts the engine. Importing vortex_torch is what wires vortex into sglang's decode loop (it installs the ServerArgs ↔ VortexConfig adapter), so the import is required even though you don't call it directly.

import sglang as sgl
import vortex_torch  # noqa: F401 — import for side effect: installs the VortexConfig adapter
from vortex_torch.engine.sgl.config import VortexConfig

llm = sgl.Engine(
    # --- standard sglang engine args ---
    model_path="Qwen/Qwen3-0.6B",
    page_size=16,                     # KV page size (pages are vortex's unit of sparsity)
    attention_backend="flashinfer",   # Mandatory
    disable_overlap_schedule=True,    # Mandatory
    disable_cuda_graph=False,
    mem_fraction_static=0.85,         # turn down if you hit CUDA OOM

    # --- all vortex knobs live in one object ---
    # Passing `vortex=` turns sparsity ON (no separate enable flag needed);
    # omit it and you get plain dense attention.
    vortex=VortexConfig(
        module_path="path/to/custom_sparse_attention.py",
        module_name="custom_sparse_attention",  # the @register name of your vFlow
        topk_val=30,                  # keep the 30 highest-scoring pages per query
        layers_skip=[0],              # layer 0 runs full/dense attention
        block_reserved_bos=1,         # always keep the first page (attention sink)
        block_reserved_eos=1,         # always keep the last (most recent) page
        max_seq_lens=8192,
    ),
)

What is `VortexConfig`?

VortexConfig is a single dataclass (vortex_torch/engine/sgl/config.py) that holds every vortex sparse-attention hyper-parameter in one place, instead of ~18 loose vortex_* arguments scattered across sglang's ServerArgs. Its presence on the engine is also the on/off switch: pass a VortexConfig and sparsity is enabled; leave it out and the model runs ordinary dense attention.

Every field, with what it controls and an example value:

Field	Explanation	Example
`module_path`	Path to the `.py` file holding your flow. `None` → vortex searches `vortex_torch.flow.algorithms`.	`"submissions/custom.py"`
`module_name`	The `@register(...)` name of the `vFlow` to load. Must match exactly.	`"custom_sparse_attention"`
`topk_val`	Static page budget — the fixed minimum number of pages each sequence keeps, regardless of length. The core accuracy↔throughput knob.	`30`
`topk_ratio`	Dynamic page budget — a fraction of the sequence's pages; the engine keeps `max(static floor, topk_ratio × num_pages)`. `0.0` disables it (use `topk_val` only).	`0.0625`
`max_topk_val`	Upper bound on the selected-page count, used to size/pick the top-k kernel variant. `None` → derived from `max_seq_lens`.	`256`
`layers_skip`	Layer indices that bypass sparse attention and run dense (e.g. early layers that need global context). `None` → all layers sparse.	`[0, 4, 8, 12]`
`block_reserved_bos`	Pages at the start of the sequence that are always selected (attention sink). Int ≥ 1.	`1`
`block_reserved_eos`	Pages at the end (most-recent tokens) that are always selected. Int ≥ 1.	`1`
`max_seq_lens`	Maximum sequence length to plan buffers for. `-1` → use the model default.	`8192`
`block_size`	Vortex page size (the unit of sparsity). Positive power of 2; smaller = finer granularity, larger = less cache-summary overhead. Defaults to sglang's `page_size`.	`16`
`workload_chunk_size`	Planner granularity — how many blocks are grouped into one indexer workload. Positive power of 2; a throughput-tuning knob.	`32`
`dtype`	dtype for intermediate indexer tensors. `"bfloat16"` is the tested default; `"float16"`/`"float32"`/`"fp8_e4m3"`/`"fp8_e5m2"` are accepted.	`"bfloat16"`
`compilation_cache_dir`	Directory for the JIT-compiled kernel cache. `None` → next to the compiler module.	`"/tmp/vortex_cache"`
`schedule_policy`	A CUDA C++ snippet that computes each sequence's page budget (see below). `None` → the default budget formula.	`None`
`attention_backend`	Sparse-attention kernel family: `"flashinfer"` (default) or `"trtllm"`.	`"flashinfer"`
`impl_backend`	Indexer op implementation backend: `"triton"` (default) or `"cuda"`.	`"triton"`
`use_tensor_core`	Enable tensor-core (bf16 `tl.dot`) codegen in the triton kernel. Only valid with `impl_backend="triton"`.	`False`

Programmable budget — the `schedule_policy`

schedule_policy is the most interesting knob: instead of a fixed formula, the per-sequence page budget is computed by a CUDA C++ snippet you provide. Vortex injects it as the body of a __device__ function, JIT-compiles it into the decode planner (cached by content hash), and runs it for every sequence on every backend (flashinfer and trtllm/MLA) — not just trtllm. When you leave it None, vortex uses this default body, which is exactly the standard budget formula:

// default schedule_policy — returns the number of pages to attend to.
const int static_kv_budget  = topk_val + block_reserved_bos + block_reserved_eos;
const int dynamic_kv_budget = int(cached_block_len * topk_ratio);
return max(static_kv_budget, dynamic_kv_budget);   // topk_val dominates short seqs, ratio long ones

The snippet must return an int (the page budget). The variables in scope are the live per-sequence values:

Variable	Meaning
`cached_block_len`	Number of cached KV pages currently in this sequence (its length, in pages).
`topk_val`	The configured static budget.
`topk_ratio`	The configured dynamic ratio.
`block_reserved_bos` / `block_reserved_eos`	Reserved sink / recent pages.

Because it's real device code, you can express budget policies the two scalar knobs can't — e.g. a length-adaptive budget that grows slowly with context and then caps:

vortex=VortexConfig(
    module_name="custom_sparse_attention",
    topk_val=32,
    schedule_policy=r"""
        // base budget + 1 extra page per 64 cached pages, capped at 256
        const int base  = topk_val + block_reserved_bos + block_reserved_eos;
        const int extra = cached_block_len / 64;
        return min(base + extra, 256);
    """,
)

Other natural uses: a hard ceiling on long contexts, a step function that jumps the budget past a length threshold, or a sqrt(cached_block_len)-style sublinear schedule. The planner is JIT-compiled once per distinct snippet, so there's no per-step overhead.

Prefer the explicit VortexConfig(...) object above. The legacy flat form — sgl.Engine(enable_vortex_sparsity=True, vortex_topk_val=30, vortex_module_name=..., ...) — still works (the adapter folds those vortex_* kwargs into a VortexConfig for you), but the object is clearer and self-documenting.

🧬 MLA Models (DeepSeek-V3 / GLM-4.7 / Kimi-style)

Sparse attention on an MLA model — GLM-4.7-Flash on AIME26 with 32K-token generation on a single NVIDIA B200: (a) mean@16, (b) pass@4, (c) pass@8 versus end-to-end throughput. Three MLA flows are expressed in vFlow (rope-aware block-sparse, rope-unaware block-sparse, and Quest), sweeping attended blocks with block sizes 16/32/64.

Models with Multi-head Latent Attention (DeepSeek-V2/V3, GLM-4.7-Flash, Kimi, …) compress the KV cache into a single shared low-rank latent instead of per-head K/V. Vortex supports them with a parallel base class, vFlowMLA. The shape of a flow is the same — create_cache / forward_cache / forward_indexer — but with MLA conventions:

The cache exposes one auto-provided field, cache["latent"] (the fused [ kv_c | k_pe ], inner dim kv_lora_rank + qk_rope_head_dim) — there is no "k" / "v".
create_cache(block_size, kv_lora_rank, qk_rope_head_dim) declares only your aux tensors (e.g. centroids).
forward_indexer receives the fused absorbed query q = [ q_nope_out | q_pe ]. Because both query and latent are the concatenations, a single dot ⟨q, centroid⟩ already equals the full decode logit ⟨q_nope, kv_c⟩ + ⟨q_pe, k_pe⟩ — RoPE included.

from typing import Dict
import torch

from vortex_torch.flow import vFlowMLA, register
from vortex_torch.indexer import GeMM, Mean, topK
from vortex_torch.cache import Mean as CMean
from vortex_torch.abs import ContextBase


@register("rope_aware_block_sparse_mla")
class RopeAwareBlockSparseMLA(vFlowMLA):

    def __init__(self):
        super().__init__()
        self.mean = Mean(dim=1)        # average the fused query over its H heads
        self.gemm = GeMM()             # GeMM(x, y) = y @ xᵀ → per-page score
        self.output_func = topK()      # terminal: write selected page ids to o
        self.reduction = CMean(dim=1)  # centroid = mean of the fused latent per page

    def forward_indexer(
        self,
        q: torch.Tensor,                 # [B, H, latent_dim]  ([q_nope_out | q_pe])
        o: torch.Tensor,
        cache: Dict[str, torch.Tensor],
        ctx: ContextBase,
    ):
        q_mean = self.mean(q, ctx=ctx)                          # [B, 1, latent_dim]
        score = self.gemm(q_mean, cache["centroids"], ctx=ctx)  # [S, 1, 1] — FULL logit
        self.output_func(score, o, ctx=ctx)

    def forward_cache(
        self,
        cache: Dict[str, torch.Tensor],  # cache["latent"] auto-provided: [B, 1, latent_dim]
        loc: torch.Tensor,
        ctx: ContextBase,
    ):
        self.reduction(cache["latent"], cache["centroids"], loc=loc, ctx=ctx)

    def create_cache(self, block_size: int, kv_lora_rank: int, qk_rope_head_dim: int):
        # "latent" is auto-provided — declare only the aux centroid (full width).
        return {
            "centroids": (1, kv_lora_rank + qk_rope_head_dim),
        }

Launching is the same VortexConfig flow, with the MLA decode backend selected on the engine (attention_backend="trtllm_mla") and the tensor-core indexer enabled:

import sglang as sgl
import vortex_torch  # noqa: F401
from vortex_torch.engine.sgl.config import VortexConfig

llm = sgl.Engine(
    model_path="zai-org/GLM-4.7-Flash",   # any MLA model (DeepSeek-V3, Kimi, …)
    trust_remote_code=True,
    page_size=32,
    attention_backend="trtllm_mla",       # vortex CUDA MLA decode kernel
    kv_cache_dtype="auto",
    mem_fraction_static=0.9,
    vortex=VortexConfig(
        module_name="rope_aware_block_sparse_mla",
        attention_backend="trtllm",       # 2D block-table indexer
        impl_backend="triton",
        use_tensor_core=True,             # tensor-core indexer GeMM
        block_size=32,
        topk_val=61,
        block_reserved_bos=1,
        block_reserved_eos=2,
        max_seq_lens=8192,
    ),
)

A runnable, single-GPU MLA demo (RULER scoring, dense-vs-sparse) lives in examples/run_ruler_mla.py.

🌐 Server Mode (OpenAI-compatible endpoint)

To serve vortex sparse attention over HTTP instead of driving the engine in-process, use examples/server_launch.sh. It boots an sglang server with an OpenAI-compatible API on 127.0.0.1:30000:

# ./server_launch.sh <MODEL_NAME> <TP_SIZE>
examples/server_launch.sh Qwen/Qwen3-4B 1

Two details make server mode work:

import vortex_torch must run first. The script doesn't call python -m sglang.launch_server directly — that builds ServerArgs in the parent before vortex is imported, so the adapter that folds the config wouldn't be installed yet. Instead it imports vortex_torch, then calls sglang's run_server, so the ServerArgs ↔ VortexConfig adapter is in place before the args are pickled to the scheduler worker.
Knobs are passed as JSON via --vortex-config. The per-knob --vortex-* CLI flags no longer exist; the script writes the VortexConfig fields (prefix stripped) to a temp JSON file and feeds it through --vortex-config '<json>'. Providing a non-null config implicitly enables sparsity. Edit the JSON block in the script to retune.

Once up, query it like any OpenAI endpoint:

curl http://127.0.0.1:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-4B", "messages": [{"role": "user", "content": "Hello!"}]}'

📚 Citation

If you find Vortex useful in your research, please cite:

@misc{chen2026vortexefficientprogrammablesparse,
      title={Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents}, 
      author={Zhuoming Chen and Xinrui Zhong and Qilong Feng and Ranajoy Sadhukhan and Yang Zhou and Michael Qizhe Shieh and Zhihao Jia and Beidi Chen},
      year={2026},
      eprint={2606.06453},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.06453}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 258 Commits
.claude		.claude
.github/workflows		.github/workflows
AI		AI
algorithm_scientist		algorithm_scientist
assets		assets
cuda_mla		cuda_mla
docker		docker
docs		docs
examples		examples
notes		notes
papers		papers
submissions		submissions
third_party/sglang		third_party/sglang
vortex_torch		vortex_torch
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install_vortex.sh		install_vortex.sh
install_vortex_glm.sh		install_vortex_glm.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vortex: Programmable Sparse Attention for Agents as Algorithm Designers

✨ Key Features

🚀 Installation

🤖 Innovate Sparse Attention with OpenHands

🧠 Iterate & Innovate with Claude Code

🧩 Quick Example: Custom Sparse Attention

1. Define the flow — `custom_sparse_attention.py`

🏃 Launch it with SGLang

What is `VortexConfig`?

Programmable budget — the `schedule_policy`

🧬 MLA Models (DeepSeek-V3 / GLM-4.7 / Kimi-style)

🌐 Server Mode (OpenAI-compatible endpoint)

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vortex: Programmable Sparse Attention for Agents as Algorithm Designers

✨ Key Features

🚀 Installation

🤖 Innovate Sparse Attention with OpenHands

🧠 Iterate & Innovate with Claude Code

🧩 Quick Example: Custom Sparse Attention

1. Define the flow — custom_sparse_attention.py

🏃 Launch it with SGLang

What is VortexConfig?

Programmable budget — the schedule_policy

🧬 MLA Models (DeepSeek-V3 / GLM-4.7 / Kimi-style)

🌐 Server Mode (OpenAI-compatible endpoint)

📚 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Define the flow — `custom_sparse_attention.py`

What is `VortexConfig`?

Programmable budget — the `schedule_policy`

Packages