Skip to content

Add quant+sparse attention for vLLM serving#1832

Draft
kaix-nv wants to merge 13 commits into
mainfrom
kaix/sparse_attn_quant
Draft

Add quant+sparse attention for vLLM serving#1832
kaix-nv wants to merge 13 commits into
mainfrom
kaix/sparse_attn_quant

Conversation

@kaix-nv

@kaix-nv kaix-nv commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

model aa-lcr
baseline 64.69±0.53
nvfp4 attn 64.33±0.78
kv_len prefill s=1 s=2 s=4 s=8 auto best speedup
2,048 0.084 0.115 0.064 0.043 0.034 0.054 0.034 2.47×
8,192 0.289 0.424 0.221 0.132 0.087 0.171 0.087 3.31×
32,768 1.109 1.659 0.849 0.488 0.295 0.641 0.295 3.76×
kv_len prefill s=1 s=2 s=4 s=8 auto best speedup
2,048 0.101 0.136 0.075 0.067 0.058 0.063 0.058 1.75×
8,192 0.155 0.502 0.260 0.139 0.080 0.118 0.080 1.94×
32,768 0.567 1.976 1.003 0.519 0.277 0.427 0.277 2.05×
  • s = num_kv_splits — how many GPU programs cooperate to compute one (request, head) output. The kernel splits that request's KV sequence into s contiguous chunks; each program does a partial softmax (running max + denominator + weighted‑V) over its chunk, and a small combine kernel merges the s partials.
    • s=1, no split: one program does the whole KV reduction per head (like an ordinary flash decode).
    • s=8, 8 programs split the KV reduction, so with only batch=1 × 32 heads = 32 base programs you get 256 programs in flight — much better SM occupancy. That's why bigger s is faster at batch=1.
  • auto: the timing when you pass num_kv_splits=None, i.e. the kernel's built‑in _auto_num_kv_splits heuristic chooses s from the SM count and batch×heads. I timed it as its own column to see how the default pick compares to the explicit sweep.
  • best: just the minimum latency across all the split settings I timed (s=1/2/4/8 and auto) for that row — i.e. the fastest decode config. The speedup column is prefill ÷ best.

Kernel vs PyTorch native SDPA — A6000, fp16, batch=2, head_dim=128:

Config (B, Hq, Hkv, D) KV Length Max Abs Error Max Rel Error Cosine Similarity
2, 32, 8, 128 (GQA 4:1) 1,024 1.2e-4 1.7e-2 1.000000
2, 32, 8, 128 (GQA 4:1) 8,192 3.1e-5 8.2e-3 1.000000
2, 32, 8, 128 (GQA 4:1) 32,768 1.5e-5 4.7e-3 1.000000
2, 16, 16, 128 (MHA) 1,024 1.2e-4 1.7e-2 1.000000
2, 16, 16, 128 (MHA) 8,192 3.1e-5 7.2e-3 1.000000
2, 16, 16, 128 (MHA) 32,768 1.5e-5 4.4e-3 1.000000

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A
  • Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

@copy-pr-bot

copy-pr-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 70c4cced-467d-4f78-918f-1446f7043372

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kaix/sparse_attn_quant

Comment @coderabbitai help to get the list of available commands.

@kaix-nv kaix-nv changed the title Kaix/sparse attn quant Add quant+sparse attention for vLLM serving Jun 25, 2026
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1832/

Built to branch gh-pages at 2026-06-29 05:57 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 1.48699% with 265 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.04%. Comparing base (f335459) to head (e255eb1).

Files with missing lines Patch % Lines
...torch/kernels/common/attention/decode_attention.py 0.00% 179 Missing ⚠️
.../torch/sparsity/attention_sparsity/plugins/vllm.py 0.00% 51 Missing ⚠️
...delopt/torch/kernels/common/attention/triton_fa.py 16.66% 20 Missing ⚠️
...lopt/torch/kernels/quantization/attention/v_qdq.py 0.00% 10 Missing ⚠️
modelopt/torch/quantization/plugins/vllm.py 0.00% 4 Missing ⚠️
...odelopt/torch/kernels/common/attention/__init__.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1832      +/-   ##
==========================================
- Coverage   77.40%   77.04%   -0.36%     
==========================================
  Files         515      517       +2     
  Lines       57118    57373     +255     
==========================================
- Hits        44214    44205       -9     
- Misses      12904    13168     +264     
Flag Coverage Δ
unit 54.67% <1.48%> (-0.25%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kaix-nv kaix-nv force-pushed the kaix/sparse_attn_quant branch 2 times, most recently from d48b1df to 6020692 Compare June 28, 2026 05:00
kaix-nv added 10 commits June 27, 2026 22:42
Signed-off-by: Kai Xu <kaix@nvidia.com>
… (V)

Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
… and in-kernel-V gate

Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
@kaix-nv kaix-nv force-pushed the kaix/sparse_attn_quant branch from 6020692 to 5942c96 Compare June 28, 2026 05:44
kaix-nv added 2 commits June 28, 2026 19:40
Signed-off-by: Kai Xu <kaix@nvidia.com>
…M2 quantizers

Signed-off-by: Kai Xu <kaix@nvidia.com>
@kaix-nv kaix-nv force-pushed the kaix/sparse_attn_quant branch from d107a21 to 1910e45 Compare June 29, 2026 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant