Add quant+sparse attention for vLLM serving by kaix-nv · Pull Request #1832 · NVIDIA/Model-Optimizer

kaix-nv · 2026-06-25T22:28:10Z

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

model	aa-lcr
baseline	64.69±0.53
nvfp4 attn	64.33±0.78

kv_len	prefill	s=1	s=2	s=4	s=8	auto	best	speedup
2,048	0.084	0.115	0.064	0.043	0.034	0.054	0.034	2.47×
8,192	0.289	0.424	0.221	0.132	0.087	0.171	0.087	3.31×
32,768	1.109	1.659	0.849	0.488	0.295	0.641	0.295	3.76×

kv_len	prefill	s=1	s=2	s=4	s=8	auto	best	speedup
2,048	0.101	0.136	0.075	0.067	0.058	0.063	0.058	1.75×
8,192	0.155	0.502	0.260	0.139	0.080	0.118	0.080	1.94×
32,768	0.567	1.976	1.003	0.519	0.277	0.427	0.277	2.05×

s = num_kv_splits — how many GPU programs cooperate to compute one (request, head) output. The kernel splits that request's KV sequence into s contiguous chunks; each program does a partial softmax (running max + denominator + weighted‑V) over its chunk, and a small combine kernel merges the s partials.
- s=1, no split: one program does the whole KV reduction per head (like an ordinary flash decode).
- s=8, 8 programs split the KV reduction, so with only batch=1 × 32 heads = 32 base programs you get 256 programs in flight — much better SM occupancy. That's why bigger s is faster at batch=1.
auto: the timing when you pass num_kv_splits=None, i.e. the kernel's built‑in _auto_num_kv_splits heuristic chooses s from the SM count and batch×heads. I timed it as its own column to see how the default pick compares to the explicit sweep.
best: just the minimum latency across all the split settings I timed (s=1/2/4/8 and auto) for that row — i.e. the fastest decode config. The speedup column is prefill ÷ best.

Kernel vs PyTorch native SDPA — A6000, fp16, batch=2, head_dim=128:

Config (B, Hq, Hkv, D)	KV Length	Max Abs Error	Max Rel Error	Cosine Similarity
2, 32, 8, 128 (GQA 4:1)	1,024	1.2e-4	1.7e-2	1.000000
2, 32, 8, 128 (GQA 4:1)	8,192	3.1e-5	8.2e-3	1.000000
2, 32, 8, 128 (GQA 4:1)	32,768	1.5e-5	4.7e-3	1.000000
2, 16, 16, 128 (MHA)	1,024	1.2e-4	1.7e-2	1.000000
2, 16, 16, 128 (MHA)	8,192	3.1e-5	7.2e-3	1.000000
2, 16, 16, 128 (MHA)	32,768	1.5e-5	4.4e-3	1.000000

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A
Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

copy-pr-bot · 2026-06-25T22:28:14Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-25T22:28:19Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 70c4cced-467d-4f78-918f-1446f7043372

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kaix/sparse_attn_quant

_{Comment @coderabbitai help to get the list of available commands.}

github-actions · 2026-06-25T22:31:53Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1832/
Built to branch `gh-pages` at 2026-06-29 05:57 UTC. Preview will be ready when the GitHub Pages deployment is complete.

codecov · 2026-06-25T22:37:21Z

Codecov Report

❌ Patch coverage is 1.48699% with 265 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.04%. Comparing base (f335459) to head (e255eb1).

Files with missing lines	Patch %	Lines
...torch/kernels/common/attention/decode_attention.py	0.00%	179 Missing ⚠️
.../torch/sparsity/attention_sparsity/plugins/vllm.py	0.00%	51 Missing ⚠️
...delopt/torch/kernels/common/attention/triton_fa.py	16.66%	20 Missing ⚠️
...lopt/torch/kernels/quantization/attention/v_qdq.py	0.00%	10 Missing ⚠️
modelopt/torch/quantization/plugins/vllm.py	0.00%	4 Missing ⚠️
...odelopt/torch/kernels/common/attention/__init__.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1832      +/-   ##
==========================================
- Coverage   77.40%   77.04%   -0.36%     
==========================================
  Files         515      517       +2     
  Lines       57118    57373     +255     
==========================================
- Hits        44214    44205       -9     
- Misses      12904    13168     +264

Flag	Coverage Δ
unit	`54.67% <1.48%> (-0.25%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Kai Xu <kaix@nvidia.com>

… (V) Signed-off-by: Kai Xu <kaix@nvidia.com>

Signed-off-by: Kai Xu <kaix@nvidia.com>

… and in-kernel-V gate Signed-off-by: Kai Xu <kaix@nvidia.com>

Signed-off-by: Kai Xu <kaix@nvidia.com>

…M2 quantizers Signed-off-by: Kai Xu <kaix@nvidia.com>

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv changed the title ~~Kaix/sparse attn quant~~ Add quant+sparse attention for vLLM serving Jun 25, 2026

kaix-nv force-pushed the kaix/sparse_attn_quant branch 2 times, most recently from d48b1df to 6020692 Compare June 28, 2026 05:00

kaix-nv added 10 commits June 27, 2026 22:42

Add Triton paged decode attention kernel with skip-softmax

8502c9f

Signed-off-by: Kai Xu <kaix@nvidia.com>

Route decode-only skip-softmax through the paged decode kernel

02a2cf6

Signed-off-by: Kai Xu <kaix@nvidia.com>

Cross-check the decode kernel against PyTorch SDPA in the numerical test

11fe779

Signed-off-by: Kai Xu <kaix@nvidia.com>

Add softmax-P quant-dequant (FP8/NVFP4) to the paged decode kernel

f159e07

Signed-off-by: Kai Xu <kaix@nvidia.com>

Drive prefill/decode P quant from the exported p_bmm_quantizer

f73a956

Signed-off-by: Kai Xu <kaix@nvidia.com>

Add in-kernel NVFP4/FP8 quant-dequant for the attention value operand…

e0bf937

… (V) Signed-off-by: Kai Xu <kaix@nvidia.com>

Add decode V quantize-on-write

edba1ce

Signed-off-by: Kai Xu <kaix@nvidia.com>

Reload p_bmm_quantizer into vLLM + add the _QuantVLLMAttention p slot…

2f075d5

… and in-kernel-V gate Signed-off-by: Kai Xu <kaix@nvidia.com>

Split V quant helper into v_qdq and de-prefix the shared P/V qdq names

39dba95

Signed-off-by: Kai Xu <kaix@nvidia.com>

Add unified quant+sparse vLLM serve worker

4bdeff3

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv force-pushed the kaix/sparse_attn_quant branch from 6020692 to 5942c96 Compare June 28, 2026 05:44

kaix-nv added 2 commits June 28, 2026 19:40

Add combined quant + skip-softmax decode test

af15906

Signed-off-by: Kai Xu <kaix@nvidia.com>

Gate in-kernel V-quant flag on mapped V format and reject unmapped BM…

1910e45

…M2 quantizers Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv force-pushed the kaix/sparse_attn_quant branch from d107a21 to 1910e45 Compare June 29, 2026 02:41

Refuse cascade attention when ModelOpt attention quant is active

e255eb1

Signed-off-by: Kai Xu <kaix@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add quant+sparse attention for vLLM serving#1832

Add quant+sparse attention for vLLM serving#1832
kaix-nv wants to merge 13 commits into
mainfrom
kaix/sparse_attn_quant

kaix-nv commented Jun 25, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 25, 2026

Uh oh!

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions Bot commented Jun 25, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-06-29 05:57 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kaix-nv commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 25, 2026

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-06-29 05:57 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kaix-nv commented Jun 25, 2026 •

edited

Loading

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

github-actions Bot commented Jun 25, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-06-29 05:57 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented Jun 25, 2026 •

edited

Loading