Skip to content

Allow public tuning of cub::DeviceRadixSort#9491

Merged
bernhardmgruber merged 3 commits into
NVIDIA:mainfrom
bernhardmgruber:radix_tune_public
Jun 28, 2026
Merged

Allow public tuning of cub::DeviceRadixSort#9491
bernhardmgruber merged 3 commits into
NVIDIA:mainfrom
bernhardmgruber:radix_tune_public

Conversation

@bernhardmgruber

@bernhardmgruber bernhardmgruber commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Fixes: #8576

@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL Jun 16, 2026
@bernhardmgruber bernhardmgruber marked this pull request as ready for review June 17, 2026 20:57
@bernhardmgruber bernhardmgruber requested review from a team as code owners June 17, 2026 20:57
@cccl-authenticator-app cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL Jun 17, 2026
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5ee34bd0-1f38-40f8-98cd-c27a3621e54d

📥 Commits

Reviewing files that changed from the base of the PR and between 12cdc11 and 8630718.

📒 Files selected for processing (2)
  • cub/benchmarks/bench/radix_sort/policy_selector.h
  • cub/test/catch2_test_device_segmented_sort_custom_policy_hub.cu
🚧 Files skipped from review as they are similar to previous changes (1)
  • cub/benchmarks/bench/radix_sort/policy_selector.h

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Overview

This PR exposes the radix sort tuning API publicly, allowing users to customize the sorting algorithm behavior via cub::RadixSortPolicy and related policy classes. The change addresses issue #8576 and establishes a foundation for external policy selection while deprecating internal-only agent policy types.

Key Changes

Public API Exposure

The core change migrates radix sort policies from internal detail::radix_sort::* types to public RadixSort*Policy types:

  • New public policy types: RadixSortAlgorithm (enum), RadixSortHistogramPolicy, RadixSortExclusiveSumPolicy, RadixSortOnesweepPolicy, RadixSortDownsweepPolicy, RadixSortUpsweepPolicy, and RadixSortPolicy (composite).
  • Public policy structure: RadixSortPolicy now serves as the canonical policy container, composed of algorithm selection, histogram, exclusive-sum, onesweep, scan, downsweep (standard and alternate), upsweep (standard and alternate), and single-tile policies.
  • Streaming support: Added operator<< overloads for all new policy types to support debug output.

Internal Refactoring

Agent policy implementations were reorganized under the detail namespace:

  • AgentRadixSortOnesweepPolicydetail::agent_radix_sort_onesweep_policy (implementation), with deprecated public alias
  • AgentRadixSortHistogramPolicydetail::agent_radix_sort_histogram_policy (implementation), with deprecated public alias
  • AgentRadixSortExclusiveSumPolicydetail::agent_radix_sort_exclusive_sum_policy (implementation), with deprecated public alias
  • AgentRadixSortDownsweepPolicydetail::agent_radix_sort_downsweep_policy (implementation), with deprecated public alias
  • AgentRadixSortUpsweepPolicydetail::agent_radix_sort_upsweep_policy (implementation), with deprecated public alias

All original agent policy names are preserved as deprecated type aliases with CCCL_DEPRECATED_BECAUSE annotations, directing users to the public tuning API.

API Deprecation

  • DispatchRadixSort is now marked with a deprecation attribute directing users to DeviceRadixSort.

DeviceRadixSort Documentation

Added a new "Tuning" subsection to cub::DeviceRadixSort documentation explaining how to pass custom policies via an environment/policy selector.

Policy Selector Updates

Updated policy_selector implementations to return the new public RadixSortPolicy type instead of internal detail::radix_sort::radix_sort_policy.

Test Additions

  • RadixSortPolicy test: Added C2H_TEST validating that RadixSortPolicy is semiregular and an aggregate, with equality/inequality checks.
  • Tuning API test: Added test demonstrating cub::DeviceRadixSort::SortKeys usage with cuda::execution::tune(...) mechanism and custom policy selection.
  • Deprecation handling: Updated existing segmented sort tests to include deprecation suppressions for agent policy usage.

Scope

  • Files modified: 19 files
  • Total lines changed: ~570 additions, ~560 deletions
  • Primary areas: Policy definitions, dispatch implementations, kernel configurations, tuning hubs, and test files

Backward Compatibility

All public API changes preserve backward compatibility through deprecated type aliases. Existing code using agent policy types will continue to compile with deprecation warnings, guiding migration to the new public tuning API.

Walkthrough

The PR promotes internal cub::detail::radix_sort::* policy types to a public cub::RadixSortPolicy family (RadixSortAlgorithm, RadixSortHistogramPolicy, RadixSortExclusiveSumPolicy, RadixSortOnesweepPolicy, RadixSortDownsweepPolicy, RadixSortUpsweepPolicy, RadixSortPolicy). Agent policy structs are moved into a detail namespace with deprecated aliases for old names. All dispatch, kernel, tuning hub, segmented sort, benchmark, and test code is updated to use the new types.

Changes

RadixSort public policy API promotion

Layer / File(s) Summary
New exported RadixSort policy structs and enum
cub/cub/device/dispatch/tuning/tuning_radix_sort.cuh
Defines RadixSortAlgorithm, six component policy structs, and the composite RadixSortPolicy aggregate with equality operators and operator<<. Updates factory functions (make_reg_scaled_radix_sort_*) and legacy conversion helpers (convert_downsweep_policy, convert_policy) to return the new exported types.
Agent policy types moved into detail namespace
cub/cub/agent/agent_radix_sort_downsweep.cuh, cub/cub/agent/agent_radix_sort_upsweep.cuh, cub/cub/agent/agent_radix_sort_histogram.cuh, cub/cub/agent/agent_radix_sort_onesweep.cuh
Each agent policy struct is renamed to detail::agent_radix_sort_*_policy and the original AgentRadixSort*Policy names become CCCL_DEPRECATED_BECAUSE using-aliases forwarding to the detail types.
policy_hub and policy selectors return RadixSortPolicy
cub/cub/device/dispatch/tuning/tuning_radix_sort.cuh
All per-architecture policy_hub entries (Policy500–Policy1000) and small-key onesweep paths switch internal aliases to detail::agent_radix_sort_*_policy. The radix_sort_policy_selector concept and all operator()(cuda::compute_capability) overloads return RadixSortPolicy.
DispatchRadixSort deprecated; internal helpers updated
cub/cub/device/dispatch/dispatch_radix_sort.cuh
DispatchRadixSort is marked deprecated toward DeviceRadixSort. __invoke_single_tile, PassConfig::InitPassConfig, __invoke_onesweep, __invoke_passes, and the GCC onesweep algorithm check are updated to RadixSort*Policy types.
Kernel compile-time policy wiring
cub/cub/device/dispatch/kernels/kernel_radix_sort.cuh, cub/cub/device/dispatch/kernels/kernel_segmented_radix_sort.cuh, cub/cub/device/dispatch/kernels/kernel_segmented_sort.cuh
Upsweep, downsweep, single-tile, histogram, onesweep, exclusive-sum, segmented radix sort, and segmented sort kernels derive compile-time configuration from RadixSort*Policy and instantiate detail::agent_radix_sort_*_policy.
Segmented sort and segmented radix sort tuning
cub/cub/device/dispatch/tuning/tuning_segmented_radix_sort.cuh, cub/cub/device/dispatch/tuning/tuning_segmented_sort.cuh
segmented_radix_sort_policy member types switch from radix_sort_downsweep_policy to RadixSortDownsweepPolicy; seven policy_hub entries in tuning_segmented_sort.cuh replace AgentRadixSortDownsweepPolicy with detail::agent_radix_sort_downsweep_policy.
DeviceRadixSort docs and TuningEnvT lookup
cub/cub/device/device_radix_sort.cuh
Adds a Tuning subsection to DeviceRadixSort docs and updates policy_selector_t derivation from TuningEnvT to query RadixSortPolicy.
Benchmark policy selector updated
cub/benchmarks/bench/radix_sort/policy_selector.h
Benchmark policy_selector::operator() return type changes to ::cub::RadixSortPolicy; all sub-policy construction sites and the max_temp_storage_size static assert use the new public types.
Tests: updated and new RadixSortPolicy / tuning API tests
cub/test/catch2_test_device_radix_sort_env.cu, cub/test/catch2_test_device_radix_sort_env_api.cu, cub/test/catch2_test_device_segmented_sort_custom_policy_hub.cu
Updates existing tiny_onesweep_policy_selector to cub::RadixSortPolicy. Adds a test verifying RadixSortPolicy is semiregular and an aggregate. Adds a C++20 test exercising DeviceRadixSort::SortKeys via cuda::execution::tune with a custom policy selector. Suppresses deprecation warnings for legacy agent policies.

Assessment against linked issues

Objective Addressed Explanation
Expose a public, user-facing DeviceRadixSort tuning policy API (RadixSortPolicy and related types) [#8576]
Deprecate the old internal detail::radix_sort::* and AgentRadixSort*Policy types in favor of the new public surface [#8576]
Wire the new public policy type through DeviceRadixSort dispatch and all internal kernels/tuning hubs [#8576]
Provide tests and examples demonstrating use of the new tuning API [#8576]

Possibly related PRs

  • NVIDIA/cccl#9489: Directly related — refactors the same RadixSortPolicy/RadixSortAlgorithm::onesweep representation, num_private_partitions field layout, and max_temp_storage_size onesweep dispatch logic that this PR consumes.

Suggested reviewers

  • gevtushenko
  • gonidelis
  • NaderAlAwar
  • davebayer

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cub/cub/device/dispatch/dispatch_radix_sort.cuh (1)

839-845: ⚠️ Potential issue | 🟠 Major

important: Line 841 passes alt_downsweep_kernel as the second argument to __invoke_passes, but that position expects alt_upsweep_kernel (parameter type UpsweepKernelT). This causes a type contract violation that breaks template instantiation. Pass alt_upsweep_kernel instead. Line 1117-1122 shows the correct argument order.

cub/benchmarks/bench/radix_sort/policy_selector.h (1)

81-90: ⚠️ Potential issue | 🔴 Critical

important: Line 90 uses onesweep.block_threads, but RadixSortOnesweepPolicy is constructed with threads_per_block as the first field (line 18). Change onesweep.block_threads to onesweep.threads_per_block.

🧹 Nitpick comments (2)
cub/cub/device/device_radix_sort.cuh (1)

192-193: ⚡ Quick win

suggestion: Line 192 introduces direct use of RadixSortPolicy, but this header does not include its defining header directly. Add a direct include for cub/device/dispatch/tuning/tuning_radix_sort.cuh to avoid transitive-include fragility.

As per coding guidelines, "Files must include all headers related to the symbols that they are using and relying on transitive header inclusion is not allowed."

Source: Coding guidelines

cub/cub/device/dispatch/tuning/tuning_radix_sort.cuh (1)

40-51: ⚡ Quick win

suggestion: Add CCCL host API annotations to the new stream overloads.

These new hosted overloads are plain functions in a CUDA header; annotate them with the project’s host API macro to match CCCL header conventions.

As per coding guidelines, “Functions must be marked with one of: _CCCL_HOST_API, _CCCL_DEVICE_API, _CCCL_HOST_DEVICE_API, _CCCL_TILE_API, or _CCCL_API.”

Also applies to: 79-85, 108-112, 148-156, 187-194, 220-225, 259-266

Source: Coding guidelines


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a97e1853-7d32-4e48-b473-78f0569926ae

📥 Commits

Reviewing files that changed from the base of the PR and between 0c0a722 and 12cdc11.

📒 Files selected for processing (15)
  • cub/benchmarks/bench/radix_sort/policy_selector.h
  • cub/cub/agent/agent_radix_sort_downsweep.cuh
  • cub/cub/agent/agent_radix_sort_histogram.cuh
  • cub/cub/agent/agent_radix_sort_onesweep.cuh
  • cub/cub/agent/agent_radix_sort_upsweep.cuh
  • cub/cub/device/device_radix_sort.cuh
  • cub/cub/device/dispatch/dispatch_radix_sort.cuh
  • cub/cub/device/dispatch/kernels/kernel_radix_sort.cuh
  • cub/cub/device/dispatch/kernels/kernel_segmented_radix_sort.cuh
  • cub/cub/device/dispatch/kernels/kernel_segmented_sort.cuh
  • cub/cub/device/dispatch/tuning/tuning_radix_sort.cuh
  • cub/cub/device/dispatch/tuning/tuning_segmented_radix_sort.cuh
  • cub/cub/device/dispatch/tuning/tuning_segmented_sort.cuh
  • cub/test/catch2_test_device_radix_sort_env.cu
  • cub/test/catch2_test_device_radix_sort_env_api.cu

Comment thread cub/benchmarks/bench/radix_sort/policy_selector.h
@github-actions

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 3h 01m: Pass: 100%/287 | Total: 7d 07h | Max: 2h 12m | Hits: 36%/616062

See results here.

Comment on lines 62 to +63
int num_private_partitions;
int radix_bits;
int radix_bits; //!< Number of bits per radix digit

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, why do we have the num_ prefix in one case, but not in the other? Seems inconsistent to me

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a note and thnk about it!

@bernhardmgruber bernhardmgruber merged commit 7e585c9 into NVIDIA:main Jun 28, 2026
310 checks passed
@github-project-automation github-project-automation Bot moved this from In Review to Done in CCCL Jun 28, 2026
@bernhardmgruber bernhardmgruber deleted the radix_tune_public branch June 28, 2026 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Productize the cub::DeviceRadixSort tuning API

2 participants