cub::DeviceTransform jumps through hoops to handle unaligned inputs, especially in the memcpy_async (LDGSTS) and ublkcp kernel. If the user gave us a compile-time guarantee that all buffers are aligned, the kernel could be simplified.
Let's investigate the performance impact of such a guarantee.
cub::DeviceTransformjumps through hoops to handle unaligned inputs, especially in thememcpy_async(LDGSTS) andublkcpkernel. If the user gave us a compile-time guarantee that all buffers are aligned, the kernel could be simplified.Let's investigate the performance impact of such a guarantee.