The operations exposed by WarpScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
The operations exposed by BlockMergeSort require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
DeviceSegmentedSort provides device-wide, parallel operations for computing a batched sort across multiple, non-overlapping sequences of data items residing within device-accessible memory.
+
The operations exposed by BlockMergeSort require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
DeviceSegmentedSort provides device-wide, parallel operations for computing a batched sort across multiple, non-overlapping sequences of data items residing within device-accessible memory.
The operations exposed by WarpExchange require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
The WarpStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA warp to a linear segment of memory
The operations exposed by WarpExchange require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
The WarpLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block
The operations exposed by WarpLoad require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
The WarpStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA warp to a linear segment of memory
The operations exposed by BlockDiscontinuity require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
The cub::BlockAdjacentDifference class provides collective methods for computing the differences of adjacent elements partitioned across a CUDA thread block.
The operations exposed by BlockDiscontinuity require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
The cub::BlockDiscontinuity class provides collective methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block.
The operations exposed by BlockExchange require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
The operations exposed by BlockHistogram require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block.
+
cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block. More...
The operations exposed by BlockMergeSort require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
template<typename KeyT , typename CompareOp , int ITEMS_PER_THREAD>
+
__device__ __forceinline__ void
SerialMerge (KeyT *keys_shared, int keys1_beg, int keys2_beg, int keys1_count, int keys2_count, KeyT(&output)[ITEMS_PER_THREAD], int(&indices)[ITEMS_PER_THREAD], CompareOp compare_op)
The operations exposed by BlockRadixSort require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
The operations exposed by BlockReduce require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
The BlockRunLengthDecode class supports decoding a run-length encoded array of items. That is, given the two arrays run_value[N] and run_lengths[N], run_value[i] is repeated run_lengths[i] many times in the output array. Due to the nature of the run-length decoding algorithm ("decompression"), the output size of the run-length decoded array is runtime-dependent and potentially without any upper bound. To address this, BlockRunLengthDecode allows retrieving a "window" from the run-length decoded array. The window's offset can be specified and BLOCK_THREADS * DECODED_ITEMS_PER_THREAD (i.e., referred to as window_size) decoded items from the specified window will be returned. More...
The operations exposed by BlockScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
The operations exposed by BlockShuffle require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory.
+
cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory. More...
diff --git a/block_reduce.png b/block_reduce.png
new file mode 100644
index 0000000000..fa3a4a6b15
Binary files /dev/null and b/block_reduce.png differ
diff --git a/block_scan_raking.png b/block_scan_raking.png
new file mode 100644
index 0000000000..48f6c55087
Binary files /dev/null and b/block_scan_raking.png differ
diff --git a/block_scan_warpscans.png b/block_scan_warpscans.png
new file mode 100644
index 0000000000..9dd57673f9
Binary files /dev/null and b/block_scan_warpscans.png differ
diff --git a/cache__modified__input__iterator_8cuh.html b/cache__modified__input__iterator_8cuh.html
index f954327444..9f1d131e5a 100644
--- a/cache__modified__input__iterator_8cuh.html
+++ b/cache__modified__input__iterator_8cuh.html
@@ -85,7 +85,7 @@
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
-AllClassesFilesFunctionsVariablesTypedefsEnumerationsEnumeratorFriendsGroupsPages
+AllClassesNamespacesFilesFunctionsVariablesTypedefsEnumerationsEnumeratorFriendsMacrosGroupsPages
template<
+ typename InputT,
+ int ITEMS_PER_THREAD,
+ WarpLoadAlgorithm ALGORITHM = WARP_LOAD_DIRECT,
+ int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS,
+ int PTX_ARCH = CUB_PTX_ARCH>
+class WarpLoad< InputT, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >
+
+
The WarpLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block.
+
Template Parameters
+
+
InputT
The data type to read into (which must be convertible from the input iterator's value type).
+
ITEMS_PER_THREAD
The number of consecutive items partitioned onto each thread.
[optional] The number of threads per "logical" warp (may be less than the number of hardware warp threads). Default is the warp size of the targeted CUDA compute-capability (e.g., 32 threads for SM86). Must be a power of two.
+
PTX_ARCH
[optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass)
+
+
+
+
Overview
+
The WarpLoad class provides a single data movement abstraction that can be specialized to implement different cub::WarpLoadAlgorithm strategies. This facilitates different performance policies for different architectures, data types, granularity sizes, etc.
+
WarpLoad can be optionally specialized by different data movement strategies:
+
cub::WARP_LOAD_VECTORIZE. A blocked arrangement of data is read directly from memory using CUDA's built-in vectorized loads as a coalescing optimization. More...
The code snippet below illustrates the loading of a linear segment of 64 integers into a "blocked" arrangement across 16 threads where each thread owns 4 consecutive items. The load is specialized for WARP_LOAD_TRANSPOSE, meaning memory references are efficiently coalesced using a warp-striped access pattern (after which items are locally reordered among threads).
+
#include <cub/cub.cuh>// or equivalently <cub/warp/warp_load.cuh>
+
+
__global__ void ExampleKernel(int *d_data, ...)
+
{
+
constexpr int warp_threads = 16;
+
constexpr int block_threads = 256;
+
constexpr int items_per_thread = 4;
+
+
// Specialize WarpLoad for a warp of 16 threads owning 4 integer items each
Suppose the input d_data is 0, 1, 2, 3, 4, 5, .... The set of thread_data across the first logical warp of threads in those threads will be: { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }.
The operations exposed by WarpLoad require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
+
Snippet
#include <cub/cub.cuh>// or equivalently <cub/warp/warp_load.cuh>
+
+
__global__ void ExampleKernel(int *d_data, ...)
+
{
+
constexpr int warp_threads = 16;
+
constexpr int block_threads = 256;
+
constexpr int items_per_thread = 4;
+
+
// Specialize WarpLoad for a warp of 16 threads owning 4 integer items each
Suppose the input d_data is 0, 1, 2, 3, 4, 5, .... The set of thread_data across the first logical warp of threads in those threads will be: { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }.
+
Parameters
+
+
[in]
block_itr
The thread block's base input iterator for loading from
+
[out]
items
Data to load
+
+
+
+
+
+
+
+
+
+
+template<typename InputT , int ITEMS_PER_THREAD, WarpLoadAlgorithm ALGORITHM = WARP_LOAD_DIRECT, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
Load a linear segment of items from memory, guarded by range.
+
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective's temporary storage (e.g., temp_storage) is to be reused or repurposed.
+
Snippet
#include <cub/cub.cuh>// or equivalently <cub/warp/warp_load.cuh>
+
+
__global__ void ExampleKernel(int *d_data, int valid_items, ...)
+
{
+
constexpr int warp_threads = 16;
+
constexpr int block_threads = 256;
+
constexpr int items_per_thread = 4;
+
+
// Specialize WarpLoad for a warp of 16 threads owning 4 integer items each
Suppose the input d_data is 0, 1, 2, 3, 4, 5, ..., valid_items is 5, and the out-of-bounds default is -1. The set of thread_data across the first logical warp of threads in those threads will be: { [0,1,2,3], [4,-1,-1,-1], ..., [-1,-1,-1,-1] } with only the first two threads being unmasked to load portions of valid data (and other items are assigned -1).
+
Parameters
+
+
[in]
block_itr
The thread block's base input iterator for loading from
+
[out]
items
Data to load
+
[in]
valid_items
Number of valid items to load
+
[in]
oob_default
Default value to assign out-of-bound items
+
+
+
+
+
+
+The documentation for this class was generated from the following file:
Sorts items partitioned across a CUDA thread block using a merge sorting method. More...
@@ -217,8 +216,86 @@
Sorts items partitioned across a CUDA thread block using a merge sorting method. More...
+
Constructor & Destructor Documentation
+
+
+
+
+template<typename KeyT , int ITEMS_PER_THREAD, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, typename ValueT = NullType, int PTX_ARCH = CUB_PTX_ARCH>
+template<typename KeyT , int ITEMS_PER_THREAD, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, typename ValueT = NullType, int PTX_ARCH = CUB_PTX_ARCH>
+template<typename KeyT , int ITEMS_PER_THREAD, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, typename ValueT = NullType, int PTX_ARCH = CUB_PTX_ARCH>
Collective constructor using the specified memory allocation as temporary storage. More...
Data movement
template<typename OutputIteratorT >
@@ -193,6 +191,62 @@
Store items into a linear segment of memory, guarded by range. More...
+
Constructor & Destructor Documentation
+
+
+
+
+template<typename T , int ITEMS_PER_THREAD, WarpStoreAlgorithm ALGORITHM = WARP_STORE_DIRECT, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
Collective constructor using a private static allocation of shared memory as temporary storage.
+
+
+
+
+
+
+
+template<typename T , int ITEMS_PER_THREAD, WarpStoreAlgorithm ALGORITHM = WARP_STORE_DIRECT, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
Collective constructor using the specified memory allocation as temporary storage. More...
@@ -226,6 +225,33 @@
Constructor & Destructor Documentation
+
+
+
+
+template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, bool WARP_TIME_SLICING = false, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
[optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z
[optional] The thread block length in threads along the Z dimension (default: 1)
PTX_ARCH
[optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass)
@@ -142,8 +142,8 @@
The T type must be implicitly castable to an integer type.
BlockHistogram expects each integral input[i] value to satisfy 0 <= input[i] < BINS. Values outside of this range result in undefined behavior.
BlockHistogram can be optionally specialized to use different algorithms:
-
cub::BLOCK_HISTO_SORT. Sorting followed by differentiation. More...
-
cub::BLOCK_HISTO_ATOMIC. Use atomic addition to update byte counts directly. More...
The following example under the examples/block folder illustrates usage of dynamically shared memory with BlockReduce and how to re-purpose the same memory region: example_block_reduce_dyn_smem.cu
Collective constructor using the specified memory allocation as temporary storage. More...
@@ -214,6 +213,33 @@
Constructor & Destructor Documentation
+
+
+
+
+template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, int BINS, BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
[optional] Whether or not only one warp's worth of shared memory should be allocated and time-sliced among block-warps during any load-related data transpositions (versus each warp having its own storage). (default: false)
BLOCK_DIM_Y
[optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z
[optional] The thread block length in threads along the Z dimension (default: 1)
@@ -137,14 +137,14 @@
Overview
-
The BlockLoad class provides a single data movement abstraction that can be specialized to implement different cub::BlockLoadAlgorithm strategies. This facilitates different performance policies for different architectures, data types, granularity sizes, etc.
+
The BlockLoad class provides a single data movement abstraction that can be specialized to implement different cub::BlockLoadAlgorithm strategies. This facilitates different performance policies for different architectures, data types, granularity sizes, etc.
BlockLoad can be optionally specialized by different data movement strategies:
-
cub::BLOCK_LOAD_DIRECT. A blocked arrangement of data is read directly from memory. More...
-
cub::BLOCK_LOAD_STRIPED,. A striped arrangement of data is read directly from memory. More...
-
cub::BLOCK_LOAD_VECTORIZE. A blocked arrangement of data is read directly from memory using CUDA's built-in vectorized loads as a coalescing optimization. More...
-
cub::BLOCK_LOAD_TRANSPOSE. A striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement. More...
cub::BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED,. A warp-striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement one warp at a time. More...
Collective constructor using the specified memory allocation as temporary storage. More...
@@ -202,6 +201,33 @@
Constructor & Destructor Documentation
+
+
+
+
+template<typename InputT , int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockLoadAlgorithm ALGORITHM = BLOCK_LOAD_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
[optional] ValueT type (default: cub::NullType, which indicates a keys-only sort)
RADIX_BITS
[optional] The number of radix bits per digit place (default: 4 bits)
MEMOIZE_OUTER_SCAN
[optional] Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure (default: true for architectures SM35 and newer, false otherwise).
-
INNER_SCAN_ALGORITHM
[optional] The cub::BlockScanAlgorithm algorithm to use (default: cub::BLOCK_SCAN_WARP_SCANS)
[optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z
[optional] The thread block length in threads along the Z dimension (default: 1)
PTX_ARCH
[optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass)
@@ -137,9 +137,9 @@
A reduction (or fold) uses a binary combining operator to compute a single aggregate from a list of input elements.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
BlockReduce can be optionally specialized by algorithm to accommodate different latency/throughput workload profiles:
-
cub::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY. An efficient "raking" reduction algorithm that only supports commutative reduction operators. More...
-
cub::BLOCK_REDUCE_RAKING. An efficient "raking" reduction algorithm that supports commutative and non-commutative reduction operators. More...
-
cub::BLOCK_REDUCE_WARP_REDUCTIONS. A quick "tiled warp-reductions" reduction algorithm that supports commutative and non-commutative reduction operators. More...
cub::BLOCK_REDUCE_RAKING. An efficient "raking" reduction algorithm that supports commutative and non-commutative reduction operators. More...
+
cub::BLOCK_REDUCE_WARP_REDUCTIONS. A quick "tiled warp-reductions" reduction algorithm that supports commutative and non-commutative reduction operators. More...
@@ -154,7 +154,7 @@
Every thread has a valid input (i.e., full vs. partial-tiles)
-
See cub::BlockReduceAlgorithm for performance details regarding algorithmic alternatives
Every thread in the block uses the BlockReduce class by first specializing the BlockReduce type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.
Collective constructor using the specified memory allocation as temporary storage. More...
@@ -221,6 +220,33 @@
Constructor & Destructor Documentation
+
+
+
+
+template<typename T , int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
Constructor specialised for user-provided temporary storage, initializing using the runs' lengths. The algorithm's temporary storage may not be repurposed between the constructor call and subsequent RunLengthDecode calls.
+
Constructor specialised for user-provided temporary storage, initializing using the runs' lengths. The algorithm's temporary storage may not be repurposed between the constructor call and subsequent RunLengthDecode calls. More...
Constructor specialised for user-provided temporary storage, initializing using the runs' offsets. The algorithm's temporary storage may not be repurposed between the constructor call and subsequent RunLengthDecode calls.
+
Constructor specialised for user-provided temporary storage, initializing using the runs' offsets. The algorithm's temporary storage may not be repurposed between the constructor call and subsequent RunLengthDecode calls. More...
Run-length decodes the runs previously passed via a call to Init(...) and returns the run-length decoded items in a blocked arrangement to decoded_items. If the number of run-length decoded items exceeds the run-length decode buffer (i.e., DECODED_ITEMS_PER_THREAD * BLOCK_THREADS), only the items that fit within the buffer are returned. Subsequent calls to RunLengthDecode adjusting from_decoded_offset can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls to RunLengthDecode is not required. More...
+
Constructor & Destructor Documentation
+
+
+
+
+template<typename ItemT , int BLOCK_DIM_X, int RUNS_PER_THREAD, int DECODED_ITEMS_PER_THREAD, typename DecodedOffsetT = uint32_t, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1>
Constructor specialised for user-provided temporary storage, initializing using the runs' lengths. The algorithm's temporary storage may not be repurposed between the constructor call and subsequent RunLengthDecode calls.
+
+
+
+
+
+
+
+template<typename ItemT , int BLOCK_DIM_X, int RUNS_PER_THREAD, int DECODED_ITEMS_PER_THREAD, typename DecodedOffsetT = uint32_t, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1>
Constructor specialised for user-provided temporary storage, initializing using the runs' offsets. The algorithm's temporary storage may not be repurposed between the constructor call and subsequent RunLengthDecode calls.
+
+
+
+
+
+
+
+template<typename ItemT , int BLOCK_DIM_X, int RUNS_PER_THREAD, int DECODED_ITEMS_PER_THREAD, typename DecodedOffsetT = uint32_t, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1>
Constructor specialised for static temporary storage, initializing using the runs' lengths.
+
+
+
+
+
+
+
+template<typename ItemT , int BLOCK_DIM_X, int RUNS_PER_THREAD, int DECODED_ITEMS_PER_THREAD, typename DecodedOffsetT = uint32_t, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1>
[optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z
[optional] The thread block length in threads along the Z dimension (default: 1)
PTX_ARCH
[optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass)
@@ -137,9 +137,9 @@
Given a list of input elements and a binary reduction operator, a prefix scan produces an output list where each element is computed to be the reduction of the elements occurring earlier in the input list. Prefix sum connotes a prefix scan with the addition operator. The term inclusive indicates that the ith output reduction incorporates the ith input. The term exclusive indicates the ith input is not incorporated into the ith output reduction.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
BlockScan can be optionally specialized by algorithm to accommodate different workload profiles:
-
cub::BLOCK_SCAN_RAKING_MEMOIZE. Similar to cub::BLOCK_SCAN_RAKING, but having higher throughput at the expense of additional register pressure for intermediate storage. More...
Every thread in the block uses the BlockScan class by first specializing the BlockScan type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.
Collective constructor using the specified memory allocation as temporary storage. More...
@@ -300,6 +299,33 @@
Constructor & Destructor Documentation
+
+
+
+
+template<typename T , int BLOCK_DIM_X, BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
[optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z
[optional] The thread block length in threads along the Z dimension (default: 1)
PTX_ARCH
[optional] The PTX compute capability for which to to specialize this collective, formatted as per the CUDA_ARCH macro (e.g., 350 for sm_35). Useful for determining the collective's storage requirements for a given device from the host. (Default: the value of CUDA_ARCH during the current compiler pass)
@@ -136,14 +136,14 @@
Overview
-
The BlockStore class provides a single data movement abstraction that can be specialized to implement different cub::BlockStoreAlgorithm strategies. This facilitates different performance policies for different architectures, data types, granularity sizes, etc.
+
The BlockStore class provides a single data movement abstraction that can be specialized to implement different cub::BlockStoreAlgorithm strategies. This facilitates different performance policies for different architectures, data types, granularity sizes, etc.
BlockStore can be optionally specialized by different data movement strategies:
-
cub::BLOCK_STORE_DIRECT. A blocked arrangement of data is written directly to memory. More...
-
cub::BLOCK_STORE_STRIPED. A striped arrangement of data is written directly to memory. More...
-
cub::BLOCK_STORE_VECTORIZE. A blocked arrangement of data is written directly to memory using CUDA's built-in vectorized stores as a coalescing optimization. More...
cub::BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED. A blocked arrangement is locally transposed into a warp-striped arrangement which is then written to memory. To reduce the shared memory requireent, only one warp's worth of shared memory is provisioned and is subsequently time-sliced among warps. More...
Collective constructor using the specified memory allocation as temporary storage. More...
@@ -200,6 +199,33 @@
Constructor & Destructor Documentation
+
+
+
+
+template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockStoreAlgorithm ALGORITHM = BLOCK_STORE_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
This operation sets the fill-size and resets the drain counter, preparing the GridQueue for draining in the next kernel instance. To be called by the host or by a kernel prior to that which will be draining.
+
This operation sets the fill-size and resets the drain counter, preparing the GridQueue for draining in the next kernel instance. To be called by the host or by a kernel prior to that which will be draining. More...
This operation resets the drain so that it may advance to meet the existing fill-size. To be called by the host or by a kernel prior to that which will be draining.
+
This operation resets the drain so that it may advance to meet the existing fill-size. To be called by the host or by a kernel prior to that which will be draining. More...
This operation sets the fill-size and resets the drain counter, preparing the GridQueue for draining in the next kernel instance. To be called by the host or by a kernel prior to that which will be draining.
This operation resets the drain so that it may advance to meet the existing fill-size. To be called by the host or by a kernel prior to that which will be draining.
DeviceAdjacentDifference provides device-wide, parallel operations for computing the differences of adjacent elements residing within device-accessible memory. More...
cub::DeviceHistogram provides device-wide parallel operations for constructing histogram(s) from a sequence of samples data residing within device-accessible memory.
DeviceMergeSort provides device-wide, parallel operations for computing a merge sort across a sequence of data items residing within device-accessible memory. More...
cub::DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within device-accessible memory.
cub::DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory.
cub::DeviceRunLengthEncode provides device-wide, parallel operations for computing a run-length encoding across a sequence of data items residing within device-accessible memory.
cub::DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory.
cub::DeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory.
cub::DeviceSegmentedReduce provides device-wide, parallel operations for computing a batched reduction across multiple sequences of data items residing within device-accessible memory.
cub::DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory.
diff --git a/docs/Doxyfile b/docs/Doxyfile
index 75160b2bb2..9285dcb1a3 100644
--- a/docs/Doxyfile
+++ b/docs/Doxyfile
@@ -375,7 +375,7 @@ LOOKUP_CACHE_SIZE = 0
# Private class members and static file members will be hidden unless
# the EXTRACT_PRIVATE and EXTRACT_STATIC tags are set to YES
-EXTRACT_ALL = NO
+EXTRACT_ALL = YES
# If the EXTRACT_PRIVATE tag is set to YES all private members of a class
# will be included in the documentation.
diff --git a/download__cub_8dox.html b/download__cub_8dox.html
new file mode 100644
index 0000000000..c964c10701
--- /dev/null
+++ b/download__cub_8dox.html
@@ -0,0 +1,114 @@
+
+
+
+
+
+
+
+CUB: download_cub.dox File Reference
+
+
+
+
+
+
+
+
+
+
+
+
+
GridEvenShare is a descriptor utility for distributing input among CUDA thread blocks in an "even-share" fashion. Each thread block gets roughly the same number of input tiles. More...
cub::GridEvenShare is a descriptor utility for distributing input among CUDA thread blocks in an "even-share" fashion. Each thread block gets roughly the same number of fixed-size work units (grains).
cub::GridMappingStrategy enumerates alternative strategies for mapping constant-sized tiles of device-wide data onto a grid of CUDA thread blocks. More...
+
cub::GridMappingStrategy enumerates alternative strategies for mapping constant-sized tiles of device-wide data onto a grid of CUDA thread blocks. More...
Detailed Description
-
cub::GridMappingStrategy enumerates alternative strategies for mapping constant-sized tiles of device-wide data onto a grid of CUDA thread blocks.
+
cub::GridMappingStrategy enumerates alternative strategies for mapping constant-sized tiles of device-wide data onto a grid of CUDA thread blocks.
cub::GridMappingStrategy enumerates alternative strategies for mapping constant-sized tiles of device-wide data onto a grid of CUDA thread blocks. More...
+
cub::GridMappingStrategy enumerates alternative strategies for mapping constant-sized tiles of device-wide data onto a grid of CUDA thread blocks. More...
Enumeration Type Documentation
@@ -133,7 +133,7 @@
Enumeration Type Documentation
-
cub::GridMappingStrategy enumerates alternative strategies for mapping constant-sized tiles of device-wide data onto a grid of CUDA thread blocks.
+
cub::GridMappingStrategy enumerates alternative strategies for mapping constant-sized tiles of device-wide data onto a grid of CUDA thread blocks.
Enumerator
GRID_MAPPING_RAKE
An a "raking" access pattern in which each thread block is assigned a consecutive sequence of input tiles.
Thread utility for writing memory using cub::CacheStoreModifier cache modifiers. Can be used to store any data type. More...
+
Thread utility for writing memory using cub::CacheStoreModifier cache modifiers. Can be used to store any data type. More...
Enumeration Type Documentation
@@ -526,7 +529,7 @@
Function Documentation
Load a linear segment of items into a blocked arrangement across the thread block.
Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
The input offset (block_ptr + block_offset) must be quad-item aligned
-
The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT:
+
The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT:
ITEMS_PER_THREAD is odd
The data type T is not a built-in primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)
@@ -1063,7 +1066,7 @@
Function Documentation
Store a blocked arrangement of items across a thread block into a linear segment of items.
Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
The output offset (block_ptr + block_offset) must be quad-item aligned, which is the default starting offset returned by cudaMalloc()
-
The following conditions will prevent vectorization and storing will fall back to cub::BLOCK_STORE_DIRECT:
+
The following conditions will prevent vectorization and storing will fall back to cub::BLOCK_STORE_DIRECT:
ITEMS_PER_THREAD is odd
The data type T is not a built-in primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)
@@ -1332,7 +1335,7 @@
Function Documentation
-
Thread utility for reading memory using cub::CacheLoadModifier cache modifiers. Can be used to load any data type.
+
Thread utility for reading memory using cub::CacheLoadModifier cache modifiers. Can be used to load any data type.
Example
#include <cub/cub.cuh>// or equivalently <cub/thread/thread_load.cuh>
// 32-bit load using cache-global modifier:
@@ -1388,7 +1391,7 @@
Function Documentation
-
Thread utility for writing memory using cub::CacheStoreModifier cache modifiers. Can be used to store any data type.
+
Thread utility for writing memory using cub::CacheStoreModifier cache modifiers. Can be used to store any data type.
Example
#include <cub/cub.cuh>// or equivalently <cub/thread/thread_store.cuh>
Retrieves the SM version of device (major * 100 + minor * 10) More...
CUB_RUNTIME_FUNCTION cudaError_t
cub::SmVersion (int &sm_version, int device=CurrentDevice())
Retrieves the SM version of device (major * 100 + minor * 10) More...
@@ -176,6 +168,72 @@
Computes maximum SM occupancy in thread blocks for executing the given kernel function pointer kernel_ptr on the current device with block_threads per thread block. More...
IfCUB_STDERR is defined and error is not cudaSuccess, the corresponding error message is printed to stderr (or stdout in device code) along with the supplied source context.
Returns
The CUDA error.
+
+
+
+
+
+
+
+
+
+
+
CUB_RUNTIME_FUNCTION int cub::CurrentDevice
+
(
+
)
+
+
+
+
+
+inline
+
+
+
+
+
Returns the current device or -1 if an error occurred.
+
+
+
+
+
+
+
+
+
+
+
+
CUB_RUNTIME_FUNCTION int cub::DeviceCountUncached
+
(
+
)
+
+
+
+
+
+inline
+
+
+
+
+
Returns the number of CUDA devices available or -1 if an error occurred.
cub::SHR_ADD (unsigned int x, unsigned int shift, unsigned int addend)
-
Shift-right then add. Returns (x >> shift) + addend.
+
Shift-right then add. Returns (x >> shift) + addend. More...
-
-__device__ __forceinline__
+
__device__ __forceinline__
unsigned int
cub::SHL_ADD (unsigned int x, unsigned int shift, unsigned int addend)
-
Shift-left then add. Returns (x << shift) + addend.
+
Shift-left then add. Returns (x << shift) + addend. More...
-
-template<typename UnsignedBits >
+
template<typename UnsignedBits >
__device__ __forceinline__
unsigned int
cub::BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits)
-
Bitfield-extract. Extracts num_bits from source starting at bit-offset bit_start. The input source may be an 8b, 16b, 32b, or 64b unsigned integer type.
+
Bitfield-extract. Extracts num_bits from source starting at bit-offset bit_start. The input source may be an 8b, 16b, 32b, or 64b unsigned integer type. More...
-
-__device__ __forceinline__ void
cub::BFI (unsigned int &ret, unsigned int x, unsigned int y, unsigned int bit_start, unsigned int num_bits)
-
Bitfield insert. Inserts the num_bits least significant bits of y into x at bit-offset bit_start.
+
__device__ __forceinline__ void
cub::BFI (unsigned int &ret, unsigned int x, unsigned int y, unsigned int bit_start, unsigned int num_bits)
+
Bitfield insert. Inserts the num_bits least significant bits of y into x at bit-offset bit_start. More...
-
-__device__ __forceinline__
+
__device__ __forceinline__
unsigned int
cub::IADD3 (unsigned int x, unsigned int y, unsigned int z)
Returns the warp ID of the calling thread. Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block.
+
Returns the warp ID of the calling thread. Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block. More...
template<int LOGICAL_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
Returns the warp lane mask of all lanes greater than or equal to the calling thread.
+
Returns the warp lane mask of all lanes greater than or equal to the calling thread. More...
Function Documentation
+
+
+
+
+
+
__device__ __forceinline__ unsigned int cub::SHR_ADD
+
(
+
unsigned int
+
x,
+
+
+
+
+
unsigned int
+
shift,
+
+
+
+
+
unsigned int
+
addend
+
+
+
+
)
+
+
+
+
+
+
Shift-right then add. Returns (x >> shift) + addend.
+
+
+
+
+
+
+
+
+
__device__ __forceinline__ unsigned int cub::SHL_ADD
+
(
+
unsigned int
+
x,
+
+
+
+
+
unsigned int
+
shift,
+
+
+
+
+
unsigned int
+
addend
+
+
+
+
)
+
+
+
+
+
+
Shift-left then add. Returns (x << shift) + addend.
+
+
+
+
+
+
+
+template<typename UnsignedBits >
+
+
+
__device__ __forceinline__ unsigned int cub::BFE
+
(
+
UnsignedBits
+
source,
+
+
+
+
+
unsigned int
+
bit_start,
+
+
+
+
+
unsigned int
+
num_bits
+
+
+
+
)
+
+
+
+
+
+
Bitfield-extract. Extracts num_bits from source starting at bit-offset bit_start. The input source may be an 8b, 16b, 32b, or 64b unsigned integer type.
+
+
+
+
+
+
+
+
+
__device__ __forceinline__ void cub::BFI
+
(
+
unsigned int &
+
ret,
+
+
+
+
+
unsigned int
+
x,
+
+
+
+
+
unsigned int
+
y,
+
+
+
+
+
unsigned int
+
bit_start,
+
+
+
+
+
unsigned int
+
num_bits
+
+
+
+
)
+
+
+
+
+
+
Bitfield insert. Inserts the num_bits least significant bits of y into x at bit-offset bit_start.
+
+
+
+
+
+
+
+
+
__device__ __forceinline__ unsigned int cub::IADD3
Abort execution and generate an interrupt to the host CPU.
+
+
+
+
+
+
+
+
+
__device__ __forceinline__ int cub::RowMajorTid
+
(
+
int
+
block_dim_x,
+
+
+
+
+
int
+
block_dim_y,
+
+
+
+
+
int
+
block_dim_z
+
+
+
+
)
+
+
+
+
+
+
Returns the row-major linear thread identifier for a multidimensional thread block.
+
+
+
+
+
+
+
+
+
__device__ __forceinline__ unsigned int cub::LaneId
+
(
+
)
+
+
+
+
+
+
Returns the warp lane ID of the calling thread.
+
+
+
+
+
+
+
+
+
__device__ __forceinline__ unsigned int cub::WarpId
+
(
+
)
+
+
+
+
+
+
Returns the warp ID of the calling thread. Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block.
+
@@ -253,6 +525,74 @@
Function Documentation
+
+
+
+
+
+
+
+
__device__ __forceinline__ unsigned int cub::LaneMaskLt
+
(
+
)
+
+
+
+
+
+
Returns the warp lane mask of all lanes less than the calling thread.
+
+
+
+
+
+
+
+
+
__device__ __forceinline__ unsigned int cub::LaneMaskLe
+
(
+
)
+
+
+
+
+
+
Returns the warp lane mask of all lanes less than or equal to the calling thread.
+
+
+
+
+
+
+
+
+
__device__ __forceinline__ unsigned int cub::LaneMaskGt
+
(
+
)
+
+
+
+
+
+
Returns the warp lane mask of all lanes greater than the calling thread.
+
+
+
+
+
+
+
+
+
__device__ __forceinline__ unsigned int cub::LaneMaskGe
+
(
+
)
+
+
+
+
+
+
Returns the warp lane mask of all lanes greater than or equal to the calling thread.
The operations exposed by WarpReduce require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
The operations exposed by WarpScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
The operations exposed by WarpExchange require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
The operations exposed by WarpLoad require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse
The WarpLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block
The WarpStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA warp to a linear segment of memory
The WarpStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA warp to a linear segment of memory
RAII helper which saves the current device and switches to the specified device on construction and switches to the saved device on destruction. More...
The BlockDiscontinuity class provides collective methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block.
+
The BlockHistogram class provides collective methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.
+
The BlockLoad class provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block.
+
The BlockRunLengthDecode class supports decoding a run-length encoded array of items. That is, given the two arrays run_value[N] and run_lengths[N], run_value[i] is repeated run_lengths[i] many times in the output array. Due to the nature of the run-length decoding algorithm ("decompression"), the output size of the run-length decoded array is runtime-dependent and potentially without any upper bound. To address this, BlockRunLengthDecode allows retrieving a "window" from the run-length decoded array. The window's offset can be specified and BLOCK_THREADS * DECODED_ITEMS_PER_THREAD (i.e., referred to as window_size) decoded items from the specified window will be returned. More...
The BlockStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA thread block to a linear segment of memory.
+
DeviceAdjacentDifference provides device-wide, parallel operations for computing the differences of adjacent elements residing within device-accessible memory. More...
DeviceHistogram provides device-wide parallel operations for constructing histogram(s) from a sequence of samples data residing within device-accessible memory.
+
DeviceMergeSort provides device-wide, parallel operations for computing a merge sort across a sequence of data items residing within device-accessible memory. More...
DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within device-accessible memory.
+
DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory.
+
DeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within device-accessible memory.
+
DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory.
+
DeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory.
+
DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory.
+
DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory.
+
GridEvenShare is a descriptor utility for distributing input among CUDA thread blocks in an "even-share" fashion. Each thread block gets roughly the same number of input tiles. More...
cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block. More...
cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory. More...
cub::GridMappingStrategy enumerates alternative strategies for mapping constant-sized tiles of device-wide data onto a grid of CUDA thread blocks. More...
MaxSmOccupancy (int &max_sm_occupancy, KernelPtr kernel_ptr, int block_threads, int dynamic_smem_bytes=0)
+
Computes maximum SM occupancy in thread blocks for executing the given kernel function pointer kernel_ptr on the current device with block_threads per thread block. More...
+
+
__device__ __forceinline__
+unsigned int
SHR_ADD (unsigned int x, unsigned int shift, unsigned int addend)
+
Shift-right then add. Returns (x >> shift) + addend. More...
+
+
__device__ __forceinline__
+unsigned int
SHL_ADD (unsigned int x, unsigned int shift, unsigned int addend)
+
Shift-left then add. Returns (x << shift) + addend. More...
+
+
template<typename UnsignedBits >
+
__device__ __forceinline__
+unsigned int
BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits)
+
Bitfield-extract. Extracts num_bits from source starting at bit-offset bit_start. The input source may be an 8b, 16b, 32b, or 64b unsigned integer type. More...
+
+
__device__ __forceinline__ void
BFI (unsigned int &ret, unsigned int x, unsigned int y, unsigned int bit_start, unsigned int num_bits)
+
Bitfield insert. Inserts the num_bits least significant bits of y into x at bit-offset bit_start. More...
+
+
__device__ __forceinline__
+unsigned int
IADD3 (unsigned int x, unsigned int y, unsigned int z)
Returns the warp ID of the calling thread. Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block. More...
+
+
template<int LOGICAL_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
Returns the warp lane mask of all lanes greater than or equal to the calling thread. More...
+
+
template<int LOGICAL_WARP_THREADS, typename T >
+
__device__ __forceinline__ T
ShuffleUp (T input, int src_offset, int first_thread, unsigned int member_mask)
+
Shuffle-up for any data type. Each warp-lanei obtains the value input contributed by warp-lanei-src_offset. For thread lanes i < src_offset, the thread's own input is returned to the thread.
+
ShuffleDown (T input, int src_offset, int last_thread, unsigned int member_mask)
+
Shuffle-down for any data type. Each warp-lanei obtains the value input contributed by warp-lanei+src_offset. For thread lanes i >= WARP_THREADS, the thread's own input is returned to the thread.
+
ShuffleIndex (T input, int src_lane, unsigned int member_mask)
+
Shuffle-broadcast for any data type. Each warp-lanei obtains the value input contributed by warp-lanesrc_lane. For src_lane < 0 or src_lane >= WARP_THREADS, then the thread's own input is returned to the thread.
+
LoadDirectBlocked (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default)
+
Load a linear segment of items into a blocked arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements.. More...
LoadDirectStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default)
+
Load a linear segment of items into a striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements. More...
+
+
template<int BLOCK_THREADS, typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
LoadDirectWarpStriped (int linear_tid, InputIteratorT block_itr, InputT(&items)[ITEMS_PER_THREAD], int valid_items, DefaultT oob_default)
+
Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements. More...
+
+
template<typename T , int ITEMS_PER_THREAD, typename OutputIteratorT >
BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms.
+
+
Enumerator
BLOCK_HISTO_SORT
+
Overview
Sorting followed by differentiation. Execution is comprised of two phases:
+
Sort the data using efficient radix sort
+
Look for "runs" of same-valued keys by detecting discontinuities; the run-lengths are histogram bin counts.
+
+
+
Performance Considerations
Delivers consistent throughput regardless of sample bin distribution.
+
+
BLOCK_HISTO_ATOMIC
+
Overview
Use atomic addition to update byte counts directly
+
Performance Considerations
Performance is strongly tied to the hardware implementation of atomic addition, and may be significantly degraded for non uniformly-random input distributions where many concurrent updates are likely to be made to the same bin counter.
cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block.
The utilization of memory transactions (coalescing) doesn't depend on the number of items per thread.
+
+
BLOCK_LOAD_VECTORIZE
+
Overview
+
A blocked arrangement of data is read from memory using CUDA's built-in vectorized loads as a coalescing optimization. For example, ld.global.v4.s32 instructions will be generated when T = int and ITEMS_PER_THREAD % 4 == 0.
+
Performance Considerations
+
The utilization of memory transactions (coalescing) remains high until the the access stride between threads (i.e., the number items per thread) exceeds the maximum vector load width (typically 4 items or 64B, whichever is lower).
+
The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT:
+
ITEMS_PER_THREAD is odd
+
The InputIteratorT is not a simple pointer type
+
The block input offset is not quadword-aligned
+
The data type T is not a built-in primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)
Provisions more shared storage, but incurs smaller latencies than the BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED alternative.
+
+
+
+
BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED
+
Overview
+
Like BLOCK_LOAD_WARP_TRANSPOSE, a warp-striped arrangement of data is read directly from memory and then is locally transposed into a blocked arrangement. To reduce the shared memory requirement, only one warp's worth of shared memory is provisioned and is subsequently time-sliced among warps.
+
Usage Considerations
+
BLOCK_THREADS must be a multiple of WARP_THREADS
+
+
+
Performance Considerations
+
The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.
+
Provisions less shared memory temporary storage, but incurs larger latencies than the BLOCK_LOAD_WARP_TRANSPOSE alternative.
BlockReduceAlgorithm enumerates alternative algorithms for parallel reduction across a CUDA thread block.
+
+
Enumerator
BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY
+
Overview
An efficient "raking" reduction algorithm that only supports commutative reduction operators (true for most operations, e.g., addition).
+
Execution is comprised of three phases:
+
Upsweep sequential reduction in registers (if threads contribute more than one input each). Threads in warps other than the first warp place their partial reductions into shared memory.
+
Upsweep sequential reduction in shared memory. Threads within the first warp continue to accumulate by raking across segments of shared partial reductions
+
A warp-synchronous Kogge-Stone style reduction within the raking warp.
+
+
+
+
+
+
BLOCK_REDUCE_RAKING data flow for a hypothetical 16-thread thread block and 4-thread raking warp.
+
Performance Considerations
+
This variant performs less communication than BLOCK_REDUCE_RAKING_NON_COMMUTATIVE and is preferable when the reduction operator is commutative. This variant applies fewer reduction operators than BLOCK_REDUCE_WARP_REDUCTIONS, and can provide higher overall throughput across the GPU when suitably occupied. However, turn-around latency may be higher than to BLOCK_REDUCE_WARP_REDUCTIONS and thus less-desirable when the GPU is under-occupied.
+
+
+
+
BLOCK_REDUCE_RAKING
+
Overview
An efficient "raking" reduction algorithm that supports commutative (e.g., addition) and non-commutative (e.g., string concatenation) reduction operators. Assumes a blocked arrangement of (block-threads*items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed..
+
Execution is comprised of three phases:
+
Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
+
Upsweep sequential reduction in shared memory. Threads within a single warp rake across segments of shared partial reductions.
+
A warp-synchronous Kogge-Stone style reduction within the raking warp.
+
+
+
+
+
+
BLOCK_REDUCE_RAKING data flow for a hypothetical 16-thread thread block and 4-thread raking warp.
+
Performance Considerations
+
This variant performs more communication than BLOCK_REDUCE_RAKING and is only preferable when the reduction operator is non-commutative. This variant applies fewer reduction operators than BLOCK_REDUCE_WARP_REDUCTIONS, and can provide higher overall throughput across the GPU when suitably occupied. However, turn-around latency may be higher than to BLOCK_REDUCE_WARP_REDUCTIONS and thus less-desirable when the GPU is under-occupied.
+
+
+
+
BLOCK_REDUCE_WARP_REDUCTIONS
+
Overview
A quick "tiled warp-reductions" reduction algorithm that supports commutative (e.g., addition) and non-commutative (e.g., string concatenation) reduction operators.
+
Execution is comprised of four phases:
+
Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
+
Compute a shallow, but inefficient warp-synchronous Kogge-Stone style reduction within each warp.
+
A propagation phase where the warp reduction outputs in each warp are updated with the aggregate from each preceding warp.
+
+
+
+
+
+
BLOCK_REDUCE_WARP_REDUCTIONS data flow for a hypothetical 16-thread thread block and 4-thread raking warp.
+
Performance Considerations
+
This variant applies more reduction operators than BLOCK_REDUCE_RAKING or BLOCK_REDUCE_RAKING_NON_COMMUTATIVE, which may result in lower overall throughput across the GPU. However turn-around latency may be lower and thus useful when the GPU is under-occupied.
BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block.
+
+
Enumerator
BLOCK_SCAN_RAKING
+
Overview
An efficient "raking reduce-then-scan" prefix scan algorithm. Execution is comprised of five phases:
+
Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
+
Upsweep sequential reduction in shared memory. Threads within a single warp rake across segments of shared partial reductions.
+
A warp-synchronous Kogge-Stone style exclusive scan within the raking warp.
+
Downsweep sequential exclusive scan in shared memory. Threads within a single warp rake across segments of shared partial reductions, seeded with the warp-scan output.
+
Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output.
+
+
+
+
+
+
BLOCK_SCAN_RAKING data flow for a hypothetical 16-thread thread block and 4-thread raking warp.
+
Performance Considerations
+
Although this variant may suffer longer turnaround latencies when the GPU is under-occupied, it can often provide higher overall throughput across the GPU when suitably occupied.
+
+
+
+
BLOCK_SCAN_RAKING_MEMOIZE
+
Overview
Similar to cub::BLOCK_SCAN_RAKING, but with fewer shared memory reads at the expense of higher register pressure. Raking threads preserve their "upsweep" segment of values in registers while performing warp-synchronous scan, allowing the "downsweep" not to re-read them from shared memory.
+
+
BLOCK_SCAN_WARP_SCANS
+
Overview
A quick "tiled warpscans" prefix scan algorithm. Execution is comprised of four phases:
+
Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
+
Compute a shallow, but inefficient warp-synchronous Kogge-Stone style scan within each warp.
+
A propagation phase where the warp scan outputs in each warp are updated with the aggregate from each preceding warp.
+
Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output.
+
+
+
+
+
+
BLOCK_SCAN_WARP_SCANS data flow for a hypothetical 16-thread thread block and 4-thread raking warp.
+
Performance Considerations
+
Although this variant may suffer lower overall throughput across the GPU because due to a heavy reliance on inefficient warpscans, it can often provide lower turnaround latencies when the GPU is under-occupied.
cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory.
The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
+
+
BLOCK_STORE_VECTORIZE
+
Overview
+
A blocked arrangement of data is written directly to memory using CUDA's built-in vectorized stores as a coalescing optimization. For example, st.global.v4.s32 instructions will be generated when T = int and ITEMS_PER_THREAD % 4 == 0.
+
Performance Considerations
+
The utilization of memory transactions (coalescing) remains high until the the access stride between threads (i.e., the number items per thread) exceeds the maximum vector store width (typically 4 items or 64B, whichever is lower).
+
The following conditions will prevent vectorization and writing will fall back to cub::BLOCK_STORE_DIRECT:
+
ITEMS_PER_THREAD is odd
+
The OutputIteratorT is not a simple pointer type
+
The block output offset is not quadword-aligned
+
The data type T is not a built-in primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)
A blocked arrangement is locally transposed and then efficiently written to memory as a warp-striped arrangement To reduce the shared memory requirement, only one warp's worth of shared memory is provisioned and is subsequently time-sliced among warps.
+
Usage Considerations
+
BLOCK_THREADS must be a multiple of WARP_THREADS
+
+
+
Performance Considerations
+
The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
+
Provisions less shared memory temporary storage, but incurs larger latencies than the BLOCK_STORE_WARP_TRANSPOSE alternative.
+
+
+
+
+
+
+
+
Function Documentation
+
+
+
+
+template<int LABEL_BITS>
+
+
+
+
+
+
__device__ unsigned int cub::MatchAny
+
(
+
unsigned int
+
label)
+
+
+
+
+
+inline
+
+
+
+
Compute a 32b mask of threads having the same least-significant LABEL_BITS of label as the calling thread.
+
+
+
+
+
+
+
+template<typename T >
+
+
+
__device__ __forceinline__ void cub::Swap
+
(
+
T &
+
lhs,
+
+
+
+
+
T &
+
rhs
+
+
+
+
)
+
+
+
+
+
+
+
+
+
+
+
+template<typename KeyT , typename ValueT , typename CompareOp , int ITEMS_PER_THREAD>
The sorting method is stable. Further details can be found in: A. Nico Habermann. Parallel neighbor sort (or the glory of the induction principle). Technical Report AD-759 248, Carnegie Mellon University, 1972.
+
Template Parameters
+
+
KeyT
Key type
+
ValueT
Value type. Ifcub::NullType is used as ValueT, only keys are sorted.
+
CompareOp
functor type having member bool operator()(KeyT lhs, KeyT rhs)
+
ITEMS_PER_THREAD
The number of items per thread
+
+
+
+
Parameters
+
+
[in,out]
keys
Keys to sort
+
[in,out]
items
Values to sort
+
[in]
compare_op
Comparison function object which returns true if the first argument is ordered before the second
template<
+ typename InputT,
+ int ITEMS_PER_THREAD,
+ WarpLoadAlgorithm ALGORITHM = WARP_LOAD_DIRECT,
+ int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS,
+ int PTX_ARCH = CUB_PTX_ARCH>
+struct WarpLoad< InputT, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >::TempStorage
+
+
The operations exposed by WarpLoad require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse.
+
+Inheritance diagram for WarpLoad< InputT, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH >::TempStorage:
+
+
+
+
+
+The documentation for this struct was generated from the following file:
Temporary storage for partially-full block guard. More...
+
Member Data Documentation
+
+
+
+
+template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockStoreAlgorithm ALGORITHM = BLOCK_STORE_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
Temporary storage for partially-full block guard. More...
+
Member Data Documentation
+
+
+
+
+template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockStoreAlgorithm ALGORITHM = BLOCK_STORE_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
Temporary storage for partially-full block guard. More...
+
Member Data Documentation
+
+
+
+
+template<typename T , int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockStoreAlgorithm ALGORITHM = BLOCK_STORE_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int PTX_ARCH = CUB_PTX_ARCH>
Frees a live allocation of device memory on the current device, returning it to the allocator.
Once freed, the allocation becomes available immediately for reuse within the active_stream with which it was associated with during allocation, and it becomes available for reuse within other streams when all prior work submitted to active_stream has completed.
Pointer to the output sequence of data items selected by select_first_part_op
[out]
d_second_part_out
Pointer to the output sequence of data items selected by select_second_part_op
[out]
d_unselected_out
Pointer to the output sequence of unselected data items
-
[out]
d_num_selected_out
Pointer to the output array with two elements, where total number of items selected by select_first_part_op is stored as d_num_selected_out[0] and total number of items selected by select_second_part_op is stored as d_num_selected_out[1], respectively
+
[out]
d_num_selected_out
Pointer to the output array with two elements, where total number of items selected by select_first_part_op is stored as d_num_selected_out[0] and total number of items selected by select_second_part_op is stored as d_num_selected_out[1], respectively
+
[in]
num_items
Total number of items to select from
+
[in]
select_first_part_op
Unary selection operator to select d_first_part_out
+
[in]
select_second_part_op
Unary selection operator to select d_second_part_out
+
[in]
stream
**[optional]** CUDA stream to launch kernels within. Default is stream0.
+
[in]
debug_synchronous
**[optional]** Whether or not to synchronize the stream after every kernel launch to check for errors. May cause significant slowdown. Default is false.
Initializes ranges for the specified thread block index. Specialized for a "raking" access pattern in which each thread block is assigned a consecutive sequence of input tiles.
Initializes ranges for the specified thread block index. Specialized for a "raking" access pattern in which each thread block is assigned a consecutive sequence of input tiles. More...
Block-initialization, specialized for "strip mining" access pattern in which the input tiles assigned to each thread block are separated by a stride equal to the the extent of the grid.
+
Block-initialization, specialized for "strip mining" access pattern in which the input tiles assigned to each thread block are separated by a stride equal to the the extent of the grid. More...
Initializes ranges for the specified thread block index. Specialized for a "raking" access pattern in which each thread block is assigned a consecutive sequence of input tiles.
Block-initialization, specialized for "strip mining" access pattern in which the input tiles assigned to each thread block are separated by a stride equal to the the extent of the grid.
cub::SHR_ADD (unsigned int x, unsigned int shift, unsigned int addend)
-
Shift-right then add. Returns (x >> shift) + addend.
+
Shift-right then add. Returns (x >> shift) + addend. More...
-
-__device__ __forceinline__
+
__device__ __forceinline__
unsigned int
cub::SHL_ADD (unsigned int x, unsigned int shift, unsigned int addend)
-
Shift-left then add. Returns (x << shift) + addend.
+
Shift-left then add. Returns (x << shift) + addend. More...
-
-template<typename UnsignedBits >
+
template<typename UnsignedBits >
__device__ __forceinline__
unsigned int
cub::BFE (UnsignedBits source, unsigned int bit_start, unsigned int num_bits)
-
Bitfield-extract. Extracts num_bits from source starting at bit-offset bit_start. The input source may be an 8b, 16b, 32b, or 64b unsigned integer type.
+
Bitfield-extract. Extracts num_bits from source starting at bit-offset bit_start. The input source may be an 8b, 16b, 32b, or 64b unsigned integer type. More...
-
-__device__ __forceinline__ void
cub::BFI (unsigned int &ret, unsigned int x, unsigned int y, unsigned int bit_start, unsigned int num_bits)
-
Bitfield insert. Inserts the num_bits least significant bits of y into x at bit-offset bit_start.
+
__device__ __forceinline__ void
cub::BFI (unsigned int &ret, unsigned int x, unsigned int y, unsigned int bit_start, unsigned int num_bits)
+
Bitfield insert. Inserts the num_bits least significant bits of y into x at bit-offset bit_start. More...
-
-__device__ __forceinline__
+
__device__ __forceinline__
unsigned int
cub::IADD3 (unsigned int x, unsigned int y, unsigned int z)
Returns the warp ID of the calling thread. Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block.
+
Returns the warp ID of the calling thread. Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block. More...
template<int LOGICAL_WARP_THREADS, int PTX_ARCH = CUB_PTX_ARCH>
The operations exposed by WarpLoad require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
The operations exposed by WarpReduce require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...
The operations exposed by WarpScan require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union'd with other storage allocation types to facilitate memory reuse. More...