diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
index 8863a753fc7..c71a1af4808 100644
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -1,6 +1,6 @@
 ---
 name: Bug report
-about: Create a report to help us improve
+about: Create a report to help us improve cuGraph
 title: "[BUG]"
 labels: "? - Needs Triage, bug"
 assignees: ''
@@ -10,29 +10,19 @@ assignees: ''
 **Describe the bug**
 A clear and concise description of what the bug is.
 
-**To Reproduce**
-Steps to reproduce the behavior:
-1. Go to '...'
-2. Click on '....'
-3. Scroll down to '....'
-4. See error
+**Steps/Code to reproduce bug**
+Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.
 
 **Expected behavior**
 A clear and concise description of what you expected to happen.
 
-**Screenshots**
-If applicable, add screenshots to help explain your problem.
+**Environment overview (please complete the following information)**
+ - Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
+ - Method of cuGraph install: [conda, Docker, or from source]
+   - If method of install is [Docker], provide `docker pull` & `docker run` commands used
 
-**Desktop (please complete the following information):**
- - OS: [e.g. iOS]
- - Browser [e.g. chrome, safari]
- - Version [e.g. 22]
-
-**Smartphone (please complete the following information):**
- - Device: [e.g. iPhone6]
- - OS: [e.g. iOS8.1]
- - Browser [e.g. stock browser, safari]
- - Version [e.g. 22]
+**Environment details**
+Please run and paste the output of the `cugraph/print_env.sh` script here, to gather any other relevant environment details
 
 **Additional context**
 Add any other context about the problem here.
diff --git a/.github/ISSUE_TEMPLATE/documentation_request.md b/.github/ISSUE_TEMPLATE/documentation_request.md
new file mode 100644
index 00000000000..595a87e191e
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/documentation_request.md
@@ -0,0 +1,35 @@
+---
+name: Documentation request
+about: Report incorrect or needed documentation
+title: "[DOC]"
+labels: "? - Needs Triage, doc"
+assignees: ''
+
+---
+
+## Report incorrect documentation
+
+**Location of incorrect documentation**
+Provide links and line numbers if applicable.
+
+**Describe the problems or issues found in the documentation**
+A clear and concise description of what you found to be incorrect.
+
+**Steps taken to verify documentation is incorrect**
+List any steps you have taken:
+
+**Suggested fix for documentation**
+Detail proposed changes to fix the documentation if you have any.
+
+---
+
+## Report needed documentation
+
+**Report needed documentation**
+A clear and concise description of what documentation you believe it is needed and why.
+
+**Describe the documentation you'd like**
+A clear and concise description of what you want to happen.
+
+**Steps taken to search for needed documentation**
+List any steps you have taken:
\ No newline at end of file
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
index c3b1a9ac71d..e5e02a4cb2d 100644
--- a/.github/ISSUE_TEMPLATE/feature_request.md
+++ b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -8,7 +8,7 @@ assignees: ''
 ---
 
 **Is your feature request related to a problem? Please describe.**
-A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
+A clear and concise description of what the problem is. Ex. I wish I could use cuGraph to do  [...]
 
 **Describe the solution you'd like**
 A clear and concise description of what you want to happen.
@@ -16,7 +16,5 @@ A clear and concise description of what you want to happen.
 **Describe alternatives you've considered**
 A clear and concise description of any alternative solutions or features you've considered.
 
-**Task List**
-A clear list of task should be called out
 **Additional context**
-Add any other context or screenshots about the feature request here.
+Add any other context, code examples, or references to existing implementations about the feature request here.
\ No newline at end of file
diff --git a/.github/ISSUE_TEMPLATE/question.md b/.github/ISSUE_TEMPLATE/question.md
index cc2d5cb79ad..a9b590525aa 100644
--- a/.github/ISSUE_TEMPLATE/question.md
+++ b/.github/ISSUE_TEMPLATE/question.md
@@ -1,6 +1,6 @@
 ---
 name: Question
-about: Ask a Question
+about: Ask a Question about cuGraph
 title: "[QST]"
 labels: "? - Needs Triage, question"
 assignees: ''
diff --git a/.gitignore b/.gitignore
index 517ceab566b..30bcd5a845d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -76,4 +76,7 @@ cpp/doxygen/html
 
 # Raft symlink
 python/cugraph/raft
-python/_external_repositories/
\ No newline at end of file
+python/_external_repositories/
+
+# created by Dask tests
+python/dask-worker-space
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 5118b9c9059..5036d07e005 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,9 +1,102 @@
+# cuGraph 0.15.0 (26 Aug 2020)
+
+## New Features
+- PR #940 Add MG Batch BC
+- PR #937 Add wrapper for gunrock HITS algorithm
+- PR #939 Updated Notebooks to include new features and benchmarks
+- PR #944 MG pagerank (dask)
+- PR #947 MG pagerank (CUDA)
+- PR #826 Bipartite Graph python API
+- PR #963 Renumbering refactor, add multi GPU support
+- PR #964 MG BFS (CUDA)
+- PR #990 MG Consolidation
+- PR #993 Add persistent Handle for Comms
+- PR #979 Add hypergraph implementation to convert DataFrames into Graphs
+- PR #1010 MG BFS (dask)
+- PR #1018 MG personalized pagerank
+- PR #1047 Updated select tests to use new dataset list that includes asymmetric directed graph
+- PR #1090 Add experimental Leiden function
+- PR #1077 Updated/added copyright notices, added copyright CI check from cuml
+- PR #1100 Add support for new build process (Project Flash)
+- PR #1093 New benchmarking notebook
+
+## Improvements
+- PR #898 Add Edge Betweenness Centrality, and endpoints to BC
+- PR #913 Eliminate `rmm.device_array` usage
+- PR #903 Add short commit hash to conda package
+- PR #920 modify bfs test, update graph number_of_edges, update storage of transposedAdjList in Graph
+- PR #933 Update mg_degree to use raft, add python tests
+- PR #930 rename test_utils.h to utilities/test_utils.hpp and remove thrust dependency
+- PR #934 Update conda dev environment.yml dependencies to 0.15
+- PR #942 Removed references to deprecated RMM headers.
+- PR #941 Regression python/cudf fix
+- PR #945 Simplified benchmark --no-rmm-reinit option, updated default options
+- PR #946 Install meta packages for dependencies
+- PR #952 Updated get_test_data.sh to also (optionally) download and install datasets for benchmark runs
+- PR #953 fix setting RAFT_DIR from the RAFT_PATH env var
+- PR #954 Update cuGraph error handling to use RAFT
+- PR #968 Add build script for CI benchmark integration
+- PR #959 Add support for uint32_t and int64_t types for BFS (cpp side)
+- PR #962 Update dask pagerank
+- PR #975 Upgrade GitHub template
+- PR #976 Fix error in Graph.edges(), update cuDF rename() calls
+- PR #977 Update force_atlas2 to call on_train_end after iterating
+- PR #980 Replace nvgraph Spectral Clustering (SC) functionality with RAFT SC
+- PR #987 Move graph out of experimental namespace
+- PR #984 Removing codecov until we figure out how to interpret failures that block CI
+- PR #985 Add raft handle to BFS, BC and edge BC
+- PR #991 Update conda upload versions for new supported CUDA/Python
+- PR #988 Add clang and clang tools to the conda env
+- PR #997 Update setup.cfg to run pytests under cugraph tests directory only
+- PR #1007 Add tolerance support to MG Pagerank and fix
+- PR #1009 Update benchmarks script to include requirements used
+- PR #1014 Fix benchmarks script variable name
+- PR #1021 Update cuGraph to use RAFT CUDA utilities
+- PR #1019 Remove deprecated CUDA library calls
+- PR #1024 Updated condata environment YML files
+- PR #1026 update chunksize for mnmg, remove files and unused code
+- PR #1028 Update benchmarks script to use ASV_LABEL
+- PR #1030 MG directory org and documentation
+- PR #1020 Updated Louvain to honor max_level, ECG now calls Louvain for 1 level, then full run.
+- PR #1031 MG notebook
+- PR #1034 Expose resolution (gamma) parameter in Louvain
+- PR #1037 Centralize test main function and replace usage of deprecated `cnmem_memory_resource`
+- PR #1041 Use S3 bucket directly for benchmark plugin
+- PR #1056 Fix MG BFS performance
+- PR #1062 Compute max_vertex_id in mnmg local data computation
+- PR #1068 Remove unused thirdparty code
+- PR #1105 Update `master` references to `main`
+
+## Bug Fixes
+- PR #936 Update Force Atlas 2 doc and wrapper
+- PR #938 Quote conda installs to avoid bash interpretation
+- PR #966 Fix build error (debug mode)
+- PR #983 Fix offset calculation in COO to CSR
+- PR #989: Fix issue with incorrect docker image being used in local build script
+- PR #992 Fix unrenumber of predecessor
+- PR #1008 Fix for cudf updates disabling iteration of Series/Columns/Index
+- PR #1012 Fix Local build script README
+- PR #1017 Fix more mg bugs
+- PR #1022 Fix support for using a cudf.DataFrame with a MG graph
+- PR #1025: Explicitly skip raft test folder for pytest 6.0.0
+- PR #1027 Fix documentation
+- PR #1033 Fix reparition error in big datasets, updated coroutine, fixed warnings
+- PR #1036 Fixed benchmarks for new renumbering API, updated comments, added quick test-only benchmark run to CI
+- PR #1040 Fix spectral clustering renumbering issue
+- PR #1057 Updated raft dependency to pull fixes on cusparse selection in CUDA 11
+- PR #1066 Update cugunrock to not build for unsupported CUDA architectures
+- PR #1069 Fixed CUDA 11 Pagerank crash, by replacing CUB's SpMV with raft's.
+- PR #1083 Fix NBs to run in nightly test run, update renumbering text, cleanup
+- PR #1087 Updated benchmarks README to better describe how to get plugin, added rapids-pytest-benchmark plugin to conda dev environments
+- PR #1101 Removed unnecessary device-to-host copy which caused a performance regression
+- PR #1106 Added new release.ipynb to notebook test skip list
+
 # cuGraph 0.14.0 (03 Jun 2020)
 
 ## New Features
 - PR #756 Add Force Atlas 2 layout
 - PR #822 Added new functions in python graph class, similar to networkx
-- PR #840 OPG degree
+- PR #840 MG degree
 - PR #875 UVM notebook
 - PR #881 Raft integration infrastructure
 
@@ -24,7 +117,7 @@
 - PR #807 Updating the Python docs
 - PR #817 Add native Betweenness Centrality with sources subset
 - PR #818 Initial version of new "benchmarks" folder
-- PR #820 OPG infra and all-gather smoke test
+- PR #820 MG infra and all-gather smoke test
 - PR #823 Remove gdf column from nvgraph
 - PR #829 Updated README and CONTRIBUTIOIN docs
 - PR #831 Updated Notebook - Added K-Truss, ECG, and Betweenness Centrality
@@ -41,6 +134,7 @@
 - PR #874 Update setup.py to use custom clean command
 - PR #876 Add BFS C++ tests
 - PR #878 Updated build script
+- PR #887 Updates test to common datasets
 - PR #879 Add docs build script to repository
 - PR #880 Remove remaining gdf_column references
 - PR #882 Add Force Atlas 2 to benchmarks
@@ -49,6 +143,7 @@
 - PR #897 Remove RMM ALLOC calls
 - PR #899 Update include paths to remove deleted cudf headers
 - PR #906 Update Louvain notebook
+- PR #948 Move doc customization scripts to Jenkins
 
 ## Bug Fixes
 - PR #927 Update scikit learn dependency
@@ -65,11 +160,13 @@
 - PR #860 Fix all Notebooks
 - PR #870 Fix Louvain
 - PR #889 Added missing conftest.py file to benchmarks dir
-- PR #896 opg dask infrastructure fixes
+- PR #896 mg dask infrastructure fixes
 - PR #907 Fix bfs directed missing vertices
 - PR #911 Env and changelog update
 - PR #923 Updated pagerank with @afender 's temp fix for double-free crash
 - PR #928 Fix scikit learn test install to work with libgcc-ng 7.3
+- PR 935 Merge
+- PR #956 Use new gpuCI image in local build script
 
 
 # cuGraph 0.13.0 (31 Mar 2020)
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 54c931bdae5..ddd4fd0f9f4 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -13,8 +13,8 @@ __Style Formatting Tools:__
 * `flake8`        version 3.5.0+
 
 
-<a name="issue"></a>
-## 1) File an Issue for the RAPIDS cuGraph team to work
+
+## 1) File an Issue for the RAPIDS cuGraph team to work  <a name="issue"></a>
 To file an issue, go to the RAPIDS cuGraph [issue](https://github.com/rapidsai/cugraph/issues/new/choose) page an select the appropriate issue type.  Once an issue is filed the RAPIDS cuGraph team will evaluate and triage the issue.  If you believe the issue needs priority attention, please include that in the issue to notify the team.
 
 ***Bug Report***</pr>
@@ -36,8 +36,8 @@ There are several ways to ask questions, including [Stack Overflow]( https://sta
 -	describing your question
 
 
-<a name="implement"></a>
-## 2) Propose a New Feature and Implement It
+
+## 2) Propose a New Feature and Implement It <a name="implement"></a>
 
 We love when people want to get involved, and if you have a suggestion for a new feature or enhancement and want to be the one doing the development work, we fully encourage that.  
 
@@ -46,8 +46,8 @@ We love when people want to get involved, and if you have a suggestion for a new
 - Once we agree that the plan looks good, go ahead and implement it
 - Follow the [code contributions](#code-contributions) guide below.
 
-<a name="bugfix"></a>
-## 3) You want to implement a feature or bug-fix for an outstanding issue
+
+## 3) You want to implement a feature or bug-fix for an outstanding issue <a name="bugfix"></a>
 - Find an open Issue, and post that you would like to work that issues
 - Once we agree that the plan looks good, go ahead and implement it
 - Follow the [code contributions](#code-contributions) guide below.
@@ -55,8 +55,8 @@ We love when people want to get involved, and if you have a suggestion for a new
 If you need more context on a particular issue, please ask.
 
 ----
-<a name="code"></a>
-# So you want to contribute code
+
+# So you want to contribute code <a name="code"></a>
 
 **TL;DR General Development Process**
 1. Read the documentation on [building from source](SOURCEBUILD.md) to learn how to setup, and validate, the development environment
@@ -74,11 +74,14 @@ If you need more context on a particular issue, please ask.
 Remember, if you are unsure about anything, don't hesitate to comment on issues
 and ask for clarifications!
 
+**The _FIXME_** comment<pr>
+
+Use the _FIXME_ comment to capture technical debt.  It should not be used to flag bugs since those need to be cleaned up before code is submitted.   
+We are implementing a script to count and track the number of FIXME in the code.  Usage of TODO or any other tag will not be accepted.
 
 
-## Fork a private copy of cuGraph 
-<a name="fork"></a>
 
+## Fork a private copy of cuGraph <a name="fork"></a>
 The RAPIDS cuGraph repo cannot directly be modified.  Contributions must come in the form of a *Pull Request* from a forked version of cugraph.    GitHub as a nice write up ion the process:  https://help.github.com/en/github/getting-started-with-github/fork-a-repo
 
 1. Fork the cugraph repo to your GitHub account
@@ -92,7 +95,8 @@ Read the section on [building cuGraph from source](SOURCEBUILD.md) to validate t
 ```git remote add upstream https://github.com/rapidsai/cugraph.git```
 
 3. Checkout the latest branch
-cuGraph only allows contribution to the current branch and not main or a future branch.  PLease check the [cuGraph](https://github.com/rapidsai/cugraph) page for the name of the current branch.  
+cuGraph only allows contribution to the current branch and not main or a future branch.  Please check the [cuGraph](https://github.com/rapidsai/cugraph) page for the name of the current branch.  
+
 ```git checkout branch-x.x```
 
 4. Code .....
diff --git a/Dockerfile b/Dockerfile
index 53169427136..de0b1e8c10b 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -1,4 +1,4 @@
-# built from https://github.com/rapidsai/cudf/blob/master/Dockerfile
+# built from https://github.com/rapidsai/cudf/blob/main/Dockerfile
 FROM cudf
 
 ADD src /cugraph/src
diff --git a/PRTAGS.md b/PRTAGS.md
index 91c47e035a4..8ec23ea30ac 100644
--- a/PRTAGS.md
+++ b/PRTAGS.md
@@ -8,5 +8,5 @@ PR = Pull Request
 | WIP        | _Work In Progress_ - Within the RAPIDS cuGraph team, we try to open a PR when development starts.  This allows other to review code as it is being developed and provide feedback before too much code needs to be refactored.  It also allows process to be tracked |
 | skip-ci    | _Do Not Run CI_ - This flag prevents CI from being run.  It is good practice to include this with the **WIP** tag since code is typically not at a point where it will pass CI.  |
 | skip ci    | same as above                                          |
-| API-REVIEW | This tag request a code review just of the API portion of the code - This is benificial to ensure that all required arguments are captured.  Doing this early can save from having to refactor later. |
-| REVIEW     | The code is ready for a full code review.  Only code that has passed a code review is merged into the baseline  |
\ No newline at end of file
+| API-REVIEW | This tag request a code review just of the API portion of the code - This is  beneficial to ensure that all required arguments are captured.  Doing this early can save from having to refactor later. |
+| REVIEW     | The code is ready for a full code review.  Only code that has passed a code review is merged into the baseline  |
diff --git a/README.md b/README.md
index f745ea1a0e3..45405d902bf 100644
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@
 
 [![Build Status](https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cugraph/job/branches/job/cugraph-branch-pipeline/badge/icon)](https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cugraph/job/branches/job/cugraph-branch-pipeline/)
 
-The [RAPIDS](https://rapids.ai) cuGraph library is a collection of GPU accelerated graph algorithms that process data found in [GPU DataFrames](https://github.com/rapidsai/cudf).  The vision of cuGraph is _to make graph analysis ubiquitous to the point that users just think in terms of analysis and not technologies or frameworks_.  To realize that vision, cuGraph operators, at the Python layer, on GPU DataFrames, allowing for seamless passing of data between ETL tasks in [cuDF](https://github.com/rapidsai/cudf) and machine learning tasks in [cuML](https://github.com/rapidsai/cuml).  Data scientist familiar with Python will quickly pick up how cuGraph integrates with the Pandas-like API of cuDF.  Likewise, user familiar with NetworkX will quickly reconnize the NetworkX-like API provided in cuGraph, with the goal being to allow existing code to be ported with minimal effort into RAPIDS.  For users familiar with C++/CUDA and graph structures, a C++ API is also provided.  However, there is less type and structure checking at the C++ layer.
+The [RAPIDS](https://rapids.ai) cuGraph library is a collection of GPU accelerated graph algorithms that process data found in [GPU DataFrames](https://github.com/rapidsai/cudf).  The vision of cuGraph is _to make graph analysis ubiquitous to the point that users just think in terms of analysis and not technologies or frameworks_.  To realize that vision, cuGraph operates, at the Python layer, on GPU DataFrames, thereby allowing for seamless passing of data between ETL tasks in [cuDF](https://github.com/rapidsai/cudf) and machine learning tasks in [cuML](https://github.com/rapidsai/cuml).  Data scientists familiar with Python will quickly pick up how cuGraph integrates with the Pandas-like API of cuDF.  Likewise, users familiar with NetworkX will quickly recognize the NetworkX-like API provided in cuGraph, with the goal to allow existing code to be ported with minimal effort into RAPIDS.  For users familiar with C++/CUDA and graph structures, a C++ API is also provided.  However, there is less type and structure checking at the C++ layer.
 
  For more project details, see [rapids.ai](https://rapids.ai/).
 
@@ -10,59 +10,62 @@ The [RAPIDS](https://rapids.ai) cuGraph library is a collection of GPU accelerat
 
 
 
-```markdown
+```python
 import cugraph
 
 # read data into a cuDF DataFrame using read_csv
-gdf = cudf.read_csv("graph_data.csv", names=["src", "dst"], dtype=["int32", "int32"] )
+gdf = cudf.read_csv("graph_data.csv", names=["src", "dst"], dtype=["int32", "int32"])
 
 # We now have data as edge pairs
-# create a Graph using the source (src) and destination (dst) vertex pairs the GDF  
+# create a Graph using the source (src) and destination (dst) vertex pairs
 G = cugraph.Graph()
 G.from_cudf_edgelist(gdf, source='src', destination='dst')
 
 # Let's now get the PageRank score of each vertex by calling cugraph.pagerank
-gdf_page = cugraph.pagerank(G)
+df_page = cugraph.pagerank(G)
 
 # Let's look at the PageRank Score (only do this on small graphs)
-for i in range(len(gdf_page)):
-	print("vertex " + str(gdf_page['vertex'][i]) + 
-		" PageRank is " + str(gdf_page['pagerank'][i]))  
+for i in range(len(df_page)):
+	print("vertex " + str(df_page['vertex'].iloc[i]) +
+		" PageRank is " + str(df_page['pagerank'].iloc[i]))
 ```
 
 
 ## Supported Algorithms
 
-| Category     | Algorithm                              | Sacle        |  Notes
+| Category     | Algorithm                              | Scale        |  Notes
 | ------------ | -------------------------------------- | ------------ | ------------------- |
 | Centrality   |                                        |              |                     |
 |              | Katz                                   | Single-GPU   |                     |
 |              | Betweenness Centrality                 | Single-GPU   |                     |
+|              | Edge Betweenness Centrality            | Single-GPU   |                     |
 | Community    |                                        |              |                     |
+|              | Leiden                                 | Single-GPU   |                     |
 |              | Louvain                                | Single-GPU   |                     |
 |              | Ensemble Clustering for Graphs         | Single-GPU   |                     |
 |              | Spectral-Clustering - Balanced Cut     | Single-GPU   |                     |
-|              | Spectral-Clustering                    | Single-GPU   |                     |
+|              | Spectral-Clustering - Modularity       | Single-GPU   |                     |
 |              | Subgraph Extraction                    | Single-GPU   |                     |
 |              | Triangle Counting                      | Single-GPU   |                     |
+|              | K-Truss                                | Single-GPU   |                     |
 | Components   |                                        |              |                     |
 |              | Weakly Connected Components            | Single-GPU   |                     |
 |              | Strongly Connected Components          | Single-GPU   |                     |
 | Core         |                                        |              |                     |
 |              | K-Core                                 | Single-GPU   |                     |
 |              | Core Number                            | Single-GPU   |                     |
-|              | K-Truss                                | Single-GPU   |                     |
 | Layout       |                                        |              |                     |
 |              | Force Atlas 2                          | Single-GPU   |                     |
 | Link Analysis|                                        |              |                     |
-|              | Pagerank                               | Single-GPU   |  Multi-GPU on DGX avaible  |
-|              | Personal Pagerank                      | Single-GPU   |                     |
+|              | Pagerank                               | Multiple-GPU | limited to 2 billion vertices |
+|              | Personal Pagerank                      | Multiple-GPU | limited to 2 billion vertices |
+|              | HITS                      				| Single-GPU   | leverages Gunrock   |
 | Link Prediction |                                     |              |                     |
-|              | Jacard Similarity                      | Single-GPU   |                     |
-|              | Weighted Jacard Similarity             | Single-GPU   |                     |
+|              | Jaccard Similarity                     | Single-GPU   |                     |
+|              | Weighted Jaccard Similarity            | Single-GPU   |                     |
 |              | Overlap Similarity                     | Single-GPU   |                     |
 | Traversal    |                                        |              |                     |
-|              | Breadth First Search (BFS)             | Single-GPU   |                     |
+|              | Breadth First Search (BFS)             | Multiple-GPU | limited to 2 billion vertices |
 |              | Single Source Shortest Path (SSSP)     | Single-GPU   |                     |
 | Structure    |                                        |              |                     |
 |              | Renumbering                            | Single-GPU   | Also for multiple columns  |
@@ -78,26 +81,25 @@ for i in range(len(gdf_page)):
 ## cuGraph Notice
 The current version of cuGraph has some limitations:
 
-- Vertex IDs need to be 32-bit integers.
+- Vertex IDs need to be 32-bit integers (that restriction is going away in 0.16)
 - Vertex IDs are expected to be contiguous integers starting from 0.
 --  If the starting index is not zero, cuGraph will add disconnected vertices to fill in the missing range.  (Auto-) Renumbering fixes this issue
 
-cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be any type, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon.
+cuGraph provides the renumber function to mitigate this problem, which is by default automatically called when data is addted to a graph.  Input vertex IDs for the renumber function can be any type, can be non-contiguous, can be multiple columns, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon.
 
-cuGraph provides an auto-renumbering feature, enabled by default, during Graph creating.  Renumbered vertices are automaticaly un-renumbered.
+Additionally, when using the auto-renumbering feature, vertices are automatically un-renumbered in results.
 
-cuGraph is constantly being updatred and improved. Please see the [Transition Guide](TRANSITIONGUIDE.md) if errors are encountered with newer versions
+cuGraph is constantly being updated and improved. Please see the [Transition Guide](TRANSITIONGUIDE.md) if errors are encountered with newer versions
 
 ## Graph Sizes and GPU Memory Size
-The amount of memory required is dependent on the graph structure and the analytics being executed.  As a simple rule of thumb, the amount of GPU memory should be about twice the size of the data size.  That gives overhead for the CSV reader and other transform functions.  There are ways around the rule but using smaller data chunks.  
-
-
-|       Size        | Recomended GPU Memory |
-|-------------------|-----------------------|
-| 500 million edges	|  32GB    |
-| 250 million edges |	16 GB  |
+The amount of memory required is dependent on the graph structure and the analytics being executed.  As a simple rule of thumb, the amount of GPU memory should be about twice the size of the data size.  That gives overhead for the CSV reader and other transform functions.  There are ways around the rule but using smaller data chunks.
 
+|       Size        | Recommended GPU Memory |
+|-------------------|------------------------|
+| 500 million edges |  32 GB                  |
+| 250 million edges |  16 GB                 |
 
+The use of managed memory for oversubscription can also be used to exceed the above memory limitations.  See the recent blog on _Tackling Large Graphs with RAPIDS cuGraph and CUDA Unified Memory on GPUs_:  https://medium.com/rapids-ai/tackling-large-graphs-with-rapids-cugraph-and-unified-virtual-memory-b5b69a065d4
 
 
 ## Getting cuGraph
@@ -108,35 +110,33 @@ There are 3 ways to get cuGraph :
 3. [Build from Source](#source)
 
 
-<a name="quick"></a>
 
-## Quick Start
+
+## Quick Start <a name="quick"></a>
 Please see the [Demo Docker Repository](https://hub.docker.com/r/rapidsai/rapidsai/), choosing a tag based on the NVIDIA CUDA version you’re running. This provides a ready to run Docker container with example notebooks and data, showcasing how you can utilize all of the RAPIDS libraries: cuDF, cuML, and cuGraph.
 
 
-<a name="conda"></a>
-### Conda
+### Conda <a name="conda"></a>
 It is easy to install cuGraph using conda. You can get a minimal conda installation with [Miniconda](https://conda.io/miniconda.html) or get the full installation with [Anaconda](https://www.anaconda.com/download).
 
 Install and update cuGraph using the conda command:
 
 ```bash
 
-# CUDA 10.0
-conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cugraph cudatoolkit=10.0
-
 # CUDA 10.1
 conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cugraph cudatoolkit=10.1
 
 # CUDA 10.2
 conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cugraph cudatoolkit=10.2
+
+# CUDA 11.0
+conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cugraph cudatoolkit=11.0
 ```
 
-Note: This conda installation only applies to Linux and Python versions 3.6/3.7.
+Note: This conda installation only applies to Linux and Python versions 3.7/3.8.
 
 
-<a name="source"></a>
-### Build from Source and Contributing
+### Build from Source and Contributing <a name="source"></a>
 
 Please see our [guide for building cuGraph from source](SOURCEBUILD.md)</pr>
 
@@ -153,7 +153,7 @@ Python API documentation can be generated from [docs](docs) directory.
 
 ## <div align="left"><img src="img/rapids_logo.png" width="265px"/></div> Open GPU Data Science
 
-The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
+The RAPIDS suite of open source software libraries aims to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
 
 <p align="center"><img src="img/rapids_arrow.png" width="80%"/></p>
 
diff --git a/SOURCEBUILD.md b/SOURCEBUILD.md
index 23beee55f07..29aa20ad522 100644
--- a/SOURCEBUILD.md
+++ b/SOURCEBUILD.md
@@ -12,7 +12,7 @@ __Compiler__:
 * `cmake`       version 3.12+
 
 __CUDA:__
-* CUDA 10.0+
+* CUDA 10.1+
 * NVIDIA driver 396.44+
 * Pascal architecture or better
 
@@ -47,8 +47,7 @@ __Create the conda development environment__
 ```bash
 # create the conda environment (assuming in base `cugraph` directory)
 
-# for CUDA 10
-conda env create --name cugraph_dev --file conda/environments/cugraph_dev_cuda10.0.yml
+
 
 # for CUDA 10.1
 conda env create --name cugraph_dev --file conda/environments/cugraph_dev_cuda10.1.yml
@@ -56,6 +55,9 @@ conda env create --name cugraph_dev --file conda/environments/cugraph_dev_cuda10
 # for CUDA 10.2
 conda env create --name cugraph_dev --file conda/environments/cugraph_dev_cuda10.2.yml
 
+# for CUDA 11
+conda env create --name cugraph_dev --file conda/environments/cugraph_dev_cuda11.0.yml
+
 # activate the environment
 conda activate cugraph_dev
 
@@ -68,15 +70,15 @@ conda deactivate
 
 ```bash
 
-# for CUDA 10
-conda env update --name cugraph_dev --file conda/environments/cugraph_dev_cuda10.0.yml
-
 # for CUDA 10.1
 conda env update --name cugraph_dev --file conda/environments/cugraph_dev_cuda10.1.yml
 
 # for CUDA 10.2
 conda env update --name cugraph_dev --file conda/environments/cugraph_dev_cuda10.2.yml
 
+# for CUDA 11
+conda env update --name cugraph_dev --file conda/environments/cugraph_dev_cuda11.0.yml
+
 conda activate cugraph_dev
 ```
 
@@ -200,7 +202,7 @@ Run either the C++ or the Python tests with datasets
    make test
    ```
 
-Note: This conda installation only applies to Linux and Python versions 3.6/3.7.
+Note: This conda installation only applies to Linux and Python versions 3.7/3.8.
 
 ### Building and Testing on a gpuCI image locally
 
@@ -226,8 +228,8 @@ Next the env_vars.sh file needs to be edited
 vi ./etc/conda/activate.d/env_vars.sh
 
 #!/bin/bash
-export PATH=/usr/local/cuda-10.0/bin:$PATH # or cuda-10.2 if using CUDA 10.2
-export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH # or cuda-10.2 if using CUDA 10.2
+export PATH=/usr/local/cuda-10.1/bin:$PATH # or cuda-10.2 if using CUDA 10.2
+export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64:$LD_LIBRARY_PATH # or cuda-10.2 if using CUDA 10.2
 ```
 
 ```
diff --git a/benchmarks/README.md b/benchmarks/README.md
index 7aa581d14bb..0190b2870de 100644
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -15,13 +15,15 @@ directory under the root of the `cuGraph` source tree.
 * cugraph built and installed (or `cugraph` sources and built C++ extensions
   available on `PYTHONPATH`)
 
-* rapids-pytest-benchmark pytest plugin (`conda install -c rlratzel
+* rapids-pytest-benchmark pytest plugin (`conda install -c rapidsai
   rapids-pytest-benchmark`)
-  * NOTE: the `rlratzel` channel is temporary! This plugin will eventually be
-    moved to a more standard channel
 
-* specific datasets installed in <cugraph>/datasets (see benchmark sources in
-  this dir for details)
+* The benchmark datasets downloaded and installed in <cugraph>/datasets. Run the
+script below from the <cugraph>/datasets directory:
+```
+cd <cugraph>/datasets
+./get_test_data.sh --benchmark
+```
 
 ## Usage (Python)
 ### Python
@@ -33,6 +35,7 @@ directory under the root of the `cuGraph` source tree.
 
 ## Examples
 ### Python
+_**NOTE: these commands must be run from the `<cugraph_root>/benchmarks` directory.**_
 * Run all the benchmarks and print their names on a separate line (`-v`), and generate a report to stdout
 ```
 (rapids) user@machine:/cugraph/benchmarks> pytest -v
diff --git a/benchmarks/bench_algos.py b/benchmarks/bench_algos.py
index 91dc8fbb0fa..9be636ca480 100644
--- a/benchmarks/bench_algos.py
+++ b/benchmarks/bench_algos.py
@@ -1,3 +1,16 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import pytest
 
 import pytest_benchmark
@@ -17,6 +30,7 @@ def setFixtureParamNames(*args, **kwargs):
         pass
 
 import cugraph
+from cugraph.structure.number_map import NumberMap
 from cugraph.tests import utils
 import rmm
 
@@ -47,12 +61,26 @@ def createGraph(csvFileName, graphType=None):
         renumber=True)
 
 
+# Record the current RMM settings so reinitialize() will be called only when a
+# change is needed (RMM defaults both values to False). This allows the
+# --no-rmm-reinit option to prevent reinitialize() from being called at all
+# (see conftest.py for details).
+RMM_SETTINGS = {"managed_mem": False,
+                "pool_alloc": False}
+
+
 def reinitRMM(managed_mem, pool_alloc):
-    rmm.reinitialize(
-        managed_memory=managed_mem,
-        pool_allocator=pool_alloc,
-        initial_pool_size=2 << 27
-    )
+
+    if (managed_mem != RMM_SETTINGS["managed_mem"]) or \
+       (pool_alloc != RMM_SETTINGS["pool_alloc"]):
+
+        rmm.reinitialize(
+            managed_memory=managed_mem,
+            pool_allocator=pool_alloc,
+            initial_pool_size=2 << 27
+        )
+        RMM_SETTINGS.update(managed_mem=managed_mem,
+                            pool_alloc=pool_alloc)
 
 
 ###############################################################################
@@ -78,8 +106,7 @@ def edgelistCreated(request):
     setFixtureParamNames(request, ["dataset", "managed_mem", "pool_allocator"])
 
     csvFileName = request.param[0]
-    if len(request.param) > 1:
-        reinitRMM(request.param[1], request.param[2])
+    reinitRMM(request.param[1], request.param[2])
     return utils.read_csv_file(csvFileName)
 
 
@@ -92,8 +119,7 @@ def graphWithAdjListComputed(request):
     """
     setFixtureParamNames(request, ["dataset", "managed_mem", "pool_allocator"])
     csvFileName = request.param[0]
-    if len(request.param) > 1:
-        reinitRMM(request.param[1], request.param[2])
+    reinitRMM(request.param[1], request.param[2])
 
     G = createGraph(csvFileName, cugraph.structure.graph.Graph)
     G.view_adj_list()
@@ -109,8 +135,7 @@ def anyGraphWithAdjListComputed(request):
     """
     setFixtureParamNames(request, ["dataset", "managed_mem", "pool_allocator"])
     csvFileName = request.param[0]
-    if len(request.param) > 1:
-        reinitRMM(request.param[1], request.param[2])
+    reinitRMM(request.param[1], request.param[2])
 
     G = createGraph(csvFileName)
     G.view_adj_list()
@@ -126,8 +151,7 @@ def anyGraphWithTransposedAdjListComputed(request):
     """
     setFixtureParamNames(request, ["dataset", "managed_mem", "pool_allocator"])
     csvFileName = request.param[0]
-    if len(request.param) > 1:
-        reinitRMM(request.param[1], request.param[2])
+    reinitRMM(request.param[1], request.param[2])
 
     G = createGraph(csvFileName)
     G.view_transposed_adj_list()
@@ -164,9 +188,7 @@ def bench_create_digraph(gpubenchmark, edgelistCreated):
 
 @pytest.mark.ETL
 def bench_renumber(gpubenchmark, edgelistCreated):
-    gpubenchmark(cugraph.renumber,
-                 edgelistCreated["0"],  # src
-                 edgelistCreated["1"])  # dst
+    gpubenchmark(NumberMap.renumber, edgelistCreated, "0", "1")
 
 
 def bench_pagerank(gpubenchmark, anyGraphWithTransposedAdjListComputed):
@@ -233,3 +255,9 @@ def bench_graph_degrees(gpubenchmark, anyGraphWithAdjListComputed):
 def bench_betweenness_centrality(gpubenchmark, anyGraphWithAdjListComputed):
     gpubenchmark(cugraph.betweenness_centrality,
                  anyGraphWithAdjListComputed, k=10, seed=123)
+
+
+def bench_edge_betweenness_centrality(gpubenchmark,
+                                      anyGraphWithAdjListComputed):
+    gpubenchmark(cugraph.edge_betweenness_centrality,
+                 anyGraphWithAdjListComputed, k=10, seed=123)
diff --git a/benchmarks/conftest.py b/benchmarks/conftest.py
index ea5be7212dc..8ab0c5a57b4 100644
--- a/benchmarks/conftest.py
+++ b/benchmarks/conftest.py
@@ -1,8 +1,4 @@
 # pytest customizations specific to these benchmarks
-import sys
-from os import path
-import importlib
-
 
 def pytest_addoption(parser):
     parser.addoption("--no-rmm-reinit", action="store_true", default=False,
@@ -11,21 +7,19 @@ def pytest_addoption(parser):
 
 
 def pytest_sessionstart(session):
-    # if the --no-rmm-reinit option is given, import the benchmark's "params"
-    # module and change the FIXTURE_PARAMS accordingly.
+    # if the --no-rmm-reinit option is given, set (or add to) the CLI "mark
+    # expression" (-m) the markers for no managedmem and no poolallocator. This
+    # will cause the RMM reinit() function to not be called.
     if session.config.getoption("no_rmm_reinit"):
-        paramsPyFile = path.join(path.dirname(path.abspath(__file__)),
-                                 "params.py")
+        newMarkexpr = "managedmem_off and poolallocator_off"
+        currentMarkexpr = session.config.getoption("markexpr")
 
-        # A simple "import" statement will not find the modules here (unless if
-        # this package is on the import path) since pytest evaluates this from
-        # a different location.
-        spec = importlib.util.spec_from_file_location("params", paramsPyFile)
-        module = importlib.util.module_from_spec(spec)
-        spec.loader.exec_module(module)
+        if ("managedmem" in currentMarkexpr) or \
+           ("poolallocator" in currentMarkexpr):
+            raise RuntimeError("managedmem and poolallocator markers cannot "
+                               "be used with --no-rmm-reinit")
 
-        module.FIXTURE_PARAMS = module.NO_RMMREINIT_FIXTURE_PARAMS
+        if currentMarkexpr:
+            newMarkexpr = f"({currentMarkexpr}) and ({newMarkexpr})"
 
-        # If "benchmarks.params" is registered in sys.modules, all future
-        # imports of the module will simply refer to this one.
-        sys.modules["benchmarks.params"] = module
+        session.config.option.markexpr = newMarkexpr
diff --git a/benchmarks/params.py b/benchmarks/params.py
index cab0210ba23..2d1d3ea4acc 100644
--- a/benchmarks/params.py
+++ b/benchmarks/params.py
@@ -1,3 +1,15 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 from itertools import product
 
 import pytest
@@ -58,8 +70,10 @@ def genFixtureParamsProduct(*args):
 
 # FIXME: write and use mechanism described here for specifying datasets:
 #        https://docs.rapids.ai/maintainers/datasets
-# FIXME: rlr: soc-twitter-2010.csv crashes with OOM error on my HP-Z8!
+# FIXME: rlr: soc-twitter-2010.csv crashes with OOM error on my RTX-8000
 UNDIRECTED_DATASETS = [
+    pytest.param("../datasets/karate.csv",
+                 marks=[pytest.mark.tiny, pytest.mark.undirected]),
     pytest.param("../datasets/csv/undirected/hollywood.csv",
                  marks=[pytest.mark.small, pytest.mark.undirected]),
     pytest.param("../datasets/csv/undirected/europe_osm.csv",
@@ -88,16 +102,7 @@ def genFixtureParamsProduct(*args):
                  marks=[pytest.mark.poolallocator_off]),
 ]
 
-ALL_FIXTURE_PARAMS = genFixtureParamsProduct(
-                         (DIRECTED_DATASETS + UNDIRECTED_DATASETS, "ds"),
-                         (MANAGED_MEMORY, "mm"),
-                         (POOL_ALLOCATOR, "pa"))
-
-NO_RMMREINIT_FIXTURE_PARAMS = genFixtureParamsProduct(
-                                  (DIRECTED_DATASETS +
-                                   UNDIRECTED_DATASETS, "ds"))
-
-# conftest.py will switch this to NO_RMMREINIT_FIXTURE_PARAMS
-# if the --no-rmm-reinit option is passed.
-# See conftest.py for details
-FIXTURE_PARAMS = ALL_FIXTURE_PARAMS
+FIXTURE_PARAMS = genFixtureParamsProduct(
+    (DIRECTED_DATASETS + UNDIRECTED_DATASETS, "ds"),
+    (MANAGED_MEMORY, "mm"),
+    (POOL_ALLOCATOR, "pa"))
diff --git a/benchmarks/pytest.ini b/benchmarks/pytest.ini
index fb4e43965d6..06a67a06040 100644
--- a/benchmarks/pytest.ini
+++ b/benchmarks/pytest.ini
@@ -1,9 +1,9 @@
 [pytest]
 addopts =
-          -x
           --benchmark-warmup=on
           --benchmark-warmup-iterations=1
           --benchmark-min-rounds=3
+          --benchmark-columns="min, max, mean, stddev, outliers, gpu_mem, rounds"
 
 markers =
           managedmem_on: RMM managed memory enabled
@@ -12,6 +12,7 @@ markers =
           poolallocator_off: RMM pool allocator disabled
           ETL: benchmarks for ETL steps
           small: small datasets
+          tiny: tiny datasets
           directed: directed datasets
           undirected: undirected datasets
 
diff --git a/build.sh b/build.sh
index 94c37cf20bb..e0557344384 100755
--- a/build.sh
+++ b/build.sh
@@ -34,7 +34,7 @@ HELP="$0 [<target> ...] [<flag> ...]
 
  default action (no args) is to build and install 'libcugraph' then 'cugraph' targets
 "
-LIBCUGRAPH_BUILD_DIR=${REPODIR}/cpp/build
+LIBCUGRAPH_BUILD_DIR=${LIBCUGRAPH_BUILD_DIR:=${REPODIR}/cpp/build}
 CUGRAPH_BUILD_DIR=${REPODIR}/python/build
 BUILD_DIRS="${LIBCUGRAPH_BUILD_DIR} ${CUGRAPH_BUILD_DIR}"
 
@@ -116,7 +116,7 @@ if (( ${NUMARGS} == 0 )) || hasArg cugraph; then
 
     cd ${REPODIR}/python
     if [[ ${INSTALL_TARGET} != "" ]]; then
-	python setup.py build_ext --inplace
+	python setup.py build_ext --inplace --library-dir=${LIBCUGRAPH_BUILD_DIR}
 	python setup.py install
     else
 	python setup.py build_ext --inplace --library-dir=${LIBCUGRAPH_BUILD_DIR}
diff --git a/ci/benchmark/build.sh b/ci/benchmark/build.sh
new file mode 100644
index 00000000000..49a6362a904
--- /dev/null
+++ b/ci/benchmark/build.sh
@@ -0,0 +1,169 @@
+#!/usr/bin/env bash
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+##########################################
+# cuGraph Benchmark test script for CI   #
+##########################################
+
+set -e
+set -o pipefail
+NUMARGS=$#
+ARGS=$*
+
+function logger {
+  echo -e "\n>>>> $@\n"
+}
+
+function hasArg {
+    (( ${NUMARGS} != 0 )) && (echo " ${ARGS} " | grep -q " $1 ")
+}
+
+function cleanup {
+  logger "Removing datasets and temp files..."
+  rm -rf $WORKSPACE/datasets/test
+  rm -rf $WORKSPACE/datasets/benchmark
+  rm -f testoutput.txt
+}
+
+# Set cleanup trap for Jenkins
+if [ ! -z "$JENKINS_HOME" ] ; then
+  logger "Jenkins environment detected, setting cleanup trap..."
+  trap cleanup EXIT
+fi
+
+# Set path, build parallel level, and CUDA version
+cd $WORKSPACE
+export PATH=/conda/bin:/usr/local/cuda/bin:$PATH
+export PARALLEL_LEVEL=4
+export CUDA_REL=${CUDA_VERSION%.*}
+export HOME=$WORKSPACE
+export GIT_DESCRIBE_TAG=`git describe --tags`
+export MINOR_VERSION=`echo $GIT_DESCRIBE_TAG | grep -o -E '([0-9]+\.[0-9]+)'`
+
+# Set Benchmark Vars
+export DATASETS_DIR=${WORKSPACE}/datasets
+export BENCHMARKS_DIR=${WORKSPACE}/benchmarks
+
+##########################################
+# Environment Setup                      #
+##########################################
+
+# TODO: Delete build section when artifacts are available
+
+logger "Check environment..."
+env
+
+logger "Check GPU usage..."
+nvidia-smi
+
+logger "Activate conda env..."
+source activate rapids
+
+
+# Enter dependencies to be shown in ASV tooltips.
+CUGRAPH_DEPS=(cudf rmm)
+LIBCUGRAPH_DEPS=(cudf rmm)
+
+logger "conda install required packages"
+conda install -c nvidia -c rapidsai -c rapidsai-nightly -c conda-forge -c defaults \
+      "cudf=${MINOR_VERSION}" \
+      "rmm=${MINOR_VERSION}" \
+      "cudatoolkit=$CUDA_REL" \
+      "dask-cudf=${MINOR_VERSION}" \
+      "dask-cuda=${MINOR_VERSION}" \
+      "ucx-py=${MINOR_VERSION}" \
+      "rapids-build-env=${MINOR_VERSION}" \
+      rapids-pytest-benchmark
+
+# Install the master version of dask and distributed
+logger "pip install git+https://github.com/dask/distributed.git --upgrade --no-deps"
+pip install "git+https://github.com/dask/distributed.git" --upgrade --no-deps
+
+logger "pip install git+https://github.com/dask/dask.git --upgrade --no-deps"
+pip install "git+https://github.com/dask/dask.git" --upgrade --no-deps
+
+logger "Check versions..."
+python --version
+$CC --version
+$CXX --version
+conda list
+
+##########################################
+# Build cuGraph                          #
+##########################################
+
+logger "Build libcugraph..."
+$WORKSPACE/build.sh clean libcugraph cugraph
+
+##########################################
+# Run Benchmarks                         #
+##########################################
+
+logger "Downloading Datasets for Benchmarks..."
+cd $DATASETS_DIR
+bash ./get_test_data.sh --benchmark
+ERRORCODE=$((ERRORCODE | $?))
+# Exit if dataset download failed
+if (( ${ERRORCODE} != 0 )); then
+    exit ${ERRORCODE}
+fi
+
+
+# Concatenate dependency arrays, convert to JSON array,
+# and remove duplicates.
+X=("${CUGRAPH_DEPS[@]}" "${LIBCUGRAPH_DEPS[@]}")
+DEPS=$(printf '%s\n' "${X[@]}" | jq -R . | jq -s 'unique')
+
+# Build object with k/v pairs of "dependency:version"
+DEP_VER_DICT=$(jq -n '{}')
+for DEP in $(echo "${DEPS}" | jq -r '.[]'); do
+  VER=$(conda list | grep "^${DEP}" | awk '{print $2"-"$3}')
+  DEP_VER_DICT=$(echo "${DEP_VER_DICT}" | jq -c --arg DEP "${DEP}" --arg VER "${VER}" '. + { ($DEP): $VER }')
+done
+
+# Pass in an array of dependencies to get a dict of "dependency:version"
+function getReqs() {
+  local DEPS_ARR=("$@")
+  local REQS="{}"
+  for DEP in "${DEPS_ARR[@]}"; do
+    VER=$(echo "${DEP_VER_DICT}" | jq -r --arg DEP "${DEP}" '.[$DEP]')
+    REQS=$(echo "${REQS}" | jq -c --arg DEP "${DEP}" --arg VER "${VER}" '. + { ($DEP): $VER }')
+  done
+
+  echo "${REQS}"
+}
+
+REQS=$(getReqs "${CUGRAPH_DEPS[@]}")
+
+BENCHMARK_META=$(jq -n \
+  --arg NODE "${ASV_LABEL}" \
+  --arg BRANCH "branch-${MINOR_VERSION}" \
+  --argjson REQS "${REQS}" '
+  {
+    "machineName": $NODE,
+    "commitBranch": $BRANCH,
+    "requirements": $REQS
+  }
+')
+
+echo "Benchmark meta:"
+echo "${BENCHMARK_META}" | jq "."
+
+logger "Running Benchmarks..."
+cd $BENCHMARKS_DIR
+set +e
+time pytest -v -m "small and managedmem_on and poolallocator_on" \
+    --benchmark-gpu-device=0 \
+    --benchmark-gpu-max-rounds=3 \
+    --benchmark-asv-output-dir="${S3_ASV_DIR}" \
+    --benchmark-asv-metadata="${BENCHMARK_META}"
+
+
+
+EXITCODE=$?
+
+# The reqs below can be passed as requirements for
+# C++ benchmarks in the future.
+# REQS=$(getReqs "${LIBCUGRAPH_DEPS[@]}")
+
+set -e
+JOBEXITCODE=0
diff --git a/ci/checks/changelog.sh b/ci/checks/changelog.sh
index 6cd869d1171..73921f6bf19 100755
--- a/ci/checks/changelog.sh
+++ b/ci/checks/changelog.sh
@@ -1,20 +1,20 @@
 #!/bin/bash
-# Copyright (c) 2018, NVIDIA CORPORATION.
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
 ############################
 # cuGraph CHANGELOG Tester #
 ############################
 
-# Checkout master for comparison
-git checkout --quiet master
+# Checkout main for comparison
+git checkout --force --quiet main
 
 # Switch back to tip of PR branch
-git checkout --quiet current-pr-branch
+git checkout --force --quiet current-pr-branch
 
 # Ignore errors during searching
 set +e
 
 # Get list of modified files between matster and PR branch
-CHANGELOG=`git diff --name-only master...current-pr-branch | grep CHANGELOG.md`
+CHANGELOG=`git diff --name-only main...current-pr-branch | grep CHANGELOG.md`
 # Check if CHANGELOG has PR ID
 PRNUM=`cat CHANGELOG.md | grep "$PR_ID"`
 RETVAL=0
diff --git a/ci/checks/copyright.py b/ci/checks/copyright.py
new file mode 100644
index 00000000000..cb7f6d1d360
--- /dev/null
+++ b/ci/checks/copyright.py
@@ -0,0 +1,189 @@
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import datetime
+import re
+import argparse
+import io
+import os
+import git_helpers
+
+FilesToCheck = [
+    re.compile(r"[.](cmake|cpp|cu|cuh|h|hpp|sh|pxd|py|pyx)$"),
+    re.compile(r"CMakeLists[.]txt$"),
+    re.compile(r"CMakeLists_standalone[.]txt$"),
+    re.compile(r"setup[.]cfg$"),
+    re.compile(r"[.]flake8[.]cython$"),
+    re.compile(r"meta[.]yaml$")
+]
+
+# this will break starting at year 10000, which is probably OK :)
+CheckSimple = re.compile(r"Copyright \(c\) (\d{4}), NVIDIA CORPORATION")
+CheckDouble = re.compile(
+    r"Copyright \(c\) (\d{4})-(\d{4}), NVIDIA CORPORATION")
+
+
+def checkThisFile(f):
+    # This check covers things like symlinks which point to files that DNE
+    if not(os.path.exists(f)):
+        return False
+    if git_helpers and git_helpers.isFileEmpty(f):
+        return False
+    for checker in FilesToCheck:
+        if checker.search(f):
+            return True
+    return False
+
+
+def getCopyrightYears(line):
+    res = CheckSimple.search(line)
+    if res:
+        return (int(res.group(1)), int(res.group(1)))
+    res = CheckDouble.search(line)
+    if res:
+        return (int(res.group(1)), int(res.group(2)))
+    return (None, None)
+
+
+def replaceCurrentYear(line, start, end):
+    # first turn a simple regex into double (if applicable). then update years
+    res = CheckSimple.sub(r"Copyright (c) \1-\1, NVIDIA CORPORATION", line)
+    res = CheckDouble.sub(
+        r"Copyright (c) {:04d}-{:04d}, NVIDIA CORPORATION".format(start, end),
+        res)
+    return res
+
+
+def checkCopyright(f, update_current_year):
+    """
+    Checks for copyright headers and their years
+    """
+    errs = []
+    thisYear = datetime.datetime.now().year
+    lineNum = 0
+    crFound = False
+    yearMatched = False
+    with io.open(f, "r", encoding="utf-8") as fp:
+        lines = fp.readlines()
+    for line in lines:
+        lineNum += 1
+        start, end = getCopyrightYears(line)
+        if start is None:
+            continue
+        crFound = True
+        if start > end:
+            e = [f, lineNum, "First year after second year in the copyright "
+                 "header (manual fix required)", None]
+            errs.append(e)
+        if thisYear < start or thisYear > end:
+            e = [f, lineNum, "Current year not included in the "
+                 "copyright header", None]
+            if thisYear < start:
+                e[-1] = replaceCurrentYear(line, thisYear, end)
+            if thisYear > end:
+                e[-1] = replaceCurrentYear(line, start, thisYear)
+            errs.append(e)
+        else:
+            yearMatched = True
+    fp.close()
+    # copyright header itself not found
+    if not crFound:
+        e = [f, 0, "Copyright header missing or formatted incorrectly "
+             "(manual fix required)", None]
+        errs.append(e)
+    # even if the year matches a copyright header, make the check pass
+    if yearMatched:
+        errs = []
+
+    if update_current_year:
+        errs_update = [x for x in errs if x[-1] is not None]
+        if len(errs_update) > 0:
+            print("File: {}. Changing line(s) {}".format(
+                f, ', '.join(str(x[1]) for x in errs if x[-1] is not None)))
+            for _, lineNum, __, replacement in errs_update:
+                lines[lineNum - 1] = replacement
+            with io.open(f, "w", encoding="utf-8") as out_file:
+                for new_line in lines:
+                    out_file.write(new_line)
+        errs = [x for x in errs if x[-1] is None]
+
+    return errs
+
+
+
+def getAllFilesUnderDir(root, pathFilter=None):
+    retList = []
+    for (dirpath, dirnames, filenames) in os.walk(root):
+        for fn in filenames:
+            filePath = os.path.join(dirpath, fn)
+            if pathFilter(filePath):
+                retList.append(filePath)
+    return retList
+
+
+def checkCopyright_main():
+    """
+    Checks for copyright headers in all the modified files. In case of local
+    repo, this script will just look for uncommitted files and in case of CI
+    it compares between branches "$PR_TARGET_BRANCH" and "current-pr-branch"
+    """
+    retVal = 0
+
+    argparser = argparse.ArgumentParser(
+        description="Checks for a consistent copyright header")
+    argparser.add_argument("--update-current-year", dest='update_current_year',
+                           action="store_true", required=False, help="If set, "
+                           "update the current year if a header is already "
+                           "present and well formatted.")
+    argparser.add_argument("--git-modified-only", dest='git_modified_only',
+                           action="store_true", required=False, help="If set, "
+                           "only files seen as modified by git will be "
+                           "processed.")
+
+    (args, dirs) = argparser.parse_known_args()
+    if args.git_modified_only:
+        files = git_helpers.modifiedFiles(pathFilter=checkThisFile)
+    else:
+        files = []
+        for d in [os.path.abspath(d) for d in dirs]:
+            if not(os.path.isdir(d)):
+                raise ValueError(f"{d} is not a directory.")
+            files += getAllFilesUnderDir(d, pathFilter=checkThisFile)
+
+    errors = []
+    for f in files:
+        errors += checkCopyright(f, args.update_current_year)
+
+    if len(errors) > 0:
+        print("Copyright headers incomplete in some of the files!")
+        for e in errors:
+            print("  %s:%d Issue: %s" % (e[0], e[1], e[2]))
+        print("")
+        n_fixable = sum(1 for e in errors if e[-1] is not None)
+        path_parts = os.path.abspath(__file__).split(os.sep)
+        file_from_repo = os.sep.join(path_parts[path_parts.index("ci"):])
+        if n_fixable > 0:
+            print("You can run {} --update-current-year to fix {} of these "
+                  "errors.\n".format(file_from_repo, n_fixable))
+        retVal = 1
+    else:
+        print("Copyright check passed")
+
+    return retVal
+
+
+if __name__ == "__main__":
+    import sys
+    sys.exit(checkCopyright_main())
diff --git a/ci/checks/style.sh b/ci/checks/style.sh
index fa933e41410..696f566a96a 100755
--- a/ci/checks/style.sh
+++ b/ci/checks/style.sh
@@ -1,11 +1,17 @@
 #!/bin/bash
-# Copyright (c) 2018, NVIDIA CORPORATION.
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
 ########################
 # cuGraph Style Tester #
 ########################
 
-# Ignore errors and set path
-set +e
+# Assume this script is run from the root of the cugraph repo
+
+# Make failing commands visible when used in a pipeline and allow the script to
+# continue on errors, but use ERRORCODE to still allow any failing command to be
+# captured for returning a final status code. This allows all style check to
+# take place to provide a more comprehensive list of style violations.
+set -o pipefail
+ERRORCODE=0
 PATH=/conda/bin:$PATH
 
 # Activate common conda env
@@ -13,11 +19,12 @@ source activate gdf
 
 # Run flake8 and get results/return code
 FLAKE=`flake8 --config=python/.flake8 python`
-FLAKE_RETVAL=$?
+ERRORCODE=$((ERRORCODE | $?))
 
 # Run clang-format and check for a consistent code format
 CLANG_FORMAT=`python cpp/scripts/run-clang-format.py 2>&1`
 CLANG_FORMAT_RETVAL=$?
+ERRORCODE=$((ERRORCODE | ${CLANG_FORMAT_RETVAL}))
 
 # Output results if failure otherwise show pass
 if [ "$FLAKE" != "" ]; then
@@ -36,8 +43,19 @@ else
   echo -e "\n\n>>>> PASSED: clang format check\n\n"
 fi
 
-RETVALS=($FLAKE_RETVAL $CLANG_FORMAT_RETVAL)
-IFS=$'\n'
-RETVAL=`echo "${RETVALS[*]}" | sort -nr | head -n1`
+# Check for copyright headers in the files modified currently
+#COPYRIGHT=`env PYTHONPATH=ci/utils python ci/checks/copyright.py cpp python benchmarks ci 2>&1`
+COPYRIGHT=`env PYTHONPATH=ci/utils python ci/checks/copyright.py --git-modified-only 2>&1`
+CR_RETVAL=$?
+ERRORCODE=$((ERRORCODE | ${CR_RETVAL}))
+
+# Output results if failure otherwise show pass
+if [ "$CR_RETVAL" != "0" ]; then
+  echo -e "\n\n>>>> FAILED: copyright check; begin output\n\n"
+  echo -e "$COPYRIGHT"
+  echo -e "\n\n>>>> FAILED: copyright check; end output\n\n"
+else
+  echo -e "\n\n>>>> PASSED: copyright check\n\n"
+fi
 
-exit $RETVAL
+exit ${ERRORCODE}
diff --git a/ci/cpu/build.sh b/ci/cpu/build.sh
index dfbbbffc73b..2cdb77bbbc2 100755
--- a/ci/cpu/build.sh
+++ b/ci/cpu/build.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2018, NVIDIA CORPORATION.
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
 #########################################
 # cuGraph CPU conda build script for CI #
 #########################################
@@ -20,10 +20,6 @@ export HOME=$WORKSPACE
 # Switch to project root; also root of repo checkout
 cd $WORKSPACE
 
-# Get latest tag and number of commits since tag
-export GIT_DESCRIBE_TAG=`git describe --abbrev=0 --tags`
-export GIT_DESCRIBE_NUMBER=`git rev-list ${GIT_DESCRIBE_TAG}..HEAD --count`
-
 # If nightly build, append current YYMMDD to version
 if [[ "$BUILD_MODE" = "branch" && "$SOURCE_BRANCH" = branch-* ]] ; then
   export VERSION_SUFFIX=`date +%y%m%d`
diff --git a/ci/cpu/cugraph/build_cugraph.sh b/ci/cpu/cugraph/build_cugraph.sh
index 874488ff020..70f5baee230 100755
--- a/ci/cpu/cugraph/build_cugraph.sh
+++ b/ci/cpu/cugraph/build_cugraph.sh
@@ -1,9 +1,25 @@
 #!/usr/bin/env bash
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 set -e
 
 if [ "$BUILD_CUGRAPH" == "1" ]; then
   echo "Building cugraph"
   CUDA_REL=${CUDA_VERSION%.*}
-
-  conda build conda/recipes/cugraph --python=$PYTHON
+  if [[ -z "$PROJECT_FLASH" || "$PROJECT_FLASH" == "0" ]]; then
+    conda build conda/recipes/cugraph --python=$PYTHON
+  else
+    conda build conda/recipes/cugraph -c ci/artifacts/cugraph/cpu/conda-bld/ --dirty --no-remove-work-dir --python=$PYTHON
+  fi
 fi
diff --git a/ci/cpu/cugraph/upload-anaconda.sh b/ci/cpu/cugraph/upload-anaconda.sh
index e729972cf43..9601905d6c4 100755
--- a/ci/cpu/cugraph/upload-anaconda.sh
+++ b/ci/cpu/cugraph/upload-anaconda.sh
@@ -1,13 +1,22 @@
 #!/bin/bash
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
 #
-# Adopted from https://github.com/tmcdonell/travis-scripts/blob/dfaac280ac2082cd6bcaba3217428347899f2975/update-accelerate-buildbot.sh
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 set -e
 
-if [ "$UPLOAD_CUGRAPH" == "1" ]; then
+if [[ "$BUILD_CUGRAPH" == "1" && "$UPLOAD_CUGRAPH" == "1" ]]; then
   export UPLOADFILE=`conda build conda/recipes/cugraph -c rapidsai -c nvidia -c numba -c conda-forge -c defaults --python=$PYTHON --output`
 
-  SOURCE_BRANCH=master
 
   # Have to label all CUDA versions due to the compatibility to work with any CUDA
   if [ "$LABEL_MAIN" == "1" ]; then
@@ -22,8 +31,7 @@ if [ "$UPLOAD_CUGRAPH" == "1" ]; then
 
   test -e ${UPLOADFILE}
 
-  # Restrict uploads to master branch
-  if [ ${GIT_BRANCH} != ${SOURCE_BRANCH} ]; then
+  if [ ${BUILD_MODE} != "branch" ]; then
     echo "Skipping upload"
     return 0
   fi
diff --git a/ci/cpu/libcugraph/build_libcugraph.sh b/ci/cpu/libcugraph/build_libcugraph.sh
index b728c130d0e..e5ff77d7db9 100755
--- a/ci/cpu/libcugraph/build_libcugraph.sh
+++ b/ci/cpu/libcugraph/build_libcugraph.sh
@@ -1,9 +1,25 @@
 #!/usr/bin/env bash
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 set -e
 
 if [ "$BUILD_LIBCUGRAPH" == '1' ]; then
   echo "Building libcugraph"
   CUDA_REL=${CUDA_VERSION%.*}
-  
-  conda build conda/recipes/libcugraph
+  if [[ -z "$PROJECT_FLASH" || "$PROJECT_FLASH" == "0" ]]; then
+    conda build conda/recipes/libcugraph
+  else
+    conda build --dirty --no-remove-work-dir conda/recipes/libcugraph
+  fi
 fi
diff --git a/ci/cpu/libcugraph/upload-anaconda.sh b/ci/cpu/libcugraph/upload-anaconda.sh
index 11316dc5b1f..8cd71070778 100755
--- a/ci/cpu/libcugraph/upload-anaconda.sh
+++ b/ci/cpu/libcugraph/upload-anaconda.sh
@@ -1,23 +1,31 @@
 #!/bin/bash
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
 #
-# Adopted from https://github.com/tmcdonell/travis-scripts/blob/dfaac280ac2082cd6bcaba3217428347899f2975/update-accelerate-buildbot.sh
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 set -e
 
-if [ "$UPLOAD_LIBCUGRAPH" == "1" ]; then
+if [[ "$BUILD_LIBCUGRAPH" == "1" && "$UPLOAD_LIBCUGRAPH" == "1" ]]; then
   CUDA_REL=${CUDA_VERSION%.*}
 
   export UPLOADFILE=`conda build conda/recipes/libcugraph --output`
 
-  SOURCE_BRANCH=master
 
   LABEL_OPTION="--label main"
   echo "LABEL_OPTION=${LABEL_OPTION}"
 
   test -e ${UPLOADFILE}
 
-  # Restrict uploads to master branch
-  if [ ${GIT_BRANCH} != ${SOURCE_BRANCH} ]; then
+  if [ ${BUILD_MODE} != "branch" ]; then
     echo "Skipping upload"
     return 0
   fi
diff --git a/ci/cpu/prebuild.sh b/ci/cpu/prebuild.sh
index 2abc137662c..ee471329b35 100644
--- a/ci/cpu/prebuild.sh
+++ b/ci/cpu/prebuild.sh
@@ -1,15 +1,30 @@
 #!/usr/bin/env bash
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
-export BUILD_CUGRAPH=1
-export BUILD_LIBCUGRAPH=1
+if [[ -z "$PROJECT_FLASH" || "$PROJECT_FLASH" == "0" ]]; then
+    #If project flash is not activate, always build both
+    export BUILD_CUGRAPH=1
+    export BUILD_LIBCUGRAPH=1
+fi
 
-if [[ "$CUDA" == "10.0" ]]; then
+if [[ "$CUDA" == "10.1" ]]; then
     export UPLOAD_CUGRAPH=1
 else
     export UPLOAD_CUGRAPH=0
 fi
 
-if [[ "$PYTHON" == "3.6" ]]; then
+if [[ "$PYTHON" == "3.7" ]]; then
     export UPLOAD_LIBCUGRAPH=1
 else
     export UPLOAD_LIBCUGRAPH=0
diff --git a/ci/docs/build.sh b/ci/docs/build.sh
index 1bf8b6b569a..71ad79419a0 100644
--- a/ci/docs/build.sh
+++ b/ci/docs/build.sh
@@ -61,15 +61,3 @@ done
 mv $PROJECT_WORKSPACE/cpp/doxygen/html/* $DOCS_WORKSPACE/api/libcugraph/$BRANCH_VERSION
 mv $PROJECT_WORKSPACE/docs/build/html/* $DOCS_WORKSPACE/api/cugraph/$BRANCH_VERSION
 
-# Customize HTML documentation
-./update_symlinks.sh $NIGHTLY_VERSION
-./customization/lib_map.sh
-
-
-for PROJECT in ${PROJECTS[@]}; do
-    echo ""
-    echo "Customizing: $PROJECT"
-    ./customization/customize_docs_in_folder.sh api/$PROJECT/ $NIGHTLY_VERSION
-    git add $DOCS_WORKSPACE/api/$PROJECT/*
-done
-
diff --git a/ci/getGTestTimes.sh b/ci/getGTestTimes.sh
index b2c3c7718e0..8a3752d76e2 100755
--- a/ci/getGTestTimes.sh
+++ b/ci/getGTestTimes.sh
@@ -1,4 +1,16 @@
 #!/bin/bash
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 # This script will print the gtest results sorted by runtime. This will print
 # the results two ways: first by printing all tests sorted by runtime, then by
diff --git a/ci/gpu/build.sh b/ci/gpu/build.sh
index 78c020375d9..3cef2e56877 100755
--- a/ci/gpu/build.sh
+++ b/ci/gpu/build.sh
@@ -1,5 +1,5 @@
 #!/usr/bin/env bash
-# Copyright (c) 2018, NVIDIA CORPORATION.
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
 ##########################################
 # cuGraph GPU build & testscript for CI  #
 ##########################################
@@ -57,21 +57,19 @@ source activate gdf
 
 logger "conda install required packages"
 conda install -c nvidia -c rapidsai -c rapidsai-nightly -c conda-forge -c defaults \
-      cudf=${MINOR_VERSION} \
-      rmm=${MINOR_VERSION} \
-      networkx>=2.3 \
-      python-louvain \
-      cudatoolkit=$CUDA_REL \
-      dask>=2.12.0 \
-      distributed>=2.12.0 \
-      dask-cudf=${MINOR_VERSION} \
-      dask-cuda=${MINOR_VERSION} \
-      scikit-learn=0.23.0 \
-      nccl>=2.5 \
-      ucx-py=${MINOR_VERSION} \
-      libcypher-parser \
-      ipython=7.3* \
-      jupyterlab
+      "cudf=${MINOR_VERSION}" \
+      "rmm=${MINOR_VERSION}" \
+      "cudatoolkit=$CUDA_REL" \
+      "dask-cudf=${MINOR_VERSION}" \
+      "dask-cuda=${MINOR_VERSION}" \
+      "ucx-py=${MINOR_VERSION}" \
+      "rapids-build-env=$MINOR_VERSION.*" \
+      "rapids-notebook-env=$MINOR_VERSION.*" \
+      rapids-pytest-benchmark
+
+# https://docs.rapids.ai/maintainers/depmgmt/
+# conda remove --force rapids-build-env rapids-notebook-env
+# conda install "your-pkg=1.0.0"
 
 # Install the master version of dask and distributed
 logger "pip install git+https://github.com/dask/distributed.git --upgrade --no-deps"
@@ -91,8 +89,10 @@ conda list
 # BUILD - Build libcugraph and cuGraph from source
 ################################################################################
 
-logger "Build libcugraph..."
-$WORKSPACE/build.sh clean libcugraph cugraph
+if [[ -z "$PROJECT_FLASH" || "$PROJECT_FLASH" == "0" ]]; then
+  logger "Build libcugraph..."
+  $WORKSPACE/build.sh clean libcugraph cugraph
+fi
 
 ################################################################################
 # TEST - Run GoogleTest and py.tests for libcugraph and cuGraph
diff --git a/ci/gpu/test-notebooks.sh b/ci/gpu/test-notebooks.sh
index 491458df5ce..247eb328d2e 100755
--- a/ci/gpu/test-notebooks.sh
+++ b/ci/gpu/test-notebooks.sh
@@ -1,4 +1,16 @@
 #!/bin/bash
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 #RAPIDS_DIR=/rapids
 NOTEBOOKS_DIR=${WORKSPACE}/notebooks
@@ -11,7 +23,7 @@ TOPLEVEL_NB_FOLDERS=$(find . -name *.ipynb |cut -d'/' -f2|sort -u)
 # Add notebooks that should be skipped here
 # (space-separated list of filenames without paths)
 
-SKIPNBS="uvm.ipynb"
+SKIPNBS="uvm.ipynb bfs_benchmark.ipynb louvain_benchmark.ipynb pagerank_benchmark.ipynb sssp_benchmark.ipynb release.ipynb"
 
 ## Check env
 env
diff --git a/ci/local/README.md b/ci/local/README.md
index c20a073e833..28bbe3590ea 100644
--- a/ci/local/README.md
+++ b/ci/local/README.md
@@ -25,7 +25,7 @@ where:
 Example Usage:
 `bash build.sh -r ~/rapids/cugraph -i gpuci/rapidsai-base:cuda10.1-ubuntu16.04-gcc5-py3.6`
 
-For a full list of available gpuCI docker images, visit our [DockerHub](https://hub.docker.com/r/gpuci/rapidsai-base/tags) page.
+For a full list of available gpuCI docker images, visit our [DockerHub](https://hub.docker.com/r/gpuci/rapidsai/tags) page.
 
 Style Check:
 ```bash
@@ -51,6 +51,7 @@ The docker image will generate build artifacts in a folder on your machine locat
 
 The script will build your repository and run all tests. If any tests fail, it dumps the user into the docker container itself to allow you to debug from within the container. If all the tests pass as expected the container exits and is automatically removed. Remember to exit the container if tests fail and you do not wish to debug within the container itself.
 
+If you would like to rerun the tests after changing some code in the container, run `bash ci/gpu/build.sh`.
 
 ### Container File Structure
 
diff --git a/ci/local/build.sh b/ci/local/build.sh
index c6f7f1a51e2..51b9380a311 100755
--- a/ci/local/build.sh
+++ b/ci/local/build.sh
@@ -1,6 +1,21 @@
 #!/bin/bash
-
-DOCKER_IMAGE="gpuci/rapidsai-base:cuda10.0-ubuntu16.04-gcc5-py3.6"
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+GIT_DESCRIBE_TAG=`git describe --tags`
+MINOR_VERSION=`echo $GIT_DESCRIBE_TAG | grep -o -E '([0-9]+\.[0-9]+)'`
+
+DOCKER_IMAGE="gpuci/rapidsai:${MINOR_VERSION}-cuda10.1-devel-ubuntu16.04-py3.7"
 REPO_PATH=${PWD}
 RAPIDS_DIR_IN_CONTAINER="/rapids"
 CPP_BUILD_DIR="cpp/build"
@@ -139,4 +154,4 @@ docker run --rm -it ${GPU_OPTS} \
        -v "$PASSWD_FILE":/etc/passwd:ro \
        -v "$GROUP_FILE":/etc/group:ro \
        --cap-add=SYS_PTRACE \
-       "${DOCKER_IMAGE}" bash -c "${COMMAND}"
\ No newline at end of file
+       "${DOCKER_IMAGE}" bash -c "${COMMAND}"
diff --git a/ci/release/update-version.sh b/ci/release/update-version.sh
index b9faa5cbf1f..d853c3693c6 100755
--- a/ci/release/update-version.sh
+++ b/ci/release/update-version.sh
@@ -1,7 +1,16 @@
 #!/bin/bash
-########################
-# RMM Version Updater #
-########################
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 ## Usage
 # bash update-version.sh <type>
@@ -17,6 +26,7 @@ CURRENT_TAG=`git tag | grep -xE 'v[0-9\.]+' | sort --version-sort | tail -n 1 |
 CURRENT_MAJOR=`echo $CURRENT_TAG | awk '{split($0, a, "."); print a[1]}'`
 CURRENT_MINOR=`echo $CURRENT_TAG | awk '{split($0, a, "."); print a[2]}'`
 CURRENT_PATCH=`echo $CURRENT_TAG | awk '{split($0, a, "."); print a[3]}'`
+CURRENT_SHORT_TAG=${CURRENT_MAJOR}.${CURRENT_MINOR}
 NEXT_MAJOR=$((CURRENT_MAJOR + 1))
 NEXT_MINOR=$((CURRENT_MINOR + 1))
 NEXT_PATCH=$((CURRENT_PATCH + 1))
@@ -51,3 +61,11 @@ sed_runner 's/'"CUGRAPH VERSION .* LANGUAGES C CXX CUDA)"'/'"CUGRAPH VERSION ${N
 # RTD update
 sed_runner 's/version = .*/version = '"'${NEXT_SHORT_TAG}'"'/g' docs/source/conf.py
 sed_runner 's/release = .*/release = '"'${NEXT_FULL_TAG}'"'/g' docs/source/conf.py
+
+for FILE in conda/environments/*.yml; do
+   sed_runner "s/cudf=${CURRENT_SHORT_TAG}/cudf=${NEXT_SHORT_TAG}/g" ${FILE};
+   sed_runner "s/rmm=${CURRENT_SHORT_TAG}/rmm=${NEXT_SHORT_TAG}/g" ${FILE};
+   sed_runner "s/dask-cuda=${CURRENT_SHORT_TAG}/dask-cuda=${NEXT_SHORT_TAG}/g" ${FILE};
+   sed_runner "s/dask-cudf=${CURRENT_SHORT_TAG}/dask-cudf=${NEXT_SHORT_TAG}/g" ${FILE};
+   sed_runner "s/ucx-py=${CURRENT_SHORT_TAG}/ucx-py=${NEXT_SHORT_TAG}/g" ${FILE};
+done
diff --git a/ci/test.sh b/ci/test.sh
index 37ec2fcc956..fde9bbb3d8d 100755
--- a/ci/test.sh
+++ b/ci/test.sh
@@ -1,4 +1,16 @@
 #!/bin/bash
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 # note: do not use set -e in order to allow all gtest invocations to take place,
 # and instead keep track of exit status and exit with an overall exit status
@@ -45,7 +57,12 @@ else
     fi
 fi
 
-cd ${CUGRAPH_ROOT}/cpp/build
+if [[ -z "$PROJECT_FLASH" || "$PROJECT_FLASH" == "0" ]]; then
+    cd ${CUGRAPH_ROOT}/cpp/build
+else
+    export LD_LIBRARY_PATH="$WORKSPACE/ci/artifacts/cugraph/cpu/conda_work/cpp/build:$LD_LIBRARY_PATH"
+    cd $WORKSPACE/ci/artifacts/cugraph/cpu/conda_work/cpp/build
+fi
 
 for gt in gtests/*; do
     test_name=$(basename $gt)
@@ -54,9 +71,22 @@ for gt in gtests/*; do
     ERRORCODE=$((ERRORCODE | $?))
 done
 
-echo "Python py.test for cuGraph..."
+if [[ "$PROJECT_FLASH" == "1" ]]; then
+    echo "Installing libcugraph..."
+    conda install -c $WORKSPACE/ci/artifacts/cugraph/cpu/conda-bld/ libcugraph
+    export LIBCUGRAPH_BUILD_DIR="$WORKSPACE/ci/artifacts/cugraph/cpu/conda_work/cpp/build"
+    echo "Build cugraph..."
+    $WORKSPACE/build.sh cugraph
+fi
+
+echo "Python pytest for cuGraph..."
 cd ${CUGRAPH_ROOT}/python
-py.test --cache-clear --junitxml=${CUGRAPH_ROOT}/junit-cugraph.xml -v --cov-config=.coveragerc --cov=cugraph --cov-report=xml:${WORKSPACE}/python/cugraph/cugraph-coverage.xml --cov-report term
+pytest --cache-clear --junitxml=${CUGRAPH_ROOT}/junit-cugraph.xml -v --cov-config=.coveragerc --cov=cugraph --cov-report=xml:${WORKSPACE}/python/cugraph/cugraph-coverage.xml --cov-report term --ignore=cugraph/raft
+ERRORCODE=$((ERRORCODE | $?))
+
+echo "Python benchmarks for cuGraph (running as tests)..."
+cd ${CUGRAPH_ROOT}/benchmarks
+pytest -v -m "managedmem_on and poolallocator_on and tiny" --benchmark-disable
 ERRORCODE=$((ERRORCODE | $?))
 
 exit ${ERRORCODE}
diff --git a/ci/utils/git_helpers.py b/ci/utils/git_helpers.py
new file mode 100644
index 00000000000..83ad73fe283
--- /dev/null
+++ b/ci/utils/git_helpers.py
@@ -0,0 +1,137 @@
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import subprocess
+import os
+import re
+
+
+def isFileEmpty(f):
+    return os.stat(f).st_size == 0
+
+
+def __git(*opts):
+    """Runs a git command and returns its output"""
+    cmd = "git " + " ".join(list(opts))
+    ret = subprocess.check_output(cmd, shell=True)
+    return ret.decode("UTF-8")
+
+
+def __gitdiff(*opts):
+    """Runs a git diff command with no pager set"""
+    return __git("--no-pager", "diff", *opts)
+
+
+def branch():
+    """Returns the name of the current branch"""
+    name = __git("rev-parse", "--abbrev-ref", "HEAD")
+    name = name.rstrip()
+    return name
+
+
+def uncommittedFiles():
+    """
+    Returns a list of all changed files that are not yet committed. This
+    means both untracked/unstaged as well as uncommitted files too.
+    """
+    files = __git("status", "-u", "-s")
+    ret = []
+    for f in files.splitlines():
+        f = f.strip(" ")
+        f = re.sub("\s+", " ", f)  # noqa: W605
+        tmp = f.split(" ", 1)
+        # only consider staged files or uncommitted files
+        # in other words, ignore untracked files
+        if tmp[0] == "M" or tmp[0] == "A":
+            ret.append(tmp[1])
+    return ret
+
+
+def changedFilesBetween(b1, b2):
+    """Returns a list of files changed between branches b1 and b2"""
+    current = branch()
+    __git("checkout", "--quiet", b1)
+    __git("checkout", "--quiet", b2)
+    files = __gitdiff("--name-only", "--ignore-submodules", "%s...%s" %
+                      (b1, b2))
+    __git("checkout", "--quiet", current)
+    return files.splitlines()
+
+
+def changesInFileBetween(file, b1, b2, pathFilter=None):
+    """Filters the changed lines to a file between the branches b1 and b2"""
+    current = branch()
+    __git("checkout", "--quiet", b1)
+    __git("checkout", "--quiet", b2)
+    diffs = __gitdiff("--ignore-submodules", "-w", "--minimal", "-U0",
+                      "%s...%s" % (b1, b2), "--", file)
+    __git("checkout", "--quiet", current)
+    lines = []
+    for line in diffs.splitlines():
+        if pathFilter is None or pathFilter(line):
+            lines.append(line)
+    return lines
+
+
+def modifiedFiles(pathFilter=None):
+    """
+    If inside a CI-env (ie. currentBranch=current-pr-branch and the env-var
+    PR_TARGET_BRANCH is defined), then lists out all files modified between
+    these 2 branches. Else, lists out all the uncommitted files in the current
+    branch.
+
+    Such utility function is helpful while putting checker scripts as part of
+    cmake, as well as CI process. This way, during development, only the files
+    touched (but not yet committed) by devs can be checked. But, during the CI
+    process ALL files modified by the dev, as submiited in the PR, will be
+    checked. This happens, all the while using the same script.
+    """
+    if "PR_TARGET_BRANCH" in os.environ and branch() == "current-pr-branch":
+        allFiles = changedFilesBetween(os.environ["PR_TARGET_BRANCH"],
+                                       branch())
+    else:
+        allFiles = uncommittedFiles()
+    files = []
+    for f in allFiles:
+        if pathFilter is None or pathFilter(f):
+            files.append(f)
+    return files
+
+
+def listAllFilesInDir(folder):
+    """Utility function to list all files/subdirs in the input folder"""
+    allFiles = []
+    for root, dirs, files in os.walk(folder):
+        for name in files:
+            allFiles.append(os.path.join(root, name))
+    return allFiles
+
+
+def listFilesToCheck(filesDirs, pathFilter=None):
+    """
+    Utility function to filter the input list of files/dirs based on the input
+    pathFilter method and returns all the files that need to be checked
+    """
+    allFiles = []
+    for f in filesDirs:
+        if os.path.isfile(f):
+            if pathFilter is None or pathFilter(f):
+                allFiles.append(f)
+        elif os.path.isdir(f):
+            files = listAllFilesInDir(f)
+            for f_ in files:
+                if pathFilter is None or pathFilter(f_):
+                    allFiles.append(f_)
+    return allFiles
diff --git a/ci/utils/nbtest.sh b/ci/utils/nbtest.sh
index f7b9774c6fd..8c86baeaa09 100755
--- a/ci/utils/nbtest.sh
+++ b/ci/utils/nbtest.sh
@@ -1,4 +1,16 @@
 #!/bin/bash
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 MAGIC_OVERRIDE_CODE="
 def my_run_line_magic(*args, **kwargs):
diff --git a/ci/utils/nbtestlog2junitxml.py b/ci/utils/nbtestlog2junitxml.py
index 15b362e4b70..e9712253b0e 100644
--- a/ci/utils/nbtestlog2junitxml.py
+++ b/ci/utils/nbtestlog2junitxml.py
@@ -1,3 +1,16 @@
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # Generate a junit-xml file from parsing a nbtest log
 
 import re
diff --git a/codecov.yml b/codecov.yml
new file mode 100644
index 00000000000..c0a3a2fba2b
--- /dev/null
+++ b/codecov.yml
@@ -0,0 +1,5 @@
+#Configuration File for CodeCov
+coverage:
+  status:
+    project: off
+    patch: off
diff --git a/conda/environments/cugraph_dev_cuda10.1.yml b/conda/environments/cugraph_dev_cuda10.1.yml
index 40e4da01244..eb987f326c8 100644
--- a/conda/environments/cugraph_dev_cuda10.1.yml
+++ b/conda/environments/cugraph_dev_cuda10.1.yml
@@ -5,21 +5,22 @@ channels:
 - rapidsai-nightly
 - conda-forge
 dependencies:
-- cudf=0.14.*
-- nvstrings=0.14.*
-- rmm=0.14.*
+- cudf=0.15.*
+- rmm=0.15.*
 - dask>=2.12.0
 - distributed>=2.12.0
-- dask-cuda=0.14*
-- dask-cudf=0.14*
+- dask-cuda=0.15*
+- dask-cudf=0.15*
 - nccl>=2.5
-- ucx-py=0.14*
+- ucx-py=0.15*
 - scipy
 - networkx
 - python-louvain
 - cudatoolkit=10.1
+- clang=8.0.1
+- clang-tools=8.0.1
 - cmake>=3.12
-- python>=3.6,<3.8
+- python>=3.6,<3.9
 - notebook>=0.5.0
 - boost
 - cython>=0.29,<0.30
@@ -35,3 +36,4 @@ dependencies:
 - recommonmark
 - pip
 - libcypher-parser
+- rapids-pytest-benchmark
diff --git a/conda/environments/cugraph_dev_cuda10.2.yml b/conda/environments/cugraph_dev_cuda10.2.yml
index 6625d6c711c..028e0fce1a4 100644
--- a/conda/environments/cugraph_dev_cuda10.2.yml
+++ b/conda/environments/cugraph_dev_cuda10.2.yml
@@ -5,21 +5,22 @@ channels:
 - rapidsai-nightly
 - conda-forge
 dependencies:
-- cudf=0.14.*
-- nvstrings=0.14.*
-- rmm=0.14.*
+- cudf=0.15.*
+- rmm=0.15.*
 - dask>=2.12.0
 - distributed>=2.12.0
-- dask-cuda=0.14*
-- dask-cudf=0.14*
+- dask-cuda=0.15*
+- dask-cudf=0.15*
 - nccl>=2.5
-- ucx-py=0.14*
+- ucx-py=0.15*
 - scipy
 - networkx
 - python-louvain
 - cudatoolkit=10.2
+- clang=8.0.1
+- clang-tools=8.0.1
 - cmake>=3.12
-- python>=3.6,<3.8
+- python>=3.6,<3.9
 - notebook>=0.5.0
 - boost
 - cython>=0.29,<0.30
@@ -35,3 +36,4 @@ dependencies:
 - recommonmark
 - pip
 - libcypher-parser
+- rapids-pytest-benchmark
diff --git a/conda/environments/cugraph_dev_cuda10.0.yml b/conda/environments/cugraph_dev_cuda11.0.yml
similarity index 70%
rename from conda/environments/cugraph_dev_cuda10.0.yml
rename to conda/environments/cugraph_dev_cuda11.0.yml
index 83e98d90437..bc3b84badf2 100644
--- a/conda/environments/cugraph_dev_cuda10.0.yml
+++ b/conda/environments/cugraph_dev_cuda11.0.yml
@@ -5,21 +5,22 @@ channels:
 - rapidsai-nightly
 - conda-forge
 dependencies:
-- cudf=0.14.*
-- nvstrings=0.14.*
-- rmm=0.14.*
+- cudf=0.15.*
+- rmm=0.15.*
 - dask>=2.12.0
 - distributed>=2.12.0
-- dask-cuda=0.14*
-- dask-cudf=0.14*
+- dask-cuda=0.15*
+- dask-cudf=0.15*
 - nccl>=2.5
-- ucx-py=0.14*
+- ucx-py=0.15*
 - scipy
 - networkx
 - python-louvain
-- cudatoolkit=10.0
+- cudatoolkit=11.0
+- clang=8.0.1
+- clang-tools=8.0.1
 - cmake>=3.12
-- python>=3.6,<3.8
+- python>=3.6,<3.9
 - notebook>=0.5.0
 - boost
 - cython>=0.29,<0.30
@@ -35,3 +36,4 @@ dependencies:
 - recommonmark
 - pip
 - libcypher-parser
+- rapids-pytest-benchmark
diff --git a/conda/recipes/cugraph/meta.yaml b/conda/recipes/cugraph/meta.yaml
index 4be2ef4014d..1a32fd2a4b1 100644
--- a/conda/recipes/cugraph/meta.yaml
+++ b/conda/recipes/cugraph/meta.yaml
@@ -4,7 +4,6 @@
 #   conda build -c nvidia -c rapidsai -c conda-forge -c defaults .
 {% set version = environ.get('GIT_DESCRIBE_TAG', '0.0.0.dev').lstrip('v') + environ.get('VERSION_SUFFIX', '') %}
 {% set minor_version =  version.split('.')[0] + '.' + version.split('.')[1] %}
-{% set git_revision_count=environ.get('GIT_DESCRIBE_NUMBER', 0) %}
 {% set py_version=environ.get('CONDA_PY', 36) %}
 package:
   name: cugraph
@@ -14,8 +13,8 @@ source:
   path: ../../..
 
 build:
-  number: {{ git_revision_count }}
-  string: py{{ py_version }}_{{ git_revision_count }}
+  number: {{ GIT_DESCRIBE_NUMBER }}
+  string: py{{ py_version }}_{{ GIT_DESCRIBE_HASH }}_{{ GIT_DESCRIBE_NUMBER }}
   script_env:
     - CC
     - CXX
diff --git a/conda/recipes/libcugraph/meta.yaml b/conda/recipes/libcugraph/meta.yaml
index 2d0f81dd27a..22731102110 100644
--- a/conda/recipes/libcugraph/meta.yaml
+++ b/conda/recipes/libcugraph/meta.yaml
@@ -4,18 +4,17 @@
 #   conda build -c nvidia -c rapidsai -c conda-forge -c defaults .
 {% set version = environ.get('GIT_DESCRIBE_TAG', '0.0.0.dev').lstrip('v') + environ.get('VERSION_SUFFIX', '') %}
 {% set minor_version =  version.split('.')[0] + '.' + version.split('.')[1] %}
-{% set git_revision_count=environ.get('GIT_DESCRIBE_NUMBER', 0) %}
 {% set cuda_version='.'.join(environ.get('CUDA', '9.2').split('.')[:2]) %}
 package:
   name: libcugraph
   version: {{ version }}
 
 source:
-  path: ../../..
+  git_url: ../../..
 
 build:
-  number: {{ git_revision_count }}
-  string: cuda{{ cuda_version }}_{{ git_revision_count }}
+  number: {{ GIT_DESCRIBE_NUMBER }}
+  string: cuda{{ cuda_version }}_{{ GIT_DESCRIBE_HASH }}_{{ GIT_DESCRIBE_NUMBER }}
   script_env:
     - CC
     - CXX
diff --git a/conda_build.sh b/conda_build.sh
index 14e3fae1e1f..4643e302f5c 100755
--- a/conda_build.sh
+++ b/conda_build.sh
@@ -8,7 +8,7 @@ conda build -c nvidia -c rapidsai -c rapidsai-nightly/label/cuda${CUDA_REL} -c c
 
 if [ "$UPLOAD_PACKAGE" == '1' ]; then
     export UPLOADFILE=`conda build -c nvidia -c rapidsai -c conda-forge -c defaults --python=${PYTHON} conda/recipes/cugraph --output`
-    SOURCE_BRANCH=master
+    SOURCE_BRANCH=main
 
     test -e ${UPLOADFILE}
 
diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
index d948b27a939..70d7edf99a3 100644
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@@ -1,6 +1,5 @@
 #=============================================================================
-# Copyright 2018 BlazingDB, Inc.
-#     Copyright 2018 Percy Camilo Triveño Aucahuasi <percy@blazingdb.com>
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -17,7 +16,7 @@
 
 cmake_minimum_required(VERSION 3.12 FATAL_ERROR)
 
-project(CUGRAPH VERSION 0.14.0 LANGUAGES C CXX CUDA)
+project(CUGRAPH VERSION 0.15.0 LANGUAGES C CXX CUDA)
 
 ###################################################################################################
 # - build type ------------------------------------------------------------------------------------
@@ -104,13 +103,6 @@ set(CMAKE_EXE_LINKER_FLAGS "-Wl,--disable-new-dtags")
 option(BUILD_TESTS "Configure CMake to build tests"
        ON)
 
-option(BUILD_MPI "Build with MPI" OFF)
-if (BUILD_MPI)
-    find_package(MPI REQUIRED)
-    set (CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${MPI_C_COMPILE_FLAGS}")
-    set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${MPI_CXX_COMPILE_FLAGS}")
-    set (CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} ${MPI_CXX_LINK_FLAGS}")
-endif(BUILD_MPI)
 ###################################################################################################
 # - cmake modules ---------------------------------------------------------------------------------
 
@@ -194,24 +186,52 @@ if (RMM_INCLUDE AND RMM_LIBRARY)
 endif (RMM_INCLUDE AND RMM_LIBRARY)
 
 ###################################################################################################
-# - External Projects -----------------------------------------------------------------------------
-
-# https://cmake.org/cmake/help/v3.0/module/ExternalProject.html
-include(ExternalProject)
+# - Fetch Content -----------------------------------------------------------------------------
+include(FetchContent)
 
 # - CUB
-set(CUB_DIR ${CMAKE_CURRENT_BINARY_DIR}/cub CACHE STRING "Path to cub repo")
-set(CUB_INCLUDE_DIR ${CUB_DIR}/src/cub CACHE STRING "Path to cub includes")
+message("Fetching CUB")
 
-ExternalProject_Add(cub
-  GIT_REPOSITORY    https://github.com/NVlabs/cub.git
-  GIT_TAG           v1.8.0
-  PREFIX            ${CUB_DIR}
-  CONFIGURE_COMMAND ""
-  BUILD_COMMAND     ""
-  INSTALL_COMMAND   ""
+FetchContent_Declare(
+    cub
+    GIT_REPOSITORY https://github.com/thrust/cub.git
+    GIT_TAG        1.9.10
+    GIT_SHALLOW    true
 )
 
+FetchContent_GetProperties(cub)
+if(NOT cub_POPULATED)
+  FetchContent_Populate(cub)
+  # We are not using the cub CMake targets, so no need to call `add_subdirectory()`.
+endif()
+set(CUB_INCLUDE_DIR "${cub_SOURCE_DIR}")
+
+# - THRUST
+message("Fetching Thrust")
+
+FetchContent_Declare(
+    thrust
+    GIT_REPOSITORY https://github.com/thrust/thrust.git
+    GIT_TAG        1.9.10
+    GIT_SHALLOW    true
+)
+
+FetchContent_GetProperties(thrust)
+if(NOT thrust_POPULATED)
+  FetchContent_Populate(thrust)
+  # We are not using the thrust CMake targets, so no need to call `add_subdirectory()`.
+endif()
+set(THRUST_INCLUDE_DIR "${thrust_SOURCE_DIR}")
+
+
+
+
+###################################################################################################
+# - External Projects -----------------------------------------------------------------------------
+
+# https://cmake.org/cmake/help/v3.0/module/ExternalProject.html
+include(ExternalProject)
+
 # - CUHORNET
 set(CUHORNET_DIR ${CMAKE_CURRENT_BINARY_DIR}/cuhornet CACHE STRING "Path to cuhornet repo")
 set(CUHORNET_INCLUDE_DIR ${CUHORNET_DIR}/src/cuhornet CACHE STRING "Path to cuhornet includes")
@@ -219,7 +239,7 @@ set(CUHORNET_INCLUDE_DIR ${CUHORNET_DIR}/src/cuhornet CACHE STRING "Path to cuho
 
 ExternalProject_Add(cuhornet
   GIT_REPOSITORY    https://github.com/rapidsai/cuhornet.git
-  GIT_TAG           master
+  GIT_TAG           main
   PREFIX            ${CUHORNET_DIR}
   CONFIGURE_COMMAND ""
   BUILD_COMMAND     ""
@@ -232,12 +252,18 @@ set(CUGUNROCK_DIR ${CMAKE_CURRENT_BINARY_DIR}/cugunrock CACHE STRING
 
 ExternalProject_Add(cugunrock
   GIT_REPOSITORY    https://github.com/rapidsai/cugunrock.git
-  GIT_TAG           fea_full_bc      # provide a branch, a tag, or even a commit hash
+  GIT_TAG           main
   PREFIX            ${CUGUNROCK_DIR}
   CMAKE_ARGS        -DCMAKE_INSTALL_PREFIX=<INSTALL_DIR>
                     -DGPU_ARCHS=""
                     -DGUNROCK_BUILD_SHARED_LIBS=OFF
                     -DGUNROCK_BUILD_TESTS=OFF
+                    -DCUDA_AUTODETECT_GENCODE=FALSE
+                    -DGUNROCK_GENCODE_SM60=TRUE
+                    -DGUNROCK_GENCODE_SM61=TRUE
+                    -DGUNROCK_GENCODE_SM70=TRUE
+                    -DGUNROCK_GENCODE_SM72=TRUE
+                    -DGUNROCK_GENCODE_SM75=TRUE
   BUILD_BYPRODUCTS  ${CUGUNROCK_DIR}/lib/libgunrock.a
 )
 
@@ -263,7 +289,7 @@ endif(NOT NCCL_PATH)
 if(DEFINED ENV{RAFT_PATH})
   message(STATUS "RAFT_PATH environment variable detected.")
   message(STATUS "RAFT_DIR set to $ENV{RAFT_PATH}")
-  set(RAFT_DIR ENV{RAFT_PATH})
+  set(RAFT_DIR "$ENV{RAFT_PATH}")
 
   ExternalProject_Add(raft
     DOWNLOAD_COMMAND  ""
@@ -278,14 +304,14 @@ else(DEFINED ENV{RAFT_PATH})
 
   ExternalProject_Add(raft
     GIT_REPOSITORY    https://github.com/rapidsai/raft.git
-    GIT_TAG           e003de27fc4e4a096337f184dddbd7942a68bb5c
+    GIT_TAG           099e2b874b05555a78bed1666fa2d22f784e56a7
     PREFIX            ${RAFT_DIR}
     CONFIGURE_COMMAND ""
     BUILD_COMMAND     ""
     INSTALL_COMMAND   "")
 
   # Redefining RAFT_DIR so it coincides with the one inferred by env variable.
-  set(RAFT_DIR ${RAFT_DIR}/src/raft/ CACHE STRING "Path to RAFT repo")
+  set(RAFT_DIR "${RAFT_DIR}/src/raft/")
 endif(DEFINED ENV{RAFT_PATH})
 
 
@@ -301,13 +327,14 @@ link_directories(
     "${CMAKE_CUDA_IMPLICIT_LINK_DIRECTORIES}")
 
 add_library(cugraph SHARED
-    src/comms/mpi/comms_mpi.cpp
     src/db/db_object.cu
     src/db/db_parser_integration_test.cu
     src/db/db_operators.cu
-    src/utilities/cusparse_helper.cu
+    src/utilities/spmv_1D.cu
     src/structure/graph.cu
     src/link_analysis/pagerank.cu
+    src/link_analysis/pagerank_1D.cu
+    src/link_analysis/gunrock_hits.cpp
     src/traversal/bfs.cu
     src/traversal/sssp.cu
     src/link_prediction/jaccard.cu
@@ -318,25 +345,17 @@ add_library(cugraph SHARED
     src/community/spectral_clustering.cu
     src/community/louvain.cpp
     src/community/louvain_kernels.cu
+    src/community/leiden.cpp
+    src/community/leiden_kernels.cu
     src/community/ktruss.cu
     src/community/ECG.cu
     src/community/triangles_counting.cu
     src/community/extract_subgraph_by_vertex.cu
     src/cores/core_number.cu
     src/traversal/two_hop_neighbors.cu
-    src/utilities/cusparse_helper.cu
     src/components/connectivity.cu
     src/centrality/katz_centrality.cu
     src/centrality/betweenness_centrality.cu
-    src/nvgraph/kmeans.cu
-    src/nvgraph/lanczos.cu
-    src/nvgraph/spectral_matrix.cu
-    src/nvgraph/modularity_maximization.cu
-    src/nvgraph/nvgraph_cusparse.cpp
-    src/nvgraph/nvgraph_cublas.cpp
-    src/nvgraph/nvgraph_lapack.cu
-    src/nvgraph/nvgraph_vector_kernels.cu
-    src/nvgraph/partition.cu
 )
 
 #
@@ -346,20 +365,17 @@ add_library(cugraph SHARED
 add_dependencies(cugraph cugunrock)
 add_dependencies(cugraph raft)
 
-if (BUILD_MPI)
-    add_compile_definitions(ENABLE_OPG=1)
-endif (BUILD_MPI)
-
 ###################################################################################################
 # - include paths ---------------------------------------------------------------------------------
 target_include_directories(cugraph
     PRIVATE
+    "${CUB_INCLUDE_DIR}"
+    "${THRUST_INCLUDE_DIR}"
     "${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}"
     "${LIBCYPHERPARSER_INCLUDE}"
     "${Boost_INCLUDE_DIRS}"
     "${RMM_INCLUDE}"
     "${CMAKE_CURRENT_SOURCE_DIR}/../thirdparty"
-    "${CUB_INCLUDE_DIR}"
     "${CUHORNET_INCLUDE_DIR}/hornet/include"
     "${CUHORNET_INCLUDE_DIR}/hornetsnest/include"
     "${CUHORNET_INCLUDE_DIR}/xlib/include"
@@ -367,7 +383,6 @@ target_include_directories(cugraph
     "${CMAKE_CURRENT_SOURCE_DIR}/src"
     "${CUGUNROCK_DIR}/include"
     "${NCCL_INCLUDE_DIRS}"
-    "${MPI_CXX_INCLUDE_PATH}"
     "${RAFT_DIR}/cpp/include"
     PUBLIC
     "${CMAKE_CURRENT_SOURCE_DIR}/include"
@@ -377,7 +392,7 @@ target_include_directories(cugraph
 # - link libraries --------------------------------------------------------------------------------
 
 target_link_libraries(cugraph PRIVATE
-    ${RMM_LIBRARY} gunrock ${NVSTRINGS_LIBRARY} cublas cusparse curand cusolver cudart cuda ${LIBCYPHERPARSER_LIBRARY} ${MPI_CXX_LIBRARIES} ${NCCL_LIBRARIES})
+    ${RMM_LIBRARY} gunrock cublas cusparse curand cusolver cudart cuda ${LIBCYPHERPARSER_LIBRARY} ${MPI_CXX_LIBRARIES} ${NCCL_LIBRARIES})
 
 if(OpenMP_CXX_FOUND)
 target_link_libraries(cugraph PRIVATE
diff --git a/cpp/cmake/Modules/FindNCCL.cmake b/cpp/cmake/Modules/FindNCCL.cmake
index 16ca4458a7f..0f673707444 100644
--- a/cpp/cmake/Modules/FindNCCL.cmake
+++ b/cpp/cmake/Modules/FindNCCL.cmake
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
diff --git a/cpp/include/algorithms.hpp b/cpp/include/algorithms.hpp
index ece827475ee..5241043fe88 100644
--- a/cpp/include/algorithms.hpp
+++ b/cpp/include/algorithms.hpp
@@ -17,6 +17,7 @@
 
 #include <graph.hpp>
 #include <internals.hpp>
+#include <raft/handle.hpp>
 
 namespace cugraph {
 
@@ -28,6 +29,7 @@ namespace cugraph {
  * when the tolerance descreases and/or alpha increases toward the limiting value of 1.
  * The user is free to use default values or to provide inputs for the initial guess,
  * tolerance and maximum number of iterations.
+
  *
  * @throws                           cugraph::logic_error with a custom message when an error
  occurs.
@@ -38,7 +40,9 @@ namespace cugraph {
  32-bit)
  * @tparam WT                        Type of edge weights. Supported value : float or double.
  *
- * @param[in] graph                  cuGRAPH graph descriptor, should contain the connectivity
+ * @param[in] handle                 Library handle (RAFT). If a communicator is set in the handle,
+ the multi GPU version will be selected.
+ * @param[in] graph                  cuGraph graph descriptor, should contain the connectivity
  information as a transposed adjacency list (CSC). Edge weights are not used for this algorithm.
  * @param[in] alpha                  The damping factor alpha represents the probability to follow
  an outgoing edge, standard value is 0.85. Thus, 1.0-alpha is the probability to “teleport” to a
@@ -48,36 +52,38 @@ namespace cugraph {
  * @param[in] pagerank               Array of size V. Should contain the initial guess if
  has_guess=true. In this case the initial guess cannot be the vector of 0s. Memory is provided and
  owned by the caller.
- * @param[in] personalization_subset_size (optional) The number of vertices for to personalize.
- Initialized to 0 by default.
- * @param[in] personalization_subset (optional) Array of size personalization_subset_size containing
- vertices for running personalized pagerank. Initialized to nullptr by default. Memory is provided
- and owned by the caller.
- * @param[in] personalization_values (optional) Array of size personalization_subset_size containing
- values associated with personalization_subset vertices. Initialized to nullptr by default. Memory
- is provided and owned by the caller.
- * @param[in] tolerance              Set the tolerance the approximation, this parameter should be a
- small magnitude value.
+ * @param[in] personalization_subset_size (optional) Supported on single-GPU, on the roadmap for
+ Multi-GPU. The number of vertices for to personalize. Initialized to 0 by default.
+ * @param[in] personalization_subset (optional) Supported on single-GPU, on the roadmap for
+ Multi-GPU..= Array of size personalization_subset_size containing vertices for running personalized
+ pagerank. Initialized to nullptr by default. Memory is provided and owned by the caller.
+ * @param[in] personalization_values (optional) Supported on single-GPU, on the roadmap for
+ Multi-GPU. Array of size personalization_subset_size containing values associated with
+ personalization_subset vertices. Initialized to nullptr by default. Memory is provided and owned by
+ the caller.
+ * @param[in] tolerance              Supported on single-GPU. Set the tolerance the approximation,
+ this parameter should be a small magnitude value.
  *                                   The lower the tolerance the better the approximation. If this
- value is 0.0f, cuGRAPH will use the default value which is 1.0E-5.
+ value is 0.0f, cuGraph will use the default value which is 1.0E-5.
  *                                   Setting too small a tolerance can lead to non-convergence due
  to numerical roundoff. Usually values between 0.01 and 0.00001 are acceptable.
  * @param[in] max_iter               (optional) The maximum number of iterations before an answer is
  returned. This can be used to limit the execution time and do an early exit before the solver
  reaches the convergence tolerance.
- *                                   If this value is lower or equal to 0 cuGRAPH will use the
+ *                                   If this value is lower or equal to 0 cuGraph will use the
  default value, which is 500.
- * @param[in] has_guess              (optional) This parameter is used to notify cuGRAPH if it
- should use a user-provided initial guess. False means the user does not have a guess, in this case
- cuGRAPH will use a uniform vector set to 1/V.
- *                                   If the value is True, cuGRAPH will read the pagerank parameter
+ * @param[in] has_guess              (optional) Supported on single-GPU. This parameter is used to
+ notify cuGraph if it should use a user-provided initial guess. False means the user does not have a
+ guess, in this case cuGraph will use a uniform vector set to 1/V.
+ *                                   If the value is True, cuGraph will read the pagerank parameter
  and use this as an initial guess.
  * @param[out] *pagerank             The PageRank : pagerank[i] is the PageRank of vertex i. Memory
  remains provided and owned by the caller.
  *
  */
 template <typename VT, typename ET, typename WT>
-void pagerank(experimental::GraphCSCView<VT, ET, WT> const &graph,
+void pagerank(raft::handle_t const &handle,
+              GraphCSCView<VT, ET, WT> const &graph,
               WT *pagerank,
               VT personalization_subset_size = 0,
               VT *personalization_subset     = nullptr,
@@ -106,7 +112,7 @@ void pagerank(experimental::GraphCSCView<VT, ET, WT> const &graph,
  * caller
  */
 template <typename VT, typename ET, typename WT>
-void jaccard(experimental::GraphCSRView<VT, ET, WT> const &graph, WT const *weights, WT *result);
+void jaccard(GraphCSRView<VT, ET, WT> const &graph, WT const *weights, WT *result);
 
 /**
  * @brief     Compute jaccard similarity coefficient for selected vertex pairs
@@ -130,7 +136,7 @@ void jaccard(experimental::GraphCSRView<VT, ET, WT> const &graph, WT const *weig
  * caller
  */
 template <typename VT, typename ET, typename WT>
-void jaccard_list(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void jaccard_list(GraphCSRView<VT, ET, WT> const &graph,
                   WT const *weights,
                   ET num_pairs,
                   VT const *first,
@@ -156,7 +162,7 @@ void jaccard_list(experimental::GraphCSRView<VT, ET, WT> const &graph,
  * caller
  */
 template <typename VT, typename ET, typename WT>
-void overlap(experimental::GraphCSRView<VT, ET, WT> const &graph, WT const *weights, WT *result);
+void overlap(GraphCSRView<VT, ET, WT> const &graph, WT const *weights, WT *result);
 
 /**
  * @brief     Compute overlap coefficient for select pairs of vertices
@@ -180,7 +186,7 @@ void overlap(experimental::GraphCSRView<VT, ET, WT> const &graph, WT const *weig
  * caller
  */
 template <typename VT, typename ET, typename WT>
-void overlap_list(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void overlap_list(GraphCSRView<VT, ET, WT> const &graph,
                   WT const *weights,
                   ET num_pairs,
                   VT const *first,
@@ -203,7 +209,7 @@ void overlap_list(experimental::GraphCSRView<VT, ET, WT> const &graph,
  * @tparam WT                                   Type of edge weights. Supported values : float or
  * double.
  *
- * @param[in] graph                             cuGRAPH graph descriptor, should contain the
+ * @param[in] graph                             cuGraph graph descriptor, should contain the
  * connectivity information as a COO. Graph is considered undirected. Edge weights are used for this
  * algorithm and set to 1 by default.
  * @param[out] pos                              Device array (2, n) containing x-axis and y-axis
@@ -241,7 +247,7 @@ void overlap_list(experimental::GraphCSRView<VT, ET, WT> const &graph,
  *
  */
 template <typename VT, typename ET, typename WT>
-void force_atlas2(experimental::GraphCOOView<VT, ET, WT> &graph,
+void force_atlas2(GraphCOOView<VT, ET, WT> &graph,
                   float *pos,
                   const int max_iter                            = 500,
                   float *x_start                                = nullptr,
@@ -267,39 +273,87 @@ void force_atlas2(experimental::GraphCOOView<VT, ET, WT> &graph,
  *
  * The current implementation does not support a weighted graph.
  *
- * @throws                           cugraph::logic_error with a custom message when an error
- * occurs.
+ * @throws                                  cugraph::logic_error if `result == nullptr` or
+ * `number_of_sources < 0` or `number_of_sources !=0 and sources == nullptr`.
+ * @tparam vertex_t                               Type of vertex identifiers. Supported value : int
+ * (signed, 32-bit)
+ * @tparam edge_t                               Type of edge identifiers.  Supported value : int
+ * (signed, 32-bit)
+ * @tparam weight_t                               Type of edge weights. Supported values : float or
+ * double.
+ * @tparam result_t                         Type of computed result.  Supported values :  float or
+ * double
+ * @param[in] handle                        Library handle (RAFT). If a communicator is set in the
+ * handle, the multi GPU version will be selected.
+ * @param[in] graph                         cuGRAPH graph descriptor, should contain the
+ * connectivity information as a CSR
+ * @param[out] result                       Device array of centrality scores
+ * @param[in] normalized                    If true, return normalized scores, if false return
+ * unnormalized scores.
+ * @param[in] endpoints                     If true, include endpoints of paths in score, if false
+ * do not
+ * @param[in] weight                        If specified, device array of weights for each edge
+ * @param[in] k                             If specified, number of vertex samples defined in the
+ * vertices array.
+ * @param[in] vertices                      If specified, host array of vertex ids to estimate
+ * betweenness these vertices will serve as sources for the traversal
+ * algorihtm to obtain shortest path counters.
+ * @param[in] total_number_of_source_used   If specified use this number to normalize results
+ * when using subsampling, it allows accumulation of results across multiple calls.
  *
- * @tparam VT                        Type of vertex identifiers. Supported value : int (signed,
- * 32-bit)
- * @tparam ET                        Type of edge identifiers.  Supported value : int (signed,
- * 32-bit)
- * @tparam WT                        Type of edge weights. Supported values : float or double.
- * @tparam result_t                  Type of computed result.  Supported values :  float or double
- * (double only supported in default implementation)
+ */
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void betweenness_centrality(const raft::handle_t &handle,
+                            GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                            result_t *result,
+                            bool normalized          = true,
+                            bool endpoints           = false,
+                            weight_t const *weight   = nullptr,
+                            vertex_t k               = 0,
+                            vertex_t const *vertices = nullptr);
+
+/**
+ * @brief     Compute edge betweenness centrality for a graph
  *
- * @param[in] graph                  cuGRAPH graph descriptor, should contain the connectivity
- * information as a CSR
- * @param[out] result                Device array of centrality scores
- * @param[in] normalized             If true, return normalized scores, if false return unnormalized
- * scores.
- * @param[in] endpoints              If true, include endpoints of paths in score, if false do not
- * @param[in] weight                 If specified, device array of weights for each edge
- * @param[in] k                      If specified, number of vertex samples defined in the vertices
- * array.
- * @param[in] vertices               If specified, host array of vertex ids to estimate betweenness
- * centrality, these vertices will serve as sources for the traversal algorihtm to obtain
- * shortest path counters.
+ * Betweenness centrality of an edge is the sum of the fraction of all-pairs shortest paths that
+ * pass through this edge. The weight parameter is currenlty not supported
+ *
+ * @throws                                  cugraph::logic_error if `result == nullptr` or
+ * `number_of_sources < 0` or `number_of_sources !=0 and sources == nullptr` or `endpoints ==
+ * true`.
+ * @tparam vertex_t                               Type of vertex identifiers. Supported value : int
+ * (signed, 32-bit)
+ * @tparam edge_t                               Type of edge identifiers.  Supported value : int
+ * (signed, 32-bit)
+ * @tparam weight_t                               Type of edge weights. Supported values : float or
+ * double.
+ * @tparam result_t                         Type of computed result.  Supported values :  float or
+ * double
+ * @param[in] handle                        Library handle (RAFT). If a communicator is set in the
+ * handle, the multi GPU version will be selected.
+ * @param[in] graph                         cuGraph graph descriptor, should contain the
+ * connectivity information as a CSR
+ * @param[out] result                       Device array of centrality scores
+ * @param[in] normalized                    If true, return normalized scores, if false return
+ * unnormalized scores.
+ * @param[in] weight                        If specified, device array of weights for each edge
+ * @param[in] k                             If specified, number of vertex samples defined in the
+ * vertices array.
+ * @param[in] vertices                      If specified, host array of vertex ids to estimate
+ * betweenness these vertices will serve as sources for the traversal
+ * algorihtm to obtain shortest path counters.
+ * @param[in] total_number_of_source_used   If specified use this number to normalize results
+ * when using subsampling, it allows accumulation of results across multiple calls.
  *
  */
-template <typename VT, typename ET, typename WT, typename result_t>
-void betweenness_centrality(experimental::GraphCSRView<VT, ET, WT> const &graph,
-                            result_t *result,
-                            bool normalized    = true,
-                            bool endpoints     = false,
-                            WT const *weight   = nullptr,
-                            VT k               = 0,
-                            VT const *vertices = nullptr);
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void edge_betweenness_centrality(const raft::handle_t &handle,
+                                 GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                                 result_t *result,
+                                 bool normalized          = true,
+                                 weight_t const *weight   = nullptr,
+                                 vertex_t k               = 0,
+                                 vertex_t const *vertices = nullptr);
 
 enum class cugraph_cc_t {
   CUGRAPH_WEAK = 0,  ///> Weakly Connected Components
@@ -330,14 +384,14 @@ enum class cugraph_cc_t {
  * @tparam ET                     Type of edge identifiers.  Supported value : int (signed, 32-bit)
  * @tparam WT                     Type of edge weights. Supported values : float or double.
  *
- * @param[in] graph               cuGRAPH graph descriptor, should contain the connectivity
+ * @param[in] graph               cuGraph graph descriptor, should contain the connectivity
  * information as a CSR
  * @param[in] connectivity_type   STRONG or WEAK
  * @param[out] labels             Device array of component labels (labels[i] indicates the label
  * associated with vertex id i.
  */
 template <typename VT, typename ET, typename WT>
-void connected_components(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void connected_components(GraphCSRView<VT, ET, WT> const &graph,
                           cugraph_cc_t connectivity_type,
                           VT *labels);
 
@@ -358,7 +412,7 @@ void connected_components(experimental::GraphCSRView<VT, ET, WT> const &graph,
  * 32-bit)
  * @tparam WT                        Type of edge weights. Supported values : float or double.
  *
- * @param[in] graph                  cuGRAPH graph descriptor, should contain the connectivity
+ * @param[in] graph                  cuGraph graph descriptor, should contain the connectivity
  * information as a COO
  * @param[in] k                      The order of the truss
  * @param[in] mr                     Memory resource used to allocate the returned graph
@@ -366,8 +420,8 @@ void connected_components(experimental::GraphCSRView<VT, ET, WT> const &graph,
  *
  */
 template <typename VT, typename ET, typename WT>
-std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> k_truss_subgraph(
-  experimental::GraphCOOView<VT, ET, WT> const &graph,
+std::unique_ptr<GraphCOO<VT, ET, WT>> k_truss_subgraph(
+  GraphCOOView<VT, ET, WT> const &graph,
   int k,
   rmm::mr::device_memory_resource *mr = rmm::mr::get_default_resource());
 
@@ -384,7 +438,7 @@ std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> k_truss_subgraph(
  * @tparam WT                        Type of edge weights. Supported values : float or double.
  * @tparam result_t                  Type of computed result.  Supported values :  float
  *
- * @param[in] graph                  cuGRAPH graph descriptor, should contain the connectivity
+ * @param[in] graph                  cuGraph graph descriptor, should contain the connectivity
  * information as a CSR
  * @param[out] result                Device array of centrality scores
  * @param[in] alpha                  Attenuation factor with a default value of 0.1. Alpha is set to
@@ -404,7 +458,7 @@ std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> k_truss_subgraph(
  * @param[in] normalized             If True normalize the resulting katz centrality values
  */
 template <typename VT, typename ET, typename WT, typename result_t>
-void katz_centrality(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void katz_centrality(GraphCSRView<VT, ET, WT> const &graph,
                      result_t *result,
                      double alpha,
                      int max_iter,
@@ -415,14 +469,14 @@ void katz_centrality(experimental::GraphCSRView<VT, ET, WT> const &graph,
 /**
  * @brief         Compute the Core Number for the nodes of the graph G
  *
- * @param[in]  graph                cuGRAPH graph descriptor with a valid edgeList or adjList
+ * @param[in]  graph                cuGraph graph descriptor with a valid edgeList or adjList
  * @param[out] core_number          Populated by the core number of every vertex in the graph
  *
  * @throws     cugraph::logic_error when an error occurs.
  */
 /* ----------------------------------------------------------------------------*/
 template <typename VT, typename ET, typename WT>
-void core_number(experimental::GraphCSRView<VT, ET, WT> const &graph, VT *core_number);
+void core_number(GraphCSRView<VT, ET, WT> const &graph, VT *core_number);
 
 /**
  * @brief   Compute K Core of the graph G
@@ -435,7 +489,7 @@ void core_number(experimental::GraphCSRView<VT, ET, WT> const &graph, VT *core_n
  * 32-bit)
  * @tparam WT                        Type of edge weights. Supported values : float or double.
  *
- * @param[in]  graph                 cuGRAPH graph in coordinate format
+ * @param[in]  graph                 cuGraph graph in coordinate format
  * @param[in]  k                     Order of the core. This value must not be negative.
  * @param[in]  vertex_id             User specified vertex identifiers for which core number values
  * are supplied
@@ -446,8 +500,8 @@ void core_number(experimental::GraphCSRView<VT, ET, WT> const &graph, VT *core_n
  * @param[out] out_graph             Unique pointer to K Core subgraph in COO format
  */
 template <typename VT, typename ET, typename WT>
-std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> k_core(
-  experimental::GraphCOOView<VT, ET, WT> const &graph,
+std::unique_ptr<GraphCOO<VT, ET, WT>> k_core(
+  GraphCOOView<VT, ET, WT> const &graph,
   int k,
   VT const *vertex_id,
   VT const *core_number,
@@ -472,8 +526,7 @@ std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> k_core(
  * @return                  Graph in COO format
  */
 template <typename VT, typename ET, typename WT>
-std::unique_ptr<cugraph::experimental::GraphCOO<VT, ET, WT>> get_two_hop_neighbors(
-  experimental::GraphCSRView<VT, ET, WT> const &graph);
+std::unique_ptr<GraphCOO<VT, ET, WT>> get_two_hop_neighbors(GraphCSRView<VT, ET, WT> const &graph);
 
 /**
  * @Synopsis   Performs a single source shortest path traversal of a graph starting from a vertex.
@@ -486,7 +539,7 @@ std::unique_ptr<cugraph::experimental::GraphCOO<VT, ET, WT>> get_two_hop_neighbo
  * 32-bit)
  * @tparam WT                        Type of edge weights. Supported values : float or double.
  *
- * @param[in] graph                  cuGRAPH graph descriptor, should contain the connectivity
+ * @param[in] graph                  cuGraph graph descriptor, should contain the connectivity
  * information as a CSR
  *
  * @param[out] distances            If set to a valid pointer, array of size V populated by distance
@@ -500,7 +553,7 @@ std::unique_ptr<cugraph::experimental::GraphCOO<VT, ET, WT>> get_two_hop_neighbo
  *
  */
 template <typename VT, typename ET, typename WT>
-void sssp(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void sssp(GraphCSRView<VT, ET, WT> const &graph,
           WT *distances,
           VT *predecessors,
           const VT source_vertex);
@@ -519,7 +572,9 @@ void sssp(experimental::GraphCSRView<VT, ET, WT> const &graph,
  * 32-bit)
  * @tparam WT                        Type of edge weights. Supported values : int (signed, 32-bit)
  *
- * @param[in] graph                  cuGRAPH graph descriptor, should contain the connectivity
+ * @param[in] handle                 Library handle (RAFT). If a communicator is set in the handle,
+ the multi GPU version will be selected.
+ * @param[in] graph                  cuGraph graph descriptor, should contain the connectivity
  * information as a CSR
  *
  * @param[out] distances             If set to a valid pointer, this is populated by distance of
@@ -535,41 +590,96 @@ void sssp(experimental::GraphCSRView<VT, ET, WT> const &graph,
  *
  * @param[in] directed               Treat the input graph as directed
  *
- * @throws     cugraph::logic_error when an error occurs.
+ * @param[in] mg_batch               If set to true use SG BFS path when comms are initialized.
+ *
  */
 template <typename VT, typename ET, typename WT>
-void bfs(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void bfs(raft::handle_t const &handle,
+         GraphCSRView<VT, ET, WT> const &graph,
          VT *distances,
          VT *predecessors,
          double *sp_counters,
          const VT start_vertex,
-         bool directed = true);
+         bool directed = true,
+         bool mg_batch = false);
 
 /**
  * @brief      Louvain implementation
  *
- * Compute a clustering of the graph by minimizing modularity
+ * Compute a clustering of the graph by maximizing modularity
+ *
+ * Computed using the Louvain method described in:
+ *
+ *    VD Blondel, J-L Guillaume, R Lambiotte and E Lefebvre: Fast unfolding of
+ *    community hierarchies in large networks, J Stat Mech P10008 (2008),
+ *    http://arxiv.org/abs/0803.0476
  *
  * @throws     cugraph::logic_error when an error occurs.
  *
- * @tparam VT                        Type of vertex identifiers.
+ * @tparam vertex_t                  Type of vertex identifiers.
  *                                   Supported value : int (signed, 32-bit)
- * @tparam ET                        Type of edge identifiers.
+ * @tparam edge_t                    Type of edge identifiers.
  *                                   Supported value : int (signed, 32-bit)
- * @tparam WT                        Type of edge weights. Supported values : float or double.
+ * @tparam weight_t                  Type of edge weights. Supported values : float or double.
  *
  * @param[in]  graph                 input graph object (CSR)
  * @param[out] final_modularity      modularity of the returned clustering
  * @param[out] num_level             number of levels of the returned clustering
  * @param[out] clustering            Pointer to device array where the clustering should be stored
  * @param[in]  max_iter              (optional) maximum number of iterations to run (default 100)
+ * @param[in]  resolution            (optional) The value of the resolution parameter to use.
+ *                                   Called gamma in the modularity formula, this changes the size
+ *                                   of the communities.  Higher resolutions lead to more smaller
+ *                                   communities, lower resolutions lead to fewer larger
+ * communities. (default 1)
+ *
  */
-template <typename VT, typename ET, typename WT>
-void louvain(experimental::GraphCSRView<VT, ET, WT> const &graph,
-             WT *final_modularity,
+template <typename vertex_t, typename edge_t, typename weight_t>
+void louvain(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+             weight_t *final_modularity,
              int *num_level,
-             VT *louvain_parts,
-             int max_iter = 100);
+             vertex_t *louvain_parts,
+             int max_iter        = 100,
+             weight_t resolution = weight_t{1});
+
+/**
+ * @brief      Leiden implementation
+ *
+ * Compute a clustering of the graph by maximizing modularity using the Leiden improvements
+ * to the Louvain method.
+ *
+ * Computed using the Leiden method described in:
+ *
+ *    Traag, V. A., Waltman, L., & van Eck, N. J. (2019). From Louvain to Leiden:
+ *    guaranteeing well-connected communities. Scientific reports, 9(1), 5233.
+ *    doi: 10.1038/s41598-019-41695-z
+ *
+ * @throws cugraph::logic_error when an error occurs.
+ *
+ * @tparam vertex_t                  Type of vertex identifiers.
+ *                                   Supported value : int (signed, 32-bit)
+ * @tparam edge_t                    Type of edge identifiers.
+ *                                   Supported value : int (signed, 32-bit)
+ * @tparam weight_t                  Type of edge weights. Supported values : float or double.
+ *
+ * @param[in]  graph                 input graph object (CSR)
+ * @param[out] final_modularity      modularity of the returned clustering
+ * @param[out] num_level             number of levels of the returned clustering
+ * @param[out] clustering            Pointer to device array where the clustering should be stored
+ * @param[in]  max_iter              (optional) maximum number of iterations to run (default 100)
+ * @param[in]  resolution            (optional) The value of the resolution parameter to use.
+ *                                   Called gamma in the modularity formula, this changes the size
+ *                                   of the communities.  Higher resolutions lead to more smaller
+ *                                   communities, lower resolutions lead to fewer larger
+ * communities. (default 1)
+ */
+template <typename vertex_t, typename edge_t, typename weight_t>
+void leiden(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+            weight_t &final_modularity,
+            int &num_level,
+            vertex_t *leiden_parts,
+            int max_iter        = 100,
+            weight_t resolution = weight_t{1});
 
 /**
  * @brief Computes the ecg clustering of the given graph.
@@ -596,12 +706,9 @@ void louvain(experimental::GraphCSRView<VT, ET, WT> const &graph,
  * written
  */
 template <typename VT, typename ET, typename WT>
-void ecg(experimental::GraphCSRView<VT, ET, WT> const &graph_csr,
-         WT min_weight,
-         VT ensemble_size,
-         VT *ecg_parts);
+void ecg(GraphCSRView<VT, ET, WT> const &graph_csr, WT min_weight, VT ensemble_size, VT *ecg_parts);
 
-namespace nvgraph {
+namespace triangle {
 
 /**
  * @brief             Count the number of triangles in the graph
@@ -619,8 +726,10 @@ namespace nvgraph {
  * @return                           The number of triangles
  */
 template <typename VT, typename ET, typename WT>
-uint64_t triangle_count(experimental::GraphCSRView<VT, ET, WT> const &graph);
+uint64_t triangle_count(GraphCSRView<VT, ET, WT> const &graph);
+}  // namespace triangle
 
+namespace subgraph {
 /**
  * @brief             Extract subgraph by vertices
  *
@@ -642,8 +751,9 @@ uint64_t triangle_count(experimental::GraphCSRView<VT, ET, WT> const &graph);
  * @param[out] result                a graph in COO format containing the edges in the subgraph
  */
 template <typename VT, typename ET, typename WT>
-std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> extract_subgraph_vertex(
-  experimental::GraphCOOView<VT, ET, WT> const &graph, VT const *vertices, VT num_vertices);
+std::unique_ptr<GraphCOO<VT, ET, WT>> extract_subgraph_vertex(GraphCOOView<VT, ET, WT> const &graph,
+                                                              VT const *vertices,
+                                                              VT num_vertices);
 
 /**
  * @brief     Wrapper function for Nvgraph balanced cut clustering
@@ -663,11 +773,14 @@ std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> extract_subgraph_vertex(
  * @param[in]  evs_max_iter          The maximum number of iterations of the eigenvalue solver
  * @param[in]  kmean_tolerance       The tolerance to use for the kmeans solver
  * @param[in]  kmean_max_iter        The maximum number of iteration of the k-means solver
- * @param[out] clustering            Pointer to device memory where the resulting clustering will be
- * stored
+ * @param[out] clustering            Pointer to device memory where the resulting clustering will
+ * be stored
  */
+}  // namespace subgraph
+
+namespace ext_raft {
 template <typename VT, typename ET, typename WT>
-void balancedCutClustering(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void balancedCutClustering(GraphCSRView<VT, ET, WT> const &graph,
                            VT num_clusters,
                            VT num_eigen_vects,
                            WT evs_tolerance,
@@ -694,11 +807,11 @@ void balancedCutClustering(experimental::GraphCSRView<VT, ET, WT> const &graph,
  * @param[in]  evs_max_iter          The maximum number of iterations of the eigenvalue solver
  * @param[in]  kmean_tolerance       The tolerance to use for the kmeans solver
  * @param[in]  kmean_max_iter        The maximum number of iteration of the k-means solver
- * @param[out] clustering            Pointer to device memory where the resulting clustering will be
- * stored
+ * @param[out] clustering            Pointer to device memory where the resulting clustering will
+ * be stored
  */
 template <typename VT, typename ET, typename WT>
-void spectralModularityMaximization(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void spectralModularityMaximization(GraphCSRView<VT, ET, WT> const &graph,
                                     VT n_clusters,
                                     VT n_eig_vects,
                                     WT evs_tolerance,
@@ -724,7 +837,7 @@ void spectralModularityMaximization(experimental::GraphCSRView<VT, ET, WT> const
  * @param[out] score                 Pointer to a float in which the result will be written
  */
 template <typename VT, typename ET, typename WT>
-void analyzeClustering_modularity(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void analyzeClustering_modularity(GraphCSRView<VT, ET, WT> const &graph,
                                   int n_clusters,
                                   VT const *clustering,
                                   WT *score);
@@ -746,7 +859,7 @@ void analyzeClustering_modularity(experimental::GraphCSRView<VT, ET, WT> const &
  * @param[out] score                 Pointer to a float in which the result will be written
  */
 template <typename VT, typename ET, typename WT>
-void analyzeClustering_edge_cut(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void analyzeClustering_edge_cut(GraphCSRView<VT, ET, WT> const &graph,
                                 int n_clusters,
                                 VT const *clustering,
                                 WT *score);
@@ -768,10 +881,50 @@ void analyzeClustering_edge_cut(experimental::GraphCSRView<VT, ET, WT> const &gr
  * @param[out] score                 Pointer to a float in which the result will be written
  */
 template <typename VT, typename ET, typename WT>
-void analyzeClustering_ratio_cut(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void analyzeClustering_ratio_cut(GraphCSRView<VT, ET, WT> const &graph,
                                  int n_clusters,
                                  VT const *clustering,
                                  WT *score);
 
-}  // namespace nvgraph
+}  // namespace ext_raft
+
+namespace gunrock {
+/**
+ * @brief     Compute the HITS vertex values for a graph
+ *
+ * cuGraph uses the gunrock implementation of HITS
+ *
+ * @throws                           cugraph::logic_error on an error
+ *
+ * @tparam VT                        Type of vertex identifiers.
+ *                                   Supported value : int (signed, 32-bit)
+ * @tparam ET                        Type of edge identifiers.
+ *                                   Supported value : int (signed, 32-bit)
+ * @tparam WT                        Type of edge weights.
+ *                                   Supported value : float
+ *
+ * @param[in] graph                  input graph object (CSR). Edge weights are not used
+ *                                   for this algorithm.
+ * @param[in] max_iter               Maximum number of iterations to run
+ * @param[in] tolerance              Currently ignored.  gunrock implementation runs
+ *                                   the specified number of iterations and stops
+ * @param[in] starting value         Currently ignored.  gunrock does not support.
+ * @param[in] normalized             Currently ignored, gunrock computes this as true
+ * @param[out] *hubs                 Device memory pointing to the node value based
+ *                                   on outgoing links
+ * @param[out] *authorities          Device memory pointing to the node value based
+ *                                   on incoming links
+ *
+ */
+template <typename VT, typename ET, typename WT>
+void hits(GraphCSRView<VT, ET, WT> const &graph,
+          int max_iter,
+          WT tolerance,
+          WT const *starting_value,
+          bool normalized,
+          WT *hubs,
+          WT *authorities);
+
+}  // namespace gunrock
+
 }  // namespace cugraph
diff --git a/cpp/include/comms_mpi.hpp b/cpp/include/comms_mpi.hpp
deleted file mode 100644
index 7a17bdfea4c..00000000000
--- a/cpp/include/comms_mpi.hpp
+++ /dev/null
@@ -1,74 +0,0 @@
-/*
- * Copyright (c) 2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-#if ENABLE_OPG
-#include <mpi.h>
-#include <nccl.h>
-#endif
-#include <cstddef>
-namespace cugraph {
-namespace experimental {
-
-enum class ReduceOp { SUM, MAX, MIN };
-
-// basic info about the snmg env setup
-class Comm {
- private:
-  int _p{0};
-  int _rank{0};
-  bool _finalize_mpi{false};
-  bool _finalize_nccl{false};
-
-  int _device_id{0};
-  int _device_count{0};
-
-  int _sm_count_per_device{0};
-  int _max_grid_dim_1D{0};
-  int _max_block_dim_1D{0};
-  int _l2_cache_size{0};
-  int _shared_memory_size_per_sm{0};
-
-#if ENABLE_OPG
-  MPI_Comm _mpi_comm{};
-  ncclComm_t _nccl_comm{};
-#endif
-
- public:
-  Comm(){};
-  Comm(int p);
-#if ENABLE_OPG
-  Comm(ncclComm_t comm, int size, int rank);
-#endif
-  ~Comm();
-  int get_rank() const { return _rank; }
-  int get_p() const { return _p; }
-  int get_dev() const { return _device_id; }
-  int get_dev_count() const { return _device_count; }
-  int get_sm_count() const { return _sm_count_per_device; }
-  bool is_master() const { return (_rank == 0) ? true : false; }
-
-  void barrier();
-
-  template <typename value_t>
-  void allgather(size_t size, value_t *sendbuff, value_t *recvbuff) const;
-
-  template <typename value_t>
-  void allreduce(size_t size, value_t *sendbuff, value_t *recvbuff, ReduceOp reduce_op) const;
-};
-
-}  // namespace experimental
-}  // namespace cugraph
diff --git a/cpp/include/functions.hpp b/cpp/include/functions.hpp
index db737a4f5a4..1e88acb54b7 100644
--- a/cpp/include/functions.hpp
+++ b/cpp/include/functions.hpp
@@ -15,70 +15,13 @@
  */
 #pragma once
 
+#include <raft/handle.hpp>
 #include <rmm/device_buffer.hpp>
 
 #include <graph.hpp>
 
 namespace cugraph {
 
-/**
- * @brief    Convert COO to CSR, unweighted
- *
- * Takes a list of edges in COOrdinate format and generates a CSR format.
- * Note, if you want CSC format simply pass the src and dst arrays
- * in the opposite order.
- *
- * @throws                    cugraph::logic_error when an error occurs.
- *
- * @tparam vertex_t           type of vertex index
- * @tparam edge_t             type of edge index
- *
- * @param[in]  num_edges      Number of edges
- * @param[in]  src            Device array containing original source vertices
- * @param[in]  dst            Device array containing original dest vertices
- * @param[out] offsets        Device array containing the CSR offsets
- * @param[out] indices        Device array containing the CSR indices
- *
- * @return                    Number of unique vertices in the src and dst arrays
- *
- */
-template <typename vertex_t, typename edge_t>
-vertex_t coo2csr(
-  edge_t num_edges, vertex_t const *src, vertex_t const *dst, edge_t **offsets, vertex_t **indices);
-
-/**
- * @brief    Convert COO to CSR, weighted
- *
- * Takes a list of edges in COOrdinate format and generates a CSR format.
- * Note, if you want CSC format simply pass the src and dst arrays
- * in the opposite order.
- *
- * @throws                    cugraph::logic_error when an error occurs.
- *
- * @tparam vertex_t           type of vertex index
- * @tparam edge_t             type of edge index
- * @tparam weight_t           type of the edge weight
- *
- * @param[in]  num_edges      Number of edges
- * @param[in]  src            Device array containing original source vertices
- * @param[in]  dst            Device array containing original dest vertices
- * @param[in]  weights        Device array containing original edge weights
- * @param[out] offsets        Device array containing the CSR offsets
- * @param[out] indices        Device array containing the CSR indices
- * @param[out] csr_weights    Device array containing the CSR edge weights
- *
- * @return                    Number of unique vertices in the src and dst arrays
- *
- */
-template <typename vertex_t, typename edge_t, typename weight_t>
-vertex_t coo2csr_weighted(edge_t num_edges,
-                          vertex_t const *src,
-                          vertex_t const *dst,
-                          weight_t const *weights,
-                          edge_t **offsets,
-                          vertex_t **indices,
-                          weight_t **csr_weights);
-
 /**
  * @brief    Convert COO to CSR
  *
@@ -90,15 +33,15 @@ vertex_t coo2csr_weighted(edge_t num_edges,
  * @tparam ET                 type of edge index
  * @tparam WT                 type of the edge weight
  *
- * @param[in]  graph          cuGRAPH graph in coordinate format
+ * @param[in]  graph          cuGraph graph in coordinate format
  * @param[in]  mr             Memory resource used to allocate the returned graph
  *
  * @return                    Unique pointer to generate Compressed Sparse Row graph
  *
  */
 template <typename VT, typename ET, typename WT>
-std::unique_ptr<experimental::GraphCSR<VT, ET, WT>> coo_to_csr(
-  experimental::GraphCOOView<VT, ET, WT> const &graph,
+std::unique_ptr<GraphCSR<VT, ET, WT>> coo_to_csr(
+  GraphCOOView<VT, ET, WT> const &graph,
   rmm::mr::device_memory_resource *mr = rmm::mr::get_default_resource());
 
 /**
@@ -135,4 +78,24 @@ std::unique_ptr<rmm::device_buffer> renumber_vertices(
   ET *map_size,
   rmm::mr::device_memory_resource *mr = rmm::mr::get_default_resource());
 
+/**
+ * @brief    Broadcast using handle communicator
+ *
+ * Use handle's communicator to operate broadcasting.
+ *
+ * @throws                    cugraph::logic_error when an error occurs.
+ *
+ * @tparam value_t            Type of the data to broadcast
+ *
+ * @param[out] value          Point to the data
+ * @param[in]  count          Number of elements to broadcast
+ *
+ */
+
+// FIXME: It would be better to expose it in RAFT
+template <typename value_t>
+void comms_bcast(const raft::handle_t &handle, value_t *value, size_t count)
+{
+  handle.get_comms().bcast(value, count, 0, handle.get_stream());
+}
 }  // namespace cugraph
diff --git a/cpp/include/graph.hpp b/cpp/include/graph.hpp
index d7b1a2838ac..9d42b4acdd7 100644
--- a/cpp/include/graph.hpp
+++ b/cpp/include/graph.hpp
@@ -14,16 +14,15 @@
  * limitations under the License.
  */
 #pragma once
-#include <comms_mpi.hpp>
+#include <unistd.h>
+#include <cstddef>
+#include <cstdint>
 #include <iostream>
 #include <memory>
+#include <raft/handle.hpp>
 #include <rmm/device_buffer.hpp>
 
-#include <cstddef>
-#include <cstdint>
-
 namespace cugraph {
-namespace experimental {
 
 enum class PropType { PROP_UNDEF, PROP_FALSE, PROP_TRUE };
 
@@ -47,107 +46,133 @@ enum class DegreeDirection {
 /**
  * @brief       Base class graphs, all but vertices and edges
  *
- * @tparam VT   Type of vertex id
- * @tparam ET   Type of edge id
- * @tparam WT   Type of weight
+ * @tparam vertex_t   Type of vertex id
+ * @tparam edge_t     Type of edge id
+ * @tparam weight_t   Type of weight
  */
-template <typename VT, typename ET, typename WT>
+template <typename vertex_t, typename edge_t, typename weight_t>
 class GraphViewBase {
  public:
-  WT *edge_data;  ///< edge weight
-  Comm comm;
+  raft::handle_t *handle;
+  weight_t *edge_data;  ///< edge weight
 
   GraphProperties prop;
 
-  VT number_of_vertices;
-  ET number_of_edges;
+  vertex_t number_of_vertices;
+  edge_t number_of_edges;
+
+  vertex_t *local_vertices;
+  edge_t *local_edges;
+  vertex_t *local_offsets;
 
   /**
    * @brief      Fill the identifiers array with the vertex identifiers.
    *
-   * @param[out]    identifier      Pointer to device memory to store the vertex
+   * @param[out]    identifiers      Pointer to device memory to store the vertex
    * identifiers
    */
-  void get_vertex_identifiers(VT *identifiers) const;
-  void set_communicator(Comm &comm_) { comm = comm_; }
+  void get_vertex_identifiers(vertex_t *identifiers) const;
+
+  void set_local_data(vertex_t *vertices, edge_t *edges, vertex_t *offsets)
+  {
+    local_vertices = vertices;
+    local_edges    = edges;
+    local_offsets  = offsets;
+  }
 
-  GraphViewBase(WT *edge_data_, VT number_of_vertices_, ET number_of_edges_)
-    : edge_data(edge_data_),
-      comm(),
+  void set_handle(raft::handle_t *handle_in) { handle = handle_in; }
+
+  GraphViewBase(weight_t *edge_data, vertex_t number_of_vertices, edge_t number_of_edges)
+    : handle(nullptr),
+      edge_data(edge_data),
       prop(),
-      number_of_vertices(number_of_vertices_),
-      number_of_edges(number_of_edges_)
+      number_of_vertices(number_of_vertices),
+      number_of_edges(number_of_edges),
+      local_vertices(nullptr),
+      local_edges(nullptr),
+      local_offsets(nullptr)
   {
   }
+
   bool has_data(void) const { return edge_data != nullptr; }
 };
 
 /**
  * @brief       A graph stored in COO (COOrdinate) format.
  *
- * @tparam VT   Type of vertex id
- * @tparam ET   Type of edge id
- * @tparam WT   Type of weight
+ * @tparam vertex_t   Type of vertex id
+ * @tparam edge_t     Type of edge id
+ * @tparam weight_t   Type of weight
  */
-template <typename VT, typename ET, typename WT>
-class GraphCOOView : public GraphViewBase<VT, ET, WT> {
+template <typename vertex_t, typename edge_t, typename weight_t>
+class GraphCOOView : public GraphViewBase<vertex_t, edge_t, weight_t> {
  public:
-  VT *src_indices{nullptr};  ///< rowInd
-  VT *dst_indices{nullptr};  ///< colInd
+  vertex_t *src_indices{nullptr};  ///< rowInd
+  vertex_t *dst_indices{nullptr};  ///< colInd
 
   /**
    * @brief     Computes degree(in, out, in+out) of all the nodes of a Graph
    *
    * @throws     cugraph::logic_error when an error occurs.
    *
-   * @param[out] degree                Device array of size V (V is number of vertices) initialized
+   * @param[out] degree                Device array of size V (V is number of
+   * vertices) initialized
    * to zeros. Will contain the computed degree of every vertex.
    * @param[in]  direction             IN_PLUS_OUT, IN or OUT
    */
-  void degree(ET *degree, DegreeDirection direction) const;
+  void degree(edge_t *degree, DegreeDirection direction) const;
 
   /**
    * @brief      Default constructor
    */
-  GraphCOOView() : GraphViewBase<VT, ET, WT>(nullptr, 0, 0) {}
+  GraphCOOView() : GraphViewBase<vertex_t, edge_t, weight_t>(nullptr, 0, 0) {}
 
   /**
    * @brief      Wrap existing arrays representing an edge list in a Graph.
    *
-   *             GraphCOOView does not own the memory used to represent this graph. This
+   *             GraphCOOView does not own the memory used to represent this
+   * graph. This
    *             function does not allocate memory.
    *
-   * @param  source_indices        This array of size E (number of edges) contains the index of the
+   * @param  source_indices        This array of size E (number of edges)
+   * contains the index of the
    * source for each edge. Indices must be in the range [0, V-1].
-   * @param  destination_indices   This array of size E (number of edges) contains the index of the
+   * @param  destination_indices   This array of size E (number of edges)
+   * contains the index of the
    * destination for each edge. Indices must be in the range [0, V-1].
-   * @param  edge_data             This array size E (number of edges) contains the weight for each
-   * edge.  This array can be null in which case the graph is considered unweighted.
+   * @param  edge_data             This array size E (number of edges) contains
+   * the weight for each
+   * edge.  This array can be null in which case the graph is considered
+   * unweighted.
    * @param  number_of_vertices    The number of vertices in the graph
    * @param  number_of_edges       The number of edges in the graph
    */
-  GraphCOOView(
-    VT *src_indices_, VT *dst_indices_, WT *edge_data_, VT number_of_vertices_, ET number_of_edges_)
-    : GraphViewBase<VT, ET, WT>(edge_data_, number_of_vertices_, number_of_edges_),
-      src_indices(src_indices_),
-      dst_indices(dst_indices_)
+  GraphCOOView(vertex_t *src_indices,
+               vertex_t *dst_indices,
+               weight_t *edge_data,
+               vertex_t number_of_vertices,
+               edge_t number_of_edges)
+    : GraphViewBase<vertex_t, edge_t, weight_t>(edge_data, number_of_vertices, number_of_edges),
+      src_indices(src_indices),
+      dst_indices(dst_indices)
   {
   }
 };
 
 /**
- * @brief       Base class for graph stored in CSR (Compressed Sparse Row) format or CSC (Compressed
+ * @brief       Base class for graph stored in CSR (Compressed Sparse Row)
+ * format or CSC (Compressed
  * Sparse Column) format
  *
- * @tparam VT   Type of vertex id
- * @tparam ET   Type of edge id
- * @tparam WT   Type of weight
+ * @tparam vertex_t   Type of vertex id
+ * @tparam edge_t     Type of edge id
+ * @tparam weight_t   Type of weight
  */
-template <typename VT, typename ET, typename WT>
-class GraphCompressedSparseBaseView : public GraphViewBase<VT, ET, WT> {
+template <typename vertex_t, typename edge_t, typename weight_t>
+class GraphCompressedSparseBaseView : public GraphViewBase<vertex_t, edge_t, weight_t> {
  public:
-  ET *offsets{nullptr};  ///< CSR offsets
-  VT *indices{nullptr};  ///< CSR indices
+  edge_t *offsets{nullptr};    ///< CSR offsets
+  vertex_t *indices{nullptr};  ///< CSR indices
 
   /**
    * @brief      Fill the identifiers in the array with the source vertex
@@ -156,42 +181,53 @@ class GraphCompressedSparseBaseView : public GraphViewBase<VT, ET, WT> {
    * @param[out]    src_indices      Pointer to device memory to store the
    * source vertex identifiers
    */
-  void get_source_indices(VT *src_indices) const;
+  void get_source_indices(vertex_t *src_indices) const;
 
   /**
    * @brief     Computes degree(in, out, in+out) of all the nodes of a Graph
    *
    * @throws     cugraph::logic_error when an error occurs.
    *
-   * @param[out] degree                Device array of size V (V is number of vertices) initialized
+   * @param[out] degree         Device array of size V (V is number of
+   * vertices) initialized
    * to zeros. Will contain the computed degree of every vertex.
-   * @param[in]  x                     Integer value indicating type of degree calculation
+   * @param[in]  direction      Integer value indicating type of degree
+   * calculation
    *                                      0 : in+out degree
    *                                      1 : in-degree
    *                                      2 : out-degree
    */
-  void degree(ET *degree, DegreeDirection direction) const;
+  void degree(edge_t *degree, DegreeDirection direction) const;
 
   /**
    * @brief      Wrap existing arrays representing adjacency lists in a Graph.
-   *             GraphCSRView does not own the memory used to represent this graph. This
+   *             GraphCSRView does not own the memory used to represent this
+   * graph. This
    *             function does not allocate memory.
    *
-   * @param  offsets               This array of size V+1 (V is number of vertices) contains the
-   * offset of adjacency lists of every vertex. Offsets must be in the range [0, E] (number of
+   * @param  offsets               This array of size V+1 (V is number of
+   * vertices) contains the
+   * offset of adjacency lists of every vertex. Offsets must be in the range [0,
+   * E] (number of
    * edges).
-   * @param  indices               This array of size E contains the index of the destination for
+   * @param  indices               This array of size E contains the index of
+   * the destination for
    * each edge. Indices must be in the range [0, V-1].
-   * @param  edge_data             This array of size E (number of edges) contains the weight for
-   * each edge.  This array can be null in which case the graph is considered unweighted.
+   * @param  edge_data             This array of size E (number of edges)
+   * contains the weight for
+   * each edge.  This array can be null in which case the graph is considered
+   * unweighted.
    * @param  number_of_vertices    The number of vertices in the graph
    * @param  number_of_edges       The number of edges in the graph
    */
-  GraphCompressedSparseBaseView(
-    ET *offsets_, VT *indices_, WT *edge_data_, VT number_of_vertices_, ET number_of_edges_)
-    : GraphViewBase<VT, ET, WT>(edge_data_, number_of_vertices_, number_of_edges_),
-      offsets{offsets_},
-      indices{indices_}
+  GraphCompressedSparseBaseView(edge_t *offsets,
+                                vertex_t *indices,
+                                weight_t *edge_data,
+                                vertex_t number_of_vertices,
+                                edge_t number_of_edges)
+    : GraphViewBase<vertex_t, edge_t, weight_t>(edge_data, number_of_vertices, number_of_edges),
+      offsets{offsets},
+      indices{indices}
   {
   }
 };
@@ -199,37 +235,49 @@ class GraphCompressedSparseBaseView : public GraphViewBase<VT, ET, WT> {
 /**
  * @brief       A graph stored in CSR (Compressed Sparse Row) format.
  *
- * @tparam VT   Type of vertex id
- * @tparam ET   Type of edge id
- * @tparam WT   Type of weight
+ * @tparam vertex_t   Type of vertex id
+ * @tparam edge_t   Type of edge id
+ * @tparam weight_t   Type of weight
  */
-template <typename VT, typename ET, typename WT>
-class GraphCSRView : public GraphCompressedSparseBaseView<VT, ET, WT> {
+template <typename vertex_t, typename edge_t, typename weight_t>
+class GraphCSRView : public GraphCompressedSparseBaseView<vertex_t, edge_t, weight_t> {
  public:
   /**
    * @brief      Default constructor
    */
-  GraphCSRView() : GraphCompressedSparseBaseView<VT, ET, WT>(nullptr, nullptr, nullptr, 0, 0) {}
+  GraphCSRView()
+    : GraphCompressedSparseBaseView<vertex_t, edge_t, weight_t>(nullptr, nullptr, nullptr, 0, 0)
+  {
+  }
 
   /**
    * @brief      Wrap existing arrays representing adjacency lists in a Graph.
-   *             GraphCSRView does not own the memory used to represent this graph. This
+   *             GraphCSRView does not own the memory used to represent this
+   * graph. This
    *             function does not allocate memory.
    *
-   * @param  offsets               This array of size V+1 (V is number of vertices) contains the
-   * offset of adjacency lists of every vertex. Offsets must be in the range [0, E] (number of
+   * @param  offsets               This array of size V+1 (V is number of
+   * vertices) contains the
+   * offset of adjacency lists of every vertex. Offsets must be in the range [0,
+   * E] (number of
    * edges).
-   * @param  indices               This array of size E contains the index of the destination for
+   * @param  indices               This array of size E contains the index of
+   * the destination for
    * each edge. Indices must be in the range [0, V-1].
-   * @param  edge_data             This array of size E (number of edges) contains the weight for
-   * each edge.  This array can be null in which case the graph is considered unweighted.
+   * @param  edge_data             This array of size E (number of edges)
+   * contains the weight for
+   * each edge.  This array can be null in which case the graph is considered
+   * unweighted.
    * @param  number_of_vertices    The number of vertices in the graph
    * @param  number_of_edges       The number of edges in the graph
    */
-  GraphCSRView(
-    ET *offsets_, VT *indices_, WT *edge_data_, VT number_of_vertices_, ET number_of_edges_)
-    : GraphCompressedSparseBaseView<VT, ET, WT>(
-        offsets_, indices_, edge_data_, number_of_vertices_, number_of_edges_)
+  GraphCSRView(edge_t *offsets,
+               vertex_t *indices,
+               weight_t *edge_data,
+               vertex_t number_of_vertices,
+               edge_t number_of_edges)
+    : GraphCompressedSparseBaseView<vertex_t, edge_t, weight_t>(
+        offsets, indices, edge_data, number_of_vertices, number_of_edges)
   {
   }
 };
@@ -237,57 +285,75 @@ class GraphCSRView : public GraphCompressedSparseBaseView<VT, ET, WT> {
 /**
  * @brief       A graph stored in CSC (Compressed Sparse Column) format.
  *
- * @tparam VT   Type of vertex id
- * @tparam ET   Type of edge id
- * @tparam WT   Type of weight
+ * @tparam vertex_t   Type of vertex id
+ * @tparam edge_t     Type of edge id
+ * @tparam weight_t   Type of weight
  */
-template <typename VT, typename ET, typename WT>
-class GraphCSCView : public GraphCompressedSparseBaseView<VT, ET, WT> {
+template <typename vertex_t, typename edge_t, typename weight_t>
+class GraphCSCView : public GraphCompressedSparseBaseView<vertex_t, edge_t, weight_t> {
  public:
   /**
    * @brief      Default constructor
    */
-  GraphCSCView() : GraphCompressedSparseBaseView<VT, ET, WT>(nullptr, nullptr, nullptr, 0, 0) {}
+  GraphCSCView()
+    : GraphCompressedSparseBaseView<vertex_t, edge_t, weight_t>(nullptr, nullptr, nullptr, 0, 0)
+  {
+  }
 
   /**
-   * @brief      Wrap existing arrays representing transposed adjacency lists in a Graph.
-   *             GraphCSCView does not own the memory used to represent this graph. This
+   * @brief      Wrap existing arrays representing transposed adjacency lists in
+   * a Graph.
+   *             GraphCSCView does not own the memory used to represent this
+   * graph. This
    *             function does not allocate memory.
    *
-   * @param  offsets               This array of size V+1 (V is number of vertices) contains the
-   * offset of adjacency lists of every vertex. Offsets must be in the range [0, E] (number of
+   * @param  offsets               This array of size V+1 (V is number of
+   * vertices) contains the
+   * offset of adjacency lists of every vertex. Offsets must be in the range [0,
+   * E] (number of
    * edges).
-   * @param  indices               This array of size E contains the index of the destination for
+   * @param  indices               This array of size E contains the index of
+   * the destination for
    * each edge. Indices must be in the range [0, V-1].
-   * @param  edge_data             This array of size E (number of edges) contains the weight for
-   * each edge.  This array can be null in which case the graph is considered unweighted.
+   * @param  edge_data             This array of size E (number of edges)
+   * contains the weight for
+   * each edge.  This array can be null in which case the graph is considered
+   * unweighted.
    * @param  number_of_vertices    The number of vertices in the graph
    * @param  number_of_edges       The number of edges in the graph
    */
-  GraphCSCView(
-    ET *offsets_, VT *indices_, WT *edge_data_, VT number_of_vertices_, ET number_of_edges_)
-    : GraphCompressedSparseBaseView<VT, ET, WT>(
-        offsets_, indices_, edge_data_, number_of_vertices_, number_of_edges_)
+  GraphCSCView(edge_t *offsets,
+               vertex_t *indices,
+               weight_t *edge_data,
+               vertex_t number_of_vertices,
+               edge_t number_of_edges)
+    : GraphCompressedSparseBaseView<vertex_t, edge_t, weight_t>(
+        offsets, indices, edge_data, number_of_vertices, number_of_edges)
   {
   }
 };
 
 /**
- * @brief      TODO : Change this Take ownership of the provided graph arrays in COO format
+ * @brief      TODO : Change this Take ownership of the provided graph arrays in
+ * COO format
  *
- * @param  source_indices        This array of size E (number of edges) contains the index of the
+ * @param  source_indices        This array of size E (number of edges) contains
+ * the index of the
  * source for each edge. Indices must be in the range [0, V-1].
- * @param  destination_indices   This array of size E (number of edges) contains the index of the
+ * @param  destination_indices   This array of size E (number of edges) contains
+ * the index of the
  * destination for each edge. Indices must be in the range [0, V-1].
- * @param  edge_data             This array size E (number of edges) contains the weight for each
- * edge.  This array can be null in which case the graph is considered unweighted.
+ * @param  edge_data             This array size E (number of edges) contains
+ * the weight for each
+ * edge.  This array can be null in which case the graph is considered
+ * unweighted.
  * @param  number_of_vertices    The number of vertices in the graph
  * @param  number_of_edges       The number of edges in the graph
  */
-template <typename VT, typename ET, typename WT>
+template <typename vertex_t, typename edge_t, typename weight_t>
 struct GraphCOOContents {
-  VT number_of_vertices;
-  ET number_of_edges;
+  vertex_t number_of_vertices;
+  edge_t number_of_edges;
   std::unique_ptr<rmm::device_buffer> src_indices;
   std::unique_ptr<rmm::device_buffer> dst_indices;
   std::unique_ptr<rmm::device_buffer> edge_data;
@@ -298,278 +364,291 @@ struct GraphCOOContents {
  *
  * This class will src_indices and dst_indicies (until moved)
  *
- * @tparam VT   Type of vertex id
- * @tparam ET   Type of edge id
- * @tparam WT   Type of weight
+ * @tparam vertex_t   Type of vertex id
+ * @tparam edge_t     Type of edge id
+ * @tparam weight_t   Type of weight
  */
-template <typename VT, typename ET, typename WT>
+template <typename vertex_t, typename edge_t, typename weight_t>
 class GraphCOO {
-  VT number_of_vertices_;
-  ET number_of_edges_;
-  rmm::device_buffer src_indices_{};  ///< rowInd
-  rmm::device_buffer dst_indices_{};  ///< colInd
-  rmm::device_buffer edge_data_{};    ///< CSR data
+  vertex_t number_of_vertices_p;
+  edge_t number_of_edges_p;
+  rmm::device_buffer src_indices_p{};  ///< rowInd
+  rmm::device_buffer dst_indices_p{};  ///< colInd
+  rmm::device_buffer edge_data_p{};    ///< CSR data
 
  public:
   /**
    * @brief      Take ownership of the provided graph arrays in COO format
    *
-   * @param  source_indices        This array of size E (number of edges) contains the index of the
-   * source for each edge. Indices must be in the range [0, V-1].
-   * @param  destination_indices   This array of size E (number of edges) contains the index of the
-   * destination for each edge. Indices must be in the range [0, V-1].
-   * @param  edge_data             This array size E (number of edges) contains the weight for each
-   * edge.  This array can be null in which case the graph is considered unweighted.
    * @param  number_of_vertices    The number of vertices in the graph
    * @param  number_of_edges       The number of edges in the graph
+   * @param  has_data              Wiether or not the class has data, default = False
+   * @param  stream                Specify the cudaStream, default = null
+   * @param mr                     Specify the memory resource
    */
-  GraphCOO(VT number_of_vertices,
-           ET number_of_edges,
+  GraphCOO(vertex_t number_of_vertices,
+           edge_t number_of_edges,
            bool has_data                       = false,
            cudaStream_t stream                 = nullptr,
            rmm::mr::device_memory_resource *mr = rmm::mr::get_default_resource())
-    : number_of_vertices_(number_of_vertices),
-      number_of_edges_(number_of_edges),
-      src_indices_(sizeof(VT) * number_of_edges, stream, mr),
-      dst_indices_(sizeof(VT) * number_of_edges, stream, mr),
-      edge_data_((has_data ? sizeof(WT) * number_of_edges : 0), stream, mr)
+    : number_of_vertices_p(number_of_vertices),
+      number_of_edges_p(number_of_edges),
+      src_indices_p(sizeof(vertex_t) * number_of_edges, stream, mr),
+      dst_indices_p(sizeof(vertex_t) * number_of_edges, stream, mr),
+      edge_data_p((has_data ? sizeof(weight_t) * number_of_edges : 0), stream, mr)
   {
   }
 
-  GraphCOO(GraphCOOView<VT, ET, WT> const &graph,
+  GraphCOO(GraphCOOView<vertex_t, edge_t, weight_t> const &graph,
            cudaStream_t stream                 = nullptr,
            rmm::mr::device_memory_resource *mr = rmm::mr::get_default_resource())
-    : number_of_vertices_(graph.number_of_vertices),
-      number_of_edges_(graph.number_of_edges),
-      src_indices_(graph.src_indices, graph.number_of_edges * sizeof(VT), stream, mr),
-      dst_indices_(graph.dst_indices, graph.number_of_edges * sizeof(VT), stream, mr)
+    : number_of_vertices_p(graph.number_of_vertices),
+      number_of_edges_p(graph.number_of_edges),
+      src_indices_p(graph.src_indices, graph.number_of_edges * sizeof(vertex_t), stream, mr),
+      dst_indices_p(graph.dst_indices, graph.number_of_edges * sizeof(vertex_t), stream, mr)
   {
     if (graph.has_data()) {
-      edge_data_ =
-        rmm::device_buffer{graph.edge_data, graph.number_of_edges * sizeof(WT), stream, mr};
+      edge_data_p =
+        rmm::device_buffer{graph.edge_data, graph.number_of_edges * sizeof(weight_t), stream, mr};
     }
   }
 
-  VT number_of_vertices(void) { return number_of_vertices_; }
-  ET number_of_edges(void) { return number_of_edges_; }
-  VT *src_indices(void) { return static_cast<VT *>(src_indices_.data()); }
-  VT *dst_indices(void) { return static_cast<VT *>(dst_indices_.data()); }
-  WT *edge_data(void) { return static_cast<WT *>(edge_data_.data()); }
+  vertex_t number_of_vertices(void) { return number_of_vertices_p; }
+  edge_t number_of_edges(void) { return number_of_edges_p; }
+  vertex_t *src_indices(void) { return static_cast<vertex_t *>(src_indices_p.data()); }
+  vertex_t *dst_indices(void) { return static_cast<vertex_t *>(dst_indices_p.data()); }
+  weight_t *edge_data(void) { return static_cast<weight_t *>(edge_data_p.data()); }
 
-  GraphCOOContents<VT, ET, WT> release() noexcept
+  GraphCOOContents<vertex_t, edge_t, weight_t> release() noexcept
   {
-    VT number_of_vertices = number_of_vertices_;
-    ET number_of_edges    = number_of_edges_;
-    number_of_vertices_   = 0;
-    number_of_edges_      = 0;
-    return GraphCOOContents<VT, ET, WT>{
+    vertex_t number_of_vertices = number_of_vertices_p;
+    edge_t number_of_edges      = number_of_edges_p;
+    number_of_vertices_p        = 0;
+    number_of_edges_p           = 0;
+    return GraphCOOContents<vertex_t, edge_t, weight_t>{
       number_of_vertices,
       number_of_edges,
-      std::make_unique<rmm::device_buffer>(std::move(src_indices_)),
-      std::make_unique<rmm::device_buffer>(std::move(dst_indices_)),
-      std::make_unique<rmm::device_buffer>(std::move(edge_data_))};
+      std::make_unique<rmm::device_buffer>(std::move(src_indices_p)),
+      std::make_unique<rmm::device_buffer>(std::move(dst_indices_p)),
+      std::make_unique<rmm::device_buffer>(std::move(edge_data_p))};
   }
 
-  GraphCOOView<VT, ET, WT> view(void) noexcept
+  GraphCOOView<vertex_t, edge_t, weight_t> view(void) noexcept
   {
-    return GraphCOOView<VT, ET, WT>(
-      src_indices(), dst_indices(), edge_data(), number_of_vertices_, number_of_edges_);
+    return GraphCOOView<vertex_t, edge_t, weight_t>(
+      src_indices(), dst_indices(), edge_data(), number_of_vertices_p, number_of_edges_p);
   }
 
-  bool has_data(void) { return nullptr != edge_data_.data(); }
+  bool has_data(void) { return nullptr != edge_data_p.data(); }
 };
 
-template <typename VT, typename ET, typename WT>
+template <typename vertex_t, typename edge_t, typename weight_t>
 struct GraphSparseContents {
-  VT number_of_vertices;
-  ET number_of_edges;
+  vertex_t number_of_vertices;
+  edge_t number_of_edges;
   std::unique_ptr<rmm::device_buffer> offsets;
   std::unique_ptr<rmm::device_buffer> indices;
   std::unique_ptr<rmm::device_buffer> edge_data;
 };
 
 /**
- * @brief       Base class for constructted graphs stored in CSR (Compressed Sparse Row) format or
+ * @brief       Base class for constructted graphs stored in CSR (Compressed
+ * Sparse Row) format or
  * CSC (Compressed Sparse Column) format
  *
- * @tparam VT   Type of vertex id
- * @tparam ET   Type of edge id
- * @tparam WT   Type of weight
+ * @tparam vertex_t   Type of vertex id
+ * @tparam edge_t     Type of edge id
+ * @tparam weight_t   Type of weight
  */
-template <typename VT, typename ET, typename WT>
+template <typename vertex_t, typename edge_t, typename weight_t>
 class GraphCompressedSparseBase {
-  VT number_of_vertices_{0};
-  ET number_of_edges_{0};
-  rmm::device_buffer offsets_{};    ///< CSR offsets
-  rmm::device_buffer indices_{};    ///< CSR indices
-  rmm::device_buffer edge_data_{};  ///< CSR data
+  vertex_t number_of_vertices_p{0};
+  edge_t number_of_edges_p{0};
+  rmm::device_buffer offsets_p{};    ///< CSR offsets
+  rmm::device_buffer indices_p{};    ///< CSR indices
+  rmm::device_buffer edge_data_p{};  ///< CSR data
 
-  bool has_data_{false};
+  bool has_data_p{false};
 
  public:
   /**
    * @brief      Take ownership of the provided graph arrays in CSR/CSC format
    *
-   * @param  offsets               This array of size V+1 (V is number of vertices) contains the
-   * offset of adjacency lists of every vertex. Offsets must be in the range [0, E] (number of
-   * edges).
-   * @param  indices               This array of size E contains the index of the destination for
-   * each edge. Indices must be in the range [0, V-1].
-   * @param  edge_data             This array of size E (number of edges) contains the weight for
-   * each edge.  This array can be null in which case the graph is considered unweighted.
    * @param  number_of_vertices    The number of vertices in the graph
    * @param  number_of_edges       The number of edges in the graph
+   * @param  has_data              Wiether or not the class has data, default = False
+   * @param  stream                Specify the cudaStream, default = null
+   * @param mr                     Specify the memory resource
    */
-  GraphCompressedSparseBase(VT number_of_vertices,
-                            ET number_of_edges,
+  GraphCompressedSparseBase(vertex_t number_of_vertices,
+                            edge_t number_of_edges,
                             bool has_data,
                             cudaStream_t stream,
                             rmm::mr::device_memory_resource *mr)
-    : number_of_vertices_(number_of_vertices),
-      number_of_edges_(number_of_edges),
-      offsets_(sizeof(ET) * (number_of_vertices + 1), stream, mr),
-      indices_(sizeof(VT) * number_of_edges, stream, mr),
-      edge_data_((has_data ? sizeof(WT) * number_of_edges : 0), stream, mr)
+    : number_of_vertices_p(number_of_vertices),
+      number_of_edges_p(number_of_edges),
+      offsets_p(sizeof(edge_t) * (number_of_vertices + 1), stream, mr),
+      indices_p(sizeof(vertex_t) * number_of_edges, stream, mr),
+      edge_data_p((has_data ? sizeof(weight_t) * number_of_edges : 0), stream, mr)
   {
   }
 
-  GraphCompressedSparseBase(GraphSparseContents<VT, ET, WT> &&contents)
-    : number_of_vertices_(contents.number_of_vertices),
-      number_of_edges_(contents.number_of_edges),
-      offsets_(std::move(*contents.offsets.release())),
-      indices_(std::move(*contents.indices.release())),
-      edge_data_(std::move(*contents.edge_data.release()))
+  GraphCompressedSparseBase(GraphSparseContents<vertex_t, edge_t, weight_t> &&contents)
+    : number_of_vertices_p(contents.number_of_vertices),
+      number_of_edges_p(contents.number_of_edges),
+      offsets_p(std::move(*contents.offsets.release())),
+      indices_p(std::move(*contents.indices.release())),
+      edge_data_p(std::move(*contents.edge_data.release()))
   {
   }
 
-  VT number_of_vertices(void) { return number_of_vertices_; }
-  ET number_of_edges(void) { return number_of_edges_; }
-  ET *offsets(void) { return static_cast<ET *>(offsets_.data()); }
-  VT *indices(void) { return static_cast<VT *>(indices_.data()); }
-  WT *edge_data(void) { return static_cast<WT *>(edge_data_.data()); }
+  vertex_t number_of_vertices(void) { return number_of_vertices_p; }
+  edge_t number_of_edges(void) { return number_of_edges_p; }
+  edge_t *offsets(void) { return static_cast<edge_t *>(offsets_p.data()); }
+  vertex_t *indices(void) { return static_cast<vertex_t *>(indices_p.data()); }
+  weight_t *edge_data(void) { return static_cast<weight_t *>(edge_data_p.data()); }
 
-  GraphSparseContents<VT, ET, WT> release() noexcept
+  GraphSparseContents<vertex_t, edge_t, weight_t> release() noexcept
   {
-    VT number_of_vertices = number_of_vertices_;
-    ET number_of_edges    = number_of_edges_;
-    number_of_vertices_   = 0;
-    number_of_edges_      = 0;
-    return GraphSparseContents<VT, ET, WT>{
+    vertex_t number_of_vertices = number_of_vertices_p;
+    edge_t number_of_edges      = number_of_edges_p;
+    number_of_vertices_p        = 0;
+    number_of_edges_p           = 0;
+    return GraphSparseContents<vertex_t, edge_t, weight_t>{
       number_of_vertices,
       number_of_edges,
-      std::make_unique<rmm::device_buffer>(std::move(offsets_)),
-      std::make_unique<rmm::device_buffer>(std::move(indices_)),
-      std::make_unique<rmm::device_buffer>(std::move(edge_data_))};
+      std::make_unique<rmm::device_buffer>(std::move(offsets_p)),
+      std::make_unique<rmm::device_buffer>(std::move(indices_p)),
+      std::make_unique<rmm::device_buffer>(std::move(edge_data_p))};
   }
 
-  bool has_data(void) { return nullptr != edge_data_.data(); }
+  bool has_data(void) { return nullptr != edge_data_p.data(); }
 };
 
 /**
- * @brief       A constructed graph stored in CSR (Compressed Sparse Row) format.
+ * @brief       A constructed graph stored in CSR (Compressed Sparse Row)
+ * format.
  *
- * @tparam VT   Type of vertex id
- * @tparam ET   Type of edge id
- * @tparam WT   Type of weight
+ * @tparam vertex_t   Type of vertex id
+ * @tparam edge_t     Type of edge id
+ * @tparam weight_t   Type of weight
  */
-template <typename VT, typename ET, typename WT>
-class GraphCSR : public GraphCompressedSparseBase<VT, ET, WT> {
+template <typename vertex_t, typename edge_t, typename weight_t>
+class GraphCSR : public GraphCompressedSparseBase<vertex_t, edge_t, weight_t> {
  public:
   /**
    * @brief      Default constructor
    */
-  GraphCSR() : GraphCompressedSparseBase<VT, ET, WT>() {}
+  GraphCSR() : GraphCompressedSparseBase<vertex_t, edge_t, weight_t>() {}
 
   /**
    * @brief      Take ownership of the provided graph arrays in CSR format
    *
-   * @param  offsets               This array of size V+1 (V is number of vertices) contains the
-   * offset of adjacency lists of every vertex. Offsets must be in the range [0, E] (number of
-   * edges).
-   * @param  indices               This array of size E contains the index of the destination for
-   * each edge. Indices must be in the range [0, V-1].
-   * @param  edge_data             This array of size E (number of edges) contains the weight for
-   * each edge.  This array can be null in which case the graph is considered unweighted.
    * @param  number_of_vertices    The number of vertices in the graph
    * @param  number_of_edges       The number of edges in the graph
+   * @param  has_data              Wiether or not the class has data, default = False
+   * @param  stream                Specify the cudaStream, default = null
+   * @param mr                     Specify the memory resource
    */
-  GraphCSR(VT number_of_vertices_,
-           ET number_of_edges_,
+  GraphCSR(vertex_t number_of_vertices_,
+           edge_t number_of_edges_,
            bool has_data_                      = false,
            cudaStream_t stream                 = nullptr,
            rmm::mr::device_memory_resource *mr = rmm::mr::get_default_resource())
-    : GraphCompressedSparseBase<VT, ET, WT>(
+    : GraphCompressedSparseBase<vertex_t, edge_t, weight_t>(
         number_of_vertices_, number_of_edges_, has_data_, stream, mr)
   {
   }
 
-  GraphCSR(GraphSparseContents<VT, ET, WT> &&contents)
-    : GraphCompressedSparseBase<VT, ET, WT>(std::move(contents))
+  GraphCSR(GraphSparseContents<vertex_t, edge_t, weight_t> &&contents)
+    : GraphCompressedSparseBase<vertex_t, edge_t, weight_t>(std::move(contents))
   {
   }
 
-  GraphCSRView<VT, ET, WT> view(void) noexcept
+  GraphCSRView<vertex_t, edge_t, weight_t> view(void) noexcept
   {
-    return GraphCSRView<VT, ET, WT>(GraphCompressedSparseBase<VT, ET, WT>::offsets(),
-                                    GraphCompressedSparseBase<VT, ET, WT>::indices(),
-                                    GraphCompressedSparseBase<VT, ET, WT>::edge_data(),
-                                    GraphCompressedSparseBase<VT, ET, WT>::number_of_vertices(),
-                                    GraphCompressedSparseBase<VT, ET, WT>::number_of_edges());
+    return GraphCSRView<vertex_t, edge_t, weight_t>(
+      GraphCompressedSparseBase<vertex_t, edge_t, weight_t>::offsets(),
+      GraphCompressedSparseBase<vertex_t, edge_t, weight_t>::indices(),
+      GraphCompressedSparseBase<vertex_t, edge_t, weight_t>::edge_data(),
+      GraphCompressedSparseBase<vertex_t, edge_t, weight_t>::number_of_vertices(),
+      GraphCompressedSparseBase<vertex_t, edge_t, weight_t>::number_of_edges());
   }
 };
 
 /**
- * @brief       A constructed graph stored in CSC (Compressed Sparse Column) format.
+ * @brief       A constructed graph stored in CSC (Compressed Sparse Column)
+ * format.
  *
- * @tparam VT   Type of vertex id
- * @tparam ET   Type of edge id
- * @tparam WT   Type of weight
+ * @tparam vertex_t   Type of vertex id
+ * @tparam edge_t   Type of edge id
+ * @tparam weight_t   Type of weight
  */
-template <typename VT, typename ET, typename WT>
-class GraphCSC : public GraphCompressedSparseBase<VT, ET, WT> {
+template <typename vertex_t, typename edge_t, typename weight_t>
+class GraphCSC : public GraphCompressedSparseBase<vertex_t, edge_t, weight_t> {
  public:
   /**
    * @brief      Default constructor
    */
-  GraphCSC() : GraphCompressedSparseBase<VT, ET, WT>() {}
+  GraphCSC() : GraphCompressedSparseBase<vertex_t, edge_t, weight_t>() {}
 
   /**
    * @brief      Take ownership of the provided graph arrays in CSR format
    *
-   * @param  offsets               This array of size V+1 (V is number of vertices) contains the
-   * offset of adjacency lists of every vertex. Offsets must be in the range [0, E] (number of
-   * edges).
-   * @param  indices               This array of size E contains the index of the destination for
-   * each edge. Indices must be in the range [0, V-1].
-   * @param  edge_data             This array of size E (number of edges) contains the weight for
-   * each edge.  This array can be null in which case the graph is considered unweighted.
    * @param  number_of_vertices    The number of vertices in the graph
    * @param  number_of_edges       The number of edges in the graph
+   * @param  has_data              Wiether or not the class has data, default = False
+   * @param  stream                Specify the cudaStream, default = null
+   * @param mr                     Specify the memory resource
    */
-  GraphCSC(VT number_of_vertices_,
-           ET number_of_edges_,
-           bool has_data_                      = false,
+  GraphCSC(vertex_t number_of_vertices_in,
+           edge_t number_of_edges_in,
+           bool has_data_in                    = false,
            cudaStream_t stream                 = nullptr,
            rmm::mr::device_memory_resource *mr = rmm::mr::get_default_resource())
-    : GraphCompressedSparseBase<VT, ET, WT>(
-        number_of_vertices_, number_of_edges_, has_data_, stream, mr)
+    : GraphCompressedSparseBase<vertex_t, edge_t, weight_t>(
+        number_of_vertices_in, number_of_edges_in, has_data_in, stream, mr)
   {
   }
 
-  GraphCSC(GraphSparseContents<VT, ET, WT> &&contents)
-    : GraphCompressedSparseBase<VT, ET, WT>(contents)
+  GraphCSC(GraphSparseContents<vertex_t, edge_t, weight_t> &&contents)
+    : GraphCompressedSparseBase<vertex_t, edge_t, weight_t>(contents)
   {
   }
 
-  GraphCSCView<VT, ET, WT> view(void) noexcept
+  GraphCSCView<vertex_t, edge_t, weight_t> view(void) noexcept
   {
-    return GraphCSCView<VT, ET, WT>(GraphCompressedSparseBase<VT, ET, WT>::offsets(),
-                                    GraphCompressedSparseBase<VT, ET, WT>::indices(),
-                                    GraphCompressedSparseBase<VT, ET, WT>::edge_data(),
-                                    GraphCompressedSparseBase<VT, ET, WT>::number_of_vertices(),
-                                    GraphCompressedSparseBase<VT, ET, WT>::number_of_edges());
+    return GraphCSCView<vertex_t, edge_t, weight_t>(
+      GraphCompressedSparseBase<vertex_t, edge_t, weight_t>::offsets(),
+      GraphCompressedSparseBase<vertex_t, edge_t, weight_t>::indices(),
+      GraphCompressedSparseBase<vertex_t, edge_t, weight_t>::edge_data(),
+      GraphCompressedSparseBase<vertex_t, edge_t, weight_t>::number_of_vertices(),
+      GraphCompressedSparseBase<vertex_t, edge_t, weight_t>::number_of_edges());
   }
 };
 
-}  // namespace experimental
+template <typename T, typename Enable = void>
+struct invalid_idx;
+
+template <typename T>
+struct invalid_idx<
+  T,
+  typename std::enable_if_t<std::is_integral<T>::value && std::is_signed<T>::value>>
+  : std::integral_constant<T, -1> {
+};
+
+template <typename T>
+struct invalid_idx<
+  T,
+  typename std::enable_if_t<std::is_integral<T>::value && std::is_unsigned<T>::value>>
+  : std::integral_constant<T, std::numeric_limits<T>::max()> {
+};
+
+template <typename vertex_t>
+struct invalid_vertex_id : invalid_idx<vertex_t> {
+};
+
+template <typename edge_t>
+struct invalid_edge_id : invalid_idx<edge_t> {
+};
 }  // namespace cugraph
diff --git a/cpp/include/utilities/error.hpp b/cpp/include/utilities/error.hpp
new file mode 100644
index 00000000000..e44e2c910ea
--- /dev/null
+++ b/cpp/include/utilities/error.hpp
@@ -0,0 +1,65 @@
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#pragma once
+
+#include <raft/error.hpp>
+
+namespace cugraph {
+
+/**
+ * @brief Exception thrown when logical precondition is violated.
+ *
+ * This exception should not be thrown directly and is instead thrown by the
+ * CUGRAPH_EXPECTS and  CUGRAPH_FAIL macros.
+ *
+ */
+struct logic_error : public raft::exception {
+  explicit logic_error(char const* const message) : raft::exception(message) {}
+  explicit logic_error(std::string const& message) : raft::exception(message) {}
+};
+
+}  // namespace cugraph
+
+/**
+ * @brief Macro for checking (pre-)conditions that throws an exception when a condition is false
+ *
+ * @param[in] cond Expression that evaluates to true or false
+ * @param[in] fmt String literal description of the reason that cond is expected to be true with
+ * optinal format tagas
+ * @throw cugraph::logic_error if the condition evaluates to false.
+ */
+#define CUGRAPH_EXPECTS(cond, fmt, ...)                              \
+  do {                                                               \
+    if (!(cond)) {                                                   \
+      std::string msg{};                                             \
+      SET_ERROR_MSG(msg, "cuGraph failure at ", fmt, ##__VA_ARGS__); \
+      throw cugraph::logic_error(msg);                               \
+    }                                                                \
+  } while (0)
+
+/**
+ * @brief Indicates that an erroneous code path has been taken.
+ *
+ * @param[in] fmt String literal description of the reason that this code path is erroneous with
+ * optinal format tagas
+ * @throw always throws cugraph::logic_error
+ */
+#define CUGRAPH_FAIL(fmt, ...)                                     \
+  do {                                                             \
+    std::string msg{};                                             \
+    SET_ERROR_MSG(msg, "cuGraph failure at ", fmt, ##__VA_ARGS__); \
+    throw cugraph::logic_error(msg);                               \
+  } while (0)
diff --git a/cpp/src/centrality/betweenness_centrality.cu b/cpp/src/centrality/betweenness_centrality.cu
index 5948c6f9ec9..8ff62f7ddb6 100644
--- a/cpp/src/centrality/betweenness_centrality.cu
+++ b/cpp/src/centrality/betweenness_centrality.cu
@@ -18,148 +18,207 @@
 
 #include <thrust/transform.h>
 
+#include <raft/cudart_utils.h>
+
 #include <algorithms.hpp>
 #include <graph.hpp>
+#include <rmm/device_scalar.hpp>
+#include <utilities/error.hpp>
 
-#include <utilities/error_utils.h>
-
+#include <raft/handle.hpp>
 #include "betweenness_centrality.cuh"
+#include "betweenness_centrality_kernels.cuh"
 
 namespace cugraph {
 namespace detail {
+namespace {
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void betweenness_centrality_impl(raft::handle_t const &handle,
+                                 GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                                 result_t *result,
+                                 bool normalize,
+                                 bool endpoints,
+                                 weight_t const *weight,
+                                 vertex_t number_of_sources,
+                                 vertex_t const *sources,
+                                 vertex_t total_number_of_sources)
+{
+  // Current Implementation relies on BFS
+  // FIXME: For SSSP version
+  // Brandes Algorithm expects non negative weights for the accumulation
+  bool is_edge_betweenness = false;
+  verify_betweenness_centrality_input<vertex_t, edge_t, weight_t, result_t>(
+    result, is_edge_betweenness, normalize, endpoints, weight, number_of_sources, sources);
+  cugraph::detail::BC<vertex_t, edge_t, weight_t, result_t> bc(handle, graph);
+  bc.configure(
+    result, is_edge_betweenness, normalize, endpoints, weight, sources, number_of_sources);
+  bc.compute();
+  bc.rescale_by_total_sources_used(total_number_of_sources);
+}
 
-template <typename VT, typename ET, typename WT, typename result_t>
-void BC<VT, ET, WT, result_t>::setup()
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void edge_betweenness_centrality_impl(raft::handle_t const &handle,
+                                      GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                                      result_t *result,
+                                      bool normalize,
+                                      weight_t const *weight,
+                                      vertex_t number_of_sources,
+                                      vertex_t const *sources,
+                                      vertex_t total_number_of_sources)
 {
-  // --- Set up parameters from graph adjList ---
-  number_of_vertices = graph.number_of_vertices;
-  number_of_edges    = graph.number_of_edges;
-  offsets_ptr        = graph.offsets;
-  indices_ptr        = graph.indices;
+  // Current Implementation relies on BFS
+  // FIXME: For SSSP version
+  // Brandes Algorithm expects non negative weights for the accumulation
+  bool is_edge_betweenness = true;
+  bool endpoints           = false;
+  verify_betweenness_centrality_input<vertex_t, edge_t, weight_t, result_t>(
+    result, is_edge_betweenness, normalize, endpoints, weight, number_of_sources, sources);
+  cugraph::detail::BC<vertex_t, edge_t, weight_t, result_t> bc(handle, graph);
+  bc.configure(
+    result, is_edge_betweenness, normalize, endpoints, weight, sources, number_of_sources);
+  bc.compute();
+  // NOTE: As of 07/2020 NetworkX does not apply rescaling based on number
+  // of sources
+  // bc.rescale_by_total_sources_used(total_number_of_sources);
 }
+template <typename vertex_t>
+vertex_t get_total_number_of_sources(raft::handle_t const &handle, vertex_t local_number_of_sources)
+{
+  vertex_t total_number_of_sources_used = local_number_of_sources;
+  if (handle.comms_initialized()) {
+    rmm::device_scalar<vertex_t> d_number_of_sources(local_number_of_sources, handle.get_stream());
+    handle.get_comms().allreduce(d_number_of_sources.data(),
+                                 d_number_of_sources.data(),
+                                 1,
+                                 raft::comms::op_t::SUM,
+                                 handle.get_stream());
+    total_number_of_sources_used = d_number_of_sources.value(handle.get_stream());
+    // CUDA_TRY(
+    // cudaMemcpy(&total_number_of_sources_used, data, sizeof(vertex_t), cudaMemcpyDeviceToHost));
+  }
+  return total_number_of_sources_used;
+}
+}  // namespace
 
-template <typename VT, typename ET, typename WT, typename result_t>
-void BC<VT, ET, WT, result_t>::configure(result_t *_betweenness,
-                                         bool _normalized,
-                                         bool _endpoints,
-                                         WT const *_weights,
-                                         VT const *_sources,
-                                         VT _number_of_sources)
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void verify_betweenness_centrality_input(result_t *result,
+                                         bool is_edge_betweenness,
+                                         bool normalize,
+                                         bool endpoints,
+                                         weight_t const *weights,
+                                         vertex_t const number_of_sources,
+                                         vertex_t const *sources)
+{
+  static_assert(std::is_same<vertex_t, int>::value, "vertex_t should be int");
+  static_assert(std::is_same<edge_t, int>::value, "edge_t should be int");
+  static_assert(std::is_same<weight_t, float>::value || std::is_same<weight_t, double>::value,
+                "weight_t should be float or double");
+  static_assert(std::is_same<result_t, float>::value || std::is_same<result_t, double>::value,
+                "result_t should be float or double");
+
+  CUGRAPH_EXPECTS(result != nullptr, "Invalid API parameter: betwenness pointer is NULL");
+  CUGRAPH_EXPECTS(number_of_sources >= 0, "Number of sources must be positive or equal to 0.");
+  if (number_of_sources != 0) {
+    CUGRAPH_EXPECTS(sources != nullptr,
+                    "Sources cannot be NULL if number_of_source is different from 0.");
+  }
+  if (is_edge_betweenness) {
+    CUGRAPH_EXPECTS(!endpoints, "Endpoints is not supported for edge betweenness centrality.");
+  }
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void BC<vertex_t, edge_t, weight_t, result_t>::setup()
+{
+  number_of_vertices_ = graph_.number_of_vertices;
+  number_of_edges_    = graph_.number_of_edges;
+  offsets_ptr_        = graph_.offsets;
+  indices_ptr_        = graph_.indices;
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void BC<vertex_t, edge_t, weight_t, result_t>::configure(result_t *betweenness,
+                                                         bool is_edge_betweenness,
+                                                         bool normalized,
+                                                         bool endpoints,
+                                                         weight_t const *weights,
+                                                         vertex_t const *sources,
+                                                         vertex_t number_of_sources)
 {
   // --- Bind betweenness output vector to internal ---
-  betweenness       = _betweenness;
-  normalized        = _normalized;
-  endpoints         = _endpoints;
-  sources           = _sources;
-  number_of_sources = _number_of_sources;
-  edge_weights_ptr  = _weights;
+  betweenness_         = betweenness;
+  normalized_          = normalized;
+  endpoints_           = endpoints;
+  sources_             = sources;
+  number_of_sources_   = number_of_sources;
+  edge_weights_ptr_    = weights;
+  is_edge_betweenness_ = is_edge_betweenness;
 
   // --- Working data allocation ---
-  distances_vec.resize(number_of_vertices);
-  predecessors_vec.resize(number_of_vertices);
-  sp_counters_vec.resize(number_of_vertices);
-  deltas_vec.resize(number_of_vertices);
-
-  distances    = distances_vec.data().get();
-  predecessors = predecessors_vec.data().get();
-  sp_counters  = sp_counters_vec.data().get();
-  deltas       = deltas_vec.data().get();
+  initialize_work_vectors();
+  initialize_pointers_to_vectors();
 
   // --- Get Device Information ---
-  CUDA_TRY(cudaGetDevice(&device_id));
-  CUDA_TRY(cudaDeviceGetAttribute(&max_grid_dim_1D, cudaDevAttrMaxGridDimX, device_id));
-  CUDA_TRY(cudaDeviceGetAttribute(&max_block_dim_1D, cudaDevAttrMaxBlockDimX, device_id));
+  initialize_device_information();
 
   // --- Confirm that configuration went through ---
-  configured = true;
+  configured_ = true;
 }
 
-// Dependecy Accumulation: McLaughlin and Bader, 2018
-// NOTE: Accumulation kernel might not scale well, as each thread is handling
-//        all the edges for each node, an approach similar to the traversal
-//        bucket (i.e. BFS / SSSP) system might enable speed up
-// NOTE: Shortest Path counter can increase extremely fast, thus double are used
-//       however, the user might want to get the result back in float
-//       we delay casting the result until dependecy accumulation
-template <typename VT, typename ET, typename WT, typename result_t>
-__global__ void accumulation_kernel(result_t *betweenness,
-                                    VT number_vertices,
-                                    VT const *indices,
-                                    ET const *offsets,
-                                    VT *distances,
-                                    double *sp_counters,
-                                    double *deltas,
-                                    VT source,
-                                    VT depth)
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void BC<vertex_t, edge_t, weight_t, result_t>::initialize_work_vectors()
 {
-  for (int tid = blockIdx.x * blockDim.x + threadIdx.x; tid < number_vertices;
-       tid += gridDim.x * blockDim.x) {
-    VT w       = tid;
-    double dsw = 0;
-    double sw  = sp_counters[w];
-    if (distances[w] == depth) {  // Process nodes at this depth
-      ET edge_start = offsets[w];
-      ET edge_end   = offsets[w + 1];
-      ET edge_count = edge_end - edge_start;
-      for (ET edge_idx = 0; edge_idx < edge_count; ++edge_idx) {  // Visit neighbors
-        VT v = indices[edge_start + edge_idx];
-        if (distances[v] == distances[w] + 1) {
-          double factor = (static_cast<double>(1) + deltas[v]) / sp_counters[v];
-          dsw += sw * factor;
-        }
-      }
-      deltas[w] = dsw;
-    }
-  }
+  distances_vec_.resize(number_of_vertices_);
+  predecessors_vec_.resize(number_of_vertices_);
+  sp_counters_vec_.resize(number_of_vertices_);
+  deltas_vec_.resize(number_of_vertices_);
 }
 
-template <typename VT, typename ET, typename WT, typename result_t>
-void BC<VT, ET, WT, result_t>::accumulate(result_t *betweenness,
-                                          VT *distances,
-                                          double *sp_counters,
-                                          double *deltas,
-                                          VT source,
-                                          VT max_depth)
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void BC<vertex_t, edge_t, weight_t, result_t>::initialize_pointers_to_vectors()
 {
-  dim3 grid, block;
-  block.x = max_block_dim_1D;
-  grid.x  = min(max_grid_dim_1D, (number_of_edges / block.x + 1));
-  // Step 1) Dependencies (deltas) are initialized to 0 before starting
-  thrust::fill(rmm::exec_policy(stream)->on(stream),
-               deltas,
-               deltas + number_of_vertices,
-               static_cast<result_t>(0));
-  // Step 2) Process each node, -1 is used to notify unreached nodes in the sssp
-  for (VT depth = max_depth; depth > 0; --depth) {
-    accumulation_kernel<VT, ET, WT, result_t><<<grid, block, 0, stream>>>(betweenness,
-                                                                          number_of_vertices,
-                                                                          graph.indices,
-                                                                          graph.offsets,
-                                                                          distances,
-                                                                          sp_counters,
-                                                                          deltas,
-                                                                          source,
-                                                                          depth);
-  }
+  distances_    = distances_vec_.data().get();
+  predecessors_ = predecessors_vec_.data().get();
+  sp_counters_  = sp_counters_vec_.data().get();
+  deltas_       = deltas_vec_.data().get();
+}
 
-  thrust::transform(rmm::exec_policy(stream)->on(stream),
-                    deltas,
-                    deltas + number_of_vertices,
-                    betweenness,
-                    betweenness,
-                    thrust::plus<result_t>());
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void BC<vertex_t, edge_t, weight_t, result_t>::initialize_device_information()
+{
+  max_grid_dim_1D_  = handle_.get_device_properties().maxGridSize[0];
+  max_block_dim_1D_ = handle_.get_device_properties().maxThreadsDim[0];
 }
 
-// We do not verifiy the graph structure as the new graph structure
-// enforces CSR Format
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void BC<vertex_t, edge_t, weight_t, result_t>::compute()
+{
+  CUGRAPH_EXPECTS(configured_, "BC must be configured before computation");
+  if (sources_) {
+    for (vertex_t source_idx = 0; source_idx < number_of_sources_; ++source_idx) {
+      vertex_t source_vertex = sources_[source_idx];
+      compute_single_source(source_vertex);
+    }
+  } else {
+    for (vertex_t source_vertex = 0; source_vertex < number_of_vertices_; ++source_vertex) {
+      compute_single_source(source_vertex);
+    }
+  }
+  rescale();
+}
 
-// FIXME: Having a system that relies on an class might make it harder to
-// dispatch later
-template <typename VT, typename ET, typename WT, typename result_t>
-void BC<VT, ET, WT, result_t>::compute_single_source(VT source_vertex)
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void BC<vertex_t, edge_t, weight_t, result_t>::compute_single_source(vertex_t source_vertex)
 {
   // Step 1) Singe-source shortest-path problem
-  cugraph::bfs(graph, distances, predecessors, sp_counters, source_vertex, graph.prop.directed);
+  cugraph::bfs(handle_,
+               graph_,
+               distances_,
+               predecessors_,
+               sp_counters_,
+               source_vertex,
+               graph_.prop.directed,
+               true);
 
   // FIXME: Remove that with a BC specific class to gather
   //        information during traversal
@@ -168,166 +227,335 @@ void BC<VT, ET, WT, result_t>::compute_single_source(VT source_vertex)
   // the traversal, this value is avalaible within the bfs implementation and
   // there could be a way to access it directly and avoid both replace and the
   // max
-  thrust::replace(rmm::exec_policy(stream)->on(stream),
-                  distances,
-                  distances + number_of_vertices,
-                  std::numeric_limits<VT>::max(),
-                  static_cast<VT>(-1));
-  auto current_max_depth = thrust::max_element(
-    rmm::exec_policy(stream)->on(stream), distances, distances + number_of_vertices);
-  VT max_depth = 0;
-  cudaMemcpy(&max_depth, current_max_depth, sizeof(VT), cudaMemcpyDeviceToHost);
+  thrust::replace(rmm::exec_policy(handle_.get_stream())->on(handle_.get_stream()),
+                  distances_,
+                  distances_ + number_of_vertices_,
+                  std::numeric_limits<vertex_t>::max(),
+                  static_cast<vertex_t>(-1));
+  auto current_max_depth =
+    thrust::max_element(rmm::exec_policy(handle_.get_stream())->on(handle_.get_stream()),
+                        distances_,
+                        distances_ + number_of_vertices_);
+  vertex_t max_depth = 0;
+  CUDA_TRY(cudaMemcpy(&max_depth, current_max_depth, sizeof(vertex_t), cudaMemcpyDeviceToHost));
   // Step 2) Dependency accumulation
-  accumulate(betweenness, distances, sp_counters, deltas, source_vertex, max_depth);
+  accumulate(source_vertex, max_depth);
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void BC<vertex_t, edge_t, weight_t, result_t>::accumulate(vertex_t source_vertex,
+                                                          vertex_t max_depth)
+{
+  dim3 grid_configuration, block_configuration;
+  block_configuration.x = max_block_dim_1D_;
+  grid_configuration.x  = min(max_grid_dim_1D_, (number_of_edges_ / block_configuration.x + 1));
+
+  initialize_dependencies();
+
+  if (is_edge_betweenness_) {
+    accumulate_edges(max_depth, grid_configuration, block_configuration);
+  } else if (endpoints_) {
+    accumulate_vertices_with_endpoints(
+      source_vertex, max_depth, grid_configuration, block_configuration);
+  } else {
+    accumulate_vertices(max_depth, grid_configuration, block_configuration);
+  }
 }
 
-template <typename VT, typename ET, typename WT, typename result_t>
-void BC<VT, ET, WT, result_t>::compute()
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void BC<vertex_t, edge_t, weight_t, result_t>::initialize_dependencies()
 {
-  CUGRAPH_EXPECTS(configured, "BC must be configured before computation");
-  // If sources is defined we only process vertices contained in it
-  thrust::fill(rmm::exec_policy(stream)->on(stream),
-               betweenness,
-               betweenness + number_of_vertices,
+  thrust::fill(rmm::exec_policy(handle_.get_stream())->on(handle_.get_stream()),
+               deltas_,
+               deltas_ + number_of_vertices_,
                static_cast<result_t>(0));
-  cudaStreamSynchronize(stream);
-  if (sources) {
-    for (VT source_idx = 0; source_idx < number_of_sources; ++source_idx) {
-      VT source_vertex = sources[source_idx];
-      compute_single_source(source_vertex);
-    }
-  } else {  // Otherwise process every vertices
-    // NOTE: Maybe we could still use number of sources and set it to number_of_vertices?
-    //       It woudl imply having a host vector of size |V|
-    //       But no need for the if/ else statement
-    for (VT source_vertex = 0; source_vertex < number_of_vertices; ++source_vertex) {
-      compute_single_source(source_vertex);
-    }
+}
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void BC<vertex_t, edge_t, weight_t, result_t>::accumulate_edges(vertex_t max_depth,
+                                                                dim3 grid_configuration,
+                                                                dim3 block_configuration)
+{
+  for (vertex_t depth = max_depth; depth >= 0; --depth) {
+    edges_accumulation_kernel<vertex_t, edge_t, weight_t, result_t>
+      <<<grid_configuration, block_configuration, 0, handle_.get_stream()>>>(betweenness_,
+                                                                             number_of_vertices_,
+                                                                             graph_.indices,
+                                                                             graph_.offsets,
+                                                                             distances_,
+                                                                             sp_counters_,
+                                                                             deltas_,
+                                                                             depth);
   }
-  rescale();
 }
 
-template <typename VT, typename ET, typename WT, typename result_t>
-void BC<VT, ET, WT, result_t>::rescale()
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void BC<vertex_t, edge_t, weight_t, result_t>::accumulate_vertices_with_endpoints(
+  vertex_t source_vertex, vertex_t max_depth, dim3 grid_configuration, dim3 block_configuration)
 {
-  thrust::device_vector<result_t> normalizer(number_of_vertices);
-  bool modified                      = false;
-  result_t rescale_factor            = static_cast<result_t>(1);
-  result_t casted_number_of_vertices = static_cast<result_t>(number_of_vertices);
-  result_t casted_number_of_sources  = static_cast<result_t>(number_of_sources);
-  if (normalized) {
-    if (number_of_vertices > 2) {
-      rescale_factor /= ((casted_number_of_vertices - 1) * (casted_number_of_vertices - 2));
-      modified = true;
+  for (vertex_t depth = max_depth; depth > 0; --depth) {
+    endpoints_accumulation_kernel<vertex_t, edge_t, weight_t, result_t>
+      <<<grid_configuration, block_configuration, 0, handle_.get_stream()>>>(betweenness_,
+                                                                             number_of_vertices_,
+                                                                             graph_.indices,
+                                                                             graph_.offsets,
+                                                                             distances_,
+                                                                             sp_counters_,
+                                                                             deltas_,
+                                                                             depth);
+  }
+  add_reached_endpoints_to_source_betweenness(source_vertex);
+  add_vertices_dependencies_to_betweenness();
+}
+
+// Distances should contain -1 for unreached nodes,
+
+// FIXME: There might be a cleaner way to add a value to a single
+//        score in the betweenness vector
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void BC<vertex_t, edge_t, weight_t, result_t>::add_reached_endpoints_to_source_betweenness(
+  vertex_t source_vertex)
+{
+  vertex_t number_of_unvisited_vertices =
+    thrust::count(rmm::exec_policy(handle_.get_stream())->on(handle_.get_stream()),
+                  distances_,
+                  distances_ + number_of_vertices_,
+                  -1);
+  vertex_t number_of_visited_vertices_except_source =
+    number_of_vertices_ - number_of_unvisited_vertices - 1;
+  rmm::device_vector<vertex_t> buffer(1);
+  buffer[0] = number_of_visited_vertices_except_source;
+  thrust::transform(rmm::exec_policy(handle_.get_stream())->on(handle_.get_stream()),
+                    buffer.begin(),
+                    buffer.end(),
+                    betweenness_ + source_vertex,
+                    betweenness_ + source_vertex,
+                    thrust::plus<result_t>());
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void BC<vertex_t, edge_t, weight_t, result_t>::add_vertices_dependencies_to_betweenness()
+{
+  thrust::transform(rmm::exec_policy(handle_.get_stream())->on(handle_.get_stream()),
+                    deltas_,
+                    deltas_ + number_of_vertices_,
+                    betweenness_,
+                    betweenness_,
+                    thrust::plus<result_t>());
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void BC<vertex_t, edge_t, weight_t, result_t>::accumulate_vertices(vertex_t max_depth,
+                                                                   dim3 grid_configuration,
+                                                                   dim3 block_configuration)
+{
+  for (vertex_t depth = max_depth; depth > 0; --depth) {
+    accumulation_kernel<vertex_t, edge_t, weight_t, result_t>
+      <<<grid_configuration, block_configuration, 0, handle_.get_stream()>>>(betweenness_,
+                                                                             number_of_vertices_,
+                                                                             graph_.indices,
+                                                                             graph_.offsets,
+                                                                             distances_,
+                                                                             sp_counters_,
+                                                                             deltas_,
+                                                                             depth);
+  }
+  add_vertices_dependencies_to_betweenness();
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void BC<vertex_t, edge_t, weight_t, result_t>::rescale()
+{
+  bool modified           = false;
+  result_t rescale_factor = static_cast<result_t>(1);
+  if (normalized_) {
+    if (is_edge_betweenness_) {
+      std::tie(rescale_factor, modified) =
+        rescale_edges_betweenness_centrality(rescale_factor, modified);
+    } else {
+      std::tie(rescale_factor, modified) =
+        rescale_vertices_betweenness_centrality(rescale_factor, modified);
     }
   } else {
-    if (!graph.prop.directed) {
+    if (!graph_.prop.directed) {
       rescale_factor /= static_cast<result_t>(2);
       modified = true;
     }
   }
-  if (modified) {
-    if (number_of_sources > 0) {
-      rescale_factor *= (casted_number_of_vertices / casted_number_of_sources);
+  apply_rescale_factor_to_betweenness(rescale_factor);
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+std::tuple<result_t, bool>
+BC<vertex_t, edge_t, weight_t, result_t>::rescale_edges_betweenness_centrality(
+  result_t rescale_factor, bool modified)
+{
+  result_t casted_number_of_vertices_ = static_cast<result_t>(number_of_vertices_);
+  if (number_of_vertices_ > 1) {
+    rescale_factor /= ((casted_number_of_vertices_) * (casted_number_of_vertices_ - 1));
+    modified = true;
+  }
+  return std::make_tuple(rescale_factor, modified);
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+std::tuple<result_t, bool>
+BC<vertex_t, edge_t, weight_t, result_t>::rescale_vertices_betweenness_centrality(
+  result_t rescale_factor, bool modified)
+{
+  result_t casted_number_of_vertices = static_cast<result_t>(number_of_vertices_);
+  if (number_of_vertices_ > 2) {
+    if (endpoints_) {
+      rescale_factor /= (casted_number_of_vertices * (casted_number_of_vertices - 1));
+    } else {
+      rescale_factor /= ((casted_number_of_vertices - 1) * (casted_number_of_vertices - 2));
     }
+    modified = true;
   }
-  thrust::fill(normalizer.begin(), normalizer.end(), rescale_factor);
-  thrust::transform(rmm::exec_policy(stream)->on(stream),
-                    betweenness,
-                    betweenness + number_of_vertices,
-                    normalizer.begin(),
-                    betweenness,
+  return std::make_tuple(rescale_factor, modified);
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void BC<vertex_t, edge_t, weight_t, result_t>::apply_rescale_factor_to_betweenness(
+  result_t rescale_factor)
+{
+  size_t result_size = number_of_vertices_;
+  if (is_edge_betweenness_) result_size = number_of_edges_;
+  thrust::transform(rmm::exec_policy(handle_.get_stream())->on(handle_.get_stream()),
+                    betweenness_,
+                    betweenness_ + result_size,
+                    thrust::make_constant_iterator(rescale_factor),
+                    betweenness_,
                     thrust::multiplies<result_t>());
 }
 
-template <typename VT, typename ET, typename WT, typename result_t>
-void verify_input(result_t *result,
-                  bool normalize,
-                  bool endpoints,
-                  WT const *weights,
-                  VT const number_of_sources,
-                  VT const *sources)
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void BC<vertex_t, edge_t, weight_t, result_t>::rescale_by_total_sources_used(
+  vertex_t total_number_of_sources_used)
 {
-  CUGRAPH_EXPECTS(result != nullptr, "Invalid API parameter: output betwenness is nullptr");
-  if (typeid(VT) != typeid(int)) {
-    CUGRAPH_FAIL("Unsupported vertex id data type, please use int");
-  }
-  if (typeid(ET) != typeid(int)) { CUGRAPH_FAIL("Unsupported edge id data type, please use int"); }
-  if (typeid(WT) != typeid(float) && typeid(WT) != typeid(double)) {
-    CUGRAPH_FAIL("Unsupported weight data type, please use float or double");
-  }
-  if (typeid(result_t) != typeid(float) && typeid(result_t) != typeid(double)) {
-    CUGRAPH_FAIL("Unsupported result data type, please use float or double");
-  }
-  if (number_of_sources < 0) {
-    CUGRAPH_FAIL("Number of sources must be positive or equal to 0.");
-  } else if (number_of_sources != 0) {
-    CUGRAPH_EXPECTS(sources != nullptr,
-                    "sources cannot be null if number_of_source is different from 0.");
+  result_t rescale_factor = static_cast<result_t>(1);
+  result_t casted_total_number_of_sources_used =
+    static_cast<result_t>(total_number_of_sources_used);
+  result_t casted_number_of_vertices = static_cast<result_t>(number_of_vertices_);
+
+  if (normalized_) {
+    if (number_of_vertices_ > 2 && total_number_of_sources_used > 0) {
+      rescale_factor *= (casted_number_of_vertices / casted_total_number_of_sources_used);
+    }
+  } else if (!graph_.prop.directed) {
+    if (number_of_vertices_ > 2 && total_number_of_sources_used > 0) {
+      rescale_factor *= (casted_number_of_vertices / casted_total_number_of_sources_used);
+    }
   }
-  if (endpoints) { CUGRAPH_FAIL("Endpoints option is currently not supported."); }
+  apply_rescale_factor_to_betweenness(rescale_factor);
 }
-/**
- * ---------------------------------------------------------------------------*
- * @brief Native betweenness centrality
- *
- * @file betweenness_centrality.cu
- * --------------------------------------------------------------------------*/
-template <typename VT, typename ET, typename WT, typename result_t>
-void betweenness_centrality(experimental::GraphCSRView<VT, ET, WT> const &graph,
+}  // namespace detail
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void betweenness_centrality(raft::handle_t const &handle,
+                            GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
                             result_t *result,
                             bool normalize,
                             bool endpoints,
-                            WT const *weight,
-                            VT const number_of_sources,
-                            VT const *sources)
+                            weight_t const *weight,
+                            vertex_t k,
+                            vertex_t const *vertices)
 {
-  // Current Implementation relies on BFS
-  // FIXME: For SSSP version
-  // Brandes Algorithm expects non negative weights for the accumulation
-  verify_input<VT, ET, WT, result_t>(
-    result, normalize, endpoints, weight, number_of_sources, sources);
-  cugraph::detail::BC<VT, ET, WT, result_t> bc(graph);
-  bc.configure(result, normalize, endpoints, weight, sources, number_of_sources);
-  bc.compute();
+  vertex_t total_number_of_sources_used = detail::get_total_number_of_sources<vertex_t>(handle, k);
+  if (handle.comms_initialized()) {
+    rmm::device_vector<result_t> betweenness(graph.number_of_vertices, 0);
+    detail::betweenness_centrality_impl(handle,
+                                        graph,
+                                        betweenness.data().get(),
+                                        normalize,
+                                        endpoints,
+                                        weight,
+                                        k,
+                                        vertices,
+                                        total_number_of_sources_used);
+    handle.get_comms().reduce(betweenness.data().get(),
+                              result,
+                              betweenness.size(),
+                              raft::comms::op_t::SUM,
+                              0,
+                              handle.get_stream());
+  } else {
+    detail::betweenness_centrality_impl(handle,
+                                        graph,
+                                        result,
+                                        normalize,
+                                        endpoints,
+                                        weight,
+                                        k,
+                                        vertices,
+                                        total_number_of_sources_used);
+  }
 }
-}  // namespace detail
 
-/**
- * @param[out]  result          array<result_t>(number_of_vertices)
- * @param[in]   normalize       bool True -> Apply normalization
- * @param[in]   endpoints (NIY) bool Include endpoints
- * @param[in]   weights   (NIY) array<WT>(number_of_edges) Weights to use
- * @param[in]   k               Number of sources
- * @param[in]   vertices        array<VT>(k) Sources for traversal
- */
-template <typename VT, typename ET, typename WT, typename result_t>
-void betweenness_centrality(experimental::GraphCSRView<VT, ET, WT> const &graph,
-                            result_t *result,
-                            bool normalize,
-                            bool endpoints,
-                            WT const *weight,
-                            VT k,
-                            VT const *vertices)
+template void betweenness_centrality<int, int, float, float>(const raft::handle_t &,
+                                                             GraphCSRView<int, int, float> const &,
+                                                             float *,
+                                                             bool,
+                                                             bool,
+                                                             float const *,
+                                                             int,
+                                                             int const *);
+template void betweenness_centrality<int, int, double, double>(
+  const raft::handle_t &,
+  GraphCSRView<int, int, double> const &,
+  double *,
+  bool,
+  bool,
+  double const *,
+  int,
+  int const *);
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void edge_betweenness_centrality(raft::handle_t const &handle,
+                                 GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                                 result_t *result,
+                                 bool normalize,
+                                 weight_t const *weight,
+                                 vertex_t k,
+                                 vertex_t const *vertices)
 {
-  detail::betweenness_centrality(graph, result, normalize, endpoints, weight, k, vertices);
+  vertex_t total_number_of_sources_used = detail::get_total_number_of_sources<vertex_t>(handle, k);
+  if (handle.comms_initialized()) {
+    rmm::device_vector<result_t> betweenness(graph.number_of_edges, 0);
+    detail::edge_betweenness_centrality_impl(handle,
+                                             graph,
+                                             betweenness.data().get(),
+                                             normalize,
+                                             weight,
+                                             k,
+                                             vertices,
+                                             total_number_of_sources_used);
+    handle.get_comms().reduce(betweenness.data().get(),
+                              result,
+                              betweenness.size(),
+                              raft::comms::op_t::SUM,
+                              0,
+                              handle.get_stream());
+  } else {
+    detail::edge_betweenness_centrality_impl(
+      handle, graph, result, normalize, weight, k, vertices, total_number_of_sources_used);
+  }
 }
 
-template void betweenness_centrality<int, int, float, float>(
-  experimental::GraphCSRView<int, int, float> const &,
+template void edge_betweenness_centrality<int, int, float, float>(
+  const raft::handle_t &,
+  GraphCSRView<int, int, float> const &,
   float *,
   bool,
-  bool,
   float const *,
   int,
   int const *);
-template void betweenness_centrality<int, int, double, double>(
-  experimental::GraphCSRView<int, int, double> const &,
+
+template void edge_betweenness_centrality<int, int, double, double>(
+  raft::handle_t const &handle,
+  GraphCSRView<int, int, double> const &,
   double *,
   bool,
-  bool,
   double const *,
   int,
   int const *);
-
 }  // namespace cugraph
diff --git a/cpp/src/centrality/betweenness_centrality.cuh b/cpp/src/centrality/betweenness_centrality.cuh
index d4f448618e2..418ac06faa4 100644
--- a/cpp/src/centrality/betweenness_centrality.cuh
+++ b/cpp/src/centrality/betweenness_centrality.cuh
@@ -15,79 +15,134 @@
  */
 
 // Author: Xavier Cadet xcadet@nvidia.com
+
 #pragma once
 #include <rmm/thrust_rmm_allocator.h>
 
 namespace cugraph {
 namespace detail {
-template <typename VT, typename ET, typename WT, typename result_t>
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void betweenness_centrality(raft::handle_t const &handle,
+                            GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                            result_t *result,
+                            bool normalize,
+                            bool endpoints,
+                            weight_t const *weight,
+                            vertex_t const number_of_sources,
+                            vertex_t const *sources);
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void edge_betweenness_centrality(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                                 result_t *result,
+                                 bool normalize,
+                                 weight_t const *weight,
+                                 vertex_t const number_of_sources,
+                                 vertex_t const *sources);
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void verify_betweenness_centrality_input(result_t *result,
+                                         bool is_edge_betweenness,
+                                         bool normalize,
+                                         bool endpoints,
+                                         weight_t const *weights,
+                                         vertex_t const number_of_sources,
+                                         vertex_t const *sources);
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
 class BC {
+ public:
+  virtual ~BC(void) {}
+  BC(raft::handle_t const &handle,
+     GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+     cudaStream_t stream = 0)
+    : handle_(handle), graph_(graph)
+  {
+    setup();
+  }
+  void configure(result_t *betweenness,
+                 bool is_edge_betweenness,
+                 bool normalize,
+                 bool endpoints,
+                 weight_t const *weight,
+                 vertex_t const *sources,
+                 vertex_t const number_of_sources);
+
+  void configure_edge(result_t *betweenness,
+                      bool normalize,
+                      weight_t const *weight,
+                      vertex_t const *sources,
+                      vertex_t const number_of_sources);
+  void compute();
+  void rescale_by_total_sources_used(vertex_t total_number_of_sources_used);
+
  private:
+  // --- RAFT handle ---
+  raft::handle_t const &handle_;
   // --- Information concerning the graph ---
-  const experimental::GraphCSRView<VT, ET, WT> &graph;
+  const GraphCSRView<vertex_t, edge_t, weight_t> &graph_;
   // --- These information are extracted on setup ---
-  VT number_of_vertices;  // Number of vertices in the graph
-  VT number_of_edges;     // Number of edges in the graph
-  ET const *offsets_ptr;  // Pointer to the offsets
-  VT const *indices_ptr;  // Pointers to the indices
+  vertex_t number_of_vertices_;  // Number of vertices in the graph
+  vertex_t number_of_edges_;     // Number of edges in the graph
+  edge_t const *offsets_ptr_;    // Pointer to the offsets
+  vertex_t const *indices_ptr_;  // Pointers to the indices
 
   // --- Information from configuration ---
-  bool configured = false;  // Flag to ensure configuration was called
-  bool normalized = false;  // If True normalize the betweenness
+  bool configured_          = false;  // Flag to ensure configuration was called
+  bool normalized_          = false;  // If True normalize the betweenness
+  bool is_edge_betweenness_ = false;  // If True compute edge_betweeness
+
   // FIXME: For weighted version
-  WT const *edge_weights_ptr = nullptr;  // Pointer to the weights
-  bool endpoints             = false;    // If True normalize the betweenness
-  VT const *sources          = nullptr;  // Subset of vertices to gather information from
-  VT number_of_sources;                  // Number of vertices in sources
+  weight_t const *edge_weights_ptr_ = nullptr;  // Pointer to the weights
+  bool endpoints_                   = false;    // If True normalize the betweenness
+  vertex_t const *sources_          = nullptr;  // Subset of vertices to gather information from
+  vertex_t number_of_sources_;                  // Number of vertices in sources
 
   // --- Output ----
   // betweenness is set/read by users - using Vectors
-  result_t *betweenness = nullptr;
+  result_t *betweenness_ = nullptr;
 
   // --- Data required to perform computation ----
-  rmm::device_vector<VT> distances_vec;
-  rmm::device_vector<VT> predecessors_vec;
-  rmm::device_vector<double> sp_counters_vec;
-  rmm::device_vector<double> deltas_vec;
-
-  VT *distances    = nullptr;  // array<VT>(|V|) stores the distances gathered by the latest SSSP
-  VT *predecessors = nullptr;  // array<WT>(|V|) stores the predecessors of the latest SSSP
-  double *sp_counters =
-    nullptr;                 // array<VT>(|V|) stores the shortest path counter for the latest SSSP
-  double *deltas = nullptr;  // array<result_t>(|V|) stores the dependencies for the latest SSSP
-
-  // FIXME: This should be replaced using RAFT handle
-  int device_id        = 0;
-  int max_grid_dim_1D  = 0;
-  int max_block_dim_1D = 0;
-  cudaStream_t stream;
-
-  // -----------------------------------------------------------------------
-  void setup();  // Saves information related to the graph itself
-
-  void accumulate(result_t *betweenness,
-                  VT *distances,
-                  double *sp_counters,
-                  double *deltas,
-                  VT source,
-                  VT max_depth);
-  void compute_single_source(VT source_vertex);
-  void rescale();
+  rmm::device_vector<vertex_t> distances_vec_;
+  rmm::device_vector<vertex_t> predecessors_vec_;
+  rmm::device_vector<double> sp_counters_vec_;
+  rmm::device_vector<double> deltas_vec_;
 
- public:
-  virtual ~BC(void) {}
-  BC(experimental::GraphCSRView<VT, ET, WT> const &_graph, cudaStream_t _stream = 0)
-    : graph(_graph), stream(_stream)
-  {
-    setup();
-  }
-  void configure(result_t *betweenness,
-                 bool normalize,
-                 bool endpoints,
-                 WT const *weigth,
-                 VT const *sources,
-                 VT const number_of_sources);
-  void compute();
+  vertex_t *distances_ =
+    nullptr;  // array<vertex_t>(|V|) stores the distances gathered by the latest SSSP
+  vertex_t *predecessors_ =
+    nullptr;  // array<weight_t>(|V|) stores the predecessors of the latest SSSP
+  double *sp_counters_ =
+    nullptr;  // array<vertex_t>(|V|) stores the shortest path counter for the latest SSSP
+  double *deltas_ = nullptr;  // array<result_t>(|V|) stores the dependencies for the latest SSSP
+
+  int max_grid_dim_1D_  = 0;
+  int max_block_dim_1D_ = 0;
+
+  void setup();
+
+  void initialize_work_vectors();
+  void initialize_pointers_to_vectors();
+  void initialize_device_information();
+
+  void compute_single_source(vertex_t source_vertex);
+
+  void accumulate(vertex_t source_vertex, vertex_t max_depth);
+  void initialize_dependencies();
+  void accumulate_edges(vertex_t max_depth, dim3 grid_configuration, dim3 block_configuration);
+  void accumulate_vertices_with_endpoints(vertex_t source_vertex,
+                                          vertex_t max_depth,
+                                          dim3 grid_configuration,
+                                          dim3 block_configuration);
+  void accumulate_vertices(vertex_t max_depth, dim3 grid_configuration, dim3 block_configuration);
+  void add_reached_endpoints_to_source_betweenness(vertex_t source_vertex);
+  void add_vertices_dependencies_to_betweenness();
+
+  void rescale();
+  std::tuple<result_t, bool> rescale_vertices_betweenness_centrality(result_t rescale_factor,
+                                                                     bool modified);
+  std::tuple<result_t, bool> rescale_edges_betweenness_centrality(result_t rescale_factor,
+                                                                  bool modified);
+  void apply_rescale_factor_to_betweenness(result_t scaling_factor);
 };
 }  // namespace detail
 }  // namespace cugraph
diff --git a/cpp/src/centrality/betweenness_centrality_kernels.cuh b/cpp/src/centrality/betweenness_centrality_kernels.cuh
new file mode 100644
index 00000000000..3cb5add8ad6
--- /dev/null
+++ b/cpp/src/centrality/betweenness_centrality_kernels.cuh
@@ -0,0 +1,120 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+namespace cugraph {
+namespace detail {
+// Dependecy Accumulation: based on McLaughlin and Bader, 2018
+// FIXME: Accumulation kernel mights not scale well, as each thread is handling
+//        all the edges for each node, an approach similar to the traversal
+//        bucket (i.e. BFS / SSSP) system might enable speed up.
+//        Should look into forAllEdge type primitive for different
+//        load balancing
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+__global__ void edges_accumulation_kernel(result_t *betweenness,
+                                          vertex_t number_vertices,
+                                          vertex_t const *indices,
+                                          edge_t const *offsets,
+                                          vertex_t *distances,
+                                          double *sp_counters,
+                                          double *deltas,
+                                          vertex_t depth)
+{
+  for (int thread_idx = blockIdx.x * blockDim.x + threadIdx.x; thread_idx < number_vertices;
+       thread_idx += gridDim.x * blockDim.x) {
+    vertex_t vertex     = thread_idx;
+    double vertex_delta = 0;
+    double vertex_sigma = sp_counters[vertex];
+    if (distances[vertex] == depth) {
+      edge_t first_edge_idx = offsets[vertex];
+      edge_t last_edge_idx  = offsets[vertex + 1];
+      for (edge_t edge_idx = first_edge_idx; edge_idx < last_edge_idx; ++edge_idx) {
+        vertex_t successor = indices[edge_idx];
+        if (distances[successor] == distances[vertex] + 1) {
+          double factor = (static_cast<double>(1) + deltas[successor]) / sp_counters[successor];
+          double coefficient = vertex_sigma * factor;
+
+          vertex_delta += coefficient;
+          betweenness[edge_idx] += coefficient;
+        }
+      }
+      deltas[vertex] = vertex_delta;
+    }
+  }
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+__global__ void endpoints_accumulation_kernel(result_t *betweenness,
+                                              vertex_t number_vertices,
+                                              vertex_t const *indices,
+                                              edge_t const *offsets,
+                                              vertex_t *distances,
+                                              double *sp_counters,
+                                              double *deltas,
+                                              vertex_t depth)
+{
+  for (int thread_idx = blockIdx.x * blockDim.x + threadIdx.x; thread_idx < number_vertices;
+       thread_idx += gridDim.x * blockDim.x) {
+    vertex_t vertex     = thread_idx;
+    double vertex_delta = 0;
+    double vertex_sigma = sp_counters[vertex];
+    if (distances[vertex] == depth) {
+      edge_t first_edge_idx = offsets[vertex];
+      edge_t last_edge_idx  = offsets[vertex + 1];
+      for (edge_t edge_idx = first_edge_idx; edge_idx < last_edge_idx; ++edge_idx) {
+        vertex_t successor = indices[edge_idx];
+        if (distances[successor] == distances[vertex] + 1) {
+          double factor = (static_cast<double>(1) + deltas[successor]) / sp_counters[successor];
+          vertex_delta += vertex_sigma * factor;
+        }
+      }
+      betweenness[vertex] += 1;
+      deltas[vertex] = vertex_delta;
+    }
+  }
+}
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+__global__ void accumulation_kernel(result_t *betweenness,
+                                    vertex_t number_vertices,
+                                    vertex_t const *indices,
+                                    edge_t const *offsets,
+                                    vertex_t *distances,
+                                    double *sp_counters,
+                                    double *deltas,
+                                    vertex_t depth)
+{
+  for (int thread_idx = blockIdx.x * blockDim.x + threadIdx.x; thread_idx < number_vertices;
+       thread_idx += gridDim.x * blockDim.x) {
+    vertex_t vertex     = thread_idx;
+    double vertex_delta = 0;
+    double vertex_sigma = sp_counters[vertex];
+    if (distances[vertex] == depth) {
+      edge_t first_edge_idx = offsets[vertex];
+      edge_t last_edge_idx  = offsets[vertex + 1];
+      for (edge_t edge_idx = first_edge_idx; edge_idx < last_edge_idx; ++edge_idx) {
+        vertex_t successor = indices[edge_idx];
+        if (distances[successor] == distances[vertex] + 1) {
+          double factor = (static_cast<double>(1) + deltas[successor]) / sp_counters[successor];
+          vertex_delta += vertex_sigma * factor;
+        }
+      }
+      deltas[vertex] = vertex_delta;
+    }
+  }
+}
+}  // namespace detail
+}  // namespace cugraph
\ No newline at end of file
diff --git a/cpp/src/centrality/katz_centrality.cu b/cpp/src/centrality/katz_centrality.cu
index 2e24a3110c1..0119a388680 100644
--- a/cpp/src/centrality/katz_centrality.cu
+++ b/cpp/src/centrality/katz_centrality.cu
@@ -24,12 +24,12 @@
 #include <Hornet.hpp>
 #include <Static/KatzCentrality/Katz.cuh>
 #include <graph.hpp>
-#include "utilities/error_utils.h"
+#include "utilities/error.hpp"
 
 namespace cugraph {
 
 template <typename VT, typename ET, typename WT, typename result_t>
-void katz_centrality(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void katz_centrality(GraphCSRView<VT, ET, WT> const &graph,
                      result_t *result,
                      double alpha,
                      int max_iter,
@@ -52,6 +52,6 @@ void katz_centrality(experimental::GraphCSRView<VT, ET, WT> const &graph,
 }
 
 template void katz_centrality<int, int, float, double>(
-  experimental::GraphCSRView<int, int, float> const &, double *, double, int, double, bool, bool);
+  GraphCSRView<int, int, float> const &, double *, double, int, double, bool, bool);
 
 }  // namespace cugraph
diff --git a/cpp/src/comms/mpi/comms_mpi.cpp b/cpp/src/comms/mpi/comms_mpi.cpp
deleted file mode 100644
index f473c0a1939..00000000000
--- a/cpp/src/comms/mpi/comms_mpi.cpp
+++ /dev/null
@@ -1,279 +0,0 @@
-/*
- * Copyright (c) 2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <comms_mpi.hpp>
-#include <iostream>
-#include <vector>
-#include "utilities/error_utils.h"
-
-namespace cugraph {
-namespace experimental {
-#if ENABLE_OPG
-
-/**---------------------------------------------------------------------------*
- * @brief Exception thrown when a NCCL error is encountered.
- *
- *---------------------------------------------------------------------------**/
-struct nccl_error : public std::runtime_error {
-  nccl_error(std::string const &message) : std::runtime_error(message) {}
-};
-
-inline void throw_nccl_error(ncclResult_t error, const char *file, unsigned int line)
-{
-  throw nccl_error(std::string{"NCCL error encountered at: " + std::string{file} + ":" +
-                               std::to_string(line) + ": " + ncclGetErrorString(error)});
-}
-
-#define NCCL_TRY(call)                                                                     \
-  {                                                                                        \
-    ncclResult_t nccl_status = (call);                                                     \
-    if (nccl_status != ncclSuccess) { throw_nccl_error(nccl_status, __FILE__, __LINE__); } \
-  }
-// MPI errors are expected to be fatal before reaching this.
-// FIXME : improve when adding raft comms
-#define MPI_TRY(cmd)                                             \
-  {                                                              \
-    int e = cmd;                                                 \
-    if (e != MPI_SUCCESS) { CUGRAPH_FAIL("Failed: MPI error"); } \
-  }
-
-template <typename value_t>
-constexpr MPI_Datatype get_mpi_type()
-{
-  if (std::is_integral<value_t>::value) {
-    if (std::is_signed<value_t>::value) {
-      if (sizeof(value_t) == 1) {
-        return MPI_INT8_T;
-      } else if (sizeof(value_t) == 2) {
-        return MPI_INT16_T;
-      } else if (sizeof(value_t) == 4) {
-        return MPI_INT32_T;
-      } else if (sizeof(value_t) == 8) {
-        return MPI_INT64_T;
-      } else {
-        CUGRAPH_FAIL("unsupported type");
-      }
-    } else {
-      if (sizeof(value_t) == 1) {
-        return MPI_UINT8_T;
-      } else if (sizeof(value_t) == 2) {
-        return MPI_UINT16_T;
-      } else if (sizeof(value_t) == 4) {
-        return MPI_UINT32_T;
-      } else if (sizeof(value_t) == 8) {
-        return MPI_UINT64_T;
-      } else {
-        CUGRAPH_FAIL("unsupported type");
-      }
-    }
-  } else if (std::is_same<value_t, float>::value) {
-    return MPI_FLOAT;
-  } else if (std::is_same<value_t, double>::value) {
-    return MPI_DOUBLE;
-  } else {
-    CUGRAPH_FAIL("unsupported type");
-  }
-}
-
-template <typename value_t>
-constexpr ncclDataType_t get_nccl_type()
-{
-  if (std::is_integral<value_t>::value) {
-    if (std::is_signed<value_t>::value) {
-      if (sizeof(value_t) == 1) {
-        return ncclInt8;
-      } else if (sizeof(value_t) == 4) {
-        return ncclInt32;
-      } else if (sizeof(value_t) == 8) {
-        return ncclInt64;
-      } else {
-        CUGRAPH_FAIL("unsupported type");
-      }
-    } else {
-      if (sizeof(value_t) == 1) {
-        return ncclUint8;
-      } else if (sizeof(value_t) == 4) {
-        return ncclUint32;
-      } else if (sizeof(value_t) == 8) {
-        return ncclUint64;
-      } else {
-        CUGRAPH_FAIL("unsupported type");
-      }
-    }
-  } else if (std::is_same<value_t, float>::value) {
-    return ncclFloat32;
-  } else if (std::is_same<value_t, double>::value) {
-    return ncclFloat64;
-  } else {
-    CUGRAPH_FAIL("unsupported type");
-  }
-}
-
-constexpr MPI_Op get_mpi_reduce_op(ReduceOp reduce_op)
-{
-  if (reduce_op == ReduceOp::SUM) {
-    return MPI_SUM;
-  } else if (reduce_op == ReduceOp::MAX) {
-    return MPI_MAX;
-  } else if (reduce_op == ReduceOp::MIN) {
-    return MPI_MIN;
-  } else {
-    CUGRAPH_FAIL("unsupported type");
-  }
-}
-
-constexpr ncclRedOp_t get_nccl_reduce_op(ReduceOp reduce_op)
-{
-  if (reduce_op == ReduceOp::SUM) {
-    return ncclSum;
-  } else if (reduce_op == ReduceOp::MAX) {
-    return ncclMax;
-  } else if (reduce_op == ReduceOp::MIN) {
-    return ncclMin;
-  } else {
-    CUGRAPH_FAIL("unsupported type");
-  }
-}
-#endif
-
-Comm::Comm(int p) : _p{p}
-{
-#if ENABLE_OPG
-  // MPI
-  int flag{}, mpi_world_size;
-
-  MPI_TRY(MPI_Initialized(&flag));
-
-  if (flag == false) {
-    int provided{};
-    MPI_TRY(MPI_Init_thread(nullptr, nullptr, MPI_THREAD_MULTIPLE, &provided));
-    if (provided != MPI_THREAD_MULTIPLE) { MPI_TRY(MPI_ERR_OTHER); }
-    _finalize_mpi = true;
-  }
-
-  MPI_TRY(MPI_Comm_rank(MPI_COMM_WORLD, &_rank));
-  MPI_TRY(MPI_Comm_size(MPI_COMM_WORLD, &mpi_world_size));
-  CUGRAPH_EXPECTS((_p == mpi_world_size),
-                  "Invalid input arguments: p should match the number of MPI processes.");
-
-  _mpi_comm = MPI_COMM_WORLD;
-
-  // CUDA
-
-  CUDA_TRY(cudaGetDeviceCount(&_device_count));
-  _device_id = _rank % _device_count;  // FIXME : assumes each node has the same number of GPUs
-  CUDA_TRY(cudaSetDevice(_device_id));
-
-  CUDA_TRY(
-    cudaDeviceGetAttribute(&_sm_count_per_device, cudaDevAttrMultiProcessorCount, _device_id));
-  CUDA_TRY(cudaDeviceGetAttribute(&_max_grid_dim_1D, cudaDevAttrMaxGridDimX, _device_id));
-  CUDA_TRY(cudaDeviceGetAttribute(&_max_block_dim_1D, cudaDevAttrMaxBlockDimX, _device_id));
-  CUDA_TRY(cudaDeviceGetAttribute(&_l2_cache_size, cudaDevAttrL2CacheSize, _device_id));
-  CUDA_TRY(cudaDeviceGetAttribute(
-    &_shared_memory_size_per_sm, cudaDevAttrMaxSharedMemoryPerMultiprocessor, _device_id));
-
-  // NCCL
-
-  ncclUniqueId nccl_unique_id_p{};
-  if (get_rank() == 0) { NCCL_TRY(ncclGetUniqueId(&nccl_unique_id_p)); }
-  MPI_TRY(MPI_Bcast(&nccl_unique_id_p, sizeof(ncclUniqueId), MPI_BYTE, 0, _mpi_comm));
-  NCCL_TRY(ncclCommInitRank(&_nccl_comm, get_p(), nccl_unique_id_p, get_rank()));
-  _finalize_nccl = true;
-#endif
-}
-
-#if ENABLE_OPG
-Comm::Comm(ncclComm_t comm, int size, int rank) : _nccl_comm(comm), _p(size), _rank(rank)
-{
-  // CUDA
-  CUDA_TRY(cudaGetDeviceCount(&_device_count));
-  _device_id = _rank % _device_count;   // FIXME : assumes each node has the same number of GPUs
-  CUDA_TRY(cudaSetDevice(_device_id));  // FIXME : check if this is needed or if
-                                        // python takes care of this
-
-  CUDA_TRY(
-    cudaDeviceGetAttribute(&_sm_count_per_device, cudaDevAttrMultiProcessorCount, _device_id));
-  CUDA_TRY(cudaDeviceGetAttribute(&_max_grid_dim_1D, cudaDevAttrMaxGridDimX, _device_id));
-  CUDA_TRY(cudaDeviceGetAttribute(&_max_block_dim_1D, cudaDevAttrMaxBlockDimX, _device_id));
-  CUDA_TRY(cudaDeviceGetAttribute(&_l2_cache_size, cudaDevAttrL2CacheSize, _device_id));
-  CUDA_TRY(cudaDeviceGetAttribute(
-    &_shared_memory_size_per_sm, cudaDevAttrMaxSharedMemoryPerMultiprocessor, _device_id));
-}
-#endif
-
-Comm::~Comm()
-{
-#if ENABLE_OPG
-  // NCCL
-  if (_finalize_nccl) ncclCommDestroy(_nccl_comm);
-
-  if (_finalize_mpi) { MPI_Finalize(); }
-#endif
-}
-
-void Comm::barrier()
-{
-#if ENABLE_OPG
-  MPI_Barrier(MPI_COMM_WORLD);
-#endif
-}
-
-template <typename value_t>
-void Comm::allgather(size_t size, value_t *sendbuff, value_t *recvbuff) const
-{
-#if ENABLE_OPG
-  NCCL_TRY(ncclAllGather((const void *)sendbuff,
-                         (void *)recvbuff,
-                         size,
-                         get_nccl_type<value_t>(),
-                         _nccl_comm,
-                         cudaStreamDefault));
-#endif
-}
-
-template <typename value_t>
-void Comm::allreduce(size_t size, value_t *sendbuff, value_t *recvbuff, ReduceOp reduce_op) const
-{
-#if ENABLE_OPG
-  NCCL_TRY(ncclAllReduce((const void *)sendbuff,
-                         (void *)recvbuff,
-                         size,
-                         get_nccl_type<value_t>(),
-                         get_nccl_reduce_op(reduce_op),
-                         _nccl_comm,
-                         cudaStreamDefault));
-#endif
-}
-
-// explicit
-template void Comm::allgather<int>(size_t size, int *sendbuff, int *recvbuff) const;
-template void Comm::allgather<float>(size_t size, float *sendbuff, float *recvbuff) const;
-template void Comm::allgather<double>(size_t size, double *sendbuff, double *recvbuff) const;
-template void Comm::allreduce<int>(size_t size,
-                                   int *sendbuff,
-                                   int *recvbuff,
-                                   ReduceOp reduce_op) const;
-template void Comm::allreduce<float>(size_t size,
-                                     float *sendbuff,
-                                     float *recvbuff,
-                                     ReduceOp reduce_op) const;
-template void Comm::allreduce<double>(size_t size,
-                                      double *sendbuff,
-                                      double *recvbuff,
-                                      ReduceOp reduce_op) const;
-
-}  // namespace experimental
-}  // namespace cugraph
diff --git a/cpp/src/community/ECG.cu b/cpp/src/community/ECG.cu
index b746966627c..47a80fa48d6 100644
--- a/cpp/src/community/ECG.cu
+++ b/cpp/src/community/ECG.cu
@@ -16,12 +16,11 @@
 
 #include <algorithms.hpp>
 
-#include <rmm/rmm.h>
 #include <rmm/thrust_rmm_allocator.h>
 #include <thrust/random.h>
-#include <utilities/error_utils.h>
 #include <converters/permute_graph.cuh>
 #include <ctime>
+#include <utilities/error.hpp>
 #include "utilities/graph_utils.cuh"
 
 namespace {
@@ -108,43 +107,43 @@ void get_permutation_vector(T size, T seed, T *permutation, cudaStream_t stream)
 
 namespace cugraph {
 
-template <typename VT, typename ET, typename WT>
-void ecg(experimental::GraphCSRView<VT, ET, WT> const &graph,
-         WT min_weight,
-         VT ensemble_size,
-         VT *ecg_parts)
+template <typename vertex_t, typename edge_t, typename weight_t>
+void ecg(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+         weight_t min_weight,
+         vertex_t ensemble_size,
+         vertex_t *ecg_parts)
 {
   CUGRAPH_EXPECTS(graph.edge_data != nullptr, "API error, louvain expects a weighted graph");
   CUGRAPH_EXPECTS(ecg_parts != nullptr, "Invalid API parameter: ecg_parts is NULL");
 
   cudaStream_t stream{0};
 
-  rmm::device_vector<WT> ecg_weights_v(graph.edge_data, graph.edge_data + graph.number_of_edges);
+  rmm::device_vector<weight_t> ecg_weights_v(graph.edge_data,
+                                             graph.edge_data + graph.number_of_edges);
 
-  VT size{graph.number_of_vertices};
-  VT seed{0};
-  // VT seed{1};  // Note... this seed won't work for the unit tests... retest after fixing Louvain.
+  vertex_t size{graph.number_of_vertices};
+  vertex_t seed{1};
 
-  auto permuted_graph = std::make_unique<experimental::GraphCSR<VT, ET, WT>>(
+  auto permuted_graph = std::make_unique<GraphCSR<vertex_t, edge_t, weight_t>>(
     size, graph.number_of_edges, graph.has_data());
 
   // Iterate over each member of the ensemble
-  for (VT i = 0; i < ensemble_size; i++) {
+  for (vertex_t i = 0; i < ensemble_size; i++) {
     // Take random permutation of the graph
-    rmm::device_vector<VT> permutation_v(size);
-    VT *d_permutation = permutation_v.data().get();
+    rmm::device_vector<vertex_t> permutation_v(size);
+    vertex_t *d_permutation = permutation_v.data().get();
 
     get_permutation_vector(size, seed, d_permutation, stream);
     seed += size;
 
-    detail::permute_graph<VT, ET, WT>(graph, d_permutation, permuted_graph->view());
+    detail::permute_graph<vertex_t, edge_t, weight_t>(graph, d_permutation, permuted_graph->view());
 
-    // Run Louvain clustering on the random permutation
-    rmm::device_vector<VT> parts_v(size);
-    VT *d_parts = parts_v.data().get();
+    // Run one level of Louvain clustering on the random permutation
+    rmm::device_vector<vertex_t> parts_v(size);
+    vertex_t *d_parts = parts_v.data().get();
 
-    WT final_modularity;
-    VT num_level;
+    weight_t final_modularity;
+    vertex_t num_level;
 
     cugraph::louvain(permuted_graph->view(), &final_modularity, &num_level, d_parts, 1);
 
@@ -152,7 +151,7 @@ void ecg(experimental::GraphCSRView<VT, ET, WT> const &graph,
     // Keep a sum for each edge of the total number of times its endpoints are in the same partition
     dim3 grid, block;
     block.x = 512;
-    grid.x  = min(VT{CUDA_MAX_BLOCKS}, (graph.number_of_edges / 512 + 1));
+    grid.x  = min(vertex_t{CUDA_MAX_BLOCKS}, (graph.number_of_edges / 512 + 1));
     match_check_kernel<<<grid, block, 0, stream>>>(graph.number_of_edges,
                                                    graph.number_of_vertices,
                                                    graph.offsets,
@@ -163,7 +162,7 @@ void ecg(experimental::GraphCSRView<VT, ET, WT> const &graph,
   }
 
   // Set weights = min_weight + (1 - min-weight)*sum/ensemble_size
-  update_functor<WT> uf(min_weight, ensemble_size);
+  update_functor<weight_t> uf(min_weight, ensemble_size);
   thrust::transform(rmm::exec_policy(stream)->on(stream),
                     ecg_weights_v.data().get(),
                     ecg_weights_v.data().get() + graph.number_of_edges,
@@ -171,27 +170,26 @@ void ecg(experimental::GraphCSRView<VT, ET, WT> const &graph,
                     uf);
 
   // Run Louvain on the original graph using the computed weights
-  experimental::GraphCSRView<VT, ET, WT> louvain_graph;
+  // (pass max_level = 100 for a "full run")
+  GraphCSRView<vertex_t, edge_t, weight_t> louvain_graph;
   louvain_graph.indices            = graph.indices;
   louvain_graph.offsets            = graph.offsets;
   louvain_graph.edge_data          = ecg_weights_v.data().get();
   louvain_graph.number_of_vertices = graph.number_of_vertices;
   louvain_graph.number_of_edges    = graph.number_of_edges;
 
-  WT final_modularity;
-  VT num_level;
+  weight_t final_modularity;
+  vertex_t num_level;
   cugraph::louvain(louvain_graph, &final_modularity, &num_level, ecg_parts, 100);
 }
 
 // Explicit template instantiations.
-template void ecg<int32_t, int32_t, float>(
-  experimental::GraphCSRView<int32_t, int32_t, float> const &graph,
-  float min_weight,
-  int32_t ensemble_size,
-  int32_t *ecg_parts);
-template void ecg<int32_t, int32_t, double>(
-  experimental::GraphCSRView<int32_t, int32_t, double> const &graph,
-  double min_weight,
-  int32_t ensemble_size,
-  int32_t *ecg_parts);
+template void ecg<int32_t, int32_t, float>(GraphCSRView<int32_t, int32_t, float> const &graph,
+                                           float min_weight,
+                                           int32_t ensemble_size,
+                                           int32_t *ecg_parts);
+template void ecg<int32_t, int32_t, double>(GraphCSRView<int32_t, int32_t, double> const &graph,
+                                            double min_weight,
+                                            int32_t ensemble_size,
+                                            int32_t *ecg_parts);
 }  // namespace cugraph
diff --git a/cpp/src/community/extract_subgraph_by_vertex.cu b/cpp/src/community/extract_subgraph_by_vertex.cu
index 919f89545a0..c39b7f8ad0a 100644
--- a/cpp/src/community/extract_subgraph_by_vertex.cu
+++ b/cpp/src/community/extract_subgraph_by_vertex.cu
@@ -16,18 +16,16 @@
 
 #include <algorithms.hpp>
 #include <graph.hpp>
-
-#include <utilities/cuda_utils.cuh>
+#include <utilities/error.hpp>
 
 #include <rmm/thrust_rmm_allocator.h>
-#include <utilities/error_utils.h>
+#include <raft/device_atomics.cuh>
 
 namespace {
 
 template <typename vertex_t, typename edge_t, typename weight_t, bool has_weight>
-std::unique_ptr<cugraph::experimental::GraphCOO<vertex_t, edge_t, weight_t>>
-extract_subgraph_by_vertices(
-  cugraph::experimental::GraphCOOView<vertex_t, edge_t, weight_t> const &graph,
+std::unique_ptr<cugraph::GraphCOO<vertex_t, edge_t, weight_t>> extract_subgraph_by_vertices(
+  cugraph::GraphCOOView<vertex_t, edge_t, weight_t> const &graph,
   vertex_t const *vertices,
   vertex_t num_vertices,
   cudaStream_t stream)
@@ -49,7 +47,7 @@ extract_subgraph_by_vertices(
       if ((v >= 0) && (v < graph_num_verts)) {
         d_vertex_used[v] = idx;
       } else {
-        cugraph::atomicAdd(d_error_count, int64_t{1});
+        atomicAdd(d_error_count, int64_t{1});
       }
     });
 
@@ -72,7 +70,7 @@ extract_subgraph_by_vertices(
     });
 
   if (count > 0) {
-    auto result = std::make_unique<cugraph::experimental::GraphCOO<vertex_t, edge_t, weight_t>>(
+    auto result = std::make_unique<cugraph::GraphCOO<vertex_t, edge_t, weight_t>>(
       num_vertices, count, has_weight);
 
     vertex_t *d_new_src    = result->src_indices();
@@ -99,7 +97,7 @@ extract_subgraph_by_vertices(
                          //     require 2*|E| temporary memory.  If this becomes important perhaps
                          //     we make 2 implementations and pick one based on the number of
                          //     vertices in the subgraph set.
-                         auto pos       = cugraph::atomicAdd(d_error_count, 1);
+                         auto pos       = atomicAdd(d_error_count, int64_t{1});
                          d_new_src[pos] = d_vertex_used[s];
                          d_new_dst[pos] = d_vertex_used[d];
                          if (has_weight) d_new_weight[pos] = graph_weight[e];
@@ -108,18 +106,18 @@ extract_subgraph_by_vertices(
 
     return result;
   } else {
-    return std::make_unique<cugraph::experimental::GraphCOO<vertex_t, edge_t, weight_t>>(
-      0, 0, has_weight);
+    return std::make_unique<cugraph::GraphCOO<vertex_t, edge_t, weight_t>>(0, 0, has_weight);
   }
 }
 }  // namespace
 
 namespace cugraph {
-namespace nvgraph {
+namespace subgraph {
 
 template <typename VT, typename ET, typename WT>
-std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> extract_subgraph_vertex(
-  experimental::GraphCOOView<VT, ET, WT> const &graph, VT const *vertices, VT num_vertices)
+std::unique_ptr<GraphCOO<VT, ET, WT>> extract_subgraph_vertex(GraphCOOView<VT, ET, WT> const &graph,
+                                                              VT const *vertices,
+                                                              VT num_vertices)
 {
   CUGRAPH_EXPECTS(vertices != nullptr, "API error, vertices must be non null");
 
@@ -132,12 +130,14 @@ std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> extract_subgraph_vertex(
   }
 }
 
-template std::unique_ptr<experimental::GraphCOO<int32_t, int32_t, float>>
-extract_subgraph_vertex<int32_t, int32_t, float>(
-  experimental::GraphCOOView<int32_t, int32_t, float> const &, int32_t const *, int32_t);
-template std::unique_ptr<experimental::GraphCOO<int32_t, int32_t, double>>
-extract_subgraph_vertex<int32_t, int32_t, double>(
-  experimental::GraphCOOView<int32_t, int32_t, double> const &, int32_t const *, int32_t);
+template std::unique_ptr<GraphCOO<int32_t, int32_t, float>>
+extract_subgraph_vertex<int32_t, int32_t, float>(GraphCOOView<int32_t, int32_t, float> const &,
+                                                 int32_t const *,
+                                                 int32_t);
+template std::unique_ptr<GraphCOO<int32_t, int32_t, double>>
+extract_subgraph_vertex<int32_t, int32_t, double>(GraphCOOView<int32_t, int32_t, double> const &,
+                                                  int32_t const *,
+                                                  int32_t);
 
-}  // namespace nvgraph
+}  // namespace subgraph
 }  // namespace cugraph
diff --git a/cpp/src/community/ktruss.cu b/cpp/src/community/ktruss.cu
index ea6d1091fab..11a8ed6fbae 100644
--- a/cpp/src/community/ktruss.cu
+++ b/cpp/src/community/ktruss.cu
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -21,8 +21,7 @@
  * @file ktruss.cu
  * --------------------------------------------------------------------------*/
 
-#include <rmm/rmm.h>
-#include <utilities/error_utils.h>
+#include <utilities/error.hpp>
 
 #include <Hornet.hpp>
 #include <StandardAPI.hpp>
@@ -36,8 +35,9 @@ namespace cugraph {
 namespace detail {
 
 template <typename VT, typename ET, typename WT>
-std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> ktruss_subgraph_impl(
-  experimental::GraphCOOView<VT, ET, WT> const &graph, int k, rmm::mr::device_memory_resource *mr)
+std::unique_ptr<GraphCOO<VT, ET, WT>> ktruss_subgraph_impl(GraphCOOView<VT, ET, WT> const &graph,
+                                                           int k,
+                                                           rmm::mr::device_memory_resource *mr)
 {
   using HornetGraph = hornet::gpu::Hornet<VT>;
   using UpdatePtr   = hornet::BatchUpdatePtr<VT, hornet::EMPTY, hornet::DeviceType::DEVICE>;
@@ -68,7 +68,7 @@ std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> ktruss_subgraph_impl(
   kt.runForK(k);
   CUGRAPH_EXPECTS(cudaPeekAtLastError() == cudaSuccess, "KTruss : Failed to run");
 
-  auto out_graph = std::make_unique<experimental::GraphCOO<VT, ET, WT>>(
+  auto out_graph = std::make_unique<GraphCOO<VT, ET, WT>>(
     graph.number_of_vertices, kt.getGraphEdgeCount(), graph.has_data(), stream, mr);
 
   kt.copyGraph(out_graph->src_indices(), out_graph->dst_indices());
@@ -79,8 +79,8 @@ std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> ktruss_subgraph_impl(
   return out_graph;
 }
 template <typename VT, typename ET, typename WT>
-std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> weighted_ktruss_subgraph_impl(
-  experimental::GraphCOOView<VT, ET, WT> const &graph, int k, rmm::mr::device_memory_resource *mr)
+std::unique_ptr<GraphCOO<VT, ET, WT>> weighted_ktruss_subgraph_impl(
+  GraphCOOView<VT, ET, WT> const &graph, int k, rmm::mr::device_memory_resource *mr)
 {
   using HornetGraph = hornet::gpu::Hornet<VT, hornet::EMPTY, hornet::TypeList<WT>>;
   using UpdatePtr   = hornet::BatchUpdatePtr<VT, hornet::TypeList<WT>, hornet::DeviceType::DEVICE>;
@@ -111,7 +111,7 @@ std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> weighted_ktruss_subgraph_imp
   kt.runForK(k);
   CUGRAPH_EXPECTS(cudaPeekAtLastError() == cudaSuccess, "KTruss : Failed to run");
 
-  auto out_graph = std::make_unique<experimental::GraphCOO<VT, ET, WT>>(
+  auto out_graph = std::make_unique<GraphCOO<VT, ET, WT>>(
     graph.number_of_vertices, kt.getGraphEdgeCount(), graph.has_data(), stream, mr);
 
   kt.copyGraph(out_graph->src_indices(), out_graph->dst_indices(), out_graph->edge_data());
@@ -125,8 +125,9 @@ std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> weighted_ktruss_subgraph_imp
 }  // namespace detail
 
 template <typename VT, typename ET, typename WT>
-std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> k_truss_subgraph(
-  experimental::GraphCOOView<VT, ET, WT> const &graph, int k, rmm::mr::device_memory_resource *mr)
+std::unique_ptr<GraphCOO<VT, ET, WT>> k_truss_subgraph(GraphCOOView<VT, ET, WT> const &graph,
+                                                       int k,
+                                                       rmm::mr::device_memory_resource *mr)
 {
   CUGRAPH_EXPECTS(graph.src_indices != nullptr, "Graph source indices cannot be a nullptr");
   CUGRAPH_EXPECTS(graph.dst_indices != nullptr, "Graph destination indices cannot be a nullptr");
@@ -138,14 +139,10 @@ std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> k_truss_subgraph(
   }
 }
 
-template std::unique_ptr<experimental::GraphCOO<int32_t, int32_t, float>>
-k_truss_subgraph<int, int, float>(experimental::GraphCOOView<int, int, float> const &,
-                                  int,
-                                  rmm::mr::device_memory_resource *);
+template std::unique_ptr<GraphCOO<int32_t, int32_t, float>> k_truss_subgraph<int, int, float>(
+  GraphCOOView<int, int, float> const &, int, rmm::mr::device_memory_resource *);
 
-template std::unique_ptr<experimental::GraphCOO<int32_t, int32_t, double>>
-k_truss_subgraph<int, int, double>(experimental::GraphCOOView<int, int, double> const &,
-                                   int,
-                                   rmm::mr::device_memory_resource *);
+template std::unique_ptr<GraphCOO<int32_t, int32_t, double>> k_truss_subgraph<int, int, double>(
+  GraphCOOView<int, int, double> const &, int, rmm::mr::device_memory_resource *);
 
 }  // namespace cugraph
diff --git a/cpp/src/community/leiden.cpp b/cpp/src/community/leiden.cpp
new file mode 100644
index 00000000000..9e7a49db1f1
--- /dev/null
+++ b/cpp/src/community/leiden.cpp
@@ -0,0 +1,50 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <algorithms.hpp>
+#include <graph.hpp>
+
+#include <rmm/thrust_rmm_allocator.h>
+
+#include <thrust/sequence.h>
+
+#include <community/leiden_kernels.hpp>
+
+#include "utilities/error.hpp"
+
+namespace cugraph {
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+void leiden(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+            weight_t &final_modularity,
+            int &num_level,
+            vertex_t *leiden_parts,
+            int max_level,
+            weight_t resolution)
+{
+  CUGRAPH_EXPECTS(graph.edge_data != nullptr, "API error, leiden expects a weighted graph");
+  CUGRAPH_EXPECTS(leiden_parts != nullptr, "API error, leiden_parts is null");
+
+  detail::leiden<vertex_t, edge_t, weight_t>(
+    graph, final_modularity, num_level, leiden_parts, max_level, resolution);
+}
+
+template void leiden(
+  GraphCSRView<int32_t, int32_t, float> const &, float &, int &, int32_t *, int, float);
+template void leiden(
+  GraphCSRView<int32_t, int32_t, double> const &, double &, int &, int32_t *, int, double);
+
+}  // namespace cugraph
diff --git a/cpp/src/community/leiden_kernels.cu b/cpp/src/community/leiden_kernels.cu
new file mode 100644
index 00000000000..5eb4219d1ac
--- /dev/null
+++ b/cpp/src/community/leiden_kernels.cu
@@ -0,0 +1,299 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include <graph.hpp>
+
+#include <rmm/thrust_rmm_allocator.h>
+
+#include <community/louvain_kernels.hpp>
+#include <utilities/graph_utils.cuh>
+
+//#define TIMING
+
+#ifdef TIMING
+#include <utilities/high_res_timer.hpp>
+#endif
+
+#include <converters/COOtoCSR.cuh>
+
+namespace cugraph {
+namespace detail {
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+weight_t update_clustering_by_delta_modularity_constrained(
+  weight_t total_edge_weight,
+  weight_t resolution,
+  GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+  rmm::device_vector<vertex_t> const &src_indices,
+  rmm::device_vector<weight_t> const &vertex_weights,
+  rmm::device_vector<weight_t> &cluster_weights,
+  rmm::device_vector<vertex_t> &cluster,
+  rmm::device_vector<vertex_t> &constraint,
+  cudaStream_t stream)
+{
+  rmm::device_vector<vertex_t> next_cluster(cluster);
+  rmm::device_vector<weight_t> delta_Q(graph.number_of_edges);
+  rmm::device_vector<vertex_t> cluster_hash(graph.number_of_edges);
+  rmm::device_vector<weight_t> old_cluster_sum(graph.number_of_vertices);
+
+  weight_t *d_delta_Q           = delta_Q.data().get();
+  vertex_t *d_constraint        = constraint.data().get();
+  vertex_t const *d_src_indices = src_indices.data().get();
+  vertex_t const *d_dst_indices = graph.indices;
+
+  weight_t new_Q = modularity(total_edge_weight, resolution, graph, cluster.data().get(), stream);
+
+  weight_t cur_Q = new_Q - 1;
+
+  // To avoid the potential of having two vertices swap clusters
+  // we will only allow vertices to move up (true) or down (false)
+  // during each iteration of the loop
+  bool up_down = true;
+
+  while (new_Q > (cur_Q + 0.0001)) {
+    cur_Q = new_Q;
+
+    compute_delta_modularity(total_edge_weight,
+                             resolution,
+                             graph,
+                             src_indices,
+                             vertex_weights,
+                             cluster_weights,
+                             cluster,
+                             cluster_hash,
+                             delta_Q,
+                             old_cluster_sum,
+                             stream);
+
+    // Filter out positive delta_Q values for nodes not in the same constraint group
+    thrust::for_each(
+      rmm::exec_policy(stream)->on(stream),
+      thrust::make_counting_iterator(0),
+      thrust::make_counting_iterator(graph.number_of_edges),
+      [d_src_indices, d_dst_indices, d_constraint, d_delta_Q] __device__(vertex_t i) {
+        vertex_t start_cluster = d_constraint[d_src_indices[i]];
+        vertex_t end_cluster   = d_constraint[d_dst_indices[i]];
+        if (start_cluster != end_cluster) d_delta_Q[i] = weight_t{0.0};
+      });
+
+    assign_nodes(graph,
+                 delta_Q,
+                 cluster_hash,
+                 src_indices,
+                 next_cluster,
+                 vertex_weights,
+                 cluster_weights,
+                 up_down,
+                 stream);
+
+    up_down = !up_down;
+
+    new_Q = modularity(total_edge_weight, resolution, graph, next_cluster.data().get(), stream);
+
+    if (new_Q > cur_Q) {
+      thrust::copy(rmm::exec_policy(stream)->on(stream),
+                   next_cluster.begin(),
+                   next_cluster.end(),
+                   cluster.begin());
+    }
+  }
+
+  return cur_Q;
+}
+
+template float update_clustering_by_delta_modularity_constrained(
+  float,
+  float,
+  GraphCSRView<int32_t, int32_t, float> const &,
+  rmm::device_vector<int32_t> const &,
+  rmm::device_vector<float> const &,
+  rmm::device_vector<float> &,
+  rmm::device_vector<int32_t> &,
+  rmm::device_vector<int32_t> &,
+  cudaStream_t);
+
+template double update_clustering_by_delta_modularity_constrained(
+  double,
+  double,
+  GraphCSRView<int32_t, int32_t, double> const &,
+  rmm::device_vector<int32_t> const &,
+  rmm::device_vector<double> const &,
+  rmm::device_vector<double> &,
+  rmm::device_vector<int32_t> &,
+  rmm::device_vector<int32_t> &,
+  cudaStream_t);
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+void leiden(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+            weight_t &final_modularity,
+            int &num_level,
+            vertex_t *cluster_vec,
+            int max_level,
+            weight_t resolution,
+            cudaStream_t stream)
+{
+#ifdef TIMING
+  HighResTimer hr_timer;
+#endif
+
+  num_level = 0;
+
+  //
+  //  Vectors to create a copy of the graph
+  //
+  rmm::device_vector<edge_t> offsets_v(graph.offsets, graph.offsets + graph.number_of_vertices + 1);
+  rmm::device_vector<vertex_t> indices_v(graph.indices, graph.indices + graph.number_of_edges);
+  rmm::device_vector<weight_t> weights_v(graph.edge_data, graph.edge_data + graph.number_of_edges);
+  rmm::device_vector<vertex_t> src_indices_v(graph.number_of_edges);
+
+  //
+  //  Weights and clustering across iterations of algorithm
+  //
+  rmm::device_vector<weight_t> vertex_weights_v(graph.number_of_vertices);
+  rmm::device_vector<weight_t> cluster_weights_v(graph.number_of_vertices);
+  rmm::device_vector<vertex_t> cluster_v(graph.number_of_vertices);
+
+  //
+  //  Temporaries used within kernels.  Each iteration uses less
+  //  of this memory
+  //
+  rmm::device_vector<vertex_t> tmp_arr_v(graph.number_of_vertices);
+  rmm::device_vector<vertex_t> cluster_inverse_v(graph.number_of_vertices);
+
+  weight_t total_edge_weight =
+    thrust::reduce(rmm::exec_policy(stream)->on(stream), weights_v.begin(), weights_v.end());
+  weight_t best_modularity = -1;
+
+  //
+  //  Initialize every cluster to reference each vertex to itself
+  //
+  thrust::sequence(rmm::exec_policy(stream)->on(stream), cluster_v.begin(), cluster_v.end());
+  thrust::copy(
+    rmm::exec_policy(stream)->on(stream), cluster_v.begin(), cluster_v.end(), cluster_vec);
+
+  //
+  //  Our copy of the graph.  Each iteration of the outer loop will
+  //  shrink this copy of the graph.
+  //
+  GraphCSRView<vertex_t, edge_t, weight_t> current_graph(offsets_v.data().get(),
+                                                         indices_v.data().get(),
+                                                         weights_v.data().get(),
+                                                         graph.number_of_vertices,
+                                                         graph.number_of_edges);
+
+  current_graph.get_source_indices(src_indices_v.data().get());
+
+  while (num_level < max_level) {
+    //
+    //  Sum the weights of all edges departing a vertex.  This is
+    //  loop invariant, so we'll compute it here.
+    //
+    //  Cluster weights are equivalent to vertex weights with this initial
+    //  graph
+    //
+#ifdef TIMING
+    hr_timer.start("init");
+#endif
+
+    cugraph::detail::compute_vertex_sums(current_graph, vertex_weights_v, stream);
+    thrust::copy(rmm::exec_policy(stream)->on(stream),
+                 vertex_weights_v.begin(),
+                 vertex_weights_v.end(),
+                 cluster_weights_v.begin());
+
+#ifdef TIMING
+    hr_timer.stop();
+
+    hr_timer.start("update_clustering");
+#endif
+
+    weight_t new_Q = update_clustering_by_delta_modularity(total_edge_weight,
+                                                           resolution,
+                                                           current_graph,
+                                                           src_indices_v,
+                                                           vertex_weights_v,
+                                                           cluster_weights_v,
+                                                           cluster_v,
+                                                           stream);
+
+    // After finding the initial unconstrained partition we use that partitioning as the constraint
+    // for the second round.
+    rmm::device_vector<vertex_t> constraint(graph.number_of_vertices);
+    thrust::copy(
+      rmm::exec_policy(stream)->on(stream), cluster_v.begin(), cluster_v.end(), constraint.begin());
+    new_Q = update_clustering_by_delta_modularity_constrained(total_edge_weight,
+                                                              resolution,
+                                                              current_graph,
+                                                              src_indices_v,
+                                                              vertex_weights_v,
+                                                              cluster_weights_v,
+                                                              cluster_v,
+                                                              constraint,
+                                                              stream);
+
+#ifdef TIMING
+    hr_timer.stop();
+#endif
+
+    if (new_Q <= best_modularity) { break; }
+
+    best_modularity = new_Q;
+
+#ifdef TIMING
+    hr_timer.start("shrinking graph");
+#endif
+
+    // renumber the clusters to the range 0..(num_clusters-1)
+    vertex_t num_clusters = renumber_clusters(
+      graph.number_of_vertices, cluster_v, tmp_arr_v, cluster_inverse_v, cluster_vec, stream);
+    cluster_weights_v.resize(num_clusters);
+
+    // shrink our graph to represent the graph of supervertices
+    generate_superverticies_graph(current_graph, src_indices_v, num_clusters, cluster_v, stream);
+
+    // assign each new vertex to its own cluster
+    thrust::sequence(rmm::exec_policy(stream)->on(stream), cluster_v.begin(), cluster_v.end());
+
+#ifdef TIMING
+    hr_timer.stop();
+#endif
+
+    num_level++;
+  }
+
+#ifdef TIMING
+  hr_timer.display(std::cout);
+#endif
+
+  final_modularity = best_modularity;
+}
+
+template void leiden(GraphCSRView<int32_t, int32_t, float> const &,
+                     float &,
+                     int &,
+                     int32_t *,
+                     int,
+                     float,
+                     cudaStream_t);
+template void leiden(GraphCSRView<int32_t, int32_t, double> const &,
+                     double &,
+                     int &,
+                     int32_t *,
+                     int,
+                     double,
+                     cudaStream_t);
+
+}  // namespace detail
+}  // namespace cugraph
diff --git a/cpp/src/community/leiden_kernels.hpp b/cpp/src/community/leiden_kernels.hpp
new file mode 100644
index 00000000000..cbe93c04f52
--- /dev/null
+++ b/cpp/src/community/leiden_kernels.hpp
@@ -0,0 +1,35 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#pragma once
+
+#include <rmm/thrust_rmm_allocator.h>
+
+#include <graph.hpp>
+
+namespace cugraph {
+namespace detail {
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+void leiden(GraphCSRView<vertex_t, edge_t, weight_t> const& graph,
+            weight_t& final_modularity,
+            int& num_level,
+            vertex_t* cluster_vec,
+            int max_level,
+            weight_t resolution,
+            cudaStream_t stream = 0);
+
+}  // namespace detail
+}  // namespace cugraph
diff --git a/cpp/src/community/louvain.cpp b/cpp/src/community/louvain.cpp
index 94ed67a0fcc..0e3f6ac51fd 100644
--- a/cpp/src/community/louvain.cpp
+++ b/cpp/src/community/louvain.cpp
@@ -23,28 +23,30 @@
 
 #include <community/louvain_kernels.hpp>
 
-#include "utilities/error_utils.h"
+#include "utilities/error.hpp"
 
 namespace cugraph {
 
-template <typename VT, typename ET, typename WT>
-void louvain(experimental::GraphCSRView<VT, ET, WT> const &graph,
-             WT *final_modularity,
+template <typename vertex_t, typename edge_t, typename weight_t>
+void louvain(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+             weight_t *final_modularity,
              int *num_level,
-             VT *louvain_parts,
-             int max_iter)
+             vertex_t *louvain_parts,
+             int max_level,
+             weight_t resolution)
 {
   CUGRAPH_EXPECTS(graph.edge_data != nullptr, "API error, louvain expects a weighted graph");
   CUGRAPH_EXPECTS(final_modularity != nullptr, "API error, final_modularity is null");
   CUGRAPH_EXPECTS(num_level != nullptr, "API error, num_level is null");
   CUGRAPH_EXPECTS(louvain_parts != nullptr, "API error, louvain_parts is null");
 
-  detail::louvain<VT, ET, WT>(graph, final_modularity, num_level, louvain_parts, max_iter);
+  detail::louvain<vertex_t, edge_t, weight_t>(
+    graph, final_modularity, num_level, louvain_parts, max_level, resolution);
 }
 
 template void louvain(
-  experimental::GraphCSRView<int32_t, int32_t, float> const &, float *, int *, int32_t *, int);
+  GraphCSRView<int32_t, int32_t, float> const &, float *, int *, int32_t *, int, float);
 template void louvain(
-  experimental::GraphCSRView<int32_t, int32_t, double> const &, double *, int *, int32_t *, int);
+  GraphCSRView<int32_t, int32_t, double> const &, double *, int *, int32_t *, int, double);
 
 }  // namespace cugraph
diff --git a/cpp/src/community/louvain_kernels.cu b/cpp/src/community/louvain_kernels.cu
index 757cf2fcde2..c93e2d82fdf 100644
--- a/cpp/src/community/louvain_kernels.cu
+++ b/cpp/src/community/louvain_kernels.cu
@@ -17,10 +17,10 @@
 
 #include <rmm/thrust_rmm_allocator.h>
 
-#include <nvgraph/include/util.cuh>
-#include <utilities/cuda_utils.cuh>
 #include <utilities/graph_utils.cuh>
 
+//#define TIMING
+
 #ifdef TIMING
 #include <utilities/high_res_timer.hpp>
 #endif
@@ -30,8 +30,12 @@
 namespace cugraph {
 namespace detail {
 
+namespace {  // anonym.
+constexpr int BLOCK_SIZE_1D = 64;
+}
+
 template <typename vertex_t, typename edge_t, typename weight_t>
-__global__  // __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
+__global__  //
   void
   compute_vertex_sums(vertex_t n_vertex,
                       edge_t const *offsets,
@@ -50,8 +54,9 @@ __global__  // __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
 }
 
 template <typename vertex_t, typename edge_t, typename weight_t>
-weight_t modularity(weight_t m2,
-                    experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+weight_t modularity(weight_t total_edge_weight,
+                    weight_t resolution,
+                    GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
                     vertex_t const *d_cluster,
                     cudaStream_t stream)
 {
@@ -66,6 +71,10 @@ weight_t modularity(weight_t m2,
   weight_t *d_inc           = inc.data().get();
   weight_t *d_deg           = deg.data().get();
 
+  // FIXME:  Already have weighted degree computed in main loop,
+  //         could pass that in rather than computing d_deg... which
+  //         would save an atomicAdd (synchronization)
+  //
   thrust::for_each(
     rmm::exec_policy(stream)->on(stream),
     thrust::make_counting_iterator(0),
@@ -78,11 +87,10 @@ weight_t modularity(weight_t m2,
       for (edge_t loc = d_offsets[v]; loc < d_offsets[v + 1]; ++loc) {
         vertex_t neighbor = d_indices[loc];
         degree += d_weights[loc];
-        if (d_cluster[neighbor] == community) { increase += d_weights[loc] / 2; }
+        if (d_cluster[neighbor] == community) { increase += d_weights[loc]; }
       }
 
       if (degree > weight_t{0.0}) atomicAdd(d_deg + community, degree);
-
       if (increase > weight_t{0.0}) atomicAdd(d_inc + community, increase);
     });
 
@@ -90,29 +98,28 @@ weight_t modularity(weight_t m2,
     rmm::exec_policy(stream)->on(stream),
     thrust::make_counting_iterator(0),
     thrust::make_counting_iterator(graph.number_of_vertices),
-    [d_deg, d_inc, m2] __device__(vertex_t community) {
-#ifdef DEBUG
-      printf("  d_inc[%d] = %g, d_deg = %g, return = %g\n",
-             community,
-             d_inc[community],
-             d_deg[community],
-             ((2 * d_inc[community] / m2) - pow(d_deg[community] / m2, 2)));
-#endif
-
-      return (2 * d_inc[community] / m2) - pow(d_deg[community] / m2, 2);
+    [d_deg, d_inc, total_edge_weight, resolution] __device__(vertex_t community) {
+      return ((d_inc[community] / total_edge_weight) - resolution *
+                                                         (d_deg[community] * d_deg[community]) /
+                                                         (total_edge_weight * total_edge_weight));
     },
     weight_t{0.0},
     thrust::plus<weight_t>());
   return Q;
 }
 
+template float modularity(
+  float, float, GraphCSRView<int32_t, int32_t, float> const &, int32_t const *, cudaStream_t);
+
+template double modularity(
+  double, double, GraphCSRView<int32_t, int32_t, double> const &, int32_t const *, cudaStream_t);
+
 template <typename vertex_t, typename edge_t, typename weight_t>
-void generate_superverticies_graph(
-  cugraph::experimental::GraphCSRView<vertex_t, edge_t, weight_t> &current_graph,
-  rmm::device_vector<vertex_t> &src_indices_v,
-  vertex_t new_number_of_vertices,
-  rmm::device_vector<vertex_t> &cluster_v,
-  cudaStream_t stream)
+void generate_superverticies_graph(cugraph::GraphCSRView<vertex_t, edge_t, weight_t> &current_graph,
+                                   rmm::device_vector<vertex_t> &src_indices_v,
+                                   vertex_t new_number_of_vertices,
+                                   rmm::device_vector<vertex_t> &cluster_v,
+                                   cudaStream_t stream)
 {
   rmm::device_vector<vertex_t> new_src_v(current_graph.number_of_edges);
   rmm::device_vector<vertex_t> new_dst_v(current_graph.number_of_edges);
@@ -174,13 +181,25 @@ void generate_superverticies_graph(
                       new_number_of_vertices,
                       current_graph.number_of_edges,
                       stream);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
 
   src_indices_v.resize(current_graph.number_of_edges);
 }
 
+template void generate_superverticies_graph(GraphCSRView<int32_t, int32_t, float> &,
+                                            rmm::device_vector<int32_t> &,
+                                            int32_t,
+                                            rmm::device_vector<int32_t> &,
+                                            cudaStream_t);
+
+template void generate_superverticies_graph(GraphCSRView<int32_t, int32_t, double> &,
+                                            rmm::device_vector<int32_t> &,
+                                            int32_t,
+                                            rmm::device_vector<int32_t> &,
+                                            cudaStream_t);
+
 template <typename vertex_t, typename edge_t, typename weight_t>
-void compute_vertex_sums(experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+void compute_vertex_sums(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
                          rmm::device_vector<weight_t> &sums,
                          cudaStream_t stream)
 {
@@ -192,6 +211,14 @@ void compute_vertex_sums(experimental::GraphCSRView<vertex_t, edge_t, weight_t>
     graph.number_of_vertices, graph.offsets, graph.edge_data, sums.data().get());
 }
 
+template void compute_vertex_sums(GraphCSRView<int32_t, int32_t, float> const &,
+                                  rmm::device_vector<float> &,
+                                  cudaStream_t);
+
+template void compute_vertex_sums(GraphCSRView<int32_t, int32_t, double> const &,
+                                  rmm::device_vector<double> &,
+                                  cudaStream_t);
+
 template <typename vertex_t>
 vertex_t renumber_clusters(vertex_t graph_num_vertices,
                            rmm::device_vector<vertex_t> &cluster,
@@ -204,9 +231,11 @@ vertex_t renumber_clusters(vertex_t graph_num_vertices,
   //  Now we're going to renumber the clusters from 0 to (k-1), where k is the number of
   //  clusters in this level of the dendogram.
   //
-  thrust::copy(cluster.begin(), cluster.end(), temp_array.begin());
-  thrust::sort(temp_array.begin(), temp_array.end());
-  auto tmp_end = thrust::unique(temp_array.begin(), temp_array.end());
+  thrust::copy(
+    rmm::exec_policy(stream)->on(stream), cluster.begin(), cluster.end(), temp_array.begin());
+  thrust::sort(rmm::exec_policy(stream)->on(stream), temp_array.begin(), temp_array.end());
+  auto tmp_end =
+    thrust::unique(rmm::exec_policy(stream)->on(stream), temp_array.begin(), temp_array.end());
 
   vertex_t old_num_clusters = cluster.size();
   vertex_t new_num_clusters = thrust::distance(temp_array.begin(), tmp_end);
@@ -244,10 +273,243 @@ vertex_t renumber_clusters(vertex_t graph_num_vertices,
   return new_num_clusters;
 }
 
+template int32_t renumber_clusters(int32_t,
+                                   rmm::device_vector<int32_t> &,
+                                   rmm::device_vector<int32_t> &,
+                                   rmm::device_vector<int32_t> &,
+                                   int32_t *,
+                                   cudaStream_t);
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+void compute_delta_modularity(weight_t total_edge_weight,
+                              weight_t resolution,
+                              GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                              rmm::device_vector<vertex_t> const &src_indices_v,
+                              rmm::device_vector<weight_t> const &vertex_weights_v,
+                              rmm::device_vector<weight_t> const &cluster_weights_v,
+                              rmm::device_vector<vertex_t> const &cluster_v,
+                              rmm::device_vector<vertex_t> &cluster_hash_v,
+                              rmm::device_vector<weight_t> &delta_Q_v,
+                              rmm::device_vector<weight_t> &tmp_size_V_v,
+                              cudaStream_t stream)
+{
+  vertex_t const *d_src_indices     = src_indices_v.data().get();
+  vertex_t const *d_dst_indices     = graph.indices;
+  edge_t const *d_offsets           = graph.offsets;
+  weight_t const *d_weights         = graph.edge_data;
+  vertex_t const *d_cluster         = cluster_v.data().get();
+  weight_t const *d_vertex_weights  = vertex_weights_v.data().get();
+  weight_t const *d_cluster_weights = cluster_weights_v.data().get();
+
+  vertex_t *d_cluster_hash    = cluster_hash_v.data().get();
+  weight_t *d_delta_Q         = delta_Q_v.data().get();
+  weight_t *d_old_cluster_sum = tmp_size_V_v.data().get();
+  weight_t *d_new_cluster_sum = d_delta_Q;
+
+  thrust::fill(cluster_hash_v.begin(), cluster_hash_v.end(), vertex_t{-1});
+  thrust::fill(delta_Q_v.begin(), delta_Q_v.end(), weight_t{0.0});
+  thrust::fill(tmp_size_V_v.begin(), tmp_size_V_v.end(), weight_t{0.0});
+
+  //
+  // For each source vertex, we're going to build a hash
+  // table to the destination cluster ids.  We can use
+  // the offsets ranges to define the bounds of the hash
+  // table.
+  //
+  thrust::for_each(rmm::exec_policy(stream)->on(stream),
+                   thrust::make_counting_iterator<edge_t>(0),
+                   thrust::make_counting_iterator<edge_t>(graph.number_of_edges),
+                   [d_src_indices,
+                    d_dst_indices,
+                    d_cluster,
+                    d_offsets,
+                    d_cluster_hash,
+                    d_new_cluster_sum,
+                    d_weights,
+                    d_old_cluster_sum] __device__(edge_t loc) {
+                     vertex_t src = d_src_indices[loc];
+                     vertex_t dst = d_dst_indices[loc];
+
+                     if (src != dst) {
+                       vertex_t old_cluster = d_cluster[src];
+                       vertex_t new_cluster = d_cluster[dst];
+                       edge_t hash_base     = d_offsets[src];
+                       edge_t n_edges       = d_offsets[src + 1] - hash_base;
+
+                       int h         = (new_cluster % n_edges);
+                       edge_t offset = hash_base + h;
+                       while (d_cluster_hash[offset] != new_cluster) {
+                         if (d_cluster_hash[offset] == -1) {
+                           atomicCAS(d_cluster_hash + offset, -1, new_cluster);
+                         } else {
+                           h      = (h + 1) % n_edges;
+                           offset = hash_base + h;
+                         }
+                       }
+
+                       atomicAdd(d_new_cluster_sum + offset, d_weights[loc]);
+
+                       if (old_cluster == new_cluster)
+                         atomicAdd(d_old_cluster_sum + src, d_weights[loc]);
+                     }
+                   });
+
+  thrust::for_each(rmm::exec_policy(stream)->on(stream),
+                   thrust::make_counting_iterator<edge_t>(0),
+                   thrust::make_counting_iterator<edge_t>(graph.number_of_edges),
+                   [total_edge_weight,
+                    resolution,
+                    d_cluster_hash,
+                    d_src_indices,
+                    d_cluster,
+                    d_vertex_weights,
+                    d_delta_Q,
+                    d_new_cluster_sum,
+                    d_old_cluster_sum,
+                    d_cluster_weights] __device__(edge_t loc) {
+                     vertex_t new_cluster = d_cluster_hash[loc];
+                     if (new_cluster >= 0) {
+                       vertex_t src         = d_src_indices[loc];
+                       vertex_t old_cluster = d_cluster[src];
+                       weight_t k_k         = d_vertex_weights[src];
+                       weight_t a_old       = d_cluster_weights[old_cluster];
+                       weight_t a_new       = d_cluster_weights[new_cluster];
+
+                       // NOTE: d_delta_Q and d_new_cluster_sum are aliases
+                       //       for same device array to save memory
+                       d_delta_Q[loc] =
+                         2 *
+                         (((d_new_cluster_sum[loc] - d_old_cluster_sum[src]) / total_edge_weight) -
+                          resolution * (a_new * k_k - a_old * k_k + k_k * k_k) /
+                            (total_edge_weight * total_edge_weight));
+                     } else {
+                       d_delta_Q[loc] = weight_t{0.0};
+                     }
+                   });
+}
+
+template void compute_delta_modularity(float,
+                                       float,
+                                       GraphCSRView<int32_t, int32_t, float> const &,
+                                       rmm::device_vector<int32_t> const &,
+                                       rmm::device_vector<float> const &,
+                                       rmm::device_vector<float> const &,
+                                       rmm::device_vector<int32_t> const &,
+                                       rmm::device_vector<int32_t> &,
+                                       rmm::device_vector<float> &,
+                                       rmm::device_vector<float> &,
+                                       cudaStream_t);
+
+template void compute_delta_modularity(double,
+                                       double,
+                                       GraphCSRView<int32_t, int32_t, double> const &,
+                                       rmm::device_vector<int32_t> const &,
+                                       rmm::device_vector<double> const &,
+                                       rmm::device_vector<double> const &,
+                                       rmm::device_vector<int32_t> const &,
+                                       rmm::device_vector<int32_t> &,
+                                       rmm::device_vector<double> &,
+                                       rmm::device_vector<double> &,
+                                       cudaStream_t);
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+void assign_nodes(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                  rmm::device_vector<weight_t> &delta_Q,
+                  rmm::device_vector<vertex_t> &cluster_hash,
+                  rmm::device_vector<vertex_t> const &src_indices,
+                  rmm::device_vector<vertex_t> &next_cluster,
+                  rmm::device_vector<weight_t> const &vertex_weights,
+                  rmm::device_vector<weight_t> &cluster_weights,
+                  bool up_down,
+                  cudaStream_t stream)
+{
+  rmm::device_vector<vertex_t> temp_vertices(graph.number_of_vertices);
+  rmm::device_vector<vertex_t> temp_cluster(graph.number_of_vertices, vertex_t{-1});
+  rmm::device_vector<weight_t> temp_delta_Q(graph.number_of_vertices, weight_t{0.0});
+
+  weight_t *d_delta_Q              = delta_Q.data().get();
+  vertex_t *d_next_cluster         = next_cluster.data().get();
+  vertex_t *d_cluster_hash         = cluster_hash.data().get();
+  weight_t const *d_vertex_weights = vertex_weights.data().get();
+  weight_t *d_cluster_weights      = cluster_weights.data().get();
+
+  auto cluster_reduce_iterator =
+    thrust::make_zip_iterator(thrust::make_tuple(d_cluster_hash, d_delta_Q));
+
+  auto output_edge_iterator2 = thrust::make_zip_iterator(
+    thrust::make_tuple(temp_cluster.data().get(), temp_delta_Q.data().get()));
+
+  auto cluster_reduce_end =
+    thrust::reduce_by_key(rmm::exec_policy(stream)->on(stream),
+                          src_indices.begin(),
+                          src_indices.end(),
+                          cluster_reduce_iterator,
+                          temp_vertices.data().get(),
+                          output_edge_iterator2,
+                          thrust::equal_to<vertex_t>(),
+                          [] __device__(auto pair1, auto pair2) {
+                            if (thrust::get<1>(pair1) > thrust::get<1>(pair2))
+                              return pair1;
+                            else
+                              return pair2;
+                          });
+
+  vertex_t final_size = thrust::distance(temp_vertices.data().get(), cluster_reduce_end.first);
+
+  vertex_t *d_temp_vertices = temp_vertices.data().get();
+  vertex_t *d_temp_clusters = temp_cluster.data().get();
+  weight_t *d_temp_delta_Q  = temp_delta_Q.data().get();
+
+  thrust::for_each(rmm::exec_policy(stream)->on(stream),
+                   thrust::make_counting_iterator<vertex_t>(0),
+                   thrust::make_counting_iterator<vertex_t>(final_size),
+                   [d_temp_delta_Q,
+                    up_down,
+                    d_next_cluster,
+                    d_temp_vertices,
+                    d_vertex_weights,
+                    d_temp_clusters,
+                    d_cluster_weights] __device__(vertex_t id) {
+                     if ((d_temp_clusters[id] >= 0) && (d_temp_delta_Q[id] > weight_t{0.0})) {
+                       vertex_t new_cluster = d_temp_clusters[id];
+                       vertex_t old_cluster = d_next_cluster[d_temp_vertices[id]];
+
+                       if ((new_cluster > old_cluster) == up_down) {
+                         weight_t src_weight = d_vertex_weights[d_temp_vertices[id]];
+                         d_next_cluster[d_temp_vertices[id]] = d_temp_clusters[id];
+
+                         atomicAdd(d_cluster_weights + new_cluster, src_weight);
+                         atomicAdd(d_cluster_weights + old_cluster, -src_weight);
+                       }
+                     }
+                   });
+}
+
+template void assign_nodes(GraphCSRView<int32_t, int32_t, float> const &,
+                           rmm::device_vector<float> &,
+                           rmm::device_vector<int32_t> &,
+                           rmm::device_vector<int32_t> const &,
+                           rmm::device_vector<int32_t> &,
+                           rmm::device_vector<float> const &,
+                           rmm::device_vector<float> &,
+                           bool,
+                           cudaStream_t);
+
+template void assign_nodes(GraphCSRView<int32_t, int32_t, double> const &,
+                           rmm::device_vector<double> &,
+                           rmm::device_vector<int32_t> &,
+                           rmm::device_vector<int32_t> const &,
+                           rmm::device_vector<int32_t> &,
+                           rmm::device_vector<double> const &,
+                           rmm::device_vector<double> &,
+                           bool,
+                           cudaStream_t);
+
 template <typename vertex_t, typename edge_t, typename weight_t>
 weight_t update_clustering_by_delta_modularity(
-  weight_t m2,
-  experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+  weight_t total_edge_weight,
+  weight_t resolution,
+  GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
   rmm::device_vector<vertex_t> const &src_indices,
   rmm::device_vector<weight_t> const &vertex_weights,
   rmm::device_vector<weight_t> &cluster_weights,
@@ -255,24 +517,18 @@ weight_t update_clustering_by_delta_modularity(
   cudaStream_t stream)
 {
   rmm::device_vector<vertex_t> next_cluster(cluster);
-  rmm::device_vector<weight_t> old_cluster_sum(graph.number_of_vertices);
   rmm::device_vector<weight_t> delta_Q(graph.number_of_edges);
   rmm::device_vector<vertex_t> cluster_hash(graph.number_of_edges);
-  rmm::device_vector<weight_t> cluster_hash_sum(graph.number_of_edges, weight_t{0.0});
+  rmm::device_vector<weight_t> old_cluster_sum(graph.number_of_vertices);
 
   vertex_t *d_cluster_hash         = cluster_hash.data().get();
-  weight_t *d_cluster_hash_sum     = cluster_hash_sum.data().get();
   vertex_t *d_cluster              = cluster.data().get();
-  vertex_t const *d_src_indices    = src_indices.data().get();
-  vertex_t *d_dst_indices          = graph.indices;
-  edge_t *d_offsets                = graph.offsets;
-  weight_t *d_weights              = graph.edge_data;
   weight_t const *d_vertex_weights = vertex_weights.data().get();
   weight_t *d_cluster_weights      = cluster_weights.data().get();
   weight_t *d_delta_Q              = delta_Q.data().get();
-  weight_t *d_old_cluster_sum      = old_cluster_sum.data().get();
 
-  weight_t new_Q = modularity<vertex_t, edge_t, weight_t>(m2, graph, cluster.data().get(), stream);
+  weight_t new_Q = modularity<vertex_t, edge_t, weight_t>(
+    total_edge_weight, resolution, graph, cluster.data().get(), stream);
 
   weight_t cur_Q = new_Q - 1;
 
@@ -284,171 +540,70 @@ weight_t update_clustering_by_delta_modularity(
   while (new_Q > (cur_Q + 0.0001)) {
     cur_Q = new_Q;
 
-    thrust::fill(cluster_hash.begin(), cluster_hash.end(), vertex_t{-1});
-    thrust::fill(cluster_hash_sum.begin(), cluster_hash_sum.end(), weight_t{0.0});
-    thrust::fill(old_cluster_sum.begin(), old_cluster_sum.end(), weight_t{0.0});
-
-    //
-    // For each source vertex, we're going to build a hash
-    // table to the destination cluster ids.  We can use
-    // the offsets ranges to define the bounds of the hash
-    // table.
-    //
-    thrust::for_each(rmm::exec_policy(stream)->on(stream),
-                     thrust::make_counting_iterator<edge_t>(0),
-                     thrust::make_counting_iterator<edge_t>(graph.number_of_edges),
-                     [d_src_indices,
-                      d_dst_indices,
-                      d_cluster,
-                      d_offsets,
-                      d_cluster_hash,
-                      d_cluster_hash_sum,
-                      d_weights,
-                      d_old_cluster_sum] __device__(edge_t loc) {
-                       vertex_t src = d_src_indices[loc];
-                       vertex_t dst = d_dst_indices[loc];
-
-                       if (src != dst) {
-                         vertex_t old_cluster = d_cluster[src];
-                         vertex_t new_cluster = d_cluster[dst];
-                         edge_t hash_base     = d_offsets[src];
-                         edge_t n_edges       = d_offsets[src + 1] - hash_base;
-
-                         int h         = (new_cluster % n_edges);
-                         edge_t offset = hash_base + h;
-                         while (d_cluster_hash[offset] != new_cluster) {
-                           if (d_cluster_hash[offset] == -1) {
-                             atomicCAS(d_cluster_hash + offset, -1, new_cluster);
-                           } else {
-                             h      = (h + 1) % n_edges;
-                             offset = hash_base + h;
-                           }
-                         }
-
-                         atomicAdd(d_cluster_hash_sum + offset, d_weights[loc]);
-
-                         if (old_cluster == new_cluster)
-                           atomicAdd(d_old_cluster_sum + src, d_weights[loc]);
-                       }
-                     });
-
-    thrust::for_each(rmm::exec_policy(stream)->on(stream),
-                     thrust::make_counting_iterator<edge_t>(0),
-                     thrust::make_counting_iterator<edge_t>(graph.number_of_edges),
-                     [m2,
-                      d_cluster_hash,
-                      d_src_indices,
-                      d_cluster,
-                      d_vertex_weights,
-                      d_delta_Q,
-                      d_cluster_hash_sum,
-                      d_old_cluster_sum,
-                      d_cluster_weights] __device__(edge_t loc) {
-                       vertex_t new_cluster = d_cluster_hash[loc];
-                       if (new_cluster >= 0) {
-                         vertex_t src         = d_src_indices[loc];
-                         vertex_t old_cluster = d_cluster[src];
-                         weight_t degc_totw   = d_vertex_weights[src] / m2;
-
-                         d_delta_Q[loc] =
-                           d_cluster_hash_sum[loc] - degc_totw * d_cluster_weights[new_cluster] -
-                           (d_old_cluster_sum[src] -
-                            (degc_totw * (d_cluster_weights[old_cluster] - d_vertex_weights[src])));
-
-#ifdef DEBUG
-                         printf("src = %d, new cluster = %d, d_delta_Q[%d] = %g\n",
-                                src,
-                                new_cluster,
-                                loc,
-                                d_delta_Q[loc]);
-#endif
-                       } else {
-                         d_delta_Q[loc] = weight_t{0.0};
-                       }
-                     });
-
-    auto cluster_reduce_iterator =
-      thrust::make_zip_iterator(thrust::make_tuple(d_cluster_hash, d_delta_Q));
-
-    rmm::device_vector<vertex_t> temp_vertices(graph.number_of_vertices);
-    rmm::device_vector<vertex_t> temp_cluster(graph.number_of_vertices, vertex_t{-1});
-    rmm::device_vector<weight_t> temp_delta_Q(graph.number_of_vertices, weight_t{0.0});
-
-    auto output_edge_iterator2 = thrust::make_zip_iterator(
-      thrust::make_tuple(temp_cluster.data().get(), temp_delta_Q.data().get()));
-
-    auto cluster_reduce_end =
-      thrust::reduce_by_key(rmm::exec_policy(stream)->on(stream),
-                            d_src_indices,
-                            d_src_indices + graph.number_of_edges,
-                            cluster_reduce_iterator,
-                            temp_vertices.data().get(),
-                            output_edge_iterator2,
-                            thrust::equal_to<vertex_t>(),
-                            [] __device__(auto pair1, auto pair2) {
-                              if (thrust::get<1>(pair1) > thrust::get<1>(pair2))
-                                return pair1;
-                              else
-                                return pair2;
-                            });
-
-    vertex_t final_size = thrust::distance(temp_vertices.data().get(), cluster_reduce_end.first);
-
-    vertex_t *d_temp_vertices = temp_vertices.data().get();
-    vertex_t *d_temp_clusters = temp_cluster.data().get();
-    vertex_t *d_next_cluster  = next_cluster.data().get();
-    weight_t *d_temp_delta_Q  = temp_delta_Q.data().get();
-
-    thrust::for_each(rmm::exec_policy(stream)->on(stream),
-                     thrust::make_counting_iterator<vertex_t>(0),
-                     thrust::make_counting_iterator<vertex_t>(final_size),
-                     [d_temp_delta_Q,
-                      up_down,
-                      d_next_cluster,
-                      d_temp_vertices,
-                      d_vertex_weights,
-                      d_temp_clusters,
-                      d_cluster_weights] __device__(vertex_t id) {
-                       if ((d_temp_clusters[id] >= 0) && (d_temp_delta_Q[id] > weight_t{0.0})) {
-                         vertex_t new_cluster = d_temp_clusters[id];
-                         vertex_t old_cluster = d_next_cluster[d_temp_vertices[id]];
-
-                         if ((new_cluster > old_cluster) == up_down) {
-#ifdef DEBUG
-                           printf(
-                             "%s moving vertex %d from cluster %d to cluster %d - deltaQ = %g\n",
-                             (up_down ? "up" : "down"),
-                             d_temp_vertices[id],
-                             d_next_cluster[d_temp_vertices[id]],
-                             d_temp_clusters[id],
-                             d_temp_delta_Q[id]);
-#endif
-
-                           weight_t src_weight = d_vertex_weights[d_temp_vertices[id]];
-                           d_next_cluster[d_temp_vertices[id]] = d_temp_clusters[id];
-
-                           atomicAdd(d_cluster_weights + new_cluster, src_weight);
-                           atomicAdd(d_cluster_weights + old_cluster, -src_weight);
-                         }
-                       }
-                     });
+    compute_delta_modularity(total_edge_weight,
+                             resolution,
+                             graph,
+                             src_indices,
+                             vertex_weights,
+                             cluster_weights,
+                             cluster,
+                             cluster_hash,
+                             delta_Q,
+                             old_cluster_sum,
+                             stream);
+
+    assign_nodes(graph,
+                 delta_Q,
+                 cluster_hash,
+                 src_indices,
+                 next_cluster,
+                 vertex_weights,
+                 cluster_weights,
+                 up_down,
+                 stream);
 
     up_down = !up_down;
 
-    new_Q = modularity<vertex_t, edge_t, weight_t>(m2, graph, next_cluster.data().get(), stream);
+    new_Q = modularity<vertex_t, edge_t, weight_t>(
+      total_edge_weight, resolution, graph, next_cluster.data().get(), stream);
 
-    if (new_Q > cur_Q) { thrust::copy(next_cluster.begin(), next_cluster.end(), cluster.begin()); }
+    if (new_Q > cur_Q) {
+      thrust::copy(rmm::exec_policy(stream)->on(stream),
+                   next_cluster.begin(),
+                   next_cluster.end(),
+                   cluster.begin());
+    }
   }
 
   return cur_Q;
 }
 
+template float update_clustering_by_delta_modularity(float,
+                                                     float,
+                                                     GraphCSRView<int32_t, int32_t, float> const &,
+                                                     rmm::device_vector<int32_t> const &,
+                                                     rmm::device_vector<float> const &,
+                                                     rmm::device_vector<float> &,
+                                                     rmm::device_vector<int32_t> &,
+                                                     cudaStream_t);
+
+template double update_clustering_by_delta_modularity(
+  double,
+  double,
+  GraphCSRView<int32_t, int32_t, double> const &,
+  rmm::device_vector<int32_t> const &,
+  rmm::device_vector<double> const &,
+  rmm::device_vector<double> &,
+  rmm::device_vector<int32_t> &,
+  cudaStream_t);
+
 template <typename vertex_t, typename edge_t, typename weight_t>
-void louvain(experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+void louvain(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
              weight_t *final_modularity,
              int *num_level,
              vertex_t *cluster_vec,
-             int max_iter,
+             int max_level,
+             weight_t resolution,
              cudaStream_t stream)
 {
 #ifdef TIMING
@@ -479,7 +634,7 @@ void louvain(experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph
   rmm::device_vector<vertex_t> tmp_arr_v(graph.number_of_vertices);
   rmm::device_vector<vertex_t> cluster_inverse_v(graph.number_of_vertices);
 
-  weight_t m2 =
+  weight_t total_edge_weight =
     thrust::reduce(rmm::exec_policy(stream)->on(stream), weights_v.begin(), weights_v.end());
   weight_t best_modularity = -1;
 
@@ -487,22 +642,22 @@ void louvain(experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph
   //  Initialize every cluster to reference each vertex to itself
   //
   thrust::sequence(rmm::exec_policy(stream)->on(stream), cluster_v.begin(), cluster_v.end());
-  thrust::copy(cluster_v.begin(), cluster_v.end(), cluster_vec);
+  thrust::copy(
+    rmm::exec_policy(stream)->on(stream), cluster_v.begin(), cluster_v.end(), cluster_vec);
 
   //
   //  Our copy of the graph.  Each iteration of the outer loop will
   //  shrink this copy of the graph.
   //
-  cugraph::experimental::GraphCSRView<vertex_t, edge_t, weight_t> current_graph(
-    offsets_v.data().get(),
-    indices_v.data().get(),
-    weights_v.data().get(),
-    graph.number_of_vertices,
-    graph.number_of_edges);
+  GraphCSRView<vertex_t, edge_t, weight_t> current_graph(offsets_v.data().get(),
+                                                         indices_v.data().get(),
+                                                         weights_v.data().get(),
+                                                         graph.number_of_vertices,
+                                                         graph.number_of_edges);
 
   current_graph.get_source_indices(src_indices_v.data().get());
 
-  while (true) {
+  while (*num_level < max_level) {
     //
     //  Sum the weights of all edges departing a vertex.  This is
     //  loop invariant, so we'll compute it here.
@@ -515,7 +670,10 @@ void louvain(experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph
 #endif
 
     cugraph::detail::compute_vertex_sums(current_graph, vertex_weights_v, stream);
-    thrust::copy(vertex_weights_v.begin(), vertex_weights_v.end(), cluster_weights_v.begin());
+    thrust::copy(rmm::exec_policy(stream)->on(stream),
+                 vertex_weights_v.begin(),
+                 vertex_weights_v.end(),
+                 cluster_weights_v.begin());
 
 #ifdef TIMING
     hr_timer.stop();
@@ -523,8 +681,14 @@ void louvain(experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph
     hr_timer.start("update_clustering");
 #endif
 
-    weight_t new_Q = update_clustering_by_delta_modularity(
-      m2, current_graph, src_indices_v, vertex_weights_v, cluster_weights_v, cluster_v, stream);
+    weight_t new_Q = update_clustering_by_delta_modularity(total_edge_weight,
+                                                           resolution,
+                                                           current_graph,
+                                                           src_indices_v,
+                                                           vertex_weights_v,
+                                                           cluster_weights_v,
+                                                           cluster_v,
+                                                           stream);
 
 #ifdef TIMING
     hr_timer.stop();
@@ -552,6 +716,8 @@ void louvain(experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph
 #ifdef TIMING
     hr_timer.stop();
 #endif
+
+    (*num_level)++;
   }
 
 #ifdef TIMING
@@ -561,17 +727,19 @@ void louvain(experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph
   *final_modularity = best_modularity;
 }
 
-template void louvain(experimental::GraphCSRView<int32_t, int32_t, float> const &,
+template void louvain(GraphCSRView<int32_t, int32_t, float> const &,
                       float *,
                       int *,
                       int32_t *,
                       int,
+                      float,
                       cudaStream_t);
-template void louvain(experimental::GraphCSRView<int32_t, int32_t, double> const &,
+template void louvain(GraphCSRView<int32_t, int32_t, double> const &,
                       double *,
                       int *,
                       int32_t *,
                       int,
+                      double,
                       cudaStream_t);
 
 }  // namespace detail
diff --git a/cpp/src/community/louvain_kernels.hpp b/cpp/src/community/louvain_kernels.hpp
index dd400f97f9e..eabd562315a 100644
--- a/cpp/src/community/louvain_kernels.hpp
+++ b/cpp/src/community/louvain_kernels.hpp
@@ -15,17 +15,82 @@
  */
 #pragma once
 
+#include <rmm/thrust_rmm_allocator.h>
+
 #include <graph.hpp>
 
 namespace cugraph {
 namespace detail {
 
 template <typename vertex_t, typename edge_t, typename weight_t>
-void louvain(experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+weight_t modularity(weight_t total_edge_weight,
+                    weight_t resolution,
+                    GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                    vertex_t const *d_cluster,
+                    cudaStream_t stream = 0);
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+void generate_superverticies_graph(cugraph::GraphCSRView<vertex_t, edge_t, weight_t> &current_graph,
+                                   rmm::device_vector<vertex_t> &src_indices_v,
+                                   vertex_t new_number_of_vertices,
+                                   rmm::device_vector<vertex_t> &cluster_v,
+                                   cudaStream_t stream);
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+void compute_vertex_sums(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                         rmm::device_vector<weight_t> &sums,
+                         cudaStream_t stream);
+
+template <typename vertex_t>
+vertex_t renumber_clusters(vertex_t graph_num_vertices,
+                           rmm::device_vector<vertex_t> &cluster,
+                           rmm::device_vector<vertex_t> &temp_array,
+                           rmm::device_vector<vertex_t> &cluster_inverse,
+                           vertex_t *cluster_vec,
+                           cudaStream_t stream);
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+void compute_delta_modularity(weight_t total_edge_weight,
+                              weight_t resolution,
+                              GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                              rmm::device_vector<vertex_t> const &src_indices_v,
+                              rmm::device_vector<weight_t> const &vertex_weights_v,
+                              rmm::device_vector<weight_t> const &cluster_weights_v,
+                              rmm::device_vector<vertex_t> const &cluster_v,
+                              rmm::device_vector<vertex_t> &cluster_hash_v,
+                              rmm::device_vector<weight_t> &delta_Q_v,
+                              rmm::device_vector<weight_t> &tmp_size_V_v,
+                              cudaStream_t stream = 0);
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+void assign_nodes(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                  rmm::device_vector<weight_t> &delta_Q,
+                  rmm::device_vector<vertex_t> &cluster_hash,
+                  rmm::device_vector<vertex_t> const &src_indices,
+                  rmm::device_vector<vertex_t> &next_cluster,
+                  rmm::device_vector<weight_t> const &vertex_weights,
+                  rmm::device_vector<weight_t> &cluster_weights,
+                  bool up_down,
+                  cudaStream_t stream);
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+weight_t update_clustering_by_delta_modularity(
+  weight_t total_edge_weight,
+  weight_t resolution,
+  GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+  rmm::device_vector<vertex_t> const &src_indices,
+  rmm::device_vector<weight_t> const &vertex_weights,
+  rmm::device_vector<weight_t> &cluster_weights,
+  rmm::device_vector<vertex_t> &cluster,
+  cudaStream_t stream);
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+void louvain(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
              weight_t *final_modularity,
              int *num_level,
              vertex_t *cluster_vec,
-             int max_iter,
+             int max_level,
+             weight_t resolution,
              cudaStream_t stream = 0);
 
 }  // namespace detail
diff --git a/cpp/src/community/spectral_clustering.cu b/cpp/src/community/spectral_clustering.cu
index 908ef61a7a4..f32739ddf29 100644
--- a/cpp/src/community/spectral_clustering.cu
+++ b/cpp/src/community/spectral_clustering.cu
@@ -15,35 +15,31 @@
  */
 
 /** ---------------------------------------------------------------------------*
- * @brief Wrapper functions for Nvgraph
+ * @brief Wrapper functions for Spectral Clustering
  *
- * @file nvgraph_wrapper.cpp
+ * @file spectral_clustering.cu
  * ---------------------------------------------------------------------------**/
 
 #include <algorithms.hpp>
-#include <graph.hpp>
 
-#include <nvgraph/include/sm_utils.h>
 #include <rmm/thrust_rmm_allocator.h>
 #include <thrust/transform.h>
-#include <utilities/error_utils.h>
 #include <ctime>
-#include <nvgraph/include/nvgraph_error.hxx>
 
-#include <nvgraph/include/modularity_maximization.hxx>
-#include <nvgraph/include/nvgraph_cublas.hxx>
-#include <nvgraph/include/nvgraph_cusparse.hxx>
-#include <nvgraph/include/partition.hxx>
+#include <graph.hpp>
+#include <utilities/error.hpp>
 
-#include <nvgraph/include/spectral_matrix.hxx>
+#include <raft/spectral/modularity_maximization.hpp>
+#include <raft/spectral/partition.hpp>
 
 namespace cugraph {
-namespace nvgraph {
+
+namespace ext_raft {
 
 namespace detail {
 
 template <typename vertex_t, typename edge_t, typename weight_t>
-void balancedCutClustering_impl(experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+void balancedCutClustering_impl(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
                                 vertex_t n_clusters,
                                 vertex_t n_eig_vects,
                                 weight_t evs_tolerance,
@@ -54,23 +50,28 @@ void balancedCutClustering_impl(experimental::GraphCSRView<vertex_t, edge_t, wei
                                 weight_t *eig_vals,
                                 weight_t *eig_vects)
 {
-  CUGRAPH_EXPECTS(graph.edge_data != nullptr, "API error, graph must have weights");
-  CUGRAPH_EXPECTS(evs_tolerance >= weight_t{0.0},
-                  "API error, evs_tolerance must be between 0.0 and 1.0");
-  CUGRAPH_EXPECTS(evs_tolerance < weight_t{1.0},
-                  "API error, evs_tolerance must be between 0.0 and 1.0");
-  CUGRAPH_EXPECTS(kmean_tolerance >= weight_t{0.0},
-                  "API error, kmean_tolerance must be between 0.0 and 1.0");
-  CUGRAPH_EXPECTS(kmean_tolerance < weight_t{1.0},
-                  "API error, kmean_tolerance must be between 0.0 and 1.0");
-  CUGRAPH_EXPECTS(n_clusters > 1, "API error, must specify more than 1 cluster");
-  CUGRAPH_EXPECTS(n_clusters < graph.number_of_vertices,
-                  "API error, number of clusters must be smaller than number of vertices");
-  CUGRAPH_EXPECTS(n_eig_vects <= n_clusters,
-                  "API error, cannot specify more eigenvectors than clusters");
-  CUGRAPH_EXPECTS(clustering != nullptr, "API error, must specify valid clustering");
-  CUGRAPH_EXPECTS(eig_vals != nullptr, "API error, must specify valid eigenvalues");
-  CUGRAPH_EXPECTS(eig_vects != nullptr, "API error, must specify valid eigenvectors");
+  RAFT_EXPECTS(graph.edge_data != nullptr, "API error, graph must have weights");
+  RAFT_EXPECTS(evs_tolerance >= weight_t{0.0},
+               "API error, evs_tolerance must be between 0.0 and 1.0");
+  RAFT_EXPECTS(evs_tolerance < weight_t{1.0},
+               "API error, evs_tolerance must be between 0.0 and 1.0");
+  RAFT_EXPECTS(kmean_tolerance >= weight_t{0.0},
+               "API error, kmean_tolerance must be between 0.0 and 1.0");
+  RAFT_EXPECTS(kmean_tolerance < weight_t{1.0},
+               "API error, kmean_tolerance must be between 0.0 and 1.0");
+  RAFT_EXPECTS(n_clusters > 1, "API error, must specify more than 1 cluster");
+  RAFT_EXPECTS(n_clusters < graph.number_of_vertices,
+               "API error, number of clusters must be smaller than number of vertices");
+  RAFT_EXPECTS(n_eig_vects <= n_clusters,
+               "API error, cannot specify more eigenvectors than clusters");
+  RAFT_EXPECTS(clustering != nullptr, "API error, must specify valid clustering");
+  RAFT_EXPECTS(eig_vals != nullptr, "API error, must specify valid eigenvalues");
+  RAFT_EXPECTS(eig_vects != nullptr, "API error, must specify valid eigenvectors");
+
+  raft::handle_t handle;
+  auto stream  = handle.get_stream();
+  auto exec    = rmm::exec_policy(stream);
+  auto t_exe_p = exec->on(stream);
 
   int evs_max_it{4000};
   int kmean_max_it{200};
@@ -87,57 +88,66 @@ void balancedCutClustering_impl(experimental::GraphCSRView<vertex_t, edge_t, wei
 
   int restartIter_lanczos = 15 + n_eig_vects;
 
-  ::nvgraph::partition<vertex_t, edge_t, weight_t>(graph,
-                                                   n_clusters,
-                                                   n_eig_vects,
-                                                   evs_max_it,
-                                                   restartIter_lanczos,
-                                                   evs_tol,
-                                                   kmean_max_it,
-                                                   kmean_tol,
-                                                   clustering,
-                                                   eig_vals,
-                                                   eig_vects);
+  unsigned long long seed{1234567};
+  bool reorthog{false};
+
+  using index_type = vertex_t;
+  using value_type = weight_t;
+
+  raft::matrix::sparse_matrix_t<index_type, value_type> const r_csr_m{handle, graph};
+
+  raft::eigen_solver_config_t<index_type, value_type> eig_cfg{
+    n_eig_vects, evs_max_it, restartIter_lanczos, evs_tol, reorthog, seed};
+  raft::lanczos_solver_t<index_type, value_type> eig_solver{eig_cfg};
+
+  raft::cluster_solver_config_t<index_type, value_type> clust_cfg{
+    n_clusters, kmean_max_it, kmean_tol, seed};
+  raft::kmeans_solver_t<index_type, value_type> cluster_solver{clust_cfg};
+
+  raft::spectral::partition(
+    handle, t_exe_p, r_csr_m, eig_solver, cluster_solver, clustering, eig_vals, eig_vects);
 }
 
 template <typename vertex_t, typename edge_t, typename weight_t>
-void spectralModularityMaximization_impl(
-  experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
-  vertex_t n_clusters,
-  vertex_t n_eig_vects,
-  weight_t evs_tolerance,
-  int evs_max_iter,
-  weight_t kmean_tolerance,
-  int kmean_max_iter,
-  vertex_t *clustering,
-  weight_t *eig_vals,
-  weight_t *eig_vects)
+void spectralModularityMaximization_impl(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                                         vertex_t n_clusters,
+                                         vertex_t n_eig_vects,
+                                         weight_t evs_tolerance,
+                                         int evs_max_iter,
+                                         weight_t kmean_tolerance,
+                                         int kmean_max_iter,
+                                         vertex_t *clustering,
+                                         weight_t *eig_vals,
+                                         weight_t *eig_vects)
 {
-  CUGRAPH_EXPECTS(graph.edge_data != nullptr, "API error, graph must have weights");
-  CUGRAPH_EXPECTS(evs_tolerance >= weight_t{0.0},
-                  "API error, evs_tolerance must be between 0.0 and 1.0");
-  CUGRAPH_EXPECTS(evs_tolerance < weight_t{1.0},
-                  "API error, evs_tolerance must be between 0.0 and 1.0");
-  CUGRAPH_EXPECTS(kmean_tolerance >= weight_t{0.0},
-                  "API error, kmean_tolerance must be between 0.0 and 1.0");
-  CUGRAPH_EXPECTS(kmean_tolerance < weight_t{1.0},
-                  "API error, kmean_tolerance must be between 0.0 and 1.0");
-  CUGRAPH_EXPECTS(n_clusters > 1, "API error, must specify more than 1 cluster");
-  CUGRAPH_EXPECTS(n_clusters < graph.number_of_vertices,
-                  "API error, number of clusters must be smaller than number of vertices");
-  CUGRAPH_EXPECTS(n_eig_vects <= n_clusters,
-                  "API error, cannot specify more eigenvectors than clusters");
-  CUGRAPH_EXPECTS(clustering != nullptr, "API error, must specify valid clustering");
-  CUGRAPH_EXPECTS(eig_vals != nullptr, "API error, must specify valid eigenvalues");
-  CUGRAPH_EXPECTS(eig_vects != nullptr, "API error, must specify valid eigenvectors");
+  RAFT_EXPECTS(graph.edge_data != nullptr, "API error, graph must have weights");
+  RAFT_EXPECTS(evs_tolerance >= weight_t{0.0},
+               "API error, evs_tolerance must be between 0.0 and 1.0");
+  RAFT_EXPECTS(evs_tolerance < weight_t{1.0},
+               "API error, evs_tolerance must be between 0.0 and 1.0");
+  RAFT_EXPECTS(kmean_tolerance >= weight_t{0.0},
+               "API error, kmean_tolerance must be between 0.0 and 1.0");
+  RAFT_EXPECTS(kmean_tolerance < weight_t{1.0},
+               "API error, kmean_tolerance must be between 0.0 and 1.0");
+  RAFT_EXPECTS(n_clusters > 1, "API error, must specify more than 1 cluster");
+  RAFT_EXPECTS(n_clusters < graph.number_of_vertices,
+               "API error, number of clusters must be smaller than number of vertices");
+  RAFT_EXPECTS(n_eig_vects <= n_clusters,
+               "API error, cannot specify more eigenvectors than clusters");
+  RAFT_EXPECTS(clustering != nullptr, "API error, must specify valid clustering");
+  RAFT_EXPECTS(eig_vals != nullptr, "API error, must specify valid eigenvalues");
+  RAFT_EXPECTS(eig_vects != nullptr, "API error, must specify valid eigenvectors");
+
+  raft::handle_t handle;
+  auto stream  = handle.get_stream();
+  auto exec    = rmm::exec_policy(stream);
+  auto t_exe_p = exec->on(stream);
 
   int evs_max_it{4000};
   int kmean_max_it{200};
   weight_t evs_tol{1.0E-3};
   weight_t kmean_tol{1.0E-2};
 
-  int iters_lanczos, iters_kmeans;
-
   if (evs_max_iter > 0) evs_max_it = evs_max_iter;
 
   if (evs_tolerance > weight_t{0.0}) evs_tol = evs_tolerance;
@@ -147,56 +157,90 @@ void spectralModularityMaximization_impl(
   if (kmean_tolerance > weight_t{0.0}) kmean_tol = kmean_tolerance;
 
   int restartIter_lanczos = 15 + n_eig_vects;
-  ::nvgraph::modularity_maximization<vertex_t, edge_t, weight_t>(graph,
-                                                                 n_clusters,
-                                                                 n_eig_vects,
-                                                                 evs_max_it,
-                                                                 restartIter_lanczos,
-                                                                 evs_tol,
-                                                                 kmean_max_it,
-                                                                 kmean_tol,
-                                                                 clustering,
-                                                                 eig_vals,
-                                                                 eig_vects,
-                                                                 iters_lanczos,
-                                                                 iters_kmeans);
+
+  unsigned long long seed{123456};
+  bool reorthog{false};
+
+  using index_type = vertex_t;
+  using value_type = weight_t;
+
+  raft::matrix::sparse_matrix_t<index_type, value_type> const r_csr_m{handle, graph};
+
+  raft::eigen_solver_config_t<index_type, value_type> eig_cfg{
+    n_eig_vects, evs_max_it, restartIter_lanczos, evs_tol, reorthog, seed};
+  raft::lanczos_solver_t<index_type, value_type> eig_solver{eig_cfg};
+
+  raft::cluster_solver_config_t<index_type, value_type> clust_cfg{
+    n_clusters, kmean_max_it, kmean_tol, seed};
+  raft::kmeans_solver_t<index_type, value_type> cluster_solver{clust_cfg};
+
+  // not returned...
+  // auto result =
+  raft::spectral::modularity_maximization(
+    handle, t_exe_p, r_csr_m, eig_solver, cluster_solver, clustering, eig_vals, eig_vects);
+
+  // not returned...
+  // int iters_lanczos, iters_kmeans;
+  // iters_lanczos = std::get<0>(result);
+  // iters_kmeans  = std::get<2>(result);
 }
 
 template <typename vertex_t, typename edge_t, typename weight_t>
-void analyzeModularityClustering_impl(
-  experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
-  int n_clusters,
-  vertex_t const *clustering,
-  weight_t *modularity)
+void analyzeModularityClustering_impl(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                                      int n_clusters,
+                                      vertex_t const *clustering,
+                                      weight_t *modularity)
 {
+  raft::handle_t handle;
+  auto stream  = handle.get_stream();
+  auto exec    = rmm::exec_policy(stream);
+  auto t_exe_p = exec->on(stream);
+
+  using index_type = vertex_t;
+  using value_type = weight_t;
+
+  raft::matrix::sparse_matrix_t<index_type, value_type> const r_csr_m{handle, graph};
+
   weight_t mod;
-  ::nvgraph::analyzeModularity(graph, n_clusters, clustering, mod);
+  raft::spectral::analyzeModularity(handle, t_exe_p, r_csr_m, n_clusters, clustering, mod);
   *modularity = mod;
 }
 
 template <typename vertex_t, typename edge_t, typename weight_t>
-void analyzeBalancedCut_impl(experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+void analyzeBalancedCut_impl(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
                              vertex_t n_clusters,
                              vertex_t const *clustering,
                              weight_t *edgeCut,
                              weight_t *ratioCut)
 {
-  CUGRAPH_EXPECTS(n_clusters <= graph.number_of_vertices,
-                  "API error: number of clusters must be <= number of vertices");
-  CUGRAPH_EXPECTS(n_clusters > 0, "API error: number of clusters must be > 0)");
+  raft::handle_t handle;
+  auto stream  = handle.get_stream();
+  auto exec    = rmm::exec_policy(stream);
+  auto t_exe_p = exec->on(stream);
+
+  RAFT_EXPECTS(n_clusters <= graph.number_of_vertices,
+               "API error: number of clusters must be <= number of vertices");
+  RAFT_EXPECTS(n_clusters > 0, "API error: number of clusters must be > 0)");
+
+  weight_t edge_cut;
+  weight_t cost{0};
+
+  using index_type = vertex_t;
+  using value_type = weight_t;
 
-  weight_t edge_cut, ratio_cut;
+  raft::matrix::sparse_matrix_t<index_type, value_type> const r_csr_m{handle, graph};
 
-  ::nvgraph::analyzePartition(graph, n_clusters, clustering, edge_cut, ratio_cut);
+  raft::spectral::analyzePartition(
+    handle, t_exe_p, r_csr_m, n_clusters, clustering, edge_cut, cost);
 
   *edgeCut  = edge_cut;
-  *ratioCut = ratio_cut;
+  *ratioCut = cost;
 }
 
 }  // namespace detail
 
 template <typename VT, typename ET, typename WT>
-void balancedCutClustering(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void balancedCutClustering(GraphCSRView<VT, ET, WT> const &graph,
                            VT num_clusters,
                            VT num_eigen_vects,
                            WT evs_tolerance,
@@ -221,7 +265,7 @@ void balancedCutClustering(experimental::GraphCSRView<VT, ET, WT> const &graph,
 }
 
 template <typename VT, typename ET, typename WT>
-void spectralModularityMaximization(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void spectralModularityMaximization(GraphCSRView<VT, ET, WT> const &graph,
                                     VT n_clusters,
                                     VT n_eigen_vects,
                                     WT evs_tolerance,
@@ -246,7 +290,7 @@ void spectralModularityMaximization(experimental::GraphCSRView<VT, ET, WT> const
 }
 
 template <typename VT, typename ET, typename WT>
-void analyzeClustering_modularity(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void analyzeClustering_modularity(GraphCSRView<VT, ET, WT> const &graph,
                                   int n_clusters,
                                   VT const *clustering,
                                   WT *score)
@@ -255,7 +299,7 @@ void analyzeClustering_modularity(experimental::GraphCSRView<VT, ET, WT> const &
 }
 
 template <typename VT, typename ET, typename WT>
-void analyzeClustering_edge_cut(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void analyzeClustering_edge_cut(GraphCSRView<VT, ET, WT> const &graph,
                                 int n_clusters,
                                 VT const *clustering,
                                 WT *score)
@@ -265,7 +309,7 @@ void analyzeClustering_edge_cut(experimental::GraphCSRView<VT, ET, WT> const &gr
 }
 
 template <typename VT, typename ET, typename WT>
-void analyzeClustering_ratio_cut(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void analyzeClustering_ratio_cut(GraphCSRView<VT, ET, WT> const &graph,
                                  int n_clusters,
                                  VT const *clustering,
                                  WT *score)
@@ -275,25 +319,37 @@ void analyzeClustering_ratio_cut(experimental::GraphCSRView<VT, ET, WT> const &g
 }
 
 template void balancedCutClustering<int, int, float>(
-  experimental::GraphCSRView<int, int, float> const &, int, int, float, int, float, int, int *);
+  GraphCSRView<int, int, float> const &, int, int, float, int, float, int, int *);
 template void balancedCutClustering<int, int, double>(
-  experimental::GraphCSRView<int, int, double> const &, int, int, double, int, double, int, int *);
+  GraphCSRView<int, int, double> const &, int, int, double, int, double, int, int *);
 template void spectralModularityMaximization<int, int, float>(
-  experimental::GraphCSRView<int, int, float> const &, int, int, float, int, float, int, int *);
+  GraphCSRView<int, int, float> const &, int, int, float, int, float, int, int *);
 template void spectralModularityMaximization<int, int, double>(
-  experimental::GraphCSRView<int, int, double> const &, int, int, double, int, double, int, int *);
-template void analyzeClustering_modularity<int, int, float>(
-  experimental::GraphCSRView<int, int, float> const &, int, int const *, float *);
-template void analyzeClustering_modularity<int, int, double>(
-  experimental::GraphCSRView<int, int, double> const &, int, int const *, double *);
-template void analyzeClustering_edge_cut<int, int, float>(
-  experimental::GraphCSRView<int, int, float> const &, int, int const *, float *);
-template void analyzeClustering_edge_cut<int, int, double>(
-  experimental::GraphCSRView<int, int, double> const &, int, int const *, double *);
-template void analyzeClustering_ratio_cut<int, int, float>(
-  experimental::GraphCSRView<int, int, float> const &, int, int const *, float *);
-template void analyzeClustering_ratio_cut<int, int, double>(
-  experimental::GraphCSRView<int, int, double> const &, int, int const *, double *);
-
-}  // namespace nvgraph
+  GraphCSRView<int, int, double> const &, int, int, double, int, double, int, int *);
+template void analyzeClustering_modularity<int, int, float>(GraphCSRView<int, int, float> const &,
+                                                            int,
+                                                            int const *,
+                                                            float *);
+template void analyzeClustering_modularity<int, int, double>(GraphCSRView<int, int, double> const &,
+                                                             int,
+                                                             int const *,
+                                                             double *);
+template void analyzeClustering_edge_cut<int, int, float>(GraphCSRView<int, int, float> const &,
+                                                          int,
+                                                          int const *,
+                                                          float *);
+template void analyzeClustering_edge_cut<int, int, double>(GraphCSRView<int, int, double> const &,
+                                                           int,
+                                                           int const *,
+                                                           double *);
+template void analyzeClustering_ratio_cut<int, int, float>(GraphCSRView<int, int, float> const &,
+                                                           int,
+                                                           int const *,
+                                                           float *);
+template void analyzeClustering_ratio_cut<int, int, double>(GraphCSRView<int, int, double> const &,
+                                                            int,
+                                                            int const *,
+                                                            double *);
+
+}  // namespace ext_raft
 }  // namespace cugraph
diff --git a/cpp/src/community/triangles_counting.cu b/cpp/src/community/triangles_counting.cu
index 27b19e2e2a8..f6670365652 100644
--- a/cpp/src/community/triangles_counting.cu
+++ b/cpp/src/community/triangles_counting.cu
@@ -16,17 +16,18 @@
 
 #include <cuda_runtime.h>
 
+#include <raft/cudart_utils.h>
 #include <algorithms.hpp>
 #include <graph.hpp>
 
-#include <nvgraph/include/sm_utils.h>
-#include <nvgraph/include/nvgraph_error.hxx>
+#include <utilities/error.hpp>
 
 #include <rmm/thrust_rmm_allocator.h>
 #include <rmm/device_buffer.hpp>
 
 #include <thrust/iterator/counting_iterator.h>
 
+#include <raft/cudart_utils.h>
 #include "cub/cub.cuh"
 
 #define TH_CENT_K_LOCLEN (34)
@@ -49,7 +50,10 @@
 #define DEG_THR1 (3.5)
 #define DEG_THR2 (38.0)
 
-namespace nvgraph {
+namespace cugraph {
+namespace triangle {
+
+namespace {  // anonym.
 
 template <typename T>
 struct type_utils;
@@ -95,13 +99,13 @@ static inline void cubSum(InputIteratorT d_in,
 
   cub::DeviceReduce::Sum(
     nullptr, temp_storage_bytes, d_in, d_out, num_items, stream, debug_synchronous);
-  cudaCheckError();
+  CHECK_CUDA(stream);
 
   rmm::device_buffer d_temp_storage(temp_storage_bytes, stream);
 
   cub::DeviceReduce::Sum(
     d_temp_storage.data(), temp_storage_bytes, d_in, d_out, num_items, stream, debug_synchronous);
-  cudaCheckError();
+  CHECK_CUDA(stream);
 
   return;
 }
@@ -129,7 +133,7 @@ static inline void cubIf(InputIteratorT d_in,
                         select_op,
                         stream,
                         debug_synchronous);
-  cudaCheckError();
+  CHECK_CUDA(stream);
 
   rmm::device_buffer d_temp_storage(temp_storage_bytes, stream);
 
@@ -142,7 +146,7 @@ static inline void cubIf(InputIteratorT d_in,
                         select_op,
                         stream,
                         debug_synchronous);
-  cudaCheckError();
+  CHECK_CUDA(stream);
 
   return;
 }
@@ -169,7 +173,7 @@ __device__ __forceinline__ T block_sum(T v)
   const int wid = threadIdx.x / 32 + ((BDIM_Y > 1) ? threadIdx.y * (BDIM_X / 32) : 0);
 
 #pragma unroll
-  for (int i = WSIZE / 2; i; i >>= 1) { v += utils::shfl_down(v, i); }
+  for (int i = WSIZE / 2; i; i >>= 1) { v += __shfl_down_sync(raft::warp_full_mask(), v, i); }
   if (lid == 0) sh[wid] = v;
 
   __syncthreads();
@@ -177,7 +181,9 @@ __device__ __forceinline__ T block_sum(T v)
     v = (lid < (BDIM_X * BDIM_Y / WSIZE)) ? sh[lid] : 0;
 
 #pragma unroll
-    for (int i = (BDIM_X * BDIM_Y / WSIZE) / 2; i; i >>= 1) { v += utils::shfl_down(v, i); }
+    for (int i = (BDIM_X * BDIM_Y / WSIZE) / 2; i; i >>= 1) {
+      v += __shfl_down_sync(raft::warp_full_mask(), v, i);
+    }
   }
   return v;
 }
@@ -282,7 +288,7 @@ void tricnt_b2b(T nblock,
   // still best overall (with no psum)
   tricnt_b2b_k<THREADS, 32, BLK_BWL0><<<nblock, THREADS, 0, stream>>>(
     m->nrows, m->rows_d, m->roff_d, m->cols_d, ocnt_d, bmapL0_d, bmldL0, bmapL1_d, bmldL1);
-  cudaCheckError();
+  CHECK_CUDA(stream);
   return;
 }
 
@@ -294,7 +300,7 @@ __device__ __forceinline__ T block_sum_sh(T v, T *sh)
   const int wid = threadIdx.x / 32 + ((BDIM_Y > 1) ? threadIdx.y * (BDIM_X / 32) : 0);
 
 #pragma unroll
-  for (int i = WSIZE / 2; i; i >>= 1) { v += utils::shfl_down(v, i); }
+  for (int i = WSIZE / 2; i; i >>= 1) { v += __shfl_down_sync(raft::warp_full_mask(), v, i); }
   if (lid == 0) sh[wid] = v;
 
   __syncthreads();
@@ -302,7 +308,9 @@ __device__ __forceinline__ T block_sum_sh(T v, T *sh)
     v = (lid < (BDIM_X * BDIM_Y / WSIZE)) ? sh[lid] : 0;
 
 #pragma unroll
-    for (int i = (BDIM_X * BDIM_Y / WSIZE) / 2; i; i >>= 1) { v += utils::shfl_down(v, i); }
+    for (int i = (BDIM_X * BDIM_Y / WSIZE) / 2; i; i >>= 1) {
+      v += __shfl_down_sync(raft::warp_full_mask(), v, i);
+    }
   }
   return v;
 }
@@ -386,7 +394,7 @@ void tricnt_bsh(T nblock, spmat_t<T> *m, uint64_t *ocnt_d, size_t bmld, cudaStre
 {
   tricnt_bsh_k<THREADS, 32><<<nblock, THREADS, sizeof(unsigned int) * bmld, stream>>>(
     m->nrows, m->rows_d, m->roff_d, m->cols_d, ocnt_d, bmld);
-  cudaCheckError();
+  CHECK_CUDA(stream);
   return;
 }
 
@@ -438,8 +446,8 @@ __global__ void tricnt_wrp_ps_k(const ROW_T ner,
       for (int i = 1; i < RLEN_THR1; i++) {
         if (i == nloc) break;
 
-        const OFF_T csoff = utils::shfl(soff, i);
-        const OFF_T ceoff = utils::shfl(eoff, i);
+        const OFF_T csoff = __shfl_sync(raft::warp_full_mask(), soff, i);
+        const OFF_T ceoff = __shfl_sync(raft::warp_full_mask(), eoff, i);
 
         if (ceoff - csoff < RLEN_THR2) {
           if (threadIdx.x == i) mysm = i;
@@ -483,11 +491,11 @@ __global__ void tricnt_wrp_ps_k(const ROW_T ner,
 
 #pragma unroll
         for (int j = 1; j < 32; j <<= 1) {
-          lensum += (threadIdx.x >= j) * (utils::shfl_up(lensum, j));
+          lensum += (threadIdx.x >= j) * (__shfl_up_sync(raft::warp_full_mask(), lensum, j));
         }
         shs[threadIdx.y][threadIdx.x] = lensum - len;
 
-        lensum = utils::shfl(lensum, 31);
+        lensum = __shfl_sync(raft::warp_full_mask(), lensum, 31);
 
         int k = WSIZE - 1;
         for (int j = lensum - 1; j >= 0; j -= WSIZE) {
@@ -534,7 +542,7 @@ void tricnt_wrp(
   dim3 block(32, THREADS / 32);
   tricnt_wrp_ps_k<32, THREADS / 32, WP_LEN_TH1, WP_LEN_TH2>
     <<<nblock, block, 0, stream>>>(m->nrows, m->rows_d, m->roff_d, m->cols_d, ocnt_d, bmap_d, bmld);
-  cudaCheckError();
+  CHECK_CUDA(stream);
   return;
 }
 
@@ -622,7 +630,7 @@ void tricnt_thr(T nblock, spmat_t<T> *m, uint64_t *ocnt_d, cudaStream_t stream)
 
   tricnt_thr_k<THREADS, TH_CENT_K_LOCLEN>
     <<<nblock, THREADS, 0, stream>>>(m->nrows, m->rows_d, m->roff_d, m->cols_d, ocnt_d);
-  cudaCheckError();
+  CHECK_CUDA(stream);
   return;
 }
 
@@ -648,7 +656,7 @@ void create_nondangling_vector(
 
   cubIf(it, p_nonempty, out_num.data().get(), n, temp_func, stream);
   cudaMemcpy(n_nonempty, out_num.data().get(), sizeof(*n_nonempty), cudaMemcpyDeviceToHost);
-  cudaCheckError();
+  CHECK_CUDA(stream);
 }
 
 template <typename T>
@@ -657,7 +665,7 @@ uint64_t reduce(uint64_t *v_d, T n, cudaStream_t stream)
   rmm::device_vector<uint64_t> tmp(1);
 
   cubSum(v_d, tmp.data().get(), n, stream);
-  cudaCheckError();
+  CHECK_CUDA(stream);
 
   return tmp[0];
 }
@@ -700,27 +708,20 @@ TrianglesCount<IndexType>::TrianglesCount(IndexType num_vertices,
                                           IndexType const *row_offsets,
                                           IndexType const *col_indices,
                                           cudaStream_t stream)
+  : m_mat{num_vertices, num_edges, num_vertices, row_offsets, nullptr, col_indices},
+    m_stream{stream},
+    m_done{true}
 {
-  m_stream = stream;
-  m_done   = true;
-
   int device_id;
   cudaGetDevice(&device_id);
 
   cudaDeviceGetAttribute(&m_shared_mem_per_block, cudaDevAttrMaxSharedMemoryPerBlock, device_id);
-  cudaCheckError();
+  CHECK_CUDA(m_stream);
   cudaDeviceGetAttribute(&m_multi_processor_count, cudaDevAttrMultiProcessorCount, device_id);
-  cudaCheckError();
+  CHECK_CUDA(m_stream);
   cudaDeviceGetAttribute(
     &m_max_threads_per_multi_processor, cudaDevAttrMaxThreadsPerMultiProcessor, device_id);
-  cudaCheckError();
-
-  // fill spmat struct;
-  m_mat.nnz    = num_edges;
-  m_mat.N      = num_vertices;
-  m_mat.nrows  = num_vertices;
-  m_mat.roff_d = row_offsets;
-  m_mat.cols_d = col_indices;
+  CHECK_CUDA(m_stream);
 
   m_seq.resize(m_mat.N, IndexType{0});
   create_nondangling_vector(m_mat.roff_d, m_seq.data().get(), &(m_mat.nrows), m_mat.N, m_stream);
@@ -730,9 +731,11 @@ TrianglesCount<IndexType>::TrianglesCount(IndexType num_vertices,
 template <typename IndexType>
 void TrianglesCount<IndexType>::tcount_bsh()
 {
-  if (m_shared_mem_per_block * 8 < (size_t)m_mat.nrows) {
-    FatalError("Number of vertices too high to use this kernel!", NVGRAPH_ERR_BAD_PARAMETERS);
-  }
+  CUGRAPH_EXPECTS(not(m_shared_mem_per_block * 8 < m_mat.nrows),
+                  "Number of vertices too high for TrainglesCount.");
+  /// if (m_shared_mem_per_block * 8 < (size_t)m_mat.nrows) {
+  ///  FatalError("Number of vertices too high to use this kernel!", NVGRAPH_ERR_BAD_PARAMETERS);
+  ///}
 
   size_t bmld = bitmap_roundup<uint32_t>(m_mat.N);
   int nblock  = m_mat.nrows;
@@ -754,7 +757,7 @@ void TrianglesCount<IndexType>::tcount_b2b()
 
   size_t free_bytes, total_bytes;
   cudaMemGetInfo(&free_bytes, &total_bytes);
-  cudaCheckError();
+  CHECK_CUDA(m_stream);
 
   int nblock = (free_bytes * 95 / 100) / (sizeof(uint32_t) * bmldL1);  //@TODO: what?
   nblock     = MIN(nblock, m_mat.nrows);
@@ -788,7 +791,7 @@ void TrianglesCount<IndexType>::tcount_wrp()
   // number of blocks limited by birmap size
   size_t free_bytes, total_bytes;
   cudaMemGetInfo(&free_bytes, &total_bytes);
-  cudaCheckError();
+  CHECK_CUDA(m_stream);
 
   int nblock = (free_bytes * 95 / 100) / (sizeof(uint32_t) * bmld * (THREADS / 32));
   nblock     = MIN(nblock, DIV_UP(m_mat.nrows, (THREADS / 32)));
@@ -831,15 +834,12 @@ void TrianglesCount<IndexType>::count()
   }
 }
 
-}  // namespace nvgraph
-
-namespace cugraph {
-namespace nvgraph {
+}  // namespace
 
 template <typename VT, typename ET, typename WT>
-uint64_t triangle_count(experimental::GraphCSRView<VT, ET, WT> const &graph)
+uint64_t triangle_count(GraphCSRView<VT, ET, WT> const &graph)
 {
-  ::nvgraph::TrianglesCount<VT> counter(
+  TrianglesCount<VT> counter(
     graph.number_of_vertices, graph.number_of_edges, graph.offsets, graph.indices);
 
   counter.count();
@@ -847,7 +847,7 @@ uint64_t triangle_count(experimental::GraphCSRView<VT, ET, WT> const &graph)
 }
 
 template uint64_t triangle_count<int32_t, int32_t, float>(
-  experimental::GraphCSRView<int32_t, int32_t, float> const &);
+  GraphCSRView<int32_t, int32_t, float> const &);
 
-}  // namespace nvgraph
+}  // namespace triangle
 }  // namespace cugraph
diff --git a/cpp/src/components/connectivity.cu b/cpp/src/components/connectivity.cu
index 5dcbfcfadc2..2cc1da017a9 100644
--- a/cpp/src/components/connectivity.cu
+++ b/cpp/src/components/connectivity.cu
@@ -1,3 +1,19 @@
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
 #include "scc_matrix.cuh"
 #include "weak_cc.cuh"
 
@@ -8,7 +24,7 @@
 #include <graph.hpp>
 #include <iostream>
 #include <type_traits>
-#include "utilities/error_utils.h"
+#include "utilities/error.hpp"
 #include "utilities/graph_utils.cuh"
 
 #include "topology/topology.cuh"
@@ -41,7 +57,7 @@ namespace detail {
  */
 template <typename VT, typename ET, typename WT, int TPB_X = 32>
 std::enable_if_t<std::is_signed<VT>::value> connected_components_impl(
-  experimental::GraphCSRView<VT, ET, WT> const &graph,
+  GraphCSRView<VT, ET, WT> const &graph,
   cugraph_cc_t connectivity_type,
   VT *labels,
   cudaStream_t stream)
@@ -68,7 +84,7 @@ std::enable_if_t<std::is_signed<VT>::value> connected_components_impl(
 }  // namespace detail
 
 template <typename VT, typename ET, typename WT>
-void connected_components(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void connected_components(GraphCSRView<VT, ET, WT> const &graph,
                           cugraph_cc_t connectivity_type,
                           VT *labels)
 {
@@ -80,8 +96,8 @@ void connected_components(experimental::GraphCSRView<VT, ET, WT> const &graph,
 }
 
 template void connected_components<int32_t, int32_t, float>(
-  experimental::GraphCSRView<int32_t, int32_t, float> const &, cugraph_cc_t, int32_t *);
+  GraphCSRView<int32_t, int32_t, float> const &, cugraph_cc_t, int32_t *);
 template void connected_components<int64_t, int64_t, float>(
-  experimental::GraphCSRView<int64_t, int64_t, float> const &, cugraph_cc_t, int64_t *);
+  GraphCSRView<int64_t, int64_t, float> const &, cugraph_cc_t, int64_t *);
 
 }  // namespace cugraph
diff --git a/cpp/src/components/scc_matrix.cuh b/cpp/src/components/scc_matrix.cuh
index ce15e8d3c98..801f1fe0fad 100644
--- a/cpp/src/components/scc_matrix.cuh
+++ b/cpp/src/components/scc_matrix.cuh
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
diff --git a/cpp/src/components/utils.h b/cpp/src/components/utils.h
index dfc56434357..c9ebb6ac4d1 100644
--- a/cpp/src/components/utils.h
+++ b/cpp/src/components/utils.h
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -25,7 +25,9 @@
 #include <stdexcept>
 #include <string>
 
-#include <utilities/error_utils.h>
+#include <raft/cudart_utils.h>
+
+#include <utilities/error.hpp>
 
 namespace MLCommon {
 
@@ -77,35 +79,6 @@ class Exception : public std::exception {
   }
 };
 
-/** macro to throw a runtime error */
-#define THROW(fmt, ...)                                                               \
-  do {                                                                                \
-    std::string msg;                                                                  \
-    char errMsg[2048];                                                                \
-    std::sprintf(errMsg, "Exception occured! file=%s line=%d: ", __FILE__, __LINE__); \
-    msg += errMsg;                                                                    \
-    std::sprintf(errMsg, fmt, ##__VA_ARGS__);                                         \
-    msg += errMsg;                                                                    \
-    throw MLCommon::Exception(msg);                                                   \
-  } while (0)
-
-/** macro to check for a conditional and assert on failure */
-#define ASSERT(check, fmt, ...)              \
-  do {                                       \
-    if (!(check)) THROW(fmt, ##__VA_ARGS__); \
-  } while (0)
-
-/** check for cuda runtime API errors and assert accordingly */
-#define CUDA_CHECK(call)                                                                         \
-  do {                                                                                           \
-    cudaError_t status = call;                                                                   \
-    ASSERT(                                                                                      \
-      status == cudaSuccess, "FAIL: call='%s'. Reason:%s\n", #call, cudaGetErrorString(status)); \
-  } while (0)
-
-///@todo: add a similar CUDA_CHECK_NO_THROW
-/// (Ref: https://github.com/rapidsai/cuml/issues/229)
-
 /**
  * @brief Generic copy method for all kinds of transfers
  * @tparam Type data type
@@ -117,7 +90,7 @@ class Exception : public std::exception {
 template <typename Type>
 void copy(Type* dst, const Type* src, size_t len, cudaStream_t stream)
 {
-  CUDA_CHECK(cudaMemcpyAsync(dst, src, len * sizeof(Type), cudaMemcpyDefault, stream));
+  CUDA_TRY(cudaMemcpyAsync(dst, src, len * sizeof(Type), cudaMemcpyDefault, stream));
 }
 
 /**
@@ -143,7 +116,7 @@ void updateHost(Type* hPtr, const Type* dPtr, size_t len, cudaStream_t stream)
 template <typename Type>
 void copyAsync(Type* dPtr1, const Type* dPtr2, size_t len, cudaStream_t stream)
 {
-  CUDA_CHECK(cudaMemcpyAsync(dPtr1, dPtr2, len * sizeof(Type), cudaMemcpyDeviceToDevice, stream));
+  CUDA_TRY(cudaMemcpyAsync(dPtr1, dPtr2, len * sizeof(Type), cudaMemcpyDeviceToDevice, stream));
 }
 /** @} */
 
@@ -214,8 +187,7 @@ void myPrintDevVector(const char* variableName,
                       OutStream& out)
 {
   std::vector<T> hostMem(componentsCount);
-  CUDA_CHECK(
-    cudaMemcpy(hostMem.data(), devMem, componentsCount * sizeof(T), cudaMemcpyDeviceToHost));
+  CUDA_TRY(cudaMemcpy(hostMem.data(), devMem, componentsCount * sizeof(T), cudaMemcpyDeviceToHost));
   myPrintHostVector(variableName, hostMem.data(), componentsCount, out);
 }
 
diff --git a/cpp/src/components/weak_cc.cuh b/cpp/src/components/weak_cc.cuh
index 291831d2c37..d644a988117 100644
--- a/cpp/src/components/weak_cc.cuh
+++ b/cpp/src/components/weak_cc.cuh
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -25,8 +25,10 @@
 #include <iostream>
 #include <type_traits>
 
+#include <raft/cudart_utils.h>
+#include <raft/device_atomics.cuh>
+
 #include <rmm/thrust_rmm_allocator.h>
-#include "utilities/cuda_utils.cuh"
 #include "utils.h"
 
 namespace MLCommon {
@@ -94,7 +96,7 @@ __global__ void weak_cc_label_device(vertex_t *labels,
         vertex_t j_ind = indices[j];
         cj             = labels[j_ind];
         if (ci < cj) {
-          cugraph::atomicMin(labels + j_ind, ci);
+          atomicMin(labels + j_ind, ci);
           xa[j_ind] = true;
           m[0]      = true;
         } else if (ci > cj) {
@@ -104,7 +106,7 @@ __global__ void weak_cc_label_device(vertex_t *labels,
       }
 
       if (ci_mod) {
-        cugraph::atomicMin(labels + startVertexId + tid, ci);
+        atomicMin(labels + startVertexId + tid, ci);
         xa[startVertexId + tid] = true;
         m[0]                    = true;
       }
@@ -163,22 +165,22 @@ void weak_cc_label_batched(vertex_t *labels,
   weak_cc_init_label_kernel<vertex_t, TPB_X>
     <<<blocks, threads, 0, stream>>>(labels, startVertexId, batchSize, MAX_LABEL, filter_op);
 
-  CUDA_CHECK(cudaPeekAtLastError());
+  CUDA_TRY(cudaPeekAtLastError());
 
   int n_iters = 0;
   do {
-    CUDA_CHECK(cudaMemsetAsync(state.m, false, sizeof(bool), stream));
+    CUDA_TRY(cudaMemsetAsync(state.m, false, sizeof(bool), stream));
 
     weak_cc_label_device<vertex_t, edge_t, TPB_X><<<blocks, threads, 0, stream>>>(
       labels, offsets, indices, nnz, state.fa, state.xa, state.m, startVertexId, batchSize);
-    CUDA_CHECK(cudaPeekAtLastError());
-    CUDA_CHECK(cudaStreamSynchronize(stream));
+    CUDA_TRY(cudaPeekAtLastError());
+    CUDA_TRY(cudaStreamSynchronize(stream));
 
     thrust::swap(state.fa, state.xa);
 
     //** Updating m *
     MLCommon::updateHost(&host_m, state.m, 1, stream);
-    CUDA_CHECK(cudaStreamSynchronize(stream));
+    CUDA_TRY(cudaStreamSynchronize(stream));
 
     n_iters++;
   } while (host_m);
@@ -233,7 +235,7 @@ void weak_cc_batched(vertex_t *labels,
   if (startVertexId == 0) {
     weak_cc_init_all_kernel<vertex_t, TPB_X>
       <<<blocks, threads, 0, stream>>>(labels, state.fa, state.xa, N, MAX_LABEL);
-    CUDA_CHECK(cudaPeekAtLastError());
+    CUDA_TRY(cudaPeekAtLastError());
   }
 
   weak_cc_label_batched<vertex_t, edge_t, TPB_X>(
diff --git a/cpp/src/converters/COOtoCSR.cu b/cpp/src/converters/COOtoCSR.cu
index 96143e6ba24..f52be206015 100644
--- a/cpp/src/converters/COOtoCSR.cu
+++ b/cpp/src/converters/COOtoCSR.cu
@@ -19,13 +19,29 @@
 
 namespace cugraph {
 
-template std::unique_ptr<experimental::GraphCSR<int32_t, int32_t, float>>
-coo_to_csr<int32_t, int32_t, float>(
-  experimental::GraphCOOView<int32_t, int32_t, float> const &graph,
-  rmm::mr::device_memory_resource *);
-template std::unique_ptr<experimental::GraphCSR<int32_t, int32_t, double>>
-coo_to_csr<int32_t, int32_t, double>(
-  experimental::GraphCOOView<int32_t, int32_t, double> const &graph,
-  rmm::mr::device_memory_resource *);
+// Explicit instantiation for uint32_t + float
+template std::unique_ptr<GraphCSR<uint32_t, uint32_t, float>> coo_to_csr<uint32_t, uint32_t, float>(
+  GraphCOOView<uint32_t, uint32_t, float> const &graph, rmm::mr::device_memory_resource *);
+
+// Explicit instantiation for uint32_t + double
+template std::unique_ptr<GraphCSR<uint32_t, uint32_t, double>>
+coo_to_csr<uint32_t, uint32_t, double>(GraphCOOView<uint32_t, uint32_t, double> const &graph,
+                                       rmm::mr::device_memory_resource *);
+
+// Explicit instantiation for int + float
+template std::unique_ptr<GraphCSR<int32_t, int32_t, float>> coo_to_csr<int32_t, int32_t, float>(
+  GraphCOOView<int32_t, int32_t, float> const &graph, rmm::mr::device_memory_resource *);
+
+// Explicit instantiation for int + double
+template std::unique_ptr<GraphCSR<int32_t, int32_t, double>> coo_to_csr<int32_t, int32_t, double>(
+  GraphCOOView<int32_t, int32_t, double> const &graph, rmm::mr::device_memory_resource *);
+
+// Explicit instantiation for int64_t + float
+template std::unique_ptr<GraphCSR<int64_t, int64_t, float>> coo_to_csr<int64_t, int64_t, float>(
+  GraphCOOView<int64_t, int64_t, float> const &graph, rmm::mr::device_memory_resource *);
+
+// Explicit instantiation for int64_t + double
+template std::unique_ptr<GraphCSR<int64_t, int64_t, double>> coo_to_csr<int64_t, int64_t, double>(
+  GraphCOOView<int64_t, int64_t, double> const &graph, rmm::mr::device_memory_resource *);
 
 }  // namespace cugraph
diff --git a/cpp/src/converters/COOtoCSR.cuh b/cpp/src/converters/COOtoCSR.cuh
index 5ba884f4a74..f636e387aa1 100644
--- a/cpp/src/converters/COOtoCSR.cuh
+++ b/cpp/src/converters/COOtoCSR.cuh
@@ -31,7 +31,7 @@
 #include <algorithm>
 
 #include <rmm/thrust_rmm_allocator.h>
-#include <utilities/error_utils.h>
+#include <utilities/error.hpp>
 
 #include <cub/device/device_radix_sort.cuh>
 #include <cub/device/device_run_length_encode.cuh>
@@ -60,7 +60,7 @@ namespace detail {
  * @param[out] result      Total number of vertices
  */
 template <typename VT, typename ET, typename WT>
-VT sort(experimental::GraphCOOView<VT, ET, WT>& graph, cudaStream_t stream)
+VT sort(GraphCOOView<VT, ET, WT>& graph, cudaStream_t stream)
 {
   VT max_src_id;
   VT max_dst_id;
@@ -111,8 +111,10 @@ void fill_offset(
                      VT id = source[index];
                      if (id != source[index - 1]) { offsets[id] = index; }
                    });
-  ET zero = 0;
-  CUDA_TRY(cudaMemcpy(offsets, &zero, sizeof(ET), cudaMemcpyDefault));
+  thrust::device_ptr<VT> src = thrust::device_pointer_cast(source);
+  thrust::device_ptr<ET> off = thrust::device_pointer_cast(offsets);
+  off[src[0]]                = ET{0};
+
   auto iter = thrust::make_reverse_iterator(offsets + number_of_vertices + 1);
   thrust::inclusive_scan(rmm::exec_policy(stream)->on(stream),
                          iter,
@@ -141,13 +143,10 @@ rmm::device_buffer create_offset(VT* source,
 }  // namespace detail
 
 template <typename VT, typename ET, typename WT>
-std::unique_ptr<experimental::GraphCSR<VT, ET, WT>> coo_to_csr(
-  experimental::GraphCOOView<VT, ET, WT> const& graph, rmm::mr::device_memory_resource* mr)
+std::unique_ptr<GraphCSR<VT, ET, WT>> coo_to_csr(GraphCOOView<VT, ET, WT> const& graph,
+                                                 rmm::mr::device_memory_resource* mr)
 {
   cudaStream_t stream{nullptr};
-  using experimental::GraphCOO;
-  using experimental::GraphCOOView;
-  using experimental::GraphSparseContents;
 
   GraphCOO<VT, ET, WT> temp_graph(graph, stream, mr);
   GraphCOOView<VT, ET, WT> temp_graph_view = temp_graph.view();
@@ -162,12 +161,11 @@ std::unique_ptr<experimental::GraphCSR<VT, ET, WT>> coo_to_csr(
     std::move(coo_contents.dst_indices),
     std::move(coo_contents.edge_data)};
 
-  return std::make_unique<experimental::GraphCSR<VT, ET, WT>>(std::move(csr_contents));
+  return std::make_unique<GraphCSR<VT, ET, WT>>(std::move(csr_contents));
 }
 
 template <typename VT, typename ET, typename WT>
-void coo_to_csr_inplace(experimental::GraphCOOView<VT, ET, WT>& graph,
-                        experimental::GraphCSRView<VT, ET, WT>& result)
+void coo_to_csr_inplace(GraphCOOView<VT, ET, WT>& graph, GraphCSRView<VT, ET, WT>& result)
 {
   cudaStream_t stream{nullptr};
 
diff --git a/cpp/src/converters/permute_graph.cuh b/cpp/src/converters/permute_graph.cuh
index edf97ddc212..b5b2de83e9b 100644
--- a/cpp/src/converters/permute_graph.cuh
+++ b/cpp/src/converters/permute_graph.cuh
@@ -14,8 +14,8 @@
  * limitations under the License.
  */
 #include <rmm/thrust_rmm_allocator.h>
-#include <utilities/error_utils.h>
 #include <graph.hpp>
+#include <utilities/error.hpp>
 #include "converters/COOtoCSR.cuh"
 #include "utilities/graph_utils.cuh"
 
@@ -42,9 +42,9 @@ struct permutation_functor {
  * @return The permuted graph.
  */
 template <typename vertex_t, typename edge_t, typename weight_t>
-void permute_graph(experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+void permute_graph(GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
                    vertex_t const *permutation,
-                   experimental::GraphCSRView<vertex_t, edge_t, weight_t> result,
+                   GraphCSRView<vertex_t, edge_t, weight_t> result,
                    cudaStream_t stream = 0)
 {
   //  Create a COO out of the CSR
@@ -76,7 +76,7 @@ void permute_graph(experimental::GraphCSRView<vertex_t, edge_t, weight_t> const
                     d_dst,
                     pf);
 
-  cugraph::experimental::GraphCOOView<vertex_t, edge_t, weight_t> graph_coo;
+  GraphCOOView<vertex_t, edge_t, weight_t> graph_coo;
 
   graph_coo.number_of_vertices = graph.number_of_vertices;
   graph_coo.number_of_edges    = graph.number_of_edges;
diff --git a/cpp/src/converters/renumber.cuh b/cpp/src/converters/renumber.cuh
index 02ce10a1f20..263d7199c10 100644
--- a/cpp/src/converters/renumber.cuh
+++ b/cpp/src/converters/renumber.cuh
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -27,10 +27,11 @@
 #include <thrust/iterator/counting_iterator.h>
 #include <thrust/scan.h>
 
+#include <raft/cudart_utils.h>
 #include <rmm/device_buffer.hpp>
 
+#include <utilities/error.hpp>
 #include "sort/bitonic.cuh"
-#include "utilities/error_utils.h"
 #include "utilities/graph_utils.cuh"
 
 namespace cugraph {
diff --git a/cpp/src/cores/core_number.cu b/cpp/src/cores/core_number.cu
index f3770147db8..40b1b7bf943 100644
--- a/cpp/src/cores/core_number.cu
+++ b/cpp/src/cores/core_number.cu
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -14,19 +14,18 @@
  * limitations under the License.
  */
 
-#include <rmm/rmm.h>
 #include <rmm/thrust_rmm_allocator.h>
-#include <utilities/error_utils.h>
 #include <Hornet.hpp>
 #include <Static/CoreNumber/CoreNumber.cuh>
 #include <graph.hpp>
+#include <utilities/error.hpp>
 //#include <nvgraph_gdf.h>
 
 namespace cugraph {
 namespace detail {
 
 template <typename VT, typename ET, typename WT>
-void core_number(experimental::GraphCSRView<VT, ET, WT> const &graph, int *core_number)
+void core_number(GraphCSRView<VT, ET, WT> const &graph, int *core_number)
 {
   using HornetGraph = hornet::gpu::HornetStatic<int>;
   using HornetInit  = hornet::HornetInit<VT>;
@@ -53,8 +52,8 @@ struct FilterEdges {
 };
 
 template <typename VT, typename ET, typename WT>
-void extract_edges(experimental::GraphCOOView<VT, ET, WT> const &i_graph,
-                   experimental::GraphCOOView<VT, ET, WT> &o_graph,
+void extract_edges(GraphCOOView<VT, ET, WT> const &i_graph,
+                   GraphCOOView<VT, ET, WT> &o_graph,
                    VT *d_core,
                    int k)
 {
@@ -97,8 +96,8 @@ void extract_edges(experimental::GraphCOOView<VT, ET, WT> const &i_graph,
 // i.e. All edges (s,d,w) in in_graph are copied over to out_graph
 // if core_num[s] and core_num[d] are greater than or equal to k.
 template <typename VT, typename ET, typename WT>
-std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> extract_subgraph(
-  experimental::GraphCOOView<VT, ET, WT> const &in_graph,
+std::unique_ptr<GraphCOO<VT, ET, WT>> extract_subgraph(
+  GraphCOOView<VT, ET, WT> const &in_graph,
   int const *vid,
   int const *core_num,
   int k,
@@ -120,7 +119,7 @@ std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> extract_subgraph(
   auto edge =
     thrust::make_zip_iterator(thrust::make_tuple(in_graph.src_indices, in_graph.dst_indices));
 
-  auto out_graph = std::make_unique<experimental::GraphCOO<VT, ET, WT>>(
+  auto out_graph = std::make_unique<GraphCOO<VT, ET, WT>>(
     in_graph.number_of_vertices,
     thrust::count_if(rmm::exec_policy(stream)->on(stream),
                      edge,
@@ -130,7 +129,7 @@ std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> extract_subgraph(
     stream,
     mr);
 
-  experimental::GraphCOOView<VT, ET, WT> out_graph_view = out_graph->view();
+  GraphCOOView<VT, ET, WT> out_graph_view = out_graph->view();
   extract_edges(in_graph, out_graph_view, d_sorted_core_num, k);
 
   return out_graph;
@@ -139,19 +138,18 @@ std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> extract_subgraph(
 }  // namespace detail
 
 template <typename VT, typename ET, typename WT>
-void core_number(experimental::GraphCSRView<VT, ET, WT> const &graph, VT *core_number)
+void core_number(GraphCSRView<VT, ET, WT> const &graph, VT *core_number)
 {
   return detail::core_number(graph, core_number);
 }
 
 template <typename VT, typename ET, typename WT>
-std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> k_core(
-  experimental::GraphCOOView<VT, ET, WT> const &in_graph,
-  int k,
-  VT const *vertex_id,
-  VT const *core_number,
-  VT num_vertex_ids,
-  rmm::mr::device_memory_resource *mr)
+std::unique_ptr<GraphCOO<VT, ET, WT>> k_core(GraphCOOView<VT, ET, WT> const &in_graph,
+                                             int k,
+                                             VT const *vertex_id,
+                                             VT const *core_number,
+                                             VT num_vertex_ids,
+                                             rmm::mr::device_memory_resource *mr)
 {
   CUGRAPH_EXPECTS(vertex_id != nullptr, "Invalid API parameter: vertex_id is NULL");
   CUGRAPH_EXPECTS(core_number != nullptr, "Invalid API parameter: core_number is NULL");
@@ -160,21 +158,21 @@ std::unique_ptr<experimental::GraphCOO<VT, ET, WT>> k_core(
   return detail::extract_subgraph(in_graph, vertex_id, core_number, k, num_vertex_ids, mr);
 }
 
-template void core_number<int32_t, int32_t, float>(
-  experimental::GraphCSRView<int32_t, int32_t, float> const &, int32_t *core_number);
-template std::unique_ptr<experimental::GraphCOO<int32_t, int32_t, float>>
-k_core<int32_t, int32_t, float>(experimental::GraphCOOView<int32_t, int32_t, float> const &,
-                                int,
-                                int32_t const *,
-                                int32_t const *,
-                                int32_t,
-                                rmm::mr::device_memory_resource *);
-template std::unique_ptr<experimental::GraphCOO<int32_t, int32_t, double>>
-k_core<int32_t, int32_t, double>(experimental::GraphCOOView<int32_t, int32_t, double> const &,
-                                 int,
-                                 int32_t const *,
-                                 int32_t const *,
-                                 int32_t,
-                                 rmm::mr::device_memory_resource *);
+template void core_number<int32_t, int32_t, float>(GraphCSRView<int32_t, int32_t, float> const &,
+                                                   int32_t *core_number);
+template std::unique_ptr<GraphCOO<int32_t, int32_t, float>> k_core<int32_t, int32_t, float>(
+  GraphCOOView<int32_t, int32_t, float> const &,
+  int,
+  int32_t const *,
+  int32_t const *,
+  int32_t,
+  rmm::mr::device_memory_resource *);
+template std::unique_ptr<GraphCOO<int32_t, int32_t, double>> k_core<int32_t, int32_t, double>(
+  GraphCOOView<int32_t, int32_t, double> const &,
+  int,
+  int32_t const *,
+  int32_t const *,
+  int32_t,
+  rmm::mr::device_memory_resource *);
 
 }  // namespace cugraph
diff --git a/cpp/src/db/db_object.cu b/cpp/src/db/db_object.cu
index 391df5e6dbd..31c149f3503 100644
--- a/cpp/src/db/db_object.cu
+++ b/cpp/src/db/db_object.cu
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -14,12 +14,16 @@
  * limitations under the License.
  */
 
-#include <rmm/rmm.h>
+#include <db/db_object.cuh>
+
+#include <utilities/error.hpp>
+
+#include <raft/cudart_utils.h>
 #include <rmm/thrust_rmm_allocator.h>
+
 #include <thrust/binary_search.h>
-#include <utilities/error_utils.h>
 #include <cub/device/device_run_length_encode.cuh>
-#include <db/db_object.cuh>
+
 #include <sstream>
 
 namespace cugraph {
diff --git a/cpp/src/db/db_object.cuh b/cpp/src/db/db_object.cuh
index fe007a69020..a9b1f461f85 100644
--- a/cpp/src/db/db_object.cuh
+++ b/cpp/src/db/db_object.cuh
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
diff --git a/cpp/src/db/db_operators.cu b/cpp/src/db/db_operators.cu
index c6d7163a47f..d67f7ef9140 100644
--- a/cpp/src/db/db_operators.cu
+++ b/cpp/src/db/db_operators.cu
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -14,11 +14,14 @@
  * limitations under the License.
  */
 
-#include <rmm/rmm.h>
-#include <utilities/error_utils.h>
-#include <cub/device/device_select.cuh>
 #include <db/db_operators.cuh>
 
+#include <utilities/error.hpp>
+
+#include <raft/cudart_utils.h>
+
+#include <cub/device/device_select.cuh>
+
 namespace cugraph {
 namespace db {
 template <typename IndexType>
diff --git a/cpp/src/db/db_operators.cuh b/cpp/src/db/db_operators.cuh
index f960a465099..6a2e8322069 100644
--- a/cpp/src/db/db_operators.cuh
+++ b/cpp/src/db/db_operators.cuh
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
diff --git a/cpp/src/db/db_parser_integration_test.cu b/cpp/src/db/db_parser_integration_test.cu
index e1539910bc5..aa395bf8a4c 100644
--- a/cpp/src/db/db_parser_integration_test.cu
+++ b/cpp/src/db/db_parser_integration_test.cu
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
diff --git a/cpp/src/db/db_parser_integration_test.cuh b/cpp/src/db/db_parser_integration_test.cuh
index 517c79dd5f4..63da8805164 100644
--- a/cpp/src/db/db_parser_integration_test.cuh
+++ b/cpp/src/db/db_parser_integration_test.cuh
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
diff --git a/cpp/src/layout/barnes_hut.hpp b/cpp/src/layout/barnes_hut.hpp
index dab98642c91..f8c200648e1 100644
--- a/cpp/src/layout/barnes_hut.hpp
+++ b/cpp/src/layout/barnes_hut.hpp
@@ -17,7 +17,7 @@
 #pragma once
 
 #include <rmm/thrust_rmm_allocator.h>
-#include <utilities/error_utils.h>
+#include <utilities/error.hpp>
 
 #include <stdio.h>
 #include <converters/COOtoCSR.cuh>
@@ -33,7 +33,7 @@ namespace cugraph {
 namespace detail {
 
 template <typename vertex_t, typename edge_t, typename weight_t>
-void barnes_hut(experimental::GraphCOOView<vertex_t, edge_t, weight_t> &graph,
+void barnes_hut(GraphCOOView<vertex_t, edge_t, weight_t> &graph,
                 float *pos,
                 const int max_iter                            = 1000,
                 float *x_start                                = nullptr,
@@ -74,8 +74,11 @@ void barnes_hut(experimental::GraphCOOView<vertex_t, edge_t, weight_t> &graph,
   int *bottomd      = d_bottomd.data().get();
   float *radiusd    = d_radiusd.data().get();
 
+  cudaStream_t stream = {nullptr};
+
+  // FIXME: this should work on "stream"
   InitializationKernel<<<1, 1>>>(limiter, maxdepthd, radiusd);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
 
   const int FOUR_NNODES     = 4 * nnodes;
   const int FOUR_N          = 4 * n;
@@ -147,11 +150,11 @@ void barnes_hut(experimental::GraphCOOView<vertex_t, edge_t, weight_t> &graph,
   traction   = d_traction.data().get();
 
   // Sort COO for coalesced memory access.
-  cudaStream_t stream = {nullptr};
   sort(graph, stream);
-  CUDA_CHECK_LAST();
-  graph.degree(massl, cugraph::experimental::DegreeDirection::OUT);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
+  // FIXME: this should work on "stream"
+  graph.degree(massl, cugraph::DegreeDirection::OUT);
+  CHECK_CUDA(stream);
 
   const vertex_t *row = graph.src_indices;
   const vertex_t *col = graph.dst_indices;
@@ -194,9 +197,11 @@ void barnes_hut(experimental::GraphCOOView<vertex_t, edge_t, weight_t> &graph,
     fill(n, swinging, 0.f);
     fill(n, traction, 0.f);
 
+    // FIXME: this should work on "stream"
     ResetKernel<<<1, 1>>>(radiusd_squared, bottomd, NNODES, radiusd);
-    CUDA_CHECK_LAST();
+    CHECK_CUDA(stream);
 
+    // FIXME: this should work on "stream"
     // Compute bounding box arround all bodies
     BoundingBoxKernel<<<blocks * FACTOR1, THREADS1>>>(startl,
                                                       childl,
@@ -212,28 +217,34 @@ void barnes_hut(experimental::GraphCOOView<vertex_t, edge_t, weight_t> &graph,
                                                       n,
                                                       limiter,
                                                       radiusd);
-    CUDA_CHECK_LAST();
+    CHECK_CUDA(stream);
 
+    // FIXME: this should work on "stream"
     ClearKernel1<<<blocks, 1024>>>(childl, FOUR_NNODES, FOUR_N);
-    CUDA_CHECK_LAST();
+    CHECK_CUDA(stream);
 
+    // FIXME: this should work on "stream"
     // Build quadtree
     TreeBuildingKernel<<<blocks * FACTOR2, THREADS2>>>(
       childl, nodes_pos, nodes_pos + nnodes + 1, NNODES, n, maxdepthd, bottomd, radiusd);
-    CUDA_CHECK_LAST();
+    CHECK_CUDA(stream);
 
+    // FIXME: this should work on "stream"
     ClearKernel2<<<blocks, 1024>>>(startl, massl, NNODES, bottomd);
-    CUDA_CHECK_LAST();
+    CHECK_CUDA(stream);
 
+    // FIXME: this should work on "stream"
     // Summarizes mass and position for each cell, bottom up approach
     SummarizationKernel<<<blocks * FACTOR3, THREADS3>>>(
       countl, childl, massl, nodes_pos, nodes_pos + nnodes + 1, NNODES, n, bottomd);
-    CUDA_CHECK_LAST();
+    CHECK_CUDA(stream);
 
+    // FIXME: this should work on "stream"
     // Group closed bodies together, used to speed up Repulsion kernel
     SortKernel<<<blocks * FACTOR4, THREADS4>>>(sortl, countl, startl, childl, NNODES, n, bottomd);
-    CUDA_CHECK_LAST();
+    CHECK_CUDA(stream);
 
+    // FIXME: this should work on "stream"
     // Force computation O(n . log(n))
     RepulsionKernel<<<blocks * FACTOR5, THREADS5>>>(scaling_ratio,
                                                     theta,
@@ -251,7 +262,7 @@ void barnes_hut(experimental::GraphCOOView<vertex_t, edge_t, weight_t> &graph,
                                                     n,
                                                     radiusd_squared,
                                                     maxdepthd);
-    CUDA_CHECK_LAST();
+    CHECK_CUDA(stream);
 
     apply_gravity<vertex_t>(nodes_pos,
                             nodes_pos + nnodes + 1,
@@ -324,7 +335,7 @@ void barnes_hut(experimental::GraphCOOView<vertex_t, edge_t, weight_t> &graph,
   copy(n, nodes_pos, pos);
   copy(n, nodes_pos + nnodes + 1, pos + n);
 
-  if (callback) callback->on_epoch_end(nodes_pos);
+  if (callback) callback->on_train_end(nodes_pos);
 }
 
 }  // namespace detail
diff --git a/cpp/src/layout/exact_fa2.hpp b/cpp/src/layout/exact_fa2.hpp
index d138b5dd57c..e9f73e04cd5 100644
--- a/cpp/src/layout/exact_fa2.hpp
+++ b/cpp/src/layout/exact_fa2.hpp
@@ -17,7 +17,7 @@
 #pragma once
 
 #include <rmm/thrust_rmm_allocator.h>
-#include <utilities/error_utils.h>
+#include <utilities/error.hpp>
 
 #include <stdio.h>
 #include <converters/COOtoCSR.cuh>
@@ -32,7 +32,7 @@ namespace cugraph {
 namespace detail {
 
 template <typename vertex_t, typename edge_t, typename weight_t>
-void exact_fa2(experimental::GraphCOOView<vertex_t, edge_t, weight_t> &graph,
+void exact_fa2(GraphCOOView<vertex_t, edge_t, weight_t> &graph,
                float *pos,
                const int max_iter                            = 500,
                float *x_start                                = nullptr,
@@ -84,9 +84,10 @@ void exact_fa2(experimental::GraphCOOView<vertex_t, edge_t, weight_t> &graph,
   // Sort COO for coalesced memory access.
   cudaStream_t stream = {nullptr};
   sort(graph, stream);
-  CUDA_CHECK_LAST();
-  graph.degree(d_mass, cugraph::experimental::DegreeDirection::OUT);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
+  // FIXME: this function should work on "stream"
+  graph.degree(d_mass, cugraph::DegreeDirection::OUT);
+  CHECK_CUDA(stream);
 
   const vertex_t *row = graph.src_indices;
   const vertex_t *col = graph.dst_indices;
diff --git a/cpp/src/layout/exact_repulsion.hpp b/cpp/src/layout/exact_repulsion.hpp
index 1a7db88f782..713ac654326 100644
--- a/cpp/src/layout/exact_repulsion.hpp
+++ b/cpp/src/layout/exact_repulsion.hpp
@@ -62,9 +62,10 @@ void apply_repulsion(const float *restrict x_pos,
   dim3 nblocks(min((n + nthreads.x - 1) / nthreads.x, CUDA_MAX_BLOCKS_2D),
                min((n + nthreads.y - 1) / nthreads.y, CUDA_MAX_BLOCKS_2D));
 
+  // FIXME: apply repulsion should take stream as an input argument
   repulsion_kernel<vertex_t>
     <<<nblocks, nthreads>>>(x_pos, y_pos, repel_x, repel_y, mass, scaling_ratio, n);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(nullptr);
 }
 
 }  // namespace detail
diff --git a/cpp/src/layout/fa2_kernels.hpp b/cpp/src/layout/fa2_kernels.hpp
index 7ecbb961000..06e73c3dda4 100644
--- a/cpp/src/layout/fa2_kernels.hpp
+++ b/cpp/src/layout/fa2_kernels.hpp
@@ -23,20 +23,19 @@ namespace cugraph {
 namespace detail {
 
 template <typename vertex_t, typename edge_t, typename weight_t>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
-  attraction_kernel(const vertex_t *restrict row,
-                    const vertex_t *restrict col,
-                    const weight_t *restrict v,
-                    const edge_t e,
-                    const float *restrict x_pos,
-                    const float *restrict y_pos,
-                    float *restrict attract_x,
-                    float *restrict attract_y,
-                    const int *restrict mass,
-                    bool outbound_attraction_distribution,
-                    bool lin_log_mode,
-                    const float edge_weight_influence,
-                    const float coef)
+__global__ void attraction_kernel(const vertex_t *restrict row,
+                                  const vertex_t *restrict col,
+                                  const weight_t *restrict v,
+                                  const edge_t e,
+                                  const float *restrict x_pos,
+                                  const float *restrict y_pos,
+                                  float *restrict attract_x,
+                                  float *restrict attract_y,
+                                  const int *restrict mass,
+                                  bool outbound_attraction_distribution,
+                                  bool lin_log_mode,
+                                  const float edge_weight_influence,
+                                  const float coef)
 {
   vertex_t i, src, dst;
   weight_t weight = 1;
@@ -112,18 +111,17 @@ void apply_attraction(const vertex_t *restrict row,
                             edge_weight_influence,
                             coef);
 
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(nullptr);
 }
 
 template <typename vertex_t>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
-  linear_gravity_kernel(const float *restrict x_pos,
-                        const float *restrict y_pos,
-                        float *restrict attract_x,
-                        float *restrict attract_y,
-                        const int *restrict mass,
-                        const float gravity,
-                        const vertex_t n)
+__global__ void linear_gravity_kernel(const float *restrict x_pos,
+                                      const float *restrict y_pos,
+                                      float *restrict attract_x,
+                                      float *restrict attract_y,
+                                      const int *restrict mass,
+                                      const float gravity,
+                                      const vertex_t n)
 {
   // For every node.
   for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < n; i += gridDim.x * blockDim.x) {
@@ -137,15 +135,14 @@ __global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
 }
 
 template <typename vertex_t>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
-  strong_gravity_kernel(const float *restrict x_pos,
-                        const float *restrict y_pos,
-                        float *restrict attract_x,
-                        float *restrict attract_y,
-                        const int *restrict mass,
-                        const float gravity,
-                        const float scaling_ratio,
-                        const vertex_t n)
+__global__ void strong_gravity_kernel(const float *restrict x_pos,
+                                      const float *restrict y_pos,
+                                      float *restrict attract_x,
+                                      float *restrict attract_y,
+                                      const int *restrict mass,
+                                      const float gravity,
+                                      const float scaling_ratio,
+                                      const vertex_t n)
 {
   // For every node.
   for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < n; i += gridDim.x * blockDim.x) {
@@ -183,21 +180,20 @@ void apply_gravity(const float *restrict x_pos,
   else
     linear_gravity_kernel<vertex_t>
       <<<nblocks, nthreads>>>(x_pos, y_pos, attract_x, attract_y, mass, gravity, n);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(nullptr);
 }
 
 template <typename vertex_t>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
-  local_speed_kernel(const float *restrict repel_x,
-                     const float *restrict repel_y,
-                     const float *restrict attract_x,
-                     const float *restrict attract_y,
-                     const float *restrict old_dx,
-                     const float *restrict old_dy,
-                     const int *restrict mass,
-                     float *restrict swinging,
-                     float *restrict traction,
-                     const vertex_t n)
+__global__ void local_speed_kernel(const float *restrict repel_x,
+                                   const float *restrict repel_y,
+                                   const float *restrict attract_x,
+                                   const float *restrict attract_y,
+                                   const float *restrict old_dx,
+                                   const float *restrict old_dy,
+                                   const int *restrict mass,
+                                   float *restrict swinging,
+                                   float *restrict traction,
+                                   const vertex_t n)
 {
   // For every node.
   for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < n; i += gridDim.x * blockDim.x) {
@@ -232,7 +228,7 @@ void compute_local_speed(const float *restrict repel_x,
 
   local_speed_kernel<<<nblocks, nthreads>>>(
     repel_x, repel_y, attract_x, attract_y, old_dx, old_dy, mass, swinging, traction, n);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(nullptr);
 }
 
 template <typename vertex_t>
@@ -272,18 +268,17 @@ void adapt_speed(const float jitter_tolerance,
 }
 
 template <typename vertex_t>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
-  update_positions_kernel(float *restrict x_pos,
-                          float *restrict y_pos,
-                          const float *restrict repel_x,
-                          const float *restrict repel_y,
-                          const float *restrict attract_x,
-                          const float *restrict attract_y,
-                          float *restrict old_dx,
-                          float *restrict old_dy,
-                          const float *restrict swinging,
-                          const float speed,
-                          const vertex_t n)
+__global__ void update_positions_kernel(float *restrict x_pos,
+                                        float *restrict y_pos,
+                                        const float *restrict repel_x,
+                                        const float *restrict repel_y,
+                                        const float *restrict attract_x,
+                                        const float *restrict attract_y,
+                                        float *restrict old_dx,
+                                        float *restrict old_dy,
+                                        const float *restrict swinging,
+                                        const float speed,
+                                        const vertex_t n)
 {
   // For every node.
   for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < n; i += gridDim.x * blockDim.x) {
@@ -321,7 +316,7 @@ void apply_forces(float *restrict x_pos,
 
   update_positions_kernel<vertex_t><<<nblocks, nthreads>>>(
     x_pos, y_pos, repel_x, repel_y, attract_x, attract_y, old_dx, old_dy, swinging, speed, n);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(nullptr);
 }
 
 }  // namespace detail
diff --git a/cpp/src/layout/force_atlas2.cu b/cpp/src/layout/force_atlas2.cu
index 59a5c58aa73..15ac8120ce5 100644
--- a/cpp/src/layout/force_atlas2.cu
+++ b/cpp/src/layout/force_atlas2.cu
@@ -20,7 +20,7 @@
 namespace cugraph {
 
 template <typename VT, typename ET, typename WT>
-void force_atlas2(experimental::GraphCOOView<VT, ET, WT> &graph,
+void force_atlas2(GraphCOOView<VT, ET, WT> &graph,
                   float *pos,
                   const int max_iter,
                   float *x_start,
@@ -77,7 +77,7 @@ void force_atlas2(experimental::GraphCOOView<VT, ET, WT> &graph,
   }
 }
 
-template void force_atlas2<int, int, float>(experimental::GraphCOOView<int, int, float> &graph,
+template void force_atlas2<int, int, float>(GraphCOOView<int, int, float> &graph,
                                             float *pos,
                                             const int max_iter,
                                             float *x_start,
@@ -95,7 +95,7 @@ template void force_atlas2<int, int, float>(experimental::GraphCOOView<int, int,
                                             bool verbose,
                                             internals::GraphBasedDimRedCallback *callback);
 
-template void force_atlas2<int, int, double>(experimental::GraphCOOView<int, int, double> &graph,
+template void force_atlas2<int, int, double>(GraphCOOView<int, int, double> &graph,
                                              float *pos,
                                              const int max_iter,
                                              float *x_start,
diff --git a/cpp/src/layout/utils.hpp b/cpp/src/layout/utils.hpp
index e26f93e8f71..7d639660831 100644
--- a/cpp/src/layout/utils.hpp
+++ b/cpp/src/layout/utils.hpp
@@ -16,6 +16,8 @@
 
 #pragma once
 
+#include <raft/cudart_utils.h>
+
 #include <thrust/random.h>
 
 namespace cugraph {
diff --git a/cpp/src/link_analysis/gunrock_hits.cpp b/cpp/src/link_analysis/gunrock_hits.cpp
new file mode 100644
index 00000000000..84c6036ad70
--- /dev/null
+++ b/cpp/src/link_analysis/gunrock_hits.cpp
@@ -0,0 +1,102 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * ---------------------------------------------------------------------------*
+ * @brief wrapper calling gunrock's HITS analytic
+ * --------------------------------------------------------------------------*/
+
+#include <algorithms.hpp>
+#include <graph.hpp>
+
+#include <utilities/error.hpp>
+
+#include <gunrock/gunrock.h>
+
+namespace cugraph {
+
+namespace gunrock {
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+void hits(cugraph::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+          int max_iter,
+          weight_t tolerance,
+          weight_t const *starting_value,
+          bool normalized,
+          weight_t *hubs,
+          weight_t *authorities)
+{
+  CUGRAPH_EXPECTS(hubs != nullptr, "Invalid API parameter: hubs array should be of size V");
+  CUGRAPH_EXPECTS(authorities != nullptr,
+                  "Invalid API parameter: authorities array should be of size V");
+
+  //
+  //  NOTE:  gunrock doesn't support tolerance parameter
+  //         gunrock doesn't support passing a starting value
+  //         gunrock doesn't support the normalized parameter
+  //
+  //  FIXME: gunrock uses a 2-norm, while networkx uses a 1-norm.
+  //         They will add a parameter to allow us to specify
+  //         which norm to use.
+  //
+  std::vector<edge_t> local_offsets(graph.number_of_vertices + 1);
+  std::vector<vertex_t> local_indices(graph.number_of_edges);
+  std::vector<weight_t> local_hubs(graph.number_of_vertices);
+  std::vector<weight_t> local_authorities(graph.number_of_vertices);
+
+  //    Ideally:
+  //
+  //::hits(graph.number_of_vertices, graph.number_of_edges, graph.offsets, graph.indices,
+  //       max_iter, hubs, authorities, DEVICE);
+  //
+  //    For now, the following:
+
+  CUDA_TRY(cudaMemcpy(local_offsets.data(),
+                      graph.offsets,
+                      (graph.number_of_vertices + 1) * sizeof(edge_t),
+                      cudaMemcpyDeviceToHost));
+  CUDA_TRY(cudaMemcpy(local_indices.data(),
+                      graph.indices,
+                      graph.number_of_edges * sizeof(vertex_t),
+                      cudaMemcpyDeviceToHost));
+
+  ::hits(graph.number_of_vertices,
+         graph.number_of_edges,
+         local_offsets.data(),
+         local_indices.data(),
+         max_iter,
+         local_hubs.data(),
+         local_authorities.data());
+
+  CUDA_TRY(cudaMemcpy(
+    hubs, local_hubs.data(), graph.number_of_vertices * sizeof(weight_t), cudaMemcpyHostToDevice));
+  CUDA_TRY(cudaMemcpy(authorities,
+                      local_authorities.data(),
+                      graph.number_of_vertices * sizeof(weight_t),
+                      cudaMemcpyHostToDevice));
+}
+
+template void hits(cugraph::GraphCSRView<int32_t, int32_t, float> const &,
+                   int,
+                   float,
+                   float const *,
+                   bool,
+                   float *,
+                   float *);
+
+}  // namespace gunrock
+
+}  // namespace cugraph
diff --git a/cpp/src/link_analysis/pagerank.cu b/cpp/src/link_analysis/pagerank.cu
index b989c46cb07..e5da24e328d 100644
--- a/cpp/src/link_analysis/pagerank.cu
+++ b/cpp/src/link_analysis/pagerank.cu
@@ -22,13 +22,16 @@
 #include <string>
 #include "cub/cub.cuh"
 
-#include <rmm/rmm.h>
+#include <raft/cudart_utils.h>
 #include <rmm/thrust_rmm_allocator.h>
-#include <utilities/error_utils.h>
+#include <utilities/error.hpp>
 
 #include <graph.hpp>
+#include "pagerank_1D.cuh"
 #include "utilities/graph_utils.cuh"
 
+#include <raft/spectral/matrix_wrappers.hpp>
+
 namespace cugraph {
 namespace detail {
 
@@ -37,7 +40,8 @@ namespace detail {
 #endif
 
 template <typename IndexType, typename ValueType>
-bool pagerankIteration(IndexType n,
+bool pagerankIteration(raft::handle_t const &handle,
+                       IndexType n,
                        IndexType e,
                        IndexType const *cscPtr,
                        IndexType const *cscInd,
@@ -55,6 +59,14 @@ bool pagerankIteration(IndexType n,
                        ValueType *residual)
 {
   ValueType dot_res;
+//#if defined(CUDART_VERSION) and CUDART_VERSION >= 11000
+#if 1
+  {
+    raft::matrix::sparse_matrix_t<IndexType, ValueType> const r_csr_m{
+      handle, cscPtr, cscInd, cscVal, n, e};
+    r_csr_m.mv(1.0, tmp, 0.0, pr);
+  }
+#else
   CUDA_TRY(cub::DeviceSpmv::CsrMV(cub_d_temp_storage,
                                   cub_temp_storage_bytes,
                                   cscVal,
@@ -65,7 +77,7 @@ bool pagerankIteration(IndexType n,
                                   n,
                                   n,
                                   e));
-
+#endif
   scal(n, alpha, pr);
   dot_res = dot(n, a, tmp);
   axpy(n, dot_res, b, pr);
@@ -92,7 +104,8 @@ bool pagerankIteration(IndexType n,
 }
 
 template <typename IndexType, typename ValueType>
-int pagerankSolver(IndexType n,
+int pagerankSolver(raft::handle_t const &handle,
+                   IndexType n,
                    IndexType e,
                    IndexType const *cscPtr,
                    IndexType const *cscInd,
@@ -142,7 +155,8 @@ int pagerankSolver(IndexType n,
   rmm::device_vector<WT> tmp(n);
   tmp_d = pr.data().get();
 #endif
-  CUDA_CHECK_LAST();
+  // FIXME: this should take a passed CUDA strema instead of default nullptr
+  CHECK_CUDA(nullptr);
 
   if (!has_guess) {
     fill(n, pagerank_vector, randomProbability);
@@ -165,6 +179,14 @@ int pagerankSolver(IndexType n,
   }
   update_dangling_nodes(n, a, alpha);
 
+//#if defined(CUDART_VERSION) and CUDART_VERSION >= 11000
+#if 1
+  {
+    raft::matrix::sparse_matrix_t<IndexType, ValueType> const r_csr_m{
+      handle, cscPtr, cscInd, cscVal, n, e};
+    r_csr_m.mv(1.0, tmp_d, 0.0, pagerank_vector);
+  }
+#else
   CUDA_TRY(cub::DeviceSpmv::CsrMV(cub_d_temp_storage,
                                   cub_temp_storage_bytes,
                                   cscVal,
@@ -175,6 +197,7 @@ int pagerankSolver(IndexType n,
                                   n,
                                   n,
                                   e));
+#endif
   // Allocate temporary storage
   rmm::device_buffer cub_temp_storage(cub_temp_storage_bytes);
   cub_d_temp_storage = cub_temp_storage.data();
@@ -191,7 +214,8 @@ int pagerankSolver(IndexType n,
 
   while (!converged && i < max_it) {
     i++;
-    converged = pagerankIteration<IndexType, ValueType>(n,
+    converged = pagerankIteration<IndexType, ValueType>(handle,
+                                                        n,
                                                         e,
                                                         cscPtr,
                                                         cscInd,
@@ -225,7 +249,8 @@ int pagerankSolver(IndexType n,
 // template int pagerankSolver<int, half> (  int n, int e, int *cscPtr, int *cscInd,half *cscVal,
 // half alpha, half *a, bool has_guess, float tolerance, int max_iter, half * &pagerank_vector, half
 // * &residual);
-template int pagerankSolver<int, float>(int n,
+template int pagerankSolver<int, float>(raft::handle_t const &handle,
+                                        int n,
                                         int e,
                                         int const *cscPtr,
                                         int const *cscInd,
@@ -241,7 +266,8 @@ template int pagerankSolver<int, float>(int n,
                                         int max_iter,
                                         float *&pagerank_vector,
                                         float *&residual);
-template int pagerankSolver<int, double>(int n,
+template int pagerankSolver<int, double>(raft::handle_t const &handle,
+                                         int n,
                                          int e,
                                          const int *cscPtr,
                                          int const *cscInd,
@@ -259,14 +285,15 @@ template int pagerankSolver<int, double>(int n,
                                          double *&residual);
 
 template <typename VT, typename ET, typename WT>
-void pagerank_impl(experimental::GraphCSCView<VT, ET, WT> const &graph,
+void pagerank_impl(raft::handle_t const &handle,
+                   GraphCSCView<VT, ET, WT> const &graph,
                    WT *pagerank,
                    VT personalization_subset_size = 0,
                    VT *personalization_subset     = nullptr,
                    WT *personalization_values     = nullptr,
                    double alpha                   = 0.85,
-                   double tolerance               = 1e-4,
-                   int64_t max_iter               = 200,
+                   double tolerance               = 1e-5,
+                   int64_t max_iter               = 100,
                    bool has_guess                 = false)
 {
   bool has_personalization = false;
@@ -310,7 +337,8 @@ void pagerank_impl(experimental::GraphCSCView<VT, ET, WT> const &graph,
 
   if (has_guess) { copy<WT>(m, (WT *)pagerank, d_pr); }
 
-  status = pagerankSolver<int32_t, WT>(m,
+  status = pagerankSolver<int32_t, WT>(handle,
+                                       m,
                                        nnz,
                                        graph.offsets,
                                        graph.indices,
@@ -330,7 +358,7 @@ void pagerank_impl(experimental::GraphCSCView<VT, ET, WT> const &graph,
   switch (status) {
     case 0: break;
     case -1: CUGRAPH_FAIL("Error : bad parameters in Pagerank");
-    case 1: CUGRAPH_FAIL("Warning : Pagerank did not reached the desired tolerance");
+    case 1: break;  // Warning : Pagerank did not reached the desired tolerance
     default: CUGRAPH_FAIL("Pagerank exec failed");
   }
 
@@ -339,7 +367,8 @@ void pagerank_impl(experimental::GraphCSCView<VT, ET, WT> const &graph,
 }  // namespace detail
 
 template <typename VT, typename ET, typename WT>
-void pagerank(experimental::GraphCSCView<VT, ET, WT> const &graph,
+void pagerank(raft::handle_t const &handle,
+              GraphCSCView<VT, ET, WT> const &graph,
               WT *pagerank,
               VT personalization_subset_size,
               VT *personalization_subset,
@@ -350,20 +379,37 @@ void pagerank(experimental::GraphCSCView<VT, ET, WT> const &graph,
               bool has_guess)
 {
   CUGRAPH_EXPECTS(pagerank != nullptr, "Invalid API parameter: Pagerank array should be of size V");
-
-  return detail::pagerank_impl<VT, ET, WT>(graph,
-                                           pagerank,
-                                           personalization_subset_size,
-                                           personalization_subset,
-                                           personalization_values,
-                                           alpha,
-                                           tolerance,
-                                           max_iter,
-                                           has_guess);
+  // Multi-GPU
+  if (handle.comms_initialized()) {
+    CUGRAPH_EXPECTS(has_guess == false,
+                    "Invalid API parameter: Multi-GPU Pagerank does not guess, please use the "
+                    "single GPU version for this feature");
+    CUGRAPH_EXPECTS(max_iter > 0, "The number of iteration must be positive");
+    cugraph::mg::pagerank<VT, ET, WT>(handle,
+                                      graph,
+                                      pagerank,
+                                      personalization_subset_size,
+                                      personalization_subset,
+                                      personalization_values,
+                                      alpha,
+                                      max_iter,
+                                      tolerance);
+  } else  // Single GPU
+    return detail::pagerank_impl<VT, ET, WT>(handle,
+                                             graph,
+                                             pagerank,
+                                             personalization_subset_size,
+                                             personalization_subset,
+                                             personalization_values,
+                                             alpha,
+                                             tolerance,
+                                             max_iter,
+                                             has_guess);
 }
 
 // explicit instantiation
-template void pagerank<int, int, float>(experimental::GraphCSCView<int, int, float> const &graph,
+template void pagerank<int, int, float>(raft::handle_t const &handle,
+                                        GraphCSCView<int, int, float> const &graph,
                                         float *pagerank,
                                         int personalization_subset_size,
                                         int *personalization_subset,
@@ -372,7 +418,8 @@ template void pagerank<int, int, float>(experimental::GraphCSCView<int, int, flo
                                         double tolerance,
                                         int64_t max_iter,
                                         bool has_guess);
-template void pagerank<int, int, double>(experimental::GraphCSCView<int, int, double> const &graph,
+template void pagerank<int, int, double>(raft::handle_t const &handle,
+                                         GraphCSCView<int, int, double> const &graph,
                                          double *pagerank,
                                          int personalization_subset_size,
                                          int *personalization_subset,
diff --git a/cpp/src/link_analysis/pagerank_1D.cu b/cpp/src/link_analysis/pagerank_1D.cu
new file mode 100644
index 00000000000..27780626480
--- /dev/null
+++ b/cpp/src/link_analysis/pagerank_1D.cu
@@ -0,0 +1,186 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// Author: Alex Fender afender@nvidia.com
+
+#include <algorithm>
+#include <graph.hpp>
+#include "pagerank_1D.cuh"
+#include "utilities/graph_utils.cuh"
+
+namespace cugraph {
+namespace mg {
+
+template <typename VT, typename WT>
+__global__ void transition_kernel(const size_t e, const VT *ind, const VT *degree, WT *val)
+{
+  for (auto i = threadIdx.x + blockIdx.x * blockDim.x; i < e; i += gridDim.x * blockDim.x)
+    val[i] = 1.0 / degree[ind[i]];  // Degree contains IN degree. So all degree[ind[i]] were
+                                    // incremented by definition (no div by 0).
+}
+
+template <typename VT, typename ET, typename WT>
+Pagerank<VT, ET, WT>::Pagerank(const raft::handle_t &handle_, GraphCSCView<VT, ET, WT> const &G)
+  : comm(handle_.get_comms()),
+    bookmark(G.number_of_vertices),
+    prev_pr(G.number_of_vertices),
+    val(G.local_edges[comm.get_rank()]),
+    handle(handle_),
+    has_personalization(false)
+{
+  v_glob         = G.number_of_vertices;
+  v_loc          = G.local_vertices[comm.get_rank()];
+  e_loc          = G.local_edges[comm.get_rank()];
+  part_off       = G.local_offsets;
+  local_vertices = G.local_vertices;
+  off            = G.offsets;
+  ind            = G.indices;
+  blocks         = handle_.get_device_properties().maxGridSize[0];
+  threads        = handle_.get_device_properties().maxThreadsPerBlock;
+  sm_count       = handle_.get_device_properties().multiProcessorCount;
+
+  is_setup = false;
+}
+
+template <typename VT, typename ET, typename WT>
+Pagerank<VT, ET, WT>::~Pagerank()
+{
+}
+
+template <typename VT, typename ET, typename WT>
+void Pagerank<VT, ET, WT>::transition_vals(const VT *degree)
+{
+  if (e_loc > 0) {
+    int threads = std::min(e_loc, this->threads);
+    int blocks  = std::min(32 * sm_count, this->blocks);
+    transition_kernel<VT, WT><<<blocks, threads>>>(e_loc, ind, degree, val.data().get());
+    CHECK_CUDA(nullptr);
+  }
+}
+
+template <typename VT, typename ET, typename WT>
+void Pagerank<VT, ET, WT>::flag_leafs(const VT *degree)
+{
+  if (v_glob > 0) {
+    int threads = std::min(v_glob, this->threads);
+    int blocks  = std::min(32 * sm_count, this->blocks);
+    cugraph::detail::flag_leafs_kernel<VT, WT>
+      <<<blocks, threads>>>(v_glob, degree, bookmark.data().get());
+    CHECK_CUDA(nullptr);
+  }
+}
+
+// Artificially create the google matrix by setting val and bookmark
+template <typename VT, typename ET, typename WT>
+void Pagerank<VT, ET, WT>::setup(WT _alpha,
+                                 VT *degree,
+                                 VT personalization_subset_size,
+                                 VT *personalization_subset,
+                                 WT *personalization_values)
+{
+  if (!is_setup) {
+    alpha   = _alpha;
+    WT zero = 0.0;
+    WT one  = 1.0;
+    // Update dangling node vector
+    cugraph::detail::fill(v_glob, bookmark.data().get(), zero);
+    flag_leafs(degree);
+    cugraph::detail::update_dangling_nodes(v_glob, bookmark.data().get(), alpha);
+
+    // Transition matrix
+    transition_vals(degree);
+
+    // personalize
+    if (personalization_subset_size != 0) {
+      CUGRAPH_EXPECTS(personalization_subset != nullptr,
+                      "Invalid API parameter: personalization_subset array should be of size "
+                      "personalization_subset_size");
+      CUGRAPH_EXPECTS(personalization_values != nullptr,
+                      "Invalid API parameter: personalization_values array should be of size "
+                      "personalization_subset_size");
+      CUGRAPH_EXPECTS(personalization_subset_size <= v_glob,
+                      "Personalization size should be smaller than V");
+
+      WT sum = cugraph::detail::nrm1(personalization_subset_size, personalization_values);
+      if (sum != zero) {
+        has_personalization = true;
+        personalization_vector.resize(v_glob);
+        cugraph::detail::fill(v_glob, personalization_vector.data().get(), zero);
+        cugraph::detail::scal(v_glob, one / sum, personalization_values);
+        cugraph::detail::scatter(personalization_subset_size,
+                                 personalization_values,
+                                 personalization_vector.data().get(),
+                                 personalization_subset);
+      }
+    }
+    is_setup = true;
+  } else
+    CUGRAPH_FAIL("MG PageRank : Setup can be called only once");
+}
+
+// run the power iteration on the google matrix
+template <typename VT, typename ET, typename WT>
+int Pagerank<VT, ET, WT>::solve(int max_iter, float tolerance, WT *pagerank)
+{
+  if (is_setup) {
+    WT dot_res;
+    WT one = 1.0;
+    WT *pr = pagerank;
+    cugraph::detail::fill(v_glob, pagerank, one / v_glob);
+    cugraph::detail::fill(v_glob, prev_pr.data().get(), one / v_glob);
+    // This cuda sync was added to fix #426
+    // This should not be requiered in theory
+    // This is not needed on one GPU at this time
+    cudaDeviceSynchronize();
+    dot_res = cugraph::detail::dot(v_glob, bookmark.data().get(), pr);
+    MGcsrmv<VT, ET, WT> spmv_solver(
+      handle, local_vertices, part_off, off, ind, val.data().get(), pagerank);
+
+    WT residual;
+    int i;
+    for (i = 0; i < max_iter; ++i) {
+      spmv_solver.run(pagerank);
+      cugraph::detail::scal(v_glob, alpha, pr);
+
+      // personalization
+      if (has_personalization)
+        cugraph::detail::axpy(v_glob, dot_res, personalization_vector.data().get(), pr);
+      else
+        cugraph::detail::addv(v_glob, dot_res * (one / v_glob), pr);
+
+      dot_res = cugraph::detail::dot(v_glob, bookmark.data().get(), pr);
+      cugraph::detail::scal(v_glob, one / cugraph::detail::nrm2(v_glob, pr), pr);
+
+      // convergence check
+      cugraph::detail::axpy(v_glob, (WT)-1.0, pr, prev_pr.data().get());
+      residual = cugraph::detail::nrm2(v_glob, prev_pr.data().get());
+      if (residual < tolerance)
+        break;
+      else
+        cugraph::detail::copy(v_glob, pr, prev_pr.data().get());
+    }
+    cugraph::detail::scal(v_glob, one / cugraph::detail::nrm1(v_glob, pr), pr);
+    return i;
+  } else {
+    CUGRAPH_FAIL("MG PageRank : Solve was called before setup");
+  }
+}
+
+template class Pagerank<int, int, double>;
+template class Pagerank<int, int, float>;
+
+}  // namespace mg
+}  // namespace cugraph
diff --git a/cpp/src/link_analysis/pagerank_1D.cuh b/cpp/src/link_analysis/pagerank_1D.cuh
new file mode 100644
index 00000000000..feb410daa9a
--- /dev/null
+++ b/cpp/src/link_analysis/pagerank_1D.cuh
@@ -0,0 +1,125 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// Author: Alex Fender afender@nvidia.com
+
+#pragma once
+
+#include <rmm/thrust_rmm_allocator.h>
+#include <numeric>
+#include <raft/handle.hpp>
+
+#include "utilities/error.hpp"
+#include "utilities/spmv_1D.cuh"
+
+namespace cugraph {
+namespace mg {
+
+template <typename VT, typename ET, typename WT>
+class Pagerank {
+ private:
+  VT v_glob{};  // global number of vertices
+  VT v_loc{};   // local number of vertices
+  ET e_loc{};   // local number of edges
+  WT alpha{};   // damping factor
+  bool has_personalization;
+  // CUDA
+  const raft::comms::comms_t &comm;  // info about the mg comm setup
+  cudaStream_t stream;
+  int blocks;
+  int threads;
+  int sm_count;
+
+  // Vertex offsets for each partition.
+  VT *part_off;
+  VT *local_vertices;
+
+  // Google matrix
+  ET *off;
+  VT *ind;
+
+  rmm::device_vector<WT> val;                     // values of the substochastic matrix
+  rmm::device_vector<WT> bookmark;                // constant vector with dangling node info
+  rmm::device_vector<WT> prev_pr;                 // record the last pagerank for convergence check
+  rmm::device_vector<WT> personalization_vector;  // personalization vector after reconstruction
+
+  bool is_setup;
+  raft::handle_t const &handle;  // raft handle propagation for SpMV, etc.
+
+ public:
+  Pagerank(const raft::handle_t &handle, const GraphCSCView<VT, ET, WT> &G);
+  ~Pagerank();
+
+  void transition_vals(const VT *degree);
+
+  void flag_leafs(const VT *degree);
+
+  // Artificially create the google matrix by setting val and bookmark
+  void setup(WT _alpha,
+             VT *degree,
+             VT personalization_subset_size,
+             VT *personalization_subset,
+             WT *personalization_values);
+
+  // run the power iteration on the google matrix, return the number of iterations
+  int solve(int max_iter, float tolerance, WT *pagerank);
+};
+
+template <typename VT, typename ET, typename WT>
+int pagerank(raft::handle_t const &handle,
+             const GraphCSCView<VT, ET, WT> &G,
+             WT *pagerank_result,
+             VT personalization_subset_size,
+             VT *personalization_subset,
+             WT *personalization_values,
+             const double damping_factor = 0.85,
+             const int64_t n_iter        = 100,
+             const double tolerance      = 1e-5)
+{
+  // null pointers check
+  CUGRAPH_EXPECTS(G.offsets != nullptr, "Invalid API parameter - offsets is null");
+  CUGRAPH_EXPECTS(G.indices != nullptr, "Invalid API parameter - indidices is null");
+  CUGRAPH_EXPECTS(pagerank_result != nullptr,
+                  "Invalid API parameter - pagerank output memory must be allocated");
+
+  // parameter values
+  CUGRAPH_EXPECTS(damping_factor > 0.0,
+                  "Invalid API parameter - invalid damping factor value (alpha<0)");
+  CUGRAPH_EXPECTS(damping_factor < 1.0,
+                  "Invalid API parameter - invalid damping factor value (alpha>1)");
+  CUGRAPH_EXPECTS(n_iter > 0, "Invalid API parameter - n_iter must be > 0");
+
+  rmm::device_vector<VT> degree(G.number_of_vertices);
+
+  // in-degree of CSC (equivalent to out-degree of original edge list)
+  G.degree(degree.data().get(), DegreeDirection::IN);
+
+  // Allocate and intialize Pagerank class
+  Pagerank<VT, ET, WT> pr_solver(handle, G);
+
+  // Set all constants info
+  pr_solver.setup(damping_factor,
+                  degree.data().get(),
+                  personalization_subset_size,
+                  personalization_subset,
+                  personalization_values);
+
+  // Run pagerank
+  return pr_solver.solve(n_iter, tolerance, pagerank_result);
+}
+
+}  // namespace mg
+}  // namespace cugraph
diff --git a/cpp/src/link_prediction/jaccard.cu b/cpp/src/link_prediction/jaccard.cu
index 8462466f9e9..70952974b39 100644
--- a/cpp/src/link_prediction/jaccard.cu
+++ b/cpp/src/link_prediction/jaccard.cu
@@ -20,7 +20,7 @@
  * ---------------------------------------------------------------------------**/
 
 #include <rmm/thrust_rmm_allocator.h>
-#include <utilities/error_utils.h>
+#include <utilities/error.hpp>
 #include "graph.hpp"
 #include "utilities/graph_utils.cuh"
 
@@ -29,7 +29,7 @@ namespace detail {
 
 // Volume of neighboors (*weight_s)
 template <bool weighted, typename vertex_t, typename edge_t, typename weight_t>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS) jaccard_row_sum(
+__global__ void jaccard_row_sum(
   vertex_t n, edge_t const *csrPtr, vertex_t const *csrInd, weight_t const *v, weight_t *work)
 {
   vertex_t row;
@@ -53,13 +53,13 @@ __global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS) jaccard_row_sum(
 
 // Volume of intersections (*weight_i) and cumulated volume of neighboors (*weight_s)
 template <bool weighted, typename vertex_t, typename edge_t, typename weight_t>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS) jaccard_is(vertex_t n,
-                                                                      edge_t const *csrPtr,
-                                                                      vertex_t const *csrInd,
-                                                                      weight_t const *v,
-                                                                      weight_t *work,
-                                                                      weight_t *weight_i,
-                                                                      weight_t *weight_s)
+__global__ void jaccard_is(vertex_t n,
+                           edge_t const *csrPtr,
+                           vertex_t const *csrInd,
+                           weight_t const *v,
+                           weight_t *work,
+                           weight_t *weight_i,
+                           weight_t *weight_s)
 {
   edge_t i, j, Ni, Nj;
   vertex_t row, col;
@@ -117,16 +117,15 @@ __global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS) jaccard_is(vertex_t n
 // Volume of intersections (*weight_i) and cumulated volume of neighboors (*weight_s)
 // Using list of node pairs
 template <bool weighted, typename vertex_t, typename edge_t, typename weight_t>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
-  jaccard_is_pairs(edge_t num_pairs,
-                   edge_t const *csrPtr,
-                   vertex_t const *csrInd,
-                   vertex_t const *first_pair,
-                   vertex_t const *second_pair,
-                   weight_t const *v,
-                   weight_t *work,
-                   weight_t *weight_i,
-                   weight_t *weight_s)
+__global__ void jaccard_is_pairs(edge_t num_pairs,
+                                 edge_t const *csrPtr,
+                                 vertex_t const *csrInd,
+                                 vertex_t const *first_pair,
+                                 vertex_t const *second_pair,
+                                 weight_t const *v,
+                                 weight_t *work,
+                                 weight_t *weight_i,
+                                 weight_t *weight_s)
 {
   edge_t i, idx, Ni, Nj, match;
   vertex_t row, col, ref, cur, ref_col, cur_col;
@@ -182,8 +181,10 @@ __global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
 
 // Jaccard  weights (*weight)
 template <bool weighted, typename vertex_t, typename edge_t, typename weight_t>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
-  jaccard_jw(edge_t e, weight_t const *weight_i, weight_t const *weight_s, weight_t *weight_j)
+__global__ void jaccard_jw(edge_t e,
+                           weight_t const *weight_i,
+                           weight_t const *weight_s,
+                           weight_t *weight_j)
 {
   edge_t j;
   weight_t Wi, Ws, Wu;
@@ -312,7 +313,7 @@ int jaccard_pairs(vertex_t n,
 }  // namespace detail
 
 template <typename VT, typename ET, typename WT>
-void jaccard(experimental::GraphCSRView<VT, ET, WT> const &graph, WT const *weights, WT *result)
+void jaccard(GraphCSRView<VT, ET, WT> const &graph, WT const *weights, WT *result)
 {
   CUGRAPH_EXPECTS(result != nullptr, "Invalid API parameter: result pointer is NULL");
 
@@ -344,7 +345,7 @@ void jaccard(experimental::GraphCSRView<VT, ET, WT> const &graph, WT const *weig
 }
 
 template <typename VT, typename ET, typename WT>
-void jaccard_list(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void jaccard_list(GraphCSRView<VT, ET, WT> const &graph,
                   WT const *weights,
                   ET num_pairs,
                   VT const *first,
@@ -386,41 +387,41 @@ void jaccard_list(experimental::GraphCSRView<VT, ET, WT> const &graph,
   }
 }
 
-template void jaccard<int32_t, int32_t, float>(
-  experimental::GraphCSRView<int32_t, int32_t, float> const &, float const *, float *);
-template void jaccard<int32_t, int32_t, double>(
-  experimental::GraphCSRView<int32_t, int32_t, double> const &, double const *, double *);
-template void jaccard<int64_t, int64_t, float>(
-  experimental::GraphCSRView<int64_t, int64_t, float> const &, float const *, float *);
-template void jaccard<int64_t, int64_t, double>(
-  experimental::GraphCSRView<int64_t, int64_t, double> const &, double const *, double *);
-template void jaccard_list<int32_t, int32_t, float>(
-  experimental::GraphCSRView<int32_t, int32_t, float> const &,
-  float const *,
-  int32_t,
-  int32_t const *,
-  int32_t const *,
-  float *);
-template void jaccard_list<int32_t, int32_t, double>(
-  experimental::GraphCSRView<int32_t, int32_t, double> const &,
-  double const *,
-  int32_t,
-  int32_t const *,
-  int32_t const *,
-  double *);
-template void jaccard_list<int64_t, int64_t, float>(
-  experimental::GraphCSRView<int64_t, int64_t, float> const &,
-  float const *,
-  int64_t,
-  int64_t const *,
-  int64_t const *,
-  float *);
-template void jaccard_list<int64_t, int64_t, double>(
-  experimental::GraphCSRView<int64_t, int64_t, double> const &,
-  double const *,
-  int64_t,
-  int64_t const *,
-  int64_t const *,
-  double *);
+template void jaccard<int32_t, int32_t, float>(GraphCSRView<int32_t, int32_t, float> const &,
+                                               float const *,
+                                               float *);
+template void jaccard<int32_t, int32_t, double>(GraphCSRView<int32_t, int32_t, double> const &,
+                                                double const *,
+                                                double *);
+template void jaccard<int64_t, int64_t, float>(GraphCSRView<int64_t, int64_t, float> const &,
+                                               float const *,
+                                               float *);
+template void jaccard<int64_t, int64_t, double>(GraphCSRView<int64_t, int64_t, double> const &,
+                                                double const *,
+                                                double *);
+template void jaccard_list<int32_t, int32_t, float>(GraphCSRView<int32_t, int32_t, float> const &,
+                                                    float const *,
+                                                    int32_t,
+                                                    int32_t const *,
+                                                    int32_t const *,
+                                                    float *);
+template void jaccard_list<int32_t, int32_t, double>(GraphCSRView<int32_t, int32_t, double> const &,
+                                                     double const *,
+                                                     int32_t,
+                                                     int32_t const *,
+                                                     int32_t const *,
+                                                     double *);
+template void jaccard_list<int64_t, int64_t, float>(GraphCSRView<int64_t, int64_t, float> const &,
+                                                    float const *,
+                                                    int64_t,
+                                                    int64_t const *,
+                                                    int64_t const *,
+                                                    float *);
+template void jaccard_list<int64_t, int64_t, double>(GraphCSRView<int64_t, int64_t, double> const &,
+                                                     double const *,
+                                                     int64_t,
+                                                     int64_t const *,
+                                                     int64_t const *,
+                                                     double *);
 
 }  // namespace cugraph
diff --git a/cpp/src/link_prediction/overlap.cu b/cpp/src/link_prediction/overlap.cu
index ed945c378bd..e3f80b50d9a 100644
--- a/cpp/src/link_prediction/overlap.cu
+++ b/cpp/src/link_prediction/overlap.cu
@@ -20,7 +20,7 @@
  * ---------------------------------------------------------------------------**/
 
 #include <rmm/thrust_rmm_allocator.h>
-#include <utilities/error_utils.h>
+#include <utilities/error.hpp>
 #include "graph.hpp"
 #include "utilities/graph_utils.cuh"
 
@@ -30,7 +30,7 @@ namespace detail {
 // Volume of neighboors (*weight_s)
 // TODO: Identical kernel to jaccard_row_sum!!
 template <bool weighted, typename vertex_t, typename edge_t, typename weight_t>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS) overlap_row_sum(
+__global__ void overlap_row_sum(
   vertex_t n, edge_t const *csrPtr, vertex_t const *csrInd, weight_t const *v, weight_t *work)
 {
   vertex_t row;
@@ -55,13 +55,13 @@ __global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS) overlap_row_sum(
 // Volume of intersections (*weight_i) and cumulated volume of neighboors (*weight_s)
 // TODO: Identical kernel to jaccard_row_sum!!
 template <bool weighted, typename vertex_t, typename edge_t, typename weight_t>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS) overlap_is(vertex_t n,
-                                                                      edge_t const *csrPtr,
-                                                                      vertex_t const *csrInd,
-                                                                      weight_t const *v,
-                                                                      weight_t *work,
-                                                                      weight_t *weight_i,
-                                                                      weight_t *weight_s)
+__global__ void overlap_is(vertex_t n,
+                           edge_t const *csrPtr,
+                           vertex_t const *csrInd,
+                           weight_t const *v,
+                           weight_t *work,
+                           weight_t *weight_i,
+                           weight_t *weight_s)
 {
   edge_t i, j, Ni, Nj;
   vertex_t row, col;
@@ -120,16 +120,15 @@ __global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS) overlap_is(vertex_t n
 // Using list of node pairs
 // NOTE:  NOT the same as jaccard
 template <bool weighted, typename vertex_t, typename edge_t, typename weight_t>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
-  overlap_is_pairs(edge_t num_pairs,
-                   edge_t const *csrPtr,
-                   vertex_t const *csrInd,
-                   vertex_t const *first_pair,
-                   vertex_t const *second_pair,
-                   weight_t const *v,
-                   weight_t *work,
-                   weight_t *weight_i,
-                   weight_t *weight_s)
+__global__ void overlap_is_pairs(edge_t num_pairs,
+                                 edge_t const *csrPtr,
+                                 vertex_t const *csrInd,
+                                 vertex_t const *first_pair,
+                                 vertex_t const *second_pair,
+                                 weight_t const *v,
+                                 weight_t *work,
+                                 weight_t *weight_i,
+                                 weight_t *weight_s)
 {
   edge_t i, idx, Ni, Nj, match;
   vertex_t row, col, ref, cur, ref_col, cur_col;
@@ -185,12 +184,12 @@ __global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
 
 // Overlap  weights (*weight)
 template <bool weighted, typename vertex_t, typename edge_t, typename weight_t>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS) overlap_jw(edge_t e,
-                                                                      edge_t const *csrPtr,
-                                                                      vertex_t const *csrInd,
-                                                                      weight_t *weight_i,
-                                                                      weight_t *weight_s,
-                                                                      weight_t *weight_j)
+__global__ void overlap_jw(edge_t e,
+                           edge_t const *csrPtr,
+                           vertex_t const *csrInd,
+                           weight_t *weight_i,
+                           weight_t *weight_s,
+                           weight_t *weight_j)
 {
   edge_t j;
   weight_t Wi, Wu;
@@ -315,7 +314,7 @@ int overlap_pairs(vertex_t n,
 }  // namespace detail
 
 template <typename VT, typename ET, typename WT>
-void overlap(experimental::GraphCSRView<VT, ET, WT> const &graph, WT const *weights, WT *result)
+void overlap(GraphCSRView<VT, ET, WT> const &graph, WT const *weights, WT *result)
 {
   CUGRAPH_EXPECTS(result != nullptr, "Invalid API parameter: result pointer is NULL");
 
@@ -347,7 +346,7 @@ void overlap(experimental::GraphCSRView<VT, ET, WT> const &graph, WT const *weig
 }
 
 template <typename VT, typename ET, typename WT>
-void overlap_list(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void overlap_list(GraphCSRView<VT, ET, WT> const &graph,
                   WT const *weights,
                   ET num_pairs,
                   VT const *first,
@@ -389,41 +388,41 @@ void overlap_list(experimental::GraphCSRView<VT, ET, WT> const &graph,
   }
 }
 
-template void overlap<int32_t, int32_t, float>(
-  experimental::GraphCSRView<int32_t, int32_t, float> const &, float const *, float *);
-template void overlap<int32_t, int32_t, double>(
-  experimental::GraphCSRView<int32_t, int32_t, double> const &, double const *, double *);
-template void overlap<int64_t, int64_t, float>(
-  experimental::GraphCSRView<int64_t, int64_t, float> const &, float const *, float *);
-template void overlap<int64_t, int64_t, double>(
-  experimental::GraphCSRView<int64_t, int64_t, double> const &, double const *, double *);
-template void overlap_list<int32_t, int32_t, float>(
-  experimental::GraphCSRView<int32_t, int32_t, float> const &,
-  float const *,
-  int32_t,
-  int32_t const *,
-  int32_t const *,
-  float *);
-template void overlap_list<int32_t, int32_t, double>(
-  experimental::GraphCSRView<int32_t, int32_t, double> const &,
-  double const *,
-  int32_t,
-  int32_t const *,
-  int32_t const *,
-  double *);
-template void overlap_list<int64_t, int64_t, float>(
-  experimental::GraphCSRView<int64_t, int64_t, float> const &,
-  float const *,
-  int64_t,
-  int64_t const *,
-  int64_t const *,
-  float *);
-template void overlap_list<int64_t, int64_t, double>(
-  experimental::GraphCSRView<int64_t, int64_t, double> const &,
-  double const *,
-  int64_t,
-  int64_t const *,
-  int64_t const *,
-  double *);
+template void overlap<int32_t, int32_t, float>(GraphCSRView<int32_t, int32_t, float> const &,
+                                               float const *,
+                                               float *);
+template void overlap<int32_t, int32_t, double>(GraphCSRView<int32_t, int32_t, double> const &,
+                                                double const *,
+                                                double *);
+template void overlap<int64_t, int64_t, float>(GraphCSRView<int64_t, int64_t, float> const &,
+                                               float const *,
+                                               float *);
+template void overlap<int64_t, int64_t, double>(GraphCSRView<int64_t, int64_t, double> const &,
+                                                double const *,
+                                                double *);
+template void overlap_list<int32_t, int32_t, float>(GraphCSRView<int32_t, int32_t, float> const &,
+                                                    float const *,
+                                                    int32_t,
+                                                    int32_t const *,
+                                                    int32_t const *,
+                                                    float *);
+template void overlap_list<int32_t, int32_t, double>(GraphCSRView<int32_t, int32_t, double> const &,
+                                                     double const *,
+                                                     int32_t,
+                                                     int32_t const *,
+                                                     int32_t const *,
+                                                     double *);
+template void overlap_list<int64_t, int64_t, float>(GraphCSRView<int64_t, int64_t, float> const &,
+                                                    float const *,
+                                                    int64_t,
+                                                    int64_t const *,
+                                                    int64_t const *,
+                                                    float *);
+template void overlap_list<int64_t, int64_t, double>(GraphCSRView<int64_t, int64_t, double> const &,
+                                                     double const *,
+                                                     int64_t,
+                                                     int64_t const *,
+                                                     int64_t const *,
+                                                     double *);
 
 }  // namespace cugraph
diff --git a/cpp/src/nvgraph/include/async_event.cuh b/cpp/src/nvgraph/include/async_event.cuh
deleted file mode 100644
index e7bf04fa33f..00000000000
--- a/cpp/src/nvgraph/include/async_event.cuh
+++ /dev/null
@@ -1,41 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-class AsyncEvent {
- public:
-  AsyncEvent() : async_event(NULL) {}
-  AsyncEvent(int size) : async_event(NULL) { cudaEventCreate(&async_event); }
-  ~AsyncEvent()
-  {
-    if (async_event != NULL) cudaEventDestroy(async_event);
-  }
-
-  void create() { cudaEventCreate(&async_event); }
-  void record(cudaStream_t s = 0)
-  {
-    if (async_event == NULL) {
-      cudaEventCreate(&async_event);  // check if we haven't created the event yet
-    }
-
-    cudaEventRecord(async_event, s);
-  }
-  void sync() { cudaEventSynchronize(async_event); }
-
- private:
-  cudaEvent_t async_event;
-};
diff --git a/cpp/src/nvgraph/include/atomics.hxx b/cpp/src/nvgraph/include/atomics.hxx
deleted file mode 100644
index 4cd02764ed7..00000000000
--- a/cpp/src/nvgraph/include/atomics.hxx
+++ /dev/null
@@ -1,145 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-namespace nvgraph {
-//This file contains the atomic operations for floats and doubles from cusparse/src/cusparse_atomics.h
-
-static __inline__ __device__ double atomicFPAdd(double *addr, double val)
-{
-// atomicAdd for double starts with sm_60
-#if __CUDA_ARCH__ >= 600
-    return atomicAdd( addr, val );
-#else
-    unsigned long long old = __double_as_longlong( addr[0] ), assumed;
-
-    do
-    {
-        assumed = old;
-        old = atomicCAS( (unsigned long long *) addr, assumed, __double_as_longlong( val + __longlong_as_double( assumed ) ) );
-    }
-    while ( assumed != old );
-
-    return old;
-#endif
-} 
-
-// atomicAdd for float starts with sm_20
-static __inline__ __device__ float atomicFPAdd(float *addr, float val)
-{
-    return atomicAdd( addr, val );
-}
-
-static __inline__ __device__ double atomicFPMin(double *addr, double val)
-{
-    double old, assumed;
-    old=*addr; 
-    do{
-        assumed = old;
-        old     = __longlong_as_double(atomicCAS((unsigned long long int *)addr, __double_as_longlong(assumed),
-                                                 __double_as_longlong(min(val,assumed))));
-    } while (__double_as_longlong(assumed) != __double_as_longlong(old));
-    return old;
-} 
-
-/* atomic addition: based on Nvidia Research atomic's tricks from cusparse */
-static __inline__ __device__ float atomicFPMin(float *addr, float val)
-{       
-    float old, assumed;
-    old=*addr;
-    do{
-        assumed = old;
-        old     = int_as_float(atomicCAS((int *)addr, float_as_int(assumed),float_as_int(min(val,assumed))));
-    } while (float_as_int(assumed) != float_as_int(old));
-
-    return old;
-}
-
-static __inline__ __device__ double atomicFPMax(double *addr, double val)
-{
-    double old, assumed;
-    old=*addr; 
-    do{
-        assumed = old;
-        old     = __longlong_as_double(atomicCAS((unsigned long long int *)addr, __double_as_longlong(assumed),
-                                                 __double_as_longlong(max(val,assumed))));
-    } while (__double_as_longlong(assumed) != __double_as_longlong(old));
-    return old;
-} 
-
-/* atomic addition: based on Nvidia Research atomic's tricks from cusparse */
-static __inline__ __device__ float atomicFPMax(float *addr, float val)
-{       
-    float old, assumed;
-    old=*addr;
-    do{
-        assumed = old;
-        old     = int_as_float(atomicCAS((int *)addr, float_as_int(assumed),float_as_int(max(val,assumed))));
-    } while (float_as_int(assumed) != float_as_int(old));
-
-    return old;
-}
-
-static __inline__ __device__ double atomicFPOr(double *addr, double val)
-{
-    double old, assumed;
-    old=*addr; 
-    do{
-        assumed = old;
-        old     = __longlong_as_double(atomicCAS((unsigned long long int *)addr, __double_as_longlong(assumed),
-                                                 __double_as_longlong((bool)val | (bool)assumed)));
-    } while (__double_as_longlong(assumed) != __double_as_longlong(old));
-    return old;
-} 
-
-/* atomic addition: based on Nvidia Research atomic's tricks from cusparse */
-static __inline__ __device__ float atomicFPOr(float *addr, float val)
-{       
-    float old, assumed;
-    old=*addr;
-    do{
-        assumed = old;
-        old     = int_as_float(atomicCAS((int *)addr, float_as_int(assumed),float_as_int((bool)val | (bool)assumed)));
-    } while (float_as_int(assumed) != float_as_int(old));
-
-    return old;
-}
-
-static __inline__ __device__ double atomicFPLog(double *addr, double val)
-{
-    double old, assumed;
-    old=*addr; 
-    do{
-        assumed = old;
-        old     = __longlong_as_double(atomicCAS((unsigned long long int *)addr, __double_as_longlong(assumed),
-                                                 __double_as_longlong(-log(exp(-val)+exp(-assumed)))));
-    } while (__double_as_longlong(assumed) != __double_as_longlong(old));
-    return old;
-} 
-
-/* atomic addition: based on Nvidia Research atomic's tricks from cusparse */
-static __inline__ __device__ float atomicFPLog(float *addr, float val)
-{       
-    float old, assumed;
-    old=*addr;
-    do{
-        assumed = old;
-        old     = int_as_float(atomicCAS((int *)addr, float_as_int(assumed),float_as_int(-logf(expf(-val)+expf(-assumed)))));
-    } while (float_as_int(assumed) != float_as_int(old));
-
-    return old;
-}
-
-} //end anmespace nvgraph
-
diff --git a/cpp/src/nvgraph/include/debug_macros.h b/cpp/src/nvgraph/include/debug_macros.h
deleted file mode 100644
index 5ee114c0084..00000000000
--- a/cpp/src/nvgraph/include/debug_macros.h
+++ /dev/null
@@ -1,42 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#pragma once
-
-#include "nvgraph_error.hxx"
-
-#define CHECK_STATUS(...)                                               \
-  do {                                                                  \
-    if (__VA_ARGS__) { FatalError(#__VA_ARGS__, NVGRAPH_ERR_UNKNOWN); } \
-  } while (0)
-
-#define CHECK_NVGRAPH(...)                               \
-  do {                                                   \
-    NVGRAPH_ERROR e = __VA_ARGS__;                       \
-    if (e != NVGRAPH_OK) { FatalError(#__VA_ARGS__, e) } \
-  } while (0)
-
-#ifdef DEBUG
-#define COUT() (std::cout)
-#define CERR() (std::cerr)
-#define WARNING(message)                                                  \
-  do {                                                                    \
-    std::stringstream ss;                                                 \
-    ss << "Warning (" << __FILE__ << ":" << __LINE__ << "): " << message; \
-    CERR() << ss.str() << std::endl;                                      \
-  } while (0)
-#else  // DEBUG
-#define WARNING(message)
-#endif
diff --git a/cpp/src/nvgraph/include/graph_utils.cuh b/cpp/src/nvgraph/include/graph_utils.cuh
deleted file mode 100644
index 106cd875ed1..00000000000
--- a/cpp/src/nvgraph/include/graph_utils.cuh
+++ /dev/null
@@ -1,339 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-// Helper functions based on Thrust
-
-#pragma once
-
-#include <cuda.h>
-#include <cuda_runtime.h>
-//#include <library_types.h>
-//#include <cuda_fp16.h>
-
-#include <thrust/device_vector.h>
-#include <thrust/functional.h>
-#include <thrust/inner_product.h>
-#include <thrust/iterator/zip_iterator.h>
-#include <thrust/sort.h>
-#include <thrust/transform.h>
-
-#include <rmm/rmm.h>
-#include <rmm/thrust_rmm_allocator.h>
-
-#define USE_CG 1
-#define DEBUG 1
-
-namespace nvlouvain {
-
-#define CUDA_MAX_BLOCKS 65535
-#define CUDA_MAX_KERNEL_THREADS 256  // kernel will launch at most 256 threads per block
-#define DEFAULT_MASK 0xffffffff
-#define US
-
-//#define DEBUG 1
-
-// error check
-#undef cudaCheckError
-#ifdef DEBUG
-#define WHERE " at: " << __FILE__ << ':' << __LINE__
-#define cudaCheckError()                                                            \
-  {                                                                                 \
-    cudaError_t e = cudaGetLastError();                                             \
-    if (e != cudaSuccess) {                                                         \
-      std::cerr << "Cuda failure: " << cudaGetErrorString(e) << WHERE << std::endl; \
-    }                                                                               \
-  }
-#else
-#define cudaCheckError()
-#define WHERE ""
-#endif
-
-// This is a gap filler, and should be replaced with a RAPIDS-wise error handling mechanism.
-#undef rmmCheckError
-#ifdef DEBUG
-#define WHERE " at: " << __FILE__ << ':' << __LINE__
-#define rmmCheckError(e)                                                          \
-  {                                                                               \
-    if (e != RMM_SUCCESS) { std::cerr << "RMM failure: " << WHERE << std::endl; } \
-  }
-#else
-#define rmmCheckError(e)
-#define WHERE ""
-#endif
-
-template <typename T>
-static __device__ __forceinline__ T
-shfl_up(T r, int offset, int bound = 32, int mask = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#if USE_CG
-  return __shfl_up_sync(mask, r, offset, bound);
-#else
-  return __shfl_up(r, offset, bound);
-#endif
-#else
-  return 0.0f;
-#endif
-}
-
-template <typename T>
-static __device__ __forceinline__ T shfl(T r, int lane, int bound = 32, int mask = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#if USE_CG
-  return __shfl_sync(mask, r, lane, bound);
-#else
-  return __shfl(r, lane, bound);
-#endif
-#else
-  return 0.0f;
-#endif
-}
-
-template <typename T>
-__inline__ __device__ T parallel_prefix_sum(int n, int *ind, T *w)
-{
-  int i, j, mn;
-  T v, last;
-  T sum = 0.0;
-  bool valid;
-
-  // Parallel prefix sum (using __shfl)
-  mn = (((n + blockDim.x - 1) / blockDim.x) * blockDim.x);  // n in multiple of blockDim.x
-  for (i = threadIdx.x; i < mn; i += blockDim.x) {
-    // All threads (especially the last one) must always participate
-    // in the shfl instruction, otherwise their sum will be undefined.
-    // So, the loop stopping condition is based on multiple of n in loop increments,
-    // so that all threads enter into the loop and inside we make sure we do not
-    // read out of bounds memory checking for the actual size n.
-
-    // check if the thread is valid
-    valid = i < n;
-
-    // Notice that the last thread is used to propagate the prefix sum.
-    // For all the threads, in the first iteration the last is 0, in the following
-    // iterations it is the value at the last thread of the previous iterations.
-
-    // get the value of the last thread
-    last = shfl(sum, blockDim.x - 1, blockDim.x);
-
-    // if you are valid read the value from memory, otherwise set your value to 0
-    sum = (valid) ? w[ind[i]] : 0.0;
-
-    // do prefix sum (of size warpSize=blockDim.x =< 32)
-    for (j = 1; j < blockDim.x; j *= 2) {
-      v = shfl_up(sum, j, blockDim.x);
-      if (threadIdx.x >= j) sum += v;
-    }
-    // shift by last
-    sum += last;
-    // notice that no __threadfence or __syncthreads are needed in this implementation
-  }
-  // get the value of the last thread (to all threads)
-  last = shfl(sum, blockDim.x - 1, blockDim.x);
-
-  return last;
-}
-
-// dot
-template <typename T>
-T dot(size_t n, T *x, T *y)
-{
-  T result = thrust::inner_product(thrust::device_pointer_cast(x),
-                                   thrust::device_pointer_cast(x + n),
-                                   thrust::device_pointer_cast(y),
-                                   0.0f);
-  cudaCheckError();
-  return result;
-}
-
-// axpy
-template <typename T>
-struct axpy_functor : public thrust::binary_function<T, T, T> {
-  const T a;
-  axpy_functor(T _a) : a(_a) {}
-  __host__ __device__ T operator()(const T &x, const T &y) const { return a * x + y; }
-};
-
-template <typename T>
-void axpy(size_t n, T a, T *x, T *y)
-{
-  thrust::transform(thrust::device_pointer_cast(x),
-                    thrust::device_pointer_cast(x + n),
-                    thrust::device_pointer_cast(y),
-                    thrust::device_pointer_cast(y),
-                    axpy_functor<T>(a));
-  cudaCheckError();
-}
-
-// norm
-template <typename T>
-struct square {
-  __host__ __device__ T operator()(const T &x) const { return x * x; }
-};
-
-template <typename T>
-T nrm2(size_t n, T *x)
-{
-  T init   = 0;
-  T result = std::sqrt(thrust::transform_reduce(thrust::device_pointer_cast(x),
-                                                thrust::device_pointer_cast(x + n),
-                                                square<T>(),
-                                                init,
-                                                thrust::plus<T>()));
-  cudaCheckError();
-  return result;
-}
-
-template <typename T>
-T nrm1(size_t n, T *x)
-{
-  T result = thrust::reduce(thrust::device_pointer_cast(x), thrust::device_pointer_cast(x + n));
-  cudaCheckError();
-  return result;
-}
-
-template <typename T>
-void scal(size_t n, T val, T *x)
-{
-  thrust::transform(thrust::device_pointer_cast(x),
-                    thrust::device_pointer_cast(x + n),
-                    thrust::make_constant_iterator(val),
-                    thrust::device_pointer_cast(x),
-                    thrust::multiplies<T>());
-  cudaCheckError();
-}
-
-template <typename T>
-void fill(size_t n, T *x, T value)
-{
-  thrust::fill(thrust::device_pointer_cast(x), thrust::device_pointer_cast(x + n), value);
-  cudaCheckError();
-}
-
-template <typename T>
-void printv(size_t n, T *vec, int offset)
-{
-  thrust::device_ptr<T> dev_ptr(vec);
-  std::cout.precision(15);
-  std::cout << "sample size = " << n << ", offset = " << offset << std::endl;
-  thrust::copy(dev_ptr + offset, dev_ptr + offset + n, std::ostream_iterator<T>(std::cout, " "));
-  cudaCheckError();
-  std::cout << std::endl;
-}
-
-template <typename T>
-void copy(size_t n, T *x, T *res)
-{
-  thrust::device_ptr<T> dev_ptr(x);
-  thrust::device_ptr<T> res_ptr(res);
-  thrust::copy_n(dev_ptr, n, res_ptr);
-  cudaCheckError();
-}
-
-template <typename T>
-struct is_zero {
-  __host__ __device__ bool operator()(const T x) { return x == 0; }
-};
-
-template <typename T>
-struct dangling_functor : public thrust::unary_function<T, T> {
-  const T val;
-  dangling_functor(T _val) : val(_val) {}
-  __host__ __device__ T operator()(const T &x) const { return val + x; }
-};
-
-template <typename T>
-void update_dangling_nodes(size_t n, T *dangling_nodes, T damping_factor)
-{
-  thrust::transform_if(thrust::device_pointer_cast(dangling_nodes),
-                       thrust::device_pointer_cast(dangling_nodes + n),
-                       thrust::device_pointer_cast(dangling_nodes),
-                       dangling_functor<T>(1.0 - damping_factor),
-                       is_zero<T>());
-  cudaCheckError();
-}
-
-// google matrix kernels
-template <typename IndexType, typename ValueType>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
-  degree_coo(const IndexType n, const IndexType e, const IndexType *ind, IndexType *degree)
-{
-  for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < e; i += gridDim.x * blockDim.x)
-    atomicAdd(&degree[ind[i]], 1.0);
-}
-template <typename IndexType, typename ValueType>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS) equi_prob(
-  const IndexType n, const IndexType e, const IndexType *ind, ValueType *val, IndexType *degree)
-{
-  for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < e; i += gridDim.x * blockDim.x)
-    val[i] = 1.0 / degree[ind[i]];
-}
-
-template <typename IndexType, typename ValueType>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
-  flag_leafs(const IndexType n, IndexType *degree, ValueType *bookmark)
-{
-  for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < n; i += gridDim.x * blockDim.x)
-    if (degree[i] == 0) bookmark[i] = 1.0;
-}
-// notice that in the transposed matrix/csc a dangling node is a node without incomming edges
-template <typename IndexType, typename ValueType>
-void google_matrix(const IndexType n,
-                   const IndexType e,
-                   const IndexType *cooColInd,
-                   ValueType *cooVal,
-                   ValueType *bookmark)
-{
-  rmm::device_vector<IndexType> degree(n, 0);
-  dim3 nthreads, nblocks;
-  nthreads.x = min(e, CUDA_MAX_KERNEL_THREADS);
-  nthreads.y = 1;
-  nthreads.z = 1;
-  nblocks.x  = min((e + nthreads.x - 1) / nthreads.x, CUDA_MAX_BLOCKS);
-  nblocks.y  = 1;
-  nblocks.z  = 1;
-  degree_coo<IndexType, ValueType>
-    <<<nblocks, nthreads>>>(n, e, cooColInd, thrust::raw_pointer_cast(degree.data()));
-  equi_prob<IndexType, ValueType>
-    <<<nblocks, nthreads>>>(n, e, cooColInd, cooVal, thrust::raw_pointer_cast(degree.data()));
-  ValueType val = 0.0;
-  fill(n, bookmark, val);
-  nthreads.x = min(n, CUDA_MAX_KERNEL_THREADS);
-  nblocks.x  = min((n + nthreads.x - 1) / nthreads.x, CUDA_MAX_BLOCKS);
-  flag_leafs<IndexType, ValueType>
-    <<<nblocks, nthreads>>>(n, thrust::raw_pointer_cast(degree.data()), bookmark);
-  // printv(n, thrust::raw_pointer_cast(degree.data()) , 0);
-  // printv(n, bookmark , 0);
-  // printv(e, cooVal , 0);
-}
-
-template <typename IndexType>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
-  update_clustering_kernel(const IndexType n, IndexType *clustering, IndexType *aggregates_d)
-{
-  for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < n; i += gridDim.x * blockDim.x)
-    clustering[i] = aggregates_d[clustering[i]];
-}
-
-template <typename IndexType>
-void update_clustering(const IndexType n, IndexType *clustering, IndexType *aggregates_d)
-{
-  int nthreads = min(n, CUDA_MAX_KERNEL_THREADS);
-  int nblocks  = min((n + nthreads - 1) / nthreads, CUDA_MAX_BLOCKS);
-  update_clustering_kernel<IndexType><<<nblocks, nthreads>>>(n, clustering, aggregates_d);
-}
-
-}  // namespace nvlouvain
diff --git a/cpp/src/nvgraph/include/kmeans.hxx b/cpp/src/nvgraph/include/kmeans.hxx
deleted file mode 100644
index 386b084706a..00000000000
--- a/cpp/src/nvgraph/include/kmeans.hxx
+++ /dev/null
@@ -1,99 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#pragma once
-
-#include "nvgraph_error.hxx"
-
-namespace nvgraph {
-
-  /// Find clusters with k-means algorithm
-  /** Initial centroids are chosen with k-means++ algorithm. Empty
-   *  clusters are reinitialized by choosing new centroids with
-   *  k-means++ algorithm.
-   *
-   *  CNMEM must be initialized before calling this function.
-   *
-   *  @param cublasHandle_t cuBLAS handle.
-   *  @param n Number of observation vectors.
-   *  @param d Dimension of observation vectors.
-   *  @param k Number of clusters.
-   *  @param tol Tolerance for convergence. k-means stops when the
-   *    change in residual divided by n is less than tol.
-   *  @param maxiter Maximum number of k-means iterations.
-   *  @param obs (Input, device memory, d*n entries) Observation
-   *    matrix. Matrix is stored column-major and each column is an
-   *    observation vector. Matrix dimensions are d x n.
-   *  @param codes (Output, device memory, n entries) Cluster
-   *    assignments.
-   *  @param residual On exit, residual sum of squares (sum of squares
-   *    of distances between observation vectors and centroids).
-   *  @param On exit, number of k-means iterations.
-   *  @return NVGRAPH error flag.
-   */
-  template <typename IndexType_, typename ValueType_>
-  NVGRAPH_ERROR kmeans(IndexType_ n, IndexType_ d, IndexType_ k,
-		    ValueType_ tol, IndexType_ maxiter,
-		    const ValueType_ * __restrict__ obs,
-		    IndexType_ * __restrict__ codes,
-		    ValueType_ & residual,
-		    IndexType_ & iters);
-
-  /// Find clusters with k-means algorithm
-  /** Initial centroids are chosen with k-means++ algorithm. Empty
-   *  clusters are reinitialized by choosing new centroids with
-   *  k-means++ algorithm.
-   *
-   *  @param n Number of observation vectors.
-   *  @param d Dimension of observation vectors.
-   *  @param k Number of clusters.
-   *  @param tol Tolerance for convergence. k-means stops when the
-   *    change in residual divided by n is less than tol.
-   *  @param maxiter Maximum number of k-means iterations.
-   *  @param obs (Input, device memory, d*n entries) Observation
-   *    matrix. Matrix is stored column-major and each column is an
-   *    observation vector. Matrix dimensions are d x n.
-   *  @param codes (Output, device memory, n entries) Cluster
-   *    assignments.
-   *  @param clusterSizes (Output, device memory, k entries) Number of
-   *    points in each cluster.
-   *  @param centroids (Output, device memory, d*k entries) Centroid
-   *    matrix. Matrix is stored column-major and each column is a
-   *    centroid. Matrix dimensions are d x k.
-   *  @param work (Output, device memory, n*max(k,d) entries)
-   *    Workspace.
-   *  @param work_int (Output, device memory, 2*d*n entries)
-   *    Workspace.
-   *  @param residual_host (Output, host memory, 1 entry) Residual sum
-   *    of squares (sum of squares of distances between observation
-   *    vectors and centroids).
-   *  @param iters_host (Output, host memory, 1 entry) Number of
-   *    k-means iterations.
-   *  @return NVGRAPH error flag.
-   */
-  template <typename IndexType_, typename ValueType_>
-  NVGRAPH_ERROR kmeans(IndexType_ n, IndexType_ d, IndexType_ k,
-		    ValueType_ tol, IndexType_ maxiter,
-		    const ValueType_ * __restrict__ obs,
-		    IndexType_ * __restrict__ codes,
-		    IndexType_ * __restrict__ clusterSizes,
-		    ValueType_ * __restrict__ centroids,
-		    ValueType_ * __restrict__ work,
-		    IndexType_ * __restrict__ work_int,
-		    ValueType_ * residual_host,
-		    IndexType_ * iters_host);
-
-}
-
diff --git a/cpp/src/nvgraph/include/lanczos.hxx b/cpp/src/nvgraph/include/lanczos.hxx
deleted file mode 100644
index 58be76a0a45..00000000000
--- a/cpp/src/nvgraph/include/lanczos.hxx
+++ /dev/null
@@ -1,118 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
- #pragma once
-
-#include "nvgraph_error.hxx"
-#include "spectral_matrix.hxx"
-
-namespace nvgraph {
-
-  /// Compute smallest eigenvectors of symmetric matrix
-  /** Computes eigenvalues and eigenvectors that are least
-   *  positive. If matrix is positive definite or positive
-   *  semidefinite, the computed eigenvalues are smallest in
-   *  magnitude.
-   *
-   *  The largest eigenvalue is estimated by performing several
-   *  Lanczos iterations. An implicitly restarted Lanczos method is
-   *  then applied to A+s*I, where s is negative the largest
-   *  eigenvalue.
-   *
-   *  CNMEM must be initialized before calling this function.
-   *
-   *  @param A Pointer to matrix object.
-   *  @param nEigVecs Number of eigenvectors to compute.
-   *  @param maxIter Maximum number of Lanczos steps. Does not include
-   *    Lanczos steps used to estimate largest eigenvalue.
-   *  @param restartIter Maximum size of Lanczos system before
-   *    performing an implicit restart. Should be at least 4.
-   *  @param tol Convergence tolerance. Lanczos iteration will
-   *    terminate when the residual norm is less than tol*theta, where
-   *    theta is an estimate for the smallest unwanted eigenvalue
-   *    (i.e. the (nEigVecs+1)th smallest eigenvalue).
-   *  @param reorthogonalize Whether to reorthogonalize Lanczos
-   *    vectors.
-   *  @param iter On exit, pointer to total number of Lanczos
-   *    iterations performed. Does not include Lanczos steps used to
-   *    estimate largest eigenvalue.
-   *  @param eigVals_dev (Output, device memory, nEigVecs entries)
-   *    Smallest eigenvalues of matrix.
-   *  @param eigVecs_dev (Output, device memory, n*nEigVecs entries)
-   *    Eigenvectors corresponding to smallest eigenvalues of
-   *    matrix. Vectors are stored as columns of a column-major matrix
-   *    with dimensions n x nEigVecs.
-   *  @return NVGRAPH error flag.
-   */
-  template <typename IndexType_, typename ValueType_>
-  NVGRAPH_ERROR computeSmallestEigenvectors(const Matrix<IndexType_,ValueType_> & A,
-					 IndexType_ nEigVecs,
-					 IndexType_ maxIter,
-					 IndexType_ restartIter,
-					 ValueType_ tol,
-					 bool reorthogonalize,
-					 IndexType_ & iter,
-					 ValueType_ * __restrict__ eigVals_dev,
-					 ValueType_ * __restrict__ eigVecs_dev);
-
-    /// Compute largest eigenvectors of symmetric matrix
-  /** Computes eigenvalues and eigenvectors that are least
-   *  positive. If matrix is positive definite or positive
-   *  semidefinite, the computed eigenvalues are largest in
-   *  magnitude.
-   *
-   *  The largest eigenvalue is estimated by performing several
-   *  Lanczos iterations. An implicitly restarted Lanczos method is
-   *  then applied to A+s*I, where s is negative the largest
-   *  eigenvalue.
-   *
-   *  CNMEM must be initialized before calling this function.
-   *
-   *  @param A Matrix.
-   *  @param nEigVecs Number of eigenvectors to compute.
-   *  @param maxIter Maximum number of Lanczos steps. Does not include
-   *    Lanczos steps used to estimate largest eigenvalue.
-   *  @param restartIter Maximum size of Lanczos system before
-   *    performing an implicit restart. Should be at least 4.
-   *  @param tol Convergence tolerance. Lanczos iteration will
-   *    terminate when the residual norm is less than tol*theta, where
-   *    theta is an estimate for the largest unwanted eigenvalue
-   *    (i.e. the (nEigVecs+1)th largest eigenvalue).
-   *  @param reorthogonalize Whether to reorthogonalize Lanczos
-   *    vectors.
-   *  @param iter On exit, pointer to total number of Lanczos
-   *    iterations performed. Does not include Lanczos steps used to
-   *    estimate largest eigenvalue.
-   *  @param eigVals_dev (Output, device memory, nEigVecs entries)
-   *    Largest eigenvalues of matrix.
-   *  @param eigVecs_dev (Output, device memory, n*nEigVecs entries)
-   *    Eigenvectors corresponding to largest eigenvalues of
-   *    matrix. Vectors are stored as columns of a column-major matrix
-   *    with dimensions n x nEigVecs.
-   *  @return NVGRAPH error flag.
-   */
-  template <typename IndexType_, typename ValueType_>
-  NVGRAPH_ERROR computeLargestEigenvectors(const Matrix<IndexType_,ValueType_> & A,
-           IndexType_ nEigVecs,
-           IndexType_ maxIter,
-           IndexType_ restartIter,
-           ValueType_ tol,
-           bool reorthogonalize,
-           IndexType_ & iter,
-           ValueType_ * __restrict__ eigVals_dev,
-           ValueType_ * __restrict__ eigVecs_dev);
-
-}
-
diff --git a/cpp/src/nvgraph/include/modularity_maximization.hxx b/cpp/src/nvgraph/include/modularity_maximization.hxx
deleted file mode 100644
index 34720f88341..00000000000
--- a/cpp/src/nvgraph/include/modularity_maximization.hxx
+++ /dev/null
@@ -1,76 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#pragma once
-
-#include <graph.hpp>
-
-#include "nvgraph_error.hxx"
-#include "spectral_matrix.hxx"
-
-
-namespace nvgraph {
-  /** Compute partition for a weighted undirected graph. This
-   *  partition attempts to minimize the cost function:
-   *    Cost = \sum_i (Edges cut by ith partition)/(Vertices in ith partition)
-   *
-   *  @param G Weighted graph in CSR format
-   *  @param nClusters Number of partitions.
-   *  @param nEigVecs Number of eigenvectors to compute.
-   *  @param maxIter_lanczos Maximum number of Lanczos iterations.
-   *  @param restartIter_lanczos Maximum size of Lanczos system before
-   *    implicit restart.
-   *  @param tol_lanczos Convergence tolerance for Lanczos method.
-   *  @param maxIter_kmeans Maximum number of k-means iterations.
-   *  @param tol_kmeans Convergence tolerance for k-means algorithm.
-   *  @param parts (Output, device memory, n entries) Cluster
-   *    assignments.
-   *  @param iters_lanczos On exit, number of Lanczos iterations
-   *    performed.
-   *  @param iters_kmeans On exit, number of k-means iterations
-   *    performed.
-   *  @return NVGRAPH error flag.
-   */
-  template <typename vertex_t, typename edge_t, typename weight_t>
-  NVGRAPH_ERROR modularity_maximization(cugraph::experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
-                                        vertex_t nClusters,
-                                        vertex_t nEigVecs,
-                                        int maxIter_lanczos,
-                                        int restartIter_lanczos,
-                                        weight_t tol_lanczos,
-                                        int maxIter_kmeans,
-                                        weight_t tol_kmeans,
-                                        vertex_t * __restrict__ clusters,
-                                        weight_t *eigVals,
-                                        weight_t *eigVecs,
-                                        int & iters_lanczos,
-                                        int & iters_kmeans) ;
-
-
-  /// Compute modularity
-  /** This function determines the modularity based on a graph and cluster assignments 
-   *  @param G Weighted graph in CSR format
-   *  @param nClusters Number of clusters.
-   *  @param parts (Input, device memory, n entries) Cluster assignments.
-   *  @param modularity On exit, modularity
-   */
-  template <typename vertex_t, typename edge_t, typename weight_t>
-  NVGRAPH_ERROR analyzeModularity(cugraph::experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
-                                  vertex_t nClusters,
-                                  const vertex_t * __restrict__ parts,
-                                  weight_t & modularity);
-
-}
-
diff --git a/cpp/src/nvgraph/include/nvgraph_cublas.hxx b/cpp/src/nvgraph/include/nvgraph_cublas.hxx
deleted file mode 100644
index bddbbf18ae1..00000000000
--- a/cpp/src/nvgraph/include/nvgraph_cublas.hxx
+++ /dev/null
@@ -1,120 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
- 
-#pragma once
-
-#include <cublas_v2.h>
-#include <iostream>
-#include "debug_macros.h"
-
-namespace nvgraph
-{
-class Cublas;
-
-class Cublas
-{
-private:
-    static cublasHandle_t m_handle;
-    // Private ctor to prevent instantiation.
-    Cublas();
-    ~Cublas();
-public:
-
-    // Get the handle.
-    static cublasHandle_t get_handle()
-    {
-        if (m_handle == 0)
-            CHECK_CUBLAS(cublasCreate(&m_handle));
-        return m_handle;
-    }
-
-    static void destroy_handle()
-    {
-        if (m_handle != 0)
-            CHECK_CUBLAS(cublasDestroy(m_handle));
-        m_handle = 0;
-    }
-
-    static void set_pointer_mode_device();
-    static void set_pointer_mode_host();
-    static void setStream(cudaStream_t stream) 
-    {   
-        cublasHandle_t handle = Cublas::get_handle();
-        CHECK_CUBLAS(cublasSetStream(handle, stream));
-    }
-
-    template <typename T>
-    static void axpy(int n, T alpha,
-                     const T* x, int incx,
-                     T* y, int incy);
-
-    template <typename T>
-    static void copy(int n, const T* x, int incx,
-                     T* y, int incy);
-
-    template <typename T>
-    static void dot(int n, const T* x, int incx,
-                    const T* y, int incy,
-                    T* result);
-
-    template <typename T>
-    static void gemv(bool transposed, int m, int n,
-                     const T* alpha, const T* A, int lda,
-                     const T* x, int incx,
-                     const T* beta, T* y, int incy);
-
-    template <typename T>
-    static void gemv_ext(bool transposed, const int m, const int n,
-                     const T* alpha, const T* A, const int lda,
-                     const T* x, const int incx,
-                     const T* beta, T* y, const int incy, const int offsetx, const int offsety, const int offseta);
-    
-    template <typename T>
-    static void trsv_v2( cublasFillMode_t uplo, cublasOperation_t trans, cublasDiagType_t diag, int n, 
-			      const T *A, int lda, T *x, int incx, int offseta);
-
-    template <typename T>
-    static void ger(int m, int n, const T* alpha,
-                    const T* x, int incx,
-                    const T* y, int incy,
-                    T* A, int lda);
-
-    template <typename T>
-    static T nrm2(int n, const T* x, int incx);
-    template <typename T>
-    static void nrm2(int n, const T* x, int incx, T* result);
-
-    template <typename T>
-    static void scal(int n, T alpha, T* x, int incx);
-    template <typename T>
-    static void scal(int n, T* alpha, T* x, int incx);
-
-    template <typename T>
-    static void gemm(bool transa, bool transb, int m, int n, int k,
-		     const T * alpha, const T * A, int lda,
-		     const T * B, int ldb,
-		     const T * beta, T * C, int ldc);
-
-    template <typename T>
-    static void geam(bool transa, bool transb, int m, int n,
-		     const T * alpha, const T * A, int lda,
-		     const T * beta,  const T * B, int ldb,
-		     T * C, int ldc);
-
-};
-
-} // end namespace nvgraph
-
diff --git a/cpp/src/nvgraph/include/nvgraph_cusparse.hxx b/cpp/src/nvgraph/include/nvgraph_cusparse.hxx
deleted file mode 100644
index a1c86bd1bc8..00000000000
--- a/cpp/src/nvgraph/include/nvgraph_cusparse.hxx
+++ /dev/null
@@ -1,164 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
- 
-#pragma once
-
-#include <cusparse_v2.h>
-#include <cusparse_internal.h>
-#include "nvgraph_vector.hxx"
-
-#include <iostream>
-#include "debug_macros.h"
-
-namespace nvgraph
-{
-class Cusparse 
-{
-private:
-  // global CUSPARSE handle for nvgraph
-  static cusparseHandle_t m_handle; // Constructor.
-  Cusparse();
-  // Destructor.
-  ~Cusparse();
-
-public:
-
-  // Get the handle.
-  static cusparseHandle_t get_handle()
-  {
-      if (m_handle == 0)
-          CHECK_CUSPARSE(cusparseCreate(&m_handle));
-      return m_handle;
-  }
-  // Destroy handle
-  static void destroy_handle()
-  {
-    if (m_handle != 0)
-      CHECK_CUSPARSE( cusparseDestroy(m_handle) );
-    m_handle = 0;
-  }
-  static void setStream(cudaStream_t stream) 
-  {   
-      cusparseHandle_t handle = Cusparse::get_handle();
-      CHECK_CUSPARSE(cusparseSetStream(handle, stream));
-  }
-  // Set pointer mode
-  static void set_pointer_mode_device();
-  static void set_pointer_mode_host();
-
-  // operate on all rows and columns y= alpha*A.x + beta*y
-  template <typename IndexType_, typename ValueType_>
-  static void csrmv( const bool transposed,
-                     const bool sym,
-                     const int m, const int n, const int nnz, 
-                     const ValueType_* alpha, 
-                     const ValueType_* csrVal,
-                     const IndexType_ *csrRowPtr, 
-                     const IndexType_ *csrColInd, 
-                     const ValueType_* x,
-                     const ValueType_* beta, 
-                     ValueType_* y);
-  
-  // future possible features
-  /*
-  template <class TConfig>
-  static void csrmv_with_mask( const typename TConfig::MatPrec alphaConst, 
-                     Matrix<TConfig> &A, 
-                     Vector<TConfig> &x,
-                     const typename TConfig::MatPrec betaConst, 
-                     Vector<TConfig> &y );
-
-  template <class TConfig>
-  static void csrmv_with_mask_restriction( const typename TConfig::MatPrec alphaConst, 
-                     Matrix<TConfig> &A, 
-                     Vector<TConfig> &x,
-                     const typename TConfig::MatPrec betaConst, 
-                     Vector<TConfig> &y, 
-                     Matrix<TConfig> &P);
-
-  // E is a vector that represents a diagonal matrix
-  // operate on all rows and columns
-  // y= alpha*E.x + beta*y
-  template <class TConfig>
-  static void csrmv( const typename TConfig::MatPrec alphaConst, 
-                     Matrix<TConfig> &A, 
-                     const typename Matrix<TConfig>::MVector &E,
-                     Vector<TConfig> &x,
-                     const typename TConfig::MatPrec betaConst, 
-                     Vector<TConfig> &y, 
-                     ViewType view = OWNED );
-
-  // operate only on columns specified by columnColorSelector, see enum ColumnColorSelector above
-  // operate only on rows of specified color, given by A.offsets_rows_per_color, A.sorted_rows_by_color
-  // y= alpha*A.x + beta*y
-  template <class TConfig>
-  static void csrmv( ColumnColorSelector columnColorSelector, 
-                     const int color,
-                     const typename TConfig::MatPrec alphaConst, 
-                     Matrix<TConfig> &A, 
-                     Vector<TConfig> &x,
-                     const typename TConfig::MatPrec betaConst, 
-                     Vector<TConfig> &y, 
-                     ViewType view = OWNED );
-
-  // E is a vector that represents a diagonal matrix
-  // operate only on rows of specified color, given by A.offsets_rows_per_color, A.sorted_rows_by_color
-  // y= alpha*E.x + beta*y
-  template <class TConfig>
-  static void csrmv( const int color,
-                     typename TConfig::MatPrec alphaConst, 
-                     Matrix<TConfig> &A, 
-                     const typename Matrix<TConfig>::MVector &E, 
-                     Vector<TConfig> &x,
-                     typename TConfig::MatPrec betaConst, 
-                     Vector<TConfig> &y, 
-                     ViewType view=OWNED );
-
-  template <class TConfig>
-  static void csrmm(typename TConfig::MatPrec alpha,
-                    Matrix<TConfig> &A,
-                    Vector<TConfig> &V,
-                    typename TConfig::VecPrec beta,
-                    Vector<TConfig> &Res);
-
-*/
-
- template <typename IndexType_, typename ValueType_>
- static void csrmm(const bool transposed,
-                   const bool sym,
-                   const int m, 
-                   const int n, 
-                   const int k,
-                   const int nnz, 
-                   const ValueType_* alpha, 
-                   const ValueType_* csrVal,
-                   const IndexType_* csrRowPtr, 
-                   const IndexType_* csrColInd, 
-                   const ValueType_* x,
-                   const int ldx,
-                   const ValueType_* beta, 
-                   ValueType_* y,
-                   const int ldy);   
-
- //template <typename IndexType_, typename ValueType_>
- static void csr2coo( const int n, 
-                                     const int nnz, 
-                                     const int *csrRowPtr,
-                                     int *cooRowInd);   
-};
-
-} // end namespace nvgraph
-
diff --git a/cpp/src/nvgraph/include/nvgraph_error.hxx b/cpp/src/nvgraph/include/nvgraph_error.hxx
deleted file mode 100644
index cf7dff5b009..00000000000
--- a/cpp/src/nvgraph/include/nvgraph_error.hxx
+++ /dev/null
@@ -1,176 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
- 
-#pragma once
-
-#include <stdio.h>
-#include <string>
-#include <sstream>
-#include <time.h>
- 
-#include "stacktrace.h"
-
-namespace nvgraph {
-
-#if defined(DEBUG) || defined(VERBOSE_DIAG)
-#define STACKTRACE "\nStack trace:\n" + std::string(e.trace())
-#define WHERE " at: " << __FILE__ << ':' << __LINE__
-#else
-#define STACKTRACE ""
-#define WHERE ""
-#endif 
-
-
-enum NVGRAPH_ERROR { 
-/*********************************************************
- * Flags for status reporting
- *********************************************************/
-    NVGRAPH_OK=0, 
-    NVGRAPH_ERR_BAD_PARAMETERS=1,
-    NVGRAPH_ERR_UNKNOWN=2,
-    NVGRAPH_ERR_CUDA_FAILURE=3,
-    NVGRAPH_ERR_THRUST_FAILURE=4,
-    NVGRAPH_ERR_IO=5,
-    NVGRAPH_ERR_NOT_IMPLEMENTED=6,
-    NVGRAPH_ERR_NO_MEMORY=7,
-    NVGRAPH_ERR_NOT_CONVERGED=8
-};
-
-// define our own bad_alloc so we can set its .what()
-class nvgraph_exception: public std::exception
-{
-  public:
-    inline nvgraph_exception(const std::string &w, const std::string &where, const std::string &trace, NVGRAPH_ERROR reason) : m_trace(trace), m_what(w), m_reason(reason), m_where(where)
-    {
-    }
-
-    inline virtual ~nvgraph_exception(void) throw () {};
-
-    inline virtual const char *what(void) const throw()
-    {
-      return m_what.c_str();
-    }
-    inline virtual const char *where(void) const throw()
-    {
-      return m_where.c_str();
-    }
-    inline virtual const char *trace(void) const throw()
-    {
-      return m_trace.c_str();
-    }
-    inline virtual NVGRAPH_ERROR reason(void) const throw()
-    {
-      return m_reason;
-    }
-
-
-  private:
-    std::string  m_trace;
-    std::string  m_what;
-    NVGRAPH_ERROR m_reason;
-    std::string  m_where;
-}; // end bad_alloc
-  
-
-int NVGRAPH_GetErrorString( NVGRAPH_ERROR error, char* buffer, int buf_len);
-
-/********************************************************
- * Prints the error message, the stack trace, and exits
- * ******************************************************/
-#define FatalError(s, reason) {                                                 \
-  std::stringstream _where;                                                     \
-  _where << WHERE ;                                                             \
-  std::stringstream _trace;                                                     \
-  printStackTrace(_trace);                                                      \
-  throw nvgraph_exception(std::string(s) + "\n", _where.str(), _trace.str(), reason); \
-}
-
-#undef cudaCheckError
-#if defined(DEBUG) || defined(VERBOSE_DIAG)
-#define cudaCheckError() {                                              \
-  cudaError_t e=cudaGetLastError();                                     \
-  if(e!=cudaSuccess) {                                                  \
-    std::stringstream _error;                                           \
-    _error << "Cuda failure: '" << cudaGetErrorString(e) << "'";        \
-    FatalError(_error.str(), NVGRAPH_ERR_CUDA_FAILURE);                 \
-  }                                                                     \
-}
-#else // NO DEBUG
-#define cudaCheckError()                                                      \
-    {                                                                         \
-        cudaError_t __e = cudaGetLastError();                                 \
-        if (__e != cudaSuccess) {                                             \
-            FatalError("", NVGRAPH_ERR_CUDA_FAILURE);                         \
-        }                                                                     \
-    }
-#endif
-
-#define CHECK_CUDA(call)                                                      \
-    {                                                                         \
-        cudaError_t _e = (call);                                              \
-        if (_e != cudaSuccess)                                                \
-        {                                                                     \
-            std::stringstream _error;                                         \
-            _error << "CUDA Runtime failure: '#" << _e << "'";                \
-            FatalError(_error.str(), NVGRAPH_ERR_CUDA_FAILURE);               \
-        }                                                                     \
-    }
-
-#define CHECK_CURAND(call)                                                    \
-    {                                                                         \
-        curandStatus_t _e = (call);                                           \
-        if (_e != CURAND_STATUS_SUCCESS)                                      \
-        {                                                                     \
-            std::stringstream _error;                                         \
-            _error << "CURAND failure: '#" << _e << "'";                      \
-            FatalError(_error.str(), NVGRAPH_ERR_CUDA_FAILURE);               \
-        }                                                                     \
-    }
-
-#define CHECK_CUBLAS(call)                                                    \
-    {                                                                         \
-        cublasStatus_t _e = (call);                                           \
-        if (_e != CUBLAS_STATUS_SUCCESS)                                      \
-        {                                                                     \
-            std::stringstream _error;                                         \
-            _error << "CUBLAS failure: '#" << _e << "'";                      \
-            FatalError(_error.str(), NVGRAPH_ERR_CUDA_FAILURE);               \
-        }                                                                     \
-    }
-
-#define CHECK_CUSPARSE(call)                                                  \
-    {                                                                         \
-        cusparseStatus_t _e = (call);                                         \
-        if (_e != CUSPARSE_STATUS_SUCCESS)                                    \
-        {                                                                     \
-            std::stringstream _error;                                         \
-            _error << "CURAND failure: '#" << _e << "'";                      \
-            FatalError(_error.str(), NVGRAPH_ERR_CUDA_FAILURE);               \
-        }                                                                     \
-    }
-
-#define CHECK_CUSOLVER(call)                                                  \
-    {                                                                         \
-        cusolverStatus_t _e = (call);                                         \
-        if (_e != CUSOLVER_STATUS_SUCCESS)                                    \
-        {                                                                     \
-            std::stringstream _error;                                         \
-            _error << "CURAND failure: '#" << _e << "'";                      \
-            FatalError(_error.str(), NVGRAPH_ERR_CUDA_FAILURE);               \
-        }                                                                     \
-    }
-} // namespace nvgraph
-
diff --git a/cpp/src/nvgraph/include/nvgraph_lapack.hxx b/cpp/src/nvgraph/include/nvgraph_lapack.hxx
deleted file mode 100644
index a667a3717a2..00000000000
--- a/cpp/src/nvgraph/include/nvgraph_lapack.hxx
+++ /dev/null
@@ -1,55 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
- 
-#pragma once
-#include "nvgraph_error.hxx"
-namespace nvgraph
-{
-template <typename T> class Lapack;
-
-template <typename T>
-class Lapack
-{
-private:
-    Lapack();
-    ~Lapack();
-public:
-	static void check_lapack_enabled();
-
-	static void gemm(bool transa, bool transb, int m, int n, int k, T alpha, const T * A, int lda, const T * B, int ldb, T beta, T * C, int ldc);
-
-	// special QR for lanczos
-	static void sterf(int n, T * d, T * e);
-	static void steqr(char compz, int n, T * d, T * e, T * z, int ldz, T * work);
-
-	// QR
-	// computes the QR factorization of a general matrix
-	static void geqrf (int m, int n, T *a, int lda, T *tau, T *work, int *lwork);
-	// Generates the real orthogonal matrix Q of the QR factorization formed by geqrf.
-	//static void orgqr( int m, int n, int k, T* a, int lda, const T* tau, T* work, int* lwork );
-	// multiply C by implicit Q
-	static void ormqr (bool right_side, bool transq, int m, int n, int k, T *a, int lda, T *tau, T *c, int ldc, T *work, int *lwork);
-	//static void unmqr (bool right_side, bool transq, int m, int n, int k, T *a, int lda, T *tau, T *c, int ldc, T *work, int *lwork);
-    //static void qrf (int n, T *H, T *Q, T *R);
-
-    //static void hseqr (T* Q, T* R, T* eigenvalues,T* eigenvectors, int dim, int ldh, int ldq);
-	static void geev(T* A, T* eigenvalues, int dim, int lda);
-	static void geev(T* A, T* eigenvalues, T* eigenvectors, int dim, int lda, int ldvr);
-	static void geev(T* A, T* eigenvalues_r, T* eigenvalues_i, T* eigenvectors_r, T* eigenvectors_i, int dim, int lda, int ldvr);
-
-};
-}  // end namespace nvgraph
-
diff --git a/cpp/src/nvgraph/include/nvgraph_vector.hxx b/cpp/src/nvgraph/include/nvgraph_vector.hxx
deleted file mode 100644
index 228c83686dc..00000000000
--- a/cpp/src/nvgraph/include/nvgraph_vector.hxx
+++ /dev/null
@@ -1,87 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
- 
-#pragma once
-
-#include "nvgraph_error.hxx"
-#include "nvgraph_vector_kernels.hxx"
-
-#include <rmm/thrust_rmm_allocator.h>
-
-#include "debug_macros.h"
-
-namespace nvgraph
-{
-
-/*! A Vector contains a device vector of size |E| and type T
- */
-template <typename ValueType_>
-class Vector {
-public:
-  typedef ValueType_ ValueType;
-
-protected:
-  rmm::device_vector<ValueType> values;
-
-public:
-  /*! Construct an empty \p Vector.
-   */
-  Vector(void) {}
-  ~Vector(void) {}
-  /*! Construct a \p Vector of size vertices.
-   *
-   *  \param vertices The size of the Vector
-   */
-  Vector(size_t vertices, cudaStream_t stream = 0)
-    : values(vertices) {}
-    
-  size_t get_size() const { return values.size(); }
-  size_t bytes() const { return values.size()*sizeof(ValueType);}
-  ValueType const *raw() const { return values.data().get();  }
-  ValueType *raw() { return values.data().get();  }
-
-  void allocate(size_t n, cudaStream_t stream = 0) 
-  {
-    values.resize(n);
-  }
-
-  void fill(ValueType val, cudaStream_t stream = 0) 
-  {
-    fill_raw_vec(this->raw(), this->get_size(), val, stream); 
-  } 
-
-  void copy(Vector<ValueType> &vec1, cudaStream_t stream = 0)
-  {
-    if (this->get_size() == 0 && vec1.get_size()>0) {
-      allocate(vec1.get_size(), stream);
-      copy_vec(vec1.raw(), this->get_size(), this->raw(), stream);
-    } else if (this->get_size() == vec1.get_size()) 
-      copy_vec(vec1.raw(),  this->get_size(), this->raw(), stream);
-    else if (this->get_size() > vec1.get_size()) {
-      copy_vec(vec1.raw(),  vec1.get_size(), this->raw(), stream);
-    } else {
-      FatalError("Cannot copy a vector into a smaller one", NVGRAPH_ERR_BAD_PARAMETERS);
-    }
-  }
-
-  ValueType nrm1(cudaStream_t stream = 0) { 
-    ValueType res = 0;
-    nrm1_raw_vec(this->raw(), this->get_size(), &res, stream);
-    return res;
-  }
-}; // class Vector
-} // end namespace nvgraph
-
diff --git a/cpp/src/nvgraph/include/nvgraph_vector_kernels.hxx b/cpp/src/nvgraph/include/nvgraph_vector_kernels.hxx
deleted file mode 100644
index 9a0e640044a..00000000000
--- a/cpp/src/nvgraph/include/nvgraph_vector_kernels.hxx
+++ /dev/null
@@ -1,42 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
- 
-#pragma once
-namespace nvgraph
-{
-	template <typename ValueType_>
-	void nrm1_raw_vec (ValueType_* vec, size_t n, ValueType_* res, cudaStream_t stream = 0);
-
- 	template <typename ValueType_>
-	void fill_raw_vec (ValueType_* vec, size_t n, ValueType_ value, cudaStream_t stream = 0);
-
-	template <typename ValueType_>
-	void dump_raw_vec (ValueType_* vec, size_t n, int offset, cudaStream_t stream = 0);
-
-	template <typename ValueType_>
-	void dmv (size_t num_vertices, ValueType_ alpha, ValueType_* D, ValueType_* x, ValueType_ beta, ValueType_* y, cudaStream_t stream = 0);
-
-	template<typename ValueType_>
-	void copy_vec(ValueType_ *vec1, size_t n, ValueType_ *res, cudaStream_t stream = 0);
-
-	template <typename ValueType_>
-	void flag_zeros_raw_vec(size_t num_vertices, ValueType_* vec, int* flag, cudaStream_t stream = 0 );
-
-	template <typename IndexType_, typename ValueType_>
-	void set_connectivity( size_t n, IndexType_ root, ValueType_ self_loop_val, ValueType_ unreachable_val, ValueType_* res, cudaStream_t stream = 0);
-
-} // end namespace nvgraph
-
diff --git a/cpp/src/nvgraph/include/partition.hxx b/cpp/src/nvgraph/include/partition.hxx
deleted file mode 100644
index 10673d1eee3..00000000000
--- a/cpp/src/nvgraph/include/partition.hxx
+++ /dev/null
@@ -1,93 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#include <graph.hpp>
-
-#include "nvgraph_error.hxx"
-#include "spectral_matrix.hxx"
-
-
-namespace nvgraph {
-  #define SPECTRAL_USE_COLORING true
-  
-  #define SPECTRAL_USE_LOBPCG true 
-  #define SPECTRAL_USE_PRECONDITIONING true
-  #define SPECTRAL_USE_SCALING_OF_EIGVECS false
-  
-  #define SPECTRAL_USE_MAGMA false
-  #define SPECTRAL_USE_THROTTLE true
-  #define SPECTRAL_USE_NORMALIZED_LAPLACIAN true
-  #define SPECTRAL_USE_R_ORTHOGONALIZATION false
-
-  /// Spectral graph partition
-  /** Compute partition for a weighted undirected graph. This
-   *  partition attempts to minimize the cost function:
-   *    Cost = \sum_i (Edges cut by ith partition)/(Vertices in ith partition)
-   *
-   *  @param G Weighted graph in CSR format
-   *  @param nParts Number of partitions.
-   *  @param nEigVecs Number of eigenvectors to compute.
-   *  @param maxIter_lanczos Maximum number of Lanczos iterations.
-   *  @param restartIter_lanczos Maximum size of Lanczos system before
-   *    implicit restart.
-   *  @param tol_lanczos Convergence tolerance for Lanczos method.
-   *  @param maxIter_kmeans Maximum number of k-means iterations.
-   *  @param tol_kmeans Convergence tolerance for k-means algorithm.
-   *  @param parts (Output, device memory, n entries) Partition
-   *    assignments.
-   *  @param iters_lanczos On exit, number of Lanczos iterations
-   *    performed.
-   *  @param iters_kmeans On exit, number of k-means iterations
-   *    performed.
-   *  @return NVGRAPH error flag.
-   */
-  template <typename vertex_t, typename edge_t, typename weight_t>
-  NVGRAPH_ERROR partition(cugraph::experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
-                          vertex_t nParts,
-                          vertex_t nEigVecs,
-                          int maxIter_lanczos,
-                          int restartIter_lanczos,
-                          weight_t tol_lanczos,
-                          int maxIter_kmeans,
-                          weight_t tol_kmeans,
-                          vertex_t * __restrict__ parts,
-                          weight_t *eigVals,
-                          weight_t *eig_vects);
-
-  /// Compute cost function for partition
-  /** This function determines the edges cut by a partition and a cost
-   *  function:
-   *    Cost = \sum_i (Edges cut by ith partition)/(Vertices in ith partition)
-   *  Graph is assumed to be weighted and undirected.
-   *
-   *  @param G Weighted graph in CSR format
-   *  @param nParts Number of partitions.
-   *  @param parts (Input, device memory, n entries) Partition
-   *    assignments.
-   *  @param edgeCut On exit, weight of edges cut by partition.
-   *  @param cost On exit, partition cost function.
-   *  @return NVGRAPH error flag.
-   */
-  template <typename vertex_t, typename edge_t, typename weight_t>
-  NVGRAPH_ERROR analyzePartition(cugraph::experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
-                                 vertex_t nParts,
-                                 const vertex_t * __restrict__ parts,
-                                 weight_t & edgeCut, weight_t & cost);
-
-}
-
diff --git a/cpp/src/nvgraph/include/sm_utils.h b/cpp/src/nvgraph/include/sm_utils.h
deleted file mode 100644
index 001bffe136e..00000000000
--- a/cpp/src/nvgraph/include/sm_utils.h
+++ /dev/null
@@ -1,326 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#pragma once
-
-#ifdef _MSC_VER
-#include <stdint.h>
-#else
-#include <inttypes.h>
-#endif
-
-#define DEFAULT_MASK 0xffffffff
-
-#define USE_CG 1
-//(__CUDACC_VER__ >= 80500)
-
-namespace nvgraph {
-namespace utils {
-static __device__ __forceinline__ int lane_id()
-{
-  int id;
-  asm("mov.u32 %0, %%laneid;" : "=r"(id));
-  return id;
-}
-
-static __device__ __forceinline__ int lane_mask_lt()
-{
-  int mask;
-  asm("mov.u32 %0, %%lanemask_lt;" : "=r"(mask));
-  return mask;
-}
-
-static __device__ __forceinline__ int lane_mask_le()
-{
-  int mask;
-  asm("mov.u32 %0, %%lanemask_le;" : "=r"(mask));
-  return mask;
-}
-
-static __device__ __forceinline__ int warp_id() { return threadIdx.x >> 5; }
-
-static __device__ __forceinline__ unsigned int ballot(int p, int mask = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#if USE_CG
-  return __ballot_sync(mask, p);
-#else
-  return __ballot(p);
-#endif
-#else
-  return 0;
-#endif
-}
-
-static __device__ __forceinline__ int shfl(int r, int lane, int bound = 32, int mask = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#if USE_CG
-  return __shfl_sync(mask, r, lane, bound);
-#else
-  return __shfl(r, lane, bound);
-#endif
-#else
-  return 0;
-#endif
-}
-
-static __device__ __forceinline__ float shfl(float r,
-                                             int lane,
-                                             int bound = 32,
-                                             int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#if USE_CG
-  return __shfl_sync(mask, r, lane, bound);
-#else
-  return __shfl(r, lane, bound);
-#endif
-#else
-  return 0.0f;
-#endif
-}
-
-/// Warp shuffle down function
-/** Warp shuffle functions on 64-bit floating point values are not
- *  natively implemented as of Compute Capability 5.0. This
- *  implementation has been copied from
- *  (http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler).
- *  Once this is natively implemented, this function can be replaced
- *  by __shfl_down.
- *
- */
-static __device__ __forceinline__ double shfl(double r,
-                                              int lane,
-                                              int bound = 32,
-                                              int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_sync(mask, a.x, lane, bound);
-  a.y    = __shfl_sync(mask, a.y, lane, bound);
-  return *reinterpret_cast<double*>(&a);
-#else
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl(a.x, lane, bound);
-  a.y    = __shfl(a.y, lane, bound);
-  return *reinterpret_cast<double*>(&a);
-#endif
-#else
-  return 0.0;
-#endif
-}
-
-static __device__ __forceinline__ long long shfl(long long r,
-                                                 int lane,
-                                                 int bound = 32,
-                                                 int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_sync(mask, a.x, lane, bound);
-  a.y    = __shfl_sync(mask, a.y, lane, bound);
-  return *reinterpret_cast<long long*>(&a);
-#else
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl(a.x, lane, bound);
-  a.y    = __shfl(a.y, lane, bound);
-  return *reinterpret_cast<long long*>(&a);
-#endif
-#else
-  return 0.0;
-#endif
-}
-
-static __device__ __forceinline__ int shfl_down(int r,
-                                                int offset,
-                                                int bound = 32,
-                                                int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  return __shfl_down_sync(mask, r, offset, bound);
-#else
-  return __shfl_down(r, offset, bound);
-#endif
-#else
-  return 0.0f;
-#endif
-}
-
-static __device__ __forceinline__ float shfl_down(float r,
-                                                  int offset,
-                                                  int bound = 32,
-                                                  int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  return __shfl_down_sync(mask, r, offset, bound);
-#else
-  return __shfl_down(r, offset, bound);
-#endif
-#else
-  return 0.0f;
-#endif
-}
-
-static __device__ __forceinline__ double shfl_down(double r,
-                                                   int offset,
-                                                   int bound = 32,
-                                                   int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_down_sync(mask, a.x, offset, bound);
-  a.y    = __shfl_down_sync(mask, a.y, offset, bound);
-  return *reinterpret_cast<double*>(&a);
-#else
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_down(a.x, offset, bound);
-  a.y    = __shfl_down(a.y, offset, bound);
-  return *reinterpret_cast<double*>(&a);
-#endif
-#else
-  return 0.0;
-#endif
-}
-
-static __device__ __forceinline__ long long shfl_down(long long r,
-                                                      int offset,
-                                                      int bound = 32,
-                                                      int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_down_sync(mask, a.x, offset, bound);
-  a.y    = __shfl_down_sync(mask, a.y, offset, bound);
-  return *reinterpret_cast<long long*>(&a);
-#else
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_down(a.x, offset, bound);
-  a.y    = __shfl_down(a.y, offset, bound);
-  return *reinterpret_cast<long long*>(&a);
-#endif
-#else
-  return 0.0;
-#endif
-}
-
-// specifically for triangles counting
-static __device__ __forceinline__ uint64_t shfl_down(uint64_t r,
-                                                     int offset,
-                                                     int bound = 32,
-                                                     int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_down_sync(mask, a.x, offset, bound);
-  a.y    = __shfl_down_sync(mask, a.y, offset, bound);
-  return *reinterpret_cast<uint64_t*>(&a);
-#else
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_down(mask, a.x, offset, bound);
-  a.y    = __shfl_down(mask, a.y, offset, bound);
-  return *reinterpret_cast<uint64_t*>(&a);
-#endif
-#else
-  return 0.0;
-#endif
-}
-
-static __device__ __forceinline__ int shfl_up(int r,
-                                              int offset,
-                                              int bound = 32,
-                                              int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  return __shfl_up_sync(mask, r, offset, bound);
-#else
-  return __shfl_up(r, offset, bound);
-#endif
-#else
-  return 0.0f;
-#endif
-}
-
-static __device__ __forceinline__ float shfl_up(float r,
-                                                int offset,
-                                                int bound = 32,
-                                                int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  return __shfl_up_sync(mask, r, offset, bound);
-#else
-  return __shfl_up(r, offset, bound);
-#endif
-#else
-  return 0.0f;
-#endif
-}
-
-static __device__ __forceinline__ double shfl_up(double r,
-                                                 int offset,
-                                                 int bound = 32,
-                                                 int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_up_sync(mask, a.x, offset, bound);
-  a.y    = __shfl_up_sync(mask, a.y, offset, bound);
-  return *reinterpret_cast<double*>(&a);
-#else
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_up(a.x, offset, bound);
-  a.y    = __shfl_up(a.y, offset, bound);
-  return *reinterpret_cast<double*>(&a);
-#endif
-#else
-  return 0.0;
-#endif
-}
-
-static __device__ __forceinline__ long long shfl_up(long long r,
-                                                    int offset,
-                                                    int bound = 32,
-                                                    int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_up_sync(mask, a.x, offset, bound);
-  a.y    = __shfl_up_sync(mask, a.y, offset, bound);
-  return *reinterpret_cast<long long*>(&a);
-#else
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_up(a.x, offset, bound);
-  a.y    = __shfl_up(a.y, offset, bound);
-  return *reinterpret_cast<long long*>(&a);
-#endif
-#else
-  return 0.0;
-#endif
-}
-}  // namespace utils
-
-}  // namespace nvgraph
diff --git a/cpp/src/nvgraph/include/spectral_matrix.hxx b/cpp/src/nvgraph/include/spectral_matrix.hxx
deleted file mode 100644
index d3f6e0411da..00000000000
--- a/cpp/src/nvgraph/include/spectral_matrix.hxx
+++ /dev/null
@@ -1,785 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#pragma once
-
-#include <cuda.h>
-#include <cublas_v2.h>
-#include <curand.h>
-#include <cusolverDn.h>
-#include <cusparse.h>
-
-#include "nvgraph_vector.hxx"
-
-namespace nvgraph {
-
-  /// Abstract matrix class
-  /** Derived classes must implement matrix-vector products.
-   */
-  template <typename IndexType_, typename ValueType_>
-  class Matrix {
-  public:
-    /// Number of rows
-    const IndexType_ m;
-    /// Number of columns
-    const IndexType_ n;
-    /// CUDA stream
-    cudaStream_t s;  
-
-    /// Constructor
-    /** @param _m Number of rows.
-     *  @param _n Number of columns.
-     */
-    Matrix(IndexType_ _m, IndexType_ _n) : m(_m), n(_n), s(0){}
-
-    /// Destructor
-    virtual ~Matrix() {}
-
-
-    /// Get and Set CUDA stream  
-    virtual void setCUDAStream(cudaStream_t _s) = 0;  
-    virtual void getCUDAStream(cudaStream_t *_s) = 0;    
-
-    /// Matrix-vector product
-    /** y is overwritten with alpha*A*x+beta*y.
-     *
-     *  @param alpha Scalar.
-     *  @param x (Input, device memory, n entries) Vector.
-     *  @param beta Scalar.
-     *  @param y (Input/output, device memory, m entries) Output
-     *    vector.
-     */
-    virtual void mv(ValueType_ alpha,
-		    const ValueType_ * __restrict__ x,
-		    ValueType_ beta,
-		    ValueType_ * __restrict__ y) const = 0;
-
-    virtual void mm(IndexType_ k, ValueType_ alpha, const ValueType_ * __restrict__ x, ValueType_ beta, ValueType_ * __restrict__ y) const = 0;  
-    /// Color and Reorder
-    virtual void color(IndexType_ *c, IndexType_ *p) const = 0;  
-    virtual void reorder(IndexType_ *p) const = 0;  
-
-    /// Incomplete Cholesky (setup, factor and solve)
-    virtual void prec_setup(Matrix<IndexType_,ValueType_> * _M) = 0;
-    virtual void prec_solve(IndexType_ k, ValueType_ alpha, ValueType_ * __restrict__ fx, ValueType_ * __restrict__ t) const = 0; 
-    
-    //Get the sum of all edges
-    virtual ValueType_ getEdgeSum() const = 0;
-  };
-
-  /// Dense matrix class
-  template <typename IndexType_, typename ValueType_>
-  class DenseMatrix : public Matrix<IndexType_, ValueType_> {
-
-  private:
-    /// Whether to transpose matrix
-    const bool trans;
-    /// Matrix entries, stored column-major in device memory
-    const ValueType_ * A;
-    /// Leading dimension of matrix entry array
-    const IndexType_ lda;
-
-  public:
-    /// Constructor
-    DenseMatrix(bool _trans,
-		IndexType_ _m, IndexType_ _n,
-		const ValueType_ * _A, IndexType_ _lda);
-
-    /// Destructor
-    virtual ~DenseMatrix();
-
-    /// Get and Set CUDA stream  
-    virtual void setCUDAStream(cudaStream_t _s);  
-    virtual void getCUDAStream(cudaStream_t *_s);     
-
-    /// Matrix-vector product
-    virtual void mv(ValueType_ alpha, const ValueType_ * __restrict__ x,
-		    ValueType_ beta, ValueType_ * __restrict__ y) const;
-    /// Matrix-set of k vectors product
-    virtual void mm(IndexType_ k, ValueType_ alpha, const ValueType_ * __restrict__ x, ValueType_ beta, ValueType_ * __restrict__ y) const;  
-
-    /// Color and Reorder
-    virtual void color(IndexType_ *c, IndexType_ *p) const;  
-    virtual void reorder(IndexType_ *p) const;  
-
-    /// Incomplete Cholesky (setup, factor and solve)
-    virtual void prec_setup(Matrix<IndexType_,ValueType_> * _M);
-    virtual void prec_solve(IndexType_ k, ValueType_ alpha, ValueType_ * __restrict__ fx, ValueType_ * __restrict__ t) const; 
-    
-    //Get the sum of all edges
-    virtual ValueType_ getEdgeSum() const;
-  };
-
-  /// Sparse matrix class in CSR format
-  template <typename IndexType_, typename ValueType_>
-  class CsrMatrix : public Matrix<IndexType_, ValueType_> {
-
-  private:
-    /// Whether to transpose matrix
-    const bool trans;
-    /// Whether matrix is stored in symmetric format
-    const bool sym;
-    /// Number of non-zero entries
-    const IndexType_ nnz;
-    /// Matrix properties
-    const cusparseMatDescr_t descrA;
-    /// Matrix entry values (device memory)
-    /*const*/ ValueType_ * csrValA;
-    /// Pointer to first entry in each row (device memory)
-    const IndexType_ * csrRowPtrA;
-    /// Column index of each matrix entry (device memory)
-    const IndexType_ * csrColIndA;
-    /// Analysis info (pointer to opaque CUSPARSE struct)  
-    cusparseSolveAnalysisInfo_t info_l;
-    cusparseSolveAnalysisInfo_t info_u;  
-    /// factored flag (originally set to false, then reset to true after factorization), 
-    /// notice we only want to factor once
-    bool factored;  
-
-  public:
-    /// Constructor
-    CsrMatrix(bool _trans, bool _sym,
-	      IndexType_ _m, IndexType_ _n, IndexType_ _nnz,
-        const cusparseMatDescr_t _descrA,
-	      /*const*/ ValueType_ * _csrValA,
-	      const IndexType_ * _csrRowPtrA,
-	      const IndexType_ * _csrColIndA);
-
-    /// Destructor
-    virtual ~CsrMatrix();
-
-    /// Get and Set CUDA stream    
-    virtual void setCUDAStream(cudaStream_t _s);  
-    virtual void getCUDAStream(cudaStream_t *_s);  
-
-
-    /// Matrix-vector product
-    virtual void mv(ValueType_ alpha, const ValueType_ * __restrict__ x,
-		    ValueType_ beta, ValueType_ * __restrict__ y) const;
-    /// Matrix-set of k vectors product
-    virtual void mm(IndexType_ k, ValueType_ alpha, const ValueType_ * __restrict__ x, ValueType_ beta, ValueType_ * __restrict__ y) const;  
-
-    /// Color and Reorder
-    virtual void color(IndexType_ *c, IndexType_ *p) const;  
-    virtual void reorder(IndexType_ *p) const;  
-
-    /// Incomplete Cholesky (setup, factor and solve)
-    virtual void prec_setup(Matrix<IndexType_,ValueType_> * _M);
-    virtual void prec_solve(IndexType_ k, ValueType_ alpha, ValueType_ * __restrict__ fx, ValueType_ * __restrict__ t) const;         
-
-    //Get the sum of all edges
-    virtual ValueType_ getEdgeSum() const;
-  };
-
-  /// Graph Laplacian matrix
-  template <typename IndexType_, typename ValueType_>
-  class LaplacianMatrix 
-    : public Matrix<IndexType_, ValueType_> {
-
-  private:
-    /// Adjacency matrix
-    /*const*/ Matrix<IndexType_, ValueType_> * A;
-    /// Degree of each vertex
-    Vector<ValueType_> D;
-    /// Preconditioning matrix
-    Matrix<IndexType_, ValueType_> * M;  
-
-  public:
-    /// Constructor
-    LaplacianMatrix(/*const*/ Matrix<IndexType_,ValueType_> & _A);
-
-    /// Destructor
-    virtual ~LaplacianMatrix();
-
-    /// Get and Set CUDA stream    
-    virtual void setCUDAStream(cudaStream_t _s);  
-    virtual void getCUDAStream(cudaStream_t *_s);   
-
-    /// Matrix-vector product
-    virtual void mv(ValueType_ alpha, const ValueType_ * __restrict__ x,
-		    ValueType_ beta, ValueType_ * __restrict__ y) const;
-     /// Matrix-set of k vectors product
-    virtual void mm(IndexType_ k, ValueType_ alpha, const ValueType_ * __restrict__ x, ValueType_ beta, ValueType_ * __restrict__ y) const;
-
-    /// Scale a set of k vectors by a diagonal
-    virtual void dm(IndexType_ k, ValueType_ alpha, const ValueType_ * __restrict__ x, ValueType_ beta, ValueType_ * __restrict__ y) const;  
-
-    /// Color and Reorder
-    virtual void color(IndexType_ *c, IndexType_ *p) const;  
-    virtual void reorder(IndexType_ *p) const;    
-
-    /// Solve preconditioned system M x = f for a set of k vectors 
-    virtual void prec_setup(Matrix<IndexType_,ValueType_> * _M);
-    virtual void prec_solve(IndexType_ k, ValueType_ alpha, ValueType_ * __restrict__ fx, ValueType_ * __restrict__ t) const;    
-    
-    //Get the sum of all edges
-    virtual ValueType_ getEdgeSum() const;
-  };
-
-    ///  Modularity matrix
-  template <typename IndexType_, typename ValueType_>
-  class ModularityMatrix 
-    : public Matrix<IndexType_, ValueType_> {
-
-  private:
-    /// Adjacency matrix
-    /*const*/ Matrix<IndexType_, ValueType_> * A;
-    /// Degree of each vertex
-    Vector<ValueType_> D;
-    IndexType_ nnz;
-    ValueType_ edge_sum;
-    
-    /// Preconditioning matrix
-    Matrix<IndexType_, ValueType_> * M;  
-
-  public:
-    /// Constructor
-    ModularityMatrix(/*const*/ Matrix<IndexType_,ValueType_> & _A, IndexType_ _nnz);
-
-    /// Destructor
-    virtual ~ModularityMatrix();
-
-    /// Get and Set CUDA stream    
-    virtual void setCUDAStream(cudaStream_t _s);  
-    virtual void getCUDAStream(cudaStream_t *_s);   
-
-    /// Matrix-vector product
-    virtual void mv(ValueType_ alpha, const ValueType_ * __restrict__ x,
-        ValueType_ beta, ValueType_ * __restrict__ y) const;
-     /// Matrix-set of k vectors product
-    virtual void mm(IndexType_ k, ValueType_ alpha, const ValueType_ * __restrict__ x, ValueType_ beta, ValueType_ * __restrict__ y) const;
-
-    /// Scale a set of k vectors by a diagonal
-    virtual void dm(IndexType_ k, ValueType_ alpha, const ValueType_ * __restrict__ x, ValueType_ beta, ValueType_ * __restrict__ y) const;  
-
-    /// Color and Reorder
-    virtual void color(IndexType_ *c, IndexType_ *p) const;  
-    virtual void reorder(IndexType_ *p) const;    
-
-    /// Solve preconditioned system M x = f for a set of k vectors 
-    virtual void prec_setup(Matrix<IndexType_,ValueType_> * _M);
-    virtual void prec_solve(IndexType_ k, ValueType_ alpha, ValueType_ * __restrict__ fx, ValueType_ * __restrict__ t) const;    
-   
-    //Get the sum of all edges
-    virtual ValueType_ getEdgeSum() const;
-  };
-
-// cublasIxamax
-inline
-cublasStatus_t cublasIxamax(cublasHandle_t handle, int n,
-          const float *x, int incx, int *result) {
-  return cublasIsamax(handle, n, x, incx, result);
-}
-inline
-cublasStatus_t cublasIxamax(cublasHandle_t handle, int n,
-          const double *x, int incx, int *result) {
-  return cublasIdamax(handle, n, x, incx, result);
-}
-
-// cublasIxamin
-inline
-cublasStatus_t cublasIxamin(cublasHandle_t handle, int n,
-          const float *x, int incx, int *result) {
-  return cublasIsamin(handle, n, x, incx, result);
-}
-inline
-cublasStatus_t cublasIxamin(cublasHandle_t handle, int n,
-          const double *x, int incx, int *result) {
-  return cublasIdamin(handle, n, x, incx, result);
-}
-
-// cublasXasum
-inline
-cublasStatus_t cublasXasum(cublasHandle_t handle, int n,
-         const float *x, int incx,
-         float  *result) {
-  return cublasSasum(handle, n, x, incx, result);
-}
-inline
-cublasStatus_t cublasXasum(cublasHandle_t handle, int n,
-         const double *x, int incx,
-         double  *result) {
-  return cublasDasum(handle, n, x, incx, result);
-}
-
-// cublasXaxpy
-inline
-cublasStatus_t cublasXaxpy(cublasHandle_t handle, int n,
-                           const float * alpha,
-                           const float * x, int incx,
-                           float * y, int incy) {
-  return cublasSaxpy(handle, n, alpha, x, incx, y, incy);
-}
-inline
-cublasStatus_t cublasXaxpy(cublasHandle_t handle, int n,
-                           const double *alpha,
-                           const double *x, int incx,
-                           double *y, int incy) {
-  return cublasDaxpy(handle, n, alpha, x, incx, y, incy);
-}
-
-// cublasXcopy
-inline
-cublasStatus_t cublasXcopy(cublasHandle_t handle, int n,
-                           const float *x, int incx,
-                           float *y, int incy) {
-  return cublasScopy(handle, n, x, incx, y, incy);
-}
-inline
-cublasStatus_t cublasXcopy(cublasHandle_t handle, int n,
-                           const double *x, int incx,
-                           double *y, int incy) {
-  return cublasDcopy(handle, n, x, incx, y, incy);
-}
-
-// cublasXdot
-inline
-cublasStatus_t cublasXdot(cublasHandle_t handle, int n,
-        const float *x, int incx,
-        const float *y, int incy,
-        float *result) {
-  return cublasSdot(handle, n, x, incx, y, incy, result);
-}
-inline
-cublasStatus_t cublasXdot(cublasHandle_t handle, int n,
-        const double *x, int incx,
-        const double *y, int incy,
-        double *result) {
-  return cublasDdot(handle, n, x, incx, y, incy, result);
-}
-
-// cublasXnrm2
-inline
-cublasStatus_t cublasXnrm2(cublasHandle_t handle, int n,
-         const float *x, int incx,
-         float  *result) {
-  return cublasSnrm2(handle, n, x, incx, result);
-}
-inline
-cublasStatus_t cublasXnrm2(cublasHandle_t handle, int n,
-         const double *x, int incx,
-         double  *result) {
-  return cublasDnrm2(handle, n, x, incx, result);
-}
-
-// cublasXscal
-inline
-cublasStatus_t cublasXscal(cublasHandle_t handle, int n,
-         const float *alpha,
-         float *x, int incx) {
-  return cublasSscal(handle, n, alpha, x, incx);
-}
-inline
-cublasStatus_t cublasXscal(cublasHandle_t handle, int n,
-         const double *alpha,
-         double *x, int incx) {
-  return cublasDscal(handle, n, alpha, x, incx);
-}
-
-// cublasXgemv
-inline
-cublasStatus_t cublasXgemv(cublasHandle_t handle,
-         cublasOperation_t trans,
-                           int m, int n,
-                           const float *alpha,
-                           const float *A, int lda,
-                           const float *x, int incx,
-                           const float *beta,
-                           float *y, int incy) {
-  return cublasSgemv(handle, trans, m, n, alpha, A, lda, x, incx,
-         beta, y, incy);
-}
-inline
-cublasStatus_t cublasXgemv(cublasHandle_t handle,
-         cublasOperation_t trans,
-                           int m, int n,
-                           const double *alpha,
-                           const double *A, int lda,
-                           const double *x, int incx,
-                           const double *beta,
-                           double *y, int incy) {
-  return cublasDgemv(handle, trans, m, n, alpha, A, lda, x, incx,
-         beta, y, incy);
-}
-
-// cublasXger
-inline
-cublasStatus_t cublasXger(cublasHandle_t handle, int m, int n,
-        const float *alpha,
-        const float *x, int incx,
-        const float *y, int incy,
-        float *A, int lda) {
-  return cublasSger(handle, m, n, alpha, x, incx, y, incy, A, lda);
-}
-inline
-cublasStatus_t cublasXger(cublasHandle_t handle, int m, int n,
-        const double *alpha,
-        const double *x, int incx,
-        const double *y, int incy,
-        double *A, int lda) {
-  return cublasDger(handle, m, n, alpha, x, incx, y, incy, A, lda);
-}
-
-// cublasXgemm
-inline
-cublasStatus_t cublasXgemm(cublasHandle_t handle,
-         cublasOperation_t transa,
-         cublasOperation_t transb,
-         int m, int n, int k,
-         const float *alpha,
-         const float *A, int lda,
-         const float *B, int ldb,
-         const float *beta,
-         float *C, int ldc) {
-  return cublasSgemm(handle, transa, transb, m, n, k,
-         alpha, A, lda, B, ldb, beta, C, ldc);
-}
-inline
-cublasStatus_t cublasXgemm(cublasHandle_t handle,
-         cublasOperation_t transa,
-         cublasOperation_t transb,
-         int m, int n, int k,
-         const double *alpha,
-         const double *A, int lda,
-         const double *B, int ldb,
-         const double *beta,
-         double *C, int ldc) {
-  return cublasDgemm(handle, transa, transb, m, n, k,
-         alpha, A, lda, B, ldb, beta, C, ldc);
-}
-
-// cublasXgeam
-inline
-cublasStatus_t cublasXgeam(cublasHandle_t handle,
-         cublasOperation_t transa,
-         cublasOperation_t transb,
-         int m, int n,
-         const float *alpha,
-         const float *A, int lda,
-         const float *beta,
-         const float *B, int ldb,
-         float *C, int ldc) {
-  return cublasSgeam(handle, transa, transb, m, n,
-         alpha, A, lda, beta, B, ldb, C, ldc);
-}
-inline
-cublasStatus_t cublasXgeam(cublasHandle_t handle,
-         cublasOperation_t transa,
-         cublasOperation_t transb,
-         int m, int n,
-         const double *alpha,
-         const double *A, int lda,
-         const double *beta,
-         const double *B, int ldb,
-         double *C, int ldc) {
-  return cublasDgeam(handle, transa, transb, m, n,
-         alpha, A, lda, beta, B, ldb, C, ldc);
-}
-
-// cublasXtrsm
-inline cublasStatus_t cublasXtrsm(cublasHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, cublasDiagType_t diag, int m, int n, const float *alpha, const float *A, int lda, float *B, int ldb) {
-    return cublasStrsm(handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb); 
-}
-inline cublasStatus_t cublasXtrsm(cublasHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, cublasDiagType_t diag, int m, int n, const double *alpha, const double *A, int lda, double *B, int ldb) {
-    return cublasDtrsm(handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb); 
-}
-
-// curandGeneratorNormalX
-inline 
-curandStatus_t
-curandGenerateNormalX(curandGenerator_t generator,
-          float * outputPtr, size_t n,
-          float mean, float stddev) {
-  return curandGenerateNormal(generator, outputPtr, n, mean, stddev);
-}
-inline
-curandStatus_t
-curandGenerateNormalX(curandGenerator_t generator,
-          double * outputPtr, size_t n,
-          double mean, double stddev) {
-  return curandGenerateNormalDouble(generator, outputPtr,
-            n, mean, stddev);
-}
-
-// cusolverXpotrf_bufferSize
-inline cusolverStatus_t cusolverXpotrf_bufferSize(cusolverDnHandle_t handle, int n, float *A, int lda, int *Lwork){
-    return cusolverDnSpotrf_bufferSize(handle,CUBLAS_FILL_MODE_LOWER,n,A,lda,Lwork);
-}
-inline cusolverStatus_t cusolverXpotrf_bufferSize(cusolverDnHandle_t handle, int n, double *A, int lda, int *Lwork){
-    return cusolverDnDpotrf_bufferSize(handle,CUBLAS_FILL_MODE_LOWER,n,A,lda,Lwork);
-}
-
-// cusolverXpotrf
-inline cusolverStatus_t cusolverXpotrf(cusolverDnHandle_t handle, int n, float *A, int lda, float *Workspace, int Lwork, int *devInfo){
-    return cusolverDnSpotrf(handle,CUBLAS_FILL_MODE_LOWER,n,A,lda,Workspace,Lwork,devInfo);
-}
-inline cusolverStatus_t cusolverXpotrf(cusolverDnHandle_t handle, int n, double *A, int lda, double *Workspace, int Lwork, int *devInfo){
-    return cusolverDnDpotrf(handle,CUBLAS_FILL_MODE_LOWER,n,A,lda,Workspace,Lwork,devInfo);
-}
-
-// cusolverXgesvd_bufferSize
-inline cusolverStatus_t cusolverXgesvd_bufferSize(cusolverDnHandle_t handle, int m, int n, float *A, int lda, float *U, int ldu, float *VT, int ldvt, int *Lwork){
-    //ideally
-    //char jobu = 'O';
-    //char jobvt= 'N';
-    //only supported
-    //char jobu = 'A';
-    //char jobvt= 'A';
-    return cusolverDnSgesvd_bufferSize(handle,m,n,Lwork);
-}
-
-inline cusolverStatus_t cusolverXgesvd_bufferSize(cusolverDnHandle_t handle, int m, int n, double *A, int lda, double *U, int ldu, double *VT, int ldvt, int *Lwork){
-    //ideally
-    //char jobu = 'O';
-    //char jobvt= 'N';
-    //only supported
-    //char jobu = 'A';
-    //char jobvt= 'A';
-    return cusolverDnDgesvd_bufferSize(handle,m,n,Lwork);
-}
-
-// cusolverXgesvd
-inline cusolverStatus_t cusolverXgesvd(cusolverDnHandle_t handle, int m, int n, float *A, int lda, float *S, float *U, int ldu, float *VT, int ldvt, float *Work, int Lwork, float *rwork, int  *devInfo){
-    //ideally
-    //char jobu = 'O';
-    //char jobvt= 'N';
-    //only supported
-    char jobu = 'A';
-    char jobvt= 'A';
-
-    return cusolverDnSgesvd(handle,jobu,jobvt,m,n,A,lda,S,U,ldu,VT,ldvt,Work,Lwork,rwork,devInfo);
-} 
-
-inline cusolverStatus_t cusolverXgesvd(cusolverDnHandle_t handle, int m, int n, double *A, int lda, double *S, double *U, int ldu, double *VT, int ldvt, double *Work, int Lwork, double *rwork, int  *devInfo){
-    //ideally
-    //char jobu = 'O';
-    //char jobvt= 'N';
-    //only supported
-    char jobu = 'A';
-    char jobvt= 'A';
-    return cusolverDnDgesvd(handle,jobu,jobvt,m,n,A,lda,S,U,ldu,VT,ldvt,Work,Lwork,rwork,devInfo);
-} 
-
-// cusolverXgesvd_cond
-inline cusolverStatus_t cusolverXgesvd_cond(cusolverDnHandle_t handle, int m, int n, float *A, int lda, float *S, float *U, int ldu, float *VT, int ldvt, float *Work, int Lwork, float *rwork, int  *devInfo){
-    //ideally
-    //char jobu = 'N';
-    //char jobvt= 'N';
-    //only supported
-    char jobu = 'A';
-    char jobvt= 'A';
-    return cusolverDnSgesvd(handle,jobu,jobvt,m,n,A,lda,S,U,ldu,VT,ldvt,Work,Lwork,rwork,devInfo);
-} 
-
-inline cusolverStatus_t cusolverXgesvd_cond(cusolverDnHandle_t handle, int m, int n, double *A, int lda, double *S, double *U, int ldu, double *VT, int ldvt, double *Work, int Lwork, double *rwork, int  *devInfo){
-    //ideally
-    //char jobu = 'N';
-    //char jobvt= 'N';
-    //only supported
-    char jobu = 'A';
-    char jobvt= 'A';
-    return cusolverDnDgesvd(handle,jobu,jobvt,m,n,A,lda,S,U,ldu,VT,ldvt,Work,Lwork,rwork,devInfo);
-} 
-
-// cusparseXcsrmv
-inline
-cusparseStatus_t cusparseXcsrmv(cusparseHandle_t handle,
-        cusparseOperation_t transA, 
-        int m, int n, int nnz,
-        const float * alpha, 
-        const cusparseMatDescr_t descrA, 
-        const float * csrValA, 
-        const int * csrRowPtrA,
-        const int * csrColIndA,
-        const float * x,
-        const float * beta, 
-        float *y) {
-  return cusparseScsrmv_mp(handle, transA, m, n, nnz, 
-      alpha, descrA, csrValA, csrRowPtrA, csrColIndA, 
-      x, beta, y);
-}
-inline
-cusparseStatus_t cusparseXcsrmv(cusparseHandle_t handle,
-        cusparseOperation_t transA, 
-        int m, int n, int nnz,
-        const double * alpha, 
-        const cusparseMatDescr_t descrA, 
-        const double * csrValA, 
-        const int * csrRowPtrA,
-        const int * csrColIndA,
-        const double * x,
-        const double * beta, 
-        double *y) {
-  return cusparseDcsrmv_mp(handle, transA, m, n, nnz,
-        alpha, descrA, csrValA, csrRowPtrA, csrColIndA,
-        x, beta, y);
-}
-
-// cusparseXcsrmm
-inline
-cusparseStatus_t cusparseXcsrmm(cusparseHandle_t handle, 
-        cusparseOperation_t transA, 
-        int m, int n, int k, int nnz, 
-        const float *alpha, 
-        const cusparseMatDescr_t descrA, 
-        const float *csrValA, 
-        const int *csrRowPtrA, 
-        const int *csrColIndA,
-        const float *B, int ldb,
-        const float *beta, 
-        float *C, int ldc) {
-  return cusparseScsrmm(handle, transA, m, n, k, nnz,
-      alpha, descrA, csrValA,
-      csrRowPtrA, csrColIndA,
-      B, ldb, beta, C, ldc);
-}
-inline
-cusparseStatus_t cusparseXcsrmm(cusparseHandle_t handle, 
-        cusparseOperation_t transA, 
-        int m, int n, int k, int nnz, 
-        const double *alpha, 
-        const cusparseMatDescr_t descrA, 
-        const double *csrValA, 
-        const int *csrRowPtrA, 
-        const int *csrColIndA,
-        const double *B, int ldb,
-        const double *beta, 
-        double *C, int ldc) {
-  return cusparseDcsrmm(handle, transA, m, n, k, nnz,
-      alpha, descrA, csrValA,
-      csrRowPtrA, csrColIndA,
-      B, ldb, beta, C, ldc);
-}
-
-// cusparseXcsrgeam
-inline
-cusparseStatus_t cusparseXcsrgeam(cusparseHandle_t handle, 
-          int m, int n,
-          const float *alpha,
-          const cusparseMatDescr_t descrA, 
-          int nnzA, const float *csrValA, 
-          const int *csrRowPtrA, 
-          const int *csrColIndA,
-          const float *beta,
-          const cusparseMatDescr_t descrB, 
-          int nnzB, const float *csrValB, 
-          const int *csrRowPtrB,
-          const int *csrColIndB,
-          const cusparseMatDescr_t descrC,
-          float *csrValC, 
-          int *csrRowPtrC, int *csrColIndC) {
-  return cusparseScsrgeam(handle,m,n,
-        alpha,descrA,nnzA,csrValA,csrRowPtrA,csrColIndA,
-        beta,descrB,nnzB,csrValB,csrRowPtrB,csrColIndB,
-        descrC,csrValC,csrRowPtrC,csrColIndC);
-}
-inline
-cusparseStatus_t cusparseXcsrgeam(cusparseHandle_t handle, 
-          int m, int n,
-          const double *alpha,
-          const cusparseMatDescr_t descrA, 
-          int nnzA, const double *csrValA, 
-          const int *csrRowPtrA, 
-          const int *csrColIndA,
-          const double *beta,
-          const cusparseMatDescr_t descrB, 
-          int nnzB, const double *csrValB, 
-          const int *csrRowPtrB,
-          const int *csrColIndB,
-          const cusparseMatDescr_t descrC,
-          double *csrValC, 
-          int *csrRowPtrC, int *csrColIndC) {
-  return cusparseDcsrgeam(handle,m,n,
-        alpha,descrA,nnzA,csrValA,csrRowPtrA,csrColIndA,
-        beta,descrB,nnzB,csrValB,csrRowPtrB,csrColIndB,
-        descrC,csrValC,csrRowPtrC,csrColIndC);
-}
-
-//ILU0, incomplete-LU with 0 threshhold (CUSPARSE)
-inline cusparseStatus_t cusparseXcsrilu0(cusparseHandle_t handle, 
-                                         cusparseOperation_t trans, 
-                                         int m, 
-                                         const cusparseMatDescr_t descrA, 
-                                         float *csrValM,
-                                         const int *csrRowPtrA, 
-                                         const int *csrColIndA,
-                                         cusparseSolveAnalysisInfo_t info){
-    return cusparseScsrilu0(handle,trans,m,descrA,csrValM,csrRowPtrA,csrColIndA,info);
-}
-
-inline cusparseStatus_t cusparseXcsrilu0(cusparseHandle_t handle, 
-                                         cusparseOperation_t trans, 
-                                         int m, 
-                                         const cusparseMatDescr_t descrA, 
-                                         double *csrValM, 
-                                         const int *csrRowPtrA, 
-                                         const int *csrColIndA, 
-                                         cusparseSolveAnalysisInfo_t info){
-    return cusparseDcsrilu0(handle,trans,m,descrA,csrValM,csrRowPtrA,csrColIndA,info);
-}
-
-//IC0, incomplete-Cholesky with 0 threshhold (CUSPARSE)
-inline cusparseStatus_t cusparseXcsric0(cusparseHandle_t handle, 
-                                        cusparseOperation_t trans, 
-                                        int m, 
-                                        const cusparseMatDescr_t descrA, 
-                                        float *csrValM,
-                                        const int *csrRowPtrA, 
-                                        const int *csrColIndA,
-                                        cusparseSolveAnalysisInfo_t info){
-    return cusparseScsric0(handle,trans,m,descrA,csrValM,csrRowPtrA,csrColIndA,info);
-}
-inline cusparseStatus_t cusparseXcsric0(cusparseHandle_t handle, 
-                                        cusparseOperation_t trans, 
-                                        int m, 
-                                        const cusparseMatDescr_t descrA, 
-                                        double *csrValM, 
-                                        const int *csrRowPtrA, 
-                                        const int *csrColIndA, 
-                                        cusparseSolveAnalysisInfo_t info){
-    return cusparseDcsric0(handle,trans,m,descrA,csrValM,csrRowPtrA,csrColIndA,info);
-}
-
-//sparse triangular solve (CUSPARSE)
-//analysis phase
-inline cusparseStatus_t cusparseXcsrsm_analysis (cusparseHandle_t handle, cusparseOperation_t transa, int m, int nnz, const cusparseMatDescr_t descra, 
-                                                   const float *a, const int *ia, const int *ja, cusparseSolveAnalysisInfo_t info){
-    return cusparseScsrsm_analysis(handle,transa,m,nnz,descra,a,ia,ja,info);
-}   
-inline cusparseStatus_t cusparseXcsrsm_analysis (cusparseHandle_t handle, cusparseOperation_t transa, int m, int nnz, const cusparseMatDescr_t descra, 
-                                                   const double *a, const int *ia, const int *ja, cusparseSolveAnalysisInfo_t info){
-    return cusparseDcsrsm_analysis(handle,transa,m,nnz,descra,a,ia,ja,info);
-} 
-//solve phase
-inline cusparseStatus_t cusparseXcsrsm_solve (cusparseHandle_t handle, cusparseOperation_t transa, int m, int k, float alpha, const cusparseMatDescr_t descra, 
-                                              const float *a, const int *ia, const int *ja, cusparseSolveAnalysisInfo_t info, const float *x, int ldx, float *y, int ldy){
-    return cusparseScsrsm_solve(handle,transa,m,k,&alpha,descra,a,ia,ja,info,x,ldx,y,ldy);
-}   
-inline cusparseStatus_t cusparseXcsrsm_solve (cusparseHandle_t handle, cusparseOperation_t transa, int m, int k, double alpha, const cusparseMatDescr_t descra, 
-                                              const double *a, const int *ia, const int *ja, cusparseSolveAnalysisInfo_t info, const double *x, int ldx, double *y, int ldy){
-    return cusparseDcsrsm_solve(handle,transa,m,k,&alpha,descra,a,ia,ja,info,x,ldx,y,ldy);
-} 
-
-
-inline cusparseStatus_t cusparseXcsrcolor(cusparseHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const float *csrValA, const int *csrRowPtrA, const int *csrColIndA, const float *fractionToColor, int *ncolors, int *coloring, int *reordering,cusparseColorInfo_t info) {
-    return cusparseScsrcolor(handle,m,nnz,descrA,csrValA,csrRowPtrA,csrColIndA,fractionToColor,ncolors,coloring,reordering,info);
-}
-inline cusparseStatus_t cusparseXcsrcolor(cusparseHandle_t handle, int m, int nnz, const cusparseMatDescr_t descrA, const double *csrValA, const int *csrRowPtrA, const int *csrColIndA, const double *fractionToColor, int *ncolors, int *coloring, int *reordering,cusparseColorInfo_t info) {
-    return cusparseDcsrcolor(handle,m,nnz,descrA,csrValA,csrRowPtrA,csrColIndA,fractionToColor,ncolors,coloring,reordering,info);
-}
-
-
-}
-
diff --git a/cpp/src/nvgraph/include/stacktrace.h b/cpp/src/nvgraph/include/stacktrace.h
deleted file mode 100644
index b00824547e6..00000000000
--- a/cpp/src/nvgraph/include/stacktrace.h
+++ /dev/null
@@ -1,119 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-// adapted from https://idlebox.net/2008/0901-stacktrace-demangled/ and licensed under WTFPL v2.0
-#pragma once
-
-#if defined(_WIN32) || defined(__ANDROID__) || defined(ANDROID) || defined(__QNX__) || \
-  defined(__QNXNTO__)
-#else
-#include <cxxabi.h>
-#include <dlfcn.h>
-#include <execinfo.h>
-#include <stdlib.h>
-#include <unistd.h>
-#endif
-
-#include <stdio.h>
-#include <iostream>
-#include <memory>
-#include <sstream>
-#include <string>
-#include <vector>
-
-namespace nvgraph {
-
-/** Print a demangled stack backtrace of the caller function to FILE* out. */
-static inline void printStackTrace(std::ostream &eout = std::cerr, unsigned int max_frames = 63)
-{
-#if defined(_WIN32) || defined(__ANDROID__) || defined(ANDROID) || defined(__QNX__) || \
-  defined(__QNXNTO__)
-  // TODO add code for windows stack trace and android stack trace
-#else
-  std::stringstream out;
-
-  // storage array for stack trace address data
-  void *addrlist[max_frames + 1];
-
-  // retrieve current stack addresses
-  int addrlen = backtrace(addrlist, sizeof(addrlist) / sizeof(void *));
-  if (addrlen == 0) {
-    out << "  <empty, possibly corrupt>\n";
-    return;
-  }
-
-  // resolve addresses into strings containing "filename(function+address)",
-  // this array must be free()-ed
-  std::unique_ptr<char *, decltype(&::free)> symbollist(backtrace_symbols(addrlist, addrlen),
-                                                        &::free);
-  // char** symbollist = backtrace_symbols(addrlist, addrlen);
-
-  // allocate string which will be filled with the demangled function name
-  size_t funcnamesize = 256;
-  std::vector<char> funcname_v(funcnamesize);
-  char *funcname = funcname_v.data();
-
-  // iterate over the returned symbol lines. skip the first, it is the
-  // address of this function.
-  for (int i = 1; i < addrlen; i++) {
-    char *begin_name = 0, *begin_offset = 0, *end_offset = 0;
-
-    // find parentheses and +address offset surrounding the mangled name:
-    // ./module(function+0x15c) [0x8048a6d]
-    for (char *p = symbollist.get()[i]; *p; ++p) {
-      if (*p == '(')
-        begin_name = p;
-      else if (*p == '+')
-        begin_offset = p;
-      else if (*p == ')' && begin_offset) {
-        end_offset = p;
-        break;
-      }
-    }
-
-    if (begin_name && begin_offset && end_offset && begin_name < begin_offset) {
-      *begin_name++   = '\0';
-      *begin_offset++ = '\0';
-      *end_offset     = '\0';
-
-      // mangled name is now in [begin_name, begin_offset) and caller
-      // offset in [begin_offset, end_offset). now apply
-      // __cxa_demangle():
-
-      int status;
-      char *ret = abi::__cxa_demangle(begin_name, funcname, &funcnamesize, &status);
-      if (status == 0) {
-        funcname = ret;  // use possibly realloc()-ed string
-        out << " " << symbollist.get()[i] << " : " << funcname << "+" << begin_offset << "\n";
-      } else {
-        // demangling failed. Output function name as a C function with
-        // no arguments.
-        out << " " << symbollist.get()[i] << " : " << begin_name << "()+" << begin_offset << "\n";
-      }
-    } else {
-      // couldn't parse the line? print the whole line.
-      out << " " << symbollist.get()[i] << "\n";
-    }
-  }
-  eout << out.str();
-  // error_output(out.str().c_str(),out.str().size());
-  // free(symbollist);
-  // printf("PID of failing process: %d\n",getpid());
-  // while(1);
-#endif
-}
-
-}  // end namespace nvgraph
diff --git a/cpp/src/nvgraph/include/util.cuh b/cpp/src/nvgraph/include/util.cuh
deleted file mode 100644
index ac6b3a898ba..00000000000
--- a/cpp/src/nvgraph/include/util.cuh
+++ /dev/null
@@ -1,162 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#pragma once
-#include <time.h>
-#include <chrono>
-#include <ctime>
-#include <fstream>
-#include <iostream>
-#include <string>
-
-namespace nvlouvain {
-
-#define BLOCK_SIZE_1D 64
-#define BLOCK_SIZE_2D 16
-#define CUDA_MAX_KERNEL_THREADS 256
-#define CUDA_MAX_BLOCKS_1D 65535
-#define CUDA_MAX_BLOCKS_2D 256
-#define LOCAL_MEM_MAX 512
-#define GRID_MAX_SIZE 65535
-#define WARP_SIZE 32
-
-#define CUDA_CALL(call)                                                               \
-  {                                                                                   \
-    cudaError_t cudaStatus = call;                                                    \
-    if (cudaSuccess != cudaStatus)                                                    \
-      fprintf(stderr,                                                                 \
-              "ERROR: CUDA call \"%s\" in line %d of file %s failed with %s (%d).\n", \
-              #call,                                                                  \
-              __LINE__,                                                               \
-              __FILE__,                                                               \
-              cudaGetErrorString(cudaStatus),                                         \
-              cudaStatus);                                                            \
-  }
-
-#define THRUST_SAFE_CALL(call)                               \
-  {                                                          \
-    try {                                                    \
-      call;                                                  \
-    } catch (std::bad_alloc & e) {                           \
-      fprintf(stderr, "ERROR: THRUST call \"%s\".\n" #call); \
-      exit(-1);                                              \
-    }                                                        \
-  }
-
-#define COLOR_GRN "\033[0;32m"
-#define COLOR_MGT "\033[0;35m"
-#define COLOR_WHT "\033[0;0m"
-
-inline std::string time_now()
-{
-  struct timespec ts;
-  timespec_get(&ts, TIME_UTC);
-  char buff[100];
-  strftime(buff, sizeof buff, "%T", gmtime(&ts.tv_sec));
-  std::string s = buff;
-  s += "." + std::to_string(ts.tv_nsec).substr(0, 6);
-
-  return s;
-}
-
-typedef enum {
-  NVLOUVAIN_OK                 = 0,
-  NVLOUVAIN_ERR_BAD_PARAMETERS = 1,
-} NVLOUVAIN_STATUS;
-
-using nvlouvainStatus_t = NVLOUVAIN_STATUS;
-
-const char* nvlouvainStatusGetString(nvlouvainStatus_t status)
-{
-  std::string s;
-  switch (status) {
-    case 0: s = "NVLOUVAIN_OK"; break;
-    case 1: s = "NVLOUVAIN_ERR_BAD_PARAMETERS"; break;
-    default: break;
-  }
-  return s.c_str();
-}
-
-template <typename VecType>
-void display_vec(VecType vec, std::ostream& ouf = std::cout)
-{
-  auto it = vec.begin();
-  ouf << vec.front();
-  for (it = vec.begin() + 1; it != vec.end(); ++it) { ouf << ", " << (*it); }
-  ouf << "\n";
-}
-
-template <typename VecType>
-void display_intvec_size(VecType vec, unsigned size)
-{
-  printf("%d", (int)vec[0]);
-  for (unsigned i = 1; i < size; ++i) { printf(", %d", (int)vec[i]); }
-  printf("\n");
-}
-
-template <typename VecType>
-void display_vec_size(VecType vec, unsigned size)
-{
-  for (unsigned i = 0; i < size; ++i) { printf("%f ", vec[i]); }
-  printf("\n");
-}
-
-template <typename VecIter>
-__host__ __device__ void display_vec(VecIter vec, int size)
-{
-  for (unsigned i = 0; i < size; ++i) { printf("%f ", (*(vec + i))); }
-  printf("\n");
-}
-
-template <typename VecType>
-__host__ __device__ void display_vec_with_idx(VecType vec, int size, int offset = 0)
-{
-  for (unsigned i = 0; i < size; ++i) { printf("idx:%d %f\n", i + offset, (*(vec + i))); }
-  printf("\n");
-}
-
-template <typename VecType>
-void display_cluster(std::vector<VecType>& vec, std::ostream& ouf = std::cout)
-{
-  for (const auto& it : vec) {
-    for (unsigned idx = 0; idx < it.size(); ++idx) { ouf << idx << " " << it[idx] << std::endl; }
-  }
-}
-
-template <typename VecType>
-int folded_print_float(VecType s)
-{
-  return printf("%f\n", s);
-}
-
-template <typename VecType1, typename... VecType2>
-int folded_print_float(VecType1 s, VecType2... vec)
-{
-  return printf("%f ", s) + folded_print_float(vec...);
-}
-
-template <typename VecType>
-int folded_print_int(VecType s)
-{
-  return printf("%d\n", (int)s);
-}
-
-template <typename VecType1, typename... VecType2>
-int folded_print_int(VecType1 s, VecType2... vec)
-{
-  return printf("%d ", (int)s) + folded_print_int(vec...);
-}
-
-}  // namespace nvlouvain
diff --git a/cpp/src/nvgraph/kmeans.cu b/cpp/src/nvgraph/kmeans.cu
deleted file mode 100644
index 691df3e5ced..00000000000
--- a/cpp/src/nvgraph/kmeans.cu
+++ /dev/null
@@ -1,935 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-//#ifdef NVGRAPH_PARTITION
-//#ifdef DEBUG
-
-#include "include/kmeans.hxx"
-
-#include <math.h>
-#include <stdio.h>
-#include <time.h>
-
-#include <cuda.h>
-#include <thrust/binary_search.h>
-#include <thrust/device_vector.h>
-#include <thrust/gather.h>
-#include <thrust/random.h>
-#include <thrust/reduce.h>
-#include <thrust/sequence.h>
-#include <thrust/sort.h>
-
-#include "include/atomics.hxx"
-#include "include/debug_macros.h"
-#include "include/nvgraph_cublas.hxx"
-#include "include/nvgraph_vector.hxx"
-#include "include/sm_utils.h"
-
-using namespace nvgraph;
-
-// =========================================================
-// Useful macros
-// =========================================================
-
-#define BLOCK_SIZE 1024
-#define WARP_SIZE 32
-#define BSIZE_DIV_WSIZE (BLOCK_SIZE / WARP_SIZE)
-
-// Get index of matrix entry
-#define IDX(i, j, lda) ((i) + (j) * (lda))
-
-namespace {
-
-// =========================================================
-// CUDA kernels
-// =========================================================
-
-/// Compute distances between observation vectors and centroids
-/** Block dimensions should be (warpSize, 1,
- *  blockSize/warpSize). Ideally, the grid is large enough so there
- *  are d threads in the x-direction, k threads in the y-direction,
- *  and n threads in the z-direction.
- *
- *  @param n Number of observation vectors.
- *  @param d Dimension of observation vectors.
- *  @param k Number of clusters.
- *  @param obs (Input, d*n entries) Observation matrix. Matrix is
- *    stored column-major and each column is an observation
- *    vector. Matrix dimensions are d x n.
- *  @param centroids (Input, d*k entries) Centroid matrix. Matrix is
- *    stored column-major and each column is a centroid. Matrix
- *    dimensions are d x k.
- *  @param dists (Output, n*k entries) Distance matrix. Matrix is
- *    stored column-major and the (i,j)-entry is the square of the
- *    Euclidean distance between the ith observation vector and jth
- *    centroid. Matrix dimensions are n x k. Entries must be
- *    initialized to zero.
- */
-template <typename IndexType_, typename ValueType_>
-static __global__ void computeDistances(IndexType_ n,
-                                        IndexType_ d,
-                                        IndexType_ k,
-                                        const ValueType_* __restrict__ obs,
-                                        const ValueType_* __restrict__ centroids,
-                                        ValueType_* __restrict__ dists)
-{
-  // Loop index
-  IndexType_ i;
-
-  // Block indices
-  IndexType_ bidx;
-  // Global indices
-  IndexType_ gidx, gidy, gidz;
-
-  // Private memory
-  ValueType_ centroid_private, dist_private;
-
-  // Global x-index indicates index of vector entry
-  bidx = blockIdx.x;
-  while (bidx * blockDim.x < d) {
-    gidx = threadIdx.x + bidx * blockDim.x;
-
-    // Global y-index indicates centroid
-    gidy = threadIdx.y + blockIdx.y * blockDim.y;
-    while (gidy < k) {
-      // Load centroid coordinate from global memory
-      centroid_private = (gidx < d) ? centroids[IDX(gidx, gidy, d)] : 0;
-
-      // Global z-index indicates observation vector
-      gidz = threadIdx.z + blockIdx.z * blockDim.z;
-      while (gidz < n) {
-        // Load observation vector coordinate from global memory
-        dist_private = (gidx < d) ? obs[IDX(gidx, gidz, d)] : 0;
-
-        // Compute contribution of current entry to distance
-        dist_private = centroid_private - dist_private;
-        dist_private = dist_private * dist_private;
-
-        // Perform reduction on warp
-        for (i = WARP_SIZE / 2; i > 0; i /= 2)
-          dist_private += utils::shfl_down(dist_private, i, 2 * i);
-
-        // Write result to global memory
-        if (threadIdx.x == 0) atomicFPAdd(dists + IDX(gidz, gidy, n), dist_private);
-
-        // Move to another observation vector
-        gidz += blockDim.z * gridDim.z;
-      }
-
-      // Move to another centroid
-      gidy += blockDim.y * gridDim.y;
-    }
-
-    // Move to another vector entry
-    bidx += gridDim.x;
-  }
-}
-
-/// Find closest centroid to observation vectors
-/** Block and grid dimensions should be 1-dimensional. Ideally the
- *  grid is large enough so there are n threads.
- *
- *  @param n Number of observation vectors.
- *  @param k Number of clusters.
- *  @param centroids (Input, d*k entries) Centroid matrix. Matrix is
- *    stored column-major and each column is a centroid. Matrix
- *    dimensions are d x k.
- *  @param dists (Input/output, n*k entries) Distance matrix. Matrix
- *    is stored column-major and the (i,j)-entry is the square of
- *    the Euclidean distance between the ith observation vector and
- *    jth centroid. Matrix dimensions are n x k. On exit, the first
- *    n entries give the square of the Euclidean distance between
- *    observation vectors and closest centroids.
- *  @param codes (Output, n entries) Cluster assignments.
- *  @param clusterSizes (Output, k entries) Number of points in each
- *    cluster. Entries must be initialized to zero.
- */
-template <typename IndexType_, typename ValueType_>
-static __global__ void minDistances(IndexType_ n,
-                                    IndexType_ k,
-                                    ValueType_* __restrict__ dists,
-                                    IndexType_* __restrict__ codes,
-                                    IndexType_* __restrict__ clusterSizes)
-{
-  // Loop index
-  IndexType_ i, j;
-
-  // Current matrix entry
-  ValueType_ dist_curr;
-
-  // Smallest entry in row
-  ValueType_ dist_min;
-  IndexType_ code_min;
-
-  // Each row in observation matrix is processed by a thread
-  i = threadIdx.x + blockIdx.x * blockDim.x;
-  while (i < n) {
-    // Find minimum entry in row
-    code_min = 0;
-    dist_min = dists[IDX(i, 0, n)];
-    for (j = 1; j < k; ++j) {
-      dist_curr = dists[IDX(i, j, n)];
-      code_min  = (dist_curr < dist_min) ? j : code_min;
-      dist_min  = (dist_curr < dist_min) ? dist_curr : dist_min;
-    }
-
-    // Transfer result to global memory
-    dists[i] = dist_min;
-    codes[i] = code_min;
-
-    // Increment cluster sizes
-    atomicAdd(clusterSizes + code_min, 1);
-
-    // Move to another row
-    i += blockDim.x * gridDim.x;
-  }
-}
-
-/// Check if newly computed distances are smaller than old distances
-/** Block and grid dimensions should be 1-dimensional. Ideally the
- *  grid is large enough so there are n threads.
- *
- *  @param n Number of observation vectors.
- *  @param dists_old (Input/output, n entries) Distances between
- *    observation vectors and closest centroids. On exit, entries
- *    are replaced by entries in 'dists_new' if the corresponding
- *    observation vectors are closest to the new centroid.
- *  @param dists_new (Input, n entries) Distance between observation
- *    vectors and new centroid.
- *  @param codes_old (Input/output, n entries) Cluster
- *    assignments. On exit, entries are replaced with 'code_new' if
- *    the corresponding observation vectors are closest to the new
- *    centroid.
- *  @param code_new Index associated with new centroid.
- */
-template <typename IndexType_, typename ValueType_>
-static __global__ void minDistances2(IndexType_ n,
-                                     ValueType_* __restrict__ dists_old,
-                                     const ValueType_* __restrict__ dists_new,
-                                     IndexType_* __restrict__ codes_old,
-                                     IndexType_ code_new)
-{
-  // Loop index
-  IndexType_ i;
-
-  // Distances
-  ValueType_ dist_old_private;
-  ValueType_ dist_new_private;
-
-  // Each row is processed by a thread
-  i = threadIdx.x + blockIdx.x * blockDim.x;
-  while (i < n) {
-    // Get old and new distances
-    dist_old_private = dists_old[i];
-    dist_new_private = dists_new[i];
-
-    // Update if new distance is smaller than old distance
-    if (dist_new_private < dist_old_private) {
-      dists_old[i] = dist_new_private;
-      codes_old[i] = code_new;
-    }
-
-    // Move to another row
-    i += blockDim.x * gridDim.x;
-  }
-}
-
-/// Compute size of k-means clusters
-/** Block and grid dimensions should be 1-dimensional. Ideally the
- *  grid is large enough so there are n threads.
- *
- *  @param n Number of observation vectors.
- *  @param k Number of clusters.
- *  @param codes (Input, n entries) Cluster assignments.
- *  @param clusterSizes (Output, k entries) Number of points in each
- *    cluster. Entries must be initialized to zero.
- */
-template <typename IndexType_>
-static __global__ void computeClusterSizes(IndexType_ n,
-                                           IndexType_ k,
-                                           const IndexType_* __restrict__ codes,
-                                           IndexType_* __restrict__ clusterSizes)
-{
-  IndexType_ i = threadIdx.x + blockIdx.x * blockDim.x;
-  while (i < n) {
-    atomicAdd(clusterSizes + codes[i], 1);
-    i += blockDim.x * gridDim.x;
-  }
-}
-
-/// Divide rows of centroid matrix by cluster sizes
-/** Divides the ith column of the sum matrix by the size of the ith
- *  cluster. If the sum matrix has been initialized so that the ith
- *  row is the sum of all observation vectors in the ith cluster,
- *  this kernel produces cluster centroids. The grid and block
- *  dimensions should be 2-dimensional. Ideally the grid is large
- *  enough so there are d threads in the x-direction and k threads
- *  in the y-direction.
- *
- *  @param d Dimension of observation vectors.
- *  @param k Number of clusters.
- *  @param clusterSizes (Input, k entries) Number of points in each
- *    cluster.
- *  @param centroids (Input/output, d*k entries) Sum matrix. Matrix
- *    is stored column-major and matrix dimensions are d x k. The
- *    ith column is the sum of all observation vectors in the ith
- *    cluster. On exit, the matrix is the centroid matrix (each
- *    column is the mean position of a cluster).
- */
-template <typename IndexType_, typename ValueType_>
-static __global__ void divideCentroids(IndexType_ d,
-                                       IndexType_ k,
-                                       const IndexType_* __restrict__ clusterSizes,
-                                       ValueType_* __restrict__ centroids)
-{
-  // Global indices
-  IndexType_ gidx, gidy;
-
-  // Current cluster size
-  IndexType_ clusterSize_private;
-
-  // Observation vector is determined by global y-index
-  gidy = threadIdx.y + blockIdx.y * blockDim.y;
-  while (gidy < k) {
-    // Get cluster size from global memory
-    clusterSize_private = clusterSizes[gidy];
-
-    // Add vector entries to centroid matrix
-    //   Vector entris are determined by global x-index
-    gidx = threadIdx.x + blockIdx.x * blockDim.x;
-    while (gidx < d) {
-      centroids[IDX(gidx, gidy, d)] /= clusterSize_private;
-      gidx += blockDim.x * gridDim.x;
-    }
-
-    // Move to another centroid
-    gidy += blockDim.y * gridDim.y;
-  }
-}
-
-// =========================================================
-// Helper functions
-// =========================================================
-
-/// Randomly choose new centroids
-/** Centroid is randomly chosen with k-means++ algorithm.
- *
- *  @param n Number of observation vectors.
- *  @param d Dimension of observation vectors.
- *  @param k Number of clusters.
- *  @param rand Random number drawn uniformly from [0,1).
- *  @param obs (Input, device memory, d*n entries) Observation
- *    matrix. Matrix is stored column-major and each column is an
- *    observation vector. Matrix dimensions are n x d.
- *  @param dists (Input, device memory, 2*n entries) Workspace. The
- *    first n entries should be the distance between observation
- *    vectors and the closest centroid.
- *  @param centroid (Output, device memory, d entries) Centroid
- *    coordinates.
- *  @return Zero if successful. Otherwise non-zero.
- */
-template <typename IndexType_, typename ValueType_>
-static int chooseNewCentroid(IndexType_ n,
-                             IndexType_ d,
-                             IndexType_ k,
-                             ValueType_ rand,
-                             const ValueType_* __restrict__ obs,
-                             ValueType_* __restrict__ dists,
-                             ValueType_* __restrict__ centroid)
-{
-  using namespace thrust;
-
-  // Cumulative sum of distances
-  ValueType_* distsCumSum = dists + n;
-  // Residual sum of squares
-  ValueType_ distsSum;
-  // Observation vector that is chosen as new centroid
-  IndexType_ obsIndex;
-
-  // Compute cumulative sum of distances
-  inclusive_scan(
-    device_pointer_cast(dists), device_pointer_cast(dists + n), device_pointer_cast(distsCumSum));
-  cudaCheckError();
-  CHECK_CUDA(
-    cudaMemcpy(&distsSum, distsCumSum + n - 1, sizeof(ValueType_), cudaMemcpyDeviceToHost));
-
-  // Randomly choose observation vector
-  //   Probabilities are proportional to square of distance to closest
-  //   centroid (see k-means++ algorithm)
-  obsIndex =
-    (lower_bound(
-       device_pointer_cast(distsCumSum), device_pointer_cast(distsCumSum + n), distsSum * rand) -
-     device_pointer_cast(distsCumSum));
-  cudaCheckError();
-  obsIndex = max(obsIndex, 0);
-  obsIndex = min(obsIndex, n - 1);
-
-  // Record new centroid position
-  CHECK_CUDA(cudaMemcpyAsync(
-    centroid, obs + IDX(0, obsIndex, d), d * sizeof(ValueType_), cudaMemcpyDeviceToDevice));
-
-  return 0;
-}
-
-/// Choose initial cluster centroids for k-means algorithm
-/** Centroids are randomly chosen with k-means++ algorithm
- *
- *  @param n Number of observation vectors.
- *  @param d Dimension of observation vectors.
- *  @param k Number of clusters.
- *  @param obs (Input, device memory, d*n entries) Observation
- *    matrix. Matrix is stored column-major and each column is an
- *    observation vector. Matrix dimensions are d x n.
- *  @param centroids (Output, device memory, d*k entries) Centroid
- *    matrix. Matrix is stored column-major and each column is a
- *    centroid. Matrix dimensions are d x k.
- *  @param codes (Output, device memory, n entries) Cluster
- *    assignments.
- *  @param clusterSizes (Output, device memory, k entries) Number of
- *    points in each cluster.
- *  @param dists (Output, device memory, 2*n entries) Workspace. On
- *    exit, the first n entries give the square of the Euclidean
- *    distance between observation vectors and the closest centroid.
- *  @return Zero if successful. Otherwise non-zero.
- */
-template <typename IndexType_, typename ValueType_>
-static int initializeCentroids(IndexType_ n,
-                               IndexType_ d,
-                               IndexType_ k,
-                               const ValueType_* __restrict__ obs,
-                               ValueType_* __restrict__ centroids,
-                               IndexType_* __restrict__ codes,
-                               IndexType_* __restrict__ clusterSizes,
-                               ValueType_* __restrict__ dists)
-{
-  // -------------------------------------------------------
-  // Variable declarations
-  // -------------------------------------------------------
-
-  // Loop index
-  IndexType_ i;
-
-  // CUDA grid dimensions
-  dim3 blockDim_warp, gridDim_warp, gridDim_block;
-
-  // Random number generator
-  thrust::default_random_engine rng(123456);
-  thrust::uniform_real_distribution<ValueType_> uniformDist(0, 1);
-
-  // -------------------------------------------------------
-  // Implementation
-  // -------------------------------------------------------
-
-  // Initialize grid dimensions
-  blockDim_warp.x = WARP_SIZE;
-  blockDim_warp.y = 1;
-  blockDim_warp.z = BSIZE_DIV_WSIZE;
-  gridDim_warp.x  = min((d + WARP_SIZE - 1) / WARP_SIZE, 65535);
-  gridDim_warp.y  = 1;
-  gridDim_warp.z  = min((n + BSIZE_DIV_WSIZE - 1) / BSIZE_DIV_WSIZE, 65535);
-  gridDim_block.x = min((n + BLOCK_SIZE - 1) / BLOCK_SIZE, 65535);
-  gridDim_block.y = 1;
-  gridDim_block.z = 1;
-
-  // Assign observation vectors to code 0
-  CHECK_CUDA(cudaMemsetAsync(codes, 0, n * sizeof(IndexType_)));
-
-  // Choose first centroid
-  thrust::fill(thrust::device_pointer_cast(dists), thrust::device_pointer_cast(dists + n), 1);
-  cudaCheckError();
-  if (chooseNewCentroid(n, d, k, uniformDist(rng), obs, dists, centroids))
-    WARNING("error in k-means++ (could not pick centroid)");
-
-  // Compute distances from first centroid
-  CHECK_CUDA(cudaMemsetAsync(dists, 0, n * sizeof(ValueType_)));
-  computeDistances<<<gridDim_warp, blockDim_warp>>>(n, d, 1, obs, centroids, dists);
-  cudaCheckError()
-
-    // Choose remaining centroids
-    for (i = 1; i < k; ++i)
-  {
-    // Choose ith centroid
-    if (chooseNewCentroid(n, d, k, uniformDist(rng), obs, dists, centroids + IDX(0, i, d)))
-      WARNING("error in k-means++ (could not pick centroid)");
-
-    // Compute distances from ith centroid
-    CHECK_CUDA(cudaMemsetAsync(dists + n, 0, n * sizeof(ValueType_)));
-    computeDistances<<<gridDim_warp, blockDim_warp>>>(
-      n, d, 1, obs, centroids + IDX(0, i, d), dists + n);
-    cudaCheckError();
-
-    // Recompute minimum distances
-    minDistances2<<<gridDim_block, BLOCK_SIZE>>>(n, dists, dists + n, codes, i);
-    cudaCheckError();
-  }
-
-  // Compute cluster sizes
-  CHECK_CUDA(cudaMemsetAsync(clusterSizes, 0, k * sizeof(IndexType_)));
-  computeClusterSizes<<<gridDim_block, BLOCK_SIZE>>>(n, k, codes, clusterSizes);
-  cudaCheckError();
-
-  return 0;
-}
-
-/// Find cluster centroids closest to observation vectors
-/** Distance is measured with Euclidean norm.
- *
- *  @param n Number of observation vectors.
- *  @param d Dimension of observation vectors.
- *  @param k Number of clusters.
- *  @param obs (Input, device memory, d*n entries) Observation
- *    matrix. Matrix is stored column-major and each column is an
- *    observation vector. Matrix dimensions are d x n.
- *  @param centroids (Input, device memory, d*k entries) Centroid
- *    matrix. Matrix is stored column-major and each column is a
- *    centroid. Matrix dimensions are d x k.
- *  @param dists (Output, device memory, n*k entries) Workspace. On
- *    exit, the first n entries give the square of the Euclidean
- *    distance between observation vectors and the closest centroid.
- *  @param codes (Output, device memory, n entries) Cluster
- *    assignments.
- *  @param clusterSizes (Output, device memory, k entries) Number of
- *    points in each cluster.
- *  @param residual_host (Output, host memory, 1 entry) Residual sum
- *    of squares of assignment.
- *  @return Zero if successful. Otherwise non-zero.
- */
-template <typename IndexType_, typename ValueType_>
-static int assignCentroids(IndexType_ n,
-                           IndexType_ d,
-                           IndexType_ k,
-                           const ValueType_* __restrict__ obs,
-                           const ValueType_* __restrict__ centroids,
-                           ValueType_* __restrict__ dists,
-                           IndexType_* __restrict__ codes,
-                           IndexType_* __restrict__ clusterSizes,
-                           ValueType_* residual_host)
-{
-  // CUDA grid dimensions
-  dim3 blockDim, gridDim;
-
-  // Compute distance between centroids and observation vectors
-  CHECK_CUDA(cudaMemsetAsync(dists, 0, n * k * sizeof(ValueType_)));
-  blockDim.x = WARP_SIZE;
-  blockDim.y = 1;
-  blockDim.z = BLOCK_SIZE / WARP_SIZE;
-  gridDim.x  = min((d + WARP_SIZE - 1) / WARP_SIZE, 65535);
-  gridDim.y  = min(k, 65535);
-  gridDim.z  = min((n + BSIZE_DIV_WSIZE - 1) / BSIZE_DIV_WSIZE, 65535);
-  computeDistances<<<gridDim, blockDim>>>(n, d, k, obs, centroids, dists);
-  cudaCheckError();
-
-  // Find centroid closest to each observation vector
-  CHECK_CUDA(cudaMemsetAsync(clusterSizes, 0, k * sizeof(IndexType_)));
-  blockDim.x = BLOCK_SIZE;
-  blockDim.y = 1;
-  blockDim.z = 1;
-  gridDim.x  = min((n + BLOCK_SIZE - 1) / BLOCK_SIZE, 65535);
-  gridDim.y  = 1;
-  gridDim.z  = 1;
-  minDistances<<<gridDim, blockDim>>>(n, k, dists, codes, clusterSizes);
-  cudaCheckError();
-
-  // Compute residual sum of squares
-  *residual_host =
-    thrust::reduce(thrust::device_pointer_cast(dists), thrust::device_pointer_cast(dists + n));
-
-  return 0;
-}
-
-/// Update cluster centroids for k-means algorithm
-/** All clusters are assumed to be non-empty.
- *
- *  @param n Number of observation vectors.
- *  @param d Dimension of observation vectors.
- *  @param k Number of clusters.
- *  @param obs (Input, device memory, d*n entries) Observation
- *    matrix. Matrix is stored column-major and each column is an
- *    observation vector. Matrix dimensions are d x n.
- *  @param codes (Input, device memory, n entries) Cluster
- *    assignments.
- *  @param clusterSizes (Input, device memory, k entries) Number of
- *    points in each cluster.
- *  @param centroids (Output, device memory, d*k entries) Centroid
- *    matrix. Matrix is stored column-major and each column is a
- *    centroid. Matrix dimensions are d x k.
- *  @param work (Output, device memory, n*d entries) Workspace.
- *  @param work_int (Output, device memory, 2*d*n entries)
- *    Workspace.
- *  @return Zero if successful. Otherwise non-zero.
- */
-template <typename IndexType_, typename ValueType_>
-static int updateCentroids(IndexType_ n,
-                           IndexType_ d,
-                           IndexType_ k,
-                           const ValueType_* __restrict__ obs,
-                           const IndexType_* __restrict__ codes,
-                           const IndexType_* __restrict__ clusterSizes,
-                           ValueType_* __restrict__ centroids,
-                           ValueType_* __restrict__ work,
-                           IndexType_* __restrict__ work_int)
-{
-  using namespace thrust;
-
-  // -------------------------------------------------------
-  // Variable declarations
-  // -------------------------------------------------------
-
-  // Useful constants
-  const ValueType_ one  = 1;
-  const ValueType_ zero = 0;
-
-  // CUDA grid dimensions
-  dim3 blockDim, gridDim;
-
-  // Device memory
-  device_ptr<ValueType_> obs_copy(work);
-  device_ptr<IndexType_> codes_copy(work_int);
-  device_ptr<IndexType_> rows(work_int + d * n);
-
-  // Take transpose of observation matrix
-  Cublas::geam(
-    true, false, n, d, &one, obs, d, &zero, (ValueType_*)NULL, n, raw_pointer_cast(obs_copy), n);
-
-  // Cluster assigned to each observation matrix entry
-  sequence(rows, rows + d * n);
-  cudaCheckError();
-  transform(rows, rows + d * n, make_constant_iterator<IndexType_>(n), rows, modulus<IndexType_>());
-  cudaCheckError();
-  gather(rows, rows + d * n, device_pointer_cast(codes), codes_copy);
-  cudaCheckError();
-
-  // Row associated with each observation matrix entry
-  sequence(rows, rows + d * n);
-  cudaCheckError();
-  transform(rows, rows + d * n, make_constant_iterator<IndexType_>(n), rows, divides<IndexType_>());
-  cudaCheckError();
-
-  // Sort and reduce to add observation vectors in same cluster
-  stable_sort_by_key(codes_copy, codes_copy + d * n, make_zip_iterator(make_tuple(obs_copy, rows)));
-  cudaCheckError();
-  reduce_by_key(rows,
-                rows + d * n,
-                obs_copy,
-                codes_copy,  // Output to codes_copy is ignored
-                device_pointer_cast(centroids));
-  cudaCheckError();
-
-  // Divide sums by cluster size to get centroid matrix
-  blockDim.x = WARP_SIZE;
-  blockDim.y = BLOCK_SIZE / WARP_SIZE;
-  blockDim.z = 1;
-  gridDim.x  = min((d + WARP_SIZE - 1) / WARP_SIZE, 65535);
-  gridDim.y  = min((k + BSIZE_DIV_WSIZE - 1) / BSIZE_DIV_WSIZE, 65535);
-  gridDim.z  = 1;
-  divideCentroids<<<gridDim, blockDim>>>(d, k, clusterSizes, centroids);
-  cudaCheckError();
-
-  return 0;
-}
-
-}  // namespace
-
-namespace nvgraph {
-
-// =========================================================
-// k-means algorithm
-// =========================================================
-
-/// Find clusters with k-means algorithm
-/** Initial centroids are chosen with k-means++ algorithm. Empty
- *  clusters are reinitialized by choosing new centroids with
- *  k-means++ algorithm.
- *
- *  @param n Number of observation vectors.
- *  @param d Dimension of observation vectors.
- *  @param k Number of clusters.
- *  @param tol Tolerance for convergence. k-means stops when the
- *    change in residual divided by n is less than tol.
- *  @param maxiter Maximum number of k-means iterations.
- *  @param obs (Input, device memory, d*n entries) Observation
- *    matrix. Matrix is stored column-major and each column is an
- *    observation vector. Matrix dimensions are d x n.
- *  @param codes (Output, device memory, n entries) Cluster
- *    assignments.
- *  @param clusterSizes (Output, device memory, k entries) Number of
- *    points in each cluster.
- *  @param centroids (Output, device memory, d*k entries) Centroid
- *    matrix. Matrix is stored column-major and each column is a
- *    centroid. Matrix dimensions are d x k.
- *  @param work (Output, device memory, n*max(k,d) entries)
- *    Workspace.
- *  @param work_int (Output, device memory, 2*d*n entries)
- *    Workspace.
- *  @param residual_host (Output, host memory, 1 entry) Residual sum
- *    of squares (sum of squares of distances between observation
- *    vectors and centroids).
- *  @param iters_host (Output, host memory, 1 entry) Number of
- *    k-means iterations.
- *  @return NVGRAPH error flag.
- */
-template <typename IndexType_, typename ValueType_>
-NVGRAPH_ERROR kmeans(IndexType_ n,
-                     IndexType_ d,
-                     IndexType_ k,
-                     ValueType_ tol,
-                     IndexType_ maxiter,
-                     const ValueType_* __restrict__ obs,
-                     IndexType_* __restrict__ codes,
-                     IndexType_* __restrict__ clusterSizes,
-                     ValueType_* __restrict__ centroids,
-                     ValueType_* __restrict__ work,
-                     IndexType_* __restrict__ work_int,
-                     ValueType_* residual_host,
-                     IndexType_* iters_host)
-{
-  // -------------------------------------------------------
-  // Variable declarations
-  // -------------------------------------------------------
-
-  // Current iteration
-  IndexType_ iter;
-
-  // Residual sum of squares at previous iteration
-  ValueType_ residualPrev = 0;
-
-  // Random number generator
-  thrust::default_random_engine rng(123456);
-  thrust::uniform_real_distribution<ValueType_> uniformDist(0, 1);
-
-  // -------------------------------------------------------
-  // Initialization
-  // -------------------------------------------------------
-
-  // Check that parameters are valid
-  if (n < 1) {
-    WARNING("invalid parameter (n<1)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (d < 1) {
-    WARNING("invalid parameter (d<1)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (k < 1) {
-    WARNING("invalid parameter (k<1)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (tol < 0) {
-    WARNING("invalid parameter (tol<0)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (maxiter < 0) {
-    WARNING("invalid parameter (maxiter<0)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-
-  // Trivial cases
-  if (k == 1) {
-    CHECK_CUDA(cudaMemsetAsync(codes, 0, n * sizeof(IndexType_)));
-    CHECK_CUDA(cudaMemcpyAsync(clusterSizes, &n, sizeof(IndexType_), cudaMemcpyHostToDevice));
-    if (updateCentroids(n, d, k, obs, codes, clusterSizes, centroids, work, work_int))
-      WARNING("could not compute k-means centroids");
-    dim3 blockDim, gridDim;
-    blockDim.x = WARP_SIZE;
-    blockDim.y = 1;
-    blockDim.z = BLOCK_SIZE / WARP_SIZE;
-    gridDim.x  = min((d + WARP_SIZE - 1) / WARP_SIZE, 65535);
-    gridDim.y  = 1;
-    gridDim.z  = min((n + BLOCK_SIZE / WARP_SIZE - 1) / (BLOCK_SIZE / WARP_SIZE), 65535);
-    CHECK_CUDA(cudaMemsetAsync(work, 0, n * k * sizeof(ValueType_)));
-    computeDistances<<<gridDim, blockDim>>>(n, d, 1, obs, centroids, work);
-    cudaCheckError();
-    *residual_host =
-      thrust::reduce(thrust::device_pointer_cast(work), thrust::device_pointer_cast(work + n));
-    cudaCheckError();
-    return NVGRAPH_OK;
-  }
-  if (n <= k) {
-    thrust::sequence(thrust::device_pointer_cast(codes), thrust::device_pointer_cast(codes + n));
-    cudaCheckError();
-    thrust::fill_n(thrust::device_pointer_cast(clusterSizes), n, 1);
-    cudaCheckError();
-
-    if (n < k) CHECK_CUDA(cudaMemsetAsync(clusterSizes + n, 0, (k - n) * sizeof(IndexType_)));
-    CHECK_CUDA(
-      cudaMemcpyAsync(centroids, obs, d * n * sizeof(ValueType_), cudaMemcpyDeviceToDevice));
-    *residual_host = 0;
-    return NVGRAPH_OK;
-  }
-
-  // Initialize cuBLAS
-  Cublas::set_pointer_mode_host();
-
-  // -------------------------------------------------------
-  // k-means++ algorithm
-  // -------------------------------------------------------
-
-  // Choose initial cluster centroids
-  if (initializeCentroids(n, d, k, obs, centroids, codes, clusterSizes, work))
-    WARNING("could not initialize k-means centroids");
-
-  // Apply k-means iteration until convergence
-  for (iter = 0; iter < maxiter; ++iter) {
-    // Update cluster centroids
-    if (updateCentroids(n, d, k, obs, codes, clusterSizes, centroids, work, work_int))
-      WARNING("could not update k-means centroids");
-
-    // Determine centroid closest to each observation
-    residualPrev = *residual_host;
-    if (assignCentroids(n, d, k, obs, centroids, work, codes, clusterSizes, residual_host))
-      WARNING("could not assign observation vectors to k-means clusters");
-
-    // Reinitialize empty clusters with new centroids
-    IndexType_ emptyCentroid = (thrust::find(thrust::device_pointer_cast(clusterSizes),
-                                             thrust::device_pointer_cast(clusterSizes + k),
-                                             0) -
-                                thrust::device_pointer_cast(clusterSizes));
-
-    // FIXME: emptyCentroid never reaches k (infinite loop) under certain
-    // conditions, such as if obs is corrupt (as seen as a result of a
-    // DataFrame column of NULL edge vals used to create the Graph)
-    while (emptyCentroid < k) {
-      if (chooseNewCentroid(
-            n, d, k, uniformDist(rng), obs, work, centroids + IDX(0, emptyCentroid, d)))
-        WARNING("could not replace empty centroid");
-      if (assignCentroids(n, d, k, obs, centroids, work, codes, clusterSizes, residual_host))
-        WARNING("could not assign observation vectors to k-means clusters");
-      emptyCentroid = (thrust::find(thrust::device_pointer_cast(clusterSizes),
-                                    thrust::device_pointer_cast(clusterSizes + k),
-                                    0) -
-                       thrust::device_pointer_cast(clusterSizes));
-      cudaCheckError();
-    }
-
-    // Check for convergence
-    if (fabs(residualPrev - (*residual_host)) / n < tol) {
-      ++iter;
-      break;
-    }
-  }
-
-  // Warning if k-means has failed to converge
-  if (fabs(residualPrev - (*residual_host)) / n >= tol) WARNING("k-means failed to converge");
-
-  *iters_host = iter;
-  return NVGRAPH_OK;
-}
-
-/// Find clusters with k-means algorithm
-/** Initial centroids are chosen with k-means++ algorithm. Empty
- *  clusters are reinitialized by choosing new centroids with
- *  k-means++ algorithm.
- *
- *  CNMEM must be initialized before calling this function.
- *
- *  @param n Number of observation vectors.
- *  @param d Dimension of observation vectors.
- *  @param k Number of clusters.
- *  @param tol Tolerance for convergence. k-means stops when the
- *    change in residual divided by n is less than tol.
- *  @param maxiter Maximum number of k-means iterations.
- *  @param obs (Input, device memory, d*n entries) Observation
- *    matrix. Matrix is stored column-major and each column is an
- *    observation vector. Matrix dimensions are d x n.
- *  @param codes (Output, device memory, n entries) Cluster
- *    assignments.
- *  @param residual On exit, residual sum of squares (sum of squares
- *    of distances between observation vectors and centroids).
- *  @param On exit, number of k-means iterations.
- *  @return NVGRAPH error flag
- */
-template <typename IndexType_, typename ValueType_>
-NVGRAPH_ERROR kmeans(IndexType_ n,
-                     IndexType_ d,
-                     IndexType_ k,
-                     ValueType_ tol,
-                     IndexType_ maxiter,
-                     const ValueType_* __restrict__ obs,
-                     IndexType_* __restrict__ codes,
-                     ValueType_& residual,
-                     IndexType_& iters)
-{
-  // Check that parameters are valid
-  if (n < 1) {
-    WARNING("invalid parameter (n<1)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (d < 1) {
-    WARNING("invalid parameter (d<1)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (k < 1) {
-    WARNING("invalid parameter (k<1)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (tol < 0) {
-    WARNING("invalid parameter (tol<0)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (maxiter < 0) {
-    WARNING("invalid parameter (maxiter<0)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-
-  // Allocate memory
-  // TODO: handle non-zero CUDA streams
-  cudaStream_t stream = 0;
-  Vector<IndexType_> clusterSizes(k, stream);
-  Vector<ValueType_> centroids(d * k, stream);
-  Vector<ValueType_> work(n * max(k, d), stream);
-  Vector<IndexType_> work_int(2 * d * n, stream);
-
-  // Perform k-means
-  return kmeans<IndexType_, ValueType_>(n,
-                                        d,
-                                        k,
-                                        tol,
-                                        maxiter,
-                                        obs,
-                                        codes,
-                                        clusterSizes.raw(),
-                                        centroids.raw(),
-                                        work.raw(),
-                                        work_int.raw(),
-                                        &residual,
-                                        &iters);
-}
-
-// =========================================================
-// Explicit instantiations
-// =========================================================
-
-template NVGRAPH_ERROR kmeans<int, float>(int n,
-                                          int d,
-                                          int k,
-                                          float tol,
-                                          int maxiter,
-                                          const float* __restrict__ obs,
-                                          int* __restrict__ codes,
-                                          float& residual,
-                                          int& iters);
-template NVGRAPH_ERROR kmeans<int, double>(int n,
-                                           int d,
-                                           int k,
-                                           double tol,
-                                           int maxiter,
-                                           const double* __restrict__ obs,
-                                           int* __restrict__ codes,
-                                           double& residual,
-                                           int& iters);
-}  // namespace nvgraph
-//#endif //NVGRAPH_PARTITION
-//#endif //debug
diff --git a/cpp/src/nvgraph/lanczos.cu b/cpp/src/nvgraph/lanczos.cu
deleted file mode 100644
index ad49be1c059..00000000000
--- a/cpp/src/nvgraph/lanczos.cu
+++ /dev/null
@@ -1,1487 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-//#ifdef NVGRAPH_PARTITION
-
-#define _USE_MATH_DEFINES
-#include <math.h>
-#include "include/lanczos.hxx"
-
-#include <stdio.h>
-#include <time.h>
-#include <vector>
-
-#include <cuda.h>
-#include <curand.h>
-
-#include "include/debug_macros.h"
-#include "include/nvgraph_cublas.hxx"
-#include "include/nvgraph_error.hxx"
-#include "include/nvgraph_lapack.hxx"
-#include "include/nvgraph_vector.hxx"
-#include "include/nvgraph_vector_kernels.hxx"
-// =========================================================
-// Useful macros
-// =========================================================
-
-// Get index of matrix entry
-#define IDX(i, j, lda) ((i) + (j) * (lda))
-
-namespace nvgraph {
-
-namespace {
-
-// =========================================================
-// Helper functions
-// =========================================================
-
-/// Perform Lanczos iteration
-/** Lanczos iteration is performed on a shifted matrix A+shift*I.
- *
- *  @param A Matrix.
- *  @param iter Pointer to current Lanczos iteration. On exit, the
- *    variable is set equal to the final Lanczos iteration.
- *  @param maxIter Maximum Lanczos iteration. This function will
- *    perform a maximum of maxIter-*iter iterations.
- *  @param shift Matrix shift.
- *  @param tol Convergence tolerance. Lanczos iteration will
- *    terminate when the residual norm (i.e. entry in beta_host) is
- *    less than tol.
- *  @param reorthogonalize Whether to reorthogonalize Lanczos
- *    vectors.
- *  @param alpha_host (Output, host memory, maxIter entries)
- *    Diagonal entries of Lanczos system.
- *  @param beta_host (Output, host memory, maxIter entries)
- *    Off-diagonal entries of Lanczos system.
- *  @param lanczosVecs_dev (Input/output, device memory,
- *    n*(maxIter+1) entries) Lanczos vectors. Vectors are stored as
- *    columns of a column-major matrix with dimensions
- *    n x (maxIter+1).
- *  @param work_dev (Output, device memory, maxIter entries)
- *    Workspace. Not needed if full reorthogonalization is disabled.
- *  @return Zero if successful. Otherwise non-zero.
- */
-template <typename IndexType_, typename ValueType_>
-static int performLanczosIteration(const Matrix<IndexType_, ValueType_> *A,
-                                   IndexType_ *iter,
-                                   IndexType_ maxIter,
-                                   ValueType_ shift,
-                                   ValueType_ tol,
-                                   bool reorthogonalize,
-                                   ValueType_ *__restrict__ alpha_host,
-                                   ValueType_ *__restrict__ beta_host,
-                                   ValueType_ *__restrict__ lanczosVecs_dev,
-                                   ValueType_ *__restrict__ work_dev)
-{
-  // -------------------------------------------------------
-  // Variable declaration
-  // -------------------------------------------------------
-
-  // Useful variables
-  const ValueType_ one    = 1;
-  const ValueType_ negOne = -1;
-  const ValueType_ zero   = 0;
-
-  IndexType_ n = A->n;
-
-  // -------------------------------------------------------
-  // Compute second Lanczos vector
-  // -------------------------------------------------------
-  if (*iter <= 0) {
-    *iter = 1;
-
-    // Apply matrix
-    if (shift != 0)
-      CHECK_CUDA(cudaMemcpyAsync(
-        lanczosVecs_dev + n, lanczosVecs_dev, n * sizeof(ValueType_), cudaMemcpyDeviceToDevice));
-    A->mv(1, lanczosVecs_dev, shift, lanczosVecs_dev + n);
-
-    // Orthogonalize Lanczos vector
-    Cublas::dot(n, lanczosVecs_dev, 1, lanczosVecs_dev + IDX(0, 1, n), 1, alpha_host);
-    Cublas::axpy(n, -alpha_host[0], lanczosVecs_dev, 1, lanczosVecs_dev + IDX(0, 1, n), 1);
-    beta_host[0] = Cublas::nrm2(n, lanczosVecs_dev + IDX(0, 1, n), 1);
-
-    // Check if Lanczos has converged
-    if (beta_host[0] <= tol) return 0;
-
-    // Normalize Lanczos vector
-    Cublas::scal(n, 1 / beta_host[0], lanczosVecs_dev + IDX(0, 1, n), 1);
-  }
-
-  // -------------------------------------------------------
-  // Compute remaining Lanczos vectors
-  // -------------------------------------------------------
-
-  while (*iter < maxIter) {
-    ++(*iter);
-
-    // Apply matrix
-    if (shift != 0)
-      CHECK_CUDA(cudaMemcpyAsync(lanczosVecs_dev + (*iter) * n,
-                                 lanczosVecs_dev + (*iter - 1) * n,
-                                 n * sizeof(ValueType_),
-                                 cudaMemcpyDeviceToDevice));
-    A->mv(1, lanczosVecs_dev + IDX(0, *iter - 1, n), shift, lanczosVecs_dev + IDX(0, *iter, n));
-
-    // Full reorthogonalization
-    //   "Twice is enough" algorithm per Kahan and Parlett
-    if (reorthogonalize) {
-      Cublas::gemv(true,
-                   n,
-                   *iter,
-                   &one,
-                   lanczosVecs_dev,
-                   n,
-                   lanczosVecs_dev + IDX(0, *iter, n),
-                   1,
-                   &zero,
-                   work_dev,
-                   1);
-      Cublas::gemv(false,
-                   n,
-                   *iter,
-                   &negOne,
-                   lanczosVecs_dev,
-                   n,
-                   work_dev,
-                   1,
-                   &one,
-                   lanczosVecs_dev + IDX(0, *iter, n),
-                   1);
-      CHECK_CUDA(cudaMemcpyAsync(alpha_host + (*iter - 1),
-                                 work_dev + (*iter - 1),
-                                 sizeof(ValueType_),
-                                 cudaMemcpyDeviceToHost));
-      Cublas::gemv(true,
-                   n,
-                   *iter,
-                   &one,
-                   lanczosVecs_dev,
-                   n,
-                   lanczosVecs_dev + IDX(0, *iter, n),
-                   1,
-                   &zero,
-                   work_dev,
-                   1);
-      Cublas::gemv(false,
-                   n,
-                   *iter,
-                   &negOne,
-                   lanczosVecs_dev,
-                   n,
-                   work_dev,
-                   1,
-                   &one,
-                   lanczosVecs_dev + IDX(0, *iter, n),
-                   1);
-    }
-
-    // Orthogonalization with 3-term recurrence relation
-    else {
-      Cublas::dot(n,
-                  lanczosVecs_dev + IDX(0, *iter - 1, n),
-                  1,
-                  lanczosVecs_dev + IDX(0, *iter, n),
-                  1,
-                  alpha_host + (*iter - 1));
-      Cublas::axpy(n,
-                   -alpha_host[*iter - 1],
-                   lanczosVecs_dev + IDX(0, *iter - 1, n),
-                   1,
-                   lanczosVecs_dev + IDX(0, *iter, n),
-                   1);
-      Cublas::axpy(n,
-                   -beta_host[*iter - 2],
-                   lanczosVecs_dev + IDX(0, *iter - 2, n),
-                   1,
-                   lanczosVecs_dev + IDX(0, *iter, n),
-                   1);
-    }
-
-    // Compute residual
-    beta_host[*iter - 1] = Cublas::nrm2(n, lanczosVecs_dev + IDX(0, *iter, n), 1);
-
-    // Check if Lanczos has converged
-    if (beta_host[*iter - 1] <= tol) break;
-    // Normalize Lanczos vector
-    Cublas::scal(n, 1 / beta_host[*iter - 1], lanczosVecs_dev + IDX(0, *iter, n), 1);
-  }
-
-  CHECK_CUDA(cudaDeviceSynchronize());
-
-  return 0;
-}
-
-/// Find Householder transform for 3-dimensional system
-/** Given an input vector v=[x,y,z]', this function finds a
- *  Householder transform P such that P*v is a multiple of
- *  e_1=[1,0,0]'. The input vector v is overwritten with the
- *  Householder vector such that P=I-2*v*v'.
- *
- *  @param v (Input/output, host memory, 3 entries) Input
- *    3-dimensional vector. On exit, the vector is set to the
- *    Householder vector.
- *  @param Pv (Output, host memory, 1 entry) First entry of P*v
- *    (here v is the input vector). Either equal to ||v||_2 or
- *    -||v||_2.
- *  @param P (Output, host memory, 9 entries) Householder transform
- *    matrix. Matrix dimensions are 3 x 3.
- */
-template <typename IndexType_, typename ValueType_>
-static void findHouseholder3(ValueType_ *v, ValueType_ *Pv, ValueType_ *P)
-{
-  // Compute norm of vector
-  *Pv = std::sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
-
-  // Choose whether to reflect to e_1 or -e_1
-  //   This choice avoids catastrophic cancellation
-  if (v[0] >= 0) *Pv = -(*Pv);
-  v[0] -= *Pv;
-
-  // Normalize Householder vector
-  ValueType_ normHouseholder = std::sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
-  if (normHouseholder != 0) {
-    v[0] /= normHouseholder;
-    v[1] /= normHouseholder;
-    v[2] /= normHouseholder;
-  } else {
-    v[0] = 0;
-    v[1] = 0;
-    v[2] = 0;
-  }
-
-  // Construct Householder matrix
-  IndexType_ i, j;
-  for (j = 0; j < 3; ++j)
-    for (i = 0; i < 3; ++i) P[IDX(i, j, 3)] = -2 * v[i] * v[j];
-  for (i = 0; i < 3; ++i) P[IDX(i, i, 3)] += 1;
-}
-
-/// Apply 3-dimensional Householder transform to 4 x 4 matrix
-/** The Householder transform is pre-applied to the top three rows
- *  of the matrix and post-applied to the left three columns. The
- *  4 x 4 matrix is intended to contain the bulge that is produced
- *  in the Francis QR algorithm.
- *
- *  @param v (Input, host memory, 3 entries) Householder vector.
- *  @param A (Input/output, host memory, 16 entries) 4 x 4 matrix.
- */
-template <typename IndexType_, typename ValueType_>
-static void applyHouseholder3(const ValueType_ *v, ValueType_ *A)
-{
-  // Loop indices
-  IndexType_ i, j;
-  // Dot product between Householder vector and matrix row/column
-  ValueType_ vDotA;
-
-  // Pre-apply Householder transform
-  for (j = 0; j < 4; ++j) {
-    vDotA = 0;
-    for (i = 0; i < 3; ++i) vDotA += v[i] * A[IDX(i, j, 4)];
-    for (i = 0; i < 3; ++i) A[IDX(i, j, 4)] -= 2 * v[i] * vDotA;
-  }
-
-  // Post-apply Householder transform
-  for (i = 0; i < 4; ++i) {
-    vDotA = 0;
-    for (j = 0; j < 3; ++j) vDotA += A[IDX(i, j, 4)] * v[j];
-    for (j = 0; j < 3; ++j) A[IDX(i, j, 4)] -= 2 * vDotA * v[j];
-  }
-}
-
-/// Perform one step of Francis QR algorithm
-/** Equivalent to two steps of the classical QR algorithm on a
- *  tridiagonal matrix.
- *
- *  @param n Matrix dimension.
- *  @param shift1 QR algorithm shift.
- *  @param shift2 QR algorithm shift.
- *  @param alpha (Input/output, host memory, n entries) Diagonal
- *    entries of tridiagonal matrix.
- *  @param beta (Input/output, host memory, n-1 entries)
- *    Off-diagonal entries of tridiagonal matrix.
- *  @param V (Input/output, host memory, n*n entries) Orthonormal
- *    transforms from previous steps of QR algorithm. Matrix
- *    dimensions are n x n. On exit, the orthonormal transform from
- *    this Francis QR step is post-applied to the matrix.
- *  @param work (Output, host memory, 3*n entries) Workspace.
- *  @return Zero if successful. Otherwise non-zero.
- */
-template <typename IndexType_, typename ValueType_>
-static int francisQRIteration(IndexType_ n,
-                              ValueType_ shift1,
-                              ValueType_ shift2,
-                              ValueType_ *alpha,
-                              ValueType_ *beta,
-                              ValueType_ *V,
-                              ValueType_ *work)
-{
-  // -------------------------------------------------------
-  // Variable declaration
-  // -------------------------------------------------------
-
-  // Temporary storage of 4x4 bulge and Householder vector
-  ValueType_ bulge[16];
-
-  // Householder vector
-  ValueType_ householder[3];
-  // Householder matrix
-  ValueType_ householderMatrix[3 * 3];
-
-  // Shifts are roots of the polynomial p(x)=x^2+b*x+c
-  ValueType_ b = -shift1 - shift2;
-  ValueType_ c = shift1 * shift2;
-
-  // Loop indices
-  IndexType_ i, j, pos;
-  // Temporary variable
-  ValueType_ temp;
-
-  // -------------------------------------------------------
-  // Implementation
-  // -------------------------------------------------------
-
-  // Compute initial Householder transform
-  householder[0] = alpha[0] * alpha[0] + beta[0] * beta[0] + b * alpha[0] + c;
-  householder[1] = beta[0] * (alpha[0] + alpha[1] + b);
-  householder[2] = beta[0] * beta[1];
-  findHouseholder3<IndexType_, ValueType_>(householder, &temp, householderMatrix);
-
-  // Apply initial Householder transform to create bulge
-  memset(bulge, 0, 16 * sizeof(ValueType_));
-  for (i = 0; i < 4; ++i) bulge[IDX(i, i, 4)] = alpha[i];
-  for (i = 0; i < 3; ++i) {
-    bulge[IDX(i + 1, i, 4)] = beta[i];
-    bulge[IDX(i, i + 1, 4)] = beta[i];
-  }
-  applyHouseholder3<IndexType_, ValueType_>(householder, bulge);
-  Lapack<ValueType_>::gemm(false, false, n, 3, 3, 1, V, n, householderMatrix, 3, 0, work, n);
-  memcpy(V, work, 3 * n * sizeof(ValueType_));
-
-  // Chase bulge to bottom-right of matrix with Householder transforms
-  for (pos = 0; pos < n - 4; ++pos) {
-    // Move to next position
-    alpha[pos]     = bulge[IDX(0, 0, 4)];
-    householder[0] = bulge[IDX(1, 0, 4)];
-    householder[1] = bulge[IDX(2, 0, 4)];
-    householder[2] = bulge[IDX(3, 0, 4)];
-    for (j = 0; j < 3; ++j)
-      for (i = 0; i < 3; ++i) bulge[IDX(i, j, 4)] = bulge[IDX(i + 1, j + 1, 4)];
-    bulge[IDX(3, 0, 4)] = 0;
-    bulge[IDX(3, 1, 4)] = 0;
-    bulge[IDX(3, 2, 4)] = beta[pos + 3];
-    bulge[IDX(0, 3, 4)] = 0;
-    bulge[IDX(1, 3, 4)] = 0;
-    bulge[IDX(2, 3, 4)] = beta[pos + 3];
-    bulge[IDX(3, 3, 4)] = alpha[pos + 4];
-
-    // Apply Householder transform
-    findHouseholder3<IndexType_, ValueType_>(householder, beta + pos, householderMatrix);
-    applyHouseholder3<IndexType_, ValueType_>(householder, bulge);
-    Lapack<ValueType_>::gemm(
-      false, false, n, 3, 3, 1, V + IDX(0, pos + 1, n), n, householderMatrix, 3, 0, work, n);
-    memcpy(V + IDX(0, pos + 1, n), work, 3 * n * sizeof(ValueType_));
-  }
-
-  // Apply penultimate Householder transform
-  //   Values in the last row and column are zero
-  alpha[n - 4]   = bulge[IDX(0, 0, 4)];
-  householder[0] = bulge[IDX(1, 0, 4)];
-  householder[1] = bulge[IDX(2, 0, 4)];
-  householder[2] = bulge[IDX(3, 0, 4)];
-  for (j = 0; j < 3; ++j)
-    for (i = 0; i < 3; ++i) bulge[IDX(i, j, 4)] = bulge[IDX(i + 1, j + 1, 4)];
-  bulge[IDX(3, 0, 4)] = 0;
-  bulge[IDX(3, 1, 4)] = 0;
-  bulge[IDX(3, 2, 4)] = 0;
-  bulge[IDX(0, 3, 4)] = 0;
-  bulge[IDX(1, 3, 4)] = 0;
-  bulge[IDX(2, 3, 4)] = 0;
-  bulge[IDX(3, 3, 4)] = 0;
-  findHouseholder3<IndexType_, ValueType_>(householder, beta + n - 4, householderMatrix);
-  applyHouseholder3<IndexType_, ValueType_>(householder, bulge);
-  Lapack<ValueType_>::gemm(
-    false, false, n, 3, 3, 1, V + IDX(0, n - 3, n), n, householderMatrix, 3, 0, work, n);
-  memcpy(V + IDX(0, n - 3, n), work, 3 * n * sizeof(ValueType_));
-
-  // Apply final Householder transform
-  //   Values in the last two rows and columns are zero
-  alpha[n - 3]   = bulge[IDX(0, 0, 4)];
-  householder[0] = bulge[IDX(1, 0, 4)];
-  householder[1] = bulge[IDX(2, 0, 4)];
-  householder[2] = 0;
-  for (j = 0; j < 3; ++j)
-    for (i = 0; i < 3; ++i) bulge[IDX(i, j, 4)] = bulge[IDX(i + 1, j + 1, 4)];
-  findHouseholder3<IndexType_, ValueType_>(householder, beta + n - 3, householderMatrix);
-  applyHouseholder3<IndexType_, ValueType_>(householder, bulge);
-  Lapack<ValueType_>::gemm(
-    false, false, n, 2, 2, 1, V + IDX(0, n - 2, n), n, householderMatrix, 3, 0, work, n);
-  memcpy(V + IDX(0, n - 2, n), work, 2 * n * sizeof(ValueType_));
-
-  // Bulge has been eliminated
-  alpha[n - 2] = bulge[IDX(0, 0, 4)];
-  alpha[n - 1] = bulge[IDX(1, 1, 4)];
-  beta[n - 2]  = bulge[IDX(1, 0, 4)];
-
-  return 0;
-}
-
-/// Perform implicit restart of Lanczos algorithm
-/** Shifts are Chebyshev nodes of unwanted region of matrix spectrum.
- *
- *  @param n Matrix dimension.
- *  @param iter Current Lanczos iteration.
- *  @param iter_new Lanczos iteration after restart.
- *  @param shiftUpper Pointer to upper bound for unwanted
- *    region. Value is ignored if less than *shiftLower. If a
- *    stronger upper bound has been found, the value is updated on
- *    exit.
- *  @param shiftLower Pointer to lower bound for unwanted
- *    region. Value is ignored if greater than *shiftUpper. If a
- *    stronger lower bound has been found, the value is updated on
- *    exit.
- *  @param alpha_host (Input/output, host memory, iter entries)
- *    Diagonal entries of Lanczos system.
- *  @param beta_host (Input/output, host memory, iter entries)
- *    Off-diagonal entries of Lanczos system.
- *  @param V_host (Output, host memory, iter*iter entries)
- *    Orthonormal transform used to obtain restarted system. Matrix
- *    dimensions are iter x iter.
- *  @param work_host (Output, host memory, 4*iter entries)
- *    Workspace.
- *  @param lanczosVecs_dev (Input/output, device memory, n*(iter+1)
- *    entries) Lanczos vectors. Vectors are stored as columns of a
- *    column-major matrix with dimensions n x (iter+1).
- *  @param work_dev (Output, device memory, (n+iter)*iter entries)
- *    Workspace.
- */
-template <typename IndexType_, typename ValueType_>
-static int lanczosRestart(IndexType_ n,
-                          IndexType_ iter,
-                          IndexType_ iter_new,
-                          ValueType_ *shiftUpper,
-                          ValueType_ *shiftLower,
-                          ValueType_ *__restrict__ alpha_host,
-                          ValueType_ *__restrict__ beta_host,
-                          ValueType_ *__restrict__ V_host,
-                          ValueType_ *__restrict__ work_host,
-                          ValueType_ *__restrict__ lanczosVecs_dev,
-                          ValueType_ *__restrict__ work_dev,
-                          bool smallest_eig)
-{
-  // -------------------------------------------------------
-  // Variable declaration
-  // -------------------------------------------------------
-
-  // Useful constants
-  const ValueType_ zero = 0;
-  const ValueType_ one  = 1;
-
-  // Loop index
-  IndexType_ i;
-
-  // Number of implicit restart steps
-  //   Assumed to be even since each call to Francis algorithm is
-  //   equivalent to two calls of QR algorithm
-  IndexType_ restartSteps = iter - iter_new;
-
-  // Ritz values from Lanczos method
-  ValueType_ *ritzVals_host = work_host + 3 * iter;
-  // Shifts for implicit restart
-  ValueType_ *shifts_host;
-
-  // Orthonormal matrix for similarity transform
-  ValueType_ *V_dev = work_dev + n * iter;
-
-  // -------------------------------------------------------
-  // Implementation
-  // -------------------------------------------------------
-
-  // Compute Ritz values
-  memcpy(ritzVals_host, alpha_host, iter * sizeof(ValueType_));
-  memcpy(work_host, beta_host, (iter - 1) * sizeof(ValueType_));
-  Lapack<ValueType_>::sterf(iter, ritzVals_host, work_host);
-
-  // Debug: Print largest eigenvalues
-  // for (int i = iter-iter_new; i < iter; ++i)
-  //  std::cout <<*(ritzVals_host+i)<< " ";
-  // std::cout <<std::endl;
-
-  // Initialize similarity transform with identity matrix
-  memset(V_host, 0, iter * iter * sizeof(ValueType_));
-  for (i = 0; i < iter; ++i) V_host[IDX(i, i, iter)] = 1;
-
-  // Determine interval to suppress eigenvalues
-  if (smallest_eig) {
-    if (*shiftLower > *shiftUpper) {
-      *shiftUpper = ritzVals_host[iter - 1];
-      *shiftLower = ritzVals_host[iter_new];
-    } else {
-      *shiftUpper = max(*shiftUpper, ritzVals_host[iter - 1]);
-      *shiftLower = min(*shiftLower, ritzVals_host[iter_new]);
-    }
-  } else {
-    if (*shiftLower > *shiftUpper) {
-      *shiftUpper = ritzVals_host[iter - iter_new - 1];
-      *shiftLower = ritzVals_host[0];
-    } else {
-      *shiftUpper = max(*shiftUpper, ritzVals_host[iter - iter_new - 1]);
-      *shiftLower = min(*shiftLower, ritzVals_host[0]);
-    }
-  }
-
-  // Calculate Chebyshev nodes as shifts
-  shifts_host = ritzVals_host;
-  for (i = 0; i < restartSteps; ++i) {
-    shifts_host[i] = cos((i + 0.5) * static_cast<ValueType_>(M_PI) / restartSteps);
-    shifts_host[i] *= 0.5 * ((*shiftUpper) - (*shiftLower));
-    shifts_host[i] += 0.5 * ((*shiftUpper) + (*shiftLower));
-  }
-
-  // Apply Francis QR algorithm to implicitly restart Lanczos
-  for (i = 0; i < restartSteps; i += 2)
-    if (francisQRIteration(
-          iter, shifts_host[i], shifts_host[i + 1], alpha_host, beta_host, V_host, work_host))
-      WARNING("error in implicitly shifted QR algorithm");
-
-  // Obtain new residual
-  CHECK_CUDA(
-    cudaMemcpyAsync(V_dev, V_host, iter * iter * sizeof(ValueType_), cudaMemcpyHostToDevice));
-
-  beta_host[iter - 1] = beta_host[iter - 1] * V_host[IDX(iter - 1, iter_new - 1, iter)];
-  Cublas::gemv(false,
-               n,
-               iter,
-               beta_host + iter_new - 1,
-               lanczosVecs_dev,
-               n,
-               V_dev + IDX(0, iter_new, iter),
-               1,
-               beta_host + iter - 1,
-               lanczosVecs_dev + IDX(0, iter, n),
-               1);
-
-  // Obtain new Lanczos vectors
-  Cublas::gemm(
-    false, false, n, iter_new, iter, &one, lanczosVecs_dev, n, V_dev, iter, &zero, work_dev, n);
-
-  CHECK_CUDA(cudaMemcpyAsync(
-    lanczosVecs_dev, work_dev, n * iter_new * sizeof(ValueType_), cudaMemcpyDeviceToDevice));
-
-  // Normalize residual to obtain new Lanczos vector
-  CHECK_CUDA(cudaMemcpyAsync(lanczosVecs_dev + IDX(0, iter_new, n),
-                             lanczosVecs_dev + IDX(0, iter, n),
-                             n * sizeof(ValueType_),
-                             cudaMemcpyDeviceToDevice));
-  beta_host[iter_new - 1] = Cublas::nrm2(n, lanczosVecs_dev + IDX(0, iter_new, n), 1);
-  Cublas::scal(n, 1 / beta_host[iter_new - 1], lanczosVecs_dev + IDX(0, iter_new, n), 1);
-
-  return 0;
-}
-
-}  // namespace
-
-// =========================================================
-// Eigensolver
-// =========================================================
-
-/// Compute smallest eigenvectors of symmetric matrix
-/** Computes eigenvalues and eigenvectors that are least
- *  positive. If matrix is positive definite or positive
- *  semidefinite, the computed eigenvalues are smallest in
- *  magnitude.
- *
- *  The largest eigenvalue is estimated by performing several
- *  Lanczos iterations. An implicitly restarted Lanczos method is
- *  then applied to A+s*I, where s is negative the largest
- *  eigenvalue.
- *
- *  @param A Matrix.
- *  @param nEigVecs Number of eigenvectors to compute.
- *  @param maxIter Maximum number of Lanczos steps. Does not include
- *    Lanczos steps used to estimate largest eigenvalue.
- *  @param restartIter Maximum size of Lanczos system before
- *    performing an implicit restart. Should be at least 4.
- *  @param tol Convergence tolerance. Lanczos iteration will
- *    terminate when the residual norm is less than tol*theta, where
- *    theta is an estimate for the smallest unwanted eigenvalue
- *    (i.e. the (nEigVecs+1)th smallest eigenvalue).
- *  @param reorthogonalize Whether to reorthogonalize Lanczos
- *    vectors.
- *  @param effIter On exit, pointer to final size of Lanczos system.
- *  @param totalIter On exit, pointer to total number of Lanczos
- *    iterations performed. Does not include Lanczos steps used to
- *    estimate largest eigenvalue.
- *  @param shift On exit, pointer to matrix shift (estimate for
- *    largest eigenvalue).
- *  @param alpha_host (Output, host memory, restartIter entries)
- *    Diagonal entries of Lanczos system.
- *  @param beta_host (Output, host memory, restartIter entries)
- *    Off-diagonal entries of Lanczos system.
- *  @param lanczosVecs_dev (Output, device memory, n*(restartIter+1)
- *    entries) Lanczos vectors. Vectors are stored as columns of a
- *    column-major matrix with dimensions n x (restartIter+1).
- *  @param work_dev (Output, device memory,
- *    (n+restartIter)*restartIter entries) Workspace.
- *  @param eigVals_dev (Output, device memory, nEigVecs entries)
- *    Largest eigenvalues of matrix.
- *  @param eigVecs_dev (Output, device memory, n*nEigVecs entries)
- *    Eigenvectors corresponding to smallest eigenvalues of
- *    matrix. Vectors are stored as columns of a column-major matrix
- *    with dimensions n x nEigVecs.
- *  @return NVGRAPH error flag.
- */
-template <typename IndexType_, typename ValueType_>
-NVGRAPH_ERROR computeSmallestEigenvectors(const Matrix<IndexType_, ValueType_> *A,
-                                          IndexType_ nEigVecs,
-                                          IndexType_ maxIter,
-                                          IndexType_ restartIter,
-                                          ValueType_ tol,
-                                          bool reorthogonalize,
-                                          IndexType_ *effIter,
-                                          IndexType_ *totalIter,
-                                          ValueType_ *shift,
-                                          ValueType_ *__restrict__ alpha_host,
-                                          ValueType_ *__restrict__ beta_host,
-                                          ValueType_ *__restrict__ lanczosVecs_dev,
-                                          ValueType_ *__restrict__ work_dev,
-                                          ValueType_ *__restrict__ eigVals_dev,
-                                          ValueType_ *__restrict__ eigVecs_dev)
-{
-  // -------------------------------------------------------
-  // Variable declaration
-  // -------------------------------------------------------
-
-  // Useful constants
-  const ValueType_ one  = 1;
-  const ValueType_ zero = 0;
-
-  // Matrix dimension
-  IndexType_ n = A->n;
-
-  // Shift for implicit restart
-  ValueType_ shiftUpper;
-  ValueType_ shiftLower;
-
-  // Lanczos iteration counters
-  IndexType_ maxIter_curr = restartIter;  // Maximum size of Lanczos system
-
-  // Status flags
-  int status;
-
-  // Loop index
-  IndexType_ i;
-
-  // Host memory
-  ValueType_ *Z_host;     // Eigenvectors in Lanczos basis
-  ValueType_ *work_host;  // Workspace
-
-  // -------------------------------------------------------
-  // Check that LAPACK is enabled
-  // -------------------------------------------------------
-  // Lapack<ValueType_>::check_lapack_enabled();
-
-  // -------------------------------------------------------
-  // Check that parameters are valid
-  // -------------------------------------------------------
-  if (A->m != A->n) {
-    WARNING("invalid parameter (matrix is not square)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (nEigVecs < 1) {
-    WARNING("invalid parameter (nEigVecs<1)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (restartIter < 1) {
-    WARNING("invalid parameter (restartIter<4)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (tol < 0) {
-    WARNING("invalid parameter (tol<0)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (nEigVecs > n) {
-    WARNING("invalid parameters (nEigVecs>n)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (maxIter < nEigVecs) {
-    WARNING("invalid parameters (maxIter<nEigVecs)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (restartIter < nEigVecs) {
-    WARNING("invalid parameters (restartIter<nEigVecs)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-
-  // -------------------------------------------------------
-  // Variable initialization
-  // -------------------------------------------------------
-
-  // Total number of Lanczos iterations
-  *totalIter = 0;
-
-  // Allocate host memory
-  std::vector<ValueType_> Z_host_v(restartIter * restartIter);
-  std::vector<ValueType_> work_host_v(4 * restartIter);
-
-  Z_host    = Z_host_v.data();
-  work_host = work_host_v.data();
-
-  // Initialize cuBLAS
-  Cublas::set_pointer_mode_host();
-
-  // -------------------------------------------------------
-  // Compute largest eigenvalue to determine shift
-  // -------------------------------------------------------
-
-  // Random number generator
-  curandGenerator_t randGen;
-  // Initialize random number generator
-  CHECK_CURAND(curandCreateGenerator(&randGen, CURAND_RNG_PSEUDO_PHILOX4_32_10));
-
-  // FIXME: This is hard coded, which is good for unit testing...
-  //        but should really be a parameter so it could be
-  //        "random" for real runs and "fixed" for tests
-  CHECK_CURAND(curandSetPseudoRandomGeneratorSeed(randGen, 1234567 /*time(NULL)*/));
-  // CHECK_CURAND(curandSetPseudoRandomGeneratorSeed(randGen, time(NULL)));
-  // Initialize initial Lanczos vector
-  CHECK_CURAND(curandGenerateNormalX(randGen, lanczosVecs_dev, n + n % 2, zero, one));
-  ValueType_ normQ1 = Cublas::nrm2(n, lanczosVecs_dev, 1);
-  Cublas::scal(n, 1 / normQ1, lanczosVecs_dev, 1);
-
-  // Estimate number of Lanczos iterations
-  //   See bounds in Kuczynski and Wozniakowski (1992).
-  // const ValueType_ relError = 0.25;  // Relative error
-  // const ValueType_ failProb = 1e-4;  // Probability of failure
-  // maxIter_curr = log(n/pow(failProb,2))/(4*std::sqrt(relError)) + 1;
-  // maxIter_curr = min(maxIter_curr, restartIter);
-
-  // Obtain tridiagonal matrix with Lanczos
-  *effIter = 0;
-  *shift   = 0;
-  status   = performLanczosIteration<IndexType_, ValueType_>(A,
-                                                           effIter,
-                                                           maxIter_curr,
-                                                           *shift,
-                                                           0.0,
-                                                           reorthogonalize,
-                                                           alpha_host,
-                                                           beta_host,
-                                                           lanczosVecs_dev,
-                                                           work_dev);
-  if (status) WARNING("error in Lanczos iteration");
-
-  // Determine largest eigenvalue
-
-  Lapack<ValueType_>::sterf(*effIter, alpha_host, beta_host);
-  *shift = -alpha_host[*effIter - 1];
-  // std::cout <<  *shift <<std::endl;
-  // -------------------------------------------------------
-  // Compute eigenvectors of shifted matrix
-  // -------------------------------------------------------
-
-  // Obtain tridiagonal matrix with Lanczos
-  *effIter = 0;
-  // maxIter_curr = min(maxIter, restartIter);
-  status = performLanczosIteration<IndexType_, ValueType_>(A,
-                                                           effIter,
-                                                           maxIter_curr,
-                                                           *shift,
-                                                           0,
-                                                           reorthogonalize,
-                                                           alpha_host,
-                                                           beta_host,
-                                                           lanczosVecs_dev,
-                                                           work_dev);
-  if (status) WARNING("error in Lanczos iteration");
-  *totalIter += *effIter;
-
-  // Apply Lanczos method until convergence
-  shiftLower = 1;
-  shiftUpper = -1;
-  while (*totalIter < maxIter && beta_host[*effIter - 1] > tol * shiftLower) {
-    // Determine number of restart steps
-    // Number of steps must be even due to Francis algorithm
-    IndexType_ iter_new = nEigVecs + 1;
-    if (restartIter - (maxIter - *totalIter) > nEigVecs + 1)
-      iter_new = restartIter - (maxIter - *totalIter);
-    if ((restartIter - iter_new) % 2) iter_new -= 1;
-    if (iter_new == *effIter) break;
-
-    // Implicit restart of Lanczos method
-    status = lanczosRestart<IndexType_, ValueType_>(n,
-                                                    *effIter,
-                                                    iter_new,
-                                                    &shiftUpper,
-                                                    &shiftLower,
-                                                    alpha_host,
-                                                    beta_host,
-                                                    Z_host,
-                                                    work_host,
-                                                    lanczosVecs_dev,
-                                                    work_dev,
-                                                    true);
-    if (status) WARNING("error in Lanczos implicit restart");
-    *effIter = iter_new;
-
-    // Check for convergence
-    if (beta_host[*effIter - 1] <= tol * fabs(shiftLower)) break;
-
-    // Proceed with Lanczos method
-    // maxIter_curr = min(restartIter, maxIter-*totalIter+*effIter);
-    status = performLanczosIteration<IndexType_, ValueType_>(A,
-                                                             effIter,
-                                                             maxIter_curr,
-                                                             *shift,
-                                                             tol * fabs(shiftLower),
-                                                             reorthogonalize,
-                                                             alpha_host,
-                                                             beta_host,
-                                                             lanczosVecs_dev,
-                                                             work_dev);
-    if (status) WARNING("error in Lanczos iteration");
-    *totalIter += *effIter - iter_new;
-  }
-
-  // Warning if Lanczos has failed to converge
-  if (beta_host[*effIter - 1] > tol * fabs(shiftLower)) {
-    WARNING("implicitly restarted Lanczos failed to converge");
-  }
-
-  // Solve tridiagonal system
-  memcpy(work_host + 2 * (*effIter), alpha_host, (*effIter) * sizeof(ValueType_));
-  memcpy(work_host + 3 * (*effIter), beta_host, (*effIter - 1) * sizeof(ValueType_));
-  Lapack<ValueType_>::steqr('I',
-                            *effIter,
-                            work_host + 2 * (*effIter),
-                            work_host + 3 * (*effIter),
-                            Z_host,
-                            *effIter,
-                            work_host);
-
-  // Obtain desired eigenvalues by applying shift
-  for (i = 0; i < *effIter; ++i) work_host[i + 2 * (*effIter)] -= *shift;
-  for (i = *effIter; i < nEigVecs; ++i) work_host[i + 2 * (*effIter)] = 0;
-
-  // Copy results to device memory
-  CHECK_CUDA(cudaMemcpy(eigVals_dev,
-                        work_host + 2 * (*effIter),
-                        nEigVecs * sizeof(ValueType_),
-                        cudaMemcpyHostToDevice));
-  // for (int i = 0; i < nEigVecs; ++i)
-  //{
-  //  std::cout <<*(work_host+(2*(*effIter)+i))<< std::endl;
-  //}
-  CHECK_CUDA(cudaMemcpy(
-    work_dev, Z_host, (*effIter) * nEigVecs * sizeof(ValueType_), cudaMemcpyHostToDevice));
-
-  // Convert eigenvectors from Lanczos basis to standard basis
-  Cublas::gemm(false,
-               false,
-               n,
-               nEigVecs,
-               *effIter,
-               &one,
-               lanczosVecs_dev,
-               n,
-               work_dev,
-               *effIter,
-               &zero,
-               eigVecs_dev,
-               n);
-
-  // Clean up and exit
-  CHECK_CURAND(curandDestroyGenerator(randGen));
-  return NVGRAPH_OK;
-}
-
-/// Compute smallest eigenvectors of symmetric matrix
-/** Computes eigenvalues and eigenvectors that are least
- *  positive. If matrix is positive definite or positive
- *  semidefinite, the computed eigenvalues are smallest in
- *  magnitude.
- *
- *  The largest eigenvalue is estimated by performing several
- *  Lanczos iterations. An implicitly restarted Lanczos method is
- *  then applied to A+s*I, where s is negative the largest
- *  eigenvalue.
- *
- *  CNMEM must be initialized before calling this function.
- *
- *  @param A Matrix.
- *  @param nEigVecs Number of eigenvectors to compute.
- *  @param maxIter Maximum number of Lanczos steps. Does not include
- *    Lanczos steps used to estimate largest eigenvalue.
- *  @param restartIter Maximum size of Lanczos system before
- *    performing an implicit restart. Should be at least 4.
- *  @param tol Convergence tolerance. Lanczos iteration will
- *    terminate when the residual norm is less than tol*theta, where
- *    theta is an estimate for the smallest unwanted eigenvalue
- *    (i.e. the (nEigVecs+1)th smallest eigenvalue).
- *  @param reorthogonalize Whether to reorthogonalize Lanczos
- *    vectors.
- *  @param iter On exit, pointer to total number of Lanczos
- *    iterations performed. Does not include Lanczos steps used to
- *    estimate largest eigenvalue.
- *  @param eigVals_dev (Output, device memory, nEigVecs entries)
- *    Smallest eigenvalues of matrix.
- *  @param eigVecs_dev (Output, device memory, n*nEigVecs entries)
- *    Eigenvectors corresponding to smallest eigenvalues of
- *    matrix. Vectors are stored as columns of a column-major matrix
- *    with dimensions n x nEigVecs.
- *  @return NVGRAPH error flag.
- */
-template <typename IndexType_, typename ValueType_>
-NVGRAPH_ERROR computeSmallestEigenvectors(const Matrix<IndexType_, ValueType_> &A,
-                                          IndexType_ nEigVecs,
-                                          IndexType_ maxIter,
-                                          IndexType_ restartIter,
-                                          ValueType_ tol,
-                                          bool reorthogonalize,
-                                          IndexType_ &iter,
-                                          ValueType_ *__restrict__ eigVals_dev,
-                                          ValueType_ *__restrict__ eigVecs_dev)
-{
-  // CUDA stream
-  //   TODO: handle non-zero streams
-  cudaStream_t stream = 0;
-
-  // Matrix dimension
-  IndexType_ n = A.n;
-
-  // Check that parameters are valid
-  if (A.m != A.n) {
-    WARNING("invalid parameter (matrix is not square)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (nEigVecs < 1) {
-    WARNING("invalid parameter (nEigVecs<1)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (restartIter < 1) {
-    WARNING("invalid parameter (restartIter<4)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (tol < 0) {
-    WARNING("invalid parameter (tol<0)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (nEigVecs > n) {
-    WARNING("invalid parameters (nEigVecs>n)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (maxIter < nEigVecs) {
-    WARNING("invalid parameters (maxIter<nEigVecs)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (restartIter < nEigVecs) {
-    WARNING("invalid parameters (restartIter<nEigVecs)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-
-  // Allocate memory
-  std::vector<ValueType_> alpha_host_v(restartIter);
-  std::vector<ValueType_> beta_host_v(restartIter);
-
-  ValueType_ *alpha_host = alpha_host_v.data();
-  ValueType_ *beta_host  = beta_host_v.data();
-
-  Vector<ValueType_> lanczosVecs_dev(n * (restartIter + 1), stream);
-  Vector<ValueType_> work_dev((n + restartIter) * restartIter, stream);
-
-  // Perform Lanczos method
-  IndexType_ effIter;
-  ValueType_ shift;
-  NVGRAPH_ERROR status = computeSmallestEigenvectors(&A,
-                                                     nEigVecs,
-                                                     maxIter,
-                                                     restartIter,
-                                                     tol,
-                                                     reorthogonalize,
-                                                     &effIter,
-                                                     &iter,
-                                                     &shift,
-                                                     alpha_host,
-                                                     beta_host,
-                                                     lanczosVecs_dev.raw(),
-                                                     work_dev.raw(),
-                                                     eigVals_dev,
-                                                     eigVecs_dev);
-
-  // Clean up and return
-  return status;
-}
-
-// =========================================================
-// Eigensolver
-// =========================================================
-
-/// Compute largest eigenvectors of symmetric matrix
-/** Computes eigenvalues and eigenvectors that are least
- *  positive. If matrix is positive definite or positive
- *  semidefinite, the computed eigenvalues are largest in
- *  magnitude.
- *
- *  The largest eigenvalue is estimated by performing several
- *  Lanczos iterations. An implicitly restarted Lanczos method is
- *  then applied.
- *
- *  @param A Matrix.
- *  @param nEigVecs Number of eigenvectors to compute.
- *  @param maxIter Maximum number of Lanczos steps.
- *  @param restartIter Maximum size of Lanczos system before
- *    performing an implicit restart. Should be at least 4.
- *  @param tol Convergence tolerance. Lanczos iteration will
- *    terminate when the residual norm is less than tol*theta, where
- *    theta is an estimate for the largest unwanted eigenvalue
- *    (i.e. the (nEigVecs+1)th largest eigenvalue).
- *  @param reorthogonalize Whether to reorthogonalize Lanczos
- *    vectors.
- *  @param effIter On exit, pointer to final size of Lanczos system.
- *  @param totalIter On exit, pointer to total number of Lanczos
- *    iterations performed.
- *  @param alpha_host (Output, host memory, restartIter entries)
- *    Diagonal entries of Lanczos system.
- *  @param beta_host (Output, host memory, restartIter entries)
- *    Off-diagonal entries of Lanczos system.
- *  @param lanczosVecs_dev (Output, device memory, n*(restartIter+1)
- *    entries) Lanczos vectors. Vectors are stored as columns of a
- *    column-major matrix with dimensions n x (restartIter+1).
- *  @param work_dev (Output, device memory,
- *    (n+restartIter)*restartIter entries) Workspace.
- *  @param eigVals_dev (Output, device memory, nEigVecs entries)
- *    Largest eigenvalues of matrix.
- *  @param eigVecs_dev (Output, device memory, n*nEigVecs entries)
- *    Eigenvectors corresponding to largest eigenvalues of
- *    matrix. Vectors are stored as columns of a column-major matrix
- *    with dimensions n x nEigVecs.
- *  @return NVGRAPH error flag.
- */
-template <typename IndexType_, typename ValueType_>
-NVGRAPH_ERROR computeLargestEigenvectors(const Matrix<IndexType_, ValueType_> *A,
-                                         IndexType_ nEigVecs,
-                                         IndexType_ maxIter,
-                                         IndexType_ restartIter,
-                                         ValueType_ tol,
-                                         bool reorthogonalize,
-                                         IndexType_ *effIter,
-                                         IndexType_ *totalIter,
-                                         ValueType_ *__restrict__ alpha_host,
-                                         ValueType_ *__restrict__ beta_host,
-                                         ValueType_ *__restrict__ lanczosVecs_dev,
-                                         ValueType_ *__restrict__ work_dev,
-                                         ValueType_ *__restrict__ eigVals_dev,
-                                         ValueType_ *__restrict__ eigVecs_dev)
-{
-  // -------------------------------------------------------
-  // Variable declaration
-  // -------------------------------------------------------
-
-  // Useful constants
-  const ValueType_ one  = 1;
-  const ValueType_ zero = 0;
-
-  // Matrix dimension
-  IndexType_ n = A->n;
-
-  // Lanczos iteration counters
-  IndexType_ maxIter_curr = restartIter;  // Maximum size of Lanczos system
-
-  // Status flags
-  int status;
-
-  // Loop index
-  IndexType_ i;
-
-  // Host memory
-  ValueType_ *Z_host;     // Eigenvectors in Lanczos basis
-  ValueType_ *work_host;  // Workspace
-
-  // -------------------------------------------------------
-  // Check that LAPACK is enabled
-  // -------------------------------------------------------
-  // Lapack<ValueType_>::check_lapack_enabled();
-
-  // -------------------------------------------------------
-  // Check that parameters are valid
-  // -------------------------------------------------------
-  if (A->m != A->n) {
-    WARNING("invalid parameter (matrix is not square)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (nEigVecs < 1) {
-    WARNING("invalid parameter (nEigVecs<1)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (restartIter < 1) {
-    WARNING("invalid parameter (restartIter<4)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (tol < 0) {
-    WARNING("invalid parameter (tol<0)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (nEigVecs > n) {
-    WARNING("invalid parameters (nEigVecs>n)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (maxIter < nEigVecs) {
-    WARNING("invalid parameters (maxIter<nEigVecs)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (restartIter <= nEigVecs) {
-    WARNING("invalid parameters (restartIter<=nEigVecs)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-
-  // -------------------------------------------------------
-  // Variable initialization
-  // -------------------------------------------------------
-
-  // Total number of Lanczos iterations
-  *totalIter = 0;
-
-  // Allocate host memory
-  std::vector<ValueType_> Z_host_v(restartIter * restartIter);
-  std::vector<ValueType_> work_host_v(4 * restartIter);
-
-  Z_host    = Z_host_v.data();
-  work_host = work_host_v.data();
-
-  // Initialize cuBLAS
-  Cublas::set_pointer_mode_host();
-
-  // -------------------------------------------------------
-  // Compute largest eigenvalue
-  // -------------------------------------------------------
-
-  // Random number generator
-  curandGenerator_t randGen;
-  // Initialize random number generator
-  CHECK_CURAND(curandCreateGenerator(&randGen, CURAND_RNG_PSEUDO_PHILOX4_32_10));
-  CHECK_CURAND(curandSetPseudoRandomGeneratorSeed(randGen, 123456));
-  // Initialize initial Lanczos vector
-  CHECK_CURAND(curandGenerateNormalX(randGen, lanczosVecs_dev, n + n % 2, zero, one));
-  ValueType_ normQ1 = Cublas::nrm2(n, lanczosVecs_dev, 1);
-  Cublas::scal(n, 1 / normQ1, lanczosVecs_dev, 1);
-
-  // Estimate number of Lanczos iterations
-  //   See bounds in Kuczynski and Wozniakowski (1992).
-  // const ValueType_ relError = 0.25;  // Relative error
-  // const ValueType_ failProb = 1e-4;  // Probability of failure
-  // maxIter_curr = log(n/pow(failProb,2))/(4*std::sqrt(relError)) + 1;
-  // maxIter_curr = min(maxIter_curr, restartIter);
-
-  // Obtain tridiagonal matrix with Lanczos
-  *effIter             = 0;
-  ValueType_ shift_val = 0.0;
-  ValueType_ *shift    = &shift_val;
-  // maxIter_curr = min(maxIter, restartIter);
-  status = performLanczosIteration<IndexType_, ValueType_>(A,
-                                                           effIter,
-                                                           maxIter_curr,
-                                                           *shift,
-                                                           0,
-                                                           reorthogonalize,
-                                                           alpha_host,
-                                                           beta_host,
-                                                           lanczosVecs_dev,
-                                                           work_dev);
-  if (status) WARNING("error in Lanczos iteration");
-  *totalIter += *effIter;
-
-  // Apply Lanczos method until convergence
-  ValueType_ shiftLower = 1;
-  ValueType_ shiftUpper = -1;
-  while (*totalIter < maxIter && beta_host[*effIter - 1] > tol * shiftLower) {
-    // Determine number of restart steps
-    //   Number of steps must be even due to Francis algorithm
-    IndexType_ iter_new = nEigVecs + 1;
-    if (restartIter - (maxIter - *totalIter) > nEigVecs + 1)
-      iter_new = restartIter - (maxIter - *totalIter);
-    if ((restartIter - iter_new) % 2) iter_new -= 1;
-    if (iter_new == *effIter) break;
-
-    // Implicit restart of Lanczos method
-    status = lanczosRestart<IndexType_, ValueType_>(n,
-                                                    *effIter,
-                                                    iter_new,
-                                                    &shiftUpper,
-                                                    &shiftLower,
-                                                    alpha_host,
-                                                    beta_host,
-                                                    Z_host,
-                                                    work_host,
-                                                    lanczosVecs_dev,
-                                                    work_dev,
-                                                    false);
-    if (status) WARNING("error in Lanczos implicit restart");
-    *effIter = iter_new;
-
-    // Check for convergence
-    if (beta_host[*effIter - 1] <= tol * fabs(shiftLower)) break;
-
-    // Proceed with Lanczos method
-    // maxIter_curr = min(restartIter, maxIter-*totalIter+*effIter);
-    status = performLanczosIteration<IndexType_, ValueType_>(A,
-                                                             effIter,
-                                                             maxIter_curr,
-                                                             *shift,
-                                                             tol * fabs(shiftLower),
-                                                             reorthogonalize,
-                                                             alpha_host,
-                                                             beta_host,
-                                                             lanczosVecs_dev,
-                                                             work_dev);
-    if (status) WARNING("error in Lanczos iteration");
-    *totalIter += *effIter - iter_new;
-  }
-
-  // Warning if Lanczos has failed to converge
-  if (beta_host[*effIter - 1] > tol * fabs(shiftLower)) {
-    WARNING("implicitly restarted Lanczos failed to converge");
-  }
-  for (int i = 0; i < restartIter; ++i) {
-    for (int j = 0; j < restartIter; ++j) Z_host[i * restartIter + j] = 0;
-  }
-  // Solve tridiagonal system
-  memcpy(work_host + 2 * (*effIter), alpha_host, (*effIter) * sizeof(ValueType_));
-  memcpy(work_host + 3 * (*effIter), beta_host, (*effIter - 1) * sizeof(ValueType_));
-  Lapack<ValueType_>::steqr('I',
-                            *effIter,
-                            work_host + 2 * (*effIter),
-                            work_host + 3 * (*effIter),
-                            Z_host,
-                            *effIter,
-                            work_host);
-
-  // note: We need to pick the top nEigVecs eigenvalues
-  // but effItter can be larger than nEigVecs
-  // hence we add an offset for that case, because we want to access top nEigVecs eigenpairs in the
-  // matrix of size effIter. remember the array is sorted, so it is not needed for smallest
-  // eigenvalues case because the first ones are the smallest ones
-
-  IndexType_ top_eigenparis_idx_offset = *effIter - nEigVecs;
-
-  // Debug : print nEigVecs largest eigenvalues
-  // for (int i = top_eigenparis_idx_offset; i < *effIter; ++i)
-  //  std::cout <<*(work_host+(2*(*effIter)+i))<< " ";
-  // std::cout <<std::endl;
-
-  // Debug : print nEigVecs largest eigenvectors
-  // for (int i = top_eigenparis_idx_offset; i < *effIter; ++i)
-  //{
-  //  for (int j = 0; j < *effIter; ++j)
-  //    std::cout <<Z_host[i*(*effIter)+j]<< " ";
-  //  std::cout <<std::endl;
-  //}
-
-  // Obtain desired eigenvalues by applying shift
-  for (i = 0; i < *effIter; ++i) work_host[i + 2 * (*effIter)] -= *shift;
-
-  for (i = 0; i < top_eigenparis_idx_offset; ++i) work_host[i + 2 * (*effIter)] = 0;
-
-  // Copy results to device memory
-  // skip smallest eigenvalue if needed
-  CHECK_CUDA(cudaMemcpy(eigVals_dev,
-                        work_host + 2 * (*effIter) + top_eigenparis_idx_offset,
-                        nEigVecs * sizeof(ValueType_),
-                        cudaMemcpyHostToDevice));
-
-  // skip smallest eigenvector if needed
-  CHECK_CUDA(cudaMemcpy(work_dev,
-                        Z_host + (top_eigenparis_idx_offset * (*effIter)),
-                        (*effIter) * nEigVecs * sizeof(ValueType_),
-                        cudaMemcpyHostToDevice));
-
-  // Convert eigenvectors from Lanczos basis to standard basis
-  Cublas::gemm(false,
-               false,
-               n,
-               nEigVecs,
-               *effIter,
-               &one,
-               lanczosVecs_dev,
-               n,
-               work_dev,
-               *effIter,
-               &zero,
-               eigVecs_dev,
-               n);
-
-  // Clean up and exit
-  CHECK_CURAND(curandDestroyGenerator(randGen));
-  return NVGRAPH_OK;
-}
-
-/// Compute largest eigenvectors of symmetric matrix
-/** Computes eigenvalues and eigenvectors that are least
- *  positive. If matrix is positive definite or positive
- *  semidefinite, the computed eigenvalues are largest in
- *  magnitude.
- *
- *  The largest eigenvalue is estimated by performing several
- *  Lanczos iterations. An implicitly restarted Lanczos method is
- *  then applied to A+s*I, where s is negative the largest
- *  eigenvalue.
- *
- *  CNMEM must be initialized before calling this function.
- *
- *  @param A Matrix.
- *  @param nEigVecs Number of eigenvectors to compute.
- *  @param maxIter Maximum number of Lanczos steps. Does not include
- *    Lanczos steps used to estimate largest eigenvalue.
- *  @param restartIter Maximum size of Lanczos system before
- *    performing an implicit restart. Should be at least 4.
- *  @param tol Convergence tolerance. Lanczos iteration will
- *    terminate when the residual norm is less than tol*theta, where
- *    theta is an estimate for the largest unwanted eigenvalue
- *    (i.e. the (nEigVecs+1)th largest eigenvalue).
- *  @param reorthogonalize Whether to reorthogonalize Lanczos
- *    vectors.
- *  @param iter On exit, pointer to total number of Lanczos
- *    iterations performed. Does not include Lanczos steps used to
- *    estimate largest eigenvalue.
- *  @param eigVals_dev (Output, device memory, nEigVecs entries)
- *    Largest eigenvalues of matrix.
- *  @param eigVecs_dev (Output, device memory, n*nEigVecs entries)
- *    Eigenvectors corresponding to largest eigenvalues of
- *    matrix. Vectors are stored as columns of a column-major matrix
- *    with dimensions n x nEigVecs.
- *  @return NVGRAPH error flag.
- */
-template <typename IndexType_, typename ValueType_>
-NVGRAPH_ERROR computeLargestEigenvectors(const Matrix<IndexType_, ValueType_> &A,
-                                         IndexType_ nEigVecs,
-                                         IndexType_ maxIter,
-                                         IndexType_ restartIter,
-                                         ValueType_ tol,
-                                         bool reorthogonalize,
-                                         IndexType_ &iter,
-                                         ValueType_ *__restrict__ eigVals_dev,
-                                         ValueType_ *__restrict__ eigVecs_dev)
-{
-  // CUDA stream
-  //   TODO: handle non-zero streams
-  cudaStream_t stream = 0;
-
-  // Matrix dimension
-  IndexType_ n = A.n;
-
-  // Check that parameters are valid
-  if (A.m != A.n) {
-    WARNING("invalid parameter (matrix is not square)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (nEigVecs < 1) {
-    WARNING("invalid parameter (nEigVecs<1)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (restartIter < 1) {
-    WARNING("invalid parameter (restartIter<4)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (tol < 0) {
-    WARNING("invalid parameter (tol<0)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (nEigVecs > n) {
-    WARNING("invalid parameters (nEigVecs>n)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (maxIter < nEigVecs) {
-    WARNING("invalid parameters (maxIter<nEigVecs)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-  if (restartIter < nEigVecs) {
-    WARNING("invalid parameters (restartIter<nEigVecs)");
-    return NVGRAPH_ERR_BAD_PARAMETERS;
-  }
-
-  // Allocate memory
-  std::vector<ValueType_> alpha_host_v(restartIter);
-  std::vector<ValueType_> beta_host_v(restartIter);
-
-  ValueType_ *alpha_host = alpha_host_v.data();
-  ValueType_ *beta_host  = beta_host_v.data();
-
-  Vector<ValueType_> lanczosVecs_dev(n * (restartIter + 1), stream);
-  Vector<ValueType_> work_dev((n + restartIter) * restartIter, stream);
-
-  // Perform Lanczos method
-  IndexType_ effIter;
-  NVGRAPH_ERROR status = computeLargestEigenvectors(&A,
-                                                    nEigVecs,
-                                                    maxIter,
-                                                    restartIter,
-                                                    tol,
-                                                    reorthogonalize,
-                                                    &effIter,
-                                                    &iter,
-                                                    alpha_host,
-                                                    beta_host,
-                                                    lanczosVecs_dev.raw(),
-                                                    work_dev.raw(),
-                                                    eigVals_dev,
-                                                    eigVecs_dev);
-
-  // Clean up and return
-  return status;
-}
-
-// =========================================================
-// Explicit instantiation
-// =========================================================
-
-template NVGRAPH_ERROR computeSmallestEigenvectors<int, float>(const Matrix<int, float> &A,
-                                                               int nEigVecs,
-                                                               int maxIter,
-                                                               int restartIter,
-                                                               float tol,
-                                                               bool reorthogonalize,
-                                                               int &iter,
-                                                               float *__restrict__ eigVals_dev,
-                                                               float *__restrict__ eigVecs_dev);
-template NVGRAPH_ERROR computeSmallestEigenvectors<int, double>(const Matrix<int, double> &A,
-                                                                int nEigVecs,
-                                                                int maxIter,
-                                                                int restartIter,
-                                                                double tol,
-                                                                bool reorthogonalize,
-                                                                int &iter,
-                                                                double *__restrict__ eigVals_dev,
-                                                                double *__restrict__ eigVecs_dev);
-
-template NVGRAPH_ERROR computeLargestEigenvectors<int, float>(const Matrix<int, float> &A,
-                                                              int nEigVecs,
-                                                              int maxIter,
-                                                              int restartIter,
-                                                              float tol,
-                                                              bool reorthogonalize,
-                                                              int &iter,
-                                                              float *__restrict__ eigVals_dev,
-                                                              float *__restrict__ eigVecs_dev);
-template NVGRAPH_ERROR computeLargestEigenvectors<int, double>(const Matrix<int, double> &A,
-                                                               int nEigVecs,
-                                                               int maxIter,
-                                                               int restartIter,
-                                                               double tol,
-                                                               bool reorthogonalize,
-                                                               int &iter,
-                                                               double *__restrict__ eigVals_dev,
-                                                               double *__restrict__ eigVecs_dev);
-
-}  // namespace nvgraph
diff --git a/cpp/src/nvgraph/modularity_maximization.cu b/cpp/src/nvgraph/modularity_maximization.cu
deleted file mode 100644
index bd90f3093aa..00000000000
--- a/cpp/src/nvgraph/modularity_maximization.cu
+++ /dev/null
@@ -1,436 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-//#ifdef NVGRAPH_PARTITION
-
-#include "include/modularity_maximization.hxx"
-
-#include <math.h>
-#include <stdio.h>
-
-#include <cuda.h>
-#include <thrust/device_vector.h>
-#include <thrust/fill.h>
-#include <thrust/reduce.h>
-#include <thrust/transform.h>
-
-#include "include/debug_macros.h"
-#include "include/kmeans.hxx"
-#include "include/lanczos.hxx"
-#include "include/nvgraph_cublas.hxx"
-#include "include/nvgraph_error.hxx"
-#include "include/nvgraph_vector.hxx"
-#include "include/sm_utils.h"
-#include "include/spectral_matrix.hxx"
-
-//#define COLLECT_TIME_STATISTICS 1
-//#undef COLLECT_TIME_STATISTICS
-
-#ifdef COLLECT_TIME_STATISTICS
-#include <stddef.h>
-#include <sys/resource.h>
-#include <sys/sysinfo.h>
-#include <sys/time.h>
-#include "cuda_profiler_api.h"
-#endif
-
-#ifdef COLLECT_TIME_STATISTICS
-static double timer(void)
-{
-  struct timeval tv;
-  cudaDeviceSynchronize();
-  gettimeofday(&tv, NULL);
-  return (double)tv.tv_sec + (double)tv.tv_usec / 1000000.0;
-}
-#endif
-
-namespace nvgraph {
-
-// =========================================================
-// Useful macros
-// =========================================================
-
-// Get index of matrix entry
-#define IDX(i, j, lda) ((i) + (j) * (lda))
-
-template <typename IndexType_, typename ValueType_>
-static __global__ void scale_obs_kernel(IndexType_ m, IndexType_ n, ValueType_ *obs)
-{
-  IndexType_ i, j, k, index, mm;
-  ValueType_ alpha, v, last;
-  bool valid;
-  // ASSUMPTION: kernel is launched with either 2, 4, 8, 16 or 32 threads in x-dimension
-
-  // compute alpha
-  mm    = (((m + blockDim.x - 1) / blockDim.x) * blockDim.x);  // m in multiple of blockDim.x
-  alpha = 0.0;
-  // printf("[%d,%d,%d,%d] n=%d, li=%d, mn=%d \n",threadIdx.x,threadIdx.y,blockIdx.x,blockIdx.y, n,
-  // li, mn);
-  for (j = threadIdx.y + blockIdx.y * blockDim.y; j < n; j += blockDim.y * gridDim.y) {
-    for (i = threadIdx.x; i < mm; i += blockDim.x) {
-      // check if the thread is valid
-      valid = i < m;
-
-      // get the value of the last thread
-      last = utils::shfl(alpha, blockDim.x - 1, blockDim.x);
-
-      // if you are valid read the value from memory, otherwise set your value to 0
-      alpha = (valid) ? obs[i + j * m] : 0.0;
-      alpha = alpha * alpha;
-
-      // do prefix sum (of size warpSize=blockDim.x =< 32)
-      for (k = 1; k < blockDim.x; k *= 2) {
-        v = utils::shfl_up(alpha, k, blockDim.x);
-        if (threadIdx.x >= k) alpha += v;
-      }
-      // shift by last
-      alpha += last;
-    }
-  }
-
-  // scale by alpha
-  alpha = utils::shfl(alpha, blockDim.x - 1, blockDim.x);
-  alpha = std::sqrt(alpha);
-  for (j = threadIdx.y + blockIdx.y * blockDim.y; j < n; j += blockDim.y * gridDim.y) {
-    for (i = threadIdx.x; i < m; i += blockDim.x) {  // blockDim.x=32
-      index      = i + j * m;
-      obs[index] = obs[index] / alpha;
-    }
-  }
-}
-
-template <typename IndexType_>
-IndexType_ next_pow2(IndexType_ n)
-{
-  IndexType_ v;
-  // Reference:
-  // http://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2Float
-  v = n - 1;
-  v |= v >> 1;
-  v |= v >> 2;
-  v |= v >> 4;
-  v |= v >> 8;
-  v |= v >> 16;
-  return v + 1;
-}
-
-template <typename IndexType_, typename ValueType_>
-cudaError_t scale_obs(IndexType_ m, IndexType_ n, ValueType_ *obs)
-{
-  IndexType_ p2m;
-  dim3 nthreads, nblocks;
-
-  // find next power of 2
-  p2m = next_pow2<IndexType_>(m);
-  // setup launch configuration
-  nthreads.x = max(2, min(p2m, 32));
-  nthreads.y = 256 / nthreads.x;
-  nthreads.z = 1;
-  nblocks.x  = 1;
-  nblocks.y  = (n + nthreads.y - 1) / nthreads.y;
-  nblocks.z  = 1;
-  // printf("m=%d(%d),n=%d,obs=%p,
-  // nthreads=(%d,%d,%d),nblocks=(%d,%d,%d)\n",m,p2m,n,obs,nthreads.x,nthreads.y,nthreads.z,nblocks.x,nblocks.y,nblocks.z);
-
-  // launch scaling kernel (scale each column of obs by its norm)
-  scale_obs_kernel<IndexType_, ValueType_><<<nblocks, nthreads>>>(m, n, obs);
-  cudaCheckError();
-
-  return cudaSuccess;
-}
-
-// =========================================================
-// Spectral modularity_maximization
-// =========================================================
-
-/** Compute partition for a weighted undirected graph. This
- *  partition attempts to minimize the cost function:
- *    Cost = \sum_i (Edges cut by ith partition)/(Vertices in ith partition)
- *
- *  @param G Weighted graph in CSR format
- *  @param nClusters Number of partitions.
- *  @param nEigVecs Number of eigenvectors to compute.
- *  @param maxIter_lanczos Maximum number of Lanczos iterations.
- *  @param restartIter_lanczos Maximum size of Lanczos system before
- *    implicit restart.
- *  @param tol_lanczos Convergence tolerance for Lanczos method.
- *  @param maxIter_kmeans Maximum number of k-means iterations.
- *  @param tol_kmeans Convergence tolerance for k-means algorithm.
- *  @param parts (Output, device memory, n entries) Cluster
- *    assignments.
- *  @param iters_lanczos On exit, number of Lanczos iterations
- *    performed.
- *  @param iters_kmeans On exit, number of k-means iterations
- *    performed.
- *  @return NVGRAPH error flag.
- */
-template <typename vertex_t, typename edge_t, typename weight_t>
-NVGRAPH_ERROR modularity_maximization(
-  cugraph::experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
-  vertex_t nClusters,
-  vertex_t nEigVecs,
-  int maxIter_lanczos,
-  int restartIter_lanczos,
-  weight_t tol_lanczos,
-  int maxIter_kmeans,
-  weight_t tol_kmeans,
-  vertex_t *__restrict__ clusters,
-  weight_t *eigVals,
-  weight_t *eigVecs,
-  int &iters_lanczos,
-  int &iters_kmeans)
-{
-  cudaStream_t stream = 0;
-  const weight_t zero{0.0};
-  const weight_t one{1.0};
-
-  edge_t i;
-  edge_t n = graph.number_of_vertices;
-
-  // k-means residual
-  weight_t residual_kmeans;
-
-  // Compute eigenvectors of Modularity Matrix
-  // Initialize Modularity Matrix
-  CsrMatrix<vertex_t, weight_t> A(false,
-                                  false,
-                                  graph.number_of_vertices,
-                                  graph.number_of_vertices,
-                                  graph.number_of_edges,
-                                  0,
-                                  graph.edge_data,
-                                  graph.offsets,
-                                  graph.indices);
-  ModularityMatrix<vertex_t, weight_t> B(A, graph.number_of_edges);
-
-  // Compute smallest eigenvalues and eigenvectors
-  CHECK_NVGRAPH(computeLargestEigenvectors(B,
-                                           nEigVecs,
-                                           maxIter_lanczos,
-                                           restartIter_lanczos,
-                                           tol_lanczos,
-                                           false,
-                                           iters_lanczos,
-                                           eigVals,
-                                           eigVecs));
-
-  // eigVals.dump(0, nEigVecs);
-  // eigVecs.dump(0, nEigVecs);
-  // eigVecs.dump(n, nEigVecs);
-  // eigVecs.dump(2*n, nEigVecs);
-  // Whiten eigenvector matrix
-  for (i = 0; i < nEigVecs; ++i) {
-    weight_t mean, std;
-    mean = thrust::reduce(thrust::device_pointer_cast(eigVecs + IDX(0, i, n)),
-                          thrust::device_pointer_cast(eigVecs + IDX(0, i + 1, n)));
-    cudaCheckError();
-    mean /= n;
-    thrust::transform(thrust::device_pointer_cast(eigVecs + IDX(0, i, n)),
-                      thrust::device_pointer_cast(eigVecs + IDX(0, i + 1, n)),
-                      thrust::make_constant_iterator(mean),
-                      thrust::device_pointer_cast(eigVecs + IDX(0, i, n)),
-                      thrust::minus<weight_t>());
-    cudaCheckError();
-    std = Cublas::nrm2(n, eigVecs + IDX(0, i, n), 1) / std::sqrt(static_cast<weight_t>(n));
-    thrust::transform(thrust::device_pointer_cast(eigVecs + IDX(0, i, n)),
-                      thrust::device_pointer_cast(eigVecs + IDX(0, i + 1, n)),
-                      thrust::make_constant_iterator(std),
-                      thrust::device_pointer_cast(eigVecs + IDX(0, i, n)),
-                      thrust::divides<weight_t>());
-    cudaCheckError();
-  }
-
-  // Transpose eigenvector matrix
-  //   TODO: in-place transpose
-  {
-    Vector<weight_t> work(nEigVecs * n, stream);
-    Cublas::set_pointer_mode_host();
-    Cublas::geam(true,
-                 false,
-                 nEigVecs,
-                 n,
-                 &one,
-                 eigVecs,
-                 n,
-                 &zero,
-                 (weight_t *)NULL,
-                 nEigVecs,
-                 work.raw(),
-                 nEigVecs);
-    CHECK_CUDA(cudaMemcpyAsync(
-      eigVecs, work.raw(), nEigVecs * n * sizeof(weight_t), cudaMemcpyDeviceToDevice));
-  }
-
-  // WARNING: notice that at this point the matrix has already been transposed, so we are scaling
-  // columns
-  scale_obs(nEigVecs, n, eigVecs);
-  cudaCheckError();
-
-  // eigVecs.dump(0, nEigVecs*n);
-  // Find partition with k-means clustering
-  CHECK_NVGRAPH(kmeans(n,
-                       nEigVecs,
-                       nClusters,
-                       tol_kmeans,
-                       maxIter_kmeans,
-                       eigVecs,
-                       clusters,
-                       residual_kmeans,
-                       iters_kmeans));
-
-  return NVGRAPH_OK;
-}
-//===================================================
-// Analysis of graph partition
-// =========================================================
-
-namespace {
-/// Functor to generate indicator vectors
-/** For use in Thrust transform
- */
-template <typename IndexType_, typename ValueType_>
-struct equal_to_i_op {
-  const IndexType_ i;
-
- public:
-  equal_to_i_op(IndexType_ _i) : i(_i) {}
-  template <typename Tuple_>
-  __host__ __device__ void operator()(Tuple_ t)
-  {
-    thrust::get<1>(t) = (thrust::get<0>(t) == i) ? (ValueType_)1.0 : (ValueType_)0.0;
-  }
-};
-}  // namespace
-
-/// Compute modularity
-/** This function determines the modularity based on a graph and cluster assignments
- *  @param G Weighted graph in CSR format
- *  @param nClusters Number of clusters.
- *  @param parts (Input, device memory, n entries) Cluster assignments.
- *  @param modularity On exit, modularity
- */
-template <typename vertex_t, typename edge_t, typename weight_t>
-NVGRAPH_ERROR analyzeModularity(
-  cugraph::experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
-  vertex_t nClusters,
-  const vertex_t *__restrict__ parts,
-  weight_t &modularity)
-{
-  cudaStream_t stream = 0;
-  edge_t i;
-  edge_t n = graph.number_of_vertices;
-  weight_t partModularity, partSize;
-
-  // Device memory
-  Vector<weight_t> part_i(n, stream);
-  Vector<weight_t> Bx(n, stream);
-
-  // Initialize cuBLAS
-  Cublas::set_pointer_mode_host();
-
-  // Initialize Modularity
-  CsrMatrix<vertex_t, weight_t> A(false,
-                                  false,
-                                  graph.number_of_vertices,
-                                  graph.number_of_vertices,
-                                  graph.number_of_edges,
-                                  0,
-                                  graph.edge_data,
-                                  graph.offsets,
-                                  graph.indices);
-  ModularityMatrix<vertex_t, weight_t> B(A, graph.number_of_edges);
-
-  // Initialize output
-  modularity = 0;
-
-  // Iterate through partitions
-  for (i = 0; i < nClusters; ++i) {
-    // Construct indicator vector for ith partition
-    thrust::for_each(
-      thrust::make_zip_iterator(thrust::make_tuple(thrust::device_pointer_cast(parts),
-                                                   thrust::device_pointer_cast(part_i.raw()))),
-      thrust::make_zip_iterator(thrust::make_tuple(thrust::device_pointer_cast(parts + n),
-                                                   thrust::device_pointer_cast(part_i.raw() + n))),
-      equal_to_i_op<vertex_t, weight_t>(i));
-    cudaCheckError();
-
-    // Compute size of ith partition
-    Cublas::dot(n, part_i.raw(), 1, part_i.raw(), 1, &partSize);
-    partSize = round(partSize);
-    if (partSize < 0.5) {
-      WARNING("empty partition");
-      continue;
-    }
-
-    // Compute modularity
-    B.mv(1, part_i.raw(), 0, Bx.raw());
-    Cublas::dot(n, Bx.raw(), 1, part_i.raw(), 1, &partModularity);
-
-    // Record results
-    modularity += partModularity;
-    // std::cout<< "partModularity " <<partModularity<< std::endl;
-  }
-  // modularity = modularity/nClusters;
-  // devide by nnz
-  modularity = modularity / B.getEdgeSum();
-  // Clean up and return
-
-  return NVGRAPH_OK;
-}
-
-// =========================================================
-// Explicit instantiation
-// =========================================================
-template NVGRAPH_ERROR modularity_maximization<int, int, float>(
-  cugraph::experimental::GraphCSRView<int, int, float> const &graph,
-  int nClusters,
-  int nEigVecs,
-  int maxIter_lanczos,
-  int restartIter_lanczos,
-  float tol_lanczos,
-  int maxIter_kmeans,
-  float tol_kmeans,
-  int *__restrict__ parts,
-  float *eigVals,
-  float *eigVecs,
-  int &iters_lanczos,
-  int &iters_kmeans);
-template NVGRAPH_ERROR modularity_maximization<int, int, double>(
-  cugraph::experimental::GraphCSRView<int, int, double> const &graph,
-  int nClusters,
-  int nEigVecs,
-  int maxIter_lanczos,
-  int restartIter_lanczos,
-  double tol_lanczos,
-  int maxIter_kmeans,
-  double tol_kmeans,
-  int *__restrict__ parts,
-  double *eigVals,
-  double *eigVecs,
-  int &iters_lanczos,
-  int &iters_kmeans);
-template NVGRAPH_ERROR analyzeModularity<int, int, float>(
-  cugraph::experimental::GraphCSRView<int, int, float> const &graph,
-  int nClusters,
-  const int *__restrict__ parts,
-  float &modularity);
-template NVGRAPH_ERROR analyzeModularity<int, int, double>(
-  cugraph::experimental::GraphCSRView<int, int, double> const &graph,
-  int nClusters,
-  const int *__restrict__ parts,
-  double &modularity);
-
-}  // namespace nvgraph
-//#endif //NVGRAPH_PARTITION
diff --git a/cpp/src/nvgraph/nvgraph_cublas.cpp b/cpp/src/nvgraph/nvgraph_cublas.cpp
deleted file mode 100644
index ceb3ad25d6b..00000000000
--- a/cpp/src/nvgraph/nvgraph_cublas.cpp
+++ /dev/null
@@ -1,569 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "include/nvgraph_cublas.hxx"
-
-namespace nvgraph {
-
-cublasHandle_t Cublas::m_handle = 0;
-
-namespace {
-cublasStatus_t cublas_axpy(
-  cublasHandle_t handle, int n, const float* alpha, const float* x, int incx, float* y, int incy)
-{
-  return cublasSaxpy(handle, n, alpha, x, incx, y, incy);
-}
-
-cublasStatus_t cublas_axpy(
-  cublasHandle_t handle, int n, const double* alpha, const double* x, int incx, double* y, int incy)
-{
-  return cublasDaxpy(handle, n, alpha, x, incx, y, incy);
-}
-
-cublasStatus_t cublas_copy(
-  cublasHandle_t handle, int n, const float* x, int incx, float* y, int incy)
-{
-  return cublasScopy(handle, n, x, incx, y, incy);
-}
-
-cublasStatus_t cublas_copy(
-  cublasHandle_t handle, int n, const double* x, int incx, double* y, int incy)
-{
-  return cublasDcopy(handle, n, x, incx, y, incy);
-}
-
-cublasStatus_t cublas_dot(
-  cublasHandle_t handle, int n, const float* x, int incx, const float* y, int incy, float* result)
-{
-  return cublasSdot(handle, n, x, incx, y, incy, result);
-}
-
-cublasStatus_t cublas_dot(cublasHandle_t handle,
-                          int n,
-                          const double* x,
-                          int incx,
-                          const double* y,
-                          int incy,
-                          double* result)
-{
-  return cublasDdot(handle, n, x, incx, y, incy, result);
-}
-
-cublasStatus_t cublas_trsv_v2(cublasHandle_t handle,
-                              cublasFillMode_t uplo,
-                              cublasOperation_t trans,
-                              cublasDiagType_t diag,
-                              int n,
-                              const float* A,
-                              int lda,
-                              float* x,
-                              int incx)
-{
-  return cublasStrsv(handle, uplo, trans, diag, n, A, lda, x, incx);
-}
-cublasStatus_t cublas_trsv_v2(cublasHandle_t handle,
-                              cublasFillMode_t uplo,
-                              cublasOperation_t trans,
-                              cublasDiagType_t diag,
-                              int n,
-                              const double* A,
-                              int lda,
-                              double* x,
-                              int incx)
-{
-  return cublasDtrsv(handle, uplo, trans, diag, n, A, lda, x, incx);
-}
-
-cublasStatus_t cublas_gemm(cublasHandle_t handle,
-                           cublasOperation_t transa,
-                           cublasOperation_t transb,
-                           int m,
-                           int n,
-                           int k,
-                           const float* alpha,
-                           const float* A,
-                           int lda,
-                           const float* B,
-                           int ldb,
-                           const float* beta,
-                           float* C,
-                           int ldc)
-{
-  return cublasSgemm(handle, transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
-}
-
-cublasStatus_t cublas_gemm(cublasHandle_t handle,
-                           cublasOperation_t transa,
-                           cublasOperation_t transb,
-                           int m,
-                           int n,
-                           int k,
-                           const double* alpha,
-                           const double* A,
-                           int lda,
-                           const double* B,
-                           int ldb,
-                           const double* beta,
-                           double* C,
-                           int ldc)
-{
-  return cublasDgemm(handle, transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
-}
-
-cublasStatus_t cublas_gemv(cublasHandle_t handle,
-                           cublasOperation_t trans,
-                           int m,
-                           int n,
-                           const float* alpha,
-                           const float* A,
-                           int lda,
-                           const float* x,
-                           int incx,
-                           const float* beta,
-                           float* y,
-                           int incy)
-{
-  return cublasSgemv(handle, trans, m, n, alpha, A, lda, x, incx, beta, y, incy);
-}
-
-cublasStatus_t cublas_gemv(cublasHandle_t handle,
-                           cublasOperation_t trans,
-                           int m,
-                           int n,
-                           const double* alpha,
-                           const double* A,
-                           int lda,
-                           const double* x,
-                           int incx,
-                           const double* beta,
-                           double* y,
-                           int incy)
-{
-  return cublasDgemv(handle, trans, m, n, alpha, A, lda, x, incx, beta, y, incy);
-}
-
-cublasStatus_t cublas_ger(cublasHandle_t handle,
-                          int m,
-                          int n,
-                          const float* alpha,
-                          const float* x,
-                          int incx,
-                          const float* y,
-                          int incy,
-                          float* A,
-                          int lda)
-{
-  return cublasSger(handle, m, n, alpha, x, incx, y, incy, A, lda);
-}
-
-cublasStatus_t cublas_ger(cublasHandle_t handle,
-                          int m,
-                          int n,
-                          const double* alpha,
-                          const double* x,
-                          int incx,
-                          const double* y,
-                          int incy,
-                          double* A,
-                          int lda)
-{
-  return cublasDger(handle, m, n, alpha, x, incx, y, incy, A, lda);
-}
-
-cublasStatus_t cublas_nrm2(cublasHandle_t handle, int n, const float* x, int incx, float* result)
-{
-  return cublasSnrm2(handle, n, x, incx, result);
-}
-
-cublasStatus_t cublas_nrm2(cublasHandle_t handle, int n, const double* x, int incx, double* result)
-{
-  return cublasDnrm2(handle, n, x, incx, result);
-}
-
-cublasStatus_t cublas_scal(cublasHandle_t handle, int n, const float* alpha, float* x, int incx)
-{
-  return cublasSscal(handle, n, alpha, x, incx);
-}
-
-cublasStatus_t cublas_scal(cublasHandle_t handle, int n, const double* alpha, double* x, int incx)
-{
-  return cublasDscal(handle, n, alpha, x, incx);
-}
-
-cublasStatus_t cublas_geam(cublasHandle_t handle,
-                           cublasOperation_t transa,
-                           cublasOperation_t transb,
-                           int m,
-                           int n,
-                           const float* alpha,
-                           const float* A,
-                           int lda,
-                           const float* beta,
-                           const float* B,
-                           int ldb,
-                           float* C,
-                           int ldc)
-{
-  return cublasSgeam(handle, transa, transb, m, n, alpha, A, lda, beta, B, ldb, C, ldc);
-}
-
-cublasStatus_t cublas_geam(cublasHandle_t handle,
-                           cublasOperation_t transa,
-                           cublasOperation_t transb,
-                           int m,
-                           int n,
-                           const double* alpha,
-                           const double* A,
-                           int lda,
-                           const double* beta,
-                           const double* B,
-                           int ldb,
-                           double* C,
-                           int ldc)
-{
-  return cublasDgeam(handle, transa, transb, m, n, alpha, A, lda, beta, B, ldb, C, ldc);
-}
-
-}  // anonymous namespace.
-
-void Cublas::set_pointer_mode_device()
-{
-  cublasHandle_t handle = Cublas::get_handle();
-  cublasSetPointerMode(handle, CUBLAS_POINTER_MODE_DEVICE);
-}
-
-void Cublas::set_pointer_mode_host()
-{
-  cublasHandle_t handle = Cublas::get_handle();
-  cublasSetPointerMode(handle, CUBLAS_POINTER_MODE_HOST);
-}
-
-template <typename T>
-void Cublas::axpy(int n, T alpha, const T* x, int incx, T* y, int incy)
-{
-  cublasHandle_t handle = Cublas::get_handle();
-  CHECK_CUBLAS(cublas_axpy(handle, n, &alpha, x, incx, y, incy));
-}
-
-template <typename T>
-void Cublas::copy(int n, const T* x, int incx, T* y, int incy)
-{
-  cublasHandle_t handle = Cublas::get_handle();
-  CHECK_CUBLAS(cublas_copy(handle, n, x, incx, y, incy));
-}
-
-template <typename T>
-void Cublas::dot(int n, const T* x, int incx, const T* y, int incy, T* result)
-{
-  cublasHandle_t handle = Cublas::get_handle();
-  CHECK_CUBLAS(cublas_dot(handle, n, x, incx, y, incy, result));
-}
-
-template <typename T>
-T Cublas::nrm2(int n, const T* x, int incx)
-{
-  Cublas::get_handle();
-  T result;
-  Cublas::nrm2(n, x, incx, &result);
-  return result;
-}
-
-template <typename T>
-void Cublas::nrm2(int n, const T* x, int incx, T* result)
-{
-  cublasHandle_t handle = Cublas::get_handle();
-  CHECK_CUBLAS(cublas_nrm2(handle, n, x, incx, result));
-}
-
-template <typename T>
-void Cublas::scal(int n, T alpha, T* x, int incx)
-{
-  Cublas::scal(n, &alpha, x, incx);
-}
-
-template <typename T>
-void Cublas::scal(int n, T* alpha, T* x, int incx)
-{
-  cublasHandle_t handle = Cublas::get_handle();
-  CHECK_CUBLAS(cublas_scal(handle, n, alpha, x, incx));
-}
-
-template <typename T>
-void Cublas::gemv(bool transposed,
-                  int m,
-                  int n,
-                  const T* alpha,
-                  const T* A,
-                  int lda,
-                  const T* x,
-                  int incx,
-                  const T* beta,
-                  T* y,
-                  int incy)
-{
-  cublasHandle_t handle   = Cublas::get_handle();
-  cublasOperation_t trans = transposed ? CUBLAS_OP_T : CUBLAS_OP_N;
-  CHECK_CUBLAS(cublas_gemv(handle, trans, m, n, alpha, A, lda, x, incx, beta, y, incy));
-}
-
-template <typename T>
-void Cublas::gemv_ext(bool transposed,
-                      const int m,
-                      const int n,
-                      const T* alpha,
-                      const T* A,
-                      const int lda,
-                      const T* x,
-                      const int incx,
-                      const T* beta,
-                      T* y,
-                      const int incy,
-                      const int offsetx,
-                      const int offsety,
-                      const int offseta)
-{
-  cublasHandle_t handle   = Cublas::get_handle();
-  cublasOperation_t trans = transposed ? CUBLAS_OP_T : CUBLAS_OP_N;
-  CHECK_CUBLAS(cublas_gemv(
-    handle, trans, m, n, alpha, A + offseta, lda, x + offsetx, incx, beta, y + offsety, incy));
-}
-
-template <typename T>
-void Cublas::trsv_v2(cublasFillMode_t uplo,
-                     cublasOperation_t trans,
-                     cublasDiagType_t diag,
-                     int n,
-                     const T* A,
-                     int lda,
-                     T* x,
-                     int incx,
-                     int offseta)
-{
-  cublasHandle_t handle = Cublas::get_handle();
-
-  CHECK_CUBLAS(cublas_trsv_v2(handle, uplo, trans, diag, n, A + offseta, lda, x, incx));
-}
-
-template <typename T>
-void Cublas::ger(
-  int m, int n, const T* alpha, const T* x, int incx, const T* y, int incy, T* A, int lda)
-{
-  cublasHandle_t handle = Cublas::get_handle();
-  CHECK_CUBLAS(cublas_ger(handle, m, n, alpha, x, incx, y, incy, A, lda));
-}
-
-template <typename T>
-void Cublas::gemm(bool transa,
-                  bool transb,
-                  int m,
-                  int n,
-                  int k,
-                  const T* alpha,
-                  const T* A,
-                  int lda,
-                  const T* B,
-                  int ldb,
-                  const T* beta,
-                  T* C,
-                  int ldc)
-{
-  cublasHandle_t handle          = Cublas::get_handle();
-  cublasOperation_t cublasTransA = transa ? CUBLAS_OP_T : CUBLAS_OP_N;
-  cublasOperation_t cublasTransB = transb ? CUBLAS_OP_T : CUBLAS_OP_N;
-  CHECK_CUBLAS(
-    cublas_gemm(handle, cublasTransA, cublasTransB, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc));
-}
-
-template <typename T>
-void Cublas::geam(bool transa,
-                  bool transb,
-                  int m,
-                  int n,
-                  const T* alpha,
-                  const T* A,
-                  int lda,
-                  const T* beta,
-                  const T* B,
-                  int ldb,
-                  T* C,
-                  int ldc)
-{
-  cublasHandle_t handle          = Cublas::get_handle();
-  cublasOperation_t cublasTransA = transa ? CUBLAS_OP_T : CUBLAS_OP_N;
-  cublasOperation_t cublasTransB = transb ? CUBLAS_OP_T : CUBLAS_OP_N;
-  CHECK_CUBLAS(
-    cublas_geam(handle, cublasTransA, cublasTransB, m, n, alpha, A, lda, beta, B, ldb, C, ldc));
-}
-
-template void Cublas::axpy(int n, float alpha, const float* x, int incx, float* y, int incy);
-template void Cublas::axpy(int n, double alpha, const double* x, int incx, double* y, int incy);
-
-template void Cublas::copy(int n, const float* x, int incx, float* y, int incy);
-template void Cublas::copy(int n, const double* x, int incx, double* y, int incy);
-
-template void Cublas::dot(int n, const float* x, int incx, const float* y, int incy, float* result);
-template void Cublas::dot(
-  int n, const double* x, int incx, const double* y, int incy, double* result);
-
-template void Cublas::gemv(bool transposed,
-                           int m,
-                           int n,
-                           const float* alpha,
-                           const float* A,
-                           int lda,
-                           const float* x,
-                           int incx,
-                           const float* beta,
-                           float* y,
-                           int incy);
-template void Cublas::gemv(bool transposed,
-                           int m,
-                           int n,
-                           const double* alpha,
-                           const double* A,
-                           int lda,
-                           const double* x,
-                           int incx,
-                           const double* beta,
-                           double* y,
-                           int incy);
-
-template void Cublas::ger(int m,
-                          int n,
-                          const float* alpha,
-                          const float* x,
-                          int incx,
-                          const float* y,
-                          int incy,
-                          float* A,
-                          int lda);
-template void Cublas::ger(int m,
-                          int n,
-                          const double* alpha,
-                          const double* x,
-                          int incx,
-                          const double* y,
-                          int incy,
-                          double* A,
-                          int lda);
-
-template void Cublas::gemv_ext(bool transposed,
-                               const int m,
-                               const int n,
-                               const float* alpha,
-                               const float* A,
-                               const int lda,
-                               const float* x,
-                               const int incx,
-                               const float* beta,
-                               float* y,
-                               const int incy,
-                               const int offsetx,
-                               const int offsety,
-                               const int offseta);
-template void Cublas::gemv_ext(bool transposed,
-                               const int m,
-                               const int n,
-                               const double* alpha,
-                               const double* A,
-                               const int lda,
-                               const double* x,
-                               const int incx,
-                               const double* beta,
-                               double* y,
-                               const int incy,
-                               const int offsetx,
-                               const int offsety,
-                               const int offseta);
-
-template void Cublas::trsv_v2(cublasFillMode_t uplo,
-                              cublasOperation_t trans,
-                              cublasDiagType_t diag,
-                              int n,
-                              const float* A,
-                              int lda,
-                              float* x,
-                              int incx,
-                              int offseta);
-template void Cublas::trsv_v2(cublasFillMode_t uplo,
-                              cublasOperation_t trans,
-                              cublasDiagType_t diag,
-                              int n,
-                              const double* A,
-                              int lda,
-                              double* x,
-                              int incx,
-                              int offseta);
-
-template double Cublas::nrm2(int n, const double* x, int incx);
-template float Cublas::nrm2(int n, const float* x, int incx);
-
-template void Cublas::scal(int n, float alpha, float* x, int incx);
-template void Cublas::scal(int n, double alpha, double* x, int incx);
-
-template void Cublas::gemm(bool transa,
-                           bool transb,
-                           int m,
-                           int n,
-                           int k,
-                           const float* alpha,
-                           const float* A,
-                           int lda,
-                           const float* B,
-                           int ldb,
-                           const float* beta,
-                           float* C,
-                           int ldc);
-template void Cublas::gemm(bool transa,
-                           bool transb,
-                           int m,
-                           int n,
-                           int k,
-                           const double* alpha,
-                           const double* A,
-                           int lda,
-                           const double* B,
-                           int ldb,
-                           const double* beta,
-                           double* C,
-                           int ldc);
-
-template void Cublas::geam(bool transa,
-                           bool transb,
-                           int m,
-                           int n,
-                           const float* alpha,
-                           const float* A,
-                           int lda,
-                           const float* beta,
-                           const float* B,
-                           int ldb,
-                           float* C,
-                           int ldc);
-template void Cublas::geam(bool transa,
-                           bool transb,
-                           int m,
-                           int n,
-                           const double* alpha,
-                           const double* A,
-                           int lda,
-                           const double* beta,
-                           const double* B,
-                           int ldb,
-                           double* C,
-                           int ldc);
-
-}  // end namespace nvgraph
diff --git a/cpp/src/nvgraph/nvgraph_cusparse.cpp b/cpp/src/nvgraph/nvgraph_cusparse.cpp
deleted file mode 100644
index 51a06968455..00000000000
--- a/cpp/src/nvgraph/nvgraph_cusparse.cpp
+++ /dev/null
@@ -1,263 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "include/nvgraph_cusparse.hxx"
-
-namespace nvgraph {
-cusparseHandle_t Cusparse::m_handle = 0;
-
-namespace {
-cusparseStatus_t cusparse_csrmv(cusparseHandle_t handle,
-                                cusparseOperation_t trans,
-                                int m,
-                                int n,
-                                int nnz,
-                                const float* alpha,
-                                const cusparseMatDescr_t descr,
-                                const float* csrVal,
-                                const int* csrRowPtr,
-                                const int* csrColInd,
-                                const float* x,
-                                const float* beta,
-                                float* y)
-{
-  return cusparseScsrmv(
-    handle, trans, m, n, nnz, alpha, descr, csrVal, csrRowPtr, csrColInd, x, beta, y);
-}
-
-cusparseStatus_t cusparse_csrmv(cusparseHandle_t handle,
-                                cusparseOperation_t trans,
-                                int m,
-                                int n,
-                                int nnz,
-                                const double* alpha,
-                                const cusparseMatDescr_t descr,
-                                const double* csrVal,
-                                const int* csrRowPtr,
-                                const int* csrColInd,
-                                const double* x,
-                                const double* beta,
-                                double* y)
-{
-  return cusparseDcsrmv(
-    handle, trans, m, n, nnz, alpha, descr, csrVal, csrRowPtr, csrColInd, x, beta, y);
-}
-
-cusparseStatus_t cusparse_csrmm(cusparseHandle_t handle,
-                                cusparseOperation_t trans,
-                                int m,
-                                int n,
-                                int k,
-                                int nnz,
-                                const float* alpha,
-                                const cusparseMatDescr_t descr,
-                                const float* csrVal,
-                                const int* csrRowPtr,
-                                const int* csrColInd,
-                                const float* x,
-                                const int ldx,
-                                const float* beta,
-                                float* y,
-                                const int ldy)
-{
-  return cusparseScsrmm(
-    handle, trans, m, n, k, nnz, alpha, descr, csrVal, csrRowPtr, csrColInd, x, ldx, beta, y, ldy);
-}
-
-cusparseStatus_t cusparse_csrmm(cusparseHandle_t handle,
-                                cusparseOperation_t trans,
-                                int m,
-                                int n,
-                                int k,
-                                int nnz,
-                                const double* alpha,
-                                const cusparseMatDescr_t descr,
-                                const double* csrVal,
-                                const int* csrRowPtr,
-                                const int* csrColInd,
-                                const double* x,
-                                const int ldx,
-                                const double* beta,
-                                double* y,
-                                const int ldy)
-{
-  return cusparseDcsrmm(
-    handle, trans, m, n, k, nnz, alpha, descr, csrVal, csrRowPtr, csrColInd, x, ldx, beta, y, ldy);
-}
-
-}  // end anonymous namespace.
-
-// Set pointer mode
-void Cusparse::set_pointer_mode_device()
-{
-  cusparseHandle_t handle = Cusparse::get_handle();
-  cusparseSetPointerMode(handle, CUSPARSE_POINTER_MODE_DEVICE);
-}
-void Cusparse::set_pointer_mode_host()
-{
-  cusparseHandle_t handle = Cusparse::get_handle();
-  cusparseSetPointerMode(handle, CUSPARSE_POINTER_MODE_HOST);
-}
-
-template <typename IndexType_, typename ValueType_>
-void Cusparse::csrmv(const bool transposed,
-                     const bool sym,
-                     const int m,
-                     const int n,
-                     const int nnz,
-                     const ValueType_* alpha,
-                     const ValueType_* csrVal,
-                     const IndexType_* csrRowPtr,
-                     const IndexType_* csrColInd,
-                     const ValueType_* x,
-                     const ValueType_* beta,
-                     ValueType_* y)
-{
-  cusparseHandle_t handle = Cusparse::get_handle();
-  cusparseOperation_t trans =
-    transposed ? CUSPARSE_OPERATION_TRANSPOSE : CUSPARSE_OPERATION_NON_TRANSPOSE;
-  cusparseMatDescr_t descr = 0;
-  CHECK_CUSPARSE(cusparseCreateMatDescr(&descr));  // we should move that somewhere else
-  if (sym) {
-    CHECK_CUSPARSE(cusparseSetMatType(descr, CUSPARSE_MATRIX_TYPE_SYMMETRIC));
-  } else {
-    CHECK_CUSPARSE(cusparseSetMatType(descr, CUSPARSE_MATRIX_TYPE_GENERAL));
-  }
-  CHECK_CUSPARSE(cusparseSetMatIndexBase(descr, CUSPARSE_INDEX_BASE_ZERO));
-  CHECK_CUSPARSE(cusparse_csrmv(
-    handle, trans, m, n, nnz, alpha, descr, csrVal, csrRowPtr, csrColInd, x, beta, y));
-  CHECK_CUSPARSE(cusparseDestroyMatDescr(descr));  // we should move that somewhere else
-}
-
-template void Cusparse::csrmv(const bool transposed,
-                              const bool sym,
-                              const int m,
-                              const int n,
-                              const int nnz,
-                              const double* alpha,
-                              const double* csrVal,
-                              const int* csrRowPtr,
-                              const int* csrColInd,
-                              const double* x,
-                              const double* beta,
-                              double* y);
-template void Cusparse::csrmv(const bool transposed,
-                              const bool sym,
-                              const int m,
-                              const int n,
-                              const int nnz,
-                              const float* alpha,
-                              const float* csrVal,
-                              const int* csrRowPtr,
-                              const int* csrColInd,
-                              const float* x,
-                              const float* beta,
-                              float* y);
-/*
-template void Cusparse::csrmv( const bool transposed,
-                               const bool sym,
-                               const double* alpha,
-                               const ValuedCsrGraph<int, double>& G,
-                               const Vector<double>& x,
-                               const double* beta,
-                               Vector<double>& y
-                     );
-
-
-template void Cusparse::csrmv( const bool transposed,
-                               const bool sym,
-                               const float* alpha,
-                               const ValuedCsrGraph<int, float>& G,
-                               const Vector<float>& x,
-                               const float* beta,
-                               Vector<float>& y
-                     );
-*/
-
-template <typename IndexType_, typename ValueType_>
-void Cusparse::csrmm(const bool transposed,
-                     const bool sym,
-                     const int m,
-                     const int n,
-                     const int k,
-                     const int nnz,
-                     const ValueType_* alpha,
-                     const ValueType_* csrVal,
-                     const IndexType_* csrRowPtr,
-                     const IndexType_* csrColInd,
-                     const ValueType_* x,
-                     const int ldx,
-                     const ValueType_* beta,
-                     ValueType_* y,
-                     const int ldy)
-{
-  cusparseHandle_t handle = Cusparse::get_handle();
-  cusparseOperation_t trans =
-    transposed ? CUSPARSE_OPERATION_TRANSPOSE : CUSPARSE_OPERATION_NON_TRANSPOSE;
-  cusparseMatDescr_t descr = 0;
-  CHECK_CUSPARSE(cusparseCreateMatDescr(&descr));  // we should move that somewhere else
-  if (sym) {
-    CHECK_CUSPARSE(cusparseSetMatType(descr, CUSPARSE_MATRIX_TYPE_SYMMETRIC));
-  } else {
-    CHECK_CUSPARSE(cusparseSetMatType(descr, CUSPARSE_MATRIX_TYPE_GENERAL));
-  }
-  CHECK_CUSPARSE(cusparseSetMatIndexBase(descr, CUSPARSE_INDEX_BASE_ZERO));
-  CHECK_CUSPARSE(cusparse_csrmm(
-    handle, trans, m, n, k, nnz, alpha, descr, csrVal, csrRowPtr, csrColInd, x, ldx, beta, y, ldy));
-  CHECK_CUSPARSE(cusparseDestroyMatDescr(descr));  // we should move that somewhere else
-}
-
-template void Cusparse::csrmm(const bool transposed,
-                              const bool sym,
-                              const int m,
-                              const int n,
-                              const int k,
-                              const int nnz,
-                              const double* alpha,
-                              const double* csrVal,
-                              const int* csrRowPtr,
-                              const int* csrColInd,
-                              const double* x,
-                              const int ldx,
-                              const double* beta,
-                              double* y,
-                              const int ldy);
-
-template void Cusparse::csrmm(const bool transposed,
-                              const bool sym,
-                              const int m,
-                              const int n,
-                              const int k,
-                              const int nnz,
-                              const float* alpha,
-                              const float* csrVal,
-                              const int* csrRowPtr,
-                              const int* csrColInd,
-                              const float* x,
-                              const int ldx,
-                              const float* beta,
-                              float* y,
-                              const int ldy);
-
-// template <typename IndexType_, typename ValueType_>
-void Cusparse::csr2coo(const int n, const int nnz, const int* csrRowPtr, int* cooRowInd)
-{
-  cusparseHandle_t handle     = Cusparse::get_handle();
-  cusparseIndexBase_t idxBase = CUSPARSE_INDEX_BASE_ZERO;
-  CHECK_CUSPARSE(cusparseXcsr2coo(handle, csrRowPtr, nnz, n, cooRowInd, idxBase));
-}
-
-}  // end namespace nvgraph
diff --git a/cpp/src/nvgraph/nvgraph_lapack.cu b/cpp/src/nvgraph/nvgraph_lapack.cu
deleted file mode 100644
index a3f1786a1cd..00000000000
--- a/cpp/src/nvgraph/nvgraph_lapack.cu
+++ /dev/null
@@ -1,792 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "include/nvgraph_lapack.hxx"
-
-//#include <f2c.h>
-//#include <complex>
-
-//#define NVGRAPH_USE_LAPACK 1
-
-namespace nvgraph {
-
-#define lapackCheckError(status)                                                     \
-  {                                                                                  \
-    if (status < 0) {                                                                \
-      std::stringstream ss;                                                          \
-      ss << "Lapack error: argument number " << -status << " had an illegal value."; \
-      FatalError(ss.str(), NVGRAPH_ERR_UNKNOWN);                                     \
-    } else if (status > 0)                                                           \
-      FatalError("Lapack error: internal error.", NVGRAPH_ERR_UNKNOWN);              \
-  }
-
-template <typename T>
-void Lapack<T>::check_lapack_enabled()
-{
-#ifndef NVGRAPH_USE_LAPACK
-  FatalError("Error: LAPACK not enabled.", NVGRAPH_ERR_UNKNOWN);
-#endif
-}
-
-typedef enum {
-  CUSOLVER_STATUS_SUCCESS                   = 0,
-  CUSOLVER_STATUS_NOT_INITIALIZED           = 1,
-  CUSOLVER_STATUS_ALLOC_FAILED              = 2,
-  CUSOLVER_STATUS_INVALID_VALUE             = 3,
-  CUSOLVER_STATUS_ARCH_MISMATCH             = 4,
-  CUSOLVER_STATUS_MAPPING_ERROR             = 5,
-  CUSOLVER_STATUS_EXECUTION_FAILED          = 6,
-  CUSOLVER_STATUS_INTERNAL_ERROR            = 7,
-  CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED = 8,
-  CUSOLVER_STATUS_NOT_SUPPORTED             = 9,
-  CUSOLVER_STATUS_ZERO_PIVOT                = 10,
-  CUSOLVER_STATUS_INVALID_LICENSE           = 11
-} cusolverStatus_t;
-
-typedef enum { CUBLAS_OP_N = 0, CUBLAS_OP_T = 1, CUBLAS_OP_C = 2 } cublasOperation_t;
-
-namespace {
-// XGEMM
-// extern "C"
-// void sgemm_(const char *transa, const char *transb,
-//        const int *m, const int *n, const int *k,
-//        const float *alpha, const float *a, const int *lda,
-//        const float *b, const int *ldb,
-//        const float *beta, float *c, const int *ldc);
-// extern "C"
-// void dgemm_(const char *transa, const char *transb,
-//        const int *m, const int *n, const int *k,
-//        const double *alpha, const double *a, const int *lda,
-//        const double *b, const int *ldb,
-//        const double *beta, double *c, const int *ldc);
-
-extern "C" cusolverStatus_t cusolverDnSgemmHost(cublasOperation_t transa,
-                                                cublasOperation_t transb,
-                                                int m,
-                                                int n,
-                                                int k,
-                                                const float *alpha,
-                                                const float *A,
-                                                int lda,
-                                                const float *B,
-                                                int ldb,
-                                                const float *beta,
-                                                float *C,
-                                                int ldc);
-
-void lapack_gemm(const char transa,
-                 const char transb,
-                 int m,
-                 int n,
-                 int k,
-                 float alpha,
-                 const float *a,
-                 int lda,
-                 const float *b,
-                 int ldb,
-                 float beta,
-                 float *c,
-                 int ldc)
-{
-  cublasOperation_t cublas_transa = (transa == 'N') ? CUBLAS_OP_N : CUBLAS_OP_T;
-  cublasOperation_t cublas_transb = (transb == 'N') ? CUBLAS_OP_N : CUBLAS_OP_T;
-  cusolverDnSgemmHost(
-    cublas_transa, cublas_transb, m, n, k, &alpha, (float *)a, lda, (float *)b, ldb, &beta, c, ldc);
-}
-
-extern "C" cusolverStatus_t cusolverDnDgemmHost(cublasOperation_t transa,
-                                                cublasOperation_t transb,
-                                                int m,
-                                                int n,
-                                                int k,
-                                                const double *alpha,
-                                                const double *A,
-                                                int lda,
-                                                const double *B,
-                                                int ldb,
-                                                const double *beta,
-                                                double *C,
-                                                int ldc);
-
-void lapack_gemm(const signed char transa,
-                 const signed char transb,
-                 int m,
-                 int n,
-                 int k,
-                 double alpha,
-                 const double *a,
-                 int lda,
-                 const double *b,
-                 int ldb,
-                 double beta,
-                 double *c,
-                 int ldc)
-{
-  cublasOperation_t cublas_transa = (transa == 'N') ? CUBLAS_OP_N : CUBLAS_OP_T;
-  cublasOperation_t cublas_transb = (transb == 'N') ? CUBLAS_OP_N : CUBLAS_OP_T;
-  cusolverDnDgemmHost(cublas_transa,
-                      cublas_transb,
-                      m,
-                      n,
-                      k,
-                      &alpha,
-                      (double *)a,
-                      lda,
-                      (double *)b,
-                      ldb,
-                      &beta,
-                      c,
-                      ldc);
-}
-
-// XSTERF
-// extern "C"
-// void ssterf_(const int *n, float *d, float *e, int *info);
-//
-// extern "C"
-// void dsterf_(const int *n, double *d, double *e, int *info);
-//
-
-extern "C" cusolverStatus_t cusolverDnSsterfHost(int n, float *d, float *e, int *info);
-
-void lapack_sterf(int n, float *d, float *e, int *info) { cusolverDnSsterfHost(n, d, e, info); }
-
-extern "C" cusolverStatus_t cusolverDnDsterfHost(int n, double *d, double *e, int *info);
-
-void lapack_sterf(int n, double *d, double *e, int *info) { cusolverDnDsterfHost(n, d, e, info); }
-
-// XSTEQR
-// extern "C"
-// void ssteqr_(const char *compz, const int *n, float *d, float *e,
-//       float *z, const int *ldz, float *work, int * info);
-// extern "C"
-// void dsteqr_(const char *compz, const int *n, double *d, double *e,
-//       double *z, const int *ldz, double *work, int *info);
-
-extern "C" cusolverStatus_t cusolverDnSsteqrHost(
-  const signed char *compz, int n, float *d, float *e, float *z, int ldz, float *work, int *info);
-
-void lapack_steqr(
-  const signed char compz, int n, float *d, float *e, float *z, int ldz, float *work, int *info)
-{
-  cusolverDnSsteqrHost(&compz, n, d, e, z, ldz, work, info);
-}
-
-extern "C" cusolverStatus_t cusolverDnDsteqrHost(const signed char *compz,
-                                                 int n,
-                                                 double *d,
-                                                 double *e,
-                                                 double *z,
-                                                 int ldz,
-                                                 double *work,
-                                                 int *info);
-
-void lapack_steqr(
-  const signed char compz, int n, double *d, double *e, double *z, int ldz, double *work, int *info)
-{
-  cusolverDnDsteqrHost(&compz, n, d, e, z, ldz, work, info);
-}
-
-#ifdef NVGRAPH_USE_LAPACK
-
-extern "C" void sgeqrf_(
-  int *m, int *n, float *a, int *lda, float *tau, float *work, int *lwork, int *info);
-extern "C" void dgeqrf_(
-  int *m, int *n, double *a, int *lda, double *tau, double *work, int *lwork, int *info);
-// extern "C"
-// void cgeqrf_(int *m, int *n, std::complex<float> *a, int *lda, std::complex<float> *tau,
-// std::complex<float> *work, int *lwork, int *info); extern "C" void zgeqrf_(int *m, int *n,
-// std::complex<double> *a, int *lda, std::complex<double> *tau, std::complex<double> *work, int
-// *lwork, int *info);
-
-void lapack_geqrf(int m, int n, float *a, int lda, float *tau, float *work, int *lwork, int *info)
-{
-  sgeqrf_(&m, &n, a, &lda, tau, work, lwork, info);
-}
-void lapack_geqrf(
-  int m, int n, double *a, int lda, double *tau, double *work, int *lwork, int *info)
-{
-  dgeqrf_(&m, &n, a, &lda, tau, work, lwork, info);
-}
-// void lapack_geqrf(int m, int n, std::complex<float> *a, int lda, std::complex<float> *tau,
-// std::complex<float> *work, int *lwork, int *info)
-//{
-//    cgeqrf_(&m, &n, a, &lda, tau, work, lwork, info);
-//}
-// void lapack_geqrf(int m, int n, std::complex<double> *a, int lda, std::complex<double> *tau,
-// std::complex<double> *work, int *lwork, int *info)
-//{
-//    zgeqrf_(&m, &n, a, &lda, tau, work, lwork, info);
-//}
-
-extern "C" void sormqr_(char *side,
-                        char *trans,
-                        int *m,
-                        int *n,
-                        int *k,
-                        float *a,
-                        int *lda,
-                        const float *tau,
-                        float *c,
-                        int *ldc,
-                        float *work,
-                        int *lwork,
-                        int *info);
-extern "C" void dormqr_(char *side,
-                        char *trans,
-                        int *m,
-                        int *n,
-                        int *k,
-                        double *a,
-                        int *lda,
-                        const double *tau,
-                        double *c,
-                        int *ldc,
-                        double *work,
-                        int *lwork,
-                        int *info);
-// extern "C"
-// void cunmqr_ (char* side, char* trans, int *m, int *n, int *k, std::complex<float> *a, int *lda,
-// const std::complex<float> *tau, std::complex<float>* c, int *ldc, std::complex<float> *work, int
-// *lwork, int *info); extern "C" void zunmqr_(char* side, char* trans, int *m, int *n, int *k,
-// std::complex<double> *a, int *lda, const std::complex<double> *tau,  std::complex<double>* c, int
-// *ldc, std::complex<double> *work, int *lwork, int *info);
-
-void lapack_ormqr(char side,
-                  char trans,
-                  int m,
-                  int n,
-                  int k,
-                  float *a,
-                  int lda,
-                  float *tau,
-                  float *c,
-                  int ldc,
-                  float *work,
-                  int *lwork,
-                  int *info)
-{
-  sormqr_(&side, &trans, &m, &n, &k, a, &lda, tau, c, &ldc, work, lwork, info);
-}
-void lapack_ormqr(char side,
-                  char trans,
-                  int m,
-                  int n,
-                  int k,
-                  double *a,
-                  int lda,
-                  double *tau,
-                  double *c,
-                  int ldc,
-                  double *work,
-                  int *lwork,
-                  int *info)
-{
-  dormqr_(&side, &trans, &m, &n, &k, a, &lda, tau, c, &ldc, work, lwork, info);
-}
-// void lapack_unmqr(char side, char trans, int m, int n, int k, std::complex<float> *a, int lda,
-// std::complex<float> *tau, std::complex<float>* c, int ldc, std::complex<float> *work, int *lwork,
-// int *info)
-//{
-//    cunmqr_(&side, &trans, &m, &n, &k, a, &lda, tau, c, &ldc, work, lwork, info);
-//}
-// void lapack_unmqr(char side, char trans, int m, int n, int k, std::complex<double> *a, int lda,
-// std::complex<double> *tau, std::complex<double>* c, int ldc, std::complex<double> *work, int
-// *lwork, int *info)
-//{
-//    zunmqr_(&side, &trans, &m, &n, &k, a, &lda, tau, c, &ldc, work, lwork, info);
-//}
-
-// extern "C"
-// void sorgqr_ ( int* m, int* n, int* k, float* a, int* lda, const float* tau, float* work, int*
-// lwork, int *info ); extern "C" void dorgqr_ ( int* m, int* n, int* k, double* a, int* lda, const
-// double* tau, double* work, int* lwork, int *info );
-//
-// void lapack_orgqr( int m, int n, int k, float* a, int lda, const float* tau, float* work, int
-// *lwork, int *info)
-// {
-//     sorgqr_(&m, &n, &k, a, &lda, tau, work, lwork, info);
-// }
-// void lapack_orgqr( int m, int n, int k, double* a, int lda, const double* tau, double* work, int*
-// lwork, int *info )
-// {
-//     dorgqr_(&m, &n, &k, a, &lda, tau, work, lwork, info);
-// }
-
-// int lapack_hseqr_dispatch(char *jobvl, char *jobvr, int* n, int*ilo, int*ihi,
-//                          double *h, int* ldh, double *wr, double *wi, double *z,
-//                          int*ldz, double *work, int *lwork, int *info)
-//{
-//    return dhseqr_(jobvl, jobvr, n, ilo, ihi, h, ldh, wr, wi, z, ldz, work, lwork, info);
-//}
-//
-// int lapack_hseqr_dispatch(char *jobvl, char *jobvr, int* n, int*ilo, int*ihi,
-//                          float *h, int* ldh, float *wr, float *wi, float *z,
-//                          int*ldz, float *work, int *lwork, int *info)
-//{
-//    return shseqr_(jobvl, jobvr, n, ilo, ihi, h, ldh, wr, wi, z, ldz, work, lwork, info);
-//}
-
-// XGEEV
-extern "C" int dgeev_(char *jobvl,
-                      char *jobvr,
-                      int *n,
-                      double *a,
-                      int *lda,
-                      double *wr,
-                      double *wi,
-                      double *vl,
-                      int *ldvl,
-                      double *vr,
-                      int *ldvr,
-                      double *work,
-                      int *lwork,
-                      int *info);
-
-extern "C" int sgeev_(char *jobvl,
-                      char *jobvr,
-                      int *n,
-                      float *a,
-                      int *lda,
-                      float *wr,
-                      float *wi,
-                      float *vl,
-                      int *ldvl,
-                      float *vr,
-                      int *ldvr,
-                      float *work,
-                      int *lwork,
-                      int *info);
-
-// extern "C"
-// int dhseqr_(char *jobvl, char *jobvr, int* n, int*ilo, int*ihi,
-//            double *h, int* ldh, double *wr, double *wi, double *z,
-//            int*ldz, double *work, int *lwork, int *info);
-// extern "C"
-// int shseqr_(char *jobvl, char *jobvr, int* n, int*ilo, int*ihi,
-//            float *h, int* ldh, float *wr, float *wi, float *z,
-//            int*ldz, float *work, int *lwork, int *info);
-//
-int lapack_geev_dispatch(char *jobvl,
-                         char *jobvr,
-                         int *n,
-                         double *a,
-                         int *lda,
-                         double *wr,
-                         double *wi,
-                         double *vl,
-                         int *ldvl,
-                         double *vr,
-                         int *ldvr,
-                         double *work,
-                         int *lwork,
-                         int *info)
-{
-  return dgeev_(jobvl, jobvr, n, a, lda, wr, wi, vl, ldvl, vr, ldvr, work, lwork, info);
-}
-
-int lapack_geev_dispatch(char *jobvl,
-                         char *jobvr,
-                         int *n,
-                         float *a,
-                         int *lda,
-                         float *wr,
-                         float *wi,
-                         float *vl,
-                         int *ldvl,
-                         float *vr,
-                         int *ldvr,
-                         float *work,
-                         int *lwork,
-                         int *info)
-{
-  return sgeev_(jobvl, jobvr, n, a, lda, wr, wi, vl, ldvl, vr, ldvr, work, lwork, info);
-}
-
-// real eigenvalues
-template <typename T>
-void lapack_geev(T *A, T *eigenvalues, int dim, int lda)
-{
-  char job = 'N';
-  std::vector<T> WI(dim);
-  int ldv       = 1;
-  T *vl         = 0;
-  int work_size = 6 * dim;
-  std::vector<T> work(work_size);
-  int info;
-  lapack_geev_dispatch(&job,
-                       &job,
-                       &dim,
-                       A,
-                       &lda,
-                       eigenvalues,
-                       WI.data(),
-                       vl,
-                       &ldv,
-                       vl,
-                       &ldv,
-                       work.data(),
-                       &work_size,
-                       &info);
-  lapackCheckError(info);
-}
-// real eigenpairs
-template <typename T>
-void lapack_geev(T *A, T *eigenvalues, T *eigenvectors, int dim, int lda, int ldvr)
-{
-  char jobvl = 'N';
-  char jobvr = 'V';
-  std::vector<T> WI(dim);
-  int work_size = 6 * dim;
-  T *vl         = 0;
-  int ldvl      = 1;
-  std::vector<T> work(work_size);
-  int info;
-  lapack_geev_dispatch(&jobvl,
-                       &jobvr,
-                       &dim,
-                       A,
-                       &lda,
-                       eigenvalues,
-                       WI.data(),
-                       vl,
-                       &ldvl,
-                       eigenvectors,
-                       &ldvr,
-                       work.data(),
-                       &work_size,
-                       &info);
-  lapackCheckError(info);
-}
-// complex eigenpairs
-template <typename T>
-void lapack_geev(T *A,
-                 T *eigenvalues_r,
-                 T *eigenvalues_i,
-                 T *eigenvectors_r,
-                 T *eigenvectors_i,
-                 int dim,
-                 int lda,
-                 int ldvr)
-{
-  char jobvl    = 'N';
-  char jobvr    = 'V';
-  int work_size = 8 * dim;
-  int ldvl      = 1;
-  std::vector<T> work(work_size);
-  int info;
-  lapack_geev_dispatch(&jobvl,
-                       &jobvr,
-                       &dim,
-                       A,
-                       &lda,
-                       eigenvalues_r,
-                       eigenvalues_i,
-                       0,
-                       &ldvl,
-                       eigenvectors_r,
-                       &ldvr,
-                       work.data(),
-                       &work_size,
-                       &info);
-  lapackCheckError(info);
-}
-
-// template <typename T>
-// void lapack_hseqr(T* Q, T* H, T* eigenvalues, int dim, int ldh, int ldq)
-//{
-//    char job = 'S'; // S compute eigenvalues and the Schur form T. On entry, the upper Hessenberg
-//    matrix H.
-//                    // On exit H contains the upper quasi-triangular matrix T from the Schur
-//                    decomposition
-//    char jobvr = 'V'; //Take Q on entry, and the product Q*Z is returned.
-//    //ILO and IHI are normally set by a previous call to DGEBAL, Otherwise ILO and IHI should be
-//    set to 1 and N int ilo = 1; int ihi = dim; T* WI = new T[dim]; int ldv = 1; T* vl = 0; int
-//    work_size = 11 * dim; //LWORK as large as 11*N may be required for optimal performance. It is
-//    CPU memory and the matrix is assumed to be small T* work = new T[work_size]; int info;
-//    lapack_hseqr_dispatch(&job, &jobvr, &dim, &ilo, &ihi, H, &ldh, eigenvalues, WI, Q, &ldq, work,
-//    &work_size, &info); lapackCheckError(info); delete [] WI; delete [] work;
-//}
-
-#endif
-
-}  // end anonymous namespace
-
-template <typename T>
-void Lapack<T>::gemm(bool transa,
-                     bool transb,
-                     int m,
-                     int n,
-                     int k,
-                     T alpha,
-                     const T *A,
-                     int lda,
-                     const T *B,
-                     int ldb,
-                     T beta,
-                     T *C,
-                     int ldc)
-{
-  // check_lapack_enabled();
-  //#ifdef NVGRAPH_USE_LAPACK
-  const char transA_char = transa ? 'T' : 'N';
-  const char transB_char = transb ? 'T' : 'N';
-  lapack_gemm(transA_char, transB_char, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
-  //#endif
-}
-
-template <typename T>
-void Lapack<T>::sterf(int n, T *d, T *e)
-{
-  //    check_lapack_enabled();
-  //#ifdef NVGRAPH_USE_LAPACK
-  int info;
-  lapack_sterf(n, d, e, &info);
-  lapackCheckError(info);
-  //#endif
-}
-
-template <typename T>
-void Lapack<T>::steqr(char compz, int n, T *d, T *e, T *z, int ldz, T *work)
-{
-  //    check_lapack_enabled();
-  //#ifdef NVGRAPH_USE_LAPACK
-  int info;
-  lapack_steqr(compz, n, d, e, z, ldz, work, &info);
-  lapackCheckError(info);
-  //#endif
-}
-
-template <typename T>
-void Lapack<T>::geqrf(int m, int n, T *a, int lda, T *tau, T *work, int *lwork)
-{
-  check_lapack_enabled();
-#ifdef NVGRAPH_USE_LAPACK
-  int info;
-  lapack_geqrf(m, n, a, lda, tau, work, lwork, &info);
-  lapackCheckError(info);
-#endif
-}
-template <typename T>
-void Lapack<T>::ormqr(bool right_side,
-                      bool transq,
-                      int m,
-                      int n,
-                      int k,
-                      T *a,
-                      int lda,
-                      T *tau,
-                      T *c,
-                      int ldc,
-                      T *work,
-                      int *lwork)
-{
-  check_lapack_enabled();
-#ifdef NVGRAPH_USE_LAPACK
-  char side  = right_side ? 'R' : 'L';
-  char trans = transq ? 'T' : 'N';
-  int info;
-  lapack_ormqr(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, &info);
-  lapackCheckError(info);
-#endif
-}
-
-// template <typename T>
-// void Lapack< T >::unmqr(bool right_side, bool transq, int m, int n, int k, T *a, int lda, T *tau,
-// T *c, int ldc, T *work, int *lwork)
-//{
-//    check_lapack_enabled();
-//    #ifdef NVGRAPH_USE_LAPACK
-//        char side = right_side ? 'R' : 'L';
-//        char trans = transq ? 'T' : 'N';
-//        int info;
-//        lapack_unmqr(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, &info);
-//        lapackCheckError(info);
-//    #endif
-//}
-
-// template <typename T>
-// void Lapack< T >::orgqr( int m, int n, int k, T* a, int lda, const T* tau, T* work, int* lwork)
-//{
-//    check_lapack_enabled();
-//    #ifdef NVGRAPH_USE_LAPACK
-//        int info;
-//        lapack_orgqr(m, n, k, a, lda, tau, work, lwork, &info);
-//        lapackCheckError(info);
-//    #endif
-//}
-// template <typename T>
-// void Lapack< T >::qrf(int n, int k, T *H, T *C, T *Q, T *R)
-//{
-//    check_lapack_enabled();
-//    #ifdef NVGRAPH_USE_LAPACK
-//    //   int m = n, k = n, lda=n, lwork=2*n, info;
-//    //   lapack_geqrf(m, n, H, lda, C, work, lwork, &info);
-//    //   lapackCheckError(info);
-//    //   lapack_ormqr(m, n, k, H, lda, tau, c, ldc, work, lwork, &info);
-//    //   lapackCheckError(info);
-//    #endif
-//}
-
-// real eigenvalues
-template <typename T>
-void Lapack<T>::geev(T *A, T *eigenvalues, int dim, int lda)
-{
-  check_lapack_enabled();
-#ifdef NVGRAPH_USE_LAPACK
-  lapack_geev(A, eigenvalues, dim, lda);
-#endif
-}
-// real eigenpairs
-template <typename T>
-void Lapack<T>::geev(T *A, T *eigenvalues, T *eigenvectors, int dim, int lda, int ldvr)
-{
-  check_lapack_enabled();
-#ifdef NVGRAPH_USE_LAPACK
-  lapack_geev(A, eigenvalues, eigenvectors, dim, lda, ldvr);
-#endif
-}
-// complex eigenpairs
-template <typename T>
-void Lapack<T>::geev(T *A,
-                     T *eigenvalues_r,
-                     T *eigenvalues_i,
-                     T *eigenvectors_r,
-                     T *eigenvectors_i,
-                     int dim,
-                     int lda,
-                     int ldvr)
-{
-  check_lapack_enabled();
-#ifdef NVGRAPH_USE_LAPACK
-  lapack_geev(A, eigenvalues_r, eigenvalues_i, eigenvectors_r, eigenvectors_i, dim, lda, ldvr);
-#endif
-}
-
-// template <typename T>
-// void Lapack< T >::hseqr(T* Q, T* H, T* eigenvalues,T* eigenvectors, int dim, int ldh, int ldq)
-//{
-//    check_lapack_enabled();
-//#ifdef NVGRAPH_USE_LAPACK
-//    lapack_hseqr(Q, H, eigenvalues, dim, ldh, ldq);
-//#endif
-//}
-
-// Explicit instantiation
-template void Lapack<float>::check_lapack_enabled();
-template void Lapack<float>::gemm(bool transa,
-                                  bool transb,
-                                  int m,
-                                  int n,
-                                  int k,
-                                  float alpha,
-                                  const float *A,
-                                  int lda,
-                                  const float *B,
-                                  int ldb,
-                                  float beta,
-                                  float *C,
-                                  int ldc);
-template void Lapack<float>::sterf(int n, float *d, float *e);
-template void Lapack<float>::geev(
-  float *A, float *eigenvalues, float *eigenvectors, int dim, int lda, int ldvr);
-template void Lapack<float>::geev(float *A,
-                                  float *eigenvalues_r,
-                                  float *eigenvalues_i,
-                                  float *eigenvectors_r,
-                                  float *eigenvectors_i,
-                                  int dim,
-                                  int lda,
-                                  int ldvr);
-// template void Lapack<float>::hseqr(float* Q, float* H, float* eigenvalues, float* eigenvectors,
-// int dim, int ldh, int ldq);
-template void Lapack<float>::steqr(
-  char compz, int n, float *d, float *e, float *z, int ldz, float *work);
-template void Lapack<float>::geqrf(
-  int m, int n, float *a, int lda, float *tau, float *work, int *lwork);
-template void Lapack<float>::ormqr(bool right_side,
-                                   bool transq,
-                                   int m,
-                                   int n,
-                                   int k,
-                                   float *a,
-                                   int lda,
-                                   float *tau,
-                                   float *c,
-                                   int ldc,
-                                   float *work,
-                                   int *lwork);
-// template void Lapack<float>::orgqr(int m, int n, int k, float* a, int lda, const float* tau,
-// float* work, int* lwork);
-
-template void Lapack<double>::check_lapack_enabled();
-template void Lapack<double>::gemm(bool transa,
-                                   bool transb,
-                                   int m,
-                                   int n,
-                                   int k,
-                                   double alpha,
-                                   const double *A,
-                                   int lda,
-                                   const double *B,
-                                   int ldb,
-                                   double beta,
-                                   double *C,
-                                   int ldc);
-template void Lapack<double>::sterf(int n, double *d, double *e);
-template void Lapack<double>::geev(
-  double *A, double *eigenvalues, double *eigenvectors, int dim, int lda, int ldvr);
-template void Lapack<double>::geev(double *A,
-                                   double *eigenvalues_r,
-                                   double *eigenvalues_i,
-                                   double *eigenvectors_r,
-                                   double *eigenvectors_i,
-                                   int dim,
-                                   int lda,
-                                   int ldvr);
-// template void Lapack<double>::hseqr(double* Q, double* H, double* eigenvalues, double*
-// eigenvectors, int dim, int ldh, int ldq);
-template void Lapack<double>::steqr(
-  char compz, int n, double *d, double *e, double *z, int ldz, double *work);
-template void Lapack<double>::geqrf(
-  int m, int n, double *a, int lda, double *tau, double *work, int *lwork);
-template void Lapack<double>::ormqr(bool right_side,
-                                    bool transq,
-                                    int m,
-                                    int n,
-                                    int k,
-                                    double *a,
-                                    int lda,
-                                    double *tau,
-                                    double *c,
-                                    int ldc,
-                                    double *work,
-                                    int *lwork);
-// template void Lapack<double>::orgqr(int m, int n, int k, double* a, int lda, const double* tau,
-// double* work, int* lwork);
-
-// template void Lapack<std::complex<float> >::geqrf(int m, int n, std::complex<float> *a, int lda,
-// std::complex<float> *tau, std::complex<float> *work, int *lwork); template void
-// Lapack<std::complex<double> >::geqrf(int m, int n, std::complex<double> *a, int lda,
-// std::complex<double> *tau, std::complex<double> *work, int *lwork); template void
-// Lapack<std::complex<float> >::unmqr(bool right_side, bool transq, int m, int n, int k,
-// std::complex<float> *a, int lda, std::complex<float> *tau, std::complex<float> *c, int ldc,
-// std::complex<float> *work, int *lwork); template void Lapack<std::complex<double> >::unmqr(bool
-// right_side, bool transq, int m, int n, int k, std::complex<double> *a, int lda,
-// std::complex<double> *tau, std::complex<double> *c, int ldc, std::complex<double> *work, int
-// *lwork);
-
-}  // end namespace nvgraph
diff --git a/cpp/src/nvgraph/nvgraph_vector_kernels.cu b/cpp/src/nvgraph/nvgraph_vector_kernels.cu
deleted file mode 100644
index a2d8234f9e6..00000000000
--- a/cpp/src/nvgraph/nvgraph_vector_kernels.cu
+++ /dev/null
@@ -1,200 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#include <thrust/device_vector.h>
-#include <thrust/reduce.h>
-#include <algorithm>
-#include "include/nvgraph_error.hxx"
-#include "include/nvgraph_vector_kernels.hxx"
-
-#include "include/debug_macros.h"
-
-namespace nvgraph {
-
-void check_size(size_t sz)
-{
-  if (sz > INT_MAX) FatalError("Vector larger than INT_MAX", NVGRAPH_ERR_BAD_PARAMETERS);
-}
-template <typename ValueType_>
-void nrm1_raw_vec(ValueType_* vec, size_t n, ValueType_* res, cudaStream_t stream)
-{
-  thrust::device_ptr<ValueType_> dev_ptr(vec);
-  *res = thrust::reduce(dev_ptr, dev_ptr + n);
-  cudaCheckError();
-}
-
-template <typename ValueType_>
-void fill_raw_vec(ValueType_* vec, size_t n, ValueType_ value, cudaStream_t stream)
-{
-  thrust::device_ptr<ValueType_> dev_ptr(vec);
-  thrust::fill(dev_ptr, dev_ptr + n, value);
-  cudaCheckError();
-}
-
-template <typename ValueType_>
-void dump_raw_vec(ValueType_* vec, size_t n, int offset, cudaStream_t stream)
-{
-#ifdef DEBUG
-  thrust::device_ptr<ValueType_> dev_ptr(vec);
-  COUT().precision(15);
-  COUT() << "sample size = " << n << ", offset = " << offset << std::endl;
-  thrust::copy(
-    dev_ptr + offset, dev_ptr + offset + n, std::ostream_iterator<ValueType_>(COUT(), " "));
-  cudaCheckError();
-  COUT() << std::endl;
-#endif
-}
-
-template <typename ValueType_>
-__global__ void flag_zeroes_kernel(int num_vertices, ValueType_* vec, int* flags)
-{
-  int tidx = blockDim.x * blockIdx.x + threadIdx.x;
-  for (int r = tidx; r < num_vertices; r += blockDim.x * gridDim.x) {
-    if (vec[r] != 0.0)
-      flags[r] = 1;  // NOTE 2 : alpha*0 + (1-alpha)*1 = (1-alpha)
-    else
-      flags[r] = 0;
-  }
-}
-template <typename ValueType_>
-__global__ void dmv0_kernel(const ValueType_* __restrict__ D,
-                            const ValueType_* __restrict__ x,
-                            ValueType_* __restrict__ y,
-                            int n)
-{
-  // y=D*x
-  int tidx = blockIdx.x * blockDim.x + threadIdx.x;
-  for (int i = tidx; i < n; i += blockDim.x * gridDim.x) y[i] = D[i] * x[i];
-}
-template <typename ValueType_>
-__global__ void dmv1_kernel(const ValueType_* __restrict__ D,
-                            const ValueType_* __restrict__ x,
-                            ValueType_* __restrict__ y,
-                            int n)
-{
-  // y+=D*x
-  int tidx = blockIdx.x * blockDim.x + threadIdx.x;
-  for (int i = tidx; i < n; i += blockDim.x * gridDim.x) y[i] += D[i] * x[i];
-}
-template <typename ValueType_>
-void copy_vec(ValueType_* vec1, size_t n, ValueType_* res, cudaStream_t stream)
-{
-  thrust::device_ptr<ValueType_> dev_ptr(vec1);
-  thrust::device_ptr<ValueType_> res_ptr(res);
-#ifdef DEBUG
-  // COUT() << "copy "<< n << " elements" << std::endl;
-#endif
-  thrust::copy_n(dev_ptr, n, res_ptr);
-  cudaCheckError();
-  // dump_raw_vec (res, n, 0);
-}
-
-template <typename ValueType_>
-void flag_zeros_raw_vec(size_t num_vertices, ValueType_* vec, int* flags, cudaStream_t stream)
-{
-  int items_per_thread = 4;
-  int num_threads      = 128;
-  int max_grid_size    = 4096;
-  check_size(num_vertices);
-  int n          = static_cast<int>(num_vertices);
-  int num_blocks = std::min(max_grid_size, (n / (items_per_thread * num_threads)) + 1);
-  flag_zeroes_kernel<<<num_blocks, num_threads, 0, stream>>>(num_vertices, vec, flags);
-  cudaCheckError();
-}
-
-template <typename ValueType_>
-void dmv(size_t num_vertices,
-         ValueType_ alpha,
-         ValueType_* D,
-         ValueType_* x,
-         ValueType_ beta,
-         ValueType_* y,
-         cudaStream_t stream)
-{
-  int items_per_thread = 4;
-  int num_threads      = 128;
-  int max_grid_size    = 4096;
-  check_size(num_vertices);
-  int n          = static_cast<int>(num_vertices);
-  int num_blocks = std::min(max_grid_size, (n / (items_per_thread * num_threads)) + 1);
-  if (alpha == 1.0 && beta == 0.0)
-    dmv0_kernel<<<num_blocks, num_threads, 0, stream>>>(D, x, y, n);
-  else if (alpha == 1.0 && beta == 1.0)
-    dmv1_kernel<<<num_blocks, num_threads, 0, stream>>>(D, x, y, n);
-  else
-    FatalError("Not implemented case of y = D*x", NVGRAPH_ERR_BAD_PARAMETERS);
-
-  cudaCheckError();
-}
-
-template <typename IndexType_, typename ValueType_>
-void set_connectivity(size_t n,
-                      IndexType_ root,
-                      ValueType_ self_loop_val,
-                      ValueType_ unreachable_val,
-                      ValueType_* res,
-                      cudaStream_t stream)
-{
-  fill_raw_vec(res, n, unreachable_val);
-  cudaMemcpy(&res[root], &self_loop_val, sizeof(self_loop_val), cudaMemcpyHostToDevice);
-  cudaCheckError();
-}
-
-template void nrm1_raw_vec<float>(float* vec, size_t n, float* res, cudaStream_t stream);
-template void nrm1_raw_vec<double>(double* vec, size_t n, double* res, cudaStream_t stream);
-
-template void dmv<float>(
-  size_t num_vertices, float alpha, float* D, float* x, float beta, float* y, cudaStream_t stream);
-template void dmv<double>(size_t num_vertices,
-                          double alpha,
-                          double* D,
-                          double* x,
-                          double beta,
-                          double* y,
-                          cudaStream_t stream);
-
-template void set_connectivity<int, float>(
-  size_t n, int root, float self_loop_val, float unreachable_val, float* res, cudaStream_t stream);
-template void set_connectivity<int, double>(size_t n,
-                                            int root,
-                                            double self_loop_val,
-                                            double unreachable_val,
-                                            double* res,
-                                            cudaStream_t stream);
-
-template void flag_zeros_raw_vec<float>(size_t num_vertices,
-                                        float* vec,
-                                        int* flags,
-                                        cudaStream_t stream);
-template void flag_zeros_raw_vec<double>(size_t num_vertices,
-                                         double* vec,
-                                         int* flags,
-                                         cudaStream_t stream);
-
-template void fill_raw_vec<float>(float* vec, size_t n, float value, cudaStream_t stream);
-template void fill_raw_vec<double>(double* vec, size_t n, double value, cudaStream_t stream);
-template void fill_raw_vec<int>(int* vec, size_t n, int value, cudaStream_t stream);
-template void fill_raw_vec<char>(char* vec, size_t n, char value, cudaStream_t stream);
-
-template void copy_vec<float>(float* vec1, size_t n, float* res, cudaStream_t stream);
-template void copy_vec<double>(double* vec1, size_t n, double* res, cudaStream_t stream);
-template void copy_vec<int>(int* vec1, size_t n, int* res, cudaStream_t stream);
-template void copy_vec<char>(char* vec1, size_t n, char* res, cudaStream_t stream);
-
-template void dump_raw_vec<float>(float* vec, size_t n, int off, cudaStream_t stream);
-template void dump_raw_vec<double>(double* vec, size_t n, int off, cudaStream_t stream);
-template void dump_raw_vec<int>(int* vec, size_t n, int off, cudaStream_t stream);
-template void dump_raw_vec<char>(char* vec, size_t n, int off, cudaStream_t stream);
-}  // end namespace nvgraph
diff --git a/cpp/src/nvgraph/partition.cu b/cpp/src/nvgraph/partition.cu
deleted file mode 100644
index e4b9f507908..00000000000
--- a/cpp/src/nvgraph/partition.cu
+++ /dev/null
@@ -1,424 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "include/partition.hxx"
-
-#include <math.h>
-#include <stdio.h>
-
-#include <cuda.h>
-#include <thrust/device_vector.h>
-#include <thrust/fill.h>
-#include <thrust/reduce.h>
-#include <thrust/transform.h>
-
-#include <nvgraph/include/debug_macros.h>
-#include <nvgraph/include/sm_utils.h>
-#include <nvgraph/include/kmeans.hxx>
-#include <nvgraph/include/lanczos.hxx>
-#include <nvgraph/include/nvgraph_cublas.hxx>
-#include <nvgraph/include/nvgraph_error.hxx>
-#include <nvgraph/include/nvgraph_vector.hxx>
-#include <nvgraph/include/spectral_matrix.hxx>
-
-namespace nvgraph {
-
-// =========================================================
-// Useful macros
-// =========================================================
-
-// Get index of matrix entry
-#define IDX(i, j, lda) ((i) + (j) * (lda))
-
-template <typename IndexType_, typename ValueType_>
-static __global__ void scale_obs_kernel(IndexType_ m, IndexType_ n, ValueType_ *obs)
-{
-  IndexType_ i, j, k, index, mm;
-  ValueType_ alpha, v, last;
-  bool valid;
-  // ASSUMPTION: kernel is launched with either 2, 4, 8, 16 or 32 threads in x-dimension
-
-  // compute alpha
-  mm    = (((m + blockDim.x - 1) / blockDim.x) * blockDim.x);  // m in multiple of blockDim.x
-  alpha = 0.0;
-  // printf("[%d,%d,%d,%d] n=%d, li=%d, mn=%d \n",threadIdx.x,threadIdx.y,blockIdx.x,blockIdx.y, n,
-  // li, mn);
-  for (j = threadIdx.y + blockIdx.y * blockDim.y; j < n; j += blockDim.y * gridDim.y) {
-    for (i = threadIdx.x; i < mm; i += blockDim.x) {
-      // check if the thread is valid
-      valid = i < m;
-
-      // get the value of the last thread
-      last = utils::shfl(alpha, blockDim.x - 1, blockDim.x);
-
-      // if you are valid read the value from memory, otherwise set your value to 0
-      alpha = (valid) ? obs[i + j * m] : 0.0;
-      alpha = alpha * alpha;
-
-      // do prefix sum (of size warpSize=blockDim.x =< 32)
-      for (k = 1; k < blockDim.x; k *= 2) {
-        v = utils::shfl_up(alpha, k, blockDim.x);
-        if (threadIdx.x >= k) alpha += v;
-      }
-      // shift by last
-      alpha += last;
-    }
-  }
-
-  // scale by alpha
-  alpha = utils::shfl(alpha, blockDim.x - 1, blockDim.x);
-  alpha = std::sqrt(alpha);
-  for (j = threadIdx.y + blockIdx.y * blockDim.y; j < n; j += blockDim.y * gridDim.y) {
-    for (i = threadIdx.x; i < m; i += blockDim.x) {  // blockDim.x=32
-      index      = i + j * m;
-      obs[index] = obs[index] / alpha;
-    }
-  }
-}
-
-template <typename IndexType_>
-IndexType_ next_pow2(IndexType_ n)
-{
-  IndexType_ v;
-  // Reference:
-  // http://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2Float
-  v = n - 1;
-  v |= v >> 1;
-  v |= v >> 2;
-  v |= v >> 4;
-  v |= v >> 8;
-  v |= v >> 16;
-  return v + 1;
-}
-
-template <typename IndexType_, typename ValueType_>
-cudaError_t scale_obs(IndexType_ m, IndexType_ n, ValueType_ *obs)
-{
-  IndexType_ p2m;
-  dim3 nthreads, nblocks;
-
-  // find next power of 2
-  p2m = next_pow2<IndexType_>(m);
-  // setup launch configuration
-  nthreads.x = max(2, min(p2m, 32));
-  nthreads.y = 256 / nthreads.x;
-  nthreads.z = 1;
-  nblocks.x  = 1;
-  nblocks.y  = (n + nthreads.y - 1) / nthreads.y;
-  nblocks.z  = 1;
-  // printf("m=%d(%d),n=%d,obs=%p,
-  // nthreads=(%d,%d,%d),nblocks=(%d,%d,%d)\n",m,p2m,n,obs,nthreads.x,nthreads.y,nthreads.z,nblocks.x,nblocks.y,nblocks.z);
-
-  // launch scaling kernel (scale each column of obs by its norm)
-  scale_obs_kernel<IndexType_, ValueType_><<<nblocks, nthreads>>>(m, n, obs);
-  cudaCheckError();
-
-  return cudaSuccess;
-}
-
-// =========================================================
-// Spectral partitioner
-// =========================================================
-
-/// Compute spectral graph partition
-/** Compute partition for a weighted undirected graph. This
- *  partition attempts to minimize the cost function:
- *    Cost = \sum_i (Edges cut by ith partition)/(Vertices in ith partition)
- *
- *  @param G Weighted graph in CSR format
- *  @param nParts Number of partitions.
- *  @param nEigVecs Number of eigenvectors to compute.
- *  @param maxIter_lanczos Maximum number of Lanczos iterations.
- *  @param restartIter_lanczos Maximum size of Lanczos system before
- *    implicit restart.
- *  @param tol_lanczos Convergence tolerance for Lanczos method.
- *  @param maxIter_kmeans Maximum number of k-means iterations.
- *  @param tol_kmeans Convergence tolerance for k-means algorithm.
- *  @param parts (Output, device memory, n entries) Partition
- *    assignments.
- *  @param iters_lanczos On exit, number of Lanczos iterations
- *    performed.
- *  @param iters_kmeans On exit, number of k-means iterations
- *    performed.
- *  @return NVGRAPH error flag.
- */
-template <typename vertex_t, typename edge_t, typename weight_t>
-NVGRAPH_ERROR partition(
-  cugraph::experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
-  vertex_t nParts,
-  vertex_t nEigVecs,
-  int maxIter_lanczos,
-  int restartIter_lanczos,
-  weight_t tol_lanczos,
-  int maxIter_kmeans,
-  weight_t tol_kmeans,
-  vertex_t *__restrict__ parts,
-  weight_t *eigVals,
-  weight_t *eigVecs)
-{
-  cudaStream_t stream = 0;
-
-  const weight_t zero{0.0};
-  const weight_t one{1.0};
-
-  int iters_lanczos;
-  int iters_kmeans;
-
-  edge_t i;
-  edge_t n = graph.number_of_vertices;
-
-  // k-means residual
-  weight_t residual_kmeans;
-
-  // -------------------------------------------------------
-  // Spectral partitioner
-  // -------------------------------------------------------
-
-  // Compute eigenvectors of Laplacian
-
-  // Initialize Laplacian
-  CsrMatrix<vertex_t, weight_t> A(false,
-                                  false,
-                                  graph.number_of_vertices,
-                                  graph.number_of_vertices,
-                                  graph.number_of_edges,
-                                  0,
-                                  graph.edge_data,
-                                  graph.offsets,
-                                  graph.indices);
-  LaplacianMatrix<vertex_t, weight_t> L(A);
-
-  // Compute smallest eigenvalues and eigenvectors
-  CHECK_NVGRAPH(computeSmallestEigenvectors(L,
-                                            nEigVecs,
-                                            maxIter_lanczos,
-                                            restartIter_lanczos,
-                                            tol_lanczos,
-                                            false,
-                                            iters_lanczos,
-                                            eigVals,
-                                            eigVecs));
-
-  // Whiten eigenvector matrix
-  for (i = 0; i < nEigVecs; ++i) {
-    weight_t mean, std;
-
-    mean = thrust::reduce(thrust::device_pointer_cast(eigVecs + IDX(0, i, n)),
-                          thrust::device_pointer_cast(eigVecs + IDX(0, i + 1, n)));
-    cudaCheckError();
-    mean /= n;
-    thrust::transform(thrust::device_pointer_cast(eigVecs + IDX(0, i, n)),
-                      thrust::device_pointer_cast(eigVecs + IDX(0, i + 1, n)),
-                      thrust::make_constant_iterator(mean),
-                      thrust::device_pointer_cast(eigVecs + IDX(0, i, n)),
-                      thrust::minus<weight_t>());
-    cudaCheckError();
-    std = Cublas::nrm2(n, eigVecs + IDX(0, i, n), 1) / std::sqrt(static_cast<weight_t>(n));
-    thrust::transform(thrust::device_pointer_cast(eigVecs + IDX(0, i, n)),
-                      thrust::device_pointer_cast(eigVecs + IDX(0, i + 1, n)),
-                      thrust::make_constant_iterator(std),
-                      thrust::device_pointer_cast(eigVecs + IDX(0, i, n)),
-                      thrust::divides<weight_t>());
-    cudaCheckError();
-  }
-
-  // Transpose eigenvector matrix
-  //   TODO: in-place transpose
-  {
-    Vector<weight_t> work(nEigVecs * n, stream);
-    Cublas::set_pointer_mode_host();
-    Cublas::geam(true,
-                 false,
-                 nEigVecs,
-                 n,
-                 &one,
-                 eigVecs,
-                 n,
-                 &zero,
-                 (weight_t *)NULL,
-                 nEigVecs,
-                 work.raw(),
-                 nEigVecs);
-    CHECK_CUDA(cudaMemcpyAsync(
-      eigVecs, work.raw(), nEigVecs * n * sizeof(weight_t), cudaMemcpyDeviceToDevice));
-  }
-
-  // Clean up
-
-  // eigVecs.dump(0, nEigVecs*n);
-  // Find partition with k-means clustering
-  CHECK_NVGRAPH(kmeans(n,
-                       nEigVecs,
-                       nParts,
-                       tol_kmeans,
-                       maxIter_kmeans,
-                       eigVecs,
-                       parts,
-                       residual_kmeans,
-                       iters_kmeans));
-
-  return NVGRAPH_OK;
-}
-
-// =========================================================
-// Analysis of graph partition
-// =========================================================
-
-namespace {
-/// Functor to generate indicator vectors
-/** For use in Thrust transform
- */
-template <typename IndexType_, typename ValueType_>
-struct equal_to_i_op {
-  const IndexType_ i;
-
- public:
-  equal_to_i_op(IndexType_ _i) : i(_i) {}
-  template <typename Tuple_>
-  __host__ __device__ void operator()(Tuple_ t)
-  {
-    thrust::get<1>(t) = (thrust::get<0>(t) == i) ? (ValueType_)1.0 : (ValueType_)0.0;
-  }
-};
-}  // namespace
-
-/// Compute cost function for partition
-/** This function determines the edges cut by a partition and a cost
- *  function:
- *    Cost = \sum_i (Edges cut by ith partition)/(Vertices in ith partition)
- *  Graph is assumed to be weighted and undirected.
- *
- *  @param G Weighted graph in CSR format
- *  @param nParts Number of partitions.
- *  @param parts (Input, device memory, n entries) Partition
- *    assignments.
- *  @param edgeCut On exit, weight of edges cut by partition.
- *  @param cost On exit, partition cost function.
- *  @return NVGRAPH error flag.
- */
-template <typename vertex_t, typename edge_t, typename weight_t>
-NVGRAPH_ERROR analyzePartition(
-  cugraph::experimental::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
-  vertex_t nParts,
-  const vertex_t *__restrict__ parts,
-  weight_t &edgeCut,
-  weight_t &cost)
-{
-  cudaStream_t stream = 0;
-
-  edge_t i;
-  edge_t n = graph.number_of_vertices;
-
-  weight_t partEdgesCut, partSize;
-
-  // Device memory
-  Vector<weight_t> part_i(n, stream);
-  Vector<weight_t> Lx(n, stream);
-
-  // Initialize cuBLAS
-  Cublas::set_pointer_mode_host();
-
-  // Initialize Laplacian
-  CsrMatrix<vertex_t, weight_t> A(false,
-                                  false,
-                                  graph.number_of_vertices,
-                                  graph.number_of_vertices,
-                                  graph.number_of_edges,
-                                  0,
-                                  graph.edge_data,
-                                  graph.offsets,
-                                  graph.indices);
-  LaplacianMatrix<vertex_t, weight_t> L(A);
-
-  // Initialize output
-  cost    = 0;
-  edgeCut = 0;
-
-  // Iterate through partitions
-  for (i = 0; i < nParts; ++i) {
-    // Construct indicator vector for ith partition
-    thrust::for_each(
-      thrust::make_zip_iterator(thrust::make_tuple(thrust::device_pointer_cast(parts),
-                                                   thrust::device_pointer_cast(part_i.raw()))),
-      thrust::make_zip_iterator(thrust::make_tuple(thrust::device_pointer_cast(parts + n),
-                                                   thrust::device_pointer_cast(part_i.raw() + n))),
-      equal_to_i_op<vertex_t, weight_t>(i));
-    cudaCheckError();
-
-    // Compute size of ith partition
-    Cublas::dot(n, part_i.raw(), 1, part_i.raw(), 1, &partSize);
-    partSize = round(partSize);
-    if (partSize < 0.5) {
-      WARNING("empty partition");
-      continue;
-    }
-
-    // Compute number of edges cut by ith partition
-    L.mv(1, part_i.raw(), 0, Lx.raw());
-    Cublas::dot(n, Lx.raw(), 1, part_i.raw(), 1, &partEdgesCut);
-
-    // Record results
-    cost += partEdgesCut / partSize;
-    edgeCut += partEdgesCut / 2;
-  }
-
-  // Clean up and return
-  return NVGRAPH_OK;
-}
-
-// =========================================================
-// Explicit instantiation
-// =========================================================
-template NVGRAPH_ERROR partition<int, int, float>(
-  cugraph::experimental::GraphCSRView<int, int, float> const &graph,
-  int nParts,
-  int nEigVecs,
-  int maxIter_lanczos,
-  int restartIter_lanczos,
-  float tol_lanczos,
-  int maxIter_kmeans,
-  float tol_kmeans,
-  int *__restrict__ parts,
-  float *eigVals,
-  float *eigVecs);
-
-template NVGRAPH_ERROR partition<int, int, double>(
-  cugraph::experimental::GraphCSRView<int, int, double> const &graph,
-  int nParts,
-  int nEigVecs,
-  int maxIter_lanczos,
-  int restartIter_lanczos,
-  double tol_lanczos,
-  int maxIter_kmeans,
-  double tol_kmeans,
-  int *__restrict__ parts,
-  double *eigVals,
-  double *eigVecs);
-
-template NVGRAPH_ERROR analyzePartition<int, int, float>(
-  cugraph::experimental::GraphCSRView<int, int, float> const &graph,
-  int nParts,
-  const int *__restrict__ parts,
-  float &edgeCut,
-  float &cost);
-template NVGRAPH_ERROR analyzePartition<int, int, double>(
-  cugraph::experimental::GraphCSRView<int, int, double> const &graph,
-  int nParts,
-  const int *__restrict__ parts,
-  double &edgeCut,
-  double &cost);
-
-}  // namespace nvgraph
diff --git a/cpp/src/nvgraph/spectral_matrix.cu b/cpp/src/nvgraph/spectral_matrix.cu
deleted file mode 100644
index 66c2160741e..00000000000
--- a/cpp/src/nvgraph/spectral_matrix.cu
+++ /dev/null
@@ -1,765 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-//#ifdef NVGRAPH_PARTITION
-//#ifdef DEBUG
-
-#include "include/spectral_matrix.hxx"
-
-#include <thrust/device_vector.h>
-#include <thrust/transform.h>
-
-#include "include/debug_macros.h"
-#include "include/nvgraph_cublas.hxx"
-#include "include/nvgraph_cusparse.hxx"
-#include "include/nvgraph_error.hxx"
-#include "include/nvgraph_vector.hxx"
-
-// =========================================================
-// Useful macros
-// =========================================================
-
-// CUDA block size
-#define BLOCK_SIZE 1024
-
-// Get index of matrix entry
-#define IDX(i, j, lda) ((i) + (j) * (lda))
-
-namespace nvgraph {
-
-// =============================================
-// CUDA kernels
-// =============================================
-
-namespace {
-
-/// Apply diagonal matrix to vector
-template <typename IndexType_, typename ValueType_>
-static __global__ void diagmv(IndexType_ n,
-                              ValueType_ alpha,
-                              const ValueType_ *__restrict__ D,
-                              const ValueType_ *__restrict__ x,
-                              ValueType_ *__restrict__ y)
-{
-  IndexType_ i = threadIdx.x + blockIdx.x * blockDim.x;
-  while (i < n) {
-    y[i] += alpha * D[i] * x[i];
-    i += blockDim.x * gridDim.x;
-  }
-}
-
-/// Apply diagonal matrix to a set of dense vectors (tall matrix)
-template <typename IndexType_, typename ValueType_, bool beta_is_zero>
-static __global__ void diagmm(IndexType_ n,
-                              IndexType_ k,
-                              ValueType_ alpha,
-                              const ValueType_ *__restrict__ D,
-                              const ValueType_ *__restrict__ x,
-                              ValueType_ beta,
-                              ValueType_ *__restrict__ y)
-{
-  IndexType_ i, j, index;
-
-  for (j = threadIdx.y + blockIdx.y * blockDim.y; j < k; j += blockDim.y * gridDim.y) {
-    for (i = threadIdx.x + blockIdx.x * blockDim.x; i < n; i += blockDim.x * gridDim.x) {
-      index = i + j * n;
-      if (beta_is_zero) {
-        y[index] = alpha * D[i] * x[index];
-      } else {
-        y[index] = alpha * D[i] * x[index] + beta * y[index];
-      }
-    }
-  }
-}
-}  // namespace
-
-// =============================================
-// Dense matrix class
-// =============================================
-
-/// Constructor for dense matrix class
-/** @param _trans Whether to transpose matrix.
- *  @param _m Number of rows.
- *  @param _n Number of columns.
- *  @param _A (Input, device memory, _m*_n entries) Matrix
- *    entries, stored column-major.
- *  @param _lda Leading dimension of _A.
- */
-template <typename IndexType_, typename ValueType_>
-DenseMatrix<IndexType_, ValueType_>::DenseMatrix(
-  bool _trans, IndexType_ _m, IndexType_ _n, const ValueType_ *_A, IndexType_ _lda)
-  : Matrix<IndexType_, ValueType_>(_m, _n), trans(_trans), A(_A), lda(_lda)
-{
-  Cublas::set_pointer_mode_host();
-  if (_lda < _m) FatalError("invalid dense matrix parameter (lda<m)", NVGRAPH_ERR_BAD_PARAMETERS);
-}
-
-/// Destructor for dense matrix class
-template <typename IndexType_, typename ValueType_>
-DenseMatrix<IndexType_, ValueType_>::~DenseMatrix()
-{
-}
-
-/// Get and Set CUDA stream
-template <typename IndexType_, typename ValueType_>
-void DenseMatrix<IndexType_, ValueType_>::setCUDAStream(cudaStream_t _s)
-{
-  this->s = _s;
-  // printf("DenseMatrix setCUDAStream stream=%p\n",this->s);
-  Cublas::setStream(_s);
-}
-template <typename IndexType_, typename ValueType_>
-void DenseMatrix<IndexType_, ValueType_>::getCUDAStream(cudaStream_t *_s)
-{
-  *_s = this->s;
-  // CHECK_CUBLAS(cublasGetStream(cublasHandle, _s));
-}
-
-/// Matrix-vector product for dense matrix class
-/** y is overwritten with alpha*A*x+beta*y.
- *
- *  @param alpha Scalar.
- *  @param x (Input, device memory, n entries) Vector.
- *  @param beta Scalar.
- *  @param y (Input/output, device memory, m entries) Output vector.
- */
-template <typename IndexType_, typename ValueType_>
-void DenseMatrix<IndexType_, ValueType_>::mv(ValueType_ alpha,
-                                             const ValueType_ *__restrict__ x,
-                                             ValueType_ beta,
-                                             ValueType_ *__restrict__ y) const
-{
-  Cublas::gemv(this->trans, this->m, this->n, &alpha, this->A, this->lda, x, 1, &beta, y, 1);
-}
-
-template <typename IndexType_, typename ValueType_>
-void DenseMatrix<IndexType_, ValueType_>::mm(IndexType_ k,
-                                             ValueType_ alpha,
-                                             const ValueType_ *__restrict__ x,
-                                             ValueType_ beta,
-                                             ValueType_ *__restrict__ y) const
-{
-  Cublas::gemm(
-    this->trans, false, this->m, k, this->n, &alpha, A, lda, x, this->m, &beta, y, this->n);
-}
-
-/// Color and Reorder
-template <typename IndexType_, typename ValueType_>
-void DenseMatrix<IndexType_, ValueType_>::color(IndexType_ *c, IndexType_ *p) const
-{
-}
-
-template <typename IndexType_, typename ValueType_>
-void DenseMatrix<IndexType_, ValueType_>::reorder(IndexType_ *p) const
-{
-}
-
-/// Incomplete Cholesky (setup, factor and solve)
-template <typename IndexType_, typename ValueType_>
-void DenseMatrix<IndexType_, ValueType_>::prec_setup(Matrix<IndexType_, ValueType_> *_M)
-{
-  printf("ERROR: DenseMatrix prec_setup dispacthed\n");
-  // exit(1);
-}
-
-template <typename IndexType_, typename ValueType_>
-void DenseMatrix<IndexType_, ValueType_>::prec_solve(IndexType_ k,
-                                                     ValueType_ alpha,
-                                                     ValueType_ *__restrict__ fx,
-                                                     ValueType_ *__restrict__ t) const
-{
-  printf("ERROR: DenseMatrix prec_solve dispacthed\n");
-  // exit(1);
-}
-
-template <typename IndexType_, typename ValueType_>
-ValueType_ DenseMatrix<IndexType_, ValueType_>::getEdgeSum() const
-{
-  return 0.0;
-}
-
-// =============================================
-// CSR matrix class
-// =============================================
-
-/// Constructor for CSR matrix class
-/** @param _transA Whether to transpose matrix.
- *  @param _m Number of rows.
- *  @param _n Number of columns.
- *  @param _nnz Number of non-zero entries.
- *  @param _descrA Matrix properties.
- *  @param _csrValA (Input, device memory, _nnz entries) Matrix
- *    entry values.
- *  @param _csrRowPtrA (Input, device memory, _m+1 entries) Pointer
- *    to first entry in each row.
- *  @param _csrColIndA (Input, device memory, _nnz entries) Column
- *    index of each matrix entry.
- */
-template <typename IndexType_, typename ValueType_>
-CsrMatrix<IndexType_, ValueType_>::CsrMatrix(bool _trans,
-                                             bool _sym,
-                                             IndexType_ _m,
-                                             IndexType_ _n,
-                                             IndexType_ _nnz,
-                                             const cusparseMatDescr_t _descrA,
-                                             /*const*/ ValueType_ *_csrValA,
-                                             const IndexType_ *_csrRowPtrA,
-                                             const IndexType_ *_csrColIndA)
-  : Matrix<IndexType_, ValueType_>(_m, _n),
-    trans(_trans),
-    sym(_sym),
-    nnz(_nnz),
-    descrA(_descrA),
-    csrValA(_csrValA),
-    csrRowPtrA(_csrRowPtrA),
-    csrColIndA(_csrColIndA)
-{
-  if (nnz < 0) FatalError("invalid CSR matrix parameter (nnz<0)", NVGRAPH_ERR_BAD_PARAMETERS);
-  Cusparse::set_pointer_mode_host();
-}
-
-/// Destructor for CSR matrix class
-template <typename IndexType_, typename ValueType_>
-CsrMatrix<IndexType_, ValueType_>::~CsrMatrix()
-{
-}
-
-/// Get and Set CUDA stream
-template <typename IndexType_, typename ValueType_>
-void CsrMatrix<IndexType_, ValueType_>::setCUDAStream(cudaStream_t _s)
-{
-  this->s = _s;
-  // printf("CsrMatrix setCUDAStream stream=%p\n",this->s);
-  Cusparse::setStream(_s);
-}
-template <typename IndexType_, typename ValueType_>
-void CsrMatrix<IndexType_, ValueType_>::getCUDAStream(cudaStream_t *_s)
-{
-  *_s = this->s;
-  // CHECK_CUSPARSE(cusparseGetStream(Cusparse::get_handle(), _s));
-}
-template <typename IndexType_, typename ValueType_>
-void CsrMatrix<IndexType_, ValueType_>::mm(IndexType_ k,
-                                           ValueType_ alpha,
-                                           const ValueType_ *__restrict__ x,
-                                           ValueType_ beta,
-                                           ValueType_ *__restrict__ y) const
-{
-  // CHECK_CUSPARSE(cusparseXcsrmm(Cusparse::get_handle(), transA, this->m, k, this->n, nnz, &alpha,
-  // descrA, csrValA, csrRowPtrA, csrColIndA, x, this->n, &beta, y, this->m));
-  Cusparse::csrmm(this->trans,
-                  this->sym,
-                  this->m,
-                  k,
-                  this->n,
-                  this->nnz,
-                  &alpha,
-                  this->csrValA,
-                  this->csrRowPtrA,
-                  this->csrColIndA,
-                  x,
-                  this->n,
-                  &beta,
-                  y,
-                  this->m);
-}
-
-/// Color and Reorder
-template <typename IndexType_, typename ValueType_>
-void CsrMatrix<IndexType_, ValueType_>::color(IndexType_ *c, IndexType_ *p) const
-{
-}
-
-template <typename IndexType_, typename ValueType_>
-void CsrMatrix<IndexType_, ValueType_>::reorder(IndexType_ *p) const
-{
-}
-
-/// Incomplete Cholesky (setup, factor and solve)
-template <typename IndexType_, typename ValueType_>
-void CsrMatrix<IndexType_, ValueType_>::prec_setup(Matrix<IndexType_, ValueType_> *_M)
-{
-  // printf("CsrMatrix prec_setup dispacthed\n");
-  if (!factored) {
-    // analyse lower triangular factor
-    CHECK_CUSPARSE(cusparseCreateSolveAnalysisInfo(&info_l));
-    CHECK_CUSPARSE(cusparseSetMatFillMode(descrA, CUSPARSE_FILL_MODE_LOWER));
-    CHECK_CUSPARSE(cusparseSetMatDiagType(descrA, CUSPARSE_DIAG_TYPE_UNIT));
-    CHECK_CUSPARSE(cusparseXcsrsm_analysis(Cusparse::get_handle(),
-                                           CUSPARSE_OPERATION_NON_TRANSPOSE,
-                                           this->m,
-                                           nnz,
-                                           descrA,
-                                           csrValA,
-                                           csrRowPtrA,
-                                           csrColIndA,
-                                           info_l));
-    // analyse upper triangular factor
-    CHECK_CUSPARSE(cusparseCreateSolveAnalysisInfo(&info_u));
-    CHECK_CUSPARSE(cusparseSetMatFillMode(descrA, CUSPARSE_FILL_MODE_UPPER));
-    CHECK_CUSPARSE(cusparseSetMatDiagType(descrA, CUSPARSE_DIAG_TYPE_NON_UNIT));
-    CHECK_CUSPARSE(cusparseXcsrsm_analysis(Cusparse::get_handle(),
-                                           CUSPARSE_OPERATION_NON_TRANSPOSE,
-                                           this->m,
-                                           nnz,
-                                           descrA,
-                                           csrValA,
-                                           csrRowPtrA,
-                                           csrColIndA,
-                                           info_u));
-    // perform csrilu0 (should be slightly faster than csric0)
-    CHECK_CUSPARSE(cusparseXcsrilu0(Cusparse::get_handle(),
-                                    CUSPARSE_OPERATION_NON_TRANSPOSE,
-                                    this->m,
-                                    descrA,
-                                    csrValA,
-                                    csrRowPtrA,
-                                    csrColIndA,
-                                    info_l));
-    // set factored flag to true
-    factored = true;
-  }
-}
-
-template <typename IndexType_, typename ValueType_>
-void CsrMatrix<IndexType_, ValueType_>::prec_solve(IndexType_ k,
-                                                   ValueType_ alpha,
-                                                   ValueType_ *__restrict__ fx,
-                                                   ValueType_ *__restrict__ t) const
-{
-  // printf("CsrMatrix prec_solve dispacthed (stream %p)\n",this->s);
-
-  // preconditioning Mx=f (where M = L*U, threfore x=U\(L\f))
-  // solve lower triangular factor
-  CHECK_CUSPARSE(cusparseSetMatFillMode(descrA, CUSPARSE_FILL_MODE_LOWER));
-  CHECK_CUSPARSE(cusparseSetMatDiagType(descrA, CUSPARSE_DIAG_TYPE_UNIT));
-  CHECK_CUSPARSE(cusparseXcsrsm_solve(Cusparse::get_handle(),
-                                      CUSPARSE_OPERATION_NON_TRANSPOSE,
-                                      this->m,
-                                      k,
-                                      alpha,
-                                      descrA,
-                                      csrValA,
-                                      csrRowPtrA,
-                                      csrColIndA,
-                                      info_l,
-                                      fx,
-                                      this->m,
-                                      t,
-                                      this->m));
-  // solve upper triangular factor
-  CHECK_CUSPARSE(cusparseSetMatFillMode(descrA, CUSPARSE_FILL_MODE_UPPER));
-  CHECK_CUSPARSE(cusparseSetMatDiagType(descrA, CUSPARSE_DIAG_TYPE_NON_UNIT));
-  CHECK_CUSPARSE(cusparseXcsrsm_solve(Cusparse::get_handle(),
-                                      CUSPARSE_OPERATION_NON_TRANSPOSE,
-                                      this->m,
-                                      k,
-                                      alpha,
-                                      descrA,
-                                      csrValA,
-                                      csrRowPtrA,
-                                      csrColIndA,
-                                      info_u,
-                                      t,
-                                      this->m,
-                                      fx,
-                                      this->m));
-}
-
-/// Matrix-vector product for CSR matrix class
-/** y is overwritten with alpha*A*x+beta*y.
- *
- *  @param alpha Scalar.
- *  @param x (Input, device memory, n entries) Vector.
- *  @param beta Scalar.
- *  @param y (Input/output, device memory, m entries) Output vector.
- */
-template <typename IndexType_, typename ValueType_>
-void CsrMatrix<IndexType_, ValueType_>::mv(ValueType_ alpha,
-                                           const ValueType_ *__restrict__ x,
-                                           ValueType_ beta,
-                                           ValueType_ *__restrict__ y) const
-{
-  // TODO: consider using merge-path csrmv
-  Cusparse::csrmv(this->trans,
-                  this->sym,
-                  this->m,
-                  this->n,
-                  this->nnz,
-                  &alpha,
-                  this->csrValA,
-                  this->csrRowPtrA,
-                  this->csrColIndA,
-                  x,
-                  &beta,
-                  y);
-}
-
-template <typename IndexType_, typename ValueType_>
-ValueType_ CsrMatrix<IndexType_, ValueType_>::getEdgeSum() const
-{
-  return 0.0;
-}
-
-// =============================================
-// Laplacian matrix class
-// =============================================
-
-/// Constructor for Laplacian matrix class
-/** @param A Adjacency matrix
- */
-template <typename IndexType_, typename ValueType_>
-LaplacianMatrix<IndexType_, ValueType_>::LaplacianMatrix(
-  /*const*/ Matrix<IndexType_, ValueType_> &_A)
-  : Matrix<IndexType_, ValueType_>(_A.m, _A.n), A(&_A)
-{
-  // Check that adjacency matrix is square
-  if (_A.m != _A.n)
-    FatalError("cannot construct Laplacian matrix from non-square adjacency matrix",
-               NVGRAPH_ERR_BAD_PARAMETERS);
-  // set CUDA stream
-  this->s = NULL;
-  // Construct degree matrix
-  D.allocate(_A.m, this->s);
-  Vector<ValueType_> ones(this->n, this->s);
-  ones.fill(1.0);
-  _A.mv(1, ones.raw(), 0, D.raw());
-
-  // Set preconditioning matrix pointer to NULL
-  M = NULL;
-}
-
-/// Destructor for Laplacian matrix class
-template <typename IndexType_, typename ValueType_>
-LaplacianMatrix<IndexType_, ValueType_>::~LaplacianMatrix()
-{
-}
-
-/// Get and Set CUDA stream
-template <typename IndexType_, typename ValueType_>
-void LaplacianMatrix<IndexType_, ValueType_>::setCUDAStream(cudaStream_t _s)
-{
-  this->s = _s;
-  // printf("LaplacianMatrix setCUDAStream stream=%p\n",this->s);
-  A->setCUDAStream(_s);
-  if (M != NULL) { M->setCUDAStream(_s); }
-}
-template <typename IndexType_, typename ValueType_>
-void LaplacianMatrix<IndexType_, ValueType_>::getCUDAStream(cudaStream_t *_s)
-{
-  *_s = this->s;
-  // A->getCUDAStream(_s);
-}
-
-/// Matrix-vector product for Laplacian matrix class
-/** y is overwritten with alpha*A*x+beta*y.
- *
- *  @param alpha Scalar.
- *  @param x (Input, device memory, n entries) Vector.
- *  @param beta Scalar.
- *  @param y (Input/output, device memory, m entries) Output vector.
- */
-template <typename IndexType_, typename ValueType_>
-void LaplacianMatrix<IndexType_, ValueType_>::mv(ValueType_ alpha,
-                                                 const ValueType_ *__restrict__ x,
-                                                 ValueType_ beta,
-                                                 ValueType_ *__restrict__ y) const
-{
-  // Scale result vector
-  if (beta == 0)
-    CHECK_CUDA(cudaMemset(y, 0, (this->n) * sizeof(ValueType_)))
-  else if (beta != 1)
-    thrust::transform(thrust::device_pointer_cast(y),
-                      thrust::device_pointer_cast(y + this->n),
-                      thrust::make_constant_iterator(beta),
-                      thrust::device_pointer_cast(y),
-                      thrust::multiplies<ValueType_>());
-
-  // Apply diagonal matrix
-  dim3 gridDim, blockDim;
-  gridDim.x  = min(((this->n) + BLOCK_SIZE - 1) / BLOCK_SIZE, 65535);
-  gridDim.y  = 1;
-  gridDim.z  = 1;
-  blockDim.x = BLOCK_SIZE;
-  blockDim.y = 1;
-  blockDim.z = 1;
-  diagmv<<<gridDim, blockDim, 0, A->s>>>(this->n, alpha, D.raw(), x, y);
-  cudaCheckError();
-
-  // Apply adjacency matrix
-  A->mv(-alpha, x, 1, y);
-}
-/// Matrix-vector product for Laplacian matrix class
-/** y is overwritten with alpha*A*x+beta*y.
- *
- *  @param alpha Scalar.
- *  @param x (Input, device memory, n*k entries) nxk dense matrix.
- *  @param beta Scalar.
- *  @param y (Input/output, device memory, m*k entries) Output mxk dense matrix.
- */
-template <typename IndexType_, typename ValueType_>
-void LaplacianMatrix<IndexType_, ValueType_>::mm(IndexType_ k,
-                                                 ValueType_ alpha,
-                                                 const ValueType_ *__restrict__ x,
-                                                 ValueType_ beta,
-                                                 ValueType_ *__restrict__ y) const
-{
-  // Apply diagonal matrix
-  ValueType_ one = (ValueType_)1.0;
-  this->dm(k, alpha, x, beta, y);
-
-  // Apply adjacency matrix
-  A->mm(k, -alpha, x, one, y);
-}
-
-template <typename IndexType_, typename ValueType_>
-void LaplacianMatrix<IndexType_, ValueType_>::dm(IndexType_ k,
-                                                 ValueType_ alpha,
-                                                 const ValueType_ *__restrict__ x,
-                                                 ValueType_ beta,
-                                                 ValueType_ *__restrict__ y) const
-{
-  IndexType_ t = k * (this->n);
-  dim3 gridDim, blockDim;
-
-  // setup launch parameters
-  gridDim.x  = min(((this->n) + BLOCK_SIZE - 1) / BLOCK_SIZE, 65535);
-  gridDim.y  = min(k, 65535);
-  gridDim.z  = 1;
-  blockDim.x = BLOCK_SIZE;
-  blockDim.y = 1;
-  blockDim.z = 1;
-
-  // Apply diagonal matrix
-  if (beta == 0.0) {
-    // set vectors to 0 (WARNING: notice that you need to set, not scale, because of NaNs corner
-    // case)
-    CHECK_CUDA(cudaMemset(y, 0, t * sizeof(ValueType_)));
-    diagmm<IndexType_, ValueType_, true>
-      <<<gridDim, blockDim, 0, A->s>>>(this->n, k, alpha, D.raw(), x, beta, y);
-  } else {
-    diagmm<IndexType_, ValueType_, false>
-      <<<gridDim, blockDim, 0, A->s>>>(this->n, k, alpha, D.raw(), x, beta, y);
-  }
-  cudaCheckError();
-}
-
-/// Color and Reorder
-template <typename IndexType_, typename ValueType_>
-void LaplacianMatrix<IndexType_, ValueType_>::color(IndexType_ *c, IndexType_ *p) const
-{
-}
-
-template <typename IndexType_, typename ValueType_>
-void LaplacianMatrix<IndexType_, ValueType_>::reorder(IndexType_ *p) const
-{
-}
-
-/// Solve preconditioned system M x = f for a set of k vectors
-template <typename IndexType_, typename ValueType_>
-void LaplacianMatrix<IndexType_, ValueType_>::prec_setup(Matrix<IndexType_, ValueType_> *_M)
-{
-  // save the pointer to preconditioner M
-  M = _M;
-  if (M != NULL) {
-    // setup the preconditioning matrix M
-    M->prec_setup(NULL);
-  }
-}
-
-template <typename IndexType_, typename ValueType_>
-void LaplacianMatrix<IndexType_, ValueType_>::prec_solve(IndexType_ k,
-                                                         ValueType_ alpha,
-                                                         ValueType_ *__restrict__ fx,
-                                                         ValueType_ *__restrict__ t) const
-{
-  if (M != NULL) {
-    // preconditioning
-    M->prec_solve(k, alpha, fx, t);
-  }
-}
-
-template <typename IndexType_, typename ValueType_>
-ValueType_ LaplacianMatrix<IndexType_, ValueType_>::getEdgeSum() const
-{
-  return 0.0;
-}
-// =============================================
-// Modularity matrix class
-// =============================================
-
-/// Constructor for Modularity matrix class
-/** @param A Adjacency matrix
- */
-template <typename IndexType_, typename ValueType_>
-ModularityMatrix<IndexType_, ValueType_>::ModularityMatrix(
-  /*const*/ Matrix<IndexType_, ValueType_> &_A, IndexType_ _nnz)
-  : Matrix<IndexType_, ValueType_>(_A.m, _A.n), A(&_A), nnz(_nnz)
-{
-  // Check that adjacency matrix is square
-  if (_A.m != _A.n)
-    FatalError("cannot construct Modularity matrix from non-square adjacency matrix",
-               NVGRAPH_ERR_BAD_PARAMETERS);
-
-  // set CUDA stream
-  this->s = NULL;
-  // Construct degree matrix
-  D.allocate(_A.m, this->s);
-  Vector<ValueType_> ones(this->n, this->s);
-  ones.fill(1.0);
-  _A.mv(1, ones.raw(), 0, D.raw());
-  // D.dump(0,this->n);
-  edge_sum = D.nrm1();
-
-  // Set preconditioning matrix pointer to NULL
-  M = NULL;
-}
-
-/// Destructor for Modularity matrix class
-template <typename IndexType_, typename ValueType_>
-ModularityMatrix<IndexType_, ValueType_>::~ModularityMatrix()
-{
-}
-
-/// Get and Set CUDA stream
-template <typename IndexType_, typename ValueType_>
-void ModularityMatrix<IndexType_, ValueType_>::setCUDAStream(cudaStream_t _s)
-{
-  this->s = _s;
-  // printf("ModularityMatrix setCUDAStream stream=%p\n",this->s);
-  A->setCUDAStream(_s);
-  if (M != NULL) { M->setCUDAStream(_s); }
-}
-
-template <typename IndexType_, typename ValueType_>
-void ModularityMatrix<IndexType_, ValueType_>::getCUDAStream(cudaStream_t *_s)
-{
-  *_s = this->s;
-  // A->getCUDAStream(_s);
-}
-
-/// Matrix-vector product for Modularity matrix class
-/** y is overwritten with alpha*A*x+beta*y.
- *
- *  @param alpha Scalar.
- *  @param x (Input, device memory, n entries) Vector.
- *  @param beta Scalar.
- *  @param y (Input/output, device memory, m entries) Output vector.
- */
-template <typename IndexType_, typename ValueType_>
-void ModularityMatrix<IndexType_, ValueType_>::mv(ValueType_ alpha,
-                                                  const ValueType_ *__restrict__ x,
-                                                  ValueType_ beta,
-                                                  ValueType_ *__restrict__ y) const
-{
-  // Scale result vector
-  if (alpha != 1 || beta != 0)
-    FatalError("This isn't implemented for Modularity Matrix currently",
-               NVGRAPH_ERR_NOT_IMPLEMENTED);
-
-  // CHECK_CUBLAS(cublasXdot(handle, this->n, const double *x, int incx, const double *y, int incy,
-  // double *result));
-  // y = A*x
-  A->mv(alpha, x, 0, y);
-  ValueType_ dot_res;
-  // gamma = d'*x
-  Cublas::dot(this->n, D.raw(), 1, x, 1, &dot_res);
-  // y = y -(gamma/edge_sum)*d
-  Cublas::axpy(this->n, -(dot_res / this->edge_sum), D.raw(), 1, y, 1);
-}
-/// Matrix-vector product for Modularity matrix class
-/** y is overwritten with alpha*A*x+beta*y.
- *
- *  @param alpha Scalar.
- *  @param x (Input, device memory, n*k entries) nxk dense matrix.
- *  @param beta Scalar.
- *  @param y (Input/output, device memory, m*k entries) Output mxk dense matrix.
- */
-template <typename IndexType_, typename ValueType_>
-void ModularityMatrix<IndexType_, ValueType_>::mm(IndexType_ k,
-                                                  ValueType_ alpha,
-                                                  const ValueType_ *__restrict__ x,
-                                                  ValueType_ beta,
-                                                  ValueType_ *__restrict__ y) const
-{
-  FatalError("This isn't implemented for Modularity Matrix currently", NVGRAPH_ERR_NOT_IMPLEMENTED);
-}
-
-template <typename IndexType_, typename ValueType_>
-void ModularityMatrix<IndexType_, ValueType_>::dm(IndexType_ k,
-                                                  ValueType_ alpha,
-                                                  const ValueType_ *__restrict__ x,
-                                                  ValueType_ beta,
-                                                  ValueType_ *__restrict__ y) const
-{
-  FatalError("This isn't implemented for Modularity Matrix currently", NVGRAPH_ERR_NOT_IMPLEMENTED);
-}
-
-/// Color and Reorder
-template <typename IndexType_, typename ValueType_>
-void ModularityMatrix<IndexType_, ValueType_>::color(IndexType_ *c, IndexType_ *p) const
-{
-  FatalError("This isn't implemented for Modularity Matrix currently", NVGRAPH_ERR_NOT_IMPLEMENTED);
-}
-
-template <typename IndexType_, typename ValueType_>
-void ModularityMatrix<IndexType_, ValueType_>::reorder(IndexType_ *p) const
-{
-  FatalError("This isn't implemented for Modularity Matrix currently", NVGRAPH_ERR_NOT_IMPLEMENTED);
-}
-
-/// Solve preconditioned system M x = f for a set of k vectors
-template <typename IndexType_, typename ValueType_>
-void ModularityMatrix<IndexType_, ValueType_>::prec_setup(Matrix<IndexType_, ValueType_> *_M)
-{
-  // save the pointer to preconditioner M
-  M = _M;
-  if (M != NULL) {
-    // setup the preconditioning matrix M
-    M->prec_setup(NULL);
-  }
-}
-
-template <typename IndexType_, typename ValueType_>
-void ModularityMatrix<IndexType_, ValueType_>::prec_solve(IndexType_ k,
-                                                          ValueType_ alpha,
-                                                          ValueType_ *__restrict__ fx,
-                                                          ValueType_ *__restrict__ t) const
-{
-  if (M != NULL) {
-    FatalError("This isn't implemented for Modularity Matrix currently",
-               NVGRAPH_ERR_NOT_IMPLEMENTED);
-  }
-}
-
-template <typename IndexType_, typename ValueType_>
-ValueType_ ModularityMatrix<IndexType_, ValueType_>::getEdgeSum() const
-{
-  return edge_sum;
-}
-// Explicit instantiation
-template class Matrix<int, float>;
-template class Matrix<int, double>;
-template class DenseMatrix<int, float>;
-template class DenseMatrix<int, double>;
-template class CsrMatrix<int, float>;
-template class CsrMatrix<int, double>;
-template class LaplacianMatrix<int, float>;
-template class LaplacianMatrix<int, double>;
-template class ModularityMatrix<int, float>;
-template class ModularityMatrix<int, double>;
-
-}  // namespace nvgraph
-//#endif
diff --git a/cpp/src/sort/bitonic.cuh b/cpp/src/sort/bitonic.cuh
index 38249aa3973..e2922a58d39 100644
--- a/cpp/src/sort/bitonic.cuh
+++ b/cpp/src/sort/bitonic.cuh
@@ -1,7 +1,7 @@
 // -*-c++-*-
 
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -34,7 +34,7 @@
 #include <thrust/scan.h>
 
 #include <rmm/thrust_rmm_allocator.h>
-#include <utilities/error_utils.h>
+#include <utilities/error.hpp>
 
 namespace cugraph {
 namespace sort {
diff --git a/cpp/src/structure/graph.cu b/cpp/src/structure/graph.cu
index 059651e80d2..63ef725c3b7 100644
--- a/cpp/src/structure/graph.cu
+++ b/cpp/src/structure/graph.cu
@@ -15,10 +15,11 @@
  */
 
 #include <graph.hpp>
-#include "utilities/cuda_utils.cuh"
-#include "utilities/error_utils.h"
+#include "utilities/error.hpp"
 #include "utilities/graph_utils.cuh"
 
+#include <raft/device_atomics.cuh>
+
 namespace {
 
 template <typename vertex_t, typename edge_t>
@@ -36,25 +37,26 @@ void degree_from_offsets(vertex_t number_of_vertices,
 }
 
 template <typename vertex_t, typename edge_t>
-void degree_from_vertex_ids(const cugraph::experimental::Comm &comm,
+void degree_from_vertex_ids(const raft::handle_t *handle,
                             vertex_t number_of_vertices,
                             edge_t number_of_edges,
                             vertex_t const *indices,
                             edge_t *degree,
                             cudaStream_t stream)
 {
-  thrust::for_each(
-    rmm::exec_policy(stream)->on(stream),
-    thrust::make_counting_iterator<edge_t>(0),
-    thrust::make_counting_iterator<edge_t>(number_of_edges),
-    [indices, degree] __device__(edge_t e) { cugraph::atomicAdd(degree + indices[e], 1); });
-  comm.allreduce(number_of_vertices, degree, degree, cugraph::experimental::ReduceOp::SUM);
+  thrust::for_each(rmm::exec_policy(stream)->on(stream),
+                   thrust::make_counting_iterator<edge_t>(0),
+                   thrust::make_counting_iterator<edge_t>(number_of_edges),
+                   [indices, degree] __device__(edge_t e) { atomicAdd(degree + indices[e], 1); });
+  if ((handle != nullptr) && (handle->comms_initialized())) {
+    auto &comm = handle->get_comms();
+    comm.allreduce(degree, degree, number_of_vertices, raft::comms::op_t::SUM, stream);
+  }
 }
 
 }  // namespace
 
 namespace cugraph {
-namespace experimental {
 
 template <typename VT, typename ET, typename WT>
 void GraphViewBase<VT, ET, WT>::get_vertex_identifiers(VT *identifiers) const
@@ -82,10 +84,14 @@ void GraphCOOView<VT, ET, WT>::degree(ET *degree, DegreeDirection direction) con
   cudaStream_t stream{nullptr};
 
   if (direction != DegreeDirection::IN) {
-    if (GraphViewBase<VT, ET, WT>::comm.get_p())  // FIXME retrieve global source
-                                                  // indexing for the allreduce work
-      CUGRAPH_FAIL("OPG degree not implemented for OUT degree");
-    degree_from_vertex_ids(GraphViewBase<VT, ET, WT>::comm,
+    if ((GraphViewBase<VT, ET, WT>::handle != nullptr) &&
+        (GraphViewBase<VT, ET, WT>::handle
+           ->comms_initialized()))  // FIXME retrieve global source
+                                    // indexing for the allreduce work
+    {
+      CUGRAPH_FAIL("MG degree not implemented for OUT degree");
+    }
+    degree_from_vertex_ids(GraphViewBase<VT, ET, WT>::handle,
                            GraphViewBase<VT, ET, WT>::number_of_vertices,
                            GraphViewBase<VT, ET, WT>::number_of_edges,
                            src_indices,
@@ -94,7 +100,7 @@ void GraphCOOView<VT, ET, WT>::degree(ET *degree, DegreeDirection direction) con
   }
 
   if (direction != DegreeDirection::OUT) {
-    degree_from_vertex_ids(GraphViewBase<VT, ET, WT>::comm,
+    degree_from_vertex_ids(GraphViewBase<VT, ET, WT>::handle,
                            GraphViewBase<VT, ET, WT>::number_of_vertices,
                            GraphViewBase<VT, ET, WT>::number_of_edges,
                            dst_indices,
@@ -115,15 +121,17 @@ void GraphCompressedSparseBaseView<VT, ET, WT>::degree(ET *degree, DegreeDirecti
   cudaStream_t stream{nullptr};
 
   if (direction != DegreeDirection::IN) {
-    if (GraphViewBase<VT, ET, WT>::comm.get_p())
-      CUGRAPH_FAIL("OPG degree not implemented for OUT degree");  // FIXME retrieve global
-                                                                  // source indexing for
-                                                                  // the allreduce to work
+    if ((GraphViewBase<VT, ET, WT>::handle != nullptr) &&
+        (GraphViewBase<VT, ET, WT>::handle->comms_initialized())) {
+      CUGRAPH_FAIL("MG degree not implemented for OUT degree");  // FIXME retrieve global
+                                                                 // source indexing for
+                                                                 // the allreduce to work
+    }
     degree_from_offsets(GraphViewBase<VT, ET, WT>::number_of_vertices, offsets, degree, stream);
   }
 
   if (direction != DegreeDirection::OUT) {
-    degree_from_vertex_ids(GraphViewBase<VT, ET, WT>::comm,
+    degree_from_vertex_ids(GraphViewBase<VT, ET, WT>::handle,
                            GraphViewBase<VT, ET, WT>::number_of_vertices,
                            GraphViewBase<VT, ET, WT>::number_of_edges,
                            indices,
@@ -139,5 +147,4 @@ template class GraphCOOView<int32_t, int32_t, float>;
 template class GraphCOOView<int32_t, int32_t, double>;
 template class GraphCompressedSparseBaseView<int32_t, int32_t, float>;
 template class GraphCompressedSparseBaseView<int32_t, int32_t, double>;
-}  // namespace experimental
 }  // namespace cugraph
diff --git a/cpp/src/topology/topology.cuh b/cpp/src/topology/topology.cuh
index 15fbf588c23..82b0e72c705 100644
--- a/cpp/src/topology/topology.cuh
+++ b/cpp/src/topology/topology.cuh
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
diff --git a/cpp/src/traversal/bfs.cu b/cpp/src/traversal/bfs.cu
index cbe741424ea..dfb7a32499d 100644
--- a/cpp/src/traversal/bfs.cu
+++ b/cpp/src/traversal/bfs.cu
@@ -16,8 +16,10 @@
 
 #include "graph.hpp"
 
-#include <utilities/error_utils.h>
+#include <utilities/error.hpp>
 #include "bfs_kernels.cuh"
+#include "mg/bfs.cuh"
+#include "mg/common_utils.cuh"
 #include "traversal_common.cuh"
 #include "utilities/graph_utils.cuh"
 
@@ -265,7 +267,6 @@ void BFS<IndexType>::traverse(IndexType source_vertex)
   bool can_use_bottom_up = (!sp_counters && !directed && distances);
 
   while (nf > 0) {
-    // Each vertices can appear only once in the frontierer array - we know it will fit
     new_frontier     = frontier + nf;
     IndexType old_nf = nf;
     resetDevicePointers();
@@ -356,7 +357,7 @@ void BFS<IndexType>::traverse(IndexType source_vertex)
         mu -= mf;
 
         cudaMemcpyAsync(&nf, d_new_frontier_cnt, sizeof(IndexType), cudaMemcpyDeviceToHost, stream);
-        CUDA_CHECK_LAST();
+        CHECK_CUDA(stream);
 
         // We need nf
         cudaStreamSynchronize(stream);
@@ -413,7 +414,7 @@ void BFS<IndexType>::traverse(IndexType source_vertex)
                           sizeof(IndexType),
                           cudaMemcpyDeviceToHost,
                           stream);
-          CUDA_CHECK_LAST()
+          CHECK_CUDA(stream);
           // We need last_left_unvisited_size
           cudaStreamSynchronize(stream);
           bfs_kernels::bottom_up_large(left_unvisited_queue,
@@ -431,7 +432,7 @@ void BFS<IndexType>::traverse(IndexType source_vertex)
                                        deterministic);
         }
         cudaMemcpyAsync(&nf, d_new_frontier_cnt, sizeof(IndexType), cudaMemcpyDeviceToHost, stream);
-        CUDA_CHECK_LAST()
+        CHECK_CUDA(stream);
 
         // We will need nf
         cudaStreamSynchronize(stream);
@@ -461,50 +462,111 @@ void BFS<IndexType>::clean()
   // the vectors have a destructor that takes care of cleaning
 }
 
+// Explicit Instantiation
+template class BFS<uint32_t>;
 template class BFS<int>;
+template class BFS<int64_t>;
+
 }  // namespace detail
 
 // NOTE: SP counter increase extremely fast on large graph
 //       It can easily reach 1e40~1e70 on GAP-road.mtx
 template <typename VT, typename ET, typename WT>
-void bfs(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void bfs(raft::handle_t const &handle,
+         GraphCSRView<VT, ET, WT> const &graph,
          VT *distances,
          VT *predecessors,
          double *sp_counters,
          const VT start_vertex,
-         bool directed)
+         bool directed,
+         bool mg_batch)
 {
-  CUGRAPH_EXPECTS(typeid(VT) == typeid(int), "Unsupported vertex id data type, please use int");
-  CUGRAPH_EXPECTS(typeid(ET) == typeid(int), "Unsupported edge id data type, please use int");
-  CUGRAPH_EXPECTS((typeid(WT) == typeid(float)) || (typeid(WT) == typeid(double)),
-                  "Unsupported weight data type, please use float or double");
-
-  VT number_of_vertices = graph.number_of_vertices;
-  ET number_of_edges    = graph.number_of_edges;
-
-  const VT *indices_ptr = graph.indices;
-  const ET *offsets_ptr = graph.offsets;
-
-  int alpha = 15;
-  int beta  = 18;
-  // FIXME: Use VT and ET in the BFS detail
-  cugraph::detail::BFS<VT> bfs(
-    number_of_vertices, number_of_edges, offsets_ptr, indices_ptr, directed, alpha, beta);
-  bfs.configure(distances, predecessors, sp_counters, nullptr);
-  bfs.traverse(start_vertex);
+  static_assert(std::is_integral<VT>::value && sizeof(VT) >= sizeof(int32_t),
+                "Unsupported vertex id data type. Use integral types of size >= sizeof(int32_t)");
+  static_assert(std::is_same<VT, ET>::value,
+                "VT and ET should be the same time for the current BFS implementation");
+  static_assert(std::is_floating_point<WT>::value,
+                "Unsupported edge weight type. Use floating point types");  // actually, this is
+                                                                            // unnecessary for BFS
+  if (handle.comms_initialized() && !mg_batch) {
+    CUGRAPH_EXPECTS(sp_counters == nullptr,
+                    "BFS Traversal shortest path is not supported in MG path");
+    mg::bfs<VT, ET, WT>(handle, graph, distances, predecessors, start_vertex);
+  } else {
+    VT number_of_vertices = graph.number_of_vertices;
+    ET number_of_edges    = graph.number_of_edges;
+
+    const VT *indices_ptr = graph.indices;
+    const ET *offsets_ptr = graph.offsets;
+
+    int alpha = 15;
+    int beta  = 18;
+    // FIXME: Use VT and ET in the BFS detail
+    cugraph::detail::BFS<VT> bfs(
+      number_of_vertices, number_of_edges, offsets_ptr, indices_ptr, directed, alpha, beta);
+    bfs.configure(distances, predecessors, sp_counters, nullptr);
+    bfs.traverse(start_vertex);
+  }
 }
 
-template void bfs<int, int, float>(experimental::GraphCSRView<int, int, float> const &graph,
-                                   int *distances,
-                                   int *predecessors,
-                                   double *sp_counters,
-                                   const int source_vertex,
-                                   bool directed);
-template void bfs<int, int, double>(experimental::GraphCSRView<int, int, double> const &graph,
-                                    int *distances,
-                                    int *predecessors,
-                                    double *sp_counters,
-                                    const int source_vertex,
-                                    bool directed);
+// Explicit Instantiation
+template void bfs<uint32_t, uint32_t, float>(raft::handle_t const &handle,
+                                             GraphCSRView<uint32_t, uint32_t, float> const &graph,
+                                             uint32_t *distances,
+                                             uint32_t *predecessors,
+                                             double *sp_counters,
+                                             const uint32_t source_vertex,
+                                             bool directed,
+                                             bool mg_batch);
+
+// Explicit Instantiation
+template void bfs<uint32_t, uint32_t, double>(raft::handle_t const &handle,
+                                              GraphCSRView<uint32_t, uint32_t, double> const &graph,
+                                              uint32_t *distances,
+                                              uint32_t *predecessors,
+                                              double *sp_counters,
+                                              const uint32_t source_vertex,
+                                              bool directed,
+                                              bool mg_batch);
+
+// Explicit Instantiation
+template void bfs<int32_t, int32_t, float>(raft::handle_t const &handle,
+                                           GraphCSRView<int32_t, int32_t, float> const &graph,
+                                           int32_t *distances,
+                                           int32_t *predecessors,
+                                           double *sp_counters,
+                                           const int32_t source_vertex,
+                                           bool directed,
+                                           bool mg_batch);
+
+// Explicit Instantiation
+template void bfs<int32_t, int32_t, double>(raft::handle_t const &handle,
+                                            GraphCSRView<int32_t, int32_t, double> const &graph,
+                                            int32_t *distances,
+                                            int32_t *predecessors,
+                                            double *sp_counters,
+                                            const int32_t source_vertex,
+                                            bool directed,
+                                            bool mg_batch);
+
+// Explicit Instantiation
+template void bfs<int64_t, int64_t, float>(raft::handle_t const &handle,
+                                           GraphCSRView<int64_t, int64_t, float> const &graph,
+                                           int64_t *distances,
+                                           int64_t *predecessors,
+                                           double *sp_counters,
+                                           const int64_t source_vertex,
+                                           bool directed,
+                                           bool mg_batch);
+
+// Explicit Instantiation
+template void bfs<int64_t, int64_t, double>(raft::handle_t const &handle,
+                                            GraphCSRView<int64_t, int64_t, double> const &graph,
+                                            int64_t *distances,
+                                            int64_t *predecessors,
+                                            double *sp_counters,
+                                            const int64_t source_vertex,
+                                            bool directed,
+                                            bool mg_batch);
 
 }  // namespace cugraph
diff --git a/cpp/src/traversal/bfs_kernels.cuh b/cpp/src/traversal/bfs_kernels.cuh
index ceac8e5a1fa..bf2ec2fc6ee 100644
--- a/cpp/src/traversal/bfs_kernels.cuh
+++ b/cpp/src/traversal/bfs_kernels.cuh
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2018-2020 NVIDIA CORPORATION.
+ * Copyright (c) 2018-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -15,8 +15,10 @@
  */
 #include <iostream>
 
-#include <utilities/sm_utils.h>
+#include <raft/cudart_utils.h>
 #include <cub/cub.cuh>
+
+#include "graph.hpp"
 #include "traversal_common.cuh"
 
 namespace cugraph {
@@ -92,7 +94,7 @@ __global__ void fill_unvisited_queue_kernel(int *visited_bmap,
     // saving the common offset
     if (threadIdx.x == (FILL_UNVISITED_QUEUE_DIMX - 1)) {
       IndexType total               = unvisited_thread_offset + n_unvisited_in_int;
-      unvisited_common_block_offset = atomicAdd(unvisited_cnt, total);
+      unvisited_common_block_offset = traversal::atomicAdd(unvisited_cnt, total);
     }
 
     // syncthreads for two reasons :
@@ -161,11 +163,12 @@ void fill_unvisited_queue(int *visited_bmap,
   dim3 grid, block;
   block.x = FILL_UNVISITED_QUEUE_DIMX;
 
-  grid.x = min((IndexType)MAXBLOCKS, (visited_bmap_nints + block.x - 1) / block.x);
+  grid.x = std::min(static_cast<size_t>(MAXBLOCKS),
+                    (static_cast<size_t>(visited_bmap_nints) + block.x - 1) / block.x);
 
   fill_unvisited_queue_kernel<<<grid, block, 0, m_stream>>>(
     visited_bmap, visited_bmap_nints, n, unvisited, unvisited_cnt);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(m_stream);
 }
 
 //
@@ -206,7 +209,7 @@ __global__ void count_unvisited_edges_kernel(const IndexType *potentially_unvisi
     BlockReduce(reduce_temp_storage).Sum(thread_unvisited_edges_count);
 
   // block_unvisited_edges_count is only defined is th.x == 0
-  if (threadIdx.x == 0) atomicAdd(mu, block_unvisited_edges_count);
+  if (threadIdx.x == 0) traversal::atomicAdd(mu, block_unvisited_edges_count);
 }
 
 // Wrapper
@@ -220,11 +223,12 @@ void count_unvisited_edges(const IndexType *potentially_unvisited,
 {
   dim3 grid, block;
   block.x = COUNT_UNVISITED_EDGES_DIMX;
-  grid.x  = min((IndexType)MAXBLOCKS, (potentially_unvisited_size + block.x - 1) / block.x);
+  grid.x  = std::min(static_cast<size_t>(MAXBLOCKS),
+                    (static_cast<size_t>(potentially_unvisited_size) + block.x - 1) / block.x);
 
   count_unvisited_edges_kernel<<<grid, block, 0, m_stream>>>(
     potentially_unvisited, potentially_unvisited_size, visited_bmap, node_degree, mu);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(m_stream);
 }
 
 //
@@ -285,6 +289,11 @@ __global__ void main_bottomup_kernel(const IndexType *unvisited,
   const int warpid = threadIdx.x / WARP_SIZE;
   const int laneid = threadIdx.x % WARP_SIZE;
 
+  // When this kernel is converted to support different VT and ET, this
+  // will likely split into invalid_vid and invalid_eid
+  // This is equivalent to ~IndexType(0) (i.e., all bits set to 1)
+  constexpr IndexType invalid_idx = cugraph::invalid_idx<IndexType>::value;
+
   // we will call __syncthreads inside the loop
   // we need to keep complete block active
   for (IndexType block_off = blockIdx.x * blockDim.x; block_off < unvisited_size;
@@ -299,8 +308,9 @@ __global__ void main_bottomup_kernel(const IndexType *unvisited,
     // by different in in visited_bmap)
     IndexType visited_bmap_index[1];  // this is an array of size 1 because CUB
                                       // needs one
-    visited_bmap_index[0]      = -1;
-    IndexType unvisited_vertex = -1;
+
+    visited_bmap_index[0]      = invalid_idx;
+    IndexType unvisited_vertex = invalid_idx;
 
     // local_visited_bmap gives info on the visited bit of unvisited_vertex
     //
@@ -329,7 +339,9 @@ __global__ void main_bottomup_kernel(const IndexType *unvisited,
 
       IndexType degree = edge_end - edge_begin;
 
-      for (IndexType edge = edge_begin; edge < min(edge_end, edge_begin + MAIN_BOTTOMUP_MAX_EDGES);
+      for (IndexType edge = edge_begin;
+           edge < min(static_cast<size_t>(edge_end),
+                      static_cast<size_t>(edge_begin) + MAIN_BOTTOMUP_MAX_EDGES);
            ++edge) {
         if (edge_mask && !edge_mask[edge]) continue;
 
@@ -353,7 +365,7 @@ __global__ void main_bottomup_kernel(const IndexType *unvisited,
 
       // If we haven't found a parent and there's more edge to check
       if (!found && degree > MAIN_BOTTOMUP_MAX_EDGES) {
-        left_unvisited_off = atomicAdd(left_unvisited_cnt, (IndexType)1);
+        left_unvisited_off = traversal::atomicAdd(left_unvisited_cnt, static_cast<IndexType>(1));
         more_to_visit      = 1;
       }
     }
@@ -393,7 +405,7 @@ __global__ void main_bottomup_kernel(const IndexType *unvisited,
     // broadcasting local_visited_bmap_warp_head
     __syncthreads();
 
-    int head_ballot = cugraph::detail::utils::ballot(is_head);
+    int head_ballot = __ballot_sync(raft::warp_full_mask(), is_head);
 
     // As long as idx < unvisited_size, we know there's at least one head per
     // warp
@@ -438,9 +450,8 @@ __global__ void main_bottomup_kernel(const IndexType *unvisited,
 
       // the destination thread of the __shfl is active
       int laneid_max =
-        min((IndexType)(WARP_SIZE - 1), (unvisited_size - (block_off + 32 * warpid)));
-      IndexType last_v =
-        cugraph::detail::utils::shfl(unvisited_vertex, laneid_max, WARP_SIZE, __activemask());
+        min(static_cast<IndexType>(WARP_SIZE - 1), (unvisited_size - (block_off + 32 * warpid)));
+      IndexType last_v = __shfl_sync(__activemask(), unvisited_vertex, laneid_max, WARP_SIZE);
 
       if (is_last_head_in_warp) {
         int ilast_v = last_v % INT_SIZE + 1;
@@ -462,7 +473,7 @@ __global__ void main_bottomup_kernel(const IndexType *unvisited,
     BlockScan(scan_temp_storage).ExclusiveSum(found, thread_frontier_offset);
     IndexType inclusive_sum = thread_frontier_offset + found;
     if (threadIdx.x == (MAIN_BOTTOMUP_DIMX - 1) && inclusive_sum) {
-      frontier_common_block_offset = atomicAdd(new_frontier_cnt, inclusive_sum);
+      frontier_common_block_offset = traversal::atomicAdd(new_frontier_cnt, inclusive_sum);
     }
 
     // 1) Broadcasting frontier_common_block_offset
@@ -495,7 +506,8 @@ void bottom_up_main(IndexType *unvisited,
   dim3 grid, block;
   block.x = MAIN_BOTTOMUP_DIMX;
 
-  grid.x = min((IndexType)MAXBLOCKS, ((unvisited_size + block.x)) / block.x);
+  grid.x = std::min(static_cast<size_t>(MAXBLOCKS),
+                    (static_cast<size_t>(unvisited_size) + block.x) / block.x);
 
   main_bottomup_kernel<<<grid, block, 0, m_stream>>>(unvisited,
                                                      unvisited_size,
@@ -510,7 +522,7 @@ void bottom_up_main(IndexType *unvisited,
                                                      distances,
                                                      predecessors,
                                                      edge_mask);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(m_stream);
 }
 
 //
@@ -535,6 +547,11 @@ __global__ void bottom_up_large_degree_kernel(IndexType *left_unvisited,
   int logical_warp_id         = threadIdx.x / BOTTOM_UP_LOGICAL_WARP_SIZE;
   int logical_warps_per_block = blockDim.x / BOTTOM_UP_LOGICAL_WARP_SIZE;
 
+  // When this kernel is converted to support different VT and ET, this
+  // will likely split into invalid_vid and invalid_eid
+  // This is equivalent to ~IndexType(0) (i.e., all bits set to 1)
+  constexpr IndexType invalid_idx = cugraph::invalid_idx<IndexType>::value;
+
   // Inactive threads are not a pb for __ballot (known behaviour)
   for (IndexType idx = logical_warps_per_block * blockIdx.x + logical_warp_id;
        idx < left_unvisited_size;
@@ -555,7 +572,7 @@ __global__ void bottom_up_large_degree_kernel(IndexType *left_unvisited,
     // is know with inactive threads
     for (IndexType i_edge = first_i_edge + logical_lane_id; i_edge < end_i_edge;
          i_edge += BOTTOM_UP_LOGICAL_WARP_SIZE) {
-      IndexType valid_parent = -1;
+      IndexType valid_parent = invalid_idx;
 
       if (!edge_mask || edge_mask[i_edge]) {
         IndexType u     = col_ind[i_edge];
@@ -564,7 +581,8 @@ __global__ void bottom_up_large_degree_kernel(IndexType *left_unvisited,
         if (lvl_u == (lvl - 1)) { valid_parent = u; }
       }
 
-      unsigned int warp_valid_p_ballot = cugraph::detail::utils::ballot((valid_parent != -1));
+      unsigned int warp_valid_p_ballot =
+        __ballot_sync(raft::warp_full_mask(), valid_parent != invalid_idx);
 
       int logical_warp_id_in_warp = (threadIdx.x % WARP_SIZE) / BOTTOM_UP_LOGICAL_WARP_SIZE;
       unsigned int mask           = (1 << BOTTOM_UP_LOGICAL_WARP_SIZE) - 1;
@@ -576,7 +594,7 @@ __global__ void bottom_up_large_degree_kernel(IndexType *left_unvisited,
 
       if (chosen_thread == logical_lane_id) {
         // Using only one valid parent (reduce bw)
-        IndexType off = atomicAdd(new_frontier_cnt, (IndexType)1);
+        IndexType off = traversal::atomicAdd(new_frontier_cnt, static_cast<IndexType>(1));
         int m         = 1 << (v % INT_SIZE);
         atomicOr(&visited[v / INT_SIZE], m);
         distances[v] = lvl;
@@ -608,8 +626,10 @@ void bottom_up_large(IndexType *left_unvisited,
 {
   dim3 grid, block;
   block.x = LARGE_BOTTOMUP_DIMX;
-  grid.x  = min((IndexType)MAXBLOCKS,
-               ((left_unvisited_size + block.x - 1) * BOTTOM_UP_LOGICAL_WARP_SIZE) / block.x);
+  grid.x  = std::min(
+    static_cast<size_t>(MAXBLOCKS),
+    ((static_cast<size_t>(left_unvisited_size) + block.x - 1) * BOTTOM_UP_LOGICAL_WARP_SIZE) /
+      block.x);
 
   bottom_up_large_degree_kernel<<<grid, block, 0, m_stream>>>(left_unvisited,
                                                               left_unvisited_size,
@@ -622,7 +642,7 @@ void bottom_up_large(IndexType *left_unvisited,
                                                               distances,
                                                               predecessors,
                                                               edge_mask);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(m_stream);
 }
 
 //
@@ -704,18 +724,27 @@ __global__ void topdown_expand_kernel(
   __shared__ IndexType block_n_frontier_candidates;
 
   IndexType block_offset = (blockDim.x * blockIdx.x) * max_items_per_thread;
+
+  // When this kernel is converted to support different VT and ET, this
+  // will likely split into invalid_vid and invalid_eid
+  // This is equivalent to ~IndexType(0) (i.e., all bits set to 1)
+  constexpr IndexType invalid_idx = cugraph::invalid_idx<IndexType>::value;
+
   IndexType n_items_per_thread_left =
-    (totaldegree - block_offset + TOP_DOWN_EXPAND_DIMX - 1) / TOP_DOWN_EXPAND_DIMX;
+    (totaldegree > block_offset)
+      ? (totaldegree - block_offset + TOP_DOWN_EXPAND_DIMX - 1) / TOP_DOWN_EXPAND_DIMX
+      : 0;
 
   n_items_per_thread_left = min(max_items_per_thread, n_items_per_thread_left);
 
   for (; (n_items_per_thread_left > 0) && (block_offset < totaldegree);
 
        block_offset += MAX_ITEMS_PER_THREAD_PER_OFFSETS_LOAD * blockDim.x,
-       n_items_per_thread_left -= MAX_ITEMS_PER_THREAD_PER_OFFSETS_LOAD) {
+       n_items_per_thread_left -= min(
+         n_items_per_thread_left, static_cast<IndexType>(MAX_ITEMS_PER_THREAD_PER_OFFSETS_LOAD))) {
     // In this loop, we will process batch_set_size batches
     IndexType nitems_per_thread =
-      min(n_items_per_thread_left, (IndexType)MAX_ITEMS_PER_THREAD_PER_OFFSETS_LOAD);
+      min(n_items_per_thread_left, static_cast<IndexType>(MAX_ITEMS_PER_THREAD_PER_OFFSETS_LOAD));
 
     // Loading buckets offset (see compute_bucket_offsets_kernel)
 
@@ -803,8 +832,9 @@ __global__ void topdown_expand_kernel(
         // We process TOP_DOWN_BATCH_SIZE edge in parallel (instruction
         // parallism) Reduces latency
 
-        IndexType current_max_edge_index =
-          min(block_offset + (left + nitems_per_thread_for_this_load) * blockDim.x, totaldegree);
+        IndexType current_max_edge_index = min(
+          static_cast<size_t>(block_offset) + (left + nitems_per_thread_for_this_load) * blockDim.x,
+          static_cast<size_t>(totaldegree));
 
         // We will need vec_u (source of the edge) until the end if we need to
         // save the predecessors For others informations, we will reuse pointers
@@ -834,8 +864,8 @@ __global__ void topdown_expand_kernel(
             vec_u[iv]                                    = frontier[k];  // origin of this edge
             vec_frontier_degrees_exclusive_sum_index[iv] = frontier_degrees_exclusive_sum[k];
           } else {
-            vec_u[iv]                                    = -1;
-            vec_frontier_degrees_exclusive_sum_index[iv] = -1;
+            vec_u[iv]                                    = invalid_idx;
+            vec_frontier_degrees_exclusive_sum_index[iv] = invalid_idx;
           }
         }
 
@@ -844,7 +874,7 @@ __global__ void topdown_expand_kernel(
         for (int iv = 0; iv < TOP_DOWN_BATCH_SIZE; ++iv) {
           IndexType u = vec_u[iv];
           // row_ptr for this vertex origin u
-          vec_row_ptr_u[iv] = (u != -1) ? row_ptr[u] : -1;
+          vec_row_ptr_u[iv] = (u != invalid_idx) ? row_ptr[u] : invalid_idx;
         }
 
         // We won't need row_ptr after that, reusing pointer
@@ -856,12 +886,18 @@ __global__ void topdown_expand_kernel(
           IndexType gid               = block_offset + thread_item_index * blockDim.x + threadIdx.x;
 
           IndexType row_ptr_u = vec_row_ptr_u[iv];
-          IndexType edge      = row_ptr_u + gid - vec_frontier_degrees_exclusive_sum_index[iv];
-
-          if (edge_mask && !edge_mask[edge]) row_ptr_u = -1;  // disabling edge
-
-          // Destination of this edge
-          vec_dest_v[iv] = (row_ptr_u != -1) ? col_ind[edge] : -1;
+          // Need this check so that we don't use invalid values of edge to index
+          if (row_ptr_u != invalid_idx) {
+            IndexType edge = row_ptr_u + gid - vec_frontier_degrees_exclusive_sum_index[iv];
+
+            if (edge_mask && !edge_mask[edge]) {
+              // Disabling edge
+              row_ptr_u = invalid_idx;
+            } else {
+              // Destination of this edge
+              vec_dest_v[iv] = col_ind[edge];
+            }
+          }
         }
 
         // We don't need vec_frontier_degrees_exclusive_sum_index anymore
@@ -874,7 +910,7 @@ __global__ void topdown_expand_kernel(
         for (int iv = 0; iv < TOP_DOWN_BATCH_SIZE; ++iv) {
           IndexType v = vec_dest_v[iv];
           vec_v_visited_bmap[iv] =
-            (v != -1) ? previous_bmap[v / INT_SIZE] : (~0);  // will look visited
+            (v != invalid_idx) ? previous_bmap[v / INT_SIZE] : (~int(0));  // will look visited
         }
 
         // From now on we will consider v as a frontier candidate
@@ -889,7 +925,7 @@ __global__ void topdown_expand_kernel(
 
           int is_visited = vec_v_visited_bmap[iv] & m;
 
-          if (is_visited) vec_frontier_candidate[iv] = -1;
+          if (is_visited) vec_frontier_candidate[iv] = invalid_idx;
         }
 
         // Each source should update the destination shortest path counter
@@ -898,7 +934,7 @@ __global__ void topdown_expand_kernel(
 #pragma unroll
           for (int iv = 0; iv < TOP_DOWN_BATCH_SIZE; ++iv) {
             IndexType dst = vec_frontier_candidate[iv];
-            if (dst != -1) {
+            if (dst != invalid_idx) {
               IndexType src = vec_u[iv];
               atomicAdd(&sp_counters[dst], sp_counters[src]);
             }
@@ -912,7 +948,7 @@ __global__ void topdown_expand_kernel(
 #pragma unroll
           for (int iv = 0; iv < TOP_DOWN_BATCH_SIZE; ++iv) {
             IndexType v              = vec_frontier_candidate[iv];
-            vec_is_isolated_bmap[iv] = (v != -1) ? isolated_bmap[v / INT_SIZE] : -1;
+            vec_is_isolated_bmap[iv] = (v != invalid_idx) ? isolated_bmap[v / INT_SIZE] : ~int(0);
           }
 
 #pragma unroll
@@ -928,7 +964,7 @@ __global__ void topdown_expand_kernel(
             // visited, and save distance and predecessor here. Not need to
             // check return value of atomicOr
 
-            if (is_isolated && v != -1) {
+            if (is_isolated && v != invalid_idx) {
               int m = 1 << (v % INT_SIZE);
               atomicOr(&bmap[v / INT_SIZE], m);
               if (distances) distances[v] = lvl;
@@ -936,7 +972,7 @@ __global__ void topdown_expand_kernel(
               if (predecessors) predecessors[v] = vec_u[iv];
 
               // This is no longer a candidate, neutralize it
-              vec_frontier_candidate[iv] = -1;
+              vec_frontier_candidate[iv] = invalid_idx;
             }
           }
         }
@@ -947,7 +983,7 @@ __global__ void topdown_expand_kernel(
 #pragma unroll
         for (int iv = 0; iv < TOP_DOWN_BATCH_SIZE; ++iv) {
           IndexType v = vec_frontier_candidate[iv];
-          if (v != -1) ++thread_n_frontier_candidates;
+          if (v != invalid_idx) ++thread_n_frontier_candidates;
         }
 
         // We need to have all nfrontier_candidates to be ready before doing the
@@ -965,7 +1001,7 @@ __global__ void topdown_expand_kernel(
           // May have bank conflicts
           IndexType frontier_candidate = vec_frontier_candidate[iv];
 
-          if (frontier_candidate != -1) {
+          if (frontier_candidate != invalid_idx) {
             shared_local_new_frontier_candidates[thread_frontier_candidate_offset] =
               frontier_candidate;
             shared_local_new_frontier_predecessors[thread_frontier_candidate_offset] = vec_u[iv];
@@ -990,7 +1026,7 @@ __global__ void topdown_expand_kernel(
 #pragma unroll
         for (int iv = 0; iv < TOP_DOWN_BATCH_SIZE; ++iv) {
           const int idx_shared             = iv * blockDim.x + threadIdx.x;
-          vec_frontier_accepted_vertex[iv] = -1;
+          vec_frontier_accepted_vertex[iv] = invalid_idx;
 
           if (idx_shared < block_n_frontier_candidates) {
             IndexType v = shared_local_new_frontier_candidates[idx_shared];  // popping
@@ -1024,7 +1060,7 @@ __global__ void topdown_expand_kernel(
           // for this thread, thread_new_frontier_offset + has_successor
           // (exclusive sum)
           if (inclusive_sum)
-            frontier_common_block_offset = atomicAdd(new_frontier_cnt, inclusive_sum);
+            frontier_common_block_offset = traversal::atomicAdd(new_frontier_cnt, inclusive_sum);
         }
 
         // Broadcasting frontier_common_block_offset
@@ -1036,7 +1072,7 @@ __global__ void topdown_expand_kernel(
           if (idx_shared < block_n_frontier_candidates) {
             IndexType new_frontier_vertex = vec_frontier_accepted_vertex[iv];
 
-            if (new_frontier_vertex != -1) {
+            if (new_frontier_vertex != invalid_idx) {
               IndexType off     = frontier_common_block_offset + thread_new_frontier_offset++;
               new_frontier[off] = new_frontier_vertex;
             }
@@ -1084,12 +1120,14 @@ void frontier_expand(const IndexType *row_ptr,
   dim3 block;
   block.x = TOP_DOWN_EXPAND_DIMX;
 
-  IndexType max_items_per_thread = (totaldegree + MAXBLOCKS * block.x - 1) / (MAXBLOCKS * block.x);
+  IndexType max_items_per_thread =
+    (static_cast<size_t>(totaldegree) + MAXBLOCKS * block.x - 1) / (MAXBLOCKS * block.x);
 
   dim3 grid;
-  grid.x =
-    min((totaldegree + max_items_per_thread * block.x - 1) / (max_items_per_thread * block.x),
-        (IndexType)MAXBLOCKS);
+  grid.x = std::min((static_cast<size_t>(totaldegree) + max_items_per_thread * block.x - 1) /
+                      (max_items_per_thread * block.x),
+                    static_cast<size_t>(MAXBLOCKS));
+
   // Shortest Path counting (Betweenness Centrality)
   // We need to keep track of the previously visited bmap
 
@@ -1117,123 +1155,7 @@ void frontier_expand(const IndexType *row_ptr,
     edge_mask,
     isolated_bmap,
     directed);
-  CUDA_CHECK_LAST();
-}
-
-template <typename IndexType>
-__global__ void flag_isolated_vertices_kernel(IndexType n,
-                                              int *isolated_bmap,
-                                              const IndexType *row_ptr,
-                                              IndexType *degrees,
-                                              IndexType *nisolated)
-{
-  typedef cub::BlockLoad<IndexType,
-                         FLAG_ISOLATED_VERTICES_DIMX,
-                         FLAG_ISOLATED_VERTICES_VERTICES_PER_THREAD,
-                         cub::BLOCK_LOAD_WARP_TRANSPOSE>
-    BlockLoad;
-  typedef cub::BlockStore<IndexType,
-                          FLAG_ISOLATED_VERTICES_DIMX,
-                          FLAG_ISOLATED_VERTICES_VERTICES_PER_THREAD,
-                          cub::BLOCK_STORE_WARP_TRANSPOSE>
-    BlockStore;
-  typedef cub::BlockReduce<IndexType, FLAG_ISOLATED_VERTICES_DIMX> BlockReduce;
-  typedef cub::WarpReduce<int, FLAG_ISOLATED_VERTICES_THREADS_PER_INT> WarpReduce;
-
-  __shared__ typename BlockLoad::TempStorage load_temp_storage;
-  __shared__ typename BlockStore::TempStorage store_temp_storage;
-  __shared__ typename BlockReduce::TempStorage block_reduce_temp_storage;
-
-  __shared__ typename WarpReduce::TempStorage
-    warp_reduce_temp_storage[FLAG_ISOLATED_VERTICES_DIMX / FLAG_ISOLATED_VERTICES_THREADS_PER_INT];
-
-  __shared__ IndexType row_ptr_tail[FLAG_ISOLATED_VERTICES_DIMX];
-
-  for (IndexType block_off = FLAG_ISOLATED_VERTICES_VERTICES_PER_THREAD * (blockDim.x * blockIdx.x);
-       block_off < n;
-       block_off += FLAG_ISOLATED_VERTICES_VERTICES_PER_THREAD * (blockDim.x * gridDim.x)) {
-    IndexType thread_off = block_off + FLAG_ISOLATED_VERTICES_VERTICES_PER_THREAD * threadIdx.x;
-    IndexType last_node_thread = thread_off + FLAG_ISOLATED_VERTICES_VERTICES_PER_THREAD - 1;
-
-    IndexType thread_row_ptr[FLAG_ISOLATED_VERTICES_VERTICES_PER_THREAD];
-    IndexType block_valid_items = n - block_off + 1;  //+1, we need row_ptr[last_node+1]
-
-    BlockLoad(load_temp_storage).Load(row_ptr + block_off, thread_row_ptr, block_valid_items, -1);
-
-    // To compute 4 degrees, we need 5 values of row_ptr
-    // Saving the "5th" value in shared memory for previous thread to use
-    if (threadIdx.x > 0) { row_ptr_tail[threadIdx.x - 1] = thread_row_ptr[0]; }
-
-    // If this is the last thread, it needs to load its row ptr tail value
-    if (threadIdx.x == (FLAG_ISOLATED_VERTICES_DIMX - 1) && last_node_thread < n) {
-      row_ptr_tail[threadIdx.x] = row_ptr[last_node_thread + 1];
-    }
-    __syncthreads();  // we may reuse temp_storage
-
-    int local_isolated_bmap = 0;
-
-    IndexType imax = (n - thread_off);
-
-    IndexType local_degree[FLAG_ISOLATED_VERTICES_VERTICES_PER_THREAD];
-
-#pragma unroll
-    for (int i = 0; i < (FLAG_ISOLATED_VERTICES_VERTICES_PER_THREAD - 1); ++i) {
-      IndexType degree = local_degree[i] = thread_row_ptr[i + 1] - thread_row_ptr[i];
-
-      if (i < imax) local_isolated_bmap |= ((degree == 0) << i);
-    }
-
-    if (last_node_thread < n) {
-      IndexType degree = local_degree[FLAG_ISOLATED_VERTICES_VERTICES_PER_THREAD - 1] =
-        row_ptr_tail[threadIdx.x] - thread_row_ptr[FLAG_ISOLATED_VERTICES_VERTICES_PER_THREAD - 1];
-
-      local_isolated_bmap |= ((degree == 0) << (FLAG_ISOLATED_VERTICES_VERTICES_PER_THREAD - 1));
-    }
-
-    local_isolated_bmap <<= (thread_off % INT_SIZE);
-
-    IndexType local_nisolated = __popc(local_isolated_bmap);
-
-    // We need local_nisolated and local_isolated_bmap to be ready for next
-    // steps
-    __syncthreads();
-
-    IndexType total_nisolated = BlockReduce(block_reduce_temp_storage).Sum(local_nisolated);
-
-    if (threadIdx.x == 0 && total_nisolated) { atomicAdd(nisolated, total_nisolated); }
-
-    int logicalwarpid = threadIdx.x / FLAG_ISOLATED_VERTICES_THREADS_PER_INT;
-
-    // Building int for bmap
-    int int_aggregate_isolated_bmap = WarpReduce(warp_reduce_temp_storage[logicalwarpid])
-                                        .Reduce(local_isolated_bmap, traversal::BitwiseOr());
-
-    int is_head_of_visited_int = ((threadIdx.x % (FLAG_ISOLATED_VERTICES_THREADS_PER_INT)) == 0);
-    if (is_head_of_visited_int) {
-      isolated_bmap[thread_off / INT_SIZE] = int_aggregate_isolated_bmap;
-    }
-
-    BlockStore(store_temp_storage).Store(degrees + block_off, local_degree, block_valid_items);
-  }
-}
-
-template <typename IndexType>
-void flag_isolated_vertices(IndexType n,
-                            int *isolated_bmap,
-                            const IndexType *row_ptr,
-                            IndexType *degrees,
-                            IndexType *nisolated,
-                            cudaStream_t m_stream)
-{
-  dim3 grid, block;
-  block.x = FLAG_ISOLATED_VERTICES_DIMX;
-
-  grid.x = min((IndexType)MAXBLOCKS,
-               (n / FLAG_ISOLATED_VERTICES_VERTICES_PER_THREAD + 1 + block.x - 1) / block.x);
-
-  flag_isolated_vertices_kernel<<<grid, block, 0, m_stream>>>(
-    n, isolated_bmap, row_ptr, degrees, nisolated);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(m_stream);
 }
 
 }  // namespace bfs_kernels
diff --git a/cpp/src/traversal/mg/bfs.cuh b/cpp/src/traversal/mg/bfs.cuh
new file mode 100644
index 00000000000..b053a6ff75a
--- /dev/null
+++ b/cpp/src/traversal/mg/bfs.cuh
@@ -0,0 +1,170 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <raft/handle.hpp>
+#include "../traversal_common.cuh"
+#include "common_utils.cuh"
+#include "frontier_expand.cuh"
+
+namespace cugraph {
+
+namespace mg {
+
+namespace detail {
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename operator_t>
+void bfs_traverse(raft::handle_t const &handle,
+                  cugraph::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                  const vertex_t start_vertex,
+                  rmm::device_vector<uint32_t> &visited_bmap,
+                  rmm::device_vector<uint32_t> &output_frontier_bmap,
+                  operator_t &bfs_op)
+{
+  // Frontiers required for BFS
+  rmm::device_vector<vertex_t> input_frontier(graph.number_of_vertices);
+  rmm::device_vector<vertex_t> output_frontier(graph.number_of_vertices);
+
+  // Bitmaps required for BFS
+  size_t word_count = detail::number_of_words(graph.number_of_vertices);
+  rmm::device_vector<uint32_t> isolated_bmap(word_count, 0);
+  rmm::device_vector<uint32_t> unique_bmap(word_count, 0);
+  rmm::device_vector<size_t> temp_buffer_len(handle.get_comms().get_size());
+
+  // Reusing buffers to create isolated bitmap
+  {
+    rmm::device_vector<vertex_t> &local_isolated_ids  = input_frontier;
+    rmm::device_vector<vertex_t> &global_isolated_ids = output_frontier;
+    detail::create_isolated_bitmap(
+      handle, graph, local_isolated_ids, global_isolated_ids, temp_buffer_len, isolated_bmap);
+  }
+
+  if (is_vertex_isolated(isolated_bmap, start_vertex)) { return; }
+
+  // Frontier Expand for calls to bfs functors
+  detail::FrontierExpand<vertex_t, edge_t, weight_t> fexp(handle, graph);
+
+  cudaStream_t stream = handle.get_stream();
+
+  // Initialize input frontier
+  input_frontier[0]           = start_vertex;
+  vertex_t input_frontier_len = 1;
+
+  do {
+    // Mark all input frontier vertices as visited
+    detail::add_to_bitmap(handle, visited_bmap, input_frontier, input_frontier_len);
+
+    bfs_op.increment_level();
+
+    // Remove duplicates,isolated and out of partition vertices
+    // from input_frontier and store it to output_frontier
+    input_frontier_len = detail::preprocess_input_frontier(handle,
+                                                           graph,
+                                                           unique_bmap,
+                                                           isolated_bmap,
+                                                           input_frontier,
+                                                           input_frontier_len,
+                                                           output_frontier);
+    // Swap input and output frontier
+    input_frontier.swap(output_frontier);
+
+    // Clear output frontier bitmap
+    thrust::fill(rmm::exec_policy(stream)->on(stream),
+                 output_frontier_bmap.begin(),
+                 output_frontier_bmap.end(),
+                 static_cast<uint32_t>(0));
+
+    // Generate output frontier bitmap from input frontier
+    vertex_t output_frontier_len =
+      fexp(bfs_op, input_frontier, input_frontier_len, output_frontier);
+
+    // Collect output_frontier from all ranks to input_frontier
+    // If not empty then we proceed to next iteration.
+    // Note that its an error to remove duplicates and non local
+    // start vertices here since it is possible that doing so will
+    // result in input_frontier_len to be 0. That would cause some
+    // ranks to go ahead with the iteration and some to terminate.
+    // This would further cause a nccl communication error since
+    // not every rank participates in broadcast/allgather in
+    // subsequent calls
+    input_frontier_len = detail::collect_vectors(
+      handle, temp_buffer_len, output_frontier, output_frontier_len, input_frontier);
+
+  } while (input_frontier_len != 0);
+}
+
+}  // namespace detail
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+void bfs(raft::handle_t const &handle,
+         cugraph::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+         vertex_t *distances,
+         vertex_t *predecessors,
+         const vertex_t start_vertex)
+{
+  CUGRAPH_EXPECTS(handle.comms_initialized(),
+                  "cugraph::mg::bfs() expected to work only in multi gpu case.");
+
+  // Distances and predecessors are of the size global_number_of_vertices
+  vertex_t global_number_of_vertices = detail::get_global_vertex_count(handle, graph);
+
+  size_t word_count = detail::number_of_words(global_number_of_vertices);
+  rmm::device_vector<uint32_t> visited_bmap(word_count, 0);
+  rmm::device_vector<uint32_t> output_frontier_bmap(word_count, 0);
+
+  cudaStream_t stream = handle.get_stream();
+
+  // Set all predecessors to be invalid vertex ids
+  thrust::fill(rmm::exec_policy(stream)->on(stream),
+               predecessors,
+               predecessors + global_number_of_vertices,
+               cugraph::invalid_idx<vertex_t>::value);
+
+  if (distances == nullptr) {
+    detail::BFSStepNoDist<vertex_t, edge_t> bfs_op(
+      output_frontier_bmap.data().get(), visited_bmap.data().get(), predecessors);
+
+    detail::bfs_traverse(handle, graph, start_vertex, visited_bmap, output_frontier_bmap, bfs_op);
+
+  } else {
+    // Update distances to max distances everywhere except start_vertex
+    // where it is set to 0
+    detail::fill_max_dist(handle, graph, start_vertex, global_number_of_vertices, distances);
+
+    detail::BFSStep<vertex_t, edge_t> bfs_op(
+      output_frontier_bmap.data().get(), visited_bmap.data().get(), predecessors, distances);
+
+    detail::bfs_traverse(handle, graph, start_vertex, visited_bmap, output_frontier_bmap, bfs_op);
+
+    // In place reduce to collect distances
+    if (handle.comms_initialized()) {
+      handle.get_comms().allreduce(
+        distances, distances, global_number_of_vertices, raft::comms::op_t::MIN, stream);
+    }
+  }
+
+  // In place reduce to collect predecessors
+  if (handle.comms_initialized()) {
+    auto op = raft::comms::op_t::MIN;
+    if (std::is_signed<vertex_t>::value) { op = raft::comms::op_t::MAX; }
+    handle.get_comms().allreduce(predecessors, predecessors, global_number_of_vertices, op, stream);
+  }
+}
+
+}  // namespace mg
+
+}  // namespace cugraph
diff --git a/cpp/src/traversal/mg/common_utils.cuh b/cpp/src/traversal/mg/common_utils.cuh
new file mode 100644
index 00000000000..6199730c28f
--- /dev/null
+++ b/cpp/src/traversal/mg/common_utils.cuh
@@ -0,0 +1,495 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <raft/integer_utils.h>
+#include <rmm/thrust_rmm_allocator.h>
+#include <cub/cub.cuh>
+#include "../traversal_common.cuh"
+
+namespace cugraph {
+
+namespace mg {
+
+namespace detail {
+
+template <typename degree_t>
+constexpr int BitsPWrd = sizeof(degree_t) * 8;
+
+template <typename degree_t>
+constexpr int NumberBins = sizeof(degree_t) * 8 + 1;
+
+template <typename return_t>
+constexpr inline return_t number_of_words(return_t number_of_bits)
+{
+  return raft::div_rounding_up_safe(number_of_bits, static_cast<return_t>(BitsPWrd<uint32_t>));
+}
+
+template <typename edge_t>
+struct isDegreeZero {
+  edge_t const *offset_;
+  isDegreeZero(edge_t const *offset) : offset_(offset) {}
+
+  __device__ bool operator()(const edge_t &id) const { return (offset_[id + 1] == offset_[id]); }
+};
+
+struct set_nth_bit {
+  uint32_t *bmap_;
+  set_nth_bit(uint32_t *bmap) : bmap_(bmap) {}
+
+  template <typename return_t>
+  __device__ void operator()(const return_t &id)
+  {
+    atomicOr(bmap_ + (id / BitsPWrd<uint32_t>), (uint32_t{1} << (id % BitsPWrd<uint32_t>)));
+  }
+};
+
+template <typename vertex_t>
+bool is_vertex_isolated(rmm::device_vector<uint32_t> &bmap, vertex_t id)
+{
+  uint32_t word       = bmap[id / BitsPWrd<uint32_t>];
+  uint32_t active_bit = static_cast<uint32_t>(1) << (id % BitsPWrd<uint32_t>);
+  // If idth bit of bmap is set to 1 then return true
+  return ((active_bit & word) != 0);
+}
+
+template <typename vertex_t, typename edge_t>
+struct BFSStepNoDist {
+  uint32_t *output_frontier_;
+  uint32_t *visited_;
+  vertex_t *predecessors_;
+
+  BFSStepNoDist(uint32_t *output_frontier, uint32_t *visited, vertex_t *predecessors)
+    : output_frontier_(output_frontier), visited_(visited), predecessors_(predecessors)
+  {
+  }
+
+  __device__ bool operator()(vertex_t src, vertex_t dst)
+  {
+    uint32_t active_bit = static_cast<uint32_t>(1) << (dst % BitsPWrd<uint32_t>);
+    uint32_t prev_word  = atomicOr(output_frontier_ + (dst / BitsPWrd<uint32_t>), active_bit);
+    bool dst_not_visited_earlier = !(active_bit & visited_[dst / BitsPWrd<uint32_t>]);
+    bool dst_not_visited_current = !(prev_word & active_bit);
+    // If this thread activates the frontier bitmap for a destination
+    // then the source is the predecessor of that destination
+    if (dst_not_visited_earlier && dst_not_visited_current) {
+      predecessors_[dst] = src;
+      return true;
+    } else {
+      return false;
+    }
+  }
+
+  // No-op
+  void increment_level(void) {}
+};
+
+template <typename vertex_t, typename edge_t>
+struct BFSStep {
+  uint32_t *output_frontier_;
+  uint32_t *visited_;
+  vertex_t *predecessors_;
+  vertex_t *distances_;
+  vertex_t level_;
+
+  BFSStep(uint32_t *output_frontier, uint32_t *visited, vertex_t *predecessors, vertex_t *distances)
+    : output_frontier_(output_frontier),
+      visited_(visited),
+      predecessors_(predecessors),
+      distances_(distances),
+      level_(0)
+  {
+  }
+
+  __device__ bool operator()(vertex_t src, vertex_t dst)
+  {
+    uint32_t active_bit = static_cast<uint32_t>(1) << (dst % BitsPWrd<uint32_t>);
+    uint32_t prev_word  = atomicOr(output_frontier_ + (dst / BitsPWrd<uint32_t>), active_bit);
+    bool dst_not_visited_earlier = !(active_bit & visited_[dst / BitsPWrd<uint32_t>]);
+    bool dst_not_visited_current = !(prev_word & active_bit);
+    // If this thread activates the frontier bitmap for a destination
+    // then the source is the predecessor of that destination
+    if (dst_not_visited_earlier && dst_not_visited_current) {
+      distances_[dst]    = level_;
+      predecessors_[dst] = src;
+      return true;
+    } else {
+      return false;
+    }
+  }
+
+  void increment_level(void) { ++level_; }
+};
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+vertex_t populate_isolated_vertices(raft::handle_t const &handle,
+                                    cugraph::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                                    rmm::device_vector<vertex_t> &isolated_vertex_ids)
+{
+  bool is_mg          = (handle.comms_initialized() && (graph.local_vertices != nullptr) &&
+                (graph.local_offsets != nullptr));
+  cudaStream_t stream = handle.get_stream();
+
+  edge_t vertex_begin_, vertex_end_;
+  if (is_mg) {
+    vertex_begin_ = graph.local_offsets[handle.get_comms().get_rank()];
+    vertex_end_   = graph.local_offsets[handle.get_comms().get_rank()] +
+                  graph.local_vertices[handle.get_comms().get_rank()];
+  } else {
+    vertex_begin_ = 0;
+    vertex_end_   = graph.number_of_vertices;
+  }
+  auto count = thrust::copy_if(rmm::exec_policy(stream)->on(stream),
+                               thrust::make_counting_iterator<vertex_t>(vertex_begin_),
+                               thrust::make_counting_iterator<vertex_t>(vertex_end_),
+                               thrust::make_counting_iterator<edge_t>(0),
+                               isolated_vertex_ids.begin(),
+                               isDegreeZero<edge_t>(graph.offsets)) -
+               isolated_vertex_ids.begin();
+  return static_cast<vertex_t>(count);
+}
+
+template <typename return_t>
+return_t collect_vectors(raft::handle_t const &handle,
+                         rmm::device_vector<size_t> &buffer_len,
+                         rmm::device_vector<return_t> &local,
+                         return_t local_count,
+                         rmm::device_vector<return_t> &global)
+{
+  CHECK_CUDA(handle.get_stream());
+  buffer_len.resize(handle.get_comms().get_size());
+  auto my_rank        = handle.get_comms().get_rank();
+  buffer_len[my_rank] = static_cast<size_t>(local_count);
+  handle.get_comms().allgather(
+    buffer_len.data().get() + my_rank, buffer_len.data().get(), 1, handle.get_stream());
+  CHECK_CUDA(handle.get_stream());
+  // buffer_len now contains the lengths of all local buffers
+  // for all ranks
+
+  thrust::host_vector<size_t> h_buffer_len = buffer_len;
+  // h_buffer_offsets has to be int because raft allgatherv expects
+  // int array for displacement vector. This should be changed in
+  // raft so that the displacement is templated
+  thrust::host_vector<int> h_buffer_offsets(h_buffer_len.size());
+
+  thrust::exclusive_scan(
+    thrust::host, h_buffer_len.begin(), h_buffer_len.end(), h_buffer_offsets.begin());
+  return_t global_buffer_len = h_buffer_len.back() + h_buffer_offsets.back();
+
+  handle.get_comms().allgatherv(local.data().get(),
+                                global.data().get(),
+                                h_buffer_len.data(),
+                                h_buffer_offsets.data(),
+                                handle.get_stream());
+  CHECK_CUDA(handle.get_stream());
+  return global_buffer_len;
+}
+
+template <typename return_t>
+void add_to_bitmap(raft::handle_t const &handle,
+                   rmm::device_vector<uint32_t> &bmap,
+                   rmm::device_vector<return_t> &id,
+                   return_t count)
+{
+  cudaStream_t stream = handle.get_stream();
+  thrust::for_each(rmm::exec_policy(stream)->on(stream),
+                   id.begin(),
+                   id.begin() + count,
+                   set_nth_bit(bmap.data().get()));
+  CHECK_CUDA(stream);
+}
+
+// For all vertex ids i which are isolated (out degree is 0), set
+// ith bit of isolated_bmap to 1
+template <typename vertex_t, typename edge_t, typename weight_t>
+void create_isolated_bitmap(raft::handle_t const &handle,
+                            cugraph::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                            rmm::device_vector<vertex_t> &local_isolated_ids,
+                            rmm::device_vector<vertex_t> &global_isolated_ids,
+                            rmm::device_vector<size_t> &temp_buffer_len,
+                            rmm::device_vector<uint32_t> &isolated_bmap)
+{
+  size_t word_count = detail::number_of_words(graph.number_of_vertices);
+  local_isolated_ids.resize(graph.number_of_vertices);
+  global_isolated_ids.resize(graph.number_of_vertices);
+  temp_buffer_len.resize(handle.get_comms().get_size());
+  isolated_bmap.resize(word_count);
+
+  vertex_t local_isolated_count  = populate_isolated_vertices(handle, graph, local_isolated_ids);
+  vertex_t global_isolated_count = collect_vectors(
+    handle, temp_buffer_len, local_isolated_ids, local_isolated_count, global_isolated_ids);
+  add_to_bitmap(handle, isolated_bmap, global_isolated_ids, global_isolated_count);
+}
+
+template <typename return_t>
+return_t remove_duplicates(raft::handle_t const &handle,
+                           rmm::device_vector<return_t> &data,
+                           return_t data_len)
+{
+  cudaStream_t stream = handle.get_stream();
+  thrust::sort(rmm::exec_policy(stream)->on(stream), data.begin(), data.begin() + data_len);
+  auto unique_count =
+    thrust::unique(rmm::exec_policy(stream)->on(stream), data.begin(), data.begin() + data_len) -
+    data.begin();
+  return static_cast<return_t>(unique_count);
+}
+
+// Use the fact that any value in id array can only be in
+// the range [id_begin, id_end) to create a unique set of
+// ids. bmap is expected to be of the length
+// id_end/BitsPWrd<uint32_t> and is set to 0 initially
+template <uint32_t BLOCK_SIZE, typename return_t>
+__global__ void remove_duplicates_kernel(uint32_t *bmap,
+                                         return_t *in_id,
+                                         return_t id_begin,
+                                         return_t id_end,
+                                         return_t count,
+                                         return_t *out_id,
+                                         return_t *out_count)
+{
+  return_t tid = blockIdx.x * blockDim.x + threadIdx.x;
+  return_t id;
+  if (tid < count) {
+    id = in_id[tid];
+  } else {
+    // Invalid vertex id to avoid partial thread block execution
+    id = id_end;
+  }
+
+  int acceptable_vertex = 0;
+  // If id is not in the acceptable range then set it to
+  // an invalid vertex id
+  if ((id >= id_begin) && (id < id_end)) {
+    uint32_t active_bit = static_cast<uint32_t>(1) << (id % BitsPWrd<uint32_t>);
+    uint32_t prev_word  = atomicOr(bmap + (id / BitsPWrd<uint32_t>), active_bit);
+    // If bit was set by this thread then the id is unique
+    if (!(prev_word & active_bit)) { acceptable_vertex = 1; }
+  }
+
+  __shared__ return_t block_offset;
+  typedef cub::BlockScan<int, BLOCK_SIZE> BlockScan;
+  __shared__ typename BlockScan::TempStorage temp_storage;
+  int thread_write_offset;
+  int block_acceptable_vertex_count;
+  BlockScan(temp_storage)
+    .ExclusiveSum(acceptable_vertex, thread_write_offset, block_acceptable_vertex_count);
+
+  // If the block is not going to write unique ids then return
+  if (block_acceptable_vertex_count == 0) { return; }
+
+  if (threadIdx.x == 0) {
+    block_offset = cugraph::detail::traversal::atomicAdd(
+      out_count, static_cast<return_t>(block_acceptable_vertex_count));
+  }
+  __syncthreads();
+
+  if (acceptable_vertex) { out_id[block_offset + thread_write_offset] = id; }
+}
+
+template <uint32_t BLOCK_SIZE, typename return_t>
+__global__ void remove_duplicates_kernel(uint32_t *bmap,
+                                         uint32_t *isolated_bmap,
+                                         return_t *in_id,
+                                         return_t id_begin,
+                                         return_t id_end,
+                                         return_t count,
+                                         return_t *out_id,
+                                         return_t *out_count)
+{
+  return_t tid = blockIdx.x * blockDim.x + threadIdx.x;
+  return_t id;
+  if (tid < count) {
+    id = in_id[tid];
+  } else {
+    // Invalid vertex id to avoid partial thread block execution
+    id = id_end;
+  }
+
+  int acceptable_vertex = 0;
+  // If id is not in the acceptable range then set it to
+  // an invalid vertex id
+  if ((id >= id_begin) && (id < id_end)) {
+    uint32_t active_bit = static_cast<uint32_t>(1) << (id % BitsPWrd<uint32_t>);
+    uint32_t prev_word  = atomicOr(bmap + (id / BitsPWrd<uint32_t>), active_bit);
+    // If bit was set by this thread then the id is unique
+    if (!(prev_word & active_bit)) {
+      // If id is isolated (out-degree == 0) then mark it as unacceptable
+      bool is_dst_isolated = active_bit & isolated_bmap[id / BitsPWrd<uint32_t>];
+      acceptable_vertex    = !is_dst_isolated;
+    }
+  }
+
+  __shared__ return_t block_offset;
+  typedef cub::BlockScan<int, BLOCK_SIZE> BlockScan;
+  __shared__ typename BlockScan::TempStorage temp_storage;
+  int thread_write_offset;
+  int block_acceptable_vertex_count;
+  BlockScan(temp_storage)
+    .ExclusiveSum(acceptable_vertex, thread_write_offset, block_acceptable_vertex_count);
+
+  // If the block is not going to write unique ids then return
+  if (block_acceptable_vertex_count == 0) { return; }
+
+  if (threadIdx.x == 0) {
+    block_offset = cugraph::detail::traversal::atomicAdd(
+      out_count, static_cast<return_t>(block_acceptable_vertex_count));
+  }
+  __syncthreads();
+
+  if (acceptable_vertex) { out_id[block_offset + thread_write_offset] = id; }
+}
+
+template <typename return_t>
+return_t remove_duplicates(raft::handle_t const &handle,
+                           rmm::device_vector<uint32_t> &bmap,
+                           rmm::device_vector<return_t> &data,
+                           return_t data_len,
+                           return_t data_begin,
+                           return_t data_end,
+                           rmm::device_vector<return_t> &out_data)
+{
+  cudaStream_t stream = handle.get_stream();
+
+  rmm::device_vector<return_t> unique_count(1, 0);
+
+  thrust::fill(
+    rmm::exec_policy(stream)->on(stream), bmap.begin(), bmap.end(), static_cast<uint32_t>(0));
+  constexpr return_t threads = 256;
+  return_t blocks            = raft::div_rounding_up_safe(data_len, threads);
+  remove_duplicates_kernel<threads><<<blocks, threads, 0, stream>>>(bmap.data().get(),
+                                                                    data.data().get(),
+                                                                    data_begin,
+                                                                    data_end,
+                                                                    data_len,
+                                                                    out_data.data().get(),
+                                                                    unique_count.data().get());
+  CHECK_CUDA(stream);
+  return static_cast<return_t>(unique_count[0]);
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+vertex_t preprocess_input_frontier(raft::handle_t const &handle,
+                                   cugraph::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                                   rmm::device_vector<uint32_t> &bmap,
+                                   rmm::device_vector<uint32_t> &isolated_bmap,
+                                   rmm::device_vector<vertex_t> &input_frontier,
+                                   vertex_t input_frontier_len,
+                                   rmm::device_vector<vertex_t> &output_frontier)
+{
+  cudaStream_t stream = handle.get_stream();
+
+  vertex_t vertex_begin = graph.local_offsets[handle.get_comms().get_rank()];
+  vertex_t vertex_end   = graph.local_offsets[handle.get_comms().get_rank()] +
+                        graph.local_vertices[handle.get_comms().get_rank()];
+  rmm::device_vector<vertex_t> unique_count(1, 0);
+
+  thrust::fill(
+    rmm::exec_policy(stream)->on(stream), bmap.begin(), bmap.end(), static_cast<uint32_t>(0));
+  constexpr vertex_t threads = 256;
+  vertex_t blocks            = raft::div_rounding_up_safe(input_frontier_len, threads);
+  remove_duplicates_kernel<threads><<<blocks, threads, 0, stream>>>(bmap.data().get(),
+                                                                    isolated_bmap.data().get(),
+                                                                    input_frontier.data().get(),
+                                                                    vertex_begin,
+                                                                    vertex_end,
+                                                                    input_frontier_len,
+                                                                    output_frontier.data().get(),
+                                                                    unique_count.data().get());
+  CHECK_CUDA(stream);
+  return static_cast<vertex_t>(unique_count[0]);
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+vertex_t preprocess_input_frontier(raft::handle_t const &handle,
+                                   cugraph::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                                   rmm::device_vector<uint32_t> &bmap,
+                                   rmm::device_vector<vertex_t> &input_frontier,
+                                   vertex_t input_frontier_len,
+                                   rmm::device_vector<vertex_t> &output_frontier)
+{
+  cudaStream_t stream = handle.get_stream();
+
+  vertex_t vertex_begin = graph.local_offsets[handle.get_comms().get_rank()];
+  vertex_t vertex_end   = graph.local_offsets[handle.get_comms().get_rank()] +
+                        graph.local_vertices[handle.get_comms().get_rank()];
+  rmm::device_vector<vertex_t> unique_count(1, 0);
+
+  thrust::fill(
+    rmm::exec_policy(stream)->on(stream), bmap.begin(), bmap.end(), static_cast<uint32_t>(0));
+  constexpr vertex_t threads = 256;
+  vertex_t blocks            = raft::div_rounding_up_safe(input_frontier_len, threads);
+  remove_duplicates_kernel<threads><<<blocks, threads, 0, stream>>>(bmap.data().get(),
+                                                                    input_frontier.data().get(),
+                                                                    vertex_begin,
+                                                                    vertex_end,
+                                                                    input_frontier_len,
+                                                                    output_frontier.data().get(),
+                                                                    unique_count.data().get());
+  CHECK_CUDA(stream);
+  return static_cast<vertex_t>(unique_count[0]);
+}
+
+template <typename vertex_t>
+__global__ void fill_kernel(vertex_t *distances, vertex_t count, vertex_t start_vertex)
+{
+  vertex_t tid = blockIdx.x * blockDim.x + threadIdx.x;
+  if (tid >= count) { return; }
+  if (tid == start_vertex) {
+    distances[tid] = vertex_t{0};
+  } else {
+    distances[tid] = cugraph::detail::traversal::vec_t<vertex_t>::max;
+  }
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+void fill_max_dist(raft::handle_t const &handle,
+                   cugraph::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                   vertex_t start_vertex,
+                   vertex_t global_number_of_vertices,
+                   vertex_t *distances)
+{
+  if (distances == nullptr) { return; }
+  vertex_t array_size        = global_number_of_vertices;
+  constexpr vertex_t threads = 256;
+  vertex_t blocks            = raft::div_rounding_up_safe(array_size, threads);
+  fill_kernel<<<blocks, threads, 0, handle.get_stream()>>>(distances, array_size, start_vertex);
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+vertex_t get_global_vertex_count(raft::handle_t const &handle,
+                                 cugraph::GraphCSRView<vertex_t, edge_t, weight_t> const &graph)
+{
+  rmm::device_vector<vertex_t> id(1);
+  id[0] = *thrust::max_element(rmm::exec_policy(handle.get_stream())->on(handle.get_stream()),
+                               graph.indices,
+                               graph.indices + graph.number_of_edges);
+  handle.get_comms().allreduce(
+    id.data().get(), id.data().get(), 1, raft::comms::op_t::MAX, handle.get_stream());
+  vertex_t max_vertex_id = id[0];
+
+  if ((graph.number_of_vertices - 1) > max_vertex_id) {
+    max_vertex_id = graph.number_of_vertices - 1;
+  }
+
+  return max_vertex_id + 1;
+}
+
+}  // namespace detail
+
+}  // namespace mg
+
+}  // namespace cugraph
diff --git a/cpp/src/traversal/mg/frontier_expand.cuh b/cpp/src/traversal/mg/frontier_expand.cuh
new file mode 100644
index 00000000000..2733c319087
--- /dev/null
+++ b/cpp/src/traversal/mg/frontier_expand.cuh
@@ -0,0 +1,133 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <graph.hpp>
+#include "frontier_expand_kernels.cuh"
+#include "vertex_binning.cuh"
+
+namespace cugraph {
+
+namespace mg {
+
+namespace detail {
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+class FrontierExpand {
+  raft::handle_t const &handle_;
+  cugraph::GraphCSRView<vertex_t, edge_t, weight_t> const &graph_;
+  VertexBinner<vertex_t, edge_t> dist_;
+  rmm::device_vector<vertex_t> reorganized_vertices_;
+  edge_t vertex_begin_;
+  edge_t vertex_end_;
+  rmm::device_vector<edge_t> output_vertex_count_;
+
+ public:
+  FrontierExpand(raft::handle_t const &handle,
+                 cugraph::GraphCSRView<vertex_t, edge_t, weight_t> const &graph)
+    : handle_(handle), graph_(graph)
+  {
+    bool is_mg = (handle.comms_initialized() && (graph.local_vertices != nullptr) &&
+                  (graph.local_offsets != nullptr));
+    if (is_mg) {
+      reorganized_vertices_.resize(graph.local_vertices[handle_.get_comms().get_rank()]);
+      vertex_begin_ = graph.local_offsets[handle_.get_comms().get_rank()];
+      vertex_end_   = graph.local_offsets[handle_.get_comms().get_rank()] +
+                    graph.local_vertices[handle_.get_comms().get_rank()];
+    } else {
+      reorganized_vertices_.resize(graph.number_of_vertices);
+      vertex_begin_ = 0;
+      vertex_end_   = graph.number_of_vertices;
+    }
+    output_vertex_count_.resize(1);
+  }
+
+  // Return the size of the output_frontier
+  template <typename operator_t>
+  vertex_t operator()(operator_t op,
+                      rmm::device_vector<vertex_t> &input_frontier,
+                      vertex_t input_frontier_len,
+                      rmm::device_vector<vertex_t> &output_frontier)
+  {
+    if (input_frontier_len == 0) { return static_cast<vertex_t>(0); }
+    cudaStream_t stream     = handle_.get_stream();
+    output_vertex_count_[0] = 0;
+    dist_.setup(graph_.offsets, nullptr, vertex_begin_, vertex_end_);
+    auto distribution =
+      dist_.run(input_frontier, input_frontier_len, reorganized_vertices_, stream);
+
+    DegreeBucket<vertex_t, edge_t> large_bucket = distribution.degreeRange(16);
+    // TODO : Use other streams from handle_
+    large_vertex_lb(graph_,
+                    large_bucket,
+                    op,
+                    vertex_begin_,
+                    output_frontier.data().get(),
+                    output_vertex_count_.data().get(),
+                    stream);
+
+    DegreeBucket<vertex_t, edge_t> medium_bucket = distribution.degreeRange(12, 16);
+    medium_vertex_lb(graph_,
+                     medium_bucket,
+                     op,
+                     vertex_begin_,
+                     output_frontier.data().get(),
+                     output_vertex_count_.data().get(),
+                     stream);
+
+    DegreeBucket<vertex_t, edge_t> small_bucket_0 = distribution.degreeRange(10, 12);
+    DegreeBucket<vertex_t, edge_t> small_bucket_1 = distribution.degreeRange(8, 10);
+    DegreeBucket<vertex_t, edge_t> small_bucket_2 = distribution.degreeRange(6, 8);
+    DegreeBucket<vertex_t, edge_t> small_bucket_3 = distribution.degreeRange(0, 6);
+
+    small_vertex_lb(graph_,
+                    small_bucket_0,
+                    op,
+                    vertex_begin_,
+                    output_frontier.data().get(),
+                    output_vertex_count_.data().get(),
+                    stream);
+    small_vertex_lb(graph_,
+                    small_bucket_1,
+                    op,
+                    vertex_begin_,
+                    output_frontier.data().get(),
+                    output_vertex_count_.data().get(),
+                    stream);
+    small_vertex_lb(graph_,
+                    small_bucket_2,
+                    op,
+                    vertex_begin_,
+                    output_frontier.data().get(),
+                    output_vertex_count_.data().get(),
+                    stream);
+    small_vertex_lb(graph_,
+                    small_bucket_3,
+                    op,
+                    vertex_begin_,
+                    output_frontier.data().get(),
+                    output_vertex_count_.data().get(),
+                    stream);
+    return output_vertex_count_[0];
+  }
+};
+
+}  // namespace detail
+
+}  // namespace mg
+
+}  // namespace cugraph
diff --git a/cpp/src/traversal/mg/frontier_expand_kernels.cuh b/cpp/src/traversal/mg/frontier_expand_kernels.cuh
new file mode 100644
index 00000000000..625ec0d956f
--- /dev/null
+++ b/cpp/src/traversal/mg/frontier_expand_kernels.cuh
@@ -0,0 +1,300 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <graph.hpp>
+#include "vertex_binning.cuh"
+
+namespace cugraph {
+
+namespace mg {
+
+namespace detail {
+
+template <typename vertex_t, typename edge_t>
+__device__ void write_to_frontier(vertex_t const *thread_frontier,
+                                  int thread_frontier_count,
+                                  vertex_t *block_frontier,
+                                  int *block_frontier_count,
+                                  vertex_t *output_frontier,
+                                  edge_t *block_write_offset,
+                                  edge_t *output_frontier_count)
+{
+  // Set frontier count for block to 0
+  if (threadIdx.x == 0) { *block_frontier_count = 0; }
+  __syncthreads();
+
+  // Find out where to write the thread frontier to shared memory
+  int thread_write_offset = atomicAdd(block_frontier_count, thread_frontier_count);
+  for (int i = 0; i < thread_frontier_count; ++i) {
+    block_frontier[i + thread_write_offset] = thread_frontier[i];
+  }
+  __syncthreads();
+
+  // If the total number of frontiers for this block is 0 then return
+  if (*block_frontier_count == 0) { return; }
+
+  // Find out where to write the block frontier to global memory
+  if (threadIdx.x == 0) {
+    *block_write_offset = cugraph::detail::traversal::atomicAdd(
+      output_frontier_count, static_cast<edge_t>(*block_frontier_count));
+  }
+  __syncthreads();
+
+  // Write block frontier to global memory
+  for (int i = threadIdx.x; i < (*block_frontier_count); i += blockDim.x) {
+    output_frontier[(*block_write_offset) + i] = block_frontier[i];
+  }
+}
+
+template <int BlockSize,
+          int EdgesPerThread,
+          typename vertex_t,
+          typename edge_t,
+          typename operator_t>
+__global__ void block_per_vertex(edge_t const *offsets,
+                                 vertex_t const *indices,
+                                 vertex_t const *input_frontier,
+                                 vertex_t input_frontier_count,
+                                 vertex_t vertex_begin,
+                                 vertex_t *output_frontier,
+                                 edge_t *output_frontier_count,
+                                 operator_t op)
+{
+  if (blockIdx.x >= input_frontier_count) { return; }
+
+  __shared__ edge_t block_write_offset;
+  __shared__ vertex_t block_frontier[BlockSize * EdgesPerThread];
+  __shared__ int block_frontier_count;
+  vertex_t thread_frontier[EdgesPerThread];
+
+  vertex_t source        = input_frontier[blockIdx.x];
+  edge_t beg_edge_offset = offsets[source];
+  edge_t end_edge_offset = offsets[source + 1];
+
+  edge_t edge_offset = threadIdx.x + beg_edge_offset;
+  int num_iter       = (end_edge_offset - beg_edge_offset + BlockSize - 1) / BlockSize;
+
+  int thread_frontier_count = 0;
+  for (int i = 0; i < num_iter; ++i) {
+    if (edge_offset < end_edge_offset) {
+      vertex_t destination = indices[edge_offset];
+      // If operator returns true then add to local frontier
+      if (op(source + vertex_begin, destination)) {
+        thread_frontier[thread_frontier_count++] = destination;
+      }
+    }
+    bool is_last_iter = (i == (num_iter - 1));
+    bool is_nth_iter  = (i % EdgesPerThread == 0);
+    // Write to frontier every EdgesPerThread iterations
+    // Or if it is the last iteration of the for loop
+    if (is_nth_iter || is_last_iter) {
+      write_to_frontier(thread_frontier,
+                        thread_frontier_count,
+                        block_frontier,
+                        &block_frontier_count,
+                        output_frontier,
+                        &block_write_offset,
+                        output_frontier_count);
+      thread_frontier_count = 0;
+    }
+    edge_offset += blockDim.x;
+  }
+}
+
+template <int BlockSize,
+          int EdgesPerThread,
+          typename vertex_t,
+          typename edge_t,
+          typename operator_t>
+__global__ void kernel_per_vertex(edge_t const *offsets,
+                                  vertex_t const *indices,
+                                  vertex_t const *input_frontier,
+                                  vertex_t input_frontier_count,
+                                  vertex_t vertex_begin,
+                                  vertex_t *output_frontier,
+                                  edge_t *output_frontier_count,
+                                  operator_t op)
+{
+  vertex_t current_vertex_index = 0;
+  __shared__ edge_t block_write_offset;
+  __shared__ vertex_t block_frontier[BlockSize * EdgesPerThread];
+  __shared__ int block_frontier_count;
+
+  edge_t stride = blockDim.x * gridDim.x;
+  vertex_t thread_frontier[EdgesPerThread];
+
+  while (current_vertex_index < input_frontier_count) {
+    vertex_t source           = input_frontier[current_vertex_index];
+    edge_t beg_block_offset   = offsets[source] + (blockIdx.x * blockDim.x);
+    edge_t end_block_offset   = offsets[source + 1];
+    int i                     = 0;
+    int thread_frontier_count = 0;
+    for (edge_t block_offset = beg_block_offset; block_offset < end_block_offset;
+         block_offset += stride) {
+      if (block_offset + threadIdx.x < end_block_offset) {
+        vertex_t destination = indices[block_offset + threadIdx.x];
+        if (op(source + vertex_begin, destination)) {
+          thread_frontier[thread_frontier_count++] = destination;
+        }
+      }
+      bool is_last_iter = (block_offset + blockDim.x >= end_block_offset);
+      bool is_nth_iter  = (i % EdgesPerThread == 0);
+      if (is_nth_iter || is_last_iter) {
+        write_to_frontier(thread_frontier,
+                          thread_frontier_count,
+                          block_frontier,
+                          &block_frontier_count,
+                          output_frontier,
+                          &block_write_offset,
+                          output_frontier_count);
+        thread_frontier_count = 0;
+      }
+      ++i;
+    }
+    ++current_vertex_index;
+  }
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename operator_t>
+void large_vertex_lb(cugraph::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                     DegreeBucket<vertex_t, edge_t> &bucket,
+                     operator_t op,
+                     vertex_t vertex_begin,
+                     vertex_t *output_vertex_ids,
+                     edge_t *output_vertex_ids_offset,
+                     cudaStream_t stream)
+{
+  if (bucket.numberOfVertices != 0) {
+    const int block_size = 1024;
+    int block_count      = (1 << (bucket.ceilLogDegreeStart - 8));
+    kernel_per_vertex<block_size, 2>
+      <<<block_count, block_size, 0, stream>>>(graph.offsets,
+                                               graph.indices,
+                                               bucket.vertexIds,
+                                               bucket.numberOfVertices,
+                                               vertex_begin,
+                                               output_vertex_ids,
+                                               output_vertex_ids_offset,
+                                               op);
+    CHECK_CUDA(stream);
+  }
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename operator_t>
+void medium_vertex_lb(cugraph::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                      DegreeBucket<vertex_t, edge_t> &bucket,
+                      operator_t op,
+                      vertex_t vertex_begin,
+                      vertex_t *output_vertex_ids,
+                      edge_t *output_vertex_ids_offset,
+                      cudaStream_t stream)
+{
+  // Vertices with degrees 2^12 <= d < 2^16 are handled by this kernel
+  // Block size of 1024 is chosen to reduce wasted threads for a vertex
+  const int block_size = 1024;
+  int block_count      = bucket.numberOfVertices;
+  if (block_count != 0) {
+    block_per_vertex<block_size, 2>
+      <<<block_count, block_size, 0, stream>>>(graph.offsets,
+                                               graph.indices,
+                                               bucket.vertexIds,
+                                               bucket.numberOfVertices,
+                                               vertex_begin,
+                                               output_vertex_ids,
+                                               output_vertex_ids_offset,
+                                               op);
+    CHECK_CUDA(stream);
+  }
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename operator_t>
+void small_vertex_lb(cugraph::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+                     DegreeBucket<vertex_t, edge_t> &bucket,
+                     operator_t op,
+                     vertex_t vertex_begin,
+                     vertex_t *output_vertex_ids,
+                     edge_t *output_vertex_ids_offset,
+                     cudaStream_t stream)
+{
+  int block_count = bucket.numberOfVertices;
+  if (block_count == 0) { return; }
+  // For vertices with degree <= 32 block size of 32 is chosen
+  // For all vertices with degree d such that 2^x <= d < 2^x+1
+  // the block size is chosen to be 2^x. This is done so that
+  // vertices with degrees 1.5*2^x are also handled in a load
+  // balanced way
+  int block_size = 512;
+  if (bucket.ceilLogDegreeEnd < 6) {
+    block_size = 32;
+    block_per_vertex<32, 8><<<block_count, block_size, 0, stream>>>(graph.offsets,
+                                                                    graph.indices,
+                                                                    bucket.vertexIds,
+                                                                    bucket.numberOfVertices,
+                                                                    vertex_begin,
+                                                                    output_vertex_ids,
+                                                                    output_vertex_ids_offset,
+                                                                    op);
+  } else if (bucket.ceilLogDegreeEnd < 8) {
+    block_size = 64;
+    block_per_vertex<64, 8><<<block_count, block_size, 0, stream>>>(graph.offsets,
+                                                                    graph.indices,
+                                                                    bucket.vertexIds,
+                                                                    bucket.numberOfVertices,
+                                                                    vertex_begin,
+                                                                    output_vertex_ids,
+                                                                    output_vertex_ids_offset,
+                                                                    op);
+  } else if (bucket.ceilLogDegreeEnd < 10) {
+    block_size = 128;
+    block_per_vertex<128, 8><<<block_count, block_size, 0, stream>>>(graph.offsets,
+                                                                     graph.indices,
+                                                                     bucket.vertexIds,
+                                                                     bucket.numberOfVertices,
+                                                                     vertex_begin,
+                                                                     output_vertex_ids,
+                                                                     output_vertex_ids_offset,
+                                                                     op);
+  } else if (bucket.ceilLogDegreeEnd < 12) {
+    block_size = 512;
+    block_per_vertex<512, 4><<<block_count, block_size, 0, stream>>>(graph.offsets,
+                                                                     graph.indices,
+                                                                     bucket.vertexIds,
+                                                                     bucket.numberOfVertices,
+                                                                     vertex_begin,
+                                                                     output_vertex_ids,
+                                                                     output_vertex_ids_offset,
+                                                                     op);
+  } else {
+    block_size = 512;
+    block_per_vertex<512, 4><<<block_count, block_size, 0, stream>>>(graph.offsets,
+                                                                     graph.indices,
+                                                                     bucket.vertexIds,
+                                                                     bucket.numberOfVertices,
+                                                                     vertex_begin,
+                                                                     output_vertex_ids,
+                                                                     output_vertex_ids_offset,
+                                                                     op);
+  }
+  CHECK_CUDA(stream);
+}
+
+}  // namespace detail
+
+}  // namespace mg
+
+}  // namespace cugraph
diff --git a/cpp/src/traversal/mg/vertex_binning.cuh b/cpp/src/traversal/mg/vertex_binning.cuh
new file mode 100644
index 00000000000..3d8c963c466
--- /dev/null
+++ b/cpp/src/traversal/mg/vertex_binning.cuh
@@ -0,0 +1,135 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include "common_utils.cuh"
+#include "vertex_binning_kernels.cuh"
+
+namespace cugraph {
+
+namespace mg {
+
+namespace detail {
+
+template <typename vertex_t, typename edge_t>
+struct DegreeBucket {
+  vertex_t* vertexIds;
+  vertex_t numberOfVertices;
+  edge_t ceilLogDegreeStart;
+  edge_t ceilLogDegreeEnd;
+};
+
+template <typename vertex_t, typename edge_t>
+class LogDistribution {
+  vertex_t* vertex_id_begin_;
+  thrust::host_vector<edge_t> bin_offsets_;
+
+ public:
+  LogDistribution(rmm::device_vector<edge_t>& vertex_id, rmm::device_vector<edge_t>& bin_offsets)
+    : vertex_id_begin_(vertex_id.data().get()), bin_offsets_(bin_offsets)
+  {
+  }
+
+  DegreeBucket<vertex_t, edge_t> degreeRange(
+    edge_t ceilLogDegreeStart, edge_t ceilLogDegreeEnd = std::numeric_limits<edge_t>::max())
+  {
+    ceilLogDegreeStart = std::max(ceilLogDegreeStart, edge_t{0});
+    if (ceilLogDegreeEnd > static_cast<edge_t>(bin_offsets_.size()) - 2) {
+      ceilLogDegreeEnd = bin_offsets_.size() - 2;
+    }
+    return DegreeBucket<vertex_t, edge_t>{
+      vertex_id_begin_ + bin_offsets_[ceilLogDegreeStart + 1],
+      bin_offsets_[ceilLogDegreeEnd + 1] - bin_offsets_[ceilLogDegreeStart + 1],
+      ceilLogDegreeStart,
+      ceilLogDegreeEnd};
+  }
+};
+
+template <typename vertex_t, typename edge_t>
+class VertexBinner {
+  edge_t* offsets_;
+  uint32_t* active_bitmap_;
+  vertex_t vertex_begin_;
+  vertex_t vertex_end_;
+
+  rmm::device_vector<edge_t> tempBins_;
+  rmm::device_vector<edge_t> bin_offsets_;
+
+ public:
+  VertexBinner(void) : tempBins_(NumberBins<edge_t>), bin_offsets_(NumberBins<edge_t>) {}
+
+  void setup(edge_t* offsets, uint32_t* active_bitmap, vertex_t vertex_begin, vertex_t vertex_end)
+  {
+    offsets_       = offsets;
+    active_bitmap_ = active_bitmap;
+    vertex_begin_  = vertex_begin;
+    vertex_end_    = vertex_end;
+  }
+
+  LogDistribution<vertex_t, edge_t> run(rmm::device_vector<vertex_t>& reorganized_vertices,
+                                        cudaStream_t stream);
+
+  LogDistribution<vertex_t, edge_t> run(rmm::device_vector<vertex_t>& input_vertices,
+                                        vertex_t input_vertices_len,
+                                        rmm::device_vector<vertex_t>& reorganized_vertices,
+                                        cudaStream_t stream);
+};
+
+template <typename vertex_t, typename edge_t>
+LogDistribution<vertex_t, edge_t> VertexBinner<vertex_t, edge_t>::run(
+  rmm::device_vector<vertex_t>& reorganized_vertices, cudaStream_t stream)
+{
+  thrust::fill(
+    rmm::exec_policy(stream)->on(stream), bin_offsets_.begin(), bin_offsets_.end(), edge_t{0});
+  thrust::fill(rmm::exec_policy(stream)->on(stream), tempBins_.begin(), tempBins_.end(), edge_t{0});
+  bin_vertices(reorganized_vertices,
+               bin_offsets_,
+               tempBins_,
+               active_bitmap_,
+               offsets_,
+               vertex_begin_,
+               vertex_end_,
+               stream);
+
+  return LogDistribution<vertex_t, edge_t>(reorganized_vertices, bin_offsets_);
+}
+
+template <typename vertex_t, typename edge_t>
+LogDistribution<vertex_t, edge_t> VertexBinner<vertex_t, edge_t>::run(
+  rmm::device_vector<vertex_t>& input_vertices,
+  vertex_t input_vertices_len,
+  rmm::device_vector<vertex_t>& reorganized_vertices,
+  cudaStream_t stream)
+{
+  bin_vertices(input_vertices,
+               input_vertices_len,
+               reorganized_vertices,
+               bin_offsets_,
+               tempBins_,
+               offsets_,
+               vertex_begin_,
+               vertex_end_,
+               stream);
+
+  return LogDistribution<vertex_t, edge_t>(reorganized_vertices, bin_offsets_);
+}
+
+}  // namespace detail
+
+}  // namespace mg
+
+}  // namespace cugraph
diff --git a/cpp/src/traversal/mg/vertex_binning_kernels.cuh b/cpp/src/traversal/mg/vertex_binning_kernels.cuh
new file mode 100644
index 00000000000..dbb339fea05
--- /dev/null
+++ b/cpp/src/traversal/mg/vertex_binning_kernels.cuh
@@ -0,0 +1,191 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <rmm/thrust_rmm_allocator.h>
+#include "../traversal_common.cuh"
+
+namespace cugraph {
+
+namespace mg {
+
+namespace detail {
+
+template <typename degree_t>
+__device__ inline typename std::enable_if<(sizeof(degree_t) == 4), int>::type ceilLog2_p1(
+  degree_t val)
+{
+  return BitsPWrd<degree_t> - __clz(val) + (__popc(val) > 1);
+}
+
+template <typename degree_t>
+__device__ inline typename std::enable_if<(sizeof(degree_t) == 8), int>::type ceilLog2_p1(
+  degree_t val)
+{
+  return BitsPWrd<degree_t> - __clzll(val) + (__popcll(val) > 1);
+}
+
+template <typename return_t>
+__global__ void simple_fill(return_t *bin0, return_t *bin1, return_t count)
+{
+  for (return_t i = 0; i < count; i++) {
+    bin0[i] = 0;
+    bin1[i] = 0;
+  }
+}
+
+template <typename return_t>
+__global__ void exclusive_scan(return_t *data, return_t *out)
+{
+  constexpr int BinCount = NumberBins<return_t>;
+  return_t lData[BinCount];
+  thrust::exclusive_scan(thrust::seq, data, data + BinCount, lData);
+  for (int i = 0; i < BinCount; ++i) {
+    out[i]  = lData[i];
+    data[i] = lData[i];
+  }
+}
+
+////////////////////////////////////////////////////////////////////////////////
+// Queue enabled kernels
+////////////////////////////////////////////////////////////////////////////////
+
+// Given the CSR offsets of vertices and the related active bit map
+// count the number of vertices that belong to a particular bin where
+// vertex with degree d such that 2^x < d <= 2^x+1 belong to bin (x+1)
+// Vertices with degree 0 are counted in bin 0
+// In this function, any id in vertex_ids array is only acceptable as long
+// as its value is between vertex_begin and vertex_end
+template <typename vertex_t, typename edge_t>
+__global__ void count_bin_sizes(edge_t *bins,
+                                edge_t const *offsets,
+                                vertex_t const *vertex_ids,
+                                edge_t const vertex_id_count,
+                                vertex_t vertex_begin,
+                                vertex_t vertex_end)
+{
+  using cugraph::detail::traversal::atomicAdd;
+  constexpr int BinCount = NumberBins<edge_t>;
+  __shared__ edge_t lBin[BinCount];
+  for (int i = threadIdx.x; i < BinCount; i += blockDim.x) { lBin[i] = 0; }
+  __syncthreads();
+
+  for (vertex_t i = threadIdx.x + (blockIdx.x * blockDim.x); i < vertex_id_count;
+       i += gridDim.x * blockDim.x) {
+    auto source = vertex_ids[i];
+    if ((source >= vertex_begin) && (source < vertex_end)) {
+      // Take care of OPG partitioning
+      // source logical vertex resides from offsets[source - vertex_begin]
+      // to offsets[source - vertex_begin + 1]
+      source -= vertex_begin;
+      auto degree = offsets[source + 1] - offsets[source];
+      atomicAdd(lBin + ceilLog2_p1(degree), edge_t{1});
+    }
+  }
+  __syncthreads();
+
+  for (int i = threadIdx.x; i < BinCount; i += blockDim.x) { atomicAdd(bins + i, lBin[i]); }
+}
+
+// Bin vertices to the appropriate bins by taking into account
+// the starting offsets calculated by count_bin_sizes
+template <typename vertex_t, typename edge_t>
+__global__ void create_vertex_bins(vertex_t *out_vertex_ids,
+                                   edge_t *bin_offsets,
+                                   edge_t const *offsets,
+                                   vertex_t *in_vertex_ids,
+                                   edge_t const vertex_id_count,
+                                   vertex_t vertex_begin,
+                                   vertex_t vertex_end)
+{
+  using cugraph::detail::traversal::atomicAdd;
+  constexpr int BinCount = NumberBins<edge_t>;
+  __shared__ edge_t lBin[BinCount];
+  __shared__ int lPos[BinCount];
+  if (threadIdx.x < BinCount) {
+    lBin[threadIdx.x] = 0;
+    lPos[threadIdx.x] = 0;
+  }
+  __syncthreads();
+
+  vertex_t vertex_index = (threadIdx.x + blockIdx.x * blockDim.x);
+  bool is_valid_vertex  = (vertex_index < vertex_id_count);
+  vertex_t source;
+
+  if (is_valid_vertex) {
+    source          = in_vertex_ids[vertex_index];
+    is_valid_vertex = ((source >= vertex_begin) && (source < vertex_end));
+    source -= vertex_begin;
+  }
+
+  int threadBin;
+  edge_t threadPos;
+  if (is_valid_vertex) {
+    threadBin = ceilLog2_p1(offsets[source + 1] - offsets[source]);
+    threadPos = atomicAdd(lBin + threadBin, edge_t{1});
+  }
+  __syncthreads();
+
+  if (threadIdx.x < BinCount) {
+    lPos[threadIdx.x] = atomicAdd(bin_offsets + threadIdx.x, lBin[threadIdx.x]);
+  }
+  __syncthreads();
+
+  if (is_valid_vertex) { out_vertex_ids[lPos[threadBin] + threadPos] = source; }
+}
+
+template <typename vertex_t, typename edge_t>
+void bin_vertices(rmm::device_vector<vertex_t> &input_vertex_ids,
+                  vertex_t input_vertex_ids_len,
+                  rmm::device_vector<vertex_t> &reorganized_vertex_ids,
+                  rmm::device_vector<edge_t> &bin_count_offsets,
+                  rmm::device_vector<edge_t> &bin_count,
+                  edge_t *offsets,
+                  vertex_t vertex_begin,
+                  vertex_t vertex_end,
+                  cudaStream_t stream)
+{
+  simple_fill<edge_t><<<1, 1, 0, stream>>>(
+    bin_count_offsets.data().get(), bin_count.data().get(), static_cast<edge_t>(bin_count.size()));
+
+  const uint32_t BLOCK_SIZE = 512;
+  uint32_t blocks           = ((input_vertex_ids_len) + BLOCK_SIZE - 1) / BLOCK_SIZE;
+  count_bin_sizes<edge_t>
+    <<<blocks, BLOCK_SIZE, 0, stream>>>(bin_count.data().get(),
+                                        offsets,
+                                        input_vertex_ids.data().get(),
+                                        static_cast<edge_t>(input_vertex_ids_len),
+                                        vertex_begin,
+                                        vertex_end);
+
+  exclusive_scan<<<1, 1, 0, stream>>>(bin_count.data().get(), bin_count_offsets.data().get());
+
+  create_vertex_bins<vertex_t, edge_t>
+    <<<blocks, BLOCK_SIZE, 0, stream>>>(reorganized_vertex_ids.data().get(),
+                                        bin_count.data().get(),
+                                        offsets,
+                                        input_vertex_ids.data().get(),
+                                        static_cast<edge_t>(input_vertex_ids_len),
+                                        vertex_begin,
+                                        vertex_end);
+}
+
+}  // namespace detail
+
+}  // namespace mg
+
+}  // namespace cugraph
diff --git a/cpp/src/traversal/sssp.cu b/cpp/src/traversal/sssp.cu
index f47583fdc9a..4018c9d9878 100644
--- a/cpp/src/traversal/sssp.cu
+++ b/cpp/src/traversal/sssp.cu
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -16,8 +16,8 @@
 
 // Author: Prasun Gera pgera@nvidia.com
 
-#include <utilities/error_utils.h>
 #include <algorithm>
+#include <utilities/error.hpp>
 
 #include "graph.hpp"
 
@@ -211,10 +211,8 @@ void SSSP<IndexType, DistType>::traverse(IndexType source_vertex)
     cudaMemcpyAsync(
       distances, next_distances, n * sizeof(DistType), cudaMemcpyDeviceToDevice, stream);
 
-    CUDA_CHECK_LAST();
-
     // We need nf for the loop
-    cudaStreamSynchronize(stream);
+    CUDA_TRY(cudaStreamSynchronize(stream));
 
     // Swap frontiers
     // IndexType *tmp = frontier;
@@ -244,7 +242,7 @@ void SSSP<IndexType, DistType>::clean()
  * @file sssp.cu
  * --------------------------------------------------------------------------*/
 template <typename VT, typename ET, typename WT>
-void sssp(experimental::GraphCSRView<VT, ET, WT> const &graph,
+void sssp(GraphCSRView<VT, ET, WT> const &graph,
           WT *distances,
           VT *predecessors,
           const VT source_vertex)
@@ -283,7 +281,7 @@ void sssp(experimental::GraphCSRView<VT, ET, WT> const &graph,
   } else {
     // SSSP is not defined for graphs with negative weight cycles
     // Warn user about any negative edges
-    if (graph.prop.has_negative_edges == experimental::PropType::PROP_TRUE)
+    if (graph.prop.has_negative_edges == PropType::PROP_TRUE)
       std::cerr << "WARN: The graph has negative weight edges. SSSP will not "
                    "converge if the graph has negative weight cycles\n";
     edge_weights_ptr = graph.edge_data;
@@ -295,11 +293,11 @@ void sssp(experimental::GraphCSRView<VT, ET, WT> const &graph,
 }
 
 // explicit instantiation
-template void sssp<int, int, float>(experimental::GraphCSRView<int, int, float> const &graph,
+template void sssp<int, int, float>(GraphCSRView<int, int, float> const &graph,
                                     float *distances,
                                     int *predecessors,
                                     const int source_vertex);
-template void sssp<int, int, double>(experimental::GraphCSRView<int, int, double> const &graph,
+template void sssp<int, int, double>(GraphCSRView<int, int, double> const &graph,
                                      double *distances,
                                      int *predecessors,
                                      const int source_vertex);
diff --git a/cpp/src/traversal/sssp.cuh b/cpp/src/traversal/sssp.cuh
index 16dcecf33de..fac66e3d47e 100644
--- a/cpp/src/traversal/sssp.cuh
+++ b/cpp/src/traversal/sssp.cuh
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019 NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
diff --git a/cpp/src/traversal/sssp_kernels.cuh b/cpp/src/traversal/sssp_kernels.cuh
index d778372af41..d96540b22b9 100644
--- a/cpp/src/traversal/sssp_kernels.cuh
+++ b/cpp/src/traversal/sssp_kernels.cuh
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019 NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -18,10 +18,9 @@
 
 #include <iostream>
 
-#include <utilities/sm_utils.h>
 #include <cub/cub.cuh>
 #include "traversal_common.cuh"
-#include "utilities/error_utils.h"
+#include "utilities/error.hpp"
 namespace cugraph {
 namespace detail {
 namespace sssp_kernels {
@@ -548,7 +547,7 @@ void frontier_expand(const IndexType* row_ptr,
     predecessors,
     edge_mask);
 
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(m_stream);
 }
 }  // namespace sssp_kernels
 }  // namespace detail
diff --git a/cpp/src/traversal/traversal_common.cuh b/cpp/src/traversal/traversal_common.cuh
index ca36d7edb79..2802fb94be8 100644
--- a/cpp/src/traversal/traversal_common.cuh
+++ b/cpp/src/traversal/traversal_common.cuh
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -17,7 +17,7 @@
 #pragma once
 
 #include <cub/cub.cuh>
-#include "utilities/error_utils.h"
+#include "utilities/error.hpp"
 
 #define MAXBLOCKS 65535
 #define WARP_SIZE 32
@@ -107,6 +107,20 @@ struct vec_t<int> {
   static const int max = std::numeric_limits<int>::max();
 };
 
+template <>
+struct vec_t<long> {
+  typedef long4 vec4;
+  typedef long2 vec2;
+  static const long max = std::numeric_limits<long>::max();
+};
+
+template <>
+struct vec_t<unsigned> {
+  typedef uint4 vec4;
+  typedef uint2 vec2;
+  static const unsigned max = std::numeric_limits<unsigned>::max();
+};
+
 template <>
 struct vec_t<long long int> {
   typedef longlong4 vec4;
@@ -184,7 +198,7 @@ void fill_vec(ValueType* vec, SizeType n, ValueType val, cudaStream_t stream)
   grid.x  = (n + block.x - 1) / block.x;
 
   fill_vec_kernel<<<grid, block, 0, stream>>>(vec, n, val);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
 }
 
 template <typename IndexType>
@@ -204,6 +218,24 @@ binsearch_maxle(const IndexType* vec, const IndexType val, IndexType low, IndexT
   }
 }
 
+// FIXME: The atomicAdd wrappers should be moved to RAFT
+
+template <typename T>
+__device__ static __forceinline__ T atomicAdd(T* addr, T val)
+{
+  return ::atomicAdd(addr, val);
+}
+
+template <>
+__device__ __forceinline__ int64_t atomicAdd<int64_t>(int64_t* addr, int64_t val)
+{
+  static_assert(sizeof(int64_t) == sizeof(unsigned long long),
+                "sizeof(int64_t) != sizeof(unsigned long long). Can't use atomicAdd");
+
+  return ::atomicAdd(reinterpret_cast<unsigned long long*>(addr),
+                     static_cast<unsigned long long>(val));
+}
+
 __device__ static __forceinline__ float atomicMin(float* addr, float val)
 {
   int* addr_as_int = (int*)addr;
@@ -286,7 +318,7 @@ __global__ void flag_isolated_vertices_kernel(IndexType n,
 
     int local_isolated_bmap = 0;
 
-    IndexType imax = (n - thread_off);
+    IndexType imax = (n > thread_off) ? (n - thread_off) : 0;
 
     IndexType local_degree[FLAG_ISOLATED_VERTICES_VERTICES_PER_THREAD];
 
@@ -314,7 +346,7 @@ __global__ void flag_isolated_vertices_kernel(IndexType n,
 
     IndexType total_nisolated = BlockReduce(block_reduce_temp_storage).Sum(local_nisolated);
 
-    if (threadIdx.x == 0 && total_nisolated) { atomicAdd(nisolated, total_nisolated); }
+    if (threadIdx.x == 0 && total_nisolated) { traversal::atomicAdd(nisolated, total_nisolated); }
 
     int logicalwarpid = threadIdx.x / FLAG_ISOLATED_VERTICES_THREADS_PER_INT;
 
@@ -347,7 +379,7 @@ void flag_isolated_vertices(IndexType n,
 
   flag_isolated_vertices_kernel<<<grid, block, 0, m_stream>>>(
     n, isolated_bmap, row_ptr, degrees, nisolated);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(m_stream);
 }
 
 template <typename IndexType>
@@ -374,7 +406,7 @@ void set_frontier_degree(IndexType* frontier_degree,
   block.x = 256;
   grid.x  = min((n + block.x - 1) / block.x, (IndexType)MAXBLOCKS);
   set_frontier_degree_kernel<<<grid, block, 0, m_stream>>>(frontier_degree, frontier, degree, n);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(m_stream);
 }
 
 template <typename IndexType>
@@ -439,7 +471,7 @@ void compute_bucket_offsets(IndexType* cumul,
 
   compute_bucket_offsets_kernel<<<grid, block, 0, m_stream>>>(
     cumul, bucket_offsets, frontier_size, total_degree);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(m_stream);
 }
 }  // namespace traversal
 }  // namespace detail
diff --git a/cpp/src/traversal/two_hop_neighbors.cu b/cpp/src/traversal/two_hop_neighbors.cu
index dc46d56910c..fb984dae0ad 100644
--- a/cpp/src/traversal/two_hop_neighbors.cu
+++ b/cpp/src/traversal/two_hop_neighbors.cu
@@ -20,9 +20,9 @@
  * ---------------------------------------------------------------------------**/
 
 #include <rmm/thrust_rmm_allocator.h>
-#include <utilities/error_utils.h>
 #include <algorithms.hpp>
 #include <graph.hpp>
+#include <utilities/error.hpp>
 #include "two_hop_neighbors.cuh"
 
 #include <thrust/execution_policy.h>
@@ -32,8 +32,7 @@
 namespace cugraph {
 
 template <typename VT, typename ET, typename WT>
-std::unique_ptr<cugraph::experimental::GraphCOO<VT, ET, WT>> get_two_hop_neighbors(
-  experimental::GraphCSRView<VT, ET, WT> const &graph)
+std::unique_ptr<GraphCOO<VT, ET, WT>> get_two_hop_neighbors(GraphCSRView<VT, ET, WT> const &graph)
 {
   cudaStream_t stream{nullptr};
 
@@ -109,8 +108,7 @@ std::unique_ptr<cugraph::experimental::GraphCOO<VT, ET, WT>> get_two_hop_neighbo
   // Get things ready to return
   ET outputSize = tuple_end - tuple_start;
 
-  auto result = std::make_unique<cugraph::experimental::GraphCOO<VT, ET, WT>>(
-    graph.number_of_vertices, outputSize, false);
+  auto result = std::make_unique<GraphCOO<VT, ET, WT>>(graph.number_of_vertices, outputSize, false);
 
   cudaMemcpy(result->src_indices(), d_first_pair, sizeof(VT) * outputSize, cudaMemcpyDefault);
   cudaMemcpy(result->dst_indices(), d_second_pair, sizeof(VT) * outputSize, cudaMemcpyDefault);
@@ -118,10 +116,10 @@ std::unique_ptr<cugraph::experimental::GraphCOO<VT, ET, WT>> get_two_hop_neighbo
   return result;
 }
 
-template std::unique_ptr<cugraph::experimental::GraphCOO<int, int, float>> get_two_hop_neighbors(
-  experimental::GraphCSRView<int, int, float> const &);
+template std::unique_ptr<GraphCOO<int, int, float>> get_two_hop_neighbors(
+  GraphCSRView<int, int, float> const &);
 
-template std::unique_ptr<cugraph::experimental::GraphCOO<int, int, double>> get_two_hop_neighbors(
-  experimental::GraphCSRView<int, int, double> const &);
+template std::unique_ptr<GraphCOO<int, int, double>> get_two_hop_neighbors(
+  GraphCSRView<int, int, double> const &);
 
 }  // namespace cugraph
diff --git a/cpp/src/traversal/two_hop_neighbors.cuh b/cpp/src/traversal/two_hop_neighbors.cuh
index fd29b3e5140..87d3b36b861 100644
--- a/cpp/src/traversal/two_hop_neighbors.cuh
+++ b/cpp/src/traversal/two_hop_neighbors.cuh
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
diff --git a/cpp/src/utilities/cuda_utils.cuh b/cpp/src/utilities/cuda_utils.cuh
deleted file mode 100644
index dfb407aa35d..00000000000
--- a/cpp/src/utilities/cuda_utils.cuh
+++ /dev/null
@@ -1,88 +0,0 @@
-/*
- * Copyright (c) 2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#pragma once
-
-#include <thrust/extrema.h>
-
-namespace cugraph {
-//
-//  This should go into RAFT...
-//
-__device__ static __forceinline__ int64_t atomicMin(int64_t *addr, int64_t val)
-{
-  unsigned long long *addr_as_ull{reinterpret_cast<unsigned long long *>(addr)};
-  unsigned long long *val_addr_as_ull{reinterpret_cast<unsigned long long *>(&val)};
-  unsigned long long old        = *addr_as_ull;
-  unsigned long long val_as_ull = *val_addr_as_ull;
-  int64_t *p_old{reinterpret_cast<int64_t *>(&old)};
-  unsigned long long expected;
-
-  do {
-    expected = old;
-    old      = ::atomicCAS(addr_as_ull, expected, thrust::min(val_as_ull, expected));
-  } while (expected != old);
-  return *p_old;
-}
-
-__device__ static __forceinline__ int32_t atomicMin(int32_t *addr, int32_t val)
-{
-  return ::atomicMin(addr, val);
-}
-
-__device__ static __forceinline__ int64_t atomicAdd(int64_t *addr, int64_t val)
-{
-  unsigned long long *addr_as_ull{reinterpret_cast<unsigned long long *>(addr)};
-  unsigned long long *val_addr_as_ull{reinterpret_cast<unsigned long long *>(&val)};
-  unsigned long long old        = *addr_as_ull;
-  unsigned long long val_as_ull = *val_addr_as_ull;
-  int64_t *p_old{reinterpret_cast<int64_t *>(&old)};
-  unsigned long long expected;
-
-  do {
-    expected = old;
-    old      = ::atomicCAS(addr_as_ull, expected, (expected + val_as_ull));
-  } while (expected != old);
-  return *p_old;
-}
-
-__device__ static __forceinline__ int32_t atomicAdd(int32_t *addr, int32_t val)
-{
-  return ::atomicAdd(addr, val);
-}
-
-__device__ static __forceinline__ int32_t atomicAdd(int32_t volatile *addr, int32_t val)
-{
-  return ::atomicAdd(const_cast<int32_t *>(addr), val);
-}
-
-__device__ static __forceinline__ double atomicAdd(double volatile *addr, double val)
-{
-  return ::atomicAdd(const_cast<double *>(addr), val);
-}
-
-__device__ static __forceinline__ float atomicAdd(float volatile *addr, float val)
-{
-  return ::atomicAdd(const_cast<float *>(addr), val);
-}
-
-__device__ static __forceinline__ int32_t atomicCAS(int32_t volatile *addr,
-                                                    int32_t expected,
-                                                    int32_t val)
-{
-  return ::atomicCAS(const_cast<int32_t *>(addr), expected, val);
-}
-
-}  // namespace cugraph
diff --git a/cpp/src/utilities/cusparse_helper.cu b/cpp/src/utilities/cusparse_helper.cu
deleted file mode 100644
index 43d19f74547..00000000000
--- a/cpp/src/utilities/cusparse_helper.cu
+++ /dev/null
@@ -1,119 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#include <cusparse.h>
-#include <utilities/error_utils.h>
-#include "cusparse_helper.h"
-
-namespace cugraph {
-namespace detail {
-cusparseHandle_t Cusparse::m_handle = 0;
-
-template <typename ValueType>
-CusparseCsrMV<ValueType>::CusparseCsrMV()
-{
-  if (sizeof(ValueType) == 4)
-    cuda_type = CUDA_R_32F;
-  else
-    cuda_type = CUDA_R_64F;
-  CHECK_CUSPARSE(cusparseCreateMatDescr(&descrA));
-  CHECK_CUSPARSE(cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ZERO));
-  CHECK_CUSPARSE(cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL));
-  // alg = CUSPARSE_ALG_MERGE_PATH;
-  alg    = CUSPARSE_ALG_NAIVE;
-  stream = nullptr;
-}
-
-template <typename ValueType>
-CusparseCsrMV<ValueType>::~CusparseCsrMV()
-{
-}
-
-template <typename ValueType>
-void CusparseCsrMV<ValueType>::setup(int m,
-                                     int n,
-                                     int nnz,
-                                     const ValueType* alpha,
-                                     const ValueType* csrValA,
-                                     const int* csrRowPtrA,
-                                     const int* csrColIndA,
-                                     const ValueType* x,
-                                     const ValueType* beta,
-                                     ValueType* y)
-{
-  CHECK_CUSPARSE(cusparseCsrmvEx_bufferSize(Cusparse::get_handle(),
-                                            alg,
-                                            CUSPARSE_OPERATION_NON_TRANSPOSE,
-                                            m,
-                                            n,
-                                            nnz,
-                                            alpha,
-                                            cuda_type,
-                                            descrA,
-                                            csrValA,
-                                            cuda_type,
-                                            csrRowPtrA,
-                                            csrColIndA,
-                                            x,
-                                            cuda_type,
-                                            beta,
-                                            cuda_type,
-                                            y,
-                                            cuda_type,
-                                            cuda_type,
-                                            &spmv_temp_storage_bytes));
-  spmv_temp_storage.resize(spmv_temp_storage_bytes, stream);
-  spmv_d_temp_storage = spmv_temp_storage.data();
-}
-template <typename ValueType>
-void CusparseCsrMV<ValueType>::run(int m,
-                                   int n,
-                                   int nnz,
-                                   const ValueType* alpha,
-                                   const ValueType* csrValA,
-                                   const int* csrRowPtrA,
-                                   const int* csrColIndA,
-                                   const ValueType* x,
-                                   const ValueType* beta,
-                                   ValueType* y)
-{
-  CHECK_CUSPARSE(cusparseCsrmvEx(Cusparse::get_handle(),
-                                 alg,
-                                 CUSPARSE_OPERATION_NON_TRANSPOSE,
-                                 m,
-                                 n,
-                                 nnz,
-                                 alpha,
-                                 cuda_type,
-                                 descrA,
-                                 csrValA,
-                                 cuda_type,
-                                 csrRowPtrA,
-                                 csrColIndA,
-                                 x,
-                                 cuda_type,
-                                 beta,
-                                 cuda_type,
-                                 y,
-                                 cuda_type,
-                                 cuda_type,
-                                 spmv_d_temp_storage));
-}
-
-template class CusparseCsrMV<double>;
-template class CusparseCsrMV<float>;
-
-}  // namespace detail
-}  // namespace cugraph
diff --git a/cpp/src/utilities/cusparse_helper.h b/cpp/src/utilities/cusparse_helper.h
deleted file mode 100644
index d206c824bb6..00000000000
--- a/cpp/src/utilities/cusparse_helper.h
+++ /dev/null
@@ -1,92 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#pragma once
-#include <cusparse.h>
-#include <utilities/error_utils.h>
-#include <rmm/device_buffer.hpp>
-#include "utilities/graph_utils.cuh"
-
-namespace cugraph {
-namespace detail {
-
-#define CHECK_CUSPARSE(call)                                               \
-  {                                                                        \
-    cusparseStatus_t _e = (call);                                          \
-    if (_e != CUSPARSE_STATUS_SUCCESS) { CUGRAPH_FAIL("CUSPARSE ERROR"); } \
-  }
-
-class Cusparse {
- private:
-  // global CUSPARSE handle for nvgraph
-  static cusparseHandle_t m_handle;  // Constructor.
-  Cusparse();
-  // Destructor.
-  ~Cusparse();
-
- public:
-  // Get the handle.
-  static cusparseHandle_t get_handle()
-  {
-    if (m_handle == 0) CHECK_CUSPARSE(cusparseCreate(&m_handle));
-    return m_handle;
-  }
-  // Destroy handle
-  static void destroy_handle()
-  {
-    if (m_handle != 0) CHECK_CUSPARSE(cusparseDestroy(m_handle));
-    m_handle = 0;
-  }
-};
-
-template <typename ValueType>
-class CusparseCsrMV {
- private:
-  cusparseMatDescr_t descrA;
-  cudaDataType cuda_type;
-  cusparseAlgMode_t alg;
-  rmm::device_buffer spmv_temp_storage;
-  void* spmv_d_temp_storage;
-  size_t spmv_temp_storage_bytes;
-  cudaStream_t stream;
-
- public:
-  CusparseCsrMV();
-
-  ~CusparseCsrMV();
-  void setup(int m,
-             int n,
-             int nnz,
-             const ValueType* alpha,
-             const ValueType* csrValA,
-             const int* csrRowPtrA,
-             const int* csrColIndA,
-             const ValueType* x,
-             const ValueType* beta,
-             ValueType* y);
-  void run(int m,
-           int n,
-           int nnz,
-           const ValueType* alpha,
-           const ValueType* csrValA,
-           const int* csrRowPtrA,
-           const int* csrColIndA,
-           const ValueType* x,
-           const ValueType* beta,
-           ValueType* y);
-};
-
-}  // namespace detail
-}  // namespace cugraph
diff --git a/cpp/src/utilities/error_utils.h b/cpp/src/utilities/error_utils.h
deleted file mode 100644
index 25179dd201b..00000000000
--- a/cpp/src/utilities/error_utils.h
+++ /dev/null
@@ -1,190 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#ifndef ERRORUTILS_HPP
-#define ERRORUTILS_HPP
-
-#include <cuda.h>
-#include <cuda_runtime_api.h>
-#include <iostream>
-#include <stdexcept>
-
-#include <rmm/rmm.h>
-
-namespace cugraph {
-/**---------------------------------------------------------------------------*
- * @brief Exception thrown when logical precondition is violated.
- *
- * This exception should not be thrown directly and is instead thrown by the
- * CUGRAPH_EXPECTS macro.
- *
- *---------------------------------------------------------------------------**/
-struct logic_error : public std::logic_error {
-  logic_error(char const* const message) : std::logic_error(message) {}
-
-  logic_error(std::string const& message) : std::logic_error(message) {}
-
-  // TODO Add an error code member? This would be useful for translating an
-  // exception to an error code in a pure-C API
-};
-/**---------------------------------------------------------------------------*
- * @brief Exception thrown when a CUDA error is encountered.
- *
- *---------------------------------------------------------------------------**/
-struct cuda_error : public std::runtime_error {
-  cuda_error(std::string const& message) : std::runtime_error(message) {}
-};
-}  // namespace cugraph
-
-#define STRINGIFY_DETAIL(x) #x
-#define CUGRAPH_STRINGIFY(x) STRINGIFY_DETAIL(x)
-
-/**---------------------------------------------------------------------------*
- * @brief Macro for checking (pre-)conditions that throws an exception when
- * a condition is violated.
- *
- * Example usage:
- *
- * @code
- * CUGRAPH_EXPECTS(lhs->dtype == rhs->dtype, "Column type mismatch");
- * @endcode
- *
- * @param[in] cond Expression that evaluates to true or false
- * @param[in] reason String literal description of the reason that cond is
- * expected to be true
- * @throw cugraph::logic_error if the condition evaluates to false.
- *---------------------------------------------------------------------------**/
-#define CUGRAPH_EXPECTS(cond, reason)                                     \
-  (!!(cond)) ? static_cast<void>(0)                                       \
-             : throw cugraph::logic_error("CUGRAPH failure at: " __FILE__ \
-                                          ":" CUGRAPH_STRINGIFY(__LINE__) ": " reason)
-
-/**---------------------------------------------------------------------------*
- * @brief Try evaluation an expression with a gdf_error type,
- * and throw an appropriate exception if it fails.
- *---------------------------------------------------------------------------**/
-#define CUGRAPH_TRY(_gdf_error_expression)                                                       \
-  do {                                                                                           \
-    auto _evaluated = _gdf_error_expression;                                                     \
-    if (_evaluated == GDF_SUCCESS) { break; }                                                    \
-    throw cugraph::logic_error(                                                                  \
-      ("CUGRAPH error " + std::string(gdf_error_get_name(_evaluated)) +                          \
-       " at " __FILE__                                                                           \
-       ":" CUGRAPH_STRINGIFY(__LINE__) " evaluating " CUGRAPH_STRINGIFY(#_gdf_error_expression)) \
-        .c_str());                                                                               \
-  } while (0)
-
-/**---------------------------------------------------------------------------*
- * @brief Indicates that an erroneous code path has been taken.
- *
- * In host code, throws a `cugraph::logic_error`.
- *
- *
- * Example usage:
- * ```
- * CUGRAPH_FAIL("Non-arithmetic operation is not supported");
- * ```
- *
- * @param[in] reason String literal description of the reason
- *---------------------------------------------------------------------------**/
-#define CUGRAPH_FAIL(reason)                                 \
-  throw cugraph::logic_error("cuGraph failure at: " __FILE__ \
-                             ":" CUGRAPH_STRINGIFY(__LINE__) ": " reason)
-
-namespace cugraph {
-namespace detail {
-
-inline void throw_cuda_error(cudaError_t error, const char* file, unsigned int line)
-{
-  throw cugraph::cuda_error(std::string{"CUDA error encountered at: " + std::string{file} + ":" +
-                                        std::to_string(line) + ": " + std::to_string(error) + " " +
-                                        cudaGetErrorName(error) + " " + cudaGetErrorString(error)});
-}
-
-inline void check_stream(cudaStream_t stream, const char* file, unsigned int line)
-{
-  cudaError_t error{cudaSuccess};
-  error = cudaStreamSynchronize(stream);
-  if (cudaSuccess != error) { throw_cuda_error(error, file, line); }
-
-  error = cudaGetLastError();
-  if (cudaSuccess != error) { throw_cuda_error(error, file, line); }
-}
-}  // namespace detail
-}  // namespace cugraph
-
-/**---------------------------------------------------------------------------*
- * @brief Error checking macro for CUDA runtime API functions.
- *
- * Invokes a CUDA runtime API function call, if the call does not return
- * cudaSuccess, throws an exception detailing the CUDA error that occurred.
- *
- * This macro supersedes GDF_REQUIRE and should be preferred in all instances.
- * GDF_REQUIRE should be considered deprecated.
- *
- *---------------------------------------------------------------------------**/
-#ifndef CUDA_TRY
-#define CUDA_TRY(call)                                                                            \
-  do {                                                                                            \
-    cudaError_t const status = (call);                                                            \
-    if (cudaSuccess != status) { cugraph::detail::throw_cuda_error(status, __FILE__, __LINE__); } \
-  } while (0);
-#endif
-#endif
-
-#define CUDA_CHECK_LAST()                                                                         \
-  {                                                                                               \
-    cudaError_t const status = cudaGetLastError();                                                \
-    if (status != cudaSuccess) { cugraph::detail::throw_cuda_error(status, __FILE__, __LINE__); } \
-  }
-
-/**---------------------------------------------------------------------------*
- * @brief Debug macro to synchronize a stream and check for CUDA errors
- *
- * In a non-release build, this macro will synchronize the specified stream, and
- * check for any CUDA errors returned from cudaGetLastError. If an error is
- * reported, an exception is thrown detailing the CUDA error that occurred.
- *
- * The intent of this macro is to provide a mechanism for synchronous and
- * deterministic execution for debugging asynchronous CUDA execution. It should
- * be used after any asynchronous CUDA call, e.g., cudaMemcpyAsync, or an
- * asynchronous kernel launch.
- *
- * Similar to assert(), it is only present in non-Release builds.
- *
- *---------------------------------------------------------------------------**/
-#ifndef NDEBUG
-#define CHECK_STREAM(stream) cugraph::detail::check_stream((stream), __FILE__, __LINE__)
-#else
-#define CHECK_STREAM(stream) static_cast<void>(0)
-#endif
-
-/**---------------------------------------------------------------------------*
- * @brief Macro for checking graph object that throws an exception when
- * a condition is violated.
- *
- * Example usage:
- *
- * @code
- * CHECK_GRAPH(graph);
- * @endcode
- *
- * @param[in] the Graph class
- * @throw cugraph::logic_error if the condition evaluates to false.
- *---------------------------------------------------------------------------**/
-#define CHECK_GRAPH(graph)                                                   \
-  CUGRAPH_EXPECTS(graph != nullptr, "Invalid API parameter: graph is NULL"); \
-  CUGRAPH_EXPECTS(graph->adjList != nullptr || graph->edgeList != nullptr,   \
-                  "Invalid API parameter: graph is empty");
diff --git a/cpp/src/utilities/graph_utils.cuh b/cpp/src/utilities/graph_utils.cuh
index efad365aa96..6b7e8558e86 100644
--- a/cpp/src/utilities/graph_utils.cuh
+++ b/cpp/src/utilities/graph_utils.cuh
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2018-2020, NVIDIA CORPORATION.  All rights reserved.
  *
  * NVIDIA CORPORATION and its licensors retain all intellectual property
  * and proprietary rights in and to this software, related documentation
@@ -13,58 +13,28 @@
 // Author: Alex Fender afender@nvidia.com
 #pragma once
 
+#include <utilities/error.hpp>
+
+#include <raft/cudart_utils.h>
+#include <rmm/thrust_rmm_allocator.h>
+#include <raft/device_atomics.cuh>
+
 #include <cuda.h>
 #include <cuda_runtime.h>
-//#include <library_types.h>
-//#include <cuda_fp16.h>
 #include <thrust/functional.h>
 #include <thrust/inner_product.h>
 #include <thrust/iterator/zip_iterator.h>
 #include <thrust/sort.h>
 #include <thrust/transform.h>
 
-#include <rmm/thrust_rmm_allocator.h>
-#include <utilities/error_utils.h>
-
 namespace cugraph {
 namespace detail {
 
-#define USE_CG 1
 //#define DEBUG 1
 #define CUDA_MAX_BLOCKS 65535
-#define CUDA_MAX_KERNEL_THREADS 256  // kernefgdfl will launch at most 256 threads per block
-#define DEFAULT_MASK 0xffffffff
+#define CUDA_MAX_KERNEL_THREADS 256  // kernel will launch at most 256 threads per block
 #define US
 
-template <typename T>
-static __device__ __forceinline__ T
-shfl_up(T r, int offset, int bound = 32, int mask = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#if USE_CG
-  return __shfl_up_sync(mask, r, offset, bound);
-#else
-  return __shfl_up(r, offset, bound);
-#endif
-#else
-  return 0.0f;
-#endif
-}
-
-template <typename T>
-static __device__ __forceinline__ T shfl(T r, int lane, int bound = 32, int mask = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#if USE_CG
-  return __shfl_sync(mask, r, lane, bound);
-#else
-  return __shfl(r, lane, bound);
-#endif
-#else
-  return 0.0f;
-#endif
-}
-
 template <typename count_t, typename index_t, typename value_t>
 __inline__ __device__ value_t parallel_prefix_sum(count_t n, index_t const *ind, value_t const *w)
 {
@@ -90,14 +60,14 @@ __inline__ __device__ value_t parallel_prefix_sum(count_t n, index_t const *ind,
     // iterations it is the value at the last thread of the previous iterations.
 
     // get the value of the last thread
-    last = shfl(sum, blockDim.x - 1, blockDim.x);
+    last = __shfl_sync(raft::warp_full_mask(), sum, blockDim.x - 1, blockDim.x);
 
     // if you are valid read the value from memory, otherwise set your value to 0
     sum = (valid) ? w[ind[i]] : 0.0;
 
     // do prefix sum (of size warpSize=blockDim.x =< 32)
     for (j = 1; j < blockDim.x; j *= 2) {
-      v = shfl_up(sum, j, blockDim.x);
+      v = __shfl_up_sync(raft::warp_full_mask(), sum, j, blockDim.x);
       if (threadIdx.x >= j) sum += v;
     }
     // shift by last
@@ -105,7 +75,7 @@ __inline__ __device__ value_t parallel_prefix_sum(count_t n, index_t const *ind,
     // notice that no __threadfence or __syncthreads are needed in this implementation
   }
   // get the value of the last thread (to all threads)
-  last = shfl(sum, blockDim.x - 1, blockDim.x);
+  last = __shfl_sync(raft::warp_full_mask(), sum, blockDim.x - 1, blockDim.x);
 
   return last;
 }
@@ -120,7 +90,7 @@ T dot(size_t n, T *x, T *y)
                                    thrust::device_pointer_cast(x + n),
                                    thrust::device_pointer_cast(y),
                                    0.0f);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
   return result;
 }
 
@@ -142,7 +112,7 @@ void axpy(size_t n, T a, T *x, T *y)
                     thrust::device_pointer_cast(y),
                     thrust::device_pointer_cast(y),
                     axpy_functor<T>(a));
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
 }
 
 // norm
@@ -162,7 +132,7 @@ T nrm2(size_t n, T *x)
                                                 square<T>(),
                                                 init,
                                                 thrust::plus<T>()));
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
   return result;
 }
 
@@ -173,7 +143,7 @@ T nrm1(size_t n, T *x)
   T result = thrust::reduce(rmm::exec_policy(stream)->on(stream),
                             thrust::device_pointer_cast(x),
                             thrust::device_pointer_cast(x + n));
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
   return result;
 }
 
@@ -187,7 +157,7 @@ void scal(size_t n, T val, T *x)
                     thrust::make_constant_iterator(val),
                     thrust::device_pointer_cast(x),
                     thrust::multiplies<T>());
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
 }
 
 template <typename T>
@@ -200,7 +170,7 @@ void addv(size_t n, T val, T *x)
                     thrust::make_constant_iterator(val),
                     thrust::device_pointer_cast(x),
                     thrust::plus<T>());
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
 }
 
 template <typename T>
@@ -211,7 +181,7 @@ void fill(size_t n, T *x, T value)
                thrust::device_pointer_cast(x),
                thrust::device_pointer_cast(x + n),
                value);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
 }
 
 template <typename T, typename M>
@@ -223,7 +193,7 @@ void scatter(size_t n, T *src, T *dst, M *map)
                   thrust::device_pointer_cast(src + n),
                   thrust::device_pointer_cast(map),
                   thrust::device_pointer_cast(dst));
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
 }
 
 template <typename T>
@@ -237,7 +207,7 @@ void printv(size_t n, T *vec, int offset)
     dev_ptr + offset + n,
     std::ostream_iterator<T>(
       std::cout, " "));  // Assume no RMM dependency; TODO: check / test (potential BUG !!!!!)
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(nullptr);
   std::cout << std::endl;
 }
 
@@ -248,7 +218,7 @@ void copy(size_t n, T *x, T *res)
   thrust::device_ptr<T> res_ptr(res);
   cudaStream_t stream{nullptr};
   thrust::copy_n(rmm::exec_policy(stream)->on(stream), dev_ptr, n, res_ptr);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
 }
 
 template <typename T>
@@ -273,36 +243,39 @@ void update_dangling_nodes(size_t n, T *dangling_nodes, T damping_factor)
                        thrust::device_pointer_cast(dangling_nodes),
                        dangling_functor<T>(1.0 - damping_factor),
                        is_zero<T>());
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
 }
 
 // google matrix kernels
 template <typename IndexType, typename ValueType>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
-  degree_coo(const IndexType n, const IndexType e, const IndexType *ind, ValueType *degree)
+__global__ void degree_coo(const IndexType n,
+                           const IndexType e,
+                           const IndexType *ind,
+                           ValueType *degree)
 {
   for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < e; i += gridDim.x * blockDim.x)
     atomicAdd(&degree[ind[i]], (ValueType)1.0);
 }
 
 template <typename IndexType, typename ValueType>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
-  flag_leafs_kernel(const size_t n, const IndexType *degree, ValueType *bookmark)
+__global__ void flag_leafs_kernel(const size_t n, const IndexType *degree, ValueType *bookmark)
 {
   for (auto i = threadIdx.x + blockIdx.x * blockDim.x; i < n; i += gridDim.x * blockDim.x)
     if (degree[i] == 0) bookmark[i] = 1.0;
 }
 
 template <typename IndexType, typename ValueType>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
-  degree_offsets(const IndexType n, const IndexType e, const IndexType *ind, ValueType *degree)
+__global__ void degree_offsets(const IndexType n,
+                               const IndexType e,
+                               const IndexType *ind,
+                               ValueType *degree)
 {
   for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < n; i += gridDim.x * blockDim.x)
     degree[i] += ind[i + 1] - ind[i];
 }
 
 template <typename FromType, typename ToType>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS) type_convert(FromType *array, int n)
+__global__ void type_convert(FromType *array, int n)
 {
   for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < n; i += gridDim.x * blockDim.x) {
     ToType val   = array[i];
@@ -312,12 +285,12 @@ __global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS) type_convert(FromType
 }
 
 template <typename IndexType, typename ValueType>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS) equi_prob3(const IndexType n,
-                                                                      const IndexType e,
-                                                                      const IndexType *csrPtr,
-                                                                      const IndexType *csrInd,
-                                                                      ValueType *val,
-                                                                      IndexType *degree)
+__global__ void equi_prob3(const IndexType n,
+                           const IndexType e,
+                           const IndexType *csrPtr,
+                           const IndexType *csrInd,
+                           ValueType *val,
+                           IndexType *degree)
 {
   int j, row, col;
   for (row = threadIdx.z + blockIdx.z * blockDim.z; row < n; row += gridDim.z * blockDim.z) {
@@ -331,12 +304,12 @@ __global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS) equi_prob3(const Inde
 }
 
 template <typename IndexType, typename ValueType>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS) equi_prob2(const IndexType n,
-                                                                      const IndexType e,
-                                                                      const IndexType *csrPtr,
-                                                                      const IndexType *csrInd,
-                                                                      ValueType *val,
-                                                                      IndexType *degree)
+__global__ void equi_prob2(const IndexType n,
+                           const IndexType e,
+                           const IndexType *csrPtr,
+                           const IndexType *csrInd,
+                           ValueType *val,
+                           IndexType *degree)
 {
   int row = blockIdx.x * blockDim.x + threadIdx.x;
   if (row < n) {
@@ -371,7 +344,7 @@ void HT_matrix_csc_coo(const IndexType n,
   nblocks.z  = 1;
   degree_coo<IndexType, IndexType>
     <<<nblocks, nthreads, 0, stream>>>(n, e, csrInd, degree.data().get());
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
 
   int y      = 4;
   nthreads.x = 32 / y;
@@ -382,11 +355,11 @@ void HT_matrix_csc_coo(const IndexType n,
   nblocks.z  = min((n + nthreads.z - 1) / nthreads.z, CUDA_MAX_BLOCKS);  // 1;
   equi_prob3<IndexType, ValueType>
     <<<nblocks, nthreads, 0, stream>>>(n, e, csrPtr, csrInd, val, degree.data().get());
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
 
   ValueType a = 0.0;
   fill(n, bookmark, a);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
 
   nthreads.x = min(n, CUDA_MAX_KERNEL_THREADS);
   nthreads.y = 1;
@@ -396,12 +369,14 @@ void HT_matrix_csc_coo(const IndexType n,
   nblocks.z  = 1;
   flag_leafs_kernel<IndexType, ValueType>
     <<<nblocks, nthreads, 0, stream>>>(n, degree.data().get(), bookmark);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
 }
 
 template <typename IndexType, typename ValueType>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
-  permute_vals_kernel(const IndexType e, IndexType *perm, ValueType *in, ValueType *out)
+__global__ void permute_vals_kernel(const IndexType e,
+                                    IndexType *perm,
+                                    ValueType *in,
+                                    ValueType *out)
 {
   for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < e; i += gridDim.x * blockDim.x)
     out[i] = in[perm[i]];
@@ -486,8 +461,7 @@ void remove_duplicate(
 }
 
 template <typename IndexType>
-__global__ void __launch_bounds__(CUDA_MAX_KERNEL_THREADS)
-  offsets_to_indices_kernel(const IndexType *offsets, IndexType v, IndexType *indices)
+__global__ void offsets_to_indices_kernel(const IndexType *offsets, IndexType v, IndexType *indices)
 {
   int tid, ctaStart;
   tid      = threadIdx.x;
@@ -511,7 +485,7 @@ void offsets_to_indices(const IndexType *offsets, IndexType v, IndexType *indice
   IndexType nthreads = min(v, (IndexType)CUDA_MAX_KERNEL_THREADS);
   IndexType nblocks  = min((v + nthreads - 1) / nthreads, (IndexType)CUDA_MAX_BLOCKS);
   offsets_to_indices_kernel<<<nblocks, nthreads, 0, stream>>>(offsets, v, indices);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
 }
 
 template <typename IndexType>
@@ -519,7 +493,7 @@ void sequence(IndexType n, IndexType *vec, IndexType init = 0)
 {
   thrust::sequence(
     thrust::device, thrust::device_pointer_cast(vec), thrust::device_pointer_cast(vec + n), init);
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(nullptr);
 }
 
 template <typename DistType>
@@ -532,7 +506,7 @@ bool has_negative_val(DistType *arr, size_t n)
                                          thrust::device_pointer_cast(arr),
                                          thrust::device_pointer_cast(arr + n));
 
-  CUDA_CHECK_LAST();
+  CHECK_CUDA(stream);
 
   return (result < 0);
 }
diff --git a/cpp/src/utilities/heap.cuh b/cpp/src/utilities/heap.cuh
index e290337c22d..0747a658324 100644
--- a/cpp/src/utilities/heap.cuh
+++ b/cpp/src/utilities/heap.cuh
@@ -1,7 +1,7 @@
 // -*-c++-*-
 
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
diff --git a/cpp/src/utilities/sm_utils.h b/cpp/src/utilities/sm_utils.h
deleted file mode 100644
index 57e149e7f99..00000000000
--- a/cpp/src/utilities/sm_utils.h
+++ /dev/null
@@ -1,326 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#pragma once
-
-#ifdef _MSC_VER
-#include <stdint.h>
-#else
-#include <inttypes.h>
-#endif
-
-#define DEFAULT_MASK 0xffffffff
-
-#define USE_CG 1
-//(__CUDACC_VER__ >= 80500)
-
-namespace cugraph {
-namespace detail {
-namespace utils {
-static __device__ __forceinline__ int lane_id()
-{
-  int id;
-  asm("mov.u32 %0, %%laneid;" : "=r"(id));
-  return id;
-}
-
-static __device__ __forceinline__ int lane_mask_lt()
-{
-  int mask;
-  asm("mov.u32 %0, %%lanemask_lt;" : "=r"(mask));
-  return mask;
-}
-
-static __device__ __forceinline__ int lane_mask_le()
-{
-  int mask;
-  asm("mov.u32 %0, %%lanemask_le;" : "=r"(mask));
-  return mask;
-}
-
-static __device__ __forceinline__ int warp_id() { return threadIdx.x >> 5; }
-
-static __device__ __forceinline__ unsigned int ballot(int p, int mask = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#if USE_CG
-  return __ballot_sync(mask, p);
-#else
-  return __ballot(p);
-#endif
-#else
-  return 0;
-#endif
-}
-
-static __device__ __forceinline__ int shfl(int r, int lane, int bound = 32, int mask = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#if USE_CG
-  return __shfl_sync(mask, r, lane, bound);
-#else
-  return __shfl(r, lane, bound);
-#endif
-#else
-  return 0;
-#endif
-}
-
-static __device__ __forceinline__ float shfl(float r,
-                                             int lane,
-                                             int bound = 32,
-                                             int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#if USE_CG
-  return __shfl_sync(mask, r, lane, bound);
-#else
-  return __shfl(r, lane, bound);
-#endif
-#else
-  return 0.0f;
-#endif
-}
-
-/// Warp shuffle down function
-/** Warp shuffle functions on 64-bit floating point values are not
- *  natively implemented as of Compute Capability 5.0. This
- *  implementation has been copied from
- *  (http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler).
- *  Once this is natively implemented, this function can be replaced
- *  by __shfl_down.
- *
- */
-static __device__ __forceinline__ double shfl(double r,
-                                              int lane,
-                                              int bound = 32,
-                                              int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_sync(mask, a.x, lane, bound);
-  a.y    = __shfl_sync(mask, a.y, lane, bound);
-  return *reinterpret_cast<double*>(&a);
-#else
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl(a.x, lane, bound);
-  a.y    = __shfl(a.y, lane, bound);
-  return *reinterpret_cast<double*>(&a);
-#endif
-#else
-  return 0.0;
-#endif
-}
-
-static __device__ __forceinline__ long long shfl(long long r,
-                                                 int lane,
-                                                 int bound = 32,
-                                                 int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_sync(mask, a.x, lane, bound);
-  a.y    = __shfl_sync(mask, a.y, lane, bound);
-  return *reinterpret_cast<long long*>(&a);
-#else
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl(a.x, lane, bound);
-  a.y    = __shfl(a.y, lane, bound);
-  return *reinterpret_cast<long long*>(&a);
-#endif
-#else
-  return 0.0;
-#endif
-}
-
-static __device__ __forceinline__ int shfl_down(int r,
-                                                int offset,
-                                                int bound = 32,
-                                                int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  return __shfl_down_sync(mask, r, offset, bound);
-#else
-  return __shfl_down(r, offset, bound);
-#endif
-#else
-  return 0.0f;
-#endif
-}
-
-static __device__ __forceinline__ float shfl_down(float r,
-                                                  int offset,
-                                                  int bound = 32,
-                                                  int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  return __shfl_down_sync(mask, r, offset, bound);
-#else
-  return __shfl_down(r, offset, bound);
-#endif
-#else
-  return 0.0f;
-#endif
-}
-
-static __device__ __forceinline__ double shfl_down(double r,
-                                                   int offset,
-                                                   int bound = 32,
-                                                   int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_down_sync(mask, a.x, offset, bound);
-  a.y    = __shfl_down_sync(mask, a.y, offset, bound);
-  return *reinterpret_cast<double*>(&a);
-#else
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_down(a.x, offset, bound);
-  a.y    = __shfl_down(a.y, offset, bound);
-  return *reinterpret_cast<double*>(&a);
-#endif
-#else
-  return 0.0;
-#endif
-}
-
-static __device__ __forceinline__ long long shfl_down(long long r,
-                                                      int offset,
-                                                      int bound = 32,
-                                                      int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_down_sync(mask, a.x, offset, bound);
-  a.y    = __shfl_down_sync(mask, a.y, offset, bound);
-  return *reinterpret_cast<long long*>(&a);
-#else
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_down(a.x, offset, bound);
-  a.y    = __shfl_down(a.y, offset, bound);
-  return *reinterpret_cast<long long*>(&a);
-#endif
-#else
-  return 0.0;
-#endif
-}
-
-// specifically for triangles counting
-static __device__ __forceinline__ uint64_t shfl_down(uint64_t r,
-                                                     int offset,
-                                                     int bound = 32,
-                                                     int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_down_sync(mask, a.x, offset, bound);
-  a.y    = __shfl_down_sync(mask, a.y, offset, bound);
-  return *reinterpret_cast<uint64_t*>(&a);
-#else
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_down(mask, a.x, offset, bound);
-  a.y    = __shfl_down(mask, a.y, offset, bound);
-  return *reinterpret_cast<uint64_t*>(&a);
-#endif
-#else
-  return 0.0;
-#endif
-}
-
-static __device__ __forceinline__ int shfl_up(int r,
-                                              int offset,
-                                              int bound = 32,
-                                              int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  return __shfl_up_sync(mask, r, offset, bound);
-#else
-  return __shfl_up(r, offset, bound);
-#endif
-#else
-  return 0.0f;
-#endif
-}
-
-static __device__ __forceinline__ float shfl_up(float r,
-                                                int offset,
-                                                int bound = 32,
-                                                int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  return __shfl_up_sync(mask, r, offset, bound);
-#else
-  return __shfl_up(r, offset, bound);
-#endif
-#else
-  return 0.0f;
-#endif
-}
-
-static __device__ __forceinline__ double shfl_up(double r,
-                                                 int offset,
-                                                 int bound = 32,
-                                                 int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_up_sync(mask, a.x, offset, bound);
-  a.y    = __shfl_up_sync(mask, a.y, offset, bound);
-  return *reinterpret_cast<double*>(&a);
-#else
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_up(a.x, offset, bound);
-  a.y    = __shfl_up(a.y, offset, bound);
-  return *reinterpret_cast<double*>(&a);
-#endif
-#else
-  return 0.0;
-#endif
-}
-
-static __device__ __forceinline__ long long shfl_up(long long r,
-                                                    int offset,
-                                                    int bound = 32,
-                                                    int mask  = DEFAULT_MASK)
-{
-#if __CUDA_ARCH__ >= 300
-#ifdef USE_CG
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_up_sync(mask, a.x, offset, bound);
-  a.y    = __shfl_up_sync(mask, a.y, offset, bound);
-  return *reinterpret_cast<long long*>(&a);
-#else
-  int2 a = *reinterpret_cast<int2*>(&r);
-  a.x    = __shfl_up(a.x, offset, bound);
-  a.y    = __shfl_up(a.y, offset, bound);
-  return *reinterpret_cast<long long*>(&a);
-#endif
-#else
-  return 0.0;
-#endif
-}
-}  // namespace utils
-}  // namespace detail
-}  // namespace cugraph
diff --git a/cpp/src/utilities/spmv_1D.cu b/cpp/src/utilities/spmv_1D.cu
new file mode 100644
index 00000000000..4aec86919c9
--- /dev/null
+++ b/cpp/src/utilities/spmv_1D.cu
@@ -0,0 +1,86 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include <raft/spectral/matrix_wrappers.hpp>
+#include "spmv_1D.cuh"
+
+namespace cugraph {
+namespace mg {
+template <typename vertex_t, typename edge_t, typename weight_t>
+MGcsrmv<vertex_t, edge_t, weight_t>::MGcsrmv(raft::handle_t const &handle,
+                                             vertex_t *local_vertices,
+                                             vertex_t *part_off,
+                                             edge_t *off,
+                                             vertex_t *ind,
+                                             weight_t *val,
+                                             weight_t *x)
+  : handle_(handle),
+    local_vertices_(local_vertices),
+    part_off_(part_off),
+    off_(off),
+    ind_(ind),
+    val_(val)
+{
+  i_      = handle_.get_comms().get_rank();
+  p_      = handle_.get_comms().get_size();
+  v_glob_ = part_off_[p_ - 1] + local_vertices_[p_ - 1];
+  v_loc_  = local_vertices_[i_];
+  vertex_t tmp;
+  CUDA_TRY(cudaMemcpy(&tmp, &off_[v_loc_], sizeof(vertex_t), cudaMemcpyDeviceToHost));
+  e_loc_ = tmp;
+  y_loc_.resize(v_loc_);
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+MGcsrmv<vertex_t, edge_t, weight_t>::~MGcsrmv()
+{
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+void MGcsrmv<vertex_t, edge_t, weight_t>::run(weight_t *x)
+{
+  using namespace raft::matrix;
+
+  weight_t h_one  = 1.0;
+  weight_t h_zero = 0.0;
+
+  sparse_matrix_t<vertex_t, weight_t> mat{handle_,                         // raft handle
+                                          off_,                            // CSR row_offsets
+                                          ind_,                            // CSR col_indices
+                                          val_,                            // CSR values
+                                          static_cast<vertex_t>(v_loc_),   // n_rows
+                                          static_cast<vertex_t>(v_glob_),  // n_cols
+                                          static_cast<vertex_t>(e_loc_)};  // nnz
+
+  mat.mv(h_one,                             // alpha
+         x,                                 // x
+         h_zero,                            // beta
+         y_loc_.data().get(),               // y
+         sparse_mv_alg_t::SPARSE_MV_ALG2);  // SpMV algorithm
+
+  auto stream = handle_.get_stream();
+
+  auto const &comm{handle_.get_comms()};  // local
+
+  std::vector<size_t> recvbuf(comm.get_size());
+  std::copy(local_vertices_, local_vertices_ + comm.get_size(), recvbuf.begin());
+  comm.allgatherv(y_loc_.data().get(), x, recvbuf.data(), part_off_, stream);
+}
+
+template class MGcsrmv<int32_t, int32_t, double>;
+template class MGcsrmv<int32_t, int32_t, float>;
+
+}  // namespace mg
+}  // namespace cugraph
diff --git a/cpp/src/utilities/spmv_1D.cuh b/cpp/src/utilities/spmv_1D.cuh
new file mode 100644
index 00000000000..81466595c19
--- /dev/null
+++ b/cpp/src/utilities/spmv_1D.cuh
@@ -0,0 +1,60 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+#include <rmm/thrust_rmm_allocator.h>
+#include <raft/handle.hpp>
+#include "utilities/error.hpp"
+
+namespace cugraph {
+namespace mg {
+
+template <typename vertex_t, typename edge_t, typename weight_t>
+class MGcsrmv {
+ private:
+  size_t v_glob_;
+  size_t v_loc_;
+  size_t e_loc_;
+
+  raft::handle_t const& handle_;  // raft handle propagation for SpMV, etc.
+
+  vertex_t* part_off_;
+  vertex_t* local_vertices_;
+  int i_;
+  int p_;
+  edge_t* off_;
+  vertex_t* ind_;
+  weight_t* val_;
+  rmm::device_vector<weight_t> y_loc_;
+  std::vector<size_t> v_locs_h_;
+  std::vector<vertex_t> displs_h_;
+
+ public:
+  MGcsrmv(raft::handle_t const& r_handle,
+          vertex_t* local_vertices,
+          vertex_t* part_off,
+          edge_t* row_off,
+          vertex_t* col_ind,
+          weight_t* vals,
+          weight_t* x);
+
+  ~MGcsrmv();
+
+  void run(weight_t* x);
+};
+
+}  // namespace mg
+}  // namespace cugraph
diff --git a/cpp/tests/CMakeLists.txt b/cpp/tests/CMakeLists.txt
index 0b8bec887fb..e0f945639ca 100644
--- a/cpp/tests/CMakeLists.txt
+++ b/cpp/tests/CMakeLists.txt
@@ -1,6 +1,6 @@
 ﻿#=============================================================================
 #
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -29,26 +29,24 @@ function(ConfigureTest CMAKE_TEST_NAME CMAKE_TEST_SRC CMAKE_EXTRA_LIBS)
 
     target_include_directories(${CMAKE_TEST_NAME}
         PRIVATE
+        "${CUB_INCLUDE_DIR}"
+        "${THRUST_INCLUDE_DIR}"
         "${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}"
         "${GTEST_INCLUDE_DIR}"
         "${RMM_INCLUDE}"
         "${CUDF_INCLUDE}"
         "${CUDF_INCLUDE}/libcudf/libcudacxx"
-        "${CUB_INCLUDE_DIR}"
+        "${NCCL_INCLUDE_DIRS}"
         "${CMAKE_SOURCE_DIR}/../thirdparty/mmio"
         "${CMAKE_SOURCE_DIR}/include"
         "${CMAKE_SOURCE_DIR}/src"
         "${CMAKE_CURRENT_SOURCE_DIR}"
+        "${RAFT_DIR}/cpp/include"
     )
 
     target_link_libraries(${CMAKE_TEST_NAME}
         PRIVATE
-        gtest gmock_main gmock cugraph ${CUDF_LIBRARY} ${RMM_LIBRARY} ${CMAKE_EXTRA_LIBS} cudart cuda)
-    if (BUILD_MPI)
-        include_directories(include ${MPI_CXX_INCLUDE_PATH} ${NCCL_INCLUDE_DIRS})
-        target_link_libraries(${CMAKE_TEST_NAME} PRIVATE ${MPI_C_LIBRARIES} ${NCCL_LIBRARIES} )
-        target_compile_options(${CMAKE_TEST_NAME} PUBLIC ${MPI_C_COMPILE_FLAGS})
-    endif(BUILD_MPI)
+        gtest gmock_main gmock cugraph ${CUDF_LIBRARY} ${RMM_LIBRARY} ${CMAKE_EXTRA_LIBS}  ${NCCL_LIBRARIES} cudart cuda cublas cusparse cusolver curand)
 
     if(OpenMP_CXX_FOUND)
         target_link_libraries(${CMAKE_TEST_NAME} PRIVATE
@@ -138,12 +136,18 @@ set(BETWEENNESS_TEST_SRC
 
   ConfigureTest(BETWEENNESS_TEST "${BETWEENNESS_TEST_SRC}" "")
 
+set(EDGE_BETWEENNESS_TEST_SRC
+    "${CMAKE_SOURCE_DIR}/../thirdparty/mmio/mmio.c"
+    "${CMAKE_CURRENT_SOURCE_DIR}/centrality/edge_betweenness_centrality_test.cu")
+
+  ConfigureTest(EDGE_BETWEENNESS_TEST "${EDGE_BETWEENNESS_TEST_SRC}" "")
+
 ###################################################################################################
 # - pagerank tests --------------------------------------------------------------------------------
 
 set(PAGERANK_TEST_SRC
     "${CMAKE_SOURCE_DIR}/../thirdparty/mmio/mmio.c"
-    "${CMAKE_CURRENT_SOURCE_DIR}/pagerank/pagerank_test.cu")
+    "${CMAKE_CURRENT_SOURCE_DIR}/pagerank/pagerank_test.cpp")
 
 ConfigureTest(PAGERANK_TEST "${PAGERANK_TEST_SRC}" "")
 
@@ -172,6 +176,15 @@ set(LOUVAIN_TEST_SRC
 
 ConfigureTest(LOUVAIN_TEST "${LOUVAIN_TEST_SRC}" "")
 
+###################################################################################################
+# - LEIDEN tests ---------------------------------------------------------------------------------
+
+set(LEIDEN_TEST_SRC
+    "${CMAKE_SOURCE_DIR}/../thirdparty/mmio/mmio.c"
+    "${CMAKE_CURRENT_SOURCE_DIR}/community/leiden_test.cpp")
+
+ConfigureTest(LEIDEN_TEST "${LEIDEN_TEST_SRC}" "")
+
 ###################################################################################################
 # - ECG tests ---------------------------------------------------------------------------------
 
@@ -203,7 +216,7 @@ set(RENUMBERING_TEST_SRC
     "${CMAKE_SOURCE_DIR}/../thirdparty/mmio/mmio.c"
     "${CMAKE_CURRENT_SOURCE_DIR}/renumber/renumber_test.cu")
 
-ConfigureTest(RENUMBERING_TEST "${RENUMBERING_TEST_SRC}" "${NVSTRINGS_LIBRARY}")
+ConfigureTest(RENUMBERING_TEST "${RENUMBERING_TEST_SRC}" "")
 
 ###################################################################################################
 #-FORCE ATLAS 2  tests ------------------------------------------------------------------------------
@@ -221,7 +234,7 @@ set(CONNECT_TEST_SRC
     "${CMAKE_SOURCE_DIR}/../thirdparty/mmio/mmio.c"
     "${CMAKE_CURRENT_SOURCE_DIR}/components/con_comp_test.cu")
 
-  ConfigureTest(CONNECT_TEST "${CONNECT_TEST_SRC}" "")
+ConfigureTest(CONNECT_TEST "${CONNECT_TEST_SRC}" "")
 
 ###################################################################################################
 #-STRONGLY CONNECTED COMPONENTS  tests ---------------------------------------------------------------------
diff --git a/cpp/tests/centrality/betweenness_centrality_test.cu b/cpp/tests/centrality/betweenness_centrality_test.cu
index 153e0bc876c..d680574e10b 100644
--- a/cpp/tests/centrality/betweenness_centrality_test.cu
+++ b/cpp/tests/centrality/betweenness_centrality_test.cu
@@ -14,23 +14,24 @@
  * limitations under the License.
  */
 
-#include "gmock/gmock.h"
-#include "gtest/gtest.h"
-
-#include <thrust/device_vector.h>
-#include <utility>
-#include "test_utils.h"
+#include <traversal/bfs_ref.h>
+#include <utilities/base_fixture.hpp>
+#include <utilities/test_utilities.hpp>
 
 #include <algorithms.hpp>
 #include <graph.hpp>
 
-#include <queue>
-#include <stack>
+#include <raft/error.hpp>
+#include <raft/handle.hpp>
 
-#include <fstream>
+#include <thrust/device_vector.h>
+
+#include <gmock/gmock.h>
 
-#include <rmm/mr/device/cuda_memory_resource.hpp>
-#include "traversal/bfs_ref.h"
+#include <fstream>
+#include <queue>
+#include <stack>
+#include <utility>
 
 #ifndef TEST_EPSILON
 #define TEST_EPSILON 0.0001
@@ -47,73 +48,122 @@
 // ============================================================================
 // C++ Reference Implementation
 // ============================================================================
-template <typename VT, typename ET, typename WT, typename result_t>
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
 void ref_accumulation(result_t *result,
-                      VT const number_of_vertices,
-                      std::stack<VT> &S,
-                      std::vector<std::vector<VT>> &pred,
+                      vertex_t const number_of_vertices,
+                      std::stack<vertex_t> &S,
+                      std::vector<std::vector<vertex_t>> &pred,
                       std::vector<double> &sigmas,
                       std::vector<double> &deltas,
-                      VT source)
+                      vertex_t source)
+{
+  for (vertex_t v = 0; v < number_of_vertices; ++v) { deltas[v] = 0; }
+  while (!S.empty()) {
+    vertex_t w = S.top();
+    S.pop();
+    for (vertex_t v : pred[w]) { deltas[v] += (sigmas[v] / sigmas[w]) * (1.0 + deltas[w]); }
+    if (w != source) { result[w] += deltas[w]; }
+  }
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void ref_endpoints_accumulation(result_t *result,
+                                vertex_t const number_of_vertices,
+                                std::stack<vertex_t> &S,
+                                std::vector<std::vector<vertex_t>> &pred,
+                                std::vector<double> &sigmas,
+                                std::vector<double> &deltas,
+                                vertex_t source)
+{
+  result[source] += S.size() - 1;
+  for (vertex_t v = 0; v < number_of_vertices; ++v) { deltas[v] = 0; }
+  while (!S.empty()) {
+    vertex_t w = S.top();
+    S.pop();
+    for (vertex_t v : pred[w]) { deltas[v] += (sigmas[v] / sigmas[w]) * (1.0 + deltas[w]); }
+    if (w != source) { result[w] += deltas[w] + 1; }
+  }
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void ref_edge_accumulation(result_t *result,
+                           vertex_t const number_of_vertices,
+                           std::stack<vertex_t> &S,
+                           std::vector<std::vector<vertex_t>> &pred,
+                           std::vector<double> &sigmas,
+                           std::vector<double> &deltas,
+                           vertex_t source)
 {
-  for (VT v = 0; v < number_of_vertices; ++v) { deltas[v] = 0; }
+  for (vertex_t v = 0; v < number_of_vertices; ++v) { deltas[v] = 0; }
   while (!S.empty()) {
-    VT w = S.top();
+    vertex_t w = S.top();
     S.pop();
-    for (VT v : pred[w]) { deltas[v] += (sigmas[v] / sigmas[w]) * (1.0 + deltas[w]); }
+    for (vertex_t v : pred[w]) { deltas[v] += (sigmas[v] / sigmas[w]) * (1.0 + deltas[w]); }
     if (w != source) { result[w] += deltas[w]; }
   }
 }
 
 // Algorithm 1: Shortest-path vertex betweenness, (Brandes, 2001)
-template <typename VT, typename ET, typename WT, typename result_t>
-void reference_betweenness_centrality_impl(VT *indices,
-                                           ET *offsets,
-                                           VT const number_of_vertices,
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void reference_betweenness_centrality_impl(vertex_t *indices,
+                                           edge_t *offsets,
+                                           vertex_t const number_of_vertices,
                                            result_t *result,
-                                           VT const *sources,
-                                           VT const number_of_sources)
+                                           bool endpoints,
+                                           vertex_t const *sources,
+                                           vertex_t const number_of_sources)
 {
-  std::queue<VT> Q;
-  std::stack<VT> S;
-  // NOTE: dist is of type VT not WT
-  std::vector<VT> dist(number_of_vertices);
-  std::vector<std::vector<VT>> pred(number_of_vertices);
+  std::queue<vertex_t> Q;
+  std::stack<vertex_t> S;
+  // NOTE: dist is of type vertex_t not weight_t
+  std::vector<vertex_t> dist(number_of_vertices);
+  std::vector<std::vector<vertex_t>> pred(number_of_vertices);
   std::vector<double> sigmas(number_of_vertices);
   std::vector<double> deltas(number_of_vertices);
 
-  std::vector<VT> neighbors;
+  std::vector<vertex_t> neighbors;
 
   if (sources) {
-    for (VT source_idx = 0; source_idx < number_of_sources; ++source_idx) {
-      VT s = sources[source_idx];
+    for (vertex_t source_idx = 0; source_idx < number_of_sources; ++source_idx) {
+      vertex_t s = sources[source_idx];
       // Step 1: Single-source shortest-paths problem
       //   a. Initialization
-      ref_bfs<VT, ET>(indices, offsets, number_of_vertices, Q, S, dist, pred, sigmas, s);
+      ref_bfs<vertex_t, edge_t>(indices, offsets, number_of_vertices, Q, S, dist, pred, sigmas, s);
       //  Step 2: Accumulation
       //          Back propagation of dependencies
-      ref_accumulation<VT, ET, WT, result_t>(
-        result, number_of_vertices, S, pred, sigmas, deltas, s);
+      if (endpoints) {
+        ref_endpoints_accumulation<vertex_t, edge_t, weight_t, result_t>(
+          result, number_of_vertices, S, pred, sigmas, deltas, s);
+      } else {
+        ref_accumulation<vertex_t, edge_t, weight_t, result_t>(
+          result, number_of_vertices, S, pred, sigmas, deltas, s);
+      }
     }
   } else {
-    for (VT s = 0; s < number_of_vertices; ++s) {
+    for (vertex_t s = 0; s < number_of_vertices; ++s) {
       // Step 1: Single-source shortest-paths problem
       //   a. Initialization
-      ref_bfs<VT, ET>(indices, offsets, number_of_vertices, Q, S, dist, pred, sigmas, s);
+      ref_bfs<vertex_t, edge_t>(indices, offsets, number_of_vertices, Q, S, dist, pred, sigmas, s);
       //  Step 2: Accumulation
       //          Back propagation of dependencies
-      ref_accumulation<VT, ET, WT, result_t>(
-        result, number_of_vertices, S, pred, sigmas, deltas, s);
+      if (endpoints) {
+        ref_endpoints_accumulation<vertex_t, edge_t, weight_t, result_t>(
+          result, number_of_vertices, S, pred, sigmas, deltas, s);
+      } else {
+        ref_accumulation<vertex_t, edge_t, weight_t, result_t>(
+          result, number_of_vertices, S, pred, sigmas, deltas, s);
+      }
     }
   }
 }
 
-template <typename VT, typename ET, typename WT, typename result_t>
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
 void reference_rescale(result_t *result,
-                       bool normalize,
                        bool directed,
-                       VT const number_of_vertices,
-                       VT const number_of_sources)
+                       bool normalize,
+                       bool endpoints,
+                       vertex_t const number_of_vertices,
+                       vertex_t const number_of_sources)
 {
   bool modified                      = false;
   result_t rescale_factor            = static_cast<result_t>(1);
@@ -121,7 +171,11 @@ void reference_rescale(result_t *result,
   result_t casted_number_of_vertices = static_cast<result_t>(number_of_vertices);
   if (normalize) {
     if (number_of_vertices > 2) {
-      rescale_factor /= ((casted_number_of_vertices - 1) * (casted_number_of_vertices - 2));
+      if (endpoints) {
+        rescale_factor /= (casted_number_of_vertices * (casted_number_of_vertices - 1));
+      } else {
+        rescale_factor /= ((casted_number_of_vertices - 1) * (casted_number_of_vertices - 2));
+      }
       modified = true;
     }
   } else {
@@ -138,47 +192,55 @@ void reference_rescale(result_t *result,
   for (auto idx = 0; idx < number_of_vertices; ++idx) { result[idx] *= rescale_factor; }
 }
 
-template <typename VT, typename ET, typename WT, typename result_t>
-void reference_betweenness_centrality(cugraph::experimental::GraphCSRView<VT, ET, WT> const &graph,
-                                      result_t *result,
-                                      bool normalize,
-                                      bool endpoints,  // This is not yet implemented
-                                      VT const number_of_sources,
-                                      VT const *sources)
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void reference_betweenness_centrality(
+  cugraph::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+  result_t *result,
+  bool normalize,
+  bool endpoints,  // This is not yet implemented
+  vertex_t const number_of_sources,
+  vertex_t const *sources)
 {
-  VT number_of_vertices = graph.number_of_vertices;
-  ET number_of_edges    = graph.number_of_edges;
-  thrust::host_vector<VT> h_indices(number_of_edges);
-  thrust::host_vector<ET> h_offsets(number_of_vertices + 1);
+  vertex_t number_of_vertices = graph.number_of_vertices;
+  edge_t number_of_edges      = graph.number_of_edges;
+  thrust::host_vector<vertex_t> h_indices(number_of_edges);
+  thrust::host_vector<edge_t> h_offsets(number_of_vertices + 1);
 
-  thrust::device_ptr<VT> d_indices((VT *)&graph.indices[0]);
-  thrust::device_ptr<ET> d_offsets((ET *)&graph.offsets[0]);
+  thrust::device_ptr<vertex_t> d_indices((vertex_t *)&graph.indices[0]);
+  thrust::device_ptr<edge_t> d_offsets((edge_t *)&graph.offsets[0]);
 
   thrust::copy(d_indices, d_indices + number_of_edges, h_indices.begin());
   thrust::copy(d_offsets, d_offsets + (number_of_vertices + 1), h_offsets.begin());
 
   cudaDeviceSynchronize();
 
-  reference_betweenness_centrality_impl<VT, ET, WT, result_t>(
-    &h_indices[0], &h_offsets[0], number_of_vertices, result, sources, number_of_sources);
-  reference_rescale<VT, ET, WT, result_t>(
-    result, normalize, graph.prop.directed, number_of_vertices, number_of_sources);
+  reference_betweenness_centrality_impl<vertex_t, edge_t, weight_t, result_t>(&h_indices[0],
+                                                                              &h_offsets[0],
+                                                                              number_of_vertices,
+                                                                              result,
+                                                                              endpoints,
+                                                                              sources,
+                                                                              number_of_sources);
+  reference_rescale<vertex_t, edge_t, weight_t, result_t>(
+    result, graph.prop.directed, normalize, endpoints, number_of_vertices, number_of_sources);
 }
-// Explicit declaration
+// Explicit instantiation
+/*    FIXME!!!
 template void reference_betweenness_centrality<int, int, float, float>(
-  cugraph::experimental::GraphCSRView<int, int, float> const &,
+  cugraph::GraphCSRView<int, int, float> const &,
   float *,
   bool,
   bool,
   const int,
   int const *);
 template void reference_betweenness_centrality<int, int, double, double>(
-  cugraph::experimental::GraphCSRView<int, int, double> const &,
+  cugraph::GraphCSRView<int, int, double> const &,
   double *,
   bool,
   bool,
   const int,
   int const *);
+*/
 
 // =============================================================================
 // Utility functions
@@ -198,7 +260,6 @@ bool compare_close(const T &a, const T &b, const precision_t epsilon, precision_
 // Defines Betweenness Centrality UseCase
 // SSSP's test suite code uses type of Graph parameter that could be used
 // (MTX / RMAT)
-// FIXME: Use VT for number_of_sources?
 typedef struct BC_Usecase_t {
   std::string config_;     // Path to graph file
   std::string file_path_;  // Complete path to graph using dataset_root_dir
@@ -208,7 +269,7 @@ typedef struct BC_Usecase_t {
   {
     // assume relative paths are relative to RAPIDS_DATASET_ROOT_DIR
     // FIXME: Use platform independent stuff from c++14/17 on compiler update
-    const std::string &rapidsDatasetRootDir = get_rapids_dataset_root_dir();
+    const std::string &rapidsDatasetRootDir = cugraph::test::get_rapids_dataset_root_dir();
     if ((config_ != "") && (config_[0] != '/')) {
       file_path_ = rapidsDatasetRootDir + "/" + config_;
     } else {
@@ -218,6 +279,8 @@ typedef struct BC_Usecase_t {
 } BC_Usecase;
 
 class Tests_BC : public ::testing::TestWithParam<BC_Usecase> {
+  raft::handle_t handle;
+
  public:
   Tests_BC() {}
   static void SetupTestCase() {}
@@ -225,16 +288,15 @@ class Tests_BC : public ::testing::TestWithParam<BC_Usecase> {
 
   virtual void SetUp() {}
   virtual void TearDown() {}
-  // FIXME: Should normalize be part of the configuration instead?
-  // VT         vertex identifier data type
-  // ET         edge identifier data type
-  // WT         edge weight data type
+  // vertex_t         vertex identifier data type
+  // edge_t         edge identifier data type
+  // weight_t         edge weight data type
   // result_t   result data type
   // normalize  should the result be normalized
-  // endpoints  should the endpoints be included (Not Implemented Yet)
-  template <typename VT,
-            typename ET,
-            typename WT,
+  // endpoints  should the endpoints be included
+  template <typename vertex_t,
+            typename edge_t,
+            typename weight_t,
             typename result_t,
             bool normalize,
             bool endpoints>
@@ -242,11 +304,12 @@ class Tests_BC : public ::testing::TestWithParam<BC_Usecase> {
   {
     // Step 1: Construction of the graph based on configuration
     bool is_directed = false;
-    auto csr = generate_graph_csr_from_mm<VT, ET, WT>(is_directed, configuration.file_path_);
+    auto csr         = cugraph::test::generate_graph_csr_from_mm<vertex_t, edge_t, weight_t>(
+      is_directed, configuration.file_path_);
     cudaDeviceSynchronize();
-    cugraph::experimental::GraphCSRView<VT, ET, WT> G = csr->view();
-    G.prop.directed                                   = is_directed;
-    CUDA_CHECK_LAST();
+    cugraph::GraphCSRView<vertex_t, edge_t, weight_t> G = csr->view();
+    G.prop.directed                                     = is_directed;
+    CUDA_TRY(cudaGetLastError());
     std::vector<result_t> result(G.number_of_vertices, 0);
     std::vector<result_t> expected(G.number_of_vertices, 0);
 
@@ -257,44 +320,27 @@ class Tests_BC : public ::testing::TestWithParam<BC_Usecase> {
                 configuration.number_of_sources_ <= G.number_of_vertices)
       << "Number number of sources should be >= 0 and"
       << " less than the number of vertices in the graph";
-    std::vector<VT> sources(configuration.number_of_sources_);
+    std::vector<vertex_t> sources(configuration.number_of_sources_);
     thrust::sequence(thrust::host, sources.begin(), sources.end(), 0);
 
-    VT *sources_ptr = nullptr;
+    vertex_t *sources_ptr = nullptr;
     if (configuration.number_of_sources_ > 0) { sources_ptr = sources.data(); }
 
-    reference_betweenness_centrality(G,
-                                     expected.data(),
-                                     normalize,
-                                     endpoints,
-                                     // FIXME: weights
-                                     configuration.number_of_sources_,
-                                     sources_ptr);
+    reference_betweenness_centrality(
+      G, expected.data(), normalize, endpoints, configuration.number_of_sources_, sources_ptr);
 
     sources_ptr = nullptr;
     if (configuration.number_of_sources_ > 0) { sources_ptr = sources.data(); }
 
-    thrust::device_vector<result_t> d_result(G.number_of_vertices);
-    // FIXME: Remove this once endpoints in handled
-    if (endpoints) {
-      ASSERT_THROW(cugraph::betweenness_centrality(G,
-                                                   d_result.data().get(),
-                                                   normalize,
-                                                   endpoints,
-                                                   static_cast<WT *>(nullptr),
-                                                   configuration.number_of_sources_,
-                                                   sources_ptr),
-                   cugraph::logic_error);
-      return;
-    } else {
-      cugraph::betweenness_centrality(G,
-                                      d_result.data().get(),
-                                      normalize,
-                                      endpoints,
-                                      static_cast<WT *>(nullptr),
-                                      configuration.number_of_sources_,
-                                      sources_ptr);
-    }
+    rmm::device_vector<result_t> d_result(G.number_of_vertices);
+    cugraph::betweenness_centrality(handle,
+                                    G,
+                                    d_result.data().get(),
+                                    normalize,
+                                    endpoints,
+                                    static_cast<weight_t *>(nullptr),
+                                    configuration.number_of_sources_,
+                                    sources_ptr);
     cudaDeviceSynchronize();
     CUDA_TRY(cudaMemcpy(result.data(),
                         d_result.data().get(),
@@ -312,7 +358,6 @@ class Tests_BC : public ::testing::TestWithParam<BC_Usecase> {
 // Tests
 // ============================================================================
 // Verifiy Un-Normalized results
-// Endpoint parameter is currently not usefull, is for later use
 TEST_P(Tests_BC, CheckFP32_NO_NORMALIZE_NO_ENDPOINTS)
 {
   run_current_test<int, int, float, float, false, false>(GetParam());
@@ -323,7 +368,6 @@ TEST_P(Tests_BC, CheckFP64_NO_NORMALIZE_NO_ENDPOINTS)
   run_current_test<int, int, double, double, false, false>(GetParam());
 }
 
-// FIXME: Currently endpoints throws and exception as it is not supported
 TEST_P(Tests_BC, CheckFP32_NO_NORMALIZE_ENDPOINTS)
 {
   run_current_test<int, int, float, float, false, true>(GetParam());
@@ -335,17 +379,16 @@ TEST_P(Tests_BC, CheckFP64_NO_NORMALIZE_ENDPOINTS)
 }
 
 // Verifiy Normalized results
-TEST_P(Tests_BC, CheckFP32_NORMALIZE_NO_ENPOINTS)
+TEST_P(Tests_BC, CheckFP32_NORMALIZE_NO_ENDPOINTS)
 {
   run_current_test<int, int, float, float, true, false>(GetParam());
 }
 
-TEST_P(Tests_BC, CheckFP64_NORMALIZE_NO_ENPOINTS)
+TEST_P(Tests_BC, CheckFP64_NORMALIZE_NO_ENDPOINTS)
 {
   run_current_test<int, int, double, double, true, false>(GetParam());
 }
 
-// FIXME: Currently endpoints throws and exception as it is not supported
 TEST_P(Tests_BC, CheckFP32_NORMALIZE_ENDPOINTS)
 {
   run_current_test<int, int, float, float, true, true>(GetParam());
@@ -356,19 +399,12 @@ TEST_P(Tests_BC, CheckFP64_NORMALIZE_ENDPOINTS)
   run_current_test<int, int, double, double, true, true>(GetParam());
 }
 
-// FIXME: There is an InvalidValue on a Memcopy only on tests/datasets/dblp.mtx
 INSTANTIATE_TEST_CASE_P(simple_test,
                         Tests_BC,
                         ::testing::Values(BC_Usecase("test/datasets/karate.mtx", 0),
+                                          BC_Usecase("test/datasets/netscience.mtx", 0),
                                           BC_Usecase("test/datasets/netscience.mtx", 4),
                                           BC_Usecase("test/datasets/wiki2003.mtx", 4),
                                           BC_Usecase("test/datasets/wiki-Talk.mtx", 4)));
 
-int main(int argc, char **argv)
-{
-  testing::InitGoogleTest(&argc, argv);
-  auto resource = std::make_unique<rmm::mr::cuda_memory_resource>();
-  rmm::mr::set_default_resource(resource.get());
-  int rc = RUN_ALL_TESTS();
-  return rc;
-}
+CUGRAPH_TEST_PROGRAM_MAIN()
diff --git a/cpp/tests/centrality/edge_betweenness_centrality_test.cu b/cpp/tests/centrality/edge_betweenness_centrality_test.cu
new file mode 100644
index 00000000000..b6cce8684e8
--- /dev/null
+++ b/cpp/tests/centrality/edge_betweenness_centrality_test.cu
@@ -0,0 +1,323 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <traversal/bfs_ref.h>
+#include <utilities/base_fixture.hpp>
+#include <utilities/test_utilities.hpp>
+
+#include <raft/error.hpp>
+#include <raft/handle.hpp>
+
+#include <thrust/device_vector.h>
+
+#include <gmock/gmock.h>
+
+#include <algorithms.hpp>
+#include <graph.hpp>
+
+#include <fstream>
+#include <queue>
+#include <stack>
+#include <utility>
+
+#ifndef TEST_EPSILON
+#define TEST_EPSILON 0.0001
+#endif
+
+// NOTE: Defines under which values the difference should  be discarded when
+// considering values are close to zero
+//  i.e: Do we consider that the difference between 1.3e-9 and 8.e-12 is
+// significant
+#ifndef TEST_ZERO_THRESHOLD
+#define TEST_ZERO_THRESHOLD 1e-10
+#endif
+
+// ============================================================================
+// C++ Reference Implementation
+// ============================================================================
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+edge_t get_edge_index_from_source_and_destination(vertex_t source_vertex,
+                                                  vertex_t destination_vertex,
+                                                  vertex_t const *indices,
+                                                  edge_t const *offsets)
+{
+  edge_t index          = -1;
+  edge_t first_edge_idx = offsets[source_vertex];
+  edge_t last_edge_idx  = offsets[source_vertex + 1];
+  auto index_it = std::find(indices + first_edge_idx, indices + last_edge_idx, destination_vertex);
+  if (index_it != (indices + last_edge_idx)) { index = std::distance(indices, index_it); }
+  return index;
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void ref_accumulation(result_t *result,
+                      vertex_t const *indices,
+                      edge_t const *offsets,
+                      vertex_t const number_of_vertices,
+                      std::stack<vertex_t> &S,
+                      std::vector<std::vector<vertex_t>> &pred,
+                      std::vector<double> &sigmas,
+                      std::vector<double> &deltas,
+                      vertex_t source)
+{
+  for (vertex_t v = 0; v < number_of_vertices; ++v) { deltas[v] = 0; }
+  while (!S.empty()) {
+    vertex_t w = S.top();
+    S.pop();
+    for (vertex_t v : pred[w]) {
+      edge_t edge_idx =
+        get_edge_index_from_source_and_destination<vertex_t, edge_t, weight_t, result_t>(
+          v, w, indices, offsets);
+      double coefficient = (sigmas[v] / sigmas[w]) * (1.0 + deltas[w]);
+
+      deltas[v] += coefficient;
+      result[edge_idx] += coefficient;
+    }
+  }
+}
+
+// Algorithm 1: Shortest-path vertex betweenness, (Brandes, 2001)
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void reference_edge_betweenness_centrality_impl(vertex_t *indices,
+                                                edge_t *offsets,
+                                                vertex_t const number_of_vertices,
+                                                result_t *result,
+                                                vertex_t const *sources,
+                                                vertex_t const number_of_sources)
+{
+  std::queue<vertex_t> Q;
+  std::stack<vertex_t> S;
+  // NOTE: dist is of type vertex_t not weight_t
+  std::vector<vertex_t> dist(number_of_vertices);
+  std::vector<std::vector<vertex_t>> pred(number_of_vertices);
+  std::vector<double> sigmas(number_of_vertices);
+  std::vector<double> deltas(number_of_vertices);
+
+  std::vector<vertex_t> neighbors;
+
+  if (sources) {
+    for (vertex_t source_idx = 0; source_idx < number_of_sources; ++source_idx) {
+      vertex_t s = sources[source_idx];
+      // Step 1: Single-source shortest-paths problem
+      //   a. Initialization
+      ref_bfs<vertex_t, edge_t>(indices, offsets, number_of_vertices, Q, S, dist, pred, sigmas, s);
+      //  Step 2: Accumulation
+      //          Back propagation of dependencies
+      ref_accumulation<vertex_t, edge_t, weight_t, result_t>(
+        result, indices, offsets, number_of_vertices, S, pred, sigmas, deltas, s);
+    }
+  } else {
+    for (vertex_t s = 0; s < number_of_vertices; ++s) {
+      // Step 1: Single-source shortest-paths problem
+      //   a. Initialization
+      ref_bfs<vertex_t, edge_t>(indices, offsets, number_of_vertices, Q, S, dist, pred, sigmas, s);
+      //  Step 2: Accumulation
+      //          Back propagation of dependencies
+      ref_accumulation<vertex_t, edge_t, weight_t, result_t>(
+        result, indices, offsets, number_of_vertices, S, pred, sigmas, deltas, s);
+    }
+  }
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void reference_rescale(result_t *result,
+                       bool directed,
+                       bool normalize,
+                       vertex_t const number_of_vertices,
+                       edge_t const number_of_edges)
+{
+  result_t rescale_factor            = static_cast<result_t>(1);
+  result_t casted_number_of_vertices = static_cast<result_t>(number_of_vertices);
+  if (normalize) {
+    if (number_of_vertices > 1) {
+      rescale_factor /= ((casted_number_of_vertices) * (casted_number_of_vertices - 1));
+    }
+  } else {
+    if (!directed) { rescale_factor /= static_cast<result_t>(2); }
+  }
+  for (auto idx = 0; idx < number_of_edges; ++idx) { result[idx] *= rescale_factor; }
+}
+
+template <typename vertex_t, typename edge_t, typename weight_t, typename result_t>
+void reference_edge_betweenness_centrality(
+  cugraph::GraphCSRView<vertex_t, edge_t, weight_t> const &graph,
+  result_t *result,
+  bool normalize,
+  vertex_t const number_of_sources,
+  vertex_t const *sources)
+{
+  vertex_t number_of_vertices = graph.number_of_vertices;
+  edge_t number_of_edges      = graph.number_of_edges;
+  thrust::host_vector<vertex_t> h_indices(number_of_edges);
+  thrust::host_vector<edge_t> h_offsets(number_of_vertices + 1);
+
+  thrust::device_ptr<vertex_t> d_indices((vertex_t *)&graph.indices[0]);
+  thrust::device_ptr<edge_t> d_offsets((edge_t *)&graph.offsets[0]);
+
+  thrust::copy(d_indices, d_indices + number_of_edges, h_indices.begin());
+  thrust::copy(d_offsets, d_offsets + (number_of_vertices + 1), h_offsets.begin());
+
+  cudaDeviceSynchronize();
+
+  reference_edge_betweenness_centrality_impl<vertex_t, edge_t, weight_t, result_t>(
+    &h_indices[0], &h_offsets[0], number_of_vertices, result, sources, number_of_sources);
+  reference_rescale<vertex_t, edge_t, weight_t, result_t>(
+    result, graph.prop.directed, normalize, number_of_vertices, number_of_edges);
+}
+
+// =============================================================================
+// Utility functions
+// =============================================================================
+// Compare while allowing relatie error of epsilon
+// zero_threshold indicates when  we should drop comparison for small numbers
+template <typename T, typename precision_t>
+bool compare_close(const T &a, const T &b, const precision_t epsilon, precision_t zero_threshold)
+{
+  return ((zero_threshold > a && zero_threshold > b)) ||
+         (a >= b * (1.0 - epsilon)) && (a <= b * (1.0 + epsilon));
+}
+
+// =============================================================================
+// Test Suite
+// =============================================================================
+// Defines Betweenness Centrality UseCase
+// SSSP's test suite code uses type of Graph parameter that could be used
+// (MTX / RMAT)
+typedef struct EdgeBC_Usecase_t {
+  std::string config_;     // Path to graph file
+  std::string file_path_;  // Complete path to graph using dataset_root_dir
+  int number_of_sources_;  // Starting point from the traversal
+  EdgeBC_Usecase_t(const std::string &config, int number_of_sources)
+    : config_(config), number_of_sources_(number_of_sources)
+  {
+    // assume relative paths are relative to RAPIDS_DATASET_ROOT_DIR
+    // FIXME: Use platform independent stuff from c++14/17 on compiler update
+    const std::string &rapidsDatasetRootDir = cugraph::test::get_rapids_dataset_root_dir();
+    if ((config_ != "") && (config_[0] != '/')) {
+      file_path_ = rapidsDatasetRootDir + "/" + config_;
+    } else {
+      file_path_ = config_;
+    }
+  };
+} EdgeBC_Usecase;
+
+class Tests_EdgeBC : public ::testing::TestWithParam<EdgeBC_Usecase> {
+  raft::handle_t handle;
+
+ public:
+  Tests_EdgeBC() {}
+  static void SetupTestCase() {}
+  static void TearDownTestCase() {}
+
+  virtual void SetUp() {}
+  virtual void TearDown() {}
+  // FIXME: Should normalize be part of the configuration instead?
+  // vertex_t         vertex identifier data type
+  // edge_t         edge identifier data type
+  // weight_t         edge weight data type
+  // result_t   result data type
+  // normalize  should the result be normalized
+  template <typename vertex_t,
+            typename edge_t,
+            typename weight_t,
+            typename result_t,
+            bool normalize>
+  void run_current_test(const EdgeBC_Usecase &configuration)
+  {
+    // Step 1: Construction of the graph based on configuration
+    bool is_directed = false;
+    auto csr         = cugraph::test::generate_graph_csr_from_mm<vertex_t, edge_t, weight_t>(
+      is_directed, configuration.file_path_);
+    cudaDeviceSynchronize();
+    cugraph::GraphCSRView<vertex_t, edge_t, weight_t> G = csr->view();
+    G.prop.directed                                     = is_directed;
+    CUDA_TRY(cudaGetLastError());
+    std::vector<result_t> result(G.number_of_edges, 0);
+    std::vector<result_t> expected(G.number_of_edges, 0);
+
+    // Step 2: Generation of sources based on configuration
+    //         if number_of_sources_ is 0 then sources must be nullptr
+    //         Otherwise we only  use the first k values
+    ASSERT_TRUE(configuration.number_of_sources_ >= 0 &&
+                configuration.number_of_sources_ <= G.number_of_vertices)
+      << "Number number of sources should be >= 0 and"
+      << " less than the number of vertices in the graph";
+    std::vector<vertex_t> sources(configuration.number_of_sources_);
+    thrust::sequence(thrust::host, sources.begin(), sources.end(), 0);
+
+    vertex_t *sources_ptr = nullptr;
+    if (configuration.number_of_sources_ > 0) { sources_ptr = sources.data(); }
+
+    reference_edge_betweenness_centrality(
+      G, expected.data(), normalize, configuration.number_of_sources_, sources_ptr);
+
+    sources_ptr = nullptr;
+    if (configuration.number_of_sources_ > 0) { sources_ptr = sources.data(); }
+
+    rmm::device_vector<result_t> d_result(G.number_of_edges);
+    cugraph::edge_betweenness_centrality(handle,
+                                         G,
+                                         d_result.data().get(),
+                                         normalize,
+                                         static_cast<weight_t *>(nullptr),
+                                         configuration.number_of_sources_,
+                                         sources_ptr);
+    CUDA_TRY(cudaMemcpy(result.data(),
+                        d_result.data().get(),
+                        sizeof(result_t) * G.number_of_edges,
+                        cudaMemcpyDeviceToHost));
+    for (int i = 0; i < G.number_of_edges; ++i)
+      EXPECT_TRUE(compare_close(result[i], expected[i], TEST_EPSILON, TEST_ZERO_THRESHOLD))
+        << "[MISMATCH] vaid = " << i << ", cugraph = " << result[i]
+        << " expected = " << expected[i];
+  }
+};
+
+// ============================================================================
+// Tests
+// ============================================================================
+// Verifiy Un-Normalized results
+TEST_P(Tests_EdgeBC, CheckFP32_NO_NORMALIZE)
+{
+  run_current_test<int, int, float, float, false>(GetParam());
+}
+
+TEST_P(Tests_EdgeBC, CheckFP64_NO_NORMALIZE)
+{
+  run_current_test<int, int, double, double, false>(GetParam());
+}
+
+// Verifiy Normalized results
+TEST_P(Tests_EdgeBC, CheckFP32_NORMALIZE)
+{
+  run_current_test<int, int, float, float, true>(GetParam());
+}
+
+TEST_P(Tests_EdgeBC, CheckFP64_NORMALIZE)
+{
+  run_current_test<int, int, double, double, true>(GetParam());
+}
+
+INSTANTIATE_TEST_CASE_P(simple_test,
+                        Tests_EdgeBC,
+                        ::testing::Values(EdgeBC_Usecase("test/datasets/karate.mtx", 0),
+                                          EdgeBC_Usecase("test/datasets/netscience.mtx", 0),
+                                          EdgeBC_Usecase("test/datasets/netscience.mtx", 4),
+                                          EdgeBC_Usecase("test/datasets/wiki2003.mtx", 4),
+                                          EdgeBC_Usecase("test/datasets/wiki-Talk.mtx", 4)));
+
+CUGRAPH_TEST_PROGRAM_MAIN()
diff --git a/cpp/tests/centrality/katz_centrality_test.cu b/cpp/tests/centrality/katz_centrality_test.cu
index 69c543714ca..97f499fc920 100644
--- a/cpp/tests/centrality/katz_centrality_test.cu
+++ b/cpp/tests/centrality/katz_centrality_test.cu
@@ -1,15 +1,34 @@
-#include <thrust/device_ptr.h>
-#include <algorithms.hpp>
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <utilities/high_res_clock.h>
+#include <utilities/base_fixture.hpp>
+#include <utilities/test_utilities.hpp>
+
 #include <converters/COOtoCSR.cuh>
-#include <fstream>
+
+#include <algorithms.hpp>
 #include <graph.hpp>
-#include <rmm/mr/device/cuda_memory_resource.hpp>
-#include "cuda_profiler_api.h"
-#include "gmock/gmock-generated-matchers.h"
-#include "gmock/gmock.h"
-#include "gtest/gtest.h"
-#include "high_res_clock.h"
-#include "test_utils.h"
+
+#include <gmock/gmock-generated-matchers.h>
+#include <gmock/gmock.h>
+
+#include <thrust/device_ptr.h>
+
+#include <fstream>
 
 std::vector<int> getGoldenTopKIds(std::ifstream& fs_result, int k = 10)
 {
@@ -37,13 +56,13 @@ std::vector<int> getTopKIds(double* p_katz, int count, int k = 10)
 }
 
 template <typename VT, typename ET, typename WT>
-int getMaxDegree(cugraph::experimental::GraphCSRView<VT, ET, WT> const& g)
+int getMaxDegree(cugraph::GraphCSRView<VT, ET, WT> const& g)
 {
   cudaStream_t stream{nullptr};
 
   rmm::device_vector<ET> degree_vector(g.number_of_vertices);
   ET* p_degree = degree_vector.data().get();
-  g.degree(p_degree, cugraph::experimental::DegreeDirection::OUT);
+  g.degree(p_degree, cugraph::DegreeDirection::OUT);
   ET max_out_degree = thrust::reduce(rmm::exec_policy(stream)->on(stream),
                                      p_degree,
                                      p_degree + g.number_of_vertices,
@@ -58,7 +77,7 @@ typedef struct Katz_Usecase_t {
   Katz_Usecase_t(const std::string& a, const std::string& b)
   {
     // assume relative paths are relative to RAPIDS_DATASET_ROOT_DIR
-    const std::string& rapidsDatasetRootDir = get_rapids_dataset_root_dir();
+    const std::string& rapidsDatasetRootDir = cugraph::test::get_rapids_dataset_root_dir();
     if ((a != "") && (a[0] != '/')) {
       matrix_file = rapidsDatasetRootDir + "/" + a;
     } else {
@@ -97,7 +116,7 @@ class Tests_Katz : public ::testing::TestWithParam<Katz_Usecase> {
     int m, k;
     int nnz;
     MM_typecode mc;
-    ASSERT_EQ(mm_properties<int>(fpin, 1, &mc, &m, &k, &nnz), 0)
+    ASSERT_EQ(cugraph::test::mm_properties<int>(fpin, 1, &mc, &m, &k, &nnz), 0)
       << "could not read Matrix Market file properties"
       << "\n";
     ASSERT_TRUE(mm_is_matrix(mc));
@@ -111,16 +130,16 @@ class Tests_Katz : public ::testing::TestWithParam<Katz_Usecase> {
     std::vector<double> katz_centrality(m);
 
     // Read
-    ASSERT_EQ((mm_to_coo<int, int>(fpin, 1, nnz, &cooRowInd[0], &cooColInd[0], &cooVal[0], NULL)),
+    ASSERT_EQ((cugraph::test::mm_to_coo<int, int>(
+                fpin, 1, nnz, &cooRowInd[0], &cooColInd[0], &cooVal[0], NULL)),
               0)
       << "could not read matrix data"
       << "\n";
     ASSERT_EQ(fclose(fpin), 0);
 
-    cugraph::experimental::GraphCOOView<int, int, float> cooview(
-      &cooColInd[0], &cooRowInd[0], nullptr, m, nnz);
-    auto csr                                               = cugraph::coo_to_csr(cooview);
-    cugraph::experimental::GraphCSRView<int, int, float> G = csr->view();
+    cugraph::GraphCOOView<int, int, float> cooview(&cooColInd[0], &cooRowInd[0], nullptr, m, nnz);
+    auto csr                                 = cugraph::coo_to_csr(cooview);
+    cugraph::GraphCSRView<int, int, float> G = csr->view();
 
     rmm::device_vector<double> katz_vector(m);
     double* d_katz = thrust::raw_pointer_cast(katz_vector.data());
@@ -137,7 +156,6 @@ class Tests_Katz : public ::testing::TestWithParam<Katz_Usecase> {
   }
 };
 
-// --gtest_filter=*simple_test*
 INSTANTIATE_TEST_CASE_P(
   simple_test,
   Tests_Katz,
@@ -148,11 +166,4 @@ INSTANTIATE_TEST_CASE_P(
 
 TEST_P(Tests_Katz, Check) { run_current_test(GetParam()); }
 
-int main(int argc, char** argv)
-{
-  testing::InitGoogleTest(&argc, argv);
-  auto resource = std::make_unique<rmm::mr::cuda_memory_resource>();
-  rmm::mr::set_default_resource(resource.get());
-  int rc = RUN_ALL_TESTS();
-  return rc;
-}
+CUGRAPH_TEST_PROGRAM_MAIN()
diff --git a/cpp/tests/community/balanced_edge_test.cpp b/cpp/tests/community/balanced_edge_test.cpp
index 69e34f49e84..81cee945821 100644
--- a/cpp/tests/community/balanced_edge_test.cpp
+++ b/cpp/tests/community/balanced_edge_test.cpp
@@ -8,14 +8,12 @@
  * license agreement from NVIDIA CORPORATION is strictly prohibited.
  *
  */
-#include <gtest/gtest.h>
+#include <utilities/base_fixture.hpp>
 
 #include <algorithms.hpp>
 
 #include <rmm/thrust_rmm_allocator.h>
 
-#include <rmm/mr/device/cnmem_memory_resource.hpp>
-
 TEST(balanced_edge, success)
 {
   std::vector<int> off_h = {0,  16,  25,  35,  41,  44,  48,  52,  56,  61,  63, 66,
@@ -50,7 +48,7 @@ TEST(balanced_edge, success)
   rmm::device_vector<float> weights_v(w_h);
   rmm::device_vector<int> result_v(cluster_id);
 
-  cugraph::experimental::GraphCSRView<int, int, float> G(
+  cugraph::GraphCSRView<int, int, float> G(
     offsets_v.data().get(), indices_v.data().get(), weights_v.data().get(), num_verts, num_edges);
 
   int num_clusters{8};
@@ -61,25 +59,18 @@ TEST(balanced_edge, success)
   int kmean_max_iter{100};
   float score;
 
-  cugraph::nvgraph::balancedCutClustering(G,
-                                          num_clusters,
-                                          num_eigenvectors,
-                                          evs_tolerance,
-                                          evs_max_iter,
-                                          kmean_tolerance,
-                                          kmean_max_iter,
-                                          result_v.data().get());
-  cugraph::nvgraph::analyzeClustering_edge_cut(G, num_clusters, result_v.data().get(), &score);
+  cugraph::ext_raft::balancedCutClustering(G,
+                                           num_clusters,
+                                           num_eigenvectors,
+                                           evs_tolerance,
+                                           evs_max_iter,
+                                           kmean_tolerance,
+                                           kmean_max_iter,
+                                           result_v.data().get());
+  cugraph::ext_raft::analyzeClustering_edge_cut(G, num_clusters, result_v.data().get(), &score);
 
   std::cout << "score = " << score << std::endl;
   ASSERT_LT(score, float{55.0});
 }
 
-int main(int argc, char** argv)
-{
-  testing::InitGoogleTest(&argc, argv);
-  auto resource = std::make_unique<rmm::mr::cnmem_memory_resource>();
-  rmm::mr::set_default_resource(resource.get());
-  int rc = RUN_ALL_TESTS();
-  return rc;
-}
+CUGRAPH_TEST_PROGRAM_MAIN()
diff --git a/cpp/tests/community/ecg_test.cu b/cpp/tests/community/ecg_test.cu
index b21c2f1d67f..6246a42021d 100644
--- a/cpp/tests/community/ecg_test.cu
+++ b/cpp/tests/community/ecg_test.cu
@@ -8,17 +8,16 @@
  * license agreement from NVIDIA CORPORATION is strictly prohibited.
  *
  */
-#include <gtest/gtest.h>
+#include <utilities/base_fixture.hpp>
 
 #include <algorithms.hpp>
 #include <graph.hpp>
 
-#include <rmm/rmm.h>
 #include <rmm/thrust_rmm_allocator.h>
-#include <rmm/mr/device/cnmem_memory_resource.hpp>
 
 TEST(ecg, success)
 {
+  // FIXME: verify that this is the karate dataset
   std::vector<int> off_h = {0,  16,  25,  35,  41,  44,  48,  52,  56,  61,  63, 66,
                             67, 69,  74,  76,  78,  80,  82,  84,  87,  89,  91, 93,
                             98, 101, 104, 106, 110, 113, 117, 121, 127, 139, 156};
@@ -43,7 +42,7 @@ TEST(ecg, success)
   rmm::device_vector<float> weights_v(w_h);
   rmm::device_vector<int> result_v(cluster_id);
 
-  cugraph::experimental::GraphCSRView<int, int, float> graph_csr(
+  cugraph::GraphCSRView<int, int, float> graph_csr(
     offsets_v.data().get(), indices_v.data().get(), weights_v.data().get(), num_verts, num_edges);
 
   cugraph::ecg<int32_t, int32_t, float>(graph_csr, .05, 16, result_v.data().get());
@@ -61,14 +60,14 @@ TEST(ecg, success)
 
   float modularity{0.0};
 
-  cugraph::nvgraph::analyzeClustering_modularity(
+  cugraph::ext_raft::analyzeClustering_modularity(
     graph_csr, max + 1, result_v.data().get(), &modularity);
 
+  // 0.399 is 5% below the reference value returned in
+  // <cugraph>/python/utils/ECG_Golden.ipynb on the same dataset
   ASSERT_GT(modularity, 0.399);
 }
 
-//  This test currently fails... leaving it in since once louvain is fixed
-//   it should pass
 TEST(ecg, dolphin)
 {
   std::vector<int> off_h = {0,   6,   14,  18,  21,  22,  26,  32,  37,  43,  50,  55,  56,
@@ -104,7 +103,7 @@ TEST(ecg, dolphin)
   rmm::device_vector<float> weights_v(w_h);
   rmm::device_vector<int> result_v(cluster_id);
 
-  cugraph::experimental::GraphCSRView<int, int, float> graph_csr(
+  cugraph::GraphCSRView<int, int, float> graph_csr(
     offsets_v.data().get(), indices_v.data().get(), weights_v.data().get(), num_verts, num_edges);
 
   cugraph::ecg<int32_t, int32_t, float>(graph_csr, .05, 16, result_v.data().get());
@@ -122,7 +121,7 @@ TEST(ecg, dolphin)
 
   float modularity{0.0};
 
-  cugraph::nvgraph::analyzeClustering_modularity(
+  cugraph::ext_raft::analyzeClustering_modularity(
     graph_csr, max + 1, result_v.data().get(), &modularity);
 
   float random_modularity{0.95 * 0.4962422251701355};
@@ -130,11 +129,4 @@ TEST(ecg, dolphin)
   ASSERT_GT(modularity, random_modularity);
 }
 
-int main(int argc, char** argv)
-{
-  testing::InitGoogleTest(&argc, argv);
-  auto resource = std::make_unique<rmm::mr::cnmem_memory_resource>();
-  rmm::mr::set_default_resource(resource.get());
-  int rc = RUN_ALL_TESTS();
-  return rc;
-}
+CUGRAPH_TEST_PROGRAM_MAIN()
diff --git a/cpp/tests/community/leiden_test.cpp b/cpp/tests/community/leiden_test.cpp
new file mode 100644
index 00000000000..1e8ba85249d
--- /dev/null
+++ b/cpp/tests/community/leiden_test.cpp
@@ -0,0 +1,73 @@
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * NVIDIA CORPORATION and its licensors retain all intellectual property
+ * and proprietary rights in and to this software, related documentation
+ * and any modifications thereto.  Any use, reproduction, disclosure or
+ * distribution of this software and related documentation without an express
+ * license agreement from NVIDIA CORPORATION is strictly prohibited.
+ *
+ */
+#include <gtest/gtest.h>
+
+#include <algorithms.hpp>
+#include <graph.hpp>
+
+#include <thrust/extrema.h>
+
+#include <rmm/thrust_rmm_allocator.h>
+
+#include <rmm/mr/device/cnmem_memory_resource.hpp>
+
+TEST(leiden_karate, success)
+{
+  std::vector<int> off_h = {0,  16,  25,  35,  41,  44,  48,  52,  56,  61,  63, 66,
+                            67, 69,  74,  76,  78,  80,  82,  84,  87,  89,  91, 93,
+                            98, 101, 104, 106, 110, 113, 117, 121, 127, 139, 156};
+  std::vector<int> ind_h = {
+    1,  2,  3,  4,  5,  6,  7,  8,  10, 11, 12, 13, 17, 19, 21, 31, 0,  2,  3,  7,  13, 17, 19,
+    21, 30, 0,  1,  3,  7,  8,  9,  13, 27, 28, 32, 0,  1,  2,  7,  12, 13, 0,  6,  10, 0,  6,
+    10, 16, 0,  4,  5,  16, 0,  1,  2,  3,  0,  2,  30, 32, 33, 2,  33, 0,  4,  5,  0,  0,  3,
+    0,  1,  2,  3,  33, 32, 33, 32, 33, 5,  6,  0,  1,  32, 33, 0,  1,  33, 32, 33, 0,  1,  32,
+    33, 25, 27, 29, 32, 33, 25, 27, 31, 23, 24, 31, 29, 33, 2,  23, 24, 33, 2,  31, 33, 23, 26,
+    32, 33, 1,  8,  32, 33, 0,  24, 25, 28, 32, 33, 2,  8,  14, 15, 18, 20, 22, 23, 29, 30, 31,
+    33, 8,  9,  13, 14, 15, 18, 19, 20, 22, 23, 26, 27, 28, 29, 30, 31, 32};
+  std::vector<float> w_h = {
+    1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
+    1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
+    1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
+    1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
+    1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
+    1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
+    1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
+    1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
+    1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0};
+
+  int num_verts = off_h.size() - 1;
+  int num_edges = ind_h.size();
+
+  std::vector<int> cluster_id(num_verts, -1);
+
+  rmm::device_vector<int> offsets_v(off_h);
+  rmm::device_vector<int> indices_v(ind_h);
+  rmm::device_vector<float> weights_v(w_h);
+  rmm::device_vector<int> result_v(cluster_id);
+
+  cugraph::GraphCSRView<int, int, float> G(
+    offsets_v.data().get(), indices_v.data().get(), weights_v.data().get(), num_verts, num_edges);
+
+  float modularity{0.0};
+  int num_level = 40;
+
+  cugraph::leiden(G, modularity, num_level, result_v.data().get());
+
+  cudaMemcpy((void*)&(cluster_id[0]),
+             result_v.data().get(),
+             sizeof(int) * num_verts,
+             cudaMemcpyDeviceToHost);
+
+  int min = *min_element(cluster_id.begin(), cluster_id.end());
+
+  ASSERT_GE(min, 0);
+  ASSERT_GE(modularity, 0.41116042 * 0.99);
+}
diff --git a/cpp/tests/community/louvain_test.cpp b/cpp/tests/community/louvain_test.cpp
index 7784deec7d6..391af641b73 100644
--- a/cpp/tests/community/louvain_test.cpp
+++ b/cpp/tests/community/louvain_test.cpp
@@ -8,17 +8,17 @@
  * license agreement from NVIDIA CORPORATION is strictly prohibited.
  *
  */
-#include <gtest/gtest.h>
+#include <utilities/base_fixture.hpp>
 
 #include <algorithms.hpp>
 #include <graph.hpp>
 
+#include <community/louvain_kernels.hpp>
+
 #include <thrust/extrema.h>
 
 #include <rmm/thrust_rmm_allocator.h>
 
-#include <rmm/mr/device/cnmem_memory_resource.hpp>
-
 TEST(louvain, success)
 {
   std::vector<int> off_h = {0,  16,  25,  35,  41,  44,  48,  52,  56,  61,  63, 66,
@@ -53,7 +53,7 @@ TEST(louvain, success)
   rmm::device_vector<float> weights_v(w_h);
   rmm::device_vector<int> result_v(cluster_id);
 
-  cugraph::experimental::GraphCSRView<int, int, float> G(
+  cugraph::GraphCSRView<int, int, float> G(
     offsets_v.data().get(), indices_v.data().get(), weights_v.data().get(), num_verts, num_edges);
 
   float modularity{0.0};
@@ -72,11 +72,140 @@ TEST(louvain, success)
   ASSERT_GE(modularity, 0.402777 * 0.95);
 }
 
-int main(int argc, char** argv)
+TEST(louvain_modularity, simple)
 {
-  testing::InitGoogleTest(&argc, argv);
-  auto resource = std::make_unique<rmm::mr::cnmem_memory_resource>();
-  rmm::mr::set_default_resource(resource.get());
-  int rc = RUN_ALL_TESTS();
-  return rc;
+  std::vector<int> off_h         = {0, 1, 4, 7, 10, 11, 12};
+  std::vector<int> src_ind_h     = {0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 5};
+  std::vector<int> ind_h         = {1, 0, 2, 3, 1, 3, 4, 1, 2, 5, 2, 3};
+  std::vector<float> w_h         = {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0};
+  std::vector<float> v_weights_h = {1.0, 3.0, 3.0, 3.0, 1.0, 1.0};
+
+  //
+  //  Initial cluster, everything on its own
+  //
+  std::vector<int> cluster_h           = {0, 1, 2, 3, 4, 5};
+  std::vector<float> cluster_weights_h = {1.0, 3.0, 3.0, 3.0, 1.0, 1.0};
+
+  std::vector<int> cluster_hash_h = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+  std::vector<float> delta_Q_h    = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0};
+  std::vector<float> tmp_size_V_h = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0};
+
+  int num_verts = off_h.size() - 1;
+  int num_edges = ind_h.size();
+
+  float q{0.0};
+
+  rmm::device_vector<int> offsets_v(off_h);
+  rmm::device_vector<int> src_indices_v(src_ind_h);
+  rmm::device_vector<int> indices_v(ind_h);
+  rmm::device_vector<float> weights_v(w_h);
+  rmm::device_vector<float> vertex_weights_v(v_weights_h);
+  rmm::device_vector<int> cluster_v(cluster_h);
+  rmm::device_vector<float> cluster_weights_v(cluster_weights_h);
+  rmm::device_vector<int> cluster_hash_v(cluster_hash_h);
+  rmm::device_vector<float> delta_Q_v(delta_Q_h);
+  rmm::device_vector<float> tmp_size_V_v(tmp_size_V_h);
+
+  cudaStream_t stream{0};
+
+  //
+  // Create graph
+  //
+  cugraph::GraphCSRView<int, int, float> G(
+    offsets_v.data().get(), indices_v.data().get(), weights_v.data().get(), num_verts, num_edges);
+
+  q = cugraph::detail::modularity(float{12}, float{1}, G, cluster_v.data().get());
+
+  ASSERT_FLOAT_EQ(q, float{-30.0 / 144.0});
+
+  cugraph::detail::compute_delta_modularity(float{12},
+                                            float{1},
+                                            G,
+                                            src_indices_v,
+                                            vertex_weights_v,
+                                            cluster_weights_v,
+                                            cluster_v,
+                                            cluster_hash_v,
+                                            delta_Q_v,
+                                            tmp_size_V_v);
+
+  CUDA_TRY(cudaMemcpy(cluster_hash_h.data(),
+                      cluster_hash_v.data().get(),
+                      sizeof(int) * num_edges,
+                      cudaMemcpyDeviceToHost));
+  CUDA_TRY(cudaMemcpy(
+    delta_Q_h.data(), delta_Q_v.data().get(), sizeof(float) * num_edges, cudaMemcpyDeviceToHost));
+
+  ASSERT_EQ(cluster_hash_h[0], 1);
+  ASSERT_EQ(cluster_hash_h[10], 2);
+  ASSERT_EQ(cluster_hash_h[11], 3);
+  ASSERT_FLOAT_EQ(delta_Q_h[0], float{1.0 / 8.0});
+  ASSERT_FLOAT_EQ(delta_Q_h[10], float{1.0 / 8.0});
+  ASSERT_FLOAT_EQ(delta_Q_h[11], float{1.0 / 8.0});
+
+  //
+  //  Move vertex 0 into cluster 1
+  //
+  cluster_h[0]         = 1;
+  cluster_weights_h[0] = 0.0;
+  cluster_weights_h[1] = 4.0;
+
+  CUDA_TRY(cudaMemcpy(
+    cluster_v.data().get(), cluster_h.data(), sizeof(int) * num_verts, cudaMemcpyHostToDevice));
+  CUDA_TRY(cudaMemcpy(cluster_weights_v.data().get(),
+                      cluster_weights_h.data(),
+                      sizeof(float) * num_verts,
+                      cudaMemcpyHostToDevice));
+
+  q = cugraph::detail::modularity(float{12}, float{1}, G, cluster_v.data().get());
+
+  ASSERT_FLOAT_EQ(q, float{-12.0 / 144.0});
+
+  cugraph::detail::compute_delta_modularity(float{12},
+                                            float{1},
+                                            G,
+                                            src_indices_v,
+                                            vertex_weights_v,
+                                            cluster_weights_v,
+                                            cluster_v,
+                                            cluster_hash_v,
+                                            delta_Q_v,
+                                            tmp_size_V_v);
+
+  CUDA_TRY(cudaMemcpy(cluster_hash_h.data(),
+                      cluster_hash_v.data().get(),
+                      sizeof(int) * num_edges,
+                      cudaMemcpyDeviceToHost));
+  CUDA_TRY(cudaMemcpy(
+    delta_Q_h.data(), delta_Q_v.data().get(), sizeof(float) * num_edges, cudaMemcpyDeviceToHost));
+
+  ASSERT_EQ(cluster_hash_h[10], 2);
+  ASSERT_EQ(cluster_hash_h[11], 3);
+  ASSERT_FLOAT_EQ(delta_Q_h[10], float{1.0 / 8.0});
+  ASSERT_FLOAT_EQ(delta_Q_h[11], float{1.0 / 8.0});
+
+  //
+  //  Move vertex 1 into cluster 2.  Not the optimal, in fact it will reduce
+  //  modularity (so Louvain would never do this), but let's see if it reduces
+  //  by the expected amount (-12/144).
+  //
+  ASSERT_EQ(cluster_hash_h[3], 2);
+  ASSERT_FLOAT_EQ(delta_Q_h[3], float{-12.0 / 144.0});
+
+  cluster_h[1]         = 2;
+  cluster_weights_h[1] = 1.0;
+  cluster_weights_h[2] = 6.0;
+
+  CUDA_TRY(cudaMemcpy(
+    cluster_v.data().get(), cluster_h.data(), sizeof(int) * num_verts, cudaMemcpyHostToDevice));
+  CUDA_TRY(cudaMemcpy(cluster_weights_v.data().get(),
+                      cluster_weights_h.data(),
+                      sizeof(float) * num_verts,
+                      cudaMemcpyHostToDevice));
+
+  q = cugraph::detail::modularity(float{12}, float{1}, G, cluster_v.data().get());
+
+  ASSERT_FLOAT_EQ(q, float{-24.0 / 144.0});
 }
+
+CUGRAPH_TEST_PROGRAM_MAIN()
diff --git a/cpp/tests/community/triangle_test.cu b/cpp/tests/community/triangle_test.cu
index 6440284f099..1c5c99261d2 100644
--- a/cpp/tests/community/triangle_test.cu
+++ b/cpp/tests/community/triangle_test.cu
@@ -8,14 +8,12 @@
  * license agreement from NVIDIA CORPORATION is strictly prohibited.
  *
  */
-#include <gtest/gtest.h>
+#include <utilities/base_fixture.hpp>
 
 #include <algorithms.hpp>
 #include <graph.hpp>
 
-#include <rmm/rmm.h>
 #include <rmm/thrust_rmm_allocator.h>
-#include <rmm/mr/device/cnmem_memory_resource.hpp>
 
 TEST(triangle, dolphin)
 {
@@ -51,16 +49,13 @@ TEST(triangle, dolphin)
   rmm::device_vector<int> indices_v(ind_h);
   rmm::device_vector<float> weights_v(w_h);
 
-  cugraph::experimental::GraphCSRView<int, int, float> graph_csr(
+  cugraph::GraphCSRView<int, int, float> graph_csr(
     offsets_v.data().get(), indices_v.data().get(), weights_v.data().get(), num_verts, num_edges);
 
   uint64_t count{0};
 
-  // ASSERT_NO_THROW((count = cugraph::nvgraph::triangle_count<int32_t, int32_t,
-  // float>(graph_csr)));
-
   try {
-    count = cugraph::nvgraph::triangle_count<int32_t, int32_t, float>(graph_csr);
+    count = cugraph::triangle::triangle_count<int32_t, int32_t, float>(graph_csr);
   } catch (std::exception& e) {
     std::cout << "Exception: " << e.what() << std::endl;
   }
@@ -68,11 +63,4 @@ TEST(triangle, dolphin)
   ASSERT_EQ(count, expected);
 }
 
-int main(int argc, char** argv)
-{
-  testing::InitGoogleTest(&argc, argv);
-  auto resource = std::make_unique<rmm::mr::cnmem_memory_resource>();
-  rmm::mr::set_default_resource(resource.get());
-  int rc = RUN_ALL_TESTS();
-  return rc;
-}
+CUGRAPH_TEST_PROGRAM_MAIN()
diff --git a/cpp/tests/components/con_comp_test.cu b/cpp/tests/components/con_comp_test.cu
index f2a6cba35c3..15d60867753 100644
--- a/cpp/tests/components/con_comp_test.cu
+++ b/cpp/tests/components/con_comp_test.cu
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.  All rights reserved.
  *
  * NVIDIA CORPORATION and its licensors retain all intellectual property
  * and proprietary rights in and to this software, related documentation
@@ -12,17 +12,18 @@
 // connected components tests
 // Author: Andrei Schaffer aschaffer@nvidia.com
 
-#include "cuda_profiler_api.h"
-#include "gtest/gtest.h"
-#include "high_res_clock.h"
+#include <utilities/high_res_clock.h>
+#include <utilities/base_fixture.hpp>
+#include <utilities/test_utilities.hpp>
+
+#include <cuda_profiler_api.h>
 
-#include <algorithm>
 #include <algorithms.hpp>
 #include <converters/COOtoCSR.cuh>
 #include <graph.hpp>
+
+#include <algorithm>
 #include <iterator>
-#include <rmm/mr/device/cuda_memory_resource.hpp>
-#include "test_utils.h"
 
 // do the perf measurements
 // enabled by command line parameter s'--perf'
@@ -34,7 +35,7 @@ struct Usecase {
   explicit Usecase(const std::string& a)
   {
     // assume relative paths are relative to RAPIDS_DATASET_ROOT_DIR
-    const std::string& rapidsDatasetRootDir = get_rapids_dataset_root_dir();
+    const std::string& rapidsDatasetRootDir = cugraph::test::get_rapids_dataset_root_dir();
     if ((a != "") && (a[0] != '/')) {
       matrix_file = rapidsDatasetRootDir + "/" + a;
     } else {
@@ -71,9 +72,10 @@ struct Tests_Weakly_CC : ::testing::TestWithParam<Usecase> {
     const ::testing::TestInfo* const test_info =
       ::testing::UnitTest::GetInstance()->current_test_info();
     std::stringstream ss;
-    std::string test_id =
-      std::string(test_info->test_case_name()) + std::string(".") + std::string(test_info->name()) +
-      std::string("_") + getFileName(param.get_matrix_file()) + std::string("_") + ss.str().c_str();
+    std::string test_id = std::string(test_info->test_case_name()) + std::string(".") +
+                          std::string(test_info->name()) + std::string("_") +
+                          cugraph::test::getFileName(param.get_matrix_file()) + std::string("_") +
+                          ss.str().c_str();
 
     int m, k, nnz;  //
     MM_typecode mc;
@@ -84,7 +86,7 @@ struct Tests_Weakly_CC : ::testing::TestWithParam<Usecase> {
     FILE* fpin = fopen(param.get_matrix_file().c_str(), "r");
     ASSERT_NE(fpin, nullptr) << "fopen (" << param.get_matrix_file() << ") failure.";
 
-    ASSERT_EQ(mm_properties<int>(fpin, 1, &mc, &m, &k, &nnz), 0)
+    ASSERT_EQ(cugraph::test::mm_properties<int>(fpin, 1, &mc, &m, &k, &nnz), 0)
       << "could not read Matrix Market file properties"
       << "\n";
     ASSERT_TRUE(mm_is_matrix(mc));
@@ -104,16 +106,16 @@ struct Tests_Weakly_CC : ::testing::TestWithParam<Usecase> {
 
     // Read: COO Format
     //
-    ASSERT_EQ((mm_to_coo<int, int>(fpin, 1, nnz, &cooRowInd[0], &cooColInd[0], nullptr, nullptr)),
+    ASSERT_EQ((cugraph::test::mm_to_coo<int, int>(
+                fpin, 1, nnz, &cooRowInd[0], &cooColInd[0], nullptr, nullptr)),
               0)
       << "could not read matrix data"
       << "\n";
     ASSERT_EQ(fclose(fpin), 0);
 
-    cugraph::experimental::GraphCOOView<int, int, float> G_coo(
-      &cooRowInd[0], &cooColInd[0], nullptr, m, nnz);
-    auto G_unique                                          = cugraph::coo_to_csr(G_coo);
-    cugraph::experimental::GraphCSRView<int, int, float> G = G_unique->view();
+    cugraph::GraphCOOView<int, int, float> G_coo(&cooRowInd[0], &cooColInd[0], nullptr, m, nnz);
+    auto G_unique                            = cugraph::coo_to_csr(G_coo);
+    cugraph::GraphCSRView<int, int, float> G = G_unique->view();
 
     rmm::device_vector<int> d_labels(m);
 
@@ -146,11 +148,4 @@ INSTANTIATE_TEST_CASE_P(simple_test,
                                           Usecase("test/datasets/coPapersCiteseer.mtx"),
                                           Usecase("test/datasets/hollywood.mtx")));
 
-int main(int argc, char** argv)
-{
-  testing::InitGoogleTest(&argc, argv);
-  auto resource = std::make_unique<rmm::mr::cuda_memory_resource>();
-  rmm::mr::set_default_resource(resource.get());
-  int rc = RUN_ALL_TESTS();
-  return rc;
-}
+CUGRAPH_TEST_PROGRAM_MAIN()
diff --git a/cpp/tests/components/scc_test.cu b/cpp/tests/components/scc_test.cu
index e8d15790f68..9d5b55f34c6 100644
--- a/cpp/tests/components/scc_test.cu
+++ b/cpp/tests/components/scc_test.cu
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.  All rights reserved.
  *
  * NVIDIA CORPORATION and its licensors retain all intellectual property
  * and proprietary rights in and to this software, related documentation
@@ -12,24 +12,23 @@
 // strongly connected components tests
 // Author: Andrei Schaffer aschaffer@nvidia.com
 
-#include "cuda_profiler_api.h"
-#include "gtest/gtest.h"
-#include "high_res_clock.h"
+#include <utilities/high_res_clock.h>
+#include <utilities/base_fixture.hpp>
+#include <utilities/test_utilities.hpp>
+
+#include <algorithms.hpp>
+#include <components/scc_matrix.cuh>
+#include <converters/COOtoCSR.cuh>
+#include <graph.hpp>
+#include <topology/topology.cuh>
+
+#include <cuda_profiler_api.h>
 
 #include <thrust/sequence.h>
 #include <thrust/unique.h>
 
 #include <algorithm>
 #include <iterator>
-#include "test_utils.h"
-
-#include <algorithms.hpp>
-#include <converters/COOtoCSR.cuh>
-#include <graph.hpp>
-
-#include <rmm/mr/device/cuda_memory_resource.hpp>
-#include "components/scc_matrix.cuh"
-#include "topology/topology.cuh"
 
 // do the perf measurements
 // enabled by command line parameter s'--perf'
@@ -37,14 +36,14 @@
 static int PERF = 0;
 
 template <typename T>
-using DVector = thrust::device_vector<T>;
+using DVector = rmm::device_vector<T>;
 
 namespace {  // un-nammed
 struct Usecase {
   explicit Usecase(const std::string& a)
   {
     // assume relative paths are relative to RAPIDS_DATASET_ROOT_DIR
-    const std::string& rapidsDatasetRootDir = get_rapids_dataset_root_dir();
+    const std::string& rapidsDatasetRootDir = cugraph::test::get_rapids_dataset_root_dir();
     if ((a != "") && (a[0] != '/')) {
       matrix_file = rapidsDatasetRootDir + "/" + a;
     } else {
@@ -120,9 +119,10 @@ struct Tests_Strongly_CC : ::testing::TestWithParam<Usecase> {
     const ::testing::TestInfo* const test_info =
       ::testing::UnitTest::GetInstance()->current_test_info();
     std::stringstream ss;
-    std::string test_id =
-      std::string(test_info->test_case_name()) + std::string(".") + std::string(test_info->name()) +
-      std::string("_") + getFileName(param.get_matrix_file()) + std::string("_") + ss.str().c_str();
+    std::string test_id = std::string(test_info->test_case_name()) + std::string(".") +
+                          std::string(test_info->name()) + std::string("_") +
+                          cugraph::test::getFileName(param.get_matrix_file()) + std::string("_") +
+                          ss.str().c_str();
 
     using ByteT  = unsigned char;
     using IndexT = int;
@@ -136,7 +136,7 @@ struct Tests_Strongly_CC : ::testing::TestWithParam<Usecase> {
     FILE* fpin = fopen(param.get_matrix_file().c_str(), "r");
     ASSERT_NE(fpin, nullptr) << "fopen (" << param.get_matrix_file().c_str() << ") failure.";
 
-    ASSERT_EQ(mm_properties<IndexT>(fpin, 1, &mc, &m, &k, &nnz), 0)
+    ASSERT_EQ(cugraph::test::mm_properties<IndexT>(fpin, 1, &mc, &m, &k, &nnz), 0)
       << "could not read Matrix Market file properties"
       << "\n";
     ASSERT_TRUE(mm_is_matrix(mc));
@@ -159,16 +159,16 @@ struct Tests_Strongly_CC : ::testing::TestWithParam<Usecase> {
 
     // Read: COO Format
     //
-    ASSERT_EQ(
-      (mm_to_coo<IndexT, IndexT>(fpin, 1, nnz, &cooRowInd[0], &cooColInd[0], nullptr, nullptr)), 0)
+    ASSERT_EQ((cugraph::test::mm_to_coo<IndexT, IndexT>(
+                fpin, 1, nnz, &cooRowInd[0], &cooColInd[0], nullptr, nullptr)),
+              0)
       << "could not read matrix data"
       << "\n";
     ASSERT_EQ(fclose(fpin), 0);
 
-    cugraph::experimental::GraphCOOView<int, int, float> G_coo(
-      &cooRowInd[0], &cooColInd[0], nullptr, m, nnz);
-    auto G_unique                                          = cugraph::coo_to_csr(G_coo);
-    cugraph::experimental::GraphCSRView<int, int, float> G = G_unique->view();
+    cugraph::GraphCOOView<int, int, float> G_coo(&cooRowInd[0], &cooColInd[0], nullptr, m, nnz);
+    auto G_unique                            = cugraph::coo_to_csr(G_coo);
+    cugraph::GraphCSRView<int, int, float> G = G_unique->view();
 
     rmm::device_vector<int> d_labels(m);
 
@@ -208,11 +208,4 @@ INSTANTIATE_TEST_CASE_P(
     Usecase("test/datasets/cage6.mtx")  // DG "small" enough to meet SCC GPU memory requirements
     ));
 
-int main(int argc, char** argv)
-{
-  testing::InitGoogleTest(&argc, argv);
-  auto resource = std::make_unique<rmm::mr::cuda_memory_resource>();
-  rmm::mr::set_default_resource(resource.get());
-  int rc = RUN_ALL_TESTS();
-  return rc;
-}
+CUGRAPH_TEST_PROGRAM_MAIN()
diff --git a/cpp/tests/db/find_matches_test.cu b/cpp/tests/db/find_matches_test.cu
index 3b44b682d34..c1373bb8bf2 100644
--- a/cpp/tests/db/find_matches_test.cu
+++ b/cpp/tests/db/find_matches_test.cu
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -14,14 +14,14 @@
  * limitations under the License.
  */
 
-#include <rmm/mr/device/cuda_memory_resource.hpp>
-#include "db/db_operators.cuh"
-#include "gtest/gtest.h"
-#include "high_res_clock.h"
-#include "rmm/device_buffer.hpp"
-#include "test_utils.h"
-#include "utilities/error_utils.h"
-#include "utilities/graph_utils.cuh"
+#include <utilities/high_res_clock.h>
+#include <utilities/base_fixture.hpp>
+#include <utilities/graph_utils.cuh>
+#include <utilities/test_utilities.hpp>
+
+#include <db/db_operators.cuh>
+
+#include <rmm/device_buffer.hpp>
 
 class Test_FindMatches : public ::testing::Test {
  public:
@@ -229,11 +229,4 @@ TEST_F(Test_FindMatches, fifthTest)
   ASSERT_EQ(resultB[1], 3);
 }
 
-int main(int argc, char** argv)
-{
-  testing::InitGoogleTest(&argc, argv);
-  auto resource = std::make_unique<rmm::mr::cuda_memory_resource>();
-  rmm::mr::set_default_resource(resource.get());
-  int rc = RUN_ALL_TESTS();
-  return rc;
-}
+CUGRAPH_TEST_PROGRAM_MAIN()
diff --git a/cpp/tests/layout/force_atlas2_test.cu b/cpp/tests/layout/force_atlas2_test.cu
index a18f5525bb6..d564765d0df 100644
--- a/cpp/tests/layout/force_atlas2_test.cu
+++ b/cpp/tests/layout/force_atlas2_test.cu
@@ -12,17 +12,21 @@
 // Force_Atlas2 tests
 // Author: Hugo Linsenmaier hlinsenmaier@nvidia.com
 
-#include <rmm/thrust_rmm_allocator.h>
+#include <utilities/high_res_clock.h>
+#include <utilities/base_fixture.hpp>
+#include <utilities/test_utilities.hpp>
+
+#include <layout/trust_worthiness.h>
 #include <algorithms.hpp>
-#include <fstream>
 #include <graph.hpp>
+
+#include <rmm/thrust_rmm_allocator.h>
+#include <raft/error.hpp>
+
+#include <cuda_profiler_api.h>
+
+#include <fstream>
 #include <iostream>
-#include <rmm/mr/device/cuda_memory_resource.hpp>
-#include "cuda_profiler_api.h"
-#include "gtest/gtest.h"
-#include "high_res_clock.h"
-#include "test_utils.h"
-#include "trust_worthiness.h"
 
 // do the perf measurements
 // enabled by command line parameter s'--perf'
@@ -38,7 +42,7 @@ typedef struct Force_Atlas2_Usecase_t {
   Force_Atlas2_Usecase_t(const std::string& a, const float b)
   {
     // assume relative paths are relative to RAPIDS_DATASET_ROOT_DIR
-    const std::string& rapidsDatasetRootDir = get_rapids_dataset_root_dir();
+    const std::string& rapidsDatasetRootDir = cugraph::test::get_rapids_dataset_root_dir();
     if ((a != "") && (a[0] != '/')) {
       matrix_file = rapidsDatasetRootDir + "/" + a;
     } else {
@@ -83,7 +87,8 @@ class Tests_Force_Atlas2 : public ::testing::TestWithParam<Force_Atlas2_Usecase>
     std::stringstream ss;
     std::string test_id = std::string(test_info->test_case_name()) + std::string(".") +
                           std::string(test_info->name()) + std::string("_") +
-                          getFileName(param.matrix_file) + std::string("_") + ss.str().c_str();
+                          cugraph::test::getFileName(param.matrix_file) + std::string("_") +
+                          ss.str().c_str();
 
     int m, k, nnz;
     MM_typecode mc;
@@ -92,7 +97,7 @@ class Tests_Force_Atlas2 : public ::testing::TestWithParam<Force_Atlas2_Usecase>
 
     FILE* fpin = fopen(param.matrix_file.c_str(), "r");
     ASSERT_NE(fpin, nullptr) << "fopen (" << param.matrix_file << ") failure.";
-    ASSERT_EQ(mm_properties<int>(fpin, 1, &mc, &m, &k, &nnz), 0)
+    ASSERT_EQ(cugraph::test::mm_properties<int>(fpin, 1, &mc, &m, &k, &nnz), 0)
       << "could not read Matrix Market file properties"
       << "\n";
     ASSERT_TRUE(mm_is_matrix(mc));
@@ -111,7 +116,9 @@ class Tests_Force_Atlas2 : public ::testing::TestWithParam<Force_Atlas2_Usecase>
     float* d_force_atlas2 = force_atlas2_vector.data().get();
 
     // Read
-    ASSERT_EQ((mm_to_coo<int, T>(fpin, 1, nnz, &cooRowInd[0], &cooColInd[0], &cooVal[0], NULL)), 0)
+    ASSERT_EQ((cugraph::test::mm_to_coo<int, T>(
+                fpin, 1, nnz, &cooRowInd[0], &cooColInd[0], &cooVal[0], NULL)),
+              0)
       << "could not read matrix data"
       << "\n";
     ASSERT_EQ(fclose(fpin), 0);
@@ -132,10 +139,11 @@ class Tests_Force_Atlas2 : public ::testing::TestWithParam<Force_Atlas2_Usecase>
     int* dests = dests_v.data().get();
     T* weights = weights_v.data().get();
 
+    // FIXME: RAFT error handling mechanism should be used instead
     CUDA_TRY(cudaMemcpy(srcs, &cooRowInd[0], sizeof(int) * nnz, cudaMemcpyDefault));
     CUDA_TRY(cudaMemcpy(dests, &cooColInd[0], sizeof(int) * nnz, cudaMemcpyDefault));
     CUDA_TRY(cudaMemcpy(weights, &cooVal[0], sizeof(T) * nnz, cudaMemcpyDefault));
-    cugraph::experimental::GraphCOOView<int, int, T> G(srcs, dests, weights, m, nnz);
+    cugraph::GraphCOOView<int, int, T> G(srcs, dests, weights, m, nnz);
 
     const int max_iter                    = 500;
     float* x_start                        = nullptr;
@@ -199,8 +207,7 @@ class Tests_Force_Atlas2 : public ::testing::TestWithParam<Force_Atlas2_Usecase>
 
     // Copy pos to host
     std::vector<float> h_pos(m * 2);
-    CUDA_RT_CALL(
-      cudaMemcpy(&h_pos[0], d_force_atlas2, sizeof(float) * m * 2, cudaMemcpyDeviceToHost));
+    CUDA_TRY(cudaMemcpy(&h_pos[0], d_force_atlas2, sizeof(float) * m * 2, cudaMemcpyDeviceToHost));
 
     // Transpose the data
     std::vector<std::vector<double>> C_contiguous_embedding(m, std::vector<double>(2));
@@ -230,11 +237,4 @@ INSTANTIATE_TEST_CASE_P(simple_test,
                                           Force_Atlas2_Usecase("test/datasets/netscience.mtx",
                                                                0.80)));
 
-int main(int argc, char** argv)
-{
-  testing::InitGoogleTest(&argc, argv);
-  auto resource = std::make_unique<rmm::mr::cuda_memory_resource>();
-  rmm::mr::set_default_resource(resource.get());
-  int rc = RUN_ALL_TESTS();
-  return rc;
-}
+CUGRAPH_TEST_PROGRAM_MAIN()
diff --git a/cpp/tests/layout/knn.h b/cpp/tests/layout/knn.h
index d42318288fc..07d07528769 100644
--- a/cpp/tests/layout/knn.h
+++ b/cpp/tests/layout/knn.h
@@ -20,6 +20,7 @@
 #include <list>
 #include <map>
 #include <set>
+#include <vector>
 
 struct point {
   point() {}
diff --git a/cpp/tests/layout/trust_worthiness.h b/cpp/tests/layout/trust_worthiness.h
index 5d3f4436950..40c9782a76e 100644
--- a/cpp/tests/layout/trust_worthiness.h
+++ b/cpp/tests/layout/trust_worthiness.h
@@ -16,6 +16,10 @@
 
 #include "knn.h"
 
+#include <algorithm>
+#include <iostream>
+#include <vector>
+
 double euclidian_dist(const std::vector<int>& x, const std::vector<int>& y)
 {
   double total = 0;
diff --git a/cpp/tests/nccl/degree_test.cu b/cpp/tests/nccl/degree_test.cu
deleted file mode 100644
index 9bba66efe1e..00000000000
--- a/cpp/tests/nccl/degree_test.cu
+++ /dev/null
@@ -1,130 +0,0 @@
-/*
- * Copyright (c) 2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <string.h>
-#include <thrust/device_vector.h>
-#include <thrust/functional.h>
-#include <comms_mpi.hpp>
-#include <graph.hpp>
-#include "gtest/gtest.h"
-#include "test_utils.h"
-
-// ref Degree on the host
-template <typename idx_t>
-void ref_degree_h(std::vector<idx_t> &ind_h, std::vector<idx_t> &degree)
-{
-  for (size_t i = 0; i < degree.size(); i++) degree[i] = 0;
-  for (size_t i = 0; i < ind_h.size(); i++) degree[ind_h[i]] += 1;
-}
-
-// global to local offsets by shifting all offsets by the first offset value
-template <typename T>
-void shift_by_front(std::vector<T> &v)
-{
-  auto start = v.front();
-  for (auto i = size_t{0}; i < v.size(); ++i) v[i] -= start;
-}
-
-// 1D partitioning such as each GPU has about the same number of edges
-template <typename T>
-void opg_edge_partioning(
-  int r, int p, std::vector<T> &ind_h, std::vector<size_t> &part_offset, size_t &e_loc)
-{
-  // set first and last partition offsets
-  part_offset[0] = 0;
-  part_offset[p] = ind_h.size();
-  // part_offset[p] = *(std::max_element(ind_h.begin(), ind_h.end()));
-  auto loc_nnz = ind_h.size() / p;
-  for (int i = 1; i < p; i++) {
-    // get the first vertex ID of each partition
-    auto start_nnz = i * loc_nnz;
-    auto start_v   = 0;
-    for (auto j = size_t{0}; j < ind_h.size(); ++j) {
-      if (j >= start_nnz) {
-        start_v = j;
-        break;
-      }
-    }
-    part_offset[i] = start_v;
-  }
-  e_loc = part_offset[r + 1] - part_offset[r];
-}
-TEST(degree, success)
-{
-  int v = 6;
-
-  // host
-  std::vector<int> src_h  = {0, 0, 2, 2, 2, 3, 3, 4, 4, 5, 5},
-                   dest_h = {1, 2, 0, 1, 4, 4, 5, 3, 5, 3, 1};
-  std::vector<int> degree_h(v, 0.0), degree_ref(v, 0.0);
-
-  // MG
-  int p;
-  MPICHECK(MPI_Comm_size(MPI_COMM_WORLD, &p));
-  cugraph::experimental::Comm comm(p);
-  std::vector<size_t> part_offset(p + 1);
-  auto i = comm.get_rank();
-  size_t e_loc;
-
-  opg_edge_partioning(i, p, src_h, part_offset, e_loc);
-#ifdef OPG_VERBOSE
-  sleep(i);
-  for (auto j = part_offset.begin(); j != part_offset.end(); ++j) std::cout << *j << ' ';
-  std::cout << std::endl;
-  std::cout << "eloc: " << e_loc << std::endl;
-#endif
-  std::vector<int> src_loc_h(src_h.begin() + part_offset[i],
-                             src_h.begin() + part_offset[i] + e_loc),
-    dest_loc_h(dest_h.begin() + part_offset[i], dest_h.begin() + part_offset[i] + e_loc);
-  shift_by_front(src_loc_h);
-
-  // print mg info
-  printf("#   Rank %2d - Pid %6d - device %2d\n", comm.get_rank(), getpid(), comm.get_dev());
-
-  // local device
-  thrust::device_vector<int> src_d(src_loc_h.begin(), src_loc_h.end());
-  thrust::device_vector<int> dest_d(dest_loc_h.begin(), dest_loc_h.end());
-  thrust::device_vector<int> degree_d(v);
-
-  // load local chunck to cugraph
-  cugraph::experimental::GraphCOO<int, int, float> G(thrust::raw_pointer_cast(src_d.data()),
-                                                     thrust::raw_pointer_cast(dest_d.data()),
-                                                     nullptr,
-                                                     degree_h.size(),
-                                                     e_loc);
-  G.set_communicator(comm);
-
-  // OUT degree
-  G.degree(thrust::raw_pointer_cast(degree_d.data()), cugraph::experimental::DegreeDirection::IN);
-  thrust::copy(degree_d.begin(), degree_d.end(), degree_h.begin());
-  ref_degree_h(dest_h, degree_ref);
-  // sleep(i);
-  for (size_t j = 0; j < degree_h.size(); ++j) EXPECT_EQ(degree_ref[j], degree_h[j]);
-  std::cout << "Rank " << i << " done checking." << std::endl;
-}
-
-int main(int argc, char **argv)
-{
-  testing::InitGoogleTest(&argc, argv);
-  MPI_Init(&argc, &argv);
-  {
-    auto resource = std::make_unique<rmm::mr::cuda_memory_resource>();
-    rmm::mr::set_default_resource(resource.get());
-    int rc = RUN_ALL_TESTS();
-  }
-  MPI_Finalize();
-  return rc;
-}
diff --git a/cpp/tests/nccl/nccl_test.cu b/cpp/tests/nccl/nccl_test.cu
deleted file mode 100644
index 6c8bb2043eb..00000000000
--- a/cpp/tests/nccl/nccl_test.cu
+++ /dev/null
@@ -1,76 +0,0 @@
-#include <mpi.h>
-#include <nccl.h>
-#include <string.h>
-#include <thrust/device_vector.h>
-#include <thrust/functional.h>
-#include "gtest/gtest.h"
-#include "test_utils.h"
-
-TEST(allgather, success)
-{
-  int p = 1, r = 0, dev = 0, dev_count = 0;
-  MPICHECK(MPI_Comm_size(MPI_COMM_WORLD, &p));
-  MPICHECK(MPI_Comm_rank(MPI_COMM_WORLD, &r));
-  CUDA_RT_CALL(cudaGetDeviceCount(&dev_count));
-
-  // shortcut for device ID here
-  // may need something smarter later
-  dev = r % dev_count;
-  // cudaSetDevice must happen before ncclCommInitRank
-  CUDA_RT_CALL(cudaSetDevice(dev));
-
-  // print info
-  printf("#   Rank %2d - Pid %6d - device %2d\n", r, getpid(), dev);
-
-  // NCCL init
-  ncclUniqueId id;
-  ncclComm_t comm;
-  if (r == 0) NCCLCHECK(ncclGetUniqueId(&id));
-  MPICHECK(MPI_Bcast((void *)&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD));
-  NCCLCHECK(ncclCommInitRank(&comm, p, id, r));
-  MPICHECK(MPI_Barrier(MPI_COMM_WORLD));
-
-  // allocate device buffers
-  int size = 3;
-  float *sendbuff, *recvbuff;
-  CUDA_RT_CALL(cudaMalloc(&sendbuff, size * sizeof(float)));
-  CUDA_RT_CALL(cudaMalloc(&recvbuff, size * p * sizeof(float)));
-
-  // init values
-  thrust::fill(
-    thrust::device_pointer_cast(sendbuff), thrust::device_pointer_cast(sendbuff + size), (float)r);
-  thrust::fill(
-    thrust::device_pointer_cast(recvbuff), thrust::device_pointer_cast(recvbuff + size * p), -1.0f);
-
-  // ncclAllGather
-  NCCLCHECK(ncclAllGather(
-    (const void *)sendbuff, (void *)recvbuff, size, ncclFloat, comm, cudaStreamDefault));
-
-  // expect each rankid printed size times in ascending order
-  if (r == 0) {
-    thrust::device_ptr<float> dev_ptr(recvbuff);
-    std::cout.precision(15);
-    thrust::copy(dev_ptr, dev_ptr + size * p, std::ostream_iterator<float>(std::cout, " "));
-    std::cout << std::endl;
-  }
-
-  // free device buffers
-  CUDA_RT_CALL(cudaFree(sendbuff));
-  CUDA_RT_CALL(cudaFree(recvbuff));
-
-  // finalizing NCCL
-  NCCLCHECK(ncclCommDestroy(comm));
-}
-
-int main(int argc, char **argv)
-{
-  testing::InitGoogleTest(&argc, argv);
-  MPI_Init(&argc, &argv);
-  {
-    auto resource = std::make_unique<rmm::mr::cuda_memory_resource>();
-    rmm::mr::set_default_resource(resource.get());
-    int rc = RUN_ALL_TESTS();
-  }
-  MPI_Finalize();
-  return rc;
-}
diff --git a/cpp/tests/pagerank/pagerank_test.cu b/cpp/tests/pagerank/pagerank_test.cpp
similarity index 74%
rename from cpp/tests/pagerank/pagerank_test.cu
rename to cpp/tests/pagerank/pagerank_test.cpp
index 977650c6c90..48705f7f324 100644
--- a/cpp/tests/pagerank/pagerank_test.cu
+++ b/cpp/tests/pagerank/pagerank_test.cpp
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2018-2020, NVIDIA CORPORATION.  All rights reserved.
  *
  * NVIDIA CORPORATION and its licensors retain all intellectual property
  * and proprietary rights in and to this software, related documentation
@@ -12,15 +12,21 @@
 // Pagerank solver tests
 // Author: Alex Fender afender@nvidia.com
 
-#include <rmm/thrust_rmm_allocator.h>
+#include <utilities/high_res_clock.h>
+#include <utilities/base_fixture.hpp>
+#include <utilities/test_utilities.hpp>
+
 #include <algorithms.hpp>
-#include <converters/COOtoCSR.cuh>
 #include <graph.hpp>
-#include <rmm/mr/device/cuda_memory_resource.hpp>
-#include "cuda_profiler_api.h"
-#include "gtest/gtest.h"
-#include "high_res_clock.h"
-#include "test_utils.h"
+
+#include <raft/error.hpp>
+#include <raft/handle.hpp>
+
+#include <rmm/device_uvector.hpp>
+
+#include <cuda_profiler_api.h>
+
+#include <cmath>
 
 // do the perf measurements
 // enabled by command line parameter s'--perf'
@@ -36,7 +42,7 @@ typedef struct Pagerank_Usecase_t {
   Pagerank_Usecase_t(const std::string& a, const std::string& b)
   {
     // assume relative paths are relative to RAPIDS_DATASET_ROOT_DIR
-    const std::string& rapidsDatasetRootDir = get_rapids_dataset_root_dir();
+    const std::string& rapidsDatasetRootDir = cugraph::test::get_rapids_dataset_root_dir();
     if ((a != "") && (a[0] != '/')) {
       matrix_file = rapidsDatasetRootDir + "/" + a;
     } else {
@@ -81,7 +87,8 @@ class Tests_Pagerank : public ::testing::TestWithParam<Pagerank_Usecase> {
     std::stringstream ss;
     std::string test_id = std::string(test_info->test_case_name()) + std::string(".") +
                           std::string(test_info->name()) + std::string("_") +
-                          getFileName(param.matrix_file) + std::string("_") + ss.str().c_str();
+                          cugraph::test::getFileName(param.matrix_file) + std::string("_") +
+                          ss.str().c_str();
 
     int m, k, nnz;
     MM_typecode mc;
@@ -101,7 +108,7 @@ class Tests_Pagerank : public ::testing::TestWithParam<Pagerank_Usecase> {
     FILE* fpin = fopen(param.matrix_file.c_str(), "r");
     ASSERT_NE(fpin, nullptr) << "fopen (" << param.matrix_file << ") failure.";
 
-    ASSERT_EQ(mm_properties<int>(fpin, 1, &mc, &m, &k, &nnz), 0)
+    ASSERT_EQ(cugraph::test::mm_properties<int>(fpin, 1, &mc, &m, &k, &nnz), 0)
       << "could not read Matrix Market file properties"
       << "\n";
     ASSERT_TRUE(mm_is_matrix(mc));
@@ -114,37 +121,39 @@ class Tests_Pagerank : public ::testing::TestWithParam<Pagerank_Usecase> {
     std::vector<T> cooVal(nnz), pagerank(m);
 
     // device alloc
-    rmm::device_vector<T> pagerank_vector(m);
-    T* d_pagerank = thrust::raw_pointer_cast(pagerank_vector.data());
+    rmm::device_uvector<T> pagerank_vector(static_cast<size_t>(m), nullptr);
+    T* d_pagerank = pagerank_vector.data();
 
     // Read
-    ASSERT_EQ((mm_to_coo<int, T>(fpin, 1, nnz, &cooRowInd[0], &cooColInd[0], &cooVal[0], NULL)), 0)
+    ASSERT_EQ((cugraph::test::mm_to_coo<int, T>(
+                fpin, 1, nnz, &cooRowInd[0], &cooColInd[0], &cooVal[0], NULL)),
+              0)
       << "could not read matrix data"
       << "\n";
     ASSERT_EQ(fclose(fpin), 0);
 
     //  Pagerank runs on CSC, so feed COOtoCSR the row/col backwards.
-    cugraph::experimental::GraphCOOView<int, int, T> G_coo(
-      &cooColInd[0], &cooRowInd[0], &cooVal[0], m, nnz);
+    raft::handle_t handle;
+    cugraph::GraphCOOView<int, int, T> G_coo(&cooColInd[0], &cooRowInd[0], &cooVal[0], m, nnz);
     auto G_unique = cugraph::coo_to_csr(G_coo);
-    cugraph::experimental::GraphCSCView<int, int, T> G(G_unique->view().offsets,
-                                                       G_unique->view().indices,
-                                                       G_unique->view().edge_data,
-                                                       G_unique->view().number_of_vertices,
-                                                       G_unique->view().number_of_edges);
+    cugraph::GraphCSCView<int, int, T> G(G_unique->view().offsets,
+                                         G_unique->view().indices,
+                                         G_unique->view().edge_data,
+                                         G_unique->view().number_of_vertices,
+                                         G_unique->view().number_of_edges);
 
     cudaDeviceSynchronize();
     if (PERF) {
       hr_clock.start();
       for (int i = 0; i < PERF_MULTIPLIER; ++i) {
-        cugraph::pagerank<int, int, T>(G, d_pagerank);
+        cugraph::pagerank<int, int, T>(handle, G, d_pagerank);
         cudaDeviceSynchronize();
       }
       hr_clock.stop(&time_tmp);
       pagerank_time.push_back(time_tmp);
     } else {
       cudaProfilerStart();
-      cugraph::pagerank<int, int, T>(G, d_pagerank);
+      cugraph::pagerank<int, int, T>(handle, G, d_pagerank);
       cudaProfilerStop();
       cudaDeviceSynchronize();
     }
@@ -153,14 +162,13 @@ class Tests_Pagerank : public ::testing::TestWithParam<Pagerank_Usecase> {
     if (param.result_file.length() > 0) {
       std::vector<T> calculated_res(m);
 
-      CUDA_RT_CALL(
-        cudaMemcpy(&calculated_res[0], d_pagerank, sizeof(T) * m, cudaMemcpyDeviceToHost));
+      CUDA_TRY(cudaMemcpy(&calculated_res[0], d_pagerank, sizeof(T) * m, cudaMemcpyDeviceToHost));
       std::sort(calculated_res.begin(), calculated_res.end());
       fpin = fopen(param.result_file.c_str(), "rb");
       ASSERT_TRUE(fpin != NULL) << " Cannot read file with reference data: " << param.result_file
                                 << std::endl;
       std::vector<T> expected_res(m);
-      ASSERT_EQ(read_binary_vector(fpin, m, expected_res), 0);
+      ASSERT_EQ(cugraph::test::read_binary_vector(fpin, m, expected_res), 0);
       fclose(fpin);
       T err;
       int n_err = 0;
@@ -195,11 +203,4 @@ INSTANTIATE_TEST_CASE_P(
                     Pagerank_Usecase("test/datasets/webbase-1M.mtx",
                                      "test/ref/pagerank/webbase-1M.pagerank_val_0.85.bin")));
 
-int main(int argc, char** argv)
-{
-  testing::InitGoogleTest(&argc, argv);
-  auto resource = std::make_unique<rmm::mr::cuda_memory_resource>();
-  rmm::mr::set_default_resource(resource.get());
-  int rc = RUN_ALL_TESTS();
-  return rc;
-}
+CUGRAPH_TEST_PROGRAM_MAIN()
diff --git a/cpp/tests/renumber/renumber_test.cu b/cpp/tests/renumber/renumber_test.cu
index 1601eff284f..608adc59ccb 100644
--- a/cpp/tests/renumber/renumber_test.cu
+++ b/cpp/tests/renumber/renumber_test.cu
@@ -1,7 +1,7 @@
 // -*-c++-*-
 
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -16,20 +16,19 @@
  * limitations under the License.
  */
 
-#include "gmock/gmock.h"
-#include "gtest/gtest.h"
+//#include "gmock/gmock.h"
 
-#include "cuda_profiler_api.h"
+#include <utilities/base_fixture.hpp>
 
-#include <rmm/rmm.h>
-#include <rmm/thrust_rmm_allocator.h>
-#include <rmm/mr/device/cuda_memory_resource.hpp>
-#include "converters/renumber.cuh"
+#include <converters/renumber.cuh>
 
-#include <chrono>
+#include <rmm/thrust_rmm_allocator.h>
 
+#include <cuda_profiler_api.h>
 #include <curand_kernel.h>
 
+#include <chrono>
+
 struct RenumberingTest : public ::testing::Test {
 };
 
@@ -577,11 +576,4 @@ TEST_F(RenumberingTest, Random500MVertexSet)
   std::cout << "  hash size = " << hash_size << std::endl;
 }
 
-int main(int argc, char **argv)
-{
-  testing::InitGoogleTest(&argc, argv);
-  auto resource = std::make_unique<rmm::mr::cuda_memory_resource>();
-  rmm::mr::set_default_resource(resource.get());
-  int rc = RUN_ALL_TESTS();
-  return rc;
-}
+CUGRAPH_TEST_PROGRAM_MAIN()
diff --git a/cpp/tests/test_utils.h b/cpp/tests/test_utils.h
deleted file mode 100644
index ca8555c5cc7..00000000000
--- a/cpp/tests/test_utils.h
+++ /dev/null
@@ -1,691 +0,0 @@
-/*
- * Copyright (c) 2019-2020, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#pragma once
-
-#include <stddef.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <algorithm>
-#include <cstdint>
-#include <cstdlib>
-#include <iomanip>
-#include <iostream>
-#include <limits>
-#include <map>
-#include <sstream>
-#include <string>
-#include <utility>
-extern "C" {
-#include "mmio.h"
-}
-#include <cuda.h>
-#include <cuda_profiler_api.h>
-#include <cuda_runtime.h>
-#include <library_types.h>
-#include <thrust/adjacent_difference.h>
-#include <thrust/device_vector.h>
-#include <thrust/functional.h>
-#include <thrust/host_vector.h>
-#include <thrust/reduce.h>
-#include <thrust/sequence.h>
-
-#include <rmm/rmm.h>
-
-#include "utilities/error_utils.h"
-
-#include "converters/COOtoCSR.cuh"
-
-#ifndef CUDA_RT_CALL
-#define CUDA_RT_CALL(call)                                                               \
-  {                                                                                      \
-    cudaError_t cudaStatus = call;                                                       \
-    if (cudaSuccess != cudaStatus) {                                                     \
-      fprintf(stderr,                                                                    \
-              "ERROR: CUDA RT call \"%s\" in line %d of file %s failed with %s (%d).\n", \
-              #call,                                                                     \
-              __LINE__,                                                                  \
-              __FILE__,                                                                  \
-              cudaGetErrorString(cudaStatus),                                            \
-              cudaStatus);                                                               \
-    }                                                                                    \
-  }
-#endif
-
-#define NCCLCHECK(cmd)                                                                          \
-  {                                                                                             \
-    ncclResult_t nccl_status = cmd;                                                             \
-    if (nccl_status != ncclSuccess) {                                                           \
-      printf("NCCL failure %s:%d '%s'\n", __FILE__, __LINE__, ncclGetErrorString(nccl_status)); \
-      FAIL();                                                                                   \
-    }                                                                                           \
-  }
-
-#define MPICHECK(cmd)                                                  \
-  {                                                                    \
-    int e = cmd;                                                       \
-    if (e != MPI_SUCCESS) {                                            \
-      printf("Failed: MPI error %s:%d '%d'\n", __FILE__, __LINE__, e); \
-      FAIL();                                                          \
-    }                                                                  \
-  }
-
-std::string getFileName(const std::string& s)
-{
-  char sep = '/';
-
-#ifdef _WIN32
-  sep = '\\';
-#endif
-
-  size_t i = s.rfind(sep, s.length());
-  if (i != std::string::npos) { return (s.substr(i + 1, s.length() - i)); }
-  return ("");
-}
-
-template <typename T>
-void verbose_diff(std::vector<T>& v1, std::vector<T>& v2)
-{
-  for (unsigned int i = 0; i < v1.size(); ++i) {
-    if (v1[i] != v2[i]) {
-      std::cout << "[" << i << "] : " << v1[i] << " vs. " << v2[i] << std::endl;
-    }
-  }
-}
-
-template <typename T>
-int eq(std::vector<T>& v1, std::vector<T>& v2)
-{
-  if (v1 == v2)
-    return 0;
-  else {
-    verbose_diff(v1, v2);
-    return 1;
-  }
-}
-
-template <typename T>
-void printv(size_t n, T* vec, int offset)
-{
-  thrust::device_ptr<T> dev_ptr(vec);
-  std::cout.precision(15);
-  std::cout << "sample size = " << n << ", offset = " << offset << std::endl;
-  thrust::copy(
-    dev_ptr + offset,
-    dev_ptr + offset + n,
-    std::ostream_iterator<T>(
-      std::cout, " "));  // Assume no RMM dependency; FIXME: check / test (potential BUG !!!!!)
-  std::cout << std::endl;
-}
-
-template <typename T>
-void random_vals(std::vector<T>& v)
-{
-  srand(42);
-  for (auto i = size_t{0}; i < v.size(); i++) v[i] = static_cast<T>(std::rand() % 10);
-}
-
-template <typename T_ELEM>
-void ref_csr2csc(int m,
-                 int n,
-                 int nnz,
-                 const T_ELEM* csrVals,
-                 const int* csrRowptr,
-                 const int* csrColInd,
-                 T_ELEM* cscVals,
-                 int* cscRowind,
-                 int* cscColptr,
-                 int base = 0)
-{
-  int i, j, row, col, index;
-  int* counters;
-  T_ELEM val;
-
-  /* early return */
-  if ((m <= 0) || (n <= 0) || (nnz <= 0)) { return; }
-
-  /* build compressed column pointers */
-  memset(cscColptr, 0, (n + 1) * sizeof(cscColptr[0]));
-  cscColptr[0] = base;
-  for (i = 0; i < nnz; i++) { cscColptr[1 + csrColInd[i] - base]++; }
-  for (i = 0; i < n; i++) { cscColptr[i + 1] += cscColptr[i]; }
-
-  /* expand row indecis and copy them and values into csc arrays according to permutation */
-  counters = (int*)malloc(n * sizeof(counters[0]));
-  memset(counters, 0, n * sizeof(counters[0]));
-  for (i = 0; i < m; i++) {
-    for (j = csrRowptr[i]; j < csrRowptr[i + 1]; j++) {
-      row = i + base;
-      col = csrColInd[j - base];
-
-      index = cscColptr[col - base] - base + counters[col - base];
-      counters[col - base]++;
-
-      cscRowind[index] = row;
-
-      if (csrVals != NULL || cscVals != NULL) {
-        val            = csrVals[j - base];
-        cscVals[index] = val;
-      }
-    }
-  }
-  free(counters);
-}
-
-template <typename T>
-int transition_matrix_cpu(int n, int e, int* csrRowPtrA, int* csrColIndA, T* weight, T* is_leaf)
-// omp_set_num_threads(4);
-//#pragma omp parallel
-{
-  int j, row, row_size;
-  //#pragma omp for
-  for (row = 0; row < n; row++) {
-    row_size = csrRowPtrA[row + 1] - csrRowPtrA[row];
-    if (row_size == 0)
-      is_leaf[row] = 1.0;
-    else {
-      is_leaf[row] = 0.0;
-      for (j = csrRowPtrA[row]; j < csrRowPtrA[row + 1]; j++) weight[j] = 1.0 / row_size;
-    }
-  }
-  return 0;
-}
-template <typename T>
-void printCsrMatI(int m,
-                  int n,
-                  int nnz,
-                  std::vector<int>& csrRowPtr,
-                  std::vector<uint16_t>& csrColInd,
-                  std::vector<T>& csrVal)
-{
-  std::vector<T> v(n);
-  std::stringstream ss;
-  ss.str(std::string());
-  ss << std::fixed;
-  ss << std::setprecision(2);
-  for (int i = 0; i < m; i++) {
-    std::fill(v.begin(), v.end(), 0);
-    for (int j = csrRowPtr[i]; j < csrRowPtr[i + 1]; j++) v[csrColInd[j]] = csrVal[j];
-
-    std::copy(v.begin(), v.end(), std::ostream_iterator<int>(ss, " "));
-    ss << "\n";
-  }
-  ss << "\n";
-  std::cout << ss.str();
-}
-
-/// Read matrix properties from Matrix Market file
-/** Matrix Market file is assumed to be a sparse matrix in coordinate
- *  format.
- *
- *  @param f File stream for Matrix Market file.
- *  @param tg Boolean indicating whether to convert matrix to general
- *  format (from symmetric, Hermitian, or skew symmetric format).
- *  @param t (Output) MM_typecode with matrix properties.
- *  @param m (Output) Number of matrix rows.
- *  @param n (Output) Number of matrix columns.
- *  @param nnz (Output) Number of non-zero matrix entries.
- *  @return Zero if properties were read successfully. Otherwise
- *  non-zero.
- */
-template <typename IndexType_>
-int mm_properties(FILE* f, int tg, MM_typecode* t, IndexType_* m, IndexType_* n, IndexType_* nnz)
-{
-  // Read matrix properties from file
-  int mint, nint, nnzint;
-  if (fseek(f, 0, SEEK_SET)) {
-    fprintf(stderr, "Error: could not set position in file\n");
-    return -1;
-  }
-  if (mm_read_banner(f, t)) {
-    fprintf(stderr, "Error: could not read Matrix Market file banner\n");
-    return -1;
-  }
-  if (!mm_is_matrix(*t) || !mm_is_coordinate(*t)) {
-    fprintf(stderr, "Error: file does not contain matrix in coordinate format\n");
-    return -1;
-  }
-  if (mm_read_mtx_crd_size(f, &mint, &nint, &nnzint)) {
-    fprintf(stderr, "Error: could not read matrix dimensions\n");
-    return -1;
-  }
-  if (!mm_is_pattern(*t) && !mm_is_real(*t) && !mm_is_integer(*t) && !mm_is_complex(*t)) {
-    fprintf(stderr, "Error: matrix entries are not valid type\n");
-    return -1;
-  }
-  *m   = mint;
-  *n   = nint;
-  *nnz = nnzint;
-
-  // Find total number of non-zero entries
-  if (tg && !mm_is_general(*t)) {
-    // Non-diagonal entries should be counted twice
-    IndexType_ nnzOld = *nnz;
-    *nnz *= 2;
-
-    // Diagonal entries should not be double-counted
-    int i;
-    int st;
-    for (i = 0; i < nnzOld; ++i) {
-      // Read matrix entry
-      IndexType_ row, col;
-      double rval, ival;
-      if (mm_is_pattern(*t))
-        st = fscanf(f, "%d %d\n", &row, &col);
-      else if (mm_is_real(*t) || mm_is_integer(*t))
-        st = fscanf(f, "%d %d %lg\n", &row, &col, &rval);
-      else  // Complex matrix
-        st = fscanf(f, "%d %d %lg %lg\n", &row, &col, &rval, &ival);
-      if (ferror(f) || (st == EOF)) {
-        fprintf(stderr, "Error: error %d reading Matrix Market file (entry %d)\n", st, i + 1);
-        return -1;
-      }
-
-      // Check if entry is diagonal
-      if (row == col) --(*nnz);
-    }
-  }
-
-  return 0;
-}
-
-/// Read Matrix Market file and convert to COO format matrix
-/** Matrix Market file is assumed to be a sparse matrix in coordinate
- *  format.
- *
- *  @param f File stream for Matrix Market file.
- *  @param tg Boolean indicating whether to convert matrix to general
- *  format (from symmetric, Hermitian, or skew symmetric format).
- *  @param nnz Number of non-zero matrix entries.
- *  @param cooRowInd (Output) Row indices for COO matrix. Should have
- *  at least nnz entries.
- *  @param cooColInd (Output) Column indices for COO matrix. Should
- *  have at least nnz entries.
- *  @param cooRVal (Output) Real component of COO matrix
- *  entries. Should have at least nnz entries. Ignored if null
- *  pointer.
- *  @param cooIVal (Output) Imaginary component of COO matrix
- *  entries. Should have at least nnz entries. Ignored if null
- *  pointer.
- *  @return Zero if matrix was read successfully. Otherwise non-zero.
- */
-template <typename IndexType_, typename ValueType_>
-int mm_to_coo(FILE* f,
-              int tg,
-              IndexType_ nnz,
-              IndexType_* cooRowInd,
-              IndexType_* cooColInd,
-              ValueType_* cooRVal,
-              ValueType_* cooIVal)
-{
-  // Read matrix properties from file
-  MM_typecode t;
-  int m, n, nnzOld;
-  if (fseek(f, 0, SEEK_SET)) {
-    fprintf(stderr, "Error: could not set position in file\n");
-    return -1;
-  }
-  if (mm_read_banner(f, &t)) {
-    fprintf(stderr, "Error: could not read Matrix Market file banner\n");
-    return -1;
-  }
-  if (!mm_is_matrix(t) || !mm_is_coordinate(t)) {
-    fprintf(stderr, "Error: file does not contain matrix in coordinate format\n");
-    return -1;
-  }
-  if (mm_read_mtx_crd_size(f, &m, &n, &nnzOld)) {
-    fprintf(stderr, "Error: could not read matrix dimensions\n");
-    return -1;
-  }
-  if (!mm_is_pattern(t) && !mm_is_real(t) && !mm_is_integer(t) && !mm_is_complex(t)) {
-    fprintf(stderr, "Error: matrix entries are not valid type\n");
-    return -1;
-  }
-
-  // Add each matrix entry in file to COO format matrix
-  IndexType_ i;      // Entry index in Matrix Market file
-  IndexType_ j = 0;  // Entry index in COO format matrix
-  for (i = 0; i < nnzOld; ++i) {
-    // Read entry from file
-    int row, col;
-    double rval, ival;
-    int st;
-    if (mm_is_pattern(t)) {
-      st   = fscanf(f, "%d %d\n", &row, &col);
-      rval = 1.0;
-      ival = 0.0;
-    } else if (mm_is_real(t) || mm_is_integer(t)) {
-      st   = fscanf(f, "%d %d %lg\n", &row, &col, &rval);
-      ival = 0.0;
-    } else  // Complex matrix
-      st = fscanf(f, "%d %d %lg %lg\n", &row, &col, &rval, &ival);
-    if (ferror(f) || (st == EOF)) {
-      fprintf(stderr, "Error: error %d reading Matrix Market file (entry %d)\n", st, i + 1);
-      return -1;
-    }
-
-    // Switch to 0-based indexing
-    --row;
-    --col;
-
-    // Record entry
-    cooRowInd[j] = row;
-    cooColInd[j] = col;
-    if (cooRVal != NULL) cooRVal[j] = rval;
-    if (cooIVal != NULL) cooIVal[j] = ival;
-    ++j;
-
-    // Add symmetric complement of non-diagonal entries
-    if (tg && !mm_is_general(t) && (row != col)) {
-      // Modify entry value if matrix is skew symmetric or Hermitian
-      if (mm_is_skew(t)) {
-        rval = -rval;
-        ival = -ival;
-      } else if (mm_is_hermitian(t)) {
-        ival = -ival;
-      }
-
-      // Record entry
-      cooRowInd[j] = col;
-      cooColInd[j] = row;
-      if (cooRVal != NULL) cooRVal[j] = rval;
-      if (cooIVal != NULL) cooIVal[j] = ival;
-      ++j;
-    }
-  }
-  return 0;
-}
-
-/// Compare two tuples based on the element indexed by i
-class lesser_tuple {
-  const int i;
-
- public:
-  lesser_tuple(int _i) : i(_i) {}
-  template <typename Tuple1, typename Tuple2>
-  __host__ __device__ bool operator()(const Tuple1 t1, const Tuple2 t2)
-  {
-    switch (i) {
-      case 0:
-        return (thrust::get<0>(t1) == thrust::get<0>(t2) ? thrust::get<1>(t1) < thrust::get<1>(t2)
-                                                         : thrust::get<0>(t1) < thrust::get<0>(t2));
-      case 1:
-        return (thrust::get<1>(t1) == thrust::get<1>(t2) ? thrust::get<0>(t1) < thrust::get<0>(t2)
-                                                         : thrust::get<1>(t1) < thrust::get<1>(t2));
-      default:
-        return (thrust::get<0>(t1) == thrust::get<0>(t2) ? thrust::get<1>(t1) < thrust::get<1>(t2)
-                                                         : thrust::get<0>(t1) < thrust::get<0>(t2));
-    }
-  }
-};
-
-/// Sort entries in COO format matrix
-/** Sort is stable.
- *
- *  @param nnz Number of non-zero matrix entries.
- *  @param sort_by_row Boolean indicating whether matrix entries
- *  will be sorted by row index or by column index.
- *  @param cooRowInd Row indices for COO matrix.
- *  @param cooColInd Column indices for COO matrix.
- *  @param cooRVal Real component for COO matrix entries. Ignored if
- *  null pointer.
- *  @param cooIVal Imaginary component COO matrix entries. Ignored if
- *  null pointer.
- */
-template <typename IndexType_, typename ValueType_>
-void coo_sort(IndexType_ nnz,
-              int sort_by_row,
-              IndexType_* cooRowInd,
-              IndexType_* cooColInd,
-              ValueType_* cooRVal,
-              ValueType_* cooIVal)
-{
-  // Determine whether to sort by row or by column
-  int i;
-  if (sort_by_row == 0)
-    i = 1;
-  else
-    i = 0;
-
-  // Apply stable sort
-  using namespace thrust;
-  if ((cooRVal == NULL) && (cooIVal == NULL))
-    stable_sort(make_zip_iterator(make_tuple(cooRowInd, cooColInd)),
-                make_zip_iterator(make_tuple(cooRowInd + nnz, cooColInd + nnz)),
-                lesser_tuple(i));
-  else if ((cooRVal == NULL) && (cooIVal != NULL))
-    stable_sort(make_zip_iterator(make_tuple(cooRowInd, cooColInd, cooIVal)),
-                make_zip_iterator(make_tuple(cooRowInd + nnz, cooColInd + nnz, cooIVal + nnz)),
-                lesser_tuple(i));
-  else if ((cooRVal != NULL) && (cooIVal == NULL))
-    stable_sort(make_zip_iterator(make_tuple(cooRowInd, cooColInd, cooRVal)),
-                make_zip_iterator(make_tuple(cooRowInd + nnz, cooColInd + nnz, cooRVal + nnz)),
-                lesser_tuple(i));
-  else
-    stable_sort(
-      make_zip_iterator(make_tuple(cooRowInd, cooColInd, cooRVal, cooIVal)),
-      make_zip_iterator(make_tuple(cooRowInd + nnz, cooColInd + nnz, cooRVal + nnz, cooIVal + nnz)),
-      lesser_tuple(i));
-}
-
-template <typename IndexT>
-void coo2csr(std::vector<IndexT>& cooRowInd,        // in: I[] (overwrite)
-             const std::vector<IndexT>& cooColInd,  // in: J[]
-             std::vector<IndexT>& csrRowPtr,        // out
-             std::vector<IndexT>& csrColInd)        // out
-{
-  std::vector<std::pair<IndexT, IndexT>> items;
-  for (auto i = size_t{0}; i < cooRowInd.size(); ++i)
-    items.push_back(std::make_pair(cooRowInd[i], cooColInd[i]));
-  // sort pairs
-  std::sort(items.begin(),
-            items.end(),
-            [](const std::pair<IndexT, IndexT>& left, const std::pair<IndexT, IndexT>& right) {
-              return left.first < right.first;
-            });
-  for (auto i = size_t{0}; i < cooRowInd.size(); ++i) {
-    cooRowInd[i] = items[i].first;   // save the sorted rows to compress them later
-    csrColInd[i] = items[i].second;  // save the col idx, not sure if they are sorted for each row
-  }
-  // Count number of elements per row
-  for (auto i = size_t{0}; i < cooRowInd.size(); ++i) ++(csrRowPtr[cooRowInd[i] + 1]);
-
-  // Compute cumulative sum to obtain row offsets/pointers
-  for (auto i = size_t{0}; i < csrRowPtr.size() - 1; ++i) csrRowPtr[i + 1] += csrRowPtr[i];
-}
-
-/// Compress sorted list of indices
-/** For use in converting COO format matrix to CSR or CSC format.
- *
- *  @param n Maximum index.
- *  @param nnz Number of non-zero matrix entries.
- *  @param sortedIndices Sorted list of indices (COO format).
- *  @param compressedIndices (Output) Compressed list of indices (CSR
- *  or CSC format). Should have at least n+1 entries.
- */
-template <typename IndexType_>
-void coo_compress(IndexType_ m,
-                  IndexType_ n,
-                  IndexType_ nnz,
-                  const IndexType_* __restrict__ sortedIndices,
-                  IndexType_* __restrict__ compressedIndices)
-{
-  IndexType_ i;
-
-  // Initialize everything to zero
-  memset(compressedIndices, 0, (m + 1) * sizeof(IndexType_));
-
-  // Count number of elements per row
-  for (i = 0; i < nnz; ++i) ++(compressedIndices[sortedIndices[i] + 1]);
-
-  // Compute cumulative sum to obtain row offsets/pointers
-  for (i = 0; i < m; ++i) compressedIndices[i + 1] += compressedIndices[i];
-}
-
-/// Convert COO format matrix to CSR format
-/** On output, matrix entries in COO format matrix will be sorted
- *  (primarily by row index, secondarily by column index).
- *
- *  @param m Number of matrix rows.
- *  @param n Number of matrix columns.
- *  @param nnz Number of non-zero matrix entries.
- *  @param cooRowInd Row indices for COO matrix.
- *  @param cooColInd Column indices for COO matrix.
- *  @param cooRVal Real component of COO matrix entries. Ignored if
- *  null pointer.
- *  @param cooIVal Imaginary component of COO matrix entries. Ignored
- *  if null pointer.
- *  @param csrRowPtr Row pointers for CSR matrix. Should have at least
- *  n+1 entries.
- *  @param csrColInd Column indices for CSR matrix (identical to
- *  output of cooColInd). Should have at least nnz entries. Ignored if
- *  null pointer.
- *  @param csrRVal Real component of CSR matrix entries (identical to
- *  output of cooRVal). Should have at least nnz entries.  Ignored if
- *  null pointer.
- *  @param csrIVal Imaginary component of CSR matrix entries
- *  (identical to output of cooIVal). Should have at least nnz
- *  entries.  Ignored if null pointer.
- *  @return Zero if matrix was converted successfully. Otherwise
- *  non-zero.
- */
-template <typename IndexType_, typename ValueType_>
-int coo_to_csr(IndexType_ m,
-               IndexType_ n,
-               IndexType_ nnz,
-               IndexType_* __restrict__ cooRowInd,
-               IndexType_* __restrict__ cooColInd,
-               ValueType_* __restrict__ cooRVal,
-               ValueType_* __restrict__ cooIVal,
-               IndexType_* __restrict__ csrRowPtr,
-               IndexType_* __restrict__ csrColInd,
-               ValueType_* __restrict__ csrRVal,
-               ValueType_* __restrict__ csrIVal)
-{
-  // Convert COO to CSR matrix
-  coo_sort(nnz, 0, cooRowInd, cooColInd, cooRVal, cooIVal);
-  coo_sort(nnz, 1, cooRowInd, cooColInd, cooRVal, cooIVal);
-  // coo_sort2<int,float>(m, nnz, cooRowInd, cooColInd);
-  coo_compress(m, n, nnz, cooRowInd, csrRowPtr);
-
-  // Copy arrays
-  if (csrColInd != NULL) memcpy(csrColInd, cooColInd, nnz * sizeof(IndexType_));
-  if ((cooRVal != NULL) && (csrRVal != NULL)) memcpy(csrRVal, cooRVal, nnz * sizeof(ValueType_));
-  if ((cooIVal != NULL) && (csrIVal != NULL)) memcpy(csrIVal, cooIVal, nnz * sizeof(ValueType_));
-
-  return 0;
-}
-
-int read_binary_vector(FILE* fpin, int n, std::vector<float>& val)
-{
-  size_t is_read1;
-
-  double* t_storage = new double[n];
-  is_read1          = fread(t_storage, sizeof(double), n, fpin);
-  for (int i = 0; i < n; i++) {
-    if (t_storage[i] == DBL_MAX)
-      val[i] = FLT_MAX;
-    else if (t_storage[i] == -DBL_MAX)
-      val[i] = -FLT_MAX;
-    else
-      val[i] = static_cast<float>(t_storage[i]);
-  }
-  delete[] t_storage;
-
-  if (is_read1 != (size_t)n) {
-    printf("%s", "I/O fail\n");
-    return 1;
-  }
-  return 0;
-}
-
-int read_binary_vector(FILE* fpin, int n, std::vector<double>& val)
-{
-  size_t is_read1;
-
-  is_read1 = fread(&val[0], sizeof(double), n, fpin);
-
-  if (is_read1 != (size_t)n) {
-    printf("%s", "I/O fail\n");
-    return 1;
-  }
-  return 0;
-}
-
-// FIXME: A similar function could be useful for CSC format
-//        There are functions above that operate coo -> csr and coo->csc
-/**
- * @tparam
- */
-template <typename VT, typename ET, typename WT>
-std::unique_ptr<cugraph::experimental::GraphCSR<VT, ET, WT>> generate_graph_csr_from_mm(
-  bool& directed, std::string mm_file)
-{
-  VT number_of_vertices;
-  ET number_of_edges;
-
-  FILE* fpin = fopen(mm_file.c_str(), "r");
-  EXPECT_NE(fpin, nullptr);
-
-  VT number_of_columns = 0;
-  MM_typecode mm_typecode{0};
-  EXPECT_EQ(mm_properties<VT>(
-              fpin, 1, &mm_typecode, &number_of_vertices, &number_of_columns, &number_of_edges),
-            0);
-  EXPECT_TRUE(mm_is_matrix(mm_typecode));
-  EXPECT_TRUE(mm_is_coordinate(mm_typecode));
-  EXPECT_FALSE(mm_is_complex(mm_typecode));
-  EXPECT_FALSE(mm_is_skew(mm_typecode));
-
-  directed = !mm_is_symmetric(mm_typecode);
-
-  // Allocate memory on host
-  std::vector<VT> coo_row_ind(number_of_edges);
-  std::vector<VT> coo_col_ind(number_of_edges);
-  std::vector<WT> coo_val(number_of_edges);
-
-  // Read
-  EXPECT_EQ((mm_to_coo<VT, WT>(
-              fpin, 1, number_of_edges, &coo_row_ind[0], &coo_col_ind[0], &coo_val[0], NULL)),
-            0);
-  EXPECT_EQ(fclose(fpin), 0);
-
-  cugraph::experimental::GraphCOOView<VT, ET, WT> cooview(
-    &coo_col_ind[0], &coo_row_ind[0], &coo_val[0], number_of_vertices, number_of_edges);
-
-  return cugraph::coo_to_csr(cooview);
-}
-
-////////////////////////////////////////////////////////////////////////////////
-// FIXME: move this code to rapids-core
-////////////////////////////////////////////////////////////////////////////////
-
-// Define RAPIDS_DATASET_ROOT_DIR using a preprocessor variable to
-// allow for a build to override the default. This is useful for
-// having different builds for specific default dataset locations.
-#ifndef RAPIDS_DATASET_ROOT_DIR
-#define RAPIDS_DATASET_ROOT_DIR "/datasets"
-#endif
-
-static const std::string& get_rapids_dataset_root_dir()
-{
-  static std::string rdrd("");
-  // Env var always overrides the value of RAPIDS_DATASET_ROOT_DIR
-  if (rdrd == "") {
-    const char* envVar = std::getenv("RAPIDS_DATASET_ROOT_DIR");
-    rdrd               = (envVar != NULL) ? envVar : RAPIDS_DATASET_ROOT_DIR;
-  }
-  return rdrd;
-}
diff --git a/cpp/tests/test_utils.hpp b/cpp/tests/test_utils.hpp
deleted file mode 100644
index f711705699a..00000000000
--- a/cpp/tests/test_utils.hpp
+++ /dev/null
@@ -1,47 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#pragma once
-
-#include <cudf/utilities/traits.hpp>
-#include <cudf/utilities/type_dispatcher.hpp>
-#include <rmm/device_buffer.hpp>
-
-#include <cudf/column/column.hpp>
-
-#include <thrust/distance.h>
-
-namespace detail {
-
-template <typename Element, typename InputIterator>
-rmm::device_buffer make_elements(InputIterator begin, InputIterator end)
-{
-  static_assert(cudf::is_fixed_width<Element>(), "Unexpected non-fixed width type.");
-  std::vector<Element> elements(begin, end);
-  return rmm::device_buffer{elements.data(), elements.size() * sizeof(Element)};
-}
-
-template <typename Element, typename iterator_t>
-std::unique_ptr<cudf::column> create_column(iterator_t begin, iterator_t end)
-{
-  cudf::size_type size = thrust::distance(begin, end);
-
-  return std::unique_ptr<cudf::column>(
-    new cudf::column{cudf::data_type{cudf::experimental::type_to_id<Element>()},
-                     size,
-                     detail::make_elements<Element>(begin, end)});
-}
-
-}  // namespace detail
diff --git a/cpp/tests/traversal/bfs_ref.h b/cpp/tests/traversal/bfs_ref.h
index c13342fa4f5..a32b2f99787 100644
--- a/cpp/tests/traversal/bfs_ref.h
+++ b/cpp/tests/traversal/bfs_ref.h
@@ -15,6 +15,7 @@
  */
 #pragma once
 
+#include <limits>
 #include <queue>
 #include <stack>
 #include <vector>
@@ -69,4 +70,4 @@ void ref_bfs(VT *indices,
       }
     }
   }
-}
\ No newline at end of file
+}
diff --git a/cpp/tests/traversal/bfs_test.cu b/cpp/tests/traversal/bfs_test.cu
index 46ba2af2e6a..d90da4367a0 100644
--- a/cpp/tests/traversal/bfs_test.cu
+++ b/cpp/tests/traversal/bfs_test.cu
@@ -14,20 +14,20 @@
  * limitations under the License.
  */
 
-#include <queue>
-#include <stack>
-#include <vector>
+#include "bfs_ref.h"
+
+#include <utilities/base_fixture.hpp>
+#include <utilities/test_utilities.hpp>
 
 #include <algorithms.hpp>
 
 #include <rmm/thrust_rmm_allocator.h>
-#include <utilities/error_utils.h>
 
-#include "gtest/gtest.h"
-#include "test_utils.h"
+#include <raft/handle.hpp>
 
-#include <rmm/mr/device/cuda_memory_resource.hpp>
-#include "bfs_ref.h"
+#include <queue>
+#include <stack>
+#include <vector>
 
 // NOTE: This could be common to other files but we might not want the same precision
 // depending on the algorithm
@@ -61,7 +61,7 @@ typedef struct BFS_Usecase_t {
   int source_;             // Starting point from the traversal
   BFS_Usecase_t(const std::string &config, int source) : config_(config), source_(source)
   {
-    const std::string &rapidsDatasetRootDir = get_rapids_dataset_root_dir();
+    const std::string &rapidsDatasetRootDir = cugraph::test::get_rapids_dataset_root_dir();
     if ((config_ != "") && (config_[0] != '/')) {
       file_path_ = rapidsDatasetRootDir + "/" + config_;
     } else {
@@ -71,6 +71,8 @@ typedef struct BFS_Usecase_t {
 } BFS_Usecase;
 
 class Tests_BFS : public ::testing::TestWithParam<BFS_Usecase> {
+  raft::handle_t handle;
+
  public:
   Tests_BFS() {}
   static void SetupTestCase() {}
@@ -90,13 +92,13 @@ class Tests_BFS : public ::testing::TestWithParam<BFS_Usecase> {
     VT number_of_vertices;
     ET number_of_edges;
     bool directed = false;
-    auto csr      = generate_graph_csr_from_mm<VT, ET, WT>(directed, configuration.file_path_);
+    auto csr =
+      cugraph::test::generate_graph_csr_from_mm<VT, ET, WT>(directed, configuration.file_path_);
     cudaDeviceSynchronize();
-    cugraph::experimental::GraphCSRView<VT, ET, WT> G = csr->view();
-    G.prop.directed                                   = directed;
-    CUDA_CHECK_LAST();
+    cugraph::GraphCSRView<VT, ET, WT> G = csr->view();
+    G.prop.directed                     = directed;
 
-    ASSERT_TRUE(configuration.source_ >= 0 && configuration.source_ <= G.number_of_vertices)
+    ASSERT_TRUE(configuration.source_ >= 0 && (VT)configuration.source_ < G.number_of_vertices)
       << "Starting sources should be >= 0 and"
       << " less than the number of vertices in the graph";
 
@@ -138,10 +140,13 @@ class Tests_BFS : public ::testing::TestWithParam<BFS_Usecase> {
     std::vector<VT> cugraph_pred(number_of_vertices);
     std::vector<double> cugraph_sigmas(number_of_vertices);
 
-    cugraph::bfs<VT, ET, WT>(G,
+    // Don't pass valid sp_sp_counter ptr unless needed because it disables
+    // the bottom up flow
+    cugraph::bfs<VT, ET, WT>(handle,
+                             G,
                              d_cugraph_dist.data().get(),
                              d_cugraph_pred.data().get(),
-                             d_cugraph_sigmas.data().get(),
+                             (return_sp_counter) ? d_cugraph_sigmas.data().get() : nullptr,
                              source,
                              G.prop.directed);
     CUDA_TRY(cudaMemcpy(cugraph_dist.data(),
@@ -152,10 +157,13 @@ class Tests_BFS : public ::testing::TestWithParam<BFS_Usecase> {
                         d_cugraph_pred.data().get(),
                         sizeof(VT) * d_cugraph_pred.size(),
                         cudaMemcpyDeviceToHost));
-    CUDA_TRY(cudaMemcpy(cugraph_sigmas.data(),
-                        d_cugraph_sigmas.data().get(),
-                        sizeof(double) * d_cugraph_sigmas.size(),
-                        cudaMemcpyDeviceToHost));
+
+    if (return_sp_counter) {
+      CUDA_TRY(cudaMemcpy(cugraph_sigmas.data(),
+                          d_cugraph_sigmas.data().get(),
+                          sizeof(double) * d_cugraph_sigmas.size(),
+                          cudaMemcpyDeviceToHost));
+    }
 
     for (VT i = 0; i < number_of_vertices; ++i) {
       // Check distances: should be an exact match as we use signed int 32-bit
@@ -166,7 +174,8 @@ class Tests_BFS : public ::testing::TestWithParam<BFS_Usecase> {
       // that the predecessor obtained with the GPU implementation is one of the
       // predecessors obtained during the C++ BFS traversal
       VT pred = cugraph_pred[i];  // It could be equal to -1 if the node is never reached
-      if (pred == -1) {
+      constexpr VT invalid_vid = cugraph::invalid_vertex_id<VT>::value;
+      if (pred == invalid_vid) {
         EXPECT_TRUE(ref_bfs_pred[i].empty())
           << "[MISMATCH][PREDECESSOR] vaid = " << i << " cugraph had not predecessor,"
           << "while c++ ref found at least one.";
@@ -179,10 +188,6 @@ class Tests_BFS : public ::testing::TestWithParam<BFS_Usecase> {
           << "[MISMATCH][PREDECESSOR] vaid = " << i << " cugraph = " << cugraph_sigmas[i]
           << " , c++ ref did not consider it as a predecessor.";
       }
-      EXPECT_TRUE(
-        compare_close(cugraph_sigmas[i], ref_bfs_sigmas[i], TEST_EPSILON, TEST_ZERO_THRESHOLD))
-        << "[MISMATCH] vaid = " << i << ", cugraph = " << cugraph_sigmas[i]
-        << " c++ ref = " << ref_bfs_sigmas[i];
 
       if (return_sp_counter) {
         EXPECT_TRUE(
@@ -197,16 +202,27 @@ class Tests_BFS : public ::testing::TestWithParam<BFS_Usecase> {
 // ============================================================================
 // Tests
 // ============================================================================
-TEST_P(Tests_BFS, CheckFP32_NO_SP_COUNTER) { run_current_test<int, int, float, false>(GetParam()); }
 
-TEST_P(Tests_BFS, CheckFP64_NO_SP_COUNTER)
+// We don't need to test WT for both float and double since it's anyway ignored in BFS
+TEST_P(Tests_BFS, CheckUint32_NO_SP_COUNTER)
 {
-  run_current_test<int, int, double, false>(GetParam());
+  run_current_test<uint32_t, uint32_t, float, false>(GetParam());
+}
+TEST_P(Tests_BFS, CheckInt_NO_SP_COUNTER) { run_current_test<int, int, float, false>(GetParam()); }
+TEST_P(Tests_BFS, CheckInt64_NO_SP_COUNTER)
+{
+  run_current_test<int64_t, int64_t, float, false>(GetParam());
 }
 
-TEST_P(Tests_BFS, CheckFP32_SP_COUNTER) { run_current_test<int, int, float, true>(GetParam()); }
-
-TEST_P(Tests_BFS, CheckFP64_SP_COUNTER) { run_current_test<int, int, double, true>(GetParam()); }
+TEST_P(Tests_BFS, CheckUint32_SP_COUNTER)
+{
+  run_current_test<uint32_t, uint32_t, float, true>(GetParam());
+}
+TEST_P(Tests_BFS, CheckInt_SP_COUNTER) { run_current_test<int, int, float, true>(GetParam()); }
+TEST_P(Tests_BFS, CheckInt64_SP_COUNTER)
+{
+  run_current_test<int64_t, int64_t, float, true>(GetParam());
+}
 
 INSTANTIATE_TEST_CASE_P(simple_test,
                         Tests_BFS,
@@ -217,11 +233,4 @@ INSTANTIATE_TEST_CASE_P(simple_test,
                                           BFS_Usecase("test/datasets/wiki2003.mtx", 1000),
                                           BFS_Usecase("test/datasets/wiki-Talk.mtx", 1000)));
 
-int main(int argc, char **argv)
-{
-  testing::InitGoogleTest(&argc, argv);
-  auto resource = std::make_unique<rmm::mr::cuda_memory_resource>();
-  rmm::mr::set_default_resource(resource.get());
-  int rc = RUN_ALL_TESTS();
-  return rc;
-}
+CUGRAPH_TEST_PROGRAM_MAIN()
diff --git a/cpp/tests/traversal/sssp_test.cu b/cpp/tests/traversal/sssp_test.cu
index 0c27674f94a..ea56d1d79cb 100644
--- a/cpp/tests/traversal/sssp_test.cu
+++ b/cpp/tests/traversal/sssp_test.cu
@@ -9,21 +9,20 @@
  *
  */
 
-#include <gtest/gtest.h>
-#include <thrust/device_vector.h>
+#include <utilities/high_res_clock.h>
+#include <utilities/base_fixture.hpp>
+#include <utilities/test_utilities.hpp>
+
+#include <algorithms.hpp>
+#include <converters/COOtoCSR.cuh>
+#include <graph.hpp>
+
 #include <thrust/fill.h>
+
 #include <algorithm>
 #include <queue>
 #include <unordered_map>
 #include <utility>
-#include "high_res_clock.h"
-#include "test_utils.h"
-
-#include <converters/COOtoCSR.cuh>
-
-#include <rmm/mr/device/cuda_memory_resource.hpp>
-#include "algorithms.hpp"
-#include "graph.hpp"
 
 typedef enum graph_type { RMAT, MTX } GraphType;
 
@@ -128,7 +127,7 @@ typedef struct SSSP_Usecase_t {
     // assume relative paths are relative to RAPIDS_DATASET_ROOT_DIR
     // FIXME: Use platform independent stuff from c++14/17 on compiler update
     if (type_ == MTX) {
-      const std::string& rapidsDatasetRootDir = get_rapids_dataset_root_dir();
+      const std::string& rapidsDatasetRootDir = cugraph::test::get_rapids_dataset_root_dir();
       if ((config_ != "") && (config_[0] != '/')) {
         file_path_ = rapidsDatasetRootDir + "/" + config_;
       } else {
@@ -203,7 +202,7 @@ class Tests_SSSP : public ::testing::TestWithParam<SSSP_Usecase> {
       ASSERT_NE(fpin, static_cast<FILE*>(nullptr)) << "fopen (" << param.file_path_ << ") failure.";
 
       // mm_properties has only one template param which should be fixed there
-      ASSERT_EQ(mm_properties<MaxVType>(fpin, 1, &mc, &m, &k, &nnz), 0)
+      ASSERT_EQ(cugraph::test::mm_properties<MaxVType>(fpin, 1, &mc, &m, &k, &nnz), 0)
         << "could not read Matrix Market file properties"
         << "\n";
       ASSERT_TRUE(mm_is_matrix(mc));
@@ -218,24 +217,24 @@ class Tests_SSSP : public ::testing::TestWithParam<SSSP_Usecase> {
       // Read weights if given
       if (!mm_is_pattern(mc)) {
         cooVal.resize(nnz);
-        ASSERT_EQ((mm_to_coo(fpin,
-                             1,
-                             nnz,
-                             &cooRowInd[0],
-                             &cooColInd[0],
-                             &cooVal[0],
-                             static_cast<DistType*>(nullptr))),
+        ASSERT_EQ((cugraph::test::mm_to_coo(fpin,
+                                            1,
+                                            nnz,
+                                            &cooRowInd[0],
+                                            &cooColInd[0],
+                                            &cooVal[0],
+                                            static_cast<DistType*>(nullptr))),
                   0)
           << "could not read matrix data"
           << "\n";
       } else {
-        ASSERT_EQ((mm_to_coo(fpin,
-                             1,
-                             nnz,
-                             &cooRowInd[0],
-                             &cooColInd[0],
-                             static_cast<DistType*>(nullptr),
-                             static_cast<DistType*>(nullptr))),
+        ASSERT_EQ((cugraph::test::mm_to_coo(fpin,
+                                            1,
+                                            nnz,
+                                            &cooRowInd[0],
+                                            &cooColInd[0],
+                                            static_cast<DistType*>(nullptr),
+                                            static_cast<DistType*>(nullptr))),
                   0)
           << "could not read matrix data"
           << "\n";
@@ -256,14 +255,14 @@ class Tests_SSSP : public ::testing::TestWithParam<SSSP_Usecase> {
       ASSERT_TRUE(0);
     }
 
-    cugraph::experimental::GraphCOOView<MaxVType, MaxEType, DistType> G_coo(
+    cugraph::GraphCOOView<MaxVType, MaxEType, DistType> G_coo(
       &cooRowInd[0],
       &cooColInd[0],
       (DoRandomWeights ? &cooVal[0] : nullptr),
       num_vertices,
       num_edges);
-    auto G_unique = cugraph::coo_to_csr(G_coo);
-    cugraph::experimental::GraphCSRView<MaxVType, MaxEType, DistType> G = G_unique->view();
+    auto G_unique                                         = cugraph::coo_to_csr(G_coo);
+    cugraph::GraphCSRView<MaxVType, MaxEType, DistType> G = G_unique->view();
     cudaDeviceSynchronize();
 
     std::vector<DistType> dist_vec;
@@ -432,11 +431,4 @@ INSTANTIATE_TEST_CASE_P(simple_test,
                                           SSSP_Usecase(MTX, "test/datasets/wiki2003.mtx", 100000),
                                           SSSP_Usecase(MTX, "test/datasets/karate.mtx", 1)));
 
-int main(int argc, char** argv)
-{
-  testing::InitGoogleTest(&argc, argv);
-  auto resource = std::make_unique<rmm::mr::cuda_memory_resource>();
-  rmm::mr::set_default_resource(resource.get());
-  int rc = RUN_ALL_TESTS();
-  return rc;
-}
+CUGRAPH_TEST_PROGRAM_MAIN()
diff --git a/cpp/tests/utilities/base_fixture.hpp b/cpp/tests/utilities/base_fixture.hpp
new file mode 100644
index 00000000000..535b4b9c79e
--- /dev/null
+++ b/cpp/tests/utilities/base_fixture.hpp
@@ -0,0 +1,142 @@
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <utilities/cxxopts.hpp>
+#include <utilities/error.hpp>
+
+#include <gtest/gtest.h>
+
+#include <rmm/thrust_rmm_allocator.h>
+#include <rmm/mr/device/binning_memory_resource.hpp>
+#include <rmm/mr/device/cuda_memory_resource.hpp>
+#include <rmm/mr/device/managed_memory_resource.hpp>
+#include <rmm/mr/device/owning_wrapper.hpp>
+#include <rmm/mr/device/per_device_resource.hpp>
+#include <rmm/mr/device/pool_memory_resource.hpp>
+
+namespace cugraph {
+namespace test {
+
+/**
+ * @brief Base test fixture class from which all libcudf tests should inherit.
+ *
+ * Example:
+ * ```
+ * class MyTestFixture : public cudf::test::BaseFixture {};
+ * ```
+ **/
+class BaseFixture : public ::testing::Test {
+  rmm::mr::device_memory_resource *_mr{rmm::mr::get_current_device_resource()};
+
+ public:
+  /**
+   * @brief Returns pointer to `device_memory_resource` that should be used for
+   * all tests inheriting from this fixture
+   **/
+  rmm::mr::device_memory_resource *mr() { return _mr; }
+};
+
+/// MR factory functions
+inline auto make_cuda() { return std::make_shared<rmm::mr::cuda_memory_resource>(); }
+
+inline auto make_managed() { return std::make_shared<rmm::mr::managed_memory_resource>(); }
+
+inline auto make_pool()
+{
+  return rmm::mr::make_owning_wrapper<rmm::mr::pool_memory_resource>(make_cuda());
+}
+
+inline auto make_binning()
+{
+  auto pool = make_pool();
+  // Add a fixed_size_memory_resource for bins of size 256, 512, 1024, 2048 and 4096KiB
+  // Larger allocations will use the pool resource
+  auto mr = rmm::mr::make_owning_wrapper<rmm::mr::binning_memory_resource>(pool, 18, 22);
+  return mr;
+}
+
+/**
+ * @brief Creates a memory resource for the unit test environment
+ * given the name of the allocation mode.
+ *
+ * The returned resource instance must be kept alive for the duration of
+ * the tests. Attaching the resource to a TestEnvironment causes
+ * issues since the environment objects are not destroyed until
+ * after the runtime is shutdown.
+ *
+ * @throw cudf::logic_error if the `allocation_mode` is unsupported.
+ *
+ * @param allocation_mode String identifies which resource type.
+ *        Accepted types are "pool", "cuda", and "managed" only.
+ * @return Memory resource instance
+ */
+inline std::shared_ptr<rmm::mr::device_memory_resource> create_memory_resource(
+  std::string const &allocation_mode)
+{
+  if (allocation_mode == "binning") return make_binning();
+  if (allocation_mode == "cuda") return make_cuda();
+  if (allocation_mode == "pool") return make_pool();
+  if (allocation_mode == "managed") make_managed();
+  CUGRAPH_FAIL("Invalid RMM allocation mode");
+}
+
+}  // namespace test
+}  // namespace cugraph
+
+/**
+ * @brief Parses the cuDF test command line options.
+ *
+ * Currently only supports 'rmm_mode' string paramater, which set the rmm
+ * allocation mode. The default value of the parameter is 'pool'.
+ *
+ * @return Parsing results in the form of cxxopts::ParseResult
+ */
+inline auto parse_test_options(int argc, char **argv)
+{
+  try {
+    cxxopts::Options options(argv[0], " - cuDF tests command line options");
+    options.allow_unrecognised_options().add_options()(
+      "rmm_mode", "RMM allocation mode", cxxopts::value<std::string>()->default_value("pool"));
+
+    return options.parse(argc, argv);
+  } catch (const cxxopts::OptionException &e) {
+    CUGRAPH_FAIL("Error parsing command line options");
+  }
+}
+
+/**
+ * @brief Macro that defines main function for gtest programs that use rmm
+ *
+ * Should be included in every test program that uses rmm allocators since
+ * it maintains the lifespan of the rmm default memory resource.
+ * This `main` function is a wrapper around the google test generated `main`,
+ * maintaining the original functionality. In addition, this custom `main`
+ * function parses the command line to customize test behavior, like the
+ * allocation mode used for creating the default memory resource.
+ *
+ */
+#define CUGRAPH_TEST_PROGRAM_MAIN()                                        \
+  int main(int argc, char **argv)                                          \
+  {                                                                        \
+    ::testing::InitGoogleTest(&argc, argv);                                \
+    auto const cmd_opts = parse_test_options(argc, argv);                  \
+    auto const rmm_mode = cmd_opts["rmm_mode"].as<std::string>();          \
+    auto resource       = cugraph::test::create_memory_resource(rmm_mode); \
+    rmm::mr::set_current_device_resource(resource.get());                  \
+    return RUN_ALL_TESTS();                                                \
+  }
diff --git a/cpp/tests/utilities/cxxopts.hpp b/cpp/tests/utilities/cxxopts.hpp
new file mode 100644
index 00000000000..9a0b6e500d6
--- /dev/null
+++ b/cpp/tests/utilities/cxxopts.hpp
@@ -0,0 +1,1497 @@
+/*
+Copyright (c) 2014, 2015, 2016, 2017 Jarryd Beck
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
+*/
+
+#ifndef CXXOPTS_HPP_INCLUDED
+#define CXXOPTS_HPP_INCLUDED
+
+#include <cctype>
+#include <cstring>
+#include <exception>
+#include <iostream>
+#include <limits>
+#include <map>
+#include <memory>
+#include <regex>
+#include <sstream>
+#include <string>
+#include <unordered_map>
+#include <unordered_set>
+#include <vector>
+
+#ifndef CXXOPTS_VECTOR_DELIMITER
+#define CXXOPTS_VECTOR_DELIMITER ','
+#endif
+
+#define CXXOPTS__VERSION_MAJOR 2
+#define CXXOPTS__VERSION_MINOR 2
+#define CXXOPTS__VERSION_PATCH 0
+
+namespace cxxopts {
+static constexpr struct {
+  uint8_t major, minor, patch;
+} version = {CXXOPTS__VERSION_MAJOR, CXXOPTS__VERSION_MINOR, CXXOPTS__VERSION_PATCH};
+}  // namespace cxxopts
+
+// when we ask cxxopts to use Unicode, help strings are processed using ICU,
+// which results in the correct lengths being computed for strings when they
+// are formatted for the help output
+// it is necessary to make sure that <unicode/unistr.h> can be found by the
+// compiler, and that icu-uc is linked in to the binary.
+
+#ifdef CXXOPTS_USE_UNICODE
+#include <unicode/unistr.h>
+
+namespace cxxopts {
+typedef icu::UnicodeString String;
+
+inline String toLocalString(std::string s) { return icu::UnicodeString::fromUTF8(std::move(s)); }
+
+class UnicodeStringIterator : public std::iterator<std::forward_iterator_tag, int32_t> {
+ public:
+  UnicodeStringIterator(const icu::UnicodeString* string, int32_t pos) : s(string), i(pos) {}
+
+  value_type operator*() const { return s->char32At(i); }
+
+  bool operator==(const UnicodeStringIterator& rhs) const { return s == rhs.s && i == rhs.i; }
+
+  bool operator!=(const UnicodeStringIterator& rhs) const { return !(*this == rhs); }
+
+  UnicodeStringIterator& operator++()
+  {
+    ++i;
+    return *this;
+  }
+
+  UnicodeStringIterator operator+(int32_t v) { return UnicodeStringIterator(s, i + v); }
+
+ private:
+  const icu::UnicodeString* s;
+  int32_t i;
+};
+
+inline String& stringAppend(String& s, String a) { return s.append(std::move(a)); }
+
+inline String& stringAppend(String& s, int n, UChar32 c)
+{
+  for (int i = 0; i != n; ++i) { s.append(c); }
+
+  return s;
+}
+
+template <typename Iterator>
+String& stringAppend(String& s, Iterator begin, Iterator end)
+{
+  while (begin != end) {
+    s.append(*begin);
+    ++begin;
+  }
+
+  return s;
+}
+
+inline size_t stringLength(const String& s) { return s.length(); }
+
+inline std::string toUTF8String(const String& s)
+{
+  std::string result;
+  s.toUTF8String(result);
+
+  return result;
+}
+
+inline bool empty(const String& s) { return s.isEmpty(); }
+}  // namespace cxxopts
+
+namespace std {
+inline cxxopts::UnicodeStringIterator begin(const icu::UnicodeString& s)
+{
+  return cxxopts::UnicodeStringIterator(&s, 0);
+}
+
+inline cxxopts::UnicodeStringIterator end(const icu::UnicodeString& s)
+{
+  return cxxopts::UnicodeStringIterator(&s, s.length());
+}
+}  // namespace std
+
+// ifdef CXXOPTS_USE_UNICODE
+#else
+
+namespace cxxopts {
+typedef std::string String;
+
+template <typename T>
+T toLocalString(T&& t)
+{
+  return std::forward<T>(t);
+}
+
+inline size_t stringLength(const String& s) { return s.length(); }
+
+inline String& stringAppend(String& s, String a) { return s.append(std::move(a)); }
+
+inline String& stringAppend(String& s, size_t n, char c) { return s.append(n, c); }
+
+template <typename Iterator>
+String& stringAppend(String& s, Iterator begin, Iterator end)
+{
+  return s.append(begin, end);
+}
+
+template <typename T>
+std::string toUTF8String(T&& t)
+{
+  return std::forward<T>(t);
+}
+
+inline bool empty(const std::string& s) { return s.empty(); }
+}  // namespace cxxopts
+
+// ifdef CXXOPTS_USE_UNICODE
+#endif
+
+namespace cxxopts {
+namespace {
+#ifdef _WIN32
+const std::string LQUOTE("\'");
+const std::string RQUOTE("\'");
+#else
+const std::string LQUOTE("‘");
+const std::string RQUOTE("’");
+#endif
+}  // namespace
+
+class Value : public std::enable_shared_from_this<Value> {
+ public:
+  virtual ~Value() = default;
+
+  virtual std::shared_ptr<Value> clone() const = 0;
+
+  virtual void parse(const std::string& text) const = 0;
+
+  virtual void parse() const = 0;
+
+  virtual bool has_default() const = 0;
+
+  virtual bool is_container() const = 0;
+
+  virtual bool has_implicit() const = 0;
+
+  virtual std::string get_default_value() const = 0;
+
+  virtual std::string get_implicit_value() const = 0;
+
+  virtual std::shared_ptr<Value> default_value(const std::string& value) = 0;
+
+  virtual std::shared_ptr<Value> implicit_value(const std::string& value) = 0;
+
+  virtual std::shared_ptr<Value> no_implicit_value() = 0;
+
+  virtual bool is_boolean() const = 0;
+};
+
+class OptionException : public std::exception {
+ public:
+  OptionException(const std::string& message) : m_message(message) {}
+
+  virtual const char* what() const noexcept { return m_message.c_str(); }
+
+ private:
+  std::string m_message;
+};
+
+class OptionSpecException : public OptionException {
+ public:
+  OptionSpecException(const std::string& message) : OptionException(message) {}
+};
+
+class OptionParseException : public OptionException {
+ public:
+  OptionParseException(const std::string& message) : OptionException(message) {}
+};
+
+class option_exists_error : public OptionSpecException {
+ public:
+  option_exists_error(const std::string& option)
+    : OptionSpecException("Option " + LQUOTE + option + RQUOTE + " already exists")
+  {
+  }
+};
+
+class invalid_option_format_error : public OptionSpecException {
+ public:
+  invalid_option_format_error(const std::string& format)
+    : OptionSpecException("Invalid option format " + LQUOTE + format + RQUOTE)
+  {
+  }
+};
+
+class option_syntax_exception : public OptionParseException {
+ public:
+  option_syntax_exception(const std::string& text)
+    : OptionParseException("Argument " + LQUOTE + text + RQUOTE +
+                           " starts with a - but has incorrect syntax")
+  {
+  }
+};
+
+class option_not_exists_exception : public OptionParseException {
+ public:
+  option_not_exists_exception(const std::string& option)
+    : OptionParseException("Option " + LQUOTE + option + RQUOTE + " does not exist")
+  {
+  }
+};
+
+class missing_argument_exception : public OptionParseException {
+ public:
+  missing_argument_exception(const std::string& option)
+    : OptionParseException("Option " + LQUOTE + option + RQUOTE + " is missing an argument")
+  {
+  }
+};
+
+class option_requires_argument_exception : public OptionParseException {
+ public:
+  option_requires_argument_exception(const std::string& option)
+    : OptionParseException("Option " + LQUOTE + option + RQUOTE + " requires an argument")
+  {
+  }
+};
+
+class option_not_has_argument_exception : public OptionParseException {
+ public:
+  option_not_has_argument_exception(const std::string& option, const std::string& arg)
+    : OptionParseException("Option " + LQUOTE + option + RQUOTE +
+                           " does not take an argument, but argument " + LQUOTE + arg + RQUOTE +
+                           " given")
+  {
+  }
+};
+
+class option_not_present_exception : public OptionParseException {
+ public:
+  option_not_present_exception(const std::string& option)
+    : OptionParseException("Option " + LQUOTE + option + RQUOTE + " not present")
+  {
+  }
+};
+
+class argument_incorrect_type : public OptionParseException {
+ public:
+  argument_incorrect_type(const std::string& arg)
+    : OptionParseException("Argument " + LQUOTE + arg + RQUOTE + " failed to parse")
+  {
+  }
+};
+
+class option_required_exception : public OptionParseException {
+ public:
+  option_required_exception(const std::string& option)
+    : OptionParseException("Option " + LQUOTE + option + RQUOTE + " is required but not present")
+  {
+  }
+};
+
+template <typename T>
+void throw_or_mimic(const std::string& text)
+{
+  static_assert(std::is_base_of<std::exception, T>::value,
+                "throw_or_mimic only works on std::exception and "
+                "deriving classes");
+
+#ifndef CXXOPTS_NO_EXCEPTIONS
+  // If CXXOPTS_NO_EXCEPTIONS is not defined, just throw
+  throw T{text};
+#else
+  // Otherwise manually instantiate the exception, print what() to stderr,
+  // and abort
+  T exception{text};
+  std::cerr << exception.what() << std::endl;
+  std::cerr << "Aborting (exceptions disabled)..." << std::endl;
+  std::abort();
+#endif
+}
+
+namespace values {
+namespace {
+std::basic_regex<char> integer_pattern("(-)?(0x)?([0-9a-zA-Z]+)|((0x)?0)");
+std::basic_regex<char> truthy_pattern("(t|T)(rue)?|1");
+std::basic_regex<char> falsy_pattern("(f|F)(alse)?|0");
+}  // namespace
+
+namespace detail {
+template <typename T, bool B>
+struct SignedCheck;
+
+template <typename T>
+struct SignedCheck<T, true> {
+  template <typename U>
+  void operator()(bool negative, U u, const std::string& text)
+  {
+    if (negative) {
+      if (u > static_cast<U>((std::numeric_limits<T>::min)())) {
+        throw_or_mimic<argument_incorrect_type>(text);
+      }
+    } else {
+      if (u > static_cast<U>((std::numeric_limits<T>::max)())) {
+        throw_or_mimic<argument_incorrect_type>(text);
+      }
+    }
+  }
+};
+
+template <typename T>
+struct SignedCheck<T, false> {
+  template <typename U>
+  void operator()(bool, U, const std::string&)
+  {
+  }
+};
+
+template <typename T, typename U>
+void check_signed_range(bool negative, U value, const std::string& text)
+{
+  SignedCheck<T, std::numeric_limits<T>::is_signed>()(negative, value, text);
+}
+}  // namespace detail
+
+template <typename R, typename T>
+R checked_negate(T&& t, const std::string&, std::true_type)
+{
+  // if we got to here, then `t` is a positive number that fits into
+  // `R`. So to avoid MSVC C4146, we first cast it to `R`.
+  // See https://github.com/jarro2783/cxxopts/issues/62 for more details.
+  return static_cast<R>(-static_cast<R>(t - 1) - 1);
+}
+
+template <typename R, typename T>
+T checked_negate(T&& t, const std::string& text, std::false_type)
+{
+  throw_or_mimic<argument_incorrect_type>(text);
+  return t;
+}
+
+template <typename T>
+void integer_parser(const std::string& text, T& value)
+{
+  std::smatch match;
+  std::regex_match(text, match, integer_pattern);
+
+  if (match.length() == 0) { throw_or_mimic<argument_incorrect_type>(text); }
+
+  if (match.length(4) > 0) {
+    value = 0;
+    return;
+  }
+
+  using US = typename std::make_unsigned<T>::type;
+
+  constexpr bool is_signed = std::numeric_limits<T>::is_signed;
+  const bool negative      = match.length(1) > 0;
+  const uint8_t base       = match.length(2) > 0 ? 16 : 10;
+
+  auto value_match = match[3];
+
+  US result = 0;
+
+  for (auto iter = value_match.first; iter != value_match.second; ++iter) {
+    US digit = 0;
+
+    if (*iter >= '0' && *iter <= '9') {
+      digit = static_cast<US>(*iter - '0');
+    } else if (base == 16 && *iter >= 'a' && *iter <= 'f') {
+      digit = static_cast<US>(*iter - 'a' + 10);
+    } else if (base == 16 && *iter >= 'A' && *iter <= 'F') {
+      digit = static_cast<US>(*iter - 'A' + 10);
+    } else {
+      throw_or_mimic<argument_incorrect_type>(text);
+    }
+
+    const US next = static_cast<US>(result * base + digit);
+    if (result > next) { throw_or_mimic<argument_incorrect_type>(text); }
+
+    result = next;
+  }
+
+  detail::check_signed_range<T>(negative, result, text);
+
+  if (negative) {
+    value = checked_negate<T>(result, text, std::integral_constant<bool, is_signed>());
+  } else {
+    value = static_cast<T>(result);
+  }
+}
+
+template <typename T>
+void stringstream_parser(const std::string& text, T& value)
+{
+  std::stringstream in(text);
+  in >> value;
+  if (!in) { throw_or_mimic<argument_incorrect_type>(text); }
+}
+
+inline void parse_value(const std::string& text, uint8_t& value) { integer_parser(text, value); }
+
+inline void parse_value(const std::string& text, int8_t& value) { integer_parser(text, value); }
+
+inline void parse_value(const std::string& text, uint16_t& value) { integer_parser(text, value); }
+
+inline void parse_value(const std::string& text, int16_t& value) { integer_parser(text, value); }
+
+inline void parse_value(const std::string& text, uint32_t& value) { integer_parser(text, value); }
+
+inline void parse_value(const std::string& text, int32_t& value) { integer_parser(text, value); }
+
+inline void parse_value(const std::string& text, uint64_t& value) { integer_parser(text, value); }
+
+inline void parse_value(const std::string& text, int64_t& value) { integer_parser(text, value); }
+
+inline void parse_value(const std::string& text, bool& value)
+{
+  std::smatch result;
+  std::regex_match(text, result, truthy_pattern);
+
+  if (!result.empty()) {
+    value = true;
+    return;
+  }
+
+  std::regex_match(text, result, falsy_pattern);
+  if (!result.empty()) {
+    value = false;
+    return;
+  }
+
+  throw_or_mimic<argument_incorrect_type>(text);
+}
+
+inline void parse_value(const std::string& text, std::string& value) { value = text; }
+
+// The fallback parser. It uses the stringstream parser to parse all types
+// that have not been overloaded explicitly.  It has to be placed in the
+// source code before all other more specialized templates.
+template <typename T>
+void parse_value(const std::string& text, T& value)
+{
+  stringstream_parser(text, value);
+}
+
+template <typename T>
+void parse_value(const std::string& text, std::vector<T>& value)
+{
+  std::stringstream in(text);
+  std::string token;
+  while (in.eof() == false && std::getline(in, token, CXXOPTS_VECTOR_DELIMITER)) {
+    T v;
+    parse_value(token, v);
+    value.emplace_back(std::move(v));
+  }
+}
+
+inline void parse_value(const std::string& text, char& c)
+{
+  if (text.length() != 1) { throw_or_mimic<argument_incorrect_type>(text); }
+
+  c = text[0];
+}
+
+template <typename T>
+struct type_is_container {
+  static constexpr bool value = false;
+};
+
+template <typename T>
+struct type_is_container<std::vector<T>> {
+  static constexpr bool value = true;
+};
+
+template <typename T>
+class abstract_value : public Value {
+  using Self = abstract_value<T>;
+
+ public:
+  abstract_value() : m_result(std::make_shared<T>()), m_store(m_result.get()) {}
+
+  abstract_value(T* t) : m_store(t) {}
+
+  virtual ~abstract_value() = default;
+
+  abstract_value(const abstract_value& rhs)
+  {
+    if (rhs.m_result) {
+      m_result = std::make_shared<T>();
+      m_store  = m_result.get();
+    } else {
+      m_store = rhs.m_store;
+    }
+
+    m_default        = rhs.m_default;
+    m_implicit       = rhs.m_implicit;
+    m_default_value  = rhs.m_default_value;
+    m_implicit_value = rhs.m_implicit_value;
+  }
+
+  void parse(const std::string& text) const { parse_value(text, *m_store); }
+
+  bool is_container() const { return type_is_container<T>::value; }
+
+  void parse() const { parse_value(m_default_value, *m_store); }
+
+  bool has_default() const { return m_default; }
+
+  bool has_implicit() const { return m_implicit; }
+
+  std::shared_ptr<Value> default_value(const std::string& value)
+  {
+    m_default       = true;
+    m_default_value = value;
+    return shared_from_this();
+  }
+
+  std::shared_ptr<Value> implicit_value(const std::string& value)
+  {
+    m_implicit       = true;
+    m_implicit_value = value;
+    return shared_from_this();
+  }
+
+  std::shared_ptr<Value> no_implicit_value()
+  {
+    m_implicit = false;
+    return shared_from_this();
+  }
+
+  std::string get_default_value() const { return m_default_value; }
+
+  std::string get_implicit_value() const { return m_implicit_value; }
+
+  bool is_boolean() const { return std::is_same<T, bool>::value; }
+
+  const T& get() const
+  {
+    if (m_store == nullptr) {
+      return *m_result;
+    } else {
+      return *m_store;
+    }
+  }
+
+ protected:
+  std::shared_ptr<T> m_result;
+  T* m_store;
+
+  bool m_default  = false;
+  bool m_implicit = false;
+
+  std::string m_default_value;
+  std::string m_implicit_value;
+};
+
+template <typename T>
+class standard_value : public abstract_value<T> {
+ public:
+  using abstract_value<T>::abstract_value;
+
+  std::shared_ptr<Value> clone() const { return std::make_shared<standard_value<T>>(*this); }
+};
+
+template <>
+class standard_value<bool> : public abstract_value<bool> {
+ public:
+  ~standard_value() = default;
+
+  standard_value() { set_default_and_implicit(); }
+
+  standard_value(bool* b) : abstract_value(b) { set_default_and_implicit(); }
+
+  std::shared_ptr<Value> clone() const { return std::make_shared<standard_value<bool>>(*this); }
+
+ private:
+  void set_default_and_implicit()
+  {
+    m_default        = true;
+    m_default_value  = "false";
+    m_implicit       = true;
+    m_implicit_value = "true";
+  }
+};
+}  // namespace values
+
+template <typename T>
+std::shared_ptr<Value> value()
+{
+  return std::make_shared<values::standard_value<T>>();
+}
+
+template <typename T>
+std::shared_ptr<Value> value(T& t)
+{
+  return std::make_shared<values::standard_value<T>>(&t);
+}
+
+class OptionAdder;
+
+class OptionDetails {
+ public:
+  OptionDetails(const std::string& short_,
+                const std::string& long_,
+                const String& desc,
+                std::shared_ptr<const Value> val)
+    : m_short(short_), m_long(long_), m_desc(desc), m_value(val), m_count(0)
+  {
+  }
+
+  OptionDetails(const OptionDetails& rhs) : m_desc(rhs.m_desc), m_count(rhs.m_count)
+  {
+    m_value = rhs.m_value->clone();
+  }
+
+  OptionDetails(OptionDetails&& rhs) = default;
+
+  const String& description() const { return m_desc; }
+
+  const Value& value() const { return *m_value; }
+
+  std::shared_ptr<Value> make_storage() const { return m_value->clone(); }
+
+  const std::string& short_name() const { return m_short; }
+
+  const std::string& long_name() const { return m_long; }
+
+ private:
+  std::string m_short;
+  std::string m_long;
+  String m_desc;
+  std::shared_ptr<const Value> m_value;
+  int m_count;
+};
+
+struct HelpOptionDetails {
+  std::string s;
+  std::string l;
+  String desc;
+  bool has_default;
+  std::string default_value;
+  bool has_implicit;
+  std::string implicit_value;
+  std::string arg_help;
+  bool is_container;
+  bool is_boolean;
+};
+
+struct HelpGroupDetails {
+  std::string name;
+  std::string description;
+  std::vector<HelpOptionDetails> options;
+};
+
+class OptionValue {
+ public:
+  void parse(std::shared_ptr<const OptionDetails> details, const std::string& text)
+  {
+    ensure_value(details);
+    ++m_count;
+    m_value->parse(text);
+  }
+
+  void parse_default(std::shared_ptr<const OptionDetails> details)
+  {
+    ensure_value(details);
+    m_default = true;
+    m_value->parse();
+  }
+
+  size_t count() const noexcept { return m_count; }
+
+  // TODO: maybe default options should count towards the number of arguments
+  bool has_default() const noexcept { return m_default; }
+
+  template <typename T>
+  const T& as() const
+  {
+    if (m_value == nullptr) { throw_or_mimic<std::domain_error>("No value"); }
+
+#ifdef CXXOPTS_NO_RTTI
+    return static_cast<const values::standard_value<T>&>(*m_value).get();
+#else
+    return dynamic_cast<const values::standard_value<T>&>(*m_value).get();
+#endif
+  }
+
+ private:
+  void ensure_value(std::shared_ptr<const OptionDetails> details)
+  {
+    if (m_value == nullptr) { m_value = details->make_storage(); }
+  }
+
+  std::shared_ptr<Value> m_value;
+  size_t m_count = 0;
+  bool m_default = false;
+};
+
+class KeyValue {
+ public:
+  KeyValue(std::string key_, std::string value_)
+    : m_key(std::move(key_)), m_value(std::move(value_))
+  {
+  }
+
+  const std::string& key() const { return m_key; }
+
+  const std::string& value() const { return m_value; }
+
+  template <typename T>
+  T as() const
+  {
+    T result;
+    values::parse_value(m_value, result);
+    return result;
+  }
+
+ private:
+  std::string m_key;
+  std::string m_value;
+};
+
+class ParseResult {
+ public:
+  ParseResult(
+    const std::shared_ptr<std::unordered_map<std::string, std::shared_ptr<OptionDetails>>>,
+    std::vector<std::string>,
+    bool allow_unrecognised,
+    int&,
+    char**&);
+
+  size_t count(const std::string& o) const
+  {
+    auto iter = m_options->find(o);
+    if (iter == m_options->end()) { return 0; }
+
+    auto riter = m_results.find(iter->second);
+
+    return riter->second.count();
+  }
+
+  const OptionValue& operator[](const std::string& option) const
+  {
+    auto iter = m_options->find(option);
+
+    if (iter == m_options->end()) { throw_or_mimic<option_not_present_exception>(option); }
+
+    auto riter = m_results.find(iter->second);
+
+    return riter->second;
+  }
+
+  const std::vector<KeyValue>& arguments() const { return m_sequential; }
+
+ private:
+  void parse(int& argc, char**& argv);
+
+  void add_to_option(const std::string& option, const std::string& arg);
+
+  bool consume_positional(std::string a);
+
+  void parse_option(std::shared_ptr<OptionDetails> value,
+                    const std::string& name,
+                    const std::string& arg = "");
+
+  void parse_default(std::shared_ptr<OptionDetails> details);
+
+  void checked_parse_arg(int argc,
+                         char* argv[],
+                         int& current,
+                         std::shared_ptr<OptionDetails> value,
+                         const std::string& name);
+
+  const std::shared_ptr<std::unordered_map<std::string, std::shared_ptr<OptionDetails>>> m_options;
+  std::vector<std::string> m_positional;
+  std::vector<std::string>::iterator m_next_positional;
+  std::unordered_set<std::string> m_positional_set;
+  std::unordered_map<std::shared_ptr<OptionDetails>, OptionValue> m_results;
+
+  bool m_allow_unrecognised;
+
+  std::vector<KeyValue> m_sequential;
+};
+
+struct Option {
+  Option(const std::string& opts,
+         const std::string& desc,
+         const std::shared_ptr<const Value>& value = ::cxxopts::value<bool>(),
+         const std::string& arg_help               = "")
+    : opts_(opts), desc_(desc), value_(value), arg_help_(arg_help)
+  {
+  }
+
+  std::string opts_;
+  std::string desc_;
+  std::shared_ptr<const Value> value_;
+  std::string arg_help_;
+};
+
+class Options {
+  typedef std::unordered_map<std::string, std::shared_ptr<OptionDetails>> OptionMap;
+
+ public:
+  Options(std::string program, std::string help_string = "")
+    : m_program(std::move(program)),
+      m_help_string(toLocalString(std::move(help_string))),
+      m_custom_help("[OPTION...]"),
+      m_positional_help("positional parameters"),
+      m_show_positional(false),
+      m_allow_unrecognised(false),
+      m_options(std::make_shared<OptionMap>()),
+      m_next_positional(m_positional.end())
+  {
+  }
+
+  Options& positional_help(std::string help_text)
+  {
+    m_positional_help = std::move(help_text);
+    return *this;
+  }
+
+  Options& custom_help(std::string help_text)
+  {
+    m_custom_help = std::move(help_text);
+    return *this;
+  }
+
+  Options& show_positional_help()
+  {
+    m_show_positional = true;
+    return *this;
+  }
+
+  Options& allow_unrecognised_options()
+  {
+    m_allow_unrecognised = true;
+    return *this;
+  }
+
+  ParseResult parse(int& argc, char**& argv);
+
+  OptionAdder add_options(std::string group = "");
+
+  void add_options(const std::string& group, std::initializer_list<Option> options);
+
+  void add_option(const std::string& group, const Option& option);
+
+  void add_option(const std::string& group,
+                  const std::string& s,
+                  const std::string& l,
+                  std::string desc,
+                  std::shared_ptr<const Value> value,
+                  std::string arg_help);
+
+  // parse positional arguments into the given option
+  void parse_positional(std::string option);
+
+  void parse_positional(std::vector<std::string> options);
+
+  void parse_positional(std::initializer_list<std::string> options);
+
+  template <typename Iterator>
+  void parse_positional(Iterator begin, Iterator end)
+  {
+    parse_positional(std::vector<std::string>{begin, end});
+  }
+
+  std::string help(const std::vector<std::string>& groups = {}) const;
+
+  const std::vector<std::string> groups() const;
+
+  const HelpGroupDetails& group_help(const std::string& group) const;
+
+ private:
+  void add_one_option(const std::string& option, std::shared_ptr<OptionDetails> details);
+
+  String help_one_group(const std::string& group) const;
+
+  void generate_group_help(String& result, const std::vector<std::string>& groups) const;
+
+  void generate_all_groups_help(String& result) const;
+
+  std::string m_program;
+  String m_help_string;
+  std::string m_custom_help;
+  std::string m_positional_help;
+  bool m_show_positional;
+  bool m_allow_unrecognised;
+
+  std::shared_ptr<OptionMap> m_options;
+  std::vector<std::string> m_positional;
+  std::vector<std::string>::iterator m_next_positional;
+  std::unordered_set<std::string> m_positional_set;
+
+  // mapping from groups to help options
+  std::map<std::string, HelpGroupDetails> m_help;
+};
+
+class OptionAdder {
+ public:
+  OptionAdder(Options& options, std::string group) : m_options(options), m_group(std::move(group))
+  {
+  }
+
+  OptionAdder& operator()(const std::string& opts,
+                          const std::string& desc,
+                          std::shared_ptr<const Value> value = ::cxxopts::value<bool>(),
+                          std::string arg_help               = "");
+
+ private:
+  Options& m_options;
+  std::string m_group;
+};
+
+namespace {
+constexpr int OPTION_LONGEST  = 30;
+constexpr int OPTION_DESC_GAP = 2;
+
+std::basic_regex<char> option_matcher("--([[:alnum:]][-_[:alnum:]]+)(=(.*))?|-([[:alnum:]]+)");
+
+std::basic_regex<char> option_specifier("(([[:alnum:]]),)?[ ]*([[:alnum:]][-_[:alnum:]]*)?");
+
+String format_option(const HelpOptionDetails& o)
+{
+  auto& s = o.s;
+  auto& l = o.l;
+
+  String result = "  ";
+
+  if (s.size() > 0) {
+    result += "-" + toLocalString(s) + ",";
+  } else {
+    result += "   ";
+  }
+
+  if (l.size() > 0) { result += " --" + toLocalString(l); }
+
+  auto arg = o.arg_help.size() > 0 ? toLocalString(o.arg_help) : "arg";
+
+  if (!o.is_boolean) {
+    if (o.has_implicit) {
+      result += " [=" + arg + "(=" + toLocalString(o.implicit_value) + ")]";
+    } else {
+      result += " " + arg;
+    }
+  }
+
+  return result;
+}
+
+String format_description(const HelpOptionDetails& o, size_t start, size_t width)
+{
+  auto desc = o.desc;
+
+  if (o.has_default && (!o.is_boolean || o.default_value != "false")) {
+    if (o.default_value != "") {
+      desc += toLocalString(" (default: " + o.default_value + ")");
+    } else {
+      desc += toLocalString(" (default: \"\")");
+    }
+  }
+
+  String result;
+
+  auto current   = std::begin(desc);
+  auto startLine = current;
+  auto lastSpace = current;
+
+  auto size = size_t{};
+
+  while (current != std::end(desc)) {
+    if (*current == ' ') { lastSpace = current; }
+
+    if (*current == '\n') {
+      startLine = current + 1;
+      lastSpace = startLine;
+    } else if (size > width) {
+      if (lastSpace == startLine) {
+        stringAppend(result, startLine, current + 1);
+        stringAppend(result, "\n");
+        stringAppend(result, start, ' ');
+        startLine = current + 1;
+        lastSpace = startLine;
+      } else {
+        stringAppend(result, startLine, lastSpace);
+        stringAppend(result, "\n");
+        stringAppend(result, start, ' ');
+        startLine = lastSpace + 1;
+        lastSpace = startLine;
+      }
+      size = 0;
+    } else {
+      ++size;
+    }
+
+    ++current;
+  }
+
+  // append whatever is left
+  stringAppend(result, startLine, current);
+
+  return result;
+}
+}  // namespace
+
+inline ParseResult::ParseResult(
+  const std::shared_ptr<std::unordered_map<std::string, std::shared_ptr<OptionDetails>>> options,
+  std::vector<std::string> positional,
+  bool allow_unrecognised,
+  int& argc,
+  char**& argv)
+  : m_options(options),
+    m_positional(std::move(positional)),
+    m_next_positional(m_positional.begin()),
+    m_allow_unrecognised(allow_unrecognised)
+{
+  parse(argc, argv);
+}
+
+inline void Options::add_options(const std::string& group, std::initializer_list<Option> options)
+{
+  OptionAdder option_adder(*this, group);
+  for (const auto& option : options) {
+    option_adder(option.opts_, option.desc_, option.value_, option.arg_help_);
+  }
+}
+
+inline OptionAdder Options::add_options(std::string group)
+{
+  return OptionAdder(*this, std::move(group));
+}
+
+inline OptionAdder& OptionAdder::operator()(const std::string& opts,
+                                            const std::string& desc,
+                                            std::shared_ptr<const Value> value,
+                                            std::string arg_help)
+{
+  std::match_results<const char*> result;
+  std::regex_match(opts.c_str(), result, option_specifier);
+
+  if (result.empty()) { throw_or_mimic<invalid_option_format_error>(opts); }
+
+  const auto& short_match = result[2];
+  const auto& long_match  = result[3];
+
+  if (!short_match.length() && !long_match.length()) {
+    throw_or_mimic<invalid_option_format_error>(opts);
+  } else if (long_match.length() == 1 && short_match.length()) {
+    throw_or_mimic<invalid_option_format_error>(opts);
+  }
+
+  auto option_names = [](const std::sub_match<const char*>& short_,
+                         const std::sub_match<const char*>& long_) {
+    if (long_.length() == 1) {
+      return std::make_tuple(long_.str(), short_.str());
+    } else {
+      return std::make_tuple(short_.str(), long_.str());
+    }
+  }(short_match, long_match);
+
+  m_options.add_option(m_group,
+                       std::get<0>(option_names),
+                       std::get<1>(option_names),
+                       desc,
+                       value,
+                       std::move(arg_help));
+
+  return *this;
+}
+
+inline void ParseResult::parse_default(std::shared_ptr<OptionDetails> details)
+{
+  m_results[details].parse_default(details);
+}
+
+inline void ParseResult::parse_option(std::shared_ptr<OptionDetails> value,
+                                      const std::string& /*name*/,
+                                      const std::string& arg)
+{
+  auto& result = m_results[value];
+  result.parse(value, arg);
+
+  m_sequential.emplace_back(value->long_name(), arg);
+}
+
+inline void ParseResult::checked_parse_arg(int argc,
+                                           char* argv[],
+                                           int& current,
+                                           std::shared_ptr<OptionDetails> value,
+                                           const std::string& name)
+{
+  if (current + 1 >= argc) {
+    if (value->value().has_implicit()) {
+      parse_option(value, name, value->value().get_implicit_value());
+    } else {
+      throw_or_mimic<missing_argument_exception>(name);
+    }
+  } else {
+    if (value->value().has_implicit()) {
+      parse_option(value, name, value->value().get_implicit_value());
+    } else {
+      parse_option(value, name, argv[current + 1]);
+      ++current;
+    }
+  }
+}
+
+inline void ParseResult::add_to_option(const std::string& option, const std::string& arg)
+{
+  auto iter = m_options->find(option);
+
+  if (iter == m_options->end()) { throw_or_mimic<option_not_exists_exception>(option); }
+
+  parse_option(iter->second, option, arg);
+}
+
+inline bool ParseResult::consume_positional(std::string a)
+{
+  while (m_next_positional != m_positional.end()) {
+    auto iter = m_options->find(*m_next_positional);
+    if (iter != m_options->end()) {
+      auto& result = m_results[iter->second];
+      if (!iter->second->value().is_container()) {
+        if (result.count() == 0) {
+          add_to_option(*m_next_positional, a);
+          ++m_next_positional;
+          return true;
+        } else {
+          ++m_next_positional;
+          continue;
+        }
+      } else {
+        add_to_option(*m_next_positional, a);
+        return true;
+      }
+    } else {
+      throw_or_mimic<option_not_exists_exception>(*m_next_positional);
+    }
+  }
+
+  return false;
+}
+
+inline void Options::parse_positional(std::string option)
+{
+  parse_positional(std::vector<std::string>{std::move(option)});
+}
+
+inline void Options::parse_positional(std::vector<std::string> options)
+{
+  m_positional      = std::move(options);
+  m_next_positional = m_positional.begin();
+
+  m_positional_set.insert(m_positional.begin(), m_positional.end());
+}
+
+inline void Options::parse_positional(std::initializer_list<std::string> options)
+{
+  parse_positional(std::vector<std::string>(std::move(options)));
+}
+
+inline ParseResult Options::parse(int& argc, char**& argv)
+{
+  ParseResult result(m_options, m_positional, m_allow_unrecognised, argc, argv);
+  return result;
+}
+
+inline void ParseResult::parse(int& argc, char**& argv)
+{
+  int current = 1;
+
+  int nextKeep = 1;
+
+  bool consume_remaining = false;
+
+  while (current != argc) {
+    if (strcmp(argv[current], "--") == 0) {
+      consume_remaining = true;
+      ++current;
+      break;
+    }
+
+    std::match_results<const char*> result;
+    std::regex_match(argv[current], result, option_matcher);
+
+    if (result.empty()) {
+      // not a flag
+
+      // but if it starts with a `-`, then it's an error
+      if (argv[current][0] == '-' && argv[current][1] != '\0') {
+        if (!m_allow_unrecognised) { throw_or_mimic<option_syntax_exception>(argv[current]); }
+      }
+
+      // if true is returned here then it was consumed, otherwise it is
+      // ignored
+      if (consume_positional(argv[current])) {
+      } else {
+        argv[nextKeep] = argv[current];
+        ++nextKeep;
+      }
+      // if we return from here then it was parsed successfully, so continue
+    } else {
+      // short or long option?
+      if (result[4].length() != 0) {
+        const std::string& s = result[4];
+
+        for (std::size_t i = 0; i != s.size(); ++i) {
+          std::string name(1, s[i]);
+          auto iter = m_options->find(name);
+
+          if (iter == m_options->end()) {
+            if (m_allow_unrecognised) {
+              continue;
+            } else {
+              // error
+              throw_or_mimic<option_not_exists_exception>(name);
+            }
+          }
+
+          auto value = iter->second;
+
+          if (i + 1 == s.size()) {
+            // it must be the last argument
+            checked_parse_arg(argc, argv, current, value, name);
+          } else if (value->value().has_implicit()) {
+            parse_option(value, name, value->value().get_implicit_value());
+          } else {
+            // error
+            throw_or_mimic<option_requires_argument_exception>(name);
+          }
+        }
+      } else if (result[1].length() != 0) {
+        const std::string& name = result[1];
+
+        auto iter = m_options->find(name);
+
+        if (iter == m_options->end()) {
+          if (m_allow_unrecognised) {
+            // keep unrecognised options in argument list, skip to next argument
+            argv[nextKeep] = argv[current];
+            ++nextKeep;
+            ++current;
+            continue;
+          } else {
+            // error
+            throw_or_mimic<option_not_exists_exception>(name);
+          }
+        }
+
+        auto opt = iter->second;
+
+        // equals provided for long option?
+        if (result[2].length() != 0) {
+          // parse the option given
+
+          parse_option(opt, name, result[3]);
+        } else {
+          // parse the next argument
+          checked_parse_arg(argc, argv, current, opt, name);
+        }
+      }
+    }
+
+    ++current;
+  }
+
+  for (auto& opt : *m_options) {
+    auto& detail = opt.second;
+    auto& value  = detail->value();
+
+    auto& store = m_results[detail];
+
+    if (value.has_default() && !store.count() && !store.has_default()) { parse_default(detail); }
+  }
+
+  if (consume_remaining) {
+    while (current < argc) {
+      if (!consume_positional(argv[current])) { break; }
+      ++current;
+    }
+
+    // adjust argv for any that couldn't be swallowed
+    while (current != argc) {
+      argv[nextKeep] = argv[current];
+      ++nextKeep;
+      ++current;
+    }
+  }
+
+  argc = nextKeep;
+}
+
+inline void Options::add_option(const std::string& group, const Option& option)
+{
+  add_options(group, {option});
+}
+
+inline void Options::add_option(const std::string& group,
+                                const std::string& s,
+                                const std::string& l,
+                                std::string desc,
+                                std::shared_ptr<const Value> value,
+                                std::string arg_help)
+{
+  auto stringDesc = toLocalString(std::move(desc));
+  auto option     = std::make_shared<OptionDetails>(s, l, stringDesc, value);
+
+  if (s.size() > 0) { add_one_option(s, option); }
+
+  if (l.size() > 0) { add_one_option(l, option); }
+
+  // add the help details
+  auto& options = m_help[group];
+
+  options.options.emplace_back(HelpOptionDetails{s,
+                                                 l,
+                                                 stringDesc,
+                                                 value->has_default(),
+                                                 value->get_default_value(),
+                                                 value->has_implicit(),
+                                                 value->get_implicit_value(),
+                                                 std::move(arg_help),
+                                                 value->is_container(),
+                                                 value->is_boolean()});
+}
+
+inline void Options::add_one_option(const std::string& option,
+                                    std::shared_ptr<OptionDetails> details)
+{
+  auto in = m_options->emplace(option, details);
+
+  if (!in.second) { throw_or_mimic<option_exists_error>(option); }
+}
+
+inline String Options::help_one_group(const std::string& g) const
+{
+  typedef std::vector<std::pair<String, String>> OptionHelp;
+
+  auto group = m_help.find(g);
+  if (group == m_help.end()) { return ""; }
+
+  OptionHelp format;
+
+  size_t longest = 0;
+
+  String result;
+
+  if (!g.empty()) { result += toLocalString(" " + g + " options:\n"); }
+
+  for (const auto& o : group->second.options) {
+    if (m_positional_set.find(o.l) != m_positional_set.end() && !m_show_positional) { continue; }
+
+    auto s  = format_option(o);
+    longest = (std::max)(longest, stringLength(s));
+    format.push_back(std::make_pair(s, String()));
+  }
+
+  longest = (std::min)(longest, static_cast<size_t>(OPTION_LONGEST));
+
+  // widest allowed description
+  auto allowed = size_t{76} - longest - OPTION_DESC_GAP;
+
+  auto fiter = format.begin();
+  for (const auto& o : group->second.options) {
+    if (m_positional_set.find(o.l) != m_positional_set.end() && !m_show_positional) { continue; }
+
+    auto d = format_description(o, longest + OPTION_DESC_GAP, allowed);
+
+    result += fiter->first;
+    if (stringLength(fiter->first) > longest) {
+      result += '\n';
+      result += toLocalString(std::string(longest + OPTION_DESC_GAP, ' '));
+    } else {
+      result +=
+        toLocalString(std::string(longest + OPTION_DESC_GAP - stringLength(fiter->first), ' '));
+    }
+    result += d;
+    result += '\n';
+
+    ++fiter;
+  }
+
+  return result;
+}
+
+inline void Options::generate_group_help(String& result,
+                                         const std::vector<std::string>& print_groups) const
+{
+  for (size_t i = 0; i != print_groups.size(); ++i) {
+    const String& group_help_text = help_one_group(print_groups[i]);
+    if (empty(group_help_text)) { continue; }
+    result += group_help_text;
+    if (i < print_groups.size() - 1) { result += '\n'; }
+  }
+}
+
+inline void Options::generate_all_groups_help(String& result) const
+{
+  std::vector<std::string> all_groups;
+  all_groups.reserve(m_help.size());
+
+  for (auto& group : m_help) { all_groups.push_back(group.first); }
+
+  generate_group_help(result, all_groups);
+}
+
+inline std::string Options::help(const std::vector<std::string>& help_groups) const
+{
+  String result =
+    m_help_string + "\nUsage:\n  " + toLocalString(m_program) + " " + toLocalString(m_custom_help);
+
+  if (m_positional.size() > 0 && m_positional_help.size() > 0) {
+    result += " " + toLocalString(m_positional_help);
+  }
+
+  result += "\n\n";
+
+  if (help_groups.size() == 0) {
+    generate_all_groups_help(result);
+  } else {
+    generate_group_help(result, help_groups);
+  }
+
+  return toUTF8String(result);
+}
+
+inline const std::vector<std::string> Options::groups() const
+{
+  std::vector<std::string> g;
+
+  std::transform(
+    m_help.begin(),
+    m_help.end(),
+    std::back_inserter(g),
+    [](const std::map<std::string, HelpGroupDetails>::value_type& pair) { return pair.first; });
+
+  return g;
+}
+
+inline const HelpGroupDetails& Options::group_help(const std::string& group) const
+{
+  return m_help.at(group);
+}
+
+}  // namespace cxxopts
+
+#endif  // CXXOPTS_HPP_INCLUDED
\ No newline at end of file
diff --git a/cpp/tests/high_res_clock.h b/cpp/tests/utilities/high_res_clock.h
similarity index 97%
rename from cpp/tests/high_res_clock.h
rename to cpp/tests/utilities/high_res_clock.h
index c4629a14b83..e52e7c9a522 100644
--- a/cpp/tests/high_res_clock.h
+++ b/cpp/tests/utilities/high_res_clock.h
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
diff --git a/cpp/tests/utilities/test_utilities.hpp b/cpp/tests/utilities/test_utilities.hpp
new file mode 100644
index 00000000000..b9aace72f5d
--- /dev/null
+++ b/cpp/tests/utilities/test_utilities.hpp
@@ -0,0 +1,338 @@
+/*
+ * Copyright (c) 2019-2020, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#pragma once
+
+#include <functions.hpp>
+
+#include <gtest/gtest.h>
+
+extern "C" {
+#include "mmio.h"
+}
+
+#include <cstdio>
+#include <string>
+
+#define MPICHECK(cmd)                                                  \
+  {                                                                    \
+    int e = cmd;                                                       \
+    if (e != MPI_SUCCESS) {                                            \
+      printf("Failed: MPI error %s:%d '%d'\n", __FILE__, __LINE__, e); \
+      FAIL();                                                          \
+    }                                                                  \
+  }
+
+namespace cugraph {
+namespace test {
+
+std::string getFileName(const std::string& s)
+{
+  char sep = '/';
+
+#ifdef _WIN32
+  sep = '\\';
+#endif
+
+  size_t i = s.rfind(sep, s.length());
+  if (i != std::string::npos) { return (s.substr(i + 1, s.length() - i)); }
+  return ("");
+}
+
+/// Read matrix properties from Matrix Market file
+/** Matrix Market file is assumed to be a sparse matrix in coordinate
+ *  format.
+ *
+ *  @param f File stream for Matrix Market file.
+ *  @param tg Boolean indicating whether to convert matrix to general
+ *  format (from symmetric, Hermitian, or skew symmetric format).
+ *  @param t (Output) MM_typecode with matrix properties.
+ *  @param m (Output) Number of matrix rows.
+ *  @param n (Output) Number of matrix columns.
+ *  @param nnz (Output) Number of non-zero matrix entries.
+ *  @return Zero if properties were read successfully. Otherwise
+ *  non-zero.
+ */
+template <typename IndexType_>
+int mm_properties(FILE* f, int tg, MM_typecode* t, IndexType_* m, IndexType_* n, IndexType_* nnz)
+{
+  // Read matrix properties from file
+  int mint, nint, nnzint;
+  if (fseek(f, 0, SEEK_SET)) {
+    fprintf(stderr, "Error: could not set position in file\n");
+    return -1;
+  }
+  if (mm_read_banner(f, t)) {
+    fprintf(stderr, "Error: could not read Matrix Market file banner\n");
+    return -1;
+  }
+  if (!mm_is_matrix(*t) || !mm_is_coordinate(*t)) {
+    fprintf(stderr, "Error: file does not contain matrix in coordinate format\n");
+    return -1;
+  }
+  if (mm_read_mtx_crd_size(f, &mint, &nint, &nnzint)) {
+    fprintf(stderr, "Error: could not read matrix dimensions\n");
+    return -1;
+  }
+  if (!mm_is_pattern(*t) && !mm_is_real(*t) && !mm_is_integer(*t) && !mm_is_complex(*t)) {
+    fprintf(stderr, "Error: matrix entries are not valid type\n");
+    return -1;
+  }
+  *m   = mint;
+  *n   = nint;
+  *nnz = nnzint;
+
+  // Find total number of non-zero entries
+  if (tg && !mm_is_general(*t)) {
+    // Non-diagonal entries should be counted twice
+    *nnz *= 2;
+
+    // Diagonal entries should not be double-counted
+    int st;
+    for (int i = 0; i < nnzint; ++i) {
+      // Read matrix entry
+      // MTX only supports int for row and col idx
+      int row, col;
+      double rval, ival;
+      if (mm_is_pattern(*t))
+        st = fscanf(f, "%d %d\n", &row, &col);
+      else if (mm_is_real(*t) || mm_is_integer(*t))
+        st = fscanf(f, "%d %d %lg\n", &row, &col, &rval);
+      else  // Complex matrix
+        st = fscanf(f, "%d %d %lg %lg\n", &row, &col, &rval, &ival);
+      if (ferror(f) || (st == EOF)) {
+        fprintf(stderr, "Error: error %d reading Matrix Market file (entry %d)\n", st, i + 1);
+        return -1;
+      }
+
+      // Check if entry is diagonal
+      if (row == col) --(*nnz);
+    }
+  }
+
+  return 0;
+}
+
+/// Read Matrix Market file and convert to COO format matrix
+/** Matrix Market file is assumed to be a sparse matrix in coordinate
+ *  format.
+ *
+ *  @param f File stream for Matrix Market file.
+ *  @param tg Boolean indicating whether to convert matrix to general
+ *  format (from symmetric, Hermitian, or skew symmetric format).
+ *  @param nnz Number of non-zero matrix entries.
+ *  @param cooRowInd (Output) Row indices for COO matrix. Should have
+ *  at least nnz entries.
+ *  @param cooColInd (Output) Column indices for COO matrix. Should
+ *  have at least nnz entries.
+ *  @param cooRVal (Output) Real component of COO matrix
+ *  entries. Should have at least nnz entries. Ignored if null
+ *  pointer.
+ *  @param cooIVal (Output) Imaginary component of COO matrix
+ *  entries. Should have at least nnz entries. Ignored if null
+ *  pointer.
+ *  @return Zero if matrix was read successfully. Otherwise non-zero.
+ */
+template <typename IndexType_, typename ValueType_>
+int mm_to_coo(FILE* f,
+              int tg,
+              IndexType_ nnz,
+              IndexType_* cooRowInd,
+              IndexType_* cooColInd,
+              ValueType_* cooRVal,
+              ValueType_* cooIVal)
+{
+  // Read matrix properties from file
+  MM_typecode t;
+  int m, n, nnzOld;
+  if (fseek(f, 0, SEEK_SET)) {
+    fprintf(stderr, "Error: could not set position in file\n");
+    return -1;
+  }
+  if (mm_read_banner(f, &t)) {
+    fprintf(stderr, "Error: could not read Matrix Market file banner\n");
+    return -1;
+  }
+  if (!mm_is_matrix(t) || !mm_is_coordinate(t)) {
+    fprintf(stderr, "Error: file does not contain matrix in coordinate format\n");
+    return -1;
+  }
+  if (mm_read_mtx_crd_size(f, &m, &n, &nnzOld)) {
+    fprintf(stderr, "Error: could not read matrix dimensions\n");
+    return -1;
+  }
+  if (!mm_is_pattern(t) && !mm_is_real(t) && !mm_is_integer(t) && !mm_is_complex(t)) {
+    fprintf(stderr, "Error: matrix entries are not valid type\n");
+    return -1;
+  }
+
+  // Add each matrix entry in file to COO format matrix
+  int i;      // Entry index in Matrix Market file; can only be int in the MTX format
+  int j = 0;  // Entry index in COO format matrix; can only be int in the MTX format
+  for (i = 0; i < nnzOld; ++i) {
+    // Read entry from file
+    int row, col;
+    double rval, ival;
+    int st;
+    if (mm_is_pattern(t)) {
+      st   = fscanf(f, "%d %d\n", &row, &col);
+      rval = 1.0;
+      ival = 0.0;
+    } else if (mm_is_real(t) || mm_is_integer(t)) {
+      st   = fscanf(f, "%d %d %lg\n", &row, &col, &rval);
+      ival = 0.0;
+    } else  // Complex matrix
+      st = fscanf(f, "%d %d %lg %lg\n", &row, &col, &rval, &ival);
+    if (ferror(f) || (st == EOF)) {
+      fprintf(stderr, "Error: error %d reading Matrix Market file (entry %d)\n", st, i + 1);
+      return -1;
+    }
+
+    // Switch to 0-based indexing
+    --row;
+    --col;
+
+    // Record entry
+    cooRowInd[j] = row;
+    cooColInd[j] = col;
+    if (cooRVal != NULL) cooRVal[j] = rval;
+    if (cooIVal != NULL) cooIVal[j] = ival;
+    ++j;
+
+    // Add symmetric complement of non-diagonal entries
+    if (tg && !mm_is_general(t) && (row != col)) {
+      // Modify entry value if matrix is skew symmetric or Hermitian
+      if (mm_is_skew(t)) {
+        rval = -rval;
+        ival = -ival;
+      } else if (mm_is_hermitian(t)) {
+        ival = -ival;
+      }
+
+      // Record entry
+      cooRowInd[j] = col;
+      cooColInd[j] = row;
+      if (cooRVal != NULL) cooRVal[j] = rval;
+      if (cooIVal != NULL) cooIVal[j] = ival;
+      ++j;
+    }
+  }
+  return 0;
+}
+
+int read_binary_vector(FILE* fpin, int n, std::vector<float>& val)
+{
+  size_t is_read1;
+
+  double* t_storage = new double[n];
+  is_read1          = fread(t_storage, sizeof(double), n, fpin);
+  for (int i = 0; i < n; i++) {
+    if (t_storage[i] == DBL_MAX)
+      val[i] = FLT_MAX;
+    else if (t_storage[i] == -DBL_MAX)
+      val[i] = -FLT_MAX;
+    else
+      val[i] = static_cast<float>(t_storage[i]);
+  }
+  delete[] t_storage;
+
+  if (is_read1 != (size_t)n) {
+    printf("%s", "I/O fail\n");
+    return 1;
+  }
+  return 0;
+}
+
+int read_binary_vector(FILE* fpin, int n, std::vector<double>& val)
+{
+  size_t is_read1;
+
+  is_read1 = fread(&val[0], sizeof(double), n, fpin);
+
+  if (is_read1 != (size_t)n) {
+    printf("%s", "I/O fail\n");
+    return 1;
+  }
+  return 0;
+}
+
+// FIXME: A similar function could be useful for CSC format
+//        There are functions above that operate coo -> csr and coo->csc
+/**
+ * @tparam
+ */
+template <typename VT, typename ET, typename WT>
+std::unique_ptr<cugraph::GraphCSR<VT, ET, WT>> generate_graph_csr_from_mm(bool& directed,
+                                                                          std::string mm_file)
+{
+  VT number_of_vertices;
+  ET number_of_edges;
+
+  FILE* fpin = fopen(mm_file.c_str(), "r");
+  EXPECT_NE(fpin, nullptr);
+
+  VT number_of_columns = 0;
+  MM_typecode mm_typecode{0};
+  EXPECT_EQ(mm_properties<VT>(
+              fpin, 1, &mm_typecode, &number_of_vertices, &number_of_columns, &number_of_edges),
+            0);
+  EXPECT_TRUE(mm_is_matrix(mm_typecode));
+  EXPECT_TRUE(mm_is_coordinate(mm_typecode));
+  EXPECT_FALSE(mm_is_complex(mm_typecode));
+  EXPECT_FALSE(mm_is_skew(mm_typecode));
+
+  directed = !mm_is_symmetric(mm_typecode);
+
+  // Allocate memory on host
+  std::vector<VT> coo_row_ind(number_of_edges);
+  std::vector<VT> coo_col_ind(number_of_edges);
+  std::vector<WT> coo_val(number_of_edges);
+
+  // Read
+  EXPECT_EQ((mm_to_coo<VT, WT>(
+              fpin, 1, number_of_edges, &coo_row_ind[0], &coo_col_ind[0], &coo_val[0], NULL)),
+            0);
+  EXPECT_EQ(fclose(fpin), 0);
+
+  cugraph::GraphCOOView<VT, ET, WT> cooview(
+    &coo_row_ind[0], &coo_col_ind[0], &coo_val[0], number_of_vertices, number_of_edges);
+
+  return cugraph::coo_to_csr(cooview);
+}
+
+////////////////////////////////////////////////////////////////////////////////
+// FIXME: move this code to rapids-core
+////////////////////////////////////////////////////////////////////////////////
+
+// Define RAPIDS_DATASET_ROOT_DIR using a preprocessor variable to
+// allow for a build to override the default. This is useful for
+// having different builds for specific default dataset locations.
+#ifndef RAPIDS_DATASET_ROOT_DIR
+#define RAPIDS_DATASET_ROOT_DIR "/datasets"
+#endif
+
+static const std::string& get_rapids_dataset_root_dir()
+{
+  static std::string rdrd("");
+  // Env var always overrides the value of RAPIDS_DATASET_ROOT_DIR
+  if (rdrd == "") {
+    const char* envVar = std::getenv("RAPIDS_DATASET_ROOT_DIR");
+    rdrd               = (envVar != NULL) ? envVar : RAPIDS_DATASET_ROOT_DIR;
+  }
+  return rdrd;
+}
+
+}  // namespace test
+}  // namespace cugraph
diff --git a/datasets/README.md b/datasets/README.md
index 8e414a75325..c7f76a91dfe 100644
--- a/datasets/README.md
+++ b/datasets/README.md
@@ -1,5 +1,4 @@
-# Cugraph test data
-
+# Cugraph test and benchmark data
 
 ## Python
 
@@ -11,24 +10,58 @@ This directory contains small public datasets in `mtx` and `csv` format used by
 | dolphin       | 62    | 318   | No       | No       |
 | netscience    | 1,589 | 5,484 | No       | Yes      |
 
+**karate** : The graph "karate" contains the network of friendships between the 34 members of a karate club at a US university, as described by Wayne Zachary in 1977.
 
-**karate** :The graph "karate" contains the network of friendships between the 34 members of a karate club at a US university, as described by Wayne Zachary in 1977.
-
-**dolphin** : The graph dolphins contains an undirected social network of frequent associations between 62 dolphins in a community living off Doubtful Sound, New Zealand, as compiled by Lusseau et al. (2003).                        
+**dolphin** : The graph dolphins contains an undirected social network of frequent associations between 62 dolphins in a community living off Doubtful Sound, New Zealand, as compiled by Lusseau et al. (2003).
 
 **netscience** : The graph netscience contains a coauthorship network of scientists working on network theory and experiment, as compiled by M. Newman in May 2006.
 
-
 ## C++
-Cugraph's C++ analytics tests need larger datasets (>5GB uncompressed) and reference results (>125MB uncompressed). They can be downloaded using the provided script.
+Cugraph's C++ analytics tests need larger datasets (>5GB uncompressed) and reference results (>125MB uncompressed). They can be downloaded by running the provided script from the `datasets` directory.
 ```
-source get_test_data.sh
-``` 
-You may run this script from elsewhere and store C++ test input to another location. 
+cd <repo>/datasets
+./get_test_data.sh
+```
+You may run this script from elsewhere and store C++ test input to another location.
 
 Before running the tests, you should let cuGraph know where to find the test input by using:
 ```
 export RAPIDS_DATASET_ROOT_DIR=<path_to_ccp_test_and_reference_data>
 ```
+
+## Benchmarks
+Cugraph benchmarks (which can be found [here](../benchmarks)) also use datasets installed to this folder. Because the datasets used for benchmarking are also quite large (~14GB uncompressed), they are not installed by default. To install datasets for benchmarks, run the same script shown above from the `datasets` directory using the `--benchmark` option:
+```
+cd <repo>/datasets
+./get_test_data.sh --benchmark
+```
+The datasets installed for benchmarks currently include CSV files for use in creating both directed and undirected graphs:
+```
+<repo>/datasets/csv
+ |- directed
+ |--- cit-Patents.csv       (250M)
+ |--- soc-LiveJournal1.csv  (965M)
+ |- undirected
+ |--- europe_osm.csv        (1.8G)
+ |--- hollywood.csv         (1.5G)
+ |--- soc-twitter-2010.csv  (8.8G)
+```
+The benchmark datasets are described below:
+| Graph             | V          | E             | Directed | Weighted |
+| ----------------- | ---------- | ------------- | -------- | -------- |
+| cit-Patents       |  3,774,768 |    16,518,948 | Yes      | No       |
+| soc-LiveJournal1  |  4,847,571 |    43,369,619 | Yes      | No       |
+| europe_osm        | 50,912,018 |    54,054,660 | No       | No       |
+| hollywood         |  1,139,905 |    57,515,616 | No       | No       |
+| soc-twitter-2010  | 21,297,772 |   265,025,809 | No       | No       |
+
+**cit-Patents** : A citation graph that includes all citations made by patents granted between 1975 and 1999, totaling 16,522,438 citations.
+**soc-LiveJournal** : A graph of the LiveJournal social network.
+**europe_osm** : A graph of OpenStreetMap data for Europe.
+**hollywood** : A graph of movie actors where vertices are actors, and two actors are joined by an edge whenever they appeared in a movie together.
+**soc-twitter-2010** : A network of follower relationships from a snapshot of Twitter in 2010, where an edge from i to j indicates that j is a follower of i.
+
+_NOTE: the benchmark datasets were converted to a CSV format from their original format described in the reference URL below, and in doing so had edge weights and isolated vertices discarded._
+
 ## Reference
 The SuiteSparse Matrix Collection (formerly the University of Florida Sparse Matrix Collection) : https://sparse.tamu.edu/
diff --git a/datasets/asymmetric_directed__tiny.csv b/datasets/asymmetric_directed__tiny.csv
new file mode 100644
index 00000000000..babea2f8d37
--- /dev/null
+++ b/datasets/asymmetric_directed__tiny.csv
@@ -0,0 +1,10 @@
+0 1 1.0
+1 2 1.0
+2 3 1.0
+2 4 1.0
+2 5 1.0
+5 6 1.0
+5 7 1.0
+5 8 1.0
+8 9 1.0
+8 10 1.0
\ No newline at end of file
diff --git a/datasets/get_test_data.sh b/datasets/get_test_data.sh
index 05875a15bee..071a4b8dea3 100755
--- a/datasets/get_test_data.sh
+++ b/datasets/get_test_data.sh
@@ -1,14 +1,6 @@
 #!/bin/bash
 set -e
 set -o pipefail
-NUMARGS=$#
-ARGS=$*
-
-# FIXME: consider using getopts for option parsing
-# Arg parsing function
-function hasArg {
-    (( ${NUMARGS} != 0 )) && (echo " ${ARGS} " | grep -q " $1 ")
-}
 
 # Update this to add/remove/change a dataset, using the following format:
 #
@@ -16,6 +8,9 @@ function hasArg {
 #  dataset download URL
 #  destination dir to untar to
 #  blank line separator
+#
+# FIXME: some test data needs to be extracted to "benchmarks", which is
+# confusing now that there's dedicated datasets for benchmarks.
 BASE_DATASET_DATA="
 # ~22s download
 https://s3.us-east-2.amazonaws.com/rapidsai-data/cugraph/test/datasets.tgz
@@ -44,19 +39,41 @@ https://s3.us-east-2.amazonaws.com/rapidsai-data/cugraph/benchmark/hibench/hiben
 benchmark
 "
 
+BENCHMARK_DATASET_DATA="
+# ~90s download - these are used for benchmarks runs (code in <cugraph root>/benchmarks)
+https://rapidsai-data.s3.us-east-2.amazonaws.com/cugraph/benchmark/benchmark_csv_data.tgz
+csv
+"
+################################################################################
+# Do not change the script below this line if only adding/updating a dataset
+
+NUMARGS=$#
+ARGS=$*
+function hasArg {
+    (( ${NUMARGS} != 0 )) && (echo " ${ARGS} " | grep -q " $1 ")
+}
+
+if hasArg -h || hasArg --help; then
+    echo "$0 [--subset | --benchmark]"
+    exit 0
+fi
+
 # Select the datasets to install
-if hasArg "--subset"; then
+if hasArg "--benchmark"; then
+    DATASET_DATA="${BENCHMARK_DATASET_DATA}"
+elif hasArg "--subset"; then
     DATASET_DATA="${BASE_DATASET_DATA}"
+# Do not include benchmark datasets by default - too big
 else
     DATASET_DATA="${BASE_DATASET_DATA} ${EXTENDED_DATASET_DATA}"
 fi
 
-################################################################################
-# Do not change the script below this line if only adding/updating a dataset
 URLS=($(echo "$DATASET_DATA"|awk '{if (NR%4 == 3) print $0}'))  # extract 3rd fields to a bash array
 DESTDIRS=($(echo "$DATASET_DATA"|awk '{if (NR%4 == 0) print $0}'))  # extract 4th fields to a bash array
 
 echo Downloading ...
+
+# Download all tarfiles to a tmp dir
 rm -rf tmp
 mkdir tmp
 cd tmp
@@ -65,10 +82,13 @@ for url in ${URLS[*]}; do
 done
 cd ..
 
-rm -rf test
-rm -rf benchmark
-mkdir -p test/ref
-mkdir benchmark
+# Setup the destination dirs, removing any existing ones first!
+for index in ${!DESTDIRS[*]}; do
+    rm -rf ${DESTDIRS[$index]}
+done
+for index in ${!DESTDIRS[*]}; do
+    mkdir -p ${DESTDIRS[$index]}
+done
 
 # Iterate over the arrays and untar the nth tarfile to the nth dest directory.
 # The tarfile name is derived from the download url.
diff --git a/docs/source/api.rst b/docs/source/api.rst
index 47a4e29b230..84a01604d33 100644
--- a/docs/source/api.rst
+++ b/docs/source/api.rst
@@ -57,6 +57,13 @@ Katz Centrality
 Community
 =========
 
+Leiden
+-------
+
+.. automodule:: cugraph.community.leiden
+    :members:
+    :undoc-members:
+
 Louvain
 -------
 
diff --git a/docs/source/conf.py b/docs/source/conf.py
index a461e4d696d..41674b9b3ae 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -65,9 +65,9 @@
 # built documents.
 #
 # The short X.Y version.
-version = '0.14'
+version = '0.15'
 # The full version, including alpha/beta/rc tags.
-release = '0.14.0'
+release = '0.15.0'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.
diff --git a/docs/source/dask-cugraph.rst b/docs/source/dask-cugraph.rst
new file mode 100644
index 00000000000..b27ad382809
--- /dev/null
+++ b/docs/source/dask-cugraph.rst
@@ -0,0 +1,70 @@
+~~~~~~~~~~~~~~~~~~~~~~
+Multi-GPU with cuGraph
+~~~~~~~~~~~~~~~~~~~~~~
+
+cuGraph supports multi-GPU leveraging `Dask <https://dask.org>`_. Dask is a flexible library for parallel computing in Python which makes scaling out your workflow smooth and simple. cuGraph also uses other Dask-based RAPIDS projects such as `dask-cuda <https://github.com/rapidsai/dask-cuda>`_. The maximum graph size is currently limited to 2 Billion vertices (to be waived in the next versions).
+
+Distributed graph analytics
+===========================
+
+The current solution is able to scale across multiple GPUs on multiple machines. Distributing the graph and computation lets you analyze datasets far larger than a single GPU’s memory.
+
+With cuGraph and Dask, whether you’re using a single NVIDIA GPU or multiple nodes, your RAPIDS workflow will run smoothly, intelligently distributing the workload across the available resources.
+
+If your graph comfortably fits in memory on a single GPU, you would want to use the single-GPU version of cuGraph. If you want to distribute your workflow across multiple GPUs and have more data than you can fit in memory on a single GPU, you would want to use cuGraph's multi-GPU features.
+
+
+Distributed Graph Algorithms
+----------------------------
+
+.. automodule:: cugraph.dask.link_analysis.pagerank
+    :members: pagerank
+    :undoc-members: 
+
+.. automodule:: cugraph.dask.traversal.bfs
+    :members: bfs
+    :undoc-members: 
+
+
+Helper functions 
+----------------
+
+.. automodule:: cugraph.comms.comms
+    :members: initialize
+    :undoc-members:
+
+.. automodule:: cugraph.comms.comms
+    :members: destroy
+    :undoc-members:
+
+.. automodule:: cugraph.dask.common.read_utils
+    :members: get_chunksize
+    :undoc-members:
+
+Consolidation
+=============
+
+cuGraph can transparently interpret the Dask cuDF Dataframe as a regular Dataframe when loading the edge list. This is particularly helpful for workflows extracting a single GPU sized edge list from a distributed dataset. From there any existing single GPU feature will just work on this input.
+
+For instance, consolidation allows leveraging Dask cuDF CSV reader to load file(s) on multiple GPUs and consolidate this input to a single GPU graph. Reading is often the time and memory bottleneck, with this feature users can call the Multi-GPU version of the reader without changing anything else. 
+
+Batch Processing
+================
+
+cuGraph can leverage multi GPUs to increase processing speed for graphs that fit on a single GPU, providing faster analytics on such graphs.
+You will be able to use the Graph the same way as you used to in a Single GPU environment, but analytics that support batch processing will automatically use the GPUs available to the dask client.
+For example, Betweenness Centrality scores can be slow to obtain depending on the number of vertices used in the approximation. Thank to Multi GPUs Batch Processing,
+you can create Single GPU graph as you would regularly do it using cuDF CSV reader, enable Batch analytics on it, and obtain scores much faster as each GPU will handle a sub-set of the sources.
+In order to use Batch Analytics you need to set up a Dask Cluster and Client in addition to the cuGraph communicator, then you can simply call `enable_batch()` on you graph, and algorithms supporting batch processing will use multiple GPUs.
+
+Algorithms supporting Batch Processing
+--------------------------------------
+.. automodule:: cugraph.centrality
+    :members: betweenness_centrality
+    :undoc-members:
+    :noindex:
+
+.. automodule:: cugraph.centrality
+    :members: edge_betweenness_centrality
+    :undoc-members:
+    :noindex:
diff --git a/docs/source/dask-cugraph.rst.tmp b/docs/source/dask-cugraph.rst.tmp
deleted file mode 100644
index f618052271c..00000000000
--- a/docs/source/dask-cugraph.rst.tmp
+++ /dev/null
@@ -1,24 +0,0 @@
-~~~~~~~~~~~~~~~~~~~~~~
-Multi-GPU with cuGraph
-~~~~~~~~~~~~~~~~~~~~~~
-
-cuGraph supports multi-GPU leveraging `Dask <https://dask.org>`_. Dask is a flexible library for parallel computing in Python which makes scaling out your workflow smooth and simple. cuGraph also uses other Dask-based RAPIDS projects such as `dask-cuda <https://github.com/rapidsai/dask-cuda>`_.
-
-The current solution is able to scale across multiple GPUs on a single machine. Distributing the graph and computation lets you analyze datasets far larger than a single GPU’s memory.
-
-With cuGraph and Dask, whether you’re using a single NVIDIA GPU or using all 16 NVIDIA V100 GPUs on a DGX-2, your RAPIDS workflow will run smoothly, intelligently distributing the workload across the available resources.
-
-When to Use Multiple GPUs in cuGraph
-====================================
-
-If your graph comfortably fits in memory on a single GPU, you would want to use the single-GPU version of cuGraph. If you want to distribute your workflow across multiple GPUs and have more data than you can fit in memory on a single GPU, you would want to use cuGraph's multi-GPU features.
-
-Supported Graph Analytics
-=========================
-
-Pagerank
---------
-
-.. automodule:: cugraph.dask.pagerank.pagerank
-    :members: pagerank
-    :undoc-members: pagerank
diff --git a/notebooks/README.md b/notebooks/README.md
index fec2efcacc7..a5706720235 100644
--- a/notebooks/README.md
+++ b/notebooks/README.md
@@ -11,7 +11,7 @@ This repository contains a collection of Jupyter Notebooks that outline how to r
 | --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
 | Centrality      |                                                              |                                                              |
 |                 | [Katz](centrality/Katz.ipynb)                                | Compute the Katz centrality for every vertex                 |
-|                 | [Betweenness](centrality/Betweenness.ipynb)                  | Compute the Betweenness centrality for every vertex          |
+|                 | [Betweenness](centrality/Betweenness.ipynb)                  | Compute both Edge and Vertex Betweenness centrality          |
 | Community       |                                                              |                                                              |
 |                 | [Louvain](community/Louvain.ipynb)                           | Identify clusters in a graph using the Louvain algorithm     |
 |                 | [ECG](community/ECG.ipynb)                                   | Identify clusters in a graph using the Ensemble Clustering for Graph |
@@ -26,8 +26,9 @@ This repository contains a collection of Jupyter Notebooks that outline how to r
 |                 | [K-Truss](cores/ktruss.ipynb)                                | Extracts the K-Truss cluster                                 |
 | Link Analysis   |                                                              |                                                              |
 |                 | [Pagerank](link_analysis/Pagerank.ipynb)                     | Compute the PageRank of every vertex in a graph              |
+|                 | [HITS](link_analysis/HITS.ipynb)                             | Compute the HITS' Hub and Authority scores for every vertex in a graph              |
 | Link Prediction |                                                              |                                                              |
-|                 | [Jacard Similarity](link_prediction/Jaccard-Similarity.ipynb) | Compute vertex similarity score using both:<br />- Jaccard Similarity<br />- Weighted Jaccard |
+|                 | [Jaccard Similarity](link_prediction/Jaccard-Similarity.ipynb) | Compute vertex similarity score using both:<br />- Jaccard Similarity<br />- Weighted Jaccard |
 |                 | [Overlap Similarity](link_prediction/Overlap-Similarity.ipynb) | Compute vertex similarity score using the Overlap Coefficient |
 | Traversal       |                                                              |                                                              |
 |                 | [BFS](traversal/BFS.ipynb)                                   | Compute the Breadth First Search path from a starting vertex to every other vertex in a graph |
diff --git a/notebooks/centrality/Betweenness.ipynb b/notebooks/centrality/Betweenness.ipynb
index bd9c56ba1db..e4e33ef91e5 100644
--- a/notebooks/centrality/Betweenness.ipynb
+++ b/notebooks/centrality/Betweenness.ipynb
@@ -6,14 +6,14 @@
    "source": [
     "# Betweenness Centrality\n",
     "\n",
-    "In this notebook, we will compute the Betweenness centrality of each vertex in our test datase using both cuGraph and NetworkX. The NetworkX and cuGraph processes will be interleaved so that each step can be compared.\n",
+    "In this notebook, we will compute the Betweenness centrality for both vertices and edges in our test datase using cuGraph and NetworkX. The NetworkX and cuGraph processes will be interleaved so that each step can be compared.\n",
     "\n",
     "Notebook Credits\n",
     "* Original Authors: Bradley Rees\n",
     "* Created:   04/24/2019\n",
-    "* Last Edit: 04/24/2020\n",
+    "* Last Edit: 08/16/2020\n",
     "\n",
-    "RAPIDS Versions: 0.14   \n",
+    "RAPIDS Versions: 0.15   \n",
     "\n",
     "Test Hardware\n",
     "\n",
@@ -25,7 +25,7 @@
    "metadata": {},
    "source": [
     "## Introduction\n",
-    "Betweenness centrality is a measure of the relative importance of a vertex within the graph based on measuring the number of shortest paths that pass through each vertex.  High betweenness centrality vertices have a greater number of path cross over the vertex.  \n",
+    "Betweenness centrality is a measure of the relative importance based on measuring the number of shortest paths that pass through each vertex or over each edge .  High betweenness centrality vertices have a greater number of path cross through the vertex.  Likewise, high centrality edges have more shortest paths that pass over the edge.\n",
     "\n",
     "See [Betweenness on Wikipedia](https://en.wikipedia.org/wiki/Betweenness_centrality) for more details on the algorithm.\n",
     "\n"
@@ -45,8 +45,7 @@
    "metadata": {},
    "source": [
     "To compute the Betweenness centrality scores for a graph in cuGraph we use:<br>\n",
-    "__df = cugraph.betweenness_centrality(G)__\n",
-    "\n",
+    "__df_v = cugraph.betweenness_centrality(G)__\n",
     "    G: cugraph.Graph object\n",
     "   \n",
     "\n",
@@ -58,6 +57,19 @@
     "\n",
     "\n",
     "\n",
+    "__df_e = cugraph.edge_betweenness_centrality(G)__\n",
+    "    G: cugraph.Graph object\n",
+    "\n",
+    "Returns:\n",
+    "\n",
+    "    df: a cudf.DataFrame object with two columns:\n",
+    "        df[‘src’]cudf.Series\n",
+    "            Contains the vertex identifiers of the source of each edge\n",
+    "        df[‘dst’]cudf.Series\n",
+    "            Contains the vertex identifiers of the destination of each edge\n",
+    "        df['betweenness_centrality']: The betweenness centrality score for the vertex\n",
+    "\n",
+    "\n",
     "### _NOTICE_\n",
     "cuGraph does not currently support the ‘endpoints’ and ‘weight’ parameters as seen in the corresponding networkX call. "
    ]
@@ -66,13 +78,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### cuGraph Notice \n",
-    "The current version of cuGraph has some limitations:\n",
-    "\n",
-    "* Vertex IDs need to be 32-bit integers.\n",
-    "* Vertex IDs are expected to be contiguous integers starting from 0\n",
-    "\n",
-    "cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be of any data type and do not need to be contiguous. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon.    "
+    "#### Some notes about vertex IDs...\n",
+    "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n",
+    "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n",
+    "  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n",
+    "  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n"
    ]
   },
   {
@@ -88,7 +98,7 @@
     "![Karate Club](../img/zachary_black_lines.png)\n",
     "\n",
     "\n",
-    "The test data has vertex IDs strating at 1.  We will be using the auto-renumber feature of cuGraph to renumber the data so that the starting vertex ID is zero.  The data will be auto-unrenumbered so that the renumbering step is transparent to users. "
+    "Because the test data has vertex IDs starting at 1, the auto-renumber feature of cuGraph (mentioned above) will be used so the starting vertex ID is zero for maximum efficiency. The resulting data will then be auto-unrenumbered, making the entire renumbering process transparent to users.\n"
    ]
   },
   {
@@ -106,8 +116,7 @@
    "source": [
     "# Import needed libraries\n",
     "import cugraph\n",
-    "import cudf\n",
-    "from collections import OrderedDict"
+    "import cudf"
    ]
   },
   {
@@ -188,7 +197,17 @@
    "outputs": [],
    "source": [
     "# Call cugraph.betweenness_centrality \n",
-    "gdf_bc = cugraph.betweenness_centrality(G)"
+    "vertex_bc = cugraph.betweenness_centrality(G)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Call cugraph.edge_betweenness_centrality \n",
+    "edge_bc = cugraph.edge_betweenness_centrality(G)"
    ]
   },
   {
@@ -210,9 +229,12 @@
    "source": [
     "# Find the most important vertex using the scores\n",
     "# This methods should only be used for small graph\n",
-    "def find_top_scores(_df) :\n",
+    "def print_top_scores(_df, txt) :\n",
     "    m = _df['betweenness_centrality'].max()\n",
-    "    return _df.query('betweenness_centrality >= @m')\n",
+    "    _d = _df.query('betweenness_centrality == @m')\n",
+    "    print(txt)\n",
+    "    print(_d)\n",
+    "    print()\n",
     "        "
    ]
   },
@@ -222,8 +244,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "top_df = find_top_scores(gdf_bc)\n",
-    "top_df"
+    "print_top_scores(vertex_bc, \"top vertice centrality scores\")\n",
+    "print_top_scores(edge_bc, \"top edge centrality scores\")"
    ]
   },
   {
@@ -233,7 +255,16 @@
    "outputs": [],
    "source": [
     "# let's sort the data and look at the top 5 vertices\n",
-    "gdf_bc.sort_values(by='betweenness_centrality', ascending=False).head(5)"
+    "vertex_bc.sort_values(by='betweenness_centrality', ascending=False).head(5)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "edge_bc.sort_values(by='betweenness_centrality', ascending=False).head(5)"
    ]
   },
   {
@@ -267,7 +298,16 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "bc_nx = nx.betweenness_centrality(Gnx)"
+    "bc_nx_vert = nx.betweenness_centrality(Gnx)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bc_nx_edge = nx.edge_betweenness_centrality(Gnx)"
    ]
   },
   {
@@ -276,7 +316,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "bc_nx_s = sorted(((value, key) for (key,value) in bc_nx.items()), reverse=True)"
+    "bc_nx_sv = sorted(((value, key) for (key,value) in bc_nx_vert.items()), reverse=True)\n",
+    "bc_nx_sv[:5]"
    ]
   },
   {
@@ -285,7 +326,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "bc_nx_s[:5]"
+    "bc_nx_se = sorted(((value, key) for (key,value) in bc_nx_edge.items()), reverse=True)\n",
+    "bc_nx_se[:5]"
    ]
   },
   {
@@ -307,13 +349,6 @@
     "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
     "___"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
diff --git a/notebooks/centrality/Katz.ipynb b/notebooks/centrality/Katz.ipynb
index 27a0e37bd08..2330fc08de8 100755
--- a/notebooks/centrality/Katz.ipynb
+++ b/notebooks/centrality/Katz.ipynb
@@ -11,7 +11,7 @@
     "Notebook Credits\n",
     "* Original Authors: Bradley Rees\n",
     "* Created:   10/15/2019\n",
-    "* Last Edit: 04/23/2020\n",
+    "* Last Edit: 08/16/2020\n",
     "\n",
     "RAPIDS Versions: 0.14   \n",
     "\n",
@@ -74,13 +74,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### cuGraph Notice \n",
-    "The current version of cuGraph has some limitations:\n",
-    "\n",
-    "* Vertex IDs need to be 32-bit integers.\n",
-    "* Vertex IDs are expected to be contiguous integers starting from 0\n",
-    "\n",
-    "cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be of any data type and do not need to be contiguous. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon.    "
+    "#### Some notes about vertex IDs...\n",
+    "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n",
+    "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n",
+    "  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n",
+    "  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n"
    ]
   },
   {
@@ -96,7 +94,7 @@
     "![Karate Club](../img/zachary_black_lines.png)\n",
     "\n",
     "\n",
-    "The test data has vertex IDs strating at 1.  We will be using the auto-renumber feature of cuGraph to renumber the data so that the starting vertex ID is zero.  The data will be auto-unrenumbered so that the renumbering step is transparent to users. "
+    "Because the test data has vertex IDs starting at 1, the auto-renumber feature of cuGraph (mentioned above) will be used so the starting vertex ID is zero for maximum efficiency. The resulting data will then be auto-unrenumbered, making the entire renumbering process transparent to users.\n"
    ]
   },
   {
@@ -114,8 +112,7 @@
    "source": [
     "# Import needed libraries\n",
     "import cugraph\n",
-    "import cudf\n",
-    "from collections import OrderedDict"
+    "import cudf"
    ]
   },
   {
@@ -374,13 +371,6 @@
     "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
     "___"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
diff --git a/notebooks/community/ECG.ipynb-not-working b/notebooks/community/ECG.ipynb
similarity index 82%
rename from notebooks/community/ECG.ipynb-not-working
rename to notebooks/community/ECG.ipynb
index 851fc832c93..d7595dadb26 100644
--- a/notebooks/community/ECG.ipynb-not-working
+++ b/notebooks/community/ECG.ipynb
@@ -11,9 +11,9 @@
     "Notebook Credits\n",
     "* Original Authors: Bradley Rees and James Wyles\n",
     "* Created:   04/24/2020\n",
-    "* Last Edit: 05/08/2020\n",
+    "* Last Edit: 08/16/2020\n",
     "\n",
-    "RAPIDS Versions: 0.14\n",
+    "RAPIDS Versions: 0.15\n",
     "\n",
     "Test Hardware\n",
     "* GV100 32G, CUDA 10.2\n",
@@ -70,13 +70,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## cuGraph Notice \n",
-    "The current version of cuGraph has some limitations:\n",
-    "\n",
-    "* Vertex IDs need to be 32-bit integers.\n",
-    "* Vertex IDs are expected to be contiguous integers starting from 0.\n",
-    "\n",
-    "cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon.    "
+    "#### Some notes about vertex IDs...\n",
+    "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n",
+    "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n",
+    "  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n",
+    "  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n"
    ]
   },
   {
@@ -91,7 +89,7 @@
     "\n",
     "![Karate Club](../img/zachary_black_lines.png)\n",
     "\n",
-    "The test data has vertex IDs strating at 1.  We will be using the auto-renumber feature of cuGraph to renumber the data so that the starting vertex ID is zero.  The data will be auto-unrenumbered so that the renumbering step is transparent to users. \n"
+    "Because the test data has vertex IDs starting at 1, the auto-renumber feature of cuGraph (mentioned above) will be used so the starting vertex ID is zero for maximum efficiency. The resulting data will then be auto-unrenumbered, making the entire renumbering process transparent to users.\n"
    ]
   },
   {
@@ -236,13 +234,6 @@
     "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
     "___"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
diff --git a/notebooks/community/Louvain.ipynb b/notebooks/community/Louvain.ipynb
index 995c8bdfc80..e5e5e6a04ed 100755
--- a/notebooks/community/Louvain.ipynb
+++ b/notebooks/community/Louvain.ipynb
@@ -12,7 +12,7 @@
     "Notebook Credits\n",
     "* Original Authors: Bradley Rees and James Wyles\n",
     "* Created:   08/01/2019\n",
-    "* Last Edit: 05/27/2020\n",
+    "* Last Edit: 08/16/2020\n",
     "\n",
     "RAPIDS Versions: 0.14\n",
     "\n",
@@ -97,13 +97,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## cuGraph Notice \n",
-    "The current version of cuGraph has some limitations:\n",
-    "\n",
-    "* Vertex IDs need to be 32-bit integers.\n",
-    "* Vertex IDs are expected to be contiguous integers starting from 0.\n",
-    "\n",
-    "cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon.    "
+    "#### Some notes about vertex IDs...\n",
+    "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n",
+    "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n",
+    "  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n",
+    "  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n"
    ]
   },
   {
@@ -118,7 +116,7 @@
     "\n",
     "![Karate Club](../img/zachary_black_lines.png)\n",
     "\n",
-    "The test data has vertex IDs strating at 1.  We will be using the auto-renumber feature of cuGraph to renumber the data so that the starting vertex ID is zero.  The data will be auto-unrenumbered so that the renumbering step is transparent to users. \n"
+    "Because the test data has vertex IDs starting at 1, the auto-renumber feature of cuGraph (mentioned above) will be used so the starting vertex ID is zero for maximum efficiency. The resulting data will then be auto-unrenumbered, making the entire renumbering process transparent to users.\n"
    ]
   },
   {
@@ -276,27 +274,13 @@
     "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
     "___"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "cugraph_dev",
    "language": "python",
-   "name": "python3"
+   "name": "cugraph_dev"
   },
   "language_info": {
    "codemirror_mode": {
diff --git a/notebooks/community/Spectral-Clustering.ipynb b/notebooks/community/Spectral-Clustering.ipynb
index cd5ff39bb0a..0a23722b6f9 100755
--- a/notebooks/community/Spectral-Clustering.ipynb
+++ b/notebooks/community/Spectral-Clustering.ipynb
@@ -12,7 +12,7 @@
     "Notebook Credits\n",
     "* Original Authors: Bradley Rees and James Wyles\n",
     "* Created:   08/01/2019\n",
-    "* Last Edit: 05/08/2020\n",
+    "* Last Edit: 08/16/2020\n",
     "\n",
     "RAPIDS Versions: 0.14\n",
     "\n",
@@ -118,13 +118,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### cuGraph Notice \n",
-    "The current version of cuGraph has some limitations:\n",
-    "\n",
-    "* Vertex IDs need to be 32-bit integers.\n",
-    "* Vertex IDs are expected to be contiguous integers starting from 0.\n",
-    "\n",
-    "cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon."
+    "#### Some notes about vertex IDs...\n",
+    "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n",
+    "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n",
+    "  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n",
+    "  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n"
    ]
   },
   {
@@ -139,7 +137,7 @@
     "\n",
     "![Karate Club](../img/zachary_black_lines.png)\n",
     "\n",
-    "The test data has vertex IDs strating at 1.  We will be using the auto-renumber feature of cuGraph to renumber the data so that the starting vertex ID is zero.  The data will be auto-unrenumbered so that the renumbering step is transparent to users. \n",
+    "Because the test data has vertex IDs starting at 1, the auto-renumber feature of cuGraph (mentioned above) will be used so the starting vertex ID is zero for maximum efficiency. The resulting data will then be auto-unrenumbered, making the entire renumbering process transparent to users.\n",
     "\n",
     "\n",
     "Zachary used a min-cut flow model to partition the graph into two clusters, shown by the circles and squares.  Zarchary wanted just two cluster based on a conflict that caused the Karate club to break into two separate clubs.  Many social network clustering methods identify more that two social groups in the data."
@@ -393,13 +391,6 @@
     "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
     "___"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
diff --git a/notebooks/community/Subgraph-Extraction.ipynb b/notebooks/community/Subgraph-Extraction.ipynb
index a6359312395..e068ef53aa5 100755
--- a/notebooks/community/Subgraph-Extraction.ipynb
+++ b/notebooks/community/Subgraph-Extraction.ipynb
@@ -11,7 +11,7 @@
     "Notebook Credits\n",
     "* Original Authors: Bradley Rees\n",
     "* Created:   10/16/2019\n",
-    "* Last Edit: 05/08/2020\n",
+    "* Last Edit: 08/16/2020\n",
     "\n",
     "RAPIDS Versions: 0.13\n",
     "\n",
@@ -56,13 +56,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## cuGraph Notice \n",
-    "The current version of cuGraph has some limitations:\n",
-    "\n",
-    "* Vertex IDs need to be 32-bit integers.\n",
-    "* Vertex IDs are expected to be contiguous integers starting from 0.\n",
-    "\n",
-    "cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon.    "
+    "#### Some notes about vertex IDs...\n",
+    "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n",
+    "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n",
+    "  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n",
+    "  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n"
    ]
   },
   {
@@ -285,13 +283,6 @@
     "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
     "___"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
diff --git a/notebooks/community/Triangle-Counting.ipynb b/notebooks/community/Triangle-Counting.ipynb
index 70bf383dec8..09d7906a526 100755
--- a/notebooks/community/Triangle-Counting.ipynb
+++ b/notebooks/community/Triangle-Counting.ipynb
@@ -11,7 +11,7 @@
     "Notebook Credits\n",
     "* Original Authors: Bradley Rees\n",
     "* Created:   08/01/2019\n",
-    "* Last Edit: 05/08/2020\n",
+    "* Last Edit: 08/16/2020\n",
     "\n",
     "RAPIDS Versions: 0.13\n",
     "\n",
@@ -54,13 +54,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## cuGraph Notice \n",
-    "The current version of cuGraph has some limitations:\n",
-    "\n",
-    "* Vertex IDs need to be 32-bit integers.\n",
-    "* Vertex IDs are expected to be contiguous integers starting from 0.\n",
-    "\n",
-    "cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon.    "
+    "#### Some notes about vertex IDs...\n",
+    "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n",
+    "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n",
+    "  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n",
+    "  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n"
    ]
   },
   {
diff --git a/notebooks/components/ConnectedComponents.ipynb b/notebooks/components/ConnectedComponents.ipynb
index d5f9002d6cb..5a618e68bbd 100755
--- a/notebooks/components/ConnectedComponents.ipynb
+++ b/notebooks/components/ConnectedComponents.ipynb
@@ -18,9 +18,9 @@
     "Notebook Credits\n",
     "* Original Authors: Kumar Aatish\n",
     "* Created:    08/13/2019\n",
-    "* Last Edit:  03/03/2020\n",
+    "* Last Edit:  08/16/2020\n",
     "\n",
-    "RAPIDS Versions: 0.13   \n",
+    "RAPIDS Versions: 0.15   \n",
     "\n",
     "Test Hardware\n",
     "\n",
@@ -90,13 +90,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## cuGraph Notice \n",
-    "The current version of cuGraph has some limitations:\n",
-    "\n",
-    "* Vertex IDs need to be 32-bit integers.\n",
-    "* Vertex IDs are expected to be contiguous integers starting from 0.\n",
-    "\n",
-    "cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon."
+    "### Some notes about vertex IDs...\n",
+    "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n",
+    "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n",
+    "  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n",
+    "  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n"
    ]
   },
   {
@@ -125,8 +123,7 @@
     "# Import needed libraries\n",
     "import cugraph\n",
     "import cudf\n",
-    "import numpy as np\n",
-    "from collections import OrderedDict"
+    "import numpy as np"
    ]
   },
   {
@@ -155,7 +152,7 @@
     "# We will use the \"usecols' feature of read_csv to ignore that column\n",
     "\n",
     "gdf = cudf.read_csv(datafile, delimiter=' ', names=['src', 'dst', 'wgt'], dtype=['int32', 'int32', 'float32'], usecols=['src', 'dst'])\n",
-    "gdf.head()"
+    "gdf.head(5)"
    ]
   },
   {
@@ -191,7 +188,7 @@
    "source": [
     "# Call cugraph.weakly_connected_components on the dataframe\n",
     "df = cugraph.weakly_connected_components(G)\n",
-    "df.head()"
+    "df.head(5)"
    ]
   },
   {
@@ -267,7 +264,7 @@
    "source": [
     "# Call cugraph.strongly_connected_components on the dataframe\n",
     "df = cugraph.strongly_connected_components(G)\n",
-    "df.head()"
+    "df.head(5)"
    ]
   },
   {
diff --git a/notebooks/cores/core-number.ipynb b/notebooks/cores/core-number.ipynb
index 2b9fa3fb277..6190f653020 100755
--- a/notebooks/cores/core-number.ipynb
+++ b/notebooks/cores/core-number.ipynb
@@ -12,7 +12,7 @@
     "Notebook Credits\n",
     "* Original Authors: Bradley Rees\n",
     "* Created:   10/28/2019\n",
-    "* Last Edit: 03/03/2020\n",
+    "* Last Edit: 08/16/2020\n",
     "\n",
     "RAPIDS Versions: 0.13\n",
     "\n",
@@ -47,13 +47,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## cuGraph Notice \n",
-    "The current version of cuGraph has some limitations:\n",
-    "\n",
-    "* Vertex IDs need to be 32-bit integers.\n",
-    "* Vertex IDs are expected to be contiguous integers starting from 0.\n",
-    "\n",
-    "cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon.    "
+    "### Some notes about vertex IDs...\n",
+    "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n",
+    "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n",
+    "  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n",
+    "  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n"
    ]
   },
   {
diff --git a/notebooks/cores/kcore.ipynb b/notebooks/cores/kcore.ipynb
index 48c2998b891..342f4ecd5f7 100755
--- a/notebooks/cores/kcore.ipynb
+++ b/notebooks/cores/kcore.ipynb
@@ -12,7 +12,7 @@
     "Notebook Credits\n",
     "* Original Authors: Bradley Rees\n",
     "* Created:   10/28/2019\n",
-    "* Last Edit: 03/03/2020\n",
+    "* Last Edit: 08/16/2020\n",
     "\n",
     "RAPIDS Versions: 0.13\n",
     "\n",
@@ -46,13 +46,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## cuGraph Notice \n",
-    "The current version of cuGraph has some limitations:\n",
-    "\n",
-    "* Vertex IDs need to be 32-bit integers.\n",
-    "* Vertex IDs are expected to be contiguous integers starting from 0.\n",
-    "\n",
-    "cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon.    "
+    "### Some notes about vertex IDs...\n",
+    "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n",
+    "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n",
+    "  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n",
+    "  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n"
    ]
   },
   {
diff --git a/notebooks/cores/ktruss.ipynb b/notebooks/cores/ktruss.ipynb
index 2dab5c24575..e6470110666 100644
--- a/notebooks/cores/ktruss.ipynb
+++ b/notebooks/cores/ktruss.ipynb
@@ -12,7 +12,7 @@
     "Notebook Credits\n",
     "* Original Authors: Bradley Rees\n",
     "* Created:   10/28/2019\n",
-    "* Last Edit: 03/03/2020\n",
+    "* Last Edit: 08/16/2020\n",
     "\n",
     "RAPIDS Versions: 0.13\n",
     "\n",
@@ -68,13 +68,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## cuGraph Notice \n",
-    "The current version of cuGraph has some limitations:\n",
-    "\n",
-    "* Vertex IDs need to be 32-bit integers.\n",
-    "* Vertex IDs are expected to be contiguous integers starting from 0.\n",
-    "\n",
-    "cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon.    "
+    "### Some notes about vertex IDs...\n",
+    "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n",
+    "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n",
+    "  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n",
+    "  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n"
    ]
   },
   {
diff --git a/notebooks/cugraph_benchmarks/README.md b/notebooks/cugraph_benchmarks/README.md
new file mode 100644
index 00000000000..dc8bef4b21a
--- /dev/null
+++ b/notebooks/cugraph_benchmarks/README.md
@@ -0,0 +1,108 @@
+# cuGraph Benchmarking
+
+This folder contains a collection of graph algorithm benchmarking notebooks.  Each notebook will compare one cuGraph algorithm against NetworkX and potentially other frameworks.
+
+_NOTE:  Before any benchmarking can be done, it is important to first download the test data sets_
+
+
+## Getting the Data Sets
+
+Run the data prep script.
+
+```bash
+sh ./dataPrep.sh
+```
+
+## Benchmarks
+
+1. Louvain
+2. PageRank
+3. BSF
+4. SSSP
+
+
+
+The benchmark does not include data reading time, but does include:
+
+- Creating the Graph object
+- Running the analytic
+
+
+
+
+
+
+#### The data prep script
+By default, each files would be created in its own directory.  The goal here is to have all the MTX files in a single directory.
+
+
+```bash
+#!/bin/bash
+
+mkdir data
+cd data
+mkdir tmp
+cd tmp
+
+wget https://sparse.tamu.edu/MM/DIMACS10/preferentialAttachment.tar.gz
+wget https://sparse.tamu.edu/MM/DIMACS10/caidaRouterLevel.tar.gz
+wget https://sparse.tamu.edu/MM/DIMACS10/coAuthorsDBLP.tar.gz
+wget https://sparse.tamu.edu/MM/LAW/dblp-2010.tar.gz
+wget https://sparse.tamu.edu/MM/DIMACS10/citationCiteseer.tar.gz
+wget https://sparse.tamu.edu/MM/DIMACS10/coPapersDBLP.tar.gz
+wget https://sparse.tamu.edu/MM/DIMACS10/coPapersCiteseer.tar.gz
+wget https://sparse.tamu.edu/MM/SNAP/as-Skitter.tar.gz
+
+tar xvzf preferentialAttachment.tar.gz
+tar xvzf caidaRouterLevel.tar.gz
+tar xvzf coAuthorsDBLP.tar.gz
+tar xvzf dblp-2010.tar.gz
+tar xvzf citationCiteseer.tar.gz
+tar xvzf coPapersDBLP.tar.gz
+tar xvzf coPapersCiteseer.tar.gz
+tar xvzf as-Skitter.tar.gz
+
+cd ..
+
+find ./tmp -name "*.mtx" -exec mv {} . \;
+
+rm -rf tmp
+```
+
+
+
+**About the Test files**
+
+| File Name              | Num of Vertices | Num of Edges | Format |  Graph Type               | Symmetric   |
+| ---------------------- | --------------: | -----------: |--------|---------------------------|-------------|
+| preferentialAttachment |         100,000 |      999,970 | MTX    | Random Undirected Graph   | Yes         |
+| caidaRouterLevel       |         192,244 |    1,218,132 | MTX    | Undirected Graph          | Yes         |
+| coAuthorsDBLP          |         299,067 |    1,955,352  |MTX    | Undirected Graph          | Yes         |
+| dblp-2010              |         326,186 |    1,615,400 | MTX    | Undirected Graph          | Yes         |
+| citationCiteseer       |         268,495 |    2,313,294 | MTX    | Undirected Graph          | Yes         |
+| coPapersDBLP           |         540,486 |   30,491,458 | MTX    | Undirected Graph          | Yes         |
+| coPapersCiteseer       |         434,102 |   32,073,440 | MTX    | Undirected Graph          | Yes         |
+| as-Skitter             |       1,696,415 |   22,190,596 | MTX    | Undirected Graph          | Yes         |
+
+
+
+### Dataset Acknowlegments
+
+The dataset are downloaded from the Texas A&M SuiteSparse Matrix Collection
+
+```
+The SuiteSparse Matrix Collection (formerly known as the University of Florida Sparse Matrix Collection), is a large and actively growing set of sparse matrices that arise in real applications.
+...
+The Collection is hosted here, and also mirrored at the University of Florida at www.cise.ufl.edu/research/sparse/matrices. The Collection is maintained by Tim Davis, Texas A&M University (email: davis@tamu.edu), Yifan Hu, Yahoo! Labs, and Scott Kolodziej, Texas A&M University.
+```
+
+| File Name              |  Author        |
+| ---------------------- |----------------|
+| preferentialAttachment | H. Meyerhenke  |
+| caidaRouterLevel       | Unknown        |
+| coAuthorsDBLP          | R. Geisberger, P. Sanders, and D. Schultes |
+| dblp-2010              | Laboratory for Web Algorithmics (LAW), |
+| citationCiteseer       | R. Geisberger, P. Sanders, and D. Schultes  |
+| coPapersDBLP           | R. Geisberger, P. Sanders, and D. Schultes  |
+| coPapersCiteseer       | R. Geisberger, P. Sanders, and D. Schultes |
+| as-Skitter             | J. Leskovec, J. Kleinberg and C. Faloutsos |
diff --git a/notebooks/cugraph_benchmarks/bfs_benchmark.ipynb b/notebooks/cugraph_benchmarks/bfs_benchmark.ipynb
new file mode 100644
index 00000000000..1c1362d0498
--- /dev/null
+++ b/notebooks/cugraph_benchmarks/bfs_benchmark.ipynb
@@ -0,0 +1,433 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# BFS Performance Benchmarking\n",
+    "\n",
+    "This notebook benchmarks performance of running BFS within cuGraph against NetworkX. \n",
+    "\n",
+    "Notebook Credits\n",
+    "\n",
+    "    Original Authors: Bradley Rees\n",
+    "    Last Edit: 08/16/2020\n",
+    "    \n",
+    "RAPIDS Versions: 0.15\n",
+    "\n",
+    "Test Hardware\n",
+    "\n",
+    "    GV100 32G, CUDA 10.2\n",
+    "    Intel(R) Core(TM) CPU i7-7800X @ 3.50GHz\n",
+    "    32GB system memory\n",
+    "    \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Test Data\n",
+    "\n",
+    "| File Name              | Num of Vertices | Num of Edges |\n",
+    "|:---------------------- | --------------: | -----------: |\n",
+    "| preferentialAttachment |         100,000 |      999,970 |\n",
+    "| caidaRouterLevel       |         192,244 |    1,218,132 |\n",
+    "| coAuthorsDBLP          |         299,067 |    1,955,352 |\n",
+    "| dblp-2010              |         326,186 |    1,615,400 |\n",
+    "| citationCiteseer       |         268,495 |    2,313,294 |\n",
+    "| coPapersDBLP           |         540,486 |   30,491,458 |\n",
+    "| coPapersCiteseer       |         434,102 |   32,073,440 |\n",
+    "| as-Skitter             |       1,696,415 |   22,190,596 |\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Timing \n",
+    "What is not timed:  Reading the data</p>\n",
+    "What is timmed: (1) creating a Graph, (2) running BSF\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## NOTICE:\n",
+    "You must have run the dataPrep script prior to running this notebook so that the data is downloaded\n",
+    "\n",
+    "See the README file in this folder for a discription of how to get the data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## If you have more than one GPU, set the GPU to use\n",
+    "This is not needed on a Single GPU system or if the default GPU is to be used"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-smi"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#  Set the GPU to use\n",
+    "import os\n",
+    "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Now load the required libraries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import needed libraries\n",
+    "import gc\n",
+    "import time\n",
+    "import rmm\n",
+    "import cugraph\n",
+    "import cudf"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# NetworkX libraries\n",
+    "import networkx as nx\n",
+    "from scipy.io import mmread"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "try: \n",
+    "    import matplotlib\n",
+    "except ModuleNotFoundError:\n",
+    "    os.system('pip install matplotlib')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt; plt.rcdefaults()\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Print the name of the used GPU"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cudf._cuda.gpu.deviceGetName(0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Define the test data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Test File\n",
+    "data = {\n",
+    "    'preferentialAttachment' : './data/preferentialAttachment.mtx',\n",
+    "    'caidaRouterLevel'       : './data/caidaRouterLevel.mtx',\n",
+    "    'coAuthorsDBLP'          : './data/coAuthorsDBLP.mtx',\n",
+    "    'dblp'                   : './data/dblp-2010.mtx',\n",
+    "    'citationCiteseer'       : './data/citationCiteseer.mtx',\n",
+    "    'coPapersDBLP'           : './data/coPapersDBLP.mtx',\n",
+    "    'coPapersCiteseer'       : './data/coPapersCiteseer.mtx',\n",
+    "    'as-Skitter'             : './data/as-Skitter.mtx'\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Define the testing functions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Data reader - the file format is MTX, so we will use the reader from SciPy\n",
+    "def read_mtx_file(mm_file):\n",
+    "    print('Reading ' + str(mm_file) + '...')\n",
+    "    M = mmread(mm_file).asfptype()\n",
+    "     \n",
+    "    return M"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# CuGraph BFS\n",
+    "\n",
+    "def cugraph_call(M):\n",
+    "\n",
+    "    gdf = cudf.DataFrame()\n",
+    "    gdf['src'] = M.row\n",
+    "    gdf['dst'] = M.col\n",
+    "    \n",
+    "    print('\\tcuGraph Solving... ')\n",
+    "    \n",
+    "    t1 = time.time()\n",
+    "        \n",
+    "    # cugraph Pagerank Call\n",
+    "    G = cugraph.DiGraph()\n",
+    "    G.from_cudf_edgelist(gdf, source='src', destination='dst', renumber=False)\n",
+    "    \n",
+    "    df = cugraph.bfs(G, 1)\n",
+    "    t2 = time.time() - t1\n",
+    "    \n",
+    "    return t2\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Basic NetworkX BFS\n",
+    "\n",
+    "def networkx_call(M):\n",
+    "    nnz_per_row = {r: 0 for r in range(M.get_shape()[0])}\n",
+    "    for nnz in range(M.getnnz()):\n",
+    "        nnz_per_row[M.row[nnz]] = 1 + nnz_per_row[M.row[nnz]]\n",
+    "    for nnz in range(M.getnnz()):\n",
+    "        M.data[nnz] = 1.0/float(nnz_per_row[M.row[nnz]])\n",
+    "\n",
+    "    M = M.tocsr()\n",
+    "    if M is None:\n",
+    "        raise TypeError('Could not read the input graph')\n",
+    "    if M.shape[0] != M.shape[1]:\n",
+    "        raise TypeError('Shape is not square')\n",
+    "\n",
+    "    # should be autosorted, but check just to make sure\n",
+    "    if not M.has_sorted_indices:\n",
+    "        print('sort_indices ... ')\n",
+    "        M.sort_indices()\n",
+    "\n",
+    "    z = {k: 1.0/M.shape[0] for k in range(M.shape[0])}\n",
+    "        \n",
+    "    print('\\tNetworkX Solving... ')\n",
+    "        \n",
+    "    # start timer\n",
+    "    t1 = time.time()\n",
+    "    \n",
+    "    Gnx = nx.DiGraph(M)\n",
+    "\n",
+    "    pr = nx.bfs_edges(Gnx, 1)\n",
+    "    \n",
+    "    t2 = time.time() - t1\n",
+    "\n",
+    "    return t2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run the benchmarks"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# arrays to capture performance gains\n",
+    "perf_nx = []\n",
+    "names = []\n",
+    "time_cu = []\n",
+    "time_nx = []\n",
+    "\n",
+    "# do a simple pass just to get all the libraries initiallized\n",
+    "v = './data/preferentialAttachment.mtx'\n",
+    "M = read_mtx_file(v)\n",
+    "trapids = cugraph_call(M)\n",
+    "del M\n",
+    "\n",
+    "for k,v in data.items():\n",
+    "    gc.collect()\n",
+    "\n",
+    "    # Saved the file Name\n",
+    "    names.append(k)\n",
+    "    \n",
+    "    # read the data\n",
+    "    M = read_mtx_file(v)\n",
+    "    \n",
+    "    \n",
+    "    # call cuGraph - this will be the baseline\n",
+    "    trapids = cugraph_call(M)\n",
+    "    \n",
+    "    # Now call NetworkX\n",
+    "    tn = networkx_call(M)\n",
+    "    speedUp = (tn / trapids)\n",
+    "    perf_nx.append(speedUp)\n",
+    "    time_cu.append(trapids)\n",
+    "    time_nx.append(tn)\n",
+    "    del M\n",
+    "    \n",
+    "    print(\"\\tcuGraph (\" + str(trapids) + \")  Nx (\" + str(tn) + \")\" )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%matplotlib inline\n",
+    "\n",
+    "plt.figure(figsize=(11,9))\n",
+    "\n",
+    "bar_width = 0.5\n",
+    "index = np.arange(len(names))\n",
+    "\n",
+    "_ = plt.bar(index, perf_nx, bar_width, color='g', label='vs NetworkX')\n",
+    "\n",
+    "\n",
+    "\n",
+    "plt.xlabel('Datasets')\n",
+    "plt.ylabel('Speedup')\n",
+    "plt.title('BFS Performance Speedup')\n",
+    "plt.xticks(index + (bar_width/4), names)\n",
+    "plt.xticks(rotation=90) \n",
+    "\n",
+    "# Text on the top of each barplot\n",
+    "for i in range(len(perf_nx)):\n",
+    "    plt.text(x = (i - .5) + bar_width, y = perf_nx[i] + 25, s = round(perf_nx[i], 1), size = 12)\n",
+    "\n",
+    "plt.legend()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Dump the raw data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# print the speed-up numbers\n",
+    "perf_nx"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# print the cuGraph runtimes\n",
+    "time_cu"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# print the NetworkX runtimes\n",
+    "time_nx"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "___\n",
+    "Copyright (c) 2020, NVIDIA CORPORATION.\n",
+    "\n",
+    "Licensed under the Apache License, Version 2.0 (the \"License\");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n",
+    "\n",
+    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
+    "___"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "cugraph_dev",
+   "language": "python",
+   "name": "cugraph_dev"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/notebooks/cugraph_benchmarks/data b/notebooks/cugraph_benchmarks/data
new file mode 120000
index 00000000000..4909e06efb4
--- /dev/null
+++ b/notebooks/cugraph_benchmarks/data
@@ -0,0 +1 @@
+../data
\ No newline at end of file
diff --git a/notebooks/cugraph_benchmarks/dataPrep.sh b/notebooks/cugraph_benchmarks/dataPrep.sh
new file mode 100755
index 00000000000..b59130a98df
--- /dev/null
+++ b/notebooks/cugraph_benchmarks/dataPrep.sh
@@ -0,0 +1,41 @@
+#!/bin/bash
+
+if [ ! -d "./data" ]
+then
+    mkdir ./data
+fi
+
+cd data
+
+if [ ! -f "./preferentialAttachment.mtx" ]
+then
+    if [ ! -d "./tmp" ]
+    then
+        mkdir tmp
+        cd tmp
+
+        wget -N https://sparse.tamu.edu/MM/DIMACS10/preferentialAttachment.tar.gz
+        wget -N https://sparse.tamu.edu/MM/DIMACS10/caidaRouterLevel.tar.gz
+        wget -N https://sparse.tamu.edu/MM/DIMACS10/coAuthorsDBLP.tar.gz
+        wget -N https://sparse.tamu.edu/MM/LAW/dblp-2010.tar.gz
+        wget -N https://sparse.tamu.edu/MM/DIMACS10/citationCiteseer.tar.gz
+        wget -N https://sparse.tamu.edu/MM/DIMACS10/coPapersDBLP.tar.gz
+        wget -N https://sparse.tamu.edu/MM/DIMACS10/coPapersCiteseer.tar.gz
+        wget -N https://sparse.tamu.edu/MM/SNAP/as-Skitter.tar.gz
+
+        tar xvzf preferentialAttachment.tar.gz
+        tar xvzf caidaRouterLevel.tar.gz
+        tar xvzf coAuthorsDBLP.tar.gz
+        tar xvzf dblp-2010.tar.gz
+        tar xvzf citationCiteseer.tar.gz
+        tar xvzf coPapersDBLP.tar.gz
+        tar xvzf coPapersCiteseer.tar.gz
+        tar xvzf as-Skitter.tar.gz
+
+        cd ..
+
+        find ./tmp -name "*.mtx" -exec mv {} . \;
+
+        rm -rf tmp
+    fi
+fi
diff --git a/notebooks/cugraph_benchmarks/louvain_benchmark.ipynb b/notebooks/cugraph_benchmarks/louvain_benchmark.ipynb
new file mode 100644
index 00000000000..7a234c9c159
--- /dev/null
+++ b/notebooks/cugraph_benchmarks/louvain_benchmark.ipynb
@@ -0,0 +1,411 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Louvain Performance Benchmarking\n",
+    "\n",
+    "This notebook benchmarks performance improvement of running the Louvain clustering algorithm within cuGraph against NetworkX. The test is run over eight test networks (graphs) and then results plotted.  \n",
+    "<p><p>\n",
+    "\n",
+    "\n",
+    "#### Notebook Credits\n",
+    "\n",
+    "    Original Authors: Bradley Rees\n",
+    "    Last Edit: 06/10/2020\n",
+    "\n",
+    "\n",
+    "#### Test Environment\n",
+    "\n",
+    "    RAPIDS Versions: 0.15\n",
+    "\n",
+    "    Test Hardware:\n",
+    "    GV100 32G, CUDA 10,0\n",
+    "    Intel(R) Core(TM) CPU i7-7800X @ 3.50GHz\n",
+    "    32GB system memory\n",
+    "\n",
+    "\n",
+    "\n",
+    "#### Updates\n",
+    "- moved loading ploting libraries to front so that dependencies can be checked before running algorithms\n",
+    "- added edge values \n",
+    "- changed timing to including Graph creation for both cuGraph and NetworkX.  This will better represent end-to-end times\n",
+    "\n",
+    "\n",
+    "\n",
+    "#### Dependencies\n",
+    "- RAPIDS cuDF and cuGraph version 0.6.0 \n",
+    "- NetworkX \n",
+    "- Matplotlib \n",
+    "- Scipy \n",
+    "- data prep script run\n",
+    "\n",
+    "\n",
+    "\n",
+    "#### Note: Comparison against published results\n",
+    "\n",
+    "\n",
+    "The cuGraph blog post included performance numbers that were collected over a year ago.  For the test graphs, int32 values are now used.  That improves GPUs performance.  Additionally, the initial benchamrks were measured on a P100 GPU. \n",
+    "\n",
+    "This test only comparse the modularity scores and a success is if the scores are within 15% of each other.  That comparison is done by adjusting the NetworkX modularity score and then verifying that the cuGraph score is higher.\n",
+    "\n",
+    "cuGraph did a full validation of NetworkX results against cuGraph results.  That included cross-validation of every cluster.  That test is very slow and not included here"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## If you have more than one GPU, set the GPU to use\n",
+    "This is not needed on a Single GPU system or if the default GPU is to be used"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-smi"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# since this is a shared machine - let's pick a GPU that no one else is using\n",
+    "import os\n",
+    "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Now load the required libraries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import needed libraries\n",
+    "import time\n",
+    "import cugraph\n",
+    "import cudf\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# NetworkX libraries\n",
+    "import networkx as nx\n",
+    "from scipy.io import mmread"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# NetworkX libraries\n",
+    "try: \n",
+    "    import community\n",
+    "except ModuleNotFoundError:\n",
+    "    os.system('pip install python-louvain')\n",
+    "    import community"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "try: \n",
+    "    import matplotlib\n",
+    "except ModuleNotFoundError:\n",
+    "    os.system('pip install matplotlib')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Loading plotting libraries\n",
+    "import matplotlib.pyplot as plt; plt.rcdefaults()\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Print out GPU Name\n",
+    "cudf._cuda.gpu.deviceGetName(0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Define the test data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Test File\n",
+    "data = {\n",
+    "    'preferentialAttachment' : './data/preferentialAttachment.mtx',\n",
+    "    'caidaRouterLevel'       : './data/caidaRouterLevel.mtx',\n",
+    "    'coAuthorsDBLP'          : './data/coAuthorsDBLP.mtx',\n",
+    "    'dblp'                   : './data/dblp-2010.mtx',\n",
+    "    'citationCiteseer'       : './data/citationCiteseer.mtx',\n",
+    "    'coPapersDBLP'           : './data/coPapersDBLP.mtx',\n",
+    "    'coPapersCiteseer'       : './data/coPapersCiteseer.mtx',\n",
+    "    'as-Skitter'             : './data/as-Skitter.mtx'\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Define the testing functions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Read in a dataset in MTX format \n",
+    "def read_mtx_file(mm_file):\n",
+    "    print('Reading ' + str(mm_file) + '...')\n",
+    "    M = mmread(mm_file).asfptype()\n",
+    "        \n",
+    "    return M"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run the cuGraph Louvain analytic (using nvGRAPH function)\n",
+    "def cugraph_call(M):\n",
+    "\n",
+    "    t1 = time.time()\n",
+    "\n",
+    "    # data\n",
+    "    gdf = cudf.DataFrame()\n",
+    "    gdf['src'] = M.row\n",
+    "    gdf['dst'] = M.col\n",
+    "    \n",
+    "    # create graph \n",
+    "    G = cugraph.Graph()\n",
+    "    G.from_cudf_edgelist(gdf, source='src', destination='dst', renumber=False)\n",
+    "    \n",
+    "    # cugraph Louvain Call\n",
+    "    print('  cuGraph Solving... ')\n",
+    "    df, mod = cugraph.louvain(G)   \n",
+    "    \n",
+    "    t2 = time.time() - t1\n",
+    "    return t2, mod\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run the NetworkX Louvain analytic.  THis is done in two parts since the modularity score is not returned \n",
+    "def networkx_call(M):\n",
+    "    nnz_per_row = {r: 0 for r in range(M.get_shape()[0])}\n",
+    "    for nnz in range(M.getnnz()):\n",
+    "        nnz_per_row[M.row[nnz]] = 1 + nnz_per_row[M.row[nnz]]\n",
+    "    for nnz in range(M.getnnz()):\n",
+    "        M.data[nnz] = 1.0/float(nnz_per_row[M.row[nnz]])\n",
+    "\n",
+    "    M = M.tocsr()\n",
+    "    if M is None:\n",
+    "        raise TypeError('Could not read the input graph')\n",
+    "    if M.shape[0] != M.shape[1]:\n",
+    "        raise TypeError('Shape is not square')\n",
+    "        \n",
+    "    t1 = time.time()\n",
+    "\n",
+    "    # Directed NetworkX graph\n",
+    "    Gnx = nx.Graph(M)\n",
+    "\n",
+    "    # Networkx \n",
+    "    print('  NetworkX Solving... ')\n",
+    "    parts = community.best_partition(Gnx)\n",
+    "    \n",
+    "    # Calculating modularity scores for comparison \n",
+    "    mod = community.modularity(parts, Gnx)   \n",
+    "    \n",
+    "    t2 = time.time() - t1\n",
+    "    \n",
+    "    return t2, mod"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run the benchmarks"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "# Loop through each test file and compute the speedup\n",
+    "perf  = []\n",
+    "names = []\n",
+    "time_cu = []\n",
+    "time_nx = []\n",
+    "\n",
+    "#init libraries by doing quick pass\n",
+    "v = './data/preferentialAttachment.mtx'\n",
+    "M = read_mtx_file(v)\n",
+    "trapids = cugraph_call(M)\n",
+    "del M\n",
+    "\n",
+    "\n",
+    "for k,v in data.items():\n",
+    "    M = read_mtx_file(v)\n",
+    "    tr, modc = cugraph_call(M)\n",
+    "    tn, modx = networkx_call(M)\n",
+    "    \n",
+    "    speedUp = (tn / tr)\n",
+    "    names.append(k)\n",
+    "    perf.append(speedUp)\n",
+    "    time_cu.append(tr)\n",
+    "    time_nx.append(tn)\n",
+    "    # mod_delta = (0.85 * modx)\n",
+    "    \n",
+    "    print(str(speedUp) + \"x faster =>  cugraph \" + str(tr) + \" vs \" + str(tn))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### plot the output"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%matplotlib inline\n",
+    "\n",
+    "y_pos = np.arange(len(names))\n",
+    " \n",
+    "plt.bar(y_pos, perf, align='center', alpha=0.5)\n",
+    "plt.xticks(y_pos, names)\n",
+    "plt.ylabel('Speed Up')\n",
+    "plt.title('Performance Speedup: cuGraph vs NetworkX')\n",
+    "plt.xticks(rotation=90) \n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Dump the raw stats"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "perf"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "time_cu"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "time_nx"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "___\n",
+    "Copyright (c) 2020, NVIDIA CORPORATION.\n",
+    "\n",
+    "Licensed under the Apache License, Version 2.0 (the \"License\");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n",
+    "\n",
+    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
+    "___"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "cugraph_dev",
+   "language": "python",
+   "name": "cugraph_dev"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/notebooks/cugraph_benchmarks/pagerank_benchmark.ipynb b/notebooks/cugraph_benchmarks/pagerank_benchmark.ipynb
new file mode 100644
index 00000000000..52388fc1a14
--- /dev/null
+++ b/notebooks/cugraph_benchmarks/pagerank_benchmark.ipynb
@@ -0,0 +1,503 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# PageRank Performance Benchmarking\n",
+    "\n",
+    "This notebook benchmarks performance of running PageRank within cuGraph against NetworkX. NetworkX contains several implementations of PageRank.  This benchmark will compare cuGraph versus the defaukt Nx implementation as well as the SciPy version\n",
+    "\n",
+    "Notebook Credits\n",
+    "\n",
+    "    Original Authors: Bradley Rees\n",
+    "    Last Edit: 08/16/2020\n",
+    "    \n",
+    "RAPIDS Versions: 0.15\n",
+    "\n",
+    "Test Hardware\n",
+    "\n",
+    "    GV100 32G, CUDA 10,0\n",
+    "    Intel(R) Core(TM) CPU i7-7800X @ 3.50GHz\n",
+    "    32GB system memory\n",
+    "    \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Test Data\n",
+    "\n",
+    "| File Name              | Num of Vertices | Num of Edges |\n",
+    "|:---------------------- | --------------: | -----------: |\n",
+    "| preferentialAttachment |         100,000 |      999,970 |\n",
+    "| caidaRouterLevel       |         192,244 |    1,218,132 |\n",
+    "| coAuthorsDBLP          |         299,067 |    1,955,352 |\n",
+    "| dblp-2010              |         326,186 |    1,615,400 |\n",
+    "| citationCiteseer       |         268,495 |    2,313,294 |\n",
+    "| coPapersDBLP           |         540,486 |   30,491,458 |\n",
+    "| coPapersCiteseer       |         434,102 |   32,073,440 |\n",
+    "| as-Skitter             |       1,696,415 |   22,190,596 |\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Timing \n",
+    "What is not timed:  Reading the data\n",
+    "\n",
+    "What is timmed: (1) creating a Graph, (2) running PageRank\n",
+    "\n",
+    "The data file is read in once for all flavors of PageRank.  Each timed block will craete a Graph and then execute the algorithm.  The results of the algorithm are not compared.  If you are interested in seeing the comparison of results, then please see PageRank in the __notebooks__ repo. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## NOTICE\n",
+    "_You must have run the __dataPrep__ script prior to running this notebook so that the data is downloaded_\n",
+    "\n",
+    "See the README file in this folder for a discription of how to get the data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## If you have more than one GPU, set the GPU to use\n",
+    "This is not needed on a Single GPU system or if the default GPU is to be used"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-smi"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# since this is a shared machine - let's pick a GPU that no one else is using\n",
+    "import os\n",
+    "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Now load the required libraries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import needed libraries\n",
+    "import gc\n",
+    "import time\n",
+    "import rmm\n",
+    "import cugraph\n",
+    "import cudf"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# NetworkX libraries\n",
+    "import networkx as nx\n",
+    "from scipy.io import mmread"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "try: \n",
+    "    import matplotlib\n",
+    "except ModuleNotFoundError:\n",
+    "    os.system('pip install matplotlib')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt; plt.rcdefaults()\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Print out GPU Name\n",
+    "cudf._cuda.gpu.deviceGetName(0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Define the test data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Test File\n",
+    "data = {\n",
+    "    'preferentialAttachment' : './data/preferentialAttachment.mtx',\n",
+    "    'caidaRouterLevel'       : './data/caidaRouterLevel.mtx',\n",
+    "    'coAuthorsDBLP'          : './data/coAuthorsDBLP.mtx',\n",
+    "    'dblp'                   : './data/dblp-2010.mtx',\n",
+    "    'citationCiteseer'       : './data/citationCiteseer.mtx',\n",
+    "    'coPapersDBLP'           : './data/coPapersDBLP.mtx',\n",
+    "    'coPapersCiteseer'       : './data/coPapersCiteseer.mtx',\n",
+    "    'as-Skitter'             : './data/as-Skitter.mtx'\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Define the testing functions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Data reader - the file format is MTX, so we will use the reader from SciPy\n",
+    "def read_mtx_file(mm_file):\n",
+    "    print('Reading ' + str(mm_file) + '...')\n",
+    "    M = mmread(mm_file).asfptype()\n",
+    "     \n",
+    "    return M"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# CuGraph PageRank\n",
+    "\n",
+    "def cugraph_call(M, max_iter, tol, alpha):\n",
+    "\n",
+    "    gdf = cudf.DataFrame()\n",
+    "    gdf['src'] = M.row\n",
+    "    gdf['dst'] = M.col\n",
+    "    \n",
+    "    print('\\tcuGraph Solving... ')\n",
+    "    \n",
+    "    t1 = time.time()\n",
+    "        \n",
+    "    # cugraph Pagerank Call\n",
+    "    G = cugraph.DiGraph()\n",
+    "    G.from_cudf_edgelist(gdf, source='src', destination='dst', renumber=False)\n",
+    "    \n",
+    "    df = cugraph.pagerank(G, alpha=alpha, max_iter=max_iter, tol=tol)\n",
+    "    t2 = time.time() - t1\n",
+    "    \n",
+    "    return t2\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Basic NetworkX PageRank\n",
+    "\n",
+    "def networkx_call(M, max_iter, tol, alpha):\n",
+    "    nnz_per_row = {r: 0 for r in range(M.get_shape()[0])}\n",
+    "    for nnz in range(M.getnnz()):\n",
+    "        nnz_per_row[M.row[nnz]] = 1 + nnz_per_row[M.row[nnz]]\n",
+    "    for nnz in range(M.getnnz()):\n",
+    "        M.data[nnz] = 1.0/float(nnz_per_row[M.row[nnz]])\n",
+    "\n",
+    "    M = M.tocsr()\n",
+    "    if M is None:\n",
+    "        raise TypeError('Could not read the input graph')\n",
+    "    if M.shape[0] != M.shape[1]:\n",
+    "        raise TypeError('Shape is not square')\n",
+    "\n",
+    "    # should be autosorted, but check just to make sure\n",
+    "    if not M.has_sorted_indices:\n",
+    "        print('sort_indices ... ')\n",
+    "        M.sort_indices()\n",
+    "\n",
+    "    z = {k: 1.0/M.shape[0] for k in range(M.shape[0])}\n",
+    "        \n",
+    "    print('\\tNetworkX Solving... ')\n",
+    "        \n",
+    "    # start timer\n",
+    "    t1 = time.time()\n",
+    "    \n",
+    "    Gnx = nx.DiGraph(M)\n",
+    "\n",
+    "    pr = nx.pagerank(Gnx, alpha, z, max_iter, tol)\n",
+    "    \n",
+    "    t2 = time.time() - t1\n",
+    "\n",
+    "    return t2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# SciPy PageRank\n",
+    "\n",
+    "def networkx_scipy_call(M, max_iter, tol, alpha):\n",
+    "    nnz_per_row = {r: 0 for r in range(M.get_shape()[0])}\n",
+    "    for nnz in range(M.getnnz()):\n",
+    "        nnz_per_row[M.row[nnz]] = 1 + nnz_per_row[M.row[nnz]]\n",
+    "    for nnz in range(M.getnnz()):\n",
+    "        M.data[nnz] = 1.0/float(nnz_per_row[M.row[nnz]])\n",
+    "\n",
+    "    M = M.tocsr()\n",
+    "    if M is None:\n",
+    "        raise TypeError('Could not read the input graph')\n",
+    "    if M.shape[0] != M.shape[1]:\n",
+    "        raise TypeError('Shape is not square')\n",
+    "\n",
+    "    # should be autosorted, but check just to make sure\n",
+    "    if not M.has_sorted_indices:\n",
+    "        print('sort_indices ... ')\n",
+    "        M.sort_indices()\n",
+    "\n",
+    "    z = {k: 1.0/M.shape[0] for k in range(M.shape[0])}\n",
+    "\n",
+    "    # SciPy Pagerank Call\n",
+    "    print('\\tSciPy Solving... ')\n",
+    "    t1 = time.time()\n",
+    "    \n",
+    "    Gnx = nx.DiGraph(M)    \n",
+    "    \n",
+    "    pr = nx.pagerank_scipy(Gnx, alpha, z, max_iter, tol)\n",
+    "    t2 = time.time() - t1\n",
+    "\n",
+    "    return t2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run the benchmarks"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# arrays to capture performance gains\n",
+    "time_cu = []\n",
+    "time_nx = []\n",
+    "time_sp = []\n",
+    "perf_nx = []\n",
+    "perf_sp = []\n",
+    "names = []\n",
+    "\n",
+    "# init libraries by doing a simple task \n",
+    "v = './data/preferentialAttachment.mtx'\n",
+    "M = read_mtx_file(v)\n",
+    "trapids = cugraph_call(M, 100, 0.00001, 0.85)\n",
+    "del M\n",
+    "\n",
+    "\n",
+    "for k,v in data.items():\n",
+    "    gc.collect()\n",
+    "\n",
+    "    # Saved the file Name\n",
+    "    names.append(k)\n",
+    "    \n",
+    "    # read the data\n",
+    "    M = read_mtx_file(v)\n",
+    "    \n",
+    "    # call cuGraph - this will be the baseline\n",
+    "    trapids = cugraph_call(M, 100, 0.00001, 0.85)\n",
+    "    time_cu.append(trapids)\n",
+    "    \n",
+    "    # Now call NetworkX\n",
+    "    tn = networkx_call(M, 100, 0.00001, 0.85)\n",
+    "    speedUp = (tn / trapids)\n",
+    "    perf_nx.append(speedUp)\n",
+    "    time_nx.append(tn)\n",
+    "    \n",
+    "    # Now call SciPy\n",
+    "    tsp = networkx_scipy_call(M, 100, 0.00001, 0.85)\n",
+    "    speedUp = (tsp / trapids)\n",
+    "    perf_sp.append(speedUp)  \n",
+    "    time_sp.append(tsp)\n",
+    "    \n",
+    "    print(\"cuGraph (\" + str(trapids) + \")  Nx (\" + str(tn) + \")  SciPy (\" + str(tsp) + \")\" )\n",
+    "    del M"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### plot the output"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%matplotlib inline\n",
+    "\n",
+    "plt.figure(figsize=(10,8))\n",
+    "\n",
+    "bar_width = 0.35\n",
+    "index = np.arange(len(names))\n",
+    "\n",
+    "_ = plt.bar(index, perf_nx, bar_width, color='g', label='vs Nx')\n",
+    "_ = plt.bar(index + bar_width, perf_sp, bar_width, color='b', label='vs SciPy')\n",
+    "\n",
+    "plt.xlabel('Datasets')\n",
+    "plt.ylabel('Speedup')\n",
+    "plt.title('PageRank Performance Speedup')\n",
+    "plt.xticks(index + (bar_width / 2), names)\n",
+    "plt.xticks(rotation=90) \n",
+    "\n",
+    "# Text on the top of each barplot\n",
+    "for i in range(len(perf_nx)):\n",
+    "    plt.text(x = (i - 0.55) + bar_width, y = perf_nx[i] + 25, s = round(perf_nx[i], 1), size = 12)\n",
+    "\n",
+    "for i in range(len(perf_sp)):\n",
+    "    plt.text(x = (i - 0.1) + bar_width, y = perf_sp[i] + 25, s = round(perf_sp[i], 1), size = 12)\n",
+    "\n",
+    "\n",
+    "plt.legend()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Dump the raw stats"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "perf_nx"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "perf_sp"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "time_cu"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "time_nx"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "time_sp"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "___\n",
+    "Copyright (c) 2020, NVIDIA CORPORATION.\n",
+    "\n",
+    "Licensed under the Apache License, Version 2.0 (the \"License\");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n",
+    "\n",
+    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
+    "___"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "cugraph_dev",
+   "language": "python",
+   "name": "cugraph_dev"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/notebooks/cugraph_benchmarks/release.ipynb b/notebooks/cugraph_benchmarks/release.ipynb
new file mode 100644
index 00000000000..ff5ed5abf9f
--- /dev/null
+++ b/notebooks/cugraph_benchmarks/release.ipynb
@@ -0,0 +1,596 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Release Benchmarking\n",
+    "\n",
+    "With every release, RAPIDS publishes a release slide deck that includes the current performance state of cuGraph. \n",
+    "This notebook, starting with release 0.15, runs all the various algorithms to computes the performance gain.  \n",
+    "\n",
+    "### Algorithms\n",
+    "|        Algorithm        |  Graph   |   DiGraph   |\n",
+    "| ------------------------| -------- | ----------- |\n",
+    "| BFS                     |    X     |             |\n",
+    "| SSSP                    |    X     |             |\n",
+    "| PageRank                |          |      X      |\n",
+    "| WCC                     |          |      X      |\n",
+    "| Betweenness Centrality  |    X     |             |\n",
+    "| Louvain                 |    X     |             |\n",
+    "| Triangle Counting       |    X     |             |\n",
+    "\n",
+    "### Test Data\n",
+    "\n",
+    "| File Name              | Num of Vertices | Num of Edges |\n",
+    "| ---------------------- | --------------: | -----------: |\n",
+    "| preferentialAttachment |         100,000 |      999,970 |\n",
+    "| dblp-2010              |         326,186 |    1,615,400 |\n",
+    "| coPapersCiteseer       |         434,102 |   32,073,440 |\n",
+    "| as-Skitter             |       1,696,415 |   22,190,596 |\n",
+    "\n",
+    "\n",
+    "Notebook Credits\n",
+    "\n",
+    "    Original Authors: Bradley Rees\n",
+    "    Last Edit: 08/17/2020\n",
+    "    \n",
+    "RAPIDS Versions: 0.15\n",
+    "\n",
+    "Test Hardware\n",
+    "    GV100 32G, CUDA 10.2\n",
+    "    Intel(R) Core(TM) CPU i7-7800X @ 3.50GHz\n",
+    "    32GB system memory\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Timing \n",
+    "What is not timed:  Reading the data</p>\n",
+    "What is timmed: (1) creating a Graph, (2) running the algorithm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Import Modules"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# system and other\n",
+    "import gc\n",
+    "import os\n",
+    "import time\n",
+    "import numpy as np\n",
+    "\n",
+    "# rapids\n",
+    "import cugraph\n",
+    "import cudf\n",
+    "\n",
+    "# NetworkX libraries\n",
+    "import networkx as nx\n",
+    "\n",
+    "# MTX file reader\n",
+    "from scipy.io import mmread"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "try: \n",
+    "    import community\n",
+    "except ModuleNotFoundError:\n",
+    "    os.system('pip install python-louvain')\n",
+    "    import community"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "try: \n",
+    "    import matplotlib\n",
+    "except ModuleNotFoundError:\n",
+    "    os.system('pip install matplotlib')\n",
+    "\n",
+    "import matplotlib.pyplot as plt; plt.rcdefaults()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Define the test data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Test File\n",
+    "data = {\n",
+    "    'preferentialAttachment' : './data/preferentialAttachment.mtx',\n",
+    "    'dblp'                   : './data/dblp-2010.mtx',\n",
+    "    'coPapersCiteseer'       : './data/coPapersCiteseer.mtx',\n",
+    "    'as-Skitter'             : './data/as-Skitter.mtx'\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Read data\n",
+    "The data is read in once and used for both cuGraph and NetworkX."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Data reader - the file format is MTX, so we will use the reader from SciPy\n",
+    "def read_data(datafile):\n",
+    "    print('Reading ' + str(datafile) + '...')\n",
+    "    M = mmread(datafile).asfptype()\n",
+    "\n",
+    "    _gdf = cudf.DataFrame()\n",
+    "    _gdf['src'] = M.row\n",
+    "    _gdf['dst'] = M.col\n",
+    "    \n",
+    "    return _gdf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create Graph functions\n",
+    "There are two types of graphs created:\n",
+    "Directed Graphs - calls to create_xx_digraph\n",
+    "Undirected Graphs - calls to create_xx_ugraph <- fully syemmeterized "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# NetworkX\n",
+    "def create_nx_digraph(_df):\n",
+    "    _gnx = nx.from_pandas_edgelist(_df, source='src', target='dst', edge_attr=None, create_using=nx.DiGraph)\n",
+    "    return _gnx\n",
+    "\n",
+    "def create_nx_ugraph(_df):\n",
+    "    _gnx = nx.from_pandas_edgelist(_df, source='src', target='dst', edge_attr=None, create_using=nx.Graph)\n",
+    "    return _gnx\n",
+    "\n",
+    "\n",
+    "# cuGraph\n",
+    "def create_cu_digraph(_df):\n",
+    "    _g = cugraph.DiGraph()\n",
+    "    _g.from_cudf_edgelist(_df, source='src', destination='dst', renumber=False)\n",
+    "    return _g\n",
+    "\n",
+    "def create_cu_ugraph(_df):\n",
+    "    _g = cugraph.Graph()\n",
+    "    _g.from_cudf_edgelist(_df, source='src', destination='dst', renumber=False)\n",
+    "    return _g"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### BFS"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def nx_bfs(_df):\n",
+    "    t1 = time.time()\n",
+    "    _G = create_nx_ugraph(_df)\n",
+    "    _ = nx.bfs_edges(_G, 1)\n",
+    "    t2 = time.time() - t1\n",
+    "    return t2\n",
+    "\n",
+    "def cu_bfs(_df):\n",
+    "    t1 = time.time()\n",
+    "    _G = create_cu_ugraph(_df)\n",
+    "    _ = cugraph.bfs(_G, 1)\n",
+    "    t2 = time.time() - t1\n",
+    "    return t2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### SSSP"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def nx_sssp(_df):\n",
+    "    t1 = time.time()\n",
+    "    _G = create_nx_ugraph(_df)\n",
+    "    _ = nx.shortest_path(_G, 1)\n",
+    "    t2 = time.time() - t1\n",
+    "    return t2\n",
+    "\n",
+    "def cu_sssp(_df):\n",
+    "    t1 = time.time()\n",
+    "    _G = create_cu_ugraph(_df)    \n",
+    "    _ = cugraph.sssp(_G, 1)\n",
+    "    t2 = time.time() - t1\n",
+    "    return t2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### PageRank"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def nx_pagerank(_df):\n",
+    "    t1 = time.time()\n",
+    "    _G = create_nx_digraph(_df)\n",
+    "    _ = nx.pagerank(_G)\n",
+    "    t2 = time.time() - t1\n",
+    "    return t2\n",
+    "\n",
+    "def cu_pagerank(_df):\n",
+    "    t1 = time.time()\n",
+    "    _G = create_cu_graph(_df)\n",
+    "    _ = cugraph.pagerank(_G)\n",
+    "    t2 = time.time() - t1\n",
+    "    return t2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### WCC"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def nx_wcc(_df):\n",
+    "    t1 = time.time()\n",
+    "    _G = create_nx_digraph(_df)\n",
+    "    _ = nx.weakly_connected_components(_G)\n",
+    "    t2 = time.time() - t1\n",
+    "    return t2\n",
+    "\n",
+    "def cu_wcc(_df):\n",
+    "    t1 = time.time()\n",
+    "    _G = create_cu_graph(_df)    \n",
+    "    _ = cugraph.weakly_connected_components(_G)\n",
+    "    t2 = time.time() - t1\n",
+    "    return t2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Betweenness Centrality (vertex)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def nx_bc(_df):\n",
+    "    t1 = time.time()\n",
+    "    _G = create_nx_ugraph(_df)\n",
+    "    _ = nx.betweenness_centrality(_G, k=100)\n",
+    "    t2 = time.time() - t1\n",
+    "    return t2\n",
+    "\n",
+    "def cu_bc(_df):\n",
+    "    t1 = time.time()\n",
+    "    _G = create_cu_ugraph(_df)\n",
+    "    _ = cugraph.betweenness_centrality(_G, k=100)\n",
+    "    t2 = time.time() - t1\n",
+    "    return t2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Louvain"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def nx_louvain(_df):\n",
+    "    t1 = time.time()\n",
+    "    _G = create_nx_ugraph(_df)\n",
+    "    parts = community.best_partition(_G)\n",
+    "    \n",
+    "    # Calculating modularity scores for comparison \n",
+    "    _ = community.modularity(parts, _G)  \n",
+    "    \n",
+    "    t2 = time.time() - t1\n",
+    "    return t2\n",
+    "\n",
+    "def cu_louvain(_df):\n",
+    "    t1 = time.time()\n",
+    "    _G = create_cu_ugraph(_df)\n",
+    "    _,_ = cugraph.louvain(_G)\n",
+    "    t2 = time.time() - t1\n",
+    "    return t2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Triangle Counting"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def nx_tc(_df):\n",
+    "    t1 = time.time()\n",
+    "    _G = create_nx_ugraph(_df)\n",
+    "    nx_count = nx.triangles(_G)\n",
+    "    \n",
+    "    # To get the number of triangles, we would need to loop through the array and add up each count\n",
+    "    count = 0\n",
+    "    for key, value in nx_count.items():\n",
+    "        count = count + value    \n",
+    "    \n",
+    "    t2 = time.time() - t1\n",
+    "    return t2\n",
+    "\n",
+    "def cu_tc(_df):\n",
+    "    t1 = time.time()\n",
+    "    _G = create_cu_ugraph(_df)\n",
+    "    _ = cugraph.triangles(_G)\n",
+    "    t2 = time.time() - t1\n",
+    "    return t2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Benchmark Functions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# number of datasets\n",
+    "num_datasets = len(data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# do a simple pass just to get all the libraries initiallized\n",
+    "# This cell might not be needed\n",
+    "v = './data/preferentialAttachment.mtx'\n",
+    "gdf = read_data(v)\n",
+    "print(f\"\\tGDF Size {len(gdf)}\")\n",
+    "\n",
+    "g = create_cu_ugraph(gdf)\n",
+    "\n",
+    "print(f\"\\tcugraph Size {g.number_of_edges()}\")\n",
+    "print(f\"\\tcugraph Order {g.number_of_vertices()}\")\n",
+    "\n",
+    "# clean up what we just created\n",
+    "del gdf\n",
+    "del g\n",
+    "gc.collect()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# arrays to capture performance gains\n",
+    "names = []\n",
+    "\n",
+    "# Two dimension data\n",
+    "time_algo_cu = []       # will be two dimensional\n",
+    "time_algo_nx = []       # will be two dimensional\n",
+    "perf = []\n",
+    "\n",
+    "\n",
+    "\n",
+    "i = 0\n",
+    "for k,v in data.items():\n",
+    "    time_algo_cu.append([])\n",
+    "    time_algo_nx.append([])\n",
+    "    perf.append([])\n",
+    "    \n",
+    "    # Saved the file Name\n",
+    "    names.append(k)\n",
+    "\n",
+    "    # read data\n",
+    "    gdf = read_data(v)\n",
+    "    pdf = gdf.to_pandas()\n",
+    "    print(f\"\\tdata in gdf {len(gdf)} and data in pandas {len(pdf)}\")\n",
+    "\n",
+    "    # BFS\n",
+    "    print(\"\\tBFS\")\n",
+    "    tx = nx_bfs(pdf)\n",
+    "    tc = cu_bfs(gdf)\n",
+    "\n",
+    "    time_algo_nx[i].append(tx)\n",
+    "    time_algo_cu[i].append(tc)\n",
+    "    perf[i].append(tx/tc)\n",
+    "    gc.collect()\n",
+    "    \n",
+    "    # SSSP\n",
+    "    print(\"\\tSSSP\")\n",
+    "    tx = nx_sssp(pdf)\n",
+    "    tc = cu_sssp(gdf)\n",
+    "\n",
+    "    time_algo_nx[i].append(tx)\n",
+    "    time_algo_cu[i].append(tc)\n",
+    "    perf[i].append(tx/tc)\n",
+    "    gc.collect()\n",
+    "\n",
+    "    # PageRank\n",
+    "    print(\"\\tPageRank\")    \n",
+    "    tx = nx_pagerank(pdf)\n",
+    "    tc = cu_pagerank(gdf)\n",
+    "\n",
+    "    time_algo_nx[i].append(tx)\n",
+    "    time_algo_cu[i].append(tc)\n",
+    "    perf[i].append(tx/tc)\n",
+    "    gc.collect()\n",
+    "\n",
+    "    # WCC\n",
+    "    print(\"\\tWCC\")\n",
+    "    tx = nx_wcc(pdf)\n",
+    "    tc = cu_wcc(gdf)\n",
+    "\n",
+    "    time_algo_nx[i].append(tx)\n",
+    "    time_algo_cu[i].append(tc)\n",
+    "    perf[i].append(tx/tc)\n",
+    "    gc.collect()\n",
+    "\n",
+    "    # BC\n",
+    "    print(\"\\tBC\")\n",
+    "    tx = nx_bc(pdf)\n",
+    "    tc = cu_bc(gdf)\n",
+    "\n",
+    "    time_algo_nx[i].append(tx)\n",
+    "    time_algo_cu[i].append(tc)\n",
+    "    perf[i].append(tx/tc)\n",
+    "    gc.collect()\n",
+    "\n",
+    "    # Louvain\n",
+    "    print(\"\\tLouvain\")\n",
+    "    tx = nx_louvain(pdf)\n",
+    "    tc = cu_louvain(gdf)\n",
+    "\n",
+    "    time_algo_nx[i].append(tx)\n",
+    "    time_algo_cu[i].append(tc)\n",
+    "    perf[i].append(tx/tc)\n",
+    "    gc.collect()\n",
+    "\n",
+    "    # TC\n",
+    "    print(\"\\tTC\")\n",
+    "    tx = nx_tc(pdf)\n",
+    "    tc = cu_tc(gdf)\n",
+    "\n",
+    "    time_algo_nx[i].append(tx)\n",
+    "    time_algo_cu[i].append(tc)\n",
+    "    perf[i].append(tx/tc)\n",
+    "    gc.collect()\n",
+    "\n",
+    "    i = i + 1\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Print results\n",
+    "for i in range(num_datasets):\n",
+    "    print(f\"{names[i]}\")\n",
+    "    print(f\"{perf[i]}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "___\n",
+    "Copyright (c) 2020, NVIDIA CORPORATION.\n",
+    "\n",
+    "Licensed under the Apache License, Version 2.0 (the \"License\");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n",
+    "\n",
+    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
+    "___"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "cugraph_dev",
+   "language": "python",
+   "name": "cugraph_dev"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/notebooks/cugraph_benchmarks/sssp_benchmark.ipynb b/notebooks/cugraph_benchmarks/sssp_benchmark.ipynb
new file mode 100644
index 00000000000..2d040e0acaf
--- /dev/null
+++ b/notebooks/cugraph_benchmarks/sssp_benchmark.ipynb
@@ -0,0 +1,415 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# SSSP Performance Benchmarking\n",
+    "\n",
+    "This notebook benchmarks performance of running SSSP within cuGraph against NetworkX. \n",
+    "\n",
+    "Notebook Credits\n",
+    "\n",
+    "    Original Authors: Bradley Rees\n",
+    "    Last Edit: 06/10/2020\n",
+    "    \n",
+    "RAPIDS Versions: 0.15\n",
+    "\n",
+    "Test Hardware\n",
+    "\n",
+    "    GV100 32G, CUDA 10,0\n",
+    "    Intel(R) Core(TM) CPU i7-7800X @ 3.50GHz\n",
+    "    32GB system memory\n",
+    "    \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Test Data\n",
+    "\n",
+    "| File Name              | Num of Vertices | Num of Edges |\n",
+    "|:---------------------- | --------------: | -----------: |\n",
+    "| preferentialAttachment |         100,000 |      999,970 |\n",
+    "| caidaRouterLevel       |         192,244 |    1,218,132 |\n",
+    "| coAuthorsDBLP          |         299,067 |    1,955,352 |\n",
+    "| dblp-2010              |         326,186 |    1,615,400 |\n",
+    "| citationCiteseer       |         268,495 |    2,313,294 |\n",
+    "| coPapersDBLP           |         540,486 |   30,491,458 |\n",
+    "| coPapersCiteseer       |         434,102 |   32,073,440 |\n",
+    "| as-Skitter             |       1,696,415 |   22,190,596 |\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Timing \n",
+    "What is not timed:  Reading the data\n",
+    "\n",
+    "What is timmed: (1) creating a Graph, (2) running SSSP\n",
+    "\n",
+    "The data file is read and used for both cuGraph and NetworkX.  Each timed block will craete a Graph and then execute the algorithm.  The results of the algorithm are not compared.  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## NOTICE\n",
+    "You must have run the dataPrep script prior to running this notebook so that the data is downloaded\n",
+    "\n",
+    "See the README file in this folder for a discription of how to get the data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## If you have more than one GPU, set the GPU to use\n",
+    "This is not needed on a Single GPU system or if the default GPU is to be used"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-smi"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# since this is a shared machine - let's pick a GPU that no one else is using\n",
+    "import os\n",
+    "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Now load the required libraries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import needed libraries\n",
+    "import gc\n",
+    "import time\n",
+    "import rmm\n",
+    "import cugraph\n",
+    "import cudf"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# NetworkX libraries\n",
+    "import networkx as nx\n",
+    "from scipy.io import mmread"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "try: \n",
+    "    import matplotlib\n",
+    "except ModuleNotFoundError:\n",
+    "    os.system('pip install matplotlib')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt; plt.rcdefaults()\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Define the test data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Test File\n",
+    "data = {\n",
+    "    'preferentialAttachment' : './data/preferentialAttachment.mtx',\n",
+    "    'caidaRouterLevel'       : './data/caidaRouterLevel.mtx',\n",
+    "    'coAuthorsDBLP'          : './data/coAuthorsDBLP.mtx',\n",
+    "    'dblp'                   : './data/dblp-2010.mtx',\n",
+    "    'citationCiteseer'       : './data/citationCiteseer.mtx',\n",
+    "    'coPapersDBLP'           : './data/coPapersDBLP.mtx',\n",
+    "    'coPapersCiteseer'       : './data/coPapersCiteseer.mtx',\n",
+    "    'as-Skitter'             : './data/as-Skitter.mtx'\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Define the testing functions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Data reader - the file format is MTX, so we will use the reader from SciPy\n",
+    "def read_mtx_file(mm_file):\n",
+    "    print('Reading ' + str(mm_file) + '...')\n",
+    "    M = mmread(mm_file).asfptype()\n",
+    "     \n",
+    "    return M"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# CuGraph SSSP\n",
+    "\n",
+    "def cugraph_call(M):\n",
+    "\n",
+    "    gdf = cudf.DataFrame()\n",
+    "    gdf['src'] = M.row\n",
+    "    gdf['dst'] = M.col\n",
+    "    \n",
+    "    print('\\tcuGraph Solving... ')\n",
+    "    \n",
+    "    t1 = time.time()\n",
+    "        \n",
+    "    # cugraph SSSP Call\n",
+    "    G = cugraph.DiGraph()\n",
+    "    G.from_cudf_edgelist(gdf, source='src', destination='dst', renumber=False)\n",
+    "    \n",
+    "    df = cugraph.sssp(G, 1)\n",
+    "    t2 = time.time() - t1\n",
+    "    \n",
+    "    return t2\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Basic NetworkX SSSP\n",
+    "\n",
+    "def networkx_call(M):\n",
+    "    nnz_per_row = {r: 0 for r in range(M.get_shape()[0])}\n",
+    "    for nnz in range(M.getnnz()):\n",
+    "        nnz_per_row[M.row[nnz]] = 1 + nnz_per_row[M.row[nnz]]\n",
+    "    for nnz in range(M.getnnz()):\n",
+    "        M.data[nnz] = 1.0/float(nnz_per_row[M.row[nnz]])\n",
+    "\n",
+    "    M = M.tocsr()\n",
+    "    if M is None:\n",
+    "        raise TypeError('Could not read the input graph')\n",
+    "    if M.shape[0] != M.shape[1]:\n",
+    "        raise TypeError('Shape is not square')\n",
+    "\n",
+    "    # should be autosorted, but check just to make sure\n",
+    "    if not M.has_sorted_indices:\n",
+    "        print('sort_indices ... ')\n",
+    "        M.sort_indices()\n",
+    "\n",
+    "    z = {k: 1.0/M.shape[0] for k in range(M.shape[0])}\n",
+    "        \n",
+    "    print('\\tNetworkX Solving... ')\n",
+    "        \n",
+    "    # start timer\n",
+    "    t1 = time.time()\n",
+    "    \n",
+    "    Gnx = nx.DiGraph(M)\n",
+    "\n",
+    "    pr = nx.shortest_path(Gnx, 1)\n",
+    "    \n",
+    "    t2 = time.time() - t1\n",
+    "\n",
+    "    return t2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run the benchmarks"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# arrays to capture performance gains\n",
+    "perf_nx = []\n",
+    "names = []\n",
+    "\n",
+    "time_cu = []\n",
+    "time_nx = []\n",
+    "\n",
+    "#init libraries by doing quick pass\n",
+    "v = './data/preferentialAttachment.mtx'\n",
+    "M = read_mtx_file(v)\n",
+    "trapids = cugraph_call(M)\n",
+    "del M\n",
+    "\n",
+    "for k,v in data.items():\n",
+    "    gc.collect()\n",
+    "    \n",
+    "    # Saved the file Name\n",
+    "    names.append(k)\n",
+    "    \n",
+    "    # read the data\n",
+    "    M = read_mtx_file(v)\n",
+    "    \n",
+    "    # call cuGraph - this will be the baseline\n",
+    "    trapids = cugraph_call(M)\n",
+    "    \n",
+    "    # Now call NetworkX\n",
+    "    tn = networkx_call(M)\n",
+    "    speedUp = (tn / trapids)\n",
+    "    perf_nx.append(speedUp)\n",
+    "    time_cu.append(trapids)\n",
+    "    time_nx.append(tn)\n",
+    "    \n",
+    "    print(\"\\tcuGraph (\" + str(trapids) + \")  Nx (\" + str(tn) + \")\" )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%matplotlib inline\n",
+    "\n",
+    "plt.figure(figsize=(10,8))\n",
+    "\n",
+    "bar_width = 0.4\n",
+    "index = np.arange(len(names))\n",
+    "\n",
+    "_ = plt.bar(index, perf_nx, bar_width, color='g', label='vs Nx')\n",
+    "\n",
+    "plt.xlabel('Datasets')\n",
+    "plt.ylabel('Speedup')\n",
+    "plt.title('SSSP Performance Speedup of cuGraph vs NetworkX')\n",
+    "plt.xticks(index, names)\n",
+    "plt.xticks(rotation=90) \n",
+    "\n",
+    "# Text on the top of each barplot\n",
+    "for i in range(len(perf_nx)):\n",
+    "    #plt.text(x = (i - 0.6) + bar_width, y = perf_nx[i] + 25, s = round(perf_nx[i], 1), size = 12)\n",
+    "    plt.text(x = i - (bar_width/2), y = perf_nx[i] + 25, s = round(perf_nx[i], 1), size = 12)\n",
+    "\n",
+    "#plt.legend()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Dump the raw data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "perf_nx"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "time_cu"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "time_nx"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "___\n",
+    "Copyright (c) 2020, NVIDIA CORPORATION.\n",
+    "\n",
+    "Licensed under the Apache License, Version 2.0 (the \"License\");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n",
+    "\n",
+    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
+    "___"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "cugraph_dev",
+   "language": "python",
+   "name": "cugraph_dev"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/notebooks/demo/batch_betweenness.ipynb b/notebooks/demo/batch_betweenness.ipynb
new file mode 100644
index 00000000000..c2fb5227960
--- /dev/null
+++ b/notebooks/demo/batch_betweenness.ipynb
@@ -0,0 +1,397 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Multi-GPU Batch Betweenness Centrality\n",
+    "#### Author : Xavier Cadet\n",
+    "In this notebook, we will compute Betweenness Centrality for vertices using cuGraph and will see how to **use Multiple GPUs to compute Betweenness Centrality scores**.\n",
+    "\n",
+    "This notebook was tested using 4 NVIDIA Tesla V100-DGX 32G GPUs, using RAPIDS 0.15, and CUDA 10.1. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. If you think you have found a bug or an error, please file an issue in [cuGraph](https://github.com/rapidsai/cugraph/issues)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "Betweennes Centrality can be slow to compute on large graphs, in order to speed up the process we can leverage multiple GPUs.\n",
+    "In this notebook we will showcase how it would have been done with a Single GPU approach, then we will show how it can be done using multiple GPUs."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data\n",
+    "The soc-LiveJournal1 dataset which can be obtained on [SNAP](https://snap.stanford.edu/data/soc-LiveJournal1.html). This graph contains roughly 5 million nodes, and 70 million edges and was extracted from the LiveJournal online social network, further information can be found in:\n",
+    "\n",
+    "*Group Formation in Large Social Networks: Membership, Growth, and Evolution., L. Backstrom, D. Huttenlocher, J. Kleinberg, X. Lan., KDD, 2006.*\n",
+    "\n",
+    "and:\n",
+    "\n",
+    "*Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters., J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney., Internet Mathematics 6(1) 29--123, 2009.*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Betweenness Centrality with cuGraph"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### The imports:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import cugraph\n",
+    "import cudf\n",
+    "\n",
+    "import dask\n",
+    "import dask_cuda\n",
+    "import cugraph.comms as Comms"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import time\n",
+    "import cupy"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Get the data\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import urllib.request\n",
+    "import os\n",
+    "\n",
+    "data_dir = '../data/'\n",
+    "if not os.path.exists(data_dir):\n",
+    "    print('creating data directory')\n",
+    "    os.system('mkdir ../data')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# download the soc-LiveJournal1 dataset\n",
+    "base_url = 'https://snap.stanford.edu/data/'\n",
+    "fn = 'soc-LiveJournal1.txt'\n",
+    "comp = '.gz'\n",
+    "if not os.path.isfile(data_dir + fn):\n",
+    "    if not os.path.isfile(data_dir + fn + comp):\n",
+    "        print(f'Downloading {base_url + fn + comp} to {data_dir + fn + comp}')\n",
+    "        urllib.request.urlretrieve(base_url + fn + comp, data_dir + fn + comp)\n",
+    "    print(f'Decompressing {data_dir + fn + comp}...')\n",
+    "    os.system('gunzip ' + data_dir + fn + comp)\n",
+    "    print(f'{data_dir + fn + comp} decompressed!')\n",
+    "else:\n",
+    "    print(f'Your data file, {data_dir + fn}, already exists')\n",
+    "input_data_path = data_dir + fn"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Single GPU"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Reading the Data - Single GPU\n",
+    "The following shows how we would read the csv file using a single GPU as it is commonly done when using a single GPU with CuGraph."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "t_start_read_sg = time.perf_counter()\n",
+    "e_list = cudf.read_csv(input_data_path, delimiter='\\t', names=['src', 'dst'], dtype=['int32', 'int32'])\n",
+    "t_stop_read_sg = time.perf_counter()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"SG Read time: {}s\".format(t_stop_read_sg - t_start_read_sg))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Building the Graph - Single GPU\n",
+    "Once we read the file, we need to build the Graph, we will use a DiGraph, and use the content extracted from the .csv file as an edge list."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "t_start_build_sg = time.perf_counter()\n",
+    "G = cugraph.DiGraph()\n",
+    "G.from_cudf_edgelist(e_list, source='src', destination='dst')\n",
+    "t_stop_build_sg = time.perf_counter()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"SG Build time: {}s\".format(t_stop_build_sg - t_start_build_sg))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Calling the Algorithm -  Single GPU\n",
+    "Now that our graph is built, we can get its betweenness centrality score. Here we will use a sub-sample of 1024 sources in order to have a better approximation of the overall betweenness centrality. We set the seed for comparability with the multi GPU version that comes next."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "t_start_sg = time.perf_counter()\n",
+    "sg_df = cugraph.betweenness_centrality(G, k=1024, seed=123)\n",
+    "t_stop_sg = time.perf_counter()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"SG Time elapsed: {}s\".format(t_stop_sg - t_start_sg))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Now let's use multiple GPUs!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Using a Dask Cluster\n",
+    "In order to use multiple GPU, we need to ensure that we have Dask Cluster and Client running, further more we need to initialize the CuGraph Communicator."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cluster = dask_cuda.LocalCUDACluster()\n",
+    "client = dask.distributed.Client(cluster)\n",
+    "Comms.initialize()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Enabling Multi GPU Batch Processing\n",
+    "The good thing is that with a simple `enable_mg_batch` call you can harness the power of Multiple GPUs to operate Batch Processing.\n",
+    "This step might take a few seconds, indeed we need to get the graph available to all GPUS, do not worry, this is only required once or when adding new representations to the graph (adjacency list for example)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "t_start_mg = time.perf_counter()\n",
+    "G.enable_batch()\n",
+    "print(\"MG Batch Enabling Time elapsed: {}s\".format(time.perf_counter() - t_start_mg))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Calling the algorithm\n",
+    "We call the algorithm the same way as we used to, but this time it is much faster as we leverage multiple GPUs to compute the Betweenness Centrality scores."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "t_start_mg = time.perf_counter()\n",
+    "batch_df = cugraph.betweenness_centrality(G, k=1024, seed=123)\n",
+    "t_stop_mg = time.perf_counter()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"MG Time elapsed: {}s\".format(t_stop_mg - t_start_mg))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Verification\n",
+    "Order in the DataFrame might vary, but scores for each vertices match, in order to display them side by side we will first sort the resluts based on the `vertex` key, and renew the DataFramee index."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sorted_sg_df = sg_df.sort_values(\"vertex\").reset_index(drop=True)\n",
+    "sorted_batch_df = batch_df.sort_values(\"vertex\").reset_index(drop=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can now compare score for each of the vertices:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cupy.allclose(sorted_sg_df[\"betweenness_centrality\"], sorted_batch_df[\"betweenness_centrality\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And just to visually compare the results we can display the DataFrames:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(sorted_sg_df)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(sorted_batch_df)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Do not forget to clear the Communicator / client /cluster if required."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "Comms.destroy()\n",
+    "client.close()\n",
+    "cluster.close()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "___\n",
+    "Copyright (c) 2020, NVIDIA CORPORATION.\n",
+    "\n",
+    "Licensed under the Apache License, Version 2.0 (the \"License\");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n",
+    "\n",
+    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
+    "___"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "cugraph_dev",
+   "language": "python",
+   "name": "cugraph_dev"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/notebooks/demo/mg_pagerank.ipynb b/notebooks/demo/mg_pagerank.ipynb
new file mode 100644
index 00000000000..d333580ba55
--- /dev/null
+++ b/notebooks/demo/mg_pagerank.ipynb
@@ -0,0 +1,320 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Multiple GPU in cuGraph\n",
+    "#### Author : Alex Fender\n",
+    "\n",
+    "In this notebook, we will show how to use multiple GPUs in cuGraph to compute the PageRank of each user in Twitter's dataset.\n",
+    "\n",
+    "This notebook was tested using RAPIDS 0.15 and CUDA 10.2. Please be aware that your system may be different, and you may need to modify the code or install packages to run the below examples. If you think you have found a bug or an error, please file an issue in [cuGraph](https://github.com/rapidsai/cugraph/issues)\n",
+    "\n",
+    "\n",
+    "CuGraph's multi-GPU features leverage Dask. RAPIDS has other projects based on Dask such as dask-cudf and dask-cuda. These products will also be used in this example. Check out [RAPIDS.ai](https://rapids.ai/) to learn more about these technologies."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data\n",
+    "We will be analyzing **1.47 billion social relations** on 41.7 million user profiles from the Twitter dataset.  The CSV file is 26GB and was collected in :<br>\n",
+    "*What is Twitter, a social network or a news media? Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010.*<br> \n",
+    "\n",
+    "Notice that the memory requirement to read this 26GB dataset is already bigger than the memory of a single GPU. While we are not limited by the device memory size in this case, the whole system should still have at least 60GB of memory available"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## PageRank with cuGraph\n",
+    "### Basic setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import needed libraries. We recommend using cugraph_dev env through conda\n",
+    "from dask.distributed import Client, wait\n",
+    "from dask_cuda import LocalCUDACluster\n",
+    "import cugraph.comms as Comms\n",
+    "import cugraph.dask as dask_cugraph\n",
+    "import cugraph\n",
+    "import dask_cudf\n",
+    "import time\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Get the data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The Twitter dataset is in our S3 bucket and zipped.  \n",
+    "1. We'll need to create a folder for our data in the `/data` folder\n",
+    "1. Download the zipped data into that folder from S3 (it will take some time as it it 6GB)\n",
+    "1. Decompress the zipped data for use (it will take some time as it it 26GB)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import urllib.request\n",
+    "import os\n",
+    "\n",
+    "data_dir = '../data/'\n",
+    "if not os.path.exists(data_dir):\n",
+    "    print('creating data directory')\n",
+    "    os.system('mkdir ../data')\n",
+    "\n",
+    "# download the Twitter dataset\n",
+    "base_url = 'https://s3.us-east-2.amazonaws.com/rapidsai-data/cugraph/benchmark/'\n",
+    "fn = 'twitter-2010.csv'\n",
+    "comp = '.gz'\n",
+    "if not os.path.isfile(data_dir+fn):\n",
+    "    if not os.path.isfile(data_dir+fn+comp):\n",
+    "        print(f'Downloading {base_url+fn+comp} to {data_dir+fn+comp}')\n",
+    "        urllib.request.urlretrieve(base_url+fn+comp, data_dir+fn+comp)\n",
+    "    print(f'Decompressing {data_dir+fn+comp}...')\n",
+    "    os.system('gunzip '+data_dir+fn+comp)\n",
+    "    print(f'{data_dir+fn+comp} decompressed!')\n",
+    "else:\n",
+    "    print(f'Your data file, {data_dir+fn}, already exists')\n",
+    "\n",
+    "# File path, assuming Notebook directory\n",
+    "input_data_path = data_dir+fn"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Initialize multi-GPU environment\n",
+    "Before we get started, we need to setup a Dask local cluster of workers to execute our work and a client to coordinate and schedule work for that cluster. As we see below, we can initiate a cluster and client using only 3 lines of code."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cluster = LocalCUDACluster()\n",
+    "client = Client(cluster)\n",
+    "Comms.initialize()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Read the data from disk\n",
+    "cuGraph depends on cudf for data loading and the initial DataFrame creation. The CSV data file contains an edge list, which represents the connection of a vertex to another. The source to destination pairs is what is known as Coordinate Format (COO). In this test case, the data is just two columns. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# Start ETL timer\n",
+    "t_start = time.time()\n",
+    "\n",
+    "# Helper function to set the reader chunk size to automatically get one partition per GPU  \n",
+    "chunksize = dask_cugraph.get_chunksize(input_data_path)\n",
+    "\n",
+    "# Multi-GPU CSV reader\n",
+    "e_list = dask_cudf.read_csv(input_data_path, chunksize = chunksize, delimiter=' ', names=['src', 'dst'], dtype=['int32', 'int32'])\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Create a graph\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# Create a directed graph using the source (src) and destination (dst) vertex pairs from the Dataframe \n",
+    "G = cugraph.DiGraph()\n",
+    "G.from_dask_cudf_edgelist(e_list, source='src', destination='dst', renumber=False)\n",
+    "\n",
+    "# (optional) request the transposed here so that we can analyse pagerank solver time alone\n",
+    "G.compute_local_data(by='dst')\n",
+    "\n",
+    "# Print time\n",
+    "print(\"Read, load and transpose: \", time.time()-t_start, \"s\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Call PageRank algorithm\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true,
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# Start Pagerank timer\n",
+    "t_start = time.time()\n",
+    "\n",
+    "# Get the pagerank scores\n",
+    "pr_df = dask_cugraph.pagerank(G, tol=1e-4)\n",
+    "\n",
+    "# Print time\n",
+    "print(\"Pagerank: \", time.time()-t_start, \"s\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It was that easy! PageRank should only take a few seconds to run on this 26GB input with one GPU.<br>\n",
+    "Check out how it compares to published Spark results in the [Annex](#annex_cell)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Further analysis on the PageRank result\n",
+    "\n",
+    "We can now identify the most influent users in the network.<br>\n",
+    "Notice that the PageRank result is already in a regular `cudf.DataFrame`. We can then sort by PageRank value and print the *Top 3*."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true,
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# Start timer\n",
+    "t_start = time.time()\n",
+    "\n",
+    "# Sort, descending order\n",
+    "pr_sorted_df = pr_df.sort_values('pagerank',ascending=False)\n",
+    "\n",
+    "# Print time\n",
+    "print(time.time()-t_start, \"s\")\n",
+    "\n",
+    "# Print the Top 3\n",
+    "print(pr_sorted_df.head(3))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can now use the [map](https://s3.us-east-2.amazonaws.com/rapidsai-data/cugraph/benchmark/twitter-2010-ids.csv.gz) to convert Vertex ID into to Twitter's numeric ID. The user name can also be retrieved using the [TwitterID](https://tweeterid.com/) web app.<br>\n",
+    "The table below shows more information on our *Top 3*. Notice that this ranking is much better at capturing network influence compared the number of followers for instance. Further analysis of this dataset was published [here](https://doi.org/10.1145/1772690.1772751).\n",
+    "\n",
+    "| Vertex ID\t| Twitter ID\t| User name\t| Description |\n",
+    "| --------- |  ---------   | --------   |   ----------  |\n",
+    "| 21513299\t| 813286\t| barackobama\t| US President (2009-2017) |\n",
+    "| 23933989\t| 14224719\t| 10DowningStreet | UK Prime Minister office |\n",
+    "| 23933986\t| 15131310\t| WholeFoods\t| Food store from Austin |\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Close multi-GPU environment\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "Comms.destroy()\n",
+    "client.close()\n",
+    "cluster.close()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Annex\n",
+    "<a id='annex_cell'></a>\n",
+    "An experiment comparing various porducts for this workflow was published in *GraphX: Graph Processing in a Distributed Dataflow Framework,OSDI, 2014*. They used 16 m2.4xlarge worker nodes on Amazon EC2. There was a total of 128 CPU cores and 1TB of memory in this 2014 setup.\n",
+    "\n",
+    "![twitter-2010-spark.png](twitter-2010-spark.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "___\n",
+    "Copyright (c) 2020, NVIDIA CORPORATION.\n",
+    "\n",
+    "Licensed under the Apache License, Version 2.0 (the \"License\");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n",
+    "\n",
+    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
+    "___"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git a/notebooks/demo/uvm.ipynb b/notebooks/demo/uvm.ipynb
index 2021b4c3875..d279be8ed54 100644
--- a/notebooks/demo/uvm.ipynb
+++ b/notebooks/demo/uvm.ipynb
@@ -7,9 +7,9 @@
     "# Oversubscribing GPU memory in cuGraph\n",
     "#### Author : Alex Fender\n",
     "\n",
-    "In this notebook, we will show how to **scale to 4x larger graphs than before** without performance drop using managed memory features in cuGraph. We will compute the PageRank of each user in Twitter's dataset on a single GPU as an example. This technique applies to all features.\n",
+    "In this notebook, we will show how to **scale to 4x larger graphs than before** without incurring a performance drop using managed memory features in cuGraph. We will compute the PageRank of each user in Twitter's dataset on a single GPU as an example. This technique applies to all features.\n",
     "\n",
-    "Unified Memory is a single memory address space accessible from any processor in a system. If a kernel tries to access any absent pages,the Page Migration Engine migrates the pages. When the GPU memory is full, least recently used pages are evicted. In other words, Unified Memory transparently enables oversubscribing GPU memory, enabling out-of-core computations.\n",
+    "Unified Memory is a single memory address space accessible from any processor in a system. If a kernel tries to access any absent pages,the Page Migration Engine migrates the pages. When the GPU memory is full, the least recently used pages are evicted. In other words, Unified Memory transparently enables oversubscribing GPU memory, enabling out-of-core computations.\n",
     "\n",
     "\n",
     "This notebook was tested on an NVIDIA 48GB RTX8000 GPU using RAPIDS 0.14 and CUDA 10.2. Please be aware that your system may be different, and you may need to modify the code or install packages to run the below examples. If you think you have found a bug or an error, please file an issue in [cuGraph](https://github.com/rapidsai/cugraph/issues)"
diff --git a/notebooks/img/zachary_graph_hits.png b/notebooks/img/zachary_graph_hits.png
new file mode 100644
index 00000000000..748e71699dd
Binary files /dev/null and b/notebooks/img/zachary_graph_hits.png differ
diff --git a/notebooks/link_analysis/HITS.ipynb b/notebooks/link_analysis/HITS.ipynb
new file mode 100755
index 00000000000..01fd22929d5
--- /dev/null
+++ b/notebooks/link_analysis/HITS.ipynb
@@ -0,0 +1,401 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# HITS\n",
+    "\n",
+    "In this notebook, we will use both NetworkX and cuGraph to compute HITS.  \n",
+    "The NetworkX and cuGraph processes will be interleaved so that each step can be compared.\n",
+    "\n",
+    "Notebook Credits\n",
+    "* Original Authors: Bradley Rees and James Wyles\n",
+    "* Created:   06/09/2020\n",
+    "* Updated:   08/16/2020\n",
+    "\n",
+    "RAPIDS Versions: 0.15  \n",
+    "\n",
+    "Test Hardware\n",
+    "\n",
+    "* GV100 32G, CUDA 10.0\n",
+    "\n",
+    "\n",
+    "## Introduction\n",
+    "HITS, also known as hubs and authorities, computes the relative importance of vertices.   \n",
+    "\n",
+    "See [Wikipedia](https://en.wikipedia.org/wiki/HITS_algorithm) for more details on the algorithm.\n",
+    "\n",
+    "HITS can be thought of as similar to PageRank, it is an iterative algorithm that propagates scores until a tolerance is reached, or max number of iterations is processed.  \n",
+    "\n",
+    "---\n",
+    "From Wikepedia:\n",
+    "\n",
+    "The algorithm performs a series of iterations, each consisting of two basic steps:\n",
+    "\n",
+    "__Authority__ update: Update each node's authority score to be equal to the sum of the hub scores of each node that points to it. That is, a node is given a high authority score by being linked from pages that are recognized as Hubs for information.\n",
+    "\n",
+    "__Hub__ update: Update each node's hub score to be equal to the sum of the authority scores of each node that it points to. That is, a node is given a high hub score by linking to nodes that are considered to be authorities on the subject.\n",
+    "\n",
+    "---\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To compute the HITS scores for a graph in cuGraph we use:<br>\n",
+    "\n",
+    "**cugraph.hits(G,max_iter=100,\n",
+    "         tol=1.0e-5,\n",
+    "         nstart=None,\n",
+    "         normalized=True)**\n",
+    "* __G__: cugraph.Graph object\n",
+    "* __max_iter__: int, The maximum number of iterations before an answer is returned. \n",
+    "* __tol__: float, Set the tolerance the approximation, this parameter should be a small magnitude value. \n",
+    "* __nstart__: cudf.DataFrame - Not currently supported\n",
+    "* __normalized_ : bool - Not currently supported, always used as True\n",
+    "\n",
+    "Returns:\n",
+    "* __df__: a cudf.DataFrame object with two columns:\n",
+    "  * df['vertex'] : cudf.Series\n",
+    "        Contains the vertex identifiers\n",
+    "  * df['hubs'] : cudf.Series\n",
+    "        Contains the hubs score\n",
+    "  * df['authorities'] : cudf.Series\n",
+    "        Contains the authorities score\n",
+    "\n",
+    "\n",
+    "### Some notes about vertex IDs...\n",
+    "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n",
+    "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n",
+    "  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n",
+    "  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Test Data\n",
+    "We will be using the Zachary Karate club dataset \n",
+    "*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of\n",
+    "Anthropological Research 33, 452-473 (1977).*\n",
+    "\n",
+    "\n",
+    "![Karate Club](../img/zachary_black_lines.png)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Prep"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# The notebook compares cuGraph to NetworkX,  \n",
+    "# therefore there some additional non-RAPIDS python libraries need to be installed. \n",
+    "# Please run this cell if you need the additional libraries\n",
+    "!pip install networkx"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import needed libraries\n",
+    "import cugraph\n",
+    "import cudf"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# NetworkX libraries\n",
+    "import networkx as nx"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Some Prep"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define the path to the test data  \n",
+    "datafile='../data/karate-data.csv'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "# NetworkX"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Read the data, this also created a NetworkX Graph \n",
+    "file = open(datafile, 'rb')\n",
+    "Gnx = nx.read_edgelist(file)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hits_nx = nx.hits(Gnx)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hits_nx"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Running NetworkX is that easy.  \n",
+    "Let's seet how that compares to cuGraph\n",
+    "\n",
+    "----"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# cuGraph"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Read in the data - GPU\n",
+    "cuGraph depends on cuDF for data loading and the initial Dataframe creation\n",
+    "\n",
+    "The data file contains an edge list, which represents the connection of a vertex to another.  The `source` to `destination` pairs is in what is known as Coordinate Format (COO).  In this test case, the data is just two columns.  However a third, `weight`, column is also possible"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Read the data  \n",
+    "gdf = cudf.read_csv(datafile, names=[\"src\", \"dst\"], delimiter='\\t', dtype=[\"int32\", \"int32\"] )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Create a Graph "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# create a Graph using the source (src) and destination (dst) vertex pairs from the Dataframe \n",
+    "G = cugraph.Graph()\n",
+    "G.from_cudf_edgelist(gdf, source='src', destination='dst')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Call the HITS algorithm"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Call cugraph.hits to get the hits scores\n",
+    "gdf_hits = cugraph.hits(G)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "_It was that easy!_  \n",
+    "Compared to NetworkX, the cuGraph data loading might have been more steps, but using cuDF allows for a wider range of data to be loaded. \n",
+    "\n",
+    "\n",
+    "----\n",
+    "\n",
+    "Let's now look at the results"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def print_hub_threshold(_df, t=0) :\n",
+    "    filtered = _df.query('hubs >= @t')\n",
+    "    \n",
+    "    for i in range(len(filtered)):\n",
+    "        print(\"Best vertex is \" + str(filtered['vertex'].iloc[i]) + \n",
+    "            \" with HUB score of \" + str(filtered['hubs'].iloc[i]))  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def print_authorities_threshold(_df, t=0) :\n",
+    "    filtered = _df.query('authorities >= @t')\n",
+    "    \n",
+    "    for i in range(len(filtered)):\n",
+    "        print(\"Best vertex is \" + str(filtered['vertex'].iloc[i]) + \n",
+    "            \" with Authorities score of \" + str(filtered['authorities'].iloc[i]))  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print_hub_threshold(gdf_hits, gdf_hits['hubs'].max())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print_authorities_threshold(gdf_hits, gdf_hits['authorities'].max())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "----\n",
+    "\n",
+    "Since this is a very small graph, let's just sort and get the first three records"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sort_a = gdf_hits.sort_values('authorities', ascending=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sort_a.head(3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img src=\"../img/zachary_graph_hits.png\" width=\"600\">"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sort_h = gdf_hits.sort_values('hubs', ascending=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sort_h.head(32)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "___\n",
+    "Copyright (c) 2019-2020, NVIDIA CORPORATION.\n",
+    "\n",
+    "Licensed under the Apache License, Version 2.0 (the \"License\");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n",
+    "\n",
+    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
+    "___"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "cugraph_dev",
+   "language": "python",
+   "name": "cugraph_dev"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/notebooks/link_analysis/Pagerank.ipynb b/notebooks/link_analysis/Pagerank.ipynb
index 8e5eeea80e4..c43561ff48c 100755
--- a/notebooks/link_analysis/Pagerank.ipynb
+++ b/notebooks/link_analysis/Pagerank.ipynb
@@ -11,7 +11,7 @@
     "Notebook Credits\n",
     "* Original Authors: Bradley Rees and James Wyles\n",
     "* Created:   08/13/2019\n",
-    "* Updated:   05/08/2020\n",
+    "* Updated:   08/16/2020\n",
     "\n",
     "RAPIDS Versions: 0.14    \n",
     "\n",
@@ -39,13 +39,11 @@
     "    * df['pagerank']: The pagerank score for the vertex\n",
     "\n",
     "\n",
-    "## cuGraph Notice \n",
-    "The current version of cuGraph has some limitations:\n",
-    "\n",
-    "* Vertex IDs need to be 32-bit integers.\n",
-    "* Vertex IDs are expected to be contiguous integers starting from 0.\n",
-    "\n",
-    "cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon.    "
+    "### Some notes about vertex IDs...\n",
+    "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n",
+    "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n",
+    "  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n",
+    "  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n"
    ]
   },
   {
diff --git a/notebooks/link_prediction/Jaccard-Similarity.ipynb b/notebooks/link_prediction/Jaccard-Similarity.ipynb
index 84456f45516..21835da1cce 100755
--- a/notebooks/link_prediction/Jaccard-Similarity.ipynb
+++ b/notebooks/link_prediction/Jaccard-Similarity.ipynb
@@ -17,7 +17,7 @@
     "\n",
     "    Original Authors: Brad Rees\n",
     "    Created:   10/14/2019\n",
-    "    Last Edit: 05/08/2020\n",
+    "    Last Edit: 08/16/2020\n",
     "\n",
     "RAPIDS Versions: 0.14\n",
     "\n",
@@ -126,13 +126,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### cuGraph Notice \n",
-    "The current version of cuGraph has some limitations:\n",
-    "\n",
-    "* Vertex IDs need to be 32-bit integers.\n",
-    "* Vertex IDs are expected to be contiguous integers starting from 0.\n",
-    "\n",
-    "cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon."
+    "### Some notes about vertex IDs...\n",
+    "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n",
+    "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n",
+    "  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n",
+    "  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n"
    ]
   },
   {
@@ -480,13 +478,6 @@
     "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
     "___"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
diff --git a/notebooks/link_prediction/Overlap-Similarity.ipynb b/notebooks/link_prediction/Overlap-Similarity.ipynb
index 47e7d0f5d0b..b8733ce4d80 100755
--- a/notebooks/link_prediction/Overlap-Similarity.ipynb
+++ b/notebooks/link_prediction/Overlap-Similarity.ipynb
@@ -14,7 +14,7 @@
     "\n",
     "    Original Authors: Brad Rees\n",
     "    Created:   10/14/2019\n",
-    "    Last Edit: 05/08/2020\n",
+    "    Last Edit: 08/16/2020\n",
     "\n",
     "RAPIDS Versions: 0.12.0a\n",
     "\n",
@@ -109,13 +109,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### cuGraph Notice \n",
-    "The current version of cuGraph has some limitations:\n",
-    "\n",
-    "* Vertex IDs need to be 32-bit integers.\n",
-    "* Vertex IDs are expected to be contiguous integers starting from 0.\n",
-    "\n",
-    "cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon."
+    "### Some notes about vertex IDs...\n",
+    "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n",
+    "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n",
+    "  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n",
+    "  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n"
    ]
   },
   {
diff --git a/notebooks/structure/Renumber-2.ipynb b/notebooks/structure/Renumber-2.ipynb
index 62710a417ba..68c21fe725a 100755
--- a/notebooks/structure/Renumber-2.ipynb
+++ b/notebooks/structure/Renumber-2.ipynb
@@ -16,7 +16,7 @@
     "Notebook Credits\n",
     "* Original Authors: Bradley Rees\n",
     "* Created:   08/13/2019\n",
-    "* Updated:   03/03/2020\n",
+    "* Updated:   07/08/2020\n",
     "\n",
     "RAPIDS Versions: 0.13    \n",
     "\n",
@@ -68,7 +68,8 @@
     "# Import needed libraries\n",
     "import cugraph\n",
     "import cudf\n",
-    "import nvstrings"
+    "\n",
+    "from cugraph.structure import NumberMap\n"
    ]
   },
   {
@@ -151,7 +152,12 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "gdf['src_r'], gdf['dst_r'], numbering = cugraph.renumber(gdf['src_ip'], gdf['dst_ip'])"
+    "gdf['order'] = gdf.index\n",
+    "\n",
+    "tmp_df, numbering = NumberMap.renumber(gdf, ['src_ip'], ['dst_ip'])\n",
+    "\n",
+    "gdf = gdf.merge(tmp_df, on='order').sort_values('order').set_index(index='order', drop=True)\n",
+    "gdf = gdf.rename(columns={'src': 'src_r', 'dst': 'dst_r'})"
    ]
   },
   {
@@ -167,7 +173,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's now look at the renumbered ranged of values"
+    "Let's now look at the renumbered range of values"
    ]
   },
   {
@@ -209,9 +215,9 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "cugraph_dev",
+   "display_name": "Python 3",
    "language": "python",
-   "name": "cugraph_dev"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
diff --git a/notebooks/structure/Renumber.ipynb b/notebooks/structure/Renumber.ipynb
index 9013361fe27..929a600a39d 100755
--- a/notebooks/structure/Renumber.ipynb
+++ b/notebooks/structure/Renumber.ipynb
@@ -16,9 +16,9 @@
     "Notebook Credits\n",
     "* Original Authors: Chuck Hastings and Bradley Rees\n",
     "* Created:   08/13/2019\n",
-    "* Updated:   03/03/2020\n",
+    "* Updated:   07/08/2020\n",
     "\n",
-    "RAPIDS Versions: 0.13   \n",
+    "RAPIDS Versions: 0.15   \n",
     "\n",
     "Test Hardware\n",
     "\n",
@@ -30,11 +30,17 @@
     "\n",
     "Most cugraph algorithms operate on a CSR representation of a graph.  A CSR representation requires an indices array that is as long as the number of edges and an offsets array that is as 1 more than the largest vertex id.  This makes the memory utilization entirely dependent on the size of the largest vertex id.  For data sets that have a sparse range of vertex ids, the size of the CSR can be unnecessarily large.  It is easy to construct an example where the amount of memory required for the offsets array will exceed the amount of memory in the GPU (not to mention the performance cost of having a large number of offsets that are empty but still have to be read to be skipped).\n",
     "\n",
-    "The cugraph renumbering feature allows us to take two columns of any integer type and translate them into a densely packed contiguous array numbered from 0 to (num_unique_values - 1).  These renumbered vertices can be used to create a graph much more efficiently.\n",
+    "The renumbering feature allows us to generate unique identifiers for every vertex identified in the input data frame.\n",
     "\n",
-    "Another of the features of the renumbering function is that it can take vertex ids that are 64-bit values and map them down into a range that fits into 32-bit integers.  The current cugraph algorithms are limited to 32-bit signed integers as vertex ids. and the renumbering feature will allow the caller to translate ids that are 64-bit into a densly packed 32-bit array of ids that can be used in cugraph algorithms.  Note that if there are more than 2^31 - 1 unique vertex ids then the renumber method will fail with an error indicating that there are too many vertices to renumber into a 32-bit signed integer.\n",
+    "Renumbering can happen automatically as part of graph generation.  It can also be done explicitely by the caller, this notebook will provide examples using both techniques.\n",
     "\n",
-    "Note that this version (0.10) is limited to integer types.  The intention is to extend the renumbering function to be able to handle strings and other types."
+    "The fundamental requirement for the user of the renumbering software is to specify how to identify a vertex.  We will refer to this as the *external* vertex identifier.  This will typically be done by specifying a cuDF DataFrame, and then identifying which columns within the DataFrame constitute source vertices and which columns specify destination columns.\n",
+    "\n",
+    "Let us consider that a vertex is uniquely defined as a tuple of elements from the rows of a cuDF DataFrame.  The primary restriction is that the number of elements in the tuple must be the same for both source vertices and destination vertices, and that the types of each element in the source tuple must be the same as the corresponding element in the destination tuple.  This restriction is a natural restriction and should be obvious why this is required.\n",
+    "\n",
+    "Renumbering takes the collection of tuples that uniquely identify vertices in the graph, eliminates duplicates, and assigns integer identifiers to the unique tuples.  These integer identifiers are used as *internal* vertex identifiers within the cuGraph software.\n",
+    "\n",
+    "One of the features of the renumbering function is that it maps vertex ids of any size and structure down into a range that fits into 32-bit integers.  The current cugraph algorithms are limited to 32-bit signed integers as vertex ids. and the renumbering feature will allow the caller to translate ids that are 64-bit (or strings, or complex data types) into a densly packed 32-bit array of ids that can be used in cugraph algorithms.  Note that if there are more than 2^31 - 1 unique vertex ids then the renumber method will fail with an error indicating that there are too many vertices to renumber into a 32-bit signed integer."
    ]
   },
   {
@@ -56,7 +62,8 @@
     "import struct\n",
     "import pandas as pd\n",
     "import numpy as np\n",
-    "import networkx as nx\n"
+    "import networkx as nx\n",
+    "from cugraph.structure import NumberMap\n"
    ]
   },
   {
@@ -119,18 +126,9 @@
    "source": [
     "# Run renumbering\n",
     "\n",
-    "The current version of renumbering takes a column of source vertex ids and a column of dest vertex ids.  As mentioned above, these must be integer columns.\n",
+    "Output from renumbering is a data frame and a NumberMap object.  The data frame contains the renumbered sources and destinations.  The NumberMap will allow you to translate from external to internal vertex identifiers.\n",
     "\n",
-    "Output from renumbering is 3 cudf.Series structures representing the renumbered sources, the renumbered destinations and the numbering map which maps the new ids back to the original ids.\n",
-    "\n",
-    "In this case,\n",
-    " * gdf['source_as_int'] is a column of type int64\n",
-    " * gdf['dest_as_int'] is a column of type int64\n",
-    " * src_r will be a series of type int32 (we translate back to 32-bit integers)\n",
-    " * dst_r will be a series of type int32\n",
-    " * numbering will be a series of type int64 that translates the elements of src and dst back to their original 64-bit values\n",
-    " \n",
-    "Note that because the renumbering translates us to 32-bit integers, if there are more than 2^31 - 1 unique 64-bit values in the source/dest passed into renumbering this would exceed the size of the 32-bit integers so you will get an error from the renumber call. "
+    "Note that renumbering does not guarantee that the output data frame is in the same order as the input data frame (although in our simple example it will match).  To address this we will add the index as a column of gdf before renumbering.\n"
    ]
   },
   {
@@ -139,13 +137,31 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "src_r, dst_r, numbering = cugraph.renumber(gdf['source_as_int'], gdf['dest_as_int'])\n",
+    "gdf['order'] = gdf.index\n",
     "\n",
-    "gdf.add_column(\"original id\", numbering)\n",
-    "gdf.add_column(\"src_renumbered\", src_r)\n",
-    "gdf.add_column(\"dst_renumbered\", dst_r)\n",
+    "renumbered_df, numbering = NumberMap.renumber(gdf, ['source_as_int'], ['dest_as_int'])\n",
     "\n",
-    "gdf.to_pandas()"
+    "renumbered_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Now combine renumbered df with original df\n",
+    "\n",
+    "We can use the order column to merge the data frames together."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "renumbered_df = renumbered_df.merge(gdf, on='order').sort_values('order').reset_index(drop=True)\n",
+    "\n",
+    "renumbered_df"
    ]
   },
   {
@@ -163,7 +179,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "gdf.dtypes"
+    "renumbered_df.dtypes"
    ]
   },
   {
@@ -172,7 +188,9 @@
    "source": [
     "# Quick verification\n",
     "\n",
-    "To understand the renumbering, here's a block of verification logic.  In the renumbered series we created a new id for each unique value in the original series.  The numbering map identifies that mapping.  For any vertex id X in the new numbering, numbering[X] should refer to the original value."
+    "The NumberMap object allows us to translate back and forth between *external* vertex identifiers and *internal* vertex identifiers.\n",
+    "\n",
+    "To understand the renumbering, here's an ugly block of verification logic."
    ]
   },
   {
@@ -181,11 +199,15 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "for i in range(len(src_r)):\n",
-    "    print(\" \" + str(i) +\n",
-    "          \": (\" + str(source_as_int[i]) + \",\" + str(dest_as_int[i]) +\")\"\n",
-    "          \", renumbered: (\" + str(src_r[i]) + \",\" + str(dst_r[i]) +\")\"\n",
-    "          \", translate back: (\" + str(numbering[src_r[i]]) + \",\" + str(numbering[dst_r[i]]) +\")\"\n",
+    "numbering.from_internal_vertex_id(cudf.Series([0]))['0'][0]\n",
+    "\n",
+    "for i in range(len(renumbered_df)):\n",
+    "    print(\" \", i,\n",
+    "          \": (\",  source_as_int[i], \",\", dest_as_int[i],\n",
+    "          \"), renumbered: (\", renumbered_df['src'][i], \",\", renumbered_df['dst'][i], \n",
+    "          \"), translate back: (\",\n",
+    "          numbering.from_internal_vertex_id(cudf.Series([renumbered_df['src'][i]]))['0'][0], \",\",\n",
+    "          numbering.from_internal_vertex_id(cudf.Series([renumbered_df['dst'][i]]))['0'][0], \")\"\n",
     "         )\n"
    ]
   },
@@ -195,7 +217,9 @@
    "source": [
     "# Now let's do some graph things...\n",
     "\n",
-    "To start, let's run page rank.  Not particularly interesting on our circle, since everything should have an equal rank."
+    "To start, let's run page rank.  Not particularly interesting on our circle, since everything should have an equal rank.\n",
+    "\n",
+    "Note, we passed in the renumbered columns as our input, so the output is based upon the internal vertex ids."
    ]
   },
   {
@@ -206,14 +230,36 @@
    "source": [
     "G = cugraph.Graph()\n",
     "gdf_r = cudf.DataFrame()\n",
-    "gdf_r[\"src\"] = src_r\n",
-    "gdf_r[\"dst\"] = dst_r\n",
-    "G.from_cudf_edgelist(gdf_r, source='src', destination='dst')\n",
+    "gdf_r[\"src\"] = renumbered_df[\"src\"]\n",
+    "gdf_r[\"dst\"] = renumbered_df[\"dst\"]\n",
+    "\n",
+    "G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)\n",
     "\n",
     "pr = cugraph.pagerank(G)\n",
     "\n",
-    "pr.add_column(\"original id\", numbering)\n",
-    "pr.to_pandas()\n"
+    "pr.to_pandas()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Convert vertex ids back\n",
+    "\n",
+    "To be relevant, we probably want the vertex ids converted back into the original ids.  This can be done by the NumberMap object.\n",
+    "\n",
+    "Note again, the unrenumber call does not guarantee order.  If order matters you would need to do something to regenerate the desired order."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "numbering.unrenumber(pr, 'vertex')"
    ]
   },
   {
@@ -233,15 +279,49 @@
    "source": [
     "jac = cugraph.jaccard(G)\n",
     "\n",
+    "jac = numbering.unrenumber(jac, 'source')\n",
+    "jac = numbering.unrenumber(jac, 'destination')\n",
+    "\n",
     "jac.add_column(\"original_source\",\n",
-    "               [ socket.inet_ntoa(struct.pack('!L', numbering[x])) for x in jac['source'] ])\n",
+    "               [ socket.inet_ntoa(struct.pack('!L', x)) for x in jac['source'].values_host ])\n",
     "\n",
     "jac.add_column(\"original_destination\",\n",
-    "               [ socket.inet_ntoa(struct.pack('!L', numbering[x])) for x in jac['destination'] ])\n",
+    "               [ socket.inet_ntoa(struct.pack('!L', x)) for x in jac['destination'].values_host ])\n",
     "\n",
     "jac.to_pandas()\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Working from the strings\n",
+    "\n",
+    "Starting with version 0.15, the base renumbering feature contains support for any arbitrary columns.  So we can now directly support strings.\n",
+    "\n",
+    "Renumbering also happens automatically in the graph.  So let's combine all of this to a simpler example with the same data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gdf = cudf.DataFrame.from_pandas(df[['source_list', 'dest_list']])\n",
+    "\n",
+    "G = cugraph.Graph()\n",
+    "G.from_cudf_edgelist(gdf, source='source_list', destination='dest_list', renumber=True)\n",
+    "\n",
+    "pr = cugraph.pagerank(G)\n",
+    "\n",
+    "print('pagerank output:\\n', pr)\n",
+    "\n",
+    "jac = cugraph.jaccard(G)\n",
+    "\n",
+    "print('jaccard output:\\n', jac)\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -258,9 +338,9 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "cugraph_dev",
+   "display_name": "Python 3",
    "language": "python",
-   "name": "cugraph_dev"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
diff --git a/notebooks/traversal/BFS.ipynb b/notebooks/traversal/BFS.ipynb
index d65982d08a6..9a4066a07f3 100755
--- a/notebooks/traversal/BFS.ipynb
+++ b/notebooks/traversal/BFS.ipynb
@@ -10,7 +10,7 @@
     "Notebook Credits\n",
     "* Original Authors: Bradley Rees and James Wyles\n",
     "* Feature available since 0.6\n",
-    "* Last Edit: 05/08/2020\n",
+    "* Last Edit: 08/16/2020\n",
     "\n",
     "RAPIDS Versions: 0.14.0    \n",
     "\n",
@@ -41,13 +41,11 @@
     "    * df[\"predecessor\"]: The vertex ID of the vertex that was used to reach this vertex\n",
     "\n",
     "\n",
-    "## cuGraph Notice \n",
-    "The current version of cuGraph has some limitations:\n",
-    "\n",
-    "* Vertex IDs need to be 32-bit integers.\n",
-    "* Vertex IDs are expected to be contiguous integers starting from 0.\n",
-    "\n",
-    "cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon."
+    "### Some notes about vertex IDs...\n",
+    "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n",
+    "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n",
+    "  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n",
+    "  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n"
    ]
   },
   {
diff --git a/notebooks/traversal/SSSP.ipynb b/notebooks/traversal/SSSP.ipynb
index 20d1179b85c..d2baeb12e74 100755
--- a/notebooks/traversal/SSSP.ipynb
+++ b/notebooks/traversal/SSSP.ipynb
@@ -11,7 +11,7 @@
     "Notebook Credits\n",
     "* Original Authors: Bradley Rees and James Wyles\n",
     "* available since rerlease 0.6\n",
-    "* Last Edit: 02/04/2020\n",
+    "* Last Edit: 08/16/2020\n",
     "\n",
     "\n",
     "RAPIDS Versions: 0.12.0   \n",
@@ -41,13 +41,11 @@
     "    * df['predecessor']: The predecessor vertex along this paths.  Allows paths to be recreated\n",
     "\n",
     "\n",
-    "## cuGraph Notice \n",
-    "The current version of cuGraph has some limitations:\n",
-    "\n",
-    "* Vertex IDs need to be 32-bit integers.\n",
-    "* Vertex IDs are expected to be contiguous integers starting from 0.\n",
-    "\n",
-    "cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon."
+    "### Some notes about vertex IDs...\n",
+    "* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.\n",
+    "* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.\n",
+    "  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).\n",
+    "  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`\n"
    ]
   },
   {
@@ -109,10 +107,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Need to shift the vertex IDs to start with zero rather than one (next version of cuGraph will fix this issue)\n",
-    "gdf[\"src_0\"] = gdf[\"src\"] - 1\n",
-    "gdf[\"dst_0\"] = gdf[\"dst\"] - 1\n",
-    "\n",
     "# The SSSP algorithm requires that there are weights.  Just use 1.0 here (equivalent to BFS)\n",
     "gdf[\"data\"] = 1.0"
    ]
@@ -141,7 +135,7 @@
    "source": [
     "# create a Graph \n",
     "G = cugraph.Graph()\n",
-    "G.from_cudf_edgelist(gdf, source='src_0', destination='dst_0', edge_attr='data')"
+    "G.from_cudf_edgelist(gdf, source='src', destination='dst', edge_attr='data')"
    ]
   },
   {
@@ -161,11 +155,10 @@
    "outputs": [],
    "source": [
     "# Print the paths\n",
-    "# Not using the filterred dataframe to ensure that vertex IDs match row IDs\n",
-    "for i in range(len(df)) :\n",
-    "    \n",
-    "    p = cugraph.utils.get_traversed_path_list(df, i)\n",
-    "    print(p)    \n"
+    "for index, row in df.to_pandas().iterrows():\n",
+    "    v = int(row['vertex'])\n",
+    "    p = cugraph.utils.get_traversed_path_list(df, v)\n",
+    "    print(v, ': ', p)\n"
    ]
   },
   {
@@ -191,9 +184,9 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "cugraph_dev",
+   "display_name": "Python 3",
    "language": "python",
-   "name": "cugraph_dev"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
diff --git a/print_env.sh b/print_env.sh
index ddb36bc652c..6f2d33b0eb8 100644
--- a/print_env.sh
+++ b/print_env.sh
@@ -5,9 +5,12 @@
 # "./print_env.sh" - prints to stdout
 # "./print_env.sh > env.txt" - prints to file "env.txt"
 
+print_env() {
 echo "**git***"
 if [ "$(git rev-parse --is-inside-work-tree 2>/dev/null)" == "true" ]; then
 git log --decorate -n 1
+echo "**git submodules***"
+git submodule status --recursive
 else
 echo "Not inside a git repository"
 fi
@@ -27,19 +30,19 @@ lscpu
 echo
 
 echo "***CMake***"
-which cmake && cmake --version 
+which cmake && cmake --version
 echo 
 
 echo "***g++***"
-which g++ && g++ --version 
+which g++ && g++ --version
 echo 
 
 echo "***nvcc***"
-which nvcc && nvcc --version 
+which nvcc && nvcc --version
 echo 
 
 echo "***Python***"
-which python && python --version
+which python && python -c "import sys; print('Python {0}.{1}.{2}'.format(sys.version_info[0], sys.version_info[1], sys.version_info[2]))"
 echo
 
 echo "***Environment Variables***"
@@ -62,7 +65,7 @@ echo
 # Print conda packages if conda exists
 if type "conda" &> /dev/null; then
 echo '***conda packages***'
-which conda && conda list 
+which conda && conda list
 echo
 # Print pip packages if pip exists
 elif type "pip" &> /dev/null; then
@@ -74,3 +77,11 @@ else
 echo "conda not found"
 echo "pip not found"
 fi
+}
+
+echo "<details><summary>Click here to see environment details</summary><pre>"
+echo "     "
+print_env | while read -r line; do
+    echo "     $line"
+done
+echo "</pre></details>"
diff --git a/python/cugraph/__init__.py b/python/cugraph/__init__.py
index 9bd7191a399..6f40641eddc 100644
--- a/python/cugraph/__init__.py
+++ b/python/cugraph/__init__.py
@@ -15,34 +15,41 @@
     ecg,
     ktruss_subgraph,
     louvain,
+    leiden,
     spectralBalancedCutClustering,
     spectralModularityMaximizationClustering,
     analyzeClustering_modularity,
     analyzeClustering_edge_cut,
     analyzeClustering_ratio_cut,
     subgraph,
-    triangles
+    triangles,
 )
 
 from cugraph.structure import (
     Graph,
     DiGraph,
     from_cudf_edgelist,
-    renumber,
+    hypergraph,
     symmetrize,
     symmetrize_df,
-    renumber_from_cudf
 )
 
-from cugraph.centrality import katz_centrality, betweenness_centrality
+from cugraph.centrality import (
+    betweenness_centrality,
+    edge_betweenness_centrality,
+    katz_centrality,
+)
+
 from cugraph.cores import core_number, k_core
-from cugraph.components import weakly_connected_components, strongly_connected_components
-from cugraph.link_analysis import pagerank
+from cugraph.components import (
+    weakly_connected_components,
+    strongly_connected_components,
+)
+from cugraph.link_analysis import pagerank, hits
 
 from cugraph.link_prediction import jaccard, overlap, jaccard_w, overlap_w
 from cugraph.traversal import bfs, sssp, filter_unreachable
-# from cugraph.utilities import grmat_gen
-#from cugraph.utilities import device_of_gpu_pointer
+
 from cugraph.utilities import utils
 
 from cugraph.bsp.traversal import bfs_df_pregel
@@ -52,8 +59,10 @@
 
 from cugraph.layout import force_atlas2
 from cugraph.raft import raft_include_test
+from cugraph.comms import comms
 
 # Versioneer
 from ._version import get_versions
-__version__ = get_versions()['version']
+
+__version__ = get_versions()["version"]
 del get_versions
diff --git a/python/cugraph/_version.py b/python/cugraph/_version.py
index f4807bd4090..4a4a7387141 100644
--- a/python/cugraph/_version.py
+++ b/python/cugraph/_version.py
@@ -1,3 +1,15 @@
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 # This file helps to compute a version number in source trees obtained from
 # git-archive tarball (such as those provided by githubs download-from-tag
diff --git a/python/cugraph/bsp/__init__.py b/python/cugraph/bsp/__init__.py
index c1e8ec007d6..dbb94895cec 100644
--- a/python/cugraph/bsp/__init__.py
+++ b/python/cugraph/bsp/__init__.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/bsp/traversal/__init__.py b/python/cugraph/bsp/traversal/__init__.py
index 457d9989d0b..061d1d7e3a1 100644
--- a/python/cugraph/bsp/traversal/__init__.py
+++ b/python/cugraph/bsp/traversal/__init__.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/bsp/traversal/bfs_bsp.py b/python/cugraph/bsp/traversal/bfs_bsp.py
index bd7a2350f1a..28a71631443 100644
--- a/python/cugraph/bsp/traversal/bfs_bsp.py
+++ b/python/cugraph/bsp/traversal/bfs_bsp.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/centrality/__init__.py b/python/cugraph/centrality/__init__.py
index d9517d465b7..da882a61850 100644
--- a/python/cugraph/centrality/__init__.py
+++ b/python/cugraph/centrality/__init__.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -13,3 +13,6 @@
 
 from cugraph.centrality.katz_centrality import katz_centrality
 from cugraph.centrality.betweenness_centrality import betweenness_centrality
+from cugraph.centrality.betweenness_centrality import (
+    edge_betweenness_centrality,
+)
diff --git a/python/cugraph/centrality/betweenness_centrality.pxd b/python/cugraph/centrality/betweenness_centrality.pxd
index 61bc159ae5c..0c17a17ad5a 100644
--- a/python/cugraph/centrality/betweenness_centrality.pxd
+++ b/python/cugraph/centrality/betweenness_centrality.pxd
@@ -19,11 +19,11 @@
 from cugraph.structure.graph_new cimport *
 from libcpp cimport bool
 
-
 cdef extern from "algorithms.hpp" namespace "cugraph":
 
-    cdef void betweenness_centrality[VT,ET,WT,result_t](
-        const GraphCSRView[VT,ET,WT] &graph,
+    cdef void betweenness_centrality[VT, ET, WT, result_t](
+        const handle_t &handle,
+        const GraphCSRView[VT, ET, WT] &graph,
         result_t *result,
         bool normalized,
         bool endpoints,
@@ -31,3 +31,11 @@ cdef extern from "algorithms.hpp" namespace "cugraph":
         VT k,
         const VT *vertices) except +
 
+    cdef void edge_betweenness_centrality[VT, ET, WT, result_t](
+        const handle_t &handle,
+        const GraphCSRView[VT, ET, WT] &graph,
+        result_t *result,
+        bool normalized,
+        const WT *weight,
+        VT k,
+        const VT *vertices) except +
diff --git a/python/cugraph/centrality/betweenness_centrality.py b/python/cugraph/centrality/betweenness_centrality.py
index df0e230d921..92bc5a7b3e0 100644
--- a/python/cugraph/centrality/betweenness_centrality.py
+++ b/python/cugraph/centrality/betweenness_centrality.py
@@ -13,13 +13,22 @@
 
 import random
 import numpy as np
+import cudf
 from cugraph.centrality import betweenness_centrality_wrapper
+from cugraph.centrality import edge_betweenness_centrality_wrapper
+import cugraph
 
 
-# NOTE: result_type=float could ne an intuitive way to indicate the result type
-def betweenness_centrality(G, k=None, normalized=True,
-                           weight=None, endpoints=False,
-                           seed=None, result_dtype=np.float64):
+# NOTE: result_type=float could be an intuitive way to indicate the result type
+def betweenness_centrality(
+    G,
+    k=None,
+    normalized=True,
+    weight=None,
+    endpoints=False,
+    seed=None,
+    result_dtype=np.float64,
+):
     """
     Compute the betweenness centrality for all nodes of the graph G from a
     sample of 'k' sources.
@@ -40,6 +49,8 @@ def betweenness_centrality(G, k=None, normalized=True,
         values give better approximation
         If k is a list, use the content of the list for estimation: the list
         should contain vertices identifiers.
+        If k is None (the default), all the vertices are used to estimate
+        betweenness.
         Vertices obtained through sampling or defined as a list will be used as
         sources for traversals inside the algorithm.
 
@@ -49,8 +60,8 @@ def betweenness_centrality(G, k=None, normalized=True,
         2 / ((n - 1) * (n - 2)) for Graphs (undirected), and
         1 / ((n - 1) * (n - 2)) for DiGraphs (directed graphs)
         where n is the number of nodes in G.
-        Normalization will ensure that the values in [0, 1],
-        this normalization scales fo the highest possible value where one
+        Normalization will ensure that values are in [0, 1],
+        this normalization scales for the highest possible value where one
         node is crossed by every single shortest path.
 
     weight : cudf.DataFrame, optional, default=None
@@ -94,60 +105,182 @@ def betweenness_centrality(G, k=None, normalized=True,
     >>> G.from_cudf_edgelist(M, source='0', destination='1')
     >>> bc = cugraph.betweenness_centrality(G)
     """
-
     # vertices is intended to be a cuDF series that contains a sampling of
     # k vertices out of the graph.
     #
     # NOTE: cuDF doesn't currently support sampling, but there is a python
     # workaround.
-    #
-    vertices = None
 
-    if k is not None:
-        # In order to compare with pre-set sources,
-        # k can either be a list or an integer or None
-        #  int: Generate an random sample with k elements
-        # list: k become the length of the list and vertices become the content
-        # None: All the vertices are considered
-        # NOTE: We do not renumber in case k is an int, the sampling is
-        #       not operating on the valid vertices identifiers but their
-        #       indices:
-        # Example:
-        # - vertex '2' is missing
-        # - vertices '0' '1' '3' '4' exist
-        # - There is a vertex at index 2 (there is not guarantee that it is
-        #   vertice '3' )
-        if isinstance(k, int):
-            random.seed(seed)
-            vertices = random.sample(range(G.number_of_vertices()), k)
-        # Using k as a list allows to have an easier way to compare against
-        # other implementations on
-        elif isinstance(k, list):
-            vertices = k
-            k = len(vertices)
-            # We assume that the list that was provided is not the indices
-            # in the graph structure but the vertices identifiers in the graph
-            # hence: [1, 2, 10] should proceed to sampling on vertices that
-            # have 1, 2 and 10 as their identifiers
-            # FIXME: There might be a cleaner way to obtain the inverse mapping
-            if G.renumbered:
-                vertices = [G.edgelist.renumber_map[G.edgelist.renumber_map ==
-                                                    vert].index[0] for vert in
-                            vertices]
-
-    if endpoints is True:
-        raise NotImplementedError("endpoints accumulation for betweenness "
-                                  "centrality not currently supported")
+    vertices = _initialize_vertices(G, k, seed)
+
+    if weight is not None:
+        raise NotImplementedError(
+            "weighted implementation of betweenness "
+            "centrality not currently supported"
+        )
+
+    if result_dtype not in [np.float32, np.float64]:
+        raise TypeError("result type can only be np.float32 or np.float64")
+
+    df = betweenness_centrality_wrapper.betweenness_centrality(
+        G, normalized, endpoints, weight, vertices, result_dtype
+    )
+
+    if G.renumbered:
+        return G.unrenumber(df, "vertex")
+
+    return df
+
+
+def edge_betweenness_centrality(
+    G, k=None, normalized=True, weight=None, seed=None, result_dtype=np.float64
+):
+    """
+    Compute the edge betweenness centrality for all edges of the graph G from a
+    sample of 'k' sources.
+    CuGraph does not currently support the 'weight' parameter
+    as seen in the corresponding networkX call.
+
+    Parameters
+    ----------
+    G : cuGraph.Graph
+        cuGraph graph descriptor with connectivity information. The graph can
+        be either directed (DiGraph) or undirected (Graph).
+        Weights in the graph are ignored, the current implementation uses
+        BFS traversals. Use weight parameter if weights need to be considered
+        (currently not supported)
+
+    k : int or list or None, optional, default=None
+        If k is not None, use k node samples to estimate betweenness.  Higher
+        values give better approximation
+        If k is a list, use the content of the list for estimation: the list
+        should contain vertices identifiers.
+        Vertices obtained through sampling or defined as a list will be used as
+        sources for traversals inside the algorithm.
+
+    normalized : bool, optional
+        Default is True.
+        If true, the betweenness values are normalized by
+        2 / (n * (n - 1)) for Graphs (undirected), and
+        1 / (n * (n - 1)) for DiGraphs (directed graphs)
+        where n is the number of nodes in G.
+        Normalization will ensure that values are in [0, 1],
+        this normalization scales for the highest possible value where one
+        edge is crossed by every single shortest path.
+
+    weight : cudf.DataFrame, optional, default=None
+        Specifies the weights to be used for each edge.
+        Should contain a mapping between
+        edges and weights.
+        (Not Supported)
+
+    seed : optional
+        if k is specified and k is an integer, use seed to initialize the
+        random number generator.
+        Using None as seed relies on random.seed() behavior: using current
+        system time
+        If k is either None or list: seed parameter is ignored
+
+    result_dtype : np.float32 or np.float64, optional, default=np.float64
+        Indicate the data type of the betweenness centrality scores
+        Using double automatically switch implementation to "default"
+
+    Returns
+    -------
+    df : cudf.DataFrame
+        GPU data frame containing three cudf.Series of size E: the vertex
+        identifiers of the sources, the vertex identifies of the destinations
+        and the corresponding betweenness centrality values.
+        Please note that the resulting the 'src', 'dst' column might not be
+        in ascending order.
+
+        df['src'] : cudf.Series
+            Contains the vertex identifiers of the source of each edge
+
+        df['dst'] : cudf.Series
+            Contains the vertex identifiers of the destination of each edge
+
+        df['edge_betweenness_centrality'] : cudf.Series
+            Contains the betweenness centrality of edges
+
+        When using undirected graphs, 'src' and 'dst' only contains elements
+        such that 'src' < 'dst', which might differ from networkx and user's
+        input. Namely edge (1 -> 0) is transformed into (0 -> 1) but
+        contains the betweenness centrality of edge (1 -> 0).
+
 
+    Examples
+    --------
+    >>> M = cudf.read_csv('datasets/karate.csv', delimiter=' ',
+    >>>                   dtype=['int32', 'int32', 'float32'], header=None)
+    >>> G = cugraph.Graph()
+    >>> G.from_cudf_edgelist(M, source='0', destination='1')
+    >>> ebc = cugraph.edge_betweenness_centrality(G)
+    """
+
+    vertices = _initialize_vertices(G, k, seed)
     if weight is not None:
-        raise NotImplementedError("weighted implementation of betweenness "
-                                  "centrality not currently supported")
+        raise NotImplementedError(
+            "weighted implementation of betweenness "
+            "centrality not currently supported"
+        )
     if result_dtype not in [np.float32, np.float64]:
         raise TypeError("result type can only be np.float32 or np.float64")
 
-    df = betweenness_centrality_wrapper.betweenness_centrality(G, normalized,
-                                                               endpoints,
-                                                               weight,
-                                                               k, vertices,
-                                                               result_dtype)
+    df = edge_betweenness_centrality_wrapper.edge_betweenness_centrality(
+        G, normalized, weight, vertices, result_dtype
+    )
+
+    if G.renumbered:
+        df = G.unrenumber(df, "src")
+        df = G.unrenumber(df, "dst")
+
+    if type(G) is cugraph.Graph:
+        lower_triangle = df['src'] >= df['dst']
+        df[["src", "dst"]][lower_triangle] = df[["dst", "src"]][lower_triangle]
+        df = df.groupby(by=["src", "dst"]).sum().reset_index()
+
     return df
+
+
+# In order to compare with pre-set sources,
+# k can either be a list or an integer or None
+#  int: Generate an random sample with k elements
+# list: k become the length of the list and vertices become the content
+# None: All the vertices are considered
+def _initialize_vertices(G, k, seed):
+    vertices = None
+    numpy_vertices = None
+    if k is not None:
+        if isinstance(k, int):
+            vertices = _initialize_vertices_from_indices_sampling(G, k, seed)
+        elif isinstance(k, list):
+            vertices = _initialize_vertices_from_identifiers_list(G, k)
+        numpy_vertices = np.array(vertices, dtype=np.int32)
+    else:
+        numpy_vertices = np.arange(G.number_of_vertices(), dtype=np.int32)
+    return numpy_vertices
+
+
+# NOTE: We do not renumber in case k is an int, the sampling is
+#       not operating on the valid vertices identifiers but their
+#       indices:
+# Example:
+# - vertex '2' is missing
+# - vertices '0' '1' '3' '4' exist
+# - There is a vertex at index 2 (there is not guarantee that it is
+#   vertice '3' )
+def _initialize_vertices_from_indices_sampling(G, k, seed):
+    random.seed(seed)
+    vertices = random.sample(range(G.number_of_vertices()), k)
+    return vertices
+
+
+def _initialize_vertices_from_identifiers_list(G, identifiers):
+    vertices = identifiers
+    if G.renumbered:
+        vertices = G.lookup_internal_vertex_id(
+            cudf.Series(vertices)
+        ).to_array()
+
+    return vertices
diff --git a/python/cugraph/centrality/betweenness_centrality_wrapper.pyx b/python/cugraph/centrality/betweenness_centrality_wrapper.pyx
index 6cefc31a2f6..a20a58b844b 100644
--- a/python/cugraph/centrality/betweenness_centrality_wrapper.pyx
+++ b/python/cugraph/centrality/betweenness_centrality_wrapper.pyx
@@ -17,99 +17,224 @@
 # cython: language_level = 3
 
 from cugraph.centrality.betweenness_centrality cimport betweenness_centrality as c_betweenness_centrality
+from cugraph.centrality.betweenness_centrality cimport handle_t
+from cugraph.structure.graph import DiGraph
 from cugraph.structure.graph_new cimport *
-from cugraph.utilities.unrenumber import unrenumber
-from libcpp cimport bool
 from libc.stdint cimport uintptr_t
-from cugraph.structure import graph_new_wrapper
-from cugraph.structure.graph import DiGraph
+from libcpp cimport bool
 import cudf
-import rmm
 import numpy as np
 import numpy.ctypeslib as ctypeslib
 
+import dask_cudf
+import dask_cuda
+
+import cugraph.comms.comms as Comms
+from cugraph.dask.common.mg_utils import get_client
+import dask.distributed
+
 
-def betweenness_centrality(input_graph, normalized, endpoints, weight, k,
+def get_output_df(number_of_vertices, result_dtype):
+    df = cudf.DataFrame()
+    df['vertex'] = cudf.Series(np.zeros(number_of_vertices, dtype=np.int32))
+    df['betweenness_centrality'] = cudf.Series(np.zeros(number_of_vertices,
+                                                        dtype=result_dtype))
+    return df
+
+
+def get_batch(sources, number_of_workers, current_worker):
+    batch_size = len(sources) // number_of_workers
+    begin =  current_worker * batch_size
+    end = (current_worker + 1) * batch_size
+    if current_worker == (number_of_workers - 1):
+        end = len(sources)
+    batch = sources[begin:end]
+    return batch
+
+
+cdef void run_c_betweenness_centrality(uintptr_t c_handle,
+                                       uintptr_t c_graph,
+                                       uintptr_t c_betweenness,
+                                       bool normalized,
+                                       bool endpoints,
+                                       uintptr_t c_weights,
+                                       int number_of_sources_in_batch,
+                                       uintptr_t c_batch,
+                                       result_dtype):
+    if result_dtype == np.float64:
+        c_betweenness_centrality[int, int, double, double]((<handle_t *> c_handle)[0],
+                                                           (<GraphCSRView[int, int, double] *> c_graph)[0],
+                                                           <double *> c_betweenness,
+                                                           normalized,
+                                                           endpoints,
+                                                           <double *> c_weights,
+                                                           number_of_sources_in_batch,
+                                                           <int *> c_batch)
+    elif result_dtype == np.float32:
+        c_betweenness_centrality[int, int, float, float]((<handle_t *> c_handle)[0],
+                                                         (<GraphCSRView[int, int, float] *> c_graph)[0],
+                                                         <float *> c_betweenness,
+                                                         normalized,
+                                                         endpoints,
+                                                         <float *> c_weights,
+                                                         number_of_sources_in_batch,
+                                                         <int *> c_batch)
+    else:
+        raise ValueError("result_dtype can only be np.float64 or np.float32")
+
+
+def run_internal_work(handle, input_data, normalized, endpoints,
+                      weights,
+                      batch,
+                      result_dtype):
+    cdef uintptr_t c_handle = <uintptr_t> NULL
+    cdef uintptr_t c_graph = <uintptr_t> NULL
+    cdef uintptr_t c_identifier = <uintptr_t> NULL
+    cdef uintptr_t c_weights = <uintptr_t> NULL
+    cdef uintptr_t c_betweenness = <uintptr_t> NULL
+    cdef uintptr_t c_batch = <uintptr_t> NULL
+
+    cdef uintptr_t c_offsets = <uintptr_t> NULL
+    cdef uintptr_t c_indices = <uintptr_t> NULL
+    cdef uintptr_t c_graph_weights = <uintptr_t> NULL
+
+    cdef GraphCSRViewDouble graph_double
+    cdef GraphCSRViewFloat graph_float
+
+    (offsets, indices, graph_weights), is_directed = input_data
+
+    if graph_weights:
+        c_graph_weights = graph_weights.__cuda_array_interface__['data'][0]
+    c_offsets = offsets.__cuda_array_interface__['data'][0]
+    c_indices = indices.__cuda_array_interface__['data'][0]
+
+    number_of_vertices = len(offsets) - 1
+    number_of_edges = len(indices)
+
+    result_size = number_of_vertices
+    result_df = get_output_df(result_size, result_dtype)
+    number_of_sources_in_batch = len(batch)
+    if result_dtype == np.float64:
+        graph_double = GraphCSRView[int, int, double](<int*> c_offsets,
+                                                      <int*> c_indices,
+                                                      <double*> c_graph_weights,
+                                                      number_of_vertices,
+                                                      number_of_edges)
+        graph_double.prop.directed = is_directed
+        c_graph = <uintptr_t>&graph_double
+    elif result_dtype == np.float32:
+        graph_float = GraphCSRView[int, int, float](<int*>c_offsets,
+                                                    <int*>c_indices,
+                                                    <float*>c_graph_weights,
+                                                    number_of_vertices,
+                                                    number_of_edges)
+        graph_float.prop.directed = is_directed
+        c_graph = <uintptr_t>&graph_float
+    else:
+        raise ValueError("result_dtype can only be np.float64 or np.float32")
+
+    c_identifier = result_df['vertex'].__cuda_array_interface__['data'][0]
+    c_betweenness = result_df['betweenness_centrality'].__cuda_array_interface__['data'][0]
+    if weights is not None:
+        c_weights = weights.__cuda_array_interface__['data'][0]
+
+    c_batch = batch.__array_interface__['data'][0]
+    c_handle = <uintptr_t>handle.getHandle()
+
+    run_c_betweenness_centrality(c_handle,
+                                 c_graph,
+                                 c_betweenness,
+                                 normalized,
+                                 endpoints,
+                                 c_weights,
+                                 number_of_sources_in_batch,
+                                 c_batch,
+                                 result_dtype)
+    if result_dtype == np.float64:
+        graph_double.get_vertex_identifiers(<int*> c_identifier)
+    elif result_dtype == np.float32:
+        graph_float.get_vertex_identifiers(<int*> c_identifier)
+    else:
+        raise ValueError("result_dtype can only be np.float64 or np.float32")
+
+    return result_df
+
+def run_mg_work(input_data, normalized, endpoints,
+                weights, sources,
+                result_dtype, session_id):
+    result = None
+
+    number_of_workers = Comms.get_n_workers(session_id)
+    worker_idx = Comms.get_worker_id(session_id)
+    handle = Comms.get_handle(session_id)
+
+    batch = get_batch(sources, number_of_workers, worker_idx)
+
+    result = run_internal_work(handle, input_data, normalized,
+                               endpoints, weights, batch,
+                               result_dtype)
+    return result
+
+
+def batch_betweenness_centrality(input_graph, normalized, endpoints,
+                                    weights, vertices, result_dtype):
+    df = None
+    client = get_client()
+    comms = Comms.get_comms()
+    replicated_adjlists = input_graph.batch_adjlists
+    work_futures =  [client.submit(run_mg_work,
+                                   (data, type(input_graph)
+                                   is DiGraph),
+                                   normalized,
+                                   endpoints,
+                                   weights,
+                                   vertices,
+                                   result_dtype,
+                                   comms.sessionId,
+                                   workers=[worker]) for
+                    idx, (worker, data) in enumerate(replicated_adjlists.items())]
+    dask.distributed.wait(work_futures)
+    df = work_futures[0].result()
+    return df
+
+
+def sg_betweenness_centrality(input_graph, normalized, endpoints, weights,
+                              vertices, result_dtype):
+    handle = Comms.get_default_handle()
+    adjlist = input_graph.adjlist
+    input_data = ((adjlist.offsets, adjlist.indices, adjlist.weights),
+                  type(input_graph) is DiGraph)
+    df = run_internal_work(handle, input_data, normalized, endpoints, weights,
+                           vertices, result_dtype)
+    return df
+
+
+# NOTE: The current implementation only has <int, int, float, float> and
+#       <int, int, double, double> as explicit template declaration
+#       The current BFS requires the GraphCSR to be declared
+#       as <int, int, float> or <int, int double> even if weights is null
+def betweenness_centrality(input_graph, normalized, endpoints, weights,
                            vertices, result_dtype):
     """
     Call betweenness centrality
     """
-    cdef GraphCSRView[int, int, float] graph_float
-    cdef GraphCSRView[int, int, double] graph_double
+    df = None
 
     if not input_graph.adjlist:
         input_graph.view_adj_list()
 
-    [offsets, indices] = graph_new_wrapper.datatype_cast([input_graph.adjlist.offsets, input_graph.adjlist.indices], [np.int32])
-
-    num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
-
-    df = cudf.DataFrame()
-    df['vertex'] = cudf.Series(np.zeros(num_verts, dtype=np.int32))
-    df['betweenness_centrality'] = cudf.Series(np.zeros(num_verts, dtype=result_dtype))
-
-    cdef uintptr_t c_identifier = df['vertex'].__cuda_array_interface__['data'][0]
-    cdef uintptr_t c_betweenness = df['betweenness_centrality'].__cuda_array_interface__['data'][0]
-
-    cdef uintptr_t c_offsets = offsets.__cuda_array_interface__['data'][0]
-    cdef uintptr_t c_indices = indices.__cuda_array_interface__['data'][0]
-    cdef uintptr_t c_weight  = <uintptr_t> NULL
-    cdef uintptr_t c_vertices = <uintptr_t> NULL
-
-    if weight is not None:
-        c_weight = weight.__cuda_array_interface__['data'][0]
-
-    #FIXME: We could sample directly from a cudf array in the futur: i.e
-    #       c_vertices = vertices.__cuda_array_interface__['data'][0]
-    if vertices is not None:
-        c_vertices =  np.array(vertices, dtype=np.int32).__array_interface__['data'][0]
-
-    c_k = 0
-    if k is not None:
-        c_k = k
-
-    # NOTE: The current implementation only has <int, int, float, float> and
-    #       <int, int, double, double> as explicit template declaration
-    #       The current BFS requires the GraphCSR to be declared
-    #       as <int, int, float> or <int, int double> even if weights is null
-    if result_dtype == np.float32:
-        graph_float = GraphCSRView[int, int, float](<int*> c_offsets, <int*> c_indices,
-                                                <float*> NULL, num_verts, num_edges)
-        # FIXME: There might be a way to avoid manually setting the Graph property
-        graph_float.prop.directed = type(input_graph) is DiGraph
-
-        c_betweenness_centrality[int, int, float, float](graph_float,
-                                                         <float*> c_betweenness,
-                                                         normalized, endpoints,
-                                                         <float*> c_weight, c_k,
-                                                         <int*> c_vertices)
-        graph_float.get_vertex_identifiers(<int*>c_identifier)
-    elif result_dtype == np.float64:
-        graph_double = GraphCSRView[int, int, double](<int*>c_offsets, <int*>c_indices,
-                                                  <double*> NULL, num_verts, num_edges)
-        # FIXME: There might be a way to avoid manually setting the Graph property
-        graph_double.prop.directed = type(input_graph) is DiGraph
-
-        c_betweenness_centrality[int, int, double, double](graph_double,
-                                                           <double*> c_betweenness,
-                                                           normalized, endpoints,
-                                                           <double*> c_weight, c_k,
-                                                           <int*> c_vertices)
-        graph_double.get_vertex_identifiers(<int*>c_identifier)
+    if Comms.is_initialized() and input_graph.batch_enabled == True:
+        df = batch_betweenness_centrality(input_graph,
+                                             normalized,
+                                             endpoints,
+                                             weights,
+                                             vertices,
+                                             result_dtype)
     else:
-        raise TypeError("result type for betweenness centrality can only be "
-                        "float or double")
-
-    #FIXME: For large graph renumbering produces a dataframe organized
-    #       in buckets, i.e, if they are 3 buckets
-    # 0
-    # 8191
-    # 16382
-    # 1
-    # 8192 ...
-    # Instead of having  the sources in ascending order
-    if input_graph.renumbered:
-        df = unrenumber(input_graph.edgelist.renumber_map, df, 'vertex')
-
+        df = sg_betweenness_centrality(input_graph,
+                                       normalized,
+                                       endpoints,
+                                       weights,
+                                       vertices,
+                                       result_dtype)
     return df
diff --git a/python/cugraph/centrality/edge_betweenness_centrality_wrapper.pyx b/python/cugraph/centrality/edge_betweenness_centrality_wrapper.pyx
new file mode 100644
index 00000000000..9a5a022f640
--- /dev/null
+++ b/python/cugraph/centrality/edge_betweenness_centrality_wrapper.pyx
@@ -0,0 +1,228 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# cython: profile=False
+# distutils: language = c++
+# cython: embedsignature = True
+# cython: language_level = 3
+
+from cugraph.centrality.betweenness_centrality cimport edge_betweenness_centrality as c_edge_betweenness_centrality
+from cugraph.structure import graph_new_wrapper
+from cugraph.structure.graph import DiGraph, Graph
+from cugraph.structure.graph_new cimport *
+from libc.stdint cimport uintptr_t
+from libcpp cimport bool
+import cudf
+import numpy as np
+import numpy.ctypeslib as ctypeslib
+
+from cugraph.dask.common.mg_utils import get_client
+import cugraph.comms.comms as Comms
+import dask.distributed
+
+
+def get_output_df(indices, result_dtype):
+    number_of_edges = len(indices)
+    df = cudf.DataFrame()
+    df['src'] = cudf.Series(np.zeros(number_of_edges, dtype=np.int32))
+    df['dst'] = indices.copy()
+    df['betweenness_centrality'] = cudf.Series(np.zeros(number_of_edges,
+                                               dtype=result_dtype))
+    return df
+
+
+def get_batch(sources, number_of_workers, current_worker):
+    batch_size = len(sources) // number_of_workers
+    begin =  current_worker * batch_size
+    end = (current_worker + 1) * batch_size
+    if current_worker == (number_of_workers - 1):
+        end = len(sources)
+    batch = sources[begin:end]
+    return batch
+
+
+def run_mg_work(input_data, normalized, weights, sources,
+             result_dtype, session_id):
+    result = None
+
+    number_of_workers = Comms.get_n_workers(session_id)
+    worker_idx = Comms.get_worker_id(session_id)
+    handle = Comms.get_handle(session_id)
+
+    batch = get_batch(sources, number_of_workers, worker_idx)
+
+    result = run_internal_work(handle, input_data, normalized, weights,
+                               batch, result_dtype)
+    return result
+
+
+def run_internal_work(handle, input_data, normalized, weights, batch,
+                      result_dtype):
+    cdef uintptr_t c_handle = <uintptr_t> NULL
+    cdef uintptr_t c_graph = <uintptr_t> NULL
+    cdef uintptr_t c_src_identifier = <uintptr_t> NULL
+    cdef uintptr_t c_dst_identifier = <uintptr_t> NULL
+    cdef uintptr_t c_weights = <uintptr_t> NULL
+    cdef uintptr_t c_betweenness = <uintptr_t> NULL
+    cdef uintptr_t c_batch = <uintptr_t> NULL
+
+    cdef uintptr_t c_offsets = <uintptr_t> NULL
+    cdef uintptr_t c_indices = <uintptr_t> NULL
+    cdef uintptr_t c_graph_weights = <uintptr_t> NULL
+
+    cdef GraphCSRViewDouble graph_double
+    cdef GraphCSRViewFloat graph_float
+
+    (offsets, indices, graph_weights), is_directed =  input_data
+
+    if graph_weights:
+        c_graph_weights = graph_weights.__cuda_array_interface__['data'][0]
+    c_offsets = offsets.__cuda_array_interface__['data'][0]
+    c_indices = indices.__cuda_array_interface__['data'][0]
+
+    number_of_vertices = len(offsets) - 1
+    number_of_edges = len(indices)
+
+    result_df = get_output_df(indices, result_dtype)
+    c_src_identifier = result_df['src'].__cuda_array_interface__['data'][0]
+    c_dst_identifier = result_df['dst'].__cuda_array_interface__['data'][0]
+    c_betweenness = result_df['betweenness_centrality'].__cuda_array_interface__['data'][0]
+
+    number_of_sources_in_batch = len(batch)
+    if result_dtype == np.float64:
+        graph_double = GraphCSRView[int, int, double](<int*> c_offsets,
+                                                      <int*> c_indices,
+                                                      <double*> c_graph_weights,
+                                                      number_of_vertices,
+                                                      number_of_edges)
+        graph_double.prop.directed = is_directed
+        c_graph = <uintptr_t>&graph_double
+    elif result_dtype == np.float32:
+        graph_float = GraphCSRView[int, int, float](<int*>c_offsets,
+                                                    <int*>c_indices,
+                                                    <float*>c_graph_weights,
+                                                    number_of_vertices,
+                                                    number_of_edges)
+        graph_float.prop.directed = is_directed
+        c_graph = <uintptr_t>&graph_float
+    else:
+        raise ValueError("result_dtype can only be np.float64 or np.float32")
+
+    if weights is not None:
+        c_weights = weights.__cuda_array_interface__['data'][0]
+    c_batch = batch.__array_interface__['data'][0]
+    c_handle = <uintptr_t>handle.getHandle()
+
+    run_c_edge_betweenness_centrality(c_handle,
+                                      c_graph,
+                                      c_betweenness,
+                                      normalized,
+                                      c_weights,
+                                      number_of_sources_in_batch,
+                                      c_batch,
+                                      result_dtype)
+    return result_df
+
+
+cdef void run_c_edge_betweenness_centrality(uintptr_t c_handle,
+                                            uintptr_t c_graph,
+                                            uintptr_t c_betweenness,
+                                            bool normalized,
+                                            uintptr_t c_weights,
+                                            int number_of_sources_in_batch,
+                                            uintptr_t c_batch,
+                                            result_dtype):
+    if result_dtype == np.float64:
+        c_edge_betweenness_centrality[int, int, double, double]((<handle_t *> c_handle)[0],
+                                                                (<GraphCSRView[int, int, double] *> c_graph)[0],
+                                                                <double *> c_betweenness,
+                                                                normalized,
+                                                                <double *> c_weights,
+                                                                number_of_sources_in_batch,
+                                                                <int *> c_batch)
+    elif result_dtype == np.float32:
+        c_edge_betweenness_centrality[int, int, float, float]((<handle_t *> c_handle)[0],
+                                                              (<GraphCSRView[int, int, float] *> c_graph)[0],
+                                                              <float *> c_betweenness,
+                                                              normalized,
+                                                              <float *> c_weights,
+                                                              number_of_sources_in_batch,
+                                                              <int *> c_batch)
+    else:
+        raise ValueError("result_dtype can only be np.float64 or np.float32")
+
+def batch_edge_betweenness_centrality(input_graph,
+                                         normalized,
+                                         weights, vertices, result_dtype):
+    client = get_client()
+    comms = Comms.get_comms()
+    replicated_adjlists = input_graph.batch_adjlists
+    work_futures =  [client.submit(run_mg_work,
+                                   (data, type(input_graph)
+                                   is DiGraph),
+                                   normalized,
+                                   weights,
+                                   vertices,
+                                   result_dtype,
+                                   comms.sessionId,
+                                   workers=[worker]) for
+                    (worker, data) in replicated_adjlists.items()]
+    dask.distributed.wait(work_futures)
+    df = work_futures[0].result()
+    return df
+
+
+def sg_edge_betweenness_centrality(input_graph, normalized, weights,
+                                   vertices, result_dtype):
+    if not input_graph.adjlist:
+        input_graph.view_adj_list()
+
+    handle = Comms.get_default_handle()
+    adjlist = input_graph.adjlist
+    input_data = ((adjlist.offsets, adjlist.indices, adjlist.weights),
+                  type(input_graph) is DiGraph)
+    df = run_internal_work(handle, input_data, normalized, weights,
+                           vertices, result_dtype)
+    return df
+
+
+def edge_betweenness_centrality(input_graph, normalized, weights,
+                                vertices, result_dtype):
+    """
+    Call betweenness centrality
+    """
+    cdef GraphCSRViewDouble graph_double
+    cdef GraphCSRViewFloat graph_float
+
+
+    df = None
+
+    if not input_graph.adjlist:
+        input_graph.view_adj_list()
+
+    if Comms.is_initialized() and input_graph.batch_enabled == True:
+        df = batch_edge_betweenness_centrality(input_graph, normalized,
+                                                  weights, vertices,
+                                                  result_dtype)
+    else:
+        df = sg_edge_betweenness_centrality(input_graph, normalized,
+                                            weights, vertices, result_dtype)
+
+    if result_dtype == np.float64:
+        graph_double = get_graph_view[GraphCSRViewDouble](input_graph)
+        graph_double.get_source_indices(<int*>(<uintptr_t>df['src'].__cuda_array_interface__['data'][0]))
+    elif result_dtype == np.float32:
+        graph_float = get_graph_view[GraphCSRViewFloat](input_graph)
+        graph_float.get_source_indices(<int*>(<uintptr_t>df['src'].__cuda_array_interface__['data'][0]))
+
+    return df
diff --git a/python/cugraph/centrality/katz_centrality.py b/python/cugraph/centrality/katz_centrality.py
index d9ef15dfb22..d57682c726c 100644
--- a/python/cugraph/centrality/katz_centrality.py
+++ b/python/cugraph/centrality/katz_centrality.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019 - 2020, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -14,12 +14,9 @@
 from cugraph.centrality import katz_centrality_wrapper
 
 
-def katz_centrality(G,
-                    alpha=None,
-                    max_iter=100,
-                    tol=1.0e-6,
-                    nstart=None,
-                    normalized=True):
+def katz_centrality(
+    G, alpha=None, max_iter=100, tol=1.0e-6, nstart=None, normalized=True
+):
     """
     Compute the Katz centrality for the nodes of the graph G. cuGraph does not
     currently support the 'beta' and 'weight' parameters as seen in the
@@ -93,7 +90,15 @@ def katz_centrality(G,
     >>> kc = cugraph.katz_centrality(G)
     """
 
-    df = katz_centrality_wrapper.katz_centrality(G, alpha, max_iter,
-                                                 tol, nstart, normalized)
+    if nstart is not None:
+        if G.renumbered is True:
+            nstart = G.add_internal_vertex_id(nstart, 'vertex', 'vertex')
+
+    df = katz_centrality_wrapper.katz_centrality(
+        G, alpha, max_iter, tol, nstart, normalized
+    )
+
+    if G.renumbered:
+        df = G.unrenumber(df, "vertex")
 
     return df
diff --git a/python/cugraph/centrality/katz_centrality_wrapper.pyx b/python/cugraph/centrality/katz_centrality_wrapper.pyx
index 355105f6ede..01b942991a5 100644
--- a/python/cugraph/centrality/katz_centrality_wrapper.pyx
+++ b/python/cugraph/centrality/katz_centrality_wrapper.pyx
@@ -19,7 +19,6 @@
 from cugraph.centrality.katz_centrality cimport katz_centrality as c_katz_centrality
 from cugraph.structure.graph_new cimport *
 from cugraph.structure import graph_new_wrapper
-from cugraph.utilities.unrenumber import unrenumber
 from libcpp cimport bool
 from libc.stdint cimport uintptr_t
 
@@ -32,26 +31,15 @@ def get_output_df(input_graph, nstart):
     num_verts = input_graph.number_of_vertices()
     df = cudf.DataFrame()
     df['vertex'] = cudf.Series(np.zeros(num_verts, dtype=np.int32))
+    df['katz_centrality'] = cudf.Series(np.zeros(num_verts, dtype=np.float64))
 
-    if nstart is None:
-        df['katz_centrality'] = cudf.Series(np.zeros(num_verts, dtype=np.float64))
-    else:
+    if nstart is not None:
         if len(nstart) != num_verts:
             raise ValueError('nstart must have initial guess for all vertices')
 
-        nstart = graph_new_wrapper.datatype_cast([nstart], [np.float64])
-
-        if input_graph.renumbered is True:
-            renumber_series = cudf.Series(input_graph.edgelist.renumber_map.index,
-                                          index=input_graph.edgelist.renumber_map)
-            nstart_vertex_renumbered = cudf.Series(renumber_series.loc[nstart['vertex']], dtype=np.int32)
-            df['katz_centrality'] = cudf.Series(cudf._lib.copying.scatter(nstart['values']._column,
-                                                nstart_vertex_renumbered._column,
-                                                df['katz_centrality']._column))
-        else:
-            df['katz_centrality'] = cudf.Series(cudf._lib.copying.scatter(nstart['values']._column,
-                                                nstart['vertex']._column,
-                                                df['katz_centrality']._column))
+        nstart['values'] = graph_new_wrapper.datatype_cast([nstart['values']], [np.float64])
+        df['katz_centrality'][nstart['vertex']] = nstart['values']
+
     return df
 
 
@@ -75,7 +63,4 @@ def katz_centrality(input_graph, alpha=None, max_iter=100, tol=1.0e-5, nstart=No
 
     graph.get_vertex_identifiers(<int*>c_identifier)
 
-    if input_graph.renumbered:
-        df = unrenumber(input_graph.edgelist.renumber_map, df, 'vertex')
-
     return df
diff --git a/python/cugraph/comms/__init__.py b/python/cugraph/comms/__init__.py
new file mode 100644
index 00000000000..e69de29bb2d
diff --git a/python/cugraph/comms/comms.py b/python/cugraph/comms/comms.py
new file mode 100644
index 00000000000..d8957cf0086
--- /dev/null
+++ b/python/cugraph/comms/comms.py
@@ -0,0 +1,119 @@
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from cugraph.raft.dask.common.comms import Comms as raftComms
+from cugraph.raft.dask.common.comms import worker_state
+from cugraph.raft.common.handle import Handle
+
+
+__instance = None
+__default_handle = None
+
+
+# Intialize Comms. If explicit Comms not provided as arg,
+# default Comms are initialized as per client information.
+def initialize(comms=None, p2p=False):
+    """
+    Initialize a communicator for multi-node/multi-gpu communications.
+    It is expected to be called right after client initialization for running
+    multi-GPU algorithms. It wraps raft comms that manages underlying NCCL and
+    UCX comms handles across the workers of a Dask cluster.
+    It is recommended to also call `destroy()` when the comms are no longer
+    needed so the underlying resources can be cleaned up.
+
+    Parameters
+    ----------
+    comms : raft Comms
+        A pre-initialized raft communicator. If provided, this is used for mnmg
+        communications.
+    p2p : bool
+        Initialize UCX endpoints
+    """
+
+    global __instance
+    if __instance is None:
+        global __default_handle
+        __default_handle = None
+        if comms is None:
+            __instance = raftComms(comms_p2p=p2p)
+            __instance.init()
+        else:
+            __instance = comms
+    else:
+        raise Exception("Communicator is already initialized")
+
+
+# Check is Comms was initialized.
+def is_initialized():
+    global __instance
+    if __instance is not None:
+        return True
+    else:
+        return False
+
+
+# Get raft Comms
+def get_comms():
+    global __instance
+    return __instance
+
+
+# Get workers in the Comms
+def get_workers():
+    if is_initialized():
+        global __instance
+        return __instance.worker_addresses
+
+
+# Get sessionId for finding sessionstate of workers.
+def get_session_id():
+    if is_initialized():
+        global __instance
+        return __instance.sessionId
+
+
+# Destroy Comms
+def destroy():
+    """
+    Shuts down initialized comms and cleans up resources.
+    """
+    global __instance
+    if is_initialized():
+        __instance.destroy()
+        __instance = None
+
+
+# Default handle in case Comms is not initialized.
+# This does not perform nccl initialization.
+def get_default_handle():
+    global __default_handle
+    if __default_handle is None:
+        __default_handle = Handle()
+    return __default_handle
+
+
+# Functions to be called from within workers
+
+def get_handle(sID):
+    sessionstate = worker_state(sID)
+    return sessionstate['handle']
+
+
+def get_worker_id(sID):
+    sessionstate = worker_state(sID)
+    return sessionstate['wid']
+
+
+def get_n_workers(sID):
+    sessionstate = worker_state(sID)
+    return sessionstate['nworkers']
diff --git a/python/cugraph/community/__init__.py b/python/cugraph/community/__init__.py
index 0b458c27411..31e6f097a7a 100644
--- a/python/cugraph/community/__init__.py
+++ b/python/cugraph/community/__init__.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -12,13 +12,14 @@
 # limitations under the License.
 
 from cugraph.community.louvain import louvain
+from cugraph.community.leiden import leiden
 from cugraph.community.ecg import ecg
 from cugraph.community.spectral_clustering import (
     spectralBalancedCutClustering,
     spectralModularityMaximizationClustering,
     analyzeClustering_modularity,
     analyzeClustering_edge_cut,
-    analyzeClustering_ratio_cut
+    analyzeClustering_ratio_cut,
 )
 from cugraph.community.subgraph_extraction import subgraph
 from cugraph.community.triangle_count import triangles
diff --git a/python/cugraph/community/ecg.py b/python/cugraph/community/ecg.py
index 221e58bf31b..5030a2475b7 100644
--- a/python/cugraph/community/ecg.py
+++ b/python/cugraph/community/ecg.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019 - 2020, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -14,7 +14,7 @@
 from cugraph.community import ecg_wrapper
 
 
-def ecg(input_graph, min_weight=.05, ensemble_size=16):
+def ecg(input_graph, min_weight=0.05, ensemble_size=16):
     """
     Compute the Ensemble Clustering for Graphs (ECG) partition of the input
     graph. ECG runs truncated Louvain on an ensemble of permutations of the
@@ -65,4 +65,7 @@ def ecg(input_graph, min_weight=.05, ensemble_size=16):
 
     parts = ecg_wrapper.ecg(input_graph, min_weight, ensemble_size)
 
+    if input_graph.renumbered:
+        parts = input_graph.unrenumber(parts, "vertex")
+
     return parts
diff --git a/python/cugraph/community/ecg_wrapper.pyx b/python/cugraph/community/ecg_wrapper.pyx
index 0fe59e8f8ba..913a633c088 100644
--- a/python/cugraph/community/ecg_wrapper.pyx
+++ b/python/cugraph/community/ecg_wrapper.pyx
@@ -19,7 +19,6 @@
 from cugraph.community.ecg cimport ecg as c_ecg
 from cugraph.structure.graph_new cimport *
 from cugraph.structure import graph_new_wrapper
-from cugraph.utilities.unrenumber import unrenumber
 from libc.stdint cimport uintptr_t
 
 import cudf
@@ -42,7 +41,7 @@ def ecg(input_graph, min_weight=.05, ensemble_size=16):
     [weights] = graph_new_wrapper.datatype_cast([input_graph.adjlist.weights], [np.float32, np.float64])
 
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
 
     df = cudf.DataFrame()
     df['vertex'] = cudf.Series(np.zeros(num_verts, dtype=np.int32))
@@ -72,7 +71,4 @@ def ecg(input_graph, min_weight=.05, ensemble_size=16):
 
         c_ecg[int,int,double](graph_double, min_weight, ensemble_size, <int*> c_partition)
 
-    if input_graph.renumbered:
-        df = unrenumber(input_graph.edgelist.renumber_map, df, 'vertex')
-
     return df
diff --git a/python/cugraph/community/ktruss_subgraph.pxd b/python/cugraph/community/ktruss_subgraph.pxd
index 492e898f059..08e59d2f8f2 100644
--- a/python/cugraph/community/ktruss_subgraph.pxd
+++ b/python/cugraph/community/ktruss_subgraph.pxd
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/community/ktruss_subgraph.py b/python/cugraph/community/ktruss_subgraph.py
index be2976d75eb..74fc343c097 100644
--- a/python/cugraph/community/ktruss_subgraph.py
+++ b/python/cugraph/community/ktruss_subgraph.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019 - 2020, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -12,7 +12,6 @@
 # limitations under the License.
 
 from cugraph.community import ktruss_subgraph_wrapper
-from cugraph.utilities.unrenumber import unrenumber
 from cugraph.structure.graph import Graph
 
 
@@ -83,17 +82,16 @@ def ktruss_subgraph(G, k, use_weights=True):
 
     subgraph_df = ktruss_subgraph_wrapper.ktruss_subgraph(G, k, use_weights)
     if G.renumbered:
-        subgraph_df = unrenumber(G.edgelist.renumber_map, subgraph_df, 'src')
-        subgraph_df = unrenumber(G.edgelist.renumber_map, subgraph_df, 'dst')
+        subgraph_df = G.unrenumber(subgraph_df, "src")
+        subgraph_df = G.unrenumber(subgraph_df, "dst")
 
     if G.edgelist.weights:
-        KTrussSubgraph.from_cudf_edgelist(subgraph_df,
-                                          source='src',
-                                          destination='dst',
-                                          edge_attr='weight')
+        KTrussSubgraph.from_cudf_edgelist(
+            subgraph_df, source="src", destination="dst", edge_attr="weight"
+        )
     else:
-        KTrussSubgraph.from_cudf_edgelist(subgraph_df,
-                                          source='src',
-                                          destination='dst')
+        KTrussSubgraph.from_cudf_edgelist(
+            subgraph_df, source="src", destination="dst"
+        )
 
     return KTrussSubgraph
diff --git a/python/cugraph/community/ktruss_subgraph_wrapper.pyx b/python/cugraph/community/ktruss_subgraph_wrapper.pyx
index 1d8037f55f1..8a2c81f70fa 100644
--- a/python/cugraph/community/ktruss_subgraph_wrapper.pyx
+++ b/python/cugraph/community/ktruss_subgraph_wrapper.pyx
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -19,7 +19,6 @@
 from cugraph.community.ktruss_subgraph cimport *
 from cugraph.structure.graph_new cimport *
 from cugraph.structure import graph_new_wrapper
-from cugraph.utilities.unrenumber import unrenumber
 from libcpp cimport bool
 from libc.stdint cimport uintptr_t
 from libc.float cimport FLT_MAX_EXP
diff --git a/python/cugraph/community/leiden.pxd b/python/cugraph/community/leiden.pxd
new file mode 100644
index 00000000000..1c6009b30b6
--- /dev/null
+++ b/python/cugraph/community/leiden.pxd
@@ -0,0 +1,30 @@
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# cython: profile=False
+# distutils: language = c++
+# cython: embedsignature = True
+# cython: language_level = 3
+
+from cugraph.structure.graph_new cimport *
+
+
+cdef extern from "algorithms.hpp" namespace "cugraph":
+
+    cdef void leiden[vertex_t,edge_t,weight_t](
+        const GraphCSRView[vertex_t,edge_t,weight_t] &graph,
+        weight_t &final_modularity,
+        int &num_level,
+        vertex_t *leiden_parts,
+        int max_level,
+        weight_t resolution) except +
diff --git a/python/cugraph/community/leiden.py b/python/cugraph/community/leiden.py
new file mode 100644
index 00000000000..355b2939617
--- /dev/null
+++ b/python/cugraph/community/leiden.py
@@ -0,0 +1,84 @@
+# Copyright (c) 2019 - 2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from cugraph.community import leiden_wrapper
+from cugraph.structure.graph import Graph
+
+
+def leiden(input_graph, max_iter=100, resolution=1.):
+    """
+    Compute the modularity optimizing partition of the input graph using the
+    Leiden algorithm
+
+    It uses the Louvain method described in:
+
+    Traag, V. A., Waltman, L., & van Eck, N. J. (2019). From Louvain to Leiden:
+    guaranteeing well-connected communities. Scientific reports, 9(1), 5233.
+    doi: 10.1038/s41598-019-41695-z
+
+    Parameters
+    ----------
+    input_graph : cugraph.Graph
+        cuGraph graph descriptor of type Graph
+
+        The adjacency list will be computed if not already present.
+
+    max_iter : integer
+        This controls the maximum number of levels/iterations of the Leiden
+        algorithm. When specified the algorithm will terminate after no more
+        than the specified number of iterations. No error occurs when the
+        algorithm terminates early in this manner.
+
+    resolution: float/double, optional
+        Called gamma in the modularity formula, this changes the size
+        of the communities.  Higher resolutions lead to more smaller
+        communities, lower resolutions lead to fewer larger communities.
+        Defaults to 1.
+
+    Returns
+    -------
+    parts : cudf.DataFrame
+        GPU data frame of size V containing two columns the vertex id and the
+        partition id it is assigned to.
+
+        df['vertex'] : cudf.Series
+            Contains the vertex identifiers
+        df['partition'] : cudf.Series
+            Contains the partition assigned to the vertices
+
+    modularity_score : float
+        a floating point number containing the global modularity score of the
+        partitioning.
+
+    Examples
+    --------
+    >>> M = cudf.read_csv('datasets/karate.csv',
+                          delimiter = ' ',
+                          dtype=['int32', 'int32', 'float32'],
+                          header=None)
+    >>> G = cugraph.Graph()
+    >>> G.from_cudf_edgelist(M, source='0', destination='1')
+    >>> parts, modularity_score = cugraph.leiden(G)
+    """
+
+    if type(input_graph) is not Graph:
+        raise Exception("input graph must be undirected")
+
+    parts, modularity_score = leiden_wrapper.leiden(
+        input_graph, max_iter, resolution
+    )
+
+    if input_graph.renumbered:
+        parts = input_graph.unrenumber(parts, "vertex")
+
+    return parts, modularity_score
diff --git a/python/cugraph/community/leiden_wrapper.pyx b/python/cugraph/community/leiden_wrapper.pyx
new file mode 100644
index 00000000000..9ed220bb2a2
--- /dev/null
+++ b/python/cugraph/community/leiden_wrapper.pyx
@@ -0,0 +1,93 @@
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# cython: profile=False
+# distutils: language = c++
+# cython: embedsignature = True
+# cython: language_level = 3
+
+from cugraph.community.leiden cimport leiden as c_leiden
+from cugraph.structure.graph_new cimport *
+from cugraph.structure import graph_new_wrapper
+from libc.stdint cimport uintptr_t
+
+import cudf
+import rmm
+import numpy as np
+
+
+def leiden(input_graph, max_iter, resolution):
+    """
+    Call leiden
+    """
+    if not input_graph.adjlist:
+        input_graph.view_adj_list()
+
+    weights = None
+    final_modularity = None
+
+    [offsets, indices] = graph_new_wrapper.datatype_cast([input_graph.adjlist.offsets, input_graph.adjlist.indices], [np.int32])
+
+    num_verts = input_graph.number_of_vertices()
+    num_edges = input_graph.number_of_edges(directed_edges=True)
+
+    if input_graph.adjlist.weights is not None:
+        [weights] = graph_new_wrapper.datatype_cast([input_graph.adjlist.weights], [np.float32, np.float64])
+    else:
+        weights = cudf.Series(np.full(num_edges, 1.0, dtype=np.float32))
+
+    # Create the output dataframe
+    df = cudf.DataFrame()
+    df['vertex'] = cudf.Series(np.zeros(num_verts, dtype=np.int32))
+    df['partition'] = cudf.Series(np.zeros(num_verts,dtype=np.int32))
+
+    cdef uintptr_t c_offsets = offsets.__cuda_array_interface__['data'][0]
+    cdef uintptr_t c_indices = indices.__cuda_array_interface__['data'][0]
+    cdef uintptr_t c_identifier = df['vertex'].__cuda_array_interface__['data'][0]
+    cdef uintptr_t c_partition = df['partition'].__cuda_array_interface__['data'][0]
+    cdef uintptr_t c_weights = weights.__cuda_array_interface__['data'][0]
+
+    cdef GraphCSRView[int,int,float] graph_float
+    cdef GraphCSRView[int,int,double] graph_double
+
+    cdef float final_modularity_float = 1.0
+    cdef double final_modularity_double = 1.0
+    cdef int num_level = 0
+
+    if weights.dtype == np.float32:
+        graph_float = GraphCSRView[int,int,float](<int*>c_offsets, <int*>c_indices,
+                                                  <float*>c_weights, num_verts, num_edges)
+
+        graph_float.get_vertex_identifiers(<int*>c_identifier)
+        c_leiden(graph_float,
+                  final_modularity_float,
+                  num_level,
+                  <int*> c_partition,
+                  <int> max_iter,
+                  <float> resolution)
+
+        final_modularity = final_modularity_float
+    else:
+        graph_double = GraphCSRView[int,int,double](<int*>c_offsets, <int*>c_indices,
+                                                    <double*>c_weights, num_verts, num_edges)
+
+        graph_double.get_vertex_identifiers(<int*>c_identifier)
+        c_leiden(graph_double,
+                  final_modularity_double,
+                  num_level,
+                  <int*> c_partition,
+                  <int> max_iter,
+                  <double> resolution)
+        final_modularity = final_modularity_double
+
+    return df, final_modularity
diff --git a/python/cugraph/community/louvain.pxd b/python/cugraph/community/louvain.pxd
index 19f690aac95..7cc72b4d0ed 100644
--- a/python/cugraph/community/louvain.pxd
+++ b/python/cugraph/community/louvain.pxd
@@ -21,9 +21,10 @@ from cugraph.structure.graph_new cimport *
 
 cdef extern from "algorithms.hpp" namespace "cugraph":
 
-    cdef void louvain[VT,ET,WT](
-        const GraphCSRView[VT,ET,WT] &graph,
-        WT *final_modularity,
+    cdef void louvain[vertex_t,edge_t,weight_t](
+        const GraphCSRView[vertex_t,edge_t,weight_t] &graph,
+        weight_t *final_modularity,
         int *num_level,
-        VT *louvain_parts,
-        int max_iter) except +
+        vertex_t *louvain_parts,
+        int max_level,
+        weight_t resolution) except +
diff --git a/python/cugraph/community/louvain.py b/python/cugraph/community/louvain.py
index c00c6e4a3cb..0d1fd9ec084 100644
--- a/python/cugraph/community/louvain.py
+++ b/python/cugraph/community/louvain.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019 - 2020, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -15,19 +15,23 @@
 from cugraph.structure.graph import Graph
 
 
-def louvain(input_graph, max_iter=100):
+def louvain(input_graph, max_iter=100, resolution=1.):
     """
     Compute the modularity optimizing partition of the input graph using the
-    Louvain heuristic
+    Louvain method
+
+    It uses the Louvain method described in:
+
+    VD Blondel, J-L Guillaume, R Lambiotte and E Lefebvre: Fast unfolding of
+    community hierarchies in large networks, J Stat Mech P10008 (2008),
+    http://arxiv.org/abs/0803.0476
 
     Parameters
     ----------
     input_graph : cugraph.Graph
         cuGraph graph descriptor of type Graph
 
-        The adjacency list will be computed if not already present. The graph
-        should be undirected where an undirected edge is represented by a
-        directed edge in both direction.
+        The adjacency list will be computed if not already present.
 
     max_iter : integer
         This controls the maximum number of levels/iterations of the Louvain
@@ -35,6 +39,12 @@ def louvain(input_graph, max_iter=100):
         than the specified number of iterations. No error occurs when the
         algorithm terminates early in this manner.
 
+    resolution: float/double, optional
+        Called gamma in the modularity formula, this changes the size
+        of the communities.  Higher resolutions lead to more smaller
+        communities, lower resolutions lead to fewer larger communities.
+        Defaults to 1.
+
     Returns
     -------
     parts : cudf.DataFrame
@@ -64,7 +74,11 @@ def louvain(input_graph, max_iter=100):
     if type(input_graph) is not Graph:
         raise Exception("input graph must be undirected")
 
-    parts, modularity_score = louvain_wrapper.louvain(input_graph,
-                                                      max_iter=max_iter)
+    parts, modularity_score = louvain_wrapper.louvain(
+        input_graph, max_iter, resolution
+    )
+
+    if input_graph.renumbered:
+        parts = input_graph.unrenumber(parts, "vertex")
 
     return parts, modularity_score
diff --git a/python/cugraph/community/louvain_wrapper.pyx b/python/cugraph/community/louvain_wrapper.pyx
index 4a5212d3efa..79db57125b1 100644
--- a/python/cugraph/community/louvain_wrapper.pyx
+++ b/python/cugraph/community/louvain_wrapper.pyx
@@ -19,7 +19,6 @@
 from cugraph.community.louvain cimport louvain as c_louvain
 from cugraph.structure.graph_new cimport *
 from cugraph.structure import graph_new_wrapper
-from cugraph.utilities.unrenumber import unrenumber
 from libc.stdint cimport uintptr_t
 
 import cudf
@@ -27,7 +26,7 @@ import rmm
 import numpy as np
 
 
-def louvain(input_graph, max_iter=100):
+def louvain(input_graph, max_iter, resolution):
     """
     Call louvain
     """
@@ -40,7 +39,7 @@ def louvain(input_graph, max_iter=100):
     [offsets, indices] = graph_new_wrapper.datatype_cast([input_graph.adjlist.offsets, input_graph.adjlist.indices], [np.int32])
 
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
 
     if input_graph.adjlist.weights is not None:
         [weights] = graph_new_wrapper.datatype_cast([input_graph.adjlist.weights], [np.float32, np.float64])
@@ -74,7 +73,8 @@ def louvain(input_graph, max_iter=100):
                   &final_modularity_float,
                   &num_level,
                   <int*> c_partition,
-                  max_iter)
+                  <int> max_iter,
+                  <float> resolution)
 
         final_modularity = final_modularity_float
     else:
@@ -86,10 +86,8 @@ def louvain(input_graph, max_iter=100):
                   &final_modularity_double,
                   &num_level,
                   <int*> c_partition,
-                  max_iter)
+                  <int> max_iter,
+                  <double> resolution)
         final_modularity = final_modularity_double
 
-    if input_graph.renumbered:
-        df = unrenumber(input_graph.edgelist.renumber_map, df, 'vertex')
-
     return df, final_modularity
diff --git a/python/cugraph/community/spectral_clustering.pxd b/python/cugraph/community/spectral_clustering.pxd
index 5a1fd98ce82..360ff08a04e 100644
--- a/python/cugraph/community/spectral_clustering.pxd
+++ b/python/cugraph/community/spectral_clustering.pxd
@@ -19,7 +19,7 @@
 from cugraph.structure.graph_new cimport *
 
 
-cdef extern from "algorithms.hpp" namespace "cugraph::nvgraph":
+cdef extern from "algorithms.hpp" namespace "cugraph::ext_raft":
 
     cdef void balancedCutClustering[VT,ET,WT](
         const GraphCSRView[VT,ET,WT] &graph,
diff --git a/python/cugraph/community/spectral_clustering.py b/python/cugraph/community/spectral_clustering.py
index c2137d280b5..f35836da4ca 100644
--- a/python/cugraph/community/spectral_clustering.py
+++ b/python/cugraph/community/spectral_clustering.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019 - 2020, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -14,13 +14,15 @@
 from cugraph.community import spectral_clustering_wrapper
 
 
-def spectralBalancedCutClustering(G,
-                                  num_clusters,
-                                  num_eigen_vects=2,
-                                  evs_tolerance=.00001,
-                                  evs_max_iter=100,
-                                  kmean_tolerance=.00001,
-                                  kmean_max_iter=100):
+def spectralBalancedCutClustering(
+    G,
+    num_clusters,
+    num_eigen_vects=2,
+    evs_tolerance=0.00001,
+    evs_max_iter=100,
+    kmean_tolerance=0.00001,
+    kmean_max_iter=100,
+):
     """
     Compute a clustering/partitioning of the given graph using the spectral
     balanced cut method.
@@ -70,24 +72,34 @@ def spectralBalancedCutClustering(G,
     """
 
     df = spectral_clustering_wrapper.spectralBalancedCutClustering(
-             G,
-             num_clusters,
-             num_eigen_vects,
-             evs_tolerance,
-             evs_max_iter,
-             kmean_tolerance,
-             kmean_max_iter)
+        G,
+        num_clusters,
+        num_eigen_vects,
+        evs_tolerance,
+        evs_max_iter,
+        kmean_tolerance,
+        kmean_max_iter,
+    )
+
+    if G.renumbered:
+        # FIXME:  This is a hack to get around an
+        # API problem.  The spectral API assumes that
+        # the data frame remains in internal vertex
+        # id order.  It should not do that.
+        df = G.unrenumber(df, "vertex", preserve_order=True)
 
     return df
 
 
-def spectralModularityMaximizationClustering(G,
-                                             num_clusters,
-                                             num_eigen_vects=2,
-                                             evs_tolerance=.00001,
-                                             evs_max_iter=100,
-                                             kmean_tolerance=.00001,
-                                             kmean_max_iter=100):
+def spectralModularityMaximizationClustering(
+    G,
+    num_clusters,
+    num_eigen_vects=2,
+    evs_tolerance=0.00001,
+    evs_max_iter=100,
+    kmean_tolerance=0.00001,
+    kmean_max_iter=100,
+):
     """
     Compute a clustering/partitioning of the given graph using the spectral
     modularity maximization method.
@@ -134,13 +146,20 @@ def spectralModularityMaximizationClustering(G,
     """
 
     df = spectral_clustering_wrapper.spectralModularityMaximizationClustering(
-             G,
-             num_clusters,
-             num_eigen_vects,
-             evs_tolerance,
-             evs_max_iter,
-             kmean_tolerance,
-             kmean_max_iter)
+        G,
+        num_clusters,
+        num_eigen_vects,
+        evs_tolerance,
+        evs_max_iter,
+        kmean_tolerance,
+        kmean_max_iter,
+    )
+
+    if G.renumbered:
+        # FIXME:  Existing code relies on df being sorted...
+        #   Shouldn't because in MG we can't guarantee sorting
+        #   and partitioning of output
+        df = G.unrenumber(df, "vertex", preserve_order=True)
 
     return df
 
@@ -176,9 +195,8 @@ def analyzeClustering_modularity(G, n_clusters, clustering):
     """
 
     score = spectral_clustering_wrapper.analyzeClustering_modularity(
-                G,
-                n_clusters,
-                clustering)
+        G, n_clusters, clustering
+    )
 
     return score
 
@@ -214,9 +232,8 @@ def analyzeClustering_edge_cut(G, n_clusters, clustering):
     """
 
     score = spectral_clustering_wrapper.analyzeClustering_edge_cut(
-                G,
-                n_clusters,
-                clustering)
+        G, n_clusters, clustering
+    )
 
     return score
 
@@ -252,8 +269,7 @@ def analyzeClustering_ratio_cut(G, n_clusters, clustering):
     """
 
     score = spectral_clustering_wrapper.analyzeClustering_ratio_cut(
-                G,
-                n_clusters,
-                clustering)
+        G, n_clusters, clustering
+    )
 
     return score
diff --git a/python/cugraph/community/spectral_clustering_wrapper.pyx b/python/cugraph/community/spectral_clustering_wrapper.pyx
index 83bb22433c9..fff027bac7e 100644
--- a/python/cugraph/community/spectral_clustering_wrapper.pyx
+++ b/python/cugraph/community/spectral_clustering_wrapper.pyx
@@ -23,7 +23,6 @@ from cugraph.community.spectral_clustering cimport analyzeClustering_edge_cut as
 from cugraph.community.spectral_clustering cimport analyzeClustering_ratio_cut as c_analyze_clustering_ratio_cut
 from cugraph.structure.graph_new cimport *
 from cugraph.structure import graph_new_wrapper
-from cugraph.utilities.unrenumber import unrenumber
 from libcpp cimport bool
 from libc.stdint cimport uintptr_t
 
@@ -54,7 +53,7 @@ def spectralBalancedCutClustering(input_graph,
     [offsets, indices] = graph_new_wrapper.datatype_cast([input_graph.adjlist.offsets, input_graph.adjlist.indices], [np.int32])
 
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
 
     if input_graph.adjlist.weights is not None:
         [weights] = graph_new_wrapper.datatype_cast([input_graph.adjlist.weights], [np.float32, np.float64])
@@ -102,9 +101,6 @@ def spectralBalancedCutClustering(input_graph,
                                   kmean_max_iter,
                                   <int*>c_cluster)
 
-    if input_graph.renumbered:
-        df = unrenumber(input_graph.edgelist.renumber_map, df, 'vertex')
-
     return df
 
 def spectralModularityMaximizationClustering(input_graph,
@@ -130,7 +126,7 @@ def spectralModularityMaximizationClustering(input_graph,
     [weights] = graph_new_wrapper.datatype_cast([input_graph.adjlist.weights], [np.float32, np.float64])
 
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
 
     # Create the output dataframe
     df = cudf.DataFrame()
@@ -173,9 +169,6 @@ def spectralModularityMaximizationClustering(input_graph,
                                            kmean_max_iter,
                                            <int*>c_cluster)
 
-    if input_graph.renumbered:
-        df = unrenumber(input_graph.edgelist.renumber_map, df, 'vertex')
-
     return df
 
 def analyzeClustering_modularity(input_graph, n_clusters, clustering):
@@ -193,7 +186,7 @@ def analyzeClustering_modularity(input_graph, n_clusters, clustering):
 
     score = None
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
 
     if input_graph.adjlist.weights is None:
         raise Exception("analyze clustering modularity must be called on a graph with weights")
@@ -248,7 +241,7 @@ def analyzeClustering_edge_cut(input_graph, n_clusters, clustering):
 
     score = None
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
 
     if input_graph.adjlist.weights is not None:
         [weights] = graph_new_wrapper.datatype_cast([input_graph.adjlist.weights], [np.float32, np.float64])
@@ -301,7 +294,7 @@ def analyzeClustering_ratio_cut(input_graph, n_clusters, clustering):
 
     score = None
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
 
     if input_graph.adjlist.weights is not None:
         [weights] = graph_new_wrapper.datatype_cast([input_graph.adjlist.weights], [np.float32, np.float64])
diff --git a/python/cugraph/community/subgraph_extraction.pxd b/python/cugraph/community/subgraph_extraction.pxd
index 24db9d411d5..12cef73fad4 100644
--- a/python/cugraph/community/subgraph_extraction.pxd
+++ b/python/cugraph/community/subgraph_extraction.pxd
@@ -20,7 +20,7 @@ from cugraph.structure.graph_new cimport *
 from libcpp.memory cimport unique_ptr
 
 
-cdef extern from "algorithms.hpp" namespace "cugraph::nvgraph":
+cdef extern from "algorithms.hpp" namespace "cugraph::subgraph":
 
     cdef unique_ptr[GraphCOO[VT,ET,WT]] extract_subgraph_vertex[VT,ET,WT](
         const GraphCOOView[VT,ET,WT] &graph,
diff --git a/python/cugraph/community/subgraph_extraction.py b/python/cugraph/community/subgraph_extraction.py
index 4d63eae55a0..6a17061db92 100644
--- a/python/cugraph/community/subgraph_extraction.py
+++ b/python/cugraph/community/subgraph_extraction.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019 - 2020, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -52,11 +52,22 @@ def subgraph(G, vertices):
 
     null_check(vertices)
 
+    if G.renumbered:
+        vertices = G.lookup_internal_vertex_id(vertices)
+
     result_graph = type(G)()
 
-    subgraph_extraction_wrapper.subgraph(
-        G,
-        vertices,
-        result_graph)
+    df = subgraph_extraction_wrapper.subgraph(G, vertices)
+
+    if G.renumbered:
+        df = G.unrenumber(df, "src")
+        df = G.unrenumber(df, "dst")
+
+    if G.edgelist.weights:
+        result_graph.from_cudf_edgelist(
+            df, source="src", destination="dst", edge_attr="weight"
+        )
+    else:
+        result_graph.from_cudf_edgelist(df, source="src", destination="dst")
 
     return result_graph
diff --git a/python/cugraph/community/subgraph_extraction_wrapper.pyx b/python/cugraph/community/subgraph_extraction_wrapper.pyx
index 788aed266d1..03593dafe03 100644
--- a/python/cugraph/community/subgraph_extraction_wrapper.pyx
+++ b/python/cugraph/community/subgraph_extraction_wrapper.pyx
@@ -19,7 +19,6 @@
 from cugraph.community.subgraph_extraction cimport extract_subgraph_vertex as c_extract_subgraph_vertex
 from cugraph.structure.graph_new cimport *
 from cugraph.structure import graph_new_wrapper
-from cugraph.utilities.unrenumber import unrenumber
 from libc.stdint cimport uintptr_t
 
 import cudf
@@ -27,14 +26,13 @@ import rmm
 import numpy as np
 
 
-def subgraph(input_graph, vertices, subgraph):
+def subgraph(input_graph, vertices):
     """
     Call extract_subgraph_vertex
     """
     src = None
     dst = None
     weights = None
-    vertices_renumbered = None
     use_float = True
 
     if not input_graph.edgelist:
@@ -59,14 +57,7 @@ def subgraph(input_graph, vertices, subgraph):
     if weights is not None:
         c_weights = weights.__cuda_array_interface__['data'][0]
     
-    if input_graph.renumbered:
-        renumber_series = cudf.Series(input_graph.edgelist.renumber_map.index,
-                                      index=input_graph.edgelist.renumber_map, dtype=np.int32)
-        vertices_renumbered = renumber_series.loc[vertices]
-    else:
-        vertices_renumbered = vertices
-
-    cdef uintptr_t c_vertices = vertices_renumbered.__cuda_array_interface__['data'][0]
+    cdef uintptr_t c_vertices = vertices.__cuda_array_interface__['data'][0]
 
     num_verts = input_graph.number_of_vertices()
     num_edges = len(src)
@@ -81,17 +72,10 @@ def subgraph(input_graph, vertices, subgraph):
 
     # renumber vertices to match original input
     vertices_df = cudf.DataFrame()
-    vertices_df['v'] = vertices_renumbered
+    vertices_df['v'] = vertices
     vertices_df = vertices_df.reset_index(drop=True).reset_index()
 
-    df = df.merge(vertices_df, left_on='src', right_on='index', how='left').drop(['src', 'index']).rename({'v': 'src'})
-    df = df.merge(vertices_df, left_on='dst', right_on='index', how='left').drop(['dst', 'index']).rename({'v': 'dst'})
+    df = df.merge(vertices_df, left_on='src', right_on='index', how='left').drop(['src', 'index']).rename(columns={'v': 'src'}, copy=False)
+    df = df.merge(vertices_df, left_on='dst', right_on='index', how='left').drop(['dst', 'index']).rename(columns={'v': 'dst'}, copy=False)
     
-    if input_graph.renumbered:
-        df = unrenumber(input_graph.edgelist.renumber_map, df, 'src')
-        df = unrenumber(input_graph.edgelist.renumber_map, df, 'dst')
-
-    if weights is not None:
-        subgraph.from_cudf_edgelist(df, source='src', destination='dst', edge_attr='weight')
-    else:
-        subgraph.from_cudf_edgelist(df, source='src', destination='dst')
+    return df
diff --git a/python/cugraph/community/triangle_count.pxd b/python/cugraph/community/triangle_count.pxd
index 4282ab05f1b..6876d067f7a 100644
--- a/python/cugraph/community/triangle_count.pxd
+++ b/python/cugraph/community/triangle_count.pxd
@@ -20,7 +20,7 @@ from cugraph.structure.graph_new cimport *
 from libc.stdint cimport uint64_t
 
 
-cdef extern from "algorithms.hpp" namespace "cugraph::nvgraph":
+cdef extern from "algorithms.hpp" namespace "cugraph::triangle":
 
     cdef uint64_t triangle_count[VT,ET,WT](
         const GraphCSRView[VT,ET,WT] &graph) except +
diff --git a/python/cugraph/community/triangle_count.py b/python/cugraph/community/triangle_count.py
index 407a2f1f2a5..52193c74a3e 100644
--- a/python/cugraph/community/triangle_count.py
+++ b/python/cugraph/community/triangle_count.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019 - 2020, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/community/triangle_count_wrapper.pyx b/python/cugraph/community/triangle_count_wrapper.pyx
index ce299264b68..f34f6a7a947 100644
--- a/python/cugraph/community/triangle_count_wrapper.pyx
+++ b/python/cugraph/community/triangle_count_wrapper.pyx
@@ -40,7 +40,7 @@ def triangles(input_graph):
                                                           input_graph.adjlist.indices], [np.int32])
 
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
 
     cdef uintptr_t c_offsets = offsets.__cuda_array_interface__['data'][0]
     cdef uintptr_t c_indices = indices.__cuda_array_interface__['data'][0]
diff --git a/python/cugraph/components/__init__.py b/python/cugraph/components/__init__.py
index d12a2a19442..7a3c9f189c8 100644
--- a/python/cugraph/components/__init__.py
+++ b/python/cugraph/components/__init__.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/components/connectivity.py b/python/cugraph/components/connectivity.py
index 290976713c2..522eff78c20 100644
--- a/python/cugraph/components/connectivity.py
+++ b/python/cugraph/components/connectivity.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019 - 2020, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -48,6 +48,9 @@ def weakly_connected_components(G):
 
     df = connectivity_wrapper.weakly_connected_components(G)
 
+    if G.renumbered:
+        df = G.unrenumber(df, "vertices")
+
     return df
 
 
@@ -85,4 +88,7 @@ def strongly_connected_components(G):
 
     df = connectivity_wrapper.strongly_connected_components(G)
 
+    if G.renumbered:
+        df = G.unrenumber(df, "vertices")
+
     return df
diff --git a/python/cugraph/components/connectivity_wrapper.pyx b/python/cugraph/components/connectivity_wrapper.pyx
index fd7051e4615..a738ad0c9db 100644
--- a/python/cugraph/components/connectivity_wrapper.pyx
+++ b/python/cugraph/components/connectivity_wrapper.pyx
@@ -23,7 +23,6 @@ from cugraph.structure import graph_new_wrapper
 from libc.stdint cimport uintptr_t
 from cugraph.structure.symmetrize import symmetrize
 from cugraph.structure.graph import Graph as type_Graph
-from cugraph.utilities.unrenumber import unrenumber
 
 import cudf
 import numpy as np
@@ -54,7 +53,7 @@ def weakly_connected_components(input_graph):
                                                              [np.int32])
 
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
 
     df = cudf.DataFrame()
     df['vertices'] = cudf.Series(np.zeros(num_verts, dtype=np.int32))
@@ -74,9 +73,6 @@ def weakly_connected_components(input_graph):
 
     g.get_vertex_identifiers(<int*>c_identifier)
 
-    if input_graph.renumbered:
-        df = unrenumber(input_graph.edgelist.renumber_map, df, 'vertices')
-
     return df
 
 
@@ -90,7 +86,7 @@ def strongly_connected_components(input_graph):
     [offsets, indices] = graph_new_wrapper.datatype_cast([input_graph.adjlist.offsets, input_graph.adjlist.indices], [np.int32])
 
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
 
     df = cudf.DataFrame()
     df['vertices'] = cudf.Series(np.zeros(num_verts, dtype=np.int32))
@@ -110,7 +106,4 @@ def strongly_connected_components(input_graph):
 
     g.get_vertex_identifiers(<int*>c_identifier)
 
-    if input_graph.renumbered:
-        df = unrenumber(input_graph.edgelist.renumber_map, df, 'vertices')
-
     return df
diff --git a/python/cugraph/cores/__init__.py b/python/cugraph/cores/__init__.py
index e51ea7d0c91..a0278aed8c1 100644
--- a/python/cugraph/cores/__init__.py
+++ b/python/cugraph/cores/__init__.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/cores/core_number.py b/python/cugraph/cores/core_number.py
index c407439f522..6476a863d2d 100644
--- a/python/cugraph/cores/core_number.py
+++ b/python/cugraph/cores/core_number.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019 - 2020, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -12,7 +12,6 @@
 # limitations under the License.
 
 from cugraph.cores import core_number_wrapper
-from cugraph.utilities.unrenumber import unrenumber
 
 
 def core_number(G):
@@ -54,6 +53,6 @@ def core_number(G):
     df = core_number_wrapper.core_number(G)
 
     if G.renumbered:
-        df = unrenumber(G.edgelist.renumber_map, df, 'vertex')
+        df = G.unrenumber(df, "vertex")
 
     return df
diff --git a/python/cugraph/cores/core_number_wrapper.pyx b/python/cugraph/cores/core_number_wrapper.pyx
index 2916ad2f7a7..0b8dc63c294 100644
--- a/python/cugraph/cores/core_number_wrapper.pyx
+++ b/python/cugraph/cores/core_number_wrapper.pyx
@@ -36,7 +36,7 @@ def core_number(input_graph):
     [offsets, indices] = graph_new_wrapper.datatype_cast([input_graph.adjlist.offsets, input_graph.adjlist.indices], [np.int32])
 
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
 
     df = cudf.DataFrame()
     df['vertex'] = cudf.Series(np.zeros(num_verts, dtype=np.int32))
diff --git a/python/cugraph/cores/k_core.pxd b/python/cugraph/cores/k_core.pxd
index f65dc7f44d3..9b001494143 100644
--- a/python/cugraph/cores/k_core.pxd
+++ b/python/cugraph/cores/k_core.pxd
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/cores/k_core.py b/python/cugraph/cores/k_core.py
index 6e401d24bcc..8c6c05c3178 100644
--- a/python/cugraph/cores/k_core.py
+++ b/python/cugraph/cores/k_core.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019 - 2020, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -12,15 +12,9 @@
 # limitations under the License.
 
 from cugraph.cores import k_core_wrapper, core_number_wrapper
-from cugraph.utilities.unrenumber import unrenumber
 
-import cudf
-import numpy as np
 
-
-def k_core(G,
-           k=None,
-           core_number=None):
+def k_core(G, k=None, core_number=None):
     """
     Compute the k-core of the graph G based on the out degree of its nodes. A
     k-core of a graph is a maximal subgraph that contains nodes of degree k or
@@ -67,34 +61,31 @@ def k_core(G,
 
     if core_number is not None:
         if G.renumbered is True:
-            renumber_df = cudf.DataFrame()
-            renumber_df['map'] = G.edgelist.renumber_map
-            renumber_df['id'] = G.edgelist.renumber_map.index.astype(np.int32)
-            core_number = core_number.merge(renumber_df,
-                                            left_on='vertex',
-                                            right_on='map',
-                                            how='left').drop('map')
+            core_number = G.add_internal_vertex_id(
+                core_number, "vertex", "vertex", drop=True
+            )
     else:
         core_number = core_number_wrapper.core_number(G)
-        core_number = core_number.rename(columns={"core_number": "values"})
+        core_number = core_number.rename(
+            columns={"core_number": "values"}, copy=False
+        )
 
     if k is None:
-        k = core_number['values'].max()
+        k = core_number["values"].max()
 
     k_core_df = k_core_wrapper.k_core(G, k, core_number)
 
     if G.renumbered:
-        k_core_df = unrenumber(G.edgelist.renumber_map, k_core_df, 'src')
-        k_core_df = unrenumber(G.edgelist.renumber_map, k_core_df, 'dst')
+        k_core_df = G.unrenumber(k_core_df, "src")
+        k_core_df = G.unrenumber(k_core_df, "dst")
 
     if G.edgelist.weights:
-        KCoreGraph.from_cudf_edgelist(k_core_df,
-                                      source='src',
-                                      destination='dst',
-                                      edge_attr='weight')
+        KCoreGraph.from_cudf_edgelist(
+            k_core_df, source="src", destination="dst", edge_attr="weight"
+        )
     else:
-        KCoreGraph.from_cudf_edgelist(k_core_df,
-                                      source='src',
-                                      destination='dst')
+        KCoreGraph.from_cudf_edgelist(
+            k_core_df, source="src", destination="dst"
+        )
 
     return KCoreGraph
diff --git a/python/cugraph/cores/k_core_wrapper.pyx b/python/cugraph/cores/k_core_wrapper.pyx
index 2734b4e2b29..3083ffdf42e 100644
--- a/python/cugraph/cores/k_core_wrapper.pyx
+++ b/python/cugraph/cores/k_core_wrapper.pyx
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -54,7 +54,6 @@ def k_core(input_graph, k, core_number):
     """
     Call k_core
     """
-
     if graph_new_wrapper.weight_type(input_graph) == np.float64:
         return k_core_double(input_graph, k, core_number)
     else:
diff --git a/python/cugraph/dask/__init__.py b/python/cugraph/dask/__init__.py
index e69de29bb2d..76c47338852 100644
--- a/python/cugraph/dask/__init__.py
+++ b/python/cugraph/dask/__init__.py
@@ -0,0 +1,16 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .link_analysis.pagerank import pagerank
+from .traversal.bfs import bfs
+from .common.read_utils import get_chunksize
diff --git a/python/cugraph/dask/common/__init__.py b/python/cugraph/dask/common/__init__.py
new file mode 100644
index 00000000000..e69de29bb2d
diff --git a/python/cugraph/dask/common/input_utils.py b/python/cugraph/dask/common/input_utils.py
new file mode 100644
index 00000000000..26201c835d2
--- /dev/null
+++ b/python/cugraph/dask/common/input_utils.py
@@ -0,0 +1,224 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from collections.abc import Sequence
+
+from collections import OrderedDict
+from dask_cudf.core import DataFrame as dcDataFrame
+from dask_cudf.core import Series as daskSeries
+
+import cugraph.comms.comms as Comms
+from cugraph.raft.dask.common.utils import get_client
+from cugraph.dask.common.part_utils import (_extract_partitions,
+                                            load_balance_func)
+from dask.distributed import default_client
+from toolz import first
+from functools import reduce
+
+
+class DistributedDataHandler:
+    """
+    Class to centralize distributed data management. Functionalities include:
+    - Data colocation
+    - Worker information extraction
+    - GPU futures extraction,
+
+    Additional functionality can be added as needed. This class **does not**
+    contain the actual data, just the metadata necessary to handle it,
+    including common pieces of code that need to be performed to call
+    Dask functions.
+
+    The constructor is not meant to be used directly, but through the factory
+    method DistributedDataHandler.create
+
+    """
+
+    def __init__(self, gpu_futures=None, workers=None,
+                 datatype=None, multiple=False, client=None):
+        self.client = get_client(client)
+        self.gpu_futures = gpu_futures
+        self.worker_to_parts = _workers_to_parts(gpu_futures)
+        self.workers = workers
+        self.datatype = datatype
+        self.multiple = multiple
+        self.worker_info = None
+        self.total_rows = None
+        self.max_vertex_id = None
+        self.ranks = None
+        self.parts_to_sizes = None
+        self.local_data = None
+
+    @classmethod
+    def get_client(cls, client=None):
+        return default_client() if client is None else client
+
+    """ Class methods for initalization """
+
+    @classmethod
+    def create(cls, data, client=None):
+        """
+        Creates a distributed data handler instance with the given
+        distributed data set(s).
+
+        Parameters
+        ----------
+
+        data : dask.array, dask.dataframe, or unbounded Sequence of
+               dask.array or dask.dataframe.
+
+        client : dask.distributedClient
+        """
+
+        client = cls.get_client(client)
+
+        multiple = isinstance(data, Sequence)
+
+        if isinstance(first(data) if multiple else data,
+                      (dcDataFrame, daskSeries)):
+            datatype = 'cudf'
+        else:
+            raise Exception("Graph data must be dask-cudf dataframe")
+
+        gpu_futures = client.sync(_extract_partitions, data, client)
+        workers = tuple(OrderedDict.fromkeys(map(lambda x: x[0], gpu_futures)))
+        return DistributedDataHandler(gpu_futures=gpu_futures, workers=workers,
+                                      datatype=datatype, multiple=multiple,
+                                      client=client)
+
+    """ Methods to calculate further attributes """
+
+    def calculate_worker_and_rank_info(self, comms):
+
+        self.worker_info = comms.worker_info(comms.worker_addresses)
+        self.ranks = dict()
+
+        for w, futures in self.worker_to_parts.items():
+            self.ranks[w] = self.worker_info[w]["rank"]
+
+    def calculate_parts_to_sizes(self, comms=None, ranks=None):
+
+        if self.worker_info is None and comms is not None:
+            self.calculate_worker_and_rank_info(comms)
+
+        self.total_rows = 0
+
+        self.parts_to_sizes = dict()
+
+        parts = [(wf[0], self.client.submit(
+            _get_rows,
+            wf[1],
+            self.multiple,
+            workers=[wf[0]],
+            pure=False))
+            for idx, wf in enumerate(self.worker_to_parts.items())]
+
+        sizes = self.client.compute(parts, sync=True)
+
+        for w, sizes_parts in sizes:
+            sizes, total = sizes_parts
+            self.parts_to_sizes[self.worker_info[w]["rank"]] = \
+                sizes
+
+            self.total_rows += total
+
+    def calculate_local_data(self, comms, by):
+
+        if self.worker_info is None and comms is not None:
+            self.calculate_worker_and_rank_info(comms)
+
+        local_data = dict([(self.worker_info[wf[0]]["rank"],
+                            self.client.submit(
+                            _get_local_data,
+                            wf[1],
+                            by,
+                            workers=[wf[0]]))
+                          for idx, wf in enumerate(self.worker_to_parts.items()
+                                                   )])
+
+        _local_data_dict = self.client.compute(local_data, sync=True)
+        local_data_dict = {'edges': [], 'offsets': [], 'verts': []}
+        max_vid = 0
+        for rank in range(len(_local_data_dict)):
+            data = _local_data_dict[rank]
+            local_data_dict['edges'].append(data[0])
+            if rank == 0:
+                local_offset = 0
+            else:
+                prev_data = _local_data_dict[rank-1]
+                local_offset = prev_data[1] + 1
+            local_data_dict['offsets'].append(local_offset)
+            local_data_dict['verts'].append(data[1] - local_offset + 1)
+            if data[2] > max_vid:
+                max_vid = data[2]
+
+        import numpy as np
+        local_data_dict['edges'] = np.array(local_data_dict['edges'],
+                                            dtype=np.int32)
+        local_data_dict['offsets'] = np.array(local_data_dict['offsets'],
+                                              dtype=np.int32)
+        local_data_dict['verts'] = np.array(local_data_dict['verts'],
+                                            dtype=np.int32)
+        self.local_data = local_data_dict
+        self.max_vertex_id = max_vid
+
+
+""" Internal methods, API subject to change """
+
+
+def _workers_to_parts(futures):
+    """
+    Builds an ordered dict mapping each worker to their list
+    of parts
+    :param futures: list of (worker, part) tuples
+    :return:
+    """
+    w_to_p_map = OrderedDict()
+    for w, p in futures:
+        if w not in w_to_p_map:
+            w_to_p_map[w] = []
+        w_to_p_map[w].append(p)
+    return w_to_p_map
+
+
+def _get_rows(objs, multiple):
+    def get_obj(x): return x[0] if multiple else x
+    total = list(map(lambda x: get_obj(x).shape[0], objs))
+    return total, reduce(lambda a, b: a + b, total)
+
+
+def _get_local_data(df, by):
+    df = df[0]
+    num_local_edges = len(df)
+    local_by_max = df[by].iloc[-1]
+    local_max = df[['src', 'dst']].max().max()
+    return num_local_edges, local_by_max, local_max
+
+
+def get_local_data(input_graph, by, load_balance=True):
+    _ddf = input_graph.edgelist.edgelist_df
+    ddf = _ddf.sort_values(by=by, ignore_index=True)
+
+    if load_balance:
+        ddf = load_balance_func(ddf, by=by)
+
+    comms = Comms.get_comms()
+    data = DistributedDataHandler.create(data=ddf)
+    data.calculate_local_data(comms, by)
+    return data
+
+
+def get_mg_batch_data(dask_cudf_data):
+    data = DistributedDataHandler.create(data=dask_cudf_data)
+    return data
diff --git a/python/cugraph/dask/common/mg_utils.py b/python/cugraph/dask/common/mg_utils.py
new file mode 100644
index 00000000000..198b0756c00
--- /dev/null
+++ b/python/cugraph/dask/common/mg_utils.py
@@ -0,0 +1,34 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from cugraph.raft.dask.common.utils import default_client
+
+
+# FIXME: We currently look for the default client from dask, as such is the
+# if there is a dask client running without any GPU we will still try
+# to run MG using this client, it also implies that more  work will be
+# required  in order to run an MG Batch in Combination with mutli-GPU Graph
+def get_client():
+    try:
+        client = default_client()
+    except ValueError:
+        client = None
+    return client
+
+
+def prepare_worker_to_parts(data, client=None):
+    if client is None:
+        client = get_client()
+    for placeholder, worker in enumerate(client.has_what().keys()):
+        if worker not in data.worker_to_parts:
+            data.worker_to_parts[worker] = [placeholder]
+    return data
diff --git a/python/cugraph/dask/common/part_utils.py b/python/cugraph/dask/common/part_utils.py
new file mode 100644
index 00000000000..45dc7ed7ef2
--- /dev/null
+++ b/python/cugraph/dask/common/part_utils.py
@@ -0,0 +1,202 @@
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+from dask.distributed import futures_of, default_client, wait
+from toolz import first
+import collections
+import dask_cudf as dc
+from dask.array.core import Array as daskArray
+from dask_cudf.core import DataFrame as daskDataFrame
+from dask_cudf.core import Series as daskSeries
+from functools import reduce
+import cugraph.comms.comms as Comms
+
+
+def workers_to_parts(futures):
+    """
+    Builds an ordered dict mapping each worker to their list
+    of parts
+    :param futures: list of (worker, part) tuples
+    :return:
+    """
+    w_to_p_map = collections.OrderedDict()
+    for w, p in futures:
+        if w not in w_to_p_map:
+            w_to_p_map[w] = []
+        w_to_p_map[w].append(p)
+    return w_to_p_map
+
+
+def _func_get_rows(df):
+    return df.shape[0]
+
+
+def parts_to_ranks(client, worker_info, part_futures):
+    """
+    Builds a list of (rank, size) tuples of partitions
+    :param worker_info: dict of {worker, {"rank": rank }}. Note: \
+        This usually comes from the underlying communicator
+    :param part_futures: list of (worker, future) tuples
+    :return: [(part, size)] in the same order of part_futures
+    """
+    futures = [(worker_info[wf[0]]["rank"],
+                client.submit(_func_get_rows,
+                              wf[1],
+                              workers=[wf[0]],
+                              pure=False))
+               for idx, wf in enumerate(part_futures)]
+
+    sizes = client.compute(list(map(lambda x: x[1], futures)), sync=True)
+    total = reduce(lambda a, b: a + b, sizes)
+
+    return [(futures[idx][0], size) for idx, size in enumerate(sizes)], total
+
+
+def persist_distributed_data(dask_df, client):
+    client = default_client() if client is None else client
+    worker_addresses = Comms.get_workers()
+    _keys = dask_df.__dask_keys__()
+    worker_dict = {}
+    for i, key in enumerate(_keys):
+        worker_dict[str(key)] = tuple([worker_addresses[i]])
+    persisted = client.persist(dask_df, workers=worker_dict)
+    parts = futures_of(persisted)
+    return parts
+
+
+async def _extract_partitions(dask_obj, client=None):
+
+    client = default_client() if client is None else client
+    # dask.dataframe or dask.array
+    if isinstance(dask_obj, (daskDataFrame, daskArray, daskSeries)):
+        parts = persist_distributed_data(dask_obj, client)
+    # iterable of dask collections (need to colocate them)
+    elif isinstance(dask_obj, collections.Sequence):
+        # NOTE: We colocate (X, y) here by zipping delayed
+        # n partitions of them as (X1, y1), (X2, y2)...
+        # and asking client to compute a single future for
+        # each tuple in the list
+        dela = [np.asarray(d.to_delayed()) for d in dask_obj]
+
+        # TODO: ravel() is causing strange behavior w/ delayed Arrays which are
+        # not yet backed by futures. Need to investigate this behavior.
+        # ref: https://github.com/rapidsai/cuml/issues/2045
+        raveled = [d.flatten() for d in dela]
+        parts = client.compute([p for p in zip(*raveled)])
+
+    await wait(parts)
+    key_to_part = [(str(part.key), part) for part in parts]
+    who_has = await client.who_has(parts)
+    return [(first(who_has[key]), part)
+            for key, part in key_to_part]
+
+
+def create_dict(futures):
+    w_to_p_map = collections.OrderedDict()
+    for w, k, p in futures:
+        if w not in w_to_p_map:
+            w_to_p_map[w] = []
+        w_to_p_map[w].append([p, k])
+    return w_to_p_map
+
+
+def set_global_index(df, cumsum):
+    df.index = df.index + cumsum
+    df.index = df.index.astype('int64')
+    return df
+
+
+def get_cumsum(df, by):
+    return df[by].value_counts(sort=False).cumsum()
+
+
+def repartition(ddf, cumsum):
+    # Calculate new optimal divisions and repartition the data
+    # for load balancing.
+
+    import math
+    npartitions = ddf.npartitions
+    count = math.ceil(len(ddf)/npartitions)
+    new_divisions = [0]
+    move_count = 0
+    i = npartitions - 2
+    for i in range(npartitions-1):
+        search_val = count - move_count
+        index = cumsum[i].searchsorted(search_val)
+        if index == len(cumsum[i]):
+            index = -1
+        elif index > 0:
+            left = cumsum[i].iloc[index-1]
+            right = cumsum[i].iloc[index]
+            index -= search_val - left < right - search_val
+        new_divisions.append(new_divisions[i] +
+                             cumsum[i].iloc[index] +
+                             move_count)
+        move_count = cumsum[i].iloc[-1] - cumsum[i].iloc[index]
+    new_divisions.append(new_divisions[i+1] +
+                         cumsum[-1].iloc[-1] +
+                         move_count - 1)
+
+    return ddf.repartition(divisions=tuple(new_divisions))
+
+
+def load_balance_func(ddf_, by, client=None):
+    # Load balances the sorted dask_cudf DataFrame.
+    # Input is a dask_cudf dataframe ddf_ which is sorted by
+    # the column name passed as the 'by' argument.
+
+    client = default_client() if client is None else client
+
+    parts = persist_distributed_data(ddf_, client)
+    wait(parts)
+
+    who_has = client.who_has(parts)
+    key_to_part = [(str(part.key), part) for part in parts]
+    gpu_fututres = [(first(who_has[key]),
+                     part.key[1], part) for key, part in key_to_part]
+    worker_to_data = create_dict(gpu_fututres)
+
+    # Calculate cumulative sum in each dataframe partition
+    cumsum_parts = [client.submit(get_cumsum,
+                    wf[1][0][0],
+                    by,
+                    workers=[wf[0]]).result()
+                    for idx, wf in enumerate(worker_to_data.items())]
+
+    num_rows = []
+    for cumsum in cumsum_parts:
+        num_rows.append(cumsum.iloc[-1])
+
+    # Calculate current partition divisions
+    divisions = [sum(num_rows[0:x:1]) for x in range(0, len(num_rows) + 1)]
+    divisions[-1] = divisions[-1] - 1
+    divisions = tuple(divisions)
+
+    # Set global index from 0 to len(dask_cudf_dataframe) so that global
+    # indexing of divisions can be used for repartitioning.
+    futures = [client.submit(set_global_index,
+               wf[1][0][0],
+               divisions[wf[1][0][1]],
+               workers=[wf[0]])
+               for idx, wf in enumerate(worker_to_data.items())]
+    wait(futures)
+
+    ddf = dc.from_delayed(futures)
+    ddf.divisions = divisions
+
+    # Repartition the data
+    ddf = repartition(ddf, cumsum_parts)
+
+    return ddf
diff --git a/python/cugraph/dask/common/read_utils.py b/python/cugraph/dask/common/read_utils.py
new file mode 100644
index 00000000000..a8ee1ef3e07
--- /dev/null
+++ b/python/cugraph/dask/common/read_utils.py
@@ -0,0 +1,44 @@
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+def get_n_workers():
+    from dask.distributed import default_client
+    client = default_client()
+    return len(client.scheduler_info()['workers'])
+
+
+def get_chunksize(input_path):
+    """
+    Calculate the appropriate chunksize for dask_cudf.read_csv
+    to get a number of partitions equal to the number of GPUs.
+
+    Examples
+    --------
+    >>> import dask_cugraph.pagerank as dcg
+    >>> chunksize = dcg.get_chunksize(edge_list.csv)
+    """
+
+    import os
+    from glob import glob
+    import math
+
+    input_files = sorted(glob(str(input_path)))
+    if len(input_files) == 1:
+        size = os.path.getsize(input_files[0])
+        chunksize = math.ceil(size/get_n_workers())
+    else:
+        size = [os.path.getsize(_file) for _file in input_files]
+        chunksize = max(size)
+    return chunksize
diff --git a/python/cugraph/dask/core.py b/python/cugraph/dask/core.py
deleted file mode 100644
index a746a9479b7..00000000000
--- a/python/cugraph/dask/core.py
+++ /dev/null
@@ -1,184 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import logging
-import numba.cuda
-import time
-import os
-from threading import Lock, Thread
-import cugraph
-
-
-class IPCThread(Thread):
-    """
-    This mechanism gets around Numba's restriction of CUDA contexts being
-    thread-local by creating a thread that can select its own device.
-    This allows the user of IPC handles to open them up directly on the
-    same device as the owner (bypassing the need for peer access.)
-    """
-
-    def __init__(self, ipcs, device):
-        """
-        Initializes the thread with the given IPC handles for the
-        given device
-        :param ipcs: list[ipc] list of ipc handles with memory on the
-                     given device
-        :param device: device id to use.
-        """
-
-        Thread.__init__(self)
-
-        self.lock = Lock()
-        self.ipcs = ipcs
-
-        # Use canonical device id
-        self.device = get_device_id(device)
-
-        print("Starting new IPC thread on device %i for ipcs %s" %
-              (self.device, str(list(ipcs))))
-        self.running = False
-
-    def run(self):
-        """
-        Starts the current Thread instance enabling memory from the selected
-        device to be used.
-        """
-
-        select_device(self.device)
-
-        print("Opening: " + str(self.device) + " "
-              + str(numba.cuda.get_current_device()))
-
-        self.lock.acquire()
-
-        try:
-            self.arrs = [ipc.open() for ipc in self.ipcs]
-            self.ptr_info = [x.__cuda_array_interface__ for x in self.arrs]
-
-            self.running = True
-        except Exception as e:
-            logging.error("Error opening ipc_handle on device " +
-                          str(self.device) + ": " + str(e))
-
-        self.lock.release()
-
-        while (self.running):
-            time.sleep(0.0001)
-
-        try:
-            logging.warn("Closing: " + str(self.device) +
-                         str(numba.cuda.get_current_device()))
-            self.lock.acquire()
-            [ipc.close() for ipc in self.ipcs]
-            self.lock.release()
-
-        except Exception as e:
-            logging.error("Error closing ipc_handle on device " +
-                          str(self.device) + ": " + str(e))
-
-    def close(self):
-
-        """
-        This should be called before calling join(). Otherwise, IPC handles
-        may not be properly cleaned up.
-        """
-        self.lock.acquire()
-        self.running = False
-        self.lock.release()
-
-    def info(self):
-        """
-        Warning: this method is invoked from the calling thread. Make
-        sure the context in the thread reading the memory is tied to
-        self.device, otherwise an expensive peer access might take
-        place underneath.
-        """
-        while (not self.running):
-            time.sleep(0.0001)
-
-        return self.ptr_info
-
-
-def new_ipc_thread(ipcs, dev):
-    t = IPCThread(ipcs, dev)
-    t.start()
-    return t
-
-
-def select_device(dev, close=True):
-    """
-    Use numbas numba to select the given device, optionally
-    closing and opening up a new cuda context if it fails.
-    :param dev: int device to select
-    :param close: bool close the cuda context and create new one?
-    """
-    if numba.cuda.get_current_device().id != dev:
-        logging.warn("Selecting device " + str(dev))
-        if close:
-            numba.cuda.close()
-        numba.cuda.select_device(dev)
-        if dev != numba.cuda.get_current_device().id:
-            logging.warn("Current device " +
-                         str(numba.cuda.get_current_device()) +
-                         " does not match expected " + str(dev))
-
-
-def get_visible_devices():
-    """
-    Return a list of the CUDA_VISIBLE_DEVICES
-    :return: list[int] visible devices
-    """
-    # TODO: Shouldn't have to split on every call
-    return os.environ["CUDA_VISIBLE_DEVICES"].split(",")
-
-
-def device_of_devicendarray(devicendarray):
-    """
-    Returns the device that backs memory allocated on the given
-    deviceNDArray
-    :param devicendarray: devicendarray array to check
-    :return: int device id
-    """
-    dev = cugraph.device_of_gpu_pointer(devicendarray)
-    return get_visible_devices()[dev]
-
-
-def get_device_id(canonical_name):
-    """
-    Given a local device id, find the actual "global" id
-    :param canonical_name: the local device name in CUDA_VISIBLE_DEVICES
-    :return: the global device id for the system
-    """
-    dev_order = get_visible_devices()
-    idx = 0
-    for dev in dev_order:
-        if dev == canonical_name:
-            return idx
-        idx += 1
-
-    return -1
-
-
-def parse_host_port(address):
-    """
-    Given a string address with host/port, build a tuple(host, port)
-    :param address: string address to parse
-    :return: tuple(host, port)
-    """
-    if '://' in address:
-        address = address.rsplit('://', 1)[1]
-    host, port = address.split(':')
-    port = int(port)
-    return host, port
diff --git a/python/cugraph/dask/link_analysis/__init__.py b/python/cugraph/dask/link_analysis/__init__.py
new file mode 100644
index 00000000000..e69de29bb2d
diff --git a/python/cugraph/snmg/link_analysis/__init__.py b/python/cugraph/dask/link_analysis/mg_pagerank.pxd
similarity index 51%
rename from python/cugraph/snmg/link_analysis/__init__.py
rename to python/cugraph/dask/link_analysis/mg_pagerank.pxd
index ca18e4ef650..4de9becf10d 100644
--- a/python/cugraph/snmg/link_analysis/__init__.py
+++ b/python/cugraph/dask/link_analysis/mg_pagerank.pxd
@@ -1,4 +1,6 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -10,5 +12,22 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+#
+
+from cugraph.structure.graph_new cimport *
+from libcpp cimport bool
+
+
+cdef extern from "algorithms.hpp" namespace "cugraph":
 
-from cugraph.snmg.link_analysis.mg_pagerank import mg_pagerank
+    cdef void pagerank[VT,ET,WT](
+        const handle_t &handle,
+        const GraphCSCView[VT,ET,WT] &graph,
+        WT *pagerank,
+        VT size,
+        VT *personalization_subset,
+        WT *personalization_values,
+        double alpha,
+        double tolerance,
+        long long max_iter,
+        bool has_guess) except +
diff --git a/python/cugraph/dask/link_analysis/mg_pagerank_wrapper.pyx b/python/cugraph/dask/link_analysis/mg_pagerank_wrapper.pyx
new file mode 100644
index 00000000000..c5a72647e03
--- /dev/null
+++ b/python/cugraph/dask/link_analysis/mg_pagerank_wrapper.pyx
@@ -0,0 +1,93 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from cugraph.structure.utils_wrapper import *
+from cugraph.dask.link_analysis cimport mg_pagerank as c_pagerank
+import cudf
+from cugraph.structure.graph_new cimport *
+import cugraph.structure.graph_new_wrapper as graph_new_wrapper
+from libc.stdint cimport uintptr_t
+from cython.operator cimport dereference as deref
+
+def mg_pagerank(input_df, local_data, rank, handle, alpha=0.85, max_iter=100, tol=1.0e-5, personalization=None, nstart=None):
+    """
+    Call pagerank
+    """
+
+    cdef size_t handle_size_t = <size_t>handle.getHandle()
+    handle_ = <c_pagerank.handle_t*>handle_size_t
+
+
+    src = input_df['src']
+    dst = input_df['dst']
+
+    num_verts = local_data['verts'].sum()
+    num_edges = local_data['edges'].sum()
+
+    local_offset = local_data['offsets'][rank]
+    dst = dst - local_offset
+    num_local_verts = local_data['verts'][rank]
+    num_local_edges = len(src)
+ 
+    cdef uintptr_t c_local_verts = local_data['verts'].__array_interface__['data'][0]
+    cdef uintptr_t c_local_edges = local_data['edges'].__array_interface__['data'][0]
+    cdef uintptr_t c_local_offsets = local_data['offsets'].__array_interface__['data'][0]
+
+    [src, dst] = graph_new_wrapper.datatype_cast([src, dst], [np.int32])
+    _offsets, indices, weights = coo2csr(dst, src, None)
+    offsets = _offsets[:num_local_verts + 1]
+    del _offsets
+    df = cudf.DataFrame()
+    df['vertex'] = cudf.Series(np.zeros(num_verts, dtype=np.int32))
+    df['pagerank'] = cudf.Series(np.zeros(num_verts, dtype=np.float32))
+
+    cdef uintptr_t c_identifier = df['vertex'].__cuda_array_interface__['data'][0];
+    cdef uintptr_t c_pagerank_val = df['pagerank'].__cuda_array_interface__['data'][0];
+    
+    cdef uintptr_t c_pers_vtx = <uintptr_t>NULL
+    cdef uintptr_t c_pers_val = <uintptr_t>NULL
+    cdef int sz = 0
+    
+    cdef uintptr_t c_offsets = offsets.__cuda_array_interface__['data'][0]
+    cdef uintptr_t c_indices = indices.__cuda_array_interface__['data'][0]
+    cdef uintptr_t c_weights = <uintptr_t>NULL
+
+    cdef GraphCSCView[int,int,float] graph_float
+    cdef GraphCSCView[int,int,double] graph_double
+
+    if personalization is not None:
+        sz = personalization['vertex'].shape[0]
+        personalization['vertex'] = personalization['vertex'].astype(np.int32)
+        personalization['values'] = personalization['values'].astype(df['pagerank'].dtype)
+        c_pers_vtx = personalization['vertex'].__cuda_array_interface__['data'][0]
+        c_pers_val = personalization['values'].__cuda_array_interface__['data'][0]
+
+    if (df['pagerank'].dtype == np.float32):
+        graph_float = GraphCSCView[int,int,float](<int*>c_offsets, <int*>c_indices, <float*>c_weights, num_verts, num_local_edges)
+        graph_float.set_local_data(<int*>c_local_verts, <int*>c_local_edges, <int*>c_local_offsets)
+        graph_float.set_handle(handle_)
+        c_pagerank.pagerank[int,int,float](handle_[0], graph_float, <float*> c_pagerank_val, sz, <int*> c_pers_vtx, <float*> c_pers_val,
+                               <float> alpha, <float> tol, <int> max_iter, <bool> 0)
+        graph_float.get_vertex_identifiers(<int*>c_identifier)
+    else:
+        graph_double = GraphCSCView[int,int,double](<int*>c_offsets, <int*>c_indices, <double*>c_weights, num_verts, num_local_edges)
+        graph_double.set_local_data(<int*>c_local_verts, <int*>c_local_edges, <int*>c_local_offsets)
+        graph_double.set_handle(handle_)
+        c_pagerank.pagerank[int,int,double](handle_[0], graph_double, <double*> c_pagerank_val, sz, <int*> c_pers_vtx, <double*> c_pers_val,
+                            <float> alpha, <float> tol, <int> max_iter, <bool> 0)
+        graph_double.get_vertex_identifiers(<int*>c_identifier)
+
+    return df
diff --git a/python/cugraph/dask/link_analysis/pagerank.py b/python/cugraph/dask/link_analysis/pagerank.py
new file mode 100644
index 00000000000..a287333ef6f
--- /dev/null
+++ b/python/cugraph/dask/link_analysis/pagerank.py
@@ -0,0 +1,152 @@
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from dask.distributed import wait, default_client
+from cugraph.dask.common.input_utils import get_local_data
+from cugraph.dask.link_analysis import mg_pagerank_wrapper as mg_pagerank
+import cugraph.comms.comms as Comms
+
+
+def call_pagerank(sID, data, local_data, alpha, max_iter,
+                  tol, personalization, nstart):
+    wid = Comms.get_worker_id(sID)
+    handle = Comms.get_handle(sID)
+    return mg_pagerank.mg_pagerank(data[0],
+                                   local_data,
+                                   wid,
+                                   handle,
+                                   alpha,
+                                   max_iter,
+                                   tol,
+                                   personalization,
+                                   nstart)
+
+
+def pagerank(input_graph,
+             alpha=0.85,
+             personalization=None,
+             max_iter=100,
+             tol=1.0e-5,
+             nstart=None,
+             load_balance=True):
+
+    """
+    Find the PageRank values for each vertex in a graph using multiple GPUs.
+    cuGraph computes an approximation of the Pagerank using the power method.
+    The input graph must contain edge list as  dask-cudf dataframe with
+    one partition per GPU.
+
+    Parameters
+    ----------
+    graph : cugraph.DiGraph
+        cuGraph graph descriptor, should contain the connectivity information
+        as dask cudf edge list dataframe(edge weights are not used for this
+        algorithm). Undirected Graph not currently supported.
+    alpha : float
+        The damping factor alpha represents the probability to follow an
+        outgoing edge, standard value is 0.85.
+        Thus, 1.0-alpha is the probability to “teleport” to a random vertex.
+        Alpha should be greater than 0.0 and strictly lower than 1.0.
+    personalization : cudf.Dataframe
+        GPU Dataframe containing the personalization information.
+
+        personalization['vertex'] : cudf.Series
+            Subset of vertices of graph for personalization
+        personalization['values'] : cudf.Series
+            Personalization values for vertices
+    max_iter : int
+        The maximum number of iterations before an answer is returned.
+        If this value is lower or equal to 0 cuGraph will use the default
+        value, which is 30.
+    tolerance : float
+        Set the tolerance the approximation, this parameter should be a small
+        magnitude value.
+        The lower the tolerance the better the approximation. If this value is
+        0.0f, cuGraph will use the default value which is 1.0E-5.
+        Setting too small a tolerance can lead to non-convergence due to
+        numerical roundoff. Usually values between 0.01 and 0.00001 are
+        acceptable.
+    nstart : not supported
+        initial guess for pagerank
+    load_balance : bool
+        Set as True to perform load_balancing after global sorting of
+        dask-cudf DataFrame. This ensures that the data is uniformly
+        distributed among multiple GPUs to avoid over-loading.
+
+    Returns
+    -------
+    PageRank : cudf.DataFrame
+        GPU data frame containing two cudf.Series of size V: the vertex
+        identifiers and the corresponding PageRank values.
+
+        df['vertex'] : cudf.Series
+            Contains the vertex identifiers
+        df['pagerank'] : cudf.Series
+            Contains the PageRank score
+
+    Examples
+    --------
+    >>> import cugraph.dask as dcg
+    >>> Comms.initialize()
+    >>> chunksize = dcg.get_chunksize(input_data_path)
+    >>> ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize,
+                                 delimiter=' ',
+                                 names=['src', 'dst', 'value'],
+                                 dtype=['int32', 'int32', 'float32'])
+    >>> dg = cugraph.DiGraph()
+    >>> dg.from_dask_cudf_edgelist(ddf, source='src', destination='dst',
+                                   edge_attr='value')
+    >>> pr = dcg.pagerank(dg)
+    >>> Comms.destroy()
+    """
+    from cugraph.structure.graph import null_check
+
+    nstart = None
+
+    client = default_client()
+
+    if(input_graph.local_data is not None and
+       input_graph.local_data['by'] == 'dst'):
+        data = input_graph.local_data['data']
+    else:
+        data = get_local_data(input_graph, by='dst', load_balance=load_balance)
+
+    if personalization is not None:
+        null_check(personalization["vertex"])
+        null_check(personalization["values"])
+        if input_graph.renumbered is True:
+            personalization = input_graph.add_internal_vertex_id(
+                personalization, "vertex", "vertex"
+            ).compute()
+
+    result = dict([(data.worker_info[wf[0]]["rank"],
+                    client.submit(
+                    call_pagerank,
+                    Comms.get_session_id(),
+                    wf[1],
+                    data.local_data,
+                    alpha,
+                    max_iter,
+                    tol,
+                    personalization,
+                    nstart,
+                    workers=[wf[0]]))
+                   for idx, wf in enumerate(data.worker_to_parts.items())])
+    wait(result)
+
+    if input_graph.renumbered:
+        return input_graph.unrenumber(result[0].result(), 'vertex').compute()
+
+    return result[0].result()
diff --git a/python/cugraph/dask/pagerank/__init__.py b/python/cugraph/dask/pagerank/__init__.py
deleted file mode 100644
index 01d75d2a3d1..00000000000
--- a/python/cugraph/dask/pagerank/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .pagerank import pagerank, get_chunksize, read_split_csv, drop_duplicates
diff --git a/python/cugraph/dask/pagerank/pagerank.py b/python/cugraph/dask/pagerank/pagerank.py
deleted file mode 100644
index fb301a3dcf1..00000000000
--- a/python/cugraph/dask/pagerank/pagerank.py
+++ /dev/null
@@ -1,321 +0,0 @@
-import cugraph
-import random
-import dask_cudf as dc
-from collections import defaultdict
-from cugraph.dask.core import new_ipc_thread, parse_host_port
-from cugraph.dask.core import device_of_devicendarray, get_device_id
-import os
-from dask.distributed import wait, default_client
-from toolz import first
-import dask.dataframe as dd
-import cudf
-
-
-def to_gpu_array(df):
-    """
-    Get the gpu_array pointer to the data in columns of the
-    input dataframe.
-    """
-    start_idx = df.index[0]
-    stop_idx = df.index[-1]
-    gpu_array_src = df['src']._column.data_array_view
-    gpu_array_dest = df['dst']._column.data_array_view
-    dev = device_of_devicendarray(gpu_array_src)
-    return dev, (gpu_array_src, gpu_array_dest), (start_idx, stop_idx)
-
-
-def build_alloc_info(data):
-    """
-    Use the __cuda_array_interface__ to extract cpointer
-    information for passing into cython.
-    """
-    dev, gpu_array, _ = data
-    return (gpu_array[0].__cuda_array_interface__,
-            gpu_array[1].__cuda_array_interface__)
-
-
-def get_ipc_handle(data):
-    """
-    Extract IPC handles from input Numba array. Pass
-    along the device of the current worker and the
-    start/stop indices from the original cudf.
-    """
-    dev, gpu_array, idx = data
-    from numba.cuda.cudadrv.drvapi import cu_device_ptr
-    gpu_array[0].gpu_data.owner.handle = cu_device_ptr(gpu_array[0].
-                                                       gpu_data.owner.ptr)
-    gpu_array[1].gpu_data.owner.handle = cu_device_ptr(gpu_array[1].
-                                                       gpu_data.owner.ptr)
-
-    in_handle_src = gpu_array[0].get_ipc_handle()
-    in_handle_dest = gpu_array[1].get_ipc_handle()
-    return dev, [in_handle_src, in_handle_dest], idx
-
-
-def _build_host_dict(gpu_futures, client):
-    """
-    Build a dictionary of hosts and their corresponding ports from workers
-    which have the given gpu_futures.
-    """
-    # TO DO: IMPROVE/ CLEANUP
-    who_has = client.who_has(gpu_futures)
-
-    workers = [key[0] for key in list(who_has.values())]
-    hosts = set(map(lambda x: parse_host_port(x), workers))
-    hosts_dict = {}
-    for host, port in hosts:
-        if host not in hosts_dict:
-            hosts_dict[host] = set([port])
-        else:
-            hosts_dict[host].add(port)
-
-    return hosts_dict
-
-
-def _mg_pagerank(data):
-    """
-    Collect all ipc pointer information into source and destination alloc_info
-    list that is passed to snmg pagerank.
-    """
-    ipcs, raw_arrs, alpha, max_iter = data
-
-    # Separate threads to hold pointers to separate devices
-    # The order in which we pass the list of IPCs to the thread matters and
-    # the goal isto maximize reuse while minimizing the number of threads.
-    # We want to limit the number of threads to O(len(devices)) and want to
-    # avoid having if be O(len(ipcs)) at all costs!
-    device_handle_map = defaultdict(list)
-    [device_handle_map[dev].append((idx, ipc)) for dev, ipc, idx in ipcs]
-
-    open_ipcs = []
-    for dev, ipcs in device_handle_map.items():
-        open_ipcs.append([[dev], new_ipc_thread(ipcs[0][1], dev)])
-
-    alloc_info_src = []
-    alloc_info_dest = []
-    for dev, t in open_ipcs:
-        inf = t.info()
-        for i in range(len(dev)):
-            alloc_info_src.append([get_device_id(dev[i]), inf[0]])
-            alloc_info_dest.append([get_device_id(dev[i]), inf[1]])
-
-    for t in raw_arrs:
-        raw_info = build_alloc_info(t)
-        alloc_info_src.append([get_device_id(t[2]), raw_info[0]])
-        alloc_info_dest.append([get_device_id(t[2]), raw_info[1]])
-
-    alloc_info_src.sort(key=lambda x: x[0])
-    alloc_info_dest.sort(key=lambda x: x[0])
-
-    final_allocs_src = [a for i, a in alloc_info_src]
-    final_allocs_dest = [a for i, a in alloc_info_dest]
-
-    pr = cugraph.mg_pagerank(final_allocs_src,
-                             final_allocs_dest,
-                             alpha, max_iter)
-
-    [t[1].close() for t in open_ipcs]
-    [t[1].join() for t in open_ipcs]
-
-    return pr
-
-
-def pagerank(edge_list, alpha=0.85, max_iter=30):
-    """
-    Find the PageRank values for each vertex in a graph using multiple GPUs.
-    cuGraph computes an approximation of the Pagerank using the power method.
-    The input edge list should be provided in dask-cudf dataframe
-    with one partition per GPU.
-
-    Parameters
-    ----------
-    edge_list : dask_cudf.DataFrame
-        Contain the connectivity information as an edge list.
-        Source 'src' and destination 'dst' columns must be of type 'int32'.
-        Edge weights are not used for this algorithm.
-        Indices must be in the range [0, V-1], where V is the global number
-        of vertices.
-    alpha : float
-        The damping factor alpha represents the probability to follow an
-        outgoing edge, standard value is 0.85.
-        Thus, 1.0-alpha is the probability to “teleport” to a random vertex.
-        Alpha should be greater than 0.0 and strictly lower than 1.0.
-    max_iter : int
-        The maximum number of iterations before an answer is returned.
-        If this value is lower or equal to 0 cuGraph will use the default
-        value, which is 30.
-
-    Returns
-    -------
-    PageRank : dask_cudf.DataFrame
-        Dask GPU DataFrame containing two columns of size V: the vertex
-        identifiers and the corresponding PageRank values.
-
-    Examples
-    --------
-    >>> import dask_cugraph.pagerank as dcg
-    >>> chunksize = dcg.get_chunksize(edge_list.csv)
-    >>> ddf_edge_list = dask_cudf.read_csv(edge_list.csv,
-    >>>                                    chunksize = chunksize,
-    >>>                                    delimiter='\t',
-    >>>                                    names=['src', 'dst'],
-    >>>                                    dtype=['int32', 'int32'])
-    >>> pr = dcg.pagerank(ddf_edge_list, alpha=0.85, max_iter=50)
-    """
-
-    client = default_client()
-    gpu_futures = _get_mg_info(edge_list)
-    # npartitions = len(gpu_futures)
-
-    host_dict = _build_host_dict(gpu_futures, client).items()
-    if len(host_dict) > 1:
-        raise Exception("Dask cluster appears to span hosts. Current "
-                        "multi-GPU version is limited to single host")
-
-    master_host = [(host, random.sample(ports, 1)[0])
-                   for host, ports in host_dict][0]
-
-    host, port = master_host
-    gpu_futures_for_host = list(filter(lambda d: d[0][0] == host,
-                                       gpu_futures))
-    exec_node = (host, port)
-    # build ipc handles
-    gpu_data_excl_worker = list(filter(lambda d: d[0] != exec_node,
-                                       gpu_futures_for_host))
-    gpu_data_incl_worker = list(filter(lambda d: d[0] == exec_node,
-                                       gpu_futures_for_host))
-
-    ipc_handles = [client.submit(get_ipc_handle, future, workers=[worker])
-                   for worker, future in gpu_data_excl_worker]
-
-    raw_arrays = [future for worker, future in gpu_data_incl_worker]
-    pr = [client.submit(_mg_pagerank,
-                        (ipc_handles, raw_arrays, alpha, max_iter),
-                        workers=[exec_node])]
-    c = cudf.DataFrame({'vertex': cudf.Series(dtype='int32'),
-                       'pagerank': cudf.Series(dtype='float32')})
-    ddf = dc.from_delayed(pr, meta=c)
-    return ddf
-
-
-def _get_mg_info(ddf):
-    # Get gpu data pointers of columns of each dataframe partition
-
-    client = default_client()
-
-    if isinstance(ddf, dd.DataFrame):
-        parts = ddf.to_delayed()
-        parts = client.compute(parts)
-        wait(parts)
-    else:
-        parts = ddf
-    key_to_part_dict = dict([(str(part.key), part) for part in parts])
-    who_has = client.who_has(parts)
-    worker_map = []
-    for key, workers in who_has.items():
-        worker = parse_host_port(first(workers))
-        worker_map.append((worker, key_to_part_dict[key]))
-
-    gpu_data = [(worker, client.submit(to_gpu_array, part, workers=[worker]))
-                for worker, part in worker_map]
-
-    wait(gpu_data)
-    return gpu_data
-
-
-# UTILITY FUNCTIONS
-
-
-def _drop_duplicates(df):
-    df.drop_duplicates(inplace=True)
-    return df
-
-
-def drop_duplicates(ddf):
-    client = default_client()
-
-    if isinstance(ddf, dd.DataFrame):
-        parts = ddf.to_delayed()
-        parts = client.compute(parts)
-        wait(parts)
-    else:
-        parts = ddf
-    key_to_part_dict = dict([(str(part.key), part) for part in parts])
-    who_has = client.who_has(parts)
-    worker_map = []
-    for key, workers in who_has.items():
-        worker = parse_host_port(first(workers))
-        worker_map.append((worker, key_to_part_dict[key]))
-
-    gpu_data = [client.submit(_drop_duplicates, part, workers=[worker])
-                for worker, part in worker_map]
-
-    wait(gpu_data)
-    return gpu_data
-
-
-def get_n_gpus():
-    try:
-        return len(os.environ["CUDA_VISIBLE_DEVICES"].split(","))
-    except KeyError:
-        return len(os.popen("nvidia-smi -L").read().strip().split("\n"))
-
-
-def get_chunksize(input_path):
-    """
-    Calculate the appropriate chunksize for dask_cudf.read_csv
-    to get a number of partitions equal to the number of GPUs
-
-    Examples
-    --------
-    >>> import dask_cugraph.pagerank as dcg
-    >>> chunksize = dcg.get_chunksize(edge_list.csv)
-    """
-
-    import os
-    from glob import glob
-    import math
-
-    input_files = sorted(glob(str(input_path)))
-    if len(input_files) == 1:
-        size = os.path.getsize(input_files[0])
-        chunksize = math.ceil(size/get_n_gpus())
-    else:
-        size = [os.path.getsize(_file) for _file in input_files]
-        chunksize = max(size)
-    return chunksize
-
-
-def _read_csv(input_files, delimiter, names, dtype):
-    df = []
-    for f in input_files:
-        df.append(cudf.read_csv(f, delimiter=delimiter, names=names,
-                                dtype=dtype))
-    df_concatenated = cudf.concat(df)
-    return df_concatenated
-
-
-def read_split_csv(input_files, delimiter='\t', names=['src', 'dst'],
-                   dtype=['int32', 'int32']):
-    """
-    Read csv for large datasets which cannot be read directly by dask-cudf
-    read_csv due to memory requirements. This function takes large input
-    split into smaller files (number of input_files > number of gpus),
-    reads two or more csv per gpu/worker and concatenates them into a
-    single dataframe. Additional parameters (delimiter, names and dtype)
-    can be specified for reading the csv file.
-    """
-
-    client = default_client()
-    n_files = len(input_files)
-    n_gpus = get_n_gpus()
-    n_files_per_gpu = int(n_files/n_gpus)
-    worker_map = []
-    for i, w in enumerate(client.has_what().keys()):
-        files_per_gpu = input_files[i*n_files_per_gpu: (i+1)*n_files_per_gpu]
-        worker_map.append((files_per_gpu, w))
-    new_ddf = [client.submit(_read_csv, part, delimiter, names, dtype,
-               workers=[worker]) for part, worker in worker_map]
-
-    wait(new_ddf)
-    return new_ddf
diff --git a/python/cugraph/dask/structure/__init__.py b/python/cugraph/dask/structure/__init__.py
new file mode 100644
index 00000000000..e69de29bb2d
diff --git a/python/cugraph/dask/structure/replication.pyx b/python/cugraph/dask/structure/replication.pyx
new file mode 100644
index 00000000000..7256fa63448
--- /dev/null
+++ b/python/cugraph/dask/structure/replication.pyx
@@ -0,0 +1,144 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# cython: profile=False
+# distutils: language = c++
+# cython: embedsignature = True
+# cython: language_level = 3
+
+from libc.stdint cimport uintptr_t
+from cugraph.structure cimport utils as c_utils
+from cugraph.structure.graph_new cimport *
+from libc.stdint cimport uintptr_t
+
+import cudf
+import dask.distributed as dd
+from cugraph.dask.common.input_utils import get_mg_batch_data
+import dask_cudf
+import cugraph.comms.comms as Comms
+import cugraph.dask.common.mg_utils as mg_utils
+import numpy as np
+
+
+def replicate_cudf_dataframe(cudf_dataframe, client=None, comms=None):
+    if type(cudf_dataframe) is not cudf.DataFrame:
+        raise TypeError("Expected a cudf.Series to replicate")
+    client = mg_utils.get_client() if client is None else client
+    comms = Comms.get_comms() if comms is None else comms
+    dask_cudf_df = dask_cudf.from_cudf(cudf_dataframe, npartitions=1)
+    df_length = len(dask_cudf_df)
+
+    _df_data =  get_mg_batch_data(dask_cudf_df)
+    df_data =  mg_utils.prepare_worker_to_parts(_df_data, client)
+
+    workers_to_futures = {worker: client.submit(_replicate_cudf_dataframe,
+                          (data, cudf_dataframe.columns.values, cudf_dataframe.dtypes, df_length),
+                          comms.sessionId,
+                          workers=[worker]) for
+                          (worker, data) in
+                          df_data.worker_to_parts.items()}
+    dd.wait(workers_to_futures)
+    return workers_to_futures
+
+
+def _replicate_cudf_dataframe(input_data, session_id):
+    cdef uintptr_t c_handle = <uintptr_t> NULL
+    cdef uintptr_t c_series = <uintptr_t> NULL
+
+    result = None
+    handle = Comms.get_handle(session_id)
+    c_handle = <uintptr_t>handle.getHandle()
+
+    _data, columns, dtypes, df_length = input_data
+    data = _data[0]
+    has_data = type(data) is cudf.DataFrame
+
+    series = None
+    df_data = {}
+    for idx, column in enumerate(columns):
+        if has_data:
+            series = data[column]
+        else:
+            dtype = dtypes[idx]
+            series = cudf.Series(np.zeros(df_length), dtype=dtype)
+            df_data[column] = series
+        c_series =  series.__cuda_array_interface__['data'][0]
+        comms_bcast(c_handle, c_series, df_length, series.dtype)
+
+    if has_data:
+        result = data
+    else:
+        result = cudf.DataFrame(data=df_data)
+    return result
+
+
+def replicate_cudf_series(cudf_series, client=None, comms=None):
+    if type(cudf_series) is not cudf.Series:
+        raise TypeError("Expected a cudf.Series to replicate")
+    client = mg_utils.get_client() if client is None else client
+    comms = Comms.get_comms() if comms is None else comms
+    dask_cudf_series =  dask_cudf.from_cudf(cudf_series,
+                                            npartitions=1)
+    series_length = len(dask_cudf_series)
+    _series_data = get_mg_batch_data(dask_cudf_series)
+    series_data = mg_utils.prepare_worker_to_parts(_series_data)
+
+    dtype = cudf_series.dtype
+    workers_to_futures = {worker:
+                          client.submit(_replicate_cudf_series,
+                                        (data, series_length, dtype),
+                                        comms.sessionId,
+                                         workers=[worker]) for
+                           (worker, data) in
+                           series_data.worker_to_parts.items()}
+    dd.wait(workers_to_futures)
+    return workers_to_futures
+
+
+def _replicate_cudf_series(input_data, session_id):
+    cdef uintptr_t c_handle = <uintptr_t> NULL
+    cdef uintptr_t c_result = <uintptr_t> NULL
+
+    result = None
+
+    handle = Comms.get_handle(session_id)
+    c_handle = <uintptr_t>handle.getHandle()
+
+    (_data, size, dtype) = input_data
+
+    data = _data[0]
+    has_data = type(data) is cudf.Series
+    if has_data:
+        result = data
+    else:
+        result = cudf.Series(np.zeros(size), dtype=dtype)
+
+    c_result = result.__cuda_array_interface__['data'][0]
+
+    comms_bcast(c_handle, c_result, size, dtype)
+
+    return result
+
+
+cdef comms_bcast(uintptr_t handle,
+                 uintptr_t value_ptr,
+                 size_t count,
+                 dtype):
+    if dtype ==  np.int32:
+        c_utils.comms_bcast((<handle_t*> handle)[0], <int*> value_ptr, count)
+    elif dtype == np.float32:
+        c_utils.comms_bcast((<handle_t*> handle)[0], <float*> value_ptr, count)
+    elif dtype == np.float64:
+        c_utils.comms_bcast((<handle_t*> handle)[0], <double*> value_ptr, count)
+    else:
+        raise TypeError("Unsupported broadcast type")
\ No newline at end of file
diff --git a/python/cugraph/dask/traversal/__init__.py b/python/cugraph/dask/traversal/__init__.py
new file mode 100644
index 00000000000..e69de29bb2d
diff --git a/python/cugraph/dask/traversal/bfs.py b/python/cugraph/dask/traversal/bfs.py
new file mode 100644
index 00000000000..8baf15e079b
--- /dev/null
+++ b/python/cugraph/dask/traversal/bfs.py
@@ -0,0 +1,121 @@
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from dask.distributed import wait, default_client
+from cugraph.dask.common.input_utils import get_local_data
+from cugraph.dask.traversal import mg_bfs_wrapper as mg_bfs
+import cugraph.comms.comms as Comms
+import cudf
+
+
+def call_bfs(sID, data, local_data, start, num_verts, return_distances):
+    wid = Comms.get_worker_id(sID)
+    handle = Comms.get_handle(sID)
+    return mg_bfs.mg_bfs(data[0],
+                         local_data,
+                         wid,
+                         handle,
+                         start,
+                         num_verts,
+                         return_distances)
+
+
+def bfs(graph,
+        start,
+        return_distances=False,
+        load_balance=True):
+
+    """
+    Find the distances and predecessors for a breadth first traversal of a
+    graph.
+    The input graph must contain edge list as  dask-cudf dataframe with
+    one partition per GPU.
+
+    Parameters
+    ----------
+    graph : cugraph.DiGraph
+        cuGraph graph descriptor, should contain the connectivity information
+        as dask cudf edge list dataframe(edge weights are not used for this
+        algorithm). Undirected Graph not currently supported.
+    start : Integer
+        Specify starting vertex for breadth-first search; this function
+        iterates over edges in the component reachable from this node.
+    return_distances : bool, optional, default=False
+        Indicates if distances should be returned
+    load_balance : bool, optional, default=True
+        Set as True to perform load_balancing after global sorting of
+        dask-cudf DataFrame. This ensures that the data is uniformly
+        distributed among multiple GPUs to avoid over-loading.
+
+    Returns
+    -------
+    df : cudf.DataFrame
+        df['vertex'][i] gives the vertex id of the i'th vertex
+
+        df['distance'][i] gives the path distance for the i'th vertex from the
+        starting vertex (Only if return_distances is True)
+
+        df['predecessor'][i] gives for the i'th vertex the vertex it was
+        reached from in the traversal
+
+    Examples
+    --------
+    >>> import cugraph.dask as dcg
+    >>> Comms.initialize()
+    >>> chunksize = dcg.get_chunksize(input_data_path)
+    >>> ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize,
+                                 delimiter=' ',
+                                 names=['src', 'dst', 'value'],
+                                 dtype=['int32', 'int32', 'float32'])
+    >>> dg = cugraph.DiGraph()
+    >>> dg.from_dask_cudf_edgelist(ddf)
+    >>> df = dcg.bfs(dg, 0)
+    >>> Comms.destroy()
+    """
+
+    client = default_client()
+
+    if(graph.local_data is not None and
+       graph.local_data['by'] == 'src'):
+        data = graph.local_data['data']
+    else:
+        data = get_local_data(graph, by='src', load_balance=load_balance)
+
+    if graph.renumbered:
+        start = graph.lookup_internal_vertex_id(cudf.Series([start],
+                                                dtype='int32')).compute()
+        start = start.iloc[0]
+
+    result = dict([(data.worker_info[wf[0]]["rank"],
+                    client.submit(
+            call_bfs,
+            Comms.get_session_id(),
+            wf[1],
+            data.local_data,
+            start,
+            data.max_vertex_id+1,
+            return_distances,
+            workers=[wf[0]]))
+            for idx, wf in enumerate(data.worker_to_parts.items())])
+    wait(result)
+
+    df = result[0].result()
+
+    if graph.renumbered:
+        df = graph.unrenumber(df, 'vertex').compute()
+        df = graph.unrenumber(df, 'predecessor').compute()
+        df["predecessor"].fillna(-1, inplace=True)
+
+    return df
diff --git a/python/cugraph/snmg/link_analysis/mg_pagerank.py b/python/cugraph/dask/traversal/mg_bfs.pxd
similarity index 56%
rename from python/cugraph/snmg/link_analysis/mg_pagerank.py
rename to python/cugraph/dask/traversal/mg_bfs.pxd
index a721d49aad1..8b9e8c1c81f 100644
--- a/python/cugraph/snmg/link_analysis/mg_pagerank.py
+++ b/python/cugraph/dask/traversal/mg_bfs.pxd
@@ -1,4 +1,6 @@
-# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -10,11 +12,19 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+#
+
+from cugraph.structure.graph_new cimport *
+from libcpp cimport bool
+
 
+cdef extern from "algorithms.hpp" namespace "cugraph":
 
-def mg_pagerank(src_ptrs_info,
-                dest_ptrs_info,
-                alpha=0.85,
-                max_iter=30):
-    raise Exception("mg_pagerank currently disabled... "
-                    "new OPG version coming soon")
+    cdef void bfs[VT,ET,WT](
+        const handle_t &handle,
+        const GraphCSRView[VT,ET,WT] &graph,
+        VT *distances,
+        VT *predecessors,
+        double *sp_counters,
+        const VT start_vertex,
+        bool directed) except +
diff --git a/python/cugraph/dask/traversal/mg_bfs_wrapper.pyx b/python/cugraph/dask/traversal/mg_bfs_wrapper.pyx
new file mode 100644
index 00000000000..66a2668a41f
--- /dev/null
+++ b/python/cugraph/dask/traversal/mg_bfs_wrapper.pyx
@@ -0,0 +1,90 @@
+#
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from cugraph.structure.utils_wrapper import *
+from cugraph.dask.traversal cimport mg_bfs as c_bfs
+import cudf
+from cugraph.structure.graph_new cimport *
+import cugraph.structure.graph_new_wrapper as graph_new_wrapper
+from libc.stdint cimport uintptr_t
+
+def mg_bfs(input_df, local_data, rank, handle, start, result_len, return_distances=False):
+    """
+    Call pagerank
+    """
+
+    cdef size_t handle_size_t = <size_t>handle.getHandle()
+    handle_ = <c_bfs.handle_t*>handle_size_t
+
+    # Local COO information
+    src = input_df['src']
+    dst = input_df['dst']
+    num_verts = local_data['verts'].sum()
+    num_edges = local_data['edges'].sum()
+    local_offset = local_data['offsets'][rank]
+    src = src - local_offset
+    num_local_verts = local_data['verts'][rank]
+    num_local_edges = len(src)
+
+    # Convert to local CSR
+    [src, dst] = graph_new_wrapper.datatype_cast([src, dst], [np.int32])
+    _offsets, indices, weights = coo2csr(src, dst, None)
+    offsets = _offsets[:num_local_verts + 1]
+    del _offsets
+
+    # Pointers required for CSR Graph
+    cdef uintptr_t c_offsets_ptr = offsets.__cuda_array_interface__['data'][0]
+    cdef uintptr_t c_indices_ptr = indices.__cuda_array_interface__['data'][0]
+
+    # Generate the cudf.DataFrame result
+    df = cudf.DataFrame()
+    df['vertex'] = cudf.Series(range(0, result_len), dtype=np.int32)
+    df['predecessor'] = cudf.Series(np.zeros(result_len, dtype=np.int32))
+    if (return_distances):
+        df['distance'] = cudf.Series(np.zeros(result_len, dtype=np.int32))
+
+    # Associate <uintptr_t> to cudf Series
+    cdef uintptr_t c_distance_ptr    = <uintptr_t> NULL # Pointer to the DataFrame 'distance' Series
+    cdef uintptr_t c_predecessor_ptr = df['predecessor'].__cuda_array_interface__['data'][0];
+    if (return_distances):
+        c_distance_ptr = df['distance'].__cuda_array_interface__['data'][0]
+
+    # Extract local data
+    cdef uintptr_t c_local_verts = local_data['verts'].__array_interface__['data'][0]
+    cdef uintptr_t c_local_edges = local_data['edges'].__array_interface__['data'][0]
+    cdef uintptr_t c_local_offsets = local_data['offsets'].__array_interface__['data'][0]
+
+    # BFS
+    cdef GraphCSRView[int,int,float] graph
+    graph= GraphCSRView[int, int, float](<int*> c_offsets_ptr,
+                                         <int*> c_indices_ptr,
+                                         <float*> NULL,
+                                         num_verts,
+                                         num_local_edges)
+    graph.set_local_data(<int*>c_local_verts, <int*>c_local_edges, <int*>c_local_offsets)
+    graph.set_handle(handle_)
+
+    cdef bool direction = <bool> 1
+    # MG BFS path assumes directed is true
+    c_bfs.bfs[int, int, float](handle_[0],
+                               graph,
+                               <int*> c_distance_ptr,
+                               <int*> c_predecessor_ptr,
+                               <double*> NULL,
+                               <int> start,
+                               direction)
+
+    return df
diff --git a/python/cugraph/internals/internals.pyx b/python/cugraph/internals/internals.pyx
index 0c3c67ca5cf..b725133a713 100644
--- a/python/cugraph/internals/internals.pyx
+++ b/python/cugraph/internals/internals.pyx
@@ -46,7 +46,6 @@ cdef class PyCallback:
     def get_numba_matrix(self, positions, shape, typestr):
 
         sizeofType = 4 if typestr == "float32" else 8
-        print(shape)
         desc = {
             'shape': shape,
             'strides': (sizeofType, shape[0]*sizeofType),
diff --git a/python/cugraph/layout/__init__.py b/python/cugraph/layout/__init__.py
index a517d4bff54..33789f6bbd5 100644
--- a/python/cugraph/layout/__init__.py
+++ b/python/cugraph/layout/__init__.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/layout/force_atlas2.py b/python/cugraph/layout/force_atlas2.py
index 259d0b71f55..4a61e2a345b 100644
--- a/python/cugraph/layout/force_atlas2.py
+++ b/python/cugraph/layout/force_atlas2.py
@@ -15,21 +15,23 @@
 from cugraph.structure.graph import null_check
 
 
-def force_atlas2(input_graph,
-                 max_iter=500,
-                 pos_list=None,
-                 outbound_attraction_distribution=True,
-                 lin_log_mode=False,
-                 prevent_overlapping=False,
-                 edge_weight_influence=1.0,
-                 jitter_tolerance=1.0,
-                 barnes_hut_optimize=True,
-                 barnes_hut_theta=0.5,
-                 scaling_ratio=2.0,
-                 strong_gravity_mode=False,
-                 gravity=1.0,
-                 verbose=False,
-                 callback=None):
+def force_atlas2(
+    input_graph,
+    max_iter=500,
+    pos_list=None,
+    outbound_attraction_distribution=True,
+    lin_log_mode=False,
+    prevent_overlapping=False,
+    edge_weight_influence=1.0,
+    jitter_tolerance=1.0,
+    barnes_hut_optimize=True,
+    barnes_hut_theta=0.5,
+    scaling_ratio=2.0,
+    strong_gravity_mode=False,
+    gravity=1.0,
+    verbose=False,
+    callback=None,
+):
 
     """
         ForceAtlas2 is a continuous graph layout algorithm for handy network
@@ -82,11 +84,11 @@ def force_atlas2(input_graph,
             An instance of GraphBasedDimRedCallback class to intercept
             the internal state of positions while they are being trained.
             Example of callback usage:
-                from cugraph.layout import GraphBasedDimRedCallback
+                from cugraph.internals import GraphBasedDimRedCallback
                     class CustomCallback(GraphBasedDimRedCallback):
                         def on_preprocess_end(self, positions):
                             print(positions.copy_to_host())
-                        def on_train_end(self, positions):
+                        def on_epoch_end(self, positions):
                             print(positions.copy_to_host())
                         def on_train_end(self, positions):
                             print(positions.copy_to_host())
@@ -99,9 +101,9 @@ def on_train_end(self, positions):
     """
 
     if pos_list is not None:
-        null_check(pos_list['vertex'])
-        null_check(pos_list['x'])
-        null_check(pos_list['y'])
+        null_check(pos_list["vertex"])
+        null_check(pos_list["x"])
+        null_check(pos_list["y"])
 
     if prevent_overlapping:
         raise Exception("Feature not supported")
@@ -110,19 +112,28 @@ def on_train_end(self, positions):
         input_graph = input_graph.to_undirected()
 
     pos = force_atlas2_wrapper.force_atlas2(
-            input_graph,
-            max_iter=max_iter,
-            pos_list=pos_list,
-            outbound_attraction_distribution=outbound_attraction_distribution,
-            lin_log_mode=lin_log_mode,
-            prevent_overlapping=prevent_overlapping,
-            edge_weight_influence=edge_weight_influence,
-            jitter_tolerance=jitter_tolerance,
-            barnes_hut_optimize=barnes_hut_optimize,
-            barnes_hut_theta=barnes_hut_theta,
-            scaling_ratio=scaling_ratio,
-            strong_gravity_mode=strong_gravity_mode,
-            gravity=gravity,
-            verbose=verbose,
-            callback=callback)
+        input_graph,
+        max_iter=max_iter,
+        pos_list=pos_list,
+        outbound_attraction_distribution=outbound_attraction_distribution,
+        lin_log_mode=lin_log_mode,
+        prevent_overlapping=prevent_overlapping,
+        edge_weight_influence=edge_weight_influence,
+        jitter_tolerance=jitter_tolerance,
+        barnes_hut_optimize=barnes_hut_optimize,
+        barnes_hut_theta=barnes_hut_theta,
+        scaling_ratio=scaling_ratio,
+        strong_gravity_mode=strong_gravity_mode,
+        gravity=gravity,
+        verbose=verbose,
+        callback=callback,
+    )
+    # If the caller passed in a pos_list, those values are already mapped to
+    # original numbering in the call to force_atlas2_wrapper.force_atlas2(),
+    # but if the caller did not specify a pos_list and the graph was
+    # renumbered, the pos dataframe should be mapped back to the original
+    # numbering.
+    if pos_list is None and input_graph.renumbered:
+        pos = input_graph.unrenumber(pos, "vertex")
+
     return pos
diff --git a/python/cugraph/layout/force_atlas2_wrapper.pyx b/python/cugraph/layout/force_atlas2_wrapper.pyx
index cd51b9f4289..128e5f61f3c 100644
--- a/python/cugraph/layout/force_atlas2_wrapper.pyx
+++ b/python/cugraph/layout/force_atlas2_wrapper.pyx
@@ -20,13 +20,12 @@ from cugraph.layout.force_atlas2 cimport force_atlas2 as c_force_atlas2
 from cugraph.structure import graph_new_wrapper
 from cugraph.structure.graph_new cimport *
 from cugraph.structure import utils_wrapper
-from cugraph.utilities.unrenumber import unrenumber
 from libcpp cimport bool
 from libc.stdint cimport uintptr_t
 
 import cudf
 import cudf._lib as libcudf
-import rmm
+from numba import cuda
 import numpy as np
 import numpy.ctypeslib as ctypeslib
 
@@ -81,10 +80,7 @@ def force_atlas2(input_graph,
         if len(pos_list) != num_verts:
             raise ValueError('pos_list must have initial positions for all vertices')
         if input_graph.renumbered is True:
-            renumber_df = cudf.DataFrame()
-            renumber_df['map'] = input_graph.edgelist.renumber_map
-            renumber_df['id'] = input_graph.edgelist.renumber_map.index.astype(np.int32)
-            start_pos = pos_list.merge(renumber_df, left_on='vertex', right_on='map', how='left').drop('map')
+            start_pos = input_graph.edgelist.renumber_map.add_vertex_id(pos_list, 'id', 'vertex')
             # Remap pos and vertices
             df['vertex'][start_pos['id']] = start_pos['vertex']
             start_pos['x'][start_pos['id']] = start_pos['x']
@@ -103,10 +99,10 @@ def force_atlas2(input_graph,
     if input_graph.edgelist.weights \
             and input_graph.edgelist.edgelist_df['weights'].dtype == np.float64:
 
-        pos = rmm.device_array(
+        pos = cuda.device_array(
                         (num_verts, 2),
                         order="F",
-                        dtype=np.float64)
+                        dtype=np.float32)
 
         pos_ptr = pos.device_ctypes_pointer.value
 
@@ -135,7 +131,7 @@ def force_atlas2(input_graph,
         df['x'] = pos_df['x']
         df['y'] = pos_df['y']
     else:
-        pos = rmm.device_array(
+        pos = cuda.device_array(
                 (num_verts, 2),
                 order="F",
                 dtype=np.float32)
@@ -167,7 +163,4 @@ def force_atlas2(input_graph,
         df['x'] = pos_df['x']
         df['y'] = pos_df['y']
 
-    if pos_list is None and input_graph.renumbered:
-        df = unrenumber(input_graph.edgelist.renumber_map, df, 'vertex')
-
     return df
diff --git a/python/cugraph/link_analysis/__init__.py b/python/cugraph/link_analysis/__init__.py
index 251f0a455ee..c3478e9431c 100644
--- a/python/cugraph/link_analysis/__init__.py
+++ b/python/cugraph/link_analysis/__init__.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -12,3 +12,4 @@
 # limitations under the License.
 
 from cugraph.link_analysis.pagerank import pagerank
+from cugraph.link_analysis.hits import hits
diff --git a/python/cugraph/snmg/__init__.py b/python/cugraph/link_analysis/hits.pxd
similarity index 51%
rename from python/cugraph/snmg/__init__.py
rename to python/cugraph/link_analysis/hits.pxd
index 52a69a64082..2efa417655a 100644
--- a/python/cugraph/snmg/__init__.py
+++ b/python/cugraph/link_analysis/hits.pxd
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -10,3 +10,23 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+
+# cython: profile=False
+# distutils: language = c++
+# cython: embedsignature = True
+# cython: language_level = 3
+
+from cugraph.structure.graph_new cimport *
+from libcpp cimport bool
+
+
+cdef extern from "algorithms.hpp" namespace "cugraph::gunrock":
+
+    cdef void hits[VT,ET,WT](
+        const GraphCSRView[VT,ET,WT] &graph,
+        int max_iter,
+        WT tolerance,
+        const WT *starting_value,
+        bool normalized,
+        WT *hubs,
+        WT *authorities) except +
diff --git a/python/cugraph/link_analysis/hits.py b/python/cugraph/link_analysis/hits.py
new file mode 100644
index 00000000000..c3b8a93c8ac
--- /dev/null
+++ b/python/cugraph/link_analysis/hits.py
@@ -0,0 +1,80 @@
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from cugraph.link_analysis import hits_wrapper
+
+
+def hits(G, max_iter=100, tol=1.0e-5, nstart=None, normalized=True):
+    """
+    Compute HITS hubs and authorities values for each vertex
+
+    The HITS algorithm computes two numbers for a node.  Authorities
+    estimates the node value based on the incoming links.  Hubs estimates
+    the node value based on outgoing links.
+
+    The cuGraph implementation of HITS is a wrapper around the gunrock
+    implementation of HITS.
+
+    Note that the gunrock implementation uses a 2-norm, while networkx
+    uses a 1-norm.  The raw scores will be different, but the rank ordering
+    should be comparable with networkx.
+
+    Parameters
+    ----------
+    graph : cugraph.Graph
+        cuGraph graph descriptor, should contain the connectivity information
+        as an edge list (edge weights are not used for this algorithm).
+        The adjacency list will be computed if not already present.
+    max_iter : int
+        The maximum number of iterations before an answer is returned.
+        The gunrock implementation does not currently support tolerance,
+        so this will in fact be the number of iterations the HITS algorithm
+        executes.
+    tolerance : float
+        Set the tolerance the approximation, this parameter should be a small
+        magnitude value.  This parameter is not currently supported.
+    nstart : cudf.Dataframe
+        Not currently supported
+    normalized : bool
+        Not currently supported, always used as True
+
+    Returns
+    -------
+    HubsAndAuthorities : cudf.DataFrame
+        GPU data frame containing three cudf.Series of size V: the vertex
+        identifiers and the corresponding hubs values and the corresponding
+        authorities values.
+
+        df['vertex'] : cudf.Series
+            Contains the vertex identifiers
+        df['hubs'] : cudf.Series
+            Contains the hubs score
+        df['authorities'] : cudf.Series
+            Contains the authorities score
+
+
+    Examples
+    --------
+    >>> gdf = cudf.read_csv('datasets/karate.csv', delimiter=' ',
+    >>>                   dtype=['int32', 'int32', 'float32'], header=None)
+    >>> G = cugraph.Graph()
+    >>> G.from_cudf_edgelist(gdf, source='0', destination='1')
+    >>> hits = cugraph.hits(G, max_iter = 50)
+    """
+
+    df = hits_wrapper.hits(G, max_iter, tol)
+
+    if G.renumbered:
+        df = G.unrenumber(df, "vertex")
+
+    return df
diff --git a/python/cugraph/link_analysis/hits_wrapper.pyx b/python/cugraph/link_analysis/hits_wrapper.pyx
new file mode 100644
index 00000000000..5f52df63fe8
--- /dev/null
+++ b/python/cugraph/link_analysis/hits_wrapper.pyx
@@ -0,0 +1,69 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# cython: profile=False
+# distutils: language = c++
+# cython: embedsignature = True
+# cython: language_level = 3
+
+from cugraph.link_analysis.hits cimport hits as c_hits
+from cugraph.structure.graph_new cimport *
+from libcpp cimport bool
+from libc.stdint cimport uintptr_t
+from cugraph.structure import graph_new_wrapper
+import cudf
+import rmm
+import numpy as np
+import numpy.ctypeslib as ctypeslib
+
+
+def hits(input_graph, max_iter=100, tol=1.0e-5, nstart=None, normalized=True):
+    """
+    Call HITS
+    """
+
+    if nstart is not None:
+        raise ValueError('nstart is not currently supported')
+
+    if not input_graph.adjlist:
+        input_graph.view_adj_list()
+
+    [offsets, indices] = graph_new_wrapper.datatype_cast([input_graph.adjlist.offsets, input_graph.adjlist.indices], [np.int32])
+
+    num_verts = input_graph.number_of_vertices()
+    num_edges = input_graph.number_of_edges(directed_edges=True)
+
+    df = cudf.DataFrame()
+    df['vertex'] = cudf.Series(np.zeros(num_verts, dtype=np.int32))
+    df['hubs'] = cudf.Series(np.zeros(num_verts, dtype=np.float32))
+    df['authorities'] = cudf.Series(np.zeros(num_verts, dtype=np.float32))
+
+    #cdef bool normalized = <bool> 1
+
+    cdef uintptr_t c_identifier = df['vertex'].__cuda_array_interface__['data'][0];
+    cdef uintptr_t c_hubs = df['hubs'].__cuda_array_interface__['data'][0];
+    cdef uintptr_t c_authorities = df['authorities'].__cuda_array_interface__['data'][0];
+
+    cdef uintptr_t c_offsets = offsets.__cuda_array_interface__['data'][0]
+    cdef uintptr_t c_indices = indices.__cuda_array_interface__['data'][0]
+    cdef uintptr_t c_weights = <uintptr_t>NULL
+
+    cdef GraphCSRView[int,int,float] graph_float
+    
+    graph_float = GraphCSRView[int,int,float](<int*>c_offsets, <int*>c_indices, <float*>c_weights, num_verts, num_edges)
+
+    c_hits[int,int,float](graph_float, max_iter, tol, <float*> NULL,
+                          normalized, <float*>c_hubs, <float*>c_authorities);
+    graph_float.get_vertex_identifiers(<int*>c_identifier)
+
+    return df
diff --git a/python/cugraph/link_analysis/pagerank.pxd b/python/cugraph/link_analysis/pagerank.pxd
index 608a086fefb..e5ec22a5d35 100644
--- a/python/cugraph/link_analysis/pagerank.pxd
+++ b/python/cugraph/link_analysis/pagerank.pxd
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -23,6 +23,7 @@ from libcpp cimport bool
 cdef extern from "algorithms.hpp" namespace "cugraph":
 
     cdef void pagerank[VT,ET,WT](
+        const handle_t &handle,
         const GraphCSCView[VT,ET,WT] &graph,
         WT *pagerank,
         VT size,
diff --git a/python/cugraph/link_analysis/pagerank.py b/python/cugraph/link_analysis/pagerank.py
index 11f1452304e..69106f3bf2b 100644
--- a/python/cugraph/link_analysis/pagerank.py
+++ b/python/cugraph/link_analysis/pagerank.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019 - 2020, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -15,12 +15,9 @@
 from cugraph.structure.graph import null_check
 
 
-def pagerank(G,
-             alpha=0.85,
-             personalization=None,
-             max_iter=100,
-             tol=1.0e-5,
-             nstart=None):
+def pagerank(
+    G, alpha=0.85, personalization=None, max_iter=100, tol=1.0e-5, nstart=None
+):
     """
     Find the PageRank score for every vertex in a graph. cuGraph computes an
     approximation of the Pagerank eigenvector using the power method. The
@@ -41,7 +38,7 @@ def pagerank(G,
         Thus, 1.0-alpha is the probability to “teleport” to a random vertex.
         Alpha should be greater than 0.0 and strictly lower than 1.0.
     personalization : cudf.Dataframe
-        GPU Dataframe containing the personalizatoin information.
+        GPU Dataframe containing the personalization information.
 
         personalization['vertex'] : cudf.Series
             Subset of vertices of graph for personalization
@@ -92,14 +89,24 @@ def pagerank(G,
     """
 
     if personalization is not None:
-        null_check(personalization['vertex'])
-        null_check(personalization['values'])
-
-    df = pagerank_wrapper.pagerank(G,
-                                   alpha,
-                                   personalization,
-                                   max_iter,
-                                   tol,
-                                   nstart)
+        null_check(personalization["vertex"])
+        null_check(personalization["values"])
+        if G.renumbered is True:
+            personalization = G.add_internal_vertex_id(
+                personalization, "vertex", "vertex"
+            )
+
+    if nstart is not None:
+        if G.renumbered is True:
+            nstart = G.add_internal_vertex_id(
+                nstart, "vertex", "vertex"
+            )
+
+    df = pagerank_wrapper.pagerank(
+        G, alpha, personalization, max_iter, tol, nstart
+    )
+
+    if G.renumbered:
+        df = G.unrenumber(df, "vertex")
 
     return df
diff --git a/python/cugraph/link_analysis/pagerank_wrapper.pyx b/python/cugraph/link_analysis/pagerank_wrapper.pyx
index bbd44a84027..4b045264ead 100644
--- a/python/cugraph/link_analysis/pagerank_wrapper.pyx
+++ b/python/cugraph/link_analysis/pagerank_wrapper.pyx
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -19,7 +19,6 @@
 #cimport cugraph.link_analysis.pagerank as c_pagerank
 from cugraph.link_analysis.pagerank cimport pagerank as c_pagerank
 from cugraph.structure.graph_new cimport *
-from cugraph.utilities.unrenumber import unrenumber
 from libcpp cimport bool
 from libc.stdint cimport uintptr_t
 from cugraph.structure import graph_new_wrapper
@@ -37,11 +36,14 @@ def pagerank(input_graph, alpha=0.85, personalization=None, max_iter=100, tol=1.
     if not input_graph.transposedadjlist:
         input_graph.view_transposed_adj_list()
 
+    cdef unique_ptr[handle_t] handle_ptr
+    handle_ptr.reset(new handle_t())
+
     [offsets, indices] = graph_new_wrapper.datatype_cast([input_graph.transposedadjlist.offsets, input_graph.transposedadjlist.indices], [np.int32])
     [weights] = graph_new_wrapper.datatype_cast([input_graph.transposedadjlist.weights], [np.float32, np.float64])
 
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
 
     df = cudf.DataFrame()
     df['vertex'] = cudf.Series(np.zeros(num_verts, dtype=np.int32))
@@ -51,14 +53,7 @@ def pagerank(input_graph, alpha=0.85, personalization=None, max_iter=100, tol=1.
     if nstart is not None:
         if len(nstart) != num_verts:
             raise ValueError('nstart must have initial guess for all vertices')
-        if input_graph.renumbered is True:
-            renumber_df = cudf.DataFrame()
-            renumber_df['map'] = input_graph.edgelist.renumber_map
-            renumber_df['id'] = input_graph.edgelist.renumber_map.index.astype(np.int32)
-            guess = nstart.merge(renumber_df, left_on='vertex', right_on='map', how='left').drop('map')
-            df['pagerank'][guess['id']] = guess['values']
-        else:
-            df['pagerank'][nstart['vertex']] = nstart['values']
+        df['pagerank'][nstart['vertex']] = nstart['values']
         has_guess = <bool> 1
 
     cdef uintptr_t c_identifier = df['vertex'].__cuda_array_interface__['data'][0];
@@ -72,6 +67,8 @@ def pagerank(input_graph, alpha=0.85, personalization=None, max_iter=100, tol=1.
     cdef uintptr_t c_indices = indices.__cuda_array_interface__['data'][0]
     cdef uintptr_t c_weights = <uintptr_t>NULL
 
+    personalization_id_series = None
+
     if weights is not None:
         c_weights = weights.__cuda_array_interface__['data'][0]
 
@@ -82,30 +79,19 @@ def pagerank(input_graph, alpha=0.85, personalization=None, max_iter=100, tol=1.
         sz = personalization['vertex'].shape[0]
         personalization['vertex'] = personalization['vertex'].astype(np.int32)
         personalization['values'] = personalization['values'].astype(df['pagerank'].dtype)
-        if input_graph.renumbered is True:
-            renumber_df = cudf.DataFrame()
-            renumber_df['map'] = input_graph.edgelist.renumber_map
-            renumber_df['id'] = input_graph.edgelist.renumber_map.index.astype(np.int32)
-            personalization_values = personalization.merge(renumber_df, left_on='vertex', right_on='map', how='left').drop('map')
-            c_pers_vtx = personalization_values['id'].__cuda_array_interface__['data'][0]
-            c_pers_val = personalization_values['values'].__cuda_array_interface__['data'][0]
-        else:
-            c_pers_vtx = personalization['vertex'].__cuda_array_interface__['data'][0]
-            c_pers_val = personalization['values'].__cuda_array_interface__['data'][0]
+        c_pers_vtx = personalization['vertex'].__cuda_array_interface__['data'][0]
+        c_pers_val = personalization['values'].__cuda_array_interface__['data'][0]
     
     if (df['pagerank'].dtype == np.float32): 
         graph_float = GraphCSCView[int,int,float](<int*>c_offsets, <int*>c_indices, <float*>c_weights, num_verts, num_edges)
 
-        c_pagerank[int,int,float](graph_float, <float*> c_pagerank_val, sz, <int*> c_pers_vtx, <float*> c_pers_val,
+        c_pagerank[int,int,float](handle_ptr.get()[0], graph_float, <float*> c_pagerank_val, sz, <int*> c_pers_vtx, <float*> c_pers_val,
                                <float> alpha, <float> tol, <int> max_iter, has_guess)
         graph_float.get_vertex_identifiers(<int*>c_identifier)
     else: 
         graph_double = GraphCSCView[int,int,double](<int*>c_offsets, <int*>c_indices, <double*>c_weights, num_verts, num_edges)
-        c_pagerank[int,int,double](graph_double, <double*> c_pagerank_val, sz, <int*> c_pers_vtx, <double*> c_pers_val,
+        c_pagerank[int,int,double](handle_ptr.get()[0], graph_double, <double*> c_pagerank_val, sz, <int*> c_pers_vtx, <double*> c_pers_val,
                             <float> alpha, <float> tol, <int> max_iter, has_guess)
         graph_double.get_vertex_identifiers(<int*>c_identifier)
 
-    if input_graph.renumbered:
-        df = unrenumber(input_graph.edgelist.renumber_map, df, 'vertex')
-
     return df
diff --git a/python/cugraph/link_prediction/jaccard.py b/python/cugraph/link_prediction/jaccard.py
index 9621cc3335a..e2160a0a803 100644
--- a/python/cugraph/link_prediction/jaccard.py
+++ b/python/cugraph/link_prediction/jaccard.py
@@ -106,9 +106,15 @@ def jaccard(input_graph, vertex_pair=None):
     if type(input_graph) is not Graph:
         raise Exception("input graph must be undirected")
 
-    if (type(vertex_pair) == cudf.DataFrame):
-        null_check(vertex_pair[vertex_pair.columns[0]])
-        null_check(vertex_pair[vertex_pair.columns[1]])
+    # FIXME: Add support for multi-column vertices
+    if type(vertex_pair) == cudf.DataFrame:
+        for col in vertex_pair.columns:
+            null_check(vertex_pair[col])
+            if input_graph.renumbered:
+                vertex_pair = input_graph.add_internal_vertex_id(
+                    vertex_pair, col, col
+                )
+
     elif vertex_pair is None:
         pass
     else:
@@ -116,4 +122,8 @@ def jaccard(input_graph, vertex_pair=None):
 
     df = jaccard_wrapper.jaccard(input_graph, None, vertex_pair)
 
+    if input_graph.renumbered:
+        df = input_graph.unrenumber(df, "source")
+        df = input_graph.unrenumber(df, "destination")
+
     return df
diff --git a/python/cugraph/link_prediction/jaccard_wrapper.pyx b/python/cugraph/link_prediction/jaccard_wrapper.pyx
index 30c342f9f1b..24e2ca429f5 100644
--- a/python/cugraph/link_prediction/jaccard_wrapper.pyx
+++ b/python/cugraph/link_prediction/jaccard_wrapper.pyx
@@ -51,7 +51,10 @@ def jaccard(input_graph, weights_arr=None, vertex_pair=None):
                                                               input_graph.adjlist.indices], [np.int32])
         
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
+
+    first = None
+    second = None
 
     cdef uintptr_t c_result_col = <uintptr_t> NULL
     cdef uintptr_t c_first_col = <uintptr_t> NULL
@@ -80,25 +83,15 @@ def jaccard(input_graph, weights_arr=None, vertex_pair=None):
         df = cudf.DataFrame()
         df['jaccard_coeff'] = result
 
-        if input_graph.renumbered is True:
-            renumber_df = cudf.DataFrame()
-            renumber_df['map'] = input_graph.edgelist.renumber_map
-            renumber_df['id'] = input_graph.edgelist.renumber_map.index.astype(np.int32)
-            vp = vertex_pair.merge(renumber_df, left_on='first', right_on='map', how='left').drop('map').merge(renumber_df, left_on='second', right_on='map', how='left').drop('map')
-
-            df['source'] = vp['first']
-            df['destination'] = vp['second']
-            c_first_col = vp['id_x'].__cuda_array_interface__['data'][0]
-            c_second_col = vp['id_y'].__cuda_array_interface__['data'][0]
-        else:
-            cols = vertex_pair.columns.to_list()
-            first = vertex_pair[cols[0]].astype(np.int32)
-            second = vertex_pair[cols[1]].astype(np.int32)
-            df['source'] = first
-            df['destination'] = second
-            c_first_col = first.__cuda_array_interface__['data'][0]
-            c_second_col = second.__cuda_array_interface__['data'][0]
+        cols = vertex_pair.columns.to_list()
+        first = vertex_pair[cols[0]].astype(np.int32)
+        second = vertex_pair[cols[1]].astype(np.int32)
 
+        # FIXME: multi column support
+        df['source'] = first
+        df['destination'] = second
+        c_first_col = first.__cuda_array_interface__['data'][0]
+        c_second_col = second.__cuda_array_interface__['data'][0]
 
         if weight_type == np.float32:
             graph_float = GraphCSRView[int,int,float](<int*>c_offsets, <int*>c_indices,
@@ -161,14 +154,4 @@ def jaccard(input_graph, weights_arr=None, vertex_pair=None):
 
             graph_double.get_source_indices(<int*>c_src_index_col)
             
-        if input_graph.renumbered:
-            if isinstance(input_graph.edgelist.renumber_map, cudf.DataFrame):
-                unrenumbered_df_ = df.merge(input_graph.edgelist.renumber_map, left_on='source', right_on='id', how='left').drop(['id', 'source'])
-                unrenumbered_df = unrenumbered_df_.merge(input_graph.edgelist.renumber_map, left_on='destination', right_on='id', how='left').drop(['id', 'destination'])
-                cols = unrenumbered_df.columns.to_list()
-                df = unrenumbered_df[cols[1:] + [cols[0]]]
-            else:
-                df['source'] = input_graph.edgelist.renumber_map[df['source']].reset_index(drop=True)
-                df['destination'] = input_graph.edgelist.renumber_map[df['destination']].reset_index(drop=True)
-
         return df
diff --git a/python/cugraph/link_prediction/overlap.py b/python/cugraph/link_prediction/overlap.py
index b23c49f256e..c9aa216095e 100644
--- a/python/cugraph/link_prediction/overlap.py
+++ b/python/cugraph/link_prediction/overlap.py
@@ -65,9 +65,14 @@ def overlap(input_graph, vertex_pair=None):
     >>> df = cugraph.overlap(G)
     """
 
-    if (type(vertex_pair) == cudf.DataFrame):
-        null_check(vertex_pair[vertex_pair.columns[0]])
-        null_check(vertex_pair[vertex_pair.columns[1]])
+    # FIXME: Add support for multi-column vertices
+    if type(vertex_pair) == cudf.DataFrame:
+        for col in vertex_pair.columns:
+            null_check(vertex_pair[col])
+            if input_graph.renumbered:
+                vertex_pair = input_graph.add_internal_vertex_id(
+                    vertex_pair, col, col,
+                )
     elif vertex_pair is None:
         pass
     else:
@@ -75,4 +80,8 @@ def overlap(input_graph, vertex_pair=None):
 
     df = overlap_wrapper.overlap(input_graph, None, vertex_pair)
 
+    if input_graph.renumbered:
+        df = input_graph.unrenumber(df, "source")
+        df = input_graph.unrenumber(df, "destination")
+
     return df
diff --git a/python/cugraph/link_prediction/overlap_wrapper.pyx b/python/cugraph/link_prediction/overlap_wrapper.pyx
index 310c720f527..61b04d0d315 100644
--- a/python/cugraph/link_prediction/overlap_wrapper.pyx
+++ b/python/cugraph/link_prediction/overlap_wrapper.pyx
@@ -38,8 +38,11 @@ def overlap(input_graph, weights_arr=None, vertex_pair=None):
     [offsets, indices] = graph_new_wrapper.datatype_cast([input_graph.adjlist.offsets, input_graph.adjlist.indices], [np.int32])
 
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
 
+    first = None
+    second = None
+    
     cdef uintptr_t c_result_col = <uintptr_t> NULL
     cdef uintptr_t c_first_col = <uintptr_t> NULL
     cdef uintptr_t c_second_col = <uintptr_t> NULL
@@ -67,23 +70,14 @@ def overlap(input_graph, weights_arr=None, vertex_pair=None):
         df = cudf.DataFrame()
         df['overlap_coeff'] = result
         
-        if input_graph.renumbered is True:
-            renumber_df = cudf.DataFrame()
-            renumber_df['map'] = input_graph.edgelist.renumber_map
-            renumber_df['id'] = input_graph.edgelist.renumber_map.index.astype(np.int32)
-            vp = vertex_pair.merge(renumber_df, left_on='first', right_on='map', how='left').drop('map').merge(renumber_df, left_on='second', right_on='map', how='left').drop('map')
-
-            df['source'] = vp['first']
-            df['destination'] = vp['second']
-            c_first_col = vp['id_x'].__cuda_array_interface__['data'][0]
-            c_second_col = vp['id_y'].__cuda_array_interface__['data'][0]
-        else:
-            first = vertex_pair[vertex_pair.columns[0]].astype(np.int32)
-            second = vertex_pair[vertex_pair.columns[1]].astype(np.int32)
-            df['source'] = first
-            df['destination'] = second
-            c_first_col = first.__cuda_array_interface__['data'][0]
-            c_second_col = second.__cuda_array_interface__['data'][0]
+        first = vertex_pair['first']
+        second = vertex_pair['second']
+
+        # FIXME: multi column support
+        df['source'] = first
+        df['destination'] = second
+        c_first_col = first.__cuda_array_interface__['data'][0]
+        c_second_col = second.__cuda_array_interface__['data'][0]
 
         if weight_type == np.float32:
             graph_float = GraphCSRView[int,int,float](<int*>c_offsets, <int*>c_indices,
@@ -105,7 +99,6 @@ def overlap(input_graph, weights_arr=None, vertex_pair=None):
                                            <double*>c_result_col)
         
         return df
-
     else:
         # error check performed in overlap.py
         assert vertex_pair is None
@@ -147,14 +140,4 @@ def overlap(input_graph, weights_arr=None, vertex_pair=None):
 
             graph_double.get_source_indices(<int*>c_src_index_col)
             
-        if input_graph.renumbered:
-            if isinstance(input_graph.edgelist.renumber_map, cudf.DataFrame):
-                unrenumbered_df_ = df.merge(input_graph.edgelist.renumber_map, left_on='source', right_on='id', how='left').drop(['id', 'source'])
-                unrenumbered_df = unrenumbered_df_.merge(input_graph.edgelist.renumber_map, left_on='destination', right_on='id', how='left').drop(['id', 'destination'])
-                cols = unrenumbered_df.columns.to_list()
-                df = unrenumbered_df[cols[1:] + [cols[0]]]
-            else:
-                df['source'] = input_graph.edgelist.renumber_map[df['source']].reset_index(drop=True)
-                df['destination'] = input_graph.edgelist.renumber_map[df['destination']].reset_index(drop=True)
-
         return df
diff --git a/python/cugraph/link_prediction/wjaccard.py b/python/cugraph/link_prediction/wjaccard.py
index 917ff7ea517..2a4e2417102 100644
--- a/python/cugraph/link_prediction/wjaccard.py
+++ b/python/cugraph/link_prediction/wjaccard.py
@@ -71,15 +71,23 @@ def jaccard_w(input_graph, weights, vertex_pair=None):
     if type(input_graph) is not Graph:
         raise Exception("input graph must be undirected")
 
-    if (type(vertex_pair) == cudf.DataFrame):
-        null_check(vertex_pair[vertex_pair.columns[0]])
-        null_check(vertex_pair[vertex_pair.columns[1]])
+    # FIXME: Add support for multi-column vertices
+    if type(vertex_pair) == cudf.DataFrame:
+        for col in vertex_pair.columns:
+            null_check(vertex_pair[col])
+            if input_graph.renumbered:
+                vertex_pair = input_graph.add_internal_vertex_id(
+                    vertex_pair, col, col,
+                )
     elif vertex_pair is None:
         pass
     else:
         raise ValueError("vertex_pair must be a cudf dataframe")
 
-    df = jaccard_wrapper.jaccard(input_graph,
-                                 weights, vertex_pair)
+    df = jaccard_wrapper.jaccard(input_graph, weights, vertex_pair)
+
+    if input_graph.renumbered:
+        df = input_graph.unrenumber(df, "source")
+        df = input_graph.unrenumber(df, "destination")
 
     return df
diff --git a/python/cugraph/link_prediction/woverlap.py b/python/cugraph/link_prediction/woverlap.py
index a176bc3b8d2..c93ad28ea54 100644
--- a/python/cugraph/link_prediction/woverlap.py
+++ b/python/cugraph/link_prediction/woverlap.py
@@ -67,16 +67,23 @@ def overlap_w(input_graph, weights, vertex_pair=None):
     >>> G.from_cudf_edgelist(M, source='0', destination='1')
     >>> df = cugraph.overlap_w(G, M[2])
     """
-
-    if (type(vertex_pair) == cudf.DataFrame):
-        null_check(vertex_pair[vertex_pair.columns[0]])
-        null_check(vertex_pair[vertex_pair.columns[1]])
+    # FIXME: Add support for multi-column vertices
+    if type(vertex_pair) == cudf.DataFrame:
+        for col in vertex_pair.columns:
+            null_check(vertex_pair[col])
+            if input_graph.renumbered:
+                vertex_pair = input_graph.add_internal_vertex_id(
+                    vertex_pair, col, col
+                )
     elif vertex_pair is None:
         pass
     else:
         raise ValueError("vertex_pair must be a cudf dataframe")
 
-    df = overlap_wrapper.overlap(input_graph,
-                                 weights, vertex_pair)
+    df = overlap_wrapper.overlap(input_graph, weights, vertex_pair)
+
+    if input_graph.renumbered:
+        df = input_graph.unrenumber(df, "source")
+        df = input_graph.unrenumber(df, "destination")
 
     return df
diff --git a/python/cugraph/proto/__init__.py b/python/cugraph/proto/__init__.py
index c4b49489292..16f220f3ef2 100644
--- a/python/cugraph/proto/__init__.py
+++ b/python/cugraph/proto/__init__.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/proto/components/__init__.py b/python/cugraph/proto/components/__init__.py
index 232bfeb4d72..4259bf87963 100644
--- a/python/cugraph/proto/components/__init__.py
+++ b/python/cugraph/proto/components/__init__.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/proto/components/scc.py b/python/cugraph/proto/components/scc.py
index 9f2708fcf76..d65549ff641 100644
--- a/python/cugraph/proto/components/scc.py
+++ b/python/cugraph/proto/components/scc.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/proto/structure/__init__.py b/python/cugraph/proto/structure/__init__.py
index 9652308aee9..04b6066e79b 100644
--- a/python/cugraph/proto/structure/__init__.py
+++ b/python/cugraph/proto/structure/__init__.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/proto/structure/bicliques.py b/python/cugraph/proto/structure/bicliques.py
index 16e7074c74d..2ff97bee686 100644
--- a/python/cugraph/proto/structure/bicliques.py
+++ b/python/cugraph/proto/structure/bicliques.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -180,7 +180,7 @@ def _count_features(_gdf, sort=True):
 
     c = _gdf.groupby(['dst'], as_index=False).agg(aggs)
 
-    c = c.rename(columns={'count_dst': 'count'})
+    c = c.rename(columns={'count_dst': 'count'}, copy=False)
 
     if (sort):
         c = c.sort_values(by='count', ascending=False)
diff --git a/python/cugraph/structure/__init__.py b/python/cugraph/structure/__init__.py
index dd84a69f710..b43f4f3ebfa 100644
--- a/python/cugraph/structure/__init__.py
+++ b/python/cugraph/structure/__init__.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -11,10 +11,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from cugraph.structure.graph import (Graph,
-                                     DiGraph
-                                    )
-from cugraph.structure.renumber import renumber
+from cugraph.structure.graph import Graph, DiGraph
+from cugraph.structure.number_map import NumberMap
 from cugraph.structure.symmetrize import symmetrize, symmetrize_df
-from cugraph.structure.renumber import renumber_from_cudf
 from cugraph.structure.convert_matrix import from_cudf_edgelist
+from cugraph.structure.hypergraph import hypergraph
diff --git a/python/cugraph/structure/convert_matrix.py b/python/cugraph/structure/convert_matrix.py
index 4bc10c8d0ab..0266a158bb1 100644
--- a/python/cugraph/structure/convert_matrix.py
+++ b/python/cugraph/structure/convert_matrix.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/structure/graph.pxd b/python/cugraph/structure/graph.pxd
new file mode 100644
index 00000000000..2343a0604dc
--- /dev/null
+++ b/python/cugraph/structure/graph.pxd
@@ -0,0 +1,192 @@
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# cython: profile=False
+# distutils: language = c++
+# cython: embedsignature = True
+# cython: language_level = 3
+
+from libcpp cimport bool
+from libcpp.memory cimport unique_ptr
+
+from rmm._lib.device_buffer cimport device_buffer
+
+cdef extern from "raft/handle.hpp" namespace "raft":
+    cdef cppclass handle_t:
+        handle_t() except +
+
+cdef extern from "graph.hpp" namespace "cugraph":
+
+    ctypedef enum PropType:
+        PROP_UNDEF "cugraph::PROP_UNDEF"
+        PROP_FALSE "cugraph::PROP_FALSE"
+        PROP_TRUE "cugraph::PROP_TRUE"
+
+    ctypedef enum DegreeDirection:
+        DIRECTION_IN_PLUS_OUT "cugraph::DegreeDirection::IN_PLUS_OUT"
+        DIRECTION_IN "cugraph::DegreeDirection::IN"
+        DIRECTION_OUT "cugraph::DegreeDirection::OUT"
+
+    struct GraphProperties:
+        bool directed
+        bool weighted
+        bool multigraph
+        bool bipartite
+        bool tree
+        PropType has_negative_edges
+
+    cdef cppclass GraphViewBase[VT,ET,WT]:
+        WT *edge_data
+        handle_t *handle;
+        GraphProperties prop
+        VT number_of_vertices
+        ET number_of_edges
+        VT* local_vertices
+        ET* local_edges
+        VT* local_offsets
+        void set_handle(handle_t*)
+        void set_local_data(VT* local_vertices_, ET* local_edges_, VT* local_offsets_)
+        void get_vertex_identifiers(VT *) const
+
+        GraphViewBase(WT*,VT,ET)
+
+    cdef cppclass GraphCOOView[VT,ET,WT](GraphViewBase[VT,ET,WT]):
+        VT *src_indices
+        VT *dst_indices
+
+        void degree(ET *,DegreeDirection) const
+
+        GraphCOOView()
+        GraphCOOView(const VT *, const ET *, const WT *, size_t, size_t)
+
+    cdef cppclass GraphCompressedSparseBaseView[VT,ET,WT](GraphViewBase[VT,ET,WT]):
+        ET *offsets
+        VT *indices
+
+        void get_source_indices(VT *) const
+        void degree(ET *,DegreeDirection) const
+
+        GraphCompressedSparseBaseView(const VT *, const ET *, const WT *, size_t, size_t)
+
+    cdef cppclass GraphCSRView[VT,ET,WT](GraphCompressedSparseBaseView[VT,ET,WT]):
+        GraphCSRView()
+        GraphCSRView(const VT *, const ET *, const WT *, size_t, size_t)
+
+    cdef cppclass GraphCSCView[VT,ET,WT](GraphCompressedSparseBaseView[VT,ET,WT]):
+        GraphCSCView()
+        GraphCSCView(const VT *, const ET *, const WT *, size_t, size_t)
+
+    cdef cppclass GraphCOOContents[VT,ET,WT]:
+        VT number_of_vertices
+        ET number_of_edges
+        unique_ptr[device_buffer] src_indices
+        unique_ptr[device_buffer] dst_indices
+        unique_ptr[device_buffer] edge_data
+
+    cdef cppclass GraphCOO[VT,ET,WT]:
+        GraphCOO(
+                VT nv,
+                ET ne,
+                bool has_data) except+
+        GraphCOOContents[VT,ET,WT] release()
+        GraphCOOView[VT,ET,WT] view()
+
+    cdef cppclass GraphSparseContents[VT,ET,WT]:
+        VT number_of_vertices
+        ET number_of_edges
+        unique_ptr[device_buffer] offsets
+        unique_ptr[device_buffer] indices
+        unique_ptr[device_buffer] edge_data
+
+    cdef cppclass GraphCSC[VT,ET,WT]:
+        GraphCSC(
+                VT nv,
+                ET ne,
+                bool has_data) except+
+        GraphSparseContents[VT,ET,WT] release()
+        GraphCSCView[VT,ET,WT] view()
+
+    cdef cppclass GraphCSR[VT,ET,WT]:
+        GraphCSR(
+                VT nv,
+                ET ne,
+                bool has_data) except+
+        GraphSparseContents[VT,ET,WT] release()
+        GraphCSRView[VT,ET,WT] view()
+
+
+
+cdef extern from "algorithms.hpp" namespace "cugraph":
+
+    cdef unique_ptr[GraphCOO[VT, ET, WT]] get_two_hop_neighbors[VT,ET,WT](
+        const GraphCSRView[VT, ET, WT] &graph) except +
+
+cdef extern from "functions.hpp" namespace "cugraph":
+
+    cdef unique_ptr[device_buffer] renumber_vertices[VT_IN,VT_OUT,ET](
+        ET number_of_edges,
+        const VT_IN *src,
+        const VT_IN *dst,
+        VT_OUT *src_renumbered,
+        VT_OUT *dst_renumbered,
+        ET *map_size) except +
+
+
+cdef extern from "<utility>" namespace "std" nogil:
+    cdef unique_ptr[GraphCOO[int,int,float]] move(unique_ptr[GraphCOO[int,int,float]])
+    cdef unique_ptr[GraphCOO[int,int,double]] move(unique_ptr[GraphCOO[int,int,double]])
+    cdef GraphCOOContents[int,int,float] move(GraphCOOContents[int,int,float])
+    cdef GraphCOOContents[int,int,double] move(GraphCOOContents[int,int,double])
+    cdef device_buffer move(device_buffer)
+    cdef unique_ptr[device_buffer] move(unique_ptr[device_buffer])
+    cdef unique_ptr[GraphCSR[int,int,float]] move(unique_ptr[GraphCSR[int,int,float]])
+    cdef unique_ptr[GraphCSR[int,int,double]] move(unique_ptr[GraphCSR[int,int,double]])
+    cdef GraphSparseContents[int,int,float] move(GraphSparseContents[int,int,float])
+    cdef GraphSparseContents[int,int,double] move(GraphSparseContents[int,int,double])
+
+ctypedef unique_ptr[GraphCOO[int,int,float]] GraphCOOPtrFloat
+ctypedef unique_ptr[GraphCOO[int,int,double]] GraphCOOPtrDouble
+
+ctypedef fused GraphCOOPtrType:
+    GraphCOOPtrFloat
+    GraphCOOPtrDouble
+
+ctypedef unique_ptr[GraphCSR[int,int,float]] GraphCSRPtrFloat
+ctypedef unique_ptr[GraphCSR[int,int,double]] GraphCSRPtrDouble
+
+ctypedef fused GraphCSRPtrType:
+    GraphCSRPtrFloat
+    GraphCSRPtrDouble
+
+ctypedef GraphCOOView[int,int,float] GraphCOOViewFloat
+ctypedef GraphCOOView[int,int,double] GraphCOOViewDouble
+ctypedef GraphCSRView[int,int,float] GraphCSRViewFloat
+ctypedef GraphCSRView[int,int,double] GraphCSRViewDouble
+
+ctypedef fused GraphCOOViewType:
+    GraphCOOViewFloat
+    GraphCOOViewDouble
+
+ctypedef fused GraphCSRViewType:
+    GraphCSRViewFloat
+    GraphCSRViewDouble
+
+ctypedef fused GraphViewType:
+    GraphCOOViewFloat
+    GraphCOOViewDouble
+    GraphCSRViewFloat
+    GraphCSRViewDouble
+
+cdef coo_to_df(GraphCOOPtrType graph)
+cdef csr_to_series(GraphCSRPtrType graph)
+cdef GraphViewType get_graph_view(input_graph, bool weightless=*, GraphViewType* dummy=*)
diff --git a/python/cugraph/structure/graph.py b/python/cugraph/structure/graph.py
index 044d6762c83..f70e9aa12cc 100644
--- a/python/cugraph/structure/graph.py
+++ b/python/cugraph/structure/graph.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -13,27 +13,34 @@
 
 from cugraph.structure import graph_new_wrapper
 from cugraph.structure.symmetrize import symmetrize
-from cugraph.structure.renumber import renumber as rnb
-from cugraph.structure.renumber import renumber_from_cudf as multi_rnb
+from cugraph.structure.number_map import NumberMap
+from cugraph.dask.common.input_utils import get_local_data
+import cugraph.dask.common.mg_utils as mg_utils
 import cudf
-import numpy as np
+import dask_cudf
 import warnings
+import cugraph.comms.comms as Comms
+
+from cugraph.dask.structure import replication
 
 
 def null_check(col):
     if col.null_count != 0:
-        raise ValueError('Series contains NULL values')
+        raise ValueError("Series contains NULL values")
 
 
 class Graph:
-
     class EdgeList:
-        def __init__(self, source, destination, edge_attr=None,
-                     renumber_map=None):
-            self.renumber_map = renumber_map
+        def __init__(self, *args):
+            if len(args) == 1:
+                self.__from_dask_cudf(*args)
+            else:
+                self.__from_cudf(*args)
+
+        def __from_cudf(self, source, destination, edge_attr=None):
             self.edgelist_df = cudf.DataFrame()
-            self.edgelist_df['src'] = source
-            self.edgelist_df['dst'] = destination
+            self.edgelist_df["src"] = source
+            self.edgelist_df["dst"] = destination
             self.weights = False
             if edge_attr is not None:
                 self.weights = True
@@ -41,23 +48,37 @@ def __init__(self, source, destination, edge_attr=None,
                     for k in edge_attr.keys():
                         self.edgelist_df[k] = edge_attr[k]
                 else:
-                    self.edgelist_df['weights'] = edge_attr
+                    self.edgelist_df["weights"] = edge_attr
+
+        def __from_dask_cudf(self, ddf):
+            self.edgelist_df = ddf
+            self.weights = False
+            # FIXME: Edge Attribute not handled
 
     class AdjList:
         def __init__(self, offsets, indices, value=None):
             self.offsets = offsets
             self.indices = indices
-            self.weights = value  # Should de a daftaframe for multiple weights
+            self.weights = value  # Should be a dataframe for multiple weights
 
     class transposedAdjList:
         def __init__(self, offsets, indices, value=None):
             Graph.AdjList.__init__(self, offsets, indices, value)
+
     """
     cuGraph graph class containing basic graph creation and transformation
     operations.
     """
-    def __init__(self, m_graph=None, edge_attr=None, symmetrized=False,
-                 bipartite=False, multi=False, dynamic=False):
+
+    def __init__(
+        self,
+        m_graph=None,
+        edge_attr=None,
+        symmetrized=False,
+        bipartite=False,
+        multi=False,
+        dynamic=False,
+    ):
         """
         Returns
         -------
@@ -71,29 +92,111 @@ def __init__(self, m_graph=None, edge_attr=None, symmetrized=False,
         """
         self.symmetrized = symmetrized
         self.renumbered = False
-        self.bipartite = bipartite
+        self.bipartite = False
+        self.multipartite = False
+        self._nodes = {}
         self.multi = multi
+        self.distributed = False
         self.dynamic = dynamic
         self.edgelist = None
         self.adjlist = None
         self.transposedadjlist = None
         self.edge_count = None
         self.node_count = None
+
+        # MG - Batch
+        self.batch_enabled = False
+        self.batch_edgelists = None
+        self.batch_adjlists = None
+        self.batch_transposed_adjlists = None
+
         if m_graph is not None:
-            if ((type(self) is Graph and type(m_graph) is MultiGraph)
-               or (type(self) is DiGraph and type(m_graph) is MultiDiGraph)):
-                self.from_cudf_edgelist(m_graph.edgelist.edgelist_df,
-                                        source='src',
-                                        destination='dst',
-                                        edge_attr=edge_attr)
+            if (type(self) is Graph and type(m_graph) is MultiGraph) or (
+                type(self) is DiGraph and type(m_graph) is MultiDiGraph
+            ):
+                self.from_cudf_edgelist(
+                    m_graph.edgelist.edgelist_df,
+                    source="src",
+                    destination="dst",
+                    edge_attr=edge_attr,
+                )
                 self.renumbered = m_graph.renumbered
-                self.edgelist.renumber_map = m_graph.edgelist.renumber_map
+                self.renumber_map = m_graph.renumber_map
             else:
                 msg = "Graph can be initialized using MultiGraph\
  and DiGraph can be initialized using MultiDiGraph"
                 raise Exception(msg)
         # self.number_of_vertices = None
 
+    def enable_batch(self):
+        client = mg_utils.get_client()
+        comms = Comms.get_comms()
+
+        if client is None or comms is None:
+            msg = "MG Batch needs a Dask Client and the " \
+                "Communicator needs to be initialized."
+            raise Exception(msg)
+
+        self.batch_enabled = True
+
+        if self.edgelist is not None:
+            if self.batch_edgelists is None:
+                self._replicate_edgelist()
+
+        if self.adjlist is not None:
+            if self.batch_adjlists is None:
+                self._replicate_adjlist()
+
+        if self.transposedadjlist is not None:
+            if self.batch_transposed_adjlists is None:
+                self._replicate_transposed_adjlist()
+
+    def _replicate_edgelist(self):
+        client = mg_utils.get_client()
+        comms = Comms.get_comms()
+
+        # FIXME: There  might be a better way to control it
+        if client is None:
+            return
+        work_futures = replication.replicate_cudf_dataframe(
+            self.edgelist.edgelist_df,
+            client=client,
+            comms=comms)
+
+        self.batch_edgelists = work_futures
+
+    def _replicate_adjlist(self):
+        client = mg_utils.get_client()
+        comms = Comms.get_comms()
+
+        # FIXME: There  might be a better way to control it
+        if client is None:
+            return
+
+        weights = None
+        offsets_futures = replication.replicate_cudf_series(
+            self.adjlist.offsets,
+            client=client,
+            comms=comms)
+        indices_futures = replication.replicate_cudf_series(
+            self.adjlist.indices,
+            client=client,
+            comms=comms)
+
+        if self.adjlist.weights is not None:
+            weights = replication.replicate_cudf_series(self.adjlist.weights)
+        else:
+            weights = {worker: None for worker in offsets_futures}
+
+        merged_futures = {worker: [offsets_futures[worker],
+                                   indices_futures[worker], weights[worker]]
+                          for worker in offsets_futures}
+        self.batch_adjlists = merged_futures
+
+    # FIXME: Not implemented yet
+    def _replicate_transposed_adjlist(self):
+        self.batch_transposed_adjlists = True
+
     def clear(self):
         """
         Empty this graph. This function is added for NetworkX compatibility.
@@ -102,23 +205,112 @@ def clear(self):
         self.adjlist = None
         self.transposedadjlist = None
 
-    def from_cudf_edgelist(self, input_df, source='source',
-                           destination='destination',
-                           edge_attr=None, renumber=True):
+        self.batch_edgelists = None
+        self.batch_adjlists = None
+        self.batch_transposed_adjlists = None
+
+    def add_nodes_from(self, nodes, bipartite=None, multipartite=None):
+        """
+        Add nodes information to the Graph.
+
+        Parameters
+        ----------
+        nodes : list or cudf.Series
+            The nodes of the graph to be stored. If bipartite and multipartite
+            arguments are not passed, the nodes are considered to be a list of
+            all the nodes present in the Graph.
+        bipartite : str
+            Sets the Graph as bipartite. The nodes are stored as a set of nodes
+            of the partition named as bipartite argument.
+        multipartite : str
+            Sets the Graph as multipartite. The nodes are stored as a set of
+            nodes of the partition named as multipartite argument.
+        """
+        if bipartite is None and multipartite is None:
+            self._nodes['all_nodes'] = cudf.Series(nodes)
+        else:
+            set_names = [i for i in self._nodes.keys() if i != 'all_nodes']
+            if multipartite is not None:
+                if self.bipartite:
+                    raise Exception("The Graph is already set as bipartite. "
+                                    "Use bipartite option instead.")
+                self.multipartite = True
+            elif bipartite is not None:
+                if self.multipartite:
+                    raise Exception("The Graph is set as multipartite. "
+                                    "Use multipartite option instead.")
+                self.bipartite = True
+                multipartite = bipartite
+                if multipartite not in set_names and len(set_names) == 2:
+                    raise Exception("The Graph is set as bipartite and "
+                                    "already has two partitions initialized.")
+            self._nodes[multipartite] = cudf.Series(nodes)
+
+    def is_bipartite(self):
+        """
+        Checks if Graph is bipartite. This solely relies on the user call of
+        add_nodes_from with the bipartite parameter. This does not parse the
+        graph to check if it is bipartite.
+        """
+        # TO DO: Call coloring algorithm
+        return self.bipartite
+
+    def is_multipartite(self):
+        """
+        Checks if Graph is multipartite. This solely relies on the user call
+        of add_nodes_from with the partition parameter. This does not parse
+        the graph to check if it is multipartite.
+        """
+        # TO DO: Call coloring algorithm
+        return self.multipartite or self.bipartite
+
+    def sets(self):
+        """
+        Returns the bipartite set of nodes. This solely relies on the user's
+        call of add_nodes_from with the bipartite parameter. This does not
+        parse the graph to compute bipartite sets. If bipartite argument was
+        not provided during add_nodes_from(), it raise an exception that the
+        graph is not bipartite.
+        """
+        # TO DO: Call coloring algorithm
+        set_names = [i for i in self._nodes.keys() if i != 'all_nodes']
+        if self.bipartite:
+            top = self._nodes[set_names[0]]
+            if len(set_names) == 2:
+                bottom = self._nodes[set_names[1]]
+            else:
+                bottom = cudf.Series(set(self.nodes().values_host)
+                                     - set(top.values_host))
+            return top, bottom
+        else:
+            return {k: self._nodes[k] for k in set_names}
+
+    def from_cudf_edgelist(
+        self,
+        input_df,
+        source="source",
+        destination="destination",
+        edge_attr=None,
+        renumber=True,
+    ):
         """
         Initialize a graph from the edge list. It is an error to call this
         method on an initialized Graph object. The passed input_df argument
         wraps gdf_column objects that represent a graph using the edge list
         format. source argument is source column name and destination argument
         is destination column name.
-        Source and destination indices must be in the range [0, V) where V is
-        the number of vertices. If renumbering needs to be done, renumber
-        argument should be passed as True.
+
+        By default, renumbering is enabled to map the source and destination
+        vertices into an index in the range [0, V) where V is the number
+        of vertices.  If the input vertices are a single column of integers
+        in the range [0, V), renumbering can be disabled and the original
+        external vertex ids will be used.
+
         If weights are present, edge_attr argument is the weights column name.
 
         Parameters
         ----------
-        input_df : cudf.DataFrame
+        input_df : cudf.DataFrame or dask_cudf.DataFrame
             This cudf.DataFrame wraps source, destination and weight
             gdf_column of size E (E: number of edges)
             The 'src' column contains the source index for each edge.
@@ -130,6 +322,9 @@ def from_cudf_edgelist(self, input_df, source='source',
             argument should be passed as True.
             For weighted graphs, dataframe contains 'weight' column
             containing the weight value for each edge.
+            If a dask_cudf.DataFrame is passed it will be reinterpreted as
+            a cudf.DataFrame. For the distributed path please use
+            from_dask_cudf_edgelist.
         source : str
             source argument is source column name
         destination : str
@@ -150,56 +345,163 @@ def from_cudf_edgelist(self, input_df, source='source',
 
         """
         if self.edgelist is not None or self.adjlist is not None:
-            raise Exception('Graph already has values')
-        if self.multi:
-            if type(edge_attr) is not list:
-                raise Exception('edge_attr should be a list of column names')
-            value_col = {}
-            for col_name in edge_attr:
-                value_col[col_name] = input_df[col_name]
-        elif edge_attr is not None:
-            value_col = input_df[edge_attr]
+            raise Exception("Graph already has values")
+
+        # Consolidation
+        if isinstance(input_df, cudf.DataFrame):
+            if len(input_df[source]) > 2147483100:
+                raise Exception('cudf dataFrame edge list is too big \
+                                 to fit in a single GPU')
+            elist = input_df
+        elif isinstance(input_df, dask_cudf.DataFrame):
+            if len(input_df[source]) > 2147483100:
+                raise Exception('dask_cudf dataFrame edge list is too big \
+                                 to fit in a single GPU')
+            elist = input_df.compute().reset_index(drop=True)
         else:
-            value_col = None
+            raise Exception('input should be a cudf.DataFrame or \
+                              a dask_cudf dataFrame')
+
         renumber_map = None
         if renumber:
-            if type(source) is list and type(destination) is list:
-                source_col, dest_col, renumber_map = multi_rnb(input_df,
-                                                               source,
-                                                               destination)
-            else:
-                source_col, dest_col, renumber_map = rnb(input_df[source],
-                                                         input_df[destination])
+            elist, renumber_map = NumberMap.renumber(
+                elist, source, destination
+            )
+            source = 'src'
+            destination = 'dst'
             self.renumbered = True
         else:
             if type(source) is list and type(destination) is list:
                 raise Exception('set renumber to True for multi column ids')
-            else:
-                source_col = input_df[source]
-                dest_col = input_df[destination]
+
+        source_col = elist[source]
+        dest_col = elist[destination]
+
+        if self.multi:
+            if type(edge_attr) is not list:
+                raise Exception("edge_attr should be a list of column names")
+            value_col = {}
+            for col_name in edge_attr:
+                value_col[col_name] = elist[col_name]
+        elif edge_attr is not None:
+            value_col = elist[edge_attr]
+        else:
+            value_col = None
+
         if not self.symmetrized and not self.multi:
             if value_col is not None:
-                source_col, dest_col, value_col = symmetrize(source_col,
-                                                             dest_col,
-                                                             value_col)
+                source_col, dest_col, value_col = symmetrize(
+                    source_col, dest_col, value_col
+                )
             else:
                 source_col, dest_col = symmetrize(source_col, dest_col)
 
-        self.edgelist = Graph.EdgeList(source_col, dest_col, value_col,
-                                       renumber_map)
+        self.edgelist = Graph.EdgeList(
+            source_col, dest_col, value_col
+        )
+
+        if self.batch_enabled:
+            self._replicate_edgelist()
+
+        self.renumber_map = renumber_map
 
     def add_edge_list(self, source, destination, value=None):
-        warnings.warn('add_edge_list will be deprecated in next release.\
- Use from_cudf_edgelist instead')
+        warnings.warn(
+            "add_edge_list will be deprecated in next release.\
+ Use from_cudf_edgelist instead"
+        )
         input_df = cudf.DataFrame()
-        input_df['source'] = source
-        input_df['destination'] = destination
+        input_df["source"] = source
+        input_df["destination"] = destination
         if value is not None:
-            input_df['weights'] = value
-            self.from_cudf_edgelist(input_df, edge_attr='weights')
+            input_df["weights"] = value
+            self.from_cudf_edgelist(input_df, edge_attr="weights")
         else:
             self.from_cudf_edgelist(input_df)
 
+    def from_dask_cudf_edgelist(self, input_ddf, source='source',
+                                destination='destination',
+                                edge_attr=None, renumber=True):
+        """
+        Initializes the distributed graph from the dask_cudf.DataFrame
+        edgelist. Undirected Graphs are not currently supported.
+
+        By default, renumbering is enabled to map the source and destination
+        vertices into an index in the range [0, V) where V is the number
+        of vertices.  If the input vertices are a single column of integers
+        in the range [0, V), renumbering can be disabled and the original
+        external vertex ids will be used.
+
+        Parameters
+        ----------
+        input_ddf : dask_cudf.DataFrame
+            The edgelist as a dask_cudf.DataFrame
+        source : str
+            source argument is source column name
+        destination : str
+            destination argument is destination column name.
+        edge_attr : str
+            edge_attr argument is the weights column name.
+        renumber : bool
+            If source and destination indices are not in range 0 to V where V
+            is number of vertices, renumber argument should be True.
+        """
+        if self.edgelist is not None or self.adjlist is not None:
+            raise Exception('Graph already has values')
+        if not isinstance(input_ddf, dask_cudf.DataFrame):
+            raise Exception('input should be a dask_cudf dataFrame')
+        self.distributed = True
+        self.local_data = None
+
+        if type(self) is Graph:
+            raise Exception('Undirected distributed graph not supported')
+        if isinstance(input_ddf, dask_cudf.DataFrame):
+            self.distributed = True
+            self.local_data = None
+            rename_map = {source: 'src', destination: 'dst'}
+            if edge_attr is not None:
+                rename_map[edge_attr] = 'weights'
+            input_ddf = input_ddf.rename(columns=rename_map)
+            if renumber:
+                renumbered_ddf, number_map = NumberMap.renumber(
+                    input_ddf, "src", "dst"
+                )
+                self.edgelist = self.EdgeList(renumbered_ddf)
+                self.renumber_map = number_map
+                self.renumbered = True
+            else:
+                self.edgelist = self.EdgeList(input_ddf)
+                self.renumber_map = None
+                self.renumbered = False
+        else:
+            raise Exception('input should be a dask_cudf dataFrame')
+
+    def compute_local_data(self, by, load_balance=True):
+        """
+        Compute the local edges, vertices and offsets for a distributed
+        graph stored as a dask-cudf dataframe and initialize the
+        communicator. Performs global sorting and load_balancing.
+
+        Parameters
+        ----------
+        by : str
+            by argument is the column by which we want to sort and
+            partition. It should be the source column name for generating
+            CSR format and destination column name for generating CSC
+            format.
+        load_balance : bool
+            Set as True to perform load_balancing after global sorting of
+            dask-cudf DataFrame. This ensures that the data is uniformly
+            distributed among multiple GPUs to avoid over-loading.
+        """
+        if self.distributed:
+            data = get_local_data(self, by, load_balance)
+            self.local_data = {}
+            self.local_data['data'] = data
+            self.local_data['by'] = by
+        else:
+            raise Exception('Graph should be a distributed graph')
+
     def view_edge_list(self):
         """
         Display the edge list. Compute it if needed.
@@ -234,40 +536,26 @@ def view_edge_list(self):
             For weighted graphs, dataframe contains 'weight' column
             containing the weight value for each edge.
         """
+        if self.distributed:
+            if self.edgelist is None:
+                raise Exception("Graph has no Edgelist.")
+            return self.edgelist.edgelist_df
         if self.edgelist is None:
             src, dst, weights = graph_new_wrapper.view_edge_list(self)
             self.edgelist = self.EdgeList(src, dst, weights)
+
+        edgelist_df = self.edgelist.edgelist_df
+
+        if self.renumbered:
+            edgelist_df = self.unrenumber(edgelist_df, "src")
+            edgelist_df = self.unrenumber(edgelist_df, "dst")
+
         if type(self) is Graph:
-            edgelist_df = self.edgelist.edgelist_df[self.edgelist.edgelist_df[
-                          'src'] <= self.edgelist.edgelist_df['dst']].\
-                          reset_index(drop=True)
+            edgelist_df = edgelist_df[edgelist_df["src"] <= edgelist_df["dst"]]
+            edgelist_df = edgelist_df.reset_index(drop=True)
             self.edge_count = len(edgelist_df)
-        else:
-            edgelist_df = self.edgelist.edgelist_df
 
-        if self.renumbered:
-            if isinstance(self.edgelist.renumber_map, cudf.DataFrame):
-                df = cudf.DataFrame()
-                ncols = len(edgelist_df.columns) - 2
-                unrnb_df_ = edgelist_df.merge(self.edgelist.renumber_map,
-                                              left_on='src', right_on='id',
-                                              how='left').drop(['id', 'src'])
-                unrnb_df = unrnb_df_.merge(self.edgelist.renumber_map,
-                                           left_on='dst', right_on='id',
-                                           how='left').drop(['id', 'dst'])
-                cols = unrnb_df.columns.to_list()
-                df = unrnb_df[cols[ncols:]+cols[0:ncols]]
-            else:
-                df = cudf.DataFrame()
-                for c in edgelist_df.columns:
-                    if c in ['src', 'dst']:
-                        df[c] = self.edgelist.renumber_map[edgelist_df[c]].\
-                            reset_index(drop=True)
-                    else:
-                        df[c] = edgelist_df[c]
-            return df
-        else:
-            return edgelist_df
+        return edgelist_df
 
     def delete_edge_list(self):
         """
@@ -325,12 +613,17 @@ def from_cudf_adjlist(self, offset_col, index_col, value_col=None):
 
         """
         if self.edgelist is not None or self.adjlist is not None:
-            raise Exception('Graph already has values')
+            raise Exception("Graph already has values")
         self.adjlist = Graph.AdjList(offset_col, index_col, value_col)
 
+        if self.batch_enabled:
+            self._replicate_adjlist()
+
     def add_adj_list(self, offset_col, index_col, value_col=None):
-        warnings.warn('add_adj_list will be deprecated in next release.\
- Use from_cudf_adjlist instead')
+        warnings.warn(
+            "add_adj_list will be deprecated in next release.\
+ Use from_cudf_adjlist instead"
+        )
         self.from_cudf_adjlist(offset_col, index_col, value_col)
 
     def view_adj_list(self):
@@ -357,9 +650,22 @@ def view_adj_list(self):
             The expected type of the gdf_column element is floating point
             number.
         """
+        if self.distributed:
+            raise Exception("Not supported for distributed graph")
         if self.adjlist is None:
-            offsets, indices, weights = graph_new_wrapper.view_adj_list(self)
-            self.adjlist = self.AdjList(offsets, indices, weights)
+            if self.transposedadjlist is not None and type(self) is Graph:
+                off, ind, vals = (
+                    self.transposedadjlist.offsets,
+                    self.transposedadjlist.indices,
+                    self.transposedadjlist.weights,
+                )
+            else:
+                off, ind, vals = graph_new_wrapper.view_adj_list(self)
+            self.adjlist = self.AdjList(off, ind, vals)
+
+            if self.batch_enabled:
+                self._replicate_adjlist()
+
         return self.adjlist.offsets, self.adjlist.indices, self.adjlist.weights
 
     def view_transposed_adj_list(self):
@@ -387,13 +693,29 @@ def view_transposed_adj_list(self):
             number.
 
         """
+        if self.distributed:
+            raise Exception("Not supported for distributed graph")
         if self.transposedadjlist is None:
-            off, ind, vals = graph_new_wrapper.view_transposed_adj_list(self)
+            if self.adjlist is not None and type(self) is Graph:
+                off, ind, vals = (
+                    self.adjlist.offsets,
+                    self.adjlist.indices,
+                    self.adjlist.weights,
+                )
+            else:
+                off, ind, vals = graph_new_wrapper.view_transposed_adj_list(
+                    self
+                )
             self.transposedadjlist = self.transposedAdjList(off, ind, vals)
 
-        return (self.transposedadjlist.offsets,
-                self.transposedadjlist.indices,
-                self.transposedadjlist.weights)
+            if self.batch_enabled:
+                self._replicate_transposed_adjlist()
+
+        return (
+            self.transposedadjlist.offsets,
+            self.transposedadjlist.indices,
+            self.transposedadjlist.weights,
+        )
 
     def delete_adj_list(self):
         """
@@ -410,34 +732,27 @@ def get_two_hop_neighbors(self):
         -------
         df : cudf.DataFrame
             df['first'] : cudf.Series
-                the first vertex id of a pair.
+                the first vertex id of a pair, if an external vertex id
+                is defined by only one column
             df['second'] : cudf.Series
-                the second vertex id of a pair.
-
+                the second vertex id of a pair, if an external vertex id
+                is defined by only one column
+            df['*_first'] : cudf.Series
+                the first vertex id of a pair, column 0 of the external
+                vertex id will be represented as '0_first', column 1 as
+                '1_first', etc.
+            df['*_second'] : cudf.Series
+                the second vertex id of a pair, column 0 of the external
+                vertex id will be represented as '0_first', column 1 as
+                '1_first', etc.
         """
+        if self.distributed:
+            raise Exception("Not supported for distributed graph")
         df = graph_new_wrapper.get_two_hop_neighbors(self)
         if self.renumbered is True:
-            if isinstance(self.edgelist.renumber_map, cudf.DataFrame):
-                n_cols = len(self.edgelist.renumber_map.columns) - 1
-                unrenumbered_df_ = df.merge(self.edgelist.renumber_map,
-                                            left_on='first', right_on='id',
-                                            how='left').\
-                    drop(['id', 'first'])
-                unrenumbered_df = unrenumbered_df_.merge(self.edgelist.
-                                                         renumber_map,
-                                                         left_on='second',
-                                                         right_on='id',
-                                                         how='left').\
-                    drop(['id', 'second'])
-                unrenumbered_df.columns = ['first_' + str(i)
-                                           for i in range(n_cols)]\
-                    + ['second_' + str(i) for i in range(n_cols)]
-                df = unrenumbered_df
-            else:
-                df['first'] = self.edgelist.renumber_map[df['first']].\
-                    reset_index(drop=True)
-                df['second'] = self.edgelist.renumber_map[df['second']].\
-                    reset_index(drop=True)
+            df = self.unrenumber(df, "first")
+            df = self.unrenumber(df, "second")
+
         return df
 
     def number_of_vertices(self):
@@ -446,13 +761,21 @@ def number_of_vertices(self):
 
         """
         if self.node_count is None:
-            if self.adjlist is not None:
+            if self.distributed:
+                if self.edgelist is not None:
+                    ddf = self.edgelist.edgelist_df[['src', 'dst']]
+                    self.node_count = ddf.max().max().compute() + 1
+                else:
+                    raise Exception("Graph is Empty")
+            elif self.adjlist is not None:
                 self.node_count = len(self.adjlist.offsets)-1
             elif self.transposedadjlist is not None:
-                self.node_count = len(self.transposedadjlist.offsets)-1
+                self.node_count = len(self.transposedadjlist.offsets) - 1
             elif self.edgelist is not None:
-                df = self.edgelist.edgelist_df[['src', 'dst']]
+                df = self.edgelist.edgelist_df[["src", "dst"]]
                 self.node_count = df.max().max() + 1
+            else:
+                raise Exception("Graph is Empty")
         return self.node_count
 
     def number_of_nodes(self):
@@ -463,18 +786,27 @@ def number_of_nodes(self):
         """
         return self.number_of_vertices()
 
-    def number_of_edges(self):
+    def number_of_edges(self, directed_edges=False):
         """
         Get the number of edges in the graph.
 
         """
+        if self.distributed:
+            if self.edgelist is not None:
+                return len(self.edgelist.edgelist_df)
+            else:
+                raise ValueError('Graph is Empty')
+        if directed_edges and self.edgelist is not None:
+            return len(self.edgelist.edgelist_df)
         if self.edge_count is None:
             if self.edgelist is not None:
                 if type(self) is Graph:
-                    self.edge_count = len(self.edgelist.edgelist_df[
-                                          self.edgelist.edgelist_df['src']
-                                          >= self.edgelist.edgelist_df['dst']]
-                                          )
+                    self.edge_count = len(
+                        self.edgelist.edgelist_df[
+                            self.edgelist.edgelist_df["src"]
+                            >= self.edgelist.edgelist_df["dst"]
+                        ]
+                    )
                 else:
                     self.edge_count = len(self.edgelist.edgelist_df)
             elif self.adjlist is not None:
@@ -482,7 +814,7 @@ def number_of_edges(self):
             elif self.transposedadjlist is not None:
                 self.edge_count = len(self.transposedadjlist.indices)
             else:
-                raise ValueError('Graph is Empty')
+                raise ValueError("Graph is Empty")
         return self.edge_count
 
     def in_degree(self, vertex_subset=None):
@@ -502,7 +834,7 @@ def in_degree(self, vertex_subset=None):
         Returns
         -------
         df : cudf.DataFrame
-            GPU data frame of size N (the default) or the size of the given
+            GPU DataFrame of size N (the default) or the size of the given
             vertices (vertex_subset) containing the in_degree. The ordering is
             relative to the adjacency list, or that given by the specified
             vertex_subset.
@@ -542,7 +874,7 @@ def out_degree(self, vertex_subset=None):
         Returns
         -------
         df : cudf.DataFrame
-            GPU data frame of size N (the default) or the size of the given
+            GPU DataFrame of size N (the default) or the size of the given
             vertices (vertex_subset) containing the out_degree. The ordering is
             relative to the adjacency list, or that given by the specified
             vertex_subset.
@@ -563,6 +895,8 @@ def out_degree(self, vertex_subset=None):
         >>> df = G.out_degree([0,9,12])
 
         """
+        if self.distributed:
+            raise Exception("Not supported for distributed graph")
         return self._degree(vertex_subset, x=2)
 
     def degree(self, vertex_subset=None):
@@ -581,7 +915,7 @@ def degree(self, vertex_subset=None):
         Returns
         -------
         df : cudf.DataFrame
-            GPU data frame of size N (the default) or the size of the given
+            GPU DataFrame of size N (the default) or the size of the given
             vertices (vertex_subset) containing the degree. The ordering is
             relative to the adjacency list, or that given by the specified
             vertex_subset.
@@ -602,8 +936,11 @@ def degree(self, vertex_subset=None):
         >>> df = G.degree([0,9,12])
 
         """
+        if self.distributed:
+            raise Exception("Not supported for distributed graph")
         return self._degree(vertex_subset)
 
+    # FIXME:  vertex_subset could be a DataFrame for multi-column vertices
     def degrees(self, vertex_subset=None):
         """
         Compute vertex in-degree and out-degree. By default, this method
@@ -639,67 +976,36 @@ def degrees(self, vertex_subset=None):
         >>> df = G.degrees([0,9,12])
 
         """
+        if self.distributed:
+            raise Exception("Not supported for distributed graph")
         vertex_col, in_degree_col, out_degree_col = graph_new_wrapper._degrees(
-                                                        self)
+            self
+        )
 
         df = cudf.DataFrame()
-        if vertex_subset is None:
-            if self.renumbered is True:
-                df['vertex'] = self.edgelist.renumber_map[vertex_col]
-            else:
-                df['vertex'] = vertex_col
-            df['in_degree'] = in_degree_col
-            df['out_degree'] = out_degree_col
-        else:
-            df['vertex'] = cudf.Series(
-                np.asarray(vertex_subset, dtype=np.int32))
-            if self.renumbered is True:
-                renumber_series = cudf.Series(self.edgelist.renumber_map.index,
-                                              index=self.edgelist.renumber_map)
-                vertices_renumbered = renumber_series.loc[vertex_subset]
-
-                df['in_degree'] = cudf.Series(
-                    np.asarray([in_degree_col[i] for i in vertices_renumbered],
-                               dtype=np.int32))
-                df['out_degree'] = cudf.Series(np.asarray([out_degree_col[i]
-                                               for i in vertices_renumbered],
-                                               dtype=np.int32))
-            else:
-                df['in_degree'] = cudf.Series(
-                    np.asarray([in_degree_col[i] for i in vertex_subset],
-                               dtype=np.int32))
-                df['out_degree'] = cudf.Series(
-                    np.asarray([out_degree_col[i] for i in vertex_subset],
-                               dtype=np.int32))
+        df["vertex"] = vertex_col
+        df["in_degree"] = in_degree_col
+        df["out_degree"] = out_degree_col
+
+        if self.renumbered is True:
+            df = self.unrenumber(df, "vertex")
+
+        if vertex_subset is not None:
+            df = df.query("`vertex` in @vertex_subset")
 
         return df
 
     def _degree(self, vertex_subset, x=0):
         vertex_col, degree_col = graph_new_wrapper._degree(self, x)
-
         df = cudf.DataFrame()
-        if vertex_subset is None:
-            if self.renumbered is True:
-                df['vertex'] = self.edgelist.renumber_map[vertex_col]
-            else:
-                df['vertex'] = vertex_col
-            df['degree'] = degree_col
-        else:
-            df['vertex'] = cudf.Series(np.asarray(
-                vertex_subset, dtype=np.int32
-            ))
-            if self.renumbered is True:
-                renumber_series = cudf.Series(self.edgelist.renumber_map.index,
-                                              index=self.edgelist.renumber_map)
-                vertices_renumbered = renumber_series.loc[vertex_subset]
-                df['degree'] = cudf.Series(np.asarray(
-                    [degree_col[i] for i in vertices_renumbered],
-                    dtype=np.int32
-                ))
-            else:
-                df['degree'] = cudf.Series(np.asarray(
-                    [degree_col[i] for i in vertex_subset], dtype=np.int32
-                ))
+        df["vertex"] = vertex_col
+        df["degree"] = degree_col
+
+        if self.renumbered is True:
+            df = self.unrenumber(df, "vertex")
+
+        if vertex_subset is not None:
+            df = df.query("`vertex` in @vertex_subset")
 
         return df
 
@@ -724,12 +1030,14 @@ def to_directed(self):
         >>> DiG = G.to_directed()
 
         """
-
+        if self.distributed:
+            raise Exception("Not supported for distributed graph")
         if type(self) is DiGraph:
             return self
         if type(self) is Graph:
             DiG = DiGraph()
             DiG.renumbered = self.renumbered
+            DiG.renumber_map = self.renumber_map
             DiG.edgelist = self.edgelist
             DiG.adjlist = self.adjlist
             DiG.transposedadjlist = self.transposedadjlist
@@ -754,23 +1062,25 @@ def to_undirected(self):
         >>> G = DiG.to_undirected()
 
         """
-
+        if self.distributed:
+            raise Exception("Not supported for distributed graph")
         if type(self) is Graph:
             return self
         if type(self) is DiGraph:
             G = Graph()
             df = self.edgelist.edgelist_df
             G.renumbered = self.renumbered
+            G.renumber_map = self.renumber_map
             if self.edgelist.weights:
-                source_col, dest_col, value_col = symmetrize(df['src'],
-                                                             df['dst'],
-                                                             df['weights'])
+                source_col, dest_col, value_col = symmetrize(
+                    df["src"], df["dst"], df["weights"]
+                )
             else:
-                source_col, dest_col = symmetrize(df['src'],
-                                                  df['dst'])
+                source_col, dest_col = symmetrize(df["src"], df["dst"])
                 value_col = None
-            G.edgelist = Graph.EdgeList(source_col, dest_col, value_col,
-                                        self.edgelist.renumber_map)
+            G.edgelist = Graph.EdgeList(
+                source_col, dest_col, value_col
+            )
 
             return G
 
@@ -784,29 +1094,33 @@ def has_node(self, n):
         """
         Returns True if the graph contains the node n.
         """
-
+        if self.edgelist is None:
+            raise Exception("Graph has no Edgelist.")
+        if self.distributed:
+            ddf = self.edgelist.edgelist_df[['src', 'dst']]
+            return (ddf == n).any().any().compute()
         if self.renumbered:
-            return (self.edgelist.renumber_map == n).any()
+            tmp = self.renumber_map.to_internal_vertex_id(cudf.Series([n]))
+            return tmp[0] >= 0
         else:
-            df = self.edgelist.edgelist_df[['src', 'dst']]
+            df = self.edgelist.edgelist_df[["src", "dst"]]
             return (df == n).any().any()
 
     def has_edge(self, u, v):
         """
         Returns True if the graph contains the edge (u,v).
         """
-
+        if self.edgelist is None:
+            raise Exception("Graph has no Edgelist.")
         if self.renumbered:
-            src = self.edgelist.renumber_map.index[self.edgelist.
-                                                   renumber_map == u]
-            dst = self.edgelist.renumber_map.index[self.edgelist.
-                                                   renumber_map == v]
-            if (len(src) and len(dst)) == 0:
-                return False
-            else:
-                u = src[0]
-                v = dst[0]
+            tmp = self.renumber_map.to_internal_vertex_id(cudf.Series([u, v]))
+
+            u = tmp[0]
+            v = tmp[1]
+
         df = self.edgelist.edgelist_df
+        if self.distributed:
+            return ((df['src'] == u) & (df['dst'] == v)).any().compute()
         return ((df['src'] == u) & (df['dst'] == v)).any()
 
     def edges(self):
@@ -815,40 +1129,158 @@ def edges(self):
         sources and destinations. It does not return the edge weights.
         For viewing edges with weights use view_edge_list()
         """
-        return self.view_edge_list()[['src', 'dst']]
+        return self.view_edge_list()[["src", "dst"]]
 
     def nodes(self):
         """
         Returns all the nodes in the graph as a cudf.Series
         """
-        df = self.edgelist.edgelist_df
-        n = cudf.concat([df['src'], df['dst']]).unique()
-        if self.renumbered:
-            return self.edgelist.renumber_map[n]
+        if self.distributed:
+            raise Exception("Not supported for distributed graph")
+        if self.edgelist is not None:
+            df = self.edgelist.edgelist_df
+            if self.renumbered:
+                # FIXME: If vertices are multicolumn
+                #        this needs to return a dataframe
+                # FIXME: This relies un current implementation
+                #        of NumberMap, should not really expose
+                #        this, perhaps add a method to NumberMap
+                return self.renumber_map.implementation.df["0"]
+            else:
+                return cudf.concat([df["src"], df["dst"]]).unique()
+        if 'all_nodes' in self._nodes.keys():
+            return self._nodes['all_nodes']
         else:
+            n = cudf.Series(dtype='int')
+            set_names = [i for i in self._nodes.keys() if i != 'all_nodes']
+            for k in set_names:
+                n = n.append(self._nodes[k])
             return n
 
     def neighbors(self, n):
-
+        if self.edgelist is None:
+            raise Exception("Graph has no Edgelist.")
+        if self.distributed:
+            ddf = self.edgelist.edgelist_df
+            return ddf[ddf['src'] == n]['dst'].reset_index(drop=True)
         if self.renumbered:
-            node = self.edgelist.renumber_map.index[self.edgelist.
-                                                    renumber_map == n]
+            node = self.renumber_map.to_internal_vertex_id(cudf.Series([n]))
             if len(node) == 0:
-                return cudf.Series(dtype='int')
+                return cudf.Series(dtype="int")
             n = node[0]
 
         df = self.edgelist.edgelist_df
-        neighbors = df[df['src'] == n]['dst'].reset_index(drop=True)
+        neighbors = df[df["src"] == n]["dst"].reset_index(drop=True)
         if self.renumbered:
-            return self.edgelist.renumber_map[neighbors]
+            # FIXME:  Multi-column vertices
+            return self.renumber_map.from_internal_vertex_id(neighbors)["0"]
         else:
             return neighbors
 
+    def unrenumber(self, df, column_name, preserve_order=False):
+        """
+        Given a DataFrame containing internal vertex ids in the identified
+        column, replace this with external vertex ids.  If the renumbering
+        is from a single column, the output dataframe will use the same
+        name for the external vertex identifiers.  If the renumbering is from
+        a multi-column input, the output columns will be labeled 0 through
+        n-1 with a suffix of _column_name.
+
+        Note that this function does not guarantee order in single GPU mode,
+        and does not guarantee order or partitioning in multi-GPU mode.  If you
+        wish to preserve ordering, add an index column to df and sort the
+        return by that index column.
+
+        Parameters
+        ----------
+        df: cudf.DataFrame or dask_cudf.DataFrame
+            A DataFrame containing internal vertex identifiers that will be
+            converted into external vertex identifiers.
+
+        column_name: string
+            Name of the column containing the internal vertex id.
+
+        preserve_order: (optional) bool
+            If True, preserve the order of the rows in the output
+            DataFrame to match the input DataFrame
+
+        Returns
+        ---------
+        df : cudf.DataFrame or dask_cudf.DataFrame
+            The original DataFrame columns exist unmodified.  The external
+            vertex identifiers are added to the DataFrame, the internal
+            vertex identifier column is removed from the dataframe.
+        """
+        return self.renumber_map.unrenumber(df, column_name, preserve_order)
+
+    def lookup_internal_vertex_id(self, df, column_name=None):
+        """
+        Given a DataFrame containing external vertex ids in the identified
+        columns, or a Series containing external vertex ids, return a
+        Series with the internal vertex ids.
+
+        Note that this function does not guarantee order in single GPU mode,
+        and does not guarantee order or partitioning in multi-GPU mode.
+
+        Parameters
+        ----------
+        df: cudf.DataFrame, cudf.Series, dask_cudf.DataFrame, dask_cudf.Series
+            A DataFrame containing external vertex identifiers that will be
+            converted into internal vertex identifiers.
+
+        column_name: (optional) string
+            Name of the column containing the external vertex ids
+
+        Returns
+        ---------
+        series : cudf.Series or dask_cudf.Series
+            The internal vertex identifiers
+        """
+        return self.renumber_map.to_internal_vertex_id(df, column_name)
+
+    def add_internal_vertex_id(self, df, external_column_name,
+                               internal_column_name,
+                               drop=True, preserve_order=False):
+        """
+        Given a DataFrame containing external vertex ids in the identified
+        columns, return a DataFrame containing the internal vertex ids as the
+        specified column name.  Optionally drop the external vertex id columns.
+        Optionally preserve the order of the original DataFrame.
+
+        Parameters
+        ----------
+        df: cudf.DataFrame or dask_cudf.DataFrame
+            A DataFrame containing external vertex identifiers that will be
+            converted into internal vertex identifiers.
+
+        external_column_name: string or list of strings
+            Name of the column(s) containing the external vertex ids
+
+        internal_column_name: string
+            Name of column to contain the internal vertex id
+
+        drop: (optional) bool, defaults to True
+            Drop the external columns from the returned DataFrame
+
+        preserve_order: (optional) bool, defaults to False
+            Preserve the order of the data frame (requires an extra sort)
+
+        Returns
+        ---------
+        df : cudf.DataFrame or dask_cudf.DataFrame
+            Original DataFrame with new column containing internal vertex
+            id
+        """
+        return self.renumber_map.add_internal_vertex_id(
+            df, external_column_name, internal_column_name,
+            drop, preserve_order)
+
 
 class DiGraph(Graph):
     def __init__(self, m_graph=None, edge_attr=None):
-        super().__init__(m_graph=m_graph, edge_attr=edge_attr,
-                         symmetrized=True)
+        super().__init__(
+            m_graph=m_graph, edge_attr=edge_attr, symmetrized=True
+        )
 
 
 class MultiGraph(Graph):
diff --git a/python/cugraph/structure/graph_new.pxd b/python/cugraph/structure/graph_new.pxd
index dd87f0e9cce..2343a0604dc 100644
--- a/python/cugraph/structure/graph_new.pxd
+++ b/python/cugraph/structure/graph_new.pxd
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -21,17 +21,21 @@ from libcpp.memory cimport unique_ptr
 
 from rmm._lib.device_buffer cimport device_buffer
 
-cdef extern from "graph.hpp" namespace "cugraph::experimental":
+cdef extern from "raft/handle.hpp" namespace "raft":
+    cdef cppclass handle_t:
+        handle_t() except +
+
+cdef extern from "graph.hpp" namespace "cugraph":
 
     ctypedef enum PropType:
-        PROP_UNDEF "cugraph::experimental::PROP_UNDEF"
-        PROP_FALSE "cugraph::experimental::PROP_FALSE"
-        PROP_TRUE "cugraph::experimental::PROP_TRUE"
+        PROP_UNDEF "cugraph::PROP_UNDEF"
+        PROP_FALSE "cugraph::PROP_FALSE"
+        PROP_TRUE "cugraph::PROP_TRUE"
 
     ctypedef enum DegreeDirection:
-        DIRECTION_IN_PLUS_OUT "cugraph::experimental::DegreeDirection::IN_PLUS_OUT"
-        DIRECTION_IN "cugraph::experimental::DegreeDirection::IN"
-        DIRECTION_OUT "cugraph::experimental::DegreeDirection::OUT"
+        DIRECTION_IN_PLUS_OUT "cugraph::DegreeDirection::IN_PLUS_OUT"
+        DIRECTION_IN "cugraph::DegreeDirection::IN"
+        DIRECTION_OUT "cugraph::DegreeDirection::OUT"
 
     struct GraphProperties:
         bool directed
@@ -43,10 +47,15 @@ cdef extern from "graph.hpp" namespace "cugraph::experimental":
 
     cdef cppclass GraphViewBase[VT,ET,WT]:
         WT *edge_data
+        handle_t *handle;
         GraphProperties prop
         VT number_of_vertices
         ET number_of_edges
-
+        VT* local_vertices
+        ET* local_edges
+        VT* local_offsets
+        void set_handle(handle_t*)
+        void set_local_data(VT* local_vertices_, ET* local_edges_, VT* local_offsets_)
         void get_vertex_identifiers(VT *) const
 
         GraphViewBase(WT*,VT,ET)
diff --git a/python/cugraph/structure/graph_new.pyx b/python/cugraph/structure/graph_new.pyx
index 560386c649b..ad01b86ac84 100644
--- a/python/cugraph/structure/graph_new.pyx
+++ b/python/cugraph/structure/graph_new.pyx
@@ -80,7 +80,7 @@ cdef GraphCSRViewType get_csr_graph_view(input_graph, bool weighted=True, GraphC
         c_weights = input_graph.adjlist.weights.__cuda_array_interface__['data'][0]
 
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(input_graph.adjlist.indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
     cdef GraphCSRViewType in_graph
     if GraphCSRViewType is GraphCSRViewFloat:
         in_graph = GraphCSRViewFloat(<int*>c_off, <int*>c_ind, <float*>c_weights, num_verts, num_edges)
@@ -101,7 +101,7 @@ cdef GraphCOOViewType get_coo_graph_view(input_graph, bool weighted=True, GraphC
         c_weights = input_graph.edgelist.edgelist_df['weights'].__cuda_array_interface__['data'][0]
 
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(input_graph.edgelist.edgelist_df)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
     cdef GraphCOOViewType in_graph
     if GraphCOOViewType is GraphCOOViewFloat:
         in_graph = GraphCOOViewFloat(<int*>c_src, <int*>c_dst, <float*>c_weights, num_verts, num_edges)
diff --git a/python/cugraph/structure/graph_new_wrapper.pyx b/python/cugraph/structure/graph_new_wrapper.pyx
index b0eba63cce1..80de6a31e5a 100644
--- a/python/cugraph/structure/graph_new_wrapper.pyx
+++ b/python/cugraph/structure/graph_new_wrapper.pyx
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -25,6 +25,11 @@ from libc.stdint cimport uintptr_t
 
 from rmm._lib.device_buffer cimport device_buffer, DeviceBuffer
 
+import dask_cudf as dc
+import cugraph.comms.comms as Comms
+from dask.distributed import wait, default_client
+from cugraph.dask.common.input_utils import DistributedDataHandler
+
 import cudf
 import rmm
 import numpy as np
@@ -118,7 +123,7 @@ def view_edge_list(input_graph):
     [offsets, indices] = datatype_cast([input_graph.adjlist.offsets, input_graph.adjlist.indices], [np.int32])
     [weights] = datatype_cast([input_graph.adjlist.weights], [np.float32, np.float64])
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
 
     cdef uintptr_t c_offsets = offsets.__cuda_array_interface__['data'][0]
     cdef uintptr_t c_indices = indices.__cuda_array_interface__['data'][0]
@@ -132,12 +137,15 @@ def view_edge_list(input_graph):
     return src_indices, indices, weights
 
 
-def _degree_coo(src, dst, x=0):
+def _degree_coo(edgelist_df, src_name, dst_name, x=0, num_verts=None, sID=None):
     #
     #  Computing the degree of the input graph from COO
     #
     cdef DegreeDirection dir
 
+    src = edgelist_df[src_name]
+    dst = edgelist_df[dst_name]
+
     if x == 0:
         dir = DIRECTION_IN_PLUS_OUT
     elif x == 1:
@@ -149,7 +157,8 @@ def _degree_coo(src, dst, x=0):
 
     [src, dst] = datatype_cast([src, dst], [np.int32])
 
-    num_verts = 1 + max(src.max(), dst.max())
+    if num_verts is None:
+        num_verts = 1 + max(src.max(), dst.max())
     num_edges = len(src)
 
     vertex_col = cudf.Series(np.zeros(num_verts, dtype=np.int32))
@@ -164,6 +173,12 @@ def _degree_coo(src, dst, x=0):
 
     graph = GraphCOOView[int,int,float](<int*>c_src, <int*>c_dst, <float*>NULL, num_verts, num_edges)
 
+    cdef size_t handle_size_t
+    if sID is not None:
+        handle = Comms.get_handle(sID)
+        handle_size_t = <size_t>handle.getHandle()
+        graph.set_handle(<handle_t*>handle_size_t)
+
     graph.degree(<int*> c_degree, dir)
     graph.get_vertex_identifiers(<int*>c_vertex)
 
@@ -221,9 +236,18 @@ def _degree(input_graph, x=0):
                            transpose_x[x])
 
     if input_graph.edgelist is not None:
-        return _degree_coo(input_graph.edgelist.edgelist_df['src'],
-                           input_graph.edgelist.edgelist_df['dst'],
-                           x)
+        if isinstance(input_graph.edgelist.edgelist_df, dc.DataFrame):
+            input_ddf = input_graph.edgelist.edgelist_df
+            num_verts = input_ddf[['src', 'dst']].max().max().compute() + 1
+            data = DistributedDataHandler.create(data=input_ddf)
+            comms = Comms.get_comms()
+            client = default_client()
+            data.calculate_parts_to_sizes(comms)
+            degree_ddf = [client.submit(_degree_coo, wf[1][0], 'src', 'dst', x, num_verts, comms.sessionId, workers=[wf[0]]) for idx, wf in enumerate(data.worker_to_parts.items())]
+            wait(degree_ddf)
+            return degree_ddf[0].result()      
+        return _degree_coo(input_graph.edgelist.edgelist_df,
+                           'src', 'dst', x)
                            
     raise Exception("input_graph not COO, CSR or CSC")
 
@@ -255,7 +279,7 @@ def get_two_hop_neighbors(input_graph):
     cdef uintptr_t c_indices = indices.__cuda_array_interface__['data'][0]
 
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
 
     graph = GraphCSRView[int,int,float](<int*>c_offsets, <int*> c_indices, <float*>NULL, num_verts, num_edges)
 
diff --git a/python/cugraph/structure/hypergraph.py b/python/cugraph/structure/hypergraph.py
new file mode 100644
index 00000000000..9b1c4b55e61
--- /dev/null
+++ b/python/cugraph/structure/hypergraph.py
@@ -0,0 +1,500 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Copyright (c) 2015, Graphistry, Inc.
+# All rights reserved.
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in the
+#       documentation and/or other materials provided with the distribution.
+#     * Neither the name of the Graphistry, Inc nor the
+#       names of its contributors may be used to endorse or promote products
+#       derived from this software without specific prior written permission.
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+# ARE DISCLAIMED. IN NO EVENT SHALL Graphistry, Inc BE LIABLE FOR ANY DIRECT,
+# INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+# THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import cudf
+import numpy as np
+from cugraph.structure.graph import Graph
+
+
+def hypergraph(
+    values,
+    columns=None,
+    dropna=True,
+    direct=False,
+    graph_class=Graph,
+    categories=dict(),
+    drop_edge_attrs=False,
+    categorical_metadata=True,
+    SKIP=None,
+    EDGES=None,
+    DELIM="::",
+    SOURCE="src",
+    TARGET="dst",
+    WEIGHTS=None,
+    NODEID="node_id",
+    EVENTID="event_id",
+    ATTRIBID="attrib_id",
+    CATEGORY="category",
+    NODETYPE="node_type",
+    EDGETYPE="edge_type",
+):
+    """
+    Creates a hypergraph out of the given dataframe, returning the graph
+    components as dataframes. The transform reveals relationships between the
+    rows and unique values. This transform is useful for lists of events,
+    samples, relationships, and other structured high-dimensional data.
+
+    The transform creates a node for every row, and turns a row's column
+    entries into node attributes. If direct=False (default), every unique
+    value within a column is also turned into a node. Edges are added to
+    connect a row's nodes to each of its column nodes, or if direct=True, to
+    one another. Nodes are given the attribute specified by ``NODETYPE``
+    that corresponds to the originating column name, or if a row ``EVENTID``.
+
+    Consider a list of events. Each row represents a distinct event, and each
+    column some metadata about an event. If multiple events have common
+    metadata, they will be transitively connected through those metadata
+    values. Conversely, if an event has unique metadata, the unique metadata
+    will turn into nodes that only have connections to the event node.
+
+    For best results, set ``EVENTID`` to a row's unique ID, ``SKIP`` to all
+    non-categorical columns (or ``columns`` to all categorical columns),
+    and ``categories`` to group columns with the same kinds of values.
+
+    Parameters
+    ----------
+    values : cudf.DataFrame
+        The input Dataframe to transform into a hypergraph.
+    columns : sequence, optional, default ``values.columns``
+        An optional sequence of column names to process.
+    dropna : bool, optional, default True
+        If True, do not include "null" values in the graph.
+    direct : bool, optional, default False
+        If True, omit hypernodes and instead strongly connect nodes for each
+        row with each other.
+    categories : dict, optional
+        Dictionary mapping column names to distinct categories. If the same
+        value appears columns mapped to the same category, the transform will
+        generate one node for it, instead of one for each column.
+    drop_edge_attrs : bool, optional, default False
+        If True, exclude each row's attributes from its edges (default: False)
+    categorical_metadata : bool, optional, default True
+        Whether to use cudf.CategoricalDtype for the ``CATEGORY``,
+        ``NODETYPE``, and ``EDGETYPE`` columns. These columns are typically
+        large string columns with with low cardinality, and using categorical
+        dtypes can save a significant amount of memory.
+    SKIP : sequence, optional
+        A sequence of column names not to transform into nodes.
+    EDGES : dict, optional
+        When ``direct=True``, select column pairs instead of making all edges.
+    DELIM : str, optional, default "::"
+        The delimiter to use when joining column names, categories, and ids.
+    SOURCE : str, optional, default "src"
+        The name to use as the source column in the graph and edge DF.
+    TARGET : str, optional, default "dst"
+        The name to use as the target column in the graph and edge DF.
+    WEIGHTS : str, optional, default None
+        The column name from the input DF to map as the graph's edge weights.
+    NODEID : str, optional, default "node_id"
+        The name to use as the node id column in the graph and node DFs.
+    EVENTID : str, optional, default "event_id"
+        The name to use as the event id column in the graph and node DFs.
+    ATTRIBID : str, optional, default "attrib_id"
+        The name to use as the attribute id column in the graph and node DFs.
+    CATEGORY : str, optional, default "category"
+        The name to use as the category column in the graph and DFs.
+    NODETYPE : str, optional, default "node_type"
+        The name to use as the node type column in the graph and node DFs.
+    EDGETYPE : str, optional, default "edge_type"
+        The name to use as the edge type column in the graph and edge DF.
+
+    Returns
+    -------
+    result : dict {"nodes", "edges", "graph", "events", "entities"}
+        nodes : cudf.DataFrame
+            A DataFrame of found entity and hyper node attributes.
+        edges : cudf.DataFrame
+            A DataFrame of edge attributes.
+        graph : cugraph.Graph
+            A Graph of the found entity nodes, hyper nodes, and edges.
+        events : cudf.DataFrame
+            If direct=True, a DataFrame of hyper node attributes, else empty.
+        entities : cudf.DataFrame
+            A DataFrame of the found entity node attributes.
+    """
+
+    columns = values.columns if columns is None else columns
+    columns = sorted(list(columns if SKIP is None else [
+        x for x in columns if x not in SKIP
+    ]))
+
+    events = values.copy(deep=False)
+    events.reset_index(drop=True, inplace=True)
+
+    if EVENTID not in events.columns:
+        events[EVENTID] = cudf.core.index.RangeIndex(len(events))
+
+    events[EVENTID] = _prepend_str(events[EVENTID], EVENTID + DELIM)
+    events[NODETYPE] = "event" if not categorical_metadata \
+        else _str_scalar_to_category(len(events), "event")
+
+    if not dropna:
+        for key, col in events[columns].iteritems():
+            if cudf.utils.dtypes.is_string_dtype(col.dtype):
+                events[key].fillna("null", inplace=True)
+
+    edges = None
+    nodes = None
+    entities = _create_entity_nodes(
+        events,
+        columns,
+        dropna=dropna,
+        categories=categories,
+        categorical_metadata=categorical_metadata,
+        DELIM=DELIM,
+        NODEID=NODEID,
+        CATEGORY=CATEGORY,
+        NODETYPE=NODETYPE,
+    )
+
+    if direct:
+        edges = _create_direct_edges(
+            events,
+            columns,
+            dropna=dropna,
+            edge_shape=EDGES,
+            categories=categories,
+            drop_edge_attrs=drop_edge_attrs,
+            categorical_metadata=categorical_metadata,
+            DELIM=DELIM,
+            SOURCE=SOURCE,
+            TARGET=TARGET,
+            EVENTID=EVENTID,
+            CATEGORY=CATEGORY,
+            EDGETYPE=EDGETYPE,
+            NODETYPE=NODETYPE,
+        )
+        nodes = entities
+        events = cudf.DataFrame()
+    else:
+        SOURCE = ATTRIBID
+        TARGET = EVENTID
+        edges = _create_hyper_edges(
+            events,
+            columns,
+            dropna=dropna,
+            categories=categories,
+            drop_edge_attrs=drop_edge_attrs,
+            categorical_metadata=categorical_metadata,
+            DELIM=DELIM,
+            EVENTID=EVENTID,
+            ATTRIBID=ATTRIBID,
+            CATEGORY=CATEGORY,
+            EDGETYPE=EDGETYPE,
+            NODETYPE=NODETYPE,
+        )
+        # Concatenate regular nodes and hyper nodes
+        events = _create_hyper_nodes(
+            events,
+            NODEID=NODEID,
+            EVENTID=EVENTID,
+            CATEGORY=CATEGORY,
+            NODETYPE=NODETYPE,
+            categorical_metadata=categorical_metadata,
+        )
+        nodes = cudf.concat([entities, events])
+        nodes.reset_index(drop=True, inplace=True)
+
+    if WEIGHTS is not None:
+        if WEIGHTS not in edges:
+            WEIGHTS = None
+        else:
+            edges[WEIGHTS].fillna(0, inplace=True)
+
+    graph = graph_class()
+    graph.from_cudf_edgelist(
+        edges,
+        # force using renumber_from_cudf
+        source=[SOURCE],
+        destination=[TARGET],
+        edge_attr=WEIGHTS,
+        renumber=True,
+    )
+
+    return {
+        "nodes": nodes,
+        "edges": edges,
+        "graph": graph,
+        "events": events,
+        "entities": entities,
+    }
+
+
+def _create_entity_nodes(
+    events,
+    columns,
+    dropna=True,
+    categorical_metadata=False,
+    categories=dict(),
+    DELIM="::",
+    NODEID="node_id",
+    CATEGORY="category",
+    NODETYPE="node_type",
+):
+    nodes = [cudf.DataFrame(dict([
+        (NODEID, cudf.core.column.column_empty(0, "str")),
+        (CATEGORY, cudf.core.column.column_empty(
+            0, "str" if not categorical_metadata else _empty_cat_dt()
+        )),
+        (NODETYPE, cudf.core.column.column_empty(
+            0, "str" if not categorical_metadata else _empty_cat_dt()
+        ))
+    ] + [
+        (key, cudf.core.column.column_empty(0, col.dtype))
+        for key, col in events[columns].iteritems()
+    ]))]
+
+    for key, col in events[columns].iteritems():
+        cat = categories.get(key, key)
+        col = col.unique()
+        col = col.nans_to_nulls().dropna() if dropna else col
+        if len(col) == 0:
+            continue
+        df = cudf.DataFrame({
+            key: cudf.core.column.as_column(col),
+            NODEID: _prepend_str(col, cat + DELIM),
+            CATEGORY: cat if not categorical_metadata
+            else _str_scalar_to_category(len(col), cat),
+            NODETYPE: key if not categorical_metadata
+            else _str_scalar_to_category(len(col), key),
+        })
+        df.reset_index(drop=True, inplace=True)
+        nodes.append(df)
+
+    nodes = cudf.concat(nodes)
+    nodes = nodes.drop_duplicates(subset=[NODEID])
+    nodes = nodes[[NODEID, NODETYPE, CATEGORY] + list(columns)]
+    nodes.reset_index(drop=True, inplace=True)
+    return nodes
+
+
+def _create_hyper_nodes(
+    events,
+    categorical_metadata=False,
+    NODEID="node_id",
+    EVENTID="event_id",
+    CATEGORY="category",
+    NODETYPE="node_type",
+):
+    nodes = events.copy(deep=False)
+    if NODEID in nodes:
+        nodes.drop([NODEID], inplace=True)
+    if NODETYPE in nodes:
+        nodes.drop([NODETYPE], inplace=True)
+    if CATEGORY in nodes:
+        nodes.drop([CATEGORY], inplace=True)
+    nodes[NODETYPE] = EVENTID if not categorical_metadata \
+        else _str_scalar_to_category(len(nodes), EVENTID)
+    nodes[CATEGORY] = "event" if not categorical_metadata \
+        else _str_scalar_to_category(len(nodes), "event")
+    nodes[NODEID] = nodes[EVENTID]
+    nodes.reset_index(drop=True, inplace=True)
+    return nodes
+
+
+def _create_hyper_edges(
+    events,
+    columns,
+    dropna=True,
+    categories=dict(),
+    drop_edge_attrs=False,
+    categorical_metadata=False,
+    DELIM="::",
+    EVENTID="event_id",
+    ATTRIBID="attrib_id",
+    CATEGORY="category",
+    EDGETYPE="edge_type",
+    NODETYPE="node_type",
+):
+    edge_attrs = [x for x in events.columns if x != NODETYPE]
+    edges = [cudf.DataFrame(dict(
+        ([
+            (EVENTID, cudf.core.column.column_empty(0, "str")),
+            (ATTRIBID, cudf.core.column.column_empty(0, "str")),
+            (EDGETYPE, cudf.core.column.column_empty(
+                0, "str" if not categorical_metadata else _empty_cat_dt()
+            ))
+        ]) +
+        ([] if len(categories) == 0 else [
+            (CATEGORY, cudf.core.column.column_empty(
+                0, "str" if not categorical_metadata else _empty_cat_dt()
+            ))
+        ]) +
+        ([] if drop_edge_attrs else [
+            (key, cudf.core.column.column_empty(0, col.dtype))
+            for key, col in events[edge_attrs].iteritems()
+        ])
+    ))]
+
+    for key, col in events[columns].iteritems():
+        cat = categories.get(key, key)
+        fs = [EVENTID] + ([key] if drop_edge_attrs else edge_attrs)
+        df = events[fs].dropna(subset=[key]) if dropna else events[fs]
+        if len(df) == 0:
+            continue
+        if len(categories) > 0:
+            df[CATEGORY] = key if not categorical_metadata \
+                else _str_scalar_to_category(len(df), key)
+        df[EDGETYPE] = cat if not categorical_metadata \
+            else _str_scalar_to_category(len(df), cat)
+        df[ATTRIBID] = _prepend_str(col, cat + DELIM)
+        df.reset_index(drop=True, inplace=True)
+        edges.append(df)
+
+    columns = [EVENTID, EDGETYPE, ATTRIBID]
+
+    if len(categories) > 0:
+        columns += [CATEGORY]
+
+    if not drop_edge_attrs:
+        columns += edge_attrs
+
+    edges = cudf.concat(edges)[columns]
+    edges.reset_index(drop=True, inplace=True)
+    return edges
+
+
+def _create_direct_edges(
+    events,
+    columns,
+    dropna=True,
+    categories=dict(),
+    edge_shape=None,
+    drop_edge_attrs=False,
+    categorical_metadata=False,
+    DELIM="::",
+    SOURCE="src",
+    TARGET="dst",
+    EVENTID="event_id",
+    CATEGORY="category",
+    EDGETYPE="edge_type",
+    NODETYPE="node_type",
+):
+    if edge_shape is None:
+        edge_shape = {}
+        for i, name in enumerate(columns):
+            edge_shape[name] = columns[(i + 1):]
+
+    edge_attrs = [x for x in events.columns if x != NODETYPE]
+    edges = [cudf.DataFrame(dict(
+        ([
+            (EVENTID, cudf.core.column.column_empty(0, "str")),
+            (SOURCE, cudf.core.column.column_empty(0, "str")),
+            (TARGET, cudf.core.column.column_empty(0, "str")),
+            (EDGETYPE, cudf.core.column.column_empty(
+                0, "str" if not categorical_metadata else _empty_cat_dt()
+            ))
+        ]) +
+        ([] if len(categories) == 0 else [
+            (CATEGORY, cudf.core.column.column_empty(
+                0, "str" if not categorical_metadata else _empty_cat_dt()
+            ))
+        ]) +
+        ([] if drop_edge_attrs else [
+            (key, cudf.core.column.column_empty(0, col.dtype))
+            for key, col in events[edge_attrs].iteritems()
+        ])
+    ))]
+
+    for key1, col1 in events[sorted(edge_shape.keys())].iteritems():
+        cat1 = categories.get(key1, key1)
+
+        if isinstance(edge_shape[key1], str):
+            edge_shape[key1] = [edge_shape[key1]]
+        elif isinstance(edge_shape[key1], dict):
+            edge_shape[key1] = list(edge_shape[key1].keys())
+        elif not isinstance(edge_shape[key1], (set, list, tuple)):
+            raise ValueError("EDGES must be a dict of column name(s)")
+
+        for key2, col2 in events[sorted(edge_shape[key1])].iteritems():
+            cat2 = categories.get(key2, key2)
+            fs = [EVENTID] + ([key1, key2] if drop_edge_attrs else edge_attrs)
+            df = (
+                events[fs].dropna(subset=[key1, key2])
+                if dropna else events[fs]
+            )
+            if len(df) == 0:
+                continue
+            if len(categories) > 0:
+                df[CATEGORY] = key1 + DELIM + key2 \
+                    if not categorical_metadata \
+                    else _str_scalar_to_category(
+                        len(df), key1 + DELIM + key2
+                    )
+            df[EDGETYPE] = cat1 + DELIM + cat2 \
+                if not categorical_metadata \
+                else _str_scalar_to_category(
+                    len(df), cat1 + DELIM + cat2
+                )
+            df[SOURCE] = _prepend_str(col1, cat1 + DELIM)
+            df[TARGET] = _prepend_str(col2, cat2 + DELIM)
+            df.reset_index(drop=True, inplace=True)
+            edges.append(df)
+
+    columns = [EVENTID, EDGETYPE, SOURCE, TARGET]
+
+    if len(categories) > 0:
+        columns += [CATEGORY]
+
+    if not drop_edge_attrs:
+        columns += edge_attrs
+
+    edges = cudf.concat(edges)[columns]
+    edges.reset_index(drop=True, inplace=True)
+    return edges
+
+
+def _str_scalar_to_category(size, val):
+    return cudf.core.column.build_categorical_column(
+        categories=cudf.core.column.as_column([val], dtype="str"),
+        codes=cudf.utils.utils.scalar_broadcast_to(0, size, dtype=np.int32),
+        mask=None,
+        size=size,
+        offset=0,
+        null_count=0,
+        ordered=False,
+    )
+
+
+def _prepend_str(col, val):
+    return val + col.astype(str).fillna("null")
+
+
+# Make an empty categorical string dtype
+def _empty_cat_dt():
+    return cudf.core.dtypes.CategoricalDtype(
+        categories=np.array([], dtype="str"), ordered=False
+    )
diff --git a/python/cugraph/structure/number_map.py b/python/cugraph/structure/number_map.py
new file mode 100644
index 00000000000..0646c074c7f
--- /dev/null
+++ b/python/cugraph/structure/number_map.py
@@ -0,0 +1,829 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import cudf
+import dask_cudf
+import numpy as np
+import bisect
+
+
+class NumberMap:
+    """
+    Class used to translate external vertex ids to internal vertex ids
+    in the cuGraph framework.
+
+    Internal vertex ids are assigned by hashing the external vertex ids
+    into a structure to eliminate duplicates, and the resulting list
+    of unique vertices are assigned integers from [0, V) where V is
+    the number of unique vertices.
+
+    In Single GPU mode, internal vertex ids are constructed using
+    cudf functions, with a cudf.DataFrame containing the mapping
+    from external vertex identifiers and internal vertex identifiers
+    allowing for mapping vertex identifiers in either direction.  In
+    this mode, the order of the output from the mapping functions is
+    non-deterministic.  cudf makes no guarantees about order.  If
+    matching the input order is required set the preserve_order
+    to True.
+
+    In Multi GPU mode, internal vertex ids are constucted using
+    dask_cudf functions, with a dask_cudf.DataFrame containing
+    the mapping from external vertex identifiers and internal
+    vertex identifiers allowing for mapping vertex identifiers
+    in either direction.  In this mode, the partitioning of
+    the number_map and the output from any of the mapping functions
+    are non-deterministic.  dask_cudf makes no guarantees about the
+    partitioning or order of the output.  As of this release,
+    there is no mechanism for controlling that, this will be
+    addressed at some point.
+    """
+
+    class SingleGPU:
+        def __init__(self, df, src_col_names, dst_col_names, id_type):
+            self.col_names = NumberMap.compute_vals(src_col_names)
+            self.df = cudf.DataFrame()
+            self.id_type = id_type
+
+            tmp = (
+                df[src_col_names]
+                .groupby(src_col_names)
+                .count()
+                .reset_index()
+                .rename(
+                    columns=dict(zip(src_col_names, self.col_names)),
+                    copy=False,
+                )
+            )
+
+            if dst_col_names is not None:
+                tmp_dst = (
+                    df[dst_col_names]
+                    .groupby(dst_col_names)
+                    .count()
+                    .reset_index()
+                )
+                for newname, oldname in zip(self.col_names, dst_col_names):
+                    self.df[newname] = tmp[newname].append(tmp_dst[oldname])
+            else:
+                for newname, oldname in zip(self.col_names, dst_col_names):
+                    self.df[newname] = tmp[newname]
+
+            self.numbered = False
+
+        def compute(self):
+            if not self.numbered:
+                tmp = self.df.groupby(self.col_names).count().reset_index()
+                tmp["id"] = tmp.index.astype(self.id_type)
+                self.df = tmp
+                self.numbered = True
+
+        def to_internal_vertex_id(self, df, col_names):
+            tmp_df = df[col_names].rename(
+                columns=dict(zip(col_names, self.col_names)), copy=False
+            )
+            index_name = NumberMap.generate_unused_column_name(df.columns)
+            tmp_df[index_name] = tmp_df.index
+            return (
+                self.df.merge(tmp_df, on=self.col_names, how="right")
+                .sort_values(index_name)
+                .drop(columns=[index_name])
+                .reset_index()["id"]
+            )
+
+        def add_internal_vertex_id(self, df, id_column_name, col_names,
+                                   drop, preserve_order):
+            ret = None
+
+            if preserve_order:
+                index_name = NumberMap.generate_unused_column_name(df.columns)
+                tmp_df = df
+                tmp_df[index_name] = tmp_df.index
+            else:
+                tmp_df = df
+
+            if "id" in df.columns:
+                id_name = NumberMap.generate_unused_column_name(tmp_df.columns)
+                merge_df = self.df.rename(columns={"id": id_name}, copy=False)
+            else:
+                id_name = "id"
+                merge_df = self.df
+
+            if col_names is None:
+                ret = merge_df.merge(tmp_df, on=self.col_names, how="right")
+            elif col_names == self.col_names:
+                ret = merge_df.merge(tmp_df, on=self.col_names, how="right")
+            else:
+                ret = (
+                    merge_df.merge(
+                        tmp_df,
+                        right_on=col_names,
+                        left_on=self.col_names,
+                        how="right",
+                    )
+                    .drop(columns=self.col_names)
+                )
+
+            if drop:
+                ret = ret.drop(columns=col_names)
+
+            ret = ret.rename(
+                columns={id_name: id_column_name}, copy=False
+            )
+
+            if preserve_order:
+                ret = ret.sort_values(index_name).reset_index(drop=True)
+
+            return ret
+
+        def from_internal_vertex_id(
+            self, df, internal_column_name, external_column_names
+        ):
+            tmp_df = self.df.merge(
+                df,
+                right_on=internal_column_name,
+                left_on="id",
+                how="right",
+            )
+            if internal_column_name != "id":
+                tmp_df = tmp_df.drop(columns=["id"])
+            if external_column_names is None:
+                return tmp_df
+            else:
+                return tmp_df.rename(
+                    columns=dict(zip(self.col_names, external_column_names)),
+                    copy=False,
+                )
+
+    class MultiGPU:
+        def extract_vertices(
+            df, src_col_names, dst_col_names, internal_col_names
+        ):
+            s = (
+                df[src_col_names]
+                .groupby(src_col_names)
+                .count()
+                .reset_index()
+                .rename(
+                    columns=dict(zip(src_col_names, internal_col_names)),
+                    copy=False,
+                )
+            )
+            d = None
+
+            if dst_col_names is not None:
+                d = (
+                    df[dst_col_names]
+                    .groupby(dst_col_names)
+                    .count()
+                    .reset_index()
+                    .rename(
+                        columns=dict(zip(dst_col_names, internal_col_names)),
+                        copy=False,
+                    )
+                )
+
+            reply = cudf.DataFrame()
+
+            for i in internal_col_names:
+                if d is None:
+                    reply[i] = s[i]
+                else:
+                    reply[i] = s[i].append(d[i])
+
+            return reply
+
+        def __init__(self, ddf, src_col_names, dst_col_names, id_type):
+            self.col_names = NumberMap.compute_vals(src_col_names)
+            self.val_types = NumberMap.compute_vals_types(ddf, src_col_names)
+            self.id_type = id_type
+            self.ddf = ddf.map_partitions(
+                NumberMap.MultiGPU.extract_vertices,
+                src_col_names,
+                dst_col_names,
+                self.col_names,
+                meta=self.val_types,
+            )
+            self.numbered = False
+
+        # Function to compute partitions based on known divisions of the
+        # hash value
+        def compute_partition(df, divisions):
+            sample = df.index[0]
+            partition_id = bisect.bisect_right(divisions, sample) - 1
+            return df.assign(partition=partition_id)
+
+        def assign_internal_identifiers_kernel(
+            local_id, partition, global_id, base_addresses
+        ):
+            for i in range(len(local_id)):
+                global_id[i] = local_id[i] + base_addresses[partition[i]]
+
+        def assign_internal_identifiers(df, base_addresses, id_type):
+            df = df.assign(local_id=df.index.astype(np.int64))
+            df = df.apply_rows(
+                NumberMap.MultiGPU.assign_internal_identifiers_kernel,
+                incols=["local_id", "partition"],
+                outcols={"global_id": id_type},
+                kwargs={"base_addresses": base_addresses},
+            )
+
+            return df.drop(columns=["local_id", "hash", "partition"])
+
+        def assign_global_id(self, ddf, base_addresses, val_types):
+            val_types["global_id"] = self.id_type
+            del val_types["hash"]
+            del val_types["partition"]
+
+            ddf = ddf.map_partitions(
+                lambda df: NumberMap.MultiGPU.assign_internal_identifiers(
+                    df, base_addresses, self.id_type
+                ),
+                meta=val_types,
+            )
+            return ddf
+
+        def compute(self):
+            if not self.numbered:
+                val_types = self.val_types
+                val_types["hash"] = np.int32
+
+                vertices = self.ddf.map_partitions(
+                    lambda df: df.assign(hash=df.hash_columns(self.col_names)),
+                    meta=val_types,
+                )
+
+                # Redistribute the ddf based on the hash values
+                rehashed = vertices.set_index("hash", drop=False)
+
+                # Compute the local partition id (obsolete once
+                #   https://github.com/dask/dask/issues/3707 is completed)
+                val_types["partition"] = np.int32
+                rehashed_with_partition_id = rehashed.map_partitions(
+                    NumberMap.MultiGPU.compute_partition,
+                    rehashed.divisions,
+                    meta=val_types,
+                )
+
+                numbering_map = rehashed_with_partition_id.map_partitions(
+                    lambda df: df.groupby(self.col_names).min().reset_index()
+                )
+
+                #
+                #  Compute base address for each partition
+                #
+                counts = numbering_map.map_partitions(
+                    lambda df: df.groupby("partition").count()
+                ).compute()["hash"]
+                base_addresses = cudf.Series(
+                    np.zeros(len(counts) + 1, self.id_type)
+                )
+                for i in range(len(counts)):
+                    base_addresses[i + 1] = base_addresses[i] + counts[i]
+
+                #
+                #  Update each partition with the base address
+                #
+                numbering_map = self.assign_global_id(
+                    numbering_map, base_addresses, val_types
+                )
+
+                self.ddf = numbering_map
+                self.numbered = True
+
+        def to_internal_vertex_id(self, ddf, col_names):
+            return self.ddf.merge(
+                ddf,
+                right_on=col_names,
+                left_on=self.col_names,
+                how="right",
+            )["global_id"]
+
+        def add_internal_vertex_id(self, ddf, id_column_name, col_names, drop,
+                                   preserve_order):
+            # At the moment, preserve_order cannot be done on
+            # multi-GPU
+            if preserve_order:
+                raise Exception("preserve_order not supported for multi-GPU")
+
+            ret = None
+            if col_names is None:
+                ret = self.ddf.merge(
+                    ddf, on=self.col_names, how="right"
+                )
+            elif col_names == self.col_names:
+                ret = self.ddf.merge(
+                    ddf, on=col_names, how="right"
+                )
+            else:
+                ret = self.ddf.merge(
+                    ddf, right_on=col_names, left_on=self.col_names
+                ).map_partitions(
+                    lambda df: df.drop(columns=self.col_names)
+                )
+
+            if drop:
+                ret = ret.map_partitions(lambda df: df.drop(columns=col_names))
+
+            ret = ret.map_partitions(
+                lambda df: df.rename(
+                    columns={"global_id": id_column_name}, copy=False
+                )
+            )
+
+            return ret
+
+        def from_internal_vertex_id(
+            self, df, internal_column_name, external_column_names
+        ):
+            tmp_df = self.ddf.merge(
+                df,
+                right_on=internal_column_name,
+                left_on="global_id",
+                how="right"
+            ).map_partitions(lambda df: df.drop(columns="global_id"))
+
+            if external_column_names is None:
+                return tmp_df
+            else:
+                return tmp_df.map_partitions(
+                    lambda df:
+                    df.rename(
+                        columns=dict(
+                            zip(self.col_names, external_column_names)
+                        ),
+                        copy=False
+                    )
+                )
+
+    def __init__(self, id_type=np.int32):
+        self.implementation = None
+        self.id_type = id_type
+
+    def compute_vals_types(df, column_names):
+        """
+        Helper function to compute internal column names and types
+        """
+        return {
+            str(i): df[column_names[i]].dtype for i in range(len(column_names))
+        }
+
+    def generate_unused_column_name(column_names):
+        """
+        Helper function to generate an unused column name
+        """
+        name = 'x'
+        while name in column_names:
+            name = name + "x"
+
+        return name
+
+    def compute_vals(column_names):
+        """
+        Helper function to compute internal column names based on external
+        column names
+        """
+        return [str(i) for i in range(len(column_names))]
+
+    def from_dataframe(self, df, src_col_names, dst_col_names=None):
+        """
+        Populate the numbering map with vertices from the specified
+        columns of the provided DataFrame.
+
+        Parameters
+        ----------
+        df : cudf.DataFrame or dask_cudf.DataFrame
+            Contains a list of external vertex identifiers that will be
+            numbered by the NumberMap class.
+        src_col_names: list of strings
+            This list of 1 or more strings contain the names
+            of the columns that uniquely identify an external
+            vertex identifier for source vertices
+        dst_col_names: list of strings
+            This list of 1 or more strings contain the names
+            of the columns that uniquely identify an external
+            vertex identifier for destination vertices
+        """
+        if self.implementation is not None:
+            raise Exception("NumberMap is already populated")
+
+        if dst_col_names is not None and len(src_col_names) != len(
+            dst_col_names
+        ):
+            raise Exception(
+                "src_col_names must have same length as dst_col_names"
+            )
+
+        if type(df) is cudf.DataFrame:
+            self.implementation = NumberMap.SingleGPU(
+                df, src_col_names, dst_col_names, self.id_type
+            )
+        elif type(df) is dask_cudf.DataFrame:
+            self.implementation = NumberMap.MultiGPU(
+                df, src_col_names, dst_col_names, self.id_type
+            )
+        else:
+            raise Exception("df must be cudf.DataFrame or dask_cudf.DataFrame")
+
+        self.implementation.compute()
+
+    def from_series(self, src_series, dst_series=None):
+        """
+        Populate the numbering map with vertices from the specified
+        pair of series objects, one for the source and one for
+        the destination
+
+        Parameters
+        ----------
+        src_series: cudf.Series or dask_cudf.Series
+            Contains a list of external vertex identifiers that will be
+            numbered by the NumberMap class.
+        dst_series: cudf.Series or dask_cudf.Series
+            Contains a list of external vertex identifiers that will be
+            numbered by the NumberMap class.
+        """
+        if self.implementation is not None:
+            raise Exception("NumberMap is already populated")
+
+        if dst_series is not None and type(src_series) != type(dst_series):
+            raise Exception("src_series and dst_series must have same type")
+
+        if type(src_series) is cudf.Series:
+            dst_series_list = None
+            df = cudf.DataFrame()
+            df["s"] = src_series
+            if dst_series is not None:
+                df["d"] = dst_series
+                dst_series_list = ["d"]
+            self.implementation = NumberMap.SingleGPU(
+                df, ["s"], dst_series_list, self.id_type
+            )
+        elif type(src_series) is dask_cudf.Series:
+            dst_series_list = None
+            df = dask_cudf.DataFrame()
+            df["s"] = src_series
+            if dst_series is not None:
+                df["d"] = dst_series
+                dst_series_list = ["d"]
+            self.implementation = NumberMap.MultiGPU(
+                df, ["s"], dst_series_list, self.id_type
+            )
+        else:
+            raise Exception(
+                "src_series must be cudf.Series or " "dask_cudf.Series"
+            )
+
+        self.implementation.compute()
+
+    def to_internal_vertex_id(self, df, col_names=None):
+        """
+        Given a collection of external vertex ids, return the internal
+        vertex ids
+
+        Parameters
+        ----------
+        df: cudf.DataFrame, cudf.Series, dask_cudf.DataFrame, dask_cudf.Series
+            Contains a list of external vertex identifiers that will be
+            converted into internal vertex identifiers
+
+        col_names: (optional) list of strings
+            This list of 1 or more strings contain the names
+            of the columns that uniquely identify an external
+            vertex identifier
+
+        Returns
+        ---------
+        vertex_ids : cudf.Series or dask_cudf.Series
+            The vertex identifiers.  Note that to_internal_vertex_id
+            does not guarantee order or partitioning (in the case of
+            dask_cudf) of vertex ids. If order matters use
+            add_internal_vertex_id
+
+        """
+        tmp_df = None
+        tmp_col_names = None
+        if type(df) is cudf.Series:
+            tmp_df = cudf.DataFrame()
+            tmp_df["0"] = df
+            tmp_col_names = ["0"]
+        elif type(df) is dask_cudf.Series:
+            tmp_df = dask_cudf.DataFrame()
+            tmp_df["0"] = df
+            tmp_col_names = ["0"]
+        else:
+            tmp_df = df
+            tmp_col_names = col_names
+
+        return self.implementation.to_internal_vertex_id(tmp_df, tmp_col_names)
+
+    def add_internal_vertex_id(
+        self, df, id_column_name="id", col_names=None, drop=False,
+        preserve_order=False
+    ):
+        """
+        Given a collection of external vertex ids, return the internal vertex
+        ids combined with the input data.
+
+        If a series-type input is provided then the series will be in a column
+        named '0'. Otherwise the input column names in the DataFrame will be
+        preserved.
+
+        Parameters
+        ----------
+        df: cudf.DataFrame, cudf.Series, dask_cudf.DataFrame, dask_cudf.Series
+            Contains a list of external vertex identifiers that will be
+            converted into internal vertex identifiers
+
+        id_column_name: (optional) string
+            The name to be applied to the column containing the id
+            (defaults to 'id')
+
+        col_names: (optional) list of strings
+            This list of 1 or more strings contain the names
+            of the columns that uniquely identify an external
+            vertex identifier
+
+        drop: (optional) boolean
+            If True, drop the column names specified in col_names from
+            the returned DataFrame.  Defaults to False.
+
+        preserve_order: (optional) boolean
+            If True, do extra sorting work to preserve the order
+            of the input DataFrame.  Defaults to False.
+
+        Returns
+        ---------
+        df : cudf.DataFrame or dask_cudf.DataFrame
+            A DataFrame containing the input data (DataFrame or series)
+            with an additional column containing the internal vertex id.
+            Note that there is no guarantee of the order or partitioning
+            of elements in the returned DataFrame.
+
+        """
+        tmp_df = None
+        tmp_col_names = None
+        can_drop = True
+        if type(df) is cudf.Series:
+            tmp_df = df.to_frame("0")
+            tmp_col_names = ["0"]
+            can_drop = False
+        elif type(df) is dask_cudf.Series:
+            tmp_df = df.to_frame("0")
+            tmp_col_names = ["0"]
+            can_drop = False
+        else:
+            tmp_df = df
+
+            if isinstance(col_names, list):
+                tmp_col_names = col_names
+            else:
+                tmp_col_names = [col_names]
+
+        return self.implementation.add_internal_vertex_id(
+            tmp_df, id_column_name, tmp_col_names, (drop and can_drop),
+            preserve_order
+        )
+
+    def from_internal_vertex_id(
+        self,
+        df,
+        internal_column_name=None,
+        external_column_names=None,
+        drop=False,
+    ):
+        """
+        Given a collection of internal vertex ids, return a DataFrame of
+        the external vertex ids
+
+        Parameters
+        ----------
+        df: cudf.DataFrame, cudf.Series, dask_cudf.DataFrame, dask_cudf.Series
+            A list of internal vertex identifiers that will be
+            converted into external vertex identifiers.  If df is a series type
+            object it will be converted to a dataframe where the series is
+            in a column labeled 'id'.  If df is a dataframe type object
+            then internal_column_name should identify which column corresponds
+            the the internal vertex id that should be converted
+
+        internal_column_name: (optional) string
+            Name of the column containing the internal vertex id.
+            If df is a series then this parameter is ignored.  If df is
+            a DataFrame this parameter is required.
+
+        external_column_names: (optional) string or list of strings
+            Name of the columns that define an external vertex id.
+            If not specified, columns will be labeled '0', '1,', ..., 'n-1'
+
+        drop: (optional) boolean
+            If True the internal column name will be dropped from the
+            DataFrame.  Defaults to False.
+
+        Returns
+        ---------
+        df : cudf.DataFrame or dask_cudf.DataFrame
+            The original DataFrame columns exist unmodified.  Columns
+            are added to the DataFrame to identify the external vertex
+            identifiers. If external_columns is specified, these names
+            are used as the names of the output columns.  If external_columns
+            is not specifed the columns are labeled '0', ... 'n-1' based on
+            the number of columns identifying the external vertex identifiers.
+        """
+        tmp_df = None
+        can_drop = True
+        if type(df) is cudf.Series:
+            tmp_df = df.to_frame("id")
+            internal_column_name = "id"
+            can_drop = False
+        elif type(df) is dask_cudf.Series:
+            tmp_df = df.to_frame("id")
+            internal_column_name = "id"
+            can_drop = False
+        else:
+            tmp_df = df
+
+        output_df = self.implementation.from_internal_vertex_id(
+            tmp_df, internal_column_name, external_column_names
+        )
+
+        if drop and can_drop:
+            return output_df.drop(columns=internal_column_name)
+
+        return output_df
+
+    def column_names(self):
+        """
+        Return the list of internal column names
+
+        Returns
+        ----------
+            List of column names ('0', '1', ..., 'n-1')
+        """
+        return self.implementation.col_names
+
+    def renumber(df, src_col_names, dst_col_names, preserve_order=False):
+        """
+        Given a single GPU or distributed DataFrame, use src_col_names and
+        dst_col_names to identify the source vertex identifiers and destination
+        vertex identifiers, respectively.
+
+        Internal vertex identifiers will be created, numbering vertices as
+        integers starting from 0.
+
+        The function will return a DataFrame containing the original dataframe
+        contents with a new column labeled 'src' containing the renumbered
+        source vertices and a new column labeled 'dst' containing the
+        renumbered dest vertices, along with a NumberMap object that contains
+        the number map for the numbering that was used.
+
+        Note that this function does not guarantee order in single GPU mode,
+        and does not guarantee order or partitioning in multi-GPU mode.  If you
+        wish to preserve ordering, add an index column to df and sort the
+        return by that index column.
+
+        Parameters
+        ----------
+        df: cudf.DataFrame or dask_cudf.DataFrame
+            Contains a list of external vertex identifiers that will be
+            numbered by the NumberMap class.
+        src_col_names: string or list of strings
+            This list of 1 or more strings contain the names
+            of the columns that uniquely identify an external
+            vertex identifier for source vertices
+        dst_col_names: string or list of strings
+            This list of 1 or more strings contain the names
+            of the columns that uniquely identify an external
+            vertex identifier for destination vertices
+
+        Returns
+        ---------
+        df : cudf.DataFrame or dask_cudf.DataFrame
+            The original DataFrame columns exist unmodified.  Columns
+            are added to the DataFrame to identify the external vertex
+            identifiers. If external_columns is specified, these names
+            are used as the names of the output columns.  If external_columns
+            is not specifed the columns are labeled '0', ... 'n-1' based on
+            the number of columns identifying the external vertex identifiers.
+
+        number_map : NumberMap
+            The number map object object that retains the mapping between
+            internal vertex identifiers and external vertex identifiers.
+
+        Examples
+        --------
+        >>> M = cudf.read_csv('datasets/karate.csv', delimiter=' ',
+        >>>                   dtype=['int32', 'int32', 'float32'], header=None)
+        >>>
+        >>> df, number_map = NumberMap.renumber(df, '0', '1')
+        >>>
+        >>> G = cugraph.Graph()
+        >>> G.from_cudf_edgelist(df, 'src', 'dst')
+        """
+        renumber_map = NumberMap()
+
+        if isinstance(src_col_names, list):
+            renumber_map.from_dataframe(df, src_col_names, dst_col_names)
+            df = renumber_map.add_internal_vertex_id(
+                df, "src", src_col_names, drop=True,
+                preserve_order=preserve_order
+            )
+            df = renumber_map.add_internal_vertex_id(
+                df, "dst", dst_col_names, drop=True,
+                preserve_order=preserve_order
+            )
+        else:
+            renumber_map.from_dataframe(df, [src_col_names], [dst_col_names])
+            df = renumber_map.add_internal_vertex_id(
+                df, "src", src_col_names, drop=True,
+                preserve_order=preserve_order
+            )
+
+            df = renumber_map.add_internal_vertex_id(
+                df, "dst", dst_col_names, drop=True,
+                preserve_order=preserve_order
+            )
+
+        return df, renumber_map
+
+    def unrenumber(self, df, column_name, preserve_order=False):
+        """
+        Given a DataFrame containing internal vertex ids in the identified
+        column, replace this with external vertex ids.  If the renumbering
+        is from a single column, the output dataframe will use the same
+        name for the external vertex identifiers.  If the renumbering is from
+        a multi-column input, the output columns will be labeled 0 through
+        n-1 with a suffix of _column_name.
+
+        Note that this function does not guarantee order or partitioning in
+        multi-GPU mode.
+
+        Parameters
+        ----------
+        df: cudf.DataFrame or dask_cudf.DataFrame
+            A DataFrame containing internal vertex identifiers that will be
+            converted into external vertex identifiers.
+
+        column_name: string
+            Name of the column containing the internal vertex id.
+
+        preserve_order: (optional) bool
+            If True, preserve the ourder of the rows in the output
+            DataFrame to match the input DataFrame
+
+        Returns
+        ---------
+        df : cudf.DataFrame or dask_cudf.DataFrame
+            The original DataFrame columns exist unmodified.  The external
+            vertex identifiers are added to the DataFrame, the internal
+            vertex identifier column is removed from the dataframe.
+
+        Examples
+        --------
+        >>> M = cudf.read_csv('datasets/karate.csv', delimiter=' ',
+        >>>                   dtype=['int32', 'int32', 'float32'], header=None)
+        >>>
+        >>> df, number_map = NumberMap.renumber(df, '0', '1')
+        >>>
+        >>> G = cugraph.Graph()
+        >>> G.from_cudf_edgelist(df, 'src', 'dst')
+        >>>
+        >>> pr = cugraph.pagerank(G, alpha = 0.85, max_iter = 500,
+        >>>                       tol = 1.0e-05)
+        >>>
+        >>> pr = number_map.unrenumber(pr, 'vertex')
+        >>>
+        """
+        if len(self.implementation.col_names) == 1:
+            # Output will be renamed to match input
+            mapping = {"0": column_name}
+        else:
+            # Output will be renamed to ${i}_${column_name}
+            mapping = {}
+            for nm in self.implementation.col_names:
+                mapping[nm] = nm + "_" + column_name
+
+        if preserve_order:
+            index_name = NumberMap.generate_unused_column_name(df)
+            df[index_name] = df.index
+
+        df = self.from_internal_vertex_id(df, column_name, drop=True)
+
+        if preserve_order:
+            df = df.sort_values(
+                index_name
+            ).drop(index_name).reset_index(drop=True)
+
+        if type(df) is dask_cudf.DataFrame:
+            return df.map_partitions(
+                lambda df: df.rename(columns=mapping, copy=False)
+            )
+        else:
+            return df.rename(columns=mapping, copy=False)
diff --git a/python/cugraph/structure/renumber.py b/python/cugraph/structure/renumber.py
deleted file mode 100644
index 7cb50634c44..00000000000
--- a/python/cugraph/structure/renumber.py
+++ /dev/null
@@ -1,176 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import numpy as np
-import cudf
-from collections import OrderedDict
-from cugraph.structure import graph_new_wrapper
-from cugraph.structure import graph as csg
-
-
-def renumber(source_col, dest_col):
-    """
-    Take a (potentially sparse) set of source and destination vertex ids and
-    renumber the vertices to create a dense set of vertex ids using all values
-    contiguously from 0 to the number of unique vertices - 1.
-
-    Input columns can be either int64 or int32.  The output will be mapped to
-    int32, since many of the cugraph functions are limited to int32. If the
-    number of unique values in source_col and dest_col > 2^31-1 then this
-    function will return an error.
-
-    Return from this call will be three cudf Series - the renumbered
-    source_col, the renumbered dest_col and a numbering map that maps the new
-    ids to the original ids.
-
-    Parameters
-    ----------
-    source_col : cudf.Series
-        This cudf.Series wraps a gdf_column of size E (E: number of edges).
-        The gdf column contains the source index for each edge.
-        Source indices must be an integer type.
-    dest_col : cudf.Series
-        This cudf.Series wraps a gdf_column of size E (E: number of edges).
-        The gdf column contains the destination index for each edge.
-        Destination indices must be an integer type.
-    numbering_map : cudf.Series
-        This cudf.Series wraps a gdf column of size V (V: number of vertices).
-        The gdf column contains a numbering map that maps the new ids to the
-        original ids.
-
-    Examples
-    --------
-    >>> M = cudf.read_csv('datasets/karate.csv', delimiter=' ',
-    >>>                   dtype=['int32', 'int32', 'float32'], header=None)
-    >>> sources = cudf.Series(M['0'])
-    >>> destinations = cudf.Series(M['1'])
-    >>> source_col, dest_col, numbering_map = cugraph.renumber(sources,
-    >>>                                                        destinations)
-    >>> G = cugraph.Graph()
-    >>> G.add_edge_list(source_col, dest_col, None)
-    """
-    csg.null_check(source_col)
-    csg.null_check(dest_col)
-
-    (source_col, dest_col,
-     numbering_map) = graph_new_wrapper.renumber(source_col, dest_col)
-
-    return source_col, dest_col, numbering_map
-
-
-def renumber_from_cudf(_df, source_cols_names, dest_cols_names):
-    """
-    Take a set, collection (lists) of source and destination columns, and
-    renumber the vertices to create a dense set of contiguously vertex ids
-    from 0 to the number of unique vertices - 1.
-
-    Input columns can be any data type.
-
-    The output will be mapped to int32, since many of the cugraph functions
-    are limited to int32. If the number of unique values is > 2^31-1 then
-    this function will return an error.
-
-    NOTICE
-    ---------
-    - The number of source and destination columns must be the same
-    - The source and destination column names cannot be the same or overlap.
-    - The data type order needs to be the same between source and destination
-        columns. This is due to the two sets being merged to create a single
-        list of all possible values
-
-    Input Parameters
-    ----------
-    df : cudf.DataFrame
-        The dataframe containing the source and destination columans
-    source_cols_names : List
-        This is a list of source column names
-    dest_cols_names : List
-        This is a list of destination column names
-
-    Returns
-    ---------
-    src_ids : cudf.Series
-        The new source vertex IDs
-    dst_ids : cudf.Series
-        The new destination vertex IDs
-    numbering_df : cudf.DataFrame
-        a dataframe that maps a vertex ID to the unique values
-
-
-    Examples
-    --------
-    >>> gdf = cudf.read_csv('datasets/karate.csv', delimiter=' ',
-    >>>                   dtype=['int32', 'int32', 'float32'], header=None)
-
-    >>> source_col, dest_col, numbering_map =
-    >>>    cugraph.renumber_from_cudf(gdf, ["0"], ["1"])
-    >>>
-    >>> G = cugraph.Graph()
-    >>> G.add_edge_list(source_col, dest_col, None)
-    """
-    if len(source_cols_names) == 0:
-        raise ValueError('Source column list is empty')
-
-    if len(dest_cols_names) == 0:
-        raise ValueError('Destination column list is empty')
-
-    if len(source_cols_names) != len(dest_cols_names):
-        raise ValueError(
-            'Source and Destination column lists are not the same size')
-
-    # ---------------------------------------------------
-    # get the source column names and map to indexes
-    _src_map = OrderedDict()
-    for i in range(len(source_cols_names)):
-        _src_map.update({source_cols_names[i]: str(i)})
-
-    _tmp_df_src = _df[source_cols_names].rename(_src_map).reset_index()
-
-    # --------------------------------------------------------
-    # get the destination column names and map to indexes
-    _dst_map = OrderedDict()
-    for i in range(len(dest_cols_names)):
-        _dst_map.update({dest_cols_names[i]: str(i)})
-
-    _tmp_df_dst = _df[dest_cols_names].rename(_dst_map).reset_index()
-
-    _vals = list(_src_map.values())
-
-    # ------------------------------------
-    _s = _tmp_df_src.drop('index').drop_duplicates()
-    _d = _tmp_df_dst.drop('index').drop_duplicates()
-
-    _tmp_df = cudf.concat([_s, _d])
-    _tmp_df = _tmp_df.drop_duplicates().reset_index().drop('index')
-
-    if len(_tmp_df) > np.iinfo(np.int32).max:
-        raise ValueError('dataset is larger than int32')
-
-    _tmp_df['id'] = _tmp_df.index.astype(np.int32)
-
-    del _s
-    del _d
-
-    _src_ids = _tmp_df_src.merge(
-        _tmp_df, on=_vals, how='left').drop(_vals).sort_values(by='index')
-
-    _dst_ids = _tmp_df_dst.merge(
-        _tmp_df, on=_vals, how='left').drop(_vals).sort_values(by='index')
-
-    _s_id = cudf.Series(_src_ids['id']).reset_index(drop=True)
-    _d_id = cudf.Series(_dst_ids['id']).reset_index(drop=True)
-
-    del _src_ids
-    del _dst_ids
-
-    return _s_id, _d_id, _tmp_df
diff --git a/python/cugraph/structure/symmetrize.py b/python/cugraph/structure/symmetrize.py
index 91f8d1f890f..cf3a823ca27 100644
--- a/python/cugraph/structure/symmetrize.py
+++ b/python/cugraph/structure/symmetrize.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -55,39 +55,24 @@ def symmetrize_df(df, src_name, dst_name):
     """
     gdf = cudf.DataFrame()
 
-    #
-    #  NOTE: if there are values then we can't use drop_duplicates - in
-    #        case the values are different in different directions.  To
-    #        address this, we will use groupby if there are values.  If
-    #        there are no values then groupby won't eliminate the duplicate
-    #        keys.  Believe this is a bug, see
-    #        https://github.com/rapidsai/cudf/issues/2730.  Once this
-    #        is resolved we should be able to just use groupby.
-    #
-    #  We will use drop_duplicates if there are no non-key fields
-    #
-    use_groupby = False
-
     #
     #  Now append the columns.  We add sources to the end of destinations,
     #  and destinations to the end of sources.  Otherwise we append a
     #  column onto itself.
     #
     for idx, name in enumerate(df.columns):
-        if (name == src_name):
-            gdf[src_name] = df[src_name].append(df[dst_name],
-                                                ignore_index=True)
-        elif (name == dst_name):
-            gdf[dst_name] = df[dst_name].append(df[src_name],
-                                                ignore_index=True)
+        if name == src_name:
+            gdf[src_name] = df[src_name].append(
+                df[dst_name], ignore_index=True
+            )
+        elif name == dst_name:
+            gdf[dst_name] = df[dst_name].append(
+                df[src_name], ignore_index=True
+            )
         else:
             gdf[name] = df[name].append(df[name], ignore_index=True)
-            use_groupby = True
 
-    if use_groupby:
-        return gdf.groupby(by=[src_name, dst_name], as_index=False).min()
-    else:
-        return gdf.drop_duplicates(subset=[src_name, dst_name], keep='first')
+    return gdf.groupby(by=[src_name, dst_name], as_index=False).min()
 
 
 def symmetrize(source_col, dest_col, value_col=None):
@@ -131,18 +116,19 @@ def symmetrize(source_col, dest_col, value_col=None):
     csg.null_check(source_col)
     csg.null_check(dest_col)
 
-    input_df = cudf.DataFrame({'source': source_col,
-                               'destination': dest_col})
+    input_df = cudf.DataFrame({"source": source_col, "destination": dest_col})
 
     if value_col is not None:
         csg.null_check(value_col)
-        input_df.insert(len(input_df.columns), 'value', value_col)
+        input_df.insert(len(input_df.columns), "value", value_col)
 
-    output_df = symmetrize_df(input_df, 'source', 'destination')
+    output_df = symmetrize_df(input_df, "source", "destination")
 
     if value_col is not None:
-        return (output_df['source'],
-                output_df['destination'],
-                output_df['value'])
+        return (
+            output_df["source"],
+            output_df["destination"],
+            output_df["value"],
+        )
 
-    return output_df['source'], output_df['destination']
+    return output_df["source"], output_df["destination"]
diff --git a/python/cugraph/structure/utils.pxd b/python/cugraph/structure/utils.pxd
index b1931a897cf..3f48e0fdd2d 100644
--- a/python/cugraph/structure/utils.pxd
+++ b/python/cugraph/structure/utils.pxd
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -19,9 +19,16 @@
 from cugraph.structure.graph_new cimport *
 from libcpp.memory cimport unique_ptr
 
+cdef extern from "raft/handle.hpp" namespace "raft":
+    cdef cppclass handle_t:
+        handle_t() except +
 
 cdef extern from "functions.hpp" namespace "cugraph":
 
     cdef unique_ptr[GraphCSR[VT,ET,WT]] coo_to_csr[VT,ET,WT](
             const GraphCOOView[VT,ET,WT] &graph) except +
 
+    cdef void comms_bcast[value_t](
+            const handle_t &handle,
+            value_t *dst,
+            size_t size) except +
diff --git a/python/cugraph/structure/utils_wrapper.pyx b/python/cugraph/structure/utils_wrapper.pyx
index 6841da0cc96..a847f74d73c 100644
--- a/python/cugraph/structure/utils_wrapper.pyx
+++ b/python/cugraph/structure/utils_wrapper.pyx
@@ -77,6 +77,9 @@ def coo2csr(source_col, dest_col, weights=None):
     if source_col.dtype != np.int32:
         raise Exception("source_col and dest_col must be type np.int32")
 
+    if len(source_col) == 0:
+        return cudf.Series(np.zeros(1, dtype=np.int32)), cudf.Series(np.zeros(1, dtype=np.int32)), weights
+
     if weight_type(weights) == np.float64:
         return create_csr_double(source_col, dest_col, weights)
     else:
diff --git a/python/cugraph/tests/__init__.py b/python/cugraph/tests/__init__.py
index 52a69a64082..64ea1011b26 100644
--- a/python/cugraph/tests/__init__.py
+++ b/python/cugraph/tests/__init__.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/tests/dask/__init__.py b/python/cugraph/tests/dask/__init__.py
index 8b137891791..2d38df908f7 100644
--- a/python/cugraph/tests/dask/__init__.py
+++ b/python/cugraph/tests/dask/__init__.py
@@ -1 +1,12 @@
-
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/python/cugraph/tests/dask/file_splitting.py b/python/cugraph/tests/dask/file_splitting.py
deleted file mode 100644
index cb392be4609..00000000000
--- a/python/cugraph/tests/dask/file_splitting.py
+++ /dev/null
@@ -1,86 +0,0 @@
-import warnings
-import gc
-import time
-
-# Temporarily suppress warnings till networkX fixes deprecation warnings
-# (Using or importing the ABCs from 'collections' instead of from
-# 'collections.abc' is deprecated, and in 3.8 it will stop working) for
-# python 3.7.  Also, this import networkx needs to be relocated in the
-# third-party group once this gets fixed.
-with warnings.catch_warnings():
-    warnings.filterwarnings("ignore", category=DeprecationWarning)
-    from dask.distributed import Client, wait
-    import cugraph.dask.pagerank as dcg
-    from dask_cuda import LocalCUDACluster
-
-
-def test_splitting():
-    gc.collect()
-
-    # This is an experimental setup for 300GB bigdatax8 dataset.
-    # This test can be run on 16 32GB gpus. The dataset is split into 32 files.
-    input_data_path = r"/datasets/pagerank_demo/1/Input-bigdatax8/edges/"
-    input_files = ['file-00000.csv',
-                   'file-00001.csv',
-                   'file-00002.csv',
-                   'file-00003.csv',
-                   'file-00004.csv',
-                   'file-00005.csv',
-                   'file-00006.csv',
-                   'file-00007.csv',
-                   'file-00008.csv',
-                   'file-00009.csv',
-                   'file-00010.csv',
-                   'file-00011.csv',
-                   'file-00012.csv',
-                   'file-00013.csv',
-                   'file-00014.csv',
-                   'file-00015.csv',
-                   'file-00016.csv',
-                   'file-00017.csv',
-                   'file-00018.csv',
-                   'file-00019.csv',
-                   'file-00020.csv',
-                   'file-00021.csv',
-                   'file-00022.csv',
-                   'file-00023.csv',
-                   'file-00024.csv',
-                   'file-00025.csv',
-                   'file-00026.csv',
-                   'file-00027.csv',
-                   'file-00028.csv',
-                   'file-00029.csv',
-                   'file-00030.csv',
-                   'file-00031.csv']
-
-    # Cugraph snmg pagerank Call
-    cluster = LocalCUDACluster(threads_per_worker=1)
-    client = Client(cluster)
-
-    files = [input_data_path+f for f in input_files]
-
-    # Read 2 files per gpu/worker and concatenate the dataframe
-    # This is a work around for large files to fit memory requirements
-    # of cudf.read_csv
-    t0 = time.time()
-    new_ddf = dcg.read_split_csv(files)
-    t1 = time.time()
-    print("Reading Csv time: ", t1-t0)
-    t2 = time.time()
-    pr = dcg.pagerank(new_ddf, alpha=0.85, max_iter=3)
-    wait(pr)
-    t3 = time.time()
-    print("Pagerank (Dask) time: ", t3-t2)
-    t4 = time.time()
-    res_df = pr.compute()
-    t5 = time.time()
-    print("Compute time: ", t5-t4)
-    print(res_df)
-    t6 = time.time()
-    res_df.to_csv('~/pagerank.csv', chunksize=40000000, header=False,
-                  index=False)
-    t7 = time.time()
-    print("Write csv time: ", t7-t6)
-
-    client.close()
-    cluster.close()
diff --git a/python/cugraph/tests/dask/mg_context.py b/python/cugraph/tests/dask/mg_context.py
new file mode 100644
index 00000000000..e062cfd842b
--- /dev/null
+++ b/python/cugraph/tests/dask/mg_context.py
@@ -0,0 +1,103 @@
+import time
+import os
+
+from dask.distributed import Client
+from dask_cuda import LocalCUDACluster as CUDACluster
+import cugraph.comms as Comms
+import pytest
+
+# Maximal number of verifications of the number of workers
+DEFAULT_MAX_ATTEMPT = 100
+
+# Time between each attempt in seconds
+DEFAULT_WAIT_TIME = 0.5
+
+
+def get_visible_devices():
+    _visible_devices = os.environ.get("CUDA_VISIBLE_DEVICES")
+    if _visible_devices is None:
+        # FIXME: We assume that if the variable is unset there is only one GPU
+        visible_devices = ["0"]
+    else:
+        visible_devices = _visible_devices.strip().split(",")
+    return visible_devices
+
+
+def skip_if_not_enough_devices(required_devices):
+    visible_devices = get_visible_devices()
+    number_of_visible_devices = len(visible_devices)
+    if required_devices > number_of_visible_devices:
+        pytest.skip("Not enough devices available to "
+                    "test MG({})".format(required_devices))
+
+
+class MGContext:
+    """Utility Context Manager to start a multi GPU context using dask_cuda
+
+    Parameters:
+    -----------
+
+    number_of_devices : int
+        Number of devices to use, verification must be done prior to call
+        to ensure that there are enough devices available.
+    """
+    def __init__(self, number_of_devices=None, rmm_managed_memory=False):
+        self._number_of_devices = number_of_devices
+        self._rmm_managed_memory = rmm_managed_memory
+        self._cluster = None
+        self._client = None
+
+    @property
+    def client(self):
+        return self._client
+
+    @property
+    def cluster(self):
+        return self._cluster
+
+    def __enter__(self):
+        self._prepare_mg()
+        return self
+
+    def _prepare_mg(self):
+        self._prepare_cluster()
+        self._prepare_client()
+        self._prepare_comms()
+
+    def _prepare_cluster(self):
+        self._cluster = CUDACluster(
+            n_workers=self._number_of_devices,
+            rmm_managed_memory=self._rmm_managed_memory
+        )
+
+    def _prepare_client(self):
+        self._client = Client(self._cluster)
+        self._client.wait_for_workers(self._number_of_devices)
+
+    def _prepare_comms(self):
+        Comms.initialize()
+
+    def _close(self):
+        Comms.destroy()
+        if self._client is not None:
+            self._client.close()
+        if self._cluster is not None:
+            self._cluster.close()
+
+    def __exit__(self, type, value, traceback):
+        self._close()
+
+
+# NOTE: This only looks for the number of  workers
+# Tries to rescale the given cluster and wait until all workers are ready
+# or until the maximal number of attempts is reached
+def enforce_rescale(cluster, scale, max_attempts=DEFAULT_MAX_ATTEMPT,
+                    wait_time=DEFAULT_WAIT_TIME):
+    cluster.scale(scale)
+    attempt = 0
+    ready = (len(cluster.workers) == scale)
+    while (attempt < max_attempts) and not ready:
+        time.sleep(wait_time)
+        ready = (len(cluster.workers) == scale)
+        attempt += 1
+    assert ready, "Unable to rescale cluster to {}".format(scale)
diff --git a/python/cugraph/tests/dask/test_hibench_small.py b/python/cugraph/tests/dask/test_hibench_small.py
deleted file mode 100644
index e1b92a3fb1b..00000000000
--- a/python/cugraph/tests/dask/test_hibench_small.py
+++ /dev/null
@@ -1,87 +0,0 @@
-'''
-import warnings
-import gc
-import dask_cudf
-import pandas as pd
-import time
-import tempfile
-import os
-
-
-# Temporarily suppress warnings till networkX fixes deprecation warnings
-# (Using or importing the ABCs from 'collections' instead of from
-# 'collections.abc' is deprecated, and in 3.8 it will stop working) for
-# python 3.7.  Also, this import networkx needs to be relocated in the
-# third-party group once this gets fixed.
-with warnings.catch_warnings():
-    warnings.filterwarnings("ignore", category=DeprecationWarning)
-    from dask.distributed import Client, wait
-    import cugraph.dask.pagerank as dcg
-    from dask_cuda import LocalCUDACluster
-    import networkx as nx
-
-
-def test_pagerank():
-    gc.collect()
-    input_data_path = r"../datasets/hibench_small/1/part-00000.csv"
-
-    # Networkx Call
-    pd_df = pd.read_csv(input_data_path, delimiter='\t', names=['src', 'dst'])
-    G = nx.DiGraph()
-    for i in range(0, len(pd_df)):
-        G.add_edge(pd_df['src'][i], pd_df['dst'][i])
-    nx_pr = nx.pagerank(G, alpha=0.85)
-    nx_pr = sorted(nx_pr.items(), key=lambda x: x[0])
-
-    # Cugraph snmg pagerank Call
-    cluster = LocalCUDACluster(threads_per_worker=1)
-    client = Client(cluster)
-
-    t0 = time.time()
-    chunksize = dcg.get_chunksize(input_data_path)
-    ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize,
-                             delimiter='\t', names=['src', 'dst'],
-                             dtype=['int32', 'int32'])
-    y = ddf.to_delayed()
-    x = client.compute(y)
-    wait(x)
-    t1 = time.time()
-    print("Reading Csv time: ", t1-t0)
-    new_ddf = dcg.drop_duplicates(x)
-    t2 = time.time()
-    pr = dcg.pagerank(new_ddf, alpha=0.85, max_iter=50)
-    wait(pr)
-    t3 = time.time()
-    print("Running PR algo time: ", t3-t2)
-    t4 = time.time()
-    res_df = pr.compute()
-    t5 = time.time()
-    print("Compute time: ", t5-t4)
-    print(res_df)
-
-    # Use tempfile.mkstemp() to get a temp file name. Close and delete the file
-    # so to_csv() can create it using the unique temp name
-    (tempfileHandle, tempfileName) = tempfile.mkstemp(suffix=".csv",
-                                                      prefix="pagerank_")
-    os.close(tempfileHandle)
-    os.remove(tempfileName)
-
-    # For bigdatax4, chunksize=100000000 to avoid oom on write csv
-    t6 = time.time()
-    res_df.to_csv(tempfileName, header=False, index=False)
-    t7 = time.time()
-    print("Write csv time: ", t7-t6)
-
-    # Comparison
-    err = 0
-    tol = 1.0e-05
-    for i in range(len(res_df)):
-        if(abs(res_df['pagerank'][i]-nx_pr[i][1]) > tol*1.1):
-            err = err + 1
-    print("Mismatches:", err)
-    assert err < (0.02*len(res_df))
-
-    client.close()
-    cluster.close()
-    os.remove(tempfileName)
-'''
diff --git a/python/cugraph/tests/dask/test_karate_csv.py b/python/cugraph/tests/dask/test_karate_csv.py
deleted file mode 100644
index b8c528c6936..00000000000
--- a/python/cugraph/tests/dask/test_karate_csv.py
+++ /dev/null
@@ -1,54 +0,0 @@
-'''
-import warnings
-import gc
-import dask_cudf
-import pandas as pd
-
-
-# Temporarily suppress warnings till networkX fixes deprecation warnings
-# (Using or importing the ABCs from 'collections' instead of from
-# 'collections.abc' is deprecated, and in 3.8 it will stop working) for
-# python 3.7.  Also, this import networkx needs to be relocated in the
-# third-party group once this gets fixed.
-with warnings.catch_warnings():
-    warnings.filterwarnings("ignore", category=DeprecationWarning)
-    from dask.distributed import Client
-    import cugraph.dask.pagerank as dcg
-    from dask_cuda import LocalCUDACluster
-    import networkx as nx
-
-
-def test_pagerank():
-    gc.collect()
-    input_data_path = r"../datasets/karate.csv"
-    # Networkx Call
-    pd_df = pd.read_csv(input_data_path, delimiter=' ',
-                        names=['src', 'dst', 'value'])
-    G = nx.Graph()
-    for i in range(0, len(pd_df)):
-        G.add_edge(pd_df['src'][i], pd_df['dst'][i])
-    nx_pr = nx.pagerank(G, alpha=0.85)
-    nx_pr = sorted(nx_pr.items(), key=lambda x: x[0])
-    # Cugraph snmg pagerank Call
-    cluster = LocalCUDACluster(threads_per_worker=1)
-    client = Client(cluster)
-    chunksize = dcg.get_chunksize(input_data_path)
-    ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize,
-                             delimiter=' ',
-                             names=['src', 'dst', 'value'],
-                             dtype=['int32', 'int32', 'float32'])
-
-    pr = dcg.pagerank(ddf, alpha=0.85, max_iter=50)
-    res_df = pr.compute()
-
-    err = 0
-    tol = 1.0e-05
-    for i in range(len(res_df)):
-        if(abs(res_df['pagerank'][i]-nx_pr[i][1]) > tol*1.1):
-            err = err + 1
-    print("Mismatches:", err)
-    assert err < (0.01*len(res_df))
-
-    client.close()
-    cluster.close()
-'''
diff --git a/python/cugraph/tests/dask/test_mg_batch_betweenness_centrality.py b/python/cugraph/tests/dask/test_mg_batch_betweenness_centrality.py
new file mode 100644
index 00000000000..92782e7a0b3
--- /dev/null
+++ b/python/cugraph/tests/dask/test_mg_batch_betweenness_centrality.py
@@ -0,0 +1,65 @@
+import pytest
+import cugraph.tests.utils as utils
+import numpy as np
+
+from cugraph.tests.dask.mg_context import (MGContext,
+                                           skip_if_not_enough_devices)
+
+# Get parameters from standard betwenness_centrality_test
+from cugraph.tests.test_betweenness_centrality import (
+    DIRECTED_GRAPH_OPTIONS,
+    ENDPOINTS_OPTIONS,
+    NORMALIZED_OPTIONS,
+    DEFAULT_EPSILON,
+    SUBSET_SIZE_OPTIONS,
+    SUBSET_SEED_OPTIONS,
+)
+
+from cugraph.tests.test_betweenness_centrality import (
+    prepare_test,
+    calc_betweenness_centrality,
+    compare_scores
+)
+
+# =============================================================================
+# Parameters
+# =============================================================================
+DATASETS = utils.DATASETS_1
+MG_DEVICE_COUNT_OPTIONS = [1, 2, 3, 4]
+RESULT_DTYPE_OPTIONS = [np.float64]
+
+
+# FIXME: The following creates and destroys Comms at every call making the
+# testsuite quite slow
+@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize('directed', DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize('subset_size', SUBSET_SIZE_OPTIONS)
+@pytest.mark.parametrize('normalized', NORMALIZED_OPTIONS)
+@pytest.mark.parametrize('weight', [None])
+@pytest.mark.parametrize('endpoints', ENDPOINTS_OPTIONS)
+@pytest.mark.parametrize('subset_seed', SUBSET_SEED_OPTIONS)
+@pytest.mark.parametrize('result_dtype', RESULT_DTYPE_OPTIONS)
+@pytest.mark.parametrize('mg_device_count', MG_DEVICE_COUNT_OPTIONS)
+def test_mg_betweenness_centrality(graph_file,
+                                   directed,
+                                   subset_size,
+                                   normalized,
+                                   weight,
+                                   endpoints,
+                                   subset_seed,
+                                   result_dtype,
+                                   mg_device_count):
+    prepare_test()
+    skip_if_not_enough_devices(mg_device_count)
+    with MGContext(mg_device_count):
+        sorted_df = calc_betweenness_centrality(graph_file,
+                                                directed=directed,
+                                                normalized=normalized,
+                                                k=subset_size,
+                                                weight=weight,
+                                                endpoints=endpoints,
+                                                seed=subset_seed,
+                                                result_dtype=result_dtype,
+                                                multi_gpu_batch=True)
+    compare_scores(sorted_df, first_key="cu_bc", second_key="ref_bc",
+                   epsilon=DEFAULT_EPSILON)
diff --git a/python/cugraph/tests/dask/test_mg_batch_edge_betweenness_centrality.py b/python/cugraph/tests/dask/test_mg_batch_edge_betweenness_centrality.py
new file mode 100644
index 00000000000..d4906ca04ef
--- /dev/null
+++ b/python/cugraph/tests/dask/test_mg_batch_edge_betweenness_centrality.py
@@ -0,0 +1,60 @@
+import pytest
+import cugraph.tests.utils as utils
+import numpy as np
+
+
+from cugraph.tests.dask.mg_context import (MGContext,
+                                           skip_if_not_enough_devices)
+
+# Get parameters from standard betwenness_centrality_test
+from cugraph.tests.test_edge_betweenness_centrality import (
+    DIRECTED_GRAPH_OPTIONS,
+    NORMALIZED_OPTIONS,
+    DEFAULT_EPSILON,
+    SUBSET_SIZE_OPTIONS,
+    SUBSET_SEED_OPTIONS,
+)
+
+from cugraph.tests.test_edge_betweenness_centrality import (
+    prepare_test,
+    calc_edge_betweenness_centrality,
+    compare_scores
+)
+
+# =============================================================================
+# Parameters
+# =============================================================================
+DATASETS = utils.DATASETS_1
+MG_DEVICE_COUNT_OPTIONS = [1, 2, 3, 4]
+RESULT_DTYPE_OPTIONS = [np.float64]
+
+
+@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize('directed', DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize('subset_size', SUBSET_SIZE_OPTIONS)
+@pytest.mark.parametrize('normalized', NORMALIZED_OPTIONS)
+@pytest.mark.parametrize('weight', [None])
+@pytest.mark.parametrize('subset_seed', SUBSET_SEED_OPTIONS)
+@pytest.mark.parametrize('result_dtype', RESULT_DTYPE_OPTIONS)
+@pytest.mark.parametrize('mg_device_count', MG_DEVICE_COUNT_OPTIONS)
+def test_mg_edge_betweenness_centrality(graph_file,
+                                        directed,
+                                        subset_size,
+                                        normalized,
+                                        weight,
+                                        subset_seed,
+                                        result_dtype,
+                                        mg_device_count):
+    prepare_test()
+    skip_if_not_enough_devices(mg_device_count)
+    with MGContext(mg_device_count):
+        sorted_df = calc_edge_betweenness_centrality(graph_file,
+                                                     directed=directed,
+                                                     normalized=normalized,
+                                                     k=subset_size,
+                                                     weight=weight,
+                                                     seed=subset_seed,
+                                                     result_dtype=result_dtype,
+                                                     multi_gpu_batch=True)
+    compare_scores(sorted_df, first_key="cu_bc", second_key="ref_bc",
+                   epsilon=DEFAULT_EPSILON)
diff --git a/python/cugraph/tests/dask/test_mg_bfs.py b/python/cugraph/tests/dask/test_mg_bfs.py
new file mode 100644
index 00000000000..80ddf579973
--- /dev/null
+++ b/python/cugraph/tests/dask/test_mg_bfs.py
@@ -0,0 +1,66 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import cugraph.dask as dcg
+import cugraph.comms as Comms
+from dask.distributed import Client
+import gc
+import cugraph
+import dask_cudf
+import cudf
+from dask_cuda import LocalCUDACluster
+
+
+def test_dask_bfs():
+    gc.collect()
+    cluster = LocalCUDACluster()
+    client = Client(cluster)
+    Comms.initialize()
+
+    input_data_path = r"../datasets/netscience.csv"
+    chunksize = dcg.get_chunksize(input_data_path)
+
+    ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize,
+                             delimiter=' ',
+                             names=['src', 'dst', 'value'],
+                             dtype=['int32', 'int32', 'float32'])
+
+    df = cudf.read_csv(input_data_path,
+                       delimiter=' ',
+                       names=['src', 'dst', 'value'],
+                       dtype=['int32', 'int32', 'float32'])
+
+    g = cugraph.DiGraph()
+    g.from_cudf_edgelist(df, 'src', 'dst', renumber=True)
+
+    dg = cugraph.DiGraph()
+    dg.from_dask_cudf_edgelist(ddf, renumber=True)
+
+    expected_dist = cugraph.bfs(g, 0)
+    result_dist = dcg.bfs(dg, 0, True)
+
+    compare_dist = expected_dist.merge(
+        result_dist, on="vertex", suffixes=['_local', '_dask']
+    )
+
+    err = 0
+
+    for i in range(len(compare_dist)):
+        if (compare_dist['distance_local'].iloc[i] !=
+                compare_dist['distance_dask'].iloc[i]):
+            err = err + 1
+    assert err == 0
+
+    Comms.destroy()
+    client.close()
+    cluster.close()
diff --git a/python/cugraph/tests/dask/test_mg_comms.py b/python/cugraph/tests/dask/test_mg_comms.py
new file mode 100644
index 00000000000..2b9df6f9efd
--- /dev/null
+++ b/python/cugraph/tests/dask/test_mg_comms.py
@@ -0,0 +1,118 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import cugraph.dask as dcg
+import cugraph.comms as Comms
+from dask.distributed import Client
+import gc
+import pytest
+import cugraph
+import dask_cudf
+import cudf
+from dask_cuda import LocalCUDACluster
+
+
+@pytest.fixture
+def client_connection():
+    cluster = LocalCUDACluster()
+    client = Client(cluster)
+    Comms.initialize()
+
+    yield client
+
+    Comms.destroy()
+    client.close()
+    cluster.close()
+
+
+def test_dask_pagerank(client_connection):
+    gc.collect()
+
+    # Initialize and run pagerank on two distributed graphs
+    # with same communicator
+
+    input_data_path1 = r"../datasets/karate.csv"
+    chunksize1 = dcg.get_chunksize(input_data_path1)
+
+    input_data_path2 = r"../datasets/dolphins.csv"
+    chunksize2 = dcg.get_chunksize(input_data_path2)
+
+    ddf1 = dask_cudf.read_csv(input_data_path1, chunksize=chunksize1,
+                              delimiter=' ',
+                              names=['src', 'dst', 'value'],
+                              dtype=['int32', 'int32', 'float32'])
+
+    dg1 = cugraph.DiGraph()
+    dg1.from_dask_cudf_edgelist(ddf1, 'src', 'dst')
+    result_pr1 = dcg.pagerank(dg1)
+
+    ddf2 = dask_cudf.read_csv(input_data_path2, chunksize=chunksize2,
+                              delimiter=' ',
+                              names=['src', 'dst', 'value'],
+                              dtype=['int32', 'int32', 'float32'])
+
+    dg2 = cugraph.DiGraph()
+    dg2.from_dask_cudf_edgelist(ddf2, 'src', 'dst')
+    result_pr2 = dcg.pagerank(dg2)
+
+    # Calculate single GPU pagerank for verification of results
+    df1 = cudf.read_csv(input_data_path1,
+                        delimiter=' ',
+                        names=['src', 'dst', 'value'],
+                        dtype=['int32', 'int32', 'float32'])
+
+    g1 = cugraph.DiGraph()
+    g1.from_cudf_edgelist(df1, 'src', 'dst')
+    expected_pr1 = cugraph.pagerank(g1)
+
+    df2 = cudf.read_csv(input_data_path2,
+                        delimiter=' ',
+                        names=['src', 'dst', 'value'],
+                        dtype=['int32', 'int32', 'float32'])
+
+    g2 = cugraph.DiGraph()
+    g2.from_cudf_edgelist(df2, 'src', 'dst')
+    expected_pr2 = cugraph.pagerank(g2)
+
+    # Compare and verify pagerank results
+
+    err1 = 0
+    err2 = 0
+    tol = 1.0e-05
+
+    compare_pr1 = expected_pr1.merge(
+        result_pr1, on="vertex", suffixes=['_local', '_dask']
+    )
+
+    assert len(expected_pr1) == len(result_pr1)
+
+    for i in range(len(compare_pr1)):
+        diff = abs(compare_pr1['pagerank_local'].iloc[i] -
+                   compare_pr1['pagerank_dask'].iloc[i])
+        if diff > tol * 1.1:
+            err1 = err1 + 1
+    print("Mismatches in ", input_data_path1, ": ", err1)
+
+    assert len(expected_pr2) == len(result_pr2)
+
+    compare_pr2 = expected_pr2.merge(
+        result_pr2, on="vertex", suffixes=['_local', '_dask']
+    )
+
+    for i in range(len(compare_pr2)):
+        diff = abs(compare_pr2['pagerank_local'].iloc[i] -
+                   compare_pr2['pagerank_dask'].iloc[i])
+        if diff > tol * 1.1:
+            err2 = err2 + 1
+    print("Mismatches in ", input_data_path2, ": ", err2)
+    assert err1 == err2 == 0
diff --git a/python/cugraph/tests/dask/test_mg_degree.py b/python/cugraph/tests/dask/test_mg_degree.py
new file mode 100644
index 00000000000..f7e206b8e75
--- /dev/null
+++ b/python/cugraph/tests/dask/test_mg_degree.py
@@ -0,0 +1,65 @@
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dask.distributed import Client
+import gc
+import pytest
+import cudf
+import cugraph.comms as Comms
+import cugraph
+import dask_cudf
+
+# Move to conftest
+from dask_cuda import LocalCUDACluster
+
+
+@pytest.fixture
+def client_connection():
+    cluster = LocalCUDACluster()
+    client = Client(cluster)
+    Comms.initialize()
+
+    yield client
+
+    Comms.destroy()
+    client.close()
+    cluster.close()
+
+
+def test_dask_mg_degree(client_connection):
+    gc.collect()
+
+    input_data_path = r"../datasets/karate.csv"
+
+    chunksize = cugraph.dask.get_chunksize(input_data_path)
+
+    ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize,
+                             delimiter=' ',
+                             names=['src', 'dst', 'value'],
+                             dtype=['int32', 'int32', 'float32'])
+
+    df = cudf.read_csv(input_data_path,
+                       delimiter=' ',
+                       names=['src', 'dst', 'value'],
+                       dtype=['int32', 'int32', 'float32'])
+
+    dg = cugraph.DiGraph()
+    dg.from_dask_cudf_edgelist(ddf, 'src', 'dst')
+
+    g = cugraph.DiGraph()
+    g.from_cudf_edgelist(df, 'src', 'dst')
+
+    merge_df = dg.in_degree().merge(
+        g.in_degree(), on="vertex", suffixes=['_dg', '_g']).compute()
+
+    assert merge_df['degree_dg'].equals(merge_df['degree_g'])
diff --git a/python/cugraph/tests/dask/test_mg_pagerank.py b/python/cugraph/tests/dask/test_mg_pagerank.py
new file mode 100644
index 00000000000..c23b2bb0262
--- /dev/null
+++ b/python/cugraph/tests/dask/test_mg_pagerank.py
@@ -0,0 +1,115 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+import pytest
+import cugraph.dask as dcg
+import cugraph.comms as Comms
+from dask.distributed import Client
+import gc
+import cugraph
+import dask_cudf
+import cudf
+from dask_cuda import LocalCUDACluster
+
+# The function selects personalization_perc% of accessible vertices in graph M
+# and randomly assigns them personalization values
+
+
+def personalize(v, personalization_perc):
+    personalization = None
+    if personalization_perc != 0:
+        personalization = {}
+        nnz_vtx = np.arange(0, v)
+        personalization_count = int((nnz_vtx.size *
+                                     personalization_perc)/100.0)
+        nnz_vtx = np.random.choice(nnz_vtx,
+                                   min(nnz_vtx.size, personalization_count),
+                                   replace=False)
+        nnz_val = np.random.random(nnz_vtx.size)
+        nnz_val = nnz_val/sum(nnz_val)
+        for vtx, val in zip(nnz_vtx, nnz_val):
+            personalization[vtx] = val
+
+        k = np.fromiter(personalization.keys(), dtype='int32')
+        v = np.fromiter(personalization.values(), dtype='float32')
+        cu_personalization = cudf.DataFrame({'vertex': k, 'values': v})
+
+    return cu_personalization
+
+
+PERSONALIZATION_PERC = [0, 10, 50]
+
+
+@pytest.fixture
+def client_connection():
+    cluster = LocalCUDACluster()
+    client = Client(cluster)
+    Comms.initialize()
+
+    yield client
+
+    Comms.destroy()
+    client.close()
+    cluster.close()
+
+
+@pytest.mark.parametrize('personalization_perc', PERSONALIZATION_PERC)
+def test_dask_pagerank(client_connection, personalization_perc):
+    gc.collect()
+
+    input_data_path = r"../datasets/karate.csv"
+    chunksize = dcg.get_chunksize(input_data_path)
+
+    ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize,
+                             delimiter=' ',
+                             names=['src', 'dst', 'value'],
+                             dtype=['int32', 'int32', 'float32'])
+
+    df = cudf.read_csv(input_data_path,
+                       delimiter=' ',
+                       names=['src', 'dst', 'value'],
+                       dtype=['int32', 'int32', 'float32'])
+
+    g = cugraph.DiGraph()
+    g.from_cudf_edgelist(df, 'src', 'dst')
+
+    dg = cugraph.DiGraph()
+    dg.from_dask_cudf_edgelist(ddf)
+
+    # Pre compute local data and personalize
+    personalization = None
+    if personalization_perc != 0:
+        dg.compute_local_data(by='dst')
+        personalization = personalize(dg.number_of_vertices(),
+                                      personalization_perc)
+
+    expected_pr = cugraph.pagerank(g,
+                                   personalization=personalization,
+                                   tol=1e-6)
+    result_pr = dcg.pagerank(dg, personalization=personalization, tol=1e-6)
+
+    err = 0
+    tol = 1.0e-05
+
+    assert len(expected_pr) == len(result_pr)
+
+    compare_pr = expected_pr.merge(
+        result_pr, on="vertex", suffixes=['_local', '_dask']
+    )
+
+    for i in range(len(compare_pr)):
+        diff = abs(compare_pr['pagerank_local'].iloc[i] -
+                   compare_pr['pagerank_dask'].iloc[i])
+        if diff > tol * 1.1:
+            err = err + 1
+    assert err == 0
diff --git a/python/cugraph/tests/dask/test_mg_renumber.py b/python/cugraph/tests/dask/test_mg_renumber.py
new file mode 100644
index 00000000000..7e683fe7148
--- /dev/null
+++ b/python/cugraph/tests/dask/test_mg_renumber.py
@@ -0,0 +1,201 @@
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This file test the Renumbering features
+
+import gc
+import pytest
+
+import pandas
+import numpy as np
+
+import cugraph.dask as dcg
+import cugraph.comms as Comms
+from dask.distributed import Client
+import cugraph
+import dask_cudf
+import dask
+import cudf
+from dask_cuda import LocalCUDACluster
+from cugraph.tests import utils
+from cugraph.structure.number_map import NumberMap
+
+
+@pytest.fixture
+def client_connection():
+    cluster = LocalCUDACluster()
+    client = Client(cluster)
+    Comms.initialize()
+
+    yield client
+
+    Comms.destroy()
+    client.close()
+    cluster.close()
+
+
+# Test all combinations of default/managed and pooled/non-pooled allocation
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
+def test_mg_renumber(graph_file, client_connection):
+    gc.collect()
+
+    M = utils.read_csv_for_nx(graph_file)
+    sources = cudf.Series(M["0"])
+    destinations = cudf.Series(M["1"])
+
+    translate = 1000
+
+    gdf = cudf.DataFrame()
+    gdf["src_old"] = sources
+    gdf["dst_old"] = destinations
+    gdf["src"] = sources + translate
+    gdf["dst"] = destinations + translate
+
+    ddf = dask.dataframe.from_pandas(gdf, npartitions=2)
+
+    numbering = NumberMap()
+    numbering.from_dataframe(ddf, ["src", "src_old"], ["dst", "dst_old"])
+    renumbered_df = numbering.add_internal_vertex_id(
+        numbering.add_internal_vertex_id(ddf, "src_id", ["src", "src_old"]),
+        "dst_id",
+        ["dst", "dst_old"],
+    )
+
+    check_src = numbering.from_internal_vertex_id(
+        renumbered_df, "src_id"
+    ).compute()
+    check_dst = numbering.from_internal_vertex_id(
+        renumbered_df, "dst_id"
+    ).compute()
+
+    assert check_src["0"].to_pandas().equals(check_src["src"].to_pandas())
+    assert check_src["1"].to_pandas().equals(check_src["src_old"].to_pandas())
+    assert check_dst["0"].to_pandas().equals(check_dst["dst"].to_pandas())
+    assert check_dst["1"].to_pandas().equals(check_dst["dst_old"].to_pandas())
+
+
+# Test all combinations of default/managed and pooled/non-pooled allocation
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
+def test_mg_renumber2(graph_file, client_connection):
+    gc.collect()
+
+    M = utils.read_csv_for_nx(graph_file)
+    sources = cudf.Series(M["0"])
+    destinations = cudf.Series(M["1"])
+
+    translate = 1000
+
+    gdf = cudf.DataFrame()
+    gdf["src_old"] = sources
+    gdf["dst_old"] = destinations
+    gdf["src"] = sources + translate
+    gdf["dst"] = destinations + translate
+    gdf["weight"] = gdf.index.astype(np.float)
+
+    ddf = dask.dataframe.from_pandas(gdf, npartitions=2)
+
+    ren2, num2 = NumberMap.renumber(
+        ddf, ["src", "src_old"], ["dst", "dst_old"]
+    )
+
+    check_src = num2.from_internal_vertex_id(ren2, "src").compute()
+    check_src = check_src.sort_values('weight').reset_index(drop=True)
+    check_dst = num2.from_internal_vertex_id(ren2, "dst").compute()
+    check_dst = check_dst.sort_values('weight').reset_index(drop=True)
+
+    assert check_src["0"].to_pandas().equals(gdf["src"].to_pandas())
+    assert check_src["1"].to_pandas().equals(gdf["src_old"].to_pandas())
+    assert check_dst["0"].to_pandas().equals(gdf["dst"].to_pandas())
+    assert check_dst["1"].to_pandas().equals(gdf["dst_old"].to_pandas())
+
+
+# Test all combinations of default/managed and pooled/non-pooled allocation
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
+def test_mg_renumber3(graph_file, client_connection):
+    gc.collect()
+
+    M = utils.read_csv_for_nx(graph_file)
+    sources = cudf.Series(M["0"])
+    destinations = cudf.Series(M["1"])
+
+    translate = 1000
+
+    gdf = cudf.DataFrame()
+    gdf["src_old"] = sources
+    gdf["dst_old"] = destinations
+    gdf["src"] = sources + translate
+    gdf["dst"] = destinations + translate
+    gdf["weight"] = gdf.index.astype(np.float)
+
+    ddf = dask.dataframe.from_pandas(gdf, npartitions=2)
+
+    ren2, num2 = NumberMap.renumber(
+        ddf, ["src", "src_old"], ["dst", "dst_old"]
+    )
+
+    test_df = gdf[['src', 'src_old']].head()
+
+    #
+    #  This call raises an exception in branch-0.15
+    #  prior to this PR
+    #
+    test_df = num2.add_internal_vertex_id(test_df, 'src', ['src', 'src_old'])
+    assert(True)
+
+
+def test_dask_pagerank(client_connection):
+    gc.collect()
+
+    pandas.set_option('display.max_rows', 10000)
+
+    input_data_path = r"../datasets/karate.csv"
+    chunksize = dcg.get_chunksize(input_data_path)
+
+    ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize,
+                             delimiter=' ',
+                             names=['src', 'dst', 'value'],
+                             dtype=['int32', 'int32', 'float32'])
+
+    df = cudf.read_csv(input_data_path,
+                       delimiter=' ',
+                       names=['src', 'dst', 'value'],
+                       dtype=['int32', 'int32', 'float32'])
+
+    g = cugraph.DiGraph()
+    g.from_cudf_edgelist(df, 'src', 'dst')
+
+    dg = cugraph.DiGraph()
+    dg.from_dask_cudf_edgelist(ddf)
+
+    # Pre compute local data
+    # dg.compute_local_data(by='dst')
+
+    expected_pr = cugraph.pagerank(g)
+    result_pr = dcg.pagerank(dg)
+
+    err = 0
+    tol = 1.0e-05
+
+    assert len(expected_pr) == len(result_pr)
+
+    compare_pr = expected_pr.merge(
+        result_pr, on="vertex", suffixes=['_local', '_dask']
+    )
+
+    for i in range(len(compare_pr)):
+        diff = abs(compare_pr['pagerank_local'].iloc[i] -
+                   compare_pr['pagerank_dask'].iloc[i])
+        if diff > tol * 1.1:
+            err = err + 1
+    print("Mismatches:", err)
+    assert err == 0
diff --git a/python/cugraph/tests/dask/test_mg_replication.py b/python/cugraph/tests/dask/test_mg_replication.py
new file mode 100644
index 00000000000..b74e6e9a15f
--- /dev/null
+++ b/python/cugraph/tests/dask/test_mg_replication.py
@@ -0,0 +1,249 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import cugraph
+from cugraph.tests.dask.mg_context import (MGContext,
+                                           skip_if_not_enough_devices)
+import cudf
+import cugraph.dask.structure.replication as replication
+import cugraph.tests.utils as utils
+import pytest
+
+DATASETS_OPTIONS = utils.DATASETS_1
+DIRECTED_GRAPH_OPTIONS = [False, True]
+MG_DEVICE_COUNT_OPTIONS = [1, 2, 3, 4]
+
+
+@pytest.mark.parametrize("input_data_path", DATASETS_OPTIONS)
+@pytest.mark.parametrize("mg_device_count", MG_DEVICE_COUNT_OPTIONS)
+def test_replicate_cudf_dataframe_with_weights(input_data_path,
+                                               mg_device_count):
+    skip_if_not_enough_devices(mg_device_count)
+    df = cudf.read_csv(input_data_path,
+                       delimiter=' ',
+                       names=['src', 'dst', 'value'],
+                       dtype=['int32', 'int32', 'float32'])
+    with MGContext(mg_device_count):
+        worker_to_futures = replication.replicate_cudf_dataframe(df)
+        for worker in worker_to_futures:
+            replicated_df = worker_to_futures[worker].result()
+            assert df.equals(replicated_df), "There is a mismatch in one " \
+                "of the replications"
+
+
+@pytest.mark.parametrize("input_data_path", DATASETS_OPTIONS)
+@pytest.mark.parametrize("mg_device_count", MG_DEVICE_COUNT_OPTIONS)
+def test_replicate_cudf_dataframe_no_weights(input_data_path,
+                                             mg_device_count):
+    skip_if_not_enough_devices(mg_device_count)
+    df = cudf.read_csv(input_data_path,
+                       delimiter=' ',
+                       names=['src', 'dst'],
+                       dtype=['int32', 'int32'])
+    with MGContext(mg_device_count):
+        worker_to_futures = replication.replicate_cudf_dataframe(df)
+        for worker in worker_to_futures:
+            replicated_df = worker_to_futures[worker].result()
+            assert df.equals(replicated_df), "There is a mismatch in one " \
+                "of the replications"
+
+
+@pytest.mark.parametrize("input_data_path", DATASETS_OPTIONS)
+@pytest.mark.parametrize("mg_device_count", MG_DEVICE_COUNT_OPTIONS)
+def test_replicate_cudf_series(input_data_path,
+                               mg_device_count):
+    skip_if_not_enough_devices(mg_device_count)
+    df = cudf.read_csv(input_data_path,
+                       delimiter=' ',
+                       names=['src', 'dst', 'value'],
+                       dtype=['int32', 'int32', 'float32'])
+    with MGContext(mg_device_count):
+        for column in df.columns.values:
+            series = df[column]
+            worker_to_futures = replication.replicate_cudf_series(series)
+            for worker in worker_to_futures:
+                replicated_series = worker_to_futures[worker].result()
+                assert series.equals(replicated_series), "There is a " \
+                    "mismatch in one of the replications"
+            # FIXME: If we do not clear this dictionary, when comparing
+            # results for the 2nd column, one of the workers still
+            # has a value from the 1st column
+            worker_to_futures = {}
+
+
+@pytest.mark.parametrize("graph_file", DATASETS_OPTIONS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("mg_device_count", MG_DEVICE_COUNT_OPTIONS)
+def test_enable_batch_no_context(graph_file, directed, mg_device_count):
+    skip_if_not_enough_devices(mg_device_count)
+    G = utils.generate_cugraph_graph_from_file(graph_file, directed)
+    assert G.batch_enabled is False, "Internal property should be False"
+    with pytest.raises(Exception):
+        G.enable_batch()
+
+
+@pytest.mark.parametrize("graph_file", DATASETS_OPTIONS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("mg_device_count", MG_DEVICE_COUNT_OPTIONS)
+def test_enable_batch_no_context_view_adj(graph_file, directed,
+                                          mg_device_count):
+    skip_if_not_enough_devices(mg_device_count)
+    G = utils.generate_cugraph_graph_from_file(graph_file, directed)
+    assert G.batch_enabled is False, "Internal property should be False"
+    G.view_adj_list()
+
+
+@pytest.mark.parametrize("graph_file", DATASETS_OPTIONS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("mg_device_count", MG_DEVICE_COUNT_OPTIONS)
+def test_enable_batch_context_then_views(graph_file, directed,
+                                         mg_device_count):
+    skip_if_not_enough_devices(mg_device_count)
+    G = utils.generate_cugraph_graph_from_file(graph_file, directed)
+    with MGContext(mg_device_count):
+        assert G.batch_enabled is False, "Internal property should be False"
+        G.enable_batch()
+        assert G.batch_enabled is True, "Internal property should be True"
+        assert G.batch_edgelists is not None, "The graph should have " \
+                                              "been created with an "  \
+                                              "edgelist"
+        assert G.batch_adjlists is None
+        G.view_adj_list()
+        assert G.batch_adjlists is not None
+
+        assert G.batch_transposed_adjlists is None
+        G.view_transposed_adj_list()
+        assert G.batch_transposed_adjlists is not None
+
+
+@pytest.mark.parametrize("graph_file", DATASETS_OPTIONS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("mg_device_count", MG_DEVICE_COUNT_OPTIONS)
+def test_enable_batch_view_then_context(graph_file, directed,
+                                        mg_device_count):
+    skip_if_not_enough_devices(mg_device_count)
+    G = utils.generate_cugraph_graph_from_file(graph_file, directed)
+
+    assert G.batch_adjlists is None
+    G.view_adj_list()
+    assert G.batch_adjlists is None
+
+    assert G.batch_transposed_adjlists is None
+    G.view_transposed_adj_list()
+    assert G.batch_transposed_adjlists is None
+
+    with MGContext(mg_device_count):
+        assert G.batch_enabled is False, "Internal property should be False"
+        G.enable_batch()
+        assert G.batch_enabled is True, "Internal property should be True"
+        assert G.batch_edgelists is not None, "The graph should have " \
+                                              "been created with an "  \
+                                              "edgelist"
+        assert G.batch_adjlists is not None
+        assert G.batch_transposed_adjlists is not None
+
+
+@pytest.mark.parametrize("graph_file", DATASETS_OPTIONS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("mg_device_count", MG_DEVICE_COUNT_OPTIONS)
+def test_enable_batch_context_no_context_views(graph_file, directed,
+                                               mg_device_count):
+    skip_if_not_enough_devices(mg_device_count)
+    G = utils.generate_cugraph_graph_from_file(graph_file, directed)
+    with MGContext(mg_device_count):
+        assert G.batch_enabled is False, "Internal property should be False"
+        G.enable_batch()
+        assert G.batch_enabled is True, "Internal property should be True"
+        assert G.batch_edgelists is not None, "The graph should have " \
+                                              "been created with an "  \
+                                              "edgelist"
+    G.view_edge_list()
+    G.view_adj_list()
+    G.view_transposed_adj_list()
+
+
+@pytest.mark.parametrize("graph_file", DATASETS_OPTIONS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("mg_device_count", MG_DEVICE_COUNT_OPTIONS)
+def test_enable_batch_edgelist_replication(graph_file, directed,
+                                           mg_device_count):
+    skip_if_not_enough_devices(mg_device_count)
+    G = utils.generate_cugraph_graph_from_file(graph_file, directed)
+    with MGContext(mg_device_count):
+        G.enable_batch()
+        df = G.edgelist.edgelist_df
+        for worker in G.batch_edgelists:
+            replicated_df = G.batch_edgelists[worker].result()
+            assert df.equals(replicated_df), "Replication of edgelist failed"
+
+
+@pytest.mark.parametrize("graph_file", DATASETS_OPTIONS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("mg_device_count", MG_DEVICE_COUNT_OPTIONS)
+def test_enable_batch_adjlist_replication_weights(graph_file, directed,
+                                                  mg_device_count):
+    skip_if_not_enough_devices(mg_device_count)
+    df = cudf.read_csv(graph_file,
+                       delimiter=' ',
+                       names=['src', 'dst', 'value'],
+                       dtype=['int32', 'int32', 'float32'])
+    G = cugraph.DiGraph() if directed else cugraph.Graph()
+    G.from_cudf_edgelist(df, source='src', destination='dst',
+                         edge_attr='value')
+    with MGContext(mg_device_count):
+        G.enable_batch()
+        G.view_adj_list()
+        adjlist = G.adjlist
+        offsets = adjlist.offsets
+        indices = adjlist.indices
+        weights = adjlist.weights
+        for worker in G.batch_adjlists:
+            (rep_offsets,
+             rep_indices,
+             rep_weights) = G.batch_adjlists[worker]
+            assert offsets.equals(rep_offsets.result()), "Replication of " \
+                "adjlist offsets failed"
+            assert indices.equals(rep_indices.result()), "Replication of " \
+                "adjlist indices failed"
+            assert weights.equals(rep_weights.result()), "Replication of " \
+                "adjlist weights failed"
+
+
+@pytest.mark.parametrize("graph_file", DATASETS_OPTIONS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("mg_device_count", MG_DEVICE_COUNT_OPTIONS)
+def test_enable_batch_adjlist_replication_no_weights(graph_file, directed,
+                                                     mg_device_count):
+    skip_if_not_enough_devices(mg_device_count)
+    df = cudf.read_csv(graph_file,
+                       delimiter=' ',
+                       names=['src', 'dst'],
+                       dtype=['int32', 'int32'])
+    G = cugraph.DiGraph() if directed else cugraph.Graph()
+    G.from_cudf_edgelist(df, source='src', destination='dst')
+    with MGContext(mg_device_count):
+        G.enable_batch()
+        G.view_adj_list()
+        adjlist = G.adjlist
+        offsets = adjlist.offsets
+        indices = adjlist.indices
+        weights = adjlist.weights
+        for worker in G.batch_adjlists:
+            (rep_offsets,
+             rep_indices,
+             rep_weights) = G.batch_adjlists[worker]
+            assert offsets.equals(rep_offsets.result()), "Replication of " \
+                "adjlist offsets failed"
+            assert indices.equals(rep_indices.result()), "Replication of " \
+                "adjlist indices failed"
+            assert weights is None and rep_weights is None
diff --git a/python/cugraph/tests/dask/test_mg_utility.py b/python/cugraph/tests/dask/test_mg_utility.py
new file mode 100644
index 00000000000..704b1db849c
--- /dev/null
+++ b/python/cugraph/tests/dask/test_mg_utility.py
@@ -0,0 +1,63 @@
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import cugraph.dask as dcg
+from dask.distributed import Client
+import gc
+import cugraph
+import dask_cudf
+import cugraph.comms as Comms
+from dask_cuda import LocalCUDACluster
+import pytest
+
+
+@pytest.fixture
+def client_connection():
+    cluster = LocalCUDACluster()
+    client = Client(cluster)
+    Comms.initialize()
+
+    yield client
+
+    Comms.destroy()
+    client.close()
+    cluster.close()
+
+
+def test_compute_local_data(client_connection):
+
+    gc.collect()
+
+    input_data_path = r"../datasets/karate.csv"
+    chunksize = dcg.get_chunksize(input_data_path)
+    ddf = dask_cudf.read_csv(input_data_path, chunksize=chunksize,
+                             delimiter=' ',
+                             names=['src', 'dst', 'value'],
+                             dtype=['int32', 'int32', 'float32'])
+
+    dg = cugraph.DiGraph()
+    dg.from_dask_cudf_edgelist(ddf, source='src', destination='dst',
+                               edge_attr='value')
+
+    # Compute_local_data
+    dg.compute_local_data(by='dst')
+    data = dg.local_data['data']
+    by = dg.local_data['by']
+
+    assert by == 'dst'
+    assert Comms.is_initialized()
+
+    global_num_edges = data.local_data['edges'].sum()
+    assert global_num_edges == dg.number_of_edges()
+    global_num_verts = data.local_data['verts'].sum()
+    assert global_num_verts == dg.number_of_nodes()
diff --git a/python/cugraph/tests/test_balanced_cut.py b/python/cugraph/tests/test_balanced_cut.py
index e0d9c980184..6605ecadee1 100644
--- a/python/cugraph/tests/test_balanced_cut.py
+++ b/python/cugraph/tests/test_balanced_cut.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -22,10 +22,12 @@
 
 
 def cugraph_call(G, partitions):
-    df = cugraph.spectralBalancedCutClustering(G, partitions,
-                                               num_eigen_vects=partitions)
-    score = cugraph.analyzeClustering_edge_cut(G, partitions, df['cluster'])
-    return set(df['vertex'].to_array()), score
+    df = cugraph.spectralBalancedCutClustering(
+        G, partitions, num_eigen_vects=partitions
+    )
+
+    score = cugraph.analyzeClustering_edge_cut(G, partitions, df["cluster"])
+    return set(df["vertex"].to_array()), score
 
 
 def random_call(G, partitions):
@@ -33,23 +35,20 @@ def random_call(G, partitions):
     num_verts = G.number_of_vertices()
     assignment = []
     for i in range(num_verts):
-        assignment.append(random.randint(0, partitions-1))
+        assignment.append(random.randint(0, partitions - 1))
     assignment_cu = cudf.Series(assignment)
     score = cugraph.analyzeClustering_edge_cut(G, partitions, assignment_cu)
     return set(range(num_verts)), score
 
 
-DATASETS = ['../datasets/karate.csv',
-            '../datasets/dolphins.csv',
-            '../datasets/netscience.csv']
-
 PARTITIONS = [2, 4, 8]
 
 
 # Test all combinations of default/managed and pooled/non-pooled allocation
 
-@pytest.mark.parametrize('graph_file', DATASETS)
-@pytest.mark.parametrize('partitions', PARTITIONS)
+
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
+@pytest.mark.parametrize("partitions", PARTITIONS)
 def test_edge_cut_clustering(graph_file, partitions):
     gc.collect()
 
@@ -57,15 +56,15 @@ def test_edge_cut_clustering(graph_file, partitions):
     cu_M = utils.read_csv_file(graph_file, read_weights_in_sp=False)
 
     G_edge = cugraph.Graph()
-    G_edge.from_cudf_edgelist(cu_M, source='0', destination='1')
+    G_edge.from_cudf_edgelist(cu_M, source="0", destination="1")
 
     # Get the edge_cut score for partitioning versus random assignment
-    '''cu_vid, cu_score = cugraph_call(G_adj, partitions)
+    """cu_vid, cu_score = cugraph_call(G_adj, partitions)
     rand_vid, rand_score = random_call(G_adj, partitions)
-    '''
+    """
     # Assert that the partitioning has better edge_cut than the random
     # assignment
-    '''assert cu_score < rand_score'''
+    """assert cu_score < rand_score"""
 
     # Get the edge_cut score for partitioning versus random assignment
     cu_vid, cu_score = cugraph_call(G_edge, partitions)
@@ -77,8 +76,8 @@ def test_edge_cut_clustering(graph_file, partitions):
     assert cu_score < rand_score
 
 
-@pytest.mark.parametrize('graph_file', DATASETS)
-@pytest.mark.parametrize('partitions', PARTITIONS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
+@pytest.mark.parametrize("partitions", PARTITIONS)
 def test_edge_cut_clustering_with_edgevals(graph_file, partitions):
     gc.collect()
 
@@ -86,16 +85,15 @@ def test_edge_cut_clustering_with_edgevals(graph_file, partitions):
     cu_M = utils.read_csv_file(graph_file, read_weights_in_sp=False)
 
     G_edge = cugraph.Graph()
-    G_edge.from_cudf_edgelist(cu_M, source='0', destination='1',
-                              edge_attr='2')
+    G_edge.from_cudf_edgelist(cu_M, source="0", destination="1", edge_attr="2")
 
     # Get the edge_cut score for partitioning versus random assignment
-    '''cu_vid, cu_score = cugraph_call(G_adj, partitions)
+    """cu_vid, cu_score = cugraph_call(G_adj, partitions)
     rand_vid, rand_score = random_call(G_adj, partitions)
-    '''
+    """
     # Assert that the partitioning has better edge_cut than the random
     # assignment
-    '''assert cu_score < rand_score'''
+    """assert cu_score < rand_score"""
 
     # Get the edge_cut score for partitioning versus random assignment
     cu_vid, cu_score = cugraph_call(G_edge, partitions)
@@ -110,19 +108,19 @@ def test_edge_cut_clustering_with_edgevals(graph_file, partitions):
 # Test to ensure DiGraph objs are not accepted
 # Test all combinations of default/managed and pooled/non-pooled allocation
 
+
 def test_digraph_rejected():
     gc.collect()
 
     df = cudf.DataFrame()
-    df['src'] = cudf.Series(range(10))
-    df['dst'] = cudf.Series(range(10))
-    df['val'] = cudf.Series(range(10))
+    df["src"] = cudf.Series(range(10))
+    df["dst"] = cudf.Series(range(10))
+    df["val"] = cudf.Series(range(10))
 
     G = cugraph.DiGraph()
-    G.from_cudf_edgelist(df, source="src",
-                         destination="dst",
-                         edge_attr="val",
-                         renumber=False)
+    G.from_cudf_edgelist(
+        df, source="src", destination="dst", edge_attr="val", renumber=False
+    )
 
     with pytest.raises(Exception):
         cugraph_call(G, 2)
diff --git a/python/cugraph/tests/test_betweenness_centrality.py b/python/cugraph/tests/test_betweenness_centrality.py
index f6568e271a1..bf99848301e 100644
--- a/python/cugraph/tests/test_betweenness_centrality.py
+++ b/python/cugraph/tests/test_betweenness_centrality.py
@@ -19,6 +19,8 @@
 from cugraph.tests import utils
 import random
 import numpy as np
+import cudf
+import cupy
 
 # Temporarily suppress warnings till networkX fixes deprecation warnings
 # (Using or importing the ABCs from 'collections' instead of from
@@ -26,25 +28,24 @@
 # python 3.7.  Also, this import networkx needs to be relocated in the
 # third-party group once this gets fixed.
 import warnings
+
 with warnings.catch_warnings():
     warnings.filterwarnings("ignore", category=DeprecationWarning)
     import networkx as nx
 
-# NOTE: Endpoint parameter is not currently being tested, there could be a test
-#       to verify that python raise an error if it is used
 # =============================================================================
 # Parameters
 # =============================================================================
 DIRECTED_GRAPH_OPTIONS = [False, True]
+ENDPOINTS_OPTIONS = [False, True]
+NORMALIZED_OPTIONS = [False, True]
 DEFAULT_EPSILON = 0.0001
 
-TINY_DATASETS = ['../datasets/karate.csv']
-
-UNRENUMBERED_DATASETS = ['../datasets/karate.csv']
+DATASETS = ["../datasets/karate.csv", "../datasets/netscience.csv"]
 
-SMALL_DATASETS = ['../datasets/netscience.csv']
+UNRENUMBERED_DATASETS = ["../datasets/karate.csv"]
 
-SUBSET_SIZE_OPTIONS = [4]
+SUBSET_SIZE_OPTIONS = [4, None]
 SUBSET_SEED_OPTIONS = [42]
 
 # NOTE: The following is not really being exploited in the tests as the
@@ -56,25 +57,18 @@
 # =============================================================================
 # Comparison functions
 # =============================================================================
-def build_graphs(graph_file, directed=True):
-    # cugraph
-    cu_M = utils.read_csv_file(graph_file)
-    G = cugraph.DiGraph() if directed else cugraph.Graph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
-    G.view_adj_list()  # Enforce generation before computation
-
-    # networkx
-    M = utils.read_csv_for_nx(graph_file)
-    Gnx = nx.from_pandas_edgelist(M, create_using=(nx.DiGraph() if directed
-                                                   else nx.Graph()),
-                                  source='0', target='1')
-    return G, Gnx
-
-
-def calc_betweenness_centrality(graph_file, directed=True, normalized=False,
-                                weight=None, endpoints=False,
-                                k=None, seed=None,
-                                result_dtype=np.float32):
+def calc_betweenness_centrality(
+    graph_file,
+    directed=True,
+    k=None,
+    normalized=False,
+    weight=None,
+    endpoints=False,
+    seed=None,
+    result_dtype=np.float64,
+    use_k_full=False,
+    multi_gpu_batch=False,
+):
     """ Generate both cugraph and networkx betweenness centrality
 
     Parameters
@@ -84,65 +78,118 @@ def calc_betweenness_centrality(graph_file, directed=True, normalized=False,
 
     directed : bool, optional, default=True
 
+    k : int or None, optional, default=None
+        int:  Number of sources  to sample  from
+        None: All sources are used to compute
+
     normalized : bool
         True: Normalize Betweenness Centrality scores
         False: Scores are left unnormalized
 
-    k : int or None, optional, default=None
-        int:  Number of sources  to sample  from
-        None: All sources are used to compute
+    weight : cudf.DataFrame:
+        Not supported as of 06/2020
+
+    endpoints : bool
+        True: Endpoints are included when computing scores
+        False: Endpoints are not considered
 
     seed : int or None, optional, default=None
         Seed for random sampling  of the starting point
 
+    result_dtype :  numpy.dtype
+        Expected type of the result, either np.float32 or np.float64
+
+    use_k_full : bool
+        When True, if k is None replaces k by the number of sources of the
+        Graph
+
+    multi_gpu_batch : bool
+        When True, enable mg batch after constructing the graph
+
     Returns
     -------
-        cu_bc : dict
-            Each key is the vertex identifier, each value is the betweenness
-            centrality score obtained from cugraph betweenness_centrality
-        nx_bc : dict
-            Each key is the vertex identifier, each value is the betweenness
-            centrality score obtained from networkx betweenness_centrality
+
+    sorted_df : cudf.DataFrame
+        Contains 'vertex' and  'cu_bc' 'ref_bc' columns,  where 'cu_bc'
+        and 'ref_bc' are the two betweenness centrality scores to compare.
+        The dataframe is expected to be sorted based on 'vertex', so that we
+        can use cupy.isclose to compare the scores.
     """
-    G, Gnx = build_graphs(graph_file, directed=directed)
+    G = None
+    Gnx = None
+
+    G, Gnx = utils.build_cu_and_nx_graphs(graph_file, directed=directed)
+    assert G is not None and Gnx is not None
+    if multi_gpu_batch:
+        G.enable_batch()
+
     calc_func = None
     if k is not None and seed is not None:
         calc_func = _calc_bc_subset
     elif k is not None:
         calc_func = _calc_bc_subset_fixed
     else:  # We processed to a comparison using every sources
+        if use_k_full:
+            k = Gnx.number_of_nodes()
         calc_func = _calc_bc_full
-    cu_bc, nx_bc = calc_func(G, Gnx, normalized=normalized, weight=weight,
-                             endpoints=endpoints, k=k, seed=seed,
-                             result_dtype=result_dtype)
-
-    return cu_bc, nx_bc
-
-
-def _calc_bc_subset(G, Gnx, normalized, weight, endpoints, k, seed,
-                    result_dtype):
+    sorted_df = calc_func(
+        G,
+        Gnx,
+        k=k,
+        normalized=normalized,
+        weight=weight,
+        endpoints=endpoints,
+        seed=seed,
+        result_dtype=result_dtype,
+    )
+
+    return sorted_df
+
+
+def _calc_bc_subset(
+    G, Gnx, normalized, weight, endpoints, k, seed, result_dtype
+):
     # NOTE: Networkx API does not allow passing a list of vertices
     # And the sampling is operated on Gnx.nodes() directly
     # We first mimic acquisition of the nodes to compare with same sources
     random.seed(seed)  # It will be called again in nx's call
     sources = random.sample(Gnx.nodes(), k)
-    df = cugraph.betweenness_centrality(G, normalized=normalized,
-                                        weight=weight,
-                                        endpoints=endpoints,
-                                        k=sources,
-                                        result_dtype=result_dtype)
-    nx_bc = nx.betweenness_centrality(Gnx, normalized=normalized, k=k,
-                                      seed=seed)
-    cu_bc = {key: score for key, score in
-             zip(df['vertex'].to_array(),
-                 df['betweenness_centrality'].to_array())}
-    return cu_bc, nx_bc
-
-
-def _calc_bc_subset_fixed(G, Gnx, normalized, weight, endpoints, k, seed,
-                          result_dtype):
-    assert isinstance(k, int), "This test is meant for verifying coherence " \
-                               "when k is given as an int"
+    df = cugraph.betweenness_centrality(
+        G,
+        k=sources,
+        normalized=normalized,
+        weight=weight,
+        endpoints=endpoints,
+        result_dtype=result_dtype,
+    )
+    sorted_df = df.sort_values("vertex").rename(
+        columns={"betweenness_centrality": "cu_bc"}, copy=False
+    ).reset_index(drop=True)
+
+    nx_bc = nx.betweenness_centrality(
+        Gnx,
+        k=k,
+        normalized=normalized,
+        weight=weight,
+        endpoints=endpoints,
+        seed=seed,
+    )
+
+    _, nx_bc = zip(*sorted(nx_bc.items()))
+    nx_df = cudf.DataFrame({"ref_bc": nx_bc})
+
+    merged_sorted_df = cudf.concat([sorted_df, nx_df], axis=1, sort=False)
+
+    return merged_sorted_df
+
+
+def _calc_bc_subset_fixed(
+    G, Gnx, normalized, weight, endpoints, k, seed, result_dtype
+):
+    assert isinstance(k, int), (
+        "This test is meant for verifying coherence "
+        "when k is given as an int"
+    )
     # In the fixed set we compare cu_bc against itself as we random.seed(seed)
     # on the same seed and then sample on the number of vertices themselves
     if seed is None:
@@ -151,92 +198,92 @@ def _calc_bc_subset_fixed(G, Gnx, normalized, weight, endpoints, k, seed,
     sources = random.sample(range(G.number_of_vertices()), k)
     # The first call is going to proceed to the random sampling in the same
     # fashion as the lines above
-    df = cugraph.betweenness_centrality(G, k=k, normalized=normalized,
-                                        weight=weight,
-                                        endpoints=endpoints,
-                                        seed=seed,
-                                        result_dtype=result_dtype)
+    df = cugraph.betweenness_centrality(
+        G,
+        k=k,
+        normalized=normalized,
+        weight=weight,
+        endpoints=endpoints,
+        seed=seed,
+        result_dtype=result_dtype,
+    )
+    sorted_df = df.sort_values("vertex").rename(
+        columns={"betweenness_centrality": "cu_bc"}, copy=False
+    ).reset_index(drop=True)
+
     # The second call is going to process source that were already sampled
     # We set seed to None as k : int, seed : not none should not be normal
     # behavior
-    df2 = cugraph.betweenness_centrality(G, k=sources, normalized=normalized,
-                                         weight=weight,
-                                         endpoints=endpoints,
-                                         seed=None,
-                                         result_dtype=result_dtype)
-    cu_bc = {key: score for key, score in
-             zip(df['vertex'].to_array(),
-                 df['betweenness_centrality'].to_array())}
-    cu_bc2 = {key: score for key, score in
-              zip(df2['vertex'].to_array(),
-                  df2['betweenness_centrality'].to_array())}
-
-    return cu_bc, cu_bc2
-
-
-def _calc_bc_full(G, Gnx, normalized, weight, endpoints,
-                  k, seed,
-                  result_dtype):
-    df = cugraph.betweenness_centrality(G, normalized=normalized,
-                                        weight=weight,
-                                        endpoints=endpoints,
-                                        result_dtype=result_dtype)
-    assert df['betweenness_centrality'].dtype == result_dtype,  \
-        "'betweenness_centrality' column has not the expected type"
-    nx_bc = nx.betweenness_centrality(Gnx, normalized=normalized,
-                                      weight=weight,
-                                      endpoints=endpoints)
-
-    cu_bc = {key: score for key, score in
-             zip(df['vertex'].to_array(),
-                 df['betweenness_centrality'].to_array())}
-    return cu_bc, nx_bc
+    df2 = cugraph.betweenness_centrality(
+        G,
+        k=sources,
+        normalized=normalized,
+        weight=weight,
+        endpoints=endpoints,
+        seed=None,
+        result_dtype=result_dtype,
+    )
+    sorted_df2 = df2.sort_values("vertex").rename(
+        columns={"betweenness_centrality": "ref_bc"}, copy=False
+    ).reset_index(drop=True)
+
+    merged_sorted_df = cudf.concat(
+        [sorted_df, sorted_df2["ref_bc"]], axis=1, sort=False
+    )
+
+    return merged_sorted_df
+
+
+def _calc_bc_full(
+    G, Gnx, normalized, weight, endpoints, k, seed, result_dtype
+):
+    df = cugraph.betweenness_centrality(
+        G,
+        k=k,
+        normalized=normalized,
+        weight=weight,
+        endpoints=endpoints,
+        result_dtype=result_dtype,
+    )
+    assert (
+        df["betweenness_centrality"].dtype == result_dtype
+    ), "'betweenness_centrality' column has not the expected type"
+    nx_bc = nx.betweenness_centrality(
+        Gnx, k=k, normalized=normalized, weight=weight, endpoints=endpoints
+    )
+
+    sorted_df = df.sort_values("vertex").rename(
+        columns={"betweenness_centrality": "cu_bc"}, copy=False
+    ).reset_index(drop=True)
+    _, nx_bc = zip(*sorted(nx_bc.items()))
+    nx_df = cudf.DataFrame({"ref_bc": nx_bc})
+
+    merged_sorted_df = cudf.concat([sorted_df, nx_df], axis=1, sort=False)
+
+    return merged_sorted_df
 
 
 # =============================================================================
 # Utils
 # =============================================================================
-def compare_single_score(result, expected, epsilon):
-    """
-    Compare value in score at given index with relative error
-
-    Parameters
-    ----------
-    scores : DataFrame
-        contains 'cu' and 'nx' columns which are the values to compare
-    idx : int
-        row index of the DataFrame
-    epsilon : floating point
-        indicates relative error tolerated
-
-    Returns
-    -------
-    close : bool
-        True: Result and expected are close to each other
-        False: Otherwise
-    """
-    close = np.isclose(result, expected, rtol=epsilon)
-    return close
-
-
-# NOTE: We assume that both cugraph and networkx are generating dicts with
-#       all the sources, thus we can compare all of them
-def compare_scores(cu_bc, ref_bc, epsilon=DEFAULT_EPSILON):
-    missing_key_error = 0
-    score_mismatch_error = 0
-    for vertex in ref_bc:
-        if vertex in cu_bc:
-            result = cu_bc[vertex]
-            expected = ref_bc[vertex]
-            if not compare_single_score(result, expected, epsilon=epsilon):
-                score_mismatch_error += 1
-                print("ERROR: vid = {}, cu = {}, "
-                      "nx = {}".format(vertex, result, expected))
-        else:
-            missing_key_error += 1
-            print("[ERROR] Missing vertex {vertex}".format(vertex=vertex))
-    assert missing_key_error == 0, "Some vertices were missing"
-    assert score_mismatch_error == 0, "Some scores were not close enough"
+# NOTE: We assume that both column are ordered in such way that values
+#        at ith positions are expected to be compared in both columns
+# i.e: sorted_df[idx][first_key] should be compared to
+#      sorted_df[idx][second_key]
+def compare_scores(sorted_df, first_key, second_key, epsilon=DEFAULT_EPSILON):
+    errors = sorted_df[
+        ~cupy.isclose(
+            sorted_df[first_key], sorted_df[second_key], rtol=epsilon
+        )
+    ]
+    num_errors = len(errors)
+    if num_errors > 0:
+        print(errors)
+    assert (
+        num_errors == 0
+    ), "Mismatch were found when comparing '{}' and '{}' (rtol = {})".format(
+        first_key, second_key, epsilon
+    )
 
 
 def prepare_test():
@@ -246,207 +293,183 @@ def prepare_test():
 # =============================================================================
 # Tests
 # =============================================================================
-@pytest.mark.parametrize('graph_file', TINY_DATASETS)
-@pytest.mark.parametrize('directed', DIRECTED_GRAPH_OPTIONS)
-@pytest.mark.parametrize('result_dtype', RESULT_DTYPE_OPTIONS)
-def test_betweenness_centrality_normalized_tiny(graph_file,
-                                                directed,
-                                                result_dtype):
-    """Test Normalized Betweenness Centrality"""
+@pytest.mark.parametrize("graph_file", DATASETS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("subset_size", SUBSET_SIZE_OPTIONS)
+@pytest.mark.parametrize("normalized", NORMALIZED_OPTIONS)
+@pytest.mark.parametrize("weight", [None])
+@pytest.mark.parametrize("endpoints", ENDPOINTS_OPTIONS)
+@pytest.mark.parametrize("subset_seed", SUBSET_SEED_OPTIONS)
+@pytest.mark.parametrize("result_dtype", RESULT_DTYPE_OPTIONS)
+def test_betweenness_centrality(
+    graph_file,
+    directed,
+    subset_size,
+    normalized,
+    weight,
+    endpoints,
+    subset_seed,
+    result_dtype,
+):
     prepare_test()
-    cu_bc, nx_bc = calc_betweenness_centrality(graph_file, directed=directed,
-                                               normalized=True,
-                                               result_dtype=result_dtype)
-    compare_scores(cu_bc, nx_bc)
-
-
-@pytest.mark.parametrize('graph_file', TINY_DATASETS)
-@pytest.mark.parametrize('directed', DIRECTED_GRAPH_OPTIONS)
-@pytest.mark.parametrize('result_dtype', RESULT_DTYPE_OPTIONS)
-def test_betweenness_centrality_unnormalized_tiny(graph_file,
-                                                  directed,
-                                                  result_dtype):
-    """Test Unnormalized Betweenness Centrality"""
-    prepare_test()
-    cu_bc, nx_bc = calc_betweenness_centrality(graph_file, directed=directed,
-                                               normalized=False,
-                                               result_dtype=result_dtype)
-    compare_scores(cu_bc, nx_bc)
-
-
-@pytest.mark.parametrize('graph_file', SMALL_DATASETS)
-@pytest.mark.parametrize('directed', DIRECTED_GRAPH_OPTIONS)
-@pytest.mark.parametrize('result_dtype', RESULT_DTYPE_OPTIONS)
-def test_betweenness_centrality_normalized_small(graph_file,
-                                                 directed,
-                                                 result_dtype):
-    """Test Unnormalized Betweenness Centrality"""
-    prepare_test()
-    cu_bc, nx_bc = calc_betweenness_centrality(graph_file, directed=directed,
-                                               normalized=True,
-                                               result_dtype=result_dtype)
-    compare_scores(cu_bc, nx_bc)
-
-
-@pytest.mark.parametrize('graph_file', SMALL_DATASETS)
-@pytest.mark.parametrize('directed', DIRECTED_GRAPH_OPTIONS)
-@pytest.mark.parametrize('result_dtype', RESULT_DTYPE_OPTIONS)
-def test_betweenness_centrality_unnormalized_small(graph_file,
-                                                   directed,
-                                                   result_dtype):
-    """Test Unnormalized Betweenness Centrality"""
-    prepare_test()
-    cu_bc, nx_bc = calc_betweenness_centrality(graph_file, directed=directed,
-                                               normalized=False,
-                                               result_dtype=result_dtype)
-    compare_scores(cu_bc, nx_bc)
-
-
-@pytest.mark.parametrize('graph_file', SMALL_DATASETS)
-@pytest.mark.parametrize('directed', DIRECTED_GRAPH_OPTIONS)
-@pytest.mark.parametrize('subset_size', SUBSET_SIZE_OPTIONS)
-@pytest.mark.parametrize('subset_seed', SUBSET_SEED_OPTIONS)
-@pytest.mark.parametrize('result_dtype', RESULT_DTYPE_OPTIONS)
-def test_betweenness_centrality_normalized_subset_small(graph_file,
-                                                        directed,
-                                                        subset_size,
-                                                        subset_seed,
-                                                        result_dtype):
-    """Test Unnormalized Betweenness Centrality using a subset
-
-    Only k sources are considered for an approximate Betweenness Centrality
-    """
+    sorted_df = calc_betweenness_centrality(
+        graph_file,
+        directed=directed,
+        normalized=normalized,
+        k=subset_size,
+        weight=weight,
+        endpoints=endpoints,
+        seed=subset_seed,
+        result_dtype=result_dtype,
+    )
+    compare_scores(sorted_df, first_key="cu_bc", second_key="ref_bc")
+
+
+@pytest.mark.parametrize("graph_file", DATASETS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("subset_size", [None])
+@pytest.mark.parametrize("normalized", NORMALIZED_OPTIONS)
+@pytest.mark.parametrize("weight", [None])
+@pytest.mark.parametrize("endpoints", ENDPOINTS_OPTIONS)
+@pytest.mark.parametrize("subset_seed", SUBSET_SEED_OPTIONS)
+@pytest.mark.parametrize("result_dtype", RESULT_DTYPE_OPTIONS)
+@pytest.mark.parametrize("use_k_full", [True])
+def test_betweenness_centrality_k_full(
+    graph_file,
+    directed,
+    subset_size,
+    normalized,
+    weight,
+    endpoints,
+    subset_seed,
+    result_dtype,
+    use_k_full,
+):
+    """Tests full betweenness centrality by using k = G.number_of_vertices()
+    instead of k=None, checks that k scales properly"""
     prepare_test()
-    cu_bc, nx_bc = calc_betweenness_centrality(graph_file,
-                                               directed=directed,
-                                               normalized=True,
-                                               k=subset_size,
-                                               seed=subset_seed,
-                                               result_dtype=result_dtype)
-    compare_scores(cu_bc, nx_bc)
+    sorted_df = calc_betweenness_centrality(
+        graph_file,
+        directed=directed,
+        normalized=normalized,
+        k=subset_size,
+        weight=weight,
+        endpoints=endpoints,
+        seed=subset_seed,
+        result_dtype=result_dtype,
+        use_k_full=use_k_full,
+    )
+    compare_scores(sorted_df, first_key="cu_bc", second_key="ref_bc")
 
 
 # NOTE: This test should only be execute on unrenumbered datasets
 #       the function operating the comparison inside is first proceeding
 #       to a random sampling over the number of vertices (thus direct offsets)
 #       in the graph structure instead of actual vertices identifiers
-@pytest.mark.parametrize('graph_file', UNRENUMBERED_DATASETS)
-@pytest.mark.parametrize('directed', DIRECTED_GRAPH_OPTIONS)
-@pytest.mark.parametrize('subset_size', SUBSET_SIZE_OPTIONS)
-@pytest.mark.parametrize('result_dtype', RESULT_DTYPE_OPTIONS)
-def test_betweenness_centrality_normalized_fixed_sample(graph_file,
-                                                        directed,
-                                                        subset_size,
-                                                        result_dtype):
-    """Test Unnormalized Betweenness Centrality using a subset
+@pytest.mark.parametrize("graph_file", UNRENUMBERED_DATASETS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("subset_size", SUBSET_SIZE_OPTIONS)
+@pytest.mark.parametrize("normalized", NORMALIZED_OPTIONS)
+@pytest.mark.parametrize("weight", [None])
+@pytest.mark.parametrize("endpoints", ENDPOINTS_OPTIONS)
+@pytest.mark.parametrize("subset_seed", [None])
+@pytest.mark.parametrize("result_dtype", RESULT_DTYPE_OPTIONS)
+def test_betweenness_centrality_fixed_sample(
+    graph_file,
+    directed,
+    subset_size,
+    normalized,
+    weight,
+    endpoints,
+    subset_seed,
+    result_dtype,
+):
+    """Test Betweenness Centrality using a subset
 
     Only k sources are considered for an approximate Betweenness Centrality
     """
     prepare_test()
-    cu_bc, nx_bc = calc_betweenness_centrality(graph_file,
-                                               directed=directed,
-                                               normalized=True,
-                                               k=subset_size,
-                                               seed=None,
-                                               result_dtype=result_dtype)
-    compare_scores(cu_bc, nx_bc)
-
-
-@pytest.mark.parametrize('graph_file', SMALL_DATASETS)
-@pytest.mark.parametrize('directed', DIRECTED_GRAPH_OPTIONS)
-@pytest.mark.parametrize('subset_size', SUBSET_SIZE_OPTIONS)
-@pytest.mark.parametrize('subset_seed', SUBSET_SEED_OPTIONS)
-@pytest.mark.parametrize('result_dtype', RESULT_DTYPE_OPTIONS)
-def test_betweenness_centrality_unnormalized_subset_small(graph_file,
-                                                          directed,
-                                                          subset_size,
-                                                          subset_seed,
-                                                          result_dtype):
-    """Test Unnormalized Betweenness Centrality on Graph on subset
-
-    Only k sources are considered for an approximate Betweenness Centrality
+    sorted_df = calc_betweenness_centrality(
+        graph_file,
+        directed=directed,
+        k=subset_size,
+        normalized=normalized,
+        weight=weight,
+        endpoints=endpoints,
+        seed=subset_seed,
+        result_dtype=result_dtype,
+    )
+    compare_scores(sorted_df, first_key="cu_bc", second_key="ref_bc")
+
+
+@pytest.mark.parametrize("graph_file", DATASETS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("subset_size", SUBSET_SIZE_OPTIONS)
+@pytest.mark.parametrize("normalized", NORMALIZED_OPTIONS)
+@pytest.mark.parametrize("weight", [[]])
+@pytest.mark.parametrize("endpoints", ENDPOINTS_OPTIONS)
+@pytest.mark.parametrize("subset_seed", SUBSET_SEED_OPTIONS)
+@pytest.mark.parametrize("result_dtype", RESULT_DTYPE_OPTIONS)
+def test_betweenness_centrality_weight_except(
+    graph_file,
+    directed,
+    subset_size,
+    normalized,
+    weight,
+    endpoints,
+    subset_seed,
+    result_dtype,
+):
+    """Calls betwenness_centrality with weight
+
+    As of 05/28/2020, weight is not supported and should raise
+    a NotImplementedError
     """
     prepare_test()
-    cu_bc, nx_bc = calc_betweenness_centrality(graph_file,
-                                               directed=directed,
-                                               normalized=False,
-                                               k=subset_size,
-                                               seed=subset_seed,
-                                               result_dtype=result_dtype)
-    compare_scores(cu_bc, nx_bc)
-
-
-@pytest.mark.parametrize('graph_file', TINY_DATASETS)
-@pytest.mark.parametrize('directed', DIRECTED_GRAPH_OPTIONS)
-@pytest.mark.parametrize('result_dtype', RESULT_DTYPE_OPTIONS)
-def test_betweenness_centrality_unnormalized_endpoints_except(graph_file,
-                                                              directed,
-                                                              result_dtype):
-    """Test calls betwenness_centrality unnormalized + endpoints"""
-    prepare_test()
-    with pytest.raises(NotImplementedError):
-        cu_bc, nx_bc = calc_betweenness_centrality(graph_file,
-                                                   normalized=False,
-                                                   endpoints=True,
-                                                   directed=directed,
-                                                   result_dtype=result_dtype)
-
-
-@pytest.mark.parametrize('graph_file', TINY_DATASETS)
-@pytest.mark.parametrize('directed', DIRECTED_GRAPH_OPTIONS)
-@pytest.mark.parametrize('result_dtype', RESULT_DTYPE_OPTIONS)
-def test_betweenness_centrality_normalized_endpoints_except(graph_file,
-                                                            directed,
-                                                            result_dtype):
-    """Test calls betwenness_centrality normalized + endpoints"""
-    prepare_test()
     with pytest.raises(NotImplementedError):
-        cu_bc, nx_bc = calc_betweenness_centrality(graph_file,
-                                                   normalized=True,
-                                                   endpoints=True,
-                                                   directed=directed,
-                                                   result_dtype=result_dtype)
-
-
-@pytest.mark.parametrize('graph_file', TINY_DATASETS)
-@pytest.mark.parametrize('directed', DIRECTED_GRAPH_OPTIONS)
-@pytest.mark.parametrize('result_dtype', RESULT_DTYPE_OPTIONS)
-def test_betweenness_centrality_unnormalized_weight_except(graph_file,
-                                                           directed,
-                                                           result_dtype):
-    """Test calls betwenness_centrality unnormalized + weight"""
-    prepare_test()
-    with pytest.raises(NotImplementedError):
-        cu_bc, nx_bc = calc_betweenness_centrality(graph_file,
-                                                   normalized=False,
-                                                   weight=True,
-                                                   directed=directed,
-                                                   result_dtype=result_dtype)
-
-
-@pytest.mark.parametrize('graph_file', TINY_DATASETS)
-@pytest.mark.parametrize('directed', DIRECTED_GRAPH_OPTIONS)
-@pytest.mark.parametrize('result_dtype', RESULT_DTYPE_OPTIONS)
-def test_betweenness_centrality_normalized_weight_except(graph_file,
-                                                         directed,
-                                                         result_dtype):
-    """Test calls betwenness_centrality normalized + weight"""
-    prepare_test()
-    with pytest.raises(NotImplementedError):
-        cu_bc, nx_bc = calc_betweenness_centrality(graph_file,
-                                                   normalized=True,
-                                                   weight=True,
-                                                   directed=directed,
-                                                   result_dtype=result_dtype)
-
+        sorted_df = calc_betweenness_centrality(
+            graph_file,
+            directed=directed,
+            k=subset_size,
+            normalized=normalized,
+            weight=weight,
+            endpoints=endpoints,
+            seed=subset_seed,
+            result_dtype=result_dtype,
+        )
+        compare_scores(sorted_df, first_key="cu_bc", second_key="ref_bc")
+
+
+@pytest.mark.parametrize("graph_file", DATASETS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("normalized", NORMALIZED_OPTIONS)
+@pytest.mark.parametrize("subset_size", SUBSET_SIZE_OPTIONS)
+@pytest.mark.parametrize("weight", [None])
+@pytest.mark.parametrize("endpoints", ENDPOINTS_OPTIONS)
+@pytest.mark.parametrize("subset_seed", SUBSET_SEED_OPTIONS)
+@pytest.mark.parametrize("result_dtype", [str])
+def test_betweenness_invalid_dtype(
+    graph_file,
+    directed,
+    subset_size,
+    normalized,
+    weight,
+    endpoints,
+    subset_seed,
+    result_dtype,
+):
+    """Test calls edge_betwenness_centrality an invalid type"""
 
-@pytest.mark.parametrize('graph_file', TINY_DATASETS)
-@pytest.mark.parametrize('directed', DIRECTED_GRAPH_OPTIONS)
-def test_betweenness_centrality_invalid_dtype(graph_file, directed):
-    """Test calls betwenness_centrality normalized + weight"""
     prepare_test()
     with pytest.raises(TypeError):
-        cu_bc, nx_bc = calc_betweenness_centrality(graph_file,
-                                                   normalized=True,
-                                                   result_dtype=str,
-                                                   directed=directed)
+        sorted_df = calc_betweenness_centrality(
+            graph_file,
+            directed=directed,
+            k=subset_size,
+            normalized=normalized,
+            weight=weight,
+            endpoints=endpoints,
+            seed=subset_seed,
+            result_dtype=result_dtype,
+        )
+        compare_scores(sorted_df, first_key="cu_bc", second_key="ref_bc")
diff --git a/python/cugraph/tests/test_bfs.py b/python/cugraph/tests/test_bfs.py
index cfbfc2d7f30..8f76e25031e 100644
--- a/python/cugraph/tests/test_bfs.py
+++ b/python/cugraph/tests/test_bfs.py
@@ -26,6 +26,7 @@
 # python 3.7.  Also, this import networkx needs to be relocated in the
 # third-party group once this gets fixed.
 import warnings
+
 with warnings.catch_warnings():
     warnings.filterwarnings("ignore", category=DeprecationWarning)
     import networkx as nx
@@ -36,14 +37,6 @@
 # =============================================================================
 DIRECTED_GRAPH_OPTIONS = [True, False]
 
-TINY_DATASETS = ['../datasets/karate.csv',
-                 '../datasets/dolphins.csv',
-                 '../datasets/polbooks.csv']
-SMALL_DATASETS = ['../datasets/netscience.csv',
-                  '../datasets/email-Eu-core.csv']
-
-DATASETS = TINY_DATASETS + SMALL_DATASETS
-
 SUBSET_SEED_OPTIONS = [42]
 
 DEFAULT_EPSILON = 1e-6
@@ -56,23 +49,6 @@ def prepare_test():
     gc.collect()
 
 
-# TODO: This is also present in test_betweenness_centrality.py
-#       And it could probably be used in SSSP also
-def build_graphs(graph_file, directed=True):
-    # cugraph
-    cu_M = utils.read_csv_file(graph_file)
-    G = cugraph.DiGraph() if directed else cugraph.Graph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
-    G.view_adj_list()  # Enforce CSR generation before computation
-
-    # networkx
-    M = utils.read_csv_for_nx(graph_file)
-    Gnx = nx.from_pandas_edgelist(M, create_using=(nx.DiGraph() if directed
-                                                   else nx.Graph()),
-                                  source='0', target='1')
-    return G, Gnx
-
-
 # =============================================================================
 # Functions for comparison
 # =============================================================================
@@ -82,8 +58,7 @@ def compare_single_sp_counter(result, expected, epsilon=DEFAULT_EPSILON):
     return np.isclose(result, expected, rtol=epsilon)
 
 
-def compare_bfs(graph_file, directed=True, return_sp_counter=False,
-                seed=42):
+def compare_bfs(graph_file, directed=True, return_sp_counter=False, seed=42):
     """ Genereate both cugraph and reference bfs traversal
 
     Parameters
@@ -103,7 +78,7 @@ def compare_bfs(graph_file, directed=True, return_sp_counter=False,
     Returns
     -------
     """
-    G, Gnx = build_graphs(graph_file, directed)
+    G, Gnx = utils.build_cu_and_nx_graphs(graph_file, directed)
     # Seed for reproducibility
     if isinstance(seed, int):
         random.seed(seed)
@@ -120,19 +95,21 @@ def compare_bfs(graph_file, directed=True, return_sp_counter=False,
         compare_func(G, Gnx, start_vertex)
     elif isinstance(seed, list):  # For other Verifications
         for start_vertex in seed:
-            compare_func = _compare_bfs_spc if return_sp_counter else \
-                           _compare_bfs
+            compare_func = (
+                _compare_bfs_spc if return_sp_counter else _compare_bfs
+            )
             compare_func(G, Gnx, start_vertex)
     elif seed is None:  # Same here, it is only to run full checks
         for start_vertex in Gnx:
-            compare_func = _compare_bfs_spc if return_sp_counter else \
-                           _compare_bfs
+            compare_func = (
+                _compare_bfs_spc if return_sp_counter else _compare_bfs
+            )
             compare_func(G, Gnx, start_vertex)
     else:  # Unknown type given to seed
         raise NotImplementedError("Invalid type for seed")
 
 
-def _compare_bfs(G,  Gnx, source):
+def _compare_bfs(G, Gnx, source):
     df = cugraph.bfs(G, source, return_sp_counter=False)
     # This call should only contain 3 columns:
     # 'vertex', 'distance', 'predecessor'
@@ -141,18 +118,26 @@ def _compare_bfs(G,  Gnx, source):
     # sure that it was not the case
     # NOTE: 'predecessor' is always returned while the C++ function allows to
     # pass a nullptr
-    assert len(df.columns) == 3, "The result of the BFS has an invalid " \
-                                 "number of columns"
-    cu_distances = {vertex: dist for vertex, dist in
-                    zip(df['vertex'].to_array(), df['distance'].to_array())}
-    cu_predecessors = {vertex: dist for vertex, dist in
-                       zip(df['vertex'].to_array(),
-                           df['predecessor'].to_array())}
+    assert len(df.columns) == 3, (
+        "The result of the BFS has an invalid " "number of columns"
+    )
+    cu_distances = {
+        vertex: dist
+        for vertex, dist in zip(
+            df["vertex"].to_array(), df["distance"].to_array()
+        )
+    }
+    cu_predecessors = {
+        vertex: dist
+        for vertex, dist in zip(
+            df["vertex"].to_array(), df["predecessor"].to_array()
+        )
+    }
 
     nx_distances = nx.single_source_shortest_path_length(Gnx, source)
-    # TODO: The following only verifies vertices that were reached
+    # FIXME: The following only verifies vertices that were reached
     #       by cugraph's BFS.
-    # We assume that the distances are ginven back as integers in BFS
+    # We assume that the distances are given back as integers in BFS
     # max_val = np.iinfo(df['distance'].dtype).max
     # Unreached vertices have a distance of max_val
 
@@ -163,11 +148,13 @@ def _compare_bfs(G,  Gnx, source):
         if vertex in cu_distances:
             result = cu_distances[vertex]
             expected = nx_distances[vertex]
-            if (result != expected):
-                print("[ERR] Mismatch on distances: "
-                      "vid = {}, cugraph = {}, nx = {}".format(vertex,
-                                                               result,
-                                                               expected))
+            if result != expected:
+                print(
+                    "[ERR] Mismatch on distances: "
+                    "vid = {}, cugraph = {}, nx = {}".format(
+                        vertex, result, expected
+                    )
+                )
                 distance_mismatch_error += 1
             if vertex not in cu_predecessors:
                 missing_vertex_error += 1
@@ -177,10 +164,13 @@ def _compare_bfs(G,  Gnx, source):
                     invalid_predecessor_error += 1
                 else:
                     # The graph is unweighted thus, predecessors are 1 away
-                    if (vertex != source and ((nx_distances[pred] + 1 !=
-                                              cu_distances[vertex]))):
-                        print("[ERR] Invalid on predecessors: "
-                              "vid = {}, cugraph = {}".format(vertex, pred))
+                    if vertex != source and (
+                        (nx_distances[pred] + 1 != cu_distances[vertex])
+                    ):
+                        print(
+                            "[ERR] Invalid on predecessors: "
+                            "vid = {}, cugraph = {}".format(vertex, pred)
+                        )
                         invalid_predecessor_error += 1
         else:
             missing_vertex_error += 1
@@ -193,10 +183,10 @@ def _compare_bfs_spc(G, Gnx, source):
     df = cugraph.bfs(G, source, return_sp_counter=True)
     # This call should only contain 3 columns:
     # 'vertex', 'distance', 'predecessor', 'sp_counter'
-    assert len(df.columns) == 4, "The result of the BFS has an invalid " \
-                                 "number of columns"
-    _, _, nx_sp_counter = nxacb._single_source_shortest_path_basic(Gnx,
-                                                                   source)
+    assert len(df.columns) == 4, (
+        "The result of the BFS has an invalid " "number of columns"
+    )
+    _, _, nx_sp_counter = nxacb._single_source_shortest_path_basic(Gnx, source)
     sorted_nx = [nx_sp_counter[key] for key in sorted(nx_sp_counter.keys())]
     # We are not checking for distances / predecessors here as we assume
     # that these have been checked  in the _compare_bfs tests
@@ -210,14 +200,17 @@ def _compare_bfs_spc(G, Gnx, source):
     # the vertices.
     # There is no guarantee when we get `df` that the vertices are sorted
     # thus we enforce the order so that we can leverage faster comparison after
-    sorted_df = df.sort_values('vertex').rename({"sp_counter": "cu_spc"})
+    sorted_df = df.sort_values("vertex").rename(
+        columns={"sp_counter": "cu_spc"}, copy=False
+    )
 
-    # This will allows to detect vertices identifier that could have been
+    # This allows to detect vertices identifier that could have been
     # wrongly present multiple times
-    cu_vertices = set(sorted_df['vertex'])
+    cu_vertices = set(sorted_df['vertex'].values_host)
     nx_vertices = nx_sp_counter.keys()
-    assert len(cu_vertices.intersection(nx_vertices)) == len(nx_vertices), \
-        "There are missing vertices"
+    assert len(cu_vertices.intersection(nx_vertices)) == len(
+        nx_vertices
+    ), "There are missing vertices"
 
     # We add the nx shortest path counter in the cudf.DataFrame, both the
     # the DataFrame and `sorted_nx` are sorted base on vertices identifiers
@@ -227,43 +220,48 @@ def _compare_bfs_spc(G, Gnx, source):
     # in the cudf.DataFrame where there are is a mismatch.
     # numpy / cupy allclose would get only a boolean and we might want the
     # extra information about the discrepancies
-    shortest_path_counter_errors = sorted_df[~cupy.isclose(sorted_df['cu_spc'],
-                                             sorted_df['nx_spc'],
-                                             rtol=DEFAULT_EPSILON)
-                                             ]
+    shortest_path_counter_errors = sorted_df[
+        ~cupy.isclose(
+            sorted_df["cu_spc"], sorted_df["nx_spc"], rtol=DEFAULT_EPSILON
+        )
+    ]
     if len(shortest_path_counter_errors) > 0:
         print(shortest_path_counter_errors)
-    assert len(shortest_path_counter_errors) == 0, "Shortest path counters " \
-                                                   "are too different"
+    assert len(shortest_path_counter_errors) == 0, (
+        "Shortest path counters " "are too different"
+    )
 
 
 # =============================================================================
 # Tests
 # =============================================================================
-@pytest.mark.parametrize('graph_file', DATASETS)
-@pytest.mark.parametrize('directed', DIRECTED_GRAPH_OPTIONS)
-@pytest.mark.parametrize('seed', SUBSET_SEED_OPTIONS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_5)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("seed", SUBSET_SEED_OPTIONS)
 def test_bfs(graph_file, directed, seed):
     """Test BFS traversal on random source with distance and predecessors"""
     prepare_test()
-    compare_bfs(graph_file, directed=directed, return_sp_counter=False,
-                seed=seed)
+    compare_bfs(
+        graph_file, directed=directed, return_sp_counter=False, seed=seed
+    )
 
 
-@pytest.mark.parametrize('graph_file', DATASETS)
-@pytest.mark.parametrize('directed', DIRECTED_GRAPH_OPTIONS)
-@pytest.mark.parametrize('seed', SUBSET_SEED_OPTIONS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("seed", SUBSET_SEED_OPTIONS)
 def test_bfs_spc(graph_file, directed, seed):
     """Test BFS traversal on random source with shortest path counting"""
     prepare_test()
-    compare_bfs(graph_file, directed=directed, return_sp_counter=True,
-                seed=seed)
+    compare_bfs(
+        graph_file, directed=directed, return_sp_counter=True, seed=seed
+    )
 
 
-@pytest.mark.parametrize('graph_file', TINY_DATASETS)
-@pytest.mark.parametrize('directed', DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("graph_file", utils.TINY_DATASETS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
 def test_bfs_spc_full(graph_file, directed):
     """Test BFS traversal on every vertex with shortest path counting"""
     prepare_test()
-    compare_bfs(graph_file, directed=directed, return_sp_counter=True,
-                seed=None)
+    compare_bfs(
+        graph_file, directed=directed, return_sp_counter=True, seed=None
+    )
diff --git a/python/cugraph/tests/test_bfs_bsp.py b/python/cugraph/tests/test_bfs_bsp.py
index 1893d17d743..78039e7ce18 100644
--- a/python/cugraph/tests/test_bfs_bsp.py
+++ b/python/cugraph/tests/test_bfs_bsp.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -22,21 +22,22 @@
 from cugraph.tests import utils
 
 # compute once
-_int_max = 2**31 - 1
+_int_max = 2 ** 31 - 1
 
 
 def cugraph_call(cu_M, start_vertex):
     # Device data
-    df = cu_M[['0', '1']]
+    df = cu_M[["0", "1"]]
 
     t1 = time.time()
     df = cugraph.bsp.traversal.bfs_df_pregel(
-        df, start_vertex, src_col='0', dst_col='1')
+        df, start_vertex, src_col="0", dst_col="1"
+    )
     t2 = time.time() - t1
-    print('Time : '+str(t2))
+    print("Time : " + str(t2))
 
     # Return distances as np.array()
-    return df['vertex'].to_array(), df['distance'].to_array()
+    return df["vertex"].to_array(), df["distance"].to_array()
 
 
 def base_call(M, start_vertex):
@@ -55,27 +56,20 @@ def base_call(M, start_vertex):
     q = queue.Queue()
     q.put(start_vertex)
     dist[start_vertex] = 0
-    while(not q.empty()):
+    while not q.empty():
         u = q.get()
         for i_col in range(offsets[u], offsets[u + 1]):
             v = indices[i_col]
-            if (dist[v] == _int_max):
+            if dist[v] == _int_max:
                 dist[v] = dist[u] + 1
                 q.put(v)
 
     return vertex, dist
 
 
-DATASETS = ['../datasets/dolphins.csv',
-            '../datasets/karate.csv',
-            '../datasets/polbooks.csv',
-            '../datasets/netscience.csv',
-            '../datasets/email-Eu-core.csv']
-
-
 # Test all combinations of default/managed and pooled/non-pooled allocation
 @pytest.mark.skip(reason="SG BFS is not yet formally supported")
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_bfs(managed, pool, graph_file):
     gc.collect()
 
diff --git a/python/cugraph/tests/test_connectivity.py b/python/cugraph/tests/test_connectivity.py
index 3a8593e794f..b33b6f8e9a3 100644
--- a/python/cugraph/tests/test_connectivity.py
+++ b/python/cugraph/tests/test_connectivity.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -25,33 +25,35 @@
 # python 3.7.  Also, this import networkx needs to be relocated in the
 # third-party group once this gets fixed.
 import warnings
+
 with warnings.catch_warnings():
     warnings.filterwarnings("ignore", category=DeprecationWarning)
     import networkx as nx
 
 
-print('Networkx version : {} '.format(nx.__version__))
+print("Networkx version : {} ".format(nx.__version__))
 
 
 def networkx_weak_call(M):
-    '''M = M.tocsr()
+    """M = M.tocsr()
     if M is None:
         raise TypeError('Could not read the input graph')
     if M.shape[0] != M.shape[1]:
         raise TypeError('Shape is not square')
 
-    Gnx = nx.DiGraph(M)'''
-    Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                  create_using=nx.DiGraph())
+    Gnx = nx.DiGraph(M)"""
+    Gnx = nx.from_pandas_edgelist(
+        M, source="0", target="1", create_using=nx.DiGraph()
+    )
 
     # Weakly Connected components call:
-    print('Solving... ')
+    print("Solving... ")
     t1 = time.time()
 
     # same parameters as in NVGRAPH
     result = nx.weakly_connected_components(Gnx)
     t2 = time.time() - t1
-    print('Time : ' + str(t2))
+    print("Time : " + str(t2))
 
     labels = sorted(result)
     return labels
@@ -60,31 +62,32 @@ def networkx_weak_call(M):
 def cugraph_weak_call(cu_M):
     # cugraph Pagerank Call
     G = cugraph.DiGraph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
+    G.from_cudf_edgelist(cu_M, source="0", destination="1")
     t1 = time.time()
     df = cugraph.weakly_connected_components(G)
     t2 = time.time() - t1
-    print('Time : '+str(t2))
+    print("Time : " + str(t2))
 
     label_vertex_dict = defaultdict(list)
     for i in range(len(df)):
-        label_vertex_dict[df['labels'][i]].append(df['vertices'][i])
+        label_vertex_dict[df["labels"][i]].append(df["vertices"][i])
     return label_vertex_dict
 
 
 def networkx_strong_call(M):
-    Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                  create_using=nx.DiGraph())
+    Gnx = nx.from_pandas_edgelist(
+        M, source="0", target="1", create_using=nx.DiGraph()
+    )
 
     # Weakly Connected components call:
-    print('Solving... ')
+    print("Solving... ")
     t1 = time.time()
 
     # same parameters as in NVGRAPH
     result = nx.strongly_connected_components(Gnx)
     t2 = time.time() - t1
 
-    print('Time : ' + str(t2))
+    print("Time : " + str(t2))
 
     labels = sorted(result)
     return labels
@@ -93,30 +96,20 @@ def networkx_strong_call(M):
 def cugraph_strong_call(cu_M):
     # cugraph Pagerank Call
     G = cugraph.DiGraph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
+    G.from_cudf_edgelist(cu_M, source="0", destination="1")
     t1 = time.time()
     df = cugraph.strongly_connected_components(G)
     t2 = time.time() - t1
-    print('Time : '+str(t2))
+    print("Time : " + str(t2))
 
     label_vertex_dict = defaultdict(list)
     for i in range(len(df)):
-        label_vertex_dict[df['labels'][i]].append(df['vertices'][i])
+        label_vertex_dict[df["labels"][i]].append(df["vertices"][i])
     return label_vertex_dict
 
 
-# these should come w/ cugraph/python:
-#
-DATASETS = ['../datasets/dolphins.csv',
-            '../datasets/netscience.csv']
-
-STRONGDATASETS = ['../datasets/dolphins.csv',
-                  '../datasets/netscience.csv',
-                  '../datasets/email-Eu-core.csv']
-
-
 # Test all combinations of default/managed and pooled/non-pooled allocation
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_weak_cc(graph_file):
     gc.collect()
 
@@ -155,7 +148,8 @@ def test_weak_cc(graph_file):
 
 # Test all combinations of default/managed and pooled/non-pooled allocation
 
-@pytest.mark.parametrize('graph_file', STRONGDATASETS)
+
+@pytest.mark.parametrize("graph_file", utils.STRONGDATASETS)
 def test_strong_cc(graph_file):
     gc.collect()
 
diff --git a/python/cugraph/tests/test_core_number.py b/python/cugraph/tests/test_core_number.py
index b688dd7ae66..c1b8702836f 100644
--- a/python/cugraph/tests/test_core_number.py
+++ b/python/cugraph/tests/test_core_number.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -12,9 +12,7 @@
 # limitations under the License.
 
 import gc
-
 import pytest
-
 import cugraph
 from cugraph.tests import utils
 
@@ -24,39 +22,46 @@
 # python 3.7.  Also, this import networkx needs to be relocated in the
 # third-party group once this gets fixed.
 import warnings
+
 with warnings.catch_warnings():
     warnings.filterwarnings("ignore", category=DeprecationWarning)
     import networkx as nx
 
 
-print('Networkx version : {} '.format(nx.__version__))
+print("Networkx version : {} ".format(nx.__version__))
 
 
 def calc_core_number(graph_file):
     M = utils.read_csv_file(graph_file)
     G = cugraph.DiGraph()
-    G.from_cudf_edgelist(M, source='0', destination='1')
+    G.from_cudf_edgelist(M, source="0", destination="1")
 
     cn = cugraph.core_number(G)
+    cn = cn.sort_values("vertex").reset_index(drop=True)
 
     NM = utils.read_csv_for_nx(graph_file)
-    Gnx = nx.from_pandas_edgelist(NM, source='0', target='1',
-                                  create_using=nx.Graph())
+    Gnx = nx.from_pandas_edgelist(
+        NM, source="0", target="1", create_using=nx.Graph()
+    )
     nc = nx.core_number(Gnx)
     pdf = [nc[k] for k in sorted(nc.keys())]
-    cn['nx_core_number'] = pdf
-    cn = cn.rename({'core_number': 'cu_core_number'})
+    cn["nx_core_number"] = pdf
+    cn = cn.rename(columns={"core_number": "cu_core_number"}, copy=False)
     return cn
 
 
-DATASETS = ['../datasets/dolphins.csv',
-            '../datasets/netscience.csv']
-
-
-@pytest.mark.parametrize('graph_file', DATASETS)
+# FIXME: the default set of datasets includes an asymmetric directed graph
+# (email-EU-core.csv), which currently causes an error with NetworkX:
+# "networkx.exception.NetworkXError: Input graph has self loops which is not
+#  permitted; Consider using G.remove_edges_from(nx.selfloop_edges(G))"
+#
+# https://github.com/rapidsai/cugraph/issues/1045
+#
+# @pytest.mark.parametrize("graph_file", utils.DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_UNDIRECTED)
 def test_core_number(graph_file):
     gc.collect()
 
     cn = calc_core_number(graph_file)
 
-    assert cn['cu_core_number'].equals(cn['nx_core_number'])
+    assert cn["cu_core_number"].equals(cn["nx_core_number"])
diff --git a/python/cugraph/tests/test_ecg.py b/python/cugraph/tests/test_ecg.py
index 8118a516eb9..1fd5d2424fe 100644
--- a/python/cugraph/tests/test_ecg.py
+++ b/python/cugraph/tests/test_ecg.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -21,42 +21,42 @@
 
 def cugraph_call(G, min_weight, ensemble_size):
     df = cugraph.ecg(G, min_weight, ensemble_size)
-    num_parts = df['partition'].max() + 1
-    score = cugraph.analyzeClustering_modularity(G, num_parts, df['partition'])
+    df = df.sort_values("vertex")
+    num_parts = df["partition"].max() + 1
+    score = cugraph.analyzeClustering_modularity(G, num_parts, df["partition"])
     return score, num_parts
 
 
 def golden_call(graph_file):
-    if graph_file == '../datasets/dolphins.csv':
+    if graph_file == "../datasets/dolphins.csv":
         return 0.4962422251701355
-    if graph_file == '../datasets/karate.csv':
+    if graph_file == "../datasets/karate.csv":
         return 0.38428664207458496
-    if graph_file == '../datasets/netscience.csv':
+    if graph_file == "../datasets/netscience.csv":
         return 0.9279554486274719
 
 
-DATASETS = ['../datasets/karate.csv',
-            '../datasets/dolphins.csv',
-            '../datasets/netscience.csv']
+DATASETS = [
+    "../datasets/karate.csv",
+    "../datasets/dolphins.csv",
+    "../datasets/netscience.csv",
+]
 
-MIN_WEIGHTS = [.05, .10, .15]
+MIN_WEIGHTS = [0.05, 0.10, 0.15]
 
 ENSEMBLE_SIZES = [16, 32]
 
 
-# Test all combinations of default/managed and pooled/non-pooled allocation
-@pytest.mark.parametrize('graph_file', DATASETS)
-@pytest.mark.parametrize('min_weight', MIN_WEIGHTS)
-@pytest.mark.parametrize('ensemble_size', ENSEMBLE_SIZES)
-def test_ecg_clustering(graph_file,
-                        min_weight,
-                        ensemble_size):
+@pytest.mark.parametrize("graph_file", DATASETS)
+@pytest.mark.parametrize("min_weight", MIN_WEIGHTS)
+@pytest.mark.parametrize("ensemble_size", ENSEMBLE_SIZES)
+def test_ecg_clustering(graph_file, min_weight, ensemble_size):
     gc.collect()
 
     # Read in the graph and get a cugraph object
     cu_M = utils.read_csv_file(graph_file, read_weights_in_sp=False)
     G = cugraph.Graph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1', edge_attr='2')
+    G.from_cudf_edgelist(cu_M, source="0", destination="1", edge_attr="2")
 
     # Get the modularity score for partitioning versus random assignment
     cu_score, num_parts = cugraph_call(G, min_weight, ensemble_size)
@@ -64,4 +64,4 @@ def test_ecg_clustering(graph_file,
 
     # Assert that the partitioning has better modularity than the random
     # assignment
-    assert cu_score > (.95 * golden_score)
+    assert cu_score > (0.95 * golden_score)
diff --git a/python/cugraph/tests/test_edge_betweenness_centrality.py b/python/cugraph/tests/test_edge_betweenness_centrality.py
new file mode 100644
index 00000000000..9c198344fa7
--- /dev/null
+++ b/python/cugraph/tests/test_edge_betweenness_centrality.py
@@ -0,0 +1,459 @@
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.:
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+
+import pytest
+
+import cugraph
+from cugraph.tests import utils
+import random
+import numpy as np
+import cupy
+import cudf
+
+# Temporarily suppress warnings till networkX fixes deprecation warnings
+# (Using or importing the ABCs from 'collections' instead of from
+# 'collections.abc' is deprecated, and in 3.8 it will stop working) for
+# python 3.7.  Also, this import networkx needs to be relocated in the
+# third-party group once this gets fixed.
+import warnings
+
+with warnings.catch_warnings():
+    warnings.filterwarnings("ignore", category=DeprecationWarning)
+    import networkx as nx
+
+# NOTE: Endpoint parameter is not currently being tested, there could be a test
+#       to verify that python raise an error if it is used
+# =============================================================================
+# Parameters
+# =============================================================================
+DIRECTED_GRAPH_OPTIONS = [False, True]
+NORMALIZED_OPTIONS = [False, True]
+DEFAULT_EPSILON = 0.0001
+
+DATASETS = utils.DATASETS
+UNRENUMBERED_DATASETS = ["../datasets/karate.csv"]
+
+SUBSET_SIZE_OPTIONS = [4, None]
+SUBSET_SEED_OPTIONS = [42]
+
+# NOTE: The following is not really being exploited in the tests as the
+# datasets that are used are too small to compare, but it ensures that both
+# path are actually sane
+RESULT_DTYPE_OPTIONS = [np.float32, np.float64]
+
+
+# =============================================================================
+# Comparison functions
+# =============================================================================
+def calc_edge_betweenness_centrality(
+    graph_file,
+    directed=True,
+    k=None,
+    normalized=False,
+    weight=None,
+    seed=None,
+    result_dtype=np.float64,
+    use_k_full=False,
+    multi_gpu_batch=False
+):
+    """ Generate both cugraph and networkx edge betweenness centrality
+
+    Parameters
+    ----------
+    graph_file : string
+        Path to COO Graph representation in .csv format
+
+    k : int or None, optional, default=None
+        int:  Number of sources  to sample  from
+        None: All sources are used to compute
+
+    directed : bool, optional, default=True
+
+    normalized : bool
+        True: Normalize Betweenness Centrality scores
+        False: Scores are left unnormalized
+
+    weight : cudf.DataFrame:
+        Not supported as of 06/2020
+
+    seed : int or None, optional, default=None
+        Seed for random sampling  of the starting point
+
+    result_dtype :  numpy.dtype
+        Expected type of the result, either np.float32 or np.float64
+
+    use_k_full : bool
+        When True, if k is None replaces k by the number of sources of the
+        Graph
+
+    multi_gpu_batch: bool
+        When True, enable mg batch after constructing the graph
+
+    Returns
+    -------
+
+    sorted_df : cudf.DataFrame
+        Contains 'src', 'dst', 'cu_bc' and 'ref_bc' columns,  where 'cu_bc'
+        and 'ref_bc' are the two betweenness centrality scores to compare.
+        The dataframe is expected to be sorted based on 'src' then 'dst',
+        so that we can use cupy.isclose to compare the scores.
+    """
+    G = None
+    Gnx = None
+    G, Gnx = utils.build_cu_and_nx_graphs(graph_file, directed=directed)
+    assert G is not None and Gnx is not None
+    if multi_gpu_batch:
+        G.enable_batch()
+
+    if k is not None and seed is not None:
+        calc_func = _calc_bc_subset
+    elif k is not None:
+        calc_func = _calc_bc_subset_fixed
+    else:  # We processed to a comparison using every sources
+        if use_k_full:
+            k = Gnx.number_of_nodes()
+        calc_func = _calc_bc_full
+    sorted_df = calc_func(
+        G,
+        Gnx,
+        k=k,
+        normalized=normalized,
+        weight=weight,
+        seed=seed,
+        result_dtype=result_dtype,
+    )
+
+    return sorted_df
+
+
+def _calc_bc_subset(G, Gnx, normalized, weight, k, seed, result_dtype):
+    # NOTE: Networkx API does not allow passing a list of vertices
+    # And the sampling is operated on Gnx.nodes() directly
+    # We first mimic acquisition of the nodes to compare with same sources
+    random.seed(seed)  # It will be called again in nx's call
+    sources = random.sample(Gnx.nodes(), k)
+    df = cugraph.edge_betweenness_centrality(
+        G,
+        k=sources,
+        normalized=normalized,
+        weight=weight,
+        result_dtype=result_dtype,
+    )
+
+    nx_bc_dict = nx.edge_betweenness_centrality(
+        Gnx, k=k, normalized=normalized, weight=weight, seed=seed
+    )
+
+    nx_df = generate_nx_result(nx_bc_dict, type(Gnx) is nx.DiGraph).rename(
+        columns={"betweenness_centrality": "ref_bc"}, copy=False
+    )
+
+    sorted_df = df.sort_values(["src", "dst"]).rename(
+        columns={"betweenness_centrality": "cu_bc"}, copy=False
+    ).reset_index(drop=True)
+
+    sorted_df = cudf.concat([sorted_df, nx_df["ref_bc"]], axis=1, sort=False)
+
+    return sorted_df
+
+
+def _calc_bc_subset_fixed(G, Gnx, normalized, weight, k, seed, result_dtype):
+    assert isinstance(k, int), (
+        "This test is meant for verifying coherence "
+        "when k is given as an int"
+    )
+    # In the fixed set we compare cu_bc against itself as we random.seed(seed)
+    # on the same seed and then sample on the number of vertices themselves
+    if seed is None:
+        seed = 123  # random.seed(None) uses time, but we want same sources
+    random.seed(seed)  # It will be called again in cugraph's call
+    sources = random.sample(range(G.number_of_vertices()), k)
+    # The first call is going to proceed to the random sampling in the same
+    # fashion as the lines above
+    df = cugraph.edge_betweenness_centrality(
+        G,
+        k=k,
+        normalized=normalized,
+        weight=weight,
+        seed=seed,
+        result_dtype=result_dtype,
+    )
+
+    # The second call is going to process source that were already sampled
+    # We set seed to None as k : int, seed : not none should not be normal
+    # behavior
+    df2 = cugraph.edge_betweenness_centrality(
+        G,
+        k=sources,
+        normalized=normalized,
+        weight=weight,
+        seed=None,
+        result_dtype=result_dtype,
+    )
+
+    sorted_df = df.sort_values(["src", "dst"]).rename(
+        columns={"betweenness_centrality": "cu_bc"}, copy=False
+    ).reset_index(drop=True)
+    sorted_df2 = df2.sort_values(["src", "dst"]).rename(
+        columns={"betweenness_centrality": "ref_bc"}, copy=False
+    ).reset_index(drop=True)
+
+    sorted_df = cudf.concat(
+        [sorted_df, sorted_df2["ref_bc"]], axis=1, sort=False
+    )
+
+    return sorted_df
+
+
+def _calc_bc_full(G, Gnx, normalized, weight, k, seed, result_dtype):
+    df = cugraph.edge_betweenness_centrality(
+        G,
+        k=k,
+        normalized=normalized,
+        weight=weight,
+        seed=seed,
+        result_dtype=result_dtype,
+    )
+    assert (
+        df["betweenness_centrality"].dtype == result_dtype
+    ), "'betweenness_centrality' column has not the expected type"
+    nx_bc_dict = nx.edge_betweenness_centrality(
+        Gnx, k=k, normalized=normalized, seed=seed, weight=weight
+    )
+
+    nx_df = generate_nx_result(nx_bc_dict, type(Gnx) is nx.DiGraph).rename(
+        columns={"betweenness_centrality": "ref_bc"}, copy=False
+    )
+
+    sorted_df = df.sort_values(["src", "dst"]).rename(
+        columns={"betweenness_centrality": "cu_bc"}, copy=False
+    ).reset_index(drop=True)
+
+    sorted_df = cudf.concat([sorted_df, nx_df["ref_bc"]], axis=1, sort=False)
+    return sorted_df
+
+
+# =============================================================================
+def compare_scores(sorted_df, first_key, second_key, epsilon=DEFAULT_EPSILON):
+    errors = sorted_df[
+        ~cupy.isclose(
+            sorted_df[first_key], sorted_df[second_key], rtol=epsilon
+        )
+    ]
+    num_errors = len(errors)
+    if num_errors > 0:
+        print(errors)
+    assert (
+        num_errors == 0
+    ), "Mismatch were found when comparing '{}' and '{}' (rtol = {})".format(
+        first_key, second_key, epsilon
+    )
+
+
+def generate_nx_result(nx_res_dict, directed):
+    df = generate_dataframe_from_nx_dict(nx_res_dict)
+    if not directed:
+        df = generate_upper_triangle(df)
+    sorted_nx_dataframe = df.sort_values(["src", "dst"])
+    sorted_nx_dataframe_new_index = sorted_nx_dataframe.reset_index(drop=True)
+    return sorted_nx_dataframe_new_index
+
+
+def generate_dataframe_from_nx_dict(nx_dict):
+    nx_edges, nx_bc = zip(*sorted(nx_dict.items()))
+    nx_src, nx_dst = zip(*nx_edges)
+    df = cudf.DataFrame(
+        {"src": nx_src, "dst": nx_dst, "betweenness_centrality": nx_bc}
+    )
+    return df
+
+
+def generate_upper_triangle(dataframe):
+    lower_triangle = dataframe["src"] >= dataframe["dst"]
+    dataframe[["src", "dst"]][lower_triangle] = dataframe[["dst", "src"]][
+        lower_triangle
+    ]
+    return dataframe
+
+
+def prepare_test():
+    gc.collect()
+
+
+@pytest.mark.parametrize("graph_file", DATASETS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("subset_size", SUBSET_SIZE_OPTIONS)
+@pytest.mark.parametrize("normalized", NORMALIZED_OPTIONS)
+@pytest.mark.parametrize("weight", [None])
+@pytest.mark.parametrize("subset_seed", SUBSET_SEED_OPTIONS)
+@pytest.mark.parametrize("result_dtype", RESULT_DTYPE_OPTIONS)
+def test_edge_betweenness_centrality(
+    graph_file,
+    directed,
+    subset_size,
+    normalized,
+    weight,
+    subset_seed,
+    result_dtype,
+):
+    prepare_test()
+    sorted_df = calc_edge_betweenness_centrality(
+        graph_file,
+        directed=directed,
+        normalized=normalized,
+        k=subset_size,
+        weight=weight,
+        seed=subset_seed,
+        result_dtype=result_dtype,
+    )
+    compare_scores(sorted_df, first_key="cu_bc", second_key="ref_bc")
+
+
+@pytest.mark.parametrize("graph_file", DATASETS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("subset_size", [None])
+@pytest.mark.parametrize("normalized", NORMALIZED_OPTIONS)
+@pytest.mark.parametrize("weight", [None])
+@pytest.mark.parametrize("subset_seed", SUBSET_SEED_OPTIONS)
+@pytest.mark.parametrize("result_dtype", RESULT_DTYPE_OPTIONS)
+@pytest.mark.parametrize("use_k_full", [True])
+def test_edge_betweenness_centrality_k_full(
+    graph_file,
+    directed,
+    subset_size,
+    normalized,
+    weight,
+    subset_seed,
+    result_dtype,
+    use_k_full,
+):
+    """Tests full edge betweenness centrality by using k = G.number_of_vertices()
+    instead of k=None, checks that k scales properly"""
+    prepare_test()
+    sorted_df = calc_edge_betweenness_centrality(
+        graph_file,
+        directed=directed,
+        normalized=normalized,
+        k=subset_size,
+        weight=weight,
+        seed=subset_seed,
+        result_dtype=result_dtype,
+        use_k_full=use_k_full,
+    )
+    compare_scores(sorted_df, first_key="cu_bc", second_key="ref_bc")
+
+
+# NOTE: This test should only be execute on unrenumbered datasets
+#       the function operating the comparison inside is first proceeding
+#       to a random sampling over the number of vertices (thus direct offsets)
+#       in the graph structure instead of actual vertices identifiers
+@pytest.mark.parametrize("graph_file", UNRENUMBERED_DATASETS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("subset_size", SUBSET_SIZE_OPTIONS)
+@pytest.mark.parametrize("normalized", NORMALIZED_OPTIONS)
+@pytest.mark.parametrize("weight", [None])
+@pytest.mark.parametrize("subset_seed", [None])
+@pytest.mark.parametrize("result_dtype", RESULT_DTYPE_OPTIONS)
+def test_edge_betweenness_centrality_fixed_sample(
+    graph_file,
+    directed,
+    subset_size,
+    normalized,
+    weight,
+    subset_seed,
+    result_dtype,
+):
+    """Test Edge Betweenness Centrality using a subset
+
+    Only k sources are considered for an approximate Betweenness Centrality
+    """
+    prepare_test()
+    sorted_df = calc_edge_betweenness_centrality(
+        graph_file,
+        directed=directed,
+        k=subset_size,
+        normalized=normalized,
+        weight=weight,
+        seed=subset_seed,
+        result_dtype=result_dtype,
+    )
+    compare_scores(sorted_df, first_key="cu_bc", second_key="ref_bc")
+
+
+@pytest.mark.parametrize("graph_file", DATASETS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("subset_size", SUBSET_SIZE_OPTIONS)
+@pytest.mark.parametrize("normalized", NORMALIZED_OPTIONS)
+@pytest.mark.parametrize("weight", [[]])
+@pytest.mark.parametrize("subset_seed", SUBSET_SEED_OPTIONS)
+@pytest.mark.parametrize("result_dtype", RESULT_DTYPE_OPTIONS)
+def test_edge_betweenness_centrality_weight_except(
+    graph_file,
+    directed,
+    subset_size,
+    normalized,
+    weight,
+    subset_seed,
+    result_dtype,
+):
+    """Test calls edge_betweeness_centrality with weight parameter
+
+    As of 05/28/2020, weight is not supported and should raise
+    a NotImplementedError
+    """
+    prepare_test()
+    with pytest.raises(NotImplementedError):
+        sorted_df = calc_edge_betweenness_centrality(
+            graph_file,
+            directed=directed,
+            k=subset_size,
+            normalized=normalized,
+            weight=weight,
+            seed=subset_seed,
+            result_dtype=result_dtype,
+        )
+        compare_scores(sorted_df, first_key="cu_bc", second_key="ref_bc")
+
+
+@pytest.mark.parametrize("graph_file", DATASETS)
+@pytest.mark.parametrize("directed", DIRECTED_GRAPH_OPTIONS)
+@pytest.mark.parametrize("normalized", NORMALIZED_OPTIONS)
+@pytest.mark.parametrize("subset_size", SUBSET_SIZE_OPTIONS)
+@pytest.mark.parametrize("weight", [None])
+@pytest.mark.parametrize("subset_seed", SUBSET_SEED_OPTIONS)
+@pytest.mark.parametrize("result_dtype", [str])
+def test_edge_betweenness_invalid_dtype(
+    graph_file,
+    directed,
+    subset_size,
+    normalized,
+    weight,
+    subset_seed,
+    result_dtype,
+):
+    """Test calls edge_betwenness_centrality an invalid type"""
+
+    prepare_test()
+    with pytest.raises(TypeError):
+        sorted_df = calc_edge_betweenness_centrality(
+            graph_file,
+            directed=directed,
+            k=subset_size,
+            normalized=normalized,
+            weight=weight,
+            seed=subset_seed,
+            result_dtype=result_dtype,
+        )
+        compare_scores(sorted_df, first_key="cu_bc", second_key="ref_bc")
diff --git a/python/cugraph/tests/test_filter_unreachable.py b/python/cugraph/tests/test_filter_unreachable.py
index 3b58200938a..29b862f0285 100644
--- a/python/cugraph/tests/test_filter_unreachable.py
+++ b/python/cugraph/tests/test_filter_unreachable.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -13,7 +13,6 @@
 
 import gc
 import time
-
 import pytest
 import numpy as np
 
@@ -26,45 +25,46 @@
 # python 3.7.  Also, this import networkx needs to be relocated in the
 # third-party group once this gets fixed.
 import warnings
+
 with warnings.catch_warnings():
     warnings.filterwarnings("ignore", category=DeprecationWarning)
     import networkx as nx
 
 
-print('Networkx version : {} '.format(nx.__version__))
+print("Networkx version : {} ".format(nx.__version__))
 
 SOURCES = [1]
 
 
-@pytest.mark.parametrize('graph_file', ['../datasets/netscience.csv'])
-@pytest.mark.parametrize('source', SOURCES)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
+@pytest.mark.parametrize("source", SOURCES)
 def test_filter_unreachable(graph_file, source):
     gc.collect()
 
     cu_M = utils.read_csv_file(graph_file)
 
-    print('sources size = ' + str(len(cu_M)))
-    print('destinations size = ' + str(len(cu_M)))
+    print("sources size = " + str(len(cu_M)))
+    print("destinations size = " + str(len(cu_M)))
 
     # cugraph Pagerank Call
     G = cugraph.DiGraph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1', edge_attr='2')
+    G.from_cudf_edgelist(cu_M, source="0", destination="1", edge_attr="2")
 
-    print('cugraph Solving... ')
+    print("cugraph Solving... ")
     t1 = time.time()
 
     df = cugraph.sssp(G, source)
 
     t2 = time.time() - t1
-    print('Time : '+str(t2))
+    print("Time : " + str(t2))
 
     reachable_df = cugraph.filter_unreachable(df)
 
-    if(np.issubdtype(df['distance'].dtype, np.integer)):
-        inf = np.iinfo(reachable_df['distance'].dtype).max  # noqa: F841
+    if np.issubdtype(df["distance"].dtype, np.integer):
+        inf = np.iinfo(reachable_df["distance"].dtype).max  # noqa: F841
         assert len(reachable_df.query("distance == @inf")) == 0
-    elif(np.issubdtype(df['distance'].dtype, np.inexact)):
-        inf = np.finfo(reachable_df['distance'].dtype).max  # noqa: F841
+    elif np.issubdtype(df["distance"].dtype, np.inexact):
+        inf = np.finfo(reachable_df["distance"].dtype).max  # noqa: F841
         assert len(reachable_df.query("distance == @inf")) == 0
 
     assert len(reachable_df) != 0
diff --git a/python/cugraph/tests/test_force_atlas2.py b/python/cugraph/tests/test_force_atlas2.py
index 2a374566caf..4de49cb4088 100644
--- a/python/cugraph/tests/test_force_atlas2.py
+++ b/python/cugraph/tests/test_force_atlas2.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -16,6 +16,7 @@
 import pytest
 
 import cugraph
+from cugraph.internals import GraphBasedDimRedCallback
 from cugraph.tests import utils
 from sklearn.manifold import trustworthiness
 import scipy.io
@@ -30,11 +31,12 @@
 def cugraph_call(cu_M, max_iter, pos_list, outbound_attraction_distribution,
                  lin_log_mode, prevent_overlapping, edge_weight_influence,
                  jitter_tolerance, barnes_hut_theta, barnes_hut_optimize,
-                 scaling_ratio, strong_gravity_mode, gravity):
+                 scaling_ratio, strong_gravity_mode, gravity, callback=None):
 
     G = cugraph.Graph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1', edge_attr='2',
-                         renumber=False)
+    G.from_cudf_edgelist(
+        cu_M, source="0", destination="1", edge_attr="2", renumber=False
+    )
 
     # cugraph Force Atlas 2 Call
     t1 = time.time()
@@ -51,26 +53,47 @@ def cugraph_call(cu_M, max_iter, pos_list, outbound_attraction_distribution,
             barnes_hut_theta=barnes_hut_theta,
             scaling_ratio=scaling_ratio,
             strong_gravity_mode=strong_gravity_mode,
-            gravity=gravity)
+            gravity=gravity,
+            callback=callback)
     t2 = time.time() - t1
-    print('Cugraph Time : ' + str(t2))
+    print("Cugraph Time : " + str(t2))
     return pos
 
 
-DATASETS = [('../datasets/karate.csv', 0.70),
-            ('../datasets/polbooks.csv', 0.75),
-            ('../datasets/dolphins.csv', 0.66),
-            ('../datasets/netscience.csv', 0.66)]
+DATASETS = [
+    ("../datasets/karate.csv", 0.70),
+    ("../datasets/polbooks.csv", 0.75),
+    ("../datasets/dolphins.csv", 0.66),
+    ("../datasets/netscience.csv", 0.66),
+]
 MAX_ITERATIONS = [500]
 BARNES_HUT_OPTIMIZE = [False, True]
 
 
+class TestCallback(GraphBasedDimRedCallback):
+    def __init__(self):
+        super(TestCallback, self).__init__()
+        self.on_preprocess_end_called_count = 0
+        self.on_epoch_end_called_count = 0
+        self.on_train_end_called_count = 0
+
+    def on_preprocess_end(self, positions):
+        self.on_preprocess_end_called_count += 1
+
+    def on_epoch_end(self, positions):
+        self.on_epoch_end_called_count += 1
+
+    def on_train_end(self, positions):
+        self.on_train_end_called_count += 1
+
+
 @pytest.mark.parametrize('graph_file, score', DATASETS)
 @pytest.mark.parametrize('max_iter', MAX_ITERATIONS)
 @pytest.mark.parametrize('barnes_hut_optimize', BARNES_HUT_OPTIMIZE)
 def test_force_atlas2(graph_file, score, max_iter,
                       barnes_hut_optimize):
     cu_M = utils.read_csv_file(graph_file)
+    test_callback = TestCallback()
     cu_pos = cugraph_call(cu_M,
                           max_iter=max_iter,
                           pos_list=None,
@@ -83,8 +106,9 @@ def test_force_atlas2(graph_file, score, max_iter,
                           barnes_hut_theta=0.5,
                           scaling_ratio=2.0,
                           strong_gravity_mode=False,
-                          gravity=1.0)
-    '''
+                          gravity=1.0,
+                          callback=test_callback)
+    """
         Trustworthiness score can be used for Force Atlas 2 as the algorithm
         optimizes modularity. The final layout will result in
         different communities being drawn out. We consider here the n x n
@@ -94,11 +118,17 @@ def test_force_atlas2(graph_file, score, max_iter,
         or neighbors are close to each other in the final embedding.
         Thresholds are based on the best score that is achived after 500
         iterations on a given graph.
-    '''
+    """
 
-    matrix_file = graph_file[:-4] + '.mtx'
+    matrix_file = graph_file[:-4] + ".mtx"
     M = scipy.io.mmread(matrix_file)
     M = M.todense()
-    cu_trust = trustworthiness(M, cu_pos[['x', 'y']].to_pandas())
+    cu_trust = trustworthiness(M, cu_pos[["x", "y"]].to_pandas())
     print(cu_trust, score)
     assert cu_trust > score
+    # verify `on_preprocess_end` was only called once
+    assert test_callback.on_preprocess_end_called_count == 1
+    # verify `on_epoch_end` was called on each iteration
+    assert test_callback.on_epoch_end_called_count == max_iter
+    # verify `on_train_end` was only called once
+    assert test_callback.on_train_end_called_count == 1
diff --git a/python/cugraph/tests/test_graph.py b/python/cugraph/tests/test_graph.py
index d37b7c9afd8..ffb22f8178f 100644
--- a/python/cugraph/tests/test_graph.py
+++ b/python/cugraph/tests/test_graph.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -18,9 +18,15 @@
 
 import scipy
 import cudf
+from cudf.tests.utils import assert_eq
 import cugraph
 from cugraph.tests import utils
 
+# MG
+import cugraph.dask as dcg
+from dask_cuda import LocalCUDACluster
+from dask.distributed import Client
+import dask_cudf
 
 # Temporarily suppress warnings till networkX fixes deprecation warnings
 # (Using or importing the ABCs from 'collections' instead of from
@@ -28,32 +34,57 @@
 # python 3.7.  Also, this import networkx needs to be relocated in the
 # third-party group once this gets fixed.
 import warnings
+
 with warnings.catch_warnings():
     warnings.filterwarnings("ignore", category=DeprecationWarning)
     import networkx as nx
 
 
 def compare_series(series_1, series_2):
-    if (len(series_1) != len(series_2)):
+    if isinstance(series_1, cudf.Series):
+        series_1 = series_1.values_host
+    if isinstance(series_2, cudf.Series):
+        series_2 = series_2.values_host
+    if len(series_1) != len(series_2):
         print("Series do not match in length")
         return 0
     for i in range(len(series_1)):
-        if(series_1[i] != series_2[i]):
-            print("Series[" + str(i) + "] does not match, " + str(series_1[i])
-                  + ", " + str(series_2[i]))
+        if series_1[i] != series_2[i]:
+            print(
+                "Series["
+                + str(i)
+                + "] does not match, "
+                + str(series_1[i])
+                + ", "
+                + str(series_2[i])
+            )
             return 0
     return True
 
 
 def compare_offsets(offset0, offset1):
+    if isinstance(offset0, cudf.Series):
+        offset0 = offset0.values_host
+    if isinstance(offset1, cudf.Series):
+        offset1 = offset1.values_host
     if not (len(offset0) <= len(offset1)):
-        print("Mismatched length: " + str(len(offset0)) + " != "
-              + str(len(offset1)))
+        print(
+            "Mismatched length: "
+            + str(len(offset0))
+            + " != "
+            + str(len(offset1))
+        )
         return False
     for i in range(len(offset0)):
         if offset0[i] != offset1[i]:
-            print("Series[" + str(i) + "]: " + str(offset0[i]) + " != "
-                  + str(offset1[i]))
+            print(
+                "Series["
+                + str(i)
+                + "]: "
+                + str(offset0[i])
+                + " != "
+                + str(offset1[i])
+            )
             return False
     return True
 
@@ -62,46 +93,53 @@ def compare_offsets(offset0, offset1):
 # vertices in one graph to the vertices in the other graph is identity AND two
 # graphs are automorphic; no permutations of vertices are allowed).
 def compare_graphs(nx_graph, cu_graph):
-    edgelist_df = cu_graph.view_edge_list()
+    edgelist_df = cu_graph.view_edge_list().reset_index(drop=True)
 
     df = cudf.DataFrame()
-    df['source'] = edgelist_df['src']
-    df['target'] = edgelist_df['dst']
+    df["source"] = edgelist_df["src"]
+    df["target"] = edgelist_df["dst"]
     if len(edgelist_df.columns) > 2:
-        df['weight'] = edgelist_df['weights']
-        cu_to_nx_graph = nx.from_pandas_edgelist(df.to_pandas(),
-                                                 source='source',
-                                                 target='target',
-                                                 edge_attr=['weight'],
-                                                 create_using=nx.DiGraph())
+        df["weight"] = edgelist_df["weights"]
+        cu_to_nx_graph = nx.from_pandas_edgelist(
+            df.to_pandas(),
+            source="source",
+            target="target",
+            edge_attr=["weight"],
+            create_using=nx.DiGraph(),
+        )
     else:
-        cu_to_nx_graph = nx.from_pandas_edgelist(df.to_pandas(),
-                                                 create_using=nx.DiGraph())
+        cu_to_nx_graph = nx.from_pandas_edgelist(
+            df.to_pandas(), create_using=nx.DiGraph()
+        )
 
     # first compare nodes
-
-    ds0 = pd.Series(nx_graph.nodes)
-    ds1 = pd.Series(cu_to_nx_graph.nodes)
+    ds0 = pd.Series(list(nx_graph.nodes)).sort_values(ignore_index=True)
+    ds1 = pd.Series(list(cu_to_nx_graph.nodes)).sort_values(ignore_index=True)
 
     if not ds0.equals(ds1):
+        print('ds0 != ds1')
         return False
 
     # second compare edges
-
     diff = nx.difference(nx_graph, cu_to_nx_graph)
 
     if diff.number_of_edges() > 0:
+        print('diff.number_of_edges = ', diff.number_of_edges())
         return False
 
     diff = nx.difference(cu_to_nx_graph, nx_graph)
     if diff.number_of_edges() > 0:
+        print('2: diff.number_of_edges = ', diff.number_of_edges())
         return False
 
     if len(edgelist_df.columns) > 2:
         df0 = cudf.from_pandas(nx.to_pandas_edgelist(nx_graph))
-        df0 = df0.sort_values(by=['source', 'target']).reset_index(drop=True)
-        df1 = df.sort_values(by=['source', 'target']).reset_index(drop=True)
-        if not df0['weight'].equals(df1['weight']):
+        df0 = df0.sort_values(by=["source", "target"]).reset_index(drop=True)
+        df1 = df.sort_values(by=["source", "target"]).reset_index(drop=True)
+        if not df0["weight"].equals(df1["weight"]):
+            print('weights different')
+            print('df0 = \n', df0)
+            print('df1 = \n', df1)
             return False
 
     return True
@@ -109,8 +147,8 @@ def compare_graphs(nx_graph, cu_graph):
 
 def find_two_paths(df, M):
     for i in range(len(df)):
-        start = df['first'][i]
-        end = df['second'][i]
+        start = df["first"][i]
+        end = df["second"][i]
         foundPath = False
         for idx in range(M.indptr[start], M.indptr[start + 1]):
             mid = M.indices[idx]
@@ -121,8 +159,7 @@ def find_two_paths(df, M):
             if foundPath:
                 break
         if not foundPath:
-            print("No path found between " + str(start) +
-                  " and " + str(end))
+            print("No path found between " + str(start) + " and " + str(end))
         assert foundPath
 
 
@@ -137,8 +174,8 @@ def has_pair(first_arr, second_arr, first, second):
 
 def check_all_two_hops(df, M):
     num_verts = len(M.indptr) - 1
-    first_arr = df['first'].to_array()
-    second_arr = df['second'].to_array()
+    first_arr = df["first"].to_array()
+    second_arr = df["second"].to_array()
     for start in range(num_verts):
         for idx in range(M.indptr[start], M.indptr[start + 1]):
             mid = M.indices[idx]
@@ -153,28 +190,22 @@ def test_version():
     cugraph.__version__
 
 
-DATASETS = ['../datasets/karate.csv',
-            '../datasets/dolphins.csv',
-            '../datasets/netscience.csv']
-
-
 # Test
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_add_edge_list_to_adj_list(graph_file):
     gc.collect()
 
     cu_M = utils.read_csv_file(graph_file)
 
     M = utils.read_csv_for_nx(graph_file)
-    N = max(max(M['0']), max(M['1'])) + 1
-    M = scipy.sparse.csr_matrix((M.weight, (M['0'], M['1'])),
-                                shape=(N, N))
+    N = max(max(M["0"]), max(M["1"])) + 1
+    M = scipy.sparse.csr_matrix((M.weight, (M["0"], M["1"])), shape=(N, N))
     offsets_exp = M.indptr
     indices_exp = M.indices
 
     # cugraph add_egde_list to_adj_list call
     G = cugraph.DiGraph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1', renumber=False)
+    G.from_cudf_edgelist(cu_M, source="0", destination="1", renumber=False)
     offsets_cu, indices_cu, values_cu = G.view_adj_list()
     assert compare_offsets(offsets_cu, offsets_exp)
     assert compare_series(indices_cu, indices_exp)
@@ -182,14 +213,15 @@ def test_add_edge_list_to_adj_list(graph_file):
 
 
 # Test
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_add_adj_list_to_edge_list(graph_file):
     gc.collect()
 
     Mnx = utils.read_csv_for_nx(graph_file)
-    N = max(max(Mnx['0']), max(Mnx['1'])) + 1
-    Mcsr = scipy.sparse.csr_matrix((Mnx.weight, (Mnx['0'], Mnx['1'])),
-                                   shape=(N, N))
+    N = max(max(Mnx["0"]), max(Mnx["1"])) + 1
+    Mcsr = scipy.sparse.csr_matrix(
+        (Mnx.weight, (Mnx["0"], Mnx["1"])), shape=(N, N)
+    )
 
     offsets = cudf.Series(Mcsr.indptr)
     indices = cudf.Series(Mcsr.indices)
@@ -202,21 +234,22 @@ def test_add_adj_list_to_edge_list(graph_file):
     G = cugraph.DiGraph()
     G.from_cudf_adjlist(offsets, indices, None)
     edgelist = G.view_edge_list()
-    sources_cu = edgelist['src']
-    destinations_cu = edgelist['dst']
+    sources_cu = edgelist["src"]
+    destinations_cu = edgelist["dst"]
     assert compare_series(sources_cu, sources_exp)
     assert compare_series(destinations_cu, destinations_exp)
 
 
 # Test
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_view_edge_list_from_adj_list(graph_file):
     gc.collect()
 
     Mnx = utils.read_csv_for_nx(graph_file)
-    N = max(max(Mnx['0']), max(Mnx['1'])) + 1
-    Mcsr = scipy.sparse.csr_matrix((Mnx.weight, (Mnx['0'], Mnx['1'])),
-                                   shape=(N, N))
+    N = max(max(Mnx["0"]), max(Mnx["1"])) + 1
+    Mcsr = scipy.sparse.csr_matrix(
+        (Mnx.weight, (Mnx["0"], Mnx["1"])), shape=(N, N)
+    )
 
     offsets = cudf.Series(Mcsr.indptr)
     indices = cudf.Series(Mcsr.indices)
@@ -226,29 +259,30 @@ def test_view_edge_list_from_adj_list(graph_file):
     Mcoo = Mcsr.tocoo()
     src1 = Mcoo.row
     dst1 = Mcoo.col
-    assert compare_series(src1, edgelist_df['src'])
-    assert compare_series(dst1, edgelist_df['dst'])
+    assert compare_series(src1, edgelist_df["src"])
+    assert compare_series(dst1, edgelist_df["dst"])
 
 
 # Test
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_delete_edge_list_delete_adj_list(graph_file):
     gc.collect()
 
     Mnx = utils.read_csv_for_nx(graph_file)
     df = cudf.DataFrame()
-    df['src'] = cudf.Series(Mnx['0'])
-    df['dst'] = cudf.Series(Mnx['1'])
+    df["src"] = cudf.Series(Mnx["0"])
+    df["dst"] = cudf.Series(Mnx["1"])
 
-    N = max(max(Mnx['0']), max(Mnx['1'])) + 1
-    Mcsr = scipy.sparse.csr_matrix((Mnx.weight, (Mnx['0'], Mnx['1'])),
-                                   shape=(N, N))
+    N = max(max(Mnx["0"]), max(Mnx["1"])) + 1
+    Mcsr = scipy.sparse.csr_matrix(
+        (Mnx.weight, (Mnx["0"], Mnx["1"])), shape=(N, N)
+    )
     offsets = cudf.Series(Mcsr.indptr)
     indices = cudf.Series(Mcsr.indices)
 
     # cugraph delete_adj_list delete_edge_list call
     G = cugraph.DiGraph()
-    G.from_cudf_edgelist(df, source='src', destination='dst')
+    G.from_cudf_edgelist(df, source="src", destination="dst")
     G.delete_edge_list()
     with pytest.raises(Exception):
         G.view_adj_list()
@@ -260,18 +294,19 @@ def test_delete_edge_list_delete_adj_list(graph_file):
 
 
 # Test
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_add_edge_or_adj_list_after_add_edge_or_adj_list(graph_file):
     gc.collect()
 
     Mnx = utils.read_csv_for_nx(graph_file)
     df = cudf.DataFrame()
-    df['src'] = cudf.Series(Mnx['0'])
-    df['dst'] = cudf.Series(Mnx['1'])
+    df["src"] = cudf.Series(Mnx["0"])
+    df["dst"] = cudf.Series(Mnx["1"])
 
-    N = max(max(Mnx['0']), max(Mnx['1'])) + 1
-    Mcsr = scipy.sparse.csr_matrix((Mnx.weight, (Mnx['0'], Mnx['1'])),
-                                   shape=(N, N))
+    N = max(max(Mnx["0"]), max(Mnx["1"])) + 1
+    Mcsr = scipy.sparse.csr_matrix(
+        (Mnx.weight, (Mnx["0"], Mnx["1"])), shape=(N, N)
+    )
 
     offsets = cudf.Series(Mcsr.indptr)
     indices = cudf.Series(Mcsr.indices)
@@ -283,9 +318,9 @@ def test_add_edge_or_adj_list_after_add_edge_or_adj_list(graph_file):
     # graphs.
 
     # If cugraph has a graph edge list, adding a new graph should fail.
-    G.from_cudf_edgelist(df, source='src', destination='dst')
+    G.from_cudf_edgelist(df, source="src", destination="dst")
     with pytest.raises(Exception):
-        G.from_cudf_edgelist(df, source='src', destination='dst')
+        G.from_cudf_edgelist(df, source="src", destination="dst")
     with pytest.raises(Exception):
         G.from_cudf_adjlist(offsets, indices, None)
     G.delete_edge_list()
@@ -293,15 +328,15 @@ def test_add_edge_or_adj_list_after_add_edge_or_adj_list(graph_file):
     # If cugraph has a graph adjacency list, adding a new graph should fail.
     G.from_cudf_adjlist(offsets, indices, None)
     with pytest.raises(Exception):
-        G.from_cudf_edgelist(df, source='src', destination='dst')
+        G.from_cudf_edgelist(df, source="src", destination="dst")
     with pytest.raises(Exception):
         G.from_cudf_adjlist(offsets, indices, None)
     G.delete_adj_list()
 
 
 # Test
-@pytest.mark.parametrize('graph_file', DATASETS)
-def test_view_edge_list_for_Graph(graph_file):
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
+def test_edges_for_Graph(graph_file):
     gc.collect()
 
     cu_M = utils.read_csv_file(graph_file)
@@ -314,10 +349,50 @@ def test_view_edge_list_for_Graph(graph_file):
     nx_edges = nx_graph.edges()
 
     # Create Cugraph Graph from DataFrame
-    G = cugraph.from_cudf_edgelist(cu_M, source='0',
-                                   destination='1',
+    # Force it to use renumber_from_cudf
+    G = cugraph.from_cudf_edgelist(cu_M, source=['0'],
+                                   destination=['1'],
                                    create_using=cugraph.Graph)
-    cu_edge_list = G.view_edge_list()
+    cu_edge_list = G.edges()
+
+    # Check if number of Edges is same
+    assert len(nx_edges) == len(cu_edge_list)
+    assert nx_graph.number_of_edges() == G.number_of_edges()
+
+    # Compare nx and cugraph edges when viewing edgelist
+    edges = []
+    for edge in nx_edges:
+        if edge[0] > edge[1]:
+            edges.append([edge[1], edge[0]])
+        else:
+            edges.append([edge[0], edge[1]])
+    nx_edge_list = cudf.DataFrame(list(edges), columns=['src', 'dst'])
+    assert_eq(
+        nx_edge_list.sort_values(by=['src', 'dst']).reset_index(drop=True),
+        cu_edge_list.sort_values(by=['src', 'dst']).reset_index(drop=True),
+        check_dtype=False
+    )
+
+
+# Test
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
+def test_view_edge_list_for_Graph(graph_file):
+    gc.collect()
+
+    cu_M = utils.read_csv_file(graph_file)
+
+    # Create nx Graph
+    pdf = cu_M.to_pandas()[["0", "1"]]
+    nx_graph = nx.from_pandas_edgelist(
+        pdf, source="0", target="1", create_using=nx.Graph
+    )
+    nx_edges = nx_graph.edges()
+
+    # Create Cugraph Graph from DataFrame
+    G = cugraph.from_cudf_edgelist(
+        cu_M, source="0", destination="1", create_using=cugraph.Graph
+    )
+    cu_edge_list = G.view_edge_list().sort_values(["src", "dst"])
 
     # Check if number of Edges is same
     assert len(nx_edges) == len(cu_edge_list)
@@ -332,14 +407,20 @@ def test_view_edge_list_for_Graph(graph_file):
             edges.append([edge[0], edge[1]])
     edges = list(edges)
     edges.sort()
-    nx_edge_list = cudf.DataFrame(edges, columns=['src', 'dst'])
+    nx_edge_list = cudf.DataFrame(edges, columns=["src", "dst"])
 
     # Compare nx and cugraph edges when viewing edgelist
-    assert cu_edge_list.equals(nx_edge_list)
+    # assert cu_edge_list.equals(nx_edge_list)
+    assert (
+        cu_edge_list["src"].to_array() == nx_edge_list["src"].to_array()
+    ).all()
+    assert (
+        cu_edge_list["dst"].to_array() == nx_edge_list["dst"].to_array()
+    ).all()
 
 
 # Test
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_networkx_compatibility(graph_file):
     gc.collect()
 
@@ -348,29 +429,41 @@ def test_networkx_compatibility(graph_file):
     M = utils.read_csv_for_nx(graph_file)
 
     df = pd.DataFrame()
-    df['source'] = pd.Series(M['0'])
-    df['target'] = pd.Series(M['1'])
-    df['weight'] = pd.Series(M.weight)
+    df["source"] = pd.Series(M["0"])
+    df["target"] = pd.Series(M["1"])
+    df["weight"] = pd.Series(M.weight)
     gdf = cudf.from_pandas(df)
 
-    Gnx = nx.from_pandas_edgelist(df,
-                                  source='source',
-                                  target='target',
-                                  edge_attr='weight',
-                                  create_using=nx.DiGraph)
-    G = cugraph.from_cudf_edgelist(gdf,
-                                   source='source',
-                                   destination='target',
-                                   edge_attr='weight',
-                                   create_using=cugraph.DiGraph)
+    Gnx = nx.from_pandas_edgelist(
+        df,
+        source="source",
+        target="target",
+        edge_attr="weight",
+        create_using=nx.DiGraph,
+    )
+    G = cugraph.from_cudf_edgelist(
+        gdf,
+        source="source",
+        destination="target",
+        edge_attr="weight",
+        create_using=cugraph.DiGraph,
+    )
+
+    print('g from gdf = \n', gdf)
+    print('nx from df = \n', df)
     assert compare_graphs(Gnx, G)
 
     Gnx.clear()
     G.clear()
-    Gnx = nx.from_pandas_edgelist(df, source='source', target='target',
-                                  create_using=nx.DiGraph)
-    G = cugraph.from_cudf_edgelist(gdf, source='source', destination='target',
-                                   create_using=cugraph.DiGraph)
+    Gnx = nx.from_pandas_edgelist(
+        df, source="source", target="target", create_using=nx.DiGraph
+    )
+    G = cugraph.from_cudf_edgelist(
+        gdf,
+        source="source",
+        destination="target",
+        create_using=cugraph.DiGraph,
+    )
 
     assert compare_graphs(Gnx, G)
 
@@ -378,32 +471,61 @@ def test_networkx_compatibility(graph_file):
     G.clear()
 
 
-DATASETS2 = ['../datasets/karate.csv',
-             '../datasets/dolphins.csv']
+# Test
+@pytest.mark.parametrize('graph_file', utils.DATASETS)
+def test_consolidation(graph_file):
+    gc.collect()
+
+    cluster = LocalCUDACluster()
+    client = Client(cluster)
+    chunksize = dcg.get_chunksize(graph_file)
+
+    M = utils.read_csv_for_nx(graph_file)
+
+    df = pd.DataFrame()
+    df['source'] = pd.Series(M['0'])
+    df['target'] = pd.Series(M['1'])
+
+    ddf = dask_cudf.read_csv(graph_file, chunksize=chunksize,
+                             delimiter=' ',
+                             names=['source', 'target', 'weight'],
+                             dtype=['int32', 'int32', 'float32'], header=None)
+
+    Gnx = nx.from_pandas_edgelist(df, source='source', target='target',
+                                  create_using=nx.DiGraph)
+    G = cugraph.from_cudf_edgelist(ddf, source='source', destination='target',
+                                   create_using=cugraph.DiGraph)
+
+    assert compare_graphs(Gnx, G)
+    Gnx.clear()
+    G.clear()
+    client.close()
+    cluster.close()
 
 
 # Test
-@pytest.mark.parametrize('graph_file', DATASETS2)
+@pytest.mark.parametrize('graph_file', utils.DATASETS_2)
 def test_two_hop_neighbors(graph_file):
     gc.collect()
 
     cu_M = utils.read_csv_file(graph_file)
 
     G = cugraph.DiGraph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1', edge_attr='2')
+    G.from_cudf_edgelist(cu_M, source="0", destination="1", edge_attr="2")
 
     df = G.get_two_hop_neighbors()
     Mnx = utils.read_csv_for_nx(graph_file)
-    N = max(max(Mnx['0']), max(Mnx['1'])) + 1
-    Mcsr = scipy.sparse.csr_matrix((Mnx.weight, (Mnx['0'], Mnx['1'])),
-                                   shape=(N, N))
+    N = max(max(Mnx["0"]), max(Mnx["1"])) + 1
+    Mcsr = scipy.sparse.csr_matrix(
+        (Mnx.weight, (Mnx["0"], Mnx["1"])), shape=(N, N)
+    )
 
     find_two_paths(df, Mcsr)
     check_all_two_hops(df, Mcsr)
 
 
 # Test
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_degree_functionality(graph_file):
     gc.collect()
 
@@ -411,10 +533,11 @@ def test_degree_functionality(graph_file):
     cu_M = utils.read_csv_file(graph_file)
 
     G = cugraph.DiGraph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1', edge_attr='2')
+    G.from_cudf_edgelist(cu_M, source="0", destination="1", edge_attr="2")
 
-    Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                  create_using=nx.DiGraph())
+    Gnx = nx.from_pandas_edgelist(
+        M, source="0", target="1", create_using=nx.DiGraph()
+    )
 
     df_in_degree = G.in_degree()
     df_out_degree = G.out_degree()
@@ -428,13 +551,13 @@ def test_degree_functionality(graph_file):
     err_out_degree = 0
     err_degree = 0
     for i in range(len(df_degree)):
-        in_deg = df_in_degree['degree'][i]
-        out_deg = df_out_degree['degree'][i]
-        if(in_deg != nx_in_degree[df_in_degree['vertex'][i]]):
+        in_deg = df_in_degree["degree"][i]
+        out_deg = df_out_degree["degree"][i]
+        if in_deg != nx_in_degree[df_in_degree["vertex"][i]]:
             err_in_degree = err_in_degree + 1
-        if(out_deg != nx_out_degree[df_out_degree['vertex'][i]]):
+        if out_deg != nx_out_degree[df_out_degree["vertex"][i]]:
             err_out_degree = err_out_degree + 1
-        if(df_degree['degree'][i] != nx_degree[df_degree['vertex'][i]]):
+        if df_degree["degree"][i] != nx_degree[df_degree["vertex"][i]]:
             err_degree = err_degree + 1
     assert err_in_degree == 0
     assert err_out_degree == 0
@@ -442,7 +565,7 @@ def test_degree_functionality(graph_file):
 
 
 # Test
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_degrees_functionality(graph_file):
     gc.collect()
 
@@ -450,10 +573,11 @@ def test_degrees_functionality(graph_file):
     cu_M = utils.read_csv_file(graph_file)
 
     G = cugraph.DiGraph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1', edge_attr='2')
+    G.from_cudf_edgelist(cu_M, source="0", destination="1", edge_attr="2")
 
-    Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                  create_using=nx.DiGraph())
+    Gnx = nx.from_pandas_edgelist(
+        M, source="0", target="1", create_using=nx.DiGraph()
+    )
 
     df = G.degrees()
 
@@ -464,9 +588,9 @@ def test_degrees_functionality(graph_file):
     err_out_degree = 0
 
     for i in range(len(df)):
-        if(df['in_degree'][i] != nx_in_degree[df['vertex'][i]]):
+        if df["in_degree"][i] != nx_in_degree[df["vertex"][i]]:
             err_in_degree = err_in_degree + 1
-        if(df['out_degree'][i] != nx_out_degree[df['vertex'][i]]):
+        if df["out_degree"][i] != nx_out_degree[df["vertex"][i]]:
             err_out_degree = err_out_degree + 1
 
     assert err_in_degree == 0
@@ -474,7 +598,7 @@ def test_degrees_functionality(graph_file):
 
 
 # Test
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_number_of_vertices(graph_file):
     gc.collect()
 
@@ -482,86 +606,91 @@ def test_number_of_vertices(graph_file):
 
     M = utils.read_csv_for_nx(graph_file)
     if M is None:
-        raise TypeError('Could not read the input graph')
+        raise TypeError("Could not read the input graph")
 
     # cugraph add_edge_list
     G = cugraph.DiGraph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
-    Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                  create_using=nx.DiGraph())
-    assert(G.number_of_vertices() == Gnx.number_of_nodes())
+    G.from_cudf_edgelist(cu_M, source="0", destination="1")
+    Gnx = nx.from_pandas_edgelist(
+        M, source="0", target="1", create_using=nx.DiGraph()
+    )
+    assert G.number_of_vertices() == Gnx.number_of_nodes()
 
 
 # Test
-@pytest.mark.parametrize('graph_file', DATASETS2)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_2)
 def test_to_directed(graph_file):
     gc.collect()
 
     cu_M = utils.read_csv_file(graph_file)
-    cu_M = cu_M[cu_M['0'] <= cu_M['1']].reset_index(drop=True)
+    cu_M = cu_M[cu_M["0"] <= cu_M["1"]].reset_index(drop=True)
     M = utils.read_csv_for_nx(graph_file)
-    M = M[M['0'] <= M['1']]
+    M = M[M["0"] <= M["1"]]
     assert len(cu_M) == len(M)
 
     # cugraph add_edge_list
     G = cugraph.Graph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
-    Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                  create_using=nx.Graph())
+    G.from_cudf_edgelist(cu_M, source="0", destination="1")
+    Gnx = nx.from_pandas_edgelist(
+        M, source="0", target="1", create_using=nx.Graph()
+    )
 
     DiG = G.to_directed()
     DiGnx = Gnx.to_directed()
 
-    assert(DiG.number_of_nodes() == DiGnx.number_of_nodes())
-    assert(DiG.number_of_edges() == DiGnx.number_of_edges())
+    assert DiG.number_of_nodes() == DiGnx.number_of_nodes()
+    assert DiG.number_of_edges() == DiGnx.number_of_edges()
 
     edgelist_df = G.edgelist.edgelist_df
     for i in range(len(edgelist_df)):
-        assert DiGnx.has_edge(edgelist_df.iloc[i]['src'],
-                              edgelist_df.iloc[i]['dst'])
+        assert DiGnx.has_edge(
+            edgelist_df.iloc[i]["src"], edgelist_df.iloc[i]["dst"]
+        )
 
 
 # Test
-@pytest.mark.parametrize('graph_file', DATASETS2)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_2)
 def test_to_undirected(graph_file):
     gc.collect()
 
     cu_M = utils.read_csv_file(graph_file)
-    cu_M = cu_M[cu_M['0'] <= cu_M['1']].reset_index(drop=True)
+    cu_M = cu_M[cu_M["0"] <= cu_M["1"]].reset_index(drop=True)
     M = utils.read_csv_for_nx(graph_file)
-    M = M[M['0'] <= M['1']]
+    M = M[M["0"] <= M["1"]]
     assert len(cu_M) == len(M)
 
     # cugraph add_edge_list
     DiG = cugraph.DiGraph()
-    DiG.from_cudf_edgelist(cu_M, source='0', destination='1')
-    DiGnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                    create_using=nx.DiGraph())
+    DiG.from_cudf_edgelist(cu_M, source="0", destination="1")
+    DiGnx = nx.from_pandas_edgelist(
+        M, source="0", target="1", create_using=nx.DiGraph()
+    )
 
     G = DiG.to_undirected()
     Gnx = DiGnx.to_undirected()
 
-    assert(G.number_of_nodes() == Gnx.number_of_nodes())
-    assert(G.number_of_edges() == Gnx.number_of_edges())
+    assert G.number_of_nodes() == Gnx.number_of_nodes()
+    assert G.number_of_edges() == Gnx.number_of_edges()
 
     edgelist_df = G.edgelist.edgelist_df
 
     for i in range(len(edgelist_df)):
-        assert Gnx.has_edge(edgelist_df.iloc[i]['src'],
-                            edgelist_df.iloc[i]['dst'])
+        assert Gnx.has_edge(
+            edgelist_df.iloc[i]["src"], edgelist_df.iloc[i]["dst"]
+        )
 
 
 # Test
-@pytest.mark.parametrize('graph_file', DATASETS2)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_2)
 def test_has_edge(graph_file):
     gc.collect()
 
     cu_M = utils.read_csv_file(graph_file)
-    cu_M = cu_M[cu_M['0'] <= cu_M['1']].reset_index(drop=True)
+    cu_M = cu_M[cu_M["0"] <= cu_M["1"]].reset_index(drop=True)
 
     # cugraph add_edge_list
     G = cugraph.Graph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
+    G.from_cudf_edgelist(cu_M, source="0", destination="1")
 
     for i in range(len(cu_M)):
         assert G.has_edge(cu_M.loc[i][0], cu_M.loc[i][1])
@@ -569,39 +698,72 @@ def test_has_edge(graph_file):
 
 
 # Test
-@pytest.mark.parametrize('graph_file', DATASETS2)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_2)
 def test_has_node(graph_file):
     gc.collect()
 
     cu_M = utils.read_csv_file(graph_file)
-    nodes = cudf.concat([cu_M['0'], cu_M['1']]).unique()
+    nodes = cudf.concat([cu_M["0"], cu_M["1"]]).unique()
 
     # cugraph add_edge_list
     G = cugraph.Graph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
+    G.from_cudf_edgelist(cu_M, source="0", destination="1")
 
-    for n in nodes:
+    for n in nodes.values_host:
         assert G.has_node(n)
 
 
+# Test all combinations of default/managed and pooled/non-pooled allocation
+@pytest.mark.parametrize('graph_file', utils.DATASETS_2)
+def test_bipartite_api(graph_file):
+    # This test only tests the functionality of adding set of nodes and
+    # retrieving them. The datasets currently used are not truly bipartite.
+    gc.collect()
+
+    cu_M = utils.read_csv_file(graph_file)
+    nodes = cudf.concat([cu_M['0'], cu_M['1']]).unique()
+
+    # Create set of nodes for partition
+    set1_exp = cudf.Series(nodes[0:int(len(nodes)/2)])
+    set2_exp = cudf.Series(set(nodes.values_host)
+                           - set(set1_exp.values_host))
+
+    G = cugraph.Graph()
+    assert not G.is_bipartite()
+
+    # Add a set of nodes present in one partition
+    G.add_nodes_from(set1_exp, bipartite='set1')
+    G.from_cudf_edgelist(cu_M, source='0', destination='1')
+
+    # Check if Graph is bipartite. It should return True since we have
+    # added the partition in add_nodes_from()
+    assert G.is_bipartite()
+
+    # Call sets() to get the bipartite set of nodes.
+    set1, set2 = G.sets()
+
+    # assert if the input set1_exp is same as returned bipartite set1
+    assert set1.equals(set1_exp)
+    # assert if set2 is the remaining set of nodes not in set1_exp
+    assert set2.equals(set2_exp)
+
+
 # Test
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_neighbors(graph_file):
     gc.collect()
 
     cu_M = utils.read_csv_file(graph_file)
-    nodes = cudf.concat([cu_M['0'], cu_M['1']]).unique()
-    print(nodes)
+    nodes = cudf.concat([cu_M["0"], cu_M["1"]]).unique()
     M = utils.read_csv_for_nx(graph_file)
 
     G = cugraph.Graph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
+    G.from_cudf_edgelist(cu_M, source="0", destination="1")
 
     Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
                                   create_using=nx.Graph())
-    for n in nodes:
-        print("NODE: ", n)
-        cu_neighbors = G.neighbors(n).tolist()
+    for n in nodes.values_host:
+        cu_neighbors = G.neighbors(n).to_arrow().to_pylist()
         nx_neighbors = [i for i in Gnx.neighbors(n)]
         cu_neighbors.sort()
         nx_neighbors.sort()
diff --git a/python/cugraph/tests/test_hits.py b/python/cugraph/tests/test_hits.py
new file mode 100644
index 00000000000..77471f57601
--- /dev/null
+++ b/python/cugraph/tests/test_hits.py
@@ -0,0 +1,141 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import time
+import numpy as np
+import pandas as pd
+
+import pytest
+
+import cudf
+import cugraph
+from cugraph.tests import utils
+
+# Temporarily suppress warnings till networkX fixes deprecation warnings
+# (Using or importing the ABCs from 'collections' instead of from
+# 'collections.abc' is deprecated, and in 3.8 it will stop working) for
+# python 3.7.  Also, this import networkx needs to be relocated in the
+# third-party group once this gets fixed.
+import warnings
+
+with warnings.catch_warnings():
+    warnings.filterwarnings("ignore", category=DeprecationWarning)
+    import networkx as nx
+
+
+print("Networkx version : {} ".format(nx.__version__))
+
+
+def cudify(d):
+    if d is None:
+        return None
+
+    k = np.fromiter(d.keys(), dtype="int32")
+    v = np.fromiter(d.values(), dtype="float32")
+    cuD = cudf.DataFrame({"vertex": k, "values": v})
+    return cuD
+
+
+def cugraph_call(cu_M, max_iter, tol):
+    # cugraph hits Call
+
+    t1 = time.time()
+    G = cugraph.DiGraph()
+    G.from_cudf_edgelist(cu_M, source="0", destination="1")
+    df = cugraph.hits(G, max_iter, tol)
+    df = df.sort_values("vertex").reset_index(drop=True)
+    t2 = time.time() - t1
+    print("Cugraph Time : " + str(t2))
+
+    return df
+
+
+# The function selects personalization_perc% of accessible vertices in graph M
+# and randomly assigns them personalization values
+def networkx_call(M, max_iter, tol):
+    # in NVGRAPH tests we read as CSR and feed as CSC,
+    # so here we do this explicitly
+    print("Format conversion ... ")
+
+    # Networkx Hits Call
+    print("Solving... ")
+    t1 = time.time()
+
+    # Directed NetworkX graph
+    Gnx = nx.from_pandas_edgelist(
+        M, source="0", target="1", create_using=nx.DiGraph()
+    )
+
+    # same parameters as in NVGRAPH
+    pr = nx.hits(Gnx, max_iter, tol, normalized=True)
+    t2 = time.time() - t1
+
+    print("Networkx Time : " + str(t2))
+
+    return pr
+
+
+DATASETS = ["../datasets/dolphins.csv", "../datasets/karate.csv"]
+
+MAX_ITERATIONS = [50]
+TOLERANCE = [1.0e-06]
+
+
+# Test all combinations of default/managed and pooled/non-pooled allocation
+
+
+@pytest.mark.parametrize("graph_file", DATASETS)
+@pytest.mark.parametrize("max_iter", MAX_ITERATIONS)
+@pytest.mark.parametrize("tol", TOLERANCE)
+def test_hits(graph_file, max_iter, tol):
+    gc.collect()
+
+    M = utils.read_csv_for_nx(graph_file)
+    hubs, authorities = networkx_call(M, max_iter, tol)
+
+    cu_M = utils.read_csv_file(graph_file)
+    cugraph_hits = cugraph_call(cu_M, max_iter, tol)
+
+    # Calculating mismatch
+    # hubs = sorted(hubs.items(), key=lambda x: x[0])
+    # print("hubs = ", hubs)
+
+    #
+    #  Scores don't match.  Networkx uses the 1-norm,
+    #  gunrock uses a 2-norm.  Eventually we'll add that
+    #  as a parameter. For now, let's check the order
+    #  which should match.  We'll allow 6 digits to right
+    #  of decimal point accuracy
+    #
+    pdf = pd.DataFrame.from_dict(hubs, orient="index").sort_index()
+    pdf = pdf.multiply(1000000).floordiv(1)
+    cugraph_hits["nx_hubs"] = cudf.Series.from_pandas(pdf[0])
+
+    pdf = pd.DataFrame.from_dict(authorities, orient="index").sort_index()
+    pdf = pdf.multiply(1000000).floordiv(1)
+    cugraph_hits["nx_authorities"] = cudf.Series.from_pandas(pdf[0])
+
+    #
+    #  Sort by hubs (cugraph) in descending order.  Then we'll
+    #  check to make sure all scores are in descending order.
+    #
+    cugraph_hits = cugraph_hits.sort_values("hubs", ascending=False)
+
+    assert cugraph_hits["hubs"].is_monotonic_decreasing
+    assert cugraph_hits["nx_hubs"].is_monotonic_decreasing
+
+    cugraph_hits = cugraph_hits.sort_values("authorities", ascending=False)
+
+    assert cugraph_hits["authorities"].is_monotonic_decreasing
+    assert cugraph_hits["nx_authorities"].is_monotonic_decreasing
diff --git a/python/cugraph/tests/test_hypergraph.py b/python/cugraph/tests/test_hypergraph.py
new file mode 100644
index 00000000000..dbce89905cd
--- /dev/null
+++ b/python/cugraph/tests/test_hypergraph.py
@@ -0,0 +1,468 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Copyright (c) 2015, Graphistry, Inc.
+# All rights reserved.
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in the
+#       documentation and/or other materials provided with the distribution.
+#     * Neither the name of the Graphistry, Inc nor the
+#       names of its contributors may be used to endorse or promote products
+#       derived from this software without specific prior written permission.
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+# ARE DISCLAIMED. IN NO EVENT SHALL Graphistry, Inc BE LIABLE FOR ANY DIRECT,
+# INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+# THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import cudf
+from cudf.tests.utils import assert_eq
+import cugraph
+import datetime as dt
+import pandas as pd
+import pytest
+
+
+simple_df = cudf.DataFrame.from_pandas(pd.DataFrame({
+    "id": ["a", "b", "c"],
+    "a1": [1, 2, 3],
+    "a2": ["red", "blue", "green"],
+    "🙈": ["æski ēˈmōjē", "😋", "s"],
+}))
+
+hyper_df = cudf.DataFrame.from_pandas(pd.DataFrame({
+    "aa": [0, 1, 2],
+    "bb": ["a", "b", "c"],
+    "cc": ["b", "0", "1"]
+}))
+
+
+def test_complex_df():
+    complex_df = pd.DataFrame({
+        "src": [0, 1, 2, 3],
+        "dst": [1, 2, 3, 0],
+        "colors": [1, 1, 2, 2],
+        "bool": [True, False, True, True],
+        "char": ["a", "b", "c", "d"],
+        "str": ["a", "b", "c", "d"],
+        "ustr": [u"a", u"b", u"c", u"d"],
+        "emoji": ["😋", "😋😋", "😋", "😋"],
+        "int": [0, 1, 2, 3],
+        "num": [0.5, 1.5, 2.5, 3.5],
+        "date_str": [
+            "2018-01-01 00:00:00",
+            "2018-01-02 00:00:00",
+            "2018-01-03 00:00:00",
+            "2018-01-05 00:00:00",
+        ],
+        "date": [
+            dt.datetime(2018, 1, 1),
+            dt.datetime(2018, 1, 1),
+            dt.datetime(2018, 1, 1),
+            dt.datetime(2018, 1, 1),
+        ],
+        "time": [
+            pd.Timestamp("2018-01-05"),
+            pd.Timestamp("2018-01-05"),
+            pd.Timestamp("2018-01-05"),
+            pd.Timestamp("2018-01-05"),
+        ],
+    })
+
+    for c in complex_df.columns:
+        try:
+            complex_df[c + "_cat"] = complex_df[c].astype("category")
+        except Exception:
+            # lists aren't categorical
+            # print('could not make categorical', c)
+            pass
+
+    complex_df = cudf.DataFrame.from_pandas(complex_df)
+
+    cugraph.hypergraph(complex_df)
+
+
+@pytest.mark.parametrize("categorical_metadata", [False, True])
+def test_hyperedges(categorical_metadata):
+
+    h = cugraph.hypergraph(simple_df,
+                           categorical_metadata=categorical_metadata)
+
+    assert_eq(
+        len(h.keys()), len(["entities", "nodes", "edges", "events", "graph"])
+    )
+
+    edges = pd.DataFrame({
+        "event_id": [
+            "event_id::0",
+            "event_id::1",
+            "event_id::2",
+            "event_id::0",
+            "event_id::1",
+            "event_id::2",
+            "event_id::0",
+            "event_id::1",
+            "event_id::2",
+            "event_id::0",
+            "event_id::1",
+            "event_id::2",
+        ],
+        "edge_type": [
+            "a1",
+            "a1",
+            "a1",
+            "a2",
+            "a2",
+            "a2",
+            "id",
+            "id",
+            "id",
+            "🙈",
+            "🙈",
+            "🙈",
+        ],
+        "attrib_id": [
+            "a1::1",
+            "a1::2",
+            "a1::3",
+            "a2::red",
+            "a2::blue",
+            "a2::green",
+            "id::a",
+            "id::b",
+            "id::c",
+            "🙈::æski ēˈmōjē",
+            "🙈::😋",
+            "🙈::s",
+        ],
+        "id": ["a", "b", "c"] * 4,
+        "a1": [1, 2, 3] * 4,
+        "a2": ["red", "blue", "green"] * 4,
+        "🙈": ["æski ēˈmōjē", "😋", "s"] * 4,
+    })
+
+    if categorical_metadata:
+        edges = edges.astype({"edge_type": "category"})
+
+    assert_eq(edges, h["edges"])
+
+    for (k, v) in [
+        ("entities", 12), ("nodes", 15), ("edges", 12), ("events", 3)
+    ]:
+        assert_eq(len(h[k]), v)
+
+
+def test_hyperedges_direct():
+
+    h = cugraph.hypergraph(hyper_df, direct=True)
+
+    assert_eq(len(h["edges"]), 9)
+    assert_eq(len(h["nodes"]), 9)
+
+
+def test_hyperedges_direct_categories():
+
+    h = cugraph.hypergraph(
+        hyper_df,
+        direct=True,
+        categories={
+            "aa": "N",
+            "bb": "N",
+            "cc": "N",
+        },
+    )
+
+    assert_eq(len(h["edges"]), 9)
+    assert_eq(len(h["nodes"]), 6)
+
+
+def test_hyperedges_direct_manual_shaping():
+
+    h1 = cugraph.hypergraph(
+        hyper_df,
+        direct=True,
+        EDGES={"aa": ["cc"], "cc": ["cc"]},
+    )
+    assert_eq(len(h1["edges"]), 6)
+
+    h2 = cugraph.hypergraph(
+        hyper_df,
+        direct=True,
+        EDGES={"aa": ["cc", "bb", "aa"], "cc": ["cc"]},
+    )
+    assert_eq(len(h2["edges"]), 12)
+
+
+@pytest.mark.parametrize("categorical_metadata", [False, True])
+def test_drop_edge_attrs(categorical_metadata):
+
+    h = cugraph.hypergraph(simple_df,
+                           columns=["id", "a1", "🙈"],
+                           drop_edge_attrs=True,
+                           categorical_metadata=categorical_metadata)
+
+    assert_eq(
+        len(h.keys()), len(["entities", "nodes", "edges", "events", "graph"])
+    )
+
+    edges = cudf.DataFrame.from_pandas(pd.DataFrame({
+        "event_id": [
+            "event_id::0",
+            "event_id::1",
+            "event_id::2",
+            "event_id::0",
+            "event_id::1",
+            "event_id::2",
+            "event_id::0",
+            "event_id::1",
+            "event_id::2",
+        ],
+        "edge_type": [
+            "a1", "a1", "a1", "id", "id", "id", "🙈", "🙈", "🙈"
+        ],
+        "attrib_id": [
+            "a1::1",
+            "a1::2",
+            "a1::3",
+            "id::a",
+            "id::b",
+            "id::c",
+            "🙈::æski ēˈmōjē",
+            "🙈::😋",
+            "🙈::s",
+        ],
+    }))
+
+    if categorical_metadata:
+        edges = edges.astype({"edge_type": "category"})
+
+    assert_eq(edges, h["edges"])
+
+    for (k, v) in [
+        ("entities", 9), ("nodes", 12), ("edges", 9), ("events", 3)
+    ]:
+        assert_eq(len(h[k]), v)
+
+
+@pytest.mark.parametrize("categorical_metadata", [False, True])
+def test_drop_edge_attrs_direct(categorical_metadata):
+
+    h = cugraph.hypergraph(
+        simple_df,
+        ["id", "a1", "🙈"],
+        direct=True,
+        drop_edge_attrs=True,
+        EDGES={"id": ["a1"], "a1": ["🙈"]},
+        categorical_metadata=categorical_metadata,
+    )
+
+    assert_eq(
+        len(h.keys()), len(["entities", "nodes", "edges", "events", "graph"])
+    )
+
+    edges = cudf.DataFrame.from_pandas(pd.DataFrame({
+        "event_id": [
+            "event_id::0",
+            "event_id::1",
+            "event_id::2",
+            "event_id::0",
+            "event_id::1",
+            "event_id::2",
+        ],
+        "edge_type": [
+            "a1::🙈", "a1::🙈", "a1::🙈", "id::a1", "id::a1", "id::a1"
+        ],
+        "src": ["a1::1", "a1::2", "a1::3", "id::a", "id::b", "id::c"],
+        "dst": ["🙈::æski ēˈmōjē", "🙈::😋", "🙈::s", "a1::1", "a1::2", "a1::3"],
+    }))
+
+    if categorical_metadata:
+        edges = edges.astype({"edge_type": "category"})
+
+    assert_eq(edges, h["edges"])
+
+    for (k, v) in [("entities", 9), ("nodes", 9), ("edges", 6), ("events", 0)]:
+        assert_eq(len(h[k]), v)
+
+
+def test_skip_hyper():
+
+    df = cudf.DataFrame.from_pandas(pd.DataFrame({
+        "a": ["a", None, "b"],
+        "b": ["a", "b", "c"],
+        "c": [1, 2, 3]
+    }))
+
+    hg = cugraph.hypergraph(df, SKIP=["c"], dropna=False)
+
+    assert len(hg["graph"].nodes()) == 9
+    assert len(hg["graph"].edges()) == 6
+
+
+def test_skip_drop_na_hyper():
+
+    df = cudf.DataFrame.from_pandas(pd.DataFrame({
+        "a": ["a", None, "b"],
+        "b": ["a", "b", "c"],
+        "c": [1, 2, 3]
+    }))
+
+    hg = cugraph.hypergraph(df, SKIP=["c"], dropna=True)
+
+    assert len(hg["graph"].nodes()) == 8
+    assert len(hg["graph"].edges()) == 5
+
+
+def test_skip_direct():
+
+    df = cudf.DataFrame.from_pandas(pd.DataFrame({
+        "a": ["a", None, "b"],
+        "b": ["a", "b", "c"],
+        "c": [1, 2, 3]
+    }))
+
+    hg = cugraph.hypergraph(df, SKIP=["c"], dropna=False, direct=True)
+
+    assert len(hg["graph"].nodes()) == 6
+    assert len(hg["graph"].edges()) == 3
+
+
+def test_skip_drop_na_direct():
+
+    df = cudf.DataFrame.from_pandas(pd.DataFrame({
+        "a": ["a", None, "b"],
+        "b": ["a", "b", "c"],
+        "c": [1, 2, 3]
+    }))
+
+    hg = cugraph.hypergraph(df, SKIP=["c"], dropna=True, direct=True)
+
+    assert len(hg["graph"].nodes()) == 4
+    assert len(hg["graph"].edges()) == 2
+
+
+def test_drop_na_hyper():
+
+    df = cudf.DataFrame.from_pandas(
+        pd.DataFrame({"a": ["a", None, "c"], "i": [1, 2, None]})
+    )
+
+    hg = cugraph.hypergraph(df, dropna=True)
+
+    assert len(hg["graph"].nodes()) == 7
+    assert len(hg["graph"].edges()) == 4
+
+
+def test_drop_na_direct():
+
+    df = cudf.DataFrame.from_pandas(
+        pd.DataFrame({"a": ["a", None, "a"], "i": [1, 1, None]})
+    )
+
+    hg = cugraph.hypergraph(df, dropna=True, direct=True)
+
+    assert len(hg["graph"].nodes()) == 2
+    assert len(hg["graph"].edges()) == 1
+
+
+def test_skip_na_hyperedge():
+
+    nans_df = cudf.DataFrame.from_pandas(pd.DataFrame({
+        "x": ["a", "b", "c"],
+        "y": ["aa", None, "cc"]
+    }))
+
+    expected_hits = ["a", "b", "c", "aa", "cc"]
+
+    skip_attr_h_edges = cugraph.hypergraph(
+        nans_df, drop_edge_attrs=True
+    )["edges"]
+
+    assert_eq(len(skip_attr_h_edges), len(expected_hits))
+
+    default_h_edges = cugraph.hypergraph(nans_df)["edges"]
+    assert_eq(len(default_h_edges), len(expected_hits))
+
+
+def test_hyper_to_pa_vanilla():
+
+    df = cudf.DataFrame.from_pandas(pd.DataFrame({
+        "x": ["a", "b", "c"],
+        "y": ["d", "e", "f"]
+    }))
+
+    hg = cugraph.hypergraph(df)
+    nodes_arr = hg["graph"].nodes().to_arrow()
+    assert len(nodes_arr) == 9
+    edges_err = hg["graph"].edges().to_arrow()
+    assert len(edges_err) == 6
+
+
+def test_hyper_to_pa_mixed():
+
+    df = cudf.DataFrame.from_pandas(pd.DataFrame({
+        "x": ["a", "b", "c"],
+        "y": [1, 2, 3]
+    }))
+
+    hg = cugraph.hypergraph(df)
+    nodes_arr = hg["graph"].nodes().to_arrow()
+    assert len(nodes_arr) == 9
+    edges_err = hg["graph"].edges().to_arrow()
+    assert len(edges_err) == 6
+
+
+def test_hyper_to_pa_na():
+
+    df = cudf.DataFrame.from_pandas(pd.DataFrame({
+        "x": ["a", None, "c"],
+        "y": [1, 2, None]
+    }))
+
+    hg = cugraph.hypergraph(df, dropna=False)
+    print(hg["graph"].nodes())
+    nodes_arr = hg["graph"].nodes().to_arrow()
+    assert len(hg["graph"].nodes()) == 9
+    assert len(nodes_arr) == 9
+    edges_err = hg["graph"].edges().to_arrow()
+    assert len(hg["graph"].edges()) == 6
+    assert len(edges_err) == 6
+
+
+def test_hyper_to_pa_all():
+    hg = cugraph.hypergraph(simple_df, ["id", "a1", "🙈"])
+    nodes_arr = hg["graph"].nodes().to_arrow()
+    assert len(hg["graph"].nodes()) == 12
+    assert len(nodes_arr) == 12
+    edges_err = hg["graph"].edges().to_arrow()
+    assert len(hg["graph"].edges()) == 9
+    assert len(edges_err) == 9
+
+
+def test_hyper_to_pa_all_direct():
+    hg = cugraph.hypergraph(simple_df, ["id", "a1", "🙈"], direct=True)
+    nodes_arr = hg["graph"].nodes().to_arrow()
+    assert len(hg["graph"].nodes()) == 9
+    assert len(nodes_arr) == 9
+    edges_err = hg["graph"].edges().to_arrow()
+    assert len(hg["graph"].edges()) == 9
+    assert len(edges_err) == 9
diff --git a/python/cugraph/tests/test_jaccard.py b/python/cugraph/tests/test_jaccard.py
index 8f3e267385f..7cb7b274434 100644
--- a/python/cugraph/tests/test_jaccard.py
+++ b/python/cugraph/tests/test_jaccard.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -25,53 +25,59 @@
 # python 3.7.  Also, this import networkx needs to be relocated in the
 # third-party group once this gets fixed.
 import warnings
+
 with warnings.catch_warnings():
     warnings.filterwarnings("ignore", category=DeprecationWarning)
     import networkx as nx
 
 
-print('Networkx version : {} '.format(nx.__version__))
+print("Networkx version : {} ".format(nx.__version__))
 
 
 def cugraph_call(cu_M, edgevals=False):
     G = cugraph.Graph()
     if edgevals is True:
-        G.from_cudf_edgelist(cu_M, source='0', destination='1',
-                             edge_attr='2')
+        G.from_cudf_edgelist(cu_M, source="0", destination="1", edge_attr="2")
     else:
-        G.from_cudf_edgelist(cu_M, source='0', destination='1')
+        G.from_cudf_edgelist(cu_M, source="0", destination="1")
 
     # cugraph Jaccard Call
     t1 = time.time()
     df = cugraph.jaccard(G)
     t2 = time.time() - t1
-    print('Time : '+str(t2))
-    print(df)
-    return df['source'].to_array(), df['destination'].to_array(),\
-        df['jaccard_coeff'].to_array()
+    print("Time : " + str(t2))
+
+    df = df.sort_values(["source", "destination"]).reset_index(drop=True)
+
+    return (
+        df["source"].to_array(),
+        df["destination"].to_array(),
+        df["jaccard_coeff"].to_array(),
+    )
 
 
 def networkx_call(M):
 
-    sources = M['0']
-    destinations = M['1']
+    sources = M["0"]
+    destinations = M["1"]
     edges = []
     for i in range(len(M)):
         edges.append((sources[i], destinations[i]))
     edges = sorted(edges)
     # in NVGRAPH tests we read as CSR and feed as CSC, so here we doing this
     # explicitly
-    print('Format conversion ... ')
+    print("Format conversion ... ")
 
-    Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                  edge_attr='weight', create_using=nx.Graph())
+    Gnx = nx.from_pandas_edgelist(
+        M, source="0", target="1", edge_attr="weight", create_using=nx.Graph()
+    )
     # Networkx Jaccard Call
-    print('Solving... ')
+    print("Solving... ")
     t1 = time.time()
     preds = nx.jaccard_coefficient(Gnx, edges)
     t2 = time.time() - t1
 
-    print('Time : '+str(t2))
+    print("Time : " + str(t2))
     src = []
     dst = []
     coeff = []
@@ -82,13 +88,7 @@ def networkx_call(M):
     return src, dst, coeff
 
 
-DATASETS = ['../datasets/dolphins.csv',
-            '../datasets/karate.csv',
-            '../datasets/netscience.csv']
-
-
-# Test all combinations of default/managed and pooled/non-pooled allocation
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_UNDIRECTED)
 def test_jaccard(graph_file):
     gc.collect()
 
@@ -104,15 +104,14 @@ def test_jaccard(graph_file):
 
     assert len(cu_coeff) == len(nx_coeff)
     for i in range(len(cu_coeff)):
-        if(abs(cu_coeff[i] - nx_coeff[i]) > tol*1.1):
+        if abs(cu_coeff[i] - nx_coeff[i]) > tol * 1.1:
             err += 1
 
     print("Mismatches:  %d" % err)
     assert err == 0
 
 
-# Test all combinations of default/managed and pooled/non-pooled allocation
-@pytest.mark.parametrize('graph_file', ['../datasets/netscience.csv'])
+@pytest.mark.parametrize("graph_file", ["../datasets/netscience.csv"])
 def test_jaccard_edgevals(graph_file):
     gc.collect()
 
@@ -127,65 +126,74 @@ def test_jaccard_edgevals(graph_file):
 
     assert len(cu_coeff) == len(nx_coeff)
     for i in range(len(cu_coeff)):
-        if(abs(cu_coeff[i] - nx_coeff[i]) > tol*1.1):
+        if abs(cu_coeff[i] - nx_coeff[i]) > tol * 1.1:
             err += 1
 
     print("Mismatches:  %d" % err)
     assert err == 0
 
 
-# Test all combinations of default/managed and pooled/non-pooled allocation
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_UNDIRECTED)
 def test_jaccard_two_hop(graph_file):
     gc.collect()
 
     M = utils.read_csv_for_nx(graph_file)
     cu_M = utils.read_csv_file(graph_file)
 
-    Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                  create_using=nx.Graph())
+    Gnx = nx.from_pandas_edgelist(
+        M, source="0", target="1", create_using=nx.Graph()
+    )
     G = cugraph.Graph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
-    pairs = G.get_two_hop_neighbors()
-    print(pairs)
+    G.from_cudf_edgelist(cu_M, source="0", destination="1")
+    pairs = (
+        G.get_two_hop_neighbors()
+        .sort_values(["first", "second"])
+        .reset_index(drop=True)
+    )
     nx_pairs = []
     for i in range(len(pairs)):
-        nx_pairs.append((pairs['first'].iloc[i], pairs['second'].iloc[i]))
+        nx_pairs.append((pairs["first"].iloc[i], pairs["second"].iloc[i]))
     preds = nx.jaccard_coefficient(Gnx, nx_pairs)
     nx_coeff = []
     for u, v, p in preds:
         nx_coeff.append(p)
     df = cugraph.jaccard(G, pairs)
-    df = df.sort_values(by=['source', 'destination'])
+    df = df.sort_values(by=["source", "destination"]).reset_index(drop=True)
     assert len(nx_coeff) == len(df)
     for i in range(len(df)):
-        diff = abs(nx_coeff[i] - df['jaccard_coeff'].iloc[i])
+        diff = abs(nx_coeff[i] - df["jaccard_coeff"].iloc[i])
         assert diff < 1.0e-6
 
 
-# Test all combinations of default/managed and pooled/non-pooled allocation
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_UNDIRECTED)
 def test_jaccard_two_hop_edge_vals(graph_file):
     gc.collect()
 
     M = utils.read_csv_for_nx(graph_file)
     cu_M = utils.read_csv_file(graph_file)
 
-    Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                  edge_attr='weight', create_using=nx.Graph())
+    Gnx = nx.from_pandas_edgelist(
+        M, source="0", target="1", edge_attr="weight", create_using=nx.Graph()
+    )
     G = cugraph.Graph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1', edge_attr='2')
-    pairs = G.get_two_hop_neighbors()
+    G.from_cudf_edgelist(cu_M, source="0", destination="1", edge_attr="2")
+
+    pairs = (
+        G.get_two_hop_neighbors()
+        .sort_values(["first", "second"])
+        .reset_index(drop=True)
+    )
+
     nx_pairs = []
     for i in range(len(pairs)):
-        nx_pairs.append((pairs['first'].iloc[i], pairs['second'].iloc[i]))
+        nx_pairs.append((pairs["first"].iloc[i], pairs["second"].iloc[i]))
     preds = nx.jaccard_coefficient(Gnx, nx_pairs)
     nx_coeff = []
     for u, v, p in preds:
         nx_coeff.append(p)
     df = cugraph.jaccard(G, pairs)
-    df = df.sort_values(by=['source', 'destination'])
+    df = df.sort_values(by=["source", "destination"]).reset_index(drop=True)
     assert len(nx_coeff) == len(df)
     for i in range(len(df)):
-        diff = abs(nx_coeff[i] - df['jaccard_coeff'].iloc[i])
+        diff = abs(nx_coeff[i] - df["jaccard_coeff"].iloc[i])
         assert diff < 1.0e-6
diff --git a/python/cugraph/tests/test_k_core.py b/python/cugraph/tests/test_k_core.py
index ddfa2252cfb..59f0b3fb301 100644
--- a/python/cugraph/tests/test_k_core.py
+++ b/python/cugraph/tests/test_k_core.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -24,12 +24,13 @@
 # python 3.7.  Also, this import networkx needs to be relocated in the
 # third-party group once this gets fixed.
 import warnings
+
 with warnings.catch_warnings():
     warnings.filterwarnings("ignore", category=DeprecationWarning)
     import networkx as nx
 
 
-print('Networkx version : {} '.format(nx.__version__))
+print("Networkx version : {} ".format(nx.__version__))
 
 
 def calc_k_cores(graph_file, directed=True):
@@ -39,13 +40,15 @@ def calc_k_cores(graph_file, directed=True):
     NM = utils.read_csv_for_nx(graph_file)
     if directed:
         G = cugraph.DiGraph()
-        Gnx = nx.from_pandas_edgelist(NM, source='0', target='1',
-                                      create_using=nx.DiGraph())
+        Gnx = nx.from_pandas_edgelist(
+            NM, source="0", target="1", create_using=nx.DiGraph()
+        )
     else:
         G = cugraph.Graph()
-        Gnx = nx.from_pandas_edgelist(NM, source='0', target='1',
-                                      create_using=nx.Graph())
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
+        Gnx = nx.from_pandas_edgelist(
+            NM, source="0", target="1", create_using=nx.Graph()
+        )
+    G.from_cudf_edgelist(cu_M, source="0", destination="1")
     ck = cugraph.k_core(G)
     nk = nx.k_core(Gnx)
     return ck, nk
@@ -53,7 +56,8 @@ def calc_k_cores(graph_file, directed=True):
 
 def compare_edges(cg, nxg):
     edgelist_df = cg.view_edge_list()
-    src, dest = edgelist_df['src'], edgelist_df['dst'],
+    src, dest = edgelist_df["src"], edgelist_df["dst"]
+
     assert cg.edgelist.weights is False
     assert len(src) == nxg.size()
     for i in range(len(src)):
@@ -61,11 +65,15 @@ def compare_edges(cg, nxg):
     return True
 
 
-DATASETS = ['../datasets/dolphins.csv',
-            '../datasets/netscience.csv']
-
-
-@pytest.mark.parametrize('graph_file', DATASETS)
+# FIXME: the default set of datasets includes an asymmetric directed graph
+# (email-EU-core.csv), which currently produces different results between
+# cugraph and Nx and fails that test. Investigate, resolve, and use
+# utils.DATASETS instead.
+#
+# https://github.com/rapidsai/cugraph/issues/1046
+#
+# @pytest.mark.parametrize("graph_file", utils.DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_UNDIRECTED)
 def test_core_number_DiGraph(graph_file):
     gc.collect()
 
@@ -74,7 +82,7 @@ def test_core_number_DiGraph(graph_file):
     assert compare_edges(cu_kcore, nx_kcore)
 
 
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_UNDIRECTED)
 def test_core_number_Graph(graph_file):
     gc.collect()
 
diff --git a/python/cugraph/tests/test_k_truss_subgraph.py b/python/cugraph/tests/test_k_truss_subgraph.py
index e9ae64d247c..314a4f62618 100644
--- a/python/cugraph/tests/test_k_truss_subgraph.py
+++ b/python/cugraph/tests/test_k_truss_subgraph.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -26,11 +26,12 @@
 # python 3.7.  Also, this import networkx needs to be relocated in the
 # third-party group once this gets fixed.
 import warnings
+
 with warnings.catch_warnings():
     warnings.filterwarnings("ignore", category=DeprecationWarning)
     import networkx as nx
 
-print('Networkx version : {} '.format(nx.__version__))
+print("Networkx version : {} ".format(nx.__version__))
 
 
 # These ground truth files have been created by running the networkx ktruss
@@ -40,7 +41,7 @@
 # currently in networkx master and will hopefully will make it to a release
 # soon.
 def ktruss_ground_truth(graph_file):
-    G = nx.read_edgelist(graph_file, nodetype=int, data=(('weights', float),))
+    G = nx.read_edgelist(graph_file, nodetype=int, data=(("weights", float),))
     df = nx.to_pandas_edgelist(G)
     return df
 
@@ -48,7 +49,7 @@ def ktruss_ground_truth(graph_file):
 def cugraph_k_truss_subgraph(graph_file, k):
     cu_M = utils.read_csv_file(graph_file)
     G = cugraph.Graph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1', edge_attr='2')
+    G.from_cudf_edgelist(cu_M, source="0", destination="1", edge_attr="2")
     k_subgraph = cugraph.ktruss_subgraph(G, k)
     return k_subgraph
 
@@ -58,28 +59,26 @@ def compare_k_truss(graph_file, k, ground_truth_file):
     k_truss_nx = ktruss_ground_truth(ground_truth_file)
 
     edgelist_df = k_truss_cugraph.view_edge_list()
-    src = edgelist_df['src']
-    dst = edgelist_df['dst']
-    wgt = edgelist_df['weights']
+    src = edgelist_df["src"]
+    dst = edgelist_df["dst"]
+    wgt = edgelist_df["weights"]
     assert len(edgelist_df) == len(k_truss_nx)
     for i in range(len(src)):
-        has_edge = ((k_truss_nx['source'] == src[i]) &
-                    (k_truss_nx['target'] == dst[i]) &
-                    np.isclose(k_truss_nx['weights'], wgt[i])).any()
-        has_opp_edge = ((k_truss_nx['source'] == dst[i]) &
-                        (k_truss_nx['target'] == src[i]) &
-                        np.isclose(k_truss_nx['weights'], wgt[i])).any()
-        assert(has_edge or has_opp_edge)
+        has_edge = (
+            (k_truss_nx["source"] == src[i])
+            & (k_truss_nx["target"] == dst[i])
+            & np.isclose(k_truss_nx["weights"], wgt[i])
+        ).any()
+        has_opp_edge = (
+            (k_truss_nx["source"] == dst[i])
+            & (k_truss_nx["target"] == src[i])
+            & np.isclose(k_truss_nx["weights"], wgt[i])
+        ).any()
+        assert has_edge or has_opp_edge
     return True
 
 
-DATASETS = [('../datasets/polbooks.csv',
-             '../datasets/ref/ktruss/polbooks.csv'),
-            ('../datasets/netscience.csv',
-             '../datasets/ref/ktruss/netscience.csv')]
-
-
-@pytest.mark.parametrize('graph_file, nx_ground_truth', DATASETS)
+@pytest.mark.parametrize("graph_file, nx_ground_truth", utils.DATASETS_KTRUSS)
 def test_ktruss_subgraph_Graph(graph_file, nx_ground_truth):
     gc.collect()
 
diff --git a/python/cugraph/tests/test_katz_centrality.py b/python/cugraph/tests/test_katz_centrality.py
index bb98d5b5985..62f30e22a57 100644
--- a/python/cugraph/tests/test_katz_centrality.py
+++ b/python/cugraph/tests/test_katz_centrality.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -24,52 +24,59 @@
 # python 3.7.  Also, this import networkx needs to be relocated in the
 # third-party group once this gets fixed.
 import warnings
+
 with warnings.catch_warnings():
     warnings.filterwarnings("ignore", category=DeprecationWarning)
     import networkx as nx
 
 
-print('Networkx version : {} '.format(nx.__version__))
+print("Networkx version : {} ".format(nx.__version__))
 
 
 def topKVertices(katz, col, k):
     top = katz.nlargest(n=k, columns=col)
     top = top.sort_values(by=col, ascending=False)
-    return top['vertex']
+    return top["vertex"]
 
 
 def calc_katz(graph_file):
     cu_M = utils.read_csv_file(graph_file)
     G = cugraph.DiGraph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
+    G.from_cudf_edgelist(cu_M, source="0", destination="1")
 
-    largest_out_degree = G.degrees().nlargest(n=1, columns='out_degree')
-    largest_out_degree = largest_out_degree['out_degree'].iloc[0]
-    katz_alpha = 1/(largest_out_degree + 1)
+    largest_out_degree = G.degrees().nlargest(n=1, columns="out_degree")
+    largest_out_degree = largest_out_degree["out_degree"].iloc[0]
+    katz_alpha = 1 / (largest_out_degree + 1)
 
-    k_df = cugraph.katz_centrality(G, None, max_iter=1000)
+    k_df = cugraph.katz_centrality(G, alpha=None, max_iter=1000)
+    k_df = k_df.sort_values("vertex").reset_index(drop=True)
 
     NM = utils.read_csv_for_nx(graph_file)
-    Gnx = nx.from_pandas_edgelist(NM, create_using=nx.DiGraph(),
-                                  source='0', target='1')
+    Gnx = nx.from_pandas_edgelist(
+        NM, create_using=nx.DiGraph(), source="0", target="1"
+    )
     nk = nx.katz_centrality(Gnx, alpha=katz_alpha)
     pdf = [nk[k] for k in sorted(nk.keys())]
-    k_df['nx_katz'] = pdf
-    k_df = k_df.rename({'katz_centrality': 'cu_katz'})
+    k_df["nx_katz"] = pdf
+    k_df = k_df.rename(columns={"katz_centrality": "cu_katz"}, copy=False)
     return k_df
 
 
-DATASETS = ['../datasets/dolphins.csv',
-            '../datasets/netscience.csv']
-
-
-@pytest.mark.parametrize('graph_file', DATASETS)
+# FIXME: the default set of datasets includes an asymmetric directed graph
+# (email-EU-core.csv), which currently produces different results between
+# cugraph and Nx and fails that test. Investigate, resolve, and use
+# utils.DATASETS instead.
+#
+# https://github.com/rapidsai/cugraph/issues/1042
+#
+# @pytest.mark.parametrize("graph_file", utils.DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_UNDIRECTED)
 def test_katz_centrality(graph_file):
     gc.collect()
 
     katz_scores = calc_katz(graph_file)
 
-    topKNX = topKVertices(katz_scores, 'nx_katz', 10)
-    topKCU = topKVertices(katz_scores, 'cu_katz', 10)
+    topKNX = topKVertices(katz_scores, "nx_katz", 10)
+    topKCU = topKVertices(katz_scores, "cu_katz", 10)
 
     assert topKNX.equals(topKCU)
diff --git a/python/cugraph/tests/test_leiden.py b/python/cugraph/tests/test_leiden.py
new file mode 100644
index 00000000000..a2b7424484d
--- /dev/null
+++ b/python/cugraph/tests/test_leiden.py
@@ -0,0 +1,74 @@
+# Copyright (c) 2019, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import time
+
+import pytest
+
+import cugraph
+from cugraph.tests import utils
+
+# Temporarily suppress warnings till networkX fixes deprecation warnings
+# (Using or importing the ABCs from 'collections' instead of from
+# 'collections.abc' is deprecated, and in 3.8 it will stop working) for
+# python 3.7.  Also, these import community and import networkx need to be
+# relocated in the third-party group once this gets fixed.
+import warnings
+
+with warnings.catch_warnings():
+    warnings.filterwarnings("ignore", category=DeprecationWarning)
+
+
+def cugraph_leiden(cu_M, edgevals=False):
+
+    G = cugraph.Graph()
+    if edgevals:
+        G.from_cudf_edgelist(cu_M, source="0", destination="1", edge_attr="2")
+    else:
+        G.from_cudf_edgelist(cu_M, source="0", destination="1")
+    # cugraph Louvain Call
+    t1 = time.time()
+    parts, mod = cugraph.leiden(G)
+    t2 = time.time() - t1
+    print("Cugraph Time : " + str(t2))
+
+    return parts, mod
+
+
+def cugraph_louvain(cu_M, edgevals=False):
+
+    G = cugraph.Graph()
+    if edgevals:
+        G.from_cudf_edgelist(cu_M, source="0", destination="1", edge_attr="2")
+    else:
+        G.from_cudf_edgelist(cu_M, source="0", destination="1")
+    # cugraph Louvain Call
+    t1 = time.time()
+    parts, mod = cugraph.louvain(G)
+    t2 = time.time() - t1
+    print("Cugraph Time : " + str(t2))
+
+    return parts, mod
+
+
+@pytest.mark.parametrize("graph_file", utils.DATASETS_3)
+def test_leiden(graph_file):
+    gc.collect()
+
+    cu_M = utils.read_csv_file(graph_file)
+    leiden_parts, leiden_mod = cugraph_leiden(cu_M, edgevals=True)
+    louvain_parts, louvain_mod = cugraph_louvain(cu_M, edgevals=True)
+
+    # Calculating modularity scores for comparison
+    assert leiden_mod >= (0.99 * louvain_mod)
diff --git a/python/cugraph/tests/test_louvain.py b/python/cugraph/tests/test_louvain.py
index a8fbfe46ae1..fadfd06759d 100644
--- a/python/cugraph/tests/test_louvain.py
+++ b/python/cugraph/tests/test_louvain.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -25,53 +25,48 @@
 # python 3.7.  Also, these import community and import networkx need to be
 # relocated in the third-party group once this gets fixed.
 import warnings
+
 with warnings.catch_warnings():
     warnings.filterwarnings("ignore", category=DeprecationWarning)
     import community
     import networkx as nx
 
 
-print('Networkx version : {} '.format(nx.__version__))
+print("Networkx version : {} ".format(nx.__version__))
 
 
 def cugraph_call(cu_M, edgevals=False):
 
     G = cugraph.Graph()
     if edgevals:
-        G.from_cudf_edgelist(cu_M, source='0', destination='1',
-                             edge_attr='2')
+        G.from_cudf_edgelist(cu_M, source="0", destination="1", edge_attr="2")
     else:
-        G.from_cudf_edgelist(cu_M, source='0', destination='1')
+        G.from_cudf_edgelist(cu_M, source="0", destination="1")
     # cugraph Louvain Call
     t1 = time.time()
     parts, mod = cugraph.louvain(G)
     t2 = time.time() - t1
-    print('Cugraph Time : '+str(t2))
+    print("Cugraph Time : " + str(t2))
 
     return parts, mod
 
 
 def networkx_call(M):
     # z = {k: 1.0/M.shape[0] for k in range(M.shape[0])}
-    Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                  edge_attr='weight', create_using=nx.Graph())
+    Gnx = nx.from_pandas_edgelist(
+        M, source="0", target="1", edge_attr="weight", create_using=nx.Graph()
+    )
     # Networkx louvain Call
-    print('Solving... ')
+    print("Solving... ")
     t1 = time.time()
     parts = community.best_partition(Gnx)
     t2 = time.time() - t1
 
-    print('Networkx Time : '+str(t2))
+    print("Networkx Time : " + str(t2))
     return parts
 
 
-DATASETS = ['../datasets/karate.csv',
-            '../datasets/dolphins.csv',
-            '../datasets/netscience.csv']
-
-
-# Test all combinations of default/managed and pooled/non-pooled allocation
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_UNDIRECTED)
 def test_louvain_with_edgevals(graph_file):
     gc.collect()
 
@@ -80,25 +75,21 @@ def test_louvain_with_edgevals(graph_file):
     cu_parts, cu_mod = cugraph_call(cu_M, edgevals=True)
     nx_parts = networkx_call(M)
     # Calculating modularity scores for comparison
-    Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                  edge_attr='weight', create_using=nx.Graph())
+    Gnx = nx.from_pandas_edgelist(
+        M, source="0", target="1", edge_attr="weight", create_using=nx.Graph()
+    )
     cu_map = {0: 0}
     for i in range(len(cu_parts)):
-        cu_map[cu_parts['vertex'][i]] = cu_parts['partition'][i]
+        cu_map[cu_parts["vertex"][i]] = cu_parts["partition"][i]
     assert set(nx_parts.keys()) == set(cu_map.keys())
     cu_mod_nx = community.modularity(cu_map, Gnx)
     nx_mod = community.modularity(nx_parts, Gnx)
     assert len(cu_parts) == len(nx_parts)
-    assert cu_mod > (.82 * nx_mod)
-    assert abs(cu_mod - cu_mod_nx) < .0001
-
-
-DATASETS = ['../datasets/karate.csv',
-            '../datasets/dolphins.csv']
+    assert cu_mod > (0.82 * nx_mod)
+    assert abs(cu_mod - cu_mod_nx) < 0.0001
 
 
-# Test all combinations of default/managed and pooled/non-pooled allocation
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_UNDIRECTED)
 def test_louvain(graph_file):
     gc.collect()
 
@@ -108,15 +99,16 @@ def test_louvain(graph_file):
     nx_parts = networkx_call(M)
 
     # Calculating modularity scores for comparison
-    Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                  edge_attr='weight', create_using=nx.Graph())
+    Gnx = nx.from_pandas_edgelist(
+        M, source="0", target="1", edge_attr="weight", create_using=nx.Graph()
+    )
     cu_map = {0: 0}
     for i in range(len(cu_parts)):
-        cu_map[cu_parts['vertex'][i]] = cu_parts['partition'][i]
+        cu_map[cu_parts["vertex"][i]] = cu_parts["partition"][i]
     assert set(nx_parts.keys()) == set(cu_map.keys())
 
     cu_mod_nx = community.modularity(cu_map, Gnx)
     nx_mod = community.modularity(nx_parts, Gnx)
     assert len(cu_parts) == len(nx_parts)
-    assert cu_mod > (.82 * nx_mod)
-    assert abs(cu_mod - cu_mod_nx) < .0001
+    assert cu_mod > (0.82 * nx_mod)
+    assert abs(cu_mod - cu_mod_nx) < 0.0001
diff --git a/python/cugraph/tests/test_modularity.py b/python/cugraph/tests/test_modularity.py
index b5fd2fffffb..fb61fb7019a 100644
--- a/python/cugraph/tests/test_modularity.py
+++ b/python/cugraph/tests/test_modularity.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -23,8 +23,9 @@
 
 def cugraph_call(G, partitions):
     df = cugraph.spectralModularityMaximizationClustering(
-        G, partitions, num_eigen_vects=(partitions - 1))
-    score = cugraph.analyzeClustering_modularity(G, partitions, df['cluster'])
+        G, partitions, num_eigen_vects=(partitions - 1)
+    )
+    score = cugraph.analyzeClustering_modularity(G, partitions, df["cluster"])
     return score
 
 
@@ -33,29 +34,25 @@ def random_call(G, partitions):
     num_verts = G.number_of_vertices()
     assignment = []
     for i in range(num_verts):
-        assignment.append(random.randint(0, partitions-1))
+        assignment.append(random.randint(0, partitions - 1))
     assignment_cu = cudf.Series(assignment)
     score = cugraph.analyzeClustering_modularity(G, partitions, assignment_cu)
     return score
 
 
-DATASETS = ['../datasets/karate.csv',
-            '../datasets/dolphins.csv',
-            '../datasets/netscience.csv']
 PARTITIONS = [2, 4, 8]
 
 
 # Test all combinations of default/managed and pooled/non-pooled allocation
-@pytest.mark.parametrize('graph_file', DATASETS)
-@pytest.mark.parametrize('partitions', PARTITIONS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
+@pytest.mark.parametrize("partitions", PARTITIONS)
 def test_modularity_clustering(graph_file, partitions):
     gc.collect()
 
     # Read in the graph and get a cugraph object
     cu_M = utils.read_csv_file(graph_file, read_weights_in_sp=False)
     G = cugraph.Graph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1',
-                         edge_attr='2')
+    G.from_cudf_edgelist(cu_M, source="0", destination="1", edge_attr="2")
 
     # Get the modularity score for partitioning versus random assignment
     cu_score = cugraph_call(G, partitions)
@@ -69,19 +66,19 @@ def test_modularity_clustering(graph_file, partitions):
 # Test to ensure DiGraph objs are not accepted
 # Test all combinations of default/managed and pooled/non-pooled allocation
 
+
 def test_digraph_rejected():
     gc.collect()
 
     df = cudf.DataFrame()
-    df['src'] = cudf.Series(range(10))
-    df['dst'] = cudf.Series(range(10))
-    df['val'] = cudf.Series(range(10))
+    df["src"] = cudf.Series(range(10))
+    df["dst"] = cudf.Series(range(10))
+    df["val"] = cudf.Series(range(10))
 
     G = cugraph.DiGraph()
-    G.from_cudf_edgelist(df, source="src",
-                         destination="dst",
-                         edge_attr="val",
-                         renumber=False)
+    G.from_cudf_edgelist(
+        df, source="src", destination="dst", edge_attr="val", renumber=False
+    )
 
     with pytest.raises(Exception):
         cugraph_call(G, 2)
diff --git a/python/cugraph/tests/test_overlap.py b/python/cugraph/tests/test_overlap.py
index 84381b7993c..53d279478f7 100644
--- a/python/cugraph/tests/test_overlap.py
+++ b/python/cugraph/tests/test_overlap.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -25,16 +25,16 @@ def cugraph_call(cu_M, pairs, edgevals=False):
     G = cugraph.DiGraph()
     # Device data
     if edgevals is True:
-        G.from_cudf_edgelist(cu_M, source='0', destination='1', edge_attr='2')
+        G.from_cudf_edgelist(cu_M, source="0", destination="1", edge_attr="2")
     else:
-        G.from_cudf_edgelist(cu_M, source='0', destination='1')
+        G.from_cudf_edgelist(cu_M, source="0", destination="1")
     # cugraph Overlap Call
     t1 = time.time()
     df = cugraph.overlap(G, pairs)
     t2 = time.time() - t1
-    print('Time : '+str(t2))
-    df = df.sort_values(by=['source', 'destination'])
-    return df['overlap_coeff'].to_array()
+    print("Time : " + str(t2))
+    df = df.sort_values(by=["source", "destination"])
+    return df["overlap_coeff"].to_array()
 
 
 def intersection(a, b, M):
@@ -42,7 +42,7 @@ def intersection(a, b, M):
     a_idx = M.indptr[a]
     b_idx = M.indptr[b]
 
-    while (a_idx < M.indptr[a+1]) and (b_idx < M.indptr[b+1]):
+    while (a_idx < M.indptr[a + 1]) and (b_idx < M.indptr[b + 1]):
         a_vertex = M.indices[a_idx]
         b_vertex = M.indices[b_idx]
 
@@ -59,13 +59,13 @@ def intersection(a, b, M):
 
 
 def degree(a, M):
-    return M.indptr[a+1] - M.indptr[a]
+    return M.indptr[a + 1] - M.indptr[a]
 
 
 def overlap(a, b, M):
     b_sum = degree(b, M)
     if b_sum == 0:
-        return float('NaN')
+        return float("NaN")
 
     a_sum = degree(a, M)
 
@@ -78,35 +78,31 @@ def cpu_call(M, first, second):
     result = []
     for i in range(len(first)):
         result.append(overlap(first[i], second[i], M))
-    print(result)
     return result
 
 
-DATASETS = ['../datasets/dolphins.csv',
-            '../datasets/karate.csv',
-            '../datasets/netscience.csv']
-#  Too slow to run on CPU
-#            '../datasets/email-Eu-core.csv']
-
-
-# Test all combinations of default/managed and pooled/non-pooled allocation
-
-@pytest.mark.parametrize('graph_file', DATASETS)
+# Test
+@pytest.mark.parametrize("graph_file", utils.DATASETS_UNDIRECTED)
 def test_overlap(graph_file):
     gc.collect()
 
     Mnx = utils.read_csv_for_nx(graph_file)
-    N = max(max(Mnx['0']), max(Mnx['1'])) + 1
-    M = scipy.sparse.csr_matrix((Mnx.weight, (Mnx['0'], Mnx['1'])),
-                                shape=(N, N))
+    N = max(max(Mnx["0"]), max(Mnx["1"])) + 1
+    M = scipy.sparse.csr_matrix(
+        (Mnx.weight, (Mnx["0"], Mnx["1"])), shape=(N, N)
+    )
 
     cu_M = utils.read_csv_file(graph_file)
     G = cugraph.Graph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
-    pairs = G.get_two_hop_neighbors()
+    G.from_cudf_edgelist(cu_M, source="0", destination="1")
+    pairs = (
+        G.get_two_hop_neighbors()
+        .sort_values(["first", "second"])
+        .reset_index(drop=True)
+    )
 
     cu_coeff = cugraph_call(cu_M, pairs)
-    cpu_coeff = cpu_call(M, pairs['first'], pairs['second'])
+    cpu_coeff = cpu_call(M, pairs["first"], pairs["second"])
 
     assert len(cu_coeff) == len(cpu_coeff)
     for i in range(len(cu_coeff)):
@@ -119,24 +115,28 @@ def test_overlap(graph_file):
             assert diff < 1.0e-6
 
 
-# Test all combinations of default/managed and pooled/non-pooled allocation
-@pytest.mark.parametrize('graph_file', DATASETS)
+# Test
+@pytest.mark.parametrize("graph_file", utils.DATASETS_UNDIRECTED)
 def test_overlap_edge_vals(graph_file):
     gc.collect()
 
     Mnx = utils.read_csv_for_nx(graph_file)
-    N = max(max(Mnx['0']), max(Mnx['1'])) + 1
-    M = scipy.sparse.csr_matrix((Mnx.weight, (Mnx['0'], Mnx['1'])),
-                                shape=(N, N))
+    N = max(max(Mnx["0"]), max(Mnx["1"])) + 1
+    M = scipy.sparse.csr_matrix(
+        (Mnx.weight, (Mnx["0"], Mnx["1"])), shape=(N, N)
+    )
 
     cu_M = utils.read_csv_file(graph_file)
     G = cugraph.Graph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
-    pairs = G.get_two_hop_neighbors()
-
-    cu_coeff = cugraph_call(cu_M, pairs,
-                            edgevals=True)
-    cpu_coeff = cpu_call(M, pairs['first'], pairs['second'])
+    G.from_cudf_edgelist(cu_M, source="0", destination="1")
+    pairs = (
+        G.get_two_hop_neighbors()
+        .sort_values(["first", "second"])
+        .reset_index(drop=True)
+    )
+
+    cu_coeff = cugraph_call(cu_M, pairs, edgevals=True)
+    cpu_coeff = cpu_call(M, pairs["first"], pairs["second"])
 
     assert len(cu_coeff) == len(cpu_coeff)
     for i in range(len(cu_coeff)):
diff --git a/python/cugraph/tests/test_pagerank.py b/python/cugraph/tests/test_pagerank.py
index 9c7bfd03057..b58ec2d9bc9 100644
--- a/python/cugraph/tests/test_pagerank.py
+++ b/python/cugraph/tests/test_pagerank.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -27,37 +27,47 @@
 # python 3.7.  Also, this import networkx needs to be relocated in the
 # third-party group once this gets fixed.
 import warnings
+
 with warnings.catch_warnings():
     warnings.filterwarnings("ignore", category=DeprecationWarning)
     import networkx as nx
 
 
-print('Networkx version : {} '.format(nx.__version__))
+print("Networkx version : {} ".format(nx.__version__))
 
 
 def cudify(d):
     if d is None:
         return None
 
-    k = np.fromiter(d.keys(), dtype='int32')
-    v = np.fromiter(d.values(), dtype='float32')
-    cuD = cudf.DataFrame({'vertex': k, 'values': v})
+    k = np.fromiter(d.keys(), dtype="int32")
+    v = np.fromiter(d.values(), dtype="float32")
+    cuD = cudf.DataFrame({"vertex": k, "values": v})
     return cuD
 
 
 def cugraph_call(cu_M, max_iter, tol, alpha, personalization, nstart):
     # cugraph Pagerank Call
     G = cugraph.DiGraph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
+    G.from_cudf_edgelist(cu_M, source="0", destination="1")
     t1 = time.time()
-    df = cugraph.pagerank(G, alpha=alpha, max_iter=max_iter, tol=tol,
-                          personalization=personalization, nstart=nstart)
+    df = cugraph.pagerank(
+        G,
+        alpha=alpha,
+        max_iter=max_iter,
+        tol=tol,
+        personalization=personalization,
+        nstart=nstart,
+    )
     t2 = time.time() - t1
-    print('Cugraph Time : '+str(t2))
+    print("Cugraph Time : " + str(t2))
 
     # Sort Pagerank values
     sorted_pr = []
-    pr_scores = df['pagerank'].to_array()
+
+    df = df.sort_values("vertex").reset_index(drop=True)
+
+    pr_scores = df["pagerank"].to_array()
     for i, rank in enumerate(pr_scores):
         sorted_pr.append((i, rank))
 
@@ -67,7 +77,7 @@ def cugraph_call(cu_M, max_iter, tol, alpha, personalization, nstart):
 # The function selects personalization_perc% of accessible vertices in graph M
 # and randomly assigns them personalization values
 def networkx_call(M, max_iter, tol, alpha, personalization_perc):
-    '''nnz_per_row = {r: 0 for r in range(M.get_shape()[0])}
+    """nnz_per_row = {r: 0 for r in range(M.get_shape()[0])}
     for nnz in range(M.getnnz()):
         nnz_per_row[M.row[nnz]] = 1 + nnz_per_row[M.row[nnz]]
     for nnz in range(M.getnnz()):
@@ -78,57 +88,62 @@ def networkx_call(M, max_iter, tol, alpha, personalization_perc):
         raise TypeError('Could not read the input graph')
     if M.shape[0] != M.shape[1]:
         raise TypeError('Shape is not square')
-    '''
+    """
     personalization = None
     if personalization_perc != 0:
         personalization = {}
         nnz_vtx = np.unique(M)
         print(nnz_vtx)
-        personalization_count = int((nnz_vtx.size *
-                                     personalization_perc)/100.0)
+        personalization_count = int(
+            (nnz_vtx.size * personalization_perc) / 100.0
+        )
         print(personalization_count)
-        nnz_vtx = np.random.choice(nnz_vtx,
-                                   min(nnz_vtx.size, personalization_count),
-                                   replace=False)
+        nnz_vtx = np.random.choice(
+            nnz_vtx, min(nnz_vtx.size, personalization_count), replace=False
+        )
         print(nnz_vtx)
         nnz_val = np.random.random(nnz_vtx.size)
-        nnz_val = nnz_val/sum(nnz_val)
+        nnz_val = nnz_val / sum(nnz_val)
         print(nnz_val)
         for vtx, val in zip(nnz_vtx, nnz_val):
             personalization[vtx] = val
 
     # should be autosorted, but check just to make sure
-    '''if not M.has_sorted_indices:
+    """if not M.has_sorted_indices:
         print('sort_indices ... ')
         M.sort_indices()
-    '''
+    """
     # in NVGRAPH tests we read as CSR and feed as CSC,
     # so here we do this explicitly
-    print('Format conversion ... ')
+    print("Format conversion ... ")
 
     # Directed NetworkX graph
-    Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                  create_using=nx.DiGraph())
+    Gnx = nx.from_pandas_edgelist(
+        M, source="0", target="1", create_using=nx.DiGraph()
+    )
 
-    z = {k: 1.0/Gnx.number_of_nodes() for k in range(Gnx.number_of_nodes())}
+    z = {k: 1.0 / Gnx.number_of_nodes() for k in range(Gnx.number_of_nodes())}
 
     # Networkx Pagerank Call
-    print('Solving... ')
+    print("Solving... ")
     t1 = time.time()
 
     # same parameters as in NVGRAPH
-    pr = nx.pagerank(Gnx, alpha=alpha, nstart=z, max_iter=max_iter*2,
-                     tol=tol*0.01, personalization=personalization)
+    pr = nx.pagerank(
+        Gnx,
+        alpha=alpha,
+        nstart=z,
+        max_iter=max_iter * 2,
+        tol=tol * 0.01,
+        personalization=personalization,
+    )
     t2 = time.time() - t1
 
-    print('Networkx Time : ' + str(t2))
+    print("Networkx Time : " + str(t2))
 
     return pr, personalization
 
 
-DATASETS = ['../datasets/dolphins.csv',
-            '../datasets/karate.csv']
-
 MAX_ITERATIONS = [500]
 TOLERANCE = [1.0e-06]
 ALPHA = [0.85]
@@ -136,21 +151,29 @@ def networkx_call(M, max_iter, tol, alpha, personalization_perc):
 HAS_GUESS = [0, 1]
 
 
-# Test all combinations of default/managed and pooled/non-pooled allocation
-
-@pytest.mark.parametrize('graph_file', DATASETS)
-@pytest.mark.parametrize('max_iter', MAX_ITERATIONS)
-@pytest.mark.parametrize('tol', TOLERANCE)
-@pytest.mark.parametrize('alpha', ALPHA)
-@pytest.mark.parametrize('personalization_perc', PERSONALIZATION_PERC)
-@pytest.mark.parametrize('has_guess', HAS_GUESS)
-def test_pagerank(graph_file, max_iter, tol, alpha,
-                  personalization_perc, has_guess):
+# FIXME: the default set of datasets includes an asymmetric directed graph
+# (email-EU-core.csv), which currently produces different results between
+# cugraph and Nx and fails that test. Investigate, resolve, and use
+# utils.DATASETS instead.
+#
+# https://github.com/rapidsai/cugraph/issues/533
+#
+# @pytest.mark.parametrize("graph_file", utils.DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_UNDIRECTED)
+@pytest.mark.parametrize("max_iter", MAX_ITERATIONS)
+@pytest.mark.parametrize("tol", TOLERANCE)
+@pytest.mark.parametrize("alpha", ALPHA)
+@pytest.mark.parametrize("personalization_perc", PERSONALIZATION_PERC)
+@pytest.mark.parametrize("has_guess", HAS_GUESS)
+def test_pagerank(
+    graph_file, max_iter, tol, alpha, personalization_perc, has_guess
+):
     gc.collect()
 
     M = utils.read_csv_for_nx(graph_file)
-    networkx_pr, networkx_prsn = networkx_call(M, max_iter, tol, alpha,
-                                               personalization_perc)
+    networkx_pr, networkx_prsn = networkx_call(
+        M, max_iter, tol, alpha, personalization_perc
+    )
 
     cu_nstart = None
     if has_guess == 1:
@@ -166,8 +189,10 @@ def test_pagerank(graph_file, max_iter, tol, alpha,
     err = 0
     assert len(cugraph_pr) == len(networkx_pr)
     for i in range(len(cugraph_pr)):
-        if(abs(cugraph_pr[i][1]-networkx_pr[i][1]) > tol*1.1
-           and cugraph_pr[i][0] == networkx_pr[i][0]):
+        if (
+            abs(cugraph_pr[i][1] - networkx_pr[i][1]) > tol * 1.1
+            and cugraph_pr[i][0] == networkx_pr[i][0]
+        ):
             err = err + 1
     print("Mismatches:", err)
-    assert err < (0.01*len(cugraph_pr))
+    assert err < (0.01 * len(cugraph_pr))
diff --git a/python/cugraph/tests/test_renumber.py b/python/cugraph/tests/test_renumber.py
index 171c1085be1..91416942429 100644
--- a/python/cugraph/tests/test_renumber.py
+++ b/python/cugraph/tests/test_renumber.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -19,201 +19,246 @@
 import pytest
 
 import cudf
-import cugraph
+from cugraph.structure.number_map import NumberMap
 from cugraph.tests import utils
 
-DATASETS = ['../datasets/karate.csv',
-            '../datasets/dolphins.csv',
-            '../datasets/netscience.csv']
-
 
 def test_renumber_ips():
-
-    source_list = ['192.168.1.1',
-                   '172.217.5.238',
-                   '216.228.121.209',
-                   '192.16.31.23']
-    dest_list = ['172.217.5.238',
-                 '216.228.121.209',
-                 '192.16.31.23',
-                 '192.168.1.1']
-
-    pdf = pd.DataFrame({
-            'source_list': source_list,
-            'dest_list': dest_list
-            })
+    source_list = [
+        "192.168.1.1",
+        "172.217.5.238",
+        "216.228.121.209",
+        "192.16.31.23",
+    ]
+    dest_list = [
+        "172.217.5.238",
+        "216.228.121.209",
+        "192.16.31.23",
+        "192.168.1.1",
+    ]
+
+    pdf = pd.DataFrame({"source_list": source_list, "dest_list": dest_list})
 
     gdf = cudf.from_pandas(pdf)
 
-    gdf['source_as_int'] = gdf['source_list'].str.ip2int()
-    gdf['dest_as_int'] = gdf['dest_list'].str.ip2int()
+    gdf["source_as_int"] = gdf["source_list"].str.ip2int()
+    gdf["dest_as_int"] = gdf["dest_list"].str.ip2int()
 
-    src, dst, numbering = cugraph.renumber(gdf['source_as_int'],
-                                           gdf['dest_as_int'])
+    numbering = NumberMap()
+    numbering.from_series(gdf["source_as_int"], gdf["dest_as_int"])
+    src = numbering.to_internal_vertex_id(gdf["source_as_int"])
+    dst = numbering.to_internal_vertex_id(gdf["dest_as_int"])
 
-    for i in range(len(gdf)):
-        assert numbering[src[i]] == gdf['source_as_int'][i]
-        assert numbering[dst[i]] == gdf['dest_as_int'][i]
+    check_src = numbering.from_internal_vertex_id(src)["0"]
+    check_dst = numbering.from_internal_vertex_id(dst)["0"]
 
+    assert check_src.equals(gdf["source_as_int"])
+    assert check_dst.equals(gdf["dest_as_int"])
 
-def test_renumber_ips_cols():
 
-    source_list = ['192.168.1.1',
-                   '172.217.5.238',
-                   '216.228.121.209',
-                   '192.16.31.23']
-    dest_list = ['172.217.5.238',
-                 '216.228.121.209',
-                 '192.16.31.23',
-                 '192.168.1.1']
+def test_renumber_ips_cols():
 
-    pdf = pd.DataFrame({
-            'source_list': source_list,
-            'dest_list': dest_list
-            })
+    source_list = [
+        "192.168.1.1",
+        "172.217.5.238",
+        "216.228.121.209",
+        "192.16.31.23",
+    ]
+    dest_list = [
+        "172.217.5.238",
+        "216.228.121.209",
+        "192.16.31.23",
+        "192.168.1.1",
+    ]
+
+    pdf = pd.DataFrame({"source_list": source_list, "dest_list": dest_list})
 
     gdf = cudf.from_pandas(pdf)
 
-    gdf['source_as_int'] = gdf['source_list'].str.ip2int()
-    gdf['dest_as_int'] = gdf['dest_list'].str.ip2int()
+    gdf["source_as_int"] = gdf["source_list"].str.ip2int()
+    gdf["dest_as_int"] = gdf["dest_list"].str.ip2int()
 
-    src, dst, number_df = cugraph.renumber_from_cudf(
-        gdf, ['source_as_int'], ['dest_as_int'])
+    numbering = NumberMap()
+    numbering.from_dataframe(gdf, ["source_as_int"], ["dest_as_int"])
+    src = numbering.to_internal_vertex_id(gdf["source_as_int"])
+    dst = numbering.to_internal_vertex_id(gdf["dest_as_int"])
 
-    for i in range(len(gdf)):
-        assert number_df['0'][src[i]] == gdf['source_as_int'][i]
-        assert number_df['0'][dst[i]] == gdf['dest_as_int'][i]
+    check_src = numbering.from_internal_vertex_id(src)["0"]
+    check_dst = numbering.from_internal_vertex_id(dst)["0"]
 
+    assert check_src.equals(gdf["source_as_int"])
+    assert check_dst.equals(gdf["dest_as_int"])
 
-@pytest.mark.skip(reason='temporarily dropped string support')
-def test_renumber_ips_str_cols():
 
-    source_list = ['192.168.1.1',
-                   '172.217.5.238',
-                   '216.228.121.209',
-                   '192.16.31.23']
-    dest_list = ['172.217.5.238',
-                 '216.228.121.209',
-                 '192.16.31.23',
-                 '192.168.1.1']
+@pytest.mark.skip(reason="temporarily dropped string support")
+def test_renumber_ips_str_cols():
 
-    pdf = pd.DataFrame({
-            'source_list': source_list,
-            'dest_list': dest_list
-            })
+    source_list = [
+        "192.168.1.1",
+        "172.217.5.238",
+        "216.228.121.209",
+        "192.16.31.23",
+    ]
+    dest_list = [
+        "172.217.5.238",
+        "216.228.121.209",
+        "192.16.31.23",
+        "192.168.1.1",
+    ]
+
+    pdf = pd.DataFrame({"source_list": source_list, "dest_list": dest_list})
 
     gdf = cudf.from_pandas(pdf)
 
-    src, dst, number_df = cugraph.renumber_from_cudf(
-        gdf, ['source_list'], ['dest_list'])
+    numbering = NumberMap()
+    numbering.from_dataframe(gdf, ["source_list"], ["dest_list"])
+    src = numbering.to_internal_vertex_id(gdf["source_list"])
+    dst = numbering.to_internal_vertex_id(gdf["dest_list"])
 
-    for i in range(len(gdf)):
-        assert number_df['0'][src[i]] == gdf['source_list'][i]
-        assert number_df['0'][dst[i]] == gdf['dest_list'][i]
+    check_src = numbering.from_internal_vertex_id(src)["0"]
+    check_dst = numbering.from_internal_vertex_id(dst)["0"]
+
+    assert check_src.equals(gdf["source_list"])
+    assert check_dst.equals(gdf["dest_list"])
 
 
 def test_renumber_negative():
     source_list = [4, 6, 8, -20, 1]
     dest_list = [1, 29, 35, 0, 77]
 
-    df = pd.DataFrame({
-        'source_list': source_list,
-        'dest_list': dest_list,
-    })
+    df = pd.DataFrame({"source_list": source_list, "dest_list": dest_list})
+
+    gdf = cudf.DataFrame.from_pandas(df[["source_list", "dest_list"]])
 
-    gdf = cudf.DataFrame.from_pandas(df[['source_list', 'dest_list']])
+    numbering = NumberMap()
+    numbering.from_dataframe(gdf, ["source_list"], ["dest_list"])
+    src = numbering.to_internal_vertex_id(gdf["source_list"])
+    dst = numbering.to_internal_vertex_id(gdf["dest_list"])
 
-    src, dst, numbering = cugraph.renumber(gdf['source_list'],
-                                           gdf['dest_list'])
+    check_src = numbering.from_internal_vertex_id(src)["0"]
+    check_dst = numbering.from_internal_vertex_id(dst)["0"]
 
-    for i in range(len(source_list)):
-        assert source_list[i] == numbering[src[i]]
-        assert dest_list[i] == numbering[dst[i]]
+    assert check_src.equals(gdf["source_list"])
+    assert check_dst.equals(gdf["dest_list"])
 
 
 def test_renumber_negative_col():
     source_list = [4, 6, 8, -20, 1]
     dest_list = [1, 29, 35, 0, 77]
 
-    df = pd.DataFrame({
-        'source_list': source_list,
-        'dest_list': dest_list,
-    })
+    df = pd.DataFrame({"source_list": source_list, "dest_list": dest_list})
 
-    gdf = cudf.DataFrame.from_pandas(df[['source_list', 'dest_list']])
+    gdf = cudf.DataFrame.from_pandas(df[["source_list", "dest_list"]])
 
-    src, dst, numbering = cugraph.renumber_from_cudf(
-        gdf, ['source_list'], ['dest_list'])
+    numbering = NumberMap()
+    numbering.from_dataframe(gdf, ["source_list"], ["dest_list"])
+    src = numbering.to_internal_vertex_id(gdf["source_list"])
+    dst = numbering.to_internal_vertex_id(gdf["dest_list"])
 
-    for i in range(len(source_list)):
-        assert source_list[i] == numbering['0'][src[i]]
-        assert dest_list[i] == numbering['0'][dst[i]]
+    check_src = numbering.from_internal_vertex_id(src)["0"]
+    check_dst = numbering.from_internal_vertex_id(dst)["0"]
+
+    assert check_src.equals(gdf["source_list"])
+    assert check_dst.equals(gdf["dest_list"])
 
 
 # Test all combinations of default/managed and pooled/non-pooled allocation
 
-@pytest.mark.parametrize('graph_file', DATASETS)
+
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_renumber_files(graph_file):
     gc.collect()
 
     M = utils.read_csv_for_nx(graph_file)
-    sources = cudf.Series(M['0'])
-    destinations = cudf.Series(M['1'])
+    sources = cudf.Series(M["0"])
+    destinations = cudf.Series(M["1"])
 
     translate = 1000
 
-    source_translated = cudf.Series([x + translate for x in sources])
-    dest_translated = cudf.Series([x + translate for x in destinations])
+    df = cudf.DataFrame()
+    df["src"] = cudf.Series([x + translate for x in sources.
+                            values_host])
+    df["dst"] = cudf.Series([x + translate for x in destinations.
+                            values_host])
+
+    numbering = NumberMap()
+    numbering.from_series(df["src"], df["dst"])
+
+    renumbered_df = numbering.add_internal_vertex_id(
+        numbering.add_internal_vertex_id(df, "src_id", ["src"]),
+        "dst_id", ["dst"]
+    )
 
-    src, dst, numbering = cugraph.renumber(source_translated, dest_translated)
+    check_src = numbering.from_internal_vertex_id(renumbered_df, "src_id")
+    check_dst = numbering.from_internal_vertex_id(renumbered_df, "dst_id")
 
-    for i in range(len(sources)):
-        assert sources[i] == (numbering[src[i]] - translate)
-        assert destinations[i] == (numbering[dst[i]] - translate)
+    assert check_src["src"].equals(check_src["0"])
+    assert check_dst["dst"].equals(check_dst["0"])
 
 
 # Test all combinations of default/managed and pooled/non-pooled allocation
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_renumber_files_col(graph_file):
     gc.collect()
 
     M = utils.read_csv_for_nx(graph_file)
-    sources = cudf.Series(M['0'])
-    destinations = cudf.Series(M['1'])
+    sources = cudf.Series(M["0"])
+    destinations = cudf.Series(M["1"])
 
     translate = 1000
 
     gdf = cudf.DataFrame()
-    gdf['src'] = cudf.Series([x + translate for x in sources])
-    gdf['dst'] = cudf.Series([x + translate for x in destinations])
+    gdf['src'] = cudf.Series([x + translate for x in sources.values_host])
+    gdf['dst'] = cudf.Series([x + translate for x in destinations.
+                             values_host])
 
-    src, dst, numbering = cugraph.renumber_from_cudf(gdf, ['src'], ['dst'])
+    numbering = NumberMap()
+    numbering.from_dataframe(gdf, ["src"], ["dst"])
 
-    for i in range(len(gdf)):
-        assert sources[i] == (numbering['0'].iloc[src[i]] - translate)
-        assert destinations[i] == (numbering['0'].iloc[dst[i]] - translate)
+    renumbered_df = numbering.add_internal_vertex_id(
+        numbering.add_internal_vertex_id(gdf, "src_id", ["src"]),
+        "dst_id", ["dst"]
+    )
+
+    check_src = numbering.from_internal_vertex_id(renumbered_df, "src_id")
+    check_dst = numbering.from_internal_vertex_id(renumbered_df, "dst_id")
+
+    assert check_src["src"].equals(check_src["0"])
+    assert check_dst["dst"].equals(check_dst["0"])
 
 
 # Test all combinations of default/managed and pooled/non-pooled allocation
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_renumber_files_multi_col(graph_file):
     gc.collect()
 
     M = utils.read_csv_for_nx(graph_file)
-    sources = cudf.Series(M['0'])
-    destinations = cudf.Series(M['1'])
+    sources = cudf.Series(M["0"])
+    destinations = cudf.Series(M["1"])
 
     translate = 1000
 
     gdf = cudf.DataFrame()
-    gdf['src_old'] = sources
-    gdf['dst_old'] = destinations
-    gdf['src'] = sources + translate
-    gdf['dst'] = destinations + translate
-
-    src, dst, numbering = cugraph.renumber_from_cudf(
-        gdf, ['src', 'src_old'], ['dst', 'dst_old'])
-
-    for i in range(len(gdf)):
-        assert sources[i] == (numbering['0'].iloc[src[i]] - translate)
-        assert destinations[i] == (numbering['0'].iloc[dst[i]] - translate)
+    gdf["src_old"] = sources
+    gdf["dst_old"] = destinations
+    gdf["src"] = sources + translate
+    gdf["dst"] = destinations + translate
+
+    numbering = NumberMap()
+    numbering.from_dataframe(gdf, ["src", "src_old"], ["dst", "dst_old"])
+
+    renumbered_df = numbering.add_internal_vertex_id(
+        numbering.add_internal_vertex_id(
+            gdf, "src_id", ["src", "src_old"]
+        ),
+        "dst_id",
+        ["dst", "dst_old"],
+    )
+
+    check_src = numbering.from_internal_vertex_id(renumbered_df, "src_id")
+    check_dst = numbering.from_internal_vertex_id(renumbered_df, "dst_id")
+
+    assert check_src["src"].equals(check_src["0"])
+    assert check_src["src_old"].equals(check_src["1"])
+    assert check_dst["dst"].equals(check_dst["0"])
+    assert check_dst["dst_old"].equals(check_dst["1"])
diff --git a/python/cugraph/tests/test_sssp.py b/python/cugraph/tests/test_sssp.py
index 5b776efc7a1..5c3d6293dcd 100644
--- a/python/cugraph/tests/test_sssp.py
+++ b/python/cugraph/tests/test_sssp.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -26,41 +26,41 @@
 # python 3.7.  Also, this import networkx needs to be relocated in the
 # third-party group once this gets fixed.
 import warnings
+
 with warnings.catch_warnings():
     warnings.filterwarnings("ignore", category=DeprecationWarning)
     import networkx as nx
 
 
-print('Networkx version : {} '.format(nx.__version__))
+print("Networkx version : {} ".format(nx.__version__))
 
 
 def cugraph_call(cu_M, source, edgevals=False):
 
     G = cugraph.DiGraph()
     if edgevals is True:
-        G.from_cudf_edgelist(cu_M, source='0', destination='1',
-                             edge_attr='2')
+        G.from_cudf_edgelist(cu_M, source="0", destination="1", edge_attr="2")
     else:
-        G.from_cudf_edgelist(cu_M, source='0', destination='1')
-    print('sources size = ' + str(len(cu_M['0'])))
-    print('destinations size = ' + str(len(cu_M['1'])))
+        G.from_cudf_edgelist(cu_M, source="0", destination="1")
+    print("sources size = " + str(len(cu_M["0"])))
+    print("destinations size = " + str(len(cu_M["1"])))
 
-    print('cugraph Solving... ')
+    print("cugraph Solving... ")
     t1 = time.time()
 
     df = cugraph.sssp(G, source)
 
     t2 = time.time() - t1
-    print('Cugraph Time : '+str(t2))
+    print("Cugraph Time : " + str(t2))
 
-    if(np.issubdtype(df['distance'].dtype, np.integer)):
-        max_val = np.iinfo(df['distance'].dtype).max
+    if np.issubdtype(df["distance"].dtype, np.integer):
+        max_val = np.iinfo(df["distance"].dtype).max
     else:
-        max_val = np.finfo(df['distance'].dtype).max
+        max_val = np.finfo(df["distance"].dtype).max
 
-    verts_np = df['vertex'].to_array()
-    dist_np = df['distance'].to_array()
-    pred_np = df['predecessor'].to_array()
+    verts_np = df["vertex"].to_array()
+    dist_np = df["distance"].to_array()
+    pred_np = df["predecessor"].to_array()
     result = dict(zip(verts_np, zip(dist_np, pred_np)))
     return result, max_val
 
@@ -68,10 +68,14 @@ def cugraph_call(cu_M, source, edgevals=False):
 def networkx_call(M, source, edgevals=False):
 
     # Directed NetworkX graph
-    Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                  edge_attr='weight',
-                                  create_using=nx.DiGraph())
-    print('NX Solving... ')
+    Gnx = nx.from_pandas_edgelist(
+        M,
+        source="0",
+        target="1",
+        edge_attr="weight",
+        create_using=nx.DiGraph(),
+    )
+    print("NX Solving... ")
     t1 = time.time()
 
     if edgevals is False:
@@ -81,23 +85,19 @@ def networkx_call(M, source, edgevals=False):
 
     t2 = time.time() - t1
 
-    print('NX Time : ' + str(t2))
+    print("NX Time : " + str(t2))
 
     return path, Gnx
 
 
-DATASETS = ['../datasets/dolphins.csv',
-            '../datasets/karate.csv',
-            '../datasets/netscience.csv',
-            '../datasets/email-Eu-core.csv']
 SOURCES = [1]
 
 
-# Test all combinations of default/managed and pooled/non-pooled allocation
-@pytest.mark.parametrize('graph_file', DATASETS)
-@pytest.mark.parametrize('source', SOURCES)
+# Test
+@pytest.mark.parametrize("graph_file", utils.DATASETS_4)
+@pytest.mark.parametrize("source", SOURCES)
 def test_sssp(graph_file, source):
-    print('DOING test_sssp : ' + graph_file + '\n\n\n')
+    print("DOING test_sssp : " + graph_file + "\n\n\n")
     gc.collect()
 
     M = utils.read_csv_for_nx(graph_file)
@@ -111,24 +111,24 @@ def test_sssp(graph_file, source):
         # Validate vertices that are reachable
         # NOTE : If distance type is float64 then cu_paths[vid][0]
         # should be compared against np.finfo(np.float64).max)
-        if (cu_paths[vid][0] != max_val):
-            if(cu_paths[vid][0] != nx_paths[vid]):
+        if cu_paths[vid][0] != max_val:
+            if cu_paths[vid][0] != nx_paths[vid]:
                 err = err + 1
             # check pred dist + 1 = current dist (since unweighted)
             pred = cu_paths[vid][1]
-            if(vid != source and cu_paths[pred][0] + 1 != cu_paths[vid][0]):
+            if vid != source and cu_paths[pred][0] + 1 != cu_paths[vid][0]:
                 err = err + 1
         else:
-            if (vid in nx_paths.keys()):
+            if vid in nx_paths.keys():
                 err = err + 1
 
     assert err == 0
-    print('DONE test_sssp : ' + graph_file + '\n\n\n')
+    print("DONE test_sssp : " + graph_file + "\n\n\n")
 
 
-# Test all combinations of default/managed and pooled/non-pooled allocation
-@pytest.mark.parametrize('graph_file', ['../datasets/netscience.csv'])
-@pytest.mark.parametrize('source', SOURCES)
+# Test
+@pytest.mark.parametrize("graph_file", utils.DATASETS_1)
+@pytest.mark.parametrize("source", SOURCES)
 def test_sssp_edgevals(graph_file, source):
     gc.collect()
 
@@ -143,24 +143,24 @@ def test_sssp_edgevals(graph_file, source):
         # Validate vertices that are reachable
         # NOTE : If distance type is float64 then cu_paths[vid][0]
         # should be compared against np.finfo(np.float64).max)
-        if (cu_paths[vid][0] != max_val):
-            if(cu_paths[vid][0] != nx_paths[vid]):
+        if cu_paths[vid][0] != max_val:
+            if cu_paths[vid][0] != nx_paths[vid]:
                 err = err + 1
             # check pred dist + edge_weight = current dist
-            if(vid != source):
+            if vid != source:
                 pred = cu_paths[vid][1]
-                edge_weight = Gnx[pred][vid]['weight']
-                if(cu_paths[pred][0] + edge_weight != cu_paths[vid][0]):
+                edge_weight = Gnx[pred][vid]["weight"]
+                if cu_paths[pred][0] + edge_weight != cu_paths[vid][0]:
                     err = err + 1
         else:
-            if (vid in nx_paths.keys()):
+            if vid in nx_paths.keys():
                 err = err + 1
 
     assert err == 0
 
 
-@pytest.mark.parametrize('graph_file', ['../datasets/netscience.csv'])
-@pytest.mark.parametrize('source', SOURCES)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_1)
+@pytest.mark.parametrize("source", SOURCES)
 def test_sssp_data_type_conversion(graph_file, source):
     gc.collect()
 
@@ -168,26 +168,29 @@ def test_sssp_data_type_conversion(graph_file, source):
     cu_M = utils.read_csv_file(graph_file)
 
     # cugraph call with int32 weights
-    cu_M['2'] = cu_M['2'].astype(np.int32)
+    cu_M["2"] = cu_M["2"].astype(np.int32)
     G = cugraph.DiGraph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1',
-                         edge_attr='2')
+    G.from_cudf_edgelist(cu_M, source="0", destination="1", edge_attr="2")
     # assert cugraph weights is int32
-    assert G.edgelist.edgelist_df['weights'].dtype == np.int32
+    assert G.edgelist.edgelist_df["weights"].dtype == np.int32
     df = cugraph.sssp(G, source)
-    max_val = np.finfo(df['distance'].dtype).max
-    verts_np = df['vertex'].to_array()
-    dist_np = df['distance'].to_array()
-    pred_np = df['predecessor'].to_array()
+    max_val = np.finfo(df["distance"].dtype).max
+    verts_np = df["vertex"].to_array()
+    dist_np = df["distance"].to_array()
+    pred_np = df["predecessor"].to_array()
     cu_paths = dict(zip(verts_np, zip(dist_np, pred_np)))
 
     # networkx call with int32 weights
-    M['weight'] = M['weight'].astype(np.int32)
-    Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                  edge_attr='weight',
-                                  create_using=nx.DiGraph())
+    M["weight"] = M["weight"].astype(np.int32)
+    Gnx = nx.from_pandas_edgelist(
+        M,
+        source="0",
+        target="1",
+        edge_attr="weight",
+        create_using=nx.DiGraph(),
+    )
     # assert nx weights is int
-    assert type(list(Gnx.edges(data=True))[0][2]['weight']) is int
+    assert type(list(Gnx.edges(data=True))[0][2]["weight"]) is int
     nx_paths = nx.single_source_dijkstra_path_length(Gnx, source)
 
     # Calculating mismatch
@@ -196,17 +199,17 @@ def test_sssp_data_type_conversion(graph_file, source):
         # Validate vertices that are reachable
         # NOTE : If distance type is float64 then cu_paths[vid][0]
         # should be compared against np.finfo(np.float64).max)
-        if (cu_paths[vid][0] != max_val):
-            if(cu_paths[vid][0] != nx_paths[vid]):
+        if cu_paths[vid][0] != max_val:
+            if cu_paths[vid][0] != nx_paths[vid]:
                 err = err + 1
             # check pred dist + edge_weight = current dist
-            if(vid != source):
+            if vid != source:
                 pred = cu_paths[vid][1]
-                edge_weight = Gnx[pred][vid]['weight']
-                if(cu_paths[pred][0] + edge_weight != cu_paths[vid][0]):
+                edge_weight = Gnx[pred][vid]["weight"]
+                if cu_paths[pred][0] + edge_weight != cu_paths[vid][0]:
                     err = err + 1
         else:
-            if (vid in nx_paths.keys()):
+            if vid in nx_paths.keys():
                 err = err + 1
 
     assert err == 0
diff --git a/python/cugraph/tests/test_subgraph_extraction.py b/python/cugraph/tests/test_subgraph_extraction.py
index d159e128144..9d9631c0f6d 100644
--- a/python/cugraph/tests/test_subgraph_extraction.py
+++ b/python/cugraph/tests/test_subgraph_extraction.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -26,6 +26,7 @@
 # python 3.7.  Also, this import networkx needs to be relocated in the
 # third-party group once this gets fixed.
 import warnings
+
 with warnings.catch_warnings():
     warnings.filterwarnings("ignore", category=DeprecationWarning)
     import networkx as nx
@@ -36,8 +37,9 @@ def compare_edges(cg, nxg):
     assert cg.edgelist.weights is False
     assert len(edgelist_df) == nxg.size()
     for i in range(len(edgelist_df)):
-        assert nxg.has_edge(edgelist_df['src'].iloc[i],
-                            edgelist_df['dst'].iloc[i])
+        assert nxg.has_edge(
+            edgelist_df["src"].iloc[i], edgelist_df["dst"].iloc[i]
+        )
     return True
 
 
@@ -49,31 +51,27 @@ def cugraph_call(M, verts, directed=True):
     else:
         G = cugraph.Graph()
     cu_M = cudf.DataFrame()
-    cu_M['src'] = cudf.Series(M['0'])
-    cu_M['dst'] = cudf.Series(M['1'])
-    G.from_cudf_edgelist(cu_M, source='src', destination='dst')
+    cu_M["src"] = cudf.Series(M["0"])
+    cu_M["dst"] = cudf.Series(M["1"])
+    G.from_cudf_edgelist(cu_M, source="src", destination="dst")
     cu_verts = cudf.Series(verts)
     return cugraph.subgraph(G, cu_verts)
 
 
 def nx_call(M, verts, directed=True):
     if directed:
-        G = nx.from_pandas_edgelist(M, source='0', target='1',
-                                    create_using=nx.DiGraph())
+        G = nx.from_pandas_edgelist(
+            M, source="0", target="1", create_using=nx.DiGraph()
+        )
     else:
-        G = nx.from_pandas_edgelist(M, source='0', target='1',
-                                    create_using=nx.Graph())
+        G = nx.from_pandas_edgelist(
+            M, source="0", target="1", create_using=nx.Graph()
+        )
     return nx.subgraph(G, verts)
 
 
-DATASETS = ['../datasets/karate.csv',
-            '../datasets/dolphins.csv',
-            '../datasets/netscience.csv',
-            '../datasets/email-Eu-core.csv']
-
-
 # Test all combinations of default/managed and pooled/non-pooled allocation
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_4)
 def test_subgraph_extraction_DiGraph(graph_file):
     gc.collect()
 
@@ -89,7 +87,8 @@ def test_subgraph_extraction_DiGraph(graph_file):
 
 # Test all combinations of default/managed and pooled/non-pooled allocation
 
-@pytest.mark.parametrize('graph_file', DATASETS)
+
+@pytest.mark.parametrize("graph_file", utils.DATASETS_4)
 def test_subgraph_extraction_Graph(graph_file):
     gc.collect()
 
diff --git a/python/cugraph/tests/test_symmetrize.py b/python/cugraph/tests/test_symmetrize.py
index 494861b9832..4a49eddb70b 100644
--- a/python/cugraph/tests/test_symmetrize.py
+++ b/python/cugraph/tests/test_symmetrize.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -26,10 +26,6 @@ def test_version():
     cugraph.__version__
 
 
-DATASETS = ['../datasets/karate',
-            '../datasets/email-Eu-core']
-
-
 def compare(src1, dst1, val1, src2, dst2, val2):
     #
     #  We will do comparison computations by using dataframe
@@ -37,16 +33,16 @@ def compare(src1, dst1, val1, src2, dst2, val2):
     #  start by making two data frames
     #
     df1 = cudf.DataFrame()
-    df1['src1'] = src1
-    df1['dst1'] = dst1
+    df1["src1"] = src1
+    df1["dst1"] = dst1
     if val1 is not None:
-        df1['val1'] = val1
+        df1["val1"] = val1
 
     df2 = cudf.DataFrame()
-    df2['src2'] = src2
-    df2['dst2'] = dst2
+    df2["src2"] = src2
+    df2["dst2"] = dst2
     if val2 is not None:
-        df2['val2'] = val2
+        df2["val2"] = val2
 
     #
     #  Check to see if all pairs in the original data frame
@@ -55,9 +51,7 @@ def compare(src1, dst1, val1, src2, dst2, val2):
     #  then we should get exactly the same number of entries in
     #  the data frame if we did not lose any data.
     #
-    join = df1.merge(df2,
-                     left_on=['src1', 'dst1'],
-                     right_on=['src2', 'dst2'])
+    join = df1.merge(df2, left_on=["src1", "dst1"], right_on=["src2", "dst2"])
     assert len(df1) == len(join)
 
     if val1 is not None:
@@ -68,13 +62,13 @@ def compare(src1, dst1, val1, src2, dst2, val2):
         #  direction, so we'll merge with the edges reversed and
         #  check to make sure that the values all match
         #
-        diffs = join.query('val1 != val2')
-        diffs_check = diffs.merge(df1,
-                                  left_on=['src1', 'dst1'],
-                                  right_on=['dst1', 'src1'])
-        query = diffs_check.query('val1_y != val2')
+        diffs = join.query("val1 != val2")
+        diffs_check = diffs.merge(
+            df1, left_on=["src1", "dst1"], right_on=["dst1", "src1"]
+        )
+        query = diffs_check.query("val1_y != val2")
         if len(query) > 0:
-            print('differences: ')
+            print("differences: ")
             print(query)
             assert 0 == len(query)
 
@@ -87,9 +81,7 @@ def compare(src1, dst1, val1, src2, dst2, val2):
     #  (src1[i] = dst2[i]) and (dst1[i] = src2[i]), and verifying
     #  that we get exactly the same number of entries in the data frame.
     #
-    join = df1.merge(df2,
-                     left_on=['src1', 'dst1'],
-                     right_on=['dst2', 'src2'])
+    join = df1.merge(df2, left_on=["src1", "dst1"], right_on=["dst2", "src2"])
     assert len(df1) == len(join)
 
     if val1 is not None:
@@ -100,13 +92,13 @@ def compare(src1, dst1, val1, src2, dst2, val2):
         #  direction, so we'll merge with the edges reversed and
         #  check to make sure that the values all match
         #
-        diffs = join.query('val1 != val2')
-        diffs_check = diffs.merge(df1,
-                                  left_on=['src2', 'dst2'],
-                                  right_on=['src1', 'dst1'])
-        query = diffs_check.query('val1_y != val2')
+        diffs = join.query("val1 != val2")
+        diffs_check = diffs.merge(
+            df1, left_on=["src2", "dst2"], right_on=["src1", "dst1"]
+        )
+        query = diffs_check.query("val1_y != val2")
         if len(query) > 0:
-            print('differences: ')
+            print("differences: ")
             print(query)
             assert 0 == len(query)
 
@@ -133,13 +125,9 @@ def compare(src1, dst1, val1, src2, dst2, val2):
     #  in both data frames as single rows.  This gives us a data frame
     #  with the same number of rows as the symmetrized data.
     #
-    join1 = df2.merge(df1,
-                      left_on=['src2', 'dst2'],
-                      right_on=['src1', 'dst1'])
-    join2 = df2.merge(df1,
-                      left_on=['src2', 'dst2'],
-                      right_on=['dst1', 'src1'])
-    joinM = join1.merge(join2, how='outer', on=['src2', 'dst2'])
+    join1 = df2.merge(df1, left_on=["src2", "dst2"], right_on=["src1", "dst1"])
+    join2 = df2.merge(df1, left_on=["src2", "dst2"], right_on=["dst1", "src1"])
+    joinM = join1.merge(join2, how="outer", on=["src2", "dst2"])
 
     assert len(df2) == len(joinM)
 
@@ -149,18 +137,19 @@ def compare(src1, dst1, val1, src2, dst2, val2):
     #
 
 
-# Test all combinations of default/managed and pooled/non-pooled allocation
+# Test
 # NOTE: see https://github.com/rapidsai/cudf/issues/2636
 #       drop_duplicates doesn't work well with the pool allocator
 #                        list(product([False, True], [False, True])))
 
-@pytest.mark.parametrize('graph_file', DATASETS)
+
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_symmetrize_unweighted(graph_file):
     gc.collect()
 
-    cu_M = utils.read_csv_file(graph_file+'.csv')
+    cu_M = utils.read_csv_file(graph_file)
 
-    sym_sources, sym_destinations = cugraph.symmetrize(cu_M['0'], cu_M['1'])
+    sym_sources, sym_destinations = cugraph.symmetrize(cu_M["0"], cu_M["1"])
 
     #
     #  Check to see if all pairs in sources/destinations exist in
@@ -173,49 +162,58 @@ def test_symmetrize_unweighted(graph_file):
     #  the length of the data frames should be equal.
     #
     sym_df = cudf.DataFrame()
-    sym_df['src_s'] = sym_sources
-    sym_df['dst_s'] = sym_destinations
+    sym_df["src_s"] = sym_sources
+    sym_df["dst_s"] = sym_destinations
 
     orig_df = cudf.DataFrame()
-    orig_df['src'] = cu_M['0']
-    orig_df['dst'] = cu_M['1']
+    orig_df["src"] = cu_M["0"]
+    orig_df["dst"] = cu_M["1"]
 
-    compare(orig_df['src'], orig_df['dst'], None,
-            sym_df['src_s'], sym_df['dst_s'], None)
+    compare(
+        orig_df["src"],
+        orig_df["dst"],
+        None,
+        sym_df["src_s"],
+        sym_df["dst_s"],
+        None,
+    )
 
 
-# Test all combinations of default/managed and pooled/non-pooled allocation
+# Test
 # NOTE: see https://github.com/rapidsai/cudf/issues/2636
 #       drop_duplicates doesn't work well with the pool allocator
 #                        list(product([False, True], [False, True])))
 
-@pytest.mark.parametrize('graph_file', DATASETS)
+
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_symmetrize_weighted(graph_file):
     gc.collect()
 
-    cu_M = utils.read_csv_file(graph_file+'.csv')
+    cu_M = utils.read_csv_file(graph_file)
 
-    sym_src, sym_dst, sym_w = cugraph.symmetrize(cu_M['0'],
-                                                 cu_M['1'],
-                                                 cu_M['2'])
+    sym_src, sym_dst, sym_w = cugraph.symmetrize(
+        cu_M["0"], cu_M["1"], cu_M["2"]
+    )
 
-    compare(cu_M['0'], cu_M['1'], cu_M['2'], sym_src, sym_dst, sym_w)
+    compare(cu_M["0"], cu_M["1"], cu_M["2"], sym_src, sym_dst, sym_w)
 
 
-# Test all combinations of default/managed and pooled/non-pooled allocation
+# Test
 # NOTE: see https://github.com/rapidsai/cudf/issues/2636
 #       drop_duplicates doesn't work well with the pool allocator
 #                        list(product([False, True], [False, True])))
 
-@pytest.mark.parametrize('graph_file', DATASETS)
+
+@pytest.mark.parametrize("graph_file", utils.DATASETS)
 def test_symmetrize_df(graph_file):
     gc.collect()
 
-    cu_M = utils.read_csv_file(graph_file+'.csv')
-    sym_df = cugraph.symmetrize_df(cu_M, '0', '1')
+    cu_M = utils.read_csv_file(graph_file)
+    sym_df = cugraph.symmetrize_df(cu_M, "0", "1")
 
-    compare(cu_M['0'], cu_M['1'], cu_M['2'],
-            sym_df['0'], sym_df['1'], sym_df['2'])
+    compare(
+        cu_M["0"], cu_M["1"], cu_M["2"], sym_df["0"], sym_df["1"], sym_df["2"]
+    )
 
 
 def test_symmetrize_bad_weights():
@@ -223,14 +221,16 @@ def test_symmetrize_bad_weights():
     dst = [1, 2, 3, 4, 0, 3]
     val = [1.0, 1.0, 1.0, 1.0, 2.0, 1.0]
 
-    df = pd.DataFrame({
-        'src': src,
-        'dst': dst,
-        'val': val
-    })
+    df = pd.DataFrame({"src": src, "dst": dst, "val": val})
 
-    gdf = cudf.DataFrame.from_pandas(df[['src', 'dst', 'val']])
-    sym_df = cugraph.symmetrize_df(gdf, 'src', 'dst')
+    gdf = cudf.DataFrame.from_pandas(df[["src", "dst", "val"]])
+    sym_df = cugraph.symmetrize_df(gdf, "src", "dst")
 
-    compare(gdf['src'], gdf['dst'], gdf['val'],
-            sym_df['src'], sym_df['dst'], sym_df['val'])
+    compare(
+        gdf["src"],
+        gdf["dst"],
+        gdf["val"],
+        sym_df["src"],
+        sym_df["dst"],
+        sym_df["val"],
+    )
diff --git a/python/cugraph/tests/test_triangle_count.py b/python/cugraph/tests/test_triangle_count.py
index ea83c8e2b57..975ddd82470 100644
--- a/python/cugraph/tests/test_triangle_count.py
+++ b/python/cugraph/tests/test_triangle_count.py
@@ -1,5 +1,4 @@
-
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -26,6 +25,7 @@
 # python 3.7.  Also, this import networkx needs to be relocated in the
 # third-party group once this gets fixed.
 import warnings
+
 with warnings.catch_warnings():
     warnings.filterwarnings("ignore", category=DeprecationWarning)
     import networkx as nx
@@ -34,20 +34,22 @@
 def cugraph_call(M, edgevals=False):
     G = cugraph.Graph()
     cu_M = cudf.DataFrame()
-    cu_M['src'] = cudf.Series(M['0'])
-    cu_M['dst'] = cudf.Series(M['1'])
+    cu_M["src"] = cudf.Series(M["0"])
+    cu_M["dst"] = cudf.Series(M["1"])
     if edgevals is True:
-        cu_M['weights'] = cudf.Series(M['weight'])
-        G.from_cudf_edgelist(cu_M, source='src', destination='dst',
-                             edge_attr='weights')
+        cu_M["weights"] = cudf.Series(M["weight"])
+        G.from_cudf_edgelist(
+            cu_M, source="src", destination="dst", edge_attr="weights"
+        )
     else:
-        G.from_cudf_edgelist(cu_M, source='src', destination='dst')
+        G.from_cudf_edgelist(cu_M, source="src", destination="dst")
     return cugraph.triangles(G)
 
 
 def networkx_call(M):
-    Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                  create_using=nx.Graph())
+    Gnx = nx.from_pandas_edgelist(
+        M, source="0", target="1", create_using=nx.Graph()
+    )
     dic = nx.triangles(Gnx)
     print(dic)
     count = 0
@@ -56,14 +58,15 @@ def networkx_call(M):
     return count
 
 
-DATASETS = ['../datasets/dolphins.csv',
-            '../datasets/karate.csv',
-            '../datasets/netscience.csv']
-
-
-# Test all combinations of default/managed and pooled/non-pooled allocation
-
-@pytest.mark.parametrize('graph_file', DATASETS)
+# FIXME: the default set of datasets includes an asymmetric directed graph
+# (email-EU-core.csv), which currently produces different results between
+# cugraph and Nx and fails that test. Investigate, resolve, and use
+# utils.DATASETS instead.
+#
+# https://github.com/rapidsai/cugraph/issues/1043
+#
+# @pytest.mark.parametrize("graph_file", utils.DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_UNDIRECTED)
 def test_triangles(graph_file):
     gc.collect()
 
@@ -73,9 +76,7 @@ def test_triangles(graph_file):
     assert cu_count == nx_count
 
 
-# Test all combinations of default/managed and pooled/non-pooled allocation
-
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_UNDIRECTED)
 def test_triangles_edge_vals(graph_file):
     gc.collect()
 
diff --git a/python/cugraph/tests/test_unrenumber.py b/python/cugraph/tests/test_unrenumber.py
deleted file mode 100644
index e69e069d773..00000000000
--- a/python/cugraph/tests/test_unrenumber.py
+++ /dev/null
@@ -1,54 +0,0 @@
-# Copyright (c) 2020, NVIDIA CORPORATION.
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# This file test the Renumbering features
-
-import gc
-
-import pytest
-
-import cudf
-import cugraph
-from cugraph.tests import utils
-
-DATASETS = ['../datasets/karate.csv',
-            '../datasets/dolphins.csv',
-            '../datasets/netscience.csv']
-
-
-# Test all combinations of default/managed and pooled/non-pooled allocation
-
-@pytest.mark.parametrize('graph_file', DATASETS)
-def test_multi_column_unrenumbering(graph_file):
-    gc.collect()
-
-    translate = 100
-    cu_M = utils.read_csv_file(graph_file)
-    cu_M['00'] = cu_M['0'] + translate
-    cu_M['11'] = cu_M['1'] + translate
-
-    G = cugraph.DiGraph()
-    G.from_cudf_edgelist(cu_M, ['0', '00'], ['1', '11'])
-    result_multi = cugraph.pagerank(G).sort_values(by='0').\
-        reset_index(drop=True)
-
-    G = cugraph.DiGraph()
-    G.from_cudf_edgelist(cu_M, '0', '1')
-    result_single = cugraph.pagerank(G)
-
-    result_exp = cudf.DataFrame()
-    result_exp['0'] = result_single['vertex']
-    result_exp['1'] = result_single['vertex'] + translate
-    result_exp['pagerank'] = result_single['pagerank']
-
-    assert result_multi.equals(result_exp)
diff --git a/python/cugraph/tests/test_utils.py b/python/cugraph/tests/test_utils.py
index 34f45d59c59..22af649ea2e 100644
--- a/python/cugraph/tests/test_utils.py
+++ b/python/cugraph/tests/test_utils.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -17,12 +17,6 @@
 import cugraph
 from cugraph.tests import utils
 
-DATASETS = ['../datasets/dolphins.csv',
-            '../datasets/karate.csv',
-            '../datasets/polbooks.csv',
-            '../datasets/netscience.csv',
-            '../datasets/email-Eu-core.csv']
-
 
 def test_bfs_paths():
     with pytest.raises(ValueError) as ErrorMsg:
diff --git a/python/cugraph/tests/test_wjaccard.py b/python/cugraph/tests/test_wjaccard.py
index 35f0e56a2a0..d3d2d5643ec 100644
--- a/python/cugraph/tests/test_wjaccard.py
+++ b/python/cugraph/tests/test_wjaccard.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -27,66 +27,64 @@
 # python 3.7.  Also, this import networkx needs to be relocated in the
 # third-party group once this gets fixed.
 import warnings
+
 with warnings.catch_warnings():
     warnings.filterwarnings("ignore", category=DeprecationWarning)
     import networkx as nx
 
-print('Networkx version : {} '.format(nx.__version__))
+print("Networkx version : {} ".format(nx.__version__))
 
 
 def cugraph_call(cu_M):
     # Device data
-    weights_arr = cudf.Series(np.ones(max(cu_M['0'].max(),
-                              cu_M['1'].max())+1, dtype=np.float32))
+    weights_arr = cudf.Series(
+        np.ones(max(cu_M["0"].max(), cu_M["1"].max()) + 1, dtype=np.float32)
+    )
 
     G = cugraph.Graph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
+    G.from_cudf_edgelist(cu_M, source="0", destination="1")
 
     # cugraph Jaccard Call
     t1 = time.time()
     df = cugraph.jaccard_w(G, weights_arr)
     t2 = time.time() - t1
-    print('Time : '+str(t2))
-    print(df)
-    return df['jaccard_coeff']
+    print("Time : " + str(t2))
+
+    df = df.sort_values(["source", "destination"]).reset_index(drop=True)
+
+    return df["jaccard_coeff"]
 
 
 def networkx_call(M):
 
-    sources = M['0']
-    destinations = M['1']
+    sources = M["0"]
+    destinations = M["1"]
     edges = []
     for i in range(len(sources)):
         edges.append((sources[i], destinations[i]))
     edges = sorted(edges)
     # in NVGRAPH tests we read as CSR and feed as CSC, so here we doing this
     # explicitly
-    print('Format conversion ... ')
+    print("Format conversion ... ")
 
     # NetworkX graph
-    Gnx = nx.from_pandas_edgelist(M, source='0', target='1',
-                                  create_using=nx.Graph())
+    Gnx = nx.from_pandas_edgelist(
+        M, source="0", target="1", create_using=nx.Graph()
+    )
     # Networkx Jaccard Call
-    print('Solving... ')
+    print("Solving... ")
     t1 = time.time()
     preds = nx.jaccard_coefficient(Gnx, edges)
     t2 = time.time() - t1
 
-    print('Time : '+str(t2))
+    print("Time : " + str(t2))
     coeff = []
     for u, v, p in preds:
         coeff.append(p)
     return coeff
 
 
-DATASETS = ['../datasets/dolphins.csv',
-            '../datasets/karate.csv',
-            '../datasets/netscience.csv']
-
-
-# Test all combinations of default/managed and pooled/non-pooled allocation
-
-@pytest.mark.parametrize('graph_file', DATASETS)
+@pytest.mark.parametrize("graph_file", utils.DATASETS_UNDIRECTED)
 def test_wjaccard(graph_file):
     gc.collect()
 
diff --git a/python/cugraph/tests/test_woverlap.py b/python/cugraph/tests/test_woverlap.py
index b7a7304a456..e7da21014ba 100644
--- a/python/cugraph/tests/test_woverlap.py
+++ b/python/cugraph/tests/test_woverlap.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -24,19 +24,20 @@
 
 def cugraph_call(cu_M, pairs):
     # Device data
-    weights_arr = cudf.Series(np.ones(max(cu_M['0'].max(),
-                              cu_M['1'].max())+1, dtype=np.float32))
+    weights_arr = cudf.Series(
+        np.ones(max(cu_M["0"].max(), cu_M["1"].max()) + 1, dtype=np.float32)
+    )
 
     G = cugraph.DiGraph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
+    G.from_cudf_edgelist(cu_M, source="0", destination="1")
 
     # cugraph Overlap Call
     t1 = time.time()
     df = cugraph.overlap_w(G, weights_arr, pairs)
     t2 = time.time() - t1
-    print('Time : '+str(t2))
-    df = df.sort_values(by=['source', 'destination'])
-    return df['overlap_coeff'].to_array()
+    print("Time : " + str(t2))
+    df = df.sort_values(by=["source", "destination"])
+    return df["overlap_coeff"].to_array()
 
 
 def intersection(a, b, M):
@@ -44,7 +45,7 @@ def intersection(a, b, M):
     a_idx = M.indptr[a]
     b_idx = M.indptr[b]
 
-    while (a_idx < M.indptr[a+1]) and (b_idx < M.indptr[b+1]):
+    while (a_idx < M.indptr[a + 1]) and (b_idx < M.indptr[b + 1]):
         a_vertex = M.indices[a_idx]
         b_vertex = M.indices[b_idx]
 
@@ -61,13 +62,13 @@ def intersection(a, b, M):
 
 
 def degree(a, M):
-    return M.indptr[a+1] - M.indptr[a]
+    return M.indptr[a + 1] - M.indptr[a]
 
 
 def overlap(a, b, M):
     b_sum = degree(b, M)
     if b_sum == 0:
-        return float('NaN')
+        return float("NaN")
 
     i = intersection(a, b, M)
     a_sum = degree(a, M)
@@ -82,31 +83,28 @@ def cpu_call(M, first, second):
     return result
 
 
-DATASETS = ['../datasets/dolphins.csv',
-            '../datasets/karate.csv',
-            '../datasets/netscience.csv']
-#  Too slow to run on CPU
-#            '../datasets/email-Eu-core.csv']
-
-
-# Test all combinations of default/managed and pooled/non-pooled allocation
-
-@pytest.mark.parametrize('graph_file', DATASETS)
+# Test
+@pytest.mark.parametrize("graph_file", utils.DATASETS_UNDIRECTED)
 def test_woverlap(graph_file):
     gc.collect()
 
     Mnx = utils.read_csv_for_nx(graph_file)
-    N = max(max(Mnx['0']), max(Mnx['1'])) + 1
-    M = scipy.sparse.csr_matrix((Mnx.weight, (Mnx['0'], Mnx['1'])),
-                                shape=(N, N))
+    N = max(max(Mnx["0"]), max(Mnx["1"])) + 1
+    M = scipy.sparse.csr_matrix(
+        (Mnx.weight, (Mnx["0"], Mnx["1"])), shape=(N, N)
+    )
 
     cu_M = utils.read_csv_file(graph_file)
     G = cugraph.Graph()
-    G.from_cudf_edgelist(cu_M, source='0', destination='1')
-    pairs = G.get_two_hop_neighbors()
+    G.from_cudf_edgelist(cu_M, source="0", destination="1")
+    pairs = (
+        G.get_two_hop_neighbors()
+        .sort_values(["first", "second"])
+        .reset_index(drop=True)
+    )
 
     cu_coeff = cugraph_call(cu_M, pairs)
-    cpu_coeff = cpu_call(M, pairs['first'], pairs['second'])
+    cpu_coeff = cpu_call(M, pairs["first"], pairs["second"])
     assert len(cu_coeff) == len(cpu_coeff)
     for i in range(len(cu_coeff)):
         if np.isnan(cpu_coeff[i]):
diff --git a/python/cugraph/tests/utils.py b/python/cugraph/tests/utils.py
index ab4367f4894..fa0631f9c30 100644
--- a/python/cugraph/tests/utils.py
+++ b/python/cugraph/tests/utils.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -12,7 +12,60 @@
 # limitations under the License.
 
 import cudf
+import cugraph
 import pandas as pd
+import networkx as nx
+import dask_cudf
+import os
+from cugraph.dask.common.mg_utils import (get_client)
+
+#
+# Datasets are numbered based on the number of elements in the array
+#
+DATASETS_1 = ['../datasets/netscience.csv']
+
+DATASETS_2 = ['../datasets/karate.csv',
+              '../datasets/dolphins.csv']
+
+DATASETS_3 = ['../datasets/karate.csv',
+              '../datasets/dolphins.csv',
+              '../datasets/email-Eu-core.csv']
+
+# FIXME: netscience.csv causes NetworkX pagerank to throw an exception.
+# (networkx/algorithms/link_analysis/pagerank_alg.py:152: KeyError: 1532)
+DATASETS_4 = ['../datasets/karate.csv',
+              '../datasets/dolphins.csv',
+              '../datasets/netscience.csv',
+              '../datasets/email-Eu-core.csv']
+
+DATASETS_5 = ['../datasets/karate.csv',
+              '../datasets/dolphins.csv',
+              '../datasets/polbooks.csv',
+              '../datasets/netscience.csv',
+              '../datasets/email-Eu-core.csv']
+
+STRONGDATASETS = ['../datasets/dolphins.csv',
+                  '../datasets/netscience.csv',
+                  '../datasets/email-Eu-core.csv']
+
+DATASETS_KTRUSS = [('../datasets/polbooks.csv',
+                    '../datasets/ref/ktruss/polbooks.csv'),
+                   ('../datasets/netscience.csv',
+                    '../datasets/ref/ktruss/netscience.csv')]
+
+TINY_DATASETS = ['../datasets/karate.csv',
+                 '../datasets/dolphins.csv',
+                 '../datasets/polbooks.csv']
+
+SMALL_DATASETS = ['../datasets/netscience.csv',
+                  '../datasets/email-Eu-core.csv']
+
+UNRENUMBERED_DATASETS = ['../datasets/karate.csv']
+
+
+# define the base for tests to use
+DATASETS = DATASETS_3
+DATASETS_UNDIRECTED = DATASETS_2
 
 
 def read_csv_for_nx(csv_file, read_weights_in_sp=True):
@@ -42,3 +95,71 @@ def read_csv_file(csv_file, read_weights_in_sp=True):
     else:
         return cudf.read_csv(csv_file, delimiter=' ',
                              dtype=['int32', 'int32', 'float64'], header=None)
+
+
+def read_dask_cudf_csv_file(csv_file, read_weights_in_sp=True,
+                            single_partition=True):
+    print('Reading ' + str(csv_file) + '...')
+    if read_weights_in_sp is True:
+        if single_partition:
+            chunksize = os.path.getsize(csv_file)
+            return dask_cudf.read_csv(csv_file, chunksize=chunksize,
+                                      delimiter=' ',
+                                      names=['src', 'dst', 'weight'],
+                                      dtype=['int32', 'int32', 'float32'],
+                                      header=None)
+        else:
+            return dask_cudf.read_csv(csv_file, delimiter=' ',
+                                      names=['src', 'dst', 'weight'],
+                                      dtype=['int32', 'int32', 'float32'],
+                                      header=None)
+    else:
+        if single_partition:
+            chunksize = os.path.getsize(csv_file)
+            return dask_cudf.read_csv(csv_file, chunksize=chunksize,
+                                      delimiter=' ',
+                                      names=['src', 'dst', 'weight'],
+                                      dtype=['int32', 'int32', 'float32'],
+                                      header=None)
+        else:
+            return dask_cudf.read_csv(csv_file, delimiter=' ',
+                                      names=['src', 'dst', 'weight'],
+                                      dtype=['int32', 'int32', 'float64'],
+                                      header=None)
+
+
+def generate_nx_graph_from_file(graph_file, directed=True):
+    M = read_csv_for_nx(graph_file)
+    Gnx = nx.from_pandas_edgelist(M, create_using=(nx.DiGraph() if directed
+                                                   else nx.Graph()),
+                                  source='0', target='1')
+    return Gnx
+
+
+def generate_cugraph_graph_from_file(graph_file, directed=True):
+    cu_M = read_csv_file(graph_file)
+    G = cugraph.DiGraph() if directed else cugraph.Graph()
+    G.from_cudf_edgelist(cu_M, source='0', destination='1')
+    return G
+
+
+def generate_mg_batch_cugraph_graph_from_file(graph_file, directed=True):
+    client = get_client()
+    _ddf = read_dask_cudf_csv_file(graph_file)
+    ddf = client.persist(_ddf)
+    G = cugraph.DiGraph() if directed else cugraph.Graph()
+    G.from_dask_cudf_edgelist(ddf)
+    return G
+
+
+def build_cu_and_nx_graphs(graph_file, directed=True):
+    G = generate_cugraph_graph_from_file(graph_file, directed=directed)
+    Gnx = generate_nx_graph_from_file(graph_file, directed=directed)
+    return G, Gnx
+
+
+def build_mg_batch_cu_and_nx_graphs(graph_file, directed=True):
+    G = generate_mg_batch_cugraph_graph_from_file(graph_file,
+                                                  directed=directed)
+    Gnx = generate_nx_graph_from_file(graph_file, directed=directed)
+    return G, Gnx
diff --git a/python/cugraph/traversal/__init__.py b/python/cugraph/traversal/__init__.py
index 6a42e26210c..288c4edd2e3 100644
--- a/python/cugraph/traversal/__init__.py
+++ b/python/cugraph/traversal/__init__.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/traversal/bfs.pxd b/python/cugraph/traversal/bfs.pxd
index fe1dd5e5965..ea9f3e4a0e4 100644
--- a/python/cugraph/traversal/bfs.pxd
+++ b/python/cugraph/traversal/bfs.pxd
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -23,6 +23,7 @@ from libcpp cimport bool
 cdef extern from "algorithms.hpp" namespace "cugraph":
 
     cdef void bfs[VT,ET,WT](
+        const handle_t &handle,
         const GraphCSRView[VT,ET,WT] &graph,
         VT *distances,
         VT *predecessors,
diff --git a/python/cugraph/traversal/bfs.py b/python/cugraph/traversal/bfs.py
index 9a1e3caa431..3a977a06baf 100644
--- a/python/cugraph/traversal/bfs.py
+++ b/python/cugraph/traversal/bfs.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019 - 2020, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -11,6 +11,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import cudf
+
 from cugraph.traversal import bfs_wrapper
 from cugraph.structure.graph import Graph
 
@@ -59,6 +61,14 @@ def bfs(G, start, return_sp_counter=False):
     else:
         directed = True
 
+    if G.renumbered is True:
+        start = G.lookup_internal_vertex_id(cudf.Series([start]))[0]
+
     df = bfs_wrapper.bfs(G, start, directed, return_sp_counter)
 
+    if G.renumbered:
+        df = G.unrenumber(df, "vertex")
+        df = G.unrenumber(df, "predecessor")
+        df["predecessor"].fillna(-1, inplace=True)
+
     return df
diff --git a/python/cugraph/traversal/bfs_wrapper.pyx b/python/cugraph/traversal/bfs_wrapper.pyx
index cf9287552ee..dbbda90b17e 100644
--- a/python/cugraph/traversal/bfs_wrapper.pyx
+++ b/python/cugraph/traversal/bfs_wrapper.pyx
@@ -19,7 +19,6 @@
 cimport cugraph.traversal.bfs as c_bfs
 from cugraph.structure.graph_new cimport *
 from cugraph.structure import graph_new_wrapper
-from cugraph.utilities.unrenumber import unrenumber
 from libcpp cimport bool
 from libc.stdint cimport uintptr_t
 from libc.float cimport FLT_MAX_EXP
@@ -51,6 +50,9 @@ def bfs(input_graph, start, directed=True,
     if input_graph.adjlist is None:
         input_graph.view_adj_list()
 
+    cdef unique_ptr[handle_t] handle_ptr
+    handle_ptr.reset(new handle_t())
+
     # Step 3: Extract CSR offsets, indices, weights are not expected
     #         - offsets: int (signed, 32-bit)
     #         - indices: int (signed, 32-bit)
@@ -60,12 +62,9 @@ def bfs(input_graph, start, directed=True,
 
     # Step 4: Setup number of vertices and edges
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
 
-    # Step 5: Handle the case the graph has been renumbered
-    #         The source given as input has to be renumbered
-    if input_graph.renumbered is True:
-        start = input_graph.edgelist.renumber_map[input_graph.edgelist.renumber_map == start].index[0]
+    # Step 5: Check if source index is valid
     if not 0 <= start < num_verts:
         raise ValueError("Starting vertex should be between 0 to number of vertices")
 
@@ -94,24 +93,12 @@ def bfs(input_graph, start, directed=True,
                                             num_edges)
     graph_float.get_vertex_identifiers(<int*> c_identifier_ptr)
     # Different pathing wether shortest_path_counting is required or not
-    c_bfs.bfs[int, int, float](graph_float,
+    c_bfs.bfs[int, int, float](handle_ptr.get()[0],
+                               graph_float,
                                <int*> c_distance_ptr,
                                <int*> c_predecessor_ptr,
                                <double*> c_sp_counter_ptr,
                                <int> start,
                                directed)
-    #FIXME: Update with multicolumn renumbering
-    # Step 9: Unrenumber before return
-    #         It is only required to renumber vertex and predecessors
-    if input_graph.renumbered:
-        if isinstance(input_graph.edgelist.renumber_map, cudf.DataFrame): # Multicolumn renumbering
-            n_cols = len(input_graph.edgelist.renumber_map.columns) - 1
-            unrenumbered_df_ = df.merge(input_graph.edgelist.renumber_map, left_on='vertex', right_on='id', how='left').drop(['id', 'vertex'])
-            unrenumbered_df = unrenumbered_df_.merge(input_graph.edgelist.renumber_map, left_on='predecessor', right_on='id', how='left').drop(['id', 'predecessor'])
-            unrenumbered_df.columns = ['distance'] + ['vertex_' + str(i) for i in range(n_cols)] + ['predecessor_' + str(i) for i in range(n_cols)]
-            cols = unrenumbered_df.columns.to_list()
-            df = unrenumbered_df[cols[1:n_cols + 1] + [cols[0]] + cols[n_cols:]]
-        else: # Simple renumbering
-            df = unrenumber(input_graph.edgelist.renumber_map, df, 'vertex')
-            df['predecessor'][df['predecessor'] > -1] = input_graph.edgelist.renumber_map[df['predecessor'][df['predecessor'] > -1]]
+
     return df
diff --git a/python/cugraph/traversal/sssp.pxd b/python/cugraph/traversal/sssp.pxd
index b79b6643737..7067a5e983f 100644
--- a/python/cugraph/traversal/sssp.pxd
+++ b/python/cugraph/traversal/sssp.pxd
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/traversal/sssp.py b/python/cugraph/traversal/sssp.py
index de27449b47f..546407af2b6 100644
--- a/python/cugraph/traversal/sssp.py
+++ b/python/cugraph/traversal/sssp.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019 - 2020, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -13,6 +13,7 @@
 
 from cugraph.traversal import sssp_wrapper
 import numpy as np
+import cudf
 
 
 def sssp(G, source):
@@ -52,8 +53,16 @@ def sssp(G, source):
     >>> distances = cugraph.sssp(G, 0)
     """
 
+    if G.renumbered is True:
+        source = G.lookup_internal_vertex_id(cudf.Series([source]))[0]
+
     df = sssp_wrapper.sssp(G, source)
 
+    if G.renumbered:
+        df = G.unrenumber(df, "vertex")
+        df = G.unrenumber(df, "predecessor")
+        df["predecessor"].fillna(-1, inplace=True)
+
     return df
 
 
@@ -75,13 +84,13 @@ def filter_unreachable(df):
         df['predecessor'][i] gives the vertex that was reached before the i'th
         vertex in the traversal.
     """
-    if('distance' not in df):
+    if "distance" not in df:
         raise KeyError("No distance column found in input data frame")
-    if(np.issubdtype(df['distance'].dtype, np.integer)):
-        max_val = np.iinfo(df['distance'].dtype).max
+    if np.issubdtype(df["distance"].dtype, np.integer):
+        max_val = np.iinfo(df["distance"].dtype).max
         return df[df.distance != max_val]
-    elif(np.issubdtype(df['distance'].dtype, np.inexact)):
-        max_val = np.finfo(df['distance'].dtype).max
+    elif np.issubdtype(df["distance"].dtype, np.inexact):
+        max_val = np.finfo(df["distance"].dtype).max
         return df[df.distance != max_val]
     else:
-        raise TypeError("distace type unsupported")
+        raise TypeError("distance type unsupported")
diff --git a/python/cugraph/traversal/sssp_wrapper.pyx b/python/cugraph/traversal/sssp_wrapper.pyx
index 0b5bf300459..ab844819291 100644
--- a/python/cugraph/traversal/sssp_wrapper.pyx
+++ b/python/cugraph/traversal/sssp_wrapper.pyx
@@ -21,8 +21,6 @@ cimport cugraph.traversal.bfs as c_bfs
 from cugraph.structure.graph_new cimport *
 from cugraph.structure import graph_new_wrapper
 
-from cugraph.utilities.unrenumber import unrenumber
-
 from libcpp cimport bool
 from libc.stdint cimport uintptr_t
 from libc.float cimport FLT_MAX_EXP
@@ -49,6 +47,9 @@ def sssp(input_graph, source):
     cdef uintptr_t c_distance_ptr       = <uintptr_t> NULL # Pointer to the DataFrame 'distance' Series
     cdef uintptr_t c_predecessor_ptr    = <uintptr_t> NULL # Pointer to the DataFrame 'predecessor' Series
 
+    cdef unique_ptr[handle_t] handle_ptr
+    handle_ptr.reset(new handle_t())
+
     # Step 2: Verify that input_graph has the expected format
     #         the SSSP implementation expects CSR format
     if not input_graph.adjlist:
@@ -71,12 +72,9 @@ def sssp(input_graph, source):
 
     # Step 4: Setup number of vertices and number of edges
     num_verts = input_graph.number_of_vertices()
-    num_edges = len(indices)
+    num_edges = input_graph.number_of_edges(directed_edges=True)
 
-    # Step 5: Handle the case our graph had to be renumbered
-    #         Our source index might no longer be valid
-    if input_graph.renumbered is True:
-        source = input_graph.edgelist.renumber_map[input_graph.edgelist.renumber_map == source].index[0]
+    # Step 5: Check if source index is valid
     if not 0 <= source < num_verts:
         raise ValueError("Starting vertex should be between 0 to number of vertices")
 
@@ -129,24 +127,11 @@ def sssp(input_graph, source):
                                                 num_verts,
                                                 num_edges)
         graph_float.get_vertex_identifiers(<int*> c_identifier_ptr)
-        c_bfs.bfs[int, int, float](graph_float,
+        c_bfs.bfs[int, int, float](handle_ptr.get()[0],
+                                   graph_float,
                                    <int*> c_distance_ptr,
                                    <int*> c_predecessor_ptr,
                                    <double*> NULL,
                                    <int> source)
 
-    #FIXME: Update with multiple column renumbering
-    # Step 9: Unrenumber before return
-    #         It is only required to renumber vertex and predecessors
-    if input_graph.renumbered:
-        if isinstance(input_graph.edgelist.renumber_map, cudf.DataFrame): # Multicolumns renumbering
-            n_cols = len(input_graph.edgelist.renumber_map.columns) - 1
-            unrenumbered_df_ = df.merge(input_graph.edgelist.renumber_map, left_on='vertex', right_on='id', how='left').drop(['id', 'vertex'])
-            unrenumbered_df = unrenumbered_df_.merge(input_graph.edgelist.renumber_map, left_on='predecessor', right_on='id', how='left').drop(['id', 'predecessor'])
-            unrenumbered_df.columns = ['distance'] + ['vertex_' + str(i) for i in range(n_cols)] + ['predecessor_' + str(i) for i in range(n_cols)]
-            cols = unrenumbered_df.columns.to_list()
-            df = unrenumbered_df[cols[1:n_cols + 1] + [cols[0]] + cols[n_cols:]]
-        else: # Simple renumbering
-            df = unrenumber(input_graph.edgelist.renumber_map, df, 'vertex')
-            df['predecessor'][df['predecessor'] >- 1] = input_graph.edgelist.renumber_map[df['predecessor'][df['predecessor'] >- 1]]
     return df
diff --git a/python/cugraph/utilities/__init__.py b/python/cugraph/utilities/__init__.py
index 549282b8cb0..19b7c347420 100644
--- a/python/cugraph/utilities/__init__.py
+++ b/python/cugraph/utilities/__init__.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/utilities/grmat.py b/python/cugraph/utilities/grmat.py
index f8fb3be056c..6b597583bc6 100644
--- a/python/cugraph/utilities/grmat.py
+++ b/python/cugraph/utilities/grmat.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/utilities/pointer_utils.pyx b/python/cugraph/utilities/pointer_utils.pyx
index 8de0e9bc334..879da400de7 100644
--- a/python/cugraph/utilities/pointer_utils.pyx
+++ b/python/cugraph/utilities/pointer_utils.pyx
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
diff --git a/python/cugraph/utilities/unrenumber.py b/python/cugraph/utilities/unrenumber.py
deleted file mode 100644
index 7a5f6b2f1a0..00000000000
--- a/python/cugraph/utilities/unrenumber.py
+++ /dev/null
@@ -1,13 +0,0 @@
-import cudf
-
-
-def unrenumber(renumber_map, df, col):
-    if isinstance(renumber_map, cudf.DataFrame):
-        unrenumbered_df = df.merge(renumber_map, left_on=col,
-                                   right_on='id',
-                                   how='left').drop(['id', col])
-        cols = unrenumbered_df.columns.to_list()
-        df = unrenumbered_df[cols[1:] + [cols[0]]]
-    else:
-        df[col] = renumber_map[df[col]].reset_index(drop=True)
-    return df
diff --git a/python/cugraph/utilities/utils.py b/python/cugraph/utilities/utils.py
index fcb880916b7..000e32283fa 100644
--- a/python/cugraph/utilities/utils.py
+++ b/python/cugraph/utilities/utils.py
@@ -121,16 +121,16 @@ def get_traversed_path_list(df, id):
     answer = []
     answer.append(id)
 
-    ddf = df.loc[df['vertex'] == id]
+    ddf = df[df['vertex'] == id]
     if len(ddf) == 0:
         raise ValueError("The vertex (", id, " is not in the result set")
 
-    pred = ddf['predecessor']
+    pred = ddf['predecessor'].iloc[0]
 
-    while (pred != -1):
+    while pred != -1:
         answer.append(pred)
 
-        ddf = df.loc[df['vertex'] == pred]
-        pred = ddf['predecessor']
+        ddf = df[df['vertex'] == pred]
+        pred = ddf['predecessor'].iloc[0]
 
     return answer
diff --git a/python/setup.cfg b/python/setup.cfg
index b379674b516..524adf766bc 100644
--- a/python/setup.cfg
+++ b/python/setup.cfg
@@ -1,4 +1,4 @@
-# Copyright (c) 2018-2019, NVIDIA CORPORATION.
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -22,3 +22,6 @@ versionfile_source = cugraph/_version.py
 versionfile_build = cugraph/_version.py
 tag_prefix = v
 parentdir_prefix = cugraph-
+
+[tool:pytest]
+testpaths = cugraph/tests
diff --git a/python/setup.py b/python/setup.py
index c123ed7391c..dfd37b81d42 100644
--- a/python/setup.py
+++ b/python/setup.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2018-2019, NVIDIA CORPORATION.
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
@@ -84,6 +84,7 @@ def run(self):
         os.system('rm -rf build')
         os.system('rm -rf dist')
         os.system('rm -rf dask-worker-space')
+        os.system('rm -f cugraph/raft')
         os.system('find . -name "__pycache__" -type d -exec rm -rf {} +')
         os.system('rm -rf *.egg-info')
         os.system('find . -name "*.cpp" -type f -delete')
diff --git a/python/setuputils.py b/python/setuputils.py
index a7e663125d3..f2de886255d 100644
--- a/python/setuputils.py
+++ b/python/setuputils.py
@@ -22,6 +22,8 @@
 import sys
 import warnings
 
+from pathlib import Path
+
 
 def get_environment_option(name):
     ENV_VARIABLE = os.environ.get(name, False)
@@ -73,6 +75,8 @@ def use_raft_package(raft_path, cpp_build_path,
     """
     Function to use the python code in RAFT in package.raft
 
+    - If RAFT symlink already exists, don't change anything. Use setup.py clean
+        if you want to change RAFT location.
     - Uses RAFT located in $RAFT_PATH if $RAFT_PATH exists.
     - Otherwise it will look for RAFT in the libcugraph build folder,
         located either in the default location ../cpp/build or in
@@ -86,38 +90,55 @@ def use_raft_package(raft_path, cpp_build_path,
          Path to the C++ include folder of RAFT
     """
 
-    if not raft_path:
+    if os.path.islink('cugraph/raft'):
+        raft_path = os.path.realpath('cugraph/raft')
+        # walk up two dirs from `python/raft`
+        raft_path = os.path.join(raft_path, '..', '..')
+        print("-- Using existing RAFT folder")
+    elif isinstance(raft_path, (str, os.PathLike)):
+        print('-- Using RAFT_PATH argument')
+    elif os.environ.get('RAFT_PATH', False) is not False:
+        raft_path = str(os.environ['RAFT_PATH'])
+        print('-- Using RAFT_PATH environment variable')
+    else:
         raft_path, raft_cloned = \
             clone_repo_if_needed('raft', cpp_build_path,
                                  git_info_file=git_info_file)
+        raft_path = os.path.join('../', raft_path)
 
-    else:
-        print("-- Using RAFT_PATH variable, RAFT found at " +
-              str(os.environ['RAFT_PATH']))
-        raft_path = os.environ['RAFT_PATH']
+    raft_path = os.path.realpath(raft_path)
+    print('-- RAFT found at: ' + str(raft_path))
 
     try:
-        os.symlink('../' + raft_path + 'python/raft', 'cugraph/raft')
+        os.symlink(
+            os.path.join(raft_path, 'python/raft'),
+            os.path.join('cugraph/raft')
+        )
     except FileExistsError:
-        os.remove('cugraph/raft')
-        os.symlink('../' + raft_path + 'python/raft', 'cugraph/raft')
+        os.remove(os.path.join('cugraph/raft'))
+        os.symlink(
+            os.path.join(raft_path, 'python/raft'),
+            os.path.join('cugraph/raft')
+        )
 
-    return raft_path + 'cpp/include'
+    return os.path.join(raft_path, 'cpp/include')
 
 
-def clone_repo_if_needed(name, cpp_build_path,
-                         git_info_file='../cpp/cmake/Dependencies.cmake'):
-    if cpp_build_path:
-        cpp_build_path = '../' + cpp_build_path
-    else:
-        cpp_build_path = '../cpp/build/'
+def clone_repo_if_needed(name, cpp_build_path=None,
+                         git_info_file=None):
+    if git_info_file is None:
+        git_info_file = _get_repo_path() + '/cpp/CMakeLists.txt'
+
+    if cpp_build_path is None or cpp_build_path is False:
+        cpp_build_path = _get_repo_path() + '/cpp/build/'
 
     repo_cloned = get_submodule_dependency(name,
                                            cpp_build_path=cpp_build_path,
                                            git_info_file=git_info_file)
 
     if repo_cloned:
-        repo_path = '_external_repositories/' + name + '/'
+        repo_path = (
+            _get_repo_path() + '/python/_external_repositories/' + name + '/')
     else:
         repo_path = os.path.join(cpp_build_path, name + '/src/' + name + '/')
 
@@ -257,3 +278,8 @@ def get_repo_cmake_info(names, file_path):
         results[name] = res
 
     return results
+
+
+def _get_repo_path():
+    python_dir = Path(__file__).resolve()
+    return str(python_dir.parent.parent.absolute())
diff --git a/python/utils/asv_report.py b/python/utils/asv_report.py
index 43726138002..4f891ee62b8 100644
--- a/python/utils/asv_report.py
+++ b/python/utils/asv_report.py
@@ -1,3 +1,16 @@
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import platform
 
 import psutil
diff --git a/python/utils/benchmark.py b/python/utils/benchmark.py
index f2e9b63322f..bb2035f2765 100644
--- a/python/utils/benchmark.py
+++ b/python/utils/benchmark.py
@@ -1,3 +1,16 @@
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # from time import process_time_ns   # only in 3.7!
 from time import clock_gettime, CLOCK_MONOTONIC_RAW
 
diff --git a/python/utils/gpu_metric_poller.py b/python/utils/gpu_metric_poller.py
index 0bbd1661934..a709cc60a78 100755
--- a/python/utils/gpu_metric_poller.py
+++ b/python/utils/gpu_metric_poller.py
@@ -1,3 +1,16 @@
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # GPUMetricPoller
 # Utility class and helpers for retrieving GPU metrics for a specific section
 # of code.
diff --git a/python/utils/mtx2csv.py b/python/utils/mtx2csv.py
index 7be4a026ac1..0032e5ae41b 100644
--- a/python/utils/mtx2csv.py
+++ b/python/utils/mtx2csv.py
@@ -1,3 +1,15 @@
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import os
 import time
 from scipy.io import mmread
diff --git a/python/utils/run_benchmarks.py b/python/utils/run_benchmarks.py
index aeeb6bd6727..cab139ec4bd 100644
--- a/python/utils/run_benchmarks.py
+++ b/python/utils/run_benchmarks.py
@@ -1,3 +1,16 @@
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import argparse
 import sys
 from collections import OrderedDict
diff --git a/python/utils/run_benchmarks.sh b/python/utils/run_benchmarks.sh
index 6fd4d01de4e..4c3e7a288f1 100755
--- a/python/utils/run_benchmarks.sh
+++ b/python/utils/run_benchmarks.sh
@@ -1,4 +1,17 @@
 #!/bin/bash
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 set -e
 
 THISDIR=$(dirname $0)
diff --git a/python/utils/utils.py b/python/utils/utils.py
index 2aa1c96d1e6..1e019cf08ae 100644
--- a/python/utils/utils.py
+++ b/python/utils/utils.py
@@ -1,3 +1,16 @@
+# Copyright (c) 2018-2020, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import subprocess
 
 
diff --git a/thirdparty/cub_semiring/agent/agent_histogram.cuh b/thirdparty/cub_semiring/agent/agent_histogram.cuh
deleted file mode 100644
index 3b6cc4c92bc..00000000000
--- a/thirdparty/cub_semiring/agent/agent_histogram.cuh
+++ /dev/null
@@ -1,787 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::AgentHistogram implements a stateful abstraction of CUDA thread blocks for participating in device-wide histogram .
- */
-
-#pragma once
-
-#include <iterator>
-
-#include "../util_type.cuh"
-#include "../block/block_load.cuh"
-#include "../grid/grid_queue.cuh"
-#include "../iterator/cache_modified_input_iterator.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/******************************************************************************
- * Tuning policy
- ******************************************************************************/
-
-/**
- *
- */
-enum BlockHistogramMemoryPreference
-{
-    GMEM,
-    SMEM,
-    BLEND
-};
-
-
-/**
- * Parameterizable tuning policy type for AgentHistogram
- */
-template <
-    int                             _BLOCK_THREADS,                 ///< Threads per thread block
-    int                             _PIXELS_PER_THREAD,             ///< Pixels per thread (per tile of input)
-    BlockLoadAlgorithm              _LOAD_ALGORITHM,                ///< The BlockLoad algorithm to use
-    CacheLoadModifier               _LOAD_MODIFIER,                 ///< Cache load modifier for reading input elements
-    bool                            _RLE_COMPRESS,                  ///< Whether to perform localized RLE to compress samples before histogramming
-    BlockHistogramMemoryPreference  _MEM_PREFERENCE,                ///< Whether to prefer privatized shared-memory bins (versus privatized global-memory bins)
-    bool                            _WORK_STEALING>                 ///< Whether to dequeue tiles from a global work queue
-struct AgentHistogramPolicy
-{
-    enum
-    {
-        BLOCK_THREADS           = _BLOCK_THREADS,                   ///< Threads per thread block
-        PIXELS_PER_THREAD       = _PIXELS_PER_THREAD,               ///< Pixels per thread (per tile of input)
-        IS_RLE_COMPRESS         = _RLE_COMPRESS,                    ///< Whether to perform localized RLE to compress samples before histogramming
-        MEM_PREFERENCE          = _MEM_PREFERENCE,                  ///< Whether to prefer privatized shared-memory bins (versus privatized global-memory bins)
-        IS_WORK_STEALING        = _WORK_STEALING,                   ///< Whether to dequeue tiles from a global work queue
-    };
-
-    static const BlockLoadAlgorithm     LOAD_ALGORITHM          = _LOAD_ALGORITHM;          ///< The BlockLoad algorithm to use
-    static const CacheLoadModifier      LOAD_MODIFIER           = _LOAD_MODIFIER;           ///< Cache load modifier for reading input elements
-};
-
-
-/******************************************************************************
- * Thread block abstractions
- ******************************************************************************/
-
-/**
- * \brief AgentHistogram implements a stateful abstraction of CUDA thread blocks for participating in device-wide histogram .
- */
-template <
-    typename    AgentHistogramPolicyT,     ///< Parameterized AgentHistogramPolicy tuning policy type
-    int         PRIVATIZED_SMEM_BINS,           ///< Number of privatized shared-memory histogram bins of any channel.  Zero indicates privatized counters to be maintained in device-accessible memory.
-    int         NUM_CHANNELS,                   ///< Number of channels interleaved in the input data.  Supports up to four channels.
-    int         NUM_ACTIVE_CHANNELS,            ///< Number of channels actively being histogrammed
-    typename    SampleIteratorT,                ///< Random-access input iterator type for reading samples
-    typename    CounterT,                       ///< Integer type for counting sample occurrences per histogram bin
-    typename    PrivatizedDecodeOpT,            ///< The transform operator type for determining privatized counter indices from samples, one for each channel
-    typename    OutputDecodeOpT,                ///< The transform operator type for determining output bin-ids from privatized counter indices, one for each channel
-    typename    OffsetT,                        ///< Signed integer type for global offsets
-    int         PTX_ARCH = CUB_PTX_ARCH>        ///< PTX compute capability
-struct AgentHistogram
-{
-    //---------------------------------------------------------------------
-    // Types and constants
-    //---------------------------------------------------------------------
-
-    /// The sample type of the input iterator
-    typedef typename std::iterator_traits<SampleIteratorT>::value_type SampleT;
-
-    /// The pixel type of SampleT
-    typedef typename CubVector<SampleT, NUM_CHANNELS>::Type PixelT;
-
-    /// The quad type of SampleT
-    typedef typename CubVector<SampleT, 4>::Type QuadT;
-
-    /// Constants
-    enum
-    {
-        BLOCK_THREADS           = AgentHistogramPolicyT::BLOCK_THREADS,
-
-        PIXELS_PER_THREAD       = AgentHistogramPolicyT::PIXELS_PER_THREAD,
-        SAMPLES_PER_THREAD      = PIXELS_PER_THREAD * NUM_CHANNELS,
-        QUADS_PER_THREAD        = SAMPLES_PER_THREAD / 4,
-
-        TILE_PIXELS             = PIXELS_PER_THREAD * BLOCK_THREADS,
-        TILE_SAMPLES            = SAMPLES_PER_THREAD * BLOCK_THREADS,
-
-        IS_RLE_COMPRESS            = AgentHistogramPolicyT::IS_RLE_COMPRESS,
-
-        MEM_PREFERENCE          = (PRIVATIZED_SMEM_BINS > 0) ?
-                                        AgentHistogramPolicyT::MEM_PREFERENCE :
-                                        GMEM,
-
-        IS_WORK_STEALING           = AgentHistogramPolicyT::IS_WORK_STEALING,
-    };
-
-    /// Cache load modifier for reading input elements
-    static const CacheLoadModifier LOAD_MODIFIER = AgentHistogramPolicyT::LOAD_MODIFIER;
-
-
-    /// Input iterator wrapper type (for applying cache modifier)
-    typedef typename If<IsPointer<SampleIteratorT>::VALUE,
-            CacheModifiedInputIterator<LOAD_MODIFIER, SampleT, OffsetT>,     // Wrap the native input pointer with CacheModifiedInputIterator
-            SampleIteratorT>::Type                                           // Directly use the supplied input iterator type
-        WrappedSampleIteratorT;
-
-    /// Pixel input iterator type (for applying cache modifier)
-    typedef CacheModifiedInputIterator<LOAD_MODIFIER, PixelT, OffsetT>
-        WrappedPixelIteratorT;
-
-    /// Qaud input iterator type (for applying cache modifier)
-    typedef CacheModifiedInputIterator<LOAD_MODIFIER, QuadT, OffsetT>
-        WrappedQuadIteratorT;
-
-    /// Parameterized BlockLoad type for samples
-    typedef BlockLoad<
-            SampleT,
-            BLOCK_THREADS,
-            SAMPLES_PER_THREAD,
-            AgentHistogramPolicyT::LOAD_ALGORITHM>
-        BlockLoadSampleT;
-
-    /// Parameterized BlockLoad type for pixels
-    typedef BlockLoad<
-            PixelT,
-            BLOCK_THREADS,
-            PIXELS_PER_THREAD,
-            AgentHistogramPolicyT::LOAD_ALGORITHM>
-        BlockLoadPixelT;
-
-    /// Parameterized BlockLoad type for quads
-    typedef BlockLoad<
-            QuadT,
-            BLOCK_THREADS,
-            QUADS_PER_THREAD,
-            AgentHistogramPolicyT::LOAD_ALGORITHM>
-        BlockLoadQuadT;
-
-    /// Shared memory type required by this thread block
-    struct _TempStorage
-    {
-        CounterT histograms[NUM_ACTIVE_CHANNELS][PRIVATIZED_SMEM_BINS + 1];     // Smem needed for block-privatized smem histogram (with 1 word of padding)
-
-        int tile_idx;
-
-        // Aliasable storage layout
-        union Aliasable
-        {
-            typename BlockLoadSampleT::TempStorage sample_load;     // Smem needed for loading a tile of samples
-            typename BlockLoadPixelT::TempStorage pixel_load;       // Smem needed for loading a tile of pixels
-            typename BlockLoadQuadT::TempStorage quad_load;         // Smem needed for loading a tile of quads
-
-        } aliasable;
-    };
-
-
-    /// Temporary storage type (unionable)
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    //---------------------------------------------------------------------
-    // Per-thread fields
-    //---------------------------------------------------------------------
-
-    /// Reference to temp_storage
-    _TempStorage &temp_storage;
-
-    /// Sample input iterator (with cache modifier applied, if possible)
-    WrappedSampleIteratorT d_wrapped_samples;
-
-    /// Native pointer for input samples (possibly NULL if unavailable)
-    SampleT* d_native_samples;
-
-    /// The number of output bins for each channel
-    int (&num_output_bins)[NUM_ACTIVE_CHANNELS];
-
-    /// The number of privatized bins for each channel
-    int (&num_privatized_bins)[NUM_ACTIVE_CHANNELS];
-
-    /// Reference to gmem privatized histograms for each channel
-    CounterT* d_privatized_histograms[NUM_ACTIVE_CHANNELS];
-
-    /// Reference to final output histograms (gmem)
-    CounterT* (&d_output_histograms)[NUM_ACTIVE_CHANNELS];
-
-    /// The transform operator for determining output bin-ids from privatized counter indices, one for each channel
-    OutputDecodeOpT (&output_decode_op)[NUM_ACTIVE_CHANNELS];
-
-    /// The transform operator for determining privatized counter indices from samples, one for each channel
-    PrivatizedDecodeOpT (&privatized_decode_op)[NUM_ACTIVE_CHANNELS];
-
-    /// Whether to prefer privatized smem counters vs privatized global counters
-    bool prefer_smem;
-
-
-    //---------------------------------------------------------------------
-    // Initialize privatized bin counters
-    //---------------------------------------------------------------------
-
-    // Initialize privatized bin counters
-    __device__ __forceinline__ void InitBinCounters(CounterT* privatized_histograms[NUM_ACTIVE_CHANNELS])
-    {
-        // Initialize histogram bin counts to zeros
-        #pragma unroll
-        for (int CHANNEL = 0; CHANNEL < NUM_ACTIVE_CHANNELS; ++CHANNEL)
-        {
-            for (int privatized_bin = threadIdx.x; privatized_bin < num_privatized_bins[CHANNEL]; privatized_bin += BLOCK_THREADS)
-            {
-                privatized_histograms[CHANNEL][privatized_bin] = 0;
-            }
-        }
-
-        // Barrier to make sure all threads are done updating counters
-        CTA_SYNC();
-    }
-
-
-    // Initialize privatized bin counters.  Specialized for privatized shared-memory counters
-    __device__ __forceinline__ void InitSmemBinCounters()
-    {
-        CounterT* privatized_histograms[NUM_ACTIVE_CHANNELS];
-
-        for (int CHANNEL = 0; CHANNEL < NUM_ACTIVE_CHANNELS; ++CHANNEL)
-            privatized_histograms[CHANNEL] = temp_storage.histograms[CHANNEL];
-
-        InitBinCounters(privatized_histograms);
-    }
-
-
-    // Initialize privatized bin counters.  Specialized for privatized global-memory counters
-    __device__ __forceinline__ void InitGmemBinCounters()
-    {
-        InitBinCounters(d_privatized_histograms);
-    }
-
-
-    //---------------------------------------------------------------------
-    // Update final output histograms
-    //---------------------------------------------------------------------
-
-    // Update final output histograms from privatized histograms
-    __device__ __forceinline__ void StoreOutput(CounterT* privatized_histograms[NUM_ACTIVE_CHANNELS])
-    {
-        // Barrier to make sure all threads are done updating counters
-        CTA_SYNC();
-
-        // Apply privatized bin counts to output bin counts
-        #pragma unroll
-        for (int CHANNEL = 0; CHANNEL < NUM_ACTIVE_CHANNELS; ++CHANNEL)
-        {
-            int channel_bins = num_privatized_bins[CHANNEL];
-            for (int privatized_bin = threadIdx.x; 
-                    privatized_bin < channel_bins;  
-                    privatized_bin += BLOCK_THREADS)
-            {
-                int         output_bin  = -1;
-                CounterT    count       = privatized_histograms[CHANNEL][privatized_bin];
-                bool        is_valid    = count > 0;
-
-                output_decode_op[CHANNEL].template BinSelect<LOAD_MODIFIER>((SampleT) privatized_bin, output_bin, is_valid);
-
-                if (output_bin >= 0)
-                {
-                    atomicAdd(&d_output_histograms[CHANNEL][output_bin], count);
-                }
-
-            }
-        }
-    }
-
-
-    // Update final output histograms from privatized histograms.  Specialized for privatized shared-memory counters
-    __device__ __forceinline__ void StoreSmemOutput()
-    {
-        CounterT* privatized_histograms[NUM_ACTIVE_CHANNELS];
-        for (int CHANNEL = 0; CHANNEL < NUM_ACTIVE_CHANNELS; ++CHANNEL)
-            privatized_histograms[CHANNEL] = temp_storage.histograms[CHANNEL];
-
-        StoreOutput(privatized_histograms);
-    }
-
-
-    // Update final output histograms from privatized histograms.  Specialized for privatized global-memory counters
-    __device__ __forceinline__ void StoreGmemOutput()
-    {
-        StoreOutput(d_privatized_histograms);
-    }
-
-
-    //---------------------------------------------------------------------
-    // Tile accumulation
-    //---------------------------------------------------------------------
-
-    // Accumulate pixels.  Specialized for RLE compression.
-    __device__ __forceinline__ void AccumulatePixels(
-        SampleT             samples[PIXELS_PER_THREAD][NUM_CHANNELS],
-        bool                is_valid[PIXELS_PER_THREAD],
-        CounterT*           privatized_histograms[NUM_ACTIVE_CHANNELS],
-        Int2Type<true>      is_rle_compress)
-    {
-        #pragma unroll
-        for (int CHANNEL = 0; CHANNEL < NUM_ACTIVE_CHANNELS; ++CHANNEL)
-        {
-            // Bin pixels
-            int bins[PIXELS_PER_THREAD];
-
-            #pragma unroll
-            for (int PIXEL = 0; PIXEL < PIXELS_PER_THREAD; ++PIXEL)
-            {
-                bins[PIXEL] = -1;
-                privatized_decode_op[CHANNEL].template BinSelect<LOAD_MODIFIER>(samples[PIXEL][CHANNEL], bins[PIXEL], is_valid[PIXEL]);
-            }
-
-            CounterT accumulator = 1;
-
-            #pragma unroll
-            for (int PIXEL = 0; PIXEL < PIXELS_PER_THREAD - 1; ++PIXEL)
-            {
-                if (bins[PIXEL] != bins[PIXEL + 1])
-                {
-                    if (bins[PIXEL] >= 0)
-                        atomicAdd(privatized_histograms[CHANNEL] + bins[PIXEL], accumulator);
-
-                     accumulator = 0;
-                }
-                accumulator++;
-            }
-
-            // Last pixel
-            if (bins[PIXELS_PER_THREAD - 1] >= 0)
-                atomicAdd(privatized_histograms[CHANNEL] + bins[PIXELS_PER_THREAD - 1], accumulator);
-        }
-    }
-
-
-    // Accumulate pixels.  Specialized for individual accumulation of each pixel.
-    __device__ __forceinline__ void AccumulatePixels(
-        SampleT             samples[PIXELS_PER_THREAD][NUM_CHANNELS],
-        bool                is_valid[PIXELS_PER_THREAD],
-        CounterT*           privatized_histograms[NUM_ACTIVE_CHANNELS],
-        Int2Type<false>     is_rle_compress)
-    {
-        #pragma unroll
-        for (int PIXEL = 0; PIXEL < PIXELS_PER_THREAD; ++PIXEL)
-        {
-            #pragma unroll
-            for (int CHANNEL = 0; CHANNEL < NUM_ACTIVE_CHANNELS; ++CHANNEL)
-            {
-                int bin = -1;
-                privatized_decode_op[CHANNEL].template BinSelect<LOAD_MODIFIER>(samples[PIXEL][CHANNEL], bin, is_valid[PIXEL]);
-                if (bin >= 0)
-                    atomicAdd(privatized_histograms[CHANNEL] + bin, 1);
-            }
-        }
-    }
-
-
-    /**
-     * Accumulate pixel, specialized for smem privatized histogram
-     */
-    __device__ __forceinline__ void AccumulateSmemPixels(
-        SampleT             samples[PIXELS_PER_THREAD][NUM_CHANNELS],
-        bool                is_valid[PIXELS_PER_THREAD])
-    {
-        CounterT* privatized_histograms[NUM_ACTIVE_CHANNELS];
-
-        for (int CHANNEL = 0; CHANNEL < NUM_ACTIVE_CHANNELS; ++CHANNEL)
-            privatized_histograms[CHANNEL] = temp_storage.histograms[CHANNEL];
-
-        AccumulatePixels(samples, is_valid, privatized_histograms, Int2Type<IS_RLE_COMPRESS>());
-    }
-
-
-    /**
-     * Accumulate pixel, specialized for gmem privatized histogram
-     */
-    __device__ __forceinline__ void AccumulateGmemPixels(
-        SampleT             samples[PIXELS_PER_THREAD][NUM_CHANNELS],
-        bool                is_valid[PIXELS_PER_THREAD])
-    {
-        AccumulatePixels(samples, is_valid, d_privatized_histograms, Int2Type<IS_RLE_COMPRESS>());
-    }
-
-
-
-    //---------------------------------------------------------------------
-    // Tile loading
-    //---------------------------------------------------------------------
-
-    // Load full, aligned tile using pixel iterator (multi-channel)
-    template <int _NUM_ACTIVE_CHANNELS>
-    __device__ __forceinline__ void LoadFullAlignedTile(
-        OffsetT                         block_offset,
-        int                             valid_samples,
-        SampleT                         (&samples)[PIXELS_PER_THREAD][NUM_CHANNELS],
-        Int2Type<_NUM_ACTIVE_CHANNELS>  num_active_channels)
-    {
-        typedef PixelT AliasedPixels[PIXELS_PER_THREAD];
-
-        WrappedPixelIteratorT d_wrapped_pixels((PixelT*) (d_native_samples + block_offset));
-
-        // Load using a wrapped pixel iterator
-        BlockLoadPixelT(temp_storage.aliasable.pixel_load).Load(
-            d_wrapped_pixels,
-            reinterpret_cast<AliasedPixels&>(samples));
-    }
-
-    // Load full, aligned tile using quad iterator (single-channel)
-    __device__ __forceinline__ void LoadFullAlignedTile(
-        OffsetT                         block_offset,
-        int                             valid_samples,
-        SampleT                         (&samples)[PIXELS_PER_THREAD][NUM_CHANNELS],
-        Int2Type<1>                     num_active_channels)
-    {
-        typedef QuadT AliasedQuads[QUADS_PER_THREAD];
-
-        WrappedQuadIteratorT d_wrapped_quads((QuadT*) (d_native_samples + block_offset));
-
-        // Load using a wrapped quad iterator
-        BlockLoadQuadT(temp_storage.aliasable.quad_load).Load(
-            d_wrapped_quads,
-            reinterpret_cast<AliasedQuads&>(samples));
-    }
-
-    // Load full, aligned tile
-    __device__ __forceinline__ void LoadTile(
-        OffsetT         block_offset,
-        int             valid_samples,
-        SampleT         (&samples)[PIXELS_PER_THREAD][NUM_CHANNELS],
-        Int2Type<true>  is_full_tile,
-        Int2Type<true>  is_aligned)
-    {
-        LoadFullAlignedTile(block_offset, valid_samples, samples, Int2Type<NUM_ACTIVE_CHANNELS>());
-    }
-
-    // Load full, mis-aligned tile using sample iterator
-    __device__ __forceinline__ void LoadTile(
-        OffsetT         block_offset,
-        int             valid_samples,
-        SampleT         (&samples)[PIXELS_PER_THREAD][NUM_CHANNELS],
-        Int2Type<true>  is_full_tile,
-        Int2Type<false> is_aligned)
-    {
-        typedef SampleT AliasedSamples[SAMPLES_PER_THREAD];
-
-        // Load using sample iterator
-        BlockLoadSampleT(temp_storage.aliasable.sample_load).Load(
-            d_wrapped_samples + block_offset,
-            reinterpret_cast<AliasedSamples&>(samples));
-    }
-
-    // Load partially-full, aligned tile using the pixel iterator
-    __device__ __forceinline__ void LoadTile(
-        OffsetT         block_offset,
-        int             valid_samples,
-        SampleT         (&samples)[PIXELS_PER_THREAD][NUM_CHANNELS],
-        Int2Type<false> is_full_tile,
-        Int2Type<true>  is_aligned)
-    {
-        typedef PixelT AliasedPixels[PIXELS_PER_THREAD];
-
-        WrappedPixelIteratorT d_wrapped_pixels((PixelT*) (d_native_samples + block_offset));
-
-        int valid_pixels = valid_samples / NUM_CHANNELS;
-
-        // Load using a wrapped pixel iterator
-        BlockLoadPixelT(temp_storage.aliasable.pixel_load).Load(
-            d_wrapped_pixels,
-            reinterpret_cast<AliasedPixels&>(samples),
-            valid_pixels);
-    }
-
-    // Load partially-full, mis-aligned tile using sample iterator
-    __device__ __forceinline__ void LoadTile(
-        OffsetT         block_offset,
-        int             valid_samples,
-        SampleT         (&samples)[PIXELS_PER_THREAD][NUM_CHANNELS],
-        Int2Type<false> is_full_tile,
-        Int2Type<false> is_aligned)
-    {
-        typedef SampleT AliasedSamples[SAMPLES_PER_THREAD];
-
-        BlockLoadSampleT(temp_storage.aliasable.sample_load).Load(
-            d_wrapped_samples + block_offset,
-            reinterpret_cast<AliasedSamples&>(samples),
-            valid_samples);
-    }
-
-
-    //---------------------------------------------------------------------
-    // Tile processing
-    //---------------------------------------------------------------------
-
-    // Consume a tile of data samples
-    template <
-        bool IS_ALIGNED,        // Whether the tile offset is aligned (quad-aligned for single-channel, pixel-aligned for multi-channel)
-        bool IS_FULL_TILE>      // Whether the tile is full
-    __device__ __forceinline__ void ConsumeTile(OffsetT block_offset, int valid_samples)
-    {
-        SampleT     samples[PIXELS_PER_THREAD][NUM_CHANNELS];
-        bool        is_valid[PIXELS_PER_THREAD];
-
-        // Load tile
-        LoadTile(
-            block_offset,
-            valid_samples,
-            samples,
-            Int2Type<IS_FULL_TILE>(),
-            Int2Type<IS_ALIGNED>());
-
-        // Set valid flags
-        #pragma unroll
-        for (int PIXEL = 0; PIXEL < PIXELS_PER_THREAD; ++PIXEL)
-            is_valid[PIXEL] = IS_FULL_TILE || (((threadIdx.x * PIXELS_PER_THREAD + PIXEL) * NUM_CHANNELS) < valid_samples);
-
-        // Accumulate samples
-#if CUB_PTX_ARCH >= 120
-        if (prefer_smem)
-            AccumulateSmemPixels(samples, is_valid);
-        else
-            AccumulateGmemPixels(samples, is_valid);
-#else
-        AccumulateGmemPixels(samples, is_valid);
-#endif
-
-    }
-
-
-    // Consume row tiles.  Specialized for work-stealing from queue
-    template <bool IS_ALIGNED>
-    __device__ __forceinline__ void ConsumeTiles(
-        OffsetT             num_row_pixels,             ///< The number of multi-channel pixels per row in the region of interest
-        OffsetT             num_rows,                   ///< The number of rows in the region of interest
-        OffsetT             row_stride_samples,         ///< The number of samples between starts of consecutive rows in the region of interest
-        int                 tiles_per_row,              ///< Number of image tiles per row
-        GridQueue<int>      tile_queue,
-        Int2Type<true>      is_work_stealing)
-    {
-
-        int         num_tiles                   = num_rows * tiles_per_row;
-        int         tile_idx                    = (blockIdx.y  * gridDim.x) + blockIdx.x;
-        OffsetT     num_even_share_tiles        = gridDim.x * gridDim.y;
-
-        while (tile_idx < num_tiles)
-        {
-            int     row             = tile_idx / tiles_per_row;
-            int     col             = tile_idx - (row * tiles_per_row);
-            OffsetT row_offset      = row * row_stride_samples;
-            OffsetT col_offset      = (col * TILE_SAMPLES);
-            OffsetT tile_offset     = row_offset + col_offset;
-
-            if (col == tiles_per_row - 1)
-            {
-                // Consume a partially-full tile at the end of the row
-                OffsetT num_remaining = (num_row_pixels * NUM_CHANNELS) - col_offset;
-                ConsumeTile<IS_ALIGNED, false>(tile_offset, num_remaining);
-            } 
-            else
-            {
-                // Consume full tile
-                ConsumeTile<IS_ALIGNED, true>(tile_offset, TILE_SAMPLES);
-            }
-
-            CTA_SYNC();
-
-            // Get next tile
-            if (threadIdx.x == 0)
-                temp_storage.tile_idx = tile_queue.Drain(1) + num_even_share_tiles;
-
-            CTA_SYNC();
-
-            tile_idx = temp_storage.tile_idx;
-        }
-    }
-
-
-    // Consume row tiles.  Specialized for even-share (striped across thread blocks)
-    template <bool IS_ALIGNED>
-    __device__ __forceinline__ void ConsumeTiles(
-        OffsetT             num_row_pixels,             ///< The number of multi-channel pixels per row in the region of interest
-        OffsetT             num_rows,                   ///< The number of rows in the region of interest
-        OffsetT             row_stride_samples,         ///< The number of samples between starts of consecutive rows in the region of interest
-        int                 tiles_per_row,              ///< Number of image tiles per row
-        GridQueue<int>      tile_queue,
-        Int2Type<false>     is_work_stealing)
-    {
-        for (int row = blockIdx.y; row < num_rows; row += gridDim.y)
-        {
-            OffsetT row_begin   = row * row_stride_samples;
-            OffsetT row_end     = row_begin + (num_row_pixels * NUM_CHANNELS);
-            OffsetT tile_offset = row_begin + (blockIdx.x * TILE_SAMPLES);
-
-            while (tile_offset < row_end)
-            {
-                OffsetT num_remaining = row_end - tile_offset;
-
-                if (num_remaining < TILE_SAMPLES)
-                {
-                    // Consume partial tile
-                    ConsumeTile<IS_ALIGNED, false>(tile_offset, num_remaining);
-                    break;
-                }
-
-                // Consume full tile
-                ConsumeTile<IS_ALIGNED, true>(tile_offset, TILE_SAMPLES);
-                tile_offset += gridDim.x * TILE_SAMPLES;
-            }
-        }
-    }
-
-
-    //---------------------------------------------------------------------
-    // Parameter extraction
-    //---------------------------------------------------------------------
-
-    // Return a native pixel pointer (specialized for CacheModifiedInputIterator types)
-    template <
-        CacheLoadModifier   _MODIFIER,
-        typename            _ValueT,
-        typename            _OffsetT>
-    __device__ __forceinline__ SampleT* NativePointer(CacheModifiedInputIterator<_MODIFIER, _ValueT, _OffsetT> itr)
-    {
-        return itr.ptr;
-    }
-
-    // Return a native pixel pointer (specialized for other types)
-    template <typename IteratorT>
-    __device__ __forceinline__ SampleT* NativePointer(IteratorT itr)
-    {
-        return NULL;
-    }
-
-
-
-    //---------------------------------------------------------------------
-    // Interface
-    //---------------------------------------------------------------------
-
-
-    /**
-     * Constructor
-     */
-    __device__ __forceinline__ AgentHistogram(
-        TempStorage         &temp_storage,                                      ///< Reference to temp_storage
-        SampleIteratorT     d_samples,                                          ///< Input data to reduce
-        int                 (&num_output_bins)[NUM_ACTIVE_CHANNELS],            ///< The number bins per final output histogram
-        int                 (&num_privatized_bins)[NUM_ACTIVE_CHANNELS],        ///< The number bins per privatized histogram
-        CounterT*           (&d_output_histograms)[NUM_ACTIVE_CHANNELS],        ///< Reference to final output histograms
-        CounterT*           (&d_privatized_histograms)[NUM_ACTIVE_CHANNELS],    ///< Reference to privatized histograms
-        OutputDecodeOpT     (&output_decode_op)[NUM_ACTIVE_CHANNELS],           ///< The transform operator for determining output bin-ids from privatized counter indices, one for each channel
-        PrivatizedDecodeOpT (&privatized_decode_op)[NUM_ACTIVE_CHANNELS])       ///< The transform operator for determining privatized counter indices from samples, one for each channel
-    :
-        temp_storage(temp_storage.Alias()),
-        d_wrapped_samples(d_samples),
-        num_output_bins(num_output_bins),
-        num_privatized_bins(num_privatized_bins),
-        d_output_histograms(d_output_histograms),
-        privatized_decode_op(privatized_decode_op),
-        output_decode_op(output_decode_op),
-        d_native_samples(NativePointer(d_wrapped_samples)),
-        prefer_smem((MEM_PREFERENCE == SMEM) ?
-            true :                              // prefer smem privatized histograms
-            (MEM_PREFERENCE == GMEM) ?
-                false :                         // prefer gmem privatized histograms
-                blockIdx.x & 1)                 // prefer blended privatized histograms
-    {
-        int blockId = (blockIdx.y * gridDim.x) + blockIdx.x;
-
-        // Initialize the locations of this block's privatized histograms
-        for (int CHANNEL = 0; CHANNEL < NUM_ACTIVE_CHANNELS; ++CHANNEL)
-            this->d_privatized_histograms[CHANNEL] = d_privatized_histograms[CHANNEL] + (blockId * num_privatized_bins[CHANNEL]);
-    }
-
-
-    /**
-     * Consume image
-     */
-    __device__ __forceinline__ void ConsumeTiles(
-        OffsetT             num_row_pixels,             ///< The number of multi-channel pixels per row in the region of interest
-        OffsetT             num_rows,                   ///< The number of rows in the region of interest
-        OffsetT             row_stride_samples,         ///< The number of samples between starts of consecutive rows in the region of interest
-        int                 tiles_per_row,              ///< Number of image tiles per row
-        GridQueue<int>      tile_queue)                 ///< Queue descriptor for assigning tiles of work to thread blocks
-    {
-        // Check whether all row starting offsets are quad-aligned (in single-channel) or pixel-aligned (in multi-channel)
-        int     quad_mask           = AlignBytes<QuadT>::ALIGN_BYTES - 1;
-        int     pixel_mask          = AlignBytes<PixelT>::ALIGN_BYTES - 1;
-        size_t  row_bytes           = sizeof(SampleT) * row_stride_samples;
-
-        bool quad_aligned_rows      = (NUM_CHANNELS == 1) && (SAMPLES_PER_THREAD % 4 == 0) &&     // Single channel
-                                        ((size_t(d_native_samples) & quad_mask) == 0) &&        // ptr is quad-aligned
-                                        ((num_rows == 1) || ((row_bytes & quad_mask) == 0));    // number of row-samples is a multiple of the alignment of the quad
-
-        bool pixel_aligned_rows     = (NUM_CHANNELS > 1) &&                                     // Multi channel
-                                        ((size_t(d_native_samples) & pixel_mask) == 0) &&       // ptr is pixel-aligned
-                                        ((row_bytes & pixel_mask) == 0);                        // number of row-samples is a multiple of the alignment of the pixel
-
-        // Whether rows are aligned and can be vectorized
-        if ((d_native_samples != NULL) && (quad_aligned_rows || pixel_aligned_rows))
-            ConsumeTiles<true>(num_row_pixels, num_rows, row_stride_samples, tiles_per_row, tile_queue, Int2Type<IS_WORK_STEALING>());
-        else
-            ConsumeTiles<false>(num_row_pixels, num_rows, row_stride_samples, tiles_per_row, tile_queue, Int2Type<IS_WORK_STEALING>());
-    }
-
-
-    /**
-     * Initialize privatized bin counters.  Specialized for privatized shared-memory counters
-     */
-    __device__ __forceinline__ void InitBinCounters()
-    {
-        if (prefer_smem)
-            InitSmemBinCounters();
-        else
-            InitGmemBinCounters();
-    }
-
-
-    /**
-     * Store privatized histogram to device-accessible memory.  Specialized for privatized shared-memory counters
-     */
-    __device__ __forceinline__ void StoreOutput()
-    {
-        if (prefer_smem)
-            StoreSmemOutput();
-        else
-            StoreGmemOutput();
-    }
-
-
-};
-
-
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/agent/agent_radix_sort_downsweep.cuh b/thirdparty/cub_semiring/agent/agent_radix_sort_downsweep.cuh
deleted file mode 100644
index 0eee5f4ebf1..00000000000
--- a/thirdparty/cub_semiring/agent/agent_radix_sort_downsweep.cuh
+++ /dev/null
@@ -1,772 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * AgentRadixSortDownsweep implements a stateful abstraction of CUDA thread blocks for participating in device-wide radix sort downsweep .
- */
-
-
-#pragma once
-
-#include <stdint.h>
-
-#include "../thread/thread_load.cuh"
-#include "../block/block_load.cuh"
-#include "../block/block_store.cuh"
-#include "../block/block_radix_rank.cuh"
-#include "../block/block_exchange.cuh"
-#include "../util_type.cuh"
-#include "../iterator/cache_modified_input_iterator.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/******************************************************************************
- * Tuning policy types
- ******************************************************************************/
-
-/**
- * Radix ranking algorithm
- */
-enum RadixRankAlgorithm
-{
-    RADIX_RANK_BASIC,
-    RADIX_RANK_MEMOIZE,
-    RADIX_RANK_MATCH
-};
-
-/**
- * Parameterizable tuning policy type for AgentRadixSortDownsweep
- */
-template <
-    int                         _BLOCK_THREADS,         ///< Threads per thread block
-    int                         _ITEMS_PER_THREAD,      ///< Items per thread (per tile of input)
-    BlockLoadAlgorithm          _LOAD_ALGORITHM,        ///< The BlockLoad algorithm to use
-    CacheLoadModifier           _LOAD_MODIFIER,         ///< Cache load modifier for reading keys (and values)
-    RadixRankAlgorithm          _RANK_ALGORITHM,        ///< The radix ranking algorithm to use
-    BlockScanAlgorithm          _SCAN_ALGORITHM,        ///< The block scan algorithm to use
-    int                         _RADIX_BITS>            ///< The number of radix bits, i.e., log2(bins)
-struct AgentRadixSortDownsweepPolicy
-{
-    enum
-    {
-        BLOCK_THREADS           = _BLOCK_THREADS,           ///< Threads per thread block
-        ITEMS_PER_THREAD        = _ITEMS_PER_THREAD,        ///< Items per thread (per tile of input)
-        RADIX_BITS              = _RADIX_BITS,              ///< The number of radix bits, i.e., log2(bins)
-    };
-
-    static const BlockLoadAlgorithm  LOAD_ALGORITHM     = _LOAD_ALGORITHM;    ///< The BlockLoad algorithm to use
-    static const CacheLoadModifier   LOAD_MODIFIER      = _LOAD_MODIFIER;     ///< Cache load modifier for reading keys (and values)
-    static const RadixRankAlgorithm  RANK_ALGORITHM     = _RANK_ALGORITHM;    ///< The radix ranking algorithm to use
-    static const BlockScanAlgorithm  SCAN_ALGORITHM     = _SCAN_ALGORITHM;    ///< The BlockScan algorithm to use
-};
-
-
-/******************************************************************************
- * Thread block abstractions
- ******************************************************************************/
-
-
-
-
-
-/**
- * \brief AgentRadixSortDownsweep implements a stateful abstraction of CUDA thread blocks for participating in device-wide radix sort downsweep .
- */
-template <
-    typename AgentRadixSortDownsweepPolicy,     ///< Parameterized AgentRadixSortDownsweepPolicy tuning policy type
-    bool     IS_DESCENDING,                     ///< Whether or not the sorted-order is high-to-low
-    typename KeyT,                              ///< KeyT type
-    typename ValueT,                            ///< ValueT type
-    typename OffsetT>                           ///< Signed integer type for global offsets
-struct AgentRadixSortDownsweep
-{
-    //---------------------------------------------------------------------
-    // Type definitions and constants
-    //---------------------------------------------------------------------
-
-    // Appropriate unsigned-bits representation of KeyT
-    typedef typename Traits<KeyT>::UnsignedBits UnsignedBits;
-
-    static const UnsignedBits           LOWEST_KEY  = Traits<KeyT>::LOWEST_KEY;
-    static const UnsignedBits           MAX_KEY     = Traits<KeyT>::MAX_KEY;
-
-    static const BlockLoadAlgorithm     LOAD_ALGORITHM  = AgentRadixSortDownsweepPolicy::LOAD_ALGORITHM;
-    static const CacheLoadModifier      LOAD_MODIFIER   = AgentRadixSortDownsweepPolicy::LOAD_MODIFIER;
-    static const RadixRankAlgorithm     RANK_ALGORITHM  = AgentRadixSortDownsweepPolicy::RANK_ALGORITHM;
-    static const BlockScanAlgorithm     SCAN_ALGORITHM  = AgentRadixSortDownsweepPolicy::SCAN_ALGORITHM;
-
-    enum
-    {
-        BLOCK_THREADS           = AgentRadixSortDownsweepPolicy::BLOCK_THREADS,
-        ITEMS_PER_THREAD        = AgentRadixSortDownsweepPolicy::ITEMS_PER_THREAD,
-        RADIX_BITS              = AgentRadixSortDownsweepPolicy::RADIX_BITS,
-        TILE_ITEMS              = BLOCK_THREADS * ITEMS_PER_THREAD,
-
-        RADIX_DIGITS            = 1 << RADIX_BITS,
-        KEYS_ONLY               = Equals<ValueT, NullType>::VALUE,
-    };
-
-    // Input iterator wrapper type (for applying cache modifier)s
-    typedef CacheModifiedInputIterator<LOAD_MODIFIER, UnsignedBits, OffsetT>    KeysItr;
-    typedef CacheModifiedInputIterator<LOAD_MODIFIER, ValueT, OffsetT>          ValuesItr;
-
-    // Radix ranking type to use
-    typedef typename If<(RANK_ALGORITHM == RADIX_RANK_BASIC),
-            BlockRadixRank<BLOCK_THREADS, RADIX_BITS, IS_DESCENDING, false, SCAN_ALGORITHM>,
-            typename If<(RANK_ALGORITHM == RADIX_RANK_MEMOIZE),
-                BlockRadixRank<BLOCK_THREADS, RADIX_BITS, IS_DESCENDING, true, SCAN_ALGORITHM>,
-                BlockRadixRankMatch<BLOCK_THREADS, RADIX_BITS, IS_DESCENDING, SCAN_ALGORITHM>
-            >::Type
-        >::Type BlockRadixRankT;
-
-    enum
-    {
-        /// Number of bin-starting offsets tracked per thread
-        BINS_TRACKED_PER_THREAD = BlockRadixRankT::BINS_TRACKED_PER_THREAD
-    };
-
-    // BlockLoad type (keys)
-    typedef BlockLoad<
-        UnsignedBits,
-        BLOCK_THREADS,
-        ITEMS_PER_THREAD,
-        LOAD_ALGORITHM> BlockLoadKeysT;
-
-    // BlockLoad type (values)
-    typedef BlockLoad<
-        ValueT,
-        BLOCK_THREADS,
-        ITEMS_PER_THREAD,
-        LOAD_ALGORITHM> BlockLoadValuesT;
-
-    // Value exchange array type
-    typedef ValueT ValueExchangeT[TILE_ITEMS];
-
-    /**
-     * Shared memory storage layout
-     */
-    union __align__(16) _TempStorage
-    {
-        typename BlockLoadKeysT::TempStorage    load_keys;
-        typename BlockLoadValuesT::TempStorage  load_values;
-        typename BlockRadixRankT::TempStorage   radix_rank;
-
-        struct
-        {
-            UnsignedBits                        exchange_keys[TILE_ITEMS];
-            OffsetT                             relative_bin_offsets[RADIX_DIGITS];
-        };
-
-        Uninitialized<ValueExchangeT>           exchange_values;
-
-        OffsetT                                 exclusive_digit_prefix[RADIX_DIGITS];
-    };
-
-
-    /// Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    //---------------------------------------------------------------------
-    // Thread fields
-    //---------------------------------------------------------------------
-
-    // Shared storage for this CTA
-    _TempStorage    &temp_storage;
-
-    // Input and output device pointers
-    KeysItr         d_keys_in;
-    ValuesItr       d_values_in;
-    UnsignedBits    *d_keys_out;
-    ValueT          *d_values_out;
-
-    // The global scatter base offset for each digit (valid in the first RADIX_DIGITS threads)
-    OffsetT         bin_offset[BINS_TRACKED_PER_THREAD];
-
-    // The least-significant bit position of the current digit to extract
-    int             current_bit;
-
-    // Number of bits in current digit
-    int             num_bits;
-
-    // Whether to short-cirucit
-    int             short_circuit;
-
-    //---------------------------------------------------------------------
-    // Utility methods
-    //---------------------------------------------------------------------
-
-
-    /**
-     * Scatter ranked keys through shared memory, then to device-accessible memory
-     */
-    template <bool FULL_TILE>
-    __device__ __forceinline__ void ScatterKeys(
-        UnsignedBits    (&twiddled_keys)[ITEMS_PER_THREAD],
-        OffsetT         (&relative_bin_offsets)[ITEMS_PER_THREAD],
-        int             (&ranks)[ITEMS_PER_THREAD],
-        OffsetT         valid_items)
-    {
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            temp_storage.exchange_keys[ranks[ITEM]] = twiddled_keys[ITEM];
-        }
-
-        CTA_SYNC();
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            UnsignedBits key            = temp_storage.exchange_keys[threadIdx.x + (ITEM * BLOCK_THREADS)];
-            UnsignedBits digit          = BFE(key, current_bit, num_bits);
-            relative_bin_offsets[ITEM]  = temp_storage.relative_bin_offsets[digit];
-
-            // Un-twiddle
-            key = Traits<KeyT>::TwiddleOut(key);
-
-            if (FULL_TILE || 
-                (static_cast<OffsetT>(threadIdx.x + (ITEM * BLOCK_THREADS)) < valid_items))
-            {
-                d_keys_out[relative_bin_offsets[ITEM] + threadIdx.x + (ITEM * BLOCK_THREADS)] = key;
-            }
-        }
-    }
-
-
-    /**
-     * Scatter ranked values through shared memory, then to device-accessible memory
-     */
-    template <bool FULL_TILE>
-    __device__ __forceinline__ void ScatterValues(
-        ValueT      (&values)[ITEMS_PER_THREAD],
-        OffsetT     (&relative_bin_offsets)[ITEMS_PER_THREAD],
-        int         (&ranks)[ITEMS_PER_THREAD],
-        OffsetT     valid_items)
-    {
-        CTA_SYNC();
-
-        ValueExchangeT &exchange_values = temp_storage.exchange_values.Alias();
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            exchange_values[ranks[ITEM]] = values[ITEM];
-        }
-
-        CTA_SYNC();
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            ValueT value = exchange_values[threadIdx.x + (ITEM * BLOCK_THREADS)];
-
-            if (FULL_TILE || 
-                (static_cast<OffsetT>(threadIdx.x + (ITEM * BLOCK_THREADS)) < valid_items))
-            {
-                d_values_out[relative_bin_offsets[ITEM] + threadIdx.x + (ITEM * BLOCK_THREADS)] = value;
-            }
-        }
-    }
-
-    /**
-     * Load a tile of keys (specialized for full tile, any ranking algorithm)
-     */
-    template <int _RANK_ALGORITHM>
-    __device__ __forceinline__ void LoadKeys(
-        UnsignedBits                (&keys)[ITEMS_PER_THREAD],
-        OffsetT                     block_offset,
-        OffsetT                     valid_items,
-        UnsignedBits                oob_item,
-        Int2Type<true>              is_full_tile,
-        Int2Type<_RANK_ALGORITHM>   rank_algorithm)
-    {
-        BlockLoadKeysT(temp_storage.load_keys).Load(
-            d_keys_in + block_offset, keys);
-
-        CTA_SYNC();
-    }
-
-
-    /**
-     * Load a tile of keys (specialized for partial tile, any ranking algorithm)
-     */
-    template <int _RANK_ALGORITHM>
-    __device__ __forceinline__ void LoadKeys(
-        UnsignedBits                (&keys)[ITEMS_PER_THREAD],
-        OffsetT                     block_offset,
-        OffsetT                     valid_items,
-        UnsignedBits                oob_item,
-        Int2Type<false>             is_full_tile,
-        Int2Type<_RANK_ALGORITHM>   rank_algorithm)
-    {
-        BlockLoadKeysT(temp_storage.load_keys).Load(
-            d_keys_in + block_offset, keys, valid_items, oob_item);
-
-        CTA_SYNC();
-    }
-
-
-    /**
-     * Load a tile of keys (specialized for full tile, match ranking algorithm)
-     */
-    __device__ __forceinline__ void LoadKeys(
-        UnsignedBits                (&keys)[ITEMS_PER_THREAD],
-        OffsetT                     block_offset,
-        OffsetT                     valid_items,
-        UnsignedBits                oob_item,
-        Int2Type<true>              is_full_tile,
-        Int2Type<RADIX_RANK_MATCH>  rank_algorithm)
-    {
-        LoadDirectWarpStriped(threadIdx.x, d_keys_in + block_offset, keys);
-    }
-
-
-    /**
-     * Load a tile of keys (specialized for partial tile, match ranking algorithm)
-     */
-    __device__ __forceinline__ void LoadKeys(
-        UnsignedBits                (&keys)[ITEMS_PER_THREAD],
-        OffsetT                     block_offset,
-        OffsetT                     valid_items,
-        UnsignedBits                oob_item,
-        Int2Type<false>             is_full_tile,
-        Int2Type<RADIX_RANK_MATCH>  rank_algorithm)
-    {
-        LoadDirectWarpStriped(threadIdx.x, d_keys_in + block_offset, keys, valid_items, oob_item);
-    }
-
-
-    /**
-     * Load a tile of values (specialized for full tile, any ranking algorithm)
-     */
-    template <int _RANK_ALGORITHM>
-    __device__ __forceinline__ void LoadValues(
-        ValueT                      (&values)[ITEMS_PER_THREAD],
-        OffsetT                     block_offset,
-        OffsetT                     valid_items,
-        Int2Type<true>              is_full_tile,
-        Int2Type<_RANK_ALGORITHM>   rank_algorithm)
-    {
-        BlockLoadValuesT(temp_storage.load_values).Load(
-            d_values_in + block_offset, values);
-
-        CTA_SYNC();
-    }
-
-
-    /**
-     * Load a tile of values (specialized for partial tile, any ranking algorithm)
-     */
-    template <int _RANK_ALGORITHM>
-    __device__ __forceinline__ void LoadValues(
-        ValueT                      (&values)[ITEMS_PER_THREAD],
-        OffsetT                     block_offset,
-        OffsetT                     valid_items,
-        Int2Type<false>             is_full_tile,
-        Int2Type<_RANK_ALGORITHM>   rank_algorithm)
-    {
-        BlockLoadValuesT(temp_storage.load_values).Load(
-            d_values_in + block_offset, values, valid_items);
-
-        CTA_SYNC();
-    }
-
-
-    /**
-     * Load a tile of items (specialized for full tile, match ranking algorithm)
-     */
-    __device__ __forceinline__ void LoadValues(
-        ValueT                      (&values)[ITEMS_PER_THREAD],
-        OffsetT                     block_offset,
-        volatile OffsetT                     valid_items,
-        Int2Type<true>              is_full_tile,
-        Int2Type<RADIX_RANK_MATCH>  rank_algorithm)
-    {
-        LoadDirectWarpStriped(threadIdx.x, d_values_in + block_offset, values);
-    }
-
-
-    /**
-     * Load a tile of items (specialized for partial tile, match ranking algorithm)
-     */
-    __device__ __forceinline__ void LoadValues(
-        ValueT                      (&values)[ITEMS_PER_THREAD],
-        OffsetT                     block_offset,
-        volatile OffsetT                     valid_items,
-        Int2Type<false>             is_full_tile,
-        Int2Type<RADIX_RANK_MATCH>  rank_algorithm)
-    {
-        LoadDirectWarpStriped(threadIdx.x, d_values_in + block_offset, values, valid_items);
-    }
-
-
-    /**
-     * Truck along associated values
-     */
-    template <bool FULL_TILE>
-    __device__ __forceinline__ void GatherScatterValues(
-        OffsetT         (&relative_bin_offsets)[ITEMS_PER_THREAD],
-        int             (&ranks)[ITEMS_PER_THREAD],
-        OffsetT         block_offset,
-        OffsetT         valid_items,
-        Int2Type<false> /*is_keys_only*/)
-    {
-        CTA_SYNC();
-
-        ValueT values[ITEMS_PER_THREAD];
-
-        LoadValues(
-            values,
-            block_offset,
-            valid_items,
-            Int2Type<FULL_TILE>(),
-            Int2Type<RANK_ALGORITHM>());
-
-        ScatterValues<FULL_TILE>(
-            values,
-            relative_bin_offsets,
-            ranks,
-            valid_items);
-    }
-
-
-    /**
-     * Truck along associated values (specialized for key-only sorting)
-     */
-    template <bool FULL_TILE>
-    __device__ __forceinline__ void GatherScatterValues(
-        OffsetT         (&/*relative_bin_offsets*/)[ITEMS_PER_THREAD],
-        int             (&/*ranks*/)[ITEMS_PER_THREAD],
-        OffsetT         /*block_offset*/,
-        OffsetT         /*valid_items*/,
-        Int2Type<true>  /*is_keys_only*/)
-    {}
-
-
-    /**
-     * Process tile
-     */
-    template <bool FULL_TILE>
-    __device__ __forceinline__ void ProcessTile(
-        OffsetT block_offset,
-        const OffsetT &valid_items = TILE_ITEMS)
-    {
-        UnsignedBits    keys[ITEMS_PER_THREAD];
-        int             ranks[ITEMS_PER_THREAD];
-        OffsetT         relative_bin_offsets[ITEMS_PER_THREAD];
-
-        // Assign default (min/max) value to all keys
-        UnsignedBits default_key = (IS_DESCENDING) ? LOWEST_KEY : MAX_KEY;
-
-        // Load tile of keys
-        LoadKeys(
-            keys,
-            block_offset,
-            valid_items, 
-            default_key,
-            Int2Type<FULL_TILE>(),
-            Int2Type<RANK_ALGORITHM>());
-
-        // Twiddle key bits if necessary
-        #pragma unroll
-        for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
-        {
-            keys[KEY] = Traits<KeyT>::TwiddleIn(keys[KEY]);
-        }
-
-        // Rank the twiddled keys
-        int exclusive_digit_prefix[BINS_TRACKED_PER_THREAD];
-        BlockRadixRankT(temp_storage.radix_rank).RankKeys(
-            keys,
-            ranks,
-            current_bit,
-            num_bits,
-            exclusive_digit_prefix);
-
-        CTA_SYNC();
-
-        // Share exclusive digit prefix
-        #pragma unroll
-        for (int track = 0; track < BINS_TRACKED_PER_THREAD; ++track)
-        {
-            int bin_idx = (threadIdx.x * BINS_TRACKED_PER_THREAD) + track;
-            if ((BLOCK_THREADS == RADIX_DIGITS) || (bin_idx < RADIX_DIGITS))
-            {
-                // Store exclusive prefix
-                temp_storage.exclusive_digit_prefix[bin_idx] =
-                    exclusive_digit_prefix[track];
-            }
-        }
-
-        CTA_SYNC();
-
-        // Get inclusive digit prefix
-        int inclusive_digit_prefix[BINS_TRACKED_PER_THREAD];
-
-        #pragma unroll
-        for (int track = 0; track < BINS_TRACKED_PER_THREAD; ++track)
-        {
-            int bin_idx = (threadIdx.x * BINS_TRACKED_PER_THREAD) + track;
-            if ((BLOCK_THREADS == RADIX_DIGITS) || (bin_idx < RADIX_DIGITS))
-            {
-                if (IS_DESCENDING)
-                {
-                    // Get inclusive digit prefix from exclusive prefix (higher bins come first)
-                    inclusive_digit_prefix[track] = (bin_idx == 0) ?
-                        (BLOCK_THREADS * ITEMS_PER_THREAD) :
-                        temp_storage.exclusive_digit_prefix[bin_idx - 1];
-                }
-                else
-                {
-                    // Get inclusive digit prefix from exclusive prefix (lower bins come first)
-                    inclusive_digit_prefix[track] = (bin_idx == RADIX_DIGITS - 1) ?
-                        (BLOCK_THREADS * ITEMS_PER_THREAD) :
-                        temp_storage.exclusive_digit_prefix[bin_idx + 1];
-                }
-            }
-        }
-
-        CTA_SYNC();
-
-        // Update global scatter base offsets for each digit
-        #pragma unroll
-        for (int track = 0; track < BINS_TRACKED_PER_THREAD; ++track)
-        {
-            int bin_idx = (threadIdx.x * BINS_TRACKED_PER_THREAD) + track;
-            if ((BLOCK_THREADS == RADIX_DIGITS) || (bin_idx < RADIX_DIGITS))
-            {
-                bin_offset[track] -= exclusive_digit_prefix[track];
-                temp_storage.relative_bin_offsets[bin_idx] = bin_offset[track];
-                bin_offset[track] += inclusive_digit_prefix[track];
-            }
-        }
-
-        CTA_SYNC();
-
-        // Scatter keys
-        ScatterKeys<FULL_TILE>(keys, relative_bin_offsets, ranks, valid_items);
-
-        // Gather/scatter values
-        GatherScatterValues<FULL_TILE>(relative_bin_offsets , ranks, block_offset, valid_items, Int2Type<KEYS_ONLY>());
-    }
-
-    //---------------------------------------------------------------------
-    // Copy shortcut
-    //---------------------------------------------------------------------
-
-    /**
-     * Copy tiles within the range of input
-     */
-    template <
-        typename InputIteratorT,
-        typename T>
-    __device__ __forceinline__ void Copy(
-        InputIteratorT  d_in,
-        T               *d_out,
-        OffsetT         block_offset,
-        OffsetT         block_end)
-    {
-        // Simply copy the input
-        while (block_offset + TILE_ITEMS <= block_end)
-        {
-            T items[ITEMS_PER_THREAD];
-
-            LoadDirectStriped<BLOCK_THREADS>(threadIdx.x, d_in + block_offset, items);
-            CTA_SYNC();
-            StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_out + block_offset, items);
-
-            block_offset += TILE_ITEMS;
-        }
-
-        // Clean up last partial tile with guarded-I/O
-        if (block_offset < block_end)
-        {
-            OffsetT valid_items = block_end - block_offset;
-
-            T items[ITEMS_PER_THREAD];
-
-            LoadDirectStriped<BLOCK_THREADS>(threadIdx.x, d_in + block_offset, items, valid_items);
-            CTA_SYNC();
-            StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_out + block_offset, items, valid_items);
-        }
-    }
-
-
-    /**
-     * Copy tiles within the range of input (specialized for NullType)
-     */
-    template <typename InputIteratorT>
-    __device__ __forceinline__ void Copy(
-        InputIteratorT  /*d_in*/,
-        NullType        * /*d_out*/,
-        OffsetT         /*block_offset*/,
-        OffsetT         /*block_end*/)
-    {}
-
-
-    //---------------------------------------------------------------------
-    // Interface
-    //---------------------------------------------------------------------
-
-    /**
-     * Constructor
-     */
-    __device__ __forceinline__ AgentRadixSortDownsweep(
-        TempStorage     &temp_storage,
-        OffsetT         (&bin_offset)[BINS_TRACKED_PER_THREAD],
-        OffsetT         num_items,
-        const KeyT      *d_keys_in,
-        KeyT            *d_keys_out,
-        const ValueT    *d_values_in,
-        ValueT          *d_values_out,
-        int             current_bit,
-        int             num_bits)
-    :
-        temp_storage(temp_storage.Alias()),
-        d_keys_in(reinterpret_cast<const UnsignedBits*>(d_keys_in)),
-        d_values_in(d_values_in),
-        d_keys_out(reinterpret_cast<UnsignedBits*>(d_keys_out)),
-        d_values_out(d_values_out),
-        current_bit(current_bit),
-        num_bits(num_bits),
-        short_circuit(1)
-    {
-        #pragma unroll
-        for (int track = 0; track < BINS_TRACKED_PER_THREAD; ++track)
-        {
-            this->bin_offset[track] = bin_offset[track];
-
-            int bin_idx = (threadIdx.x * BINS_TRACKED_PER_THREAD) + track;
-            if ((BLOCK_THREADS == RADIX_DIGITS) || (bin_idx < RADIX_DIGITS))
-            {
-                // Short circuit if the histogram has only bin counts of only zeros or problem-size
-                short_circuit = short_circuit && ((bin_offset[track] == 0) || (bin_offset[track] == num_items));
-            }
-        }
-
-        short_circuit = CTA_SYNC_AND(short_circuit);
-    }
-
-
-    /**
-     * Constructor
-     */
-    __device__ __forceinline__ AgentRadixSortDownsweep(
-        TempStorage     &temp_storage,
-        OffsetT         num_items,
-        OffsetT         *d_spine,
-        const KeyT      *d_keys_in,
-        KeyT            *d_keys_out,
-        const ValueT    *d_values_in,
-        ValueT          *d_values_out,
-        int             current_bit,
-        int             num_bits)
-    :
-        temp_storage(temp_storage.Alias()),
-        d_keys_in(reinterpret_cast<const UnsignedBits*>(d_keys_in)),
-        d_values_in(d_values_in),
-        d_keys_out(reinterpret_cast<UnsignedBits*>(d_keys_out)),
-        d_values_out(d_values_out),
-        current_bit(current_bit),
-        num_bits(num_bits),
-        short_circuit(1)
-    {
-        #pragma unroll
-        for (int track = 0; track < BINS_TRACKED_PER_THREAD; ++track)
-        {
-            int bin_idx = (threadIdx.x * BINS_TRACKED_PER_THREAD) + track;
-
-            // Load digit bin offsets (each of the first RADIX_DIGITS threads will load an offset for that digit)
-            if ((BLOCK_THREADS == RADIX_DIGITS) || (bin_idx < RADIX_DIGITS))
-            {
-                if (IS_DESCENDING)
-                    bin_idx = RADIX_DIGITS - bin_idx - 1;
-
-                // Short circuit if the first block's histogram has only bin counts of only zeros or problem-size
-                OffsetT first_block_bin_offset = d_spine[gridDim.x * bin_idx];
-                short_circuit = short_circuit && ((first_block_bin_offset == 0) || (first_block_bin_offset == num_items));
-
-                // Load my block's bin offset for my bin
-                bin_offset[track] = d_spine[(gridDim.x * bin_idx) + blockIdx.x];
-            }
-        }
-
-        short_circuit = CTA_SYNC_AND(short_circuit);
-    }
-
-
-    /**
-     * Distribute keys from a segment of input tiles.
-     */
-    __device__ __forceinline__ void ProcessRegion(
-        OffsetT   block_offset,
-        OffsetT   block_end)
-    {
-        if (short_circuit)
-        {
-            // Copy keys
-            Copy(d_keys_in, d_keys_out, block_offset, block_end);
-
-            // Copy values
-            Copy(d_values_in, d_values_out, block_offset, block_end);
-        }
-        else
-        {
-            // Process full tiles of tile_items
-            while (block_offset + TILE_ITEMS <= block_end)
-            {
-                ProcessTile<true>(block_offset);
-                block_offset += TILE_ITEMS;
-
-                CTA_SYNC();
-            }
-
-            // Clean up last partial tile with guarded-I/O
-            if (block_offset < block_end)
-            {
-                ProcessTile<false>(block_offset, block_end - block_offset);
-            }
-
-        }
-    }
-
-};
-
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/agent/agent_radix_sort_upsweep.cuh b/thirdparty/cub_semiring/agent/agent_radix_sort_upsweep.cuh
deleted file mode 100644
index 803fadf2486..00000000000
--- a/thirdparty/cub_semiring/agent/agent_radix_sort_upsweep.cuh
+++ /dev/null
@@ -1,526 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * AgentRadixSortUpsweep implements a stateful abstraction of CUDA thread blocks for participating in device-wide radix sort upsweep .
- */
-
-#pragma once
-
-#include "../thread/thread_reduce.cuh"
-#include "../thread/thread_load.cuh"
-#include "../warp/warp_reduce.cuh"
-#include "../block/block_load.cuh"
-#include "../util_type.cuh"
-#include "../iterator/cache_modified_input_iterator.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/******************************************************************************
- * Tuning policy types
- ******************************************************************************/
-
-/**
- * Parameterizable tuning policy type for AgentRadixSortUpsweep
- */
-template <
-    int                 _BLOCK_THREADS,     ///< Threads per thread block
-    int                 _ITEMS_PER_THREAD,  ///< Items per thread (per tile of input)
-    CacheLoadModifier   _LOAD_MODIFIER,     ///< Cache load modifier for reading keys
-    int                 _RADIX_BITS>        ///< The number of radix bits, i.e., log2(bins)
-struct AgentRadixSortUpsweepPolicy
-{
-    enum
-    {
-        BLOCK_THREADS       = _BLOCK_THREADS,       ///< Threads per thread block
-        ITEMS_PER_THREAD    = _ITEMS_PER_THREAD,    ///< Items per thread (per tile of input)
-        RADIX_BITS          = _RADIX_BITS,          ///< The number of radix bits, i.e., log2(bins)
-    };
-
-    static const CacheLoadModifier LOAD_MODIFIER = _LOAD_MODIFIER;      ///< Cache load modifier for reading keys
-};
-
-
-/******************************************************************************
- * Thread block abstractions
- ******************************************************************************/
-
-/**
- * \brief AgentRadixSortUpsweep implements a stateful abstraction of CUDA thread blocks for participating in device-wide radix sort upsweep .
- */
-template <
-    typename AgentRadixSortUpsweepPolicy,   ///< Parameterized AgentRadixSortUpsweepPolicy tuning policy type
-    typename KeyT,                          ///< KeyT type
-    typename OffsetT>                       ///< Signed integer type for global offsets
-struct AgentRadixSortUpsweep
-{
-
-    //---------------------------------------------------------------------
-    // Type definitions and constants
-    //---------------------------------------------------------------------
-
-    typedef typename Traits<KeyT>::UnsignedBits UnsignedBits;
-
-    // Integer type for digit counters (to be packed into words of PackedCounters)
-    typedef unsigned char DigitCounter;
-
-    // Integer type for packing DigitCounters into columns of shared memory banks
-    typedef unsigned int PackedCounter;
-
-    static const CacheLoadModifier LOAD_MODIFIER = AgentRadixSortUpsweepPolicy::LOAD_MODIFIER;
-
-    enum
-    {
-        RADIX_BITS              = AgentRadixSortUpsweepPolicy::RADIX_BITS,
-        BLOCK_THREADS           = AgentRadixSortUpsweepPolicy::BLOCK_THREADS,
-        KEYS_PER_THREAD         = AgentRadixSortUpsweepPolicy::ITEMS_PER_THREAD,
-
-        RADIX_DIGITS            = 1 << RADIX_BITS,
-
-        LOG_WARP_THREADS        = CUB_PTX_LOG_WARP_THREADS,
-        WARP_THREADS            = 1 << LOG_WARP_THREADS,
-        WARPS                   = (BLOCK_THREADS + WARP_THREADS - 1) / WARP_THREADS,
-
-        TILE_ITEMS              = BLOCK_THREADS * KEYS_PER_THREAD,
-
-        BYTES_PER_COUNTER       = sizeof(DigitCounter),
-        LOG_BYTES_PER_COUNTER   = Log2<BYTES_PER_COUNTER>::VALUE,
-
-        PACKING_RATIO           = sizeof(PackedCounter) / sizeof(DigitCounter),
-        LOG_PACKING_RATIO       = Log2<PACKING_RATIO>::VALUE,
-
-        LOG_COUNTER_LANES       = CUB_MAX(0, RADIX_BITS - LOG_PACKING_RATIO),
-        COUNTER_LANES           = 1 << LOG_COUNTER_LANES,
-
-        // To prevent counter overflow, we must periodically unpack and aggregate the
-        // digit counters back into registers.  Each counter lane is assigned to a
-        // warp for aggregation.
-
-        LANES_PER_WARP          = CUB_MAX(1, (COUNTER_LANES + WARPS - 1) / WARPS),
-
-        // Unroll tiles in batches without risk of counter overflow
-        UNROLL_COUNT            = CUB_MIN(64, 255 / KEYS_PER_THREAD),
-        UNROLLED_ELEMENTS       = UNROLL_COUNT * TILE_ITEMS,
-    };
-
-
-    // Input iterator wrapper type (for applying cache modifier)s
-    typedef CacheModifiedInputIterator<LOAD_MODIFIER, UnsignedBits, OffsetT> KeysItr;
-
-    /**
-     * Shared memory storage layout
-     */
-    union __align__(16) _TempStorage
-    {
-        DigitCounter    thread_counters[COUNTER_LANES][BLOCK_THREADS][PACKING_RATIO];
-        PackedCounter   packed_thread_counters[COUNTER_LANES][BLOCK_THREADS];
-        OffsetT         block_counters[WARP_THREADS][RADIX_DIGITS];
-    };
-
-
-    /// Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    //---------------------------------------------------------------------
-    // Thread fields (aggregate state bundle)
-    //---------------------------------------------------------------------
-
-    // Shared storage for this CTA
-    _TempStorage    &temp_storage;
-
-    // Thread-local counters for periodically aggregating composite-counter lanes
-    OffsetT         local_counts[LANES_PER_WARP][PACKING_RATIO];
-
-    // Input and output device pointers
-    KeysItr         d_keys_in;
-
-    // The least-significant bit position of the current digit to extract
-    int             current_bit;
-
-    // Number of bits in current digit
-    int             num_bits;
-
-
-
-    //---------------------------------------------------------------------
-    // Helper structure for templated iteration
-    //---------------------------------------------------------------------
-
-    // Iterate
-    template <int COUNT, int MAX>
-    struct Iterate
-    {
-        // BucketKeys
-        static __device__ __forceinline__ void BucketKeys(
-            AgentRadixSortUpsweep       &cta,
-            UnsignedBits                keys[KEYS_PER_THREAD])
-        {
-            cta.Bucket(keys[COUNT]);
-
-            // Next
-            Iterate<COUNT + 1, MAX>::BucketKeys(cta, keys);
-        }
-    };
-
-    // Terminate
-    template <int MAX>
-    struct Iterate<MAX, MAX>
-    {
-        // BucketKeys
-        static __device__ __forceinline__ void BucketKeys(AgentRadixSortUpsweep &/*cta*/, UnsignedBits /*keys*/[KEYS_PER_THREAD]) {}
-    };
-
-
-    //---------------------------------------------------------------------
-    // Utility methods
-    //---------------------------------------------------------------------
-
-    /**
-     * Decode a key and increment corresponding smem digit counter
-     */
-    __device__ __forceinline__ void Bucket(UnsignedBits key)
-    {
-        // Perform transform op
-        UnsignedBits converted_key = Traits<KeyT>::TwiddleIn(key);
-
-        // Extract current digit bits
-        UnsignedBits digit = BFE(converted_key, current_bit, num_bits);
-
-        // Get sub-counter offset
-        UnsignedBits sub_counter = digit & (PACKING_RATIO - 1);
-
-        // Get row offset
-        UnsignedBits row_offset = digit >> LOG_PACKING_RATIO;
-
-        // Increment counter
-        temp_storage.thread_counters[row_offset][threadIdx.x][sub_counter]++;
-    }
-
-
-    /**
-     * Reset composite counters
-     */
-    __device__ __forceinline__ void ResetDigitCounters()
-    {
-        #pragma unroll
-        for (int LANE = 0; LANE < COUNTER_LANES; LANE++)
-        {
-            temp_storage.packed_thread_counters[LANE][threadIdx.x] = 0;
-        }
-    }
-
-
-    /**
-     * Reset the unpacked counters in each thread
-     */
-    __device__ __forceinline__ void ResetUnpackedCounters()
-    {
-        #pragma unroll
-        for (int LANE = 0; LANE < LANES_PER_WARP; LANE++)
-        {
-            #pragma unroll
-            for (int UNPACKED_COUNTER = 0; UNPACKED_COUNTER < PACKING_RATIO; UNPACKED_COUNTER++)
-            {
-                local_counts[LANE][UNPACKED_COUNTER] = 0;
-            }
-        }
-    }
-
-
-    /**
-     * Extracts and aggregates the digit counters for each counter lane
-     * owned by this warp
-     */
-    __device__ __forceinline__ void UnpackDigitCounts()
-    {
-        unsigned int warp_id = threadIdx.x >> LOG_WARP_THREADS;
-        unsigned int warp_tid = LaneId();
-
-        #pragma unroll
-        for (int LANE = 0; LANE < LANES_PER_WARP; LANE++)
-        {
-            const int counter_lane = (LANE * WARPS) + warp_id;
-            if (counter_lane < COUNTER_LANES)
-            {
-                #pragma unroll
-                for (int PACKED_COUNTER = 0; PACKED_COUNTER < BLOCK_THREADS; PACKED_COUNTER += WARP_THREADS)
-                {
-                    #pragma unroll
-                    for (int UNPACKED_COUNTER = 0; UNPACKED_COUNTER < PACKING_RATIO; UNPACKED_COUNTER++)
-                    {
-                        OffsetT counter = temp_storage.thread_counters[counter_lane][warp_tid + PACKED_COUNTER][UNPACKED_COUNTER];
-                        local_counts[LANE][UNPACKED_COUNTER] += counter;
-                    }
-                }
-            }
-        }
-    }
-
-
-    /**
-     * Processes a single, full tile
-     */
-    __device__ __forceinline__ void ProcessFullTile(OffsetT block_offset)
-    {
-        // Tile of keys
-        UnsignedBits keys[KEYS_PER_THREAD];
-
-        LoadDirectStriped<BLOCK_THREADS>(threadIdx.x, d_keys_in + block_offset, keys);
-
-        // Prevent hoisting
-        CTA_SYNC();
-
-        // Bucket tile of keys
-        Iterate<0, KEYS_PER_THREAD>::BucketKeys(*this, keys);
-    }
-
-
-    /**
-     * Processes a single load (may have some threads masked off)
-     */
-    __device__ __forceinline__ void ProcessPartialTile(
-        OffsetT block_offset,
-        const OffsetT &block_end)
-    {
-        // Process partial tile if necessary using single loads
-        block_offset += threadIdx.x;
-        while (block_offset < block_end)
-        {
-            // Load and bucket key
-            UnsignedBits key = d_keys_in[block_offset];
-            Bucket(key);
-            block_offset += BLOCK_THREADS;
-        }
-    }
-
-
-    //---------------------------------------------------------------------
-    // Interface
-    //---------------------------------------------------------------------
-
-    /**
-     * Constructor
-     */
-    __device__ __forceinline__ AgentRadixSortUpsweep(
-        TempStorage &temp_storage,
-        const KeyT  *d_keys_in,
-        int         current_bit,
-        int         num_bits)
-    :
-        temp_storage(temp_storage.Alias()),
-        d_keys_in(reinterpret_cast<const UnsignedBits*>(d_keys_in)),
-        current_bit(current_bit),
-        num_bits(num_bits)
-    {}
-
-
-    /**
-     * Compute radix digit histograms from a segment of input tiles.
-     */
-    __device__ __forceinline__ void ProcessRegion(
-        OffsetT          block_offset,
-        const OffsetT    &block_end)
-    {
-        // Reset digit counters in smem and unpacked counters in registers
-        ResetDigitCounters();
-        ResetUnpackedCounters();
-
-        // Unroll batches of full tiles
-        while (block_offset + UNROLLED_ELEMENTS <= block_end)
-        {
-            for (int i = 0; i < UNROLL_COUNT; ++i)
-            {
-                ProcessFullTile(block_offset);
-                block_offset += TILE_ITEMS;
-            }
-
-            CTA_SYNC();
-
-            // Aggregate back into local_count registers to prevent overflow
-            UnpackDigitCounts();
-
-            CTA_SYNC();
-
-            // Reset composite counters in lanes
-            ResetDigitCounters();
-        }
-
-        // Unroll single full tiles
-        while (block_offset + TILE_ITEMS <= block_end)
-        {
-            ProcessFullTile(block_offset);
-            block_offset += TILE_ITEMS;
-        }
-
-        // Process partial tile if necessary
-        ProcessPartialTile(
-            block_offset,
-            block_end);
-
-        CTA_SYNC();
-
-        // Aggregate back into local_count registers
-        UnpackDigitCounts();
-    }
-
-
-    /**
-     * Extract counts (saving them to the external array)
-     */
-    template <bool IS_DESCENDING>
-    __device__ __forceinline__ void ExtractCounts(
-        OffsetT     *counters,
-        int         bin_stride = 1,
-        int         bin_offset = 0)
-    {
-        unsigned int warp_id    = threadIdx.x >> LOG_WARP_THREADS;
-        unsigned int warp_tid   = LaneId();
-
-        // Place unpacked digit counters in shared memory
-        #pragma unroll
-        for (int LANE = 0; LANE < LANES_PER_WARP; LANE++)
-        {
-            int counter_lane = (LANE * WARPS) + warp_id;
-            if (counter_lane < COUNTER_LANES)
-            {
-                int digit_row = counter_lane << LOG_PACKING_RATIO;
-
-                #pragma unroll
-                for (int UNPACKED_COUNTER = 0; UNPACKED_COUNTER < PACKING_RATIO; UNPACKED_COUNTER++)
-                {
-                    int bin_idx = digit_row + UNPACKED_COUNTER;
-
-                    temp_storage.block_counters[warp_tid][bin_idx] =
-                        local_counts[LANE][UNPACKED_COUNTER];
-                }
-            }
-        }
-
-        CTA_SYNC();
-
-        // Rake-reduce bin_count reductions
-
-        // Whole blocks
-        #pragma unroll
-        for (int BIN_BASE   = RADIX_DIGITS % BLOCK_THREADS;
-            (BIN_BASE + BLOCK_THREADS) <= RADIX_DIGITS;
-            BIN_BASE += BLOCK_THREADS)
-        {
-            int bin_idx = BIN_BASE + threadIdx.x;
-
-            OffsetT bin_count = 0;
-            #pragma unroll
-            for (int i = 0; i < WARP_THREADS; ++i)
-                bin_count += temp_storage.block_counters[i][bin_idx];
-
-            if (IS_DESCENDING)
-                bin_idx = RADIX_DIGITS - bin_idx - 1;
-
-            counters[(bin_stride * bin_idx) + bin_offset] = bin_count;
-        }
-
-        // Remainder
-        if ((RADIX_DIGITS % BLOCK_THREADS != 0) && (threadIdx.x < RADIX_DIGITS))
-        {
-            int bin_idx = threadIdx.x;
-
-            OffsetT bin_count = 0;
-            #pragma unroll
-            for (int i = 0; i < WARP_THREADS; ++i)
-                bin_count += temp_storage.block_counters[i][bin_idx];
-
-            if (IS_DESCENDING)
-                bin_idx = RADIX_DIGITS - bin_idx - 1;
-
-            counters[(bin_stride * bin_idx) + bin_offset] = bin_count;
-        }
-    }
-
-
-    /**
-     * Extract counts
-     */
-    template <int BINS_TRACKED_PER_THREAD>
-    __device__ __forceinline__ void ExtractCounts(
-        OffsetT (&bin_count)[BINS_TRACKED_PER_THREAD])  ///< [out] The exclusive prefix sum for the digits [(threadIdx.x * BINS_TRACKED_PER_THREAD) ... (threadIdx.x * BINS_TRACKED_PER_THREAD) + BINS_TRACKED_PER_THREAD - 1]
-    {
-        unsigned int warp_id    = threadIdx.x >> LOG_WARP_THREADS;
-        unsigned int warp_tid   = LaneId();
-
-        // Place unpacked digit counters in shared memory
-        #pragma unroll
-        for (int LANE = 0; LANE < LANES_PER_WARP; LANE++)
-        {
-            int counter_lane = (LANE * WARPS) + warp_id;
-            if (counter_lane < COUNTER_LANES)
-            {
-                int digit_row = counter_lane << LOG_PACKING_RATIO;
-
-                #pragma unroll
-                for (int UNPACKED_COUNTER = 0; UNPACKED_COUNTER < PACKING_RATIO; UNPACKED_COUNTER++)
-                {
-                    int bin_idx = digit_row + UNPACKED_COUNTER;
-
-                    temp_storage.block_counters[warp_tid][bin_idx] =
-                        local_counts[LANE][UNPACKED_COUNTER];
-                }
-            }
-        }
-
-        CTA_SYNC();
-
-        // Rake-reduce bin_count reductions
-        #pragma unroll
-        for (int track = 0; track < BINS_TRACKED_PER_THREAD; ++track)
-        {
-            int bin_idx = (threadIdx.x * BINS_TRACKED_PER_THREAD) + track;
-
-            if ((BLOCK_THREADS == RADIX_DIGITS) || (bin_idx < RADIX_DIGITS))
-            {
-                bin_count[track] = 0;
-
-                #pragma unroll
-                for (int i = 0; i < WARP_THREADS; ++i)
-                    bin_count[track] += temp_storage.block_counters[i][bin_idx];
-            }
-        }
-    }
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/agent/agent_reduce.cuh b/thirdparty/cub_semiring/agent/agent_reduce.cuh
deleted file mode 100644
index 5528d8bdd64..00000000000
--- a/thirdparty/cub_semiring/agent/agent_reduce.cuh
+++ /dev/null
@@ -1,385 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::AgentReduce implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduction .
- */
-
-#pragma once
-
-#include <iterator>
-
-#include "../block/block_load.cuh"
-#include "../block/block_reduce.cuh"
-#include "../grid/grid_mapping.cuh"
-#include "../grid/grid_even_share.cuh"
-#include "../util_type.cuh"
-#include "../iterator/cache_modified_input_iterator.cuh"
-#include "../util_namespace.cuh"
-
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/******************************************************************************
- * Tuning policy types
- ******************************************************************************/
-
-/**
- * Parameterizable tuning policy type for AgentReduce
- */
-template <
-    int                     _BLOCK_THREADS,         ///< Threads per thread block
-    int                     _ITEMS_PER_THREAD,      ///< Items per thread (per tile of input)
-    int                     _VECTOR_LOAD_LENGTH,    ///< Number of items per vectorized load
-    BlockReduceAlgorithm    _BLOCK_ALGORITHM,       ///< Cooperative block-wide reduction algorithm to use
-    CacheLoadModifier       _LOAD_MODIFIER>         ///< Cache load modifier for reading input elements
-struct AgentReducePolicy
-{
-    enum
-    {
-        BLOCK_THREADS       = _BLOCK_THREADS,       ///< Threads per thread block
-        ITEMS_PER_THREAD    = _ITEMS_PER_THREAD,    ///< Items per thread (per tile of input)
-        VECTOR_LOAD_LENGTH  = _VECTOR_LOAD_LENGTH,  ///< Number of items per vectorized load
-    };
-
-    static const BlockReduceAlgorithm  BLOCK_ALGORITHM      = _BLOCK_ALGORITHM;     ///< Cooperative block-wide reduction algorithm to use
-    static const CacheLoadModifier     LOAD_MODIFIER        = _LOAD_MODIFIER;       ///< Cache load modifier for reading input elements
-};
-
-
-
-/******************************************************************************
- * Thread block abstractions
- ******************************************************************************/
-
-/**
- * \brief AgentReduce implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduction .
- *
- * Each thread reduces only the values it loads. If \p FIRST_TILE, this
- * partial reduction is stored into \p thread_aggregate.  Otherwise it is
- * accumulated into \p thread_aggregate.
- */
-template <
-    typename AgentReducePolicy,        ///< Parameterized AgentReducePolicy tuning policy type
-    typename InputIteratorT,           ///< Random-access iterator type for input
-    typename OutputIteratorT,          ///< Random-access iterator type for output
-    typename OffsetT,                  ///< Signed integer type for global offsets
-    typename ReductionOp>              ///< Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
-struct AgentReduce
-{
-
-    //---------------------------------------------------------------------
-    // Types and constants
-    //---------------------------------------------------------------------
-
-    /// The input value type
-    typedef typename std::iterator_traits<InputIteratorT>::value_type InputT;
-
-    /// The output value type
-    typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
-        typename std::iterator_traits<InputIteratorT>::value_type,                                          // ... then the input iterator's value type,
-        typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputT;                          // ... else the output iterator's value type
-
-    /// Vector type of InputT for data movement
-    typedef typename CubVector<InputT, AgentReducePolicy::VECTOR_LOAD_LENGTH>::Type VectorT;
-
-    /// Input iterator wrapper type (for applying cache modifier)
-    typedef typename If<IsPointer<InputIteratorT>::VALUE,
-            CacheModifiedInputIterator<AgentReducePolicy::LOAD_MODIFIER, InputT, OffsetT>,      // Wrap the native input pointer with CacheModifiedInputIterator
-            InputIteratorT>::Type                                                               // Directly use the supplied input iterator type
-        WrappedInputIteratorT;
-
-    /// Constants
-    enum
-    {
-        BLOCK_THREADS       = AgentReducePolicy::BLOCK_THREADS,
-        ITEMS_PER_THREAD    = AgentReducePolicy::ITEMS_PER_THREAD,
-        VECTOR_LOAD_LENGTH  = CUB_MIN(ITEMS_PER_THREAD, AgentReducePolicy::VECTOR_LOAD_LENGTH),
-        TILE_ITEMS          = BLOCK_THREADS * ITEMS_PER_THREAD,
-
-        // Can vectorize according to the policy if the input iterator is a native pointer to a primitive type
-        ATTEMPT_VECTORIZATION   = (VECTOR_LOAD_LENGTH > 1) &&
-                                    (ITEMS_PER_THREAD % VECTOR_LOAD_LENGTH == 0) &&
-                                    (IsPointer<InputIteratorT>::VALUE) && Traits<InputT>::PRIMITIVE,
-
-    };
-
-    static const CacheLoadModifier    LOAD_MODIFIER   = AgentReducePolicy::LOAD_MODIFIER;
-    static const BlockReduceAlgorithm BLOCK_ALGORITHM = AgentReducePolicy::BLOCK_ALGORITHM;
-
-    /// Parameterized BlockReduce primitive
-    typedef BlockReduce<OutputT, BLOCK_THREADS, AgentReducePolicy::BLOCK_ALGORITHM> BlockReduceT;
-
-    /// Shared memory type required by this thread block
-    struct _TempStorage
-    {
-        typename BlockReduceT::TempStorage  reduce;
-    };
-
-    /// Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    //---------------------------------------------------------------------
-    // Per-thread fields
-    //---------------------------------------------------------------------
-
-    _TempStorage&           temp_storage;       ///< Reference to temp_storage
-    InputIteratorT          d_in;               ///< Input data to reduce
-    WrappedInputIteratorT   d_wrapped_in;       ///< Wrapped input data to reduce
-    ReductionOp             reduction_op;       ///< Binary reduction operator
-
-
-    //---------------------------------------------------------------------
-    // Utility
-    //---------------------------------------------------------------------
-
-
-    // Whether or not the input is aligned with the vector type (specialized for types we can vectorize)
-    template <typename Iterator>
-    static __device__ __forceinline__ bool IsAligned(
-        Iterator        d_in,
-        Int2Type<true>  /*can_vectorize*/)
-    {
-        return (size_t(d_in) & (sizeof(VectorT) - 1)) == 0;
-    }
-
-    // Whether or not the input is aligned with the vector type (specialized for types we cannot vectorize)
-    template <typename Iterator>
-    static __device__ __forceinline__ bool IsAligned(
-        Iterator        /*d_in*/,
-        Int2Type<false> /*can_vectorize*/)
-    {
-        return false;
-    }
-
-
-    //---------------------------------------------------------------------
-    // Constructor
-    //---------------------------------------------------------------------
-
-    /**
-     * Constructor
-     */
-    __device__ __forceinline__ AgentReduce(
-        TempStorage&            temp_storage,       ///< Reference to temp_storage
-        InputIteratorT          d_in,               ///< Input data to reduce
-        ReductionOp             reduction_op)       ///< Binary reduction operator
-    :
-        temp_storage(temp_storage.Alias()),
-        d_in(d_in),
-        d_wrapped_in(d_in),
-        reduction_op(reduction_op)
-    {}
-
-
-    //---------------------------------------------------------------------
-    // Tile consumption
-    //---------------------------------------------------------------------
-
-    /**
-     * Consume a full tile of input (non-vectorized)
-     */
-    template <int IS_FIRST_TILE>
-    __device__ __forceinline__ void ConsumeTile(
-        OutputT                 &thread_aggregate,
-        OffsetT                 block_offset,       ///< The offset the tile to consume
-        int                     /*valid_items*/,    ///< The number of valid items in the tile
-        Int2Type<true>          /*is_full_tile*/,   ///< Whether or not this is a full tile
-        Int2Type<false>         /*can_vectorize*/)  ///< Whether or not we can vectorize loads
-    {
-        OutputT items[ITEMS_PER_THREAD];
-
-        // Load items in striped fashion
-        LoadDirectStriped<BLOCK_THREADS>(threadIdx.x, d_wrapped_in + block_offset, items);
-
-        // Reduce items within each thread stripe
-        thread_aggregate = (IS_FIRST_TILE) ?
-            internal::ThreadReduce(items, reduction_op) :
-            internal::ThreadReduce(items, reduction_op, thread_aggregate);
-    }
-
-
-    /**
-     * Consume a full tile of input (vectorized)
-     */
-    template <int IS_FIRST_TILE>
-    __device__ __forceinline__ void ConsumeTile(
-        OutputT                 &thread_aggregate,
-        OffsetT                 block_offset,       ///< The offset the tile to consume
-        int                     /*valid_items*/,    ///< The number of valid items in the tile
-        Int2Type<true>          /*is_full_tile*/,   ///< Whether or not this is a full tile
-        Int2Type<true>          /*can_vectorize*/)  ///< Whether or not we can vectorize loads
-    {
-        // Alias items as an array of VectorT and load it in striped fashion
-        enum { WORDS =  ITEMS_PER_THREAD / VECTOR_LOAD_LENGTH };
-
-        // Fabricate a vectorized input iterator
-        InputT *d_in_unqualified = const_cast<InputT*>(d_in) + block_offset + (threadIdx.x * VECTOR_LOAD_LENGTH);
-        CacheModifiedInputIterator<AgentReducePolicy::LOAD_MODIFIER, VectorT, OffsetT> d_vec_in(
-            reinterpret_cast<VectorT*>(d_in_unqualified));
-
-        // Load items as vector items
-        InputT input_items[ITEMS_PER_THREAD];
-        VectorT *vec_items = reinterpret_cast<VectorT*>(input_items);
-        #pragma unroll
-        for (int i = 0; i < WORDS; ++i)
-            vec_items[i] = d_vec_in[BLOCK_THREADS * i];
-
-        // Convert from input type to output type
-        OutputT items[ITEMS_PER_THREAD];
-        #pragma unroll
-        for (int i = 0; i < ITEMS_PER_THREAD; ++i)
-            items[i] = input_items[i];
-
-        // Reduce items within each thread stripe
-        thread_aggregate = (IS_FIRST_TILE) ?
-            internal::ThreadReduce(items, reduction_op) :
-            internal::ThreadReduce(items, reduction_op, thread_aggregate);
-    }
-
-
-    /**
-     * Consume a partial tile of input
-     */
-    template <int IS_FIRST_TILE, int CAN_VECTORIZE>
-    __device__ __forceinline__ void ConsumeTile(
-        OutputT                 &thread_aggregate,
-        OffsetT                 block_offset,       ///< The offset the tile to consume
-        int                     valid_items,        ///< The number of valid items in the tile
-        Int2Type<false>         /*is_full_tile*/,   ///< Whether or not this is a full tile
-        Int2Type<CAN_VECTORIZE> /*can_vectorize*/)  ///< Whether or not we can vectorize loads
-    {
-        // Partial tile
-        int thread_offset = threadIdx.x;
-
-        // Read first item
-        if ((IS_FIRST_TILE) && (thread_offset < valid_items))
-        {
-            thread_aggregate = d_wrapped_in[block_offset + thread_offset];
-            thread_offset += BLOCK_THREADS;
-        }
-
-        // Continue reading items (block-striped)
-        while (thread_offset < valid_items)
-        {
-            OutputT item        = d_wrapped_in[block_offset + thread_offset];
-            thread_aggregate    = reduction_op(thread_aggregate, item);
-            thread_offset       += BLOCK_THREADS;
-        }
-    }
-
-
-    //---------------------------------------------------------------
-    // Consume a contiguous segment of tiles
-    //---------------------------------------------------------------------
-
-    /**
-     * \brief Reduce a contiguous segment of input tiles
-     */
-    template <int CAN_VECTORIZE>
-    __device__ __forceinline__ OutputT ConsumeRange(
-        GridEvenShare<OffsetT> &even_share,          ///< GridEvenShare descriptor
-        Int2Type<CAN_VECTORIZE> can_vectorize)      ///< Whether or not we can vectorize loads
-    {
-        OutputT thread_aggregate;
-
-        if (even_share.block_offset + TILE_ITEMS > even_share.block_end)
-        {
-            // First tile isn't full (not all threads have valid items)
-            int valid_items = even_share.block_end - even_share.block_offset;
-            ConsumeTile<true>(thread_aggregate, even_share.block_offset, valid_items, Int2Type<false>(), can_vectorize);
-            return BlockReduceT(temp_storage.reduce).Reduce(thread_aggregate, reduction_op, valid_items);
-        }
-
-        // At least one full block
-        ConsumeTile<true>(thread_aggregate, even_share.block_offset, TILE_ITEMS, Int2Type<true>(), can_vectorize);
-        even_share.block_offset += even_share.block_stride;
-
-        // Consume subsequent full tiles of input
-        while (even_share.block_offset + TILE_ITEMS <= even_share.block_end)
-        {
-            ConsumeTile<false>(thread_aggregate, even_share.block_offset, TILE_ITEMS, Int2Type<true>(), can_vectorize);
-            even_share.block_offset += even_share.block_stride;
-        }
-
-        // Consume a partially-full tile
-        if (even_share.block_offset < even_share.block_end)
-        {
-            int valid_items = even_share.block_end - even_share.block_offset;
-            ConsumeTile<false>(thread_aggregate, even_share.block_offset, valid_items, Int2Type<false>(), can_vectorize);
-        }
-
-        // Compute block-wide reduction (all threads have valid items)
-        return BlockReduceT(temp_storage.reduce).Reduce(thread_aggregate, reduction_op);
-    }
-
-
-    /**
-     * \brief Reduce a contiguous segment of input tiles
-     */
-    __device__ __forceinline__ OutputT ConsumeRange(
-        OffsetT block_offset,                       ///< [in] Threadblock begin offset (inclusive)
-        OffsetT block_end)                          ///< [in] Threadblock end offset (exclusive)
-    {
-        GridEvenShare<OffsetT> even_share;
-        even_share.template BlockInit<TILE_ITEMS>(block_offset, block_end);
-
-        return (IsAligned(d_in + block_offset, Int2Type<ATTEMPT_VECTORIZATION>())) ?
-            ConsumeRange(even_share, Int2Type<true && ATTEMPT_VECTORIZATION>()) :
-            ConsumeRange(even_share, Int2Type<false && ATTEMPT_VECTORIZATION>());
-    }
-
-
-    /**
-     * Reduce a contiguous segment of input tiles
-     */
-    __device__ __forceinline__ OutputT ConsumeTiles(
-        GridEvenShare<OffsetT> &even_share)        ///< [in] GridEvenShare descriptor
-    {
-        // Initialize GRID_MAPPING_STRIP_MINE even-share descriptor for this thread block
-        even_share.template BlockInit<TILE_ITEMS, GRID_MAPPING_STRIP_MINE>();
-
-        return (IsAligned(d_in, Int2Type<ATTEMPT_VECTORIZATION>())) ?
-            ConsumeRange(even_share, Int2Type<true && ATTEMPT_VECTORIZATION>()) :
-            ConsumeRange(even_share, Int2Type<false && ATTEMPT_VECTORIZATION>());
-
-    }
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/agent/agent_reduce_by_key.cuh b/thirdparty/cub_semiring/agent/agent_reduce_by_key.cuh
deleted file mode 100644
index a57d60ea210..00000000000
--- a/thirdparty/cub_semiring/agent/agent_reduce_by_key.cuh
+++ /dev/null
@@ -1,549 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::AgentReduceByKey implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduce-value-by-key.
- */
-
-#pragma once
-
-#include <iterator>
-
-#include "single_pass_scan_operators.cuh"
-#include "../block/block_load.cuh"
-#include "../block/block_store.cuh"
-#include "../block/block_scan.cuh"
-#include "../block/block_discontinuity.cuh"
-#include "../iterator/cache_modified_input_iterator.cuh"
-#include "../iterator/constant_input_iterator.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/******************************************************************************
- * Tuning policy types
- ******************************************************************************/
-
-/**
- * Parameterizable tuning policy type for AgentReduceByKey
- */
-template <
-    int                         _BLOCK_THREADS,                 ///< Threads per thread block
-    int                         _ITEMS_PER_THREAD,              ///< Items per thread (per tile of input)
-    BlockLoadAlgorithm          _LOAD_ALGORITHM,                ///< The BlockLoad algorithm to use
-    CacheLoadModifier           _LOAD_MODIFIER,                 ///< Cache load modifier for reading input elements
-    BlockScanAlgorithm          _SCAN_ALGORITHM>                ///< The BlockScan algorithm to use
-struct AgentReduceByKeyPolicy
-{
-    enum
-    {
-        BLOCK_THREADS           = _BLOCK_THREADS,               ///< Threads per thread block
-        ITEMS_PER_THREAD        = _ITEMS_PER_THREAD,            ///< Items per thread (per tile of input)
-    };
-
-    static const BlockLoadAlgorithm     LOAD_ALGORITHM          = _LOAD_ALGORITHM;      ///< The BlockLoad algorithm to use
-    static const CacheLoadModifier      LOAD_MODIFIER           = _LOAD_MODIFIER;       ///< Cache load modifier for reading input elements
-    static const BlockScanAlgorithm     SCAN_ALGORITHM          = _SCAN_ALGORITHM;      ///< The BlockScan algorithm to use
-};
-
-
-/******************************************************************************
- * Thread block abstractions
- ******************************************************************************/
-
-/**
- * \brief AgentReduceByKey implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduce-value-by-key
- */
-template <
-    typename    AgentReduceByKeyPolicyT,        ///< Parameterized AgentReduceByKeyPolicy tuning policy type
-    typename    KeysInputIteratorT,             ///< Random-access input iterator type for keys
-    typename    UniqueOutputIteratorT,          ///< Random-access output iterator type for keys
-    typename    ValuesInputIteratorT,           ///< Random-access input iterator type for values
-    typename    AggregatesOutputIteratorT,      ///< Random-access output iterator type for values
-    typename    NumRunsOutputIteratorT,         ///< Output iterator type for recording number of items selected
-    typename    EqualityOpT,                    ///< KeyT equality operator type
-    typename    ReductionOpT,                   ///< ValueT reduction operator type
-    typename    OffsetT>                        ///< Signed integer type for global offsets
-struct AgentReduceByKey
-{
-    //---------------------------------------------------------------------
-    // Types and constants
-    //---------------------------------------------------------------------
-
-    // The input keys type
-    typedef typename std::iterator_traits<KeysInputIteratorT>::value_type KeyInputT;
-
-    // The output keys type
-    typedef typename If<(Equals<typename std::iterator_traits<UniqueOutputIteratorT>::value_type, void>::VALUE),    // KeyOutputT =  (if output iterator's value type is void) ?
-        typename std::iterator_traits<KeysInputIteratorT>::value_type,                                              // ... then the input iterator's value type,
-        typename std::iterator_traits<UniqueOutputIteratorT>::value_type>::Type KeyOutputT;                         // ... else the output iterator's value type
-
-    // The input values type
-    typedef typename std::iterator_traits<ValuesInputIteratorT>::value_type ValueInputT;
-
-    // The output values type
-    typedef typename If<(Equals<typename std::iterator_traits<AggregatesOutputIteratorT>::value_type, void>::VALUE),    // ValueOutputT =  (if output iterator's value type is void) ?
-        typename std::iterator_traits<ValuesInputIteratorT>::value_type,                                                // ... then the input iterator's value type,
-        typename std::iterator_traits<AggregatesOutputIteratorT>::value_type>::Type ValueOutputT;                       // ... else the output iterator's value type
-
-    // Tuple type for scanning (pairs accumulated segment-value with segment-index)
-    typedef KeyValuePair<OffsetT, ValueOutputT> OffsetValuePairT;
-
-    // Tuple type for pairing keys and values
-    typedef KeyValuePair<KeyOutputT, ValueOutputT> KeyValuePairT;
-
-    // Tile status descriptor interface type
-    typedef ReduceByKeyScanTileState<ValueOutputT, OffsetT> ScanTileStateT;
-
-    // Guarded inequality functor
-    template <typename _EqualityOpT>
-    struct GuardedInequalityWrapper
-    {
-        _EqualityOpT     op;             ///< Wrapped equality operator
-        int             num_remaining;  ///< Items remaining
-
-        /// Constructor
-        __host__ __device__ __forceinline__
-        GuardedInequalityWrapper(_EqualityOpT op, int num_remaining) : op(op), num_remaining(num_remaining) {}
-
-        /// Boolean inequality operator, returns <tt>(a != b)</tt>
-        template <typename T>
-        __host__ __device__ __forceinline__ bool operator()(const T &a, const T &b, int idx) const
-        {
-            if (idx < num_remaining)
-                return !op(a, b);   // In bounds
-
-            // Return true if first out-of-bounds item, false otherwise
-            return (idx == num_remaining);
-       }
-    };
-
-
-    // Constants
-    enum
-    {
-        BLOCK_THREADS       = AgentReduceByKeyPolicyT::BLOCK_THREADS,
-        ITEMS_PER_THREAD    = AgentReduceByKeyPolicyT::ITEMS_PER_THREAD,
-        TILE_ITEMS          = BLOCK_THREADS * ITEMS_PER_THREAD,
-        TWO_PHASE_SCATTER   = (ITEMS_PER_THREAD > 1),
-
-        // Whether or not the scan operation has a zero-valued identity value (true if we're performing addition on a primitive type)
-        HAS_IDENTITY_ZERO   = (Equals<ReductionOpT, cub::Sum>::VALUE) && (Traits<ValueOutputT>::PRIMITIVE),
-    };
-
-    // Cache-modified Input iterator wrapper type (for applying cache modifier) for keys
-    typedef typename If<IsPointer<KeysInputIteratorT>::VALUE,
-            CacheModifiedInputIterator<AgentReduceByKeyPolicyT::LOAD_MODIFIER, KeyInputT, OffsetT>,     // Wrap the native input pointer with CacheModifiedValuesInputIterator
-            KeysInputIteratorT>::Type                                                                   // Directly use the supplied input iterator type
-        WrappedKeysInputIteratorT;
-
-    // Cache-modified Input iterator wrapper type (for applying cache modifier) for values
-    typedef typename If<IsPointer<ValuesInputIteratorT>::VALUE,
-            CacheModifiedInputIterator<AgentReduceByKeyPolicyT::LOAD_MODIFIER, ValueInputT, OffsetT>,   // Wrap the native input pointer with CacheModifiedValuesInputIterator
-            ValuesInputIteratorT>::Type                                                                 // Directly use the supplied input iterator type
-        WrappedValuesInputIteratorT;
-
-    // Cache-modified Input iterator wrapper type (for applying cache modifier) for fixup values
-    typedef typename If<IsPointer<AggregatesOutputIteratorT>::VALUE,
-            CacheModifiedInputIterator<AgentReduceByKeyPolicyT::LOAD_MODIFIER, ValueInputT, OffsetT>,   // Wrap the native input pointer with CacheModifiedValuesInputIterator
-            AggregatesOutputIteratorT>::Type                                                            // Directly use the supplied input iterator type
-        WrappedFixupInputIteratorT;
-
-    // Reduce-value-by-segment scan operator
-    typedef ReduceBySegmentOp<ReductionOpT> ReduceBySegmentOpT;
-
-    // Parameterized BlockLoad type for keys
-    typedef BlockLoad<
-            KeyOutputT,
-            BLOCK_THREADS,
-            ITEMS_PER_THREAD,
-            AgentReduceByKeyPolicyT::LOAD_ALGORITHM>
-        BlockLoadKeysT;
-
-    // Parameterized BlockLoad type for values
-    typedef BlockLoad<
-            ValueOutputT,
-            BLOCK_THREADS,
-            ITEMS_PER_THREAD,
-            AgentReduceByKeyPolicyT::LOAD_ALGORITHM>
-        BlockLoadValuesT;
-
-    // Parameterized BlockDiscontinuity type for keys
-    typedef BlockDiscontinuity<
-            KeyOutputT,
-            BLOCK_THREADS>
-        BlockDiscontinuityKeys;
-
-    // Parameterized BlockScan type
-    typedef BlockScan<
-            OffsetValuePairT,
-            BLOCK_THREADS,
-            AgentReduceByKeyPolicyT::SCAN_ALGORITHM>
-        BlockScanT;
-
-    // Callback type for obtaining tile prefix during block scan
-    typedef TilePrefixCallbackOp<
-            OffsetValuePairT,
-            ReduceBySegmentOpT,
-            ScanTileStateT>
-        TilePrefixCallbackOpT;
-
-    // Key and value exchange types
-    typedef KeyOutputT    KeyExchangeT[TILE_ITEMS + 1];
-    typedef ValueOutputT  ValueExchangeT[TILE_ITEMS + 1];
-
-    // Shared memory type for this thread block
-    union _TempStorage
-    {
-        struct
-        {
-            typename BlockScanT::TempStorage                scan;           // Smem needed for tile scanning
-            typename TilePrefixCallbackOpT::TempStorage     prefix;         // Smem needed for cooperative prefix callback
-            typename BlockDiscontinuityKeys::TempStorage    discontinuity;  // Smem needed for discontinuity detection
-        };
-
-        // Smem needed for loading keys
-        typename BlockLoadKeysT::TempStorage load_keys;
-
-        // Smem needed for loading values
-        typename BlockLoadValuesT::TempStorage load_values;
-
-        // Smem needed for compacting key value pairs(allows non POD items in this union)
-        Uninitialized<KeyValuePairT[TILE_ITEMS + 1]> raw_exchange;
-    };
-
-    // Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    //---------------------------------------------------------------------
-    // Per-thread fields
-    //---------------------------------------------------------------------
-
-    _TempStorage&                   temp_storage;       ///< Reference to temp_storage
-    WrappedKeysInputIteratorT       d_keys_in;          ///< Input keys
-    UniqueOutputIteratorT           d_unique_out;       ///< Unique output keys
-    WrappedValuesInputIteratorT     d_values_in;        ///< Input values
-    AggregatesOutputIteratorT       d_aggregates_out;   ///< Output value aggregates
-    NumRunsOutputIteratorT          d_num_runs_out;     ///< Output pointer for total number of segments identified
-    EqualityOpT                     equality_op;        ///< KeyT equality operator
-    ReductionOpT                    reduction_op;       ///< Reduction operator
-    ReduceBySegmentOpT              scan_op;            ///< Reduce-by-segment scan operator
-
-
-    //---------------------------------------------------------------------
-    // Constructor
-    //---------------------------------------------------------------------
-
-    // Constructor
-    __device__ __forceinline__
-    AgentReduceByKey(
-        TempStorage&                temp_storage,       ///< Reference to temp_storage
-        KeysInputIteratorT          d_keys_in,          ///< Input keys
-        UniqueOutputIteratorT       d_unique_out,       ///< Unique output keys
-        ValuesInputIteratorT        d_values_in,        ///< Input values
-        AggregatesOutputIteratorT   d_aggregates_out,   ///< Output value aggregates
-        NumRunsOutputIteratorT      d_num_runs_out,     ///< Output pointer for total number of segments identified
-        EqualityOpT                 equality_op,        ///< KeyT equality operator
-        ReductionOpT                reduction_op)       ///< ValueT reduction operator
-    :
-        temp_storage(temp_storage.Alias()),
-        d_keys_in(d_keys_in),
-        d_unique_out(d_unique_out),
-        d_values_in(d_values_in),
-        d_aggregates_out(d_aggregates_out),
-        d_num_runs_out(d_num_runs_out),
-        equality_op(equality_op),
-        reduction_op(reduction_op),
-        scan_op(reduction_op)
-    {}
-
-
-    //---------------------------------------------------------------------
-    // Scatter utility methods
-    //---------------------------------------------------------------------
-
-    /**
-     * Directly scatter flagged items to output offsets
-     */
-    __device__ __forceinline__ void ScatterDirect(
-        KeyValuePairT   (&scatter_items)[ITEMS_PER_THREAD],
-        OffsetT         (&segment_flags)[ITEMS_PER_THREAD],
-        OffsetT         (&segment_indices)[ITEMS_PER_THREAD])
-    {
-        // Scatter flagged keys and values
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            if (segment_flags[ITEM])
-            {
-                d_unique_out[segment_indices[ITEM]]     = scatter_items[ITEM].key;
-                d_aggregates_out[segment_indices[ITEM]] = scatter_items[ITEM].value;
-            }
-        }
-    }
-
-
-    /**
-     * 2-phase scatter flagged items to output offsets
-     *
-     * The exclusive scan causes each head flag to be paired with the previous
-     * value aggregate: the scatter offsets must be decremented for value aggregates
-     */
-    __device__ __forceinline__ void ScatterTwoPhase(
-        KeyValuePairT   (&scatter_items)[ITEMS_PER_THREAD],
-        OffsetT         (&segment_flags)[ITEMS_PER_THREAD],
-        OffsetT         (&segment_indices)[ITEMS_PER_THREAD],
-        OffsetT         num_tile_segments,
-        OffsetT         num_tile_segments_prefix)
-    {
-        CTA_SYNC();
-
-        // Compact and scatter pairs
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            if (segment_flags[ITEM])
-            {
-                temp_storage.raw_exchange.Alias()[segment_indices[ITEM] - num_tile_segments_prefix] = scatter_items[ITEM];
-            }
-        }
-
-        CTA_SYNC();
-
-        for (int item = threadIdx.x; item < num_tile_segments; item += BLOCK_THREADS)
-        {
-            KeyValuePairT pair                                  = temp_storage.raw_exchange.Alias()[item];
-            d_unique_out[num_tile_segments_prefix + item]       = pair.key;
-            d_aggregates_out[num_tile_segments_prefix + item]   = pair.value;
-        }
-    }
-
-
-    /**
-     * Scatter flagged items
-     */
-    __device__ __forceinline__ void Scatter(
-        KeyValuePairT   (&scatter_items)[ITEMS_PER_THREAD],
-        OffsetT         (&segment_flags)[ITEMS_PER_THREAD],
-        OffsetT         (&segment_indices)[ITEMS_PER_THREAD],
-        OffsetT         num_tile_segments,
-        OffsetT         num_tile_segments_prefix)
-    {
-        // Do a one-phase scatter if (a) two-phase is disabled or (b) the average number of selected items per thread is less than one
-        if (TWO_PHASE_SCATTER && (num_tile_segments > BLOCK_THREADS))
-        {
-            ScatterTwoPhase(
-                scatter_items,
-                segment_flags,
-                segment_indices,
-                num_tile_segments,
-                num_tile_segments_prefix);
-        }
-        else
-        {
-            ScatterDirect(
-                scatter_items,
-                segment_flags,
-                segment_indices);
-        }
-    }
-
-
-    //---------------------------------------------------------------------
-    // Cooperatively scan a device-wide sequence of tiles with other CTAs
-    //---------------------------------------------------------------------
-
-    /**
-     * Process a tile of input (dynamic chained scan)
-     */
-    template <bool IS_LAST_TILE>                ///< Whether the current tile is the last tile
-    __device__ __forceinline__ void ConsumeTile(
-        OffsetT             num_remaining,      ///< Number of global input items remaining (including this tile)
-        int                 tile_idx,           ///< Tile index
-        OffsetT             tile_offset,        ///< Tile offset
-        ScanTileStateT&     tile_state)         ///< Global tile state descriptor
-    {
-        KeyOutputT          keys[ITEMS_PER_THREAD];             // Tile keys
-        KeyOutputT          prev_keys[ITEMS_PER_THREAD];        // Tile keys shuffled up
-        ValueOutputT        values[ITEMS_PER_THREAD];           // Tile values
-        OffsetT             head_flags[ITEMS_PER_THREAD];       // Segment head flags
-        OffsetT             segment_indices[ITEMS_PER_THREAD];  // Segment indices
-        OffsetValuePairT    scan_items[ITEMS_PER_THREAD];       // Zipped values and segment flags|indices
-        KeyValuePairT       scatter_items[ITEMS_PER_THREAD];    // Zipped key value pairs for scattering
-
-        // Load keys
-        if (IS_LAST_TILE)
-            BlockLoadKeysT(temp_storage.load_keys).Load(d_keys_in + tile_offset, keys, num_remaining);
-        else
-            BlockLoadKeysT(temp_storage.load_keys).Load(d_keys_in + tile_offset, keys);
-
-        // Load tile predecessor key in first thread
-        KeyOutputT tile_predecessor;
-        if (threadIdx.x == 0)
-        {
-            tile_predecessor = (tile_idx == 0) ?
-                keys[0] :                       // First tile gets repeat of first item (thus first item will not be flagged as a head)
-                d_keys_in[tile_offset - 1];     // Subsequent tiles get last key from previous tile
-        }
-
-        CTA_SYNC();
-
-        // Load values
-        if (IS_LAST_TILE)
-            BlockLoadValuesT(temp_storage.load_values).Load(d_values_in + tile_offset, values, num_remaining);
-        else
-            BlockLoadValuesT(temp_storage.load_values).Load(d_values_in + tile_offset, values);
-
-        CTA_SYNC();
-
-        // Initialize head-flags and shuffle up the previous keys
-        if (IS_LAST_TILE)
-        {
-            // Use custom flag operator to additionally flag the first out-of-bounds item
-            GuardedInequalityWrapper<EqualityOpT> flag_op(equality_op, num_remaining);
-            BlockDiscontinuityKeys(temp_storage.discontinuity).FlagHeads(
-                head_flags, keys, prev_keys, flag_op, tile_predecessor);
-        }
-        else
-        {
-            InequalityWrapper<EqualityOpT> flag_op(equality_op);
-            BlockDiscontinuityKeys(temp_storage.discontinuity).FlagHeads(
-                head_flags, keys, prev_keys, flag_op, tile_predecessor);
-        }
-
-        // Zip values and head flags
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            scan_items[ITEM].value  = values[ITEM];
-            scan_items[ITEM].key    = head_flags[ITEM];
-        }
-
-        // Perform exclusive tile scan
-        OffsetValuePairT    block_aggregate;        // Inclusive block-wide scan aggregate
-        OffsetT             num_segments_prefix;    // Number of segments prior to this tile
-        ValueOutputT        total_aggregate;        // The tile prefix folded with block_aggregate
-        if (tile_idx == 0)
-        {
-            // Scan first tile
-            BlockScanT(temp_storage.scan).ExclusiveScan(scan_items, scan_items, scan_op, block_aggregate);
-            num_segments_prefix     = 0;
-            total_aggregate         = block_aggregate.value;
-
-            // Update tile status if there are successor tiles
-            if ((!IS_LAST_TILE) && (threadIdx.x == 0))
-                tile_state.SetInclusive(0, block_aggregate);
-        }
-        else
-        {
-            // Scan non-first tile
-            TilePrefixCallbackOpT prefix_op(tile_state, temp_storage.prefix, scan_op, tile_idx);
-            BlockScanT(temp_storage.scan).ExclusiveScan(scan_items, scan_items, scan_op, prefix_op);
-
-            block_aggregate         = prefix_op.GetBlockAggregate();
-            num_segments_prefix     = prefix_op.GetExclusivePrefix().key;
-            total_aggregate         = reduction_op(
-                                        prefix_op.GetExclusivePrefix().value,
-                                        block_aggregate.value);
-        }
-
-        // Rezip scatter items and segment indices
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            scatter_items[ITEM].key     = prev_keys[ITEM];
-            scatter_items[ITEM].value   = scan_items[ITEM].value;
-            segment_indices[ITEM]       = scan_items[ITEM].key;
-        }
-
-        // At this point, each flagged segment head has:
-        //  - The key for the previous segment
-        //  - The reduced value from the previous segment
-        //  - The segment index for the reduced value
-
-        // Scatter flagged keys and values
-        OffsetT num_tile_segments = block_aggregate.key;
-        Scatter(scatter_items, head_flags, segment_indices, num_tile_segments, num_segments_prefix);
-
-        // Last thread in last tile will output final count (and last pair, if necessary)
-        if ((IS_LAST_TILE) && (threadIdx.x == BLOCK_THREADS - 1))
-        {
-            OffsetT num_segments = num_segments_prefix + num_tile_segments;
-
-            // If the last tile is a whole tile, output the final_value
-            if (num_remaining == TILE_ITEMS)
-            {
-                d_unique_out[num_segments]      = keys[ITEMS_PER_THREAD - 1];
-                d_aggregates_out[num_segments]  = total_aggregate;
-                num_segments++;
-            }
-
-            // Output the total number of items selected
-            *d_num_runs_out = num_segments;
-        }
-    }
-
-
-    /**
-     * Scan tiles of items as part of a dynamic chained scan
-     */
-    __device__ __forceinline__ void ConsumeRange(
-        int                 num_items,          ///< Total number of input items
-        ScanTileStateT&     tile_state,         ///< Global tile state descriptor
-        int                 start_tile)         ///< The starting tile for the current grid
-    {
-        // Blocks are launched in increasing order, so just assign one tile per block
-        int     tile_idx        = start_tile + blockIdx.x;          // Current tile index
-        OffsetT tile_offset     = OffsetT(TILE_ITEMS) * tile_idx;   // Global offset for the current tile
-        OffsetT num_remaining   = num_items - tile_offset;          // Remaining items (including this tile)
-
-        if (num_remaining > TILE_ITEMS)
-        {
-            // Not last tile
-            ConsumeTile<false>(num_remaining, tile_idx, tile_offset, tile_state);
-        }
-        else if (num_remaining > 0)
-        {
-            // Last tile
-            ConsumeTile<true>(num_remaining, tile_idx, tile_offset, tile_state);
-        }
-    }
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/agent/agent_rle.cuh b/thirdparty/cub_semiring/agent/agent_rle.cuh
deleted file mode 100644
index 0ba9216176c..00000000000
--- a/thirdparty/cub_semiring/agent/agent_rle.cuh
+++ /dev/null
@@ -1,837 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::AgentRle implements a stateful abstraction of CUDA thread blocks for participating in device-wide run-length-encode.
- */
-
-#pragma once
-
-#include <iterator>
-
-#include "single_pass_scan_operators.cuh"
-#include "../block/block_load.cuh"
-#include "../block/block_store.cuh"
-#include "../block/block_scan.cuh"
-#include "../block/block_exchange.cuh"
-#include "../block/block_discontinuity.cuh"
-#include "../grid/grid_queue.cuh"
-#include "../iterator/cache_modified_input_iterator.cuh"
-#include "../iterator/constant_input_iterator.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/******************************************************************************
- * Tuning policy types
- ******************************************************************************/
-
-/**
- * Parameterizable tuning policy type for AgentRle
- */
-template <
-    int                         _BLOCK_THREADS,                 ///< Threads per thread block
-    int                         _ITEMS_PER_THREAD,              ///< Items per thread (per tile of input)
-    BlockLoadAlgorithm          _LOAD_ALGORITHM,                ///< The BlockLoad algorithm to use
-    CacheLoadModifier           _LOAD_MODIFIER,                 ///< Cache load modifier for reading input elements
-    bool                        _STORE_WARP_TIME_SLICING,       ///< Whether or not only one warp's worth of shared memory should be allocated and time-sliced among block-warps during any store-related data transpositions (versus each warp having its own storage)
-    BlockScanAlgorithm          _SCAN_ALGORITHM>                ///< The BlockScan algorithm to use
-struct AgentRlePolicy
-{
-    enum
-    {
-        BLOCK_THREADS           = _BLOCK_THREADS,               ///< Threads per thread block
-        ITEMS_PER_THREAD        = _ITEMS_PER_THREAD,            ///< Items per thread (per tile of input)
-        STORE_WARP_TIME_SLICING = _STORE_WARP_TIME_SLICING,     ///< Whether or not only one warp's worth of shared memory should be allocated and time-sliced among block-warps during any store-related data transpositions (versus each warp having its own storage)
-    };
-
-    static const BlockLoadAlgorithm     LOAD_ALGORITHM          = _LOAD_ALGORITHM;      ///< The BlockLoad algorithm to use
-    static const CacheLoadModifier      LOAD_MODIFIER           = _LOAD_MODIFIER;       ///< Cache load modifier for reading input elements
-    static const BlockScanAlgorithm     SCAN_ALGORITHM          = _SCAN_ALGORITHM;      ///< The BlockScan algorithm to use
-};
-
-
-
-
-
-/******************************************************************************
- * Thread block abstractions
- ******************************************************************************/
-
-/**
- * \brief AgentRle implements a stateful abstraction of CUDA thread blocks for participating in device-wide run-length-encode 
- */
-template <
-    typename    AgentRlePolicyT,        ///< Parameterized AgentRlePolicyT tuning policy type
-    typename    InputIteratorT,         ///< Random-access input iterator type for data
-    typename    OffsetsOutputIteratorT, ///< Random-access output iterator type for offset values
-    typename    LengthsOutputIteratorT, ///< Random-access output iterator type for length values
-    typename    EqualityOpT,            ///< T equality operator type
-    typename    OffsetT>                ///< Signed integer type for global offsets
-struct AgentRle
-{
-    //---------------------------------------------------------------------
-    // Types and constants
-    //---------------------------------------------------------------------
-
-    /// The input value type
-    typedef typename std::iterator_traits<InputIteratorT>::value_type T;
-
-    /// The lengths output value type
-    typedef typename If<(Equals<typename std::iterator_traits<LengthsOutputIteratorT>::value_type, void>::VALUE),   // LengthT =  (if output iterator's value type is void) ?
-        OffsetT,                                                                                                    // ... then the OffsetT type,
-        typename std::iterator_traits<LengthsOutputIteratorT>::value_type>::Type LengthT;                           // ... else the output iterator's value type
-
-    /// Tuple type for scanning (pairs run-length and run-index)
-    typedef KeyValuePair<OffsetT, LengthT> LengthOffsetPair;
-
-    /// Tile status descriptor interface type
-    typedef ReduceByKeyScanTileState<LengthT, OffsetT> ScanTileStateT;
-
-    // Constants
-    enum
-    {
-        WARP_THREADS            = CUB_WARP_THREADS(PTX_ARCH),
-        BLOCK_THREADS           = AgentRlePolicyT::BLOCK_THREADS,
-        ITEMS_PER_THREAD        = AgentRlePolicyT::ITEMS_PER_THREAD,
-        WARP_ITEMS              = WARP_THREADS * ITEMS_PER_THREAD,
-        TILE_ITEMS              = BLOCK_THREADS * ITEMS_PER_THREAD,
-        WARPS                   = (BLOCK_THREADS + WARP_THREADS - 1) / WARP_THREADS,
-
-        /// Whether or not to sync after loading data
-        SYNC_AFTER_LOAD         = (AgentRlePolicyT::LOAD_ALGORITHM != BLOCK_LOAD_DIRECT),
-
-        /// Whether or not only one warp's worth of shared memory should be allocated and time-sliced among block-warps during any store-related data transpositions (versus each warp having its own storage)
-        STORE_WARP_TIME_SLICING = AgentRlePolicyT::STORE_WARP_TIME_SLICING,
-        ACTIVE_EXCHANGE_WARPS   = (STORE_WARP_TIME_SLICING) ? 1 : WARPS,
-    };
-
-
-    /**
-     * Special operator that signals all out-of-bounds items are not equal to everything else,
-     * forcing both (1) the last item to be tail-flagged and (2) all oob items to be marked
-     * trivial.
-     */
-    template <bool LAST_TILE>
-    struct OobInequalityOp
-    {
-        OffsetT         num_remaining;
-        EqualityOpT      equality_op;
-
-        __device__ __forceinline__ OobInequalityOp(
-            OffsetT     num_remaining,
-            EqualityOpT  equality_op)
-        :
-            num_remaining(num_remaining),
-            equality_op(equality_op)
-        {}
-
-        template <typename Index>
-        __host__ __device__ __forceinline__ bool operator()(T first, T second, Index idx)
-        {
-            if (!LAST_TILE || (idx < num_remaining))
-                return !equality_op(first, second);
-            else
-                return true;
-        }
-    };
-
-
-    // Cache-modified Input iterator wrapper type (for applying cache modifier) for data
-    typedef typename If<IsPointer<InputIteratorT>::VALUE,
-            CacheModifiedInputIterator<AgentRlePolicyT::LOAD_MODIFIER, T, OffsetT>,      // Wrap the native input pointer with CacheModifiedVLengthnputIterator
-            InputIteratorT>::Type                                                       // Directly use the supplied input iterator type
-        WrappedInputIteratorT;
-
-    // Parameterized BlockLoad type for data
-    typedef BlockLoad<
-            T,
-            AgentRlePolicyT::BLOCK_THREADS,
-            AgentRlePolicyT::ITEMS_PER_THREAD,
-            AgentRlePolicyT::LOAD_ALGORITHM>
-        BlockLoadT;
-
-    // Parameterized BlockDiscontinuity type for data
-    typedef BlockDiscontinuity<T, BLOCK_THREADS> BlockDiscontinuityT;
-
-    // Parameterized WarpScan type
-    typedef WarpScan<LengthOffsetPair> WarpScanPairs;
-
-    // Reduce-length-by-run scan operator
-    typedef ReduceBySegmentOp<cub::Sum> ReduceBySegmentOpT;
-
-    // Callback type for obtaining tile prefix during block scan
-    typedef TilePrefixCallbackOp<
-            LengthOffsetPair,
-            ReduceBySegmentOpT,
-            ScanTileStateT>
-        TilePrefixCallbackOpT;
-
-    // Warp exchange types
-    typedef WarpExchange<LengthOffsetPair, ITEMS_PER_THREAD>        WarpExchangePairs;
-
-    typedef typename If<STORE_WARP_TIME_SLICING, typename WarpExchangePairs::TempStorage, NullType>::Type WarpExchangePairsStorage;
-
-    typedef WarpExchange<OffsetT, ITEMS_PER_THREAD>                 WarpExchangeOffsets;
-    typedef WarpExchange<LengthT, ITEMS_PER_THREAD>                 WarpExchangeLengths;
-
-    typedef LengthOffsetPair WarpAggregates[WARPS];
-
-    // Shared memory type for this thread block
-    struct _TempStorage
-    {
-        // Aliasable storage layout
-        union Aliasable
-        {
-            struct
-            {
-                typename BlockDiscontinuityT::TempStorage       discontinuity;              // Smem needed for discontinuity detection
-                typename WarpScanPairs::TempStorage             warp_scan[WARPS];           // Smem needed for warp-synchronous scans
-                Uninitialized<LengthOffsetPair[WARPS]>          warp_aggregates;            // Smem needed for sharing warp-wide aggregates
-                typename TilePrefixCallbackOpT::TempStorage     prefix;                     // Smem needed for cooperative prefix callback
-            };
-
-            // Smem needed for input loading
-            typename BlockLoadT::TempStorage                    load;
-
-            // Aliasable layout needed for two-phase scatter
-            union ScatterAliasable
-            {
-                unsigned long long                              align;
-                WarpExchangePairsStorage                        exchange_pairs[ACTIVE_EXCHANGE_WARPS];
-                typename WarpExchangeOffsets::TempStorage       exchange_offsets[ACTIVE_EXCHANGE_WARPS];
-                typename WarpExchangeLengths::TempStorage       exchange_lengths[ACTIVE_EXCHANGE_WARPS];
-
-            } scatter_aliasable;
-
-        } aliasable;
-
-        OffsetT             tile_idx;                   // Shared tile index
-        LengthOffsetPair    tile_inclusive;             // Inclusive tile prefix
-        LengthOffsetPair    tile_exclusive;             // Exclusive tile prefix
-    };
-
-    // Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    //---------------------------------------------------------------------
-    // Per-thread fields
-    //---------------------------------------------------------------------
-
-    _TempStorage&                   temp_storage;       ///< Reference to temp_storage
-
-    WrappedInputIteratorT           d_in;               ///< Pointer to input sequence of data items
-    OffsetsOutputIteratorT          d_offsets_out;      ///< Input run offsets
-    LengthsOutputIteratorT          d_lengths_out;      ///< Output run lengths
-
-    EqualityOpT                     equality_op;        ///< T equality operator
-    ReduceBySegmentOpT              scan_op;            ///< Reduce-length-by-flag scan operator
-    OffsetT                         num_items;          ///< Total number of input items
-
-
-    //---------------------------------------------------------------------
-    // Constructor
-    //---------------------------------------------------------------------
-
-    // Constructor
-    __device__ __forceinline__
-    AgentRle(
-        TempStorage                 &temp_storage,      ///< [in] Reference to temp_storage
-        InputIteratorT              d_in,               ///< [in] Pointer to input sequence of data items
-        OffsetsOutputIteratorT      d_offsets_out,      ///< [out] Pointer to output sequence of run offsets
-        LengthsOutputIteratorT      d_lengths_out,      ///< [out] Pointer to output sequence of run lengths
-        EqualityOpT                 equality_op,        ///< [in] T equality operator
-        OffsetT                     num_items)          ///< [in] Total number of input items
-    :
-        temp_storage(temp_storage.Alias()),
-        d_in(d_in),
-        d_offsets_out(d_offsets_out),
-        d_lengths_out(d_lengths_out),
-        equality_op(equality_op),
-        scan_op(cub::Sum()),
-        num_items(num_items)
-    {}
-
-
-    //---------------------------------------------------------------------
-    // Utility methods for initializing the selections
-    //---------------------------------------------------------------------
-
-    template <bool FIRST_TILE, bool LAST_TILE>
-    __device__ __forceinline__ void InitializeSelections(
-        OffsetT             tile_offset,
-        OffsetT             num_remaining,
-        T                   (&items)[ITEMS_PER_THREAD],
-        LengthOffsetPair    (&lengths_and_num_runs)[ITEMS_PER_THREAD])
-    {
-        bool                head_flags[ITEMS_PER_THREAD];
-        bool                tail_flags[ITEMS_PER_THREAD];
-
-        OobInequalityOp<LAST_TILE> inequality_op(num_remaining, equality_op);
-
-        if (FIRST_TILE && LAST_TILE)
-        {
-            // First-and-last-tile always head-flags the first item and tail-flags the last item
-
-            BlockDiscontinuityT(temp_storage.aliasable.discontinuity).FlagHeadsAndTails(
-                head_flags, tail_flags, items, inequality_op);
-        }
-        else if (FIRST_TILE)
-        {
-            // First-tile always head-flags the first item
-
-            // Get the first item from the next tile
-            T tile_successor_item;
-            if (threadIdx.x == BLOCK_THREADS - 1)
-                tile_successor_item = d_in[tile_offset + TILE_ITEMS];
-
-            BlockDiscontinuityT(temp_storage.aliasable.discontinuity).FlagHeadsAndTails(
-                head_flags, tail_flags, tile_successor_item, items, inequality_op);
-        }
-        else if (LAST_TILE)
-        {
-            // Last-tile always flags the last item
-
-            // Get the last item from the previous tile
-            T tile_predecessor_item;
-            if (threadIdx.x == 0)
-                tile_predecessor_item = d_in[tile_offset - 1];
-
-            BlockDiscontinuityT(temp_storage.aliasable.discontinuity).FlagHeadsAndTails(
-                head_flags, tile_predecessor_item, tail_flags, items, inequality_op);
-        }
-        else
-        {
-            // Get the first item from the next tile
-            T tile_successor_item;
-            if (threadIdx.x == BLOCK_THREADS - 1)
-                tile_successor_item = d_in[tile_offset + TILE_ITEMS];
-
-            // Get the last item from the previous tile
-            T tile_predecessor_item;
-            if (threadIdx.x == 0)
-                tile_predecessor_item = d_in[tile_offset - 1];
-
-            BlockDiscontinuityT(temp_storage.aliasable.discontinuity).FlagHeadsAndTails(
-                head_flags, tile_predecessor_item, tail_flags, tile_successor_item, items, inequality_op);
-        }
-
-        // Zip counts and runs
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            lengths_and_num_runs[ITEM].key      = head_flags[ITEM] && (!tail_flags[ITEM]);
-            lengths_and_num_runs[ITEM].value    = ((!head_flags[ITEM]) || (!tail_flags[ITEM]));
-        }
-    }
-
-    //---------------------------------------------------------------------
-    // Scan utility methods
-    //---------------------------------------------------------------------
-
-    /**
-     * Scan of allocations
-     */
-    __device__ __forceinline__ void WarpScanAllocations(
-        LengthOffsetPair    &tile_aggregate,
-        LengthOffsetPair    &warp_aggregate,
-        LengthOffsetPair    &warp_exclusive_in_tile,
-        LengthOffsetPair    &thread_exclusive_in_warp,
-        LengthOffsetPair    (&lengths_and_num_runs)[ITEMS_PER_THREAD])
-    {
-        // Perform warpscans
-        unsigned int warp_id = ((WARPS == 1) ? 0 : threadIdx.x / WARP_THREADS);
-        int lane_id = LaneId();
-
-        LengthOffsetPair identity;
-        identity.key = 0;
-        identity.value = 0;
-
-        LengthOffsetPair thread_inclusive;
-        LengthOffsetPair thread_aggregate = internal::ThreadReduce(lengths_and_num_runs, scan_op);
-        WarpScanPairs(temp_storage.aliasable.warp_scan[warp_id]).Scan(
-            thread_aggregate,
-            thread_inclusive,
-            thread_exclusive_in_warp,
-            identity,
-            scan_op);
-
-        // Last lane in each warp shares its warp-aggregate
-        if (lane_id == WARP_THREADS - 1)
-            temp_storage.aliasable.warp_aggregates.Alias()[warp_id] = thread_inclusive;
-
-        CTA_SYNC();
-
-        // Accumulate total selected and the warp-wide prefix
-        warp_exclusive_in_tile          = identity;
-        warp_aggregate                  = temp_storage.aliasable.warp_aggregates.Alias()[warp_id];
-        tile_aggregate                  = temp_storage.aliasable.warp_aggregates.Alias()[0];
-
-        #pragma unroll
-        for (int WARP = 1; WARP < WARPS; ++WARP)
-        {
-            if (warp_id == WARP)
-                warp_exclusive_in_tile = tile_aggregate;
-
-            tile_aggregate = scan_op(tile_aggregate, temp_storage.aliasable.warp_aggregates.Alias()[WARP]);
-        }
-    }
-
-
-    //---------------------------------------------------------------------
-    // Utility methods for scattering selections
-    //---------------------------------------------------------------------
-
-    /**
-     * Two-phase scatter, specialized for warp time-slicing
-     */
-    template <bool FIRST_TILE>
-    __device__ __forceinline__ void ScatterTwoPhase(
-        OffsetT             tile_num_runs_exclusive_in_global,
-        OffsetT             warp_num_runs_aggregate,
-        OffsetT             warp_num_runs_exclusive_in_tile,
-        OffsetT             (&thread_num_runs_exclusive_in_warp)[ITEMS_PER_THREAD],
-        LengthOffsetPair    (&lengths_and_offsets)[ITEMS_PER_THREAD],
-        Int2Type<true>      is_warp_time_slice)
-    {
-        unsigned int warp_id = ((WARPS == 1) ? 0 : threadIdx.x / WARP_THREADS);
-        int lane_id = LaneId();
-
-        // Locally compact items within the warp (first warp)
-        if (warp_id == 0)
-        {
-            WarpExchangePairs(temp_storage.aliasable.scatter_aliasable.exchange_pairs[0]).ScatterToStriped(
-                lengths_and_offsets, thread_num_runs_exclusive_in_warp);
-        }
-
-        // Locally compact items within the warp (remaining warps)
-        #pragma unroll
-        for (int SLICE = 1; SLICE < WARPS; ++SLICE)
-        {
-            CTA_SYNC();
-
-            if (warp_id == SLICE)
-            {
-                WarpExchangePairs(temp_storage.aliasable.scatter_aliasable.exchange_pairs[0]).ScatterToStriped(
-                    lengths_and_offsets, thread_num_runs_exclusive_in_warp);
-            }
-        }
-
-        // Global scatter
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            if ((ITEM * WARP_THREADS) < warp_num_runs_aggregate - lane_id)
-            {
-                OffsetT item_offset =
-                    tile_num_runs_exclusive_in_global +
-                    warp_num_runs_exclusive_in_tile +
-                    (ITEM * WARP_THREADS) + lane_id;
-
-                // Scatter offset
-                d_offsets_out[item_offset] = lengths_and_offsets[ITEM].key;
-
-                // Scatter length if not the first (global) length
-                if ((!FIRST_TILE) || (ITEM != 0) || (threadIdx.x > 0))
-                {
-                    d_lengths_out[item_offset - 1] = lengths_and_offsets[ITEM].value;
-                }
-            }
-        }
-    }
-
-
-    /**
-     * Two-phase scatter
-     */
-    template <bool FIRST_TILE>
-    __device__ __forceinline__ void ScatterTwoPhase(
-        OffsetT             tile_num_runs_exclusive_in_global,
-        OffsetT             warp_num_runs_aggregate,
-        OffsetT             warp_num_runs_exclusive_in_tile,
-        OffsetT             (&thread_num_runs_exclusive_in_warp)[ITEMS_PER_THREAD],
-        LengthOffsetPair    (&lengths_and_offsets)[ITEMS_PER_THREAD],
-        Int2Type<false>     is_warp_time_slice)
-    {
-        unsigned int warp_id = ((WARPS == 1) ? 0 : threadIdx.x / WARP_THREADS);
-        int lane_id = LaneId();
-
-        // Unzip
-        OffsetT run_offsets[ITEMS_PER_THREAD];
-        LengthT run_lengths[ITEMS_PER_THREAD];
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            run_offsets[ITEM] = lengths_and_offsets[ITEM].key;
-            run_lengths[ITEM] = lengths_and_offsets[ITEM].value;
-        }
-
-        WarpExchangeOffsets(temp_storage.aliasable.scatter_aliasable.exchange_offsets[warp_id]).ScatterToStriped(
-            run_offsets, thread_num_runs_exclusive_in_warp);
-
-        WARP_SYNC(0xffffffff);
-
-        WarpExchangeLengths(temp_storage.aliasable.scatter_aliasable.exchange_lengths[warp_id]).ScatterToStriped(
-            run_lengths, thread_num_runs_exclusive_in_warp);
-
-        // Global scatter
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            if ((ITEM * WARP_THREADS) + lane_id < warp_num_runs_aggregate)
-            {
-                OffsetT item_offset =
-                    tile_num_runs_exclusive_in_global +
-                    warp_num_runs_exclusive_in_tile +
-                    (ITEM * WARP_THREADS) + lane_id;
-
-                // Scatter offset
-                d_offsets_out[item_offset] = run_offsets[ITEM];
-
-                // Scatter length if not the first (global) length
-                if ((!FIRST_TILE) || (ITEM != 0) || (threadIdx.x > 0))
-                {
-                    d_lengths_out[item_offset - 1] = run_lengths[ITEM];
-                }
-            }
-        }
-    }
-
-
-    /**
-     * Direct scatter
-     */
-    template <bool FIRST_TILE>
-    __device__ __forceinline__ void ScatterDirect(
-        OffsetT             tile_num_runs_exclusive_in_global,
-        OffsetT             warp_num_runs_aggregate,
-        OffsetT             warp_num_runs_exclusive_in_tile,
-        OffsetT             (&thread_num_runs_exclusive_in_warp)[ITEMS_PER_THREAD],
-        LengthOffsetPair    (&lengths_and_offsets)[ITEMS_PER_THREAD])
-    {
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            if (thread_num_runs_exclusive_in_warp[ITEM] < warp_num_runs_aggregate)
-            {
-                OffsetT item_offset =
-                    tile_num_runs_exclusive_in_global +
-                    warp_num_runs_exclusive_in_tile +
-                    thread_num_runs_exclusive_in_warp[ITEM];
-
-                // Scatter offset
-                d_offsets_out[item_offset] = lengths_and_offsets[ITEM].key;
-
-                // Scatter length if not the first (global) length
-                if (item_offset >= 1)
-                {
-                    d_lengths_out[item_offset - 1] = lengths_and_offsets[ITEM].value;
-                }
-            }
-        }
-    }
-
-
-    /**
-     * Scatter
-     */
-    template <bool FIRST_TILE>
-    __device__ __forceinline__ void Scatter(
-        OffsetT             tile_num_runs_aggregate,
-        OffsetT             tile_num_runs_exclusive_in_global,
-        OffsetT             warp_num_runs_aggregate,
-        OffsetT             warp_num_runs_exclusive_in_tile,
-        OffsetT             (&thread_num_runs_exclusive_in_warp)[ITEMS_PER_THREAD],
-        LengthOffsetPair    (&lengths_and_offsets)[ITEMS_PER_THREAD])
-    {
-        if ((ITEMS_PER_THREAD == 1) || (tile_num_runs_aggregate < BLOCK_THREADS))
-        {
-            // Direct scatter if the warp has any items
-            if (warp_num_runs_aggregate)
-            {
-                ScatterDirect<FIRST_TILE>(
-                    tile_num_runs_exclusive_in_global,
-                    warp_num_runs_aggregate,
-                    warp_num_runs_exclusive_in_tile,
-                    thread_num_runs_exclusive_in_warp,
-                    lengths_and_offsets);
-            }
-        }
-        else
-        {
-            // Scatter two phase
-            ScatterTwoPhase<FIRST_TILE>(
-                tile_num_runs_exclusive_in_global,
-                warp_num_runs_aggregate,
-                warp_num_runs_exclusive_in_tile,
-                thread_num_runs_exclusive_in_warp,
-                lengths_and_offsets,
-                Int2Type<STORE_WARP_TIME_SLICING>());
-        }
-    }
-
-
-
-    //---------------------------------------------------------------------
-    // Cooperatively scan a device-wide sequence of tiles with other CTAs
-    //---------------------------------------------------------------------
-
-    /**
-     * Process a tile of input (dynamic chained scan)
-     */
-    template <
-        bool                LAST_TILE>
-    __device__ __forceinline__ LengthOffsetPair ConsumeTile(
-        OffsetT             num_items,          ///< Total number of global input items
-        OffsetT             num_remaining,      ///< Number of global input items remaining (including this tile)
-        int                 tile_idx,           ///< Tile index
-        OffsetT             tile_offset,       ///< Tile offset
-        ScanTileStateT       &tile_status)       ///< Global list of tile status
-    {
-        if (tile_idx == 0)
-        {
-            // First tile
-
-            // Load items
-            T items[ITEMS_PER_THREAD];
-            if (LAST_TILE)
-                BlockLoadT(temp_storage.aliasable.load).Load(d_in + tile_offset, items, num_remaining, T());
-            else
-                BlockLoadT(temp_storage.aliasable.load).Load(d_in + tile_offset, items);
-
-            if (SYNC_AFTER_LOAD)
-                CTA_SYNC();
-
-            // Set flags
-            LengthOffsetPair    lengths_and_num_runs[ITEMS_PER_THREAD];
-
-            InitializeSelections<true, LAST_TILE>(
-                tile_offset,
-                num_remaining,
-                items,
-                lengths_and_num_runs);
-
-            // Exclusive scan of lengths and runs
-            LengthOffsetPair tile_aggregate;
-            LengthOffsetPair warp_aggregate;
-            LengthOffsetPair warp_exclusive_in_tile;
-            LengthOffsetPair thread_exclusive_in_warp;
-
-            WarpScanAllocations(
-                tile_aggregate,
-                warp_aggregate,
-                warp_exclusive_in_tile,
-                thread_exclusive_in_warp,
-                lengths_and_num_runs);
-
-            // Update tile status if this is not the last tile
-            if (!LAST_TILE && (threadIdx.x == 0))
-                tile_status.SetInclusive(0, tile_aggregate);
-
-            // Update thread_exclusive_in_warp to fold in warp run-length
-            if (thread_exclusive_in_warp.key == 0)
-                thread_exclusive_in_warp.value += warp_exclusive_in_tile.value;
-
-            LengthOffsetPair    lengths_and_offsets[ITEMS_PER_THREAD];
-            OffsetT             thread_num_runs_exclusive_in_warp[ITEMS_PER_THREAD];
-            LengthOffsetPair    lengths_and_num_runs2[ITEMS_PER_THREAD];
-
-            // Downsweep scan through lengths_and_num_runs
-            internal::ThreadScanExclusive(lengths_and_num_runs, lengths_and_num_runs2, scan_op, thread_exclusive_in_warp);
-
-            // Zip
-
-            #pragma unroll
-            for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-            {
-                lengths_and_offsets[ITEM].value         = lengths_and_num_runs2[ITEM].value;
-                lengths_and_offsets[ITEM].key        = tile_offset + (threadIdx.x * ITEMS_PER_THREAD) + ITEM;
-                thread_num_runs_exclusive_in_warp[ITEM] = (lengths_and_num_runs[ITEM].key) ?
-                                                                lengths_and_num_runs2[ITEM].key :         // keep
-                                                                WARP_THREADS * ITEMS_PER_THREAD;            // discard
-            }
-
-            OffsetT tile_num_runs_aggregate              = tile_aggregate.key;
-            OffsetT tile_num_runs_exclusive_in_global    = 0;
-            OffsetT warp_num_runs_aggregate              = warp_aggregate.key;
-            OffsetT warp_num_runs_exclusive_in_tile      = warp_exclusive_in_tile.key;
-
-            // Scatter
-            Scatter<true>(
-                tile_num_runs_aggregate,
-                tile_num_runs_exclusive_in_global,
-                warp_num_runs_aggregate,
-                warp_num_runs_exclusive_in_tile,
-                thread_num_runs_exclusive_in_warp,
-                lengths_and_offsets);
-
-            // Return running total (inclusive of this tile)
-            return tile_aggregate;
-        }
-        else
-        {
-            // Not first tile
-
-            // Load items
-            T items[ITEMS_PER_THREAD];
-            if (LAST_TILE)
-                BlockLoadT(temp_storage.aliasable.load).Load(d_in + tile_offset, items, num_remaining, T());
-            else
-                BlockLoadT(temp_storage.aliasable.load).Load(d_in + tile_offset, items);
-
-            if (SYNC_AFTER_LOAD)
-                CTA_SYNC();
-
-            // Set flags
-            LengthOffsetPair    lengths_and_num_runs[ITEMS_PER_THREAD];
-
-            InitializeSelections<false, LAST_TILE>(
-                tile_offset,
-                num_remaining,
-                items,
-                lengths_and_num_runs);
-
-            // Exclusive scan of lengths and runs
-            LengthOffsetPair tile_aggregate;
-            LengthOffsetPair warp_aggregate;
-            LengthOffsetPair warp_exclusive_in_tile;
-            LengthOffsetPair thread_exclusive_in_warp;
-
-            WarpScanAllocations(
-                tile_aggregate,
-                warp_aggregate,
-                warp_exclusive_in_tile,
-                thread_exclusive_in_warp,
-                lengths_and_num_runs);
-
-            // First warp computes tile prefix in lane 0
-            TilePrefixCallbackOpT prefix_op(tile_status, temp_storage.aliasable.prefix, Sum(), tile_idx);
-            unsigned int warp_id = ((WARPS == 1) ? 0 : threadIdx.x / WARP_THREADS);
-            if (warp_id == 0)
-            {
-                prefix_op(tile_aggregate);
-                if (threadIdx.x == 0)
-                    temp_storage.tile_exclusive = prefix_op.exclusive_prefix;
-            }
-
-            CTA_SYNC();
-
-            LengthOffsetPair tile_exclusive_in_global = temp_storage.tile_exclusive;
-
-            // Update thread_exclusive_in_warp to fold in warp and tile run-lengths
-            LengthOffsetPair thread_exclusive = scan_op(tile_exclusive_in_global, warp_exclusive_in_tile);
-            if (thread_exclusive_in_warp.key == 0)
-                thread_exclusive_in_warp.value += thread_exclusive.value;
-
-            // Downsweep scan through lengths_and_num_runs
-            LengthOffsetPair    lengths_and_num_runs2[ITEMS_PER_THREAD];
-            LengthOffsetPair    lengths_and_offsets[ITEMS_PER_THREAD];
-            OffsetT             thread_num_runs_exclusive_in_warp[ITEMS_PER_THREAD];
-
-            internal::ThreadScanExclusive(lengths_and_num_runs, lengths_and_num_runs2, scan_op, thread_exclusive_in_warp);
-
-            // Zip
-            #pragma unroll
-            for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-            {
-                lengths_and_offsets[ITEM].value         = lengths_and_num_runs2[ITEM].value;
-                lengths_and_offsets[ITEM].key        = tile_offset + (threadIdx.x * ITEMS_PER_THREAD) + ITEM;
-                thread_num_runs_exclusive_in_warp[ITEM] = (lengths_and_num_runs[ITEM].key) ?
-                                                                lengths_and_num_runs2[ITEM].key :         // keep
-                                                                WARP_THREADS * ITEMS_PER_THREAD;            // discard
-            }
-
-            OffsetT tile_num_runs_aggregate              = tile_aggregate.key;
-            OffsetT tile_num_runs_exclusive_in_global    = tile_exclusive_in_global.key;
-            OffsetT warp_num_runs_aggregate              = warp_aggregate.key;
-            OffsetT warp_num_runs_exclusive_in_tile      = warp_exclusive_in_tile.key;
-
-            // Scatter
-            Scatter<false>(
-                tile_num_runs_aggregate,
-                tile_num_runs_exclusive_in_global,
-                warp_num_runs_aggregate,
-                warp_num_runs_exclusive_in_tile,
-                thread_num_runs_exclusive_in_warp,
-                lengths_and_offsets);
-
-            // Return running total (inclusive of this tile)
-            return prefix_op.inclusive_prefix;
-        }
-    }
-
-
-    /**
-     * Scan tiles of items as part of a dynamic chained scan
-     */
-    template <typename NumRunsIteratorT>            ///< Output iterator type for recording number of items selected
-    __device__ __forceinline__ void ConsumeRange(
-        int                 num_tiles,              ///< Total number of input tiles
-        ScanTileStateT&     tile_status,            ///< Global list of tile status
-        NumRunsIteratorT    d_num_runs_out)         ///< Output pointer for total number of runs identified
-    {
-        // Blocks are launched in increasing order, so just assign one tile per block
-        int     tile_idx        = (blockIdx.x * gridDim.y) + blockIdx.y;    // Current tile index
-        OffsetT tile_offset     = tile_idx * TILE_ITEMS;                  // Global offset for the current tile
-        OffsetT num_remaining   = num_items - tile_offset;                  // Remaining items (including this tile)
-
-        if (tile_idx < num_tiles - 1)
-        {
-            // Not the last tile (full)
-            ConsumeTile<false>(num_items, num_remaining, tile_idx, tile_offset, tile_status);
-        }
-        else if (num_remaining > 0)
-        {
-            // The last tile (possibly partially-full)
-            LengthOffsetPair running_total = ConsumeTile<true>(num_items, num_remaining, tile_idx, tile_offset, tile_status);
-
-            if (threadIdx.x == 0)
-            {
-                // Output the total number of items selected
-                *d_num_runs_out = running_total.key;
-
-                // The inclusive prefix contains accumulated length reduction for the last run
-                if (running_total.key > 0)
-                    d_lengths_out[running_total.key - 1] = running_total.value;
-            }
-        }
-    }
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/agent/agent_scan.cuh b/thirdparty/cub_semiring/agent/agent_scan.cuh
deleted file mode 100644
index 567df8049e9..00000000000
--- a/thirdparty/cub_semiring/agent/agent_scan.cuh
+++ /dev/null
@@ -1,471 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::AgentScan implements a stateful abstraction of CUDA thread blocks for participating in device-wide prefix scan .
- */
-
-#pragma once
-
-#include <iterator>
-
-#include "single_pass_scan_operators.cuh"
-#include "../block/block_load.cuh"
-#include "../block/block_store.cuh"
-#include "../block/block_scan.cuh"
-#include "../grid/grid_queue.cuh"
-#include "../iterator/cache_modified_input_iterator.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/******************************************************************************
- * Tuning policy types
- ******************************************************************************/
-
-/**
- * Parameterizable tuning policy type for AgentScan
- */
-template <
-    int                         _BLOCK_THREADS,                 ///< Threads per thread block
-    int                         _ITEMS_PER_THREAD,              ///< Items per thread (per tile of input)
-    BlockLoadAlgorithm          _LOAD_ALGORITHM,                ///< The BlockLoad algorithm to use
-    CacheLoadModifier           _LOAD_MODIFIER,                 ///< Cache load modifier for reading input elements
-    BlockStoreAlgorithm         _STORE_ALGORITHM,               ///< The BlockStore algorithm to use
-    BlockScanAlgorithm          _SCAN_ALGORITHM>                ///< The BlockScan algorithm to use
-struct AgentScanPolicy
-{
-    enum
-    {
-        BLOCK_THREADS           = _BLOCK_THREADS,               ///< Threads per thread block
-        ITEMS_PER_THREAD        = _ITEMS_PER_THREAD,            ///< Items per thread (per tile of input)
-    };
-
-    static const BlockLoadAlgorithm     LOAD_ALGORITHM          = _LOAD_ALGORITHM;          ///< The BlockLoad algorithm to use
-    static const CacheLoadModifier      LOAD_MODIFIER           = _LOAD_MODIFIER;           ///< Cache load modifier for reading input elements
-    static const BlockStoreAlgorithm    STORE_ALGORITHM         = _STORE_ALGORITHM;         ///< The BlockStore algorithm to use
-    static const BlockScanAlgorithm     SCAN_ALGORITHM          = _SCAN_ALGORITHM;          ///< The BlockScan algorithm to use
-};
-
-
-
-
-/******************************************************************************
- * Thread block abstractions
- ******************************************************************************/
-
-/**
- * \brief AgentScan implements a stateful abstraction of CUDA thread blocks for participating in device-wide prefix scan .
- */
-template <
-    typename AgentScanPolicyT,      ///< Parameterized AgentScanPolicyT tuning policy type
-    typename InputIteratorT,        ///< Random-access input iterator type
-    typename OutputIteratorT,       ///< Random-access output iterator type
-    typename ScanOpT,               ///< Scan functor type
-    typename InitValueT,            ///< The init_value element for ScanOpT type (cub::NullType for inclusive scan)
-    typename OffsetT>               ///< Signed integer type for global offsets
-struct AgentScan
-{
-    //---------------------------------------------------------------------
-    // Types and constants
-    //---------------------------------------------------------------------
-
-    // The input value type
-    typedef typename std::iterator_traits<InputIteratorT>::value_type InputT;
-
-    // The output value type
-    typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
-        typename std::iterator_traits<InputIteratorT>::value_type,                                          // ... then the input iterator's value type,
-        typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputT;                          // ... else the output iterator's value type
-
-    // Tile status descriptor interface type
-    typedef ScanTileState<OutputT> ScanTileStateT;
-
-    // Input iterator wrapper type (for applying cache modifier)
-    typedef typename If<IsPointer<InputIteratorT>::VALUE,
-            CacheModifiedInputIterator<AgentScanPolicyT::LOAD_MODIFIER, InputT, OffsetT>,   // Wrap the native input pointer with CacheModifiedInputIterator
-            InputIteratorT>::Type                                                           // Directly use the supplied input iterator type
-        WrappedInputIteratorT;
-
-    // Constants
-    enum
-    {
-        IS_INCLUSIVE        = Equals<InitValueT, NullType>::VALUE,            // Inclusive scan if no init_value type is provided
-        BLOCK_THREADS       = AgentScanPolicyT::BLOCK_THREADS,
-        ITEMS_PER_THREAD    = AgentScanPolicyT::ITEMS_PER_THREAD,
-        TILE_ITEMS          = BLOCK_THREADS * ITEMS_PER_THREAD,
-    };
-
-    // Parameterized BlockLoad type
-    typedef BlockLoad<
-            OutputT,
-            AgentScanPolicyT::BLOCK_THREADS,
-            AgentScanPolicyT::ITEMS_PER_THREAD,
-            AgentScanPolicyT::LOAD_ALGORITHM>
-        BlockLoadT;
-
-    // Parameterized BlockStore type
-    typedef BlockStore<
-            OutputT,
-            AgentScanPolicyT::BLOCK_THREADS,
-            AgentScanPolicyT::ITEMS_PER_THREAD,
-            AgentScanPolicyT::STORE_ALGORITHM>
-        BlockStoreT;
-
-    // Parameterized BlockScan type
-    typedef BlockScan<
-            OutputT,
-            AgentScanPolicyT::BLOCK_THREADS,
-            AgentScanPolicyT::SCAN_ALGORITHM>
-        BlockScanT;
-
-    // Callback type for obtaining tile prefix during block scan
-    typedef TilePrefixCallbackOp<
-            OutputT,
-            ScanOpT,
-            ScanTileStateT>
-        TilePrefixCallbackOpT;
-
-    // Stateful BlockScan prefix callback type for managing a running total while scanning consecutive tiles
-    typedef BlockScanRunningPrefixOp<
-            OutputT,
-            ScanOpT>
-        RunningPrefixCallbackOp;
-
-    // Shared memory type for this thread block
-    union _TempStorage
-    {
-        typename BlockLoadT::TempStorage    load;       // Smem needed for tile loading
-        typename BlockStoreT::TempStorage   store;      // Smem needed for tile storing
-
-        struct
-        {
-            typename TilePrefixCallbackOpT::TempStorage  prefix;     // Smem needed for cooperative prefix callback
-            typename BlockScanT::TempStorage             scan;       // Smem needed for tile scanning
-        };
-    };
-
-    // Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    //---------------------------------------------------------------------
-    // Per-thread fields
-    //---------------------------------------------------------------------
-
-    _TempStorage&               temp_storage;       ///< Reference to temp_storage
-    WrappedInputIteratorT       d_in;               ///< Input data
-    OutputIteratorT             d_out;              ///< Output data
-    ScanOpT                     scan_op;            ///< Binary scan operator
-    InitValueT                  init_value;         ///< The init_value element for ScanOpT
-
-
-    //---------------------------------------------------------------------
-    // Block scan utility methods
-    //---------------------------------------------------------------------
-
-    /**
-     * Exclusive scan specialization (first tile)
-     */
-    __device__ __forceinline__
-    void ScanTile(
-        OutputT             (&items)[ITEMS_PER_THREAD],
-        OutputT             init_value,
-        ScanOpT             scan_op,
-        OutputT             &block_aggregate,
-        Int2Type<false>     /*is_inclusive*/)
-    {
-        BlockScanT(temp_storage.scan).ExclusiveScan(items, items, init_value, scan_op, block_aggregate);
-        block_aggregate = scan_op(init_value, block_aggregate);
-    }
-
-
-    /**
-     * Inclusive scan specialization (first tile)
-     */
-    __device__ __forceinline__
-    void ScanTile(
-        OutputT             (&items)[ITEMS_PER_THREAD],
-        InitValueT          /*init_value*/,
-        ScanOpT             scan_op,
-        OutputT             &block_aggregate,
-        Int2Type<true>      /*is_inclusive*/)
-    {
-        BlockScanT(temp_storage.scan).InclusiveScan(items, items, scan_op, block_aggregate);
-    }
-
-
-    /**
-     * Exclusive scan specialization (subsequent tiles)
-     */
-    template <typename PrefixCallback>
-    __device__ __forceinline__
-    void ScanTile(
-        OutputT             (&items)[ITEMS_PER_THREAD],
-        ScanOpT             scan_op,
-        PrefixCallback      &prefix_op,
-        Int2Type<false>     /*is_inclusive*/)
-    {
-        BlockScanT(temp_storage.scan).ExclusiveScan(items, items, scan_op, prefix_op);
-    }
-
-
-    /**
-     * Inclusive scan specialization (subsequent tiles)
-     */
-    template <typename PrefixCallback>
-    __device__ __forceinline__
-    void ScanTile(
-        OutputT             (&items)[ITEMS_PER_THREAD],
-        ScanOpT             scan_op,
-        PrefixCallback      &prefix_op,
-        Int2Type<true>      /*is_inclusive*/)
-    {
-        BlockScanT(temp_storage.scan).InclusiveScan(items, items, scan_op, prefix_op);
-    }
-
-
-    //---------------------------------------------------------------------
-    // Constructor
-    //---------------------------------------------------------------------
-
-    // Constructor
-    __device__ __forceinline__
-    AgentScan(
-        TempStorage&    temp_storage,       ///< Reference to temp_storage
-        InputIteratorT  d_in,               ///< Input data
-        OutputIteratorT d_out,              ///< Output data
-        ScanOpT         scan_op,            ///< Binary scan operator
-        InitValueT      init_value)         ///< Initial value to seed the exclusive scan
-    :
-        temp_storage(temp_storage.Alias()),
-        d_in(d_in),
-        d_out(d_out),
-        scan_op(scan_op),
-        init_value(init_value)
-    {}
-
-
-    //---------------------------------------------------------------------
-    // Cooperatively scan a device-wide sequence of tiles with other CTAs
-    //---------------------------------------------------------------------
-
-    /**
-     * Process a tile of input (dynamic chained scan)
-     */
-    template <bool IS_LAST_TILE>                ///< Whether the current tile is the last tile
-    __device__ __forceinline__ void ConsumeTile(
-        OffsetT             num_remaining,      ///< Number of global input items remaining (including this tile)
-        int                 tile_idx,           ///< Tile index
-        OffsetT             tile_offset,        ///< Tile offset
-        ScanTileStateT&     tile_state)         ///< Global tile state descriptor
-    {
-        // Load items
-        OutputT items[ITEMS_PER_THREAD];
-
-        if (IS_LAST_TILE)
-            BlockLoadT(temp_storage.load).Load(d_in + tile_offset, items, num_remaining);
-        else
-            BlockLoadT(temp_storage.load).Load(d_in + tile_offset, items);
-
-        CTA_SYNC();
-
-        // Perform tile scan
-        if (tile_idx == 0)
-        {
-            // Scan first tile
-            OutputT block_aggregate;
-            ScanTile(items, init_value, scan_op, block_aggregate, Int2Type<IS_INCLUSIVE>());
-            if ((!IS_LAST_TILE) && (threadIdx.x == 0))
-                tile_state.SetInclusive(0, block_aggregate);
-        }
-        else
-        {
-            // Scan non-first tile
-            TilePrefixCallbackOpT prefix_op(tile_state, temp_storage.prefix, scan_op, tile_idx);
-            ScanTile(items, scan_op, prefix_op, Int2Type<IS_INCLUSIVE>());
-        }
-
-        CTA_SYNC();
-
-        // Store items
-        if (IS_LAST_TILE)
-            BlockStoreT(temp_storage.store).Store(d_out + tile_offset, items, num_remaining);
-        else
-            BlockStoreT(temp_storage.store).Store(d_out + tile_offset, items);
-    }
-
-
-    /**
-     * Scan tiles of items as part of a dynamic chained scan
-     */
-    __device__ __forceinline__ void ConsumeRange(
-        int                 num_items,          ///< Total number of input items
-        ScanTileStateT&     tile_state,         ///< Global tile state descriptor
-        int                 start_tile)         ///< The starting tile for the current grid
-    {
-        // Blocks are launched in increasing order, so just assign one tile per block
-        int     tile_idx        = start_tile + blockIdx.x;          // Current tile index
-        OffsetT tile_offset     = OffsetT(TILE_ITEMS) * tile_idx;   // Global offset for the current tile
-        OffsetT num_remaining   = num_items - tile_offset;          // Remaining items (including this tile)
-
-        if (num_remaining > TILE_ITEMS)
-        {
-            // Not last tile
-            ConsumeTile<false>(num_remaining, tile_idx, tile_offset, tile_state);
-        }
-        else if (num_remaining > 0)
-        {
-            // Last tile
-            ConsumeTile<true>(num_remaining, tile_idx, tile_offset, tile_state);
-        }
-    }
-
-
-    //---------------------------------------------------------------------
-    // Scan an sequence of consecutive tiles (independent of other thread blocks)
-    //---------------------------------------------------------------------
-
-    /**
-     * Process a tile of input
-     */
-    template <
-        bool                        IS_FIRST_TILE,
-        bool                        IS_LAST_TILE>
-    __device__ __forceinline__ void ConsumeTile(
-        OffsetT                     tile_offset,                ///< Tile offset
-        RunningPrefixCallbackOp&    prefix_op,                  ///< Running prefix operator
-        int                         valid_items = TILE_ITEMS)   ///< Number of valid items in the tile
-    {
-        // Load items
-        OutputT items[ITEMS_PER_THREAD];
-
-        if (IS_LAST_TILE)
-            BlockLoadT(temp_storage.load).Load(d_in + tile_offset, items, valid_items);
-        else
-            BlockLoadT(temp_storage.load).Load(d_in + tile_offset, items);
-
-        CTA_SYNC();
-
-        // Block scan
-        if (IS_FIRST_TILE)
-        {
-            OutputT block_aggregate;
-            ScanTile(items, init_value, scan_op, block_aggregate, Int2Type<IS_INCLUSIVE>());
-            prefix_op.running_total = block_aggregate;
-        }
-        else
-        {
-            ScanTile(items, scan_op, prefix_op, Int2Type<IS_INCLUSIVE>());
-        }
-
-        CTA_SYNC();
-
-        // Store items
-        if (IS_LAST_TILE)
-            BlockStoreT(temp_storage.store).Store(d_out + tile_offset, items, valid_items);
-        else
-            BlockStoreT(temp_storage.store).Store(d_out + tile_offset, items);
-    }
-
-
-    /**
-     * Scan a consecutive share of input tiles
-     */
-    __device__ __forceinline__ void ConsumeRange(
-        OffsetT  range_offset,      ///< [in] Threadblock begin offset (inclusive)
-        OffsetT  range_end)         ///< [in] Threadblock end offset (exclusive)
-    {
-        BlockScanRunningPrefixOp<OutputT, ScanOpT> prefix_op(scan_op);
-
-        if (range_offset + TILE_ITEMS <= range_end)
-        {
-            // Consume first tile of input (full)
-            ConsumeTile<true, true>(range_offset, prefix_op);
-            range_offset += TILE_ITEMS;
-
-            // Consume subsequent full tiles of input
-            while (range_offset + TILE_ITEMS <= range_end)
-            {
-                ConsumeTile<false, true>(range_offset, prefix_op);
-                range_offset += TILE_ITEMS;
-            }
-
-            // Consume a partially-full tile
-            if (range_offset < range_end)
-            {
-                int valid_items = range_end - range_offset;
-                ConsumeTile<false, false>(range_offset, prefix_op, valid_items);
-            }
-        }
-        else
-        {
-            // Consume the first tile of input (partially-full)
-            int valid_items = range_end - range_offset;
-            ConsumeTile<true, false>(range_offset, prefix_op, valid_items);
-        }
-    }
-
-
-    /**
-     * Scan a consecutive share of input tiles, seeded with the specified prefix value
-     */
-    __device__ __forceinline__ void ConsumeRange(
-        OffsetT range_offset,                       ///< [in] Threadblock begin offset (inclusive)
-        OffsetT range_end,                          ///< [in] Threadblock end offset (exclusive)
-        OutputT prefix)                             ///< [in] The prefix to apply to the scan segment
-    {
-        BlockScanRunningPrefixOp<OutputT, ScanOpT> prefix_op(prefix, scan_op);
-
-        // Consume full tiles of input
-        while (range_offset + TILE_ITEMS <= range_end)
-        {
-            ConsumeTile<true, false>(range_offset, prefix_op);
-            range_offset += TILE_ITEMS;
-        }
-
-        // Consume a partially-full tile
-        if (range_offset < range_end)
-        {
-            int valid_items = range_end - range_offset;
-            ConsumeTile<false, false>(range_offset, prefix_op, valid_items);
-        }
-    }
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/agent/agent_segment_fixup.cuh b/thirdparty/cub_semiring/agent/agent_segment_fixup.cuh
deleted file mode 100644
index cb6e5772580..00000000000
--- a/thirdparty/cub_semiring/agent/agent_segment_fixup.cuh
+++ /dev/null
@@ -1,385 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::AgentSegmentFixup implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduce-value-by-key.
- */
-
-#pragma once
-
-#include <iterator>
-
-#include "single_pass_scan_operators.cuh"
-#include "../block/block_load.cuh"
-#include "../block/block_store.cuh"
-#include "../block/block_scan.cuh"
-#include "../block/block_discontinuity.cuh"
-#include "../iterator/cache_modified_input_iterator.cuh"
-#include "../iterator/constant_input_iterator.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/******************************************************************************
- * Tuning policy types
- ******************************************************************************/
-
-/**
- * Parameterizable tuning policy type for AgentSegmentFixup
- */
-template <
-    int                         _BLOCK_THREADS,                 ///< Threads per thread block
-    int                         _ITEMS_PER_THREAD,              ///< Items per thread (per tile of input)
-    BlockLoadAlgorithm          _LOAD_ALGORITHM,                ///< The BlockLoad algorithm to use
-    CacheLoadModifier           _LOAD_MODIFIER,                 ///< Cache load modifier for reading input elements
-    BlockScanAlgorithm          _SCAN_ALGORITHM>                ///< The BlockScan algorithm to use
-struct AgentSegmentFixupPolicy
-{
-    enum
-    {
-        BLOCK_THREADS           = _BLOCK_THREADS,               ///< Threads per thread block
-        ITEMS_PER_THREAD        = _ITEMS_PER_THREAD,            ///< Items per thread (per tile of input)
-    };
-
-    static const BlockLoadAlgorithm     LOAD_ALGORITHM          = _LOAD_ALGORITHM;      ///< The BlockLoad algorithm to use
-    static const CacheLoadModifier      LOAD_MODIFIER           = _LOAD_MODIFIER;       ///< Cache load modifier for reading input elements
-    static const BlockScanAlgorithm     SCAN_ALGORITHM          = _SCAN_ALGORITHM;      ///< The BlockScan algorithm to use
-};
-
-
-/******************************************************************************
- * Thread block abstractions
- ******************************************************************************/
-
-/**
- * \brief AgentSegmentFixup implements a stateful abstraction of CUDA thread blocks for participating in device-wide reduce-value-by-key
- */
-template <
-    typename    AgentSegmentFixupPolicyT,       ///< Parameterized AgentSegmentFixupPolicy tuning policy type
-    typename    PairsInputIteratorT,            ///< Random-access input iterator type for keys
-    typename    AggregatesOutputIteratorT,      ///< Random-access output iterator type for values
-    typename    EqualityOpT,                    ///< KeyT equality operator type
-    typename    ReductionOpT,                   ///< ValueT reduction operator type
-    typename    OffsetT,                        ///< Signed integer type for global offsets
-    typename    SemiringT>                      ///< Semiring type
-struct AgentSegmentFixup
-{
-    //---------------------------------------------------------------------
-    // Types and constants
-    //---------------------------------------------------------------------
-
-    // Data type of key-value input iterator
-    typedef typename std::iterator_traits<PairsInputIteratorT>::value_type KeyValuePairT;
-
-    // Value type
-    typedef typename KeyValuePairT::Value ValueT;
-
-    // Tile status descriptor interface type
-    typedef ReduceByKeyScanTileState<ValueT, OffsetT> ScanTileStateT;
-
-    // Constants
-    enum
-    {
-        BLOCK_THREADS       = AgentSegmentFixupPolicyT::BLOCK_THREADS,
-        ITEMS_PER_THREAD    = AgentSegmentFixupPolicyT::ITEMS_PER_THREAD,
-        TILE_ITEMS          = BLOCK_THREADS * ITEMS_PER_THREAD,
-
-        // Whether or not do fixup using RLE + global atomics
-        // double atomics starting with 6.0
-        USE_ATOMIC_FIXUP    = (((CUB_PTX_ARCH >= 350) && 
-                                (Equals<ValueT, float>::VALUE || 
-                                 Equals<ValueT, int>::VALUE ||
-                                 Equals<ValueT, unsigned int>::VALUE ||
-                                 Equals<ValueT, unsigned long long>::VALUE)) 
-                                ||
-                                ((CUB_PTX_ARCH >= 600) && 
-                                (Equals<ValueT, double>::VALUE)))
-                                && SemiringT::HAS_PLUS_ATOMICS, // don't use atomics for semirings like maxmin
-
-        // Whether or not the scan operation has a zero-valued identity value (true if we're performing addition on a primitive type)
-        // not used. 
-        //HAS_IDENTITY_ZERO   = (Equals<ReductionOpT, cub::Sum>::VALUE) && (Traits<ValueT>::PRIMITIVE),
-    };
-
-    // Cache-modified Input iterator wrapper type (for applying cache modifier) for keys
-    typedef typename If<IsPointer<PairsInputIteratorT>::VALUE,
-            CacheModifiedInputIterator<AgentSegmentFixupPolicyT::LOAD_MODIFIER, KeyValuePairT, OffsetT>,    // Wrap the native input pointer with CacheModifiedValuesInputIterator
-            PairsInputIteratorT>::Type                                                                      // Directly use the supplied input iterator type
-        WrappedPairsInputIteratorT;
-
-    // Cache-modified Input iterator wrapper type (for applying cache modifier) for fixup values
-    typedef typename If<IsPointer<AggregatesOutputIteratorT>::VALUE,
-            CacheModifiedInputIterator<AgentSegmentFixupPolicyT::LOAD_MODIFIER, ValueT, OffsetT>,    // Wrap the native input pointer with CacheModifiedValuesInputIterator
-            AggregatesOutputIteratorT>::Type                                                        // Directly use the supplied input iterator type
-        WrappedFixupInputIteratorT;
-
-    // Reduce-value-by-segment scan operator
-    typedef ReduceByKeyOp<typename SemiringT::SumOp> ReduceBySegmentOpT;
-
-    // Parameterized BlockLoad type for pairs
-    typedef BlockLoad<
-            KeyValuePairT,
-            BLOCK_THREADS,
-            ITEMS_PER_THREAD,
-            AgentSegmentFixupPolicyT::LOAD_ALGORITHM>
-        BlockLoadPairs;
-
-    // Parameterized BlockScan type
-    typedef BlockScan<
-            KeyValuePairT,
-            BLOCK_THREADS,
-            AgentSegmentFixupPolicyT::SCAN_ALGORITHM>
-        BlockScanT;
-
-    // Callback type for obtaining tile prefix during block scan
-    typedef TilePrefixCallbackOp<
-            KeyValuePairT,
-            ReduceBySegmentOpT,
-            ScanTileStateT>
-        TilePrefixCallbackOpT;
-
-    // Shared memory type for this thread block
-    union _TempStorage
-    {
-        struct
-        {
-            typename BlockScanT::TempStorage                scan;           // Smem needed for tile scanning
-            typename TilePrefixCallbackOpT::TempStorage     prefix;         // Smem needed for cooperative prefix callback
-        };
-
-        // Smem needed for loading keys
-        typename BlockLoadPairs::TempStorage load_pairs;
-    };
-
-    // Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    //---------------------------------------------------------------------
-    // Per-thread fields
-    //---------------------------------------------------------------------
-
-    _TempStorage&                   temp_storage;       ///< Reference to temp_storage
-    WrappedPairsInputIteratorT      d_pairs_in;          ///< Input keys
-    AggregatesOutputIteratorT       d_aggregates_out;   ///< Output value aggregates
-    WrappedFixupInputIteratorT      d_fixup_in;         ///< Fixup input values
-    InequalityWrapper<EqualityOpT>  inequality_op;      ///< KeyT inequality operator
-    ReductionOpT                    reduction_op;       ///< Reduction operator
-    ReduceBySegmentOpT              scan_op;            ///< Reduce-by-segment scan operator
-
-
-    //---------------------------------------------------------------------
-    // Constructor
-    //---------------------------------------------------------------------
-
-    // Constructor
-    __device__ __forceinline__
-    AgentSegmentFixup(
-        TempStorage&                temp_storage,       ///< Reference to temp_storage
-        PairsInputIteratorT         d_pairs_in,          ///< Input keys
-        AggregatesOutputIteratorT   d_aggregates_out,   ///< Output value aggregates
-        EqualityOpT                 equality_op,        ///< KeyT equality operator
-        ReductionOpT                reduction_op)       ///< ValueT reduction operator
-    :
-        temp_storage(temp_storage.Alias()),
-        d_pairs_in(d_pairs_in),
-        d_aggregates_out(d_aggregates_out),
-        d_fixup_in(d_aggregates_out),
-        inequality_op(equality_op),
-        reduction_op(reduction_op),
-        scan_op(reduction_op)
-    {}
-
-
-    //---------------------------------------------------------------------
-    // Cooperatively scan a device-wide sequence of tiles with other CTAs
-    //---------------------------------------------------------------------
-
-
-    /**
-     * Process input tile.  Specialized for atomic-fixup
-     */
-    template <bool IS_LAST_TILE>
-    __device__ __forceinline__ void ConsumeTile(
-        OffsetT             max_item,           ///< maximum item key, to prevent OOB writes
-        OffsetT             num_remaining,      ///< Number of global input items remaining (including this tile)
-        int                 tile_idx,           ///< Tile index
-        OffsetT             tile_offset,        ///< Tile offset
-        ScanTileStateT&     tile_state,         ///< Global tile state descriptor
-        Int2Type<true>      use_atomic_fixup)   ///< Marker whether to use atomicAdd (instead of reduce-by-key)
-    {
-        KeyValuePairT   pairs[ITEMS_PER_THREAD];
-
-        // Load pairs
-        KeyValuePairT oob_pair;
-        oob_pair.key = -1;
-
-        if (IS_LAST_TILE)
-            BlockLoadPairs(temp_storage.load_pairs).Load(d_pairs_in + tile_offset, pairs, num_remaining, oob_pair);
-        else
-            BlockLoadPairs(temp_storage.load_pairs).Load(d_pairs_in + tile_offset, pairs);
-
-        // RLE 
-        #pragma unroll
-        for (int ITEM = 1; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            ValueT* d_scatter = d_aggregates_out + pairs[ITEM - 1].key;
-            if (pairs[ITEM].key != pairs[ITEM - 1].key && pairs[ITEM - 1].key < max_item)
-                atomicAdd(d_scatter, pairs[ITEM - 1].value);
-            else
-                pairs[ITEM].value = reduction_op(pairs[ITEM - 1].value, pairs[ITEM].value);
-        }
-
-        // Flush last item if valid
-        ValueT* d_scatter = d_aggregates_out + pairs[ITEMS_PER_THREAD - 1].key;
-        if ((!IS_LAST_TILE) || (pairs[ITEMS_PER_THREAD - 1].key >= 0))
-            atomicAdd(d_scatter, pairs[ITEMS_PER_THREAD - 1].value);
-    }
-
-
-    /**
-     * Process input tile.  Specialized for reduce-by-key fixup
-     */
-    template <bool IS_LAST_TILE>
-    __device__ __forceinline__ void ConsumeTile(
-        OffsetT             max_item,           ///< maximum item key, to prevent OOB writes
-        OffsetT             num_remaining,      ///< Number of global input items remaining (including this tile)
-        int                 tile_idx,           ///< Tile index
-        OffsetT             tile_offset,        ///< Tile offset
-        ScanTileStateT&     tile_state,         ///< Global tile state descriptor
-        Int2Type<false>     use_atomic_fixup)   ///< Marker whether to use atomicAdd (instead of reduce-by-key)
-    {
-        KeyValuePairT   pairs[ITEMS_PER_THREAD];
-        KeyValuePairT   scatter_pairs[ITEMS_PER_THREAD];
-
-        // Load pairs
-        KeyValuePairT oob_pair;
-        oob_pair.key = -1;
-
-        if (IS_LAST_TILE)
-            BlockLoadPairs(temp_storage.load_pairs).Load(d_pairs_in + tile_offset, pairs, num_remaining, oob_pair);
-        else
-            BlockLoadPairs(temp_storage.load_pairs).Load(d_pairs_in + tile_offset, pairs);
-
-        CTA_SYNC();
-
-        KeyValuePairT tile_aggregate;
-        if (tile_idx == 0)
-        {
-            // Exclusive scan of values and segment_flags
-            BlockScanT(temp_storage.scan).ExclusiveScan(pairs, scatter_pairs, scan_op, tile_aggregate);
-
-            // Update tile status if this is not the last tile
-            if (threadIdx.x == 0)
-            {
-                // Set first segment id to not trigger a flush (invalid from exclusive scan)
-                scatter_pairs[0].key = pairs[0].key;
-
-                if (!IS_LAST_TILE)
-                    tile_state.SetInclusive(0, tile_aggregate);
-
-            }
-        }
-        else
-        {
-            // Exclusive scan of values and segment_flags
-            TilePrefixCallbackOpT prefix_op(tile_state, temp_storage.prefix, scan_op, tile_idx);
-            BlockScanT(temp_storage.scan).ExclusiveScan(pairs, scatter_pairs, scan_op, prefix_op);
-            tile_aggregate = prefix_op.GetBlockAggregate();
-        }
-
-        // Scatter updated values
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            if (scatter_pairs[ITEM].key != pairs[ITEM].key && scatter_pairs[ITEM].key < max_item)
-            {
-                // Update the value at the key location
-                ValueT value    = d_fixup_in[scatter_pairs[ITEM].key];
-                value           = reduction_op(value, scatter_pairs[ITEM].value);
-
-                d_aggregates_out[scatter_pairs[ITEM].key] = value;
-            }
-        }
-
-        // Finalize the last item
-        if (IS_LAST_TILE)
-        {
-            // Last thread will output final count and last item, if necessary
-            if (threadIdx.x == BLOCK_THREADS - 1)
-            {
-                // If the last tile is a whole tile, the inclusive prefix contains accumulated value reduction for the last segment
-                if (num_remaining == TILE_ITEMS)
-                {
-                    // Update the value at the key location
-                    OffsetT last_key = pairs[ITEMS_PER_THREAD - 1].key;
-                    d_aggregates_out[last_key] = reduction_op(tile_aggregate.value, d_fixup_in[last_key]);
-                }
-            }
-        }
-    }
-
-
-    /**
-     * Scan tiles of items as part of a dynamic chained scan
-     */
-    __device__ __forceinline__ void ConsumeRange(
-        OffsetT             max_item,
-        int                 num_items,          ///< Total number of input items
-        int                 num_tiles,          ///< Total number of input tiles
-        ScanTileStateT&     tile_state)         ///< Global tile state descriptor
-    {
-        // Blocks are launched in increasing order, so just assign one tile per block
-        int     tile_idx        = (blockIdx.x * gridDim.y) + blockIdx.y;    // Current tile index
-        OffsetT tile_offset     = tile_idx * TILE_ITEMS;                    // Global offset for the current tile
-        OffsetT num_remaining   = num_items - tile_offset;                  // Remaining items (including this tile)
-
-        if (num_remaining > TILE_ITEMS)
-        {
-            // Not the last tile (full)
-            ConsumeTile<false>(max_item, num_remaining, tile_idx, tile_offset, tile_state, Int2Type<USE_ATOMIC_FIXUP>());
-        }
-        else if (num_remaining > 0)
-        {
-            // The last tile (possibly partially-full)
-            ConsumeTile<true>(max_item, num_remaining, tile_idx, tile_offset, tile_state, Int2Type<USE_ATOMIC_FIXUP>());
-        }
-    }
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/agent/agent_select_if.cuh b/thirdparty/cub_semiring/agent/agent_select_if.cuh
deleted file mode 100644
index f365481915b..00000000000
--- a/thirdparty/cub_semiring/agent/agent_select_if.cuh
+++ /dev/null
@@ -1,703 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::AgentSelectIf implements a stateful abstraction of CUDA thread blocks for participating in device-wide select.
- */
-
-#pragma once
-
-#include <iterator>
-
-#include "single_pass_scan_operators.cuh"
-#include "../block/block_load.cuh"
-#include "../block/block_store.cuh"
-#include "../block/block_scan.cuh"
-#include "../block/block_exchange.cuh"
-#include "../block/block_discontinuity.cuh"
-#include "../grid/grid_queue.cuh"
-#include "../iterator/cache_modified_input_iterator.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/******************************************************************************
- * Tuning policy types
- ******************************************************************************/
-
-/**
- * Parameterizable tuning policy type for AgentSelectIf
- */
-template <
-    int                         _BLOCK_THREADS,                 ///< Threads per thread block
-    int                         _ITEMS_PER_THREAD,              ///< Items per thread (per tile of input)
-    BlockLoadAlgorithm          _LOAD_ALGORITHM,                ///< The BlockLoad algorithm to use
-    CacheLoadModifier           _LOAD_MODIFIER,                 ///< Cache load modifier for reading input elements
-    BlockScanAlgorithm          _SCAN_ALGORITHM>                ///< The BlockScan algorithm to use
-struct AgentSelectIfPolicy
-{
-    enum
-    {
-        BLOCK_THREADS           = _BLOCK_THREADS,               ///< Threads per thread block
-        ITEMS_PER_THREAD        = _ITEMS_PER_THREAD,            ///< Items per thread (per tile of input)
-    };
-
-    static const BlockLoadAlgorithm     LOAD_ALGORITHM          = _LOAD_ALGORITHM;      ///< The BlockLoad algorithm to use
-    static const CacheLoadModifier      LOAD_MODIFIER           = _LOAD_MODIFIER;       ///< Cache load modifier for reading input elements
-    static const BlockScanAlgorithm     SCAN_ALGORITHM          = _SCAN_ALGORITHM;      ///< The BlockScan algorithm to use
-};
-
-
-
-
-/******************************************************************************
- * Thread block abstractions
- ******************************************************************************/
-
-
-/**
- * \brief AgentSelectIf implements a stateful abstraction of CUDA thread blocks for participating in device-wide selection
- *
- * Performs functor-based selection if SelectOpT functor type != NullType
- * Otherwise performs flag-based selection if FlagsInputIterator's value type != NullType
- * Otherwise performs discontinuity selection (keep unique)
- */
-template <
-    typename    AgentSelectIfPolicyT,           ///< Parameterized AgentSelectIfPolicy tuning policy type
-    typename    InputIteratorT,                 ///< Random-access input iterator type for selection items
-    typename    FlagsInputIteratorT,            ///< Random-access input iterator type for selections (NullType* if a selection functor or discontinuity flagging is to be used for selection)
-    typename    SelectedOutputIteratorT,        ///< Random-access input iterator type for selection_flags items
-    typename    SelectOpT,                      ///< Selection operator type (NullType if selections or discontinuity flagging is to be used for selection)
-    typename    EqualityOpT,                    ///< Equality operator type (NullType if selection functor or selections is to be used for selection)
-    typename    OffsetT,                        ///< Signed integer type for global offsets
-    bool        KEEP_REJECTS>                   ///< Whether or not we push rejected items to the back of the output
-struct AgentSelectIf
-{
-    //---------------------------------------------------------------------
-    // Types and constants
-    //---------------------------------------------------------------------
-
-    // The input value type
-    typedef typename std::iterator_traits<InputIteratorT>::value_type InputT;
-
-    // The output value type
-    typedef typename If<(Equals<typename std::iterator_traits<SelectedOutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
-        typename std::iterator_traits<InputIteratorT>::value_type,                                                  // ... then the input iterator's value type,
-        typename std::iterator_traits<SelectedOutputIteratorT>::value_type>::Type OutputT;                          // ... else the output iterator's value type
-
-    // The flag value type
-    typedef typename std::iterator_traits<FlagsInputIteratorT>::value_type FlagT;
-
-    // Tile status descriptor interface type
-    typedef ScanTileState<OffsetT> ScanTileStateT;
-
-    // Constants
-    enum
-    {
-        USE_SELECT_OP,
-        USE_SELECT_FLAGS,
-        USE_DISCONTINUITY,
-
-        BLOCK_THREADS           = AgentSelectIfPolicyT::BLOCK_THREADS,
-        ITEMS_PER_THREAD        = AgentSelectIfPolicyT::ITEMS_PER_THREAD,
-        TILE_ITEMS              = BLOCK_THREADS * ITEMS_PER_THREAD,
-        TWO_PHASE_SCATTER       = (ITEMS_PER_THREAD > 1),
-
-        SELECT_METHOD           = (!Equals<SelectOpT, NullType>::VALUE) ?
-                                    USE_SELECT_OP :
-                                    (!Equals<FlagT, NullType>::VALUE) ?
-                                        USE_SELECT_FLAGS :
-                                        USE_DISCONTINUITY
-    };
-
-    // Cache-modified Input iterator wrapper type (for applying cache modifier) for items
-    typedef typename If<IsPointer<InputIteratorT>::VALUE,
-            CacheModifiedInputIterator<AgentSelectIfPolicyT::LOAD_MODIFIER, InputT, OffsetT>,        // Wrap the native input pointer with CacheModifiedValuesInputIterator
-            InputIteratorT>::Type                                                               // Directly use the supplied input iterator type
-        WrappedInputIteratorT;
-
-    // Cache-modified Input iterator wrapper type (for applying cache modifier) for values
-    typedef typename If<IsPointer<FlagsInputIteratorT>::VALUE,
-            CacheModifiedInputIterator<AgentSelectIfPolicyT::LOAD_MODIFIER, FlagT, OffsetT>,    // Wrap the native input pointer with CacheModifiedValuesInputIterator
-            FlagsInputIteratorT>::Type                                                          // Directly use the supplied input iterator type
-        WrappedFlagsInputIteratorT;
-
-    // Parameterized BlockLoad type for input data
-    typedef BlockLoad<
-            OutputT,
-            BLOCK_THREADS,
-            ITEMS_PER_THREAD,
-            AgentSelectIfPolicyT::LOAD_ALGORITHM>
-        BlockLoadT;
-
-    // Parameterized BlockLoad type for flags
-    typedef BlockLoad<
-            FlagT,
-            BLOCK_THREADS,
-            ITEMS_PER_THREAD,
-            AgentSelectIfPolicyT::LOAD_ALGORITHM>
-        BlockLoadFlags;
-
-    // Parameterized BlockDiscontinuity type for items
-    typedef BlockDiscontinuity<
-            OutputT,
-            BLOCK_THREADS>
-        BlockDiscontinuityT;
-
-    // Parameterized BlockScan type
-    typedef BlockScan<
-            OffsetT,
-            BLOCK_THREADS,
-            AgentSelectIfPolicyT::SCAN_ALGORITHM>
-        BlockScanT;
-
-    // Callback type for obtaining tile prefix during block scan
-    typedef TilePrefixCallbackOp<
-            OffsetT,
-            cub::Sum,
-            ScanTileStateT>
-        TilePrefixCallbackOpT;
-
-    // Item exchange type
-    typedef OutputT ItemExchangeT[TILE_ITEMS];
-
-    // Shared memory type for this thread block
-    union _TempStorage
-    {
-        struct
-        {
-            typename BlockScanT::TempStorage                scan;           // Smem needed for tile scanning
-            typename TilePrefixCallbackOpT::TempStorage     prefix;         // Smem needed for cooperative prefix callback
-            typename BlockDiscontinuityT::TempStorage       discontinuity;  // Smem needed for discontinuity detection
-        };
-
-        // Smem needed for loading items
-        typename BlockLoadT::TempStorage load_items;
-
-        // Smem needed for loading values
-        typename BlockLoadFlags::TempStorage load_flags;
-
-        // Smem needed for compacting items (allows non POD items in this union)
-        Uninitialized<ItemExchangeT> raw_exchange;
-    };
-
-    // Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    //---------------------------------------------------------------------
-    // Per-thread fields
-    //---------------------------------------------------------------------
-
-    _TempStorage&                   temp_storage;       ///< Reference to temp_storage
-    WrappedInputIteratorT           d_in;               ///< Input items
-    SelectedOutputIteratorT         d_selected_out;     ///< Unique output items
-    WrappedFlagsInputIteratorT      d_flags_in;         ///< Input selection flags (if applicable)
-    InequalityWrapper<EqualityOpT>  inequality_op;      ///< T inequality operator
-    SelectOpT                       select_op;          ///< Selection operator
-    OffsetT                         num_items;          ///< Total number of input items
-
-
-    //---------------------------------------------------------------------
-    // Constructor
-    //---------------------------------------------------------------------
-
-    // Constructor
-    __device__ __forceinline__
-    AgentSelectIf(
-        TempStorage                 &temp_storage,      ///< Reference to temp_storage
-        InputIteratorT              d_in,               ///< Input data
-        FlagsInputIteratorT         d_flags_in,         ///< Input selection flags (if applicable)
-        SelectedOutputIteratorT     d_selected_out,     ///< Output data
-        SelectOpT                   select_op,          ///< Selection operator
-        EqualityOpT                 equality_op,        ///< Equality operator
-        OffsetT                     num_items)          ///< Total number of input items
-    :
-        temp_storage(temp_storage.Alias()),
-        d_in(d_in),
-        d_flags_in(d_flags_in),
-        d_selected_out(d_selected_out),
-        select_op(select_op),
-        inequality_op(equality_op),
-        num_items(num_items)
-    {}
-
-
-    //---------------------------------------------------------------------
-    // Utility methods for initializing the selections
-    //---------------------------------------------------------------------
-
-    /**
-     * Initialize selections (specialized for selection operator)
-     */
-    template <bool IS_FIRST_TILE, bool IS_LAST_TILE>
-    __device__ __forceinline__ void InitializeSelections(
-        OffsetT                     /*tile_offset*/,
-        OffsetT                     num_tile_items,
-        OutputT                     (&items)[ITEMS_PER_THREAD],
-        OffsetT                     (&selection_flags)[ITEMS_PER_THREAD],
-        Int2Type<USE_SELECT_OP>     /*select_method*/)
-    {
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            // Out-of-bounds items are selection_flags
-            selection_flags[ITEM] = 1;
-
-            if (!IS_LAST_TILE || (OffsetT(threadIdx.x * ITEMS_PER_THREAD) + ITEM < num_tile_items))
-                selection_flags[ITEM] = select_op(items[ITEM]);
-        }
-    }
-
-
-    /**
-     * Initialize selections (specialized for valid flags)
-     */
-    template <bool IS_FIRST_TILE, bool IS_LAST_TILE>
-    __device__ __forceinline__ void InitializeSelections(
-        OffsetT                     tile_offset,
-        OffsetT                     num_tile_items,
-        OutputT                     (&/*items*/)[ITEMS_PER_THREAD],
-        OffsetT                     (&selection_flags)[ITEMS_PER_THREAD],
-        Int2Type<USE_SELECT_FLAGS>  /*select_method*/)
-    {
-        CTA_SYNC();
-
-        FlagT flags[ITEMS_PER_THREAD];
-
-        if (IS_LAST_TILE)
-        {
-            // Out-of-bounds items are selection_flags
-            BlockLoadFlags(temp_storage.load_flags).Load(d_flags_in + tile_offset, flags, num_tile_items, 1);
-        }
-        else
-        {
-            BlockLoadFlags(temp_storage.load_flags).Load(d_flags_in + tile_offset, flags);
-        }
-
-        // Convert flag type to selection_flags type
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            selection_flags[ITEM] = flags[ITEM];
-        }
-    }
-
-
-    /**
-     * Initialize selections (specialized for discontinuity detection)
-     */
-    template <bool IS_FIRST_TILE, bool IS_LAST_TILE>
-    __device__ __forceinline__ void InitializeSelections(
-        OffsetT                     tile_offset,
-        OffsetT                     num_tile_items,
-        OutputT                     (&items)[ITEMS_PER_THREAD],
-        OffsetT                     (&selection_flags)[ITEMS_PER_THREAD],
-        Int2Type<USE_DISCONTINUITY> /*select_method*/)
-    {
-        if (IS_FIRST_TILE)
-        {
-            CTA_SYNC();
-
-            // Set head selection_flags.  First tile sets the first flag for the first item
-            BlockDiscontinuityT(temp_storage.discontinuity).FlagHeads(selection_flags, items, inequality_op);
-        }
-        else
-        {
-            OutputT tile_predecessor;
-            if (threadIdx.x == 0)
-                tile_predecessor = d_in[tile_offset - 1];
-
-            CTA_SYNC();
-
-            BlockDiscontinuityT(temp_storage.discontinuity).FlagHeads(selection_flags, items, inequality_op, tile_predecessor);
-        }
-
-        // Set selection flags for out-of-bounds items
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            // Set selection_flags for out-of-bounds items
-            if ((IS_LAST_TILE) && (OffsetT(threadIdx.x * ITEMS_PER_THREAD) + ITEM >= num_tile_items))
-                selection_flags[ITEM] = 1;
-        }
-    }
-
-
-    //---------------------------------------------------------------------
-    // Scatter utility methods
-    //---------------------------------------------------------------------
-
-    /**
-     * Scatter flagged items to output offsets (specialized for direct scattering)
-     */
-    template <bool IS_LAST_TILE, bool IS_FIRST_TILE>
-    __device__ __forceinline__ void ScatterDirect(
-        OutputT (&items)[ITEMS_PER_THREAD],
-        OffsetT (&selection_flags)[ITEMS_PER_THREAD],
-        OffsetT (&selection_indices)[ITEMS_PER_THREAD],
-        OffsetT num_selections)
-    {
-        // Scatter flagged items
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            if (selection_flags[ITEM])
-            {
-                if ((!IS_LAST_TILE) || selection_indices[ITEM] < num_selections)
-                {
-                    d_selected_out[selection_indices[ITEM]] = items[ITEM];
-                }
-            }
-        }
-    }
-
-
-    /**
-     * Scatter flagged items to output offsets (specialized for two-phase scattering)
-     */
-    template <bool IS_LAST_TILE, bool IS_FIRST_TILE>
-    __device__ __forceinline__ void ScatterTwoPhase(
-        OutputT         (&items)[ITEMS_PER_THREAD],
-        OffsetT         (&selection_flags)[ITEMS_PER_THREAD],
-        OffsetT         (&selection_indices)[ITEMS_PER_THREAD],
-        int             /*num_tile_items*/,                         ///< Number of valid items in this tile
-        int             num_tile_selections,                        ///< Number of selections in this tile
-        OffsetT         num_selections_prefix,                      ///< Total number of selections prior to this tile
-        OffsetT         /*num_rejected_prefix*/,                    ///< Total number of rejections prior to this tile
-        Int2Type<false> /*is_keep_rejects*/)                        ///< Marker type indicating whether to keep rejected items in the second partition
-    {
-        CTA_SYNC();
-
-        // Compact and scatter items
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            int local_scatter_offset = selection_indices[ITEM] - num_selections_prefix;
-            if (selection_flags[ITEM])
-            {
-                temp_storage.raw_exchange.Alias()[local_scatter_offset] = items[ITEM];
-            }
-        }
-
-        CTA_SYNC();
-
-        for (int item = threadIdx.x; item < num_tile_selections; item += BLOCK_THREADS)
-        {
-            d_selected_out[num_selections_prefix + item] = temp_storage.raw_exchange.Alias()[item];
-        }
-    }
-
-
-    /**
-     * Scatter flagged items to output offsets (specialized for two-phase scattering)
-     */
-    template <bool IS_LAST_TILE, bool IS_FIRST_TILE>
-    __device__ __forceinline__ void ScatterTwoPhase(
-        OutputT         (&items)[ITEMS_PER_THREAD],
-        OffsetT         (&selection_flags)[ITEMS_PER_THREAD],
-        OffsetT         (&selection_indices)[ITEMS_PER_THREAD],
-        int             num_tile_items,                             ///< Number of valid items in this tile
-        int             num_tile_selections,                        ///< Number of selections in this tile
-        OffsetT         num_selections_prefix,                      ///< Total number of selections prior to this tile
-        OffsetT         num_rejected_prefix,                        ///< Total number of rejections prior to this tile
-        Int2Type<true>  /*is_keep_rejects*/)                        ///< Marker type indicating whether to keep rejected items in the second partition
-    {
-        CTA_SYNC();
-
-        int tile_num_rejections = num_tile_items - num_tile_selections;
-
-        // Scatter items to shared memory (rejections first)
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            int item_idx                = (threadIdx.x * ITEMS_PER_THREAD) + ITEM;
-            int local_selection_idx     = selection_indices[ITEM] - num_selections_prefix;
-            int local_rejection_idx     = item_idx - local_selection_idx;
-            int local_scatter_offset    = (selection_flags[ITEM]) ?
-                                            tile_num_rejections + local_selection_idx :
-                                            local_rejection_idx;
-
-            temp_storage.raw_exchange.Alias()[local_scatter_offset] = items[ITEM];
-        }
-
-        CTA_SYNC();
-
-        // Gather items from shared memory and scatter to global
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            int item_idx            = (ITEM * BLOCK_THREADS) + threadIdx.x;
-            int rejection_idx       = item_idx;
-            int selection_idx       = item_idx - tile_num_rejections;
-            OffsetT scatter_offset  = (item_idx < tile_num_rejections) ?
-                                        num_items - num_rejected_prefix - rejection_idx - 1 :
-                                        num_selections_prefix + selection_idx;
-
-            OutputT item = temp_storage.raw_exchange.Alias()[item_idx];
-
-            if (!IS_LAST_TILE || (item_idx < num_tile_items))
-            {
-                d_selected_out[scatter_offset] = item;
-            }
-        }
-    }
-
-
-    /**
-     * Scatter flagged items
-     */
-    template <bool IS_LAST_TILE, bool IS_FIRST_TILE>
-    __device__ __forceinline__ void Scatter(
-        OutputT         (&items)[ITEMS_PER_THREAD],
-        OffsetT         (&selection_flags)[ITEMS_PER_THREAD],
-        OffsetT         (&selection_indices)[ITEMS_PER_THREAD],
-        int             num_tile_items,                             ///< Number of valid items in this tile
-        int             num_tile_selections,                        ///< Number of selections in this tile
-        OffsetT         num_selections_prefix,                      ///< Total number of selections prior to this tile
-        OffsetT         num_rejected_prefix,                        ///< Total number of rejections prior to this tile
-        OffsetT         num_selections)                             ///< Total number of selections including this tile
-    {
-        // Do a two-phase scatter if (a) keeping both partitions or (b) two-phase is enabled and the average number of selection_flags items per thread is greater than one
-        if (KEEP_REJECTS || (TWO_PHASE_SCATTER && (num_tile_selections > BLOCK_THREADS)))
-        {
-            ScatterTwoPhase<IS_LAST_TILE, IS_FIRST_TILE>(
-                items,
-                selection_flags,
-                selection_indices,
-                num_tile_items,
-                num_tile_selections,
-                num_selections_prefix,
-                num_rejected_prefix,
-                Int2Type<KEEP_REJECTS>());
-        }
-        else
-        {
-            ScatterDirect<IS_LAST_TILE, IS_FIRST_TILE>(
-                items,
-                selection_flags,
-                selection_indices,
-                num_selections);
-        }
-    }
-
-    //---------------------------------------------------------------------
-    // Cooperatively scan a device-wide sequence of tiles with other CTAs
-    //---------------------------------------------------------------------
-
-
-    /**
-     * Process first tile of input (dynamic chained scan).  Returns the running count of selections (including this tile)
-     */
-    template <bool IS_LAST_TILE>
-    __device__ __forceinline__ OffsetT ConsumeFirstTile(
-        int                 num_tile_items,      ///< Number of input items comprising this tile
-        OffsetT             tile_offset,        ///< Tile offset
-        ScanTileStateT&     tile_state)         ///< Global tile state descriptor
-    {
-        OutputT     items[ITEMS_PER_THREAD];
-        OffsetT     selection_flags[ITEMS_PER_THREAD];
-        OffsetT     selection_indices[ITEMS_PER_THREAD];
-
-        // Load items
-        if (IS_LAST_TILE)
-            BlockLoadT(temp_storage.load_items).Load(d_in + tile_offset, items, num_tile_items);
-        else
-            BlockLoadT(temp_storage.load_items).Load(d_in + tile_offset, items);
-
-        // Initialize selection_flags
-        InitializeSelections<true, IS_LAST_TILE>(
-            tile_offset,
-            num_tile_items,
-            items,
-            selection_flags,
-            Int2Type<SELECT_METHOD>());
-
-        CTA_SYNC();
-
-        // Exclusive scan of selection_flags
-        OffsetT num_tile_selections;
-        BlockScanT(temp_storage.scan).ExclusiveSum(selection_flags, selection_indices, num_tile_selections);
-
-        if (threadIdx.x == 0)
-        {
-            // Update tile status if this is not the last tile
-            if (!IS_LAST_TILE)
-                tile_state.SetInclusive(0, num_tile_selections);
-        }
-
-        // Discount any out-of-bounds selections
-        if (IS_LAST_TILE)
-            num_tile_selections -= (TILE_ITEMS - num_tile_items);
-
-        // Scatter flagged items
-        Scatter<IS_LAST_TILE, true>(
-            items,
-            selection_flags,
-            selection_indices,
-            num_tile_items,
-            num_tile_selections,
-            0,
-            0,
-            num_tile_selections);
-
-        return num_tile_selections;
-    }
-
-
-    /**
-     * Process subsequent tile of input (dynamic chained scan).  Returns the running count of selections (including this tile)
-     */
-    template <bool IS_LAST_TILE>
-    __device__ __forceinline__ OffsetT ConsumeSubsequentTile(
-        int                 num_tile_items,      ///< Number of input items comprising this tile
-        int                 tile_idx,           ///< Tile index
-        OffsetT             tile_offset,        ///< Tile offset
-        ScanTileStateT&     tile_state)         ///< Global tile state descriptor
-    {
-        OutputT     items[ITEMS_PER_THREAD];
-        OffsetT     selection_flags[ITEMS_PER_THREAD];
-        OffsetT     selection_indices[ITEMS_PER_THREAD];
-
-        // Load items
-        if (IS_LAST_TILE)
-            BlockLoadT(temp_storage.load_items).Load(d_in + tile_offset, items, num_tile_items);
-        else
-            BlockLoadT(temp_storage.load_items).Load(d_in + tile_offset, items);
-
-        // Initialize selection_flags
-        InitializeSelections<false, IS_LAST_TILE>(
-            tile_offset,
-            num_tile_items,
-            items,
-            selection_flags,
-            Int2Type<SELECT_METHOD>());
-
-        CTA_SYNC();
-
-        // Exclusive scan of values and selection_flags
-        TilePrefixCallbackOpT prefix_op(tile_state, temp_storage.prefix, cub::Sum(), tile_idx);
-        BlockScanT(temp_storage.scan).ExclusiveSum(selection_flags, selection_indices, prefix_op);
-
-        OffsetT num_tile_selections     = prefix_op.GetBlockAggregate();
-        OffsetT num_selections          = prefix_op.GetInclusivePrefix();
-        OffsetT num_selections_prefix   = prefix_op.GetExclusivePrefix();
-        OffsetT num_rejected_prefix     = (tile_idx * TILE_ITEMS) - num_selections_prefix;
-
-        // Discount any out-of-bounds selections
-        if (IS_LAST_TILE)
-        {
-            int num_discount    = TILE_ITEMS - num_tile_items;
-            num_selections      -= num_discount;
-            num_tile_selections -= num_discount;
-        }
-
-        // Scatter flagged items
-        Scatter<IS_LAST_TILE, false>(
-            items,
-            selection_flags,
-            selection_indices,
-            num_tile_items,
-            num_tile_selections,
-            num_selections_prefix,
-            num_rejected_prefix,
-            num_selections);
-
-        return num_selections;
-    }
-
-
-    /**
-     * Process a tile of input
-     */
-    template <bool IS_LAST_TILE>
-    __device__ __forceinline__ OffsetT ConsumeTile(
-        int                 num_tile_items,         ///< Number of input items comprising this tile
-        int                 tile_idx,           ///< Tile index
-        OffsetT             tile_offset,        ///< Tile offset
-        ScanTileStateT&     tile_state)         ///< Global tile state descriptor
-    {
-        OffsetT num_selections;
-        if (tile_idx == 0)
-        {
-            num_selections = ConsumeFirstTile<IS_LAST_TILE>(num_tile_items, tile_offset, tile_state);
-        }
-        else
-        {
-            num_selections = ConsumeSubsequentTile<IS_LAST_TILE>(num_tile_items, tile_idx, tile_offset, tile_state);
-        }
-
-        return num_selections;
-    }
-
-
-    /**
-     * Scan tiles of items as part of a dynamic chained scan
-     */
-    template <typename NumSelectedIteratorT>        ///< Output iterator type for recording number of items selection_flags
-    __device__ __forceinline__ void ConsumeRange(
-        int                     num_tiles,          ///< Total number of input tiles
-        ScanTileStateT&         tile_state,         ///< Global tile state descriptor
-        NumSelectedIteratorT    d_num_selected_out) ///< Output total number selection_flags
-    {
-        // Blocks are launched in increasing order, so just assign one tile per block
-        int     tile_idx        = (blockIdx.x * gridDim.y) + blockIdx.y;    // Current tile index
-        OffsetT tile_offset     = tile_idx * TILE_ITEMS;                    // Global offset for the current tile
-
-        if (tile_idx < num_tiles - 1)
-        {
-            // Not the last tile (full)
-            ConsumeTile<false>(TILE_ITEMS, tile_idx, tile_offset, tile_state);
-        }
-        else
-        {
-            // The last tile (possibly partially-full)
-            OffsetT num_remaining   = num_items - tile_offset;
-            OffsetT num_selections  = ConsumeTile<true>(num_remaining, tile_idx, tile_offset, tile_state);
-
-            if (threadIdx.x == 0)
-            {
-                // Output the total number of items selection_flags
-                *d_num_selected_out = num_selections;
-            }
-        }
-    }
-
-};
-
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/agent/agent_spmv_orig.cuh b/thirdparty/cub_semiring/agent/agent_spmv_orig.cuh
deleted file mode 100644
index 65e0d2bd2d7..00000000000
--- a/thirdparty/cub_semiring/agent/agent_spmv_orig.cuh
+++ /dev/null
@@ -1,692 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::AgentSpmv implements a stateful abstraction of CUDA thread blocks for participating in device-wide SpMV.
- */
-
-#pragma once
-
-#include <iterator>
-
-#include "../util_type.cuh"
-#include "../block/block_reduce.cuh"
-#include "../block/block_scan.cuh"
-#include "../block/block_exchange.cuh"
-#include "../thread/thread_search.cuh"
-#include "../thread/thread_operators.cuh"
-#include "../iterator/cache_modified_input_iterator.cuh"
-#include "../iterator/counting_input_iterator.cuh"
-#include "../iterator/tex_ref_input_iterator.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/******************************************************************************
- * Tuning policy
- ******************************************************************************/
-
-/**
- * Parameterizable tuning policy type for AgentSpmv
- */
-template <
-    int                             _BLOCK_THREADS,                         ///< Threads per thread block
-    int                             _ITEMS_PER_THREAD,                      ///< Items per thread (per tile of input)
-    CacheLoadModifier               _ROW_OFFSETS_SEARCH_LOAD_MODIFIER,      ///< Cache load modifier for reading CSR row-offsets during search
-    CacheLoadModifier               _ROW_OFFSETS_LOAD_MODIFIER,             ///< Cache load modifier for reading CSR row-offsets
-    CacheLoadModifier               _COLUMN_INDICES_LOAD_MODIFIER,          ///< Cache load modifier for reading CSR column-indices
-    CacheLoadModifier               _VALUES_LOAD_MODIFIER,                  ///< Cache load modifier for reading CSR values
-    CacheLoadModifier               _VECTOR_VALUES_LOAD_MODIFIER,           ///< Cache load modifier for reading vector values
-    bool                            _DIRECT_LOAD_NONZEROS,                  ///< Whether to load nonzeros directly from global during sequential merging (vs. pre-staged through shared memory)
-    BlockScanAlgorithm              _SCAN_ALGORITHM>                        ///< The BlockScan algorithm to use
-struct AgentSpmvPolicy
-{
-    enum
-    {
-        BLOCK_THREADS                                                   = _BLOCK_THREADS,                       ///< Threads per thread block
-        ITEMS_PER_THREAD                                                = _ITEMS_PER_THREAD,                    ///< Items per thread (per tile of input)
-        DIRECT_LOAD_NONZEROS                                            = _DIRECT_LOAD_NONZEROS,                ///< Whether to load nonzeros directly from global during sequential merging (pre-staged through shared memory)
-    };
-
-    static const CacheLoadModifier  ROW_OFFSETS_SEARCH_LOAD_MODIFIER    = _ROW_OFFSETS_SEARCH_LOAD_MODIFIER;    ///< Cache load modifier for reading CSR row-offsets
-    static const CacheLoadModifier  ROW_OFFSETS_LOAD_MODIFIER           = _ROW_OFFSETS_LOAD_MODIFIER;           ///< Cache load modifier for reading CSR row-offsets
-    static const CacheLoadModifier  COLUMN_INDICES_LOAD_MODIFIER        = _COLUMN_INDICES_LOAD_MODIFIER;        ///< Cache load modifier for reading CSR column-indices
-    static const CacheLoadModifier  VALUES_LOAD_MODIFIER                = _VALUES_LOAD_MODIFIER;                ///< Cache load modifier for reading CSR values
-    static const CacheLoadModifier  VECTOR_VALUES_LOAD_MODIFIER         = _VECTOR_VALUES_LOAD_MODIFIER;         ///< Cache load modifier for reading vector values
-    static const BlockScanAlgorithm SCAN_ALGORITHM                      = _SCAN_ALGORITHM;                      ///< The BlockScan algorithm to use
-
-};
-
-
-/******************************************************************************
- * Thread block abstractions
- ******************************************************************************/
-
-template <
-    typename        ValueT,              ///< Matrix and vector value type
-    typename        OffsetT>             ///< Signed integer type for sequence offsets
-struct SpmvParams
-{
-    const ValueT*         d_values;            ///< Pointer to the array of \p num_nonzeros values of the corresponding nonzero elements of matrix <b>A</b>.
-    const OffsetT*        d_row_end_offsets;   ///< Pointer to the array of \p m offsets demarcating the end of every row in \p d_column_indices and \p d_values
-    const OffsetT*        d_column_indices;    ///< Pointer to the array of \p num_nonzeros column-indices of the corresponding nonzero elements of matrix <b>A</b>.  (Indices are zero-valued.)
-    const ValueT*         d_vector_x;          ///< Pointer to the array of \p num_cols values corresponding to the dense input vector <em>x</em>
-    ValueT*         d_vector_y;          ///< Pointer to the array of \p num_rows values corresponding to the dense output vector <em>y</em>
-    int             num_rows;            ///< Number of rows of matrix <b>A</b>.
-    int             num_cols;            ///< Number of columns of matrix <b>A</b>.
-    int             num_nonzeros;        ///< Number of nonzero elements of matrix <b>A</b>.
-    ValueT          alpha;               ///< Alpha multiplicand
-    ValueT          beta;                ///< Beta addend-multiplicand
-
-    TexRefInputIterator<ValueT, 66778899, OffsetT>  t_vector_x;
-};
-
-
-/**
- * \brief AgentSpmv implements a stateful abstraction of CUDA thread blocks for participating in device-wide SpMV.
- */
-template <
-    typename    AgentSpmvPolicyT,           ///< Parameterized AgentSpmvPolicy tuning policy type
-    typename    ValueT,                     ///< Matrix and vector value type
-    typename    OffsetT,                    ///< Signed integer type for sequence offsets
-    typename    SemiringT,                  ///< Semiring type
-    bool        HAS_ALPHA,                  ///< Whether the input parameter \p alpha is 1
-    bool        HAS_BETA,                   ///< Whether the input parameter \p beta is 0
-    int         PTX_ARCH = CUB_PTX_ARCH>    ///< PTX compute capability
-struct AgentSpmv
-{
-    //---------------------------------------------------------------------
-    // Types and constants
-    //---------------------------------------------------------------------
-
-    /// Constants
-    enum
-    {
-        BLOCK_THREADS           = AgentSpmvPolicyT::BLOCK_THREADS,
-        ITEMS_PER_THREAD        = AgentSpmvPolicyT::ITEMS_PER_THREAD,
-        TILE_ITEMS              = BLOCK_THREADS * ITEMS_PER_THREAD,
-    };
-
-    /// 2D merge path coordinate type
-    typedef typename CubVector<OffsetT, 2>::Type CoordinateT;
-
-    /// Input iterator wrapper types (for applying cache modifiers)
-
-    typedef CacheModifiedInputIterator<
-            AgentSpmvPolicyT::ROW_OFFSETS_SEARCH_LOAD_MODIFIER,
-            OffsetT,
-            OffsetT>
-        RowOffsetsSearchIteratorT;
-
-    typedef CacheModifiedInputIterator<
-            AgentSpmvPolicyT::ROW_OFFSETS_LOAD_MODIFIER,
-            OffsetT,
-            OffsetT>
-        RowOffsetsIteratorT;
-
-    typedef CacheModifiedInputIterator<
-            AgentSpmvPolicyT::COLUMN_INDICES_LOAD_MODIFIER,
-            OffsetT,
-            OffsetT>
-        ColumnIndicesIteratorT;
-
-    typedef CacheModifiedInputIterator<
-            AgentSpmvPolicyT::VALUES_LOAD_MODIFIER,
-            ValueT,
-            OffsetT>
-        ValueIteratorT;
-
-    typedef CacheModifiedInputIterator<
-            AgentSpmvPolicyT::VECTOR_VALUES_LOAD_MODIFIER,
-            ValueT,
-            OffsetT>
-        VectorValueIteratorT;
-
-    // Tuple type for scanning (pairs accumulated segment-value with segment-index)
-    typedef KeyValuePair<OffsetT, ValueT> KeyValuePairT;
-
-    // Reduce-value-by-segment scan operator
-    typedef ReduceByKeyOp<typename SemiringT::SumOp> ReduceBySegmentOpT;
-
-    // BlockReduce specialization
-    typedef BlockReduce<
-            ValueT,
-            BLOCK_THREADS,
-            BLOCK_REDUCE_WARP_REDUCTIONS>
-        BlockReduceT;
-
-    // BlockScan specialization
-    typedef BlockScan<
-            KeyValuePairT,
-            BLOCK_THREADS,
-            AgentSpmvPolicyT::SCAN_ALGORITHM>
-        BlockScanT;
-
-    // BlockScan specialization
-    typedef BlockScan<
-            ValueT,
-            BLOCK_THREADS,
-            AgentSpmvPolicyT::SCAN_ALGORITHM>
-        BlockPrefixSumT;
-
-    // BlockExchange specialization
-    typedef BlockExchange<
-            ValueT,
-            BLOCK_THREADS,
-            ITEMS_PER_THREAD>
-        BlockExchangeT;
-
-    /// Merge item type (either a non-zero value or a row-end offset)
-    union MergeItem
-    {
-        // Value type to pair with index type OffsetT (NullType if loading values directly during merge)
-        typedef typename If<AgentSpmvPolicyT::DIRECT_LOAD_NONZEROS, NullType, ValueT>::Type MergeValueT;
-
-        OffsetT     row_end_offset;
-        MergeValueT nonzero;
-    };
-
-    /// Shared memory type required by this thread block
-    struct _TempStorage
-    {
-        CoordinateT tile_coords[2];
-
-        union Aliasable
-        {
-            // Smem needed for tile of merge items
-            MergeItem merge_items[ITEMS_PER_THREAD + TILE_ITEMS + 1];
-
-            // Smem needed for block exchange
-            typename BlockExchangeT::TempStorage exchange;
-
-            // Smem needed for block-wide reduction
-            typename BlockReduceT::TempStorage reduce;
-
-            // Smem needed for tile scanning
-            typename BlockScanT::TempStorage scan;
-
-            // Smem needed for tile prefix sum
-            typename BlockPrefixSumT::TempStorage prefix_sum;
-
-        } aliasable;
-    };
-
-    /// Temporary storage type (unionable)
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    //---------------------------------------------------------------------
-    // Per-thread fields
-    //---------------------------------------------------------------------
-
-
-    _TempStorage&                   temp_storage;         /// Reference to temp_storage
-
-    SpmvParams<ValueT, OffsetT>&    spmv_params;
-
-    ValueIteratorT                  wd_values;            ///< Wrapped pointer to the array of \p num_nonzeros values of the corresponding nonzero elements of matrix <b>A</b>.
-    RowOffsetsIteratorT             wd_row_end_offsets;   ///< Wrapped Pointer to the array of \p m offsets demarcating the end of every row in \p d_column_indices and \p d_values
-    ColumnIndicesIteratorT          wd_column_indices;    ///< Wrapped Pointer to the array of \p num_nonzeros column-indices of the corresponding nonzero elements of matrix <b>A</b>.  (Indices are zero-valued.)
-    VectorValueIteratorT            wd_vector_x;          ///< Wrapped Pointer to the array of \p num_cols values corresponding to the dense input vector <em>x</em>
-    VectorValueIteratorT            wd_vector_y;          ///< Wrapped Pointer to the array of \p num_cols values corresponding to the dense input vector <em>x</em>
-
-
-    //---------------------------------------------------------------------
-    // Interface
-    //---------------------------------------------------------------------
-
-    /**
-     * Constructor
-     */
-    __device__ __forceinline__ AgentSpmv(
-        TempStorage&                    temp_storage,           ///< Reference to temp_storage
-        SpmvParams<ValueT, OffsetT>&    spmv_params)            ///< SpMV input parameter bundle
-    :
-        temp_storage(temp_storage.Alias()),
-        spmv_params(spmv_params),
-        wd_values(spmv_params.d_values),
-        wd_row_end_offsets(spmv_params.d_row_end_offsets),
-        wd_column_indices(spmv_params.d_column_indices),
-        wd_vector_x(spmv_params.d_vector_x),
-        wd_vector_y(spmv_params.d_vector_y)
-    {}
-
-
-
-
-    /**
-     * Consume a merge tile, specialized for direct-load of nonzeros
-     */
-    __device__ __forceinline__ KeyValuePairT ConsumeTile(
-        int             tile_idx,
-        CoordinateT     tile_start_coord,
-        CoordinateT     tile_end_coord,
-        Int2Type<true>  is_direct_load)     ///< Marker type indicating whether to load nonzeros directly during path-discovery or beforehand in batch
-    {
-        int         tile_num_rows           = tile_end_coord.x - tile_start_coord.x;
-        int         tile_num_nonzeros       = tile_end_coord.y - tile_start_coord.y;
-        OffsetT*    s_tile_row_end_offsets  = &temp_storage.aliasable.merge_items[0].row_end_offset;
-
-        // Gather the row end-offsets for the merge tile into shared memory
-        for (int item = threadIdx.x; item <= tile_num_rows; item += BLOCK_THREADS)
-        {
-            s_tile_row_end_offsets[item] = wd_row_end_offsets[tile_start_coord.x + item];
-        }
-
-        CTA_SYNC();
-
-        // Search for the thread's starting coordinate within the merge tile
-        CountingInputIterator<OffsetT>  tile_nonzero_indices(tile_start_coord.y);
-        CoordinateT                     thread_start_coord;
-
-        MergePathSearch(
-            OffsetT(threadIdx.x * ITEMS_PER_THREAD),    // Diagonal
-            s_tile_row_end_offsets,                     // List A
-            tile_nonzero_indices,                       // List B
-            tile_num_rows,
-            tile_num_nonzeros,
-            thread_start_coord);
-
-        CTA_SYNC();            // Perf-sync
-
-        // Compute the thread's merge path segment
-        CoordinateT     thread_current_coord = thread_start_coord;
-        KeyValuePairT   scan_segment[ITEMS_PER_THREAD];
-
-        ValueT          running_total = SemiringT::plus_ident();
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            OffsetT nonzero_idx         = CUB_MIN(tile_nonzero_indices[thread_current_coord.y], spmv_params.num_nonzeros - 1);
-            OffsetT column_idx          = wd_column_indices[nonzero_idx];
-            ValueT  value               = wd_values[nonzero_idx];
-
-// #if (CUB_PTX_ARCH >= 350)
-//             ValueT vector_value         = wd_vector_x[column_idx];
-// #else
-//             ValueT vector_value         = spmv_params.t_vector_x[column_idx];
-// #endif
-            ValueT  vector_value        = spmv_params.t_vector_x[column_idx];
-#if (CUB_PTX_ARCH >= 350)
-            vector_value                = wd_vector_x[column_idx];
-#endif
-            ValueT nonzero              = SemiringT::times(value, vector_value);
-
-            OffsetT row_end_offset      = s_tile_row_end_offsets[thread_current_coord.x];
-
-            if (tile_nonzero_indices[thread_current_coord.y] < row_end_offset)
-            {
-                // Move down (accumulate)
-                running_total = SemiringT::plus(nonzero, running_total);
-                scan_segment[ITEM].value    = running_total;
-                scan_segment[ITEM].key      = tile_num_rows;
-                ++thread_current_coord.y;
-            }
-            else
-            {
-                // Move right (reset)
-                scan_segment[ITEM].value    = running_total;
-                scan_segment[ITEM].key      = thread_current_coord.x;
-                running_total               = SemiringT::plus_ident();
-                ++thread_current_coord.x;
-            }
-        }
-
-        CTA_SYNC();
-
-        // Block-wide reduce-value-by-segment
-        KeyValuePairT       tile_carry;
-        ReduceBySegmentOpT  scan_op;
-        KeyValuePairT       scan_item;
-
-        scan_item.value = running_total;
-        scan_item.key   = thread_current_coord.x;
-
-        BlockScanT(temp_storage.aliasable.scan).ExclusiveScan(scan_item, scan_item, scan_op, tile_carry);
-
-        if (tile_num_rows > 0)
-        {
-            if (threadIdx.x == 0)
-                scan_item.key = -1;
-
-            // Direct scatter
-            #pragma unroll
-            for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-            {
-                if (scan_segment[ITEM].key < tile_num_rows)
-                {
-                    if (scan_item.key == scan_segment[ITEM].key)
-                        scan_segment[ITEM].value = SemiringT::plus(scan_item.value, scan_segment[ITEM].value);
-
-                    if (HAS_ALPHA)
-                    {
-                        scan_segment[ITEM].value = SemiringT::times(scan_segment[ITEM].value, spmv_params.alpha);
-                    }
-
-                    if (HAS_BETA)
-                    {
-                        // Update the output vector element
-                        ValueT addend = SemiringT::times(spmv_params.beta, wd_vector_y[tile_start_coord.x + scan_segment[ITEM].key]);
-                        scan_segment[ITEM].value = SemiringT::plus(addend, scan_segment[ITEM].value);
-                    }
-
-                    // Set the output vector element
-                    spmv_params.d_vector_y[tile_start_coord.x + scan_segment[ITEM].key] = scan_segment[ITEM].value;
-                }
-            }
-        }
-
-        // Return the tile's running carry-out
-        return tile_carry;
-    }
-
-
-
-    /**
-     * Consume a merge tile, specialized for indirect load of nonzeros
-     */
-    __device__ __forceinline__ KeyValuePairT ConsumeTile(
-        int             tile_idx,
-        CoordinateT     tile_start_coord,
-        CoordinateT     tile_end_coord,
-        Int2Type<false> is_direct_load)     ///< Marker type indicating whether to load nonzeros directly during path-discovery or beforehand in batch
-    {
-        int         tile_num_rows           = tile_end_coord.x - tile_start_coord.x;
-        int         tile_num_nonzeros       = tile_end_coord.y - tile_start_coord.y;
-
-#if (CUB_PTX_ARCH >= 520)
-
-        OffsetT*    s_tile_row_end_offsets  = &temp_storage.aliasable.merge_items[0].row_end_offset;
-        ValueT*     s_tile_nonzeros         = &temp_storage.aliasable.merge_items[tile_num_rows + ITEMS_PER_THREAD].nonzero;
-
-        // Gather the nonzeros for the merge tile into shared memory
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            int nonzero_idx = threadIdx.x + (ITEM * BLOCK_THREADS);
-
-            ValueIteratorT a                = wd_values + tile_start_coord.y + nonzero_idx;
-            ColumnIndicesIteratorT ci       = wd_column_indices + tile_start_coord.y + nonzero_idx;
-            ValueT* s                       = s_tile_nonzeros + nonzero_idx;
-
-            if (nonzero_idx < tile_num_nonzeros)
-            {
-
-                OffsetT column_idx          = *ci;
-                ValueT  value               = *a;
-
-                ValueT  vector_value            = spmv_params.t_vector_x[column_idx];
-                vector_value                    = wd_vector_x[column_idx];
-
-                ValueT  nonzero             = SemiringT::times(value, vector_value);
-
-                *s    = nonzero;
-            }
-        }
-
-
-#else
-
-        OffsetT*    s_tile_row_end_offsets  = &temp_storage.aliasable.merge_items[0].row_end_offset;
-        ValueT*     s_tile_nonzeros         = &temp_storage.aliasable.merge_items[tile_num_rows + ITEMS_PER_THREAD].nonzero;
-
-        // Gather the nonzeros for the merge tile into shared memory
-        if (tile_num_nonzeros > 0)
-        {
-            #pragma unroll
-            for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-            {
-                int     nonzero_idx             = threadIdx.x + (ITEM * BLOCK_THREADS);
-                nonzero_idx                     = CUB_MIN(nonzero_idx, tile_num_nonzeros - 1);
-
-                OffsetT column_idx              = wd_column_indices[tile_start_coord.y + nonzero_idx];
-                ValueT  value                   = wd_values[tile_start_coord.y + nonzero_idx];
-
-// #if (CUB_PTX_ARCH >= 350)
-//                 ValueT vector_value             = wd_vector_x[column_idx];
-// #else
-//                 ValueT vector_value             = spmv_params.t_vector_x[column_idx];
-// #endif
-                ValueT  vector_value            = spmv_params.t_vector_x[column_idx];
-#if (CUB_PTX_ARCH >= 350)
-                vector_value                    = wd_vector_x[column_idx];
-#endif
-                ValueT  nonzero                 = SemiringT::times(value, vector_value);
-
-                s_tile_nonzeros[nonzero_idx]    = nonzero;
-            }
-        }
-
-#endif
-
-        // Gather the row end-offsets for the merge tile into shared memory
-        #pragma unroll 1
-        for (int item = threadIdx.x; item <= tile_num_rows; item += BLOCK_THREADS)
-        {
-            s_tile_row_end_offsets[item] = wd_row_end_offsets[tile_start_coord.x + item];
-        }
-
-        CTA_SYNC();
-
-        // Search for the thread's starting coordinate within the merge tile
-        CountingInputIterator<OffsetT>  tile_nonzero_indices(tile_start_coord.y);
-        CoordinateT                     thread_start_coord;
-
-        MergePathSearch(
-            OffsetT(threadIdx.x * ITEMS_PER_THREAD),    // Diagonal
-            s_tile_row_end_offsets,                     // List A
-            tile_nonzero_indices,                       // List B
-            tile_num_rows,
-            tile_num_nonzeros,
-            thread_start_coord);
-
-        CTA_SYNC();            // Perf-sync
-
-        // Compute the thread's merge path segment
-        CoordinateT     thread_current_coord = thread_start_coord;
-        KeyValuePairT   scan_segment[ITEMS_PER_THREAD];
-        ValueT          running_total = SemiringT::plus_ident();
-
-        OffsetT row_end_offset  = s_tile_row_end_offsets[thread_current_coord.x];
-        ValueT  nonzero         = s_tile_nonzeros[thread_current_coord.y];
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-        {
-            if (tile_nonzero_indices[thread_current_coord.y] < row_end_offset)
-            {
-                // Move down (accumulate)
-                scan_segment[ITEM].value    = nonzero;
-                running_total               = SemiringT::plus(nonzero, running_total);
-                ++thread_current_coord.y;
-                nonzero                     = s_tile_nonzeros[thread_current_coord.y];
-            }
-            else
-            {
-                // Move right (reset)
-                scan_segment[ITEM].value    = SemiringT::plus_ident();
-                running_total               = SemiringT::plus_ident();
-                ++thread_current_coord.x;
-                row_end_offset              = s_tile_row_end_offsets[thread_current_coord.x];
-            }
-
-            scan_segment[ITEM].key = thread_current_coord.x;
-        }
-
-        CTA_SYNC();
-
-        // Block-wide reduce-value-by-segment
-        KeyValuePairT       tile_carry;
-        ReduceBySegmentOpT  scan_op;
-        KeyValuePairT       scan_item;
-
-        scan_item.value = running_total;
-        scan_item.key = thread_current_coord.x;
-
-        BlockScanT(temp_storage.aliasable.scan).ExclusiveScan(scan_item, scan_item, scan_op, tile_carry);
-
-        if (threadIdx.x == 0)
-        {
-            scan_item.key = thread_start_coord.x;
-            scan_item.value = SemiringT::plus_ident();
-        }
-
-        if (tile_num_rows > 0)
-        {
-
-            CTA_SYNC();
-
-            // Scan downsweep and scatter
-            ValueT* s_partials = &temp_storage.aliasable.merge_items[0].nonzero;
-
-            if (scan_item.key != scan_segment[0].key)
-            {
-                s_partials[scan_item.key] = scan_item.value;
-            }
-            else
-            {
-                scan_segment[0].value = SemiringT::plus(scan_item.value, scan_segment[0].value);
-            }
-
-            #pragma unroll
-            for (int ITEM = 1; ITEM < ITEMS_PER_THREAD; ++ITEM)
-            {
-                if (scan_segment[ITEM - 1].key != scan_segment[ITEM].key)
-                {
-                    s_partials[scan_segment[ITEM - 1].key] = scan_segment[ITEM - 1].value;
-                }
-                else
-                {
-                    scan_segment[ITEM].value = SemiringT::plus(scan_segment[ITEM].value, scan_segment[ITEM - 1].value);
-                }
-            }
-
-            CTA_SYNC();
-
-            #pragma unroll 1
-            for (int item = threadIdx.x; item < tile_num_rows; item += BLOCK_THREADS)
-            {
-                if (HAS_ALPHA)
-                {
-                    s_partials[item] = SemiringT::times(s_partials[item], spmv_params.alpha);
-                }
-
-                if (HAS_BETA)
-                {
-                    // Update the output vector element
-                    ValueT addend = SemiringT::times(spmv_params.beta, spmv_params.d_vector_y[tile_start_coord.x + item]);
-                    s_partials[item] = SemiringT::plus(addend, s_partials[item]);
-                }
-                spmv_params.d_vector_y[tile_start_coord.x + item] = s_partials[item];
-            }
-        }
-
-        // Return the tile's running carry-out
-        return tile_carry;
-    }
-
-
-    /**
-     * Consume input tile
-     */
-    __device__ __forceinline__ void ConsumeTile(
-        CoordinateT*    d_tile_coordinates,     ///< [in] Pointer to the temporary array of tile starting coordinates
-        KeyValuePairT*  d_tile_carry_pairs,     ///< [out] Pointer to the temporary array carry-out dot product row-ids, one per block
-        int             num_merge_tiles)        ///< [in] Number of merge tiles
-    {
-        int tile_idx = (blockIdx.x * gridDim.y) + blockIdx.y;    // Current tile index
-
-        if (tile_idx >= num_merge_tiles)
-            return;
-
-        // Read our starting coordinates
-        if (threadIdx.x < 2)
-        {
-            if (d_tile_coordinates == NULL)
-            {
-                // Search our starting coordinates
-                OffsetT                         diagonal = (tile_idx + threadIdx.x) * TILE_ITEMS;
-                CoordinateT                     tile_coord;
-                CountingInputIterator<OffsetT>  nonzero_indices(0);
-
-                // Search the merge path
-                MergePathSearch(
-                    diagonal,
-                    RowOffsetsSearchIteratorT(spmv_params.d_row_end_offsets),
-                    nonzero_indices,
-                    spmv_params.num_rows,
-                    spmv_params.num_nonzeros,
-                    tile_coord);
-
-                temp_storage.tile_coords[threadIdx.x] = tile_coord;
-            }
-            else
-            {
-                temp_storage.tile_coords[threadIdx.x] = d_tile_coordinates[tile_idx + threadIdx.x];
-            }
-        }
-
-        CTA_SYNC();
-
-        CoordinateT tile_start_coord     = temp_storage.tile_coords[0];
-        CoordinateT tile_end_coord       = temp_storage.tile_coords[1];
-
-        // Consume multi-segment tile
-        KeyValuePairT tile_carry = ConsumeTile(
-            tile_idx,
-            tile_start_coord,
-            tile_end_coord,
-            Int2Type<AgentSpmvPolicyT::DIRECT_LOAD_NONZEROS>());
-
-        // Output the tile's carry-out
-        if (threadIdx.x == 0)
-        {
-            if (HAS_ALPHA)
-                tile_carry.value = SemiringT::times(spmv_params.alpha, tile_carry.value);
-
-            tile_carry.key += tile_start_coord.x;
-            d_tile_carry_pairs[tile_idx]    = tile_carry;
-        }
-    }
-
-
-};
-
-
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/agent/single_pass_scan_operators.cuh b/thirdparty/cub_semiring/agent/single_pass_scan_operators.cuh
deleted file mode 100644
index 8106e42e108..00000000000
--- a/thirdparty/cub_semiring/agent/single_pass_scan_operators.cuh
+++ /dev/null
@@ -1,815 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Callback operator types for supplying BlockScan prefixes
- */
-
-#pragma once
-
-#include <iterator>
-
-#include "../thread/thread_load.cuh"
-#include "../thread/thread_store.cuh"
-#include "../warp/warp_reduce.cuh"
-#include "../util_arch.cuh"
-#include "../util_device.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/******************************************************************************
- * Prefix functor type for maintaining a running prefix while scanning a
- * region independent of other thread blocks
- ******************************************************************************/
-
-/**
- * Stateful callback operator type for supplying BlockScan prefixes.
- * Maintains a running prefix that can be applied to consecutive
- * BlockScan operations.
- */
-template <
-    typename T,                 ///< BlockScan value type
-    typename ScanOpT>            ///< Wrapped scan operator type
-struct BlockScanRunningPrefixOp
-{
-    ScanOpT     op;                 ///< Wrapped scan operator
-    T           running_total;      ///< Running block-wide prefix
-
-    /// Constructor
-    __device__ __forceinline__ BlockScanRunningPrefixOp(ScanOpT op)
-    :
-        op(op)
-    {}
-
-    /// Constructor
-    __device__ __forceinline__ BlockScanRunningPrefixOp(
-        T starting_prefix,
-        ScanOpT op)
-    :
-        op(op),
-        running_total(starting_prefix)
-    {}
-
-    /**
-     * Prefix callback operator.  Returns the block-wide running_total in thread-0.
-     */
-    __device__ __forceinline__ T operator()(
-        const T &block_aggregate)              ///< The aggregate sum of the BlockScan inputs
-    {
-        T retval = running_total;
-        running_total = op(running_total, block_aggregate);
-        return retval;
-    }
-};
-
-
-/******************************************************************************
- * Generic tile status interface types for block-cooperative scans
- ******************************************************************************/
-
-/**
- * Enumerations of tile status
- */
-enum ScanTileStatus
-{
-    SCAN_TILE_OOB,          // Out-of-bounds (e.g., padding)
-    SCAN_TILE_INVALID = 99, // Not yet processed
-    SCAN_TILE_PARTIAL,      // Tile aggregate is available
-    SCAN_TILE_INCLUSIVE,    // Inclusive tile prefix is available
-};
-
-
-/**
- * Tile status interface.
- */
-template <
-    typename    T,
-    bool        SINGLE_WORD = Traits<T>::PRIMITIVE>
-struct ScanTileState;
-
-
-/**
- * Tile status interface specialized for scan status and value types
- * that can be combined into one machine word that can be
- * read/written coherently in a single access.
- */
-template <typename T>
-struct ScanTileState<T, true>
-{
-    // Status word type
-    typedef typename If<(sizeof(T) == 8),
-        long long,
-        typename If<(sizeof(T) == 4),
-            int,
-            typename If<(sizeof(T) == 2),
-                short,
-                char>::Type>::Type>::Type StatusWord;
-
-
-    // Unit word type
-    typedef typename If<(sizeof(T) == 8),
-        longlong2,
-        typename If<(sizeof(T) == 4),
-            int2,
-            typename If<(sizeof(T) == 2),
-                int,
-                uchar2>::Type>::Type>::Type TxnWord;
-
-
-    // Device word type
-    struct TileDescriptor
-    {
-        StatusWord  status;
-        T           value;
-    };
-
-
-    // Constants
-    enum
-    {
-        TILE_STATUS_PADDING = CUB_PTX_WARP_THREADS,
-    };
-
-
-    // Device storage
-    TxnWord *d_tile_descriptors;
-
-    /// Constructor
-    __host__ __device__ __forceinline__
-    ScanTileState()
-    :
-        d_tile_descriptors(NULL)
-    {}
-
-
-    /// Initializer
-    __host__ __device__ __forceinline__
-    cudaError_t Init(
-        int     /*num_tiles*/,                      ///< [in] Number of tiles
-        void    *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t  /*temp_storage_bytes*/)             ///< [in] Size in bytes of \t d_temp_storage allocation
-    {
-        d_tile_descriptors = reinterpret_cast<TxnWord*>(d_temp_storage);
-        return cudaSuccess;
-    }
-
-
-    /**
-     * Compute device memory needed for tile status
-     */
-    __host__ __device__ __forceinline__
-    static cudaError_t AllocationSize(
-        int     num_tiles,                          ///< [in] Number of tiles
-        size_t  &temp_storage_bytes)                ///< [out] Size in bytes of \t d_temp_storage allocation
-    {
-        temp_storage_bytes = (num_tiles + TILE_STATUS_PADDING) * sizeof(TileDescriptor);       // bytes needed for tile status descriptors
-        return cudaSuccess;
-    }
-
-
-    /**
-     * Initialize (from device)
-     */
-    __device__ __forceinline__ void InitializeStatus(int num_tiles)
-    {
-        int tile_idx = (blockIdx.x * blockDim.x) + threadIdx.x;
-
-        TxnWord val = TxnWord();
-        TileDescriptor *descriptor = reinterpret_cast<TileDescriptor*>(&val);
-
-        if (tile_idx < num_tiles)
-        {
-            // Not-yet-set
-            descriptor->status = StatusWord(SCAN_TILE_INVALID);
-            d_tile_descriptors[TILE_STATUS_PADDING + tile_idx] = val;
-        }
-
-        if ((blockIdx.x == 0) && (threadIdx.x < TILE_STATUS_PADDING))
-        {
-            // Padding
-            descriptor->status = StatusWord(SCAN_TILE_OOB);
-            d_tile_descriptors[threadIdx.x] = val;
-        }
-    }
-
-
-    /**
-     * Update the specified tile's inclusive value and corresponding status
-     */
-    __device__ __forceinline__ void SetInclusive(int tile_idx, T tile_inclusive)
-    {
-        TileDescriptor tile_descriptor;
-        tile_descriptor.status = SCAN_TILE_INCLUSIVE;
-        tile_descriptor.value = tile_inclusive;
-
-        TxnWord alias;
-        *reinterpret_cast<TileDescriptor*>(&alias) = tile_descriptor;
-        ThreadStore<STORE_CG>(d_tile_descriptors + TILE_STATUS_PADDING + tile_idx, alias);
-    }
-
-
-    /**
-     * Update the specified tile's partial value and corresponding status
-     */
-    __device__ __forceinline__ void SetPartial(int tile_idx, T tile_partial)
-    {
-        TileDescriptor tile_descriptor;
-        tile_descriptor.status = SCAN_TILE_PARTIAL;
-        tile_descriptor.value = tile_partial;
-
-        TxnWord alias;
-        *reinterpret_cast<TileDescriptor*>(&alias) = tile_descriptor;
-        ThreadStore<STORE_CG>(d_tile_descriptors + TILE_STATUS_PADDING + tile_idx, alias);
-    }
-
-    /**
-     * Wait for the corresponding tile to become non-invalid
-     */
-    __device__ __forceinline__ void WaitForValid(
-        int             tile_idx,
-        StatusWord      &status,
-        T               &value)
-    {
-        TileDescriptor tile_descriptor;
-        do
-        {
-            __threadfence_block(); // prevent hoisting loads from loop
-            TxnWord alias = ThreadLoad<LOAD_CG>(d_tile_descriptors + TILE_STATUS_PADDING + tile_idx);
-            tile_descriptor = reinterpret_cast<TileDescriptor&>(alias);
-
-        } while (WARP_ANY((tile_descriptor.status == SCAN_TILE_INVALID), 0xffffffff));
-
-        status = tile_descriptor.status;
-        value = tile_descriptor.value;
-    }
-
-};
-
-
-
-/**
- * Tile status interface specialized for scan status and value types that
- * cannot be combined into one machine word.
- */
-template <typename T>
-struct ScanTileState<T, false>
-{
-    // Status word type
-    typedef char StatusWord;
-
-    // Constants
-    enum
-    {
-        TILE_STATUS_PADDING = CUB_PTX_WARP_THREADS,
-    };
-
-    // Device storage
-    StatusWord  *d_tile_status;
-    T           *d_tile_partial;
-    T           *d_tile_inclusive;
-
-    /// Constructor
-    __host__ __device__ __forceinline__
-    ScanTileState()
-    :
-        d_tile_status(NULL),
-        d_tile_partial(NULL),
-        d_tile_inclusive(NULL)
-    {}
-
-
-    /// Initializer
-    __host__ __device__ __forceinline__
-    cudaError_t Init(
-        int     num_tiles,                          ///< [in] Number of tiles
-        void    *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t  temp_storage_bytes)                 ///< [in] Size in bytes of \t d_temp_storage allocation
-    {
-        cudaError_t error = cudaSuccess;
-        do
-        {
-            void*   allocations[3] = {NULL, NULL, NULL};
-            size_t  allocation_sizes[3];
-
-            allocation_sizes[0] = (num_tiles + TILE_STATUS_PADDING) * sizeof(StatusWord);           // bytes needed for tile status descriptors
-            allocation_sizes[1] = (num_tiles + TILE_STATUS_PADDING) * sizeof(Uninitialized<T>);     // bytes needed for partials
-            allocation_sizes[2] = (num_tiles + TILE_STATUS_PADDING) * sizeof(Uninitialized<T>);     // bytes needed for inclusives
-
-            // Compute allocation pointers into the single storage blob
-            if (CubDebug(error = AliasTemporaries(d_temp_storage, temp_storage_bytes, allocations, allocation_sizes))) break;
-
-            // Alias the offsets
-            d_tile_status       = reinterpret_cast<StatusWord*>(allocations[0]);
-            d_tile_partial      = reinterpret_cast<T*>(allocations[1]);
-            d_tile_inclusive    = reinterpret_cast<T*>(allocations[2]);
-        }
-        while (0);
-
-        return error;
-    }
-
-
-    /**
-     * Compute device memory needed for tile status
-     */
-    __host__ __device__ __forceinline__
-    static cudaError_t AllocationSize(
-        int     num_tiles,                          ///< [in] Number of tiles
-        size_t  &temp_storage_bytes)                ///< [out] Size in bytes of \t d_temp_storage allocation
-    {
-        // Specify storage allocation requirements
-        size_t  allocation_sizes[3];
-        allocation_sizes[0] = (num_tiles + TILE_STATUS_PADDING) * sizeof(StatusWord);         // bytes needed for tile status descriptors
-        allocation_sizes[1] = (num_tiles + TILE_STATUS_PADDING) * sizeof(Uninitialized<T>);   // bytes needed for partials
-        allocation_sizes[2] = (num_tiles + TILE_STATUS_PADDING) * sizeof(Uninitialized<T>);   // bytes needed for inclusives
-
-        // Set the necessary size of the blob
-        void* allocations[3];
-        return CubDebug(AliasTemporaries(NULL, temp_storage_bytes, allocations, allocation_sizes));
-    }
-
-
-    /**
-     * Initialize (from device)
-     */
-    __device__ __forceinline__ void InitializeStatus(int num_tiles)
-    {
-        int tile_idx = (blockIdx.x * blockDim.x) + threadIdx.x;
-        if (tile_idx < num_tiles)
-        {
-            // Not-yet-set
-            d_tile_status[TILE_STATUS_PADDING + tile_idx] = StatusWord(SCAN_TILE_INVALID);
-        }
-
-        if ((blockIdx.x == 0) && (threadIdx.x < TILE_STATUS_PADDING))
-        {
-            // Padding
-            d_tile_status[threadIdx.x] = StatusWord(SCAN_TILE_OOB);
-        }
-    }
-
-
-    /**
-     * Update the specified tile's inclusive value and corresponding status
-     */
-    __device__ __forceinline__ void SetInclusive(int tile_idx, T tile_inclusive)
-    {
-        // Update tile inclusive value
-        ThreadStore<STORE_CG>(d_tile_inclusive + TILE_STATUS_PADDING + tile_idx, tile_inclusive);
-
-        // Fence
-        __threadfence();
-
-        // Update tile status
-        ThreadStore<STORE_CG>(d_tile_status + TILE_STATUS_PADDING + tile_idx, StatusWord(SCAN_TILE_INCLUSIVE));
-    }
-
-
-    /**
-     * Update the specified tile's partial value and corresponding status
-     */
-    __device__ __forceinline__ void SetPartial(int tile_idx, T tile_partial)
-    {
-        // Update tile partial value
-        ThreadStore<STORE_CG>(d_tile_partial + TILE_STATUS_PADDING + tile_idx, tile_partial);
-
-        // Fence
-        __threadfence();
-
-        // Update tile status
-        ThreadStore<STORE_CG>(d_tile_status + TILE_STATUS_PADDING + tile_idx, StatusWord(SCAN_TILE_PARTIAL));
-    }
-
-    /**
-     * Wait for the corresponding tile to become non-invalid
-     */
-    __device__ __forceinline__ void WaitForValid(
-        int             tile_idx,
-        StatusWord      &status,
-        T               &value)
-    {
-        do {
-            status = ThreadLoad<LOAD_CG>(d_tile_status + TILE_STATUS_PADDING + tile_idx);
-
-            __threadfence();    // prevent hoisting loads from loop or loads below above this one
-
-        } while (status == SCAN_TILE_INVALID);
-
-        if (status == StatusWord(SCAN_TILE_PARTIAL)) 
-            value = ThreadLoad<LOAD_CG>(d_tile_partial + TILE_STATUS_PADDING + tile_idx);
-        else
-            value = ThreadLoad<LOAD_CG>(d_tile_inclusive + TILE_STATUS_PADDING + tile_idx);
-    }
-};
-
-
-/******************************************************************************
- * ReduceByKey tile status interface types for block-cooperative scans
- ******************************************************************************/
-
-/**
- * Tile status interface for reduction by key.
- *
- */
-template <
-    typename    ValueT,
-    typename    KeyT,
-    bool        SINGLE_WORD = (Traits<ValueT>::PRIMITIVE) && (sizeof(ValueT) + sizeof(KeyT) < 16)>
-struct ReduceByKeyScanTileState;
-
-
-/**
- * Tile status interface for reduction by key, specialized for scan status and value types that
- * cannot be combined into one machine word.
- */
-template <
-    typename    ValueT,
-    typename    KeyT>
-struct ReduceByKeyScanTileState<ValueT, KeyT, false> :
-    ScanTileState<KeyValuePair<KeyT, ValueT> >
-{
-    typedef ScanTileState<KeyValuePair<KeyT, ValueT> > SuperClass;
-
-    /// Constructor
-    __host__ __device__ __forceinline__
-    ReduceByKeyScanTileState() : SuperClass() {}
-};
-
-
-/**
- * Tile status interface for reduction by key, specialized for scan status and value types that
- * can be combined into one machine word that can be read/written coherently in a single access.
- */
-template <
-    typename ValueT,
-    typename KeyT>
-struct ReduceByKeyScanTileState<ValueT, KeyT, true>
-{
-    typedef KeyValuePair<KeyT, ValueT>KeyValuePairT;
-
-    // Constants
-    enum
-    {
-        PAIR_SIZE           = sizeof(ValueT) + sizeof(KeyT),
-        TXN_WORD_SIZE       = 1 << Log2<PAIR_SIZE + 1>::VALUE,
-        STATUS_WORD_SIZE    = TXN_WORD_SIZE - PAIR_SIZE,
-
-        TILE_STATUS_PADDING = CUB_PTX_WARP_THREADS,
-    };
-
-    // Status word type
-    typedef typename If<(STATUS_WORD_SIZE == 8),
-        long long,
-        typename If<(STATUS_WORD_SIZE == 4),
-            int,
-            typename If<(STATUS_WORD_SIZE == 2),
-                short,
-                char>::Type>::Type>::Type StatusWord;
-
-    // Status word type
-    typedef typename If<(TXN_WORD_SIZE == 16),
-        longlong2,
-        typename If<(TXN_WORD_SIZE == 8),
-            long long,
-            int>::Type>::Type TxnWord;
-
-    // Device word type (for when sizeof(ValueT) == sizeof(KeyT))
-    struct TileDescriptorBigStatus
-    {
-        KeyT        key;
-        ValueT      value;
-        StatusWord  status;
-    };
-
-    // Device word type (for when sizeof(ValueT) != sizeof(KeyT))
-    struct TileDescriptorLittleStatus
-    {
-        ValueT      value;
-        StatusWord  status;
-        KeyT        key;
-    };
-
-    // Device word type
-    typedef typename If<
-            (sizeof(ValueT) == sizeof(KeyT)),
-            TileDescriptorBigStatus,
-            TileDescriptorLittleStatus>::Type
-        TileDescriptor;
-
-
-    // Device storage
-    TxnWord *d_tile_descriptors;
-
-
-    /// Constructor
-    __host__ __device__ __forceinline__
-    ReduceByKeyScanTileState()
-    :
-        d_tile_descriptors(NULL)
-    {}
-
-
-    /// Initializer
-    __host__ __device__ __forceinline__
-    cudaError_t Init(
-        int     /*num_tiles*/,                      ///< [in] Number of tiles
-        void    *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t  /*temp_storage_bytes*/)             ///< [in] Size in bytes of \t d_temp_storage allocation
-    {
-        d_tile_descriptors = reinterpret_cast<TxnWord*>(d_temp_storage);
-        return cudaSuccess;
-    }
-
-
-    /**
-     * Compute device memory needed for tile status
-     */
-    __host__ __device__ __forceinline__
-    static cudaError_t AllocationSize(
-        int     num_tiles,                          ///< [in] Number of tiles
-        size_t  &temp_storage_bytes)                ///< [out] Size in bytes of \t d_temp_storage allocation
-    {
-        temp_storage_bytes = (num_tiles + TILE_STATUS_PADDING) * sizeof(TileDescriptor);       // bytes needed for tile status descriptors
-        return cudaSuccess;
-    }
-
-
-    /**
-     * Initialize (from device)
-     */
-    __device__ __forceinline__ void InitializeStatus(int num_tiles)
-    {
-        int             tile_idx    = (blockIdx.x * blockDim.x) + threadIdx.x;
-        TxnWord         val         = TxnWord();
-        TileDescriptor  *descriptor = reinterpret_cast<TileDescriptor*>(&val);
-
-        if (tile_idx < num_tiles)
-        {
-            // Not-yet-set
-            descriptor->status = StatusWord(SCAN_TILE_INVALID);
-            d_tile_descriptors[TILE_STATUS_PADDING + tile_idx] = val;
-        }
-
-        if ((blockIdx.x == 0) && (threadIdx.x < TILE_STATUS_PADDING))
-        {
-            // Padding
-            descriptor->status = StatusWord(SCAN_TILE_OOB);
-            d_tile_descriptors[threadIdx.x] = val;
-        }
-    }
-
-
-    /**
-     * Update the specified tile's inclusive value and corresponding status
-     */
-    __device__ __forceinline__ void SetInclusive(int tile_idx, KeyValuePairT tile_inclusive)
-    {
-        TileDescriptor tile_descriptor;
-        tile_descriptor.status  = SCAN_TILE_INCLUSIVE;
-        tile_descriptor.value   = tile_inclusive.value;
-        tile_descriptor.key     = tile_inclusive.key;
-
-        TxnWord alias;
-        *reinterpret_cast<TileDescriptor*>(&alias) = tile_descriptor;
-        ThreadStore<STORE_CG>(d_tile_descriptors + TILE_STATUS_PADDING + tile_idx, alias);
-    }
-
-
-    /**
-     * Update the specified tile's partial value and corresponding status
-     */
-    __device__ __forceinline__ void SetPartial(int tile_idx, KeyValuePairT tile_partial)
-    {
-        TileDescriptor tile_descriptor;
-        tile_descriptor.status  = SCAN_TILE_PARTIAL;
-        tile_descriptor.value   = tile_partial.value;
-        tile_descriptor.key     = tile_partial.key;
-
-        TxnWord alias;
-        *reinterpret_cast<TileDescriptor*>(&alias) = tile_descriptor;
-        ThreadStore<STORE_CG>(d_tile_descriptors + TILE_STATUS_PADDING + tile_idx, alias);
-    }
-
-    /**
-     * Wait for the corresponding tile to become non-invalid
-     */
-    __device__ __forceinline__ void WaitForValid(
-        int                     tile_idx,
-        StatusWord              &status,
-        KeyValuePairT           &value)
-    {
-//        TxnWord         alias           = ThreadLoad<LOAD_CG>(d_tile_descriptors + TILE_STATUS_PADDING + tile_idx);
-//        TileDescriptor  tile_descriptor = reinterpret_cast<TileDescriptor&>(alias);
-//
-//        while (tile_descriptor.status == SCAN_TILE_INVALID)
-//        {
-//            __threadfence_block(); // prevent hoisting loads from loop
-//
-//            alias           = ThreadLoad<LOAD_CG>(d_tile_descriptors + TILE_STATUS_PADDING + tile_idx);
-//            tile_descriptor = reinterpret_cast<TileDescriptor&>(alias);
-//        }
-//
-//        status      = tile_descriptor.status;
-//        value.value = tile_descriptor.value;
-//        value.key   = tile_descriptor.key;
-
-        TileDescriptor tile_descriptor;
-        do
-        {
-            __threadfence_block(); // prevent hoisting loads from loop
-            TxnWord alias = ThreadLoad<LOAD_CG>(d_tile_descriptors + TILE_STATUS_PADDING + tile_idx);
-            tile_descriptor = reinterpret_cast<TileDescriptor&>(alias);
-
-        } while (WARP_ANY((tile_descriptor.status == SCAN_TILE_INVALID), 0xffffffff));
-
-        status      = tile_descriptor.status;
-        value.value = tile_descriptor.value;
-        value.key   = tile_descriptor.key;
-    }
-
-};
-
-
-/******************************************************************************
- * Prefix call-back operator for coupling local block scan within a
- * block-cooperative scan
- ******************************************************************************/
-
-/**
- * Stateful block-scan prefix functor.  Provides the the running prefix for
- * the current tile by using the call-back warp to wait on on
- * aggregates/prefixes from predecessor tiles to become available.
- */
-template <
-    typename    T,
-    typename    ScanOpT,
-    typename    ScanTileStateT,
-    int         PTX_ARCH = CUB_PTX_ARCH>
-struct TilePrefixCallbackOp
-{
-    // Parameterized warp reduce
-    typedef WarpReduce<T, CUB_PTX_WARP_THREADS, PTX_ARCH> WarpReduceT;
-
-    // Temporary storage type
-    struct _TempStorage
-    {
-        typename WarpReduceT::TempStorage   warp_reduce;
-        T                                   exclusive_prefix;
-        T                                   inclusive_prefix;
-        T                                   block_aggregate;
-    };
-
-    // Alias wrapper allowing temporary storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-    // Type of status word
-    typedef typename ScanTileStateT::StatusWord StatusWord;
-
-    // Fields
-    _TempStorage&               temp_storage;       ///< Reference to a warp-reduction instance
-    ScanTileStateT&             tile_status;        ///< Interface to tile status
-    ScanOpT                     scan_op;            ///< Binary scan operator
-    int                         tile_idx;           ///< The current tile index
-    T                           exclusive_prefix;   ///< Exclusive prefix for the tile
-    T                           inclusive_prefix;   ///< Inclusive prefix for the tile
-
-    // Constructor
-    __device__ __forceinline__
-    TilePrefixCallbackOp(
-        ScanTileStateT       &tile_status,
-        TempStorage         &temp_storage,
-        ScanOpT              scan_op,
-        int                 tile_idx)
-    :
-        temp_storage(temp_storage.Alias()),
-        tile_status(tile_status),
-        scan_op(scan_op),
-        tile_idx(tile_idx) {}
-
-
-    // Block until all predecessors within the warp-wide window have non-invalid status
-    __device__ __forceinline__
-    void ProcessWindow(
-        int         predecessor_idx,        ///< Preceding tile index to inspect
-        StatusWord  &predecessor_status,    ///< [out] Preceding tile status
-        T           &window_aggregate)      ///< [out] Relevant partial reduction from this window of preceding tiles
-    {
-        T value;
-        tile_status.WaitForValid(predecessor_idx, predecessor_status, value);
-
-        // Perform a segmented reduction to get the prefix for the current window.
-        // Use the swizzled scan operator because we are now scanning *down* towards thread0.
-
-        int tail_flag = (predecessor_status == StatusWord(SCAN_TILE_INCLUSIVE));
-        window_aggregate = WarpReduceT(temp_storage.warp_reduce).TailSegmentedReduce(
-            value,
-            tail_flag,
-            SwizzleScanOp<ScanOpT>(scan_op));
-    }
-
-
-    // BlockScan prefix callback functor (called by the first warp)
-    __device__ __forceinline__
-    T operator()(T block_aggregate)
-    {
-
-        // Update our status with our tile-aggregate
-        if (threadIdx.x == 0)
-        {
-            temp_storage.block_aggregate = block_aggregate;
-            tile_status.SetPartial(tile_idx, block_aggregate);
-        }
-
-        int         predecessor_idx = tile_idx - threadIdx.x - 1;
-        StatusWord  predecessor_status;
-        T           window_aggregate;
-
-        // Wait for the warp-wide window of predecessor tiles to become valid
-        ProcessWindow(predecessor_idx, predecessor_status, window_aggregate);
-
-        // The exclusive tile prefix starts out as the current window aggregate
-        exclusive_prefix = window_aggregate;
-
-        // Keep sliding the window back until we come across a tile whose inclusive prefix is known
-        while (WARP_ALL((predecessor_status != StatusWord(SCAN_TILE_INCLUSIVE)), 0xffffffff))
-        {
-            predecessor_idx -= CUB_PTX_WARP_THREADS;
-
-            // Update exclusive tile prefix with the window prefix
-            ProcessWindow(predecessor_idx, predecessor_status, window_aggregate);
-            exclusive_prefix = scan_op(window_aggregate, exclusive_prefix);
-        }
-
-        // Compute the inclusive tile prefix and update the status for this tile
-        if (threadIdx.x == 0)
-        {
-            inclusive_prefix = scan_op(exclusive_prefix, block_aggregate);
-            tile_status.SetInclusive(tile_idx, inclusive_prefix);
-
-            temp_storage.exclusive_prefix = exclusive_prefix;
-            temp_storage.inclusive_prefix = inclusive_prefix;
-        }
-
-        // Return exclusive_prefix
-        return exclusive_prefix;
-    }
-
-    // Get the exclusive prefix stored in temporary storage
-    __device__ __forceinline__
-    T GetExclusivePrefix()
-    {
-        return temp_storage.exclusive_prefix;
-    }
-
-    // Get the inclusive prefix stored in temporary storage
-    __device__ __forceinline__
-    T GetInclusivePrefix()
-    {
-        return temp_storage.inclusive_prefix;
-    }
-
-    // Get the block aggregate stored in temporary storage
-    __device__ __forceinline__
-    T GetBlockAggregate()
-    {
-        return temp_storage.block_aggregate;
-    }
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/block_adjacent_difference.cuh b/thirdparty/cub_semiring/block/block_adjacent_difference.cuh
deleted file mode 100644
index 1125fe59cea..00000000000
--- a/thirdparty/cub_semiring/block/block_adjacent_difference.cuh
+++ /dev/null
@@ -1,596 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * The cub::BlockDiscontinuity class provides [<em>collective</em>](index.html#sec0) methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block.
- */
-
-#pragma once
-
-#include "../util_type.cuh"
-#include "../util_ptx.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-template <
-    typename    T,
-    int         BLOCK_DIM_X,
-    int         BLOCK_DIM_Y     = 1,
-    int         BLOCK_DIM_Z     = 1,
-    int         PTX_ARCH        = CUB_PTX_ARCH>
-class BlockAdjacentDifference
-{
-private:
-
-    /******************************************************************************
-     * Constants and type definitions
-     ******************************************************************************/
-
-    /// Constants
-    enum
-    {
-        /// The thread block size in threads
-        BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-    };
-
-
-    /// Shared memory storage layout type (last element from each thread's input)
-    struct _TempStorage
-    {
-        T first_items[BLOCK_THREADS];
-        T last_items[BLOCK_THREADS];
-    };
-
-
-    /******************************************************************************
-     * Utility methods
-     ******************************************************************************/
-
-    /// Internal storage allocator
-    __device__ __forceinline__ _TempStorage& PrivateStorage()
-    {
-        __shared__ _TempStorage private_storage;
-        return private_storage;
-    }
-
-
-    /// Specialization for when FlagOp has third index param
-    template <typename FlagOp, bool HAS_PARAM = BinaryOpHasIdxParam<T, FlagOp>::HAS_PARAM>
-    struct ApplyOp
-    {
-        // Apply flag operator
-        static __device__ __forceinline__ T FlagT(FlagOp flag_op, const T &a, const T &b, int idx)
-        {
-            return flag_op(b, a, idx);
-        }
-    };
-
-    /// Specialization for when FlagOp does not have a third index param
-    template <typename FlagOp>
-    struct ApplyOp<FlagOp, false>
-    {
-        // Apply flag operator
-        static __device__ __forceinline__ T FlagT(FlagOp flag_op, const T &a, const T &b, int /*idx*/)
-        {
-            return flag_op(b, a);
-        }
-    };
-
-    /// Templated unrolling of item comparison (inductive case)
-    template <int ITERATION, int MAX_ITERATIONS>
-    struct Iterate
-    {
-        // Head flags
-        template <
-            int             ITEMS_PER_THREAD,
-            typename        FlagT,
-            typename        FlagOp>
-        static __device__ __forceinline__ void FlagHeads(
-            int                     linear_tid,
-            FlagT                   (&flags)[ITEMS_PER_THREAD],         ///< [out] Calling thread's discontinuity head_flags
-            T                       (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-            T                       (&preds)[ITEMS_PER_THREAD],         ///< [out] Calling thread's predecessor items
-            FlagOp                  flag_op)                            ///< [in] Binary boolean flag predicate
-        {
-            preds[ITERATION] = input[ITERATION - 1];
-
-            flags[ITERATION] = ApplyOp<FlagOp>::FlagT(
-                flag_op,
-                preds[ITERATION],
-                input[ITERATION],
-                (linear_tid * ITEMS_PER_THREAD) + ITERATION);
-
-            Iterate<ITERATION + 1, MAX_ITERATIONS>::FlagHeads(linear_tid, flags, input, preds, flag_op);
-        }
-
-        // Tail flags
-        template <
-            int             ITEMS_PER_THREAD,
-            typename        FlagT,
-            typename        FlagOp>
-        static __device__ __forceinline__ void FlagTails(
-            int                     linear_tid,
-            FlagT                   (&flags)[ITEMS_PER_THREAD],         ///< [out] Calling thread's discontinuity head_flags
-            T                       (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-            FlagOp                  flag_op)                            ///< [in] Binary boolean flag predicate
-        {
-            flags[ITERATION] = ApplyOp<FlagOp>::FlagT(
-                flag_op,
-                input[ITERATION],
-                input[ITERATION + 1],
-                (linear_tid * ITEMS_PER_THREAD) + ITERATION + 1);
-
-            Iterate<ITERATION + 1, MAX_ITERATIONS>::FlagTails(linear_tid, flags, input, flag_op);
-        }
-
-    };
-
-    /// Templated unrolling of item comparison (termination case)
-    template <int MAX_ITERATIONS>
-    struct Iterate<MAX_ITERATIONS, MAX_ITERATIONS>
-    {
-        // Head flags
-        template <
-            int             ITEMS_PER_THREAD,
-            typename        FlagT,
-            typename        FlagOp>
-        static __device__ __forceinline__ void FlagHeads(
-            int                     /*linear_tid*/,
-            FlagT                   (&/*flags*/)[ITEMS_PER_THREAD],         ///< [out] Calling thread's discontinuity head_flags
-            T                       (&/*input*/)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-            T                       (&/*preds*/)[ITEMS_PER_THREAD],         ///< [out] Calling thread's predecessor items
-            FlagOp                  /*flag_op*/)                            ///< [in] Binary boolean flag predicate
-        {}
-
-        // Tail flags
-        template <
-            int             ITEMS_PER_THREAD,
-            typename        FlagT,
-            typename        FlagOp>
-        static __device__ __forceinline__ void FlagTails(
-            int                     /*linear_tid*/,
-            FlagT                   (&/*flags*/)[ITEMS_PER_THREAD],         ///< [out] Calling thread's discontinuity head_flags
-            T                       (&/*input*/)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-            FlagOp                  /*flag_op*/)                            ///< [in] Binary boolean flag predicate
-        {}
-    };
-
-
-    /******************************************************************************
-     * Thread fields
-     ******************************************************************************/
-
-    /// Shared storage reference
-    _TempStorage &temp_storage;
-
-    /// Linear thread-id
-    unsigned int linear_tid;
-
-
-public:
-
-    /// \smemstorage{BlockDiscontinuity}
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    /******************************************************************//**
-     * \name Collective constructors
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Collective constructor using a private static allocation of shared memory as temporary storage.
-     */
-    __device__ __forceinline__ BlockAdjacentDifference()
-    :
-        temp_storage(PrivateStorage()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    /**
-     * \brief Collective constructor using the specified memory allocation as temporary storage.
-     */
-    __device__ __forceinline__ BlockAdjacentDifference(
-        TempStorage &temp_storage)  ///< [in] Reference to memory allocation having layout type TempStorage
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Head flag operations
-     *********************************************************************/
-    //@{
-
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagHeads(
-        FlagT           (&head_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity head_flags
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        T               (&preds)[ITEMS_PER_THREAD],         ///< [out] Calling thread's predecessor items
-        FlagOp          flag_op)                            ///< [in] Binary boolean flag predicate
-    {
-        // Share last item
-        temp_storage.last_items[linear_tid] = input[ITEMS_PER_THREAD - 1];
-
-        CTA_SYNC();
-
-        if (linear_tid == 0)
-        {
-            // Set flag for first thread-item (preds[0] is undefined)
-            head_flags[0] = 1;
-        }
-        else
-        {
-            preds[0] = temp_storage.last_items[linear_tid - 1];
-            head_flags[0] = ApplyOp<FlagOp>::FlagT(flag_op, preds[0], input[0], linear_tid * ITEMS_PER_THREAD);
-        }
-
-        // Set head_flags for remaining items
-        Iterate<1, ITEMS_PER_THREAD>::FlagHeads(linear_tid, head_flags, input, preds, flag_op);
-    }
-
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagHeads(
-        FlagT           (&head_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity head_flags
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        T               (&preds)[ITEMS_PER_THREAD],         ///< [out] Calling thread's predecessor items
-        FlagOp          flag_op,                            ///< [in] Binary boolean flag predicate
-        T               tile_predecessor_item)              ///< [in] <b>[<em>thread</em><sub>0</sub> only]</b> Item with which to compare the first tile item (<tt>input<sub>0</sub></tt> from <em>thread</em><sub>0</sub>).
-    {
-        // Share last item
-        temp_storage.last_items[linear_tid] = input[ITEMS_PER_THREAD - 1];
-
-        CTA_SYNC();
-
-        // Set flag for first thread-item
-        preds[0] = (linear_tid == 0) ?
-            tile_predecessor_item :              // First thread
-            temp_storage.last_items[linear_tid - 1];
-
-        head_flags[0] = ApplyOp<FlagOp>::FlagT(flag_op, preds[0], input[0], linear_tid * ITEMS_PER_THREAD);
-
-        // Set head_flags for remaining items
-        Iterate<1, ITEMS_PER_THREAD>::FlagHeads(linear_tid, head_flags, input, preds, flag_op);
-    }
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagHeads(
-        FlagT           (&head_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity head_flags
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        FlagOp          flag_op)                            ///< [in] Binary boolean flag predicate
-    {
-        T preds[ITEMS_PER_THREAD];
-        FlagHeads(head_flags, input, preds, flag_op);
-    }
-
-
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagHeads(
-        FlagT           (&head_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity head_flags
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        FlagOp          flag_op,                            ///< [in] Binary boolean flag predicate
-        T               tile_predecessor_item)              ///< [in] <b>[<em>thread</em><sub>0</sub> only]</b> Item with which to compare the first tile item (<tt>input<sub>0</sub></tt> from <em>thread</em><sub>0</sub>).
-    {
-        T preds[ITEMS_PER_THREAD];
-        FlagHeads(head_flags, input, preds, flag_op, tile_predecessor_item);
-    }
-
-
-
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagTails(
-        FlagT           (&tail_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity tail_flags
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        FlagOp          flag_op)                            ///< [in] Binary boolean flag predicate
-    {
-        // Share first item
-        temp_storage.first_items[linear_tid] = input[0];
-
-        CTA_SYNC();
-
-        // Set flag for last thread-item
-        tail_flags[ITEMS_PER_THREAD - 1] = (linear_tid == BLOCK_THREADS - 1) ?
-            1 :                             // Last thread
-            ApplyOp<FlagOp>::FlagT(
-                flag_op,
-                input[ITEMS_PER_THREAD - 1],
-                temp_storage.first_items[linear_tid + 1],
-                (linear_tid * ITEMS_PER_THREAD) + ITEMS_PER_THREAD);
-
-        // Set tail_flags for remaining items
-        Iterate<0, ITEMS_PER_THREAD - 1>::FlagTails(linear_tid, tail_flags, input, flag_op);
-    }
-
-
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagTails(
-        FlagT           (&tail_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity tail_flags
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        FlagOp          flag_op,                            ///< [in] Binary boolean flag predicate
-        T               tile_successor_item)                ///< [in] <b>[<em>thread</em><sub><tt>BLOCK_THREADS</tt>-1</sub> only]</b> Item with which to compare the last tile item (<tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> from <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>).
-    {
-        // Share first item
-        temp_storage.first_items[linear_tid] = input[0];
-
-        CTA_SYNC();
-
-        // Set flag for last thread-item
-        T successor_item = (linear_tid == BLOCK_THREADS - 1) ?
-            tile_successor_item :              // Last thread
-            temp_storage.first_items[linear_tid + 1];
-
-        tail_flags[ITEMS_PER_THREAD - 1] = ApplyOp<FlagOp>::FlagT(
-            flag_op,
-            input[ITEMS_PER_THREAD - 1],
-            successor_item,
-            (linear_tid * ITEMS_PER_THREAD) + ITEMS_PER_THREAD);
-
-        // Set tail_flags for remaining items
-        Iterate<0, ITEMS_PER_THREAD - 1>::FlagTails(linear_tid, tail_flags, input, flag_op);
-    }
-
-
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagHeadsAndTails(
-        FlagT           (&head_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity head_flags
-        FlagT           (&tail_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity tail_flags
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        FlagOp          flag_op)                            ///< [in] Binary boolean flag predicate
-    {
-        // Share first and last items
-        temp_storage.first_items[linear_tid] = input[0];
-        temp_storage.last_items[linear_tid] = input[ITEMS_PER_THREAD - 1];
-
-        CTA_SYNC();
-
-        T preds[ITEMS_PER_THREAD];
-
-        // Set flag for first thread-item
-        preds[0] = temp_storage.last_items[linear_tid - 1];
-        if (linear_tid == 0)
-        {
-            head_flags[0] = 1;
-        }
-        else
-        {
-            head_flags[0] = ApplyOp<FlagOp>::FlagT(
-                flag_op,
-                preds[0],
-                input[0],
-                linear_tid * ITEMS_PER_THREAD);
-        }
-
-
-        // Set flag for last thread-item
-        tail_flags[ITEMS_PER_THREAD - 1] = (linear_tid == BLOCK_THREADS - 1) ?
-            1 :                             // Last thread
-            ApplyOp<FlagOp>::FlagT(
-                flag_op,
-                input[ITEMS_PER_THREAD - 1],
-                temp_storage.first_items[linear_tid + 1],
-                (linear_tid * ITEMS_PER_THREAD) + ITEMS_PER_THREAD);
-
-        // Set head_flags for remaining items
-        Iterate<1, ITEMS_PER_THREAD>::FlagHeads(linear_tid, head_flags, input, preds, flag_op);
-
-        // Set tail_flags for remaining items
-        Iterate<0, ITEMS_PER_THREAD - 1>::FlagTails(linear_tid, tail_flags, input, flag_op);
-    }
-
-
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagHeadsAndTails(
-        FlagT           (&head_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity head_flags
-        FlagT           (&tail_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity tail_flags
-        T               tile_successor_item,                ///< [in] <b>[<em>thread</em><sub><tt>BLOCK_THREADS</tt>-1</sub> only]</b> Item with which to compare the last tile item (<tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> from <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>).
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        FlagOp          flag_op)                            ///< [in] Binary boolean flag predicate
-    {
-        // Share first and last items
-        temp_storage.first_items[linear_tid] = input[0];
-        temp_storage.last_items[linear_tid] = input[ITEMS_PER_THREAD - 1];
-
-        CTA_SYNC();
-
-        T preds[ITEMS_PER_THREAD];
-
-        // Set flag for first thread-item
-        if (linear_tid == 0)
-        {
-            head_flags[0] = 1;
-        }
-        else
-        {
-            preds[0] = temp_storage.last_items[linear_tid - 1];
-            head_flags[0] = ApplyOp<FlagOp>::FlagT(
-                flag_op,
-                preds[0],
-                input[0],
-                linear_tid * ITEMS_PER_THREAD);
-        }
-
-        // Set flag for last thread-item
-        T successor_item = (linear_tid == BLOCK_THREADS - 1) ?
-            tile_successor_item :              // Last thread
-            temp_storage.first_items[linear_tid + 1];
-
-        tail_flags[ITEMS_PER_THREAD - 1] = ApplyOp<FlagOp>::FlagT(
-            flag_op,
-            input[ITEMS_PER_THREAD - 1],
-            successor_item,
-            (linear_tid * ITEMS_PER_THREAD) + ITEMS_PER_THREAD);
-
-        // Set head_flags for remaining items
-        Iterate<1, ITEMS_PER_THREAD>::FlagHeads(linear_tid, head_flags, input, preds, flag_op);
-
-        // Set tail_flags for remaining items
-        Iterate<0, ITEMS_PER_THREAD - 1>::FlagTails(linear_tid, tail_flags, input, flag_op);
-    }
-
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagHeadsAndTails(
-        FlagT           (&head_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity head_flags
-        T               tile_predecessor_item,              ///< [in] <b>[<em>thread</em><sub>0</sub> only]</b> Item with which to compare the first tile item (<tt>input<sub>0</sub></tt> from <em>thread</em><sub>0</sub>).
-        FlagT           (&tail_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity tail_flags
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        FlagOp          flag_op)                            ///< [in] Binary boolean flag predicate
-    {
-        // Share first and last items
-        temp_storage.first_items[linear_tid] = input[0];
-        temp_storage.last_items[linear_tid] = input[ITEMS_PER_THREAD - 1];
-
-        CTA_SYNC();
-
-        T preds[ITEMS_PER_THREAD];
-
-        // Set flag for first thread-item
-        preds[0] = (linear_tid == 0) ?
-            tile_predecessor_item :              // First thread
-            temp_storage.last_items[linear_tid - 1];
-
-        head_flags[0] = ApplyOp<FlagOp>::FlagT(
-            flag_op,
-            preds[0],
-            input[0],
-            linear_tid * ITEMS_PER_THREAD);
-
-        // Set flag for last thread-item
-        tail_flags[ITEMS_PER_THREAD - 1] = (linear_tid == BLOCK_THREADS - 1) ?
-            1 :                             // Last thread
-            ApplyOp<FlagOp>::FlagT(
-                flag_op,
-                input[ITEMS_PER_THREAD - 1],
-                temp_storage.first_items[linear_tid + 1],
-                (linear_tid * ITEMS_PER_THREAD) + ITEMS_PER_THREAD);
-
-        // Set head_flags for remaining items
-        Iterate<1, ITEMS_PER_THREAD>::FlagHeads(linear_tid, head_flags, input, preds, flag_op);
-
-        // Set tail_flags for remaining items
-        Iterate<0, ITEMS_PER_THREAD - 1>::FlagTails(linear_tid, tail_flags, input, flag_op);
-    }
-
-
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagHeadsAndTails(
-        FlagT           (&head_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity head_flags
-        T               tile_predecessor_item,              ///< [in] <b>[<em>thread</em><sub>0</sub> only]</b> Item with which to compare the first tile item (<tt>input<sub>0</sub></tt> from <em>thread</em><sub>0</sub>).
-        FlagT           (&tail_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity tail_flags
-        T               tile_successor_item,                ///< [in] <b>[<em>thread</em><sub><tt>BLOCK_THREADS</tt>-1</sub> only]</b> Item with which to compare the last tile item (<tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> from <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>).
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        FlagOp          flag_op)                            ///< [in] Binary boolean flag predicate
-    {
-        // Share first and last items
-        temp_storage.first_items[linear_tid] = input[0];
-        temp_storage.last_items[linear_tid] = input[ITEMS_PER_THREAD - 1];
-
-        CTA_SYNC();
-
-        T preds[ITEMS_PER_THREAD];
-
-        // Set flag for first thread-item
-        preds[0] = (linear_tid == 0) ?
-            tile_predecessor_item :              // First thread
-            temp_storage.last_items[linear_tid - 1];
-
-        head_flags[0] = ApplyOp<FlagOp>::FlagT(
-            flag_op,
-            preds[0],
-            input[0],
-            linear_tid * ITEMS_PER_THREAD);
-
-        // Set flag for last thread-item
-        T successor_item = (linear_tid == BLOCK_THREADS - 1) ?
-            tile_successor_item :              // Last thread
-            temp_storage.first_items[linear_tid + 1];
-
-        tail_flags[ITEMS_PER_THREAD - 1] = ApplyOp<FlagOp>::FlagT(
-            flag_op,
-            input[ITEMS_PER_THREAD - 1],
-            successor_item,
-            (linear_tid * ITEMS_PER_THREAD) + ITEMS_PER_THREAD);
-
-        // Set head_flags for remaining items
-        Iterate<1, ITEMS_PER_THREAD>::FlagHeads(linear_tid, head_flags, input, preds, flag_op);
-
-        // Set tail_flags for remaining items
-        Iterate<0, ITEMS_PER_THREAD - 1>::FlagTails(linear_tid, tail_flags, input, flag_op);
-    }
-
-
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/block/block_discontinuity.cuh b/thirdparty/cub_semiring/block/block_discontinuity.cuh
deleted file mode 100644
index 428882f70ab..00000000000
--- a/thirdparty/cub_semiring/block/block_discontinuity.cuh
+++ /dev/null
@@ -1,1148 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * The cub::BlockDiscontinuity class provides [<em>collective</em>](index.html#sec0) methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block.
- */
-
-#pragma once
-
-#include "../util_type.cuh"
-#include "../util_ptx.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \brief The BlockDiscontinuity class provides [<em>collective</em>](index.html#sec0) methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block. ![](discont_logo.png)
- * \ingroup BlockModule
- *
- * \tparam T                The data type to be flagged.
- * \tparam BLOCK_DIM_X      The thread block length in threads along the X dimension
- * \tparam BLOCK_DIM_Y      <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
- * \tparam BLOCK_DIM_Z      <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
- * \tparam PTX_ARCH         <b>[optional]</b> \ptxversion
- *
- * \par Overview
- * - A set of "head flags" (or "tail flags") is often used to indicate corresponding items
- *   that differ from their predecessors (or successors).  For example, head flags are convenient
- *   for demarcating disjoint data segments as part of a segmented scan or reduction.
- * - \blocked
- *
- * \par Performance Considerations
- * - \granularity
- *
- * \par A Simple Example
- * \blockcollective{BlockDiscontinuity}
- * \par
- * The code snippet below illustrates the head flagging of 512 integer items that
- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
- * where each thread owns 4 consecutive items.
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
- *
- * __global__ void ExampleKernel(...)
- * {
- *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
- *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
- *
- *     // Allocate shared memory for BlockDiscontinuity
- *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
- *
- *     // Obtain a segment of consecutive items that are blocked across threads
- *     int thread_data[4];
- *     ...
- *
- *     // Collectively compute head flags for discontinuities in the segment
- *     int head_flags[4];
- *     BlockDiscontinuity(temp_storage).FlagHeads(head_flags, thread_data, cub::Inequality());
- *
- * \endcode
- * \par
- * Suppose the set of input \p thread_data across the block of threads is
- * <tt>{ [0,0,1,1], [1,1,1,1], [2,3,3,3], [3,4,4,4], ... }</tt>.
- * The corresponding output \p head_flags in those threads will be
- * <tt>{ [1,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }</tt>.
- *
- * \par Performance Considerations
- * - Incurs zero bank conflicts for most types
- *
- */
-template <
-    typename    T,
-    int         BLOCK_DIM_X,
-    int         BLOCK_DIM_Y     = 1,
-    int         BLOCK_DIM_Z     = 1,
-    int         PTX_ARCH        = CUB_PTX_ARCH>
-class BlockDiscontinuity
-{
-private:
-
-    /******************************************************************************
-     * Constants and type definitions
-     ******************************************************************************/
-
-    /// Constants
-    enum
-    {
-        /// The thread block size in threads
-        BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-    };
-
-
-    /// Shared memory storage layout type (last element from each thread's input)
-    struct _TempStorage
-    {
-        T first_items[BLOCK_THREADS];
-        T last_items[BLOCK_THREADS];
-    };
-
-
-    /******************************************************************************
-     * Utility methods
-     ******************************************************************************/
-
-    /// Internal storage allocator
-    __device__ __forceinline__ _TempStorage& PrivateStorage()
-    {
-        __shared__ _TempStorage private_storage;
-        return private_storage;
-    }
-
-
-    /// Specialization for when FlagOp has third index param
-    template <typename FlagOp, bool HAS_PARAM = BinaryOpHasIdxParam<T, FlagOp>::HAS_PARAM>
-    struct ApplyOp
-    {
-        // Apply flag operator
-        static __device__ __forceinline__ bool FlagT(FlagOp flag_op, const T &a, const T &b, int idx)
-        {
-            return flag_op(a, b, idx);
-        }
-    };
-
-    /// Specialization for when FlagOp does not have a third index param
-    template <typename FlagOp>
-    struct ApplyOp<FlagOp, false>
-    {
-        // Apply flag operator
-        static __device__ __forceinline__ bool FlagT(FlagOp flag_op, const T &a, const T &b, int /*idx*/)
-        {
-            return flag_op(a, b);
-        }
-    };
-
-    /// Templated unrolling of item comparison (inductive case)
-    template <int ITERATION, int MAX_ITERATIONS>
-    struct Iterate
-    {
-        // Head flags
-        template <
-            int             ITEMS_PER_THREAD,
-            typename        FlagT,
-            typename        FlagOp>
-        static __device__ __forceinline__ void FlagHeads(
-            int                     linear_tid,
-            FlagT                   (&flags)[ITEMS_PER_THREAD],         ///< [out] Calling thread's discontinuity head_flags
-            T                       (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-            T                       (&preds)[ITEMS_PER_THREAD],         ///< [out] Calling thread's predecessor items
-            FlagOp                  flag_op)                            ///< [in] Binary boolean flag predicate
-        {
-            preds[ITERATION] = input[ITERATION - 1];
-
-            flags[ITERATION] = ApplyOp<FlagOp>::FlagT(
-                flag_op,
-                preds[ITERATION],
-                input[ITERATION],
-                (linear_tid * ITEMS_PER_THREAD) + ITERATION);
-
-            Iterate<ITERATION + 1, MAX_ITERATIONS>::FlagHeads(linear_tid, flags, input, preds, flag_op);
-        }
-
-        // Tail flags
-        template <
-            int             ITEMS_PER_THREAD,
-            typename        FlagT,
-            typename        FlagOp>
-        static __device__ __forceinline__ void FlagTails(
-            int                     linear_tid,
-            FlagT                   (&flags)[ITEMS_PER_THREAD],         ///< [out] Calling thread's discontinuity head_flags
-            T                       (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-            FlagOp                  flag_op)                            ///< [in] Binary boolean flag predicate
-        {
-            flags[ITERATION] = ApplyOp<FlagOp>::FlagT(
-                flag_op,
-                input[ITERATION],
-                input[ITERATION + 1],
-                (linear_tid * ITEMS_PER_THREAD) + ITERATION + 1);
-
-            Iterate<ITERATION + 1, MAX_ITERATIONS>::FlagTails(linear_tid, flags, input, flag_op);
-        }
-
-    };
-
-    /// Templated unrolling of item comparison (termination case)
-    template <int MAX_ITERATIONS>
-    struct Iterate<MAX_ITERATIONS, MAX_ITERATIONS>
-    {
-        // Head flags
-        template <
-            int             ITEMS_PER_THREAD,
-            typename        FlagT,
-            typename        FlagOp>
-        static __device__ __forceinline__ void FlagHeads(
-            int                     /*linear_tid*/,
-            FlagT                   (&/*flags*/)[ITEMS_PER_THREAD],         ///< [out] Calling thread's discontinuity head_flags
-            T                       (&/*input*/)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-            T                       (&/*preds*/)[ITEMS_PER_THREAD],         ///< [out] Calling thread's predecessor items
-            FlagOp                  /*flag_op*/)                            ///< [in] Binary boolean flag predicate
-        {}
-
-        // Tail flags
-        template <
-            int             ITEMS_PER_THREAD,
-            typename        FlagT,
-            typename        FlagOp>
-        static __device__ __forceinline__ void FlagTails(
-            int                     /*linear_tid*/,
-            FlagT                   (&/*flags*/)[ITEMS_PER_THREAD],         ///< [out] Calling thread's discontinuity head_flags
-            T                       (&/*input*/)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-            FlagOp                  /*flag_op*/)                            ///< [in] Binary boolean flag predicate
-        {}
-    };
-
-
-    /******************************************************************************
-     * Thread fields
-     ******************************************************************************/
-
-    /// Shared storage reference
-    _TempStorage &temp_storage;
-
-    /// Linear thread-id
-    unsigned int linear_tid;
-
-
-public:
-
-    /// \smemstorage{BlockDiscontinuity}
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    /******************************************************************//**
-     * \name Collective constructors
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Collective constructor using a private static allocation of shared memory as temporary storage.
-     */
-    __device__ __forceinline__ BlockDiscontinuity()
-    :
-        temp_storage(PrivateStorage()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    /**
-     * \brief Collective constructor using the specified memory allocation as temporary storage.
-     */
-    __device__ __forceinline__ BlockDiscontinuity(
-        TempStorage &temp_storage)  ///< [in] Reference to memory allocation having layout type TempStorage
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Head flag operations
-     *********************************************************************/
-    //@{
-
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagHeads(
-        FlagT           (&head_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity head_flags
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        T               (&preds)[ITEMS_PER_THREAD],         ///< [out] Calling thread's predecessor items
-        FlagOp          flag_op)                            ///< [in] Binary boolean flag predicate
-    {
-        // Share last item
-        temp_storage.last_items[linear_tid] = input[ITEMS_PER_THREAD - 1];
-
-        CTA_SYNC();
-
-        if (linear_tid == 0)
-        {
-            // Set flag for first thread-item (preds[0] is undefined)
-            head_flags[0] = 1;
-        }
-        else
-        {
-            preds[0] = temp_storage.last_items[linear_tid - 1];
-            head_flags[0] = ApplyOp<FlagOp>::FlagT(flag_op, preds[0], input[0], linear_tid * ITEMS_PER_THREAD);
-        }
-
-        // Set head_flags for remaining items
-        Iterate<1, ITEMS_PER_THREAD>::FlagHeads(linear_tid, head_flags, input, preds, flag_op);
-    }
-
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagHeads(
-        FlagT           (&head_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity head_flags
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        T               (&preds)[ITEMS_PER_THREAD],         ///< [out] Calling thread's predecessor items
-        FlagOp          flag_op,                            ///< [in] Binary boolean flag predicate
-        T               tile_predecessor_item)              ///< [in] <b>[<em>thread</em><sub>0</sub> only]</b> Item with which to compare the first tile item (<tt>input<sub>0</sub></tt> from <em>thread</em><sub>0</sub>).
-    {
-        // Share last item
-        temp_storage.last_items[linear_tid] = input[ITEMS_PER_THREAD - 1];
-
-        CTA_SYNC();
-
-        // Set flag for first thread-item
-        preds[0] = (linear_tid == 0) ?
-            tile_predecessor_item :              // First thread
-            temp_storage.last_items[linear_tid - 1];
-
-        head_flags[0] = ApplyOp<FlagOp>::FlagT(flag_op, preds[0], input[0], linear_tid * ITEMS_PER_THREAD);
-
-        // Set head_flags for remaining items
-        Iterate<1, ITEMS_PER_THREAD>::FlagHeads(linear_tid, head_flags, input, preds, flag_op);
-    }
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-    /**
-     * \brief Sets head flags indicating discontinuities between items partitioned across the thread block, for which the first item has no reference and is always flagged.
-     *
-     * \par
-     * - The flag <tt>head_flags<sub><em>i</em></sub></tt> is set for item
-     *   <tt>input<sub><em>i</em></sub></tt> when
-     *   <tt>flag_op(</tt><em>previous-item</em><tt>, input<sub><em>i</em></sub>)</tt>
-     *   returns \p true (where <em>previous-item</em> is either the preceding item
-     *   in the same thread or the last item in the previous thread).
-     * - For <em>thread</em><sub>0</sub>, item <tt>input<sub>0</sub></tt> is always flagged.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the head-flagging of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
-     *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
-     *
-     *     // Allocate shared memory for BlockDiscontinuity
-     *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Collectively compute head flags for discontinuities in the segment
-     *     int head_flags[4];
-     *     BlockDiscontinuity(temp_storage).FlagHeads(head_flags, thread_data, cub::Inequality());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is
-     * <tt>{ [0,0,1,1], [1,1,1,1], [2,3,3,3], [3,4,4,4], ... }</tt>.
-     * The corresponding output \p head_flags in those threads will be
-     * <tt>{ [1,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }</tt>.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam FlagT                <b>[inferred]</b> The flag type (must be an integer type)
-     * \tparam FlagOp               <b>[inferred]</b> Binary predicate functor type having member <tt>T operator()(const T &a, const T &b)</tt> or member <tt>T operator()(const T &a, const T &b, unsigned int b_index)</tt>, and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false.  \p b_index is the rank of b in the aggregate tile of data.
-     */
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagHeads(
-        FlagT           (&head_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity head_flags
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        FlagOp          flag_op)                            ///< [in] Binary boolean flag predicate
-    {
-        T preds[ITEMS_PER_THREAD];
-        FlagHeads(head_flags, input, preds, flag_op);
-    }
-
-
-    /**
-     * \brief Sets head flags indicating discontinuities between items partitioned across the thread block.
-     *
-     * \par
-     * - The flag <tt>head_flags<sub><em>i</em></sub></tt> is set for item
-     *   <tt>input<sub><em>i</em></sub></tt> when
-     *   <tt>flag_op(</tt><em>previous-item</em><tt>, input<sub><em>i</em></sub>)</tt>
-     *   returns \p true (where <em>previous-item</em> is either the preceding item
-     *   in the same thread or the last item in the previous thread).
-     * - For <em>thread</em><sub>0</sub>, item <tt>input<sub>0</sub></tt> is compared
-     *   against \p tile_predecessor_item.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the head-flagging of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
-     *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
-     *
-     *     // Allocate shared memory for BlockDiscontinuity
-     *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Have thread0 obtain the predecessor item for the entire tile
-     *     int tile_predecessor_item;
-     *     if (threadIdx.x == 0) tile_predecessor_item == ...
-     *
-     *     // Collectively compute head flags for discontinuities in the segment
-     *     int head_flags[4];
-     *     BlockDiscontinuity(temp_storage).FlagHeads(
-     *         head_flags, thread_data, cub::Inequality(), tile_predecessor_item);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is
-     * <tt>{ [0,0,1,1], [1,1,1,1], [2,3,3,3], [3,4,4,4], ... }</tt>,
-     * and that \p tile_predecessor_item is \p 0.  The corresponding output \p head_flags in those threads will be
-     * <tt>{ [0,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }</tt>.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam FlagT                <b>[inferred]</b> The flag type (must be an integer type)
-     * \tparam FlagOp               <b>[inferred]</b> Binary predicate functor type having member <tt>T operator()(const T &a, const T &b)</tt> or member <tt>T operator()(const T &a, const T &b, unsigned int b_index)</tt>, and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false.  \p b_index is the rank of b in the aggregate tile of data.
-     */
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagHeads(
-        FlagT           (&head_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity head_flags
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        FlagOp          flag_op,                            ///< [in] Binary boolean flag predicate
-        T               tile_predecessor_item)              ///< [in] <b>[<em>thread</em><sub>0</sub> only]</b> Item with which to compare the first tile item (<tt>input<sub>0</sub></tt> from <em>thread</em><sub>0</sub>).
-    {
-        T preds[ITEMS_PER_THREAD];
-        FlagHeads(head_flags, input, preds, flag_op, tile_predecessor_item);
-    }
-
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Tail flag operations
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Sets tail flags indicating discontinuities between items partitioned across the thread block, for which the last item has no reference and is always flagged.
-     *
-     * \par
-     * - The flag <tt>tail_flags<sub><em>i</em></sub></tt> is set for item
-     *   <tt>input<sub><em>i</em></sub></tt> when
-     *   <tt>flag_op(input<sub><em>i</em></sub>, </tt><em>next-item</em><tt>)</tt>
-     *   returns \p true (where <em>next-item</em> is either the next item
-     *   in the same thread or the first item in the next thread).
-     * - For <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>, item
-     *   <tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> is always flagged.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the tail-flagging of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
-     *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
-     *
-     *     // Allocate shared memory for BlockDiscontinuity
-     *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Collectively compute tail flags for discontinuities in the segment
-     *     int tail_flags[4];
-     *     BlockDiscontinuity(temp_storage).FlagTails(tail_flags, thread_data, cub::Inequality());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is
-     * <tt>{ [0,0,1,1], [1,1,1,1], [2,3,3,3], ..., [124,125,125,125] }</tt>.
-     * The corresponding output \p tail_flags in those threads will be
-     * <tt>{ [0,1,0,0], [0,0,0,1], [1,0,0,...], ..., [1,0,0,1] }</tt>.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam FlagT                <b>[inferred]</b> The flag type (must be an integer type)
-     * \tparam FlagOp               <b>[inferred]</b> Binary predicate functor type having member <tt>T operator()(const T &a, const T &b)</tt> or member <tt>T operator()(const T &a, const T &b, unsigned int b_index)</tt>, and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false.  \p b_index is the rank of b in the aggregate tile of data.
-     */
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagTails(
-        FlagT           (&tail_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity tail_flags
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        FlagOp          flag_op)                            ///< [in] Binary boolean flag predicate
-    {
-        // Share first item
-        temp_storage.first_items[linear_tid] = input[0];
-
-        CTA_SYNC();
-
-        // Set flag for last thread-item
-        tail_flags[ITEMS_PER_THREAD - 1] = (linear_tid == BLOCK_THREADS - 1) ?
-            1 :                             // Last thread
-            ApplyOp<FlagOp>::FlagT(
-                flag_op,
-                input[ITEMS_PER_THREAD - 1],
-                temp_storage.first_items[linear_tid + 1],
-                (linear_tid * ITEMS_PER_THREAD) + ITEMS_PER_THREAD);
-
-        // Set tail_flags for remaining items
-        Iterate<0, ITEMS_PER_THREAD - 1>::FlagTails(linear_tid, tail_flags, input, flag_op);
-    }
-
-
-    /**
-     * \brief Sets tail flags indicating discontinuities between items partitioned across the thread block.
-     *
-     * \par
-     * - The flag <tt>tail_flags<sub><em>i</em></sub></tt> is set for item
-     *   <tt>input<sub><em>i</em></sub></tt> when
-     *   <tt>flag_op(input<sub><em>i</em></sub>, </tt><em>next-item</em><tt>)</tt>
-     *   returns \p true (where <em>next-item</em> is either the next item
-     *   in the same thread or the first item in the next thread).
-     * - For <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>, item
-     *   <tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> is compared
-     *   against \p tile_successor_item.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the tail-flagging of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
-     *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
-     *
-     *     // Allocate shared memory for BlockDiscontinuity
-     *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Have thread127 obtain the successor item for the entire tile
-     *     int tile_successor_item;
-     *     if (threadIdx.x == 127) tile_successor_item == ...
-     *
-     *     // Collectively compute tail flags for discontinuities in the segment
-     *     int tail_flags[4];
-     *     BlockDiscontinuity(temp_storage).FlagTails(
-     *         tail_flags, thread_data, cub::Inequality(), tile_successor_item);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is
-     * <tt>{ [0,0,1,1], [1,1,1,1], [2,3,3,3], ..., [124,125,125,125] }</tt>
-     * and that \p tile_successor_item is \p 125.  The corresponding output \p tail_flags in those threads will be
-     * <tt>{ [0,1,0,0], [0,0,0,1], [1,0,0,...], ..., [1,0,0,0] }</tt>.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam FlagT                <b>[inferred]</b> The flag type (must be an integer type)
-     * \tparam FlagOp               <b>[inferred]</b> Binary predicate functor type having member <tt>T operator()(const T &a, const T &b)</tt> or member <tt>T operator()(const T &a, const T &b, unsigned int b_index)</tt>, and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false.  \p b_index is the rank of b in the aggregate tile of data.
-     */
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagTails(
-        FlagT           (&tail_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity tail_flags
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        FlagOp          flag_op,                            ///< [in] Binary boolean flag predicate
-        T               tile_successor_item)                ///< [in] <b>[<em>thread</em><sub><tt>BLOCK_THREADS</tt>-1</sub> only]</b> Item with which to compare the last tile item (<tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> from <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>).
-    {
-        // Share first item
-        temp_storage.first_items[linear_tid] = input[0];
-
-        CTA_SYNC();
-
-        // Set flag for last thread-item
-        T successor_item = (linear_tid == BLOCK_THREADS - 1) ?
-            tile_successor_item :              // Last thread
-            temp_storage.first_items[linear_tid + 1];
-
-        tail_flags[ITEMS_PER_THREAD - 1] = ApplyOp<FlagOp>::FlagT(
-            flag_op,
-            input[ITEMS_PER_THREAD - 1],
-            successor_item,
-            (linear_tid * ITEMS_PER_THREAD) + ITEMS_PER_THREAD);
-
-        // Set tail_flags for remaining items
-        Iterate<0, ITEMS_PER_THREAD - 1>::FlagTails(linear_tid, tail_flags, input, flag_op);
-    }
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Head & tail flag operations
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Sets both head and tail flags indicating discontinuities between items partitioned across the thread block.
-     *
-     * \par
-     * - The flag <tt>head_flags<sub><em>i</em></sub></tt> is set for item
-     *   <tt>input<sub><em>i</em></sub></tt> when
-     *   <tt>flag_op(</tt><em>previous-item</em><tt>, input<sub><em>i</em></sub>)</tt>
-     *   returns \p true (where <em>previous-item</em> is either the preceding item
-     *   in the same thread or the last item in the previous thread).
-     * - For <em>thread</em><sub>0</sub>, item <tt>input<sub>0</sub></tt> is always flagged.
-     * - The flag <tt>tail_flags<sub><em>i</em></sub></tt> is set for item
-     *   <tt>input<sub><em>i</em></sub></tt> when
-     *   <tt>flag_op(input<sub><em>i</em></sub>, </tt><em>next-item</em><tt>)</tt>
-     *   returns \p true (where <em>next-item</em> is either the next item
-     *   in the same thread or the first item in the next thread).
-     * - For <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>, item
-     *   <tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> is always flagged.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the head- and tail-flagging of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
-     *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
-     *
-     *     // Allocate shared memory for BlockDiscontinuity
-     *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Collectively compute head and flags for discontinuities in the segment
-     *     int head_flags[4];
-     *     int tail_flags[4];
-     *     BlockDiscontinuity(temp_storage).FlagTails(
-     *         head_flags, tail_flags, thread_data, cub::Inequality());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is
-     * <tt>{ [0,0,1,1], [1,1,1,1], [2,3,3,3], ..., [124,125,125,125] }</tt>
-     * and that the tile_successor_item is \p 125.  The corresponding output \p head_flags
-     * in those threads will be <tt>{ [1,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }</tt>.
-     * and the corresponding output \p tail_flags in those threads will be
-     * <tt>{ [0,1,0,0], [0,0,0,1], [1,0,0,...], ..., [1,0,0,1] }</tt>.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam FlagT                <b>[inferred]</b> The flag type (must be an integer type)
-     * \tparam FlagOp               <b>[inferred]</b> Binary predicate functor type having member <tt>T operator()(const T &a, const T &b)</tt> or member <tt>T operator()(const T &a, const T &b, unsigned int b_index)</tt>, and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false.  \p b_index is the rank of b in the aggregate tile of data.
-     */
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagHeadsAndTails(
-        FlagT           (&head_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity head_flags
-        FlagT           (&tail_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity tail_flags
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        FlagOp          flag_op)                            ///< [in] Binary boolean flag predicate
-    {
-        // Share first and last items
-        temp_storage.first_items[linear_tid] = input[0];
-        temp_storage.last_items[linear_tid] = input[ITEMS_PER_THREAD - 1];
-
-        CTA_SYNC();
-
-        T preds[ITEMS_PER_THREAD];
-
-        // Set flag for first thread-item
-        preds[0] = temp_storage.last_items[linear_tid - 1];
-        if (linear_tid == 0)
-        {
-            head_flags[0] = 1;
-        }
-        else
-        {
-            head_flags[0] = ApplyOp<FlagOp>::FlagT(
-                flag_op,
-                preds[0],
-                input[0],
-                linear_tid * ITEMS_PER_THREAD);
-        }
-
-
-        // Set flag for last thread-item
-        tail_flags[ITEMS_PER_THREAD - 1] = (linear_tid == BLOCK_THREADS - 1) ?
-            1 :                             // Last thread
-            ApplyOp<FlagOp>::FlagT(
-                flag_op,
-                input[ITEMS_PER_THREAD - 1],
-                temp_storage.first_items[linear_tid + 1],
-                (linear_tid * ITEMS_PER_THREAD) + ITEMS_PER_THREAD);
-
-        // Set head_flags for remaining items
-        Iterate<1, ITEMS_PER_THREAD>::FlagHeads(linear_tid, head_flags, input, preds, flag_op);
-
-        // Set tail_flags for remaining items
-        Iterate<0, ITEMS_PER_THREAD - 1>::FlagTails(linear_tid, tail_flags, input, flag_op);
-    }
-
-
-    /**
-     * \brief Sets both head and tail flags indicating discontinuities between items partitioned across the thread block.
-     *
-     * \par
-     * - The flag <tt>head_flags<sub><em>i</em></sub></tt> is set for item
-     *   <tt>input<sub><em>i</em></sub></tt> when
-     *   <tt>flag_op(</tt><em>previous-item</em><tt>, input<sub><em>i</em></sub>)</tt>
-     *   returns \p true (where <em>previous-item</em> is either the preceding item
-     *   in the same thread or the last item in the previous thread).
-     * - For <em>thread</em><sub>0</sub>, item <tt>input<sub>0</sub></tt> is always flagged.
-     * - The flag <tt>tail_flags<sub><em>i</em></sub></tt> is set for item
-     *   <tt>input<sub><em>i</em></sub></tt> when
-     *   <tt>flag_op(input<sub><em>i</em></sub>, </tt><em>next-item</em><tt>)</tt>
-     *   returns \p true (where <em>next-item</em> is either the next item
-     *   in the same thread or the first item in the next thread).
-     * - For <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>, item
-     *   <tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> is compared
-     *   against \p tile_predecessor_item.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the head- and tail-flagging of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
-     *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
-     *
-     *     // Allocate shared memory for BlockDiscontinuity
-     *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Have thread127 obtain the successor item for the entire tile
-     *     int tile_successor_item;
-     *     if (threadIdx.x == 127) tile_successor_item == ...
-     *
-     *     // Collectively compute head and flags for discontinuities in the segment
-     *     int head_flags[4];
-     *     int tail_flags[4];
-     *     BlockDiscontinuity(temp_storage).FlagTails(
-     *         head_flags, tail_flags, tile_successor_item, thread_data, cub::Inequality());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is
-     * <tt>{ [0,0,1,1], [1,1,1,1], [2,3,3,3], ..., [124,125,125,125] }</tt>
-     * and that the tile_successor_item is \p 125.  The corresponding output \p head_flags
-     * in those threads will be <tt>{ [1,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }</tt>.
-     * and the corresponding output \p tail_flags in those threads will be
-     * <tt>{ [0,1,0,0], [0,0,0,1], [1,0,0,...], ..., [1,0,0,0] }</tt>.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam FlagT                <b>[inferred]</b> The flag type (must be an integer type)
-     * \tparam FlagOp               <b>[inferred]</b> Binary predicate functor type having member <tt>T operator()(const T &a, const T &b)</tt> or member <tt>T operator()(const T &a, const T &b, unsigned int b_index)</tt>, and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false.  \p b_index is the rank of b in the aggregate tile of data.
-     */
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagHeadsAndTails(
-        FlagT           (&head_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity head_flags
-        FlagT           (&tail_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity tail_flags
-        T               tile_successor_item,                ///< [in] <b>[<em>thread</em><sub><tt>BLOCK_THREADS</tt>-1</sub> only]</b> Item with which to compare the last tile item (<tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> from <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>).
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        FlagOp          flag_op)                            ///< [in] Binary boolean flag predicate
-    {
-        // Share first and last items
-        temp_storage.first_items[linear_tid] = input[0];
-        temp_storage.last_items[linear_tid] = input[ITEMS_PER_THREAD - 1];
-
-        CTA_SYNC();
-
-        T preds[ITEMS_PER_THREAD];
-
-        // Set flag for first thread-item
-        if (linear_tid == 0)
-        {
-            head_flags[0] = 1;
-        }
-        else
-        {
-            preds[0] = temp_storage.last_items[linear_tid - 1];
-            head_flags[0] = ApplyOp<FlagOp>::FlagT(
-                flag_op,
-                preds[0],
-                input[0],
-                linear_tid * ITEMS_PER_THREAD);
-        }
-
-        // Set flag for last thread-item
-        T successor_item = (linear_tid == BLOCK_THREADS - 1) ?
-            tile_successor_item :              // Last thread
-            temp_storage.first_items[linear_tid + 1];
-
-        tail_flags[ITEMS_PER_THREAD - 1] = ApplyOp<FlagOp>::FlagT(
-            flag_op,
-            input[ITEMS_PER_THREAD - 1],
-            successor_item,
-            (linear_tid * ITEMS_PER_THREAD) + ITEMS_PER_THREAD);
-
-        // Set head_flags for remaining items
-        Iterate<1, ITEMS_PER_THREAD>::FlagHeads(linear_tid, head_flags, input, preds, flag_op);
-
-        // Set tail_flags for remaining items
-        Iterate<0, ITEMS_PER_THREAD - 1>::FlagTails(linear_tid, tail_flags, input, flag_op);
-    }
-
-
-    /**
-     * \brief Sets both head and tail flags indicating discontinuities between items partitioned across the thread block.
-     *
-     * \par
-     * - The flag <tt>head_flags<sub><em>i</em></sub></tt> is set for item
-     *   <tt>input<sub><em>i</em></sub></tt> when
-     *   <tt>flag_op(</tt><em>previous-item</em><tt>, input<sub><em>i</em></sub>)</tt>
-     *   returns \p true (where <em>previous-item</em> is either the preceding item
-     *   in the same thread or the last item in the previous thread).
-     * - For <em>thread</em><sub>0</sub>, item <tt>input<sub>0</sub></tt> is compared
-     *   against \p tile_predecessor_item.
-     * - The flag <tt>tail_flags<sub><em>i</em></sub></tt> is set for item
-     *   <tt>input<sub><em>i</em></sub></tt> when
-     *   <tt>flag_op(input<sub><em>i</em></sub>, </tt><em>next-item</em><tt>)</tt>
-     *   returns \p true (where <em>next-item</em> is either the next item
-     *   in the same thread or the first item in the next thread).
-     * - For <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>, item
-     *   <tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> is always flagged.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the head- and tail-flagging of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
-     *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
-     *
-     *     // Allocate shared memory for BlockDiscontinuity
-     *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Have thread0 obtain the predecessor item for the entire tile
-     *     int tile_predecessor_item;
-     *     if (threadIdx.x == 0) tile_predecessor_item == ...
-     *
-     *     // Have thread127 obtain the successor item for the entire tile
-     *     int tile_successor_item;
-     *     if (threadIdx.x == 127) tile_successor_item == ...
-     *
-     *     // Collectively compute head and flags for discontinuities in the segment
-     *     int head_flags[4];
-     *     int tail_flags[4];
-     *     BlockDiscontinuity(temp_storage).FlagTails(
-     *         head_flags, tile_predecessor_item, tail_flags, tile_successor_item,
-     *         thread_data, cub::Inequality());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is
-     * <tt>{ [0,0,1,1], [1,1,1,1], [2,3,3,3], ..., [124,125,125,125] }</tt>,
-     * that the \p tile_predecessor_item is \p 0, and that the
-     * \p tile_successor_item is \p 125.  The corresponding output \p head_flags
-     * in those threads will be <tt>{ [0,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }</tt>.
-     * and the corresponding output \p tail_flags in those threads will be
-     * <tt>{ [0,1,0,0], [0,0,0,1], [1,0,0,...], ..., [1,0,0,1] }</tt>.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam FlagT                <b>[inferred]</b> The flag type (must be an integer type)
-     * \tparam FlagOp               <b>[inferred]</b> Binary predicate functor type having member <tt>T operator()(const T &a, const T &b)</tt> or member <tt>T operator()(const T &a, const T &b, unsigned int b_index)</tt>, and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false.  \p b_index is the rank of b in the aggregate tile of data.
-     */
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagHeadsAndTails(
-        FlagT           (&head_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity head_flags
-        T               tile_predecessor_item,              ///< [in] <b>[<em>thread</em><sub>0</sub> only]</b> Item with which to compare the first tile item (<tt>input<sub>0</sub></tt> from <em>thread</em><sub>0</sub>).
-        FlagT           (&tail_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity tail_flags
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        FlagOp          flag_op)                            ///< [in] Binary boolean flag predicate
-    {
-        // Share first and last items
-        temp_storage.first_items[linear_tid] = input[0];
-        temp_storage.last_items[linear_tid] = input[ITEMS_PER_THREAD - 1];
-
-        CTA_SYNC();
-
-        T preds[ITEMS_PER_THREAD];
-
-        // Set flag for first thread-item
-        preds[0] = (linear_tid == 0) ?
-            tile_predecessor_item :              // First thread
-            temp_storage.last_items[linear_tid - 1];
-
-        head_flags[0] = ApplyOp<FlagOp>::FlagT(
-            flag_op,
-            preds[0],
-            input[0],
-            linear_tid * ITEMS_PER_THREAD);
-
-        // Set flag for last thread-item
-        tail_flags[ITEMS_PER_THREAD - 1] = (linear_tid == BLOCK_THREADS - 1) ?
-            1 :                             // Last thread
-            ApplyOp<FlagOp>::FlagT(
-                flag_op,
-                input[ITEMS_PER_THREAD - 1],
-                temp_storage.first_items[linear_tid + 1],
-                (linear_tid * ITEMS_PER_THREAD) + ITEMS_PER_THREAD);
-
-        // Set head_flags for remaining items
-        Iterate<1, ITEMS_PER_THREAD>::FlagHeads(linear_tid, head_flags, input, preds, flag_op);
-
-        // Set tail_flags for remaining items
-        Iterate<0, ITEMS_PER_THREAD - 1>::FlagTails(linear_tid, tail_flags, input, flag_op);
-    }
-
-
-    /**
-     * \brief Sets both head and tail flags indicating discontinuities between items partitioned across the thread block.
-     *
-     * \par
-     * - The flag <tt>head_flags<sub><em>i</em></sub></tt> is set for item
-     *   <tt>input<sub><em>i</em></sub></tt> when
-     *   <tt>flag_op(</tt><em>previous-item</em><tt>, input<sub><em>i</em></sub>)</tt>
-     *   returns \p true (where <em>previous-item</em> is either the preceding item
-     *   in the same thread or the last item in the previous thread).
-     * - For <em>thread</em><sub>0</sub>, item <tt>input<sub>0</sub></tt> is compared
-     *   against \p tile_predecessor_item.
-     * - The flag <tt>tail_flags<sub><em>i</em></sub></tt> is set for item
-     *   <tt>input<sub><em>i</em></sub></tt> when
-     *   <tt>flag_op(input<sub><em>i</em></sub>, </tt><em>next-item</em><tt>)</tt>
-     *   returns \p true (where <em>next-item</em> is either the next item
-     *   in the same thread or the first item in the next thread).
-     * - For <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>, item
-     *   <tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> is compared
-     *   against \p tile_successor_item.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the head- and tail-flagging of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
-     *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
-     *
-     *     // Allocate shared memory for BlockDiscontinuity
-     *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Have thread0 obtain the predecessor item for the entire tile
-     *     int tile_predecessor_item;
-     *     if (threadIdx.x == 0) tile_predecessor_item == ...
-     *
-     *     // Have thread127 obtain the successor item for the entire tile
-     *     int tile_successor_item;
-     *     if (threadIdx.x == 127) tile_successor_item == ...
-     *
-     *     // Collectively compute head and flags for discontinuities in the segment
-     *     int head_flags[4];
-     *     int tail_flags[4];
-     *     BlockDiscontinuity(temp_storage).FlagTails(
-     *         head_flags, tile_predecessor_item, tail_flags, tile_successor_item,
-     *         thread_data, cub::Inequality());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is
-     * <tt>{ [0,0,1,1], [1,1,1,1], [2,3,3,3], ..., [124,125,125,125] }</tt>,
-     * that the \p tile_predecessor_item is \p 0, and that the
-     * \p tile_successor_item is \p 125.  The corresponding output \p head_flags
-     * in those threads will be <tt>{ [0,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }</tt>.
-     * and the corresponding output \p tail_flags in those threads will be
-     * <tt>{ [0,1,0,0], [0,0,0,1], [1,0,0,...], ..., [1,0,0,0] }</tt>.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam FlagT                <b>[inferred]</b> The flag type (must be an integer type)
-     * \tparam FlagOp               <b>[inferred]</b> Binary predicate functor type having member <tt>T operator()(const T &a, const T &b)</tt> or member <tt>T operator()(const T &a, const T &b, unsigned int b_index)</tt>, and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false.  \p b_index is the rank of b in the aggregate tile of data.
-     */
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        FlagT,
-        typename        FlagOp>
-    __device__ __forceinline__ void FlagHeadsAndTails(
-        FlagT           (&head_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity head_flags
-        T               tile_predecessor_item,              ///< [in] <b>[<em>thread</em><sub>0</sub> only]</b> Item with which to compare the first tile item (<tt>input<sub>0</sub></tt> from <em>thread</em><sub>0</sub>).
-        FlagT           (&tail_flags)[ITEMS_PER_THREAD],    ///< [out] Calling thread's discontinuity tail_flags
-        T               tile_successor_item,                ///< [in] <b>[<em>thread</em><sub><tt>BLOCK_THREADS</tt>-1</sub> only]</b> Item with which to compare the last tile item (<tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> from <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>).
-        T               (&input)[ITEMS_PER_THREAD],         ///< [in] Calling thread's input items
-        FlagOp          flag_op)                            ///< [in] Binary boolean flag predicate
-    {
-        // Share first and last items
-        temp_storage.first_items[linear_tid] = input[0];
-        temp_storage.last_items[linear_tid] = input[ITEMS_PER_THREAD - 1];
-
-        CTA_SYNC();
-
-        T preds[ITEMS_PER_THREAD];
-
-        // Set flag for first thread-item
-        preds[0] = (linear_tid == 0) ?
-            tile_predecessor_item :              // First thread
-            temp_storage.last_items[linear_tid - 1];
-
-        head_flags[0] = ApplyOp<FlagOp>::FlagT(
-            flag_op,
-            preds[0],
-            input[0],
-            linear_tid * ITEMS_PER_THREAD);
-
-        // Set flag for last thread-item
-        T successor_item = (linear_tid == BLOCK_THREADS - 1) ?
-            tile_successor_item :              // Last thread
-            temp_storage.first_items[linear_tid + 1];
-
-        tail_flags[ITEMS_PER_THREAD - 1] = ApplyOp<FlagOp>::FlagT(
-            flag_op,
-            input[ITEMS_PER_THREAD - 1],
-            successor_item,
-            (linear_tid * ITEMS_PER_THREAD) + ITEMS_PER_THREAD);
-
-        // Set head_flags for remaining items
-        Iterate<1, ITEMS_PER_THREAD>::FlagHeads(linear_tid, head_flags, input, preds, flag_op);
-
-        // Set tail_flags for remaining items
-        Iterate<0, ITEMS_PER_THREAD - 1>::FlagTails(linear_tid, tail_flags, input, flag_op);
-    }
-
-
-
-
-    //@}  end member group
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/block/block_exchange.cuh b/thirdparty/cub_semiring/block/block_exchange.cuh
deleted file mode 100644
index c0e32fda555..00000000000
--- a/thirdparty/cub_semiring/block/block_exchange.cuh
+++ /dev/null
@@ -1,1248 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * The cub::BlockExchange class provides [<em>collective</em>](index.html#sec0) methods for rearranging data partitioned across a CUDA thread block.
- */
-
-#pragma once
-
-#include "../util_ptx.cuh"
-#include "../util_arch.cuh"
-#include "../util_macro.cuh"
-#include "../util_type.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \brief The BlockExchange class provides [<em>collective</em>](index.html#sec0) methods for rearranging data partitioned across a CUDA thread block. ![](transpose_logo.png)
- * \ingroup BlockModule
- *
- * \tparam T                    The data type to be exchanged.
- * \tparam BLOCK_DIM_X          The thread block length in threads along the X dimension
- * \tparam ITEMS_PER_THREAD     The number of items partitioned onto each thread.
- * \tparam WARP_TIME_SLICING    <b>[optional]</b> When \p true, only use enough shared memory for a single warp's worth of tile data, time-slicing the block-wide exchange over multiple synchronized rounds.  Yields a smaller memory footprint at the expense of decreased parallelism.  (Default: false)
- * \tparam BLOCK_DIM_Y          <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
- * \tparam BLOCK_DIM_Z          <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
- * \tparam PTX_ARCH             <b>[optional]</b> \ptxversion
- *
- * \par Overview
- * - It is commonplace for blocks of threads to rearrange data items between
- *   threads.  For example, the device-accessible memory subsystem prefers access patterns
- *   where data items are "striped" across threads (where consecutive threads access consecutive items),
- *   yet most block-wide operations prefer a "blocked" partitioning of items across threads
- *   (where consecutive items belong to a single thread).
- * - BlockExchange supports the following types of data exchanges:
- *   - Transposing between [<em>blocked</em>](index.html#sec5sec3) and [<em>striped</em>](index.html#sec5sec3) arrangements
- *   - Transposing between [<em>blocked</em>](index.html#sec5sec3) and [<em>warp-striped</em>](index.html#sec5sec3) arrangements
- *   - Scattering ranked items to a [<em>blocked arrangement</em>](index.html#sec5sec3)
- *   - Scattering ranked items to a [<em>striped arrangement</em>](index.html#sec5sec3)
- * - \rowmajor
- *
- * \par A Simple Example
- * \blockcollective{BlockExchange}
- * \par
- * The code snippet below illustrates the conversion from a "blocked" to a "striped" arrangement
- * of 512 integer items partitioned across 128 threads where each thread owns 4 items.
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/block/block_exchange.cuh>
- *
- * __global__ void ExampleKernel(int *d_data, ...)
- * {
- *     // Specialize BlockExchange for a 1D block of 128 threads owning 4 integer items each
- *     typedef cub::BlockExchange<int, 128, 4> BlockExchange;
- *
- *     // Allocate shared memory for BlockExchange
- *     __shared__ typename BlockExchange::TempStorage temp_storage;
- *
- *     // Load a tile of data striped across threads
- *     int thread_data[4];
- *     cub::LoadDirectStriped<128>(threadIdx.x, d_data, thread_data);
- *
- *     // Collectively exchange data into a blocked arrangement across threads
- *     BlockExchange(temp_storage).StripedToBlocked(thread_data);
- *
- * \endcode
- * \par
- * Suppose the set of striped input \p thread_data across the block of threads is
- * <tt>{ [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] }</tt>.
- * The corresponding output \p thread_data in those threads will be
- * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
- *
- * \par Performance Considerations
- * - Proper device-specific padding ensures zero bank conflicts for most types.
- *
- */
-template <
-    typename    InputT,
-    int         BLOCK_DIM_X,
-    int         ITEMS_PER_THREAD,
-    bool        WARP_TIME_SLICING   = false,
-    int         BLOCK_DIM_Y         = 1,
-    int         BLOCK_DIM_Z         = 1,
-    int         PTX_ARCH            = CUB_PTX_ARCH>
-class BlockExchange
-{
-private:
-
-    /******************************************************************************
-     * Constants
-     ******************************************************************************/
-
-    /// Constants
-    enum
-    {
-        /// The thread block size in threads
-        BLOCK_THREADS               = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-
-        LOG_WARP_THREADS            = CUB_LOG_WARP_THREADS(PTX_ARCH),
-        WARP_THREADS                = 1 << LOG_WARP_THREADS,
-        WARPS                       = (BLOCK_THREADS + WARP_THREADS - 1) / WARP_THREADS,
-
-        LOG_SMEM_BANKS              = CUB_LOG_SMEM_BANKS(PTX_ARCH),
-        SMEM_BANKS                  = 1 << LOG_SMEM_BANKS,
-
-        TILE_ITEMS                  = BLOCK_THREADS * ITEMS_PER_THREAD,
-
-        TIME_SLICES                 = (WARP_TIME_SLICING) ? WARPS : 1,
-
-        TIME_SLICED_THREADS         = (WARP_TIME_SLICING) ? CUB_MIN(BLOCK_THREADS, WARP_THREADS) : BLOCK_THREADS,
-        TIME_SLICED_ITEMS           = TIME_SLICED_THREADS * ITEMS_PER_THREAD,
-
-        WARP_TIME_SLICED_THREADS    = CUB_MIN(BLOCK_THREADS, WARP_THREADS),
-        WARP_TIME_SLICED_ITEMS      = WARP_TIME_SLICED_THREADS * ITEMS_PER_THREAD,
-
-        // Insert padding to avoid bank conflicts during raking when items per thread is a power of two and > 4 (otherwise we can typically use 128b loads)
-        INSERT_PADDING              = (ITEMS_PER_THREAD > 4) && (PowerOfTwo<ITEMS_PER_THREAD>::VALUE),
-        PADDING_ITEMS               = (INSERT_PADDING) ? (TIME_SLICED_ITEMS >> LOG_SMEM_BANKS) : 0,
-    };
-
-    /******************************************************************************
-     * Type definitions
-     ******************************************************************************/
-
-    /// Shared memory storage layout type
-    struct __align__(16) _TempStorage
-    {
-        InputT buff[TIME_SLICED_ITEMS + PADDING_ITEMS];
-    };
-
-public:
-
-    /// \smemstorage{BlockExchange}
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-private:
-
-
-    /******************************************************************************
-     * Thread fields
-     ******************************************************************************/
-
-    /// Shared storage reference
-    _TempStorage &temp_storage;
-
-    /// Linear thread-id
-    unsigned int linear_tid;
-    unsigned int lane_id;
-    unsigned int warp_id;
-    unsigned int warp_offset;
-
-
-    /******************************************************************************
-     * Utility methods
-     ******************************************************************************/
-
-    /// Internal storage allocator
-    __device__ __forceinline__ _TempStorage& PrivateStorage()
-    {
-        __shared__ _TempStorage private_storage;
-        return private_storage;
-    }
-
-
-    /**
-     * Transposes data items from <em>blocked</em> arrangement to <em>striped</em> arrangement.  Specialized for no timeslicing.
-     */
-    template <typename OutputT>
-    __device__ __forceinline__ void BlockedToStriped(
-        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        Int2Type<false> /*time_slicing*/)
-    {
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            int item_offset = (linear_tid * ITEMS_PER_THREAD) + ITEM;
-            if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-            temp_storage.buff[item_offset] = input_items[ITEM];
-        }
-
-        CTA_SYNC();
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            int item_offset = int(ITEM * BLOCK_THREADS) + linear_tid;
-            if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-            output_items[ITEM] = temp_storage.buff[item_offset];
-        }
-    }
-
-
-    /**
-     * Transposes data items from <em>blocked</em> arrangement to <em>striped</em> arrangement.  Specialized for warp-timeslicing.
-     */
-    template <typename OutputT>
-    __device__ __forceinline__ void BlockedToStriped(
-        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        Int2Type<true>  /*time_slicing*/)
-    {
-        InputT temp_items[ITEMS_PER_THREAD];
-
-        #pragma unroll
-        for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++)
-        {
-            const int SLICE_OFFSET  = SLICE * TIME_SLICED_ITEMS;
-            const int SLICE_OOB     = SLICE_OFFSET + TIME_SLICED_ITEMS;
-
-            CTA_SYNC();
-
-            if (warp_id == SLICE)
-            {
-                #pragma unroll
-                for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-                {
-                    int item_offset = (lane_id * ITEMS_PER_THREAD) + ITEM;
-                    if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-                    temp_storage.buff[item_offset] = input_items[ITEM];
-                }
-            }
-
-            CTA_SYNC();
-
-            #pragma unroll
-            for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-            {
-                // Read a strip of items
-                const int STRIP_OFFSET  = ITEM * BLOCK_THREADS;
-                const int STRIP_OOB     = STRIP_OFFSET + BLOCK_THREADS;
-
-                if ((SLICE_OFFSET < STRIP_OOB) && (SLICE_OOB > STRIP_OFFSET))
-                {
-                    int item_offset = STRIP_OFFSET + linear_tid - SLICE_OFFSET;
-                    if ((item_offset >= 0) && (item_offset < TIME_SLICED_ITEMS))
-                    {
-                        if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-                        temp_items[ITEM] = temp_storage.buff[item_offset];
-                    }
-                }
-            }
-        }
-
-        // Copy
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            output_items[ITEM] = temp_items[ITEM];
-        }
-    }
-
-
-    /**
-     * Transposes data items from <em>blocked</em> arrangement to <em>warp-striped</em> arrangement. Specialized for no timeslicing
-     */
-    template <typename OutputT>
-    __device__ __forceinline__ void BlockedToWarpStriped(
-        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        Int2Type<false> /*time_slicing*/)
-    {
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            int item_offset = warp_offset + ITEM + (lane_id * ITEMS_PER_THREAD);
-            if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-            temp_storage.buff[item_offset] = input_items[ITEM];
-        }
-
-        WARP_SYNC(0xffffffff);
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            int item_offset = warp_offset + (ITEM * WARP_TIME_SLICED_THREADS) + lane_id;
-            if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-            output_items[ITEM] = temp_storage.buff[item_offset];
-        }
-    }
-
-    /**
-     * Transposes data items from <em>blocked</em> arrangement to <em>warp-striped</em> arrangement. Specialized for warp-timeslicing
-     */
-    template <typename OutputT>
-    __device__ __forceinline__ void BlockedToWarpStriped(
-        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        Int2Type<true>  /*time_slicing*/)
-    {
-        if (warp_id == 0)
-        {
-            #pragma unroll
-            for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-            {
-                int item_offset = ITEM + (lane_id * ITEMS_PER_THREAD);
-                if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-                temp_storage.buff[item_offset] = input_items[ITEM];
-            }
-
-            WARP_SYNC(0xffffffff);
-
-            #pragma unroll
-            for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-            {
-                int item_offset = (ITEM * WARP_TIME_SLICED_THREADS) + lane_id;
-                if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-                output_items[ITEM] = temp_storage.buff[item_offset];
-            }
-        }
-
-        #pragma unroll
-        for (unsigned int SLICE = 1; SLICE < TIME_SLICES; ++SLICE)
-        {
-            CTA_SYNC();
-
-            if (warp_id == SLICE)
-            {
-                #pragma unroll
-                for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-                {
-                    int item_offset = ITEM + (lane_id * ITEMS_PER_THREAD);
-                    if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-                    temp_storage.buff[item_offset] = input_items[ITEM];
-                }
-
-                WARP_SYNC(0xffffffff);
-
-                #pragma unroll
-                for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-                {
-                    int item_offset = (ITEM * WARP_TIME_SLICED_THREADS) + lane_id;
-                    if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-                    output_items[ITEM] = temp_storage.buff[item_offset];
-                }
-            }
-        }
-    }
-
-
-    /**
-     * Transposes data items from <em>striped</em> arrangement to <em>blocked</em> arrangement.  Specialized for no timeslicing.
-     */
-    template <typename OutputT>
-    __device__ __forceinline__ void StripedToBlocked(
-        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        Int2Type<false> /*time_slicing*/)
-    {
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            int item_offset = int(ITEM * BLOCK_THREADS) + linear_tid;
-            if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-            temp_storage.buff[item_offset] = input_items[ITEM];
-        }
-
-        CTA_SYNC();
-
-        // No timeslicing
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            int item_offset = (linear_tid * ITEMS_PER_THREAD) + ITEM;
-            if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-            output_items[ITEM] = temp_storage.buff[item_offset];
-        }
-    }
-
-
-    /**
-     * Transposes data items from <em>striped</em> arrangement to <em>blocked</em> arrangement.  Specialized for warp-timeslicing.
-     */
-    template <typename OutputT>
-    __device__ __forceinline__ void StripedToBlocked(
-        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        Int2Type<true>  /*time_slicing*/)
-    {
-        // Warp time-slicing
-        InputT temp_items[ITEMS_PER_THREAD];
-
-        #pragma unroll
-        for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++)
-        {
-            const int SLICE_OFFSET  = SLICE * TIME_SLICED_ITEMS;
-            const int SLICE_OOB     = SLICE_OFFSET + TIME_SLICED_ITEMS;
-
-            CTA_SYNC();
-
-            #pragma unroll
-            for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-            {
-                // Write a strip of items
-                const int STRIP_OFFSET  = ITEM * BLOCK_THREADS;
-                const int STRIP_OOB     = STRIP_OFFSET + BLOCK_THREADS;
-
-                if ((SLICE_OFFSET < STRIP_OOB) && (SLICE_OOB > STRIP_OFFSET))
-                {
-                    int item_offset = STRIP_OFFSET + linear_tid - SLICE_OFFSET;
-                    if ((item_offset >= 0) && (item_offset < TIME_SLICED_ITEMS))
-                    {
-                        if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-                        temp_storage.buff[item_offset] = input_items[ITEM];
-                    }
-                }
-            }
-
-            CTA_SYNC();
-
-            if (warp_id == SLICE)
-            {
-                #pragma unroll
-                for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-                {
-                    int item_offset = (lane_id * ITEMS_PER_THREAD) + ITEM;
-                    if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-                    temp_items[ITEM] = temp_storage.buff[item_offset];
-                }
-            }
-        }
-
-        // Copy
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            output_items[ITEM] = temp_items[ITEM];
-        }
-    }
-
-
-    /**
-     * Transposes data items from <em>warp-striped</em> arrangement to <em>blocked</em> arrangement.  Specialized for no timeslicing
-     */
-    template <typename OutputT>
-    __device__ __forceinline__ void WarpStripedToBlocked(
-        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        Int2Type<false> /*time_slicing*/)
-    {
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            int item_offset = warp_offset + (ITEM * WARP_TIME_SLICED_THREADS) + lane_id;
-            if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-            temp_storage.buff[item_offset] = input_items[ITEM];
-        }
-
-        WARP_SYNC(0xffffffff);
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            int item_offset = warp_offset + ITEM + (lane_id * ITEMS_PER_THREAD);
-            if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-            output_items[ITEM] = temp_storage.buff[item_offset];
-        }
-    }
-
-
-    /**
-     * Transposes data items from <em>warp-striped</em> arrangement to <em>blocked</em> arrangement.  Specialized for warp-timeslicing
-     */
-    template <typename OutputT>
-    __device__ __forceinline__ void WarpStripedToBlocked(
-        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        Int2Type<true>  /*time_slicing*/)
-    {
-        #pragma unroll
-        for (unsigned int SLICE = 0; SLICE < TIME_SLICES; ++SLICE)
-        {
-            CTA_SYNC();
-
-            if (warp_id == SLICE)
-            {
-                #pragma unroll
-                for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-                {
-                    int item_offset = (ITEM * WARP_TIME_SLICED_THREADS) + lane_id;
-                    if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-                    temp_storage.buff[item_offset] = input_items[ITEM];
-                }
-
-                WARP_SYNC(0xffffffff);
-
-                #pragma unroll
-                for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-                {
-                    int item_offset = ITEM + (lane_id * ITEMS_PER_THREAD);
-                    if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-                    output_items[ITEM] = temp_storage.buff[item_offset];
-                }
-            }
-        }
-    }
-
-
-    /**
-     * Exchanges data items annotated by rank into <em>blocked</em> arrangement.  Specialized for no timeslicing.
-     */
-    template <typename OutputT, typename OffsetT>
-    __device__ __forceinline__ void ScatterToBlocked(
-        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        OffsetT         ranks[ITEMS_PER_THREAD],    ///< [in] Corresponding scatter ranks
-        Int2Type<false> /*time_slicing*/)
-    {
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            int item_offset = ranks[ITEM];
-            if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
-            temp_storage.buff[item_offset] = input_items[ITEM];
-        }
-
-        CTA_SYNC();
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            int item_offset = (linear_tid * ITEMS_PER_THREAD) + ITEM;
-            if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
-            output_items[ITEM] = temp_storage.buff[item_offset];
-        }
-    }
-
-    /**
-     * Exchanges data items annotated by rank into <em>blocked</em> arrangement.  Specialized for warp-timeslicing.
-     */
-    template <typename OutputT, typename OffsetT>
-    __device__ __forceinline__ void ScatterToBlocked(
-        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        OffsetT         ranks[ITEMS_PER_THREAD],    ///< [in] Corresponding scatter ranks
-        Int2Type<true>  /*time_slicing*/)
-    {
-        InputT temp_items[ITEMS_PER_THREAD];
-
-        #pragma unroll
-        for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++)
-        {
-            CTA_SYNC();
-
-            const int SLICE_OFFSET = TIME_SLICED_ITEMS * SLICE;
-
-            #pragma unroll
-            for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-            {
-                int item_offset = ranks[ITEM] - SLICE_OFFSET;
-                if ((item_offset >= 0) && (item_offset < WARP_TIME_SLICED_ITEMS))
-                {
-                    if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
-                    temp_storage.buff[item_offset] = input_items[ITEM];
-                }
-            }
-
-            CTA_SYNC();
-
-            if (warp_id == SLICE)
-            {
-                #pragma unroll
-                for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-                {
-                    int item_offset = (lane_id * ITEMS_PER_THREAD) + ITEM;
-                    if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
-                    temp_items[ITEM] = temp_storage.buff[item_offset];
-                }
-            }
-        }
-
-        // Copy
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            output_items[ITEM] = temp_items[ITEM];
-        }
-    }
-
-
-    /**
-     * Exchanges data items annotated by rank into <em>striped</em> arrangement.  Specialized for no timeslicing.
-     */
-    template <typename OutputT, typename OffsetT>
-    __device__ __forceinline__ void ScatterToStriped(
-        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        OffsetT         ranks[ITEMS_PER_THREAD],    ///< [in] Corresponding scatter ranks
-        Int2Type<false> /*time_slicing*/)
-    {
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            int item_offset = ranks[ITEM];
-            if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
-            temp_storage.buff[item_offset] = input_items[ITEM];
-        }
-
-        CTA_SYNC();
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            int item_offset = int(ITEM * BLOCK_THREADS) + linear_tid;
-            if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
-            output_items[ITEM] = temp_storage.buff[item_offset];
-        }
-    }
-
-
-    /**
-     * Exchanges data items annotated by rank into <em>striped</em> arrangement.  Specialized for warp-timeslicing.
-     */
-    template <typename OutputT, typename OffsetT>
-    __device__ __forceinline__ void ScatterToStriped(
-        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
-        OffsetT         ranks[ITEMS_PER_THREAD],    ///< [in] Corresponding scatter ranks
-        Int2Type<true> /*time_slicing*/)
-    {
-        InputT temp_items[ITEMS_PER_THREAD];
-
-        #pragma unroll
-        for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++)
-        {
-            const int SLICE_OFFSET  = SLICE * TIME_SLICED_ITEMS;
-            const int SLICE_OOB     = SLICE_OFFSET + TIME_SLICED_ITEMS;
-
-            CTA_SYNC();
-
-            #pragma unroll
-            for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-            {
-                int item_offset = ranks[ITEM] - SLICE_OFFSET;
-                if ((item_offset >= 0) && (item_offset < WARP_TIME_SLICED_ITEMS))
-                {
-                    if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
-                    temp_storage.buff[item_offset] = input_items[ITEM];
-                }
-            }
-
-            CTA_SYNC();
-
-            #pragma unroll
-            for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-            {
-                // Read a strip of items
-                const int STRIP_OFFSET  = ITEM * BLOCK_THREADS;
-                const int STRIP_OOB     = STRIP_OFFSET + BLOCK_THREADS;
-
-                if ((SLICE_OFFSET < STRIP_OOB) && (SLICE_OOB > STRIP_OFFSET))
-                {
-                    int item_offset = STRIP_OFFSET + linear_tid - SLICE_OFFSET;
-                    if ((item_offset >= 0) && (item_offset < TIME_SLICED_ITEMS))
-                    {
-                        if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
-                        temp_items[ITEM] = temp_storage.buff[item_offset];
-                    }
-                }
-            }
-        }
-
-        // Copy
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            output_items[ITEM] = temp_items[ITEM];
-        }
-    }
-
-
-public:
-
-    /******************************************************************//**
-     * \name Collective constructors
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Collective constructor using a private static allocation of shared memory as temporary storage.
-     */
-    __device__ __forceinline__ BlockExchange()
-    :
-        temp_storage(PrivateStorage()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z)),
-        warp_id((WARPS == 1) ? 0 : linear_tid / WARP_THREADS),
-        lane_id(LaneId()),
-        warp_offset(warp_id * WARP_TIME_SLICED_ITEMS)
-    {}
-
-
-    /**
-     * \brief Collective constructor using the specified memory allocation as temporary storage.
-     */
-    __device__ __forceinline__ BlockExchange(
-        TempStorage &temp_storage)             ///< [in] Reference to memory allocation having layout type TempStorage
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z)),
-        lane_id(LaneId()),
-        warp_id((WARPS == 1) ? 0 : linear_tid / WARP_THREADS),
-        warp_offset(warp_id * WARP_TIME_SLICED_ITEMS)
-    {}
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Structured exchanges
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Transposes data items from <em>striped</em> arrangement to <em>blocked</em> arrangement.
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the conversion from a "striped" to a "blocked" arrangement
-     * of 512 integer items partitioned across 128 threads where each thread owns 4 items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_exchange.cuh>
-     *
-     * __global__ void ExampleKernel(int *d_data, ...)
-     * {
-     *     // Specialize BlockExchange for a 1D block of 128 threads owning 4 integer items each
-     *     typedef cub::BlockExchange<int, 128, 4> BlockExchange;
-     *
-     *     // Allocate shared memory for BlockExchange
-     *     __shared__ typename BlockExchange::TempStorage temp_storage;
-     *
-     *     // Load a tile of ordered data into a striped arrangement across block threads
-     *     int thread_data[4];
-     *     cub::LoadDirectStriped<128>(threadIdx.x, d_data, thread_data);
-     *
-     *     // Collectively exchange data into a blocked arrangement across threads
-     *     BlockExchange(temp_storage).StripedToBlocked(thread_data, thread_data);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of striped input \p thread_data across the block of threads is
-     * <tt>{ [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] }</tt> after loading from device-accessible memory.
-     * The corresponding output \p thread_data in those threads will be
-     * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
-     *
-     */
-    template <typename OutputT>
-    __device__ __forceinline__ void StripedToBlocked(
-        InputT      input_items[ITEMS_PER_THREAD],    ///< [in] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-        OutputT     output_items[ITEMS_PER_THREAD])   ///< [out] Items from exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-    {
-        StripedToBlocked(input_items, output_items, Int2Type<WARP_TIME_SLICING>());
-    }
-
-
-    /**
-     * \brief Transposes data items from <em>blocked</em> arrangement to <em>striped</em> arrangement.
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the conversion from a "blocked" to a "striped" arrangement
-     * of 512 integer items partitioned across 128 threads where each thread owns 4 items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_exchange.cuh>
-     *
-     * __global__ void ExampleKernel(int *d_data, ...)
-     * {
-     *     // Specialize BlockExchange for a 1D block of 128 threads owning 4 integer items each
-     *     typedef cub::BlockExchange<int, 128, 4> BlockExchange;
-     *
-     *     // Allocate shared memory for BlockExchange
-     *     __shared__ typename BlockExchange::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Collectively exchange data into a striped arrangement across threads
-     *     BlockExchange(temp_storage).BlockedToStriped(thread_data, thread_data);
-     *
-     *     // Store data striped across block threads into an ordered tile
-     *     cub::StoreDirectStriped<STORE_DEFAULT, 128>(threadIdx.x, d_data, thread_data);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of blocked input \p thread_data across the block of threads is
-     * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
-     * The corresponding output \p thread_data in those threads will be
-     * <tt>{ [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] }</tt> in
-     * preparation for storing to device-accessible memory.
-     *
-     */
-    template <typename OutputT>
-    __device__ __forceinline__ void BlockedToStriped(
-        InputT      input_items[ITEMS_PER_THREAD],    ///< [in] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-        OutputT     output_items[ITEMS_PER_THREAD])   ///< [out] Items from exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-    {
-        BlockedToStriped(input_items, output_items, Int2Type<WARP_TIME_SLICING>());
-    }
-
-
-
-    /**
-     * \brief Transposes data items from <em>warp-striped</em> arrangement to <em>blocked</em> arrangement.
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the conversion from a "warp-striped" to a "blocked" arrangement
-     * of 512 integer items partitioned across 128 threads where each thread owns 4 items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_exchange.cuh>
-     *
-     * __global__ void ExampleKernel(int *d_data, ...)
-     * {
-     *     // Specialize BlockExchange for a 1D block of 128 threads owning 4 integer items each
-     *     typedef cub::BlockExchange<int, 128, 4> BlockExchange;
-     *
-     *     // Allocate shared memory for BlockExchange
-     *     __shared__ typename BlockExchange::TempStorage temp_storage;
-     *
-     *     // Load a tile of ordered data into a warp-striped arrangement across warp threads
-     *     int thread_data[4];
-     *     cub::LoadSWarptriped<LOAD_DEFAULT>(threadIdx.x, d_data, thread_data);
-     *
-     *     // Collectively exchange data into a blocked arrangement across threads
-     *     BlockExchange(temp_storage).WarpStripedToBlocked(thread_data);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of warp-striped input \p thread_data across the block of threads is
-     * <tt>{ [0,32,64,96], [1,33,65,97], [2,34,66,98], ..., [415,447,479,511] }</tt>
-     * after loading from device-accessible memory.  (The first 128 items are striped across
-     * the first warp of 32 threads, the second 128 items are striped across the second warp, etc.)
-     * The corresponding output \p thread_data in those threads will be
-     * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
-     *
-     */
-    template <typename OutputT>
-    __device__ __forceinline__ void WarpStripedToBlocked(
-        InputT      input_items[ITEMS_PER_THREAD],    ///< [in] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-        OutputT     output_items[ITEMS_PER_THREAD])   ///< [out] Items from exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-    {
-        WarpStripedToBlocked(input_items, output_items, Int2Type<WARP_TIME_SLICING>());
-    }
-
-
-
-    /**
-     * \brief Transposes data items from <em>blocked</em> arrangement to <em>warp-striped</em> arrangement.
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the conversion from a "blocked" to a "warp-striped" arrangement
-     * of 512 integer items partitioned across 128 threads where each thread owns 4 items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_exchange.cuh>
-     *
-     * __global__ void ExampleKernel(int *d_data, ...)
-     * {
-     *     // Specialize BlockExchange for a 1D block of 128 threads owning 4 integer items each
-     *     typedef cub::BlockExchange<int, 128, 4> BlockExchange;
-     *
-     *     // Allocate shared memory for BlockExchange
-     *     __shared__ typename BlockExchange::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Collectively exchange data into a warp-striped arrangement across threads
-     *     BlockExchange(temp_storage).BlockedToWarpStriped(thread_data, thread_data);
-     *
-     *     // Store data striped across warp threads into an ordered tile
-     *     cub::StoreDirectStriped<STORE_DEFAULT, 128>(threadIdx.x, d_data, thread_data);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of blocked input \p thread_data across the block of threads is
-     * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
-     * The corresponding output \p thread_data in those threads will be
-     * <tt>{ [0,32,64,96], [1,33,65,97], [2,34,66,98], ..., [415,447,479,511] }</tt>
-     * in preparation for storing to device-accessible memory. (The first 128 items are striped across
-     * the first warp of 32 threads, the second 128 items are striped across the second warp, etc.)
-     *
-     */
-    template <typename OutputT>
-    __device__ __forceinline__ void BlockedToWarpStriped(
-        InputT      input_items[ITEMS_PER_THREAD],    ///< [in] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-        OutputT     output_items[ITEMS_PER_THREAD])   ///< [out] Items from exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-    {
-        BlockedToWarpStriped(input_items, output_items, Int2Type<WARP_TIME_SLICING>());
-    }
-
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Scatter exchanges
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Exchanges data items annotated by rank into <em>blocked</em> arrangement.
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \tparam OffsetT                              <b>[inferred]</b> Signed integer type for local offsets
-     */
-    template <typename OutputT, typename OffsetT>
-    __device__ __forceinline__ void ScatterToBlocked(
-        InputT      input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-        OutputT     output_items[ITEMS_PER_THREAD],     ///< [out] Items from exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-        OffsetT     ranks[ITEMS_PER_THREAD])            ///< [in] Corresponding scatter ranks
-    {
-        ScatterToBlocked(input_items, output_items, ranks, Int2Type<WARP_TIME_SLICING>());
-    }
-
-
-
-    /**
-     * \brief Exchanges data items annotated by rank into <em>striped</em> arrangement.
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \tparam OffsetT                              <b>[inferred]</b> Signed integer type for local offsets
-     */
-    template <typename OutputT, typename OffsetT>
-    __device__ __forceinline__ void ScatterToStriped(
-        InputT      input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-        OutputT     output_items[ITEMS_PER_THREAD],     ///< [out] Items from exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-        OffsetT     ranks[ITEMS_PER_THREAD])            ///< [in] Corresponding scatter ranks
-    {
-        ScatterToStriped(input_items, output_items, ranks, Int2Type<WARP_TIME_SLICING>());
-    }
-
-
-
-    /**
-     * \brief Exchanges data items annotated by rank into <em>striped</em> arrangement.  Items with rank -1 are not exchanged.
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \tparam OffsetT                              <b>[inferred]</b> Signed integer type for local offsets
-     */
-    template <typename OutputT, typename OffsetT>
-    __device__ __forceinline__ void ScatterToStripedGuarded(
-        InputT      input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-        OutputT     output_items[ITEMS_PER_THREAD],     ///< [out] Items from exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-        OffsetT     ranks[ITEMS_PER_THREAD])            ///< [in] Corresponding scatter ranks
-    {
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            int item_offset = ranks[ITEM];
-            if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
-            if (ranks[ITEM] >= 0)
-                temp_storage.buff[item_offset] = input_items[ITEM];
-        }
-
-        CTA_SYNC();
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            int item_offset = int(ITEM * BLOCK_THREADS) + linear_tid;
-            if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
-            output_items[ITEM] = temp_storage.buff[item_offset];
-        }
-    }
-
-
-
-
-    /**
-     * \brief Exchanges valid data items annotated by rank into <em>striped</em> arrangement.
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \tparam OffsetT                              <b>[inferred]</b> Signed integer type for local offsets
-     * \tparam ValidFlag                            <b>[inferred]</b> FlagT type denoting which items are valid
-     */
-    template <typename OutputT, typename OffsetT, typename ValidFlag>
-    __device__ __forceinline__ void ScatterToStripedFlagged(
-        InputT      input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-        OutputT     output_items[ITEMS_PER_THREAD],     ///< [out] Items from exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-        OffsetT     ranks[ITEMS_PER_THREAD],            ///< [in] Corresponding scatter ranks
-        ValidFlag   is_valid[ITEMS_PER_THREAD])         ///< [in] Corresponding flag denoting item validity
-    {
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            int item_offset = ranks[ITEM];
-            if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
-            if (is_valid[ITEM])
-                temp_storage.buff[item_offset] = input_items[ITEM];
-        }
-
-        CTA_SYNC();
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            int item_offset = int(ITEM * BLOCK_THREADS) + linear_tid;
-            if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
-            output_items[ITEM] = temp_storage.buff[item_offset];
-        }
-    }
-
-
-    //@}  end member group
-
-
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-
-    __device__ __forceinline__ void StripedToBlocked(
-        InputT      items[ITEMS_PER_THREAD])   ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-    {
-        StripedToBlocked(items, items);
-    }
-
-    __device__ __forceinline__ void BlockedToStriped(
-        InputT      items[ITEMS_PER_THREAD])   ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-    {
-        BlockedToStriped(items, items);
-    }
-
-    __device__ __forceinline__ void WarpStripedToBlocked(
-        InputT      items[ITEMS_PER_THREAD])    ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-    {
-        WarpStripedToBlocked(items, items);
-    }
-
-    __device__ __forceinline__ void BlockedToWarpStriped(
-        InputT      items[ITEMS_PER_THREAD])    ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-    {
-        BlockedToWarpStriped(items, items);
-    }
-
-    template <typename OffsetT>
-    __device__ __forceinline__ void ScatterToBlocked(
-        InputT      items[ITEMS_PER_THREAD],    ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-        OffsetT     ranks[ITEMS_PER_THREAD])    ///< [in] Corresponding scatter ranks
-    {
-        ScatterToBlocked(items, items, ranks);
-    }
-
-    template <typename OffsetT>
-    __device__ __forceinline__ void ScatterToStriped(
-        InputT      items[ITEMS_PER_THREAD],    ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-        OffsetT     ranks[ITEMS_PER_THREAD])    ///< [in] Corresponding scatter ranks
-    {
-        ScatterToStriped(items, items, ranks);
-    }
-
-    template <typename OffsetT>
-    __device__ __forceinline__ void ScatterToStripedGuarded(
-        InputT      items[ITEMS_PER_THREAD],    ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-        OffsetT     ranks[ITEMS_PER_THREAD])    ///< [in] Corresponding scatter ranks
-    {
-        ScatterToStripedGuarded(items, items, ranks);
-    }
-
-    template <typename OffsetT, typename ValidFlag>
-    __device__ __forceinline__ void ScatterToStripedFlagged(
-        InputT      items[ITEMS_PER_THREAD],        ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
-        OffsetT     ranks[ITEMS_PER_THREAD],        ///< [in] Corresponding scatter ranks
-        ValidFlag   is_valid[ITEMS_PER_THREAD])     ///< [in] Corresponding flag denoting item validity
-    {
-        ScatterToStriped(items, items, ranks, is_valid);
-    }
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-};
-
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-
-template <
-    typename    T,
-    int         ITEMS_PER_THREAD,
-    int         LOGICAL_WARP_THREADS    = CUB_PTX_WARP_THREADS,
-    int         PTX_ARCH                = CUB_PTX_ARCH>
-class WarpExchange
-{
-private:
-
-    /******************************************************************************
-     * Constants
-     ******************************************************************************/
-
-    /// Constants
-    enum
-    {
-        // Whether the logical warp size and the PTX warp size coincide
-        IS_ARCH_WARP = (LOGICAL_WARP_THREADS == CUB_WARP_THREADS(PTX_ARCH)),
-
-        WARP_ITEMS                  = (ITEMS_PER_THREAD * LOGICAL_WARP_THREADS) + 1,
-
-        LOG_SMEM_BANKS              = CUB_LOG_SMEM_BANKS(PTX_ARCH),
-        SMEM_BANKS                  = 1 << LOG_SMEM_BANKS,
-
-        // Insert padding if the number of items per thread is a power of two and > 4 (otherwise we can typically use 128b loads)
-        INSERT_PADDING              = (ITEMS_PER_THREAD > 4) && (PowerOfTwo<ITEMS_PER_THREAD>::VALUE),
-        PADDING_ITEMS               = (INSERT_PADDING) ? (WARP_ITEMS >> LOG_SMEM_BANKS) : 0,
-    };
-
-    /******************************************************************************
-     * Type definitions
-     ******************************************************************************/
-
-    /// Shared memory storage layout type
-    struct _TempStorage
-    {
-        T buff[WARP_ITEMS + PADDING_ITEMS];
-    };
-
-public:
-
-    /// \smemstorage{WarpExchange}
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-private:
-
-
-    /******************************************************************************
-     * Thread fields
-     ******************************************************************************/
-
-    _TempStorage    &temp_storage;
-    int             lane_id;
-
-public:
-
-    /******************************************************************************
-     * Construction
-     ******************************************************************************/
-
-    /// Constructor
-    __device__ __forceinline__ WarpExchange(
-        TempStorage &temp_storage)
-    :
-        temp_storage(temp_storage.Alias()),
-        lane_id(IS_ARCH_WARP ?
-            LaneId() :
-            LaneId() % LOGICAL_WARP_THREADS)
-    {}
-
-
-    /******************************************************************************
-     * Interface
-     ******************************************************************************/
-
-    /**
-     * \brief Exchanges valid data items annotated by rank into <em>striped</em> arrangement.
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \tparam OffsetT                              <b>[inferred]</b> Signed integer type for local offsets
-     */
-    template <typename OffsetT>
-    __device__ __forceinline__ void ScatterToStriped(
-        T               items[ITEMS_PER_THREAD],        ///< [in-out] Items to exchange
-        OffsetT         ranks[ITEMS_PER_THREAD])        ///< [in] Corresponding scatter ranks
-    {
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            if (INSERT_PADDING) ranks[ITEM] = SHR_ADD(ranks[ITEM], LOG_SMEM_BANKS, ranks[ITEM]);
-            temp_storage.buff[ranks[ITEM]] = items[ITEM];
-        }
-
-        WARP_SYNC(0xffffffff);
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        {
-            int item_offset = (ITEM * LOGICAL_WARP_THREADS) + lane_id;
-            if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
-            items[ITEM] = temp_storage.buff[item_offset];
-        }
-    }
-
-};
-
-
-
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/block_histogram.cuh b/thirdparty/cub_semiring/block/block_histogram.cuh
deleted file mode 100644
index 5d393c2353f..00000000000
--- a/thirdparty/cub_semiring/block/block_histogram.cuh
+++ /dev/null
@@ -1,415 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * The cub::BlockHistogram class provides [<em>collective</em>](index.html#sec0) methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.
- */
-
-#pragma once
-
-#include "specializations/block_histogram_sort.cuh"
-#include "specializations/block_histogram_atomic.cuh"
-#include "../util_ptx.cuh"
-#include "../util_arch.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/******************************************************************************
- * Algorithmic variants
- ******************************************************************************/
-
-/**
- * \brief BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms.
- */
-enum BlockHistogramAlgorithm
-{
-
-    /**
-     * \par Overview
-     * Sorting followed by differentiation.  Execution is comprised of two phases:
-     * -# Sort the data using efficient radix sort
-     * -# Look for "runs" of same-valued keys by detecting discontinuities; the run-lengths are histogram bin counts.
-     *
-     * \par Performance Considerations
-     * Delivers consistent throughput regardless of sample bin distribution.
-     */
-    BLOCK_HISTO_SORT,
-
-
-    /**
-     * \par Overview
-     * Use atomic addition to update byte counts directly
-     *
-     * \par Performance Considerations
-     * Performance is strongly tied to the hardware implementation of atomic
-     * addition, and may be significantly degraded for non uniformly-random
-     * input distributions where many concurrent updates are likely to be
-     * made to the same bin counter.
-     */
-    BLOCK_HISTO_ATOMIC,
-};
-
-
-
-/******************************************************************************
- * Block histogram
- ******************************************************************************/
-
-
-/**
- * \brief The BlockHistogram class provides [<em>collective</em>](index.html#sec0) methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block. ![](histogram_logo.png)
- * \ingroup BlockModule
- *
- * \tparam T                    The sample type being histogrammed (must be castable to an integer bin identifier)
- * \tparam BLOCK_DIM_X          The thread block length in threads along the X dimension
- * \tparam ITEMS_PER_THREAD     The number of items per thread
- * \tparam BINS                 The number bins within the histogram
- * \tparam ALGORITHM            <b>[optional]</b> cub::BlockHistogramAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_HISTO_SORT)
- * \tparam BLOCK_DIM_Y          <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
- * \tparam BLOCK_DIM_Z          <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
- * \tparam PTX_ARCH             <b>[optional]</b> \ptxversion
- *
- * \par Overview
- * - A <a href="http://en.wikipedia.org/wiki/Histogram"><em>histogram</em></a>
- *   counts the number of observations that fall into each of the disjoint categories (known as <em>bins</em>).
- * - BlockHistogram can be optionally specialized to use different algorithms:
- *   -# <b>cub::BLOCK_HISTO_SORT</b>.  Sorting followed by differentiation. [More...](\ref cub::BlockHistogramAlgorithm)
- *   -# <b>cub::BLOCK_HISTO_ATOMIC</b>.  Use atomic addition to update byte counts directly. [More...](\ref cub::BlockHistogramAlgorithm)
- *
- * \par Performance Considerations
- * - \granularity
- *
- * \par A Simple Example
- * \blockcollective{BlockHistogram}
- * \par
- * The code snippet below illustrates a 256-bin histogram of 512 integer samples that
- * are partitioned across 128 threads where each thread owns 4 samples.
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/block/block_histogram.cuh>
- *
- * __global__ void ExampleKernel(...)
- * {
- *     // Specialize a 256-bin BlockHistogram type for a 1D block of 128 threads having 4 character samples each
- *     typedef cub::BlockHistogram<unsigned char, 128, 4, 256> BlockHistogram;
- *
- *     // Allocate shared memory for BlockHistogram
- *     __shared__ typename BlockHistogram::TempStorage temp_storage;
- *
- *     // Allocate shared memory for block-wide histogram bin counts
- *     __shared__ unsigned int smem_histogram[256];
- *
- *     // Obtain input samples per thread
- *     unsigned char data[4];
- *     ...
- *
- *     // Compute the block-wide histogram
- *     BlockHistogram(temp_storage).Histogram(data, smem_histogram);
- *
- * \endcode
- *
- * \par Performance and Usage Considerations
- * - The histogram output can be constructed in shared or device-accessible memory
- * - See cub::BlockHistogramAlgorithm for performance details regarding algorithmic alternatives
- *
- */
-template <
-    typename                T,
-    int                     BLOCK_DIM_X,
-    int                     ITEMS_PER_THREAD,
-    int                     BINS,
-    BlockHistogramAlgorithm ALGORITHM           = BLOCK_HISTO_SORT,
-    int                     BLOCK_DIM_Y         = 1,
-    int                     BLOCK_DIM_Z         = 1,
-    int                     PTX_ARCH            = CUB_PTX_ARCH>
-class BlockHistogram
-{
-private:
-
-    /******************************************************************************
-     * Constants and type definitions
-     ******************************************************************************/
-
-    /// Constants
-    enum
-    {
-        /// The thread block size in threads
-        BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-    };
-
-    /**
-     * Ensure the template parameterization meets the requirements of the
-     * targeted device architecture.  BLOCK_HISTO_ATOMIC can only be used
-     * on version SM120 or later.  Otherwise BLOCK_HISTO_SORT is used
-     * regardless.
-     */
-    static const BlockHistogramAlgorithm SAFE_ALGORITHM =
-        ((ALGORITHM == BLOCK_HISTO_ATOMIC) && (PTX_ARCH < 120)) ?
-            BLOCK_HISTO_SORT :
-            ALGORITHM;
-
-    /// Internal specialization.
-    typedef typename If<(SAFE_ALGORITHM == BLOCK_HISTO_SORT),
-        BlockHistogramSort<T, BLOCK_DIM_X, ITEMS_PER_THREAD, BINS, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>,
-        BlockHistogramAtomic<BINS> >::Type InternalBlockHistogram;
-
-    /// Shared memory storage layout type for BlockHistogram
-    typedef typename InternalBlockHistogram::TempStorage _TempStorage;
-
-
-    /******************************************************************************
-     * Thread fields
-     ******************************************************************************/
-
-    /// Shared storage reference
-    _TempStorage &temp_storage;
-
-    /// Linear thread-id
-    unsigned int linear_tid;
-
-
-    /******************************************************************************
-     * Utility methods
-     ******************************************************************************/
-
-    /// Internal storage allocator
-    __device__ __forceinline__ _TempStorage& PrivateStorage()
-    {
-        __shared__ _TempStorage private_storage;
-        return private_storage;
-    }
-
-
-public:
-
-    /// \smemstorage{BlockHistogram}
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    /******************************************************************//**
-     * \name Collective constructors
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Collective constructor using a private static allocation of shared memory as temporary storage.
-     */
-    __device__ __forceinline__ BlockHistogram()
-    :
-        temp_storage(PrivateStorage()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    /**
-     * \brief Collective constructor using the specified memory allocation as temporary storage.
-     */
-    __device__ __forceinline__ BlockHistogram(
-        TempStorage &temp_storage)             ///< [in] Reference to memory allocation having layout type TempStorage
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Histogram operations
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Initialize the shared histogram counters to zero.
-     *
-     * \par Snippet
-     * The code snippet below illustrates a the initialization and update of a
-     * histogram of 512 integer samples that are partitioned across 128 threads
-     * where each thread owns 4 samples.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_histogram.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize a 256-bin BlockHistogram type for a 1D block of 128 threads having 4 character samples each
-     *     typedef cub::BlockHistogram<unsigned char, 128, 4, 256> BlockHistogram;
-     *
-     *     // Allocate shared memory for BlockHistogram
-     *     __shared__ typename BlockHistogram::TempStorage temp_storage;
-     *
-     *     // Allocate shared memory for block-wide histogram bin counts
-     *     __shared__ unsigned int smem_histogram[256];
-     *
-     *     // Obtain input samples per thread
-     *     unsigned char thread_samples[4];
-     *     ...
-     *
-     *     // Initialize the block-wide histogram
-     *     BlockHistogram(temp_storage).InitHistogram(smem_histogram);
-     *
-     *     // Update the block-wide histogram
-     *     BlockHistogram(temp_storage).Composite(thread_samples, smem_histogram);
-     *
-     * \endcode
-     *
-     * \tparam CounterT              <b>[inferred]</b> Histogram counter type
-     */
-    template <typename CounterT     >
-    __device__ __forceinline__ void InitHistogram(CounterT      histogram[BINS])
-    {
-        // Initialize histogram bin counts to zeros
-        int histo_offset = 0;
-
-        #pragma unroll
-        for(; histo_offset + BLOCK_THREADS <= BINS; histo_offset += BLOCK_THREADS)
-        {
-            histogram[histo_offset + linear_tid] = 0;
-        }
-        // Finish up with guarded initialization if necessary
-        if ((BINS % BLOCK_THREADS != 0) && (histo_offset + linear_tid < BINS))
-        {
-            histogram[histo_offset + linear_tid] = 0;
-        }
-    }
-
-
-    /**
-     * \brief Constructs a block-wide histogram in shared/device-accessible memory.  Each thread contributes an array of input elements.
-     *
-     * \par
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a 256-bin histogram of 512 integer samples that
-     * are partitioned across 128 threads where each thread owns 4 samples.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_histogram.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize a 256-bin BlockHistogram type for a 1D block of 128 threads having 4 character samples each
-     *     typedef cub::BlockHistogram<unsigned char, 128, 4, 256> BlockHistogram;
-     *
-     *     // Allocate shared memory for BlockHistogram
-     *     __shared__ typename BlockHistogram::TempStorage temp_storage;
-     *
-     *     // Allocate shared memory for block-wide histogram bin counts
-     *     __shared__ unsigned int smem_histogram[256];
-     *
-     *     // Obtain input samples per thread
-     *     unsigned char thread_samples[4];
-     *     ...
-     *
-     *     // Compute the block-wide histogram
-     *     BlockHistogram(temp_storage).Histogram(thread_samples, smem_histogram);
-     *
-     * \endcode
-     *
-     * \tparam CounterT              <b>[inferred]</b> Histogram counter type
-     */
-    template <
-        typename            CounterT     >
-    __device__ __forceinline__ void Histogram(
-        T                   (&items)[ITEMS_PER_THREAD],     ///< [in] Calling thread's input values to histogram
-        CounterT             histogram[BINS])                ///< [out] Reference to shared/device-accessible memory histogram
-    {
-        // Initialize histogram bin counts to zeros
-        InitHistogram(histogram);
-
-        CTA_SYNC();
-
-        // Composite the histogram
-        InternalBlockHistogram(temp_storage).Composite(items, histogram);
-    }
-
-
-
-    /**
-     * \brief Updates an existing block-wide histogram in shared/device-accessible memory.  Each thread composites an array of input elements.
-     *
-     * \par
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a the initialization and update of a
-     * histogram of 512 integer samples that are partitioned across 128 threads
-     * where each thread owns 4 samples.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_histogram.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize a 256-bin BlockHistogram type for a 1D block of 128 threads having 4 character samples each
-     *     typedef cub::BlockHistogram<unsigned char, 128, 4, 256> BlockHistogram;
-     *
-     *     // Allocate shared memory for BlockHistogram
-     *     __shared__ typename BlockHistogram::TempStorage temp_storage;
-     *
-     *     // Allocate shared memory for block-wide histogram bin counts
-     *     __shared__ unsigned int smem_histogram[256];
-     *
-     *     // Obtain input samples per thread
-     *     unsigned char thread_samples[4];
-     *     ...
-     *
-     *     // Initialize the block-wide histogram
-     *     BlockHistogram(temp_storage).InitHistogram(smem_histogram);
-     *
-     *     // Update the block-wide histogram
-     *     BlockHistogram(temp_storage).Composite(thread_samples, smem_histogram);
-     *
-     * \endcode
-     *
-     * \tparam CounterT              <b>[inferred]</b> Histogram counter type
-     */
-    template <
-        typename            CounterT     >
-    __device__ __forceinline__ void Composite(
-        T                   (&items)[ITEMS_PER_THREAD],     ///< [in] Calling thread's input values to histogram
-        CounterT             histogram[BINS])                 ///< [out] Reference to shared/device-accessible memory histogram
-    {
-        InternalBlockHistogram(temp_storage).Composite(items, histogram);
-    }
-
-};
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/block_load.cuh b/thirdparty/cub_semiring/block/block_load.cuh
deleted file mode 100644
index 234dad295a0..00000000000
--- a/thirdparty/cub_semiring/block/block_load.cuh
+++ /dev/null
@@ -1,1268 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Operations for reading linear tiles of data into the CUDA thread block.
- */
-
-#pragma once
-
-#include <iterator>
-
-#include "block_exchange.cuh"
-#include "../iterator/cache_modified_input_iterator.cuh"
-#include "../util_ptx.cuh"
-#include "../util_macro.cuh"
-#include "../util_type.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \addtogroup UtilIo
- * @{
- */
-
-
-/******************************************************************//**
- * \name Blocked arrangement I/O (direct)
- *********************************************************************/
-//@{
-
-
-/**
- * \brief Load a linear segment of items into a blocked arrangement across the thread block.
- *
- * \blocked
- *
- * \tparam T                    <b>[inferred]</b> The data type to load.
- * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
- * \tparam InputIteratorT       <b>[inferred]</b> The random-access iterator type for input \iterator.
- */
-template <
-    typename        InputT,
-    int             ITEMS_PER_THREAD,
-    typename        InputIteratorT>
-__device__ __forceinline__ void LoadDirectBlocked(
-    int             linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    InputIteratorT  block_itr,                  ///< [in] The thread block's base input iterator for loading from
-    InputT          (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
-{
-    InputIteratorT thread_itr = block_itr + (linear_tid * ITEMS_PER_THREAD);
-
-    // Load directly in thread-blocked order
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-    {
-        items[ITEM] = thread_itr[ITEM];
-    }
-}
-
-
-/**
- * \brief Load a linear segment of items into a blocked arrangement across the thread block, guarded by range.
- *
- * \blocked
- *
- * \tparam T                    <b>[inferred]</b> The data type to load.
- * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
- * \tparam InputIteratorT       <b>[inferred]</b> The random-access iterator type for input \iterator.
- */
-template <
-    typename        InputT,
-    int             ITEMS_PER_THREAD,
-    typename        InputIteratorT>
-__device__ __forceinline__ void LoadDirectBlocked(
-    int             linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    InputIteratorT  block_itr,                  ///< [in] The thread block's base input iterator for loading from
-    InputT          (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
-    int             valid_items)                ///< [in] Number of valid items to load
-{
-    InputIteratorT thread_itr = block_itr + (linear_tid * ITEMS_PER_THREAD);
-
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-    {
-        if ((linear_tid * ITEMS_PER_THREAD) + ITEM < valid_items)
-        {
-            items[ITEM] = thread_itr[ITEM];
-        }
-    }
-}
-
-
-/**
- * \brief Load a linear segment of items into a blocked arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements..
- *
- * \blocked
- *
- * \tparam T                    <b>[inferred]</b> The data type to load.
- * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
- * \tparam InputIteratorT       <b>[inferred]</b> The random-access iterator type for input \iterator.
- */
-template <
-    typename        InputT,
-    typename        DefaultT,
-    int             ITEMS_PER_THREAD,
-    typename        InputIteratorT>
-__device__ __forceinline__ void LoadDirectBlocked(
-    int             linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    InputIteratorT  block_itr,                  ///< [in] The thread block's base input iterator for loading from
-    InputT          (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
-    int             valid_items,                ///< [in] Number of valid items to load
-    DefaultT        oob_default)                ///< [in] Default value to assign out-of-bound items
-{
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        items[ITEM] = oob_default;
-
-    LoadDirectBlocked(linear_tid, block_itr, items, valid_items);
-}
-
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-/**
- * Internal implementation for load vectorization
- */
-template <
-    CacheLoadModifier   MODIFIER,
-    typename            T,
-    int                 ITEMS_PER_THREAD>
-__device__ __forceinline__ void InternalLoadDirectBlockedVectorized(
-    int    linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    T      *block_ptr,                 ///< [in] Input pointer for loading from
-    T      (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
-{
-    // Biggest memory access word that T is a whole multiple of
-    typedef typename UnitWord<T>::DeviceWord DeviceWord;
-
-    enum
-    {
-        TOTAL_WORDS = sizeof(items) / sizeof(DeviceWord),
-
-        VECTOR_SIZE = (TOTAL_WORDS % 4 == 0) ?
-            4 :
-            (TOTAL_WORDS % 2 == 0) ?
-                2 :
-                1,
-
-        VECTORS_PER_THREAD = TOTAL_WORDS / VECTOR_SIZE,
-    };
-
-    // Vector type
-    typedef typename CubVector<DeviceWord, VECTOR_SIZE>::Type Vector;
-
-    // Vector items
-    Vector vec_items[VECTORS_PER_THREAD];
-
-    // Aliased input ptr
-    Vector* vec_ptr = reinterpret_cast<Vector*>(block_ptr) + (linear_tid * VECTORS_PER_THREAD);
-
-    // Load directly in thread-blocked order
-    #pragma unroll
-    for (int ITEM = 0; ITEM < VECTORS_PER_THREAD; ITEM++)
-    {
-        vec_items[ITEM] = ThreadLoad<MODIFIER>(vec_ptr + ITEM);
-    }
-
-    // Copy
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-    {
-        items[ITEM] = *(reinterpret_cast<T*>(vec_items) + ITEM);
-    }
-}
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-/**
- * \brief Load a linear segment of items into a blocked arrangement across the thread block.
- *
- * \blocked
- *
- * The input offset (\p block_ptr + \p block_offset) must be quad-item aligned
- *
- * The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT:
- *   - \p ITEMS_PER_THREAD is odd
- *   - The data type \p T is not a built-in primitive or CUDA vector type (e.g., \p short, \p int2, \p double, \p float2, etc.)
- *
- * \tparam T                    <b>[inferred]</b> The data type to load.
- * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
- */
-template <
-    typename        T,
-    int             ITEMS_PER_THREAD>
-__device__ __forceinline__ void LoadDirectBlockedVectorized(
-    int linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    T   *block_ptr,                 ///< [in] Input pointer for loading from
-    T   (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
-{
-    InternalLoadDirectBlockedVectorized<LOAD_DEFAULT>(linear_tid, block_ptr, items);
-}
-
-
-//@}  end member group
-/******************************************************************//**
- * \name Striped arrangement I/O (direct)
- *********************************************************************/
-//@{
-
-
-/**
- * \brief Load a linear segment of items into a striped arrangement across the thread block.
- *
- * \striped
- *
- * \tparam BLOCK_THREADS        The thread block size in threads
- * \tparam T                    <b>[inferred]</b> The data type to load.
- * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
- * \tparam InputIteratorT       <b>[inferred]</b> The random-access iterator type for input \iterator.
- */
-template <
-    int             BLOCK_THREADS,
-    typename        InputT,
-    int             ITEMS_PER_THREAD,
-    typename        InputIteratorT>
-__device__ __forceinline__ void LoadDirectStriped(
-    int             linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    InputIteratorT  block_itr,                  ///< [in] The thread block's base input iterator for loading from
-    InputT          (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
-{
-    InputIteratorT thread_itr = block_itr + linear_tid;
-
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-    {
-        items[ITEM] = thread_itr[ITEM * BLOCK_THREADS];
-    }
-}
-
-
-/**
- * \brief Load a linear segment of items into a striped arrangement across the thread block, guarded by range
- *
- * \striped
- *
- * \tparam BLOCK_THREADS        The thread block size in threads
- * \tparam T                    <b>[inferred]</b> The data type to load.
- * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
- * \tparam InputIteratorT       <b>[inferred]</b> The random-access iterator type for input \iterator.
- */
-template <
-    int             BLOCK_THREADS,
-    typename        InputT,
-    int             ITEMS_PER_THREAD,
-    typename        InputIteratorT>
-__device__ __forceinline__ void LoadDirectStriped(
-    int             linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    InputIteratorT  block_itr,                  ///< [in] The thread block's base input iterator for loading from
-    InputT          (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
-    int             valid_items)                ///< [in] Number of valid items to load
-{
-    InputIteratorT thread_itr = block_itr + linear_tid;
-
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-    {
-        if (linear_tid + (ITEM * BLOCK_THREADS) < valid_items)
-        {
-            items[ITEM] = thread_itr[ITEM * BLOCK_THREADS];
-        }
-    }
-}
-
-
-/**
- * \brief Load a linear segment of items into a striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements.
- *
- * \striped
- *
- * \tparam BLOCK_THREADS        The thread block size in threads
- * \tparam T                    <b>[inferred]</b> The data type to load.
- * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
- * \tparam InputIteratorT       <b>[inferred]</b> The random-access iterator type for input \iterator.
- */
-template <
-    int             BLOCK_THREADS,
-    typename        InputT,
-    typename        DefaultT,
-    int             ITEMS_PER_THREAD,
-    typename        InputIteratorT>
-__device__ __forceinline__ void LoadDirectStriped(
-    int             linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    InputIteratorT  block_itr,                  ///< [in] The thread block's base input iterator for loading from
-    InputT          (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
-    int             valid_items,                ///< [in] Number of valid items to load
-    DefaultT        oob_default)                ///< [in] Default value to assign out-of-bound items
-{
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        items[ITEM] = oob_default;
-
-    LoadDirectStriped<BLOCK_THREADS>(linear_tid, block_itr, items, valid_items);
-}
-
-
-
-//@}  end member group
-/******************************************************************//**
- * \name Warp-striped arrangement I/O (direct)
- *********************************************************************/
-//@{
-
-
-/**
- * \brief Load a linear segment of items into a warp-striped arrangement across the thread block.
- *
- * \warpstriped
- *
- * \par Usage Considerations
- * The number of threads in the thread block must be a multiple of the architecture's warp size.
- *
- * \tparam T                    <b>[inferred]</b> The data type to load.
- * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
- * \tparam InputIteratorT       <b>[inferred]</b> The random-access iterator type for input \iterator.
- */
-template <
-    typename        InputT,
-    int             ITEMS_PER_THREAD,
-    typename        InputIteratorT>
-__device__ __forceinline__ void LoadDirectWarpStriped(
-    int             linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    InputIteratorT  block_itr,                  ///< [in] The thread block's base input iterator for loading from
-    InputT          (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
-{
-    int tid                = linear_tid & (CUB_PTX_WARP_THREADS - 1);
-    int wid                = linear_tid >> CUB_PTX_LOG_WARP_THREADS;
-    int warp_offset        = wid * CUB_PTX_WARP_THREADS * ITEMS_PER_THREAD;
-
-    InputIteratorT thread_itr = block_itr + warp_offset + tid ;
-
-    // Load directly in warp-striped order
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-    {
-        items[ITEM] = thread_itr[(ITEM * CUB_PTX_WARP_THREADS)];
-    }
-}
-
-
-/**
- * \brief Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range
- *
- * \warpstriped
- *
- * \par Usage Considerations
- * The number of threads in the thread block must be a multiple of the architecture's warp size.
- *
- * \tparam T                    <b>[inferred]</b> The data type to load.
- * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
- * \tparam InputIteratorT        <b>[inferred]</b> The random-access iterator type for input \iterator.
- */
-template <
-    typename        InputT,
-    int             ITEMS_PER_THREAD,
-    typename        InputIteratorT>
-__device__ __forceinline__ void LoadDirectWarpStriped(
-    int             linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    InputIteratorT  block_itr,                  ///< [in] The thread block's base input iterator for loading from
-    InputT          (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
-    int             valid_items)                ///< [in] Number of valid items to load
-{
-    int tid                = linear_tid & (CUB_PTX_WARP_THREADS - 1);
-    int wid                = linear_tid >> CUB_PTX_LOG_WARP_THREADS;
-    int warp_offset        = wid * CUB_PTX_WARP_THREADS * ITEMS_PER_THREAD;
-
-    InputIteratorT thread_itr = block_itr + warp_offset + tid ;
-
-    // Load directly in warp-striped order
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-    {
-        if (warp_offset + tid + (ITEM * CUB_PTX_WARP_THREADS) < valid_items)
-        {
-            items[ITEM] = thread_itr[(ITEM * CUB_PTX_WARP_THREADS)];
-        }
-    }
-}
-
-
-/**
- * \brief Load a linear segment of items into a warp-striped arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements.
- *
- * \warpstriped
- *
- * \par Usage Considerations
- * The number of threads in the thread block must be a multiple of the architecture's warp size.
- *
- * \tparam T                    <b>[inferred]</b> The data type to load.
- * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
- * \tparam InputIteratorT        <b>[inferred]</b> The random-access iterator type for input \iterator.
- */
-template <
-    typename        InputT,
-    typename        DefaultT,
-    int             ITEMS_PER_THREAD,
-    typename        InputIteratorT>
-__device__ __forceinline__ void LoadDirectWarpStriped(
-    int             linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    InputIteratorT  block_itr,                  ///< [in] The thread block's base input iterator for loading from
-    InputT          (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
-    int             valid_items,                ///< [in] Number of valid items to load
-    DefaultT        oob_default)                ///< [in] Default value to assign out-of-bound items
-{
-    // Load directly in warp-striped order
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-        items[ITEM] = oob_default;
-
-    LoadDirectWarpStriped(linear_tid, block_itr, items, valid_items);
-}
-
-
-
-//@}  end member group
-
-/** @} */       // end group UtilIo
-
-
-
-//-----------------------------------------------------------------------------
-// Generic BlockLoad abstraction
-//-----------------------------------------------------------------------------
-
-/**
- * \brief cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block.
- */
-
-/**
- * \brief cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block.
- */
-enum BlockLoadAlgorithm
-{
-    /**
-     * \par Overview
-     *
-     * A [<em>blocked arrangement</em>](index.html#sec5sec3) of data is read
-     * directly from memory.
-     *
-     * \par Performance Considerations
-     * - The utilization of memory transactions (coalescing) decreases as the
-     *   access stride between threads increases (i.e., the number items per thread).
-     */
-    BLOCK_LOAD_DIRECT,
-
-    /**
-     * \par Overview
-     *
-     * A [<em>blocked arrangement</em>](index.html#sec5sec3) of data is read
-     * from memory using CUDA's built-in vectorized loads as a coalescing optimization.
-     * For example, <tt>ld.global.v4.s32</tt> instructions will be generated
-     * when \p T = \p int and \p ITEMS_PER_THREAD % 4 == 0.
-     *
-     * \par Performance Considerations
-     * - The utilization of memory transactions (coalescing) remains high until the the
-     *   access stride between threads (i.e., the number items per thread) exceeds the
-     *   maximum vector load width (typically 4 items or 64B, whichever is lower).
-     * - The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT:
-     *   - \p ITEMS_PER_THREAD is odd
-     *   - The \p InputIteratorTis not a simple pointer type
-     *   - The block input offset is not quadword-aligned
-     *   - The data type \p T is not a built-in primitive or CUDA vector type (e.g., \p short, \p int2, \p double, \p float2, etc.)
-     */
-    BLOCK_LOAD_VECTORIZE,
-
-    /**
-     * \par Overview
-     *
-     * A [<em>striped arrangement</em>](index.html#sec5sec3) of data is read
-     * efficiently from memory and then locally transposed into a
-     * [<em>blocked arrangement</em>](index.html#sec5sec3).
-     *
-     * \par Performance Considerations
-     * - The utilization of memory transactions (coalescing) remains high regardless
-     *   of items loaded per thread.
-     * - The local reordering incurs slightly longer latencies and throughput than the
-     *   direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives.
-     */
-    BLOCK_LOAD_TRANSPOSE,
-
-
-    /**
-     * \par Overview
-     *
-     * A [<em>warp-striped arrangement</em>](index.html#sec5sec3) of data is
-     * read efficiently from memory and then locally transposed into a
-     * [<em>blocked arrangement</em>](index.html#sec5sec3).
-     *
-     * \par Usage Considerations
-     * - BLOCK_THREADS must be a multiple of WARP_THREADS
-     *
-     * \par Performance Considerations
-     * - The utilization of memory transactions (coalescing) remains high regardless
-     *   of items loaded per thread.
-     * - The local reordering incurs slightly larger latencies than the
-     *   direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives.
-     * - Provisions more shared storage, but incurs smaller latencies than the
-     *   BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED alternative.
-     */
-    BLOCK_LOAD_WARP_TRANSPOSE,
-
-
-    /**
-     * \par Overview
-     *
-     * Like \p BLOCK_LOAD_WARP_TRANSPOSE, a [<em>warp-striped arrangement</em>](index.html#sec5sec3)
-     * of data is read directly from memory and then is locally transposed into a
-     * [<em>blocked arrangement</em>](index.html#sec5sec3). To reduce the shared memory
-     * requirement, only one warp's worth of shared memory is provisioned and is
-     * subsequently time-sliced among warps.
-     *
-     * \par Usage Considerations
-     * - BLOCK_THREADS must be a multiple of WARP_THREADS
-     *
-     * \par Performance Considerations
-     * - The utilization of memory transactions (coalescing) remains high regardless
-     *   of items loaded per thread.
-     * - Provisions less shared memory temporary storage, but incurs larger
-     *   latencies than the BLOCK_LOAD_WARP_TRANSPOSE alternative.
-     */
-    BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED,
-};
-
-
-/**
- * \brief The BlockLoad class provides [<em>collective</em>](index.html#sec0) data movement methods for loading a linear segment of items from memory into a [<em>blocked arrangement</em>](index.html#sec5sec3) across a CUDA thread block.  ![](block_load_logo.png)
- * \ingroup BlockModule
- * \ingroup UtilIo
- *
- * \tparam InputT               The data type to read into (which must be convertible from the input iterator's value type).
- * \tparam BLOCK_DIM_X          The thread block length in threads along the X dimension
- * \tparam ITEMS_PER_THREAD     The number of consecutive items partitioned onto each thread.
- * \tparam ALGORITHM            <b>[optional]</b> cub::BlockLoadAlgorithm tuning policy.  default: cub::BLOCK_LOAD_DIRECT.
- * \tparam WARP_TIME_SLICING    <b>[optional]</b> Whether or not only one warp's worth of shared memory should be allocated and time-sliced among block-warps during any load-related data transpositions (versus each warp having its own storage). (default: false)
- * \tparam BLOCK_DIM_Y          <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
- * \tparam BLOCK_DIM_Z          <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
- * \tparam PTX_ARCH             <b>[optional]</b> \ptxversion
- *
- * \par Overview
- * - The BlockLoad class provides a single data movement abstraction that can be specialized
- *   to implement different cub::BlockLoadAlgorithm strategies.  This facilitates different
- *   performance policies for different architectures, data types, granularity sizes, etc.
- * - BlockLoad can be optionally specialized by different data movement strategies:
- *   -# <b>cub::BLOCK_LOAD_DIRECT</b>.  A [<em>blocked arrangement</em>](index.html#sec5sec3)
- *      of data is read directly from memory.  [More...](\ref cub::BlockLoadAlgorithm)
- *   -# <b>cub::BLOCK_LOAD_VECTORIZE</b>.  A [<em>blocked arrangement</em>](index.html#sec5sec3)
- *      of data is read directly from memory using CUDA's built-in vectorized loads as a
- *      coalescing optimization.    [More...](\ref cub::BlockLoadAlgorithm)
- *   -# <b>cub::BLOCK_LOAD_TRANSPOSE</b>.  A [<em>striped arrangement</em>](index.html#sec5sec3)
- *      of data is read directly from memory and is then locally transposed into a
- *      [<em>blocked arrangement</em>](index.html#sec5sec3).  [More...](\ref cub::BlockLoadAlgorithm)
- *   -# <b>cub::BLOCK_LOAD_WARP_TRANSPOSE</b>.  A [<em>warp-striped arrangement</em>](index.html#sec5sec3)
- *      of data is read directly from memory and is then locally transposed into a
- *      [<em>blocked arrangement</em>](index.html#sec5sec3).  [More...](\ref cub::BlockLoadAlgorithm)
- *   -# <b>cub::BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED,</b>.  A [<em>warp-striped arrangement</em>](index.html#sec5sec3)
- *      of data is read directly from memory and is then locally transposed into a
- *      [<em>blocked arrangement</em>](index.html#sec5sec3) one warp at a time.  [More...](\ref cub::BlockLoadAlgorithm)
- * - \rowmajor
- *
- * \par A Simple Example
- * \blockcollective{BlockLoad}
- * \par
- * The code snippet below illustrates the loading of a linear
- * segment of 512 integers into a "blocked" arrangement across 128 threads where each
- * thread owns 4 consecutive items.  The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE,
- * meaning memory references are efficiently coalesced using a warp-striped access
- * pattern (after which items are locally reordered among threads).
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/block/block_load.cuh>
- *
- * __global__ void ExampleKernel(int *d_data, ...)
- * {
- *     // Specialize BlockLoad for a 1D block of 128 threads owning 4 integer items each
- *     typedef cub::BlockLoad<int, 128, 4, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoad;
- *
- *     // Allocate shared memory for BlockLoad
- *     __shared__ typename BlockLoad::TempStorage temp_storage;
- *
- *     // Load a segment of consecutive items that are blocked across threads
- *     int thread_data[4];
- *     BlockLoad(temp_storage).Load(d_data, thread_data);
- *
- * \endcode
- * \par
- * Suppose the input \p d_data is <tt>0, 1, 2, 3, 4, 5, ...</tt>.
- * The set of \p thread_data across the block of threads in those threads will be
- * <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt>.
- *
- */
-template <
-    typename            InputT,
-    int                 BLOCK_DIM_X,
-    int                 ITEMS_PER_THREAD,
-    BlockLoadAlgorithm  ALGORITHM           = BLOCK_LOAD_DIRECT,
-    int                 BLOCK_DIM_Y         = 1,
-    int                 BLOCK_DIM_Z         = 1,
-    int                 PTX_ARCH            = CUB_PTX_ARCH>
-class BlockLoad
-{
-private:
-
-    /******************************************************************************
-     * Constants and typed definitions
-     ******************************************************************************/
-
-    /// Constants
-    enum
-    {
-        /// The thread block size in threads
-        BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-    };
-
-
-    /******************************************************************************
-     * Algorithmic variants
-     ******************************************************************************/
-
-    /// Load helper
-    template <BlockLoadAlgorithm _POLICY, int DUMMY>
-    struct LoadInternal;
-
-
-    /**
-     * BLOCK_LOAD_DIRECT specialization of load helper
-     */
-    template <int DUMMY>
-    struct LoadInternal<BLOCK_LOAD_DIRECT, DUMMY>
-    {
-        /// Shared memory storage layout type
-        typedef NullType TempStorage;
-
-        /// Linear thread-id
-        int linear_tid;
-
-        /// Constructor
-        __device__ __forceinline__ LoadInternal(
-            TempStorage &/*temp_storage*/,
-            int linear_tid)
-        :
-            linear_tid(linear_tid)
-        {}
-
-        /// Load a linear segment of items from memory
-        template <typename InputIteratorT>
-        __device__ __forceinline__ void Load(
-            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
-            InputT          (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load
-        {
-            LoadDirectBlocked(linear_tid, block_itr, items);
-        }
-
-        /// Load a linear segment of items from memory, guarded by range
-        template <typename InputIteratorT>
-        __device__ __forceinline__ void Load(
-            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
-            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
-            int             valid_items)                    ///< [in] Number of valid items to load
-        {
-            LoadDirectBlocked(linear_tid, block_itr, items, valid_items);
-        }
-
-        /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements
-        template <typename InputIteratorT, typename DefaultT>
-        __device__ __forceinline__ void Load(
-            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
-            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
-            int             valid_items,                    ///< [in] Number of valid items to load
-            DefaultT        oob_default)                    ///< [in] Default value to assign out-of-bound items
-        {
-            LoadDirectBlocked(linear_tid, block_itr, items, valid_items, oob_default);
-        }
-
-    };
-
-
-    /**
-     * BLOCK_LOAD_VECTORIZE specialization of load helper
-     */
-    template <int DUMMY>
-    struct LoadInternal<BLOCK_LOAD_VECTORIZE, DUMMY>
-    {
-        /// Shared memory storage layout type
-        typedef NullType TempStorage;
-
-        /// Linear thread-id
-        int linear_tid;
-
-        /// Constructor
-        __device__ __forceinline__ LoadInternal(
-            TempStorage &/*temp_storage*/,
-            int linear_tid)
-        :
-            linear_tid(linear_tid)
-        {}
-
-        /// Load a linear segment of items from memory, specialized for native pointer types (attempts vectorization)
-        template <typename InputIteratorT>
-        __device__ __forceinline__ void Load(
-            InputT               *block_ptr,                     ///< [in] The thread block's base input iterator for loading from
-            InputT               (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load
-        {
-            InternalLoadDirectBlockedVectorized<LOAD_DEFAULT>(linear_tid, block_ptr, items);
-        }
-
-        /// Load a linear segment of items from memory, specialized for native pointer types (attempts vectorization)
-        template <typename InputIteratorT>
-        __device__ __forceinline__ void Load(
-            const InputT         *block_ptr,                     ///< [in] The thread block's base input iterator for loading from
-            InputT               (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load
-        {
-            InternalLoadDirectBlockedVectorized<LOAD_DEFAULT>(linear_tid, block_ptr, items);
-        }
-
-        /// Load a linear segment of items from memory, specialized for native pointer types (attempts vectorization)
-        template <
-            CacheLoadModifier   MODIFIER,
-            typename            ValueType,
-            typename            OffsetT>
-        __device__ __forceinline__ void Load(
-            CacheModifiedInputIterator<MODIFIER, ValueType, OffsetT>    block_itr,                      ///< [in] The thread block's base input iterator for loading from
-            InputT                                                     (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load
-        {
-            InternalLoadDirectBlockedVectorized<MODIFIER>(linear_tid, block_itr.ptr, items);
-        }
-
-        /// Load a linear segment of items from memory, specialized for opaque input iterators (skips vectorization)
-        template <typename _InputIteratorT>
-        __device__ __forceinline__ void Load(
-            _InputIteratorT   block_itr,                    ///< [in] The thread block's base input iterator for loading from
-            InputT           (&items)[ITEMS_PER_THREAD])   ///< [out] Data to load
-        {
-            LoadDirectBlocked(linear_tid, block_itr, items);
-        }
-
-        /// Load a linear segment of items from memory, guarded by range (skips vectorization)
-        template <typename InputIteratorT>
-        __device__ __forceinline__ void Load(
-            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
-            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
-            int             valid_items)                    ///< [in] Number of valid items to load
-        {
-            LoadDirectBlocked(linear_tid, block_itr, items, valid_items);
-        }
-
-        /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements (skips vectorization)
-        template <typename InputIteratorT, typename DefaultT>
-        __device__ __forceinline__ void Load(
-            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
-            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
-            int             valid_items,                    ///< [in] Number of valid items to load
-            DefaultT          oob_default)                    ///< [in] Default value to assign out-of-bound items
-        {
-            LoadDirectBlocked(linear_tid, block_itr, items, valid_items, oob_default);
-        }
-
-    };
-
-
-    /**
-     * BLOCK_LOAD_TRANSPOSE specialization of load helper
-     */
-    template <int DUMMY>
-    struct LoadInternal<BLOCK_LOAD_TRANSPOSE, DUMMY>
-    {
-        // BlockExchange utility type for keys
-        typedef BlockExchange<InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, false, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> BlockExchange;
-
-        /// Shared memory storage layout type
-        struct _TempStorage : BlockExchange::TempStorage
-        {
-            /// Temporary storage for partially-full block guard
-            volatile int valid_items;
-        };
-
-        /// Alias wrapper allowing storage to be unioned
-        struct TempStorage : Uninitialized<_TempStorage> {};
-
-        /// Thread reference to shared storage
-        _TempStorage &temp_storage;
-
-        /// Linear thread-id
-        int linear_tid;
-
-        /// Constructor
-        __device__ __forceinline__ LoadInternal(
-            TempStorage &temp_storage,
-            int linear_tid)
-        :
-            temp_storage(temp_storage.Alias()),
-            linear_tid(linear_tid)
-        {}
-
-        /// Load a linear segment of items from memory
-        template <typename InputIteratorT>
-        __device__ __forceinline__ void Load(
-            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
-            InputT          (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load{
-        {
-            LoadDirectStriped<BLOCK_THREADS>(linear_tid, block_itr, items);
-            BlockExchange(temp_storage).StripedToBlocked(items, items);
-        }
-
-        /// Load a linear segment of items from memory, guarded by range
-        template <typename InputIteratorT>
-        __device__ __forceinline__ void Load(
-            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
-            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
-            int             valid_items)                    ///< [in] Number of valid items to load
-        {
-            if (linear_tid == 0)
-                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
-            CTA_SYNC();
-            LoadDirectStriped<BLOCK_THREADS>(linear_tid, block_itr, items, temp_storage.valid_items);
-            BlockExchange(temp_storage).StripedToBlocked(items, items);
-        }
-
-        /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements
-        template <typename InputIteratorT, typename DefaultT>
-        __device__ __forceinline__ void Load(
-            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
-            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
-            int             valid_items,                    ///< [in] Number of valid items to load
-            DefaultT        oob_default)                    ///< [in] Default value to assign out-of-bound items
-        {
-            if (linear_tid == 0)
-                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
-            CTA_SYNC();
-            LoadDirectStriped<BLOCK_THREADS>(linear_tid, block_itr, items, temp_storage.valid_items, oob_default);
-            BlockExchange(temp_storage).StripedToBlocked(items, items);
-        }
-
-    };
-
-
-    /**
-     * BLOCK_LOAD_WARP_TRANSPOSE specialization of load helper
-     */
-    template <int DUMMY>
-    struct LoadInternal<BLOCK_LOAD_WARP_TRANSPOSE, DUMMY>
-    {
-        enum
-        {
-            WARP_THREADS = CUB_WARP_THREADS(PTX_ARCH)
-        };
-
-        // Assert BLOCK_THREADS must be a multiple of WARP_THREADS
-        CUB_STATIC_ASSERT((BLOCK_THREADS % WARP_THREADS == 0), "BLOCK_THREADS must be a multiple of WARP_THREADS");
-
-        // BlockExchange utility type for keys
-        typedef BlockExchange<InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, false, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> BlockExchange;
-
-        /// Shared memory storage layout type
-        struct _TempStorage : BlockExchange::TempStorage
-        {
-            /// Temporary storage for partially-full block guard
-            volatile int valid_items;
-        };
-
-        /// Alias wrapper allowing storage to be unioned
-        struct TempStorage : Uninitialized<_TempStorage> {};
-
-        /// Thread reference to shared storage
-        _TempStorage &temp_storage;
-
-        /// Linear thread-id
-        int linear_tid;
-
-        /// Constructor
-        __device__ __forceinline__ LoadInternal(
-            TempStorage &temp_storage,
-            int linear_tid)
-        :
-            temp_storage(temp_storage.Alias()),
-            linear_tid(linear_tid)
-        {}
-
-        /// Load a linear segment of items from memory
-        template <typename InputIteratorT>
-        __device__ __forceinline__ void Load(
-            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
-            InputT          (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load{
-        {
-            LoadDirectWarpStriped(linear_tid, block_itr, items);
-            BlockExchange(temp_storage).WarpStripedToBlocked(items, items);
-        }
-
-        /// Load a linear segment of items from memory, guarded by range
-        template <typename InputIteratorT>
-        __device__ __forceinline__ void Load(
-            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
-            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
-            int             valid_items)                    ///< [in] Number of valid items to load
-        {
-            if (linear_tid == 0)
-                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
-            CTA_SYNC();
-            LoadDirectWarpStriped(linear_tid, block_itr, items, temp_storage.valid_items);
-            BlockExchange(temp_storage).WarpStripedToBlocked(items, items);
-        }
-
-
-        /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements
-        template <typename InputIteratorT, typename DefaultT>
-        __device__ __forceinline__ void Load(
-            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
-            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
-            int             valid_items,                    ///< [in] Number of valid items to load
-            DefaultT        oob_default)                    ///< [in] Default value to assign out-of-bound items
-        {
-            if (linear_tid == 0)
-                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
-            CTA_SYNC();
-            LoadDirectWarpStriped(linear_tid, block_itr, items, temp_storage.valid_items, oob_default);
-            BlockExchange(temp_storage).WarpStripedToBlocked(items, items);
-        }
-    };
-
-
-    /**
-     * BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED specialization of load helper
-     */
-    template <int DUMMY>
-    struct LoadInternal<BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED, DUMMY>
-    {
-        enum
-        {
-            WARP_THREADS = CUB_WARP_THREADS(PTX_ARCH)
-        };
-
-        // Assert BLOCK_THREADS must be a multiple of WARP_THREADS
-        CUB_STATIC_ASSERT((BLOCK_THREADS % WARP_THREADS == 0), "BLOCK_THREADS must be a multiple of WARP_THREADS");
-
-        // BlockExchange utility type for keys
-        typedef BlockExchange<InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, true, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> BlockExchange;
-
-        /// Shared memory storage layout type
-        struct _TempStorage : BlockExchange::TempStorage
-        {
-            /// Temporary storage for partially-full block guard
-            volatile int valid_items;
-        };
-
-        /// Alias wrapper allowing storage to be unioned
-        struct TempStorage : Uninitialized<_TempStorage> {};
-
-        /// Thread reference to shared storage
-        _TempStorage &temp_storage;
-
-        /// Linear thread-id
-        int linear_tid;
-
-        /// Constructor
-        __device__ __forceinline__ LoadInternal(
-            TempStorage &temp_storage,
-            int linear_tid)
-        :
-            temp_storage(temp_storage.Alias()),
-            linear_tid(linear_tid)
-        {}
-
-        /// Load a linear segment of items from memory
-        template <typename InputIteratorT>
-        __device__ __forceinline__ void Load(
-            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
-            InputT          (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load{
-        {
-            LoadDirectWarpStriped(linear_tid, block_itr, items);
-            BlockExchange(temp_storage).WarpStripedToBlocked(items, items);
-        }
-
-        /// Load a linear segment of items from memory, guarded by range
-        template <typename InputIteratorT>
-        __device__ __forceinline__ void Load(
-            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
-            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
-            int             valid_items)                    ///< [in] Number of valid items to load
-        {
-            if (linear_tid == 0)
-                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
-            CTA_SYNC();
-            LoadDirectWarpStriped(linear_tid, block_itr, items, temp_storage.valid_items);
-            BlockExchange(temp_storage).WarpStripedToBlocked(items, items);
-        }
-
-
-        /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements
-        template <typename InputIteratorT, typename DefaultT>
-        __device__ __forceinline__ void Load(
-            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
-            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
-            int             valid_items,                    ///< [in] Number of valid items to load
-            DefaultT        oob_default)                    ///< [in] Default value to assign out-of-bound items
-        {
-            if (linear_tid == 0)
-                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
-            CTA_SYNC();
-            LoadDirectWarpStriped(linear_tid, block_itr, items, temp_storage.valid_items, oob_default);
-            BlockExchange(temp_storage).WarpStripedToBlocked(items, items);
-        }
-    };
-
-
-    /******************************************************************************
-     * Type definitions
-     ******************************************************************************/
-
-    /// Internal load implementation to use
-    typedef LoadInternal<ALGORITHM, 0> InternalLoad;
-
-
-    /// Shared memory storage layout type
-    typedef typename InternalLoad::TempStorage _TempStorage;
-
-
-    /******************************************************************************
-     * Utility methods
-     ******************************************************************************/
-
-    /// Internal storage allocator
-    __device__ __forceinline__ _TempStorage& PrivateStorage()
-    {
-        __shared__ _TempStorage private_storage;
-        return private_storage;
-    }
-
-
-    /******************************************************************************
-     * Thread fields
-     ******************************************************************************/
-
-    /// Thread reference to shared storage
-    _TempStorage &temp_storage;
-
-    /// Linear thread-id
-    int linear_tid;
-
-public:
-
-    /// \smemstorage{BlockLoad}
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    /******************************************************************//**
-     * \name Collective constructors
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Collective constructor using a private static allocation of shared memory as temporary storage.
-     */
-    __device__ __forceinline__ BlockLoad()
-    :
-        temp_storage(PrivateStorage()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    /**
-     * \brief Collective constructor using the specified memory allocation as temporary storage.
-     */
-    __device__ __forceinline__ BlockLoad(
-        TempStorage &temp_storage)             ///< [in] Reference to memory allocation having layout type TempStorage
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Data movement
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Load a linear segment of items from memory.
-     *
-     * \par
-     * - \blocked
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the loading of a linear
-     * segment of 512 integers into a "blocked" arrangement across 128 threads where each
-     * thread owns 4 consecutive items.  The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE,
-     * meaning memory references are efficiently coalesced using a warp-striped access
-     * pattern (after which items are locally reordered among threads).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_load.cuh>
-     *
-     * __global__ void ExampleKernel(int *d_data, ...)
-     * {
-     *     // Specialize BlockLoad for a 1D block of 128 threads owning 4 integer items each
-     *     typedef cub::BlockLoad<int, 128, 4, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoad;
-     *
-     *     // Allocate shared memory for BlockLoad
-     *     __shared__ typename BlockLoad::TempStorage temp_storage;
-     *
-     *     // Load a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     BlockLoad(temp_storage).Load(d_data, thread_data);
-     *
-     * \endcode
-     * \par
-     * Suppose the input \p d_data is <tt>0, 1, 2, 3, 4, 5, ...</tt>.
-     * The set of \p thread_data across the block of threads in those threads will be
-     * <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt>.
-     *
-     */
-    template <typename InputIteratorT>
-    __device__ __forceinline__ void Load(
-        InputIteratorT  block_itr,                  ///< [in] The thread block's base input iterator for loading from
-        InputT          (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
-    {
-        InternalLoad(temp_storage, linear_tid).Load(block_itr, items);
-    }
-
-
-    /**
-     * \brief Load a linear segment of items from memory, guarded by range.
-     *
-     * \par
-     * - \blocked
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the guarded loading of a linear
-     * segment of 512 integers into a "blocked" arrangement across 128 threads where each
-     * thread owns 4 consecutive items.  The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE,
-     * meaning memory references are efficiently coalesced using a warp-striped access
-     * pattern (after which items are locally reordered among threads).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_load.cuh>
-     *
-     * __global__ void ExampleKernel(int *d_data, int valid_items, ...)
-     * {
-     *     // Specialize BlockLoad for a 1D block of 128 threads owning 4 integer items each
-     *     typedef cub::BlockLoad<int, 128, 4, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoad;
-     *
-     *     // Allocate shared memory for BlockLoad
-     *     __shared__ typename BlockLoad::TempStorage temp_storage;
-     *
-     *     // Load a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     BlockLoad(temp_storage).Load(d_data, thread_data, valid_items);
-     *
-     * \endcode
-     * \par
-     * Suppose the input \p d_data is <tt>0, 1, 2, 3, 4, 5, 6...</tt> and \p valid_items is \p 5.
-     * The set of \p thread_data across the block of threads in those threads will be
-     * <tt>{ [0,1,2,3], [4,?,?,?], ..., [?,?,?,?] }</tt>, with only the first two threads
-     * being unmasked to load portions of valid data (and other items remaining unassigned).
-     *
-     */
-    template <typename InputIteratorT>
-    __device__ __forceinline__ void Load(
-        InputIteratorT  block_itr,                  ///< [in] The thread block's base input iterator for loading from
-        InputT          (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
-        int             valid_items)                ///< [in] Number of valid items to load
-    {
-        InternalLoad(temp_storage, linear_tid).Load(block_itr, items, valid_items);
-    }
-
-
-    /**
-     * \brief Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements
-     *
-     * \par
-     * - \blocked
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the guarded loading of a linear
-     * segment of 512 integers into a "blocked" arrangement across 128 threads where each
-     * thread owns 4 consecutive items.  The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE,
-     * meaning memory references are efficiently coalesced using a warp-striped access
-     * pattern (after which items are locally reordered among threads).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_load.cuh>
-     *
-     * __global__ void ExampleKernel(int *d_data, int valid_items, ...)
-     * {
-     *     // Specialize BlockLoad for a 1D block of 128 threads owning 4 integer items each
-     *     typedef cub::BlockLoad<int, 128, 4, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoad;
-     *
-     *     // Allocate shared memory for BlockLoad
-     *     __shared__ typename BlockLoad::TempStorage temp_storage;
-     *
-     *     // Load a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     BlockLoad(temp_storage).Load(d_data, thread_data, valid_items, -1);
-     *
-     * \endcode
-     * \par
-     * Suppose the input \p d_data is <tt>0, 1, 2, 3, 4, 5, 6...</tt>,
-     * \p valid_items is \p 5, and the out-of-bounds default is \p -1.
-     * The set of \p thread_data across the block of threads in those threads will be
-     * <tt>{ [0,1,2,3], [4,-1,-1,-1], ..., [-1,-1,-1,-1] }</tt>, with only the first two threads
-     * being unmasked to load portions of valid data (and other items are assigned \p -1)
-     *
-     */
-    template <typename InputIteratorT, typename DefaultT>
-    __device__ __forceinline__ void Load(
-        InputIteratorT  block_itr,                  ///< [in] The thread block's base input iterator for loading from
-        InputT          (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
-        int             valid_items,                ///< [in] Number of valid items to load
-        DefaultT        oob_default)                ///< [in] Default value to assign out-of-bound items
-    {
-        InternalLoad(temp_storage, linear_tid).Load(block_itr, items, valid_items, oob_default);
-    }
-
-
-    //@}  end member group
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/block_radix_rank.cuh b/thirdparty/cub_semiring/block/block_radix_rank.cuh
deleted file mode 100644
index 77500ba0ede..00000000000
--- a/thirdparty/cub_semiring/block/block_radix_rank.cuh
+++ /dev/null
@@ -1,697 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::BlockRadixRank provides operations for ranking unsigned integer types within a CUDA thread block
- */
-
-#pragma once
-
-#include <stdint.h>
-
-#include "../thread/thread_reduce.cuh"
-#include "../thread/thread_scan.cuh"
-#include "../block/block_scan.cuh"
-#include "../util_ptx.cuh"
-#include "../util_arch.cuh"
-#include "../util_type.cuh"
-#include "../util_namespace.cuh"
-
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \brief BlockRadixRank provides operations for ranking unsigned integer types within a CUDA thread block.
- * \ingroup BlockModule
- *
- * \tparam BLOCK_DIM_X          The thread block length in threads along the X dimension
- * \tparam RADIX_BITS           The number of radix bits per digit place
- * \tparam IS_DESCENDING           Whether or not the sorted-order is high-to-low
- * \tparam MEMOIZE_OUTER_SCAN   <b>[optional]</b> Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure (default: true for architectures SM35 and newer, false otherwise).  See BlockScanAlgorithm::BLOCK_SCAN_RAKING_MEMOIZE for more details.
- * \tparam INNER_SCAN_ALGORITHM <b>[optional]</b> The cub::BlockScanAlgorithm algorithm to use (default: cub::BLOCK_SCAN_WARP_SCANS)
- * \tparam SMEM_CONFIG          <b>[optional]</b> Shared memory bank mode (default: \p cudaSharedMemBankSizeFourByte)
- * \tparam BLOCK_DIM_Y          <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
- * \tparam BLOCK_DIM_Z          <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
- * \tparam PTX_ARCH             <b>[optional]</b> \ptxversion
- *
- * \par Overview
- * Blah...
- * - Keys must be in a form suitable for radix ranking (i.e., unsigned bits).
- * - \blocked
- *
- * \par Performance Considerations
- * - \granularity
- *
- * \par Examples
- * \par
- * - <b>Example 1:</b> Simple radix rank of 32-bit integer keys
- *      \code
- *      #include <cub/cub.cuh>
- *
- *      template <int BLOCK_THREADS>
- *      __global__ void ExampleKernel(...)
- *      {
- *
- *      \endcode
- */
-template <
-    int                     BLOCK_DIM_X,
-    int                     RADIX_BITS,
-    bool                    IS_DESCENDING,
-    bool                    MEMOIZE_OUTER_SCAN      = (CUB_PTX_ARCH >= 350) ? true : false,
-    BlockScanAlgorithm      INNER_SCAN_ALGORITHM    = BLOCK_SCAN_WARP_SCANS,
-    cudaSharedMemConfig     SMEM_CONFIG             = cudaSharedMemBankSizeFourByte,
-    int                     BLOCK_DIM_Y             = 1,
-    int                     BLOCK_DIM_Z             = 1,
-    int                     PTX_ARCH                = CUB_PTX_ARCH>
-class BlockRadixRank
-{
-private:
-
-    /******************************************************************************
-     * Type definitions and constants
-     ******************************************************************************/
-
-    // Integer type for digit counters (to be packed into words of type PackedCounters)
-    typedef unsigned short DigitCounter;
-
-    // Integer type for packing DigitCounters into columns of shared memory banks
-    typedef typename If<(SMEM_CONFIG == cudaSharedMemBankSizeEightByte),
-        unsigned long long,
-        unsigned int>::Type PackedCounter;
-
-    enum
-    {
-        // The thread block size in threads
-        BLOCK_THREADS               = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-
-        RADIX_DIGITS                = 1 << RADIX_BITS,
-
-        LOG_WARP_THREADS            = CUB_LOG_WARP_THREADS(PTX_ARCH),
-        WARP_THREADS                = 1 << LOG_WARP_THREADS,
-        WARPS                       = (BLOCK_THREADS + WARP_THREADS - 1) / WARP_THREADS,
-
-        BYTES_PER_COUNTER           = sizeof(DigitCounter),
-        LOG_BYTES_PER_COUNTER       = Log2<BYTES_PER_COUNTER>::VALUE,
-
-        PACKING_RATIO               = sizeof(PackedCounter) / sizeof(DigitCounter),
-        LOG_PACKING_RATIO           = Log2<PACKING_RATIO>::VALUE,
-
-        LOG_COUNTER_LANES           = CUB_MAX((RADIX_BITS - LOG_PACKING_RATIO), 0),                // Always at least one lane
-        COUNTER_LANES               = 1 << LOG_COUNTER_LANES,
-
-        // The number of packed counters per thread (plus one for padding)
-        PADDED_COUNTER_LANES        = COUNTER_LANES + 1,
-        RAKING_SEGMENT              = PADDED_COUNTER_LANES,
-    };
-
-public:
-
-    enum
-    {
-        /// Number of bin-starting offsets tracked per thread
-        BINS_TRACKED_PER_THREAD = CUB_MAX(1, RADIX_DIGITS / BLOCK_THREADS),
-    };
-
-private:
-
-
-    /// BlockScan type
-    typedef BlockScan<
-            PackedCounter,
-            BLOCK_DIM_X,
-            INNER_SCAN_ALGORITHM,
-            BLOCK_DIM_Y,
-            BLOCK_DIM_Z,
-            PTX_ARCH>
-        BlockScan;
-
-
-    /// Shared memory storage layout type for BlockRadixRank
-    struct __align__(16) _TempStorage
-    {
-        union Aliasable
-        {
-            DigitCounter            digit_counters[PADDED_COUNTER_LANES][BLOCK_THREADS][PACKING_RATIO];
-            PackedCounter           raking_grid[BLOCK_THREADS][RAKING_SEGMENT];
-
-        } aliasable;
-
-        // Storage for scanning local ranks
-        typename BlockScan::TempStorage block_scan;
-    };
-
-
-    /******************************************************************************
-     * Thread fields
-     ******************************************************************************/
-
-    /// Shared storage reference
-    _TempStorage &temp_storage;
-
-    /// Linear thread-id
-    unsigned int linear_tid;
-
-    /// Copy of raking segment, promoted to registers
-    PackedCounter cached_segment[RAKING_SEGMENT];
-
-
-    /******************************************************************************
-     * Utility methods
-     ******************************************************************************/
-
-    /**
-     * Internal storage allocator
-     */
-    __device__ __forceinline__ _TempStorage& PrivateStorage()
-    {
-        __shared__ _TempStorage private_storage;
-        return private_storage;
-    }
-
-
-    /**
-     * Performs upsweep raking reduction, returning the aggregate
-     */
-    __device__ __forceinline__ PackedCounter Upsweep()
-    {
-        PackedCounter *smem_raking_ptr = temp_storage.aliasable.raking_grid[linear_tid];
-        PackedCounter *raking_ptr;
-
-        if (MEMOIZE_OUTER_SCAN)
-        {
-            // Copy data into registers
-            #pragma unroll
-            for (int i = 0; i < RAKING_SEGMENT; i++)
-            {
-                cached_segment[i] = smem_raking_ptr[i];
-            }
-            raking_ptr = cached_segment;
-        }
-        else
-        {
-            raking_ptr = smem_raking_ptr;
-        }
-
-        return internal::ThreadReduce<RAKING_SEGMENT>(raking_ptr, Sum());
-    }
-
-
-    /// Performs exclusive downsweep raking scan
-    __device__ __forceinline__ void ExclusiveDownsweep(
-        PackedCounter raking_partial)
-    {
-        PackedCounter *smem_raking_ptr = temp_storage.aliasable.raking_grid[linear_tid];
-
-        PackedCounter *raking_ptr = (MEMOIZE_OUTER_SCAN) ?
-            cached_segment :
-            smem_raking_ptr;
-
-        // Exclusive raking downsweep scan
-        internal::ThreadScanExclusive<RAKING_SEGMENT>(raking_ptr, raking_ptr, Sum(), raking_partial);
-
-        if (MEMOIZE_OUTER_SCAN)
-        {
-            // Copy data back to smem
-            #pragma unroll
-            for (int i = 0; i < RAKING_SEGMENT; i++)
-            {
-                smem_raking_ptr[i] = cached_segment[i];
-            }
-        }
-    }
-
-
-    /**
-     * Reset shared memory digit counters
-     */
-    __device__ __forceinline__ void ResetCounters()
-    {
-        // Reset shared memory digit counters
-        #pragma unroll
-        for (int LANE = 0; LANE < PADDED_COUNTER_LANES; LANE++)
-        {
-            *((PackedCounter*) temp_storage.aliasable.digit_counters[LANE][linear_tid]) = 0;
-        }
-    }
-
-
-    /**
-     * Block-scan prefix callback
-     */
-    struct PrefixCallBack
-    {
-        __device__ __forceinline__ PackedCounter operator()(PackedCounter block_aggregate)
-        {
-            PackedCounter block_prefix = 0;
-
-            // Propagate totals in packed fields
-            #pragma unroll
-            for (int PACKED = 1; PACKED < PACKING_RATIO; PACKED++)
-            {
-                block_prefix += block_aggregate << (sizeof(DigitCounter) * 8 * PACKED);
-            }
-
-            return block_prefix;
-        }
-    };
-
-
-    /**
-     * Scan shared memory digit counters.
-     */
-    __device__ __forceinline__ void ScanCounters()
-    {
-        // Upsweep scan
-        PackedCounter raking_partial = Upsweep();
-
-        // Compute exclusive sum
-        PackedCounter exclusive_partial;
-        PrefixCallBack prefix_call_back;
-        BlockScan(temp_storage.block_scan).ExclusiveSum(raking_partial, exclusive_partial, prefix_call_back);
-
-        // Downsweep scan with exclusive partial
-        ExclusiveDownsweep(exclusive_partial);
-    }
-
-public:
-
-    /// \smemstorage{BlockScan}
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    /******************************************************************//**
-     * \name Collective constructors
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Collective constructor using a private static allocation of shared memory as temporary storage.
-     */
-    __device__ __forceinline__ BlockRadixRank()
-    :
-        temp_storage(PrivateStorage()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    /**
-     * \brief Collective constructor using the specified memory allocation as temporary storage.
-     */
-    __device__ __forceinline__ BlockRadixRank(
-        TempStorage &temp_storage)             ///< [in] Reference to memory allocation having layout type TempStorage
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Raking
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Rank keys.
-     */
-    template <
-        typename        UnsignedBits,
-        int             KEYS_PER_THREAD>
-    __device__ __forceinline__ void RankKeys(
-        UnsignedBits    (&keys)[KEYS_PER_THREAD],           ///< [in] Keys for this tile
-        int             (&ranks)[KEYS_PER_THREAD],          ///< [out] For each key, the local rank within the tile
-        int             current_bit,                        ///< [in] The least-significant bit position of the current digit to extract
-        int             num_bits)                           ///< [in] The number of bits in the current digit
-    {
-        DigitCounter    thread_prefixes[KEYS_PER_THREAD];   // For each key, the count of previous keys in this tile having the same digit
-        DigitCounter*   digit_counters[KEYS_PER_THREAD];    // For each key, the byte-offset of its corresponding digit counter in smem
-
-        // Reset shared memory digit counters
-        ResetCounters();
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < KEYS_PER_THREAD; ++ITEM)
-        {
-            // Get digit
-            unsigned int digit = BFE(keys[ITEM], current_bit, num_bits);
-
-            // Get sub-counter
-            unsigned int sub_counter = digit >> LOG_COUNTER_LANES;
-
-            // Get counter lane
-            unsigned int counter_lane = digit & (COUNTER_LANES - 1);
-
-            if (IS_DESCENDING)
-            {
-                sub_counter = PACKING_RATIO - 1 - sub_counter;
-                counter_lane = COUNTER_LANES - 1 - counter_lane;
-            }
-
-            // Pointer to smem digit counter
-            digit_counters[ITEM] = &temp_storage.aliasable.digit_counters[counter_lane][linear_tid][sub_counter];
-
-            // Load thread-exclusive prefix
-            thread_prefixes[ITEM] = *digit_counters[ITEM];
-
-            // Store inclusive prefix
-            *digit_counters[ITEM] = thread_prefixes[ITEM] + 1;
-        }
-
-        CTA_SYNC();
-
-        // Scan shared memory counters
-        ScanCounters();
-
-        CTA_SYNC();
-
-        // Extract the local ranks of each key
-        for (int ITEM = 0; ITEM < KEYS_PER_THREAD; ++ITEM)
-        {
-            // Add in thread block exclusive prefix
-            ranks[ITEM] = thread_prefixes[ITEM] + *digit_counters[ITEM];
-        }
-    }
-
-
-    /**
-     * \brief Rank keys.  For the lower \p RADIX_DIGITS threads, digit counts for each digit are provided for the corresponding thread.
-     */
-    template <
-        typename        UnsignedBits,
-        int             KEYS_PER_THREAD>
-    __device__ __forceinline__ void RankKeys(
-        UnsignedBits    (&keys)[KEYS_PER_THREAD],           ///< [in] Keys for this tile
-        int             (&ranks)[KEYS_PER_THREAD],          ///< [out] For each key, the local rank within the tile (out parameter)
-        int             current_bit,                        ///< [in] The least-significant bit position of the current digit to extract
-        int             num_bits,                           ///< [in] The number of bits in the current digit
-        int             (&exclusive_digit_prefix)[BINS_TRACKED_PER_THREAD])            ///< [out] The exclusive prefix sum for the digits [(threadIdx.x * BINS_TRACKED_PER_THREAD) ... (threadIdx.x * BINS_TRACKED_PER_THREAD) + BINS_TRACKED_PER_THREAD - 1]
-    {
-        // Rank keys
-        RankKeys(keys, ranks, current_bit, num_bits);
-
-        // Get the inclusive and exclusive digit totals corresponding to the calling thread.
-        #pragma unroll
-        for (int track = 0; track < BINS_TRACKED_PER_THREAD; ++track)
-        {
-            int bin_idx = (linear_tid * BINS_TRACKED_PER_THREAD) + track;
-
-            if ((BLOCK_THREADS == RADIX_DIGITS) || (bin_idx < RADIX_DIGITS))
-            {
-                if (IS_DESCENDING)
-                    bin_idx = RADIX_DIGITS - bin_idx - 1;
-
-                // Obtain ex/inclusive digit counts.  (Unfortunately these all reside in the
-                // first counter column, resulting in unavoidable bank conflicts.)
-                unsigned int counter_lane   = (bin_idx & (COUNTER_LANES - 1));
-                unsigned int sub_counter    = bin_idx >> (LOG_COUNTER_LANES);
-
-                exclusive_digit_prefix[track] = temp_storage.aliasable.digit_counters[counter_lane][0][sub_counter];
-            }
-        }
-    }
-};
-
-
-
-
-
-/**
- * Radix-rank using match.any
- */
-template <
-    int                     BLOCK_DIM_X,
-    int                     RADIX_BITS,
-    bool                    IS_DESCENDING,
-    BlockScanAlgorithm      INNER_SCAN_ALGORITHM    = BLOCK_SCAN_WARP_SCANS,
-    int                     BLOCK_DIM_Y             = 1,
-    int                     BLOCK_DIM_Z             = 1,
-    int                     PTX_ARCH                = CUB_PTX_ARCH>
-class BlockRadixRankMatch
-{
-private:
-
-    /******************************************************************************
-     * Type definitions and constants
-     ******************************************************************************/
-
-    typedef int32_t    RankT;
-    typedef int32_t    DigitCounterT;
-
-    enum
-    {
-        // The thread block size in threads
-        BLOCK_THREADS               = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-
-        RADIX_DIGITS                = 1 << RADIX_BITS,
-
-        LOG_WARP_THREADS            = CUB_LOG_WARP_THREADS(PTX_ARCH),
-        WARP_THREADS                = 1 << LOG_WARP_THREADS,
-        WARPS                       = (BLOCK_THREADS + WARP_THREADS - 1) / WARP_THREADS,
-
-        PADDED_WARPS            = ((WARPS & 0x1) == 0) ?
-                                    WARPS + 1 :
-                                    WARPS,
-
-        COUNTERS                = PADDED_WARPS * RADIX_DIGITS,
-        RAKING_SEGMENT          = (COUNTERS + BLOCK_THREADS - 1) / BLOCK_THREADS,
-        PADDED_RAKING_SEGMENT   = ((RAKING_SEGMENT & 0x1) == 0) ?
-                                    RAKING_SEGMENT + 1 :
-                                    RAKING_SEGMENT,
-    };
-
-public:
-
-    enum
-    {
-        /// Number of bin-starting offsets tracked per thread
-        BINS_TRACKED_PER_THREAD = CUB_MAX(1, RADIX_DIGITS / BLOCK_THREADS),
-    };
-
-private:
-
-    /// BlockScan type
-    typedef BlockScan<
-            DigitCounterT,
-            BLOCK_THREADS,
-            INNER_SCAN_ALGORITHM,
-            BLOCK_DIM_Y,
-            BLOCK_DIM_Z,
-            PTX_ARCH>
-        BlockScanT;
-
-
-    /// Shared memory storage layout type for BlockRadixRank
-    struct __align__(16) _TempStorage
-    {
-        typename BlockScanT::TempStorage            block_scan;
-
-        union __align__(16) Aliasable
-        {
-            volatile DigitCounterT                  warp_digit_counters[RADIX_DIGITS][PADDED_WARPS];
-            DigitCounterT                           raking_grid[BLOCK_THREADS][PADDED_RAKING_SEGMENT];
-
-        } aliasable;
-    };
-
-
-    /******************************************************************************
-     * Thread fields
-     ******************************************************************************/
-
-    /// Shared storage reference
-    _TempStorage &temp_storage;
-
-    /// Linear thread-id
-    unsigned int linear_tid;
-
-
-
-public:
-
-    /// \smemstorage{BlockScan}
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    /******************************************************************//**
-     * \name Collective constructors
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Collective constructor using the specified memory allocation as temporary storage.
-     */
-    __device__ __forceinline__ BlockRadixRankMatch(
-        TempStorage &temp_storage)             ///< [in] Reference to memory allocation having layout type TempStorage
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Raking
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Rank keys.
-     */
-    template <
-        typename        UnsignedBits,
-        int             KEYS_PER_THREAD>
-    __device__ __forceinline__ void RankKeys(
-        UnsignedBits    (&keys)[KEYS_PER_THREAD],           ///< [in] Keys for this tile
-        int             (&ranks)[KEYS_PER_THREAD],          ///< [out] For each key, the local rank within the tile
-        int             current_bit,                        ///< [in] The least-significant bit position of the current digit to extract
-        int             num_bits)                           ///< [in] The number of bits in the current digit
-    {
-        // Initialize shared digit counters
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < PADDED_RAKING_SEGMENT; ++ITEM)
-            temp_storage.aliasable.raking_grid[linear_tid][ITEM] = 0;
-
-        CTA_SYNC();
-
-        // Each warp will strip-mine its section of input, one strip at a time
-
-        volatile DigitCounterT  *digit_counters[KEYS_PER_THREAD];
-        uint32_t                lane_id         = LaneId();
-        uint32_t                warp_id         = linear_tid >> LOG_WARP_THREADS;
-        uint32_t                lane_mask_lt    = LaneMaskLt();
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < KEYS_PER_THREAD; ++ITEM)
-        {
-            // My digit
-            uint32_t digit = BFE(keys[ITEM], current_bit, num_bits);
-
-            if (IS_DESCENDING)
-                digit = RADIX_DIGITS - digit - 1;
-
-            // Mask of peers who have same digit as me
-            uint32_t peer_mask = MatchAny<RADIX_BITS>(digit);
-
-            // Pointer to smem digit counter for this key
-            digit_counters[ITEM] = &temp_storage.aliasable.warp_digit_counters[digit][warp_id];
-
-            // Number of occurrences in previous strips
-            DigitCounterT warp_digit_prefix = *digit_counters[ITEM];
-
-            // Warp-sync
-            WARP_SYNC(0xFFFFFFFF);
-
-            // Number of peers having same digit as me
-            int32_t digit_count = __popc(peer_mask);
-
-            // Number of lower-ranked peers having same digit seen so far
-            int32_t peer_digit_prefix = __popc(peer_mask & lane_mask_lt);
-
-            if (peer_digit_prefix == 0)
-            {
-                // First thread for each digit updates the shared warp counter
-                *digit_counters[ITEM] = DigitCounterT(warp_digit_prefix + digit_count);
-            }
-
-            // Warp-sync
-            WARP_SYNC(0xFFFFFFFF);
-
-            // Number of prior keys having same digit
-            ranks[ITEM] = warp_digit_prefix + DigitCounterT(peer_digit_prefix);
-        }
-
-        CTA_SYNC();
-
-        // Scan warp counters
-
-        DigitCounterT scan_counters[PADDED_RAKING_SEGMENT];
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < PADDED_RAKING_SEGMENT; ++ITEM)
-            scan_counters[ITEM] = temp_storage.aliasable.raking_grid[linear_tid][ITEM];
-
-        BlockScanT(temp_storage.block_scan).ExclusiveSum(scan_counters, scan_counters);
-
-        #pragma unroll
-        for (int ITEM = 0; ITEM < PADDED_RAKING_SEGMENT; ++ITEM)
-            temp_storage.aliasable.raking_grid[linear_tid][ITEM] = scan_counters[ITEM];
-
-        CTA_SYNC();
-
-        // Seed ranks with counter values from previous warps
-        #pragma unroll
-        for (int ITEM = 0; ITEM < KEYS_PER_THREAD; ++ITEM)
-            ranks[ITEM] += *digit_counters[ITEM];
-    }
-
-
-    /**
-     * \brief Rank keys.  For the lower \p RADIX_DIGITS threads, digit counts for each digit are provided for the corresponding thread.
-     */
-    template <
-        typename        UnsignedBits,
-        int             KEYS_PER_THREAD>
-    __device__ __forceinline__ void RankKeys(
-        UnsignedBits    (&keys)[KEYS_PER_THREAD],           ///< [in] Keys for this tile
-        int             (&ranks)[KEYS_PER_THREAD],          ///< [out] For each key, the local rank within the tile (out parameter)
-        int             current_bit,                        ///< [in] The least-significant bit position of the current digit to extract
-        int             num_bits,                           ///< [in] The number of bits in the current digit
-        int             (&exclusive_digit_prefix)[BINS_TRACKED_PER_THREAD])            ///< [out] The exclusive prefix sum for the digits [(threadIdx.x * BINS_TRACKED_PER_THREAD) ... (threadIdx.x * BINS_TRACKED_PER_THREAD) + BINS_TRACKED_PER_THREAD - 1]
-    {
-        RankKeys(keys, ranks, current_bit, num_bits);
-
-        // Get exclusive count for each digit
-        #pragma unroll
-        for (int track = 0; track < BINS_TRACKED_PER_THREAD; ++track)
-        {
-            int bin_idx = (linear_tid * BINS_TRACKED_PER_THREAD) + track;
-
-            if ((BLOCK_THREADS == RADIX_DIGITS) || (bin_idx < RADIX_DIGITS))
-            {
-                if (IS_DESCENDING)
-                    bin_idx = RADIX_DIGITS - bin_idx - 1;
-
-                exclusive_digit_prefix[track] = temp_storage.aliasable.warp_digit_counters[bin_idx][0];
-            }
-        }
-    }
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/block/block_radix_sort.cuh b/thirdparty/cub_semiring/block/block_radix_sort.cuh
deleted file mode 100644
index 736fbde746a..00000000000
--- a/thirdparty/cub_semiring/block/block_radix_sort.cuh
+++ /dev/null
@@ -1,862 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * The cub::BlockRadixSort class provides [<em>collective</em>](index.html#sec0) methods for radix sorting of items partitioned across a CUDA thread block.
- */
-
-
-#pragma once
-
-#include "block_exchange.cuh"
-#include "block_radix_rank.cuh"
-#include "../util_ptx.cuh"
-#include "../util_arch.cuh"
-#include "../util_type.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \brief The BlockRadixSort class provides [<em>collective</em>](index.html#sec0) methods for sorting items partitioned across a CUDA thread block using a radix sorting method.  ![](sorting_logo.png)
- * \ingroup BlockModule
- *
- * \tparam KeyT                 KeyT type
- * \tparam BLOCK_DIM_X          The thread block length in threads along the X dimension
- * \tparam ITEMS_PER_THREAD     The number of items per thread
- * \tparam ValueT               <b>[optional]</b> ValueT type (default: cub::NullType, which indicates a keys-only sort)
- * \tparam RADIX_BITS           <b>[optional]</b> The number of radix bits per digit place (default: 4 bits)
- * \tparam MEMOIZE_OUTER_SCAN   <b>[optional]</b> Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure (default: true for architectures SM35 and newer, false otherwise).
- * \tparam INNER_SCAN_ALGORITHM <b>[optional]</b> The cub::BlockScanAlgorithm algorithm to use (default: cub::BLOCK_SCAN_WARP_SCANS)
- * \tparam SMEM_CONFIG          <b>[optional]</b> Shared memory bank mode (default: \p cudaSharedMemBankSizeFourByte)
- * \tparam BLOCK_DIM_Y          <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
- * \tparam BLOCK_DIM_Z          <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
- * \tparam PTX_ARCH             <b>[optional]</b> \ptxversion
- *
- * \par Overview
- * - The [<em>radix sorting method</em>](http://en.wikipedia.org/wiki/Radix_sort) arranges
- *   items into ascending order.  It relies upon a positional representation for
- *   keys, i.e., each key is comprised of an ordered sequence of symbols (e.g., digits,
- *   characters, etc.) specified from least-significant to most-significant.  For a
- *   given input sequence of keys and a set of rules specifying a total ordering
- *   of the symbolic alphabet, the radix sorting method produces a lexicographic
- *   ordering of those keys.
- * - BlockRadixSort can sort all of the built-in C++ numeric primitive types, e.g.:
- *   <tt>unsigned char</tt>, \p int, \p double, etc.  Within each key, the implementation treats fixed-length
- *   bit-sequences of \p RADIX_BITS as radix digit places.  Although the direct radix sorting
- *   method can only be applied to unsigned integral types, BlockRadixSort
- *   is able to sort signed and floating-point types via simple bit-wise transformations
- *   that ensure lexicographic key ordering.
- * - \rowmajor
- *
- * \par Performance Considerations
- * - \granularity
- *
- * \par A Simple Example
- * \blockcollective{BlockRadixSort}
- * \par
- * The code snippet below illustrates a sort of 512 integer keys that
- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
- * where each thread owns 4 consecutive items.
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
- *
- * __global__ void ExampleKernel(...)
- * {
- *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer items each
- *     typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
- *
- *     // Allocate shared memory for BlockRadixSort
- *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
- *
- *     // Obtain a segment of consecutive items that are blocked across threads
- *     int thread_keys[4];
- *     ...
- *
- *     // Collectively sort the keys
- *     BlockRadixSort(temp_storage).Sort(thread_keys);
- *
- *     ...
- * \endcode
- * \par
- * Suppose the set of input \p thread_keys across the block of threads is
- * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>.  The
- * corresponding output \p thread_keys in those threads will be
- * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
- *
- */
-template <
-    typename                KeyT,
-    int                     BLOCK_DIM_X,
-    int                     ITEMS_PER_THREAD,
-    typename                ValueT                   = NullType,
-    int                     RADIX_BITS              = 4,
-    bool                    MEMOIZE_OUTER_SCAN      = (CUB_PTX_ARCH >= 350) ? true : false,
-    BlockScanAlgorithm      INNER_SCAN_ALGORITHM    = BLOCK_SCAN_WARP_SCANS,
-    cudaSharedMemConfig     SMEM_CONFIG             = cudaSharedMemBankSizeFourByte,
-    int                     BLOCK_DIM_Y             = 1,
-    int                     BLOCK_DIM_Z             = 1,
-    int                     PTX_ARCH                = CUB_PTX_ARCH>
-class BlockRadixSort
-{
-private:
-
-    /******************************************************************************
-     * Constants and type definitions
-     ******************************************************************************/
-
-    enum
-    {
-        // The thread block size in threads
-        BLOCK_THREADS               = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-
-        // Whether or not there are values to be trucked along with keys
-        KEYS_ONLY                   = Equals<ValueT, NullType>::VALUE,
-    };
-
-    // KeyT traits and unsigned bits type
-    typedef Traits<KeyT>                        KeyTraits;
-    typedef typename KeyTraits::UnsignedBits    UnsignedBits;
-
-    /// Ascending BlockRadixRank utility type
-    typedef BlockRadixRank<
-            BLOCK_DIM_X,
-            RADIX_BITS,
-            false,
-            MEMOIZE_OUTER_SCAN,
-            INNER_SCAN_ALGORITHM,
-            SMEM_CONFIG,
-            BLOCK_DIM_Y,
-            BLOCK_DIM_Z,
-            PTX_ARCH>
-        AscendingBlockRadixRank;
-
-    /// Descending BlockRadixRank utility type
-    typedef BlockRadixRank<
-            BLOCK_DIM_X,
-            RADIX_BITS,
-            true,
-            MEMOIZE_OUTER_SCAN,
-            INNER_SCAN_ALGORITHM,
-            SMEM_CONFIG,
-            BLOCK_DIM_Y,
-            BLOCK_DIM_Z,
-            PTX_ARCH>
-        DescendingBlockRadixRank;
-
-    /// BlockExchange utility type for keys
-    typedef BlockExchange<KeyT, BLOCK_DIM_X, ITEMS_PER_THREAD, false, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> BlockExchangeKeys;
-
-    /// BlockExchange utility type for values
-    typedef BlockExchange<ValueT, BLOCK_DIM_X, ITEMS_PER_THREAD, false, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> BlockExchangeValues;
-
-    /// Shared memory storage layout type
-    union _TempStorage
-    {
-        typename AscendingBlockRadixRank::TempStorage  asending_ranking_storage;
-        typename DescendingBlockRadixRank::TempStorage descending_ranking_storage;
-        typename BlockExchangeKeys::TempStorage        exchange_keys;
-        typename BlockExchangeValues::TempStorage      exchange_values;
-    };
-
-
-    /******************************************************************************
-     * Thread fields
-     ******************************************************************************/
-
-    /// Shared storage reference
-    _TempStorage &temp_storage;
-
-    /// Linear thread-id
-    unsigned int linear_tid;
-
-    /******************************************************************************
-     * Utility methods
-     ******************************************************************************/
-
-    /// Internal storage allocator
-    __device__ __forceinline__ _TempStorage& PrivateStorage()
-    {
-        __shared__ _TempStorage private_storage;
-        return private_storage;
-    }
-
-    /// Rank keys (specialized for ascending sort)
-    __device__ __forceinline__ void RankKeys(
-        UnsignedBits    (&unsigned_keys)[ITEMS_PER_THREAD],
-        int             (&ranks)[ITEMS_PER_THREAD],
-        int             begin_bit,
-        int             pass_bits,
-        Int2Type<false> /*is_descending*/)
-    {
-        AscendingBlockRadixRank(temp_storage.asending_ranking_storage).RankKeys(
-            unsigned_keys,
-            ranks,
-            begin_bit,
-            pass_bits);
-    }
-
-    /// Rank keys (specialized for descending sort)
-    __device__ __forceinline__ void RankKeys(
-        UnsignedBits    (&unsigned_keys)[ITEMS_PER_THREAD],
-        int             (&ranks)[ITEMS_PER_THREAD],
-        int             begin_bit,
-        int             pass_bits,
-        Int2Type<true>  /*is_descending*/)
-    {
-        DescendingBlockRadixRank(temp_storage.descending_ranking_storage).RankKeys(
-            unsigned_keys,
-            ranks,
-            begin_bit,
-            pass_bits);
-    }
-
-    /// ExchangeValues (specialized for key-value sort, to-blocked arrangement)
-    __device__ __forceinline__ void ExchangeValues(
-        ValueT          (&values)[ITEMS_PER_THREAD],
-        int             (&ranks)[ITEMS_PER_THREAD],
-        Int2Type<false> /*is_keys_only*/,
-        Int2Type<true>  /*is_blocked*/)
-    {
-        CTA_SYNC();
-
-        // Exchange values through shared memory in blocked arrangement
-        BlockExchangeValues(temp_storage.exchange_values).ScatterToBlocked(values, ranks);
-    }
-
-    /// ExchangeValues (specialized for key-value sort, to-striped arrangement)
-    __device__ __forceinline__ void ExchangeValues(
-        ValueT          (&values)[ITEMS_PER_THREAD],
-        int             (&ranks)[ITEMS_PER_THREAD],
-        Int2Type<false> /*is_keys_only*/,
-        Int2Type<false> /*is_blocked*/)
-    {
-        CTA_SYNC();
-
-        // Exchange values through shared memory in blocked arrangement
-        BlockExchangeValues(temp_storage.exchange_values).ScatterToStriped(values, ranks);
-    }
-
-    /// ExchangeValues (specialized for keys-only sort)
-    template <int IS_BLOCKED>
-    __device__ __forceinline__ void ExchangeValues(
-        ValueT                  (&/*values*/)[ITEMS_PER_THREAD],
-        int                     (&/*ranks*/)[ITEMS_PER_THREAD],
-        Int2Type<true>          /*is_keys_only*/,
-        Int2Type<IS_BLOCKED>    /*is_blocked*/)
-    {}
-
-    /// Sort blocked arrangement
-    template <int DESCENDING, int KEYS_ONLY>
-    __device__ __forceinline__ void SortBlocked(
-        KeyT                    (&keys)[ITEMS_PER_THREAD],          ///< Keys to sort
-        ValueT                  (&values)[ITEMS_PER_THREAD],        ///< Values to sort
-        int                     begin_bit,                          ///< The beginning (least-significant) bit index needed for key comparison
-        int                     end_bit,                            ///< The past-the-end (most-significant) bit index needed for key comparison
-        Int2Type<DESCENDING>    is_descending,                      ///< Tag whether is a descending-order sort
-        Int2Type<KEYS_ONLY>     is_keys_only)                       ///< Tag whether is keys-only sort
-    {
-        UnsignedBits (&unsigned_keys)[ITEMS_PER_THREAD] =
-            reinterpret_cast<UnsignedBits (&)[ITEMS_PER_THREAD]>(keys);
-
-        // Twiddle bits if necessary
-        #pragma unroll
-        for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
-        {
-            unsigned_keys[KEY] = KeyTraits::TwiddleIn(unsigned_keys[KEY]);
-        }
-
-        // Radix sorting passes
-        while (true)
-        {
-            int pass_bits = CUB_MIN(RADIX_BITS, end_bit - begin_bit);
-
-            // Rank the blocked keys
-            int ranks[ITEMS_PER_THREAD];
-            RankKeys(unsigned_keys, ranks, begin_bit, pass_bits, is_descending);
-            begin_bit += RADIX_BITS;
-
-            CTA_SYNC();
-
-            // Exchange keys through shared memory in blocked arrangement
-            BlockExchangeKeys(temp_storage.exchange_keys).ScatterToBlocked(keys, ranks);
-
-            // Exchange values through shared memory in blocked arrangement
-            ExchangeValues(values, ranks, is_keys_only, Int2Type<true>());
-
-            // Quit if done
-            if (begin_bit >= end_bit) break;
-
-            CTA_SYNC();
-        }
-
-        // Untwiddle bits if necessary
-        #pragma unroll
-        for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
-        {
-            unsigned_keys[KEY] = KeyTraits::TwiddleOut(unsigned_keys[KEY]);
-        }
-    }
-
-public:
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-    /// Sort blocked -> striped arrangement
-    template <int DESCENDING, int KEYS_ONLY>
-    __device__ __forceinline__ void SortBlockedToStriped(
-        KeyT                    (&keys)[ITEMS_PER_THREAD],          ///< Keys to sort
-        ValueT                  (&values)[ITEMS_PER_THREAD],        ///< Values to sort
-        int                     begin_bit,                          ///< The beginning (least-significant) bit index needed for key comparison
-        int                     end_bit,                            ///< The past-the-end (most-significant) bit index needed for key comparison
-        Int2Type<DESCENDING>    is_descending,                      ///< Tag whether is a descending-order sort
-        Int2Type<KEYS_ONLY>     is_keys_only)                       ///< Tag whether is keys-only sort
-    {
-        UnsignedBits (&unsigned_keys)[ITEMS_PER_THREAD] =
-            reinterpret_cast<UnsignedBits (&)[ITEMS_PER_THREAD]>(keys);
-
-        // Twiddle bits if necessary
-        #pragma unroll
-        for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
-        {
-            unsigned_keys[KEY] = KeyTraits::TwiddleIn(unsigned_keys[KEY]);
-        }
-
-        // Radix sorting passes
-        while (true)
-        {
-            int pass_bits = CUB_MIN(RADIX_BITS, end_bit - begin_bit);
-
-            // Rank the blocked keys
-            int ranks[ITEMS_PER_THREAD];
-            RankKeys(unsigned_keys, ranks, begin_bit, pass_bits, is_descending);
-            begin_bit += RADIX_BITS;
-
-            CTA_SYNC();
-
-            // Check if this is the last pass
-            if (begin_bit >= end_bit)
-            {
-                // Last pass exchanges keys through shared memory in striped arrangement
-                BlockExchangeKeys(temp_storage.exchange_keys).ScatterToStriped(keys, ranks);
-
-                // Last pass exchanges through shared memory in striped arrangement
-                ExchangeValues(values, ranks, is_keys_only, Int2Type<false>());
-
-                // Quit
-                break;
-            }
-
-            // Exchange keys through shared memory in blocked arrangement
-            BlockExchangeKeys(temp_storage.exchange_keys).ScatterToBlocked(keys, ranks);
-
-            // Exchange values through shared memory in blocked arrangement
-            ExchangeValues(values, ranks, is_keys_only, Int2Type<true>());
-
-            CTA_SYNC();
-        }
-
-        // Untwiddle bits if necessary
-        #pragma unroll
-        for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
-        {
-            unsigned_keys[KEY] = KeyTraits::TwiddleOut(unsigned_keys[KEY]);
-        }
-    }
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-    /// \smemstorage{BlockRadixSort}
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    /******************************************************************//**
-     * \name Collective constructors
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Collective constructor using a private static allocation of shared memory as temporary storage.
-     */
-    __device__ __forceinline__ BlockRadixSort()
-    :
-        temp_storage(PrivateStorage()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    /**
-     * \brief Collective constructor using the specified memory allocation as temporary storage.
-     */
-    __device__ __forceinline__ BlockRadixSort(
-        TempStorage &temp_storage)             ///< [in] Reference to memory allocation having layout type TempStorage
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Sorting (blocked arrangements)
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Performs an ascending block-wide radix sort over a [<em>blocked arrangement</em>](index.html#sec5sec3) of keys.
-     *
-     * \par
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a sort of 512 integer keys that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive keys.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys each
-     *     typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
-     *
-     *     // Allocate shared memory for BlockRadixSort
-     *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_keys[4];
-     *     ...
-     *
-     *     // Collectively sort the keys
-     *     BlockRadixSort(temp_storage).Sort(thread_keys);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_keys across the block of threads is
-     * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>.
-     * The corresponding output \p thread_keys in those threads will be
-     * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
-     */
-    __device__ __forceinline__ void Sort(
-        KeyT    (&keys)[ITEMS_PER_THREAD],          ///< [in-out] Keys to sort
-        int     begin_bit   = 0,                    ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
-        int     end_bit     = sizeof(KeyT) * 8)      ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
-    {
-        NullType values[ITEMS_PER_THREAD];
-
-        SortBlocked(keys, values, begin_bit, end_bit, Int2Type<false>(), Int2Type<KEYS_ONLY>());
-    }
-
-
-    /**
-     * \brief Performs an ascending block-wide radix sort across a [<em>blocked arrangement</em>](index.html#sec5sec3) of keys and values.
-     *
-     * \par
-     * - BlockRadixSort can only accommodate one associated tile of values. To "truck along"
-     *   more than one tile of values, simply perform a key-value sort of the keys paired
-     *   with a temporary value array that enumerates the key indices.  The reordered indices
-     *   can then be used as a gather-vector for exchanging other associated tile data through
-     *   shared memory.
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a sort of 512 integer keys and values that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive pairs.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys and values each
-     *     typedef cub::BlockRadixSort<int, 128, 4, int> BlockRadixSort;
-     *
-     *     // Allocate shared memory for BlockRadixSort
-     *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_keys[4];
-     *     int thread_values[4];
-     *     ...
-     *
-     *     // Collectively sort the keys and values among block threads
-     *     BlockRadixSort(temp_storage).Sort(thread_keys, thread_values);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_keys across the block of threads is
-     * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>.  The
-     * corresponding output \p thread_keys in those threads will be
-     * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
-     *
-     */
-    __device__ __forceinline__ void Sort(
-        KeyT    (&keys)[ITEMS_PER_THREAD],          ///< [in-out] Keys to sort
-        ValueT  (&values)[ITEMS_PER_THREAD],        ///< [in-out] Values to sort
-        int     begin_bit   = 0,                    ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
-        int     end_bit     = sizeof(KeyT) * 8)      ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
-    {
-        SortBlocked(keys, values, begin_bit, end_bit, Int2Type<false>(), Int2Type<KEYS_ONLY>());
-    }
-
-    /**
-     * \brief Performs a descending block-wide radix sort over a [<em>blocked arrangement</em>](index.html#sec5sec3) of keys.
-     *
-     * \par
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a sort of 512 integer keys that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive keys.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys each
-     *     typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
-     *
-     *     // Allocate shared memory for BlockRadixSort
-     *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_keys[4];
-     *     ...
-     *
-     *     // Collectively sort the keys
-     *     BlockRadixSort(temp_storage).Sort(thread_keys);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_keys across the block of threads is
-     * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>.
-     * The corresponding output \p thread_keys in those threads will be
-     * <tt>{ [511,510,509,508], [11,10,9,8], [7,6,5,4], ..., [3,2,1,0] }</tt>.
-     */
-    __device__ __forceinline__ void SortDescending(
-        KeyT    (&keys)[ITEMS_PER_THREAD],          ///< [in-out] Keys to sort
-        int     begin_bit   = 0,                    ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
-        int     end_bit     = sizeof(KeyT) * 8)      ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
-    {
-        NullType values[ITEMS_PER_THREAD];
-
-        SortBlocked(keys, values, begin_bit, end_bit, Int2Type<true>(), Int2Type<KEYS_ONLY>());
-    }
-
-
-    /**
-     * \brief Performs a descending block-wide radix sort across a [<em>blocked arrangement</em>](index.html#sec5sec3) of keys and values.
-     *
-     * \par
-     * - BlockRadixSort can only accommodate one associated tile of values. To "truck along"
-     *   more than one tile of values, simply perform a key-value sort of the keys paired
-     *   with a temporary value array that enumerates the key indices.  The reordered indices
-     *   can then be used as a gather-vector for exchanging other associated tile data through
-     *   shared memory.
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a sort of 512 integer keys and values that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive pairs.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys and values each
-     *     typedef cub::BlockRadixSort<int, 128, 4, int> BlockRadixSort;
-     *
-     *     // Allocate shared memory for BlockRadixSort
-     *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_keys[4];
-     *     int thread_values[4];
-     *     ...
-     *
-     *     // Collectively sort the keys and values among block threads
-     *     BlockRadixSort(temp_storage).Sort(thread_keys, thread_values);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_keys across the block of threads is
-     * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>.  The
-     * corresponding output \p thread_keys in those threads will be
-     * <tt>{ [511,510,509,508], [11,10,9,8], [7,6,5,4], ..., [3,2,1,0] }</tt>.
-     *
-     */
-    __device__ __forceinline__ void SortDescending(
-        KeyT    (&keys)[ITEMS_PER_THREAD],          ///< [in-out] Keys to sort
-        ValueT  (&values)[ITEMS_PER_THREAD],        ///< [in-out] Values to sort
-        int     begin_bit   = 0,                    ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
-        int     end_bit     = sizeof(KeyT) * 8)      ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
-    {
-        SortBlocked(keys, values, begin_bit, end_bit, Int2Type<true>(), Int2Type<KEYS_ONLY>());
-    }
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Sorting (blocked arrangement -> striped arrangement)
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Performs an ascending radix sort across a [<em>blocked arrangement</em>](index.html#sec5sec3) of keys, leaving them in a [<em>striped arrangement</em>](index.html#sec5sec3).
-     *
-     * \par
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a sort of 512 integer keys that
-     * are initially partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive keys.  The final partitioning is striped.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys each
-     *     typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
-     *
-     *     // Allocate shared memory for BlockRadixSort
-     *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_keys[4];
-     *     ...
-     *
-     *     // Collectively sort the keys
-     *     BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_keys across the block of threads is
-     * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>.  The
-     * corresponding output \p thread_keys in those threads will be
-     * <tt>{ [0,128,256,384], [1,129,257,385], [2,130,258,386], ..., [127,255,383,511] }</tt>.
-     *
-     */
-    __device__ __forceinline__ void SortBlockedToStriped(
-        KeyT    (&keys)[ITEMS_PER_THREAD],          ///< [in-out] Keys to sort
-        int     begin_bit   = 0,                    ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
-        int     end_bit     = sizeof(KeyT) * 8)      ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
-    {
-        NullType values[ITEMS_PER_THREAD];
-
-        SortBlockedToStriped(keys, values, begin_bit, end_bit, Int2Type<false>(), Int2Type<KEYS_ONLY>());
-    }
-
-
-    /**
-     * \brief Performs an ascending radix sort across a [<em>blocked arrangement</em>](index.html#sec5sec3) of keys and values, leaving them in a [<em>striped arrangement</em>](index.html#sec5sec3).
-     *
-     * \par
-     * - BlockRadixSort can only accommodate one associated tile of values. To "truck along"
-     *   more than one tile of values, simply perform a key-value sort of the keys paired
-     *   with a temporary value array that enumerates the key indices.  The reordered indices
-     *   can then be used as a gather-vector for exchanging other associated tile data through
-     *   shared memory.
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a sort of 512 integer keys and values that
-     * are initially partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive pairs.  The final partitioning is striped.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys and values each
-     *     typedef cub::BlockRadixSort<int, 128, 4, int> BlockRadixSort;
-     *
-     *     // Allocate shared memory for BlockRadixSort
-     *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_keys[4];
-     *     int thread_values[4];
-     *     ...
-     *
-     *     // Collectively sort the keys and values among block threads
-     *     BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys, thread_values);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_keys across the block of threads is
-     * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>.  The
-     * corresponding output \p thread_keys in those threads will be
-     * <tt>{ [0,128,256,384], [1,129,257,385], [2,130,258,386], ..., [127,255,383,511] }</tt>.
-     *
-     */
-    __device__ __forceinline__ void SortBlockedToStriped(
-        KeyT    (&keys)[ITEMS_PER_THREAD],          ///< [in-out] Keys to sort
-        ValueT  (&values)[ITEMS_PER_THREAD],        ///< [in-out] Values to sort
-        int     begin_bit   = 0,                    ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
-        int     end_bit     = sizeof(KeyT) * 8)      ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
-    {
-        SortBlockedToStriped(keys, values, begin_bit, end_bit, Int2Type<false>(), Int2Type<KEYS_ONLY>());
-    }
-
-
-    /**
-     * \brief Performs a descending radix sort across a [<em>blocked arrangement</em>](index.html#sec5sec3) of keys, leaving them in a [<em>striped arrangement</em>](index.html#sec5sec3).
-     *
-     * \par
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a sort of 512 integer keys that
-     * are initially partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive keys.  The final partitioning is striped.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys each
-     *     typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
-     *
-     *     // Allocate shared memory for BlockRadixSort
-     *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_keys[4];
-     *     ...
-     *
-     *     // Collectively sort the keys
-     *     BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_keys across the block of threads is
-     * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>.  The
-     * corresponding output \p thread_keys in those threads will be
-     * <tt>{ [511,383,255,127], [386,258,130,2], [385,257,128,1], ..., [384,256,128,0] }</tt>.
-     *
-     */
-    __device__ __forceinline__ void SortDescendingBlockedToStriped(
-        KeyT    (&keys)[ITEMS_PER_THREAD],          ///< [in-out] Keys to sort
-        int     begin_bit   = 0,                    ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
-        int     end_bit     = sizeof(KeyT) * 8)      ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
-    {
-        NullType values[ITEMS_PER_THREAD];
-
-        SortBlockedToStriped(keys, values, begin_bit, end_bit, Int2Type<true>(), Int2Type<KEYS_ONLY>());
-    }
-
-
-    /**
-     * \brief Performs a descending radix sort across a [<em>blocked arrangement</em>](index.html#sec5sec3) of keys and values, leaving them in a [<em>striped arrangement</em>](index.html#sec5sec3).
-     *
-     * \par
-     * - BlockRadixSort can only accommodate one associated tile of values. To "truck along"
-     *   more than one tile of values, simply perform a key-value sort of the keys paired
-     *   with a temporary value array that enumerates the key indices.  The reordered indices
-     *   can then be used as a gather-vector for exchanging other associated tile data through
-     *   shared memory.
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a sort of 512 integer keys and values that
-     * are initially partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive pairs.  The final partitioning is striped.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys and values each
-     *     typedef cub::BlockRadixSort<int, 128, 4, int> BlockRadixSort;
-     *
-     *     // Allocate shared memory for BlockRadixSort
-     *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_keys[4];
-     *     int thread_values[4];
-     *     ...
-     *
-     *     // Collectively sort the keys and values among block threads
-     *     BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys, thread_values);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_keys across the block of threads is
-     * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>.  The
-     * corresponding output \p thread_keys in those threads will be
-     * <tt>{ [511,383,255,127], [386,258,130,2], [385,257,128,1], ..., [384,256,128,0] }</tt>.
-     *
-     */
-    __device__ __forceinline__ void SortDescendingBlockedToStriped(
-        KeyT    (&keys)[ITEMS_PER_THREAD],          ///< [in-out] Keys to sort
-        ValueT  (&values)[ITEMS_PER_THREAD],        ///< [in-out] Values to sort
-        int     begin_bit   = 0,                    ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
-        int     end_bit     = sizeof(KeyT) * 8)      ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
-    {
-        SortBlockedToStriped(keys, values, begin_bit, end_bit, Int2Type<true>(), Int2Type<KEYS_ONLY>());
-    }
-
-
-    //@}  end member group
-
-};
-
-/**
- * \example example_block_radix_sort.cu
- */
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/block_raking_layout.cuh b/thirdparty/cub_semiring/block/block_raking_layout.cuh
deleted file mode 100644
index ab6b71036cd..00000000000
--- a/thirdparty/cub_semiring/block/block_raking_layout.cuh
+++ /dev/null
@@ -1,152 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::BlockRakingLayout provides a conflict-free shared memory layout abstraction for warp-raking across thread block data.
- */
-
-
-#pragma once
-
-#include "../util_macro.cuh"
-#include "../util_arch.cuh"
-#include "../util_type.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \brief BlockRakingLayout provides a conflict-free shared memory layout abstraction for 1D raking across thread block data.    ![](raking.png)
- * \ingroup BlockModule
- *
- * \par Overview
- * This type facilitates a shared memory usage pattern where a block of CUDA
- * threads places elements into shared memory and then reduces the active
- * parallelism to one "raking" warp of threads for serially aggregating consecutive
- * sequences of shared items.  Padding is inserted to eliminate bank conflicts
- * (for most data types).
- *
- * \tparam T                        The data type to be exchanged.
- * \tparam BLOCK_THREADS            The thread block size in threads.
- * \tparam PTX_ARCH                 <b>[optional]</b> \ptxversion
- */
-template <
-    typename    T,
-    int         BLOCK_THREADS,
-    int         PTX_ARCH = CUB_PTX_ARCH>
-struct BlockRakingLayout
-{
-    //---------------------------------------------------------------------
-    // Constants and type definitions
-    //---------------------------------------------------------------------
-
-    enum
-    {
-        /// The total number of elements that need to be cooperatively reduced
-        SHARED_ELEMENTS = BLOCK_THREADS,
-
-        /// Maximum number of warp-synchronous raking threads
-        MAX_RAKING_THREADS = CUB_MIN(BLOCK_THREADS, CUB_WARP_THREADS(PTX_ARCH)),
-
-        /// Number of raking elements per warp-synchronous raking thread (rounded up)
-        SEGMENT_LENGTH = (SHARED_ELEMENTS + MAX_RAKING_THREADS - 1) / MAX_RAKING_THREADS,
-
-        /// Never use a raking thread that will have no valid data (e.g., when BLOCK_THREADS is 62 and SEGMENT_LENGTH is 2, we should only use 31 raking threads)
-        RAKING_THREADS = (SHARED_ELEMENTS + SEGMENT_LENGTH - 1) / SEGMENT_LENGTH,
-
-        /// Whether we will have bank conflicts (technically we should find out if the GCD is > 1)
-        HAS_CONFLICTS = (CUB_SMEM_BANKS(PTX_ARCH) % SEGMENT_LENGTH == 0),
-
-        /// Degree of bank conflicts (e.g., 4-way)
-        CONFLICT_DEGREE = (HAS_CONFLICTS) ?
-            (MAX_RAKING_THREADS * SEGMENT_LENGTH) / CUB_SMEM_BANKS(PTX_ARCH) :
-            1,
-
-        /// Pad each segment length with one element if segment length is not relatively prime to warp size and can't be optimized as a vector load
-        USE_SEGMENT_PADDING = ((SEGMENT_LENGTH & 1) == 0) && (SEGMENT_LENGTH > 2),
-
-        /// Total number of elements in the raking grid
-        GRID_ELEMENTS = RAKING_THREADS * (SEGMENT_LENGTH + USE_SEGMENT_PADDING),
-
-        /// Whether or not we need bounds checking during raking (the number of reduction elements is not a multiple of the number of raking threads)
-        UNGUARDED = (SHARED_ELEMENTS % RAKING_THREADS == 0),
-    };
-
-
-    /**
-     * \brief Shared memory storage type
-     */
-    struct __align__(16) _TempStorage
-    {
-        T buff[BlockRakingLayout::GRID_ELEMENTS];
-    };
-
-    /// Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    /**
-     * \brief Returns the location for the calling thread to place data into the grid
-     */
-    static __device__ __forceinline__ T* PlacementPtr(
-        TempStorage &temp_storage,
-        unsigned int linear_tid)
-    {
-        // Offset for partial
-        unsigned int offset = linear_tid;
-
-        // Add in one padding element for every segment
-        if (USE_SEGMENT_PADDING > 0)
-        {
-            offset += offset / SEGMENT_LENGTH;
-        }
-
-        // Incorporating a block of padding partials every shared memory segment
-        return temp_storage.Alias().buff + offset;
-    }
-
-
-    /**
-     * \brief Returns the location for the calling thread to begin sequential raking
-     */
-    static __device__ __forceinline__ T* RakingPtr(
-        TempStorage &temp_storage,
-        unsigned int linear_tid)
-    {
-        return temp_storage.Alias().buff + (linear_tid * (SEGMENT_LENGTH + USE_SEGMENT_PADDING));
-    }
-};
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/block_reduce.cuh b/thirdparty/cub_semiring/block/block_reduce.cuh
deleted file mode 100644
index a9de9e71742..00000000000
--- a/thirdparty/cub_semiring/block/block_reduce.cuh
+++ /dev/null
@@ -1,607 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * The cub::BlockReduce class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel reduction of items partitioned across a CUDA thread block.
- */
-
-#pragma once
-
-#include "specializations/block_reduce_raking.cuh"
-#include "specializations/block_reduce_raking_commutative_only.cuh"
-#include "specializations/block_reduce_warp_reductions.cuh"
-#include "../util_ptx.cuh"
-#include "../util_type.cuh"
-#include "../thread/thread_operators.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-
-/******************************************************************************
- * Algorithmic variants
- ******************************************************************************/
-
-/**
- * BlockReduceAlgorithm enumerates alternative algorithms for parallel
- * reduction across a CUDA thread block.
- */
-enum BlockReduceAlgorithm
-{
-
-    /**
-     * \par Overview
-     * An efficient "raking" reduction algorithm that only supports commutative
-     * reduction operators (true for most operations, e.g., addition).
-     *
-     * \par
-     * Execution is comprised of three phases:
-     * -# Upsweep sequential reduction in registers (if threads contribute more
-     *    than one input each).  Threads in warps other than the first warp place
-     *    their partial reductions into shared memory.
-     * -# Upsweep sequential reduction in shared memory.  Threads within the first
-     *    warp continue to accumulate by raking across segments of shared partial reductions
-     * -# A warp-synchronous Kogge-Stone style reduction within the raking warp.
-     *
-     * \par
-     * \image html block_reduce.png
-     * <div class="centercaption">\p BLOCK_REDUCE_RAKING data flow for a hypothetical 16-thread thread block and 4-thread raking warp.</div>
-     *
-     * \par Performance Considerations
-     * - This variant performs less communication than BLOCK_REDUCE_RAKING_NON_COMMUTATIVE
-     *   and is preferable when the reduction operator is commutative.  This variant
-     *   applies fewer reduction operators  than BLOCK_REDUCE_WARP_REDUCTIONS, and can provide higher overall
-     *   throughput across the GPU when suitably occupied.  However, turn-around latency may be
-     *   higher than to BLOCK_REDUCE_WARP_REDUCTIONS and thus less-desirable
-     *   when the GPU is under-occupied.
-     */
-    BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY,
-
-
-    /**
-     * \par Overview
-     * An efficient "raking" reduction algorithm that supports commutative
-     * (e.g., addition) and non-commutative (e.g., string concatenation) reduction
-     * operators. \blocked.
-     *
-     * \par
-     * Execution is comprised of three phases:
-     * -# Upsweep sequential reduction in registers (if threads contribute more
-     *    than one input each).  Each thread then places the partial reduction
-     *    of its item(s) into shared memory.
-     * -# Upsweep sequential reduction in shared memory.  Threads within a
-     *    single warp rake across segments of shared partial reductions.
-     * -# A warp-synchronous Kogge-Stone style reduction within the raking warp.
-     *
-     * \par
-     * \image html block_reduce.png
-     * <div class="centercaption">\p BLOCK_REDUCE_RAKING data flow for a hypothetical 16-thread thread block and 4-thread raking warp.</div>
-     *
-     * \par Performance Considerations
-     * - This variant performs more communication than BLOCK_REDUCE_RAKING
-     *   and is only preferable when the reduction operator is non-commutative.  This variant
-     *   applies fewer reduction operators than BLOCK_REDUCE_WARP_REDUCTIONS, and can provide higher overall
-     *   throughput across the GPU when suitably occupied.  However, turn-around latency may be
-     *   higher than to BLOCK_REDUCE_WARP_REDUCTIONS and thus less-desirable
-     *   when the GPU is under-occupied.
-     */
-    BLOCK_REDUCE_RAKING,
-
-
-    /**
-     * \par Overview
-     * A quick "tiled warp-reductions" reduction algorithm that supports commutative
-     * (e.g., addition) and non-commutative (e.g., string concatenation) reduction
-     * operators.
-     *
-     * \par
-     * Execution is comprised of four phases:
-     * -# Upsweep sequential reduction in registers (if threads contribute more
-     *    than one input each).  Each thread then places the partial reduction
-     *    of its item(s) into shared memory.
-     * -# Compute a shallow, but inefficient warp-synchronous Kogge-Stone style
-     *    reduction within each warp.
-     * -# A propagation phase where the warp reduction outputs in each warp are
-     *    updated with the aggregate from each preceding warp.
-     *
-     * \par
-     * \image html block_scan_warpscans.png
-     * <div class="centercaption">\p BLOCK_REDUCE_WARP_REDUCTIONS data flow for a hypothetical 16-thread thread block and 4-thread raking warp.</div>
-     *
-     * \par Performance Considerations
-     * - This variant applies more reduction operators than BLOCK_REDUCE_RAKING
-     *   or BLOCK_REDUCE_RAKING_NON_COMMUTATIVE, which may result in lower overall
-     *   throughput across the GPU.  However turn-around latency may be lower and
-     *   thus useful when the GPU is under-occupied.
-     */
-    BLOCK_REDUCE_WARP_REDUCTIONS,
-};
-
-
-/******************************************************************************
- * Block reduce
- ******************************************************************************/
-
-/**
- * \brief The BlockReduce class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel reduction of items partitioned across a CUDA thread block. ![](reduce_logo.png)
- * \ingroup BlockModule
- *
- * \tparam T                Data type being reduced
- * \tparam BLOCK_DIM_X      The thread block length in threads along the X dimension
- * \tparam ALGORITHM        <b>[optional]</b> cub::BlockReduceAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_REDUCE_WARP_REDUCTIONS)
- * \tparam BLOCK_DIM_Y      <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
- * \tparam BLOCK_DIM_Z      <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
- * \tparam PTX_ARCH         <b>[optional]</b> \ptxversion
- *
- * \par Overview
- * - A <a href="http://en.wikipedia.org/wiki/Reduce_(higher-order_function)"><em>reduction</em></a> (or <em>fold</em>)
- *   uses a binary combining operator to compute a single aggregate from a list of input elements.
- * - \rowmajor
- * - BlockReduce can be optionally specialized by algorithm to accommodate different latency/throughput workload profiles:
- *   -# <b>cub::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY</b>.  An efficient "raking" reduction algorithm that only supports commutative reduction operators. [More...](\ref cub::BlockReduceAlgorithm)
- *   -# <b>cub::BLOCK_REDUCE_RAKING</b>.  An efficient "raking" reduction algorithm that supports commutative and non-commutative reduction operators. [More...](\ref cub::BlockReduceAlgorithm)
- *   -# <b>cub::BLOCK_REDUCE_WARP_REDUCTIONS</b>.  A quick "tiled warp-reductions" reduction algorithm that supports commutative and non-commutative reduction operators. [More...](\ref cub::BlockReduceAlgorithm)
- *
- * \par Performance Considerations
- * - \granularity
- * - Very efficient (only one synchronization barrier).
- * - Incurs zero bank conflicts for most types
- * - Computation is slightly more efficient (i.e., having lower instruction overhead) for:
- *   - Summation (<b><em>vs.</em></b> generic reduction)
- *   - \p BLOCK_THREADS is a multiple of the architecture's warp size
- *   - Every thread has a valid input (i.e., full <b><em>vs.</em></b> partial-tiles)
- * - See cub::BlockReduceAlgorithm for performance details regarding algorithmic alternatives
- *
- * \par A Simple Example
- * \blockcollective{BlockReduce}
- * \par
- * The code snippet below illustrates a sum reduction of 512 integer items that
- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
- * where each thread owns 4 consecutive items.
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>
- *
- * __global__ void ExampleKernel(...)
- * {
- *     // Specialize BlockReduce for a 1D block of 128 threads on type int
- *     typedef cub::BlockReduce<int, 128> BlockReduce;
- *
- *     // Allocate shared memory for BlockReduce
- *     __shared__ typename BlockReduce::TempStorage temp_storage;
- *
- *     // Obtain a segment of consecutive items that are blocked across threads
- *     int thread_data[4];
- *     ...
- *
- *     // Compute the block-wide sum for thread0
- *     int aggregate = BlockReduce(temp_storage).Sum(thread_data);
- *
- * \endcode
- *
- */
-template <
-    typename                T,
-    int                     BLOCK_DIM_X,
-    BlockReduceAlgorithm    ALGORITHM       = BLOCK_REDUCE_WARP_REDUCTIONS,
-    int                     BLOCK_DIM_Y     = 1,
-    int                     BLOCK_DIM_Z     = 1,
-    int                     PTX_ARCH        = CUB_PTX_ARCH>
-class BlockReduce
-{
-private:
-
-    /******************************************************************************
-     * Constants and type definitions
-     ******************************************************************************/
-
-    /// Constants
-    enum
-    {
-        /// The thread block size in threads
-        BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-    };
-
-    typedef BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>           WarpReductions;
-    typedef BlockReduceRakingCommutativeOnly<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>    RakingCommutativeOnly;
-    typedef BlockReduceRaking<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>                   Raking;
-
-    /// Internal specialization type
-    typedef typename If<(ALGORITHM == BLOCK_REDUCE_WARP_REDUCTIONS),
-        WarpReductions,
-        typename If<(ALGORITHM == BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY),
-            RakingCommutativeOnly,
-            Raking>::Type>::Type InternalBlockReduce;     // BlockReduceRaking
-
-    /// Shared memory storage layout type for BlockReduce
-    typedef typename InternalBlockReduce::TempStorage _TempStorage;
-
-
-    /******************************************************************************
-     * Utility methods
-     ******************************************************************************/
-
-    /// Internal storage allocator
-    __device__ __forceinline__ _TempStorage& PrivateStorage()
-    {
-        __shared__ _TempStorage private_storage;
-        return private_storage;
-    }
-
-
-    /******************************************************************************
-     * Thread fields
-     ******************************************************************************/
-
-    /// Shared storage reference
-    _TempStorage &temp_storage;
-
-    /// Linear thread-id
-    unsigned int linear_tid;
-
-
-public:
-
-    /// \smemstorage{BlockReduce}
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    /******************************************************************//**
-     * \name Collective constructors
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Collective constructor using a private static allocation of shared memory as temporary storage.
-     */
-    __device__ __forceinline__ BlockReduce()
-    :
-        temp_storage(PrivateStorage()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    /**
-     * \brief Collective constructor using the specified memory allocation as temporary storage.
-     */
-    __device__ __forceinline__ BlockReduce(
-        TempStorage &temp_storage)             ///< [in] Reference to memory allocation having layout type TempStorage
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Generic reductions
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Computes a block-wide reduction for thread<sub>0</sub> using the specified binary reduction functor.  Each thread contributes one input element.
-     *
-     * \par
-     * - The return value is undefined in threads other than thread<sub>0</sub>.
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a max reduction of 128 integer items that
-     * are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockReduce for a 1D block of 128 threads on type int
-     *     typedef cub::BlockReduce<int, 128> BlockReduce;
-     *
-     *     // Allocate shared memory for BlockReduce
-     *     __shared__ typename BlockReduce::TempStorage temp_storage;
-     *
-     *     // Each thread obtains an input item
-     *     int thread_data;
-     *     ...
-     *
-     *     // Compute the block-wide max for thread0
-     *     int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max());
-     *
-     * \endcode
-     *
-     * \tparam ReductionOp          <b>[inferred]</b> Binary reduction functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ReductionOp>
-    __device__ __forceinline__ T Reduce(
-        T               input,                      ///< [in] Calling thread's input
-        ReductionOp     reduction_op)               ///< [in] Binary reduction functor 
-    {
-        return InternalBlockReduce(temp_storage).template Reduce<true>(input, BLOCK_THREADS, reduction_op);
-    }
-
-
-    /**
-     * \brief Computes a block-wide reduction for thread<sub>0</sub> using the specified binary reduction functor.  Each thread contributes an array of consecutive input elements.
-     *
-     * \par
-     * - The return value is undefined in threads other than thread<sub>0</sub>.
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a max reduction of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockReduce for a 1D block of 128 threads on type int
-     *     typedef cub::BlockReduce<int, 128> BlockReduce;
-     *
-     *     // Allocate shared memory for BlockReduce
-     *     __shared__ typename BlockReduce::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Compute the block-wide max for thread0
-     *     int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max());
-     *
-     * \endcode
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam ReductionOp          <b>[inferred]</b> Binary reduction functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <
-        int ITEMS_PER_THREAD,
-        typename ReductionOp>
-    __device__ __forceinline__ T Reduce(
-        T               (&inputs)[ITEMS_PER_THREAD],    ///< [in] Calling thread's input segment
-        ReductionOp     reduction_op)                   ///< [in] Binary reduction functor 
-    {
-        // Reduce partials
-        T partial = internal::ThreadReduce(inputs, reduction_op);
-        return Reduce(partial, reduction_op);
-    }
-
-
-    /**
-     * \brief Computes a block-wide reduction for thread<sub>0</sub> using the specified binary reduction functor.  The first \p num_valid threads each contribute one input element.
-     *
-     * \par
-     * - The return value is undefined in threads other than thread<sub>0</sub>.
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a max reduction of a partially-full tile of integer items that
-     * are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>
-     *
-     * __global__ void ExampleKernel(int num_valid, ...)
-     * {
-     *     // Specialize BlockReduce for a 1D block of 128 threads on type int
-     *     typedef cub::BlockReduce<int, 128> BlockReduce;
-     *
-     *     // Allocate shared memory for BlockReduce
-     *     __shared__ typename BlockReduce::TempStorage temp_storage;
-     *
-     *     // Each thread obtains an input item
-     *     int thread_data;
-     *     if (threadIdx.x < num_valid) thread_data = ...
-     *
-     *     // Compute the block-wide max for thread0
-     *     int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max(), num_valid);
-     *
-     * \endcode
-     *
-     * \tparam ReductionOp          <b>[inferred]</b> Binary reduction functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ReductionOp>
-    __device__ __forceinline__ T Reduce(
-        T                   input,                  ///< [in] Calling thread's input
-        ReductionOp         reduction_op,           ///< [in] Binary reduction functor 
-        int                 num_valid)              ///< [in] Number of threads containing valid elements (may be less than BLOCK_THREADS)
-    {
-        // Determine if we scan skip bounds checking
-        if (num_valid >= BLOCK_THREADS)
-        {
-            return InternalBlockReduce(temp_storage).template Reduce<true>(input, num_valid, reduction_op);
-        }
-        else
-        {
-            return InternalBlockReduce(temp_storage).template Reduce<false>(input, num_valid, reduction_op);
-        }
-    }
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Summation reductions
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Computes a block-wide reduction for thread<sub>0</sub> using addition (+) as the reduction operator.  Each thread contributes one input element.
-     *
-     * \par
-     * - The return value is undefined in threads other than thread<sub>0</sub>.
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a sum reduction of 128 integer items that
-     * are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockReduce for a 1D block of 128 threads on type int
-     *     typedef cub::BlockReduce<int, 128> BlockReduce;
-     *
-     *     // Allocate shared memory for BlockReduce
-     *     __shared__ typename BlockReduce::TempStorage temp_storage;
-     *
-     *     // Each thread obtains an input item
-     *     int thread_data;
-     *     ...
-     *
-     *     // Compute the block-wide sum for thread0
-     *     int aggregate = BlockReduce(temp_storage).Sum(thread_data);
-     *
-     * \endcode
-     *
-     */
-    __device__ __forceinline__ T Sum(
-        T   input)                      ///< [in] Calling thread's input
-    {
-        return InternalBlockReduce(temp_storage).template Sum<true>(input, BLOCK_THREADS);
-    }
-
-    /**
-     * \brief Computes a block-wide reduction for thread<sub>0</sub> using addition (+) as the reduction operator.  Each thread contributes an array of consecutive input elements.
-     *
-     * \par
-     * - The return value is undefined in threads other than thread<sub>0</sub>.
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a sum reduction of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockReduce for a 1D block of 128 threads on type int
-     *     typedef cub::BlockReduce<int, 128> BlockReduce;
-     *
-     *     // Allocate shared memory for BlockReduce
-     *     __shared__ typename BlockReduce::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Compute the block-wide sum for thread0
-     *     int aggregate = BlockReduce(temp_storage).Sum(thread_data);
-     *
-     * \endcode
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     */
-    template <int ITEMS_PER_THREAD>
-    __device__ __forceinline__ T Sum(
-        T   (&inputs)[ITEMS_PER_THREAD])    ///< [in] Calling thread's input segment
-    {
-        // Reduce partials
-        T partial = internal::ThreadReduce(inputs, cub::Sum());
-        return Sum(partial);
-    }
-
-
-    /**
-     * \brief Computes a block-wide reduction for thread<sub>0</sub> using addition (+) as the reduction operator.  The first \p num_valid threads each contribute one input element.
-     *
-     * \par
-     * - The return value is undefined in threads other than thread<sub>0</sub>.
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a sum reduction of a partially-full tile of integer items that
-     * are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>
-     *
-     * __global__ void ExampleKernel(int num_valid, ...)
-     * {
-     *     // Specialize BlockReduce for a 1D block of 128 threads on type int
-     *     typedef cub::BlockReduce<int, 128> BlockReduce;
-     *
-     *     // Allocate shared memory for BlockReduce
-     *     __shared__ typename BlockReduce::TempStorage temp_storage;
-     *
-     *     // Each thread obtains an input item (up to num_items)
-     *     int thread_data;
-     *     if (threadIdx.x < num_valid)
-     *         thread_data = ...
-     *
-     *     // Compute the block-wide sum for thread0
-     *     int aggregate = BlockReduce(temp_storage).Sum(thread_data, num_valid);
-     *
-     * \endcode
-     *
-     */
-    __device__ __forceinline__ T Sum(
-        T   input,                  ///< [in] Calling thread's input
-        int num_valid)              ///< [in] Number of threads containing valid elements (may be less than BLOCK_THREADS)
-    {
-        // Determine if we scan skip bounds checking
-        if (num_valid >= BLOCK_THREADS)
-        {
-            return InternalBlockReduce(temp_storage).template Sum<true>(input, num_valid);
-        }
-        else
-        {
-            return InternalBlockReduce(temp_storage).template Sum<false>(input, num_valid);
-        }
-    }
-
-
-    //@}  end member group
-};
-
-/**
- * \example example_block_reduce.cu
- */
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/block_scan.cuh b/thirdparty/cub_semiring/block/block_scan.cuh
deleted file mode 100644
index 245084cff61..00000000000
--- a/thirdparty/cub_semiring/block/block_scan.cuh
+++ /dev/null
@@ -1,2126 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * The cub::BlockScan class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block.
- */
-
-#pragma once
-
-#include "specializations/block_scan_raking.cuh"
-#include "specializations/block_scan_warp_scans.cuh"
-#include "../util_arch.cuh"
-#include "../util_type.cuh"
-#include "../util_ptx.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/******************************************************************************
- * Algorithmic variants
- ******************************************************************************/
-
-/**
- * \brief BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block.
- */
-enum BlockScanAlgorithm
-{
-
-    /**
-     * \par Overview
-     * An efficient "raking reduce-then-scan" prefix scan algorithm.  Execution is comprised of five phases:
-     * -# Upsweep sequential reduction in registers (if threads contribute more than one input each).  Each thread then places the partial reduction of its item(s) into shared memory.
-     * -# Upsweep sequential reduction in shared memory.  Threads within a single warp rake across segments of shared partial reductions.
-     * -# A warp-synchronous Kogge-Stone style exclusive scan within the raking warp.
-     * -# Downsweep sequential exclusive scan in shared memory.  Threads within a single warp rake across segments of shared partial reductions, seeded with the warp-scan output.
-     * -# Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output.
-     *
-     * \par
-     * \image html block_scan_raking.png
-     * <div class="centercaption">\p BLOCK_SCAN_RAKING data flow for a hypothetical 16-thread thread block and 4-thread raking warp.</div>
-     *
-     * \par Performance Considerations
-     * - Although this variant may suffer longer turnaround latencies when the
-     *   GPU is under-occupied, it can often provide higher overall throughput
-     *   across the GPU when suitably occupied.
-     */
-    BLOCK_SCAN_RAKING,
-
-
-    /**
-     * \par Overview
-     * Similar to cub::BLOCK_SCAN_RAKING, but with fewer shared memory reads at
-     * the expense of higher register pressure.  Raking threads preserve their
-     * "upsweep" segment of values in registers while performing warp-synchronous
-     * scan, allowing the "downsweep" not to re-read them from shared memory.
-     */
-    BLOCK_SCAN_RAKING_MEMOIZE,
-
-
-    /**
-     * \par Overview
-     * A quick "tiled warpscans" prefix scan algorithm.  Execution is comprised of four phases:
-     * -# Upsweep sequential reduction in registers (if threads contribute more than one input each).  Each thread then places the partial reduction of its item(s) into shared memory.
-     * -# Compute a shallow, but inefficient warp-synchronous Kogge-Stone style scan within each warp.
-     * -# A propagation phase where the warp scan outputs in each warp are updated with the aggregate from each preceding warp.
-     * -# Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output.
-     *
-     * \par
-     * \image html block_scan_warpscans.png
-     * <div class="centercaption">\p BLOCK_SCAN_WARP_SCANS data flow for a hypothetical 16-thread thread block and 4-thread raking warp.</div>
-     *
-     * \par Performance Considerations
-     * - Although this variant may suffer lower overall throughput across the
-     *   GPU because due to a heavy reliance on inefficient warpscans, it can
-     *   often provide lower turnaround latencies when the GPU is under-occupied.
-     */
-    BLOCK_SCAN_WARP_SCANS,
-};
-
-
-/******************************************************************************
- * Block scan
- ******************************************************************************/
-
-/**
- * \brief The BlockScan class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block. ![](block_scan_logo.png)
- * \ingroup BlockModule
- *
- * \tparam T                Data type being scanned
- * \tparam BLOCK_DIM_X      The thread block length in threads along the X dimension
- * \tparam ALGORITHM        <b>[optional]</b> cub::BlockScanAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_SCAN_RAKING)
- * \tparam BLOCK_DIM_Y      <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
- * \tparam BLOCK_DIM_Z      <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
- * \tparam PTX_ARCH         <b>[optional]</b> \ptxversion
- *
- * \par Overview
- * - Given a list of input elements and a binary reduction operator, a [<em>prefix scan</em>](http://en.wikipedia.org/wiki/Prefix_sum)
- *   produces an output list where each element is computed to be the reduction
- *   of the elements occurring earlier in the input list.  <em>Prefix sum</em>
- *   connotes a prefix scan with the addition operator. The term \em inclusive indicates
- *   that the <em>i</em><sup>th</sup> output reduction incorporates the <em>i</em><sup>th</sup> input.
- *   The term \em exclusive indicates the <em>i</em><sup>th</sup> input is not incorporated into
- *   the <em>i</em><sup>th</sup> output reduction.
- * - \rowmajor
- * - BlockScan can be optionally specialized by algorithm to accommodate different workload profiles:
- *   -# <b>cub::BLOCK_SCAN_RAKING</b>.  An efficient (high throughput) "raking reduce-then-scan" prefix scan algorithm. [More...](\ref cub::BlockScanAlgorithm)
- *   -# <b>cub::BLOCK_SCAN_RAKING_MEMOIZE</b>.  Similar to cub::BLOCK_SCAN_RAKING, but having higher throughput at the expense of additional register pressure for intermediate storage. [More...](\ref cub::BlockScanAlgorithm)
- *   -# <b>cub::BLOCK_SCAN_WARP_SCANS</b>.  A quick (low latency) "tiled warpscans" prefix scan algorithm. [More...](\ref cub::BlockScanAlgorithm)
- *
- * \par Performance Considerations
- * - \granularity
- * - Uses special instructions when applicable (e.g., warp \p SHFL)
- * - Uses synchronization-free communication between warp lanes when applicable
- * - Invokes a minimal number of minimal block-wide synchronization barriers (only
- *   one or two depending on algorithm selection)
- * - Incurs zero bank conflicts for most types
- * - Computation is slightly more efficient (i.e., having lower instruction overhead) for:
- *   - Prefix sum variants (<b><em>vs.</em></b> generic scan)
- *   - \blocksize
- * - See cub::BlockScanAlgorithm for performance details regarding algorithmic alternatives
- *
- * \par A Simple Example
- * \blockcollective{BlockScan}
- * \par
- * The code snippet below illustrates an exclusive prefix sum of 512 integer items that
- * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
- * where each thread owns 4 consecutive items.
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
- *
- * __global__ void ExampleKernel(...)
- * {
- *     // Specialize BlockScan for a 1D block of 128 threads on type int
- *     typedef cub::BlockScan<int, 128> BlockScan;
- *
- *     // Allocate shared memory for BlockScan
- *     __shared__ typename BlockScan::TempStorage temp_storage;
- *
- *     // Obtain a segment of consecutive items that are blocked across threads
- *     int thread_data[4];
- *     ...
- *
- *     // Collectively compute the block-wide exclusive prefix sum
- *     BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);
- *
- * \endcode
- * \par
- * Suppose the set of input \p thread_data across the block of threads is
- * <tt>{[1,1,1,1], [1,1,1,1], ..., [1,1,1,1]}</tt>.
- * The corresponding output \p thread_data in those threads will be
- * <tt>{[0,1,2,3], [4,5,6,7], ..., [508,509,510,511]}</tt>.
- *
- */
-template <
-    typename            T,
-    int                 BLOCK_DIM_X,
-    BlockScanAlgorithm  ALGORITHM       = BLOCK_SCAN_RAKING,
-    int                 BLOCK_DIM_Y     = 1,
-    int                 BLOCK_DIM_Z     = 1,
-    int                 PTX_ARCH        = CUB_PTX_ARCH>
-class BlockScan
-{
-private:
-
-    /******************************************************************************
-     * Constants and type definitions
-     ******************************************************************************/
-
-    /// Constants
-    enum
-    {
-        /// The thread block size in threads
-        BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-    };
-
-    /**
-     * Ensure the template parameterization meets the requirements of the
-     * specified algorithm. Currently, the BLOCK_SCAN_WARP_SCANS policy
-     * cannot be used with thread block sizes not a multiple of the
-     * architectural warp size.
-     */
-    static const BlockScanAlgorithm SAFE_ALGORITHM =
-        ((ALGORITHM == BLOCK_SCAN_WARP_SCANS) && (BLOCK_THREADS % CUB_WARP_THREADS(PTX_ARCH) != 0)) ?
-            BLOCK_SCAN_RAKING :
-            ALGORITHM;
-
-    typedef BlockScanWarpScans<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> WarpScans;
-    typedef BlockScanRaking<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, (SAFE_ALGORITHM == BLOCK_SCAN_RAKING_MEMOIZE), PTX_ARCH> Raking;
-
-    /// Define the delegate type for the desired algorithm
-    typedef typename If<(SAFE_ALGORITHM == BLOCK_SCAN_WARP_SCANS),
-        WarpScans,
-        Raking>::Type InternalBlockScan;
-
-    /// Shared memory storage layout type for BlockScan
-    typedef typename InternalBlockScan::TempStorage _TempStorage;
-
-
-    /******************************************************************************
-     * Thread fields
-     ******************************************************************************/
-
-    /// Shared storage reference
-    _TempStorage &temp_storage;
-
-    /// Linear thread-id
-    unsigned int linear_tid;
-
-
-    /******************************************************************************
-     * Utility methods
-     ******************************************************************************/
-
-    /// Internal storage allocator
-    __device__ __forceinline__ _TempStorage& PrivateStorage()
-    {
-        __shared__ _TempStorage private_storage;
-        return private_storage;
-    }
-
-
-    /******************************************************************************
-     * Public types
-     ******************************************************************************/
-public:
-
-    /// \smemstorage{BlockScan}
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    /******************************************************************//**
-     * \name Collective constructors
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Collective constructor using a private static allocation of shared memory as temporary storage.
-     */
-    __device__ __forceinline__ BlockScan()
-    :
-        temp_storage(PrivateStorage()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    /**
-     * \brief Collective constructor using the specified memory allocation as temporary storage.
-     */
-    __device__ __forceinline__ BlockScan(
-        TempStorage &temp_storage)             ///< [in] Reference to memory allocation having layout type TempStorage
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Exclusive prefix sum operations
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes one input element.  The value of 0 is applied as the initial value, and is assigned to \p output in <em>thread</em><sub>0</sub>.
-     *
-     * \par
-     * - \identityzero
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates an exclusive prefix sum of 128 integer items that
-     * are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads on type int
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Obtain input item for each thread
-     *     int thread_data;
-     *     ...
-     *
-     *     // Collectively compute the block-wide exclusive prefix sum
-     *     BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>1, 1, ..., 1</tt>.  The
-     * corresponding output \p thread_data in those threads will be <tt>0, 1, ..., 127</tt>.
-     *
-     */
-    __device__ __forceinline__ void ExclusiveSum(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &output)                        ///< [out] Calling thread's output item (may be aliased to \p input)
-    {
-        T initial_value = 0;
-        ExclusiveScan(input, output, initial_value, cub::Sum());
-    }
-
-
-    /**
-     * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes one input element.  The value of 0 is applied as the initial value, and is assigned to \p output in <em>thread</em><sub>0</sub>.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-     *
-     * \par
-     * - \identityzero
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates an exclusive prefix sum of 128 integer items that
-     * are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads on type int
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Obtain input item for each thread
-     *     int thread_data;
-     *     ...
-     *
-     *     // Collectively compute the block-wide exclusive prefix sum
-     *     int block_aggregate;
-     *     BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data, block_aggregate);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>1, 1, ..., 1</tt>.  The
-     * corresponding output \p thread_data in those threads will be <tt>0, 1, ..., 127</tt>.
-     * Furthermore the value \p 128 will be stored in \p block_aggregate for all threads.
-     *
-     */
-    __device__ __forceinline__ void ExclusiveSum(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &output,                        ///< [out] Calling thread's output item (may be aliased to \p input)
-        T               &block_aggregate)               ///< [out] block-wide aggregate reduction of input items
-    {
-        T initial_value = 0;
-        ExclusiveScan(input, output, initial_value, cub::Sum(), block_aggregate);
-    }
-
-
-    /**
-     * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes one input element.  Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-     *
-     * \par
-     * - \identityzero
-     * - The \p block_prefix_callback_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
-     *   The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
-     *   The functor will be invoked by the first warp of threads in the block, however only the return value from
-     *   <em>lane</em><sub>0</sub> is applied as the block-wide prefix.  Can be stateful.
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a single thread block that progressively
-     * computes an exclusive prefix sum over multiple "tiles" of input using a
-     * prefix functor to maintain a running total between block-wide scans.  Each tile consists
-     * of 128 integer items that are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * // A stateful callback functor that maintains a running prefix to be applied
-     * // during consecutive scan operations.
-     * struct BlockPrefixCallbackOp
-     * {
-     *     // Running prefix
-     *     int running_total;
-     *
-     *     // Constructor
-     *     __device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}
-     *
-     *     // Callback operator to be entered by the first warp of threads in the block.
-     *     // Thread-0 is responsible for returning a value for seeding the block-wide scan.
-     *     __device__ int operator()(int block_aggregate)
-     *     {
-     *         int old_prefix = running_total;
-     *         running_total += block_aggregate;
-     *         return old_prefix;
-     *     }
-     * };
-     *
-     * __global__ void ExampleKernel(int *d_data, int num_items, ...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Initialize running total
-     *     BlockPrefixCallbackOp prefix_op(0);
-     *
-     *     // Have the block iterate over segments of items
-     *     for (int block_offset = 0; block_offset < num_items; block_offset += 128)
-     *     {
-     *         // Load a segment of consecutive items that are blocked across threads
-     *         int thread_data = d_data[block_offset];
-     *
-     *         // Collectively compute the block-wide exclusive prefix sum
-     *         BlockScan(temp_storage).ExclusiveSum(
-     *             thread_data, thread_data, prefix_op);
-     *         CTA_SYNC();
-     *
-     *         // Store scanned items to output segment
-     *         d_data[block_offset] = thread_data;
-     *     }
-     * \endcode
-     * \par
-     * Suppose the input \p d_data is <tt>1, 1, 1, 1, 1, 1, 1, 1, ...</tt>.
-     * The corresponding output for the first segment will be <tt>0, 1, ..., 127</tt>.
-     * The output for the second segment will be <tt>128, 129, ..., 255</tt>.
-     *
-     * \tparam BlockPrefixCallbackOp        <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
-     */
-    template <typename BlockPrefixCallbackOp>
-    __device__ __forceinline__ void ExclusiveSum(
-        T                       input,                          ///< [in] Calling thread's input item
-        T                       &output,                        ///< [out] Calling thread's output item (may be aliased to \p input)
-        BlockPrefixCallbackOp   &block_prefix_callback_op)      ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.
-    {
-        ExclusiveScan(input, output, cub::Sum(), block_prefix_callback_op);
-    }
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Exclusive prefix sum operations (multiple data per thread)
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes an array of consecutive input elements.  The value of 0 is applied as the initial value, and is assigned to \p output[0] in <em>thread</em><sub>0</sub>.
-     *
-     * \par
-     * - \identityzero
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates an exclusive prefix sum of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads on type int
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Collectively compute the block-wide exclusive prefix sum
-     *     BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{ [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }</tt>.  The
-     * corresponding output \p thread_data in those threads will be <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt>.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     */
-    template <int ITEMS_PER_THREAD>
-    __device__ __forceinline__ void ExclusiveSum(
-        T                 (&input)[ITEMS_PER_THREAD],   ///< [in] Calling thread's input items
-        T                 (&output)[ITEMS_PER_THREAD])  ///< [out] Calling thread's output items (may be aliased to \p input)
-    {
-        T initial_value = 0;
-        ExclusiveScan(input, output, initial_value, cub::Sum());
-    }
-
-
-    /**
-     * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes an array of consecutive input elements.  The value of 0 is applied as the initial value, and is assigned to \p output[0] in <em>thread</em><sub>0</sub>.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-     *
-     * \par
-     * - \identityzero
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates an exclusive prefix sum of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads on type int
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Collectively compute the block-wide exclusive prefix sum
-     *     int block_aggregate;
-     *     BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data, block_aggregate);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{ [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }</tt>.  The
-     * corresponding output \p thread_data in those threads will be <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt>.
-     * Furthermore the value \p 512 will be stored in \p block_aggregate for all threads.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     */
-    template <int ITEMS_PER_THREAD>
-    __device__ __forceinline__ void ExclusiveSum(
-        T                 (&input)[ITEMS_PER_THREAD],       ///< [in] Calling thread's input items
-        T                 (&output)[ITEMS_PER_THREAD],      ///< [out] Calling thread's output items (may be aliased to \p input)
-        T                 &block_aggregate)                 ///< [out] block-wide aggregate reduction of input items
-    {
-        // Reduce consecutive thread items in registers
-        T initial_value = 0;
-        ExclusiveScan(input, output, initial_value, cub::Sum(), block_aggregate);
-    }
-
-
-    /**
-     * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes an array of consecutive input elements.  Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-     *
-     * \par
-     * - \identityzero
-     * - The \p block_prefix_callback_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
-     *   The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
-     *   The functor will be invoked by the first warp of threads in the block, however only the return value from
-     *   <em>lane</em><sub>0</sub> is applied as the block-wide prefix.  Can be stateful.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a single thread block that progressively
-     * computes an exclusive prefix sum over multiple "tiles" of input using a
-     * prefix functor to maintain a running total between block-wide scans.  Each tile consists
-     * of 512 integer items that are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3)
-     * across 128 threads where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * // A stateful callback functor that maintains a running prefix to be applied
-     * // during consecutive scan operations.
-     * struct BlockPrefixCallbackOp
-     * {
-     *     // Running prefix
-     *     int running_total;
-     *
-     *     // Constructor
-     *     __device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}
-     *
-     *     // Callback operator to be entered by the first warp of threads in the block.
-     *     // Thread-0 is responsible for returning a value for seeding the block-wide scan.
-     *     __device__ int operator()(int block_aggregate)
-     *     {
-     *         int old_prefix = running_total;
-     *         running_total += block_aggregate;
-     *         return old_prefix;
-     *     }
-     * };
-     *
-     * __global__ void ExampleKernel(int *d_data, int num_items, ...)
-     * {
-     *     // Specialize BlockLoad, BlockStore, and BlockScan for a 1D block of 128 threads, 4 ints per thread
-     *     typedef cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE>   BlockLoad;
-     *     typedef cub::BlockStore<int, 128, 4, BLOCK_STORE_TRANSPOSE>  BlockStore;
-     *     typedef cub::BlockScan<int, 128>                             BlockScan;
-     *
-     *     // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan
-     *     __shared__ union {
-     *         typename BlockLoad::TempStorage     load;
-     *         typename BlockScan::TempStorage     scan;
-     *         typename BlockStore::TempStorage    store;
-     *     } temp_storage;
-     *
-     *     // Initialize running total
-     *     BlockPrefixCallbackOp prefix_op(0);
-     *
-     *     // Have the block iterate over segments of items
-     *     for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)
-     *     {
-     *         // Load a segment of consecutive items that are blocked across threads
-     *         int thread_data[4];
-     *         BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);
-     *         CTA_SYNC();
-     *
-     *         // Collectively compute the block-wide exclusive prefix sum
-     *         int block_aggregate;
-     *         BlockScan(temp_storage.scan).ExclusiveSum(
-     *             thread_data, thread_data, prefix_op);
-     *         CTA_SYNC();
-     *
-     *         // Store scanned items to output segment
-     *         BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data);
-     *         CTA_SYNC();
-     *     }
-     * \endcode
-     * \par
-     * Suppose the input \p d_data is <tt>1, 1, 1, 1, 1, 1, 1, 1, ...</tt>.
-     * The corresponding output for the first segment will be <tt>0, 1, 2, 3, ..., 510, 511</tt>.
-     * The output for the second segment will be <tt>512, 513, 514, 515, ..., 1022, 1023</tt>.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam BlockPrefixCallbackOp        <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
-     */
-    template <
-        int ITEMS_PER_THREAD,
-        typename BlockPrefixCallbackOp>
-    __device__ __forceinline__ void ExclusiveSum(
-        T                       (&input)[ITEMS_PER_THREAD],   ///< [in] Calling thread's input items
-        T                       (&output)[ITEMS_PER_THREAD],  ///< [out] Calling thread's output items (may be aliased to \p input)
-        BlockPrefixCallbackOp   &block_prefix_callback_op)    ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.
-    {
-        ExclusiveScan(input, output, cub::Sum(), block_prefix_callback_op);
-    }
-
-
-
-    //@}  end member group        // Exclusive prefix sums
-    /******************************************************************//**
-     * \name Exclusive prefix scan operations
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.
-     *
-     * \par
-     * - Supports non-commutative scan operators.
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates an exclusive prefix max scan of 128 integer items that
-     * are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads on type int
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Obtain input item for each thread
-     *     int thread_data;
-     *     ...
-     *
-     *     // Collectively compute the block-wide exclusive prefix max scan
-     *     BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>0, -1, 2, -3, ..., 126, -127</tt>.  The
-     * corresponding output \p thread_data in those threads will be <tt>INT_MIN, 0, 0, 2, ..., 124, 126</tt>.
-     *
-     * \tparam ScanOp               <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &output,                        ///< [out] Calling thread's output item (may be aliased to \p input)
-        T               initial_value,                  ///< [in] Initial value to seed the exclusive scan (and is assigned to \p output[0] in <em>thread</em><sub>0</sub>)
-        ScanOp          scan_op)                        ///< [in] Binary scan functor 
-    {
-        InternalBlockScan(temp_storage).ExclusiveScan(input, output, initial_value, scan_op);
-    }
-
-
-    /**
-     * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-     *
-     * \par
-     * - Supports non-commutative scan operators.
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates an exclusive prefix max scan of 128 integer items that
-     * are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads on type int
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Obtain input item for each thread
-     *     int thread_data;
-     *     ...
-     *
-     *     // Collectively compute the block-wide exclusive prefix max scan
-     *     int block_aggregate;
-     *     BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max(), block_aggregate);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>0, -1, 2, -3, ..., 126, -127</tt>.  The
-     * corresponding output \p thread_data in those threads will be <tt>INT_MIN, 0, 0, 2, ..., 124, 126</tt>.
-     * Furthermore the value \p 126 will be stored in \p block_aggregate for all threads.
-     *
-     * \tparam ScanOp   <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,              ///< [in] Calling thread's input items
-        T               &output,            ///< [out] Calling thread's output items (may be aliased to \p input)
-        T               initial_value,      ///< [in] Initial value to seed the exclusive scan (and is assigned to \p output[0] in <em>thread</em><sub>0</sub>)
-        ScanOp          scan_op,            ///< [in] Binary scan functor 
-        T               &block_aggregate)   ///< [out] block-wide aggregate reduction of input items
-    {
-        InternalBlockScan(temp_storage).ExclusiveScan(input, output, initial_value, scan_op, block_aggregate);
-    }
-
-
-    /**
-     * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  the call-back functor \p block_prefix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-     *
-     * \par
-     * - The \p block_prefix_callback_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
-     *   The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
-     *   The functor will be invoked by the first warp of threads in the block, however only the return value from
-     *   <em>lane</em><sub>0</sub> is applied as the block-wide prefix.  Can be stateful.
-     * - Supports non-commutative scan operators.
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a single thread block that progressively
-     * computes an exclusive prefix max scan over multiple "tiles" of input using a
-     * prefix functor to maintain a running total between block-wide scans.  Each tile consists
-     * of 128 integer items that are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * // A stateful callback functor that maintains a running prefix to be applied
-     * // during consecutive scan operations.
-     * struct BlockPrefixCallbackOp
-     * {
-     *     // Running prefix
-     *     int running_total;
-     *
-     *     // Constructor
-     *     __device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}
-     *
-     *     // Callback operator to be entered by the first warp of threads in the block.
-     *     // Thread-0 is responsible for returning a value for seeding the block-wide scan.
-     *     __device__ int operator()(int block_aggregate)
-     *     {
-     *         int old_prefix = running_total;
-     *         running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;
-     *         return old_prefix;
-     *     }
-     * };
-     *
-     * __global__ void ExampleKernel(int *d_data, int num_items, ...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Initialize running total
-     *     BlockPrefixCallbackOp prefix_op(INT_MIN);
-     *
-     *     // Have the block iterate over segments of items
-     *     for (int block_offset = 0; block_offset < num_items; block_offset += 128)
-     *     {
-     *         // Load a segment of consecutive items that are blocked across threads
-     *         int thread_data = d_data[block_offset];
-     *
-     *         // Collectively compute the block-wide exclusive prefix max scan
-     *         BlockScan(temp_storage).ExclusiveScan(
-     *             thread_data, thread_data, INT_MIN, cub::Max(), prefix_op);
-     *         CTA_SYNC();
-     *
-     *         // Store scanned items to output segment
-     *         d_data[block_offset] = thread_data;
-     *     }
-     * \endcode
-     * \par
-     * Suppose the input \p d_data is <tt>0, -1, 2, -3, 4, -5, ...</tt>.
-     * The corresponding output for the first segment will be <tt>INT_MIN, 0, 0, 2, ..., 124, 126</tt>.
-     * The output for the second segment will be <tt>126, 128, 128, 130, ..., 252, 254</tt>.
-     *
-     * \tparam ScanOp               <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     * \tparam BlockPrefixCallbackOp        <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
-     */
-    template <
-        typename ScanOp,
-        typename BlockPrefixCallbackOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T                       input,                          ///< [in] Calling thread's input item
-        T                       &output,                        ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp                  scan_op,                        ///< [in] Binary scan functor 
-        BlockPrefixCallbackOp   &block_prefix_callback_op)      ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.
-    {
-        InternalBlockScan(temp_storage).ExclusiveScan(input, output, scan_op, block_prefix_callback_op);
-    }
-
-
-    //@}  end member group        // Inclusive prefix sums
-    /******************************************************************//**
-     * \name Exclusive prefix scan operations (multiple data per thread)
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes an array of consecutive input elements.
-     *
-     * \par
-     * - Supports non-commutative scan operators.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates an exclusive prefix max scan of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads on type int
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Collectively compute the block-wide exclusive prefix max scan
-     *     BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is
-     * <tt>{ [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }</tt>.
-     * The corresponding output \p thread_data in those threads will be
-     * <tt>{ [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }</tt>.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam ScanOp               <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T                 (&input)[ITEMS_PER_THREAD],   ///< [in] Calling thread's input items
-        T                 (&output)[ITEMS_PER_THREAD],  ///< [out] Calling thread's output items (may be aliased to \p input)
-        T                 initial_value,                ///< [in] Initial value to seed the exclusive scan (and is assigned to \p output[0] in <em>thread</em><sub>0</sub>)
-        ScanOp            scan_op)                      ///< [in] Binary scan functor
-    {
-        // Reduce consecutive thread items in registers
-        T thread_prefix = internal::ThreadReduce(input, scan_op);
-
-        // Exclusive thread block-scan
-        ExclusiveScan(thread_prefix, thread_prefix, initial_value, scan_op);
-
-        // Exclusive scan in registers with prefix as seed
-        internal::ThreadScanExclusive(input, output, scan_op, thread_prefix);
-    }
-
-
-    /**
-     * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes an array of consecutive input elements.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-     *
-     * \par
-     * - Supports non-commutative scan operators.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates an exclusive prefix max scan of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads on type int
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Collectively compute the block-wide exclusive prefix max scan
-     *     int block_aggregate;
-     *     BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max(), block_aggregate);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{ [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }</tt>.  The
-     * corresponding output \p thread_data in those threads will be <tt>{ [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }</tt>.
-     * Furthermore the value \p 510 will be stored in \p block_aggregate for all threads.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam ScanOp               <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T                 (&input)[ITEMS_PER_THREAD],   ///< [in] Calling thread's input items
-        T                 (&output)[ITEMS_PER_THREAD],  ///< [out] Calling thread's output items (may be aliased to \p input)
-        T                 initial_value,                ///< [in] Initial value to seed the exclusive scan (and is assigned to \p output[0] in <em>thread</em><sub>0</sub>)
-        ScanOp            scan_op,                      ///< [in] Binary scan functor
-        T                 &block_aggregate)             ///< [out] block-wide aggregate reduction of input items
-    {
-        // Reduce consecutive thread items in registers
-        T thread_prefix = internal::ThreadReduce(input, scan_op);
-
-        // Exclusive thread block-scan
-        ExclusiveScan(thread_prefix, thread_prefix, initial_value, scan_op, block_aggregate);
-
-        // Exclusive scan in registers with prefix as seed
-        internal::ThreadScanExclusive(input, output, scan_op, thread_prefix);
-    }
-
-
-    /**
-     * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes an array of consecutive input elements.  the call-back functor \p block_prefix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-     *
-     * \par
-     * - The \p block_prefix_callback_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
-     *   The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
-     *   The functor will be invoked by the first warp of threads in the block, however only the return value from
-     *   <em>lane</em><sub>0</sub> is applied as the block-wide prefix.  Can be stateful.
-     * - Supports non-commutative scan operators.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a single thread block that progressively
-     * computes an exclusive prefix max scan over multiple "tiles" of input using a
-     * prefix functor to maintain a running total between block-wide scans.  Each tile consists
-     * of 128 integer items that are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * // A stateful callback functor that maintains a running prefix to be applied
-     * // during consecutive scan operations.
-     * struct BlockPrefixCallbackOp
-     * {
-     *     // Running prefix
-     *     int running_total;
-     *
-     *     // Constructor
-     *     __device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}
-     *
-     *     // Callback operator to be entered by the first warp of threads in the block.
-     *     // Thread-0 is responsible for returning a value for seeding the block-wide scan.
-     *     __device__ int operator()(int block_aggregate)
-     *     {
-     *         int old_prefix = running_total;
-     *         running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;
-     *         return old_prefix;
-     *     }
-     * };
-     *
-     * __global__ void ExampleKernel(int *d_data, int num_items, ...)
-     * {
-     *     // Specialize BlockLoad, BlockStore, and BlockScan for a 1D block of 128 threads, 4 ints per thread
-     *     typedef cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE>   BlockLoad;
-     *     typedef cub::BlockStore<int, 128, 4, BLOCK_STORE_TRANSPOSE>  BlockStore;
-     *     typedef cub::BlockScan<int, 128>                             BlockScan;
-     *
-     *     // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan
-     *     __shared__ union {
-     *         typename BlockLoad::TempStorage     load;
-     *         typename BlockScan::TempStorage     scan;
-     *         typename BlockStore::TempStorage    store;
-     *     } temp_storage;
-     *
-     *     // Initialize running total
-     *     BlockPrefixCallbackOp prefix_op(0);
-     *
-     *     // Have the block iterate over segments of items
-     *     for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)
-     *     {
-     *         // Load a segment of consecutive items that are blocked across threads
-     *         int thread_data[4];
-     *         BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);
-     *         CTA_SYNC();
-     *
-     *         // Collectively compute the block-wide exclusive prefix max scan
-     *         BlockScan(temp_storage.scan).ExclusiveScan(
-     *             thread_data, thread_data, INT_MIN, cub::Max(), prefix_op);
-     *         CTA_SYNC();
-     *
-     *         // Store scanned items to output segment
-     *         BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data);
-     *         CTA_SYNC();
-     *     }
-     * \endcode
-     * \par
-     * Suppose the input \p d_data is <tt>0, -1, 2, -3, 4, -5, ...</tt>.
-     * The corresponding output for the first segment will be <tt>INT_MIN, 0, 0, 2, 2, 4, ..., 508, 510</tt>.
-     * The output for the second segment will be <tt>510, 512, 512, 514, 514, 516, ..., 1020, 1022</tt>.
-     *
-     * \tparam ITEMS_PER_THREAD         <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam ScanOp                   <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     * \tparam BlockPrefixCallbackOp    <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
-     */
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        ScanOp,
-        typename        BlockPrefixCallbackOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T                       (&input)[ITEMS_PER_THREAD],     ///< [in] Calling thread's input items
-        T                       (&output)[ITEMS_PER_THREAD],    ///< [out] Calling thread's output items (may be aliased to \p input)
-        ScanOp                  scan_op,                        ///< [in] Binary scan functor
-        BlockPrefixCallbackOp   &block_prefix_callback_op)      ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.
-    {
-        // Reduce consecutive thread items in registers
-        T thread_prefix = internal::ThreadReduce(input, scan_op);
-
-        // Exclusive thread block-scan
-        ExclusiveScan(thread_prefix, thread_prefix, scan_op, block_prefix_callback_op);
-
-        // Exclusive scan in registers with prefix as seed
-        internal::ThreadScanExclusive(input, output, scan_op, thread_prefix);
-    }
-
-
-    //@}  end member group
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document no-initial-value scans
-
-    /******************************************************************//**
-     * \name Exclusive prefix scan operations (no initial value, single datum per thread)
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  With no initial value, the output computed for <em>thread</em><sub>0</sub> is undefined.
-     *
-     * \par
-     * - Supports non-commutative scan operators.
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \tparam ScanOp               <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &output,                        ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op)                        ///< [in] Binary scan functor
-    {
-        InternalBlockScan(temp_storage).ExclusiveScan(input, output, scan_op);
-    }
-
-
-    /**
-     * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  Also provides every thread with the block-wide \p block_aggregate of all inputs.  With no initial value, the output computed for <em>thread</em><sub>0</sub> is undefined.
-     *
-     * \par
-     * - Supports non-commutative scan operators.
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \tparam ScanOp   <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &output,                        ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op,                        ///< [in] Binary scan functor
-        T               &block_aggregate)               ///< [out] block-wide aggregate reduction of input items
-    {
-        InternalBlockScan(temp_storage).ExclusiveScan(input, output, scan_op, block_aggregate);
-    }
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Exclusive prefix scan operations (no initial value, multiple data per thread)
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes an array of consecutive input elements.  With no initial value, the output computed for <em>thread</em><sub>0</sub> is undefined.
-     *
-     * \par
-     * - Supports non-commutative scan operators.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam ScanOp               <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T                 (&input)[ITEMS_PER_THREAD],   ///< [in] Calling thread's input items
-        T                 (&output)[ITEMS_PER_THREAD],  ///< [out] Calling thread's output items (may be aliased to \p input)
-        ScanOp            scan_op)                      ///< [in] Binary scan functor
-    {
-        // Reduce consecutive thread items in registers
-        T thread_partial = internal::ThreadReduce(input, scan_op);
-
-        // Exclusive thread block-scan
-        ExclusiveScan(thread_partial, thread_partial, scan_op);
-
-        // Exclusive scan in registers with prefix
-        internal::ThreadScanExclusive(input, output, scan_op, thread_partial, (linear_tid != 0));
-    }
-
-
-    /**
-     * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes an array of consecutive input elements.  Also provides every thread with the block-wide \p block_aggregate of all inputs.  With no initial value, the output computed for <em>thread</em><sub>0</sub> is undefined.
-     *
-     * \par
-     * - Supports non-commutative scan operators.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam ScanOp               <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               (&input)[ITEMS_PER_THREAD],     ///< [in] Calling thread's input items
-        T               (&output)[ITEMS_PER_THREAD],    ///< [out] Calling thread's output items (may be aliased to \p input)
-        ScanOp          scan_op,                        ///< [in] Binary scan functor
-        T               &block_aggregate)               ///< [out] block-wide aggregate reduction of input items
-    {
-        // Reduce consecutive thread items in registers
-        T thread_partial = internal::ThreadReduce(input, scan_op);
-
-        // Exclusive thread block-scan
-        ExclusiveScan(thread_partial, thread_partial, scan_op, block_aggregate);
-
-        // Exclusive scan in registers with prefix
-        internal::ThreadScanExclusive(input, output, scan_op, thread_partial, (linear_tid != 0));
-    }
-
-
-    //@}  end member group
-#endif // DOXYGEN_SHOULD_SKIP_THIS  // Do not document no-initial-value scans
-
-    /******************************************************************//**
-     * \name Inclusive prefix sum operations
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes one input element.
-     *
-     * \par
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates an inclusive prefix sum of 128 integer items that
-     * are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads on type int
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Obtain input item for each thread
-     *     int thread_data;
-     *     ...
-     *
-     *     // Collectively compute the block-wide inclusive prefix sum
-     *     BlockScan(temp_storage).InclusiveSum(thread_data, thread_data);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>1, 1, ..., 1</tt>.  The
-     * corresponding output \p thread_data in those threads will be <tt>1, 2, ..., 128</tt>.
-     *
-     */
-    __device__ __forceinline__ void InclusiveSum(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &output)                        ///< [out] Calling thread's output item (may be aliased to \p input)
-    {
-        InclusiveScan(input, output, cub::Sum());
-    }
-
-
-    /**
-     * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes one input element.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-     *
-     * \par
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates an inclusive prefix sum of 128 integer items that
-     * are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads on type int
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Obtain input item for each thread
-     *     int thread_data;
-     *     ...
-     *
-     *     // Collectively compute the block-wide inclusive prefix sum
-     *     int block_aggregate;
-     *     BlockScan(temp_storage).InclusiveSum(thread_data, thread_data, block_aggregate);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>1, 1, ..., 1</tt>.  The
-     * corresponding output \p thread_data in those threads will be <tt>1, 2, ..., 128</tt>.
-     * Furthermore the value \p 128 will be stored in \p block_aggregate for all threads.
-     *
-     */
-    __device__ __forceinline__ void InclusiveSum(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &output,                        ///< [out] Calling thread's output item (may be aliased to \p input)
-        T               &block_aggregate)               ///< [out] block-wide aggregate reduction of input items
-    {
-        InclusiveScan(input, output, cub::Sum(), block_aggregate);
-    }
-
-
-
-    /**
-     * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes one input element.  Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-     *
-     * \par
-     * - The \p block_prefix_callback_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
-     *   The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
-     *   The functor will be invoked by the first warp of threads in the block, however only the return value from
-     *   <em>lane</em><sub>0</sub> is applied as the block-wide prefix.  Can be stateful.
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a single thread block that progressively
-     * computes an inclusive prefix sum over multiple "tiles" of input using a
-     * prefix functor to maintain a running total between block-wide scans.  Each tile consists
-     * of 128 integer items that are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * // A stateful callback functor that maintains a running prefix to be applied
-     * // during consecutive scan operations.
-     * struct BlockPrefixCallbackOp
-     * {
-     *     // Running prefix
-     *     int running_total;
-     *
-     *     // Constructor
-     *     __device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}
-     *
-     *     // Callback operator to be entered by the first warp of threads in the block.
-     *     // Thread-0 is responsible for returning a value for seeding the block-wide scan.
-     *     __device__ int operator()(int block_aggregate)
-     *     {
-     *         int old_prefix = running_total;
-     *         running_total += block_aggregate;
-     *         return old_prefix;
-     *     }
-     * };
-     *
-     * __global__ void ExampleKernel(int *d_data, int num_items, ...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Initialize running total
-     *     BlockPrefixCallbackOp prefix_op(0);
-     *
-     *     // Have the block iterate over segments of items
-     *     for (int block_offset = 0; block_offset < num_items; block_offset += 128)
-     *     {
-     *         // Load a segment of consecutive items that are blocked across threads
-     *         int thread_data = d_data[block_offset];
-     *
-     *         // Collectively compute the block-wide inclusive prefix sum
-     *         BlockScan(temp_storage).InclusiveSum(
-     *             thread_data, thread_data, prefix_op);
-     *         CTA_SYNC();
-     *
-     *         // Store scanned items to output segment
-     *         d_data[block_offset] = thread_data;
-     *     }
-     * \endcode
-     * \par
-     * Suppose the input \p d_data is <tt>1, 1, 1, 1, 1, 1, 1, 1, ...</tt>.
-     * The corresponding output for the first segment will be <tt>1, 2, ..., 128</tt>.
-     * The output for the second segment will be <tt>129, 130, ..., 256</tt>.
-     *
-     * \tparam BlockPrefixCallbackOp          <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
-     */
-    template <typename BlockPrefixCallbackOp>
-    __device__ __forceinline__ void InclusiveSum(
-        T                       input,                          ///< [in] Calling thread's input item
-        T                       &output,                        ///< [out] Calling thread's output item (may be aliased to \p input)
-        BlockPrefixCallbackOp   &block_prefix_callback_op)      ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.
-    {
-        InclusiveScan(input, output, cub::Sum(), block_prefix_callback_op);
-    }
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Inclusive prefix sum operations (multiple data per thread)
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes an array of consecutive input elements.
-     *
-     * \par
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates an inclusive prefix sum of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads on type int
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Collectively compute the block-wide inclusive prefix sum
-     *     BlockScan(temp_storage).InclusiveSum(thread_data, thread_data);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{ [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }</tt>.  The
-     * corresponding output \p thread_data in those threads will be <tt>{ [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }</tt>.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     */
-    template <int ITEMS_PER_THREAD>
-    __device__ __forceinline__ void InclusiveSum(
-        T               (&input)[ITEMS_PER_THREAD],     ///< [in] Calling thread's input items
-        T               (&output)[ITEMS_PER_THREAD])    ///< [out] Calling thread's output items (may be aliased to \p input)
-    {
-        if (ITEMS_PER_THREAD == 1)
-        {
-            InclusiveSum(input[0], output[0]);
-        }
-        else
-        {
-            // Reduce consecutive thread items in registers
-            Sum scan_op;
-            T thread_prefix = internal::ThreadReduce(input, scan_op);
-
-            // Exclusive thread block-scan
-            ExclusiveSum(thread_prefix, thread_prefix);
-
-            // Inclusive scan in registers with prefix as seed
-            internal::ThreadScanInclusive(input, output, scan_op, thread_prefix, (linear_tid != 0));
-        }
-    }
-
-
-    /**
-     * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes an array of consecutive input elements.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-     *
-     * \par
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates an inclusive prefix sum of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads on type int
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Collectively compute the block-wide inclusive prefix sum
-     *     int block_aggregate;
-     *     BlockScan(temp_storage).InclusiveSum(thread_data, thread_data, block_aggregate);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is
-     * <tt>{ [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }</tt>.  The
-     * corresponding output \p thread_data in those threads will be
-     * <tt>{ [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }</tt>.
-     * Furthermore the value \p 512 will be stored in \p block_aggregate for all threads.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam ScanOp               <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <int ITEMS_PER_THREAD>
-    __device__ __forceinline__ void InclusiveSum(
-        T               (&input)[ITEMS_PER_THREAD],     ///< [in] Calling thread's input items
-        T               (&output)[ITEMS_PER_THREAD],    ///< [out] Calling thread's output items (may be aliased to \p input)
-        T               &block_aggregate)               ///< [out] block-wide aggregate reduction of input items
-    {
-        if (ITEMS_PER_THREAD == 1)
-        {
-            InclusiveSum(input[0], output[0], block_aggregate);
-        }
-        else
-        {
-            // Reduce consecutive thread items in registers
-            Sum scan_op;
-            T thread_prefix = internal::ThreadReduce(input, scan_op);
-
-            // Exclusive thread block-scan
-            ExclusiveSum(thread_prefix, thread_prefix, block_aggregate);
-
-            // Inclusive scan in registers with prefix as seed
-            internal::ThreadScanInclusive(input, output, scan_op, thread_prefix, (linear_tid != 0));
-        }
-    }
-
-
-    /**
-     * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes an array of consecutive input elements.  Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-     *
-     * \par
-     * - The \p block_prefix_callback_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
-     *   The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
-     *   The functor will be invoked by the first warp of threads in the block, however only the return value from
-     *   <em>lane</em><sub>0</sub> is applied as the block-wide prefix.  Can be stateful.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a single thread block that progressively
-     * computes an inclusive prefix sum over multiple "tiles" of input using a
-     * prefix functor to maintain a running total between block-wide scans.  Each tile consists
-     * of 512 integer items that are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3)
-     * across 128 threads where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * // A stateful callback functor that maintains a running prefix to be applied
-     * // during consecutive scan operations.
-     * struct BlockPrefixCallbackOp
-     * {
-     *     // Running prefix
-     *     int running_total;
-     *
-     *     // Constructor
-     *     __device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}
-     *
-     *     // Callback operator to be entered by the first warp of threads in the block.
-     *     // Thread-0 is responsible for returning a value for seeding the block-wide scan.
-     *     __device__ int operator()(int block_aggregate)
-     *     {
-     *         int old_prefix = running_total;
-     *         running_total += block_aggregate;
-     *         return old_prefix;
-     *     }
-     * };
-     *
-     * __global__ void ExampleKernel(int *d_data, int num_items, ...)
-     * {
-     *     // Specialize BlockLoad, BlockStore, and BlockScan for a 1D block of 128 threads, 4 ints per thread
-     *     typedef cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE>   BlockLoad;
-     *     typedef cub::BlockStore<int, 128, 4, BLOCK_STORE_TRANSPOSE>  BlockStore;
-     *     typedef cub::BlockScan<int, 128>                             BlockScan;
-     *
-     *     // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan
-     *     __shared__ union {
-     *         typename BlockLoad::TempStorage     load;
-     *         typename BlockScan::TempStorage     scan;
-     *         typename BlockStore::TempStorage    store;
-     *     } temp_storage;
-     *
-     *     // Initialize running total
-     *     BlockPrefixCallbackOp prefix_op(0);
-     *
-     *     // Have the block iterate over segments of items
-     *     for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)
-     *     {
-     *         // Load a segment of consecutive items that are blocked across threads
-     *         int thread_data[4];
-     *         BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);
-     *         CTA_SYNC();
-     *
-     *         // Collectively compute the block-wide inclusive prefix sum
-     *         BlockScan(temp_storage.scan).IncluisveSum(
-     *             thread_data, thread_data, prefix_op);
-     *         CTA_SYNC();
-     *
-     *         // Store scanned items to output segment
-     *         BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data);
-     *         CTA_SYNC();
-     *     }
-     * \endcode
-     * \par
-     * Suppose the input \p d_data is <tt>1, 1, 1, 1, 1, 1, 1, 1, ...</tt>.
-     * The corresponding output for the first segment will be <tt>1, 2, 3, 4, ..., 511, 512</tt>.
-     * The output for the second segment will be <tt>513, 514, 515, 516, ..., 1023, 1024</tt>.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam BlockPrefixCallbackOp        <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
-     */
-    template <
-        int ITEMS_PER_THREAD,
-        typename BlockPrefixCallbackOp>
-    __device__ __forceinline__ void InclusiveSum(
-        T                       (&input)[ITEMS_PER_THREAD],     ///< [in] Calling thread's input items
-        T                       (&output)[ITEMS_PER_THREAD],    ///< [out] Calling thread's output items (may be aliased to \p input)
-        BlockPrefixCallbackOp   &block_prefix_callback_op)      ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.
-    {
-        if (ITEMS_PER_THREAD == 1)
-        {
-            InclusiveSum(input[0], output[0], block_prefix_callback_op);
-        }
-        else
-        {
-            // Reduce consecutive thread items in registers
-            Sum scan_op;
-            T thread_prefix = internal::ThreadReduce(input, scan_op);
-
-            // Exclusive thread block-scan
-            ExclusiveSum(thread_prefix, thread_prefix, block_prefix_callback_op);
-
-            // Inclusive scan in registers with prefix as seed
-            internal::ThreadScanInclusive(input, output, scan_op, thread_prefix);
-        }
-    }
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Inclusive prefix scan operations
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.
-     *
-     * \par
-     * - Supports non-commutative scan operators.
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates an inclusive prefix max scan of 128 integer items that
-     * are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads on type int
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Obtain input item for each thread
-     *     int thread_data;
-     *     ...
-     *
-     *     // Collectively compute the block-wide inclusive prefix max scan
-     *     BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>0, -1, 2, -3, ..., 126, -127</tt>.  The
-     * corresponding output \p thread_data in those threads will be <tt>0, 0, 2, 2, ..., 126, 126</tt>.
-     *
-     * \tparam ScanOp               <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ScanOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &output,                        ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op)                        ///< [in] Binary scan functor 
-    {
-        InternalBlockScan(temp_storage).InclusiveScan(input, output, scan_op);
-    }
-
-
-    /**
-     * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-     *
-     * \par
-     * - Supports non-commutative scan operators.
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates an inclusive prefix max scan of 128 integer items that
-     * are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads on type int
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Obtain input item for each thread
-     *     int thread_data;
-     *     ...
-     *
-     *     // Collectively compute the block-wide inclusive prefix max scan
-     *     int block_aggregate;
-     *     BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max(), block_aggregate);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>0, -1, 2, -3, ..., 126, -127</tt>.  The
-     * corresponding output \p thread_data in those threads will be <tt>0, 0, 2, 2, ..., 126, 126</tt>.
-     * Furthermore the value \p 126 will be stored in \p block_aggregate for all threads.
-     *
-     * \tparam ScanOp   <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ScanOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &output,                        ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op,                        ///< [in] Binary scan functor 
-        T               &block_aggregate)               ///< [out] block-wide aggregate reduction of input items
-    {
-        InternalBlockScan(temp_storage).InclusiveScan(input, output, scan_op, block_aggregate);
-    }
-
-
-    /**
-     * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  the call-back functor \p block_prefix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-     *
-     * \par
-     * - The \p block_prefix_callback_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
-     *   The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
-     *   The functor will be invoked by the first warp of threads in the block, however only the return value from
-     *   <em>lane</em><sub>0</sub> is applied as the block-wide prefix.  Can be stateful.
-     * - Supports non-commutative scan operators.
-     * - \rowmajor
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a single thread block that progressively
-     * computes an inclusive prefix max scan over multiple "tiles" of input using a
-     * prefix functor to maintain a running total between block-wide scans.  Each tile consists
-     * of 128 integer items that are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * // A stateful callback functor that maintains a running prefix to be applied
-     * // during consecutive scan operations.
-     * struct BlockPrefixCallbackOp
-     * {
-     *     // Running prefix
-     *     int running_total;
-     *
-     *     // Constructor
-     *     __device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}
-     *
-     *     // Callback operator to be entered by the first warp of threads in the block.
-     *     // Thread-0 is responsible for returning a value for seeding the block-wide scan.
-     *     __device__ int operator()(int block_aggregate)
-     *     {
-     *         int old_prefix = running_total;
-     *         running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;
-     *         return old_prefix;
-     *     }
-     * };
-     *
-     * __global__ void ExampleKernel(int *d_data, int num_items, ...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Initialize running total
-     *     BlockPrefixCallbackOp prefix_op(INT_MIN);
-     *
-     *     // Have the block iterate over segments of items
-     *     for (int block_offset = 0; block_offset < num_items; block_offset += 128)
-     *     {
-     *         // Load a segment of consecutive items that are blocked across threads
-     *         int thread_data = d_data[block_offset];
-     *
-     *         // Collectively compute the block-wide inclusive prefix max scan
-     *         BlockScan(temp_storage).InclusiveScan(
-     *             thread_data, thread_data, cub::Max(), prefix_op);
-     *         CTA_SYNC();
-     *
-     *         // Store scanned items to output segment
-     *         d_data[block_offset] = thread_data;
-     *     }
-     * \endcode
-     * \par
-     * Suppose the input \p d_data is <tt>0, -1, 2, -3, 4, -5, ...</tt>.
-     * The corresponding output for the first segment will be <tt>0, 0, 2, 2, ..., 126, 126</tt>.
-     * The output for the second segment will be <tt>128, 128, 130, 130, ..., 254, 254</tt>.
-     *
-     * \tparam ScanOp               <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     * \tparam BlockPrefixCallbackOp        <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
-     */
-    template <
-        typename ScanOp,
-        typename BlockPrefixCallbackOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T                       input,                          ///< [in] Calling thread's input item
-        T                       &output,                        ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp                  scan_op,                        ///< [in] Binary scan functor 
-        BlockPrefixCallbackOp   &block_prefix_callback_op)      ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.
-    {
-        InternalBlockScan(temp_storage).InclusiveScan(input, output, scan_op, block_prefix_callback_op);
-    }
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Inclusive prefix scan operations (multiple data per thread)
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes an array of consecutive input elements.
-     *
-     * \par
-     * - Supports non-commutative scan operators.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates an inclusive prefix max scan of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads on type int
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Collectively compute the block-wide inclusive prefix max scan
-     *     BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{ [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }</tt>.  The
-     * corresponding output \p thread_data in those threads will be <tt>{ [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }</tt>.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam ScanOp               <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        ScanOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T               (&input)[ITEMS_PER_THREAD],     ///< [in] Calling thread's input items
-        T               (&output)[ITEMS_PER_THREAD],    ///< [out] Calling thread's output items (may be aliased to \p input)
-        ScanOp          scan_op)                        ///< [in] Binary scan functor 
-    {
-        if (ITEMS_PER_THREAD == 1)
-        {
-            InclusiveScan(input[0], output[0], scan_op);
-        }
-        else
-        {
-            // Reduce consecutive thread items in registers
-            T thread_prefix = internal::ThreadReduce(input, scan_op);
-
-            // Exclusive thread block-scan
-            ExclusiveScan(thread_prefix, thread_prefix, scan_op);
-
-            // Inclusive scan in registers with prefix as seed (first thread does not seed)
-            internal::ThreadScanInclusive(input, output, scan_op, thread_prefix, (linear_tid != 0));
-        }
-    }
-
-
-    /**
-     * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes an array of consecutive input elements.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-     *
-     * \par
-     * - Supports non-commutative scan operators.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates an inclusive prefix max scan of 512 integer items that
-     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
-     * where each thread owns 4 consecutive items.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize BlockScan for a 1D block of 128 threads on type int
-     *     typedef cub::BlockScan<int, 128> BlockScan;
-     *
-     *     // Allocate shared memory for BlockScan
-     *     __shared__ typename BlockScan::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Collectively compute the block-wide inclusive prefix max scan
-     *     int block_aggregate;
-     *     BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max(), block_aggregate);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is
-     * <tt>{ [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }</tt>.
-     * The corresponding output \p thread_data in those threads will be
-     * <tt>{ [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }</tt>.
-     * Furthermore the value \p 510 will be stored in \p block_aggregate for all threads.
-     *
-     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam ScanOp               <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <
-        int             ITEMS_PER_THREAD,
-        typename         ScanOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T               (&input)[ITEMS_PER_THREAD],     ///< [in] Calling thread's input items
-        T               (&output)[ITEMS_PER_THREAD],    ///< [out] Calling thread's output items (may be aliased to \p input)
-        ScanOp          scan_op,                        ///< [in] Binary scan functor 
-        T               &block_aggregate)               ///< [out] block-wide aggregate reduction of input items
-    {
-        if (ITEMS_PER_THREAD == 1)
-        {
-            InclusiveScan(input[0], output[0], scan_op, block_aggregate);
-        }
-        else
-        {
-            // Reduce consecutive thread items in registers
-            T thread_prefix = internal::ThreadReduce(input, scan_op);
-
-            // Exclusive thread block-scan (with no initial value)
-            ExclusiveScan(thread_prefix, thread_prefix, scan_op, block_aggregate);
-
-            // Inclusive scan in registers with prefix as seed (first thread does not seed)
-            internal::ThreadScanInclusive(input, output, scan_op, thread_prefix, (linear_tid != 0));
-        }
-    }
-
-
-    /**
-     * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes an array of consecutive input elements.  the call-back functor \p block_prefix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-     *
-     * \par
-     * - The \p block_prefix_callback_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
-     *   The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
-     *   The functor will be invoked by the first warp of threads in the block, however only the return value from
-     *   <em>lane</em><sub>0</sub> is applied as the block-wide prefix.  Can be stateful.
-     * - Supports non-commutative scan operators.
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a single thread block that progressively
-     * computes an inclusive prefix max scan over multiple "tiles" of input using a
-     * prefix functor to maintain a running total between block-wide scans.  Each tile consists
-     * of 128 integer items that are partitioned across 128 threads.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
-     *
-     * // A stateful callback functor that maintains a running prefix to be applied
-     * // during consecutive scan operations.
-     * struct BlockPrefixCallbackOp
-     * {
-     *     // Running prefix
-     *     int running_total;
-     *
-     *     // Constructor
-     *     __device__ BlockPrefixCallbackOp(int running_total) : running_total(running_total) {}
-     *
-     *     // Callback operator to be entered by the first warp of threads in the block.
-     *     // Thread-0 is responsible for returning a value for seeding the block-wide scan.
-     *     __device__ int operator()(int block_aggregate)
-     *     {
-     *         int old_prefix = running_total;
-     *         running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix;
-     *         return old_prefix;
-     *     }
-     * };
-     *
-     * __global__ void ExampleKernel(int *d_data, int num_items, ...)
-     * {
-     *     // Specialize BlockLoad, BlockStore, and BlockScan for a 1D block of 128 threads, 4 ints per thread
-     *     typedef cub::BlockLoad<int*, 128, 4, BLOCK_LOAD_TRANSPOSE>   BlockLoad;
-     *     typedef cub::BlockStore<int, 128, 4, BLOCK_STORE_TRANSPOSE>  BlockStore;
-     *     typedef cub::BlockScan<int, 128>                             BlockScan;
-     *
-     *     // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan
-     *     __shared__ union {
-     *         typename BlockLoad::TempStorage     load;
-     *         typename BlockScan::TempStorage     scan;
-     *         typename BlockStore::TempStorage    store;
-     *     } temp_storage;
-     *
-     *     // Initialize running total
-     *     BlockPrefixCallbackOp prefix_op(0);
-     *
-     *     // Have the block iterate over segments of items
-     *     for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)
-     *     {
-     *         // Load a segment of consecutive items that are blocked across threads
-     *         int thread_data[4];
-     *         BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);
-     *         CTA_SYNC();
-     *
-     *         // Collectively compute the block-wide inclusive prefix max scan
-     *         BlockScan(temp_storage.scan).InclusiveScan(
-     *             thread_data, thread_data, cub::Max(), prefix_op);
-     *         CTA_SYNC();
-     *
-     *         // Store scanned items to output segment
-     *         BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data);
-     *         CTA_SYNC();
-     *     }
-     * \endcode
-     * \par
-     * Suppose the input \p d_data is <tt>0, -1, 2, -3, 4, -5, ...</tt>.
-     * The corresponding output for the first segment will be <tt>0, 0, 2, 2, 4, 4, ..., 510, 510</tt>.
-     * The output for the second segment will be <tt>512, 512, 514, 514, 516, 516, ..., 1022, 1022</tt>.
-     *
-     * \tparam ITEMS_PER_THREAD         <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
-     * \tparam ScanOp                   <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
-     * \tparam BlockPrefixCallbackOp    <b>[inferred]</b> Call-back functor type having member <tt>T operator()(T block_aggregate)</tt>
-     */
-    template <
-        int             ITEMS_PER_THREAD,
-        typename        ScanOp,
-        typename        BlockPrefixCallbackOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T                       (&input)[ITEMS_PER_THREAD],     ///< [in] Calling thread's input items
-        T                       (&output)[ITEMS_PER_THREAD],    ///< [out] Calling thread's output items (may be aliased to \p input)
-        ScanOp                  scan_op,                        ///< [in] Binary scan functor 
-        BlockPrefixCallbackOp   &block_prefix_callback_op)      ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide prefix to be applied to the logical input sequence.
-    {
-        if (ITEMS_PER_THREAD == 1)
-        {
-            InclusiveScan(input[0], output[0], scan_op, block_prefix_callback_op);
-        }
-        else
-        {
-            // Reduce consecutive thread items in registers
-            T thread_prefix = internal::ThreadReduce(input, scan_op);
-
-            // Exclusive thread block-scan
-            ExclusiveScan(thread_prefix, thread_prefix, scan_op, block_prefix_callback_op);
-
-            // Inclusive scan in registers with prefix as seed
-            internal::ThreadScanInclusive(input, output, scan_op, thread_prefix);
-        }
-    }
-
-    //@}  end member group
-
-
-};
-
-/**
- * \example example_block_scan.cu
- */
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/block_shuffle.cuh b/thirdparty/cub_semiring/block/block_shuffle.cuh
deleted file mode 100644
index 504f00e3552..00000000000
--- a/thirdparty/cub_semiring/block/block_shuffle.cuh
+++ /dev/null
@@ -1,305 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * The cub::BlockShuffle class provides [<em>collective</em>](index.html#sec0) methods for shuffling data partitioned across a CUDA thread block.
- */
-
-#pragma once
-
-#include "../util_arch.cuh"
-#include "../util_ptx.cuh"
-#include "../util_macro.cuh"
-#include "../util_type.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \brief The BlockShuffle class provides [<em>collective</em>](index.html#sec0) methods for shuffling data partitioned across a CUDA thread block.
- * \ingroup BlockModule
- *
- * \tparam T                    The data type to be exchanged.
- * \tparam BLOCK_DIM_X          The thread block length in threads along the X dimension
- * \tparam BLOCK_DIM_Y          <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
- * \tparam BLOCK_DIM_Z          <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
- * \tparam PTX_ARCH             <b>[optional]</b> \ptxversion
- *
- * \par Overview
- * It is commonplace for blocks of threads to rearrange data items between
- * threads.  The BlockShuffle abstraction allows threads to efficiently shift items
- * either (a) up to their successor or (b) down to their predecessor.
- *
- */
-template <
-    typename            T,
-    int                 BLOCK_DIM_X,
-    int                 BLOCK_DIM_Y         = 1,
-    int                 BLOCK_DIM_Z         = 1,
-    int                 PTX_ARCH            = CUB_PTX_ARCH>
-class BlockShuffle
-{
-private:
-
-    /******************************************************************************
-     * Constants
-     ******************************************************************************/
-
-    enum
-    {
-        BLOCK_THREADS               = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-
-        LOG_WARP_THREADS            = CUB_LOG_WARP_THREADS(PTX_ARCH),
-        WARP_THREADS                = 1 << LOG_WARP_THREADS,
-        WARPS                       = (BLOCK_THREADS + WARP_THREADS - 1) / WARP_THREADS,
-    };
-
-    /******************************************************************************
-     * Type definitions
-     ******************************************************************************/
-
-    /// Shared memory storage layout type (last element from each thread's input)
-    struct _TempStorage
-    {
-        T prev[BLOCK_THREADS];
-        T next[BLOCK_THREADS];
-    };
-
-
-public:
-
-    /// \smemstorage{BlockShuffle}
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-private:
-
-
-    /******************************************************************************
-     * Thread fields
-     ******************************************************************************/
-
-    /// Shared storage reference
-    _TempStorage &temp_storage;
-
-    /// Linear thread-id
-    unsigned int linear_tid;
-
-
-    /******************************************************************************
-     * Utility methods
-     ******************************************************************************/
-
-    /// Internal storage allocator
-    __device__ __forceinline__ _TempStorage& PrivateStorage()
-    {
-        __shared__ _TempStorage private_storage;
-        return private_storage;
-    }
-
-
-public:
-
-    /******************************************************************//**
-     * \name Collective constructors
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Collective constructor using a private static allocation of shared memory as temporary storage.
-     */
-    __device__ __forceinline__ BlockShuffle()
-    :
-        temp_storage(PrivateStorage()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    /**
-     * \brief Collective constructor using the specified memory allocation as temporary storage.
-     */
-    __device__ __forceinline__ BlockShuffle(
-        TempStorage &temp_storage)             ///< [in] Reference to memory allocation having layout type TempStorage
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Shuffle movement
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Each <em>thread<sub>i</sub></em> obtains the \p input provided by <em>thread</em><sub><em>i</em>+<tt>distance</tt></sub>. The offset \p distance may be negative.
-     *
-     * \par
-     * - \smemreuse
-     */
-    __device__ __forceinline__ void Offset(
-        T   input,                  ///< [in] The input item from the calling thread (<em>thread<sub>i</sub></em>)
-        T&  output,                 ///< [out] The \p input item from the successor (or predecessor) thread <em>thread</em><sub><em>i</em>+<tt>distance</tt></sub> (may be aliased to \p input).  This value is only updated for for <em>thread<sub>i</sub></em> when 0 <= (<em>i</em> + \p distance) < <tt>BLOCK_THREADS-1</tt>
-        int distance = 1)           ///< [in] Offset distance (may be negative)
-    {
-        temp_storage[linear_tid].prev = input;
-
-        CTA_SYNC();
-
-        if ((linear_tid + distance >= 0) && (linear_tid + distance < BLOCK_THREADS))
-            output = temp_storage[linear_tid + distance].prev;
-    }
-
-
-    /**
-     * \brief Each <em>thread<sub>i</sub></em> obtains the \p input provided by <em>thread</em><sub><em>i</em>+<tt>distance</tt></sub>.
-     *
-     * \par
-     * - \smemreuse
-     */
-    __device__ __forceinline__ void Rotate(
-        T   input,                  ///< [in] The calling thread's input item
-        T&  output,                 ///< [out] The \p input item from thread <em>thread</em><sub>(<em>i</em>+<tt>distance></tt>)%<tt><BLOCK_THREADS></tt></sub> (may be aliased to \p input).  This value is not updated for <em>thread</em><sub>BLOCK_THREADS-1</sub>
-        unsigned int distance = 1)  ///< [in] Offset distance (0 < \p distance < <tt>BLOCK_THREADS</tt>)
-    {
-        temp_storage[linear_tid].prev = input;
-
-        CTA_SYNC();
-
-        unsigned int offset = threadIdx.x + distance;
-        if (offset >= BLOCK_THREADS)
-            offset -= BLOCK_THREADS;
-
-        output = temp_storage[offset].prev;
-    }
-
-
-    /**
-     * \brief The thread block rotates its [<em>blocked arrangement</em>](index.html#sec5sec3) of \p input items, shifting it up by one item
-     *
-     * \par
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     */
-    template <int ITEMS_PER_THREAD>
-    __device__ __forceinline__ void Up(
-        T (&input)[ITEMS_PER_THREAD],   ///< [in] The calling thread's input items
-        T (&prev)[ITEMS_PER_THREAD])    ///< [out] The corresponding predecessor items (may be aliased to \p input).  The item \p prev[0] is not updated for <em>thread</em><sub>0</sub>.
-    {
-        temp_storage[linear_tid].prev = input[ITEMS_PER_THREAD - 1];
-
-        CTA_SYNC();
-
-        #pragma unroll
-        for (int ITEM = ITEMS_PER_THREAD - 1; ITEM > 0; --ITEM)
-            prev[ITEM] = input[ITEM - 1];
-
-
-        if (linear_tid > 0)
-            prev[0] = temp_storage[linear_tid - 1].prev;
-    }
-
-
-    /**
-     * \brief The thread block rotates its [<em>blocked arrangement</em>](index.html#sec5sec3) of \p input items, shifting it up by one item.  All threads receive the \p input provided by <em>thread</em><sub><tt>BLOCK_THREADS-1</tt></sub>.
-     *
-     * \par
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     */
-    template <int ITEMS_PER_THREAD>
-    __device__ __forceinline__ void Up(
-        T (&input)[ITEMS_PER_THREAD],   ///< [in] The calling thread's input items
-        T (&prev)[ITEMS_PER_THREAD],    ///< [out] The corresponding predecessor items (may be aliased to \p input).  The item \p prev[0] is not updated for <em>thread</em><sub>0</sub>.
-        T &block_suffix)                ///< [out] The item \p input[ITEMS_PER_THREAD-1] from <em>thread</em><sub><tt>BLOCK_THREADS-1</tt></sub>, provided to all threads
-    {
-        Up(input, prev);
-        block_suffix = temp_storage[BLOCK_THREADS - 1].prev;
-    }
-
-
-    /**
-     * \brief The thread block rotates its [<em>blocked arrangement</em>](index.html#sec5sec3) of \p input items, shifting it down by one item
-     *
-     * \par
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     */
-    template <int ITEMS_PER_THREAD>
-    __device__ __forceinline__ void Down(
-        T (&input)[ITEMS_PER_THREAD],   ///< [in] The calling thread's input items
-        T (&prev)[ITEMS_PER_THREAD])    ///< [out] The corresponding predecessor items (may be aliased to \p input).  The value \p prev[0] is not updated for <em>thread</em><sub>BLOCK_THREADS-1</sub>.
-    {
-        temp_storage[linear_tid].prev = input[ITEMS_PER_THREAD - 1];
-
-        CTA_SYNC();
-
-        #pragma unroll
-        for (int ITEM = ITEMS_PER_THREAD - 1; ITEM > 0; --ITEM)
-            prev[ITEM] = input[ITEM - 1];
-
-        if (linear_tid > 0)
-            prev[0] = temp_storage[linear_tid - 1].prev;
-    }
-
-
-    /**
-     * \brief The thread block rotates its [<em>blocked arrangement</em>](index.html#sec5sec3) of input items, shifting it down by one item.  All threads receive \p input[0] provided by <em>thread</em><sub><tt>0</tt></sub>.
-     *
-     * \par
-     * - \blocked
-     * - \granularity
-     * - \smemreuse
-     */
-    template <int ITEMS_PER_THREAD>
-    __device__ __forceinline__ void Down(
-        T (&input)[ITEMS_PER_THREAD],   ///< [in] The calling thread's input items
-        T (&prev)[ITEMS_PER_THREAD],    ///< [out] The corresponding predecessor items (may be aliased to \p input).  The value \p prev[0] is not updated for <em>thread</em><sub>BLOCK_THREADS-1</sub>.
-        T &block_prefix)                ///< [out] The item \p input[0] from <em>thread</em><sub><tt>0</tt></sub>, provided to all threads
-    {
-        Up(input, prev);
-        block_prefix = temp_storage[BLOCK_THREADS - 1].prev;
-    }
-
-    //@}  end member group
-
-
-};
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/block_store.cuh b/thirdparty/cub_semiring/block/block_store.cuh
deleted file mode 100644
index 63039afa8e5..00000000000
--- a/thirdparty/cub_semiring/block/block_store.cuh
+++ /dev/null
@@ -1,1000 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Operations for writing linear segments of data from the CUDA thread block
- */
-
-#pragma once
-
-#include <iterator>
-
-#include "block_exchange.cuh"
-#include "../util_ptx.cuh"
-#include "../util_macro.cuh"
-#include "../util_type.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \addtogroup UtilIo
- * @{
- */
-
-
-/******************************************************************//**
- * \name Blocked arrangement I/O (direct)
- *********************************************************************/
-//@{
-
-/**
- * \brief Store a blocked arrangement of items across a thread block into a linear segment of items.
- *
- * \blocked
- *
- * \tparam T                    <b>[inferred]</b> The data type to store.
- * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
- * \tparam OutputIteratorT      <b>[inferred]</b> The random-access iterator type for output \iterator.
- */
-template <
-    typename            T,
-    int                 ITEMS_PER_THREAD,
-    typename            OutputIteratorT>
-__device__ __forceinline__ void StoreDirectBlocked(
-    int                 linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
-    T                   (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
-{
-    OutputIteratorT thread_itr = block_itr + (linear_tid * ITEMS_PER_THREAD);
-
-    // Store directly in thread-blocked order
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-    {
-        thread_itr[ITEM] = items[ITEM];
-    }
-}
-
-
-/**
- * \brief Store a blocked arrangement of items across a thread block into a linear segment of items, guarded by range
- *
- * \blocked
- *
- * \tparam T                    <b>[inferred]</b> The data type to store.
- * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
- * \tparam OutputIteratorT      <b>[inferred]</b> The random-access iterator type for output \iterator.
- */
-template <
-    typename            T,
-    int                 ITEMS_PER_THREAD,
-    typename            OutputIteratorT>
-__device__ __forceinline__ void StoreDirectBlocked(
-    int                 linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
-    T                   (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
-    int                 valid_items)                ///< [in] Number of valid items to write
-{
-    OutputIteratorT thread_itr = block_itr + (linear_tid * ITEMS_PER_THREAD);
-
-    // Store directly in thread-blocked order
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-    {
-        if (ITEM + (linear_tid * ITEMS_PER_THREAD) < valid_items)
-        {
-            thread_itr[ITEM] = items[ITEM];
-        }
-    }
-}
-
-
-/**
- * \brief Store a blocked arrangement of items across a thread block into a linear segment of items.
- *
- * \blocked
- *
- * The output offset (\p block_ptr + \p block_offset) must be quad-item aligned,
- * which is the default starting offset returned by \p cudaMalloc()
- *
- * \par
- * The following conditions will prevent vectorization and storing will fall back to cub::BLOCK_STORE_DIRECT:
- *   - \p ITEMS_PER_THREAD is odd
- *   - The data type \p T is not a built-in primitive or CUDA vector type (e.g., \p short, \p int2, \p double, \p float2, etc.)
- *
- * \tparam T                    <b>[inferred]</b> The data type to store.
- * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
- *
- */
-template <
-    typename            T,
-    int                 ITEMS_PER_THREAD>
-__device__ __forceinline__ void StoreDirectBlockedVectorized(
-    int                 linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    T                   *block_ptr,                 ///< [in] Input pointer for storing from
-    T                   (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
-{
-    enum
-    {
-        // Maximum CUDA vector size is 4 elements
-        MAX_VEC_SIZE = CUB_MIN(4, ITEMS_PER_THREAD),
-
-        // Vector size must be a power of two and an even divisor of the items per thread
-        VEC_SIZE = ((((MAX_VEC_SIZE - 1) & MAX_VEC_SIZE) == 0) && ((ITEMS_PER_THREAD % MAX_VEC_SIZE) == 0)) ?
-            MAX_VEC_SIZE :
-            1,
-
-        VECTORS_PER_THREAD = ITEMS_PER_THREAD / VEC_SIZE,
-    };
-
-    // Vector type
-    typedef typename CubVector<T, VEC_SIZE>::Type Vector;
-
-    // Alias global pointer
-    Vector *block_ptr_vectors = reinterpret_cast<Vector*>(const_cast<T*>(block_ptr));
-
-    // Alias pointers (use "raw" array here which should get optimized away to prevent conservative PTXAS lmem spilling)
-    Vector raw_vector[VECTORS_PER_THREAD];
-    T *raw_items = reinterpret_cast<T*>(raw_vector);
-
-    // Copy
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-    {
-        raw_items[ITEM] = items[ITEM];
-    }
-
-    // Direct-store using vector types
-    StoreDirectBlocked(linear_tid, block_ptr_vectors, raw_vector);
-}
-
-
-
-//@}  end member group
-/******************************************************************//**
- * \name Striped arrangement I/O (direct)
- *********************************************************************/
-//@{
-
-
-/**
- * \brief Store a striped arrangement of data across the thread block into a linear segment of items.
- *
- * \striped
- *
- * \tparam BLOCK_THREADS        The thread block size in threads
- * \tparam T                    <b>[inferred]</b> The data type to store.
- * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
- * \tparam OutputIteratorT      <b>[inferred]</b> The random-access iterator type for output \iterator.
- */
-template <
-    int                 BLOCK_THREADS,
-    typename            T,
-    int                 ITEMS_PER_THREAD,
-    typename            OutputIteratorT>
-__device__ __forceinline__ void StoreDirectStriped(
-    int                 linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
-    T                   (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
-{
-    OutputIteratorT thread_itr = block_itr + linear_tid;
-
-    // Store directly in striped order
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-    {
-        thread_itr[(ITEM * BLOCK_THREADS)] = items[ITEM];
-    }
-}
-
-
-/**
- * \brief Store a striped arrangement of data across the thread block into a linear segment of items, guarded by range
- *
- * \striped
- *
- * \tparam BLOCK_THREADS        The thread block size in threads
- * \tparam T                    <b>[inferred]</b> The data type to store.
- * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
- * \tparam OutputIteratorT      <b>[inferred]</b> The random-access iterator type for output \iterator.
- */
-template <
-    int                 BLOCK_THREADS,
-    typename            T,
-    int                 ITEMS_PER_THREAD,
-    typename            OutputIteratorT>
-__device__ __forceinline__ void StoreDirectStriped(
-    int                 linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
-    T                   (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
-    int                 valid_items)                ///< [in] Number of valid items to write
-{
-    OutputIteratorT thread_itr = block_itr + linear_tid;
-
-    // Store directly in striped order
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-    {
-        if ((ITEM * BLOCK_THREADS) + linear_tid < valid_items)
-        {
-            thread_itr[(ITEM * BLOCK_THREADS)] = items[ITEM];
-        }
-    }
-}
-
-
-
-//@}  end member group
-/******************************************************************//**
- * \name Warp-striped arrangement I/O (direct)
- *********************************************************************/
-//@{
-
-
-/**
- * \brief Store a warp-striped arrangement of data across the thread block into a linear segment of items.
- *
- * \warpstriped
- *
- * \par Usage Considerations
- * The number of threads in the thread block must be a multiple of the architecture's warp size.
- *
- * \tparam T                    <b>[inferred]</b> The data type to store.
- * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
- * \tparam OutputIteratorT      <b>[inferred]</b> The random-access iterator type for output \iterator.
- */
-template <
-    typename            T,
-    int                 ITEMS_PER_THREAD,
-    typename            OutputIteratorT>
-__device__ __forceinline__ void StoreDirectWarpStriped(
-    int                 linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
-    T                   (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
-{
-    int tid         = linear_tid & (CUB_PTX_WARP_THREADS - 1);
-    int wid         = linear_tid >> CUB_PTX_LOG_WARP_THREADS;
-    int warp_offset = wid * CUB_PTX_WARP_THREADS * ITEMS_PER_THREAD;
-
-    OutputIteratorT thread_itr = block_itr + warp_offset + tid;
-
-    // Store directly in warp-striped order
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-    {
-        thread_itr[(ITEM * CUB_PTX_WARP_THREADS)] = items[ITEM];
-    }
-}
-
-
-/**
- * \brief Store a warp-striped arrangement of data across the thread block into a linear segment of items, guarded by range
- *
- * \warpstriped
- *
- * \par Usage Considerations
- * The number of threads in the thread block must be a multiple of the architecture's warp size.
- *
- * \tparam T                    <b>[inferred]</b> The data type to store.
- * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
- * \tparam OutputIteratorT      <b>[inferred]</b> The random-access iterator type for output \iterator.
- */
-template <
-    typename            T,
-    int                 ITEMS_PER_THREAD,
-    typename            OutputIteratorT>
-__device__ __forceinline__ void StoreDirectWarpStriped(
-    int                 linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
-    OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
-    T                   (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
-    int                 valid_items)                ///< [in] Number of valid items to write
-{
-    int tid         = linear_tid & (CUB_PTX_WARP_THREADS - 1);
-    int wid         = linear_tid >> CUB_PTX_LOG_WARP_THREADS;
-    int warp_offset = wid * CUB_PTX_WARP_THREADS * ITEMS_PER_THREAD;
-
-    OutputIteratorT thread_itr = block_itr + warp_offset + tid;
-
-    // Store directly in warp-striped order
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
-    {
-        if (warp_offset + tid + (ITEM * CUB_PTX_WARP_THREADS) < valid_items)
-        {
-            thread_itr[(ITEM * CUB_PTX_WARP_THREADS)] = items[ITEM];
-        }
-    }
-}
-
-
-//@}  end member group
-
-
-/** @} */       // end group UtilIo
-
-
-//-----------------------------------------------------------------------------
-// Generic BlockStore abstraction
-//-----------------------------------------------------------------------------
-
-/**
- * \brief cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory.
- */
-enum BlockStoreAlgorithm
-{
-    /**
-     * \par Overview
-     *
-     * A [<em>blocked arrangement</em>](index.html#sec5sec3) of data is written
-     * directly to memory.
-     *
-     * \par Performance Considerations
-     * - The utilization of memory transactions (coalescing) decreases as the
-     *   access stride between threads increases (i.e., the number items per thread).
-     */
-    BLOCK_STORE_DIRECT,
-
-    /**
-     * \par Overview
-     *
-     * A [<em>blocked arrangement</em>](index.html#sec5sec3) of data is written directly
-     * to memory using CUDA's built-in vectorized stores as a coalescing optimization.
-     * For example, <tt>st.global.v4.s32</tt> instructions will be generated
-     * when \p T = \p int and \p ITEMS_PER_THREAD % 4 == 0.
-     *
-     * \par Performance Considerations
-     * - The utilization of memory transactions (coalescing) remains high until the the
-     *   access stride between threads (i.e., the number items per thread) exceeds the
-     *   maximum vector store width (typically 4 items or 64B, whichever is lower).
-     * - The following conditions will prevent vectorization and writing will fall back to cub::BLOCK_STORE_DIRECT:
-     *   - \p ITEMS_PER_THREAD is odd
-     *   - The \p OutputIteratorT is not a simple pointer type
-     *   - The block output offset is not quadword-aligned
-     *   - The data type \p T is not a built-in primitive or CUDA vector type (e.g., \p short, \p int2, \p double, \p float2, etc.)
-     */
-    BLOCK_STORE_VECTORIZE,
-
-    /**
-     * \par Overview
-     * A [<em>blocked arrangement</em>](index.html#sec5sec3) is locally
-     * transposed and then efficiently written to memory as a [<em>striped arrangement</em>](index.html#sec5sec3).
-     *
-     * \par Performance Considerations
-     * - The utilization of memory transactions (coalescing) remains high regardless
-     *   of items written per thread.
-     * - The local reordering incurs slightly longer latencies and throughput than the
-     *   direct cub::BLOCK_STORE_DIRECT and cub::BLOCK_STORE_VECTORIZE alternatives.
-     */
-    BLOCK_STORE_TRANSPOSE,
-
-    /**
-     * \par Overview
-     * A [<em>blocked arrangement</em>](index.html#sec5sec3) is locally
-     * transposed and then efficiently written to memory as a
-     * [<em>warp-striped arrangement</em>](index.html#sec5sec3)
-     *
-     * \par Usage Considerations
-     * - BLOCK_THREADS must be a multiple of WARP_THREADS
-     *
-     * \par Performance Considerations
-     * - The utilization of memory transactions (coalescing) remains high regardless
-     *   of items written per thread.
-     * - The local reordering incurs slightly longer latencies and throughput than the
-     *   direct cub::BLOCK_STORE_DIRECT and cub::BLOCK_STORE_VECTORIZE alternatives.
-     */
-    BLOCK_STORE_WARP_TRANSPOSE,
-
-    /**
-     * \par Overview
-     * A [<em>blocked arrangement</em>](index.html#sec5sec3) is locally
-     * transposed and then efficiently written to memory as a
-     * [<em>warp-striped arrangement</em>](index.html#sec5sec3)
-     * To reduce the shared memory requirement, only one warp's worth of shared
-     * memory is provisioned and is subsequently time-sliced among warps.
-     *
-     * \par Usage Considerations
-     * - BLOCK_THREADS must be a multiple of WARP_THREADS
-     *
-     * \par Performance Considerations
-     * - The utilization of memory transactions (coalescing) remains high regardless
-     *   of items written per thread.
-     * - Provisions less shared memory temporary storage, but incurs larger
-     *   latencies than the BLOCK_STORE_WARP_TRANSPOSE alternative.
-     */
-    BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED,
-
-};
-
-
-/**
- * \brief The BlockStore class provides [<em>collective</em>](index.html#sec0) data movement methods for writing a [<em>blocked arrangement</em>](index.html#sec5sec3) of items partitioned across a CUDA thread block to a linear segment of memory.  ![](block_store_logo.png)
- * \ingroup BlockModule
- * \ingroup UtilIo
- *
- * \tparam T                    The type of data to be written.
- * \tparam BLOCK_DIM_X          The thread block length in threads along the X dimension
- * \tparam ITEMS_PER_THREAD     The number of consecutive items partitioned onto each thread.
- * \tparam ALGORITHM            <b>[optional]</b> cub::BlockStoreAlgorithm tuning policy enumeration.  default: cub::BLOCK_STORE_DIRECT.
- * \tparam WARP_TIME_SLICING    <b>[optional]</b> Whether or not only one warp's worth of shared memory should be allocated and time-sliced among block-warps during any load-related data transpositions (versus each warp having its own storage). (default: false)
- * \tparam BLOCK_DIM_Y          <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
- * \tparam BLOCK_DIM_Z          <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
- * \tparam PTX_ARCH             <b>[optional]</b> \ptxversion
- *
- * \par Overview
- * - The BlockStore class provides a single data movement abstraction that can be specialized
- *   to implement different cub::BlockStoreAlgorithm strategies.  This facilitates different
- *   performance policies for different architectures, data types, granularity sizes, etc.
- * - BlockStore can be optionally specialized by different data movement strategies:
- *   -# <b>cub::BLOCK_STORE_DIRECT</b>.  A [<em>blocked arrangement</em>](index.html#sec5sec3) of data is written
- *      directly to memory. [More...](\ref cub::BlockStoreAlgorithm)
- *   -# <b>cub::BLOCK_STORE_VECTORIZE</b>.  A [<em>blocked arrangement</em>](index.html#sec5sec3)
- *      of data is written directly to memory using CUDA's built-in vectorized stores as a
- *      coalescing optimization.  [More...](\ref cub::BlockStoreAlgorithm)
- *   -# <b>cub::BLOCK_STORE_TRANSPOSE</b>.  A [<em>blocked arrangement</em>](index.html#sec5sec3)
- *      is locally transposed into a [<em>striped arrangement</em>](index.html#sec5sec3) which is
- *      then written to memory.  [More...](\ref cub::BlockStoreAlgorithm)
- *   -# <b>cub::BLOCK_STORE_WARP_TRANSPOSE</b>.  A [<em>blocked arrangement</em>](index.html#sec5sec3)
- *      is locally transposed into a [<em>warp-striped arrangement</em>](index.html#sec5sec3) which is
- *      then written to memory.  [More...](\ref cub::BlockStoreAlgorithm)
- * - \rowmajor
- *
- * \par A Simple Example
- * \blockcollective{BlockStore}
- * \par
- * The code snippet below illustrates the storing of a "blocked" arrangement
- * of 512 integers across 128 threads (where each thread owns 4 consecutive items)
- * into a linear segment of memory.  The store is specialized for \p BLOCK_STORE_WARP_TRANSPOSE,
- * meaning items are locally reordered among threads so that memory references will be
- * efficiently coalesced using a warp-striped access pattern.
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/block/block_store.cuh>
- *
- * __global__ void ExampleKernel(int *d_data, ...)
- * {
- *     // Specialize BlockStore for a 1D block of 128 threads owning 4 integer items each
- *     typedef cub::BlockStore<int, 128, 4, BLOCK_STORE_WARP_TRANSPOSE> BlockStore;
- *
- *     // Allocate shared memory for BlockStore
- *     __shared__ typename BlockStore::TempStorage temp_storage;
- *
- *     // Obtain a segment of consecutive items that are blocked across threads
- *     int thread_data[4];
- *     ...
- *
- *     // Store items to linear memory
- *     int thread_data[4];
- *     BlockStore(temp_storage).Store(d_data, thread_data);
- *
- * \endcode
- * \par
- * Suppose the set of \p thread_data across the block of threads is
- * <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt>.
- * The output \p d_data will be <tt>0, 1, 2, 3, 4, 5, ...</tt>.
- *
- */
-template <
-    typename                T,
-    int                     BLOCK_DIM_X,
-    int                     ITEMS_PER_THREAD,
-    BlockStoreAlgorithm     ALGORITHM           = BLOCK_STORE_DIRECT,
-    int                     BLOCK_DIM_Y         = 1,
-    int                     BLOCK_DIM_Z         = 1,
-    int                     PTX_ARCH            = CUB_PTX_ARCH>
-class BlockStore
-{
-private:
-    /******************************************************************************
-     * Constants and typed definitions
-     ******************************************************************************/
-
-    /// Constants
-    enum
-    {
-        /// The thread block size in threads
-        BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-    };
-
-
-    /******************************************************************************
-     * Algorithmic variants
-     ******************************************************************************/
-
-    /// Store helper
-    template <BlockStoreAlgorithm _POLICY, int DUMMY>
-    struct StoreInternal;
-
-
-    /**
-     * BLOCK_STORE_DIRECT specialization of store helper
-     */
-    template <int DUMMY>
-    struct StoreInternal<BLOCK_STORE_DIRECT, DUMMY>
-    {
-        /// Shared memory storage layout type
-        typedef NullType TempStorage;
-
-        /// Linear thread-id
-        int linear_tid;
-
-        /// Constructor
-        __device__ __forceinline__ StoreInternal(
-            TempStorage &/*temp_storage*/,
-            int linear_tid)
-        :
-            linear_tid(linear_tid)
-        {}
-
-        /// Store items into a linear segment of memory
-        template <typename OutputIteratorT>
-        __device__ __forceinline__ void Store(
-            OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
-            T                   (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
-        {
-            StoreDirectBlocked(linear_tid, block_itr, items);
-        }
-
-        /// Store items into a linear segment of memory, guarded by range
-        template <typename OutputIteratorT>
-        __device__ __forceinline__ void Store(
-            OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
-            T                   (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
-            int                 valid_items)                ///< [in] Number of valid items to write
-        {
-            StoreDirectBlocked(linear_tid, block_itr, items, valid_items);
-        }
-    };
-
-
-    /**
-     * BLOCK_STORE_VECTORIZE specialization of store helper
-     */
-    template <int DUMMY>
-    struct StoreInternal<BLOCK_STORE_VECTORIZE, DUMMY>
-    {
-        /// Shared memory storage layout type
-        typedef NullType TempStorage;
-
-        /// Linear thread-id
-        int linear_tid;
-
-        /// Constructor
-        __device__ __forceinline__ StoreInternal(
-            TempStorage &/*temp_storage*/,
-            int linear_tid)
-        :
-            linear_tid(linear_tid)
-        {}
-
-        /// Store items into a linear segment of memory, specialized for native pointer types (attempts vectorization)
-        __device__ __forceinline__ void Store(
-            T                   *block_ptr,                 ///< [in] The thread block's base output iterator for storing to
-            T                   (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
-        {
-            StoreDirectBlockedVectorized(linear_tid, block_ptr, items);
-        }
-
-        /// Store items into a linear segment of memory, specialized for opaque input iterators (skips vectorization)
-        template <typename OutputIteratorT>
-        __device__ __forceinline__ void Store(
-            OutputIteratorT    block_itr,                  ///< [in] The thread block's base output iterator for storing to
-            T                   (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
-        {
-            StoreDirectBlocked(linear_tid, block_itr, items);
-        }
-
-        /// Store items into a linear segment of memory, guarded by range
-        template <typename OutputIteratorT>
-        __device__ __forceinline__ void Store(
-            OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
-            T                   (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
-            int                 valid_items)                ///< [in] Number of valid items to write
-        {
-            StoreDirectBlocked(linear_tid, block_itr, items, valid_items);
-        }
-    };
-
-
-    /**
-     * BLOCK_STORE_TRANSPOSE specialization of store helper
-     */
-    template <int DUMMY>
-    struct StoreInternal<BLOCK_STORE_TRANSPOSE, DUMMY>
-    {
-        // BlockExchange utility type for keys
-        typedef BlockExchange<T, BLOCK_DIM_X, ITEMS_PER_THREAD, false, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> BlockExchange;
-
-        /// Shared memory storage layout type
-        struct _TempStorage : BlockExchange::TempStorage
-        {
-            /// Temporary storage for partially-full block guard
-            volatile int valid_items;
-        };
-
-        /// Alias wrapper allowing storage to be unioned
-        struct TempStorage : Uninitialized<_TempStorage> {};
-
-        /// Thread reference to shared storage
-        _TempStorage &temp_storage;
-
-        /// Linear thread-id
-        int linear_tid;
-
-        /// Constructor
-        __device__ __forceinline__ StoreInternal(
-            TempStorage &temp_storage,
-            int linear_tid)
-        :
-            temp_storage(temp_storage.Alias()),
-            linear_tid(linear_tid)
-        {}
-
-        /// Store items into a linear segment of memory
-        template <typename OutputIteratorT>
-        __device__ __forceinline__ void Store(
-            OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
-            T                   (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
-        {
-            BlockExchange(temp_storage).BlockedToStriped(items);
-            StoreDirectStriped<BLOCK_THREADS>(linear_tid, block_itr, items);
-        }
-
-        /// Store items into a linear segment of memory, guarded by range
-        template <typename OutputIteratorT>
-        __device__ __forceinline__ void Store(
-            OutputIteratorT   block_itr,                  ///< [in] The thread block's base output iterator for storing to
-            T                   (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
-            int                 valid_items)                ///< [in] Number of valid items to write
-        {
-            BlockExchange(temp_storage).BlockedToStriped(items);
-            if (linear_tid == 0)
-                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
-            CTA_SYNC();
-            StoreDirectStriped<BLOCK_THREADS>(linear_tid, block_itr, items, temp_storage.valid_items);
-        }
-    };
-
-
-    /**
-     * BLOCK_STORE_WARP_TRANSPOSE specialization of store helper
-     */
-    template <int DUMMY>
-    struct StoreInternal<BLOCK_STORE_WARP_TRANSPOSE, DUMMY>
-    {
-        enum
-        {
-            WARP_THREADS = CUB_WARP_THREADS(PTX_ARCH)
-        };
-
-        // Assert BLOCK_THREADS must be a multiple of WARP_THREADS
-        CUB_STATIC_ASSERT((BLOCK_THREADS % WARP_THREADS == 0), "BLOCK_THREADS must be a multiple of WARP_THREADS");
-
-        // BlockExchange utility type for keys
-        typedef BlockExchange<T, BLOCK_DIM_X, ITEMS_PER_THREAD, false, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> BlockExchange;
-
-        /// Shared memory storage layout type
-        struct _TempStorage : BlockExchange::TempStorage
-        {
-            /// Temporary storage for partially-full block guard
-            volatile int valid_items;
-        };
-
-        /// Alias wrapper allowing storage to be unioned
-        struct TempStorage : Uninitialized<_TempStorage> {};
-
-        /// Thread reference to shared storage
-        _TempStorage &temp_storage;
-
-        /// Linear thread-id
-        int linear_tid;
-
-        /// Constructor
-        __device__ __forceinline__ StoreInternal(
-            TempStorage &temp_storage,
-            int linear_tid)
-        :
-            temp_storage(temp_storage.Alias()),
-            linear_tid(linear_tid)
-        {}
-
-        /// Store items into a linear segment of memory
-        template <typename OutputIteratorT>
-        __device__ __forceinline__ void Store(
-            OutputIteratorT   block_itr,                    ///< [in] The thread block's base output iterator for storing to
-            T                 (&items)[ITEMS_PER_THREAD])   ///< [in] Data to store
-        {
-            BlockExchange(temp_storage).BlockedToWarpStriped(items);
-            StoreDirectWarpStriped(linear_tid, block_itr, items);
-        }
-
-        /// Store items into a linear segment of memory, guarded by range
-        template <typename OutputIteratorT>
-        __device__ __forceinline__ void Store(
-            OutputIteratorT   block_itr,                    ///< [in] The thread block's base output iterator for storing to
-            T                 (&items)[ITEMS_PER_THREAD],   ///< [in] Data to store
-            int               valid_items)                  ///< [in] Number of valid items to write
-        {
-            BlockExchange(temp_storage).BlockedToWarpStriped(items);
-            if (linear_tid == 0)
-                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
-            CTA_SYNC();
-            StoreDirectWarpStriped(linear_tid, block_itr, items, temp_storage.valid_items);
-        }
-    };
-
-
-    /**
-     * BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED specialization of store helper
-     */
-    template <int DUMMY>
-    struct StoreInternal<BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED, DUMMY>
-    {
-        enum
-        {
-            WARP_THREADS = CUB_WARP_THREADS(PTX_ARCH)
-        };
-
-        // Assert BLOCK_THREADS must be a multiple of WARP_THREADS
-        CUB_STATIC_ASSERT((BLOCK_THREADS % WARP_THREADS == 0), "BLOCK_THREADS must be a multiple of WARP_THREADS");
-
-        // BlockExchange utility type for keys
-        typedef BlockExchange<T, BLOCK_DIM_X, ITEMS_PER_THREAD, true, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> BlockExchange;
-
-        /// Shared memory storage layout type
-        struct _TempStorage : BlockExchange::TempStorage
-        {
-            /// Temporary storage for partially-full block guard
-            volatile int valid_items;
-        };
-
-        /// Alias wrapper allowing storage to be unioned
-        struct TempStorage : Uninitialized<_TempStorage> {};
-
-        /// Thread reference to shared storage
-        _TempStorage &temp_storage;
-
-        /// Linear thread-id
-        int linear_tid;
-
-        /// Constructor
-        __device__ __forceinline__ StoreInternal(
-            TempStorage &temp_storage,
-            int linear_tid)
-        :
-            temp_storage(temp_storage.Alias()),
-            linear_tid(linear_tid)
-        {}
-
-        /// Store items into a linear segment of memory
-        template <typename OutputIteratorT>
-        __device__ __forceinline__ void Store(
-            OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
-            T                   (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
-        {
-            BlockExchange(temp_storage).BlockedToWarpStriped(items);
-            StoreDirectWarpStriped(linear_tid, block_itr, items);
-        }
-
-        /// Store items into a linear segment of memory, guarded by range
-        template <typename OutputIteratorT>
-        __device__ __forceinline__ void Store(
-            OutputIteratorT   block_itr,                  ///< [in] The thread block's base output iterator for storing to
-            T                   (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
-            int                 valid_items)                ///< [in] Number of valid items to write
-        {
-            BlockExchange(temp_storage).BlockedToWarpStriped(items);
-            if (linear_tid == 0)
-                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
-            CTA_SYNC();
-            StoreDirectWarpStriped(linear_tid, block_itr, items, temp_storage.valid_items);
-        }
-    };
-
-    /******************************************************************************
-     * Type definitions
-     ******************************************************************************/
-
-    /// Internal load implementation to use
-    typedef StoreInternal<ALGORITHM, 0> InternalStore;
-
-
-    /// Shared memory storage layout type
-    typedef typename InternalStore::TempStorage _TempStorage;
-
-
-    /******************************************************************************
-     * Utility methods
-     ******************************************************************************/
-
-    /// Internal storage allocator
-    __device__ __forceinline__ _TempStorage& PrivateStorage()
-    {
-        __shared__ _TempStorage private_storage;
-        return private_storage;
-    }
-
-
-    /******************************************************************************
-     * Thread fields
-     ******************************************************************************/
-
-    /// Thread reference to shared storage
-    _TempStorage &temp_storage;
-
-    /// Linear thread-id
-    int linear_tid;
-
-public:
-
-
-    /// \smemstorage{BlockStore}
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    /******************************************************************//**
-     * \name Collective constructors
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Collective constructor using a private static allocation of shared memory as temporary storage.
-     */
-    __device__ __forceinline__ BlockStore()
-    :
-        temp_storage(PrivateStorage()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    /**
-     * \brief Collective constructor using the specified memory allocation as temporary storage.
-     */
-    __device__ __forceinline__ BlockStore(
-        TempStorage &temp_storage)             ///< [in] Reference to memory allocation having layout type TempStorage
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Data movement
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Store items into a linear segment of memory.
-     *
-     * \par
-     * - \blocked
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the storing of a "blocked" arrangement
-     * of 512 integers across 128 threads (where each thread owns 4 consecutive items)
-     * into a linear segment of memory.  The store is specialized for \p BLOCK_STORE_WARP_TRANSPOSE,
-     * meaning items are locally reordered among threads so that memory references will be
-     * efficiently coalesced using a warp-striped access pattern.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_store.cuh>
-     *
-     * __global__ void ExampleKernel(int *d_data, ...)
-     * {
-     *     // Specialize BlockStore for a 1D block of 128 threads owning 4 integer items each
-     *     typedef cub::BlockStore<int, 128, 4, BLOCK_STORE_WARP_TRANSPOSE> BlockStore;
-     *
-     *     // Allocate shared memory for BlockStore
-     *     __shared__ typename BlockStore::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Store items to linear memory
-     *     int thread_data[4];
-     *     BlockStore(temp_storage).Store(d_data, thread_data);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of \p thread_data across the block of threads is
-     * <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt>.
-     * The output \p d_data will be <tt>0, 1, 2, 3, 4, 5, ...</tt>.
-     *
-     */
-    template <typename OutputIteratorT>
-    __device__ __forceinline__ void Store(
-        OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
-        T                   (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
-    {
-        InternalStore(temp_storage, linear_tid).Store(block_itr, items);
-    }
-
-    /**
-     * \brief Store items into a linear segment of memory, guarded by range.
-     *
-     * \par
-     * - \blocked
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the guarded storing of a "blocked" arrangement
-     * of 512 integers across 128 threads (where each thread owns 4 consecutive items)
-     * into a linear segment of memory.  The store is specialized for \p BLOCK_STORE_WARP_TRANSPOSE,
-     * meaning items are locally reordered among threads so that memory references will be
-     * efficiently coalesced using a warp-striped access pattern.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_store.cuh>
-     *
-     * __global__ void ExampleKernel(int *d_data, int valid_items, ...)
-     * {
-     *     // Specialize BlockStore for a 1D block of 128 threads owning 4 integer items each
-     *     typedef cub::BlockStore<int, 128, 4, BLOCK_STORE_WARP_TRANSPOSE> BlockStore;
-     *
-     *     // Allocate shared memory for BlockStore
-     *     __shared__ typename BlockStore::TempStorage temp_storage;
-     *
-     *     // Obtain a segment of consecutive items that are blocked across threads
-     *     int thread_data[4];
-     *     ...
-     *
-     *     // Store items to linear memory
-     *     int thread_data[4];
-     *     BlockStore(temp_storage).Store(d_data, thread_data, valid_items);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of \p thread_data across the block of threads is
-     * <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt> and \p valid_items is \p 5.
-     * The output \p d_data will be <tt>0, 1, 2, 3, 4, ?, ?, ?, ...</tt>, with
-     * only the first two threads being unmasked to store portions of valid data.
-     *
-     */
-    template <typename OutputIteratorT>
-    __device__ __forceinline__ void Store(
-        OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
-        T                   (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
-        int                 valid_items)                ///< [in] Number of valid items to write
-    {
-        InternalStore(temp_storage, linear_tid).Store(block_itr, items, valid_items);
-    }
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/specializations/block_histogram_atomic.cuh b/thirdparty/cub_semiring/block/specializations/block_histogram_atomic.cuh
deleted file mode 100644
index 4599c092568..00000000000
--- a/thirdparty/cub_semiring/block/specializations/block_histogram_atomic.cuh
+++ /dev/null
@@ -1,82 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * The cub::BlockHistogramAtomic class provides atomic-based methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.
- */
-
-#pragma once
-
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \brief The BlockHistogramAtomic class provides atomic-based methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.
- */
-template <int BINS>
-struct BlockHistogramAtomic
-{
-    /// Shared memory storage layout type
-    struct TempStorage {};
-
-
-    /// Constructor
-    __device__ __forceinline__ BlockHistogramAtomic(
-        TempStorage &temp_storage)
-    {}
-
-
-    /// Composite data onto an existing histogram
-    template <
-        typename            T,
-        typename            CounterT,     
-        int                 ITEMS_PER_THREAD>
-    __device__ __forceinline__ void Composite(
-        T                   (&items)[ITEMS_PER_THREAD],     ///< [in] Calling thread's input values to histogram
-        CounterT             histogram[BINS])                 ///< [out] Reference to shared/device-accessible memory histogram
-    {
-        // Update histogram
-        #pragma unroll
-        for (int i = 0; i < ITEMS_PER_THREAD; ++i)
-        {
-              atomicAdd(histogram + items[i], 1);
-        }
-    }
-
-};
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/specializations/block_histogram_sort.cuh b/thirdparty/cub_semiring/block/specializations/block_histogram_sort.cuh
deleted file mode 100644
index b9ad6fb79c5..00000000000
--- a/thirdparty/cub_semiring/block/specializations/block_histogram_sort.cuh
+++ /dev/null
@@ -1,226 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * The cub::BlockHistogramSort class provides sorting-based methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.
- */
-
-#pragma once
-
-#include "../../block/block_radix_sort.cuh"
-#include "../../block/block_discontinuity.cuh"
-#include "../../util_ptx.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-
-/**
- * \brief The BlockHistogramSort class provides sorting-based methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.
- */
-template <
-    typename    T,                  ///< Sample type
-    int         BLOCK_DIM_X,        ///< The thread block length in threads along the X dimension
-    int         ITEMS_PER_THREAD,   ///< The number of samples per thread
-    int         BINS,               ///< The number of bins into which histogram samples may fall
-    int         BLOCK_DIM_Y,        ///< The thread block length in threads along the Y dimension
-    int         BLOCK_DIM_Z,        ///< The thread block length in threads along the Z dimension
-    int         PTX_ARCH>           ///< The PTX compute capability for which to to specialize this collective
-struct BlockHistogramSort
-{
-    /// Constants
-    enum
-    {
-        /// The thread block size in threads
-        BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-    };
-
-    // Parameterize BlockRadixSort type for our thread block
-    typedef BlockRadixSort<
-            T,
-            BLOCK_DIM_X,
-            ITEMS_PER_THREAD,
-            NullType,
-            4,
-            (PTX_ARCH >= 350) ? true : false,
-            BLOCK_SCAN_WARP_SCANS,
-            cudaSharedMemBankSizeFourByte,
-            BLOCK_DIM_Y,
-            BLOCK_DIM_Z,
-            PTX_ARCH>
-        BlockRadixSortT;
-
-    // Parameterize BlockDiscontinuity type for our thread block
-    typedef BlockDiscontinuity<
-            T,
-            BLOCK_DIM_X,
-            BLOCK_DIM_Y,
-            BLOCK_DIM_Z,
-            PTX_ARCH>
-        BlockDiscontinuityT;
-
-    /// Shared memory
-    union _TempStorage
-    {
-        // Storage for sorting bin values
-        typename BlockRadixSortT::TempStorage sort;
-
-        struct
-        {
-            // Storage for detecting discontinuities in the tile of sorted bin values
-            typename BlockDiscontinuityT::TempStorage flag;
-
-            // Storage for noting begin/end offsets of bin runs in the tile of sorted bin values
-            unsigned int run_begin[BINS];
-            unsigned int run_end[BINS];
-        };
-    };
-
-
-    /// Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    // Thread fields
-    _TempStorage &temp_storage;
-    unsigned int linear_tid;
-
-
-    /// Constructor
-    __device__ __forceinline__ BlockHistogramSort(
-        TempStorage     &temp_storage)
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    // Discontinuity functor
-    struct DiscontinuityOp
-    {
-        // Reference to temp_storage
-        _TempStorage &temp_storage;
-
-        // Constructor
-        __device__ __forceinline__ DiscontinuityOp(_TempStorage &temp_storage) :
-            temp_storage(temp_storage)
-        {}
-
-        // Discontinuity predicate
-        __device__ __forceinline__ bool operator()(const T &a, const T &b, int b_index)
-        {
-            if (a != b)
-            {
-                // Note the begin/end offsets in shared storage
-                temp_storage.run_begin[b] = b_index;
-                temp_storage.run_end[a] = b_index;
-
-                return true;
-            }
-            else
-            {
-                return false;
-            }
-        }
-    };
-
-
-    // Composite data onto an existing histogram
-    template <
-        typename            CounterT     >
-    __device__ __forceinline__ void Composite(
-        T                   (&items)[ITEMS_PER_THREAD],     ///< [in] Calling thread's input values to histogram
-        CounterT            histogram[BINS])                 ///< [out] Reference to shared/device-accessible memory histogram
-    {
-        enum { TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD };
-
-        // Sort bytes in blocked arrangement
-        BlockRadixSortT(temp_storage.sort).Sort(items);
-
-        CTA_SYNC();
-
-        // Initialize the shared memory's run_begin and run_end for each bin
-        int histo_offset = 0;
-
-        #pragma unroll
-        for(; histo_offset + BLOCK_THREADS <= BINS; histo_offset += BLOCK_THREADS)
-        {
-            temp_storage.run_begin[histo_offset + linear_tid] = TILE_SIZE;
-            temp_storage.run_end[histo_offset + linear_tid] = TILE_SIZE;
-        }
-        // Finish up with guarded initialization if necessary
-        if ((BINS % BLOCK_THREADS != 0) && (histo_offset + linear_tid < BINS))
-        {
-            temp_storage.run_begin[histo_offset + linear_tid] = TILE_SIZE;
-            temp_storage.run_end[histo_offset + linear_tid] = TILE_SIZE;
-        }
-
-        CTA_SYNC();
-
-        int flags[ITEMS_PER_THREAD];    // unused
-
-        // Compute head flags to demarcate contiguous runs of the same bin in the sorted tile
-        DiscontinuityOp flag_op(temp_storage);
-        BlockDiscontinuityT(temp_storage.flag).FlagHeads(flags, items, flag_op);
-
-        // Update begin for first item
-        if (linear_tid == 0) temp_storage.run_begin[items[0]] = 0;
-
-        CTA_SYNC();
-
-        // Composite into histogram
-        histo_offset = 0;
-
-        #pragma unroll
-        for(; histo_offset + BLOCK_THREADS <= BINS; histo_offset += BLOCK_THREADS)
-        {
-            int thread_offset = histo_offset + linear_tid;
-            CounterT      count = temp_storage.run_end[thread_offset] - temp_storage.run_begin[thread_offset];
-            histogram[thread_offset] += count;
-        }
-
-        // Finish up with guarded composition if necessary
-        if ((BINS % BLOCK_THREADS != 0) && (histo_offset + linear_tid < BINS))
-        {
-            int thread_offset = histo_offset + linear_tid;
-            CounterT      count = temp_storage.run_end[thread_offset] - temp_storage.run_begin[thread_offset];
-            histogram[thread_offset] += count;
-        }
-    }
-
-};
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/specializations/block_reduce_raking.cuh b/thirdparty/cub_semiring/block/specializations/block_reduce_raking.cuh
deleted file mode 100644
index c2c26651796..00000000000
--- a/thirdparty/cub_semiring/block/specializations/block_reduce_raking.cuh
+++ /dev/null
@@ -1,222 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::BlockReduceRaking provides raking-based methods of parallel reduction across a CUDA thread block.  Supports non-commutative reduction operators.
- */
-
-#pragma once
-
-#include "../../block/block_raking_layout.cuh"
-#include "../../warp/warp_reduce.cuh"
-#include "../../thread/thread_reduce.cuh"
-#include "../../util_ptx.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \brief BlockReduceRaking provides raking-based methods of parallel reduction across a CUDA thread block.  Supports non-commutative reduction operators.
- *
- * Supports non-commutative binary reduction operators.  Unlike commutative
- * reduction operators (e.g., addition), the application of a non-commutative
- * reduction operator (e.g, string concatenation) across a sequence of inputs must
- * honor the relative ordering of items and partial reductions when applying the
- * reduction operator.
- *
- * Compared to the implementation of BlockReduceRaking (which does not support
- * non-commutative operators), this implementation requires a few extra
- * rounds of inter-thread communication.
- */
-template <
-    typename    T,              ///< Data type being reduced
-    int         BLOCK_DIM_X,    ///< The thread block length in threads along the X dimension
-    int         BLOCK_DIM_Y,    ///< The thread block length in threads along the Y dimension
-    int         BLOCK_DIM_Z,    ///< The thread block length in threads along the Z dimension
-    int         PTX_ARCH>       ///< The PTX compute capability for which to to specialize this collective
-struct BlockReduceRaking
-{
-    /// Constants
-    enum
-    {
-        /// The thread block size in threads
-        BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-    };
-
-    /// Layout type for padded thread block raking grid
-    typedef BlockRakingLayout<T, BLOCK_THREADS, PTX_ARCH> BlockRakingLayout;
-
-    ///  WarpReduce utility type
-    typedef typename WarpReduce<T, BlockRakingLayout::RAKING_THREADS, PTX_ARCH>::InternalWarpReduce WarpReduce;
-
-    /// Constants
-    enum
-    {
-        /// Number of raking threads
-        RAKING_THREADS = BlockRakingLayout::RAKING_THREADS,
-
-        /// Number of raking elements per warp synchronous raking thread
-        SEGMENT_LENGTH = BlockRakingLayout::SEGMENT_LENGTH,
-
-        /// Cooperative work can be entirely warp synchronous
-        WARP_SYNCHRONOUS = (RAKING_THREADS == BLOCK_THREADS),
-
-        /// Whether or not warp-synchronous reduction should be unguarded (i.e., the warp-reduction elements is a power of two
-        WARP_SYNCHRONOUS_UNGUARDED = PowerOfTwo<RAKING_THREADS>::VALUE,
-
-        /// Whether or not accesses into smem are unguarded
-        RAKING_UNGUARDED = BlockRakingLayout::UNGUARDED,
-
-    };
-
-
-    /// Shared memory storage layout type
-    union _TempStorage
-    {
-        typename WarpReduce::TempStorage            warp_storage;        ///< Storage for warp-synchronous reduction
-        typename BlockRakingLayout::TempStorage     raking_grid;         ///< Padded thread block raking grid
-    };
-
-
-    /// Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    // Thread fields
-    _TempStorage &temp_storage;
-    unsigned int linear_tid;
-
-
-    /// Constructor
-    __device__ __forceinline__ BlockReduceRaking(
-        TempStorage &temp_storage)
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    template <bool IS_FULL_TILE, typename ReductionOp, int ITERATION>
-    __device__ __forceinline__ T RakingReduction(
-        ReductionOp                 reduction_op,       ///< [in] Binary scan operator
-        T                           *raking_segment,
-        T                           partial,            ///< [in] <b>[<em>lane</em><sub>0</sub> only]</b> Warp-wide aggregate reduction of input items
-        int                         num_valid,          ///< [in] Number of valid elements (may be less than BLOCK_THREADS)
-        Int2Type<ITERATION>         /*iteration*/)
-    {
-        // Update partial if addend is in range
-        if ((IS_FULL_TILE && RAKING_UNGUARDED) || ((linear_tid * SEGMENT_LENGTH) + ITERATION < num_valid))
-        {
-            T addend = raking_segment[ITERATION];
-            partial = reduction_op(partial, addend);
-        }
-        return RakingReduction<IS_FULL_TILE>(reduction_op, raking_segment, partial, num_valid, Int2Type<ITERATION + 1>());
-    }
-
-    template <bool IS_FULL_TILE, typename ReductionOp>
-    __device__ __forceinline__ T RakingReduction(
-        ReductionOp                 /*reduction_op*/,   ///< [in] Binary scan operator
-        T                           * /*raking_segment*/,
-        T                           partial,            ///< [in] <b>[<em>lane</em><sub>0</sub> only]</b> Warp-wide aggregate reduction of input items
-        int                         /*num_valid*/,      ///< [in] Number of valid elements (may be less than BLOCK_THREADS)
-        Int2Type<SEGMENT_LENGTH>    /*iteration*/)
-    {
-        return partial;
-    }
-
-
-
-    /// Computes a thread block-wide reduction using the specified reduction operator. The first num_valid threads each contribute one reduction partial.  The return value is only valid for thread<sub>0</sub>.
-    template <
-        bool                IS_FULL_TILE,
-        typename            ReductionOp>
-    __device__ __forceinline__ T Reduce(
-        T                   partial,            ///< [in] Calling thread's input partial reductions
-        int                 num_valid,          ///< [in] Number of valid elements (may be less than BLOCK_THREADS)
-        ReductionOp         reduction_op)       ///< [in] Binary reduction operator
-    {
-        if (WARP_SYNCHRONOUS)
-        {
-            // Short-circuit directly to warp synchronous reduction (unguarded if active threads is a power-of-two)
-            partial = WarpReduce(temp_storage.warp_storage).template Reduce<IS_FULL_TILE, SEGMENT_LENGTH>(
-                partial,
-                num_valid,
-                reduction_op);
-        }
-        else
-        {
-            // Place partial into shared memory grid.
-            *BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid) = partial;
-
-            CTA_SYNC();
-
-            // Reduce parallelism to one warp
-            if (linear_tid < RAKING_THREADS)
-            {
-                // Raking reduction in grid
-                T *raking_segment = BlockRakingLayout::RakingPtr(temp_storage.raking_grid, linear_tid);
-                partial = raking_segment[0];
-
-                partial = RakingReduction<IS_FULL_TILE>(reduction_op, raking_segment, partial, num_valid, Int2Type<1>());
-
-                partial = WarpReduce(temp_storage.warp_storage).template Reduce<IS_FULL_TILE && RAKING_UNGUARDED, SEGMENT_LENGTH>(
-                    partial,
-                    num_valid,
-                    reduction_op);
-
-            }
-        }
-
-        return partial;
-    }
-
-
-    /// Computes a thread block-wide reduction using addition (+) as the reduction operator. The first num_valid threads each contribute one reduction partial.  The return value is only valid for thread<sub>0</sub>.
-    template <bool IS_FULL_TILE>
-    __device__ __forceinline__ T Sum(
-        T                   partial,            ///< [in] Calling thread's input partial reductions
-        int                 num_valid)          ///< [in] Number of valid elements (may be less than BLOCK_THREADS)
-    {
-        cub::Sum reduction_op;
-
-        return Reduce<IS_FULL_TILE>(partial, num_valid, reduction_op);
-    }
-
-
-
-};
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/specializations/block_reduce_raking_commutative_only.cuh b/thirdparty/cub_semiring/block/specializations/block_reduce_raking_commutative_only.cuh
deleted file mode 100644
index ee2294607e9..00000000000
--- a/thirdparty/cub_semiring/block/specializations/block_reduce_raking_commutative_only.cuh
+++ /dev/null
@@ -1,199 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::BlockReduceRakingCommutativeOnly provides raking-based methods of parallel reduction across a CUDA thread block.  Does not support non-commutative reduction operators.
- */
-
-#pragma once
-
-#include "block_reduce_raking.cuh"
-#include "../../warp/warp_reduce.cuh"
-#include "../../thread/thread_reduce.cuh"
-#include "../../util_ptx.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \brief BlockReduceRakingCommutativeOnly provides raking-based methods of parallel reduction across a CUDA thread block.  Does not support non-commutative reduction operators.  Does not support block sizes that are not a multiple of the warp size.
- */
-template <
-    typename    T,              ///< Data type being reduced
-    int         BLOCK_DIM_X,    ///< The thread block length in threads along the X dimension
-    int         BLOCK_DIM_Y,    ///< The thread block length in threads along the Y dimension
-    int         BLOCK_DIM_Z,    ///< The thread block length in threads along the Z dimension
-    int         PTX_ARCH>       ///< The PTX compute capability for which to to specialize this collective
-struct BlockReduceRakingCommutativeOnly
-{
-    /// Constants
-    enum
-    {
-        /// The thread block size in threads
-        BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-    };
-
-    // The fall-back implementation to use when BLOCK_THREADS is not a multiple of the warp size or not all threads have valid values
-    typedef BlockReduceRaking<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH> FallBack;
-
-    /// Constants
-    enum
-    {
-        /// Number of warp threads
-        WARP_THREADS = CUB_WARP_THREADS(PTX_ARCH),
-
-        /// Whether or not to use fall-back
-        USE_FALLBACK = ((BLOCK_THREADS % WARP_THREADS != 0) || (BLOCK_THREADS <= WARP_THREADS)),
-
-        /// Number of raking threads
-        RAKING_THREADS = WARP_THREADS,
-
-        /// Number of threads actually sharing items with the raking threads
-        SHARING_THREADS = CUB_MAX(1, BLOCK_THREADS - RAKING_THREADS),
-
-        /// Number of raking elements per warp synchronous raking thread
-        SEGMENT_LENGTH = SHARING_THREADS / WARP_THREADS,
-    };
-
-    ///  WarpReduce utility type
-    typedef WarpReduce<T, RAKING_THREADS, PTX_ARCH> WarpReduce;
-
-    /// Layout type for padded thread block raking grid
-    typedef BlockRakingLayout<T, SHARING_THREADS, PTX_ARCH> BlockRakingLayout;
-
-    /// Shared memory storage layout type
-    union _TempStorage
-    {
-        struct
-        {
-            typename WarpReduce::TempStorage        warp_storage;        ///< Storage for warp-synchronous reduction
-            typename BlockRakingLayout::TempStorage raking_grid;         ///< Padded thread block raking grid
-        };
-        typename FallBack::TempStorage              fallback_storage;    ///< Fall-back storage for non-commutative block scan
-    };
-
-
-    /// Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    // Thread fields
-    _TempStorage &temp_storage;
-    unsigned int linear_tid;
-
-
-    /// Constructor
-    __device__ __forceinline__ BlockReduceRakingCommutativeOnly(
-        TempStorage &temp_storage)
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    /// Computes a thread block-wide reduction using addition (+) as the reduction operator. The first num_valid threads each contribute one reduction partial.  The return value is only valid for thread<sub>0</sub>.
-    template <bool FULL_TILE>
-    __device__ __forceinline__ T Sum(
-        T                   partial,            ///< [in] Calling thread's input partial reductions
-        int                 num_valid)          ///< [in] Number of valid elements (may be less than BLOCK_THREADS)
-    {
-        if (USE_FALLBACK || !FULL_TILE)
-        {
-            return FallBack(temp_storage.fallback_storage).template Sum<FULL_TILE>(partial, num_valid);
-        }
-        else
-        {
-            // Place partial into shared memory grid
-            if (linear_tid >= RAKING_THREADS)
-                *BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid - RAKING_THREADS) = partial;
-
-            CTA_SYNC();
-
-            // Reduce parallelism to one warp
-            if (linear_tid < RAKING_THREADS)
-            {
-                // Raking reduction in grid
-                T *raking_segment = BlockRakingLayout::RakingPtr(temp_storage.raking_grid, linear_tid);
-                partial = internal::ThreadReduce<SEGMENT_LENGTH>(raking_segment, cub::Sum(), partial);
-
-                // Warpscan
-                partial = WarpReduce(temp_storage.warp_storage).Sum(partial);
-            }
-        }
-
-        return partial;
-    }
-
-
-    /// Computes a thread block-wide reduction using the specified reduction operator. The first num_valid threads each contribute one reduction partial.  The return value is only valid for thread<sub>0</sub>.
-    template <
-        bool                FULL_TILE,
-        typename            ReductionOp>
-    __device__ __forceinline__ T Reduce(
-        T                   partial,            ///< [in] Calling thread's input partial reductions
-        int                 num_valid,          ///< [in] Number of valid elements (may be less than BLOCK_THREADS)
-        ReductionOp         reduction_op)       ///< [in] Binary reduction operator
-    {
-        if (USE_FALLBACK || !FULL_TILE)
-        {
-            return FallBack(temp_storage.fallback_storage).template Reduce<FULL_TILE>(partial, num_valid, reduction_op);
-        }
-        else
-        {
-            // Place partial into shared memory grid
-            if (linear_tid >= RAKING_THREADS)
-                *BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid - RAKING_THREADS) = partial;
-
-            CTA_SYNC();
-
-            // Reduce parallelism to one warp
-            if (linear_tid < RAKING_THREADS)
-            {
-                // Raking reduction in grid
-                T *raking_segment = BlockRakingLayout::RakingPtr(temp_storage.raking_grid, linear_tid);
-                partial = internal::ThreadReduce<SEGMENT_LENGTH>(raking_segment, reduction_op, partial);
-
-                // Warpscan
-                partial = WarpReduce(temp_storage.warp_storage).Reduce(partial, reduction_op);
-            }
-        }
-
-        return partial;
-    }
-
-};
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/specializations/block_reduce_warp_reductions.cuh b/thirdparty/cub_semiring/block/specializations/block_reduce_warp_reductions.cuh
deleted file mode 100644
index 68495b4e77e..00000000000
--- a/thirdparty/cub_semiring/block/specializations/block_reduce_warp_reductions.cuh
+++ /dev/null
@@ -1,222 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::BlockReduceWarpReductions provides variants of warp-reduction-based parallel reduction across a CUDA thread block.  Supports non-commutative reduction operators.
- */
-
-#pragma once
-
-#include "../../warp/warp_reduce.cuh"
-#include "../../util_ptx.cuh"
-#include "../../util_arch.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \brief BlockReduceWarpReductions provides variants of warp-reduction-based parallel reduction across a CUDA thread block.  Supports non-commutative reduction operators.
- */
-template <
-    typename    T,              ///< Data type being reduced
-    int         BLOCK_DIM_X,    ///< The thread block length in threads along the X dimension
-    int         BLOCK_DIM_Y,    ///< The thread block length in threads along the Y dimension
-    int         BLOCK_DIM_Z,    ///< The thread block length in threads along the Z dimension
-    int         PTX_ARCH>       ///< The PTX compute capability for which to to specialize this collective
-struct BlockReduceWarpReductions
-{
-    /// Constants
-    enum
-    {
-        /// The thread block size in threads
-        BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-
-        /// Number of warp threads
-        WARP_THREADS = CUB_WARP_THREADS(PTX_ARCH),
-
-        /// Number of active warps
-        WARPS = (BLOCK_THREADS + WARP_THREADS - 1) / WARP_THREADS,
-
-        /// The logical warp size for warp reductions
-        LOGICAL_WARP_SIZE = CUB_MIN(BLOCK_THREADS, WARP_THREADS),
-
-        /// Whether or not the logical warp size evenly divides the thread block size
-        EVEN_WARP_MULTIPLE = (BLOCK_THREADS % LOGICAL_WARP_SIZE == 0)
-    };
-
-
-    ///  WarpReduce utility type
-    typedef typename WarpReduce<T, LOGICAL_WARP_SIZE, PTX_ARCH>::InternalWarpReduce WarpReduce;
-
-
-    /// Shared memory storage layout type
-    struct _TempStorage
-    {
-        typename WarpReduce::TempStorage    warp_reduce[WARPS];                ///< Buffer for warp-synchronous scan
-        T                                   warp_aggregates[WARPS];     ///< Shared totals from each warp-synchronous scan
-        T                                   block_prefix;               ///< Shared prefix for the entire thread block
-    };
-
-    /// Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    // Thread fields
-    _TempStorage &temp_storage;
-    unsigned int linear_tid;
-    unsigned int warp_id;
-    unsigned int lane_id;
-
-
-    /// Constructor
-    __device__ __forceinline__ BlockReduceWarpReductions(
-        TempStorage &temp_storage)
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z)),
-        warp_id((WARPS == 1) ? 0 : linear_tid / WARP_THREADS),
-        lane_id(LaneId())
-    {}
-
-
-    template <bool FULL_TILE, typename ReductionOp, int SUCCESSOR_WARP>
-    __device__ __forceinline__ T ApplyWarpAggregates(
-        ReductionOp                 reduction_op,       ///< [in] Binary scan operator
-        T                           warp_aggregate,     ///< [in] <b>[<em>lane</em><sub>0</sub> only]</b> Warp-wide aggregate reduction of input items
-        int                         num_valid,          ///< [in] Number of valid elements (may be less than BLOCK_THREADS)
-        Int2Type<SUCCESSOR_WARP>    /*successor_warp*/)
-    {
-        if (FULL_TILE || (SUCCESSOR_WARP * LOGICAL_WARP_SIZE < num_valid))
-        {
-            T addend = temp_storage.warp_aggregates[SUCCESSOR_WARP];
-            warp_aggregate = reduction_op(warp_aggregate, addend);
-        }
-        return ApplyWarpAggregates<FULL_TILE>(reduction_op, warp_aggregate, num_valid, Int2Type<SUCCESSOR_WARP + 1>());
-    }
-
-    template <bool FULL_TILE, typename ReductionOp>
-    __device__ __forceinline__ T ApplyWarpAggregates(
-        ReductionOp         /*reduction_op*/,   ///< [in] Binary scan operator
-        T                   warp_aggregate,     ///< [in] <b>[<em>lane</em><sub>0</sub> only]</b> Warp-wide aggregate reduction of input items
-        int                 /*num_valid*/,      ///< [in] Number of valid elements (may be less than BLOCK_THREADS)
-        Int2Type<WARPS>     /*successor_warp*/)
-    {
-        return warp_aggregate;
-    }
-
-
-    /// Returns block-wide aggregate in <em>thread</em><sub>0</sub>.
-    template <
-        bool                FULL_TILE,
-        typename            ReductionOp>
-    __device__ __forceinline__ T ApplyWarpAggregates(
-        ReductionOp         reduction_op,       ///< [in] Binary scan operator
-        T                   warp_aggregate,     ///< [in] <b>[<em>lane</em><sub>0</sub> only]</b> Warp-wide aggregate reduction of input items
-        int                 num_valid)          ///< [in] Number of valid elements (may be less than BLOCK_THREADS)
-    {
-        // Share lane aggregates
-        if (lane_id == 0)
-        {
-            temp_storage.warp_aggregates[warp_id] = warp_aggregate;
-        }
-
-        CTA_SYNC();
-
-        // Update total aggregate in warp 0, lane 0
-        if (linear_tid == 0)
-        {
-            warp_aggregate = ApplyWarpAggregates<FULL_TILE>(reduction_op, warp_aggregate, num_valid, Int2Type<1>());
-        }
-
-        return warp_aggregate;
-    }
-
-
-    /// Computes a thread block-wide reduction using addition (+) as the reduction operator. The first num_valid threads each contribute one reduction partial.  The return value is only valid for thread<sub>0</sub>.
-    template <bool FULL_TILE>
-    __device__ __forceinline__ T Sum(
-        T                   input,          ///< [in] Calling thread's input partial reductions
-        int                 num_valid)      ///< [in] Number of valid elements (may be less than BLOCK_THREADS)
-    {
-        cub::Sum        reduction_op;
-        unsigned int    warp_offset = warp_id * LOGICAL_WARP_SIZE;
-        unsigned int    warp_num_valid = (FULL_TILE && EVEN_WARP_MULTIPLE) ?
-                            LOGICAL_WARP_SIZE :
-                            (warp_offset < num_valid) ?
-                                num_valid - warp_offset :
-                                0;
-
-        // Warp reduction in every warp
-        T warp_aggregate = WarpReduce(temp_storage.warp_reduce[warp_id]).template Reduce<(FULL_TILE && EVEN_WARP_MULTIPLE), 1>(
-            input,
-            warp_num_valid,
-            cub::Sum());
-
-        // Update outputs and block_aggregate with warp-wide aggregates from lane-0s
-        return ApplyWarpAggregates<FULL_TILE>(reduction_op, warp_aggregate, num_valid);
-    }
-
-
-    /// Computes a thread block-wide reduction using the specified reduction operator. The first num_valid threads each contribute one reduction partial.  The return value is only valid for thread<sub>0</sub>.
-    template <
-        bool                FULL_TILE,
-        typename            ReductionOp>
-    __device__ __forceinline__ T Reduce(
-        T                   input,              ///< [in] Calling thread's input partial reductions
-        int                 num_valid,          ///< [in] Number of valid elements (may be less than BLOCK_THREADS)
-        ReductionOp         reduction_op)       ///< [in] Binary reduction operator
-    {
-        unsigned int    warp_offset = warp_id * LOGICAL_WARP_SIZE;
-        unsigned int    warp_num_valid = (FULL_TILE && EVEN_WARP_MULTIPLE) ?
-                            LOGICAL_WARP_SIZE :
-                            (warp_offset < static_cast<unsigned int>(num_valid)) ?
-                                num_valid - warp_offset :
-                                0;
-
-        // Warp reduction in every warp
-        T warp_aggregate = WarpReduce(temp_storage.warp_reduce[warp_id]).template Reduce<(FULL_TILE && EVEN_WARP_MULTIPLE), 1>(
-            input,
-            warp_num_valid,
-            reduction_op);
-
-        // Update outputs and block_aggregate with warp-wide aggregates from lane-0s
-        return ApplyWarpAggregates<FULL_TILE>(reduction_op, warp_aggregate, num_valid);
-    }
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/specializations/block_scan_raking.cuh b/thirdparty/cub_semiring/block/specializations/block_scan_raking.cuh
deleted file mode 100644
index 2e21324c9ee..00000000000
--- a/thirdparty/cub_semiring/block/specializations/block_scan_raking.cuh
+++ /dev/null
@@ -1,666 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-
-/**
- * \file
- * cub::BlockScanRaking provides variants of raking-based parallel prefix scan across a CUDA thread block.
- */
-
-#pragma once
-
-#include "../../util_ptx.cuh"
-#include "../../util_arch.cuh"
-#include "../../block/block_raking_layout.cuh"
-#include "../../thread/thread_reduce.cuh"
-#include "../../thread/thread_scan.cuh"
-#include "../../warp/warp_scan.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \brief BlockScanRaking provides variants of raking-based parallel prefix scan across a CUDA thread block.
- */
-template <
-    typename    T,              ///< Data type being scanned
-    int         BLOCK_DIM_X,    ///< The thread block length in threads along the X dimension
-    int         BLOCK_DIM_Y,    ///< The thread block length in threads along the Y dimension
-    int         BLOCK_DIM_Z,    ///< The thread block length in threads along the Z dimension
-    bool        MEMOIZE,        ///< Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure
-    int         PTX_ARCH>       ///< The PTX compute capability for which to to specialize this collective
-struct BlockScanRaking
-{
-    //---------------------------------------------------------------------
-    // Types and constants
-    //---------------------------------------------------------------------
-
-    /// Constants
-    enum
-    {
-        /// The thread block size in threads
-        BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-    };
-
-    /// Layout type for padded thread block raking grid
-    typedef BlockRakingLayout<T, BLOCK_THREADS, PTX_ARCH> BlockRakingLayout;
-
-    /// Constants
-    enum
-    {
-        /// Number of raking threads
-        RAKING_THREADS = BlockRakingLayout::RAKING_THREADS,
-
-        /// Number of raking elements per warp synchronous raking thread
-        SEGMENT_LENGTH = BlockRakingLayout::SEGMENT_LENGTH,
-
-        /// Cooperative work can be entirely warp synchronous
-        WARP_SYNCHRONOUS = (BLOCK_THREADS == RAKING_THREADS),
-    };
-
-    ///  WarpScan utility type
-    typedef WarpScan<T, RAKING_THREADS, PTX_ARCH> WarpScan;
-
-    /// Shared memory storage layout type
-    struct _TempStorage
-    {
-        typename WarpScan::TempStorage              warp_scan;          ///< Buffer for warp-synchronous scan
-        typename BlockRakingLayout::TempStorage     raking_grid;        ///< Padded thread block raking grid
-        T                                           block_aggregate;    ///< Block aggregate
-    };
-
-
-    /// Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    //---------------------------------------------------------------------
-    // Per-thread fields
-    //---------------------------------------------------------------------
-
-    // Thread fields
-    _TempStorage    &temp_storage;
-    unsigned int    linear_tid;
-    T               cached_segment[SEGMENT_LENGTH];
-
-
-    //---------------------------------------------------------------------
-    // Utility methods
-    //---------------------------------------------------------------------
-
-    /// Templated reduction
-    template <int ITERATION, typename ScanOp>
-    __device__ __forceinline__ T GuardedReduce(
-        T*                  raking_ptr,         ///< [in] Input array
-        ScanOp              scan_op,            ///< [in] Binary reduction operator
-        T                   raking_partial,     ///< [in] Prefix to seed reduction with
-        Int2Type<ITERATION> /*iteration*/)
-    {
-        if ((BlockRakingLayout::UNGUARDED) || (((linear_tid * SEGMENT_LENGTH) + ITERATION) < BLOCK_THREADS))
-        {
-            T addend = raking_ptr[ITERATION];
-            raking_partial = scan_op(raking_partial, addend);
-        }
-
-        return GuardedReduce(raking_ptr, scan_op, raking_partial, Int2Type<ITERATION + 1>());
-    }
-
-
-    /// Templated reduction (base case)
-    template <typename ScanOp>
-    __device__ __forceinline__ T GuardedReduce(
-        T*                          /*raking_ptr*/,    ///< [in] Input array
-        ScanOp                      /*scan_op*/,       ///< [in] Binary reduction operator
-        T                           raking_partial,    ///< [in] Prefix to seed reduction with
-        Int2Type<SEGMENT_LENGTH>    /*iteration*/)
-    {
-        return raking_partial;
-    }
-
-
-    /// Templated copy
-    template <int ITERATION>
-    __device__ __forceinline__ void CopySegment(
-        T*                  out,            ///< [out] Out array
-        T*                  in,             ///< [in] Input array
-        Int2Type<ITERATION> /*iteration*/)
-    {
-        out[ITERATION] = in[ITERATION];
-        CopySegment(out, in, Int2Type<ITERATION + 1>());
-    }
-
- 
-    /// Templated copy (base case)
-    __device__ __forceinline__ void CopySegment(
-        T*                  /*out*/,            ///< [out] Out array
-        T*                  /*in*/,             ///< [in] Input array
-        Int2Type<SEGMENT_LENGTH> /*iteration*/)
-    {}
-
-
-    /// Performs upsweep raking reduction, returning the aggregate
-    template <typename ScanOp>
-    __device__ __forceinline__ T Upsweep(
-        ScanOp scan_op)
-    {
-        T *smem_raking_ptr = BlockRakingLayout::RakingPtr(temp_storage.raking_grid, linear_tid);
-
-        // Read data into registers
-        CopySegment(cached_segment, smem_raking_ptr, Int2Type<0>());
-
-        T raking_partial = cached_segment[0];
-
-        return GuardedReduce(cached_segment, scan_op, raking_partial, Int2Type<1>());
-    }
-
-
-    /// Performs exclusive downsweep raking scan
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveDownsweep(
-        ScanOp          scan_op,
-        T               raking_partial,
-        bool            apply_prefix = true)
-    {
-        T *smem_raking_ptr = BlockRakingLayout::RakingPtr(temp_storage.raking_grid, linear_tid);
-
-        // Read data back into registers
-        if (!MEMOIZE)
-        {
-            CopySegment(cached_segment, smem_raking_ptr, Int2Type<0>());
-        }
-
-        internal::ThreadScanExclusive(cached_segment, cached_segment, scan_op, raking_partial, apply_prefix);
-
-        // Write data back to smem
-        CopySegment(smem_raking_ptr, cached_segment, Int2Type<0>());
-    }
-
-
-    /// Performs inclusive downsweep raking scan
-    template <typename ScanOp>
-    __device__ __forceinline__ void InclusiveDownsweep(
-        ScanOp          scan_op,
-        T               raking_partial,
-        bool            apply_prefix = true)
-    {
-        T *smem_raking_ptr = BlockRakingLayout::RakingPtr(temp_storage.raking_grid, linear_tid);
-
-        // Read data back into registers
-        if (!MEMOIZE)
-        {
-            CopySegment(cached_segment, smem_raking_ptr, Int2Type<0>());
-        }
-
-        internal::ThreadScanInclusive(cached_segment, cached_segment, scan_op, raking_partial, apply_prefix);
-
-        // Write data back to smem
-        CopySegment(smem_raking_ptr, cached_segment, Int2Type<0>());
-    }
-
-
-    //---------------------------------------------------------------------
-    // Constructors
-    //---------------------------------------------------------------------
-
-    /// Constructor
-    __device__ __forceinline__ BlockScanRaking(
-        TempStorage &temp_storage)
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
-    {}
-
-
-    //---------------------------------------------------------------------
-    // Exclusive scans
-    //---------------------------------------------------------------------
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  With no initial value, the output computed for <em>thread</em><sub>0</sub> is undefined.
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &exclusive_output,                        ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op)                        ///< [in] Binary scan operator
-    {
-        if (WARP_SYNCHRONOUS)
-        {
-            // Short-circuit directly to warp-synchronous scan
-            WarpScan(temp_storage.warp_scan).ExclusiveScan(input, exclusive_output, scan_op);
-        }
-        else
-        {
-            // Place thread partial into shared memory raking grid
-            T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
-            *placement_ptr = input;
-
-            CTA_SYNC();
-
-            // Reduce parallelism down to just raking threads
-            if (linear_tid < RAKING_THREADS)
-            {
-                // Raking upsweep reduction across shared partials
-                T upsweep_partial = Upsweep(scan_op);
-
-                // Warp-synchronous scan
-                T exclusive_partial;
-                WarpScan(temp_storage.warp_scan).ExclusiveScan(upsweep_partial, exclusive_partial, scan_op);
-
-                // Exclusive raking downsweep scan
-                ExclusiveDownsweep(scan_op, exclusive_partial, (linear_tid != 0));
-            }
-
-            CTA_SYNC();
-
-            // Grab thread prefix from shared memory
-            exclusive_output = *placement_ptr;
-        }
-    }
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,              ///< [in] Calling thread's input items
-        T               &output,            ///< [out] Calling thread's output items (may be aliased to \p input)
-        const T         &initial_value,     ///< [in] Initial value to seed the exclusive scan
-        ScanOp          scan_op)            ///< [in] Binary scan operator
-    {
-        if (WARP_SYNCHRONOUS)
-        {
-            // Short-circuit directly to warp-synchronous scan
-            WarpScan(temp_storage.warp_scan).ExclusiveScan(input, output, initial_value, scan_op);
-        }
-        else
-        {
-            // Place thread partial into shared memory raking grid
-            T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
-            *placement_ptr = input;
-
-            CTA_SYNC();
-
-            // Reduce parallelism down to just raking threads
-            if (linear_tid < RAKING_THREADS)
-            {
-                // Raking upsweep reduction across shared partials
-                T upsweep_partial = Upsweep(scan_op);
-
-                // Exclusive Warp-synchronous scan
-                T exclusive_partial;
-                WarpScan(temp_storage.warp_scan).ExclusiveScan(upsweep_partial, exclusive_partial, initial_value, scan_op);
-
-                // Exclusive raking downsweep scan
-                ExclusiveDownsweep(scan_op, exclusive_partial);
-            }
-
-            CTA_SYNC();
-
-            // Grab exclusive partial from shared memory
-            output = *placement_ptr;
-        }
-    }
-
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  Also provides every thread with the block-wide \p block_aggregate of all inputs.  With no initial value, the output computed for <em>thread</em><sub>0</sub> is undefined.
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &output,                        ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op,                        ///< [in] Binary scan operator
-        T               &block_aggregate)               ///< [out] Threadblock-wide aggregate reduction of input items
-    {
-        if (WARP_SYNCHRONOUS)
-        {
-            // Short-circuit directly to warp-synchronous scan
-            WarpScan(temp_storage.warp_scan).ExclusiveScan(input, output, scan_op, block_aggregate);
-        }
-        else
-        {
-            // Place thread partial into shared memory raking grid
-            T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
-            *placement_ptr = input;
-
-            CTA_SYNC();
-
-            // Reduce parallelism down to just raking threads
-            if (linear_tid < RAKING_THREADS)
-            {
-                // Raking upsweep reduction across shared partials
-                T upsweep_partial= Upsweep(scan_op);
-
-                // Warp-synchronous scan
-                T inclusive_partial;
-                T exclusive_partial;
-                WarpScan(temp_storage.warp_scan).Scan(upsweep_partial, inclusive_partial, exclusive_partial, scan_op);
-
-                // Exclusive raking downsweep scan
-                ExclusiveDownsweep(scan_op, exclusive_partial, (linear_tid != 0));
-
-                // Broadcast aggregate to all threads
-                if (linear_tid == RAKING_THREADS - 1)
-                    temp_storage.block_aggregate = inclusive_partial;
-            }
-
-            CTA_SYNC();
-
-            // Grab thread prefix from shared memory
-            output = *placement_ptr;
-
-            // Retrieve block aggregate
-            block_aggregate = temp_storage.block_aggregate;
-        }
-    }
-
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,              ///< [in] Calling thread's input items
-        T               &output,            ///< [out] Calling thread's output items (may be aliased to \p input)
-        const T         &initial_value,     ///< [in] Initial value to seed the exclusive scan
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               &block_aggregate)   ///< [out] Threadblock-wide aggregate reduction of input items
-    {
-        if (WARP_SYNCHRONOUS)
-        {
-            // Short-circuit directly to warp-synchronous scan
-            WarpScan(temp_storage.warp_scan).ExclusiveScan(input, output, initial_value, scan_op, block_aggregate);
-        }
-        else
-        {
-            // Place thread partial into shared memory raking grid
-            T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
-            *placement_ptr = input;
-
-            CTA_SYNC();
-
-            // Reduce parallelism down to just raking threads
-            if (linear_tid < RAKING_THREADS)
-            {
-                // Raking upsweep reduction across shared partials
-                T upsweep_partial = Upsweep(scan_op);
-
-                // Warp-synchronous scan
-                T exclusive_partial;
-                WarpScan(temp_storage.warp_scan).ExclusiveScan(upsweep_partial, exclusive_partial, initial_value, scan_op, block_aggregate);
-
-                // Exclusive raking downsweep scan
-                ExclusiveDownsweep(scan_op, exclusive_partial);
-
-                // Broadcast aggregate to other threads
-                if (linear_tid == 0)
-                    temp_storage.block_aggregate = block_aggregate;
-            }
-
-            CTA_SYNC();
-
-            // Grab exclusive partial from shared memory
-            output = *placement_ptr;
-
-            // Retrieve block aggregate
-            block_aggregate = temp_storage.block_aggregate;
-        }
-    }
-
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  the call-back functor \p block_prefix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-    template <
-        typename ScanOp,
-        typename BlockPrefixCallbackOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T                       input,                          ///< [in] Calling thread's input item
-        T                       &output,                        ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp                  scan_op,                        ///< [in] Binary scan operator
-        BlockPrefixCallbackOp   &block_prefix_callback_op)      ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a thread block-wide prefix to be applied to all inputs.
-    {
-        if (WARP_SYNCHRONOUS)
-        {
-            // Short-circuit directly to warp-synchronous scan
-            T block_aggregate;
-            WarpScan warp_scan(temp_storage.warp_scan);
-            warp_scan.ExclusiveScan(input, output, scan_op, block_aggregate);
-
-            // Obtain warp-wide prefix in lane0, then broadcast to other lanes
-            T block_prefix = block_prefix_callback_op(block_aggregate);
-            block_prefix = warp_scan.Broadcast(block_prefix, 0);
-
-            output = scan_op(block_prefix, output);
-            if (linear_tid == 0)
-                output = block_prefix;
-        }
-        else
-        {
-            // Place thread partial into shared memory raking grid
-            T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
-            *placement_ptr = input;
-
-            CTA_SYNC();
-
-            // Reduce parallelism down to just raking threads
-            if (linear_tid < RAKING_THREADS)
-            {
-                WarpScan warp_scan(temp_storage.warp_scan);
-
-                // Raking upsweep reduction across shared partials
-                T upsweep_partial = Upsweep(scan_op);
-
-                // Warp-synchronous scan
-                T exclusive_partial, block_aggregate;
-                warp_scan.ExclusiveScan(upsweep_partial, exclusive_partial, scan_op, block_aggregate);
-
-                // Obtain block-wide prefix in lane0, then broadcast to other lanes
-                T block_prefix = block_prefix_callback_op(block_aggregate);
-                block_prefix = warp_scan.Broadcast(block_prefix, 0);
-
-                // Update prefix with warpscan exclusive partial
-                T downsweep_prefix = scan_op(block_prefix, exclusive_partial);
-                if (linear_tid == 0)
-                    downsweep_prefix = block_prefix;
-
-                // Exclusive raking downsweep scan
-                ExclusiveDownsweep(scan_op, downsweep_prefix);
-            }
-
-            CTA_SYNC();
-
-            // Grab thread prefix from shared memory
-            output = *placement_ptr;
-        }
-    }
-
-
-    //---------------------------------------------------------------------
-    // Inclusive scans
-    //---------------------------------------------------------------------
-
-    /// Computes an inclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.
-    template <typename ScanOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &output,                        ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op)                        ///< [in] Binary scan operator
-    {
-        if (WARP_SYNCHRONOUS)
-        {
-            // Short-circuit directly to warp-synchronous scan
-            WarpScan(temp_storage.warp_scan).InclusiveScan(input, output, scan_op);
-        }
-        else
-        {
-            // Place thread partial into shared memory raking grid
-            T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
-            *placement_ptr = input;
-
-            CTA_SYNC();
-
-            // Reduce parallelism down to just raking threads
-            if (linear_tid < RAKING_THREADS)
-            {
-                // Raking upsweep reduction across shared partials
-                T upsweep_partial = Upsweep(scan_op);
-
-                // Exclusive Warp-synchronous scan
-                T exclusive_partial;
-                WarpScan(temp_storage.warp_scan).ExclusiveScan(upsweep_partial, exclusive_partial, scan_op);
-
-                // Inclusive raking downsweep scan
-                InclusiveDownsweep(scan_op, exclusive_partial, (linear_tid != 0));
-            }
-
-            CTA_SYNC();
-
-            // Grab thread prefix from shared memory
-            output = *placement_ptr;
-        }
-    }
-
-
-    /// Computes an inclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-    template <typename ScanOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &output,                        ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op,                        ///< [in] Binary scan operator
-        T               &block_aggregate)               ///< [out] Threadblock-wide aggregate reduction of input items
-    {
-        if (WARP_SYNCHRONOUS)
-        {
-            // Short-circuit directly to warp-synchronous scan
-            WarpScan(temp_storage.warp_scan).InclusiveScan(input, output, scan_op, block_aggregate);
-        }
-        else
-        {
-            // Place thread partial into shared memory raking grid
-            T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
-            *placement_ptr = input;
-
-            CTA_SYNC();
-
-            // Reduce parallelism down to just raking threads
-            if (linear_tid < RAKING_THREADS)
-            {
-                // Raking upsweep reduction across shared partials
-                T upsweep_partial = Upsweep(scan_op);
-
-                // Warp-synchronous scan
-                T inclusive_partial;
-                T exclusive_partial;
-                WarpScan(temp_storage.warp_scan).Scan(upsweep_partial, inclusive_partial, exclusive_partial, scan_op);
-
-                // Inclusive raking downsweep scan
-                InclusiveDownsweep(scan_op, exclusive_partial, (linear_tid != 0));
-
-                // Broadcast aggregate to all threads
-                if (linear_tid == RAKING_THREADS - 1)
-                    temp_storage.block_aggregate = inclusive_partial;
-            }
-
-            CTA_SYNC();
-
-            // Grab thread prefix from shared memory
-            output = *placement_ptr;
-
-            // Retrieve block aggregate
-            block_aggregate = temp_storage.block_aggregate;
-        }
-    }
-
-
-    /// Computes an inclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  the call-back functor \p block_prefix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-    template <
-        typename ScanOp,
-        typename BlockPrefixCallbackOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T                       input,                          ///< [in] Calling thread's input item
-        T                       &output,                        ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp                  scan_op,                        ///< [in] Binary scan operator
-        BlockPrefixCallbackOp   &block_prefix_callback_op)      ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a thread block-wide prefix to be applied to all inputs.
-    {
-        if (WARP_SYNCHRONOUS)
-        {
-            // Short-circuit directly to warp-synchronous scan
-            T block_aggregate;
-            WarpScan warp_scan(temp_storage.warp_scan);
-            warp_scan.InclusiveScan(input, output, scan_op, block_aggregate);
-
-            // Obtain warp-wide prefix in lane0, then broadcast to other lanes
-            T block_prefix = block_prefix_callback_op(block_aggregate);
-            block_prefix = warp_scan.Broadcast(block_prefix, 0);
-
-            // Update prefix with exclusive warpscan partial
-            output = scan_op(block_prefix, output);
-        }
-        else
-        {
-            // Place thread partial into shared memory raking grid
-            T *placement_ptr = BlockRakingLayout::PlacementPtr(temp_storage.raking_grid, linear_tid);
-            *placement_ptr = input;
-
-            CTA_SYNC();
-
-            // Reduce parallelism down to just raking threads
-            if (linear_tid < RAKING_THREADS)
-            {
-                WarpScan warp_scan(temp_storage.warp_scan);
-
-                // Raking upsweep reduction across shared partials
-                T upsweep_partial = Upsweep(scan_op);
-
-                // Warp-synchronous scan
-                T exclusive_partial, block_aggregate;
-                warp_scan.ExclusiveScan(upsweep_partial, exclusive_partial, scan_op, block_aggregate);
-
-                // Obtain block-wide prefix in lane0, then broadcast to other lanes
-                T block_prefix = block_prefix_callback_op(block_aggregate);
-                block_prefix = warp_scan.Broadcast(block_prefix, 0);
-
-                // Update prefix with warpscan exclusive partial
-                T downsweep_prefix = scan_op(block_prefix, exclusive_partial);
-                if (linear_tid == 0)
-                    downsweep_prefix = block_prefix;
-
-                // Inclusive raking downsweep scan
-                InclusiveDownsweep(scan_op, downsweep_prefix);
-            }
-
-            CTA_SYNC();
-
-            // Grab thread prefix from shared memory
-            output = *placement_ptr;
-        }
-    }
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/specializations/block_scan_warp_scans.cuh b/thirdparty/cub_semiring/block/specializations/block_scan_warp_scans.cuh
deleted file mode 100644
index 9252c0a3a7f..00000000000
--- a/thirdparty/cub_semiring/block/specializations/block_scan_warp_scans.cuh
+++ /dev/null
@@ -1,392 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::BlockScanWarpscans provides warpscan-based variants of parallel prefix scan across a CUDA thread block.
- */
-
-#pragma once
-
-#include "../../util_arch.cuh"
-#include "../../util_ptx.cuh"
-#include "../../warp/warp_scan.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \brief BlockScanWarpScans provides warpscan-based variants of parallel prefix scan across a CUDA thread block.
- */
-template <
-    typename    T,
-    int         BLOCK_DIM_X,    ///< The thread block length in threads along the X dimension
-    int         BLOCK_DIM_Y,    ///< The thread block length in threads along the Y dimension
-    int         BLOCK_DIM_Z,    ///< The thread block length in threads along the Z dimension
-    int         PTX_ARCH>       ///< The PTX compute capability for which to to specialize this collective
-struct BlockScanWarpScans
-{
-    //---------------------------------------------------------------------
-    // Types and constants
-    //---------------------------------------------------------------------
-
-    /// Constants
-    enum
-    {
-        /// Number of warp threads
-        WARP_THREADS = CUB_WARP_THREADS(PTX_ARCH),
-
-        /// The thread block size in threads
-        BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-
-        /// Number of active warps
-        WARPS = (BLOCK_THREADS + WARP_THREADS - 1) / WARP_THREADS,
-    };
-
-    ///  WarpScan utility type
-    typedef WarpScan<T, WARP_THREADS, PTX_ARCH> WarpScanT;
-
-    ///  WarpScan utility type
-    typedef WarpScan<T, WARPS, PTX_ARCH> WarpAggregateScan;
-
-    /// Shared memory storage layout type
-
-    struct __align__(32) _TempStorage
-    {
-        T                               warp_aggregates[WARPS];
-        typename WarpScanT::TempStorage warp_scan[WARPS];           ///< Buffer for warp-synchronous scans
-        T                               block_prefix;               ///< Shared prefix for the entire thread block
-    };
-
-
-    /// Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    //---------------------------------------------------------------------
-    // Per-thread fields
-    //---------------------------------------------------------------------
-
-    // Thread fields
-    _TempStorage    &temp_storage;
-    unsigned int    linear_tid;
-    unsigned int    warp_id;
-    unsigned int    lane_id;
-
-
-    //---------------------------------------------------------------------
-    // Constructors
-    //---------------------------------------------------------------------
-
-    /// Constructor
-    __device__ __forceinline__ BlockScanWarpScans(
-        TempStorage &temp_storage)
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z)),
-        warp_id((WARPS == 1) ? 0 : linear_tid / WARP_THREADS),
-        lane_id(LaneId())
-    {}
-
-
-    //---------------------------------------------------------------------
-    // Utility methods
-    //---------------------------------------------------------------------
-
-    template <typename ScanOp, int WARP>
-    __device__ __forceinline__ void ApplyWarpAggregates(
-        T               &warp_prefix,           ///< [out] The calling thread's partial reduction
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               &block_aggregate,   ///< [out] Threadblock-wide aggregate reduction of input items
-        Int2Type<WARP>  /*addend_warp*/)
-    {
-        if (warp_id == WARP)
-            warp_prefix = block_aggregate;
-
-        T addend = temp_storage.warp_aggregates[WARP];
-        block_aggregate = scan_op(block_aggregate, addend);
-
-        ApplyWarpAggregates(warp_prefix, scan_op, block_aggregate, Int2Type<WARP + 1>());
-    }
-
-    template <typename ScanOp>
-    __device__ __forceinline__ void ApplyWarpAggregates(
-        T               &/*warp_prefix*/,       ///< [out] The calling thread's partial reduction
-        ScanOp          /*scan_op*/,            ///< [in] Binary scan operator
-        T               &/*block_aggregate*/,   ///< [out] Threadblock-wide aggregate reduction of input items
-        Int2Type<WARPS> /*addend_warp*/)
-    {}
-
-
-    /// Use the warp-wide aggregates to compute the calling warp's prefix.  Also returns block-wide aggregate in all threads.
-    template <typename ScanOp>
-    __device__ __forceinline__ T ComputeWarpPrefix(
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               warp_aggregate,     ///< [in] <b>[<em>lane</em><sub>WARP_THREADS - 1</sub> only]</b> Warp-wide aggregate reduction of input items
-        T               &block_aggregate)   ///< [out] Threadblock-wide aggregate reduction of input items
-    {
-        // Last lane in each warp shares its warp-aggregate
-        if (lane_id == WARP_THREADS - 1)
-            temp_storage.warp_aggregates[warp_id] = warp_aggregate;
-
-        CTA_SYNC();
-
-        // Accumulate block aggregates and save the one that is our warp's prefix
-        T warp_prefix;
-        block_aggregate = temp_storage.warp_aggregates[0];
-
-        // Use template unrolling (since the PTX backend can't handle unrolling it for SM1x)
-        ApplyWarpAggregates(warp_prefix, scan_op, block_aggregate, Int2Type<1>());
-/*
-        #pragma unroll
-        for (int WARP = 1; WARP < WARPS; ++WARP)
-        {
-            if (warp_id == WARP)
-                warp_prefix = block_aggregate;
-
-            T addend = temp_storage.warp_aggregates[WARP];
-            block_aggregate = scan_op(block_aggregate, addend);
-        }
-*/
-
-        return warp_prefix;
-    }
-
-
-    /// Use the warp-wide aggregates and initial-value to compute the calling warp's prefix.  Also returns block-wide aggregate in all threads.
-    template <typename ScanOp>
-    __device__ __forceinline__ T ComputeWarpPrefix(
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               warp_aggregate,     ///< [in] <b>[<em>lane</em><sub>WARP_THREADS - 1</sub> only]</b> Warp-wide aggregate reduction of input items
-        T               &block_aggregate,   ///< [out] Threadblock-wide aggregate reduction of input items
-        const T         &initial_value)     ///< [in] Initial value to seed the exclusive scan
-    {
-        T warp_prefix = ComputeWarpPrefix(scan_op, warp_aggregate, block_aggregate);
-
-        warp_prefix = scan_op(initial_value, warp_prefix);
-
-        if (warp_id == 0)
-            warp_prefix = initial_value;
-
-        return warp_prefix;
-    }
-
-    //---------------------------------------------------------------------
-    // Exclusive scans
-    //---------------------------------------------------------------------
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  With no initial value, the output computed for <em>thread</em><sub>0</sub> is undefined.
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &exclusive_output,              ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op)                        ///< [in] Binary scan operator
-    {
-        // Compute block-wide exclusive scan.  The exclusive output from tid0 is invalid.
-        T block_aggregate;
-        ExclusiveScan(input, exclusive_output, scan_op, block_aggregate);
-    }
-
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,              ///< [in] Calling thread's input items
-        T               &exclusive_output,  ///< [out] Calling thread's output items (may be aliased to \p input)
-        const T         &initial_value,     ///< [in] Initial value to seed the exclusive scan
-        ScanOp          scan_op)            ///< [in] Binary scan operator
-    {
-        T block_aggregate;
-        ExclusiveScan(input, exclusive_output, initial_value, scan_op, block_aggregate);
-    }
-
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  Also provides every thread with the block-wide \p block_aggregate of all inputs.  With no initial value, the output computed for <em>thread</em><sub>0</sub> is undefined.
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,              ///< [in] Calling thread's input item
-        T               &exclusive_output,  ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               &block_aggregate)   ///< [out] Threadblock-wide aggregate reduction of input items
-    {
-        // Compute warp scan in each warp.  The exclusive output from each lane0 is invalid.
-        T inclusive_output;
-        WarpScanT(temp_storage.warp_scan[warp_id]).Scan(input, inclusive_output, exclusive_output, scan_op);
-
-        // Compute the warp-wide prefix and block-wide aggregate for each warp.  Warp prefix for warp0 is invalid.
-        T warp_prefix = ComputeWarpPrefix(scan_op, inclusive_output, block_aggregate);
-
-        // Apply warp prefix to our lane's partial
-        if (warp_id != 0)
-        {
-            exclusive_output = scan_op(warp_prefix, exclusive_output);
-            if (lane_id == 0)
-                exclusive_output = warp_prefix;
-        }
-    }
-
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,              ///< [in] Calling thread's input items
-        T               &exclusive_output,  ///< [out] Calling thread's output items (may be aliased to \p input)
-        const T         &initial_value,     ///< [in] Initial value to seed the exclusive scan
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               &block_aggregate)   ///< [out] Threadblock-wide aggregate reduction of input items
-    {
-        // Compute warp scan in each warp.  The exclusive output from each lane0 is invalid.
-        T inclusive_output;
-        WarpScanT(temp_storage.warp_scan[warp_id]).Scan(input, inclusive_output, exclusive_output, scan_op);
-
-        // Compute the warp-wide prefix and block-wide aggregate for each warp
-        T warp_prefix = ComputeWarpPrefix(scan_op, inclusive_output, block_aggregate, initial_value);
-
-        // Apply warp prefix to our lane's partial
-        exclusive_output = scan_op(warp_prefix, exclusive_output);
-        if (lane_id == 0)
-            exclusive_output = warp_prefix;
-    }
-
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  the call-back functor \p block_prefix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-    template <
-        typename ScanOp,
-        typename BlockPrefixCallbackOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T                       input,                          ///< [in] Calling thread's input item
-        T                       &exclusive_output,              ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp                  scan_op,                        ///< [in] Binary scan operator
-        BlockPrefixCallbackOp   &block_prefix_callback_op)      ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a thread block-wide prefix to be applied to all inputs.
-    {
-        // Compute block-wide exclusive scan.  The exclusive output from tid0 is invalid.
-        T block_aggregate;
-        ExclusiveScan(input, exclusive_output, scan_op, block_aggregate);
-
-        // Use the first warp to determine the thread block prefix, returning the result in lane0
-        if (warp_id == 0)
-        {
-            T block_prefix = block_prefix_callback_op(block_aggregate);
-            if (lane_id == 0)
-            {
-                // Share the prefix with all threads
-                temp_storage.block_prefix = block_prefix;
-                exclusive_output = block_prefix;                // The block prefix is the exclusive output for tid0
-            }
-        }
-
-        CTA_SYNC();
-
-        // Incorporate thread block prefix into outputs
-        T block_prefix = temp_storage.block_prefix;
-        if (linear_tid > 0)
-        {
-            exclusive_output = scan_op(block_prefix, exclusive_output);
-        }
-    }
-
-
-    //---------------------------------------------------------------------
-    // Inclusive scans
-    //---------------------------------------------------------------------
-
-    /// Computes an inclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.
-    template <typename ScanOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &inclusive_output,              ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op)                        ///< [in] Binary scan operator
-    {
-        T block_aggregate;
-        InclusiveScan(input, inclusive_output, scan_op, block_aggregate);
-    }
-
-
-    /// Computes an inclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-    template <typename ScanOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &inclusive_output,              ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op,                        ///< [in] Binary scan operator
-        T               &block_aggregate)               ///< [out] Threadblock-wide aggregate reduction of input items
-    {
-        WarpScanT(temp_storage.warp_scan[warp_id]).InclusiveScan(input, inclusive_output, scan_op);
-
-        // Compute the warp-wide prefix and block-wide aggregate for each warp.  Warp prefix for warp0 is invalid.
-        T warp_prefix = ComputeWarpPrefix(scan_op, inclusive_output, block_aggregate);
-
-        // Apply warp prefix to our lane's partial
-        if (warp_id != 0)
-        {
-            inclusive_output = scan_op(warp_prefix, inclusive_output);
-        }
-    }
-
-
-    /// Computes an inclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  the call-back functor \p block_prefix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-    template <
-        typename ScanOp,
-        typename BlockPrefixCallbackOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T                       input,                          ///< [in] Calling thread's input item
-        T                       &exclusive_output,              ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp                  scan_op,                        ///< [in] Binary scan operator
-        BlockPrefixCallbackOp   &block_prefix_callback_op)      ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a thread block-wide prefix to be applied to all inputs.
-    {
-        T block_aggregate;
-        InclusiveScan(input, exclusive_output, scan_op, block_aggregate);
-
-        // Use the first warp to determine the thread block prefix, returning the result in lane0
-        if (warp_id == 0)
-        {
-            T block_prefix = block_prefix_callback_op(block_aggregate);
-            if (lane_id == 0)
-            {
-                // Share the prefix with all threads
-                temp_storage.block_prefix = block_prefix;
-            }
-        }
-
-        CTA_SYNC();
-
-        // Incorporate thread block prefix into outputs
-        T block_prefix = temp_storage.block_prefix;
-        exclusive_output = scan_op(block_prefix, exclusive_output);
-    }
-
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/specializations/block_scan_warp_scans2.cuh b/thirdparty/cub_semiring/block/specializations/block_scan_warp_scans2.cuh
deleted file mode 100644
index eb0a3a1b54e..00000000000
--- a/thirdparty/cub_semiring/block/specializations/block_scan_warp_scans2.cuh
+++ /dev/null
@@ -1,436 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::BlockScanWarpscans provides warpscan-based variants of parallel prefix scan across a CUDA thread block.
- */
-
-#pragma once
-
-#include "../../util_arch.cuh"
-#include "../../util_ptx.cuh"
-#include "../../warp/warp_scan.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \brief BlockScanWarpScans provides warpscan-based variants of parallel prefix scan across a CUDA thread block.
- */
-template <
-    typename    T,
-    int         BLOCK_DIM_X,    ///< The thread block length in threads along the X dimension
-    int         BLOCK_DIM_Y,    ///< The thread block length in threads along the Y dimension
-    int         BLOCK_DIM_Z,    ///< The thread block length in threads along the Z dimension
-    int         PTX_ARCH>       ///< The PTX compute capability for which to to specialize this collective
-struct BlockScanWarpScans
-{
-    //---------------------------------------------------------------------
-    // Types and constants
-    //---------------------------------------------------------------------
-
-    /// Constants
-    enum
-    {
-        /// Number of warp threads
-        WARP_THREADS = CUB_WARP_THREADS(PTX_ARCH),
-
-        /// The thread block size in threads
-        BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-
-        /// Number of active warps
-        WARPS = (BLOCK_THREADS + WARP_THREADS - 1) / WARP_THREADS,
-    };
-
-    ///  WarpScan utility type
-    typedef WarpScan<T, WARP_THREADS, PTX_ARCH> WarpScanT;
-
-    ///  WarpScan utility type
-    typedef WarpScan<T, WARPS, PTX_ARCH> WarpAggregateScanT;
-
-    /// Shared memory storage layout type
-    struct _TempStorage
-    {
-        typename WarpAggregateScanT::TempStorage    inner_scan[WARPS];          ///< Buffer for warp-synchronous scans
-        typename WarpScanT::TempStorage             warp_scan[WARPS];           ///< Buffer for warp-synchronous scans
-        T                                           warp_aggregates[WARPS];
-        T                                           block_prefix;               ///< Shared prefix for the entire thread block
-    };
-
-
-    /// Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    //---------------------------------------------------------------------
-    // Per-thread fields
-    //---------------------------------------------------------------------
-
-    // Thread fields
-    _TempStorage    &temp_storage;
-    unsigned int    linear_tid;
-    unsigned int    warp_id;
-    unsigned int    lane_id;
-
-
-    //---------------------------------------------------------------------
-    // Constructors
-    //---------------------------------------------------------------------
-
-    /// Constructor
-    __device__ __forceinline__ BlockScanWarpScans(
-        TempStorage &temp_storage)
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z)),
-        warp_id((WARPS == 1) ? 0 : linear_tid / WARP_THREADS),
-        lane_id(LaneId())
-    {}
-
-
-    //---------------------------------------------------------------------
-    // Utility methods
-    //---------------------------------------------------------------------
-
-    template <typename ScanOp, int WARP>
-    __device__ __forceinline__ void ApplyWarpAggregates(
-        T               &warp_prefix,           ///< [out] The calling thread's partial reduction
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               &block_aggregate,   ///< [out] Threadblock-wide aggregate reduction of input items
-        Int2Type<WARP>  addend_warp)
-    {
-        if (warp_id == WARP)
-            warp_prefix = block_aggregate;
-
-        T addend = temp_storage.warp_aggregates[WARP];
-        block_aggregate = scan_op(block_aggregate, addend);
-
-        ApplyWarpAggregates(warp_prefix, scan_op, block_aggregate, Int2Type<WARP + 1>());
-    }
-
-    template <typename ScanOp>
-    __device__ __forceinline__ void ApplyWarpAggregates(
-        T               &warp_prefix,           ///< [out] The calling thread's partial reduction
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               &block_aggregate,   ///< [out] Threadblock-wide aggregate reduction of input items
-        Int2Type<WARPS> addend_warp)
-    {}
-
-
-    /// Use the warp-wide aggregates to compute the calling warp's prefix.  Also returns block-wide aggregate in all threads.
-    template <typename ScanOp>
-    __device__ __forceinline__ T ComputeWarpPrefix(
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               warp_aggregate,     ///< [in] <b>[<em>lane</em><sub>WARP_THREADS - 1</sub> only]</b> Warp-wide aggregate reduction of input items
-        T               &block_aggregate)   ///< [out] Threadblock-wide aggregate reduction of input items
-    {
-        // Last lane in each warp shares its warp-aggregate
-        if (lane_id == WARP_THREADS - 1)
-            temp_storage.warp_aggregates[warp_id] = warp_aggregate;
-
-        CTA_SYNC();
-
-        // Accumulate block aggregates and save the one that is our warp's prefix
-        T warp_prefix;
-        block_aggregate = temp_storage.warp_aggregates[0];
-
-        // Use template unrolling (since the PTX backend can't handle unrolling it for SM1x)
-        ApplyWarpAggregates(warp_prefix, scan_op, block_aggregate, Int2Type<1>());
-/*
-        #pragma unroll
-        for (int WARP = 1; WARP < WARPS; ++WARP)
-        {
-            if (warp_id == WARP)
-                warp_prefix = block_aggregate;
-
-            T addend = temp_storage.warp_aggregates[WARP];
-            block_aggregate = scan_op(block_aggregate, addend);
-        }
-*/
-
-        return warp_prefix;
-    }
-
-
-    /// Use the warp-wide aggregates and initial-value to compute the calling warp's prefix.  Also returns block-wide aggregate in all threads.
-    template <typename ScanOp>
-    __device__ __forceinline__ T ComputeWarpPrefix(
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               warp_aggregate,     ///< [in] <b>[<em>lane</em><sub>WARP_THREADS - 1</sub> only]</b> Warp-wide aggregate reduction of input items
-        T               &block_aggregate,   ///< [out] Threadblock-wide aggregate reduction of input items
-        const T         &initial_value)     ///< [in] Initial value to seed the exclusive scan
-    {
-        T warp_prefix = ComputeWarpPrefix(scan_op, warp_aggregate, block_aggregate);
-
-        warp_prefix = scan_op(initial_value, warp_prefix);
-
-        if (warp_id == 0)
-            warp_prefix = initial_value;
-
-        return warp_prefix;
-    }
-
-    //---------------------------------------------------------------------
-    // Exclusive scans
-    //---------------------------------------------------------------------
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  With no initial value, the output computed for <em>thread</em><sub>0</sub> is undefined.
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &exclusive_output,              ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op)                        ///< [in] Binary scan operator
-    {
-        // Compute block-wide exclusive scan.  The exclusive output from tid0 is invalid.
-        T block_aggregate;
-        ExclusiveScan(input, exclusive_output, scan_op, block_aggregate);
-    }
-
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,              ///< [in] Calling thread's input items
-        T               &exclusive_output,  ///< [out] Calling thread's output items (may be aliased to \p input)
-        const T         &initial_value,     ///< [in] Initial value to seed the exclusive scan
-        ScanOp          scan_op)            ///< [in] Binary scan operator
-    {
-        T block_aggregate;
-        ExclusiveScan(input, exclusive_output, initial_value, scan_op, block_aggregate);
-    }
-
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  Also provides every thread with the block-wide \p block_aggregate of all inputs.  With no initial value, the output computed for <em>thread</em><sub>0</sub> is undefined.
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,              ///< [in] Calling thread's input item
-        T               &exclusive_output,  ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               &block_aggregate)   ///< [out] Threadblock-wide aggregate reduction of input items
-    {
-        WarpScanT my_warp_scan(temp_storage.warp_scan[warp_id]);
-
-        // Compute warp scan in each warp.  The exclusive output from each lane0 is invalid.
-        T inclusive_output;
-        my_warp_scan.Scan(input, inclusive_output, exclusive_output, scan_op);
-
-        // Compute the warp-wide prefix and block-wide aggregate for each warp.  Warp prefix for warp0 is invalid.
-//        T warp_prefix = ComputeWarpPrefix(scan_op, inclusive_output, block_aggregate);
-
-//--------------------------------------------------
-        // Last lane in each warp shares its warp-aggregate
-        if (lane_id == WARP_THREADS - 1)
-            temp_storage.warp_aggregates[warp_id] = inclusive_output;
-
-        CTA_SYNC();
-
-        // Get the warp scan partial
-        T warp_inclusive, warp_prefix;
-        if (lane_id < WARPS)
-        {
-            // Scan the warpscan partials
-            T warp_val = temp_storage.warp_aggregates[lane_id];
-            WarpAggregateScanT(temp_storage.inner_scan[warp_id]).Scan(warp_val, warp_inclusive, warp_prefix, scan_op);
-        }
-
-        warp_prefix         = my_warp_scan.Broadcast(warp_prefix, warp_id);
-        block_aggregate     = my_warp_scan.Broadcast(warp_inclusive, WARPS - 1);
-//--------------------------------------------------
-
-        // Apply warp prefix to our lane's partial
-        if (warp_id != 0)
-        {
-            exclusive_output = scan_op(warp_prefix, exclusive_output);
-            if (lane_id == 0)
-                exclusive_output = warp_prefix;
-        }
-    }
-
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,              ///< [in] Calling thread's input items
-        T               &exclusive_output,  ///< [out] Calling thread's output items (may be aliased to \p input)
-        const T         &initial_value,     ///< [in] Initial value to seed the exclusive scan
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               &block_aggregate)   ///< [out] Threadblock-wide aggregate reduction of input items
-    {
-        WarpScanT my_warp_scan(temp_storage.warp_scan[warp_id]);
-
-        // Compute warp scan in each warp.  The exclusive output from each lane0 is invalid.
-        T inclusive_output;
-        my_warp_scan.Scan(input, inclusive_output, exclusive_output, scan_op);
-
-        // Compute the warp-wide prefix and block-wide aggregate for each warp
-//        T warp_prefix = ComputeWarpPrefix(scan_op, inclusive_output, block_aggregate, initial_value);
-
-//--------------------------------------------------
-        // Last lane in each warp shares its warp-aggregate
-        if (lane_id == WARP_THREADS - 1)
-            temp_storage.warp_aggregates[warp_id] = inclusive_output;
-
-        CTA_SYNC();
-
-        // Get the warp scan partial
-        T warp_inclusive, warp_prefix;
-        if (lane_id < WARPS)
-        {
-            // Scan the warpscan partials
-            T warp_val = temp_storage.warp_aggregates[lane_id];
-            WarpAggregateScanT(temp_storage.inner_scan[warp_id]).Scan(warp_val, warp_inclusive, warp_prefix, initial_value, scan_op);
-        }
-
-        warp_prefix         = my_warp_scan.Broadcast(warp_prefix, warp_id);
-        block_aggregate     = my_warp_scan.Broadcast(warp_inclusive, WARPS - 1);
-//--------------------------------------------------
-
-        // Apply warp prefix to our lane's partial
-        exclusive_output = scan_op(warp_prefix, exclusive_output);
-        if (lane_id == 0)
-            exclusive_output = warp_prefix;
-    }
-
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  the call-back functor \p block_prefix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-    template <
-        typename ScanOp,
-        typename BlockPrefixCallbackOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T                       input,                          ///< [in] Calling thread's input item
-        T                       &exclusive_output,              ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp                  scan_op,                        ///< [in] Binary scan operator
-        BlockPrefixCallbackOp   &block_prefix_callback_op)      ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a thread block-wide prefix to be applied to all inputs.
-    {
-        // Compute block-wide exclusive scan.  The exclusive output from tid0 is invalid.
-        T block_aggregate;
-        ExclusiveScan(input, exclusive_output, scan_op, block_aggregate);
-
-        // Use the first warp to determine the thread block prefix, returning the result in lane0
-        if (warp_id == 0)
-        {
-            T block_prefix = block_prefix_callback_op(block_aggregate);
-            if (lane_id == 0)
-            {
-                // Share the prefix with all threads
-                temp_storage.block_prefix = block_prefix;
-                exclusive_output = block_prefix;                // The block prefix is the exclusive output for tid0
-            }
-        }
-
-        CTA_SYNC();
-
-        // Incorporate thread block prefix into outputs
-        T block_prefix = temp_storage.block_prefix;
-        if (linear_tid > 0)
-        {
-            exclusive_output = scan_op(block_prefix, exclusive_output);
-        }
-    }
-
-
-    //---------------------------------------------------------------------
-    // Inclusive scans
-    //---------------------------------------------------------------------
-
-    /// Computes an inclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.
-    template <typename ScanOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &inclusive_output,              ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op)                        ///< [in] Binary scan operator
-    {
-        T block_aggregate;
-        InclusiveScan(input, inclusive_output, scan_op, block_aggregate);
-    }
-
-
-    /// Computes an inclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-    template <typename ScanOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &inclusive_output,              ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op,                        ///< [in] Binary scan operator
-        T               &block_aggregate)               ///< [out] Threadblock-wide aggregate reduction of input items
-    {
-        WarpScanT(temp_storage.warp_scan[warp_id]).InclusiveScan(input, inclusive_output, scan_op);
-
-        // Compute the warp-wide prefix and block-wide aggregate for each warp.  Warp prefix for warp0 is invalid.
-        T warp_prefix = ComputeWarpPrefix(scan_op, inclusive_output, block_aggregate);
-
-        // Apply warp prefix to our lane's partial
-        if (warp_id != 0)
-        {
-            inclusive_output = scan_op(warp_prefix, inclusive_output);
-        }
-    }
-
-
-    /// Computes an inclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  the call-back functor \p block_prefix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-    template <
-        typename ScanOp,
-        typename BlockPrefixCallbackOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T                       input,                          ///< [in] Calling thread's input item
-        T                       &exclusive_output,              ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp                  scan_op,                        ///< [in] Binary scan operator
-        BlockPrefixCallbackOp   &block_prefix_callback_op)      ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a thread block-wide prefix to be applied to all inputs.
-    {
-        T block_aggregate;
-        InclusiveScan(input, exclusive_output, scan_op, block_aggregate);
-
-        // Use the first warp to determine the thread block prefix, returning the result in lane0
-        if (warp_id == 0)
-        {
-            T block_prefix = block_prefix_callback_op(block_aggregate);
-            if (lane_id == 0)
-            {
-                // Share the prefix with all threads
-                temp_storage.block_prefix = block_prefix;
-            }
-        }
-
-        CTA_SYNC();
-
-        // Incorporate thread block prefix into outputs
-        T block_prefix = temp_storage.block_prefix;
-        exclusive_output = scan_op(block_prefix, exclusive_output);
-    }
-
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/block/specializations/block_scan_warp_scans3.cuh b/thirdparty/cub_semiring/block/specializations/block_scan_warp_scans3.cuh
deleted file mode 100644
index 18bd585823a..00000000000
--- a/thirdparty/cub_semiring/block/specializations/block_scan_warp_scans3.cuh
+++ /dev/null
@@ -1,418 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::BlockScanWarpscans provides warpscan-based variants of parallel prefix scan across a CUDA thread block.
- */
-
-#pragma once
-
-#include "../../util_arch.cuh"
-#include "../../util_ptx.cuh"
-#include "../../warp/warp_scan.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \brief BlockScanWarpScans provides warpscan-based variants of parallel prefix scan across a CUDA thread block.
- */
-template <
-    typename    T,
-    int         BLOCK_DIM_X,    ///< The thread block length in threads along the X dimension
-    int         BLOCK_DIM_Y,    ///< The thread block length in threads along the Y dimension
-    int         BLOCK_DIM_Z,    ///< The thread block length in threads along the Z dimension
-    int         PTX_ARCH>       ///< The PTX compute capability for which to to specialize this collective
-struct BlockScanWarpScans
-{
-    //---------------------------------------------------------------------
-    // Types and constants
-    //---------------------------------------------------------------------
-
-    /// Constants
-    enum
-    {
-        /// The thread block size in threads
-        BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
-
-        /// Number of warp threads
-        INNER_WARP_THREADS = CUB_WARP_THREADS(PTX_ARCH),
-        OUTER_WARP_THREADS = BLOCK_THREADS / INNER_WARP_THREADS,
-
-        /// Number of outer scan warps
-        OUTER_WARPS = INNER_WARP_THREADS
-    };
-
-    ///  Outer WarpScan utility type
-    typedef WarpScan<T, OUTER_WARP_THREADS, PTX_ARCH> OuterWarpScanT;
-
-    ///  Inner WarpScan utility type
-    typedef WarpScan<T, INNER_WARP_THREADS, PTX_ARCH> InnerWarpScanT;
-
-    typedef typename OuterWarpScanT::TempStorage OuterScanArray[OUTER_WARPS];
-
-
-    /// Shared memory storage layout type
-    struct _TempStorage
-    {
-        union Aliasable
-        {
-            Uninitialized<OuterScanArray>           outer_warp_scan;  ///< Buffer for warp-synchronous outer scans
-            typename InnerWarpScanT::TempStorage    inner_warp_scan;  ///< Buffer for warp-synchronous inner scan
-
-        } aliasable;
-
-        T                               warp_aggregates[OUTER_WARPS];
-
-        T                               block_aggregate;                           ///< Shared prefix for the entire thread block
-    };
-
-
-    /// Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    //---------------------------------------------------------------------
-    // Per-thread fields
-    //---------------------------------------------------------------------
-
-    // Thread fields
-    _TempStorage    &temp_storage;
-    unsigned int    linear_tid;
-    unsigned int    warp_id;
-    unsigned int    lane_id;
-
-
-    //---------------------------------------------------------------------
-    // Constructors
-    //---------------------------------------------------------------------
-
-    /// Constructor
-    __device__ __forceinline__ BlockScanWarpScans(
-        TempStorage &temp_storage)
-    :
-        temp_storage(temp_storage.Alias()),
-        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z)),
-        warp_id((OUTER_WARPS == 1) ? 0 : linear_tid / OUTER_WARP_THREADS),
-        lane_id((OUTER_WARPS == 1) ? linear_tid : linear_tid % OUTER_WARP_THREADS)
-    {}
-
-
-    //---------------------------------------------------------------------
-    // Exclusive scans
-    //---------------------------------------------------------------------
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  With no initial value, the output computed for <em>thread</em><sub>0</sub> is undefined.
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &exclusive_output,              ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op)                        ///< [in] Binary scan operator
-    {
-        // Compute block-wide exclusive scan.  The exclusive output from tid0 is invalid.
-        T block_aggregate;
-        ExclusiveScan(input, exclusive_output, scan_op, block_aggregate);
-    }
-
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,              ///< [in] Calling thread's input items
-        T               &exclusive_output,  ///< [out] Calling thread's output items (may be aliased to \p input)
-        const T         &initial_value,     ///< [in] Initial value to seed the exclusive scan
-        ScanOp          scan_op)            ///< [in] Binary scan operator
-    {
-        T block_aggregate;
-        ExclusiveScan(input, exclusive_output, initial_value, scan_op, block_aggregate);
-    }
-
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  Also provides every thread with the block-wide \p block_aggregate of all inputs.  With no initial value, the output computed for <em>thread</em><sub>0</sub> is undefined.
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,              ///< [in] Calling thread's input item
-        T               &exclusive_output,  ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               &block_aggregate)   ///< [out] Threadblock-wide aggregate reduction of input items
-    {
-        // Compute warp scan in each warp.  The exclusive output from each lane0 is invalid.
-        T inclusive_output;
-        OuterWarpScanT(temp_storage.aliasable.outer_warp_scan.Alias()[warp_id]).Scan(
-            input, inclusive_output, exclusive_output, scan_op);
-
-        // Share outer warp total
-        if (lane_id == OUTER_WARP_THREADS - 1)
-            temp_storage.warp_aggregates[warp_id] = inclusive_output;
-
-        CTA_SYNC();
-
-        if (linear_tid < INNER_WARP_THREADS)
-        {
-            T outer_warp_input = temp_storage.warp_aggregates[linear_tid];
-            T outer_warp_exclusive;
-
-            InnerWarpScanT(temp_storage.aliasable.inner_warp_scan).ExclusiveScan(
-                outer_warp_input, outer_warp_exclusive, scan_op, block_aggregate);
-
-            temp_storage.block_aggregate                = block_aggregate;
-            temp_storage.warp_aggregates[linear_tid]    = outer_warp_exclusive;
-        }
-
-        CTA_SYNC();
-
-        if (warp_id != 0)
-        {
-            // Retrieve block aggregate
-            block_aggregate = temp_storage.block_aggregate;
-
-            // Apply warp prefix to our lane's partial
-            T outer_warp_exclusive = temp_storage.warp_aggregates[warp_id];
-            exclusive_output = scan_op(outer_warp_exclusive, exclusive_output);
-            if (lane_id == 0)
-                exclusive_output = outer_warp_exclusive;
-        }
-    }
-
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,              ///< [in] Calling thread's input items
-        T               &exclusive_output,  ///< [out] Calling thread's output items (may be aliased to \p input)
-        const T         &initial_value,     ///< [in] Initial value to seed the exclusive scan
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               &block_aggregate)   ///< [out] Threadblock-wide aggregate reduction of input items
-    {
-        // Compute warp scan in each warp.  The exclusive output from each lane0 is invalid.
-        T inclusive_output;
-        OuterWarpScanT(temp_storage.aliasable.outer_warp_scan.Alias()[warp_id]).Scan(
-            input, inclusive_output, exclusive_output, scan_op);
-
-        // Share outer warp total
-        if (lane_id == OUTER_WARP_THREADS - 1)
-        {
-            temp_storage.warp_aggregates[warp_id] = inclusive_output;
-        }
-
-        CTA_SYNC();
-
-        if (linear_tid < INNER_WARP_THREADS)
-        {
-            T outer_warp_input = temp_storage.warp_aggregates[linear_tid];
-            T outer_warp_exclusive;
-
-            InnerWarpScanT(temp_storage.aliasable.inner_warp_scan).ExclusiveScan(
-                outer_warp_input, outer_warp_exclusive, initial_value, scan_op, block_aggregate);
-
-            temp_storage.block_aggregate                = block_aggregate;
-            temp_storage.warp_aggregates[linear_tid]    = outer_warp_exclusive;
-        }
-
-        CTA_SYNC();
-
-        // Retrieve block aggregate
-        block_aggregate = temp_storage.block_aggregate;
-
-        // Apply warp prefix to our lane's partial
-        T outer_warp_exclusive = temp_storage.warp_aggregates[warp_id];
-        exclusive_output = scan_op(outer_warp_exclusive, exclusive_output);
-        if (lane_id == 0)
-            exclusive_output = outer_warp_exclusive;
-    }
-
-
-    /// Computes an exclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  The call-back functor \p block_prefix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs.
-    template <
-        typename ScanOp,
-        typename BlockPrefixCallbackOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T                       input,                          ///< [in] Calling thread's input item
-        T                       &exclusive_output,              ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp                  scan_op,                        ///< [in] Binary scan operator
-        BlockPrefixCallbackOp   &block_prefix_callback_op)      ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a thread block-wide prefix to be applied to all inputs.
-    {
-        // Compute warp scan in each warp.  The exclusive output from each lane0 is invalid.
-        T inclusive_output;
-        OuterWarpScanT(temp_storage.aliasable.outer_warp_scan.Alias()[warp_id]).Scan(
-            input, inclusive_output, exclusive_output, scan_op);
-
-        // Share outer warp total
-        if (lane_id == OUTER_WARP_THREADS - 1)
-            temp_storage.warp_aggregates[warp_id] = inclusive_output;
-
-        CTA_SYNC();
-
-        if (linear_tid < INNER_WARP_THREADS)
-        {
-            InnerWarpScanT inner_scan(temp_storage.aliasable.inner_warp_scan);
-
-            T upsweep = temp_storage.warp_aggregates[linear_tid];
-            T downsweep_prefix, block_aggregate;
-
-            inner_scan.ExclusiveScan(upsweep, downsweep_prefix, scan_op, block_aggregate);
-
-            // Use callback functor to get block prefix in lane0 and then broadcast to other lanes
-            T block_prefix = block_prefix_callback_op(block_aggregate);
-            block_prefix = inner_scan.Broadcast(block_prefix, 0);
-
-            downsweep_prefix = scan_op(block_prefix, downsweep_prefix);
-            if (linear_tid == 0)
-                downsweep_prefix = block_prefix;
-
-            temp_storage.warp_aggregates[linear_tid] = downsweep_prefix;
-        }
-
-        CTA_SYNC();
-
-        // Apply warp prefix to our lane's partial (or assign it if partial is invalid)
-        T outer_warp_exclusive = temp_storage.warp_aggregates[warp_id];
-        exclusive_output = scan_op(outer_warp_exclusive, exclusive_output);
-        if (lane_id == 0)
-            exclusive_output = outer_warp_exclusive;
-    }
-
-
-    //---------------------------------------------------------------------
-    // Inclusive scans
-    //---------------------------------------------------------------------
-
-    /// Computes an inclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.
-    template <typename ScanOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &inclusive_output,              ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op)                        ///< [in] Binary scan operator
-    {
-        T block_aggregate;
-        InclusiveScan(input, inclusive_output, scan_op, block_aggregate);
-    }
-
-
-    /// Computes an inclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
-    template <typename ScanOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T               input,                          ///< [in] Calling thread's input item
-        T               &inclusive_output,              ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp          scan_op,                        ///< [in] Binary scan operator
-        T               &block_aggregate)               ///< [out] Threadblock-wide aggregate reduction of input items
-    {
-        // Compute warp scan in each warp.  The exclusive output from each lane0 is invalid.
-        OuterWarpScanT(temp_storage.aliasable.outer_warp_scan.Alias()[warp_id]).InclusiveScan(
-            input, inclusive_output, scan_op);
-
-        // Share outer warp total
-        if (lane_id == OUTER_WARP_THREADS - 1)
-            temp_storage.warp_aggregates[warp_id] = inclusive_output;
-
-        CTA_SYNC();
-
-        if (linear_tid < INNER_WARP_THREADS)
-        {
-            T outer_warp_input = temp_storage.warp_aggregates[linear_tid];
-            T outer_warp_exclusive;
-
-            InnerWarpScanT(temp_storage.aliasable.inner_warp_scan).ExclusiveScan(
-                outer_warp_input, outer_warp_exclusive, scan_op, block_aggregate);
-
-            temp_storage.block_aggregate                = block_aggregate;
-            temp_storage.warp_aggregates[linear_tid]    = outer_warp_exclusive;
-        }
-
-        CTA_SYNC();
-
-        if (warp_id != 0)
-        {
-            // Retrieve block aggregate
-            block_aggregate = temp_storage.block_aggregate;
-
-            // Apply warp prefix to our lane's partial
-            T outer_warp_exclusive = temp_storage.warp_aggregates[warp_id];
-            inclusive_output = scan_op(outer_warp_exclusive, inclusive_output);
-        }
-    }
-
-
-    /// Computes an inclusive thread block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes one input element.  the call-back functor \p block_prefix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically prefixes the thread block's scan inputs.
-    template <
-        typename ScanOp,
-        typename BlockPrefixCallbackOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T                       input,                          ///< [in] Calling thread's input item
-        T                       &inclusive_output,              ///< [out] Calling thread's output item (may be aliased to \p input)
-        ScanOp                  scan_op,                        ///< [in] Binary scan operator
-        BlockPrefixCallbackOp   &block_prefix_callback_op)      ///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a thread block-wide prefix to be applied to all inputs.
-    {
-        // Compute warp scan in each warp.  The exclusive output from each lane0 is invalid.
-        OuterWarpScanT(temp_storage.aliasable.outer_warp_scan.Alias()[warp_id]).InclusiveScan(
-            input, inclusive_output, scan_op);
-
-        // Share outer warp total
-        if (lane_id == OUTER_WARP_THREADS - 1)
-            temp_storage.warp_aggregates[warp_id] = inclusive_output;
-
-        CTA_SYNC();
-
-        if (linear_tid < INNER_WARP_THREADS)
-        {
-            InnerWarpScanT inner_scan(temp_storage.aliasable.inner_warp_scan);
-
-            T upsweep = temp_storage.warp_aggregates[linear_tid];
-            T downsweep_prefix, block_aggregate;
-            inner_scan.ExclusiveScan(upsweep, downsweep_prefix, scan_op, block_aggregate);
-
-            // Use callback functor to get block prefix in lane0 and then broadcast to other lanes
-            T block_prefix = block_prefix_callback_op(block_aggregate);
-            block_prefix = inner_scan.Broadcast(block_prefix, 0);
-
-            downsweep_prefix = scan_op(block_prefix, downsweep_prefix);
-            if (linear_tid == 0)
-                downsweep_prefix = block_prefix;
-
-            temp_storage.warp_aggregates[linear_tid]    = downsweep_prefix;
-        }
-
-        CTA_SYNC();
-
-        // Apply warp prefix to our lane's partial
-        T outer_warp_exclusive = temp_storage.warp_aggregates[warp_id];
-        inclusive_output = scan_op(outer_warp_exclusive, inclusive_output);
-    }
-
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/cub.cuh b/thirdparty/cub_semiring/cub.cuh
deleted file mode 100644
index b1c8e3200ab..00000000000
--- a/thirdparty/cub_semiring/cub.cuh
+++ /dev/null
@@ -1,95 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * CUB umbrella include file
- */
-
-#pragma once
-
-
-// Block
-#include "block/block_histogram.cuh"
-#include "block/block_discontinuity.cuh"
-#include "block/block_exchange.cuh"
-#include "block/block_load.cuh"
-#include "block/block_radix_rank.cuh"
-#include "block/block_radix_sort.cuh"
-#include "block/block_reduce.cuh"
-#include "block/block_scan.cuh"
-#include "block/block_store.cuh"
-//#include "block/block_shift.cuh"
-
-// Device
-#include "device/device_histogram.cuh"
-#include "device/device_partition.cuh"
-#include "device/device_radix_sort.cuh"
-#include "device/device_reduce.cuh"
-#include "device/device_run_length_encode.cuh"
-#include "device/device_scan.cuh"
-#include "device/device_segmented_radix_sort.cuh"
-#include "device/device_segmented_reduce.cuh"
-#include "device/device_select.cuh"
-#include "device/device_spmv.cuh"
-
-// Grid
-//#include "grid/grid_barrier.cuh"
-#include "grid/grid_even_share.cuh"
-#include "grid/grid_mapping.cuh"
-#include "grid/grid_queue.cuh"
-
-// Thread
-#include "thread/thread_load.cuh"
-#include "thread/thread_operators.cuh"
-#include "thread/thread_reduce.cuh"
-#include "thread/thread_scan.cuh"
-#include "thread/thread_store.cuh"
-
-// Warp
-#include "warp/warp_reduce.cuh"
-#include "warp/warp_scan.cuh"
-
-// Iterator
-#include "iterator/arg_index_input_iterator.cuh"
-#include "iterator/cache_modified_input_iterator.cuh"
-#include "iterator/cache_modified_output_iterator.cuh"
-#include "iterator/constant_input_iterator.cuh"
-#include "iterator/counting_input_iterator.cuh"
-#include "iterator/tex_obj_input_iterator.cuh"
-#include "iterator/tex_ref_input_iterator.cuh"
-#include "iterator/transform_input_iterator.cuh"
-
-// Util
-#include "util_arch.cuh"
-#include "util_debug.cuh"
-#include "util_device.cuh"
-#include "util_macro.cuh"
-#include "util_ptx.cuh"
-#include "util_type.cuh"
-
diff --git a/thirdparty/cub_semiring/device/device_histogram.cuh b/thirdparty/cub_semiring/device/device_histogram.cuh
deleted file mode 100644
index db131eee764..00000000000
--- a/thirdparty/cub_semiring/device/device_histogram.cuh
+++ /dev/null
@@ -1,866 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DeviceHistogram provides device-wide parallel operations for constructing histogram(s) from a sequence of samples data residing within device-accessible memory.
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-#include <limits>
-
-#include "dispatch/dispatch_histogram.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \brief DeviceHistogram provides device-wide parallel operations for constructing histogram(s) from a sequence of samples data residing within device-accessible memory. ![](histogram_logo.png)
- * \ingroup SingleModule
- *
- * \par Overview
- * A <a href="http://en.wikipedia.org/wiki/Histogram"><em>histogram</em></a>
- * counts the number of observations that fall into each of the disjoint categories (known as <em>bins</em>).
- *
- * \par Usage Considerations
- * \cdp_class{DeviceHistogram}
- *
- */
-struct DeviceHistogram
-{
-    /******************************************************************//**
-     * \name Evenly-segmented bin ranges
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Computes an intensity histogram from a sequence of data samples using equal-width bins.
-     *
-     * \par
-     * - The number of histogram bins is (\p num_levels - 1)
-     * - All bins comprise the same width of sample values: (\p upper_level - \p lower_level) / (\p num_levels - 1)
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the computation of a six-bin histogram
-     * from a sequence of float samples
-     *
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_histogram.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input samples and
-     * // output histogram
-     * int      num_samples;    // e.g., 10
-     * float*   d_samples;      // e.g., [2.2, 6.0, 7.1, 2.9, 3.5, 0.3, 2.9, 2.0, 6.1, 999.5]
-     * int*     d_histogram;    // e.g., [ -, -, -, -, -, -, -, -]
-     * int      num_levels;     // e.g., 7       (seven level boundaries for six bins)
-     * float    lower_level;    // e.g., 0.0     (lower sample value boundary of lowest bin)
-     * float    upper_level;    // e.g., 12.0    (upper sample value boundary of upper bin)
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void*    d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceHistogram::HistogramEven(d_temp_storage, temp_storage_bytes,
-     *     d_samples, d_histogram, num_levels, lower_level, upper_level, num_samples);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Compute histograms
-     * cub::DeviceHistogram::HistogramEven(d_temp_storage, temp_storage_bytes,
-     *     d_samples, d_histogram, num_levels, lower_level, upper_level, num_samples);
-     *
-     * // d_histogram   <-- [1, 0, 5, 0, 3, 0, 0, 0];
-     *
-     * \endcode
-     *
-     * \tparam SampleIteratorT          <b>[inferred]</b> Random-access input iterator type for reading input samples. \iterator
-     * \tparam CounterT                 <b>[inferred]</b> Integer type for histogram bin counters
-     * \tparam LevelT                   <b>[inferred]</b> Type for specifying boundaries (levels)
-     * \tparam OffsetT                  <b>[inferred]</b> Signed integer type for sequence offsets, list lengths, pointer differences, etc.  \offset_size1
-     */
-    template <
-        typename            SampleIteratorT,
-        typename            CounterT,
-        typename            LevelT,
-        typename            OffsetT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t HistogramEven(
-        void*               d_temp_storage,                             ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&             temp_storage_bytes,                        ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        SampleIteratorT     d_samples,                                  ///< [in] The pointer to the input sequence of data samples.
-        CounterT*           d_histogram,                                ///< [out] The pointer to the histogram counter output array of length <tt>num_levels</tt> - 1.
-        int                 num_levels,                                 ///< [in] The number of boundaries (levels) for delineating histogram samples.  Implies that the number of bins is <tt>num_levels</tt> - 1.
-        LevelT              lower_level,                                ///< [in] The lower sample value bound (inclusive) for the lowest histogram bin.
-        LevelT              upper_level,                                ///< [in] The upper sample value bound (exclusive) for the highest histogram bin.
-        OffsetT             num_samples,                                ///< [in] The number of input samples (i.e., the length of \p d_samples)
-        cudaStream_t        stream                  = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous       = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        /// The sample value type of the input iterator
-        typedef typename std::iterator_traits<SampleIteratorT>::value_type SampleT;
-
-        CounterT*           d_histogram1[1]     = {d_histogram};
-        int                 num_levels1[1]      = {num_levels};
-        LevelT              lower_level1[1]     = {lower_level};
-        LevelT              upper_level1[1]     = {upper_level};
-
-        return MultiHistogramEven<1, 1>(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_samples,
-            d_histogram1,
-            num_levels1,
-            lower_level1,
-            upper_level1,
-            num_samples,
-            1,
-            sizeof(SampleT) * num_samples,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Computes an intensity histogram from a sequence of data samples using equal-width bins.
-     *
-     * \par
-     * - A two-dimensional <em>region of interest</em> within \p d_samples can be specified
-     *   using the \p num_row_samples, num_rows, and \p row_stride_bytes parameters.
-     * - The row stride must be a whole multiple of the sample data type
-     *   size, i.e., <tt>(row_stride_bytes % sizeof(SampleT)) == 0</tt>.
-     * - The number of histogram bins is (\p num_levels - 1)
-     * - All bins comprise the same width of sample values: (\p upper_level - \p lower_level) / (\p num_levels - 1)
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the computation of a six-bin histogram
-     * from a 2x5 region of interest within a flattened 2x7 array of float samples.
-     *
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_histogram.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input samples and
-     * // output histogram
-     * int      num_row_samples;    // e.g., 5
-     * int      num_rows;           // e.g., 2;
-     * size_t   row_stride_bytes;   // e.g., 7 * sizeof(float)
-     * float*   d_samples;          // e.g., [2.2, 6.0, 7.1, 2.9, 3.5,   -, -,
-     *                              //        0.3, 2.9, 2.0, 6.1, 999.5, -, -]
-     * int*     d_histogram;        // e.g., [ -, -, -, -, -, -, -, -]
-     * int      num_levels;         // e.g., 7       (seven level boundaries for six bins)
-     * float    lower_level;        // e.g., 0.0     (lower sample value boundary of lowest bin)
-     * float    upper_level;        // e.g., 12.0    (upper sample value boundary of upper bin)
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void*    d_temp_storage  = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceHistogram::HistogramEven(d_temp_storage, temp_storage_bytes,
-     *     d_samples, d_histogram, num_levels, lower_level, upper_level,
-     *     num_row_samples, num_rows, row_stride_bytes);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Compute histograms
-     * cub::DeviceHistogram::HistogramEven(d_temp_storage, temp_storage_bytes, d_samples, d_histogram,
-     *     d_samples, d_histogram, num_levels, lower_level, upper_level,
-     *     num_row_samples, num_rows, row_stride_bytes);
-     *
-     * // d_histogram   <-- [1, 0, 5, 0, 3, 0, 0, 0];
-     *
-     * \endcode
-     *
-     * \tparam SampleIteratorT          <b>[inferred]</b> Random-access input iterator type for reading input samples. \iterator
-     * \tparam CounterT                 <b>[inferred]</b> Integer type for histogram bin counters
-     * \tparam LevelT                   <b>[inferred]</b> Type for specifying boundaries (levels)
-     * \tparam OffsetT                  <b>[inferred]</b> Signed integer type for sequence offsets, list lengths, pointer differences, etc.  \offset_size1
-     */
-    template <
-        typename            SampleIteratorT,
-        typename            CounterT,
-        typename            LevelT,
-        typename            OffsetT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t HistogramEven(
-        void*               d_temp_storage,                             ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&             temp_storage_bytes,                        ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        SampleIteratorT     d_samples,                                  ///< [in] The pointer to the input sequence of data samples.
-        CounterT*           d_histogram,                                ///< [out] The pointer to the histogram counter output array of length <tt>num_levels</tt> - 1.
-        int                 num_levels,                                 ///< [in] The number of boundaries (levels) for delineating histogram samples.  Implies that the number of bins is <tt>num_levels</tt> - 1.
-        LevelT              lower_level,                                ///< [in] The lower sample value bound (inclusive) for the lowest histogram bin.
-        LevelT              upper_level,                                ///< [in] The upper sample value bound (exclusive) for the highest histogram bin.
-        OffsetT             num_row_samples,                            ///< [in] The number of data samples per row in the region of interest
-        OffsetT             num_rows,                                   ///< [in] The number of rows in the region of interest
-        size_t              row_stride_bytes,                           ///< [in] The number of bytes between starts of consecutive rows in the region of interest
-        cudaStream_t        stream                  = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous       = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        CounterT*           d_histogram1[1]     = {d_histogram};
-        int                 num_levels1[1]      = {num_levels};
-        LevelT              lower_level1[1]     = {lower_level};
-        LevelT              upper_level1[1]     = {upper_level};
-
-        return MultiHistogramEven<1, 1>(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_samples,
-            d_histogram1,
-            num_levels1,
-            lower_level1,
-            upper_level1,
-            num_row_samples,
-            num_rows,
-            row_stride_bytes,
-            stream,
-            debug_synchronous);
-    }
-
-    /**
-     * \brief Computes per-channel intensity histograms from a sequence of multi-channel "pixel" data samples using equal-width bins.
-     *
-     * \par
-     * - The input is a sequence of <em>pixel</em> structures, where each pixel comprises
-     *   a record of \p NUM_CHANNELS consecutive data samples (e.g., an <em>RGBA</em> pixel).
-     * - Of the \p NUM_CHANNELS specified, the function will only compute histograms
-     *   for the first \p NUM_ACTIVE_CHANNELS (e.g., only <em>RGB</em> histograms from <em>RGBA</em>
-     *   pixel samples).
-     * - The number of histogram bins for channel<sub><em>i</em></sub> is <tt>num_levels[i]</tt> - 1.
-     * - For channel<sub><em>i</em></sub>, the range of values for all histogram bins
-     *   have the same width: (<tt>upper_level[i]</tt> - <tt>lower_level[i]</tt>) / (<tt> num_levels[i]</tt> - 1)
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the computation of three 256-bin <em>RGB</em> histograms
-     * from a quad-channel sequence of <em>RGBA</em> pixels (8 bits per channel per pixel)
-     *
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_histogram.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input samples
-     * // and output histograms
-     * int              num_pixels;         // e.g., 5
-     * unsigned char*   d_samples;          // e.g., [(2, 6, 7, 5), (3, 0, 2, 1), (7, 0, 6, 2),
-     *                                      //        (0, 6, 7, 5), (3, 0, 2, 6)]
-     * int*             d_histogram[3];     // e.g., three device pointers to three device buffers,
-     *                                      //       each allocated with 256 integer counters
-     * int              num_levels[3];      // e.g., {257, 257, 257};
-     * unsigned int     lower_level[3];     // e.g., {0, 0, 0};
-     * unsigned int     upper_level[3];     // e.g., {256, 256, 256};
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void*    d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceHistogram::MultiHistogramEven<4, 3>(d_temp_storage, temp_storage_bytes,
-     *     d_samples, d_histogram, num_levels, lower_level, upper_level, num_pixels);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Compute histograms
-     * cub::DeviceHistogram::MultiHistogramEven<4, 3>(d_temp_storage, temp_storage_bytes,
-     *     d_samples, d_histogram, num_levels, lower_level, upper_level, num_pixels);
-     *
-     * // d_histogram   <-- [ [1, 0, 1, 2, 0, 0, 0, 1, 0, 0, 0, ..., 0],
-     * //                     [0, 3, 0, 0, 0, 0, 2, 0, 0, 0, 0, ..., 0],
-     * //                     [0, 0, 2, 0, 0, 0, 1, 2, 0, 0, 0, ..., 0] ]
-     *
-     * \endcode
-     *
-     * \tparam NUM_CHANNELS             Number of channels interleaved in the input data (may be greater than the number of channels being actively histogrammed)
-     * \tparam NUM_ACTIVE_CHANNELS      <b>[inferred]</b> Number of channels actively being histogrammed
-     * \tparam SampleIteratorT          <b>[inferred]</b> Random-access input iterator type for reading input samples. \iterator
-     * \tparam CounterT                 <b>[inferred]</b> Integer type for histogram bin counters
-     * \tparam LevelT                   <b>[inferred]</b> Type for specifying boundaries (levels)
-     * \tparam OffsetT                  <b>[inferred]</b> Signed integer type for sequence offsets, list lengths, pointer differences, etc.  \offset_size1
-     */
-    template <
-        int                 NUM_CHANNELS,
-        int                 NUM_ACTIVE_CHANNELS,
-        typename            SampleIteratorT,
-        typename            CounterT,
-        typename            LevelT,
-        typename            OffsetT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t MultiHistogramEven(
-        void*               d_temp_storage,                             ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&             temp_storage_bytes,                        ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        SampleIteratorT     d_samples,                                  ///< [in] The pointer to the multi-channel input sequence of data samples. The samples from different channels are assumed to be interleaved (e.g., an array of 32-bit pixels where each pixel consists of four <em>RGBA</em> 8-bit samples).
-        CounterT*           d_histogram[NUM_ACTIVE_CHANNELS],           ///< [out] The pointers to the histogram counter output arrays, one for each active channel.  For channel<sub><em>i</em></sub>, the allocation length of <tt>d_histogram[i]</tt> should be <tt>num_levels[i]</tt> - 1.
-        int                 num_levels[NUM_ACTIVE_CHANNELS],            ///< [in] The number of boundaries (levels) for delineating histogram samples in each active channel.  Implies that the number of bins for channel<sub><em>i</em></sub> is <tt>num_levels[i]</tt> - 1.
-        LevelT              lower_level[NUM_ACTIVE_CHANNELS],           ///< [in] The lower sample value bound (inclusive) for the lowest histogram bin in each active channel.
-        LevelT              upper_level[NUM_ACTIVE_CHANNELS],           ///< [in] The upper sample value bound (exclusive) for the highest histogram bin in each active channel.
-        OffsetT             num_pixels,                                 ///< [in] The number of multi-channel pixels (i.e., the length of \p d_samples / NUM_CHANNELS)
-        cudaStream_t        stream                  = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous       = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        /// The sample value type of the input iterator
-        typedef typename std::iterator_traits<SampleIteratorT>::value_type SampleT;
-
-        return MultiHistogramEven<NUM_CHANNELS, NUM_ACTIVE_CHANNELS>(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_samples,
-            d_histogram,
-            num_levels,
-            lower_level,
-            upper_level,
-            num_pixels,
-            1,
-            sizeof(SampleT) * NUM_CHANNELS * num_pixels,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Computes per-channel intensity histograms from a sequence of multi-channel "pixel" data samples using equal-width bins.
-     *
-     * \par
-     * - The input is a sequence of <em>pixel</em> structures, where each pixel comprises
-     *   a record of \p NUM_CHANNELS consecutive data samples (e.g., an <em>RGBA</em> pixel).
-     * - Of the \p NUM_CHANNELS specified, the function will only compute histograms
-     *   for the first \p NUM_ACTIVE_CHANNELS (e.g., only <em>RGB</em> histograms from <em>RGBA</em>
-     *   pixel samples).
-     * - A two-dimensional <em>region of interest</em> within \p d_samples can be specified
-     *   using the \p num_row_samples, num_rows, and \p row_stride_bytes parameters.
-     * - The row stride must be a whole multiple of the sample data type
-     *   size, i.e., <tt>(row_stride_bytes % sizeof(SampleT)) == 0</tt>.
-     * - The number of histogram bins for channel<sub><em>i</em></sub> is <tt>num_levels[i]</tt> - 1.
-     * - For channel<sub><em>i</em></sub>, the range of values for all histogram bins
-     *   have the same width: (<tt>upper_level[i]</tt> - <tt>lower_level[i]</tt>) / (<tt> num_levels[i]</tt> - 1)
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the computation of three 256-bin <em>RGB</em> histograms from a 2x3 region of
-     * interest of within a flattened 2x4 array of quad-channel <em>RGBA</em> pixels (8 bits per channel per pixel).
-     *
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_histogram.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input samples
-     * // and output histograms
-     * int              num_row_pixels;     // e.g., 3
-     * int              num_rows;           // e.g., 2
-     * size_t           row_stride_bytes;   // e.g., 4 * sizeof(unsigned char) * NUM_CHANNELS
-     * unsigned char*   d_samples;          // e.g., [(2, 6, 7, 5), (3, 0, 2, 1), (7, 0, 6, 2), (-, -, -, -),
-     *                                      //        (0, 6, 7, 5), (3, 0, 2, 6), (1, 1, 1, 1), (-, -, -, -)]
-     * int*             d_histogram[3];     // e.g., three device pointers to three device buffers,
-     *                                      //       each allocated with 256 integer counters
-     * int              num_levels[3];      // e.g., {257, 257, 257};
-     * unsigned int     lower_level[3];     // e.g., {0, 0, 0};
-     * unsigned int     upper_level[3];     // e.g., {256, 256, 256};
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void*    d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceHistogram::MultiHistogramEven<4, 3>(d_temp_storage, temp_storage_bytes,
-     *     d_samples, d_histogram, num_levels, lower_level, upper_level,
-     *     num_row_pixels, num_rows, row_stride_bytes);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Compute histograms
-     * cub::DeviceHistogram::MultiHistogramEven<4, 3>(d_temp_storage, temp_storage_bytes,
-     *     d_samples, d_histogram, num_levels, lower_level, upper_level,
-     *     num_row_pixels, num_rows, row_stride_bytes);
-     *
-     * // d_histogram   <-- [ [1, 1, 1, 2, 0, 0, 0, 1, 0, 0, 0, ..., 0],
-     * //                     [0, 4, 0, 0, 0, 0, 2, 0, 0, 0, 0, ..., 0],
-     * //                     [0, 1, 2, 0, 0, 0, 1, 2, 0, 0, 0, ..., 0] ]
-     *
-     * \endcode
-     *
-     * \tparam NUM_CHANNELS             Number of channels interleaved in the input data (may be greater than the number of channels being actively histogrammed)
-     * \tparam NUM_ACTIVE_CHANNELS      <b>[inferred]</b> Number of channels actively being histogrammed
-     * \tparam SampleIteratorT          <b>[inferred]</b> Random-access input iterator type for reading input samples. \iterator
-     * \tparam CounterT                 <b>[inferred]</b> Integer type for histogram bin counters
-     * \tparam LevelT                   <b>[inferred]</b> Type for specifying boundaries (levels)
-     * \tparam OffsetT                  <b>[inferred]</b> Signed integer type for sequence offsets, list lengths, pointer differences, etc.  \offset_size1
-     */
-    template <
-        int                 NUM_CHANNELS,
-        int                 NUM_ACTIVE_CHANNELS,
-        typename            SampleIteratorT,
-        typename            CounterT,
-        typename            LevelT,
-        typename            OffsetT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t MultiHistogramEven(
-        void*               d_temp_storage,                             ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&             temp_storage_bytes,                        ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        SampleIteratorT     d_samples,                                  ///< [in] The pointer to the multi-channel input sequence of data samples. The samples from different channels are assumed to be interleaved (e.g., an array of 32-bit pixels where each pixel consists of four <em>RGBA</em> 8-bit samples).
-        CounterT*           d_histogram[NUM_ACTIVE_CHANNELS],           ///< [out] The pointers to the histogram counter output arrays, one for each active channel.  For channel<sub><em>i</em></sub>, the allocation length of <tt>d_histogram[i]</tt> should be <tt>num_levels[i]</tt> - 1.
-        int                 num_levels[NUM_ACTIVE_CHANNELS],            ///< [in] The number of boundaries (levels) for delineating histogram samples in each active channel.  Implies that the number of bins for channel<sub><em>i</em></sub> is <tt>num_levels[i]</tt> - 1.
-        LevelT              lower_level[NUM_ACTIVE_CHANNELS],           ///< [in] The lower sample value bound (inclusive) for the lowest histogram bin in each active channel.
-        LevelT              upper_level[NUM_ACTIVE_CHANNELS],           ///< [in] The upper sample value bound (exclusive) for the highest histogram bin in each active channel.
-        OffsetT             num_row_pixels,                             ///< [in] The number of multi-channel pixels per row in the region of interest
-        OffsetT             num_rows,                                   ///< [in] The number of rows in the region of interest
-        size_t              row_stride_bytes,                           ///< [in] The number of bytes between starts of consecutive rows in the region of interest
-        cudaStream_t        stream                  = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous       = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        /// The sample value type of the input iterator
-        typedef typename std::iterator_traits<SampleIteratorT>::value_type SampleT;
-        Int2Type<sizeof(SampleT) == 1> is_byte_sample;
-
-        if ((sizeof(OffsetT) > sizeof(int)) &&
-            ((unsigned long long) (num_rows * row_stride_bytes) < (unsigned long long) std::numeric_limits<int>::max()))
-        {
-            // Down-convert OffsetT data type
-
-
-            return DipatchHistogram<NUM_CHANNELS, NUM_ACTIVE_CHANNELS, SampleIteratorT, CounterT, LevelT, int>::DispatchEven(
-                d_temp_storage, temp_storage_bytes, d_samples, d_histogram, num_levels, lower_level, upper_level,
-                (int) num_row_pixels, (int) num_rows, (int) (row_stride_bytes / sizeof(SampleT)),
-                stream, debug_synchronous, is_byte_sample);
-        }
-
-        return DipatchHistogram<NUM_CHANNELS, NUM_ACTIVE_CHANNELS, SampleIteratorT, CounterT, LevelT, OffsetT>::DispatchEven(
-            d_temp_storage, temp_storage_bytes, d_samples, d_histogram, num_levels, lower_level, upper_level,
-            num_row_pixels, num_rows, (OffsetT) (row_stride_bytes / sizeof(SampleT)),
-            stream, debug_synchronous, is_byte_sample);
-    }
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Custom bin ranges
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Computes an intensity histogram from a sequence of data samples using the specified bin boundary levels.
-     *
-     * \par
-     * - The number of histogram bins is (\p num_levels - 1)
-     * - The value range for bin<sub><em>i</em></sub> is [<tt>level[i]</tt>, <tt>level[i+1]</tt>)
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the computation of an six-bin histogram
-     * from a sequence of float samples
-     *
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_histogram.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input samples and
-     * // output histogram
-     * int      num_samples;    // e.g., 10
-     * float*   d_samples;      // e.g., [2.2, 6.0, 7.1, 2.9, 3.5, 0.3, 2.9, 2.0, 6.1, 999.5]
-     * int*     d_histogram;    // e.g., [ -, -, -, -, -, -, -, -]
-     * int      num_levels      // e.g., 7 (seven level boundaries for six bins)
-     * float*   d_levels;       // e.g., [0.0, 2.0, 4.0, 6.0, 8.0, 12.0, 16.0]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void*    d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceHistogram::HistogramRange(d_temp_storage, temp_storage_bytes,
-     *     d_samples, d_histogram, num_levels, d_levels, num_samples);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Compute histograms
-     * cub::DeviceHistogram::HistogramRange(d_temp_storage, temp_storage_bytes,
-     *     d_samples, d_histogram, num_levels, d_levels, num_samples);
-     *
-     * // d_histogram   <-- [1, 0, 5, 0, 3, 0, 0, 0];
-     *
-     * \endcode
-     *
-     * \tparam SampleIteratorT          <b>[inferred]</b> Random-access input iterator type for reading input samples. \iterator
-     * \tparam CounterT                 <b>[inferred]</b> Integer type for histogram bin counters
-     * \tparam LevelT                   <b>[inferred]</b> Type for specifying boundaries (levels)
-     * \tparam OffsetT                  <b>[inferred]</b> Signed integer type for sequence offsets, list lengths, pointer differences, etc.  \offset_size1
-     */
-    template <
-        typename            SampleIteratorT,
-        typename            CounterT,
-        typename            LevelT,
-        typename            OffsetT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t HistogramRange(
-        void*               d_temp_storage,                         ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&             temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        SampleIteratorT     d_samples,                              ///< [in] The pointer to the input sequence of data samples.
-        CounterT*           d_histogram,                            ///< [out] The pointer to the histogram counter output array of length <tt>num_levels</tt> - 1.
-        int                 num_levels,                             ///< [in] The number of boundaries (levels) for delineating histogram samples.  Implies that the number of bins is <tt>num_levels</tt> - 1.
-        LevelT*             d_levels,                               ///< [in] The pointer to the array of boundaries (levels).  Bin ranges are defined by consecutive boundary pairings: lower sample value boundaries are inclusive and upper sample value boundaries are exclusive.
-        OffsetT             num_samples,                            ///< [in] The number of data samples per row in the region of interest
-        cudaStream_t        stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        /// The sample value type of the input iterator
-        typedef typename std::iterator_traits<SampleIteratorT>::value_type SampleT;
-
-        CounterT*           d_histogram1[1] = {d_histogram};
-        int                 num_levels1[1]  = {num_levels};
-        LevelT*             d_levels1[1]    = {d_levels};
-
-        return MultiHistogramRange<1, 1>(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_samples,
-            d_histogram1,
-            num_levels1,
-            d_levels1,
-            num_samples,
-            1,
-            sizeof(SampleT) * num_samples,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Computes an intensity histogram from a sequence of data samples using the specified bin boundary levels.
-     *
-     * \par
-     * - A two-dimensional <em>region of interest</em> within \p d_samples can be specified
-     *   using the \p num_row_samples, num_rows, and \p row_stride_bytes parameters.
-     * - The row stride must be a whole multiple of the sample data type
-     *   size, i.e., <tt>(row_stride_bytes % sizeof(SampleT)) == 0</tt>.
-     * - The number of histogram bins is (\p num_levels - 1)
-     * - The value range for bin<sub><em>i</em></sub> is [<tt>level[i]</tt>, <tt>level[i+1]</tt>)
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the computation of a six-bin histogram
-     * from a 2x5 region of interest within a flattened 2x7 array of float samples.
-     *
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_histogram.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input samples and
-     * // output histogram
-     * int      num_row_samples;    // e.g., 5
-     * int      num_rows;           // e.g., 2;
-     * int      row_stride_bytes;   // e.g., 7 * sizeof(float)
-     * float*   d_samples;          // e.g., [2.2, 6.0, 7.1, 2.9, 3.5,   -, -,
-     *                              //        0.3, 2.9, 2.0, 6.1, 999.5, -, -]
-     * int*     d_histogram;        // e.g., [ , , , , , , , ]
-     * int      num_levels          // e.g., 7 (seven level boundaries for six bins)
-     * float    *d_levels;          // e.g., [0.0, 2.0, 4.0, 6.0, 8.0, 12.0, 16.0]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void*    d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceHistogram::HistogramRange(d_temp_storage, temp_storage_bytes,
-     *     d_samples, d_histogram, num_levels, d_levels,
-     *     num_row_samples, num_rows, row_stride_bytes);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Compute histograms
-     * cub::DeviceHistogram::HistogramRange(d_temp_storage, temp_storage_bytes,
-     *     d_samples, d_histogram, num_levels, d_levels,
-     *     num_row_samples, num_rows, row_stride_bytes);
-     *
-     * // d_histogram   <-- [1, 0, 5, 0, 3, 0, 0, 0];
-     *
-     * \endcode
-     *
-     * \tparam SampleIteratorT          <b>[inferred]</b> Random-access input iterator type for reading input samples. \iterator
-     * \tparam CounterT                 <b>[inferred]</b> Integer type for histogram bin counters
-     * \tparam LevelT                   <b>[inferred]</b> Type for specifying boundaries (levels)
-     * \tparam OffsetT                  <b>[inferred]</b> Signed integer type for sequence offsets, list lengths, pointer differences, etc.  \offset_size1
-     */
-    template <
-        typename            SampleIteratorT,
-        typename            CounterT,
-        typename            LevelT,
-        typename            OffsetT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t HistogramRange(
-        void*               d_temp_storage,                         ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&             temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        SampleIteratorT     d_samples,                              ///< [in] The pointer to the input sequence of data samples.
-        CounterT*           d_histogram,                            ///< [out] The pointer to the histogram counter output array of length <tt>num_levels</tt> - 1.
-        int                 num_levels,                             ///< [in] The number of boundaries (levels) for delineating histogram samples.  Implies that the number of bins is <tt>num_levels</tt> - 1.
-        LevelT*             d_levels,                               ///< [in] The pointer to the array of boundaries (levels).  Bin ranges are defined by consecutive boundary pairings: lower sample value boundaries are inclusive and upper sample value boundaries are exclusive.
-        OffsetT             num_row_samples,                        ///< [in] The number of data samples per row in the region of interest
-        OffsetT             num_rows,                               ///< [in] The number of rows in the region of interest
-        size_t              row_stride_bytes,                       ///< [in] The number of bytes between starts of consecutive rows in the region of interest
-        cudaStream_t        stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        CounterT*           d_histogram1[1]     = {d_histogram};
-        int                 num_levels1[1]      = {num_levels};
-        LevelT*             d_levels1[1]        = {d_levels};
-
-        return MultiHistogramRange<1, 1>(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_samples,
-            d_histogram1,
-            num_levels1,
-            d_levels1,
-            num_row_samples,
-            num_rows,
-            row_stride_bytes,
-            stream,
-            debug_synchronous);
-    }
-
-    /**
-     * \brief Computes per-channel intensity histograms from a sequence of multi-channel "pixel" data samples using the specified bin boundary levels.
-     *
-     * \par
-     * - The input is a sequence of <em>pixel</em> structures, where each pixel comprises
-     *   a record of \p NUM_CHANNELS consecutive data samples (e.g., an <em>RGBA</em> pixel).
-     * - Of the \p NUM_CHANNELS specified, the function will only compute histograms
-     *   for the first \p NUM_ACTIVE_CHANNELS (e.g., <em>RGB</em> histograms from <em>RGBA</em>
-     *   pixel samples).
-     * - The number of histogram bins for channel<sub><em>i</em></sub> is <tt>num_levels[i]</tt> - 1.
-     * - For channel<sub><em>i</em></sub>, the range of values for all histogram bins
-     *   have the same width: (<tt>upper_level[i]</tt> - <tt>lower_level[i]</tt>) / (<tt> num_levels[i]</tt> - 1)
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the computation of three 4-bin <em>RGB</em> histograms
-     * from a quad-channel sequence of <em>RGBA</em> pixels (8 bits per channel per pixel)
-     *
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_histogram.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input samples
-     * // and output histograms
-     * int            num_pixels;       // e.g., 5
-     * unsigned char  *d_samples;       // e.g., [(2, 6, 7, 5),(3, 0, 2, 1),(7, 0, 6, 2),
-     *                                  //        (0, 6, 7, 5),(3, 0, 2, 6)]
-     * unsigned int   *d_histogram[3];  // e.g., [[ -, -, -, -],[ -, -, -, -],[ -, -, -, -]];
-     * int            num_levels[3];    // e.g., {5, 5, 5};
-     * unsigned int   *d_levels[3];     // e.g., [ [0, 2, 4, 6, 8],
-     *                                  //         [0, 2, 4, 6, 8],
-     *                                  //         [0, 2, 4, 6, 8] ];
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void*    d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceHistogram::MultiHistogramRange<4, 3>(d_temp_storage, temp_storage_bytes,
-     *     d_samples, d_histogram, num_levels, d_levels, num_pixels);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Compute histograms
-     * cub::DeviceHistogram::MultiHistogramRange<4, 3>(d_temp_storage, temp_storage_bytes,
-     *     d_samples, d_histogram, num_levels, d_levels, num_pixels);
-     *
-     * // d_histogram   <-- [ [1, 3, 0, 1],
-     * //                     [3, 0, 0, 2],
-     * //                     [0, 2, 0, 3] ]
-     *
-     * \endcode
-     *
-     * \tparam NUM_CHANNELS             Number of channels interleaved in the input data (may be greater than the number of channels being actively histogrammed)
-     * \tparam NUM_ACTIVE_CHANNELS      <b>[inferred]</b> Number of channels actively being histogrammed
-     * \tparam SampleIteratorT          <b>[inferred]</b> Random-access input iterator type for reading input samples. \iterator
-     * \tparam CounterT                 <b>[inferred]</b> Integer type for histogram bin counters
-     * \tparam LevelT                   <b>[inferred]</b> Type for specifying boundaries (levels)
-     * \tparam OffsetT                  <b>[inferred]</b> Signed integer type for sequence offsets, list lengths, pointer differences, etc.  \offset_size1
-     */
-    template <
-        int                 NUM_CHANNELS,
-        int                 NUM_ACTIVE_CHANNELS,
-        typename            SampleIteratorT,
-        typename            CounterT,
-        typename            LevelT,
-        typename            OffsetT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t MultiHistogramRange(
-        void*               d_temp_storage,                         ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&             temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        SampleIteratorT     d_samples,                              ///< [in] The pointer to the multi-channel input sequence of data samples. The samples from different channels are assumed to be interleaved (e.g., an array of 32-bit pixels where each pixel consists of four <em>RGBA</em> 8-bit samples).
-        CounterT*           d_histogram[NUM_ACTIVE_CHANNELS],       ///< [out] The pointers to the histogram counter output arrays, one for each active channel.  For channel<sub><em>i</em></sub>, the allocation length of <tt>d_histogram[i]</tt> should be <tt>num_levels[i]</tt> - 1.
-        int                 num_levels[NUM_ACTIVE_CHANNELS],        ///< [in] The number of boundaries (levels) for delineating histogram samples in each active channel.  Implies that the number of bins for channel<sub><em>i</em></sub> is <tt>num_levels[i]</tt> - 1.
-        LevelT*             d_levels[NUM_ACTIVE_CHANNELS],          ///< [in] The pointers to the arrays of boundaries (levels), one for each active channel.  Bin ranges are defined by consecutive boundary pairings: lower sample value boundaries are inclusive and upper sample value boundaries are exclusive.
-        OffsetT             num_pixels,                             ///< [in] The number of multi-channel pixels (i.e., the length of \p d_samples / NUM_CHANNELS)
-        cudaStream_t        stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        /// The sample value type of the input iterator
-        typedef typename std::iterator_traits<SampleIteratorT>::value_type SampleT;
-
-        return MultiHistogramRange<NUM_CHANNELS, NUM_ACTIVE_CHANNELS>(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_samples,
-            d_histogram,
-            num_levels,
-            d_levels,
-            num_pixels,
-            1,
-            sizeof(SampleT) * NUM_CHANNELS * num_pixels,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Computes per-channel intensity histograms from a sequence of multi-channel "pixel" data samples using the specified bin boundary levels.
-     *
-     * \par
-     * - The input is a sequence of <em>pixel</em> structures, where each pixel comprises
-     *   a record of \p NUM_CHANNELS consecutive data samples (e.g., an <em>RGBA</em> pixel).
-     * - Of the \p NUM_CHANNELS specified, the function will only compute histograms
-     *   for the first \p NUM_ACTIVE_CHANNELS (e.g., <em>RGB</em> histograms from <em>RGBA</em>
-     *   pixel samples).
-     * - A two-dimensional <em>region of interest</em> within \p d_samples can be specified
-     *   using the \p num_row_samples, num_rows, and \p row_stride_bytes parameters.
-     * - The row stride must be a whole multiple of the sample data type
-     *   size, i.e., <tt>(row_stride_bytes % sizeof(SampleT)) == 0</tt>.
-     * - The number of histogram bins for channel<sub><em>i</em></sub> is <tt>num_levels[i]</tt> - 1.
-     * - For channel<sub><em>i</em></sub>, the range of values for all histogram bins
-     *   have the same width: (<tt>upper_level[i]</tt> - <tt>lower_level[i]</tt>) / (<tt> num_levels[i]</tt> - 1)
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the computation of three 4-bin <em>RGB</em> histograms from a 2x3 region of
-     * interest of within a flattened 2x4 array of quad-channel <em>RGBA</em> pixels (8 bits per channel per pixel).
-     *
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_histogram.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input samples
-     * // and output histograms
-     * int              num_row_pixels;     // e.g., 3
-     * int              num_rows;           // e.g., 2
-     * size_t           row_stride_bytes;   // e.g., 4 * sizeof(unsigned char) * NUM_CHANNELS
-     * unsigned char*   d_samples;          // e.g., [(2, 6, 7, 5),(3, 0, 2, 1),(1, 1, 1, 1),(-, -, -, -),
-     *                                      //        (7, 0, 6, 2),(0, 6, 7, 5),(3, 0, 2, 6),(-, -, -, -)]
-     * int*             d_histogram[3];     // e.g., [[ -, -, -, -],[ -, -, -, -],[ -, -, -, -]];
-     * int              num_levels[3];      // e.g., {5, 5, 5};
-     * unsigned int*    d_levels[3];        // e.g., [ [0, 2, 4, 6, 8],
-     *                                      //         [0, 2, 4, 6, 8],
-     *                                      //         [0, 2, 4, 6, 8] ];
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void*    d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceHistogram::MultiHistogramRange<4, 3>(d_temp_storage, temp_storage_bytes,
-     *     d_samples, d_histogram, num_levels, d_levels, num_row_pixels, num_rows, row_stride_bytes);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Compute histograms
-     * cub::DeviceHistogram::MultiHistogramRange<4, 3>(d_temp_storage, temp_storage_bytes,
-     *     d_samples, d_histogram, num_levels, d_levels, num_row_pixels, num_rows, row_stride_bytes);
-     *
-     * // d_histogram   <-- [ [2, 3, 0, 1],
-     * //                     [3, 0, 0, 2],
-     * //                     [1, 2, 0, 3] ]
-     *
-     * \endcode
-     *
-     * \tparam NUM_CHANNELS             Number of channels interleaved in the input data (may be greater than the number of channels being actively histogrammed)
-     * \tparam NUM_ACTIVE_CHANNELS      <b>[inferred]</b> Number of channels actively being histogrammed
-     * \tparam SampleIteratorT          <b>[inferred]</b> Random-access input iterator type for reading input samples. \iterator
-     * \tparam CounterT                 <b>[inferred]</b> Integer type for histogram bin counters
-     * \tparam LevelT                   <b>[inferred]</b> Type for specifying boundaries (levels)
-     * \tparam OffsetT                  <b>[inferred]</b> Signed integer type for sequence offsets, list lengths, pointer differences, etc.  \offset_size1
-     */
-    template <
-        int                 NUM_CHANNELS,
-        int                 NUM_ACTIVE_CHANNELS,
-        typename            SampleIteratorT,
-        typename            CounterT,
-        typename            LevelT,
-        typename            OffsetT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t MultiHistogramRange(
-        void*               d_temp_storage,                         ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&             temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        SampleIteratorT     d_samples,                              ///< [in] The pointer to the multi-channel input sequence of data samples. The samples from different channels are assumed to be interleaved (e.g., an array of 32-bit pixels where each pixel consists of four <em>RGBA</em> 8-bit samples).
-        CounterT*           d_histogram[NUM_ACTIVE_CHANNELS],       ///< [out] The pointers to the histogram counter output arrays, one for each active channel.  For channel<sub><em>i</em></sub>, the allocation length of <tt>d_histogram[i]</tt> should be <tt>num_levels[i]</tt> - 1.
-        int                 num_levels[NUM_ACTIVE_CHANNELS],        ///< [in] The number of boundaries (levels) for delineating histogram samples in each active channel.  Implies that the number of bins for channel<sub><em>i</em></sub> is <tt>num_levels[i]</tt> - 1.
-        LevelT*             d_levels[NUM_ACTIVE_CHANNELS],          ///< [in] The pointers to the arrays of boundaries (levels), one for each active channel.  Bin ranges are defined by consecutive boundary pairings: lower sample value boundaries are inclusive and upper sample value boundaries are exclusive.
-        OffsetT             num_row_pixels,                         ///< [in] The number of multi-channel pixels per row in the region of interest
-        OffsetT             num_rows,                               ///< [in] The number of rows in the region of interest
-        size_t              row_stride_bytes,                       ///< [in] The number of bytes between starts of consecutive rows in the region of interest
-        cudaStream_t        stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        /// The sample value type of the input iterator
-        typedef typename std::iterator_traits<SampleIteratorT>::value_type SampleT;
-        Int2Type<sizeof(SampleT) == 1> is_byte_sample;
-
-        if ((sizeof(OffsetT) > sizeof(int)) &&
-            ((unsigned long long) (num_rows * row_stride_bytes) < (unsigned long long) std::numeric_limits<int>::max()))
-        {
-            // Down-convert OffsetT data type
-            return DipatchHistogram<NUM_CHANNELS, NUM_ACTIVE_CHANNELS, SampleIteratorT, CounterT, LevelT, int>::DispatchRange(
-                d_temp_storage, temp_storage_bytes, d_samples, d_histogram, num_levels, d_levels,
-                (int) num_row_pixels, (int) num_rows, (int) (row_stride_bytes / sizeof(SampleT)),
-                stream, debug_synchronous, is_byte_sample);
-        }
-
-        return DipatchHistogram<NUM_CHANNELS, NUM_ACTIVE_CHANNELS, SampleIteratorT, CounterT, LevelT, OffsetT>::DispatchRange(
-            d_temp_storage, temp_storage_bytes, d_samples, d_histogram, num_levels, d_levels,
-            num_row_pixels, num_rows, (OffsetT) (row_stride_bytes / sizeof(SampleT)),
-            stream, debug_synchronous, is_byte_sample);
-    }
-
-
-
-    //@}  end member group
-};
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/device/device_partition.cuh b/thirdparty/cub_semiring/device/device_partition.cuh
deleted file mode 100644
index 154506edcc0..00000000000
--- a/thirdparty/cub_semiring/device/device_partition.cuh
+++ /dev/null
@@ -1,273 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DevicePartition provides device-wide, parallel operations for partitioning sequences of data items residing within device-accessible memory.
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-
-#include "dispatch/dispatch_select_if.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \brief DevicePartition provides device-wide, parallel operations for partitioning sequences of data items residing within device-accessible memory. ![](partition_logo.png)
- * \ingroup SingleModule
- *
- * \par Overview
- * These operations apply a selection criterion to construct a partitioned output sequence from items selected/unselected from
- * a specified input sequence.
- *
- * \par Usage Considerations
- * \cdp_class{DevicePartition}
- *
- * \par Performance
- * \linear_performance{partition}
- *
- * \par
- * The following chart illustrates DevicePartition::If
- * performance across different CUDA architectures for \p int32 items,
- * where 50% of the items are randomly selected for the first partition.
- * \plots_below
- *
- * \image html partition_if_int32_50_percent.png
- *
- */
-struct DevicePartition
-{
-    /**
-     * \brief Uses the \p d_flags sequence to split the corresponding items from \p d_in into a partitioned sequence \p d_out.  The total number of items copied into the first partition is written to \p d_num_selected_out. ![](partition_flags_logo.png)
-     *
-     * \par
-     * - The value type of \p d_flags must be castable to \p bool (e.g., \p bool, \p char, \p int, etc.).
-     * - Copies of the selected items are compacted into \p d_out and maintain their original
-     *   relative ordering, however copies of the unselected items are compacted into the
-     *   rear of \p d_out in reverse order.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the compaction of items selected from an \p int device vector.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>       // or equivalently <cub/device/device_partition.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input, flags, and output
-     * int  num_items;              // e.g., 8
-     * int  *d_in;                  // e.g., [1, 2, 3, 4, 5, 6, 7, 8]
-     * char *d_flags;               // e.g., [1, 0, 0, 1, 0, 1, 1, 0]
-     * int  *d_out;                 // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
-     * int  *d_num_selected_out;    // e.g., [ ]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DevicePartition::Flagged(d_temp_storage, temp_storage_bytes, d_in, d_flags, d_out, d_num_selected_out, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run selection
-     * cub::DevicePartition::Flagged(d_temp_storage, temp_storage_bytes, d_in, d_flags, d_out, d_num_selected_out, num_items);
-     *
-     * // d_out                 <-- [1, 4, 6, 7, 8, 5, 3, 2]
-     * // d_num_selected_out    <-- [4]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT       <b>[inferred]</b> Random-access input iterator type for reading input items \iterator
-     * \tparam FlagIterator         <b>[inferred]</b> Random-access input iterator type for reading selection flags \iterator
-     * \tparam OutputIteratorT      <b>[inferred]</b> Random-access output iterator type for writing output items \iterator
-     * \tparam NumSelectedIteratorT  <b>[inferred]</b> Output iterator type for recording the number of items selected \iterator
-     */
-    template <
-        typename                    InputIteratorT,
-        typename                    FlagIterator,
-        typename                    OutputIteratorT,
-        typename                    NumSelectedIteratorT>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Flagged(
-        void*               d_temp_storage,                ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                      &temp_storage_bytes,            ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT              d_in,                           ///< [in] Pointer to the input sequence of data items
-        FlagIterator                d_flags,                        ///< [in] Pointer to the input sequence of selection flags
-        OutputIteratorT             d_out,                          ///< [out] Pointer to the output sequence of partitioned data items
-        NumSelectedIteratorT        d_num_selected_out,             ///< [out] Pointer to the output total number of items selected (i.e., the offset of the unselected partition)
-        int                         num_items,                      ///< [in] Total number of items to select from
-        cudaStream_t                stream             = 0,         ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous  = false)     ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        typedef int                     OffsetT;         // Signed integer type for global offsets
-        typedef NullType                SelectOp;       // Selection op (not used)
-        typedef NullType                EqualityOp;     // Equality operator (not used)
-
-        return DispatchSelectIf<InputIteratorT, FlagIterator, OutputIteratorT, NumSelectedIteratorT, SelectOp, EqualityOp, OffsetT, true>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            d_flags,
-            d_out,
-            d_num_selected_out,
-            SelectOp(),
-            EqualityOp(),
-            num_items,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Uses the \p select_op functor to split the corresponding items from \p d_in into a partitioned sequence \p d_out.  The total number of items copied into the first partition is written to \p d_num_selected_out. ![](partition_logo.png)
-     *
-     * \par
-     * - Copies of the selected items are compacted into \p d_out and maintain their original
-     *   relative ordering, however copies of the unselected items are compacted into the
-     *   rear of \p d_out in reverse order.
-     * - \devicestorage
-     *
-     * \par Performance
-     * The following charts illustrate saturated partition-if performance across different
-     * CUDA architectures for \p int32 and \p int64 items, respectively.  Items are
-     * selected for the first partition with 50% probability.
-     *
-     * \image html partition_if_int32_50_percent.png
-     * \image html partition_if_int64_50_percent.png
-     *
-     * \par
-     * The following charts are similar, but 5% selection probability for the first partition:
-     *
-     * \image html partition_if_int32_5_percent.png
-     * \image html partition_if_int64_5_percent.png
-     *
-     * \par Snippet
-     * The code snippet below illustrates the compaction of items selected from an \p int device vector.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_partition.cuh>
-     *
-     * // Functor type for selecting values less than some criteria
-     * struct LessThan
-     * {
-     *     int compare;
-     *
-     *     CUB_RUNTIME_FUNCTION __forceinline__
-     *     LessThan(int compare) : compare(compare) {}
-     *
-     *     CUB_RUNTIME_FUNCTION __forceinline__
-     *     bool operator()(const int &a) const {
-     *         return (a < compare);
-     *     }
-     * };
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int      num_items;              // e.g., 8
-     * int      *d_in;                  // e.g., [0, 2, 3, 9, 5, 2, 81, 8]
-     * int      *d_out;                 // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
-     * int      *d_num_selected_out;    // e.g., [ ]
-     * LessThan select_op(7);
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items, select_op);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run selection
-     * cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items, select_op);
-     *
-     * // d_out                 <-- [0, 2, 3, 5, 2, 8, 81, 9]
-     * // d_num_selected_out    <-- [5]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT       <b>[inferred]</b> Random-access input iterator type for reading input items \iterator
-     * \tparam OutputIteratorT      <b>[inferred]</b> Random-access output iterator type for writing output items \iterator
-     * \tparam NumSelectedIteratorT  <b>[inferred]</b> Output iterator type for recording the number of items selected \iterator
-     * \tparam SelectOp             <b>[inferred]</b> Selection functor type having member <tt>bool operator()(const T &a)</tt>
-     */
-    template <
-        typename                    InputIteratorT,
-        typename                    OutputIteratorT,
-        typename                    NumSelectedIteratorT,
-        typename                    SelectOp>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t If(
-        void*               d_temp_storage,                ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                      &temp_storage_bytes,            ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT              d_in,                           ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT             d_out,                          ///< [out] Pointer to the output sequence of partitioned data items
-        NumSelectedIteratorT        d_num_selected_out,             ///< [out] Pointer to the output total number of items selected (i.e., the offset of the unselected partition)
-        int                         num_items,                      ///< [in] Total number of items to select from
-        SelectOp                    select_op,                      ///< [in] Unary selection operator
-        cudaStream_t                stream             = 0,         ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous  = false)     ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        typedef int                     OffsetT;         // Signed integer type for global offsets
-        typedef NullType*               FlagIterator;   // FlagT iterator type (not used)
-        typedef NullType                EqualityOp;     // Equality operator (not used)
-
-        return DispatchSelectIf<InputIteratorT, FlagIterator, OutputIteratorT, NumSelectedIteratorT, SelectOp, EqualityOp, OffsetT, true>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            NULL,
-            d_out,
-            d_num_selected_out,
-            select_op,
-            EqualityOp(),
-            num_items,
-            stream,
-            debug_synchronous);
-    }
-
-};
-
-/**
- * \example example_device_partition_flagged.cu
- * \example example_device_partition_if.cu
- */
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/device/device_radix_sort.cuh b/thirdparty/cub_semiring/device/device_radix_sort.cuh
deleted file mode 100644
index fe6cad65d7b..00000000000
--- a/thirdparty/cub_semiring/device/device_radix_sort.cuh
+++ /dev/null
@@ -1,796 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within device-accessible memory.
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-
-#include "dispatch/dispatch_radix_sort.cuh"
-#include "../util_arch.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \brief DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within device-accessible memory. ![](sorting_logo.png)
- * \ingroup SingleModule
- *
- * \par Overview
- * The [<em>radix sorting method</em>](http://en.wikipedia.org/wiki/Radix_sort) arranges
- * items into ascending (or descending) order.  The algorithm relies upon a positional representation for
- * keys, i.e., each key is comprised of an ordered sequence of symbols (e.g., digits,
- * characters, etc.) specified from least-significant to most-significant.  For a
- * given input sequence of keys and a set of rules specifying a total ordering
- * of the symbolic alphabet, the radix sorting method produces a lexicographic
- * ordering of those keys.
- *
- * \par
- * DeviceRadixSort can sort all of the built-in C++ numeric primitive types, e.g.:
- * <tt>unsigned char</tt>, \p int, \p double, etc.  Although the direct radix sorting
- * method can only be applied to unsigned integral types, DeviceRadixSort
- * is able to sort signed and floating-point types via simple bit-wise transformations
- * that ensure lexicographic key ordering.
- *
- * \par Usage Considerations
- * \cdp_class{DeviceRadixSort}
- *
- * \par Performance
- * \linear_performance{radix sort} The following chart illustrates DeviceRadixSort::SortKeys
- * performance across different CUDA architectures for uniform-random \p uint32 keys.
- * \plots_below
- *
- * \image html lsb_radix_sort_int32_keys.png
- *
- */
-struct DeviceRadixSort
-{
-
-    /******************************************************************//**
-     * \name KeyT-value pairs
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Sorts key-value pairs into ascending order. (~<em>2N </em>auxiliary storage required)
-     *
-     * \par
-     * - The contents of the input data are not altered by the sorting operation
-     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
-     * - \devicestorageNP  For sorting using only <em>O</em>(<tt>P</tt>) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
-     * - \devicestorage
-     *
-     * \par Performance
-     * The following charts illustrate saturated sorting performance across different
-     * CUDA architectures for uniform-random <tt>uint32,uint32</tt> and
-     * <tt>uint64,uint64</tt> pairs, respectively.
-     *
-     * \image html lsb_radix_sort_int32_pairs.png
-     * \image html lsb_radix_sort_int64_pairs.png
-     *
-     * \par Snippet
-     * The code snippet below illustrates the sorting of a device vector of \p int keys
-     * with associated vector of \p int values.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for sorting data
-     * int  num_items;          // e.g., 7
-     * int  *d_keys_in;         // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_keys_out;        // e.g., [        ...        ]
-     * int  *d_values_in;       // e.g., [0, 1, 2, 3, 4, 5, 6]
-     * int  *d_values_out;      // e.g., [        ...        ]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes,
-     *     d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sorting operation
-     * cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes,
-     *     d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);
-     *
-     * // d_keys_out            <-- [0, 3, 5, 6, 7, 8, 9]
-     * // d_values_out          <-- [5, 4, 3, 1, 2, 0, 6]
-     *
-     * \endcode
-     *
-     * \tparam KeyT      <b>[inferred]</b> KeyT type
-     * \tparam ValueT    <b>[inferred]</b> ValueT type
-     */
-    template <
-        typename            KeyT,
-        typename            ValueT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t SortPairs(
-        void                *d_temp_storage,                        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        const KeyT          *d_keys_in,                             ///< [in] Pointer to the input data of key data to sort
-        KeyT                *d_keys_out,                            ///< [out] Pointer to the sorted output sequence of key data
-        const ValueT        *d_values_in,                           ///< [in] Pointer to the corresponding input sequence of associated value items
-        ValueT              *d_values_out,                          ///< [out] Pointer to the correspondingly-reordered output sequence of associated value items
-        int                 num_items,                              ///< [in] Number of items to sort
-        int                 begin_bit           = 0,                ///< [in] <b>[optional]</b> The least-significant bit index (inclusive)  needed for key comparison
-        int                 end_bit             = sizeof(KeyT) * 8, ///< [in] <b>[optional]</b> The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
-        cudaStream_t        stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        DoubleBuffer<KeyT>       d_keys(const_cast<KeyT*>(d_keys_in), d_keys_out);
-        DoubleBuffer<ValueT>     d_values(const_cast<ValueT*>(d_values_in), d_values_out);
-
-        return DispatchRadixSort<false, KeyT, ValueT, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_keys,
-            d_values,
-            num_items,
-            begin_bit,
-            end_bit,
-            false,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Sorts key-value pairs into ascending order. (~<em>N </em>auxiliary storage required)
-     *
-     * \par
-     * - The sorting operation is given a pair of key buffers and a corresponding
-     *   pair of associated value buffers.  Each pair is managed by a DoubleBuffer
-     *   structure that indicates which of the two buffers is "current" (and thus
-     *   contains the input data to be sorted).
-     * - The contents of both buffers within each pair may be altered by the sorting
-     *   operation.
-     * - Upon completion, the sorting operation will update the "current" indicator
-     *   within each DoubleBuffer wrapper to reference which of the two buffers
-     *   now contains the sorted output sequence (a function of the number of key bits
-     *   specified and the targeted device architecture).
-     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
-     * - \devicestorageP
-     * - \devicestorage
-     *
-     * \par Performance
-     * The following charts illustrate saturated sorting performance across different
-     * CUDA architectures for uniform-random <tt>uint32,uint32</tt> and
-     * <tt>uint64,uint64</tt> pairs, respectively.
-     *
-     * \image html lsb_radix_sort_int32_pairs.png
-     * \image html lsb_radix_sort_int64_pairs.png
-     *
-     * \par Snippet
-     * The code snippet below illustrates the sorting of a device vector of \p int keys
-     * with associated vector of \p int values.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for sorting data
-     * int  num_items;          // e.g., 7
-     * int  *d_key_buf;         // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_key_alt_buf;     // e.g., [        ...        ]
-     * int  *d_value_buf;       // e.g., [0, 1, 2, 3, 4, 5, 6]
-     * int  *d_value_alt_buf;   // e.g., [        ...        ]
-     * ...
-     *
-     * // Create a set of DoubleBuffers to wrap pairs of device pointers
-     * cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
-     * cub::DoubleBuffer<int> d_values(d_value_buf, d_value_alt_buf);
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, d_keys, d_values, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sorting operation
-     * cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, d_keys, d_values, num_items);
-     *
-     * // d_keys.Current()      <-- [0, 3, 5, 6, 7, 8, 9]
-     * // d_values.Current()    <-- [5, 4, 3, 1, 2, 0, 6]
-     *
-     * \endcode
-     *
-     * \tparam KeyT      <b>[inferred]</b> KeyT type
-     * \tparam ValueT    <b>[inferred]</b> ValueT type
-     */
-    template <
-        typename            KeyT,
-        typename            ValueT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t SortPairs(
-        void                    *d_temp_storage,                        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                  &temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        DoubleBuffer<KeyT>      &d_keys,                                ///< [in,out] Reference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
-        DoubleBuffer<ValueT>    &d_values,                              ///< [in,out] Double-buffer of values whose "current" device-accessible buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
-        int                     num_items,                              ///< [in] Number of items to sort
-        int                     begin_bit           = 0,                ///< [in] <b>[optional]</b> The least-significant bit index (inclusive)  needed for key comparison
-        int                     end_bit             = sizeof(KeyT) * 8, ///< [in] <b>[optional]</b> The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
-        cudaStream_t            stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                    debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        return DispatchRadixSort<false, KeyT, ValueT, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_keys,
-            d_values,
-            num_items,
-            begin_bit,
-            end_bit,
-            true,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Sorts key-value pairs into descending order. (~<em>2N</em> auxiliary storage required).
-     *
-     * \par
-     * - The contents of the input data are not altered by the sorting operation
-     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
-     * - \devicestorageNP  For sorting using only <em>O</em>(<tt>P</tt>) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
-     * - \devicestorage
-     *
-     * \par Performance
-     * Performance is similar to DeviceRadixSort::SortPairs.
-     *
-     * \par Snippet
-     * The code snippet below illustrates the sorting of a device vector of \p int keys
-     * with associated vector of \p int values.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for sorting data
-     * int  num_items;          // e.g., 7
-     * int  *d_keys_in;         // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_keys_out;        // e.g., [        ...        ]
-     * int  *d_values_in;       // e.g., [0, 1, 2, 3, 4, 5, 6]
-     * int  *d_values_out;      // e.g., [        ...        ]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes,
-     *     d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sorting operation
-     * cub::DeviceRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes,
-     *     d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);
-     *
-     * // d_keys_out            <-- [9, 8, 7, 6, 5, 3, 0]
-     * // d_values_out          <-- [6, 0, 2, 1, 3, 4, 5]
-     *
-     * \endcode
-     *
-     * \tparam KeyT      <b>[inferred]</b> KeyT type
-     * \tparam ValueT    <b>[inferred]</b> ValueT type
-     */
-    template <
-        typename            KeyT,
-        typename            ValueT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t SortPairsDescending(
-        void                *d_temp_storage,                        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        const KeyT          *d_keys_in,                             ///< [in] Pointer to the input data of key data to sort
-        KeyT                *d_keys_out,                            ///< [out] Pointer to the sorted output sequence of key data
-        const ValueT        *d_values_in,                           ///< [in] Pointer to the corresponding input sequence of associated value items
-        ValueT              *d_values_out,                          ///< [out] Pointer to the correspondingly-reordered output sequence of associated value items
-        int                 num_items,                              ///< [in] Number of items to sort
-        int                 begin_bit           = 0,                ///< [in] <b>[optional]</b> The least-significant bit index (inclusive)  needed for key comparison
-        int                 end_bit             = sizeof(KeyT) * 8, ///< [in] <b>[optional]</b> The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
-        cudaStream_t        stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        DoubleBuffer<KeyT>       d_keys(const_cast<KeyT*>(d_keys_in), d_keys_out);
-        DoubleBuffer<ValueT>     d_values(const_cast<ValueT*>(d_values_in), d_values_out);
-
-        return DispatchRadixSort<true, KeyT, ValueT, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_keys,
-            d_values,
-            num_items,
-            begin_bit,
-            end_bit,
-            false,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Sorts key-value pairs into descending order. (~<em>N </em>auxiliary storage required).
-     *
-     * \par
-     * - The sorting operation is given a pair of key buffers and a corresponding
-     *   pair of associated value buffers.  Each pair is managed by a DoubleBuffer
-     *   structure that indicates which of the two buffers is "current" (and thus
-     *   contains the input data to be sorted).
-     * - The contents of both buffers within each pair may be altered by the sorting
-     *   operation.
-     * - Upon completion, the sorting operation will update the "current" indicator
-     *   within each DoubleBuffer wrapper to reference which of the two buffers
-     *   now contains the sorted output sequence (a function of the number of key bits
-     *   specified and the targeted device architecture).
-     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
-     * - \devicestorageP
-     * - \devicestorage
-     *
-     * \par Performance
-     * Performance is similar to DeviceRadixSort::SortPairs.
-     *
-     * \par Snippet
-     * The code snippet below illustrates the sorting of a device vector of \p int keys
-     * with associated vector of \p int values.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for sorting data
-     * int  num_items;          // e.g., 7
-     * int  *d_key_buf;         // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_key_alt_buf;     // e.g., [        ...        ]
-     * int  *d_value_buf;       // e.g., [0, 1, 2, 3, 4, 5, 6]
-     * int  *d_value_alt_buf;   // e.g., [        ...        ]
-     * ...
-     *
-     * // Create a set of DoubleBuffers to wrap pairs of device pointers
-     * cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
-     * cub::DoubleBuffer<int> d_values(d_value_buf, d_value_alt_buf);
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes, d_keys, d_values, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sorting operation
-     * cub::DeviceRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes, d_keys, d_values, num_items);
-     *
-     * // d_keys.Current()      <-- [9, 8, 7, 6, 5, 3, 0]
-     * // d_values.Current()    <-- [6, 0, 2, 1, 3, 4, 5]
-     *
-     * \endcode
-     *
-     * \tparam KeyT      <b>[inferred]</b> KeyT type
-     * \tparam ValueT    <b>[inferred]</b> ValueT type
-     */
-    template <
-        typename            KeyT,
-        typename            ValueT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t SortPairsDescending(
-        void                    *d_temp_storage,                        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                  &temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        DoubleBuffer<KeyT>      &d_keys,                                ///< [in,out] Reference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
-        DoubleBuffer<ValueT>    &d_values,                              ///< [in,out] Double-buffer of values whose "current" device-accessible buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
-        int                     num_items,                              ///< [in] Number of items to sort
-        int                     begin_bit           = 0,                ///< [in] <b>[optional]</b> The least-significant bit index (inclusive)  needed for key comparison
-        int                     end_bit             = sizeof(KeyT) * 8, ///< [in] <b>[optional]</b> The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
-        cudaStream_t            stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                    debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        return DispatchRadixSort<true, KeyT, ValueT, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_keys,
-            d_values,
-            num_items,
-            begin_bit,
-            end_bit,
-            true,
-            stream,
-            debug_synchronous);
-    }
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Keys-only
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Sorts keys into ascending order. (~<em>2N </em>auxiliary storage required)
-     *
-     * \par
-     * - The contents of the input data are not altered by the sorting operation
-     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
-     * - \devicestorageNP  For sorting using only <em>O</em>(<tt>P</tt>) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
-     * - \devicestorage
-     *
-     * \par Performance
-     * The following charts illustrate saturated sorting performance across different
-     * CUDA architectures for uniform-random \p uint32 and \p uint64 keys, respectively.
-     *
-     * \image html lsb_radix_sort_int32_keys.png
-     * \image html lsb_radix_sort_int64_keys.png
-     *
-     * \par Snippet
-     * The code snippet below illustrates the sorting of a device vector of \p int keys.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for sorting data
-     * int  num_items;          // e.g., 7
-     * int  *d_keys_in;         // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_keys_out;        // e.g., [        ...        ]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sorting operation
-     * cub::DeviceRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, num_items);
-     *
-     * // d_keys_out            <-- [0, 3, 5, 6, 7, 8, 9]
-     *
-     * \endcode
-     *
-     * \tparam KeyT      <b>[inferred]</b> KeyT type
-     */
-    template <typename KeyT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t SortKeys(
-        void                *d_temp_storage,                        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        const KeyT          *d_keys_in,                             ///< [in] Pointer to the input data of key data to sort
-        KeyT                *d_keys_out,                            ///< [out] Pointer to the sorted output sequence of key data
-        int                 num_items,                              ///< [in] Number of items to sort
-        int                 begin_bit           = 0,                ///< [in] <b>[optional]</b> The least-significant bit index (inclusive)  needed for key comparison
-        int                 end_bit             = sizeof(KeyT) * 8, ///< [in] <b>[optional]</b> The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
-        cudaStream_t        stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // Null value type
-        DoubleBuffer<KeyT>      d_keys(const_cast<KeyT*>(d_keys_in), d_keys_out);
-        DoubleBuffer<NullType>  d_values;
-
-        return DispatchRadixSort<false, KeyT, NullType, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_keys,
-            d_values,
-            num_items,
-            begin_bit,
-            end_bit,
-            false,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Sorts keys into ascending order. (~<em>N </em>auxiliary storage required).
-     *
-     * \par
-     * - The sorting operation is given a pair of key buffers managed by a
-     *   DoubleBuffer structure that indicates which of the two buffers is
-     *   "current" (and thus contains the input data to be sorted).
-     * - The contents of both buffers may be altered by the sorting operation.
-     * - Upon completion, the sorting operation will update the "current" indicator
-     *   within the DoubleBuffer wrapper to reference which of the two buffers
-     *   now contains the sorted output sequence (a function of the number of key bits
-     *   specified and the targeted device architecture).
-     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
-     * - \devicestorageP
-     * - \devicestorage
-     *
-     * \par Performance
-     * The following charts illustrate saturated sorting performance across different
-     * CUDA architectures for uniform-random \p uint32 and \p uint64 keys, respectively.
-     *
-     * \image html lsb_radix_sort_int32_keys.png
-     * \image html lsb_radix_sort_int64_keys.png
-     *
-     * \par Snippet
-     * The code snippet below illustrates the sorting of a device vector of \p int keys.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for sorting data
-     * int  num_items;          // e.g., 7
-     * int  *d_key_buf;         // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_key_alt_buf;     // e.g., [        ...        ]
-     * ...
-     *
-     * // Create a DoubleBuffer to wrap the pair of device pointers
-     * cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sorting operation
-     * cub::DeviceRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys, num_items);
-     *
-     * // d_keys.Current()      <-- [0, 3, 5, 6, 7, 8, 9]
-     *
-     * \endcode
-     *
-     * \tparam KeyT      <b>[inferred]</b> KeyT type
-     */
-    template <typename KeyT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t SortKeys(
-        void                *d_temp_storage,                        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        DoubleBuffer<KeyT>  &d_keys,                                ///< [in,out] Reference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
-        int                 num_items,                              ///< [in] Number of items to sort
-        int                 begin_bit           = 0,                ///< [in] <b>[optional]</b> The least-significant bit index (inclusive)  needed for key comparison
-        int                 end_bit             = sizeof(KeyT) * 8, ///< [in] <b>[optional]</b> The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
-        cudaStream_t        stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // Null value type
-        DoubleBuffer<NullType> d_values;
-
-        return DispatchRadixSort<false, KeyT, NullType, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_keys,
-            d_values,
-            num_items,
-            begin_bit,
-            end_bit,
-            true,
-            stream,
-            debug_synchronous);
-    }
-
-    /**
-     * \brief Sorts keys into descending order. (~<em>2N</em> auxiliary storage required).
-     *
-     * \par
-     * - The contents of the input data are not altered by the sorting operation
-     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
-     * - \devicestorageNP  For sorting using only <em>O</em>(<tt>P</tt>) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
-     * - \devicestorage
-     *
-     * \par Performance
-     * Performance is similar to DeviceRadixSort::SortKeys.
-     *
-     * \par Snippet
-     * The code snippet below illustrates the sorting of a device vector of \p int keys.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for sorting data
-     * int  num_items;          // e.g., 7
-     * int  *d_keys_in;         // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_keys_out;        // e.g., [        ...        ]
-     * ...
-     *
-     * // Create a DoubleBuffer to wrap the pair of device pointers
-     * cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sorting operation
-     * cub::DeviceRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, num_items);
-     *
-     * // d_keys_out            <-- [9, 8, 7, 6, 5, 3, 0]s
-     *
-     * \endcode
-     *
-     * \tparam KeyT      <b>[inferred]</b> KeyT type
-     */
-    template <typename KeyT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t SortKeysDescending(
-        void                *d_temp_storage,                        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        const KeyT          *d_keys_in,                             ///< [in] Pointer to the input data of key data to sort
-        KeyT                *d_keys_out,                            ///< [out] Pointer to the sorted output sequence of key data
-        int                 num_items,                              ///< [in] Number of items to sort
-        int                 begin_bit           = 0,                ///< [in] <b>[optional]</b> The least-significant bit index (inclusive)  needed for key comparison
-        int                 end_bit             = sizeof(KeyT) * 8, ///< [in] <b>[optional]</b> The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
-        cudaStream_t        stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        DoubleBuffer<KeyT>      d_keys(const_cast<KeyT*>(d_keys_in), d_keys_out);
-        DoubleBuffer<NullType>  d_values;
-
-        return DispatchRadixSort<true, KeyT, NullType, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_keys,
-            d_values,
-            num_items,
-            begin_bit,
-            end_bit,
-            false,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Sorts keys into descending order. (~<em>N </em>auxiliary storage required).
-     *
-     * \par
-     * - The sorting operation is given a pair of key buffers managed by a
-     *   DoubleBuffer structure that indicates which of the two buffers is
-     *   "current" (and thus contains the input data to be sorted).
-     * - The contents of both buffers may be altered by the sorting operation.
-     * - Upon completion, the sorting operation will update the "current" indicator
-     *   within the DoubleBuffer wrapper to reference which of the two buffers
-     *   now contains the sorted output sequence (a function of the number of key bits
-     *   specified and the targeted device architecture).
-     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
-     * - \devicestorageP
-     * - \devicestorage
-     *
-     * \par Performance
-     * Performance is similar to DeviceRadixSort::SortKeys.
-     *
-     * \par Snippet
-     * The code snippet below illustrates the sorting of a device vector of \p int keys.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for sorting data
-     * int  num_items;          // e.g., 7
-     * int  *d_key_buf;         // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_key_alt_buf;     // e.g., [        ...        ]
-     * ...
-     *
-     * // Create a DoubleBuffer to wrap the pair of device pointers
-     * cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sorting operation
-     * cub::DeviceRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys, num_items);
-     *
-     * // d_keys.Current()      <-- [9, 8, 7, 6, 5, 3, 0]
-     *
-     * \endcode
-     *
-     * \tparam KeyT      <b>[inferred]</b> KeyT type
-     */
-    template <typename KeyT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t SortKeysDescending(
-        void                *d_temp_storage,                        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        DoubleBuffer<KeyT>  &d_keys,                                ///< [in,out] Reference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
-        int                 num_items,                              ///< [in] Number of items to sort
-        int                 begin_bit           = 0,                ///< [in] <b>[optional]</b> The least-significant bit index (inclusive)  needed for key comparison
-        int                 end_bit             = sizeof(KeyT) * 8, ///< [in] <b>[optional]</b> The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
-        cudaStream_t        stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // Null value type
-        DoubleBuffer<NullType> d_values;
-
-        return DispatchRadixSort<true, KeyT, NullType, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_keys,
-            d_values,
-            num_items,
-            begin_bit,
-            end_bit,
-            true,
-            stream,
-            debug_synchronous);
-    }
-
-
-    //@}  end member group
-
-
-};
-
-/**
- * \example example_device_radix_sort.cu
- */
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/device/device_reduce.cuh b/thirdparty/cub_semiring/device/device_reduce.cuh
deleted file mode 100644
index 3939a7ee7bf..00000000000
--- a/thirdparty/cub_semiring/device/device_reduce.cuh
+++ /dev/null
@@ -1,734 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory.
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-#include <limits>
-
-#include "../iterator/arg_index_input_iterator.cuh"
-#include "dispatch/dispatch_reduce.cuh"
-#include "dispatch/dispatch_reduce_by_key.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \brief DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory. ![](reduce_logo.png)
- * \ingroup SingleModule
- *
- * \par Overview
- * A <a href="http://en.wikipedia.org/wiki/Reduce_(higher-order_function)"><em>reduction</em></a> (or <em>fold</em>)
- * uses a binary combining operator to compute a single aggregate from a sequence of input elements.
- *
- * \par Usage Considerations
- * \cdp_class{DeviceReduce}
- *
- * \par Performance
- * \linear_performance{reduction, reduce-by-key, and run-length encode}
- *
- * \par
- * The following chart illustrates DeviceReduce::Sum
- * performance across different CUDA architectures for \p int32 keys.
- *
- * \image html reduce_int32.png
- *
- * \par
- * The following chart illustrates DeviceReduce::ReduceByKey (summation)
- * performance across different CUDA architectures for \p fp32
- * values.  Segments are identified by \p int32 keys, and have lengths uniformly sampled from [1,1000].
- *
- * \image html reduce_by_key_fp32_len_500.png
- *
- * \par
- * \plots_below
- *
- */
-struct DeviceReduce
-{
-    /**
-     * \brief Computes a device-wide reduction using the specified binary \p reduction_op functor and initial value \p init.
-     *
-     * \par
-     * - Does not support binary reduction operators that are non-commutative.
-     * - Provides "run-to-run" determinism for pseudo-associative reduction
-     *   (e.g., addition of floating point types) on the same GPU device.
-     *   However, results for pseudo-associative reduction may be inconsistent
-     *   from one device to a another device of a different compute-capability
-     *   because CUB can employ different tile-sizing for different architectures.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates a user-defined min-reduction of a device vector of \p int data elements.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // CustomMin functor
-     * struct CustomMin
-     * {
-     *     template <typename T>
-     *     __device__ __forceinline__
-     *     T operator()(const T &a, const T &b) const {
-     *         return (b < a) ? b : a;
-     *     }
-     * };
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int          num_items;  // e.g., 7
-     * int          *d_in;      // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int          *d_out;     // e.g., [-]
-     * CustomMin    min_op;
-     * int          init;       // e.g., INT_MAX
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceReduce::Reduce(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, min_op, init);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run reduction
-     * cub::DeviceReduce::Reduce(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, min_op, init);
-     *
-     * // d_out <-- [0]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT       <b>[inferred]</b> Random-access input iterator type for reading input items \iterator
-     * \tparam OutputIteratorT      <b>[inferred]</b> Output iterator type for recording the reduced aggregate \iterator
-     * \tparam ReductionOpT         <b>[inferred]</b> Binary reduction functor type having member <tt>T operator()(const T &a, const T &b)</tt> 
-     * \tparam T                    <b>[inferred]</b> Data element type that is convertible to the \p value type of \p InputIteratorT
-     */
-    template <
-        typename                    InputIteratorT,
-        typename                    OutputIteratorT,
-        typename                    ReductionOpT,
-        typename                    T>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t Reduce(
-        void                        *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                      &temp_storage_bytes,                ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT              d_in,                               ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT             d_out,                              ///< [out] Pointer to the output aggregate
-        int                         num_items,                          ///< [in] Total number of input items (i.e., length of \p d_in)
-        ReductionOpT                reduction_op,                       ///< [in] Binary reduction functor
-        T                           init,                               ///< [in] Initial value of the reduction
-        cudaStream_t                stream              = 0,            ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous   = false)        ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        return DispatchReduce<InputIteratorT, OutputIteratorT, OffsetT, ReductionOpT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            d_out,
-            num_items,
-            reduction_op,
-            init,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Computes a device-wide sum using the addition (\p +) operator.
-     *
-     * \par
-     * - Uses \p 0 as the initial value of the reduction.
-     * - Does not support \p + operators that are non-commutative..
-     * - Provides "run-to-run" determinism for pseudo-associative reduction
-     *   (e.g., addition of floating point types) on the same GPU device.
-     *   However, results for pseudo-associative reduction may be inconsistent
-     *   from one device to a another device of a different compute-capability
-     *   because CUB can employ different tile-sizing for different architectures.
-     * - \devicestorage
-     *
-     * \par Performance
-     * The following charts illustrate saturated sum-reduction performance across different
-     * CUDA architectures for \p int32 and \p int64 items, respectively.
-     *
-     * \image html reduce_int32.png
-     * \image html reduce_int64.png
-     *
-     * \par Snippet
-     * The code snippet below illustrates the sum-reduction of a device vector of \p int data elements.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int  num_items;      // e.g., 7
-     * int  *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_out;         // e.g., [-]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sum-reduction
-     * cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
-     *
-     * // d_out <-- [38]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT     <b>[inferred]</b> Random-access input iterator type for reading input items \iterator
-     * \tparam OutputIteratorT    <b>[inferred]</b> Output iterator type for recording the reduced aggregate \iterator
-     */
-    template <
-        typename                    InputIteratorT,
-        typename                    OutputIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t Sum(
-        void                        *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                      &temp_storage_bytes,                ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT              d_in,                               ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT             d_out,                              ///< [out] Pointer to the output aggregate
-        int                         num_items,                          ///< [in] Total number of input items (i.e., length of \p d_in)
-        cudaStream_t                stream              = 0,            ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous   = false)        ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // The output value type
-        typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
-            typename std::iterator_traits<InputIteratorT>::value_type,                                          // ... then the input iterator's value type,
-            typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputT;                          // ... else the output iterator's value type
-
-        return DispatchReduce<InputIteratorT, OutputIteratorT, OffsetT, cub::Sum>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            d_out,
-            num_items,
-            cub::Sum(),
-            OutputT(),            // zero-initialize
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Computes a device-wide minimum using the less-than ('<') operator.
-     *
-     * \par
-     * - Uses <tt>std::numeric_limits<T>::max()</tt> as the initial value of the reduction.
-     * - Does not support \p < operators that are non-commutative.
-     * - Provides "run-to-run" determinism for pseudo-associative reduction
-     *   (e.g., addition of floating point types) on the same GPU device.
-     *   However, results for pseudo-associative reduction may be inconsistent
-     *   from one device to a another device of a different compute-capability
-     *   because CUB can employ different tile-sizing for different architectures.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the min-reduction of a device vector of \p int data elements.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int  num_items;      // e.g., 7
-     * int  *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_out;         // e.g., [-]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceReduce::Min(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run min-reduction
-     * cub::DeviceReduce::Min(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
-     *
-     * // d_out <-- [0]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT     <b>[inferred]</b> Random-access input iterator type for reading input items \iterator
-     * \tparam OutputIteratorT    <b>[inferred]</b> Output iterator type for recording the reduced aggregate \iterator
-     */
-    template <
-        typename                    InputIteratorT,
-        typename                    OutputIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t Min(
-        void                        *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                      &temp_storage_bytes,                ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT              d_in,                               ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT             d_out,                              ///< [out] Pointer to the output aggregate
-        int                         num_items,                          ///< [in] Total number of input items (i.e., length of \p d_in)
-        cudaStream_t                stream              = 0,            ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous   = false)        ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // The input value type
-        typedef typename std::iterator_traits<InputIteratorT>::value_type InputT;
-
-        return DispatchReduce<InputIteratorT, OutputIteratorT, OffsetT, cub::Min>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            d_out,
-            num_items,
-            cub::Min(),
-            Traits<InputT>::Max(), // replace with std::numeric_limits<T>::max() when C++11 support is more prevalent
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Finds the first device-wide minimum using the less-than ('<') operator, also returning the index of that item.
-     *
-     * \par
-     * - The output value type of \p d_out is cub::KeyValuePair <tt><int, T></tt> (assuming the value type of \p d_in is \p T)
-     *   - The minimum is written to <tt>d_out.value</tt> and its offset in the input array is written to <tt>d_out.key</tt>.
-     *   - The <tt>{1, std::numeric_limits<T>::max()}</tt> tuple is produced for zero-length inputs
-     * - Does not support \p < operators that are non-commutative.
-     * - Provides "run-to-run" determinism for pseudo-associative reduction
-     *   (e.g., addition of floating point types) on the same GPU device.
-     *   However, results for pseudo-associative reduction may be inconsistent
-     *   from one device to a another device of a different compute-capability
-     *   because CUB can employ different tile-sizing for different architectures.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the argmin-reduction of a device vector of \p int data elements.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int                      num_items;      // e.g., 7
-     * int                      *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * KeyValuePair<int, int>   *d_out;         // e.g., [{-,-}]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceReduce::ArgMin(d_temp_storage, temp_storage_bytes, d_in, d_argmin, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run argmin-reduction
-     * cub::DeviceReduce::ArgMin(d_temp_storage, temp_storage_bytes, d_in, d_argmin, num_items);
-     *
-     * // d_out <-- [{5, 0}]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT     <b>[inferred]</b> Random-access input iterator type for reading input items (of some type \p T) \iterator
-     * \tparam OutputIteratorT    <b>[inferred]</b> Output iterator type for recording the reduced aggregate (having value type <tt>cub::KeyValuePair<int, T></tt>) \iterator
-     */
-    template <
-        typename                    InputIteratorT,
-        typename                    OutputIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t ArgMin(
-        void                        *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                      &temp_storage_bytes,                ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT              d_in,                               ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT             d_out,                              ///< [out] Pointer to the output aggregate
-        int                         num_items,                          ///< [in] Total number of input items (i.e., length of \p d_in)
-        cudaStream_t                stream              = 0,            ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous   = false)        ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // The input type
-        typedef typename std::iterator_traits<InputIteratorT>::value_type InputValueT;
-
-        // The output tuple type
-        typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
-            KeyValuePair<OffsetT, InputValueT>,                                                                 // ... then the key value pair OffsetT + InputValueT
-            typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputTupleT;                     // ... else the output iterator's value type
-
-        // The output value type
-        typedef typename OutputTupleT::Value OutputValueT;
-
-        // Wrapped input iterator to produce index-value <OffsetT, InputT> tuples
-        typedef ArgIndexInputIterator<InputIteratorT, OffsetT, OutputValueT> ArgIndexInputIteratorT;
-        ArgIndexInputIteratorT d_indexed_in(d_in);
-
-        // Initial value
-        OutputTupleT initial_value(1, Traits<InputValueT>::Max());   // replace with std::numeric_limits<T>::max() when C++11 support is more prevalent
-
-        return DispatchReduce<ArgIndexInputIteratorT, OutputIteratorT, OffsetT, cub::ArgMin>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_indexed_in,
-            d_out,
-            num_items,
-            cub::ArgMin(),
-            initial_value,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Computes a device-wide maximum using the greater-than ('>') operator.
-     *
-     * \par
-     * - Uses <tt>std::numeric_limits<T>::lowest()</tt> as the initial value of the reduction.
-     * - Does not support \p > operators that are non-commutative.
-     * - Provides "run-to-run" determinism for pseudo-associative reduction
-     *   (e.g., addition of floating point types) on the same GPU device.
-     *   However, results for pseudo-associative reduction may be inconsistent
-     *   from one device to a another device of a different compute-capability
-     *   because CUB can employ different tile-sizing for different architectures.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the max-reduction of a device vector of \p int data elements.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int  num_items;      // e.g., 7
-     * int  *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_out;         // e.g., [-]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_max, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run max-reduction
-     * cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_max, num_items);
-     *
-     * // d_out <-- [9]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT     <b>[inferred]</b> Random-access input iterator type for reading input items \iterator
-     * \tparam OutputIteratorT    <b>[inferred]</b> Output iterator type for recording the reduced aggregate \iterator
-     */
-    template <
-        typename                    InputIteratorT,
-        typename                    OutputIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t Max(
-        void                        *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                      &temp_storage_bytes,                ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT              d_in,                               ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT             d_out,                              ///< [out] Pointer to the output aggregate
-        int                         num_items,                          ///< [in] Total number of input items (i.e., length of \p d_in)
-        cudaStream_t                stream              = 0,            ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous   = false)        ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // The input value type
-        typedef typename std::iterator_traits<InputIteratorT>::value_type InputT;
-
-        return DispatchReduce<InputIteratorT, OutputIteratorT, OffsetT, cub::Max>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            d_out,
-            num_items,
-            cub::Max(),
-            Traits<InputT>::Lowest(),    // replace with std::numeric_limits<T>::lowest() when C++11 support is more prevalent
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Finds the first device-wide maximum using the greater-than ('>') operator, also returning the index of that item
-     *
-     * \par
-     * - The output value type of \p d_out is cub::KeyValuePair <tt><int, T></tt> (assuming the value type of \p d_in is \p T)
-     *   - The maximum is written to <tt>d_out.value</tt> and its offset in the input array is written to <tt>d_out.key</tt>.
-     *   - The <tt>{1, std::numeric_limits<T>::lowest()}</tt> tuple is produced for zero-length inputs
-     * - Does not support \p > operators that are non-commutative.
-     * - Provides "run-to-run" determinism for pseudo-associative reduction
-     *   (e.g., addition of floating point types) on the same GPU device.
-     *   However, results for pseudo-associative reduction may be inconsistent
-     *   from one device to a another device of a different compute-capability
-     *   because CUB can employ different tile-sizing for different architectures.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the argmax-reduction of a device vector of \p int data elements.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_reduce.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int                      num_items;      // e.g., 7
-     * int                      *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * KeyValuePair<int, int>   *d_out;         // e.g., [{-,-}]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceReduce::ArgMax(d_temp_storage, temp_storage_bytes, d_in, d_argmax, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run argmax-reduction
-     * cub::DeviceReduce::ArgMax(d_temp_storage, temp_storage_bytes, d_in, d_argmax, num_items);
-     *
-     * // d_out <-- [{6, 9}]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT     <b>[inferred]</b> Random-access input iterator type for reading input items (of some type \p T) \iterator
-     * \tparam OutputIteratorT    <b>[inferred]</b> Output iterator type for recording the reduced aggregate (having value type <tt>cub::KeyValuePair<int, T></tt>) \iterator
-     */
-    template <
-        typename                    InputIteratorT,
-        typename                    OutputIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t ArgMax(
-        void                        *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                      &temp_storage_bytes,                ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT              d_in,                               ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT             d_out,                              ///< [out] Pointer to the output aggregate
-        int                         num_items,                          ///< [in] Total number of input items (i.e., length of \p d_in)
-        cudaStream_t                stream              = 0,            ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous   = false)        ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // The input type
-        typedef typename std::iterator_traits<InputIteratorT>::value_type InputValueT;
-
-        // The output tuple type
-        typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
-            KeyValuePair<OffsetT, InputValueT>,                                                                 // ... then the key value pair OffsetT + InputValueT
-            typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputTupleT;                     // ... else the output iterator's value type
-
-        // The output value type
-        typedef typename OutputTupleT::Value OutputValueT;
-
-        // Wrapped input iterator to produce index-value <OffsetT, InputT> tuples
-        typedef ArgIndexInputIterator<InputIteratorT, OffsetT, OutputValueT> ArgIndexInputIteratorT;
-        ArgIndexInputIteratorT d_indexed_in(d_in);
-
-        // Initial value
-        OutputTupleT initial_value(1, Traits<InputValueT>::Lowest());     // replace with std::numeric_limits<T>::lowest() when C++11 support is more prevalent
-
-        return DispatchReduce<ArgIndexInputIteratorT, OutputIteratorT, OffsetT, cub::ArgMax>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_indexed_in,
-            d_out,
-            num_items,
-            cub::ArgMax(),
-            initial_value,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Reduces segments of values, where segments are demarcated by corresponding runs of identical keys.
-     *
-     * \par
-     * This operation computes segmented reductions within \p d_values_in using
-     * the specified binary \p reduction_op functor.  The segments are identified by
-     * "runs" of corresponding keys in \p d_keys_in, where runs are maximal ranges of
-     * consecutive, identical keys.  For the <em>i</em><sup>th</sup> run encountered,
-     * the first key of the run and the corresponding value aggregate of that run are
-     * written to <tt>d_unique_out[<em>i</em>]</tt> and <tt>d_aggregates_out[<em>i</em>]</tt>,
-     * respectively. The total number of runs encountered is written to \p d_num_runs_out.
-     *
-     * \par
-     * - The <tt>==</tt> equality operator is used to determine whether keys are equivalent
-     * - Provides "run-to-run" determinism for pseudo-associative reduction
-     *   (e.g., addition of floating point types) on the same GPU device.
-     *   However, results for pseudo-associative reduction may be inconsistent
-     *   from one device to a another device of a different compute-capability
-     *   because CUB can employ different tile-sizing for different architectures.
-     * - \devicestorage
-     *
-     * \par Performance
-     * The following chart illustrates reduction-by-key (sum) performance across
-     * different CUDA architectures for \p fp32 and \p fp64 values, respectively.  Segments
-     * are identified by \p int32 keys, and have lengths uniformly sampled from [1,1000].
-     *
-     * \image html reduce_by_key_fp32_len_500.png
-     * \image html reduce_by_key_fp64_len_500.png
-     *
-     * \par
-     * The following charts are similar, but with segment lengths uniformly sampled from [1,10]:
-     *
-     * \image html reduce_by_key_fp32_len_5.png
-     * \image html reduce_by_key_fp64_len_5.png
-     *
-     * \par Snippet
-     * The code snippet below illustrates the segmented reduction of \p int values grouped
-     * by runs of associated \p int keys.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_reduce.cuh>
-     *
-     * // CustomMin functor
-     * struct CustomMin
-     * {
-     *     template <typename T>
-     *     CUB_RUNTIME_FUNCTION __forceinline__
-     *     T operator()(const T &a, const T &b) const {
-     *         return (b < a) ? b : a;
-     *     }
-     * };
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int          num_items;          // e.g., 8
-     * int          *d_keys_in;         // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
-     * int          *d_values_in;       // e.g., [0, 7, 1, 6, 2, 5, 3, 4]
-     * int          *d_unique_out;      // e.g., [-, -, -, -, -, -, -, -]
-     * int          *d_aggregates_out;  // e.g., [-, -, -, -, -, -, -, -]
-     * int          *d_num_runs_out;    // e.g., [-]
-     * CustomMin    reduction_op;
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceReduce::ReduceByKey(d_temp_storage, temp_storage_bytes, d_keys_in, d_unique_out, d_values_in, d_aggregates_out, d_num_runs_out, reduction_op, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run reduce-by-key
-     * cub::DeviceReduce::ReduceByKey(d_temp_storage, temp_storage_bytes, d_keys_in, d_unique_out, d_values_in, d_aggregates_out, d_num_runs_out, reduction_op, num_items);
-     *
-     * // d_unique_out      <-- [0, 2, 9, 5, 8]
-     * // d_aggregates_out  <-- [0, 1, 6, 2, 4]
-     * // d_num_runs_out    <-- [5]
-     *
-     * \endcode
-     *
-     * \tparam KeysInputIteratorT       <b>[inferred]</b> Random-access input iterator type for reading input keys \iterator
-     * \tparam UniqueOutputIteratorT    <b>[inferred]</b> Random-access output iterator type for writing unique output keys \iterator
-     * \tparam ValuesInputIteratorT     <b>[inferred]</b> Random-access input iterator type for reading input values \iterator
-     * \tparam AggregatesOutputIterator <b>[inferred]</b> Random-access output iterator type for writing output value aggregates \iterator
-     * \tparam NumRunsOutputIteratorT   <b>[inferred]</b> Output iterator type for recording the number of runs encountered \iterator
-     * \tparam ReductionOpT              <b>[inferred]</b> Binary reduction functor type having member <tt>T operator()(const T &a, const T &b)</tt> 
-     */
-    template <
-        typename                    KeysInputIteratorT,
-        typename                    UniqueOutputIteratorT,
-        typename                    ValuesInputIteratorT,
-        typename                    AggregatesOutputIteratorT,
-        typename                    NumRunsOutputIteratorT,
-        typename                    ReductionOpT>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t ReduceByKey(
-        void                        *d_temp_storage,                ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                      &temp_storage_bytes,            ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        KeysInputIteratorT          d_keys_in,                      ///< [in] Pointer to the input sequence of keys
-        UniqueOutputIteratorT       d_unique_out,                   ///< [out] Pointer to the output sequence of unique keys (one key per run)
-        ValuesInputIteratorT        d_values_in,                    ///< [in] Pointer to the input sequence of corresponding values
-        AggregatesOutputIteratorT   d_aggregates_out,               ///< [out] Pointer to the output sequence of value aggregates (one aggregate per run)
-        NumRunsOutputIteratorT      d_num_runs_out,                 ///< [out] Pointer to total number of runs encountered (i.e., the length of d_unique_out)
-        ReductionOpT                reduction_op,                   ///< [in] Binary reduction functor
-        int                         num_items,                      ///< [in] Total number of associated key+value pairs (i.e., the length of \p d_in_keys and \p d_in_values)
-        cudaStream_t                stream             = 0,         ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous  = false)     ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // FlagT iterator type (not used)
-
-        // Selection op (not used)
-
-        // Default == operator
-        typedef Equality EqualityOp;
-
-        return DispatchReduceByKey<KeysInputIteratorT, UniqueOutputIteratorT, ValuesInputIteratorT, AggregatesOutputIteratorT, NumRunsOutputIteratorT, EqualityOp, ReductionOpT, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_keys_in,
-            d_unique_out,
-            d_values_in,
-            d_aggregates_out,
-            d_num_runs_out,
-            EqualityOp(),
-            reduction_op,
-            num_items,
-            stream,
-            debug_synchronous);
-    }
-
-};
-
-/**
- * \example example_device_reduce.cu
- */
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/device/device_run_length_encode.cuh b/thirdparty/cub_semiring/device/device_run_length_encode.cuh
deleted file mode 100644
index ed0bf9c7d67..00000000000
--- a/thirdparty/cub_semiring/device/device_run_length_encode.cuh
+++ /dev/null
@@ -1,278 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DeviceRunLengthEncode provides device-wide, parallel operations for computing a run-length encoding across a sequence of data items residing within device-accessible memory.
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-
-#include "dispatch/dispatch_rle.cuh"
-#include "dispatch/dispatch_reduce_by_key.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \brief DeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within device-accessible memory. ![](run_length_encode_logo.png)
- * \ingroup SingleModule
- *
- * \par Overview
- * A <a href="http://en.wikipedia.org/wiki/Run-length_encoding"><em>run-length encoding</em></a>
- * computes a simple compressed representation of a sequence of input elements such that each
- * maximal "run" of consecutive same-valued data items is encoded as a single data value along with a
- * count of the elements in that run.
- *
- * \par Usage Considerations
- * \cdp_class{DeviceRunLengthEncode}
- *
- * \par Performance
- * \linear_performance{run-length encode}
- *
- * \par
- * The following chart illustrates DeviceRunLengthEncode::RunLengthEncode performance across
- * different CUDA architectures for \p int32 items.
- * Segments have lengths uniformly sampled from [1,1000].
- *
- * \image html rle_int32_len_500.png
- *
- * \par
- * \plots_below
- *
- */
-struct DeviceRunLengthEncode
-{
-
-    /**
-     * \brief Computes a run-length encoding of the sequence \p d_in.
-     *
-     * \par
-     * - For the <em>i</em><sup>th</sup> run encountered, the first key of the run and its length are written to
-     *   <tt>d_unique_out[<em>i</em>]</tt> and <tt>d_counts_out[<em>i</em>]</tt>,
-     *   respectively.
-     * - The total number of runs encountered is written to \p d_num_runs_out.
-     * - The <tt>==</tt> equality operator is used to determine whether values are equivalent
-     * - \devicestorage
-     *
-     * \par Performance
-     * The following charts illustrate saturated encode performance across different
-     * CUDA architectures for \p int32 and \p int64 items, respectively.  Segments have
-     * lengths uniformly sampled from [1,1000].
-     *
-     * \image html rle_int32_len_500.png
-     * \image html rle_int64_len_500.png
-     *
-     * \par
-     * The following charts are similar, but with segment lengths uniformly sampled from [1,10]:
-     *
-     * \image html rle_int32_len_5.png
-     * \image html rle_int64_len_5.png
-     *
-     * \par Snippet
-     * The code snippet below illustrates the run-length encoding of a sequence of \p int values.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_run_length_encode.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int          num_items;          // e.g., 8
-     * int          *d_in;              // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
-     * int          *d_unique_out;      // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
-     * int          *d_counts_out;      // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
-     * int          *d_num_runs_out;    // e.g., [ ]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceRunLengthEncode::Encode(d_temp_storage, temp_storage_bytes, d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run encoding
-     * cub::DeviceRunLengthEncode::Encode(d_temp_storage, temp_storage_bytes, d_in, d_unique_out, d_counts_out, d_num_runs_out, num_items);
-     *
-     * // d_unique_out      <-- [0, 2, 9, 5, 8]
-     * // d_counts_out      <-- [1, 2, 1, 3, 1]
-     * // d_num_runs_out    <-- [5]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT           <b>[inferred]</b> Random-access input iterator type for reading input items \iterator
-     * \tparam UniqueOutputIteratorT    <b>[inferred]</b> Random-access output iterator type for writing unique output items \iterator
-     * \tparam LengthsOutputIteratorT   <b>[inferred]</b> Random-access output iterator type for writing output counts \iterator
-     * \tparam NumRunsOutputIteratorT   <b>[inferred]</b> Output iterator type for recording the number of runs encountered \iterator
-     */
-    template <
-        typename                    InputIteratorT,
-        typename                    UniqueOutputIteratorT,
-        typename                    LengthsOutputIteratorT,
-        typename                    NumRunsOutputIteratorT>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Encode(
-        void*                       d_temp_storage,                ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                      &temp_storage_bytes,            ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT              d_in,                           ///< [in] Pointer to the input sequence of keys
-        UniqueOutputIteratorT       d_unique_out,                   ///< [out] Pointer to the output sequence of unique keys (one key per run)
-        LengthsOutputIteratorT      d_counts_out,                   ///< [out] Pointer to the output sequence of run-lengths (one count per run)
-        NumRunsOutputIteratorT      d_num_runs_out,                     ///< [out] Pointer to total number of runs
-        int                         num_items,                      ///< [in] Total number of associated key+value pairs (i.e., the length of \p d_in_keys and \p d_in_values)
-        cudaStream_t                stream             = 0,         ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous  = false)     ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        typedef int         OffsetT;                    // Signed integer type for global offsets
-        typedef NullType*   FlagIterator;               // FlagT iterator type (not used)
-        typedef NullType    SelectOp;                   // Selection op (not used)
-        typedef Equality    EqualityOp;                 // Default == operator
-        typedef cub::Sum    ReductionOp;                // Value reduction operator
-
-        // The lengths output value type
-        typedef typename If<(Equals<typename std::iterator_traits<LengthsOutputIteratorT>::value_type, void>::VALUE),   // LengthT =  (if output iterator's value type is void) ?
-            OffsetT,                                                                                                    // ... then the OffsetT type,
-            typename std::iterator_traits<LengthsOutputIteratorT>::value_type>::Type LengthT;                           // ... else the output iterator's value type
-
-        // Generator type for providing 1s values for run-length reduction
-        typedef ConstantInputIterator<LengthT, OffsetT> LengthsInputIteratorT;
-
-        return DispatchReduceByKey<InputIteratorT, UniqueOutputIteratorT, LengthsInputIteratorT, LengthsOutputIteratorT, NumRunsOutputIteratorT, EqualityOp, ReductionOp, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            d_unique_out,
-            LengthsInputIteratorT((LengthT) 1),
-            d_counts_out,
-            d_num_runs_out,
-            EqualityOp(),
-            ReductionOp(),
-            num_items,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Enumerates the starting offsets and lengths of all non-trivial runs (of length > 1) of same-valued keys in the sequence \p d_in.
-     *
-     * \par
-     * - For the <em>i</em><sup>th</sup> non-trivial run, the run's starting offset
-     *   and its length are written to <tt>d_offsets_out[<em>i</em>]</tt> and
-     *   <tt>d_lengths_out[<em>i</em>]</tt>, respectively.
-     * - The total number of runs encountered is written to \p d_num_runs_out.
-     * - The <tt>==</tt> equality operator is used to determine whether values are equivalent
-     * - \devicestorage
-     *
-     * \par Performance
-     *
-     * \par Snippet
-     * The code snippet below illustrates the identification of non-trivial runs within a sequence of \p int values.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_run_length_encode.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int          num_items;          // e.g., 8
-     * int          *d_in;              // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
-     * int          *d_offsets_out;     // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
-     * int          *d_lengths_out;     // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
-     * int          *d_num_runs_out;    // e.g., [ ]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceRunLengthEncode::NonTrivialRuns(d_temp_storage, temp_storage_bytes, d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run encoding
-     * cub::DeviceRunLengthEncode::NonTrivialRuns(d_temp_storage, temp_storage_bytes, d_in, d_offsets_out, d_lengths_out, d_num_runs_out, num_items);
-     *
-     * // d_offsets_out         <-- [1, 4]
-     * // d_lengths_out         <-- [2, 3]
-     * // d_num_runs_out        <-- [2]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT           <b>[inferred]</b> Random-access input iterator type for reading input items \iterator
-     * \tparam OffsetsOutputIteratorT   <b>[inferred]</b> Random-access output iterator type for writing run-offset values \iterator
-     * \tparam LengthsOutputIteratorT   <b>[inferred]</b> Random-access output iterator type for writing run-length values \iterator
-     * \tparam NumRunsOutputIteratorT   <b>[inferred]</b> Output iterator type for recording the number of runs encountered \iterator
-     */
-    template <
-        typename                InputIteratorT,
-        typename                OffsetsOutputIteratorT,
-        typename                LengthsOutputIteratorT,
-        typename                NumRunsOutputIteratorT>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t NonTrivialRuns(
-        void*               d_temp_storage,                ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                  &temp_storage_bytes,            ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT          d_in,                           ///< [in] Pointer to input sequence of data items
-        OffsetsOutputIteratorT  d_offsets_out,                  ///< [out] Pointer to output sequence of run-offsets (one offset per non-trivial run)
-        LengthsOutputIteratorT  d_lengths_out,                  ///< [out] Pointer to output sequence of run-lengths (one count per non-trivial run)
-        NumRunsOutputIteratorT  d_num_runs_out,                 ///< [out] Pointer to total number of runs (i.e., length of \p d_offsets_out)
-        int                     num_items,                      ///< [in] Total number of associated key+value pairs (i.e., the length of \p d_in_keys and \p d_in_values)
-        cudaStream_t            stream             = 0,         ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                    debug_synchronous  = false)     ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        typedef int         OffsetT;                    // Signed integer type for global offsets
-        typedef Equality    EqualityOp;                 // Default == operator
-
-        return DeviceRleDispatch<InputIteratorT, OffsetsOutputIteratorT, LengthsOutputIteratorT, NumRunsOutputIteratorT, EqualityOp, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            d_offsets_out,
-            d_lengths_out,
-            d_num_runs_out,
-            EqualityOp(),
-            num_items,
-            stream,
-            debug_synchronous);
-    }
-
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/device/device_scan.cuh b/thirdparty/cub_semiring/device/device_scan.cuh
deleted file mode 100644
index 4589279eeb6..00000000000
--- a/thirdparty/cub_semiring/device/device_scan.cuh
+++ /dev/null
@@ -1,443 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory.
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-
-#include "dispatch/dispatch_scan.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \brief DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory. ![](device_scan.png)
- * \ingroup SingleModule
- *
- * \par Overview
- * Given a sequence of input elements and a binary reduction operator, a [<em>prefix scan</em>](http://en.wikipedia.org/wiki/Prefix_sum)
- * produces an output sequence where each element is computed to be the reduction
- * of the elements occurring earlier in the input sequence.  <em>Prefix sum</em>
- * connotes a prefix scan with the addition operator. The term \em inclusive indicates
- * that the <em>i</em><sup>th</sup> output reduction incorporates the <em>i</em><sup>th</sup> input.
- * The term \em exclusive indicates the <em>i</em><sup>th</sup> input is not incorporated into
- * the <em>i</em><sup>th</sup> output reduction.
- *
- * \par
- * As of CUB 1.0.1 (2013), CUB's device-wide scan APIs have implemented our <em>"decoupled look-back"</em> algorithm
- * for performing global prefix scan with only a single pass through the
- * input data, as described in our 2016 technical report [1].  The central
- * idea is to leverage a small, constant factor of redundant work in order to overlap the latencies
- * of global prefix propagation with local computation.  As such, our algorithm requires only
- * ~2<em>n</em> data movement (<em>n</em> inputs are read, <em>n</em> outputs are written), and typically
- * proceeds at "memcpy" speeds.
- *
- * \par
- * [1] [Duane Merrill and Michael Garland.  "Single-pass Parallel Prefix Scan with Decoupled Look-back", <em>NVIDIA Technical Report NVR-2016-002</em>, 2016.](https://research.nvidia.com/publication/single-pass-parallel-prefix-scan-decoupled-look-back)
- *
- * \par Usage Considerations
- * \cdp_class{DeviceScan}
- *
- * \par Performance
- * \linear_performance{prefix scan}
- *
- * \par
- * The following chart illustrates DeviceScan::ExclusiveSum
- * performance across different CUDA architectures for \p int32 keys.
- * \plots_below
- *
- * \image html scan_int32.png
- *
- */
-struct DeviceScan
-{
-    /******************************************************************//**
-     * \name Exclusive scans
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Computes a device-wide exclusive prefix sum.  The value of 0 is applied as the initial value, and is assigned to *d_out.
-     *
-     * \par
-     * - Supports non-commutative sum operators.
-     * - Provides "run-to-run" determinism for pseudo-associative reduction
-     *   (e.g., addition of floating point types) on the same GPU device.
-     *   However, results for pseudo-associative reduction may be inconsistent
-     *   from one device to a another device of a different compute-capability
-     *   because CUB can employ different tile-sizing for different architectures.
-     * - \devicestorage
-     *
-     * \par Performance
-     * The following charts illustrate saturated exclusive sum performance across different
-     * CUDA architectures for \p int32 and \p int64 items, respectively.
-     *
-     * \image html scan_int32.png
-     * \image html scan_int64.png
-     *
-     * \par Snippet
-     * The code snippet below illustrates the exclusive prefix sum of an \p int device vector.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_scan.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int  num_items;      // e.g., 7
-     * int  *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_out;         // e.g., [ ,  ,  ,  ,  ,  ,  ]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run exclusive prefix sum
-     * cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
-     *
-     * // d_out s<-- [0, 8, 14, 21, 26, 29, 29]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT     <b>[inferred]</b> Random-access input iterator type for reading scan inputs \iterator
-     * \tparam OutputIteratorT    <b>[inferred]</b> Random-access output iterator type for writing scan outputs \iterator
-     */
-    template <
-        typename        InputIteratorT,
-        typename        OutputIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t ExclusiveSum(
-        void            *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t          &temp_storage_bytes,                ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT  d_in,                               ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT d_out,                              ///< [out] Pointer to the output sequence of data items
-        int             num_items,                          ///< [in] Total number of input items (i.e., the length of \p d_in)
-        cudaStream_t    stream              = 0,            ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool            debug_synchronous   = false)        ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // The output value type
-        typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
-            typename std::iterator_traits<InputIteratorT>::value_type,                                          // ... then the input iterator's value type,
-            typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputT;                          // ... else the output iterator's value type
-
-        // Initial value
-        OutputT init_value = 0;
-
-        return DispatchScan<InputIteratorT, OutputIteratorT, Sum, OutputT, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            d_out,
-            Sum(),
-            init_value,
-            num_items,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Computes a device-wide exclusive prefix scan using the specified binary \p scan_op functor.  The \p init_value value is applied as the initial value, and is assigned to *d_out.
-     *
-     * \par
-     * - Supports non-commutative scan operators.
-     * - Provides "run-to-run" determinism for pseudo-associative reduction
-     *   (e.g., addition of floating point types) on the same GPU device.
-     *   However, results for pseudo-associative reduction may be inconsistent
-     *   from one device to a another device of a different compute-capability
-     *   because CUB can employ different tile-sizing for different architectures.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the exclusive prefix min-scan of an \p int device vector
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_scan.cuh>
-     *
-     * // CustomMin functor
-     * struct CustomMin
-     * {
-     *     template <typename T>
-     *     CUB_RUNTIME_FUNCTION __forceinline__
-     *     T operator()(const T &a, const T &b) const {
-     *         return (b < a) ? b : a;
-     *     }
-     * };
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int          num_items;      // e.g., 7
-     * int          *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int          *d_out;         // e.g., [ ,  ,  ,  ,  ,  ,  ]
-     * CustomMin    min_op
-     * ...
-     *
-     * // Determine temporary device storage requirements for exclusive prefix scan
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceScan::ExclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, (int) MAX_INT, num_items);
-     *
-     * // Allocate temporary storage for exclusive prefix scan
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run exclusive prefix min-scan
-     * cub::DeviceScan::ExclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, (int) MAX_INT, num_items);
-     *
-     * // d_out <-- [2147483647, 8, 6, 6, 5, 3, 0]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT   <b>[inferred]</b> Random-access input iterator type for reading scan inputs \iterator
-     * \tparam OutputIteratorT  <b>[inferred]</b> Random-access output iterator type for writing scan outputs \iterator
-     * \tparam ScanOp           <b>[inferred]</b> Binary scan functor type having member <tt>T operator()(const T &a, const T &b)</tt>
-     * \tparam Identity         <b>[inferred]</b> Type of the \p identity value used Binary scan functor type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <
-        typename        InputIteratorT,
-        typename        OutputIteratorT,
-        typename        ScanOpT,
-        typename        InitValueT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t ExclusiveScan(
-        void            *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t          &temp_storage_bytes,                ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT  d_in,                               ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT d_out,                              ///< [out] Pointer to the output sequence of data items
-        ScanOpT         scan_op,                            ///< [in] Binary scan functor
-        InitValueT      init_value,                         ///< [in] Initial value to seed the exclusive scan (and is assigned to *d_out)
-        int             num_items,                          ///< [in] Total number of input items (i.e., the length of \p d_in)
-        cudaStream_t    stream              = 0,            ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool            debug_synchronous   = false)        ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        return DispatchScan<InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            d_out,
-            scan_op,
-            init_value,
-            num_items,
-            stream,
-            debug_synchronous);
-    }
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Inclusive scans
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Computes a device-wide inclusive prefix sum.
-     *
-     * \par
-     * - Supports non-commutative sum operators.
-     * - Provides "run-to-run" determinism for pseudo-associative reduction
-     *   (e.g., addition of floating point types) on the same GPU device.
-     *   However, results for pseudo-associative reduction may be inconsistent
-     *   from one device to a another device of a different compute-capability
-     *   because CUB can employ different tile-sizing for different architectures.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the inclusive prefix sum of an \p int device vector.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_scan.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int  num_items;      // e.g., 7
-     * int  *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_out;         // e.g., [ ,  ,  ,  ,  ,  ,  ]
-     * ...
-     *
-     * // Determine temporary device storage requirements for inclusive prefix sum
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
-     *
-     * // Allocate temporary storage for inclusive prefix sum
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run inclusive prefix sum
-     * cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
-     *
-     * // d_out <-- [8, 14, 21, 26, 29, 29, 38]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT     <b>[inferred]</b> Random-access input iterator type for reading scan inputs \iterator
-     * \tparam OutputIteratorT    <b>[inferred]</b> Random-access output iterator type for writing scan outputs \iterator
-     */
-    template <
-        typename            InputIteratorT,
-        typename            OutputIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t InclusiveSum(
-        void*               d_temp_storage,                 ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&             temp_storage_bytes,             ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT      d_in,                           ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT     d_out,                          ///< [out] Pointer to the output sequence of data items
-        int                 num_items,                      ///< [in] Total number of input items (i.e., the length of \p d_in)
-        cudaStream_t        stream             = 0,         ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous  = false)     ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        return DispatchScan<InputIteratorT, OutputIteratorT, Sum, NullType, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            d_out,
-            Sum(),
-            NullType(),
-            num_items,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Computes a device-wide inclusive prefix scan using the specified binary \p scan_op functor.
-     *
-     * \par
-     * - Supports non-commutative scan operators.
-     * - Provides "run-to-run" determinism for pseudo-associative reduction
-     *   (e.g., addition of floating point types) on the same GPU device.
-     *   However, results for pseudo-associative reduction may be inconsistent
-     *   from one device to a another device of a different compute-capability
-     *   because CUB can employ different tile-sizing for different architectures.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the inclusive prefix min-scan of an \p int device vector.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_scan.cuh>
-     *
-     * // CustomMin functor
-     * struct CustomMin
-     * {
-     *     template <typename T>
-     *     CUB_RUNTIME_FUNCTION __forceinline__
-     *     T operator()(const T &a, const T &b) const {
-     *         return (b < a) ? b : a;
-     *     }
-     * };
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int          num_items;      // e.g., 7
-     * int          *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int          *d_out;         // e.g., [ ,  ,  ,  ,  ,  ,  ]
-     * CustomMin    min_op;
-     * ...
-     *
-     * // Determine temporary device storage requirements for inclusive prefix scan
-     * void *d_temp_storage = NULL;
-     * size_t temp_storage_bytes = 0;
-     * cub::DeviceScan::InclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, num_items);
-     *
-     * // Allocate temporary storage for inclusive prefix scan
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run inclusive prefix min-scan
-     * cub::DeviceScan::InclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, num_items);
-     *
-     * // d_out <-- [8, 6, 6, 5, 3, 0, 0]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT   <b>[inferred]</b> Random-access input iterator type for reading scan inputs \iterator
-     * \tparam OutputIteratorT  <b>[inferred]</b> Random-access output iterator type for writing scan outputs \iterator
-     * \tparam ScanOp           <b>[inferred]</b> Binary scan functor type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <
-        typename        InputIteratorT,
-        typename        OutputIteratorT,
-        typename        ScanOpT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t InclusiveScan(
-        void            *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t          &temp_storage_bytes,                ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT  d_in,                               ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT d_out,                              ///< [out] Pointer to the output sequence of data items
-        ScanOpT         scan_op,                            ///< [in] Binary scan functor
-        int             num_items,                          ///< [in] Total number of input items (i.e., the length of \p d_in)
-        cudaStream_t    stream             = 0,             ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool            debug_synchronous  = false)         ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        return DispatchScan<InputIteratorT, OutputIteratorT, ScanOpT, NullType, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            d_out,
-            scan_op,
-            NullType(),
-            num_items,
-            stream,
-            debug_synchronous);
-    }
-
-    //@}  end member group
-
-};
-
-/**
- * \example example_device_scan.cu
- */
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/device/device_segmented_radix_sort.cuh b/thirdparty/cub_semiring/device/device_segmented_radix_sort.cuh
deleted file mode 100644
index 7f8bf8e7b3c..00000000000
--- a/thirdparty/cub_semiring/device/device_segmented_radix_sort.cuh
+++ /dev/null
@@ -1,875 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory.
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-
-#include "dispatch/dispatch_radix_sort.cuh"
-#include "../util_arch.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \brief DeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory. ![](segmented_sorting_logo.png)
- * \ingroup SegmentedModule
- *
- * \par Overview
- * The [<em>radix sorting method</em>](http://en.wikipedia.org/wiki/Radix_sort) arranges
- * items into ascending (or descending) order.  The algorithm relies upon a positional representation for
- * keys, i.e., each key is comprised of an ordered sequence of symbols (e.g., digits,
- * characters, etc.) specified from least-significant to most-significant.  For a
- * given input sequence of keys and a set of rules specifying a total ordering
- * of the symbolic alphabet, the radix sorting method produces a lexicographic
- * ordering of those keys.
- *
- * \par
- * DeviceSegmentedRadixSort can sort all of the built-in C++ numeric primitive types, e.g.:
- * <tt>unsigned char</tt>, \p int, \p double, etc.  Although the direct radix sorting
- * method can only be applied to unsigned integral types, DeviceSegmentedRadixSort
- * is able to sort signed and floating-point types via simple bit-wise transformations
- * that ensure lexicographic key ordering.
- *
- * \par Usage Considerations
- * \cdp_class{DeviceSegmentedRadixSort}
- *
- */
-struct DeviceSegmentedRadixSort
-{
-
-    /******************************************************************//**
-     * \name Key-value pairs
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Sorts segments of key-value pairs into ascending order. (~<em>2N </em>auxiliary storage required)
-     *
-     * \par
-     * - The contents of the input data are not altered by the sorting operation
-     * - When input a contiguous sequence of segments, a single sequence
-     *   \p segment_offsets (of length <tt>num_segments+1</tt>) can be aliased
-     *   for both the \p d_begin_offsets and \p d_end_offsets parameters (where
-     *   the latter is specified as <tt>segment_offsets+1</tt>).
-     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
-     * - \devicestorageNP  For sorting using only <em>O</em>(<tt>P</tt>) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of \p int keys
-     * with associated vector of \p int values.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_segmentd_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for sorting data
-     * int  num_items;          // e.g., 7
-     * int  num_segments;       // e.g., 3
-     * int  *d_offsets;         // e.g., [0, 3, 3, 7]
-     * int  *d_keys_in;         // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_keys_out;        // e.g., [-, -, -, -, -, -, -]
-     * int  *d_values_in;       // e.g., [0, 1, 2, 3, 4, 5, 6]
-     * int  *d_values_out;      // e.g., [-, -, -, -, -, -, -]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSegmentedRadixSort::SortPairs(d_temp_storage, temp_storage_bytes,
-     *     d_keys_in, d_keys_out, d_values_in, d_values_out,
-     *     num_items, num_segments, d_offsets, d_offsets + 1);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sorting operation
-     * cub::DeviceSegmentedRadixSort::SortPairs(d_temp_storage, temp_storage_bytes,
-     *     d_keys_in, d_keys_out, d_values_in, d_values_out,
-     *     num_items, num_segments, d_offsets, d_offsets + 1);
-     *
-     * // d_keys_out            <-- [6, 7, 8, 0, 3, 5, 9]
-     * // d_values_out          <-- [1, 2, 0, 5, 4, 3, 6]
-     *
-     * \endcode
-     *
-     * \tparam KeyT             <b>[inferred]</b> Key type
-     * \tparam ValueT           <b>[inferred]</b> Value type
-     * \tparam OffsetIteratorT  <b>[inferred]</b> Random-access input iterator type for reading segment offsets \iterator
-     */
-    template <
-        typename            KeyT,
-        typename            ValueT,
-        typename            OffsetIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t SortPairs(
-        void                *d_temp_storage,                        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        const KeyT          *d_keys_in,                             ///< [in] %Device-accessible pointer to the input data of key data to sort
-        KeyT                *d_keys_out,                            ///< [out] %Device-accessible pointer to the sorted output sequence of key data
-        const ValueT        *d_values_in,                           ///< [in] %Device-accessible pointer to the corresponding input sequence of associated value items
-        ValueT              *d_values_out,                          ///< [out] %Device-accessible pointer to the correspondingly-reordered output sequence of associated value items
-        int                 num_items,                              ///< [in] The total number of items to sort (across all segments)
-        int                 num_segments,                           ///< [in] The number of segments that comprise the sorting data
-        OffsetIteratorT     d_begin_offsets,                        ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-        OffsetIteratorT     d_end_offsets,                          ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-        int                 begin_bit           = 0,                ///< [in] <b>[optional]</b> The least-significant bit index (inclusive)  needed for key comparison
-        int                 end_bit             = sizeof(KeyT) * 8, ///< [in] <b>[optional]</b> The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
-        cudaStream_t        stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        DoubleBuffer<KeyT>       d_keys(const_cast<KeyT*>(d_keys_in), d_keys_out);
-        DoubleBuffer<ValueT>     d_values(const_cast<ValueT*>(d_values_in), d_values_out);
-
-        return DispatchSegmentedRadixSort<false, KeyT, ValueT, OffsetIteratorT, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_keys,
-            d_values,
-            num_items,
-            num_segments,
-            d_begin_offsets,
-            d_end_offsets,
-            begin_bit,
-            end_bit,
-            false,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Sorts segments of key-value pairs into ascending order. (~<em>N </em>auxiliary storage required)
-     *
-     * \par
-     * - The sorting operation is given a pair of key buffers and a corresponding
-     *   pair of associated value buffers.  Each pair is managed by a DoubleBuffer
-     *   structure that indicates which of the two buffers is "current" (and thus
-     *   contains the input data to be sorted).
-     * - The contents of both buffers within each pair may be altered by the sorting
-     *   operation.
-     * - Upon completion, the sorting operation will update the "current" indicator
-     *   within each DoubleBuffer wrapper to reference which of the two buffers
-     *   now contains the sorted output sequence (a function of the number of key bits
-     *   specified and the targeted device architecture).
-     * - When input a contiguous sequence of segments, a single sequence
-     *   \p segment_offsets (of length <tt>num_segments+1</tt>) can be aliased
-     *   for both the \p d_begin_offsets and \p d_end_offsets parameters (where
-     *   the latter is specified as <tt>segment_offsets+1</tt>).
-     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
-     * - \devicestorageP
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of \p int keys
-     * with associated vector of \p int values.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_segmentd_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for sorting data
-     * int  num_items;          // e.g., 7
-     * int  num_segments;       // e.g., 3
-     * int  *d_offsets;         // e.g., [0, 3, 3, 7]
-     * int  *d_key_buf;         // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_key_alt_buf;     // e.g., [-, -, -, -, -, -, -]
-     * int  *d_value_buf;       // e.g., [0, 1, 2, 3, 4, 5, 6]
-     * int  *d_value_alt_buf;   // e.g., [-, -, -, -, -, -, -]
-     * ...
-     *
-     * // Create a set of DoubleBuffers to wrap pairs of device pointers
-     * cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
-     * cub::DoubleBuffer<int> d_values(d_value_buf, d_value_alt_buf);
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSegmentedRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, d_keys, d_values,
-     *     num_items, num_segments, d_offsets, d_offsets + 1);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sorting operation
-     * cub::DeviceSegmentedRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, d_keys, d_values,
-     *     num_items, num_segments, d_offsets, d_offsets + 1);
-     *
-     * // d_keys.Current()      <-- [6, 7, 8, 0, 3, 5, 9]
-     * // d_values.Current()    <-- [5, 4, 3, 1, 2, 0, 6]
-     *
-     * \endcode
-     *
-     * \tparam KeyT             <b>[inferred]</b> Key type
-     * \tparam ValueT           <b>[inferred]</b> Value type
-     * \tparam OffsetIteratorT  <b>[inferred]</b> Random-access input iterator type for reading segment offsets \iterator
-     */
-    template <
-        typename                KeyT,
-        typename                ValueT,
-        typename                OffsetIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t SortPairs(
-        void                    *d_temp_storage,                        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                  &temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        DoubleBuffer<KeyT>      &d_keys,                                ///< [in,out] Reference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
-        DoubleBuffer<ValueT>    &d_values,                              ///< [in,out] Double-buffer of values whose "current" device-accessible buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
-        int                     num_items,                              ///< [in] The total number of items to sort (across all segments)
-        int                     num_segments,                           ///< [in] The number of segments that comprise the sorting data
-        OffsetIteratorT         d_begin_offsets,                        ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-        OffsetIteratorT         d_end_offsets,                          ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-        int                     begin_bit           = 0,                ///< [in] <b>[optional]</b> The least-significant bit index (inclusive)  needed for key comparison
-        int                     end_bit             = sizeof(KeyT) * 8, ///< [in] <b>[optional]</b> The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
-        cudaStream_t            stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                    debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        return DispatchSegmentedRadixSort<false, KeyT, ValueT, OffsetIteratorT, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_keys,
-            d_values,
-            num_items,
-            num_segments,
-            d_begin_offsets,
-            d_end_offsets,
-            begin_bit,
-            end_bit,
-            true,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Sorts segments of key-value pairs into descending order. (~<em>2N</em> auxiliary storage required).
-     *
-     * \par
-     * - The contents of the input data are not altered by the sorting operation
-     * - When input a contiguous sequence of segments, a single sequence
-     *   \p segment_offsets (of length <tt>num_segments+1</tt>) can be aliased
-     *   for both the \p d_begin_offsets and \p d_end_offsets parameters (where
-     *   the latter is specified as <tt>segment_offsets+1</tt>).
-     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
-     * - \devicestorageNP  For sorting using only <em>O</em>(<tt>P</tt>) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of \p int keys
-     * with associated vector of \p int values.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_segmentd_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for sorting data
-     * int  num_items;          // e.g., 7
-     * int  num_segments;       // e.g., 3
-     * int  *d_offsets;         // e.g., [0, 3, 3, 7]
-     * int  *d_keys_in;         // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_keys_out;        // e.g., [-, -, -, -, -, -, -]
-     * int  *d_values_in;       // e.g., [0, 1, 2, 3, 4, 5, 6]
-     * int  *d_values_out;      // e.g., [-, -, -, -, -, -, -]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSegmentedRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes,
-     *     d_keys_in, d_keys_out, d_values_in, d_values_out,
-     *     num_items, num_segments, d_offsets, d_offsets + 1);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sorting operation
-     * cub::DeviceSegmentedRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes,
-     *     d_keys_in, d_keys_out, d_values_in, d_values_out,
-     *     num_items, num_segments, d_offsets, d_offsets + 1);
-     *
-     * // d_keys_out            <-- [8, 7, 6, 9, 5, 3, 0]
-     * // d_values_out          <-- [0, 2, 1, 6, 3, 4, 5]
-     *
-     * \endcode
-     *
-     * \tparam KeyT             <b>[inferred]</b> Key type
-     * \tparam ValueT           <b>[inferred]</b> Value type
-     * \tparam OffsetIteratorT  <b>[inferred]</b> Random-access input iterator type for reading segment offsets \iterator
-     */
-    template <
-        typename            KeyT,
-        typename            ValueT,
-        typename            OffsetIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t SortPairsDescending(
-        void                *d_temp_storage,                        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        const KeyT          *d_keys_in,                             ///< [in] %Device-accessible pointer to the input data of key data to sort
-        KeyT                *d_keys_out,                            ///< [out] %Device-accessible pointer to the sorted output sequence of key data
-        const ValueT        *d_values_in,                           ///< [in] %Device-accessible pointer to the corresponding input sequence of associated value items
-        ValueT              *d_values_out,                          ///< [out] %Device-accessible pointer to the correspondingly-reordered output sequence of associated value items
-        int                 num_items,                              ///< [in] The total number of items to sort (across all segments)
-        int                 num_segments,                           ///< [in] The number of segments that comprise the sorting data
-        OffsetIteratorT     d_begin_offsets,                        ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-        OffsetIteratorT     d_end_offsets,                          ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-        int                 begin_bit           = 0,                ///< [in] <b>[optional]</b> The least-significant bit index (inclusive)  needed for key comparison
-        int                 end_bit             = sizeof(KeyT) * 8, ///< [in] <b>[optional]</b> The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
-        cudaStream_t        stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        DoubleBuffer<KeyT>       d_keys(const_cast<KeyT*>(d_keys_in), d_keys_out);
-        DoubleBuffer<ValueT>     d_values(const_cast<ValueT*>(d_values_in), d_values_out);
-
-        return DispatchSegmentedRadixSort<true, KeyT, ValueT, OffsetIteratorT, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_keys,
-            d_values,
-            num_items,
-            num_segments,
-            d_begin_offsets,
-            d_end_offsets,
-            begin_bit,
-            end_bit,
-            false,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Sorts segments of key-value pairs into descending order. (~<em>N </em>auxiliary storage required).
-     *
-     * \par
-     * - The sorting operation is given a pair of key buffers and a corresponding
-     *   pair of associated value buffers.  Each pair is managed by a DoubleBuffer
-     *   structure that indicates which of the two buffers is "current" (and thus
-     *   contains the input data to be sorted).
-     * - The contents of both buffers within each pair may be altered by the sorting
-     *   operation.
-     * - Upon completion, the sorting operation will update the "current" indicator
-     *   within each DoubleBuffer wrapper to reference which of the two buffers
-     *   now contains the sorted output sequence (a function of the number of key bits
-     *   specified and the targeted device architecture).
-     * - When input a contiguous sequence of segments, a single sequence
-     *   \p segment_offsets (of length <tt>num_segments+1</tt>) can be aliased
-     *   for both the \p d_begin_offsets and \p d_end_offsets parameters (where
-     *   the latter is specified as <tt>segment_offsets+1</tt>).
-     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
-     * - \devicestorageP
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of \p int keys
-     * with associated vector of \p int values.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_segmentd_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for sorting data
-     * int  num_items;          // e.g., 7
-     * int  num_segments;       // e.g., 3
-     * int  *d_offsets;         // e.g., [0, 3, 3, 7]
-     * int  *d_key_buf;         // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_key_alt_buf;     // e.g., [-, -, -, -, -, -, -]
-     * int  *d_value_buf;       // e.g., [0, 1, 2, 3, 4, 5, 6]
-     * int  *d_value_alt_buf;   // e.g., [-, -, -, -, -, -, -]
-     * ...
-     *
-     * // Create a set of DoubleBuffers to wrap pairs of device pointers
-     * cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
-     * cub::DoubleBuffer<int> d_values(d_value_buf, d_value_alt_buf);
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSegmentedRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes, d_keys, d_values,
-     *     num_items, num_segments, d_offsets, d_offsets + 1);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sorting operation
-     * cub::DeviceSegmentedRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes, d_keys, d_values,
-     *     num_items, num_segments, d_offsets, d_offsets + 1);
-     *
-     * // d_keys.Current()      <-- [8, 7, 6, 9, 5, 3, 0]
-     * // d_values.Current()    <-- [0, 2, 1, 6, 3, 4, 5]
-     *
-     * \endcode
-     *
-     * \tparam KeyT             <b>[inferred]</b> Key type
-     * \tparam ValueT           <b>[inferred]</b> Value type
-     * \tparam OffsetIteratorT  <b>[inferred]</b> Random-access input iterator type for reading segment offsets \iterator
-     */
-    template <
-        typename                KeyT,
-        typename                ValueT,
-        typename                OffsetIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t SortPairsDescending(
-        void                    *d_temp_storage,                        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                  &temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        DoubleBuffer<KeyT>      &d_keys,                                ///< [in,out] Reference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
-        DoubleBuffer<ValueT>    &d_values,                              ///< [in,out] Double-buffer of values whose "current" device-accessible buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
-        int                     num_items,                              ///< [in] The total number of items to sort (across all segments)
-        int                     num_segments,                           ///< [in] The number of segments that comprise the sorting data
-        OffsetIteratorT         d_begin_offsets,                        ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-        OffsetIteratorT         d_end_offsets,                          ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-        int                     begin_bit           = 0,                ///< [in] <b>[optional]</b> The least-significant bit index (inclusive)  needed for key comparison
-        int                     end_bit             = sizeof(KeyT) * 8, ///< [in] <b>[optional]</b> The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
-        cudaStream_t            stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                    debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        return DispatchSegmentedRadixSort<true, KeyT, ValueT, OffsetIteratorT, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_keys,
-            d_values,
-            num_items,
-            num_segments,
-            d_begin_offsets,
-            d_end_offsets,
-            begin_bit,
-            end_bit,
-            true,
-            stream,
-            debug_synchronous);
-    }
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Keys-only
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Sorts segments of keys into ascending order. (~<em>2N </em>auxiliary storage required)
-     *
-     * \par
-     * - The contents of the input data are not altered by the sorting operation
-     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
-     * - When input a contiguous sequence of segments, a single sequence
-     *   \p segment_offsets (of length <tt>num_segments+1</tt>) can be aliased
-     *   for both the \p d_begin_offsets and \p d_end_offsets parameters (where
-     *   the latter is specified as <tt>segment_offsets+1</tt>).
-     * - \devicestorageNP  For sorting using only <em>O</em>(<tt>P</tt>) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of \p int keys.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_segmentd_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for sorting data
-     * int  num_items;          // e.g., 7
-     * int  num_segments;       // e.g., 3
-     * int  *d_offsets;         // e.g., [0, 3, 3, 7]
-     * int  *d_keys_in;         // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_keys_out;        // e.g., [-, -, -, -, -, -, -]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSegmentedRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out,
-     *     num_items, num_segments, d_offsets, d_offsets + 1);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sorting operation
-     * cub::DeviceSegmentedRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out,
-     *     num_items, num_segments, d_offsets, d_offsets + 1);
-     *
-     * // d_keys_out            <-- [6, 7, 8, 0, 3, 5, 9]
-     *
-     * \endcode
-     *
-     * \tparam KeyT             <b>[inferred]</b> Key type
-     * \tparam OffsetIteratorT  <b>[inferred]</b> Random-access input iterator type for reading segment offsets \iterator
-     */
-    template <
-        typename            KeyT,
-        typename            OffsetIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t SortKeys(
-        void                *d_temp_storage,                        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        const KeyT          *d_keys_in,                             ///< [in] %Device-accessible pointer to the input data of key data to sort
-        KeyT                *d_keys_out,                            ///< [out] %Device-accessible pointer to the sorted output sequence of key data
-        int                 num_items,                              ///< [in] The total number of items to sort (across all segments)
-        int                 num_segments,                           ///< [in] The number of segments that comprise the sorting data
-        OffsetIteratorT     d_begin_offsets,                        ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-        OffsetIteratorT     d_end_offsets,                          ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-        int                 begin_bit           = 0,                ///< [in] <b>[optional]</b> The least-significant bit index (inclusive)  needed for key comparison
-        int                 end_bit             = sizeof(KeyT) * 8, ///< [in] <b>[optional]</b> The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
-        cudaStream_t        stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // Null value type
-        DoubleBuffer<KeyT>      d_keys(const_cast<KeyT*>(d_keys_in), d_keys_out);
-        DoubleBuffer<NullType>  d_values;
-
-        return DispatchSegmentedRadixSort<false, KeyT, NullType, OffsetIteratorT, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_keys,
-            d_values,
-            num_items,
-            num_segments,
-            d_begin_offsets,
-            d_end_offsets,
-            begin_bit,
-            end_bit,
-            false,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Sorts segments of keys into ascending order. (~<em>N </em>auxiliary storage required).
-     *
-     * \par
-     * - The sorting operation is given a pair of key buffers managed by a
-     *   DoubleBuffer structure that indicates which of the two buffers is
-     *   "current" (and thus contains the input data to be sorted).
-     * - The contents of both buffers may be altered by the sorting operation.
-     * - Upon completion, the sorting operation will update the "current" indicator
-     *   within the DoubleBuffer wrapper to reference which of the two buffers
-     *   now contains the sorted output sequence (a function of the number of key bits
-     *   specified and the targeted device architecture).
-     * - When input a contiguous sequence of segments, a single sequence
-     *   \p segment_offsets (of length <tt>num_segments+1</tt>) can be aliased
-     *   for both the \p d_begin_offsets and \p d_end_offsets parameters (where
-     *   the latter is specified as <tt>segment_offsets+1</tt>).
-     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
-     * - \devicestorageP
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of \p int keys.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_segmentd_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for sorting data
-     * int  num_items;          // e.g., 7
-     * int  num_segments;       // e.g., 3
-     * int  *d_offsets;         // e.g., [0, 3, 3, 7]
-     * int  *d_key_buf;         // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_key_alt_buf;     // e.g., [-, -, -, -, -, -, -]
-     * ...
-     *
-     * // Create a DoubleBuffer to wrap the pair of device pointers
-     * cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSegmentedRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys,
-     *     num_items, num_segments, d_offsets, d_offsets + 1);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sorting operation
-     * cub::DeviceSegmentedRadixSort::SortKeys(d_temp_storage, temp_storage_bytes, d_keys,
-     *     num_items, num_segments, d_offsets, d_offsets + 1);
-     *
-     * // d_keys.Current()      <-- [6, 7, 8, 0, 3, 5, 9]
-     *
-     * \endcode
-     *
-     * \tparam KeyT             <b>[inferred]</b> Key type
-     * \tparam OffsetIteratorT  <b>[inferred]</b> Random-access input iterator type for reading segment offsets \iterator
-     */
-    template <
-        typename            KeyT,
-        typename            OffsetIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t SortKeys(
-        void                *d_temp_storage,                        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        DoubleBuffer<KeyT>  &d_keys,                                ///< [in,out] Reference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
-        int                 num_items,                              ///< [in] The total number of items to sort (across all segments)
-        int                 num_segments,                           ///< [in] The number of segments that comprise the sorting data
-        OffsetIteratorT     d_begin_offsets,                        ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-        OffsetIteratorT     d_end_offsets,                          ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-        int                 begin_bit           = 0,                ///< [in] <b>[optional]</b> The least-significant bit index (inclusive)  needed for key comparison
-        int                 end_bit             = sizeof(KeyT) * 8, ///< [in] <b>[optional]</b> The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
-        cudaStream_t        stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // Null value type
-        DoubleBuffer<NullType> d_values;
-
-        return DispatchSegmentedRadixSort<false, KeyT, NullType, OffsetIteratorT, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_keys,
-            d_values,
-            num_items,
-            num_segments,
-            d_begin_offsets,
-            d_end_offsets,
-            begin_bit,
-            end_bit,
-            true,
-            stream,
-            debug_synchronous);
-    }
-
-    /**
-     * \brief Sorts segments of keys into descending order. (~<em>2N</em> auxiliary storage required).
-     *
-     * \par
-     * - The contents of the input data are not altered by the sorting operation
-     * - When input a contiguous sequence of segments, a single sequence
-     *   \p segment_offsets (of length <tt>num_segments+1</tt>) can be aliased
-     *   for both the \p d_begin_offsets and \p d_end_offsets parameters (where
-     *   the latter is specified as <tt>segment_offsets+1</tt>).
-     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
-     * - \devicestorageNP  For sorting using only <em>O</em>(<tt>P</tt>) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of \p int keys.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_segmentd_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for sorting data
-     * int  num_items;          // e.g., 7
-     * int  num_segments;       // e.g., 3
-     * int  *d_offsets;         // e.g., [0, 3, 3, 7]
-     * int  *d_keys_in;         // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_keys_out;        // e.g., [-, -, -, -, -, -, -]
-     * ...
-     *
-     * // Create a DoubleBuffer to wrap the pair of device pointers
-     * cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSegmentedRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out,
-     *     num_items, num_segments, d_offsets, d_offsets + 1);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sorting operation
-     * cub::DeviceSegmentedRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out,
-     *     num_items, num_segments, d_offsets, d_offsets + 1);
-     *
-     * // d_keys_out            <-- [8, 7, 6, 9, 5, 3, 0]
-     *
-     * \endcode
-     *
-     * \tparam KeyT             <b>[inferred]</b> Key type
-     * \tparam OffsetIteratorT  <b>[inferred]</b> Random-access input iterator type for reading segment offsets \iterator
-     */
-    template <
-        typename            KeyT,
-        typename            OffsetIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t SortKeysDescending(
-        void                *d_temp_storage,                        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        const KeyT          *d_keys_in,                             ///< [in] %Device-accessible pointer to the input data of key data to sort
-        KeyT                *d_keys_out,                            ///< [out] %Device-accessible pointer to the sorted output sequence of key data
-        int                 num_items,                              ///< [in] The total number of items to sort (across all segments)
-        int                 num_segments,                           ///< [in] The number of segments that comprise the sorting data
-        OffsetIteratorT     d_begin_offsets,                        ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-        OffsetIteratorT     d_end_offsets,                          ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-        int                 begin_bit           = 0,                ///< [in] <b>[optional]</b> The least-significant bit index (inclusive)  needed for key comparison
-        int                 end_bit             = sizeof(KeyT) * 8, ///< [in] <b>[optional]</b> The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
-        cudaStream_t        stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        DoubleBuffer<KeyT>      d_keys(const_cast<KeyT*>(d_keys_in), d_keys_out);
-        DoubleBuffer<NullType>  d_values;
-
-        return DispatchSegmentedRadixSort<true, KeyT, NullType, OffsetIteratorT, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_keys,
-            d_values,
-            num_items,
-            num_segments,
-            d_begin_offsets,
-            d_end_offsets,
-            begin_bit,
-            end_bit,
-            false,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Sorts segments of keys into descending order. (~<em>N </em>auxiliary storage required).
-     *
-     * \par
-     * - The sorting operation is given a pair of key buffers managed by a
-     *   DoubleBuffer structure that indicates which of the two buffers is
-     *   "current" (and thus contains the input data to be sorted).
-     * - The contents of both buffers may be altered by the sorting operation.
-     * - Upon completion, the sorting operation will update the "current" indicator
-     *   within the DoubleBuffer wrapper to reference which of the two buffers
-     *   now contains the sorted output sequence (a function of the number of key bits
-     *   specified and the targeted device architecture).
-     * - When input a contiguous sequence of segments, a single sequence
-     *   \p segment_offsets (of length <tt>num_segments+1</tt>) can be aliased
-     *   for both the \p d_begin_offsets and \p d_end_offsets parameters (where
-     *   the latter is specified as <tt>segment_offsets+1</tt>).
-     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
-     * - \devicestorageP
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of \p int keys.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_segmentd_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for sorting data
-     * int  num_items;          // e.g., 7
-     * int  num_segments;       // e.g., 3
-     * int  *d_offsets;         // e.g., [0, 3, 3, 7]
-     * int  *d_key_buf;         // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int  *d_key_alt_buf;     // e.g., [-, -, -, -, -, -, -]
-     * ...
-     *
-     * // Create a DoubleBuffer to wrap the pair of device pointers
-     * cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSegmentedRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys,
-     *     num_items, num_segments, d_offsets, d_offsets + 1);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sorting operation
-     * cub::DeviceSegmentedRadixSort::SortKeysDescending(d_temp_storage, temp_storage_bytes, d_keys,
-     *     num_items, num_segments, d_offsets, d_offsets + 1);
-     *
-     * // d_keys.Current()      <-- [8, 7, 6, 9, 5, 3, 0]
-     *
-     * \endcode
-     *
-     * \tparam KeyT             <b>[inferred]</b> Key type
-     * \tparam OffsetIteratorT  <b>[inferred]</b> Random-access input iterator type for reading segment offsets \iterator
-     */
-    template <
-        typename            KeyT,
-        typename            OffsetIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t SortKeysDescending(
-        void                *d_temp_storage,                        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        DoubleBuffer<KeyT>  &d_keys,                                ///< [in,out] Reference to the double-buffer of keys whose "current" device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
-        int                 num_items,                              ///< [in] The total number of items to sort (across all segments)
-        int                 num_segments,                           ///< [in] The number of segments that comprise the sorting data
-        OffsetIteratorT     d_begin_offsets,                        ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-        OffsetIteratorT     d_end_offsets,                          ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-        int                 begin_bit           = 0,                ///< [in] <b>[optional]</b> The least-significant bit index (inclusive)  needed for key comparison
-        int                 end_bit             = sizeof(KeyT) * 8, ///< [in] <b>[optional]</b> The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
-        cudaStream_t        stream              = 0,                ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)            ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // Null value type
-        DoubleBuffer<NullType> d_values;
-
-        return DispatchSegmentedRadixSort<true, KeyT, NullType, OffsetIteratorT, OffsetT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_keys,
-            d_values,
-            num_items,
-            num_segments,
-            d_begin_offsets,
-            d_end_offsets,
-            begin_bit,
-            end_bit,
-            true,
-            stream,
-            debug_synchronous);
-    }
-
-
-    //@}  end member group
-
-
-};
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/device/device_segmented_reduce.cuh b/thirdparty/cub_semiring/device/device_segmented_reduce.cuh
deleted file mode 100644
index 1964ec1f1c4..00000000000
--- a/thirdparty/cub_semiring/device/device_segmented_reduce.cuh
+++ /dev/null
@@ -1,619 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DeviceSegmentedReduce provides device-wide, parallel operations for computing a batched reduction across multiple sequences of data items residing within device-accessible memory.
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-
-#include "../iterator/arg_index_input_iterator.cuh"
-#include "dispatch/dispatch_reduce.cuh"
-#include "dispatch/dispatch_reduce_by_key.cuh"
-#include "../util_type.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \brief DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory. ![](reduce_logo.png)
- * \ingroup SegmentedModule
- *
- * \par Overview
- * A <a href="http://en.wikipedia.org/wiki/Reduce_(higher-order_function)"><em>reduction</em></a> (or <em>fold</em>)
- * uses a binary combining operator to compute a single aggregate from a sequence of input elements.
- *
- * \par Usage Considerations
- * \cdp_class{DeviceSegmentedReduce}
- *
- */
-struct DeviceSegmentedReduce
-{
-    /**
-     * \brief Computes a device-wide segmented reduction using the specified binary \p reduction_op functor.
-     *
-     * \par
-     * - Does not support binary reduction operators that are non-commutative.
-     * - When input a contiguous sequence of segments, a single sequence
-     *   \p segment_offsets (of length <tt>num_segments+1</tt>) can be aliased
-     *   for both the \p d_begin_offsets and \p d_end_offsets parameters (where
-     *   the latter is specified as <tt>segment_offsets+1</tt>).
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates a custom min-reduction of a device vector of \p int data elements.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // CustomMin functor
-     * struct CustomMin
-     * {
-     *     template <typename T>
-     *     CUB_RUNTIME_FUNCTION __forceinline__
-     *     T operator()(const T &a, const T &b) const {
-     *         return (b < a) ? b : a;
-     *     }
-     * };
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int          num_segments;   // e.g., 3
-     * int          *d_offsets;     // e.g., [0, 3, 3, 7]
-     * int          *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int          *d_out;         // e.g., [-, -, -]
-     * CustomMin    min_op;
-     * int          initial_value;           // e.g., INT_MAX
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSegmentedReduce::Reduce(d_temp_storage, temp_storage_bytes, d_in, d_out,
-     *     num_segments, d_offsets, d_offsets + 1, min_op, initial_value);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run reduction
-     * cub::DeviceSegmentedReduce::Reduce(d_temp_storage, temp_storage_bytes, d_in, d_out,
-     *     num_segments, d_offsets, d_offsets + 1, min_op, initial_value);
-     *
-     * // d_out <-- [6, INT_MAX, 0]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT       <b>[inferred]</b> Random-access input iterator type for reading input items \iterator
-     * \tparam OutputIteratorT      <b>[inferred]</b> Output iterator type for recording the reduced aggregate \iterator
-     * \tparam OffsetIteratorT      <b>[inferred]</b> Random-access input iterator type for reading segment offsets \iterator
-     * \tparam ReductionOp          <b>[inferred]</b> Binary reduction functor type having member <tt>T operator()(const T &a, const T &b)</tt>
-     * \tparam T                    <b>[inferred]</b> Data element type that is convertible to the \p value type of \p InputIteratorT
-     */
-    template <
-        typename            InputIteratorT,
-        typename            OutputIteratorT,
-        typename            OffsetIteratorT,
-        typename            ReductionOp,
-        typename            T>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t Reduce(
-        void                *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT      d_in,                               ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT     d_out,                              ///< [out] Pointer to the output aggregate
-        int                 num_segments,                       ///< [in] The number of segments that comprise the sorting data
-        OffsetIteratorT     d_begin_offsets,                    ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-        OffsetIteratorT     d_end_offsets,                      ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-        ReductionOp         reduction_op,                       ///< [in] Binary reduction functor 
-        T                   initial_value,                      ///< [in] Initial value of the reduction for each segment
-        cudaStream_t        stream              = 0,            ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)        ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        return DispatchSegmentedReduce<InputIteratorT, OutputIteratorT, OffsetIteratorT, OffsetT, ReductionOp>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            d_out,
-            num_segments,
-            d_begin_offsets,
-            d_end_offsets,
-            reduction_op,
-            initial_value,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Computes a device-wide segmented sum using the addition ('+') operator.
-     *
-     * \par
-     * - Uses \p 0 as the initial value of the reduction for each segment.
-     * - When input a contiguous sequence of segments, a single sequence
-     *   \p segment_offsets (of length <tt>num_segments+1</tt>) can be aliased
-     *   for both the \p d_begin_offsets and \p d_end_offsets parameters (where
-     *   the latter is specified as <tt>segment_offsets+1</tt>).
-     * - Does not support \p + operators that are non-commutative..
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the sum reduction of a device vector of \p int data elements.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int num_segments;   // e.g., 3
-     * int *d_offsets;     // e.g., [0, 3, 3, 7]
-     * int *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int *d_out;         // e.g., [-, -, -]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSegmentedReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out,
-     *     num_segments, d_offsets, d_offsets + 1);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run sum-reduction
-     * cub::DeviceSegmentedReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out,
-     *     num_segments, d_offsets, d_offsets + 1);
-     *
-     * // d_out <-- [21, 0, 17]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT     <b>[inferred]</b> Random-access input iterator type for reading input items \iterator
-     * \tparam OutputIteratorT    <b>[inferred]</b> Output iterator type for recording the reduced aggregate \iterator
-     * \tparam OffsetIteratorT      <b>[inferred]</b> Random-access input iterator type for reading segment offsets \iterator
-     */
-    template <
-        typename            InputIteratorT,
-        typename            OutputIteratorT,
-        typename            OffsetIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t Sum(
-        void                *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT      d_in,                               ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT     d_out,                              ///< [out] Pointer to the output aggregate
-        int                 num_segments,                       ///< [in] The number of segments that comprise the sorting data
-        OffsetIteratorT     d_begin_offsets,                    ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-        OffsetIteratorT     d_end_offsets,                      ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-        cudaStream_t        stream              = 0,            ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)        ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // The output value type
-        typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
-            typename std::iterator_traits<InputIteratorT>::value_type,                                          // ... then the input iterator's value type,
-            typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputT;                          // ... else the output iterator's value type
-
-        return DispatchSegmentedReduce<InputIteratorT,  OutputIteratorT, OffsetIteratorT, OffsetT, cub::Sum>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            d_out,
-            num_segments,
-            d_begin_offsets,
-            d_end_offsets,
-            cub::Sum(),
-            OutputT(),            // zero-initialize
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Computes a device-wide segmented minimum using the less-than ('<') operator.
-     *
-     * \par
-     * - Uses <tt>std::numeric_limits<T>::max()</tt> as the initial value of the reduction for each segment.
-     * - When input a contiguous sequence of segments, a single sequence
-     *   \p segment_offsets (of length <tt>num_segments+1</tt>) can be aliased
-     *   for both the \p d_begin_offsets and \p d_end_offsets parameters (where
-     *   the latter is specified as <tt>segment_offsets+1</tt>).
-     * - Does not support \p < operators that are non-commutative.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the min-reduction of a device vector of \p int data elements.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int num_segments;   // e.g., 3
-     * int *d_offsets;     // e.g., [0, 3, 3, 7]
-     * int *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int *d_out;         // e.g., [-, -, -]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSegmentedReduce::Min(d_temp_storage, temp_storage_bytes, d_in, d_out,
-     *     num_segments, d_offsets, d_offsets + 1);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run min-reduction
-     * cub::DeviceSegmentedReduce::Min(d_temp_storage, temp_storage_bytes, d_in, d_out,
-     *     num_segments, d_offsets, d_offsets + 1);
-     *
-     * // d_out <-- [6, INT_MAX, 0]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT     <b>[inferred]</b> Random-access input iterator type for reading input items \iterator
-     * \tparam OutputIteratorT    <b>[inferred]</b> Output iterator type for recording the reduced aggregate \iterator
-     * \tparam OffsetIteratorT      <b>[inferred]</b> Random-access input iterator type for reading segment offsets \iterator
-     */
-    template <
-        typename            InputIteratorT,
-        typename            OutputIteratorT,
-        typename            OffsetIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t Min(
-        void                *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT      d_in,                               ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT     d_out,                              ///< [out] Pointer to the output aggregate
-        int                 num_segments,                       ///< [in] The number of segments that comprise the sorting data
-        OffsetIteratorT     d_begin_offsets,                    ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-        OffsetIteratorT     d_end_offsets,                      ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-        cudaStream_t        stream              = 0,            ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)        ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // The input value type
-        typedef typename std::iterator_traits<InputIteratorT>::value_type InputT;
-
-        return DispatchSegmentedReduce<InputIteratorT,  OutputIteratorT, OffsetIteratorT, OffsetT, cub::Min>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            d_out,
-            num_segments,
-            d_begin_offsets,
-            d_end_offsets,
-            cub::Min(),
-            Traits<InputT>::Max(),    // replace with std::numeric_limits<T>::max() when C++11 support is more prevalent
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Finds the first device-wide minimum in each segment using the less-than ('<') operator, also returning the in-segment index of that item.
-     *
-     * \par
-     * - The output value type of \p d_out is cub::KeyValuePair <tt><int, T></tt> (assuming the value type of \p d_in is \p T)
-     *   - The minimum of the <em>i</em><sup>th</sup> segment is written to <tt>d_out[i].value</tt> and its offset in that segment is written to <tt>d_out[i].key</tt>.
-     *   - The <tt>{1, std::numeric_limits<T>::max()}</tt> tuple is produced for zero-length inputs
-     * - When input a contiguous sequence of segments, a single sequence
-     *   \p segment_offsets (of length <tt>num_segments+1</tt>) can be aliased
-     *   for both the \p d_begin_offsets and \p d_end_offsets parameters (where
-     *   the latter is specified as <tt>segment_offsets+1</tt>).
-     * - Does not support \p < operators that are non-commutative.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the argmin-reduction of a device vector of \p int data elements.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int                      num_segments;   // e.g., 3
-     * int                      *d_offsets;     // e.g., [0, 3, 3, 7]
-     * int                      *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * KeyValuePair<int, int>   *d_out;         // e.g., [{-,-}, {-,-}, {-,-}]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSegmentedReduce::ArgMin(d_temp_storage, temp_storage_bytes, d_in, d_out,
-     *     num_segments, d_offsets, d_offsets + 1);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run argmin-reduction
-     * cub::DeviceSegmentedReduce::ArgMin(d_temp_storage, temp_storage_bytes, d_in, d_out,
-     *     num_segments, d_offsets, d_offsets + 1);
-     *
-     * // d_out <-- [{1,6}, {1,INT_MAX}, {2,0}]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT     <b>[inferred]</b> Random-access input iterator type for reading input items (of some type \p T) \iterator
-     * \tparam OutputIteratorT    <b>[inferred]</b> Output iterator type for recording the reduced aggregate (having value type <tt>KeyValuePair<int, T></tt>) \iterator
-     * \tparam OffsetIteratorT      <b>[inferred]</b> Random-access input iterator type for reading segment offsets \iterator
-     */
-    template <
-        typename            InputIteratorT,
-        typename            OutputIteratorT,
-        typename            OffsetIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t ArgMin(
-        void                *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT      d_in,                               ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT     d_out,                              ///< [out] Pointer to the output aggregate
-        int                 num_segments,                       ///< [in] The number of segments that comprise the sorting data
-        OffsetIteratorT     d_begin_offsets,                    ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-        OffsetIteratorT     d_end_offsets,                      ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-        cudaStream_t        stream              = 0,            ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)        ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // The input type
-        typedef typename std::iterator_traits<InputIteratorT>::value_type InputValueT;
-
-        // The output tuple type
-        typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
-            KeyValuePair<OffsetT, InputValueT>,                                                                 // ... then the key value pair OffsetT + InputValueT
-            typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputTupleT;                     // ... else the output iterator's value type
-
-        // The output value type
-        typedef typename OutputTupleT::Value OutputValueT;
-
-        // Wrapped input iterator to produce index-value <OffsetT, InputT> tuples
-        typedef ArgIndexInputIterator<InputIteratorT, OffsetT, OutputValueT> ArgIndexInputIteratorT;
-        ArgIndexInputIteratorT d_indexed_in(d_in);
-
-        // Initial value
-        OutputTupleT initial_value(1, Traits<InputValueT>::Max());   // replace with std::numeric_limits<T>::max() when C++11 support is more prevalent
-
-        return DispatchSegmentedReduce<ArgIndexInputIteratorT,  OutputIteratorT, OffsetIteratorT, OffsetT, cub::ArgMin>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_indexed_in,
-            d_out,
-            num_segments,
-            d_begin_offsets,
-            d_end_offsets,
-            cub::ArgMin(),
-            initial_value,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Computes a device-wide segmented maximum using the greater-than ('>') operator.
-     *
-     * \par
-     * - Uses <tt>std::numeric_limits<T>::lowest()</tt> as the initial value of the reduction.
-     * - When input a contiguous sequence of segments, a single sequence
-     *   \p segment_offsets (of length <tt>num_segments+1</tt>) can be aliased
-     *   for both the \p d_begin_offsets and \p d_end_offsets parameters (where
-     *   the latter is specified as <tt>segment_offsets+1</tt>).
-     * - Does not support \p > operators that are non-commutative.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the max-reduction of a device vector of \p int data elements.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int num_segments;   // e.g., 3
-     * int *d_offsets;     // e.g., [0, 3, 3, 7]
-     * int *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * int *d_out;         // e.g., [-, -, -]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSegmentedReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_out,
-     *     num_segments, d_offsets, d_offsets + 1);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run max-reduction
-     * cub::DeviceSegmentedReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_out,
-     *     num_segments, d_offsets, d_offsets + 1);
-     *
-     * // d_out <-- [8, INT_MIN, 9]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT     <b>[inferred]</b> Random-access input iterator type for reading input items \iterator
-     * \tparam OutputIteratorT    <b>[inferred]</b> Output iterator type for recording the reduced aggregate \iterator
-     * \tparam OffsetIteratorT      <b>[inferred]</b> Random-access input iterator type for reading segment offsets \iterator
-     */
-    template <
-        typename            InputIteratorT,
-        typename            OutputIteratorT,
-        typename            OffsetIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t Max(
-        void                *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT      d_in,                               ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT     d_out,                              ///< [out] Pointer to the output aggregate
-        int                 num_segments,                       ///< [in] The number of segments that comprise the sorting data
-        OffsetIteratorT     d_begin_offsets,                    ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-        OffsetIteratorT     d_end_offsets,                      ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-        cudaStream_t        stream              = 0,            ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)        ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // The input value type
-        typedef typename std::iterator_traits<InputIteratorT>::value_type InputT;
-
-        return DispatchSegmentedReduce<InputIteratorT,  OutputIteratorT, OffsetIteratorT, OffsetT, cub::Max>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            d_out,
-            num_segments,
-            d_begin_offsets,
-            d_end_offsets,
-            cub::Max(),
-            Traits<InputT>::Lowest(),    // replace with std::numeric_limits<T>::lowest() when C++11 support is more prevalent
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Finds the first device-wide maximum in each segment using the greater-than ('>') operator, also returning the in-segment index of that item
-     *
-     * \par
-     * - The output value type of \p d_out is cub::KeyValuePair <tt><int, T></tt> (assuming the value type of \p d_in is \p T)
-     *   - The maximum of the <em>i</em><sup>th</sup> segment is written to <tt>d_out[i].value</tt> and its offset in that segment is written to <tt>d_out[i].key</tt>.
-     *   - The <tt>{1, std::numeric_limits<T>::lowest()}</tt> tuple is produced for zero-length inputs
-     * - When input a contiguous sequence of segments, a single sequence
-     *   \p segment_offsets (of length <tt>num_segments+1</tt>) can be aliased
-     *   for both the \p d_begin_offsets and \p d_end_offsets parameters (where
-     *   the latter is specified as <tt>segment_offsets+1</tt>).
-     * - Does not support \p > operators that are non-commutative.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the argmax-reduction of a device vector of \p int data elements.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_reduce.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int                      num_segments;   // e.g., 3
-     * int                      *d_offsets;     // e.g., [0, 3, 3, 7]
-     * int                      *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
-     * KeyValuePair<int, int>   *d_out;         // e.g., [{-,-}, {-,-}, {-,-}]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSegmentedReduce::ArgMax(d_temp_storage, temp_storage_bytes, d_in, d_out,
-     *     num_segments, d_offsets, d_offsets + 1);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run argmax-reduction
-     * cub::DeviceSegmentedReduce::ArgMax(d_temp_storage, temp_storage_bytes, d_in, d_out,
-     *     num_segments, d_offsets, d_offsets + 1);
-     *
-     * // d_out <-- [{0,8}, {1,INT_MIN}, {3,9}]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT     <b>[inferred]</b> Random-access input iterator type for reading input items (of some type \p T) \iterator
-     * \tparam OutputIteratorT    <b>[inferred]</b> Output iterator type for recording the reduced aggregate (having value type <tt>KeyValuePair<int, T></tt>) \iterator
-     * \tparam OffsetIteratorT    <b>[inferred]</b> Random-access input iterator type for reading segment offsets \iterator
-     */
-    template <
-        typename            InputIteratorT,
-        typename            OutputIteratorT,
-        typename            OffsetIteratorT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t ArgMax(
-        void                *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t              &temp_storage_bytes,                ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT      d_in,                               ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT     d_out,                              ///< [out] Pointer to the output aggregate
-        int                 num_segments,                       ///< [in] The number of segments that comprise the sorting data
-        OffsetIteratorT     d_begin_offsets,                    ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-        OffsetIteratorT     d_end_offsets,                      ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-        cudaStream_t        stream              = 0,            ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous   = false)        ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        // Signed integer type for global offsets
-        typedef int OffsetT;
-
-        // The input type
-        typedef typename std::iterator_traits<InputIteratorT>::value_type InputValueT;
-
-        // The output tuple type
-        typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
-            KeyValuePair<OffsetT, InputValueT>,                                                                 // ... then the key value pair OffsetT + InputValueT
-            typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputTupleT;                     // ... else the output iterator's value type
-
-        // The output value type
-        typedef typename OutputTupleT::Value OutputValueT;
-
-        // Wrapped input iterator to produce index-value <OffsetT, InputT> tuples
-        typedef ArgIndexInputIterator<InputIteratorT, OffsetT, OutputValueT> ArgIndexInputIteratorT;
-        ArgIndexInputIteratorT d_indexed_in(d_in);
-
-        // Initial value
-        OutputTupleT initial_value(1, Traits<InputValueT>::Lowest());     // replace with std::numeric_limits<T>::lowest() when C++11 support is more prevalent
-
-        return DispatchSegmentedReduce<ArgIndexInputIteratorT, OutputIteratorT, OffsetIteratorT, OffsetT, cub::ArgMax>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_indexed_in,
-            d_out,
-            num_segments,
-            d_begin_offsets,
-            d_end_offsets,
-            cub::ArgMax(),
-            initial_value,
-            stream,
-            debug_synchronous);
-    }
-
-};
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/device/device_select.cuh b/thirdparty/cub_semiring/device/device_select.cuh
deleted file mode 100644
index 58bfe82ba30..00000000000
--- a/thirdparty/cub_semiring/device/device_select.cuh
+++ /dev/null
@@ -1,369 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory.
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-
-#include "dispatch/dispatch_select_if.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \brief DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory. ![](select_logo.png)
- * \ingroup SingleModule
- *
- * \par Overview
- * These operations apply a selection criterion to selectively copy
- * items from a specified input sequence to a compact output sequence.
- *
- * \par Usage Considerations
- * \cdp_class{DeviceSelect}
- *
- * \par Performance
- * \linear_performance{select-flagged, select-if, and select-unique}
- *
- * \par
- * The following chart illustrates DeviceSelect::If
- * performance across different CUDA architectures for \p int32 items,
- * where 50% of the items are randomly selected.
- *
- * \image html select_if_int32_50_percent.png
- *
- * \par
- * The following chart illustrates DeviceSelect::Unique
- * performance across different CUDA architectures for \p int32 items
- * where segments have lengths uniformly sampled from [1,1000].
- *
- * \image html select_unique_int32_len_500.png
- *
- * \par
- * \plots_below
- *
- */
-struct DeviceSelect
-{
-    /**
-     * \brief Uses the \p d_flags sequence to selectively copy the corresponding items from \p d_in into \p d_out.  The total number of items selected is written to \p d_num_selected_out. ![](select_flags_logo.png)
-     *
-     * \par
-     * - The value type of \p d_flags must be castable to \p bool (e.g., \p bool, \p char, \p int, etc.).
-     * - Copies of the selected items are compacted into \p d_out and maintain their original relative ordering.
-     * - \devicestorage
-     *
-     * \par Snippet
-     * The code snippet below illustrates the compaction of items selected from an \p int device vector.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>       // or equivalently <cub/device/device_select.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input, flags, and output
-     * int  num_items;              // e.g., 8
-     * int  *d_in;                  // e.g., [1, 2, 3, 4, 5, 6, 7, 8]
-     * char *d_flags;               // e.g., [1, 0, 0, 1, 0, 1, 1, 0]
-     * int  *d_out;                 // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
-     * int  *d_num_selected_out;    // e.g., [ ]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSelect::Flagged(d_temp_storage, temp_storage_bytes, d_in, d_flags, d_out, d_num_selected_out, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run selection
-     * cub::DeviceSelect::Flagged(d_temp_storage, temp_storage_bytes, d_in, d_flags, d_out, d_num_selected_out, num_items);
-     *
-     * // d_out                 <-- [1, 4, 6, 7]
-     * // d_num_selected_out    <-- [4]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT       <b>[inferred]</b> Random-access input iterator type for reading input items \iterator
-     * \tparam FlagIterator         <b>[inferred]</b> Random-access input iterator type for reading selection flags \iterator
-     * \tparam OutputIteratorT      <b>[inferred]</b> Random-access output iterator type for writing selected items \iterator
-     * \tparam NumSelectedIteratorT  <b>[inferred]</b> Output iterator type for recording the number of items selected \iterator
-     */
-    template <
-        typename                    InputIteratorT,
-        typename                    FlagIterator,
-        typename                    OutputIteratorT,
-        typename                    NumSelectedIteratorT>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Flagged(
-        void*               d_temp_storage,                ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                      &temp_storage_bytes,            ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT              d_in,                           ///< [in] Pointer to the input sequence of data items
-        FlagIterator                d_flags,                        ///< [in] Pointer to the input sequence of selection flags
-        OutputIteratorT             d_out,                          ///< [out] Pointer to the output sequence of selected data items
-        NumSelectedIteratorT         d_num_selected_out,                 ///< [out] Pointer to the output total number of items selected (i.e., length of \p d_out)
-        int                         num_items,                      ///< [in] Total number of input items (i.e., length of \p d_in)
-        cudaStream_t                stream             = 0,         ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous  = false)     ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        typedef int                     OffsetT;         // Signed integer type for global offsets
-        typedef NullType                SelectOp;       // Selection op (not used)
-        typedef NullType                EqualityOp;     // Equality operator (not used)
-
-        return DispatchSelectIf<InputIteratorT, FlagIterator, OutputIteratorT, NumSelectedIteratorT, SelectOp, EqualityOp, OffsetT, false>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            d_flags,
-            d_out,
-            d_num_selected_out,
-            SelectOp(),
-            EqualityOp(),
-            num_items,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Uses the \p select_op functor to selectively copy items from \p d_in into \p d_out.  The total number of items selected is written to \p d_num_selected_out. ![](select_logo.png)
-     *
-     * \par
-     * - Copies of the selected items are compacted into \p d_out and maintain their original relative ordering.
-     * - \devicestorage
-     *
-     * \par Performance
-     * The following charts illustrate saturated select-if performance across different
-     * CUDA architectures for \p int32 and \p int64 items, respectively.  Items are
-     * selected with 50% probability.
-     *
-     * \image html select_if_int32_50_percent.png
-     * \image html select_if_int64_50_percent.png
-     *
-     * \par
-     * The following charts are similar, but 5% selection probability:
-     *
-     * \image html select_if_int32_5_percent.png
-     * \image html select_if_int64_5_percent.png
-     *
-     * \par Snippet
-     * The code snippet below illustrates the compaction of items selected from an \p int device vector.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_select.cuh>
-     *
-     * // Functor type for selecting values less than some criteria
-     * struct LessThan
-     * {
-     *     int compare;
-     *
-     *     CUB_RUNTIME_FUNCTION __forceinline__
-     *     LessThan(int compare) : compare(compare) {}
-     *
-     *     CUB_RUNTIME_FUNCTION __forceinline__
-     *     bool operator()(const int &a) const {
-     *         return (a < compare);
-     *     }
-     * };
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int      num_items;              // e.g., 8
-     * int      *d_in;                  // e.g., [0, 2, 3, 9, 5, 2, 81, 8]
-     * int      *d_out;                 // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
-     * int      *d_num_selected_out;    // e.g., [ ]
-     * LessThan select_op(7);
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items, select_op);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run selection
-     * cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items, select_op);
-     *
-     * // d_out                 <-- [0, 2, 3, 5, 2]
-     * // d_num_selected_out    <-- [5]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT       <b>[inferred]</b> Random-access input iterator type for reading input items \iterator
-     * \tparam OutputIteratorT      <b>[inferred]</b> Random-access output iterator type for writing selected items \iterator
-     * \tparam NumSelectedIteratorT  <b>[inferred]</b> Output iterator type for recording the number of items selected \iterator
-     * \tparam SelectOp             <b>[inferred]</b> Selection operator type having member <tt>bool operator()(const T &a)</tt>
-     */
-    template <
-        typename                    InputIteratorT,
-        typename                    OutputIteratorT,
-        typename                    NumSelectedIteratorT,
-        typename                    SelectOp>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t If(
-        void*               d_temp_storage,                ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                      &temp_storage_bytes,            ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT              d_in,                           ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT             d_out,                          ///< [out] Pointer to the output sequence of selected data items
-        NumSelectedIteratorT         d_num_selected_out,                 ///< [out] Pointer to the output total number of items selected (i.e., length of \p d_out)
-        int                         num_items,                      ///< [in] Total number of input items (i.e., length of \p d_in)
-        SelectOp                    select_op,                      ///< [in] Unary selection operator
-        cudaStream_t                stream             = 0,         ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous  = false)     ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        typedef int                     OffsetT;         // Signed integer type for global offsets
-        typedef NullType*               FlagIterator;   // FlagT iterator type (not used)
-        typedef NullType                EqualityOp;     // Equality operator (not used)
-
-        return DispatchSelectIf<InputIteratorT, FlagIterator, OutputIteratorT, NumSelectedIteratorT, SelectOp, EqualityOp, OffsetT, false>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            NULL,
-            d_out,
-            d_num_selected_out,
-            select_op,
-            EqualityOp(),
-            num_items,
-            stream,
-            debug_synchronous);
-    }
-
-
-    /**
-     * \brief Given an input sequence \p d_in having runs of consecutive equal-valued keys, only the first key from each run is selectively copied to \p d_out.  The total number of items selected is written to \p d_num_selected_out. ![](unique_logo.png)
-     *
-     * \par
-     * - The <tt>==</tt> equality operator is used to determine whether keys are equivalent
-     * - Copies of the selected items are compacted into \p d_out and maintain their original relative ordering.
-     * - \devicestorage
-     *
-     * \par Performance
-     * The following charts illustrate saturated select-unique performance across different
-     * CUDA architectures for \p int32 and \p int64 items, respectively.  Segments have
-     * lengths uniformly sampled from [1,1000].
-     *
-     * \image html select_unique_int32_len_500.png
-     * \image html select_unique_int64_len_500.png
-     *
-     * \par
-     * The following charts are similar, but with segment lengths uniformly sampled from [1,10]:
-     *
-     * \image html select_unique_int32_len_5.png
-     * \image html select_unique_int64_len_5.png
-     *
-     * \par Snippet
-     * The code snippet below illustrates the compaction of items selected from an \p int device vector.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>       // or equivalently <cub/device/device_select.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input and output
-     * int  num_items;              // e.g., 8
-     * int  *d_in;                  // e.g., [0, 2, 2, 9, 5, 5, 5, 8]
-     * int  *d_out;                 // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
-     * int  *d_num_selected_out;    // e.g., [ ]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void     *d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSelect::Unique(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run selection
-     * cub::DeviceSelect::Unique(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items);
-     *
-     * // d_out                 <-- [0, 2, 9, 5, 8]
-     * // d_num_selected_out    <-- [5]
-     *
-     * \endcode
-     *
-     * \tparam InputIteratorT       <b>[inferred]</b> Random-access input iterator type for reading input items \iterator
-     * \tparam OutputIteratorT      <b>[inferred]</b> Random-access output iterator type for writing selected items \iterator
-     * \tparam NumSelectedIteratorT  <b>[inferred]</b> Output iterator type for recording the number of items selected \iterator
-     */
-    template <
-        typename                    InputIteratorT,
-        typename                    OutputIteratorT,
-        typename                    NumSelectedIteratorT>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Unique(
-        void*               d_temp_storage,                ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                      &temp_storage_bytes,            ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT              d_in,                           ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT             d_out,                          ///< [out] Pointer to the output sequence of selected data items
-        NumSelectedIteratorT         d_num_selected_out,             ///< [out] Pointer to the output total number of items selected (i.e., length of \p d_out)
-        int                         num_items,                      ///< [in] Total number of input items (i.e., length of \p d_in)
-        cudaStream_t                stream             = 0,         ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous  = false)     ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        typedef int                     OffsetT;         // Signed integer type for global offsets
-        typedef NullType*               FlagIterator;   // FlagT iterator type (not used)
-        typedef NullType                SelectOp;       // Selection op (not used)
-        typedef Equality                EqualityOp;     // Default == operator
-
-        return DispatchSelectIf<InputIteratorT, FlagIterator, OutputIteratorT, NumSelectedIteratorT, SelectOp, EqualityOp, OffsetT, false>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            d_in,
-            NULL,
-            d_out,
-            d_num_selected_out,
-            SelectOp(),
-            EqualityOp(),
-            num_items,
-            stream,
-            debug_synchronous);
-    }
-
-};
-
-/**
- * \example example_device_select_flagged.cu
- * \example example_device_select_if.cu
- * \example example_device_select_unique.cu
- */
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/device/device_spmv.cuh b/thirdparty/cub_semiring/device/device_spmv.cuh
deleted file mode 100644
index 13e6b49ddd2..00000000000
--- a/thirdparty/cub_semiring/device/device_spmv.cuh
+++ /dev/null
@@ -1,177 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DeviceSpmv provides device-wide parallel operations for performing sparse-matrix * vector multiplication (SpMV).
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-#include <limits>
-
-#include "dispatch/dispatch_spmv_orig.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \brief DeviceSpmv provides device-wide parallel operations for performing sparse-matrix * dense-vector multiplication (SpMV).
- * \ingroup SingleModule
- *
- * \par Overview
- * The [<em>SpMV computation</em>](http://en.wikipedia.org/wiki/Sparse_matrix-vector_multiplication)
- * performs the matrix-vector operation
- * <em>y</em> = <em>alpha</em>*<b>A</b>*<em>x</em> + <em>beta</em>*<em>y</em>,
- * where:
- *  - <b>A</b> is an <em>m</em>x<em>n</em> sparse matrix whose non-zero structure is specified in
- *    [<em>compressed-storage-row (CSR) format</em>](http://en.wikipedia.org/wiki/Sparse_matrix#Compressed_row_Storage_.28CRS_or_CSR.29)
- *    (i.e., three arrays: <em>values</em>, <em>row_offsets</em>, and <em>column_indices</em>)
- *  - <em>x</em> and <em>y</em> are dense vectors
- *  - <em>alpha</em> and <em>beta</em> are scalar multiplicands
- *
- * \par Usage Considerations
- * \cdp_class{DeviceSpmv}
- *
- */
-struct DeviceSpmv
-{
-    /******************************************************************//**
-     * \name CSR matrix operations
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief This function performs the matrix-vector operation <em>y</em> = <b>A</b>*<em>x</em>.
-     *
-     * \par Snippet
-     * The code snippet below illustrates SpMV upon a 9x9 CSR matrix <b>A</b>
-     * representing a 3x3 lattice (24 non-zeros).
-     *
-     * \par
-     * \code
-     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_spmv.cuh>
-     *
-     * // Declare, allocate, and initialize device-accessible pointers for input matrix A, input vector x,
-     * // and output vector y
-     * int    num_rows = 9;
-     * int    num_cols = 9;
-     * int    num_nonzeros = 24;
-     *
-     * float* d_values;  // e.g., [1, 1, 1, 1, 1, 1, 1, 1,
-     *                   //        1, 1, 1, 1, 1, 1, 1, 1,
-     *                   //        1, 1, 1, 1, 1, 1, 1, 1]
-     *
-     * int*   d_column_indices; // e.g., [1, 3, 0, 2, 4, 1, 5, 0,
-     *                          //        4, 6, 1, 3, 5, 7, 2, 4,
-     *                          //        8, 3, 7, 4, 6, 8, 5, 7]
-     *
-     * int*   d_row_offsets;    // e.g., [0, 2, 5, 7, 10, 14, 17, 19, 22, 24]
-     *
-     * float* d_vector_x;       // e.g., [1, 1, 1, 1, 1, 1, 1, 1, 1]
-     * float* d_vector_y;       // e.g., [ ,  ,  ,  ,  ,  ,  ,  ,  ]
-     * ...
-     *
-     * // Determine temporary device storage requirements
-     * void*    d_temp_storage = NULL;
-     * size_t   temp_storage_bytes = 0;
-     * cub::DeviceSpmv::CsrMV(d_temp_storage, temp_storage_bytes, d_values,
-     *     d_row_offsets, d_column_indices, d_vector_x, d_vector_y,
-     *     num_rows, num_cols, num_nonzeros, alpha, beta);
-     *
-     * // Allocate temporary storage
-     * cudaMalloc(&d_temp_storage, temp_storage_bytes);
-     *
-     * // Run SpMV
-     * cub::DeviceSpmv::CsrMV(d_temp_storage, temp_storage_bytes, d_values,
-     *     d_row_offsets, d_column_indices, d_vector_x, d_vector_y,
-     *     num_rows, num_cols, num_nonzeros, alpha, beta);
-     *
-     * // d_vector_y <-- [2, 3, 2, 3, 4, 3, 2, 3, 2]
-     *
-     * \endcode
-     *
-     * \tparam ValueT       <b>[inferred]</b> Matrix and vector value type (e.g., /p float, /p double, etc.)
-     */
-    template <
-        typename            ValueT,
-        typename            SemiringT>
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t CsrMV(
-        void*               d_temp_storage,                     ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&             temp_storage_bytes,                 ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        const ValueT*             d_values,                           ///< [in] Pointer to the array of \p num_nonzeros values of the corresponding nonzero elements of matrix <b>A</b>.
-        const int*                d_row_offsets,                      ///< [in] Pointer to the array of \p m + 1 offsets demarcating the start of every row in \p d_column_indices and \p d_values (with the final entry being equal to \p num_nonzeros)
-        const int*                d_column_indices,                   ///< [in] Pointer to the array of \p num_nonzeros column-indices of the corresponding nonzero elements of matrix <b>A</b>.  (Indices are zero-valued.)
-        const ValueT*             d_vector_x,                         ///< [in] Pointer to the array of \p num_cols values corresponding to the dense input vector <em>x</em>
-        ValueT*             d_vector_y,                         ///< [out] Pointer to the array of \p num_rows values corresponding to the dense output vector <em>y</em>
-        ValueT              alpha,
-        ValueT              beta,
-        int                 num_rows,                           ///< [in] number of rows of matrix <b>A</b>.
-        int                 num_cols,                           ///< [in] number of columns of matrix <b>A</b>.
-        int                 num_nonzeros,                       ///< [in] number of nonzero elements of matrix <b>A</b>.
-        cudaStream_t        stream                  = 0,        ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous       = false)    ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        SpmvParams<ValueT, int> spmv_params;
-        spmv_params.d_values             = d_values;
-        spmv_params.d_row_end_offsets    = d_row_offsets + 1;
-        spmv_params.d_column_indices     = d_column_indices;
-        spmv_params.d_vector_x           = d_vector_x;
-        spmv_params.d_vector_y           = d_vector_y;
-        spmv_params.num_rows             = num_rows;
-        spmv_params.num_cols             = num_cols;
-        spmv_params.num_nonzeros         = num_nonzeros;
-        spmv_params.alpha                = alpha;
-        spmv_params.beta                 = beta;
-
-        return DispatchSpmv<ValueT, int, SemiringT>::Dispatch(
-            d_temp_storage,
-            temp_storage_bytes,
-            spmv_params,
-            stream,
-            debug_synchronous);
-    }
-
-    //@}  end member group
-};
-
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/device/dispatch/dispatch_histogram.cuh b/thirdparty/cub_semiring/device/dispatch/dispatch_histogram.cuh
deleted file mode 100644
index cdebd8b8555..00000000000
--- a/thirdparty/cub_semiring/device/dispatch/dispatch_histogram.cuh
+++ /dev/null
@@ -1,1096 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DeviceHistogram provides device-wide parallel operations for constructing histogram(s) from a sequence of samples data residing within device-accessible memory.
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-#include <limits>
-
-#include "../../agent/agent_histogram.cuh"
-#include "../../util_debug.cuh"
-#include "../../util_device.cuh"
-#include "../../thread/thread_search.cuh"
-#include "../../grid/grid_queue.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-
-/******************************************************************************
- * Histogram kernel entry points
- *****************************************************************************/
-
-/**
- * Histogram initialization kernel entry point
- */
-template <
-    int                                             NUM_ACTIVE_CHANNELS,            ///< Number of channels actively being histogrammed
-    typename                                        CounterT,                       ///< Integer type for counting sample occurrences per histogram bin
-    typename                                        OffsetT>                        ///< Signed integer type for global offsets
-__global__ void DeviceHistogramInitKernel(
-    ArrayWrapper<int, NUM_ACTIVE_CHANNELS>          num_output_bins_wrapper,        ///< Number of output histogram bins per channel
-    ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS>    d_output_histograms_wrapper,    ///< Histogram counter data having logical dimensions <tt>CounterT[NUM_ACTIVE_CHANNELS][num_bins.array[CHANNEL]]</tt>
-    GridQueue<int>                                  tile_queue)                     ///< Drain queue descriptor for dynamically mapping tile data onto thread blocks
-{
-    if ((threadIdx.x == 0) && (blockIdx.x == 0))
-        tile_queue.ResetDrain();
-
-    int output_bin = (blockIdx.x * blockDim.x) + threadIdx.x;
-
-    #pragma unroll
-    for (int CHANNEL = 0; CHANNEL < NUM_ACTIVE_CHANNELS; ++CHANNEL)
-    {
-        if (output_bin < num_output_bins_wrapper.array[CHANNEL])
-            d_output_histograms_wrapper.array[CHANNEL][output_bin] = 0;
-    }
-}
-
-
-/**
- * Histogram privatized sweep kernel entry point (multi-block).  Computes privatized histograms, one per thread block.
- */
-template <
-    typename                                            AgentHistogramPolicyT,     ///< Parameterized AgentHistogramPolicy tuning policy type
-    int                                                 PRIVATIZED_SMEM_BINS,           ///< Maximum number of histogram bins per channel (e.g., up to 256)
-    int                                                 NUM_CHANNELS,                   ///< Number of channels interleaved in the input data (may be greater than the number of channels being actively histogrammed)
-    int                                                 NUM_ACTIVE_CHANNELS,            ///< Number of channels actively being histogrammed
-    typename                                            SampleIteratorT,                ///< The input iterator type. \iterator.
-    typename                                            CounterT,                       ///< Integer type for counting sample occurrences per histogram bin
-    typename                                            PrivatizedDecodeOpT,            ///< The transform operator type for determining privatized counter indices from samples, one for each channel
-    typename                                            OutputDecodeOpT,                ///< The transform operator type for determining output bin-ids from privatized counter indices, one for each channel
-    typename                                            OffsetT>                        ///< Signed integer type for global offsets
-__launch_bounds__ (int(AgentHistogramPolicyT::BLOCK_THREADS))
-__global__ void DeviceHistogramSweepKernel(
-    SampleIteratorT                                         d_samples,                          ///< Input data to reduce
-    ArrayWrapper<int, NUM_ACTIVE_CHANNELS>                  num_output_bins_wrapper,            ///< The number bins per final output histogram
-    ArrayWrapper<int, NUM_ACTIVE_CHANNELS>                  num_privatized_bins_wrapper,        ///< The number bins per privatized histogram
-    ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS>            d_output_histograms_wrapper,        ///< Reference to final output histograms
-    ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS>            d_privatized_histograms_wrapper,    ///< Reference to privatized histograms
-    ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS>      output_decode_op_wrapper,           ///< The transform operator for determining output bin-ids from privatized counter indices, one for each channel
-    ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS>  privatized_decode_op_wrapper,       ///< The transform operator for determining privatized counter indices from samples, one for each channel
-    OffsetT                                                 num_row_pixels,                     ///< The number of multi-channel pixels per row in the region of interest
-    OffsetT                                                 num_rows,                           ///< The number of rows in the region of interest
-    OffsetT                                                 row_stride_samples,                 ///< The number of samples between starts of consecutive rows in the region of interest
-    int                                                     tiles_per_row,                      ///< Number of image tiles per row
-    GridQueue<int>                                          tile_queue)                         ///< Drain queue descriptor for dynamically mapping tile data onto thread blocks
-{
-    // Thread block type for compositing input tiles
-    typedef AgentHistogram<
-            AgentHistogramPolicyT,
-            PRIVATIZED_SMEM_BINS,
-            NUM_CHANNELS,
-            NUM_ACTIVE_CHANNELS,
-            SampleIteratorT,
-            CounterT,
-            PrivatizedDecodeOpT,
-            OutputDecodeOpT,
-            OffsetT>
-        AgentHistogramT;
-
-    // Shared memory for AgentHistogram
-    __shared__ typename AgentHistogramT::TempStorage temp_storage;
-
-    AgentHistogramT agent(
-        temp_storage,
-        d_samples,
-        num_output_bins_wrapper.array,
-        num_privatized_bins_wrapper.array,
-        d_output_histograms_wrapper.array,
-        d_privatized_histograms_wrapper.array,
-        output_decode_op_wrapper.array,
-        privatized_decode_op_wrapper.array);
-
-    // Initialize counters
-    agent.InitBinCounters();
-
-    // Consume input tiles
-    agent.ConsumeTiles(
-        num_row_pixels,
-        num_rows,
-        row_stride_samples,
-        tiles_per_row,
-        tile_queue);
-
-    // Store output to global (if necessary)
-    agent.StoreOutput();
-
-}
-
-
-
-
-
-
-/******************************************************************************
- * Dispatch
- ******************************************************************************/
-
-/**
- * Utility class for dispatching the appropriately-tuned kernels for DeviceHistogram
- */
-template <
-    int         NUM_CHANNELS,               ///< Number of channels interleaved in the input data (may be greater than the number of channels being actively histogrammed)
-    int         NUM_ACTIVE_CHANNELS,        ///< Number of channels actively being histogrammed
-    typename    SampleIteratorT,            ///< Random-access input iterator type for reading input items \iterator
-    typename    CounterT,                   ///< Integer type for counting sample occurrences per histogram bin
-    typename    LevelT,                     ///< Type for specifying bin level boundaries
-    typename    OffsetT>                    ///< Signed integer type for global offsets
-struct DipatchHistogram
-{
-    //---------------------------------------------------------------------
-    // Types and constants
-    //---------------------------------------------------------------------
-
-    /// The sample value type of the input iterator
-    typedef typename std::iterator_traits<SampleIteratorT>::value_type SampleT;
-
-    enum
-    {
-        // Maximum number of bins per channel for which we will use a privatized smem strategy
-        MAX_PRIVATIZED_SMEM_BINS = 256
-    };
-
-
-    //---------------------------------------------------------------------
-    // Transform functors for converting samples to bin-ids
-    //---------------------------------------------------------------------
-
-    // Searches for bin given a list of bin-boundary levels
-    template <typename LevelIteratorT>
-    struct SearchTransform
-    {
-        LevelIteratorT  d_levels;                   // Pointer to levels array
-        int             num_output_levels;          // Number of levels in array
-
-        // Initializer
-        __host__ __device__ __forceinline__ void Init(
-            LevelIteratorT  d_levels,               // Pointer to levels array
-            int             num_output_levels)      // Number of levels in array
-        {
-            this->d_levels          = d_levels;
-            this->num_output_levels = num_output_levels;
-        }
-
-        // Method for converting samples to bin-ids
-        template <CacheLoadModifier LOAD_MODIFIER, typename _SampleT>
-        __host__ __device__ __forceinline__ void BinSelect(_SampleT sample, int &bin, bool valid)
-        {
-            /// Level iterator wrapper type
-            typedef typename If<IsPointer<LevelIteratorT>::VALUE,
-                    CacheModifiedInputIterator<LOAD_MODIFIER, LevelT, OffsetT>,     // Wrap the native input pointer with CacheModifiedInputIterator
-                    LevelIteratorT>::Type                                           // Directly use the supplied input iterator type
-                WrappedLevelIteratorT;
-
-            WrappedLevelIteratorT wrapped_levels(d_levels);
-
-            int num_bins = num_output_levels - 1;
-            if (valid)
-            {
-                bin = UpperBound(wrapped_levels, num_output_levels, (LevelT) sample) - 1;
-                if (bin >= num_bins)
-                    bin = -1;
-            }
-        }
-    };
-
-
-    // Scales samples to evenly-spaced bins
-    struct ScaleTransform
-    {
-        int    num_bins;    // Number of levels in array
-        LevelT max;         // Max sample level (exclusive)
-        LevelT min;         // Min sample level (inclusive)
-        LevelT scale;       // Bin scaling factor
-
-        // Initializer
-        template <typename _LevelT>
-        __host__ __device__ __forceinline__ void Init(
-            int     num_output_levels,  // Number of levels in array
-            _LevelT max,                // Max sample level (exclusive)
-            _LevelT min,                // Min sample level (inclusive)
-            _LevelT scale)              // Bin scaling factor
-        {
-            this->num_bins = num_output_levels - 1;
-            this->max = max;
-            this->min = min;
-            this->scale = scale;
-        }
-
-        // Initializer (float specialization)
-        __host__ __device__ __forceinline__ void Init(
-            int    num_output_levels,   // Number of levels in array
-            float   max,                // Max sample level (exclusive)
-            float   min,                // Min sample level (inclusive)
-            float   scale)              // Bin scaling factor
-        {
-            this->num_bins = num_output_levels - 1;
-            this->max = max;
-            this->min = min;
-            this->scale = float(1.0) / scale;
-        }
-
-        // Initializer (double specialization)
-        __host__ __device__ __forceinline__ void Init(
-            int    num_output_levels,   // Number of levels in array
-            double max,                 // Max sample level (exclusive)
-            double min,                 // Min sample level (inclusive)
-            double scale)               // Bin scaling factor
-        {
-            this->num_bins = num_output_levels - 1;
-            this->max = max;
-            this->min = min;
-            this->scale = double(1.0) / scale;
-        }
-
-        // Method for converting samples to bin-ids
-        template <CacheLoadModifier LOAD_MODIFIER, typename _SampleT>
-        __host__ __device__ __forceinline__ void BinSelect(_SampleT sample, int &bin, bool valid)
-        {
-            LevelT level_sample = (LevelT) sample;
-
-            if (valid && (level_sample >= min) && (level_sample < max))
-                bin = (int) ((level_sample - min) / scale);
-        }
-
-        // Method for converting samples to bin-ids (float specialization)
-        template <CacheLoadModifier LOAD_MODIFIER>
-        __host__ __device__ __forceinline__ void BinSelect(float sample, int &bin, bool valid)
-        {
-            LevelT level_sample = (LevelT) sample;
-
-            if (valid && (level_sample >= min) && (level_sample < max))
-                bin = (int) ((level_sample - min) * scale);
-        }
-
-        // Method for converting samples to bin-ids (double specialization)
-        template <CacheLoadModifier LOAD_MODIFIER>
-        __host__ __device__ __forceinline__ void BinSelect(double sample, int &bin, bool valid)
-        {
-            LevelT level_sample = (LevelT) sample;
-
-            if (valid && (level_sample >= min) && (level_sample < max))
-                bin = (int) ((level_sample - min) * scale);
-        }
-    };
-
-
-    // Pass-through bin transform operator
-    struct PassThruTransform
-    {
-        // Method for converting samples to bin-ids
-        template <CacheLoadModifier LOAD_MODIFIER, typename _SampleT>
-        __host__ __device__ __forceinline__ void BinSelect(_SampleT sample, int &bin, bool valid)
-        {
-            if (valid)
-                bin = (int) sample;
-        }
-    };
-
-
-
-    //---------------------------------------------------------------------
-    // Tuning policies
-    //---------------------------------------------------------------------
-
-    template <int NOMINAL_ITEMS_PER_THREAD>
-    struct TScale
-    {
-        enum
-        {
-            V_SCALE = (sizeof(SampleT) + sizeof(int) - 1) / sizeof(int),
-            VALUE   = CUB_MAX((NOMINAL_ITEMS_PER_THREAD / NUM_ACTIVE_CHANNELS / V_SCALE), 1)
-        };
-    };
-
-
-    /// SM11
-    struct Policy110
-    {
-        // HistogramSweepPolicy
-        typedef AgentHistogramPolicy<
-                512,
-                (NUM_CHANNELS == 1) ? 8 : 2,
-                BLOCK_LOAD_DIRECT,
-                LOAD_DEFAULT,
-                true,
-                GMEM,
-                false>
-            HistogramSweepPolicy;
-    };
-
-    /// SM20
-    struct Policy200
-    {
-        // HistogramSweepPolicy
-        typedef AgentHistogramPolicy<
-                (NUM_CHANNELS == 1) ? 256 : 128,
-                (NUM_CHANNELS == 1) ? 8 : 3,
-                (NUM_CHANNELS == 1) ? BLOCK_LOAD_DIRECT : BLOCK_LOAD_WARP_TRANSPOSE,
-                LOAD_DEFAULT,
-                true,
-                SMEM,
-                false>
-            HistogramSweepPolicy;
-    };
-
-    /// SM30
-    struct Policy300
-    {
-        // HistogramSweepPolicy
-        typedef AgentHistogramPolicy<
-                512,
-                (NUM_CHANNELS == 1) ? 8 : 2,
-                BLOCK_LOAD_DIRECT,
-                LOAD_DEFAULT,
-                true,
-                GMEM,
-                false>
-            HistogramSweepPolicy;
-    };
-
-    /// SM35
-    struct Policy350
-    {
-        // HistogramSweepPolicy
-        typedef AgentHistogramPolicy<
-                128,
-                TScale<8>::VALUE,
-                BLOCK_LOAD_DIRECT,
-                LOAD_LDG,
-                true,
-                BLEND,
-                true>
-            HistogramSweepPolicy;
-    };
-
-    /// SM50
-    struct Policy500
-    {
-        // HistogramSweepPolicy
-        typedef AgentHistogramPolicy<
-                384,
-                TScale<16>::VALUE,
-                BLOCK_LOAD_DIRECT,
-                LOAD_LDG,
-                true,
-                SMEM,
-                false>
-            HistogramSweepPolicy;
-    };
-
-
-
-    //---------------------------------------------------------------------
-    // Tuning policies of current PTX compiler pass
-    //---------------------------------------------------------------------
-
-#if (CUB_PTX_ARCH >= 500)
-    typedef Policy500 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 350)
-    typedef Policy350 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 300)
-    typedef Policy300 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 200)
-    typedef Policy200 PtxPolicy;
-
-#else
-    typedef Policy110 PtxPolicy;
-
-#endif
-
-    // "Opaque" policies (whose parameterizations aren't reflected in the type signature)
-    struct PtxHistogramSweepPolicy : PtxPolicy::HistogramSweepPolicy {};
-
-
-    //---------------------------------------------------------------------
-    // Utilities
-    //---------------------------------------------------------------------
-
-    /**
-     * Initialize kernel dispatch configurations with the policies corresponding to the PTX assembly we will use
-     */
-    template <typename KernelConfig>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t InitConfigs(
-        int             ptx_version,
-        KernelConfig    &histogram_sweep_config)
-    {
-    #if (CUB_PTX_ARCH > 0)
-
-        // We're on the device, so initialize the kernel dispatch configurations with the current PTX policy
-        return histogram_sweep_config.template Init<PtxHistogramSweepPolicy>();
-
-    #else
-
-        // We're on the host, so lookup and initialize the kernel dispatch configurations with the policies that match the device's PTX version
-        if (ptx_version >= 500)
-        {
-            return histogram_sweep_config.template Init<typename Policy500::HistogramSweepPolicy>();
-        }
-        else if (ptx_version >= 350)
-        {
-            return histogram_sweep_config.template Init<typename Policy350::HistogramSweepPolicy>();
-        }
-        else if (ptx_version >= 300)
-        {
-            return histogram_sweep_config.template Init<typename Policy300::HistogramSweepPolicy>();
-        }
-        else if (ptx_version >= 200)
-        {
-            return histogram_sweep_config.template Init<typename Policy200::HistogramSweepPolicy>();
-        }
-        else if (ptx_version >= 110)
-        {
-            return histogram_sweep_config.template Init<typename Policy110::HistogramSweepPolicy>();
-        }
-        else
-        {
-            // No global atomic support
-            return cudaErrorNotSupported;
-        }
-
-    #endif
-    }
-
-
-    /**
-     * Kernel kernel dispatch configuration
-     */
-    struct KernelConfig
-    {
-        int                             block_threads;
-        int                             pixels_per_thread;
-
-        template <typename BlockPolicy>
-        CUB_RUNTIME_FUNCTION __forceinline__
-        cudaError_t Init()
-        {
-            block_threads               = BlockPolicy::BLOCK_THREADS;
-            pixels_per_thread           = BlockPolicy::PIXELS_PER_THREAD;
-
-            return cudaSuccess;
-        }
-    };
-
-
-    //---------------------------------------------------------------------
-    // Dispatch entrypoints
-    //---------------------------------------------------------------------
-
-    /**
-     * Privatization-based dispatch routine
-     */
-    template <
-        typename                            PrivatizedDecodeOpT,                            ///< The transform operator type for determining privatized counter indices from samples, one for each channel
-        typename                            OutputDecodeOpT,                                ///< The transform operator type for determining output bin-ids from privatized counter indices, one for each channel
-        typename                            DeviceHistogramInitKernelT,                     ///< Function type of cub::DeviceHistogramInitKernel
-        typename                            DeviceHistogramSweepKernelT>                    ///< Function type of cub::DeviceHistogramSweepKernel
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t PrivatizedDispatch(
-        void*                               d_temp_storage,                                 ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&                             temp_storage_bytes,                             ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        SampleIteratorT                     d_samples,                                      ///< [in] The pointer to the input sequence of sample items. The samples from different channels are assumed to be interleaved (e.g., an array of 32-bit pixels where each pixel consists of four RGBA 8-bit samples).
-        CounterT*                           d_output_histograms[NUM_ACTIVE_CHANNELS],       ///< [out] The pointers to the histogram counter output arrays, one for each active channel.  For channel<sub><em>i</em></sub>, the allocation length of <tt>d_histograms[i]</tt> should be <tt>num_output_levels[i]</tt> - 1.
-        int                                 num_privatized_levels[NUM_ACTIVE_CHANNELS],     ///< [in] The number of bin level boundaries for delineating histogram samples in each active channel.  Implies that the number of bins for channel<sub><em>i</em></sub> is <tt>num_output_levels[i]</tt> - 1.
-        PrivatizedDecodeOpT                 privatized_decode_op[NUM_ACTIVE_CHANNELS],      ///< [in] Transform operators for determining bin-ids from samples, one for each channel
-        int                                 num_output_levels[NUM_ACTIVE_CHANNELS],         ///< [in] The number of bin level boundaries for delineating histogram samples in each active channel.  Implies that the number of bins for channel<sub><em>i</em></sub> is <tt>num_output_levels[i]</tt> - 1.
-        OutputDecodeOpT                     output_decode_op[NUM_ACTIVE_CHANNELS],          ///< [in] Transform operators for determining bin-ids from samples, one for each channel
-        int                                 max_num_output_bins,                            ///< [in] Maximum number of output bins in any channel
-        OffsetT                             num_row_pixels,                                 ///< [in] The number of multi-channel pixels per row in the region of interest
-        OffsetT                             num_rows,                                       ///< [in] The number of rows in the region of interest
-        OffsetT                             row_stride_samples,                             ///< [in] The number of samples between starts of consecutive rows in the region of interest
-        DeviceHistogramInitKernelT          histogram_init_kernel,                          ///< [in] Kernel function pointer to parameterization of cub::DeviceHistogramInitKernel
-        DeviceHistogramSweepKernelT         histogram_sweep_kernel,                         ///< [in] Kernel function pointer to parameterization of cub::DeviceHistogramSweepKernel
-        KernelConfig                        histogram_sweep_config,                         ///< [in] Dispatch parameters that match the policy that \p histogram_sweep_kernel was compiled for
-        cudaStream_t                        stream,                                         ///< [in] CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                                debug_synchronous)                              ///< [in] Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-    #ifndef CUB_RUNTIME_ENABLED
-
-        // Kernel launch not supported from this device
-        return CubDebug(cudaErrorNotSupported);
-
-    #else
-
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get device ordinal
-            int device_ordinal;
-            if (CubDebug(error = cudaGetDevice(&device_ordinal))) break;
-
-            // Get SM count
-            int sm_count;
-            if (CubDebug(error = cudaDeviceGetAttribute (&sm_count, cudaDevAttrMultiProcessorCount, device_ordinal))) break;
-
-            // Get SM occupancy for histogram_sweep_kernel
-            int histogram_sweep_sm_occupancy;
-            if (CubDebug(error = MaxSmOccupancy(
-                histogram_sweep_sm_occupancy,
-                histogram_sweep_kernel,
-                histogram_sweep_config.block_threads))) break;
-
-            // Get device occupancy for histogram_sweep_kernel
-            int histogram_sweep_occupancy = histogram_sweep_sm_occupancy * sm_count;
-
-            if (num_row_pixels * NUM_CHANNELS == row_stride_samples)
-            {
-                // Treat as a single linear array of samples
-                num_row_pixels      *= num_rows;
-                num_rows            = 1;
-                row_stride_samples  = num_row_pixels * NUM_CHANNELS;
-            }
-
-            // Get grid dimensions, trying to keep total blocks ~histogram_sweep_occupancy
-            int pixels_per_tile     = histogram_sweep_config.block_threads * histogram_sweep_config.pixels_per_thread;
-            int tiles_per_row       = int(num_row_pixels + pixels_per_tile - 1) / pixels_per_tile;
-            int blocks_per_row      = CUB_MIN(histogram_sweep_occupancy, tiles_per_row);
-            int blocks_per_col      = (blocks_per_row > 0) ?
-                                        int(CUB_MIN(histogram_sweep_occupancy / blocks_per_row, num_rows)) :
-                                        0;
-            int num_thread_blocks   = blocks_per_row * blocks_per_col;
-
-            dim3 sweep_grid_dims;
-            sweep_grid_dims.x = (unsigned int) blocks_per_row;
-            sweep_grid_dims.y = (unsigned int) blocks_per_col;
-            sweep_grid_dims.z = 1;
-
-            // Temporary storage allocation requirements
-            const int   NUM_ALLOCATIONS = NUM_ACTIVE_CHANNELS + 1;
-            void*       allocations[NUM_ALLOCATIONS];
-            size_t      allocation_sizes[NUM_ALLOCATIONS];
-
-            for (int CHANNEL = 0; CHANNEL < NUM_ACTIVE_CHANNELS; ++CHANNEL)
-                allocation_sizes[CHANNEL] = size_t(num_thread_blocks) * (num_privatized_levels[CHANNEL] - 1) * sizeof(CounterT);
-
-            allocation_sizes[NUM_ALLOCATIONS - 1] = GridQueue<int>::AllocationSize();
-
-            // Alias the temporary allocations from the single storage blob (or compute the necessary size of the blob)
-            if (CubDebug(error = AliasTemporaries(d_temp_storage, temp_storage_bytes, allocations, allocation_sizes))) break;
-            if (d_temp_storage == NULL)
-            {
-                // Return if the caller is simply requesting the size of the storage allocation
-                break;
-            }
-
-            // Construct the grid queue descriptor
-            GridQueue<int> tile_queue(allocations[NUM_ALLOCATIONS - 1]);
-
-            // Setup array wrapper for histogram channel output (because we can't pass static arrays as kernel parameters)
-            ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> d_output_histograms_wrapper;
-            for (int CHANNEL = 0; CHANNEL < NUM_ACTIVE_CHANNELS; ++CHANNEL)
-                d_output_histograms_wrapper.array[CHANNEL] = d_output_histograms[CHANNEL];
-
-            // Setup array wrapper for privatized per-block histogram channel output (because we can't pass static arrays as kernel parameters)
-            ArrayWrapper<CounterT*, NUM_ACTIVE_CHANNELS> d_privatized_histograms_wrapper;
-            for (int CHANNEL = 0; CHANNEL < NUM_ACTIVE_CHANNELS; ++CHANNEL)
-                d_privatized_histograms_wrapper.array[CHANNEL] = (CounterT*) allocations[CHANNEL];
-
-            // Setup array wrapper for sweep bin transforms (because we can't pass static arrays as kernel parameters)
-            ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS> privatized_decode_op_wrapper;
-            for (int CHANNEL = 0; CHANNEL < NUM_ACTIVE_CHANNELS; ++CHANNEL)
-                privatized_decode_op_wrapper.array[CHANNEL] = privatized_decode_op[CHANNEL];
-
-            // Setup array wrapper for aggregation bin transforms (because we can't pass static arrays as kernel parameters)
-            ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS> output_decode_op_wrapper;
-            for (int CHANNEL = 0; CHANNEL < NUM_ACTIVE_CHANNELS; ++CHANNEL)
-                output_decode_op_wrapper.array[CHANNEL] = output_decode_op[CHANNEL];
-
-            // Setup array wrapper for num privatized bins (because we can't pass static arrays as kernel parameters)
-            ArrayWrapper<int, NUM_ACTIVE_CHANNELS> num_privatized_bins_wrapper;
-            for (int CHANNEL = 0; CHANNEL < NUM_ACTIVE_CHANNELS; ++CHANNEL)
-                num_privatized_bins_wrapper.array[CHANNEL] = num_privatized_levels[CHANNEL] - 1;
-
-            // Setup array wrapper for num output bins (because we can't pass static arrays as kernel parameters)
-            ArrayWrapper<int, NUM_ACTIVE_CHANNELS> num_output_bins_wrapper;
-            for (int CHANNEL = 0; CHANNEL < NUM_ACTIVE_CHANNELS; ++CHANNEL)
-                num_output_bins_wrapper.array[CHANNEL] = num_output_levels[CHANNEL] - 1;
-
-            int histogram_init_block_threads    = 256;
-            int histogram_init_grid_dims        = (max_num_output_bins + histogram_init_block_threads - 1) / histogram_init_block_threads;
-
-            // Log DeviceHistogramInitKernel configuration
-            if (debug_synchronous) _CubLog("Invoking DeviceHistogramInitKernel<<<%d, %d, 0, %lld>>>()\n",
-                histogram_init_grid_dims, histogram_init_block_threads, (long long) stream);
-
-            // Invoke histogram_init_kernel
-            histogram_init_kernel<<<histogram_init_grid_dims, histogram_init_block_threads, 0, stream>>>(
-                num_output_bins_wrapper,
-                d_output_histograms_wrapper,
-                tile_queue);
-
-            // Return if empty problem
-            if ((blocks_per_row == 0) || (blocks_per_col == 0))
-                break;
-
-            // Log histogram_sweep_kernel configuration
-            if (debug_synchronous) _CubLog("Invoking histogram_sweep_kernel<<<{%d, %d, %d}, %d, 0, %lld>>>(), %d pixels per thread, %d SM occupancy\n",
-                sweep_grid_dims.x, sweep_grid_dims.y, sweep_grid_dims.z,
-                histogram_sweep_config.block_threads, (long long) stream, histogram_sweep_config.pixels_per_thread, histogram_sweep_sm_occupancy);
-
-            // Invoke histogram_sweep_kernel
-            histogram_sweep_kernel<<<sweep_grid_dims, histogram_sweep_config.block_threads, 0, stream>>>(
-                d_samples,
-                num_output_bins_wrapper,
-                num_privatized_bins_wrapper,
-                d_output_histograms_wrapper,
-                d_privatized_histograms_wrapper,
-                output_decode_op_wrapper,
-                privatized_decode_op_wrapper,
-                num_row_pixels,
-                num_rows,
-                row_stride_samples,
-                tiles_per_row,
-                tile_queue);
-
-            // Check for failure to launch
-            if (CubDebug(error = cudaPeekAtLastError())) break;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-
-        }
-        while (0);
-
-        return error;
-
-    #endif // CUB_RUNTIME_ENABLED
-    }
-
-
-
-    /**
-     * Dispatch routine for HistogramRange, specialized for sample types larger than 8bit
-     */
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t DispatchRange(
-        void*               d_temp_storage,                                ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&             temp_storage_bytes,                            ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        SampleIteratorT     d_samples,                                  ///< [in] The pointer to the multi-channel input sequence of data samples. The samples from different channels are assumed to be interleaved (e.g., an array of 32-bit pixels where each pixel consists of four RGBA 8-bit samples).
-        CounterT*           d_output_histograms[NUM_ACTIVE_CHANNELS],      ///< [out] The pointers to the histogram counter output arrays, one for each active channel.  For channel<sub><em>i</em></sub>, the allocation length of <tt>d_histograms[i]</tt> should be <tt>num_output_levels[i]</tt> - 1.
-        int                 num_output_levels[NUM_ACTIVE_CHANNELS],     ///< [in] The number of boundaries (levels) for delineating histogram samples in each active channel.  Implies that the number of bins for channel<sub><em>i</em></sub> is <tt>num_output_levels[i]</tt> - 1.
-        LevelT              *d_levels[NUM_ACTIVE_CHANNELS],             ///< [in] The pointers to the arrays of boundaries (levels), one for each active channel.  Bin ranges are defined by consecutive boundary pairings: lower sample value boundaries are inclusive and upper sample value boundaries are exclusive.
-        OffsetT             num_row_pixels,                             ///< [in] The number of multi-channel pixels per row in the region of interest
-        OffsetT             num_rows,                                   ///< [in] The number of rows in the region of interest
-        OffsetT             row_stride_samples,                         ///< [in] The number of samples between starts of consecutive rows in the region of interest
-        cudaStream_t        stream,                                     ///< [in] CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous,                          ///< [in] Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-        Int2Type<false>     is_byte_sample)                             ///< [in] Marker type indicating whether or not SampleT is a 8b type
-    {
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get PTX version
-            int ptx_version;
-    #if (CUB_PTX_ARCH == 0)
-            if (CubDebug(error = PtxVersion(ptx_version))) break;
-    #else
-            ptx_version = CUB_PTX_ARCH;
-    #endif
-
-            // Get kernel dispatch configurations
-            KernelConfig histogram_sweep_config;
-            if (CubDebug(error = InitConfigs(ptx_version, histogram_sweep_config)))
-                break;
-
-            // Use the search transform op for converting samples to privatized bins
-            typedef SearchTransform<LevelT*> PrivatizedDecodeOpT;
-
-            // Use the pass-thru transform op for converting privatized bins to output bins
-            typedef PassThruTransform OutputDecodeOpT;
-
-            PrivatizedDecodeOpT     privatized_decode_op[NUM_ACTIVE_CHANNELS];
-            OutputDecodeOpT         output_decode_op[NUM_ACTIVE_CHANNELS];
-            int                     max_levels = num_output_levels[0];
-
-            for (int channel = 0; channel < NUM_ACTIVE_CHANNELS; ++channel)
-            {
-                privatized_decode_op[channel].Init(d_levels[channel], num_output_levels[channel]);
-                if (num_output_levels[channel] > max_levels)
-                    max_levels = num_output_levels[channel];
-            }
-            int max_num_output_bins = max_levels - 1;
-
-            // Dispatch
-            if (max_num_output_bins > MAX_PRIVATIZED_SMEM_BINS)
-            {
-                // Too many bins to keep in shared memory.
-                const int PRIVATIZED_SMEM_BINS = 0;
-
-                if (CubDebug(error = PrivatizedDispatch(
-                    d_temp_storage,
-                    temp_storage_bytes,
-                    d_samples,
-                    d_output_histograms,
-                    num_output_levels,
-                    privatized_decode_op,
-                    num_output_levels,
-                    output_decode_op,
-                    max_num_output_bins,
-                    num_row_pixels,
-                    num_rows,
-                    row_stride_samples,
-                    DeviceHistogramInitKernel<NUM_ACTIVE_CHANNELS, CounterT, OffsetT>,
-                    DeviceHistogramSweepKernel<PtxHistogramSweepPolicy, PRIVATIZED_SMEM_BINS, NUM_CHANNELS, NUM_ACTIVE_CHANNELS, SampleIteratorT, CounterT, PrivatizedDecodeOpT, OutputDecodeOpT, OffsetT>,
-                    histogram_sweep_config,
-                    stream,
-                    debug_synchronous))) break;
-            }
-            else
-            {
-                // Dispatch shared-privatized approach
-                const int PRIVATIZED_SMEM_BINS = MAX_PRIVATIZED_SMEM_BINS;
-
-                if (CubDebug(error = PrivatizedDispatch(
-                    d_temp_storage,
-                    temp_storage_bytes,
-                    d_samples,
-                    d_output_histograms,
-                    num_output_levels,
-                    privatized_decode_op,
-                    num_output_levels,
-                    output_decode_op,
-                    max_num_output_bins,
-                    num_row_pixels,
-                    num_rows,
-                    row_stride_samples,
-                    DeviceHistogramInitKernel<NUM_ACTIVE_CHANNELS, CounterT, OffsetT>,
-                    DeviceHistogramSweepKernel<PtxHistogramSweepPolicy, PRIVATIZED_SMEM_BINS, NUM_CHANNELS, NUM_ACTIVE_CHANNELS, SampleIteratorT, CounterT, PrivatizedDecodeOpT, OutputDecodeOpT, OffsetT>,
-                    histogram_sweep_config,
-                    stream,
-                    debug_synchronous))) break;
-            }
-
-        } while (0);
-
-        return error;
-    }
-
-
-    /**
-     * Dispatch routine for HistogramRange, specialized for 8-bit sample types (computes 256-bin privatized histograms and then reduces to user-specified levels)
-     */
-    CUB_RUNTIME_FUNCTION
-    static cudaError_t DispatchRange(
-        void*               d_temp_storage,                             ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&             temp_storage_bytes,                         ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        SampleIteratorT     d_samples,                                  ///< [in] The pointer to the multi-channel input sequence of data samples. The samples from different channels are assumed to be interleaved (e.g., an array of 32-bit pixels where each pixel consists of four RGBA 8-bit samples).
-        CounterT*           d_output_histograms[NUM_ACTIVE_CHANNELS],   ///< [out] The pointers to the histogram counter output arrays, one for each active channel.  For channel<sub><em>i</em></sub>, the allocation length of <tt>d_histograms[i]</tt> should be <tt>num_output_levels[i]</tt> - 1.
-        int                 num_output_levels[NUM_ACTIVE_CHANNELS],     ///< [in] The number of boundaries (levels) for delineating histogram samples in each active channel.  Implies that the number of bins for channel<sub><em>i</em></sub> is <tt>num_output_levels[i]</tt> - 1.
-        LevelT              *d_levels[NUM_ACTIVE_CHANNELS],             ///< [in] The pointers to the arrays of boundaries (levels), one for each active channel.  Bin ranges are defined by consecutive boundary pairings: lower sample value boundaries are inclusive and upper sample value boundaries are exclusive.
-        OffsetT             num_row_pixels,                             ///< [in] The number of multi-channel pixels per row in the region of interest
-        OffsetT             num_rows,                                   ///< [in] The number of rows in the region of interest
-        OffsetT             row_stride_samples,                         ///< [in] The number of samples between starts of consecutive rows in the region of interest
-        cudaStream_t        stream,                                     ///< [in] CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous,                          ///< [in] Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-        Int2Type<true>      is_byte_sample)                             ///< [in] Marker type indicating whether or not SampleT is a 8b type
-    {
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get PTX version
-            int ptx_version;
-    #if (CUB_PTX_ARCH == 0)
-            if (CubDebug(error = PtxVersion(ptx_version))) break;
-    #else
-            ptx_version = CUB_PTX_ARCH;
-    #endif
-
-            // Get kernel dispatch configurations
-            KernelConfig histogram_sweep_config;
-            if (CubDebug(error = InitConfigs(ptx_version, histogram_sweep_config)))
-                break;
-
-            // Use the pass-thru transform op for converting samples to privatized bins
-            typedef PassThruTransform PrivatizedDecodeOpT;
-
-            // Use the search transform op for converting privatized bins to output bins
-            typedef SearchTransform<LevelT*> OutputDecodeOpT;
-
-            int                         num_privatized_levels[NUM_ACTIVE_CHANNELS];
-            PrivatizedDecodeOpT         privatized_decode_op[NUM_ACTIVE_CHANNELS];
-            OutputDecodeOpT             output_decode_op[NUM_ACTIVE_CHANNELS];
-            int                         max_levels = num_output_levels[0];              // Maximum number of levels in any channel
-
-            for (int channel = 0; channel < NUM_ACTIVE_CHANNELS; ++channel)
-            {
-                num_privatized_levels[channel] = 257;
-                output_decode_op[channel].Init(d_levels[channel], num_output_levels[channel]);
-
-                if (num_output_levels[channel] > max_levels)
-                    max_levels = num_output_levels[channel];
-            }
-            int max_num_output_bins = max_levels - 1;
-
-            const int PRIVATIZED_SMEM_BINS = 256;
-
-            if (CubDebug(error = PrivatizedDispatch(
-                d_temp_storage,
-                temp_storage_bytes,
-                d_samples,
-                d_output_histograms,
-                num_privatized_levels,
-                privatized_decode_op,
-                num_output_levels,
-                output_decode_op,
-                max_num_output_bins,
-                num_row_pixels,
-                num_rows,
-                row_stride_samples,
-                DeviceHistogramInitKernel<NUM_ACTIVE_CHANNELS, CounterT, OffsetT>,
-                DeviceHistogramSweepKernel<PtxHistogramSweepPolicy, PRIVATIZED_SMEM_BINS, NUM_CHANNELS, NUM_ACTIVE_CHANNELS, SampleIteratorT, CounterT, PrivatizedDecodeOpT, OutputDecodeOpT, OffsetT>,
-                histogram_sweep_config,
-                stream,
-                debug_synchronous))) break;
-
-        } while (0);
-
-        return error;
-    }
-
-
-    /**
-     * Dispatch routine for HistogramEven, specialized for sample types larger than 8-bit
-     */
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t DispatchEven(
-        void*               d_temp_storage,                            ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&             temp_storage_bytes,                        ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        SampleIteratorT     d_samples,                                  ///< [in] The pointer to the input sequence of sample items. The samples from different channels are assumed to be interleaved (e.g., an array of 32-bit pixels where each pixel consists of four RGBA 8-bit samples).
-        CounterT*           d_output_histograms[NUM_ACTIVE_CHANNELS],  ///< [out] The pointers to the histogram counter output arrays, one for each active channel.  For channel<sub><em>i</em></sub>, the allocation length of <tt>d_histograms[i]</tt> should be <tt>num_output_levels[i]</tt> - 1.
-        int                 num_output_levels[NUM_ACTIVE_CHANNELS],     ///< [in] The number of bin level boundaries for delineating histogram samples in each active channel.  Implies that the number of bins for channel<sub><em>i</em></sub> is <tt>num_output_levels[i]</tt> - 1.
-        LevelT              lower_level[NUM_ACTIVE_CHANNELS],           ///< [in] The lower sample value bound (inclusive) for the lowest histogram bin in each active channel.
-        LevelT              upper_level[NUM_ACTIVE_CHANNELS],           ///< [in] The upper sample value bound (exclusive) for the highest histogram bin in each active channel.
-        OffsetT             num_row_pixels,                             ///< [in] The number of multi-channel pixels per row in the region of interest
-        OffsetT             num_rows,                                   ///< [in] The number of rows in the region of interest
-        OffsetT             row_stride_samples,                         ///< [in] The number of samples between starts of consecutive rows in the region of interest
-        cudaStream_t        stream,                                     ///< [in] CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous,                          ///< [in] Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-        Int2Type<false>     is_byte_sample)                             ///< [in] Marker type indicating whether or not SampleT is a 8b type
-    {
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get PTX version
-            int ptx_version;
-    #if (CUB_PTX_ARCH == 0)
-            if (CubDebug(error = PtxVersion(ptx_version))) break;
-    #else
-            ptx_version = CUB_PTX_ARCH;
-    #endif
-
-            // Get kernel dispatch configurations
-            KernelConfig histogram_sweep_config;
-            if (CubDebug(error = InitConfigs(ptx_version, histogram_sweep_config)))
-                break;
-
-            // Use the scale transform op for converting samples to privatized bins
-            typedef ScaleTransform PrivatizedDecodeOpT;
-
-            // Use the pass-thru transform op for converting privatized bins to output bins
-            typedef PassThruTransform OutputDecodeOpT;
-
-            PrivatizedDecodeOpT         privatized_decode_op[NUM_ACTIVE_CHANNELS];
-            OutputDecodeOpT             output_decode_op[NUM_ACTIVE_CHANNELS];
-            int                         max_levels = num_output_levels[0];
-
-            for (int channel = 0; channel < NUM_ACTIVE_CHANNELS; ++channel)
-            {
-                int     bins    = num_output_levels[channel] - 1;
-                LevelT  scale   = (upper_level[channel] - lower_level[channel]) / bins;
-
-                privatized_decode_op[channel].Init(num_output_levels[channel], upper_level[channel], lower_level[channel], scale);
-
-                if (num_output_levels[channel] > max_levels)
-                    max_levels = num_output_levels[channel];
-            }
-            int max_num_output_bins = max_levels - 1;
-
-            if (max_num_output_bins > MAX_PRIVATIZED_SMEM_BINS)
-            {
-                // Dispatch shared-privatized approach
-                const int PRIVATIZED_SMEM_BINS = 0;
-
-                if (CubDebug(error = PrivatizedDispatch(
-                    d_temp_storage,
-                    temp_storage_bytes,
-                    d_samples,
-                    d_output_histograms,
-                    num_output_levels,
-                    privatized_decode_op,
-                    num_output_levels,
-                    output_decode_op,
-                    max_num_output_bins,
-                    num_row_pixels,
-                    num_rows,
-                    row_stride_samples,
-                    DeviceHistogramInitKernel<NUM_ACTIVE_CHANNELS, CounterT, OffsetT>,
-                    DeviceHistogramSweepKernel<PtxHistogramSweepPolicy, PRIVATIZED_SMEM_BINS, NUM_CHANNELS, NUM_ACTIVE_CHANNELS, SampleIteratorT, CounterT, PrivatizedDecodeOpT, OutputDecodeOpT, OffsetT>,
-                    histogram_sweep_config,
-                    stream,
-                    debug_synchronous))) break;
-            }
-            else
-            {
-                // Dispatch shared-privatized approach
-                const int PRIVATIZED_SMEM_BINS = MAX_PRIVATIZED_SMEM_BINS;
-
-                if (CubDebug(error = PrivatizedDispatch(
-                    d_temp_storage,
-                    temp_storage_bytes,
-                    d_samples,
-                    d_output_histograms,
-                    num_output_levels,
-                    privatized_decode_op,
-                    num_output_levels,
-                    output_decode_op,
-                    max_num_output_bins,
-                    num_row_pixels,
-                    num_rows,
-                    row_stride_samples,
-                    DeviceHistogramInitKernel<NUM_ACTIVE_CHANNELS, CounterT, OffsetT>,
-                    DeviceHistogramSweepKernel<PtxHistogramSweepPolicy, PRIVATIZED_SMEM_BINS, NUM_CHANNELS, NUM_ACTIVE_CHANNELS, SampleIteratorT, CounterT, PrivatizedDecodeOpT, OutputDecodeOpT, OffsetT>,
-                    histogram_sweep_config,
-                    stream,
-                    debug_synchronous))) break;
-            }
-        }
-        while (0);
-
-        return error;
-    }
-
-
-    /**
-     * Dispatch routine for HistogramEven, specialized for 8-bit sample types (computes 256-bin privatized histograms and then reduces to user-specified levels)
-     */
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t DispatchEven(
-        void*               d_temp_storage,                            ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&             temp_storage_bytes,                        ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        SampleIteratorT     d_samples,                                  ///< [in] The pointer to the input sequence of sample items. The samples from different channels are assumed to be interleaved (e.g., an array of 32-bit pixels where each pixel consists of four RGBA 8-bit samples).
-        CounterT*           d_output_histograms[NUM_ACTIVE_CHANNELS],  ///< [out] The pointers to the histogram counter output arrays, one for each active channel.  For channel<sub><em>i</em></sub>, the allocation length of <tt>d_histograms[i]</tt> should be <tt>num_output_levels[i]</tt> - 1.
-        int                 num_output_levels[NUM_ACTIVE_CHANNELS],     ///< [in] The number of bin level boundaries for delineating histogram samples in each active channel.  Implies that the number of bins for channel<sub><em>i</em></sub> is <tt>num_output_levels[i]</tt> - 1.
-        LevelT              lower_level[NUM_ACTIVE_CHANNELS],           ///< [in] The lower sample value bound (inclusive) for the lowest histogram bin in each active channel.
-        LevelT              upper_level[NUM_ACTIVE_CHANNELS],           ///< [in] The upper sample value bound (exclusive) for the highest histogram bin in each active channel.
-        OffsetT             num_row_pixels,                             ///< [in] The number of multi-channel pixels per row in the region of interest
-        OffsetT             num_rows,                                   ///< [in] The number of rows in the region of interest
-        OffsetT             row_stride_samples,                         ///< [in] The number of samples between starts of consecutive rows in the region of interest
-        cudaStream_t        stream,                                     ///< [in] CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous,                          ///< [in] Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-        Int2Type<true>      is_byte_sample)                             ///< [in] Marker type indicating whether or not SampleT is a 8b type
-    {
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get PTX version
-            int ptx_version;
-    #if (CUB_PTX_ARCH == 0)
-            if (CubDebug(error = PtxVersion(ptx_version))) break;
-    #else
-            ptx_version = CUB_PTX_ARCH;
-    #endif
-
-            // Get kernel dispatch configurations
-            KernelConfig histogram_sweep_config;
-            if (CubDebug(error = InitConfigs(ptx_version, histogram_sweep_config)))
-                break;
-
-            // Use the pass-thru transform op for converting samples to privatized bins
-            typedef PassThruTransform PrivatizedDecodeOpT;
-
-            // Use the scale transform op for converting privatized bins to output bins
-            typedef ScaleTransform OutputDecodeOpT;
-
-            int                     num_privatized_levels[NUM_ACTIVE_CHANNELS];
-            PrivatizedDecodeOpT     privatized_decode_op[NUM_ACTIVE_CHANNELS];
-            OutputDecodeOpT         output_decode_op[NUM_ACTIVE_CHANNELS];
-            int                     max_levels = num_output_levels[0];
-
-            for (int channel = 0; channel < NUM_ACTIVE_CHANNELS; ++channel)
-            {
-                num_privatized_levels[channel] = 257;
-
-                int     bins    = num_output_levels[channel] - 1;
-                LevelT  scale   = (upper_level[channel] - lower_level[channel]) / bins;
-                output_decode_op[channel].Init(num_output_levels[channel], upper_level[channel], lower_level[channel], scale);
-
-                if (num_output_levels[channel] > max_levels)
-                    max_levels = num_output_levels[channel];
-            }
-            int max_num_output_bins = max_levels - 1;
-
-            const int PRIVATIZED_SMEM_BINS = 256;
-
-            if (CubDebug(error = PrivatizedDispatch(
-                d_temp_storage,
-                temp_storage_bytes,
-                d_samples,
-                d_output_histograms,
-                num_privatized_levels,
-                privatized_decode_op,
-                num_output_levels,
-                output_decode_op,
-                max_num_output_bins,
-                num_row_pixels,
-                num_rows,
-                row_stride_samples,
-                DeviceHistogramInitKernel<NUM_ACTIVE_CHANNELS, CounterT, OffsetT>,
-                DeviceHistogramSweepKernel<PtxHistogramSweepPolicy, PRIVATIZED_SMEM_BINS, NUM_CHANNELS, NUM_ACTIVE_CHANNELS, SampleIteratorT, CounterT, PrivatizedDecodeOpT, OutputDecodeOpT, OffsetT>,
-                histogram_sweep_config,
-                stream,
-                debug_synchronous))) break;
-
-        }
-        while (0);
-
-        return error;
-    }
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/device/dispatch/dispatch_radix_sort.cuh b/thirdparty/cub_semiring/device/dispatch/dispatch_radix_sort.cuh
deleted file mode 100644
index f9793ebd53e..00000000000
--- a/thirdparty/cub_semiring/device/dispatch/dispatch_radix_sort.cuh
+++ /dev/null
@@ -1,1652 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within device-accessible memory.
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-
-#include "../../agent/agent_radix_sort_upsweep.cuh"
-#include "../../agent/agent_radix_sort_downsweep.cuh"
-#include "../../agent/agent_scan.cuh"
-#include "../../block/block_radix_sort.cuh"
-#include "../../grid/grid_even_share.cuh"
-#include "../../util_type.cuh"
-#include "../../util_debug.cuh"
-#include "../../util_device.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/******************************************************************************
- * Kernel entry points
- *****************************************************************************/
-
-/**
- * Upsweep digit-counting kernel entry point (multi-block).  Computes privatized digit histograms, one per block.
- */
-template <
-    typename                ChainedPolicyT,                 ///< Chained tuning policy
-    bool                    ALT_DIGIT_BITS,                 ///< Whether or not to use the alternate (lower-bits) policy
-    bool                    IS_DESCENDING,                  ///< Whether or not the sorted-order is high-to-low
-    typename                KeyT,                           ///< Key type
-    typename                OffsetT>                        ///< Signed integer type for global offsets
-__launch_bounds__ (int((ALT_DIGIT_BITS) ?
-    ChainedPolicyT::ActivePolicy::AltUpsweepPolicy::BLOCK_THREADS :
-    ChainedPolicyT::ActivePolicy::UpsweepPolicy::BLOCK_THREADS))
-__global__ void DeviceRadixSortUpsweepKernel(
-    const KeyT              *d_keys,                        ///< [in] Input keys buffer
-    OffsetT                 *d_spine,                       ///< [out] Privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)
-    OffsetT                 /*num_items*/,                  ///< [in] Total number of input data items
-    int                     current_bit,                    ///< [in] Bit position of current radix digit
-    int                     num_bits,                       ///< [in] Number of bits of current radix digit
-    GridEvenShare<OffsetT>  even_share)                     ///< [in] Even-share descriptor for mapan equal number of tiles onto each thread block
-{
-    enum {
-        TILE_ITEMS = ChainedPolicyT::ActivePolicy::AltUpsweepPolicy::BLOCK_THREADS *
-                        ChainedPolicyT::ActivePolicy::AltUpsweepPolicy::ITEMS_PER_THREAD
-    };
-
-    // Parameterize AgentRadixSortUpsweep type for the current configuration
-    typedef AgentRadixSortUpsweep<
-            typename If<(ALT_DIGIT_BITS),
-                typename ChainedPolicyT::ActivePolicy::AltUpsweepPolicy,
-                typename ChainedPolicyT::ActivePolicy::UpsweepPolicy>::Type,
-            KeyT,
-            OffsetT>
-        AgentRadixSortUpsweepT;
-
-    // Shared memory storage
-    __shared__ typename AgentRadixSortUpsweepT::TempStorage temp_storage;
-
-    // Initialize GRID_MAPPING_RAKE even-share descriptor for this thread block
-    even_share.template BlockInit<TILE_ITEMS, GRID_MAPPING_RAKE>();
-
-    AgentRadixSortUpsweepT upsweep(temp_storage, d_keys, current_bit, num_bits);
-
-    upsweep.ProcessRegion(even_share.block_offset, even_share.block_end);
-
-    CTA_SYNC();
-
-    // Write out digit counts (striped)
-    upsweep.ExtractCounts<IS_DESCENDING>(d_spine, gridDim.x, blockIdx.x);
-}
-
-
-/**
- * Spine scan kernel entry point (single-block).  Computes an exclusive prefix sum over the privatized digit histograms
- */
-template <
-    typename                ChainedPolicyT,                 ///< Chained tuning policy
-    typename                OffsetT>                        ///< Signed integer type for global offsets
-__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ScanPolicy::BLOCK_THREADS), 1)
-__global__ void RadixSortScanBinsKernel(
-    OffsetT                 *d_spine,                       ///< [in,out] Privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)
-    int                     num_counts)                     ///< [in] Total number of bin-counts
-{
-    // Parameterize the AgentScan type for the current configuration
-    typedef AgentScan<
-            typename ChainedPolicyT::ActivePolicy::ScanPolicy,
-            OffsetT*,
-            OffsetT*,
-            cub::Sum,
-            OffsetT,
-            OffsetT>
-        AgentScanT;
-
-    // Shared memory storage
-    __shared__ typename AgentScanT::TempStorage temp_storage;
-
-    // Block scan instance
-    AgentScanT block_scan(temp_storage, d_spine, d_spine, cub::Sum(), OffsetT(0)) ;
-
-    // Process full input tiles
-    int block_offset = 0;
-    BlockScanRunningPrefixOp<OffsetT, Sum> prefix_op(0, Sum());
-    while (block_offset + AgentScanT::TILE_ITEMS <= num_counts)
-    {
-        block_scan.template ConsumeTile<false, false>(block_offset, prefix_op);
-        block_offset += AgentScanT::TILE_ITEMS;
-    }
-}
-
-
-/**
- * Downsweep pass kernel entry point (multi-block).  Scatters keys (and values) into corresponding bins for the current digit place.
- */
-template <
-    typename                ChainedPolicyT,                 ///< Chained tuning policy
-    bool                    ALT_DIGIT_BITS,                 ///< Whether or not to use the alternate (lower-bits) policy
-    bool                    IS_DESCENDING,                  ///< Whether or not the sorted-order is high-to-low
-    typename                KeyT,                           ///< Key type
-    typename                ValueT,                         ///< Value type
-    typename                OffsetT>                        ///< Signed integer type for global offsets
-__launch_bounds__ (int((ALT_DIGIT_BITS) ?
-    ChainedPolicyT::ActivePolicy::AltDownsweepPolicy::BLOCK_THREADS :
-    ChainedPolicyT::ActivePolicy::DownsweepPolicy::BLOCK_THREADS))
-__global__ void DeviceRadixSortDownsweepKernel(
-    const KeyT              *d_keys_in,                     ///< [in] Input keys buffer
-    KeyT                    *d_keys_out,                    ///< [in] Output keys buffer
-    const ValueT            *d_values_in,                   ///< [in] Input values buffer
-    ValueT                  *d_values_out,                  ///< [in] Output values buffer
-    OffsetT                 *d_spine,                       ///< [in] Scan of privatized (per block) digit histograms (striped, i.e., 0s counts from each block, then 1s counts from each block, etc.)
-    OffsetT                 num_items,                      ///< [in] Total number of input data items
-    int                     current_bit,                    ///< [in] Bit position of current radix digit
-    int                     num_bits,                       ///< [in] Number of bits of current radix digit
-    GridEvenShare<OffsetT>  even_share)                     ///< [in] Even-share descriptor for mapan equal number of tiles onto each thread block
-{
-    enum {
-        TILE_ITEMS = ChainedPolicyT::ActivePolicy::AltUpsweepPolicy::BLOCK_THREADS *
-                        ChainedPolicyT::ActivePolicy::AltUpsweepPolicy::ITEMS_PER_THREAD
-    };
-
-    // Parameterize AgentRadixSortDownsweep type for the current configuration
-    typedef AgentRadixSortDownsweep<
-            typename If<(ALT_DIGIT_BITS),
-                typename ChainedPolicyT::ActivePolicy::AltDownsweepPolicy,
-                typename ChainedPolicyT::ActivePolicy::DownsweepPolicy>::Type,
-            IS_DESCENDING,
-            KeyT,
-            ValueT,
-            OffsetT>
-        AgentRadixSortDownsweepT;
-
-    // Shared memory storage
-    __shared__  typename AgentRadixSortDownsweepT::TempStorage temp_storage;
-
-    // Initialize even-share descriptor for this thread block
-    even_share.template BlockInit<TILE_ITEMS, GRID_MAPPING_RAKE>();
-
-    // Process input tiles
-    AgentRadixSortDownsweepT(temp_storage, num_items, d_spine, d_keys_in, d_keys_out, d_values_in, d_values_out, current_bit, num_bits).ProcessRegion(
-        even_share.block_offset,
-        even_share.block_end);
-}
-
-
-/**
- * Single pass kernel entry point (single-block).  Fully sorts a tile of input.
- */
-template <
-    typename                ChainedPolicyT,                 ///< Chained tuning policy
-    bool                    IS_DESCENDING,                  ///< Whether or not the sorted-order is high-to-low
-    typename                KeyT,                           ///< Key type
-    typename                ValueT,                         ///< Value type
-    typename                OffsetT>                        ///< Signed integer type for global offsets
-__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS), 1)
-__global__ void DeviceRadixSortSingleTileKernel(
-    const KeyT              *d_keys_in,                     ///< [in] Input keys buffer
-    KeyT                    *d_keys_out,                    ///< [in] Output keys buffer
-    const ValueT            *d_values_in,                   ///< [in] Input values buffer
-    ValueT                  *d_values_out,                  ///< [in] Output values buffer
-    OffsetT                 num_items,                      ///< [in] Total number of input data items
-    int                     current_bit,                    ///< [in] Bit position of current radix digit
-    int                     end_bit)                        ///< [in] The past-the-end (most-significant) bit index needed for key comparison
-{
-    // Constants
-    enum
-    {
-        BLOCK_THREADS           = ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS,
-        ITEMS_PER_THREAD        = ChainedPolicyT::ActivePolicy::SingleTilePolicy::ITEMS_PER_THREAD,
-        KEYS_ONLY               = Equals<ValueT, NullType>::VALUE,
-    };
-
-    // BlockRadixSort type
-    typedef BlockRadixSort<
-            KeyT,
-            BLOCK_THREADS,
-            ITEMS_PER_THREAD,
-            ValueT,
-            ChainedPolicyT::ActivePolicy::SingleTilePolicy::RADIX_BITS,
-            (ChainedPolicyT::ActivePolicy::SingleTilePolicy::RANK_ALGORITHM == RADIX_RANK_MEMOIZE),
-            ChainedPolicyT::ActivePolicy::SingleTilePolicy::SCAN_ALGORITHM>
-        BlockRadixSortT;
-
-    // BlockLoad type (keys)
-    typedef BlockLoad<
-        KeyT,
-        BLOCK_THREADS,
-        ITEMS_PER_THREAD,
-        ChainedPolicyT::ActivePolicy::SingleTilePolicy::LOAD_ALGORITHM> BlockLoadKeys;
-
-    // BlockLoad type (values)
-    typedef BlockLoad<
-        ValueT,
-        BLOCK_THREADS,
-        ITEMS_PER_THREAD,
-        ChainedPolicyT::ActivePolicy::SingleTilePolicy::LOAD_ALGORITHM> BlockLoadValues;
-
-    // Unsigned word for key bits
-    typedef typename Traits<KeyT>::UnsignedBits UnsignedBitsT;
-
-    // Shared memory storage
-    __shared__ union TempStorage
-    {
-        typename BlockRadixSortT::TempStorage       sort;
-        typename BlockLoadKeys::TempStorage         load_keys;
-        typename BlockLoadValues::TempStorage       load_values;
-
-    } temp_storage;
-
-    // Keys and values for the block
-    KeyT            keys[ITEMS_PER_THREAD];
-    ValueT          values[ITEMS_PER_THREAD];
-
-    // Get default (min/max) value for out-of-bounds keys
-    UnsignedBitsT   default_key_bits = (IS_DESCENDING) ? Traits<KeyT>::LOWEST_KEY : Traits<KeyT>::MAX_KEY;
-    KeyT            default_key = reinterpret_cast<KeyT&>(default_key_bits);
-
-    // Load keys
-    BlockLoadKeys(temp_storage.load_keys).Load(d_keys_in, keys, num_items, default_key);
-
-    CTA_SYNC();
-
-    // Load values
-    if (!KEYS_ONLY)
-    {
-        BlockLoadValues(temp_storage.load_values).Load(d_values_in, values, num_items);
-
-        CTA_SYNC();
-    }
-
-    // Sort tile
-    BlockRadixSortT(temp_storage.sort).SortBlockedToStriped(
-        keys,
-        values,
-        current_bit,
-        end_bit,
-        Int2Type<IS_DESCENDING>(),
-        Int2Type<KEYS_ONLY>());
-
-    // Store keys and values
-    #pragma unroll
-    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
-    {
-        int item_offset = ITEM * BLOCK_THREADS + threadIdx.x;
-        if (item_offset < num_items)
-        {
-            d_keys_out[item_offset] = keys[ITEM];
-            if (!KEYS_ONLY)
-                d_values_out[item_offset] = values[ITEM];
-        }
-    }
-}
-
-
-/**
- * Segmented radix sorting pass (one block per segment)
- */
-template <
-    typename                ChainedPolicyT,                 ///< Chained tuning policy
-    bool                    ALT_DIGIT_BITS,                 ///< Whether or not to use the alternate (lower-bits) policy
-    bool                    IS_DESCENDING,                  ///< Whether or not the sorted-order is high-to-low
-    typename                KeyT,                           ///< Key type
-    typename                ValueT,                         ///< Value type
-    typename                OffsetIteratorT,                ///< Random-access input iterator type for reading segment offsets \iterator
-    typename                OffsetT>                        ///< Signed integer type for global offsets
-__launch_bounds__ (int((ALT_DIGIT_BITS) ?
-    ChainedPolicyT::ActivePolicy::AltSegmentedPolicy::BLOCK_THREADS :
-    ChainedPolicyT::ActivePolicy::SegmentedPolicy::BLOCK_THREADS))
-__global__ void DeviceSegmentedRadixSortKernel(
-    const KeyT              *d_keys_in,                     ///< [in] Input keys buffer
-    KeyT                    *d_keys_out,                    ///< [in] Output keys buffer
-    const ValueT            *d_values_in,                   ///< [in] Input values buffer
-    ValueT                  *d_values_out,                  ///< [in] Output values buffer
-    OffsetIteratorT         d_begin_offsets,                ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-    OffsetIteratorT         d_end_offsets,                  ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-    int                     /*num_segments*/,               ///< [in] The number of segments that comprise the sorting data
-    int                     current_bit,                    ///< [in] Bit position of current radix digit
-    int                     pass_bits)                      ///< [in] Number of bits of current radix digit
-{
-    //
-    // Constants
-    //
-
-    typedef typename If<(ALT_DIGIT_BITS),
-        typename ChainedPolicyT::ActivePolicy::AltSegmentedPolicy,
-        typename ChainedPolicyT::ActivePolicy::SegmentedPolicy>::Type SegmentedPolicyT;
-
-    enum
-    {
-        BLOCK_THREADS       = SegmentedPolicyT::BLOCK_THREADS,
-        ITEMS_PER_THREAD    = SegmentedPolicyT::ITEMS_PER_THREAD,
-        RADIX_BITS          = SegmentedPolicyT::RADIX_BITS,
-        TILE_ITEMS          = BLOCK_THREADS * ITEMS_PER_THREAD,
-        RADIX_DIGITS        = 1 << RADIX_BITS,
-        KEYS_ONLY           = Equals<ValueT, NullType>::VALUE,
-    };
-
-    // Upsweep type
-    typedef AgentRadixSortUpsweep<
-            AgentRadixSortUpsweepPolicy<BLOCK_THREADS, ITEMS_PER_THREAD, SegmentedPolicyT::LOAD_MODIFIER, RADIX_BITS>,
-            KeyT,
-            OffsetT>
-        BlockUpsweepT;
-
-    // Digit-scan type
-    typedef BlockScan<OffsetT, BLOCK_THREADS> DigitScanT;
-
-    // Downsweep type
-    typedef AgentRadixSortDownsweep<SegmentedPolicyT, IS_DESCENDING, KeyT, ValueT, OffsetT> BlockDownsweepT;
-
-    enum
-    {
-        /// Number of bin-starting offsets tracked per thread
-        BINS_TRACKED_PER_THREAD = BlockDownsweepT::BINS_TRACKED_PER_THREAD
-    };
-
-    //
-    // Process input tiles
-    //
-
-    // Shared memory storage
-    __shared__ union
-    {
-        typename BlockUpsweepT::TempStorage     upsweep;
-        typename BlockDownsweepT::TempStorage   downsweep;
-        struct
-        {
-            volatile OffsetT                        reverse_counts_in[RADIX_DIGITS];
-            volatile OffsetT                        reverse_counts_out[RADIX_DIGITS];
-            typename DigitScanT::TempStorage        scan;
-        };
-
-    } temp_storage;
-
-    OffsetT segment_begin   = d_begin_offsets[blockIdx.x];
-    OffsetT segment_end     = d_end_offsets[blockIdx.x];
-    OffsetT num_items       = segment_end - segment_begin;
-
-    // Check if empty segment
-    if (num_items <= 0)
-        return;
-
-    // Upsweep
-    BlockUpsweepT upsweep(temp_storage.upsweep, d_keys_in, current_bit, pass_bits);
-    upsweep.ProcessRegion(segment_begin, segment_end);
-
-    CTA_SYNC();
-
-    // The count of each digit value in this pass (valid in the first RADIX_DIGITS threads)
-    OffsetT bin_count[BINS_TRACKED_PER_THREAD];
-    upsweep.ExtractCounts(bin_count);
-
-    CTA_SYNC();
-
-    if (IS_DESCENDING)
-    {
-        // Reverse bin counts
-        #pragma unroll
-        for (int track = 0; track < BINS_TRACKED_PER_THREAD; ++track)
-        {
-            int bin_idx = (threadIdx.x * BINS_TRACKED_PER_THREAD) + track;
-
-            if ((BLOCK_THREADS == RADIX_DIGITS) || (bin_idx < RADIX_DIGITS))
-                temp_storage.reverse_counts_in[bin_idx] = bin_count[track];
-        }
-
-        CTA_SYNC();
-
-        #pragma unroll
-        for (int track = 0; track < BINS_TRACKED_PER_THREAD; ++track)
-        {
-            int bin_idx = (threadIdx.x * BINS_TRACKED_PER_THREAD) + track;
-
-            if ((BLOCK_THREADS == RADIX_DIGITS) || (bin_idx < RADIX_DIGITS))
-                bin_count[track] = temp_storage.reverse_counts_in[RADIX_DIGITS - bin_idx - 1];
-        }
-    }
-
-    // Scan
-    OffsetT bin_offset[BINS_TRACKED_PER_THREAD];     // The global scatter base offset for each digit value in this pass (valid in the first RADIX_DIGITS threads)
-    DigitScanT(temp_storage.scan).ExclusiveSum(bin_count, bin_offset);
-
-    #pragma unroll
-    for (int track = 0; track < BINS_TRACKED_PER_THREAD; ++track)
-    {
-        bin_offset[track] += segment_begin;
-    }
-
-    if (IS_DESCENDING)
-    {
-        // Reverse bin offsets
-        #pragma unroll
-        for (int track = 0; track < BINS_TRACKED_PER_THREAD; ++track)
-        {
-            int bin_idx = (threadIdx.x * BINS_TRACKED_PER_THREAD) + track;
-
-            if ((BLOCK_THREADS == RADIX_DIGITS) || (bin_idx < RADIX_DIGITS))
-                temp_storage.reverse_counts_out[threadIdx.x] = bin_offset[track];
-        }
-
-        CTA_SYNC();
-
-        #pragma unroll
-        for (int track = 0; track < BINS_TRACKED_PER_THREAD; ++track)
-        {
-            int bin_idx = (threadIdx.x * BINS_TRACKED_PER_THREAD) + track;
-
-            if ((BLOCK_THREADS == RADIX_DIGITS) || (bin_idx < RADIX_DIGITS))
-                bin_offset[track] = temp_storage.reverse_counts_out[RADIX_DIGITS - bin_idx - 1];
-        }
-    }
-
-    CTA_SYNC();
-
-    // Downsweep
-    BlockDownsweepT downsweep(temp_storage.downsweep, bin_offset, num_items, d_keys_in, d_keys_out, d_values_in, d_values_out, current_bit, pass_bits);
-    downsweep.ProcessRegion(segment_begin, segment_end);
-}
-
-
-
-/******************************************************************************
- * Policy
- ******************************************************************************/
-
-/**
- * Tuning policy for kernel specialization
- */
-template <
-    typename KeyT,          ///< Key type
-    typename ValueT,        ///< Value type
-    typename OffsetT>       ///< Signed integer type for global offsets
-struct DeviceRadixSortPolicy
-{
-    //------------------------------------------------------------------------------
-    // Constants
-    //------------------------------------------------------------------------------
-
-    enum
-    {
-        // Whether this is a keys-only (or key-value) sort
-        KEYS_ONLY = (Equals<ValueT, NullType>::VALUE),
-
-        // Relative size of KeyT type to a 4-byte word
-        SCALE_FACTOR_4B = (CUB_MAX(sizeof(KeyT), sizeof(ValueT)) + 3) / 4,
-    };
-
-    //------------------------------------------------------------------------------
-    // Architecture-specific tuning policies
-    //------------------------------------------------------------------------------
-
-    /// SM13
-    struct Policy130 : ChainedPolicy<130, Policy130, Policy130>
-    {
-        enum {
-            PRIMARY_RADIX_BITS      = 5,
-            ALT_RADIX_BITS          = PRIMARY_RADIX_BITS - 1,
-        };
-
-        // Keys-only upsweep policies
-        typedef AgentRadixSortUpsweepPolicy <128, CUB_MAX(1, 19 / SCALE_FACTOR_4B), LOAD_DEFAULT, PRIMARY_RADIX_BITS>   UpsweepPolicyKeys;
-        typedef AgentRadixSortUpsweepPolicy <128, CUB_MAX(1, 15 / SCALE_FACTOR_4B), LOAD_DEFAULT, ALT_RADIX_BITS>       AltUpsweepPolicyKeys;
-
-        // Key-value pairs upsweep policies
-        typedef AgentRadixSortUpsweepPolicy <128, CUB_MAX(1, 19 / SCALE_FACTOR_4B), LOAD_DEFAULT, PRIMARY_RADIX_BITS>   UpsweepPolicyPairs;
-        typedef AgentRadixSortUpsweepPolicy <128, CUB_MAX(1, 15 / SCALE_FACTOR_4B), LOAD_DEFAULT, ALT_RADIX_BITS>       AltUpsweepPolicyPairs;
-
-        // Upsweep policies
-        typedef typename If<KEYS_ONLY, UpsweepPolicyKeys, UpsweepPolicyPairs>::Type         UpsweepPolicy;
-        typedef typename If<KEYS_ONLY, AltUpsweepPolicyKeys, AltUpsweepPolicyPairs>::Type   AltUpsweepPolicy;
-
-        // Scan policy
-        typedef AgentScanPolicy <256, 4, BLOCK_LOAD_VECTORIZE, LOAD_DEFAULT, BLOCK_STORE_VECTORIZE, BLOCK_SCAN_WARP_SCANS> ScanPolicy;
-
-        // Keys-only downsweep policies
-        typedef AgentRadixSortDownsweepPolicy <64, CUB_MAX(1, 19 / SCALE_FACTOR_4B), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_BASIC, BLOCK_SCAN_WARP_SCANS, PRIMARY_RADIX_BITS>    DownsweepPolicyKeys;
-        typedef AgentRadixSortDownsweepPolicy <128, CUB_MAX(1, 15 / SCALE_FACTOR_4B), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_BASIC, BLOCK_SCAN_WARP_SCANS, ALT_RADIX_BITS>       AltDownsweepPolicyKeys;
-
-        // Key-value pairs downsweep policies
-        typedef AgentRadixSortDownsweepPolicy <64, CUB_MAX(1, 19 / SCALE_FACTOR_4B), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_BASIC, BLOCK_SCAN_WARP_SCANS, PRIMARY_RADIX_BITS>    DownsweepPolicyPairs;
-        typedef AgentRadixSortDownsweepPolicy <128, CUB_MAX(1, 15 / SCALE_FACTOR_4B), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_BASIC, BLOCK_SCAN_WARP_SCANS, ALT_RADIX_BITS>       AltDownsweepPolicyPairs;
-
-        // Downsweep policies
-        typedef typename If<KEYS_ONLY, DownsweepPolicyKeys, DownsweepPolicyPairs>::Type         DownsweepPolicy;
-        typedef typename If<KEYS_ONLY, AltDownsweepPolicyKeys, AltDownsweepPolicyPairs>::Type   AltDownsweepPolicy;
-
-        // Single-tile policy
-        typedef DownsweepPolicy SingleTilePolicy;
-
-        // Segmented policies
-        typedef DownsweepPolicy     SegmentedPolicy;
-        typedef AltDownsweepPolicy  AltSegmentedPolicy;
-    };
-
-    /// SM20
-    struct Policy200 : ChainedPolicy<200, Policy200, Policy130>
-    {
-        enum {
-            PRIMARY_RADIX_BITS      = 5,
-            ALT_RADIX_BITS          = PRIMARY_RADIX_BITS - 1,
-        };
-
-        // Keys-only upsweep policies
-        typedef AgentRadixSortUpsweepPolicy <64, CUB_MAX(1, 18 / SCALE_FACTOR_4B), LOAD_DEFAULT, PRIMARY_RADIX_BITS>    UpsweepPolicyKeys;
-        typedef AgentRadixSortUpsweepPolicy <64, CUB_MAX(1, 18 / SCALE_FACTOR_4B), LOAD_DEFAULT, ALT_RADIX_BITS>        AltUpsweepPolicyKeys;
-
-        // Key-value pairs upsweep policies
-        typedef AgentRadixSortUpsweepPolicy <128, CUB_MAX(1, 13 / SCALE_FACTOR_4B), LOAD_DEFAULT, PRIMARY_RADIX_BITS>   UpsweepPolicyPairs;
-        typedef AgentRadixSortUpsweepPolicy <128, CUB_MAX(1, 13 / SCALE_FACTOR_4B), LOAD_DEFAULT, ALT_RADIX_BITS>       AltUpsweepPolicyPairs;
-
-        // Upsweep policies
-        typedef typename If<KEYS_ONLY, UpsweepPolicyKeys, UpsweepPolicyPairs>::Type         UpsweepPolicy;
-        typedef typename If<KEYS_ONLY, AltUpsweepPolicyKeys, AltUpsweepPolicyPairs>::Type   AltUpsweepPolicy;
-
-        // Scan policy
-        typedef AgentScanPolicy <512, 4, BLOCK_LOAD_VECTORIZE, LOAD_DEFAULT, BLOCK_STORE_VECTORIZE, BLOCK_SCAN_RAKING_MEMOIZE> ScanPolicy;
-
-        // Keys-only downsweep policies
-        typedef AgentRadixSortDownsweepPolicy <64, CUB_MAX(1, 18 / SCALE_FACTOR_4B), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_BASIC, BLOCK_SCAN_WARP_SCANS, PRIMARY_RADIX_BITS>    DownsweepPolicyKeys;
-        typedef AgentRadixSortDownsweepPolicy <64, CUB_MAX(1, 18 / SCALE_FACTOR_4B), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_BASIC, BLOCK_SCAN_WARP_SCANS, ALT_RADIX_BITS>        AltDownsweepPolicyKeys;
-
-        // Key-value pairs downsweep policies
-        typedef AgentRadixSortDownsweepPolicy <128, CUB_MAX(1, 13 / SCALE_FACTOR_4B), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_BASIC, BLOCK_SCAN_WARP_SCANS, PRIMARY_RADIX_BITS>   DownsweepPolicyPairs;
-        typedef AgentRadixSortDownsweepPolicy <128, CUB_MAX(1, 13 / SCALE_FACTOR_4B), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_BASIC, BLOCK_SCAN_WARP_SCANS, ALT_RADIX_BITS>       AltDownsweepPolicyPairs;
-
-        // Downsweep policies
-        typedef typename If<KEYS_ONLY, DownsweepPolicyKeys, DownsweepPolicyPairs>::Type         DownsweepPolicy;
-        typedef typename If<KEYS_ONLY, AltDownsweepPolicyKeys, AltDownsweepPolicyPairs>::Type   AltDownsweepPolicy;
-
-        // Single-tile policy
-        typedef DownsweepPolicy SingleTilePolicy;
-
-        // Segmented policies
-        typedef DownsweepPolicy     SegmentedPolicy;
-        typedef AltDownsweepPolicy  AltSegmentedPolicy;
-    };
-
-    /// SM30
-    struct Policy300 : ChainedPolicy<300, Policy300, Policy200>
-    {
-        enum {
-            PRIMARY_RADIX_BITS      = 5,
-            ALT_RADIX_BITS          = PRIMARY_RADIX_BITS - 1,
-        };
-
-        // Keys-only upsweep policies
-        typedef AgentRadixSortUpsweepPolicy <256, CUB_MAX(1, 7 / SCALE_FACTOR_4B), LOAD_DEFAULT, PRIMARY_RADIX_BITS>    UpsweepPolicyKeys;
-        typedef AgentRadixSortUpsweepPolicy <256, CUB_MAX(1, 7 / SCALE_FACTOR_4B), LOAD_DEFAULT, ALT_RADIX_BITS>        AltUpsweepPolicyKeys;
-
-        // Key-value pairs upsweep policies
-        typedef AgentRadixSortUpsweepPolicy <256, CUB_MAX(1, 5 / SCALE_FACTOR_4B), LOAD_DEFAULT, PRIMARY_RADIX_BITS>    UpsweepPolicyPairs;
-        typedef AgentRadixSortUpsweepPolicy <256, CUB_MAX(1, 5 / SCALE_FACTOR_4B), LOAD_DEFAULT, ALT_RADIX_BITS>        AltUpsweepPolicyPairs;
-
-        // Upsweep policies
-        typedef typename If<KEYS_ONLY, UpsweepPolicyKeys, UpsweepPolicyPairs>::Type         UpsweepPolicy;
-        typedef typename If<KEYS_ONLY, AltUpsweepPolicyKeys, AltUpsweepPolicyPairs>::Type   AltUpsweepPolicy;
-
-        // Scan policy
-        typedef AgentScanPolicy <1024, 4, BLOCK_LOAD_VECTORIZE, LOAD_DEFAULT, BLOCK_STORE_VECTORIZE, BLOCK_SCAN_WARP_SCANS> ScanPolicy;
-
-        // Keys-only downsweep policies
-        typedef AgentRadixSortDownsweepPolicy <128, CUB_MAX(1, 14 / SCALE_FACTOR_4B), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_BASIC, BLOCK_SCAN_WARP_SCANS, PRIMARY_RADIX_BITS>   DownsweepPolicyKeys;
-        typedef AgentRadixSortDownsweepPolicy <128, CUB_MAX(1, 14 / SCALE_FACTOR_4B), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_BASIC, BLOCK_SCAN_WARP_SCANS, ALT_RADIX_BITS>       AltDownsweepPolicyKeys;
-
-        // Key-value pairs downsweep policies
-        typedef AgentRadixSortDownsweepPolicy <128, CUB_MAX(1, 10 / SCALE_FACTOR_4B), BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_BASIC, BLOCK_SCAN_WARP_SCANS, PRIMARY_RADIX_BITS>    DownsweepPolicyPairs;
-        typedef AgentRadixSortDownsweepPolicy <128, CUB_MAX(1, 10 / SCALE_FACTOR_4B), BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_BASIC, BLOCK_SCAN_WARP_SCANS, ALT_RADIX_BITS>        AltDownsweepPolicyPairs;
-
-        // Downsweep policies
-        typedef typename If<KEYS_ONLY, DownsweepPolicyKeys, DownsweepPolicyPairs>::Type         DownsweepPolicy;
-        typedef typename If<KEYS_ONLY, AltDownsweepPolicyKeys, AltDownsweepPolicyPairs>::Type   AltDownsweepPolicy;
-
-        // Single-tile policy
-        typedef DownsweepPolicy SingleTilePolicy;
-
-        // Segmented policies
-        typedef DownsweepPolicy     SegmentedPolicy;
-        typedef AltDownsweepPolicy  AltSegmentedPolicy;
-    };
-
-
-    /// SM35
-    struct Policy350 : ChainedPolicy<350, Policy350, Policy300>
-    {
-        enum {
-            PRIMARY_RADIX_BITS      = 6,    // 1.72B 32b keys/s, 1.17B 32b pairs/s, 1.55B 32b segmented keys/s (K40m)
-        };
-
-        // Scan policy
-        typedef AgentScanPolicy <1024, 4, BLOCK_LOAD_VECTORIZE, LOAD_DEFAULT, BLOCK_STORE_VECTORIZE, BLOCK_SCAN_WARP_SCANS> ScanPolicy;
-
-        // Keys-only downsweep policies
-        typedef AgentRadixSortDownsweepPolicy <128,   CUB_MAX(1, 9 / SCALE_FACTOR_4B), BLOCK_LOAD_WARP_TRANSPOSE, LOAD_LDG, RADIX_RANK_MATCH, BLOCK_SCAN_WARP_SCANS, PRIMARY_RADIX_BITS> DownsweepPolicyKeys;
-        typedef AgentRadixSortDownsweepPolicy <64,   CUB_MAX(1, 18 / SCALE_FACTOR_4B), BLOCK_LOAD_DIRECT, LOAD_LDG, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, PRIMARY_RADIX_BITS - 1> AltDownsweepPolicyKeys;
-
-        // Key-value pairs downsweep policies
-        typedef DownsweepPolicyKeys DownsweepPolicyPairs;
-        typedef AgentRadixSortDownsweepPolicy <128,  CUB_MAX(1, 15 / SCALE_FACTOR_4B), BLOCK_LOAD_DIRECT, LOAD_LDG, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, PRIMARY_RADIX_BITS - 1> AltDownsweepPolicyPairs;
-
-        // Downsweep policies
-        typedef typename If<KEYS_ONLY, DownsweepPolicyKeys, DownsweepPolicyPairs>::Type DownsweepPolicy;
-        typedef typename If<KEYS_ONLY, AltDownsweepPolicyKeys, AltDownsweepPolicyPairs>::Type AltDownsweepPolicy;
-
-        // Upsweep policies
-        typedef DownsweepPolicy UpsweepPolicy;
-        typedef AltDownsweepPolicy AltUpsweepPolicy;
-
-        // Single-tile policy
-        typedef DownsweepPolicy SingleTilePolicy;
-
-        // Segmented policies
-        typedef DownsweepPolicy     SegmentedPolicy;
-        typedef AltDownsweepPolicy  AltSegmentedPolicy;
-
-
-    };
-
-
-    /// SM50
-    struct Policy500 : ChainedPolicy<500, Policy500, Policy350>
-    {
-        enum {
-            PRIMARY_RADIX_BITS      = 7,    // 3.5B 32b keys/s, 1.92B 32b pairs/s (TitanX)
-            SINGLE_TILE_RADIX_BITS  = 6,
-            SEGMENTED_RADIX_BITS    = 6,    // 3.1B 32b segmented keys/s (TitanX)
-        };
-
-        // ScanPolicy
-        typedef AgentScanPolicy <512, 23, BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, BLOCK_STORE_WARP_TRANSPOSE, BLOCK_SCAN_RAKING_MEMOIZE> ScanPolicy;
-
-        // Downsweep policies
-        typedef AgentRadixSortDownsweepPolicy <160, CUB_MAX(1, 39 / SCALE_FACTOR_4B),  BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_BASIC, BLOCK_SCAN_WARP_SCANS, PRIMARY_RADIX_BITS>  DownsweepPolicy;
-        typedef AgentRadixSortDownsweepPolicy <256, CUB_MAX(1, 16 / SCALE_FACTOR_4B),  BLOCK_LOAD_DIRECT, LOAD_LDG, RADIX_RANK_MEMOIZE, BLOCK_SCAN_RAKING_MEMOIZE, PRIMARY_RADIX_BITS - 1>   AltDownsweepPolicy;
-
-        // Upsweep policies
-        typedef DownsweepPolicy UpsweepPolicy;
-        typedef AltDownsweepPolicy AltUpsweepPolicy;
-
-        // Single-tile policy
-        typedef AgentRadixSortDownsweepPolicy <256, CUB_MAX(1, 19 / SCALE_FACTOR_4B),  BLOCK_LOAD_DIRECT, LOAD_LDG, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, SINGLE_TILE_RADIX_BITS> SingleTilePolicy;
-
-        // Segmented policies
-        typedef AgentRadixSortDownsweepPolicy <192, CUB_MAX(1, 31 / SCALE_FACTOR_4B),  BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, SEGMENTED_RADIX_BITS>   SegmentedPolicy;
-        typedef AgentRadixSortDownsweepPolicy <256, CUB_MAX(1, 11 / SCALE_FACTOR_4B),  BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, SEGMENTED_RADIX_BITS - 1>       AltSegmentedPolicy;
-    };
-
-
-    /// SM60 (GP100)
-    struct Policy600 : ChainedPolicy<600, Policy600, Policy500>
-    {
-        enum {
-            PRIMARY_RADIX_BITS      = 7,    // 6.9B 32b keys/s (Quadro P100)
-            SINGLE_TILE_RADIX_BITS  = 6,
-            SEGMENTED_RADIX_BITS    = 6,    // 5.9B 32b segmented keys/s (Quadro P100)
-        };
-
-        // ScanPolicy
-        typedef AgentScanPolicy <512, 23, BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, BLOCK_STORE_WARP_TRANSPOSE, BLOCK_SCAN_RAKING_MEMOIZE> ScanPolicy;
-
-        // Downsweep policies
-        typedef AgentRadixSortDownsweepPolicy <256, CUB_MAX(1, 25 / SCALE_FACTOR_4B),  BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MATCH, BLOCK_SCAN_WARP_SCANS, PRIMARY_RADIX_BITS>   DownsweepPolicy;
-        typedef AgentRadixSortDownsweepPolicy <192, CUB_MAX(1, 39 / SCALE_FACTOR_4B),  BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, PRIMARY_RADIX_BITS - 1>   AltDownsweepPolicy;
-
-        // Upsweep policies
-        typedef DownsweepPolicy UpsweepPolicy;
-        typedef AltDownsweepPolicy AltUpsweepPolicy;
-
-        // Single-tile policy
-        typedef AgentRadixSortDownsweepPolicy <256, CUB_MAX(1, 19 / SCALE_FACTOR_4B),  BLOCK_LOAD_DIRECT, LOAD_LDG, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, SINGLE_TILE_RADIX_BITS>          SingleTilePolicy;
-
-        // Segmented policies
-        typedef AgentRadixSortDownsweepPolicy <192, CUB_MAX(1, 39 / SCALE_FACTOR_4B),  BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, SEGMENTED_RADIX_BITS>     SegmentedPolicy;
-        typedef AgentRadixSortDownsweepPolicy <384, CUB_MAX(1, 11 / SCALE_FACTOR_4B),  BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, SEGMENTED_RADIX_BITS - 1> AltSegmentedPolicy;
-
-    };
-
-
-    /// SM61 (GP104)
-    struct Policy610 : ChainedPolicy<610, Policy610, Policy600>
-    {
-        enum {
-            PRIMARY_RADIX_BITS      = 7,    // 3.4B 32b keys/s, 1.83B 32b pairs/s (1080)
-            SINGLE_TILE_RADIX_BITS  = 6,
-            SEGMENTED_RADIX_BITS    = 6,    // 3.3B 32b segmented keys/s (1080)
-        };
-
-        // ScanPolicy
-        typedef AgentScanPolicy <512, 23, BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, BLOCK_STORE_WARP_TRANSPOSE, BLOCK_SCAN_RAKING_MEMOIZE> ScanPolicy;
-
-        // Downsweep policies
-        typedef AgentRadixSortDownsweepPolicy <384, CUB_MAX(1, 31 / SCALE_FACTOR_4B),  BLOCK_LOAD_DIRECT,       LOAD_DEFAULT,       RADIX_RANK_MATCH,   BLOCK_SCAN_RAKING_MEMOIZE, PRIMARY_RADIX_BITS>   DownsweepPolicy;
-        typedef AgentRadixSortDownsweepPolicy <256, CUB_MAX(1, 35 / SCALE_FACTOR_4B),  BLOCK_LOAD_TRANSPOSE,    LOAD_DEFAULT,   RADIX_RANK_MEMOIZE, BLOCK_SCAN_RAKING_MEMOIZE, PRIMARY_RADIX_BITS - 1>   AltDownsweepPolicy;
-
-        // Upsweep policies
-        typedef AgentRadixSortUpsweepPolicy <128, CUB_MAX(1, 16 / SCALE_FACTOR_4B), LOAD_LDG, PRIMARY_RADIX_BITS>        UpsweepPolicy;
-        typedef AgentRadixSortUpsweepPolicy <128, CUB_MAX(1, 16 / SCALE_FACTOR_4B), LOAD_LDG, PRIMARY_RADIX_BITS - 1>    AltUpsweepPolicy;
-
-        // Single-tile policy
-        typedef AgentRadixSortDownsweepPolicy <256, CUB_MAX(1, 19 / SCALE_FACTOR_4B),  BLOCK_LOAD_DIRECT, LOAD_LDG, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, SINGLE_TILE_RADIX_BITS>          SingleTilePolicy;
-
-        // Segmented policies
-        typedef AgentRadixSortDownsweepPolicy <192, CUB_MAX(1, 39 / SCALE_FACTOR_4B),  BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, SEGMENTED_RADIX_BITS>     SegmentedPolicy;
-        typedef AgentRadixSortDownsweepPolicy <384, CUB_MAX(1, 11 / SCALE_FACTOR_4B),  BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, SEGMENTED_RADIX_BITS - 1> AltSegmentedPolicy;
-    };
-
-
-    /// SM62 (Tegra, less RF)
-    struct Policy620 : ChainedPolicy<620, Policy620, Policy610>
-    {
-        enum {
-            PRIMARY_RADIX_BITS      = 5,
-            ALT_RADIX_BITS          = PRIMARY_RADIX_BITS - 1,
-        };
-
-        // ScanPolicy
-        typedef AgentScanPolicy <512, 23, BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, BLOCK_STORE_WARP_TRANSPOSE, BLOCK_SCAN_RAKING_MEMOIZE> ScanPolicy;
-
-        // Downsweep policies
-        typedef AgentRadixSortDownsweepPolicy <256, CUB_MAX(1, 16 / SCALE_FACTOR_4B),  BLOCK_LOAD_DIRECT, LOAD_DEFAULT, RADIX_RANK_MEMOIZE, BLOCK_SCAN_RAKING_MEMOIZE, PRIMARY_RADIX_BITS>   DownsweepPolicy;
-        typedef AgentRadixSortDownsweepPolicy <256, CUB_MAX(1, 16 / SCALE_FACTOR_4B),  BLOCK_LOAD_DIRECT, LOAD_DEFAULT, RADIX_RANK_MEMOIZE, BLOCK_SCAN_RAKING_MEMOIZE, ALT_RADIX_BITS>       AltDownsweepPolicy;
-
-        // Upsweep policies
-        typedef DownsweepPolicy UpsweepPolicy;
-        typedef AltDownsweepPolicy AltUpsweepPolicy;
-
-        // Single-tile policy
-        typedef AgentRadixSortDownsweepPolicy <256, CUB_MAX(1, 19 / SCALE_FACTOR_4B),  BLOCK_LOAD_DIRECT, LOAD_LDG, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, PRIMARY_RADIX_BITS> SingleTilePolicy;
-
-        // Segmented policies
-        typedef DownsweepPolicy     SegmentedPolicy;
-        typedef AltDownsweepPolicy  AltSegmentedPolicy;
-    };
-
-
-    /// SM70 (GV100)
-    struct Policy700 : ChainedPolicy<700, Policy700, Policy620>
-    {
-        enum {
-            PRIMARY_RADIX_BITS      = 6,    // 7.62B 32b keys/s (GV100)
-            SINGLE_TILE_RADIX_BITS  = 6,
-            SEGMENTED_RADIX_BITS    = 6,    // 8.7B 32b segmented keys/s (GV100)
-        };
-
-        // ScanPolicy
-        typedef AgentScanPolicy <512, 23, BLOCK_LOAD_WARP_TRANSPOSE, LOAD_DEFAULT, BLOCK_STORE_WARP_TRANSPOSE, BLOCK_SCAN_RAKING_MEMOIZE> ScanPolicy;
-
-        // Downsweep policies
-        typedef AgentRadixSortDownsweepPolicy <256, CUB_MAX(1, 47 / SCALE_FACTOR_4B),  BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, PRIMARY_RADIX_BITS>   DownsweepPolicy;
-        typedef AgentRadixSortDownsweepPolicy <384, CUB_MAX(1, 29 / SCALE_FACTOR_4B),  BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, PRIMARY_RADIX_BITS - 1>   AltDownsweepPolicy;
-
-        // Upsweep policies
-        typedef AgentRadixSortDownsweepPolicy <128, CUB_MAX(1, 47 / SCALE_FACTOR_4B),  BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MATCH, BLOCK_SCAN_WARP_SCANS, PRIMARY_RADIX_BITS>  UpsweepPolicy;
-        typedef AgentRadixSortDownsweepPolicy <128, CUB_MAX(1, 29 / SCALE_FACTOR_4B),  BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MATCH, BLOCK_SCAN_WARP_SCANS, PRIMARY_RADIX_BITS - 1>  AltUpsweepPolicy;
-
-        // Single-tile policy
-        typedef AgentRadixSortDownsweepPolicy <256, CUB_MAX(1, 19 / SCALE_FACTOR_4B),  BLOCK_LOAD_DIRECT, LOAD_LDG, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, SINGLE_TILE_RADIX_BITS>          SingleTilePolicy;
-
-        // Segmented policies
-        typedef AgentRadixSortDownsweepPolicy <192, CUB_MAX(1, 39 / SCALE_FACTOR_4B),  BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, SEGMENTED_RADIX_BITS>     SegmentedPolicy;
-        typedef AgentRadixSortDownsweepPolicy <384, CUB_MAX(1, 11 / SCALE_FACTOR_4B),  BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, SEGMENTED_RADIX_BITS - 1> AltSegmentedPolicy;
-    };
-
-
-    /// MaxPolicy
-    typedef Policy700 MaxPolicy;
-
-
-};
-
-
-
-/******************************************************************************
- * Single-problem dispatch
- ******************************************************************************/
-
-/**
- * Utility class for dispatching the appropriately-tuned kernels for device-wide radix sort
- */
-template <
-    bool     IS_DESCENDING, ///< Whether or not the sorted-order is high-to-low
-    typename KeyT,          ///< Key type
-    typename ValueT,        ///< Value type
-    typename OffsetT>       ///< Signed integer type for global offsets
-struct DispatchRadixSort :
-    DeviceRadixSortPolicy<KeyT, ValueT, OffsetT>
-{
-    //------------------------------------------------------------------------------
-    // Constants
-    //------------------------------------------------------------------------------
-
-    enum
-    {
-        // Whether this is a keys-only (or key-value) sort
-        KEYS_ONLY = (Equals<ValueT, NullType>::VALUE),
-    };
-
-
-    //------------------------------------------------------------------------------
-    // Problem state
-    //------------------------------------------------------------------------------
-
-    void                    *d_temp_storage;        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-    size_t                  &temp_storage_bytes;    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-    DoubleBuffer<KeyT>      &d_keys;                ///< [in,out] Double-buffer whose current buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
-    DoubleBuffer<ValueT>    &d_values;              ///< [in,out] Double-buffer whose current buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
-    OffsetT                 num_items;              ///< [in] Number of items to sort
-    int                     begin_bit;              ///< [in] The beginning (least-significant) bit index needed for key comparison
-    int                     end_bit;                ///< [in] The past-the-end (most-significant) bit index needed for key comparison
-    cudaStream_t            stream;                 ///< [in] CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-    bool                    debug_synchronous;      ///< [in] Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    int                     ptx_version;            ///< [in] PTX version
-    bool                    is_overwrite_okay;      ///< [in] Whether is okay to overwrite source buffers
-
-
-    //------------------------------------------------------------------------------
-    // Constructor
-    //------------------------------------------------------------------------------
-
-    /// Constructor
-    CUB_RUNTIME_FUNCTION __forceinline__
-    DispatchRadixSort(
-        void*                   d_temp_storage,
-        size_t                  &temp_storage_bytes,
-        DoubleBuffer<KeyT>      &d_keys,
-        DoubleBuffer<ValueT>    &d_values,
-        OffsetT                 num_items,
-        int                     begin_bit,
-        int                     end_bit,
-        bool                    is_overwrite_okay,
-        cudaStream_t            stream,
-        bool                    debug_synchronous,
-        int                     ptx_version)
-    :
-        d_temp_storage(d_temp_storage),
-        temp_storage_bytes(temp_storage_bytes),
-        d_keys(d_keys),
-        d_values(d_values),
-        num_items(num_items),
-        begin_bit(begin_bit),
-        end_bit(end_bit),
-        stream(stream),
-        debug_synchronous(debug_synchronous),
-        ptx_version(ptx_version),
-        is_overwrite_okay(is_overwrite_okay)
-    {}
-
-
-    //------------------------------------------------------------------------------
-    // Small-problem (single tile) invocation
-    //------------------------------------------------------------------------------
-
-    /// Invoke a single block to sort in-core
-    template <
-        typename                ActivePolicyT,          ///< Umbrella policy active for the target device
-        typename                SingleTileKernelT>      ///< Function type of cub::DeviceRadixSortSingleTileKernel
-    CUB_RUNTIME_FUNCTION __forceinline__
-    cudaError_t InvokeSingleTile(
-        SingleTileKernelT       single_tile_kernel)     ///< [in] Kernel function pointer to parameterization of cub::DeviceRadixSortSingleTileKernel
-    {
-#ifndef CUB_RUNTIME_ENABLED
-        (void)single_tile_kernel;
-        // Kernel launch not supported from this device
-        return CubDebug(cudaErrorNotSupported );
-#else
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Return if the caller is simply requesting the size of the storage allocation
-            if (d_temp_storage == NULL)
-            {
-                temp_storage_bytes = 1;
-                break;
-            }
-
-            // Return if empty problem
-            if (num_items == 0)
-                break;
-
-            // Log single_tile_kernel configuration
-            if (debug_synchronous)
-                _CubLog("Invoking single_tile_kernel<<<%d, %d, 0, %lld>>>(), %d items per thread, %d SM occupancy, current bit %d, bit_grain %d\n",
-                    1, ActivePolicyT::SingleTilePolicy::BLOCK_THREADS, (long long) stream,
-                    ActivePolicyT::SingleTilePolicy::ITEMS_PER_THREAD, 1, begin_bit, ActivePolicyT::SingleTilePolicy::RADIX_BITS);
-
-            // Invoke upsweep_kernel with same grid size as downsweep_kernel
-            single_tile_kernel<<<1, ActivePolicyT::SingleTilePolicy::BLOCK_THREADS, 0, stream>>>(
-                d_keys.Current(),
-                d_keys.Alternate(),
-                d_values.Current(),
-                d_values.Alternate(),
-                num_items,
-                begin_bit,
-                end_bit);
-
-            // Check for failure to launch
-            if (CubDebug(error = cudaPeekAtLastError())) break;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-
-            // Update selector
-            d_keys.selector ^= 1;
-            d_values.selector ^= 1;
-        }
-        while (0);
-
-        return error;
-
-#endif // CUB_RUNTIME_ENABLED
-    }
-
-
-    //------------------------------------------------------------------------------
-    // Normal problem size invocation
-    //------------------------------------------------------------------------------
-
-    /**
-     * Invoke a three-kernel sorting pass at the current bit.
-     */
-    template <typename PassConfigT>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    cudaError_t InvokePass(
-        const KeyT      *d_keys_in,
-        KeyT            *d_keys_out,
-        const ValueT    *d_values_in,
-        ValueT          *d_values_out,
-        OffsetT         *d_spine,
-        int             spine_length,
-        int             &current_bit,
-        PassConfigT     &pass_config)
-    {
-        cudaError error = cudaSuccess;
-        do
-        {
-            int pass_bits = CUB_MIN(pass_config.radix_bits, (end_bit - current_bit));
-
-            // Log upsweep_kernel configuration
-            if (debug_synchronous)
-                _CubLog("Invoking upsweep_kernel<<<%d, %d, 0, %lld>>>(), %d items per thread, %d SM occupancy, current bit %d, bit_grain %d\n",
-                pass_config.even_share.grid_size, pass_config.upsweep_config.block_threads, (long long) stream,
-                pass_config.upsweep_config.items_per_thread, pass_config.upsweep_config.sm_occupancy, current_bit, pass_bits);
-
-            // Invoke upsweep_kernel with same grid size as downsweep_kernel
-            pass_config.upsweep_kernel<<<pass_config.even_share.grid_size, pass_config.upsweep_config.block_threads, 0, stream>>>(
-                d_keys_in,
-                d_spine,
-                num_items,
-                current_bit,
-                pass_bits,
-                pass_config.even_share);
-
-            // Check for failure to launch
-            if (CubDebug(error = cudaPeekAtLastError())) break;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-
-            // Log scan_kernel configuration
-            if (debug_synchronous) _CubLog("Invoking scan_kernel<<<%d, %d, 0, %lld>>>(), %d items per thread\n",
-                1, pass_config.scan_config.block_threads, (long long) stream, pass_config.scan_config.items_per_thread);
-
-            // Invoke scan_kernel
-            pass_config.scan_kernel<<<1, pass_config.scan_config.block_threads, 0, stream>>>(
-                d_spine,
-                spine_length);
-
-            // Check for failure to launch
-            if (CubDebug(error = cudaPeekAtLastError())) break;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-
-            // Log downsweep_kernel configuration
-            if (debug_synchronous) _CubLog("Invoking downsweep_kernel<<<%d, %d, 0, %lld>>>(), %d items per thread, %d SM occupancy\n",
-                pass_config.even_share.grid_size, pass_config.downsweep_config.block_threads, (long long) stream,
-                pass_config.downsweep_config.items_per_thread, pass_config.downsweep_config.sm_occupancy);
-
-            // Invoke downsweep_kernel
-            pass_config.downsweep_kernel<<<pass_config.even_share.grid_size, pass_config.downsweep_config.block_threads, 0, stream>>>(
-                d_keys_in,
-                d_keys_out,
-                d_values_in,
-                d_values_out,
-                d_spine,
-                num_items,
-                current_bit,
-                pass_bits,
-                pass_config.even_share);
-
-            // Check for failure to launch
-            if (CubDebug(error = cudaPeekAtLastError())) break;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-
-            // Update current bit
-            current_bit += pass_bits;
-        }
-        while (0);
-
-        return error;
-    }
-
-
-
-    /// Pass configuration structure
-    template <
-        typename UpsweepKernelT,
-        typename ScanKernelT,
-        typename DownsweepKernelT>
-    struct PassConfig
-    {
-        UpsweepKernelT          upsweep_kernel;
-        KernelConfig            upsweep_config;
-        ScanKernelT             scan_kernel;
-        KernelConfig            scan_config;
-        DownsweepKernelT        downsweep_kernel;
-        KernelConfig            downsweep_config;
-        int                     radix_bits;
-        int                     radix_digits;
-        int                     max_downsweep_grid_size;
-        GridEvenShare<OffsetT>  even_share;
-
-        /// Initialize pass configuration
-        template <
-            typename UpsweepPolicyT,
-            typename ScanPolicyT,
-            typename DownsweepPolicyT>
-        CUB_RUNTIME_FUNCTION __forceinline__
-        cudaError_t InitPassConfig(
-            UpsweepKernelT      upsweep_kernel,
-            ScanKernelT         scan_kernel,
-            DownsweepKernelT    downsweep_kernel,
-            int                 ptx_version,
-            int                 sm_count,
-            int                 num_items)
-        {
-            cudaError error = cudaSuccess;
-            do
-            {
-                this->upsweep_kernel    = upsweep_kernel;
-                this->scan_kernel       = scan_kernel;
-                this->downsweep_kernel  = downsweep_kernel;
-                radix_bits              = DownsweepPolicyT::RADIX_BITS;
-                radix_digits            = 1 << radix_bits;
-
-                if (CubDebug(error = upsweep_config.Init<UpsweepPolicyT>(upsweep_kernel))) break;
-                if (CubDebug(error = scan_config.Init<ScanPolicyT>(scan_kernel))) break;
-                if (CubDebug(error = downsweep_config.Init<DownsweepPolicyT>(downsweep_kernel))) break;
-
-                max_downsweep_grid_size = (downsweep_config.sm_occupancy * sm_count) * CUB_SUBSCRIPTION_FACTOR(ptx_version);
-
-                even_share.DispatchInit(
-                    num_items,
-                    max_downsweep_grid_size,
-                    CUB_MAX(downsweep_config.tile_size, upsweep_config.tile_size));
-
-            }
-            while (0);
-            return error;
-        }
-
-    };
-
-
-    /// Invocation (run multiple digit passes)
-    template <
-        typename            ActivePolicyT,          ///< Umbrella policy active for the target device
-        typename            UpsweepKernelT,         ///< Function type of cub::DeviceRadixSortUpsweepKernel
-        typename            ScanKernelT,            ///< Function type of cub::SpineScanKernel
-        typename            DownsweepKernelT>       ///< Function type of cub::DeviceRadixSortDownsweepKernel
-    CUB_RUNTIME_FUNCTION __forceinline__
-    cudaError_t InvokePasses(
-        UpsweepKernelT      upsweep_kernel,         ///< [in] Kernel function pointer to parameterization of cub::DeviceRadixSortUpsweepKernel
-        UpsweepKernelT      alt_upsweep_kernel,     ///< [in] Alternate kernel function pointer to parameterization of cub::DeviceRadixSortUpsweepKernel
-        ScanKernelT         scan_kernel,            ///< [in] Kernel function pointer to parameterization of cub::SpineScanKernel
-        DownsweepKernelT    downsweep_kernel,       ///< [in] Kernel function pointer to parameterization of cub::DeviceRadixSortDownsweepKernel
-        DownsweepKernelT    alt_downsweep_kernel)   ///< [in] Alternate kernel function pointer to parameterization of cub::DeviceRadixSortDownsweepKernel
-    {
-#ifndef CUB_RUNTIME_ENABLED
-        (void)upsweep_kernel;
-        (void)alt_upsweep_kernel;
-        (void)scan_kernel;
-        (void)downsweep_kernel;
-        (void)alt_downsweep_kernel;
-
-        // Kernel launch not supported from this device
-        return CubDebug(cudaErrorNotSupported );
-#else
-
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get device ordinal
-            int device_ordinal;
-            if (CubDebug(error = cudaGetDevice(&device_ordinal))) break;
-
-            // Get SM count
-            int sm_count;
-            if (CubDebug(error = cudaDeviceGetAttribute (&sm_count, cudaDevAttrMultiProcessorCount, device_ordinal))) break;
-
-            // Init regular and alternate-digit kernel configurations
-            PassConfig<UpsweepKernelT, ScanKernelT, DownsweepKernelT> pass_config, alt_pass_config;
-            if ((error = pass_config.template InitPassConfig<
-                    typename ActivePolicyT::UpsweepPolicy, 
-                    typename ActivePolicyT::ScanPolicy, 
-                    typename ActivePolicyT::DownsweepPolicy>(
-                upsweep_kernel, scan_kernel, downsweep_kernel, ptx_version, sm_count, num_items))) break;
-
-            if ((error = alt_pass_config.template InitPassConfig<
-                    typename ActivePolicyT::AltUpsweepPolicy, 
-                    typename ActivePolicyT::ScanPolicy, 
-                    typename ActivePolicyT::AltDownsweepPolicy>(
-                alt_upsweep_kernel, scan_kernel, alt_downsweep_kernel, ptx_version, sm_count, num_items))) break;
-
-            // Get maximum spine length
-            int max_grid_size       = CUB_MAX(pass_config.max_downsweep_grid_size, alt_pass_config.max_downsweep_grid_size);
-            int spine_length        = (max_grid_size * pass_config.radix_digits) + pass_config.scan_config.tile_size;
-
-            // Temporary storage allocation requirements
-            void* allocations[3];
-            size_t allocation_sizes[3] =
-            {
-                spine_length * sizeof(OffsetT),                                         // bytes needed for privatized block digit histograms
-                (is_overwrite_okay) ? 0 : num_items * sizeof(KeyT),                     // bytes needed for 3rd keys buffer
-                (is_overwrite_okay || (KEYS_ONLY)) ? 0 : num_items * sizeof(ValueT),    // bytes needed for 3rd values buffer
-            };
-
-            // Alias the temporary allocations from the single storage blob (or compute the necessary size of the blob)
-            if (CubDebug(error = AliasTemporaries(d_temp_storage, temp_storage_bytes, allocations, allocation_sizes))) break;
-
-            // Return if the caller is simply requesting the size of the storage allocation
-            if (d_temp_storage == NULL)
-                return cudaSuccess;
-
-            // Pass planning.  Run passes of the alternate digit-size configuration until we have an even multiple of our preferred digit size
-            int num_bits            = end_bit - begin_bit;
-            int num_passes          = (num_bits + pass_config.radix_bits - 1) / pass_config.radix_bits;
-            bool is_num_passes_odd  = num_passes & 1;
-            int max_alt_passes      = (num_passes * pass_config.radix_bits) - num_bits;
-            int alt_end_bit         = CUB_MIN(end_bit, begin_bit + (max_alt_passes * alt_pass_config.radix_bits));
-
-            // Alias the temporary storage allocations
-            OffsetT *d_spine = static_cast<OffsetT*>(allocations[0]);
-
-            DoubleBuffer<KeyT> d_keys_remaining_passes(
-                (is_overwrite_okay || is_num_passes_odd) ? d_keys.Alternate() : static_cast<KeyT*>(allocations[1]),
-                (is_overwrite_okay) ? d_keys.Current() : (is_num_passes_odd) ? static_cast<KeyT*>(allocations[1]) : d_keys.Alternate());
-
-            DoubleBuffer<ValueT> d_values_remaining_passes(
-                (is_overwrite_okay || is_num_passes_odd) ? d_values.Alternate() : static_cast<ValueT*>(allocations[2]),
-                (is_overwrite_okay) ? d_values.Current() : (is_num_passes_odd) ? static_cast<ValueT*>(allocations[2]) : d_values.Alternate());
-
-            // Run first pass, consuming from the input's current buffers
-            int current_bit = begin_bit;
-            if (CubDebug(error = InvokePass(
-                d_keys.Current(), d_keys_remaining_passes.Current(),
-                d_values.Current(), d_values_remaining_passes.Current(),
-                d_spine, spine_length, current_bit,
-                (current_bit < alt_end_bit) ? alt_pass_config : pass_config))) break;
-
-            // Run remaining passes
-            while (current_bit < end_bit)
-            {
-                if (CubDebug(error = InvokePass(
-                    d_keys_remaining_passes.d_buffers[d_keys_remaining_passes.selector],    d_keys_remaining_passes.d_buffers[d_keys_remaining_passes.selector ^ 1],
-                    d_values_remaining_passes.d_buffers[d_keys_remaining_passes.selector],  d_values_remaining_passes.d_buffers[d_keys_remaining_passes.selector ^ 1],
-                    d_spine, spine_length, current_bit,
-                    (current_bit < alt_end_bit) ? alt_pass_config : pass_config))) break;;
-
-                // Invert selectors
-                d_keys_remaining_passes.selector ^= 1;
-                d_values_remaining_passes.selector ^= 1;
-            }
-
-            // Update selector
-            if (!is_overwrite_okay) {
-                num_passes = 1; // Sorted data always ends up in the other vector
-            }
-
-            d_keys.selector = (d_keys.selector + num_passes) & 1;
-            d_values.selector = (d_values.selector + num_passes) & 1;
-        }
-        while (0);
-
-        return error;
-
-#endif // CUB_RUNTIME_ENABLED
-    }
-
-
-    //------------------------------------------------------------------------------
-    // Chained policy invocation
-    //------------------------------------------------------------------------------
-
-    /// Invocation
-    template <typename ActivePolicyT>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    cudaError_t Invoke()
-    {
-        typedef typename DispatchRadixSort::MaxPolicy       MaxPolicyT;
-        typedef typename ActivePolicyT::SingleTilePolicy    SingleTilePolicyT;
-
-        // Force kernel code-generation in all compiler passes
-        if (num_items <= (SingleTilePolicyT::BLOCK_THREADS * SingleTilePolicyT::ITEMS_PER_THREAD))
-        {
-            // Small, single tile size
-            return InvokeSingleTile<ActivePolicyT>(
-                DeviceRadixSortSingleTileKernel<MaxPolicyT, IS_DESCENDING, KeyT, ValueT, OffsetT>);
-        }
-        else
-        {
-            // Regular size
-            return InvokePasses<ActivePolicyT>(
-                DeviceRadixSortUpsweepKernel<   MaxPolicyT, false,   IS_DESCENDING, KeyT, OffsetT>,
-                DeviceRadixSortUpsweepKernel<   MaxPolicyT, true,    IS_DESCENDING, KeyT, OffsetT>,
-                RadixSortScanBinsKernel<        MaxPolicyT, OffsetT>,
-                DeviceRadixSortDownsweepKernel< MaxPolicyT, false,   IS_DESCENDING, KeyT, ValueT, OffsetT>,
-                DeviceRadixSortDownsweepKernel< MaxPolicyT, true,    IS_DESCENDING, KeyT, ValueT, OffsetT>);
-        }
-    }
-
-
-    //------------------------------------------------------------------------------
-    // Dispatch entrypoints
-    //------------------------------------------------------------------------------
-
-    /**
-     * Internal dispatch routine
-     */
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Dispatch(
-        void*                   d_temp_storage,         ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                  &temp_storage_bytes,    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        DoubleBuffer<KeyT>      &d_keys,                ///< [in,out] Double-buffer whose current buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
-        DoubleBuffer<ValueT>    &d_values,              ///< [in,out] Double-buffer whose current buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
-        OffsetT                 num_items,              ///< [in] Number of items to sort
-        int                     begin_bit,              ///< [in] The beginning (least-significant) bit index needed for key comparison
-        int                     end_bit,                ///< [in] The past-the-end (most-significant) bit index needed for key comparison
-        bool                    is_overwrite_okay,      ///< [in] Whether is okay to overwrite source buffers
-        cudaStream_t            stream,                 ///< [in] CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                    debug_synchronous)      ///< [in] Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        typedef typename DispatchRadixSort::MaxPolicy MaxPolicyT;
-
-        cudaError_t error;
-        do {
-            // Get PTX version
-            int ptx_version;
-            if (CubDebug(error = PtxVersion(ptx_version))) break;
-
-            // Create dispatch functor
-            DispatchRadixSort dispatch(
-                d_temp_storage, temp_storage_bytes,
-                d_keys, d_values,
-                num_items, begin_bit, end_bit, is_overwrite_okay,
-                stream, debug_synchronous, ptx_version);
-
-            // Dispatch to chained policy
-            if (CubDebug(error = MaxPolicyT::Invoke(ptx_version, dispatch))) break;
-
-        } while (0);
-
-        return error;
-    }
-};
-
-
-
-
-/******************************************************************************
- * Segmented dispatch
- ******************************************************************************/
-
-/**
- * Utility class for dispatching the appropriately-tuned kernels for segmented device-wide radix sort
- */
-template <
-    bool     IS_DESCENDING,     ///< Whether or not the sorted-order is high-to-low
-    typename KeyT,              ///< Key type
-    typename ValueT,            ///< Value type
-    typename OffsetIteratorT,   ///< Random-access input iterator type for reading segment offsets \iterator
-    typename OffsetT>           ///< Signed integer type for global offsets
-struct DispatchSegmentedRadixSort :
-    DeviceRadixSortPolicy<KeyT, ValueT, OffsetT>
-{
-    //------------------------------------------------------------------------------
-    // Constants
-    //------------------------------------------------------------------------------
-
-    enum
-    {
-        // Whether this is a keys-only (or key-value) sort
-        KEYS_ONLY = (Equals<ValueT, NullType>::VALUE),
-    };
-
-
-    //------------------------------------------------------------------------------
-    // Parameter members
-    //------------------------------------------------------------------------------
-
-    void                    *d_temp_storage;        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-    size_t                  &temp_storage_bytes;    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-    DoubleBuffer<KeyT>      &d_keys;                ///< [in,out] Double-buffer whose current buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
-    DoubleBuffer<ValueT>    &d_values;              ///< [in,out] Double-buffer whose current buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
-    OffsetT                 num_items;              ///< [in] Number of items to sort
-    OffsetT                 num_segments;           ///< [in] The number of segments that comprise the sorting data
-    OffsetIteratorT         d_begin_offsets;        ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-    OffsetIteratorT         d_end_offsets;          ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-    int                     begin_bit;              ///< [in] The beginning (least-significant) bit index needed for key comparison
-    int                     end_bit;                ///< [in] The past-the-end (most-significant) bit index needed for key comparison
-    cudaStream_t            stream;                 ///< [in] CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-    bool                    debug_synchronous;      ///< [in] Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    int                     ptx_version;            ///< [in] PTX version
-    bool                    is_overwrite_okay;      ///< [in] Whether is okay to overwrite source buffers
-
-
-    //------------------------------------------------------------------------------
-    // Constructors
-    //------------------------------------------------------------------------------
-
-    /// Constructor
-    CUB_RUNTIME_FUNCTION __forceinline__
-    DispatchSegmentedRadixSort(
-        void*                   d_temp_storage,
-        size_t                  &temp_storage_bytes,
-        DoubleBuffer<KeyT>      &d_keys,
-        DoubleBuffer<ValueT>    &d_values,
-        OffsetT                 num_items,
-        OffsetT                 num_segments,
-        OffsetIteratorT         d_begin_offsets,
-        OffsetIteratorT         d_end_offsets,
-        int                     begin_bit,
-        int                     end_bit,
-        bool                    is_overwrite_okay,
-        cudaStream_t            stream,
-        bool                    debug_synchronous,
-        int                     ptx_version)
-    :
-        d_temp_storage(d_temp_storage),
-        temp_storage_bytes(temp_storage_bytes),
-        d_keys(d_keys),
-        d_values(d_values),
-        num_items(num_items),
-        num_segments(num_segments),
-        d_begin_offsets(d_begin_offsets),
-        d_end_offsets(d_end_offsets),
-        begin_bit(begin_bit),
-        end_bit(end_bit),
-        is_overwrite_okay(is_overwrite_okay),
-        stream(stream),
-        debug_synchronous(debug_synchronous),
-        ptx_version(ptx_version)
-    {}
-
-
-    //------------------------------------------------------------------------------
-    // Multi-segment invocation
-    //------------------------------------------------------------------------------
-
-    /// Invoke a three-kernel sorting pass at the current bit.
-    template <typename PassConfigT>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    cudaError_t InvokePass(
-        const KeyT      *d_keys_in,
-        KeyT            *d_keys_out,
-        const ValueT    *d_values_in,
-        ValueT          *d_values_out,
-        int             &current_bit,
-        PassConfigT     &pass_config)
-    {
-        cudaError error = cudaSuccess;
-        do
-        {
-            int pass_bits = CUB_MIN(pass_config.radix_bits, (end_bit - current_bit));
-
-            // Log kernel configuration
-            if (debug_synchronous)
-                _CubLog("Invoking segmented_kernels<<<%d, %d, 0, %lld>>>(), %d items per thread, %d SM occupancy, current bit %d, bit_grain %d\n",
-                    num_segments, pass_config.segmented_config.block_threads, (long long) stream,
-                pass_config.segmented_config.items_per_thread, pass_config.segmented_config.sm_occupancy, current_bit, pass_bits);
-
-            pass_config.segmented_kernel<<<num_segments, pass_config.segmented_config.block_threads, 0, stream>>>(
-                d_keys_in, d_keys_out,
-                d_values_in,  d_values_out,
-                d_begin_offsets, d_end_offsets, num_segments,
-                current_bit, pass_bits);
-
-            // Check for failure to launch
-            if (CubDebug(error = cudaPeekAtLastError())) break;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-
-            // Update current bit
-            current_bit += pass_bits;
-        }
-        while (0);
-
-        return error;
-    }
-
-
-    /// PassConfig data structure
-    template <typename SegmentedKernelT>
-    struct PassConfig
-    {
-        SegmentedKernelT    segmented_kernel;
-        KernelConfig        segmented_config;
-        int                 radix_bits;
-        int                 radix_digits;
-
-        /// Initialize pass configuration
-        template <typename SegmentedPolicyT>
-        CUB_RUNTIME_FUNCTION __forceinline__
-        cudaError_t InitPassConfig(SegmentedKernelT segmented_kernel)
-        {
-            this->segmented_kernel  = segmented_kernel;
-            this->radix_bits        = SegmentedPolicyT::RADIX_BITS;
-            this->radix_digits      = 1 << radix_bits;
-
-            return CubDebug(segmented_config.Init<SegmentedPolicyT>(segmented_kernel));
-        }
-    };
-
-
-    /// Invocation (run multiple digit passes)
-    template <
-        typename                ActivePolicyT,          ///< Umbrella policy active for the target device
-        typename                SegmentedKernelT>       ///< Function type of cub::DeviceSegmentedRadixSortKernel
-    CUB_RUNTIME_FUNCTION __forceinline__
-    cudaError_t InvokePasses(
-        SegmentedKernelT     segmented_kernel,          ///< [in] Kernel function pointer to parameterization of cub::DeviceSegmentedRadixSortKernel
-        SegmentedKernelT     alt_segmented_kernel)      ///< [in] Alternate kernel function pointer to parameterization of cub::DeviceSegmentedRadixSortKernel
-    {
-#ifndef CUB_RUNTIME_ENABLED
-      (void)segmented_kernel;
-      (void)alt_segmented_kernel;
-
-        // Kernel launch not supported from this device
-        return CubDebug(cudaErrorNotSupported );
-#else
-
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Init regular and alternate kernel configurations
-            PassConfig<SegmentedKernelT> pass_config, alt_pass_config;
-            if ((error = pass_config.template       InitPassConfig<typename ActivePolicyT::SegmentedPolicy>(segmented_kernel))) break;
-            if ((error = alt_pass_config.template   InitPassConfig<typename ActivePolicyT::AltSegmentedPolicy>(alt_segmented_kernel))) break;
-
-            // Temporary storage allocation requirements
-            void* allocations[2];
-            size_t allocation_sizes[2] =
-            {
-                (is_overwrite_okay) ? 0 : num_items * sizeof(KeyT),                      // bytes needed for 3rd keys buffer
-                (is_overwrite_okay || (KEYS_ONLY)) ? 0 : num_items * sizeof(ValueT),     // bytes needed for 3rd values buffer
-            };
-
-            // Alias the temporary allocations from the single storage blob (or compute the necessary size of the blob)
-            if (CubDebug(error = AliasTemporaries(d_temp_storage, temp_storage_bytes, allocations, allocation_sizes))) break;
-
-            // Return if the caller is simply requesting the size of the storage allocation
-            if (d_temp_storage == NULL)
-            {
-                if (temp_storage_bytes == 0)
-                    temp_storage_bytes = 1;
-                return cudaSuccess;
-            }
-
-            // Pass planning.  Run passes of the alternate digit-size configuration until we have an even multiple of our preferred digit size
-            int radix_bits          = ActivePolicyT::SegmentedPolicy::RADIX_BITS;
-            int alt_radix_bits      = ActivePolicyT::AltSegmentedPolicy::RADIX_BITS;
-            int num_bits            = end_bit - begin_bit;
-            int num_passes          = (num_bits + radix_bits - 1) / radix_bits;
-            bool is_num_passes_odd  = num_passes & 1;
-            int max_alt_passes      = (num_passes * radix_bits) - num_bits;
-            int alt_end_bit         = CUB_MIN(end_bit, begin_bit + (max_alt_passes * alt_radix_bits));
-
-            DoubleBuffer<KeyT> d_keys_remaining_passes(
-                (is_overwrite_okay || is_num_passes_odd) ? d_keys.Alternate() : static_cast<KeyT*>(allocations[0]),
-                (is_overwrite_okay) ? d_keys.Current() : (is_num_passes_odd) ? static_cast<KeyT*>(allocations[0]) : d_keys.Alternate());
-
-            DoubleBuffer<ValueT> d_values_remaining_passes(
-                (is_overwrite_okay || is_num_passes_odd) ? d_values.Alternate() : static_cast<ValueT*>(allocations[1]),
-                (is_overwrite_okay) ? d_values.Current() : (is_num_passes_odd) ? static_cast<ValueT*>(allocations[1]) : d_values.Alternate());
-
-            // Run first pass, consuming from the input's current buffers
-            int current_bit = begin_bit;
-
-            if (CubDebug(error = InvokePass(
-                d_keys.Current(), d_keys_remaining_passes.Current(),
-                d_values.Current(), d_values_remaining_passes.Current(),
-                current_bit,
-                (current_bit < alt_end_bit) ? alt_pass_config : pass_config))) break;
-
-            // Run remaining passes
-            while (current_bit < end_bit)
-            {
-                if (CubDebug(error = InvokePass(
-                    d_keys_remaining_passes.d_buffers[d_keys_remaining_passes.selector],    d_keys_remaining_passes.d_buffers[d_keys_remaining_passes.selector ^ 1],
-                    d_values_remaining_passes.d_buffers[d_keys_remaining_passes.selector],  d_values_remaining_passes.d_buffers[d_keys_remaining_passes.selector ^ 1],
-                    current_bit,
-                    (current_bit < alt_end_bit) ? alt_pass_config : pass_config))) break;
-
-                // Invert selectors and update current bit
-                d_keys_remaining_passes.selector ^= 1;
-                d_values_remaining_passes.selector ^= 1;
-            }
-
-            // Update selector
-            if (!is_overwrite_okay) {
-                num_passes = 1; // Sorted data always ends up in the other vector
-            }
-
-            d_keys.selector = (d_keys.selector + num_passes) & 1;
-            d_values.selector = (d_values.selector + num_passes) & 1;
-        }
-        while (0);
-
-        return error;
-
-#endif // CUB_RUNTIME_ENABLED
-    }
-
-
-    //------------------------------------------------------------------------------
-    // Chained policy invocation
-    //------------------------------------------------------------------------------
-
-    /// Invocation
-    template <typename ActivePolicyT>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    cudaError_t Invoke()
-    {
-        typedef typename DispatchSegmentedRadixSort::MaxPolicy MaxPolicyT;
-
-        // Force kernel code-generation in all compiler passes
-        return InvokePasses<ActivePolicyT>(
-            DeviceSegmentedRadixSortKernel<MaxPolicyT, false,   IS_DESCENDING, KeyT, ValueT, OffsetIteratorT, OffsetT>,
-            DeviceSegmentedRadixSortKernel<MaxPolicyT, true,    IS_DESCENDING, KeyT, ValueT, OffsetIteratorT, OffsetT>);
-    }
-
-
-    //------------------------------------------------------------------------------
-    // Dispatch entrypoints
-    //------------------------------------------------------------------------------
-
-
-    /// Internal dispatch routine
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Dispatch(
-        void*                   d_temp_storage,         ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t                  &temp_storage_bytes,    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        DoubleBuffer<KeyT>      &d_keys,                ///< [in,out] Double-buffer whose current buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
-        DoubleBuffer<ValueT>    &d_values,              ///< [in,out] Double-buffer whose current buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
-        int                     num_items,              ///< [in] Number of items to sort
-        int                     num_segments,           ///< [in] The number of segments that comprise the sorting data
-        OffsetIteratorT         d_begin_offsets,        ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-        OffsetIteratorT         d_end_offsets,          ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-        int                     begin_bit,              ///< [in] The beginning (least-significant) bit index needed for key comparison
-        int                     end_bit,                ///< [in] The past-the-end (most-significant) bit index needed for key comparison
-        bool                    is_overwrite_okay,      ///< [in] Whether is okay to overwrite source buffers
-        cudaStream_t            stream,                 ///< [in] CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                    debug_synchronous)      ///< [in] Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        typedef typename DispatchSegmentedRadixSort::MaxPolicy MaxPolicyT;
-
-        cudaError_t error;
-        do {
-            // Get PTX version
-            int ptx_version;
-            if (CubDebug(error = PtxVersion(ptx_version))) break;
-
-            // Create dispatch functor
-            DispatchSegmentedRadixSort dispatch(
-                d_temp_storage, temp_storage_bytes,
-                d_keys, d_values,
-                num_items, num_segments, d_begin_offsets, d_end_offsets,
-                begin_bit, end_bit, is_overwrite_okay,
-                stream, debug_synchronous, ptx_version);
-
-            // Dispatch to chained policy
-            if (CubDebug(error = MaxPolicyT::Invoke(ptx_version, dispatch))) break;
-
-        } while (0);
-
-        return error;
-    }
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/device/dispatch/dispatch_reduce.cuh b/thirdparty/cub_semiring/device/dispatch/dispatch_reduce.cuh
deleted file mode 100644
index b6aa44cc0e5..00000000000
--- a/thirdparty/cub_semiring/device/dispatch/dispatch_reduce.cuh
+++ /dev/null
@@ -1,882 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory.
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-
-#include "../../agent/agent_reduce.cuh"
-#include "../../iterator/arg_index_input_iterator.cuh"
-#include "../../thread/thread_operators.cuh"
-#include "../../grid/grid_even_share.cuh"
-#include "../../iterator/arg_index_input_iterator.cuh"
-#include "../../util_debug.cuh"
-#include "../../util_device.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/******************************************************************************
- * Kernel entry points
- *****************************************************************************/
-
-/**
- * Reduce region kernel entry point (multi-block).  Computes privatized reductions, one per thread block.
- */
-template <
-    typename                ChainedPolicyT,             ///< Chained tuning policy
-    typename                InputIteratorT,             ///< Random-access input iterator type for reading input items \iterator
-    typename                OutputIteratorT,            ///< Output iterator type for recording the reduced aggregate \iterator
-    typename                OffsetT,                    ///< Signed integer type for global offsets
-    typename                ReductionOpT>               ///< Binary reduction functor type having member <tt>T operator()(const T &a, const T &b)</tt>
-__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS))
-__global__ void DeviceReduceKernel(
-    InputIteratorT          d_in,                       ///< [in] Pointer to the input sequence of data items
-    OutputIteratorT         d_out,                      ///< [out] Pointer to the output aggregate
-    OffsetT                 num_items,                  ///< [in] Total number of input data items
-    GridEvenShare<OffsetT>  even_share,                 ///< [in] Even-share descriptor for mapping an equal number of tiles onto each thread block
-    ReductionOpT            reduction_op)               ///< [in] Binary reduction functor
-{
-    // The output value type
-    typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
-        typename std::iterator_traits<InputIteratorT>::value_type,                                          // ... then the input iterator's value type,
-        typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputT;                          // ... else the output iterator's value type
-
-    // Thread block type for reducing input tiles
-    typedef AgentReduce<
-            typename ChainedPolicyT::ActivePolicy::ReducePolicy,
-            InputIteratorT,
-            OutputIteratorT,
-            OffsetT,
-            ReductionOpT>
-        AgentReduceT;
-
-    // Shared memory storage
-    __shared__ typename AgentReduceT::TempStorage temp_storage;
-
-    // Consume input tiles
-    OutputT block_aggregate = AgentReduceT(temp_storage, d_in, reduction_op).ConsumeTiles(even_share);
-
-    // Output result
-    if (threadIdx.x == 0)
-        d_out[blockIdx.x] = block_aggregate;
-}
-
-
-/**
- * Reduce a single tile kernel entry point (single-block).  Can be used to aggregate privatized thread block reductions from a previous multi-block reduction pass.
- */
-template <
-    typename                ChainedPolicyT,             ///< Chained tuning policy
-    typename                InputIteratorT,             ///< Random-access input iterator type for reading input items \iterator
-    typename                OutputIteratorT,            ///< Output iterator type for recording the reduced aggregate \iterator
-    typename                OffsetT,                    ///< Signed integer type for global offsets
-    typename                ReductionOpT,               ///< Binary reduction functor type having member <tt>T operator()(const T &a, const T &b)</tt>
-    typename                OuputT>                     ///< Data element type that is convertible to the \p value type of \p OutputIteratorT
-__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS), 1)
-__global__ void DeviceReduceSingleTileKernel(
-    InputIteratorT          d_in,                       ///< [in] Pointer to the input sequence of data items
-    OutputIteratorT         d_out,                      ///< [out] Pointer to the output aggregate
-    OffsetT                 num_items,                  ///< [in] Total number of input data items
-    ReductionOpT            reduction_op,               ///< [in] Binary reduction functor
-    OuputT                  init)                       ///< [in] The initial value of the reduction
-{
-    // Thread block type for reducing input tiles
-    typedef AgentReduce<
-            typename ChainedPolicyT::ActivePolicy::SingleTilePolicy,
-            InputIteratorT,
-            OutputIteratorT,
-            OffsetT,
-            ReductionOpT>
-        AgentReduceT;
-
-    // Shared memory storage
-    __shared__ typename AgentReduceT::TempStorage temp_storage;
-
-    // Check if empty problem
-    if (num_items == 0)
-    {
-        if (threadIdx.x == 0)
-            *d_out = init;
-        return;
-    }
-
-    // Consume input tiles
-    OuputT block_aggregate = AgentReduceT(temp_storage, d_in, reduction_op).ConsumeRange(
-        OffsetT(0),
-        num_items);
-
-    // Output result
-    if (threadIdx.x == 0)
-        *d_out = reduction_op(init, block_aggregate);
-}
-
-
-/// Normalize input iterator to segment offset
-template <typename T, typename OffsetT, typename IteratorT>
-__device__ __forceinline__
-void NormalizeReductionOutput(
-    T &/*val*/,
-    OffsetT /*base_offset*/,
-    IteratorT /*itr*/)
-{}
-
-
-/// Normalize input iterator to segment offset (specialized for arg-index)
-template <typename KeyValuePairT, typename OffsetT, typename WrappedIteratorT, typename OutputValueT>
-__device__ __forceinline__
-void NormalizeReductionOutput(
-    KeyValuePairT &val,
-    OffsetT base_offset,
-    ArgIndexInputIterator<WrappedIteratorT, OffsetT, OutputValueT> /*itr*/)
-{
-    val.key -= base_offset;
-}
-
-
-/**
- * Segmented reduction (one block per segment)
- */
-template <
-    typename                ChainedPolicyT,             ///< Chained tuning policy
-    typename                InputIteratorT,             ///< Random-access input iterator type for reading input items \iterator
-    typename                OutputIteratorT,            ///< Output iterator type for recording the reduced aggregate \iterator
-    typename                OffsetIteratorT,            ///< Random-access input iterator type for reading segment offsets \iterator
-    typename                OffsetT,                    ///< Signed integer type for global offsets
-    typename                ReductionOpT,               ///< Binary reduction functor type having member <tt>T operator()(const T &a, const T &b)</tt>
-    typename                OutputT>                    ///< Data element type that is convertible to the \p value type of \p OutputIteratorT
-__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS))
-__global__ void DeviceSegmentedReduceKernel(
-    InputIteratorT          d_in,                       ///< [in] Pointer to the input sequence of data items
-    OutputIteratorT         d_out,                      ///< [out] Pointer to the output aggregate
-    OffsetIteratorT         d_begin_offsets,            ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-    OffsetIteratorT         d_end_offsets,              ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-    int                     /*num_segments*/,           ///< [in] The number of segments that comprise the sorting data
-    ReductionOpT            reduction_op,               ///< [in] Binary reduction functor 
-    OutputT                 init)                       ///< [in] The initial value of the reduction
-{
-    // Thread block type for reducing input tiles
-    typedef AgentReduce<
-            typename ChainedPolicyT::ActivePolicy::ReducePolicy,
-            InputIteratorT,
-            OutputIteratorT,
-            OffsetT,
-            ReductionOpT>
-        AgentReduceT;
-
-    // Shared memory storage
-    __shared__ typename AgentReduceT::TempStorage temp_storage;
-
-    OffsetT segment_begin   = d_begin_offsets[blockIdx.x];
-    OffsetT segment_end     = d_end_offsets[blockIdx.x];
-
-    // Check if empty problem
-    if (segment_begin == segment_end)
-    {
-        if (threadIdx.x == 0)
-            d_out[blockIdx.x] = init;
-        return;
-    }
-
-    // Consume input tiles
-    OutputT block_aggregate = AgentReduceT(temp_storage, d_in, reduction_op).ConsumeRange(
-        segment_begin,
-        segment_end);
-
-    // Normalize as needed
-    NormalizeReductionOutput(block_aggregate, segment_begin, d_in);
-
-    if (threadIdx.x == 0)
-        d_out[blockIdx.x] = reduction_op(init, block_aggregate);;
-}
-
-
-
-
-/******************************************************************************
- * Policy
- ******************************************************************************/
-
-template <
-    typename OuputT,            ///< Data type
-    typename OffsetT,           ///< Signed integer type for global offsets
-    typename ReductionOpT>      ///< Binary reduction functor type having member <tt>T operator()(const T &a, const T &b)</tt> 
-struct DeviceReducePolicy
-{
-    //------------------------------------------------------------------------------
-    // Architecture-specific tuning policies
-    //------------------------------------------------------------------------------
-
-    /// SM13
-    struct Policy130 : ChainedPolicy<130, Policy130, Policy130>
-    {
-        // ReducePolicy
-        typedef AgentReducePolicy<
-                CUB_NOMINAL_CONFIG(128, 8, OuputT), ///< Threads per block, items per thread
-                2,                                  ///< Number of items per vectorized load
-                BLOCK_REDUCE_RAKING,                ///< Cooperative block-wide reduction algorithm to use
-                LOAD_DEFAULT>                       ///< Cache load modifier
-            ReducePolicy;
-
-        // SingleTilePolicy
-        typedef ReducePolicy SingleTilePolicy;
-
-        // SegmentedReducePolicy
-        typedef ReducePolicy SegmentedReducePolicy;
-    };
-
-
-    /// SM20
-    struct Policy200 : ChainedPolicy<200, Policy200, Policy130>
-    {
-        // ReducePolicy (GTX 580: 178.9 GB/s @ 48M 4B items, 158.1 GB/s @ 192M 1B items)
-        typedef AgentReducePolicy<
-                CUB_NOMINAL_CONFIG(128, 8, OuputT),     ///< Threads per block, items per thread
-                4,                                      ///< Number of items per vectorized load
-                BLOCK_REDUCE_RAKING,                    ///< Cooperative block-wide reduction algorithm to use
-                LOAD_DEFAULT>                           ///< Cache load modifier
-            ReducePolicy;
-
-        // SingleTilePolicy
-        typedef ReducePolicy SingleTilePolicy;
-
-        // SegmentedReducePolicy
-        typedef ReducePolicy SegmentedReducePolicy;
-    };
-
-
-    /// SM30
-    struct Policy300 : ChainedPolicy<300, Policy300, Policy200>
-    {
-        // ReducePolicy (GTX670: 154.0 @ 48M 4B items)
-        typedef AgentReducePolicy<
-                CUB_NOMINAL_CONFIG(256, 20, OuputT),    ///< Threads per block, items per thread
-                2,                                      ///< Number of items per vectorized load
-                BLOCK_REDUCE_WARP_REDUCTIONS,           ///< Cooperative block-wide reduction algorithm to use
-                LOAD_DEFAULT>                           ///< Cache load modifier
-            ReducePolicy;
-
-        // SingleTilePolicy
-        typedef ReducePolicy SingleTilePolicy;
-
-        // SegmentedReducePolicy
-        typedef ReducePolicy SegmentedReducePolicy;
-    };
-
-
-    /// SM35
-    struct Policy350 : ChainedPolicy<350, Policy350, Policy300>
-    {
-        // ReducePolicy (GTX Titan: 255.1 GB/s @ 48M 4B items; 228.7 GB/s @ 192M 1B items)
-        typedef AgentReducePolicy<
-                CUB_NOMINAL_CONFIG(256, 20, OuputT),    ///< Threads per block, items per thread
-                4,                                      ///< Number of items per vectorized load
-                BLOCK_REDUCE_WARP_REDUCTIONS,           ///< Cooperative block-wide reduction algorithm to use
-                LOAD_LDG>                               ///< Cache load modifier
-            ReducePolicy;
-
-        // SingleTilePolicy
-        typedef ReducePolicy SingleTilePolicy;
-
-        // SegmentedReducePolicy
-        typedef ReducePolicy SegmentedReducePolicy;
-    };
-
-    /// SM60
-    struct Policy600 : ChainedPolicy<600, Policy600, Policy350>
-    {
-        // ReducePolicy (P100: 591 GB/s @ 64M 4B items; 583 GB/s @ 256M 1B items)
-        typedef AgentReducePolicy<
-                CUB_NOMINAL_CONFIG(256, 16, OuputT),    ///< Threads per block, items per thread
-                4,                                      ///< Number of items per vectorized load
-                BLOCK_REDUCE_WARP_REDUCTIONS,           ///< Cooperative block-wide reduction algorithm to use
-                LOAD_LDG>                               ///< Cache load modifier
-            ReducePolicy;
-
-        // SingleTilePolicy
-        typedef ReducePolicy SingleTilePolicy;
-
-        // SegmentedReducePolicy
-        typedef ReducePolicy SegmentedReducePolicy;
-    };
-
-
-    /// MaxPolicy
-    typedef Policy600 MaxPolicy;
-
-};
-
-
-
-/******************************************************************************
- * Single-problem dispatch
- ******************************************************************************/
-
-/**
- * Utility class for dispatching the appropriately-tuned kernels for device-wide reduction
- */
-template <
-    typename InputIteratorT,    ///< Random-access input iterator type for reading input items \iterator
-    typename OutputIteratorT,   ///< Output iterator type for recording the reduced aggregate \iterator
-    typename OffsetT,           ///< Signed integer type for global offsets
-    typename ReductionOpT>      ///< Binary reduction functor type having member <tt>T operator()(const T &a, const T &b)</tt> 
-struct DispatchReduce :
-    DeviceReducePolicy<
-        typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
-            typename std::iterator_traits<InputIteratorT>::value_type,                                  // ... then the input iterator's value type,
-            typename std::iterator_traits<OutputIteratorT>::value_type>::Type,                          // ... else the output iterator's value type
-        OffsetT,
-        ReductionOpT>
-{
-    //------------------------------------------------------------------------------
-    // Constants
-    //------------------------------------------------------------------------------
-
-    // Data type of output iterator
-    typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
-        typename std::iterator_traits<InputIteratorT>::value_type,                                          // ... then the input iterator's value type,
-        typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputT;                          // ... else the output iterator's value type
-
-
-    //------------------------------------------------------------------------------
-    // Problem state
-    //------------------------------------------------------------------------------
-
-    void                *d_temp_storage;                ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-    size_t              &temp_storage_bytes;            ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-    InputIteratorT      d_in;                           ///< [in] Pointer to the input sequence of data items
-    OutputIteratorT     d_out;                          ///< [out] Pointer to the output aggregate
-    OffsetT             num_items;                      ///< [in] Total number of input items (i.e., length of \p d_in)
-    ReductionOpT        reduction_op;                   ///< [in] Binary reduction functor 
-    OutputT             init;                           ///< [in] The initial value of the reduction
-    cudaStream_t        stream;                         ///< [in] CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-    bool                debug_synchronous;              ///< [in] Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    int                 ptx_version;                    ///< [in] PTX version
-
-    //------------------------------------------------------------------------------
-    // Constructor
-    //------------------------------------------------------------------------------
-
-    /// Constructor
-    CUB_RUNTIME_FUNCTION __forceinline__
-    DispatchReduce(
-        void*                   d_temp_storage,
-        size_t                  &temp_storage_bytes,
-        InputIteratorT          d_in,
-        OutputIteratorT         d_out,
-        OffsetT                 num_items,
-        ReductionOpT            reduction_op,
-        OutputT                 init,
-        cudaStream_t            stream,
-        bool                    debug_synchronous,
-        int                     ptx_version)
-    :
-        d_temp_storage(d_temp_storage),
-        temp_storage_bytes(temp_storage_bytes),
-        d_in(d_in),
-        d_out(d_out),
-        num_items(num_items),
-        reduction_op(reduction_op),
-        init(init),
-        stream(stream),
-        debug_synchronous(debug_synchronous),
-        ptx_version(ptx_version)
-    {}
-
-
-    //------------------------------------------------------------------------------
-    // Small-problem (single tile) invocation
-    //------------------------------------------------------------------------------
-
-    /// Invoke a single block block to reduce in-core
-    template <
-        typename                ActivePolicyT,          ///< Umbrella policy active for the target device
-        typename                SingleTileKernelT>      ///< Function type of cub::DeviceReduceSingleTileKernel
-    CUB_RUNTIME_FUNCTION __forceinline__
-    cudaError_t InvokeSingleTile(
-        SingleTileKernelT       single_tile_kernel)     ///< [in] Kernel function pointer to parameterization of cub::DeviceReduceSingleTileKernel
-    {
-#ifndef CUB_RUNTIME_ENABLED
-        (void)single_tile_kernel;
-
-        // Kernel launch not supported from this device
-        return CubDebug(cudaErrorNotSupported );
-#else
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Return if the caller is simply requesting the size of the storage allocation
-            if (d_temp_storage == NULL)
-            {
-                temp_storage_bytes = 1;
-                break;
-            }
-
-            // Log single_reduce_sweep_kernel configuration
-            if (debug_synchronous) _CubLog("Invoking DeviceReduceSingleTileKernel<<<1, %d, 0, %lld>>>(), %d items per thread\n",
-                ActivePolicyT::SingleTilePolicy::BLOCK_THREADS,
-                (long long) stream,
-                ActivePolicyT::SingleTilePolicy::ITEMS_PER_THREAD);
-
-            // Invoke single_reduce_sweep_kernel
-            single_tile_kernel<<<1, ActivePolicyT::SingleTilePolicy::BLOCK_THREADS, 0, stream>>>(
-                d_in,
-                d_out,
-                num_items,
-                reduction_op,
-                init);
-
-            // Check for failure to launch
-            if (CubDebug(error = cudaPeekAtLastError())) break;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-        }
-        while (0);
-
-        return error;
-
-#endif // CUB_RUNTIME_ENABLED
-    }
-
-
-    //------------------------------------------------------------------------------
-    // Normal problem size invocation (two-pass)
-    //------------------------------------------------------------------------------
-
-    /// Invoke two-passes to reduce
-    template <
-        typename                ActivePolicyT,              ///< Umbrella policy active for the target device
-        typename                ReduceKernelT,              ///< Function type of cub::DeviceReduceKernel
-        typename                SingleTileKernelT>          ///< Function type of cub::DeviceReduceSingleTileKernel
-    CUB_RUNTIME_FUNCTION __forceinline__
-    cudaError_t InvokePasses(
-        ReduceKernelT           reduce_kernel,          ///< [in] Kernel function pointer to parameterization of cub::DeviceReduceKernel
-        SingleTileKernelT       single_tile_kernel)     ///< [in] Kernel function pointer to parameterization of cub::DeviceReduceSingleTileKernel
-    {
-#ifndef CUB_RUNTIME_ENABLED
-        (void)                  reduce_kernel;
-        (void)                  single_tile_kernel;
-
-        // Kernel launch not supported from this device
-        return CubDebug(cudaErrorNotSupported );
-#else
-
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get device ordinal
-            int device_ordinal;
-            if (CubDebug(error = cudaGetDevice(&device_ordinal))) break;
-
-            // Get SM count
-            int sm_count;
-            if (CubDebug(error = cudaDeviceGetAttribute (&sm_count, cudaDevAttrMultiProcessorCount, device_ordinal))) break;
-
-            // Init regular kernel configuration
-            KernelConfig reduce_config;
-            if (CubDebug(error = reduce_config.Init<typename ActivePolicyT::ReducePolicy>(reduce_kernel))) break;
-            int reduce_device_occupancy = reduce_config.sm_occupancy * sm_count;
-
-            // Even-share work distribution
-            int max_blocks = reduce_device_occupancy * CUB_SUBSCRIPTION_FACTOR(ptx_version);
-            GridEvenShare<OffsetT> even_share;
-            even_share.DispatchInit(num_items, max_blocks, reduce_config.tile_size);
-
-            // Temporary storage allocation requirements
-            void* allocations[1];
-            size_t allocation_sizes[1] =
-            {
-                max_blocks * sizeof(OutputT)    // bytes needed for privatized block reductions
-            };
-
-            // Alias the temporary allocations from the single storage blob (or compute the necessary size of the blob)
-            if (CubDebug(error = AliasTemporaries(d_temp_storage, temp_storage_bytes, allocations, allocation_sizes))) break;
-            if (d_temp_storage == NULL)
-            {
-                // Return if the caller is simply requesting the size of the storage allocation
-                return cudaSuccess;
-            }
-
-            // Alias the allocation for the privatized per-block reductions
-            OutputT *d_block_reductions = (OutputT*) allocations[0];
-
-            // Get grid size for device_reduce_sweep_kernel
-            int reduce_grid_size = even_share.grid_size;
-
-            // Log device_reduce_sweep_kernel configuration
-            if (debug_synchronous) _CubLog("Invoking DeviceReduceKernel<<<%d, %d, 0, %lld>>>(), %d items per thread, %d SM occupancy\n",
-                reduce_grid_size,
-                ActivePolicyT::ReducePolicy::BLOCK_THREADS,
-                (long long) stream,
-                ActivePolicyT::ReducePolicy::ITEMS_PER_THREAD,
-                reduce_config.sm_occupancy);
-
-            // Invoke DeviceReduceKernel
-            reduce_kernel<<<reduce_grid_size, ActivePolicyT::ReducePolicy::BLOCK_THREADS, 0, stream>>>(
-                d_in,
-                d_block_reductions,
-                num_items,
-                even_share,
-                reduction_op);
-
-            // Check for failure to launch
-            if (CubDebug(error = cudaPeekAtLastError())) break;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-
-            // Log single_reduce_sweep_kernel configuration
-            if (debug_synchronous) _CubLog("Invoking DeviceReduceSingleTileKernel<<<1, %d, 0, %lld>>>(), %d items per thread\n",
-                ActivePolicyT::SingleTilePolicy::BLOCK_THREADS,
-                (long long) stream,
-                ActivePolicyT::SingleTilePolicy::ITEMS_PER_THREAD);
-
-            // Invoke DeviceReduceSingleTileKernel
-            single_tile_kernel<<<1, ActivePolicyT::SingleTilePolicy::BLOCK_THREADS, 0, stream>>>(
-                d_block_reductions,
-                d_out,
-                reduce_grid_size,
-                reduction_op,
-                init);
-
-            // Check for failure to launch
-            if (CubDebug(error = cudaPeekAtLastError())) break;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-        }
-        while (0);
-
-        return error;
-
-#endif // CUB_RUNTIME_ENABLED
-
-    }
-
-
-    //------------------------------------------------------------------------------
-    // Chained policy invocation
-    //------------------------------------------------------------------------------
-
-    /// Invocation
-    template <typename ActivePolicyT>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    cudaError_t Invoke()
-    {
-        typedef typename ActivePolicyT::SingleTilePolicy    SingleTilePolicyT;
-        typedef typename DispatchReduce::MaxPolicy          MaxPolicyT;
-
-        // Force kernel code-generation in all compiler passes
-        if (num_items <= (SingleTilePolicyT::BLOCK_THREADS * SingleTilePolicyT::ITEMS_PER_THREAD))
-        {
-            // Small, single tile size
-            return InvokeSingleTile<ActivePolicyT>(
-                DeviceReduceSingleTileKernel<MaxPolicyT, InputIteratorT, OutputIteratorT, OffsetT, ReductionOpT, OutputT>);
-        }
-        else
-        {
-            // Regular size
-            return InvokePasses<ActivePolicyT>(
-                DeviceReduceKernel<typename DispatchReduce::MaxPolicy, InputIteratorT, OutputT*, OffsetT, ReductionOpT>,
-                DeviceReduceSingleTileKernel<MaxPolicyT, OutputT*, OutputIteratorT, OffsetT, ReductionOpT, OutputT>);
-        }
-    }
-
-
-    //------------------------------------------------------------------------------
-    // Dispatch entrypoints
-    //------------------------------------------------------------------------------
-
-    /**
-     * Internal dispatch routine for computing a device-wide reduction
-     */
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Dispatch(
-        void            *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t          &temp_storage_bytes,                ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT  d_in,                               ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT d_out,                              ///< [out] Pointer to the output aggregate
-        OffsetT         num_items,                          ///< [in] Total number of input items (i.e., length of \p d_in)
-        ReductionOpT    reduction_op,                       ///< [in] Binary reduction functor 
-        OutputT         init,                               ///< [in] The initial value of the reduction
-        cudaStream_t    stream,                             ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool            debug_synchronous)                  ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        typedef typename DispatchReduce::MaxPolicy MaxPolicyT;
-
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get PTX version
-            int ptx_version;
-            if (CubDebug(error = PtxVersion(ptx_version))) break;
-
-            // Create dispatch functor
-            DispatchReduce dispatch(
-                d_temp_storage, temp_storage_bytes,
-                d_in, d_out, num_items, reduction_op, init,
-                stream, debug_synchronous, ptx_version);
-
-            // Dispatch to chained policy
-            if (CubDebug(error = MaxPolicyT::Invoke(ptx_version, dispatch))) break;
-        }
-        while (0);
-
-        return error;
-    }
-};
-
-
-
-/******************************************************************************
- * Segmented dispatch
- ******************************************************************************/
-
-/**
- * Utility class for dispatching the appropriately-tuned kernels for device-wide reduction
- */
-template <
-    typename InputIteratorT,    ///< Random-access input iterator type for reading input items \iterator
-    typename OutputIteratorT,   ///< Output iterator type for recording the reduced aggregate \iterator
-    typename OffsetIteratorT,   ///< Random-access input iterator type for reading segment offsets \iterator
-    typename OffsetT,           ///< Signed integer type for global offsets
-    typename ReductionOpT>      ///< Binary reduction functor type having member <tt>T operator()(const T &a, const T &b)</tt> 
-struct DispatchSegmentedReduce :
-    DeviceReducePolicy<
-        typename std::iterator_traits<InputIteratorT>::value_type,
-        OffsetT,
-        ReductionOpT>
-{
-    //------------------------------------------------------------------------------
-    // Constants
-    //------------------------------------------------------------------------------
-
-    /// The output value type
-    typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
-        typename std::iterator_traits<InputIteratorT>::value_type,                                          // ... then the input iterator's value type,
-        typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputT;                          // ... else the output iterator's value type
-
-
-    //------------------------------------------------------------------------------
-    // Problem state
-    //------------------------------------------------------------------------------
-
-    void                *d_temp_storage;        ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-    size_t              &temp_storage_bytes;    ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-    InputIteratorT      d_in;                   ///< [in] Pointer to the input sequence of data items
-    OutputIteratorT     d_out;                  ///< [out] Pointer to the output aggregate
-    OffsetT             num_segments;           ///< [in] The number of segments that comprise the sorting data
-    OffsetIteratorT     d_begin_offsets;        ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-    OffsetIteratorT     d_end_offsets;          ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-    ReductionOpT        reduction_op;           ///< [in] Binary reduction functor 
-    OutputT             init;                   ///< [in] The initial value of the reduction
-    cudaStream_t        stream;                 ///< [in] CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-    bool                debug_synchronous;      ///< [in] Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    int                 ptx_version;            ///< [in] PTX version
-
-    //------------------------------------------------------------------------------
-    // Constructor
-    //------------------------------------------------------------------------------
-
-    /// Constructor
-    CUB_RUNTIME_FUNCTION __forceinline__
-    DispatchSegmentedReduce(
-        void*                   d_temp_storage,
-        size_t                  &temp_storage_bytes,
-        InputIteratorT          d_in,
-        OutputIteratorT         d_out,
-        OffsetT                 num_segments,
-        OffsetIteratorT         d_begin_offsets,
-        OffsetIteratorT         d_end_offsets,
-        ReductionOpT            reduction_op,
-        OutputT                 init,
-        cudaStream_t            stream,
-        bool                    debug_synchronous,
-        int                     ptx_version)
-    :
-        d_temp_storage(d_temp_storage),
-        temp_storage_bytes(temp_storage_bytes),
-        d_in(d_in),
-        d_out(d_out),
-        num_segments(num_segments),
-        d_begin_offsets(d_begin_offsets),
-        d_end_offsets(d_end_offsets),
-        reduction_op(reduction_op),
-        init(init),
-        stream(stream),
-        debug_synchronous(debug_synchronous),
-        ptx_version(ptx_version)
-    {}
-
-
-
-    //------------------------------------------------------------------------------
-    // Chained policy invocation
-    //------------------------------------------------------------------------------
-
-    /// Invocation
-    template <
-        typename                        ActivePolicyT,                  ///< Umbrella policy active for the target device
-        typename                        DeviceSegmentedReduceKernelT>   ///< Function type of cub::DeviceSegmentedReduceKernel
-    CUB_RUNTIME_FUNCTION __forceinline__
-    cudaError_t InvokePasses(
-        DeviceSegmentedReduceKernelT    segmented_reduce_kernel)        ///< [in] Kernel function pointer to parameterization of cub::DeviceSegmentedReduceKernel
-    {
-#ifndef CUB_RUNTIME_ENABLED
-        (void)segmented_reduce_kernel;
-        // Kernel launch not supported from this device
-        return CubDebug(cudaErrorNotSupported );
-#else
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Return if the caller is simply requesting the size of the storage allocation
-            if (d_temp_storage == NULL)
-            {
-                temp_storage_bytes = 1;
-                return cudaSuccess;
-            }
-
-            // Init kernel configuration
-            KernelConfig segmented_reduce_config;
-            if (CubDebug(error = segmented_reduce_config.Init<typename ActivePolicyT::SegmentedReducePolicy>(segmented_reduce_kernel))) break;
-
-            // Log device_reduce_sweep_kernel configuration
-            if (debug_synchronous) _CubLog("Invoking SegmentedDeviceReduceKernel<<<%d, %d, 0, %lld>>>(), %d items per thread, %d SM occupancy\n",
-                num_segments,
-                ActivePolicyT::SegmentedReducePolicy::BLOCK_THREADS,
-                (long long) stream,
-                ActivePolicyT::SegmentedReducePolicy::ITEMS_PER_THREAD,
-                segmented_reduce_config.sm_occupancy);
-
-            // Invoke DeviceReduceKernel
-            segmented_reduce_kernel<<<num_segments, ActivePolicyT::SegmentedReducePolicy::BLOCK_THREADS, 0, stream>>>(
-                d_in,
-                d_out,
-                d_begin_offsets,
-                d_end_offsets,
-                num_segments,
-                reduction_op,
-                init);
-
-            // Check for failure to launch
-            if (CubDebug(error = cudaPeekAtLastError())) break;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-        }
-        while (0);
-
-        return error;
-
-#endif // CUB_RUNTIME_ENABLED
-
-    }
-
-
-    /// Invocation
-    template <typename ActivePolicyT>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    cudaError_t Invoke()
-    {
-        typedef typename DispatchSegmentedReduce::MaxPolicy MaxPolicyT;
-
-        // Force kernel code-generation in all compiler passes
-        return InvokePasses<ActivePolicyT>(
-            DeviceSegmentedReduceKernel<MaxPolicyT, InputIteratorT, OutputIteratorT, OffsetIteratorT, OffsetT, ReductionOpT, OutputT>);
-    }
-
-
-    //------------------------------------------------------------------------------
-    // Dispatch entrypoints
-    //------------------------------------------------------------------------------
-
-    /**
-     * Internal dispatch routine for computing a device-wide reduction
-     */
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Dispatch(
-        void            *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t          &temp_storage_bytes,                ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT  d_in,                               ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT d_out,                              ///< [out] Pointer to the output aggregate
-        int             num_segments,                       ///< [in] The number of segments that comprise the sorting data
-        OffsetIteratorT d_begin_offsets,                    ///< [in] Pointer to the sequence of beginning offsets of length \p num_segments, such that <tt>d_begin_offsets[i]</tt> is the first element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>
-        OffsetIteratorT d_end_offsets,                      ///< [in] Pointer to the sequence of ending offsets of length \p num_segments, such that <tt>d_end_offsets[i]-1</tt> is the last element of the <em>i</em><sup>th</sup> data segment in <tt>d_keys_*</tt> and <tt>d_values_*</tt>.  If <tt>d_end_offsets[i]-1</tt> <= <tt>d_begin_offsets[i]</tt>, the <em>i</em><sup>th</sup> is considered empty.
-        ReductionOpT    reduction_op,                       ///< [in] Binary reduction functor 
-        OutputT         init,                               ///< [in] The initial value of the reduction
-        cudaStream_t    stream,                             ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool            debug_synchronous)                  ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        typedef typename DispatchSegmentedReduce::MaxPolicy MaxPolicyT;
-
-        if (num_segments <= 0)
-            return cudaSuccess;
-
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get PTX version
-            int ptx_version;
-            if (CubDebug(error = PtxVersion(ptx_version))) break;
-
-            // Create dispatch functor
-            DispatchSegmentedReduce dispatch(
-                d_temp_storage, temp_storage_bytes,
-                d_in, d_out,
-                num_segments, d_begin_offsets, d_end_offsets,
-                reduction_op, init,
-                stream, debug_synchronous, ptx_version);
-
-            // Dispatch to chained policy
-            if (CubDebug(error = MaxPolicyT::Invoke(ptx_version, dispatch))) break;
-        }
-        while (0);
-
-        return error;
-    }
-};
-
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/device/dispatch/dispatch_reduce_by_key.cuh b/thirdparty/cub_semiring/device/dispatch/dispatch_reduce_by_key.cuh
deleted file mode 100644
index 672bc49393a..00000000000
--- a/thirdparty/cub_semiring/device/dispatch/dispatch_reduce_by_key.cuh
+++ /dev/null
@@ -1,554 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DeviceReduceByKey provides device-wide, parallel operations for reducing segments of values residing within device-accessible memory.
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-
-#include "dispatch_scan.cuh"
-#include "../../agent/agent_reduce_by_key.cuh"
-#include "../../thread/thread_operators.cuh"
-#include "../../grid/grid_queue.cuh"
-#include "../../util_device.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/******************************************************************************
- * Kernel entry points
- *****************************************************************************/
-
-/**
- * Multi-block reduce-by-key sweep kernel entry point
- */
-template <
-    typename            AgentReduceByKeyPolicyT,                 ///< Parameterized AgentReduceByKeyPolicyT tuning policy type
-    typename            KeysInputIteratorT,                     ///< Random-access input iterator type for keys
-    typename            UniqueOutputIteratorT,                  ///< Random-access output iterator type for keys
-    typename            ValuesInputIteratorT,                   ///< Random-access input iterator type for values
-    typename            AggregatesOutputIteratorT,              ///< Random-access output iterator type for values
-    typename            NumRunsOutputIteratorT,                 ///< Output iterator type for recording number of segments encountered
-    typename            ScanTileStateT,                         ///< Tile status interface type
-    typename            EqualityOpT,                            ///< KeyT equality operator type
-    typename            ReductionOpT,                           ///< ValueT reduction operator type
-    typename            OffsetT>                                ///< Signed integer type for global offsets
-__launch_bounds__ (int(AgentReduceByKeyPolicyT::BLOCK_THREADS))
-__global__ void DeviceReduceByKeyKernel(
-    KeysInputIteratorT          d_keys_in,                      ///< Pointer to the input sequence of keys
-    UniqueOutputIteratorT       d_unique_out,                   ///< Pointer to the output sequence of unique keys (one key per run)
-    ValuesInputIteratorT        d_values_in,                    ///< Pointer to the input sequence of corresponding values
-    AggregatesOutputIteratorT   d_aggregates_out,               ///< Pointer to the output sequence of value aggregates (one aggregate per run)
-    NumRunsOutputIteratorT      d_num_runs_out,                 ///< Pointer to total number of runs encountered (i.e., the length of d_unique_out)
-    ScanTileStateT              tile_state,                     ///< Tile status interface
-    int                         start_tile,                     ///< The starting tile for the current grid
-    EqualityOpT                 equality_op,                    ///< KeyT equality operator
-    ReductionOpT                reduction_op,                   ///< ValueT reduction operator
-    OffsetT                     num_items)                      ///< Total number of items to select from
-{
-    // Thread block type for reducing tiles of value segments
-    typedef AgentReduceByKey<
-            AgentReduceByKeyPolicyT,
-            KeysInputIteratorT,
-            UniqueOutputIteratorT,
-            ValuesInputIteratorT,
-            AggregatesOutputIteratorT,
-            NumRunsOutputIteratorT,
-            EqualityOpT,
-            ReductionOpT,
-            OffsetT>
-        AgentReduceByKeyT;
-
-    // Shared memory for AgentReduceByKey
-    __shared__ typename AgentReduceByKeyT::TempStorage temp_storage;
-
-    // Process tiles
-    AgentReduceByKeyT(temp_storage, d_keys_in, d_unique_out, d_values_in, d_aggregates_out, d_num_runs_out, equality_op, reduction_op).ConsumeRange(
-        num_items,
-        tile_state,
-        start_tile);
-}
-
-
-
-
-/******************************************************************************
- * Dispatch
- ******************************************************************************/
-
-/**
- * Utility class for dispatching the appropriately-tuned kernels for DeviceReduceByKey
- */
-template <
-    typename    KeysInputIteratorT,         ///< Random-access input iterator type for keys
-    typename    UniqueOutputIteratorT,      ///< Random-access output iterator type for keys
-    typename    ValuesInputIteratorT,       ///< Random-access input iterator type for values
-    typename    AggregatesOutputIteratorT,  ///< Random-access output iterator type for values
-    typename    NumRunsOutputIteratorT,     ///< Output iterator type for recording number of segments encountered
-    typename    EqualityOpT,                ///< KeyT equality operator type
-    typename    ReductionOpT,               ///< ValueT reduction operator type
-    typename    OffsetT>                    ///< Signed integer type for global offsets
-struct DispatchReduceByKey
-{
-    //-------------------------------------------------------------------------
-    // Types and constants
-    //-------------------------------------------------------------------------
-
-    // The input keys type
-    typedef typename std::iterator_traits<KeysInputIteratorT>::value_type KeyInputT;
-
-    // The output keys type
-    typedef typename If<(Equals<typename std::iterator_traits<UniqueOutputIteratorT>::value_type, void>::VALUE),    // KeyOutputT =  (if output iterator's value type is void) ?
-        typename std::iterator_traits<KeysInputIteratorT>::value_type,                                              // ... then the input iterator's value type,
-        typename std::iterator_traits<UniqueOutputIteratorT>::value_type>::Type KeyOutputT;                         // ... else the output iterator's value type
-
-    // The input values type
-    typedef typename std::iterator_traits<ValuesInputIteratorT>::value_type ValueInputT;
-
-    // The output values type
-    typedef typename If<(Equals<typename std::iterator_traits<AggregatesOutputIteratorT>::value_type, void>::VALUE),    // ValueOutputT =  (if output iterator's value type is void) ?
-        typename std::iterator_traits<ValuesInputIteratorT>::value_type,                                                // ... then the input iterator's value type,
-        typename std::iterator_traits<AggregatesOutputIteratorT>::value_type>::Type ValueOutputT;                       // ... else the output iterator's value type
-
-    enum
-    {
-        INIT_KERNEL_THREADS     = 128,
-        MAX_INPUT_BYTES         = CUB_MAX(sizeof(KeyOutputT), sizeof(ValueOutputT)),
-        COMBINED_INPUT_BYTES    = sizeof(KeyOutputT) + sizeof(ValueOutputT),
-    };
-
-    // Tile status descriptor interface type
-    typedef ReduceByKeyScanTileState<ValueOutputT, OffsetT> ScanTileStateT;
-
-
-    //-------------------------------------------------------------------------
-    // Tuning policies
-    //-------------------------------------------------------------------------
-
-    /// SM35
-    struct Policy350
-    {
-        enum {
-            NOMINAL_4B_ITEMS_PER_THREAD = 6,
-            ITEMS_PER_THREAD            = (MAX_INPUT_BYTES <= 8) ? 6 : CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, ((NOMINAL_4B_ITEMS_PER_THREAD * 8) + COMBINED_INPUT_BYTES - 1) / COMBINED_INPUT_BYTES)),
-        };
-
-        typedef AgentReduceByKeyPolicy<
-                128,
-                ITEMS_PER_THREAD,
-                BLOCK_LOAD_DIRECT,
-                LOAD_LDG,
-                BLOCK_SCAN_WARP_SCANS>
-            ReduceByKeyPolicyT;
-    };
-
-    /// SM30
-    struct Policy300
-    {
-        enum {
-            NOMINAL_4B_ITEMS_PER_THREAD = 6,
-            ITEMS_PER_THREAD            = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, ((NOMINAL_4B_ITEMS_PER_THREAD * 8) + COMBINED_INPUT_BYTES - 1) / COMBINED_INPUT_BYTES)),
-        };
-
-        typedef AgentReduceByKeyPolicy<
-                128,
-                ITEMS_PER_THREAD,
-                BLOCK_LOAD_WARP_TRANSPOSE,
-                LOAD_DEFAULT,
-                BLOCK_SCAN_WARP_SCANS>
-            ReduceByKeyPolicyT;
-    };
-
-    /// SM20
-    struct Policy200
-    {
-        enum {
-            NOMINAL_4B_ITEMS_PER_THREAD = 11,
-            ITEMS_PER_THREAD            = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, ((NOMINAL_4B_ITEMS_PER_THREAD * 8) + COMBINED_INPUT_BYTES - 1) / COMBINED_INPUT_BYTES)),
-        };
-
-        typedef AgentReduceByKeyPolicy<
-                128,
-                ITEMS_PER_THREAD,
-                BLOCK_LOAD_WARP_TRANSPOSE,
-                LOAD_DEFAULT,
-                BLOCK_SCAN_WARP_SCANS>
-            ReduceByKeyPolicyT;
-    };
-
-    /// SM13
-    struct Policy130
-    {
-        enum {
-            NOMINAL_4B_ITEMS_PER_THREAD = 7,
-            ITEMS_PER_THREAD            = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, ((NOMINAL_4B_ITEMS_PER_THREAD * 8) + COMBINED_INPUT_BYTES - 1) / COMBINED_INPUT_BYTES)),
-        };
-
-        typedef AgentReduceByKeyPolicy<
-                128,
-                ITEMS_PER_THREAD,
-                BLOCK_LOAD_WARP_TRANSPOSE,
-                LOAD_DEFAULT,
-                BLOCK_SCAN_WARP_SCANS>
-            ReduceByKeyPolicyT;
-    };
-
-    /// SM11
-    struct Policy110
-    {
-        enum {
-            NOMINAL_4B_ITEMS_PER_THREAD = 5,
-            ITEMS_PER_THREAD            = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 8) / COMBINED_INPUT_BYTES)),
-        };
-
-        typedef AgentReduceByKeyPolicy<
-                64,
-                ITEMS_PER_THREAD,
-                BLOCK_LOAD_WARP_TRANSPOSE,
-                LOAD_DEFAULT,
-                BLOCK_SCAN_RAKING>
-            ReduceByKeyPolicyT;
-    };
-
-
-    /******************************************************************************
-     * Tuning policies of current PTX compiler pass
-     ******************************************************************************/
-
-#if (CUB_PTX_ARCH >= 350)
-    typedef Policy350 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 300)
-    typedef Policy300 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 200)
-    typedef Policy200 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 130)
-    typedef Policy130 PtxPolicy;
-
-#else
-    typedef Policy110 PtxPolicy;
-
-#endif
-
-    // "Opaque" policies (whose parameterizations aren't reflected in the type signature)
-    struct PtxReduceByKeyPolicy : PtxPolicy::ReduceByKeyPolicyT {};
-
-
-    /******************************************************************************
-     * Utilities
-     ******************************************************************************/
-
-    /**
-     * Initialize kernel dispatch configurations with the policies corresponding to the PTX assembly we will use
-     */
-    template <typename KernelConfig>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static void InitConfigs(
-        int             ptx_version,
-        KernelConfig    &reduce_by_key_config)
-    {
-    #if (CUB_PTX_ARCH > 0)
-        (void)ptx_version;
-
-        // We're on the device, so initialize the kernel dispatch configurations with the current PTX policy
-        reduce_by_key_config.template Init<PtxReduceByKeyPolicy>();
-
-    #else
-
-        // We're on the host, so lookup and initialize the kernel dispatch configurations with the policies that match the device's PTX version
-        if (ptx_version >= 350)
-        {
-            reduce_by_key_config.template Init<typename Policy350::ReduceByKeyPolicyT>();
-        }
-        else if (ptx_version >= 300)
-        {
-            reduce_by_key_config.template Init<typename Policy300::ReduceByKeyPolicyT>();
-        }
-        else if (ptx_version >= 200)
-        {
-            reduce_by_key_config.template Init<typename Policy200::ReduceByKeyPolicyT>();
-        }
-        else if (ptx_version >= 130)
-        {
-            reduce_by_key_config.template Init<typename Policy130::ReduceByKeyPolicyT>();
-        }
-        else
-        {
-            reduce_by_key_config.template Init<typename Policy110::ReduceByKeyPolicyT>();
-        }
-
-    #endif
-    }
-
-
-    /**
-     * Kernel kernel dispatch configuration.
-     */
-    struct KernelConfig
-    {
-        int block_threads;
-        int items_per_thread;
-        int tile_items;
-
-        template <typename PolicyT>
-        CUB_RUNTIME_FUNCTION __forceinline__
-        void Init()
-        {
-            block_threads       = PolicyT::BLOCK_THREADS;
-            items_per_thread    = PolicyT::ITEMS_PER_THREAD;
-            tile_items          = block_threads * items_per_thread;
-        }
-    };
-
-
-    //---------------------------------------------------------------------
-    // Dispatch entrypoints
-    //---------------------------------------------------------------------
-
-    /**
-     * Internal dispatch routine for computing a device-wide reduce-by-key using the
-     * specified kernel functions.
-     */
-    template <
-        typename                    ScanInitKernelT,         ///< Function type of cub::DeviceScanInitKernel
-        typename                    ReduceByKeyKernelT>      ///< Function type of cub::DeviceReduceByKeyKernelT
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Dispatch(
-        void*                       d_temp_storage,             ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&                     temp_storage_bytes,         ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        KeysInputIteratorT          d_keys_in,                  ///< [in] Pointer to the input sequence of keys
-        UniqueOutputIteratorT       d_unique_out,               ///< [out] Pointer to the output sequence of unique keys (one key per run)
-        ValuesInputIteratorT        d_values_in,                ///< [in] Pointer to the input sequence of corresponding values
-        AggregatesOutputIteratorT   d_aggregates_out,           ///< [out] Pointer to the output sequence of value aggregates (one aggregate per run)
-        NumRunsOutputIteratorT      d_num_runs_out,             ///< [out] Pointer to total number of runs encountered (i.e., the length of d_unique_out)
-        EqualityOpT                 equality_op,                ///< [in] KeyT equality operator
-        ReductionOpT                reduction_op,               ///< [in] ValueT reduction operator
-        OffsetT                     num_items,                  ///< [in] Total number of items to select from
-        cudaStream_t                stream,                     ///< [in] CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous,          ///< [in] Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-        int                         /*ptx_version*/,            ///< [in] PTX version of dispatch kernels
-        ScanInitKernelT                init_kernel,                ///< [in] Kernel function pointer to parameterization of cub::DeviceScanInitKernel
-        ReduceByKeyKernelT             reduce_by_key_kernel,       ///< [in] Kernel function pointer to parameterization of cub::DeviceReduceByKeyKernel
-        KernelConfig                reduce_by_key_config)       ///< [in] Dispatch parameters that match the policy that \p reduce_by_key_kernel was compiled for
-    {
-
-#ifndef CUB_RUNTIME_ENABLED
-      (void)d_temp_storage;
-      (void)temp_storage_bytes;
-      (void)d_keys_in;
-      (void)d_unique_out;
-      (void)d_values_in;
-      (void)d_aggregates_out;
-      (void)d_num_runs_out;
-      (void)equality_op;
-      (void)reduction_op;
-      (void)num_items;
-      (void)stream;
-      (void)debug_synchronous;
-      (void)init_kernel;
-      (void)reduce_by_key_kernel;
-      (void)reduce_by_key_config;
-
-        // Kernel launch not supported from this device
-        return CubDebug(cudaErrorNotSupported);
-
-#else
-
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get device ordinal
-            int device_ordinal;
-            if (CubDebug(error = cudaGetDevice(&device_ordinal))) break;
-
-            // Get SM count
-            int sm_count;
-            if (CubDebug(error = cudaDeviceGetAttribute (&sm_count, cudaDevAttrMultiProcessorCount, device_ordinal))) break;
-
-            // Number of input tiles
-            int tile_size = reduce_by_key_config.block_threads * reduce_by_key_config.items_per_thread;
-            int num_tiles = (num_items + tile_size - 1) / tile_size;
-
-            // Specify temporary storage allocation requirements
-            size_t  allocation_sizes[1];
-            if (CubDebug(error = ScanTileStateT::AllocationSize(num_tiles, allocation_sizes[0]))) break;    // bytes needed for tile status descriptors
-
-            // Compute allocation pointers into the single storage blob (or compute the necessary size of the blob)
-            void* allocations[1];
-            if (CubDebug(error = AliasTemporaries(d_temp_storage, temp_storage_bytes, allocations, allocation_sizes))) break;
-            if (d_temp_storage == NULL)
-            {
-                // Return if the caller is simply requesting the size of the storage allocation
-                break;
-            }
-
-            // Construct the tile status interface
-            ScanTileStateT tile_state;
-            if (CubDebug(error = tile_state.Init(num_tiles, allocations[0], allocation_sizes[0]))) break;
-
-            // Log init_kernel configuration
-            int init_grid_size = CUB_MAX(1, (num_tiles + INIT_KERNEL_THREADS - 1) / INIT_KERNEL_THREADS);
-            if (debug_synchronous) _CubLog("Invoking init_kernel<<<%d, %d, 0, %lld>>>()\n", init_grid_size, INIT_KERNEL_THREADS, (long long) stream);
-
-            // Invoke init_kernel to initialize tile descriptors
-            init_kernel<<<init_grid_size, INIT_KERNEL_THREADS, 0, stream>>>(
-                tile_state,
-                num_tiles,
-                d_num_runs_out);
-
-            // Check for failure to launch
-            if (CubDebug(error = cudaPeekAtLastError())) break;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-
-            // Return if empty problem
-            if (num_items == 0)
-                break;
-
-            // Get SM occupancy for reduce_by_key_kernel
-            int reduce_by_key_sm_occupancy;
-            if (CubDebug(error = MaxSmOccupancy(
-                reduce_by_key_sm_occupancy,            // out
-                reduce_by_key_kernel,
-                reduce_by_key_config.block_threads))) break;
-
-            // Get max x-dimension of grid
-            int max_dim_x;
-            if (CubDebug(error = cudaDeviceGetAttribute(&max_dim_x, cudaDevAttrMaxGridDimX, device_ordinal))) break;;
-
-            // Run grids in epochs (in case number of tiles exceeds max x-dimension
-            int scan_grid_size = CUB_MIN(num_tiles, max_dim_x);
-            for (int start_tile = 0; start_tile < num_tiles; start_tile += scan_grid_size)
-            {
-                // Log reduce_by_key_kernel configuration
-                if (debug_synchronous) _CubLog("Invoking %d reduce_by_key_kernel<<<%d, %d, 0, %lld>>>(), %d items per thread, %d SM occupancy\n",
-                    start_tile, scan_grid_size, reduce_by_key_config.block_threads, (long long) stream, reduce_by_key_config.items_per_thread, reduce_by_key_sm_occupancy);
-
-                // Invoke reduce_by_key_kernel
-                reduce_by_key_kernel<<<scan_grid_size, reduce_by_key_config.block_threads, 0, stream>>>(
-                    d_keys_in,
-                    d_unique_out,
-                    d_values_in,
-                    d_aggregates_out,
-                    d_num_runs_out,
-                    tile_state,
-                    start_tile,
-                    equality_op,
-                    reduction_op,
-                    num_items);
-
-                // Check for failure to launch
-                if (CubDebug(error = cudaPeekAtLastError())) break;
-
-                // Sync the stream if specified to flush runtime errors
-                if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-            }
-        }
-        while (0);
-
-        return error;
-
-#endif  // CUB_RUNTIME_ENABLED
-    }
-
-
-    /**
-     * Internal dispatch routine
-     */
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Dispatch(
-        void*                       d_temp_storage,                 ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&                     temp_storage_bytes,             ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        KeysInputIteratorT          d_keys_in,                      ///< [in] Pointer to the input sequence of keys
-        UniqueOutputIteratorT       d_unique_out,                   ///< [out] Pointer to the output sequence of unique keys (one key per run)
-        ValuesInputIteratorT        d_values_in,                    ///< [in] Pointer to the input sequence of corresponding values
-        AggregatesOutputIteratorT   d_aggregates_out,               ///< [out] Pointer to the output sequence of value aggregates (one aggregate per run)
-        NumRunsOutputIteratorT      d_num_runs_out,                 ///< [out] Pointer to total number of runs encountered (i.e., the length of d_unique_out)
-        EqualityOpT                 equality_op,                    ///< [in] KeyT equality operator
-        ReductionOpT                reduction_op,                   ///< [in] ValueT reduction operator
-        OffsetT                     num_items,                      ///< [in] Total number of items to select from
-        cudaStream_t                stream,                         ///< [in] CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous)              ///< [in] Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get PTX version
-            int ptx_version;
-    #if (CUB_PTX_ARCH == 0)
-            if (CubDebug(error = PtxVersion(ptx_version))) break;
-    #else
-            ptx_version = CUB_PTX_ARCH;
-    #endif
-
-            // Get kernel kernel dispatch configurations
-            KernelConfig reduce_by_key_config;
-            InitConfigs(ptx_version, reduce_by_key_config);
-
-            // Dispatch
-            if (CubDebug(error = Dispatch(
-                d_temp_storage,
-                temp_storage_bytes,
-                d_keys_in,
-                d_unique_out,
-                d_values_in,
-                d_aggregates_out,
-                d_num_runs_out,
-                equality_op,
-                reduction_op,
-                num_items,
-                stream,
-                debug_synchronous,
-                ptx_version,
-                DeviceCompactInitKernel<ScanTileStateT, NumRunsOutputIteratorT>,
-                DeviceReduceByKeyKernel<PtxReduceByKeyPolicy, KeysInputIteratorT, UniqueOutputIteratorT, ValuesInputIteratorT, AggregatesOutputIteratorT, NumRunsOutputIteratorT, ScanTileStateT, EqualityOpT, ReductionOpT, OffsetT>,
-                reduce_by_key_config))) break;
-        }
-        while (0);
-
-        return error;
-    }
-};
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/device/dispatch/dispatch_rle.cuh b/thirdparty/cub_semiring/device/dispatch/dispatch_rle.cuh
deleted file mode 100644
index 1de979e88cd..00000000000
--- a/thirdparty/cub_semiring/device/dispatch/dispatch_rle.cuh
+++ /dev/null
@@ -1,538 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DeviceRle provides device-wide, parallel operations for run-length-encoding sequences of data items residing within device-accessible memory.
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-
-#include "dispatch_scan.cuh"
-#include "../../agent/agent_rle.cuh"
-#include "../../thread/thread_operators.cuh"
-#include "../../grid/grid_queue.cuh"
-#include "../../util_device.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/******************************************************************************
- * Kernel entry points
- *****************************************************************************/
-
-/**
- * Select kernel entry point (multi-block)
- *
- * Performs functor-based selection if SelectOp functor type != NullType
- * Otherwise performs flag-based selection if FlagIterator's value type != NullType
- * Otherwise performs discontinuity selection (keep unique)
- */
-template <
-    typename            AgentRlePolicyT,        ///< Parameterized AgentRlePolicyT tuning policy type
-    typename            InputIteratorT,             ///< Random-access input iterator type for reading input items \iterator
-    typename            OffsetsOutputIteratorT,     ///< Random-access output iterator type for writing run-offset values \iterator
-    typename            LengthsOutputIteratorT,     ///< Random-access output iterator type for writing run-length values \iterator
-    typename            NumRunsOutputIteratorT,     ///< Output iterator type for recording the number of runs encountered \iterator
-    typename            ScanTileStateT,              ///< Tile status interface type
-    typename            EqualityOpT,                 ///< T equality operator type
-    typename            OffsetT>                    ///< Signed integer type for global offsets
-__launch_bounds__ (int(AgentRlePolicyT::BLOCK_THREADS))
-__global__ void DeviceRleSweepKernel(
-    InputIteratorT              d_in,               ///< [in] Pointer to input sequence of data items
-    OffsetsOutputIteratorT      d_offsets_out,      ///< [out] Pointer to output sequence of run-offsets
-    LengthsOutputIteratorT      d_lengths_out,      ///< [out] Pointer to output sequence of run-lengths
-    NumRunsOutputIteratorT      d_num_runs_out,     ///< [out] Pointer to total number of runs (i.e., length of \p d_offsets_out)
-    ScanTileStateT              tile_status,        ///< [in] Tile status interface
-    EqualityOpT                 equality_op,        ///< [in] Equality operator for input items
-    OffsetT                     num_items,          ///< [in] Total number of input items (i.e., length of \p d_in)
-    int                         num_tiles)          ///< [in] Total number of tiles for the entire problem
-{
-    // Thread block type for selecting data from input tiles
-    typedef AgentRle<
-        AgentRlePolicyT,
-        InputIteratorT,
-        OffsetsOutputIteratorT,
-        LengthsOutputIteratorT,
-        EqualityOpT,
-        OffsetT> AgentRleT;
-
-    // Shared memory for AgentRle
-    __shared__ typename AgentRleT::TempStorage temp_storage;
-
-    // Process tiles
-    AgentRleT(temp_storage, d_in, d_offsets_out, d_lengths_out, equality_op, num_items).ConsumeRange(
-        num_tiles,
-        tile_status,
-        d_num_runs_out);
-}
-
-
-
-
-/******************************************************************************
- * Dispatch
- ******************************************************************************/
-
-/**
- * Utility class for dispatching the appropriately-tuned kernels for DeviceRle
- */
-template <
-    typename            InputIteratorT,             ///< Random-access input iterator type for reading input items \iterator
-    typename            OffsetsOutputIteratorT,     ///< Random-access output iterator type for writing run-offset values \iterator
-    typename            LengthsOutputIteratorT,     ///< Random-access output iterator type for writing run-length values \iterator
-    typename            NumRunsOutputIteratorT,     ///< Output iterator type for recording the number of runs encountered \iterator
-    typename            EqualityOpT,                ///< T equality operator type
-    typename            OffsetT>                    ///< Signed integer type for global offsets
-struct DeviceRleDispatch
-{
-    /******************************************************************************
-     * Types and constants
-     ******************************************************************************/
-
-    // The input value type
-    typedef typename std::iterator_traits<InputIteratorT>::value_type T;
-
-    // The lengths output value type
-    typedef typename If<(Equals<typename std::iterator_traits<LengthsOutputIteratorT>::value_type, void>::VALUE),   // LengthT =  (if output iterator's value type is void) ?
-        OffsetT,                                                                                                    // ... then the OffsetT type,
-        typename std::iterator_traits<LengthsOutputIteratorT>::value_type>::Type LengthT;                           // ... else the output iterator's value type
-
-    enum
-    {
-        INIT_KERNEL_THREADS = 128,
-    };
-
-    // Tile status descriptor interface type
-    typedef ReduceByKeyScanTileState<LengthT, OffsetT> ScanTileStateT;
-
-
-    /******************************************************************************
-     * Tuning policies
-     ******************************************************************************/
-
-    /// SM35
-    struct Policy350
-    {
-        enum {
-            NOMINAL_4B_ITEMS_PER_THREAD = 15,
-            ITEMS_PER_THREAD            = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(T)))),
-        };
-
-        typedef AgentRlePolicy<
-                96,
-                ITEMS_PER_THREAD,
-                BLOCK_LOAD_DIRECT,
-                LOAD_LDG,
-                true,
-                BLOCK_SCAN_WARP_SCANS>
-            RleSweepPolicy;
-    };
-
-    /// SM30
-    struct Policy300
-    {
-        enum {
-            NOMINAL_4B_ITEMS_PER_THREAD = 5,
-            ITEMS_PER_THREAD            = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(T)))),
-        };
-
-        typedef AgentRlePolicy<
-                256,
-                ITEMS_PER_THREAD,
-                BLOCK_LOAD_WARP_TRANSPOSE,
-                LOAD_DEFAULT,
-                true,
-                BLOCK_SCAN_RAKING_MEMOIZE>
-            RleSweepPolicy;
-    };
-
-    /// SM20
-    struct Policy200
-    {
-        enum {
-            NOMINAL_4B_ITEMS_PER_THREAD = 15,
-            ITEMS_PER_THREAD            = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(T)))),
-        };
-
-        typedef AgentRlePolicy<
-                128,
-                ITEMS_PER_THREAD,
-                BLOCK_LOAD_WARP_TRANSPOSE,
-                LOAD_DEFAULT,
-                false,
-                BLOCK_SCAN_WARP_SCANS>
-            RleSweepPolicy;
-    };
-
-    /// SM13
-    struct Policy130
-    {
-        enum {
-            NOMINAL_4B_ITEMS_PER_THREAD = 9,
-            ITEMS_PER_THREAD            = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(T)))),
-        };
-
-        typedef AgentRlePolicy<
-                64,
-                ITEMS_PER_THREAD,
-                BLOCK_LOAD_WARP_TRANSPOSE,
-                LOAD_DEFAULT,
-                true,
-                BLOCK_SCAN_RAKING_MEMOIZE>
-            RleSweepPolicy;
-    };
-
-    /// SM10
-    struct Policy100
-    {
-        enum {
-            NOMINAL_4B_ITEMS_PER_THREAD = 9,
-            ITEMS_PER_THREAD            = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(T)))),
-        };
-
-        typedef AgentRlePolicy<
-                256,
-                ITEMS_PER_THREAD,
-                BLOCK_LOAD_WARP_TRANSPOSE,
-                LOAD_DEFAULT,
-                true,
-                BLOCK_SCAN_RAKING_MEMOIZE>
-            RleSweepPolicy;
-    };
-
-
-    /******************************************************************************
-     * Tuning policies of current PTX compiler pass
-     ******************************************************************************/
-
-#if (CUB_PTX_ARCH >= 350)
-    typedef Policy350 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 300)
-    typedef Policy300 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 200)
-    typedef Policy200 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 130)
-    typedef Policy130 PtxPolicy;
-
-#else
-    typedef Policy100 PtxPolicy;
-
-#endif
-
-    // "Opaque" policies (whose parameterizations aren't reflected in the type signature)
-    struct PtxRleSweepPolicy : PtxPolicy::RleSweepPolicy {};
-
-
-    /******************************************************************************
-     * Utilities
-     ******************************************************************************/
-
-    /**
-     * Initialize kernel dispatch configurations with the policies corresponding to the PTX assembly we will use
-     */
-    template <typename KernelConfig>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static void InitConfigs(
-        int             ptx_version,
-        KernelConfig&   device_rle_config)
-    {
-    #if (CUB_PTX_ARCH > 0)
-
-        // We're on the device, so initialize the kernel dispatch configurations with the current PTX policy
-        device_rle_config.template Init<PtxRleSweepPolicy>();
-
-    #else
-
-        // We're on the host, so lookup and initialize the kernel dispatch configurations with the policies that match the device's PTX version
-        if (ptx_version >= 350)
-        {
-            device_rle_config.template Init<typename Policy350::RleSweepPolicy>();
-        }
-        else if (ptx_version >= 300)
-        {
-            device_rle_config.template Init<typename Policy300::RleSweepPolicy>();
-        }
-        else if (ptx_version >= 200)
-        {
-            device_rle_config.template Init<typename Policy200::RleSweepPolicy>();
-        }
-        else if (ptx_version >= 130)
-        {
-            device_rle_config.template Init<typename Policy130::RleSweepPolicy>();
-        }
-        else
-        {
-            device_rle_config.template Init<typename Policy100::RleSweepPolicy>();
-        }
-
-    #endif
-    }
-
-
-    /**
-     * Kernel kernel dispatch configuration.  Mirrors the constants within AgentRlePolicyT.
-     */
-    struct KernelConfig
-    {
-        int                     block_threads;
-        int                     items_per_thread;
-        BlockLoadAlgorithm      load_policy;
-        bool                    store_warp_time_slicing;
-        BlockScanAlgorithm      scan_algorithm;
-
-        template <typename AgentRlePolicyT>
-        CUB_RUNTIME_FUNCTION __forceinline__
-        void Init()
-        {
-            block_threads               = AgentRlePolicyT::BLOCK_THREADS;
-            items_per_thread            = AgentRlePolicyT::ITEMS_PER_THREAD;
-            load_policy                 = AgentRlePolicyT::LOAD_ALGORITHM;
-            store_warp_time_slicing     = AgentRlePolicyT::STORE_WARP_TIME_SLICING;
-            scan_algorithm              = AgentRlePolicyT::SCAN_ALGORITHM;
-        }
-
-        CUB_RUNTIME_FUNCTION __forceinline__
-        void Print()
-        {
-            printf("%d, %d, %d, %d, %d",
-                block_threads,
-                items_per_thread,
-                load_policy,
-                store_warp_time_slicing,
-                scan_algorithm);
-        }
-    };
-
-
-    /******************************************************************************
-     * Dispatch entrypoints
-     ******************************************************************************/
-
-    /**
-     * Internal dispatch routine for computing a device-wide run-length-encode using the
-     * specified kernel functions.
-     */
-    template <
-        typename                    DeviceScanInitKernelPtr,        ///< Function type of cub::DeviceScanInitKernel
-        typename                    DeviceRleSweepKernelPtr>        ///< Function type of cub::DeviceRleSweepKernelPtr
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Dispatch(
-        void*                       d_temp_storage,                 ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&                     temp_storage_bytes,             ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT              d_in,                           ///< [in] Pointer to the input sequence of data items
-        OffsetsOutputIteratorT      d_offsets_out,                  ///< [out] Pointer to the output sequence of run-offsets
-        LengthsOutputIteratorT      d_lengths_out,                  ///< [out] Pointer to the output sequence of run-lengths
-        NumRunsOutputIteratorT      d_num_runs_out,                 ///< [out] Pointer to the total number of runs encountered (i.e., length of \p d_offsets_out)
-        EqualityOpT                 equality_op,                    ///< [in] Equality operator for input items
-        OffsetT                     num_items,                      ///< [in] Total number of input items (i.e., length of \p d_in)
-        cudaStream_t                stream,                         ///< [in] CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous,              ///< [in] Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-        int                         ptx_version,                    ///< [in] PTX version of dispatch kernels
-        DeviceScanInitKernelPtr     device_scan_init_kernel,        ///< [in] Kernel function pointer to parameterization of cub::DeviceScanInitKernel
-        DeviceRleSweepKernelPtr     device_rle_sweep_kernel,        ///< [in] Kernel function pointer to parameterization of cub::DeviceRleSweepKernel
-        KernelConfig                device_rle_config)              ///< [in] Dispatch parameters that match the policy that \p device_rle_sweep_kernel was compiled for
-    {
-
-#ifndef CUB_RUNTIME_ENABLED
-
-        // Kernel launch not supported from this device
-        return CubDebug(cudaErrorNotSupported);
-
-#else
-
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get device ordinal
-            int device_ordinal;
-            if (CubDebug(error = cudaGetDevice(&device_ordinal))) break;
-
-            // Get SM count
-            int sm_count;
-            if (CubDebug(error = cudaDeviceGetAttribute (&sm_count, cudaDevAttrMultiProcessorCount, device_ordinal))) break;
-
-            // Number of input tiles
-            int tile_size = device_rle_config.block_threads * device_rle_config.items_per_thread;
-            int num_tiles = (num_items + tile_size - 1) / tile_size;
-
-            // Specify temporary storage allocation requirements
-            size_t  allocation_sizes[1];
-            if (CubDebug(error = ScanTileStateT::AllocationSize(num_tiles, allocation_sizes[0]))) break;    // bytes needed for tile status descriptors
-
-            // Compute allocation pointers into the single storage blob (or compute the necessary size of the blob)
-            void* allocations[1];
-            if (CubDebug(error = AliasTemporaries(d_temp_storage, temp_storage_bytes, allocations, allocation_sizes))) break;
-            if (d_temp_storage == NULL)
-            {
-                // Return if the caller is simply requesting the size of the storage allocation
-                break;
-            }
-
-            // Construct the tile status interface
-            ScanTileStateT tile_status;
-            if (CubDebug(error = tile_status.Init(num_tiles, allocations[0], allocation_sizes[0]))) break;
-
-            // Log device_scan_init_kernel configuration
-            int init_grid_size = CUB_MAX(1, (num_tiles + INIT_KERNEL_THREADS - 1) / INIT_KERNEL_THREADS);
-            if (debug_synchronous) _CubLog("Invoking device_scan_init_kernel<<<%d, %d, 0, %lld>>>()\n", init_grid_size, INIT_KERNEL_THREADS, (long long) stream);
-
-            // Invoke device_scan_init_kernel to initialize tile descriptors and queue descriptors
-            device_scan_init_kernel<<<init_grid_size, INIT_KERNEL_THREADS, 0, stream>>>(
-                tile_status,
-                num_tiles,
-                d_num_runs_out);
-
-            // Check for failure to launch
-            if (CubDebug(error = cudaPeekAtLastError())) break;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-
-            // Return if empty problem
-            if (num_items == 0)
-                break;
-
-            // Get SM occupancy for device_rle_sweep_kernel
-            int device_rle_kernel_sm_occupancy;
-            if (CubDebug(error = MaxSmOccupancy(
-                device_rle_kernel_sm_occupancy,            // out
-                device_rle_sweep_kernel,
-                device_rle_config.block_threads))) break;
-
-            // Get max x-dimension of grid
-            int max_dim_x;
-            if (CubDebug(error = cudaDeviceGetAttribute(&max_dim_x, cudaDevAttrMaxGridDimX, device_ordinal))) break;;
-
-            // Get grid size for scanning tiles
-            dim3 scan_grid_size;
-            scan_grid_size.z = 1;
-            scan_grid_size.y = ((unsigned int) num_tiles + max_dim_x - 1) / max_dim_x;
-            scan_grid_size.x = CUB_MIN(num_tiles, max_dim_x);
-
-            // Log device_rle_sweep_kernel configuration
-            if (debug_synchronous) _CubLog("Invoking device_rle_sweep_kernel<<<{%d,%d,%d}, %d, 0, %lld>>>(), %d items per thread, %d SM occupancy\n",
-                scan_grid_size.x, scan_grid_size.y, scan_grid_size.z, device_rle_config.block_threads, (long long) stream, device_rle_config.items_per_thread, device_rle_kernel_sm_occupancy);
-
-            // Invoke device_rle_sweep_kernel
-            device_rle_sweep_kernel<<<scan_grid_size, device_rle_config.block_threads, 0, stream>>>(
-                d_in,
-                d_offsets_out,
-                d_lengths_out,
-                d_num_runs_out,
-                tile_status,
-                equality_op,
-                num_items,
-                num_tiles);
-
-            // Check for failure to launch
-            if (CubDebug(error = cudaPeekAtLastError())) break;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-
-        }
-        while (0);
-
-        return error;
-
-#endif  // CUB_RUNTIME_ENABLED
-    }
-
-
-    /**
-     * Internal dispatch routine
-     */
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Dispatch(
-        void*                       d_temp_storage,                 ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&                     temp_storage_bytes,             ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT              d_in,                           ///< [in] Pointer to input sequence of data items
-        OffsetsOutputIteratorT      d_offsets_out,                  ///< [out] Pointer to output sequence of run-offsets
-        LengthsOutputIteratorT      d_lengths_out,                  ///< [out] Pointer to output sequence of run-lengths
-        NumRunsOutputIteratorT      d_num_runs_out,                 ///< [out] Pointer to total number of runs (i.e., length of \p d_offsets_out)
-        EqualityOpT                 equality_op,                    ///< [in] Equality operator for input items
-        OffsetT                     num_items,                      ///< [in] Total number of input items (i.e., length of \p d_in)
-        cudaStream_t                stream,                         ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous)              ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get PTX version
-            int ptx_version;
-    #if (CUB_PTX_ARCH == 0)
-            if (CubDebug(error = PtxVersion(ptx_version))) break;
-    #else
-            ptx_version = CUB_PTX_ARCH;
-    #endif
-
-            // Get kernel kernel dispatch configurations
-            KernelConfig device_rle_config;
-            InitConfigs(ptx_version, device_rle_config);
-
-            // Dispatch
-            if (CubDebug(error = Dispatch(
-                d_temp_storage,
-                temp_storage_bytes,
-                d_in,
-                d_offsets_out,
-                d_lengths_out,
-                d_num_runs_out,
-                equality_op,
-                num_items,
-                stream,
-                debug_synchronous,
-                ptx_version,
-                DeviceCompactInitKernel<ScanTileStateT, NumRunsOutputIteratorT>,
-                DeviceRleSweepKernel<PtxRleSweepPolicy, InputIteratorT, OffsetsOutputIteratorT, LengthsOutputIteratorT, NumRunsOutputIteratorT, ScanTileStateT, EqualityOpT, OffsetT>,
-                device_rle_config))) break;
-        }
-        while (0);
-
-        return error;
-    }
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/device/dispatch/dispatch_scan.cuh b/thirdparty/cub_semiring/device/dispatch/dispatch_scan.cuh
deleted file mode 100644
index 8944dcd33e0..00000000000
--- a/thirdparty/cub_semiring/device/dispatch/dispatch_scan.cuh
+++ /dev/null
@@ -1,563 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory.
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-
-#include "../../agent/agent_scan.cuh"
-#include "../../thread/thread_operators.cuh"
-#include "../../grid/grid_queue.cuh"
-#include "../../util_arch.cuh"
-#include "../../util_debug.cuh"
-#include "../../util_device.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/******************************************************************************
- * Kernel entry points
- *****************************************************************************/
-
-/**
- * Initialization kernel for tile status initialization (multi-block)
- */
-template <
-    typename            ScanTileStateT>     ///< Tile status interface type
-__global__ void DeviceScanInitKernel(
-    ScanTileStateT      tile_state,         ///< [in] Tile status interface
-    int                 num_tiles)          ///< [in] Number of tiles
-{
-    // Initialize tile status
-    tile_state.InitializeStatus(num_tiles);
-}
-
-/**
- * Initialization kernel for tile status initialization (multi-block)
- */
-template <
-    typename                ScanTileStateT,         ///< Tile status interface type
-    typename                NumSelectedIteratorT>   ///< Output iterator type for recording the number of items selected
-__global__ void DeviceCompactInitKernel(
-    ScanTileStateT          tile_state,             ///< [in] Tile status interface
-    int                     num_tiles,              ///< [in] Number of tiles
-    NumSelectedIteratorT    d_num_selected_out)     ///< [out] Pointer to the total number of items selected (i.e., length of \p d_selected_out)
-{
-    // Initialize tile status
-    tile_state.InitializeStatus(num_tiles);
-
-    // Initialize d_num_selected_out
-    if ((blockIdx.x == 0) && (threadIdx.x == 0))
-        *d_num_selected_out = 0;
-}
-
-
-/**
- * Scan kernel entry point (multi-block)
- */
-template <
-    typename            ScanPolicyT,        ///< Parameterized ScanPolicyT tuning policy type
-    typename            InputIteratorT,     ///< Random-access input iterator type for reading scan inputs \iterator
-    typename            OutputIteratorT,    ///< Random-access output iterator type for writing scan outputs \iterator
-    typename            ScanTileStateT,     ///< Tile status interface type
-    typename            ScanOpT,            ///< Binary scan functor type having member <tt>T operator()(const T &a, const T &b)</tt>
-    typename            InitValueT,         ///< Initial value to seed the exclusive scan (cub::NullType for inclusive scans)
-    typename            OffsetT>            ///< Signed integer type for global offsets
-__launch_bounds__ (int(ScanPolicyT::BLOCK_THREADS))
-__global__ void DeviceScanKernel(
-    InputIteratorT      d_in,               ///< Input data
-    OutputIteratorT     d_out,              ///< Output data
-    ScanTileStateT      tile_state,         ///< Tile status interface
-    int                 start_tile,         ///< The starting tile for the current grid
-    ScanOpT             scan_op,            ///< Binary scan functor 
-    InitValueT          init_value,         ///< Initial value to seed the exclusive scan
-    OffsetT             num_items)          ///< Total number of scan items for the entire problem
-{
-    // Thread block type for scanning input tiles
-    typedef AgentScan<
-        ScanPolicyT,
-        InputIteratorT,
-        OutputIteratorT,
-        ScanOpT,
-        InitValueT,
-        OffsetT> AgentScanT;
-
-    // Shared memory for AgentScan
-    __shared__ typename AgentScanT::TempStorage temp_storage;
-
-    // Process tiles
-    AgentScanT(temp_storage, d_in, d_out, scan_op, init_value).ConsumeRange(
-        num_items,
-        tile_state,
-        start_tile);
-}
-
-
-
-
-/******************************************************************************
- * Dispatch
- ******************************************************************************/
-
-
-/**
- * Utility class for dispatching the appropriately-tuned kernels for DeviceScan
- */
-template <
-    typename InputIteratorT,     ///< Random-access input iterator type for reading scan inputs \iterator
-    typename OutputIteratorT,    ///< Random-access output iterator type for writing scan outputs \iterator
-    typename ScanOpT,            ///< Binary scan functor type having member <tt>T operator()(const T &a, const T &b)</tt>
-    typename InitValueT,          ///< The init_value element type for ScanOpT (cub::NullType for inclusive scans)
-    typename OffsetT>            ///< Signed integer type for global offsets
-struct DispatchScan
-{
-    //---------------------------------------------------------------------
-    // Constants and Types
-    //---------------------------------------------------------------------
-
-    enum
-    {
-        INIT_KERNEL_THREADS = 128
-    };
-
-    // The output value type
-    typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
-        typename std::iterator_traits<InputIteratorT>::value_type,                                          // ... then the input iterator's value type,
-        typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputT;                          // ... else the output iterator's value type
-
-    // Tile status descriptor interface type
-    typedef ScanTileState<OutputT> ScanTileStateT;
-
-
-    //---------------------------------------------------------------------
-    // Tuning policies
-    //---------------------------------------------------------------------
-
-    /// SM600
-    struct Policy600
-    {
-        typedef AgentScanPolicy<
-            CUB_NOMINAL_CONFIG(128, 15, OutputT),      ///< Threads per block, items per thread
-                BLOCK_LOAD_TRANSPOSE,
-                LOAD_DEFAULT,
-                BLOCK_STORE_TRANSPOSE,
-                BLOCK_SCAN_WARP_SCANS>
-            ScanPolicyT;
-    };
-
-
-    /// SM520
-    struct Policy520
-    {
-        // Titan X: 32.47B items/s @ 48M 32-bit T
-        typedef AgentScanPolicy<
-                CUB_NOMINAL_CONFIG(128, 12, OutputT),      ///< Threads per block, items per thread
-                BLOCK_LOAD_DIRECT,
-                LOAD_LDG,
-                BLOCK_STORE_WARP_TRANSPOSE,
-                BLOCK_SCAN_WARP_SCANS>
-            ScanPolicyT;
-    };
-
-
-    /// SM35
-    struct Policy350
-    {
-        // GTX Titan: 29.5B items/s (232.4 GB/s) @ 48M 32-bit T
-        typedef AgentScanPolicy<
-                CUB_NOMINAL_CONFIG(128, 12, OutputT),      ///< Threads per block, items per thread
-                BLOCK_LOAD_DIRECT,
-                LOAD_LDG,
-                BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED,
-                BLOCK_SCAN_RAKING>
-            ScanPolicyT;
-    };
-
-    /// SM30
-    struct Policy300
-    {
-        typedef AgentScanPolicy<
-                CUB_NOMINAL_CONFIG(256, 9, OutputT),      ///< Threads per block, items per thread
-                BLOCK_LOAD_WARP_TRANSPOSE,
-                LOAD_DEFAULT,
-                BLOCK_STORE_WARP_TRANSPOSE,
-                BLOCK_SCAN_WARP_SCANS>
-            ScanPolicyT;
-    };
-
-    /// SM20
-    struct Policy200
-    {
-        // GTX 580: 20.3B items/s (162.3 GB/s) @ 48M 32-bit T
-        typedef AgentScanPolicy<
-                CUB_NOMINAL_CONFIG(128, 12, OutputT),      ///< Threads per block, items per thread
-                BLOCK_LOAD_WARP_TRANSPOSE,
-                LOAD_DEFAULT,
-                BLOCK_STORE_WARP_TRANSPOSE,
-                BLOCK_SCAN_WARP_SCANS>
-            ScanPolicyT;
-    };
-
-    /// SM13
-    struct Policy130
-    {
-        typedef AgentScanPolicy<
-                CUB_NOMINAL_CONFIG(96, 21, OutputT),      ///< Threads per block, items per thread
-                BLOCK_LOAD_WARP_TRANSPOSE,
-                LOAD_DEFAULT,
-                BLOCK_STORE_WARP_TRANSPOSE,
-                BLOCK_SCAN_RAKING_MEMOIZE>
-            ScanPolicyT;
-    };
-
-    /// SM10
-    struct Policy100
-    {
-        typedef AgentScanPolicy<
-                CUB_NOMINAL_CONFIG(64, 9, OutputT),      ///< Threads per block, items per thread
-                BLOCK_LOAD_WARP_TRANSPOSE,
-                LOAD_DEFAULT,
-                BLOCK_STORE_WARP_TRANSPOSE,
-                BLOCK_SCAN_WARP_SCANS>
-            ScanPolicyT;
-    };
-
-
-    //---------------------------------------------------------------------
-    // Tuning policies of current PTX compiler pass
-    //---------------------------------------------------------------------
-
-#if (CUB_PTX_ARCH >= 600)
-    typedef Policy600 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 520)
-    typedef Policy520 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 350)
-    typedef Policy350 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 300)
-    typedef Policy300 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 200)
-    typedef Policy200 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 130)
-    typedef Policy130 PtxPolicy;
-
-#else
-    typedef Policy100 PtxPolicy;
-
-#endif
-
-    // "Opaque" policies (whose parameterizations aren't reflected in the type signature)
-    struct PtxAgentScanPolicy : PtxPolicy::ScanPolicyT {};
-
-
-    //---------------------------------------------------------------------
-    // Utilities
-    //---------------------------------------------------------------------
-
-    /**
-     * Initialize kernel dispatch configurations with the policies corresponding to the PTX assembly we will use
-     */
-    template <typename KernelConfig>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static void InitConfigs(
-        int             ptx_version,
-        KernelConfig    &scan_kernel_config)
-    {
-    #if (CUB_PTX_ARCH > 0)
-        (void)ptx_version;
-
-        // We're on the device, so initialize the kernel dispatch configurations with the current PTX policy
-        scan_kernel_config.template Init<PtxAgentScanPolicy>();
-
-    #else
-
-        // We're on the host, so lookup and initialize the kernel dispatch configurations with the policies that match the device's PTX version
-        if (ptx_version >= 600)
-        {
-            scan_kernel_config.template Init<typename Policy600::ScanPolicyT>();
-        }
-        else if (ptx_version >= 520)
-        {
-            scan_kernel_config.template Init<typename Policy520::ScanPolicyT>();
-        }
-        else if (ptx_version >= 350)
-        {
-            scan_kernel_config.template Init<typename Policy350::ScanPolicyT>();
-        }
-        else if (ptx_version >= 300)
-        {
-            scan_kernel_config.template Init<typename Policy300::ScanPolicyT>();
-        }
-        else if (ptx_version >= 200)
-        {
-            scan_kernel_config.template Init<typename Policy200::ScanPolicyT>();
-        }
-        else if (ptx_version >= 130)
-        {
-            scan_kernel_config.template Init<typename Policy130::ScanPolicyT>();
-        }
-        else
-        {
-            scan_kernel_config.template Init<typename Policy100::ScanPolicyT>();
-        }
-
-    #endif
-    }
-
-
-    /**
-     * Kernel kernel dispatch configuration.
-     */
-    struct KernelConfig
-    {
-        int block_threads;
-        int items_per_thread;
-        int tile_items;
-
-        template <typename PolicyT>
-        CUB_RUNTIME_FUNCTION __forceinline__
-        void Init()
-        {
-            block_threads       = PolicyT::BLOCK_THREADS;
-            items_per_thread    = PolicyT::ITEMS_PER_THREAD;
-            tile_items          = block_threads * items_per_thread;
-        }
-    };
-
-
-    //---------------------------------------------------------------------
-    // Dispatch entrypoints
-    //---------------------------------------------------------------------
-
-    /**
-     * Internal dispatch routine for computing a device-wide prefix scan using the
-     * specified kernel functions.
-     */
-    template <
-        typename            ScanInitKernelPtrT,     ///< Function type of cub::DeviceScanInitKernel
-        typename            ScanSweepKernelPtrT>    ///< Function type of cub::DeviceScanKernelPtrT
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Dispatch(
-        void*               d_temp_storage,         ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&             temp_storage_bytes,     ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT      d_in,                   ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT     d_out,                  ///< [out] Pointer to the output sequence of data items
-        ScanOpT             scan_op,                ///< [in] Binary scan functor 
-        InitValueT          init_value,             ///< [in] Initial value to seed the exclusive scan
-        OffsetT             num_items,              ///< [in] Total number of input items (i.e., the length of \p d_in)
-        cudaStream_t        stream,                 ///< [in] CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                debug_synchronous,      ///< [in] Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-        int                 /*ptx_version*/,        ///< [in] PTX version of dispatch kernels
-        ScanInitKernelPtrT  init_kernel,            ///< [in] Kernel function pointer to parameterization of cub::DeviceScanInitKernel
-        ScanSweepKernelPtrT scan_kernel,            ///< [in] Kernel function pointer to parameterization of cub::DeviceScanKernel
-        KernelConfig        scan_kernel_config)     ///< [in] Dispatch parameters that match the policy that \p scan_kernel was compiled for
-    {
-
-#ifndef CUB_RUNTIME_ENABLED
-        (void)d_temp_storage;
-        (void)temp_storage_bytes;
-        (void)d_in;
-        (void)d_out;
-        (void)scan_op;
-        (void)init_value;
-        (void)num_items;
-        (void)stream;
-        (void)debug_synchronous;
-        (void)init_kernel;
-        (void)scan_kernel;
-        (void)scan_kernel_config;
-
-        // Kernel launch not supported from this device
-        return CubDebug(cudaErrorNotSupported);
-
-#else
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get device ordinal
-            int device_ordinal;
-            if (CubDebug(error = cudaGetDevice(&device_ordinal))) break;
-
-            // Get SM count
-            int sm_count;
-            if (CubDebug(error = cudaDeviceGetAttribute (&sm_count, cudaDevAttrMultiProcessorCount, device_ordinal))) break;
-
-            // Number of input tiles
-            int tile_size = scan_kernel_config.block_threads * scan_kernel_config.items_per_thread;
-            int num_tiles = (num_items + tile_size - 1) / tile_size;
-
-            // Specify temporary storage allocation requirements
-            size_t  allocation_sizes[1];
-            if (CubDebug(error = ScanTileStateT::AllocationSize(num_tiles, allocation_sizes[0]))) break;    // bytes needed for tile status descriptors
-
-            // Compute allocation pointers into the single storage blob (or compute the necessary size of the blob)
-            void* allocations[1];
-            if (CubDebug(error = AliasTemporaries(d_temp_storage, temp_storage_bytes, allocations, allocation_sizes))) break;
-            if (d_temp_storage == NULL)
-            {
-                // Return if the caller is simply requesting the size of the storage allocation
-                break;
-            }
-
-            // Return if empty problem
-            if (num_items == 0)
-                break;
-
-            // Construct the tile status interface
-            ScanTileStateT tile_state;
-            if (CubDebug(error = tile_state.Init(num_tiles, allocations[0], allocation_sizes[0]))) break;
-
-            // Log init_kernel configuration
-            int init_grid_size = (num_tiles + INIT_KERNEL_THREADS - 1) / INIT_KERNEL_THREADS;
-            if (debug_synchronous) _CubLog("Invoking init_kernel<<<%d, %d, 0, %lld>>>()\n", init_grid_size, INIT_KERNEL_THREADS, (long long) stream);
-
-            // Invoke init_kernel to initialize tile descriptors
-            init_kernel<<<init_grid_size, INIT_KERNEL_THREADS, 0, stream>>>(
-                tile_state,
-                num_tiles);
-
-            // Check for failure to launch
-            if (CubDebug(error = cudaPeekAtLastError())) break;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-
-            // Get SM occupancy for scan_kernel
-            int scan_sm_occupancy;
-            if (CubDebug(error = MaxSmOccupancy(
-                scan_sm_occupancy,            // out
-                scan_kernel,
-                scan_kernel_config.block_threads))) break;
-
-            // Get max x-dimension of grid
-            int max_dim_x;
-            if (CubDebug(error = cudaDeviceGetAttribute(&max_dim_x, cudaDevAttrMaxGridDimX, device_ordinal))) break;;
-
-            // Run grids in epochs (in case number of tiles exceeds max x-dimension
-            int scan_grid_size = CUB_MIN(num_tiles, max_dim_x);
-            for (int start_tile = 0; start_tile < num_tiles; start_tile += scan_grid_size)
-            {
-                // Log scan_kernel configuration
-                if (debug_synchronous) _CubLog("Invoking %d scan_kernel<<<%d, %d, 0, %lld>>>(), %d items per thread, %d SM occupancy\n",
-                    start_tile, scan_grid_size, scan_kernel_config.block_threads, (long long) stream, scan_kernel_config.items_per_thread, scan_sm_occupancy);
-
-                // Invoke scan_kernel
-                scan_kernel<<<scan_grid_size, scan_kernel_config.block_threads, 0, stream>>>(
-                    d_in,
-                    d_out,
-                    tile_state,
-                    start_tile,
-                    scan_op,
-                    init_value,
-                    num_items);
-
-                // Check for failure to launch
-                if (CubDebug(error = cudaPeekAtLastError())) break;
-
-                // Sync the stream if specified to flush runtime errors
-                if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-            }
-        }
-        while (0);
-
-        return error;
-
-#endif  // CUB_RUNTIME_ENABLED
-    }
-
-
-    /**
-     * Internal dispatch routine
-     */
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Dispatch(
-        void*           d_temp_storage,         ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&         temp_storage_bytes,     ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT  d_in,                   ///< [in] Pointer to the input sequence of data items
-        OutputIteratorT d_out,                  ///< [out] Pointer to the output sequence of data items
-        ScanOpT         scan_op,                ///< [in] Binary scan functor 
-        InitValueT      init_value,             ///< [in] Initial value to seed the exclusive scan
-        OffsetT         num_items,              ///< [in] Total number of input items (i.e., the length of \p d_in)
-        cudaStream_t    stream,                 ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool            debug_synchronous)      ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get PTX version
-            int ptx_version;
-            if (CubDebug(error = PtxVersion(ptx_version))) break;
-
-            // Get kernel kernel dispatch configurations
-            KernelConfig scan_kernel_config;
-            InitConfigs(ptx_version, scan_kernel_config);
-
-            // Dispatch
-            if (CubDebug(error = Dispatch(
-                d_temp_storage,
-                temp_storage_bytes,
-                d_in,
-                d_out,
-                scan_op,
-                init_value,
-                num_items,
-                stream,
-                debug_synchronous,
-                ptx_version,
-                DeviceScanInitKernel<ScanTileStateT>,
-                DeviceScanKernel<PtxAgentScanPolicy, InputIteratorT, OutputIteratorT, ScanTileStateT, ScanOpT, InitValueT, OffsetT>,
-                scan_kernel_config))) break;
-        }
-        while (0);
-
-        return error;
-    }
-};
-
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/device/dispatch/dispatch_select_if.cuh b/thirdparty/cub_semiring/device/dispatch/dispatch_select_if.cuh
deleted file mode 100644
index 6f033197c2d..00000000000
--- a/thirdparty/cub_semiring/device/dispatch/dispatch_select_if.cuh
+++ /dev/null
@@ -1,542 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DeviceSelect provides device-wide, parallel operations for selecting items from sequences of data items residing within device-accessible memory.
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-
-#include "dispatch_scan.cuh"
-#include "../../agent/agent_select_if.cuh"
-#include "../../thread/thread_operators.cuh"
-#include "../../grid/grid_queue.cuh"
-#include "../../util_device.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/******************************************************************************
- * Kernel entry points
- *****************************************************************************/
-
-/**
- * Select kernel entry point (multi-block)
- *
- * Performs functor-based selection if SelectOpT functor type != NullType
- * Otherwise performs flag-based selection if FlagsInputIterator's value type != NullType
- * Otherwise performs discontinuity selection (keep unique)
- */
-template <
-    typename            AgentSelectIfPolicyT,       ///< Parameterized AgentSelectIfPolicyT tuning policy type
-    typename            InputIteratorT,             ///< Random-access input iterator type for reading input items
-    typename            FlagsInputIteratorT,        ///< Random-access input iterator type for reading selection flags (NullType* if a selection functor or discontinuity flagging is to be used for selection)
-    typename            SelectedOutputIteratorT,    ///< Random-access output iterator type for writing selected items
-    typename            NumSelectedIteratorT,       ///< Output iterator type for recording the number of items selected
-    typename            ScanTileStateT,             ///< Tile status interface type
-    typename            SelectOpT,                  ///< Selection operator type (NullType if selection flags or discontinuity flagging is to be used for selection)
-    typename            EqualityOpT,                ///< Equality operator type (NullType if selection functor or selection flags is to be used for selection)
-    typename            OffsetT,                    ///< Signed integer type for global offsets
-    bool                KEEP_REJECTS>               ///< Whether or not we push rejected items to the back of the output
-__launch_bounds__ (int(AgentSelectIfPolicyT::BLOCK_THREADS))
-__global__ void DeviceSelectSweepKernel(
-    InputIteratorT          d_in,                   ///< [in] Pointer to the input sequence of data items
-    FlagsInputIteratorT     d_flags,                ///< [in] Pointer to the input sequence of selection flags (if applicable)
-    SelectedOutputIteratorT d_selected_out,         ///< [out] Pointer to the output sequence of selected data items
-    NumSelectedIteratorT    d_num_selected_out,     ///< [out] Pointer to the total number of items selected (i.e., length of \p d_selected_out)
-    ScanTileStateT          tile_status,            ///< [in] Tile status interface
-    SelectOpT               select_op,              ///< [in] Selection operator
-    EqualityOpT             equality_op,            ///< [in] Equality operator
-    OffsetT                 num_items,              ///< [in] Total number of input items (i.e., length of \p d_in)
-    int                     num_tiles)              ///< [in] Total number of tiles for the entire problem
-{
-    // Thread block type for selecting data from input tiles
-    typedef AgentSelectIf<
-        AgentSelectIfPolicyT,
-        InputIteratorT,
-        FlagsInputIteratorT,
-        SelectedOutputIteratorT,
-        SelectOpT,
-        EqualityOpT,
-        OffsetT,
-        KEEP_REJECTS> AgentSelectIfT;
-
-    // Shared memory for AgentSelectIf
-    __shared__ typename AgentSelectIfT::TempStorage temp_storage;
-
-    // Process tiles
-    AgentSelectIfT(temp_storage, d_in, d_flags, d_selected_out, select_op, equality_op, num_items).ConsumeRange(
-        num_tiles,
-        tile_status,
-        d_num_selected_out);
-}
-
-
-
-
-/******************************************************************************
- * Dispatch
- ******************************************************************************/
-
-/**
- * Utility class for dispatching the appropriately-tuned kernels for DeviceSelect
- */
-template <
-    typename    InputIteratorT,                 ///< Random-access input iterator type for reading input items
-    typename    FlagsInputIteratorT,            ///< Random-access input iterator type for reading selection flags (NullType* if a selection functor or discontinuity flagging is to be used for selection)
-    typename    SelectedOutputIteratorT,        ///< Random-access output iterator type for writing selected items
-    typename    NumSelectedIteratorT,           ///< Output iterator type for recording the number of items selected
-    typename    SelectOpT,                      ///< Selection operator type (NullType if selection flags or discontinuity flagging is to be used for selection)
-    typename    EqualityOpT,                    ///< Equality operator type (NullType if selection functor or selection flags is to be used for selection)
-    typename    OffsetT,                        ///< Signed integer type for global offsets
-    bool        KEEP_REJECTS>                   ///< Whether or not we push rejected items to the back of the output
-struct DispatchSelectIf
-{
-    /******************************************************************************
-     * Types and constants
-     ******************************************************************************/
-
-    // The output value type
-    typedef typename If<(Equals<typename std::iterator_traits<SelectedOutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
-        typename std::iterator_traits<InputIteratorT>::value_type,                                                  // ... then the input iterator's value type,
-        typename std::iterator_traits<SelectedOutputIteratorT>::value_type>::Type OutputT;                          // ... else the output iterator's value type
-
-    // The flag value type
-    typedef typename std::iterator_traits<FlagsInputIteratorT>::value_type FlagT;
-
-    enum
-    {
-        INIT_KERNEL_THREADS = 128,
-    };
-
-    // Tile status descriptor interface type
-    typedef ScanTileState<OffsetT> ScanTileStateT;
-
-
-    /******************************************************************************
-     * Tuning policies
-     ******************************************************************************/
-
-    /// SM35
-    struct Policy350
-    {
-        enum {
-            NOMINAL_4B_ITEMS_PER_THREAD = 10,
-            ITEMS_PER_THREAD            = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(OutputT)))),
-        };
-
-        typedef AgentSelectIfPolicy<
-                128,
-                ITEMS_PER_THREAD,
-                BLOCK_LOAD_DIRECT,
-                LOAD_LDG,
-                BLOCK_SCAN_WARP_SCANS>
-            SelectIfPolicyT;
-    };
-
-    /// SM30
-    struct Policy300
-    {
-        enum {
-            NOMINAL_4B_ITEMS_PER_THREAD = 7,
-            ITEMS_PER_THREAD            = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(3, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(OutputT)))),
-        };
-
-        typedef AgentSelectIfPolicy<
-                128,
-                ITEMS_PER_THREAD,
-                BLOCK_LOAD_WARP_TRANSPOSE,
-                LOAD_DEFAULT,
-                BLOCK_SCAN_WARP_SCANS>
-            SelectIfPolicyT;
-    };
-
-    /// SM20
-    struct Policy200
-    {
-        enum {
-            NOMINAL_4B_ITEMS_PER_THREAD = (KEEP_REJECTS) ? 7 : 15,
-            ITEMS_PER_THREAD            = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(OutputT)))),
-        };
-
-        typedef AgentSelectIfPolicy<
-                128,
-                ITEMS_PER_THREAD,
-                BLOCK_LOAD_WARP_TRANSPOSE,
-                LOAD_DEFAULT,
-                BLOCK_SCAN_WARP_SCANS>
-            SelectIfPolicyT;
-    };
-
-    /// SM13
-    struct Policy130
-    {
-        enum {
-            NOMINAL_4B_ITEMS_PER_THREAD = 9,
-            ITEMS_PER_THREAD            = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(OutputT)))),
-        };
-
-        typedef AgentSelectIfPolicy<
-                64,
-                ITEMS_PER_THREAD,
-                BLOCK_LOAD_WARP_TRANSPOSE,
-                LOAD_DEFAULT,
-                BLOCK_SCAN_RAKING_MEMOIZE>
-            SelectIfPolicyT;
-    };
-
-    /// SM10
-    struct Policy100
-    {
-        enum {
-            NOMINAL_4B_ITEMS_PER_THREAD = 9,
-            ITEMS_PER_THREAD            = CUB_MIN(NOMINAL_4B_ITEMS_PER_THREAD, CUB_MAX(1, (NOMINAL_4B_ITEMS_PER_THREAD * 4 / sizeof(OutputT)))),
-        };
-
-        typedef AgentSelectIfPolicy<
-                64,
-                ITEMS_PER_THREAD,
-                BLOCK_LOAD_WARP_TRANSPOSE,
-                LOAD_DEFAULT,
-                BLOCK_SCAN_RAKING>
-            SelectIfPolicyT;
-    };
-
-
-    /******************************************************************************
-     * Tuning policies of current PTX compiler pass
-     ******************************************************************************/
-
-#if (CUB_PTX_ARCH >= 350)
-    typedef Policy350 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 300)
-    typedef Policy300 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 200)
-    typedef Policy200 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 130)
-    typedef Policy130 PtxPolicy;
-
-#else
-    typedef Policy100 PtxPolicy;
-
-#endif
-
-    // "Opaque" policies (whose parameterizations aren't reflected in the type signature)
-    struct PtxSelectIfPolicyT : PtxPolicy::SelectIfPolicyT {};
-
-
-    /******************************************************************************
-     * Utilities
-     ******************************************************************************/
-
-    /**
-     * Initialize kernel dispatch configurations with the policies corresponding to the PTX assembly we will use
-     */
-    template <typename KernelConfig>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static void InitConfigs(
-        int             ptx_version,
-        KernelConfig    &select_if_config)
-    {
-    #if (CUB_PTX_ARCH > 0)
-        (void)ptx_version;
-
-        // We're on the device, so initialize the kernel dispatch configurations with the current PTX policy
-        select_if_config.template Init<PtxSelectIfPolicyT>();
-
-    #else
-
-        // We're on the host, so lookup and initialize the kernel dispatch configurations with the policies that match the device's PTX version
-        if (ptx_version >= 350)
-        {
-            select_if_config.template Init<typename Policy350::SelectIfPolicyT>();
-        }
-        else if (ptx_version >= 300)
-        {
-            select_if_config.template Init<typename Policy300::SelectIfPolicyT>();
-        }
-        else if (ptx_version >= 200)
-        {
-            select_if_config.template Init<typename Policy200::SelectIfPolicyT>();
-        }
-        else if (ptx_version >= 130)
-        {
-            select_if_config.template Init<typename Policy130::SelectIfPolicyT>();
-        }
-        else
-        {
-            select_if_config.template Init<typename Policy100::SelectIfPolicyT>();
-        }
-
-    #endif
-    }
-
-
-    /**
-     * Kernel kernel dispatch configuration.
-     */
-    struct KernelConfig
-    {
-        int block_threads;
-        int items_per_thread;
-        int tile_items;
-
-        template <typename PolicyT>
-        CUB_RUNTIME_FUNCTION __forceinline__
-        void Init()
-        {
-            block_threads       = PolicyT::BLOCK_THREADS;
-            items_per_thread    = PolicyT::ITEMS_PER_THREAD;
-            tile_items          = block_threads * items_per_thread;
-        }
-    };
-
-
-    /******************************************************************************
-     * Dispatch entrypoints
-     ******************************************************************************/
-
-    /**
-     * Internal dispatch routine for computing a device-wide selection using the
-     * specified kernel functions.
-     */
-    template <
-        typename                    ScanInitKernelPtrT,             ///< Function type of cub::DeviceScanInitKernel
-        typename                    SelectIfKernelPtrT>             ///< Function type of cub::SelectIfKernelPtrT
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Dispatch(
-        void*                       d_temp_storage,                 ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&                     temp_storage_bytes,             ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT              d_in,                           ///< [in] Pointer to the input sequence of data items
-        FlagsInputIteratorT         d_flags,                        ///< [in] Pointer to the input sequence of selection flags (if applicable)
-        SelectedOutputIteratorT     d_selected_out,                 ///< [in] Pointer to the output sequence of selected data items
-        NumSelectedIteratorT        d_num_selected_out,             ///< [in] Pointer to the total number of items selected (i.e., length of \p d_selected_out)
-        SelectOpT                   select_op,                      ///< [in] Selection operator
-        EqualityOpT                 equality_op,                    ///< [in] Equality operator
-        OffsetT                     num_items,                      ///< [in] Total number of input items (i.e., length of \p d_in)
-        cudaStream_t                stream,                         ///< [in] CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous,              ///< [in] Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-        int                         /*ptx_version*/,                ///< [in] PTX version of dispatch kernels
-        ScanInitKernelPtrT          scan_init_kernel,               ///< [in] Kernel function pointer to parameterization of cub::DeviceScanInitKernel
-        SelectIfKernelPtrT          select_if_kernel,               ///< [in] Kernel function pointer to parameterization of cub::DeviceSelectSweepKernel
-        KernelConfig                select_if_config)               ///< [in] Dispatch parameters that match the policy that \p select_if_kernel was compiled for
-    {
-
-#ifndef CUB_RUNTIME_ENABLED
-        (void)d_temp_storage;
-        (void)temp_storage_bytes;
-        (void)d_in;
-        (void)d_flags;
-        (void)d_selected_out;
-        (void)d_num_selected_out;
-        (void)select_op;
-        (void)equality_op;
-        (void)num_items;
-        (void)stream;
-        (void)debug_synchronous;
-        (void)scan_init_kernel;
-        (void)select_if_kernel;
-        (void)select_if_config;
-
-        // Kernel launch not supported from this device
-        return CubDebug(cudaErrorNotSupported);
-
-#else
-
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get device ordinal
-            int device_ordinal;
-            if (CubDebug(error = cudaGetDevice(&device_ordinal))) break;
-
-            // Get SM count
-            int sm_count;
-            if (CubDebug(error = cudaDeviceGetAttribute (&sm_count, cudaDevAttrMultiProcessorCount, device_ordinal))) break;
-
-            // Number of input tiles
-            int tile_size = select_if_config.block_threads * select_if_config.items_per_thread;
-            int num_tiles = (num_items + tile_size - 1) / tile_size;
-
-            // Specify temporary storage allocation requirements
-            size_t  allocation_sizes[1];
-            if (CubDebug(error = ScanTileStateT::AllocationSize(num_tiles, allocation_sizes[0]))) break;    // bytes needed for tile status descriptors
-
-            // Compute allocation pointers into the single storage blob (or compute the necessary size of the blob)
-            void* allocations[1];
-            if (CubDebug(error = AliasTemporaries(d_temp_storage, temp_storage_bytes, allocations, allocation_sizes))) break;
-            if (d_temp_storage == NULL)
-            {
-                // Return if the caller is simply requesting the size of the storage allocation
-                break;
-            }
-
-            // Construct the tile status interface
-            ScanTileStateT tile_status;
-            if (CubDebug(error = tile_status.Init(num_tiles, allocations[0], allocation_sizes[0]))) break;
-
-            // Log scan_init_kernel configuration
-            int init_grid_size = CUB_MAX(1, (num_tiles + INIT_KERNEL_THREADS - 1) / INIT_KERNEL_THREADS);
-            if (debug_synchronous) _CubLog("Invoking scan_init_kernel<<<%d, %d, 0, %lld>>>()\n", init_grid_size, INIT_KERNEL_THREADS, (long long) stream);
-
-            // Invoke scan_init_kernel to initialize tile descriptors
-            scan_init_kernel<<<init_grid_size, INIT_KERNEL_THREADS, 0, stream>>>(
-                tile_status,
-                num_tiles,
-                d_num_selected_out);
-
-            // Check for failure to launch
-            if (CubDebug(error = cudaPeekAtLastError())) break;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-
-            // Return if empty problem
-            if (num_items == 0)
-                break;
-
-            // Get SM occupancy for select_if_kernel
-            int range_select_sm_occupancy;
-            if (CubDebug(error = MaxSmOccupancy(
-                range_select_sm_occupancy,            // out
-                select_if_kernel,
-                select_if_config.block_threads))) break;
-
-            // Get max x-dimension of grid
-            int max_dim_x;
-            if (CubDebug(error = cudaDeviceGetAttribute(&max_dim_x, cudaDevAttrMaxGridDimX, device_ordinal))) break;;
-
-            // Get grid size for scanning tiles
-            dim3 scan_grid_size;
-            scan_grid_size.z = 1;
-            scan_grid_size.y = ((unsigned int) num_tiles + max_dim_x - 1) / max_dim_x;
-            scan_grid_size.x = CUB_MIN(num_tiles, max_dim_x);
-
-            // Log select_if_kernel configuration
-            if (debug_synchronous) _CubLog("Invoking select_if_kernel<<<{%d,%d,%d}, %d, 0, %lld>>>(), %d items per thread, %d SM occupancy\n",
-                scan_grid_size.x, scan_grid_size.y, scan_grid_size.z, select_if_config.block_threads, (long long) stream, select_if_config.items_per_thread, range_select_sm_occupancy);
-
-            // Invoke select_if_kernel
-            select_if_kernel<<<scan_grid_size, select_if_config.block_threads, 0, stream>>>(
-                d_in,
-                d_flags,
-                d_selected_out,
-                d_num_selected_out,
-                tile_status,
-                select_op,
-                equality_op,
-                num_items,
-                num_tiles);
-
-            // Check for failure to launch
-            if (CubDebug(error = cudaPeekAtLastError())) break;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-        }
-        while (0);
-
-        return error;
-
-#endif  // CUB_RUNTIME_ENABLED
-    }
-
-
-    /**
-     * Internal dispatch routine
-     */
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Dispatch(
-        void*                       d_temp_storage,                 ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&                     temp_storage_bytes,             ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        InputIteratorT              d_in,                           ///< [in] Pointer to the input sequence of data items
-        FlagsInputIteratorT         d_flags,                        ///< [in] Pointer to the input sequence of selection flags (if applicable)
-        SelectedOutputIteratorT     d_selected_out,                 ///< [in] Pointer to the output sequence of selected data items
-        NumSelectedIteratorT        d_num_selected_out,             ///< [in] Pointer to the total number of items selected (i.e., length of \p d_selected_out)
-        SelectOpT                   select_op,                      ///< [in] Selection operator
-        EqualityOpT                 equality_op,                    ///< [in] Equality operator
-        OffsetT                     num_items,                      ///< [in] Total number of input items (i.e., length of \p d_in)
-        cudaStream_t                stream,                         ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                        debug_synchronous)              ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-    {
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get PTX version
-            int ptx_version;
-    #if (CUB_PTX_ARCH == 0)
-            if (CubDebug(error = PtxVersion(ptx_version))) break;
-    #else
-            ptx_version = CUB_PTX_ARCH;
-    #endif
-
-            // Get kernel kernel dispatch configurations
-            KernelConfig select_if_config;
-            InitConfigs(ptx_version, select_if_config);
-
-            // Dispatch
-            if (CubDebug(error = Dispatch(
-                d_temp_storage,
-                temp_storage_bytes,
-                d_in,
-                d_flags,
-                d_selected_out,
-                d_num_selected_out,
-                select_op,
-                equality_op,
-                num_items,
-                stream,
-                debug_synchronous,
-                ptx_version,
-                DeviceCompactInitKernel<ScanTileStateT, NumSelectedIteratorT>,
-                DeviceSelectSweepKernel<PtxSelectIfPolicyT, InputIteratorT, FlagsInputIteratorT, SelectedOutputIteratorT, NumSelectedIteratorT, ScanTileStateT, SelectOpT, EqualityOpT, OffsetT, KEEP_REJECTS>,
-                select_if_config))) break;
-        }
-        while (0);
-
-        return error;
-    }
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/device/dispatch/dispatch_spmv_orig.cuh b/thirdparty/cub_semiring/device/dispatch/dispatch_spmv_orig.cuh
deleted file mode 100644
index 7a4691b55c3..00000000000
--- a/thirdparty/cub_semiring/device/dispatch/dispatch_spmv_orig.cuh
+++ /dev/null
@@ -1,942 +0,0 @@
-
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::DeviceSpmv provides device-wide parallel operations for performing sparse-matrix * vector multiplication (SpMV).
- */
-
-#pragma once
-
-#include <stdio.h>
-#include <iterator>
-
-#include "../../agent/single_pass_scan_operators.cuh"
-#include "../../agent/agent_segment_fixup.cuh"
-#include "../../agent/agent_spmv_orig.cuh"
-#include "../../util_type.cuh"
-#include "../../util_debug.cuh"
-#include "../../util_device.cuh"
-#include "../../thread/thread_search.cuh"
-#include "../../grid/grid_queue.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/******************************************************************************
- * SpMV kernel entry points
- *****************************************************************************/
-
-/**
- * Spmv search kernel. Identifies merge path starting coordinates for each tile.
- */
-template <
-    typename    AgentSpmvPolicyT,           ///< Parameterized SpmvPolicy tuning policy type
-    typename    ValueT,                     ///< Matrix and vector value type
-    typename    OffsetT,                    ///< Signed integer type for sequence offsets
-    typename    SemiringT>                  ///< Semiring operations
-__global__ void DeviceSpmv1ColKernel(
-    SpmvParams<ValueT, OffsetT> spmv_params)                ///< [in] SpMV input parameter bundle
-{
-    typedef CacheModifiedInputIterator<
-            AgentSpmvPolicyT::VECTOR_VALUES_LOAD_MODIFIER,
-            ValueT,
-            OffsetT>
-        VectorValueIteratorT;
-
-    VectorValueIteratorT wrapped_vector_x(spmv_params.d_vector_x);
-
-    int row_idx = (blockIdx.x * blockDim.x) + threadIdx.x;
-    if (row_idx < spmv_params.num_rows)
-    {
-        OffsetT     end_nonzero_idx = spmv_params.d_row_end_offsets[row_idx];
-        OffsetT     nonzero_idx = spmv_params.d_row_end_offsets[row_idx - 1];
-
-        ValueT value = SemiringT::plus_ident();
-        if (end_nonzero_idx != nonzero_idx)
-        {
-            value = SemiringT::times( spmv_params.alpha,
-                SemiringT::times(spmv_params.d_values[nonzero_idx], wrapped_vector_x[spmv_params.d_column_indices[nonzero_idx]]));
-        }
-
-        spmv_params.d_vector_y[row_idx] = SemiringT::plus(value, SemiringT::times(spmv_params.d_vector_y[row_idx], spmv_params.beta));
-    }
-}
-
-/**
- * Degenerate case: y = b*y
- */
-template <
-    typename    ValueT,                     ///< Matrix and vector value type
-    typename    OffsetT,                    ///< Signed integer type for sequence offsets
-    typename    SemiringT>                  ///< Semiring operations
-__global__ void DeviceSpmvbyKernel(
-    SpmvParams<ValueT, OffsetT> spmv_params)                ///< [in] SpMV input parameter bundle
-{
-    int row_idx = (blockIdx.x * blockDim.x) + threadIdx.x;
-    if (row_idx < spmv_params.num_rows)
-    {
-        spmv_params.d_vector_y[row_idx] = SemiringT::times(spmv_params.d_vector_y[row_idx], spmv_params.beta);
-    }
-}
-
-
-/**
- * Spmv search kernel. Identifies merge path starting coordinates for each tile.
- */
-template <
-    typename    SpmvPolicyT,                    ///< Parameterized SpmvPolicy tuning policy type
-    typename    OffsetT,                        ///< Signed integer type for sequence offsets
-    typename    CoordinateT,                    ///< Merge path coordinate type
-    typename    SpmvParamsT,                    ///< SpmvParams type
-    typename    SemiringT>                      ///< Semiring type
-__global__ void DeviceSpmvSearchKernel(
-    int             num_merge_tiles,            ///< [in] Number of SpMV merge tiles (spmv grid size)
-    CoordinateT*    d_tile_coordinates,         ///< [out] Pointer to the temporary array of tile starting coordinates
-    SpmvParamsT     spmv_params)                ///< [in] SpMV input parameter bundle
-{
-    /// Constants
-    enum
-    {
-        BLOCK_THREADS           = SpmvPolicyT::BLOCK_THREADS,
-        ITEMS_PER_THREAD        = SpmvPolicyT::ITEMS_PER_THREAD,
-        TILE_ITEMS              = BLOCK_THREADS * ITEMS_PER_THREAD,
-    };
-
-    typedef CacheModifiedInputIterator<
-            SpmvPolicyT::ROW_OFFSETS_SEARCH_LOAD_MODIFIER,
-            OffsetT,
-            OffsetT>
-        RowOffsetsSearchIteratorT;
-
-    // Find the starting coordinate for all tiles (plus the end coordinate of the last one)
-    int tile_idx = (blockIdx.x * blockDim.x) + threadIdx.x;
-    if (tile_idx < num_merge_tiles + 1)
-    {
-        OffsetT                         diagonal = (tile_idx * TILE_ITEMS);
-        CoordinateT                     tile_coordinate;
-        CountingInputIterator<OffsetT>  nonzero_indices(0);
-
-        // Search the merge path
-        MergePathSearch(
-            diagonal,
-            RowOffsetsSearchIteratorT(spmv_params.d_row_end_offsets),
-            nonzero_indices,
-            spmv_params.num_rows,
-            spmv_params.num_nonzeros,
-            tile_coordinate);
-
-        // Output starting offset
-        d_tile_coordinates[tile_idx] = tile_coordinate;
-    }
-}
-
-
-/**
- * Spmv agent entry point
- */
-template <
-    typename        SpmvPolicyT,                ///< Parameterized SpmvPolicy tuning policy type
-    typename        ScanTileStateT,             ///< Tile status interface type
-    typename        ValueT,                     ///< Matrix and vector value type
-    typename        OffsetT,                    ///< Signed integer type for sequence offsets
-    typename        CoordinateT,                ///< Merge path coordinate type
-    typename        SemiringT,                  ///< Semiring type
-    bool            HAS_ALPHA,                  ///< Whether the input parameter Alpha is 1
-    bool            HAS_BETA>                   ///< Whether the input parameter Beta is 0
-__launch_bounds__ (int(SpmvPolicyT::BLOCK_THREADS))
-__global__ void DeviceSpmvKernel(
-    SpmvParams<ValueT, OffsetT>     spmv_params,                ///< [in] SpMV input parameter bundle
-    CoordinateT*                    d_tile_coordinates,         ///< [in] Pointer to the temporary array of tile starting coordinates
-    KeyValuePair<OffsetT,ValueT>*   d_tile_carry_pairs,         ///< [out] Pointer to the temporary array carry-out dot product row-ids, one per block
-    int                             num_tiles,                  ///< [in] Number of merge tiles
-    ScanTileStateT                  tile_state,                 ///< [in] Tile status interface for fixup reduce-by-key kernel
-    int                             num_segment_fixup_tiles)    ///< [in] Number of reduce-by-key tiles (fixup grid size)
-{
-    // Spmv agent type specialization
-    typedef AgentSpmv<
-            SpmvPolicyT,
-            ValueT,
-            OffsetT,
-            SemiringT,
-            HAS_ALPHA,
-            HAS_BETA>
-        AgentSpmvT;
-
-    // Shared memory for AgentSpmv
-    __shared__ typename AgentSpmvT::TempStorage temp_storage;
-
-    AgentSpmvT(temp_storage, spmv_params).ConsumeTile(
-        d_tile_coordinates,
-        d_tile_carry_pairs,
-        num_tiles);
-
-    // Initialize fixup tile status
-    tile_state.InitializeStatus(num_segment_fixup_tiles);
-
-}
-
-
-/**
- * Multi-block reduce-by-key sweep kernel entry point
- */
-template <
-    typename    AgentSegmentFixupPolicyT,       ///< Parameterized AgentSegmentFixupPolicy tuning policy type
-    typename    PairsInputIteratorT,            ///< Random-access input iterator type for keys
-    typename    AggregatesOutputIteratorT,      ///< Random-access output iterator type for values
-    typename    OffsetT,                        ///< Signed integer type for global offsets
-    typename    SemiringT,                      ///< Semiring type
-    typename    ScanTileStateT>                 ///< Tile status interface type
-__launch_bounds__ (int(AgentSegmentFixupPolicyT::BLOCK_THREADS))
-__global__ void DeviceSegmentFixupKernel(
-    OffsetT                     max_items,          ///< [in] Limit on number of output items (number of rows). Used to prevent OOB writes.
-    PairsInputIteratorT         d_pairs_in,         ///< [in] Pointer to the array carry-out dot product row-ids, one per spmv block
-    AggregatesOutputIteratorT   d_aggregates_out,   ///< [in,out] Output value aggregates
-    OffsetT                     num_items,          ///< [in] Total number of items to select from
-    int                         num_tiles,          ///< [in] Total number of tiles for the entire problem
-    ScanTileStateT              tile_state)         ///< [in] Tile status interface
-{
-    // Thread block type for reducing tiles of value segments
-    typedef AgentSegmentFixup<
-            AgentSegmentFixupPolicyT,
-            PairsInputIteratorT,
-            AggregatesOutputIteratorT,
-            cub::Equality,
-            typename SemiringT::SumOp,
-            OffsetT,
-            SemiringT>
-        AgentSegmentFixupT;
-
-    // Shared memory for AgentSegmentFixup
-    __shared__ typename AgentSegmentFixupT::TempStorage temp_storage;
-
-    // Process tiles
-    AgentSegmentFixupT(temp_storage, d_pairs_in, d_aggregates_out, cub::Equality(), SemiringT::SumOp()).ConsumeRange(
-        max_items,
-        num_items,
-        num_tiles,
-        tile_state);
-}
-
-
-/******************************************************************************
- * Dispatch
- ******************************************************************************/
-
-/**
- * Utility class for dispatching the appropriately-tuned kernels for DeviceSpmv
- */
-template <
-    typename    ValueT,                     ///< Matrix and vector value type
-    typename    OffsetT,                    ///< Signed integer type for global offsets
-    typename    SemiringT>                  ///< Semiring type
-struct DispatchSpmv
-{
-    //---------------------------------------------------------------------
-    // Constants and Types
-    //---------------------------------------------------------------------
-
-    enum
-    {
-        INIT_KERNEL_THREADS = 128
-    };
-
-    // SpmvParams bundle type
-    typedef SpmvParams<ValueT, OffsetT> SpmvParamsT;
-
-    // 2D merge path coordinate type
-    typedef typename CubVector<OffsetT, 2>::Type CoordinateT;
-
-    // Tile status descriptor interface type
-    typedef ReduceByKeyScanTileState<ValueT, OffsetT> ScanTileStateT;
-
-    // Tuple type for scanning (pairs accumulated segment-value with segment-index)
-    typedef KeyValuePair<OffsetT, ValueT> KeyValuePairT;
-
-
-    //---------------------------------------------------------------------
-    // Tuning policies
-    //---------------------------------------------------------------------
-
-    /// SM11
-    struct Policy110
-    {
-        typedef AgentSpmvPolicy<
-                128,
-                1,
-                LOAD_DEFAULT,
-                LOAD_DEFAULT,
-                LOAD_DEFAULT,
-                LOAD_DEFAULT,
-                LOAD_DEFAULT,
-                false,
-                BLOCK_SCAN_WARP_SCANS>
-            SpmvPolicyT;
-
-        typedef AgentSegmentFixupPolicy<
-                128,
-                4,
-                BLOCK_LOAD_VECTORIZE,
-                LOAD_DEFAULT,
-                BLOCK_SCAN_WARP_SCANS>
-            SegmentFixupPolicyT;
-    };
-
-    /// SM20
-    struct Policy200 
-    {
-        typedef AgentSpmvPolicy<
-                96,
-                18,
-                LOAD_DEFAULT,
-                LOAD_DEFAULT,
-                LOAD_DEFAULT,
-                LOAD_DEFAULT,
-                LOAD_DEFAULT,
-                false,
-                BLOCK_SCAN_RAKING>
-            SpmvPolicyT;
-
-        typedef AgentSegmentFixupPolicy<
-                128,
-                4,
-                BLOCK_LOAD_VECTORIZE,
-                LOAD_DEFAULT,
-                BLOCK_SCAN_WARP_SCANS>
-            SegmentFixupPolicyT;
-
-    };
-
-
-
-    /// SM30
-    struct Policy300 
-    {
-        typedef AgentSpmvPolicy<
-                96,
-                6,
-                LOAD_DEFAULT,
-                LOAD_DEFAULT,
-                LOAD_DEFAULT,
-                LOAD_DEFAULT,
-                LOAD_DEFAULT,
-                false,
-                BLOCK_SCAN_WARP_SCANS>
-            SpmvPolicyT;
-
-        typedef AgentSegmentFixupPolicy<
-                128,
-                4,
-                BLOCK_LOAD_VECTORIZE,
-                LOAD_DEFAULT,
-                BLOCK_SCAN_WARP_SCANS>
-            SegmentFixupPolicyT;
-
-    };
-
-
-    /// SM35
-    struct Policy350
-    {
-        typedef AgentSpmvPolicy<
-                (sizeof(ValueT) > 4) ? 96 : 128,
-                (sizeof(ValueT) > 4) ? 4 : 7,
-                LOAD_LDG,
-                LOAD_CA,
-                LOAD_LDG,
-                LOAD_LDG,
-                LOAD_LDG,
-                (sizeof(ValueT) > 4) ? true : false,
-                BLOCK_SCAN_WARP_SCANS>
-            SpmvPolicyT;
-
-        typedef AgentSegmentFixupPolicy<
-                128,
-                3,
-                BLOCK_LOAD_VECTORIZE,
-                LOAD_LDG,
-                BLOCK_SCAN_WARP_SCANS>
-            SegmentFixupPolicyT;
-    };
-
-
-    /// SM37
-    struct Policy370
-    {
-
-        typedef AgentSpmvPolicy<
-                (sizeof(ValueT) > 4) ? 128 : 128,
-                (sizeof(ValueT) > 4) ? 9 : 14,
-                LOAD_LDG,
-                LOAD_CA,
-                LOAD_LDG,
-                LOAD_LDG,
-                LOAD_LDG,
-                false, 
-                BLOCK_SCAN_WARP_SCANS>
-            SpmvPolicyT;
-
-        typedef AgentSegmentFixupPolicy<
-                128,
-                3,
-                BLOCK_LOAD_VECTORIZE,
-                LOAD_LDG,
-                BLOCK_SCAN_WARP_SCANS>
-            SegmentFixupPolicyT;
-    };
-
-    /// SM50
-    struct Policy500
-    {
-        typedef AgentSpmvPolicy<
-                (sizeof(ValueT) > 4) ? 64 : 128,
-                (sizeof(ValueT) > 4) ? 6 : 7,
-                LOAD_LDG,
-                LOAD_DEFAULT,
-                (sizeof(ValueT) > 4) ? LOAD_LDG : LOAD_DEFAULT,
-                (sizeof(ValueT) > 4) ? LOAD_LDG : LOAD_DEFAULT,
-                LOAD_LDG,
-                (sizeof(ValueT) > 4) ? true : false,
-                (sizeof(ValueT) > 4) ? BLOCK_SCAN_WARP_SCANS : BLOCK_SCAN_RAKING_MEMOIZE>
-            SpmvPolicyT;
-
-
-        typedef AgentSegmentFixupPolicy<
-                128,
-                3,
-                BLOCK_LOAD_VECTORIZE,
-                LOAD_LDG,
-                BLOCK_SCAN_RAKING_MEMOIZE>
-            SegmentFixupPolicyT;
-    };
-
-
-    /// SM60
-    struct Policy600
-    {
-        typedef AgentSpmvPolicy<
-                (sizeof(ValueT) > 4) ? 64 : 128,
-                (sizeof(ValueT) > 4) ? 5 : 7,
-                LOAD_DEFAULT,
-                LOAD_DEFAULT,
-                LOAD_DEFAULT,
-                LOAD_DEFAULT,
-                LOAD_DEFAULT,
-                false,
-                BLOCK_SCAN_WARP_SCANS>
-            SpmvPolicyT;
-
-
-        typedef AgentSegmentFixupPolicy<
-                128,
-                3,
-                BLOCK_LOAD_DIRECT,
-                LOAD_LDG,
-                BLOCK_SCAN_WARP_SCANS>
-            SegmentFixupPolicyT;
-    };
-
-
-
-    //---------------------------------------------------------------------
-    // Tuning policies of current PTX compiler pass
-    //---------------------------------------------------------------------
-
-#if (CUB_PTX_ARCH >= 600)
-    typedef Policy600 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 500)
-    typedef Policy500 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 370)
-    typedef Policy370 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 350)
-    typedef Policy350 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 300)
-    typedef Policy300 PtxPolicy;
-
-#elif (CUB_PTX_ARCH >= 200)
-    typedef Policy200 PtxPolicy;
-
-#else
-    typedef Policy110 PtxPolicy;
-
-#endif
-
-    // "Opaque" policies (whose parameterizations aren't reflected in the type signature)
-    struct PtxSpmvPolicyT : PtxPolicy::SpmvPolicyT {};
-    struct PtxSegmentFixupPolicy : PtxPolicy::SegmentFixupPolicyT {};
-
-
-    //---------------------------------------------------------------------
-    // Utilities
-    //---------------------------------------------------------------------
-
-    /**
-     * Initialize kernel dispatch configurations with the policies corresponding to the PTX assembly we will use
-     */
-    template <typename KernelConfig>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static void InitConfigs(
-        int             ptx_version,
-        KernelConfig    &spmv_config,
-        KernelConfig    &segment_fixup_config)
-    {
-    #if (CUB_PTX_ARCH > 0)
-
-        // We're on the device, so initialize the kernel dispatch configurations with the current PTX policy
-        spmv_config.template Init<PtxSpmvPolicyT>();
-        segment_fixup_config.template Init<PtxSegmentFixupPolicy>();
-
-    #else
-
-        // We're on the host, so lookup and initialize the kernel dispatch configurations with the policies that match the device's PTX version
-        if (ptx_version >= 600)
-        {
-            spmv_config.template            Init<typename Policy600::SpmvPolicyT>();
-            segment_fixup_config.template   Init<typename Policy600::SegmentFixupPolicyT>();
-        }
-        else if (ptx_version >= 500)
-        {
-            spmv_config.template            Init<typename Policy500::SpmvPolicyT>();
-            segment_fixup_config.template   Init<typename Policy500::SegmentFixupPolicyT>();
-        }
-        else if (ptx_version >= 370)
-        {
-            spmv_config.template            Init<typename Policy370::SpmvPolicyT>();
-            segment_fixup_config.template   Init<typename Policy370::SegmentFixupPolicyT>();
-        }
-        else if (ptx_version >= 350)
-        {
-            spmv_config.template            Init<typename Policy350::SpmvPolicyT>();
-            segment_fixup_config.template   Init<typename Policy350::SegmentFixupPolicyT>();
-        }
-        else if (ptx_version >= 300)
-        {
-            spmv_config.template            Init<typename Policy300::SpmvPolicyT>();
-            segment_fixup_config.template   Init<typename Policy300::SegmentFixupPolicyT>();
-
-        }
-        else if (ptx_version >= 200)
-        {
-            spmv_config.template            Init<typename Policy200::SpmvPolicyT>();
-            segment_fixup_config.template   Init<typename Policy200::SegmentFixupPolicyT>();
-        }
-        else
-        {
-            spmv_config.template            Init<typename Policy110::SpmvPolicyT>();
-            segment_fixup_config.template   Init<typename Policy110::SegmentFixupPolicyT>();
-        }
-
-    #endif
-    }
-
-
-    /**
-     * Kernel kernel dispatch configuration.
-     */
-    struct KernelConfig
-    {
-        int block_threads;
-        int items_per_thread;
-        int tile_items;
-
-        template <typename PolicyT>
-        CUB_RUNTIME_FUNCTION __forceinline__
-        void Init()
-        {
-            block_threads       = PolicyT::BLOCK_THREADS;
-            items_per_thread    = PolicyT::ITEMS_PER_THREAD;
-            tile_items          = block_threads * items_per_thread;
-        }
-    };
-
-
-    //---------------------------------------------------------------------
-    // Dispatch entrypoints
-    //---------------------------------------------------------------------
-
-    /**
-     * Internal dispatch routine for computing a device-wide reduction using the
-     * specified kernel functions.
-     *
-     * If the input is larger than a single tile, this method uses two-passes of
-     * kernel invocations.
-     */
-    template <
-        typename                Spmv1ColKernelT,                    ///< Function type of cub::DeviceSpmv1ColKernel
-        typename                SpmvbyKernelT,
-        typename                SpmvSearchKernelT,                  ///< Function type of cub::AgentSpmvSearchKernel
-        typename                SpmvKernelT,                        ///< Function type of cub::AgentSpmvKernel
-        typename                SegmentFixupKernelT>                 ///< Function type of cub::DeviceSegmentFixupKernelT
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Dispatch(
-        void*                   d_temp_storage,                     ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&                 temp_storage_bytes,                 ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        SpmvParamsT&            spmv_params,                        ///< SpMV input parameter bundle
-        cudaStream_t            stream,                             ///< [in] CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                    debug_synchronous,                  ///< [in] Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console.  Default is \p false.
-        SpmvbyKernelT           spmv_by_kernel,
-        Spmv1ColKernelT         spmv_1col_kernel,                   ///< [in] Kernel function pointer to parameterization of DeviceSpmv1ColKernel
-        SpmvSearchKernelT       spmv_search_kernel,                 ///< [in] Kernel function pointer to parameterization of AgentSpmvSearchKernel
-        SpmvKernelT             spmv_kernel,                        ///< [in] Kernel function pointer to parameterization of AgentSpmvKernel
-        SegmentFixupKernelT     segment_fixup_kernel,               ///< [in] Kernel function pointer to parameterization of cub::DeviceSegmentFixupKernel
-        KernelConfig            spmv_config,                        ///< [in] Dispatch parameters that match the policy that \p spmv_kernel was compiled for
-        KernelConfig            segment_fixup_config)               ///< [in] Dispatch parameters that match the policy that \p segment_fixup_kernel was compiled for
-    {
-#ifndef CUB_RUNTIME_ENABLED
-
-        // Kernel launch not supported from this device
-        return CubDebug(cudaErrorNotSupported );
-
-#else
-        cudaError error = cudaSuccess;
-        do
-        {
-            // degenerate case of y = beta*y
-            if (spmv_params.alpha == SemiringT::times_null())
-            {
-                if (d_temp_storage == NULL)
-                {
-                    // Return if the caller is simply requesting the size of the storage allocation
-                    temp_storage_bytes = 1;
-                    break;
-                }
-
-                // Get search/init grid dims
-                int degen_by_block_size     = INIT_KERNEL_THREADS;
-                int degen_by_grid_size      = (spmv_params.num_rows + degen_by_block_size - 1) / degen_by_block_size;
-
-                if (debug_synchronous) _CubLog("Invoking spmv_1col_kernel<<<%d, %d, 0, %lld>>>()\n",
-                    degen_by_grid_size, degen_by_block_size, (long long) stream);
-
-                // Invoke spmv_search_kernel
-                spmv_by_kernel<<<degen_by_grid_size, degen_by_block_size, 0, stream>>>(
-                    spmv_params);
-
-                // Check for failure to launch
-                if (CubDebug(error = cudaPeekAtLastError())) break;
-
-                // Sync the stream if specified to flush runtime errors
-                if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-
-                break;
-            }
-
-            if (spmv_params.num_cols == 1)
-            {
-                if (d_temp_storage == NULL)
-                {
-                    // Return if the caller is simply requesting the size of the storage allocation
-                    temp_storage_bytes = 1;
-                    break;
-                }
-
-                // Get search/init grid dims
-                int degen_col_kernel_block_size     = INIT_KERNEL_THREADS;
-                int degen_col_kernel_grid_size      = (spmv_params.num_rows + degen_col_kernel_block_size - 1) / degen_col_kernel_block_size;
-
-                if (debug_synchronous) _CubLog("Invoking spmv_1col_kernel<<<%d, %d, 0, %lld>>>()\n",
-                    degen_col_kernel_grid_size, degen_col_kernel_block_size, (long long) stream);
-
-                // Invoke spmv_search_kernel
-                spmv_1col_kernel<<<degen_col_kernel_grid_size, degen_col_kernel_block_size, 0, stream>>>(
-                    spmv_params);
-
-                // Check for failure to launch
-                if (CubDebug(error = cudaPeekAtLastError())) break;
-
-                // Sync the stream if specified to flush runtime errors
-                if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-
-                break;
-            }
-
-            // Get device ordinal
-            int device_ordinal;
-            if (CubDebug(error = cudaGetDevice(&device_ordinal))) break;
-
-            // Get SM count
-            int sm_count;
-            if (CubDebug(error = cudaDeviceGetAttribute (&sm_count, cudaDevAttrMultiProcessorCount, device_ordinal))) break;
-
-            // Get max x-dimension of grid
-            int max_dim_x_i;
-            if (CubDebug(error = cudaDeviceGetAttribute(&max_dim_x_i, cudaDevAttrMaxGridDimX, device_ordinal))) break;;
-            unsigned int max_dim_x = max_dim_x_i;
-
-            // Total number of spmv work items
-            int num_merge_items = spmv_params.num_rows + spmv_params.num_nonzeros;
-
-            // Tile sizes of kernels
-            int merge_tile_size              = spmv_config.block_threads * spmv_config.items_per_thread;
-            int segment_fixup_tile_size     = segment_fixup_config.block_threads * segment_fixup_config.items_per_thread;
-
-            // Number of tiles for kernels
-            unsigned int num_merge_tiles            = (num_merge_items + merge_tile_size - 1) / merge_tile_size;
-            unsigned int num_segment_fixup_tiles    = (num_merge_tiles + segment_fixup_tile_size - 1) / segment_fixup_tile_size;
-
-            // Get SM occupancy for kernels
-            int spmv_sm_occupancy;
-            if (CubDebug(error = MaxSmOccupancy(
-                spmv_sm_occupancy,
-                spmv_kernel,
-                spmv_config.block_threads))) break;
-
-            int segment_fixup_sm_occupancy;
-            if (CubDebug(error = MaxSmOccupancy(
-                segment_fixup_sm_occupancy,
-                segment_fixup_kernel,
-                segment_fixup_config.block_threads))) break;
-
-            // Get grid dimensions
-            dim3 spmv_grid_size(
-                CUB_MIN(num_merge_tiles, max_dim_x),
-                (num_merge_tiles + max_dim_x - 1) / max_dim_x,
-                1);
-
-            dim3 segment_fixup_grid_size(
-                CUB_MIN(num_segment_fixup_tiles, max_dim_x),
-                (num_segment_fixup_tiles + max_dim_x - 1) / max_dim_x,
-                1);
-
-            // Get the temporary storage allocation requirements
-            size_t allocation_sizes[3];
-            if (CubDebug(error = ScanTileStateT::AllocationSize(num_segment_fixup_tiles, allocation_sizes[0]))) break;    // bytes needed for reduce-by-key tile status descriptors
-            allocation_sizes[1] = num_merge_tiles * sizeof(KeyValuePairT);       // bytes needed for block carry-out pairs
-            allocation_sizes[2] = (num_merge_tiles + 1) * sizeof(CoordinateT);   // bytes needed for tile starting coordinates
-
-            // Alias the temporary allocations from the single storage blob (or compute the necessary size of the blob)
-            void* allocations[3];
-            if (CubDebug(error = AliasTemporaries(d_temp_storage, temp_storage_bytes, allocations, allocation_sizes))) break;
-            if (d_temp_storage == NULL)
-            {
-                // Return if the caller is simply requesting the size of the storage allocation
-                break;
-            }
-
-            // Construct the tile status interface
-            ScanTileStateT tile_state;
-            if (CubDebug(error = tile_state.Init(num_segment_fixup_tiles, allocations[0], allocation_sizes[0]))) break;
-
-            // Alias the other allocations
-            KeyValuePairT*  d_tile_carry_pairs      = (KeyValuePairT*) allocations[1];  // Agent carry-out pairs
-            CoordinateT*    d_tile_coordinates      = (CoordinateT*) allocations[2];    // Agent starting coordinates
-
-            // Get search/init grid dims
-            int search_block_size   = INIT_KERNEL_THREADS;
-            int search_grid_size    = (num_merge_tiles + 1 + search_block_size - 1) / search_block_size;
-
-#if (CUB_PTX_ARCH == 0)
-            // Init textures
-            if (CubDebug(error = spmv_params.t_vector_x.BindTexture(spmv_params.d_vector_x, spmv_params.num_cols * sizeof(ValueT)))) break;
-#endif
-
-            if (search_grid_size < sm_count)
-//            if (num_merge_tiles < spmv_sm_occupancy * sm_count)
-            {
-                // Not enough spmv tiles to saturate the device: have spmv blocks search their own staring coords
-                d_tile_coordinates = NULL;
-            }
-            else
-            {
-                // Use separate search kernel if we have enough spmv tiles to saturate the device
-
-                // Log spmv_search_kernel configuration
-                if (debug_synchronous) _CubLog("Invoking spmv_search_kernel<<<%d, %d, 0, %lld>>>()\n",
-                    search_grid_size, search_block_size, (long long) stream);
-
-                // Invoke spmv_search_kernel
-                spmv_search_kernel<<<search_grid_size, search_block_size, 0, stream>>>(
-                    num_merge_tiles,
-                    d_tile_coordinates,
-                    spmv_params);
-
-                // Check for failure to launch
-                if (CubDebug(error = cudaPeekAtLastError())) break;
-
-                // Sync the stream if specified to flush runtime errors
-                if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-            }
-
-            // Log spmv_kernel configuration
-            if (debug_synchronous) _CubLog("Invoking spmv_kernel<<<{%d,%d,%d}, %d, 0, %lld>>>(), %d items per thread, %d SM occupancy\n",
-                spmv_grid_size.x, spmv_grid_size.y, spmv_grid_size.z, spmv_config.block_threads, (long long) stream, spmv_config.items_per_thread, spmv_sm_occupancy);
-
-            // Invoke spmv_kernel
-            spmv_kernel<<<spmv_grid_size, spmv_config.block_threads, 0, stream>>>(
-                spmv_params,
-                d_tile_coordinates,
-                d_tile_carry_pairs,
-                num_merge_tiles,
-                tile_state,
-                num_segment_fixup_tiles);
-
-            // Check for failure to launch
-            if (CubDebug(error = cudaPeekAtLastError())) break;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-
-            // Run reduce-by-key fixup if necessary
-            if (num_merge_tiles > 1)
-            {
-                // Log segment_fixup_kernel configuration
-                if (debug_synchronous) _CubLog("Invoking segment_fixup_kernel<<<{%d,%d,%d}, %d, 0, %lld>>>(), %d items per thread, %d SM occupancy\n",
-                    segment_fixup_grid_size.x, segment_fixup_grid_size.y, segment_fixup_grid_size.z, segment_fixup_config.block_threads, (long long) stream, segment_fixup_config.items_per_thread, segment_fixup_sm_occupancy);
-
-                // Invoke segment_fixup_kernel
-                segment_fixup_kernel<<<segment_fixup_grid_size, segment_fixup_config.block_threads, 0, stream>>>(
-                    spmv_params.num_rows,
-                    d_tile_carry_pairs,
-                    spmv_params.d_vector_y,
-                    num_merge_tiles,
-                    num_segment_fixup_tiles,
-                    tile_state);
-
-                // Check for failure to launch
-                if (CubDebug(error = cudaPeekAtLastError())) break;
-
-                // Sync the stream if specified to flush runtime errors
-                if (debug_synchronous && (CubDebug(error = SyncStream(stream)))) break;
-            }
-
-#if (CUB_PTX_ARCH == 0)
-            // Free textures
-            if (CubDebug(error = spmv_params.t_vector_x.UnbindTexture())) break;
-#endif
-        }
-        while (0);
-
-        return error;
-
-#endif // CUB_RUNTIME_ENABLED
-    }
-
-
-    /**
-     * Internal dispatch routine for computing a device-wide reduction
-     */
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Dispatch(
-        void*                   d_temp_storage,                     ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-        size_t&                 temp_storage_bytes,                 ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
-        SpmvParamsT&            spmv_params,                        ///< SpMV input parameter bundle
-        cudaStream_t            stream                  = 0,        ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                    debug_synchronous       = false)    ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
-    {
-        cudaError error = cudaSuccess;
-        do
-        {
-            // Get PTX version
-            int ptx_version;
-    #if (CUB_PTX_ARCH == 0)
-            if (CubDebug(error = PtxVersion(ptx_version))) break;
-    #else
-            ptx_version = CUB_PTX_ARCH;
-    #endif
-
-            // Get kernel kernel dispatch configurations
-            KernelConfig spmv_config, segment_fixup_config;
-            InitConfigs(ptx_version, spmv_config, segment_fixup_config);
-
-            // Dispatch
-            if (spmv_params.beta == SemiringT::times_null())
-            {
-                if (spmv_params.alpha == SemiringT::times_ident())
-                {
-                    // Dispatch y = A*x
-                    if (CubDebug(error = Dispatch(
-                        d_temp_storage, temp_storage_bytes, spmv_params, stream, debug_synchronous,
-                        DeviceSpmvbyKernel<ValueT, OffsetT, SemiringT>,
-                        DeviceSpmv1ColKernel<PtxSpmvPolicyT, ValueT, OffsetT, SemiringT>,
-                        DeviceSpmvSearchKernel<PtxSpmvPolicyT, OffsetT, CoordinateT, SpmvParamsT, SemiringT>,
-                        DeviceSpmvKernel<PtxSpmvPolicyT, ScanTileStateT, ValueT, OffsetT, CoordinateT, SemiringT, false, false>,
-                        DeviceSegmentFixupKernel<PtxSegmentFixupPolicy, KeyValuePairT*, ValueT*, OffsetT, SemiringT, ScanTileStateT>,
-                        spmv_config, segment_fixup_config))) break;
-                }
-                else
-                {
-                    // Dispatch y = alpha*A*x
-                    if (CubDebug(error = Dispatch(
-                        d_temp_storage, temp_storage_bytes, spmv_params, stream, debug_synchronous,
-                        DeviceSpmvbyKernel<ValueT, OffsetT, SemiringT>,
-                        DeviceSpmv1ColKernel<PtxSpmvPolicyT, ValueT, OffsetT, SemiringT>,
-                        DeviceSpmvSearchKernel<PtxSpmvPolicyT, OffsetT, CoordinateT, SpmvParamsT, SemiringT>,
-                        DeviceSpmvKernel<PtxSpmvPolicyT, ScanTileStateT, ValueT, OffsetT, CoordinateT, SemiringT, true, false>,
-                        DeviceSegmentFixupKernel<PtxSegmentFixupPolicy, KeyValuePairT*, ValueT*, OffsetT, SemiringT, ScanTileStateT>,
-                        spmv_config, segment_fixup_config))) break;
-                }
-            }
-            else
-            {
-                if (spmv_params.alpha == SemiringT::times_ident())
-                {
-                    // Dispatch y = A*x + beta*y
-                    if (CubDebug(error = Dispatch(
-                        d_temp_storage, temp_storage_bytes, spmv_params, stream, debug_synchronous,
-                        DeviceSpmvbyKernel<ValueT, OffsetT, SemiringT>,
-                        DeviceSpmv1ColKernel<PtxSpmvPolicyT, ValueT, OffsetT, SemiringT>,
-                        DeviceSpmvSearchKernel<PtxSpmvPolicyT, OffsetT, CoordinateT, SpmvParamsT, SemiringT>,
-                        DeviceSpmvKernel<PtxSpmvPolicyT, ScanTileStateT, ValueT, OffsetT, CoordinateT, SemiringT, false, true>,
-                        DeviceSegmentFixupKernel<PtxSegmentFixupPolicy, KeyValuePairT*, ValueT*, OffsetT, SemiringT, ScanTileStateT>,
-                        spmv_config, segment_fixup_config))) break;
-                }
-                else
-                {
-                    // Dispatch y = alpha*A*x + beta*y
-                    if (CubDebug(error = Dispatch(
-                        d_temp_storage, temp_storage_bytes, spmv_params, stream, debug_synchronous,
-                        DeviceSpmvbyKernel<ValueT, OffsetT, SemiringT>,
-                        DeviceSpmv1ColKernel<PtxSpmvPolicyT, ValueT, OffsetT, SemiringT>,
-                        DeviceSpmvSearchKernel<PtxSpmvPolicyT, OffsetT, CoordinateT, SpmvParamsT, SemiringT>,
-                        DeviceSpmvKernel<PtxSpmvPolicyT, ScanTileStateT, ValueT, OffsetT, CoordinateT, SemiringT, true, true>,
-                        DeviceSegmentFixupKernel<PtxSegmentFixupPolicy, KeyValuePairT*, ValueT*, OffsetT, SemiringT, ScanTileStateT>,
-                        spmv_config, segment_fixup_config))) break;
-                }
-            }
-        }
-        while (0);
-
-        return error;
-    }
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/grid/grid_barrier.cuh b/thirdparty/cub_semiring/grid/grid_barrier.cuh
deleted file mode 100644
index d9f83360b9e..00000000000
--- a/thirdparty/cub_semiring/grid/grid_barrier.cuh
+++ /dev/null
@@ -1,211 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::GridBarrier implements a software global barrier among thread blocks within a CUDA grid
- */
-
-#pragma once
-
-#include "../util_debug.cuh"
-#include "../util_namespace.cuh"
-#include "../thread/thread_load.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \addtogroup GridModule
- * @{
- */
-
-
-/**
- * \brief GridBarrier implements a software global barrier among thread blocks within a CUDA grid
- */
-class GridBarrier
-{
-protected :
-
-    typedef unsigned int SyncFlag;
-
-    // Counters in global device memory
-    SyncFlag* d_sync;
-
-public:
-
-    /**
-     * Constructor
-     */
-    GridBarrier() : d_sync(NULL) {}
-
-
-    /**
-     * Synchronize
-     */
-    __device__ __forceinline__ void Sync() const
-    {
-        volatile SyncFlag *d_vol_sync = d_sync;
-
-        // Threadfence and syncthreads to make sure global writes are visible before
-        // thread-0 reports in with its sync counter
-        __threadfence();
-        CTA_SYNC();
-
-        if (blockIdx.x == 0)
-        {
-            // Report in ourselves
-            if (threadIdx.x == 0)
-            {
-                d_vol_sync[blockIdx.x] = 1;
-            }
-
-            CTA_SYNC();
-
-            // Wait for everyone else to report in
-            for (int peer_block = threadIdx.x; peer_block < gridDim.x; peer_block += blockDim.x)
-            {
-                while (ThreadLoad<LOAD_CG>(d_sync + peer_block) == 0)
-                {
-                    __threadfence_block();
-                }
-            }
-
-            CTA_SYNC();
-
-            // Let everyone know it's safe to proceed
-            for (int peer_block = threadIdx.x; peer_block < gridDim.x; peer_block += blockDim.x)
-            {
-                d_vol_sync[peer_block] = 0;
-            }
-        }
-        else
-        {
-            if (threadIdx.x == 0)
-            {
-                // Report in
-                d_vol_sync[blockIdx.x] = 1;
-
-                // Wait for acknowledgment
-                while (ThreadLoad<LOAD_CG>(d_sync + blockIdx.x) == 1)
-                {
-                    __threadfence_block();
-                }
-            }
-
-            CTA_SYNC();
-        }
-    }
-};
-
-
-/**
- * \brief GridBarrierLifetime extends GridBarrier to provide lifetime management of the temporary device storage needed for cooperation.
- *
- * Uses RAII for lifetime, i.e., device resources are reclaimed when
- * the destructor is called.
- */
-class GridBarrierLifetime : public GridBarrier
-{
-protected:
-
-    // Number of bytes backed by d_sync
-    size_t sync_bytes;
-
-public:
-
-    /**
-     * Constructor
-     */
-    GridBarrierLifetime() : GridBarrier(), sync_bytes(0) {}
-
-
-    /**
-     * DeviceFrees and resets the progress counters
-     */
-    cudaError_t HostReset()
-    {
-        cudaError_t retval = cudaSuccess;
-        if (d_sync)
-        {
-            CubDebug(retval = cudaFree(d_sync));
-            d_sync = NULL;
-        }
-        sync_bytes = 0;
-        return retval;
-    }
-
-
-    /**
-     * Destructor
-     */
-    virtual ~GridBarrierLifetime()
-    {
-        HostReset();
-    }
-
-
-    /**
-     * Sets up the progress counters for the next kernel launch (lazily
-     * allocating and initializing them if necessary)
-     */
-    cudaError_t Setup(int sweep_grid_size)
-    {
-        cudaError_t retval = cudaSuccess;
-        do {
-            size_t new_sync_bytes = sweep_grid_size * sizeof(SyncFlag);
-            if (new_sync_bytes > sync_bytes)
-            {
-                if (d_sync)
-                {
-                    if (CubDebug(retval = cudaFree(d_sync))) break;
-                }
-
-                sync_bytes = new_sync_bytes;
-
-                // Allocate and initialize to zero
-                if (CubDebug(retval = cudaMalloc((void**) &d_sync, sync_bytes))) break;
-                if (CubDebug(retval = cudaMemset(d_sync, 0, new_sync_bytes))) break;
-            }
-        } while (0);
-
-        return retval;
-    }
-};
-
-
-/** @} */       // end group GridModule
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/grid/grid_even_share.cuh b/thirdparty/cub_semiring/grid/grid_even_share.cuh
deleted file mode 100644
index 3ba29da7ae6..00000000000
--- a/thirdparty/cub_semiring/grid/grid_even_share.cuh
+++ /dev/null
@@ -1,222 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::GridEvenShare is a descriptor utility for distributing input among CUDA thread blocks in an "even-share" fashion.  Each thread block gets roughly the same number of fixed-size work units (grains).
- */
-
-
-#pragma once
-
-#include "../util_namespace.cuh"
-#include "../util_macro.cuh"
-#include "grid_mapping.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \addtogroup GridModule
- * @{
- */
-
-
-/**
- * \brief GridEvenShare is a descriptor utility for distributing input among
- * CUDA thread blocks in an "even-share" fashion.  Each thread block gets roughly
- * the same number of input tiles.
- *
- * \par Overview
- * Each thread block is assigned a consecutive sequence of input tiles.  To help
- * preserve alignment and eliminate the overhead of guarded loads for all but the
- * last thread block, to GridEvenShare assigns one of three different amounts of
- * work to a given thread block: "big", "normal", or "last".  The "big" workloads
- * are one scheduling grain larger than "normal".  The "last" work unit for the
- * last thread block may be partially-full if the input is not an even multiple of
- * the scheduling grain size.
- *
- * \par
- * Before invoking a child grid, a parent thread will typically construct an
- * instance of GridEvenShare.  The instance can be passed to child thread blocks
- * which can initialize their per-thread block offsets using \p BlockInit().
- */
-template <typename OffsetT>
-struct GridEvenShare
-{
-private:
-
-    OffsetT     total_tiles;
-    int         big_shares;
-    OffsetT     big_share_items;
-    OffsetT     normal_share_items;
-    OffsetT     normal_base_offset;
-
-public:
-
-    /// Total number of input items
-    OffsetT     num_items;
-
-    /// Grid size in thread blocks
-    int         grid_size;
-
-    /// OffsetT into input marking the beginning of the owning thread block's segment of input tiles
-    OffsetT     block_offset;
-
-    /// OffsetT into input of marking the end (one-past) of the owning thread block's segment of input tiles
-    OffsetT     block_end;
-
-    /// Stride between input tiles
-    OffsetT     block_stride;
-
-
-    /**
-     * \brief Constructor.
-     */
-    __host__ __device__ __forceinline__ GridEvenShare() :
-        total_tiles(0),
-        big_shares(0),
-        big_share_items(0),
-        normal_share_items(0),
-        normal_base_offset(0),
-        num_items(0),
-        grid_size(0),
-        block_offset(0),
-        block_end(0),
-        block_stride(0)
-    {}
-
-
-    /**
-     * \brief Dispatch initializer. To be called prior prior to kernel launch.
-     */
-    __host__ __device__ __forceinline__ void DispatchInit(
-        OffsetT num_items,          ///< Total number of input items
-        int     max_grid_size,      ///< Maximum grid size allowable (actual grid size may be less if not warranted by the the number of input items)
-        int     tile_items)         ///< Number of data items per input tile
-    {
-        this->block_offset          = num_items;    // Initialize past-the-end
-        this->block_end             = num_items;    // Initialize past-the-end
-        this->num_items             = num_items;
-        this->total_tiles           = (num_items + tile_items - 1) / tile_items;
-        this->grid_size             = CUB_MIN(total_tiles, max_grid_size);
-        OffsetT avg_tiles_per_block = total_tiles / grid_size;
-        this->big_shares            = total_tiles - (avg_tiles_per_block * grid_size);        // leftover grains go to big blocks
-        this->normal_share_items    = avg_tiles_per_block * tile_items;
-        this->normal_base_offset    = big_shares * tile_items;
-        this->big_share_items       = normal_share_items + tile_items;
-    }
-
-
-    /**
-     * \brief Initializes ranges for the specified thread block index.  Specialized
-     * for a "raking" access pattern in which each thread block is assigned a
-     * consecutive sequence of input tiles.
-     */
-    template <int TILE_ITEMS>
-    __device__ __forceinline__ void BlockInit(
-        int block_id,
-        Int2Type<GRID_MAPPING_RAKE> /*strategy_tag*/)
-    {
-        block_stride = TILE_ITEMS;
-        if (block_id < big_shares)
-        {
-            // This thread block gets a big share of grains (avg_tiles_per_block + 1)
-            block_offset = (block_id * big_share_items);
-            block_end = block_offset + big_share_items;
-        }
-        else if (block_id < total_tiles)
-        {
-            // This thread block gets a normal share of grains (avg_tiles_per_block)
-            block_offset = normal_base_offset + (block_id * normal_share_items);
-            block_end = CUB_MIN(num_items, block_offset + normal_share_items);
-        }
-        // Else default past-the-end
-    }
-
-
-    /**
-     * \brief Block-initialization, specialized for a "raking" access
-     * pattern in which each thread block is assigned a consecutive sequence
-     * of input tiles.
-     */
-    template <int TILE_ITEMS>
-    __device__ __forceinline__ void BlockInit(
-        int block_id,
-        Int2Type<GRID_MAPPING_STRIP_MINE> /*strategy_tag*/)
-    {
-        block_stride = grid_size * TILE_ITEMS;
-        block_offset = (block_id * TILE_ITEMS);
-        block_end = num_items;
-    }
-
-
-    /**
-     * \brief Block-initialization, specialized for "strip mining" access
-     * pattern in which the input tiles assigned to each thread block are
-     * separated by a stride equal to the the extent of the grid.
-     */
-    template <
-        int TILE_ITEMS,
-        GridMappingStrategy STRATEGY>
-    __device__ __forceinline__ void BlockInit()
-    {
-        BlockInit<TILE_ITEMS>(blockIdx.x, Int2Type<STRATEGY>());
-    }
-
-
-    /**
-     * \brief Block-initialization, specialized for a "raking" access
-     * pattern in which each thread block is assigned a consecutive sequence
-     * of input tiles.
-     */
-    template <int TILE_ITEMS>
-    __device__ __forceinline__ void BlockInit(
-        OffsetT block_offset,                       ///< [in] Threadblock begin offset (inclusive)
-        OffsetT block_end)                          ///< [in] Threadblock end offset (exclusive)
-    {
-        this->block_offset = block_offset;
-        this->block_end = block_end;
-        this->block_stride = TILE_ITEMS;
-    }
-
-
-};
-
-
-
-
-
-/** @} */       // end group GridModule
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/grid/grid_mapping.cuh b/thirdparty/cub_semiring/grid/grid_mapping.cuh
deleted file mode 100644
index 6cd89209f83..00000000000
--- a/thirdparty/cub_semiring/grid/grid_mapping.cuh
+++ /dev/null
@@ -1,113 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::GridMappingStrategy enumerates alternative strategies for mapping constant-sized tiles of device-wide data onto a grid of CUDA thread blocks.
- */
-
-#pragma once
-
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \addtogroup GridModule
- * @{
- */
-
-
-/******************************************************************************
- * Mapping policies
- *****************************************************************************/
-
-
-/**
- * \brief cub::GridMappingStrategy enumerates alternative strategies for mapping constant-sized tiles of device-wide data onto a grid of CUDA thread blocks.
- */
-enum GridMappingStrategy
-{
-    /**
-     * \brief An a "raking" access pattern in which each thread block is
-     * assigned a consecutive sequence of input tiles
-     *
-     * \par Overview
-     * The input is evenly partitioned into \p p segments, where \p p is
-     * constant and corresponds loosely to the number of thread blocks that may
-     * actively reside on the target device. Each segment is comprised of
-     * consecutive tiles, where a tile is a small, constant-sized unit of input
-     * to be processed to completion before the thread block terminates or
-     * obtains more work.  The kernel invokes \p p thread blocks, each
-     * of which iteratively consumes a segment of <em>n</em>/<em>p</em> elements
-     * in tile-size increments.
-     */
-    GRID_MAPPING_RAKE,
-
-    /**
-     * \brief An a "strip mining" access pattern in which the input tiles assigned
-     * to each thread block are separated by a stride equal to the the extent of
-     * the grid.
-     *
-     * \par Overview
-     * The input is evenly partitioned into \p p sets, where \p p is
-     * constant and corresponds loosely to the number of thread blocks that may
-     * actively reside on the target device. Each set is comprised of
-     * data tiles separated by stride \p tiles, where a tile is a small,
-     * constant-sized unit of input to be processed to completion before the
-     * thread block terminates or obtains more work.  The kernel invokes \p p
-     * thread blocks, each of which iteratively consumes a segment of
-     * <em>n</em>/<em>p</em> elements in tile-size increments.
-     */
-    GRID_MAPPING_STRIP_MINE,
-
-    /**
-     * \brief A dynamic "queue-based" strategy for assigning input tiles to thread blocks.
-     *
-     * \par Overview
-     * The input is treated as a queue to be dynamically consumed by a grid of
-     * thread blocks.  Work is atomically dequeued in tiles, where a tile is a
-     * unit of input to be processed to completion before the thread block
-     * terminates or obtains more work.  The grid size \p p is constant,
-     * loosely corresponding to the number of thread blocks that may actively
-     * reside on the target device.
-     */
-    GRID_MAPPING_DYNAMIC,
-};
-
-
-/** @} */       // end group GridModule
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/grid/grid_queue.cuh b/thirdparty/cub_semiring/grid/grid_queue.cuh
deleted file mode 100644
index f413c6d2c4a..00000000000
--- a/thirdparty/cub_semiring/grid/grid_queue.cuh
+++ /dev/null
@@ -1,220 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::GridQueue is a descriptor utility for dynamic queue management.
- */
-
-#pragma once
-
-#include "../util_namespace.cuh"
-#include "../util_debug.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \addtogroup GridModule
- * @{
- */
-
-
-/**
- * \brief GridQueue is a descriptor utility for dynamic queue management.
- *
- * \par Overview
- * GridQueue descriptors provides abstractions for "filling" or
- * "draining" globally-shared vectors.
- *
- * \par
- * A "filling" GridQueue works by atomically-adding to a zero-initialized counter,
- * returning a unique offset for the calling thread to write its items.
- * The GridQueue maintains the total "fill-size".  The fill counter must be reset
- * using GridQueue::ResetFill by the host or kernel instance prior to the kernel instance that
- * will be filling.
- *
- * \par
- * Similarly, a "draining" GridQueue works by works by atomically-incrementing a
- * zero-initialized counter, returning a unique offset for the calling thread to
- * read its items. Threads can safely drain until the array's logical fill-size is
- * exceeded.  The drain counter must be reset using GridQueue::ResetDrain or
- * GridQueue::FillAndResetDrain by the host or kernel instance prior to the kernel instance that
- * will be filling.  (For dynamic work distribution of existing data, the corresponding fill-size
- * is simply the number of elements in the array.)
- *
- * \par
- * Iterative work management can be implemented simply with a pair of flip-flopping
- * work buffers, each with an associated set of fill and drain GridQueue descriptors.
- *
- * \tparam OffsetT Signed integer type for global offsets
- */
-template <typename OffsetT>
-class GridQueue
-{
-private:
-
-    /// Counter indices
-    enum
-    {
-        FILL    = 0,
-        DRAIN   = 1,
-    };
-
-    /// Pair of counters
-    OffsetT *d_counters;
-
-public:
-
-    /// Returns the device allocation size in bytes needed to construct a GridQueue instance
-    __host__ __device__ __forceinline__
-    static size_t AllocationSize()
-    {
-        return sizeof(OffsetT) * 2;
-    }
-
-
-    /// Constructs an invalid GridQueue descriptor
-    __host__ __device__ __forceinline__ GridQueue()
-    :
-        d_counters(NULL)
-    {}
-
-
-    /// Constructs a GridQueue descriptor around the device storage allocation
-    __host__ __device__ __forceinline__ GridQueue(
-        void *d_storage)                    ///< Device allocation to back the GridQueue.  Must be at least as big as <tt>AllocationSize()</tt>.
-    :
-        d_counters((OffsetT*) d_storage)
-    {}
-
-
-    /// This operation sets the fill-size and resets the drain counter, preparing the GridQueue for draining in the next kernel instance.  To be called by the host or by a kernel prior to that which will be draining.
-    __host__ __device__ __forceinline__ cudaError_t FillAndResetDrain(
-        OffsetT fill_size,
-        cudaStream_t stream = 0)
-    {
-#if (CUB_PTX_ARCH > 0)
-        (void)stream;
-        d_counters[FILL] = fill_size;
-        d_counters[DRAIN] = 0;
-        return cudaSuccess;
-#else
-        OffsetT counters[2];
-        counters[FILL] = fill_size;
-        counters[DRAIN] = 0;
-        return CubDebug(cudaMemcpyAsync(d_counters, counters, sizeof(OffsetT) * 2, cudaMemcpyHostToDevice, stream));
-#endif
-    }
-
-
-    /// This operation resets the drain so that it may advance to meet the existing fill-size.  To be called by the host or by a kernel prior to that which will be draining.
-    __host__ __device__ __forceinline__ cudaError_t ResetDrain(cudaStream_t stream = 0)
-    {
-#if (CUB_PTX_ARCH > 0)
-        (void)stream;
-        d_counters[DRAIN] = 0;
-        return cudaSuccess;
-#else
-        return CubDebug(cudaMemsetAsync(d_counters + DRAIN, 0, sizeof(OffsetT), stream));
-#endif
-    }
-
-
-    /// This operation resets the fill counter.  To be called by the host or by a kernel prior to that which will be filling.
-    __host__ __device__ __forceinline__ cudaError_t ResetFill(cudaStream_t stream = 0)
-    {
-#if (CUB_PTX_ARCH > 0)
-        (void)stream;
-        d_counters[FILL] = 0;
-        return cudaSuccess;
-#else
-        return CubDebug(cudaMemsetAsync(d_counters + FILL, 0, sizeof(OffsetT), stream));
-#endif
-    }
-
-
-    /// Returns the fill-size established by the parent or by the previous kernel.
-    __host__ __device__ __forceinline__ cudaError_t FillSize(
-        OffsetT &fill_size,
-        cudaStream_t stream = 0)
-    {
-#if (CUB_PTX_ARCH > 0)
-        (void)stream;
-        fill_size = d_counters[FILL];
-        return cudaSuccess;
-#else
-        return CubDebug(cudaMemcpyAsync(&fill_size, d_counters + FILL, sizeof(OffsetT), cudaMemcpyDeviceToHost, stream));
-#endif
-    }
-
-
-    /// Drain \p num_items from the queue.  Returns offset from which to read items.  To be called from CUDA kernel.
-    __device__ __forceinline__ OffsetT Drain(OffsetT num_items)
-    {
-        return atomicAdd(d_counters + DRAIN, num_items);
-    }
-
-
-    /// Fill \p num_items into the queue.  Returns offset from which to write items.    To be called from CUDA kernel.
-    __device__ __forceinline__ OffsetT Fill(OffsetT num_items)
-    {
-        return atomicAdd(d_counters + FILL, num_items);
-    }
-};
-
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-
-/**
- * Reset grid queue (call with 1 block of 1 thread)
- */
-template <typename OffsetT>
-__global__ void FillAndResetDrainKernel(
-    GridQueue<OffsetT>   grid_queue,
-    OffsetT              num_items)
-{
-    grid_queue.FillAndResetDrain(num_items);
-}
-
-
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-/** @} */       // end group GridModule
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-
diff --git a/thirdparty/cub_semiring/host/mutex.cuh b/thirdparty/cub_semiring/host/mutex.cuh
deleted file mode 100644
index 0054f1f916d..00000000000
--- a/thirdparty/cub_semiring/host/mutex.cuh
+++ /dev/null
@@ -1,171 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Simple portable mutex
- */
-
-
-#pragma once
-
-#if (__cplusplus > 199711L) || (defined(_MSC_VER) && _MSC_VER >= 1800)
-    #include <mutex>
-#else
-    #if defined(_WIN32) || defined(_WIN64)
-        #include <intrin.h>
-
-        #define WIN32_LEAN_AND_MEAN
-        #define NOMINMAX
-        #include <windows.h>
-        #undef WIN32_LEAN_AND_MEAN
-        #undef NOMINMAX
-
-        /**
-         * Compiler read/write barrier
-         */
-        #pragma intrinsic(_ReadWriteBarrier)
-
-    #endif
-#endif
-
-#include "../util_namespace.cuh"
-
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * Simple portable mutex
- *   - Wraps std::mutex when compiled with C++11 or newer (supported on all platforms)
- *   - Uses GNU/Windows spinlock mechanisms for pre C++11 (supported on x86/x64 when compiled with cl.exe or g++)
- */
-struct Mutex
-{
-#if (__cplusplus > 199711L) || (defined(_MSC_VER) && _MSC_VER >= 1800)
-
-    std::mutex mtx;
-
-    void Lock()
-    {
-        mtx.lock();
-    }
-
-    void Unlock()
-    {
-        mtx.unlock();
-    }
-
-    void TryLock()
-    {
-        mtx.try_lock();
-    }
-
-#else       //__cplusplus > 199711L
-
-    #if defined(_MSC_VER)
-
-        // Microsoft VC++
-        typedef long Spinlock;
-
-    #else
-
-        // GNU g++
-        typedef int Spinlock;
-
-        /**
-         * Compiler read/write barrier
-         */
-        __forceinline__ void _ReadWriteBarrier()
-        {
-            __sync_synchronize();
-        }
-
-        /**
-         * Atomic exchange
-         */
-        __forceinline__ long _InterlockedExchange(volatile int * const Target, const int Value)
-        {
-            // NOTE: __sync_lock_test_and_set would be an acquire barrier, so we force a full barrier
-            _ReadWriteBarrier();
-            return __sync_lock_test_and_set(Target, Value);
-        }
-
-        /**
-         * Pause instruction to prevent excess processor bus usage
-         */
-        __forceinline__ void YieldProcessor()
-        {
-        }
-
-    #endif  // defined(_MSC_VER)
-
-        /// Lock member
-        volatile Spinlock lock;
-
-        /**
-         * Constructor
-         */
-        Mutex() : lock(0) {}
-
-        /**
-         * Return when the specified spinlock has been acquired
-         */
-        __forceinline__ void Lock()
-        {
-            while (1)
-            {
-                if (!_InterlockedExchange(&lock, 1)) return;
-                while (lock) YieldProcessor();
-            }
-        }
-
-
-        /**
-         * Release the specified spinlock
-         */
-        __forceinline__ void Unlock()
-        {
-            _ReadWriteBarrier();
-            lock = 0;
-        }
-
-#endif      // __cplusplus > 199711L
-
-};
-
-
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
diff --git a/thirdparty/cub_semiring/iterator/arg_index_input_iterator.cuh b/thirdparty/cub_semiring/iterator/arg_index_input_iterator.cuh
deleted file mode 100644
index d3bce583d8c..00000000000
--- a/thirdparty/cub_semiring/iterator/arg_index_input_iterator.cuh
+++ /dev/null
@@ -1,259 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Random-access iterator types
- */
-
-#pragma once
-
-#include <iterator>
-#include <iostream>
-
-#include "../thread/thread_load.cuh"
-#include "../thread/thread_store.cuh"
-#include "../util_device.cuh"
-#include "../util_namespace.cuh"
-
-#include <thrust/version.h>
-
-#if (THRUST_VERSION >= 100700)
-    // This iterator is compatible with Thrust API 1.7 and newer
-    #include <thrust/iterator/iterator_facade.h>
-    #include <thrust/iterator/iterator_traits.h>
-#endif // THRUST_VERSION
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \addtogroup UtilIterator
- * @{
- */
-
-
-/**
- * \brief A random-access input wrapper for pairing dereferenced values with their corresponding indices (forming \p KeyValuePair tuples).
- *
- * \par Overview
- * - ArgIndexInputIteratorTwraps a random access input iterator \p itr of type \p InputIteratorT.
- *   Dereferencing an ArgIndexInputIteratorTat offset \p i produces a \p KeyValuePair value whose
- *   \p key field is \p i and whose \p value field is <tt>itr[i]</tt>.
- * - Can be used with any data type.
- * - Can be constructed, manipulated, and exchanged within and between host and device
- *   functions.  Wrapped host memory can only be dereferenced on the host, and wrapped
- *   device memory can only be dereferenced on the device.
- * - Compatible with Thrust API v1.7 or newer.
- *
- * \par Snippet
- * The code snippet below illustrates the use of \p ArgIndexInputIteratorTto
- * dereference an array of doubles
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/iterator/arg_index_input_iterator.cuh>
- *
- * // Declare, allocate, and initialize a device array
- * double *d_in;         // e.g., [8.0, 6.0, 7.0, 5.0, 3.0, 0.0, 9.0]
- *
- * // Create an iterator wrapper
- * cub::ArgIndexInputIterator<double*> itr(d_in);
- *
- * // Within device code:
- * typedef typename cub::ArgIndexInputIterator<double*>::value_type Tuple;
- * Tuple item_offset_pair.key = *itr;
- * printf("%f @ %d\n",
- *   item_offset_pair.value,
- *   item_offset_pair.key);   // 8.0 @ 0
- *
- * itr = itr + 6;
- * item_offset_pair.key = *itr;
- * printf("%f @ %d\n",
- *   item_offset_pair.value,
- *   item_offset_pair.key);   // 9.0 @ 6
- *
- * \endcode
- *
- * \tparam InputIteratorT       The value type of the wrapped input iterator
- * \tparam OffsetT              The difference type of this iterator (Default: \p ptrdiff_t)
- * \tparam OutputValueT         The paired value type of the <offset,value> tuple (Default: value type of input iterator)
- */
-template <
-    typename    InputIteratorT,
-    typename    OffsetT             = ptrdiff_t,
-    typename    OutputValueT        = typename std::iterator_traits<InputIteratorT>::value_type>
-class ArgIndexInputIterator
-{
-public:
-
-    // Required iterator traits
-    typedef ArgIndexInputIterator                       self_type;              ///< My own type
-    typedef OffsetT                                     difference_type;        ///< Type to express the result of subtracting one iterator from another
-    typedef KeyValuePair<difference_type, OutputValueT> value_type;             ///< The type of the element the iterator can point to
-    typedef value_type*                                 pointer;                ///< The type of a pointer to an element the iterator can point to
-    typedef value_type                                  reference;              ///< The type of a reference to an element the iterator can point to
-
-#if (THRUST_VERSION >= 100700)
-    // Use Thrust's iterator categories so we can use these iterators in Thrust 1.7 (or newer) methods
-    typedef typename thrust::detail::iterator_facade_category<
-        thrust::any_system_tag,
-        thrust::random_access_traversal_tag,
-        value_type,
-        reference
-      >::type iterator_category;                                        ///< The iterator category
-#else
-    typedef std::random_access_iterator_tag     iterator_category;      ///< The iterator category
-#endif  // THRUST_VERSION
-
-private:
-
-    InputIteratorT  itr;
-    difference_type offset;
-
-public:
-
-    /// Constructor
-    __host__ __device__ __forceinline__ ArgIndexInputIterator(
-        InputIteratorT  itr,            ///< Input iterator to wrap
-        difference_type offset = 0)     ///< OffsetT (in items) from \p itr denoting the position of the iterator
-    :
-        itr(itr),
-        offset(offset)
-    {}
-
-    /// Postfix increment
-    __host__ __device__ __forceinline__ self_type operator++(int)
-    {
-        self_type retval = *this;
-        offset++;
-        return retval;
-    }
-
-    /// Prefix increment
-    __host__ __device__ __forceinline__ self_type operator++()
-    {
-        offset++;
-        return *this;
-    }
-
-    /// Indirection
-    __host__ __device__ __forceinline__ reference operator*() const
-    {
-        value_type retval;
-        retval.value = itr[offset];
-        retval.key = offset;
-        return retval;
-    }
-
-    /// Addition
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator+(Distance n) const
-    {
-        self_type retval(itr, offset + n);
-        return retval;
-    }
-
-    /// Addition assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator+=(Distance n)
-    {
-        offset += n;
-        return *this;
-    }
-
-    /// Subtraction
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator-(Distance n) const
-    {
-        self_type retval(itr, offset - n);
-        return retval;
-    }
-
-    /// Subtraction assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator-=(Distance n)
-    {
-        offset -= n;
-        return *this;
-    }
-
-    /// Distance
-    __host__ __device__ __forceinline__ difference_type operator-(self_type other) const
-    {
-        return offset - other.offset;
-    }
-
-    /// Array subscript
-    template <typename Distance>
-    __host__ __device__ __forceinline__ reference operator[](Distance n) const
-    {
-        self_type offset = (*this) + n;
-        return *offset;
-    }
-
-    /// Structure dereference
-    __host__ __device__ __forceinline__ pointer operator->()
-    {
-        return &(*(*this));
-    }
-
-    /// Equal to
-    __host__ __device__ __forceinline__ bool operator==(const self_type& rhs)
-    {
-        return ((itr == rhs.itr) && (offset == rhs.offset));
-    }
-
-    /// Not equal to
-    __host__ __device__ __forceinline__ bool operator!=(const self_type& rhs)
-    {
-        return ((itr != rhs.itr) || (offset != rhs.offset));
-    }
-
-    /// Normalize
-    __host__ __device__ __forceinline__ void normalize()
-    {
-        itr += offset;
-        offset = 0;
-    }
-
-    /// ostream operator
-    friend std::ostream& operator<<(std::ostream& os, const self_type& /*itr*/)
-    {
-        return os;
-    }
-};
-
-
-
-/** @} */       // end group UtilIterator
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/iterator/cache_modified_input_iterator.cuh b/thirdparty/cub_semiring/iterator/cache_modified_input_iterator.cuh
deleted file mode 100644
index 0c0252c8b1a..00000000000
--- a/thirdparty/cub_semiring/iterator/cache_modified_input_iterator.cuh
+++ /dev/null
@@ -1,240 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Random-access iterator types
- */
-
-#pragma once
-
-#include <iterator>
-#include <iostream>
-
-#include "../thread/thread_load.cuh"
-#include "../thread/thread_store.cuh"
-#include "../util_device.cuh"
-#include "../util_namespace.cuh"
-
-#if (THRUST_VERSION >= 100700)
-    // This iterator is compatible with Thrust API 1.7 and newer
-    #include <thrust/iterator/iterator_facade.h>
-    #include <thrust/iterator/iterator_traits.h>
-#endif // THRUST_VERSION
-
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-
-/**
- * \addtogroup UtilIterator
- * @{
- */
-
-
-/**
- * \brief A random-access input wrapper for dereferencing array values using a PTX cache load modifier.
- *
- * \par Overview
- * - CacheModifiedInputIteratorTis a random-access input iterator that wraps a native
- *   device pointer of type <tt>ValueType*</tt>. \p ValueType references are
- *   made by reading \p ValueType values through loads modified by \p MODIFIER.
- * - Can be used to load any data type from memory using PTX cache load modifiers (e.g., "LOAD_LDG",
- *   "LOAD_CG", "LOAD_CA", "LOAD_CS", "LOAD_CV", etc.).
- * - Can be constructed, manipulated, and exchanged within and between host and device
- *   functions, but can only be dereferenced within device functions.
- * - Compatible with Thrust API v1.7 or newer.
- *
- * \par Snippet
- * The code snippet below illustrates the use of \p CacheModifiedInputIteratorTto
- * dereference a device array of double using the "ldg" PTX load modifier
- * (i.e., load values through texture cache).
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/iterator/cache_modified_input_iterator.cuh>
- *
- * // Declare, allocate, and initialize a device array
- * double *d_in;            // e.g., [8.0, 6.0, 7.0, 5.0, 3.0, 0.0, 9.0]
- *
- * // Create an iterator wrapper
- * cub::CacheModifiedInputIterator<cub::LOAD_LDG, double> itr(d_in);
- *
- * // Within device code:
- * printf("%f\n", itr[0]);  // 8.0
- * printf("%f\n", itr[1]);  // 6.0
- * printf("%f\n", itr[6]);  // 9.0
- *
- * \endcode
- *
- * \tparam CacheLoadModifier    The cub::CacheLoadModifier to use when accessing data
- * \tparam ValueType            The value type of this iterator
- * \tparam OffsetT              The difference type of this iterator (Default: \p ptrdiff_t)
- */
-template <
-    CacheLoadModifier   MODIFIER,
-    typename            ValueType,
-    typename            OffsetT = ptrdiff_t>
-class CacheModifiedInputIterator
-{
-public:
-
-    // Required iterator traits
-    typedef CacheModifiedInputIterator          self_type;              ///< My own type
-    typedef OffsetT                             difference_type;        ///< Type to express the result of subtracting one iterator from another
-    typedef ValueType                           value_type;             ///< The type of the element the iterator can point to
-    typedef ValueType*                          pointer;                ///< The type of a pointer to an element the iterator can point to
-    typedef ValueType                           reference;              ///< The type of a reference to an element the iterator can point to
-
-#if (THRUST_VERSION >= 100700)
-    // Use Thrust's iterator categories so we can use these iterators in Thrust 1.7 (or newer) methods
-    typedef typename thrust::detail::iterator_facade_category<
-        thrust::device_system_tag,
-        thrust::random_access_traversal_tag,
-        value_type,
-        reference
-      >::type iterator_category;                                        ///< The iterator category
-#else
-    typedef std::random_access_iterator_tag     iterator_category;      ///< The iterator category
-#endif  // THRUST_VERSION
-
-
-public:
-
-    /// Wrapped native pointer
-    ValueType* ptr;
-
-    /// Constructor
-    template <typename QualifiedValueType>
-    __host__ __device__ __forceinline__ CacheModifiedInputIterator(
-        QualifiedValueType* ptr)     ///< Native pointer to wrap
-    :
-        ptr(const_cast<typename RemoveQualifiers<QualifiedValueType>::Type *>(ptr))
-    {}
-
-    /// Postfix increment
-    __host__ __device__ __forceinline__ self_type operator++(int)
-    {
-        self_type retval = *this;
-        ptr++;
-        return retval;
-    }
-
-    /// Prefix increment
-    __host__ __device__ __forceinline__ self_type operator++()
-    {
-        ptr++;
-        return *this;
-    }
-
-    /// Indirection
-    __device__ __forceinline__ reference operator*() const
-    {
-        return ThreadLoad<MODIFIER>(ptr);
-    }
-
-    /// Addition
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator+(Distance n) const
-    {
-        self_type retval(ptr + n);
-        return retval;
-    }
-
-    /// Addition assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator+=(Distance n)
-    {
-        ptr += n;
-        return *this;
-    }
-
-    /// Subtraction
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator-(Distance n) const
-    {
-        self_type retval(ptr - n);
-        return retval;
-    }
-
-    /// Subtraction assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator-=(Distance n)
-    {
-        ptr -= n;
-        return *this;
-    }
-
-    /// Distance
-    __host__ __device__ __forceinline__ difference_type operator-(self_type other) const
-    {
-        return ptr - other.ptr;
-    }
-
-    /// Array subscript
-    template <typename Distance>
-    __device__ __forceinline__ reference operator[](Distance n) const
-    {
-        return ThreadLoad<MODIFIER>(ptr + n);
-    }
-
-    /// Structure dereference
-    __device__ __forceinline__ pointer operator->()
-    {
-        return &ThreadLoad<MODIFIER>(ptr);
-    }
-
-    /// Equal to
-    __host__ __device__ __forceinline__ bool operator==(const self_type& rhs)
-    {
-        return (ptr == rhs.ptr);
-    }
-
-    /// Not equal to
-    __host__ __device__ __forceinline__ bool operator!=(const self_type& rhs)
-    {
-        return (ptr != rhs.ptr);
-    }
-
-    /// ostream operator
-    friend std::ostream& operator<<(std::ostream& os, const self_type& /*itr*/)
-    {
-        return os;
-    }
-};
-
-
-
-/** @} */       // end group UtilIterator
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/iterator/cache_modified_output_iterator.cuh b/thirdparty/cub_semiring/iterator/cache_modified_output_iterator.cuh
deleted file mode 100644
index 8dbaafa61c5..00000000000
--- a/thirdparty/cub_semiring/iterator/cache_modified_output_iterator.cuh
+++ /dev/null
@@ -1,254 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Random-access iterator types
- */
-
-#pragma once
-
-#include <iterator>
-#include <iostream>
-
-#include "../thread/thread_load.cuh"
-#include "../thread/thread_store.cuh"
-#include "../util_device.cuh"
-#include "../util_namespace.cuh"
-
-#if (THRUST_VERSION >= 100700)
-    // This iterator is compatible with Thrust API 1.7 and newer
-    #include <thrust/iterator/iterator_facade.h>
-    #include <thrust/iterator/iterator_traits.h>
-#endif // THRUST_VERSION
-
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \addtogroup UtilIterator
- * @{
- */
-
-
-/**
- * \brief A random-access output wrapper for storing array values using a PTX cache-modifier.
- *
- * \par Overview
- * - CacheModifiedOutputIterator is a random-access output iterator that wraps a native
- *   device pointer of type <tt>ValueType*</tt>. \p ValueType references are
- *   made by writing \p ValueType values through stores modified by \p MODIFIER.
- * - Can be used to store any data type to memory using PTX cache store modifiers (e.g., "STORE_WB",
- *   "STORE_CG", "STORE_CS", "STORE_WT", etc.).
- * - Can be constructed, manipulated, and exchanged within and between host and device
- *   functions, but can only be dereferenced within device functions.
- * - Compatible with Thrust API v1.7 or newer.
- *
- * \par Snippet
- * The code snippet below illustrates the use of \p CacheModifiedOutputIterator to
- * dereference a device array of doubles using the "wt" PTX load modifier
- * (i.e., write-through to system memory).
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/iterator/cache_modified_output_iterator.cuh>
- *
- * // Declare, allocate, and initialize a device array
- * double *d_out;              // e.g., [, , , , , , ]
- *
- * // Create an iterator wrapper
- * cub::CacheModifiedOutputIterator<cub::STORE_WT, double> itr(d_out);
- *
- * // Within device code:
- * itr[0]  = 8.0;
- * itr[1]  = 66.0;
- * itr[55] = 24.0;
- *
- * \endcode
- *
- * \par Usage Considerations
- * - Can only be dereferenced within device code
- *
- * \tparam CacheStoreModifier     The cub::CacheStoreModifier to use when accessing data
- * \tparam ValueType            The value type of this iterator
- * \tparam OffsetT              The difference type of this iterator (Default: \p ptrdiff_t)
- */
-template <
-    CacheStoreModifier  MODIFIER,
-    typename            ValueType,
-    typename            OffsetT = ptrdiff_t>
-class CacheModifiedOutputIterator
-{
-private:
-
-    // Proxy object
-    struct Reference
-    {
-        ValueType* ptr;
-
-        /// Constructor
-        __host__ __device__ __forceinline__ Reference(ValueType* ptr) : ptr(ptr) {}
-
-        /// Assignment
-        __device__ __forceinline__ ValueType operator =(ValueType val)
-        {
-            ThreadStore<MODIFIER>(ptr, val);
-            return val;
-        }
-    };
-
-public:
-
-    // Required iterator traits
-    typedef CacheModifiedOutputIterator         self_type;              ///< My own type
-    typedef OffsetT                             difference_type;        ///< Type to express the result of subtracting one iterator from another
-    typedef void                                value_type;             ///< The type of the element the iterator can point to
-    typedef void                                pointer;                ///< The type of a pointer to an element the iterator can point to
-    typedef Reference                           reference;              ///< The type of a reference to an element the iterator can point to
-
-#if (THRUST_VERSION >= 100700)
-    // Use Thrust's iterator categories so we can use these iterators in Thrust 1.7 (or newer) methods
-    typedef typename thrust::detail::iterator_facade_category<
-        thrust::device_system_tag,
-        thrust::random_access_traversal_tag,
-        value_type,
-        reference
-      >::type iterator_category;                                        ///< The iterator category
-#else
-    typedef std::random_access_iterator_tag     iterator_category;      ///< The iterator category
-#endif  // THRUST_VERSION
-
-private:
-
-    ValueType* ptr;
-
-public:
-
-    /// Constructor
-    template <typename QualifiedValueType>
-    __host__ __device__ __forceinline__ CacheModifiedOutputIterator(
-        QualifiedValueType* ptr)     ///< Native pointer to wrap
-    :
-        ptr(const_cast<typename RemoveQualifiers<QualifiedValueType>::Type *>(ptr))
-    {}
-
-    /// Postfix increment
-    __host__ __device__ __forceinline__ self_type operator++(int)
-    {
-        self_type retval = *this;
-        ptr++;
-        return retval;
-    }
-
-
-    /// Prefix increment
-    __host__ __device__ __forceinline__ self_type operator++()
-    {
-        ptr++;
-        return *this;
-    }
-
-    /// Indirection
-    __host__ __device__ __forceinline__ reference operator*() const
-    {
-        return Reference(ptr);
-    }
-
-    /// Addition
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator+(Distance n) const
-    {
-        self_type retval(ptr + n);
-        return retval;
-    }
-
-    /// Addition assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator+=(Distance n)
-    {
-        ptr += n;
-        return *this;
-    }
-
-    /// Subtraction
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator-(Distance n) const
-    {
-        self_type retval(ptr - n);
-        return retval;
-    }
-
-    /// Subtraction assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator-=(Distance n)
-    {
-        ptr -= n;
-        return *this;
-    }
-
-    /// Distance
-    __host__ __device__ __forceinline__ difference_type operator-(self_type other) const
-    {
-        return ptr - other.ptr;
-    }
-
-    /// Array subscript
-    template <typename Distance>
-    __host__ __device__ __forceinline__ reference operator[](Distance n) const
-    {
-        return Reference(ptr + n);
-    }
-
-    /// Equal to
-    __host__ __device__ __forceinline__ bool operator==(const self_type& rhs)
-    {
-        return (ptr == rhs.ptr);
-    }
-
-    /// Not equal to
-    __host__ __device__ __forceinline__ bool operator!=(const self_type& rhs)
-    {
-        return (ptr != rhs.ptr);
-    }
-
-    /// ostream operator
-    friend std::ostream& operator<<(std::ostream& os, const self_type& itr)
-    {
-        return os;
-    }
-};
-
-
-/** @} */       // end group UtilIterator
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/iterator/constant_input_iterator.cuh b/thirdparty/cub_semiring/iterator/constant_input_iterator.cuh
deleted file mode 100644
index 0b7af478d74..00000000000
--- a/thirdparty/cub_semiring/iterator/constant_input_iterator.cuh
+++ /dev/null
@@ -1,235 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Random-access iterator types
- */
-
-#pragma once
-
-#include <iterator>
-#include <iostream>
-
-#include "../thread/thread_load.cuh"
-#include "../thread/thread_store.cuh"
-#include "../util_namespace.cuh"
-
-#if (THRUST_VERSION >= 100700)
-    // This iterator is compatible with Thrust API 1.7 and newer
-    #include <thrust/iterator/iterator_facade.h>
-    #include <thrust/iterator/iterator_traits.h>
-#endif // THRUST_VERSION
-
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \addtogroup UtilIterator
- * @{
- */
-
-
-/**
- * \brief A random-access input generator for dereferencing a sequence of homogeneous values
- *
- * \par Overview
- * - Read references to a ConstantInputIteratorTiterator always return the supplied constant
- *   of type \p ValueType.
- * - Can be used with any data type.
- * - Can be constructed, manipulated, dereferenced, and exchanged within and between host and device
- *   functions.
- * - Compatible with Thrust API v1.7 or newer.
- *
- * \par Snippet
- * The code snippet below illustrates the use of \p ConstantInputIteratorTto
- * dereference a sequence of homogeneous doubles.
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/iterator/constant_input_iterator.cuh>
- *
- * cub::ConstantInputIterator<double> itr(5.0);
- *
- * printf("%f\n", itr[0]);      // 5.0
- * printf("%f\n", itr[1]);      // 5.0
- * printf("%f\n", itr[2]);      // 5.0
- * printf("%f\n", itr[50]);     // 5.0
- *
- * \endcode
- *
- * \tparam ValueType            The value type of this iterator
- * \tparam OffsetT              The difference type of this iterator (Default: \p ptrdiff_t)
- */
-template <
-    typename ValueType,
-    typename OffsetT = ptrdiff_t>
-class ConstantInputIterator
-{
-public:
-
-    // Required iterator traits
-    typedef ConstantInputIterator               self_type;              ///< My own type
-    typedef OffsetT                             difference_type;        ///< Type to express the result of subtracting one iterator from another
-    typedef ValueType                           value_type;             ///< The type of the element the iterator can point to
-    typedef ValueType*                          pointer;                ///< The type of a pointer to an element the iterator can point to
-    typedef ValueType                           reference;              ///< The type of a reference to an element the iterator can point to
-
-#if (THRUST_VERSION >= 100700)
-    // Use Thrust's iterator categories so we can use these iterators in Thrust 1.7 (or newer) methods
-    typedef typename thrust::detail::iterator_facade_category<
-        thrust::any_system_tag,
-        thrust::random_access_traversal_tag,
-        value_type,
-        reference
-      >::type iterator_category;                                        ///< The iterator category
-#else
-    typedef std::random_access_iterator_tag     iterator_category;      ///< The iterator category
-#endif  // THRUST_VERSION
-
-private:
-
-    ValueType   val;
-    OffsetT     offset;
-#ifdef _WIN32
-    OffsetT     pad[CUB_MAX(1, (16 / sizeof(OffsetT) - 1))];        // Workaround for win32 parameter-passing bug (ulonglong2 argmin DeviceReduce)
-#endif
-
-public:
-
-    /// Constructor
-    __host__ __device__ __forceinline__ ConstantInputIterator(
-        ValueType   val,            ///< Starting value for the iterator instance to report
-        OffsetT     offset = 0)     ///< Base offset
-    :
-        val(val),
-        offset(offset)
-    {}
-
-    /// Postfix increment
-    __host__ __device__ __forceinline__ self_type operator++(int)
-    {
-        self_type retval = *this;
-        offset++;
-        return retval;
-    }
-
-    /// Prefix increment
-    __host__ __device__ __forceinline__ self_type operator++()
-    {
-        offset++;
-        return *this;
-    }
-
-    /// Indirection
-    __host__ __device__ __forceinline__ reference operator*() const
-    {
-        return val;
-    }
-
-    /// Addition
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator+(Distance n) const
-    {
-        self_type retval(val, offset + n);
-        return retval;
-    }
-
-    /// Addition assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator+=(Distance n)
-    {
-        offset += n;
-        return *this;
-    }
-
-    /// Subtraction
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator-(Distance n) const
-    {
-        self_type retval(val, offset - n);
-        return retval;
-    }
-
-    /// Subtraction assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator-=(Distance n)
-    {
-        offset -= n;
-        return *this;
-    }
-
-    /// Distance
-    __host__ __device__ __forceinline__ difference_type operator-(self_type other) const
-    {
-        return offset - other.offset;
-    }
-
-    /// Array subscript
-    template <typename Distance>
-    __host__ __device__ __forceinline__ reference operator[](Distance /*n*/) const
-    {
-        return val;
-    }
-
-    /// Structure dereference
-    __host__ __device__ __forceinline__ pointer operator->()
-    {
-        return &val;
-    }
-
-    /// Equal to
-    __host__ __device__ __forceinline__ bool operator==(const self_type& rhs)
-    {
-        return (offset == rhs.offset) && ((val == rhs.val));
-    }
-
-    /// Not equal to
-    __host__ __device__ __forceinline__ bool operator!=(const self_type& rhs)
-    {
-        return (offset != rhs.offset) || (val!= rhs.val);
-    }
-
-    /// ostream operator
-    friend std::ostream& operator<<(std::ostream& os, const self_type& itr)
-    {
-        os << "[" << itr.val << "," << itr.offset << "]";
-        return os;
-    }
-
-};
-
-
-/** @} */       // end group UtilIterator
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/iterator/counting_input_iterator.cuh b/thirdparty/cub_semiring/iterator/counting_input_iterator.cuh
deleted file mode 100644
index 3b42a00d181..00000000000
--- a/thirdparty/cub_semiring/iterator/counting_input_iterator.cuh
+++ /dev/null
@@ -1,228 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Random-access iterator types
- */
-
-#pragma once
-
-#include <iterator>
-#include <iostream>
-
-#include "../thread/thread_load.cuh"
-#include "../thread/thread_store.cuh"
-#include "../util_device.cuh"
-#include "../util_namespace.cuh"
-
-#if (THRUST_VERSION >= 100700)
-    // This iterator is compatible with Thrust API 1.7 and newer
-    #include <thrust/iterator/iterator_facade.h>
-    #include <thrust/iterator/iterator_traits.h>
-#endif // THRUST_VERSION
-
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \addtogroup UtilIterator
- * @{
- */
-
-/**
- * \brief A random-access input generator for dereferencing a sequence of incrementing integer values.
- *
- * \par Overview
- * - After initializing a CountingInputIteratorTto a certain integer \p base, read references
- *   at \p offset will return the value \p base + \p offset.
- * - Can be constructed, manipulated, dereferenced, and exchanged within and between host and device
- *   functions.
- * - Compatible with Thrust API v1.7 or newer.
- *
- * \par Snippet
- * The code snippet below illustrates the use of \p CountingInputIteratorTto
- * dereference a sequence of incrementing integers.
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/iterator/counting_input_iterator.cuh>
- *
- * cub::CountingInputIterator<int> itr(5);
- *
- * printf("%d\n", itr[0]);      // 5
- * printf("%d\n", itr[1]);      // 6
- * printf("%d\n", itr[2]);      // 7
- * printf("%d\n", itr[50]);     // 55
- *
- * \endcode
- *
- * \tparam ValueType            The value type of this iterator
- * \tparam OffsetT              The difference type of this iterator (Default: \p ptrdiff_t)
- */
-template <
-    typename ValueType,
-    typename OffsetT = ptrdiff_t>
-class CountingInputIterator
-{
-public:
-
-    // Required iterator traits
-    typedef CountingInputIterator               self_type;              ///< My own type
-    typedef OffsetT                             difference_type;        ///< Type to express the result of subtracting one iterator from another
-    typedef ValueType                           value_type;             ///< The type of the element the iterator can point to
-    typedef ValueType*                          pointer;                ///< The type of a pointer to an element the iterator can point to
-    typedef ValueType                           reference;              ///< The type of a reference to an element the iterator can point to
-
-#if (THRUST_VERSION >= 100700)
-    // Use Thrust's iterator categories so we can use these iterators in Thrust 1.7 (or newer) methods
-    typedef typename thrust::detail::iterator_facade_category<
-        thrust::any_system_tag,
-        thrust::random_access_traversal_tag,
-        value_type,
-        reference
-      >::type iterator_category;                                        ///< The iterator category
-#else
-    typedef std::random_access_iterator_tag     iterator_category;      ///< The iterator category
-#endif  // THRUST_VERSION
-
-private:
-
-    ValueType val;
-
-public:
-
-    /// Constructor
-    __host__ __device__ __forceinline__ CountingInputIterator(
-        const ValueType &val)          ///< Starting value for the iterator instance to report
-    :
-        val(val)
-    {}
-
-    /// Postfix increment
-    __host__ __device__ __forceinline__ self_type operator++(int)
-    {
-        self_type retval = *this;
-        val++;
-        return retval;
-    }
-
-    /// Prefix increment
-    __host__ __device__ __forceinline__ self_type operator++()
-    {
-        val++;
-        return *this;
-    }
-
-    /// Indirection
-    __host__ __device__ __forceinline__ reference operator*() const
-    {
-        return val;
-    }
-
-    /// Addition
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator+(Distance n) const
-    {
-        self_type retval(val + (ValueType) n);
-        return retval;
-    }
-
-    /// Addition assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator+=(Distance n)
-    {
-        val += (ValueType) n;
-        return *this;
-    }
-
-    /// Subtraction
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator-(Distance n) const
-    {
-        self_type retval(val - (ValueType) n);
-        return retval;
-    }
-
-    /// Subtraction assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator-=(Distance n)
-    {
-        val -= n;
-        return *this;
-    }
-
-    /// Distance
-    __host__ __device__ __forceinline__ difference_type operator-(self_type other) const
-    {
-        return (difference_type) (val - other.val);
-    }
-
-    /// Array subscript
-    template <typename Distance>
-    __host__ __device__ __forceinline__ reference operator[](Distance n) const
-    {
-        return val + (ValueType) n;
-    }
-
-    /// Structure dereference
-    __host__ __device__ __forceinline__ pointer operator->()
-    {
-        return &val;
-    }
-
-    /// Equal to
-    __host__ __device__ __forceinline__ bool operator==(const self_type& rhs)
-    {
-        return (val == rhs.val);
-    }
-
-    /// Not equal to
-    __host__ __device__ __forceinline__ bool operator!=(const self_type& rhs)
-    {
-        return (val != rhs.val);
-    }
-
-    /// ostream operator
-    friend std::ostream& operator<<(std::ostream& os, const self_type& itr)
-    {
-        os << "[" << itr.val << "]";
-        return os;
-    }
-
-};
-
-
-
-/** @} */       // end group UtilIterator
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/iterator/discard_output_iterator.cuh b/thirdparty/cub_semiring/iterator/discard_output_iterator.cuh
deleted file mode 100644
index 1fca08c062d..00000000000
--- a/thirdparty/cub_semiring/iterator/discard_output_iterator.cuh
+++ /dev/null
@@ -1,220 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Random-access iterator types
- */
-
-#pragma once
-
-#include <iterator>
-#include <iostream>
-
-#include "../util_namespace.cuh"
-#include "../util_macro.cuh"
-
-#if (THRUST_VERSION >= 100700)
-    // This iterator is compatible with Thrust API 1.7 and newer
-    #include <thrust/iterator/iterator_facade.h>
-    #include <thrust/iterator/iterator_traits.h>
-#endif // THRUST_VERSION
-
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \addtogroup UtilIterator
- * @{
- */
-
-
-/**
- * \brief A discard iterator
- */
-template <typename OffsetT = ptrdiff_t>
-class DiscardOutputIterator
-{
-public:
-
-    // Required iterator traits
-    typedef DiscardOutputIterator   self_type;              ///< My own type
-    typedef OffsetT                 difference_type;        ///< Type to express the result of subtracting one iterator from another
-    typedef void                    value_type;             ///< The type of the element the iterator can point to
-    typedef void                    pointer;                ///< The type of a pointer to an element the iterator can point to
-    typedef void                    reference;              ///< The type of a reference to an element the iterator can point to
-
-#if (THRUST_VERSION >= 100700)
-    // Use Thrust's iterator categories so we can use these iterators in Thrust 1.7 (or newer) methods
-    typedef typename thrust::detail::iterator_facade_category<
-        thrust::any_system_tag,
-        thrust::random_access_traversal_tag,
-        value_type,
-        reference
-      >::type iterator_category;                                        ///< The iterator category
-#else
-    typedef std::random_access_iterator_tag     iterator_category;      ///< The iterator category
-#endif  // THRUST_VERSION
-
-private:
-
-    OffsetT offset;
-
-#if defined(_WIN32) || !defined(_WIN64)
-    // Workaround for win32 parameter-passing bug (ulonglong2 argmin DeviceReduce)
-    OffsetT pad[CUB_MAX(1, (16 / sizeof(OffsetT) - 1))];
-#endif
-
-public:
-
-    /// Constructor
-    __host__ __device__ __forceinline__ DiscardOutputIterator(
-        OffsetT offset = 0)     ///< Base offset
-    :
-        offset(offset)
-    {}
-
-    /// Postfix increment
-    __host__ __device__ __forceinline__ self_type operator++(int)
-    {
-        self_type retval = *this;
-        offset++;
-        return retval;
-    }
-
-    /// Prefix increment
-    __host__ __device__ __forceinline__ self_type operator++()
-    {
-        offset++;
-        return *this;
-    }
-
-    /// Indirection
-    __host__ __device__ __forceinline__ self_type& operator*()
-    {
-        // return self reference, which can be assigned to anything
-        return *this;
-    }
-
-    /// Addition
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator+(Distance n) const
-    {
-        self_type retval(offset + n);
-        return retval;
-    }
-
-    /// Addition assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator+=(Distance n)
-    {
-        offset += n;
-        return *this;
-    }
-
-    /// Subtraction
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator-(Distance n) const
-    {
-        self_type retval(offset - n);
-        return retval;
-    }
-
-    /// Subtraction assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator-=(Distance n)
-    {
-        offset -= n;
-        return *this;
-    }
-
-    /// Distance
-    __host__ __device__ __forceinline__ difference_type operator-(self_type other) const
-    {
-        return offset - other.offset;
-    }
-
-    /// Array subscript
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator[](Distance n)
-    {
-        // return self reference, which can be assigned to anything
-        return *this;
-    }
-
-    /// Structure dereference
-    __host__ __device__ __forceinline__ pointer operator->()
-    {
-        return;
-    }
-
-    /// Assignment to self (no-op)
-    __host__ __device__ __forceinline__ void operator=(self_type const& other)
-    {
-        offset = other.offset;
-    }
-
-    /// Assignment to anything else (no-op)
-    template<typename T>
-    __host__ __device__ __forceinline__ void operator=(T const&)
-    {}
-
-    /// Cast to void* operator
-    __host__ __device__ __forceinline__ operator void*() const { return NULL; }
-
-    /// Equal to
-    __host__ __device__ __forceinline__ bool operator==(const self_type& rhs)
-    {
-        return (offset == rhs.offset);
-    }
-
-    /// Not equal to
-    __host__ __device__ __forceinline__ bool operator!=(const self_type& rhs)
-    {
-        return (offset != rhs.offset);
-    }
-
-    /// ostream operator
-    friend std::ostream& operator<<(std::ostream& os, const self_type& itr)
-    {
-        os << "[" << itr.offset << "]";
-        return os;
-    }
-
-};
-
-
-/** @} */       // end group UtilIterator
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/iterator/tex_obj_input_iterator.cuh b/thirdparty/cub_semiring/iterator/tex_obj_input_iterator.cuh
deleted file mode 100644
index d47b214ca82..00000000000
--- a/thirdparty/cub_semiring/iterator/tex_obj_input_iterator.cuh
+++ /dev/null
@@ -1,310 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Random-access iterator types
- */
-
-#pragma once
-
-#include <iterator>
-#include <iostream>
-
-#include "../thread/thread_load.cuh"
-#include "../thread/thread_store.cuh"
-#include "../util_device.cuh"
-#include "../util_debug.cuh"
-#include "../util_namespace.cuh"
-
-#if (THRUST_VERSION >= 100700)
-    // This iterator is compatible with Thrust API 1.7 and newer
-    #include <thrust/iterator/iterator_facade.h>
-    #include <thrust/iterator/iterator_traits.h>
-#endif // THRUST_VERSION
-
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \addtogroup UtilIterator
- * @{
- */
-
-
-
-/**
- * \brief A random-access input wrapper for dereferencing array values through texture cache.  Uses newer Kepler-style texture objects.
- *
- * \par Overview
- * - TexObjInputIteratorTwraps a native device pointer of type <tt>ValueType*</tt>. References
- *   to elements are to be loaded through texture cache.
- * - Can be used to load any data type from memory through texture cache.
- * - Can be manipulated and exchanged within and between host and device
- *   functions, can only be constructed within host functions, and can only be
- *   dereferenced within device functions.
- * - With regard to nested/dynamic parallelism, TexObjInputIteratorTiterators may only be
- *   created by the host thread, but can be used by any descendant kernel.
- * - Compatible with Thrust API v1.7 or newer.
- *
- * \par Snippet
- * The code snippet below illustrates the use of \p TexRefInputIteratorTto
- * dereference a device array of doubles through texture cache.
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/iterator/tex_obj_input_iterator.cuh>
- *
- * // Declare, allocate, and initialize a device array
- * int num_items;   // e.g., 7
- * double *d_in;    // e.g., [8.0, 6.0, 7.0, 5.0, 3.0, 0.0, 9.0]
- *
- * // Create an iterator wrapper
- * cub::TexObjInputIterator<double> itr;
- * itr.BindTexture(d_in, sizeof(double) * num_items);
- * ...
- *
- * // Within device code:
- * printf("%f\n", itr[0]);      // 8.0
- * printf("%f\n", itr[1]);      // 6.0
- * printf("%f\n", itr[6]);      // 9.0
- *
- * ...
- * itr.UnbindTexture();
- *
- * \endcode
- *
- * \tparam T                    The value type of this iterator
- * \tparam OffsetT              The difference type of this iterator (Default: \p ptrdiff_t)
- */
-template <
-    typename    T,
-    typename    OffsetT = ptrdiff_t>
-class TexObjInputIterator
-{
-public:
-
-    // Required iterator traits
-    typedef TexObjInputIterator                 self_type;              ///< My own type
-    typedef OffsetT                             difference_type;        ///< Type to express the result of subtracting one iterator from another
-    typedef T                                   value_type;             ///< The type of the element the iterator can point to
-    typedef T*                                  pointer;                ///< The type of a pointer to an element the iterator can point to
-    typedef T                                   reference;              ///< The type of a reference to an element the iterator can point to
-
-#if (THRUST_VERSION >= 100700)
-    // Use Thrust's iterator categories so we can use these iterators in Thrust 1.7 (or newer) methods
-    typedef typename thrust::detail::iterator_facade_category<
-        thrust::device_system_tag,
-        thrust::random_access_traversal_tag,
-        value_type,
-        reference
-      >::type iterator_category;                                        ///< The iterator category
-#else
-    typedef std::random_access_iterator_tag     iterator_category;      ///< The iterator category
-#endif  // THRUST_VERSION
-
-private:
-
-    // Largest texture word we can use in device
-    typedef typename UnitWord<T>::TextureWord TextureWord;
-
-    // Number of texture words per T
-    enum {
-        TEXTURE_MULTIPLE = sizeof(T) / sizeof(TextureWord)
-    };
-
-private:
-
-    T*                  ptr;
-    difference_type     tex_offset;
-    cudaTextureObject_t tex_obj;
-
-public:
-
-    /// Constructor
-    __host__ __device__ __forceinline__ TexObjInputIterator()
-    :
-        ptr(NULL),
-        tex_offset(0),
-        tex_obj(0)
-    {}
-
-    /// Use this iterator to bind \p ptr with a texture reference
-    template <typename QualifiedT>
-    cudaError_t BindTexture(
-        QualifiedT      *ptr,               ///< Native pointer to wrap that is aligned to cudaDeviceProp::textureAlignment
-        size_t          bytes,              ///< Number of bytes in the range
-        size_t          tex_offset = 0)     ///< OffsetT (in items) from \p ptr denoting the position of the iterator
-    {
-        this->ptr = const_cast<typename RemoveQualifiers<QualifiedT>::Type *>(ptr);
-        this->tex_offset = tex_offset;
-
-        cudaChannelFormatDesc   channel_desc = cudaCreateChannelDesc<TextureWord>();
-        cudaResourceDesc        res_desc;
-        cudaTextureDesc         tex_desc;
-        memset(&res_desc, 0, sizeof(cudaResourceDesc));
-        memset(&tex_desc, 0, sizeof(cudaTextureDesc));
-        res_desc.resType                = cudaResourceTypeLinear;
-        res_desc.res.linear.devPtr      = this->ptr;
-        res_desc.res.linear.desc        = channel_desc;
-        res_desc.res.linear.sizeInBytes = bytes;
-        tex_desc.readMode               = cudaReadModeElementType;
-        return cudaCreateTextureObject(&tex_obj, &res_desc, &tex_desc, NULL);
-    }
-
-    /// Unbind this iterator from its texture reference
-    cudaError_t UnbindTexture()
-    {
-        return cudaDestroyTextureObject(tex_obj);
-    }
-
-    /// Postfix increment
-    __host__ __device__ __forceinline__ self_type operator++(int)
-    {
-        self_type retval = *this;
-        tex_offset++;
-        return retval;
-    }
-
-    /// Prefix increment
-    __host__ __device__ __forceinline__ self_type operator++()
-    {
-        tex_offset++;
-        return *this;
-    }
-
-    /// Indirection
-    __host__ __device__ __forceinline__ reference operator*() const
-    {
-#if (CUB_PTX_ARCH == 0)
-        // Simply dereference the pointer on the host
-        return ptr[tex_offset];
-#else
-        // Move array of uninitialized words, then alias and assign to return value
-        TextureWord words[TEXTURE_MULTIPLE];
-
-        #pragma unroll
-        for (int i = 0; i < TEXTURE_MULTIPLE; ++i)
-        {
-            words[i] = tex1Dfetch<TextureWord>(
-                tex_obj,
-                (tex_offset * TEXTURE_MULTIPLE) + i);
-        }
-
-        // Load from words
-        return *reinterpret_cast<T*>(words);
-#endif
-    }
-
-    /// Addition
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator+(Distance n) const
-    {
-        self_type retval;
-        retval.ptr          = ptr;
-        retval.tex_obj      = tex_obj;
-        retval.tex_offset   = tex_offset + n;
-        return retval;
-    }
-
-    /// Addition assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator+=(Distance n)
-    {
-        tex_offset += n;
-        return *this;
-    }
-
-    /// Subtraction
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator-(Distance n) const
-    {
-        self_type retval;
-        retval.ptr          = ptr;
-        retval.tex_obj      = tex_obj;
-        retval.tex_offset   = tex_offset - n;
-        return retval;
-    }
-
-    /// Subtraction assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator-=(Distance n)
-    {
-        tex_offset -= n;
-        return *this;
-    }
-
-    /// Distance
-    __host__ __device__ __forceinline__ difference_type operator-(self_type other) const
-    {
-        return tex_offset - other.tex_offset;
-    }
-
-    /// Array subscript
-    template <typename Distance>
-    __host__ __device__ __forceinline__ reference operator[](Distance n) const
-    {
-        self_type offset = (*this) + n;
-        return *offset;
-    }
-
-    /// Structure dereference
-    __host__ __device__ __forceinline__ pointer operator->()
-    {
-        return &(*(*this));
-    }
-
-    /// Equal to
-    __host__ __device__ __forceinline__ bool operator==(const self_type& rhs)
-    {
-        return ((ptr == rhs.ptr) && (tex_offset == rhs.tex_offset) && (tex_obj == rhs.tex_obj));
-    }
-
-    /// Not equal to
-    __host__ __device__ __forceinline__ bool operator!=(const self_type& rhs)
-    {
-        return ((ptr != rhs.ptr) || (tex_offset != rhs.tex_offset) || (tex_obj != rhs.tex_obj));
-    }
-
-    /// ostream operator
-    friend std::ostream& operator<<(std::ostream& os, const self_type& itr)
-    {
-        return os;
-    }
-
-};
-
-
-
-/** @} */       // end group UtilIterator
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/iterator/tex_ref_input_iterator.cuh b/thirdparty/cub_semiring/iterator/tex_ref_input_iterator.cuh
deleted file mode 100644
index e67b52c07f0..00000000000
--- a/thirdparty/cub_semiring/iterator/tex_ref_input_iterator.cuh
+++ /dev/null
@@ -1,374 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Random-access iterator types
- */
-
-#pragma once
-
-#include <iterator>
-#include <iostream>
-
-#include "../thread/thread_load.cuh"
-#include "../thread/thread_store.cuh"
-#include "../util_device.cuh"
-#include "../util_debug.cuh"
-#include "../util_namespace.cuh"
-
-#if (CUDA_VERSION >= 5050) || defined(DOXYGEN_ACTIVE)  // This iterator is compatible with CUDA 5.5 and newer
-
-#if (THRUST_VERSION >= 100700)    // This iterator is compatible with Thrust API 1.7 and newer
-    #include <thrust/iterator/iterator_facade.h>
-    #include <thrust/iterator/iterator_traits.h>
-#endif // THRUST_VERSION
-
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/******************************************************************************
- * Static file-scope Tesla/Fermi-style texture references
- *****************************************************************************/
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-// Anonymous namespace
-namespace {
-
-/// Global texture reference specialized by type
-template <typename T>
-struct IteratorTexRef
-{
-    /// And by unique ID
-    template <int UNIQUE_ID>
-    struct TexId
-    {
-        // Largest texture word we can use in device
-        typedef typename UnitWord<T>::DeviceWord DeviceWord;
-        typedef typename UnitWord<T>::TextureWord TextureWord;
-
-        // Number of texture words per T
-        enum {
-            DEVICE_MULTIPLE = sizeof(T) / sizeof(DeviceWord),
-            TEXTURE_MULTIPLE = sizeof(T) / sizeof(TextureWord)
-        };
-
-        // Texture reference type
-        typedef texture<TextureWord> TexRef;
-
-        // Texture reference
-        static TexRef ref;
-
-        /// Bind texture
-        static cudaError_t BindTexture(void *d_in, size_t &bytes, size_t &offset)
-        {
-            if (d_in)
-            {
-                cudaChannelFormatDesc tex_desc = cudaCreateChannelDesc<TextureWord>();
-                ref.channelDesc = tex_desc;
-                return (CubDebug(cudaBindTexture(&offset, ref, d_in, bytes)));
-            }
-
-            return cudaSuccess;
-        }
-
-        /// Unbind texture
-        static cudaError_t UnbindTexture()
-        {
-            return CubDebug(cudaUnbindTexture(ref));
-        }
-
-        /// Fetch element
-        template <typename Distance>
-        static __device__ __forceinline__ T Fetch(Distance tex_offset)
-        {
-            DeviceWord temp[DEVICE_MULTIPLE];
-            TextureWord *words = reinterpret_cast<TextureWord*>(temp);
-
-            #pragma unroll
-            for (int i = 0; i < TEXTURE_MULTIPLE; ++i)
-            {
-                words[i] = tex1Dfetch(ref, (tex_offset * TEXTURE_MULTIPLE) + i);
-            }
-
-            return reinterpret_cast<T&>(temp);
-        }
-    };
-};
-
-// Texture reference definitions
-template <typename  T>
-template <int       UNIQUE_ID>
-typename IteratorTexRef<T>::template TexId<UNIQUE_ID>::TexRef IteratorTexRef<T>::template TexId<UNIQUE_ID>::ref = 0;
-
-
-} // Anonymous namespace
-
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-
-/**
- * \addtogroup UtilIterator
- * @{
- */
-
-
-
-/**
- * \brief A random-access input wrapper for dereferencing array values through texture cache.  Uses older Tesla/Fermi-style texture references.
- *
- * \par Overview
- * - TexRefInputIteratorTwraps a native device pointer of type <tt>ValueType*</tt>. References
- *   to elements are to be loaded through texture cache.
- * - Can be used to load any data type from memory through texture cache.
- * - Can be manipulated and exchanged within and between host and device
- *   functions, can only be constructed within host functions, and can only be
- *   dereferenced within device functions.
- * - The \p UNIQUE_ID template parameter is used to statically name the underlying texture
- *   reference.  Only one TexRefInputIteratorTinstance can be bound at any given time for a
- *   specific combination of (1) data type \p T, (2) \p UNIQUE_ID, (3) host
- *   thread, and (4) compilation .o unit.
- * - With regard to nested/dynamic parallelism, TexRefInputIteratorTiterators may only be
- *   created by the host thread and used by a top-level kernel (i.e. the one which is launched
- *   from the host).
- * - Compatible with Thrust API v1.7 or newer.
- * - Compatible with CUDA toolkit v5.5 or newer.
- *
- * \par Snippet
- * The code snippet below illustrates the use of \p TexRefInputIteratorTto
- * dereference a device array of doubles through texture cache.
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/iterator/tex_ref_input_iterator.cuh>
- *
- * // Declare, allocate, and initialize a device array
- * int num_items;   // e.g., 7
- * double *d_in;    // e.g., [8.0, 6.0, 7.0, 5.0, 3.0, 0.0, 9.0]
- *
- * // Create an iterator wrapper
- * cub::TexRefInputIterator<double, __LINE__> itr;
- * itr.BindTexture(d_in, sizeof(double) * num_items);
- * ...
- *
- * // Within device code:
- * printf("%f\n", itr[0]);      // 8.0
- * printf("%f\n", itr[1]);      // 6.0
- * printf("%f\n", itr[6]);      // 9.0
- *
- * ...
- * itr.UnbindTexture();
- *
- * \endcode
- *
- * \tparam T                    The value type of this iterator
- * \tparam UNIQUE_ID            A globally-unique identifier (within the compilation unit) to name the underlying texture reference
- * \tparam OffsetT              The difference type of this iterator (Default: \p ptrdiff_t)
- */
-template <
-    typename    T,
-    int         UNIQUE_ID,
-    typename    OffsetT = ptrdiff_t>
-class TexRefInputIterator
-{
-public:
-
-    // Required iterator traits
-    typedef TexRefInputIterator                 self_type;              ///< My own type
-    typedef OffsetT                             difference_type;        ///< Type to express the result of subtracting one iterator from another
-    typedef T                                   value_type;             ///< The type of the element the iterator can point to
-    typedef T*                                  pointer;                ///< The type of a pointer to an element the iterator can point to
-    typedef T                                   reference;              ///< The type of a reference to an element the iterator can point to
-
-#if (THRUST_VERSION >= 100700)
-    // Use Thrust's iterator categories so we can use these iterators in Thrust 1.7 (or newer) methods
-    typedef typename thrust::detail::iterator_facade_category<
-        thrust::device_system_tag,
-        thrust::random_access_traversal_tag,
-        value_type,
-        reference
-      >::type iterator_category;                                        ///< The iterator category
-#else
-    typedef std::random_access_iterator_tag     iterator_category;      ///< The iterator category
-#endif  // THRUST_VERSION
-
-private:
-
-    T*              ptr;
-    difference_type tex_offset;
-
-    // Texture reference wrapper (old Tesla/Fermi-style textures)
-    typedef typename IteratorTexRef<T>::template TexId<UNIQUE_ID> TexId;
-
-public:
-
-    /// Constructor
-    __host__ __device__ __forceinline__ TexRefInputIterator()
-    :
-        ptr(NULL),
-        tex_offset(0)
-    {}
-
-    /// Use this iterator to bind \p ptr with a texture reference
-    template <typename QualifiedT>
-    cudaError_t BindTexture(
-        QualifiedT      *ptr,                   ///< Native pointer to wrap that is aligned to cudaDeviceProp::textureAlignment
-        size_t          bytes,                  ///< Number of bytes in the range
-        size_t          tex_offset = 0)         ///< OffsetT (in items) from \p ptr denoting the position of the iterator
-    {
-        this->ptr = const_cast<typename RemoveQualifiers<QualifiedT>::Type *>(ptr);
-        size_t offset;
-        cudaError_t retval = TexId::BindTexture(this->ptr + tex_offset, bytes, offset);
-        this->tex_offset = (difference_type) (offset / sizeof(QualifiedT));
-        return retval;
-    }
-
-    /// Unbind this iterator from its texture reference
-    cudaError_t UnbindTexture()
-    {
-        return TexId::UnbindTexture();
-    }
-
-    /// Postfix increment
-    __host__ __device__ __forceinline__ self_type operator++(int)
-    {
-        self_type retval = *this;
-        tex_offset++;
-        return retval;
-    }
-
-    /// Prefix increment
-    __host__ __device__ __forceinline__ self_type operator++()
-    {
-        tex_offset++;
-        return *this;
-    }
-
-    /// Indirection
-    __host__ __device__ __forceinline__ reference operator*() const
-    {
-#if (CUB_PTX_ARCH == 0)
-        // Simply dereference the pointer on the host
-        return ptr[tex_offset];
-#else
-        // Use the texture reference
-        return TexId::Fetch(tex_offset);
-#endif
-    }
-
-    /// Addition
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator+(Distance n) const
-    {
-        self_type retval;
-        retval.ptr = ptr;
-        retval.tex_offset = tex_offset + n;
-        return retval;
-    }
-
-    /// Addition assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator+=(Distance n)
-    {
-        tex_offset += n;
-        return *this;
-    }
-
-    /// Subtraction
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator-(Distance n) const
-    {
-        self_type retval;
-        retval.ptr = ptr;
-        retval.tex_offset = tex_offset - n;
-        return retval;
-    }
-
-    /// Subtraction assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator-=(Distance n)
-    {
-        tex_offset -= n;
-        return *this;
-    }
-
-    /// Distance
-    __host__ __device__ __forceinline__ difference_type operator-(self_type other) const
-    {
-        return tex_offset - other.tex_offset;
-    }
-
-    /// Array subscript
-    template <typename Distance>
-    __host__ __device__ __forceinline__ reference operator[](Distance n) const
-    {
-        self_type offset = (*this) + n;
-        return *offset;
-    }
-
-    /// Structure dereference
-    __host__ __device__ __forceinline__ pointer operator->()
-    {
-        return &(*(*this));
-    }
-
-    /// Equal to
-    __host__ __device__ __forceinline__ bool operator==(const self_type& rhs)
-    {
-        return ((ptr == rhs.ptr) && (tex_offset == rhs.tex_offset));
-    }
-
-    /// Not equal to
-    __host__ __device__ __forceinline__ bool operator!=(const self_type& rhs)
-    {
-        return ((ptr != rhs.ptr) || (tex_offset != rhs.tex_offset));
-    }
-
-    /// ostream operator
-    friend std::ostream& operator<<(std::ostream& os, const self_type& itr)
-    {
-        return os;
-    }
-
-};
-
-
-
-/** @} */       // end group UtilIterator
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
-
-#endif // CUDA_VERSION
diff --git a/thirdparty/cub_semiring/iterator/transform_input_iterator.cuh b/thirdparty/cub_semiring/iterator/transform_input_iterator.cuh
deleted file mode 100644
index 39258a40c9b..00000000000
--- a/thirdparty/cub_semiring/iterator/transform_input_iterator.cuh
+++ /dev/null
@@ -1,252 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Random-access iterator types
- */
-
-#pragma once
-
-#include <iterator>
-#include <iostream>
-
-#include "../thread/thread_load.cuh"
-#include "../thread/thread_store.cuh"
-#include "../util_device.cuh"
-#include "../util_namespace.cuh"
-
-#if (THRUST_VERSION >= 100700)
-    // This iterator is compatible with Thrust API 1.7 and newer
-    #include <thrust/iterator/iterator_facade.h>
-    #include <thrust/iterator/iterator_traits.h>
-#endif // THRUST_VERSION
-
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \addtogroup UtilIterator
- * @{
- */
-
-
-/**
- * \brief A random-access input wrapper for transforming dereferenced values.
- *
- * \par Overview
- * - TransformInputIteratorTwraps a unary conversion functor of type \p
- *   ConversionOp and a random-access input iterator of type <tt>InputIteratorT</tt>,
- *   using the former to produce references of type \p ValueType from the latter.
- * - Can be used with any data type.
- * - Can be constructed, manipulated, and exchanged within and between host and device
- *   functions.  Wrapped host memory can only be dereferenced on the host, and wrapped
- *   device memory can only be dereferenced on the device.
- * - Compatible with Thrust API v1.7 or newer.
- *
- * \par Snippet
- * The code snippet below illustrates the use of \p TransformInputIteratorTto
- * dereference an array of integers, tripling the values and converting them to doubles.
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/iterator/transform_input_iterator.cuh>
- *
- * // Functor for tripling integer values and converting to doubles
- * struct TripleDoubler
- * {
- *     __host__ __device__ __forceinline__
- *     double operator()(const int &a) const {
- *         return double(a * 3);
- *     }
- * };
- *
- * // Declare, allocate, and initialize a device array
- * int *d_in;                   // e.g., [8, 6, 7, 5, 3, 0, 9]
- * TripleDoubler conversion_op;
- *
- * // Create an iterator wrapper
- * cub::TransformInputIterator<double, TripleDoubler, int*> itr(d_in, conversion_op);
- *
- * // Within device code:
- * printf("%f\n", itr[0]);  // 24.0
- * printf("%f\n", itr[1]);  // 18.0
- * printf("%f\n", itr[6]);  // 27.0
- *
- * \endcode
- *
- * \tparam ValueType            The value type of this iterator
- * \tparam ConversionOp         Unary functor type for mapping objects of type \p InputType to type \p ValueType.  Must have member <tt>ValueType operator()(const InputType &datum)</tt>.
- * \tparam InputIteratorT       The type of the wrapped input iterator
- * \tparam OffsetT              The difference type of this iterator (Default: \p ptrdiff_t)
- *
- */
-template <
-    typename ValueType,
-    typename ConversionOp,
-    typename InputIteratorT,
-    typename OffsetT = ptrdiff_t>
-class TransformInputIterator
-{
-public:
-
-    // Required iterator traits
-    typedef TransformInputIterator              self_type;              ///< My own type
-    typedef OffsetT                             difference_type;        ///< Type to express the result of subtracting one iterator from another
-    typedef ValueType                           value_type;             ///< The type of the element the iterator can point to
-    typedef ValueType*                          pointer;                ///< The type of a pointer to an element the iterator can point to
-    typedef ValueType                           reference;              ///< The type of a reference to an element the iterator can point to
-
-#if (THRUST_VERSION >= 100700)
-    // Use Thrust's iterator categories so we can use these iterators in Thrust 1.7 (or newer) methods
-    typedef typename thrust::detail::iterator_facade_category<
-        thrust::any_system_tag,
-        thrust::random_access_traversal_tag,
-        value_type,
-        reference
-      >::type iterator_category;                                        ///< The iterator category
-#else
-    typedef std::random_access_iterator_tag     iterator_category;      ///< The iterator category
-#endif  // THRUST_VERSION
-
-private:
-
-    ConversionOp    conversion_op;
-    InputIteratorT  input_itr;
-
-public:
-
-    /// Constructor
-    __host__ __device__ __forceinline__ TransformInputIterator(
-        InputIteratorT      input_itr,          ///< Input iterator to wrap
-        ConversionOp        conversion_op)      ///< Conversion functor to wrap
-    :
-        conversion_op(conversion_op),
-        input_itr(input_itr)
-    {}
-
-    /// Postfix increment
-    __host__ __device__ __forceinline__ self_type operator++(int)
-    {
-        self_type retval = *this;
-        input_itr++;
-        return retval;
-    }
-
-    /// Prefix increment
-    __host__ __device__ __forceinline__ self_type operator++()
-    {
-        input_itr++;
-        return *this;
-    }
-
-    /// Indirection
-    __host__ __device__ __forceinline__ reference operator*() const
-    {
-        return conversion_op(*input_itr);
-    }
-
-    /// Addition
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator+(Distance n) const
-    {
-        self_type retval(input_itr + n, conversion_op);
-        return retval;
-    }
-
-    /// Addition assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator+=(Distance n)
-    {
-        input_itr += n;
-        return *this;
-    }
-
-    /// Subtraction
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type operator-(Distance n) const
-    {
-        self_type retval(input_itr - n, conversion_op);
-        return retval;
-    }
-
-    /// Subtraction assignment
-    template <typename Distance>
-    __host__ __device__ __forceinline__ self_type& operator-=(Distance n)
-    {
-        input_itr -= n;
-        return *this;
-    }
-
-    /// Distance
-    __host__ __device__ __forceinline__ difference_type operator-(self_type other) const
-    {
-        return input_itr - other.input_itr;
-    }
-
-    /// Array subscript
-    template <typename Distance>
-    __host__ __device__ __forceinline__ reference operator[](Distance n) const
-    {
-        return conversion_op(input_itr[n]);
-    }
-
-    /// Structure dereference
-    __host__ __device__ __forceinline__ pointer operator->()
-    {
-        return &conversion_op(*input_itr);
-    }
-
-    /// Equal to
-    __host__ __device__ __forceinline__ bool operator==(const self_type& rhs)
-    {
-        return (input_itr == rhs.input_itr);
-    }
-
-    /// Not equal to
-    __host__ __device__ __forceinline__ bool operator!=(const self_type& rhs)
-    {
-        return (input_itr != rhs.input_itr);
-    }
-
-    /// ostream operator
-    friend std::ostream& operator<<(std::ostream& os, const self_type& itr)
-    {
-        return os;
-    }
-};
-
-
-
-/** @} */       // end group UtilIterator
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/thread/thread_load.cuh b/thirdparty/cub_semiring/thread/thread_load.cuh
deleted file mode 100644
index 9de4bd4149b..00000000000
--- a/thirdparty/cub_semiring/thread/thread_load.cuh
+++ /dev/null
@@ -1,438 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Thread utilities for reading memory using PTX cache modifiers.
- */
-
-#pragma once
-
-#include <cuda.h>
-
-#include <iterator>
-
-#include "../util_ptx.cuh"
-#include "../util_type.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \addtogroup UtilIo
- * @{
- */
-
-//-----------------------------------------------------------------------------
-// Tags and constants
-//-----------------------------------------------------------------------------
-
-/**
- * \brief Enumeration of cache modifiers for memory load operations.
- */
-enum CacheLoadModifier
-{
-    LOAD_DEFAULT,       ///< Default (no modifier)
-    LOAD_CA,            ///< Cache at all levels
-    LOAD_CG,            ///< Cache at global level
-    LOAD_CS,            ///< Cache streaming (likely to be accessed once)
-    LOAD_CV,            ///< Cache as volatile (including cached system lines)
-    LOAD_LDG,           ///< Cache as texture
-    LOAD_VOLATILE,      ///< Volatile (any memory space)
-};
-
-
-/**
- * \name Thread I/O (cache modified)
- * @{
- */
-
-/**
- * \brief Thread utility for reading memory using cub::CacheLoadModifier cache modifiers.  Can be used to load any data type.
- *
- * \par Example
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/thread/thread_load.cuh>
- *
- * // 32-bit load using cache-global modifier:
- * int *d_in;
- * int val = cub::ThreadLoad<cub::LOAD_CA>(d_in + threadIdx.x);
- *
- * // 16-bit load using default modifier
- * short *d_in;
- * short val = cub::ThreadLoad<cub::LOAD_DEFAULT>(d_in + threadIdx.x);
- *
- * // 256-bit load using cache-volatile modifier
- * double4 *d_in;
- * double4 val = cub::ThreadLoad<cub::LOAD_CV>(d_in + threadIdx.x);
- *
- * // 96-bit load using cache-streaming modifier
- * struct TestFoo { bool a; short b; };
- * TestFoo *d_struct;
- * TestFoo val = cub::ThreadLoad<cub::LOAD_CS>(d_in + threadIdx.x);
- * \endcode
- *
- * \tparam MODIFIER             <b>[inferred]</b> CacheLoadModifier enumeration
- * \tparam InputIteratorT       <b>[inferred]</b> Input iterator type \iterator
- */
-template <
-    CacheLoadModifier MODIFIER,
-    typename InputIteratorT>
-__device__ __forceinline__ typename std::iterator_traits<InputIteratorT>::value_type ThreadLoad(InputIteratorT itr);
-
-
-//@}  end member group
-
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-
-/// Helper structure for templated load iteration (inductive case)
-template <int COUNT, int MAX>
-struct IterateThreadLoad
-{
-    template <CacheLoadModifier MODIFIER, typename T>
-    static __device__ __forceinline__ void Load(T const *ptr, T *vals)
-    {
-        vals[COUNT] = ThreadLoad<MODIFIER>(ptr + COUNT);
-        IterateThreadLoad<COUNT + 1, MAX>::template Load<MODIFIER>(ptr, vals);
-    }
-
-    template <typename InputIteratorT, typename T>
-    static __device__ __forceinline__ void Dereference(InputIteratorT itr, T *vals)
-    {
-        vals[COUNT] = itr[COUNT];
-        IterateThreadLoad<COUNT + 1, MAX>::Dereference(itr, vals);
-    }
-};
-
-
-/// Helper structure for templated load iteration (termination case)
-template <int MAX>
-struct IterateThreadLoad<MAX, MAX>
-{
-    template <CacheLoadModifier MODIFIER, typename T>
-    static __device__ __forceinline__ void Load(T const * /*ptr*/, T * /*vals*/) {}
-
-    template <typename InputIteratorT, typename T>
-    static __device__ __forceinline__ void Dereference(InputIteratorT /*itr*/, T * /*vals*/) {}
-};
-
-
-/**
- * Define a uint4 (16B) ThreadLoad specialization for the given Cache load modifier
- */
-#define _CUB_LOAD_16(cub_modifier, ptx_modifier)                                             \
-    template<>                                                                              \
-    __device__ __forceinline__ uint4 ThreadLoad<cub_modifier, uint4 const *>(uint4 const *ptr)                   \
-    {                                                                                       \
-        uint4 retval;                                                                       \
-        asm volatile ("ld."#ptx_modifier".v4.u32 {%0, %1, %2, %3}, [%4];" :                 \
-            "=r"(retval.x),                                                                 \
-            "=r"(retval.y),                                                                 \
-            "=r"(retval.z),                                                                 \
-            "=r"(retval.w) :                                                                \
-            _CUB_ASM_PTR_(ptr));                                                            \
-        return retval;                                                                      \
-    }                                                                                       \
-    template<>                                                                              \
-    __device__ __forceinline__ ulonglong2 ThreadLoad<cub_modifier, ulonglong2 const *>(ulonglong2 const *ptr)    \
-    {                                                                                       \
-        ulonglong2 retval;                                                                  \
-        asm volatile ("ld."#ptx_modifier".v2.u64 {%0, %1}, [%2];" :                         \
-            "=l"(retval.x),                                                                 \
-            "=l"(retval.y) :                                                                \
-            _CUB_ASM_PTR_(ptr));                                                            \
-        return retval;                                                                      \
-    }
-
-/**
- * Define a uint2 (8B) ThreadLoad specialization for the given Cache load modifier
- */
-#define _CUB_LOAD_8(cub_modifier, ptx_modifier)                                              \
-    template<>                                                                              \
-    __device__ __forceinline__ ushort4 ThreadLoad<cub_modifier, ushort4 const *>(ushort4 const *ptr)             \
-    {                                                                                       \
-        ushort4 retval;                                                                     \
-        asm volatile ("ld."#ptx_modifier".v4.u16 {%0, %1, %2, %3}, [%4];" :                 \
-            "=h"(retval.x),                                                                 \
-            "=h"(retval.y),                                                                 \
-            "=h"(retval.z),                                                                 \
-            "=h"(retval.w) :                                                                \
-            _CUB_ASM_PTR_(ptr));                                                            \
-        return retval;                                                                      \
-    }                                                                                       \
-    template<>                                                                              \
-    __device__ __forceinline__ uint2 ThreadLoad<cub_modifier, uint2 const *>(uint2 const *ptr)                   \
-    {                                                                                       \
-        uint2 retval;                                                                       \
-        asm volatile ("ld."#ptx_modifier".v2.u32 {%0, %1}, [%2];" :                         \
-            "=r"(retval.x),                                                                 \
-            "=r"(retval.y) :                                                                \
-            _CUB_ASM_PTR_(ptr));                                                            \
-        return retval;                                                                      \
-    }                                                                                       \
-    template<>                                                                              \
-    __device__ __forceinline__ unsigned long long ThreadLoad<cub_modifier, unsigned long long const *>(unsigned long long const *ptr)    \
-    {                                                                                       \
-        unsigned long long retval;                                                          \
-        asm volatile ("ld."#ptx_modifier".u64 %0, [%1];" :                                  \
-            "=l"(retval) :                                                                  \
-            _CUB_ASM_PTR_(ptr));                                                            \
-        return retval;                                                                      \
-    }
-
-/**
- * Define a uint (4B) ThreadLoad specialization for the given Cache load modifier
- */
-#define _CUB_LOAD_4(cub_modifier, ptx_modifier)                                              \
-    template<>                                                                              \
-    __device__ __forceinline__ unsigned int ThreadLoad<cub_modifier, unsigned int const *>(unsigned int const *ptr)                      \
-    {                                                                                       \
-        unsigned int retval;                                                                \
-        asm volatile ("ld."#ptx_modifier".u32 %0, [%1];" :                                  \
-            "=r"(retval) :                                                                  \
-            _CUB_ASM_PTR_(ptr));                                                            \
-        return retval;                                                                      \
-    }
-
-
-/**
- * Define a unsigned short (2B) ThreadLoad specialization for the given Cache load modifier
- */
-#define _CUB_LOAD_2(cub_modifier, ptx_modifier)                                              \
-    template<>                                                                              \
-    __device__ __forceinline__ unsigned short ThreadLoad<cub_modifier, unsigned short const *>(unsigned short const *ptr)                \
-    {                                                                                       \
-        unsigned short retval;                                                              \
-        asm volatile ("ld."#ptx_modifier".u16 %0, [%1];" :                                  \
-            "=h"(retval) :                                                                  \
-            _CUB_ASM_PTR_(ptr));                                                            \
-        return retval;                                                                      \
-    }
-
-
-/**
- * Define an unsigned char (1B) ThreadLoad specialization for the given Cache load modifier
- */
-#define _CUB_LOAD_1(cub_modifier, ptx_modifier)                                              \
-    template<>                                                                              \
-    __device__ __forceinline__ unsigned char ThreadLoad<cub_modifier, unsigned char const *>(unsigned char const *ptr)                   \
-    {                                                                                       \
-        unsigned short retval;                                                              \
-        asm volatile (                                                                      \
-        "{"                                                                                 \
-        "   .reg .u8 datum;"                                                                \
-        "    ld."#ptx_modifier".u8 datum, [%1];"                                            \
-        "    cvt.u16.u8 %0, datum;"                                                         \
-        "}" :                                                                               \
-            "=h"(retval) :                                                                  \
-            _CUB_ASM_PTR_(ptr));                                                            \
-        return (unsigned char) retval;                                                      \
-    }
-
-
-/**
- * Define powers-of-two ThreadLoad specializations for the given Cache load modifier
- */
-#define _CUB_LOAD_ALL(cub_modifier, ptx_modifier)                                            \
-    _CUB_LOAD_16(cub_modifier, ptx_modifier)                                                 \
-    _CUB_LOAD_8(cub_modifier, ptx_modifier)                                                  \
-    _CUB_LOAD_4(cub_modifier, ptx_modifier)                                                  \
-    _CUB_LOAD_2(cub_modifier, ptx_modifier)                                                  \
-    _CUB_LOAD_1(cub_modifier, ptx_modifier)                                                  \
-
-
-/**
- * Define powers-of-two ThreadLoad specializations for the various Cache load modifiers
- */
-#if CUB_PTX_ARCH >= 200
-    _CUB_LOAD_ALL(LOAD_CA, ca)
-    _CUB_LOAD_ALL(LOAD_CG, cg)
-    _CUB_LOAD_ALL(LOAD_CS, cs)
-    _CUB_LOAD_ALL(LOAD_CV, cv)
-#else
-    _CUB_LOAD_ALL(LOAD_CA, global)
-    // Use volatile to ensure coherent reads when this PTX is JIT'd to run on newer architectures with L1
-    _CUB_LOAD_ALL(LOAD_CG, volatile.global)
-    _CUB_LOAD_ALL(LOAD_CS, global)
-    _CUB_LOAD_ALL(LOAD_CV, volatile.global)
-#endif
-
-#if CUB_PTX_ARCH >= 350
-    _CUB_LOAD_ALL(LOAD_LDG, global.nc)
-#else
-    _CUB_LOAD_ALL(LOAD_LDG, global)
-#endif
-
-
-// Macro cleanup
-#undef _CUB_LOAD_ALL
-#undef _CUB_LOAD_1
-#undef _CUB_LOAD_2
-#undef _CUB_LOAD_4
-#undef _CUB_LOAD_8
-#undef _CUB_LOAD_16
-
-
-
-/**
- * ThreadLoad definition for LOAD_DEFAULT modifier on iterator types
- */
-template <typename InputIteratorT>
-__device__ __forceinline__ typename std::iterator_traits<InputIteratorT>::value_type ThreadLoad(
-    InputIteratorT          itr,
-    Int2Type<LOAD_DEFAULT>  /*modifier*/,
-    Int2Type<false>         /*is_pointer*/)
-{
-    return *itr;
-}
-
-
-/**
- * ThreadLoad definition for LOAD_DEFAULT modifier on pointer types
- */
-template <typename T>
-__device__ __forceinline__ T ThreadLoad(
-    T                       *ptr,
-    Int2Type<LOAD_DEFAULT>  /*modifier*/,
-    Int2Type<true>          /*is_pointer*/)
-{
-    return *ptr;
-}
-
-
-/**
- * ThreadLoad definition for LOAD_VOLATILE modifier on primitive pointer types
- */
-template <typename T>
-__device__ __forceinline__ T ThreadLoadVolatilePointer(
-    T                       *ptr,
-    Int2Type<true>          /*is_primitive*/)
-{
-    T retval = *reinterpret_cast<volatile T*>(ptr);
-    return retval;
-}
-
-
-/**
- * ThreadLoad definition for LOAD_VOLATILE modifier on non-primitive pointer types
- */
-template <typename T>
-__device__ __forceinline__ T ThreadLoadVolatilePointer(
-    T                       *ptr,
-    Int2Type<false>         /*is_primitive*/)
-{
-    typedef typename UnitWord<T>::VolatileWord VolatileWord;   // Word type for memcopying
-
-    const int VOLATILE_MULTIPLE = sizeof(T) / sizeof(VolatileWord);
-/*
-    VolatileWord words[VOLATILE_MULTIPLE];
-
-    IterateThreadLoad<0, VOLATILE_MULTIPLE>::Dereference(
-        reinterpret_cast<volatile VolatileWord*>(ptr),
-        words);
-
-    return *reinterpret_cast<T*>(words);
-*/
-
-    T retval;
-    VolatileWord *words = reinterpret_cast<VolatileWord*>(&retval);
-    IterateThreadLoad<0, VOLATILE_MULTIPLE>::Dereference(
-        reinterpret_cast<volatile VolatileWord*>(ptr),
-        words);
-    return retval;
-}
-
-
-/**
- * ThreadLoad definition for LOAD_VOLATILE modifier on pointer types
- */
-template <typename T>
-__device__ __forceinline__ T ThreadLoad(
-    T                       *ptr,
-    Int2Type<LOAD_VOLATILE> /*modifier*/,
-    Int2Type<true>          /*is_pointer*/)
-{
-    // Apply tags for partial-specialization
-    return ThreadLoadVolatilePointer(ptr, Int2Type<Traits<T>::PRIMITIVE>());
-}
-
-
-/**
- * ThreadLoad definition for generic modifiers on pointer types
- */
-template <typename T, int MODIFIER>
-__device__ __forceinline__ T ThreadLoad(
-    T const                 *ptr,
-    Int2Type<MODIFIER>      /*modifier*/,
-    Int2Type<true>          /*is_pointer*/)
-{
-    typedef typename UnitWord<T>::DeviceWord DeviceWord;
-
-    const int DEVICE_MULTIPLE = sizeof(T) / sizeof(DeviceWord);
-
-    DeviceWord words[DEVICE_MULTIPLE];
-
-    IterateThreadLoad<0, DEVICE_MULTIPLE>::template Load<CacheLoadModifier(MODIFIER)>(
-        reinterpret_cast<DeviceWord*>(const_cast<T*>(ptr)),
-        words);
-
-    return *reinterpret_cast<T*>(words);
-}
-
-
-/**
- * ThreadLoad definition for generic modifiers
- */
-template <
-    CacheLoadModifier MODIFIER,
-    typename InputIteratorT>
-__device__ __forceinline__ typename std::iterator_traits<InputIteratorT>::value_type ThreadLoad(InputIteratorT itr)
-{
-    // Apply tags for partial-specialization
-    return ThreadLoad(
-        itr,
-        Int2Type<MODIFIER>(),
-        Int2Type<IsPointer<InputIteratorT>::VALUE>());
-}
-
-
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-/** @} */       // end group UtilIo
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/thread/thread_operators.cuh b/thirdparty/cub_semiring/thread/thread_operators.cuh
deleted file mode 100644
index 2bd5403e864..00000000000
--- a/thirdparty/cub_semiring/thread/thread_operators.cuh
+++ /dev/null
@@ -1,317 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Simple binary operator functor types
- */
-
-/******************************************************************************
- * Simple functor operators
- ******************************************************************************/
-
-#pragma once
-
-#include "../util_macro.cuh"
-#include "../util_type.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \addtogroup UtilModule
- * @{
- */
-
-/**
- * \brief Default equality functor
- */
-struct Equality
-{
-    /// Boolean equality operator, returns <tt>(a == b)</tt>
-    template <typename T>
-    __host__ __device__ __forceinline__ bool operator()(const T &a, const T &b) const
-    {
-        return a == b;
-    }
-};
-
-
-/**
- * \brief Default inequality functor
- */
-struct Inequality
-{
-    /// Boolean inequality operator, returns <tt>(a != b)</tt>
-    template <typename T>
-    __host__ __device__ __forceinline__ bool operator()(const T &a, const T &b) const
-    {
-        return a != b;
-    }
-};
-
-
-/**
- * \brief Inequality functor (wraps equality functor)
- */
-template <typename EqualityOp>
-struct InequalityWrapper
-{
-    /// Wrapped equality operator
-    EqualityOp op;
-
-    /// Constructor
-    __host__ __device__ __forceinline__
-    InequalityWrapper(EqualityOp op) : op(op) {}
-
-    /// Boolean inequality operator, returns <tt>(a != b)</tt>
-    template <typename T>
-    __host__ __device__ __forceinline__ bool operator()(const T &a, const T &b)
-    {
-        return !op(a, b);
-    }
-};
-
-
-/**
- * \brief Default sum functor
- */
-struct Sum
-{
-    /// Boolean sum operator, returns <tt>a + b</tt>
-    template <typename T>
-    __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const
-    {
-        return a + b;
-    }
-};
-
-
-/**
- * \brief Default max functor
- */
-struct Max
-{
-    /// Boolean max operator, returns <tt>(a > b) ? a : b</tt>
-    template <typename T>
-    __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const
-    {
-        return CUB_MAX(a, b);
-    }
-};
-
-
-/**
- * \brief Arg max functor (keeps the value and offset of the first occurrence of the larger item)
- */
-struct ArgMax
-{
-    /// Boolean max operator, preferring the item having the smaller offset in case of ties
-    template <typename T, typename OffsetT>
-    __host__ __device__ __forceinline__ KeyValuePair<OffsetT, T> operator()(
-        const KeyValuePair<OffsetT, T> &a,
-        const KeyValuePair<OffsetT, T> &b) const
-    {
-// Mooch BUG (device reduce argmax gk110 3.2 million random fp32)
-//        return ((b.value > a.value) || ((a.value == b.value) && (b.key < a.key))) ? b : a;
-
-        if ((b.value > a.value) || ((a.value == b.value) && (b.key < a.key)))
-            return b;
-        return a;
-    }
-};
-
-
-/**
- * \brief Default min functor
- */
-struct Min
-{
-    /// Boolean min operator, returns <tt>(a < b) ? a : b</tt>
-    template <typename T>
-    __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const
-    {
-        return CUB_MIN(a, b);
-    }
-};
-
-
-/**
- * \brief Arg min functor (keeps the value and offset of the first occurrence of the smallest item)
- */
-struct ArgMin
-{
-    /// Boolean min operator, preferring the item having the smaller offset in case of ties
-    template <typename T, typename OffsetT>
-    __host__ __device__ __forceinline__ KeyValuePair<OffsetT, T> operator()(
-        const KeyValuePair<OffsetT, T> &a,
-        const KeyValuePair<OffsetT, T> &b) const
-    {
-// Mooch BUG (device reduce argmax gk110 3.2 million random fp32)
-//        return ((b.value < a.value) || ((a.value == b.value) && (b.key < a.key))) ? b : a;
-
-        if ((b.value < a.value) || ((a.value == b.value) && (b.key < a.key)))
-            return b;
-        return a;
-    }
-};
-
-
-/**
- * \brief Default cast functor
- */
-template <typename B>
-struct CastOp
-{
-    /// Cast operator, returns <tt>(B) a</tt>
-    template <typename A>
-    __host__ __device__ __forceinline__ B operator()(const A &a) const
-    {
-        return (B) a;
-    }
-};
-
-
-/**
- * \brief Binary operator wrapper for switching non-commutative scan arguments
- */
-template <typename ScanOp>
-class SwizzleScanOp
-{
-private:
-
-    /// Wrapped scan operator
-    ScanOp scan_op;
-
-public:
-
-    /// Constructor
-    __host__ __device__ __forceinline__
-    SwizzleScanOp(ScanOp scan_op) : scan_op(scan_op) {}
-
-    /// Switch the scan arguments
-    template <typename T>
-    __host__ __device__ __forceinline__
-    T operator()(const T &a, const T &b)
-    {
-      T _a(a);
-      T _b(b);
-
-      return scan_op(_b, _a);
-    }
-};
-
-
-/**
- * \brief Reduce-by-segment functor.
- *
- * Given two cub::KeyValuePair inputs \p a and \p b and a
- * binary associative combining operator \p <tt>f(const T &x, const T &y)</tt>,
- * an instance of this functor returns a cub::KeyValuePair whose \p key
- * field is <tt>a.key</tt> + <tt>b.key</tt>, and whose \p value field
- * is either b.value if b.key is non-zero, or f(a.value, b.value) otherwise.
- *
- * ReduceBySegmentOp is an associative, non-commutative binary combining operator
- * for input sequences of cub::KeyValuePair pairings.  Such
- * sequences are typically used to represent a segmented set of values to be reduced
- * and a corresponding set of {0,1}-valued integer "head flags" demarcating the
- * first value of each segment.
- *
- */
-template <typename ReductionOpT>    ///< Binary reduction operator to apply to values
-struct ReduceBySegmentOp
-{
-    /// Wrapped reduction operator
-    ReductionOpT op;
-
-    /// Constructor
-    __host__ __device__ __forceinline__ ReduceBySegmentOp() {}
-
-    /// Constructor
-    __host__ __device__ __forceinline__ ReduceBySegmentOp(ReductionOpT op) : op(op) {}
-
-    /// Scan operator
-    template <typename KeyValuePairT>       ///< KeyValuePair pairing of T (value) and OffsetT (head flag)
-    __host__ __device__ __forceinline__ KeyValuePairT operator()(
-        const KeyValuePairT &first,         ///< First partial reduction
-        const KeyValuePairT &second)        ///< Second partial reduction
-    {
-        KeyValuePairT retval;
-        retval.key = first.key + second.key;
-        retval.value = (second.key) ?
-                second.value :                          // The second partial reduction spans a segment reset, so it's value aggregate becomes the running aggregate
-                op(first.value, second.value);          // The second partial reduction does not span a reset, so accumulate both into the running aggregate
-        return retval;
-    }
-};
-
-
-
-template <typename ReductionOpT>    ///< Binary reduction operator to apply to values
-struct ReduceByKeyOp
-{
-    /// Wrapped reduction operator
-    ReductionOpT op;
-
-    /// Constructor
-    __host__ __device__ __forceinline__ ReduceByKeyOp() {}
-
-    /// Constructor
-    __host__ __device__ __forceinline__ ReduceByKeyOp(ReductionOpT op) : op(op) {}
-
-    /// Scan operator
-    template <typename KeyValuePairT>
-    __host__ __device__ __forceinline__ KeyValuePairT operator()(
-        const KeyValuePairT &first,       ///< First partial reduction
-        const KeyValuePairT &second)      ///< Second partial reduction
-    {
-        KeyValuePairT retval = second;
-
-        if (first.key == second.key)
-            retval.value = op(first.value, retval.value);
-
-        return retval;
-    }
-};
-
-
-
-
-
-
-
-/** @} */       // end group UtilModule
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/thread/thread_reduce.cuh b/thirdparty/cub_semiring/thread/thread_reduce.cuh
deleted file mode 100644
index 9e277050236..00000000000
--- a/thirdparty/cub_semiring/thread/thread_reduce.cuh
+++ /dev/null
@@ -1,152 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Thread utilities for sequential reduction over statically-sized array types
- */
-
-#pragma once
-
-#include "../thread/thread_operators.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/// Internal namespace (to prevent ADL mishaps between static functions when mixing different CUB installations)
-namespace internal {
-
-/**
- * Sequential reduction over statically-sized array types
- */
-template <
-    int         LENGTH,
-    typename    T,
-    typename    ReductionOp>
-__device__ __forceinline__ T ThreadReduce(
-    T*                  input,                  ///< [in] Input array
-    ReductionOp         reduction_op,           ///< [in] Binary reduction operator
-    T                   prefix,                 ///< [in] Prefix to seed reduction with
-    Int2Type<LENGTH>    /*length*/)
-{
-    T retval = prefix;
-
-    #pragma unroll
-    for (int i = 0; i < LENGTH; ++i)
-        retval = reduction_op(retval, input[i]);
-
-    return retval;
-}
-
-
-/**
- * \brief Perform a sequential reduction over \p LENGTH elements of the \p input array, seeded with the specified \p prefix.  The aggregate is returned.
- *
- * \tparam LENGTH     LengthT of input array
- * \tparam T          <b>[inferred]</b> The data type to be reduced.
- * \tparam ScanOp     <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
- */
-template <
-    int         LENGTH,
-    typename    T,
-    typename    ReductionOp>
-__device__ __forceinline__ T ThreadReduce(
-    T*          input,                  ///< [in] Input array
-    ReductionOp reduction_op,           ///< [in] Binary reduction operator
-    T           prefix)                 ///< [in] Prefix to seed reduction with
-{
-    return ThreadReduce(input, reduction_op, prefix, Int2Type<LENGTH>());
-}
-
-
-/**
- * \brief Perform a sequential reduction over \p LENGTH elements of the \p input array.  The aggregate is returned.
- *
- * \tparam LENGTH     LengthT of input array
- * \tparam T          <b>[inferred]</b> The data type to be reduced.
- * \tparam ScanOp     <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
- */
-template <
-    int         LENGTH,
-    typename    T,
-    typename    ReductionOp>
-__device__ __forceinline__ T ThreadReduce(
-    T*          input,                  ///< [in] Input array
-    ReductionOp reduction_op)           ///< [in] Binary reduction operator
-{
-    T prefix = input[0];
-    return ThreadReduce<LENGTH - 1>(input + 1, reduction_op, prefix);
-}
-
-
-/**
- * \brief Perform a sequential reduction over the statically-sized \p input array, seeded with the specified \p prefix.  The aggregate is returned.
- *
- * \tparam LENGTH     <b>[inferred]</b> LengthT of \p input array
- * \tparam T          <b>[inferred]</b> The data type to be reduced.
- * \tparam ScanOp     <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
- */
-template <
-    int         LENGTH,
-    typename    T,
-    typename    ReductionOp>
-__device__ __forceinline__ T ThreadReduce(
-    T           (&input)[LENGTH],       ///< [in] Input array
-    ReductionOp reduction_op,           ///< [in] Binary reduction operator
-    T           prefix)                 ///< [in] Prefix to seed reduction with
-{
-    return ThreadReduce(input, reduction_op, prefix, Int2Type<LENGTH>());
-}
-
-
-/**
- * \brief Serial reduction with the specified operator
- *
- * \tparam LENGTH     <b>[inferred]</b> LengthT of \p input array
- * \tparam T          <b>[inferred]</b> The data type to be reduced.
- * \tparam ScanOp     <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
- */
-template <
-    int         LENGTH,
-    typename    T,
-    typename    ReductionOp>
-__device__ __forceinline__ T ThreadReduce(
-    T           (&input)[LENGTH],       ///< [in] Input array
-    ReductionOp reduction_op)           ///< [in] Binary reduction operator
-{
-    return ThreadReduce<LENGTH>((T*) input, reduction_op);
-}
-
-
-}               // internal namespace
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/thread/thread_scan.cuh b/thirdparty/cub_semiring/thread/thread_scan.cuh
deleted file mode 100644
index 545b4141918..00000000000
--- a/thirdparty/cub_semiring/thread/thread_scan.cuh
+++ /dev/null
@@ -1,268 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Thread utilities for sequential prefix scan over statically-sized array types
- */
-
-#pragma once
-
-#include "../thread/thread_operators.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/// Internal namespace (to prevent ADL mishaps between static functions when mixing different CUB installations)
-namespace internal {
-
-
-/**
- * \addtogroup UtilModule
- * @{
- */
-
-/**
- * \name Sequential prefix scan over statically-sized array types
- * @{
- */
-
-template <
-    int         LENGTH,
-    typename    T,
-    typename    ScanOp>
-__device__ __forceinline__ T ThreadScanExclusive(
-    T                   inclusive,
-    T                   exclusive,
-    T                   *input,                 ///< [in] Input array
-    T                   *output,                ///< [out] Output array (may be aliased to \p input)
-    ScanOp              scan_op,                ///< [in] Binary scan operator
-    Int2Type<LENGTH>    /*length*/)
-{
-    #pragma unroll
-    for (int i = 0; i < LENGTH; ++i)
-    {
-        inclusive = scan_op(exclusive, input[i]);
-        output[i] = exclusive;
-        exclusive = inclusive;
-    }
-
-    return inclusive;
-}
-
-
-
-/**
- * \brief Perform a sequential exclusive prefix scan over \p LENGTH elements of the \p input array, seeded with the specified \p prefix.  The aggregate is returned.
- *
- * \tparam LENGTH     LengthT of \p input and \p output arrays
- * \tparam T          <b>[inferred]</b> The data type to be scanned.
- * \tparam ScanOp     <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
- */
-template <
-    int         LENGTH,
-    typename    T,
-    typename    ScanOp>
-__device__ __forceinline__ T ThreadScanExclusive(
-    T           *input,                 ///< [in] Input array
-    T           *output,                ///< [out] Output array (may be aliased to \p input)
-    ScanOp      scan_op,                ///< [in] Binary scan operator
-    T           prefix,                 ///< [in] Prefix to seed scan with
-    bool        apply_prefix = true)    ///< [in] Whether or not the calling thread should apply its prefix.  If not, the first output element is undefined.  (Handy for preventing thread-0 from applying a prefix.)
-{
-    T inclusive = input[0];
-    if (apply_prefix)
-    {
-        inclusive = scan_op(prefix, inclusive);
-    }
-    output[0] = prefix;
-    T exclusive = inclusive;
-
-    return ThreadScanExclusive(inclusive, exclusive, input + 1, output + 1, scan_op, Int2Type<LENGTH - 1>());
-}
-
-
-/**
- * \brief Perform a sequential exclusive prefix scan over the statically-sized \p input array, seeded with the specified \p prefix.  The aggregate is returned.
- *
- * \tparam LENGTH     <b>[inferred]</b> LengthT of \p input and \p output arrays
- * \tparam T          <b>[inferred]</b> The data type to be scanned.
- * \tparam ScanOp     <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
- */
-template <
-    int         LENGTH,
-    typename    T,
-    typename    ScanOp>
-__device__ __forceinline__ T ThreadScanExclusive(
-    T           (&input)[LENGTH],       ///< [in] Input array
-    T           (&output)[LENGTH],      ///< [out] Output array (may be aliased to \p input)
-    ScanOp      scan_op,                ///< [in] Binary scan operator
-    T           prefix,                 ///< [in] Prefix to seed scan with
-    bool        apply_prefix = true)    ///< [in] Whether or not the calling thread should apply its prefix.  (Handy for preventing thread-0 from applying a prefix.)
-{
-    return ThreadScanExclusive<LENGTH>((T*) input, (T*) output, scan_op, prefix, apply_prefix);
-}
-
-
-
-
-
-
-
-
-
-template <
-    int         LENGTH,
-    typename    T,
-    typename    ScanOp>
-__device__ __forceinline__ T ThreadScanInclusive(
-    T                   inclusive,
-    T                   *input,                 ///< [in] Input array
-    T                   *output,                ///< [out] Output array (may be aliased to \p input)
-    ScanOp              scan_op,                ///< [in] Binary scan operator
-    Int2Type<LENGTH>    /*length*/)
-{
-    #pragma unroll
-    for (int i = 0; i < LENGTH; ++i)
-    {
-        inclusive = scan_op(inclusive, input[i]);
-        output[i] = inclusive;
-    }
-
-    return inclusive;
-}
-
-
-/**
- * \brief Perform a sequential inclusive prefix scan over \p LENGTH elements of the \p input array.  The aggregate is returned.
- *
- * \tparam LENGTH     LengthT of \p input and \p output arrays
- * \tparam T          <b>[inferred]</b> The data type to be scanned.
- * \tparam ScanOp     <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
- */
-template <
-    int         LENGTH,
-    typename    T,
-    typename    ScanOp>
-__device__ __forceinline__ T ThreadScanInclusive(
-    T           *input,                 ///< [in] Input array
-    T           *output,                ///< [out] Output array (may be aliased to \p input)
-    ScanOp      scan_op)                ///< [in] Binary scan operator
-{
-    T inclusive = input[0];
-    output[0] = inclusive;
-
-    // Continue scan
-    return ThreadScanInclusive(inclusive, input + 1, output + 1, scan_op, Int2Type<LENGTH - 1>());
-}
-
-
-/**
- * \brief Perform a sequential inclusive prefix scan over the statically-sized \p input array.  The aggregate is returned.
- *
- * \tparam LENGTH     <b>[inferred]</b> LengthT of \p input and \p output arrays
- * \tparam T          <b>[inferred]</b> The data type to be scanned.
- * \tparam ScanOp     <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
- */
-template <
-    int         LENGTH,
-    typename    T,
-    typename    ScanOp>
-__device__ __forceinline__ T ThreadScanInclusive(
-    T           (&input)[LENGTH],       ///< [in] Input array
-    T           (&output)[LENGTH],      ///< [out] Output array (may be aliased to \p input)
-    ScanOp      scan_op)                ///< [in] Binary scan operator
-{
-    return ThreadScanInclusive<LENGTH>((T*) input, (T*) output, scan_op);
-}
-
-
-/**
- * \brief Perform a sequential inclusive prefix scan over \p LENGTH elements of the \p input array, seeded with the specified \p prefix.  The aggregate is returned.
- *
- * \tparam LENGTH     LengthT of \p input and \p output arrays
- * \tparam T          <b>[inferred]</b> The data type to be scanned.
- * \tparam ScanOp     <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
- */
-template <
-    int         LENGTH,
-    typename    T,
-    typename    ScanOp>
-__device__ __forceinline__ T ThreadScanInclusive(
-    T           *input,                 ///< [in] Input array
-    T           *output,                ///< [out] Output array (may be aliased to \p input)
-    ScanOp      scan_op,                ///< [in] Binary scan operator
-    T           prefix,                 ///< [in] Prefix to seed scan with
-    bool        apply_prefix = true)    ///< [in] Whether or not the calling thread should apply its prefix.  (Handy for preventing thread-0 from applying a prefix.)
-{
-    T inclusive = input[0];
-    if (apply_prefix)
-    {
-        inclusive = scan_op(prefix, inclusive);
-    }
-    output[0] = inclusive;
-
-    // Continue scan
-    return ThreadScanInclusive(inclusive, input + 1, output + 1, scan_op, Int2Type<LENGTH - 1>());
-}
-
-
-/**
- * \brief Perform a sequential inclusive prefix scan over the statically-sized \p input array, seeded with the specified \p prefix.  The aggregate is returned.
- *
- * \tparam LENGTH     <b>[inferred]</b> LengthT of \p input and \p output arrays
- * \tparam T          <b>[inferred]</b> The data type to be scanned.
- * \tparam ScanOp     <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
- */
-template <
-    int         LENGTH,
-    typename    T,
-    typename    ScanOp>
-__device__ __forceinline__ T ThreadScanInclusive(
-    T           (&input)[LENGTH],       ///< [in] Input array
-    T           (&output)[LENGTH],      ///< [out] Output array (may be aliased to \p input)
-    ScanOp      scan_op,                ///< [in] Binary scan operator
-    T           prefix,                 ///< [in] Prefix to seed scan with
-    bool        apply_prefix = true)    ///< [in] Whether or not the calling thread should apply its prefix.  (Handy for preventing thread-0 from applying a prefix.)
-{
-    return ThreadScanInclusive<LENGTH>((T*) input, (T*) output, scan_op, prefix, apply_prefix);
-}
-
-
-//@}  end member group
-
-/** @} */       // end group UtilModule
-
-
-}               // internal namespace
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/thread/thread_search.cuh b/thirdparty/cub_semiring/thread/thread_search.cuh
deleted file mode 100644
index 379a08a51e7..00000000000
--- a/thirdparty/cub_semiring/thread/thread_search.cuh
+++ /dev/null
@@ -1,154 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Thread utilities for sequential search
- */
-
-#pragma once
-
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * Computes the begin offsets into A and B for the specific diagonal
- */
-template <
-    typename AIteratorT,
-    typename BIteratorT,
-    typename OffsetT,
-    typename CoordinateT>
-__host__ __device__ __forceinline__ void MergePathSearch(
-    OffsetT         diagonal,
-    AIteratorT      a,
-    BIteratorT      b,
-    OffsetT         a_len,
-    OffsetT         b_len,
-    CoordinateT&    path_coordinate)
-{
-    /// The value type of the input iterator
-    typedef typename std::iterator_traits<AIteratorT>::value_type T;
-
-    OffsetT split_min = CUB_MAX(diagonal - b_len, 0);
-    OffsetT split_max = CUB_MIN(diagonal, a_len);
-
-    while (split_min < split_max)
-    {
-        OffsetT split_pivot = (split_min + split_max) >> 1;
-        if (a[split_pivot] <= b[diagonal - split_pivot - 1])
-        {
-            // Move candidate split range up A, down B
-            split_min = split_pivot + 1;
-        }
-        else
-        {
-            // Move candidate split range up B, down A
-            split_max = split_pivot;
-        }
-    }
-
-    path_coordinate.x = CUB_MIN(split_min, a_len);
-    path_coordinate.y = diagonal - split_min;
-}
-
-
-
-/**
- * \brief Returns the offset of the first value within \p input which does not compare less than \p val
- */
-template <
-    typename InputIteratorT,
-    typename OffsetT,
-    typename T>
-__device__ __forceinline__ OffsetT LowerBound(
-    InputIteratorT      input,              ///< [in] Input sequence
-    OffsetT             num_items,          ///< [in] Input sequence length
-    T                   val)                ///< [in] Search key
-{
-    OffsetT retval = 0;
-    while (num_items > 0)
-    {
-        OffsetT half = num_items >> 1;
-        if (input[retval + half] < val)
-        {
-            retval = retval + (half + 1);
-            num_items = num_items - (half + 1);
-        }
-        else
-        {
-            num_items = half;
-        }
-    }
-
-    return retval;
-}
-
-
-/**
- * \brief Returns the offset of the first value within \p input which compares greater than \p val
- */
-template <
-    typename InputIteratorT,
-    typename OffsetT,
-    typename T>
-__device__ __forceinline__ OffsetT UpperBound(
-    InputIteratorT      input,              ///< [in] Input sequence
-    OffsetT             num_items,          ///< [in] Input sequence length
-    T                   val)                ///< [in] Search key
-{
-    OffsetT retval = 0;
-    while (num_items > 0)
-    {
-        OffsetT half = num_items >> 1;
-        if (val < input[retval + half])
-        {
-            num_items = half;
-        }
-        else
-        {
-            retval = retval + (half + 1);
-            num_items = num_items - (half + 1);
-        }
-    }
-
-    return retval;
-}
-
-
-
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/thread/thread_store.cuh b/thirdparty/cub_semiring/thread/thread_store.cuh
deleted file mode 100644
index 14ee84d9270..00000000000
--- a/thirdparty/cub_semiring/thread/thread_store.cuh
+++ /dev/null
@@ -1,422 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Thread utilities for writing memory using PTX cache modifiers.
- */
-
-#pragma once
-
-#include <cuda.h>
-
-#include "../util_ptx.cuh"
-#include "../util_type.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \addtogroup UtilIo
- * @{
- */
-
-
-//-----------------------------------------------------------------------------
-// Tags and constants
-//-----------------------------------------------------------------------------
-
-/**
- * \brief Enumeration of cache modifiers for memory store operations.
- */
-enum CacheStoreModifier
-{
-    STORE_DEFAULT,              ///< Default (no modifier)
-    STORE_WB,                   ///< Cache write-back all coherent levels
-    STORE_CG,                   ///< Cache at global level
-    STORE_CS,                   ///< Cache streaming (likely to be accessed once)
-    STORE_WT,                   ///< Cache write-through (to system memory)
-    STORE_VOLATILE,             ///< Volatile shared (any memory space)
-};
-
-
-/**
- * \name Thread I/O (cache modified)
- * @{
- */
-
-/**
- * \brief Thread utility for writing memory using cub::CacheStoreModifier cache modifiers.  Can be used to store any data type.
- *
- * \par Example
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/thread/thread_store.cuh>
- *
- * // 32-bit store using cache-global modifier:
- * int *d_out;
- * int val;
- * cub::ThreadStore<cub::STORE_CG>(d_out + threadIdx.x, val);
- *
- * // 16-bit store using default modifier
- * short *d_out;
- * short val;
- * cub::ThreadStore<cub::STORE_DEFAULT>(d_out + threadIdx.x, val);
- *
- * // 256-bit store using write-through modifier
- * double4 *d_out;
- * double4 val;
- * cub::ThreadStore<cub::STORE_WT>(d_out + threadIdx.x, val);
- *
- * // 96-bit store using cache-streaming cache modifier
- * struct TestFoo { bool a; short b; };
- * TestFoo *d_struct;
- * TestFoo val;
- * cub::ThreadStore<cub::STORE_CS>(d_out + threadIdx.x, val);
- * \endcode
- *
- * \tparam MODIFIER             <b>[inferred]</b> CacheStoreModifier enumeration
- * \tparam InputIteratorT       <b>[inferred]</b> Output iterator type \iterator
- * \tparam T                    <b>[inferred]</b> Data type of output value
- */
-template <
-    CacheStoreModifier  MODIFIER,
-    typename            OutputIteratorT,
-    typename            T>
-__device__ __forceinline__ void ThreadStore(OutputIteratorT itr, T val);
-
-
-//@}  end member group
-
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-
-/// Helper structure for templated store iteration (inductive case)
-template <int COUNT, int MAX>
-struct IterateThreadStore
-{
-    template <CacheStoreModifier MODIFIER, typename T>
-    static __device__ __forceinline__ void Store(T *ptr, T *vals)
-    {
-        ThreadStore<MODIFIER>(ptr + COUNT, vals[COUNT]);
-        IterateThreadStore<COUNT + 1, MAX>::template Store<MODIFIER>(ptr, vals);
-    }
-
-    template <typename OutputIteratorT, typename T>
-    static __device__ __forceinline__ void Dereference(OutputIteratorT ptr, T *vals)
-    {
-        ptr[COUNT] = vals[COUNT];
-        IterateThreadStore<COUNT + 1, MAX>::Dereference(ptr, vals);
-    }
-
-};
-
-/// Helper structure for templated store iteration (termination case)
-template <int MAX>
-struct IterateThreadStore<MAX, MAX>
-{
-    template <CacheStoreModifier MODIFIER, typename T>
-    static __device__ __forceinline__ void Store(T * /*ptr*/, T * /*vals*/) {}
-
-    template <typename OutputIteratorT, typename T>
-    static __device__ __forceinline__ void Dereference(OutputIteratorT /*ptr*/, T * /*vals*/) {}
-};
-
-
-/**
- * Define a uint4 (16B) ThreadStore specialization for the given Cache load modifier
- */
-#define _CUB_STORE_16(cub_modifier, ptx_modifier)                                            \
-    template<>                                                                              \
-    __device__ __forceinline__ void ThreadStore<cub_modifier, uint4*, uint4>(uint4* ptr, uint4 val)                         \
-    {                                                                                       \
-        asm volatile ("st."#ptx_modifier".v4.u32 [%0], {%1, %2, %3, %4};" : :               \
-            _CUB_ASM_PTR_(ptr),                                                             \
-            "r"(val.x),                                                                     \
-            "r"(val.y),                                                                     \
-            "r"(val.z),                                                                     \
-            "r"(val.w));                                                                    \
-    }                                                                                       \
-    template<>                                                                              \
-    __device__ __forceinline__ void ThreadStore<cub_modifier, ulonglong2*, ulonglong2>(ulonglong2* ptr, ulonglong2 val)     \
-    {                                                                                       \
-        asm volatile ("st."#ptx_modifier".v2.u64 [%0], {%1, %2};" : :                       \
-            _CUB_ASM_PTR_(ptr),                                                             \
-            "l"(val.x),                                                                     \
-            "l"(val.y));                                                                    \
-    }
-
-
-/**
- * Define a uint2 (8B) ThreadStore specialization for the given Cache load modifier
- */
-#define _CUB_STORE_8(cub_modifier, ptx_modifier)                                             \
-    template<>                                                                              \
-    __device__ __forceinline__ void ThreadStore<cub_modifier, ushort4*, ushort4>(ushort4* ptr, ushort4 val)                 \
-    {                                                                                       \
-        asm volatile ("st."#ptx_modifier".v4.u16 [%0], {%1, %2, %3, %4};" : :               \
-            _CUB_ASM_PTR_(ptr),                                                             \
-            "h"(val.x),                                                                     \
-            "h"(val.y),                                                                     \
-            "h"(val.z),                                                                     \
-            "h"(val.w));                                                                    \
-    }                                                                                       \
-    template<>                                                                              \
-    __device__ __forceinline__ void ThreadStore<cub_modifier, uint2*, uint2>(uint2* ptr, uint2 val)                         \
-    {                                                                                       \
-        asm volatile ("st."#ptx_modifier".v2.u32 [%0], {%1, %2};" : :                       \
-            _CUB_ASM_PTR_(ptr),                                                             \
-            "r"(val.x),                                                                     \
-            "r"(val.y));                                                                    \
-    }                                                                                       \
-    template<>                                                                              \
-    __device__ __forceinline__ void ThreadStore<cub_modifier, unsigned long long*, unsigned long long>(unsigned long long* ptr, unsigned long long val)     \
-    {                                                                                       \
-        asm volatile ("st."#ptx_modifier".u64 [%0], %1;" : :                                \
-            _CUB_ASM_PTR_(ptr),                                                             \
-            "l"(val));                                                                      \
-    }
-
-/**
- * Define a unsigned int (4B) ThreadStore specialization for the given Cache load modifier
- */
-#define _CUB_STORE_4(cub_modifier, ptx_modifier)                                             \
-    template<>                                                                              \
-    __device__ __forceinline__ void ThreadStore<cub_modifier, unsigned int*, unsigned int>(unsigned int* ptr, unsigned int val)                             \
-    {                                                                                       \
-        asm volatile ("st."#ptx_modifier".u32 [%0], %1;" : :                                \
-            _CUB_ASM_PTR_(ptr),                                                             \
-            "r"(val));                                                                      \
-    }
-
-
-/**
- * Define a unsigned short (2B) ThreadStore specialization for the given Cache load modifier
- */
-#define _CUB_STORE_2(cub_modifier, ptx_modifier)                                             \
-    template<>                                                                              \
-    __device__ __forceinline__ void ThreadStore<cub_modifier, unsigned short*, unsigned short>(unsigned short* ptr, unsigned short val)                     \
-    {                                                                                       \
-        asm volatile ("st."#ptx_modifier".u16 [%0], %1;" : :                                \
-            _CUB_ASM_PTR_(ptr),                                                             \
-            "h"(val));                                                                      \
-    }
-
-
-/**
- * Define a unsigned char (1B) ThreadStore specialization for the given Cache load modifier
- */
-#define _CUB_STORE_1(cub_modifier, ptx_modifier)                                             \
-    template<>                                                                              \
-    __device__ __forceinline__ void ThreadStore<cub_modifier, unsigned char*, unsigned char>(unsigned char* ptr, unsigned char val)                         \
-    {                                                                                       \
-        asm volatile (                                                                      \
-        "{"                                                                                 \
-        "   .reg .u8 datum;"                                                                \
-        "   cvt.u8.u16 datum, %1;"                                                          \
-        "   st."#ptx_modifier".u8 [%0], datum;"                                             \
-        "}" : :                                                                             \
-            _CUB_ASM_PTR_(ptr),                                                             \
-            "h"((unsigned short) val));                                                               \
-    }
-
-/**
- * Define powers-of-two ThreadStore specializations for the given Cache load modifier
- */
-#define _CUB_STORE_ALL(cub_modifier, ptx_modifier)                                           \
-    _CUB_STORE_16(cub_modifier, ptx_modifier)                                                \
-    _CUB_STORE_8(cub_modifier, ptx_modifier)                                                 \
-    _CUB_STORE_4(cub_modifier, ptx_modifier)                                                 \
-    _CUB_STORE_2(cub_modifier, ptx_modifier)                                                 \
-    _CUB_STORE_1(cub_modifier, ptx_modifier)                                                 \
-
-
-/**
- * Define ThreadStore specializations for the various Cache load modifiers
- */
-#if CUB_PTX_ARCH >= 200
-    _CUB_STORE_ALL(STORE_WB, wb)
-    _CUB_STORE_ALL(STORE_CG, cg)
-    _CUB_STORE_ALL(STORE_CS, cs)
-    _CUB_STORE_ALL(STORE_WT, wt)
-#else
-    _CUB_STORE_ALL(STORE_WB, global)
-    _CUB_STORE_ALL(STORE_CG, global)
-    _CUB_STORE_ALL(STORE_CS, global)
-    _CUB_STORE_ALL(STORE_WT, volatile.global)
-#endif
-
-
-// Macro cleanup
-#undef _CUB_STORE_ALL
-#undef _CUB_STORE_1
-#undef _CUB_STORE_2
-#undef _CUB_STORE_4
-#undef _CUB_STORE_8
-#undef _CUB_STORE_16
-
-
-/**
- * ThreadStore definition for STORE_DEFAULT modifier on iterator types
- */
-template <typename OutputIteratorT, typename T>
-__device__ __forceinline__ void ThreadStore(
-    OutputIteratorT             itr,
-    T                           val,
-    Int2Type<STORE_DEFAULT>     /*modifier*/,
-    Int2Type<false>             /*is_pointer*/)
-{
-    *itr = val;
-}
-
-
-/**
- * ThreadStore definition for STORE_DEFAULT modifier on pointer types
- */
-template <typename T>
-__device__ __forceinline__ void ThreadStore(
-    T                           *ptr,
-    T                           val,
-    Int2Type<STORE_DEFAULT>     /*modifier*/,
-    Int2Type<true>              /*is_pointer*/)
-{
-    *ptr = val;
-}
-
-
-/**
- * ThreadStore definition for STORE_VOLATILE modifier on primitive pointer types
- */
-template <typename T>
-__device__ __forceinline__ void ThreadStoreVolatilePtr(
-    T                           *ptr,
-    T                           val,
-    Int2Type<true>              /*is_primitive*/)
-{
-    *reinterpret_cast<volatile T*>(ptr) = val;
-}
-
-
-/**
- * ThreadStore definition for STORE_VOLATILE modifier on non-primitive pointer types
- */
-template <typename T>
-__device__ __forceinline__ void ThreadStoreVolatilePtr(
-    T                           *ptr,
-    T                           val,
-    Int2Type<false>             /*is_primitive*/)
-{
-    // Create a temporary using shuffle-words, then store using volatile-words
-    typedef typename UnitWord<T>::VolatileWord  VolatileWord;  
-    typedef typename UnitWord<T>::ShuffleWord   ShuffleWord;
-
-    const int VOLATILE_MULTIPLE = sizeof(T) / sizeof(VolatileWord);
-    const int SHUFFLE_MULTIPLE  = sizeof(T) / sizeof(ShuffleWord);
-    
-    VolatileWord words[VOLATILE_MULTIPLE];
-
-    #pragma unroll
-    for (int i = 0; i < SHUFFLE_MULTIPLE; ++i)
-        reinterpret_cast<ShuffleWord*>(words)[i] = reinterpret_cast<ShuffleWord*>(&val)[i];
-
-    IterateThreadStore<0, VOLATILE_MULTIPLE>::template Dereference(
-        reinterpret_cast<volatile VolatileWord*>(ptr),
-        words);
-}
-
-
-/**
- * ThreadStore definition for STORE_VOLATILE modifier on pointer types
- */
-template <typename T>
-__device__ __forceinline__ void ThreadStore(
-    T                           *ptr,
-    T                           val,
-    Int2Type<STORE_VOLATILE>    /*modifier*/,
-    Int2Type<true>              /*is_pointer*/)
-{
-    ThreadStoreVolatilePtr(ptr, val, Int2Type<Traits<T>::PRIMITIVE>());
-}
-
-
-/**
- * ThreadStore definition for generic modifiers on pointer types
- */
-template <typename T, int MODIFIER>
-__device__ __forceinline__ void ThreadStore(
-    T                           *ptr,
-    T                           val,
-    Int2Type<MODIFIER>          /*modifier*/,
-    Int2Type<true>              /*is_pointer*/)
-{
-    // Create a temporary using shuffle-words, then store using device-words
-    typedef typename UnitWord<T>::DeviceWord    DeviceWord;  
-    typedef typename UnitWord<T>::ShuffleWord   ShuffleWord;
-
-    const int DEVICE_MULTIPLE   = sizeof(T) / sizeof(DeviceWord);
-    const int SHUFFLE_MULTIPLE  = sizeof(T) / sizeof(ShuffleWord);
-    
-    DeviceWord words[DEVICE_MULTIPLE];
-
-    #pragma unroll
-    for (int i = 0; i < SHUFFLE_MULTIPLE; ++i)
-        reinterpret_cast<ShuffleWord*>(words)[i] = reinterpret_cast<ShuffleWord*>(&val)[i];
-
-    IterateThreadStore<0, DEVICE_MULTIPLE>::template Store<CacheStoreModifier(MODIFIER)>(
-        reinterpret_cast<DeviceWord*>(ptr),
-        words);
-}
-
-
-/**
- * ThreadStore definition for generic modifiers
- */
-template <CacheStoreModifier MODIFIER, typename OutputIteratorT, typename T>
-__device__ __forceinline__ void ThreadStore(OutputIteratorT itr, T val)
-{
-    ThreadStore(
-        itr,
-        val,
-        Int2Type<MODIFIER>(),
-        Int2Type<IsPointer<OutputIteratorT>::VALUE>());
-}
-
-
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-/** @} */       // end group UtilIo
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/util_allocator.cuh b/thirdparty/cub_semiring/util_allocator.cuh
deleted file mode 100644
index 24c7a79fee5..00000000000
--- a/thirdparty/cub_semiring/util_allocator.cuh
+++ /dev/null
@@ -1,708 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/******************************************************************************
- * Simple caching allocator for device memory allocations. The allocator is
- * thread-safe and capable of managing device allocations on multiple devices.
- ******************************************************************************/
-
-#pragma once
-
-#include "util_namespace.cuh"
-#include "util_debug.cuh"
-
-#include <set>
-#include <map>
-
-#include "host/mutex.cuh"
-#include <math.h>
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \addtogroup UtilMgmt
- * @{
- */
-
-
-/******************************************************************************
- * CachingDeviceAllocator (host use)
- ******************************************************************************/
-
-/**
- * \brief A simple caching allocator for device memory allocations.
- *
- * \par Overview
- * The allocator is thread-safe and stream-safe and is capable of managing cached
- * device allocations on multiple devices.  It behaves as follows:
- *
- * \par
- * - Allocations from the allocator are associated with an \p active_stream.  Once freed,
- *   the allocation becomes available immediately for reuse within the \p active_stream
- *   with which it was associated with during allocation, and it becomes available for
- *   reuse within other streams when all prior work submitted to \p active_stream has completed.
- * - Allocations are categorized and cached by bin size.  A new allocation request of
- *   a given size will only consider cached allocations within the corresponding bin.
- * - Bin limits progress geometrically in accordance with the growth factor
- *   \p bin_growth provided during construction.  Unused device allocations within
- *   a larger bin cache are not reused for allocation requests that categorize to
- *   smaller bin sizes.
- * - Allocation requests below (\p bin_growth ^ \p min_bin) are rounded up to
- *   (\p bin_growth ^ \p min_bin).
- * - Allocations above (\p bin_growth ^ \p max_bin) are not rounded up to the nearest
- *   bin and are simply freed when they are deallocated instead of being returned
- *   to a bin-cache.
- * - %If the total storage of cached allocations on a given device will exceed
- *   \p max_cached_bytes, allocations for that device are simply freed when they are
- *   deallocated instead of being returned to their bin-cache.
- *
- * \par
- * For example, the default-constructed CachingDeviceAllocator is configured with:
- * - \p bin_growth          = 8
- * - \p min_bin             = 3
- * - \p max_bin             = 7
- * - \p max_cached_bytes    = 6MB - 1B
- *
- * \par
- * which delineates five bin-sizes: 512B, 4KB, 32KB, 256KB, and 2MB
- * and sets a maximum of 6,291,455 cached bytes per device
- *
- */
-struct CachingDeviceAllocator
-{
-
-    //---------------------------------------------------------------------
-    // Constants
-    //---------------------------------------------------------------------
-
-    /// Out-of-bounds bin
-    static const unsigned int INVALID_BIN = (unsigned int) -1;
-
-    /// Invalid size
-    static const size_t INVALID_SIZE = (size_t) -1;
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-    /// Invalid device ordinal
-    static const int INVALID_DEVICE_ORDINAL = -1;
-
-    //---------------------------------------------------------------------
-    // Type definitions and helper types
-    //---------------------------------------------------------------------
-
-    /**
-     * Descriptor for device memory allocations
-     */
-    struct BlockDescriptor
-    {
-        void*           d_ptr;              // Device pointer
-        size_t          bytes;              // Size of allocation in bytes
-        unsigned int    bin;                // Bin enumeration
-        int             device;             // device ordinal
-        cudaStream_t    associated_stream;  // Associated associated_stream
-        cudaEvent_t     ready_event;        // Signal when associated stream has run to the point at which this block was freed
-
-        // Constructor (suitable for searching maps for a specific block, given its pointer and device)
-        BlockDescriptor(void *d_ptr, int device) :
-            d_ptr(d_ptr),
-            bytes(0),
-            bin(INVALID_BIN),
-            device(device),
-            associated_stream(0),
-            ready_event(0)
-        {}
-
-        // Constructor (suitable for searching maps for a range of suitable blocks, given a device)
-        BlockDescriptor(int device) :
-            d_ptr(NULL),
-            bytes(0),
-            bin(INVALID_BIN),
-            device(device),
-            associated_stream(0),
-            ready_event(0)
-        {}
-
-        // Comparison functor for comparing device pointers
-        static bool PtrCompare(const BlockDescriptor &a, const BlockDescriptor &b)
-        {
-            if (a.device == b.device)
-                return (a.d_ptr < b.d_ptr);
-            else
-                return (a.device < b.device);
-        }
-
-        // Comparison functor for comparing allocation sizes
-        static bool SizeCompare(const BlockDescriptor &a, const BlockDescriptor &b)
-        {
-            if (a.device == b.device)
-                return (a.bytes < b.bytes);
-            else
-                return (a.device < b.device);
-        }
-    };
-
-    /// BlockDescriptor comparator function interface
-    typedef bool (*Compare)(const BlockDescriptor &, const BlockDescriptor &);
-
-    class TotalBytes {
-    public:
-        size_t free;
-        size_t live;
-        TotalBytes() { free = live = 0; }
-    };
-
-    /// Set type for cached blocks (ordered by size)
-    typedef std::multiset<BlockDescriptor, Compare> CachedBlocks;
-
-    /// Set type for live blocks (ordered by ptr)
-    typedef std::multiset<BlockDescriptor, Compare> BusyBlocks;
-
-    /// Map type of device ordinals to the number of cached bytes cached by each device
-    typedef std::map<int, TotalBytes> GpuCachedBytes;
-
-
-    //---------------------------------------------------------------------
-    // Utility functions
-    //---------------------------------------------------------------------
-
-    /**
-     * Integer pow function for unsigned base and exponent
-     */
-    static unsigned int IntPow(
-        unsigned int base,
-        unsigned int exp)
-    {
-        unsigned int retval = 1;
-        while (exp > 0)
-        {
-            if (exp & 1) {
-                retval = retval * base;        // multiply the result by the current base
-            }
-            base = base * base;                // square the base
-            exp = exp >> 1;                    // divide the exponent in half
-        }
-        return retval;
-    }
-
-
-    /**
-     * Round up to the nearest power-of
-     */
-    void NearestPowerOf(
-        unsigned int    &power,
-        size_t          &rounded_bytes,
-        unsigned int    base,
-        size_t          value)
-    {
-        power = 0;
-        rounded_bytes = 1;
-
-        if (value * base < value)
-        {
-            // Overflow
-            power = sizeof(size_t) * 8;
-            rounded_bytes = size_t(0) - 1;
-            return;
-        }
-
-        while (rounded_bytes < value)
-        {
-            rounded_bytes *= base;
-            power++;
-        }
-    }
-
-
-    //---------------------------------------------------------------------
-    // Fields
-    //---------------------------------------------------------------------
-
-    cub::Mutex      mutex;              /// Mutex for thread-safety
-
-    unsigned int    bin_growth;         /// Geometric growth factor for bin-sizes
-    unsigned int    min_bin;            /// Minimum bin enumeration
-    unsigned int    max_bin;            /// Maximum bin enumeration
-
-    size_t          min_bin_bytes;      /// Minimum bin size
-    size_t          max_bin_bytes;      /// Maximum bin size
-    size_t          max_cached_bytes;   /// Maximum aggregate cached bytes per device
-
-    const bool      skip_cleanup;       /// Whether or not to skip a call to FreeAllCached() when destructor is called.  (The CUDA runtime may have already shut down for statically declared allocators)
-    bool            debug;              /// Whether or not to print (de)allocation events to stdout
-
-    GpuCachedBytes  cached_bytes;       /// Map of device ordinal to aggregate cached bytes on that device
-    CachedBlocks    cached_blocks;      /// Set of cached device allocations available for reuse
-    BusyBlocks      live_blocks;        /// Set of live device allocations currently in use
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-    //---------------------------------------------------------------------
-    // Methods
-    //---------------------------------------------------------------------
-
-    /**
-     * \brief Constructor.
-     */
-    CachingDeviceAllocator(
-        unsigned int    bin_growth,                             ///< Geometric growth factor for bin-sizes
-        unsigned int    min_bin             = 1,                ///< Minimum bin (default is bin_growth ^ 1)
-        unsigned int    max_bin             = INVALID_BIN,      ///< Maximum bin (default is no max bin)
-        size_t          max_cached_bytes    = INVALID_SIZE,     ///< Maximum aggregate cached bytes per device (default is no limit)
-        bool            skip_cleanup        = false,            ///< Whether or not to skip a call to \p FreeAllCached() when the destructor is called (default is to deallocate)
-        bool            debug               = false)            ///< Whether or not to print (de)allocation events to stdout (default is no stderr output)
-    :
-        bin_growth(bin_growth),
-        min_bin(min_bin),
-        max_bin(max_bin),
-        min_bin_bytes(IntPow(bin_growth, min_bin)),
-        max_bin_bytes(IntPow(bin_growth, max_bin)),
-        max_cached_bytes(max_cached_bytes),
-        skip_cleanup(skip_cleanup),
-        debug(debug),
-        cached_blocks(BlockDescriptor::SizeCompare),
-        live_blocks(BlockDescriptor::PtrCompare)
-    {}
-
-
-    /**
-     * \brief Default constructor.
-     *
-     * Configured with:
-     * \par
-     * - \p bin_growth          = 8
-     * - \p min_bin             = 3
-     * - \p max_bin             = 7
-     * - \p max_cached_bytes    = (\p bin_growth ^ \p max_bin) * 3) - 1 = 6,291,455 bytes
-     *
-     * which delineates five bin-sizes: 512B, 4KB, 32KB, 256KB, and 2MB and
-     * sets a maximum of 6,291,455 cached bytes per device
-     */
-    CachingDeviceAllocator(
-        bool skip_cleanup = false,
-        bool debug = false)
-    :
-        bin_growth(8),
-        min_bin(3),
-        max_bin(7),
-        min_bin_bytes(IntPow(bin_growth, min_bin)),
-        max_bin_bytes(IntPow(bin_growth, max_bin)),
-        max_cached_bytes((max_bin_bytes * 3) - 1),
-        skip_cleanup(skip_cleanup),
-        debug(debug),
-        cached_blocks(BlockDescriptor::SizeCompare),
-        live_blocks(BlockDescriptor::PtrCompare)
-    {}
-
-
-    /**
-     * \brief Sets the limit on the number bytes this allocator is allowed to cache per device.
-     *
-     * Changing the ceiling of cached bytes does not cause any allocations (in-use or
-     * cached-in-reserve) to be freed.  See \p FreeAllCached().
-     */
-    cudaError_t SetMaxCachedBytes(
-        size_t max_cached_bytes)
-    {
-        // Lock
-        mutex.Lock();
-
-        if (debug) _CubLog("Changing max_cached_bytes (%lld -> %lld)\n", (long long) this->max_cached_bytes, (long long) max_cached_bytes);
-
-        this->max_cached_bytes = max_cached_bytes;
-
-        // Unlock
-        mutex.Unlock();
-
-        return cudaSuccess;
-    }
-
-
-    /**
-     * \brief Provides a suitable allocation of device memory for the given size on the specified device.
-     *
-     * Once freed, the allocation becomes available immediately for reuse within the \p active_stream
-     * with which it was associated with during allocation, and it becomes available for reuse within other
-     * streams when all prior work submitted to \p active_stream has completed.
-     */
-    cudaError_t DeviceAllocate(
-        int             device,             ///< [in] Device on which to place the allocation
-        void            **d_ptr,            ///< [out] Reference to pointer to the allocation
-        size_t          bytes,              ///< [in] Minimum number of bytes for the allocation
-        cudaStream_t    active_stream = 0)  ///< [in] The stream to be associated with this allocation
-    {
-        *d_ptr                          = NULL;
-        int entrypoint_device           = INVALID_DEVICE_ORDINAL;
-        cudaError_t error               = cudaSuccess;
-
-        if (device == INVALID_DEVICE_ORDINAL)
-        {
-            if (CubDebug(error = cudaGetDevice(&entrypoint_device))) return error;
-            device = entrypoint_device;
-        }
-
-        // Create a block descriptor for the requested allocation
-        bool found = false;
-        BlockDescriptor search_key(device);
-        search_key.associated_stream = active_stream;
-        NearestPowerOf(search_key.bin, search_key.bytes, bin_growth, bytes);
-
-        if (search_key.bin > max_bin)
-        {
-            // Bin is greater than our maximum bin: allocate the request
-            // exactly and give out-of-bounds bin.  It will not be cached
-            // for reuse when returned.
-            search_key.bin      = INVALID_BIN;
-            search_key.bytes    = bytes;
-        }
-        else
-        {
-            // Search for a suitable cached allocation: lock
-            mutex.Lock();
-
-            if (search_key.bin < min_bin)
-            {
-                // Bin is less than minimum bin: round up
-                search_key.bin      = min_bin;
-                search_key.bytes    = min_bin_bytes;
-            }
-
-            // Iterate through the range of cached blocks on the same device in the same bin
-            CachedBlocks::iterator block_itr = cached_blocks.lower_bound(search_key);
-            while ((block_itr != cached_blocks.end())
-                    && (block_itr->device == device)
-                    && (block_itr->bin == search_key.bin))
-            {
-                // To prevent races with reusing blocks returned by the host but still
-                // in use by the device, only consider cached blocks that are
-                // either (from the active stream) or (from an idle stream)
-                if ((active_stream == block_itr->associated_stream) ||
-                    (cudaEventQuery(block_itr->ready_event) != cudaErrorNotReady))
-                {
-                    // Reuse existing cache block.  Insert into live blocks.
-                    found = true;
-                    search_key = *block_itr;
-                    search_key.associated_stream = active_stream;
-                    live_blocks.insert(search_key);
-
-                    // Remove from free blocks
-                    cached_bytes[device].free -= search_key.bytes;
-                    cached_bytes[device].live += search_key.bytes;
-
-                    if (debug) _CubLog("\tDevice %d reused cached block at %p (%lld bytes) for stream %lld (previously associated with stream %lld).\n",
-                        device, search_key.d_ptr, (long long) search_key.bytes, (long long) search_key.associated_stream, (long long)  block_itr->associated_stream);
-
-                    cached_blocks.erase(block_itr);
-
-                    break;
-                }
-                block_itr++;
-            }
-
-            // Done searching: unlock
-            mutex.Unlock();
-        }
-
-        // Allocate the block if necessary
-        if (!found)
-        {
-            // Set runtime's current device to specified device (entrypoint may not be set)
-            if (device != entrypoint_device)
-            {
-                if (CubDebug(error = cudaGetDevice(&entrypoint_device))) return error;
-                if (CubDebug(error = cudaSetDevice(device))) return error;
-            }
-
-            // Attempt to allocate
-            if (CubDebug(error = cudaMalloc(&search_key.d_ptr, search_key.bytes)) == cudaErrorMemoryAllocation)
-            {
-                // The allocation attempt failed: free all cached blocks on device and retry
-                if (debug) _CubLog("\tDevice %d failed to allocate %lld bytes for stream %lld, retrying after freeing cached allocations",
-                      device, (long long) search_key.bytes, (long long) search_key.associated_stream);
-
-                error = cudaSuccess;    // Reset the error we will return
-                cudaGetLastError();     // Reset CUDART's error
-
-                // Lock
-                mutex.Lock();
-
-                // Iterate the range of free blocks on the same device
-                BlockDescriptor free_key(device);
-                CachedBlocks::iterator block_itr = cached_blocks.lower_bound(free_key);
-
-                while ((block_itr != cached_blocks.end()) && (block_itr->device == device))
-                {
-                    // No need to worry about synchronization with the device: cudaFree is
-                    // blocking and will synchronize across all kernels executing
-                    // on the current device
-
-                    // Free device memory and destroy stream event.
-                    if (CubDebug(error = cudaFree(block_itr->d_ptr))) break;
-                    if (CubDebug(error = cudaEventDestroy(block_itr->ready_event))) break;
-
-                    // Reduce balance and erase entry
-                    cached_bytes[device].free -= block_itr->bytes;
-
-                    if (debug) _CubLog("\tDevice %d freed %lld bytes.\n\t\t  %lld available blocks cached (%lld bytes), %lld live blocks (%lld bytes) outstanding.\n",
-                        device, (long long) block_itr->bytes, (long long) cached_blocks.size(), (long long) cached_bytes[device].free, (long long) live_blocks.size(), (long long) cached_bytes[device].live);
-
-                    cached_blocks.erase(block_itr);
-
-                    block_itr++;
-                }
-
-                // Unlock
-                mutex.Unlock();
-
-                // Return under error
-                if (error) return error;
-
-                // Try to allocate again
-                if (CubDebug(error = cudaMalloc(&search_key.d_ptr, search_key.bytes))) return error;
-            }
-
-            // Create ready event
-            if (CubDebug(error = cudaEventCreateWithFlags(&search_key.ready_event, cudaEventDisableTiming)))
-                return error;
-
-            // Insert into live blocks
-            mutex.Lock();
-            live_blocks.insert(search_key);
-            cached_bytes[device].live += search_key.bytes;
-            mutex.Unlock();
-
-            if (debug) _CubLog("\tDevice %d allocated new device block at %p (%lld bytes associated with stream %lld).\n",
-                      device, search_key.d_ptr, (long long) search_key.bytes, (long long) search_key.associated_stream);
-
-            // Attempt to revert back to previous device if necessary
-            if ((entrypoint_device != INVALID_DEVICE_ORDINAL) && (entrypoint_device != device))
-            {
-                if (CubDebug(error = cudaSetDevice(entrypoint_device))) return error;
-            }
-        }
-
-        // Copy device pointer to output parameter
-        *d_ptr = search_key.d_ptr;
-
-        if (debug) _CubLog("\t\t%lld available blocks cached (%lld bytes), %lld live blocks outstanding(%lld bytes).\n",
-            (long long) cached_blocks.size(), (long long) cached_bytes[device].free, (long long) live_blocks.size(), (long long) cached_bytes[device].live);
-
-        return error;
-    }
-
-
-    /**
-     * \brief Provides a suitable allocation of device memory for the given size on the current device.
-     *
-     * Once freed, the allocation becomes available immediately for reuse within the \p active_stream
-     * with which it was associated with during allocation, and it becomes available for reuse within other
-     * streams when all prior work submitted to \p active_stream has completed.
-     */
-    cudaError_t DeviceAllocate(
-        void            **d_ptr,            ///< [out] Reference to pointer to the allocation
-        size_t          bytes,              ///< [in] Minimum number of bytes for the allocation
-        cudaStream_t    active_stream = 0)  ///< [in] The stream to be associated with this allocation
-    {
-        return DeviceAllocate(INVALID_DEVICE_ORDINAL, d_ptr, bytes, active_stream);
-    }
-
-
-    /**
-     * \brief Frees a live allocation of device memory on the specified device, returning it to the allocator.
-     *
-     * Once freed, the allocation becomes available immediately for reuse within the \p active_stream
-     * with which it was associated with during allocation, and it becomes available for reuse within other
-     * streams when all prior work submitted to \p active_stream has completed.
-     */
-    cudaError_t DeviceFree(
-        int             device,
-        void*           d_ptr)
-    {
-        int entrypoint_device           = INVALID_DEVICE_ORDINAL;
-        cudaError_t error               = cudaSuccess;
-
-        if (device == INVALID_DEVICE_ORDINAL)
-        {
-            if (CubDebug(error = cudaGetDevice(&entrypoint_device)))
-                return error;
-            device = entrypoint_device;
-        }
-
-        // Lock
-        mutex.Lock();
-
-        // Find corresponding block descriptor
-        bool recached = false;
-        BlockDescriptor search_key(d_ptr, device);
-        BusyBlocks::iterator block_itr = live_blocks.find(search_key);
-        if (block_itr != live_blocks.end())
-        {
-            // Remove from live blocks
-            search_key = *block_itr;
-            live_blocks.erase(block_itr);
-            cached_bytes[device].live -= search_key.bytes;
-
-            // Keep the returned allocation if bin is valid and we won't exceed the max cached threshold
-            if ((search_key.bin != INVALID_BIN) && (cached_bytes[device].free + search_key.bytes <= max_cached_bytes))
-            {
-                // Insert returned allocation into free blocks
-                recached = true;
-                cached_blocks.insert(search_key);
-                cached_bytes[device].free += search_key.bytes;
-
-                if (debug) _CubLog("\tDevice %d returned %lld bytes from associated stream %lld.\n\t\t %lld available blocks cached (%lld bytes), %lld live blocks outstanding. (%lld bytes)\n",
-                    device, (long long) search_key.bytes, (long long) search_key.associated_stream, (long long) cached_blocks.size(),
-                    (long long) cached_bytes[device].free, (long long) live_blocks.size(), (long long) cached_bytes[device].live);
-            }
-        }
-
-        // Unlock
-        mutex.Unlock();
-
-        // First set to specified device (entrypoint may not be set)
-        if (device != entrypoint_device)
-        {
-            if (CubDebug(error = cudaGetDevice(&entrypoint_device))) return error;
-            if (CubDebug(error = cudaSetDevice(device))) return error;
-        }
-
-        if (recached)
-        {
-            // Insert the ready event in the associated stream (must have current device set properly)
-            if (CubDebug(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream))) return error;
-        }
-        else
-        {
-            // Free the allocation from the runtime and cleanup the event.
-            if (CubDebug(error = cudaFree(d_ptr))) return error;
-            if (CubDebug(error = cudaEventDestroy(search_key.ready_event))) return error;
-
-            if (debug) _CubLog("\tDevice %d freed %lld bytes from associated stream %lld.\n\t\t  %lld available blocks cached (%lld bytes), %lld live blocks (%lld bytes) outstanding.\n",
-                device, (long long) search_key.bytes, (long long) search_key.associated_stream, (long long) cached_blocks.size(), (long long) cached_bytes[device].free, (long long) live_blocks.size(), (long long) cached_bytes[device].live);
-        }
-
-        // Reset device
-        if ((entrypoint_device != INVALID_DEVICE_ORDINAL) && (entrypoint_device != device))
-        {
-            if (CubDebug(error = cudaSetDevice(entrypoint_device))) return error;
-        }
-
-        return error;
-    }
-
-
-    /**
-     * \brief Frees a live allocation of device memory on the current device, returning it to the allocator.
-     *
-     * Once freed, the allocation becomes available immediately for reuse within the \p active_stream
-     * with which it was associated with during allocation, and it becomes available for reuse within other
-     * streams when all prior work submitted to \p active_stream has completed.
-     */
-    cudaError_t DeviceFree(
-        void*           d_ptr)
-    {
-        return DeviceFree(INVALID_DEVICE_ORDINAL, d_ptr);
-    }
-
-
-    /**
-     * \brief Frees all cached device allocations on all devices
-     */
-    cudaError_t FreeAllCached()
-    {
-        cudaError_t error         = cudaSuccess;
-        int entrypoint_device     = INVALID_DEVICE_ORDINAL;
-        int current_device        = INVALID_DEVICE_ORDINAL;
-
-        mutex.Lock();
-
-        while (!cached_blocks.empty())
-        {
-            // Get first block
-            CachedBlocks::iterator begin = cached_blocks.begin();
-
-            // Get entry-point device ordinal if necessary
-            if (entrypoint_device == INVALID_DEVICE_ORDINAL)
-            {
-                if (CubDebug(error = cudaGetDevice(&entrypoint_device))) break;
-            }
-
-            // Set current device ordinal if necessary
-            if (begin->device != current_device)
-            {
-                if (CubDebug(error = cudaSetDevice(begin->device))) break;
-                current_device = begin->device;
-            }
-
-            // Free device memory
-            if (CubDebug(error = cudaFree(begin->d_ptr))) break;
-            if (CubDebug(error = cudaEventDestroy(begin->ready_event))) break;
-
-            // Reduce balance and erase entry
-            cached_bytes[current_device].free -= begin->bytes;
-
-            if (debug) _CubLog("\tDevice %d freed %lld bytes.\n\t\t  %lld available blocks cached (%lld bytes), %lld live blocks (%lld bytes) outstanding.\n",
-                current_device, (long long) begin->bytes, (long long) cached_blocks.size(), (long long) cached_bytes[current_device].free, (long long) live_blocks.size(), (long long) cached_bytes[current_device].live);
-
-            cached_blocks.erase(begin);
-        }
-
-        mutex.Unlock();
-
-        // Attempt to revert back to entry-point device if necessary
-        if (entrypoint_device != INVALID_DEVICE_ORDINAL)
-        {
-            if (CubDebug(error = cudaSetDevice(entrypoint_device))) return error;
-        }
-
-        return error;
-    }
-
-
-    /**
-     * \brief Destructor
-     */
-    virtual ~CachingDeviceAllocator()
-    {
-        if (!skip_cleanup)
-            FreeAllCached();
-    }
-
-};
-
-
-
-
-/** @} */       // end group UtilMgmt
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/util_arch.cuh b/thirdparty/cub_semiring/util_arch.cuh
deleted file mode 100644
index 5ec36e5f1f7..00000000000
--- a/thirdparty/cub_semiring/util_arch.cuh
+++ /dev/null
@@ -1,151 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Static architectural properties by SM version.
- */
-
-#pragma once
-
-#include "util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-#if (__CUDACC_VER_MAJOR__ >= 9) && !defined(CUB_USE_COOPERATIVE_GROUPS)
-    #define CUB_USE_COOPERATIVE_GROUPS
-#endif
-
-/// CUB_PTX_ARCH reflects the PTX version targeted by the active compiler pass (or zero during the host pass).
-#ifndef CUB_PTX_ARCH
-    #ifndef __CUDA_ARCH__
-        #define CUB_PTX_ARCH 0
-    #else
-        #define CUB_PTX_ARCH __CUDA_ARCH__
-    #endif
-#endif
-
-
-/// Whether or not the source targeted by the active compiler pass is allowed to  invoke device kernels or methods from the CUDA runtime API.
-#ifndef CUB_RUNTIME_FUNCTION
-    #if !defined(__CUDA_ARCH__) || (__CUDA_ARCH__>= 350 && defined(__CUDACC_RDC__))
-        #define CUB_RUNTIME_ENABLED
-        #define CUB_RUNTIME_FUNCTION __host__ __device__
-    #else
-        #define CUB_RUNTIME_FUNCTION __host__
-    #endif
-#endif
-
-
-/// Number of threads per warp
-#ifndef CUB_LOG_WARP_THREADS
-    #define CUB_LOG_WARP_THREADS(arch)                      \
-        (5)
-    #define CUB_WARP_THREADS(arch)                          \
-        (1 << CUB_LOG_WARP_THREADS(arch))
-
-    #define CUB_PTX_WARP_THREADS        CUB_WARP_THREADS(CUB_PTX_ARCH)
-    #define CUB_PTX_LOG_WARP_THREADS    CUB_LOG_WARP_THREADS(CUB_PTX_ARCH)
-#endif
-
-
-/// Number of smem banks
-#ifndef CUB_LOG_SMEM_BANKS
-    #define CUB_LOG_SMEM_BANKS(arch)                        \
-        ((arch >= 200) ?                                    \
-            (5) :                                           \
-            (4))
-    #define CUB_SMEM_BANKS(arch)                            \
-        (1 << CUB_LOG_SMEM_BANKS(arch))
-
-    #define CUB_PTX_LOG_SMEM_BANKS      CUB_LOG_SMEM_BANKS(CUB_PTX_ARCH)
-    #define CUB_PTX_SMEM_BANKS          CUB_SMEM_BANKS(CUB_PTX_ARCH)
-#endif
-
-
-/// Oversubscription factor
-#ifndef CUB_SUBSCRIPTION_FACTOR
-    #define CUB_SUBSCRIPTION_FACTOR(arch)                   \
-        ((arch >= 300) ?                                    \
-            (5) :                                           \
-            ((arch >= 200) ?                                \
-                (3) :                                       \
-                (10)))
-    #define CUB_PTX_SUBSCRIPTION_FACTOR             CUB_SUBSCRIPTION_FACTOR(CUB_PTX_ARCH)
-#endif
-
-
-/// Prefer padding overhead vs X-way conflicts greater than this threshold
-#ifndef CUB_PREFER_CONFLICT_OVER_PADDING
-    #define CUB_PREFER_CONFLICT_OVER_PADDING(arch)          \
-        ((arch >= 300) ?                                    \
-            (1) :                                           \
-            (4))
-    #define CUB_PTX_PREFER_CONFLICT_OVER_PADDING    CUB_PREFER_CONFLICT_OVER_PADDING(CUB_PTX_ARCH)
-#endif
-
-
-/// Scale down the number of warps to keep same amount of "tile" storage as the nominal configuration for 4B data.  Minimum of two warps.
-#ifndef CUB_BLOCK_THREADS
-    #define CUB_BLOCK_THREADS(NOMINAL_4B_BLOCK_THREADS, T, PTX_ARCH)                        \
-        (CUB_MIN(                                                                           \
-            NOMINAL_4B_BLOCK_THREADS * 2,                                                   \
-            CUB_WARP_THREADS(PTX_ARCH) * CUB_MAX(                                           \
-                (NOMINAL_4B_BLOCK_THREADS / CUB_WARP_THREADS(PTX_ARCH)) * 3 / 4,            \
-                (NOMINAL_4B_BLOCK_THREADS / CUB_WARP_THREADS(PTX_ARCH)) * 4 / sizeof(T))))
-#endif
-
-/// Scale up/down number of items per thread to keep the same amount of "tile" storage as the nominal configuration for 4B data.  Minimum 1 item per thread
-#ifndef CUB_ITEMS_PER_THREAD
-    #define CUB_ITEMS_PER_THREAD(NOMINAL_4B_ITEMS_PER_THREAD, NOMINAL_4B_BLOCK_THREADS, T, PTX_ARCH)    \
-	    (CUB_MIN(                                                                                       \
-	        NOMINAL_4B_ITEMS_PER_THREAD * 2,                                                            \
-	        CUB_MAX(                                                                                    \
-	            1,                                                                                      \
-	            (NOMINAL_4B_ITEMS_PER_THREAD * NOMINAL_4B_BLOCK_THREADS * 4 / sizeof(T)) / CUB_BLOCK_THREADS(NOMINAL_4B_BLOCK_THREADS, T, PTX_ARCH))))
-#endif
-
-/// Define both nominal threads-per-block and items-per-thread
-#ifndef CUB_NOMINAL_CONFIG
-    #define CUB_NOMINAL_CONFIG(NOMINAL_4B_BLOCK_THREADS, NOMINAL_4B_ITEMS_PER_THREAD, T)    \
-        CUB_BLOCK_THREADS(NOMINAL_4B_BLOCK_THREADS, T, 200),                                \
-        CUB_ITEMS_PER_THREAD(NOMINAL_4B_ITEMS_PER_THREAD, NOMINAL_4B_BLOCK_THREADS, T, 200)
-#endif
-
-
-
-#endif  // Do not document
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/util_debug.cuh b/thirdparty/cub_semiring/util_debug.cuh
deleted file mode 100644
index 1ad60cf2db6..00000000000
--- a/thirdparty/cub_semiring/util_debug.cuh
+++ /dev/null
@@ -1,145 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Error and event logging routines.
- *
- * The following macros definitions are supported:
- * - \p CUB_LOG.  Simple event messages are printed to \p stdout.
- */
-
-#pragma once
-
-#include <stdio.h>
-#include "util_namespace.cuh"
-#include "util_arch.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \addtogroup UtilMgmt
- * @{
- */
-
-
-/// CUB error reporting macro (prints error messages to stderr)
-#if (defined(DEBUG) || defined(_DEBUG)) && !defined(CUB_STDERR)
-    #define CUB_STDERR
-#endif
-
-
-
-/**
- * \brief %If \p CUB_STDERR is defined and \p error is not \p cudaSuccess, the corresponding error message is printed to \p stderr (or \p stdout in device code) along with the supplied source context.
- *
- * \return The CUDA error.
- */
-__host__ __device__ __forceinline__ cudaError_t Debug(
-    cudaError_t     error,
-    const char*     filename,
-    int             line)
-{
-    (void)filename;
-    (void)line;
-#ifdef CUB_STDERR
-    if (error)
-    {
-    #if (CUB_PTX_ARCH == 0)
-        fprintf(stderr, "CUDA error %d [%s, %d]: %s\n", error, filename, line, cudaGetErrorString(error));
-        fflush(stderr);
-    #elif (CUB_PTX_ARCH >= 200)
-        printf("CUDA error %d [block (%d,%d,%d) thread (%d,%d,%d), %s, %d]\n", error, blockIdx.z, blockIdx.y, blockIdx.x, threadIdx.z, threadIdx.y, threadIdx.x, filename, line);
-    #endif
-    }
-#endif
-    return error;
-}
-
-
-/**
- * \brief Debug macro
- */
-#ifndef CubDebug
-    #define CubDebug(e) cub::Debug((cudaError_t) (e), __FILE__, __LINE__)
-#endif
-
-
-/**
- * \brief Debug macro with exit
- */
-#ifndef CubDebugExit
-    #define CubDebugExit(e) if (cub::Debug((cudaError_t) (e), __FILE__, __LINE__)) { exit(1); }
-#endif
-
-
-/**
- * \brief Log macro for printf statements.
- */
-#if !defined(_CubLog)
-    #if !(defined(__clang__) && defined(__CUDA__))
-        #if (CUB_PTX_ARCH == 0)
-            #define _CubLog(format, ...) printf(format,__VA_ARGS__);
-        #elif (CUB_PTX_ARCH >= 200)
-            #define _CubLog(format, ...) printf("[block (%d,%d,%d), thread (%d,%d,%d)]: " format, blockIdx.z, blockIdx.y, blockIdx.x, threadIdx.z, threadIdx.y, threadIdx.x, __VA_ARGS__);
-        #endif
-    #else
-        // XXX shameless hack for clang around variadic printf...
-        //     Compilies w/o supplying -std=c++11 but shows warning,
-        //     so we sielence them :)
-        #pragma clang diagnostic ignored "-Wc++11-extensions"
-        #pragma clang diagnostic ignored "-Wunnamed-type-template-args"
-            template <class... Args>
-            inline __host__ __device__ void va_printf(char const* format, Args const&... args)
-            {
-        #ifdef __CUDA_ARCH__
-              printf(format, blockIdx.z, blockIdx.y, blockIdx.x, threadIdx.z, threadIdx.y, threadIdx.x, args...);
-        #else
-              printf(format, args...);
-        #endif
-            }
-        #ifndef __CUDA_ARCH__
-            #define _CubLog(format, ...) va_printf(format,__VA_ARGS__);
-        #else
-            #define _CubLog(format, ...) va_printf("[block (%d,%d,%d), thread (%d,%d,%d)]: " format, __VA_ARGS__);
-        #endif
-    #endif
-#endif
-
-
-
-
-/** @} */       // end group UtilMgmt
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/util_device.cuh b/thirdparty/cub_semiring/util_device.cuh
deleted file mode 100644
index fa73dbd74f1..00000000000
--- a/thirdparty/cub_semiring/util_device.cuh
+++ /dev/null
@@ -1,347 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Properties of a given CUDA device and the corresponding PTX bundle
- */
-
-#pragma once
-
-#include "util_type.cuh"
-#include "util_arch.cuh"
-#include "util_debug.cuh"
-#include "util_namespace.cuh"
-#include "util_macro.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \addtogroup UtilMgmt
- * @{
- */
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-
-/**
- * Alias temporaries to externally-allocated device storage (or simply return the amount of storage needed).
- */
-template <int ALLOCATIONS>
-__host__ __device__ __forceinline__
-cudaError_t AliasTemporaries(
-    void    *d_temp_storage,                    ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
-    size_t  &temp_storage_bytes,                ///< [in,out] Size in bytes of \t d_temp_storage allocation
-    void*   (&allocations)[ALLOCATIONS],        ///< [in,out] Pointers to device allocations needed
-    size_t  (&allocation_sizes)[ALLOCATIONS])   ///< [in] Sizes in bytes of device allocations needed
-{
-    const int ALIGN_BYTES   = 256;
-    const int ALIGN_MASK    = ~(ALIGN_BYTES - 1);
-
-    // Compute exclusive prefix sum over allocation requests
-    size_t allocation_offsets[ALLOCATIONS];
-    size_t bytes_needed = 0;
-    for (int i = 0; i < ALLOCATIONS; ++i)
-    {
-        size_t allocation_bytes = (allocation_sizes[i] + ALIGN_BYTES - 1) & ALIGN_MASK;
-        allocation_offsets[i] = bytes_needed;
-        bytes_needed += allocation_bytes;
-    }
-    bytes_needed += ALIGN_BYTES - 1;
-
-    // Check if the caller is simply requesting the size of the storage allocation
-    if (!d_temp_storage)
-    {
-        temp_storage_bytes = bytes_needed;
-        return cudaSuccess;
-    }
-
-    // Check if enough storage provided
-    if (temp_storage_bytes < bytes_needed)
-    {
-        return CubDebug(cudaErrorInvalidValue);
-    }
-
-    // Alias
-    d_temp_storage = (void *) ((size_t(d_temp_storage) + ALIGN_BYTES - 1) & ALIGN_MASK);
-    for (int i = 0; i < ALLOCATIONS; ++i)
-    {
-        allocations[i] = static_cast<char*>(d_temp_storage) + allocation_offsets[i];
-    }
-
-    return cudaSuccess;
-}
-
-
-/**
- * Empty kernel for querying PTX manifest metadata (e.g., version) for the current device
- */
-template <typename T>
-__global__ void EmptyKernel(void) { }
-
-
-#endif  // DOXYGEN_SHOULD_SKIP_THIS
-
-/**
- * \brief Retrieves the PTX version that will be used on the current device (major * 100 + minor * 10)
- */
-CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t PtxVersion(int &ptx_version)
-{
-    struct Dummy
-    {
-        /// Type definition of the EmptyKernel kernel entry point
-        typedef void (*EmptyKernelPtr)();
-
-        /// Force EmptyKernel<void> to be generated if this class is used
-        CUB_RUNTIME_FUNCTION __forceinline__
-        EmptyKernelPtr Empty()
-        {
-            return EmptyKernel<void>;
-        }
-    };
-
-
-#ifndef CUB_RUNTIME_ENABLED
-    (void)ptx_version;
-
-    // CUDA API calls not supported from this device
-    return cudaErrorInvalidConfiguration;
-
-#elif (CUB_PTX_ARCH > 0)
-
-    ptx_version = CUB_PTX_ARCH;
-    return cudaSuccess;
-
-#else
-
-    cudaError_t error = cudaSuccess;
-    do
-    {
-        cudaFuncAttributes empty_kernel_attrs;
-        if (CubDebug(error = cudaFuncGetAttributes(&empty_kernel_attrs, EmptyKernel<void>))) break;
-        ptx_version = empty_kernel_attrs.ptxVersion * 10;
-    }
-    while (0);
-
-    return error;
-
-#endif
-}
-
-
-/**
- * \brief Retrieves the SM version (major * 100 + minor * 10)
- */
-CUB_RUNTIME_FUNCTION __forceinline__ cudaError_t SmVersion(int &sm_version, int device_ordinal)
-{
-#ifndef CUB_RUNTIME_ENABLED
-    (void)sm_version;
-    (void)device_ordinal;
-
-    // CUDA API calls not supported from this device
-    return cudaErrorInvalidConfiguration;
-
-#else
-
-    cudaError_t error = cudaSuccess;
-    do
-    {
-        // Fill in SM version
-        int major, minor;
-        if (CubDebug(error = cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, device_ordinal))) break;
-        if (CubDebug(error = cudaDeviceGetAttribute(&minor, cudaDevAttrComputeCapabilityMinor, device_ordinal))) break;
-        sm_version = major * 100 + minor * 10;
-    }
-    while (0);
-
-    return error;
-
-#endif
-}
-
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-/**
- * Synchronize the stream if specified
- */
-CUB_RUNTIME_FUNCTION __forceinline__
-static cudaError_t SyncStream(cudaStream_t stream)
-{
-#if (CUB_PTX_ARCH == 0)
-    return cudaStreamSynchronize(stream);
-#else
-    (void)stream;
-    // Device can't yet sync on a specific stream
-    return cudaDeviceSynchronize();
-#endif
-}
-
-
-/**
- * \brief Computes maximum SM occupancy in thread blocks for executing the given kernel function pointer \p kernel_ptr on the current device with \p block_threads per thread block.
- *
- * \par Snippet
- * The code snippet below illustrates the use of the MaxSmOccupancy function.
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/util_device.cuh>
- *
- * template <typename T>
- * __global__ void ExampleKernel()
- * {
- *     // Allocate shared memory for BlockScan
- *     __shared__ volatile T buffer[4096];
- *
- *        ...
- * }
- *
- *     ...
- *
- * // Determine SM occupancy for ExampleKernel specialized for unsigned char
- * int max_sm_occupancy;
- * MaxSmOccupancy(max_sm_occupancy, ExampleKernel<unsigned char>, 64);
- *
- * // max_sm_occupancy  <-- 4 on SM10
- * // max_sm_occupancy  <-- 8 on SM20
- * // max_sm_occupancy  <-- 12 on SM35
- *
- * \endcode
- *
- */
-template <typename KernelPtr>
-CUB_RUNTIME_FUNCTION __forceinline__
-cudaError_t MaxSmOccupancy(
-    int                 &max_sm_occupancy,          ///< [out] maximum number of thread blocks that can reside on a single SM
-    KernelPtr           kernel_ptr,                 ///< [in] Kernel pointer for which to compute SM occupancy
-    int                 block_threads,              ///< [in] Number of threads per thread block
-    int                 dynamic_smem_bytes = 0)
-{
-#ifndef CUB_RUNTIME_ENABLED
-    (void)dynamic_smem_bytes;
-    (void)block_threads;
-    (void)kernel_ptr;
-    (void)max_sm_occupancy;
-
-    // CUDA API calls not supported from this device
-    return CubDebug(cudaErrorInvalidConfiguration);
-
-#else
-
-    return cudaOccupancyMaxActiveBlocksPerMultiprocessor (
-        &max_sm_occupancy,
-        kernel_ptr,
-        block_threads,
-        dynamic_smem_bytes);
-
-#endif  // CUB_RUNTIME_ENABLED
-}
-
-
-/******************************************************************************
- * Policy management
- ******************************************************************************/
-
-/**
- * Kernel dispatch configuration
- */
-struct KernelConfig
-{
-    int block_threads;
-    int items_per_thread;
-    int tile_size;
-    int sm_occupancy;
-
-    CUB_RUNTIME_FUNCTION __forceinline__
-    KernelConfig() : block_threads(0), items_per_thread(0), tile_size(0), sm_occupancy(0) {}
-
-    template <typename AgentPolicyT, typename KernelPtrT>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    cudaError_t Init(KernelPtrT kernel_ptr)
-    {
-        block_threads        = AgentPolicyT::BLOCK_THREADS;
-        items_per_thread     = AgentPolicyT::ITEMS_PER_THREAD;
-        tile_size            = block_threads * items_per_thread;
-        cudaError_t retval   = MaxSmOccupancy(sm_occupancy, kernel_ptr, block_threads);
-        return retval;
-    }
-};
-
-
-
-/// Helper for dispatching into a policy chain
-template <int PTX_VERSION, typename PolicyT, typename PrevPolicyT>
-struct ChainedPolicy
-{
-   /// The policy for the active compiler pass
-   typedef typename If<(CUB_PTX_ARCH < PTX_VERSION), typename PrevPolicyT::ActivePolicy, PolicyT>::Type ActivePolicy;
-
-   /// Specializes and dispatches op in accordance to the first policy in the chain of adequate PTX version
-   template <typename FunctorT>
-   CUB_RUNTIME_FUNCTION __forceinline__
-   static cudaError_t Invoke(int ptx_version, FunctorT &op)
-   {
-       if (ptx_version < PTX_VERSION) {
-           return PrevPolicyT::Invoke(ptx_version, op);
-       }
-       return op.template Invoke<PolicyT>();
-   }
-};
-
-/// Helper for dispatching into a policy chain (end-of-chain specialization)
-template <int PTX_VERSION, typename PolicyT>
-struct ChainedPolicy<PTX_VERSION, PolicyT, PolicyT>
-{
-    /// The policy for the active compiler pass
-    typedef PolicyT ActivePolicy;
-
-    /// Specializes and dispatches op in accordance to the first policy in the chain of adequate PTX version
-    template <typename FunctorT>
-    CUB_RUNTIME_FUNCTION __forceinline__
-    static cudaError_t Invoke(int /*ptx_version*/, FunctorT &op) {
-        return op.template Invoke<PolicyT>();
-    }
-};
-
-
-
-
-#endif  // Do not document
-
-
-
-
-/** @} */       // end group UtilMgmt
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/util_macro.cuh b/thirdparty/cub_semiring/util_macro.cuh
deleted file mode 100644
index 73c29d22c5c..00000000000
--- a/thirdparty/cub_semiring/util_macro.cuh
+++ /dev/null
@@ -1,103 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/******************************************************************************
- * Common C/C++ macro utilities
- ******************************************************************************/
-
-#pragma once
-
-#include "util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \addtogroup UtilModule
- * @{
- */
-
-#ifndef CUB_ALIGN
-    #if defined(_WIN32) || defined(_WIN64)
-        /// Align struct
-        #define CUB_ALIGN(bytes) __declspec(align(32))
-    #else
-        /// Align struct
-        #define CUB_ALIGN(bytes) __attribute__((aligned(bytes)))
-    #endif
-#endif
-
-#ifndef CUB_MAX
-    /// Select maximum(a, b)
-    #define CUB_MAX(a, b) (((b) > (a)) ? (b) : (a))
-#endif
-
-#ifndef CUB_MIN
-    /// Select minimum(a, b)
-    #define CUB_MIN(a, b) (((b) < (a)) ? (b) : (a))
-#endif
-
-#ifndef CUB_QUOTIENT_FLOOR
-    /// Quotient of x/y rounded down to nearest integer
-    #define CUB_QUOTIENT_FLOOR(x, y) ((x) / (y))
-#endif
-
-#ifndef CUB_QUOTIENT_CEILING
-    /// Quotient of x/y rounded up to nearest integer
-    #define CUB_QUOTIENT_CEILING(x, y) (((x) + (y) - 1) / (y))
-#endif
-
-#ifndef CUB_ROUND_UP_NEAREST
-    /// x rounded up to the nearest multiple of y
-    #define CUB_ROUND_UP_NEAREST(x, y) ((((x) + (y) - 1) / (y)) * y)
-#endif
-
-#ifndef CUB_ROUND_DOWN_NEAREST
-    /// x rounded down to the nearest multiple of y
-    #define CUB_ROUND_DOWN_NEAREST(x, y) (((x) / (y)) * y)
-#endif
-
-
-#ifndef CUB_STATIC_ASSERT
-    #ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-        #define CUB_CAT_(a, b) a ## b
-        #define CUB_CAT(a, b) CUB_CAT_(a, b)
-    #endif // DOXYGEN_SHOULD_SKIP_THIS
-
-    /// Static assert
-    #define CUB_STATIC_ASSERT(cond, msg) typedef int CUB_CAT(cub_static_assert, __LINE__)[(cond) ? 1 : -1]
-#endif
-
-/** @} */       // end group UtilModule
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/util_namespace.cuh b/thirdparty/cub_semiring/util_namespace.cuh
deleted file mode 100644
index 9f488e96978..00000000000
--- a/thirdparty/cub_semiring/util_namespace.cuh
+++ /dev/null
@@ -1,46 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Place-holder for prefixing the cub namespace
- */
-
-#pragma once
-
-// For example:
-#define CUB_NS_PREFIX namespace cub_semiring {
-#define CUB_NS_POSTFIX }
-
-#ifndef CUB_NS_PREFIX
-#define CUB_NS_PREFIX
-#endif
-
-#ifndef CUB_NS_POSTFIX
-#define CUB_NS_POSTFIX
-#endif
diff --git a/thirdparty/cub_semiring/util_ptx.cuh b/thirdparty/cub_semiring/util_ptx.cuh
deleted file mode 100644
index fae6e4fae2e..00000000000
--- a/thirdparty/cub_semiring/util_ptx.cuh
+++ /dev/null
@@ -1,729 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * PTX intrinsics
- */
-
-
-#pragma once
-
-#include "util_type.cuh"
-#include "util_arch.cuh"
-#include "util_namespace.cuh"
-#include "util_debug.cuh"
-
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \addtogroup UtilPtx
- * @{
- */
-
-
-/******************************************************************************
- * PTX helper macros
- ******************************************************************************/
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-/**
- * Register modifier for pointer-types (for inlining PTX assembly)
- */
-#if defined(_WIN64) || defined(__LP64__)
-    #define __CUB_LP64__ 1
-    // 64-bit register modifier for inlined asm
-    #define _CUB_ASM_PTR_ "l"
-    #define _CUB_ASM_PTR_SIZE_ "u64"
-#else
-    #define __CUB_LP64__ 0
-    // 32-bit register modifier for inlined asm
-    #define _CUB_ASM_PTR_ "r"
-    #define _CUB_ASM_PTR_SIZE_ "u32"
-#endif
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-/******************************************************************************
- * Inlined PTX intrinsics
- ******************************************************************************/
-
-/**
- * \brief Shift-right then add.  Returns (\p x >> \p shift) + \p addend.
- */
-__device__ __forceinline__ unsigned int SHR_ADD(
-    unsigned int x,
-    unsigned int shift,
-    unsigned int addend)
-{
-    unsigned int ret;
-#if CUB_PTX_ARCH >= 200
-    asm ("vshr.u32.u32.u32.clamp.add %0, %1, %2, %3;" :
-        "=r"(ret) : "r"(x), "r"(shift), "r"(addend));
-#else
-    ret = (x >> shift) + addend;
-#endif
-    return ret;
-}
-
-
-/**
- * \brief Shift-left then add.  Returns (\p x << \p shift) + \p addend.
- */
-__device__ __forceinline__ unsigned int SHL_ADD(
-    unsigned int x,
-    unsigned int shift,
-    unsigned int addend)
-{
-    unsigned int ret;
-#if CUB_PTX_ARCH >= 200
-    asm ("vshl.u32.u32.u32.clamp.add %0, %1, %2, %3;" :
-        "=r"(ret) : "r"(x), "r"(shift), "r"(addend));
-#else
-    ret = (x << shift) + addend;
-#endif
-    return ret;
-}
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-/**
- * Bitfield-extract.
- */
-template <typename UnsignedBits, int BYTE_LEN>
-__device__ __forceinline__ unsigned int BFE(
-    UnsignedBits            source,
-    unsigned int            bit_start,
-    unsigned int            num_bits,
-    Int2Type<BYTE_LEN>      /*byte_len*/)
-{
-    unsigned int bits;
-#if CUB_PTX_ARCH >= 200
-    asm ("bfe.u32 %0, %1, %2, %3;" : "=r"(bits) : "r"((unsigned int) source), "r"(bit_start), "r"(num_bits));
-#else
-    const unsigned int MASK = (1 << num_bits) - 1;
-    bits = (source >> bit_start) & MASK;
-#endif
-    return bits;
-}
-
-
-/**
- * Bitfield-extract for 64-bit types.
- */
-template <typename UnsignedBits>
-__device__ __forceinline__ unsigned int BFE(
-    UnsignedBits            source,
-    unsigned int            bit_start,
-    unsigned int            num_bits,
-    Int2Type<8>             /*byte_len*/)
-{
-    const unsigned long long MASK = (1ull << num_bits) - 1;
-    return (source >> bit_start) & MASK;
-}
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-/**
- * \brief Bitfield-extract.  Extracts \p num_bits from \p source starting at bit-offset \p bit_start.  The input \p source may be an 8b, 16b, 32b, or 64b unsigned integer type.
- */
-template <typename UnsignedBits>
-__device__ __forceinline__ unsigned int BFE(
-    UnsignedBits source,
-    unsigned int bit_start,
-    unsigned int num_bits)
-{
-    return BFE(source, bit_start, num_bits, Int2Type<sizeof(UnsignedBits)>());
-}
-
-
-/**
- * \brief Bitfield insert.  Inserts the \p num_bits least significant bits of \p y into \p x at bit-offset \p bit_start.
- */
-__device__ __forceinline__ void BFI(
-    unsigned int &ret,
-    unsigned int x,
-    unsigned int y,
-    unsigned int bit_start,
-    unsigned int num_bits)
-{
-#if CUB_PTX_ARCH >= 200
-    asm ("bfi.b32 %0, %1, %2, %3, %4;" :
-        "=r"(ret) : "r"(y), "r"(x), "r"(bit_start), "r"(num_bits));
-#else
-    x <<= bit_start;
-    unsigned int MASK_X = ((1 << num_bits) - 1) << bit_start;
-    unsigned int MASK_Y = ~MASK_X;
-    ret = (y & MASK_Y) | (x & MASK_X);
-#endif
-}
-
-
-/**
- * \brief Three-operand add.  Returns \p x + \p y + \p z.
- */
-__device__ __forceinline__ unsigned int IADD3(unsigned int x, unsigned int y, unsigned int z)
-{
-#if CUB_PTX_ARCH >= 200
-    asm ("vadd.u32.u32.u32.add %0, %1, %2, %3;" : "=r"(x) : "r"(x), "r"(y), "r"(z));
-#else
-    x = x + y + z;
-#endif
-    return x;
-}
-
-
-/**
- * \brief Byte-permute. Pick four arbitrary bytes from two 32-bit registers, and reassemble them into a 32-bit destination register.  For SM2.0 or later.
- *
- * \par
- * The bytes in the two source registers \p a and \p b are numbered from 0 to 7:
- * {\p b, \p a} = {{b7, b6, b5, b4}, {b3, b2, b1, b0}}. For each of the four bytes
- * {b3, b2, b1, b0} selected in the return value, a 4-bit selector is defined within
- * the four lower "nibbles" of \p index: {\p index } = {n7, n6, n5, n4, n3, n2, n1, n0}
- *
- * \par Snippet
- * The code snippet below illustrates byte-permute.
- * \par
- * \code
- * #include <cub/cub.cuh>
- *
- * __global__ void ExampleKernel(...)
- * {
- *     int a        = 0x03020100;
- *     int b        = 0x07060504;
- *     int index    = 0x00007531;
- *
- *     int selected = PRMT(a, b, index);    // 0x07050301
- *
- * \endcode
- *
- */
-__device__ __forceinline__ int PRMT(unsigned int a, unsigned int b, unsigned int index)
-{
-    int ret;
-    asm ("prmt.b32 %0, %1, %2, %3;" : "=r"(ret) : "r"(a), "r"(b), "r"(index));
-    return ret;
-}
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-/**
- * Sync-threads barrier.
- */
-__device__ __forceinline__ void BAR(int count)
-{
-    asm volatile("bar.sync 1, %0;" : : "r"(count));
-}
-
-/**
- * CTA barrier
- */
-__device__  __forceinline__ void CTA_SYNC()
-{
-    __syncthreads();
-}
-
-
-/**
- * CTA barrier with predicate
- */
-__device__  __forceinline__ int CTA_SYNC_AND(int p)
-{
-    return __syncthreads_and(p);
-}
-
-
-/**
- * Warp barrier
- */
-__device__  __forceinline__ void WARP_SYNC(unsigned int member_mask)
-{
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-    __syncwarp(member_mask);
-#endif
-}
-
-
-/**
- * Warp any
- */
-__device__  __forceinline__ int WARP_ANY(int predicate, unsigned int member_mask)
-{
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-    return __any_sync(member_mask, predicate);
-#else
-    return ::__any(predicate);
-#endif
-}
-
-
-/**
- * Warp any
- */
-__device__  __forceinline__ int WARP_ALL(int predicate, unsigned int member_mask)
-{
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-    return __all_sync(member_mask, predicate);
-#else
-    return ::__all(predicate);
-#endif
-}
-
-
-/**
- * Warp ballot
- */
-__device__  __forceinline__ int WARP_BALLOT(int predicate, unsigned int member_mask)
-{
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-    return __ballot_sync(member_mask, predicate);
-#else
-    return __ballot(predicate);
-#endif
-}
-
-/**
- * Warp synchronous shfl_up
- */
-__device__ __forceinline__ 
-unsigned int SHFL_UP_SYNC(unsigned int word, int src_offset, int first_lane, unsigned int member_mask)
-{
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-    asm volatile("shfl.sync.up.b32 %0, %1, %2, %3, %4;"
-        : "=r"(word) : "r"(word), "r"(src_offset), "r"(first_lane), "r"(member_mask));
-#else
-    asm volatile("shfl.up.b32 %0, %1, %2, %3;"
-        : "=r"(word) : "r"(word), "r"(src_offset), "r"(first_lane));
-#endif
-    return word;
-}
-
-/**
- * Warp synchronous shfl_down
- */
-__device__ __forceinline__ 
-unsigned int SHFL_DOWN_SYNC(unsigned int word, int src_offset, int last_lane, unsigned int member_mask)
-{
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-    asm volatile("shfl.sync.down.b32 %0, %1, %2, %3, %4;"
-        : "=r"(word) : "r"(word), "r"(src_offset), "r"(last_lane), "r"(member_mask));
-#else
-    asm volatile("shfl.down.b32 %0, %1, %2, %3;"
-        : "=r"(word) : "r"(word), "r"(src_offset), "r"(last_lane));
-#endif
-    return word;
-}
-
-/**
- * Warp synchronous shfl_idx
- */
-__device__ __forceinline__ 
-unsigned int SHFL_IDX_SYNC(unsigned int word, int src_lane, int last_lane, unsigned int member_mask)
-{
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-    asm volatile("shfl.sync.idx.b32 %0, %1, %2, %3, %4;"
-        : "=r"(word) : "r"(word), "r"(src_lane), "r"(last_lane), "r"(member_mask));
-#else
-    asm volatile("shfl.idx.b32 %0, %1, %2, %3;"
-        : "=r"(word) : "r"(word), "r"(src_lane), "r"(last_lane));
-#endif
-    return word;
-}
-
-/**
- * Floating point multiply. (Mantissa LSB rounds towards zero.)
- */
-__device__ __forceinline__ float FMUL_RZ(float a, float b)
-{
-    float d;
-    asm ("mul.rz.f32 %0, %1, %2;" : "=f"(d) : "f"(a), "f"(b));
-    return d;
-}
-
-
-/**
- * Floating point multiply-add. (Mantissa LSB rounds towards zero.)
- */
-__device__ __forceinline__ float FFMA_RZ(float a, float b, float c)
-{
-    float d;
-    asm ("fma.rz.f32 %0, %1, %2, %3;" : "=f"(d) : "f"(a), "f"(b), "f"(c));
-    return d;
-}
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-/**
- * \brief Terminates the calling thread
- */
-__device__ __forceinline__ void ThreadExit() {
-    asm volatile("exit;");
-}    
-
-
-/**
- * \brief  Abort execution and generate an interrupt to the host CPU
- */
-__device__ __forceinline__ void ThreadTrap() {
-    asm volatile("trap;");
-}
-
-
-/**
- * \brief Returns the row-major linear thread identifier for a multidimensional thread block
- */
-__device__ __forceinline__ int RowMajorTid(int block_dim_x, int block_dim_y, int block_dim_z)
-{
-    return ((block_dim_z == 1) ? 0 : (threadIdx.z * block_dim_x * block_dim_y)) +
-            ((block_dim_y == 1) ? 0 : (threadIdx.y * block_dim_x)) +
-            threadIdx.x;
-}
-
-
-/**
- * \brief Returns the warp lane ID of the calling thread
- */
-__device__ __forceinline__ unsigned int LaneId()
-{
-    unsigned int ret;
-    asm ("mov.u32 %0, %%laneid;" : "=r"(ret) );
-    return ret;
-}
-
-
-/**
- * \brief Returns the warp ID of the calling thread.  Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block.
- */
-__device__ __forceinline__ unsigned int WarpId()
-{
-    unsigned int ret;
-    asm ("mov.u32 %0, %%warpid;" : "=r"(ret) );
-    return ret;
-}
-
-/**
- * \brief Returns the warp lane mask of all lanes less than the calling thread
- */
-__device__ __forceinline__ unsigned int LaneMaskLt()
-{
-    unsigned int ret;
-    asm ("mov.u32 %0, %%lanemask_lt;" : "=r"(ret) );
-    return ret;
-}
-
-/**
- * \brief Returns the warp lane mask of all lanes less than or equal to the calling thread
- */
-__device__ __forceinline__ unsigned int LaneMaskLe()
-{
-    unsigned int ret;
-    asm ("mov.u32 %0, %%lanemask_le;" : "=r"(ret) );
-    return ret;
-}
-
-/**
- * \brief Returns the warp lane mask of all lanes greater than the calling thread
- */
-__device__ __forceinline__ unsigned int LaneMaskGt()
-{
-    unsigned int ret;
-    asm ("mov.u32 %0, %%lanemask_gt;" : "=r"(ret) );
-    return ret;
-}
-
-/**
- * \brief Returns the warp lane mask of all lanes greater than or equal to the calling thread
- */
-__device__ __forceinline__ unsigned int LaneMaskGe()
-{
-    unsigned int ret;
-    asm ("mov.u32 %0, %%lanemask_ge;" : "=r"(ret) );
-    return ret;
-}
-
-/** @} */       // end group UtilPtx
-
-
-
-
-/**
- * \brief Shuffle-up for any data type.  Each <em>warp-lane<sub>i</sub></em> obtains the value \p input contributed by <em>warp-lane</em><sub><em>i</em>-<tt>src_offset</tt></sub>.  For thread lanes \e i < src_offset, the thread's own \p input is returned to the thread. ![](shfl_up_logo.png)
- * \ingroup WarpModule
- *
- * \par
- * - Available only for SM3.0 or newer
- *
- * \par Snippet
- * The code snippet below illustrates each thread obtaining a \p double value from the
- * predecessor of its predecessor.
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/util_ptx.cuh>
- *
- * __global__ void ExampleKernel(...)
- * {
- *     // Obtain one input item per thread
- *     double thread_data = ...
- *
- *     // Obtain item from two ranks below
- *     double peer_data = ShuffleUp(thread_data, 2, 0, 0xffffffff);
- *
- * \endcode
- * \par
- * Suppose the set of input \p thread_data across the first warp of threads is <tt>{1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}</tt>.
- * The corresponding output \p peer_data will be <tt>{1.0, 2.0, 1.0, 2.0, 3.0, ..., 30.0}</tt>.
- *
- */
-template <typename T>
-__device__ __forceinline__ T ShuffleUp(
-    T               input,              ///< [in] The value to broadcast
-    int             src_offset,         ///< [in] The relative down-offset of the peer to read from
-    int             first_lane,         ///< [in] Index of first lane in segment (typically 0)
-    unsigned int    member_mask)        ///< [in] 32-bit mask of participating warp lanes
-{
-    typedef typename UnitWord<T>::ShuffleWord ShuffleWord;
-
-    const int       WORDS           = (sizeof(T) + sizeof(ShuffleWord) - 1) / sizeof(ShuffleWord);
- 
-    T               output;
-    ShuffleWord     *output_alias   = reinterpret_cast<ShuffleWord *>(&output);
-    ShuffleWord     *input_alias    = reinterpret_cast<ShuffleWord *>(&input);
-
-    unsigned int shuffle_word;
-    shuffle_word = SHFL_UP_SYNC((unsigned int)input_alias[0], src_offset, first_lane, member_mask);
-    output_alias[0] = shuffle_word;
-
-    #pragma unroll
-    for (int WORD = 1; WORD < WORDS; ++WORD)
-    {
-        shuffle_word       = SHFL_UP_SYNC((unsigned int)input_alias[WORD], src_offset, first_lane, member_mask);
-        output_alias[WORD] = shuffle_word;
-    }
-
-    return output;
-}
-
-
-/**
- * \brief Shuffle-down for any data type.  Each <em>warp-lane<sub>i</sub></em> obtains the value \p input contributed by <em>warp-lane</em><sub><em>i</em>+<tt>src_offset</tt></sub>.  For thread lanes \e i >= WARP_THREADS, the thread's own \p input is returned to the thread.  ![](shfl_down_logo.png)
- * \ingroup WarpModule
- *
- * \par
- * - Available only for SM3.0 or newer
- *
- * \par Snippet
- * The code snippet below illustrates each thread obtaining a \p double value from the
- * successor of its successor.
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/util_ptx.cuh>
- *
- * __global__ void ExampleKernel(...)
- * {
- *     // Obtain one input item per thread
- *     double thread_data = ...
- *
- *     // Obtain item from two ranks below
- *     double peer_data = ShuffleDown(thread_data, 2, 31, 0xffffffff);
- *
- * \endcode
- * \par
- * Suppose the set of input \p thread_data across the first warp of threads is <tt>{1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}</tt>.
- * The corresponding output \p peer_data will be <tt>{3.0, 4.0, 5.0, 6.0, 7.0, ..., 32.0}</tt>.
- *
- */
-template <typename T>
-__device__ __forceinline__ T ShuffleDown(
-    T               input,              ///< [in] The value to broadcast
-    int             src_offset,         ///< [in] The relative up-offset of the peer to read from
-    int             last_lane,          ///< [in] Index of first lane in segment (typically 31)
-    unsigned int    member_mask)        ///< [in] 32-bit mask of participating warp lanes
-{
-    typedef typename UnitWord<T>::ShuffleWord ShuffleWord;
-
-    const int       WORDS           = (sizeof(T) + sizeof(ShuffleWord) - 1) / sizeof(ShuffleWord);
-
-    T               output;
-    ShuffleWord     *output_alias   = reinterpret_cast<ShuffleWord *>(&output);
-    ShuffleWord     *input_alias    = reinterpret_cast<ShuffleWord *>(&input);
-
-    unsigned int shuffle_word;
-    shuffle_word    = SHFL_DOWN_SYNC((unsigned int)input_alias[0], src_offset, last_lane, member_mask);
-    output_alias[0] = shuffle_word;
-
-    #pragma unroll
-    for (int WORD = 1; WORD < WORDS; ++WORD)
-    {
-        shuffle_word       = SHFL_DOWN_SYNC((unsigned int)input_alias[WORD], src_offset, last_lane, member_mask);
-        output_alias[WORD] = shuffle_word;
-    }
-
-    return output;
-}
-
-
-/**
- * \brief Shuffle-broadcast for any data type.  Each <em>warp-lane<sub>i</sub></em> obtains the value \p input
- * contributed by <em>warp-lane</em><sub><tt>src_lane</tt></sub>.  For \p src_lane < 0 or \p src_lane >= WARP_THREADS,
- * then the thread's own \p input is returned to the thread. ![](shfl_broadcast_logo.png)
- *
- * \ingroup WarpModule
- *
- * \par
- * - Available only for SM3.0 or newer
- *
- * \par Snippet
- * The code snippet below illustrates each thread obtaining a \p double value from <em>warp-lane</em><sub>0</sub>.
- *
- * \par
- * \code
- * #include <cub/cub.cuh>   // or equivalently <cub/util_ptx.cuh>
- *
- * __global__ void ExampleKernel(...)
- * {
- *     // Obtain one input item per thread
- *     double thread_data = ...
- *
- *     // Obtain item from thread 0
- *     double peer_data = ShuffleIndex(thread_data, 0, 32, 0xffffffff);
- *
- * \endcode
- * \par
- * Suppose the set of input \p thread_data across the first warp of threads is <tt>{1.0, 2.0, 3.0, 4.0, 5.0, ..., 32.0}</tt>.
- * The corresponding output \p peer_data will be <tt>{1.0, 1.0, 1.0, 1.0, 1.0, ..., 1.0}</tt>.
- *
- */
-template <typename T>
-__device__ __forceinline__ T ShuffleIndex(
-    T               input,                  ///< [in] The value to broadcast
-    int             src_lane,               ///< [in] Which warp lane is to do the broadcasting
-    int             logical_warp_threads,   ///< [in] Number of threads per logical warp
-    unsigned int    member_mask)            ///< [in] 32-bit mask of participating warp lanes
-{
-    typedef typename UnitWord<T>::ShuffleWord ShuffleWord;
-
-    const int       WORDS           = (sizeof(T) + sizeof(ShuffleWord) - 1) / sizeof(ShuffleWord);
-
-    T               output;
-    ShuffleWord     *output_alias   = reinterpret_cast<ShuffleWord *>(&output);
-    ShuffleWord     *input_alias    = reinterpret_cast<ShuffleWord *>(&input);
-
-    unsigned int shuffle_word;
-    shuffle_word = SHFL_IDX_SYNC((unsigned int)input_alias[0],
-                                 src_lane,
-                                 logical_warp_threads - 1,
-                                 member_mask);
-
-    output_alias[0] = shuffle_word;
-
-    #pragma unroll
-    for (int WORD = 1; WORD < WORDS; ++WORD)
-    {
-        shuffle_word = SHFL_IDX_SYNC((unsigned int)input_alias[WORD],
-                                     src_lane,
-                                     logical_warp_threads - 1,
-                                     member_mask);
-
-        output_alias[WORD] = shuffle_word;
-    }
-
-    return output;
-}
-
-
-
-/**
- * Compute a 32b mask of threads having the same least-significant
- * LABEL_BITS of \p label as the calling thread.
- */
-template <int LABEL_BITS>
-inline __device__ unsigned int MatchAny(unsigned int label)
-{
-    unsigned int retval;
-
-    // Extract masks of common threads for each bit
-    #pragma unroll
-    for (int BIT = 0; BIT < LABEL_BITS; ++BIT)
-    {
-        unsigned int mask;
-        unsigned int current_bit = 1 << BIT;
-        asm ("{\n"
-            "    .reg .pred p;\n"
-            "    and.b32 %0, %1, %2;"
-            "    setp.eq.u32 p, %0, %2;\n"
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-            "    vote.ballot.sync.b32 %0, p, 0xffffffff;\n"
-#else
-            "    vote.ballot.b32 %0, p;\n"
-#endif
-            "    @!p not.b32 %0, %0;\n"
-            "}\n" : "=r"(mask) : "r"(label), "r"(current_bit));
-
-        // Remove peers who differ
-        retval = (BIT == 0) ? mask : retval & mask;
-    }
-
-    return retval;
-
-//  // VOLTA match
-//    unsigned int retval;
-//    asm ("{\n"
-//         "    match.any.sync.b32 %0, %1, 0xffffffff;\n"
-//         "}\n" : "=r"(retval) : "r"(label));
-//    return retval;
-
-}
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/util_type.cuh b/thirdparty/cub_semiring/util_type.cuh
deleted file mode 100644
index a1ea845ad04..00000000000
--- a/thirdparty/cub_semiring/util_type.cuh
+++ /dev/null
@@ -1,1452 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * Common type manipulation (metaprogramming) utilities
- */
-
-#pragma once
-
-#include <iostream>
-#include <limits>
-#include <cfloat>
-
-#include "util_macro.cuh"
-#include "util_arch.cuh"
-#include "util_namespace.cuh"
-
-#include "cuComplex.h"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \addtogroup UtilModule
- * @{
- */
-
-
-
-/******************************************************************************
- * Type equality
- ******************************************************************************/
-
-/**
- * \brief Type selection (<tt>IF ? ThenType : ElseType</tt>)
- */
-template <bool IF, typename ThenType, typename ElseType>
-struct If
-{
-    /// Conditional type result
-    typedef ThenType Type;      // true
-};
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-template <typename ThenType, typename ElseType>
-struct If<false, ThenType, ElseType>
-{
-    typedef ElseType Type;      // false
-};
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-
-/******************************************************************************
- * Conditional types
- ******************************************************************************/
-
-/**
- * \brief Type equality test
- */
-template <typename A, typename B>
-struct Equals
-{
-    enum {
-        VALUE = 0,
-        NEGATE = 1
-    };
-};
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-template <typename A>
-struct Equals <A, A>
-{
-    enum {
-        VALUE = 1,
-        NEGATE = 0
-    };
-};
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-/******************************************************************************
- * Static math
- ******************************************************************************/
-
-/**
- * \brief Statically determine log2(N), rounded up.
- *
- * For example:
- *     Log2<8>::VALUE   // 3
- *     Log2<3>::VALUE   // 2
- */
-template <int N, int CURRENT_VAL = N, int COUNT = 0>
-struct Log2
-{
-    /// Static logarithm value
-    enum { VALUE = Log2<N, (CURRENT_VAL >> 1), COUNT + 1>::VALUE };         // Inductive case
-};
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-template <int N, int COUNT>
-struct Log2<N, 0, COUNT>
-{
-    enum {VALUE = (1 << (COUNT - 1) < N) ?                                  // Base case
-        COUNT :
-        COUNT - 1 };
-};
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-/**
- * \brief Statically determine if N is a power-of-two
- */
-template <int N>
-struct PowerOfTwo
-{
-    enum { VALUE = ((N & (N - 1)) == 0) };
-};
-
-
-
-/******************************************************************************
- * Pointer vs. iterator detection
- ******************************************************************************/
-
-/**
- * \brief Pointer vs. iterator
- */
-template <typename Tp>
-struct IsPointer
-{
-    enum { VALUE = 0 };
-};
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-template <typename Tp>
-struct IsPointer<Tp*>
-{
-    enum { VALUE = 1 };
-};
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-
-/******************************************************************************
- * Qualifier detection
- ******************************************************************************/
-
-/**
- * \brief Volatile modifier test
- */
-template <typename Tp>
-struct IsVolatile
-{
-    enum { VALUE = 0 };
-};
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-template <typename Tp>
-struct IsVolatile<Tp volatile>
-{
-    enum { VALUE = 1 };
-};
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-/******************************************************************************
- * Qualifier removal
- ******************************************************************************/
-
-/**
- * \brief Removes \p const and \p volatile qualifiers from type \p Tp.
- *
- * For example:
- *     <tt>typename RemoveQualifiers<volatile int>::Type         // int;</tt>
- */
-template <typename Tp, typename Up = Tp>
-struct RemoveQualifiers
-{
-    /// Type without \p const and \p volatile qualifiers
-    typedef Up Type;
-};
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-template <typename Tp, typename Up>
-struct RemoveQualifiers<Tp, volatile Up>
-{
-    typedef Up Type;
-};
-
-template <typename Tp, typename Up>
-struct RemoveQualifiers<Tp, const Up>
-{
-    typedef Up Type;
-};
-
-template <typename Tp, typename Up>
-struct RemoveQualifiers<Tp, const volatile Up>
-{
-    typedef Up Type;
-};
-
-
-/******************************************************************************
- * Marker types
- ******************************************************************************/
-
-/**
- * \brief A simple "NULL" marker type
- */
-struct NullType
-{
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-    template <typename T>
-    __host__ __device__ __forceinline__ NullType& operator =(const T&) { return *this; }
-
-    __host__ __device__ __forceinline__ bool operator ==(const NullType&) { return true; }
-
-    __host__ __device__ __forceinline__ bool operator !=(const NullType&) { return false; }
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-};
-
-
-/**
- * \brief Allows for the treatment of an integral constant as a type at compile-time (e.g., to achieve static call dispatch based on constant integral values)
- */
-template <int A>
-struct Int2Type
-{
-   enum {VALUE = A};
-};
-
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-
-/******************************************************************************
- * Size and alignment
- ******************************************************************************/
-
-/// Structure alignment
-template <typename T>
-struct AlignBytes
-{
-    struct Pad
-    {
-        T       val;
-        char    byte;
-    };
-
-    enum
-    {
-        /// The "true CUDA" alignment of T in bytes
-        ALIGN_BYTES = sizeof(Pad) - sizeof(T)
-    };
-
-    /// The "truly aligned" type
-    typedef T Type;
-};
-
-// Specializations where host C++ compilers (e.g., 32-bit Windows) may disagree
-// with device C++ compilers (EDG) on types passed as template parameters through
-// kernel functions
-
-#define __CUB_ALIGN_BYTES(t, b)         \
-    template <> struct AlignBytes<t>    \
-    { enum { ALIGN_BYTES = b }; typedef __align__(b) t Type; };
-
-__CUB_ALIGN_BYTES(short4, 8)
-__CUB_ALIGN_BYTES(ushort4, 8)
-__CUB_ALIGN_BYTES(int2, 8)
-__CUB_ALIGN_BYTES(uint2, 8)
-__CUB_ALIGN_BYTES(long long, 8)
-__CUB_ALIGN_BYTES(unsigned long long, 8)
-__CUB_ALIGN_BYTES(float2, 8)
-__CUB_ALIGN_BYTES(double, 8)
-#ifdef _WIN32
-    __CUB_ALIGN_BYTES(long2, 8)
-    __CUB_ALIGN_BYTES(ulong2, 8)
-#else
-    __CUB_ALIGN_BYTES(long2, 16)
-    __CUB_ALIGN_BYTES(ulong2, 16)
-#endif
-__CUB_ALIGN_BYTES(int4, 16)
-__CUB_ALIGN_BYTES(uint4, 16)
-__CUB_ALIGN_BYTES(float4, 16)
-__CUB_ALIGN_BYTES(long4, 16)
-__CUB_ALIGN_BYTES(ulong4, 16)
-__CUB_ALIGN_BYTES(longlong2, 16)
-__CUB_ALIGN_BYTES(ulonglong2, 16)
-__CUB_ALIGN_BYTES(double2, 16)
-__CUB_ALIGN_BYTES(longlong4, 16)
-__CUB_ALIGN_BYTES(ulonglong4, 16)
-__CUB_ALIGN_BYTES(double4, 16)
-
-template <typename T> struct AlignBytes<volatile T> : AlignBytes<T> {};
-template <typename T> struct AlignBytes<const T> : AlignBytes<T> {};
-template <typename T> struct AlignBytes<const volatile T> : AlignBytes<T> {};
-
-
-/// Unit-words of data movement
-template <typename T>
-struct UnitWord
-{
-    enum {
-        ALIGN_BYTES = AlignBytes<T>::ALIGN_BYTES
-    };
-
-    template <typename Unit>
-    struct IsMultiple
-    {
-        enum {
-            UNIT_ALIGN_BYTES    = AlignBytes<Unit>::ALIGN_BYTES,
-            IS_MULTIPLE         = (sizeof(T) % sizeof(Unit) == 0) && (ALIGN_BYTES % UNIT_ALIGN_BYTES == 0)
-        };
-    };
-
-    /// Biggest shuffle word that T is a whole multiple of and is not larger than the alignment of T
-    typedef typename If<IsMultiple<int>::IS_MULTIPLE,
-        unsigned int,
-        typename If<IsMultiple<short>::IS_MULTIPLE,
-            unsigned short,
-            unsigned char>::Type>::Type         ShuffleWord;
-
-    /// Biggest volatile word that T is a whole multiple of and is not larger than the alignment of T
-    typedef typename If<IsMultiple<long long>::IS_MULTIPLE,
-        unsigned long long,
-        ShuffleWord>::Type                      VolatileWord;
-
-    /// Biggest memory-access word that T is a whole multiple of and is not larger than the alignment of T
-    typedef typename If<IsMultiple<longlong2>::IS_MULTIPLE,
-        ulonglong2,
-        VolatileWord>::Type                     DeviceWord;
-
-    /// Biggest texture reference word that T is a whole multiple of and is not larger than the alignment of T
-    typedef typename If<IsMultiple<int4>::IS_MULTIPLE,
-        uint4,
-        typename If<IsMultiple<int2>::IS_MULTIPLE,
-            uint2,
-            ShuffleWord>::Type>::Type           TextureWord;
-};
-
-
-// float2 specialization workaround (for SM10-SM13)
-template <>
-struct UnitWord <float2>
-{
-    typedef int         ShuffleWord;
-#if (CUB_PTX_ARCH > 0) && (CUB_PTX_ARCH <= 130)
-    typedef float       VolatileWord;
-    typedef uint2       DeviceWord;
-#else
-    typedef unsigned long long   VolatileWord;
-    typedef unsigned long long   DeviceWord;
-#endif
-    typedef float2      TextureWord;
-};
-
-// float4 specialization workaround (for SM10-SM13)
-template <>
-struct UnitWord <float4>
-{
-    typedef int         ShuffleWord;
-#if (CUB_PTX_ARCH > 0) && (CUB_PTX_ARCH <= 130)
-    typedef float               VolatileWord;
-    typedef uint4               DeviceWord;
-#else
-    typedef unsigned long long  VolatileWord;
-    typedef ulonglong2          DeviceWord;
-#endif
-    typedef float4              TextureWord;
-};
-
-
-// char2 specialization workaround (for SM10-SM13)
-template <>
-struct UnitWord <char2>
-{
-    typedef unsigned short      ShuffleWord;
-#if (CUB_PTX_ARCH > 0) && (CUB_PTX_ARCH <= 130)
-    typedef unsigned short      VolatileWord;
-    typedef short               DeviceWord;
-#else
-    typedef unsigned short      VolatileWord;
-    typedef unsigned short      DeviceWord;
-#endif
-    typedef unsigned short      TextureWord;
-};
-
-
-template <typename T> struct UnitWord<volatile T> : UnitWord<T> {};
-template <typename T> struct UnitWord<const T> : UnitWord<T> {};
-template <typename T> struct UnitWord<const volatile T> : UnitWord<T> {};
-
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-
-/******************************************************************************
- * Vector type inference utilities.
- ******************************************************************************/
-
-/**
- * \brief Exposes a member typedef \p Type that names the corresponding CUDA vector type if one exists.  Otherwise \p Type refers to the CubVector structure itself, which will wrap the corresponding \p x, \p y, etc. vector fields.
- */
-template <typename T, int vec_elements> struct CubVector;
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-enum
-{
-    /// The maximum number of elements in CUDA vector types
-    MAX_VEC_ELEMENTS = 4,
-};
-
-
-/**
- * Generic vector-1 type
- */
-template <typename T>
-struct CubVector<T, 1>
-{
-    T x;
-
-    typedef T BaseType;
-    typedef CubVector<T, 1> Type;
-};
-
-/**
- * Generic vector-2 type
- */
-template <typename T>
-struct CubVector<T, 2>
-{
-    T x;
-    T y;
-
-    typedef T BaseType;
-    typedef CubVector<T, 2> Type;
-};
-
-/**
- * Generic vector-3 type
- */
-template <typename T>
-struct CubVector<T, 3>
-{
-    T x;
-    T y;
-    T z;
-
-    typedef T BaseType;
-    typedef CubVector<T, 3> Type;
-};
-
-/**
- * Generic vector-4 type
- */
-template <typename T>
-struct CubVector<T, 4>
-{
-    T x;
-    T y;
-    T z;
-    T w;
-
-    typedef T BaseType;
-    typedef CubVector<T, 4> Type;
-};
-
-
-/**
- * Macro for expanding partially-specialized built-in vector types
- */
-#define CUB_DEFINE_VECTOR_TYPE(base_type,short_type)                                                    \
-                                                                                                        \
-    template<> struct CubVector<base_type, 1> : short_type##1                                           \
-    {                                                                                                   \
-      typedef base_type       BaseType;                                                                 \
-      typedef short_type##1   Type;                                                                     \
-      __host__ __device__ __forceinline__ CubVector operator+(const CubVector &other) const {           \
-          CubVector retval;                                                                             \
-          retval.x = x + other.x;                                                                       \
-          return retval;                                                                                \
-      }                                                                                                 \
-      __host__ __device__ __forceinline__ CubVector operator-(const CubVector &other) const {           \
-          CubVector retval;                                                                             \
-          retval.x = x - other.x;                                                                       \
-          return retval;                                                                                \
-      }                                                                                                 \
-    };                                                                                                  \
-                                                                                                        \
-    template<> struct CubVector<base_type, 2> : short_type##2                                           \
-    {                                                                                                   \
-        typedef base_type       BaseType;                                                               \
-        typedef short_type##2   Type;                                                                   \
-        __host__ __device__ __forceinline__ CubVector operator+(const CubVector &other) const {         \
-            CubVector retval;                                                                           \
-            retval.x = x + other.x;                                                                     \
-            retval.y = y + other.y;                                                                     \
-            return retval;                                                                              \
-        }                                                                                               \
-        __host__ __device__ __forceinline__ CubVector operator-(const CubVector &other) const {         \
-            CubVector retval;                                                                           \
-            retval.x = x - other.x;                                                                     \
-            retval.y = y - other.y;                                                                     \
-            return retval;                                                                              \
-        }                                                                                               \
-    };                                                                                                  \
-                                                                                                        \
-    template<> struct CubVector<base_type, 3> : short_type##3                                           \
-    {                                                                                                   \
-        typedef base_type       BaseType;                                                               \
-        typedef short_type##3   Type;                                                                   \
-        __host__ __device__ __forceinline__ CubVector operator+(const CubVector &other) const {         \
-            CubVector retval;                                                                           \
-            retval.x = x + other.x;                                                                     \
-            retval.y = y + other.y;                                                                     \
-            retval.z = z + other.z;                                                                     \
-            return retval;                                                                              \
-        }                                                                                               \
-        __host__ __device__ __forceinline__ CubVector operator-(const CubVector &other) const {         \
-            CubVector retval;                                                                           \
-            retval.x = x - other.x;                                                                     \
-            retval.y = y - other.y;                                                                     \
-            retval.z = z - other.z;                                                                     \
-            return retval;                                                                              \
-        }                                                                                               \
-    };                                                                                                  \
-                                                                                                        \
-    template<> struct CubVector<base_type, 4> : short_type##4                                           \
-    {                                                                                                   \
-        typedef base_type       BaseType;                                                               \
-        typedef short_type##4   Type;                                                                   \
-        __host__ __device__ __forceinline__ CubVector operator+(const CubVector &other) const {         \
-            CubVector retval;                                                                           \
-            retval.x = x + other.x;                                                                     \
-            retval.y = y + other.y;                                                                     \
-            retval.z = z + other.z;                                                                     \
-            retval.w = w + other.w;                                                                     \
-            return retval;                                                                              \
-        }                                                                                               \
-        __host__ __device__ __forceinline__ CubVector operator-(const CubVector &other) const {         \
-            CubVector retval;                                                                           \
-            retval.x = x - other.x;                                                                     \
-            retval.y = y - other.y;                                                                     \
-            retval.z = z - other.z;                                                                     \
-            retval.w = w - other.w;                                                                     \
-            return retval;                                                                              \
-        }                                                                                               \
-    };
-
-
-
-// Expand CUDA vector types for built-in primitives
-CUB_DEFINE_VECTOR_TYPE(char,               char)
-CUB_DEFINE_VECTOR_TYPE(signed char,        char)
-CUB_DEFINE_VECTOR_TYPE(short,              short)
-CUB_DEFINE_VECTOR_TYPE(int,                int)
-CUB_DEFINE_VECTOR_TYPE(long,               long)
-CUB_DEFINE_VECTOR_TYPE(long long,          longlong)
-CUB_DEFINE_VECTOR_TYPE(unsigned char,      uchar)
-CUB_DEFINE_VECTOR_TYPE(unsigned short,     ushort)
-CUB_DEFINE_VECTOR_TYPE(unsigned int,       uint)
-CUB_DEFINE_VECTOR_TYPE(unsigned long,      ulong)
-CUB_DEFINE_VECTOR_TYPE(unsigned long long, ulonglong)
-CUB_DEFINE_VECTOR_TYPE(float,              float)
-CUB_DEFINE_VECTOR_TYPE(double,             double)
-CUB_DEFINE_VECTOR_TYPE(bool,               uchar)
-
-// Undefine macros
-#undef CUB_DEFINE_VECTOR_TYPE
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-
-/******************************************************************************
- * Wrapper types
- ******************************************************************************/
-
-/**
- * \brief A storage-backing wrapper that allows types with non-trivial constructors to be aliased in unions
- */
-template <typename T>
-struct Uninitialized
-{
-    /// Biggest memory-access word that T is a whole multiple of and is not larger than the alignment of T
-    typedef typename UnitWord<T>::DeviceWord DeviceWord;
-
-    enum
-    {
-        WORDS = sizeof(T) / sizeof(DeviceWord)
-    };
-
-    /// Backing storage
-    DeviceWord storage[WORDS];
-
-    /// Alias
-    __host__ __device__ __forceinline__ T& Alias()
-    {
-        return reinterpret_cast<T&>(*this);
-    }
-};
-
-
-/**
- * \brief A key identifier paired with a corresponding value
- */
-template <
-    typename    _Key,
-    typename    _Value
-#if defined(_WIN32) && !defined(_WIN64)
-    , bool KeyIsLT = (AlignBytes<_Key>::ALIGN_BYTES < AlignBytes<_Value>::ALIGN_BYTES)
-    , bool ValIsLT = (AlignBytes<_Value>::ALIGN_BYTES < AlignBytes<_Key>::ALIGN_BYTES)
-#endif // #if defined(_WIN32) && !defined(_WIN64)
-    >
-struct KeyValuePair
-{
-    typedef _Key    Key;                ///< Key data type
-    typedef _Value  Value;              ///< Value data type
-
-    Key     key;                        ///< Item key
-    Value   value;                      ///< Item value
-
-    /// Constructor
-    __host__ __device__ __forceinline__
-    KeyValuePair() {}
-
-    /// Constructor
-    __host__ __device__ __forceinline__
-    KeyValuePair(Key const& key, Value const& value) : key(key), value(value) {}
-
-    /// Inequality operator
-    __host__ __device__ __forceinline__ bool operator !=(const KeyValuePair &b)
-    {
-        return (value != b.value) || (key != b.key);
-    }
-};
-
-#if defined(_WIN32) && !defined(_WIN64)
-
-/**
- * Win32 won't do 16B alignment.  This can present two problems for
- * should-be-16B-aligned (but actually 8B aligned) built-in and intrinsics members:
- * 1) If a smaller-aligned item were to be listed first, the host compiler places the
- *    should-be-16B item at too early an offset (and disagrees with device compiler)
- * 2) Or, if a smaller-aligned item lists second, the host compiler gets the size
- *    of the struct wrong (and disagrees with device compiler)
- *
- * So we put the larger-should-be-aligned item first, and explicitly pad the
- * end of the struct
- */
-
-/// Smaller key specialization
-template <typename K, typename V>
-struct KeyValuePair<K, V, true, false>
-{
-    typedef K Key;
-    typedef V Value;
-
-    typedef char Pad[AlignBytes<V>::ALIGN_BYTES - AlignBytes<K>::ALIGN_BYTES];
-
-    Value   value;  // Value has larger would-be alignment and goes first
-    Key     key;
-    Pad     pad;
-
-    /// Constructor
-    __host__ __device__ __forceinline__
-    KeyValuePair() {}
-
-    /// Constructor
-    __host__ __device__ __forceinline__
-    KeyValuePair(Key const& key, Value const& value) : key(key), value(value) {}
-
-    /// Inequality operator
-    __host__ __device__ __forceinline__ bool operator !=(const KeyValuePair &b)
-    {
-        return (value != b.value) || (key != b.key);
-    }
-};
-
-
-/// Smaller value specialization
-template <typename K, typename V>
-struct KeyValuePair<K, V, false, true>
-{
-    typedef K Key;
-    typedef V Value;
-
-    typedef char Pad[AlignBytes<K>::ALIGN_BYTES - AlignBytes<V>::ALIGN_BYTES];
-
-    Key     key;    // Key has larger would-be alignment and goes first
-    Value   value;
-    Pad     pad;
-
-    /// Constructor
-    __host__ __device__ __forceinline__
-    KeyValuePair() {}
-
-    /// Constructor
-    __host__ __device__ __forceinline__
-    KeyValuePair(Key const& key, Value const& value) : key(key), value(value) {}
-
-    /// Inequality operator
-    __host__ __device__ __forceinline__ bool operator !=(const KeyValuePair &b)
-    {
-        return (value != b.value) || (key != b.key);
-    }
-};
-
-#endif // #if defined(_WIN32) && !defined(_WIN64)
-
-
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-
-/**
- * \brief A wrapper for passing simple static arrays as kernel parameters
- */
-template <typename T, int COUNT>
-struct ArrayWrapper
-{
-
-    /// Statically-sized array of type \p T
-    T array[COUNT];
-
-    /// Constructor
-    __host__ __device__ __forceinline__ ArrayWrapper() {}
-};
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-/**
- * \brief Double-buffer storage wrapper for multi-pass stream transformations that require more than one storage array for streaming intermediate results back and forth.
- *
- * Many multi-pass computations require a pair of "ping-pong" storage
- * buffers (e.g., one for reading from and the other for writing to, and then
- * vice-versa for the subsequent pass).  This structure wraps a set of device
- * buffers and a "selector" member to track which is "current".
- */
-template <typename T>
-struct DoubleBuffer
-{
-    /// Pair of device buffer pointers
-    T *d_buffers[2];
-
-    ///  Selector into \p d_buffers (i.e., the active/valid buffer)
-    int selector;
-
-    /// \brief Constructor
-    __host__ __device__ __forceinline__ DoubleBuffer()
-    {
-        selector = 0;
-        d_buffers[0] = NULL;
-        d_buffers[1] = NULL;
-    }
-
-    /// \brief Constructor
-    __host__ __device__ __forceinline__ DoubleBuffer(
-        T *d_current,         ///< The currently valid buffer
-        T *d_alternate)       ///< Alternate storage buffer of the same size as \p d_current
-    {
-        selector = 0;
-        d_buffers[0] = d_current;
-        d_buffers[1] = d_alternate;
-    }
-
-    /// \brief Return pointer to the currently valid buffer
-    __host__ __device__ __forceinline__ T* Current() { return d_buffers[selector]; }
-
-    /// \brief Return pointer to the currently invalid buffer
-    __host__ __device__ __forceinline__ T* Alternate() { return d_buffers[selector ^ 1]; }
-
-};
-
-
-
-/******************************************************************************
- * Typedef-detection
- ******************************************************************************/
-
-
-/**
- * \brief Defines a structure \p detector_name that is templated on type \p T.  The \p detector_name struct exposes a constant member \p VALUE indicating whether or not parameter \p T exposes a nested type \p nested_type_name
- */
-#define CUB_DEFINE_DETECT_NESTED_TYPE(detector_name, nested_type_name)  \
-    template <typename T>                                               \
-    struct detector_name                                                \
-    {                                                                   \
-        template <typename C>                                           \
-        static char& test(typename C::nested_type_name*);               \
-        template <typename>                                             \
-        static int& test(...);                                          \
-        enum                                                            \
-        {                                                               \
-            VALUE = sizeof(test<T>(0)) < sizeof(int)                    \
-        };                                                              \
-    };
-
-
-
-/******************************************************************************
- * Simple enable-if (similar to Boost)
- ******************************************************************************/
-
-/**
- * \brief Simple enable-if (similar to Boost)
- */
-template <bool Condition, class T = void>
-struct EnableIf
-{
-    /// Enable-if type for SFINAE dummy variables
-    typedef T Type;
-};
-
-
-template <class T>
-struct EnableIf<false, T> {};
-
-
-
-/******************************************************************************
- * Typedef-detection
- ******************************************************************************/
-
-/**
- * \brief Determine whether or not BinaryOp's functor is of the form <tt>bool operator()(const T& a, const T&b)</tt> or <tt>bool operator()(const T& a, const T&b, unsigned int idx)</tt>
- */
-template <typename T, typename BinaryOp>
-struct BinaryOpHasIdxParam
-{
-private:
-/*
-    template <typename BinaryOpT, bool (BinaryOpT::*)(const T &a, const T &b, unsigned int idx) const>  struct SFINAE1 {};
-    template <typename BinaryOpT, bool (BinaryOpT::*)(const T &a, const T &b, unsigned int idx)>        struct SFINAE2 {};
-    template <typename BinaryOpT, bool (BinaryOpT::*)(T a, T b, unsigned int idx) const>                struct SFINAE3 {};
-    template <typename BinaryOpT, bool (BinaryOpT::*)(T a, T b, unsigned int idx)>                      struct SFINAE4 {};
-*/
-    template <typename BinaryOpT, bool (BinaryOpT::*)(const T &a, const T &b, int idx) const>           struct SFINAE5 {};
-    template <typename BinaryOpT, bool (BinaryOpT::*)(const T &a, const T &b, int idx)>                 struct SFINAE6 {};
-    template <typename BinaryOpT, bool (BinaryOpT::*)(T a, T b, int idx) const>                         struct SFINAE7 {};
-    template <typename BinaryOpT, bool (BinaryOpT::*)(T a, T b, int idx)>                               struct SFINAE8 {};
-/*
-    template <typename BinaryOpT> static char Test(SFINAE1<BinaryOpT, &BinaryOpT::operator()> *);
-    template <typename BinaryOpT> static char Test(SFINAE2<BinaryOpT, &BinaryOpT::operator()> *);
-    template <typename BinaryOpT> static char Test(SFINAE3<BinaryOpT, &BinaryOpT::operator()> *);
-    template <typename BinaryOpT> static char Test(SFINAE4<BinaryOpT, &BinaryOpT::operator()> *);
-*/
-    template <typename BinaryOpT> static char Test(SFINAE5<BinaryOpT, &BinaryOpT::operator()> *);
-    template <typename BinaryOpT> static char Test(SFINAE6<BinaryOpT, &BinaryOpT::operator()> *);
-    template <typename BinaryOpT> static char Test(SFINAE7<BinaryOpT, &BinaryOpT::operator()> *);
-    template <typename BinaryOpT> static char Test(SFINAE8<BinaryOpT, &BinaryOpT::operator()> *);
-
-    template <typename BinaryOpT> static int Test(...);
-
-public:
-
-    /// Whether the functor BinaryOp has a third <tt>unsigned int</tt> index param
-    static const bool HAS_PARAM = sizeof(Test<BinaryOp>(NULL)) == sizeof(char);
-};
-
-
-
-
-/******************************************************************************
- * Simple type traits utilities.
- *
- * For example:
- *     Traits<int>::CATEGORY             // SIGNED_INTEGER
- *     Traits<NullType>::NULL_TYPE       // true
- *     Traits<uint4>::CATEGORY           // NOT_A_NUMBER
- *     Traits<uint4>::PRIMITIVE;         // false
- *
- ******************************************************************************/
-
-/**
- * \brief Basic type traits categories
- */
-enum Category
-{
-    NOT_A_NUMBER,
-    SIGNED_INTEGER,
-    UNSIGNED_INTEGER,
-    FLOATING_POINT
-};
-
-
-/**
- * \brief Basic type traits
- */
-template <Category _CATEGORY, bool _PRIMITIVE, bool _NULL_TYPE, typename _UnsignedBits, typename T>
-struct BaseTraits
-{
-    /// Category
-    static const Category CATEGORY      = _CATEGORY;
-    enum
-    {
-        PRIMITIVE       = _PRIMITIVE,
-        NULL_TYPE       = _NULL_TYPE,
-    };
-};
-
-
-/**
- * Basic type traits (unsigned primitive specialization)
- */
-template <typename _UnsignedBits, typename T>
-struct BaseTraits<UNSIGNED_INTEGER, true, false, _UnsignedBits, T>
-{
-    typedef _UnsignedBits       UnsignedBits;
-
-    static const Category       CATEGORY    = UNSIGNED_INTEGER;
-    static const UnsignedBits   LOWEST_KEY  = UnsignedBits(0);
-    static const UnsignedBits   MAX_KEY     = UnsignedBits(-1);
-
-    enum
-    {
-        PRIMITIVE       = true,
-        NULL_TYPE       = false,
-    };
-
-
-    static __device__ __forceinline__ UnsignedBits TwiddleIn(UnsignedBits key)
-    {
-        return key;
-    }
-
-    static __device__ __forceinline__ UnsignedBits TwiddleOut(UnsignedBits key)
-    {
-        return key;
-    }
-
-    static __host__ __device__ __forceinline__ T Max()
-    {
-        UnsignedBits retval = MAX_KEY;
-        return reinterpret_cast<T&>(retval);
-    }
-
-    static __host__ __device__ __forceinline__ T Lowest()
-    {
-        UnsignedBits retval = LOWEST_KEY;
-        return reinterpret_cast<T&>(retval);
-    }
-};
-
-
-/**
- * Basic type traits (signed primitive specialization)
- */
-template <typename _UnsignedBits, typename T>
-struct BaseTraits<SIGNED_INTEGER, true, false, _UnsignedBits, T>
-{
-    typedef _UnsignedBits       UnsignedBits;
-
-    static const Category       CATEGORY    = SIGNED_INTEGER;
-    static const UnsignedBits   HIGH_BIT    = UnsignedBits(1) << ((sizeof(UnsignedBits) * 8) - 1);
-    static const UnsignedBits   LOWEST_KEY  = HIGH_BIT;
-    static const UnsignedBits   MAX_KEY     = UnsignedBits(-1) ^ HIGH_BIT;
-
-    enum
-    {
-        PRIMITIVE       = true,
-        NULL_TYPE       = false,
-    };
-
-    static __device__ __forceinline__ UnsignedBits TwiddleIn(UnsignedBits key)
-    {
-        return key ^ HIGH_BIT;
-    };
-
-    static __device__ __forceinline__ UnsignedBits TwiddleOut(UnsignedBits key)
-    {
-        return key ^ HIGH_BIT;
-    };
-
-    static __host__ __device__ __forceinline__ T Max()
-    {
-        UnsignedBits retval = MAX_KEY;
-        return reinterpret_cast<T&>(retval);
-    }
-
-    static __host__ __device__ __forceinline__ T Lowest()
-    {
-        UnsignedBits retval = LOWEST_KEY;
-        return reinterpret_cast<T&>(retval);
-    }
-};
-
-template <typename _T>
-struct FpLimits;
-
-template <>
-struct FpLimits<float>
-{
-    static __host__ __device__ __forceinline__ float Max() {
-        return FLT_MAX;
-    }
-
-    static __host__ __device__ __forceinline__ float Lowest() {
-        return FLT_MAX * float(-1);
-    }
-};
-
-template <>
-struct FpLimits<double>
-{
-    static __host__ __device__ __forceinline__ double Max() {
-        return DBL_MAX;
-    }
-
-    static __host__ __device__ __forceinline__ double Lowest() {
-        return DBL_MAX  * double(-1);
-    }
-};
-
-template <typename _T>
-struct TypeConst;
-
-template <>
-struct TypeConst<cuComplex>
-{
-    static __host__ __device__ __forceinline__ cuComplex Zero()
-    {
-        return make_cuComplex(0.f, 0.f);
-    }
-    static __host__ __device__ __forceinline__ cuComplex One()
-    {
-        return make_cuComplex(1.f, 0.f);
-    }
-};
-
-template <>
-struct TypeConst<cuDoubleComplex>
-{
-    static __host__ __device__ __forceinline__ cuDoubleComplex Zero()
-    {
-        return make_cuDoubleComplex(0.f, 0.f);
-    }
-    static __host__ __device__ __forceinline__ cuDoubleComplex One()
-    {
-        return make_cuDoubleComplex(1.f, 0.f);
-    }
-};
-
-template <typename _T>
-struct TypeConst
-{
-    static __host__ __device__ __forceinline__ _T Zero()
-    {
-        return _T(0);
-    }
-    static __host__ __device__ __forceinline__ _T One()
-    {
-        return _T(1);
-    }
-};
-
-
-/**
- * Basic type traits (fp primitive specialization)
- */
-template <typename _UnsignedBits, typename T>
-struct BaseTraits<FLOATING_POINT, true, false, _UnsignedBits, T>
-{
-    typedef _UnsignedBits       UnsignedBits;
-
-    static const Category       CATEGORY    = FLOATING_POINT;
-    static const UnsignedBits   HIGH_BIT    = UnsignedBits(1) << ((sizeof(UnsignedBits) * 8) - 1);
-    static const UnsignedBits   LOWEST_KEY  = UnsignedBits(-1);
-    static const UnsignedBits   MAX_KEY     = UnsignedBits(-1) ^ HIGH_BIT;
-
-    enum
-    {
-        PRIMITIVE       = true,
-        NULL_TYPE       = false,
-    };
-
-    static __device__ __forceinline__ UnsignedBits TwiddleIn(UnsignedBits key)
-    {
-        UnsignedBits mask = (key & HIGH_BIT) ? UnsignedBits(-1) : HIGH_BIT;
-        return key ^ mask;
-    };
-
-    static __device__ __forceinline__ UnsignedBits TwiddleOut(UnsignedBits key)
-    {
-        UnsignedBits mask = (key & HIGH_BIT) ? HIGH_BIT : UnsignedBits(-1);
-        return key ^ mask;
-    };
-
-    static __host__ __device__ __forceinline__ T Max() {
-        return FpLimits<T>::Max();
-    }
-
-    static __host__ __device__ __forceinline__ T Lowest() {
-        return FpLimits<T>::Lowest();
-    }
-};
-
-/**
- * Basic type traits (fp complex primitive specialization)
- */
-template <typename Unused, typename T>
-struct BaseTraits<FLOATING_POINT, false, false, Unused, T>
-{
-    typedef Unused       UnsignedBits;
-
-    static const Category       CATEGORY    = FLOATING_POINT;
-
-    enum
-    {
-        PRIMITIVE       = false,
-        NULL_TYPE       = false,
-    };
-};
-
-
-/**
- * \brief Numeric type traits
- */
-template <typename T> struct NumericTraits :            BaseTraits<NOT_A_NUMBER, false, false, T, T> {};
-
-template <> struct NumericTraits<NullType> :            BaseTraits<NOT_A_NUMBER, false, true, NullType, NullType> {};
-
-template <> struct NumericTraits<char> :                BaseTraits<(std::numeric_limits<char>::is_signed) ? SIGNED_INTEGER : UNSIGNED_INTEGER, true, false, unsigned char, char> {};
-template <> struct NumericTraits<signed char> :         BaseTraits<SIGNED_INTEGER, true, false, unsigned char, signed char> {};
-template <> struct NumericTraits<short> :               BaseTraits<SIGNED_INTEGER, true, false, unsigned short, short> {};
-template <> struct NumericTraits<int> :                 BaseTraits<SIGNED_INTEGER, true, false, unsigned int, int> {};
-template <> struct NumericTraits<long> :                BaseTraits<SIGNED_INTEGER, true, false, unsigned long, long> {};
-template <> struct NumericTraits<long long> :           BaseTraits<SIGNED_INTEGER, true, false, unsigned long long, long long> {};
-
-template <> struct NumericTraits<unsigned char> :       BaseTraits<UNSIGNED_INTEGER, true, false, unsigned char, unsigned char> {};
-template <> struct NumericTraits<unsigned short> :      BaseTraits<UNSIGNED_INTEGER, true, false, unsigned short, unsigned short> {};
-template <> struct NumericTraits<unsigned int> :        BaseTraits<UNSIGNED_INTEGER, true, false, unsigned int, unsigned int> {};
-template <> struct NumericTraits<unsigned long> :       BaseTraits<UNSIGNED_INTEGER, true, false, unsigned long, unsigned long> {};
-template <> struct NumericTraits<unsigned long long> :  BaseTraits<UNSIGNED_INTEGER, true, false, unsigned long long, unsigned long long> {};
-
-template <> struct NumericTraits<float> :               BaseTraits<FLOATING_POINT, true, false, unsigned int, float> {};
-template <> struct NumericTraits<double> :              BaseTraits<FLOATING_POINT, true, false, unsigned long long, double> {};
-template <> struct NumericTraits<cuComplex> :           BaseTraits<FLOATING_POINT, false, false, void, cuComplex> {};
-template <> struct NumericTraits<cuDoubleComplex> :     BaseTraits<FLOATING_POINT, false, false, void, cuDoubleComplex> {};
-
-template <> struct NumericTraits<bool> :                BaseTraits<UNSIGNED_INTEGER, true, false, typename UnitWord<bool>::VolatileWord, bool> {};
-
-
-
-/**
- * \brief Type traits
- */
-template <typename T>
-struct Traits : NumericTraits<typename RemoveQualifiers<T>::Type> {};
-
-
-#endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-/**
- * \brief Semiring util
- */
-#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-//@TODO: reuse cub 
-/*template <typename T>
-struct type_info;
-
-template <>
-struct type_info<double>
-{ 
-    static __host__ __device__ __forceinline__ double inf() { return DBL_MAX;}
-    static __host__ __device__ __forceinline__ double ninf() { return -DBL_MAX;}
-    // this is what we use as a tolerance in the algorithms, more precision than this is useless for CPU reference comparison
-    static __host__ __device__ __forceinline__ double tol() { return 1e-6; }
-};
-
-template <>
-struct type_info<float>
-{ 
-    static __host__ __device__ __forceinline__ float inf() {return FLT_MAX;}
-    static __host__ __device__ __forceinline__ float ninf() {return -FLT_MAX;}
-    static __host__ __device__ __forceinline__ float tol() {return 1e-4;}
-};
-
-
-template <>
-struct type_info<int>
-{ 
-    static __host__ __device__ __forceinline__ int inf() {return INT_MAX;}
-    static __host__ __device__ __forceinline__ int ninf() {return INT_MIN;}
-    static __host__ __device__ __forceinline__ int tol() {return 0;}
-};*/
-
-template<typename V>
-struct PlusTimesSemiring
-{
-    // enable with c++11
-    /*
-    static_assert(  std::is_same<float, typename std::remove_cv<V>::type>::value  ||
-                    std::is_same<double, typename std::remove_cv<T>::type>::value,
-                    "Graph value type is not supported by this semiring");
-    */
-
-    static __host__ __device__ __forceinline__ V plus_ident(){ return TypeConst<V>::Zero();}
-    static __host__ __device__ __forceinline__ V times_ident(){ return TypeConst<V>::One();}
-    static __host__ __device__ __forceinline__ V times_null(){ return TypeConst<V>::Zero();}
-
-
-    static __host__ __device__ __forceinline__ V plus(const V &arg0, const V &arg1)
-    {
-        return arg0 + arg1;
-    }
-    static __host__ __device__ __forceinline__ V times(const V &arg0, const V &arg1)
-    {
-        return arg0 * arg1;
-    }
-
-    // used in external algs
-    struct SumOp
-    {
-        /// Boolean sum operator, returns <tt>a + b</tt>
-        template <typename T>
-        __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const
-        {
-            return plus(a, b);
-        }
-    };
-
-    enum{
-        HAS_PLUS_ATOMICS = 1, // for cub fixup path deduction
-    };
-};
-
-template<typename V>
-struct MinPlusSemiring
-{
-    // enable with c++11
-    /*
-    static_assert(  std::is_same<float, typename std::remove_cv<V>::type>::value  ||
-                    std::is_same<double, typename std::remove_cv<T>::type>::value,
-                    "Graph value type is not supported by this semiring");
-    */
-
-    static __host__ __device__ __forceinline__ V plus_ident(){ return FpLimits<V>::Max();}
-    static __host__ __device__ __forceinline__ V times_ident(){ return TypeConst<V>::Zero();}
-    static __host__ __device__ __forceinline__ V times_null(){ return FpLimits<V>::Max();}
-
-
-    static __host__ __device__ __forceinline__ V plus(const V &arg0, const V &arg1)
-    {
-        return CUB_MIN(arg0, arg1);
-    }
-    static __host__ __device__ __forceinline__ V times(const V &arg0, const V &arg1)
-    {
-        return arg0 + arg1;
-    }
-
-    // used in external algs
-    struct SumOp
-    {
-        /// Boolean sum operator, returns <tt>a + b</tt>
-        template <typename T>
-        __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const
-        {
-            return plus(a, b);
-        }
-    };
-
-    enum{
-        HAS_PLUS_ATOMICS = 0, // for cub fixup path deduction
-    };
-};
-
-template<typename V>
-struct MaxMinSemiring
-{
-    // enable with c++11
-    /*
-    static_assert(  std::is_same<float, typename std::remove_cv<V>::type>::value  ||
-                    std::is_same<double, typename std::remove_cv<T>::type>::value,
-                    "Graph value type is not supported by this semiring");
-    */
-
-    static __host__ __device__ __forceinline__ V plus_ident(){ return FpLimits<V>::Lowest();}
-    static __host__ __device__ __forceinline__ V times_ident(){ return FpLimits<V>::Max();}
-    static __host__ __device__ __forceinline__ V times_null(){ return FpLimits<V>::Lowest();}
-
-
-    static __host__ __device__ __forceinline__ V plus(const V &arg0, const V &arg1)
-    {
-        return CUB_MAX(arg0, arg1);
-    }
-    static __host__ __device__ __forceinline__ V times(const V &arg0, const V &arg1)
-    {
-        return CUB_MIN(arg0, arg1);
-    }
-
-    // used in external algs
-    struct SumOp
-    {
-        /// Boolean sum operator, returns <tt>a + b</tt>
-        template <typename T>
-        __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const
-        {
-            return plus(a, b);
-        }
-    };
-
-    enum{
-        HAS_PLUS_ATOMICS = 0, // for cub fixup path deduction
-    };
-};
-
-template<typename V>
-struct OrAndBoolSemiring
-{
-    // enable with c++11
-    /*
-    static_assert(  std::is_same<float, typename std::remove_cv<V>::type>::value  ||
-                    std::is_same<double, typename std::remove_cv<T>::type>::value,
-                    "Graph value type is not supported by this semiring");
-    */
-
-    static __host__ __device__ __forceinline__ V plus_ident(){ return TypeConst<V>::Zero();}
-    static __host__ __device__ __forceinline__ V times_ident(){ return TypeConst<V>::One();}
-    static __host__ __device__ __forceinline__ V times_null(){ return TypeConst<V>::Zero();}
-
-
-    static __host__ __device__ __forceinline__ V plus(const V &arg0, const V &arg1)
-    {
-        return (bool) arg0 | (bool) arg1;
-    }
-    static __host__ __device__ __forceinline__ V times(const V &arg0, const V &arg1)
-    {
-        return (bool) arg0 & (bool) arg1;
-    }
-
-    // used in external algs
-    struct SumOp
-    {
-        /// Boolean sum operator, returns <tt>a + b</tt>
-        template <typename T>
-        __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const
-        {
-            return plus(a, b);
-        }
-    };
-
-    enum{
-        HAS_PLUS_ATOMICS = 0, // for cub fixup path deduction
-    };
-};
-
-template<typename V>
-struct LogPlusSemiring
-{
-    // enable with c++11
-    /*
-    static_assert(  std::is_same<float, typename std::remove_cv<V>::type>::value  ||
-                    std::is_same<double, typename std::remove_cv<T>::type>::value,
-                    "Graph value type is not supported by this semiring");
-    */
-
-    static __host__ __device__ __forceinline__ V plus_ident(){ return FpLimits<V>::Max();}
-    static __host__ __device__ __forceinline__ V times_ident(){ return TypeConst<V>::Zero();}
-    static __host__ __device__ __forceinline__ V times_null(){ return FpLimits<V>::Max();}
-
-
-    static __host__ __device__ __forceinline__ V plus(const V &arg0, const V &arg1)
-    {
-        return -log(exp(-arg0) + exp(-arg1));
-    }
-    static __host__ __device__ __forceinline__ V times(const V &arg0, const V &arg1)
-    {
-        return arg0 + arg1;
-    }
-
-    // used in external algs
-    struct SumOp
-    {
-        /// Boolean sum operator, returns <tt>a + b</tt>
-        template <typename T>
-        __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const
-        {
-            return plus(a, b);
-        }
-    };
-
-    enum{
-        HAS_PLUS_ATOMICS = 0, // for cub fixup path deduction
-    };
-};
-
-// used in external algs
-template <typename SR>
-struct SumOp
-{
-    /// Boolean sum operator, returns <tt>a + b</tt>
-    template <typename T>
-    __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const
-    {
-        return SR::plus(a, b);
-    }
-};
-#endif // DOXYGEN_SHOULD_SKIP_THIS 
-
-
-/** @} */       // end group UtilModule
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/warp/specializations/warp_reduce_shfl.cuh b/thirdparty/cub_semiring/warp/specializations/warp_reduce_shfl.cuh
deleted file mode 100644
index 682a5bfedc2..00000000000
--- a/thirdparty/cub_semiring/warp/specializations/warp_reduce_shfl.cuh
+++ /dev/null
@@ -1,551 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::WarpReduceShfl provides SHFL-based variants of parallel reduction of items partitioned across a CUDA thread warp.
- */
-
-#pragma once
-
-#include "../../thread/thread_operators.cuh"
-#include "../../util_ptx.cuh"
-#include "../../util_type.cuh"
-#include "../../util_macro.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \brief WarpReduceShfl provides SHFL-based variants of parallel reduction of items partitioned across a CUDA thread warp.
- *
- * LOGICAL_WARP_THREADS must be a power-of-two
- */
-template <
-    typename    T,                      ///< Data type being reduced
-    int         LOGICAL_WARP_THREADS,   ///< Number of threads per logical warp
-    int         PTX_ARCH>               ///< The PTX compute capability for which to to specialize this collective
-struct WarpReduceShfl
-{
-    //---------------------------------------------------------------------
-    // Constants and type definitions
-    //---------------------------------------------------------------------
-
-    enum
-    {
-        /// Whether the logical warp size and the PTX warp size coincide
-        IS_ARCH_WARP = (LOGICAL_WARP_THREADS == CUB_WARP_THREADS(PTX_ARCH)),
-
-        /// The number of warp reduction steps
-        STEPS = Log2<LOGICAL_WARP_THREADS>::VALUE,
-
-        /// Number of logical warps in a PTX warp
-        LOGICAL_WARPS = CUB_WARP_THREADS(PTX_ARCH) / LOGICAL_WARP_THREADS,
-    };
-
-    template <typename S>
-    struct IsInteger
-    {
-        enum {
-            ///Whether the data type is a small (32b or less) integer for which we can use a single SFHL instruction per exchange
-            IS_SMALL_UNSIGNED = (Traits<S>::CATEGORY == UNSIGNED_INTEGER) && (sizeof(S) <= sizeof(unsigned int))
-        };
-    };
-
-
-    // Creates a mask where the last thread in each logical warp is set
-    template <int WARP, int WARPS>
-    struct LastLaneMask
-    {
-        enum {
-            BASE_MASK   = 1 << (LOGICAL_WARP_THREADS - 1),
-            MASK        = (LastLaneMask<WARP + 1, WARPS>::MASK << LOGICAL_WARP_THREADS) | BASE_MASK,
-        };
-    };
-
-    // Creates a mask where the last thread in each logical warp is set
-    template <int WARP>
-    struct LastLaneMask<WARP, WARP>
-    {
-        enum {
-            MASK        = 1 << (LOGICAL_WARP_THREADS - 1),
-        };
-    };
-
-
-
-    /// Shared memory storage layout type
-    typedef NullType TempStorage;
-
-
-    //---------------------------------------------------------------------
-    // Thread fields
-    //---------------------------------------------------------------------
-
-
-    unsigned int lane_id;
-
-    unsigned int member_mask;
-
-    //---------------------------------------------------------------------
-    // Construction
-    //---------------------------------------------------------------------
-
-    /// Constructor
-    __device__ __forceinline__ WarpReduceShfl(
-        TempStorage &/*temp_storage*/)
-    :
-        lane_id(LaneId()),
-
-        member_mask((0xffffffff >> (32 - LOGICAL_WARP_THREADS)) << ((IS_ARCH_WARP) ?
-            0 : // arch-width subwarps need not be tiled within the arch-warp
-            ((lane_id / LOGICAL_WARP_THREADS) * LOGICAL_WARP_THREADS)))
-    {}
-
-
-    //---------------------------------------------------------------------
-    // Reduction steps
-    //---------------------------------------------------------------------
-
-    /// Reduction (specialized for summation across uint32 types)
-    __device__ __forceinline__ unsigned int ReduceStep(
-        unsigned int    input,              ///< [in] Calling thread's input item.
-        cub::Sum        /*reduction_op*/,   ///< [in] Binary reduction operator
-        int             last_lane,          ///< [in] Index of last lane in segment
-        int             offset)             ///< [in] Up-offset to pull from
-    {
-        unsigned int output;
-
-        // Use predicate set from SHFL to guard against invalid peers
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-        asm volatile(
-            "{"
-            "  .reg .u32 r0;"
-            "  .reg .pred p;"
-            "  shfl.sync.down.b32 r0|p, %1, %2, %3, %5;"
-            "  @p add.u32 r0, r0, %4;"
-            "  mov.u32 %0, r0;"
-            "}"
-            : "=r"(output) : "r"(input), "r"(offset), "r"(last_lane), "r"(input), "r"(member_mask));
-#else
-        asm volatile(
-            "{"
-            "  .reg .u32 r0;"
-            "  .reg .pred p;"
-            "  shfl.down.b32 r0|p, %1, %2, %3;"
-            "  @p add.u32 r0, r0, %4;"
-            "  mov.u32 %0, r0;"
-            "}"
-            : "=r"(output) : "r"(input), "r"(offset), "r"(last_lane), "r"(input));
-#endif
-
-        return output;
-    }
-
-
-    /// Reduction (specialized for summation across fp32 types)
-    __device__ __forceinline__ float ReduceStep(
-        float           input,              ///< [in] Calling thread's input item.
-        cub::Sum        /*reduction_op*/,   ///< [in] Binary reduction operator
-        int             last_lane,          ///< [in] Index of last lane in segment
-        int             offset)             ///< [in] Up-offset to pull from
-    {
-        float output;
-
-        // Use predicate set from SHFL to guard against invalid peers
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-        asm volatile(
-            "{"
-            "  .reg .f32 r0;"
-            "  .reg .pred p;"
-            "  shfl.sync.down.b32 r0|p, %1, %2, %3, %5;"
-            "  @p add.f32 r0, r0, %4;"
-            "  mov.f32 %0, r0;"
-            "}"
-            : "=f"(output) : "f"(input), "r"(offset), "r"(last_lane), "f"(input), "r"(member_mask));
-#else
-        asm volatile(
-            "{"
-            "  .reg .f32 r0;"
-            "  .reg .pred p;"
-            "  shfl.down.b32 r0|p, %1, %2, %3;"
-            "  @p add.f32 r0, r0, %4;"
-            "  mov.f32 %0, r0;"
-            "}"
-            : "=f"(output) : "f"(input), "r"(offset), "r"(last_lane), "f"(input));
-#endif
-
-        return output;
-    }
-
-
-    /// Reduction (specialized for summation across unsigned long long types)
-    __device__ __forceinline__ unsigned long long ReduceStep(
-        unsigned long long  input,              ///< [in] Calling thread's input item.
-        cub::Sum            /*reduction_op*/,   ///< [in] Binary reduction operator
-        int                 last_lane,          ///< [in] Index of last lane in segment
-        int                 offset)             ///< [in] Up-offset to pull from
-    {
-        unsigned long long output;
-
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-        asm volatile(
-            "{"
-            "  .reg .u32 lo;"
-            "  .reg .u32 hi;"
-            "  .reg .pred p;"
-            "  mov.b64 {lo, hi}, %1;"
-            "  shfl.sync.down.b32 lo|p, lo, %2, %3, %4;"
-            "  shfl.sync.down.b32 hi|p, hi, %2, %3, %4;"
-            "  mov.b64 %0, {lo, hi};"
-            "  @p add.u64 %0, %0, %1;"
-            "}"
-            : "=l"(output) : "l"(input), "r"(offset), "r"(last_lane), "r"(member_mask));
-#else
-        asm volatile(
-            "{"
-            "  .reg .u32 lo;"
-            "  .reg .u32 hi;"
-            "  .reg .pred p;"
-            "  mov.b64 {lo, hi}, %1;"
-            "  shfl.down.b32 lo|p, lo, %2, %3;"
-            "  shfl.down.b32 hi|p, hi, %2, %3;"
-            "  mov.b64 %0, {lo, hi};"
-            "  @p add.u64 %0, %0, %1;"
-            "}"
-            : "=l"(output) : "l"(input), "r"(offset), "r"(last_lane));
-#endif
-
-        return output;
-    }
-
-
-    /// Reduction (specialized for summation across long long types)
-    __device__ __forceinline__ long long ReduceStep(
-        long long           input,              ///< [in] Calling thread's input item.
-        cub::Sum            /*reduction_op*/,   ///< [in] Binary reduction operator
-        int                 last_lane,          ///< [in] Index of last lane in segment
-        int                 offset)             ///< [in] Up-offset to pull from
-    {
-        long long output;
-
-        // Use predicate set from SHFL to guard against invalid peers
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-        asm volatile(
-            "{"
-            "  .reg .u32 lo;"
-            "  .reg .u32 hi;"
-            "  .reg .pred p;"
-            "  mov.b64 {lo, hi}, %1;"
-            "  shfl.sync.down.b32 lo|p, lo, %2, %3, %4;"
-            "  shfl.sync.down.b32 hi|p, hi, %2, %3, %4;"
-            "  mov.b64 %0, {lo, hi};"
-            "  @p add.s64 %0, %0, %1;"
-            "}"
-            : "=l"(output) : "l"(input), "r"(offset), "r"(last_lane), "r"(member_mask));
-#else
-        asm volatile(
-            "{"
-            "  .reg .u32 lo;"
-            "  .reg .u32 hi;"
-            "  .reg .pred p;"
-            "  mov.b64 {lo, hi}, %1;"
-            "  shfl.down.b32 lo|p, lo, %2, %3;"
-            "  shfl.down.b32 hi|p, hi, %2, %3;"
-            "  mov.b64 %0, {lo, hi};"
-            "  @p add.s64 %0, %0, %1;"
-            "}"
-            : "=l"(output) : "l"(input), "r"(offset), "r"(last_lane));
-#endif
-
-        return output;
-    }
-
-
-    /// Reduction (specialized for summation across double types)
-    __device__ __forceinline__ double ReduceStep(
-        double              input,              ///< [in] Calling thread's input item.
-        cub::Sum            /*reduction_op*/,   ///< [in] Binary reduction operator
-        int                 last_lane,          ///< [in] Index of last lane in segment
-        int                 offset)             ///< [in] Up-offset to pull from
-    {
-        double output;
-
-        // Use predicate set from SHFL to guard against invalid peers
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-        asm volatile(
-            "{"
-            "  .reg .u32 lo;"
-            "  .reg .u32 hi;"
-            "  .reg .pred p;"
-            "  .reg .f64 r0;"
-            "  mov.b64 %0, %1;"
-            "  mov.b64 {lo, hi}, %1;"
-            "  shfl.sync.down.b32 lo|p, lo, %2, %3, %4;"
-            "  shfl.sync.down.b32 hi|p, hi, %2, %3, %4;"
-            "  mov.b64 r0, {lo, hi};"
-            "  @p add.f64 %0, %0, r0;"
-            "}"
-            : "=d"(output) : "d"(input), "r"(offset), "r"(last_lane), "r"(member_mask));
-#else
-        asm volatile(
-            "{"
-            "  .reg .u32 lo;"
-            "  .reg .u32 hi;"
-            "  .reg .pred p;"
-            "  .reg .f64 r0;"
-            "  mov.b64 %0, %1;"
-            "  mov.b64 {lo, hi}, %1;"
-            "  shfl.down.b32 lo|p, lo, %2, %3;"
-            "  shfl.down.b32 hi|p, hi, %2, %3;"
-            "  mov.b64 r0, {lo, hi};"
-            "  @p add.f64 %0, %0, r0;"
-            "}"
-            : "=d"(output) : "d"(input), "r"(offset), "r"(last_lane));
-#endif
-
-        return output;
-    }
-
-
-    /// Reduction (specialized for swizzled ReduceByKeyOp<cub::Sum> across KeyValuePair<KeyT, ValueT> types)
-    template <typename ValueT, typename KeyT>
-    __device__ __forceinline__ KeyValuePair<KeyT, ValueT> ReduceStep(
-        KeyValuePair<KeyT, ValueT>                  input,              ///< [in] Calling thread's input item.
-        SwizzleScanOp<ReduceByKeyOp<cub::Sum> >     /*reduction_op*/,       ///< [in] Binary reduction operator
-        int                                         last_lane,          ///< [in] Index of last lane in segment
-        int                                         offset)             ///< [in] Up-offset to pull from
-    {
-        KeyValuePair<KeyT, ValueT> output;
-
-        KeyT other_key = ShuffleDown(input.key, offset, last_lane, member_mask);
-        
-        output.key = input.key;
-        output.value = ReduceStep(
-            input.value, 
-            cub::Sum(), 
-            last_lane, 
-            offset, 
-            Int2Type<IsInteger<ValueT>::IS_SMALL_UNSIGNED>());
-
-        if (input.key != other_key)
-            output.value = input.value;
-
-        return output;
-    }
-
-
-
-    /// Reduction (specialized for swizzled ReduceBySegmentOp<cub::Sum> across KeyValuePair<OffsetT, ValueT> types)
-    template <typename ValueT, typename OffsetT>
-    __device__ __forceinline__ KeyValuePair<OffsetT, ValueT> ReduceStep(
-        KeyValuePair<OffsetT, ValueT>                 input,              ///< [in] Calling thread's input item.
-        SwizzleScanOp<ReduceBySegmentOp<cub::Sum> >   /*reduction_op*/,   ///< [in] Binary reduction operator
-        int                                           last_lane,          ///< [in] Index of last lane in segment
-        int                                           offset)             ///< [in] Up-offset to pull from
-    {
-        KeyValuePair<OffsetT, ValueT> output;
-
-        output.value = ReduceStep(input.value, cub::Sum(), last_lane, offset, Int2Type<IsInteger<ValueT>::IS_SMALL_UNSIGNED>());
-        output.key = ReduceStep(input.key, cub::Sum(), last_lane, offset, Int2Type<IsInteger<OffsetT>::IS_SMALL_UNSIGNED>());
-
-        if (input.key > 0)
-            output.value = input.value;
-
-        return output;
-    }
-
-
-    /// Reduction step (generic)
-    template <typename _T, typename ReductionOp>
-    __device__ __forceinline__ _T ReduceStep(
-        _T                  input,              ///< [in] Calling thread's input item.
-        ReductionOp         reduction_op,       ///< [in] Binary reduction operator
-        int                 last_lane,          ///< [in] Index of last lane in segment
-        int                 offset)             ///< [in] Up-offset to pull from
-    {
-        _T output = input;
-
-        _T temp = ShuffleDown(output, offset, last_lane, member_mask);
-
-        // Perform reduction op if valid
-        if (offset + lane_id <= last_lane)
-            output = reduction_op(input, temp);
-
-        return output;
-    }
-
-
-    /// Reduction step (specialized for small unsigned integers size 32b or less)
-    template <typename _T, typename ReductionOp>
-    __device__ __forceinline__ _T ReduceStep(
-        _T              input,                  ///< [in] Calling thread's input item.
-        ReductionOp     reduction_op,           ///< [in] Binary reduction operator
-        int             last_lane,              ///< [in] Index of last lane in segment
-        int             offset,                 ///< [in] Up-offset to pull from
-        Int2Type<true>  /*is_small_unsigned*/)  ///< [in] Marker type indicating whether T is a small unsigned integer
-    {
-        return ReduceStep(input, reduction_op, last_lane, offset);
-    }
-
-
-    /// Reduction step (specialized for types other than small unsigned integers size 32b or less)
-    template <typename _T, typename ReductionOp>
-    __device__ __forceinline__ _T ReduceStep(
-        _T              input,                  ///< [in] Calling thread's input item.
-        ReductionOp     reduction_op,           ///< [in] Binary reduction operator
-        int             last_lane,              ///< [in] Index of last lane in segment
-        int             offset,                 ///< [in] Up-offset to pull from
-        Int2Type<false> /*is_small_unsigned*/)  ///< [in] Marker type indicating whether T is a small unsigned integer
-    {
-        return ReduceStep(input, reduction_op, last_lane, offset);
-    }
-
-
-    //---------------------------------------------------------------------
-    // Templated inclusive scan iteration
-    //---------------------------------------------------------------------
-
-    template <typename ReductionOp, int STEP>
-    __device__ __forceinline__ void ReduceStep(
-        T&              input,              ///< [in] Calling thread's input item.
-        ReductionOp     reduction_op,       ///< [in] Binary reduction operator
-        int             last_lane,          ///< [in] Index of last lane in segment
-        Int2Type<STEP>  /*step*/)
-    {
-        input = ReduceStep(input, reduction_op, last_lane, 1 << STEP, Int2Type<IsInteger<T>::IS_SMALL_UNSIGNED>());
-
-        ReduceStep(input, reduction_op, last_lane, Int2Type<STEP + 1>());
-    }
-
-    template <typename ReductionOp>
-    __device__ __forceinline__ void ReduceStep(
-        T&              /*input*/,              ///< [in] Calling thread's input item.
-        ReductionOp     /*reduction_op*/,       ///< [in] Binary reduction operator
-        int             /*last_lane*/,          ///< [in] Index of last lane in segment
-        Int2Type<STEPS> /*step*/)
-    {}
-
-
-    //---------------------------------------------------------------------
-    // Reduction operations
-    //---------------------------------------------------------------------
-
-    /// Reduction
-    template <
-        bool            ALL_LANES_VALID,        ///< Whether all lanes in each warp are contributing a valid fold of items
-        int             FOLDED_ITEMS_PER_LANE,  ///< Number of items folded into each lane
-        typename        ReductionOp>
-    __device__ __forceinline__ T Reduce(
-        T               input,                  ///< [in] Calling thread's input
-        int             folded_items_per_warp,  ///< [in] Total number of valid items folded into each logical warp
-        ReductionOp     reduction_op)           ///< [in] Binary reduction operator
-    {
-        // Get the lane of the first and last thread in the logical warp
-        int first_thread   = 0;
-        int last_thread    = LOGICAL_WARP_THREADS - 1;
-        if (!IS_ARCH_WARP)
-        {
-            first_thread = lane_id & (~(LOGICAL_WARP_THREADS - 1));
-            last_thread |= lane_id;
-        }
-
-        // Common case is FOLDED_ITEMS_PER_LANE = 1 (or a multiple of 32)
-        int lanes_with_valid_data = (folded_items_per_warp - 1) / FOLDED_ITEMS_PER_LANE;
-
-        // Get the last valid lane
-        int last_lane = (ALL_LANES_VALID) ?
-            last_thread :
-            CUB_MIN(last_thread, first_thread + lanes_with_valid_data);
-
-        T output = input;
-
-//        // Iterate reduction steps
-//        #pragma unroll
-//        for (int STEP = 0; STEP < STEPS; STEP++)
-//        {
-//            output = ReduceStep(output, reduction_op, last_lane, 1 << STEP, Int2Type<IsInteger<T>::IS_SMALL_UNSIGNED>());
-//        }
-
-        // Template-iterate reduction steps
-        ReduceStep(output, reduction_op, last_lane, Int2Type<0>());
-
-        return output;
-    }
-
-
-    /// Segmented reduction
-    template <
-        bool            HEAD_SEGMENTED,     ///< Whether flags indicate a segment-head or a segment-tail
-        typename        FlagT,
-        typename        ReductionOp>
-    __device__ __forceinline__ T SegmentedReduce(
-        T               input,              ///< [in] Calling thread's input
-        FlagT           flag,               ///< [in] Whether or not the current lane is a segment head/tail
-        ReductionOp     reduction_op)       ///< [in] Binary reduction operator
-    {
-        // Get the start flags for each thread in the warp.
-        int warp_flags = WARP_BALLOT(flag, member_mask);
-
-        // Convert to tail-segmented
-        if (HEAD_SEGMENTED)
-            warp_flags >>= 1;
-
-        // Mask in the last lanes of each logical warp
-        warp_flags |= LastLaneMask<1, LOGICAL_WARPS>::MASK;
-
-        // Mask out the bits below the current thread
-        warp_flags &= LaneMaskGe();
-
-        // Find the next set flag
-        int last_lane = __clz(__brev(warp_flags));
-
-        T output = input;
-
-//        // Iterate reduction steps
-//        #pragma unroll
-//        for (int STEP = 0; STEP < STEPS; STEP++)
-//        {
-//            output = ReduceStep(output, reduction_op, last_lane, 1 << STEP, Int2Type<IsInteger<T>::IS_SMALL_UNSIGNED>());
-//        }
-
-        // Template-iterate reduction steps
-        ReduceStep(output, reduction_op, last_lane, Int2Type<0>());
-
-        return output;
-    }
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/warp/specializations/warp_reduce_smem.cuh b/thirdparty/cub_semiring/warp/specializations/warp_reduce_smem.cuh
deleted file mode 100644
index 9ba8e94d12d..00000000000
--- a/thirdparty/cub_semiring/warp/specializations/warp_reduce_smem.cuh
+++ /dev/null
@@ -1,375 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::WarpReduceSmem provides smem-based variants of parallel reduction of items partitioned across a CUDA thread warp.
- */
-
-#pragma once
-
-#include "../../thread/thread_operators.cuh"
-#include "../../thread/thread_load.cuh"
-#include "../../thread/thread_store.cuh"
-#include "../../util_type.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \brief WarpReduceSmem provides smem-based variants of parallel reduction of items partitioned across a CUDA thread warp.
- */
-template <
-    typename    T,                      ///< Data type being reduced
-    int         LOGICAL_WARP_THREADS,   ///< Number of threads per logical warp
-    int         PTX_ARCH>               ///< The PTX compute capability for which to to specialize this collective
-struct WarpReduceSmem
-{
-    /******************************************************************************
-     * Constants and type definitions
-     ******************************************************************************/
-
-    enum
-    {
-        /// Whether the logical warp size and the PTX warp size coincide
-        IS_ARCH_WARP = (LOGICAL_WARP_THREADS == CUB_WARP_THREADS(PTX_ARCH)),
-
-        /// Whether the logical warp size is a power-of-two
-        IS_POW_OF_TWO = PowerOfTwo<LOGICAL_WARP_THREADS>::VALUE,
-
-        /// The number of warp scan steps
-        STEPS = Log2<LOGICAL_WARP_THREADS>::VALUE,
-
-        /// The number of threads in half a warp
-        HALF_WARP_THREADS = 1 << (STEPS - 1),
-
-        /// The number of shared memory elements per warp
-        WARP_SMEM_ELEMENTS =  LOGICAL_WARP_THREADS + HALF_WARP_THREADS,
-
-        /// FlagT status (when not using ballot)
-        UNSET   = 0x0,  // Is initially unset
-        SET     = 0x1,  // Is initially set
-        SEEN    = 0x2,  // Has seen another head flag from a successor peer
-    };
-
-    /// Shared memory flag type
-    typedef unsigned char SmemFlag;
-
-    /// Shared memory storage layout type (1.5 warps-worth of elements for each warp)
-    struct _TempStorage
-    {
-        T           reduce[WARP_SMEM_ELEMENTS];
-        SmemFlag    flags[WARP_SMEM_ELEMENTS];
-    };
-
-    // Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    /******************************************************************************
-     * Thread fields
-     ******************************************************************************/
-
-    _TempStorage    &temp_storage;
-    unsigned int    lane_id;
-    unsigned int    member_mask;
-
-
-    /******************************************************************************
-     * Construction
-     ******************************************************************************/
-
-    /// Constructor
-    __device__ __forceinline__ WarpReduceSmem(
-        TempStorage     &temp_storage)
-    :
-        temp_storage(temp_storage.Alias()),
-
-        lane_id(IS_ARCH_WARP ?
-            LaneId() :
-            LaneId() % LOGICAL_WARP_THREADS),
-
-        member_mask((0xffffffff >> (32 - LOGICAL_WARP_THREADS)) << ((IS_ARCH_WARP || !IS_POW_OF_TWO ) ?
-            0 : // arch-width and non-power-of-two subwarps cannot be tiled with the arch-warp
-            ((LaneId() / LOGICAL_WARP_THREADS) * LOGICAL_WARP_THREADS)))
-    {}
-
-    /******************************************************************************
-     * Utility methods
-     ******************************************************************************/
-
-    //---------------------------------------------------------------------
-    // Regular reduction
-    //---------------------------------------------------------------------
-
-    /**
-     * Reduction step
-     */
-    template <
-        bool                ALL_LANES_VALID,        ///< Whether all lanes in each warp are contributing a valid fold of items
-        int                 FOLDED_ITEMS_PER_LANE,  ///< Number of items folded into each lane
-        typename            ReductionOp,
-        int                 STEP>
-    __device__ __forceinline__ T ReduceStep(
-        T                   input,                  ///< [in] Calling thread's input
-        int                 folded_items_per_warp,  ///< [in] Total number of valid items folded into each logical warp
-        ReductionOp         reduction_op,           ///< [in] Reduction operator
-        Int2Type<STEP>      /*step*/)
-    {
-        const int OFFSET = 1 << STEP;
-
-        // Share input through buffer
-        ThreadStore<STORE_VOLATILE>(&temp_storage.reduce[lane_id], input);
-
-        WARP_SYNC(member_mask);
-
-        // Update input if peer_addend is in range
-        if ((ALL_LANES_VALID && IS_POW_OF_TWO) || ((lane_id + OFFSET) * FOLDED_ITEMS_PER_LANE < folded_items_per_warp))
-        {
-            T peer_addend = ThreadLoad<LOAD_VOLATILE>(&temp_storage.reduce[lane_id + OFFSET]);
-            input = reduction_op(input, peer_addend);
-        }
-
-        WARP_SYNC(member_mask);
-
-        return ReduceStep<ALL_LANES_VALID, FOLDED_ITEMS_PER_LANE>(input, folded_items_per_warp, reduction_op, Int2Type<STEP + 1>());
-    }
-
-
-    /**
-     * Reduction step (terminate)
-     */
-    template <
-        bool                ALL_LANES_VALID,            ///< Whether all lanes in each warp are contributing a valid fold of items
-        int                 FOLDED_ITEMS_PER_LANE,      ///< Number of items folded into each lane
-        typename            ReductionOp>
-    __device__ __forceinline__ T ReduceStep(
-        T                   input,                      ///< [in] Calling thread's input
-        int                 /*folded_items_per_warp*/,  ///< [in] Total number of valid items folded into each logical warp
-        ReductionOp         /*reduction_op*/,           ///< [in] Reduction operator
-        Int2Type<STEPS>     /*step*/)
-    {
-        return input;
-    }
-
-
-    //---------------------------------------------------------------------
-    // Segmented reduction
-    //---------------------------------------------------------------------
-
-
-    /**
-     * Ballot-based segmented reduce
-     */
-    template <
-        bool            HEAD_SEGMENTED,     ///< Whether flags indicate a segment-head or a segment-tail
-        typename        FlagT,
-        typename        ReductionOp>
-    __device__ __forceinline__ T SegmentedReduce(
-        T               input,                  ///< [in] Calling thread's input
-        FlagT           flag,                   ///< [in] Whether or not the current lane is a segment head/tail
-        ReductionOp     reduction_op,           ///< [in] Reduction operator
-        Int2Type<true>  /*has_ballot*/)         ///< [in] Marker type for whether the target arch has ballot functionality
-    {
-        // Get the start flags for each thread in the warp.
-        int warp_flags = WARP_BALLOT(flag, member_mask);
-
-        if (!HEAD_SEGMENTED)
-            warp_flags <<= 1;
-
-        // Keep bits above the current thread.
-        warp_flags &= LaneMaskGt();
-
-        // Accommodate packing of multiple logical warps in a single physical warp
-        if (!IS_ARCH_WARP)
-        {
-            warp_flags >>= (LaneId() / LOGICAL_WARP_THREADS) * LOGICAL_WARP_THREADS;
-        }
-
-        // Find next flag
-        int next_flag = __clz(__brev(warp_flags));
-
-        // Clip the next segment at the warp boundary if necessary
-        if (LOGICAL_WARP_THREADS != 32)
-            next_flag = CUB_MIN(next_flag, LOGICAL_WARP_THREADS);
-
-        #pragma unroll
-        for (int STEP = 0; STEP < STEPS; STEP++)
-        {
-            const int OFFSET = 1 << STEP;
-
-            // Share input into buffer
-            ThreadStore<STORE_VOLATILE>(&temp_storage.reduce[lane_id], input);
-
-            WARP_SYNC(member_mask);
-
-            // Update input if peer_addend is in range
-            if (OFFSET + lane_id < next_flag)
-            {
-                T peer_addend = ThreadLoad<LOAD_VOLATILE>(&temp_storage.reduce[lane_id + OFFSET]);
-                input = reduction_op(input, peer_addend);
-            }
-
-            WARP_SYNC(member_mask);
-        }
-
-        return input;
-    }
-
-
-    /**
-     * Smem-based segmented reduce
-     */
-    template <
-        bool            HEAD_SEGMENTED,     ///< Whether flags indicate a segment-head or a segment-tail
-        typename        FlagT,
-        typename        ReductionOp>
-    __device__ __forceinline__ T SegmentedReduce(
-        T               input,                  ///< [in] Calling thread's input
-        FlagT           flag,                   ///< [in] Whether or not the current lane is a segment head/tail
-        ReductionOp     reduction_op,           ///< [in] Reduction operator
-        Int2Type<false> /*has_ballot*/)         ///< [in] Marker type for whether the target arch has ballot functionality
-    {
-        enum
-        {
-            UNSET   = 0x0,  // Is initially unset
-            SET     = 0x1,  // Is initially set
-            SEEN    = 0x2,  // Has seen another head flag from a successor peer
-        };
-
-        // Alias flags onto shared data storage
-        volatile SmemFlag *flag_storage = temp_storage.flags;
-
-        SmemFlag flag_status = (flag) ? SET : UNSET;
-
-        for (int STEP = 0; STEP < STEPS; STEP++)
-        {
-            const int OFFSET = 1 << STEP;
-
-            // Share input through buffer
-            ThreadStore<STORE_VOLATILE>(&temp_storage.reduce[lane_id], input);
-
-            WARP_SYNC(member_mask);
-
-            // Get peer from buffer
-            T peer_addend = ThreadLoad<LOAD_VOLATILE>(&temp_storage.reduce[lane_id + OFFSET]);
-
-            WARP_SYNC(member_mask);
-
-            // Share flag through buffer
-            flag_storage[lane_id] = flag_status;
-
-            // Get peer flag from buffer
-            SmemFlag peer_flag_status = flag_storage[lane_id + OFFSET];
-
-            // Update input if peer was in range
-            if (lane_id < LOGICAL_WARP_THREADS - OFFSET)
-            {
-                if (HEAD_SEGMENTED)
-                {
-                    // Head-segmented
-                    if ((flag_status & SEEN) == 0)
-                    {
-                        // Has not seen a more distant head flag
-                        if (peer_flag_status & SET)
-                        {
-                            // Has now seen a head flag
-                            flag_status |= SEEN;
-                        }
-                        else
-                        {
-                            // Peer is not a head flag: grab its count
-                            input = reduction_op(input, peer_addend);
-                        }
-
-                        // Update seen status to include that of peer
-                        flag_status |= (peer_flag_status & SEEN);
-                    }
-                }
-                else
-                {
-                    // Tail-segmented.  Simply propagate flag status
-                    if (!flag_status)
-                    {
-                        input = reduction_op(input, peer_addend);
-                        flag_status |= peer_flag_status;
-                    }
-
-                }
-            }
-        }
-
-        return input;
-    }
-
-
-    /******************************************************************************
-     * Interface
-     ******************************************************************************/
-
-    /**
-     * Reduction
-     */
-    template <
-        bool                ALL_LANES_VALID,        ///< Whether all lanes in each warp are contributing a valid fold of items
-        int                 FOLDED_ITEMS_PER_LANE,  ///< Number of items folded into each lane
-        typename            ReductionOp>
-    __device__ __forceinline__ T Reduce(
-        T                   input,                  ///< [in] Calling thread's input
-        int                 folded_items_per_warp,  ///< [in] Total number of valid items folded into each logical warp
-        ReductionOp         reduction_op)           ///< [in] Reduction operator
-    {
-        return ReduceStep<ALL_LANES_VALID, FOLDED_ITEMS_PER_LANE>(input, folded_items_per_warp, reduction_op, Int2Type<0>());
-    }
-
-
-    /**
-     * Segmented reduction
-     */
-    template <
-        bool            HEAD_SEGMENTED,     ///< Whether flags indicate a segment-head or a segment-tail
-        typename        FlagT,
-        typename        ReductionOp>
-    __device__ __forceinline__ T SegmentedReduce(
-        T               input,              ///< [in] Calling thread's input
-        FlagT            flag,               ///< [in] Whether or not the current lane is a segment head/tail
-        ReductionOp     reduction_op)       ///< [in] Reduction operator
-    {
-        return SegmentedReduce<HEAD_SEGMENTED>(input, flag, reduction_op, Int2Type<(PTX_ARCH >= 200)>());
-    }
-
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/warp/specializations/warp_scan_shfl.cuh b/thirdparty/cub_semiring/warp/specializations/warp_scan_shfl.cuh
deleted file mode 100644
index f0deb8ddefc..00000000000
--- a/thirdparty/cub_semiring/warp/specializations/warp_scan_shfl.cuh
+++ /dev/null
@@ -1,656 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::WarpScanShfl provides SHFL-based variants of parallel prefix scan of items partitioned across a CUDA thread warp.
- */
-
-#pragma once
-
-#include "../../thread/thread_operators.cuh"
-#include "../../util_type.cuh"
-#include "../../util_ptx.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \brief WarpScanShfl provides SHFL-based variants of parallel prefix scan of items partitioned across a CUDA thread warp.
- *
- * LOGICAL_WARP_THREADS must be a power-of-two
- */
-template <
-    typename    T,                      ///< Data type being scanned
-    int         LOGICAL_WARP_THREADS,   ///< Number of threads per logical warp
-    int         PTX_ARCH>               ///< The PTX compute capability for which to to specialize this collective
-struct WarpScanShfl
-{
-    //---------------------------------------------------------------------
-    // Constants and type definitions
-    //---------------------------------------------------------------------
-
-    enum
-    {
-        /// Whether the logical warp size and the PTX warp size coincide
-        IS_ARCH_WARP = (LOGICAL_WARP_THREADS == CUB_WARP_THREADS(PTX_ARCH)),
-
-        /// The number of warp scan steps
-        STEPS = Log2<LOGICAL_WARP_THREADS>::VALUE,
-
-        /// The 5-bit SHFL mask for logically splitting warps into sub-segments starts 8-bits up
-        SHFL_C = ((0xFFFFFFFFU << STEPS) & 31) << 8,
-    };
-
-    template <typename S>
-    struct IntegerTraits
-    {
-        enum {
-            ///Whether the data type is a small (32b or less) integer for which we can use a single SFHL instruction per exchange
-            IS_SMALL_UNSIGNED = (Traits<S>::CATEGORY == UNSIGNED_INTEGER) && (sizeof(S) <= sizeof(unsigned int))
-        };
-    };
-
-    /// Shared memory storage layout type
-    struct TempStorage {};
-
-
-    //---------------------------------------------------------------------
-    // Thread fields
-    //---------------------------------------------------------------------
-
-    unsigned int lane_id;
-
-    unsigned int member_mask;
-
-    //---------------------------------------------------------------------
-    // Construction
-    //---------------------------------------------------------------------
-
-    /// Constructor
-    __device__ __forceinline__ WarpScanShfl(
-        TempStorage &/*temp_storage*/)
-    :
-        lane_id(LaneId()),
-
-        member_mask((0xffffffff >> (32 - LOGICAL_WARP_THREADS)) << ((IS_ARCH_WARP) ?
-            0 : // arch-width subwarps need not be tiled within the arch-warp
-            ((lane_id / LOGICAL_WARP_THREADS) * LOGICAL_WARP_THREADS)))
-    {}
-
-
-    //---------------------------------------------------------------------
-    // Inclusive scan steps
-    //---------------------------------------------------------------------
-
-    /// Inclusive prefix scan step (specialized for summation across int32 types)
-    __device__ __forceinline__ int InclusiveScanStep(
-        int             input,              ///< [in] Calling thread's input item.
-        cub::Sum        /*scan_op*/,        ///< [in] Binary scan operator
-        int             first_lane,         ///< [in] Index of first lane in segment
-        int             offset)             ///< [in] Up-offset to pull from
-    {
-        int output;
-        int shfl_c = first_lane | SHFL_C;   // Shuffle control (mask and first-lane)
-
-        // Use predicate set from SHFL to guard against invalid peers
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-        asm volatile(
-            "{"
-            "  .reg .s32 r0;"
-            "  .reg .pred p;"
-            "  shfl.sync.up.b32 r0|p, %1, %2, %3, %5;"
-            "  @p add.s32 r0, r0, %4;"
-            "  mov.s32 %0, r0;"
-            "}"
-            : "=r"(output) : "r"(input), "r"(offset), "r"(shfl_c), "r"(input), "r"(member_mask));
-#else
-        asm volatile(
-            "{"
-            "  .reg .s32 r0;"
-            "  .reg .pred p;"
-            "  shfl.up.b32 r0|p, %1, %2, %3;"
-            "  @p add.s32 r0, r0, %4;"
-            "  mov.s32 %0, r0;"
-            "}"
-            : "=r"(output) : "r"(input), "r"(offset), "r"(shfl_c), "r"(input));
-#endif
-
-        return output;
-    }
-
-    /// Inclusive prefix scan step (specialized for summation across uint32 types)
-    __device__ __forceinline__ unsigned int InclusiveScanStep(
-        unsigned int    input,              ///< [in] Calling thread's input item.
-        cub::Sum        /*scan_op*/,        ///< [in] Binary scan operator
-        int             first_lane,         ///< [in] Index of first lane in segment
-        int             offset)             ///< [in] Up-offset to pull from
-    {
-        unsigned int output;
-        int shfl_c = first_lane | SHFL_C;   // Shuffle control (mask and first-lane)
-
-        // Use predicate set from SHFL to guard against invalid peers
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-        asm volatile(
-            "{"
-            "  .reg .u32 r0;"
-            "  .reg .pred p;"
-            "  shfl.sync.up.b32 r0|p, %1, %2, %3, %5;"
-            "  @p add.u32 r0, r0, %4;"
-            "  mov.u32 %0, r0;"
-            "}"
-            : "=r"(output) : "r"(input), "r"(offset), "r"(shfl_c), "r"(input), "r"(member_mask));
-#else
-        asm volatile(
-            "{"
-            "  .reg .u32 r0;"
-            "  .reg .pred p;"
-            "  shfl.up.b32 r0|p, %1, %2, %3;"
-            "  @p add.u32 r0, r0, %4;"
-            "  mov.u32 %0, r0;"
-            "}"
-            : "=r"(output) : "r"(input), "r"(offset), "r"(shfl_c), "r"(input));
-#endif
-
-        return output;
-    }
-
-
-    /// Inclusive prefix scan step (specialized for summation across fp32 types)
-    __device__ __forceinline__ float InclusiveScanStep(
-        float           input,              ///< [in] Calling thread's input item.
-        cub::Sum        /*scan_op*/,        ///< [in] Binary scan operator
-        int             first_lane,         ///< [in] Index of first lane in segment
-        int             offset)             ///< [in] Up-offset to pull from
-    {
-        float output;
-        int shfl_c = first_lane | SHFL_C;   // Shuffle control (mask and first-lane)
-
-        // Use predicate set from SHFL to guard against invalid peers
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-        asm volatile(
-            "{"
-            "  .reg .f32 r0;"
-            "  .reg .pred p;"
-            "  shfl.sync.up.b32 r0|p, %1, %2, %3, %5;"
-            "  @p add.f32 r0, r0, %4;"
-            "  mov.f32 %0, r0;"
-            "}"
-            : "=f"(output) : "f"(input), "r"(offset), "r"(shfl_c), "f"(input), "r"(member_mask));
-#else
-        asm volatile(
-            "{"
-            "  .reg .f32 r0;"
-            "  .reg .pred p;"
-            "  shfl.up.b32 r0|p, %1, %2, %3;"
-            "  @p add.f32 r0, r0, %4;"
-            "  mov.f32 %0, r0;"
-            "}"
-            : "=f"(output) : "f"(input), "r"(offset), "r"(shfl_c), "f"(input));
-#endif
-
-        return output;
-    }
-
-
-    /// Inclusive prefix scan step (specialized for summation across unsigned long long types)
-    __device__ __forceinline__ unsigned long long InclusiveScanStep(
-        unsigned long long  input,              ///< [in] Calling thread's input item.
-        cub::Sum            /*scan_op*/,        ///< [in] Binary scan operator
-        int             first_lane,         ///< [in] Index of first lane in segment
-        int             offset)             ///< [in] Up-offset to pull from
-    {
-        unsigned long long output;
-        int shfl_c = first_lane | SHFL_C;   // Shuffle control (mask and first-lane)
-
-        // Use predicate set from SHFL to guard against invalid peers
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-        asm volatile(
-            "{"
-            "  .reg .u64 r0;"
-            "  .reg .u32 lo;"
-            "  .reg .u32 hi;"
-            "  .reg .pred p;"
-            "  mov.b64 {lo, hi}, %1;"
-            "  shfl.sync.up.b32 lo|p, lo, %2, %3, %5;"
-            "  shfl.sync.up.b32 hi|p, hi, %2, %3, %5;"
-            "  mov.b64 r0, {lo, hi};"
-            "  @p add.u64 r0, r0, %4;"
-            "  mov.u64 %0, r0;"
-            "}"
-            : "=l"(output) : "l"(input), "r"(offset), "r"(shfl_c), "l"(input), "r"(member_mask));
-#else
-        asm volatile(
-            "{"
-            "  .reg .u64 r0;"
-            "  .reg .u32 lo;"
-            "  .reg .u32 hi;"
-            "  .reg .pred p;"
-            "  mov.b64 {lo, hi}, %1;"
-            "  shfl.up.b32 lo|p, lo, %2, %3;"
-            "  shfl.up.b32 hi|p, hi, %2, %3;"
-            "  mov.b64 r0, {lo, hi};"
-            "  @p add.u64 r0, r0, %4;"
-            "  mov.u64 %0, r0;"
-            "}"
-            : "=l"(output) : "l"(input), "r"(offset), "r"(shfl_c), "l"(input));
-#endif
-
-        return output;
-    }
-
-
-    /// Inclusive prefix scan step (specialized for summation across long long types)
-    __device__ __forceinline__ long long InclusiveScanStep(
-        long long       input,              ///< [in] Calling thread's input item.
-        cub::Sum        /*scan_op*/,        ///< [in] Binary scan operator
-        int             first_lane,         ///< [in] Index of first lane in segment
-        int             offset)             ///< [in] Up-offset to pull from
-    {
-        long long output;
-        int shfl_c = first_lane | SHFL_C;   // Shuffle control (mask and first-lane)
-
-        // Use predicate set from SHFL to guard against invalid peers
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-        asm volatile(
-            "{"
-            "  .reg .s64 r0;"
-            "  .reg .u32 lo;"
-            "  .reg .u32 hi;"
-            "  .reg .pred p;"
-            "  mov.b64 {lo, hi}, %1;"
-            "  shfl.sync.up.b32 lo|p, lo, %2, %3, %5;"
-            "  shfl.sync.up.b32 hi|p, hi, %2, %3, %5;"
-            "  mov.b64 r0, {lo, hi};"
-            "  @p add.s64 r0, r0, %4;"
-            "  mov.s64 %0, r0;"
-            "}"
-            : "=l"(output) : "l"(input), "r"(offset), "r"(shfl_c), "l"(input), "r"(member_mask));
-#else
-        asm volatile(
-            "{"
-            "  .reg .s64 r0;"
-            "  .reg .u32 lo;"
-            "  .reg .u32 hi;"
-            "  .reg .pred p;"
-            "  mov.b64 {lo, hi}, %1;"
-            "  shfl.up.b32 lo|p, lo, %2, %3;"
-            "  shfl.up.b32 hi|p, hi, %2, %3;"
-            "  mov.b64 r0, {lo, hi};"
-            "  @p add.s64 r0, r0, %4;"
-            "  mov.s64 %0, r0;"
-            "}"
-            : "=l"(output) : "l"(input), "r"(offset), "r"(shfl_c), "l"(input));
-#endif
-
-        return output;
-    }
-
-
-    /// Inclusive prefix scan step (specialized for summation across fp64 types)
-    __device__ __forceinline__ double InclusiveScanStep(
-        double          input,              ///< [in] Calling thread's input item.
-        cub::Sum        /*scan_op*/,        ///< [in] Binary scan operator
-        int             first_lane,         ///< [in] Index of first lane in segment
-        int             offset)             ///< [in] Up-offset to pull from
-    {
-        double output;
-        int shfl_c = first_lane | SHFL_C;   // Shuffle control (mask and first-lane)
-
-        // Use predicate set from SHFL to guard against invalid peers
-#ifdef CUB_USE_COOPERATIVE_GROUPS
-        asm volatile(
-            "{"
-            "  .reg .u32 lo;"
-            "  .reg .u32 hi;"
-            "  .reg .pred p;"
-            "  .reg .f64 r0;"
-            "  mov.b64 %0, %1;"
-            "  mov.b64 {lo, hi}, %1;"
-            "  shfl.sync.up.b32 lo|p, lo, %2, %3, %4;"
-            "  shfl.sync.up.b32 hi|p, hi, %2, %3, %4;"
-            "  mov.b64 r0, {lo, hi};"
-            "  @p add.f64 %0, %0, r0;"
-            "}"
-            : "=d"(output) : "d"(input), "r"(offset), "r"(shfl_c), "r"(member_mask));
-#else
-        asm volatile(
-            "{"
-            "  .reg .u32 lo;"
-            "  .reg .u32 hi;"
-            "  .reg .pred p;"
-            "  .reg .f64 r0;"
-            "  mov.b64 %0, %1;"
-            "  mov.b64 {lo, hi}, %1;"
-            "  shfl.up.b32 lo|p, lo, %2, %3;"
-            "  shfl.up.b32 hi|p, hi, %2, %3;"
-            "  mov.b64 r0, {lo, hi};"
-            "  @p add.f64 %0, %0, r0;"
-            "}"
-            : "=d"(output) : "d"(input), "r"(offset), "r"(shfl_c));
-#endif
-
-        return output;
-    }
-
-
-/*
-    /// Inclusive prefix scan (specialized for ReduceBySegmentOp<cub::Sum> across KeyValuePair<OffsetT, Value> types)
-    template <typename Value, typename OffsetT>
-    __device__ __forceinline__ KeyValuePair<OffsetT, Value>InclusiveScanStep(
-        KeyValuePair<OffsetT, Value>    input,              ///< [in] Calling thread's input item.
-        ReduceBySegmentOp<cub::Sum>     scan_op,            ///< [in] Binary scan operator
-        int                             first_lane,         ///< [in] Index of first lane in segment
-        int                             offset)             ///< [in] Up-offset to pull from
-    {
-        KeyValuePair<OffsetT, Value> output;
-
-        output.value = InclusiveScanStep(input.value, cub::Sum(), first_lane, offset, Int2Type<IntegerTraits<Value>::IS_SMALL_UNSIGNED>());
-        output.key = InclusiveScanStep(input.key, cub::Sum(), first_lane, offset, Int2Type<IntegerTraits<OffsetT>::IS_SMALL_UNSIGNED>());
-
-        if (input.key > 0)
-            output.value = input.value;
-
-        return output;
-    }
-*/
-
-    /// Inclusive prefix scan step (generic)
-    template <typename _T, typename ScanOpT>
-    __device__ __forceinline__ _T InclusiveScanStep(
-        _T              input,              ///< [in] Calling thread's input item.
-        ScanOpT          scan_op,            ///< [in] Binary scan operator
-        int             first_lane,         ///< [in] Index of first lane in segment
-        int             offset)             ///< [in] Up-offset to pull from
-    {
-        _T temp = ShuffleUp(input, offset, first_lane, member_mask);
-
-        // Perform scan op if from a valid peer
-        _T output = scan_op(temp, input);
-        if (static_cast<int>(lane_id) < first_lane + offset)
-            output = input;
-
-        return output;
-    }
-
-
-    /// Inclusive prefix scan step (specialized for small integers size 32b or less)
-    template <typename _T, typename ScanOpT>
-    __device__ __forceinline__ _T InclusiveScanStep(
-        _T              input,              ///< [in] Calling thread's input item.
-        ScanOpT          scan_op,            ///< [in] Binary scan operator
-        int             first_lane,         ///< [in] Index of first lane in segment
-        int             offset,             ///< [in] Up-offset to pull from
-        Int2Type<true>  /*is_small_unsigned*/)  ///< [in] Marker type indicating whether T is a small integer
-    {
-        return InclusiveScanStep(input, scan_op, first_lane, offset);
-    }
-
-
-    /// Inclusive prefix scan step (specialized for types other than small integers size 32b or less)
-    template <typename _T, typename ScanOpT>
-    __device__ __forceinline__ _T InclusiveScanStep(
-        _T              input,              ///< [in] Calling thread's input item.
-        ScanOpT          scan_op,            ///< [in] Binary scan operator
-        int             first_lane,         ///< [in] Index of first lane in segment
-        int             offset,             ///< [in] Up-offset to pull from
-        Int2Type<false> /*is_small_unsigned*/)  ///< [in] Marker type indicating whether T is a small integer
-    {
-        return InclusiveScanStep(input, scan_op, first_lane, offset);
-    }
-
-    //---------------------------------------------------------------------
-    // Templated inclusive scan iteration
-    //---------------------------------------------------------------------
-
-    template <typename _T, typename ScanOp, int STEP>
-    __device__ __forceinline__ void InclusiveScanStep(
-        _T&             input,              ///< [in] Calling thread's input item.
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        int             first_lane,         ///< [in] Index of first lane in segment
-        Int2Type<STEP>  /*step*/)               ///< [in] Marker type indicating scan step
-    {
-        input = InclusiveScanStep(input, scan_op, first_lane, 1 << STEP, Int2Type<IntegerTraits<T>::IS_SMALL_UNSIGNED>());
-
-        InclusiveScanStep(input, scan_op, first_lane, Int2Type<STEP + 1>());
-    }
-
-    template <typename _T, typename ScanOp>
-    __device__ __forceinline__ void InclusiveScanStep(
-        _T&             /*input*/,              ///< [in] Calling thread's input item.
-        ScanOp          /*scan_op*/,            ///< [in] Binary scan operator
-        int             /*first_lane*/,         ///< [in] Index of first lane in segment
-        Int2Type<STEPS> /*step*/)               ///< [in] Marker type indicating scan step
-    {}
-
-
-    /******************************************************************************
-     * Interface
-     ******************************************************************************/
-
-    //---------------------------------------------------------------------
-    // Broadcast
-    //---------------------------------------------------------------------
-
-    /// Broadcast
-    __device__ __forceinline__ T Broadcast(
-        T               input,              ///< [in] The value to broadcast
-        int             src_lane)           ///< [in] Which warp lane is to do the broadcasting
-    {
-        return ShuffleIndex(input, src_lane, LOGICAL_WARP_THREADS, member_mask);
-    }
-
-
-    //---------------------------------------------------------------------
-    // Inclusive operations
-    //---------------------------------------------------------------------
-
-    /// Inclusive scan
-    template <typename _T, typename ScanOpT>
-    __device__ __forceinline__ void InclusiveScan(
-        _T              input,              ///< [in] Calling thread's input item.
-        _T              &inclusive_output,  ///< [out] Calling thread's output item.  May be aliased with \p input.
-        ScanOpT         scan_op)            ///< [in] Binary scan operator
-    {
-        inclusive_output = input;
-
-        // Iterate scan steps
-        int segment_first_lane = 0;
-
-        // Iterate scan steps
-//        InclusiveScanStep(inclusive_output, scan_op, segment_first_lane, Int2Type<0>());
-
-        // Iterate scan steps
-        #pragma unroll
-        for (int STEP = 0; STEP < STEPS; STEP++)
-        {
-            inclusive_output = InclusiveScanStep(
-                inclusive_output,
-                scan_op,
-                segment_first_lane,
-                (1 << STEP),
-                Int2Type<IntegerTraits<T>::IS_SMALL_UNSIGNED>());
-        }
-
-    }
-
-    /// Inclusive scan, specialized for reduce-value-by-key
-    template <typename KeyT, typename ValueT, typename ReductionOpT>
-    __device__ __forceinline__ void InclusiveScan(
-        KeyValuePair<KeyT, ValueT>      input,              ///< [in] Calling thread's input item.
-        KeyValuePair<KeyT, ValueT>      &inclusive_output,  ///< [out] Calling thread's output item.  May be aliased with \p input.
-        ReduceByKeyOp<ReductionOpT >    scan_op)            ///< [in] Binary scan operator
-    {
-        inclusive_output = input;
-
-        KeyT pred_key = ShuffleUp(inclusive_output.key, 1, 0, member_mask);
-
-        unsigned int ballot = WARP_BALLOT((pred_key != inclusive_output.key), member_mask);
-
-        // Mask away all lanes greater than ours
-        ballot = ballot & LaneMaskLe();
-
-        // Find index of first set bit
-        int segment_first_lane = CUB_MAX(0, 31 - __clz(ballot));
-
-        // Iterate scan steps
-//        InclusiveScanStep(inclusive_output.value, scan_op.op, segment_first_lane, Int2Type<0>());
-
-        // Iterate scan steps
-        #pragma unroll
-        for (int STEP = 0; STEP < STEPS; STEP++)
-        {
-            inclusive_output.value = InclusiveScanStep(
-                inclusive_output.value,
-                scan_op.op,
-                segment_first_lane,
-                (1 << STEP),
-                Int2Type<IntegerTraits<T>::IS_SMALL_UNSIGNED>());
-        }
-    }
-
-
-    /// Inclusive scan with aggregate
-    template <typename ScanOpT>
-    __device__ __forceinline__ void InclusiveScan(
-        T               input,              ///< [in] Calling thread's input item.
-        T               &inclusive_output,  ///< [out] Calling thread's output item.  May be aliased with \p input.
-        ScanOpT         scan_op,            ///< [in] Binary scan operator
-        T               &warp_aggregate)    ///< [out] Warp-wide aggregate reduction of input items.
-    {
-        InclusiveScan(input, inclusive_output, scan_op);
-
-        // Grab aggregate from last warp lane
-        warp_aggregate = ShuffleIndex(inclusive_output, LOGICAL_WARP_THREADS - 1, LOGICAL_WARP_THREADS, member_mask);
-    }
-
-
-    //---------------------------------------------------------------------
-    // Get exclusive from inclusive
-    //---------------------------------------------------------------------
-
-    /// Update inclusive and exclusive using input and inclusive
-    template <typename ScanOpT, typename IsIntegerT>
-    __device__ __forceinline__ void Update(
-        T                       /*input*/,          ///< [in]
-        T                       &inclusive,         ///< [in, out]
-        T                       &exclusive,         ///< [out]
-        ScanOpT                 /*scan_op*/,        ///< [in]
-        IsIntegerT              /*is_integer*/)     ///< [in]
-    {
-        // initial value unknown
-        exclusive = ShuffleUp(inclusive, 1, 0, member_mask);
-    }
-
-    /// Update inclusive and exclusive using input and inclusive (specialized for summation of integer types)
-    __device__ __forceinline__ void Update(
-        T                       input,
-        T                       &inclusive,
-        T                       &exclusive,
-        cub::Sum                /*scan_op*/,
-        Int2Type<true>          /*is_integer*/)
-    {
-        // initial value presumed 0
-        exclusive = inclusive - input;
-    }
-
-    /// Update inclusive and exclusive using initial value using input, inclusive, and initial value
-    template <typename ScanOpT, typename IsIntegerT>
-    __device__ __forceinline__ void Update (
-        T                       /*input*/,
-        T                       &inclusive,
-        T                       &exclusive,
-        ScanOpT                 scan_op,
-        T                       initial_value,
-        IsIntegerT              /*is_integer*/)
-    {
-        inclusive = scan_op(initial_value, inclusive);
-        exclusive = ShuffleUp(inclusive, 1, 0, member_mask);
-
-        unsigned int segment_id = (IS_ARCH_WARP) ?
-            lane_id :
-            lane_id % LOGICAL_WARP_THREADS;
-
-        if (segment_id == 0)
-            exclusive = initial_value;
-    }
-
-    /// Update inclusive and exclusive using initial value using input and inclusive (specialized for summation of integer types)
-    __device__ __forceinline__ void Update (
-        T                       input,
-        T                       &inclusive,
-        T                       &exclusive,
-        cub::Sum                scan_op,
-        T                       initial_value,
-        Int2Type<true>          /*is_integer*/)
-    {
-        inclusive = scan_op(initial_value, inclusive);
-        exclusive = inclusive - input;
-    }
-
-
-    /// Update inclusive, exclusive, and warp aggregate using input and inclusive
-    template <typename ScanOpT, typename IsIntegerT>
-    __device__ __forceinline__ void Update (
-        T                       input,
-        T                       &inclusive,
-        T                       &exclusive,
-        T                       &warp_aggregate,
-        ScanOpT                 scan_op,
-        IsIntegerT              is_integer)
-    {
-        warp_aggregate = ShuffleIndex(inclusive, LOGICAL_WARP_THREADS - 1, LOGICAL_WARP_THREADS, member_mask);
-        Update(input, inclusive, exclusive, scan_op, is_integer);
-    }
-
-    /// Update inclusive, exclusive, and warp aggregate using input, inclusive, and initial value
-    template <typename ScanOpT, typename IsIntegerT>
-    __device__ __forceinline__ void Update (
-        T                       input,
-        T                       &inclusive,
-        T                       &exclusive,
-        T                       &warp_aggregate,
-        ScanOpT                 scan_op,
-        T                       initial_value,
-        IsIntegerT              is_integer)
-    {
-        warp_aggregate = ShuffleIndex(inclusive, LOGICAL_WARP_THREADS - 1, LOGICAL_WARP_THREADS, member_mask);
-        Update(input, inclusive, exclusive, scan_op, initial_value, is_integer);
-    }
-
-
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/warp/specializations/warp_scan_smem.cuh b/thirdparty/cub_semiring/warp/specializations/warp_scan_smem.cuh
deleted file mode 100644
index c3a7a94ba26..00000000000
--- a/thirdparty/cub_semiring/warp/specializations/warp_scan_smem.cuh
+++ /dev/null
@@ -1,397 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * cub::WarpScanSmem provides smem-based variants of parallel prefix scan of items partitioned across a CUDA thread warp.
- */
-
-#pragma once
-
-#include "../../thread/thread_operators.cuh"
-#include "../../thread/thread_load.cuh"
-#include "../../thread/thread_store.cuh"
-#include "../../util_type.cuh"
-#include "../../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \brief WarpScanSmem provides smem-based variants of parallel prefix scan of items partitioned across a CUDA thread warp.
- */
-template <
-    typename    T,                      ///< Data type being scanned
-    int         LOGICAL_WARP_THREADS,   ///< Number of threads per logical warp
-    int         PTX_ARCH>               ///< The PTX compute capability for which to to specialize this collective
-struct WarpScanSmem
-{
-    /******************************************************************************
-     * Constants and type definitions
-     ******************************************************************************/
-
-    enum
-    {
-        /// Whether the logical warp size and the PTX warp size coincide
-        IS_ARCH_WARP = (LOGICAL_WARP_THREADS == CUB_WARP_THREADS(PTX_ARCH)),
-
-        /// Whether the logical warp size is a power-of-two
-        IS_POW_OF_TWO = PowerOfTwo<LOGICAL_WARP_THREADS>::VALUE,
-
-        /// The number of warp scan steps
-        STEPS = Log2<LOGICAL_WARP_THREADS>::VALUE,
-
-        /// The number of threads in half a warp
-        HALF_WARP_THREADS = 1 << (STEPS - 1),
-
-        /// The number of shared memory elements per warp
-        WARP_SMEM_ELEMENTS =  LOGICAL_WARP_THREADS + HALF_WARP_THREADS,
-    };
-
-    /// Storage cell type (workaround for SM1x compiler bugs with custom-ops like Max() on signed chars)
-    typedef typename If<((Equals<T, char>::VALUE || Equals<T, signed char>::VALUE) && (PTX_ARCH < 200)), int, T>::Type CellT;
-
-    /// Shared memory storage layout type (1.5 warps-worth of elements for each warp)
-    typedef CellT _TempStorage[WARP_SMEM_ELEMENTS];
-
-    // Alias wrapper allowing storage to be unioned
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    /******************************************************************************
-     * Thread fields
-     ******************************************************************************/
-
-    _TempStorage    &temp_storage;
-    unsigned int    lane_id;
-    unsigned int    member_mask;
-
-
-    /******************************************************************************
-     * Construction
-     ******************************************************************************/
-
-    /// Constructor
-    __device__ __forceinline__ WarpScanSmem(
-        TempStorage     &temp_storage)
-    :
-        temp_storage(temp_storage.Alias()),
-
-        lane_id(IS_ARCH_WARP ?
-            LaneId() :
-            LaneId() % LOGICAL_WARP_THREADS),
-
-        member_mask((0xffffffff >> (32 - LOGICAL_WARP_THREADS)) << ((IS_ARCH_WARP || !IS_POW_OF_TWO ) ?
-            0 : // arch-width and non-power-of-two subwarps cannot be tiled with the arch-warp
-            ((LaneId() / LOGICAL_WARP_THREADS) * LOGICAL_WARP_THREADS)))
-    {}
-
-
-    /******************************************************************************
-     * Utility methods
-     ******************************************************************************/
-
-    /// Basic inclusive scan iteration (template unrolled, inductive-case specialization)
-    template <
-        bool        HAS_IDENTITY,
-        int         STEP,
-        typename    ScanOp>
-    __device__ __forceinline__ void ScanStep(
-        T                       &partial,
-        ScanOp                  scan_op,
-        Int2Type<STEP>          /*step*/)
-    {
-        const int OFFSET = 1 << STEP;
-
-        // Share partial into buffer
-        ThreadStore<STORE_VOLATILE>(&temp_storage[HALF_WARP_THREADS + lane_id], (CellT) partial);
-
-        WARP_SYNC(member_mask);
-
-        // Update partial if addend is in range
-        if (HAS_IDENTITY || (lane_id >= OFFSET))
-        {
-            T addend = (T) ThreadLoad<LOAD_VOLATILE>(&temp_storage[HALF_WARP_THREADS + lane_id - OFFSET]);
-            partial = scan_op(addend, partial);
-        }
-        WARP_SYNC(member_mask);
-
-        ScanStep<HAS_IDENTITY>(partial, scan_op, Int2Type<STEP + 1>());
-    }
-
-
-    /// Basic inclusive scan iteration(template unrolled, base-case specialization)
-    template <
-        bool        HAS_IDENTITY,
-        typename    ScanOp>
-    __device__ __forceinline__ void ScanStep(
-        T                       &/*partial*/,
-        ScanOp                  /*scan_op*/,
-        Int2Type<STEPS>         /*step*/)
-    {}
-
-
-    /// Inclusive prefix scan (specialized for summation across primitive types)
-    __device__ __forceinline__ void InclusiveScan(
-        T                       input,              ///< [in] Calling thread's input item.
-        T                       &output,            ///< [out] Calling thread's output item.  May be aliased with \p input.
-        Sum                     scan_op,            ///< [in] Binary scan operator
-        Int2Type<true>          /*is_primitive*/)   ///< [in] Marker type indicating whether T is primitive type
-    {
-        T identity = 0;
-        ThreadStore<STORE_VOLATILE>(&temp_storage[lane_id], (CellT) identity);
-
-        WARP_SYNC(member_mask);
-
-        // Iterate scan steps
-        output = input;
-        ScanStep<true>(output, scan_op, Int2Type<0>());
-    }
-
-
-    /// Inclusive prefix scan
-    template <typename ScanOp, int IS_PRIMITIVE>
-    __device__ __forceinline__ void InclusiveScan(
-        T                       input,              ///< [in] Calling thread's input item.
-        T                       &output,            ///< [out] Calling thread's output item.  May be aliased with \p input.
-        ScanOp                  scan_op,            ///< [in] Binary scan operator
-        Int2Type<IS_PRIMITIVE>  /*is_primitive*/)   ///< [in] Marker type indicating whether T is primitive type
-    {
-        // Iterate scan steps
-        output = input;
-        ScanStep<false>(output, scan_op, Int2Type<0>());
-    }
-
-
-    /******************************************************************************
-     * Interface
-     ******************************************************************************/
-
-    //---------------------------------------------------------------------
-    // Broadcast
-    //---------------------------------------------------------------------
-
-    /// Broadcast
-    __device__ __forceinline__ T Broadcast(
-        T               input,              ///< [in] The value to broadcast
-        unsigned int    src_lane)           ///< [in] Which warp lane is to do the broadcasting
-    {
-        if (lane_id == src_lane)
-        {
-            ThreadStore<STORE_VOLATILE>(temp_storage, (CellT) input);
-        }
-
-        WARP_SYNC(member_mask);
-
-        return (T)ThreadLoad<LOAD_VOLATILE>(temp_storage);
-    }
-
-
-    //---------------------------------------------------------------------
-    // Inclusive operations
-    //---------------------------------------------------------------------
-
-    /// Inclusive scan
-    template <typename ScanOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T               input,              ///< [in] Calling thread's input item.
-        T               &inclusive_output,  ///< [out] Calling thread's output item.  May be aliased with \p input.
-        ScanOp          scan_op)            ///< [in] Binary scan operator
-    {
-        InclusiveScan(input, inclusive_output, scan_op, Int2Type<Traits<T>::PRIMITIVE>());
-    }
-
-
-    /// Inclusive scan with aggregate
-    template <typename ScanOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T               input,              ///< [in] Calling thread's input item.
-        T               &inclusive_output,  ///< [out] Calling thread's output item.  May be aliased with \p input.
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               &warp_aggregate)    ///< [out] Warp-wide aggregate reduction of input items.
-    {
-        InclusiveScan(input, inclusive_output, scan_op);
-
-        // Retrieve aggregate
-        ThreadStore<STORE_VOLATILE>(&temp_storage[HALF_WARP_THREADS + lane_id], (CellT) inclusive_output);
-
-        WARP_SYNC(member_mask);
-
-        warp_aggregate = (T) ThreadLoad<LOAD_VOLATILE>(&temp_storage[WARP_SMEM_ELEMENTS - 1]);
-
-        WARP_SYNC(member_mask);
-    }
-
-
-    //---------------------------------------------------------------------
-    // Get exclusive from inclusive
-    //---------------------------------------------------------------------
-
-    /// Update inclusive and exclusive using input and inclusive
-    template <typename ScanOpT, typename IsIntegerT>
-    __device__ __forceinline__ void Update(
-        T                       /*input*/,      ///< [in]
-        T                       &inclusive,     ///< [in, out]
-        T                       &exclusive,     ///< [out]
-        ScanOpT                 /*scan_op*/,    ///< [in]
-        IsIntegerT              /*is_integer*/) ///< [in]
-    {
-        // initial value unknown
-        ThreadStore<STORE_VOLATILE>(&temp_storage[HALF_WARP_THREADS + lane_id], (CellT) inclusive);
-
-        WARP_SYNC(member_mask);
-
-        exclusive = (T) ThreadLoad<LOAD_VOLATILE>(&temp_storage[HALF_WARP_THREADS + lane_id - 1]);
-    }
-
-    /// Update inclusive and exclusive using input and inclusive (specialized for summation of integer types)
-    __device__ __forceinline__ void Update(
-        T                       input,
-        T                       &inclusive,
-        T                       &exclusive,
-        cub::Sum                /*scan_op*/,
-        Int2Type<true>          /*is_integer*/)
-    {
-        // initial value presumed 0
-        exclusive = inclusive - input;
-    }
-
-    /// Update inclusive and exclusive using initial value using input, inclusive, and initial value
-    template <typename ScanOpT, typename IsIntegerT>
-    __device__ __forceinline__ void Update (
-        T                       /*input*/,
-        T                       &inclusive,
-        T                       &exclusive,
-        ScanOpT                 scan_op,
-        T                       initial_value,
-        IsIntegerT              /*is_integer*/)
-    {
-        inclusive = scan_op(initial_value, inclusive);
-        ThreadStore<STORE_VOLATILE>(&temp_storage[HALF_WARP_THREADS + lane_id], (CellT) inclusive);
-
-        WARP_SYNC(member_mask);
-
-        exclusive = (T) ThreadLoad<LOAD_VOLATILE>(&temp_storage[HALF_WARP_THREADS + lane_id - 1]);
-        if (lane_id == 0)
-            exclusive = initial_value;
-    }
-
-    /// Update inclusive and exclusive using initial value using input and inclusive (specialized for summation of integer types)
-    __device__ __forceinline__ void Update (
-        T                       input,
-        T                       &inclusive,
-        T                       &exclusive,
-        cub::Sum                scan_op,
-        T                       initial_value,
-        Int2Type<true>          /*is_integer*/)
-    {
-        inclusive = scan_op(initial_value, inclusive);
-        exclusive = inclusive - input;
-    }
-
-
-    /// Update inclusive, exclusive, and warp aggregate using input and inclusive
-    template <typename ScanOpT, typename IsIntegerT>
-    __device__ __forceinline__ void Update (
-        T                       /*input*/,
-        T                       &inclusive,
-        T                       &exclusive,
-        T                       &warp_aggregate,
-        ScanOpT                 /*scan_op*/,
-        IsIntegerT              /*is_integer*/)
-    {
-        // Initial value presumed to be unknown or identity (either way our padding is correct)
-        ThreadStore<STORE_VOLATILE>(&temp_storage[HALF_WARP_THREADS + lane_id], (CellT) inclusive);
-
-        WARP_SYNC(member_mask);
-
-        exclusive = (T) ThreadLoad<LOAD_VOLATILE>(&temp_storage[HALF_WARP_THREADS + lane_id - 1]);
-        warp_aggregate = (T) ThreadLoad<LOAD_VOLATILE>(&temp_storage[WARP_SMEM_ELEMENTS - 1]);
-    }
-
-    /// Update inclusive, exclusive, and warp aggregate using input and inclusive (specialized for summation of integer types)
-    __device__ __forceinline__ void Update (
-        T                       input,
-        T                       &inclusive,
-        T                       &exclusive,
-        T                       &warp_aggregate,
-        cub::Sum                /*scan_o*/,
-        Int2Type<true>          /*is_integer*/)
-    {
-        // Initial value presumed to be unknown or identity (either way our padding is correct)
-        ThreadStore<STORE_VOLATILE>(&temp_storage[HALF_WARP_THREADS + lane_id], (CellT) inclusive);
-
-        WARP_SYNC(member_mask);
-
-        warp_aggregate = (T) ThreadLoad<LOAD_VOLATILE>(&temp_storage[WARP_SMEM_ELEMENTS - 1]);
-        exclusive = inclusive - input;
-    }
-
-    /// Update inclusive, exclusive, and warp aggregate using input, inclusive, and initial value
-    template <typename ScanOpT, typename IsIntegerT>
-    __device__ __forceinline__ void Update (
-        T                       /*input*/,
-        T                       &inclusive,
-        T                       &exclusive,
-        T                       &warp_aggregate,
-        ScanOpT                 scan_op,
-        T                       initial_value,
-        IsIntegerT              /*is_integer*/)
-    {
-        // Broadcast warp aggregate
-        ThreadStore<STORE_VOLATILE>(&temp_storage[HALF_WARP_THREADS + lane_id], (CellT) inclusive);
-
-        WARP_SYNC(member_mask);
-
-        warp_aggregate = (T) ThreadLoad<LOAD_VOLATILE>(&temp_storage[WARP_SMEM_ELEMENTS - 1]);
-
-        WARP_SYNC(member_mask);
-
-        // Update inclusive with initial value
-        inclusive = scan_op(initial_value, inclusive);
-
-        // Get exclusive from exclusive
-        ThreadStore<STORE_VOLATILE>(&temp_storage[HALF_WARP_THREADS + lane_id - 1], (CellT) inclusive);
-
-        WARP_SYNC(member_mask);
-
-        exclusive = (T) ThreadLoad<LOAD_VOLATILE>(&temp_storage[HALF_WARP_THREADS + lane_id - 2]);
-
-        if (lane_id == 0)
-            exclusive = initial_value;
-    }
-
-
-};
-
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/warp/warp_reduce.cuh b/thirdparty/cub_semiring/warp/warp_reduce.cuh
deleted file mode 100644
index ef78dd6a009..00000000000
--- a/thirdparty/cub_semiring/warp/warp_reduce.cuh
+++ /dev/null
@@ -1,612 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * The cub::WarpReduce class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel reduction of items partitioned across a CUDA thread warp.
- */
-
-#pragma once
-
-#include "specializations/warp_reduce_shfl.cuh"
-#include "specializations/warp_reduce_smem.cuh"
-#include "../thread/thread_operators.cuh"
-#include "../util_arch.cuh"
-#include "../util_type.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-
-/**
- * \addtogroup WarpModule
- * @{
- */
-
-/**
- * \brief The WarpReduce class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel reduction of items partitioned across a CUDA thread warp. ![](warp_reduce_logo.png)
- *
- * \tparam T                        The reduction input/output element type
- * \tparam LOGICAL_WARP_THREADS     <b>[optional]</b> The number of threads per "logical" warp (may be less than the number of hardware warp threads).  Default is the warp size of the targeted CUDA compute-capability (e.g., 32 threads for SM20).
- * \tparam PTX_ARCH                 <b>[optional]</b> \ptxversion
- *
- * \par Overview
- * - A <a href="http://en.wikipedia.org/wiki/Reduce_(higher-order_function)"><em>reduction</em></a> (or <em>fold</em>)
- *   uses a binary combining operator to compute a single aggregate from a list of input elements.
- * - Supports "logical" warps smaller than the physical warp size (e.g., logical warps of 8 threads)
- * - The number of entrant threads must be an multiple of \p LOGICAL_WARP_THREADS
- *
- * \par Performance Considerations
- * - Uses special instructions when applicable (e.g., warp \p SHFL instructions)
- * - Uses synchronization-free communication between warp lanes when applicable
- * - Incurs zero bank conflicts for most types
- * - Computation is slightly more efficient (i.e., having lower instruction overhead) for:
- *     - Summation (<b><em>vs.</em></b> generic reduction)
- *     - The architecture's warp size is a whole multiple of \p LOGICAL_WARP_THREADS
- *
- * \par Simple Examples
- * \warpcollective{WarpReduce}
- * \par
- * The code snippet below illustrates four concurrent warp sum reductions within a block of
- * 128 threads (one per each of the 32-thread warps).
- * \par
- * \code
- * #include <cub/cub.cuh>
- *
- * __global__ void ExampleKernel(...)
- * {
- *     // Specialize WarpReduce for type int
- *     typedef cub::WarpReduce<int> WarpReduce;
- *
- *     // Allocate WarpReduce shared memory for 4 warps
- *     __shared__ typename WarpReduce::TempStorage temp_storage[4];
- *
- *     // Obtain one input item per thread
- *     int thread_data = ...
- *
- *     // Return the warp-wide sums to each lane0 (threads 0, 32, 64, and 96)
- *     int warp_id = threadIdx.x / 32;
- *     int aggregate = WarpReduce(temp_storage[warp_id]).Sum(thread_data);
- *
- * \endcode
- * \par
- * Suppose the set of input \p thread_data across the block of threads is <tt>{0, 1, 2, 3, ..., 127}</tt>.
- * The corresponding output \p aggregate in threads 0, 32, 64, and 96 will \p 496, \p 1520,
- * \p 2544, and \p 3568, respectively (and is undefined in other threads).
- *
- * \par
- * The code snippet below illustrates a single warp sum reduction within a block of
- * 128 threads.
- * \par
- * \code
- * #include <cub/cub.cuh>
- *
- * __global__ void ExampleKernel(...)
- * {
- *     // Specialize WarpReduce for type int
- *     typedef cub::WarpReduce<int> WarpReduce;
- *
- *     // Allocate WarpReduce shared memory for one warp
- *     __shared__ typename WarpReduce::TempStorage temp_storage;
- *     ...
- *
- *     // Only the first warp performs a reduction
- *     if (threadIdx.x < 32)
- *     {
- *         // Obtain one input item per thread
- *         int thread_data = ...
- *
- *         // Return the warp-wide sum to lane0
- *         int aggregate = WarpReduce(temp_storage).Sum(thread_data);
- *
- * \endcode
- * \par
- * Suppose the set of input \p thread_data across the warp of threads is <tt>{0, 1, 2, 3, ..., 31}</tt>.
- * The corresponding output \p aggregate in thread0 will be \p 496 (and is undefined in other threads).
- *
- */
-template <
-    typename    T,
-    int         LOGICAL_WARP_THREADS    = CUB_PTX_WARP_THREADS,
-    int         PTX_ARCH                = CUB_PTX_ARCH>
-class WarpReduce
-{
-private:
-
-    /******************************************************************************
-     * Constants and type definitions
-     ******************************************************************************/
-
-    enum
-    {
-        /// Whether the logical warp size and the PTX warp size coincide
-        IS_ARCH_WARP = (LOGICAL_WARP_THREADS == CUB_WARP_THREADS(PTX_ARCH)),
-
-        /// Whether the logical warp size is a power-of-two
-        IS_POW_OF_TWO = PowerOfTwo<LOGICAL_WARP_THREADS>::VALUE,
-    };
-
-public:
-
-    #ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document
-
-    /// Internal specialization.  Use SHFL-based reduction if (architecture is >= SM30) and (LOGICAL_WARP_THREADS is a power-of-two)
-    typedef typename If<(PTX_ARCH >= 300) && (IS_POW_OF_TWO),
-        WarpReduceShfl<T, LOGICAL_WARP_THREADS, PTX_ARCH>,
-        WarpReduceSmem<T, LOGICAL_WARP_THREADS, PTX_ARCH> >::Type InternalWarpReduce;
-
-    #endif // DOXYGEN_SHOULD_SKIP_THIS
-
-
-private:
-
-    /// Shared memory storage layout type for WarpReduce
-    typedef typename InternalWarpReduce::TempStorage _TempStorage;
-
-
-    /******************************************************************************
-     * Thread fields
-     ******************************************************************************/
-
-    /// Shared storage reference
-    _TempStorage &temp_storage;
-
-
-    /******************************************************************************
-     * Utility methods
-     ******************************************************************************/
-
-public:
-
-    /// \smemstorage{WarpReduce}
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    /******************************************************************//**
-     * \name Collective constructors
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Collective constructor using the specified memory allocation as temporary storage.  Logical warp and lane identifiers are constructed from <tt>threadIdx.x</tt>.
-     */
-    __device__ __forceinline__ WarpReduce(
-        TempStorage &temp_storage)             ///< [in] Reference to memory allocation having layout type TempStorage
-    :
-        temp_storage(temp_storage.Alias())
-    {}
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Summation reductions
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Computes a warp-wide sum in the calling warp.  The output is valid in warp <em>lane</em><sub>0</sub>.
-     *
-     * \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates four concurrent warp sum reductions within a block of
-     * 128 threads (one per each of the 32-thread warps).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpReduce for type int
-     *     typedef cub::WarpReduce<int> WarpReduce;
-     *
-     *     // Allocate WarpReduce shared memory for 4 warps
-     *     __shared__ typename WarpReduce::TempStorage temp_storage[4];
-     *
-     *     // Obtain one input item per thread
-     *     int thread_data = ...
-     *
-     *     // Return the warp-wide sums to each lane0
-     *     int warp_id = threadIdx.x / 32;
-     *     int aggregate = WarpReduce(temp_storage[warp_id]).Sum(thread_data);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{0, 1, 2, 3, ..., 127}</tt>.
-     * The corresponding output \p aggregate in threads 0, 32, 64, and 96 will \p 496, \p 1520,
-     * \p 2544, and \p 3568, respectively (and is undefined in other threads).
-     *
-     */
-    __device__ __forceinline__ T Sum(
-        T                   input)              ///< [in] Calling thread's input
-    {
-        return InternalWarpReduce(temp_storage).template Reduce<true, 1>(input, LOGICAL_WARP_THREADS, cub::Sum());
-    }
-
-    /**
-     * \brief Computes a partially-full warp-wide sum in the calling warp.  The output is valid in warp <em>lane</em><sub>0</sub>.
-     *
-     * All threads across the calling warp must agree on the same value for \p valid_items.  Otherwise the result is undefined.
-     *
-     * \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a sum reduction within a single, partially-full
-     * block of 32 threads (one warp).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(int *d_data, int valid_items)
-     * {
-     *     // Specialize WarpReduce for type int
-     *     typedef cub::WarpReduce<int> WarpReduce;
-     *
-     *     // Allocate WarpReduce shared memory for one warp
-     *     __shared__ typename WarpReduce::TempStorage temp_storage;
-     *
-     *     // Obtain one input item per thread if in range
-     *     int thread_data;
-     *     if (threadIdx.x < valid_items)
-     *         thread_data = d_data[threadIdx.x];
-     *
-     *     // Return the warp-wide sums to each lane0
-     *     int aggregate = WarpReduce(temp_storage).Sum(
-     *         thread_data, valid_items);
-     *
-     * \endcode
-     * \par
-     * Suppose the input \p d_data is <tt>{0, 1, 2, 3, 4, ...</tt> and \p valid_items
-     * is \p 4.  The corresponding output \p aggregate in thread0 is \p 6 (and is
-     * undefined in other threads).
-     *
-     */
-    __device__ __forceinline__ T Sum(
-        T                   input,              ///< [in] Calling thread's input
-        int                 valid_items)        ///< [in] Total number of valid items in the calling thread's logical warp (may be less than \p LOGICAL_WARP_THREADS)
-    {
-        // Determine if we don't need bounds checking
-        return InternalWarpReduce(temp_storage).template Reduce<false, 1>(input, valid_items, cub::Sum());
-    }
-
-
-    /**
-     * \brief Computes a segmented sum in the calling warp where segments are defined by head-flags.  The sum of each segment is returned to the first lane in that segment (which always includes <em>lane</em><sub>0</sub>).
-     *
-     * \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a head-segmented warp sum
-     * reduction within a block of 32 threads (one warp).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpReduce for type int
-     *     typedef cub::WarpReduce<int> WarpReduce;
-     *
-     *     // Allocate WarpReduce shared memory for one warp
-     *     __shared__ typename WarpReduce::TempStorage temp_storage;
-     *
-     *     // Obtain one input item and flag per thread
-     *     int thread_data = ...
-     *     int head_flag = ...
-     *
-     *     // Return the warp-wide sums to each lane0
-     *     int aggregate = WarpReduce(temp_storage).HeadSegmentedSum(
-     *         thread_data, head_flag);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data and \p head_flag across the block of threads
-     * is <tt>{0, 1, 2, 3, ..., 31</tt> and is <tt>{1, 0, 0, 0, 1, 0, 0, 0, ..., 1, 0, 0, 0</tt>,
-     * respectively.  The corresponding output \p aggregate in threads 0, 4, 8, etc. will be
-     * \p 6, \p 22, \p 38, etc. (and is undefined in other threads).
-     *
-     * \tparam ReductionOp     <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
-     *
-     */
-    template <
-        typename            FlagT>
-    __device__ __forceinline__ T HeadSegmentedSum(
-        T                   input,              ///< [in] Calling thread's input
-        FlagT                head_flag)          ///< [in] Head flag denoting whether or not \p input is the start of a new segment
-    {
-        return HeadSegmentedReduce(input, head_flag, cub::Sum());
-    }
-
-
-    /**
-     * \brief Computes a segmented sum in the calling warp where segments are defined by tail-flags.  The sum of each segment is returned to the first lane in that segment (which always includes <em>lane</em><sub>0</sub>).
-     *
-     * \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a tail-segmented warp sum
-     * reduction within a block of 32 threads (one warp).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpReduce for type int
-     *     typedef cub::WarpReduce<int> WarpReduce;
-     *
-     *     // Allocate WarpReduce shared memory for one warp
-     *     __shared__ typename WarpReduce::TempStorage temp_storage;
-     *
-     *     // Obtain one input item and flag per thread
-     *     int thread_data = ...
-     *     int tail_flag = ...
-     *
-     *     // Return the warp-wide sums to each lane0
-     *     int aggregate = WarpReduce(temp_storage).TailSegmentedSum(
-     *         thread_data, tail_flag);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data and \p tail_flag across the block of threads
-     * is <tt>{0, 1, 2, 3, ..., 31</tt> and is <tt>{0, 0, 0, 1, 0, 0, 0, 1, ..., 0, 0, 0, 1</tt>,
-     * respectively.  The corresponding output \p aggregate in threads 0, 4, 8, etc. will be
-     * \p 6, \p 22, \p 38, etc. (and is undefined in other threads).
-     *
-     * \tparam ReductionOp     <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <
-        typename            FlagT>
-    __device__ __forceinline__ T TailSegmentedSum(
-        T                   input,              ///< [in] Calling thread's input
-        FlagT                tail_flag)          ///< [in] Head flag denoting whether or not \p input is the start of a new segment
-    {
-        return TailSegmentedReduce(input, tail_flag, cub::Sum());
-    }
-
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Generic reductions
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Computes a warp-wide reduction in the calling warp using the specified binary reduction functor.  The output is valid in warp <em>lane</em><sub>0</sub>.
-     *
-     * Supports non-commutative reduction operators
-     *
-     * \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates four concurrent warp max reductions within a block of
-     * 128 threads (one per each of the 32-thread warps).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpReduce for type int
-     *     typedef cub::WarpReduce<int> WarpReduce;
-     *
-     *     // Allocate WarpReduce shared memory for 4 warps
-     *     __shared__ typename WarpReduce::TempStorage temp_storage[4];
-     *
-     *     // Obtain one input item per thread
-     *     int thread_data = ...
-     *
-     *     // Return the warp-wide reductions to each lane0
-     *     int warp_id = threadIdx.x / 32;
-     *     int aggregate = WarpReduce(temp_storage[warp_id]).Reduce(
-     *         thread_data, cub::Max());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{0, 1, 2, 3, ..., 127}</tt>.
-     * The corresponding output \p aggregate in threads 0, 32, 64, and 96 will \p 31, \p 63,
-     * \p 95, and \p 127, respectively  (and is undefined in other threads).
-     *
-     * \tparam ReductionOp     <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ReductionOp>
-    __device__ __forceinline__ T Reduce(
-        T                   input,              ///< [in] Calling thread's input
-        ReductionOp         reduction_op)       ///< [in] Binary reduction operator
-    {
-        return InternalWarpReduce(temp_storage).template Reduce<true, 1>(input, LOGICAL_WARP_THREADS, reduction_op);
-    }
-
-    /**
-     * \brief Computes a partially-full warp-wide reduction in the calling warp using the specified binary reduction functor.  The output is valid in warp <em>lane</em><sub>0</sub>.
-     *
-     * All threads across the calling warp must agree on the same value for \p valid_items.  Otherwise the result is undefined.
-     *
-     * Supports non-commutative reduction operators
-     *
-     * \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a max reduction within a single, partially-full
-     * block of 32 threads (one warp).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(int *d_data, int valid_items)
-     * {
-     *     // Specialize WarpReduce for type int
-     *     typedef cub::WarpReduce<int> WarpReduce;
-     *
-     *     // Allocate WarpReduce shared memory for one warp
-     *     __shared__ typename WarpReduce::TempStorage temp_storage;
-     *
-     *     // Obtain one input item per thread if in range
-     *     int thread_data;
-     *     if (threadIdx.x < valid_items)
-     *         thread_data = d_data[threadIdx.x];
-     *
-     *     // Return the warp-wide reductions to each lane0
-     *     int aggregate = WarpReduce(temp_storage).Reduce(
-     *         thread_data, cub::Max(), valid_items);
-     *
-     * \endcode
-     * \par
-     * Suppose the input \p d_data is <tt>{0, 1, 2, 3, 4, ...</tt> and \p valid_items
-     * is \p 4.  The corresponding output \p aggregate in thread0 is \p 3 (and is
-     * undefined in other threads).
-     *
-     * \tparam ReductionOp     <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ReductionOp>
-    __device__ __forceinline__ T Reduce(
-        T                   input,              ///< [in] Calling thread's input
-        ReductionOp         reduction_op,       ///< [in] Binary reduction operator
-        int                 valid_items)        ///< [in] Total number of valid items in the calling thread's logical warp (may be less than \p LOGICAL_WARP_THREADS)
-    {
-        return InternalWarpReduce(temp_storage).template Reduce<false, 1>(input, valid_items, reduction_op);
-    }
-
-
-    /**
-     * \brief Computes a segmented reduction in the calling warp where segments are defined by head-flags.  The reduction of each segment is returned to the first lane in that segment (which always includes <em>lane</em><sub>0</sub>).
-     *
-     * Supports non-commutative reduction operators
-     *
-     * \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a head-segmented warp max
-     * reduction within a block of 32 threads (one warp).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpReduce for type int
-     *     typedef cub::WarpReduce<int> WarpReduce;
-     *
-     *     // Allocate WarpReduce shared memory for one warp
-     *     __shared__ typename WarpReduce::TempStorage temp_storage;
-     *
-     *     // Obtain one input item and flag per thread
-     *     int thread_data = ...
-     *     int head_flag = ...
-     *
-     *     // Return the warp-wide reductions to each lane0
-     *     int aggregate = WarpReduce(temp_storage).HeadSegmentedReduce(
-     *         thread_data, head_flag, cub::Max());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data and \p head_flag across the block of threads
-     * is <tt>{0, 1, 2, 3, ..., 31</tt> and is <tt>{1, 0, 0, 0, 1, 0, 0, 0, ..., 1, 0, 0, 0</tt>,
-     * respectively.  The corresponding output \p aggregate in threads 0, 4, 8, etc. will be
-     * \p 3, \p 7, \p 11, etc. (and is undefined in other threads).
-     *
-     * \tparam ReductionOp     <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <
-        typename            ReductionOp,
-        typename            FlagT>
-    __device__ __forceinline__ T HeadSegmentedReduce(
-        T                   input,              ///< [in] Calling thread's input
-        FlagT                head_flag,          ///< [in] Head flag denoting whether or not \p input is the start of a new segment
-        ReductionOp         reduction_op)       ///< [in] Reduction operator
-    {
-        return InternalWarpReduce(temp_storage).template SegmentedReduce<true>(input, head_flag, reduction_op);
-    }
-
-
-    /**
-     * \brief Computes a segmented reduction in the calling warp where segments are defined by tail-flags.  The reduction of each segment is returned to the first lane in that segment (which always includes <em>lane</em><sub>0</sub>).
-     *
-     * Supports non-commutative reduction operators
-     *
-     * \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates a tail-segmented warp max
-     * reduction within a block of 32 threads (one warp).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpReduce for type int
-     *     typedef cub::WarpReduce<int> WarpReduce;
-     *
-     *     // Allocate WarpReduce shared memory for one warp
-     *     __shared__ typename WarpReduce::TempStorage temp_storage;
-     *
-     *     // Obtain one input item and flag per thread
-     *     int thread_data = ...
-     *     int tail_flag = ...
-     *
-     *     // Return the warp-wide reductions to each lane0
-     *     int aggregate = WarpReduce(temp_storage).TailSegmentedReduce(
-     *         thread_data, tail_flag, cub::Max());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data and \p tail_flag across the block of threads
-     * is <tt>{0, 1, 2, 3, ..., 31</tt> and is <tt>{0, 0, 0, 1, 0, 0, 0, 1, ..., 0, 0, 0, 1</tt>,
-     * respectively.  The corresponding output \p aggregate in threads 0, 4, 8, etc. will be
-     * \p 3, \p 7, \p 11, etc. (and is undefined in other threads).
-     *
-     * \tparam ReductionOp     <b>[inferred]</b> Binary reduction operator type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <
-        typename            ReductionOp,
-        typename            FlagT>
-    __device__ __forceinline__ T TailSegmentedReduce(
-        T                   input,              ///< [in] Calling thread's input
-        FlagT                tail_flag,          ///< [in] Tail flag denoting whether or not \p input is the end of the current segment
-        ReductionOp         reduction_op)       ///< [in] Reduction operator
-    {
-        return InternalWarpReduce(temp_storage).template SegmentedReduce<false>(input, tail_flag, reduction_op);
-    }
-
-
-
-    //@}  end member group
-};
-
-/** @} */       // end group WarpModule
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cub_semiring/warp/warp_scan.cuh b/thirdparty/cub_semiring/warp/warp_scan.cuh
deleted file mode 100644
index 3f78ca8a090..00000000000
--- a/thirdparty/cub_semiring/warp/warp_scan.cuh
+++ /dev/null
@@ -1,936 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011, Duane Merrill.  All rights reserved.
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in the
- *       documentation and/or other materials provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the
- *       names of its contributors may be used to endorse or promote products
- *       derived from this software without specific prior written permission.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * The cub::WarpScan class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.
- */
-
-#pragma once
-
-#include "specializations/warp_scan_shfl.cuh"
-#include "specializations/warp_scan_smem.cuh"
-#include "../thread/thread_operators.cuh"
-#include "../util_arch.cuh"
-#include "../util_type.cuh"
-#include "../util_namespace.cuh"
-
-/// Optional outer namespace(s)
-CUB_NS_PREFIX
-
-/// CUB namespace
-namespace cub {
-
-/**
- * \addtogroup WarpModule
- * @{
- */
-
-/**
- * \brief The WarpScan class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.  ![](warp_scan_logo.png)
- *
- * \tparam T                        The scan input/output element type
- * \tparam LOGICAL_WARP_THREADS     <b>[optional]</b> The number of threads per "logical" warp (may be less than the number of hardware warp threads).  Default is the warp size associated with the CUDA Compute Capability targeted by the compiler (e.g., 32 threads for SM20).
- * \tparam PTX_ARCH                 <b>[optional]</b> \ptxversion
- *
- * \par Overview
- * - Given a list of input elements and a binary reduction operator, a [<em>prefix scan</em>](http://en.wikipedia.org/wiki/Prefix_sum)
- *   produces an output list where each element is computed to be the reduction
- *   of the elements occurring earlier in the input list.  <em>Prefix sum</em>
- *   connotes a prefix scan with the addition operator. The term \em inclusive indicates
- *   that the <em>i</em><sup>th</sup> output reduction incorporates the <em>i</em><sup>th</sup> input.
- *   The term \em exclusive indicates the <em>i</em><sup>th</sup> input is not incorporated into
- *   the <em>i</em><sup>th</sup> output reduction.
- * - Supports non-commutative scan operators
- * - Supports "logical" warps smaller than the physical warp size (e.g., a logical warp of 8 threads)
- * - The number of entrant threads must be an multiple of \p LOGICAL_WARP_THREADS
- *
- * \par Performance Considerations
- * - Uses special instructions when applicable (e.g., warp \p SHFL)
- * - Uses synchronization-free communication between warp lanes when applicable
- * - Incurs zero bank conflicts for most types
- * - Computation is slightly more efficient (i.e., having lower instruction overhead) for:
- *     - Summation (<b><em>vs.</em></b> generic scan)
- *     - The architecture's warp size is a whole multiple of \p LOGICAL_WARP_THREADS
- *
- * \par Simple Examples
- * \warpcollective{WarpScan}
- * \par
- * The code snippet below illustrates four concurrent warp prefix sums within a block of
- * 128 threads (one per each of the 32-thread warps).
- * \par
- * \code
- * #include <cub/cub.cuh>
- *
- * __global__ void ExampleKernel(...)
- * {
- *     // Specialize WarpScan for type int
- *     typedef cub::WarpScan<int> WarpScan;
- *
- *     // Allocate WarpScan shared memory for 4 warps
- *     __shared__ typename WarpScan::TempStorage temp_storage[4];
- *
- *     // Obtain one input item per thread
- *     int thread_data = ...
- *
- *     // Compute warp-wide prefix sums
- *     int warp_id = threadIdx.x / 32;
- *     WarpScan(temp_storage[warp_id]).ExclusiveSum(thread_data, thread_data);
- *
- * \endcode
- * \par
- * Suppose the set of input \p thread_data across the block of threads is <tt>{1, 1, 1, 1, ...}</tt>.
- * The corresponding output \p thread_data in each of the four warps of threads will be
- * <tt>0, 1, 2, 3, ..., 31}</tt>.
- *
- * \par
- * The code snippet below illustrates a single warp prefix sum within a block of
- * 128 threads.
- * \par
- * \code
- * #include <cub/cub.cuh>
- *
- * __global__ void ExampleKernel(...)
- * {
- *     // Specialize WarpScan for type int
- *     typedef cub::WarpScan<int> WarpScan;
- *
- *     // Allocate WarpScan shared memory for one warp
- *     __shared__ typename WarpScan::TempStorage temp_storage;
- *     ...
- *
- *     // Only the first warp performs a prefix sum
- *     if (threadIdx.x < 32)
- *     {
- *         // Obtain one input item per thread
- *         int thread_data = ...
- *
- *         // Compute warp-wide prefix sums
- *         WarpScan(temp_storage).ExclusiveSum(thread_data, thread_data);
- *
- * \endcode
- * \par
- * Suppose the set of input \p thread_data across the warp of threads is <tt>{1, 1, 1, 1, ...}</tt>.
- * The corresponding output \p thread_data will be <tt>{0, 1, 2, 3, ..., 31}</tt>.
- *
- */
-template <
-    typename    T,
-    int         LOGICAL_WARP_THREADS    = CUB_PTX_WARP_THREADS,
-    int         PTX_ARCH                = CUB_PTX_ARCH>
-class WarpScan
-{
-private:
-
-    /******************************************************************************
-     * Constants and type definitions
-     ******************************************************************************/
-
-    enum
-    {
-        /// Whether the logical warp size and the PTX warp size coincide
-        IS_ARCH_WARP = (LOGICAL_WARP_THREADS == CUB_WARP_THREADS(PTX_ARCH)),
-
-        /// Whether the logical warp size is a power-of-two
-        IS_POW_OF_TWO = ((LOGICAL_WARP_THREADS & (LOGICAL_WARP_THREADS - 1)) == 0),
-
-        /// Whether the data type is an integer (which has fully-associative addition)
-        IS_INTEGER = ((Traits<T>::CATEGORY == SIGNED_INTEGER) || (Traits<T>::CATEGORY == UNSIGNED_INTEGER))
-    };
-
-    /// Internal specialization.  Use SHFL-based scan if (architecture is >= SM30) and (LOGICAL_WARP_THREADS is a power-of-two)
-    typedef typename If<(PTX_ARCH >= 300) && (IS_POW_OF_TWO),
-        WarpScanShfl<T, LOGICAL_WARP_THREADS, PTX_ARCH>,
-        WarpScanSmem<T, LOGICAL_WARP_THREADS, PTX_ARCH> >::Type InternalWarpScan;
-
-    /// Shared memory storage layout type for WarpScan
-    typedef typename InternalWarpScan::TempStorage _TempStorage;
-
-
-    /******************************************************************************
-     * Thread fields
-     ******************************************************************************/
-
-    /// Shared storage reference
-    _TempStorage    &temp_storage;
-    unsigned int    lane_id;
-
-
-
-    /******************************************************************************
-     * Public types
-     ******************************************************************************/
-
-public:
-
-    /// \smemstorage{WarpScan}
-    struct TempStorage : Uninitialized<_TempStorage> {};
-
-
-    /******************************************************************//**
-     * \name Collective constructors
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Collective constructor using the specified memory allocation as temporary storage.  Logical warp and lane identifiers are constructed from <tt>threadIdx.x</tt>.
-     */
-    __device__ __forceinline__ WarpScan(
-        TempStorage &temp_storage)             ///< [in] Reference to memory allocation having layout type TempStorage
-    :
-        temp_storage(temp_storage.Alias()),
-        lane_id(IS_ARCH_WARP ?
-            LaneId() :
-            LaneId() % LOGICAL_WARP_THREADS)
-    {}
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Inclusive prefix sums
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Computes an inclusive prefix sum across the calling warp.
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates four concurrent warp-wide inclusive prefix sums within a block of
-     * 128 threads (one per each of the 32-thread warps).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpScan for type int
-     *     typedef cub::WarpScan<int> WarpScan;
-     *
-     *     // Allocate WarpScan shared memory for 4 warps
-     *     __shared__ typename WarpScan::TempStorage temp_storage[4];
-     *
-     *     // Obtain one input item per thread
-     *     int thread_data = ...
-     *
-     *     // Compute inclusive warp-wide prefix sums
-     *     int warp_id = threadIdx.x / 32;
-     *     WarpScan(temp_storage[warp_id]).InclusiveSum(thread_data, thread_data);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{1, 1, 1, 1, ...}</tt>.
-     * The corresponding output \p thread_data in each of the four warps of threads will be
-     * <tt>1, 2, 3, ..., 32}</tt>.
-     */
-    __device__ __forceinline__ void InclusiveSum(
-        T               input,              ///< [in] Calling thread's input item.
-        T               &inclusive_output)  ///< [out] Calling thread's output item.  May be aliased with \p input.
-    {
-        InclusiveScan(input, inclusive_output, cub::Sum());
-    }
-
-
-    /**
-     * \brief Computes an inclusive prefix sum across the calling warp.  Also provides every thread with the warp-wide \p warp_aggregate of all inputs.
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates four concurrent warp-wide inclusive prefix sums within a block of
-     * 128 threads (one per each of the 32-thread warps).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpScan for type int
-     *     typedef cub::WarpScan<int> WarpScan;
-     *
-     *     // Allocate WarpScan shared memory for 4 warps
-     *     __shared__ typename WarpScan::TempStorage temp_storage[4];
-     *
-     *     // Obtain one input item per thread
-     *     int thread_data = ...
-     *
-     *     // Compute inclusive warp-wide prefix sums
-     *     int warp_aggregate;
-     *     int warp_id = threadIdx.x / 32;
-     *     WarpScan(temp_storage[warp_id]).InclusiveSum(thread_data, thread_data, warp_aggregate);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{1, 1, 1, 1, ...}</tt>.
-     * The corresponding output \p thread_data in each of the four warps of threads will be
-     * <tt>1, 2, 3, ..., 32}</tt>.  Furthermore, \p warp_aggregate for all threads in all warps will be \p 32.
-     */
-    __device__ __forceinline__ void InclusiveSum(
-        T               input,              ///< [in] Calling thread's input item.
-        T               &inclusive_output,  ///< [out] Calling thread's output item.  May be aliased with \p input.
-        T               &warp_aggregate)    ///< [out] Warp-wide aggregate reduction of input items.
-    {
-        InclusiveScan(input, inclusive_output, cub::Sum(), warp_aggregate);
-    }
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Exclusive prefix sums
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Computes an exclusive prefix sum across the calling warp.  The value of 0 is applied as the initial value, and is assigned to \p exclusive_output in <em>thread</em><sub>0</sub>.
-     *
-     * \par
-     *  - \identityzero
-     *  - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates four concurrent warp-wide exclusive prefix sums within a block of
-     * 128 threads (one per each of the 32-thread warps).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpScan for type int
-     *     typedef cub::WarpScan<int> WarpScan;
-     *
-     *     // Allocate WarpScan shared memory for 4 warps
-     *     __shared__ typename WarpScan::TempStorage temp_storage[4];
-     *
-     *     // Obtain one input item per thread
-     *     int thread_data = ...
-     *
-     *     // Compute exclusive warp-wide prefix sums
-     *     int warp_id = threadIdx.x / 32;
-     *     WarpScan(temp_storage[warp_id]).ExclusiveSum(thread_data, thread_data);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{1, 1, 1, 1, ...}</tt>.
-     * The corresponding output \p thread_data in each of the four warps of threads will be
-     * <tt>0, 1, 2, ..., 31}</tt>.
-     *
-     */
-    __device__ __forceinline__ void ExclusiveSum(
-        T               input,              ///< [in] Calling thread's input item.
-        T               &exclusive_output)  ///< [out] Calling thread's output item.  May be aliased with \p input.
-    {
-        T initial_value = 0;
-        ExclusiveScan(input, exclusive_output, initial_value, cub::Sum());
-    }
-
-
-    /**
-     * \brief Computes an exclusive prefix sum across the calling warp.  The value of 0 is applied as the initial value, and is assigned to \p exclusive_output in <em>thread</em><sub>0</sub>.  Also provides every thread with the warp-wide \p warp_aggregate of all inputs.
-     *
-     * \par
-     *  - \identityzero
-     *  - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates four concurrent warp-wide exclusive prefix sums within a block of
-     * 128 threads (one per each of the 32-thread warps).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpScan for type int
-     *     typedef cub::WarpScan<int> WarpScan;
-     *
-     *     // Allocate WarpScan shared memory for 4 warps
-     *     __shared__ typename WarpScan::TempStorage temp_storage[4];
-     *
-     *     // Obtain one input item per thread
-     *     int thread_data = ...
-     *
-     *     // Compute exclusive warp-wide prefix sums
-     *     int warp_aggregate;
-     *     int warp_id = threadIdx.x / 32;
-     *     WarpScan(temp_storage[warp_id]).ExclusiveSum(thread_data, thread_data, warp_aggregate);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{1, 1, 1, 1, ...}</tt>.
-     * The corresponding output \p thread_data in each of the four warps of threads will be
-     * <tt>0, 1, 2, ..., 31}</tt>.  Furthermore, \p warp_aggregate for all threads in all warps will be \p 32.
-     */
-    __device__ __forceinline__ void ExclusiveSum(
-        T               input,              ///< [in] Calling thread's input item.
-        T               &exclusive_output,  ///< [out] Calling thread's output item.  May be aliased with \p input.
-        T               &warp_aggregate)    ///< [out] Warp-wide aggregate reduction of input items.
-    {
-        T initial_value = 0;
-        ExclusiveScan(input, exclusive_output, initial_value, cub::Sum(), warp_aggregate);
-    }
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Inclusive prefix scans
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Computes an inclusive prefix scan using the specified binary scan functor across the calling warp.
-     *
-     * \par
-     *  - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates four concurrent warp-wide inclusive prefix max scans within a block of
-     * 128 threads (one per each of the 32-thread warps).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpScan for type int
-     *     typedef cub::WarpScan<int> WarpScan;
-     *
-     *     // Allocate WarpScan shared memory for 4 warps
-     *     __shared__ typename WarpScan::TempStorage temp_storage[4];
-     *
-     *     // Obtain one input item per thread
-     *     int thread_data = ...
-     *
-     *     // Compute inclusive warp-wide prefix max scans
-     *     int warp_id = threadIdx.x / 32;
-     *     WarpScan(temp_storage[warp_id]).InclusiveScan(thread_data, thread_data, cub::Max());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{0, -1, 2, -3, ..., 126, -127}</tt>.
-     * The corresponding output \p thread_data in the first warp would be
-     * <tt>0, 0, 2, 2, ..., 30, 30</tt>, the output for the second warp would be <tt>32, 32, 34, 34, ..., 62, 62</tt>, etc.
-     *
-     * \tparam ScanOp     <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ScanOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T               input,              ///< [in] Calling thread's input item.
-        T               &inclusive_output,  ///< [out] Calling thread's output item.  May be aliased with \p input.
-        ScanOp          scan_op)            ///< [in] Binary scan operator
-    {
-        InternalWarpScan(temp_storage).InclusiveScan(input, inclusive_output, scan_op);
-    }
-
-
-    /**
-     * \brief Computes an inclusive prefix scan using the specified binary scan functor across the calling warp.  Also provides every thread with the warp-wide \p warp_aggregate of all inputs.
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates four concurrent warp-wide inclusive prefix max scans within a block of
-     * 128 threads (one per each of the 32-thread warps).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpScan for type int
-     *     typedef cub::WarpScan<int> WarpScan;
-     *
-     *     // Allocate WarpScan shared memory for 4 warps
-     *     __shared__ typename WarpScan::TempStorage temp_storage[4];
-     *
-     *     // Obtain one input item per thread
-     *     int thread_data = ...
-     *
-     *     // Compute inclusive warp-wide prefix max scans
-     *     int warp_aggregate;
-     *     int warp_id = threadIdx.x / 32;
-     *     WarpScan(temp_storage[warp_id]).InclusiveScan(
-     *         thread_data, thread_data, cub::Max(), warp_aggregate);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{0, -1, 2, -3, ..., 126, -127}</tt>.
-     * The corresponding output \p thread_data in the first warp would be
-     * <tt>0, 0, 2, 2, ..., 30, 30</tt>, the output for the second warp would be <tt>32, 32, 34, 34, ..., 62, 62</tt>, etc.
-     * Furthermore, \p warp_aggregate would be assigned \p 30 for threads in the first warp, \p 62 for threads
-     * in the second warp, etc.
-     *
-     * \tparam ScanOp     <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ScanOp>
-    __device__ __forceinline__ void InclusiveScan(
-        T               input,              ///< [in] Calling thread's input item.
-        T               &inclusive_output,  ///< [out] Calling thread's output item.  May be aliased with \p input.
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               &warp_aggregate)    ///< [out] Warp-wide aggregate reduction of input items.
-    {
-        InternalWarpScan(temp_storage).InclusiveScan(input, inclusive_output, scan_op, warp_aggregate);
-    }
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Exclusive prefix scans
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Computes an exclusive prefix scan using the specified binary scan functor across the calling warp.  Because no initial value is supplied, the \p output computed for <em>warp-lane</em><sub>0</sub> is undefined.
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of
-     * 128 threads (one per each of the 32-thread warps).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpScan for type int
-     *     typedef cub::WarpScan<int> WarpScan;
-     *
-     *     // Allocate WarpScan shared memory for 4 warps
-     *     __shared__ typename WarpScan::TempStorage temp_storage[4];
-     *
-     *     // Obtain one input item per thread
-     *     int thread_data = ...
-     *
-     *     // Compute exclusive warp-wide prefix max scans
-     *     int warp_id = threadIdx.x / 32;
-     *     WarpScan(temp_storage[warp_id]).ExclusiveScan(thread_data, thread_data, cub::Max());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{0, -1, 2, -3, ..., 126, -127}</tt>.
-     * The corresponding output \p thread_data in the first warp would be
-     * <tt>?, 0, 0, 2, ..., 28, 30</tt>, the output for the second warp would be <tt>?, 32, 32, 34, ..., 60, 62</tt>, etc.
-     * (The output \p thread_data in warp lane<sub>0</sub> is undefined.)
-     *
-     * \tparam ScanOp     <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,              ///< [in] Calling thread's input item.
-        T               &exclusive_output,  ///< [out] Calling thread's output item.  May be aliased with \p input.
-        ScanOp          scan_op)            ///< [in] Binary scan operator
-    {
-        InternalWarpScan internal(temp_storage);
-
-        T inclusive_output;
-        internal.InclusiveScan(input, inclusive_output, scan_op);
-
-        internal.Update(
-            input,
-            inclusive_output,
-            exclusive_output,
-            scan_op,
-            Int2Type<IS_INTEGER>());
-    }
-
-
-    /**
-     * \brief Computes an exclusive prefix scan using the specified binary scan functor across the calling warp.
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of
-     * 128 threads (one per each of the 32-thread warps).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpScan for type int
-     *     typedef cub::WarpScan<int> WarpScan;
-     *
-     *     // Allocate WarpScan shared memory for 4 warps
-     *     __shared__ typename WarpScan::TempStorage temp_storage[4];
-     *
-     *     // Obtain one input item per thread
-     *     int thread_data = ...
-     *
-     *     // Compute exclusive warp-wide prefix max scans
-     *     int warp_id = threadIdx.x / 32;
-     *     WarpScan(temp_storage[warp_id]).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{0, -1, 2, -3, ..., 126, -127}</tt>.
-     * The corresponding output \p thread_data in the first warp would be
-     * <tt>INT_MIN, 0, 0, 2, ..., 28, 30</tt>, the output for the second warp would be <tt>30, 32, 32, 34, ..., 60, 62</tt>, etc.
-     *
-     * \tparam ScanOp     <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,              ///< [in] Calling thread's input item.
-        T               &exclusive_output,  ///< [out] Calling thread's output item.  May be aliased with \p input.
-        T               initial_value,      ///< [in] Initial value to seed the exclusive scan
-        ScanOp          scan_op)            ///< [in] Binary scan operator
-    {
-        InternalWarpScan internal(temp_storage);
-
-        T inclusive_output;
-        internal.InclusiveScan(input, inclusive_output, scan_op);
-
-        internal.Update(
-            input,
-            inclusive_output,
-            exclusive_output,
-            scan_op,
-            initial_value,
-            Int2Type<IS_INTEGER>());
-    }
-
-
-    /**
-     * \brief Computes an exclusive prefix scan using the specified binary scan functor across the calling warp.  Because no initial value is supplied, the \p output computed for <em>warp-lane</em><sub>0</sub> is undefined.  Also provides every thread with the warp-wide \p warp_aggregate of all inputs.
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of
-     * 128 threads (one per each of the 32-thread warps).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpScan for type int
-     *     typedef cub::WarpScan<int> WarpScan;
-     *
-     *     // Allocate WarpScan shared memory for 4 warps
-     *     __shared__ typename WarpScan::TempStorage temp_storage[4];
-     *
-     *     // Obtain one input item per thread
-     *     int thread_data = ...
-     *
-     *     // Compute exclusive warp-wide prefix max scans
-     *     int warp_aggregate;
-     *     int warp_id = threadIdx.x / 32;
-     *     WarpScan(temp_storage[warp_id]).ExclusiveScan(thread_data, thread_data, cub::Max(), warp_aggregate);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{0, -1, 2, -3, ..., 126, -127}</tt>.
-     * The corresponding output \p thread_data in the first warp would be
-     * <tt>?, 0, 0, 2, ..., 28, 30</tt>, the output for the second warp would be <tt>?, 32, 32, 34, ..., 60, 62</tt>, etc.
-     * (The output \p thread_data in warp lane<sub>0</sub> is undefined.)  Furthermore, \p warp_aggregate would be assigned \p 30 for threads in the first warp, \p 62 for threads
-     * in the second warp, etc.
-     *
-     * \tparam ScanOp     <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,              ///< [in] Calling thread's input item.
-        T               &exclusive_output,   ///< [out] Calling thread's output item.  May be aliased with \p input.
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               &warp_aggregate)    ///< [out] Warp-wide aggregate reduction of input items.
-    {
-        InternalWarpScan internal(temp_storage);
-
-        T inclusive_output;
-        internal.InclusiveScan(input, inclusive_output, scan_op);
-
-        internal.Update(
-            input,
-            inclusive_output,
-            exclusive_output,
-            warp_aggregate,
-            scan_op,
-            Int2Type<IS_INTEGER>());
-    }
-
-
-    /**
-     * \brief Computes an exclusive prefix scan using the specified binary scan functor across the calling warp.  Also provides every thread with the warp-wide \p warp_aggregate of all inputs.
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of
-     * 128 threads (one per each of the 32-thread warps).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpScan for type int
-     *     typedef cub::WarpScan<int> WarpScan;
-     *
-     *     // Allocate WarpScan shared memory for 4 warps
-     *     __shared__ typename WarpScan::TempStorage temp_storage[4];
-     *
-     *     // Obtain one input item per thread
-     *     int thread_data = ...
-     *
-     *     // Compute exclusive warp-wide prefix max scans
-     *     int warp_aggregate;
-     *     int warp_id = threadIdx.x / 32;
-     *     WarpScan(temp_storage[warp_id]).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max(), warp_aggregate);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{0, -1, 2, -3, ..., 126, -127}</tt>.
-     * The corresponding output \p thread_data in the first warp would be
-     * <tt>INT_MIN, 0, 0, 2, ..., 28, 30</tt>, the output for the second warp would be <tt>30, 32, 32, 34, ..., 60, 62</tt>, etc.
-     * Furthermore, \p warp_aggregate would be assigned \p 30 for threads in the first warp, \p 62 for threads
-     * in the second warp, etc.
-     *
-     * \tparam ScanOp     <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ScanOp>
-    __device__ __forceinline__ void ExclusiveScan(
-        T               input,              ///< [in] Calling thread's input item.
-        T               &exclusive_output,  ///< [out] Calling thread's output item.  May be aliased with \p input.
-        T               initial_value,      ///< [in] Initial value to seed the exclusive scan
-        ScanOp          scan_op,            ///< [in] Binary scan operator
-        T               &warp_aggregate)    ///< [out] Warp-wide aggregate reduction of input items.
-    {
-        InternalWarpScan internal(temp_storage);
-
-        T inclusive_output;
-        internal.InclusiveScan(input, inclusive_output, scan_op);
-
-        internal.Update(
-            input,
-            inclusive_output,
-            exclusive_output,
-            warp_aggregate,
-            scan_op,
-            initial_value,
-            Int2Type<IS_INTEGER>());
-    }
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Combination (inclusive & exclusive) prefix scans
-     *********************************************************************/
-    //@{
-
-
-    /**
-     * \brief Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp.  Because no initial value is supplied, the \p exclusive_output computed for <em>warp-lane</em><sub>0</sub> is undefined.
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates four concurrent warp-wide exclusive prefix max scans within a block of
-     * 128 threads (one per each of the 32-thread warps).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpScan for type int
-     *     typedef cub::WarpScan<int> WarpScan;
-     *
-     *     // Allocate WarpScan shared memory for 4 warps
-     *     __shared__ typename WarpScan::TempStorage temp_storage[4];
-     *
-     *     // Obtain one input item per thread
-     *     int thread_data = ...
-     *
-     *     // Compute exclusive warp-wide prefix max scans
-     *     int inclusive_partial, exclusive_partial;
-     *     WarpScan(temp_storage[warp_id]).Scan(thread_data, inclusive_partial, exclusive_partial, cub::Max());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{0, -1, 2, -3, ..., 126, -127}</tt>.
-     * The corresponding output \p inclusive_partial in the first warp would be
-     * <tt>0, 0, 2, 2, ..., 30, 30</tt>, the output for the second warp would be <tt>32, 32, 34, 34, ..., 62, 62</tt>, etc.
-     * The corresponding output \p exclusive_partial in the first warp would be
-     * <tt>?, 0, 0, 2, ..., 28, 30</tt>, the output for the second warp would be <tt>?, 32, 32, 34, ..., 60, 62</tt>, etc.
-     * (The output \p thread_data in warp lane<sub>0</sub> is undefined.)
-     *
-     * \tparam ScanOp     <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ScanOp>
-    __device__ __forceinline__ void Scan(
-        T               input,              ///< [in] Calling thread's input item.
-        T               &inclusive_output,  ///< [out] Calling thread's inclusive-scan output item.
-        T               &exclusive_output,  ///< [out] Calling thread's exclusive-scan output item.
-        ScanOp          scan_op)            ///< [in] Binary scan operator
-    {
-        InternalWarpScan internal(temp_storage);
-
-        internal.InclusiveScan(input, inclusive_output, scan_op);
-
-        internal.Update(
-            input,
-            inclusive_output,
-            exclusive_output,
-            scan_op,
-            Int2Type<IS_INTEGER>());
-    }
-
-
-    /**
-     * \brief Computes both inclusive and exclusive prefix scans using the specified binary scan functor across the calling warp.
-     *
-     * \par
-     *  - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates four concurrent warp-wide prefix max scans within a block of
-     * 128 threads (one per each of the 32-thread warps).
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpScan for type int
-     *     typedef cub::WarpScan<int> WarpScan;
-     *
-     *     // Allocate WarpScan shared memory for 4 warps
-     *     __shared__ typename WarpScan::TempStorage temp_storage[4];
-     *
-     *     // Obtain one input item per thread
-     *     int thread_data = ...
-     *
-     *     // Compute inclusive warp-wide prefix max scans
-     *     int warp_id = threadIdx.x / 32;
-     *     int inclusive_partial, exclusive_partial;
-     *     WarpScan(temp_storage[warp_id]).Scan(thread_data, inclusive_partial, exclusive_partial, INT_MIN, cub::Max());
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{0, -1, 2, -3, ..., 126, -127}</tt>.
-     * The corresponding output \p inclusive_partial in the first warp would be
-     * <tt>0, 0, 2, 2, ..., 30, 30</tt>, the output for the second warp would be <tt>32, 32, 34, 34, ..., 62, 62</tt>, etc.
-     * The corresponding output \p exclusive_partial in the first warp would be
-     * <tt>INT_MIN, 0, 0, 2, ..., 28, 30</tt>, the output for the second warp would be <tt>30, 32, 32, 34, ..., 60, 62</tt>, etc.
-     *
-     * \tparam ScanOp     <b>[inferred]</b> Binary scan operator type having member <tt>T operator()(const T &a, const T &b)</tt>
-     */
-    template <typename ScanOp>
-    __device__ __forceinline__ void Scan(
-        T               input,              ///< [in] Calling thread's input item.
-        T               &inclusive_output,  ///< [out] Calling thread's inclusive-scan output item.
-        T               &exclusive_output,  ///< [out] Calling thread's exclusive-scan output item.
-        T               initial_value,      ///< [in] Initial value to seed the exclusive scan
-        ScanOp          scan_op)            ///< [in] Binary scan operator
-    {
-        InternalWarpScan internal(temp_storage);
-
-        internal.InclusiveScan(input, inclusive_output, scan_op);
-
-        internal.Update(
-            input,
-            inclusive_output,
-            exclusive_output,
-            scan_op,
-            initial_value,
-            Int2Type<IS_INTEGER>());
-    }
-
-
-
-    //@}  end member group
-    /******************************************************************//**
-     * \name Data exchange
-     *********************************************************************/
-    //@{
-
-    /**
-     * \brief Broadcast the value \p input from <em>warp-lane</em><sub><tt>src_lane</tt></sub> to all lanes in the warp
-     *
-     * \par
-     * - \smemreuse
-     *
-     * \par Snippet
-     * The code snippet below illustrates the warp-wide broadcasts of values from
-     * lanes<sub>0</sub> in each of four warps to all other threads in those warps.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     // Specialize WarpScan for type int
-     *     typedef cub::WarpScan<int> WarpScan;
-     *
-     *     // Allocate WarpScan shared memory for 4 warps
-     *     __shared__ typename WarpScan::TempStorage temp_storage[4];
-     *
-     *     // Obtain one input item per thread
-     *     int thread_data = ...
-     *
-     *     // Broadcast from lane0 in each warp to all other threads in the warp
-     *     int warp_id = threadIdx.x / 32;
-     *     thread_data = WarpScan(temp_storage[warp_id]).Broadcast(thread_data, 0);
-     *
-     * \endcode
-     * \par
-     * Suppose the set of input \p thread_data across the block of threads is <tt>{0, 1, 2, 3, ..., 127}</tt>.
-     * The corresponding output \p thread_data will be
-     * <tt>{0, 0, ..., 0}</tt> in warp<sub>0</sub>,
-     * <tt>{32, 32, ..., 32}</tt> in warp<sub>1</sub>,
-     * <tt>{64, 64, ..., 64}</tt> in warp<sub>2</sub>, etc.
-     */
-    __device__ __forceinline__ T Broadcast(
-        T               input,              ///< [in] The value to broadcast
-        unsigned int    src_lane)           ///< [in] Which warp lane is to do the broadcasting
-    {
-        return InternalWarpScan(temp_storage).Broadcast(input, src_lane);
-    }
-
-    //@}  end member group
-
-};
-
-/** @} */       // end group WarpModule
-
-}               // CUB namespace
-CUB_NS_POSTFIX  // Optional outer namespace(s)
diff --git a/thirdparty/cusparse_internal.h b/thirdparty/cusparse_internal.h
deleted file mode 100644
index 8085b2abdfc..00000000000
--- a/thirdparty/cusparse_internal.h
+++ /dev/null
@@ -1,3060 +0,0 @@
-/*
- * Copyright (c) 2019, NVIDIA CORPORATION.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#if !defined(CUSPARSE_INTERNAL_H_)
-#define CUSPARSE_INTERNAL_H_
-
-
-#ifndef CUSPARSEAPI
-#ifdef _WIN32
-#define CUSPARSEAPI __stdcall
-#else
-#define CUSPARSEAPI 
-#endif
-#endif
-
-
-#define CACHE_LINE_SIZE   128 
-
-#define ALIGN_32(x)   ((((x)+31)/32)*32)
-
-
-
-#if defined(__cplusplus)
-extern "C" {
-#endif /* __cplusplus */
-
-
-struct csrilu02BatchInfo;
-typedef struct csrilu02BatchInfo *csrilu02BatchInfo_t;
-
-
-struct csrxilu0Info;
-typedef struct csrxilu0Info *csrxilu0Info_t;
-
-struct csrxgemmSchurInfo;
-typedef struct csrxgemmSchurInfo *csrxgemmSchurInfo_t;
-
-struct csrxtrsmInfo;
-typedef struct csrxtrsmInfo  *csrxtrsmInfo_t;
-
-struct csrilu03Info;
-typedef struct csrilu03Info *csrilu03Info_t;
-
-struct csrmmInfo;
-typedef struct csrmmInfo *csrmmInfo_t;
-
-
-cudaStream_t cusparseGetStreamInternal(const struct cusparseContext *ctx);
-
-
-cusparseStatus_t CUSPARSEAPI cusparseCheckBuffer(
-    cusparseHandle_t handle,
-    void *workspace);
-
-//------- gather: dst = src(map) ---------------------
-
-cusparseStatus_t CUSPARSEAPI cusparseIgather(
-    cusparseHandle_t handle,
-    int n,
-    const int *src,
-    const int *map,
-    int *dst);
-
-cusparseStatus_t CUSPARSEAPI cusparseSgather(
-    cusparseHandle_t handle,
-    int n,
-    const float *src,
-    const int *map,
-    float *dst);
-
-cusparseStatus_t CUSPARSEAPI cusparseDgather(
-    cusparseHandle_t handle,
-    int n,
-    const double *src,
-    const int *map,
-    double *dst);
-
-cusparseStatus_t CUSPARSEAPI cusparseCgather(
-    cusparseHandle_t handle,
-    int n,
-    const cuComplex *src,
-    const int *map,
-    cuComplex *dst);
-
-cusparseStatus_t CUSPARSEAPI cusparseZgather(
-    cusparseHandle_t handle,
-    int n,
-    const cuDoubleComplex *src,
-    const int *map,
-    cuDoubleComplex *dst);
-
-
-//------- scatter: dst(map) = src ---------------------
-
-cusparseStatus_t CUSPARSEAPI cusparseIscatter(
-    cusparseHandle_t handle,
-    int n,
-    const int *src,
-    int *dst,
-    const int *map);
-
-cusparseStatus_t CUSPARSEAPI cusparseSscatter(
-    cusparseHandle_t handle,
-    int n,
-    const float *src,
-    float *dst,
-    const int *map);
-
-cusparseStatus_t CUSPARSEAPI cusparseDscatter(
-    cusparseHandle_t handle,
-    int n,
-    const double *src,
-    double *dst,
-    const int *map);
-
-cusparseStatus_t CUSPARSEAPI cusparseCscatter(
-    cusparseHandle_t handle,
-    int n,
-    const cuComplex *src,
-    cuComplex *dst,
-    const int *map);
-
-cusparseStatus_t CUSPARSEAPI cusparseZscatter(
-    cusparseHandle_t handle,
-    int n,
-    const cuDoubleComplex *src,
-    cuDoubleComplex *dst,
-    const int *map);
-
-
-// x[j] = j 
-cusparseStatus_t CUSPARSEAPI cusparseIidentity(
-    cusparseHandle_t handle,
-    int n,
-    int *x);
-
-// x[j] = val
-cusparseStatus_t CUSPARSEAPI cusparseImemset(
-    cusparseHandle_t handle,
-    int n,
-    int val,
-    int *x);
-
-cusparseStatus_t CUSPARSEAPI cusparseI64memset(
-    cusparseHandle_t handle,
-    size_t n,
-    int val,
-    int *x);
-
-
-// ----------- reduce -----------------
-
-/*
- * cusparseStatus_t 
- *      cusparseIreduce_bufferSize( cusparseHandle_t handle,
- *                                   int n,
- *                                   int *pBufferSizeInBytes)
- * Input
- * -----
- * handle        handle to CUSPARSE library context.
- * n             number of elements.
- *
- * Output
- * ------
- * pBufferSizeInBytes   size of working space in bytes.
- *  
- * Error Status
- * ------------
- * CUSPARSE_STATUS_SUCCESS          the operation completed successfully.
- * CUSPARSE_STATUS_NOT_INITIALIZED  the library was not initialized.   
- * CUSPARSE_STATUS_INVALID_VALUE    n is too big or negative
- * CUSPARSE_STATUS_INTERNAL_ERROR   an internal operation failed.
- *                                  If n is normal, we should not have this internal error.
- *
- * ---------
- * Assumption:
- *    Only support n < 2^31.
- *
- */
-cusparseStatus_t CUSPARSEAPI cusparseIreduce_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    size_t *pBufferSizeInBytes);
-
-/*
- * cusparseStatus_t 
- *     cusparseIreduce(cusparseHandle_t handle,
- *                     int n,
- *                     int *src,
- *                     int *pBuffer,
- *                     int *total_sum)
- *  
- *    total_sum = reduction(src)
- *
- *  Input
- * -------
- *  handle            handle to the CUSPARSE library context.
- *    n               number of elements in src and dst.
- *  src               <int> array of n elements.
- *  pBuffer           working space, the size is reported by cusparseIinclusiveScan_bufferSizeExt.
- *                    Or it can be a NULL pointer, then CUSPARSE library allocates working space implicitly.
- *
- * Output
- * -------
- *  total_sum         total_sum = reduction(src) if total_sum is not a NULL pointer.
- *
- *
- * Error Status
- * ------------
- * CUSPARSE_STATUS_SUCCESS          the operation completed successfully.
- * CUSPARSE_STATUS_NOT_INITIALIZED  the library was not initialized.   
- * CUSPARSE_STATUS_ALLOC_FAILED     the resources could not be allocated.
- *                                  it is possible if pBuffer is NULL.
- * CUSPARSE_STATUS_INTERNAL_ERROR   an internal operation failed.
- *
- * 
- */
-cusparseStatus_t CUSPARSEAPI cusparseIreduce(
-    cusparseHandle_t handle,
-    int n,
-    int *src,
-    void *pBuffer,
-    int *total_sum);
-
-
-
-// ----------- prefix sum -------------------
-
-/*
- * cusparseStatus_t 
- *      cusparseIinclusiveScan_bufferSizeExt( cusparseHandle_t handle,
- *                                   int n,
- *                                   size_t *pBufferSizeInBytes)
- * Input
- * -----
- * handle        handle to CUSPARSE library context.
- * n             number of elements.
- *
- * Output
- * ------
- * pBufferSizeInBytes   size of working space in bytes.
- *  
- * Error Status
- * ------------
- * CUSPARSE_STATUS_SUCCESS          the operation completed successfully.
- * CUSPARSE_STATUS_NOT_INITIALIZED  the library was not initialized.   
- * CUSPARSE_STATUS_INVALID_VALUE    n is too big or negative
- * CUSPARSE_STATUS_INTERNAL_ERROR   an internal operation failed.
- *                                  If n is normal, we should not have this internal error.
- *
- * ---------
- * Assumption:
- *    Only support n < 2^31.
- *
- */
-cusparseStatus_t CUSPARSEAPI cusparseIinclusiveScan_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    size_t *pBufferSizeInBytes);
-
-
-/*
- * cusparseStatus_t 
- *     cusparseIinclusiveScan(cusparseHandle_t handle,
- *                             int base,
- *                             int n,
- *                             int *src,
- *                             void *pBuffer,
- *                             int *dst,
- *                             int *total_sum)
- *  
- *    dst = inclusiveScan(src) + base
- *    total_sum = reduction(src)
- *
- *  Input
- * -------
- *  handle            handle to the CUSPARSE library context.
- *    n               number of elements in src and dst.
- *  src               <int> array of n elements.
- *  pBuffer           working space, the size is reported by cusparseIinclusiveScan_bufferSizeExt.
- *                    Or it can be a NULL pointer, then CUSPARSE library allocates working space implicitly.
- *
- * Output
- * -------
- *  dst               <int> array of n elements.
- *                    dst = inclusiveScan(src) + base
- *  total_sum         total_sum = reduction(src) if total_sum is not a NULL pointer.
- *
- * Error Status
- * ------------
- * CUSPARSE_STATUS_SUCCESS          the operation completed successfully.
- * CUSPARSE_STATUS_NOT_INITIALIZED  the library was not initialized.   
- * CUSPARSE_STATUS_ALLOC_FAILED     the resources could not be allocated.
- *                                  it is possible if pBuffer is NULL.
- * CUSPARSE_STATUS_INTERNAL_ERROR   an internal operation failed.
- * 
- */
-cusparseStatus_t CUSPARSEAPI cusparseIinclusiveScan(
-    cusparseHandle_t handle,
-    int base,
-    int n,
-    int *src,
-    void *pBuffer,
-    int *dst,
-    int *total_sum);
-
-// ----------- stable sort -----------------
-
-/*
- * cusparseStatus_t 
- *      cusparseIstableSortByKey_bufferSizeExt( cusparseHandle_t handle,
- *                                   int n,
- *                                   size_t *pBufferSizeInBytes)
- * Input
- * -----
- * handle        handle to CUSPARSE library context.
- * n             number of elements.
- *
- * Output
- * ------
- * pBufferSizeInBytes   size of working space in bytes.
- *  
- * Error Status
- * ------------
- * CUSPARSE_STATUS_SUCCESS          the operation completed successfully.
- * CUSPARSE_STATUS_NOT_INITIALIZED  the library was not initialized.   
- * CUSPARSE_STATUS_INVALID_VALUE    n is too big or negative
- * CUSPARSE_STATUS_INTERNAL_ERROR   an internal operation failed.
- *                                  If n is normal, we should not have this internal error.
- *
- * ---------
- * Assumption:
- *    Only support n < 2^30 because of domino scheme. 
- *
- */
-cusparseStatus_t CUSPARSEAPI cusparseIstableSortByKey_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    size_t *pBufferSizeInBytes);
-
-
-/*
- * cusparseStatus_t 
- *      cusparseIstableSortByKey( cusparseHandle_t handle,
- *                                   int n,
- *                                   int *key,
- *                                   int *P)
- *
- *  in-place radix sort. 
- *  This is an inhouse design of thrust::stable_sort_by_key(key, P)
- *
- * Input
- * -----
- * handle    handle to CUSPARSE library context.
- * n         number of elements.
- * key       <int> array of n elements.  
- * P         <int> array of n elements.  
- * pBuffer   working space, the size is reported by cusparseIstableSortByKey_bufferSize.
- *           Or it can be a NULL pointer, then CUSPARSE library allocates working space implicitly.
- *
- * Output
- * ------
- * key       <int> array of n elements.  
- * P         <int> array of n elements.  
- *
- * Error Status
- * ------------
- * CUSPARSE_STATUS_SUCCESS          the operation completed successfully.
- * CUSPARSE_STATUS_NOT_INITIALIZED  the library was not initialized.   
- * CUSPARSE_STATUS_ALLOC_FAILED     the resources could not be allocated.
- * CUSPARSE_STATUS_INTERNAL_ERROR   an internal operation failed.
- *
- * -----
- * Assumption:
- *    Only support n < 2^30 because of domino scheme. 
- *
- * -----
- * Usage:
- *   int nBufferSize = 0;
- *   status = cusparseIstableSortByKey_bufferSize(handle, n, &nBufferSize);
- *   assert(CUSPARSE_STATUS_SUCCESS == status);
- *   
- *   int *pBuffer;
- *   cudaStat = cudaMalloc((void**)&pBuffer, (size_t)nBufferSize);
- *   assert(cudaSuccess == cudaStat);
- *
- *   d_P = 0:n-1 ;
- *   status = cusparseIstableSortByKey(handle, n, d_csrRowPtrA, d_P, pBuffer);
- *   assert(CUSPARSE_STATUS_SUCCESS == status);
- *
- */
-cusparseStatus_t CUSPARSEAPI cusparseIstableSortByKey(
-    cusparseHandle_t handle,
-    int n,
-    int *key,
-    int *P,
-    void *pBuffer);
-
-
-
-// ------------------- csr42csr ------------------
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsr42csr_bufferSize(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    size_t *pBufferSizeInByte );
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsr42csrRows(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    const cusparseMatDescr_t descrC,
-    int *csrRowPtrC,
-    int *nnzTotalDevHostPtr,
-    void *pBuffer );
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsr42csrCols(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    const cusparseMatDescr_t descrC,
-    const int *csrRowPtrC,
-    int *csrColIndC,
-    void *pBuffer );
-
-cusparseStatus_t CUSPARSEAPI cusparseScsr42csrVals(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    const float *alpha,
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const float *csrValA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    const cusparseMatDescr_t descrC,
-    float *csrValC,
-    const int *csrRowPtrC,
-    const int *csrEndPtrC,
-    const int *csrColIndC,
-    void *pBuffer );
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsr42csrVals(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    const double *alpha,
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const double *csrValA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    const cusparseMatDescr_t descrC,
-    double *csrValC,
-    const int *csrRowPtrC,
-    const int *csrEndPtrC,
-    const int *csrColIndC,
-    void *pBuffer );
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsr42csrVals(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    const cuComplex *alpha,
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const cuComplex *csrValA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    const cusparseMatDescr_t descrC,
-    cuComplex *csrValC,
-    const int *csrRowPtrC,
-    const int *csrEndPtrC,
-    const int *csrColIndC,
-    void *pBuffer );
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsr42csrVals(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    const cuDoubleComplex *alpha,
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const cuDoubleComplex *csrValA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    const cusparseMatDescr_t descrC,
-    cuDoubleComplex *csrValC,
-    const int *csrRowPtrC,
-    const int *csrEndPtrC,
-    const int *csrColIndC,
-    void *pBuffer );
-
-
-// ----- csrmv_hyb ------------------------------
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrmv_hyb(
-    cusparseHandle_t handle,
-    cusparseOperation_t trans,
-    int m,
-    int n,
-    int nnz,
-    const float *alpha,
-    const cusparseMatDescr_t descra,
-    const float *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    const float *x,
-    const float *beta,
-    float *y);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrmv_hyb(
-    cusparseHandle_t handle,
-    cusparseOperation_t trans,
-    int m,
-    int n,
-    int nnz,
-    const double *alpha,
-    const cusparseMatDescr_t descra,
-    const double *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    const double *x,
-    const double *beta, 
-    double *y);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsrmv_hyb(
-    cusparseHandle_t handle,
-    cusparseOperation_t trans,
-    int m,
-    int n,
-    int nnz,
-    const cuComplex *alpha,
-    const cusparseMatDescr_t descra,
-    const cuComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    const cuComplex *x,
-    const cuComplex *beta,
-    cuComplex *y);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsrmv_hyb(
-    cusparseHandle_t handle,
-    cusparseOperation_t trans,
-    int m,
-    int n,
-    int nnz,
-    const cuDoubleComplex *alpha,
-    const cusparseMatDescr_t descra,
-    const cuDoubleComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    const cuDoubleComplex *x,
-    const cuDoubleComplex *beta,
-    cuDoubleComplex *y);
-
-
-// ------------- getrf_ilu ---------------------
-
-cusparseStatus_t CUSPARSEAPI cusparseSgetrf_ilu(
-    cusparseHandle_t handle,
-    const int submatrix_k,
-    const int n,
-    float *A,
-    const int *pattern,
-    const int lda,
-    int *d_status,
-    int enable_boost,
-    double *tol_ptr,
-    float *boost_ptr);
-
-cusparseStatus_t CUSPARSEAPI cusparseDgetrf_ilu(
-    cusparseHandle_t handle,
-    const int submatrix_k,
-    const int n,
-    double *A,
-    const int *pattern,
-    const int lda,
-    int *d_status,
-    int enable_boost,
-    double *tol_ptr,
-    double *boost_ptr);
-
-cusparseStatus_t CUSPARSEAPI cusparseCgetrf_ilu(
-    cusparseHandle_t handle,
-    const int submatrix_k,
-    const int n,
-    cuComplex *A,
-    const int *pattern,
-    const int lda,
-    int *d_status,
-    int enable_boost,
-    double *tol_ptr,
-    cuComplex *boost_ptr);
-
-cusparseStatus_t CUSPARSEAPI cusparseZgetrf_ilu(
-    cusparseHandle_t handle,
-    const int submatrix_k,
-    const int n,
-    cuDoubleComplex *A,
-    const int *pattern,
-    const int lda,
-    int *d_status,
-    int enable_boost,
-    double *tol_ptr,
-    cuDoubleComplex *boost_ptr);
-
-
-// ------------- potrf_ic ---------------------
-
-cusparseStatus_t CUSPARSEAPI cusparseSpotrf_ic(
-    cusparseHandle_t handle,
-    const int submatrix_k,
-    const int n,
-    float *A,
-    const int *pattern,
-    const int lda,
-    int *d_status);
-
-cusparseStatus_t CUSPARSEAPI cusparseDpotrf_ic(
-    cusparseHandle_t handle,
-    const int submatrix_k,
-    const int n,
-    double *A,
-    const int *pattern,
-    const int lda,
-    int *d_status);
-
-cusparseStatus_t CUSPARSEAPI cusparseCpotrf_ic(
-    cusparseHandle_t handle,
-    const int submatrix_k,
-    const int n,
-    cuComplex *A,
-    const int *pattern,
-    const int lda,
-    int *d_status);
-
-cusparseStatus_t CUSPARSEAPI cusparseZpotrf_ic(
-    cusparseHandle_t handle,
-    const int submatrix_k,
-    const int n,
-    cuDoubleComplex *A,
-    const int *pattern,
-    const int lda,
-    int *d_status);
-
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsric02_denseConfig(
-    csric02Info_t info,
-    int enable_dense_block,
-    int max_dim_dense_block,
-    int threshold_dense_block,
-    double ratio);
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsric02_workspaceConfig(
-    csric02Info_t info,
-    int disable_workspace_limit);
-
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrilu02_denseConfig(
-    csrilu02Info_t info,
-    int enable_dense_block,
-    int max_dim_dense_block,
-    int threshold_dense_block,
-    double ratio);
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrilu02_workspaceConfig(
-    csrilu02Info_t info,
-    int disable_workspace_limit);
-
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrilu02Batch_denseConfig(
-    csrilu02BatchInfo_t info,
-    int enable_dense_block,
-    int max_dim_dense_block,
-    int threshold_dense_block,
-    double ratio);
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrilu02Batch_workspaceConfig(
-    csrilu02BatchInfo_t info,
-    int disable_workspace_limit);
-
-
-
-// ---------------- csric02 internal ----------------
-cusparseStatus_t CUSPARSEAPI cusparseXcsric02_getLevel(
-    csric02Info_t info,
-    int **level_ref);
-
-cusparseStatus_t CUSPARSEAPI cusparseScsric02_internal(
-    cusparseHandle_t handle,
-    int enable_potrf,
-    int dense_block_start,
-    //int dense_block_dim, // = m - dense_block_start
-    int dense_block_lda,
-    int *level,  // level is a permutation vector of 0:(m-1)
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    float *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    csric02Info_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsric02_internal(
-    cusparseHandle_t handle,
-    int enable_potrf,
-    int dense_block_start,
-    //int dense_block_dim, // = m - dense_block_start
-    int dense_block_lda,
-    int *level,  // level is a permutation vector of 0:(m-1)
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    double *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    csric02Info_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsric02_internal(
-    cusparseHandle_t handle,
-    int enable_potrf,
-    int dense_block_start,
-    //int dense_block_dim, // = m - dense_block_start
-    int dense_block_lda,
-    int *level,  // level is a permutation vector of 0:(m-1)
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    cuComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    csric02Info_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsric02_internal(
-    cusparseHandle_t handle,
-    int enable_potrf,
-    int dense_block_start,
-    //int dense_block_dim, // = m - dense_block_start
-    int dense_block_lda,
-    int *level,  // level is a permutation vector of 0:(m-1)
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    cuDoubleComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    csric02Info_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-// csrilu02 internal
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrilu02_getLevel(
-    csrilu02Info_t info,
-    int **level_ref);
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrilu02_getCsrEndPtrL(
-    csrilu02Info_t info,
-    int **csrEndPtrL_ref);
-
-
-// ----------------- batch ilu0 -----------------
-
-cusparseStatus_t CUSPARSEAPI cusparseCreateCsrilu02BatchInfo(
-    csrilu02BatchInfo_t *info);
-
-cusparseStatus_t CUSPARSEAPI cusparseDestroyCsrilu02BatchInfo(
-    csrilu02BatchInfo_t info);
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrilu02Batch_zeroPivot(
-    cusparseHandle_t handle,
-    csrilu02BatchInfo_t info,
-    int *position);
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrilu02Batch_numericBoost(
-    cusparseHandle_t handle,
-    csrilu02BatchInfo_t info,
-    int enable_boost,
-    double *tol,
-    float *numeric_boost);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrilu02Batch_numericBoost(
-    cusparseHandle_t handle,
-    csrilu02BatchInfo_t info,
-    int enable_boost,
-    double *tol,
-    double *numeric_boost);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsrilu02Batch_numericBoost(
-    cusparseHandle_t handle,
-    csrilu02BatchInfo_t info,
-    int enable_boost,
-    double *tol,
-    cuComplex *numeric_boost);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsrilu02Batch_numericBoost(
-    cusparseHandle_t handle,
-    csrilu02BatchInfo_t info,
-    int enable_boost,
-    double *tol,
-    cuDoubleComplex *numeric_boost);
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrilu02Batch_bufferSizeExt(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    float *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrilu02BatchInfo_t info,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrilu02Batch_bufferSizeExt(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    double *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrilu02BatchInfo_t info,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsrilu02Batch_bufferSizeExt(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    cuComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrilu02BatchInfo_t info,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsrilu02Batch_bufferSizeExt(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    cuDoubleComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrilu02BatchInfo_t info,
-    size_t *pBufferSizeInBytes);
-
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrilu02Batch_analysis(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    const float *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrilu02BatchInfo_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrilu02Batch_analysis(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    const double *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrilu02BatchInfo_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsrilu02Batch_analysis(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    const cuComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrilu02BatchInfo_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsrilu02Batch_analysis(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    const cuDoubleComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrilu02BatchInfo_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrilu02Batch(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descra,
-    float *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrilu02BatchInfo_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrilu02Batch(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descra,
-    double *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrilu02BatchInfo_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsrilu02Batch(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descra,
-    cuComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrilu02BatchInfo_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsrilu02Batch(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descra,
-    cuDoubleComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrilu02BatchInfo_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-// --------------- csrsv2 batch --------------
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrsv2Batch_bufferSizeExt(
-    cusparseHandle_t handle,
-    cusparseOperation_t transA,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    float *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrsv2Info_t info,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrsv2Batch_bufferSizeExt(
-    cusparseHandle_t handle,
-    cusparseOperation_t transA,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    double *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrsv2Info_t info,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsrsv2Batch_bufferSizeExt(
-    cusparseHandle_t handle,
-    cusparseOperation_t transA,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    cuComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrsv2Info_t info,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsrsv2Batch_bufferSizeExt(
-    cusparseHandle_t handle,
-    cusparseOperation_t transA,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    cuDoubleComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrsv2Info_t info,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrsv2Batch_analysis(
-    cusparseHandle_t handle,
-    cusparseOperation_t transA,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    const float *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrsv2Info_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrsv2Batch_analysis(
-    cusparseHandle_t handle,
-    cusparseOperation_t transA,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    const double *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrsv2Info_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsrsv2Batch_analysis(
-    cusparseHandle_t handle,
-    cusparseOperation_t transA,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    const cuComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrsv2Info_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsrsv2Batch_analysis(
-    cusparseHandle_t handle,
-    cusparseOperation_t transA,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    const cuDoubleComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int batchSize,
-    csrsv2Info_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrsv2Batch_zeroPivot(
-    cusparseHandle_t handle,
-    csrsv2Info_t info,
-    int *position);
-
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrsv2Batch_solve(
-    cusparseHandle_t handle,
-    cusparseOperation_t trans,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descra,
-    const float *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    csrsv2Info_t info,
-    const float *x,
-    float *y,
-    int batchSize,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrsv2Batch_solve(
-    cusparseHandle_t handle,
-    cusparseOperation_t trans,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descra,
-    const double *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    csrsv2Info_t info,
-    const double *x,
-    double *y,
-    int batchSize,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsrsv2Batch_solve(
-    cusparseHandle_t handle,
-    cusparseOperation_t trans,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descra,
-    const cuComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    csrsv2Info_t info,
-    const cuComplex *x,
-    cuComplex *y,
-    int batchSize,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsrsv2Batch_solve(
-    cusparseHandle_t handle,
-    cusparseOperation_t trans,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descra,
-    const cuDoubleComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    csrsv2Info_t info,
-    const cuDoubleComplex *x,
-    cuDoubleComplex *y,
-    int batchSize,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-//-------------- csrgemm2 -------------
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrgemm2_spaceConfig(
-    csrgemm2Info_t info,
-    int disable_space_limit);
-
-// internal-use only
-cusparseStatus_t CUSPARSEAPI cusparseXcsrgemm2Rows_bufferSize(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    int k,
-
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const int *csrRowPtrA,
-    const int *csrColIndA,
-
-    const cusparseMatDescr_t descrB,
-    int nnzB,
-    const int *csrRowPtrB,
-    const int *csrColIndB,
-
-    csrgemm2Info_t info,
-    size_t *pBufferSize );
-
-// internal-use only
-cusparseStatus_t CUSPARSEAPI cusparseXcsrgemm2Cols_bufferSize(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    int k,
-
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const int *csrRowPtrA,
-    const int *csrColIndA,
-
-    const cusparseMatDescr_t descrB,
-    int nnzB,
-    const int *csrRowPtrB,
-    const int *csrColIndB,
-
-    csrgemm2Info_t info,
-    size_t *pBufferSize );
-
-
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrgemm2Rows(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    int k,
-
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    const cusparseMatDescr_t descrB,
-    int nnzB,
-    const int *csrRowPtrB,
-    const int *csrEndPtrB,
-    const int *csrColIndB,
-
-    const cusparseMatDescr_t descrD,
-    int nnzD,
-    const int *csrRowPtrD,
-    const int *csrEndPtrD,
-    const int *csrColIndD,
-
-    const cusparseMatDescr_t descrC,
-    int *csrRowPtrC,
-
-    int *nnzTotalDevHostPtr,
-    csrgemm2Info_t info,
-    void *pBuffer );
-
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrgemm2Cols(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    int k,
-
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    const cusparseMatDescr_t descrB,
-    int nnzB,
-    const int *csrRowPtrB,
-    const int *csrEndPtrB,
-    const int *csrColIndB,
-
-    const cusparseMatDescr_t descrD,
-    int nnzD,
-    const int *csrRowPtrD,
-    const int *csrEndPtrD,
-    const int *csrColIndD,
-
-    const cusparseMatDescr_t descrC,
-    const int *csrRowPtrC,
-    int *csrColIndC,
-
-    csrgemm2Info_t info,
-    void *pBuffer );
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrgemm2Vals(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    int k,
-
-    const float *alpha,
-
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const float *csrValA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    const cusparseMatDescr_t descrB,
-    int nnzB,
-    const float *csrValB,
-    const int *csrRowPtrB,
-    const int *csrEndPtrB,
-    const int *csrColIndB,
-
-    const cusparseMatDescr_t descrD,
-    int nnzD,
-    const float *csrValD,
-    const int *csrRowPtrD,
-    const int *csrEndPtrD,
-    const int *csrColIndD,
-
-    const float *beta,
-
-    const cusparseMatDescr_t descrC,
-    float *csrValC,
-    const int *csrRowPtrC,
-    const int *csrEndPtrC,
-    const int *csrColIndC,
-
-    csrgemm2Info_t info,
-    void *pBuffer );
-
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrgemm2Vals(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    int k,
-
-    const double *alpha,
-
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const double *csrValA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    const cusparseMatDescr_t descrB,
-    int nnzB,
-    const double *csrValB,
-    const int *csrRowPtrB,
-    const int *csrEndPtrB,
-    const int *csrColIndB,
-
-    const cusparseMatDescr_t descrD,
-    int nnzD,
-    const double *csrValD,
-    const int *csrRowPtrD,
-    const int *csrEndPtrD,
-    const int *csrColIndD,
-
-    const double *beta,
-
-    const cusparseMatDescr_t descrC,
-    double *csrValC,
-    const int *csrRowPtrC,
-    const int *csrEndPtrC,
-    const int *csrColIndC,
-
-    csrgemm2Info_t info,
-    void *pBuffer );
-
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsrgemm2Vals(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    int k,
-
-    const cuComplex *alpha,
-
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const cuComplex *csrValA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    const cusparseMatDescr_t descrB,
-    int nnzB,
-    const cuComplex *csrValB,
-    const int *csrRowPtrB,
-    const int *csrEndPtrB,
-    const int *csrColIndB,
-
-    const cusparseMatDescr_t descrD,
-    int nnzD,
-    const cuComplex *csrValD,
-    const int *csrRowPtrD,
-    const int *csrEndPtrD,
-    const int *csrColIndD,
-
-    const cuComplex *beta,
-
-    const cusparseMatDescr_t descrC,
-    cuComplex *csrValC,
-    const int *csrRowPtrC,
-    const int *csrEndPtrC,
-    const int *csrColIndC,
-
-    csrgemm2Info_t info,
-    void *pBuffer );
-
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsrgemm2Vals(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    int k,
-
-    const cuDoubleComplex *alpha,
-
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const cuDoubleComplex *csrValA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    const cusparseMatDescr_t descrB,
-    int nnzB,
-    const cuDoubleComplex *csrValB,
-    const int *csrRowPtrB,
-    const int *csrEndPtrB,
-    const int *csrColIndB,
-
-    const cusparseMatDescr_t descrD,
-    int nnzD,
-    const cuDoubleComplex *csrValD,
-    const int *csrRowPtrD,
-    const int *csrEndPtrD,
-    const int *csrColIndD,
-
-    const cuDoubleComplex *beta,
-
-    const cusparseMatDescr_t descrC,
-    cuDoubleComplex *csrValC,
-    const int *csrRowPtrC,
-    const int *csrEndPtrC,
-    const int *csrColIndC,
-
-    csrgemm2Info_t info,
-    void *pBuffer );
-
-
-// ---------------- csr2csc2
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsr2csc2_bufferSizeExt(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    int nnz,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    size_t *pBufferSize);
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsr2csc2(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    int *cscColPtr,
-    int *cscRowInd,
-    int *cscValInd,
-    void *pBuffer);
-
-#if 0
-// ------------- CSC ILU0
-
-cusparseStatus_t CUSPARSEAPI cusparseXcscilu02_getLevel(
-    cscilu02Info_t info,
-    int **level_ref);
-
-cusparseStatus_t CUSPARSEAPI cusparseXcscilu02_getCscColPtrL(
-    cscilu02Info_t info,
-    int **cscColPtrL_ref);
-
-cusparseStatus_t CUSPARSEAPI cusparseCreateCscilu02Info(
-    cscilu02Info_t *info);
-
-cusparseStatus_t CUSPARSEAPI cusparseDestroyCscilu02Info(
-    cscilu02Info_t info);
-
-cusparseStatus_t CUSPARSEAPI cusparseXcscilu02_zeroPivot(
-    cusparseHandle_t handle,
-    cscilu02Info_t info,
-    int *position);
-
-cusparseStatus_t CUSPARSEAPI cusparseScscilu02_numericBoost(
-    cusparseHandle_t handle,
-    cscilu02Info_t info,
-    int enable_boost,
-    double *tol,
-    float *numeric_boost);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcscilu02_numericBoost(
-    cusparseHandle_t handle,
-    cscilu02Info_t info,
-    int enable_boost,
-    double *tol,
-    double *numeric_boost);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcscilu02_numericBoost(
-    cusparseHandle_t handle,
-    cscilu02Info_t info,
-    int enable_boost,
-    double *tol,
-    cuComplex *numeric_boost);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcscilu02_numericBoost(
-    cusparseHandle_t handle,
-    cscilu02Info_t info,
-    int enable_boost,
-    double *tol,
-    cuDoubleComplex *numeric_boost);
-
-cusparseStatus_t CUSPARSEAPI cusparseScscilu02_bufferSize(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    float *cscVal,
-    const int *cscColPtr,
-    const int *cscEndPtr,
-    const int *cscRowInd,
-    cscilu02Info_t info,
-    int *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcscilu02_bufferSize(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    double *cscVal,
-    const int *cscColPtr,
-    const int *cscEndPtr,
-    const int *cscRowInd,
-    cscilu02Info_t info,
-    int *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcscilu02_bufferSize(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    cuComplex *cscVal,
-    const int *cscColPtr,
-    const int *cscEndPtr,
-    const int *cscRowInd,
-    cscilu02Info_t info,
-    int *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcscilu02_bufferSize(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    cuDoubleComplex *cscVal,
-    const int *cscColPtr,
-    const int *cscEndPtr,
-    const int *cscRowInd,
-    cscilu02Info_t info,
-    int *pBufferSizeInBytes);
-
-
-cusparseStatus_t CUSPARSEAPI cusparseScscilu02_analysis(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    const float *cscVal,
-    const int *cscColPtr,
-    const int *cscEndPtr,
-    const int *cscRowInd,
-    cscilu02Info_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcscilu02_analysis(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    const double *cscVal,
-    const int *cscColPtr,
-    const int *cscEndPtr,
-    const int *cscRowInd,
-    cscilu02Info_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcscilu02_analysis(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    const cuComplex *cscVal,
-    const int *cscColPtr,
-    const int *cscEndPtr,
-    const int *cscRowInd,
-    cscilu02Info_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcscilu02_analysis(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    const cuDoubleComplex *cscVal,
-    const int *cscColPtr,
-    const int *cscEndPtr,
-    const int *cscRowInd,
-    cscilu02Info_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-
-cusparseStatus_t CUSPARSEAPI cusparseScscilu02(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    float *cscVal,
-    const int *cscColPtr,
-    const int *cscEndPtr,
-    const int *cscRowInd,
-    cscilu02Info_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcscilu02(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    double *cscVal,
-    const int *cscColPtr,
-    const int *cscEndPtr,
-    const int *cscRowInd,
-    cscilu02Info_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcscilu02(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    cuComplex *cscVal,
-    const int *cscColPtr,
-    const int *cscEndPtr,
-    const int *cscRowInd,
-    cscilu02Info_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcscilu02(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    cuDoubleComplex *cscVal,
-    const int *cscColPtr,
-    const int *cscEndPtr,
-    const int *cscRowInd,
-    cscilu02Info_t info,
-    cusparseSolvePolicy_t policy,
-    void *pBuffer);
-#endif
-
-// ------------- csrxjusqua
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrxjusqua(
-    cusparseHandle_t handle,
-    int iax,
-    int iay,
-    int m,
-    int n,
-    const cusparseMatDescr_t descrA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-    int *csrjusqua );
-
-// ------------ csrxilu0
-
-cusparseStatus_t CUSPARSEAPI cusparseCreateCsrxilu0Info(
-    csrxilu0Info_t *info);
-
-cusparseStatus_t CUSPARSEAPI cusparseDestroyCsrxilu0Info(
-    csrxilu0Info_t info);
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrxilu0_zeroPivot(
-    cusparseHandle_t handle,
-    csrxilu0Info_t info,
-    int *position);
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrxilu0_numericBoost(
-    cusparseHandle_t handle,
-    csrxilu0Info_t info,
-    int enable_boost,
-    double *tol,
-    float *numeric_boost);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrxilu0_numericBoost(
-    cusparseHandle_t handle,
-    csrxilu0Info_t info,
-    int enable_boost,
-    double *tol,
-    double *numeric_boost);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsrxilu0_numericBoost(
-    cusparseHandle_t handle,
-    csrxilu0Info_t info,
-    int enable_boost,
-    double *tol,
-    cuComplex *numeric_boost);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsrxilu0_numericBoost(
-    cusparseHandle_t handle,
-    csrxilu0Info_t info,
-    int enable_boost,
-    double *tol,
-    cuDoubleComplex *numeric_boost);
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrxilu0_bufferSizeExt(
-    cusparseHandle_t handle,
-    int iax,
-    int iay,
-    int m,
-    int n,
-    int k,
-    const cusparseMatDescr_t descrA,
-    const int *csrRowPtr,
-    const int *csrEndPtr,
-    const int *csrColInd,
-    csrxilu0Info_t info,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrxilu0(
-    cusparseHandle_t handle,
-    int iax,
-    int iay,
-    int m,
-    int n,
-    int k,
-    const cusparseMatDescr_t descrA,
-    float *csrVal,
-    const int *csrRowPtr,
-    const int *csrEndPtr,
-    const int *csrColInd,
-    csrxilu0Info_t info,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrxilu0(
-    cusparseHandle_t handle,
-    int iax,
-    int iay,
-    int m,
-    int n,
-    int k,
-    const cusparseMatDescr_t descrA,
-    double *csrVal,
-    const int *csrRowPtr,
-    const int *csrEndPtr,
-    const int *csrColInd,
-    csrxilu0Info_t info,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsrxilu0(
-    cusparseHandle_t handle,
-    int iax,
-    int iay,
-    int m,
-    int n,
-    int k,
-    const cusparseMatDescr_t descrA,
-    cuComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrEndPtr,
-    const int *csrColInd,
-    csrxilu0Info_t info,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsrxilu0(
-    cusparseHandle_t handle,
-    int iax,
-    int iay,
-    int m,
-    int n,
-    int k,
-    const cusparseMatDescr_t descrA,
-    cuDoubleComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrEndPtr,
-    const int *csrColInd,
-    csrxilu0Info_t info,
-    void *pBuffer);
-
-// ----------- csrxgemmSchur
-
-cusparseStatus_t CUSPARSEAPI cusparseCreateCsrxgemmSchurInfo(
-    csrxgemmSchurInfo_t *info);
-
-cusparseStatus_t CUSPARSEAPI cusparseDestroyCsrxgemmSchurInfo(
-    csrxgemmSchurInfo_t info);
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrxgemmSchur_bufferSizeExt(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    int k,
-
-    int iax,
-    int iay,
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    int ibx,
-    int iby,
-    const cusparseMatDescr_t descrB,
-    int nnzB,
-    const int *csrRowPtrB,
-    const int *csrEndPtrB,
-    const int *csrColIndB,
-
-    int icx,
-    int icy,
-    const cusparseMatDescr_t descrC,
-    int nnzC,
-    const int *csrRowPtrC,
-    const int *csrEndPtrC,
-    const int *csrColIndC,
-
-    csrxgemmSchurInfo_t info,
-    size_t *pBufferSizeInBytes);
-
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrxgemmSchur(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    int k,
-
-    int iax,
-    int iay,
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const float *csrValA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    int ibx,
-    int iby,
-    const cusparseMatDescr_t descrB,
-    int nnzB,
-    const float *csrValB,
-    const int *csrRowPtrB,
-    const int *csrEndPtrB,
-    const int *csrColIndB,
-
-    int icx,
-    int icy,
-    const cusparseMatDescr_t descrC,
-    int nnzC,
-    float *csrValC,
-    const int *csrRowPtrC,
-    const int *csrEndPtrC,
-    const int *csrColIndC,
-
-    csrxgemmSchurInfo_t info,
-    void *pBuffer);
-
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrxgemmSchur(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    int k,
-
-    int iax,
-    int iay,
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const double *csrValA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    int ibx,
-    int iby,
-    const cusparseMatDescr_t descrB,
-    int nnzB,
-    const double *csrValB,
-    const int *csrRowPtrB,
-    const int *csrEndPtrB,
-    const int *csrColIndB,
-
-    int icx,
-    int icy,
-    const cusparseMatDescr_t descrC,
-    int nnzC,
-    double *csrValC,
-    const int *csrRowPtrC,
-    const int *csrEndPtrC,
-    const int *csrColIndC,
-
-    csrxgemmSchurInfo_t info,
-    void *pBuffer);
-
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsrxgemmSchur(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    int k,
-
-    int iax,
-    int iay,
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const cuComplex *csrValA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    int ibx,
-    int iby,
-    const cusparseMatDescr_t descrB,
-    int nnzB,
-    const cuComplex *csrValB,
-    const int *csrRowPtrB,
-    const int *csrEndPtrB,
-    const int *csrColIndB,
-
-    int icx,
-    int icy,
-    const cusparseMatDescr_t descrC,
-    int nnzC,
-    cuComplex *csrValC,
-    const int *csrRowPtrC,
-    const int *csrEndPtrC,
-    const int *csrColIndC,
-
-    csrxgemmSchurInfo_t info,
-    void *pBuffer);
-
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsrxgemmSchur(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    int k,
-
-    int iax,
-    int iay,
-    const cusparseMatDescr_t descrA,
-    int nnzA,
-    const cuDoubleComplex *csrValA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    int ibx,
-    int iby,
-    const cusparseMatDescr_t descrB,
-    int nnzB,
-    const cuDoubleComplex *csrValB,
-    const int *csrRowPtrB,
-    const int *csrEndPtrB,
-    const int *csrColIndB,
-
-    int icx,
-    int icy,
-    const cusparseMatDescr_t descrC,
-    int nnzC,
-    cuDoubleComplex *csrValC,
-    const int *csrRowPtrC,
-    const int *csrEndPtrC,
-    const int *csrColIndC,
-
-    csrxgemmSchurInfo_t info,
-    void *pBuffer);
-
-// ---------- csrxtrsm
-
-#if 0
-cusparseStatus_t CUSPARSEAPI cusparseCreateCsrxtrsmInfo(
-    csrxtrsmInfo_t *info);
-
-cusparseStatus_t CUSPARSEAPI cusparseDestroyCsrxtrsmInfo(
-    csrxtrsmInfo_t info);
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrxtrsm_bufferSizeExt(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-
-    cusparseSideMode_t side,
-
-    int iax,
-    int iay,
-    const cusparseMatDescr_t descrA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    int ibx,
-    int iby,
-    const cusparseMatDescr_t descrB,
-    const int *csrRowPtrB,
-    const int *csrEndPtrB,
-    const int *csrColIndB,
-
-    csrxtrsmInfo_t info,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t  CUSPARSEAPI cusparseScsrxtrsm(
-    cusparseHandle_t handle,
-
-    int m,
-    int n,
-
-    cusparseSideMode_t side,
-
-    int iax,
-    int iay,
-    const cusparseMatDescr_t descrA,
-    const float *csrValA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    int ibx,
-    int iby,
-    const cusparseMatDescr_t descrB,
-    float *csrValB,
-    const int *csrRowPtrB,
-    const int *csrEndPtrB,
-    const int *csrColIndB,
-
-    csrxtrsmInfo_t info,
-    void *pBuffer);
-
-cusparseStatus_t  CUSPARSEAPI cusparseDcsrxtrsm(
-    cusparseHandle_t handle,
-
-    int m,
-    int n,
-
-    cusparseSideMode_t side,
-
-    int iax,
-    int iay,
-    const cusparseMatDescr_t descrA,
-    const double *csrValA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    int ibx,
-    int iby,
-    const cusparseMatDescr_t descrB,
-    double *csrValB,
-    const int *csrRowPtrB,
-    const int *csrEndPtrB,
-    const int *csrColIndB,
-
-    csrxtrsmInfo_t info,
-    void *pBuffer);
-
-cusparseStatus_t  CUSPARSEAPI cusparseCcsrxtrsm(
-    cusparseHandle_t handle,
-
-    int m,
-    int n,
-
-    cusparseSideMode_t side,
-
-    int iax,
-    int iay,
-    const cusparseMatDescr_t descrA,
-    const cuComplex *csrValA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    int ibx,
-    int iby,
-    const cusparseMatDescr_t descrB,
-    cuComplex *csrValB,
-    const int *csrRowPtrB,
-    const int *csrEndPtrB,
-    const int *csrColIndB,
-
-    csrxtrsmInfo_t info,
-    void *pBuffer);
-
-
-cusparseStatus_t  CUSPARSEAPI cusparseZcsrxtrsm(
-    cusparseHandle_t handle,
-
-    int m,
-    int n,
-
-    cusparseSideMode_t side,
-
-    int iax,
-    int iay,
-    const cusparseMatDescr_t descrA,
-    const cuDoubleComplex *csrValA,
-    const int *csrRowPtrA,
-    const int *csrEndPtrA,
-    const int *csrColIndA,
-
-    int ibx,
-    int iby,
-    const cusparseMatDescr_t descrB,
-    cuDoubleComplex *csrValB,
-    const int *csrRowPtrB,
-    const int *csrEndPtrB,
-    const int *csrColIndB,
-
-    csrxtrsmInfo_t info,
-    void *pBuffer);
-#endif
-
-// ------ CSR ilu03
-cusparseStatus_t CUSPARSEAPI cusparseCreateCsrilu03Info(
-    csrilu03Info_t *info);
-
-cusparseStatus_t CUSPARSEAPI cusparseDestroyCsrilu03Info(
-    csrilu03Info_t info);
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrilu03_bufferSizeExt(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    csrilu03Info_t info,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrilu03_zeroPivot(
-    cusparseHandle_t handle,
-    csrilu03Info_t info,
-    int *position);
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrilu03_numericBoost(
-    cusparseHandle_t handle,
-    csrilu03Info_t info,
-    int enable_boost,
-    double *tol,
-    float *numeric_boost);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrilu03_numericBoost(
-    cusparseHandle_t handle,
-    csrilu03Info_t info,
-    int enable_boost,
-    double *tol,
-    double *numeric_boost);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsrilu03_numericBoost(
-    cusparseHandle_t handle,
-    csrilu03Info_t info,
-    int enable_boost,
-    double *tol,
-    cuComplex *numeric_boost);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsrilu03_numericBoost(
-    cusparseHandle_t handle,
-    csrilu03Info_t info,
-    int enable_boost,
-    double *tol,
-    cuDoubleComplex *numeric_boost);
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrilu03(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    float *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    csrilu03Info_t info,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrilu03(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    double *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    csrilu03Info_t info,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsrilu03(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    cuComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    csrilu03Info_t info,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsrilu03(
-    cusparseHandle_t handle,
-    int m,
-    int nnz,
-    const cusparseMatDescr_t descrA,
-    cuDoubleComplex *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    csrilu03Info_t info,
-    void *pBuffer);
-
-
-cusparseStatus_t CUSPARSEAPI cusparseXcsrValid(
-    cusparseHandle_t handle,
-    int m,
-    int n,
-    int nnzA,
-    const cusparseMatDescr_t descrA,
-    const int *csrRowPtrA,
-    const int *csrColIndA,
-    int *valid);
-
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrmm3(
-    cusparseHandle_t handle,
-    cusparseOperation_t transa,
-    cusparseOperation_t transb,
-    int m,
-    int n,
-    int k,
-    int nnz,
-    const float *alpha,
-    const cusparseMatDescr_t descrA,
-    const float *csrValA,
-    const int *csrRowPtrA,
-    const int *csrColIndA,
-    const float *B,
-    int ldb,
-    const float *beta,
-    float *C,
-    int ldc,
-    void *buffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrmm3(
-    cusparseHandle_t handle,
-    cusparseOperation_t transa,
-    cusparseOperation_t transb,
-    int m,
-    int n,
-    int k,
-    int nnz,
-    const double *alpha,
-    const cusparseMatDescr_t descrA,
-    const double *csrValA,
-    const int *csrRowPtrA,
-    const int *csrColIndA,
-    const double *B,
-    int ldb,
-    const double *beta,
-    double *C,
-    int ldc,
-    void *buffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsrmm3(
-    cusparseHandle_t handle,
-    cusparseOperation_t transa,
-    cusparseOperation_t transb,
-    int m,
-    int n,
-    int k,
-    int nnz,
-    const cuComplex *alpha,
-    const cusparseMatDescr_t descrA,
-    const cuComplex *csrValA,
-    const int *csrRowPtrA,
-    const int *csrColIndA,
-    const cuComplex *B,
-    int ldb,
-    const cuComplex *beta,
-    cuComplex *C,
-    int ldc,
-    void *buffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsrmm3(
-    cusparseHandle_t handle,
-    cusparseOperation_t transa,
-    cusparseOperation_t transb,
-    int m,
-    int n,
-    int k,
-    int nnz,
-    const cuDoubleComplex *alpha,
-    const cusparseMatDescr_t descrA,
-    const cuDoubleComplex *csrValA,
-    const int *csrRowPtrA,
-    const int *csrColIndA,
-    const cuDoubleComplex *B,
-    int ldb,
-    const cuDoubleComplex *beta,
-    cuDoubleComplex *C,
-    int ldc,
-    void *buffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseStranspose(
-    cusparseHandle_t handle,
-    cusparseOperation_t transa,
-    int m,
-    int n,
-    const float *alpha,
-    const float *A,
-    int lda,
-    float *C,
-    int ldc);
-
-cusparseStatus_t CUSPARSEAPI cusparseDtranspose(
-    cusparseHandle_t handle,
-    cusparseOperation_t transa,
-    int m,
-    int n,
-    const double *alpha,
-    const double *A,
-    int lda,
-    double *C,
-    int ldc);
-
-cusparseStatus_t CUSPARSEAPI cusparseCtranspose(
-    cusparseHandle_t handle,
-    cusparseOperation_t transa,
-    int m,
-    int n,
-    const cuComplex *alpha,
-    const cuComplex *A,
-    int lda,
-    cuComplex *C,
-    int ldc);
-
-cusparseStatus_t CUSPARSEAPI cusparseZtranspose(
-    cusparseHandle_t handle,
-    cusparseOperation_t transa,
-    int m,
-    int n,
-    const cuDoubleComplex *alpha,
-    const cuDoubleComplex *A,
-    int lda,
-    cuDoubleComplex *C,
-    int ldc);
-
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrmv_binary(
-    cusparseHandle_t handle,
-    cusparseOperation_t trans,
-    int m,
-    int n,
-    int nnz,
-    const float *alpha,
-    const cusparseMatDescr_t descra,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    const float *x,
-    const float *beta,
-    float *y);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrmv_binary(
-    cusparseHandle_t handle,
-    cusparseOperation_t trans,
-    int m,
-    int n,
-    int nnz,
-    const double *alpha,
-    const cusparseMatDescr_t descra,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    const double *x,
-    const double *beta,
-    double *y);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsrmv_binary(
-    cusparseHandle_t handle,
-    cusparseOperation_t trans,
-    int m,
-    int n,
-    int nnz,
-    const cuComplex *alpha,
-    const cusparseMatDescr_t descra,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    const cuComplex *x,
-    const cuComplex *beta,
-    cuComplex *y);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsrmv_binary(
-    cusparseHandle_t handle,
-    cusparseOperation_t trans,
-    int m,
-    int n,
-    int nnz,
-    const cuDoubleComplex *alpha,
-    const cusparseMatDescr_t descra,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    const cuDoubleComplex *x,
-    const cuDoubleComplex *beta,
-    cuDoubleComplex *y);
-
-cusparseStatus_t CUSPARSEAPI cusparseCreateCsrmmInfo(
-    csrmmInfo_t *info);
-
-cusparseStatus_t CUSPARSEAPI cusparseDestroyCsrmmInfo(
-    csrmmInfo_t info);
-
-cusparseStatus_t CUSPARSEAPI csrmm4_analysis(
-    cusparseHandle_t handle,
-    int m, // number of rows of A
-    int k, // number of columns of A
-    int nnzA, // number of nonzeros of A
-    const cusparseMatDescr_t descrA,
-    const int *csrRowPtrA, // <int> m+1
-    const int *csrColIndA, // <int> nnzA
-    csrmmInfo_t info,
-    double *ratio // nnzB / nnzA
-    );
-
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrmm4(
-    cusparseHandle_t handle,
-    cusparseOperation_t transa,
-    cusparseOperation_t transb,
-    int m,
-    int n,
-    int k,
-    int nnz,
-    const float *alpha,
-    const cusparseMatDescr_t descrA,
-    const float *csrValA,
-    const int *csrRowPtrA,
-    const int *csrColIndA,
-    const float *B,
-    int ldb,
-    const float *beta,
-    float *C,
-    int ldc,
-    csrmmInfo_t info,
-    void *buffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrmm4(
-    cusparseHandle_t handle,
-    cusparseOperation_t transa,
-    cusparseOperation_t transb,
-    int m,
-    int n,
-    int k,
-    int nnz,
-    const double *alpha,
-    const cusparseMatDescr_t descrA,
-    const double *csrValA,
-    const int *csrRowPtrA,
-    const int *csrColIndA,
-    const double *B,
-    int ldb,
-    const double *beta,
-    double *C,
-    int ldc,
-    csrmmInfo_t info,
-    void *buffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseCcsrmm4(
-    cusparseHandle_t handle,
-    cusparseOperation_t transa,
-    cusparseOperation_t transb,
-    int m,
-    int n,
-    int k,
-    int nnz,
-    const cuComplex *alpha,
-    const cusparseMatDescr_t descrA,
-    const cuComplex *csrValA,
-    const int *csrRowPtrA,
-    const int *csrColIndA,
-    const cuComplex *B,
-    int ldb,
-    const cuComplex *beta,
-    cuComplex *C,
-    int ldc,
-    csrmmInfo_t info,
-    void *buffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseZcsrmm4(
-    cusparseHandle_t handle,
-    cusparseOperation_t transa,
-    cusparseOperation_t transb,
-    int m,
-    int n,
-    int k,
-    int nnz,
-    const cuDoubleComplex *alpha,
-    const cusparseMatDescr_t descrA,
-    const cuDoubleComplex *csrValA,
-    const int *csrRowPtrA,
-    const int *csrColIndA,
-    const cuDoubleComplex *B,
-    int ldb,
-    const cuDoubleComplex *beta,
-    cuDoubleComplex *C,
-    int ldc,
-    csrmmInfo_t info,
-    void *buffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrmm5(
-    cusparseHandle_t handle,
-    cusparseOperation_t transa,
-    cusparseOperation_t transb,
-    int m,
-    int n,
-    int k,
-    int nnzA,
-    const float *alpha,
-    const cusparseMatDescr_t descrA,
-    const float  *csrValA,
-    const int *csrRowPtrA,
-    const int *csrColIndA,
-    const float *B,
-    int ldb,
-    const float *beta,
-    float *C,
-    int ldc
-    );
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrmm5(
-    cusparseHandle_t handle,
-    cusparseOperation_t transa,
-    cusparseOperation_t transb,
-    int m,
-    int n,
-    int k,
-    int nnzA,
-    const double *alpha,
-    const cusparseMatDescr_t descrA,
-    const double  *csrValA,
-    const int *csrRowPtrA,
-    const int *csrColIndA,
-    const double *B,
-    int ldb,
-    const double *beta,
-    double *C,
-    int ldc
-    );
-
-
-cusparseStatus_t CUSPARSEAPI cusparseScsrmm6(
-    cusparseHandle_t handle,
-    cusparseOperation_t transa,
-    cusparseOperation_t transb,
-    int m,
-    int n,
-    int k,
-    int nnzA,
-    const float *alpha,
-    const cusparseMatDescr_t descrA,
-    const float  *csrValA,
-    const int *csrRowPtrA,
-    const int *csrColIndA,
-    const float *B,
-    int ldb,
-    const float *beta,
-    float *C,
-    int ldc
-    );
-
-cusparseStatus_t CUSPARSEAPI cusparseDcsrmm6(
-    cusparseHandle_t handle,
-    cusparseOperation_t transa,
-    cusparseOperation_t transb,
-    int m,
-    int n,
-    int k,
-    int nnzA,
-    const double *alpha,
-    const cusparseMatDescr_t descrA,
-    const double  *csrValA,
-    const int *csrRowPtrA,
-    const int *csrColIndA,
-    const double *B,
-    int ldb,
-    const double *beta,
-    double *C,
-    int ldc
-    );
-
-
-
-cusparseStatus_t CUSPARSEAPI cusparseSmax(
-    cusparseHandle_t handle,
-    int n,
-    const float *x,
-    float *valueHost,
-    float *work  /* at least n+1 */
-    );
-
-cusparseStatus_t CUSPARSEAPI cusparseDmax(
-    cusparseHandle_t handle,
-    int n,
-    const double *x,
-    double *valueHost,
-    double *work  /* at least n+1 */
-    );
-
-cusparseStatus_t CUSPARSEAPI cusparseSmin(
-    cusparseHandle_t handle,
-    int n,
-    const float *x,
-    float *valueHost,
-    float *work  /* at least n+1 */
-    );
-
-cusparseStatus_t CUSPARSEAPI cusparseDmin(
-    cusparseHandle_t handle,
-    int n,
-    const double *x,
-    double *valueHost,
-    double *work  /* at least n+1 */
-    );
-
-cusparseStatus_t CUSPARSEAPI cusparseI16sort_internal_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    size_t *pBufferSize);
-
-cusparseStatus_t CUSPARSEAPI cusparseI16sort_internal(
-    cusparseHandle_t handle,
-    int num_bits, /* <= 16 */
-    int n,
-    unsigned short *key,
-    int *P,
-    int ascend,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseI32sort_internal_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    size_t *pBufferSize);
-
-cusparseStatus_t CUSPARSEAPI cusparseI32sort_internal(
-    cusparseHandle_t handle,
-    int num_bits, /* <= 32 */
-    int n,
-    unsigned int *key,
-    int *P,
-    int ascend,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseI64sort_internal_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    size_t *pBufferSize);
-
-cusparseStatus_t CUSPARSEAPI cusparseI64sort_internal(
-    cusparseHandle_t handle,
-    int num_bits, /* <= 64 */
-    int n,
-    unsigned long long *key,
-    int *P,
-    int ascend,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseIsort_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    const int *key,
-    const int *P,
-    int ascend,
-    size_t *pBufferSize);
-
-cusparseStatus_t CUSPARSEAPI cusparseIsort(
-    cusparseHandle_t handle,
-    int n,
-    int *key,
-    int *P,
-    int ascend,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseSsort_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    const float *key,
-    const int *P,
-    int ascend,
-    size_t *pBufferSize);
-
-cusparseStatus_t CUSPARSEAPI cusparseSsort(
-    cusparseHandle_t handle,
-    int n,
-    float *key,
-    int *P,
-    int ascend,
-    void *pBuffer);
-
-
-cusparseStatus_t CUSPARSEAPI cusparseDsort_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    const double *key,
-    const int *P,
-    int ascend,
-    size_t *pBufferSize);
-
-cusparseStatus_t CUSPARSEAPI cusparseDsort(
-    cusparseHandle_t handle,
-    int n,
-    double *key,
-    int *P,
-    int ascend,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseHsort_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    const __half *key,
-    const int *P,
-    int ascend,
-    size_t *pBufferSize);
-
-cusparseStatus_t CUSPARSEAPI cusparseHsort(
-    cusparseHandle_t handle,
-    int n,
-    __half *key_fp16,
-    int *P,
-    int ascend,
-    void *pBuffer);
-
-
-
-
-
-cusparseStatus_t CUSPARSEAPI cusparseHsortsign_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    const __half *key,
-    const int *P,
-    int ascend,
-    size_t *pBufferSize);
-
-cusparseStatus_t CUSPARSEAPI cusparseSsortsign_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    const float *key,
-    const int *P,
-    int ascend,
-    size_t *pBufferSize);
-
-cusparseStatus_t CUSPARSEAPI cusparseDsortsign_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    const double *key,
-    const int *P,
-    int ascend,
-    size_t *pBufferSize);
-
-cusparseStatus_t CUSPARSEAPI cusparseIsortsign_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    const int *key,
-    const int *P,
-    int ascend,
-    size_t *pBufferSize);
-
-//#if defined(__cplusplus)
-cusparseStatus_t CUSPARSEAPI cusparseHsortsign(
-    cusparseHandle_t handle,
-    int n,
-    __half *key,
-    int *P,
-    int ascend,
-    int *h_nnz_bucket0, /* host */
-    void *pBuffer);
-//#endif
-
-cusparseStatus_t CUSPARSEAPI cusparseSsortsign(
-    cusparseHandle_t handle,
-    int n,
-    float *key,
-    int *P,
-    int ascend,
-    int *h_nnz_bucket0, /* host */
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseDsortsign(
-    cusparseHandle_t handle,
-    int n,
-    double *key,
-    int *P,
-    int ascend,
-    int *h_nnz_bucket0, /* host */
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseIsortsign(
-    cusparseHandle_t handle,
-    int n,
-    int *key,
-    int *P,
-    int ascend,
-    int *h_nnz_bucket0, /* host */
-    void *pBuffer);
-
-//----------------------------------------------
-
-
-cusparseStatus_t CUSPARSEAPI cusparseDDcsrMv_hyb(
-    cusparseHandle_t handle,
-    cusparseOperation_t trans,
-    int m,
-    int n,
-    int nnz,
-    const double *alpha,
-    const cusparseMatDescr_t descra,
-    const double *csrVal,
-    const int *csrRowPtr,
-    const int *csrColInd,
-    const double *x,
-    const double *beta,
-    double *y);
-
-
-/*
- * gtsv2Batch: cuThomas algorithm
- * gtsv3Batch: QR
- * gtsv4Batch: LU with partial pivoting
- */
-cusparseStatus_t CUSPARSEAPI cusparseSgtsv2Batch(
-    cusparseHandle_t handle,
-    int n,
-    float *dl,
-    float  *d,
-    float *du,
-    float *x,
-    int batchCount);
-
-cusparseStatus_t CUSPARSEAPI cusparseDgtsv2Batch(
-    cusparseHandle_t handle,
-    int n,
-    double *dl,
-    double  *d,
-    double *du,
-    double *x,
-    int batchCount);
-
-cusparseStatus_t CUSPARSEAPI cusparseCgtsv2Batch(
-    cusparseHandle_t handle,
-    int n,
-    cuComplex *dl,
-    cuComplex  *d,
-    cuComplex *du,
-    cuComplex *x,
-    int batchCount);
-
-cusparseStatus_t CUSPARSEAPI cusparseZgtsv2Batch(
-    cusparseHandle_t handle,
-    int n,
-    cuDoubleComplex *dl,
-    cuDoubleComplex  *d,
-    cuDoubleComplex *du,
-    cuDoubleComplex *x,
-    int batchCount);
-
-cusparseStatus_t CUSPARSEAPI cusparseSgtsv3Batch_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    const float *dl,
-    const float  *d,
-    const float *du,
-    const float *x,
-    int batchSize,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseDgtsv3Batch_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    const double *dl,
-    const double  *d,
-    const double *du,
-    const double *x,
-    int batchSize,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseCgtsv3Batch_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    const cuComplex *dl,
-    const cuComplex  *d,
-    const cuComplex *du,
-    const cuComplex *x,
-    int batchSize,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseZgtsv3Batch_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    const cuDoubleComplex *dl,
-    const cuDoubleComplex  *d,
-    const cuDoubleComplex *du,
-    const cuDoubleComplex *x,
-    int batchSize,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseSgtsv3Batch(
-    cusparseHandle_t handle,
-    int n,
-    float *dl,
-    float  *d,
-    float *du,
-    float *x,
-    int batchSize,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseDgtsv3Batch(
-    cusparseHandle_t handle,
-    int n,
-    double *dl,
-    double  *d,
-    double *du,
-    double *x,
-    int batchSize,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseCgtsv3Batch(
-    cusparseHandle_t handle,
-    int n,
-    cuComplex *dl,
-    cuComplex  *d,
-    cuComplex *du,
-    cuComplex *x,
-    int batchSize,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseZgtsv3Batch(
-    cusparseHandle_t handle,
-    int n,
-    cuDoubleComplex *dl,
-    cuDoubleComplex  *d,
-    cuDoubleComplex *du,
-    cuDoubleComplex *x,
-    int batchSize,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseSgtsv4Batch_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    const float *dl,
-    const float  *d,
-    const float *du,
-    const float *x,
-    int batchSize,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseDgtsv4Batch_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    const double *dl,
-    const double  *d,
-    const double *du,
-    const double *x,
-    int batchSize,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseCgtsv4Batch_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    const cuComplex *dl,
-    const cuComplex  *d,
-    const cuComplex *du,
-    const cuComplex *x,
-    int batchSize,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseZgtsv4Batch_bufferSizeExt(
-    cusparseHandle_t handle,
-    int n,
-    const cuDoubleComplex *dl,
-    const cuDoubleComplex  *d,
-    const cuDoubleComplex *du,
-    const cuDoubleComplex *x,
-    int batchSize,
-    size_t *pBufferSizeInBytes);
-
-cusparseStatus_t CUSPARSEAPI cusparseSgtsv4Batch(
-    cusparseHandle_t handle,
-    int n,
-    float *dl,
-    float  *d,
-    float *du,
-    float *x,
-    int batchSize,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseDgtsv4Batch(
-    cusparseHandle_t handle,
-    int n,
-    double *dl,
-    double  *d,
-    double *du,
-    double *x,
-    int batchSize,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseCgtsv4Batch(
-    cusparseHandle_t handle,
-    int n,
-    cuComplex *dl,
-    cuComplex  *d,
-    cuComplex *du,
-    cuComplex *x,
-    int batchSize,
-    void *pBuffer);
-
-cusparseStatus_t CUSPARSEAPI cusparseZgtsv4Batch(
-    cusparseHandle_t handle,
-    int n,
-    cuDoubleComplex *dl,
-    cuDoubleComplex  *d,
-    cuDoubleComplex *du,
-    cuDoubleComplex *x,
-    int batchSize,
-    void *pBuffer);
-
-
-#if defined(__cplusplus)
-}
-#endif /* __cplusplus */
-
-
-#endif /* CUSPARSE_INTERNAL_H_ */
-