-
Notifications
You must be signed in to change notification settings - Fork 311
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add cugraph+UCX build instructions (#3088)
This PR adds the instructions to build cugraph+UCX for benchmarking . The current conda build does not support IB as well has challenges using NVLINK . This PR adds the dockefile as well as build instructions and test to streamline benchmarking effort. The expected speedup on sampling is around `8x` by using the UCX souce build and a `20x` better bandwidth with compute. See [README](https://github.com/rapidsai/cugraph/blob/d513a86017fac6ba4742019eae47f7ffe98186e0/benchmarks/shared/build_cugraph_ucx/README.MD) for instructions . ## Benchmarks ### `dask.compute` bandwidth **With the standard conda build :** ```python3 python3 test_client_bandwidth.py Getting 1,000,000 rows of size 0.01 took = 168.32900047302246 ms Bandwidth = 0.0594 gb/s Getting 2,000,000 rows of size 0.02 took = 260.2977991104126 ms Bandwidth = 0.0768 gb/s Getting 4,000,000 rows of size 0.04 took = 288.5307312011719 ms Bandwidth = 0.1386 gb/s Getting 25,000,000 rows of size 0.28 took = 1051.6365766525269 ms Bandwidth = 0.2663 gb/s ----------------------------------------Completed Test---------------------------------------- ``` **With this **PRs** Build** : ```python3 python3 test_client_bandwidth.py Getting 1,000,000 rows of size 0.01 took = 14.061594009399414 ms Bandwidth = 0.7112 gb/s Getting 2,000,000 rows of size 0.02 took = 19.601154327392578 ms Bandwidth = 1.0203 gb/s Getting 4,000,000 rows of size 0.04 took = 15.92099666595459 ms Bandwidth = 2.5124 gb/s Getting 25,000,000 rows of size 0.28 took = 52.902913093566895 ms Bandwidth = 5.2927 gb/s ----------------------------------------Completed Test---------------------------------------- ``` ### `cugraph sampling` benchmark With the standard conda build : ```python3 python3 test_cugraph_sampling.py Sampling 1,000 took = 78.96044254302979 ms Sampling 10,000 took = 189.63768482208252 ms Sampling 100,000 took = 1084.9100351333618 ms ----------------------------------------Completed Test---------------------------------------- ``` **With this **PRs** Build** : ```python3 python3 test_cugraph_sampling.py Sampling 1,000 took = 73.62375259399414 ms Sampling 10,000 took = 87.1060848236084 ms Sampling 100,000 took = 134.7532033920288 ms ----------------------------------------Completed Test---------------------------------------- ``` Authors: - Vibhu Jawa (https://github.com/VibhuJawa) - Alex Barghi (https://github.com/alexbarghi-nv) Approvers: - Brad Rees (https://github.com/BradReesWork) - Alex Barghi (https://github.com/alexbarghi-nv) - Rick Ratzel (https://github.com/rlratzel) URL: #3088
- Loading branch information
Showing
5 changed files
with
336 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
# Cugraph with UCX | ||
## Docker Build Instructions | ||
docker build -f cugraph_ucx.dockerfile . -t cugraph_ucx | ||
|
||
## Docker Run Instructions | ||
docker run --privileged -it --gpus=all --net=host cugraph_ucx /bin/bash | ||
|
||
#### Client Bandwidth Test | ||
python3 test_client_bandwidth.py | ||
|
||
```bash | ||
(base) root@exp02:/home# python3 test_client_bandwidth.py | ||
2022-12-19 13:31:30,867 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize | ||
2022-12-19 13:31:30,867 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize | ||
2022-12-19 13:31:30,891 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize | ||
2022-12-19 13:31:30,891 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize | ||
2022-12-19 13:31:30,918 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize | ||
2022-12-19 13:31:30,918 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize | ||
Getting 1,000,000 rows of size 0.01 took = 20.995140075683594 ms | ||
Bandwidth = 0.4763 gb/s | ||
Getting 2,000,000 rows of size 0.02 took = 18.51017475128174 ms | ||
Bandwidth = 1.0805 gb/s | ||
Getting 4,000,000 rows of size 0.04 took = 25.659966468811035 ms | ||
Bandwidth = 1.5588 gb/s | ||
Getting 25,000,000 rows of size 0.28 took = 53.80809307098389 ms | ||
Bandwidth = 5.2037 gb/s | ||
----------------------------------------Completed Test---------------------------------------- | ||
``` | ||
|
||
#### Sampling Test | ||
python3 test_cugraph_sampling.py | ||
```bash | ||
test_client_bandwidth.py test_cugraph_sampling.py | ||
(base) root@exp02:/home# python3 test_cugraph_sampling.py | ||
[1671456769.722931] [exp02:93 :0] parser.c:1989 UCX WARN unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?) | ||
[1671456769.722931] [exp02:93 :0] parser.c:1989 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) | ||
2022-12-19 13:32:56,228 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize | ||
2022-12-19 13:32:56,228 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize | ||
2022-12-19 13:32:57,485 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize | ||
2022-12-19 13:32:57,485 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize | ||
2022-12-19 13:32:57,655 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize | ||
2022-12-19 13:32:57,655 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize | ||
2022-12-19 13:32:57,878 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize | ||
2022-12-19 13:32:57,879 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize | ||
2022-12-19 13:32:57,957 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize | ||
2022-12-19 13:32:57,957 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize | ||
2022-12-19 13:32:57,963 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize | ||
2022-12-19 13:32:57,963 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize | ||
2022-12-19 13:32:57,973 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize | ||
2022-12-19 13:32:57,973 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize | ||
2022-12-19 13:32:58,011 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize | ||
2022-12-19 13:32:58,012 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize | ||
Sampling 1,000 took = 69.15879249572754 ms | ||
Sampling 10,000 took = 89.63620662689209 ms | ||
Sampling 100,000 took = 135.9888792037964 ms | ||
----------------------------------------Completed Test---------------------------------------- | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
#!/bin/bash | ||
# Copyright (c) 2023, NVIDIA CORPORATION. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
set -ex | ||
|
||
UCX_VERSION_TAG=${1:-"v1.14.x"} | ||
CUDA_HOME=${2:-"/usr/local/cuda"} | ||
# Send any remaining arguments to configure | ||
CONFIGURE_ARGS=${@:2} | ||
git clone https://github.com/openucx/ucx.git | ||
cd ucx | ||
git checkout ${UCX_VERSION_TAG} | ||
./autogen.sh | ||
mkdir build-linux && cd build-linux | ||
../contrib/configure-release --prefix=${CONDA_PREFIX} --with-sysroot --enable-cma \ | ||
--enable-mt --enable-numa --with-gnu-ld --with-rdmacm --with-verbs \ | ||
--with-cuda=${CUDA_HOME} \ | ||
${CONFIGURE_ARGS} | ||
make -j install |
72 changes: 72 additions & 0 deletions
72
benchmarks/shared/build_cugraph_ucx/cugraph_ucx.dockerfile
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
# Copyright (c) 2022-2023, NVIDIA CORPORATION. | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# rdma build instructions from here: | ||
# https://github.com/rapidsai/dask-cuda-benchmarks/blob/main/runscripts/draco/docker/UCXPy-rdma-core.dockerfile | ||
ARG CUDA_VER=11.2 | ||
ARG LINUX_VER=ubuntu20.04 | ||
FROM gpuci/miniforge-cuda:$CUDA_VER-devel-$LINUX_VER | ||
|
||
RUN apt-get update -y \ | ||
&& apt-get --fix-missing upgrade -y \ | ||
&& DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends tzdata \ | ||
&& apt-get install -y \ | ||
automake \ | ||
dh-make \ | ||
git \ | ||
libcap2 \ | ||
libnuma-dev \ | ||
libtool \ | ||
make \ | ||
pkg-config \ | ||
udev \ | ||
curl \ | ||
librdmacm-dev \ | ||
rdma-core \ | ||
&& apt-get autoremove -y \ | ||
&& apt-get clean | ||
|
||
ARG PYTHON_VER=3.9 | ||
ARG RAPIDS_VER=23.02 | ||
ARG PYTORCH_VER=1.12.0 | ||
ARG CUDATOOLKIT_VER=11.3 | ||
|
||
RUN conda config --set ssl_verify false | ||
RUN conda install -c gpuci gpuci-tools | ||
RUN gpuci_conda_retry install -c conda-forge mamba | ||
|
||
|
||
RUN gpuci_mamba_retry install -y -c pytorch -c rapidsai-nightly -c rapidsai -c conda-forge -c nvidia \ | ||
cudatoolkit=$CUDATOOLKIT_VER \ | ||
cugraph=$RAPIDS_VER \ | ||
pytorch=$PYTORCH_VER \ | ||
python=$PYTHON_VER \ | ||
setuptools \ | ||
tqdm | ||
|
||
|
||
# Build ucx from source with IB support | ||
# on 1.14.x | ||
RUN conda remove --force -y ucx ucx-proc | ||
|
||
ADD build-ucx.sh /root/build-ucx.sh | ||
RUN chmod 744 /root/build-ucx.sh & bash /root/build-ucx.sh | ||
|
||
|
||
ADD test_client_bandwidth.py /root/test_client_bandwidth.py | ||
RUN chmod 777 /root/test_client_bandwidth.py | ||
ADD test_cugraph_sampling.py /root/test_cugraph_sampling.py | ||
RUN chmod 777 /root/test_cugraph_sampling.py | ||
|
||
ENV PATH /opt/conda/env/bin:$PATH | ||
WORKDIR /root/ |
80 changes: 80 additions & 0 deletions
80
benchmarks/shared/build_cugraph_ucx/test_client_bandwidth.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# Copyright (c) 2022-2023, NVIDIA CORPORATION. | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from dask_cuda import LocalCUDACluster | ||
from dask.distributed import Client, wait | ||
import cupy as cp | ||
import numpy as np | ||
import cudf | ||
import dask_cudf | ||
import rmm | ||
from time import perf_counter_ns | ||
|
||
def benchmark_func(func, n_times=10): | ||
def wrap_func(*args, **kwargs): | ||
time_ls = [] | ||
# ignore 1st run | ||
# and return other runs | ||
for _ in range(0,n_times+1): | ||
t1 = perf_counter_ns() | ||
result = func(*args, **kwargs) | ||
t2 = perf_counter_ns() | ||
time_ls.append(t2-t1) | ||
return result, time_ls[1:] | ||
return wrap_func | ||
|
||
def create_dataframe(client): | ||
n_rows = 25_000_000 | ||
df = cudf.DataFrame({'src':cp.arange(0,n_rows,dtype=cp.int32), 'dst':cp.arange(0,n_rows, dtype=cp.int32), 'eids':cp.ones(n_rows, cp.int32)}) | ||
ddf = dask_cudf.from_cudf(df,npartitions= len(client.scheduler_info()['workers'])).persist() | ||
client.rebalance(ddf) | ||
del df | ||
_ = wait(ddf) | ||
return ddf | ||
|
||
@benchmark_func | ||
def get_n_rows(ddf, n): | ||
if n==-1: | ||
df = ddf.compute() | ||
else: | ||
df = ddf.head(n) | ||
return df | ||
|
||
def run_bandwidth_test(ddf, n): | ||
df, time_ls = get_n_rows(ddf, n) | ||
time_ar = np.asarray(time_ls) | ||
time_mean = time_ar.mean() | ||
size_bytes = df.memory_usage().sum() | ||
size_gb = round(size_bytes/(pow(1024,3)), 2) | ||
print(f"Getting {len(df):,} rows of size {size_gb} took = {time_mean*1e-6} ms") | ||
time_mean_s = time_mean*1e-9 | ||
print(f"Bandwidth = {round(size_gb/time_mean_s, 4)} gb/s") | ||
return | ||
|
||
|
||
|
||
if __name__ == "__main__": | ||
cluster = LocalCUDACluster(protocol='ucx',rmm_pool_size='15GB', CUDA_VISIBLE_DEVICES='1,2,3') | ||
client = Client(cluster) | ||
rmm.reinitialize(pool_allocator=True) | ||
|
||
ddf = create_dataframe(client) | ||
run_bandwidth_test(ddf, 1_000_000) | ||
run_bandwidth_test(ddf, 2_000_000) | ||
run_bandwidth_test(ddf, 4_000_000) | ||
run_bandwidth_test(ddf, -1) | ||
|
||
print("--"*20+"Completed Test"+"--"*20, flush=True) | ||
client.shutdown() | ||
cluster.close() | ||
|
108 changes: 108 additions & 0 deletions
108
benchmarks/shared/build_cugraph_ucx/test_cugraph_sampling.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
# Copyright (c) 2022-2023, NVIDIA CORPORATION. | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from dask_cuda import LocalCUDACluster | ||
from dask.distributed import Client | ||
import rmm | ||
import numpy as np | ||
from time import perf_counter_ns | ||
from cugraph.dask.comms import comms as Comms | ||
from cugraph.dask import uniform_neighbor_sample as uniform_neighbor_sample_mg | ||
from cugraph import MultiGraph | ||
from cugraph.generators import rmat | ||
|
||
_seed = 42 | ||
|
||
def benchmark_func(func, n_times=10): | ||
def wrap_func(*args, **kwargs): | ||
time_ls = [] | ||
# ignore 1st run | ||
# and return other runs | ||
for _ in range(0,n_times+1): | ||
t1 = perf_counter_ns() | ||
result = func(*args, **kwargs) | ||
t2 = perf_counter_ns() | ||
time_ls.append(t2-t1) | ||
return result, time_ls[1:] | ||
return wrap_func | ||
|
||
def create_mg_graph(graph_data): | ||
""" | ||
Create a graph instance based on the data to be loaded/generated. | ||
""" | ||
G = MultiGraph(directed=True) | ||
# Assume strings are names of datasets in the datasets package | ||
scale = graph_data["scale"] | ||
num_edges = (2**scale) * graph_data["edgefactor"] | ||
seed = _seed | ||
edgelist_df = rmat( | ||
scale, | ||
num_edges, | ||
0.57, # from Graph500 | ||
0.19, # from Graph500 | ||
0.19, # from Graph500 | ||
seed, | ||
clip_and_flip=False, | ||
scramble_vertex_ids=False, # FIXME: need to understand relevance of this | ||
create_using=None, # None == return edgelist | ||
mg=True, | ||
) | ||
edgelist_df["weight"] = np.float32(1) | ||
|
||
G.from_dask_cudf_edgelist( | ||
edgelist_df, | ||
source="src", | ||
destination="dst", | ||
edge_attr="weight", | ||
legacy_renum_only=True, | ||
) | ||
return G | ||
|
||
@benchmark_func | ||
def sample_graph(G, start_list): | ||
output_ddf = uniform_neighbor_sample_mg(G,start_list=start_list, fanout_vals=[10,25]) | ||
df = output_ddf.compute() | ||
return df | ||
|
||
def run_sampling_test(ddf, start_list): | ||
df, time_ls = sample_graph(ddf, start_list) | ||
time_ar = np.asarray(time_ls) | ||
time_mean = time_ar.mean() | ||
print(f"Sampling {len(start_list):,} took = {time_mean*1e-6} ms") | ||
return | ||
|
||
|
||
|
||
if __name__ == "__main__": | ||
cluster = LocalCUDACluster(protocol='ucx',rmm_pool_size='15GB', CUDA_VISIBLE_DEVICES='1,2,3,4,5,6,7,8') | ||
client = Client(cluster) | ||
Comms.initialize(p2p=True) | ||
|
||
rmm.reinitialize(pool_allocator=True) | ||
|
||
graph_data = {"scale": 26, | ||
"edgefactor": 8 , | ||
} | ||
|
||
g = create_mg_graph(graph_data) | ||
|
||
for num_start_verts in [1_000, 10_000, 100_000]: | ||
start_list = g.input_df["src"].head(num_start_verts) | ||
assert len(start_list)==num_start_verts | ||
run_sampling_test(g, start_list) | ||
|
||
print("--"*20+"Completed Test"+"--"*20, flush=True) | ||
|
||
Comms.destroy() | ||
client.shutdown() | ||
cluster.close() |