Skip to content

Commit

Permalink
Add cugraph+UCX build instructions (#3088)
Browse files Browse the repository at this point in the history
This PR adds the instructions to build cugraph+UCX for benchmarking . 

The current conda build does not support IB as well has challenges using NVLINK .  This PR adds the dockefile as well as build instructions and test to streamline benchmarking effort. 

The expected speedup on sampling is around `8x` by using the UCX  souce build  and a `20x` better bandwidth with compute.   

See [README](https://github.com/rapidsai/cugraph/blob/d513a86017fac6ba4742019eae47f7ffe98186e0/benchmarks/shared/build_cugraph_ucx/README.MD) for instructions . 

## Benchmarks

### `dask.compute` bandwidth
**With the standard conda build :** 
```python3
python3 test_client_bandwidth.py 

Getting 1,000,000 rows  of size 0.01 took = 168.32900047302246 ms
Bandwidth = 0.0594 gb/s
Getting 2,000,000 rows  of size 0.02 took = 260.2977991104126 ms
Bandwidth = 0.0768 gb/s
Getting 4,000,000 rows  of size 0.04 took = 288.5307312011719 ms
Bandwidth = 0.1386 gb/s
Getting 25,000,000 rows  of size 0.28 took = 1051.6365766525269 ms
Bandwidth = 0.2663 gb/s
----------------------------------------Completed Test----------------------------------------
```
**With this **PRs** Build** :
```python3
python3 test_client_bandwidth.py

Getting 1,000,000 rows  of size 0.01 took = 14.061594009399414 ms
Bandwidth = 0.7112 gb/s
Getting 2,000,000 rows  of size 0.02 took = 19.601154327392578 ms
Bandwidth = 1.0203 gb/s
Getting 4,000,000 rows  of size 0.04 took = 15.92099666595459 ms
Bandwidth = 2.5124 gb/s
Getting 25,000,000 rows  of size 0.28 took = 52.902913093566895 ms
Bandwidth = 5.2927 gb/s
----------------------------------------Completed Test----------------------------------------
```

### `cugraph sampling` benchmark

With the standard conda build : 
```python3
python3 test_cugraph_sampling.py

Sampling 1,000 took = 78.96044254302979 ms
Sampling 10,000 took = 189.63768482208252 ms
Sampling 100,000 took = 1084.9100351333618 ms
----------------------------------------Completed Test----------------------------------------
```

**With this **PRs** Build** :
```python3
python3 test_cugraph_sampling.py


Sampling 1,000 took = 73.62375259399414 ms
Sampling 10,000 took = 87.1060848236084 ms
Sampling 100,000 took = 134.7532033920288 ms
----------------------------------------Completed Test----------------------------------------
```

Authors:
  - Vibhu Jawa (https://github.com/VibhuJawa)
  - Alex Barghi (https://github.com/alexbarghi-nv)

Approvers:
  - Brad Rees (https://github.com/BradReesWork)
  - Alex Barghi (https://github.com/alexbarghi-nv)
  - Rick Ratzel (https://github.com/rlratzel)

URL: #3088
  • Loading branch information
VibhuJawa authored Jan 4, 2023
1 parent bc2e130 commit 976da08
Show file tree
Hide file tree
Showing 5 changed files with 336 additions and 0 deletions.
57 changes: 57 additions & 0 deletions benchmarks/shared/build_cugraph_ucx/README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Cugraph with UCX
## Docker Build Instructions
docker build -f cugraph_ucx.dockerfile . -t cugraph_ucx

## Docker Run Instructions
docker run --privileged -it --gpus=all --net=host cugraph_ucx /bin/bash

#### Client Bandwidth Test
python3 test_client_bandwidth.py

```bash
(base) root@exp02:/home# python3 test_client_bandwidth.py
2022-12-19 13:31:30,867 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-12-19 13:31:30,867 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-12-19 13:31:30,891 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-12-19 13:31:30,891 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-12-19 13:31:30,918 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-12-19 13:31:30,918 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Getting 1,000,000 rows of size 0.01 took = 20.995140075683594 ms
Bandwidth = 0.4763 gb/s
Getting 2,000,000 rows of size 0.02 took = 18.51017475128174 ms
Bandwidth = 1.0805 gb/s
Getting 4,000,000 rows of size 0.04 took = 25.659966468811035 ms
Bandwidth = 1.5588 gb/s
Getting 25,000,000 rows of size 0.28 took = 53.80809307098389 ms
Bandwidth = 5.2037 gb/s
----------------------------------------Completed Test----------------------------------------
```

#### Sampling Test
python3 test_cugraph_sampling.py
```bash
test_client_bandwidth.py test_cugraph_sampling.py
(base) root@exp02:/home# python3 test_cugraph_sampling.py
[1671456769.722931] [exp02:93 :0] parser.c:1989 UCX WARN unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1671456769.722931] [exp02:93 :0] parser.c:1989 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2022-12-19 13:32:56,228 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-12-19 13:32:56,228 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-12-19 13:32:57,485 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-12-19 13:32:57,485 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-12-19 13:32:57,655 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-12-19 13:32:57,655 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-12-19 13:32:57,878 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-12-19 13:32:57,879 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-12-19 13:32:57,957 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-12-19 13:32:57,957 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-12-19 13:32:57,963 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-12-19 13:32:57,963 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-12-19 13:32:57,973 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-12-19 13:32:57,973 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-12-19 13:32:58,011 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-12-19 13:32:58,012 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Sampling 1,000 took = 69.15879249572754 ms
Sampling 10,000 took = 89.63620662689209 ms
Sampling 100,000 took = 135.9888792037964 ms
----------------------------------------Completed Test----------------------------------------
```
19 changes: 19 additions & 0 deletions benchmarks/shared/build_cugraph_ucx/build-ucx.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/bash
# Copyright (c) 2023, NVIDIA CORPORATION.
# SPDX-License-Identifier: Apache-2.0
set -ex

UCX_VERSION_TAG=${1:-"v1.14.x"}
CUDA_HOME=${2:-"/usr/local/cuda"}
# Send any remaining arguments to configure
CONFIGURE_ARGS=${@:2}
git clone https://github.com/openucx/ucx.git
cd ucx
git checkout ${UCX_VERSION_TAG}
./autogen.sh
mkdir build-linux && cd build-linux
../contrib/configure-release --prefix=${CONDA_PREFIX} --with-sysroot --enable-cma \
--enable-mt --enable-numa --with-gnu-ld --with-rdmacm --with-verbs \
--with-cuda=${CUDA_HOME} \
${CONFIGURE_ARGS}
make -j install
72 changes: 72 additions & 0 deletions benchmarks/shared/build_cugraph_ucx/cugraph_ucx.dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Copyright (c) 2022-2023, NVIDIA CORPORATION.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# rdma build instructions from here:
# https://github.com/rapidsai/dask-cuda-benchmarks/blob/main/runscripts/draco/docker/UCXPy-rdma-core.dockerfile
ARG CUDA_VER=11.2
ARG LINUX_VER=ubuntu20.04
FROM gpuci/miniforge-cuda:$CUDA_VER-devel-$LINUX_VER

RUN apt-get update -y \
&& apt-get --fix-missing upgrade -y \
&& DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends tzdata \
&& apt-get install -y \
automake \
dh-make \
git \
libcap2 \
libnuma-dev \
libtool \
make \
pkg-config \
udev \
curl \
librdmacm-dev \
rdma-core \
&& apt-get autoremove -y \
&& apt-get clean

ARG PYTHON_VER=3.9
ARG RAPIDS_VER=23.02
ARG PYTORCH_VER=1.12.0
ARG CUDATOOLKIT_VER=11.3

RUN conda config --set ssl_verify false
RUN conda install -c gpuci gpuci-tools
RUN gpuci_conda_retry install -c conda-forge mamba


RUN gpuci_mamba_retry install -y -c pytorch -c rapidsai-nightly -c rapidsai -c conda-forge -c nvidia \
cudatoolkit=$CUDATOOLKIT_VER \
cugraph=$RAPIDS_VER \
pytorch=$PYTORCH_VER \
python=$PYTHON_VER \
setuptools \
tqdm


# Build ucx from source with IB support
# on 1.14.x
RUN conda remove --force -y ucx ucx-proc

ADD build-ucx.sh /root/build-ucx.sh
RUN chmod 744 /root/build-ucx.sh & bash /root/build-ucx.sh


ADD test_client_bandwidth.py /root/test_client_bandwidth.py
RUN chmod 777 /root/test_client_bandwidth.py
ADD test_cugraph_sampling.py /root/test_cugraph_sampling.py
RUN chmod 777 /root/test_cugraph_sampling.py

ENV PATH /opt/conda/env/bin:$PATH
WORKDIR /root/
80 changes: 80 additions & 0 deletions benchmarks/shared/build_cugraph_ucx/test_client_bandwidth.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Copyright (c) 2022-2023, NVIDIA CORPORATION.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
import cupy as cp
import numpy as np
import cudf
import dask_cudf
import rmm
from time import perf_counter_ns

def benchmark_func(func, n_times=10):
def wrap_func(*args, **kwargs):
time_ls = []
# ignore 1st run
# and return other runs
for _ in range(0,n_times+1):
t1 = perf_counter_ns()
result = func(*args, **kwargs)
t2 = perf_counter_ns()
time_ls.append(t2-t1)
return result, time_ls[1:]
return wrap_func

def create_dataframe(client):
n_rows = 25_000_000
df = cudf.DataFrame({'src':cp.arange(0,n_rows,dtype=cp.int32), 'dst':cp.arange(0,n_rows, dtype=cp.int32), 'eids':cp.ones(n_rows, cp.int32)})
ddf = dask_cudf.from_cudf(df,npartitions= len(client.scheduler_info()['workers'])).persist()
client.rebalance(ddf)
del df
_ = wait(ddf)
return ddf

@benchmark_func
def get_n_rows(ddf, n):
if n==-1:
df = ddf.compute()
else:
df = ddf.head(n)
return df

def run_bandwidth_test(ddf, n):
df, time_ls = get_n_rows(ddf, n)
time_ar = np.asarray(time_ls)
time_mean = time_ar.mean()
size_bytes = df.memory_usage().sum()
size_gb = round(size_bytes/(pow(1024,3)), 2)
print(f"Getting {len(df):,} rows of size {size_gb} took = {time_mean*1e-6} ms")
time_mean_s = time_mean*1e-9
print(f"Bandwidth = {round(size_gb/time_mean_s, 4)} gb/s")
return



if __name__ == "__main__":
cluster = LocalCUDACluster(protocol='ucx',rmm_pool_size='15GB', CUDA_VISIBLE_DEVICES='1,2,3')
client = Client(cluster)
rmm.reinitialize(pool_allocator=True)

ddf = create_dataframe(client)
run_bandwidth_test(ddf, 1_000_000)
run_bandwidth_test(ddf, 2_000_000)
run_bandwidth_test(ddf, 4_000_000)
run_bandwidth_test(ddf, -1)

print("--"*20+"Completed Test"+"--"*20, flush=True)
client.shutdown()
cluster.close()

108 changes: 108 additions & 0 deletions benchmarks/shared/build_cugraph_ucx/test_cugraph_sampling.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Copyright (c) 2022-2023, NVIDIA CORPORATION.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import rmm
import numpy as np
from time import perf_counter_ns
from cugraph.dask.comms import comms as Comms
from cugraph.dask import uniform_neighbor_sample as uniform_neighbor_sample_mg
from cugraph import MultiGraph
from cugraph.generators import rmat

_seed = 42

def benchmark_func(func, n_times=10):
def wrap_func(*args, **kwargs):
time_ls = []
# ignore 1st run
# and return other runs
for _ in range(0,n_times+1):
t1 = perf_counter_ns()
result = func(*args, **kwargs)
t2 = perf_counter_ns()
time_ls.append(t2-t1)
return result, time_ls[1:]
return wrap_func

def create_mg_graph(graph_data):
"""
Create a graph instance based on the data to be loaded/generated.
"""
G = MultiGraph(directed=True)
# Assume strings are names of datasets in the datasets package
scale = graph_data["scale"]
num_edges = (2**scale) * graph_data["edgefactor"]
seed = _seed
edgelist_df = rmat(
scale,
num_edges,
0.57, # from Graph500
0.19, # from Graph500
0.19, # from Graph500
seed,
clip_and_flip=False,
scramble_vertex_ids=False, # FIXME: need to understand relevance of this
create_using=None, # None == return edgelist
mg=True,
)
edgelist_df["weight"] = np.float32(1)

G.from_dask_cudf_edgelist(
edgelist_df,
source="src",
destination="dst",
edge_attr="weight",
legacy_renum_only=True,
)
return G

@benchmark_func
def sample_graph(G, start_list):
output_ddf = uniform_neighbor_sample_mg(G,start_list=start_list, fanout_vals=[10,25])
df = output_ddf.compute()
return df

def run_sampling_test(ddf, start_list):
df, time_ls = sample_graph(ddf, start_list)
time_ar = np.asarray(time_ls)
time_mean = time_ar.mean()
print(f"Sampling {len(start_list):,} took = {time_mean*1e-6} ms")
return



if __name__ == "__main__":
cluster = LocalCUDACluster(protocol='ucx',rmm_pool_size='15GB', CUDA_VISIBLE_DEVICES='1,2,3,4,5,6,7,8')
client = Client(cluster)
Comms.initialize(p2p=True)

rmm.reinitialize(pool_allocator=True)

graph_data = {"scale": 26,
"edgefactor": 8 ,
}

g = create_mg_graph(graph_data)

for num_start_verts in [1_000, 10_000, 100_000]:
start_list = g.input_df["src"].head(num_start_verts)
assert len(start_list)==num_start_verts
run_sampling_test(g, start_list)

print("--"*20+"Completed Test"+"--"*20, flush=True)

Comms.destroy()
client.shutdown()
cluster.close()

0 comments on commit 976da08

Please sign in to comment.