Add cugraph+UCX build instructions (#3088)

This PR adds the instructions to build cugraph+UCX for benchmarking . The current conda build does not support IB as well has challenges using NVLINK . This PR adds the dockefile as well as build instructions and test to streamline benchmarking effort. The expected speedup on sampling is around `8x` by using the UCX souce build and a `20x` better bandwidth with compute. See [README](https://github.com/rapidsai/cugraph/blob/d513a86017fac6ba4742019eae47f7ffe98186e0/benchmarks/shared/build_cugraph_ucx/README.MD) for instructions . ## Benchmarks ### `dask.compute` bandwidth **With the standard conda build :** ```python3 python3 test_client_bandwidth.py Getting 1,000,000 rows of size 0.01 took = 168.32900047302246 ms Bandwidth = 0.0594 gb/s Getting 2,000,000 rows of size 0.02 took = 260.2977991104126 ms Bandwidth = 0.0768 gb/s Getting 4,000,000 rows of size 0.04 took = 288.5307312011719 ms Bandwidth = 0.1386 gb/s Getting 25,000,000 rows of size 0.28 took = 1051.6365766525269 ms Bandwidth = 0.2663 gb/s ----------------------------------------Completed Test---------------------------------------- ``` **With this **PRs** Build** : ```python3 python3 test_client_bandwidth.py Getting 1,000,000 rows of size 0.01 took = 14.061594009399414 ms Bandwidth = 0.7112 gb/s Getting 2,000,000 rows of size 0.02 took = 19.601154327392578 ms Bandwidth = 1.0203 gb/s Getting 4,000,000 rows of size 0.04 took = 15.92099666595459 ms Bandwidth = 2.5124 gb/s Getting 25,000,000 rows of size 0.28 took = 52.902913093566895 ms Bandwidth = 5.2927 gb/s ----------------------------------------Completed Test---------------------------------------- ``` ### `cugraph sampling` benchmark With the standard conda build : ```python3 python3 test_cugraph_sampling.py Sampling 1,000 took = 78.96044254302979 ms Sampling 10,000 took = 189.63768482208252 ms Sampling 100,000 took = 1084.9100351333618 ms ----------------------------------------Completed Test---------------------------------------- ``` **With this **PRs** Build** : ```python3 python3 test_cugraph_sampling.py Sampling 1,000 took = 73.62375259399414 ms Sampling 10,000 took = 87.1060848236084 ms Sampling 100,000 took = 134.7532033920288 ms ----------------------------------------Completed Test---------------------------------------- ``` Authors: - Vibhu Jawa (https://github.com/VibhuJawa) - Alex Barghi (https://github.com/alexbarghi-nv) Approvers: - Brad Rees (https://github.com/BradReesWork) - Alex Barghi (https://github.com/alexbarghi-nv) - Rick Ratzel (https://github.com/rlratzel) URL: #3088
rapidsai · Jan 4, 2023 · 976da08 · 976da08
1 parent bc2e130
commit 976da08
Show file tree

Hide file tree

Showing 5 changed files with 336 additions and 0 deletions.
diff --git a/benchmarks/shared/build_cugraph_ucx/README.MD b/benchmarks/shared/build_cugraph_ucx/README.MD
@@ -0,0 +1,57 @@
+# Cugraph with UCX
+## Docker Build Instructions
+docker build -f cugraph_ucx.dockerfile . -t cugraph_ucx
+
+## Docker Run Instructions
+docker run --privileged -it --gpus=all --net=host cugraph_ucx /bin/bash
+
+#### Client Bandwidth Test
+python3 test_client_bandwidth.py 
+
+```bash
+(base) root@exp02:/home# python3 test_client_bandwidth.py 
+2022-12-19 13:31:30,867 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
+2022-12-19 13:31:30,867 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
+2022-12-19 13:31:30,891 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
+2022-12-19 13:31:30,891 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
+2022-12-19 13:31:30,918 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
+2022-12-19 13:31:30,918 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
+Getting 1,000,000 rows  of size 0.01 took = 20.995140075683594 ms
+Bandwidth = 0.4763 gb/s
+Getting 2,000,000 rows  of size 0.02 took = 18.51017475128174 ms
+Bandwidth = 1.0805 gb/s
+Getting 4,000,000 rows  of size 0.04 took = 25.659966468811035 ms
+Bandwidth = 1.5588 gb/s
+Getting 25,000,000 rows  of size 0.28 took = 53.80809307098389 ms
+Bandwidth = 5.2037 gb/s
+----------------------------------------Completed Test----------------------------------------
+```
+
+#### Sampling Test
+python3 test_cugraph_sampling.py
+```bash
+test_client_bandwidth.py  test_cugraph_sampling.py  
+(base) root@exp02:/home# python3 test_cugraph_sampling.py 
+[1671456769.722931] [exp02:93   :0]          parser.c:1989 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
+[1671456769.722931] [exp02:93   :0]          parser.c:1989 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
+2022-12-19 13:32:56,228 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
+2022-12-19 13:32:56,228 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
+2022-12-19 13:32:57,485 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
+2022-12-19 13:32:57,485 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
+2022-12-19 13:32:57,655 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
+2022-12-19 13:32:57,655 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
+2022-12-19 13:32:57,878 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
+2022-12-19 13:32:57,879 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
+2022-12-19 13:32:57,957 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
+2022-12-19 13:32:57,957 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
+2022-12-19 13:32:57,963 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
+2022-12-19 13:32:57,963 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
+2022-12-19 13:32:57,973 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
+2022-12-19 13:32:57,973 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
+2022-12-19 13:32:58,011 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
+2022-12-19 13:32:58,012 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
+Sampling 1,000 took = 69.15879249572754 ms
+Sampling 10,000 took = 89.63620662689209 ms
+Sampling 100,000 took = 135.9888792037964 ms
+----------------------------------------Completed Test----------------------------------------
+```
diff --git a/benchmarks/shared/build_cugraph_ucx/build-ucx.sh b/benchmarks/shared/build_cugraph_ucx/build-ucx.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+# Copyright (c) 2023, NVIDIA CORPORATION.
+# SPDX-License-Identifier: Apache-2.0
+set -ex
+
+UCX_VERSION_TAG=${1:-"v1.14.x"}
+CUDA_HOME=${2:-"/usr/local/cuda"}
+# Send any remaining arguments to configure
+CONFIGURE_ARGS=${@:2}
+git clone https://github.com/openucx/ucx.git
+cd ucx
+git checkout ${UCX_VERSION_TAG}
+./autogen.sh
+mkdir build-linux && cd build-linux
+../contrib/configure-release --prefix=${CONDA_PREFIX} --with-sysroot --enable-cma \
+    --enable-mt --enable-numa --with-gnu-ld --with-rdmacm --with-verbs \
+    --with-cuda=${CUDA_HOME} \
+    ${CONFIGURE_ARGS}
+make -j install
diff --git a/benchmarks/shared/build_cugraph_ucx/cugraph_ucx.dockerfile b/benchmarks/shared/build_cugraph_ucx/cugraph_ucx.dockerfile
@@ -0,0 +1,72 @@
+# Copyright (c) 2022-2023, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# rdma build instructions from here:
+# https://github.com/rapidsai/dask-cuda-benchmarks/blob/main/runscripts/draco/docker/UCXPy-rdma-core.dockerfile
+ARG CUDA_VER=11.2
+ARG LINUX_VER=ubuntu20.04
+FROM gpuci/miniforge-cuda:$CUDA_VER-devel-$LINUX_VER
+
+RUN apt-get update -y \
+    && apt-get --fix-missing upgrade -y \
+    && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends tzdata \
+    && apt-get install -y \
+        automake \
+        dh-make \
+        git \
+        libcap2 \
+        libnuma-dev \
+        libtool \
+        make \
+        pkg-config \
+        udev \
+        curl \
+        librdmacm-dev \
+        rdma-core \
+    && apt-get autoremove -y \
+    && apt-get clean
+
+ARG PYTHON_VER=3.9
+ARG RAPIDS_VER=23.02
+ARG PYTORCH_VER=1.12.0
+ARG CUDATOOLKIT_VER=11.3
+
+RUN conda config --set ssl_verify false
+RUN conda install -c gpuci gpuci-tools
+RUN gpuci_conda_retry install -c conda-forge mamba
+
+
+RUN gpuci_mamba_retry install -y -c pytorch -c rapidsai-nightly -c rapidsai -c conda-forge -c nvidia \
+    cudatoolkit=$CUDATOOLKIT_VER \
+    cugraph=$RAPIDS_VER \
+    pytorch=$PYTORCH_VER \
+    python=$PYTHON_VER \
+    setuptools \
+    tqdm
+
+
+# Build ucx from source with IB support 
+# on 1.14.x
+RUN conda remove --force -y ucx ucx-proc
+
+ADD build-ucx.sh /root/build-ucx.sh
+RUN chmod 744 /root/build-ucx.sh & bash /root/build-ucx.sh
+
+
+ADD test_client_bandwidth.py  /root/test_client_bandwidth.py
+RUN chmod 777 /root/test_client_bandwidth.py
+ADD test_cugraph_sampling.py  /root/test_cugraph_sampling.py
+RUN chmod 777 /root/test_cugraph_sampling.py
+
+ENV PATH /opt/conda/env/bin:$PATH
+WORKDIR /root/
diff --git a/benchmarks/shared/build_cugraph_ucx/test_client_bandwidth.py b/benchmarks/shared/build_cugraph_ucx/test_client_bandwidth.py
@@ -0,0 +1,80 @@
+# Copyright (c) 2022-2023, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dask_cuda import LocalCUDACluster
+from dask.distributed import Client, wait
+import cupy as cp
+import numpy as np
+import cudf
+import dask_cudf
+import rmm
+from time import perf_counter_ns
+
+def benchmark_func(func, n_times=10):
+    def wrap_func(*args, **kwargs):
+        time_ls = []
+        # ignore 1st run
+        # and return other runs
+        for _ in range(0,n_times+1):
+            t1 = perf_counter_ns()
+            result = func(*args, **kwargs)
+            t2 = perf_counter_ns()
+            time_ls.append(t2-t1)
+        return result, time_ls[1:]
+    return wrap_func
+
+def create_dataframe(client):
+    n_rows = 25_000_000
+    df = cudf.DataFrame({'src':cp.arange(0,n_rows,dtype=cp.int32), 'dst':cp.arange(0,n_rows, dtype=cp.int32), 'eids':cp.ones(n_rows, cp.int32)})
+    ddf = dask_cudf.from_cudf(df,npartitions= len(client.scheduler_info()['workers'])).persist() 
+    client.rebalance(ddf)
+    del df
+    _ = wait(ddf)
+    return ddf
+
+@benchmark_func
+def get_n_rows(ddf, n):
+    if n==-1:
+        df = ddf.compute() 
+    else:
+        df = ddf.head(n)
+    return df
+
+def run_bandwidth_test(ddf, n):
+    df, time_ls = get_n_rows(ddf, n)
+    time_ar = np.asarray(time_ls)
+    time_mean = time_ar.mean()
+    size_bytes = df.memory_usage().sum()
+    size_gb = round(size_bytes/(pow(1024,3)), 2)        
+    print(f"Getting {len(df):,} rows  of size {size_gb} took = {time_mean*1e-6} ms")
+    time_mean_s = time_mean*1e-9
+    print(f"Bandwidth = {round(size_gb/time_mean_s, 4)} gb/s")
+    return
+
+
+
+if __name__ == "__main__":
+    cluster = LocalCUDACluster(protocol='ucx',rmm_pool_size='15GB', CUDA_VISIBLE_DEVICES='1,2,3')
+    client = Client(cluster)
+    rmm.reinitialize(pool_allocator=True)
+
+    ddf = create_dataframe(client)
+    run_bandwidth_test(ddf, 1_000_000)
+    run_bandwidth_test(ddf, 2_000_000)
+    run_bandwidth_test(ddf, 4_000_000)
+    run_bandwidth_test(ddf, -1)
+
+    print("--"*20+"Completed Test"+"--"*20, flush=True)
+    client.shutdown()
+    cluster.close()
+
diff --git a/benchmarks/shared/build_cugraph_ucx/test_cugraph_sampling.py b/benchmarks/shared/build_cugraph_ucx/test_cugraph_sampling.py
@@ -0,0 +1,108 @@
+# Copyright (c) 2022-2023, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dask_cuda import LocalCUDACluster
+from dask.distributed import Client
+import rmm
+import numpy as np
+from time import perf_counter_ns
+from cugraph.dask.comms import comms as Comms
+from cugraph.dask import uniform_neighbor_sample as uniform_neighbor_sample_mg
+from cugraph import MultiGraph
+from cugraph.generators import rmat
+
+_seed = 42
+
+def benchmark_func(func, n_times=10):
+    def wrap_func(*args, **kwargs):
+        time_ls = []
+        # ignore 1st run
+        # and return other runs
+        for _ in range(0,n_times+1):
+            t1 = perf_counter_ns()
+            result = func(*args, **kwargs)
+            t2 = perf_counter_ns()
+            time_ls.append(t2-t1)
+        return result, time_ls[1:]
+    return wrap_func
+
+def create_mg_graph(graph_data):
+    """
+    Create a graph instance based on the data to be loaded/generated.
+    """
+    G = MultiGraph(directed=True)
+    # Assume strings are names of datasets in the datasets package
+    scale = graph_data["scale"]
+    num_edges = (2**scale) * graph_data["edgefactor"]
+    seed = _seed
+    edgelist_df = rmat(
+        scale,
+        num_edges,
+        0.57,  # from Graph500
+        0.19,  # from Graph500
+        0.19,  # from Graph500
+        seed,
+        clip_and_flip=False,
+        scramble_vertex_ids=False,  # FIXME: need to understand relevance of this
+        create_using=None,  # None == return edgelist
+        mg=True,
+    )
+    edgelist_df["weight"] = np.float32(1)
+
+    G.from_dask_cudf_edgelist(
+        edgelist_df,
+        source="src",
+        destination="dst",
+        edge_attr="weight",
+        legacy_renum_only=True,
+    )
+    return G
+
+@benchmark_func
+def sample_graph(G, start_list):
+    output_ddf = uniform_neighbor_sample_mg(G,start_list=start_list, fanout_vals=[10,25])
+    df = output_ddf.compute()
+    return df
+
+def run_sampling_test(ddf, start_list):
+    df, time_ls = sample_graph(ddf, start_list)
+    time_ar = np.asarray(time_ls)
+    time_mean = time_ar.mean()
+    print(f"Sampling {len(start_list):,} took = {time_mean*1e-6} ms")
+    return
+
+
+
+if __name__ == "__main__":
+    cluster = LocalCUDACluster(protocol='ucx',rmm_pool_size='15GB', CUDA_VISIBLE_DEVICES='1,2,3,4,5,6,7,8')
+    client = Client(cluster)
+    Comms.initialize(p2p=True)
+
+    rmm.reinitialize(pool_allocator=True)
+
+    graph_data = {"scale": 26,
+              "edgefactor": 8 ,
+              }
+
+    g = create_mg_graph(graph_data)
+
+    for num_start_verts in [1_000, 10_000, 100_000]:
+        start_list = g.input_df["src"].head(num_start_verts)
+        assert len(start_list)==num_start_verts
+        run_sampling_test(g, start_list)
+
+    print("--"*20+"Completed Test"+"--"*20, flush=True)
+
+    Comms.destroy()
+    client.shutdown()
+    cluster.close()