Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RELEASE] dask-cuda v0.14 #298

Merged
merged 91 commits into from
Jun 3, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
c7abf21
DOC v0.14 Updates
raydouglass Mar 9, 2020
862f633
Merge pull request #254 from rapidsai/branch-0.13
GPUtester Mar 12, 2020
940d56c
Merge pull request #258 from rapidsai/branch-0.13
GPUtester Mar 19, 2020
580883c
Merge pull request #261 from rapidsai/branch-0.13
GPUtester Mar 24, 2020
a8c5ae4
changelog
trxcllnt Mar 25, 2020
e066972
Merge pull request #262 from trxcllnt/publish/branch-0.14
kkraus14 Mar 25, 2020
bc991b2
Merge pull request #264 from rapidsai/branch-0.13
GPUtester Mar 25, 2020
afaec84
Merge pull request #266 from rapidsai/branch-0.13
GPUtester Mar 26, 2020
75d0da0
Parse memory_limit in LocalCUDACluster and default to "auto"
pentschev Mar 29, 2020
23b26dc
Disable spilling to disk when memory_limit==0
pentschev Mar 29, 2020
7c843cd
Fix usage of `thread_per_worker` argument
pentschev Mar 29, 2020
ceecbe6
Fix wrong references to memory_limit in LocalCUDACluster
pentschev Mar 30, 2020
5c49692
Fix usage of wrong variable names in device_host_file
pentschev Mar 30, 2020
1ba934e
Add spilling tests for `memory_limit=0`
pentschev Mar 30, 2020
a8c9c3e
Merge pull request #269 from pentschev/fix-zero-memory-limit
quasiben Mar 30, 2020
1c3da91
Raise serialization errors when spilling
jakirkham Mar 30, 2020
00ca635
Merge pull request #272 from jakirkham/raise_spilling_serialization_errs
pentschev Mar 31, 2020
a088598
Fix dask-cuda-worker memory_limit
pentschev Apr 3, 2020
69a3288
Add test for dask-cuda-worker memory_limit
pentschev Apr 3, 2020
745c110
Merge pull request #279 from pentschev/fix-dask-cuda-worker-memory-limit
quasiben Apr 6, 2020
f385ef9
Add NVTX annotations for spilling
pentschev Apr 14, 2020
6b6c2de
Rename and move nvtx_annotate to utils
pentschev Apr 14, 2020
b5ae4fc
Merge pull request #282 from pentschev/add-spilling-nvtx-annotations
pentschev Apr 14, 2020
ba1565b
Skip existing on conda uploads
raydouglass Apr 27, 2020
4ed7ae6
Merge pull request #284 from raydouglass/conda-upload
raydouglass Apr 27, 2020
8cb1cd9
Add rdmacm support
pentschev Apr 27, 2020
0ed93a4
local gpuci build script
efajardo-nv Apr 29, 2020
a50b9df
update changelog
efajardo-nv Apr 29, 2020
57760a4
Add new get_host_from_cuda_device utility function
pentschev May 1, 2020
c198763
Add automatic host identification support to dask_cuda_worker
pentschev May 1, 2020
72c90b8
Remove deprecated DGX class
pentschev May 1, 2020
cc26d60
Update DGX tests to use LocalCUDACluster
pentschev May 1, 2020
44ebbf6
Remove special dask/distributed condition from test_device_host_file
pentschev May 1, 2020
55e06ab
Remove DGX from __init__.py
pentschev May 1, 2020
8bc75e9
Add --ucx-net-devices argument to CuPy benchmark
pentschev May 1, 2020
21ed223
Merge branch 'remove-dgx-class' into add-rdmacm-support
pentschev May 1, 2020
890f3d1
Merge pull request #286 from pentschev/remove-dgx-class
kkraus14 May 1, 2020
7bb0df4
Merge branch 'branch-0.14' of https://github.com/rapidsai/dask-cuda i…
efajardo-nv May 1, 2020
8edfb66
Add dask-cuda-worker RDMACM test
pentschev May 1, 2020
0e4b3ea
Add get_host_from_cuda_device test
pentschev May 1, 2020
f59f641
Merge pull request #285 from efajardo-nv/gpuci-local-build
pentschev May 1, 2020
2a3489b
Use get_host_from_cuda_device in LocalCUDACluster
pentschev May 2, 2020
da84c0f
Add RDMACM test for LocalCUDACluster
pentschev May 2, 2020
f16ecfd
Default "network" type to disabled in get_ucx_net_devices
pentschev May 2, 2020
ea893fc
Update get_host_from_cuda_device
pentschev May 2, 2020
74018bc
Fix RDMACM tests
pentschev May 2, 2020
d3f711f
Replace use of host by interface to select listener IB device
pentschev May 3, 2020
149db31
Remove get_host_from_cuda_device
pentschev May 3, 2020
83f39e1
Apply suggestions from code review
pentschev May 4, 2020
95c6be5
Apply more suggestions from code review
pentschev May 4, 2020
658a77d
Fix test_dgx imports
pentschev May 4, 2020
2b0ceff
Merge pull request #287 from pentschev/add-rdmacm-support
kkraus14 May 4, 2020
c787ba6
initial docs setup
quasiben May 5, 2020
be427c3
remove install details
quasiben May 5, 2020
f968e8d
RTD setup
quasiben May 5, 2020
312ddd7
updates
quasiben May 5, 2020
a2de3e8
Raise ValueError when ucx_net_devices="auto" IB is disabled
pentschev May 5, 2020
84e250a
Tests get_ucx_config raising exception
pentschev May 5, 2020
9e9cacc
Update documentation for ucx_net_devices
pentschev May 5, 2020
f868e55
Update LocalCUDACluster documentation for ucx_net_devices
pentschev May 5, 2020
b18177b
Merge pull request #291 from pentschev/enforce-ucx-net-devices-auto-w…
quasiben May 6, 2020
7d5bf84
use rapids-dev-doc
quasiben May 6, 2020
8775b1f
Merge pull request #290 from quasiben/fea-docs
quasiben May 6, 2020
8ffc9ab
Add multi-node support to CuPy benchmark
pentschev May 7, 2020
8c33658
Add benchmarks utils functions
pentschev May 7, 2020
d9b1812
Use benchmarks.utils to simplify CuPy benchmark
pentschev May 7, 2020
af1e45f
Remove asynchronous from benchmarks.utils.get_cluster_options
pentschev May 7, 2020
12f39a5
Use benchmarks.utils to simplify cuDF benchmark
pentschev May 7, 2020
6d83523
Fix black/isort formatting
pentschev May 7, 2020
c959ab0
Sleep to give time for benchmark SSHCluster to spin
pentschev May 7, 2020
189d93f
Add example to benchmark --hosts option.
pentschev May 7, 2020
503e573
Add docs for UCX
pentschev May 7, 2020
9d4093b
Merge pull request #293 from pentschev/multi-node-benchmarks
quasiben May 8, 2020
a1899d7
Add Specializations for GPU Usage docs page
pentschev May 8, 2020
0e8732e
Improve UCX docs text and formatting
pentschev May 8, 2020
e4db5ae
Add cross-reference links to docs pages
pentschev May 8, 2020
8f2e71f
Improve specialization docs page text
pentschev May 8, 2020
187661e
Merge pull request #294 from pentschev/ucx-docs
quasiben May 8, 2020
cbd4cda
Add `--runs` argument to CuPy benchmark
pentschev May 8, 2020
06c36f9
Merge pull request #295 from pentschev/cupy-benchmark-runs-arg
pentschev May 8, 2020
81faddb
Fixing LocalCUDACluster example. Adding README link to docs
randerzander May 13, 2020
0aecc82
Merge pull request #297 from randerzander/branch-0.14
quasiben May 13, 2020
2e09313
Add `nfinal` argument to shuffle_group, required in Dask >= 2.17
pentschev May 28, 2020
2f0dda3
Fix isort formatting
pentschev May 28, 2020
ede89f4
Fix black formatting
pentschev May 28, 2020
743a05b
Fix flake8 errors
pentschev May 28, 2020
05ed330
Fix path to run flake8 on
pentschev May 28, 2020
8a0eb98
Fix typo in flake8 directory
pentschev May 28, 2020
1ea901c
Merge pull request #299 from pentschev/fix-shuffle-group
raydouglass May 29, 2020
2f9d5dd
Update changelog for 0.14
pentschev Jun 2, 2020
f6fe329
Merge pull request #303 from pentschev/update-changelog-0.14
mike-wendt Jun 2, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,24 @@
0.14
----
- Publish branch-0.14 to conda (#262) `Paul Taylor`_
- Fix behavior for `memory_limit=0` (#269) `Peter Andreas Entschev`_
- Raise serialization errors when spilling (#272) `John Kirkham`_
- Fix dask-cuda-worker memory_limit (#279) `Peter Andreas Entschev`_
- Add NVTX annotations for spilling (#282) `Peter Andreas Entschev`_
- Skip existing on conda uploads (#284) `Ray Douglass`_
- Local gpuCI build script (#285) `Eli Fajardo`_
- Remove deprecated DGX class (#286) `Peter Andreas Entschev`_
- Add RDMACM support (#287) `Peter Andreas Entschev`_
- Read the Docs Setup (#290) `Benjamin Zaitlen`_
- Raise ValueError when ucx_net_devices="auto" and IB is disabled (#291) `Peter Andreas Entschev`_
- Multi-node benchmarks (#293) `Peter Andreas Entschev`_
- Add docs for UCX (#294) `Peter Andreas Entschev`_
- Add `--runs` argument to CuPy benchmark (#295) `Peter Andreas Entschev`_
- Fixing LocalCUDACluster example. Adding README links to docs (#297) `Randy Gelhausen`_
- Add `nfinal` argument to shuffle_group, required in Dask >= 2.17 (#299) `Peter Andreas Entschev`_
- Initialize parent process' UCX configuration (#301) `Peter Andreas Entschev`_
- Add Read the Docs link (#302) `John Kirkham`_

0.13
----
- Use RMM's `DeviceBuffer` directly (#235) `John Kirkham`_
Expand Down Expand Up @@ -119,3 +140,6 @@
.. _`Richard (Rick) Zamora`: https://github.com/rjzamora
.. _`Benjamin Zaitlen`: https://github.com/quasiben
.. _`Ray Douglass`: https://github.com/raydouglass
.. _`Paul Taylor`: https://github.com/trxcllnt
.. _`Eli Fajardo`: https://github.com/efajardo-nv
.. _`Randy Gelhausen`: https://github.com/randerzander
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ cluster = LocalCUDACluster()
client = Client(cluster)
```

Documentation is available [here](https://dask-cuda.readthedocs.io/).

What this is not
----------------

Expand Down
2 changes: 1 addition & 1 deletion ci/checks/style.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ BLACK=`black --check .`
BLACK_RETVAL=$?

# Run flake8 and get results/return code
FLAKE=`flake8 python`
FLAKE=`flake8 dask_cuda`
RETVAL=$?

# Output results if failure otherwise show pass
Expand Down
2 changes: 1 addition & 1 deletion ci/cpu/upload-anaconda.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,4 @@ fi

echo "Upload"
echo ${UPLOADFILE}
anaconda -t ${MY_UPLOAD_KEY} upload -u ${CONDA_USERNAME:-rapidsai} ${LABEL_OPTION} --force ${UPLOADFILE}
anaconda -t ${MY_UPLOAD_KEY} upload -u ${CONDA_USERNAME:-rapidsai} ${LABEL_OPTION} --skip-existing ${UPLOADFILE}
57 changes: 57 additions & 0 deletions ci/local/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
## Purpose

This script is designed for developer and contributor use. This tool mimics the actions of gpuCI on your local machine. This allows you to test and even debug your code inside a gpuCI base container before pushing your code as a GitHub commit.
The script can be helpful in locally triaging and debugging RAPIDS continuous integration failures.

## Requirements

```
nvidia-docker
```

## Usage

```
bash build.sh [-h] [-H] [-s] [-r <repo_dir>] [-i <image_name>]
Build and test your local repository using a base gpuCI Docker image
where:
-H Show this help text
-r Path to repository (defaults to working directory)
-i Use Docker image (default is gpuci/rapidsai-base:cuda10.0-ubuntu16.04-gcc5-py3.6)
-s Skip building and testing and start an interactive shell in a container of the Docker image
```

Example Usage:
`bash build.sh -r ~/rapids/dask-cuda -i gpuci/rapidsai-base:cuda10.1-ubuntu16.04-gcc5-py3.6`

For a full list of available gpuCI docker images, visit our [DockerHub](https://hub.docker.com/r/gpuci/rapidsai-base/tags) page.

Style Check:
```bash
$ bash ci/local/build.sh -r ~/rapids/dask-cuda -s
$ source activate gdf #Activate gpuCI conda environment
$ cd rapids
$ flake8 python
```

## Information

There are some caveats to be aware of when using this script, especially if you plan on developing from within the container itself.


### Docker Image Build Repository

The docker image will generate build artifacts in a folder on your machine located in the `root` directory of the repository you passed to the script. For the above example, the directory is named `~/rapids/dask-cuda/build_rapidsai-base_cuda10.1-ubuntu16.04-gcc5-py3.6/`. Feel free to remove this directory after the script is finished.

*Note*: The script *will not* override your local build repository. Your local environment stays in tact.


### Where The User is Dumped

The script will build your repository and run all tests. If any tests fail, it dumps the user into the docker container itself to allow you to debug from within the container. If all the tests pass as expected the container exits and is automatically removed. Remember to exit the container if tests fail and you do not wish to debug within the container itself.


### Container File Structure

Your repository will be located in the `/rapids/` folder of the container. This folder is volume mounted from the local machine. Any changes to the code in this repository are replicated onto the local machine. The `cpp/build` and `python/build` directories within your repository is on a separate mount to avoid conflicting with your local build artifacts.
142 changes: 142 additions & 0 deletions ci/local/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
#!/bin/bash

DOCKER_IMAGE="gpuci/rapidsai-base:cuda10.0-ubuntu16.04-gcc5-py3.6"
REPO_PATH=${PWD}
RAPIDS_DIR_IN_CONTAINER="/rapids"
CPP_BUILD_DIR="cpp/build"
PYTHON_BUILD_DIR="python/build"
CONTAINER_SHELL_ONLY=0

SHORTHELP="$(basename "$0") [-h] [-H] [-s] [-r <repo_dir>] [-i <image_name>]"
LONGHELP="${SHORTHELP}
Build and test your local repository using a base gpuCI Docker image
where:
-H Show this help text
-r Path to repository (defaults to working directory)
-i Use Docker image (default is ${DOCKER_IMAGE})
-s Skip building and testing and start an interactive shell in a container of the Docker image
"

# Limit GPUs available to container based on CUDA_VISIBLE_DEVICES
if [[ -z "${CUDA_VISIBLE_DEVICES}" ]]; then
NVIDIA_VISIBLE_DEVICES="all"
else
NVIDIA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}
fi

while getopts ":hHr:i:s" option; do
case ${option} in
r)
REPO_PATH=${OPTARG}
;;
i)
DOCKER_IMAGE=${OPTARG}
;;
s)
CONTAINER_SHELL_ONLY=1
;;
h)
echo "${SHORTHELP}"
exit 0
;;
H)
echo "${LONGHELP}"
exit 0
;;
*)
echo "ERROR: Invalid flag"
echo "${SHORTHELP}"
exit 1
;;
esac
done

REPO_PATH_IN_CONTAINER="${RAPIDS_DIR_IN_CONTAINER}/$(basename "${REPO_PATH}")"
CPP_BUILD_DIR_IN_CONTAINER="${RAPIDS_DIR_IN_CONTAINER}/$(basename "${REPO_PATH}")/${CPP_BUILD_DIR}"
PYTHON_BUILD_DIR_IN_CONTAINER="${RAPIDS_DIR_IN_CONTAINER}/$(basename "${REPO_PATH}")/${PYTHON_BUILD_DIR}"


# BASE_CONTAINER_BUILD_DIR is named after the image name, allowing for
# multiple image builds to coexist on the local filesystem. This will
# be mapped to the typical BUILD_DIR inside of the container. Builds
# running in the container generate build artifacts just as they would
# in a bare-metal environment, and the host filesystem is able to
# maintain the host build in BUILD_DIR as well.
# FIXME: Fix the shellcheck complaints
# shellcheck disable=SC2001,SC2005,SC2046
BASE_CONTAINER_BUILD_DIR=${REPO_PATH}/build_$(echo $(basename "${DOCKER_IMAGE}")|sed -e 's/:/_/g')
CPP_CONTAINER_BUILD_DIR=${BASE_CONTAINER_BUILD_DIR}/cpp
PYTHON_CONTAINER_BUILD_DIR=${BASE_CONTAINER_BUILD_DIR}/python
# Create build directories. This is to ensure correct owner for directories. If
# directories don't exist there is side effect from docker volume mounting creating build
# directories owned by root(volume mount point(s))
mkdir -p "${REPO_PATH}/${CPP_BUILD_DIR}"
mkdir -p "${REPO_PATH}/${PYTHON_BUILD_DIR}"

BUILD_SCRIPT="#!/bin/bash
set -e
WORKSPACE=${REPO_PATH_IN_CONTAINER}
PREBUILD_SCRIPT=${REPO_PATH_IN_CONTAINER}/ci/gpu/prebuild.sh
BUILD_SCRIPT=${REPO_PATH_IN_CONTAINER}/ci/gpu/build.sh
cd \${WORKSPACE}
if [ -f \${PREBUILD_SCRIPT} ]; then
source \${PREBUILD_SCRIPT}
fi
yes | source \${BUILD_SCRIPT}
"

if (( CONTAINER_SHELL_ONLY == 0 )); then
COMMAND="${CPP_BUILD_DIR_IN_CONTAINER}/build.sh || bash"
else
COMMAND="bash"
fi

# Create the build dir for the container to mount, generate the build script inside of it
mkdir -p "${BASE_CONTAINER_BUILD_DIR}"
mkdir -p "${CPP_CONTAINER_BUILD_DIR}"
mkdir -p "${PYTHON_CONTAINER_BUILD_DIR}"
echo "${BUILD_SCRIPT}" > "${CPP_CONTAINER_BUILD_DIR}/build.sh"
chmod ugo+x "${CPP_CONTAINER_BUILD_DIR}/build.sh"

# Mount passwd and group files to docker. This allows docker to resolve username and group
# avoiding these nags:
# * groups: cannot find name for group ID ID
# * I have no name!@id:/$
# For ldap user user information is not present in system /etc/passwd and /etc/group files.
# Hence we generate dummy files for ldap users which docker uses to resolve username and group

PASSWD_FILE="/etc/passwd"
GROUP_FILE="/etc/group"

USER_FOUND=$(grep -wc "$(whoami)" < "$PASSWD_FILE")
if [ "$USER_FOUND" == 0 ]; then
echo "Local User not found, LDAP WAR for docker mounts activated. Creating dummy passwd and group"
echo "files to allow docker resolve username and group"
cp "$PASSWD_FILE" /tmp/passwd
PASSWD_FILE="/tmp/passwd"
cp "$GROUP_FILE" /tmp/group
GROUP_FILE="/tmp/group"
echo "$(whoami):x:$(id -u):$(id -g):$(whoami),,,:$HOME:$SHELL" >> "$PASSWD_FILE"
echo "$(whoami):x:$(id -g):" >> "$GROUP_FILE"
fi

# Run the generated build script in a container
docker pull "${DOCKER_IMAGE}"

DOCKER_MAJOR=$(docker -v|sed 's/[^[0-9]*\([0-9]*\).*/\1/')
GPU_OPTS="--gpus device=${NVIDIA_VISIBLE_DEVICES}"
if [ "$DOCKER_MAJOR" -lt 19 ]
then
GPU_OPTS="--runtime=nvidia -e NVIDIA_VISIBLE_DEVICES='${NVIDIA_VISIBLE_DEVICES}'"
fi

docker run --rm -it ${GPU_OPTS} \
-u "$(id -u)":"$(id -g)" \
-v "${REPO_PATH}":"${REPO_PATH_IN_CONTAINER}" \
-v "${CPP_CONTAINER_BUILD_DIR}":"${CPP_BUILD_DIR_IN_CONTAINER}" \
-v "${PYTHON_CONTAINER_BUILD_DIR}":"${PYTHON_BUILD_DIR_IN_CONTAINER}" \
-v "$PASSWD_FILE":/etc/passwd:ro \
-v "$GROUP_FILE":/etc/group:ro \
--cap-add=SYS_PTRACE \
"${DOCKER_IMAGE}" bash -c "${COMMAND}"
6 changes: 6 additions & 0 deletions conda/environments/builddocs_py37.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
name: dask_cuda_docs
channels:
- rapidsai-nightly
- conda-forge
dependencies:
- rapids-doc-env
1 change: 0 additions & 1 deletion dask_cuda/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
from ._version import get_versions
from .dgx import DGX
from .local_cuda_cluster import LocalCUDACluster

__version__ = get_versions()["version"]
Expand Down
Empty file.
Loading