[RELEASE] dask-cuda v23.02 #1109

GPUtester · 2023-02-06T17:57:29Z

❄️ Code freeze for `branch-23.02` and v23.02 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-23.02 until release (merging of this PR).

What is the purpose of this PR?

Update documentation
Allow testing for the new release
Enable a means to merge branch-23.02 into main for the release

[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]

Enables copying PRs so that GitHub Actions CI can run.

This PR unpins `dask` and `distributed` to `2022.12.0+` for `23.02` development. xref: rapidsai/cudf#12302 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Ray Douglass (https://github.com/raydouglass) URL: #1060

Aligns conda channel priority in the installation guide with changes made for the 22.10.01 hotfix. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - https://github.com/jakirkham URL: #1067

Probe `__cuda_array_interface__` in `test_device_host_file_step_by_step`, to get consistent results from `safe_sizeof()`. Fixes #1070 Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1071

Authors: - Matthew Farrellee (https://github.com/mattf) - AJ Schmidt (https://github.com/ajschmidt8) - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1069

This PR adds GitHub Actions workflows to `dask-cuda`. ### Task list Coverage required for this PR: - [x] Python tests - [x] Codecov - [x] Style checks Future work required: - [Deploy sdist/wheels to PyPI](https://github.com/rapidsai/dask-cuda/blob/d6ff68daae638c30e1e2e25f2fb91ecc1ee8f6ea/ci/cpu/build.sh#L98) Authors: - Bradley Dice (https://github.com/bdice) - AJ Schmidt (https://github.com/ajschmidt8) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) URL: #1062

This PR updates the `dask-cuda` CI workflows to build against the CUDA `11.8` / Python `3.10` [branch](https://github.com/rapidsai/shared-action-workflows/tree/cuda-118) of the `shared-action-workflows` repository. Authors: - AJ Schmidt (https://github.com/ajschmidt8) Approvers: - Bradley Dice (https://github.com/bdice) - Ray Douglass (https://github.com/raydouglass) URL: #1072

A cupy array can't be used in a boolean setting (it is neither truthy nor falsy because at heart it's intuitionist) so we need to explicitly check that the owner is None. Authors: - Lawrence Mitchell (https://github.com/wence-) - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) URL: #1061

Adding `--ignore-index` and balance the partition distribution between workers. This should make the runs more consist and improve the data creation significantly. Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) - Lawrence Mitchell (https://github.com/wence-) URL: #1074

For better out of memory message, JIT-unspill now check the current RMM resource stack for resources such as `StatisticsResourceAdaptor` and `TrackingResourceAdaptor` that can report the current allocated bytes. Enable by running `dask-cuda-worker` with `--rmm-track-allocations=True` or calling `dask_cuda.LocalCUDACluster` with `rmm_track_allocations=True`. This is very useful for debugging RMM fragmentation. Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1079

In order to reduce peak memory usage, this PR implements _rounds_ in explicit-comms shuffle. The idea is that each worker handles a number of dataframe partitions in each round instead of doing everything at once. The number of partitions handled in each round can be controlled by setting `DASK_EXPLICIT_COMMS_BATCHSIZE` or directly when calling `shuffle()`. By default, each worker handles one partition per round. Set `DASK_EXPLICIT_COMMS_BATCHSIZE=-1`, to handle all partitions in a single round (the previous behavior). Authors: - Mads R. B. Kristensen (https://github.com/madsbk) - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Richard (Rick) Zamora (https://github.com/rjzamora) - Lawrence Mitchell (https://github.com/wence-) URL: #1068

There were two instances recently (below) where some Python test errors caused the `conda-python-tests` job to run/hang for ~4 hours. - #981 (comment) - #1081 (comment) To prevent this from happening again in the future, I've added a reasonable timeout of ~~45 minutes to that particular job~~ 30 minutes to the `pytest` command. The job usually takes ~25 minutes to complete entirely, so 30 minutes just for `pytest` should be plenty. This timeout will help prevent jobs from hanging and thus help preserve our finite GPU capacity for CI (particularly for `arm` nodes). Authors: - AJ Schmidt (https://github.com/ajschmidt8) Approvers: - Jake Awe (https://github.com/AyodeAwe)

After dask/distributed#7429 was merged, some of those tests started hanging and I could confirm there were two threads concurrently attempting to take the UCX spinlock and the GIL, which led to such deadlock. UCX-Py is currently not thread-safe, and indeed can cause problems like this should two or more threads attempt to call communication routines that will required the UCX spinlock. My theory is that the synchronous cluster will indeed cause communication on the main thread (in this case, the `pytest` thread) upon attempting to shutdown the cluster, instead of only within the Distributed communication thread, likely being the reason behind the test hanging. Asynchronous Distributed clusters seem not to cause any communication from the main thread, but only in the communication thread as expected, thus making the tests asynchronous suffice to resolve such issues. In practice, it's unlikely that people will use sync Distributed clusters from the same process (as pytest does), and thus it's improbable to happen in real use-cases. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) URL: #1084

@charlesbluca

…port` (#1085) Changed this because IIUC `pkg_resources.working_set` is listing the installed distributions and not necessarily the importable modules; this becomes an issue if the distribution and module names aren't the same (e.g. one would `conda install pillow` and then `import PIL`), which was causing some failures in CI that seem unrelated to the changes here. _Originally posted by @charlesbluca in #981 (comment) Authors: - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1085

Because in Python 3.10 `asyncio.get_event_loop()` does not create an event loop anymore, using synchronous `LocalCluster` raises `DeprecationWarning`s in `tornado.ioloop.IOLoop`. Ideally we should update all tests to `async`, the changes here are the minimum necessary to unblock Python 3.10. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Benjamin Zaitlen (https://github.com/quasiben) URL: #1086

Previously we had disabled `cucim` testing for Python `3.10` because the tests depended on `3.10` packages of `cudf`, which weren't previously available. Now that `3.10` packages of `cudf` are available, we can enable `3.10` testing for `cucim`. Authors: - AJ Schmidt (https://github.com/ajschmidt8) - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Jordan Jacobelli (https://github.com/Ethyling) - Peter Andreas Entschev (https://github.com/pentschev) URL: #1080

In dask/dask#9283 we are adding a new top level `dask` CLI command which can be extended by other modules using entry points. A primary motivation here is to improve discoverability by uniting everything under one tool and allowing folks to run `dask --help` and `dask <subcommand> --help` to learn more about the various tools. This PR adds a new `click` group called `cuda` and moves the `dask-cuda-worker` command under that group with the name `worker`. This means the `dask-cuda-worker` becomes `dask cuda worker` in the new CLI tool. I haven't made any changes to the existing `dask-cuda-worker` console script so that will still continue to work, but maybe we should add a deprecation warning to it? I went with this name rather than `dask cuda-worker` because I think it is more readable and also leaves us open to adding more subcommands in the future without cluttering up the top-level `dask` namespace. ```console $ dask --help Usage: dask [OPTIONS] COMMAND [ARGS]... Dask command line interface. Options: --version Show the version and exit. -h, --help Show this message and exit. Commands: cluster Manage dask clusters. cuda GPU subcommands. docs Open Dask documentation (https://docs.dask.org/) in a web browser. info Information about your dask installation. scheduler Launch a distributed scheduler. ssh Launch a distributed cluster over SSH. worker Launch a distributed worker attached to an existing SCHEDULER. ``` ```console $ dask cuda --help Usage: dask cuda [OPTIONS] COMMAND [ARGS]... GPU subcommands. Options: -h, --help Show this message and exit. Commands: worker Launch a distributed worker with GPUs attached to an existing SCHEDULER. ``` ```console $ dask cuda worker --help Usage: dask cuda worker [OPTIONS] [SCHEDULER] [PRELOAD_ARGV]... Launch a distributed worker with GPUs attached to an existing SCHEDULER. See https://docs.rapids.ai/api/dask-cuda/stable/quickstart.html#dask-cuda-worker for info. Options: --host TEXT IP address of serving host; should be visible to the scheduler and other workers. Can be a string (like ``"127.0.0.1"``) or ``None`` to fall back on the address of the interface specified by ``--interface`` or the default interface. --nthreads INTEGER Number of threads to be used for each Dask worker process. [default: 1] ... ``` The CLI PR needs to be merged and released before this can be merged. Fixes #1038 Authors: - Jacob Tomlinson (https://github.com/jacobtomlinson) - Ray Douglass (https://github.com/raydouglass) - Peter Andreas Entschev (https://github.com/pentschev) - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - https://github.com/jakirkham - Peter Andreas Entschev (https://github.com/pentschev) URL: #981

@pentschev

…dask cuda config` (#1088) As @pentschev brought up in #981 (comment), we shouldn't need the `--get-cluster-configuration` option for `dask cuda config` since it only enables/disables printing the cluster configuration. Also added a check to ensure that a scheduler address or scheduler file has been specified, as otherwise IIUC running `dask cuda config` would just end up starting up and querying a local cluster on CPU. EDIT: Modified the scheduler check for `dask cuda worker` as well since it seems like a general improvement Authors: - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1088

@ajschmidt8

The PR adds a docs_build process to the PR and Build workflows for this repository. The generated docs are synced to s3 for only the build workflows. cc @ajschmidt8 Authors: - Ajay Thorve (https://github.com/AjayThorve) - AJ Schmidt (https://github.com/ajschmidt8) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) URL: #1089

cuDF's `partition_by_hash()` is faster than calling `compute_map_index()` followed by `scatter_by_map()`. Depend on rapidsai/cudf#12554 Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1090

Implements a `--partition-distribution` argument to `local_cudf_shuffle.py` Authors: - Mads R. B. Kristensen (https://github.com/madsbk) - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #1081

Authors: - https://github.com/jakirkham Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1092

Fix test that reads directly from `cudf.Buffer` pointer to new `get_ptr(mode="read")`, in accordance with changes from rapidsai/cudf#12587 . Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Lawrence Mitchell (https://github.com/wence-) URL: #1094

close #1077 Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1091

poetry version 1.5.0 broke installs of isort prior to 5.11.5 (see PyCQA/isort#2077 and PyCQA/isort#2078), so we need to upgrade. Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1098

Using `dask.config.get("explicit_comms-batchsize", 1)` doesn't read `DASK_EXPLICIT_COMMS_BATCHSIZE` correctly. Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1096

In order to improve performance, it is now possible to skip the duplication check in `ProxyManager.proxify()`. We use this in explicit-comms shuffle. Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1101

This PR updates the branch reference used for our shared workflows. I will open similar PRs for `branch-23.04` next week. Authors: - AJ Schmidt (https://github.com/ajschmidt8) Approvers: - Ray Douglass (https://github.com/raydouglass)

This PR pins `dask` and `distributed` to `2023.1.1` for `23.02` release. xref: rapidsai/cudf#12695 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Mark Sadang (https://github.com/msadang) - Peter Andreas Entschev (https://github.com/pentschev) URL: #1106

codecov-commenter · 2023-02-07T05:35:00Z

Codecov Report

Base: 0.00% // Head: 63.92% // Increases project coverage by +63.92% 🎉

Coverage data is based on head (80d7296) compared to base (8c87288).
Patch coverage: 59.78% of modified lines in pull request are covered.

❗ Current head 80d7296 differs from pull request most recent head e2db7c9. Consider uploading reports for the commit e2db7c9 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##            main    #1109       +/-   ##
==========================================
+ Coverage   0.00%   63.92%   +63.92%     
==========================================
  Files         26       25        -1     
  Lines       3439     3171      -268     
==========================================
+ Hits           0     2027     +2027     
+ Misses      3439     1144     -2295

Impacted Files	Coverage Δ
dask_cuda/benchmarks/common.py	`0.00% <0.00%> (ø)`
dask_cuda/benchmarks/local_cudf_groupby.py	`0.00% <ø> (ø)`
dask_cuda/benchmarks/local_cudf_shuffle.py	`0.00% <0.00%> (ø)`
dask_cuda/benchmarks/utils.py	`0.00% <0.00%> (ø)`
dask_cuda/disk_io.py	`56.19% <ø> (+56.19%)`	⬆️
dask_cuda/initialize.py	`87.23% <ø> (+87.23%)`	⬆️
dask_cuda/is_spillable_object.py	`84.37% <ø> (+84.37%)`	⬆️
dask_cuda/proxy_object.py	`90.52% <ø> (+90.52%)`	⬆️
dask_cuda/cli.py	`86.40% <72.50%> (ø)`
dask_cuda/utils.py	`83.47% <82.35%> (+83.47%)`	⬆️
... and 26 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

raydouglass and others added 30 commits November 10, 2022 12:58

DOC

f4aeab2

Merge pull request #1042 from rapidsai/branch-22.12

2b4213e

[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]

Merge pull request #1043 from rapidsai/branch-22.12

3b06fef

[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]

Merge pull request #1047 from rapidsai/branch-22.12

5b5819d

[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]

Merge pull request #1048 from rapidsai/branch-22.12

466e705

[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]

Merge pull request #1050 from rapidsai/branch-22.12

324ac24

[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]

Merge pull request #1052 from rapidsai/branch-22.12

4c93def

[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]

Merge pull request #1056 from rapidsai/branch-22.12

c319971

[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]

Merge pull request #1057 from rapidsai/branch-22.12

b087770

[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]

Merge pull request #1058 from rapidsai/branch-22.12

60a5527

[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]

Enable copy_prs. [skip gpuci] (#1063)

d6ff68d

Enables copying PRs so that GitHub Actions CI can run.

Reorder channel priority. (#1067)

aedc955

Aligns conda channel priority in the installation guide with changes made for the 22.10.01 hotfix. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - https://github.com/jakirkham URL: #1067

Merge branch-22.12 into branch-23.02

f609af3

madsbk and others added 10 commits January 23, 2023 09:31

Fix whitespace & add URLs in pyproject.toml (#1092)

9ff3996

Authors: - https://github.com/jakirkham Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1092

pre-commit: spell, whitespace, and mypy check (#1091)

43969d7

close #1077 Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1091

update workflow branches [skip ci] (#1105)

7298f1e

This PR updates the branch reference used for our shared workflows. I will open similar PRs for `branch-23.04` next week. Authors: - AJ Schmidt (https://github.com/ajschmidt8) Approvers: - Ray Douglass (https://github.com/raydouglass)

GPUtester requested review from a team as code owners February 6, 2023 17:57

github-actions bot added conda conda issue gpuCI gpuCI issue python python code needed labels Feb 6, 2023

update changelog

e2db7c9

raydouglass merged commit 7664dbd into main Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RELEASE] dask-cuda v23.02 #1109

[RELEASE] dask-cuda v23.02 #1109

GPUtester commented Feb 6, 2023

codecov-commenter commented Feb 7, 2023 •

edited

Loading

[RELEASE] dask-cuda v23.02 #1109

[RELEASE] dask-cuda v23.02 #1109

Conversation

GPUtester commented Feb 6, 2023

❄️ Code freeze for branch-23.02 and v23.02 release

What does this mean?

What is the purpose of this PR?

codecov-commenter commented Feb 7, 2023 • edited Loading

Codecov Report

❄️ Code freeze for `branch-23.02` and v23.02 release

codecov-commenter commented Feb 7, 2023 •

edited

Loading