Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RELEASE] dask-cuda v23.02 #1109

Merged
merged 41 commits into from
Feb 9, 2023
Merged

[RELEASE] dask-cuda v23.02 #1109

merged 41 commits into from
Feb 9, 2023

Conversation

GPUtester
Copy link
Contributor

❄️ Code freeze for branch-23.02 and v23.02 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-23.02 until release (merging of this PR).

What is the purpose of this PR?

  • Update documentation
  • Allow testing for the new release
  • Enable a means to merge branch-23.02 into main for the release

raydouglass and others added 30 commits November 10, 2022 12:58
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
Enables copying PRs so that GitHub Actions CI can run.
This PR unpins `dask` and `distributed` to `2022.12.0+` for `23.02` development.


xref: rapidsai/cudf#12302

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Ray Douglass (https://github.com/raydouglass)

URL: #1060
Aligns conda channel priority in the installation guide with changes made for the 22.10.01 hotfix.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - https://github.com/jakirkham

URL: #1067
Probe `__cuda_array_interface__` in `test_device_host_file_step_by_step`, to get consistent results from `safe_sizeof()`.

Fixes #1070

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #1071
This PR adds GitHub Actions workflows to `dask-cuda`.

### Task list

Coverage required for this PR:
- [x] Python tests
- [x] Codecov
- [x] Style checks

Future work required:
- [Deploy sdist/wheels to PyPI](https://github.com/rapidsai/dask-cuda/blob/d6ff68daae638c30e1e2e25f2fb91ecc1ee8f6ea/ci/cpu/build.sh#L98)

Authors:
  - Bradley Dice (https://github.com/bdice)
  - AJ Schmidt (https://github.com/ajschmidt8)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: #1062
This PR updates the `dask-cuda` CI workflows to build against the CUDA `11.8` / Python `3.10` [branch](https://github.com/rapidsai/shared-action-workflows/tree/cuda-118) of the `shared-action-workflows` repository.

Authors:
  - AJ Schmidt (https://github.com/ajschmidt8)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Ray Douglass (https://github.com/raydouglass)

URL: #1072
A cupy array can't be used in a boolean setting (it is neither truthy nor falsy because at heart it's intuitionist) so we need to explicitly check that the owner is None.

Authors:
  - Lawrence Mitchell (https://github.com/wence-)
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)

URL: #1061
Adding `--ignore-index` and balance the partition distribution between workers.

This should make the runs more consist and improve the data creation significantly.

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #1074
For better out of memory message, JIT-unspill now check the current RMM resource stack for resources such as `StatisticsResourceAdaptor` and `TrackingResourceAdaptor` that can report the current allocated bytes.

Enable by running `dask-cuda-worker` with `--rmm-track-allocations=True` or calling `dask_cuda.LocalCUDACluster` with `rmm_track_allocations=True`.

This is very useful for debugging RMM fragmentation.

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #1079
In order to reduce peak memory usage, this PR implements _rounds_ in explicit-comms shuffle.
The idea is that each worker handles a number of dataframe partitions in each round instead of doing everything at once.
The number of partitions handled in each round can be controlled by setting `DASK_EXPLICIT_COMMS_BATCHSIZE` or directly when calling `shuffle()`. 

By default, each worker handles one partition per round. 
Set `DASK_EXPLICIT_COMMS_BATCHSIZE=-1`, to handle all partitions in a single round (the previous behavior).

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #1068
There were two instances recently (below) where some Python test errors caused the `conda-python-tests` job to run/hang for ~4 hours.

- #981 (comment)
- #1081 (comment)

To prevent this from happening again in the future, I've added a reasonable timeout of ~~45 minutes to that particular job~~ 30 minutes to the `pytest` command.

The job usually takes ~25 minutes to complete entirely, so 30 minutes just for `pytest` should be plenty.

This timeout will help prevent jobs from hanging and thus help preserve our finite GPU capacity for CI (particularly for `arm` nodes).

Authors:
   - AJ Schmidt (https://github.com/ajschmidt8)

Approvers:
   - Jake Awe (https://github.com/AyodeAwe)
After dask/distributed#7429 was merged, some of those tests started hanging and I could confirm there were two threads concurrently attempting to take the UCX spinlock and the GIL, which led to such deadlock. UCX-Py is currently not thread-safe, and indeed can cause problems like this should two or more threads attempt to call communication routines that will required the UCX spinlock. My theory is that the synchronous cluster will indeed cause communication on the main thread (in this case, the `pytest` thread) upon attempting to shutdown the cluster, instead of only within the Distributed communication thread, likely being the reason behind the test hanging.

Asynchronous Distributed clusters seem not to cause any communication from the main thread, but only in the communication thread as expected, thus making the tests asynchronous suffice to resolve such issues. In practice, it's unlikely that people will use sync Distributed clusters from the same process (as pytest does), and thus it's improbable to happen in real use-cases.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)

URL: #1084
…port` (#1085)

Changed this because IIUC `pkg_resources.working_set` is listing the installed distributions and not necessarily the importable modules; this becomes an issue if the distribution and module names aren't the same (e.g. one would `conda install pillow` and then `import PIL`), which was causing some failures in CI that seem unrelated to the changes here.

_Originally posted by @charlesbluca in #981 (comment)

Authors:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #1085
Because in Python 3.10 `asyncio.get_event_loop()` does not create an event loop anymore, using synchronous `LocalCluster` raises `DeprecationWarning`s in `tornado.ioloop.IOLoop`. Ideally we should update all tests to `async`, the changes here are the minimum necessary to unblock Python 3.10.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: #1086
Previously we had disabled `cucim` testing for Python `3.10` because the tests depended on `3.10` packages of `cudf`, which weren't previously available.

Now that `3.10` packages of `cudf` are available, we can enable `3.10` testing for `cucim`.

Authors:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Jordan Jacobelli (https://github.com/Ethyling)
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #1080
In dask/dask#9283 we are adding a new top level `dask` CLI command which can be extended by other modules using entry points. A primary motivation here is to improve discoverability by uniting everything under one tool and allowing folks to run `dask --help` and `dask <subcommand> --help` to learn more about the various tools.

This PR adds a new `click` group called `cuda` and moves the `dask-cuda-worker` command under that group with the name `worker`. This means the `dask-cuda-worker` becomes `dask cuda worker` in the new CLI tool.

I haven't made any changes to the existing `dask-cuda-worker` console script so that will still continue to work, but maybe we should add a deprecation warning to it?

I went with this name rather than `dask cuda-worker` because I think it is more readable and also leaves us open to adding more subcommands in the future without cluttering up the top-level `dask` namespace.

```console
$ dask --help             
Usage: dask [OPTIONS] COMMAND [ARGS]...

  Dask command line interface.

Options:
  --version   Show the version and exit.
  -h, --help  Show this message and exit.

Commands:
  cluster    Manage dask clusters.
  cuda       GPU subcommands.
  docs       Open Dask documentation (https://docs.dask.org/) in a web browser.
  info       Information about your dask installation.
  scheduler  Launch a distributed scheduler.
  ssh        Launch a distributed cluster over SSH.
  worker     Launch a distributed worker attached to an existing SCHEDULER.
```

```console
$ dask cuda --help        
Usage: dask cuda [OPTIONS] COMMAND [ARGS]...

  GPU subcommands.

Options:
  -h, --help  Show this message and exit.

Commands:
  worker  Launch a distributed worker with GPUs attached to an existing SCHEDULER.
```

```console
$ dask cuda worker --help  
Usage: dask cuda worker [OPTIONS] [SCHEDULER] [PRELOAD_ARGV]...

  Launch a distributed worker with GPUs attached to an existing SCHEDULER.

  See https://docs.rapids.ai/api/dask-cuda/stable/quickstart.html#dask-cuda-worker for
  info.

Options:
  --host TEXT                     IP address of serving host; should be visible to the
                                  scheduler and other workers. Can be a string (like
                                  ``"127.0.0.1"``) or ``None`` to fall back on the
                                  address of the interface specified by
                                  ``--interface`` or the default interface.
  --nthreads INTEGER              Number of threads to be used for each Dask worker
                                  process.  [default: 1]
...
```

The CLI PR needs to be merged and released before this can be merged.

Fixes #1038

Authors:
  - Jacob Tomlinson (https://github.com/jacobtomlinson)
  - Ray Douglass (https://github.com/raydouglass)
  - Peter Andreas Entschev (https://github.com/pentschev)
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - https://github.com/jakirkham
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #981
…dask cuda config` (#1088)

As @pentschev brought up in #981 (comment), we shouldn't need the `--get-cluster-configuration` option for `dask cuda config` since it only enables/disables printing the cluster configuration.

Also added a check to ensure that a scheduler address or scheduler file has been specified, as otherwise IIUC running `dask cuda config` would just end up starting up and querying a local cluster on CPU.

EDIT:

Modified the scheduler check for `dask cuda worker` as well since it seems like a general improvement

Authors:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #1088
The PR adds a docs_build process to the PR and Build workflows for this repository. The generated docs are synced to s3 for only the build workflows.


cc @ajschmidt8

Authors:
  - Ajay Thorve (https://github.com/AjayThorve)
  - AJ Schmidt (https://github.com/ajschmidt8)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: #1089
madsbk and others added 10 commits January 23, 2023 09:31
cuDF's `partition_by_hash()` is faster than calling `compute_map_index()` followed by `scatter_by_map()`.

Depend on rapidsai/cudf#12554

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #1090
Implements a `--partition-distribution` argument to `local_cudf_shuffle.py`

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #1081
Fix test that reads directly from `cudf.Buffer` pointer to new `get_ptr(mode="read")`, in accordance with changes from rapidsai/cudf#12587 .

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #1094
close #1077

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #1091
poetry version 1.5.0 broke installs of isort prior to 5.11.5 (see PyCQA/isort#2077 and PyCQA/isort#2078), so we need to upgrade.

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #1098
Using `dask.config.get("explicit_comms-batchsize", 1)` doesn't read `DASK_EXPLICIT_COMMS_BATCHSIZE` correctly.

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #1096
In order to improve performance, it is now possible to skip the duplication check in `ProxyManager.proxify()`.

We use this in explicit-comms shuffle.

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #1101
This PR updates the branch reference used for our shared workflows.

I will open similar PRs for `branch-23.04` next week.

Authors:
   - AJ Schmidt (https://github.com/ajschmidt8)

Approvers:
   - Ray Douglass (https://github.com/raydouglass)
This PR pins `dask` and `distributed` to `2023.1.1` for `23.02` release.

xref: rapidsai/cudf#12695

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Mark Sadang (https://github.com/msadang)
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #1106
@GPUtester GPUtester requested review from a team as code owners February 6, 2023 17:57
@github-actions github-actions bot added conda conda issue gpuCI gpuCI issue python python code needed labels Feb 6, 2023
@codecov-commenter
Copy link

codecov-commenter commented Feb 7, 2023

Codecov Report

Base: 0.00% // Head: 63.92% // Increases project coverage by +63.92% 🎉

Coverage data is based on head (80d7296) compared to base (8c87288).
Patch coverage: 59.78% of modified lines in pull request are covered.

❗ Current head 80d7296 differs from pull request most recent head e2db7c9. Consider uploading reports for the commit e2db7c9 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##            main    #1109       +/-   ##
==========================================
+ Coverage   0.00%   63.92%   +63.92%     
==========================================
  Files         26       25        -1     
  Lines       3439     3171      -268     
==========================================
+ Hits           0     2027     +2027     
+ Misses      3439     1144     -2295     
Impacted Files Coverage Δ
dask_cuda/benchmarks/common.py 0.00% <0.00%> (ø)
dask_cuda/benchmarks/local_cudf_groupby.py 0.00% <ø> (ø)
dask_cuda/benchmarks/local_cudf_shuffle.py 0.00% <0.00%> (ø)
dask_cuda/benchmarks/utils.py 0.00% <0.00%> (ø)
dask_cuda/disk_io.py 56.19% <ø> (+56.19%) ⬆️
dask_cuda/initialize.py 87.23% <ø> (+87.23%) ⬆️
dask_cuda/is_spillable_object.py 84.37% <ø> (+84.37%) ⬆️
dask_cuda/proxy_object.py 90.52% <ø> (+90.52%) ⬆️
dask_cuda/cli.py 86.40% <72.50%> (ø)
dask_cuda/utils.py 83.47% <82.35%> (+83.47%) ⬆️
... and 26 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@raydouglass raydouglass merged commit 7664dbd into main Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conda conda issue gpuCI gpuCI issue python python code needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.