-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RELEASE] dask-cuda v23.02 #1109
Conversation
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
[gpuCI] Forward-merge branch-22.12 to branch-23.02 [skip gpuci]
Enables copying PRs so that GitHub Actions CI can run.
This PR unpins `dask` and `distributed` to `2022.12.0+` for `23.02` development. xref: rapidsai/cudf#12302 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Ray Douglass (https://github.com/raydouglass) URL: #1060
Aligns conda channel priority in the installation guide with changes made for the 22.10.01 hotfix. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - https://github.com/jakirkham URL: #1067
Probe `__cuda_array_interface__` in `test_device_host_file_step_by_step`, to get consistent results from `safe_sizeof()`. Fixes #1070 Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1071
Authors: - Matthew Farrellee (https://github.com/mattf) - AJ Schmidt (https://github.com/ajschmidt8) - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1069
This PR adds GitHub Actions workflows to `dask-cuda`. ### Task list Coverage required for this PR: - [x] Python tests - [x] Codecov - [x] Style checks Future work required: - [Deploy sdist/wheels to PyPI](https://github.com/rapidsai/dask-cuda/blob/d6ff68daae638c30e1e2e25f2fb91ecc1ee8f6ea/ci/cpu/build.sh#L98) Authors: - Bradley Dice (https://github.com/bdice) - AJ Schmidt (https://github.com/ajschmidt8) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) URL: #1062
This PR updates the `dask-cuda` CI workflows to build against the CUDA `11.8` / Python `3.10` [branch](https://github.com/rapidsai/shared-action-workflows/tree/cuda-118) of the `shared-action-workflows` repository. Authors: - AJ Schmidt (https://github.com/ajschmidt8) Approvers: - Bradley Dice (https://github.com/bdice) - Ray Douglass (https://github.com/raydouglass) URL: #1072
A cupy array can't be used in a boolean setting (it is neither truthy nor falsy because at heart it's intuitionist) so we need to explicitly check that the owner is None. Authors: - Lawrence Mitchell (https://github.com/wence-) - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) URL: #1061
Adding `--ignore-index` and balance the partition distribution between workers. This should make the runs more consist and improve the data creation significantly. Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) - Lawrence Mitchell (https://github.com/wence-) URL: #1074
For better out of memory message, JIT-unspill now check the current RMM resource stack for resources such as `StatisticsResourceAdaptor` and `TrackingResourceAdaptor` that can report the current allocated bytes. Enable by running `dask-cuda-worker` with `--rmm-track-allocations=True` or calling `dask_cuda.LocalCUDACluster` with `rmm_track_allocations=True`. This is very useful for debugging RMM fragmentation. Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1079
In order to reduce peak memory usage, this PR implements _rounds_ in explicit-comms shuffle. The idea is that each worker handles a number of dataframe partitions in each round instead of doing everything at once. The number of partitions handled in each round can be controlled by setting `DASK_EXPLICIT_COMMS_BATCHSIZE` or directly when calling `shuffle()`. By default, each worker handles one partition per round. Set `DASK_EXPLICIT_COMMS_BATCHSIZE=-1`, to handle all partitions in a single round (the previous behavior). Authors: - Mads R. B. Kristensen (https://github.com/madsbk) - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Richard (Rick) Zamora (https://github.com/rjzamora) - Lawrence Mitchell (https://github.com/wence-) URL: #1068
There were two instances recently (below) where some Python test errors caused the `conda-python-tests` job to run/hang for ~4 hours. - #981 (comment) - #1081 (comment) To prevent this from happening again in the future, I've added a reasonable timeout of ~~45 minutes to that particular job~~ 30 minutes to the `pytest` command. The job usually takes ~25 minutes to complete entirely, so 30 minutes just for `pytest` should be plenty. This timeout will help prevent jobs from hanging and thus help preserve our finite GPU capacity for CI (particularly for `arm` nodes). Authors: - AJ Schmidt (https://github.com/ajschmidt8) Approvers: - Jake Awe (https://github.com/AyodeAwe)
After dask/distributed#7429 was merged, some of those tests started hanging and I could confirm there were two threads concurrently attempting to take the UCX spinlock and the GIL, which led to such deadlock. UCX-Py is currently not thread-safe, and indeed can cause problems like this should two or more threads attempt to call communication routines that will required the UCX spinlock. My theory is that the synchronous cluster will indeed cause communication on the main thread (in this case, the `pytest` thread) upon attempting to shutdown the cluster, instead of only within the Distributed communication thread, likely being the reason behind the test hanging. Asynchronous Distributed clusters seem not to cause any communication from the main thread, but only in the communication thread as expected, thus making the tests asynchronous suffice to resolve such issues. In practice, it's unlikely that people will use sync Distributed clusters from the same process (as pytest does), and thus it's improbable to happen in real use-cases. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) URL: #1084
…port` (#1085) Changed this because IIUC `pkg_resources.working_set` is listing the installed distributions and not necessarily the importable modules; this becomes an issue if the distribution and module names aren't the same (e.g. one would `conda install pillow` and then `import PIL`), which was causing some failures in CI that seem unrelated to the changes here. _Originally posted by @charlesbluca in #981 (comment) Authors: - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1085
Because in Python 3.10 `asyncio.get_event_loop()` does not create an event loop anymore, using synchronous `LocalCluster` raises `DeprecationWarning`s in `tornado.ioloop.IOLoop`. Ideally we should update all tests to `async`, the changes here are the minimum necessary to unblock Python 3.10. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Benjamin Zaitlen (https://github.com/quasiben) URL: #1086
Previously we had disabled `cucim` testing for Python `3.10` because the tests depended on `3.10` packages of `cudf`, which weren't previously available. Now that `3.10` packages of `cudf` are available, we can enable `3.10` testing for `cucim`. Authors: - AJ Schmidt (https://github.com/ajschmidt8) - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Jordan Jacobelli (https://github.com/Ethyling) - Peter Andreas Entschev (https://github.com/pentschev) URL: #1080
In dask/dask#9283 we are adding a new top level `dask` CLI command which can be extended by other modules using entry points. A primary motivation here is to improve discoverability by uniting everything under one tool and allowing folks to run `dask --help` and `dask <subcommand> --help` to learn more about the various tools. This PR adds a new `click` group called `cuda` and moves the `dask-cuda-worker` command under that group with the name `worker`. This means the `dask-cuda-worker` becomes `dask cuda worker` in the new CLI tool. I haven't made any changes to the existing `dask-cuda-worker` console script so that will still continue to work, but maybe we should add a deprecation warning to it? I went with this name rather than `dask cuda-worker` because I think it is more readable and also leaves us open to adding more subcommands in the future without cluttering up the top-level `dask` namespace. ```console $ dask --help Usage: dask [OPTIONS] COMMAND [ARGS]... Dask command line interface. Options: --version Show the version and exit. -h, --help Show this message and exit. Commands: cluster Manage dask clusters. cuda GPU subcommands. docs Open Dask documentation (https://docs.dask.org/) in a web browser. info Information about your dask installation. scheduler Launch a distributed scheduler. ssh Launch a distributed cluster over SSH. worker Launch a distributed worker attached to an existing SCHEDULER. ``` ```console $ dask cuda --help Usage: dask cuda [OPTIONS] COMMAND [ARGS]... GPU subcommands. Options: -h, --help Show this message and exit. Commands: worker Launch a distributed worker with GPUs attached to an existing SCHEDULER. ``` ```console $ dask cuda worker --help Usage: dask cuda worker [OPTIONS] [SCHEDULER] [PRELOAD_ARGV]... Launch a distributed worker with GPUs attached to an existing SCHEDULER. See https://docs.rapids.ai/api/dask-cuda/stable/quickstart.html#dask-cuda-worker for info. Options: --host TEXT IP address of serving host; should be visible to the scheduler and other workers. Can be a string (like ``"127.0.0.1"``) or ``None`` to fall back on the address of the interface specified by ``--interface`` or the default interface. --nthreads INTEGER Number of threads to be used for each Dask worker process. [default: 1] ... ``` The CLI PR needs to be merged and released before this can be merged. Fixes #1038 Authors: - Jacob Tomlinson (https://github.com/jacobtomlinson) - Ray Douglass (https://github.com/raydouglass) - Peter Andreas Entschev (https://github.com/pentschev) - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - https://github.com/jakirkham - Peter Andreas Entschev (https://github.com/pentschev) URL: #981
…dask cuda config` (#1088) As @pentschev brought up in #981 (comment), we shouldn't need the `--get-cluster-configuration` option for `dask cuda config` since it only enables/disables printing the cluster configuration. Also added a check to ensure that a scheduler address or scheduler file has been specified, as otherwise IIUC running `dask cuda config` would just end up starting up and querying a local cluster on CPU. EDIT: Modified the scheduler check for `dask cuda worker` as well since it seems like a general improvement Authors: - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1088
The PR adds a docs_build process to the PR and Build workflows for this repository. The generated docs are synced to s3 for only the build workflows. cc @ajschmidt8 Authors: - Ajay Thorve (https://github.com/AjayThorve) - AJ Schmidt (https://github.com/ajschmidt8) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) URL: #1089
cuDF's `partition_by_hash()` is faster than calling `compute_map_index()` followed by `scatter_by_map()`. Depend on rapidsai/cudf#12554 Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1090
Implements a `--partition-distribution` argument to `local_cudf_shuffle.py` Authors: - Mads R. B. Kristensen (https://github.com/madsbk) - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #1081
Authors: - https://github.com/jakirkham Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1092
Fix test that reads directly from `cudf.Buffer` pointer to new `get_ptr(mode="read")`, in accordance with changes from rapidsai/cudf#12587 . Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Lawrence Mitchell (https://github.com/wence-) URL: #1094
close #1077 Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1091
poetry version 1.5.0 broke installs of isort prior to 5.11.5 (see PyCQA/isort#2077 and PyCQA/isort#2078), so we need to upgrade. Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1098
Using `dask.config.get("explicit_comms-batchsize", 1)` doesn't read `DASK_EXPLICIT_COMMS_BATCHSIZE` correctly. Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1096
In order to improve performance, it is now possible to skip the duplication check in `ProxyManager.proxify()`. We use this in explicit-comms shuffle. Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1101
This PR updates the branch reference used for our shared workflows. I will open similar PRs for `branch-23.04` next week. Authors: - AJ Schmidt (https://github.com/ajschmidt8) Approvers: - Ray Douglass (https://github.com/raydouglass)
This PR pins `dask` and `distributed` to `2023.1.1` for `23.02` release. xref: rapidsai/cudf#12695 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Mark Sadang (https://github.com/msadang) - Peter Andreas Entschev (https://github.com/pentschev) URL: #1106
Codecov ReportBase: 0.00% // Head: 63.92% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1109 +/- ##
==========================================
+ Coverage 0.00% 63.92% +63.92%
==========================================
Files 26 25 -1
Lines 3439 3171 -268
==========================================
+ Hits 0 2027 +2027
+ Misses 3439 1144 -2295
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
❄️ Code freeze for
branch-23.02
and v23.02 releaseWhat does this mean?
Only critical/hotfix level issues should be merged into
branch-23.02
until release (merging of this PR).What is the purpose of this PR?
branch-23.02
intomain
for the release