[FEA] Async mode for cudf.series operations #13087
Labels
0 - Backlog
In queue waiting for assignment
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Python
Affects Python cuDF API.
Milestone
Is your feature request related to a problem? Please describe.
We get wide dataframes in situations like machine learning (easily 1-5K cols) and genomics (10K+ cols), and while there is some speedup from cudf (say 2-3X), it'd be easy to get to the 10X+ level with much higher GPU utilization if we could spawn concurrent tasks for each column . Getting this all the way to the df level seems tricky, but async primitives at the column level would get us far.
One Python-native idea is doing via
async/await
, when one cudf operation is getting scheduled, allocated, & run, we can be scheduling the next, and ideally, cudf can run them independently . It smoothed out 2-3 years ago in python + javascript as a popular native choice, and has since been a lot more popular in pydata, e.g., langchain just rewrote to support async versions of all methods. Ex: https://trends.google.com/trends/explore?date=all&q=async%20await&hl=en . Separately, there's heightened value for pydata dashboarding scenarios like plotly, streamlit, etc as these ecosystem increasingly build for async io underneath as well.(Another idea with precedent is a lazy mode similar to haskell or dask, discussed below as well)
Describe the solution you'd like
I'd like to be do something like:
Describe alternatives you've considered
In theory we can setup threads or multiple dask workers, but (1) both are super awkward, (2) underneath, cudf will not do concurrent jobs
Another thought is to create a lazy mode for cudf. This has precedent with Haskell, and in modern pydata land, more so with polars. Dask does this too, and we'd use it if that can work, but it's awkward -- I haven't used, but polars sounds to be more friendly in practice:
Underneath, cudf can reinvent async/io, dask, or whatever
Additional context
Slack thread: https://rapids-goai.slack.com/archives/C5E06F4DC/p1680710488795869
The text was updated successfully, but these errors were encountered: