Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spilling to host memory #12106

Merged
merged 36 commits into from
Nov 16, 2022
Merged

Spilling to host memory #12106

merged 36 commits into from
Nov 16, 2022

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented Nov 9, 2022

This PR implementing spilling of device to host memory, which is based on #11553.

Spilling can be enabled in two ways (it is disabled by default):

  • setting the environment variable CUDF_SPILL=on, or
  • setting the spill option in cudf by doing cudf.set_option("spill", True).

Additionally, parameters are:

  • CUDF_SPILL_ON_DEMAND=ON / cudf.set_option("spill_on_demand", True), which registers an RMM out-of-memory error handler that spills buffers in order to free up memory.
  • CUDF_SPILL_DEVICE_LIMIT=... / cudf.set_option("spill_device_limit", ...), which sets a device memory limit in bytes.

I have limited the scope of this PR. In a follow-up PR, I will port the statistics, logging, and partial unspill from #11553.

Design

Spilling consists of two components:

  • A new buffer sub-class, SpillableBuffer, that implements moving of its data from host to device memory in-place.

  • A spill manager that tracks all instances of SpillableBuffer and spills them on demand.

A global spill manager is used throughout cudf when spilling is enabled, which makes as_buffer() return SpillableBuffer instead of the default Buffer instances.

Challenges

Accessing Buffer.ptr, we get the device memory pointer of the buffer. This is unproblematic in the case of Buffer but what happens when accessing SpillableBuffer.ptr, which might have spilled its device memory? In this case, SpillableBuffer needs to unspill the memory before returning its device memory pointer. Furthermore, while this device memory pointer is being used (or could be used), SpillableBuffer cannot spill its memory back to host memory because doing so would invalidate the device pointer.

To address this, we mark the SpillableBuffer as unspillable, we say that the buffer has been exposed. This can be either permanent if the device pointer is exposed to external projects or temporary while libcudf accesses the device memory.

The SpillableBuffer.get_ptr() returns the device pointer of the buffer memory just like .ptr but if given an instance of SpillLock, the buffer is only unspillable as long as the instance of SpillLock is alive.

For convenience, one can use the decorator/context with_spill_lock to associate a SpillLock with a lifetime bound to the context automatically.

Overhead

When spilling is disabled, the overhead of this PR comes from the decorator with_spill_lock. However, this is small https://gist.github.com/madsbk/da6520e7583cf5d728a1b5a1b09200f3:

Micro benchmark on my local workstation:
  spilling off:
    raw:                    0.06371338899771217 us
    with-spill-lock:        1.0796624180002254 us
  spilling on:
    raw:                    0.05873749500096892 us
    with-spill-lock:        1.2184517139976379 us

Checklist

  • New or existing tests cover these changes.

  • Avoid changes to libcudf

  • The documentation is up to date with these changes.

@madsbk madsbk added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 9, 2022
@github-actions github-actions bot added Python Affects Python cuDF API. gpuCI labels Nov 9, 2022
@codecov
Copy link

codecov bot commented Nov 9, 2022

Codecov Report

Base: 86.61% // Head: 88.24% // Increases project coverage by +1.63% 🎉

Coverage data is based on head (bef71cd) compared to base (c574ddf).
Patch coverage: 89.54% of modified lines in pull request are covered.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #12106      +/-   ##
================================================
+ Coverage         86.61%   88.24%   +1.63%     
================================================
  Files               135      137       +2     
  Lines             22934    22501     -433     
================================================
- Hits              19864    19857       -7     
+ Misses             3070     2644     -426     
Impacted Files Coverage Δ
python/cudf/cudf/core/df_protocol.py 88.17% <ø> (ø)
python/cudf/cudf/core/buffer/spill_manager.py 80.00% <80.00%> (ø)
python/cudf/cudf/options.py 87.69% <88.46%> (+0.51%) ⬆️
python/cudf/cudf/core/buffer/spillable_buffer.py 92.85% <92.85%> (ø)
python/cudf/cudf/core/buffer/utils.py 96.00% <97.22%> (+2.66%) ⬆️
python/cudf/cudf/core/abc.py 94.44% <100.00%> (+0.15%) ⬆️
python/cudf/cudf/core/buffer/__init__.py 100.00% <100.00%> (ø)
python/cudf/cudf/core/column/column.py 87.96% <100.00%> (+12.32%) ⬆️
python/cudf/cudf/testing/_utils.py 94.11% <100.00%> (+0.06%) ⬆️
python/cudf/cudf/core/dataframe.py 93.64% <0.00%> (+0.04%) ⬆️
... and 12 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

ci/gpu/build.sh Outdated Show resolved Hide resolved
python/cudf/cudf/core/buffer/utils.py Show resolved Hide resolved
python/cudf/cudf/options.py Show resolved Hide resolved
python/cudf/cudf/options.py Show resolved Hide resolved
python/cudf/cudf/core/buffer/spill_manager.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/buffer/spill_manager.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/buffer/spill_manager.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/buffer/spill_manager.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/buffer/spill_manager.py Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty happy with the code at this point. The main thing that I haven't reviewed much yet is the tests, which I'll look at next week. I'd like to hear some other reviewer opinions at this point, so let's mark this PR ready for review (or wait until @galipremsagar has a chance to look through, either is fine with me) and get some more feedback.

docs/cudf/source/developer_guide/library_design.md Outdated Show resolved Hide resolved
madsbk and others added 3 commits November 14, 2022 08:53
@madsbk madsbk requested review from a team as code owners November 14, 2022 18:54
@madsbk madsbk requested review from vyasr and isVoid November 14, 2022 18:54
Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome work @madsbk! 🔥

@madsbk
Copy link
Member Author

madsbk commented Nov 15, 2022

This is awesome work @madsbk! fire

Thanks for the @galipremsagar, I have added a spilling pytest mark (as @vyasr suggested) and also added @acquire_spill_lock to most of the functions that calls libcudf.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me! I have a couple of requests for improving the code in the testing module, but I don't think they're blockers and can be addressed in a follow-up. I'm going to go ahead and merge as is. Thanks for the great work here Mads!

Comment on lines +48 to +58
def gen_df(target="gpu") -> cudf.DataFrame:
ret = cudf.DataFrame({"a": [1, 2, 3]})
if target != "gpu":
gen_df.buffer(ret).spill(target=target)
return ret


gen_df.buffer = lambda df: df._data._data["a"].data
gen_df.is_spilled = lambda df: gen_df.buffer(df).is_spilled
gen_df.is_spillable = lambda df: gen_df.buffer(df).spillable
gen_df.buffer_size = gen_df.buffer(gen_df()).size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not assign attributes to a function this way. I would rather either define these as standalone helper functions or define a helper class as a subclass of cudf.DataFrame that implements these as properties.

def test_from_pandas(manager: SpillManager):
pdf1 = pandas.DataFrame({"x": [1, 2, 3]})
df = cudf.from_pandas(pdf1)
assert df._data._data["x"].data.spillable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that you need to write this out again anyway definitely suggests we want to expose some of those lambdas attached to get_df as functions.

@vyasr
Copy link
Contributor

vyasr commented Nov 16, 2022

@gpucibot merge

Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving ops-codeowner file changes

@rapids-bot rapids-bot bot merged commit 95a348b into rapidsai:branch-22.12 Nov 16, 2022
@madsbk madsbk deleted the cudf-spill branch November 18, 2022 09:43
rapids-bot bot pushed a commit to rapidsai/dask-cuda that referenced this pull request Nov 22, 2022
Support of the [new built-in spilling in cuDF](rapidsai/cudf#12106) so that `device_memory_limit` and `memory_limit` ignores cuDF's device buffers.

This is only implemented for `DeviceHostFile`. Since jit-unspill also targets cuDF and libraries such as cupy isn't supported, I don't think it is important to support cuDF's built-in spilling in `ProxifyHostFile`.

For now, `DeviceHostFile` simply ignores cuDF's device buffers and let cuDF handle the spilling. This means that `DeviceHostFile` might estimate the device and host memory usage incorrectly ([or more incorrectly than usually](dask/distributed#4568 (comment))).

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #984
rapids-bot bot pushed a commit that referenced this pull request Nov 23, 2022
Based on @vyasr's [review](#12106 (review)), I am cleaning up the spilling tests. Hopefully making them more readable. 

Notice, I am targeting `branch-23.02` because of code freeze but maybe we should target `branch-22.12` since it implements spilling?

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #12220
rapids-bot bot pushed a commit that referenced this pull request Nov 23, 2022
Now that #12106 is in, wrapping calls to `libcudf` is an ongoing effort.

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #12232
rapids-bot bot pushed a commit that referenced this pull request Jan 18, 2023
)

Now that #12106 is in, wrapping calls to `libcudf` and setting  `as_buffer(..., exposed=False)` is an ongoing effort.

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #12535
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants