Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] gpu memory scheduling prototype #41147

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

jonathan-anyscale
Copy link
Contributor

@jonathan-anyscale jonathan-anyscale commented Nov 15, 2023

A prototype to allow user to specify _gpu_memory as alternative to specify fractional gpu to the remote function. The field _gpu_memory defined as "The gpu memory request in megabytes for this task/actor from a single gpu, rounded down to the nearest integer.".

# Assume we have a single GPU with 1000mb total memory

@ray.remote(_gpu_memory=100 * 1024 * 1024) # allocate 100mb from a single gpu to this task
def task():
    # contains { 'GPU': [assigned_gpu_id]}
    resources = ray.get_runtime_context().get_resource_ids()

@ray.remote(_gpu_memory=1001  * 1024 * 1024) # task can't be scheduled due to requested memory > total gpu memory
def expensive_task():
    resources = ray.get_runtime_context().get_resource_ids()

@ray.remote(num_gpus=1, _gpu_memory=100  * 1024 * 1024) # task can't request both num_gpus and gpu_memory
def not_allowed_task():
    resources = ray.get_runtime_context().get_resource_ids()

Implementation detail: _gpu_memory will be converted to num_gpus before being scheduled where num_gpus = _gpu_memory / gpu_total_memory and we check if the 0 <= num_gpus <= 1 representing fractional GPU request.

Implementation Detail

_gpu_memory is an alternative representation of num_gpus where num_ gpus = _gpu_memory / node_total_ gpu_ memory where node_total_gpu_memory is total gpu memory of the GPU type in the node.

Thus, we convert gpu_memory to GPU resource when scheduling depending on what GPU type the scheduled node has and update GPU resource value stored in NodeResources. Additionally, since GPU has precision of $10^{-4}$, we rounded up the converted gpu_memory resource with the GPU precision.

Status

Currently working on cluster with multi-nodes setup, but not working with autoscaler yet.

Why are these changes needed?

Related issue number

Closes #37574

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jonathan Nitisastro <[email protected]>
python/ray/_private/worker.py Outdated Show resolved Hide resolved
src/ray/common/scheduling/cluster_resource_data.cc Outdated Show resolved Hide resolved
src/ray/common/scheduling/resource_instance_set.cc Outdated Show resolved Hide resolved
Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jonathan Nitisastro <[email protected]>
@jonathan-anyscale jonathan-anyscale force-pushed the ray_gpu_memory branch 5 times, most recently from cb8e8f8 to 2d6e187 Compare November 21, 2023 06:06
Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jonathan Nitisastro <[email protected]>
@jonathan-anyscale jonathan-anyscale force-pushed the ray_gpu_memory branch 2 times, most recently from 01fc9a5 to 79a6083 Compare November 27, 2023 04:39
Signed-off-by: Jonathan Nitisastro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Ray Core] - Add ability to specify gpu memory resources in addition to gpu units
2 participants