[Core] gpu memory scheduling prototype #41147

jonathan-anyscale · 2023-11-15T01:38:24Z

A prototype to allow user to specify _gpu_memory as alternative to specify fractional gpu to the remote function. The field _gpu_memory defined as "The gpu memory request in megabytes for this task/actor from a single gpu, rounded down to the nearest integer.".

# Assume we have a single GPU with 1000mb total memory

@ray.remote(_gpu_memory=100 * 1024 * 1024) # allocate 100mb from a single gpu to this task
def task():
    # contains { 'GPU': [assigned_gpu_id]}
    resources = ray.get_runtime_context().get_resource_ids()

@ray.remote(_gpu_memory=1001  * 1024 * 1024) # task can't be scheduled due to requested memory > total gpu memory
def expensive_task():
    resources = ray.get_runtime_context().get_resource_ids()

@ray.remote(num_gpus=1, _gpu_memory=100  * 1024 * 1024) # task can't request both num_gpus and gpu_memory
def not_allowed_task():
    resources = ray.get_runtime_context().get_resource_ids()

Implementation detail: _gpu_memory will be converted to num_gpus before being scheduled where num_gpus = _gpu_memory / gpu_total_memory and we check if the 0 <= num_gpus <= 1 representing fractional GPU request.

Implementation Detail

_gpu_memory is an alternative representation of num_gpus where num_ gpus = _gpu_memory / node_total_ gpu_ memory where node_total_gpu_memory is total gpu memory of the GPU type in the node.

Thus, we convert gpu_memory to GPU resource when scheduling depending on what GPU type the scheduled node has and update GPU resource value stored in NodeResources. Additionally, since GPU has precision of $10^{-4}$, we rounded up the converted gpu_memory resource with the GPU precision.

Status

Currently working on cluster with multi-nodes setup, but not working with autoscaler yet.

Why are these changes needed?

Related issue number

Closes #37574

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Jonathan Nitisastro <[email protected]>

python/ray/_private/worker.py

src/ray/common/scheduling/cluster_resource_data.cc

src/ray/common/scheduling/resource_instance_set.cc

Signed-off-by: Jonathan Nitisastro <[email protected]>

jonathan-anyscale added 5 commits November 14, 2023 17:29

gpu memory prototype init

7725af7

Signed-off-by: Jonathan Nitisastro <[email protected]>

convert gpu memory to GPU in NodeResources

401b595

Signed-off-by: Jonathan Nitisastro <[email protected]>

expose gpu_memory in remote function

e26a0c6

Signed-off-by: Jonathan Nitisastro <[email protected]>

gpu_memory on worker and actor

1b70591

Signed-off-by: Jonathan Nitisastro <[email protected]>

add request only resource for gpu memory

89dca92

Signed-off-by: Jonathan Nitisastro <[email protected]>

jonathan-anyscale force-pushed the ray_gpu_memory branch from b246d22 to 89dca92 Compare November 16, 2023 05:31

add get gpu memory for nvidia

fc651f5

Signed-off-by: Jonathan Nitisastro <[email protected]>

jonathan-anyscale force-pushed the ray_gpu_memory branch from 477d165 to fc651f5 Compare November 16, 2023 17:23

jonathan-anyscale assigned jjyao Nov 16, 2023

jonathan-anyscale added 2 commits November 16, 2023 19:44

autoscaler fix

00e9f74

Signed-off-by: Jonathan Nitisastro <[email protected]>

cleanup

73748ea

Signed-off-by: Jonathan Nitisastro <[email protected]>

jonathan-anyscale force-pushed the ray_gpu_memory branch from 3f79e06 to 73748ea Compare November 17, 2023 05:46

jjyao reviewed Nov 17, 2023

View reviewed changes

python/ray/_private/worker.py Outdated Show resolved Hide resolved

src/ray/common/scheduling/cluster_resource_data.cc Outdated Show resolved Hide resolved

src/ray/common/scheduling/resource_instance_set.cc Outdated Show resolved Hide resolved

jonathan-anyscale added 2 commits November 17, 2023 09:18

nit fix

e7c443c

Signed-off-by: Jonathan Nitisastro <[email protected]>

convert gpu memory from mb to bytes

fc76d06

Signed-off-by: Jonathan Nitisastro <[email protected]>

jonathan-anyscale force-pushed the ray_gpu_memory branch 5 times, most recently from cb8e8f8 to 2d6e187 Compare November 21, 2023 06:06

autoscaler support

31cf724

Signed-off-by: Jonathan Nitisastro <[email protected]>

jonathan-anyscale force-pushed the ray_gpu_memory branch from 2d6e187 to 31cf724 Compare November 21, 2023 20:48

jonathan-anyscale mentioned this pull request Nov 21, 2023

[Core][REP] GPU Memory awareness scheduling ray-project/enhancements#47

Open

jonathan-anyscale force-pushed the ray_gpu_memory branch 2 times, most recently from acb1222 to 573a45d Compare November 24, 2023 23:13

autoscaler fix

cb438e5

Signed-off-by: Jonathan Nitisastro <[email protected]>

jonathan-anyscale force-pushed the ray_gpu_memory branch 2 times, most recently from 01fc9a5 to 79a6083 Compare November 27, 2023 04:39

aws node provider support

550170a

Signed-off-by: Jonathan Nitisastro <[email protected]>

jonathan-anyscale force-pushed the ray_gpu_memory branch from 79a6083 to 550170a Compare November 27, 2023 09:15

update ray start with gpu_memory as per gpu instead of total

6b6ac32

Signed-off-by: Jonathan Nitisastro <[email protected]>

This was referenced Dec 11, 2023

[Ray Core] - Add ability to specify gpu memory resources in addition to gpu units #37574

Open

[Core] Remote placement using gpu memory #26929

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] gpu memory scheduling prototype #41147

[Core] gpu memory scheduling prototype #41147

jonathan-anyscale commented Nov 15, 2023 •

edited

Loading

[Core] gpu memory scheduling prototype #41147

Are you sure you want to change the base?

[Core] gpu memory scheduling prototype #41147

Conversation

jonathan-anyscale commented Nov 15, 2023 • edited Loading

Implementation Detail

Status

Why are these changes needed?

Related issue number

Checks

jonathan-anyscale commented Nov 15, 2023 •

edited

Loading