Look for commit affecting latency v4 #46529

GeneDer · 2024-07-09T23:19:46Z

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

already supported on 2.31 Signed-off-by: Lonnie Liu <[email protected]>

…mponents. (ray-project#46376)

files are under `ci/` directory anyways Signed-off-by: Lonnie Liu <[email protected]>

no more partial platform releases Signed-off-by: Lonnie Liu <[email protected]>

overly complicated for protecting just one file.. Signed-off-by: Lonnie Liu <[email protected]>

…-project#46340) Signed-off-by: Ruiyang Wang <[email protected]>

just codeowner and github action changes; no need to run tests Signed-off-by: Lonnie Liu <[email protected]>

other individual owners have moved on to other projects. Signed-off-by: Lonnie Liu <[email protected]>

the github user does not exist anymore; the user has renamed and moved on to other projects. Signed-off-by: Lonnie Liu <[email protected]>

When an ADAG channel is pickled, it currently does not include _writer_registered flag. However, when the channel is deserialized and passed to another node, the channel may be double registered, causing runtime failures. Using the repro script of ray-project#46411 as an example: The first registration (ensure_registered_as_writer()) happens when the driver calls do_allocate_channel() on the actor, _writer_registered is set to True However, when the driver ray.get() on the channel, its _writer_registered is False as the field is not pickled The second registration happens when driver calls do_exec_tasks() (-> _prep_task() -> output_writer.start() -> _output_channel.ensure_registered_as_writer()) on the actor, the task's output channel is passed in from driver (with _writer_registered==False`). Since ensure_registered_as_writer() (if the reader is a remote node) eventually calls ExperimentalRegisterMutableObjectReaderRemote() (->HandleRegisterMutableObject()) on the remote node, where it inserts an entry to a hash map keyed with writer_object_id. If there is already an entry with the same key, the check fails as shown in the following snippet: bool success = remote_writer_object_to_local_reader_.insert({writer_object_id, info}).second; RAY_CHECK(success); This PR fixes the issue by including these states in pickling. A new test test_pp is added to verify the fix. This PR also introduces test_multi_node_dag and moves several tests from test_accelerated_dag since it got large.

which contains fixes for centos mirror Signed-off-by: Lonnie Liu <[email protected]>

…s` to `synchronous_parallel_sample`. (ray-project#46321)

…0` and doc change (ray-project#46399)   ## Why are these changes needed? Re: ray-project#45943 changed the default `max_ongoing_requests` down to 5 which caused the overall rps lower as will. This PR did two things: 1. Added a set of performance tests for `max_ongoing_requests =100` 2. Added related doc change to explicitly calling out setting up a higher value for `max_ongoing_requests` for lightweight requests ## Related issue number Closes metrics dropping at https://b534fd88.us1a.app.preset.io/explore/?form_data_key=YWenQf8xEN9DHx7wmFVo3cRXYD-xoqObnK7HEC8J2Ho8FMnPCYM-d_as4EO7CG4b&dashboard_page_id=DCipe40Dxo&slice_id=1380 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Gene Su <[email protected]>

…roject#45459) To leverage the potential of Intel Gaudi accelerator, we extend Ray Train's capabilities by adding support for Intel Gaudi (HPU) hardware. This PR include an example for pre-training Llama-7b on multi HPUs. --------- Signed-off-by: Wu, Gangsheng <[email protected]> Co-authored-by: Samuel Chan <[email protected]>

so that wheel verification does not need an extra run with manual env var setups Signed-off-by: Lonnie Liu <[email protected]>

rather than using `==` or `!=` Signed-off-by: Lonnie Liu <[email protected]>

just use `dict()` Signed-off-by: Lonnie Liu <[email protected]>

…loss function (new API stack). (ray-project#46445)

…ect#46468)

uses 2.31.0 with py39 Signed-off-by: Lonnie Liu <[email protected]>

ray-project#45881)

…-project#46471) - use caps for vars - add `-fsL` for `curl` - double quote strings - use `==` instead of `=` Signed-off-by: Lonnie Liu <[email protected]>

After ray-project#42313 was merged, the contrast of the search button in dark mode is too low. After a discussion in ray-project#43566, we are reverting to the default pydata-sphinx-theme style for the search button. Signed-off-by: pdmurray <[email protected]>

This should be an incorrect comment in the original document. Signed-off-by: zjregee <[email protected]>

Signed-off-by: hongchaodeng <[email protected]>

ray-project#46460) Signed-off-by: Jiajun Yao <[email protected]>

sorting mostly imports Signed-off-by: Lonnie Liu <[email protected]>

…rArray (ray-project#45352)   ## Why are these changes needed? Currently, the ArrowTensorArray and ArrowVariableShapedTensorArray types use int32 to encode the list offsets. This means that within a given PyArrow chunk in a column, the sum of the sizes of all of the tensors in that chunk must be less than 2^31; otherwise, depending on the overflow conditions, an error is thrown or the data is truncated. This usually doesn't manifest itself in Ray Data with the default settings, because it splits the blocks up to meet the target max block size (though this can be turned off!). However, it unavoidably shows up when one needs a large local shuffle buffer to feed into Ray Train. This PR changes the offsets to be stored in 64-bit integers and updates the corresponding storage types of the TensorArrays. As an example: ``` import ray.train import ray.train.torch import ray.data import numpy as np def f(batch): block1 = {"x": [np.ones((1000, 550), dtype=np.float16)] * 1000} return block1 dataset = ray.data.from_items([1, 2, 3, 4, 5]).map_batches(f, batch_size=None) def train(): data = ray.train.get_dataset_shard("train") for batch in data.iter_torch_batches(batch_size=100, local_shuffle_buffer_size=4000): pass trainer = ray.train.torch.TorchTrainer(train_loop_per_worker=train, datasets={"train": dataset}) trainer.fit() ``` fails with ``` pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays ``` ## Checks - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Peter Wang <[email protected]> Co-authored-by: Peter Wang <[email protected]> Co-authored-by: Scott Lee <[email protected]>

…46436)

too flaky, and taking too long time. Signed-off-by: Lonnie Liu <[email protected]>

…f absent. (ray-project#46358) Previously we have `cluster_id` in (Python)GcsClient::Connect as argument. This arg is never changed throughout lifetime of a GcsClient and should be placed to GcsClientOptions. Renamed Python binding `GcsClientOptions.from_gcs_address` to `GcsClientOptions.create`. Also, today's C++ GcsClient never requests a cluster_id, so if it's absent it never populates it in regular GCS requests. Adds a RPC in Connect() time to get it if absent. To do that, also adds a `timeout` arg in case GCS is down. One caveat: if the GcsClient is in GCS itself, the RPC blocks GCS main thread and it's deadlocked. So in GCS cluster_id *must* be populated beforehand.

…ay-project#46149) Hopefully this helps give a quick overview of using KubeRay for deploying clusters to Kubernetes Signed-off-by: omrishiv <[email protected]>

…rowTensorArray" (ray-project#46511) Reverts ray-project#45352 Breaking ray-project#46499, ray-project#46496 and ray-project#46495

CUDA checks the compatibility between the host driver and the docker image CUDA version. Previously we have this check since the driver is older than the CUDA version and cuda will be unhappy about it. Remove this check now that we have upgraded the cuda version in the host. Test: - CI Signed-off-by: can <[email protected]>

) See https://discuss.ray.io/t/error-in-ray-job-submit-on-local-machine-if-multiple-clusters-are-running-at-the-same-time/14723/7?u=davidxia   ## Why are these changes needed? Will prevent user confusion ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: David Xia <[email protected]> Co-authored-by: Samuel Chan <[email protected]>

…ay-project#46494) Signed-off-by: Jiajun Yao <[email protected]>

…#46363) Originally, the number of blocks outputted by from_pandas equaled the number of input DataFrames (i.e., each input DataFrame became a block). For consistency with how we treat other inputs, ray-project#44937 changed the behavior so that each output block is the target block size. This meant that you could pass in many DataFrames as input but from_pandas would only output one block. The change is problematic because many users do something like from_pandas(np.array_split(metadata, num_blocks)) to get better performance, and after ray-project#44937, the array_split is pointless. So, this PR reverts the change Signed-off-by: Balaji Veeramani <[email protected]>

Signed-off-by: hongchaodeng <[email protected]>

GeneDer · 2024-07-10T22:54:06Z

finished testing

aslonnie and others added 30 commits July 3, 2024 02:02

[release auto] add python 3.12 wheel verify (ray-project#46400)

3205606

already supported on 2.31 Signed-off-by: Lonnie Liu <[email protected]>

[RLlib] Introduce Checkpointable API for RLlib components and subco…

b643ed8

…mponents. (ray-project#46376)

[owner] change lint owners to ray CI group (ray-project#46409)

d14c95c

files are under `ci/` directory anyways Signed-off-by: Lonnie Liu <[email protected]>

[release auto] always release full platforms (ray-project#46401)

9f9041d

no more partial platform releases Signed-off-by: Lonnie Liu <[email protected]>

[ci] remove "external code affected" helper (ray-project#46421)

7e96446

overly complicated for protecting just one file.. Signed-off-by: Lonnie Liu <[email protected]>

[core] Add convenience method RAY_LOG(INFO).WithField(job_id). (ray…

0cde316

…-project#46340) Signed-off-by: Ruiyang Wang <[email protected]>

[ci] filter out .github dir changes (ray-project#46426)

dd95191

just codeowner and github action changes; no need to run tests Signed-off-by: Lonnie Liu <[email protected]>

[ray client] set core team the only owner (ray-project#46416)

1cb6be9

other individual owners have moved on to other projects. Signed-off-by: Lonnie Liu <[email protected]>

[owner] remove iycheng from codeowners (ray-project#46410)

16473f7

the github user does not exist anymore; the user has renamed and moved on to other projects. Signed-off-by: Lonnie Liu <[email protected]>

[wheel] use latest manylinux build image (ray-project#46415)

2432b62

which contains fixes for centos mirror Signed-off-by: Lonnie Liu <[email protected]>

[RLlib] Moving sampling coordination for `batch_mode=complete_episode…

3bdcab6

…s` to `synchronous_parallel_sample`. (ray-project#46321)

[release-auto] smartly detect ray version and commit (ray-project#46456)

04cd343

so that wheel verification does not need an extra run with manual env var setups Signed-off-by: Lonnie Liu <[email protected]>

[ray client] use is / is not on type check (ray-project#46464)

3907a3b

rather than using `==` or `!=` Signed-off-by: Lonnie Liu <[email protected]>

[ray client] lint: do not use comprehension (ray-project#46465)

ebbe8cd

just use `dict()` Signed-off-by: Lonnie Liu <[email protected]>

[RLlib] Cleanup examples folder vol19: Add example script for custom …

a850f0a

…loss function (new API stack). (ray-project#46445)

[RLlib] Enable complex action spaces with stateful modules. (ray-proj…

9ebc5d9

…ect#46468)

[RLlib] Replace all Mapping typehints with Dict. (ray-project#46474)

8d02655

[doc] drop ray-ml reference in runtime env (ray-project#46388)

92d0600

uses 2.31.0 with py39 Signed-off-by: Lonnie Liu <[email protected]>

docs: setup unified GTM (ray-project#46290)

85eb05c

[Serve] Avoid looping over all snapshot ids for each long poll request (

6beeacb

ray-project#45881)

[release auto] fix bash styling stuff in osx wheel verify script (ray…

8f3dcd5

…-project#46471) - use caps for vars - add `-fsL` for `curl` - double quote strings - use `==` instead of `=` Signed-off-by: Lonnie Liu <[email protected]>

[doc] update read_csv api comment (ray-project#46466)

a46f8ae

This should be an incorrect comment in the original document. Signed-off-by: zjregee <[email protected]>

CI: enable jemalloc in release tests (ray-project#46393)

05067f4

Signed-off-by: hongchaodeng <[email protected]>

[Core] Add object back to memory store when object recovery is skipped (

33ee732

ray-project#46460) Signed-off-by: Jiajun Yao <[email protected]>

[dashboard] apply isort to dashboard dir (ray-project#46483)

bbcd01f

sorting mostly imports Signed-off-by: Lonnie Liu <[email protected]>

simonsays1980 and others added 10 commits July 9, 2024 08:57

[RLlib] Enable multi-learner setup for hybrid stack BC. (ray-project#…

14a0be7

…46436)

[rllib] skip rllib gpu test on premerge (ray-project#46504)

c630ccc

too flaky, and taking too long time. Signed-off-by: Lonnie Liu <[email protected]>

Add section on launching clusters on Kubernetes to Getting Started (r…

ae9330f

…ay-project#46149) Hopefully this helps give a quick overview of using KubeRay for deploying clusters to Kubernetes Signed-off-by: omrishiv <[email protected]>

Revert "[Data] Change offsets to int64 and change to LargeList for Ar…

1549588

…rowTensorArray" (ray-project#46511) Reverts ray-project#45352 Breaking ray-project#46499, ray-project#46496 and ray-project#46495

[Core] Task status should start with PENDING_ARGS_AVAIL when retry (r…

c9b14d7

…ay-project#46494) Signed-off-by: Jiajun Yao <[email protected]>

[core] catch exception in async_callback (ray-project#46488)

6b98284

Signed-off-by: hongchaodeng <[email protected]>

GeneDer closed this Jul 10, 2024

GeneDer deleted the look-for-commit-affecting-latency-v4 branch July 10, 2024 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look for commit affecting latency v4 #46529

Look for commit affecting latency v4 #46529

GeneDer commented Jul 9, 2024

GeneDer commented Jul 10, 2024

Look for commit affecting latency v4 #46529

Look for commit affecting latency v4 #46529

Conversation

GeneDer commented Jul 9, 2024

Why are these changes needed?

Related issue number

Checks

GeneDer commented Jul 10, 2024