-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Look for commit affecting latency v4 #46529
Closed
GeneDer
wants to merge
40
commits into
ray-project:releases/2.32.0
from
GeneDer:look-for-commit-affecting-latency-v4
Closed
Look for commit affecting latency v4 #46529
GeneDer
wants to merge
40
commits into
ray-project:releases/2.32.0
from
GeneDer:look-for-commit-affecting-latency-v4
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
already supported on 2.31 Signed-off-by: Lonnie Liu <[email protected]>
files are under `ci/` directory anyways Signed-off-by: Lonnie Liu <[email protected]>
no more partial platform releases Signed-off-by: Lonnie Liu <[email protected]>
overly complicated for protecting just one file.. Signed-off-by: Lonnie Liu <[email protected]>
…-project#46340) Signed-off-by: Ruiyang Wang <[email protected]>
just codeowner and github action changes; no need to run tests Signed-off-by: Lonnie Liu <[email protected]>
other individual owners have moved on to other projects. Signed-off-by: Lonnie Liu <[email protected]>
the github user does not exist anymore; the user has renamed and moved on to other projects. Signed-off-by: Lonnie Liu <[email protected]>
When an ADAG channel is pickled, it currently does not include _writer_registered flag. However, when the channel is deserialized and passed to another node, the channel may be double registered, causing runtime failures. Using the repro script of ray-project#46411 as an example: The first registration (ensure_registered_as_writer()) happens when the driver calls do_allocate_channel() on the actor, _writer_registered is set to True However, when the driver ray.get() on the channel, its _writer_registered is False as the field is not pickled The second registration happens when driver calls do_exec_tasks() (-> _prep_task() -> output_writer.start() -> _output_channel.ensure_registered_as_writer()) on the actor, the task's output channel is passed in from driver (with _writer_registered==False`). Since ensure_registered_as_writer() (if the reader is a remote node) eventually calls ExperimentalRegisterMutableObjectReaderRemote() (->HandleRegisterMutableObject()) on the remote node, where it inserts an entry to a hash map keyed with writer_object_id. If there is already an entry with the same key, the check fails as shown in the following snippet: bool success = remote_writer_object_to_local_reader_.insert({writer_object_id, info}).second; RAY_CHECK(success); This PR fixes the issue by including these states in pickling. A new test test_pp is added to verify the fix. This PR also introduces test_multi_node_dag and moves several tests from test_accelerated_dag since it got large.
which contains fixes for centos mirror Signed-off-by: Lonnie Liu <[email protected]>
…s` to `synchronous_parallel_sample`. (ray-project#46321)
…0` and doc change (ray-project#46399) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Re: ray-project#45943 changed the default `max_ongoing_requests` down to 5 which caused the overall rps lower as will. This PR did two things: 1. Added a set of performance tests for `max_ongoing_requests =100` 2. Added related doc change to explicitly calling out setting up a higher value for `max_ongoing_requests` for lightweight requests ## Related issue number Closes metrics dropping at https://b534fd88.us1a.app.preset.io/explore/?form_data_key=YWenQf8xEN9DHx7wmFVo3cRXYD-xoqObnK7HEC8J2Ho8FMnPCYM-d_as4EO7CG4b&dashboard_page_id=DCipe40Dxo&slice_id=1380 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Gene Su <[email protected]>
…roject#45459) To leverage the potential of Intel Gaudi accelerator, we extend Ray Train's capabilities by adding support for Intel Gaudi (HPU) hardware. This PR include an example for pre-training Llama-7b on multi HPUs. --------- Signed-off-by: Wu, Gangsheng <[email protected]> Co-authored-by: Samuel Chan <[email protected]>
so that wheel verification does not need an extra run with manual env var setups Signed-off-by: Lonnie Liu <[email protected]>
rather than using `==` or `!=` Signed-off-by: Lonnie Liu <[email protected]>
just use `dict()` Signed-off-by: Lonnie Liu <[email protected]>
…loss function (new API stack). (ray-project#46445)
uses 2.31.0 with py39 Signed-off-by: Lonnie Liu <[email protected]>
…-project#46471) - use caps for vars - add `-fsL` for `curl` - double quote strings - use `==` instead of `=` Signed-off-by: Lonnie Liu <[email protected]>
After ray-project#42313 was merged, the contrast of the search button in dark mode is too low. After a discussion in ray-project#43566, we are reverting to the default pydata-sphinx-theme style for the search button. Signed-off-by: pdmurray <[email protected]>
This should be an incorrect comment in the original document. Signed-off-by: zjregee <[email protected]>
Signed-off-by: hongchaodeng <[email protected]>
ray-project#46460) Signed-off-by: Jiajun Yao <[email protected]>
sorting mostly imports Signed-off-by: Lonnie Liu <[email protected]>
…rArray (ray-project#45352) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Currently, the ArrowTensorArray and ArrowVariableShapedTensorArray types use int32 to encode the list offsets. This means that within a given PyArrow chunk in a column, the sum of the sizes of all of the tensors in that chunk must be less than 2^31; otherwise, depending on the overflow conditions, an error is thrown or the data is truncated. This usually doesn't manifest itself in Ray Data with the default settings, because it splits the blocks up to meet the target max block size (though this can be turned off!). However, it unavoidably shows up when one needs a large local shuffle buffer to feed into Ray Train. This PR changes the offsets to be stored in 64-bit integers and updates the corresponding storage types of the TensorArrays. As an example: ``` import ray.train import ray.train.torch import ray.data import numpy as np def f(batch): block1 = {"x": [np.ones((1000, 550), dtype=np.float16)] * 1000} return block1 dataset = ray.data.from_items([1, 2, 3, 4, 5]).map_batches(f, batch_size=None) def train(): data = ray.train.get_dataset_shard("train") for batch in data.iter_torch_batches(batch_size=100, local_shuffle_buffer_size=4000): pass trainer = ray.train.torch.TorchTrainer(train_loop_per_worker=train, datasets={"train": dataset}) trainer.fit() ``` fails with ``` pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays ``` ## Checks - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Peter Wang <[email protected]> Co-authored-by: Peter Wang <[email protected]> Co-authored-by: Scott Lee <[email protected]>
too flaky, and taking too long time. Signed-off-by: Lonnie Liu <[email protected]>
…f absent. (ray-project#46358) Previously we have `cluster_id` in (Python)GcsClient::Connect as argument. This arg is never changed throughout lifetime of a GcsClient and should be placed to GcsClientOptions. Renamed Python binding `GcsClientOptions.from_gcs_address` to `GcsClientOptions.create`. Also, today's C++ GcsClient never requests a cluster_id, so if it's absent it never populates it in regular GCS requests. Adds a RPC in Connect() time to get it if absent. To do that, also adds a `timeout` arg in case GCS is down. One caveat: if the GcsClient is in GCS itself, the RPC blocks GCS main thread and it's deadlocked. So in GCS cluster_id *must* be populated beforehand.
…ay-project#46149) Hopefully this helps give a quick overview of using KubeRay for deploying clusters to Kubernetes Signed-off-by: omrishiv <[email protected]>
…rowTensorArray" (ray-project#46511) Reverts ray-project#45352 Breaking ray-project#46499, ray-project#46496 and ray-project#46495
CUDA checks the compatibility between the host driver and the docker image CUDA version. Previously we have this check since the driver is older than the CUDA version and cuda will be unhappy about it. Remove this check now that we have upgraded the cuda version in the host. Test: - CI Signed-off-by: can <[email protected]>
) See https://discuss.ray.io/t/error-in-ray-job-submit-on-local-machine-if-multiple-clusters-are-running-at-the-same-time/14723/7?u=davidxia <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Will prevent user confusion ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: David Xia <[email protected]> Co-authored-by: Samuel Chan <[email protected]>
…ay-project#46494) Signed-off-by: Jiajun Yao <[email protected]>
…#46363) Originally, the number of blocks outputted by from_pandas equaled the number of input DataFrames (i.e., each input DataFrame became a block). For consistency with how we treat other inputs, ray-project#44937 changed the behavior so that each output block is the target block size. This meant that you could pass in many DataFrames as input but from_pandas would only output one block. The change is problematic because many users do something like from_pandas(np.array_split(metadata, num_blocks)) to get better performance, and after ray-project#44937, the array_split is pointless. So, this PR reverts the change Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: hongchaodeng <[email protected]>
finished testing |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.