Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look for commit affecting latency v4 #46529

Conversation

GeneDer
Copy link
Contributor

@GeneDer GeneDer commented Jul 9, 2024

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

aslonnie and others added 30 commits July 3, 2024 02:02
already supported on 2.31

Signed-off-by: Lonnie Liu <[email protected]>
files are under `ci/` directory anyways

Signed-off-by: Lonnie Liu <[email protected]>
no more partial platform releases

Signed-off-by: Lonnie Liu <[email protected]>
overly complicated for protecting just one file..

Signed-off-by: Lonnie Liu <[email protected]>
just codeowner and github action changes; no need to run tests

Signed-off-by: Lonnie Liu <[email protected]>
other individual owners have moved on to other projects.

Signed-off-by: Lonnie Liu <[email protected]>
the github user does not exist anymore; the user has renamed and moved
on to other projects.

Signed-off-by: Lonnie Liu <[email protected]>
When an ADAG channel is pickled, it currently does not include _writer_registered flag. However, when the channel is deserialized and passed to another node, the channel may be double registered, causing runtime failures.

Using the repro script of ray-project#46411 as an example:

The first registration (ensure_registered_as_writer()) happens when the driver calls do_allocate_channel() on the actor, _writer_registered is set to True
However, when the driver ray.get() on the channel, its _writer_registered is False as the field is not pickled
The second registration happens when driver calls do_exec_tasks() (-> _prep_task() -> output_writer.start() -> _output_channel.ensure_registered_as_writer()) on the actor, the task's output channel is passed in from driver (with _writer_registered==False`).
Since ensure_registered_as_writer() (if the reader is a remote node) eventually calls ExperimentalRegisterMutableObjectReaderRemote() (->HandleRegisterMutableObject()) on the remote node, where it inserts an entry to a hash map keyed with writer_object_id. If there is already an entry with the same key, the check fails as shown in the following snippet:
  bool success =
      remote_writer_object_to_local_reader_.insert({writer_object_id, info}).second;
  RAY_CHECK(success);
This PR fixes the issue by including these states in pickling. A new test test_pp is added to verify the fix.

This PR also introduces test_multi_node_dag and moves several tests from test_accelerated_dag since it got large.
which contains fixes for centos mirror

Signed-off-by: Lonnie Liu <[email protected]>
…0` and doc change (ray-project#46399)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Re: ray-project#45943 changed the default
`max_ongoing_requests` down to 5 which caused the overall rps lower as
will. This PR did two things:
1. Added a set of performance tests for `max_ongoing_requests =100`
2. Added related doc change to explicitly calling out setting up a
higher value for `max_ongoing_requests` for lightweight requests

## Related issue number

Closes metrics dropping at
https://b534fd88.us1a.app.preset.io/explore/?form_data_key=YWenQf8xEN9DHx7wmFVo3cRXYD-xoqObnK7HEC8J2Ho8FMnPCYM-d_as4EO7CG4b&dashboard_page_id=DCipe40Dxo&slice_id=1380

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Gene Su <[email protected]>
…roject#45459)

To leverage the potential of Intel Gaudi accelerator, we extend Ray
Train's capabilities by adding support for Intel Gaudi (HPU) hardware.
This PR include an example for pre-training Llama-7b on multi HPUs.

---------

Signed-off-by: Wu, Gangsheng <[email protected]>
Co-authored-by: Samuel Chan <[email protected]>
so that wheel verification does not need an extra run with manual env
var setups

Signed-off-by: Lonnie Liu <[email protected]>
rather than using `==` or `!=`

Signed-off-by: Lonnie Liu <[email protected]>
…-project#46471)

- use caps for vars
- add `-fsL` for `curl`
- double quote strings
- use `==` instead of `=`

Signed-off-by: Lonnie Liu <[email protected]>
After ray-project#42313 was merged, the contrast of the search button in dark mode is too low. After a discussion in ray-project#43566, we are reverting to the default pydata-sphinx-theme style for the search button.

Signed-off-by: pdmurray <[email protected]>
This should be an incorrect comment in the original document.

Signed-off-by: zjregee <[email protected]>
sorting mostly imports

Signed-off-by: Lonnie Liu <[email protected]>
…rArray (ray-project#45352)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Currently, the ArrowTensorArray and ArrowVariableShapedTensorArray types
use int32 to encode the list offsets. This means that within a given
PyArrow chunk in a column, the sum of the sizes of all of the tensors in
that chunk must be less than 2^31; otherwise, depending on the overflow
conditions, an error is thrown or the data is truncated. This usually
doesn't manifest itself in Ray Data with the default settings, because
it splits the blocks up to meet the target max block size (though this
can be turned off!). However, it unavoidably shows up when one needs a
large local shuffle buffer to feed into Ray Train.

This PR changes the offsets to be stored in 64-bit integers and updates
the corresponding storage types of the TensorArrays.

As an example:

```
import ray.train
import ray.train.torch
import ray.data
import numpy as np

def f(batch):
   block1 = {"x": [np.ones((1000, 550), dtype=np.float16)] * 1000}
   return block1

dataset = ray.data.from_items([1, 2, 3, 4, 5]).map_batches(f, batch_size=None)
def train():
  data = ray.train.get_dataset_shard("train")
  for batch in data.iter_torch_batches(batch_size=100, local_shuffle_buffer_size=4000):
    pass

trainer = ray.train.torch.TorchTrainer(train_loop_per_worker=train, datasets={"train": dataset})
trainer.fit()
```

fails with
```
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
```
## Checks

- [X] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Peter Wang <[email protected]>
Co-authored-by: Peter Wang <[email protected]>
Co-authored-by: Scott Lee <[email protected]>
simonsays1980 and others added 10 commits July 9, 2024 08:57
too flaky, and taking too long time.

Signed-off-by: Lonnie Liu <[email protected]>
…f absent. (ray-project#46358)

Previously we have `cluster_id` in (Python)GcsClient::Connect as
argument. This arg is never changed throughout lifetime of a GcsClient
and should be placed to GcsClientOptions.

Renamed Python binding `GcsClientOptions.from_gcs_address` to
`GcsClientOptions.create`.

Also, today's C++ GcsClient never requests a cluster_id, so if it's
absent it never populates it in regular GCS requests. Adds a RPC in
Connect() time to get it if absent. To do that, also adds a `timeout`
arg in case GCS is down.

One caveat: if the GcsClient is in GCS itself, the RPC blocks GCS main
thread and it's deadlocked. So in GCS cluster_id *must* be populated
beforehand.
…ay-project#46149)

Hopefully this helps give a quick overview of using KubeRay for deploying clusters to Kubernetes

Signed-off-by: omrishiv <[email protected]>
CUDA checks the compatibility between the host driver and the docker
image CUDA version. Previously we have this check since the driver is
older than the CUDA version and cuda will be unhappy about it. Remove
this check now that we have upgraded the cuda version in the host.

Test:
- CI

Signed-off-by: can <[email protected]>
)

See
https://discuss.ray.io/t/error-in-ray-job-submit-on-local-machine-if-multiple-clusters-are-running-at-the-same-time/14723/7?u=davidxia

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Will prevent user confusion

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: David Xia <[email protected]>
Co-authored-by: Samuel Chan <[email protected]>
…#46363)

Originally, the number of blocks outputted by from_pandas equaled the number of input DataFrames (i.e., each input DataFrame became a block). For consistency with how we treat other inputs, ray-project#44937 changed the behavior so that each output block is the target block size. This meant that you could pass in many DataFrames as input but from_pandas would only output one block.

The change is problematic because many users do something like from_pandas(np.array_split(metadata, num_blocks)) to get better performance, and after ray-project#44937, the array_split is pointless. So, this PR reverts the change

Signed-off-by: Balaji Veeramani <[email protected]>
@GeneDer
Copy link
Contributor Author

GeneDer commented Jul 10, 2024

finished testing

@GeneDer GeneDer closed this Jul 10, 2024
@GeneDer GeneDer deleted the look-for-commit-affecting-latency-v4 branch July 10, 2024 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.