-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving message for access to unknown objects #28209
Conversation
I also added a test for #28341, but I have not added a fix |
hm I see this does not raise an error either. How can I raise a Python error from C? : ) |
@jjyao could you shepherd this? |
d3ec300
to
d45ee06
Compare
Sorry for the late review. I'll take a look tomorrow! |
question:
This works for my purposes, but it won't build Ray with all the compilation checks (e.g. -Wunused-result) - do you know what I might need for that? |
ok at this point the failing tests seem like unrelated flakes. LMK if I should look into it! (also, I'll sign the commit on the next review cycle) |
I actually think the check failure is fine here: you should never share the object ref out of band across clusters. If you do that then it's a bug in the application code and the check failure basically tells you to fix the application code. IMO, it probably doesn't worth the effort to propagate up a catchable exception (it doesn't make much sense to catch and recover from it, we should just fix the application code). Maybe we should just improve the check failure message? WDYT @stephanie-wang ? |
Hmm, it seems to me that having a Python program just terminate the interpreter is always bad and raising and exception is much preferable. I haven't reviewed the PR in detail but it doesn't seem to introduce that much complexity. So we should probably merge it. While it doesn't make sense to catch the exception and recover from it, raising an exception is always better than just terminating the interpreter. Python is a safe language, so it shouldn't crash. |
Thanks! Right, I don't think the python interpreter should ever just terminate like that, but rather throw a Python exception that will tell the application what is the problem. I've added a couple small tests that you can try in Ray without this change and see how the behavior without the change is at least unintuitive (p.s. pushed signed-off commit) |
Ok, then I think we should check all the public apis that accept obj refs or nested obj refs and make sure they all throw the exception instead of check failure. |
I've created a GetOwnerAddressOrDie method which will preserve the existing behavior for untested paths, and used GetOwnerAddress for paths that we can test. Do you think we could review the current scope of changes and have a follow-up item to review other API paths that may cause CPython death afterwards? : ) |
python/ray/_raylet.pyx
Outdated
c_arg, c_owner_address) | ||
if not op_status.ok(): | ||
raise PlasmaObjectNotAvailable( | ||
'Object %r could not be found in this Ray session.' % arg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add possible reasons why this can happen to the message just like the check failure message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
python/ray/_raylet.pyx
Outdated
return CCoreWorkerProcess.GetCoreWorker().GetOwnerAddress( | ||
c_object_id).SerializeAsString() | ||
CAddress c_owner_address | ||
CCoreWorkerProcess.GetCoreWorker().GetOwnerAddress( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to check status here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the check there. This should be more reasonable, huh?
python/ray/_raylet.pyx
Outdated
op_status = CCoreWorkerProcess.GetCoreWorker().GetOwnerAddress( | ||
c_arg, c_owner_address) | ||
if not op_status.ok(): | ||
raise PlasmaObjectNotAvailable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this exception is for a different purpose based on its comment. Should it just be ValueError
given that we pass invalid objects to the functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense. done.
appropriate_exception_thrown | ||
), "ray.get is hanging on unknown object retrieval" | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a similar test for ray.wait()
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added one. please make sure to check on the behavior and lmk if it makes sense to you.
@@ -279,13 +279,30 @@ Status CoreWorkerMemoryStore::Get(const std::vector<ObjectID> &object_ids, | |||
/*abort_if_any_object_is_exception=*/true); | |||
} | |||
|
|||
Status CoreWorkerMemoryStore::ObjectHasOwner(const ObjectID &object_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel a bool is enough here based on the function name?
Also I feel we can expose an ObjectHasOwner method in core_worker and do the validation at the beginning of public apis like ray.get()
(normally where we do function inputs validation) so the remaining code can just assume the objects are always valid and we have a single logic deciding whether an object is valid. WDYT?
exception_thrown = False | ||
try: | ||
f.remote(my_ref) # This would cause full CPython death. | ||
except ray.exceptions.RayError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we check more specifically to make sure the exact error is thrown, RayError is too broad. We can even check the exception message.
Btw, you can use with pytest.raises(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. thanks!
@@ -311,6 +328,10 @@ Status CoreWorkerMemoryStore::GetImpl(const std::vector<ObjectID> &object_ids, | |||
count += 1; | |||
} else { | |||
remaining_ids.insert(object_id); | |||
Status status = ObjectHasOwner(object_id, abort_if_any_object_is_exception); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think abort_if_any_object_is_exception
and abort_if_missing
are two different things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I've changed this
sg :) |
@jjyao any pointers for me to look at reproducing any possible failures locally? : ) |
@jjyao ptal? : ) |
i'll try to debug why the tests are not passing as expected... |
sorry for dropping the ball here, I'll take a look at the failures as well. |
Signed-off-by: Pablo E <[email protected]>
Signed-off-by: Pablo E <[email protected]>
Signed-off-by: Pablo E <[email protected]>
Signed-off-by: Pablo E <[email protected]>
rebased! I think this should be ready : ) @jjyao thanks |
reopening to rernu tests....... |
Signed-off-by: Pablo E <[email protected]>
ok looks good finally : ) |
src/ray/core_worker/core_worker.cc
Outdated
for (size_t i = 0; i < ids.size(); i++) { | ||
rpc::Address unused_owner_address; | ||
auto status = GetOwnerAddress(ids[i], &unused_owner_address); | ||
if (!status.ok()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of not ok, can we check ObjectUnknownOwner explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/ray/core_worker/core_worker.cc
Outdated
for (size_t i = 0; i < ids.size(); i++) { | ||
rpc::Address unused_owner_address; | ||
auto status = GetOwnerAddress(ids[i], &unused_owner_address); | ||
if (!status.ok()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Tests look good. I manually triggered mac tests, lets see if they are passing as well. |
CI for the previous commit without the couple changes from comments: https://buildkite.com/ray-project/oss-ci-build-pr/builds/7636#0185179c-5905-4cf6-95bb-d6eea5422628 |
Signed-off-by: Pablo E <[email protected]>
Thanks!! |
While working on ray-project#28209 I hit issue google/googletest#3514, which causes a couple gtest-dependent tests to be unable to build Signed-off-by: Pablo E <[email protected]> Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Pablo E <[email protected]> When accessing unknown objects, Ray will kill the Python interpreter (without erroring out!). I've tried to rephrase the error to be a little easier to figure out and fix, and changed the check so that an error will be thrown rather than ending the whole process.
Signed-off-by: Pablo E <[email protected]> When accessing unknown objects, Ray will kill the Python interpreter (without erroring out!). I've tried to rephrase the error to be a little easier to figure out and fix, and changed the check so that an error will be thrown rather than ending the whole process. Signed-off-by: tmynn <[email protected]>
Signed-off-by: Pablo [email protected]
Why are these changes needed?
Consider this scenario:
In this scenario, Ray will throw this error and kill the Python interpreter (without erroring out!). I've tried to rephrase the error to be a little easier to figure out and fix, and changed the check so that an error will be thrown rather than ending the whole process.
Changes to Ray APIs
The best overview you can get is by checking the changes to
test_ray_shutdown.py
. The behavior changes in the following scenarios:First consider the following set up:
In this case,
my_ref
is anObjRef
that exists from a previous Ray session, and start a new ray session that does not know this object. Without this change, the behavior is:f.remote([my_ref])
f.remote(my_ref)
ray.get(my_ref, timeout=30)
ray.get(my_ref)
ray.wait([my_ref])
ray.wait([my_ref], timeout=30)
ray.wait([ray.put("something"), my_ref])
ray.wait([ray.put("something"), my_ref], num_returns=2)
With these changes, the behavior for each call is now:
f.remote([my_ref])
ValueError
with the expected explanationf.remote(my_ref)
ValueError
with the expected explanationray.get(my_ref, timeout=30)
ValueError
with the expected explanationray.get(my_ref)
ValueError
with the expected explanationray.wait([my_ref])
ValueError
with the expected explanationray.wait([my_ref], timeout=30)
ValueError
with the expected explanationray.wait([ray.put("something"), my_ref])
ray.wait([ray.put("something"), my_ref], num_returns=2)
ValueError
with the expected explanationI would argue that this change is preferable from the previous behavior because:
That's it, I think.
(FYI on above section @stephanie-wang @jjyao)
Related issue number
I have not created an issue to track this.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.