-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement kill action requests on scheduler #842
base: main
Are you sure you want to change the base?
Implement kill action requests on scheduler #842
Conversation
Implements scheduler api for sending a kill action request to a worker. Allows for action cancellation during scenarios such as client disconnection. towards TraceMachina#338
62c2ba7
to
24b7bd2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens on the worker side if the wrong ActionInfoHashKey
is sent?
What happens if the same ActionInfoHashKey
happens more than once?
Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and 1 discussions need to be resolved
nativelink-scheduler/src/worker.rs
line 165 at r1 (raw file):
pub fn send_kill_action_request( &mut self, unique_qualifier: &ActionInfoHashKey,
Does being called unique_qualifier
provide any value here or could it easily be action_info_hash_key
or action_info_hash
? Same goes for other points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it just returns Ok(())
on the scheduler side of the call if the same action info hash key is sent more than once. On the worker side the kill request will go through. The error is handled with a message but I don't see it appearing anywhere, the call just returns ok. I'm guessing you would have to manually call the function on the worker to see the error it raises.
If the wrong ActionInfoHashKey is sent it presumably would kill it if its an actual job, otherwise it will just handle the error on the worker.
The implementation for sending kill requests is a bit imperfect since I can confirm it is registered but I haven't been able to confirm that the process itself is killed since the worker doesn't have a way to send back this information to the scheduler.
With the way the worker api currently works I think this is the best we can do. Since we have tests to ensure kill_all works I think it is fine for now, but in the future it would be nice to have the statuses communicated better.
Reviewable status: 1 of 1 LGTMs obtained, and 1 discussions need to be resolved
nativelink-scheduler/src/worker.rs
line 165 at r1 (raw file):
Previously, adam-singer (Adam Singer) wrote…
Does being called
unique_qualifier
provide any value here or could it easily beaction_info_hash_key
oraction_info_hash
? Same goes for other points.
I think calling it unique_qualifier
is best since that's what the field on action_info
is. Everywhere else the same value is used refers to it as such. I agree that the name/Struct name combo is a bit awkward but that's a refactor for another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 1 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 1 LGTMs obtained, and 4 discussions need to be resolved
nativelink-scheduler/src/simple_scheduler.rs
line 923 at r1 (raw file):
} async fn kill_running_action_on_worker(
nit: Does not need to be .async
.
nativelink-scheduler/src/simple_scheduler.rs
line 929 at r1 (raw file):
) -> Result<(), Error> { let mut inner = self.get_inner_lock(); let maybe_worker = inner.workers.workers.get_mut(worker_id);
fyi: You can use:
let worker = inner
.workers
.workers
.get_mut(worker_id)
.err_tip(|| format!("WorkerId '{}' does not exist in workers map", worker_id))?;
worker.send_kill_action_request(unique_qualifier)
nativelink-scheduler/src/worker.rs
line 170 at r1 (raw file):
&mut self.tx, update_for_worker::Update::KillActionRequest(KillActionRequest { action_id: hex::encode(unique_qualifier.get_hash()),
nit: Use unique_qualifier.action_name()
using .get_hash()
would make it impossible for the worker to figure out what action it's referencing.
nativelink-scheduler/src/worker_scheduler.rs
line 59 at r1 (raw file):
) -> Result<(), Error>; /// Send a request to kill a running action
nit: period.
nativelink-scheduler/tests/simple_scheduler_test.rs
line 1627 at r1 (raw file):
.await?; // Drop our receiver and look up a new one.
nit: This comment seems wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 1 LGTMs obtained, and 4 discussions need to be resolved
nativelink-scheduler/src/simple_scheduler.rs
line 929 at r1 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
fyi: You can use:
let worker = inner .workers .workers .get_mut(worker_id) .err_tip(|| format!("WorkerId '{}' does not exist in workers map", worker_id))?; worker.send_kill_action_request(unique_qualifier)
Done.
nativelink-scheduler/src/worker.rs
line 170 at r1 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: Use
unique_qualifier.action_name()
using.get_hash()
would make it impossible for the worker to figure out what action it's referencing.
I think this means there is an issue with the merged worker api. When I change the test for that to use action_name()
instead, it hangs on trying to receive. Doesn't the action_name()
correspond to the action_digest
?
kill_action
currently takes an action_id as a parameter so that would mean nothing is working properly.
I'm also a bit confused on why the test hangs when I pass the name instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 1 LGTMs obtained, and 2 discussions need to be resolved
nativelink-scheduler/src/worker.rs
line 170 at r1 (raw file):
Previously, zbirenbaum (Zach Birenbaum) wrote…
I think this means there is an issue with the merged worker api. When I change the test for that to use
action_name()
instead, it hangs on trying to receive. Doesn't theaction_name()
correspond to theaction_digest
?
kill_action
currently takes an action_id as a parameter so that would mean nothing is working properly.I'm also a bit confused on why the test hangs when I pass the name instead.
I found that action_name actually includes all of the information but I'm not sure how it helps us since the action_id
in create_and_add action is let action_id = action_info.unique_qualifier.get_hash();
You might be correct here since I've been having a lot of trouble finding a way to actually confirm the action is killed after the command is sent, but I'm not seeing the issue with the action id passed.
nativelink-scheduler/src/worker_scheduler.rs
line 59 at r1 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: period.
Done.
nativelink-scheduler/tests/simple_scheduler_test.rs
line 1627 at r1 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: This comment seems wrong.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: 2 of 1 LGTMs obtained, and 2 discussions need to be resolved
nativelink-scheduler/src/worker.rs
line 170 at r1 (raw file):
Previously, zbirenbaum (Zach Birenbaum) wrote…
I found that action_name actually includes all of the information but I'm not sure how it helps us since the
action_id
in create_and_add action islet action_id = action_info.unique_qualifier.get_hash();
You might be correct here since I've been having a lot of trouble finding a way to actually confirm the action is killed after the command is sent, but I'm not seeing the issue with the action id passed.
regardless, using action_name()
is going to be much more debugable, since it's how we store the data the actions.
Zach Birenbaum seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Description
Implements scheduler api for sending a kill action request to a worker. Allows for action cancellation during scenarios such as client disconnection.
towards #338
Type of change
How Has This Been Tested?
Adds new test to confirm that the worker receives the kill action request
Checklist
bazel test //...
passes locallygit amend
see some docsThis change isdata:image/s3,"s3://crabby-images/d0bb7/d0bb7f7625ca5bf5c3cf7a2b7a514cf841ab8395" alt="Reviewable"