Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement kill action requests on scheduler #842

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zbirenbaum
Copy link
Contributor

@zbirenbaum zbirenbaum commented Apr 8, 2024

Description

Implements scheduler api for sending a kill action request to a worker. Allows for action cancellation during scenarios such as client disconnection.

towards #338

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Adds new test to confirm that the worker receives the kill action request

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

Implements scheduler api for sending a kill action request to a worker.
Allows for action cancellation during scenarios such as client
disconnection.

towards TraceMachina#338
@zbirenbaum zbirenbaum force-pushed the scheduler-kill-action-request branch from 62c2ba7 to 24b7bd2 Compare April 8, 2024 21:00
Copy link

Copy link
Contributor

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens on the worker side if the wrong ActionInfoHashKey is sent?

What happens if the same ActionInfoHashKey happens more than once?

:lgtm:

Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and 1 discussions need to be resolved


nativelink-scheduler/src/worker.rs line 165 at r1 (raw file):

    pub fn send_kill_action_request(
        &mut self,
        unique_qualifier: &ActionInfoHashKey,

Does being called unique_qualifier provide any value here or could it easily be action_info_hash_key or action_info_hash ? Same goes for other points.

Copy link
Contributor Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it just returns Ok(()) on the scheduler side of the call if the same action info hash key is sent more than once. On the worker side the kill request will go through. The error is handled with a message but I don't see it appearing anywhere, the call just returns ok. I'm guessing you would have to manually call the function on the worker to see the error it raises.

If the wrong ActionInfoHashKey is sent it presumably would kill it if its an actual job, otherwise it will just handle the error on the worker.

The implementation for sending kill requests is a bit imperfect since I can confirm it is registered but I haven't been able to confirm that the process itself is killed since the worker doesn't have a way to send back this information to the scheduler.

With the way the worker api currently works I think this is the best we can do. Since we have tests to ensure kill_all works I think it is fine for now, but in the future it would be nice to have the statuses communicated better.

Reviewable status: 1 of 1 LGTMs obtained, and 1 discussions need to be resolved


nativelink-scheduler/src/worker.rs line 165 at r1 (raw file):

Previously, adam-singer (Adam Singer) wrote…

Does being called unique_qualifier provide any value here or could it easily be action_info_hash_key or action_info_hash ? Same goes for other points.

I think calling it unique_qualifier is best since that's what the field on action_info is. Everywhere else the same value is used refers to it as such. I agree that the name/Struct name combo is a bit awkward but that's a refactor for another PR.

Copy link
Contributor

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 1 LGTMs obtained

@zbirenbaum zbirenbaum requested a review from allada April 10, 2024 05:48
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 2 of 1 LGTMs obtained, and 4 discussions need to be resolved


nativelink-scheduler/src/simple_scheduler.rs line 923 at r1 (raw file):

    }

    async fn kill_running_action_on_worker(

nit: Does not need to be .async.


nativelink-scheduler/src/simple_scheduler.rs line 929 at r1 (raw file):

    ) -> Result<(), Error> {
        let mut inner = self.get_inner_lock();
        let maybe_worker = inner.workers.workers.get_mut(worker_id);

fyi: You can use:

let worker = inner
    .workers
    .workers
    .get_mut(worker_id)
    .err_tip(|| format!("WorkerId '{}' does not exist in workers map", worker_id))?;
worker.send_kill_action_request(unique_qualifier)

nativelink-scheduler/src/worker.rs line 170 at r1 (raw file):

            &mut self.tx,
            update_for_worker::Update::KillActionRequest(KillActionRequest {
                action_id: hex::encode(unique_qualifier.get_hash()),

nit: Use unique_qualifier.action_name() using .get_hash() would make it impossible for the worker to figure out what action it's referencing.


nativelink-scheduler/src/worker_scheduler.rs line 59 at r1 (raw file):

    ) -> Result<(), Error>;

    /// Send a request to kill a running action

nit: period.


nativelink-scheduler/tests/simple_scheduler_test.rs line 1627 at r1 (raw file):

        .await?;

        // Drop our receiver and look up a new one.

nit: This comment seems wrong.

Copy link
Contributor Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 1 LGTMs obtained, and 4 discussions need to be resolved


nativelink-scheduler/src/simple_scheduler.rs line 929 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

fyi: You can use:

let worker = inner
    .workers
    .workers
    .get_mut(worker_id)
    .err_tip(|| format!("WorkerId '{}' does not exist in workers map", worker_id))?;
worker.send_kill_action_request(unique_qualifier)

Done.


nativelink-scheduler/src/worker.rs line 170 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Use unique_qualifier.action_name() using .get_hash() would make it impossible for the worker to figure out what action it's referencing.

I think this means there is an issue with the merged worker api. When I change the test for that to use action_name() instead, it hangs on trying to receive. Doesn't the action_name() correspond to the action_digest?

kill_action currently takes an action_id as a parameter so that would mean nothing is working properly.

I'm also a bit confused on why the test hangs when I pass the name instead.

Copy link
Contributor Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 1 LGTMs obtained, and 2 discussions need to be resolved


nativelink-scheduler/src/worker.rs line 170 at r1 (raw file):

Previously, zbirenbaum (Zach Birenbaum) wrote…

I think this means there is an issue with the merged worker api. When I change the test for that to use action_name() instead, it hangs on trying to receive. Doesn't the action_name() correspond to the action_digest?

kill_action currently takes an action_id as a parameter so that would mean nothing is working properly.

I'm also a bit confused on why the test hangs when I pass the name instead.

I found that action_name actually includes all of the information but I'm not sure how it helps us since the action_id in create_and_add action is let action_id = action_info.unique_qualifier.get_hash();

You might be correct here since I've been having a lot of trouble finding a way to actually confirm the action is killed after the command is sent, but I'm not seeing the issue with the action id passed.


nativelink-scheduler/src/worker_scheduler.rs line 59 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: period.

Done.


nativelink-scheduler/tests/simple_scheduler_test.rs line 1627 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: This comment seems wrong.

Done.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 2 of 1 LGTMs obtained, and 2 discussions need to be resolved


nativelink-scheduler/src/worker.rs line 170 at r1 (raw file):

Previously, zbirenbaum (Zach Birenbaum) wrote…

I found that action_name actually includes all of the information but I'm not sure how it helps us since the action_id in create_and_add action is let action_id = action_info.unique_qualifier.get_hash();

You might be correct here since I've been having a lot of trouble finding a way to actually confirm the action is killed after the command is sent, but I'm not seeing the issue with the action id passed.

regardless, using action_name() is going to be much more debugable, since it's how we store the data the actions.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Zach Birenbaum seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants