Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove no listeners #717

Closed
wants to merge 8 commits into from

Conversation

zbirenbaum
Copy link
Contributor

@zbirenbaum zbirenbaum commented Mar 4, 2024

Fixes #338

There might be a better way to do this if listeners can be checked before trying to send off the result of an action to them, but I don't know of a way to do that so I'm opening this as a draft. Also was wondering what the easiest way to test for this behavior would be, as I couldn't find a test case for actions without listeners.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

This change is Reviewable

Copy link

vercel bot commented Mar 4, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
nativelink-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 4, 2024 3:47am

@zbirenbaum zbirenbaum marked this pull request as draft March 4, 2024 03:26
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r5.
Reviewable status: 0 of 1 LGTMs obtained

a discussion (no related file):
We have a couple options here.

  1. We could call notify_channel.send() to tell the worker to terminate the job.
  2. We could cancel only queued actions.

I prefer 1, but it'll be more difficult.


a discussion (no related file):
At this phase, I suggest writing some tests.



nativelink-config/src/schedulers.rs line 124 at r6 (raw file):

    /// Remove action from queue after this much time has elapsed without a listener
    /// amount of time in seconds.

nit: End the sentence above and make this line say:
Units in seconds.


nativelink-scheduler/src/simple_scheduler.rs line 356 at r6 (raw file):

                    self.queued_actions_set.insert(action_info.clone());
                    self.queued_actions.insert(action_info.clone(), awaited_action);
                    self.metrics.time_without_listeners.reset();

Don't store this number on metrics. Metrics is only for statistics. We should not rely on it for any required features. Users can disable all metrics everywhere, so this may cause issues in weird ways.

@zbirenbaum
Copy link
Contributor Author

Reviewed 1 of 1 files at r5.
Reviewable status: 0 of 1 LGTMs obtained

a discussion (no related file): We have a couple options here.

1. We could call `notify_channel.send()` to tell the worker to terminate the job.

2. We could cancel only queued actions.

I prefer 1, but it'll be more difficult.

a discussion (no related file): At this phase, I suggest writing some tests.

nativelink-config/src/schedulers.rs line 124 at r6 (raw file):

    /// Remove action from queue after this much time has elapsed without a listener
    /// amount of time in seconds.

nit: End the sentence above and make this line say: Units in seconds.

nativelink-scheduler/src/simple_scheduler.rs line 356 at r6 (raw file):

                    self.queued_actions_set.insert(action_info.clone());
                    self.queued_actions.insert(action_info.clone(), awaited_action);
                    self.metrics.time_without_listeners.reset();

Don't store this number on metrics. Metrics is only for statistics. We should not rely on it for any required features. Users can disable all metrics everywhere, so this may cause issues in weird ways.

Got it, my instinct was that using notifications would be a better way but need to read more into how to get it working, the original issue only referenced the queue so got a little confused. I think doing it right will pay dividends in the long run so lets go with number 1.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-22.04, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04

a discussion (no related file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

At this phase, I suggest writing some tests.

Please reword the github description, as it will become the commit message.


Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r7.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-22.04, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04


nativelink-scheduler/src/simple_scheduler.rs line 210 at r8 (raw file):

    workers: Workers,
    active_actions: HashMap<Arc<ActionInfo>, RunningAction>,
    actions_no_listeners: HashMap<Arc<ActionInfoHashKey>, SystemTime>,

nit: Lets track this in field in AwaitedAction with something like last_client_event. This field will need to be updated on every connect of a client and a timer will need to update it periodically if it still has listeners.


nativelink-scheduler/src/worker.rs line 208 at r8 (raw file):

    }

    pub fn drop_action(&mut self, action_info: &Arc<ActionInfo>) {

nit: We don't want to do anything here, we need to worker to send the RPC message back to the scheduler saying it errored/completed and let the normal flow take it.


nativelink-util/src/action_messages.rs line 634 at r8 (raw file):

    /// Worker is executing the action.
    Executing,
    // TODO(zbirenbaum) Add dropped status and rpc handling for actions which were cancelled due to disconnection

nit: This is not the RPC message you are looking for, check worker_api.proto.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-22.04, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04


nativelink-scheduler/src/simple_scheduler.rs line 61 at r2 (raw file):

/// Default timeout for actions without any listeners
/// If this changes, remember to change the documentation in the config.
const DEFAULT_NO_LISTENERS_TIMEOUT_S: u64 = 10;

nit: Lets make this 60 seconds.

@zbirenbaum
Copy link
Contributor Author

Closing this and continuing in #778 which is much more polished and addresses most of the problems here

@zbirenbaum zbirenbaum closed this Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Actions with no listeners that are old should be removed from queue
2 participants