Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lots of "failed to wait for awaited action to change" errors on main branch #1197

Closed
cormacrelf opened this issue Jul 26, 2024 · 1 comment · Fixed by #1198
Closed

Lots of "failed to wait for awaited action to change" errors on main branch #1197

cormacrelf opened this issue Jul 26, 2024 · 1 comment · Fixed by #1198
Assignees

Comments

@cormacrelf
Copy link
Contributor

cormacrelf commented Jul 26, 2024

Version = main branch. This happens quite frequently, but randomly. It's a bit more consistent for actions that take a bit longer, sometimes it will fail a few times in a row on the same action (one that takes about 3 minutes, and it fails after about 1 minute). Workaround is just to run the build again a few times and hope for the best (or just go back to v0.4.0, which I have done).

Failed to wait for awaited action to change 
Error { 
  code: Internal, 
  messages: [\"Failed to wait for awaited action to change RecvError(())\"] 
}: In SimpleSchedulerActionListener::changed getting receiver

This obviously comes from

let action_state = self
.action_state_result
.changed()
.await
.err_tip(|| "In SimpleSchedulerActionListener::changed getting receiver")?;

let changed_fut = self.awaited_action_rx.changed().map(|r| {
r.map_err(|e| {
make_err!(
Code::Internal,
"Failed to wait for awaited action to change {e:?}"
)
})
});

(And another place in that file)

I don't see any other useful logs that would tell us why the Sender half was dropped. But this comment in the lines just after the error sure seems like it's onto something:

_ = tokio::time::sleep(CLIENT_KEEPALIVE_DURATION) => {
// If we haven't received any updates for a while, we should
// let the database know that we are still listening to prevent
// the action from being dropped.

Configuration is basically what I described in #963, i.e. basic_cas.json with a couple of tweaks to paths / entrypoints etc. Buck2 is in the driver's seat.

@cormacrelf cormacrelf changed the title Lots of "failed to wait for awaited action to change" errors Lots of "failed to wait for awaited action to change" errors on main branch Jul 26, 2024
@allada
Copy link
Member

allada commented Jul 26, 2024

Thanks for the report, we detected this late yesterday as well. I believe we have identified the cause and I'll try to push a fix in a bit.

It has to do with the scheduler refactor recently. When a client disconnected then reconnected, it would not hold the connection alive as active and eventually it'd get cleaned up, causing down-stream listeners to get disconnected and close the stream.

allada added a commit to allada/nativelink-fork that referenced this issue Jul 26, 2024
Fixes a bug where if a client creates an action, then the
client disconnects and then reconnects on the same action
it would not keep the action alive and eventually time it
out.

closes: TraceMachina#1197
@allada allada self-assigned this Jul 26, 2024
@allada allada closed this as completed in 0b40639 Jul 26, 2024
zbirenbaum pushed a commit to zbirenbaum/nativelink that referenced this issue Jul 27, 2024
…a#1198)

Fixes a bug where if a client creates an action, then the
client disconnects and then reconnects on the same action
it would not keep the action alive and eventually time it
out.

closes: TraceMachina#1197
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants