You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Version = main branch. This happens quite frequently, but randomly. It's a bit more consistent for actions that take a bit longer, sometimes it will fail a few times in a row on the same action (one that takes about 3 minutes, and it fails after about 1 minute). Workaround is just to run the build again a few times and hope for the best (or just go back to v0.4.0, which I have done).
Failed to wait for awaited action to change
Error {
code: Internal,
messages: [\"Failed to wait for awaited action to change RecvError(())\"]
}: In SimpleSchedulerActionListener::changed getting receiver
let changed_fut = self.awaited_action_rx.changed().map(|r| {
r.map_err(|e| {
make_err!(
Code::Internal,
"Failed to wait for awaited action to change {e:?}"
)
})
});
(And another place in that file)
I don't see any other useful logs that would tell us why the Sender half was dropped. But this comment in the lines just after the error sure seems like it's onto something:
// If we haven't received any updates for a while, we should
// let the database know that we are still listening to prevent
// the action from being dropped.
Configuration is basically what I described in #963, i.e. basic_cas.json with a couple of tweaks to paths / entrypoints etc. Buck2 is in the driver's seat.
The text was updated successfully, but these errors were encountered:
cormacrelf
changed the title
Lots of "failed to wait for awaited action to change" errors
Lots of "failed to wait for awaited action to change" errors on main branch
Jul 26, 2024
Thanks for the report, we detected this late yesterday as well. I believe we have identified the cause and I'll try to push a fix in a bit.
It has to do with the scheduler refactor recently. When a client disconnected then reconnected, it would not hold the connection alive as active and eventually it'd get cleaned up, causing down-stream listeners to get disconnected and close the stream.
Fixes a bug where if a client creates an action, then the
client disconnects and then reconnects on the same action
it would not keep the action alive and eventually time it
out.
closes: TraceMachina#1197
…a#1198)
Fixes a bug where if a client creates an action, then the
client disconnects and then reconnects on the same action
it would not keep the action alive and eventually time it
out.
closes: TraceMachina#1197
Version = main branch. This happens quite frequently, but randomly. It's a bit more consistent for actions that take a bit longer, sometimes it will fail a few times in a row on the same action (one that takes about 3 minutes, and it fails after about 1 minute). Workaround is just to run the build again a few times and hope for the best (or just go back to v0.4.0, which I have done).
This obviously comes from
nativelink/nativelink-scheduler/src/simple_scheduler.rs
Lines 83 to 87 in 5798761
nativelink/nativelink-scheduler/src/memory_awaited_action_db.rs
Lines 154 to 161 in 5798761
(And another place in that file)
I don't see any other useful logs that would tell us why the Sender half was dropped. But this comment in the lines just after the error sure seems like it's onto something:
nativelink/nativelink-scheduler/src/memory_awaited_action_db.rs
Lines 180 to 183 in 5798761
Configuration is basically what I described in #963, i.e. basic_cas.json with a couple of tweaks to paths / entrypoints etc. Buck2 is in the driver's seat.
The text was updated successfully, but these errors were encountered: