Lots of "failed to wait for awaited action to change" errors on main branch #1197

cormacrelf · 2024-07-26T00:29:33Z

Version = main branch. This happens quite frequently, but randomly. It's a bit more consistent for actions that take a bit longer, sometimes it will fail a few times in a row on the same action (one that takes about 3 minutes, and it fails after about 1 minute). Workaround is just to run the build again a few times and hope for the best (or just go back to v0.4.0, which I have done).

Failed to wait for awaited action to change 
Error { 
  code: Internal, 
  messages: [\"Failed to wait for awaited action to change RecvError(())\"] 
}: In SimpleSchedulerActionListener::changed getting receiver

This obviously comes from

nativelink/nativelink-scheduler/src/simple_scheduler.rs

Lines 83 to 87 in 5798761

    
           let action_state = self 
        
               .action_state_result 
        
               .changed() 
        
               .await 
        
               .err_tip(|| "In SimpleSchedulerActionListener::changed getting receiver")?;

nativelink/nativelink-scheduler/src/memory_awaited_action_db.rs

Lines 154 to 161 in 5798761

    
           let changed_fut = self.awaited_action_rx.changed().map(|r| { 
        
               r.map_err(|e| { 
        
                   make_err!( 
        
                       Code::Internal, 
        
                       "Failed to wait for awaited action to change {e:?}" 
        
                   ) 
        
               }) 
        
           });

(And another place in that file)

I don't see any other useful logs that would tell us why the Sender half was dropped. But this comment in the lines just after the error sure seems like it's onto something:

nativelink/nativelink-scheduler/src/memory_awaited_action_db.rs

Lines 180 to 183 in 5798761

    
           _ = tokio::time::sleep(CLIENT_KEEPALIVE_DURATION) => { 
        
               // If we haven't received any updates for a while, we should 
        
               // let the database know that we are still listening to prevent 
        
               // the action from being dropped.

Configuration is basically what I described in #963, i.e. basic_cas.json with a couple of tweaks to paths / entrypoints etc. Buck2 is in the driver's seat.

The text was updated successfully, but these errors were encountered:

allada · 2024-07-26T15:38:27Z

Thanks for the report, we detected this late yesterday as well. I believe we have identified the cause and I'll try to push a fix in a bit.

It has to do with the scheduler refactor recently. When a client disconnected then reconnected, it would not hold the connection alive as active and eventually it'd get cleaned up, causing down-stream listeners to get disconnected and close the stream.

Fixes a bug where if a client creates an action, then the client disconnects and then reconnects on the same action it would not keep the action alive and eventually time it out. closes: TraceMachina#1197

…a#1198) Fixes a bug where if a client creates an action, then the client disconnects and then reconnects on the same action it would not keep the action alive and eventually time it out. closes: TraceMachina#1197

cormacrelf changed the title ~~Lots of "failed to wait for awaited action to change" errors~~ Lots of "failed to wait for awaited action to change" errors on main branch Jul 26, 2024

allada mentioned this issue Jul 26, 2024

Fix case when scheduler drops action on client reconnect #1198

Merged

allada self-assigned this Jul 26, 2024

allada closed this as completed in #1198 Jul 26, 2024

allada closed this as completed in 0b40639 Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lots of "failed to wait for awaited action to change" errors on main branch #1197

Lots of "failed to wait for awaited action to change" errors on main branch #1197

cormacrelf commented Jul 26, 2024 •

edited

Loading

allada commented Jul 26, 2024

Lots of "failed to wait for awaited action to change" errors on main branch #1197

Lots of "failed to wait for awaited action to change" errors on main branch #1197

Comments

cormacrelf commented Jul 26, 2024 • edited Loading

allada commented Jul 26, 2024

cormacrelf commented Jul 26, 2024 •

edited

Loading