You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The "waiting for remote previous alloc to terminate" message from Agents is at the DEBUG level. You shouldn't need to have the DEBUG level when investigating why table stakes operations aren't succeeding. We're running Agents at the DEBUG level due to #23305 and hashicorp/yamux#142, despite the log spam due to #22431.
The direct cause of this bug is in one of the aforementioned issues, but basic troubleshooting, such as "Why isn't this allocation started", shouldn't require the DEBUG log level to suss out.
Reproduction steps/Explanation
I wrote a narrative/story to explain the issue, here it is!
Alloc isn't starting
Why isn't my allocation starting? Let's take a look at nomad alloc status -verbose UUID!
ID = dd920556-96c5-760d-0424-932748248f31
Eval ID = eed9ca82-d256-0b47-6f71-a517a8d1495c
Name = REDACTED[79]
Node ID = e6c6119e-04da-de2f-5637-def7f2feb423
Node Name = REDACTED
Job ID = REDACTED
Job Version = 291435
Client Status = pending
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created = 2025-01-06T12:08:06-04:00
Modified = 2025-01-06T15:10:39-04:00
Deployment ID = 41af3ed0-2bfa-3763-837f-f7b0a513e886
Deployment Health = unset
Evaluated Nodes = 2556
Filtered Nodes = 2555
Exhausted Nodes = 0
Allocation Time = 9.963488ms
Failures = 0
Hmm, nothing there.
Let's look at the Nomad Agent logs on node REDACTED (e6c6119e-04da-de2f-5637-def7f2feb423).
{"@level":"debug","@message":"waiting for remote previous alloc to terminate","@module":"client.alloc_migrator","@timestamp":"2025-01-06T19:05:31.771228Z","alloc_id":"dd920556-96c5-760d-0424-932748248f31","previous_alloc":"b15c463e-c675-2293-0da0-8669d51ddd51"}
I'm glad we're running at the debug log level, even though there's the constant spam from consul-template (wink wink).
The Previous Alloc
Let's take a look at the previous allocation with the same nomad alloc status command!
ID = b15c463e-c675-2293-0da0-8669d51ddd51
Eval ID = a8d08039-8df3-6a65-2d39-a49683e29538
Name = REDACTED[79]
Node ID = 633f4408-8675-fa8f-253a-e8e4dbdbb0af
Node Name = REDACTED
Job ID = REDACTED
Job Version = 291244
Client Status = running
Client Description = Tasks are running
Desired Status = stop
Desired Description = alloc is being updated due to job update
Created = 2024-12-31T11:39:33-04:00
Modified = 2025-01-06T12:08:06-04:00
Deployment ID = ba6e1d93-a3f0-e839-a9fc-76206341f68a
Deployment Health = unset
Replacement Alloc ID = dd920556-96c5-760d-0424-932748248f31
Evaluated Nodes = 2561
Filtered Nodes = 2558
Exhausted Nodes = 0
Allocation Time = 8.294365ms
Failures = 0
Hmm. The previous_alloc has a reference to the new allocation, which is an improvement. But why isn't the allocation getting stopped on the other agent?
This Nomad Agent has been stuck waiting for a Read to succeed (it never will) for 188 minutes! I believe it's due to yamux on the Nomad Master dropping the write going to this node.
The ONLY resolution at this point, is restarting the Nomad Agent with the hung Read call.
The text was updated successfully, but these errors were encountered:
Hi @lattwood, yeah this isn't ideal. I'd say the nature of the log message is debug-y, but I hear your pain. I'm gonna make a PR that moves it to info and see what others think about this.
Nomad version
latest
Issue
The "waiting for remote previous alloc to terminate" message from Agents is at the DEBUG level. You shouldn't need to have the DEBUG level when investigating why table stakes operations aren't succeeding. We're running Agents at the DEBUG level due to #23305 and hashicorp/yamux#142, despite the log spam due to #22431.
The direct cause of this bug is in one of the aforementioned issues, but basic troubleshooting, such as "Why isn't this allocation started", shouldn't require the DEBUG log level to suss out.
Reproduction steps/Explanation
I wrote a narrative/story to explain the issue, here it is!
Alloc isn't starting
Why isn't my allocation starting? Let's take a look at
nomad alloc status -verbose UUID
!Hmm, nothing there.
Let's look at the Nomad Agent logs on node REDACTED (
e6c6119e-04da-de2f-5637-def7f2feb423
).I'm glad we're running at the debug log level, even though there's the constant spam from consul-template (wink wink).
The Previous Alloc
Let's take a look at the previous allocation with the same
nomad alloc status
command!Hmm. The previous_alloc has a reference to the new allocation, which is an improvement. But why isn't the allocation getting stopped on the other agent?
Goroutine stack traces
I admit, I had a hunch here.
This Nomad Agent has been stuck waiting for a Read to succeed (it never will) for 188 minutes! I believe it's due to yamux on the Nomad Master dropping the write going to this node.
The ONLY resolution at this point, is restarting the Nomad Agent with the hung
Read
call.The text was updated successfully, but these errors were encountered: