Debug Log Level is Required to see "waiting for remote previous alloc to terminate" messages #24788

lattwood · 2025-01-06T19:28:54Z

Nomad version

latest

Issue

The "waiting for remote previous alloc to terminate" message from Agents is at the DEBUG level. You shouldn't need to have the DEBUG level when investigating why table stakes operations aren't succeeding. We're running Agents at the DEBUG level due to #23305 and hashicorp/yamux#142, despite the log spam due to #22431.

The direct cause of this bug is in one of the aforementioned issues, but basic troubleshooting, such as "Why isn't this allocation started", shouldn't require the DEBUG log level to suss out.

Reproduction steps/Explanation

I wrote a narrative/story to explain the issue, here it is!

Alloc isn't starting

Why isn't my allocation starting? Let's take a look at nomad alloc status -verbose UUID!

ID                  = dd920556-96c5-760d-0424-932748248f31
Eval ID             = eed9ca82-d256-0b47-6f71-a517a8d1495c
Name                = REDACTED[79]
Node ID             = e6c6119e-04da-de2f-5637-def7f2feb423
Node Name           = REDACTED
Job ID              = REDACTED
Job Version         = 291435
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 2025-01-06T12:08:06-04:00
Modified            = 2025-01-06T15:10:39-04:00
Deployment ID       = 41af3ed0-2bfa-3763-837f-f7b0a513e886
Deployment Health   = unset
Evaluated Nodes     = 2556
Filtered Nodes      = 2555
Exhausted Nodes     = 0
Allocation Time     = 9.963488ms
Failures            = 0

Hmm, nothing there.

Let's look at the Nomad Agent logs on node REDACTED (e6c6119e-04da-de2f-5637-def7f2feb423).

{"@level":"debug","@message":"waiting for remote previous alloc to terminate","@module":"client.alloc_migrator","@timestamp":"2025-01-06T19:05:31.771228Z","alloc_id":"dd920556-96c5-760d-0424-932748248f31","previous_alloc":"b15c463e-c675-2293-0da0-8669d51ddd51"}

I'm glad we're running at the debug log level, even though there's the constant spam from consul-template (wink wink).

The Previous Alloc

Let's take a look at the previous allocation with the same nomad alloc status command!

ID                   = b15c463e-c675-2293-0da0-8669d51ddd51
Eval ID              = a8d08039-8df3-6a65-2d39-a49683e29538
Name                 = REDACTED[79]
Node ID              = 633f4408-8675-fa8f-253a-e8e4dbdbb0af
Node Name            = REDACTED
Job ID               = REDACTED
Job Version          = 291244
Client Status        = running
Client Description   = Tasks are running
Desired Status       = stop
Desired Description  = alloc is being updated due to job update
Created              = 2024-12-31T11:39:33-04:00
Modified             = 2025-01-06T12:08:06-04:00
Deployment ID        = ba6e1d93-a3f0-e839-a9fc-76206341f68a
Deployment Health    = unset
Replacement Alloc ID = dd920556-96c5-760d-0424-932748248f31
Evaluated Nodes      = 2561
Filtered Nodes       = 2558
Exhausted Nodes      = 0
Allocation Time      = 8.294365ms
Failures             = 0

Hmm. The previous_alloc has a reference to the new allocation, which is an improvement. But why isn't the allocation getting stopped on the other agent?

Goroutine stack traces

I admit, I had a hunch here.

curl -s --cacert $NOMAD_CACERT --cert $NOMAD_CLIENT_CERT --key $NOMAD_CLIENT_KEY --header "X-Nomad-Token: $NOMAD_TOKEN" https://$NOMAD_HOST:4646/v1/agent/pprof/goroutine\?debug=2\&node_id=633f4408-8675-fa8f-253a-e8e4dbdbb0af | grep -B2 yamux | grep -A2 -E ' [0-9]{2,} minutes' | grep -B1 Read | grep minutes
goroutine 242 [select, 188 minutes]:

This Nomad Agent has been stuck waiting for a Read to succeed (it never will) for 188 minutes! I believe it's due to yamux on the Nomad Master dropping the write going to this node.

The ONLY resolution at this point, is restarting the Nomad Agent with the hung Read call.

The text was updated successfully, but these errors were encountered:

pkazmierczak · 2025-01-08T11:05:43Z

Hi @lattwood, yeah this isn't ideal. I'd say the nature of the log message is debug-y, but I hear your pain. I'm gonna make a PR that moves it to info and see what others think about this.

lattwood · 2025-01-08T18:14:00Z

Yay, thank you!

lattwood added the type/bug label Jan 6, 2025

jrasell added this to Nomad - Community Issues Triage Jan 7, 2025

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Jan 7, 2025

pkazmierczak moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jan 8, 2025

pkazmierczak self-assigned this Jan 8, 2025

pkazmierczak mentioned this issue Jan 8, 2025

client: move 'waiting for previous alloc to terminate' log messages to info #24804

Merged

pkazmierczak closed this as completed in #24804 Jan 8, 2025

github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Jan 8, 2025

hc-github-team-nomad-core mentioned this issue Jan 8, 2025

Backport of client: move 'waiting for previous alloc to terminate' log messages to info into release/1.9.x #24809

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug Log Level is Required to see "waiting for remote previous alloc to terminate" messages #24788

Debug Log Level is Required to see "waiting for remote previous alloc to terminate" messages #24788

lattwood commented Jan 6, 2025

pkazmierczak commented Jan 8, 2025

lattwood commented Jan 8, 2025

Debug Log Level is Required to see "waiting for remote previous alloc to terminate" messages #24788

Debug Log Level is Required to see "waiting for remote previous alloc to terminate" messages #24788

Comments

lattwood commented Jan 6, 2025

Nomad version

Issue

Reproduction steps/Explanation

Alloc isn't starting

The Previous Alloc

Goroutine stack traces

pkazmierczak commented Jan 8, 2025

lattwood commented Jan 8, 2025