Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug Log Level is Required to see "waiting for remote previous alloc to terminate" messages #24788

Closed
lattwood opened this issue Jan 6, 2025 · 2 comments · Fixed by #24804
Closed
Assignees
Labels

Comments

@lattwood
Copy link
Contributor

lattwood commented Jan 6, 2025

Nomad version

latest

Issue

The "waiting for remote previous alloc to terminate" message from Agents is at the DEBUG level. You shouldn't need to have the DEBUG level when investigating why table stakes operations aren't succeeding. We're running Agents at the DEBUG level due to #23305 and hashicorp/yamux#142, despite the log spam due to #22431.

The direct cause of this bug is in one of the aforementioned issues, but basic troubleshooting, such as "Why isn't this allocation started", shouldn't require the DEBUG log level to suss out.

Reproduction steps/Explanation

I wrote a narrative/story to explain the issue, here it is!

Alloc isn't starting

Why isn't my allocation starting? Let's take a look at nomad alloc status -verbose UUID!

ID                  = dd920556-96c5-760d-0424-932748248f31
Eval ID             = eed9ca82-d256-0b47-6f71-a517a8d1495c
Name                = REDACTED[79]
Node ID             = e6c6119e-04da-de2f-5637-def7f2feb423
Node Name           = REDACTED
Job ID              = REDACTED
Job Version         = 291435
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 2025-01-06T12:08:06-04:00
Modified            = 2025-01-06T15:10:39-04:00
Deployment ID       = 41af3ed0-2bfa-3763-837f-f7b0a513e886
Deployment Health   = unset
Evaluated Nodes     = 2556
Filtered Nodes      = 2555
Exhausted Nodes     = 0
Allocation Time     = 9.963488ms
Failures            = 0

Hmm, nothing there.

Let's look at the Nomad Agent logs on node REDACTED (e6c6119e-04da-de2f-5637-def7f2feb423).

{"@level":"debug","@message":"waiting for remote previous alloc to terminate","@module":"client.alloc_migrator","@timestamp":"2025-01-06T19:05:31.771228Z","alloc_id":"dd920556-96c5-760d-0424-932748248f31","previous_alloc":"b15c463e-c675-2293-0da0-8669d51ddd51"}

I'm glad we're running at the debug log level, even though there's the constant spam from consul-template (wink wink).

The Previous Alloc

Let's take a look at the previous allocation with the same nomad alloc status command!

ID                   = b15c463e-c675-2293-0da0-8669d51ddd51
Eval ID              = a8d08039-8df3-6a65-2d39-a49683e29538
Name                 = REDACTED[79]
Node ID              = 633f4408-8675-fa8f-253a-e8e4dbdbb0af
Node Name            = REDACTED
Job ID               = REDACTED
Job Version          = 291244
Client Status        = running
Client Description   = Tasks are running
Desired Status       = stop
Desired Description  = alloc is being updated due to job update
Created              = 2024-12-31T11:39:33-04:00
Modified             = 2025-01-06T12:08:06-04:00
Deployment ID        = ba6e1d93-a3f0-e839-a9fc-76206341f68a
Deployment Health    = unset
Replacement Alloc ID = dd920556-96c5-760d-0424-932748248f31
Evaluated Nodes      = 2561
Filtered Nodes       = 2558
Exhausted Nodes      = 0
Allocation Time      = 8.294365ms
Failures             = 0

Hmm. The previous_alloc has a reference to the new allocation, which is an improvement. But why isn't the allocation getting stopped on the other agent?

Goroutine stack traces

I admit, I had a hunch here.

curl -s --cacert $NOMAD_CACERT --cert $NOMAD_CLIENT_CERT --key $NOMAD_CLIENT_KEY --header "X-Nomad-Token: $NOMAD_TOKEN" https://$NOMAD_HOST:4646/v1/agent/pprof/goroutine\?debug=2\&node_id=633f4408-8675-fa8f-253a-e8e4dbdbb0af | grep -B2 yamux | grep -A2 -E ' [0-9]{2,} minutes' | grep -B1 Read | grep minutes
goroutine 242 [select, 188 minutes]:

This Nomad Agent has been stuck waiting for a Read to succeed (it never will) for 188 minutes! I believe it's due to yamux on the Nomad Master dropping the write going to this node.

The ONLY resolution at this point, is restarting the Nomad Agent with the hung Read call.

@pkazmierczak
Copy link
Contributor

Hi @lattwood, yeah this isn't ideal. I'd say the nature of the log message is debug-y, but I hear your pain. I'm gonna make a PR that moves it to info and see what others think about this.

@lattwood
Copy link
Contributor Author

lattwood commented Jan 8, 2025

Yay, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging a pull request may close this issue.

2 participants