-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
intermittent error "reserved port collision" when running new job #9506
Comments
I'm seeing the same issue too: https://discuss.hashicorp.com/t/evaluation-maximum-attempts-reached-5/18449 |
I'm also observing this. But my job does not even exposes any port, so I'm a bit confused about the error message: job "damon" {
datacenters = ["dc1"]
group "damon" {
count = 1
task "damon" {
driver = "raw_exec"
artifact {
source = "https://github.com/rgl/damon/releases/download/v0.1.2/damon.zip"
}
template {
destination = "local/example.ps1"
data = <<-EOD
...
EOD
}
config {
command = "local/damon.exe"
args = [
"PowerShell.exe",
"-File",
"local/example.ps1"
]
}
resources {
cpu = 100
memory = 350
}
}
}
} Also, is there a better way to gather these errors messages besides monitoring the client logs from the nomad ui (e.g. http://localhost:4646/ui/clients/c9d4d1c5-51cc-7276-0949-638d0072061e/monitor?level=debug)? It would be pretty nice to see these errors from e.g. http://localhost:4646/ui/jobs/damon/evaluations. For my particular problem, it only shows something like I'm not sure if it makes any difference, but I'm running nomad in Windows Server 2019. |
Yes I agree. This kind of error should be visible from the UI of failed job. Btw, I have a weird workaround for this bug. So, you can just restart all nomad servers and eventually, this job will be re-run again successfully. I've done it on my cluster multiple times. However, not sure if I have to do this periodically -__-' |
Btw I have the solid workaround for this. So I did these steps:
Well, at least I have this workaround. Hoping this will be useful for fellow users as well. |
Hi @habibiefaried it looks like you're running into another example of #9492. The port reservation is only honored for static ports, not dynamic ports. We intend to fix this (see #8186) and have a community-provided PR #8478 that doesn't seem like it's going to make it over the finish line so we're going to have to circle back to that. |
Getting this as well on 0.12, but the workaround doesn't work. The error seems persistent, I cannot run any job atm on the cluster, they all result in |
Ok what worked:
|
As we have continued to see reports of #9506 we need to elevate this log line as it is the only way to detect when plans are being *erroneously* rejected. Users who see this log line repeatedly should drain and restart the node in the log line. This seems to workaorund the issue. Please post any details on #9506!
We've seen this come up more and are actively investigating it. One workaround we've seen work every time is restarting the client agent (node) specified in the log message. It seems the node's state on one or more servers can get corrupted and restarting the it fixes the corruption. Fixing this is a top priority of the Nomad team! Sorry for the delay, it's a tricky one. |
For what it's worth, for me, I tracked this down to a bad puppet config. 80,8081 turned into reserved_ports: 808081. which blocked all allocations on affected nodes. |
#9506 (comment) makes for an easy repro
Thanks @badalex! While unfortunately that isn't the root cause for everyone affected, it does make for a nice easy reproduction! Outputs:
Note the
Dev AgentYou can also reproduce this with a dev agent by using the following reserved.hcl agent configuration and running any job (such as produced by
FixesI intend to add 2 fixes for this:
Note that @rgl's case is tricky in that there's no reason to even check the node's networking information if the job being scheduled doesn't use the network! I think we can cover that as well, but this brings me to my last point: LimitationsI don't think there's just 1 bug here. I think we'll have to tackle this class of bugs one at time, and SetNode/ReservedPorts is a great start. |
Goal is to fix at least one of the causes that can cause a node to be ineligible to receive work: #9506 (comment)
Goal is to fix at least one of the causes that can cause a node to be ineligible to receive work: #9506 (comment)
The shortlink /s/port-plan-failure is logged when a plan for a node is rejected to help users debug and mitigate repeated `plan for node rejected` failures. The current link to #9506 is... less than useful. It is not clear to users what steps they should take to either fix their cluster or contribute to the issue. While .../monitoring-nomad#progess isn't as comprehensive as it could be, it's a much more gentle introduction to the class of bug than the original issue.
The shortlink /s/port-plan-failure is logged when a plan for a node is rejected to help users debug and mitigate repeated `plan for node rejected` failures. The current link to #9506 is... less than useful. It is not clear to users what steps they should take to either fix their cluster or contribute to the issue. While .../monitoring-nomad#progess isn't as comprehensive as it could be, it's a much more gentle introduction to the class of bug than the original issue.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.0.0-beta3 (fcb32ef)
Operating system and Environment details
Distributor ID: Debian
Description: Debian GNU/Linux 10 (buster)
Release: 10
Codename: buster
Issue
Sometimes, whenever I update or run a new job. I got error
max-plan-attempts failed
on its UIReproduction steps
Just run a job file
Job file (if appropriate)
Taken from: https://www.hashicorp.com/blog/consul-connect-native-tasks-in-hashicorp-nomad-0-12
Nomad Client logs (if appropriate)
Nomad Server logs (if appropriate)
Nomad Server Config
Nomad Client config
Notes
Btw you can really visit this url: http://161.97.158.38:4646/ui/jobs/cn-demo. It's my testbed server for this nomad beta version. It's opened and unauthenticated, in which you can try to see it live
Thanks
The text was updated successfully, but these errors were encountered: