-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(1.9.0) missed heartbeats don't mark nodes down #24231
Comments
Btw, this also happens with brand new clients that weren't previously registered. |
Hi @dxdc! Sorry to hear you're seeing trouble. #23838 feels like the obvious culprit here but I'm not sure how off the top of my head. Can you confirm that all your clients are also 1.9? Or at least >1.6.0 (see https://developer.hashicorp.com/nomad/docs/upgrade/upgrade-specific#dropped-support-for-older-clients)? |
thanks @tgross! yes. Clients with versions 1.8.4 and 1.9.0 have the same issue. Is downgrading the server to 1.8.4 an option (and, as simple as replacing the binary)? I'd be happy to test that. |
Ok @dxdc I've reproduced this and tracked down the cause. In #23838 we updated the RPC handler we use for heartbeats (
Once the state store has written Raft logs that an older version doesn't understand, you can't rollback (see Upgrading) without replacing the state. You can get away with that with some patch-version updates but 1.9.0 won't be one of them because we added new Raft keyring entries. |
Thx @tgross. I ended up just rebuilding the entire cluster. Shouldn't release 1.9.0 be yanked while a patch is made? This seems like a serious defect. |
Agreed, and there's another set of fairly serious defects around the |
In #23838 we updated the `Node.Update` RPC handler we use for heartbeats to be more strict about requiring node secrets. But when a node goes down, it's the leader that sends the request to mark the node down via `Node.Update` (to itself), and this request was missing the leader ACL needed to authenticate to itself. Add the leader ACL to the request and update the RPC handler test for disconnected-clients to use ACLs, which would have detected this bug. Also added a note to the `Authenticate` comment about how that authentication path requires the leader ACL. Fixes: #24231 Ref: https://hashicorp.atlassian.net/browse/NET-11384
When #24241 merges this issue will get closed. I've retitled and pinned it to make sure it's obvious to passerby this is an important 1.9.0 regression. We're having an internal discussion about when we can ship 1.9.1 but expect that'll be soon. |
In #23838 we updated the `Node.Update` RPC handler we use for heartbeats to be more strict about requiring node secrets. But when a node goes down, it's the leader that sends the request to mark the node down via `Node.Update` (to itself), and this request was missing the leader ACL needed to authenticate to itself. Add the leader ACL to the request and update the RPC handler test for disconnected-clients to use ACLs, which would have detected this bug. Also added a note to the `Authenticate` comment about how that authentication path requires the leader ACL. Fixes: #24231 Ref: https://hashicorp.atlassian.net/browse/NET-11384
Unpinning now that 1.9.1 has shipped. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
After upgrading to Nomad v1.9.0 from v1.8.4, all client nodes report as "ready", regardless of their actual "ready" status.
Nomad version
Nomad v1.9.0
BuildDate 2024-10-10T07:13:43Z
Revision 7ad3685
Operating system and Environment details
Ubuntu 24.04 LTS
Issue
After upgrading servers from 1.8.4 to 1.9.0, all clients are reporting as "ready" even when they are down.
Reproduction steps
Not sure if it's needed, but this is the client ACL policy:
Expected Result
Actual Result
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: