-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Joining new nodes to recovered vault after quorum lost causes "storage.raft.snapshot: failed to move snapshot into place (Invalid handle)" in Win 10 #12116
Comments
Hi @ferAbleTech, Just a hunch, but could you try to reproduce this with a fully-qualified |
Hi, I tried to do so, both with fordward slash / and escaped backlash \, however the same error still happens: [ERROR] storage.raft: failed to send snapshot to: peer="{Nonvoter vault_3 127.0.0.1:8401}" error="sync C:\Work\KeyVault\cluster_with_autounseal - Copia\server_3\vault\raft\snapshots: Handle non valido." |
Well that's unfortunate. I should've expected as much, since it seems like the rename itself worked, it's the subsequent fsync that's returning the error. Thanks for the bug report, we'll look into it. |
Hi @ferAbleTech! Are you using cygwin or a similar emulated shell? There have been issues reported with the way these shells interact with Windows. I've seen discussions where using |
Hi @hsimon-hashicorp, I don't think it's related as I'm using the native shell. Also, even using Powershell in admin mode to execute the command gives the same results. |
Yeah, it's only when restoring a snapshot that we do renames, to try to move the new file atomically into place. And it turns out that's... not possible in Windows? https://github.com/google/renameio says:
and then links to one of the Go authors saying golang/go#22397 (comment):
So that's a bummer. I'm going to have to think about what we might do instead to install snapshots. Note that snapshots sometimes get installed on nodes even without doing a recover operation or an explicit restore. So until we fix this issue it's probably not a good idea to try to run integrated storage on windows. |
Good news, it sounds like the fact that atomic renames are not 100% safe in all conditions isn't the issue, and the associated caveats aren't relevant to our use case. Moreover, they're not the source of the bug you ran into. The only connection is that we're doing something in striving to make our renames atomic which doesn't work properly on Windows. The Consul team ran into this years ago and fixed it (hashicorp/raft#241, hashicorp/raft#243), we're going to adopt their solution. I'll start work on a fix tomorrow. |
Great, thank you very much for the update and the fix! |
This issue has been fixed in #12377 |
Describe the bug
I followed this tutorial to create a Vault cluster with Raft as storage. After simulating an outage with all 3 nodes lost and recovering a single node using peers.json, I tried joining new nodes to the recovered node, however after the command (vault operator raft join http:...) its console throws these errors periodically:
The joining node console throws these errors:
This behaviour only happens in Windows, using the online enviroment offered in the tutorial it doesn't happen.
To Reproduce
Steps to reproduce the behavior:
Follow the steps in the tutorial until "Retry Join" to create a cluster of 3 nodes (plus a server for the autounseal using Transit Secret Engine.)
Stop all nodes in the cluster.
Recover vault_2 using the peers.json method.
Try joining a new node to vault_2.
As this is a bit tedious to reproduce in Windows, since the automated script offered in the tutorial only works for Linux OS, I made a semi-automated equivalent using bat files:
https://drive.google.com/file/d/1GHbNmBG0niRkIYB6Qc4KVHdGjPrqPPIi/view?usp=sharing
Follow the README to simulate the error.
Expected behavior
The new nodes successfully joins the cluster.
Environment:
Vault server configuration file(s):
Autounseal vault:
Cluster nodes (with different api/cluster ports):
The text was updated successfully, but these errors were encountered: