-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raft snapshots disappear during machine crashes #3362
Comments
Fixed with 454b3a2 |
@preetapan thanks for fast fix. |
Hi @rom-stratoscale we've done some local testing but it would be great if you can give this a spin in your local environment and see if it looks fixed there as well. Are you ok to make a build locally off master? |
@slackpad we're using 0.8.4 (just upgraded). I can try and cherry-pick (btw, aren't u planning to to back port the fix to 0.8.x?)? BTW, do u have a containerized build environment for consul, or just to build it on my laptop? |
@slackpad @preetapan strange. I cherry picked your commit above 0.8.4 and run the crash tests (ipmi off to a server, centos 7.1, kernel 3.10.). approx 40 sec afterwards the test crashed the node. Then the node starts: ==> Log data will now stream in as it occurs:
consul did not manage to connect (test network issues probably) , after a while we restart it, and then got following errors: ==> Log data will now stream in as it occurs:
which appears to me as a snapshot corruption. Not sure why though, since on the first iteration it was loaded succesfully. state.pdf Thanks, |
Hi @rom-stratoscale thanks for the test feedback. From the output it looks like the Raft changes worked ok, but you got a few corrupted lines in the Serf snapshots, which are a totally different thing. Those lines are skipped on startup, so the agent can still startup but may have lost some information about the rest of the cluster, though that should heal itself automatically in most situations. We've done some work in improving this (6d172b7#diff-07aceaceda81f1c08ffb4c11f488ba45) but there's probably room for a similar change to what was done with Raft to better avoid corruption there in the first place. |
@rom-stratoscale @slackpad I tested the most recent raft changes by having a background task unmounting a file system partition that the consul data directory was on, rather than crashing the whole machine. Thought that was functionally equivalent (by simulating the file system disappearing before it could fsync). Appreciate the test in done above in a real environment to confirm though. As for the |
@preetapan I think I'm missing something - isn't all snapshoting goes into .tmp directory and only then renamed - meaning the data is synced to disk on the source (.tmp) and no other process touches it? So why would it be partial lines? |
serf snapshots don't work that way right now - they do a periodic atomic rename using a tmp file (
I can see why it was designed this way because in a large cluster there's a lot of gossip events and having each one of those cause that fsync/write to a tmp file for each event will hurt performance. This does predate me, so I could be missing reasoning on why it works this way. We could consider not appending those intermediate lines at all, and only doing atomic compacted writes to a snapshot file after we've received enough events, but there are other edge cases to consider there. |
@preetapan thanks for the detailed answer. My bad, I thought it was about the same snapshot. Missed the fact that there is another serf snapshot (and not only raft). btw, seems that I just managed to reproduce the original issue (raft). Unfortunately used the wrong test branch and did get the snapshots backup. Rerunning. Will update later / tomorrow. |
mmm ok. So I have a reproduction :( {"Version":1,"ID":"2-8269-1502052768345","Index":8269,"Term":2,"Peers":"k68xLjcxLjE5Mi41OjgzMDCxMS43MS4xOTIuMTE0OjgzMDCxMS43MS4yMTQuMTE4OjgzMDA=","Configuration":{"Servers":[{"Suffrage":0,"ID":"1.71.192.5:8300","Address":"1.71.192.5:8300"},{"Suffrage":0,"ID":"1.71.192.114:8300","Address":"1.71.192.114:8300"},{"Suffrage":0,"ID":"1.71.214.118:8300","Address":"1.71.214.118:8300"}]},"ConfigurationIndex":1787,"Size":619147,"CRC":"goN31aG/9aU="} how do we want to proceed? |
@rom-stratoscale I'll have to try to reproduce on my end - can't easily tell how this can still be happening given the latest changes. If you can attach logs from the container that you crashed, that would help. Was the above meta.json file left in a directory that did not end with .tmp? |
I have tried to reproduce this in the following way. I have the latest version of consul with my changes running in a docker container. A script runs This might be a very rare edge case. The container going away and coming back is sort of like restarting the machine, but its not exactly the same thing. |
Also learned about implementation differences between ext3/ext4 http://blog.httrack.com/blog/2013/11/15/everything-you-always-wanted-to-know-about-fsync/ when it comes to fsync. What kernel version and file system type are you using? |
|
I don't see this happening with docker kill either (it sends the same sigterm signal that kill -9 does). Given that the snapshots are so fast, we're talking about a 50 microsecond window in which things could potentially be left in a bad state like you describe above. So far, I haven't been able to replicate. Please attach whatever logs you do have even if they are impartial, in case there's a hint in there somewhere. It is expected that .tmp will not exist - we either delete it if there is any write error, or do a rename after the fsync of state.bin and metadata.json. Also, you mention your kernel version, but not filesystem type. Can you tell us what the output of |
@preetapan conul logs are just missing: also, I do see fsck in the journal log after the node reboots: Aug 07 00:00:07 localhost systemd-fsck[888]: ROOT: Clearing orphaned inode 1198285 (uid=162, gid=162, mode=040755, size=4096) but cannot correlate that those are the relevant inodes. Can you try not using docker filesystem but mount the raft directory to host and the do some killing? |
@rom-stratoscale had a chance to revisit this today - I've tested this mounting the raft directory to host as a volume, that's the default way consul works in Docker anyway. No luck reproducing. Looks like |
[root@stratonode1 ~]# df -T /mnt/data/consul It's an lvm with ext4 filesystem |
Given this issue has been quiet and we haven't been able to reproduce it as of the last try, I'm going to close this but do feel free to comment if you continue to see this behavior. |
@pearkes it happened one more time, what data u need to debug those issues? |
Root cause - hashicorp/raft#229
The text was updated successfully, but these errors were encountered: