-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS 2.2.4 txg_quiesce crash and filesystem hang. #16251
Comments
That's not a useful message, on its own; usually when something breaks like that, there will be another message that's not just "thread blocked for xyz seconds", and then because of something else breaking, you get lots of those because, naturally, when something crashes while holding a lock, you no longer are getting that lock back. So sharing much more of the |
Minio is also throwing errors in the dmesg but it seems to come after the zfs txg issue --- marginally. Any ideas how to figure out the root cause? Minio was working fine prior to zfs 2.2 but the OS was also upgraded. I was thinking of reinstalling ubuntu 22.04 and building zfs 2.2.4 for it but i'm not sure if that will work. Is there a ZFS expert that can be hired? Or donate to OpenZFS? Prior to zfs message>> 2024-06-06T06:51:23.071215+00:00 sever node_exporter[14387]: ts=2024-06-06T06:51:23.070Z caller=collector.go:169 level=error msg="collector failed" name=nfsd duration_seconds=0.000270918 err="failed to retrieve nfsd stats: unknown NFSd metric line "wdeleg_getattr"" after zfs message>>> 2024-06-06T06:51:45.536417+00:00 sever kernel: INFO: task minio:104446 blocked for more than 122 seconds. |
A number of companies like Klara will sell you support on contract. Without an explicit error message though, it's not really clear if that's needed for your case, because that output is still probably just "something is waiting" not "how it broke". Since it says txg_quiesce was waiting for over 120s when it printed that, you probably are looking for log lines 1-5 minutes in the past from there. |
Hello. As @rincebrain mentioned, Klara provides professional services around OpenZFS and is available to investigate bugs of this type for you: Klara OpenZFS Bug Investigation You can get in touch with us via our website if you'd like us to take a look at your issue and get it resolved. |
There isn't anything else of interest in the log, it just goes from normal log messages to ZFS / minio hanging. I have sent my details to Klara to get them to debug the issue further. Is there any other debugging steps that are recommended to see what is locked? I've attached the TXG log - it always has the same pattern when it hangs. |
Since you have refused to provide the information requested and insist that you are capable of deciding what's relevant in the logs, I am going to stop responding. Good luck. |
Sorry - Here is the full syslog. |
System information
Distribution Name | Ubuntu 24.04
Kernel Version | 6.8
Architecture | x64
OpenZFS Version | 2.2.4 and 2.2.2
On heavy read / write the TXG Quiesce process crashes and the zfs pool hangs.
We are running minio on the system and the default scanner activity causes the issue.
Started happening after upgrading for zfs2.1 / ubuntu 22 to zfs2.2.2 / ubuntu 24 . System was running for 6 month prior without issue.
2024-06-06T06:51:45.535350+00:00 sever kernel: task:txg_quiesce state:D stack:0 pid:13094 tgid:13094 ppid:2 flags:0x00004000
2024-06-06T06:51:45.535352+00:00 sever kernel: Call Trace:
2024-06-06T06:51:45.535354+00:00 sever kernel:
2024-06-06T06:51:45.535356+00:00 sever kernel: __schedule+0x27c/0x6b0
2024-06-06T06:51:45.535359+00:00 sever kernel: schedule+0x33/0x110
2024-06-06T06:51:45.535360+00:00 sever kernel: cv_wait_common+0x102/0x140 [spl]
2024-06-06T06:51:45.535362+00:00 sever kernel: ? __pfx_autoremove_wake_function+0x10/0x10
2024-06-06T06:51:45.535363+00:00 sever kernel: __cv_wait+0x15/0x30 [spl]
2024-06-06T06:51:45.535364+00:00 sever kernel: txg_quiesce+0x181/0x1f0 [zfs]
2024-06-06T06:51:45.535366+00:00 sever kernel: txg_quiesce_thread+0xd2/0x120 [zfs]
2024-06-06T06:51:45.536394+00:00 sever kernel: ? __pfx_txg_quiesce_thread+0x10/0x10 [zfs]
2024-06-06T06:51:45.536403+00:00 sever kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
2024-06-06T06:51:45.536407+00:00 sever kernel: thread_generic_wrapper+0x5f/0x70 [spl]
2024-06-06T06:51:45.536409+00:00 sever kernel: kthread+0xf2/0x120
2024-06-06T06:51:45.536410+00:00 sever kernel: ? __pfx_kthread+0x10/0x10
2024-06-06T06:51:45.536412+00:00 sever kernel: ret_from_fork+0x47/0x70
2024-06-06T06:51:45.536413+00:00 sever kernel: ? __pfx_kthread+0x10/0x10
2024-06-06T06:51:45.536414+00:00 sever kernel: ret_from_fork_asm+0x1b/0x30
2024-06-06T06:51:45.536416+00:00 sever kernel:
We have tried:
cat /proc/spl/kstat/zfs/DATA/txgs shows a tcg stuck in a Q state.
Any ideas for a workaround or code fix?
zpool status.txt
The text was updated successfully, but these errors were encountered: