ZFS 2.2.4 txg_quiesce crash and filesystem hang. #16251

Jazz9 · 2024-06-06T09:37:12Z

System information

Distribution Name | Ubuntu 24.04
Kernel Version | 6.8
Architecture | x64
OpenZFS Version | 2.2.4 and 2.2.2

On heavy read / write the TXG Quiesce process crashes and the zfs pool hangs.

We are running minio on the system and the default scanner activity causes the issue.

Started happening after upgrading for zfs2.1 / ubuntu 22 to zfs2.2.2 / ubuntu 24 . System was running for 6 month prior without issue.

2024-06-06T06:51:45.535350+00:00 sever kernel: task:txg_quiesce state:D stack:0 pid:13094 tgid:13094 ppid:2 flags:0x00004000
2024-06-06T06:51:45.535352+00:00 sever kernel: Call Trace:
2024-06-06T06:51:45.535354+00:00 sever kernel:
2024-06-06T06:51:45.535356+00:00 sever kernel: __schedule+0x27c/0x6b0
2024-06-06T06:51:45.535359+00:00 sever kernel: schedule+0x33/0x110
2024-06-06T06:51:45.535360+00:00 sever kernel: cv_wait_common+0x102/0x140 [spl]
2024-06-06T06:51:45.535362+00:00 sever kernel: ? __pfx_autoremove_wake_function+0x10/0x10
2024-06-06T06:51:45.535363+00:00 sever kernel: __cv_wait+0x15/0x30 [spl]
2024-06-06T06:51:45.535364+00:00 sever kernel: txg_quiesce+0x181/0x1f0 [zfs]
2024-06-06T06:51:45.535366+00:00 sever kernel: txg_quiesce_thread+0xd2/0x120 [zfs]
2024-06-06T06:51:45.536394+00:00 sever kernel: ? __pfx_txg_quiesce_thread+0x10/0x10 [zfs]
2024-06-06T06:51:45.536403+00:00 sever kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
2024-06-06T06:51:45.536407+00:00 sever kernel: thread_generic_wrapper+0x5f/0x70 [spl]
2024-06-06T06:51:45.536409+00:00 sever kernel: kthread+0xf2/0x120
2024-06-06T06:51:45.536410+00:00 sever kernel: ? __pfx_kthread+0x10/0x10
2024-06-06T06:51:45.536412+00:00 sever kernel: ret_from_fork+0x47/0x70
2024-06-06T06:51:45.536413+00:00 sever kernel: ? __pfx_kthread+0x10/0x10
2024-06-06T06:51:45.536414+00:00 sever kernel: ret_from_fork_asm+0x1b/0x30
2024-06-06T06:51:45.536416+00:00 sever kernel:

We have tried:

upgrading the version to 2.2.4 without any changes. ( built from source )
reducing the txg timeout

cat /proc/spl/kstat/zfs/DATA/txgs shows a tcg stuck in a Q state.

Any ideas for a workaround or code fix?

zpool status.txt

rincebrain · 2024-06-06T09:42:16Z

That's not a useful message, on its own; usually when something breaks like that, there will be another message that's not just "thread blocked for xyz seconds", and then because of something else breaking, you get lots of those because, naturally, when something crashes while holding a lock, you no longer are getting that lock back.

So sharing much more of the dmesg output would be necessary to offer much insight into this.

Jazz9 · 2024-06-06T10:45:35Z

Minio is also throwing errors in the dmesg but it seems to come after the zfs txg issue --- marginally.

Any ideas how to figure out the root cause? Minio was working fine prior to zfs 2.2 but the OS was also upgraded.

I was thinking of reinstalling ubuntu 22.04 and building zfs 2.2.4 for it but i'm not sure if that will work.

Is there a ZFS expert that can be hired? Or donate to OpenZFS?

Prior to zfs message>>

2024-06-06T06:51:23.071215+00:00 sever node_exporter[14387]: ts=2024-06-06T06:51:23.070Z caller=collector.go:169 level=error msg="collector failed" name=nfsd duration_seconds=0.000270918 err="failed to retrieve nfsd stats: unknown NFSd metric line "wdeleg_getattr""
2024-06-06T06:51:30.524717+00:00 sever snmpd[14405]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
2024-06-06T06:51:38.097518+00:00 sever node_exporter[14387]: ts=2024-06-06T06:51:38.097Z caller=collector.go:169 level=error msg="collector failed" name=nfsd duration_seconds=0.000189292 err="failed to retrieve nfsd stats: unknown NFSd metric line "wdeleg_getattr""
2024-06-06T06:51:45.535325+00:00 sever kernel: INFO: task txg_quiesce:13094 blocked for more than 122 seconds.
2024-06-06T06:51:45.535346+00:00 sever kernel: Tainted: P O 6.8.0-35-generic #35-Ubuntu
2024-06-06T06:51:45.535348+00:00 sever kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

after zfs message>>>

2024-06-06T06:51:45.536417+00:00 sever kernel: INFO: task minio:104446 blocked for more than 122 seconds.
2024-06-06T06:51:45.536420+00:00 sever kernel: Tainted: P O 6.8.0-35-generic #35-Ubuntu
2024-06-06T06:51:45.536421+00:00 sever kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2024-06-06T06:51:45.536423+00:00 sever kernel: task:minio state:D stack:0 pid:104446 tgid:52253 ppid:52249 flags:0x00000002
2024-06-06T06:51:45.536425+00:00 sever kernel: Call Trace:
2024-06-06T06:51:45.536427+00:00 sever kernel:
2024-06-06T06:51:45.536428+00:00 sever kernel: __schedule+0x27c/0x6b0
2024-06-06T06:51:45.536430+00:00 sever kernel: ? osq_lock+0xe8/0x160
2024-06-06T06:51:45.536431+00:00 sever kernel: schedule+0x33/0x110
2024-06-06T06:51:45.536432+00:00 sever kernel: schedule_preempt_disabled+0x15/0x30
2024-06-06T06:51:45.536434+00:00 sever kernel: rwsem_down_write_slowpath+0x27e/0x550
2024-06-06T06:51:45.536436+00:00 sever kernel: down_write+0x5c/0x80
2024-06-06T06:51:45.536437+00:00 sever kernel: filename_create+0xaf/0x1b0
2024-06-06T06:51:45.536438+00:00 sever kernel: do_mkdirat+0x59/0x180
2024-06-06T06:51:45.536440+00:00 sever kernel: __x64_sys_mkdirat+0x4e/0x80
2024-06-06T06:51:45.536442+00:00 sever kernel: x64_sys_call+0x1c6f/0x25c0
2024-06-06T06:51:45.536443+00:00 sever kernel: do_syscall_64+0x7f/0x180
2024-06-06T06:51:45.536444+00:00 sever kernel: ? putname+0x5b/0x80
2024-06-06T06:51:45.536446+00:00 sever kernel: ? do_sys_openat2+0x9f/0xe0
2024-06-06T06:51:45.536447+00:00 sever kernel: ? __x64_sys_openat+0x55/0xa0
2024-06-06T06:51:45.536449+00:00 sever kernel: ? syscall_exit_to_user_mode+0x86/0x260
2024-06-06T06:51:45.536450+00:00 sever kernel: ? do_syscall_64+0x8c/0x180
2024-06-06T06:51:45.536451+00:00 sever kernel: ? irqentry_exit+0x43/0x50
2024-06-06T06:51:45.536453+00:00 sever kernel: entry_SYSCALL_64_after_hwframe+0x78/0x80
2024-06-06T06:51:45.536454+00:00 sever kernel: RIP: 0033:0x40708e
2024-06-06T06:51:45.536455+00:00 sever kernel: RSP: 002b:000000c0f2b26888 EFLAGS: 00000206 ORIG_RAX: 0000000000000102
2024-06-06T06:51:45.536457+00:00 sever kernel: RAX: ffffffffffffffda RBX: ffffffffffffff9c RCX: 000000000040708e
2024-06-06T06:51:45.536458+00:00 sever kernel: RDX: 00000000000001ff RSI: 000000c0a82cc6c0 RDI: ffffffffffffff9c
2024-06-06T06:51:45.536460+00:00 sever kernel: RBP: 000000c0f2b268c8 R08: 0000000000000000 R09: 0000000000000000
2024-06-06T06:51:45.536462+00:00 sever kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 000000c0a82cc6c0
2024-06-06T06:51:45.536463+00:00 sever kernel: R13: 0000000000000000 R14: 000000c0977b01c0 R15: 00001fffffffffff
2024-06-06T06:51:45.536464+00:00 sever kernel:

rincebrain · 2024-06-06T16:02:20Z

A number of companies like Klara will sell you support on contract. Without an explicit error message though, it's not really clear if that's needed for your case, because that output is still probably just "something is waiting" not "how it broke".

Since it says txg_quiesce was waiting for over 120s when it printed that, you probably are looking for log lines 1-5 minutes in the past from there.

allanjude · 2024-06-06T16:08:11Z

Hello. As @rincebrain mentioned, Klara provides professional services around OpenZFS and is available to investigate bugs of this type for you: Klara OpenZFS Bug Investigation

You can get in touch with us via our website if you'd like us to take a look at your issue and get it resolved.

Jazz9 · 2024-06-07T03:39:44Z

There isn't anything else of interest in the log, it just goes from normal log messages to ZFS / minio hanging.

I have sent my details to Klara to get them to debug the issue further.

Is there any other debugging steps that are recommended to see what is locked?

I've attached the TXG log - it always has the same pattern when it hangs.

txgs.txt

rincebrain · 2024-06-07T03:43:35Z

Since you have refused to provide the information requested and insist that you are capable of deciding what's relevant in the logs, I am going to stop responding. Good luck.

Jazz9 · 2024-06-07T04:23:38Z

Sorry - Here is the full syslog.

syslog.txt

Jazz9 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZFS 2.2.4 txg_quiesce crash and filesystem hang. #16251

ZFS 2.2.4 txg_quiesce crash and filesystem hang. #16251

Jazz9 commented Jun 6, 2024

rincebrain commented Jun 6, 2024

Jazz9 commented Jun 6, 2024

rincebrain commented Jun 6, 2024

allanjude commented Jun 6, 2024

Jazz9 commented Jun 7, 2024

rincebrain commented Jun 7, 2024

Jazz9 commented Jun 7, 2024

ZFS 2.2.4 txg_quiesce crash and filesystem hang. #16251

ZFS 2.2.4 txg_quiesce crash and filesystem hang. #16251

Comments

Jazz9 commented Jun 6, 2024

System information

rincebrain commented Jun 6, 2024

Jazz9 commented Jun 6, 2024

rincebrain commented Jun 6, 2024

allanjude commented Jun 6, 2024

Jazz9 commented Jun 7, 2024

rincebrain commented Jun 7, 2024

Jazz9 commented Jun 7, 2024