-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel panic on write #14463
Comments
Same issue here, I'm using Xeon D-2177NT with c6xx qat device, write/read operation sometimes lead to error message Code at zfs/module/os/linux/zfs/qat_crypt.c Line 453 in 1d3ba0b
qat-api-reference-cryptographic.pdf
return value maybe CPA_STATUS_RETRY and lead to message full panic
|
By add some logs with
I traced the invalid checksum value of
I checked the intel qat document
|
Works for about 30 minutes now. Without this patch, it can't work and will panic in 10-60 seconds. I will continue to use fio for IO stress testing. |
CpaDcRqResults have to be initialized with checksum=1 for adler32. Otherwise when error CPA_DC_OVERFLOW occured, the next compress operation will continue on previously part-compressed data, and write invalid checksum data. When zfs decompress the compressed data, a invalid checksum will occured and lead to openzfs#14463 Signed-off-by: naivekun <[email protected]> Closes: openzfs#14463
CpaDcRqResults have to be initialized with checksum=1 for adler32. Otherwise when error CPA_DC_OVERFLOW occured, the next compress operation will continue on previously part-compressed data, and write invalid checksum data. When zfs decompress the compressed data, a invalid checksum will occured and lead to openzfs#14463 Signed-off-by: naivekun <[email protected]> Closes: openzfs#14463
It works for 3 days with heavy IO load. |
CpaDcRqResults have to be initialized with checksum=1 for adler32. Otherwise when error CPA_DC_OVERFLOW occurred, the next compress operation will continue on previously part-compressed data, and write invalid checksum data. When zfs decompress the compressed data, a invalid checksum will occurred and lead to openzfs#14463 Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Weigang Li <[email protected]> Reviewed-by: Chengfei Zhu <[email protected]> Signed-off-by: naivekun <[email protected]> Closes openzfs#14632 Closes openzfs#14463
CpaDcRqResults have to be initialized with checksum=1 for adler32. Otherwise when error CPA_DC_OVERFLOW occurred, the next compress operation will continue on previously part-compressed data, and write invalid checksum data. When zfs decompress the compressed data, a invalid checksum will occurred and lead to #14463 Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Weigang Li <[email protected]> Reviewed-by: Chengfei Zhu <[email protected]> Signed-off-by: naivekun <[email protected]> Closes #14632 Closes #14463
CpaDcRqResults have to be initialized with checksum=1 for adler32. Otherwise when error CPA_DC_OVERFLOW occurred, the next compress operation will continue on previously part-compressed data, and write invalid checksum data. When zfs decompress the compressed data, a invalid checksum will occurred and lead to openzfs#14463 Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Weigang Li <[email protected]> Reviewed-by: Chengfei Zhu <[email protected]> Signed-off-by: naivekun <[email protected]> Closes openzfs#14632 Closes openzfs#14463
CpaDcRqResults have to be initialized with checksum=1 for adler32. Otherwise when error CPA_DC_OVERFLOW occurred, the next compress operation will continue on previously part-compressed data, and write invalid checksum data. When zfs decompress the compressed data, a invalid checksum will occurred and lead to openzfs#14463 Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Weigang Li <[email protected]> Reviewed-by: Chengfei Zhu <[email protected]> Signed-off-by: naivekun <[email protected]> Closes openzfs#14632 Closes openzfs#14463
System information
Describe the problem you're observing
kernel crashed with intel QAT 8960 compress feature test
Describe how to reproduce the problem
database backup data write to zfs with large throughput,
with zfs_qat_compress_disable=1, zfs_qat_checksum_disable=1, throughput is low, everything is ok,
with zfs_qat_compress_disable=0, zfs_qat_checksum_disable=0, throughput boost, up, kernel panic twice with the same coredump backtrace
Include any warning/errors/backtraces from the system logs
vmcore-dmesg.txt
[1715883.239105] cpaCySymRemoveSession() - : There are 1 requests pending
[1715883.244156] cpaCySymRemoveSession() - : There are 1 requests pending
[1715884.737480] cpaCySymRemoveSession() - : There are 1 requests pending
[1715884.947721] cpaCySymRemoveSession() - : There are 1 requests pending
[1715885.278642] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[1715885.279475] PGD 3b0f49d067 P4D 3b0f49d067 PUD 3e1f3cd067 PMD 0
[1715885.280207] Oops: 0000 [#1] SMP NOPTI
[1715885.280920] CPU: 23 PID: 19884 Comm: ios-io Kdump: loaded Tainted: P IOE --------- - - 4.18.0-305.3.1.el8.x86_64 #1
[1715885.282371] Hardware name: Dell Inc. PowerEdge R940/0GCTJ1, BIOS 2.16.1 08/17/2022
[1715885.283174] RIP: 0010:arc_release+0x1a/0x6e0 [zfs]
[1715885.283888] Code: ac ab 5e e8 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 4c 8d 67 10 55 53 48 89 fb 48 83 ec 48 <4c> 8b 37 4c 89 e7 e8 cb 45 e5 e8 48 8d 43 30 65 4c 8b 04
25 40 5c
[1715885.285427] RSP: 0018:ffffb0472cd7faa8 EFLAGS: 00010292
[1715885.286259] RAX: dead000000000200 RBX: 0000000000000000 RCX: 0000000000000001
[1715885.286996] RDX: dead000000000100 RSI: ffff9c8a6ea1dc80 RDI: 0000000000000000
[1715885.287643] RBP: ffff9ca375697c00 R08: ffffb0472cd7faa8 R09: ffff9ca375697c00
[1715885.288366] R10: ffff9c82aeaec100 R11: 0000000000000001 R12: 0000000000000010
[1715885.288941] R13: 0000000000000000 R14: ffff9c8a6ea1dc80 R15: ffff9c2c8156e800
[1715885.289595] FS: 00007f47f5351700(0000) GS:ffff9cab7f540000(0000) knlGS:0000000000000000
[1715885.290289] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1715885.291058] CR2: 0000000000000000 CR3: 0000003e63b14006 CR4: 00000000007726e0
[1715885.291699] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1715885.292368] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1715885.293210] PKRU: 55555554
[1715885.293718] Call Trace:
[1715885.294309] dbuf_dirty+0x7dc/0x950 [zfs]
[1715885.295007] dmu_write_uio_dnode+0x6b/0x140 [zfs]
[1715885.295650] dmu_write_uio_dbuf+0x47/0x60 [zfs]
[1715885.296508] zfs_write+0x493/0xc90 [zfs]
[1715885.297193] ? terminate_walk+0xcc/0xe0
[1715885.297850] zpl_iter_write+0x100/0x160 [zfs]
[1715885.298646] new_sync_write+0x112/0x160
[1715885.299223] vfs_write+0xa5/0x1a0
[1715885.299747] ksys_pwrite64+0x61/0xa0
[1715885.300387] do_syscall_64+0x5b/0x1a0
[1715885.301091] entry_SYSCALL_64_after_hwframe+0x65/0xca
[1715885.301842] RIP: 0033:0x7f5177d2d2b7
[1715885.302347] Code: 41 54 49 89 d4 55 48 89 f5 53 89 fb 48 83 ec 18 e8 ae f2 ff ff 4d 89 ea 4c 89 e2 48 89 ee 41 89 c0 89 df b8 12 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 e4 f2
ff ff 48
[1715885.303465] RSP: 002b:00007f47f534a690 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
[1715885.304166] RAX: ffffffffffffffda RBX: 000000000000015c RCX: 00007f5177d2d2b7
[1715885.304844] RDX: 0000000000008000 RSI: 00007f3eb5f1d000 RDI: 000000000000015c
[1715885.305427] RBP: 00007f3eb5f1d000 R08: 0000000000000000 R09: 00007f47f534a4cb
[1715885.305910] R10: 0000000000040000 R11: 0000000000000293 R12: 0000000000008000
[1715885.306486] R13: 0000000000040000 R14: 000000000000015c R15: 000000000000015c
crash> bt
PID: 19884 TASK: ffff9be85c1adc40 CPU: 23 COMMAND: "ios-io"
#0 [ffffb0472cd7f7c8] machine_kexec at ffffffffaa26156e
#1 [ffffb0472cd7f820] __crash_kexec at ffffffffaa38f99d
#2 [ffffb0472cd7f8e8] crash_kexec at ffffffffaa39088d
#3 [ffffb0472cd7f900] oops_end at ffffffffaa22434d
#4 [ffffb0472cd7f920] no_context at ffffffffaa27262f
#5 [ffffb0472cd7f978] __bad_area_nosemaphore at ffffffffaa27298c
#6 [ffffb0472cd7f9c0] do_page_fault at ffffffffaa273267
#7 [ffffb0472cd7f9f0] page_fault at ffffffffaac010fe
[exception RIP: arc_release+26]
RIP: ffffffffc1cf5e2a RSP: ffffb0472cd7faa8 RFLAGS: 00010292
RAX: dead000000000200 RBX: 0000000000000000 RCX: 0000000000000001
RDX: dead000000000100 RSI: ffff9c8a6ea1dc80 RDI: 0000000000000000
RBP: ffff9ca375697c00 R8: ffffb0472cd7faa8 R9: ffff9ca375697c00
R10: ffff9c82aeaec100 R11: 0000000000000001 R12: 0000000000000010
R13: 0000000000000000 R14: ffff9c8a6ea1dc80 R15: ffff9c2c8156e800
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#8 [ffffb0472cd7fb20] dbuf_dirty at ffffffffc1d08c7c [zfs]
#9 [ffffb0472cd7fba8] dmu_write_uio_dnode at ffffffffc1d1157b [zfs]
#10 [ffffb0472cd7fc08] dmu_write_uio_dbuf at ffffffffc1d11697 [zfs]
#11 [ffffb0472cd7fc30] zfs_write at ffffffffc1de9383 [zfs]
#12 [ffffb0472cd7fdc8] zpl_iter_write at ffffffffc1e25720 [zfs]
#13 [ffffb0472cd7fe48] new_sync_write at ffffffffaa516452
#14 [ffffb0472cd7fed0] vfs_write at ffffffffaa519a45
#15 [ffffb0472cd7ff00] ksys_pwrite64 at ffffffffaa519ea1
#16 [ffffb0472cd7ff38] do_syscall_64 at ffffffffaa20420b
#17 [ffffb0472cd7ff50] entry_SYSCALL_64_after_hwframe at ffffffffaac000ad
RIP: 00007f5177d2d2b7 RSP: 00007f47f534a690 RFLAGS: 00000293
RAX: ffffffffffffffda RBX: 000000000000015c RCX: 00007f5177d2d2b7
RDX: 0000000000008000 RSI: 00007f3eb5f1d000 RDI: 000000000000015c
crash> struct dmu_buf_impl_t ffff9c8a6ea1dc80
The current version is not compiled with debug enable, debug enable version test is on the way.
I think it seem like internal cache logic bug, and it's hard for me to fix it.
Hope u guys take a look at it, thank you.
The text was updated successfully, but these errors were encountered: