-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent Kernel Panic: BUG: soft lockup - CPU#8 stuck for 22s! [kswapd1:146] #1579
Comments
I believe I just experienced this: BUG: soft lockup - CPU#9 stuck for 67s! [arc_adapt:5638] Pid: 5638, comm: arc_adapt Tainted: P --------------- 2.6.32-358.11.1.el6.x86_64 #1 Supermicro X9DRi+ In this case, I'm copying from one pool to another pool. The pool that is being copied to has compression set on. Let me know if you need more information. Thanks, Steve |
This is analogous to the following SPL commit: openzfs/spl@b9b3715 Closes openzfs#1579 Signed-off-by: Richard Yao <[email protected]>
#1627 should fix this. |
This is analogous to SPL commit openzfs/spl@b9b3715. While we don't have clear evidence of systems getting caught here indefinately like in the SPL this ensures that it will never happen. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#1579
I'm pretty sure this bug isn't fixed as I'm getting the exact same error. |
Using Oracle Unbreakable Linux with the 2.6.39-400.109.1.el6uek.x86_64 kernel I'm having significant stability issues with frequent kernel panics due to what looks to be memory pressure for the ARC cache or possibly a bug with zfs send writing a large abount of data (300GB+).
In my case I have a new 4-proc/16-core Linux server with 256GB of RAM that we intend to use as our PostgreSQL database server. In throughput and load tests the box is stable in general. In stress testing zfs on the box though with a zfs send command I noticed that once ARC would fill up available memory it would leave about 800M free and then fluctuate around this value. Please also note that swap is enabled (4GB) but with a swappiness sysctl value of 0 although I have also had these panics in testing with swap disabled. After ARC has filled memory and the zfs send (via NFS) was still on-going I would start receiving kernel panics. I've tested with the stock spl/zfs RPMS from RedHat, manually compiling spl/zfs 0.6.1 from source from the zfsforlinux site, and compiling from the github trunck versions of these projects. To keep things simple also here is no modprobe.d/zfs.conf or changes in sysctl, just stock spl/zfs. Also, on the manually compiled versions of spl/zfs I have --enabled-debug. In all of my test cases once ARC fills and if the box is under load during a zfs send it panics. I'm unsure if the zfs send process actually completes but the box is pretty unstable for a while after the panics and needs to be rebooted usually.
I have also tested by artificially constraining ARC to about 20% of available RAM, in this case the box does complete it's zfs send but as I bought the box with 256G of RAM specifically to use it to allow for a large ARC this solution isn't really providing the value of the hardware. vm.swappiness=0 is the only change to the stock configuration I have done for the box and there is no zfs.conf or changes to zfs's default settings. Please also note that although uname -r reports a 2.6 kernel, due to Oracle's odd standards for Unbreakable Linux this is actually a 3.2 kernel. Please see here if this is confusing, I know it confused me for a bit: https://oss.oracle.com/ol6/docs/RELEASE-NOTES-UEK2-QU2-en.html
Here is the stack trace:
Stack Trace:
Jul 4 13:07:41 db6 kernel: BUG: soft lockup - CPU#8 stuck for 22s! [kswapd1:146]
Jul 4 13:07:41 db6 kernel: Modules linked in: nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate pcspkr ghes hed i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ixgbe hwmon dca sg ext4 mbcache jbd2 dm_round_robin sr_mod cdrom usb_storage sd_mod crc_t10dif ahci libahci megaraid_sas mpt2sas scsi_transport_sas raid_class dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jul 4 13:07:41 db6 kernel: CPU 8
Jul 4 13:07:41 db6 kernel: Modules linked in: nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate pcspkr ghes hed i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ixgbe hwmon dca sg ext4 mbcache jbd2 dm_round_robin sr_mod cdrom usb_storage sd_mod crc_t10dif ahci libahci megaraid_sas mpt2sas scsi_transport_sas raid_class dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jul 4 13:07:41 db6 kernel:
Jul 4 13:07:41 db6 kernel: Pid: 146, comm: kswapd1 Tainted: P 2.6.39-400.109.1.el6uek.x86_64 #1 Oracle Corporation SUN FIRE X4270 M3 /ASSY,MOTHERBOARD,2U
Jul 4 13:07:41 db6 kernel: RIP: 0010:[] [] remove_vm_area+0x86/0xa0
Jul 4 13:07:41 db6 kernel: RSP: 0018:ffff881fbff01a80 EFLAGS: 00000287
Jul 4 13:07:41 db6 kernel: RAX: ffff88026694df40 RBX: ffff881fbff01a50 RCX: ffff880a2494f980
Jul 4 13:07:41 db6 kernel: RDX: ffff880a2494f340 RSI: 0000000000000001 RDI: ffffffff817dcd00
Jul 4 13:07:41 db6 kernel: RBP: ffff881fbff01aa0 R08: ffffea000866dd78 R09: 0000000000000000
Jul 4 13:07:41 db6 kernel: R10: ffff88207ffca5a0 R11: ffffea0008685f30 R12: ffffffff81518e2e
Jul 4 13:07:41 db6 kernel: R13: ffff88407fc15030 R14: 000000004f6be000 R15: ffff88407fc15048
Jul 4 13:07:41 db6 kernel: FS: 0000000000000000(0000) GS:ffff88407fc00000(0000) knlGS:0000000000000000
Jul 4 13:07:41 db6 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jul 4 13:07:41 db6 kernel: CR2: 00000000023481c0 CR3: 0000000001789000 CR4: 00000000000406e0
Jul 4 13:07:41 db6 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 4 13:07:41 db6 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul 4 13:07:41 db6 kernel: Process kswapd1 (pid: 146, threadinfo ffff881fbff00000, task ffff881fbfefe0c0)
Jul 4 13:07:41 db6 kernel: Stack:
Jul 4 13:07:41 db6 kernel: ffff8802796ca600 ffff88026694df40 ffff881fbff01b70 ffffc9194f780000
Jul 4 13:07:41 db6 kernel: ffff881fbff01ad0 ffffffff8114581e ffff88407fc15040 0000000000040000
Jul 4 13:07:41 db6 kernel: ffffc9194f780000 ffff881fc1060000 ffff881fbff01ae0 ffffffff8114598a
Jul 4 13:07:41 db6 kernel: Call Trace:
Jul 4 13:07:41 db6 kernel: [] __vunmap+0x2e/0x120
Jul 4 13:07:41 db6 kernel: [] vfree+0x2a/0x40
Jul 4 13:07:41 db6 kernel: [] kv_free+0xa8/0x150 [spl]
Jul 4 13:07:41 db6 kernel: [] spl_slab_reclaim+0x1f6/0x390 [spl]
Jul 4 13:07:41 db6 kernel: [] spl_kmem_cache_reap_now+0x164/0x2e0 [spl]
Jul 4 13:07:41 db6 kernel: [] ? arc_adjust+0x11e/0x1e0 [zfs]
Jul 4 13:07:41 db6 kernel: [] arc_kmem_reap_now+0x67/0xc0 [zfs]
Jul 4 13:07:41 db6 kernel: [] __arc_shrinker_func+0xdc/0x1e0 [zfs]
Jul 4 13:07:41 db6 kernel: [] arc_shrinker_func+0xe/0x10 [zfs]
Jul 4 13:07:41 db6 kernel: [] shrink_slab+0x185/0x2e0
Jul 4 13:07:41 db6 kernel: [] balance_pgdat+0x69e/0x910
Jul 4 13:07:41 db6 kernel: [] kswapd+0x1b4/0x250
Jul 4 13:07:41 db6 kernel: [] ? balance_pgdat+0x910/0x910
Jul 4 13:07:41 db6 kernel: [] kthread+0x96/0xa0
Jul 4 13:07:41 db6 kernel: [] kernel_thread_helper+0x4/0x10
Jul 4 13:07:41 db6 kernel: [] ? kthread_worker_fn+0x1a0/0x1a0
Jul 4 13:07:41 db6 kernel: [] ? gs_change+0x13/0x13
Jul 4 13:07:41 db6 kernel: Code: 00 cd 7d 81 48 89 45 e8 e8 e8 a9 3c 00 48 8b 15 11 58 bf 00 48 c7 c1 40 af d3 81 48 8b 45 e8 48 39 d3 74 0c 90 48 89 d1 48 8b 12
Jul 4 13:07:41 db6 kernel: Call Trace:
Jul 4 13:07:41 db6 kernel: [] __vunmap+0x2e/0x120
Jul 4 13:07:41 db6 kernel: [] vfree+0x2a/0x40
Jul 4 13:07:41 db6 kernel: [] kv_free+0xa8/0x150 [spl]
Jul 4 13:07:41 db6 kernel: [] spl_slab_reclaim+0x1f6/0x390 [spl]
Jul 4 13:07:41 db6 kernel: [] spl_kmem_cache_reap_now+0x164/0x2e0 [spl]
Jul 4 13:07:41 db6 kernel: [] ? arc_adjust+0x11e/0x1e0 [zfs]
Jul 4 13:07:41 db6 kernel: [] arc_kmem_reap_now+0x67/0xc0 [zfs]
Jul 4 13:07:41 db6 kernel: [] __arc_shrinker_func+0xdc/0x1e0 [zfs]
Jul 4 13:07:41 db6 kernel: [] arc_shrinker_func+0xe/0x10 [zfs]
Jul 4 13:07:41 db6 kernel: [] shrink_slab+0x185/0x2e0
Jul 4 13:07:41 db6 kernel: [] balance_pgdat+0x69e/0x910
Jul 4 13:07:41 db6 kernel: [] kswapd+0x1b4/0x250
Jul 4 13:07:41 db6 kernel: [] ? balance_pgdat+0x910/0x910
Jul 4 13:07:41 db6 kernel: [] kthread+0x96/0xa0
Jul 4 13:07:41 db6 kernel: [] kernel_thread_helper+0x4/0x10
Jul 4 13:07:41 db6 kernel: [] ? kthread_worker_fn+0x1a0/0x1a0
Jul 4 13:07:41 db6 kernel: [] ? gs_change+0x13/0x13
Jul 4 13:09:33 db6 kernel: BUG: soft lockup - CPU#2 stuck for 22s! [kswapd0:145]
The text was updated successfully, but these errors were encountered: