Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent Kernel Panic: BUG: soft lockup - CPU#8 stuck for 22s! [kswapd1:146] #1579

Closed
lycovian opened this issue Jul 8, 2013 · 3 comments
Milestone

Comments

@lycovian
Copy link

lycovian commented Jul 8, 2013

Using Oracle Unbreakable Linux with the 2.6.39-400.109.1.el6uek.x86_64 kernel I'm having significant stability issues with frequent kernel panics due to what looks to be memory pressure for the ARC cache or possibly a bug with zfs send writing a large abount of data (300GB+).

In my case I have a new 4-proc/16-core Linux server with 256GB of RAM that we intend to use as our PostgreSQL database server. In throughput and load tests the box is stable in general. In stress testing zfs on the box though with a zfs send command I noticed that once ARC would fill up available memory it would leave about 800M free and then fluctuate around this value. Please also note that swap is enabled (4GB) but with a swappiness sysctl value of 0 although I have also had these panics in testing with swap disabled. After ARC has filled memory and the zfs send (via NFS) was still on-going I would start receiving kernel panics. I've tested with the stock spl/zfs RPMS from RedHat, manually compiling spl/zfs 0.6.1 from source from the zfsforlinux site, and compiling from the github trunck versions of these projects. To keep things simple also here is no modprobe.d/zfs.conf or changes in sysctl, just stock spl/zfs. Also, on the manually compiled versions of spl/zfs I have --enabled-debug. In all of my test cases once ARC fills and if the box is under load during a zfs send it panics. I'm unsure if the zfs send process actually completes but the box is pretty unstable for a while after the panics and needs to be rebooted usually.

I have also tested by artificially constraining ARC to about 20% of available RAM, in this case the box does complete it's zfs send but as I bought the box with 256G of RAM specifically to use it to allow for a large ARC this solution isn't really providing the value of the hardware. vm.swappiness=0 is the only change to the stock configuration I have done for the box and there is no zfs.conf or changes to zfs's default settings. Please also note that although uname -r reports a 2.6 kernel, due to Oracle's odd standards for Unbreakable Linux this is actually a 3.2 kernel. Please see here if this is confusing, I know it confused me for a bit: https://oss.oracle.com/ol6/docs/RELEASE-NOTES-UEK2-QU2-en.html

Here is the stack trace:
Stack Trace:
Jul 4 13:07:41 db6 kernel: BUG: soft lockup - CPU#8 stuck for 22s! [kswapd1:146]
Jul 4 13:07:41 db6 kernel: Modules linked in: nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate pcspkr ghes hed i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ixgbe hwmon dca sg ext4 mbcache jbd2 dm_round_robin sr_mod cdrom usb_storage sd_mod crc_t10dif ahci libahci megaraid_sas mpt2sas scsi_transport_sas raid_class dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jul 4 13:07:41 db6 kernel: CPU 8
Jul 4 13:07:41 db6 kernel: Modules linked in: nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate pcspkr ghes hed i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ixgbe hwmon dca sg ext4 mbcache jbd2 dm_round_robin sr_mod cdrom usb_storage sd_mod crc_t10dif ahci libahci megaraid_sas mpt2sas scsi_transport_sas raid_class dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jul 4 13:07:41 db6 kernel:
Jul 4 13:07:41 db6 kernel: Pid: 146, comm: kswapd1 Tainted: P 2.6.39-400.109.1.el6uek.x86_64 #1 Oracle Corporation SUN FIRE X4270 M3 /ASSY,MOTHERBOARD,2U
Jul 4 13:07:41 db6 kernel: RIP: 0010:[] [] remove_vm_area+0x86/0xa0
Jul 4 13:07:41 db6 kernel: RSP: 0018:ffff881fbff01a80 EFLAGS: 00000287
Jul 4 13:07:41 db6 kernel: RAX: ffff88026694df40 RBX: ffff881fbff01a50 RCX: ffff880a2494f980
Jul 4 13:07:41 db6 kernel: RDX: ffff880a2494f340 RSI: 0000000000000001 RDI: ffffffff817dcd00
Jul 4 13:07:41 db6 kernel: RBP: ffff881fbff01aa0 R08: ffffea000866dd78 R09: 0000000000000000
Jul 4 13:07:41 db6 kernel: R10: ffff88207ffca5a0 R11: ffffea0008685f30 R12: ffffffff81518e2e
Jul 4 13:07:41 db6 kernel: R13: ffff88407fc15030 R14: 000000004f6be000 R15: ffff88407fc15048
Jul 4 13:07:41 db6 kernel: FS: 0000000000000000(0000) GS:ffff88407fc00000(0000) knlGS:0000000000000000
Jul 4 13:07:41 db6 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jul 4 13:07:41 db6 kernel: CR2: 00000000023481c0 CR3: 0000000001789000 CR4: 00000000000406e0
Jul 4 13:07:41 db6 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 4 13:07:41 db6 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul 4 13:07:41 db6 kernel: Process kswapd1 (pid: 146, threadinfo ffff881fbff00000, task ffff881fbfefe0c0)
Jul 4 13:07:41 db6 kernel: Stack:
Jul 4 13:07:41 db6 kernel: ffff8802796ca600 ffff88026694df40 ffff881fbff01b70 ffffc9194f780000
Jul 4 13:07:41 db6 kernel: ffff881fbff01ad0 ffffffff8114581e ffff88407fc15040 0000000000040000
Jul 4 13:07:41 db6 kernel: ffffc9194f780000 ffff881fc1060000 ffff881fbff01ae0 ffffffff8114598a
Jul 4 13:07:41 db6 kernel: Call Trace:
Jul 4 13:07:41 db6 kernel: [] __vunmap+0x2e/0x120
Jul 4 13:07:41 db6 kernel: [] vfree+0x2a/0x40
Jul 4 13:07:41 db6 kernel: [] kv_free+0xa8/0x150 [spl]
Jul 4 13:07:41 db6 kernel: [] spl_slab_reclaim+0x1f6/0x390 [spl]
Jul 4 13:07:41 db6 kernel: [] spl_kmem_cache_reap_now+0x164/0x2e0 [spl]
Jul 4 13:07:41 db6 kernel: [] ? arc_adjust+0x11e/0x1e0 [zfs]
Jul 4 13:07:41 db6 kernel: [] arc_kmem_reap_now+0x67/0xc0 [zfs]
Jul 4 13:07:41 db6 kernel: [] __arc_shrinker_func+0xdc/0x1e0 [zfs]
Jul 4 13:07:41 db6 kernel: [] arc_shrinker_func+0xe/0x10 [zfs]
Jul 4 13:07:41 db6 kernel: [] shrink_slab+0x185/0x2e0
Jul 4 13:07:41 db6 kernel: [] balance_pgdat+0x69e/0x910
Jul 4 13:07:41 db6 kernel: [] kswapd+0x1b4/0x250
Jul 4 13:07:41 db6 kernel: [] ? balance_pgdat+0x910/0x910
Jul 4 13:07:41 db6 kernel: [] kthread+0x96/0xa0
Jul 4 13:07:41 db6 kernel: [] kernel_thread_helper+0x4/0x10
Jul 4 13:07:41 db6 kernel: [] ? kthread_worker_fn+0x1a0/0x1a0
Jul 4 13:07:41 db6 kernel: [] ? gs_change+0x13/0x13
Jul 4 13:07:41 db6 kernel: Code: 00 cd 7d 81 48 89 45 e8 e8 e8 a9 3c 00 48 8b 15 11 58 bf 00 48 c7 c1 40 af d3 81 48 8b 45 e8 48 39 d3 74 0c 90 48 89 d1 48 8b 12
Jul 4 13:07:41 db6 kernel: Call Trace:
Jul 4 13:07:41 db6 kernel: [] __vunmap+0x2e/0x120
Jul 4 13:07:41 db6 kernel: [] vfree+0x2a/0x40
Jul 4 13:07:41 db6 kernel: [] kv_free+0xa8/0x150 [spl]
Jul 4 13:07:41 db6 kernel: [] spl_slab_reclaim+0x1f6/0x390 [spl]
Jul 4 13:07:41 db6 kernel: [] spl_kmem_cache_reap_now+0x164/0x2e0 [spl]
Jul 4 13:07:41 db6 kernel: [] ? arc_adjust+0x11e/0x1e0 [zfs]
Jul 4 13:07:41 db6 kernel: [] arc_kmem_reap_now+0x67/0xc0 [zfs]
Jul 4 13:07:41 db6 kernel: [] __arc_shrinker_func+0xdc/0x1e0 [zfs]
Jul 4 13:07:41 db6 kernel: [] arc_shrinker_func+0xe/0x10 [zfs]
Jul 4 13:07:41 db6 kernel: [] shrink_slab+0x185/0x2e0
Jul 4 13:07:41 db6 kernel: [] balance_pgdat+0x69e/0x910
Jul 4 13:07:41 db6 kernel: [] kswapd+0x1b4/0x250
Jul 4 13:07:41 db6 kernel: [] ? balance_pgdat+0x910/0x910
Jul 4 13:07:41 db6 kernel: [] kthread+0x96/0xa0
Jul 4 13:07:41 db6 kernel: [] kernel_thread_helper+0x4/0x10
Jul 4 13:07:41 db6 kernel: [] ? kthread_worker_fn+0x1a0/0x1a0
Jul 4 13:07:41 db6 kernel: [] ? gs_change+0x13/0x13
Jul 4 13:09:33 db6 kernel: BUG: soft lockup - CPU#2 stuck for 22s! [kswapd0:145]

@cousins
Copy link

cousins commented Aug 6, 2013

I believe I just experienced this:

BUG: soft lockup - CPU#9 stuck for 67s! [arc_adapt:5638]
Modules linked in: ses dm_round_robin enclosure mpt2sas raid_class zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(]
CPU 9
Modules linked in: ses dm_round_robin enclosure mpt2sas raid_class zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(]

Pid: 5638, comm: arc_adapt Tainted: P --------------- 2.6.32-358.11.1.el6.x86_64 #1 Supermicro X9DRi+
RIP: 0010:[] [] remove_vm_area+0x86/0xa0
RSP: 0018:ffff8810394dbc60 EFLAGS: 00000202
RAX: ffff8802704e57c0 RBX: ffff8810394dbc80 RCX: ffff8801341b9900
RDX: ffff8801d3e94fc0 RSI: 0000000000000001 RDI: ffffffff81ac2490
RBP: ffffffff8100bb8e R08: ffffea00093b99e8 R09: 0000000000000000
R10: ffff880028406100 R11: 0000000000000000 R12: ffff8810394dbc30
R13: 0000000000000297 R14: ffff880000022e80 R15: ffffea00093b99f8
FS: 0000000000000000(0000) GS:ffff8810b8860000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000037e6c13020 CR3: 0000000001a85000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process arc_adapt (pid: 5638, threadinfo ffff8810394da000, task ffff881064b42080)
Stack:
ffff880f298ef400 ffff8802704e57c0 ffffc90c78995800 ffffc90b031fa000
ffff8810394dbcb0 ffffffff81150d5e ffffc90b2b20ec18 ffff88105be40000
ffffc90af8c1a000 0000000000000000 ffff8810394dbcc0 ffffffff81150eca
Call Trace:
[] ? __vunmap+0x2e/0x120
[] ? vfree+0x2a/0x40
[] ? kv_free+0x65/0x70 [spl]
[] ? spl_slab_reclaim+0x2d9/0x3e0 [spl]
[] ? spl_kmem_cache_reap_now+0x144/0x230 [spl]
[] ? dispose_list+0xfc/0x120
[] ? zpl_prune_sbs+0x0/0x60 [zfs]
[] ? __spl_kmem_cache_generic_shrinker+0x4b/0xe0 [spl]
[] ? spl_kmem_reap+0x27/0x30 [spl]
[] ? zpl_prune_sbs+0x58/0x60 [zfs]
[] ? arc_adjust_meta+0x120/0x1e0 [zfs]
[] ? arc_adapt_thread+0x0/0xd0 [zfs]
[] ? arc_adapt_thread+0x0/0xd0 [zfs]
[] ? arc_adapt_thread+0x6a/0xd0 [zfs]
[] ? thread_generic_wrapper+0x68/0x80 [spl]
[] ? thread_generic_wrapper+0x0/0x80 [spl]
[] ? kthread+0x96/0xa0
[] ? child_rip+0xa/0x20
[] ? kthread+0x0/0xa0
[] ? child_rip+0x0/0x20
Code: 90 24 ac 81 48 89 45 e8 e8 a8 f8 3b 00 48 8b 15 89 69 e6 00 48 c7 c1 f8 75 fb 81 48 8b 45 e8 48 39 d3 74 0c
Call Trace:
[] ? __vunmap+0x2e/0x120
[] ? vfree+0x2a/0x40
[] ? kv_free+0x65/0x70 [spl]
[] ? spl_slab_reclaim+0x2d9/0x3e0 [spl]
[] ? spl_kmem_cache_reap_now+0x144/0x230 [spl]
[] ? dispose_list+0xfc/0x120
[] ? zpl_prune_sbs+0x0/0x60 [zfs]
[] ? __spl_kmem_cache_generic_shrinker+0x4b/0xe0 [spl]
[] ? spl_kmem_reap+0x27/0x30 [spl]
[] ? zpl_prune_sbs+0x58/0x60 [zfs]
[] ? arc_adjust_meta+0x120/0x1e0 [zfs]
[] ? arc_adapt_thread+0x0/0xd0 [zfs]
[] ? arc_adapt_thread+0x0/0xd0 [zfs]
[] ? arc_adapt_thread+0x6a/0xd0 [zfs]
[] ? thread_generic_wrapper+0x68/0x80 [spl]
[] ? thread_generic_wrapper+0x0/0x80 [spl]
[] ? kthread+0x96/0xa0
[] ? child_rip+0xa/0x20
[] ? kthread+0x0/0xa0
[] ? child_rip+0x0/0x20

In this case, I'm copying from one pool to another pool. The pool that is being copied to has compression set on. Let me know if you need more information.

Thanks,

Steve

ryao added a commit to ryao/zfs that referenced this issue Aug 7, 2013
This is analogous to the following SPL commit:

openzfs/spl@b9b3715

Closes openzfs#1579

Signed-off-by: Richard Yao <[email protected]>
@ryao
Copy link
Contributor

ryao commented Aug 7, 2013

#1627 should fix this.

unya pushed a commit to unya/zfs that referenced this issue Dec 13, 2013
This is analogous to SPL commit openzfs/spl@b9b3715.  While
we don't have clear evidence of systems getting caught here
indefinately like in the SPL this ensures that it will never
happen.

Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#1579
@peter64
Copy link

peter64 commented Nov 24, 2014

I'm pretty sure this bug isn't fixed as I'm getting the exact same error.
I filed a new ticket under openzfs/spl#415

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants