-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
txg_sync at 100% CPU #596
Comments
There is a fix/workaround for this issue in the daily ppa, issue #496. You'll want to run that for a little while until the next release candidate is tagged (which should be fairly soon). Currently the daily releases contains numbers bug fixes not in the stable ppa. Please let me know if this does resolve it the issue, it should. |
I rebuilt my server from scratch and used the daily ppa. Importing the raidz pool from the prior version worked perfectly (this was really nice) so the restore picked up right where it left off. I experienced an unrelated crash after starting the rsync job for the first time that caused every terminal session including the console become unresponsive (hard reboot). I believe it has to do with SLUB as referenced here: http://nylug.org/pipermail/nylug-talk/2010-July/014371.html Mar 15 00:20:11 neptune kernel: [ 3152.241264] SLUB: Unable to allocate memory on node -1 (gfp=0x20) Back on point, the rsync job's been running for the last 24 hours - so far so good. Thanks for all your fine work. |
Hello,
I tried both the daily builds and behlendorf's fork.
and led to the same results. |
me too! i just posted a kernel panic yesterday. I think it is related to here. 5268 root 0 -20 0 0 0 D 40.8 0.0 7:21.33 txg_sync devices blocked an irresponsive. 5268 root 0 -20 0 0 0 D 40.8 0.0 7:21.33 txg_sync |
Same here, txg_sync hangs on zfs create, Ubuntu 12.4 under kvm (on SmartOS, ironically);
|
@calmh Yours might be a 32-bit issue, use a 64-bit kernel. |
Ah, sweet, of course. Could we fail in a more obvious manner here, maybe not even compile? It On 9 okt 2012, at 18:52, Brian Behlendorf [email protected] wrote: @calmh https://github.com/calmh Yours might be a 32-bit issue, use a — |
@calmh That's probably not a bad idea. We could add a gigantic warning to the 32-bit builds encouraging folks not to use them. This is something we want to get fixed eventually but it requires some substantial work and is very low priority for us. |
@calmh I was also going to say, if your system is still otherwise responsive when this happens I'd be interested in the contents of the |
|
@calmh This is the critical bit: dmu_tx_memory_reclaim 4 590985 Two fixes actually just went in to master which should resolve this issue on 32-bit systems. If you could grab the latest source from git and build it you should see this counter drop to zero (or at least very rarely increment). b68503f Remove vmem_size() consumers |
It might require tuning vmalloc and possibly a few module parameters in addition to using these patches before this works well. Here are some known values that have worked for people: vmalloc=384M |
I am not sure about |
This is a good point. I had blindly copied this from FreeBSD's recommendations. While they worked for people, more attention needs to paid to what these settings are actually doing. I only posted them because they are known to work. |
It might be useful to modify the output of
I honestly have no idea what I am reading in that. I will need to look at the code to figure it out. |
Regarding Also is the default 1/2 of the vmalloc region a reasonable default in practice or should we add some code to size it smaller. If possible I'd love for there to be sane defaults for both 32-bit and 64-bit systems. |
I'm building a replacement server for a FreeNAS box. It's based on Ubuntu server 10.04 64 bit with ZFS built from ppa:zfs-native/stable. I'm using rsync to copy data from an NFS share mounted on the Ubuntu box. Data consists of video collection comprising 200+ GB of large media files. The rsync job runs for several hours then hangs, top shows txg_sync at 100% CPU. Can't kill the process, can't unmount the pool, can't soft reboot. Machine requires power cycle to reboot.
kern.log:
Mar 11 06:39:23 neptune kernel: [50951.201236] BUG: soft lockup - CPU#1 stuck for 61s! [txg_sync:2089]
Mar 11 06:39:23 neptune kernel: [50951.201253] Modules linked in: appletalk ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp kvm_amd kvm snd_hda_codec_atihdmi fbcon tileblit font bitblit softcursor vga16fb vgastate zfs(P) zcommon(P) zunicode(P) znvpair(P) zavl(P) snd_hda_codec_realtek radeon ttm drm_kms_helper spl zlib_deflate ppdev snd_hda_intel snd_hda_codec snd_hwdep drm i2c_algo_bit parport_pc snd_pcm snd_timer lp snd soundcore snd_page_alloc i2c_piix4 edac_core edac_mce_amd shpchp parport ohci1394 ieee1394 r8169 mii pata_atiixp ahci
Mar 11 06:39:23 neptune kernel: [50951.201253] CPU 1:
Mar 11 06:39:23 neptune kernel: [50951.201253] Modules linked in: appletalk ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp kvm_amd kvm snd_hda_codec_atihdmi fbcon tileblit font bitblit softcursor vga16fb vgastate zfs(P) zcommon(P) zunicode(P) znvpair(P) zavl(P) snd_hda_codec_realtek radeon ttm drm_kms_helper spl zlib_deflate ppdev snd_hda_intel snd_hda_codec snd_hwdep drm i2c_algo_bit parport_pc snd_pcm snd_timer lp snd soundcore snd_page_alloc i2c_piix4 edac_core edac_mce_amd shpchp parport ohci1394 ieee1394 r8169 mii pata_atiixp ahci
Mar 11 06:39:23 neptune kernel: [50951.201253] Pid: 2089, comm: txg_sync Tainted: P 2.6.32-38-server #83-Ubuntu GA-MA785GM-US2H
Mar 11 06:39:23 neptune kernel: [50951.201253] RIP: 0010:[] [] __ticket_spin_lock+0x19/0x20
Mar 11 06:39:23 neptune kernel: [50951.201253] RSP: 0018:ffff880118cb3b60 EFLAGS: 00000202
Mar 11 06:39:23 neptune kernel: [50951.201253] RAX: 0000000000000000 RBX: ffff880118cb3b60 RCX: ffff8801288af8e8
Mar 11 06:39:23 neptune kernel: [50951.201253] RDX: 000000000000001c RSI: 0000000000000070 RDI: ffffc9002d9894c4
Mar 11 06:39:23 neptune kernel: [50951.201253] RBP: ffffffff81013c6e R08: 0000000000000000 R09: 000000000010a004
Mar 11 06:39:23 neptune kernel: [50951.201253] R10: ffffc9002d9891b8 R11: 0000000000000000 R12: ffff880118cb3ae0
Mar 11 06:39:23 neptune kernel: [50951.201253] R13: ffffffff8108c135 R14: ffff880118cb3af0 R15: ffffffff81019e29
Mar 11 06:39:23 neptune kernel: [50951.201253] FS: 00007f73bd089700(0000) GS:ffff880005280000(0000) knlGS:0000000000000000
Mar 11 06:39:23 neptune kernel: [50951.201253] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Mar 11 06:39:23 neptune kernel: [50951.201253] CR2: 00007f73b630125c CR3: 0000000001001000 CR4: 00000000000006e0
Mar 11 06:39:23 neptune kernel: [50951.201253] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 11 06:39:23 neptune kernel: [50951.201253] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar 11 06:39:23 neptune kernel: [50951.201253] Call Trace:
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? _spin_lock+0xe/0x20
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? __mutex_unlock_slowpath+0x25/0x60
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? mutex_unlock+0x1b/0x20
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? zio_add_child+0x100/0x120 [zfs]
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? zio_create+0x3dc/0x4a0 [zfs]
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? zio_free_sync+0x76/0x80 [zfs]
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? spa_free_sync_cb+0x43/0x60 [zfs]
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? spa_free_sync_cb+0x0/0x60 [zfs]
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? bplist_iterate+0x7b/0xb0 [zfs]
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? spa_sync+0x3cc/0x9a0 [zfs]
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? autoremove_wake_function+0x16/0x40
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? __wake_up+0x53/0x70
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? txg_sync_thread+0x225/0x3b0 [zfs]
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? txg_sync_thread+0x0/0x3b0 [zfs]
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? thread_generic_wrapper+0x68/0x80 [spl]
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? thread_generic_wrapper+0x0/0x80 [spl]
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? kthread+0x96/0xa0
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? child_rip+0xa/0x20
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? kthread+0x0/0xa0
Mar 11 06:39:23 neptune kernel: [50951.201253] [] ? child_rip+0x0/0x20
The text was updated successfully, but these errors were encountered: