-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang during rsync #261
Comments
It looks like both processes are waiting on an open the open transaction, but it's not clear why. With what version are you able to reproduce this? Fixes for several deadlocks have been commited post -rc4. You can give then a try if you pull the latest master. |
Version is on first line of log, it is the latest in ppa. Il giorno 02/giu/2011 18:13, "behlendorf" < It looks like both processes are waiting on an open the open transaction, Reply to this email directly or view it on GitHub: |
I have disabled l2arc and zil log. I have redone the rsync test. Unfortunately it fails again. But this time it fails due to outofmemory killer. |
How much memory is in your system? At the moment things get a little dicey with ZFS if you have less than about 2GB. |
The server was on esxi so I gave it 5gb this time. I repeat the settings:
Result: zfs eats all memory again this time after more time. It is strange that for first hours memory consumption seems normal, then it goes crazy and eats all memory. I use latest version from ppa. |
I get a similar issue on a testing backup server when using rsync. Interestingly, the problem does not occur when the heavy I/O is generated by a non-rsync process (specifically, the OpenAFS backup server). The server in question has 7GB of RAM for a RAIDz pool of 4x 750GB drives. Compression is enabled on the pool but deduplication is disabled. Kernel is v2.6.36 checked out from kernel.org. Please let me know if you need any more information. |
Several deadlocks which would result in the symptoms you describe were recently fixed in the master branch. If your able to consistently reproduce this issue I'd appreciate it if you could try the latest source from github and let me know if the issue remains. Thanks! a140dc5 - Fix mmap(2)/write(2)/read(2) deadlock |
Thanks I will try when these patches will go in the ppa repository. I suppose the deadlocks you have found will help with 2 cpus. 2011/7/20 behlendorf <
|
4 rsync streams between this linux 233TB pool and one solaris 233TB zfs pool grinds system to a halt. Work around seems to be a taskset command that assigns rsync to 2 physical processors of a 4 physical processor system. After that, open ssh user interfaces will randomly hang for short periods; then recover. Obviously user interactive is peocess scheduled randomly. AMD Opteron dual core, 4 physical cpus, 8gb memory; CentOS 6 kernel update 2.6.32-71.29.1.el6.x86_64, spl and zfs 0.6.0-rc5. I realize none of this is technical but rsyncs are in progress, I'm home on a VPN. Rsync hang is an open bug for the slightly past due rc6 and/or 0.6.0 stable release. FYI - We do have reasonf for doing zfs send/receive duplications. The CentOS system can dual boot Solaris, i.e. its original OS. |
Typos in last comment. Should read "We do have reasons for not doing zfs send/receive duplications." Rsyncs are running. Currently set to run on processor #3 via taskset. Interactive processes do recover after hangs. Did have open 10 ssh sessions today. A date, sleep 1 loop was running in all shells. Generally, only one shell would hang at a time... perhaps between 2 and 15 seconds. |
I am still seeing this issue with the latest versions built from git. Setting the zfs_arc_max option to a relatively small number (1GB on a system with 6GB RAM in this case) helps to slow the failure. When running I/O intensive loads that aren't an rsync (the AFS backup system that I'm running on the system is writing 250GB to a single large file), the arc size as given by /proc/spl/kstat/zfs/arcstats quickly goes up to 1073845840 (almost exactly the value given to modprobe) and then holds at that size. When running an rsync however, that size climbs steadily through the maximum value specified (even if the rsync is just verifying that files have not changed) and keeps increasing until the system starts swapping at which point it begins to lag badly. A workaround we have in place for the moment is to rsync a portion at a time and completely unload ZFS between the rsyncs which frees the memory. |
I agree. Putting a very small zfs_arc_max slow down the hanging of the 2011/8/2 Alphalead <
|
Not in a position to unload ZFS between rsyncs. Other user processes need/use the pool. Just reporting what I, as an "ordinary" interactive user would see. Re-enforcing that the bug does exist and is noted for a fix in rc6. When fixed, it seems that rc6 and 0.6.0 will be stable enough for us to replace the Solaris OS. Looking to get ZFS and FUSE/LTFS (Linear Tape File System) on the same OS platform. We are looking to off load "deep/old" archived projects onto LTFS. Again we have reasons not to do zfs send/receive to/from tape as we have reasons not to do zfs send/receive between stand alone systems/pools. |
Anyway I have done a tar reading files and writing to /dev/null and the 2011/8/2 Mario Giammarco [email protected]
|
Can you please post a simple test case which can recreate this issue. While this is tagged to be fixed for 0.6.0-rc6 I haven't seriously investigated it yet. A test case would help a lot. |
Happy to try and help. I'm a sysadmin and not a programmer in the tstudios Sam Allgood -----Original Message----- Can you please post a simple test case which can recreate this issue. Reply to this email directly or view it on GitHub: |
If you can provide me enough information to recreate the issue locally that would be ideal.
|
While not on this system, now dual booting Solaris and Linux, another tstudios -----Original Message----- Anyway I have done a tar reading files and writing to /dev/null and the 2011/8/2 Mario Giammarco [email protected]
Reply to this email directly or view it on GitHub: |
It's late here EDT. Tomorrow, I'll give you my exact commands from a I'm running taskset -c rsync . This is allowing recovery. I do see recent comments about tar reads from zfs pools to /dev/null tstudios -----Original Message----- If you can provide me enough information to recreate the issue locally
Reply to this email directly or view it on GitHub: |
Brian, Below is a synopsis of yesterdays rsyncs. Hopefully, enough information tstudios - Sam Allgood Kernel is stock from the latest yum update: Rsync version: Running the following in 10 or so ssh login shells. I can see one or The task is to sync a Solaris zfs pool with 4 file systems (besides the The zfs list below comes from the Linux system and is indicative of the Hmmnnn.."...first error in dmesg? Presuming dmesg, it looks like a Okay, now the ifs list, sync arguments, sync summary, and an indication NAME USED AVAIL REFER MOUNTPOINT taskset -c 3 rsync -av --delete-during sent 13798995 bytes received 190547787390 bytes 9751136.57 bytes/sec [root@tsdpl-bu tsdpl-A-10G-backups]# cat log.0 | grep deleting | wc -l taskset -c 3 rsync -av --delete-during receiving incremental file list sent 276728 bytes received 268401036 bytes 18871.13 bytes/sec taskset -c 3 rsync -av --delete-during sent 1662192 bytes received 83044609836 bytes 10524842.79 bytes/sec [root@tsdpl-bu tsdpl-A-10G-gfx01]# cat log.0 | grep deleting | wc -l taskset -c 3 rsync -av --delete-during sent 84485 bytes received 1411890863 bytes 1125528.38 bytes/sec [root@tsdpl-bu tsdpl-A-10G-gfx_archive]# cat log.0 | grep deleting | wc If you can provide me enough information to recreate the issue locally
Reply to this email directly or view it on GitHub: |
rsyncing 1.5TB of data in a few large files (20 files total, ranging in size from 17GB to 140GB compressed; 18GB to 232GB uncompressed) is an example of an rsync that runs fine with no memory issues. rsyncing 60GB of data in ~2 million files (mail storage backups) will result in the arc cache ballooning beyond the defined maximum. The call stack when the system starts swapping is: [134432.390032] Pid: 25143, comm: kworker/u:1 Tainted: P 2.6.36 #7 |
@Alphalead: It certainly sounds like the meta data limits aren't being enforced properly for the arc, have you checked the /proc/spl/kstat/zfs/arcstats and /proc/spl/kmem/slab when it's in this state. I'd be interested to see what they report, and where the memory is being used. |
I seem to be hitting this problem too. Just default values on everything (I'll implement the suggestions above) except for zpool being created with ashift=12. Here are /proc/spl/kstat/zfs/arcstats and /proc/spl/kmem/slab with the machine effectively useless. rsync was copying from local ext4 to local zfs. Load average was about 18.5 but there was no I/O to speak of. There was 6900M or so of RAM allocated to this VM (a Xen dom0) with htop reporting ~5900 in use (but not by the applications, they can barely use 2GB). Killing rsync didn't help, and when I forced the machine to reboot I saw an rsync process segfault right before the actual system reboot (perhaps the open files got pulled out from under it?)
|
Interesting. So it looks like the bulk of the memory is being consumed by meta data in the ARC. It also looks like memory is getting fragmented, but it's not horrible yet. What's unclear is why all of those dnodes, inodes, and meta-data buffers aren't being released to the system. They should be unless they are still considered to be in use. Have you tried forcing the caches to be dropped with drop_caches? This will cause everything which is unused to be reclaimed, which should be happening anyway. Also, have you seen this issue with an other apps besides rsync? Say a find in the directory or something else which accesses all of these files. echo 1 >/proc/sys/vm/drop_caches |
I had the same problem with a tar that reads from zfs and writes to
|
OK, so I did:
It ate about 4GB of RAM, I killed it when there was about a gig left. I have, in /etc/modprobe.d/zfs.conf:
The filesystem options are:
didn't seem to do anything.
returns nothing
lets me watch memory usage fall by about 100MB/s until it's back down to where it started and the zpool unmounts. I grabbed the stats as above again, I'll post them if they'd have value (don't want to overly clutter). |
will even do it, though to a lesser extent. |
i'm not sure if this is the same thing but here is what i get when i rsync many files to zfs: Aug 21 20:49:00 anterovipunen kernel: [ 1320.307559] kswapd0 D ffff88007788c2c0 0 21 2 0x00000000 it works fine for a while and then just dies. |
Hi - Not sure what's going on here as I don't know much regarding the programming/debugging of zfs but I seems to experience that issue, that is with or without rsync. I never had to read much files on my system as I use zfs for backups storage, however I just had to restore things from the zfs pool and it crashed after a while.. rebooted.. crashed after a while.. Here is the error I found in dmesg/syslog : http://pastebin.com/jMTCNEFy If there is anything I can do to help, as far as testings, don't hesitate to let me know. Thanks, edit: using git from 2011-08-22 on debian squeeze |
I think I started seeing this when I updated to rc5. I was running rc4 with the ashift patch previously, but I also updated from 2.6.32 to 2.6.40 kernel, so both things changed. I also changed the disks from 1.5TB to 2TB. The reporter's messages show v0.6.0.16, but mine shows only v0.6.0. As a workaround, I'm making a script that does each step in the backup and then doing a zpool export/import between each step. |
I found another workaround to this issue. Running the following on the system while running long rsyncs keeps the memory usage down to the specified arc sizes: while true; do echo 1 >/proc/sys/vm/drop_caches; sleep 60; done; I'm not sure why this works while just running a single echo 1 >/proc/sys/vm/drop_caches doesn't free larger amounts of RAM when they are in use. |
Alphalead : I think your trick allowed my rsync session to last longer but after a while it crashed again unfortunately Message from syslogd@stor01 at Sep 11 12:18:11 ... Message from syslogd@stor01 at Sep 11 12:18:11 ... Message from syslogd@stor01 at Sep 11 12:18:11 ... Message from syslogd@stor01 at Sep 11 12:18:11 ... Message from syslogd@stor01 at Sep 11 12:18:11 ... Message from syslogd@stor01 at Sep 11 12:18:11 ... edit: version used is latest commit (2708f71) |
I am not disappearead. I try all version from ppa repository. With swapon it hangs the system. With swapoff it seems it continues to work. |
---------- Forwarded message ---------- Alphalead : I think your trick allowed my rsync session to last longer but Message from syslogd@stor01 at Sep 11 12:18:11 ... Message from syslogd@stor01 at Sep 11 12:18:11 ... Message from syslogd@stor01 at Sep 11 12:18:11 ... Message from syslogd@stor01 at Sep 11 12:18:11 ... Message from syslogd@stor01 at Sep 11 12:18:11 ... Message from syslogd@stor01 at Sep 11 12:18:11 ... Reply to this email directly or view it on GitHub: |
I add that after several hours the system definitively hangs with this SLUB: Unable to allocate memory on node -1 (gfp=0x20) 2011/10/1 Mario Giammarco [email protected]
|
This is very likely caused by issue #287 which we are testing a fix for now. |
Excellent! Looking forward to both issues falling off the bug list. Sam Allgood -----Original Message----- This is very likely caused by issue #287 which we are testing a fix for now. Reply to this email directly or view it on GitHub: |
Closing issue, this should be fixed by the following two commits for #287. |
I seem to be having this issue. I was doing a moderately large Rsync, and had the timeouts described above.
Is this relevant or is there another bug I should checking? Thanks, |
@ioquatix This looks different. If you're seeing this in the master code go ahead and open a new issue. |
…#261) Signed-off-by: Pawan <[email protected]>
…ternal error - ZFS_ERR_STREAM_TRUNCATED (openzfs#261) Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Andrew Innes <[email protected]>
Signed-off-by: Andrew Innes <[email protected]>
NAS-130427 / None / Linux 6.12 compat: META
Hello,
during a big rsync between two zfs filesystems my sistem hangs. I have tried to put primarycache=metadata, it only hangs later.
Here is a debug log:
Jun 2 14:41:27 backup1 kernel: [ 26.798015] ZFS: Loaded module v0.6.0.16, ZFS pool version 28, ZFS filesystem version 5
Jun 2 14:50:28 backup1 kernel: [ 594.852215] hrtimer: interrupt took 79628485 ns
Jun 2 14:52:33 backup1 kernel: [ 719.889678] INFO: task txg_sync:1709 blocked for more than 120 seconds.
Jun 2 14:52:33 backup1 kernel: [ 719.889795] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 2 14:52:33 backup1 kernel: [ 719.889896] txg_sync D 0000000000000000 0 1709 2 0x00000000
Jun 2 14:52:33 backup1 kernel: [ 719.889901] ffff88012788bbd0 0000000000000046 ffff88012788bfd8 ffff88012788a000
Jun 2 14:52:33 backup1 kernel: [ 719.889904] 0000000000013d00 ffff8801368c83b8 ffff88012788bfd8 0000000000013d00
Jun 2 14:52:33 backup1 kernel: [ 719.889908] ffff880135890000 ffff8801368c8000 ffff88012788bbd0 ffffc90015602c68
Jun 2 14:52:33 backup1 kernel: [ 719.889911] Call Trace:
Jun 2 14:52:33 backup1 kernel: [ 719.889955] [] cv_wait_common+0x77/0xd0 [spl]
Jun 2 14:52:33 backup1 kernel: [ 719.889973] [] ? autoremove_wake_function+0x0/0x40
Jun 2 14:52:33 backup1 kernel: [ 719.889980] [] __cv_wait+0x13/0x20 [spl]
Jun 2 14:52:33 backup1 kernel: [ 719.890090] [] zio_wait+0xc3/0x170 [zfs]
Jun 2 14:52:33 backup1 kernel: [ 719.890116] [] dsl_pool_sync+0xd9/0x4a0 [zfs]
Jun 2 14:52:33 backup1 kernel: [ 719.890136] [] spa_sync+0x3ab/0x9b0 [zfs]
Jun 2 14:52:33 backup1 kernel: [ 719.890139] [] ? autoremove_wake_function+0x16/0x40
Jun 2 14:52:33 backup1 kernel: [ 719.890144] [] ? __wake_up+0x53/0x70
Jun 2 14:52:33 backup1 kernel: [ 719.890167] [] txg_sync_thread+0x241/0x3c0 [zfs]
Jun 2 14:52:33 backup1 kernel: [ 719.890189] [] ? txg_sync_thread+0x0/0x3c0 [zfs]
Jun 2 14:52:33 backup1 kernel: [ 719.890195] [] thread_generic_wrapper+0x78/0x90 [spl]
Jun 2 14:52:33 backup1 kernel: [ 719.890200] [] ? thread_generic_wrapper+0x0/0x90 [spl]
Jun 2 14:52:33 backup1 kernel: [ 719.890203] [] kthread+0x96/0xa0
Jun 2 14:52:33 backup1 kernel: [ 719.890210] [] kernel_thread_helper+0x4/0x10
Jun 2 14:52:33 backup1 kernel: [ 719.890213] [] ? kthread+0x0/0xa0
Jun 2 14:52:33 backup1 kernel: [ 719.890216] [] ? kernel_thread_helper+0x0/0x10
Jun 2 14:54:33 backup1 kernel: [ 839.634103] INFO: task txg_sync:1709 blocked for more than 120 seconds.
Jun 2 14:54:33 backup1 kernel: [ 839.634187] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 2 14:54:33 backup1 kernel: [ 839.634286] txg_sync D 0000000000000000 0 1709 2 0x00000000
Jun 2 14:54:33 backup1 kernel: [ 839.634291] ffff88012788bbd0 0000000000000046 ffff88012788bfd8 ffff88012788a000
Jun 2 14:54:33 backup1 kernel: [ 839.634294] 0000000000013d00 ffff8801368c83b8 ffff88012788bfd8 0000000000013d00
Jun 2 14:54:33 backup1 kernel: [ 839.634298] ffff880135890000 ffff8801368c8000 ffff88012788bbd0 ffffc90015602c68
Jun 2 14:54:33 backup1 kernel: [ 839.634301] Call Trace:
Jun 2 14:54:33 backup1 kernel: [ 839.634317] [] cv_wait_common+0x77/0xd0 [spl]
Jun 2 14:54:33 backup1 kernel: [ 839.634323] [] ? autoremove_wake_function+0x0/0x40
Jun 2 14:54:33 backup1 kernel: [ 839.634328] [] __cv_wait+0x13/0x20 [spl]
Jun 2 14:54:33 backup1 kernel: [ 839.634365] [] zio_wait+0xc3/0x170 [zfs]
Jun 2 14:54:33 backup1 kernel: [ 839.634384] [] dsl_pool_sync+0xd9/0x4a0 [zfs]
Jun 2 14:54:33 backup1 kernel: [ 839.634404] [] spa_sync+0x3ab/0x9b0 [zfs]
Jun 2 14:54:33 backup1 kernel: [ 839.634406] [] ? autoremove_wake_function+0x16/0x40
Jun 2 14:54:33 backup1 kernel: [ 839.634410] [] ? __wake_up+0x53/0x70
Jun 2 14:54:33 backup1 kernel: [ 839.634431] [] txg_sync_thread+0x241/0x3c0 [zfs]
Jun 2 14:54:33 backup1 kernel: [ 839.634451] [] ? txg_sync_thread+0x0/0x3c0 [zfs]
Jun 2 14:54:33 backup1 kernel: [ 839.634456] [] thread_generic_wrapper+0x78/0x90 [spl]
Jun 2 14:54:33 backup1 kernel: [ 839.634461] [] ? thread_generic_wrapper+0x0/0x90 [spl]
Jun 2 14:54:33 backup1 kernel: [ 839.634463] [] kthread+0x96/0xa0
Jun 2 14:54:33 backup1 kernel: [ 839.634467] [] kernel_thread_helper+0x4/0x10
Jun 2 14:54:33 backup1 kernel: [ 839.634469] [] ? kthread+0x0/0xa0
Jun 2 14:54:33 backup1 kernel: [ 839.634471] [] ? kernel_thread_helper+0x0/0x10
Jun 2 14:56:33 backup1 kernel: [ 959.380537] INFO: task rsync:3180 blocked for more than 120 seconds.
Jun 2 14:56:33 backup1 kernel: [ 959.380637] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 2 14:56:33 backup1 kernel: [ 959.380765] rsync D 0000000000000000 0 3180 3179 0x00000000
Jun 2 14:56:33 backup1 kernel: [ 959.380771] ffff88012c685b88 0000000000000082 ffff88012c685fd8 ffff88012c684000
Jun 2 14:56:33 backup1 kernel: [ 959.380776] 0000000000013d00 ffff8801263cc858 ffff88012c685fd8 0000000000013d00
Jun 2 14:56:33 backup1 kernel: [ 959.380780] ffff880125d8c4a0 ffff8801263cc4a0 ffff88012c685ba8 ffff88012cf4c330
Jun 2 14:56:33 backup1 kernel: [ 959.380784] Call Trace:
Jun 2 14:56:33 backup1 kernel: [ 959.380804] [] cv_wait_common+0x77/0xd0 [spl]
Jun 2 14:56:33 backup1 kernel: [ 959.380811] [] ? autoremove_wake_function+0x0/0x40
Jun 2 14:56:33 backup1 kernel: [ 959.380818] [] __cv_wait+0x13/0x20 [spl]
Jun 2 14:56:33 backup1 kernel: [ 959.380866] [] txg_wait_open+0x73/0xb0 [zfs]
Jun 2 14:56:33 backup1 kernel: [ 959.380886] [] dmu_tx_wait+0xed/0xf0 [zfs]
Jun 2 14:56:33 backup1 kernel: [ 959.380914] [] zfs_write+0x363/0xbd0 [zfs]
Jun 2 14:56:33 backup1 kernel: [ 959.380929] [] ? do_sync_read+0xd2/0x110
Jun 2 14:56:33 backup1 kernel: [ 959.380956] [] zpl_write_common+0x52/0x80 [zfs]
Jun 2 14:56:33 backup1 kernel: [ 959.380981] [] zpl_write+0x68/0xa0 [zfs]
Jun 2 14:56:33 backup1 kernel: [ 959.380984] [] vfs_write+0xc6/0x180
Jun 2 14:56:33 backup1 kernel: [ 959.380987] [] sys_write+0x51/0x90
Jun 2 14:56:33 backup1 kernel: [ 959.380991] [] system_call_fastpath+0x16/0x1b
The text was updated successfully, but these errors were encountered: