-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More aggressively maintain arc_meta_limit (WIP) #3181
Conversation
Yes, please! I will be trying this. |
I wanted to provide some feedback: I run a backup server that isn't vital, and I usually run the latest development version of ZFS on it. Between this and the patches from the last few days, this is the first time my backup server has been able to complete a backup in over a month without crashing. Also, one of the backups is of a mail server using maildir format, so there can be thousands of small files in a directory. I was having problems with the backup taking 9 to 12 hours, and I had been trying to figure out where the problem was. It now takes 35 minutes. That's back to speeds I was seeing under the 0.6.3 release. I am looking forward to seeing how well it runs over the next week. |
OK, I spoke too soon. While I was running ok for the last few days and the three tests I ran by hand today, when I ran tonight with this patch applied, my system ran for about 15 minutes before my rsyncs died and arc_adapt is taking up 100% of one of my CPUs. According to my graphs which pull data from /proc/spl/kstat/zfs, c_max is 8338591744, but the arc size suddenly jumped to 10842942216 when the CPU spiked. Also, arc_meta_used went up to 6780346632 while arc_meta_limit is 6253943808. There are no messages in syslog, and I can't strace the arc_adapt process. The system is still running, so I will let it stay up for a day or two if you need any more information. |
perhaps this pull-request and some tweaks in prefetching could lead to a long-term solution ? #1932 and #2840 are most likely related issues DeHackEd/zfs@2827b75 could be another part in the solution according to @dweeezil , when reporting information about issues, posting /proc/spl/kstat/zfs/zfetchstats could help in getting this resolved ... |
@kernelOfTruth That patch is just another mitigation tactic. Also the value over 3000 causes kmem allocation issues for me so it needs to be reduced or made into a vmalloc call. |
I wonder what happens what behavior could be observed when disabling arc_p adapt dampener ( zfs_arc_p_dampener_disable ) - thus setting it to 0 and using this patchset this should make how ARC adapts more dynamically, no ? (not sure if I understand arc_p correctly) |
* zfs_arc_meta_prune now takes a number of objects to scan. * Scan 10,000 object per-batch in the dentry/inode caches when attempting to reclaim in order to honor arc_meta_limit. * Retry immediately in arc_adjust_meta(). * Fix arc_meta_max according, it should be updated in consume not return. Signed-off-by: Brian Behlendorf [email protected] Issue openzfs#3160 -------------------------------------------------------- zfs_arc_meta_prune raised to 12500
@DeHackEd not even 512 works fine here, I tried out 2048 before - that would show the messages almost constantly with 512 there's still: so that setting seems to be maxed out at 256 |
With over 3000 I was able to mount on startup when memory was almost entirely free, but using Anyway, we're getting off topic for this pull request. I'm going to try installing this into one of my backup hosts that exceeds its meta limits every night (though not so much that it breaks). We'll see how this goes. |
@DeHackEd agreed, I'm currently evaluating what tweaks and settings might help in combination with this pull-request right now I'm doing a small backup where rsync has to hit lots of files and SUnreclaim stays close to the set limit of 4GB ; zfs_arc_max=0x100000000 - it seems to hover nicely around that value on Sunday I most probably can say more (after transferring ~ 3 TB via rsync) |
Let me add a little more explanation about what part of the problem is for metadata heavy workloads.. This is something which has been understood for some time but the right way to handle it still needs some investigation. It's also something which impacts all the ZFS implementations to various degrees. At a high level this is roughly how the data structures are arranged in memory. At the very top are the dentries which contain the path information, there can be multiple of them pointing to a single inode and each holds a reference on that inode. In turn, that inode holds a reference on a dnode_t and those dnode_t's are arranged in groups of 32 and stored in a 16k data buffer. All of this, and a bit more, is considered meta data for accounting purposes.
Now part of the problem is that the dnode_dbufs can't be free'd until all 32 references on them held by the dnode_t's are dropped. And those dnode_t's can't be dropped until the references held on them by the inodes are dropped. And the inodes can't be dropped until the referenced held on them by the dentries are dropped. Which means that a single dentry can end up pinning a fairly large amount of memory. What happens today is when the meta limit is reached we ask the VFS to free dentries off it's per-filesystem LRU. The VFS scans N entries on its LRU and then frees some number of dentries and inodes. Then the ARC is traversed looking for buffers which can now be freed. The process get's repeated until the ARC drops under the meta limit. Now this normally works very well when the fraction of the ARC consumed by meta data is low. However, if meta data is responsible for the vast majority of the ARC there are some issues.
Specifically for this patch if you're still seeing prohibitively large amounts of CPU time being spent in arc_adapt try increasing Hopefully, this sheds some light on the what's going on here and how you might be able to tune things for your workloads. Any feedback you can provides from your real workloads would be appreciated. |
Here's the out of /proc/spl/kstat/zfs/zfetchstats:
I tried setting zfs_arc_p_dampener_disable and zfs_arc_meta_prune in modprobe.d, but it didn't work so I echoed the values into /sys/module/zfs/parameters instead. It looks like they're set. I'm starting the backups again. |
I'll deploy this tonight on one server, which so far has been suffering the most. |
Performing rsync on a directory of a million files (well, at least 700,000) still makes it overflow the limit. Now arc_adapt is pegging the CPU (oddly it wasn't before) and arc_prune is climbing impressively fast. So for me it's working, but it's an uphill battle or even a lost cause... |
Looks like it's been added to 2.6.36, which is younger than RHEL6 kernel and I doubt it's been backported. So I'm guessing it's not the best idea to put this in production on RHEL6 kernel yet, is it? |
@behlendorf But in conjunction with my 5-min dcache flushing in cron, I guess it shouldn't be that big of a problem, that the per-sb reclaim isn't there. I'm thinking about lowering the flushing period to 2 minutes. |
It's still early into the transfer operation but this patch really seems to help sys load, according to atop is 50-85% since most of you guys affected by the ARC growing seem to be also reading this I used a combination of following changes: zfs_arc_p_dampener_disable to 0 modified value of zfs_arc_max & zfs_arc_meta_limit arc_evict_iterations is replaced with zfs_arc_evict_batch_limit in #3115 , which seems to address - at least partly - the issue with ARC that @behlendorf mentioned , if I understood it correctly So that could be also worth tinkering with until #3115 has been merged or anyone has tried out #3115 , unfortunately it wouldn't build for me @behlendorf thanks for this patch ! - so far ARC (thus SUnreclaim) stays close to a value of between 3-4 GB, which has been pre-set other_size in /proc/spl/kstat/zfs/arcstats also isn't unreasonably large anymore - so far so good ! |
@snajpa my only concern is that it might end up spinning, so we'd need to completely disable it in that case. I was going to spend some time looking at the RHEL6 kernel and see if there wasn't reasonable alternative interface we could use there. I should also mention that my test workload for this was creating a directory with 10 million files in it. Then doing an |
@behlendorf Do you think limiting the number of retries would help? That's trivial to add and I still have some time before the planned outage. |
It's still too early to tell with mine, but I am already seeing that the size of the arc and the metadata are exceeding their max values... and as I was typing, the arc_adapt process suddenly shot up to 100% and is pegged. I should have mentioned that I'm running Fedora 20 with the 3.17.8-200.fc20.x86_64 kernel. Also, on a partial tangent, I use Cacti and some scripts to graph the sizes of the arc data & metadata as well as cache hits. Is there anything useful I can add to my graphing that might help with debugging? |
@snajpa yes, and I was thinking about refreshing the patch with that tweak anyway. If you add it I'd suggest making it a run time tunable with a module option. That would would then let you disable it if needed. |
@behlendorf vpsfreecz@81e3ef7 does this look ok to you? |
Looks reasonable. |
I've modified it a bit so that setting zfs_arc_adjust_meta_restarts to 0 would disable the restarts, decreasing it outright at the beginning of the function would lead to an overflow of 'restarts' variable when set to 0 :) I'm just a beginner with C still :) |
Well, arc_adapt finally stopped running; it didn't do that last night. However, the zfs filesystems appear to have frozen. |
I'm testing this in the lab conditions, recursively creating directories and lots of files in them, while endless recursive list of the structure runs. |
Thank you @behlendorf building updated dailies with this and #3161 for a good measure ;) |
@behlendorf thank you! It's still too early to say, but I have observed that the node, where I've just deployed this, has overgrown the limit by ~120MB and after a few tens of seconds, the arc_meta_used has returned to the _limit value. I'll report back tomorrow, after the current round of back-ups ends. |
Unfortunately, looks like it hasn't made much difference. arc_meta_used is now 2G over limit and growing like crazy. From what I've seen, when arc_meta_used is near arc_meta_limit, it tries and balances it fine, but at some point, it just blows up and never comes back. I wonder why that is. |
Sorry about the sloppy comment, I was a bit rushed. @snajpa thanks for the feedback. I've got some thoughts about that, let me setup a more complicated test case tomorrow and refine this patch. |
It seemed ok for about 15 minutes, then arc_adapt went up to 100% again. IO started getting slower and slower, arc_meta_used shot up over arc_meta_limit, and things are slowly grinding to a halt. It looks like the same thing @snajpa reported. |
@behlendorf here's arcstats from node10 - arc_meta_used is now 41G over limit :( Another interesting thing is, that when I add mru_size + mru_ghost_size + mfu_size + mfu_ghost_size, I'm at 264 GB, which is a total nonsense, arc_size is reported to be at my limit, 64G. Should I create a separate issue for this?
|
And another related problem, sometimes (esp. when over limit with _meta, but that might not be related), echoing a new value to zfs_arc_max doesn't do a thing - it doesn't change arc_c_max in arcstats like it should. |
in the light of the extensive beating of this PR by @snajpa (KUDOS!) I will stop torturing my box then and get some useful usage of it instead of forcing deadlock. Please buzz me whenever there is a new patch to try. |
actually for my load the patch might have even worked: http://www.onerussian.com/tmp/zfs_stats_utilization-day.png ARC size was growing and then started dropping.. I will reinitiate the testing to get such a closure |
@behlendorf so here's my most recent news:
|
@behlendorf another 'good' news - the node locked up after 6hrs of uptime. Output of stack traces into syslog messages is scrambled, but most repeating pattern of stacks of processes in D state is as follows (I think I might even have hit a bug with perf, perf was long off and it still is visible in the traces):
|
I know that this one is "no good" and thus was closed, but I kept running its new version until now I have 100% cpu arc_adapt again/as well. And here is the dynamics according to munin: http://www.onerussian.com/tmp/zfs_stats_utilization-day-20150319.png update #1: system information http://www.onerussian.com/tmp//zfs_system_details_20150319/ |
Reopening this. I didn't mean to close it, even though it's clearly still needs some work. |
Revert "More aggressively maintain arc_meta_limit (WIP) openzfs#3181" This reverts commit 8135db5.
@behlendorf I just noticed there's an obvious way to see arc_adapt stack while it's stuck in D state. It's been like this for hours now.
|
|
@behlendorf my investigation leads me to a conclusion that something has registered either inotify or dnotify events with some inode we're trying to free and until that is released, the wait_event will hang there, because s_fsnotify_marks will never be zero. Does this look about right? |
@behlendorf invalidate_inodes is most likely not meant to be used in any other context than when umounting the filesystem and there are no users of it anymore. Though I might be wrong. |
@behlendorf will try to deploy this with invalidate_inodes out of the picture (vpsfreecz@cbc4156) along with limiting the number of restarts in arc_adjust_meta() (vpsfreecz@d44b4f2). |
@behlendorf I declare victory! Back-ups are now running and arc_meta_used is in bounds. Per @ryao's advice I've temporarily removed L2ARC devices - he suspects there's a bug in the code, I'm just being cautious.
|
Btw, I would strongly advocate for merging the limit of arc_adjust_meta restarts, if you're going to merge this. Without it, arc_adapt might re-run again in a long time, which makes changing the ARC parameters like arc_c_max next to impossible. I've found 4096 restarts to be sufficient even for my heavy meta-data hammering workload. |
@snajpa Great job & Congrats ! 👍 |
Awesome! @snajpa thanks for pushing this one over the finish line! It's great to know this resolves the problems you were seeing even for your workload. I'll integrate your fixes in to the pull request and the other feedback I've gotten and refresh the pull request. Then we can get this merged. |
@snajpa @kernelOfTruth @yarikoptic @chrisrd @DeHackEd @angstymeat I've open pull request #3202 with a refreshed version of this patch stack which incorporates @snajpa's improvements and @chrisrd's feedback. It would be great if you could further test this change, but based on the latest feedback it's looking very good. Unless the buildbots uncover an issue I'd like to merge this fairly soon. |
when attempting to reclaim in order to honor arc_meta_limit.
not return.
Signed-off-by: Brian Behlendorf [email protected]
Issue #3160