Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More aggressively maintain arc_meta_limit (WIP) #3181

Closed
wants to merge 3 commits into from

Conversation

behlendorf
Copy link
Contributor

  • zfs_arc_meta_prune now takes a number of objects to scan.
  • Scan 10,000 object per-batch in the dentry/inode caches
    when attempting to reclaim in order to honor arc_meta_limit.
  • Retry immediately in arc_adjust_meta().
  • Fix arc_meta_max according, it should be updated in consume
    not return.

Signed-off-by: Brian Behlendorf [email protected]
Issue #3160

@DeHackEd
Copy link
Contributor

Yes, please! I will be trying this.

@angstymeat
Copy link

I wanted to provide some feedback:

I run a backup server that isn't vital, and I usually run the latest development version of ZFS on it.

Between this and the patches from the last few days, this is the first time my backup server has been able to complete a backup in over a month without crashing.

Also, one of the backups is of a mail server using maildir format, so there can be thousands of small files in a directory. I was having problems with the backup taking 9 to 12 hours, and I had been trying to figure out where the problem was. It now takes 35 minutes. That's back to speeds I was seeing under the 0.6.3 release.

I am looking forward to seeing how well it runs over the next week.

@angstymeat
Copy link

OK, I spoke too soon. While I was running ok for the last few days and the three tests I ran by hand today, when I ran tonight with this patch applied, my system ran for about 15 minutes before my rsyncs died and arc_adapt is taking up 100% of one of my CPUs.

According to my graphs which pull data from /proc/spl/kstat/zfs, c_max is 8338591744, but the arc size suddenly jumped to 10842942216 when the CPU spiked. Also, arc_meta_used went up to 6780346632 while arc_meta_limit is 6253943808.

There are no messages in syslog, and I can't strace the arc_adapt process. The system is still running, so I will let it stay up for a day or two if you need any more information.

@kernelOfTruth
Copy link
Contributor

perhaps this pull-request and some tweaks in prefetching could lead to a long-term solution ?

#1932 and #2840 are most likely related issues

DeHackEd/zfs@2827b75 could be another part in the solution

according to @dweeezil , when reporting information about issues, posting /proc/spl/kstat/zfs/zfetchstats could help in getting this resolved ...

@DeHackEd
Copy link
Contributor

@kernelOfTruth That patch is just another mitigation tactic. Also the value over 3000 causes kmem allocation issues for me so it needs to be reduced or made into a vmalloc call.

@kernelOfTruth
Copy link
Contributor

@DeHackEd I see, yes, didn't know that the original value (or the value at least before that) was at 64:

therefore referencing
#676 and #1101 - where the setting was raised

@kernelOfTruth
Copy link
Contributor

I wonder what happens what behavior could be observed when disabling arc_p adapt dampener ( zfs_arc_p_dampener_disable ) - thus setting it to 0

and using this patchset

this should make how ARC adapts more dynamically, no ?

(not sure if I understand arc_p correctly)

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request Mar 14, 2015
* zfs_arc_meta_prune now takes a number of objects to scan.

* Scan 10,000 object per-batch in the dentry/inode caches
when attempting to reclaim in order to honor arc_meta_limit.

* Retry immediately in arc_adjust_meta().

* Fix arc_meta_max according, it should be updated in consume
not return.

Signed-off-by: Brian Behlendorf [email protected]
Issue openzfs#3160

--------------------------------------------------------

zfs_arc_meta_prune raised to 12500
@kernelOfTruth
Copy link
Contributor

@DeHackEd not even 512 works fine here, I tried out 2048 before - that would show the messages almost constantly

with 512 there's still:

http://pastebin.com/cWm5Hvn0

so that setting seems to be maxed out at 256

@DeHackEd
Copy link
Contributor

With over 3000 I was able to mount on startup when memory was almost entirely free, but using zfs create after a day would fail since it can't allocate the object out of contiguous memory.

Anyway, we're getting off topic for this pull request. I'm going to try installing this into one of my backup hosts that exceeds its meta limits every night (though not so much that it breaks). We'll see how this goes.

@kernelOfTruth
Copy link
Contributor

@DeHackEd agreed, I'm currently evaluating what tweaks and settings might help in combination with this pull-request

right now I'm doing a small backup where rsync has to hit lots of files and SUnreclaim stays close to the set limit of 4GB ; zfs_arc_max=0x100000000 - it seems to hover nicely around that value

on Sunday I most probably can say more (after transferring ~ 3 TB via rsync)

@behlendorf
Copy link
Contributor Author

Let me add a little more explanation about what part of the problem is for metadata heavy workloads.. This is something which has been understood for some time but the right way to handle it still needs some investigation. It's also something which impacts all the ZFS implementations to various degrees.

At a high level this is roughly how the data structures are arranged in memory. At the very top are the dentries which contain the path information, there can be multiple of them pointing to a single inode and each holds a reference on that inode. In turn, that inode holds a reference on a dnode_t and those dnode_t's are arranged in groups of 32 and stored in a 16k data buffer. All of this, and a bit more, is considered meta data for accounting purposes.

dentry --+
dentry --+--> inode/znode --> dnode_t --+
dentry --+                              |
                                        |
dentry --+--> inode/znode --> dnode_t --+--> dnode_dbuf (holds 32 dnode_t's)
...           ...             ...       |    
dentry --+--> inode/znode --> dnode_t --+

Now part of the problem is that the dnode_dbufs can't be free'd until all 32 references on them held by the dnode_t's are dropped. And those dnode_t's can't be dropped until the references held on them by the inodes are dropped. And the inodes can't be dropped until the referenced held on them by the dentries are dropped. Which means that a single dentry can end up pinning a fairly large amount of memory.

What happens today is when the meta limit is reached we ask the VFS to free dentries off it's per-filesystem LRU. The VFS scans N entries on its LRU and then frees some number of dentries and inodes. Then the ARC is traversed looking for buffers which can now be freed. The process get's repeated until the ARC drops under the meta limit.

Now this normally works very well when the fraction of the ARC consumed by meta data is low. However, if meta data is responsible for the vast majority of the ARC there are some issues.

  • By default the generic VFS code frees dentries off an LRU which has no knowledge of the underlying ARC buffers. That means the dentries aren't freed in the most efficient/intelligent way to ensure dnode_dbufs are released quickly. It's definitely worth exploring abandoning the generic Linux code here are performing a ZFS -aware dentry/inode pruning.
  • The dnode_t's, znode_t's and 16K dbuf's are all on slabs to reduce allocation/free costs for the common case. However, this means that until all of the objects on a given slab are free'd no memory is actually reclaimed. The illumos slab implementation contains some complicated code to repack dnode's in to fewer slabs. However, that comes with it's own set of advantages and disadvantages and isn't obviously a clear win in all cases.

Specifically for this patch if you're still seeing prohibitively large amounts of CPU time being spent in arc_adapt try increasing zfs_arc_meta_prune to 100,000 or higher. This will case more dentries/inodes to be scanned and pruned from the VFS prior to evicting ARC buffers. This may help significantly, particularly on large memory machines, where scanning the entire ARC and not finding anything to be free'd is very CPU intensive.

Hopefully, this sheds some light on the what's going on here and how you might be able to tune things for your workloads. Any feedback you can provides from your real workloads would be appreciated.

@behlendorf behlendorf added this to the 0.6.4 milestone Mar 14, 2015
@angstymeat
Copy link

Here's the out of /proc/spl/kstat/zfs/zfetchstats:

hits 4 31918994
misses 4 12720308
colinear_hits 4 45432
colinear_misses 4 12674876
stride_hits 4 30361336
stride_misses 4 622213
reclaim_successes 4 28062
reclaim_failures 4 12646814
streams_resets 4 49559
streams_noresets 4 1557658
bogus_streams 4 0

I tried setting zfs_arc_p_dampener_disable and zfs_arc_meta_prune in modprobe.d, but it didn't work so I echoed the values into /sys/module/zfs/parameters instead. It looks like they're set.

I'm starting the backups again.

@snajpa
Copy link
Contributor

snajpa commented Mar 14, 2015

I'll deploy this tonight on one server, which so far has been suffering the most.
I'll also be disabling ZFS prefetch entirely, as at times our MFU is completely pushed out by MRU, I'm guessing this is done by prefetech. It causes the ARC hit ratio to drop drastically.

@DeHackEd
Copy link
Contributor

Performing rsync on a directory of a million files (well, at least 700,000) still makes it overflow the limit. Now arc_adapt is pegging the CPU (oddly it wasn't before) and arc_prune is climbing impressively fast.

So for me it's working, but it's an uphill battle or even a lost cause...

@behlendorf
Copy link
Contributor Author

@snajpa @DeHackEd there's a caveat here I forgot to mention and that's why this is still a WIP. The dentry reclaim infrastructure isn't available for older kernels. That would cause exactly the spinning you're describing. Sorry, I don't have the exact kernel version it was added in handy.

@snajpa
Copy link
Contributor

snajpa commented Mar 14, 2015

Looks like it's been added to 2.6.36, which is younger than RHEL6 kernel and I doubt it's been backported. So I'm guessing it's not the best idea to put this in production on RHEL6 kernel yet, is it?

http://lwn.net/Articles/445759/

@snajpa
Copy link
Contributor

snajpa commented Mar 14, 2015

@behlendorf But in conjunction with my 5-min dcache flushing in cron, I guess it shouldn't be that big of a problem, that the per-sb reclaim isn't there. I'm thinking about lowering the flushing period to 2 minutes.

@kernelOfTruth
Copy link
Contributor

It's still early into the transfer operation but this patch really seems to help

sys load, according to atop is 50-85%

since most of you guys affected by the ARC growing seem to be also reading this

I used a combination of following changes:
zfs_arc_meta_prune set to 12500 (higher for bigger boxes might be needed)
arc_evict_iterations to roughly twice the default: 180
zfs_arc_grow_retry to 20 (default for ZoL is 5; previously - around 2013 it was at 60; performance ?)
zfs_arc_shrink_shift from 5 to 4 (might be too large for big boxes, latency ?)
kernelOfTruth@086f234

zfs_arc_p_dampener_disable to 0

modified value of zfs_arc_max & zfs_arc_meta_limit

arc_evict_iterations is replaced with zfs_arc_evict_batch_limit in #3115 , which seems to address - at least partly - the issue with ARC that @behlendorf mentioned , if I understood it correctly

So that could be also worth tinkering with until #3115 has been merged or anyone has tried out #3115 , unfortunately it wouldn't build for me

@behlendorf thanks for this patch ! - so far ARC (thus SUnreclaim) stays close to a value of between 3-4 GB, which has been pre-set

other_size in /proc/spl/kstat/zfs/arcstats also isn't unreasonably large anymore - so far so good !

@behlendorf
Copy link
Contributor Author

@snajpa my only concern is that it might end up spinning, so we'd need to completely disable it in that case. I was going to spend some time looking at the RHEL6 kernel and see if there wasn't reasonable alternative interface we could use there.

I should also mention that my test workload for this was creating a directory with 10 million files in it. Then doing an ls -l >/dev/null in that directory to force it to stat those 10 million files. I ran the test under Debian Jesse (3.16) with an ARC meta limit of 12G and it was able to maintain that limit with the patch and it wasn't without. However, that's a very simple test case so your mileage may vary, and the patch itself can certainly be improved.

@snajpa
Copy link
Contributor

snajpa commented Mar 14, 2015

@behlendorf Do you think limiting the number of retries would help? That's trivial to add and I still have some time before the planned outage.

@angstymeat
Copy link

It's still too early to tell with mine, but I am already seeing that the size of the arc and the metadata are exceeding their max values... and as I was typing, the arc_adapt process suddenly shot up to 100% and is pegged.

I should have mentioned that I'm running Fedora 20 with the 3.17.8-200.fc20.x86_64 kernel.

Also, on a partial tangent, I use Cacti and some scripts to graph the sizes of the arc data & metadata as well as cache hits. Is there anything useful I can add to my graphing that might help with debugging?

@behlendorf
Copy link
Contributor Author

@snajpa yes, and I was thinking about refreshing the patch with that tweak anyway. If you add it I'd suggest making it a run time tunable with a module option. That would would then let you disable it if needed.

@snajpa
Copy link
Contributor

snajpa commented Mar 14, 2015

@behlendorf vpsfreecz@81e3ef7 does this look ok to you?

@behlendorf
Copy link
Contributor Author

Looks reasonable.

@snajpa
Copy link
Contributor

snajpa commented Mar 14, 2015

I've modified it a bit so that setting zfs_arc_adjust_meta_restarts to 0 would disable the restarts, decreasing it outright at the beginning of the function would lead to an overflow of 'restarts' variable when set to 0 :)

vpsfreecz@e0619c0

I'm just a beginner with C still :)

@angstymeat
Copy link

Well, arc_adapt finally stopped running; it didn't do that last night. However, the zfs filesystems appear to have frozen.

@snajpa
Copy link
Contributor

snajpa commented Mar 14, 2015

I'm testing this in the lab conditions, recursively creating directories and lots of files in them, while endless recursive list of the structure runs.
It shows, that no matter what I set in zfs_arc_meta_prune and my zfs_arc_adjust_meta_restarts, the metadata cache is very reluctant to shrink, even if dentry cache has been flushed and contains almost nothing. In fact, it doesn't seem to shrink at all, ARC even keeps on growing till all of the system's memory is gone. All of the RAM is basically in SUnreclaim, slabtop shows it all in zio_buf_16384.
Another interesting thing is, that slabtop reports 15G in dnode_t caches, but my system only has 8G RAM.

@yarikoptic
Copy link

Thank you @behlendorf building updated dailies with this and #3161 for a good measure ;)

@snajpa
Copy link
Contributor

snajpa commented Mar 18, 2015

@behlendorf thank you!

It's still too early to say, but I have observed that the node, where I've just deployed this, has overgrown the limit by ~120MB and after a few tens of seconds, the arc_meta_used has returned to the _limit value.

I'll report back tomorrow, after the current round of back-ups ends.

@snajpa
Copy link
Contributor

snajpa commented Mar 18, 2015

Unfortunately, looks like it hasn't made much difference. arc_meta_used is now 2G over limit and growing like crazy.

From what I've seen, when arc_meta_used is near arc_meta_limit, it tries and balances it fine, but at some point, it just blows up and never comes back. I wonder why that is.

@behlendorf
Copy link
Contributor Author

Sorry about the sloppy comment, I was a bit rushed. @snajpa thanks for the feedback. I've got some thoughts about that, let me setup a more complicated test case tomorrow and refine this patch.

@behlendorf behlendorf closed this Mar 18, 2015
@angstymeat
Copy link

It seemed ok for about 15 minutes, then arc_adapt went up to 100% again. IO started getting slower and slower, arc_meta_used shot up over arc_meta_limit, and things are slowly grinding to a halt. It looks like the same thing @snajpa reported.

@snajpa
Copy link
Contributor

snajpa commented Mar 18, 2015

@behlendorf here's arcstats from node10 - arc_meta_used is now 41G over limit :(

Another interesting thing is, that when I add mru_size + mru_ghost_size + mfu_size + mfu_ghost_size, I'm at 264 GB, which is a total nonsense, arc_size is reported to be at my limit, 64G. Should I create a separate issue for this?

[[email protected]]
 ~ # cat /proc/spl/kstat/zfs/arcstats 
5 1 0x01 86 4128 8595521487 38910575551295
name                            type data
hits                            4    3847757126
misses                          4    36705974
demand_data_hits                4    3740711490
demand_data_misses              4    6743781
demand_metadata_hits            4    94809720
demand_metadata_misses          4    24588074
prefetch_data_hits              4    1452294
prefetch_data_misses            4    1593872
prefetch_metadata_hits          4    10783622
prefetch_metadata_misses        4    3780247
mru_hits                        4    30878384
mru_ghost_hits                  4    8026876
mfu_hits                        4    3804642839
mfu_ghost_hits                  4    20349678
deleted                         4    1377
recycle_miss                    4    22690207
mutex_miss                      4    21158
evict_skip                      4    1610877364
evict_l2_cached                 4    412708508160
evict_l2_eligible               4    156754701312
evict_l2_ineligible             4    59638843904
hash_elements                   4    8743077
hash_elements_max               4    8743195
hash_collisions                 4    4660372
hash_chains                     4    958637
hash_chain_max                  4    6
p                               4    47131720704
c                               4    68719476736
c_min                           4    4194304
c_max                           4    68719476736
size                            4    68690241456
hdr_size                        4    5998339440
data_size                       4    8109146112
meta_size                       4    18964362240
other_size                      4    35419307344
anon_size                       4    61835264
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    15534180864
mru_evict_data                  4    1489551360
mru_evict_metadata              4    32768
mru_ghost_size                  4    68241238016
mru_ghost_evict_data            4    61855818240
mru_ghost_evict_metadata        4    6385419776
mfu_size                        4    11477492224
mfu_evict_data                  4    6560517120
mfu_evict_metadata              4    2354688
mfu_ghost_size                  4    188256352768
mfu_ghost_evict_data            4    184689412608
mfu_ghost_evict_metadata        4    3566940160
l2_hits                         4    22120111
l2_misses                       4    14585824
l2_feeds                        4    42819
l2_rw_clash                     4    6436
l2_read_bytes                   4    133337917952
l2_write_bytes                  4    210352763904
l2_writes_sent                  4    42666
l2_writes_done                  4    42666
l2_writes_error                 4    0
l2_writes_hdr_miss              4    335
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    327479
l2_cdata_free_on_write          4    23670
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    155021661696
l2_asize                        4    72079325184
l2_hdr_size                     4    199086320
l2_compress_successes           4    7715250
l2_compress_zeros               4    0
l2_compress_failures            4    1712768
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    60581095344
arc_meta_limit                  4    17179869184
arc_meta_max                    4    60581249472

@snajpa
Copy link
Contributor

snajpa commented Mar 18, 2015

And another related problem, sometimes (esp. when over limit with _meta, but that might not be related), echoing a new value to zfs_arc_max doesn't do a thing - it doesn't change arc_c_max in arcstats like it should.

@yarikoptic
Copy link

in the light of the extensive beating of this PR by @snajpa (KUDOS!) I will stop torturing my box then and get some useful usage of it instead of forcing deadlock. Please buzz me whenever there is a new patch to try.

@yarikoptic
Copy link

actually for my load the patch might have even worked: http://www.onerussian.com/tmp/zfs_stats_utilization-day.png ARC size was growing and then started dropping.. I will reinitiate the testing to get such a closure

@snajpa
Copy link
Contributor

snajpa commented Mar 19, 2015

@behlendorf so here's my most recent news:

  • it's 100% not NFS-server related, we've switched to SSH instead of NFS for one server and arc_meta_used is now 16G over limit
  • arc_adapt is stuck in D state
  • arc_meta_used seems to grow far more rapidly when something causes arc_size < arc_c
  • setting zfs_arc_max doesn't take effect due to arc_adapt being stuck in arc_adjust_meta (likely, I don't know of any way how to confirm this fast)
  • a nice but IMHO not much telling flame graph can be found here https://vpsfree.cz/data/perf-kernel.svg
  • bash doing 'echo 1 > /proc/sys/vm/drop_caches' is stuck in kernel space
  • arcstats is as follows:
[[email protected]]
 ~ # cat /proc/spl/kstat/zfs/arcstats 
5 1 0x01 86 4128 8274503459 10929745826958
name                            type data
hits                            4    554594855
misses                          4    8658097
demand_data_hits                4    502447025
demand_data_misses              4    2398225
demand_metadata_hits            4    45771827
demand_metadata_misses          4    4448762
prefetch_data_hits              4    2046340
prefetch_data_misses            4    407817
prefetch_metadata_hits          4    4329663
prefetch_metadata_misses        4    1403293
mru_hits                        4    12528896
mru_ghost_hits                  4    1396716
mfu_hits                        4    535689957
mfu_ghost_hits                  4    2635164
deleted                         4    625
recycle_miss                    4    3783254
mutex_miss                      4    2200
evict_skip                      4    302642434
evict_l2_cached                 4    72194112000
evict_l2_eligible               4    23862221824
evict_l2_ineligible             4    18461445120
hash_elements                   4    4744921
hash_elements_max               4    4744921
hash_collisions                 4    766639
hash_chains                     4    306441
hash_chain_max                  4    5
p                               4    21835885056
c                               4    68719476736
c_min                           4    4194304
c_max                           4    68719476736
size                            4    68719396992
hdr_size                        4    2248983672
data_size                       4    34968776192
meta_size                       4    13535997952
other_size                      4    17835935872
anon_size                       4    49951232
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    21784388096
mru_evict_data                  4    15101563392
mru_evict_metadata              4    851968
mru_ghost_size                  4    38225883136
mru_ghost_evict_data            4    33242170368
mru_ghost_evict_metadata        4    4983712768
mfu_size                        4    26670434816
mfu_evict_data                  4    19819554816
mfu_evict_metadata              4    12128768
mfu_ghost_size                  4    32170317312
mfu_ghost_evict_data            4    29894632448
mfu_ghost_evict_metadata        4    2275684864
l2_hits                         4    3053145
l2_misses                       4    5604913
l2_feeds                        4    12303
l2_rw_clash                     4    799
l2_read_bytes                   4    11610332160
l2_write_bytes                  4    63240490496
l2_writes_sent                  4    12283
l2_writes_done                  4    12283
l2_writes_error                 4    0
l2_writes_hdr_miss              4    182
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    44159
l2_cdata_free_on_write          4    8476
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    71769676800
l2_asize                        4    40505685504
l2_hdr_size                     4    129703304
l2_compress_successes           4    2982816
l2_compress_zeros               4    0
l2_compress_failures            4    933290
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    33750620800
arc_meta_limit                  4    17179869184
arc_meta_max                    4    33760781608

@snajpa
Copy link
Contributor

snajpa commented Mar 19, 2015

@behlendorf another 'good' news - the node locked up after 6hrs of uptime. Output of stack traces into syslog messages is scrambled, but most repeating pattern of stacks of processes in D state is as follows (I think I might even have hit a bug with perf, perf was long off and it still is visible in the traces):

Mar 19 11:50:22 node10 kernel: [38415.870579] php-cgi       D ffff881c2dbc70c0     0 232028 169208 2509 0x00000080
Mar 19 11:50:22 node10 kernel: [38415.870587]  ffff8810fa491d48 0000000000000086 0000000000000000 ffffffff8105b618
Mar 19 11:50:22 node10 kernel: [38415.870597]  ffff8810fa491cf8 ffffffff81070552 000022c27702dccd ffff882100eddd80
Mar 19 11:50:22 node10 kernel: [38415.870605]  0000000000000003 0000000000000000 00000001024220b5 ffff882100eddd80
Mar 19 11:50:22 node10 kernel: [38415.870614] Call Trace:
Mar 19 11:50:22 node10 kernel: [38415.870619]  [<ffffffff8105b618>] ? resched_task+0x68/0x80
Mar 19 11:50:22 node10 kernel: [38415.870624]  [<ffffffff81070552>] ? check_preempt_wakeup+0x1c2/0x260
Mar 19 11:50:22 node10 kernel: [38415.870631]  [<ffffffff815344e5>] schedule_timeout+0x215/0x2e0
Mar 19 11:50:22 node10 kernel: [38415.870637]  [<ffffffff81533fc4>] wait_for_completion+0xe4/0x120
Mar 19 11:50:22 node10 kernel: [38415.870643]  [<ffffffff81065ba0>] ? default_wake_function+0x0/0x20
Mar 19 11:50:22 node10 kernel: [38415.870650]  [<ffffffff811566d0>] ? lru_add_drain_per_cpu+0x0/0x10
Mar 19 11:50:22 node10 kernel: [38415.870656]  [<ffffffff811566d0>] ? lru_add_drain_per_cpu+0x0/0x10
Mar 19 11:50:22 node10 kernel: [38415.870661]  [<ffffffff810a10f7>] flush_work+0x77/0xc0
Mar 19 11:50:22 node10 kernel: [38415.870667]  [<ffffffff810a0890>] ? wq_barrier_func+0x0/0x20
Mar 19 11:50:22 node10 kernel: [38415.870673]  [<ffffffff810a1273>] schedule_on_each_cpu+0x133/0x180
Mar 19 11:50:22 node10 kernel: [38415.870679]  [<ffffffff81154e15>] lru_add_drain_all+0x15/0x20
Mar 19 11:50:22 node10 kernel: [38415.870685]  [<ffffffff811763a5>] __mlock+0x55/0x110
Mar 19 11:50:22 node10 kernel: [38415.870691]  [<ffffffff81176473>] sys_mlock+0x13/0x20
Mar 19 11:50:22 node10 kernel: [38415.870697]  [<ffffffff8100b102>] system_call_fastpath+0x16/0x1b
Mar 19 11:50:22 node10 kernel: [38415.872671] bash          D ffff88179588c5c0     0 236340 236286    0 0x00000084
Mar 19 11:50:22 node10 kernel: [38415.872675]  ffff8810ab121bb8 0000000000000082 0000000000000000 ffffffff8160e800
Mar 19 11:50:22 node10 kernel: [38415.872680]  0000000000000012 0000000000000000 000022ccead57909 ffffffff81067df3
Mar 19 11:50:22 node10 kernel: [38415.872684]  0000000000000000 0000000000000000 000000010242cff0 0000000000000001
Mar 19 11:50:22 node10 kernel: [38415.872688] Call Trace:
Mar 19 11:50:22 node10 kernel: [38415.872692]  [<ffffffff81067df3>] ? perf_event_task_sched_out+0x33/0x70
Mar 19 11:50:22 node10 kernel: [38415.872696]  [<ffffffff815344e5>] schedule_timeout+0x215/0x2e0
Mar 19 11:50:22 node10 kernel: [38415.872700]  [<ffffffff81533fc4>] wait_for_completion+0xe4/0x120
Mar 19 11:50:22 node10 kernel: [38415.872704]  [<ffffffff81065ba0>] ? default_wake_function+0x0/0x20
Mar 19 11:50:22 node10 kernel: [38415.872708]  [<ffffffff810a10f7>] flush_work+0x77/0xc0
Mar 19 11:50:22 node10 kernel: [38415.872711]  [<ffffffff810a0890>] ? wq_barrier_func+0x0/0x20
Mar 19 11:50:22 node10 kernel: [38415.872715]  [<ffffffff810a1314>] flush_delayed_work+0x54/0x70
Mar 19 11:50:22 node10 kernel: [38415.872719]  [<ffffffff81347ac5>] tty_flush_to_ldisc+0x15/0x20
Mar 19 11:50:22 node10 kernel: [38415.872722]  [<ffffffff81343426>] n_tty_read+0x246/0x950
Mar 19 11:50:22 node10 kernel: [38415.872727]  [<ffffffff81065ba0>] ? default_wake_function+0x0/0x20
Mar 19 11:50:22 node10 kernel: [38415.872730]  [<ffffffff81346acb>] ? put_ldisc+0x5b/0xc0
Mar 19 11:50:22 node10 kernel: [38415.872734]  [<ffffffff810131ce>] ? copy_user_generic+0xe/0x20
Mar 19 11:50:22 node10 kernel: [38415.872739]  [<ffffffff8133dc86>] tty_read+0xa6/0xf0
Mar 19 11:50:22 node10 kernel: [38415.872743]  [<ffffffff811b4b95>] vfs_read+0xb5/0x1a0
Mar 19 11:50:22 node10 kernel: [38415.872747]  [<ffffffff811b4cd1>] sys_read+0x51/0x90
Mar 19 11:50:22 node10 kernel: [38415.872751]  [<ffffffff8100b102>] system_call_fastpath+0x16/0x1b
Mar 19 11:50:22 node10 kernel: [38415.893108] cleanup       D ffff8816c673a900     0 251937 207347 3077 0x00000080
Mar 19 11:50:22 node10 kernel: [38415.893115]  ffff881d62333d18 0000000000000082 ffff881d62333ce0 0000000000000000
Mar 19 11:50:22 node10 kernel: [38415.893124]  ffff881d62333d68 0000000001046f3d 000022f0545aaf4b ffff882100ccf7b0
Mar 19 11:50:22 node10 kernel: [38415.893132]  ffff882000000000 ffff882100ccf7a0 0000000102452189 0000000a00000001
Mar 19 11:50:22 node10 kernel: [38415.893141] Call Trace:
Mar 19 11:50:22 node10 kernel: [38415.893146]  [<ffffffff81534f66>] __mutex_lock_slowpath+0x96/0x210
Mar 19 11:50:22 node10 kernel: [38415.893152]  [<ffffffff81534a8b>] mutex_lock+0x2b/0x50
Mar 19 11:50:22 node10 kernel: [38415.893158]  [<ffffffff811c7cba>] do_filp_open+0x2da/0xc50
Mar 19 11:50:22 node10 kernel: [38415.893164]  [<ffffffff81015019>] ? read_tsc+0x9/0x20
Mar 19 11:50:22 node10 kernel: [38415.893170]  [<ffffffff810b4071>] ? ktime_get_ts+0xb1/0xf0
Mar 19 11:50:22 node10 kernel: [38415.893175]  [<ffffffff812a6f4a>] ? strncpy_from_user+0x4a/0x90
Mar 19 11:50:22 node10 kernel: [38415.893182]  [<ffffffff811c2dd5>] ? getname_flags+0x155/0x260
Mar 19 11:50:22 node10 kernel: [38415.893188]  [<ffffffff811d58b2>] ? alloc_fd+0x92/0x160
Mar 19 11:50:22 node10 kernel: [38415.893194]  [<ffffffff811b0da7>] do_sys_open+0x67/0x130
Mar 19 11:50:22 node10 kernel: [38415.893200]  [<ffffffff811b0eb0>] sys_open+0x20/0x30
Mar 19 11:50:22 node10 kernel: [38415.893205]  [<ffffffff8100b102>] system_call_fastpath+0x16/0x1b
Mar 19 11:50:22 node10 kernel: [38415.898047] php-cgi       D ffff882d76f94f80     0 253375 169208 2509 0x00000080
Mar 19 11:50:22 node10 kernel: [38415.898055]  ffff882f4cc59d48 0000000000000082 ffff882f4cc59d10 ffffffff810738b9
Mar 19 11:50:22 node10 kernel: [38415.898064]  ffff882100000003 0000000003914b8f 000022ec63a7c004 ffff88012bf4f7b0
Mar 19 11:50:22 node10 kernel: [38415.898073]  ffff882000000000 ffff88012bf4f7a0 000000010244df83 0000000e810607c6
Mar 19 11:50:22 node10 kernel: [38415.898081] Call Trace:
Mar 19 11:50:22 node10 kernel: [38415.898086]  [<ffffffff810738b9>] ? enqueue_task_fair+0xb9/0x220
Mar 19 11:50:22 node10 kernel: [38415.898092]  [<ffffffff815344e5>] schedule_timeout+0x215/0x2e0
Mar 19 11:50:22 node10 kernel: [38415.898098]  [<ffffffff81533fc4>] wait_for_completion+0xe4/0x120
Mar 19 11:50:22 node10 kernel: [38415.898104]  [<ffffffff81065ba0>] ? default_wake_function+0x0/0x20
Mar 19 11:50:22 node10 kernel: [38415.898111]  [<ffffffff811566d0>] ? lru_add_drain_per_cpu+0x0/0x10
Mar 19 11:50:22 node10 kernel: [38415.898117]  [<ffffffff811566d0>] ? lru_add_drain_per_cpu+0x0/0x10
Mar 19 11:50:22 node10 kernel: [38415.898122]  [<ffffffff810a10f7>] flush_work+0x77/0xc0
Mar 19 11:50:22 node10 kernel: [38415.898128]  [<ffffffff810a0890>] ? wq_barrier_func+0x0/0x20
Mar 19 11:50:22 node10 kernel: [38415.898134]  [<ffffffff810a1273>] schedule_on_each_cpu+0x133/0x180
Mar 19 11:50:22 node10 kernel: [38415.898140]  [<ffffffff81154e15>] lru_add_drain_all+0x15/0x20
Mar 19 11:50:22 node10 kernel: [38415.898146]  [<ffffffff811763a5>] __mlock+0x55/0x110
Mar 19 11:50:22 node10 kernel: [38415.898152]  [<ffffffff81176473>] sys_mlock+0x13/0x20
Mar 19 11:50:22 node10 kernel: [38415.898158]  [<ffffffff8100b102>] system_call_fastpath+0x16/0x1b
Mar 19 11:50:22 node10 kernel: [38415.907411] nyn.sh        S ffff881e20726740     0 264671 264669 1527 0x00000080
Mar 19 11:50:22 node10 kernel: [38415.907418]  ffff881f0a135c98 0000000000000086 0000000000000000 0000000000000000
Mar 19 11:50:22 node10 kernel: [38415.907426]  ffff881e20726740 000000001f25ca00 000022f0266cceb1 ffffea001f25ca00
Mar 19 11:50:22 node10 kernel: [38415.907435]  ffff881f0a135ca8 0000000000000000 0000000102451ec0 ffff881e20726740
Mar 19 11:50:22 node10 kernel: [38415.907443] Call Trace:
Mar 19 11:50:22 node10 kernel: [38415.907447]  [<ffffffff811bfafb>] pipe_wait+0x5b/0x80
Mar 19 11:50:22 node10 kernel: [38415.907453]  [<ffffffff810a80c0>] ? autoremove_wake_function+0x0/0x40
Mar 19 11:50:22 node10 kernel: [38415.907459]  [<ffffffff81534a7e>] ? mutex_lock+0x1e/0x50
Mar 19 11:50:22 node10 kernel: [38415.907464]  [<ffffffff811c0596>] pipe_read+0x3e6/0x4e0
Mar 19 11:50:22 node10 kernel: [38415.907471]  [<ffffffff811b41fa>] do_sync_read+0xfa/0x140
Mar 19 11:50:22 node10 kernel: [38415.907477]  [<ffffffff810a80c0>] ? autoremove_wake_function+0x0/0x40
Mar 19 11:50:22 node10 kernel: [38415.907483]  [<ffffffff8107a465>] ? do_fork_pid+0x145/0x490
Mar 19 11:50:22 node10 kernel: [38415.907490]  [<ffffffff811b4b95>] vfs_read+0xb5/0x1a0
Mar 19 11:50:22 node10 kernel: [38415.907496]  [<ffffffff811b4cd1>] sys_read+0x51/0x90
Mar 19 11:50:22 node10 kernel: [38415.907501]  [<ffffffff810fb63e>] ? __audit_syscall_exit+0x25e/0x290
Mar 19 11:50:22 node10 kernel: [38415.907507]  [<ffffffff8100b102>] system_call_fastpath+0x16/0x1b
Mar 19 11:50:22 node10 kernel: [38415.907511] nyn.sh        S ffff883f9b03b1c0     0 264672 264671 1527 0x00000080
Mar 19 11:50:22 node10 kernel: [38415.907519]  ffff882e2c6c1e88 0000000000000082 ffffffff810607c6 0000000000000029
Mar 19 11:50:22 node10 kernel: [38415.907527]  00000000022b09a0 ffff883d70082e40 ffff882cae9953a0 0000000000000007
Mar 19 11:50:22 node10 kernel: [38415.907536]  ffff882e2c6c1f18 ffffffff8104e05c 0000000000000286 ffff882e2c6c1e50
Mar 19 11:50:22 node10 kernel: [38415.907545] Call Trace:
Mar 19 11:50:22 node10 kernel: [38415.907549]  [<ffffffff810607c6>] ? enqueue_task+0x66/0x80
Mar 19 11:50:22 node10 kernel: [38415.907555]  [<ffffffff8104e05c>] ? __do_page_fault+0x1ec/0x480
Mar 19 11:50:22 node10 kernel: [38415.907561]  [<ffffffff81080604>] do_wait+0x1e4/0x240
Mar 19 11:50:22 node10 kernel: [38415.907566]  [<ffffffff810806e8>] sys_wait4+0x88/0xd0
Mar 19 11:50:22 node10 kernel: [38415.907573]  [<ffffffff8107e990>] ? child_wait_callback+0x0/0x90
Mar 19 11:50:22 node10 kernel: [38415.907579]  [<ffffffff8100b102>] system_call_fastpath+0x16/0x1b
Mar 19 11:50:22 node10 kernel: [38415.907584] cat           R ffff882e2ab781c0     0 264673 264672 1527 0x00000088
Mar 19 11:50:22 node10 kernel: [38415.907592]  00000000000000ad 0000002881532fd0 ffff88393902ab80 ffff883500000000
Mar 19 11:50:22 node10 kernel: [38415.907600]  ffffea00fb527dc0 ffffea0000000000 ffff882c5e427c18 ffffffff8113c467
Mar 19 11:50:22 node10 kernel: [38415.907609]  ffff883c84c3e250 ffff883c84c3e250 ffff882c5e427ca8 ffffffff81172549
Mar 19 11:50:22 node10 kernel: [38415.907618] Call Trace:
Mar 19 11:50:22 node10 kernel: [38415.907623]  [<ffffffff8113c467>] ? unlock_page+0x27/0x30
Mar 19 11:50:22 node10 kernel: [38415.907629]  [<ffffffff81172549>] ? __do_fault+0x549/0x640
Mar 19 11:50:22 node10 kernel: [38415.907642]  [<ffffffffa025511b>] ? tsd_hash_search+0x7b/0xe0 [spl]
Mar 19 11:50:22 node10 kernel: [38415.907654]  [<ffffffffa025511b>] ? tsd_hash_search+0x7b/0xe0 [spl]
Mar 19 11:50:22 node10 kernel: [38415.907660]  [<ffffffff810700e1>] ? update_curr+0xe1/0x1f0
Mar 19 11:50:22 node10 kernel: [38415.907666]  [<ffffffff81130e40>] ? remote_function+0x0/0x60
Mar 19 11:50:22 node10 kernel: [38415.907672]  [<ffffffff81133c5e>] ? __perf_cgroup_move+0xe/0x20
Mar 19 11:50:22 node10 kernel: [38415.907678]  [<ffffffff81130e88>] ? remote_function+0x48/0x60
Mar 19 11:50:22 node10 kernel: [38415.907684]  [<ffffffff81130ee4>] ? task_function_call+0x44/0x50
Mar 19 11:50:22 node10 kernel: [38415.907689]  [<ffffffff81133c50>] ? __perf_cgroup_move+0x0/0x20
Mar 19 11:50:22 node10 kernel: [38415.907696]  [<ffffffff81138360>] ? perf_event_exit_task+0xc0/0x200
Mar 19 11:50:22 node10 kernel: [38415.907702]  [<ffffffff810e64ef>] ? cgroup_exit+0x10f/0x130
Mar 19 11:50:22 node10 kernel: [38415.907707]  [<ffffffff810f187e>] ? exit_ptrace+0x2e/0x90
Mar 19 11:50:22 node10 kernel: [38415.907713]  [<ffffffff81536a6e>] ? _write_lock_irq+0x1e/0x20
Mar 19 11:50:22 node10 kernel: [38415.907719]  [<ffffffff81080b54>] ? do_exit+0x254/0x940
Mar 19 11:50:22 node10 kernel: [38415.907724]  [<ffffffff81081298>] ? do_group_exit+0x58/0xd0
Mar 19 11:50:22 node10 kernel: [38415.907730]  [<ffffffff81081327>] ? sys_exit_group+0x17/0x20
Mar 19 11:50:22 node10 kernel: [38415.907735]  [<ffffffff8100b102>] ? system_call_fastpath+0x16/0x1b
Mar 19 11:50:22 node10 kernel: [38415.907740] grep          R ffff8835dbc48680     0 264674 264672 1527 0x00000088
Mar 19 11:50:22 node10 kernel: [38415.907748]  0000000000000001 ffffea002b6d7da8 ffff882ea7d17c88 ffffffff8114ce12
Mar 19 11:50:22 node10 kernel: [38415.907757]  ffffea00fb527dc0 ffffea0000000000 ffff882ea7d17c18 ffffffff8113c467
Mar 19 11:50:22 node10 kernel: [38415.907766]  ffff881845d2d9e0 ffff881845d2d9e0 ffff88000002b7d8 000000020000001f
Mar 19 11:50:22 node10 kernel: [38415.907775] Call Trace:
Mar 19 11:50:22 node10 kernel: [38415.907780]  [<ffffffff8114ce12>] ? free_pcppages_bulk+0x362/0x430
Mar 19 11:50:22 node10 kernel: [38415.907787]  [<ffffffff8113c467>] ? unlock_page+0x27/0x30
Mar 19 11:50:22 node10 kernel: [38415.907793]  [<ffffffff811fc422>] ? fsnotify_clear_marks_by_inode+0x32/0xf0
Mar 19 11:50:22 node10 kernel: [38415.907799]  [<ffffffff810700e1>] ? update_curr+0xe1/0x1f0
Mar 19 11:50:22 node10 kernel: [38415.907805]  [<ffffffff81130e40>] ? remote_function+0x0/0x60
Mar 19 11:50:22 node10 kernel: [38415.907811]  [<ffffffff81133c5e>] ? __perf_cgroup_move+0xe/0x20
Mar 19 11:50:22 node10 kernel: [38415.907817]  [<ffffffff81130e88>] ? remote_function+0x48/0x60
Mar 19 11:50:22 node10 kernel: [38415.907823]  [<ffffffff81130ee4>] ? task_function_call+0x44/0x50
Mar 19 11:50:22 node10 kernel: [38415.907829]  [<ffffffff81133c50>] ? __perf_cgroup_move+0x0/0x20
Mar 19 11:50:22 node10 kernel: [38415.907835]  [<ffffffff81138360>] ? perf_event_exit_task+0xc0/0x200
Mar 19 11:50:22 node10 kernel: [38415.907841]  [<ffffffff810e64ef>] ? cgroup_exit+0x10f/0x130
Mar 19 11:50:22 node10 kernel: [38415.907847]  [<ffffffff810f187e>] ? exit_ptrace+0x2e/0x90
Mar 19 11:50:22 node10 kernel: [38415.907852]  [<ffffffff81536a6e>] ? _write_lock_irq+0x1e/0x20
Mar 19 11:50:22 node10 kernel: [38415.907858]  [<ffffffff81080b54>] ? do_exit+0x254/0x940
Mar 19 11:50:22 node10 kernel: [38415.907863]  [<ffffffff81081298>] ? do_group_exit+0x58/0xd0
Mar 19 11:50:22 node10 kernel: [38415.907869]  [<ffffffff81081327>] ? sys_exit_group+0x17/0x20
Mar 19 11:50:22 node10 kernel: [38415.907874]  [<ffffffff8100b102>] ? system_call_fastpath+0x16/0x1b
Mar 19 11:50:22 node10 kernel: [38415.907584] cat           R ffff882e2ab781c0     0 264673 264672 1527 0x00000088
Mar 19 11:50:22 node10 kernel: [38415.907592]  00000000000000ad 0000002881532fd0 ffff88393902ab80 ffff883500000000
Mar 19 11:50:22 node10 kernel: [38415.907600]  ffffea00fb527dc0 ffffea0000000000 ffff882c5e427c18 ffffffff8113c467
Mar 19 11:50:22 node10 kernel: [38415.907609]  ffff883c84c3e250 ffff883c84c3e250 ffff882c5e427ca8 ffffffff81172549
Mar 19 11:50:22 node10 kernel: [38415.907618] Call Trace:
Mar 19 11:50:22 node10 kernel: [38415.907623]  [<ffffffff8113c467>] ? unlock_page+0x27/0x30
Mar 19 11:50:22 node10 kernel: [38415.907629]  [<ffffffff81172549>] ? __do_fault+0x549/0x640
Mar 19 11:50:22 node10 kernel: [38415.907642]  [<ffffffffa025511b>] ? tsd_hash_search+0x7b/0xe0 [spl]
Mar 19 11:50:22 node10 kernel: [38415.907654]  [<ffffffffa025511b>] ? tsd_hash_search+0x7b/0xe0 [spl]
Mar 19 11:50:22 node10 kernel: [38415.907660]  [<ffffffff810700e1>] ? update_curr+0xe1/0x1f0
Mar 19 11:50:22 node10 kernel: [38415.907666]  [<ffffffff81130e40>] ? remote_function+0x0/0x60
Mar 19 11:50:22 node10 kernel: [38415.907672]  [<ffffffff81133c5e>] ? __perf_cgroup_move+0xe/0x20
Mar 19 11:50:22 node10 kernel: [38415.907678]  [<ffffffff81130e88>] ? remote_function+0x48/0x60
Mar 19 11:50:22 node10 kernel: [38415.907684]  [<ffffffff81130ee4>] ? task_function_call+0x44/0x50
Mar 19 11:50:22 node10 kernel: [38415.907689]  [<ffffffff81133c50>] ? __perf_cgroup_move+0x0/0x20
Mar 19 11:50:22 node10 kernel: [38415.907696]  [<ffffffff81138360>] ? perf_event_exit_task+0xc0/0x200
Mar 19 11:50:22 node10 kernel: [38415.907702]  [<ffffffff810e64ef>] ? cgroup_exit+0x10f/0x130
Mar 19 11:50:22 node10 kernel: [38415.907707]  [<ffffffff810f187e>] ? exit_ptrace+0x2e/0x90
Mar 19 11:50:22 node10 kernel: [38415.907713]  [<ffffffff81536a6e>] ? _write_lock_irq+0x1e/0x20
Mar 19 11:50:22 node10 kernel: [38415.907719]  [<ffffffff81080b54>] ? do_exit+0x254/0x940
Mar 19 11:50:22 node10 kernel: [38415.907724]  [<ffffffff81081298>] ? do_group_exit+0x58/0xd0
Mar 19 11:50:22 node10 kernel: [38415.907730]  [<ffffffff81081327>] ? sys_exit_group+0x17/0x20
Mar 19 11:50:22 node10 kernel: [38415.907735]  [<ffffffff8100b102>] ? system_call_fastpath+0x16/0x1b
Mar 19 11:50:22 node10 kernel: [38415.907740] grep          R ffff8835dbc48680     0 264674 264672 1527 0x00000088
Mar 19 11:50:22 node10 kernel: [38415.907748]  0000000000000001 ffffea002b6d7da8 ffff882ea7d17c88 ffffffff8114ce12
Mar 19 11:50:22 node10 kernel: [38415.907757]  ffffea00fb527dc0 ffffea0000000000 ffff882ea7d17c18 ffffffff8113c467
Mar 19 11:50:22 node10 kernel: [38415.907766]  ffff881845d2d9e0 ffff881845d2d9e0 ffff88000002b7d8 000000020000001f
Mar 19 11:50:22 node10 kernel: [38415.907775] Call Trace:
Mar 19 11:50:22 node10 kernel: [38415.907780]  [<ffffffff8114ce12>] ? free_pcppages_bulk+0x362/0x430
Mar 19 11:50:22 node10 kernel: [38415.907787]  [<ffffffff8113c467>] ? unlock_page+0x27/0x30
Mar 19 11:50:22 node10 kernel: [38415.907793]  [<ffffffff811fc422>] ? fsnotify_clear_marks_by_inode+0x32/0xf0
Mar 19 11:50:22 node10 kernel: [38415.907799]  [<ffffffff810700e1>] ? update_curr+0xe1/0x1f0
Mar 19 11:50:22 node10 kernel: [38415.907805]  [<ffffffff81130e40>] ? remote_function+0x0/0x60
Mar 19 11:50:22 node10 kernel: [38415.907811]  [<ffffffff81133c5e>] ? __perf_cgroup_move+0xe/0x20
Mar 19 11:50:22 node10 kernel: [38415.907817]  [<ffffffff81130e88>] ? remote_function+0x48/0x60
Mar 19 11:50:22 node10 kernel: [38415.907823]  [<ffffffff81130ee4>] ? task_function_call+0x44/0x50
Mar 19 11:50:22 node10 kernel: [38415.907829]  [<ffffffff81133c50>] ? __perf_cgroup_move+0x0/0x20
Mar 19 11:50:22 node10 kernel: [38415.907835]  [<ffffffff81138360>] ? perf_event_exit_task+0xc0/0x200
Mar 19 11:50:22 node10 kernel: [38415.907841]  [<ffffffff810e64ef>] ? cgroup_exit+0x10f/0x130
Mar 19 11:50:22 node10 kernel: [38415.907847]  [<ffffffff810f187e>] ? exit_ptrace+0x2e/0x90
Mar 19 11:50:22 node10 kernel: [38415.907852]  [<ffffffff81536a6e>] ? _write_lock_irq+0x1e/0x20
Mar 19 11:50:22 node10 kernel: [38415.907858]  [<ffffffff81080b54>] ? do_exit+0x254/0x940
Mar 19 11:50:22 node10 kernel: [38415.907863]  [<ffffffff81081298>] ? do_group_exit+0x58/0xd0
Mar 19 11:50:22 node10 kernel: [38415.907869]  [<ffffffff81081327>] ? sys_exit_group+0x17/0x20
Mar 19 11:50:22 node10 kernel: [38415.907874]  [<ffffffff8100b102>] ? system_call_fastpath+0x16/0x1b
[[email protected]]
 ~ # cat node10-meminfo.txt 
MemTotal:       263970936 kB
MemFree:        58150164 kB
Buffers:          116016 kB
Cached:         31174404 kB
SwapCached:            0 kB
MemCommitted:   437256192 kB
VirtualSwap:           0 kB
Active:         55595660 kB
Inactive:       37364092 kB
Active(anon):   51098976 kB
Inactive(anon): 13751580 kB
Active(file):    4496684 kB
Inactive(file): 23612512 kB
Unevictable:     1406700 kB
Mlocked:         1406700 kB
SwapTotal:      33445756 kB
SwapFree:       33445756 kB
Dirty:               996 kB
Writeback:           404 kB
AnonPages:      63074484 kB
Mapped:          5532844 kB
Shmem:           3155448 kB
Slab:           61191856 kB
SReclaimable:    4194532 kB
SUnreclaim:     56997324 kB
KernelStack:      107864 kB
PageTables:      1155756 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    165431224 kB
Committed_AS:   169866292 kB
VmallocTotal:   34359738367 kB
VmallocUsed:    37879212 kB
VmallocChunk:   34172645044 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        5564 kB
DirectMap2M:     2058240 kB
DirectMap1G:    266338304 kB
[[email protected]]
 ~ # cat node10-arcstats.txt 
5 1 0x01 86 4128 8274503459 38367113689195
name                            type data
hits                            4    2138204537
misses                          4    27164977
demand_data_hits                4    2047297272
demand_data_misses              4    6863161
demand_metadata_hits            4    83024581
demand_metadata_misses          4    16466919
prefetch_data_hits              4    2557921
prefetch_data_misses            4    983135
prefetch_metadata_hits          4    5324763
prefetch_metadata_misses        4    2851762
mru_hits                        4    19647643
mru_ghost_hits                  4    6801111
mfu_hits                        4    2110674212
mfu_ghost_hits                  4    12392451
deleted                         4    625
recycle_miss                    4    14938422
mutex_miss                      4    6866
evict_skip                      4    907722674
evict_l2_cached                 4    256503610880
evict_l2_eligible               4    80892292608
evict_l2_ineligible             4    50873285632
hash_elements                   4    8363055
hash_elements_max               4    8363916
hash_collisions                 4    4682605
hash_chains                     4    883988
hash_chain_max                  4    5
p                               4    17728860160
c                               4    68719476736
c_min                           4    4194304
c_max                           4    68719476736
size                            4    68708401888
hdr_size                        4    5031528720
data_size                       4    27370795520
meta_size                       4    15330998272
other_size                      4    20721793376
anon_size                       4    108725760
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    17613757952
mru_evict_data                  4    9015178752
mru_evict_metadata              4    12001280
mru_ghost_size                  4    68434313728
mru_ghost_evict_data            4    61605666304
mru_ghost_evict_metadata        4    6828647424
mfu_size                        4    24979310080
mfu_evict_data                  4    18251010048
mfu_evict_metadata              4    95495680
mfu_ghost_size                  4    96910597632
mfu_ghost_evict_data            4    94920265216
mfu_ghost_evict_metadata        4    1990332416
l2_hits                         4    15035562
l2_misses                       4    12129376
l2_feeds                        4    41798
l2_rw_clash                     4    2290
l2_read_bytes                   4    74938961408
l2_write_bytes                  4    220023553536
l2_writes_sent                  4    41778
l2_writes_done                  4    41778
l2_writes_error                 4    0
l2_writes_hdr_miss              4    336
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    268375
l2_cdata_free_on_write          4    28375
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    133212192768
l2_asize                        4    71760593920
l2_hdr_size                     4    253286000
l2_compress_successes           4    9454138
l2_compress_zeros               4    0
l2_compress_failures            4    1830626
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    41337606368
arc_meta_limit                  4    17179869184
arc_meta_max                    4    41562729368
[[email protected]]
 ~ # cat node10-slabtop.txt 
 Active / Total Objects (% used)    : 145125368 / 148880039 (97.5%)
 Active / Total Slabs (% used)      : 10067014 / 10068565 (100.0%)
 Active / Total Caches (% used)     : 223 / 392 (56.9%)
 Active / Total Size (% used)       : 58633922.20K / 59086976.95K (99.2%)
 Minimum / Average / Maximum Object : 0.02K / 0.40K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
25251240 25251240 100%    0.09K 631281       40   2525124K arc_buf_t              
19498505 19498490  99%    0.30K 1499885       13   5999540K dmu_buf_impl_t         
13342968 12558689  94%    0.06K 226152       59    904608K size-64                
12809772 12809772 100%    0.84K 1423308        9  11386464K dnode_t                
12744203 11864025  93%    0.22K 749659       17   2998636K dentry                 
11655512 11653998  99%    0.50K 1456939        8   5827756K zio_buf_512            
11298784 10917533  96%    0.03K 100882      112    403528K size-32                
8987226 8081795  89%    0.10K 242898       37    971592K sa_cache               
8367612 8367612 100%    0.30K 697301       12   2789204K arc_buf_hdr_t          
6709830 6456017  96%    0.12K 223661       30    894644K size-128               
6338064 6333980  99%    0.04K  68892       92    275568K l2arc_buf_hdr_t        
2881029 2530351  87%    0.06K  48831       59    195324K range_seg_cache        
1010340 1008476  99%    0.19K  50517       20    202068K size-192               
949975 949613  99%    0.10K  25675       37    102700K buffer_head            
933933 891540  95%    0.05K  12129       77     48516K anon_vma_chain         
933491 933410  99%   16.00K 933491        1  14935856K zio_buf_16384          
914774 875791  95%    0.20K  48146       19    192584K vm_area_struct         
843423 843407  99%    8.00K 843423        1   6747384K size-8192              
493392 493382  99%    1.15K 164464        3    657856K nfs_inode_cache        
446061 427104  95%    0.05K   5793       77     23172K anon_vma               
294804 294804 100%    1.00K  73701        4    294804K zio_data_buf_1024      
240215 237188  98%    0.72K  48043        5    192172K proc_inode_cache       
239939 239836  99%    0.55K  34277        7    137108K radix_tree_node        
200400 200400 100%    1.50K  40080        5    320640K zio_data_buf_1536      
179756 179756 100%    2.00K  89878        2    359512K zio_data_buf_2048      
161760 155585  96%    0.25K  10784       15     43136K filp                   
131546 129737  98%    0.07K   2482       53      9928K Acpi-Operand           
123660 114116  92%    0.38K  12366       10     49464K ip_dst_cache           
 76734  76729  99%    0.14K   2842       27     11368K sysfs_dir_cache        
 55704  55701  99%    1.09K  18568        3     74272K ext4_inode_cache       
 51011  50952  99%    4.00K  51011        1    204044K zio_data_buf_4096      
 41960  39794  94%    1.00K  10490        4     41960K size-1024              
 39887  39887 100%    3.50K  39887        1    159548K zio_data_buf_3584      
 39088  30650  78%    0.50K   4886        8     19544K zio_data_buf_512       
 36590  36312  99%    4.00K  36590        1    146360K size-4096              
 36295  36295 100%    5.00K  36295        1    290360K zio_data_buf_5120      
 32136  31844  99%    0.65K   5356        6     21424K inode_cache            
 27084  20637  76%    0.30K   2257       12      9028K nf_conntrack_ffff88204f57f480
 24530  24494  99%    0.75K   4906        5     19624K sock_inode_cache       
 22620  19592  86%    0.19K   1131       20      4524K cred_jar               
 18976  17233  90%    0.50K   2372        8      9488K size-512               
 16420  15913  96%    0.19K    821       20      3284K eventpoll_epi          
 15564  15212  97%    1.00K   3891        4     15564K size-1024(UBC)         
 15211  14614  96%    0.07K    287       53      1148K eventpoll_pwq          
 14892  14285  95%    0.11K    438       34      1752K task_delay_info        
 14850  14036  94%    0.12K    495       30      1980K pid_2                  
 14778  14575  98%    0.84K   1642        9     13136K shmem_inode_cache      
 13830  13615  98%    2.75K   6915        2     55320K task_struct            
 13266  12915  97%    0.81K   1474        9     11792K task_xstate            
 12916  12794  99%    0.88K   3229        4     12916K UNIX                   
 12246  12130  99%    2.50K   4082        3     32656K zio_data_buf_2560      
 11781   1950  16%    0.05K    153       77       612K zio_link_cache         
 10672  10535  98%    0.04K    116       92       464K Acpi-Namespace         
 10390   9860  94%    2.00K   5195        2     20780K size-2048              
 10320   5664  54%    0.25K    688       15      2752K size-256               
  7917   7731  97%    0.53K   1131        7      4524K idr_layer_cache        
  7686   7343  95%    1.06K   1098        7      8784K signal_cache           
  7608   7351  96%    2.06K   2536        3     20288K sighand_cache          
  6254   5352  85%    0.06K    118       53       472K fs_cache               
  6244   6166  98%    3.00K   3122        2     24976K zio_data_buf_3072      
  6020   5273  87%    0.13K    215       28       860K cfq_io_context         
  5880   5185  88%    0.25K    392       15      1568K cfq_queue              
  5713   5672  99%   10.00K   5713        1     91408K zio_data_buf_10240     
  5709   5260  92%    0.69K    519       11      4152K files_cache            
  5618   4484  79%    0.06K    106       53       424K tcp_bind_bucket        
  5425   5053  93%    1.38K   1085        5      8680K mm_struct              
  5074   2621  51%    0.06K     86       59       344K inet_peer_cache        
  4706   4526  96%    2.00K   2353        2      9412K size-2048(UBC)         
  4518   4349  96%    4.00K   4518        1     18072K zio_buf_4096           
  4416   3781  85%    0.03K     48       92       192K size-32(UBC)           
  4032   4025  99%   12.00K   4032        1     64512K zio_data_buf_12288     
  3988   3961  99%    1.75K   1994        2      7976K TCP                    

@yarikoptic
Copy link

I know that this one is "no good" and thus was closed, but I kept running its new version until now I have 100% cpu arc_adapt again/as well. And here is the dynamics according to munin: http://www.onerussian.com/tmp/zfs_stats_utilization-day-20150319.png

update #1: system information http://www.onerussian.com/tmp//zfs_system_details_20150319/

@behlendorf
Copy link
Contributor Author

Reopening this. I didn't mean to close it, even though it's clearly still needs some work.

@behlendorf behlendorf mentioned this pull request Mar 19, 2015
kernelOfTruth added a commit to kernelOfTruth/zfs that referenced this pull request Mar 19, 2015
Revert "More aggressively maintain arc_meta_limit (WIP) openzfs#3181"

This reverts commit 8135db5.
@snajpa
Copy link
Contributor

snajpa commented Mar 19, 2015

@behlendorf I just noticed there's an obvious way to see arc_adapt stack while it's stuck in D state.

It's been like this for hours now.

[[email protected]]
 ~ # cat /proc/1366/stack
[<ffffffff811fbe35>] fsnotify_unmount+0x1d5/0x270
[<ffffffff811fbf20>] fsnotify_unmount_inodes+0x10/0x20
[<ffffffff811d34f4>] invalidate_inodes_check+0x64/0x3e0
[<ffffffffa0349aed>] zfs_sb_prune+0x9d/0xc0 [zfs]
[<ffffffffa036c2e6>] zpl_prune_sb+0x26/0x30 [zfs]
[<ffffffffa02aea44>] arc_adapt_thread+0x294/0x560 [zfs]
[<ffffffffa024f798>] thread_generic_wrapper+0x68/0x80 [spl]
[<ffffffff810a7cce>] kthread+0x9e/0xc0
[<ffffffff8100c34a>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

@snajpa
Copy link
Contributor

snajpa commented Mar 19, 2015

(gdb) list *(fsnotify_unmount+0x1d5)
0xffffffff811fbe35 is in fsnotify_unmount (fs/notify/inode_mark.c:459).
454 
455         spin_lock(&inode_lock);
456     }
457     spin_unlock(&inode_lock);
458     /* we need to pause until we are sure all iputs are finished */
459     wait_event(sb->s_fsnotify_marks_wq, !atomic_read(&sb->s_fsnotify_marks));
460     spin_lock(&inode_lock);
461 }
462 
463 void fsnotify_unmount_inodes(struct super_block *sb)

@snajpa
Copy link
Contributor

snajpa commented Mar 19, 2015

@behlendorf my investigation leads me to a conclusion that something has registered either inotify or dnotify events with some inode we're trying to free and until that is released, the wait_event will hang there, because s_fsnotify_marks will never be zero. Does this look about right?

@snajpa
Copy link
Contributor

snajpa commented Mar 19, 2015

@behlendorf invalidate_inodes is most likely not meant to be used in any other context than when umounting the filesystem and there are no users of it anymore. Though I might be wrong.

@snajpa
Copy link
Contributor

snajpa commented Mar 19, 2015

@behlendorf will try to deploy this with invalidate_inodes out of the picture (vpsfreecz@cbc4156) along with limiting the number of restarts in arc_adjust_meta() (vpsfreecz@d44b4f2).

@snajpa
Copy link
Contributor

snajpa commented Mar 20, 2015

@behlendorf I declare victory!

Back-ups are now running and arc_meta_used is in bounds.

Per @ryao's advice I've temporarily removed L2ARC devices - he suspects there's a bug in the code, I'm just being cautious.

arc_prune                       4    857
arc_meta_used                   4    17179658896
arc_meta_limit                  4    17179869184
arc_meta_max                    4    17234849080

@snajpa
Copy link
Contributor

snajpa commented Mar 20, 2015

Btw, I would strongly advocate for merging the limit of arc_adjust_meta restarts, if you're going to merge this. Without it, arc_adapt might re-run again in a long time, which makes changing the ARC parameters like arc_c_max next to impossible. I've found 4096 restarts to be sufficient even for my heavy meta-data hammering workload.

@kernelOfTruth
Copy link
Contributor

@snajpa Great job & Congrats ! 👍

@behlendorf
Copy link
Contributor Author

Awesome! @snajpa thanks for pushing this one over the finish line! It's great to know this resolves the problems you were seeing even for your workload. I'll integrate your fixes in to the pull request and the other feedback I've gotten and refresh the pull request. Then we can get this merged.

@behlendorf
Copy link
Contributor Author

@snajpa @kernelOfTruth @yarikoptic @chrisrd @DeHackEd @angstymeat I've open pull request #3202 with a refreshed version of this patch stack which incorporates @snajpa's improvements and @chrisrd's feedback. It would be great if you could further test this change, but based on the latest feedback it's looking very good. Unless the buildbots uncover an issue I'd like to merge this fairly soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants