excessive memory usage on huge trees #918

cehteh · 2021-12-29T14:45:00Z

When traversing some huge directory (5TB data/ 200Million Entries) i noticed that 'fd' will use excessive amounts of memory (I killed it at 35GB RSS). Some investigation shown that 'fd | wc -l' or 'fd >/dev/null' does not have this problem. While 'fd | less' again shows the same problem.

So my guess is that fd uses some unbounded queues to send the output which pile up because the terminal emulator is too slow to print 200 Million entries in time.

tmccombs · 2021-12-30T17:15:19Z

Which version of fd are you using?

cehteh · 2021-12-30T17:25:27Z

$ fd --version
fd 8.3.0

installed by cargo install

tmccombs · 2022-01-04T07:49:09Z

Ah, I thought that we had set a limit on the channel size in #885, but it looks like that part got removed. So, yes, there is an unbounded queue of files and directories to process, and that could cause high memory usage if the terminal can't keep up. Maybe we should revisit adding a bound to that. I believe with the std mpsc crate there is a performance cost to doing that, I'm not sure about crossbeam.

sharkdp · 2022-01-23T12:19:36Z

but it looks like that part got removed

Right. I removed it because you wrote "I switch back to using channel instead of sync_channel than it is, at least not any slower". And I believe I was able to confirm this in my own benchmarks.

Maybe we should revisit adding a bound to that. I believe with the std mpsc crate there is a performance cost to doing that, I'm not sure about crossbeam.

Yes, let's do this in a dedicated PR with new benchmark results (and memory profiles). A quick fix could be to use a much higher limit than the one suggested in #885.

cehteh · 2022-02-22T15:21:13Z

hint: in my own directory traversal code i found that about 4k per thread (for memory calculation, not one channel per thread) is a somewhat sweet spot. Actually the limit isn't terribly important as long it doesn't explode, even with 100MB limit it should be okish.

mwaitzman · 2022-06-03T02:46:29Z

Is this the cause of fd outputting "Killed" and then quitting when I search for all files with a specific extension from the root (Arch Linux, ~9/~14TB used space to search through across 5 mounted partitions)?

tavianator · 2022-06-03T12:38:24Z

@mwaitzman It might be, do you see anything about the OOM killer in dmesg?

mwaitzman · 2022-06-03T18:55:09Z

Yes, I see when running a similar command

[ 4279.148487] Out of memory: Killed process 7330 (fd) total-vm:61310736kB, anon-rss:24569660kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:111272kB oom_score_adj:0
[ 4285.522887] oom_reaper: reaped process 7330 (fd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

tavianator · 2022-06-03T19:37:29Z

Okay yeah, I think this is probably due to the lack of backpressure with the unbounded channel. It's unfortunate that std::sync::mpsc::sync_channel() (the bounded queue) performs so much worse than channel() (unbounded).

The SPSC case (-j1) is 10-40% worse with bounded queues, and the performance is worse for higher capacities, so tuning the capacity is crucial. With a capacity of 4096 * threads, bounded queues match unbounded queues at -j8 but not with fewer threads.

I'd like to figure out #933.

cehteh · 2022-06-03T19:56:50Z

On 2022-06-03 12:37, Tavian Barnes wrote: Okay yeah, I think this is probably due to the lack of backpressure with the unbounded channel. It's unfortunate that `std::sync::mpsc::sync_channel()` (the bounded queue) performs so much worse than `channel()` (unbounded). The SPSC case (`-j1`) is 10-40% worse with bounded queues, and the performance is *worse* for higher capacities, so tuning the capacity is crucial. With a capacity of `4096 * threads`, bounded queues match unbounded queues at `-j8` but not with fewer threads.

Just an observation I did with other code: Walking directories can be very fast when they are in RAM (obliviously) but otherwise it is pretty much I/O-bound. And here lies the trick: since you are not CPU-Bound but IO-bound you can have *a lot* more threads than you have cores. Which in turn gives the Kernel the opportunity to schedule requests in a much better way. I've tested up to 128 threads on a 12/24 core machine and still seen improvements (in my code, not the 'fd' tree walking code). This can by far outweigh the performance degrade you get from bounded channels.

…

I'd like to figure out #933. -- Reply to this email directly or view it on GitHub: #918 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

Architector4 · 2022-10-22T20:08:47Z

Experiencing the exact same issue, but with --exec. I'm trying to run fd "png$" / --exec optipng -o 7 (I know that this usage of optipng may break stuff, I know what I'm doing).

I've tried leaving different variations running on my laptop for the night, finding it absolutely frozen every time i wake up, arriving at fd "png$" --threads=1 --exec sh -c 'optipng -o 7 "{}"; sync; sleep 1m' to try to minimize impact. At this point I've checked with a task manager, and it's indeed fd ballooning.

Currently on fd 8.4.0 from Arch Linux repositories. Is there anything I can do on my end to do what I want, besides resorting to find?

tmccombs · 2022-10-23T02:13:53Z

@Architector4 you might have some luck filtering out some directories. It looks like you are running on /, so excluding /proc, /sys, /tmp, etc. Would likely make a difference (if you are on mac those paths might be different). If the files you care about are all on the same filesystem you could try the --one-file-system option.

Architector4 · 2022-10-23T12:17:45Z

@tmccombs I don't think it would make any meaningful difference in my case -- I'm looking for all files that end with "png", of which there would not be any in such special directories. I imagine (I hope?) that fd does not put files not matching the search terms into that ballooning cache.

I'd also prefer to go through all filesystems (my SSD mounted on / and my HDD mounted in /media), so I don't want to use that option either. But yeah, those are good ways to help alleviate the problem for other cases where applicable.

In my case I decided to resort to find and have had good success with it. This is what I ended up running to achieve the same task without running out of memory, with GNU find and parallel:

find / -iname "*.png" -print0 | parallel -0 optipng -o 7 {}

tmccombs · 2022-10-24T06:55:23Z

Oh, I think you're right. In your case probably what is happening is the queues are filling up because running optipng is blocking processing additional entries, and there isn't any backpressure to stop new entries added to the queue. If you were to use fd piped to parellel, my guess is it would probably work better. What we really need here is a bounded queue. The question is how to do that without hurting performance in cases where backpressure isn't needed.

Architector4 · 2022-10-24T12:17:08Z

I feel like the searching thread's performance would be I/O bound most of the time (input being filesystem traversal and output being to --exec or stdout or whatnot), and hence some simple "if over 10000 items in queue then block adding to it until the next item is consumed" kind of a statement slapped right before the "add to queue" part would probably not introduce too much of performance drawback.

In any case, this feels like a necessary feature for wellbeing of the host machine, because it seems like this excessive memory usage applies to any general case where such output is slower than such input. To demonstrate, I have a silly script, named delaystream.sh:

while read -r line; do
	echo "$line"
	sleep 1
done

This script prints each line it receives on stdin to stdout, but waits 1 second between them, causing artificially slow I/O. So I piped fd to it:

fd / | delaystream.sh

This reliably makes fd balloon in memory usage.

I argue some kind of a limit is absolutely necessary here, and is more important than the performance from not having to account for it.

Maybe it would be possible to introduce a command line switch that disables the limit for those who wish to have the performance benefits and know ballooning wouldn't happen?

tavianator · 2022-10-24T12:53:10Z

I feel like the searching thread's performance would be I/O bound most of the time (input being filesystem traversal and output being to --exec or stdout or whatnot), and hence some simple "if over 10000 items in queue then block adding to it until the next item is consumed" kind of a statement slapped right before the "add to queue" part would probably not introduce too much of performance drawback.

Ideally yes, but: #918 (comment)

In any case, this feels like a necessary feature for wellbeing of the host machine

I agree, we need to do something. I can try switching to crossbeam-channels again.

Architector4 · 2022-10-24T14:58:32Z

Oh, sorry, missed that.

Would it be possible to slap something like Either<channel, sync_channel> in the code and operate on whichever one it is, and have the user pick between them via commandline switches, and default to the bounded queue for safety, or at least let the user pick the bounded one if they wish? This feels like an acceptable strategy until a better solution is found, and at least makes it possible to use fd in use cases like mine.

Fixes sharkdp#918.

We originally switched to bounded channels for backpressure to fix sharkdp#918. However, bounded channels have a significant initialization overhead as they pre-allocate a fixed-size buffer for the messages. This implementation uses a different backpressure strategy: each thread gets a limited-size pool of WorkerResults. When the size limit is hit, the sender thread has to wait for the receiver thread to handle a result from that pool and recycle it. Inspired by [snmalloc], results are recycled by sending the boxed result over a channel back to the thread that allocated it. By allocating and freeing each WorkerResult from the same thread, allocator contention is reduced dramatically. And since we now pass results by pointer instead of by value, message passing overhead is reduced as well. Fixes sharkdp#1408. [snmalloc]: https://github.com/microsoft/snmalloc

tavianator mentioned this issue Jul 13, 2022

allocation heavy and get killed by OS #1063

Closed

tavianator mentioned this issue Oct 21, 2022

fd eats all my memory #1142

Closed

tavianator added a commit to tavianator/fd that referenced this issue Oct 24, 2022

walk: Use a bounded queue.

fc87204

Fixes sharkdp#918.

tavianator added a commit to tavianator/fd that referenced this issue Nov 1, 2022

walk: Use a bounded queue.

f030f60

Fixes sharkdp#918.

tavianator added a commit to tavianator/fd that referenced this issue Nov 1, 2022

walk: Use a bounded queue.

5cf0c66

Fixes sharkdp#918.

sharkdp closed this as completed in 5278405 Nov 1, 2022

tavianator mentioned this issue Oct 30, 2023

walk: Use unbounded channels #1414

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

excessive memory usage on huge trees #918

excessive memory usage on huge trees #918

cehteh commented Dec 29, 2021

tmccombs commented Dec 30, 2021

cehteh commented Dec 30, 2021

tmccombs commented Jan 4, 2022

sharkdp commented Jan 23, 2022

cehteh commented Feb 22, 2022

mwaitzman commented Jun 3, 2022

tavianator commented Jun 3, 2022

mwaitzman commented Jun 3, 2022

tavianator commented Jun 3, 2022

cehteh commented Jun 3, 2022 via email

Architector4 commented Oct 22, 2022 •

edited

Loading

tmccombs commented Oct 23, 2022

Architector4 commented Oct 23, 2022 •

edited

Loading

tmccombs commented Oct 24, 2022

Architector4 commented Oct 24, 2022 •

edited

Loading

tavianator commented Oct 24, 2022

Architector4 commented Oct 24, 2022

excessive memory usage on huge trees #918

excessive memory usage on huge trees #918

Comments

cehteh commented Dec 29, 2021

tmccombs commented Dec 30, 2021

cehteh commented Dec 30, 2021

tmccombs commented Jan 4, 2022

sharkdp commented Jan 23, 2022

cehteh commented Feb 22, 2022

mwaitzman commented Jun 3, 2022

tavianator commented Jun 3, 2022

mwaitzman commented Jun 3, 2022

tavianator commented Jun 3, 2022

cehteh commented Jun 3, 2022 via email

Architector4 commented Oct 22, 2022 • edited Loading

tmccombs commented Oct 23, 2022

Architector4 commented Oct 23, 2022 • edited Loading

tmccombs commented Oct 24, 2022

Architector4 commented Oct 24, 2022 • edited Loading

tavianator commented Oct 24, 2022

Architector4 commented Oct 24, 2022

Architector4 commented Oct 22, 2022 •

edited

Loading

Architector4 commented Oct 23, 2022 •

edited

Loading

Architector4 commented Oct 24, 2022 •

edited

Loading