-
-
Notifications
You must be signed in to change notification settings - Fork 844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
excessive memory usage on huge trees #918
Comments
Which version of fd are you using? |
$ fd --version installed by cargo install |
Ah, I thought that we had set a limit on the channel size in #885, but it looks like that part got removed. So, yes, there is an unbounded queue of files and directories to process, and that could cause high memory usage if the terminal can't keep up. Maybe we should revisit adding a bound to that. I believe with the std mpsc crate there is a performance cost to doing that, I'm not sure about crossbeam. |
Right. I removed it because you wrote "I switch back to using channel instead of sync_channel than it is, at least not any slower". And I believe I was able to confirm this in my own benchmarks.
Yes, let's do this in a dedicated PR with new benchmark results (and memory profiles). A quick fix could be to use a much higher limit than the one suggested in #885. |
hint: in my own directory traversal code i found that about 4k per thread (for memory calculation, not one channel per thread) is a somewhat sweet spot. Actually the limit isn't terribly important as long it doesn't explode, even with 100MB limit it should be okish. |
Is this the cause of fd outputting "Killed" and then quitting when I search for all files with a specific extension from the root (Arch Linux, ~9/~14TB used space to search through across 5 mounted partitions)? |
@mwaitzman It might be, do you see anything about the OOM killer in |
Yes, I see when running a similar command [ 4279.148487] Out of memory: Killed process 7330 (fd) total-vm:61310736kB, anon-rss:24569660kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:111272kB oom_score_adj:0
[ 4285.522887] oom_reaper: reaped process 7330 (fd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB |
Okay yeah, I think this is probably due to the lack of backpressure with the unbounded channel. It's unfortunate that The SPSC case ( I'd like to figure out #933. |
On 2022-06-03 12:37, Tavian Barnes wrote:
Okay yeah, I think this is probably due to the lack of backpressure
with the unbounded channel. It's unfortunate that
`std::sync::mpsc::sync_channel()` (the bounded queue) performs so much
worse than `channel()` (unbounded).
The SPSC case (`-j1`) is 10-40% worse with bounded queues, and the
performance is *worse* for higher capacities, so tuning the capacity
is crucial. With a capacity of `4096 * threads`, bounded queues match
unbounded queues at `-j8` but not with fewer threads.
Just an observation I did with other code:
Walking directories can be very fast when they are in RAM (obliviously)
but otherwise it is pretty much I/O-bound. And here lies the trick:
since you are not CPU-Bound but IO-bound you can have *a lot* more
threads than you have cores. Which in turn gives the Kernel the
opportunity to schedule requests in a much better way. I've tested up
to 128 threads on a 12/24 core machine and still seen improvements (in
my code, not the 'fd' tree walking code). This can by far outweigh the
performance degrade you get from bounded channels.
…
I'd like to figure out #933.
--
Reply to this email directly or view it on GitHub:
#918 (comment)
You are receiving this because you authored the thread.
Message ID: ***@***.***>
|
Experiencing the exact same issue, but with I've tried leaving different variations running on my laptop for the night, finding it absolutely frozen every time i wake up, arriving at Currently on |
@Architector4 you might have some luck filtering out some directories. It looks like you are running on /, so excluding /proc, /sys, /tmp, etc. Would likely make a difference (if you are on mac those paths might be different). If the files you care about are all on the same filesystem you could try the --one-file-system option. |
@tmccombs I don't think it would make any meaningful difference in my case -- I'm looking for all files that end with "png", of which there would not be any in such special directories. I imagine (I hope?) that I'd also prefer to go through all filesystems (my SSD mounted on In my case I decided to resort to
|
Oh, I think you're right. In your case probably what is happening is the queues are filling up because running |
I feel like the searching thread's performance would be I/O bound most of the time (input being filesystem traversal and output being to In any case, this feels like a necessary feature for wellbeing of the host machine, because it seems like this excessive memory usage applies to any general case where such output is slower than such input. To demonstrate, I have a silly script, named
This script prints each line it receives on stdin to stdout, but waits 1 second between them, causing artificially slow I/O. So I piped
This reliably makes I argue some kind of a limit is absolutely necessary here, and is more important than the performance from not having to account for it. Maybe it would be possible to introduce a command line switch that disables the limit for those who wish to have the performance benefits and know ballooning wouldn't happen? |
Ideally yes, but: #918 (comment)
I agree, we need to do something. I can try switching to crossbeam-channels again. |
Oh, sorry, missed that. Would it be possible to slap something like |
We originally switched to bounded channels for backpressure to fix sharkdp#918. However, bounded channels have a significant initialization overhead as they pre-allocate a fixed-size buffer for the messages. This implementation uses a different backpressure strategy: each thread gets a limited-size pool of WorkerResults. When the size limit is hit, the sender thread has to wait for the receiver thread to handle a result from that pool and recycle it. Inspired by [snmalloc], results are recycled by sending the boxed result over a channel back to the thread that allocated it. By allocating and freeing each WorkerResult from the same thread, allocator contention is reduced dramatically. And since we now pass results by pointer instead of by value, message passing overhead is reduced as well. Fixes sharkdp#1408. [snmalloc]: https://github.com/microsoft/snmalloc
We originally switched to bounded channels for backpressure to fix sharkdp#918. However, bounded channels have a significant initialization overhead as they pre-allocate a fixed-size buffer for the messages. This implementation uses a different backpressure strategy: each thread gets a limited-size pool of WorkerResults. When the size limit is hit, the sender thread has to wait for the receiver thread to handle a result from that pool and recycle it. Inspired by [snmalloc], results are recycled by sending the boxed result over a channel back to the thread that allocated it. By allocating and freeing each WorkerResult from the same thread, allocator contention is reduced dramatically. And since we now pass results by pointer instead of by value, message passing overhead is reduced as well. Fixes sharkdp#1408. [snmalloc]: https://github.com/microsoft/snmalloc
We originally switched to bounded channels for backpressure to fix sharkdp#918. However, bounded channels have a significant initialization overhead as they pre-allocate a fixed-size buffer for the messages. This implementation uses a different backpressure strategy: each thread gets a limited-size pool of WorkerResults. When the size limit is hit, the sender thread has to wait for the receiver thread to handle a result from that pool and recycle it. Inspired by [snmalloc], results are recycled by sending the boxed result over a channel back to the thread that allocated it. By allocating and freeing each WorkerResult from the same thread, allocator contention is reduced dramatically. And since we now pass results by pointer instead of by value, message passing overhead is reduced as well. Fixes sharkdp#1408. [snmalloc]: https://github.com/microsoft/snmalloc
We originally switched to bounded channels for backpressure to fix sharkdp#918. However, bounded channels have a significant initialization overhead as they pre-allocate a fixed-size buffer for the messages. This implementation uses a different backpressure strategy: each thread gets a limited-size pool of WorkerResults. When the size limit is hit, the sender thread has to wait for the receiver thread to handle a result from that pool and recycle it. Inspired by [snmalloc], results are recycled by sending the boxed result over a channel back to the thread that allocated it. By allocating and freeing each WorkerResult from the same thread, allocator contention is reduced dramatically. And since we now pass results by pointer instead of by value, message passing overhead is reduced as well. Fixes sharkdp#1408. [snmalloc]: https://github.com/microsoft/snmalloc
We originally switched to bounded channels for backpressure to fix sharkdp#918. However, bounded channels have a significant initialization overhead as they pre-allocate a fixed-size buffer for the messages. This implementation uses a different backpressure strategy: each thread gets a limited-size pool of WorkerResults. When the size limit is hit, the sender thread has to wait for the receiver thread to handle a result from that pool and recycle it. Inspired by [snmalloc], results are recycled by sending the boxed result over a channel back to the thread that allocated it. By allocating and freeing each WorkerResult from the same thread, allocator contention is reduced dramatically. And since we now pass results by pointer instead of by value, message passing overhead is reduced as well. Fixes sharkdp#1408. [snmalloc]: https://github.com/microsoft/snmalloc
We originally switched to bounded channels for backpressure to fix sharkdp#918. However, bounded channels have a significant initialization overhead as they pre-allocate a fixed-size buffer for the messages. This implementation uses a different backpressure strategy: each thread gets a limited-size pool of WorkerResults. When the size limit is hit, the sender thread has to wait for the receiver thread to handle a result from that pool and recycle it. Inspired by [snmalloc], results are recycled by sending the boxed result over a channel back to the thread that allocated it. By allocating and freeing each WorkerResult from the same thread, allocator contention is reduced dramatically. And since we now pass results by pointer instead of by value, message passing overhead is reduced as well. Fixes sharkdp#1408. [snmalloc]: https://github.com/microsoft/snmalloc
When traversing some huge directory (5TB data/ 200Million Entries) i noticed that 'fd' will use excessive amounts of memory (I killed it at 35GB RSS). Some investigation shown that 'fd | wc -l' or 'fd >/dev/null' does not have this problem. While 'fd | less' again shows the same problem.
So my guess is that fd uses some unbounded queues to send the output which pile up because the terminal emulator is too slow to print 200 Million entries in time.
The text was updated successfully, but these errors were encountered: