-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Over time the rg processes use more and more memory #458
Comments
I'm sorry, but this isn't enough data for me to give a helpful response. What command are you running? What version of ripgrep are you using? Can you reproduce this behavior on a public repository so that I can try reproducing it? If not, can you isolate the behavior to a single file? Do other programs, like |
Sorry. The command is: I installed ripgrep via I can't give access to the machine in question but the dataset is public: the 'main' portion of Ubuntu source packages, unpacked; it's around 470GB of data in what I think is around 22M files and directories. The data being searched is roughly the unpacked contents of the main packages from Ubuntu. Grep's memory use at start:
and after only five minutes:
I expect grep to still be running when I return, if it is I'll grab another line of its memory use at that point. I can also give a try at reproducing this on smaller portions of the search space. Thanks |
Can you teach me how to get that data on to a box? You may assume that I can spin up an Ubuntu 16.04 box on AWS. :-)
Sorry, but can you also share the Here's me thinking off the cuff here: Since ripgrep runs in parallel, it has to buffer the entire output of searching a single file in memory. These buffers are reused, but they are never "shrunk." So if a search for Another possibility is that there is a bug in the input buffer handling, where it found a file that didn't trip its binary file detection but otherwise contains a very long line. As with output buffers, input buffers are reused and never shrunk. |
Sadly
So grep went from 14M to 120M. Maybe rg going from 2gigs to 4gigs is no big deal. Just for kicks here's the
I don't have the easiest mechanism to get just the Ubuntu sources unpacked; worse yet, I did the work six months ago and didn't take notes. This will be insanely wasteful -- it copies EVERYTHING from the archives. That's fine for my uses but needless for this experiment:
(This could probably be reduced to copying only the 'ubuntu/pool/main/' tree and probably --exclude "*deb" to get only sources, but I haven't tested these changes. There are local archives in S3 but I doubt they are rsync targets; if you want to try those, maybe the 'debmirror' tool may be a better starting point.)
The idea is to turn absolute filenames like (I haven't tested this on full-blown data but it should be close. Six months ago I used task-spooler to run multiple jobs at once; maybe gnu parallel would do the job easier.)
The full Ubuntu archive takes roughly 1.1 TB on my machine; the unpacked sources for 'main' here are roughly 470 GB. Thanks |
Hmm, yes... If I think it would be interesting to know exactly which types of files cause this, since ideally, the input buffer stays relatively small. IIRC, the input buffer grows until it fits at least one line in memory. So if you have a 100MB JSON file (for example) that is all on one line, then that could cause the spike. I'd still like to take a look at this, but your result from |
After each file has been processed, would it be possible to just look at how big the buffer is before you reuse it and if it's really large, maybe just shrink it down first? |
@lespea Well... sure. But step 1 is figuring out the problem. And that solution isn't universally good. If every file in the corpus requires a large buffer, then you end up doing needless allocation thrashing. |
I'm going to close this. I think the fact that grep uses a similar amount of memory (proportional to the number of threads) means that there probably isn't anything seriously wrong here. I do think it would be fun to take at the corpus and get a clearer understanding of where memory is going, but if I'm going to be realistic, I don't think I'll be downloading 470GB to an AWS server any time soon. |
Hello; I noticed that rg seems to use more and more memory over time. All the threads started out with less than two gigabytes in VIRT and RES columns in top, but after thirty minutes of wall-clock time, all the threads are above three gigabytes.
and then several minutes later:
Is this enough to worry about?
Thanks
The text was updated successfully, but these errors were encountered: