-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Growing memory when parsing huge file #3623
Comments
I agree that the observed behaviour isn't very satisfying, but it's currently difficult to be sure whether it's spaCy leaking memory, or just Python not giving memory back to the operating system. If you're running in multiprocessing anyway, can you just have shorter running processes?
Yes, that's true.
I'd love to have that, and it should be possible to have the |
It is indeed not easy to debug this. I'll try to run this when monitoring the objects' size in memory and see what causes the growing. But if I understood you correctly, it might be Python not giving memory back inside of spaCy? You asked whether I could have shorter running processes. So you mean prematurely ending jobs and restarting workers in hopes that all memory for that worker is released? Is that what you mean? I'll first try monitoring the memory usage of spaCy. Maybe reinitializing only spaCy works as well. I have no experience in Cython so I fear I cannot help on that. It does look like a very useful feature. |
Most discussion is taking place at #3618. For my use case (parsing one 36GB file with multiple workers) the only working solution was to split up the file into smaller files and manually run the script over these files. |
Closing this in favour of #3618. Direct memory issues there. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Cross-post from Stack Overflow.
I am not entirely sure whether there is an error in my multiprocessing implementation, or whether spaCy is leaking memory, or whether this is expected behaviour.
In the code below, you'll see that for very large files (in my case 36GB), the memory usage will keep rising over time. We have a machine with 384GB of RAM, so it can take a lot. Stiil, after 14M sentences, the RAM usage is already at 70% and rising.
In issue #3618 it was mentioned that it is to be expected that new strings will increase the memory size. But it doesn't seem possible that this causes the continuous increase in mem consumption (after a while you'd expect all strings to be 'known' with some rare exceptions only being added).
The one thing that I can think of, is that each subprocess uses its own spaCy instance (is that true?), and as such the string memory/vocabulary is specific per process. That means that the size of the whole vocabulary is basically duplicated over all subprocesses. Is that true? If so, is there a way to make use of only one 'lookup table'/Voc instance across multiple spaCy instances? If this is not he issue, do you have any other idea what may be wrong?
Info about spaCy
The text was updated successfully, but these errors were encountered: