-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory issues for long-running parsing processes #3618
Comments
Unfortunately it's really hard to tell whether a Python program is leaking memory or whether the Python interpreter is just refusing to give memory back to the operating system. This blog post explains some good tips: https://medium.com/zendesk-engineering/hunting-for-memory-leaks-in-python-applications-6824d0518774 A small amount of increasing memory usage is expected as you pass over new text, as we add new strings to the
|
Hi Honnibal Best Olivier |
Hi Honnibal, I have a similar issue where i loop over different texts over time and i observe increased memory usage over time until the memory fills completely because i process millions of different texts (using nlp.pipe and Thanks, |
@zsavvas The simplest strategy is to just reload the NLP object periodically: `nlp = spacy.load("en_core_web_lg"). It's not a very satisfying solution, but everything we've tried to implement to flush the string store periodically ends up really difficult, because we can't guarantee what other objects are still alive that are trying to reference those strings or vocab items. |
@honnibal Just to know; what is the string store for? A cache for faster lookup? How does it differ from Vocab? |
@honnibal I'm not totally convinced that reloading the NLP object actually helps I tried with this little program and the memory is growing the same way
So I'm a bit puzzled with python memory management ... Best Olivier |
Interesting. Could you try a more explicit approach where you explicitly |
@BramVanroy , Sure here is the modified code snippet
And the output
I would say that the memory is still growing but maybe slowly ... Best Olivier |
Okay, then I am out of ideas. I understand @honnibal 's response that it is hard to tell who is responsible, spaCy or Python itself. I'm not sure how to debug this, especially since it is happening on a low-level where deleting the object itself doesn't help. But that only makes this issue the more dangerous/important, I believe. Even if this issue only arises when processing large batches of text, a memory leak is still a memory leak. Personally, I think that his should be a priority bug. But of course I have all the respect for the maintainers of this package and their priorities, and I'm not skilled enough to contribute to this issue, except for testing. Some good Stack Overflow posts
From further testing, I also found that this is not only related to |
One sneaky thing that I found that you can do, is using a multiprocessing Pool but limiting the tasks a child process can do. E.g. (un-tested, from the top of my head): with Pool(16, maxtasksperchild=5) as pool:
nlp = spacy.load('en')
proc_func = partial(process_batch_func, nlp)
for batch in minibatch(...):
pool.apply_async(proc_func , (batch, )) This way, after having processed 5 batches, a child process will be killed and replaced by a new one. This ensures that all 'leaked memory' from that child process is freed. I can confirm that with this method, I could parse a single file of 300M sentences. Using batches of 80K sentences, Here is a worked-out example script, with comments. |
We were having a discussion about this on Twitter, but I propose to keep the discussion here to make it easily accessible to others. @honnibal asked how the lingering memory leak should be discussed and communicated to users going forward. I do not think this should be stated explicitly anywhere. It may scare people off who would never run into the problem anyway. The problem only arises when you are having a huge dataset, and I assume that people who do are knowledgeable enough to go look on Github for issues they run into. (But I might be too naive here?) To that end, though, I think that the title of this issue should be changed. The issue is not specific to As I said before, I do not have the skills to debug this on a low-level, nor do I have the time if I did. I will, however, try to improve the repo I shared above. I will also add an example where instead of one huge file, you'd want to parse many (many) smaller files, i.e. where you want to parse all files in a directory efficiently. If requested, these examples can reside in |
Hi all, Best Olivier |
Hello, I'm facing issues which fit the description of "Memory issues for long-running processes". It doesn't seem to involve memory leaks. I'm getting inconsistent errors, around the 200th processed text (2000 characters each, aprox.) I'm running python on gdb. I mostly get SIGSEGV just acusing a segmentation fault. I also get a SIGABRT which usually is a double free or corruption. The backtrace from these SIGABRT cases usually have a sequence of calls from spacy/strings.cpython (std::_Rb_tree<unsigned long, unsigned long, std::_Identity, std::less, std::allocator >::_M_erase(std::_Rb_tree_node*)) leading up to the problem. Due to the inconsistency of the issue and that the problematic code is part of a large project, I've been so far unable to create a reasonably sized code with the issue, but I'm working on it. If there is any extra information I can provide to help or tips in how I can better explain the problem, please let me know |
@fabio-reale Thanks for the detailed report. Which version of spaCy are you running? |
I'm running version 2.1.3 |
explosion/srsly#4 – source of at least one leak. Hopefully there's no more! In our case spacy actually held up pretty good on not so long running batch jobs (means we didn't hit OOM), but we been using srsly.ujson a lot and it been huge pain trying to understand why we are leaking memory and crashing. I guess trying to monkey patch srsly.ujson with stdlib json before importing spacy could be a temporary workaround. |
Thanks to @sadovnychyi 's hard sleuthing and patch, we've now fixed the ujson memory leak. I'm still not sure whether this would be the problem in this thread, since the problem should have been the same between v2.0.18 and v2.1, as both were using the same ujson code. It could be that we converted some call over from json to ujson though? @BramVanroy and @oterrier If you have time, could you check whether your long-running jobs are more memory-stable after doing @fabio-reale I would say that's a different issue, so we should move it out into a different thread. If possible, could you log the texts during parsing? Then see if parsing just those texts triggers the issue for you. This should help us isolate the problem. |
@honnibal , Sorry Matthew but I confirm that the problem of long-running jobs remain so +1 to move this ujson issue into a different thread Best Olivier |
I was able to process all the texts I needed in small-batches, and those also ran into the problem some times. This tells me 2 things:
I will keep investigating it further. When I have a meaningful issue title I'll sure to start a new one. Thanks |
I ran into a memory leak as well, maybe it is related. When I was using beam parse extensively while training NER Model and calculating the loss on a test dataset for every epoch, the memory usage is quickly growing to 10 gigs after about 20 epochs. If I remove the beam parser everything is fine and the ram usage stays at around 500mb. |
I also suffered from a memory leak. But it was actually in the finally I wrote the following script that check the memory use, if it more than a threshold x, I dump the ruler to a file and start a new one.:
I used the following patch to do so. |
@shaked571 Thanks for updating and sharing your code. That's very interesting 🤔 I wonder if this could be related to #3541 – we've been suspecting that there might be some memory issue in the |
I am having the same issue with Python 3.6.8 on Mac with Spacy 2.1.4. My script loads the "web_en_core_lg" model and gets NER tags for relatively short text utterances.
|
It seem like disabling parser and ner prevents memory from leaking:
Spacy: Hope this helps. |
@azhuchkov Except I need NER in my application. It may be different for others, of course. |
Running with spacy 2.1.8 seems at least to run smoother? The script from @oterrier without deletion of nlp object and gc now does not has only one direction...
|
Hello.
The results of processing 1000 examples are the following:
I think that you can draw conclusions from these results, that using nlp.pipe() especially with as_tuples=True can lead to huge memory leak, while using standard approach without nlp.pipe() with the same amount of processed examples gives much better results in terms of memory usage, and actually not so worse results in terms of processing time. What really surprised me, was the difference between nlp.pipe() and nlp.pipe(as_tuples=True); why using as_tuples almost triple the RAM usage? Spacy version: 2.1.8 Edit: I confirm that dividing document into smaller chunks and then reloading nlp helped. Best regards. |
I think I might be hitting this issue too. I'm feeding a spacy
On Python 3.6.5, spaCy 2.2.0 with model
That's quite the memory allocation request (18 exabytes for the curious).. Maybe the memory issue is with thinc and not spacy? I'm not much of a
as well as: implementing_args = {<numpy.ndarray at remote 0x7f9fc036c210>, <unknown at remote 0x7f9f00000001>, (None,), <code at remote 0x7f9fe28ff4b0>, , '__builtins__', <unknown at remote 0xffffffff>,
Python Exception <class 'UnicodeDecodeError'> 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128):
<unknown at remote 0x7ffc8cf046c0>, , <unknown at remote 0x7ffc8cf046a0>, <unknown at remote 0x7ffc8cf04698>, <unknown at remote 0x5a2db0>, <unknown at remote 0x8>,
Python Exception <class 'UnicodeDecodeError'> 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128):
<unknown at remote 0x2732ef0>, 0x0, <unknown at remote 0x7f9fffffffff>, 'drop', <unknown at remote 0x55565613444d6b00>, , <code at remote 0x7f9fe28ff4b0>,
Python Exception <class 'UnicodeDecodeError'> 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128):```
I'm happy to provide any other info or re-run under different configurations as might be helpful. |
@flawaetz Very useful! Especially the RecursionError seems to indicate an issue that can explain the memory leak. |
@flawaetz Great analysis, thanks! The huge allocation is especially interesting. I wonder whether there might be a memory corruption behind that? If something wrote out-of-bounds, that might explain how we end up with such a ridiculous size to allocate... |
@honnibal Could be memory corruption. It's interesting to me that spaCy 1.10.1 with the same I'm happy to provide (out of band) code and source data to reproduce if that would be helpful. |
Closing this one since we're pretty sure the leak fixed in #4486 was the underlying problem. |
After applying #4486, my memory usage for the script from @oterrier's comment above without reloading
For the unusual behavior related to thinc/numpy, I wonder if these bugs might be related? |
This looks great. Thanks for your fix @adrianeboyd ! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
How to reproduce the behaviour
Hi,
I'm suspecting a memory leak when using intensively nlp.pipe(), my process is growing in memory and it looks that it never garbage collect. Do you think that it is possible ?
Here is the little python I'am using to reproduce the problem:
For me the output looks like
And at the end the process pmap output is:
total 4879812K
Can you confirm ?
Your Environment
Best regards and congrats for the amazing work on SpaCy
Olivier Terrier
The text was updated successfully, but these errors were encountered: