-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed error: SystemError: error return without exception set #377
Comments
As with the sibling issue, this is probably related to serialized LDA objects being larger than 4GB (2GB?), with so many topics. This may confuse serialization protocols (pickle, pyro) which assume 32bit lengths (or 31bit if signed). Gensim itself doesn't care, numpy doesn't care, but when sending data around in distributed/multicore version, the serialization protocols cannot handle it. A workaround may be using the newest pickle protocol (4) in Python3, which supports 64bit objects. Worth a shot. |
Ok I tried that.. I cannot figure out who is trying to call dict.iteritems() instead of items(). ubuntu@ip-172-31-42-60:~$ python3 lda_en.py During handling of the above exception, another exception occurred: Traceback (most recent call last): |
Ok, I seem to have fixed this although I am not exactly sure what I did. I suspect it was that I went through gensim and replaced all instances of .iteritems() with .items(). |
Hmm, can you send the diff please? I can't see any such instances in LDA (which would be a bug, since gensim is both py3k and py2k compatible). We're running unit tests under py3k too, so this would have to be in some untested code path... |
I was not hacking on source controlled code unfortunately - I am using ubuntu vivid, installed gensim[distributed], and if you go to the python3.4 gensim directory and grep for .iteritems(), you will find some instances. |
I did, there was only one So I'm not sure what you mean. |
I had them in distributed lsi and lda. Perhaps the code in pip is out of date? |
It is still happening, although I seem to have delayed when it happens, not sure. I uninstalled / reinstalled gensim[distributed]. ubuntu@ip-172-31-42-60:/usr/local/lib/python3.4/dist-packages/gensim$ grep -sE '.iteritems()' .// But fixing these is not enough to fix the problem. |
(note that my grep command was modified because i did not declare it as code here) |
Yeah these have all been resolved some time ago already. Can you try gensim from the |
Using the develop branch works, thanks! |
I see this error now: 2015-07-01 20:44:54,329 : INFO : initializing corpus reader from wiki_en_tfidf.mm.bz2 2015-07-01 20:58:37,206 : INFO : running online LDA training, 10000 topics, 1 passes over the supplied corpus of 3831719 documents, updating model once every 150000 documents, evaluating perplexity every 1500000 documents, iterating 50x with a convergence threshold of 0.001000 |
Yeah, the same "32bit danger" comment applies. One idea to overcome it is to use Python3, which has a new pickle protocol (version 4, IIRC). That one should be large-object-friendly. I never tried (esp. with Pyro), but in your case, it's worth trying! |
Hmm, looking at the traceback, pyro uses It's still possible large object are supported in Pyro, through some extra settings. It's been a while since I used it, I don't remember exactly, sorry. CC @irmen . |
fyi I am using python3 now. |
Pyro uses struct.pack to create the wire protocol messages that travel over the network. They consist of a small header and the serialization payload (which can be pickled data if you enabled that, or json, or serpent etc). The Pyro header enforces a 32 bit size limitation, sorry. I never suspected people would go transfer such large payloads using Pyro; it's not really meant to handle that efficiently. See https://pythonhosted.org/Pyro4/tipstricks.html#binarytransfer for more information. Can you perhaps verify if a more efficient datatype can be used? Or can you not chunk the transfer? Or use another, optimized, form of data transfer for just these large blobs? Is it a big issue? Then please raise a ticket over at https://github.com/irmen/Pyro4 (although I don't think it will be fixed because I'd rather not break Pyro's wire protocol) edit: actually it is a 32 bits signed number that is used for the message length, so effectively Pyro has a 2 gigabyte message size limit |
Thanks for the comments. This looks like a serious logistical issue w.r.t. scaling. |
I'm inclined to disagree, if we're talking about Pyro. Pyro is remote method invocation middleware, not a large data transfer tool. There are other protocols designed to do that, as I pointed out in the paragraph of the docs I linked earlier. (This doesn't mean we cannot talk about possible ways to improve matters!) Can Gensim perhaps not send the whole data in one go but chop it up in <2Gb chunks? |
(I was just referring to gensim!)
|
@irmen i'm wondering what the best path forward here is. I'm thinking about forking Pyro4, but I have no idea how deep the rabbit hole goes on this. |
Thanks for the info @irmen, super useful! @brianmingus IIRC there are three places where large objects are sent:
Number 3 is trivial to mitigate (just use smaller For both 1. and 2., the largest part of a model is a 2D words x topics matrix of floats (in your case 100k x 10k). We could add support for sending it out in pieces (multiple small matrices). But I'm afraid your use case is so special, and the required code complexity so high, that it doesn't warrant being included in gensim directly. So the change would have to be a one-off change in your own code / copy of gensim. I'm not sure. Maybe it's not so hairy and I'm wrong. I was also thinking whether making use of sparsity for 2. could help (sending the matrices for model updates in a sparse form = only the diff or some such). But sparse formats have a lot of overhead, so the sparsity has to be >60% for this to make sense. Plus, it would introduce an extra layer of complexity -- if some worker happened to produce a very dense update on some document chunk, we'd be hitting the same error suddenly, unexpectedly. |
By the way, a version of online LDA was also merged into Spark's MLlib recently. MLlib is a new library, full of bugs, but it looks like that makes no difference in this particular case. May be worth trying that, as Spark touts distributed computing directly, and it will make for an interesting benchmark comparison too. AWS supports Spark out of the box (but check about the versions, you probably want the latest MLlib = fewest bugs). |
If you're going to use PySpark you'll notice that it uses my Pyrolite library to pickle/unpickle data for the Python workers. But they only use the (un)pickler; they're not touching the Pyro protocol part. I don't know if pickle itself has a size limitation. |
Yes, it does. We have come full circle :) Anyway, Brian could use the cmd line / scala API for Spark. The MLlib implementation of LDA has nothing to do with Python AFAIK, it's in Scala. |
@brianmingus have you tried enabling Pyro compression? This may reduce the message payload to a size that fits under 2Gb (although I wonder if Python is able to compress a >4Gb chunk of data at all...) |
@brianmingus Is this still an issue when using compression? Does MLLib provide an alternative? |
This is with
ubuntu@ip-172-31-33-28:~$ python lda_en.py
2015-07-01 01:15:42,687 : INFO : initializing corpus reader from <bz2.BZ2File object at 0x7f5587d67b90>
2015-07-01 01:15:42,717 : INFO : accepted corpus with 3831719 documents, 100000 features, 595701551 non-zero entries
MmCorpus(3831719 documents, 100000 features, 595701551 non-zero entries)
2015-07-01 01:15:42,724 : INFO : using symmetric alpha at 0.0001
2015-07-01 01:15:42,862 : INFO : registering worker #0 at PYRO:[email protected]:55865
2015-07-01 01:15:42,960 : INFO : initializing worker #0
2015-07-01 01:15:42,964 : INFO : using symmetric alpha at 0.0001
2015-07-01 01:15:42,965 : INFO : using serial LDA version on this node
2015-07-01 01:19:06,128 : INFO : registering worker #1 at PYRO:[email protected]:60184
2015-07-01 01:19:06,229 : INFO : initializing worker #1
2015-07-01 01:19:06,233 : INFO : using symmetric alpha at 0.0001
2015-07-01 01:19:06,234 : INFO : using serial LDA version on this node
2015-07-01 01:22:29,332 : INFO : registering worker #2 at PYRO:[email protected]:43388
2015-07-01 01:22:29,333 : WARNING : unresponsive worker at PYRO:[email protected]:43388, deleting it from the name server
2015-07-01 01:22:29,341 : INFO : using distributed version with 2 workers
2015-07-01 01:25:51,758 : INFO : running online LDA training, 10000 topics, 1 passes over the supplied corpus of 3831719 documents, updating model once every 4000 documents, evaluating perplexity every
40000 documents, iterating 50x with a convergence threshold of 0.001000
2015-07-01 01:25:51,759 : INFO : initializing 2 workers
Traceback (most recent call last):
File "lda_en.py", line 13, in
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10000, distributed=True)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/ldamodel.py", line 317, in init
self.update(corpus)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/ldamodel.py", line 558, in update
self.dispatcher.reset(self.state)
File "/usr/local/lib/python2.7/dist-packages/Pyro4/core.py", line 168, in call
return self.__send(self.__name, args, kwargs)
File "/usr/local/lib/python2.7/dist-packages/Pyro4/core.py", line 376, in _pyroInvoke
compress=Pyro4.config.COMPRESSION)
File "/usr/local/lib/python2.7/dist-packages/Pyro4/util.py", line 167, in serializeCall
data = self.dumpsCall(obj, method, vargs, kwargs)
File "/usr/local/lib/python2.7/dist-packages/Pyro4/util.py", line 415, in dumpsCall
return pickle.dumps((obj, method, vargs, kwargs), Pyro4.config.PICKLE_PROTOCOL_VERSION)
SystemError: error return without exception set
The text was updated successfully, but these errors were encountered: