Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed error: SystemError: error return without exception set #377

Open
ghost opened this issue Jul 1, 2015 · 27 comments
Open

Distributed error: SystemError: error return without exception set #377

ghost opened this issue Jul 1, 2015 · 27 comments
Labels
bug Issue described a bug difficulty hard Hard issue: required deep gensim understanding & high python/cython skills

Comments

@ghost
Copy link

ghost commented Jul 1, 2015

This is with

  • gensim: 0.11.1.post1
  • scipy: 0.15.1
  • numpy: 1.9.2
  • Pyro4: 4.38

ubuntu@ip-172-31-33-28:~$ python lda_en.py
2015-07-01 01:15:42,687 : INFO : initializing corpus reader from <bz2.BZ2File object at 0x7f5587d67b90>
2015-07-01 01:15:42,717 : INFO : accepted corpus with 3831719 documents, 100000 features, 595701551 non-zero entries
MmCorpus(3831719 documents, 100000 features, 595701551 non-zero entries)
2015-07-01 01:15:42,724 : INFO : using symmetric alpha at 0.0001
2015-07-01 01:15:42,862 : INFO : registering worker #0 at PYRO:[email protected]:55865
2015-07-01 01:15:42,960 : INFO : initializing worker #0
2015-07-01 01:15:42,964 : INFO : using symmetric alpha at 0.0001
2015-07-01 01:15:42,965 : INFO : using serial LDA version on this node
2015-07-01 01:19:06,128 : INFO : registering worker #1 at PYRO:[email protected]:60184
2015-07-01 01:19:06,229 : INFO : initializing worker #1
2015-07-01 01:19:06,233 : INFO : using symmetric alpha at 0.0001
2015-07-01 01:19:06,234 : INFO : using serial LDA version on this node
2015-07-01 01:22:29,332 : INFO : registering worker #2 at PYRO:[email protected]:43388
2015-07-01 01:22:29,333 : WARNING : unresponsive worker at PYRO:[email protected]:43388, deleting it from the name server
2015-07-01 01:22:29,341 : INFO : using distributed version with 2 workers
2015-07-01 01:25:51,758 : INFO : running online LDA training, 10000 topics, 1 passes over the supplied corpus of 3831719 documents, updating model once every 4000 documents, evaluating perplexity every
40000 documents, iterating 50x with a convergence threshold of 0.001000
2015-07-01 01:25:51,759 : INFO : initializing 2 workers
Traceback (most recent call last):
File "lda_en.py", line 13, in
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10000, distributed=True)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/ldamodel.py", line 317, in init
self.update(corpus)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/ldamodel.py", line 558, in update
self.dispatcher.reset(self.state)
File "/usr/local/lib/python2.7/dist-packages/Pyro4/core.py", line 168, in call
return self.__send(self.__name, args, kwargs)
File "/usr/local/lib/python2.7/dist-packages/Pyro4/core.py", line 376, in _pyroInvoke
compress=Pyro4.config.COMPRESSION)
File "/usr/local/lib/python2.7/dist-packages/Pyro4/util.py", line 167, in serializeCall
data = self.dumpsCall(obj, method, vargs, kwargs)
File "/usr/local/lib/python2.7/dist-packages/Pyro4/util.py", line 415, in dumpsCall
return pickle.dumps((obj, method, vargs, kwargs), Pyro4.config.PICKLE_PROTOCOL_VERSION)
SystemError: error return without exception set

@piskvorky
Copy link
Owner

As with the sibling issue, this is probably related to serialized LDA objects being larger than 4GB (2GB?), with so many topics.

This may confuse serialization protocols (pickle, pyro) which assume 32bit lengths (or 31bit if signed).

Gensim itself doesn't care, numpy doesn't care, but when sending data around in distributed/multicore version, the serialization protocols cannot handle it.

A workaround may be using the newest pickle protocol (4) in Python3, which supports 64bit objects. Worth a shot.

@ghost
Copy link
Author

ghost commented Jul 1, 2015

Ok I tried that.. I cannot figure out who is trying to call dict.iteritems() instead of items().

ubuntu@ip-172-31-42-60:~$ python3 lda_en.py
2015-07-01 17:03:38,134 : INFO : initializing corpus reader from wiki_en_tfidf.mm.bz2
2015-07-01 17:03:38,180 : INFO : accepted corpus with 3831719 documents, 100000 features, 595701551 non-zero entries
2015-07-01 17:03:38,188 : INFO : using symmetric alpha at 0.0001
2015-07-01 17:03:38,278 : ERROR : failed to initialize distributed LDA ('dict' object has no attribute 'iteritems')
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/gensim/models/ldamodel.py", line 302, in init
chunksize=chunksize, alpha=alpha, eta=eta, distributed=False)
File "/usr/local/lib/python3.4/dist-packages/Pyro4/core.py", line 168, in call
return self.__send(self.__name, args, kwargs)
File "/usr/local/lib/python3.4/dist-packages/Pyro4/core.py", line 408, in _pyroInvoke
raise data
AttributeError: 'dict' object has no attribute 'iteritems'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "lda_en.py", line 8, in
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10000, chunksize=1000, passes=1, distributed=True)
File "/usr/local/lib/python3.4/dist-packages/gensim/models/ldamodel.py", line 308, in init
raise RuntimeError("failed to initialize distributed LDA (%s)" % err)
RuntimeError: failed to initialize distributed LDA ('dict' object has no attribute 'iteritems')

@ghost
Copy link
Author

ghost commented Jul 1, 2015

Ok, I seem to have fixed this although I am not exactly sure what I did. I suspect it was that I went through gensim and replaced all instances of .iteritems() with .items().

@piskvorky
Copy link
Owner

Hmm, can you send the diff please?

I can't see any such instances in LDA (which would be a bug, since gensim is both py3k and py2k compatible).

We're running unit tests under py3k too, so this would have to be in some untested code path...

@ghost
Copy link
Author

ghost commented Jul 1, 2015

I was not hacking on source controlled code unfortunately - I am using ubuntu vivid, installed gensim[distributed], and if you go to the python3.4 gensim directory and grep for .iteritems(), you will find some instances.

@piskvorky
Copy link
Owner

I did, there was only one .iteritems() instance, in distributed LSI (not LDA). I fixed it here just now: 73d8167

So I'm not sure what you mean.

@ghost
Copy link
Author

ghost commented Jul 1, 2015

I had them in distributed lsi and lda. Perhaps the code in pip is out of date?

@ghost
Copy link
Author

ghost commented Jul 1, 2015

It is still happening, although I seem to have delayed when it happens, not sure. I uninstalled / reinstalled gensim[distributed].

ubuntu@ip-172-31-42-60:/usr/local/lib/python3.4/dist-packages/gensim$ grep -sE '.iteritems()' .//
./models/init.py: # id2word = dict((newid, oldid2word[oldid]) for oldid, newid in old2new.iteritems())
./models/lda_dispatcher.py: for name, uri in ns.list(prefix='gensim.lda_worker').iteritems():
./models/lda_dispatcher.py: for workerid, worker in self.workers.iteritems():
./models/lda_dispatcher.py: for workerid, worker in self.workers.iteritems():
./models/lsi_dispatcher.py: for name, uri in ns.list(prefix='gensim.lsi_worker').iteritems():
./models/lsi_dispatcher.py: for workerid, worker in self.workers.iteritems():
./models/lsi_dispatcher.py: for workerid, worker in self.workers.iteritems():
./test/test_corpora_dictionary.py: self.assertEqual(list(d.items()), list(d.iteritems()))

But fixing these is not enough to fix the problem.

@ghost
Copy link
Author

ghost commented Jul 1, 2015

(note that my grep command was modified because i did not declare it as code here)

@piskvorky
Copy link
Owner

Yeah these have all been resolved some time ago already.

Can you try gensim from the develop branch? Or I'll be making a new release this weekend, in case you prefer pip.

@ghost
Copy link
Author

ghost commented Jul 1, 2015

Using the develop branch works, thanks!

@ghost
Copy link
Author

ghost commented Jul 1, 2015

I see this error now:

2015-07-01 20:44:54,329 : INFO : initializing corpus reader from wiki_en_tfidf.mm.bz2
2015-07-01 20:44:54,372 : INFO : accepted corpus with 3831719 documents, 100000 features, 595701551 non-zero entries
2015-07-01 20:44:54,380 : INFO : using symmetric alpha at 0.0001
2015-07-01 20:55:11,047 : INFO : using distributed version with 3 workers

2015-07-01 20:58:37,206 : INFO : running online LDA training, 10000 topics, 1 passes over the supplied corpus of 3831719 documents, updating model once every 150000 documents, evaluating perplexity every 1500000 documents, iterating 50x with a convergence threshold of 0.001000
2015-07-01 20:58:37,207 : INFO : initializing 3 workers
Traceback (most recent call last):
File "./lda_en", line 13, in
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10000, chunksize=50000, passes=1, distributed=True)
File "/usr/local/lib/python3.4/dist-packages/gensim-0.11.1_1-py3.4-linux-x86_64.egg/gensim/models/ldamodel.py", line 314, in init
self.update(corpus)
File "/usr/local/lib/python3.4/dist-packages/gensim-0.11.1_1-py3.4-linux-x86_64.egg/gensim/models/ldamodel.py", line 553, in update
self.dispatcher.reset(self.state)
File "/usr/local/lib/python3.4/dist-packages/Pyro4/core.py", line 168, in call
return self.__send(self.__name, args, kwargs)
File "/usr/local/lib/python3.4/dist-packages/Pyro4/core.py", line 387, in _pyroInvoke
self._pyroConnection.send(msg.to_bytes())
File "/usr/local/lib/python3.4/dist-packages/Pyro4/message.py", line 111, in to_bytes
return self.__header_bytes() + self.__annotations_bytes() + self.data
File "/usr/local/lib/python3.4/dist-packages/Pyro4/message.py", line 115, in __header_bytes
return struct.pack(self.header_format, b"PYRO", constants.PROTOCOL_VERSION, self.type, self.flags, self.seq, self.data_size, self.serializer_id, self.annotations_size, 0, checksum)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

@ghost
Copy link
Author

ghost commented Jul 1, 2015

@piskvorky
Copy link
Owner

Yeah, the same "32bit danger" comment applies.

One idea to overcome it is to use Python3, which has a new pickle protocol (version 4, IIRC). That one should be large-object-friendly. I never tried (esp. with Pyro), but in your case, it's worth trying!

@piskvorky
Copy link
Owner

Hmm, looking at the traceback, pyro uses struct.pack, not pickle. So, probably unrelated.

It's still possible large object are supported in Pyro, through some extra settings. It's been a while since I used it, I don't remember exactly, sorry. CC @irmen .

@ghost
Copy link
Author

ghost commented Jul 1, 2015

fyi I am using python3 now.

@irmen
Copy link

irmen commented Jul 1, 2015

Pyro uses struct.pack to create the wire protocol messages that travel over the network. They consist of a small header and the serialization payload (which can be pickled data if you enabled that, or json, or serpent etc). The Pyro header enforces a 32 bit size limitation, sorry. I never suspected people would go transfer such large payloads using Pyro; it's not really meant to handle that efficiently. See https://pythonhosted.org/Pyro4/tipstricks.html#binarytransfer for more information. Can you perhaps verify if a more efficient datatype can be used? Or can you not chunk the transfer? Or use another, optimized, form of data transfer for just these large blobs?

Is it a big issue? Then please raise a ticket over at https://github.com/irmen/Pyro4 (although I don't think it will be fixed because I'd rather not break Pyro's wire protocol)

edit: actually it is a 32 bits signed number that is used for the message length, so effectively Pyro has a 2 gigabyte message size limit

@ghost
Copy link
Author

ghost commented Jul 1, 2015

Thanks for the comments. This looks like a serious logistical issue w.r.t. scaling.

@irmen
Copy link

irmen commented Jul 1, 2015

I'm inclined to disagree, if we're talking about Pyro. Pyro is remote method invocation middleware, not a large data transfer tool. There are other protocols designed to do that, as I pointed out in the paragraph of the docs I linked earlier. (This doesn't mean we cannot talk about possible ways to improve matters!)

Can Gensim perhaps not send the whole data in one go but chop it up in <2Gb chunks?

@ghost
Copy link
Author

ghost commented Jul 1, 2015

(I was just referring to gensim!)
On Jul 1, 2015 4:31 PM, "Irmen de Jong" [email protected] wrote:

I'm inclined to disagree, if we're talking about Pyro. Pyro is remote
method invocation middleware, not a large data transfer tool. There are
other protocols designed to do that, as I pointed out in the paragraph of
the docs I linked earlier. (This doesn't mean we cannot talk about possible
ways to improve matters!)


Reply to this email directly or view it on GitHub
#377 (comment).

@ghost
Copy link
Author

ghost commented Jul 2, 2015

@irmen i'm wondering what the best path forward here is. I'm thinking about forking Pyro4, but I have no idea how deep the rabbit hole goes on this.

@piskvorky
Copy link
Owner

Thanks for the info @irmen, super useful!

@brianmingus IIRC there are three places where large objects are sent:

  1. master sending out LDA model to the distributed workers
  2. workers sending out LDA model updates back to the master
  3. documents (corpus chunks) being sent out to workers

Number 3 is trivial to mitigate (just use smaller chunksize), so I will ignore that. The challenge are points 1. and 2.

For both 1. and 2., the largest part of a model is a 2D words x topics matrix of floats (in your case 100k x 10k). We could add support for sending it out in pieces (multiple small matrices).

But I'm afraid your use case is so special, and the required code complexity so high, that it doesn't warrant being included in gensim directly. So the change would have to be a one-off change in your own code / copy of gensim. I'm not sure. Maybe it's not so hairy and I'm wrong.

I was also thinking whether making use of sparsity for 2. could help (sending the matrices for model updates in a sparse form = only the diff or some such). But sparse formats have a lot of overhead, so the sparsity has to be >60% for this to make sense. Plus, it would introduce an extra layer of complexity -- if some worker happened to produce a very dense update on some document chunk, we'd be hitting the same error suddenly, unexpectedly.

@piskvorky
Copy link
Owner

By the way, a version of online LDA was also merged into Spark's MLlib recently.

MLlib is a new library, full of bugs, but it looks like that makes no difference in this particular case.

May be worth trying that, as Spark touts distributed computing directly, and it will make for an interesting benchmark comparison too.

AWS supports Spark out of the box (but check about the versions, you probably want the latest MLlib = fewest bugs).

@irmen
Copy link

irmen commented Jul 2, 2015

If you're going to use PySpark you'll notice that it uses my Pyrolite library to pickle/unpickle data for the Python workers. But they only use the (un)pickler; they're not touching the Pyro protocol part. I don't know if pickle itself has a size limitation.

@piskvorky
Copy link
Owner

Yes, it does. We have come full circle :)

Anyway, Brian could use the cmd line / scala API for Spark. The MLlib implementation of LDA has nothing to do with Python AFAIK, it's in Scala.

@irmen
Copy link

irmen commented Jul 2, 2015

@brianmingus have you tried enabling Pyro compression? This may reduce the message payload to a size that fits under 2Gb (although I wonder if Python is able to compress a >4Gb chunk of data at all...)

@tmylk
Copy link
Contributor

tmylk commented Jan 10, 2016

@brianmingus Is this still an issue when using compression? Does MLLib provide an alternative?
Will close if either is positive

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty hard Hard issue: required deep gensim understanding & high python/cython skills labels Oct 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty hard Hard issue: required deep gensim understanding & high python/cython skills
Projects
None yet
Development

No branches or pull requests

4 participants