Distributed error: SystemError: error return without exception set #377

ghost · 2015-07-01T01:38:58Z

This is with

gensim: 0.11.1.post1
scipy: 0.15.1
numpy: 1.9.2
Pyro4: 4.38

ubuntu@ip-172-31-33-28:~$ python lda_en.py
2015-07-01 01:15:42,687 : INFO : initializing corpus reader from <bz2.BZ2File object at 0x7f5587d67b90>
2015-07-01 01:15:42,717 : INFO : accepted corpus with 3831719 documents, 100000 features, 595701551 non-zero entries
MmCorpus(3831719 documents, 100000 features, 595701551 non-zero entries)
2015-07-01 01:15:42,724 : INFO : using symmetric alpha at 0.0001
2015-07-01 01:15:42,862 : INFO : registering worker #0 at PYRO:[email protected]:55865
2015-07-01 01:15:42,960 : INFO : initializing worker #0
2015-07-01 01:15:42,964 : INFO : using symmetric alpha at 0.0001
2015-07-01 01:15:42,965 : INFO : using serial LDA version on this node
2015-07-01 01:19:06,128 : INFO : registering worker #1 at PYRO:[email protected]:60184
2015-07-01 01:19:06,229 : INFO : initializing worker #1
2015-07-01 01:19:06,233 : INFO : using symmetric alpha at 0.0001
2015-07-01 01:19:06,234 : INFO : using serial LDA version on this node
2015-07-01 01:22:29,332 : INFO : registering worker #2 at PYRO:[email protected]:43388
2015-07-01 01:22:29,333 : WARNING : unresponsive worker at PYRO:[email protected]:43388, deleting it from the name server
2015-07-01 01:22:29,341 : INFO : using distributed version with 2 workers
2015-07-01 01:25:51,758 : INFO : running online LDA training, 10000 topics, 1 passes over the supplied corpus of 3831719 documents, updating model once every 4000 documents, evaluating perplexity every
40000 documents, iterating 50x with a convergence threshold of 0.001000
2015-07-01 01:25:51,759 : INFO : initializing 2 workers
Traceback (most recent call last):
File "lda_en.py", line 13, in
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10000, distributed=True)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/ldamodel.py", line 317, in init
self.update(corpus)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/ldamodel.py", line 558, in update
self.dispatcher.reset(self.state)
File "/usr/local/lib/python2.7/dist-packages/Pyro4/core.py", line 168, in call
return self.__send(self.__name, args, kwargs)
File "/usr/local/lib/python2.7/dist-packages/Pyro4/core.py", line 376, in _pyroInvoke
compress=Pyro4.config.COMPRESSION)
File "/usr/local/lib/python2.7/dist-packages/Pyro4/util.py", line 167, in serializeCall
data = self.dumpsCall(obj, method, vargs, kwargs)
File "/usr/local/lib/python2.7/dist-packages/Pyro4/util.py", line 415, in dumpsCall
return pickle.dumps((obj, method, vargs, kwargs), Pyro4.config.PICKLE_PROTOCOL_VERSION)
SystemError: error return without exception set

piskvorky · 2015-07-01T09:41:15Z

As with the sibling issue, this is probably related to serialized LDA objects being larger than 4GB (2GB?), with so many topics.

This may confuse serialization protocols (pickle, pyro) which assume 32bit lengths (or 31bit if signed).

Gensim itself doesn't care, numpy doesn't care, but when sending data around in distributed/multicore version, the serialization protocols cannot handle it.

A workaround may be using the newest pickle protocol (4) in Python3, which supports 64bit objects. Worth a shot.

ghost · 2015-07-01T17:04:17Z

Ok I tried that.. I cannot figure out who is trying to call dict.iteritems() instead of items().

ubuntu@ip-172-31-42-60:~$ python3 lda_en.py
2015-07-01 17:03:38,134 : INFO : initializing corpus reader from wiki_en_tfidf.mm.bz2
2015-07-01 17:03:38,180 : INFO : accepted corpus with 3831719 documents, 100000 features, 595701551 non-zero entries
2015-07-01 17:03:38,188 : INFO : using symmetric alpha at 0.0001
2015-07-01 17:03:38,278 : ERROR : failed to initialize distributed LDA ('dict' object has no attribute 'iteritems')
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/gensim/models/ldamodel.py", line 302, in init
chunksize=chunksize, alpha=alpha, eta=eta, distributed=False)
File "/usr/local/lib/python3.4/dist-packages/Pyro4/core.py", line 168, in call
return self.__send(self.__name, args, kwargs)
File "/usr/local/lib/python3.4/dist-packages/Pyro4/core.py", line 408, in _pyroInvoke
raise data
AttributeError: 'dict' object has no attribute 'iteritems'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "lda_en.py", line 8, in
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10000, chunksize=1000, passes=1, distributed=True)
File "/usr/local/lib/python3.4/dist-packages/gensim/models/ldamodel.py", line 308, in init
raise RuntimeError("failed to initialize distributed LDA (%s)" % err)
RuntimeError: failed to initialize distributed LDA ('dict' object has no attribute 'iteritems')

ghost · 2015-07-01T17:55:21Z

Ok, I seem to have fixed this although I am not exactly sure what I did. I suspect it was that I went through gensim and replaced all instances of .iteritems() with .items().

piskvorky · 2015-07-01T18:05:59Z

Hmm, can you send the diff please?

I can't see any such instances in LDA (which would be a bug, since gensim is both py3k and py2k compatible).

We're running unit tests under py3k too, so this would have to be in some untested code path...

ghost · 2015-07-01T18:12:14Z

I was not hacking on source controlled code unfortunately - I am using ubuntu vivid, installed gensim[distributed], and if you go to the python3.4 gensim directory and grep for .iteritems(), you will find some instances.

piskvorky · 2015-07-01T18:14:12Z

I did, there was only one .iteritems() instance, in distributed LSI (not LDA). I fixed it here just now: 73d8167

So I'm not sure what you mean.

ghost · 2015-07-01T18:15:55Z

I had them in distributed lsi and lda. Perhaps the code in pip is out of date?

ghost · 2015-07-01T18:45:48Z

It is still happening, although I seem to have delayed when it happens, not sure. I uninstalled / reinstalled gensim[distributed].

ubuntu@ip-172-31-42-60:/usr/local/lib/python3.4/dist-packages/gensim$ grep -sE '.iteritems()' .//
./models/init.py: # id2word = dict((newid, oldid2word[oldid]) for oldid, newid in old2new.iteritems())
./models/lda_dispatcher.py: for name, uri in ns.list(prefix='gensim.lda_worker').iteritems():
./models/lda_dispatcher.py: for workerid, worker in self.workers.iteritems():
./models/lda_dispatcher.py: for workerid, worker in self.workers.iteritems():
./models/lsi_dispatcher.py: for name, uri in ns.list(prefix='gensim.lsi_worker').iteritems():
./models/lsi_dispatcher.py: for workerid, worker in self.workers.iteritems():
./models/lsi_dispatcher.py: for workerid, worker in self.workers.iteritems():
./test/test_corpora_dictionary.py: self.assertEqual(list(d.items()), list(d.iteritems()))

But fixing these is not enough to fix the problem.

ghost · 2015-07-01T18:47:13Z

(note that my grep command was modified because i did not declare it as code here)

piskvorky · 2015-07-01T19:11:39Z

Yeah these have all been resolved some time ago already.

Can you try gensim from the develop branch? Or I'll be making a new release this weekend, in case you prefer pip.

ghost · 2015-07-01T20:22:32Z

Using the develop branch works, thanks!

ghost · 2015-07-01T21:01:02Z

I see this error now:

2015-07-01 20:44:54,329 : INFO : initializing corpus reader from wiki_en_tfidf.mm.bz2
2015-07-01 20:44:54,372 : INFO : accepted corpus with 3831719 documents, 100000 features, 595701551 non-zero entries
2015-07-01 20:44:54,380 : INFO : using symmetric alpha at 0.0001
2015-07-01 20:55:11,047 : INFO : using distributed version with 3 workers

2015-07-01 20:58:37,206 : INFO : running online LDA training, 10000 topics, 1 passes over the supplied corpus of 3831719 documents, updating model once every 150000 documents, evaluating perplexity every 1500000 documents, iterating 50x with a convergence threshold of 0.001000
2015-07-01 20:58:37,207 : INFO : initializing 3 workers
Traceback (most recent call last):
File "./lda_en", line 13, in
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10000, chunksize=50000, passes=1, distributed=True)
File "/usr/local/lib/python3.4/dist-packages/gensim-0.11.1_1-py3.4-linux-x86_64.egg/gensim/models/ldamodel.py", line 314, in init
self.update(corpus)
File "/usr/local/lib/python3.4/dist-packages/gensim-0.11.1_1-py3.4-linux-x86_64.egg/gensim/models/ldamodel.py", line 553, in update
self.dispatcher.reset(self.state)
File "/usr/local/lib/python3.4/dist-packages/Pyro4/core.py", line 168, in call
return self.__send(self.__name, args, kwargs)
File "/usr/local/lib/python3.4/dist-packages/Pyro4/core.py", line 387, in _pyroInvoke
self._pyroConnection.send(msg.to_bytes())
File "/usr/local/lib/python3.4/dist-packages/Pyro4/message.py", line 111, in to_bytes
return self.__header_bytes() + self.__annotations_bytes() + self.data
File "/usr/local/lib/python3.4/dist-packages/Pyro4/message.py", line 115, in __header_bytes
return struct.pack(self.header_format, b"PYRO", constants.PROTOCOL_VERSION, self.type, self.flags, self.seq, self.data_size, self.serializer_id, self.annotations_size, 0, checksum)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

ghost · 2015-07-01T21:05:48Z

Possibly related: http://stackoverflow.com/questions/16576386/byte-limit-when-transferring-python-objects-between-processes-using-a-pipe

piskvorky · 2015-07-01T21:19:34Z

Yeah, the same "32bit danger" comment applies.

One idea to overcome it is to use Python3, which has a new pickle protocol (version 4, IIRC). That one should be large-object-friendly. I never tried (esp. with Pyro), but in your case, it's worth trying!

piskvorky · 2015-07-01T21:23:17Z

Hmm, looking at the traceback, pyro uses struct.pack, not pickle. So, probably unrelated.

It's still possible large object are supported in Pyro, through some extra settings. It's been a while since I used it, I don't remember exactly, sorry. CC @irmen .

ghost · 2015-07-01T21:25:42Z

fyi I am using python3 now.

irmen · 2015-07-01T21:43:11Z

Pyro uses struct.pack to create the wire protocol messages that travel over the network. They consist of a small header and the serialization payload (which can be pickled data if you enabled that, or json, or serpent etc). The Pyro header enforces a 32 bit size limitation, sorry. I never suspected people would go transfer such large payloads using Pyro; it's not really meant to handle that efficiently. See https://pythonhosted.org/Pyro4/tipstricks.html#binarytransfer for more information. Can you perhaps verify if a more efficient datatype can be used? Or can you not chunk the transfer? Or use another, optimized, form of data transfer for just these large blobs?

Is it a big issue? Then please raise a ticket over at https://github.com/irmen/Pyro4 (although I don't think it will be fixed because I'd rather not break Pyro's wire protocol)

edit: actually it is a 32 bits signed number that is used for the message length, so effectively Pyro has a 2 gigabyte message size limit

ghost · 2015-07-01T22:12:09Z

Thanks for the comments. This looks like a serious logistical issue w.r.t. scaling.

irmen · 2015-07-01T22:31:10Z

I'm inclined to disagree, if we're talking about Pyro. Pyro is remote method invocation middleware, not a large data transfer tool. There are other protocols designed to do that, as I pointed out in the paragraph of the docs I linked earlier. (This doesn't mean we cannot talk about possible ways to improve matters!)

Can Gensim perhaps not send the whole data in one go but chop it up in <2Gb chunks?

ghost · 2015-07-01T22:37:51Z

(I was just referring to gensim!)
On Jul 1, 2015 4:31 PM, "Irmen de Jong" [email protected] wrote:

I'm inclined to disagree, if we're talking about Pyro. Pyro is remote
method invocation middleware, not a large data transfer tool. There are
other protocols designed to do that, as I pointed out in the paragraph of
the docs I linked earlier. (This doesn't mean we cannot talk about possible
ways to improve matters!)

—
Reply to this email directly or view it on GitHub
#377 (comment).

ghost · 2015-07-02T03:18:58Z

@irmen i'm wondering what the best path forward here is. I'm thinking about forking Pyro4, but I have no idea how deep the rabbit hole goes on this.

piskvorky · 2015-07-02T11:06:50Z

Thanks for the info @irmen, super useful!

@brianmingus IIRC there are three places where large objects are sent:

master sending out LDA model to the distributed workers
workers sending out LDA model updates back to the master
documents (corpus chunks) being sent out to workers

Number 3 is trivial to mitigate (just use smaller chunksize), so I will ignore that. The challenge are points 1. and 2.

For both 1. and 2., the largest part of a model is a 2D words x topics matrix of floats (in your case 100k x 10k). We could add support for sending it out in pieces (multiple small matrices).

But I'm afraid your use case is so special, and the required code complexity so high, that it doesn't warrant being included in gensim directly. So the change would have to be a one-off change in your own code / copy of gensim. I'm not sure. Maybe it's not so hairy and I'm wrong.

I was also thinking whether making use of sparsity for 2. could help (sending the matrices for model updates in a sparse form = only the diff or some such). But sparse formats have a lot of overhead, so the sparsity has to be >60% for this to make sense. Plus, it would introduce an extra layer of complexity -- if some worker happened to produce a very dense update on some document chunk, we'd be hitting the same error suddenly, unexpectedly.

piskvorky · 2015-07-02T11:13:54Z

By the way, a version of online LDA was also merged into Spark's MLlib recently.

MLlib is a new library, full of bugs, but it looks like that makes no difference in this particular case.

May be worth trying that, as Spark touts distributed computing directly, and it will make for an interesting benchmark comparison too.

AWS supports Spark out of the box (but check about the versions, you probably want the latest MLlib = fewest bugs).

irmen · 2015-07-02T12:28:09Z

If you're going to use PySpark you'll notice that it uses my Pyrolite library to pickle/unpickle data for the Python workers. But they only use the (un)pickler; they're not touching the Pyro protocol part. I don't know if pickle itself has a size limitation.

piskvorky · 2015-07-02T12:35:46Z

Yes, it does. We have come full circle :)

Anyway, Brian could use the cmd line / scala API for Spark. The MLlib implementation of LDA has nothing to do with Python AFAIK, it's in Scala.

irmen · 2015-07-02T12:41:20Z

@brianmingus have you tried enabling Pyro compression? This may reduce the message payload to a size that fits under 2Gb (although I wonder if Python is able to compress a >4Gb chunk of data at all...)

tmylk · 2016-01-10T06:55:54Z

@brianmingus Is this still an issue when using compression? Does MLLib provide an alternative?
Will close if either is positive

piskvorky mentioned this issue Mar 31, 2016

SystemError when using ldamulticore on big corpus #646

Closed

menshikh-iv added bug Issue described a bug difficulty hard Hard issue: required deep gensim understanding & high python/cython skills labels Oct 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed error: SystemError: error return without exception set #377

Distributed error: SystemError: error return without exception set #377

ghost commented Jul 1, 2015

piskvorky commented Jul 1, 2015

ghost commented Jul 1, 2015

ghost commented Jul 1, 2015

piskvorky commented Jul 1, 2015

ghost commented Jul 1, 2015

piskvorky commented Jul 1, 2015

ghost commented Jul 1, 2015

ghost commented Jul 1, 2015

ghost commented Jul 1, 2015

piskvorky commented Jul 1, 2015

ghost commented Jul 1, 2015

ghost commented Jul 1, 2015

ghost commented Jul 1, 2015

piskvorky commented Jul 1, 2015

piskvorky commented Jul 1, 2015

ghost commented Jul 1, 2015

irmen commented Jul 1, 2015

ghost commented Jul 1, 2015

irmen commented Jul 1, 2015

ghost commented Jul 1, 2015

ghost commented Jul 2, 2015

piskvorky commented Jul 2, 2015

piskvorky commented Jul 2, 2015

irmen commented Jul 2, 2015

piskvorky commented Jul 2, 2015

irmen commented Jul 2, 2015

tmylk commented Jan 10, 2016

Distributed error: SystemError: error return without exception set #377

Distributed error: SystemError: error return without exception set #377

Comments

ghost commented Jul 1, 2015

piskvorky commented Jul 1, 2015

ghost commented Jul 1, 2015

ghost commented Jul 1, 2015

piskvorky commented Jul 1, 2015

ghost commented Jul 1, 2015

piskvorky commented Jul 1, 2015

ghost commented Jul 1, 2015

ghost commented Jul 1, 2015

ghost commented Jul 1, 2015

piskvorky commented Jul 1, 2015

ghost commented Jul 1, 2015

ghost commented Jul 1, 2015

ghost commented Jul 1, 2015

piskvorky commented Jul 1, 2015

piskvorky commented Jul 1, 2015

ghost commented Jul 1, 2015

irmen commented Jul 1, 2015

ghost commented Jul 1, 2015

irmen commented Jul 1, 2015

ghost commented Jul 1, 2015

ghost commented Jul 2, 2015

piskvorky commented Jul 2, 2015

piskvorky commented Jul 2, 2015

irmen commented Jul 2, 2015

piskvorky commented Jul 2, 2015

irmen commented Jul 2, 2015

tmylk commented Jan 10, 2016