-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word2Vec keeps on training during on_batch_end call #2182
Comments
Hello @jbayardo, thanks for report, I reproduced an issue (not exact, but looks similar, race condition too) from gensim.models import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec
from gensim.test.utils import get_tmpfile
import gensim.downloader as api
corpus = api.load("text8")
class BatchSaver(CallbackAny2Vec):
def __init__(self, path_prefix):
self.path_prefix = path_prefix
self.batch = 0
def on_batch_end(self, model):
output_path = get_tmpfile('{}_batch_{}.model'.format(self.path_prefix, self.batch))
model.save(output_path)
print("Model saved to {}".format(output_path))
self.batch += 1
bs = BatchSaver("w2v")
model = Word2Vec(corpus, iter=5, callbacks=[bs]) Expected result
Actual result first variant (almost always) Model saved to /tmp/w2v_batch_0.model
Exception in thread Thread-10:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 164, in _worker_loop
tally, raw_tally = self._do_train_job(data_iterable, job_parameters, thread_private_mem)
File "/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 773, in _do_train_job
tally += train_batch_cbow(self, sentences, alpha, work, neu1, self.compute_loss)
File "gensim/models/word2vec_inner.pyx", line 663, in gensim.models.word2vec_inner.train_batch_cbow
cum_table = <np.uint32_t *>(np.PyArray_DATA(model.vocabulary.cum_table))
AttributeError: 'Word2VecVocab' object has no attribute 'cum_table'
Model saved to /tmp/w2v_batch_0.model
Exception in thread Thread-11:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 164, in _worker_loop
tally, raw_tally = self._do_train_job(data_iterable, job_parameters, thread_private_mem)
File "/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 773, in _do_train_job
tally += train_batch_cbow(self, sentences, alpha, work, neu1, self.compute_loss)
File "gensim/models/word2vec_inner.pyx", line 663, in gensim.models.word2vec_inner.train_batch_cbow
cum_table = <np.uint32_t *>(np.PyArray_DATA(model.vocabulary.cum_table))
AttributeError: 'Word2VecVocab' object has no attribute 'cum_table'
Model saved to /tmp/w2v_batch_0.model
Model saved to /tmp/w2v_batch_3.model
Model saved to /tmp/w2v_batch_4.model
Model saved to /tmp/w2v_batch_5.model
Model saved to /tmp/w2v_batch_6.model
Model saved to /tmp/w2v_batch_7.model
Model saved to /tmp/w2v_batch_8.model
Model saved to /tmp/w2v_batch_9.model
Model saved to /tmp/w2v_batch_10.model
Model saved to /tmp/w2v_batch_11.model
Model saved to /tmp/w2v_batch_12.model
Model saved to /tmp/w2v_batch_13.model
Model saved to /tmp/w2v_batch_14.model
Model saved to /tmp/w2v_batch_15.model
Model saved to /tmp/w2v_batch_16.model
Model saved to /tmp/w2v_batch_17.model
Model saved to /tmp/w2v_batch_18.model
Model saved to /tmp/w2v_batch_19.model
Model saved to /tmp/w2v_batch_20.model
Model saved to /tmp/w2v_batch_21.model
Model saved to /tmp/w2v_batch_22.model
Model saved to /tmp/w2v_batch_23.model
Model saved to /tmp/w2v_batch_24.model
Model saved to /tmp/w2v_batch_25.model
Model saved to /tmp/w2v_batch_26.model
Model saved to /tmp/w2v_batch_27.model
Model saved to /tmp/w2v_batch_28.model
Model saved to /tmp/w2v_batch_29.model
Model saved to /tmp/w2v_batch_30.model
Model saved to /tmp/w2v_batch_31.model
Model saved to /tmp/w2v_batch_32.model
Model saved to /tmp/w2v_batch_33.model
Model saved to /tmp/w2v_batch_34.model
Model saved to /tmp/w2v_batch_35.model
Model saved to /tmp/w2v_batch_36.model
Model saved to /tmp/w2v_batch_37.model
Model saved to /tmp/w2v_batch_38.model
Model saved to /tmp/w2v_batch_39.model
Model saved to /tmp/w2v_batch_40.model
Model saved to /tmp/w2v_batch_41.model
Model saved to /tmp/w2v_batch_42.model
Model saved to /tmp/w2v_batch_43.model
Model saved to /tmp/w2v_batch_44.model
Model saved to /tmp/w2v_batch_45.model
Model saved to /tmp/w2v_batch_46.model
Model saved to /tmp/w2v_batch_47.model
Model saved to /tmp/w2v_batch_48.model
Model saved to /tmp/w2v_batch_49.model
Model saved to /tmp/w2v_batch_50.model
Model saved to /tmp/w2v_batch_51.model
Model saved to /tmp/w2v_batch_52.model
Model saved to /tmp/w2v_batch_53.model
Model saved to /tmp/w2v_batch_54.model
Model saved to /tmp/w2v_batch_55.model
Model saved to /tmp/w2v_batch_56.model
Model saved to /tmp/w2v_batch_57.model
Model saved to /tmp/w2v_batch_58.model
Model saved to /tmp/w2v_batch_59.model
Model saved to /tmp/w2v_batch_60.model
Model saved to /tmp/w2v_batch_61.model
Model saved to /tmp/w2v_batch_62.model
Model saved to /tmp/w2v_batch_63.model
Model saved to /tmp/w2v_batch_64.model
Model saved to /tmp/w2v_batch_65.model
Model saved to /tmp/w2v_batch_66.model
Model saved to /tmp/w2v_batch_67.model
Model saved to /tmp/w2v_batch_68.model
Model saved to /tmp/w2v_batch_69.model
Model saved to /tmp/w2v_batch_70.model
Model saved to /tmp/w2v_batch_71.model
Model saved to /tmp/w2v_batch_72.model
Model saved to /tmp/w2v_batch_73.model
Model saved to /tmp/w2v_batch_74.model
Model saved to /tmp/w2v_batch_75.model
Model saved to /tmp/w2v_batch_76.model
Model saved to /tmp/w2v_batch_77.model
Model saved to /tmp/w2v_batch_78.model
Model saved to /tmp/w2v_batch_79.model
Model saved to /tmp/w2v_batch_80.model
Model saved to /tmp/w2v_batch_81.model
Model saved to /tmp/w2v_batch_82.model
Model saved to /tmp/w2v_batch_83.model
Model saved to /tmp/w2v_batch_84.model
Model saved to /tmp/w2v_batch_85.model
Model saved to /tmp/w2v_batch_86.model
Model saved to /tmp/w2v_batch_87.model
Model saved to /tmp/w2v_batch_88.model
Model saved to /tmp/w2v_batch_89.model
Model saved to /tmp/w2v_batch_90.model
Model saved to /tmp/w2v_batch_91.model
Model saved to /tmp/w2v_batch_92.model
Model saved to /tmp/w2v_batch_93.model
Model saved to /tmp/w2v_batch_94.model
Model saved to /tmp/w2v_batch_95.model
Model saved to /tmp/w2v_batch_96.model
Model saved to /tmp/w2v_batch_97.model
Model saved to /tmp/w2v_batch_98.model
Model saved to /tmp/w2v_batch_99.model
Model saved to /tmp/w2v_batch_100.model second variant (happend only once) Model saved to /tmp/w2v_batch_0.model
Model saved to /tmp/w2v_batch_0.model
Exception in thread Thread-9:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 164, in _worker_loop
tally, raw_tally = self._do_train_job(data_iterable, job_parameters, thread_private_mem)
File "/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 773, in _do_train_job
tally += train_batch_cbow(self, sentences, alpha, work, neu1, self.compute_loss)
File "gensim/models/word2vec_inner.pyx", line 663, in gensim.models.word2vec_inner.train_batch_cbow
cum_table = <np.uint32_t *>(np.PyArray_DATA(model.vocabulary.cum_table))
AttributeError: 'Word2VecVocab' object has no attribute 'cum_table'
Model saved to /tmp/w2v_batch_0.model
Exception in thread Thread-10:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 164, in _worker_loop
tally, raw_tally = self._do_train_job(data_iterable, job_parameters, thread_private_mem)
File "/home/ivan/.virtualenvs/math/local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 773, in _do_train_job
tally += train_batch_cbow(self, sentences, alpha, work, neu1, self.compute_loss)
File "gensim/models/word2vec_inner.pyx", line 663, in gensim.models.word2vec_inner.train_batch_cbow
cum_table = <np.uint32_t *>(np.PyArray_DATA(model.vocabulary.cum_table))
AttributeError: 'Word2VecVocab' object has no attribute 'cum_table'
Model saved to /tmp/w2v_batch_1.model
Model saved to /tmp/w2v_batch_4.model
Fatal Python error: GC object already tracked
Aborted (core dumped) |
For some reason, this happened to me every time I did the training. Fact to keep in mind: I was using 14 workers. |
I tried to do something very similar to @jbayardo, and had a very similar problem. I'm also saving
The first time the model is saved, 5 out of 6 worker threads crashes:
After this, training continues with one worker thread until the first epoch is finished, at which point the process starts waiting for the workers that were killed during the first save. Note that in my case the 4 threads crash because of I'm using |
In fact, I'd recommend removing the |
Also, per a question on StackOverflow, I've just noticed the So I'd again recommend removing |
Thanks for following up @gojomo. I'm marking this ticket for 4.0.0, I think it fits a major release well. I'll do another Gensim sprint again soon, to finish 4.0.0. I haven't seen any worthwhile feedback from beta users, so the plan is to just tie up any loose ends & release. |
Is there any recommended way to run arbitrary code at a finer granularity than the epoch level? For example, it might be useful to interleave a secondary training objective on the word vectors, which would require saving/loading the vectors. It would be nice to be able to do this every 1000 batches, for example, rather than every epoch. |
To an extent, the 'epochs' are arbitrary. While you'd want to provide your whole corpus to I've not written code to do this, but it's probably just a few lines of custom Iterable-wrapper. You'd want to increase the Alternatively: you can always edit any of the source code to do anything at any step/interval, but it can get a bit hairy in the Cython code. (Potentially, also, Gensim could in the future offer more callbacks in a better-thought-out way - probably just before/after the batches-to-each thread.) |
@piskvorky Just wanted to confirm that the action to take here is
Is my understanding correct? |
Not sure, I'll have to read this thread. I'll try that after fixing my 4.0 tickets, OK? |
Sure, I have other 4.0 stuff to work on in the meanwhile, so this isn't a blocker. |
TODO for Misha:
|
I re-read the thread; dropping Plus update migration docs, yes. |
Description
Saving Word2Vec during a on_batch_end call fails because of something that looks a lot like a race condition. It looks like some internal dict within gensim is still being changed during the call to save.
Steps/Code/Corpus to Reproduce
Train W2V with a callback that looks like:
Expected Results
Model checkpoint after every hour of training
Actual Results
While running train:
Versions
Linux-4.4.0-1062-aws-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Dec 4 2017, 14:50:18) \n[GCC 5.4.0 20160609]')
('NumPy', '1.15.1')
('SciPy', '0.19.1')
('gensim', '3.5.0')
('FAST_VERSION', 1)
The text was updated successfully, but these errors were encountered: