Segfault when training doc2vec #2942

Paul-E · 2020-09-10T22:59:30Z

Problem description

When attempting to train doc2vec, gensim segfaults.

Steps/code/corpus to reproduce

I run the code:

import faulthandler
import gensim
faulthandler.enable()
model = gensim.models.doc2vec.Doc2Vec(corpus_file = "yelp_tripadvisor_linesentence.txt", vector_size=250, min_count=10, epochs=40, workers = 5)

I get the output:

Fatal Python error: Segmentation fault

Current thread 0x00007f2d9effd700 (most recent call first):
  File "/home/paul/.local/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 431 in _do_train_epoch
  File "/home/paul/.local/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 172 in _worker_loop_corpusfile
  File "/usr/lib/python3.8/threading.py", line 870 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f2d9f7fe700 (most recent call first):
  File "/home/paul/.local/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 431 in _do_train_epoch
  File "/home/paul/.local/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 172 in _worker_loop_corpusfile
  File "/usr/lib/python3.8/threading.py", line 870 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f2d9ffff700 (most recent call first):
  File "/home/paul/.local/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 431 in _do_train_epoch
  File "/home/paul/.local/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 172 in _worker_loop_corpusfile
  File "/usr/lib/python3.8/threading.py", line 870 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f2da48df700 (most recent call first):
  File "/home/paul/.local/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 431 in _do_train_epoch
  File "/home/paul/.local/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 172 in _worker_loop_corpusfile
  File "/usr/lib/python3.8/threading.py", line 870 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f2da50e0700 (most recent call first):
  File "/home/paul/.local/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 431 in _do_train_epoch
  File "/home/paul/.local/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 172 in _worker_loop_corpusfile
  File "/usr/lib/python3.8/threading.py", line 870 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f3055bd1740 (most recent call first):
  File "/usr/lib/python3.8/threading.py", line 302 in wait
  File "/usr/lib/python3.8/queue.py", line 170 in get
  File "/home/paul/.local/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 345 in _log_epoch_progress
  File "/home/paul/.local/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 430 in _train_epoch_corpusfile
  File "/home/paul/.local/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 554 in train
  File "/home/paul/.local/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 1063 in train
  File "/home/paul/.local/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 554 in train
  File "/home/paul/.local/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 360 in __init__
  File "reproduce_segfault.py", line 4 in <module>
Segmentation fault (core dumped)

When run in gdb I get:

Thread 36 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffd450ca700 (LWP 112905)]
0x00007fffc9347737 in saxpy_kernel_16 ()
   from /home/paul/.local/lib/python3.8/site-packages/scipy/spatial/../../scipy.libs/libopenblasp-r0-085ca80a.3.9.so

The backtrace I get is:

(gdb) backtrace
#0  0x00007fffc9347737 in saxpy_kernel_16 ()
   from /home/paul/.local/lib/python3.8/site-packages/scipy/spatial/../../scipy.libs/libopenblasp-r0-085ca80a.3.9.so
#1  0x00007fffc934792f in saxpy_k_ZEN ()
   from /home/paul/.local/lib/python3.8/site-packages/scipy/spatial/../../scipy.libs/libopenblasp-r0-085ca80a.3.9.so
#2  0x00007fffc84402cb in saxpy_ ()
   from /home/paul/.local/lib/python3.8/site-packages/scipy/spatial/../../scipy.libs/libopenblasp-r0-085ca80a.3.9.so
#3  0x00007fffa0e81782 in ?? ()
   from /home/paul/.local/lib/python3.8/site-packages/gensim/models/doc2vec_corpusfile.cpython-38-x86_64-linux-gnu.so
#4  0x00007fffa0e8243f in ?? ()
   from /home/paul/.local/lib/python3.8/site-packages/gensim/models/doc2vec_corpusfile.cpython-38-x86_64-linux-gnu.so
#5  0x00000000005f17e5 in cfunction_call_varargs (kwargs=<optimized out>, args=<optimized out>, 
    func=<built-in function d2v_train_epoch_dm>) at ../Objects/call.c:772
#6  PyCFunction_Call (func=<built-in function d2v_train_epoch_dm>, args=<optimized out>, kwargs=<optimized out>) at ../Objects/call.c:772
#7  0x00000000005f2406 in _PyObject_MakeTpCall (callable=<built-in function d2v_train_epoch_dm>, args=<optimized out>, 
    nargs=<optimized out>, keywords=<optimized out>) at ../Include/internal/pycore_pyerrors.h:13
#8  0x000000000056cfd4 in _PyObject_Vectorcall (kwnames=('doctag_vectors', 'doctag_locks'), nargsf=<optimized out>, 
    args=<optimized out>, callable=<built-in function d2v_train_epoch_dm>) at ../Include/cpython/abstract.h:125
#9  _PyObject_Vectorcall (kwnames=('doctag_vectors', 'doctag_locks'), nargsf=<optimized out>, args=<optimized out>, 
    callable=<built-in function d2v_train_epoch_dm>) at ../Include/cpython/abstract.h:115
#10 call_function (kwnames=('doctag_vectors', 'doctag_locks'), oparg=<optimized out>, pp_stack=<synthetic pointer>, 
    tstate=<optimized out>) at ../Python/ceval.c:4987
#11 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:3515
#12 0x0000000000565972 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x7ffd34001710, for file /home/paul/.local/lib/python3.8/site-packages/gensim/models/doc2vec.py, line 1199, in _do_train_epoch (self=<Doc2Vec(sg=0, alpha=<float at remote 0x7fffa19bd8d0>, window=5, random=<numpy.random.mtrand.RandomState at remote 0x7fffa09f4640>, min_alpha=<float at remote 0x7fffa19bd910>, hs=0, negative=5, ns_exponent=<float at remote 0x7fffa19bd930>, cbow_mean=1, compute_loss=False, running_training_loss=<float at remote 0x7fffa19bd870>, min_alpha_yet_reached=<float at remote 0x7fffa19bd8d0>, corpus_count=9643078, corpus_total_words=1099181249, vector_size=250, workers=5, epochs=40, train_count=0, total_train_time=0, batch_words=10000, model_trimmed_--Type <RE--Type <RET> for more, q to quit, c to contin--Type <RET> for more, q to quit, c to continue without--Type <RET> for more, q --Type <RET> fo--Typ--Typ--Type <RET> for more, q to quit, c to continue without paging--
post_training=False, callbacks=(), load=<function at remote 0x7ffff412f310>, dbow_words=0, dm_concat=0, dm_tag_count=1, vocabulary=<Doc2VecVocab(max_vocab_size=None, min_count=10, sample=<float at remote 0x7fffa12f9670>, sorted_vocab=True, null_word=0, cum_table=<numpy.ndarray at remote 0x7fffa0998c10>, raw_vocab={}, max_final_vocab=None,...(truncated)) at ../Python/ceval.c:741
#13 _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, 
    kwnames=<optimized out>, kwargs=0x7fffa0a01d68, kwcount=<optimized out>, kwstep=1, defs=0x7fffa12ad0f8, defcount=4, kwdefs=0x0, closure=0x0, name='_do_train_epoch', 
    qualname='Doc2Vec._do_train_epoch') at ../Python/ceval.c:4298
#14 0x00000000005f1d85 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7fffa0a01d30, nargsf=<optimized out>, kwnames=<optimized out>) at ../Objects/call.c:435
#15 0x0000000000507729 in _PyObject_Vectorcall (
    kwnames=('total_examples', 'total_words', 'start_alpha', 'end_alpha', 'word_count', 'compute_loss', 'offsets', 'start_doctags'), nargsf=7, args=0x7fffa0a01d30, 
    callable=<function at remote 0x7fffa12ba3a0>) at ../Include/cpython/abstract.h:127
#16 method_vectorcall (method=<optimized out>, args=<optimized out>, nargsf=<optimized out>, 
    kwnames=('total_examples', 'total_words', 'start_alpha', 'end_alpha', 'word_count', 'compute_loss', 'offsets', 'start_doctags')) at ../Objects/classobject.c:89
#17 0x00000000005f1107 in PyVectorcall_Call (kwargs=<optimized out>, tuple=<optimized out>, callable=<method at remote 0x7fff9d8c7600>) at ../Objects/call.c:199
#18 PyObject_Call (callable=<method at remote 0x7fff9d8c7600>, args=<optimized out>, kwargs=<optimized out>) at ../Objects/call.c:227
#19 0x0000000000568e1f in do_call_core (
    kwdict={'total_examples': 9643078, 'total_words': 1099181249, 'start_alpha': <float at remote 0x7fffa19bd8d0>, 'end_alpha': <float at remote 0x7fffa19bd910>, 'word_count': 0, 'compute_loss': False, 'offsets': [0, 1186792315, 2373585688, 3560378663, 4747171525], 'start_doctags': [0, 1296629, 3235497, 5388103, 7520884]}, 
    callargs=('yelp_tripadvisor_linesentence.txt', 4, <float at remote 0x7fff98254b10>, <gensim.models.word2vec_corpusfile.CythonVocab at remote 0x7fff9dde38e0>, (<numpy.ndarray at remote 0x7fff9de26170>, <numpy.ndarray at remote 0x7fff9de26990>), 0), func=<method at remote 0x7fff9d8c7600>, tstate=<optimized out>)
    at ../Python/ceval.c:5034
#20 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:3559
#21 0x0000000000565972 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x7ffd34000ba0, for file /home/paul/.local/lib/python3.8/site-packages/gensim/models/base_any2vec.py, line 940, in _worker_loop_corpusfile (self=<Doc2Vec(sg=0, alpha=<float at remote 0x7fffa19bd8d0>, window=5, random=<numpy.random.mtrand.RandomState at remote 0x7fffa09f4640>, min_alpha=<float at remote 0x7fffa19bd910>, hs=0, negative=5, ns_exponent=<float at remote 0x7fffa19bd930>, cbow_mean=1, compute_loss=False, running_training_loss=<float at remote 0x7fffa19bd870>, min_alpha_yet_reached=<float at remote 0x7fffa19bd8d0>, corpus_count=9643078, corpus_total_words=1099181249, vector_size=250, workers=5, epochs=40, train_count=0, total_train_time=0, batch_words=10000, model_trimmed_post_training=False, callbacks=(), load=<function at remote 0x7ffff412f310>, dbow_words=0, dm_concat=0, dm_tag_count=1, vocabulary=<Doc2VecVocab(max_vocab_size=None, min_count=10, sample=<float at remote 0x7fffa12f9670>, sorted_vocab=True, null_word=0, cum_table=<numpy.ndarray at remote 0x7fffa0998c10>, raw_vocab={}, max_final...(truncated)) at ../Python/ceval.c:741
#22 _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, 
    kwnames=<optimized out>, kwargs=0x7fffa0a01ce0, kwcount=<optimized out>, kwstep=1, defs=0x7fffa178ba58, defcount=3, kwdefs=0x0, closure=0x0, 
    name='_worker_loop_corpusfile', qualname='BaseAny2VecModel._worker_loop_corpusfile') at ../Python/ceval.c:4298
#23 0x00000000005f1d85 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7fffa0a01cb0, nargsf=<optimized out>, kwnames=<optimized out>) at ../Objects/call.c:435
#24 0x0000000000507729 in _PyObject_Vectorcall (
    kwnames=('start_alpha', 'end_alpha', 'word_count', 'compute_loss', 'offsets', 'start_doctags', 'cur_epoch', 'total_examples', 'total_words'), nargsf=6, 
    args=0x7fffa0a01cb0, callable=<function at remote 0x7fffa17970d0>) at ../Include/cpython/abstract.h:127
#25 method_vectorcall (method=<optimized out>, args=<optimized out>, nargsf=<optimized out>, 
    kwnames=('start_alpha', 'end_alpha', 'word_count', 'compute_loss', 'offsets', 'start_doctags', 'cur_epoch', 'total_examples', 'total_words'))
    at ../Objects/classobject.c:89
#26 0x00000000005f1107 in PyVectorcall_Call (kwargs=<optimized out>, tuple=<optimized out>, callable=<method at remote 0x7fff9d84cdc0>) at ../Objects/call.c:199
#27 PyObject_Call (callable=<method at remote 0x7fff9d84cdc0>, args=<optimized out>, kwargs=<optimized out>) at ../Objects/call.c:227
#28 0x0000000000568e1f in do_call_core (
    kwdict={'start_alpha': <float at remote 0x7fffa19bd8d0>, 'end_alpha': <float at remote 0x7fffa19bd910>, 'word_count': 0, 'compute_loss': False, 'offsets': [0, 1186792315, 2373585688, 3560378663, 4747171525], 'start_doctags': [0, 1296629, 3235497, 5388103, 7520884], 'cur_epoch': 0, 'total_examples': 9643078, 'total_words': 1099181249}, 
    callargs=('yelp_tripadvisor_linesentence.txt', 4, <float at remote 0x7fff98254b10>, <gensim.models.word2vec_corpusfile.CythonVocab at remote 0x7fff9dde38e0>, <Queue(maxsize=0, queue=<collections.deque at remote 0x7fff9dde3d00>, mutex=<_thread.lock at remote 0x7fff98710420>, not_empty=<Condition(_lock=<_thread.lock at remote 0x7fff98710420>, acquire=<built-in method acquire of _thread.lock object at remote 0x7fff98710420>, release=<built-in method release of _thread.lock object at remote 0x7fff98710420>, _waiters=<collections.deque at remote 0x7fff9dde3ca0>) at remote 0x7fff98710460>, not_full=<Condition(_lock=<_thread.lock at remote 0x7fff98710420>, acquire=<built-in method acquire of _thread.lock object at remote 0x7fff98710420>, release=<built-in method release of _thread.lock object at remote 0x7fff98710420>, _waiters=<collections.deque at remote 0x7fff9dde3c40>) at remote 0x7fff987104c0>, all_tasks_done=<Condition(_lock=<_thread.lock at remote 0x7fff98710420>, acquire=<built-in method acquire of _thre--Type <RET> for more, q to quit, c to continue without paging--
ad.lock objec...(truncated), func=<method at remote 0x7fff9d84cdc0>, tstate=<optimized out>) at ../Python/ceval.c:5034
#29 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:3559
#30 0x00000000005f1b8b in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x7fff9f3ee740, for file /usr/lib/python3.8/threading.py, line 870, in run (self=<Thread(_target=<method at remote 0x7fff9d84cdc0>, _name='Thread-5', _args=('yelp_tripadvisor_linesentence.txt', 4, <float at remote 0x7fff98254b10>, <gensim.models.word2vec_corpusfile.CythonVocab at remote 0x7fff9dde38e0>, <Queue(maxsize=0, queue=<collections.deque at remote 0x7fff9dde3d00>, mutex=<_thread.lock at remote 0x7fff98710420>, not_empty=<Condition(_lock=<_thread.lock at remote 0x7fff98710420>, acquire=<built-in method acquire of _thread.lock object at remote 0x7fff98710420>, release=<built-in method release of _thread.lock object at remote 0x7fff98710420>, _waiters=<collections.deque at remote 0x7fff9dde3ca0>) at remote 0x7fff98710460>, not_full=<Condition(_lock=<_thread.lock at remote 0x7fff98710420>, acquire=<built-in method acquire of _thread.lock object at remote 0x7fff98710420>, release=<built-in method release of _thread.lock object at remote 0x7fff98710420>, _waiters=<collections.deque at remote 0x7fff9dd...(truncated)) at ../Python/ceval.c:741
#31 function_code_fastcall (globals=<optimized out>, nargs=<optimized out>, args=<optimized out>, co=<optimized out>) at ../Objects/call.c:283
#32 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at ../Objects/call.c:410
#33 0x00000000005677c7 in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7fff9f35c7b8, callable=<function at remote 0x7ffff732e9d0>)
    at ../Include/cpython/abstract.h:127
#34 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0xac0530) at ../Python/ceval.c:4987
#35 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:3486
#36 0x00000000005f1b8b in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x7fff9f35c640, for file /usr/lib/python3.8/threading.py, line 932, in _bootstrap_inner (self=<Thread(_target=<method at remote 0x7fff9d84cdc0>, _name='Thread-5', _args=('yelp_tripadvisor_linesentence.txt', 4, <float at remote 0x7fff98254b10>, <gensim.models.word2vec_corpusfile.CythonVocab at remote 0x7fff9dde38e0>, <Queue(maxsize=0, queue=<collections.deque at remote 0x7fff9dde3d00>, mutex=<_thread.lock at remote 0x7fff98710420>, not_empty=<Condition(_lock=<_thread.lock at remote 0x7fff98710420>, acquire=<built-in method acquire of _thread.lock object at remote 0x7fff98710420>, release=<built-in method release of _thread.lock object at remote 0x7fff98710420>, _waiters=<collections.deque at remote 0x7fff9dde3ca0>) at remote 0x7fff98710460>, not_full=<Condition(_lock=<_thread.lock at remote 0x7fff98710420>, acquire=<built-in method acquire of _thread.lock object at remote 0x7fff98710420>, release=<built-in method release of _thread.lock object at remote 0x7fff98710420>, _waiters=<collections.deque at rem...(truncated)) at ../Python/ceval.c:741
#37 function_code_fastcall (globals=<optimized out>, nargs=<optimized out>, args=<optimized out>, co=<optimized out>) at ../Objects/call.c:283
#38 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at ../Objects/call.c:410
#39 0x00000000005677c7 in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7fff9f3ee6f8, callable=<function at remote 0x7ffff732eca0>)
    at ../Include/cpython/abstract.h:127
#40 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0xac0530) at ../Python/ceval.c:4987
#41 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:3486
#42 0x00000000005f1b8b in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x7fff9f3ee580, for file /usr/lib/python3.8/threading.py, line 890, in _bootstrap (self=<Thread(_target=<method at remote 0x7fff9d84cdc0>, _name='Thread-5', _args=('yelp_tripadvisor_linesentence.txt', 4, <float at remote 0x7fff98254b10>, <gensim.models.word2vec_corpusfile.CythonVocab at remote 0x7fff9dde38e0>, <Queue(maxsize=0, queue=<collections.deque at remote 0x7fff9dde3d00>, mutex=<_thread.lock at remote 0x7fff98710420>, not_empty=<Condition(_lock=<_thread.lock at remote 0x7fff98710420>, acquire=<built-in method acquire of _thread.lock object at remote 0x7fff98710420>, release=<built-in method release of _thread.lock object at remote 0x7fff98710420>, _waiters=<collections.deque at remote 0x7fff9dde3ca0>) at remote 0x7fff98710460>, not_full=<Condition(_lock=<_thread.lock at remote 0x7fff98710420>, acquire=<built-in method acquire of _thread.lock object at remote 0x7fff98710420>, release=<built-in method release of _thread.lock object at remote 0x7fff98710420>, _waiters=<collections.deque at remote 0x...(truncated)) at ../Python/ceval.c:741
#43 function_code_fastcall (globals=<optimized out>, nargs=<optimized out>, args=<optimized out>, co=<optimized out>) at ../Objects/call.c:283
#44 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at ../Objects/call.c:410
#45 0x000000000050722c in _PyObject_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=<optimized out>)
    at ../Include/cpython/abstract.h:127
#46 method_vectorcall (method=<optimized out>, args=0x7ffff7634058, nargsf=<optimized out>, kwnames=<optimized out>) at ../Objects/classobject.c:89
#47 0x00000000005f1107 in PyVectorcall_Call (kwargs=<optimized out>, tuple=<optimized out>, callable=<method at remote 0x7fff9d8c7540>) at ../Objects/call.c:199
#48 PyObject_Call (callable=<method at remote 0x7fff9d8c7540>, args=<optimized out>, kwargs=<optimized out>) at ../Objects/call.c:227
#49 0x000000000064fb98 in t_bootstrap (boot_raw=boot_raw@entry=0x7fff9f33a150) at ../Modules/_threadmodule.c:1002
#50 0x000000000066ee14 in pythread_wrapper (arg=<optimized out>) at ../Python/thread_pthread.h:237
#51 0x00007ffff7d96609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#52 0x00007ffff7ed2103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I can provide the corpus at request.

Versions

Linux-5.4.0-45-generic-x86_64-with-glibc2.29
Python 3.8.2 (default, Jul 16 2020, 14:00:26) 
[GCC 9.3.0]
Bits 64
NumPy 1.19.2
SciPy 1.5.2
gensim 3.8.3
FAST_VERSION 1

The text was updated successfully, but these errors were encountered:

gojomo · 2020-09-11T01:47:00Z

Thanks for the detailed report! Can you say a little more about the corpus size? If enabling logging at the INFO level, how much progress is shown before the fault? Is it fast always at the same point?

Paul-E · 2020-09-11T03:18:57Z

The corpus has 9,643,078 documents and 1,099,181,249 total words.

I forgot to include that I am running this on Ubuntu 20.04.

Attached is the output from setting logging to INFO

train_logs.txt

gojomo · 2020-09-11T07:46:39Z

This may be the same issue as #2894 - fixed in the develop branch. If you're able to test with a development code checkout (which might require other changes in your code, though not in the single line of instantiation code you've shown above), you might not see the crash.

Essentially: instead of using a package from PyPI or Conda repos: do a git checkout; ensure your system has key Ubuntu packages like build-essentials and Python packages like Cython; do a pip install -e . from within the project directory.

gojomo · 2020-09-11T08:04:47Z

(Also: that bug is only in the corpus_file path, so another workaround could be to supply your docs via the traditional iterable-of-TaggedDocument-instances API. That won't achieve as much utilization/throughput with as many workers, but if training succeeds, it'll be a slower option & confirm the problem is specific to corpus_file and probably the same as #2894.)

Paul-E · 2020-09-15T01:04:58Z

I have successfully trained my model by installing gensim from github. Thank you :)

Paul-E changed the title ~~Segfault when trainding doc2vec~~ Segfault when training doc2vec Sep 10, 2020

Paul-E closed this as completed Sep 15, 2020

mpenkov mentioned this issue Oct 28, 2020

Update changelog for 4.0.0 release #2981

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault when training doc2vec #2942

Segfault when training doc2vec #2942

Paul-E commented Sep 10, 2020 •

edited

Loading

gojomo commented Sep 11, 2020

Paul-E commented Sep 11, 2020

gojomo commented Sep 11, 2020

gojomo commented Sep 11, 2020

Paul-E commented Sep 15, 2020

Segfault when training doc2vec #2942

Segfault when training doc2vec #2942

Comments

Paul-E commented Sep 10, 2020 • edited Loading

Problem description

Steps/code/corpus to reproduce

Versions

gojomo commented Sep 11, 2020

Paul-E commented Sep 11, 2020

gojomo commented Sep 11, 2020

gojomo commented Sep 11, 2020

Paul-E commented Sep 15, 2020

Paul-E commented Sep 10, 2020 •

edited

Loading