word embeddings from word2vec fails to load correctly #1656

lambdaofgod · 2022-07-31T12:53:19Z

Loading from word2vec format fails if txt file contains first line with number of words and dimensionality.

For example gensim exports to .txt file with the first line like this:

10000 50

In the current version loading will set dimensionality to 1 and fail to load vectors.

I fixed this by checking whether first line contains exactly two tokens, and treating the second one as dimensionality.

tomaarsen · 2023-12-18T14:17:43Z

Hello!

Thanks for your PR pointing me to this issue! I've resolved this issue by merging #1875 which also had a clean solution. As a result, I will me closing this, and the next version of Sentence Transformers should include word2vec support.

Tom Aarsen

lambdaofgod added 2 commits July 31, 2022 14:42

fix word2vec header issue

ad832b5

proper addition of padding token

a0cf436

lambdaofgod force-pushed the fix_word2vec branch from c725ec8 to a0cf436 Compare July 31, 2022 13:27

tomaarsen mentioned this pull request Dec 18, 2023

Acccept word2vec formats too. #1875

Merged

tomaarsen closed this Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word embeddings from word2vec fails to load correctly #1656

word embeddings from word2vec fails to load correctly #1656

lambdaofgod commented Jul 31, 2022

tomaarsen commented Dec 18, 2023

word embeddings from word2vec fails to load correctly #1656

word embeddings from word2vec fails to load correctly #1656

Conversation

lambdaofgod commented Jul 31, 2022

tomaarsen commented Dec 18, 2023