Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acccept word2vec formats too. #1875

Merged
merged 2 commits into from
Dec 18, 2023
Merged

Acccept word2vec formats too. #1875

merged 2 commits into from
Dec 18, 2023

Conversation

mokha
Copy link
Contributor

@mokha mokha commented Mar 23, 2023

The difference between word2vec and glove embeddings (in text format) is just that in word2vec the first line indicates the size of the vocabulary and the embeddings dimensions. With this commit, the line is ignored and WordEmbeddings class is now able to read embeddings in both formats.

@tomaarsen
Copy link
Collaborator

Hello!

I didn't realise about the difference in formats, but I was under the impression that word2vec was supported. For example:

word_embedding_model = models.WordEmbeddings.from_text_file('GoogleNews-vectors-negative300.txt.gz')

Do you have an example of a common file that doesn't currently work?

  • Tom Aarsen

@tomaarsen
Copy link
Collaborator

I've prepared a word2vec file via gensim and its model.wv.save_word2vec_format, and it indeed showed me the issue here. I also found #1656, but I prefer the fix from this PR.

@tomaarsen tomaarsen merged commit 9b1c33f into UKPLab:master Dec 18, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants