Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update on-disk word vector binary format for faster load #788

Closed
wants to merge 56 commits into from
Closed

Update on-disk word vector binary format for faster load #788

wants to merge 56 commits into from

Conversation

mattmacy
Copy link

  • Speed up word vector load
  • Change on disk format
  • Store vectors normalize

To use the new loading code a new binary file will need to be generated:

import spacy
import spacy.txtvec2bin as v2b
v2b.vec2bin("glove.840B.300d.txt", "vec.bin", 1000000)

Description

  • Create simplified version of vectors.pyx from sense2vec

  • Store all word vectors in memory normalized

  • Automatically update the norm when updating the vector

  • Update all references to lexeme vector to use the new vectors structure living in Vocab

  • Create a more efficient binary format for word vectors

    • allows user to only pay for the overhead of reading in the vectors for the vocabulary that is actually used
    • pre-computes all norms at file creation time
    • in principle allows one to avoid allocating any memory for the original strings themselves
    • adds magic check to validate that it is a spacy word vector file
    • adds version check to enable the code to automatically know when the current code does not match
      the version on disk and, if desired, maintain loaders for multiple versions
  • the document save code currently does not use the new vector code path, but we can easily add that
    if so desired

Although rather invasive, the change is quite straightforward.

Ran pytest under 2.7 and 3.6m. Ran nn_text_class.py under 3.6m.

Types of changes

  • [X ] Bug fix (non-breaking change fixing an issue)
  • New feature (non-breaking change adding functionality to spaCy)
  • [X ] Breaking change (fix or feature causing change to spaCy's existing functionality)
  • Documentation (addition to documentation of spaCy)

Checklist:

  • My change requires a change to spaCy's documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • [ X] All new and existing tests passed.

Matt Macy and others added 30 commits January 22, 2017 19:39
…e all sense2vec code that doesn't make sense for a generalized vectors implementation
@mattmacy
Copy link
Author

NB 4 of 6 tests actually succeeded. The sdist builds are misconfigured.

@mattmacy
Copy link
Author

mattmacy commented Feb 5, 2017

Need to cleanup the commit history first.

@mattmacy mattmacy closed this Feb 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants