-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential refactor: a 'NamedVectors' class for reuse by Word2Vec, Doc2Vec, etc #549
Comments
The challenge here is designing a flexible enough API, i.e. identifying relevant use cases. From the linked tickets, it looks like a good-enough API may be just plain Python mapping (e.g., We could also offer a method to extract this dict from word2vec explicitly ( For doc2vec, I'd like to make its inference syntax conform to standard gensim transformation API (now it's a special method |
Yes, the API would definitely support For example:
The interface wouldn't just be a dict-interface, because ranges/slicing/raw-array access are also needed for some operations. Moving the vector-access into such a helper object would make it easier to restore But personally, I find that convention very non-intuitive. Long-runs-of-text aren't often keys/indexes, so feeding them to []-indexing causes a double-take when I'm reading code. This is especially the case if the selector is a mutable-list (or even iterator!) of string tokens – that's very different from the best-practice keys/indexes that tend to be immutable or even primitive types. Or if multiple transformations are applied via nested-[[[]]]. Further, if I know the text I'm providing is truly 'new', it feels odd to be using the language syntax for 'looking it up' rather than 'transforming it'. An explicit method call better indicates that on-demand-transformation is happening, and helpfully gives that calculation a descriptive name. (Specifically for the case of inference, it's already somewhat confusing to people; more explicitness in what's being requested can only help.) |
I'm working on this for #809 |
@droudy - Cool, let me know if I can help with any ideas/review. |
Fixed #980 |
Some wishlist features for Word2Vec/Doc2Vec vector sets, like approximate-neighbor-indexing (#51, #527) or post-training transformations ('retrofit' #547 or translation), really just need a set-of-vectors-of-which-some-are-string-named. That is, the bundle of state now mixed-into Word2Vec/DocvecsArray as
syn0
array,index2word
list, andvocab
dict.It may ease those features and improve the code-organization logic to refactor that state out into a new class, tentatively called 'NamedVectors', to be reused via composition into these other classes, but also available for separate use when you don't need the training-model-wrapper.
Many operations – loading from a plain word2vec.c model, similarity-lookups, (future) approximate-indexing, (future) learned-projections, etc – would live inside or operate on these NamedVectors objects, which could be loaded/saved separately from trainable models. For example, if you only want the vectors and the similarity-operations, why wrap that state in a broken full training model (or keep around the expansive extra state that's not needed)?
The text was updated successfully, but these errors were encountered: