Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential refactor: a 'NamedVectors' class for reuse by Word2Vec, Doc2Vec, etc #549

Closed
gojomo opened this issue Nov 26, 2015 · 7 comments
Closed
Labels
wishlist Feature request

Comments

@gojomo
Copy link
Collaborator

gojomo commented Nov 26, 2015

Some wishlist features for Word2Vec/Doc2Vec vector sets, like approximate-neighbor-indexing (#51, #527) or post-training transformations ('retrofit' #547 or translation), really just need a set-of-vectors-of-which-some-are-string-named. That is, the bundle of state now mixed-into Word2Vec/DocvecsArray as syn0 array, index2word list, and vocab dict.

It may ease those features and improve the code-organization logic to refactor that state out into a new class, tentatively called 'NamedVectors', to be reused via composition into these other classes, but also available for separate use when you don't need the training-model-wrapper.

Many operations – loading from a plain word2vec.c model, similarity-lookups, (future) approximate-indexing, (future) learned-projections, etc – would live inside or operate on these NamedVectors objects, which could be loaded/saved separately from trainable models. For example, if you only want the vectors and the similarity-operations, why wrap that state in a broken full training model (or keep around the expansive extra state that's not needed)?

@piskvorky
Copy link
Owner

The challenge here is designing a flexible enough API, i.e. identifying relevant use cases.

From the linked tickets, it looks like a good-enough API may be just plain Python mapping (e.g., dict). Word2vec already conforms -- model['some_word'] returns a vector. A new class (NamedVectors) seems superfluous.

We could also offer a method to extract this dict from word2vec explicitly (model.to_dict()), for advanced users who know what they're doing.

For doc2vec, I'd like to make its inference syntax conform to standard gensim transformation API (now it's a special method infer_vector). IMO it makes more sense for the pre-trained tag vectors to be accessed via some special syntax, keeping the dynamic inference as the "standard use case", using standard __getitem__. Or maybe keep both under __getitem__, no special syntax, and distinguish between the two types of input at runtime, based on the actual input type?

@piskvorky piskvorky added the wishlist Feature request label Nov 26, 2015
@gojomo
Copy link
Collaborator Author

gojomo commented Nov 26, 2015

Yes, the API would definitely support __getitem__() lookup; the benefit would be disentangling that state from other state/behavior that isn't necessarily attached. So it'd be easier to process/talk-about the output of a Word2Vec/Doc2Vec process, as something separate from its parameters/algorithms/internal-state.

For example:

  • Say you want just a set of vectors for similarity-searches? You wouldn't have to 'strip down' a larger Word2Vec model, leaving it in some halfway state where some operations give cryptic errors. If you only want vectors-keyed-by-strings, you'd just pull out (or in fact only ever instantiate) one of these new objects.
  • You've finished Doc2Vec training and you only want the model for future inference, not caring about the bulk-calculated vectors? You'd save just the Doc2Vec model, skipping (as one clearly-distinct unit) the composed-in state of vectors-and-names.... rather than nulling-out ~3 different properties, and again having a model in a halfway state.
  • Or alternatively, you only want the vectors, and don't need the extra model state? You pull out (or save aside) just the NamedVectors component. It's maximally compact and usable without the rest, and acceptable operations are fully enumerated by just its own methods.
  • You have a Doc2Vec model, and want to bulk-infer another 1M documents, perhaps with their own string doc-tag IDs – and then do most_similar() among those, and/or save them for later? At the moment you can't reuse the string-to-index mapping, most_similar(), and save/load facilities mixed-inside Word2Vec/Doc2Vec classes. But those would be the core of the new class, so a bulk-infer could just return a complete, usable, self-contained NamedVectors.

The interface wouldn't just be a dict-interface, because ranges/slicing/raw-array access are also needed for some operations.

Moving the vector-access into such a helper object would make it easier to restore __getitem__() to match gensim's other topic-calculation transformation use. (For example, in Doc2Vec models, both word-vecs and doctag-vecs could be looked up via subsidiary objects, leaving the model itself to offer []-transformation.)

But personally, I find that convention very non-intuitive. Long-runs-of-text aren't often keys/indexes, so feeding them to []-indexing causes a double-take when I'm reading code. This is especially the case if the selector is a mutable-list (or even iterator!) of string tokens – that's very different from the best-practice keys/indexes that tend to be immutable or even primitive types. Or if multiple transformations are applied via nested-[[[]]].

Further, if I know the text I'm providing is truly 'new', it feels odd to be using the language syntax for 'looking it up' rather than 'transforming it'. An explicit method call better indicates that on-demand-transformation is happening, and helpfully gives that calculation a descriptive name. (Specifically for the case of inference, it's already somewhat confusing to people; more explicitness in what's being requested can only help.)

@droudy
Copy link
Contributor

droudy commented Aug 3, 2016

I'm working on this for #809

@gojomo
Copy link
Collaborator Author

gojomo commented Aug 3, 2016

@droudy - Cool, let me know if I can help with any ideas/review.

@piskvorky
Copy link
Owner

Ping @droudy -- any progress on this or #809?

@tmylk
Copy link
Contributor

tmylk commented Nov 9, 2016

Latest version is waiting for @gojomo review in #980

@tmylk
Copy link
Contributor

tmylk commented Nov 13, 2016

Fixed #980

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wishlist Feature request
Projects
None yet
Development

No branches or pull requests

4 participants