Potential refactor: a 'NamedVectors' class for reuse by Word2Vec, Doc2Vec, etc #549

gojomo · 2015-11-26T02:19:05Z

Some wishlist features for Word2Vec/Doc2Vec vector sets, like approximate-neighbor-indexing (#51, #527) or post-training transformations ('retrofit' #547 or translation), really just need a set-of-vectors-of-which-some-are-string-named. That is, the bundle of state now mixed-into Word2Vec/DocvecsArray as syn0 array, index2word list, and vocab dict.

It may ease those features and improve the code-organization logic to refactor that state out into a new class, tentatively called 'NamedVectors', to be reused via composition into these other classes, but also available for separate use when you don't need the training-model-wrapper.

Many operations – loading from a plain word2vec.c model, similarity-lookups, (future) approximate-indexing, (future) learned-projections, etc – would live inside or operate on these NamedVectors objects, which could be loaded/saved separately from trainable models. For example, if you only want the vectors and the similarity-operations, why wrap that state in a broken full training model (or keep around the expansive extra state that's not needed)?

The text was updated successfully, but these errors were encountered:

piskvorky · 2015-11-26T02:43:05Z

The challenge here is designing a flexible enough API, i.e. identifying relevant use cases.

From the linked tickets, it looks like a good-enough API may be just plain Python mapping (e.g., dict). Word2vec already conforms -- model['some_word'] returns a vector. A new class (NamedVectors) seems superfluous.

We could also offer a method to extract this dict from word2vec explicitly (model.to_dict()), for advanced users who know what they're doing.

For doc2vec, I'd like to make its inference syntax conform to standard gensim transformation API (now it's a special method infer_vector). IMO it makes more sense for the pre-trained tag vectors to be accessed via some special syntax, keeping the dynamic inference as the "standard use case", using standard __getitem__. Or maybe keep both under __getitem__, no special syntax, and distinguish between the two types of input at runtime, based on the actual input type?

gojomo · 2015-11-26T03:55:05Z

Yes, the API would definitely support __getitem__() lookup; the benefit would be disentangling that state from other state/behavior that isn't necessarily attached. So it'd be easier to process/talk-about the output of a Word2Vec/Doc2Vec process, as something separate from its parameters/algorithms/internal-state.

For example:

Say you want just a set of vectors for similarity-searches? You wouldn't have to 'strip down' a larger Word2Vec model, leaving it in some halfway state where some operations give cryptic errors. If you only want vectors-keyed-by-strings, you'd just pull out (or in fact only ever instantiate) one of these new objects.
You've finished Doc2Vec training and you only want the model for future inference, not caring about the bulk-calculated vectors? You'd save just the Doc2Vec model, skipping (as one clearly-distinct unit) the composed-in state of vectors-and-names.... rather than nulling-out ~3 different properties, and again having a model in a halfway state.
Or alternatively, you only want the vectors, and don't need the extra model state? You pull out (or save aside) just the NamedVectors component. It's maximally compact and usable without the rest, and acceptable operations are fully enumerated by just its own methods.
You have a Doc2Vec model, and want to bulk-infer another 1M documents, perhaps with their own string doc-tag IDs – and then do most_similar() among those, and/or save them for later? At the moment you can't reuse the string-to-index mapping, most_similar(), and save/load facilities mixed-inside Word2Vec/Doc2Vec classes. But those would be the core of the new class, so a bulk-infer could just return a complete, usable, self-contained NamedVectors.

The interface wouldn't just be a dict-interface, because ranges/slicing/raw-array access are also needed for some operations.

Moving the vector-access into such a helper object would make it easier to restore __getitem__() to match gensim's other topic-calculation transformation use. (For example, in Doc2Vec models, both word-vecs and doctag-vecs could be looked up via subsidiary objects, leaving the model itself to offer []-transformation.)

But personally, I find that convention very non-intuitive. Long-runs-of-text aren't often keys/indexes, so feeding them to []-indexing causes a double-take when I'm reading code. This is especially the case if the selector is a mutable-list (or even iterator!) of string tokens – that's very different from the best-practice keys/indexes that tend to be immutable or even primitive types. Or if multiple transformations are applied via nested-[[[]]].

Further, if I know the text I'm providing is truly 'new', it feels odd to be using the language syntax for 'looking it up' rather than 'transforming it'. An explicit method call better indicates that on-demand-transformation is happening, and helpfully gives that calculation a descriptive name. (Specifically for the case of inference, it's already somewhat confusing to people; more explicitness in what's being requested can only help.)

droudy · 2016-08-03T17:00:14Z

I'm working on this for #809

gojomo · 2016-08-03T22:57:18Z

@droudy - Cool, let me know if I can help with any ideas/review.

piskvorky · 2016-11-09T04:10:28Z

Ping @droudy -- any progress on this or #809?

tmylk · 2016-11-09T09:35:58Z

Latest version is waiting for @gojomo review in #980

tmylk · 2016-11-13T09:15:04Z

Fixed #980

gojomo mentioned this issue Nov 26, 2015

Feature Suggestion: Wang2Vec #547

Open

piskvorky added the wishlist Feature request label Nov 26, 2015

gojomo mentioned this issue Aug 2, 2016

[WIP] Add ability to use Tensorflow to train a word2vec model #809

Closed

droudy mentioned this issue Aug 9, 2016

NamedVectors refactor for word2vec #819

Closed

droudy mentioned this issue Aug 19, 2016

KeyedVectors refactor for word2vec #833

Closed

tmylk closed this as completed Nov 13, 2016

jayantj mentioned this issue Dec 18, 2016

[MRG] Wrapper for FastText #847

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential refactor: a 'NamedVectors' class for reuse by Word2Vec, Doc2Vec, etc #549

Potential refactor: a 'NamedVectors' class for reuse by Word2Vec, Doc2Vec, etc #549

gojomo commented Nov 26, 2015

piskvorky commented Nov 26, 2015

gojomo commented Nov 26, 2015

droudy commented Aug 3, 2016 •

edited

Loading

gojomo commented Aug 3, 2016

piskvorky commented Nov 9, 2016

tmylk commented Nov 9, 2016

tmylk commented Nov 13, 2016

Potential refactor: a 'NamedVectors' class for reuse by Word2Vec, Doc2Vec, etc #549

Potential refactor: a 'NamedVectors' class for reuse by Word2Vec, Doc2Vec, etc #549

Comments

gojomo commented Nov 26, 2015

piskvorky commented Nov 26, 2015

gojomo commented Nov 26, 2015

droudy commented Aug 3, 2016 • edited Loading

gojomo commented Aug 3, 2016

piskvorky commented Nov 9, 2016

tmylk commented Nov 9, 2016

tmylk commented Nov 13, 2016

droudy commented Aug 3, 2016 •

edited

Loading