API call for Topic Distribution of words. #683

bhargavvader · 2016-04-25T12:00:08Z

As discussed here and here in the mailing list, there is no current way to find the a topic index for a word - i.e the topic a word belongs to - basically the distribution over all topics for a single word.

As @piskvorky mentioned in the second link, the ideal way now is:

"With variational inference LDA (such as the implementation in gensim), you can get the per-word topic distribution from phi, one of the variational parameters. When you call lda.inference() with collect_sstats=True, it will return a 2-tuple of (gamma, sufficient statistics). In your single-word case, these sufficient statistics directly correspond to (already normalized) phi, so you can read the word topics directly off of that."

This also talks about the same, and describes the problem/what is needed in simpler words.

I'll open a PR soon for this, and would like to take suggestions- would it be as a method in the ldamodel class, with just takes a word as a parameter, which uses VB internal parameters and gives back the topic and probability, or is there a different approach?

The text was updated successfully, but these errors were encountered:

piskvorky · 2016-04-25T12:49:16Z

I haven't given this API enhancement a deep thought, but I think a good way may be to add a new, optional parameter: get_document_topics(per_word_topics=True).

The default would be per_word_topics=False, so nothing changes unless the user specifically asks for those per-word topic assignments;
but if per_word_topics=True, the method returns a 2-tuple: (per-topic-probability-for-this-document, per-wordtype-best-topic-for-this-document) (instead of just the normal per-topic-probability-for-this-document output).

The second link in your description seems unrelated -- it asks for static topic x vocab assignments, independent of input document. This issue asks for dynamic topic x document words assignment, dependent on input document: the same word can be assigned to different topics for different documents, depending on other words in the document -- the context matters.

bhargavvader · 2016-04-28T00:21:01Z

I'm still a little confused, because the initial question says:

" want to assign a topic index to each word in the corpus."

Can that not mean the vocabulary? Or does it mean every word of document in the corpus? If it is certainly the latter and it is like you said the dynamic topic x document words assignment, then your suggestion does seem like the way to go.

piskvorky · 2016-04-28T04:24:42Z

If it's topic x vocabulary, then we don't need anything special. We already have the various print_topics etc methods.

What we need here are the topic x word assignments (assigning a topic to each word in a concrete document).

bhargavvader · 2016-04-28T18:41:22Z

I've tried looking around the ldamodel class but I seem to be stuck... Exactly how will I be getting the value? Am I to use the gamma value returned in inference for that word and go ahead?

For e.g. if my lda model has 80 topics, and my doc is a bow, what exactly will, say, lda.inference(doc, collect_sstats=True)[1][67] return? It obviously returns details about the 68th topic, but I'm not sure exactly what...

Have you mentioned the logic in a previous mailing list post? I can't find it.

Edit:

If I say phi_value = lda.inference(doc, collect_sstats=True)[1][i][j], where i is the topic and j is the word - if I iterate over all values and find the topic where phi_value is highest for the j word, is that the ith topic which the jth word corresponds to?

If my explanation doesn't make sense or I'm missing out on something let me know, I'll start over.

bhargavvader · 2016-05-03T12:33:29Z

@piskvorky , could you have a look if I'm going the right way?

piskvorky · 2016-05-03T14:40:46Z

I can maybe get to it at the end of this week; @tmylk should be able to assist you more swiftly.

Note: the VB algo assigns the word-topic distribution per word type, meaning each word type in a document gets the same distribution = will get the same "best topic". Example: in document sentence within a sentence, both of the two sentence occurrences will get the same topic.

This is in contrast to Gibbs sampling (Mallet), where each word is sampled independently = sentence can be topic #1 in its first occurrence, but topic #2 in another.

tmylk · 2016-05-03T20:14:12Z

@bhargavvader You are right. In a single document inference lda.inference(doc, collect_sstats=True)[i][j] is just phi_ij approximating P(topic_i|word_j) in the document. For example see Figure 8 where words are colored according to their topic assignment in an inferred document in this paper

bhargavvader · 2016-05-03T21:20:48Z

Great. I'll keep all this in mind and open a PR soon.

bhargavvader · 2016-05-31T06:57:02Z

@piskvorky , we use VB, right? So a word type in a sentence will get only one topic (or one distribution of likely topics)? For e.g, if the sentence is the financial bank is by the river bank, and there are two topics (one to do with financial banks, other with river banks), the word bank, since it's in a bow format, will be given only one list of likely topics, right?

What I intend to confirm is that with our sampling, it is only possible for the word bank to get one bunch of topics, and not possible for each of the banks in our sentence to get an 'appropriate' topic.

edit: okay, your previous comment pretty much confirms this - but I'll still wait for you to confirm it again.

piskvorky · 2016-05-31T07:07:49Z

Yes, VB works over word types, it doesn't sample topics for individual words.

bhargavvader · 2016-05-31T09:56:35Z

Cool. I think #704 is going in the right direction then.

graychan · 2016-06-04T03:00:05Z

I am a user of Gensim and need to compute {P(t|w), for any t and w where t is a topic and w is a word}. I took a look at the commit for #704 and saw that only topic indices were returned in word_phi.

I wonder if you can also return phi_values that are, based on my shallow understanding, the probabilities.

Thanks a lot.

bhargavvader · 2016-06-04T05:56:21Z

Hey @graychan , I've been going back and forth with the idea of the amount of information I should return for this API... The current version of the PR I decided to return only the topic_ids because the phi_values are not exactly the probabilities, they are also scaled by feature_weight. That being said, the phi_values still gives a decent idea of the word-topic proportions though, so we could definitely also give this information.

@piskvorky , @tmylk , should we give another option to users where they can get a 'raw' return format, similar to the first version of the PR? This way, users who just want the information to say, color documents (like illustrated in the tutorial) can get the current proposed return format, and others can opt for the more verbose version.

Something like - get_document_topics(per_word_topics=True, per_word_phi_values=True)

Or will this be too bloated?

graychan · 2016-06-04T15:08:44Z

Hey @bhargavvader, Thanks a lot for the quick reply.

I did notice that from the source code that phi_{dwk} is multiplied by n_{dw}. Forgive my ignorance, and could you please help me understand where phi_{dwk} is scaled by feature_weight in the code?

Thanks a million.

bhargavvader · 2016-06-05T11:34:23Z

I'll need to dig around to find it out myself... But in the mean time, for a small example to maybe explain what goes on:

(assume a model is trained based on the corpus provided in notebook of #704)
if the bag of words is ('river', 'water') the phi value for river returned by inference for topic_0 is 0.94561538, and if you add another river in there, the bag of words ('river', 'water', 'river') returns a value of 1.92400534, roughly double of the previous value, after we doubled the number of occurrences of river.
I guess intuitively this makes sense too, considering that depending on the feature weight the sstats will change - more so if the word count is higher for that topic it is likely contributing to.

That's what I meant by scaled, that it takes into account the feature weight... Is that what you meant too, @graychan? As for where it is being scaled in the code, line 459 of ldamodel.py which you pointed out seems to be doing it: if you see line 429, the cts variable seems to be storing values of counts; of course, I may be incorrect as well. @piskvorky or @tmylk can confirm this when they get the time, I think.
(And about the possible change in the output format, to include phi_values if the user needs)

graychan · 2016-06-07T16:43:13Z

@bhargavvader, thanks a lot for your quick reply.

I did observe that phi values were scaled by word count in the Gensim source code and in my experiments. However, when I read the discussions among you, @tmylk, and @piskvorky , I got the impression that word count and feature weight were two different concepts, for instance, I saw @tmylk state here, "this needs a deeper check -- are the phis directly comparable like this? Aren't they scaled by word count / feature weight?", which led me to believing that word count and feature weight were two different concepts. Of course, another interpretation is that @tmylk meant that "word count (i.e., feature weight)". I would like very much that you or someone else can clarify it.

bhargavvader · 2016-06-08T12:12:28Z

@graychan , yup, word count and feature weight mean the same thing :)

graychan · 2016-06-08T16:53:39Z

@bhargavvader , Thanks for your quick reply.

tmylk · 2016-06-09T00:52:50Z

It is a very important point by @graychan . We have to emphasize it clearly in docstring when we return phi in #704 that it is multiplied by word count

piskvorky · 2016-06-09T01:43:29Z

...or, more generally, by the feature weight, in case someone submits tfidf vectors rather than plain bow word counts to LDA.

@bhargavvader @tmylk have you verified the phi topic sorting to get the "top topic" is correct, in the face of this feature weight scaling?

piskvorky · 2016-06-09T01:44:30Z

Actually, that shouldn't matter, since all topics will be scaled by the same feature weight, for the particular feature, right?

bhargavvader · 2016-06-09T07:14:15Z

@tmylk , I'll make sure to add information about this.
@piskvorky , you are right, it doesn't matter - we are comparing among topics and they will all be scaled similarly.

graychan · 2016-06-09T12:38:58Z

@bhargavvader ,@tmylk, and @piskvorky , thanks a lot for the clarification and quick reply.

Eariler, @bhargavvader pointed out,

Something like - get_document_topics(per_word_topics=True, per_word_phi_values=True)

So, can get_document_topics(...) be revised to return phi_values then? If two parameters per_word_topics and per_word_phi_values are too bloated, perhaps, the function can be made to just return the list of (word_type, phi_values), and let users retrieve topic_id or phi_value by themselves?

bhargavvader · 2016-06-09T12:48:54Z

@graychan , I just revised get_document_topics to return both sorted topics as well as phi_values if per_word_topics is true.

Do have a look at the notebook tutorial - I've explained the new return format there. You can also follow the conversation at #704.

bhargavvader · 2016-06-22T03:59:32Z

@tmylk , @piskvorky , can we close this issue?

bhargavvader · 2016-06-23T09:56:31Z

Am closing this issue as all PRs to do with this are merged. Feel free to re-open if something comes up.

tmylk added wishlist Feature request difficulty easy Easy issue: required small fix labels Apr 25, 2016

piskvorky added difficulty medium Medium issue: required good gensim understanding & python skills and removed difficulty easy Easy issue: required small fix labels Apr 28, 2016

piskvorky mentioned this issue May 3, 2016

word-topic assignment at the document level #688

Closed

piskvorky assigned tmylk May 3, 2016

bhargavvader mentioned this issue May 22, 2016

[MRG] Per Word Topic in a Document + Notebook Tutorial #704

Merged

bhargavvader closed this as completed Jun 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API call for Topic Distribution of words. #683

API call for Topic Distribution of words. #683

bhargavvader commented Apr 25, 2016 •

edited

Loading

piskvorky commented Apr 25, 2016 •

edited

Loading

bhargavvader commented Apr 28, 2016

piskvorky commented Apr 28, 2016

bhargavvader commented Apr 28, 2016 •

edited

Loading

bhargavvader commented May 3, 2016

piskvorky commented May 3, 2016 •

edited

Loading

tmylk commented May 3, 2016

bhargavvader commented May 3, 2016

bhargavvader commented May 31, 2016 •

edited

Loading

piskvorky commented May 31, 2016

bhargavvader commented May 31, 2016

graychan commented Jun 4, 2016 •

edited

Loading

bhargavvader commented Jun 4, 2016

graychan commented Jun 4, 2016 •

edited

Loading

bhargavvader commented Jun 5, 2016

graychan commented Jun 7, 2016

bhargavvader commented Jun 8, 2016

graychan commented Jun 8, 2016

tmylk commented Jun 9, 2016

piskvorky commented Jun 9, 2016

piskvorky commented Jun 9, 2016

bhargavvader commented Jun 9, 2016

graychan commented Jun 9, 2016

bhargavvader commented Jun 9, 2016

bhargavvader commented Jun 22, 2016

bhargavvader commented Jun 23, 2016

API call for Topic Distribution of words. #683

API call for Topic Distribution of words. #683

Comments

bhargavvader commented Apr 25, 2016 • edited Loading

piskvorky commented Apr 25, 2016 • edited Loading

bhargavvader commented Apr 28, 2016

piskvorky commented Apr 28, 2016

bhargavvader commented Apr 28, 2016 • edited Loading

bhargavvader commented May 3, 2016

piskvorky commented May 3, 2016 • edited Loading

tmylk commented May 3, 2016

bhargavvader commented May 3, 2016

bhargavvader commented May 31, 2016 • edited Loading

piskvorky commented May 31, 2016

bhargavvader commented May 31, 2016

graychan commented Jun 4, 2016 • edited Loading

bhargavvader commented Jun 4, 2016

graychan commented Jun 4, 2016 • edited Loading

bhargavvader commented Jun 5, 2016

graychan commented Jun 7, 2016

bhargavvader commented Jun 8, 2016

graychan commented Jun 8, 2016

tmylk commented Jun 9, 2016

piskvorky commented Jun 9, 2016

piskvorky commented Jun 9, 2016

bhargavvader commented Jun 9, 2016

graychan commented Jun 9, 2016

bhargavvader commented Jun 9, 2016

bhargavvader commented Jun 22, 2016

bhargavvader commented Jun 23, 2016

bhargavvader commented Apr 25, 2016 •

edited

Loading

piskvorky commented Apr 25, 2016 •

edited

Loading

bhargavvader commented Apr 28, 2016 •

edited

Loading

piskvorky commented May 3, 2016 •

edited

Loading

bhargavvader commented May 31, 2016 •

edited

Loading

graychan commented Jun 4, 2016 •

edited

Loading

graychan commented Jun 4, 2016 •

edited

Loading