Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API call for Topic Distribution of words. #683

Closed
bhargavvader opened this issue Apr 25, 2016 · 26 comments
Closed

API call for Topic Distribution of words. #683

bhargavvader opened this issue Apr 25, 2016 · 26 comments
Assignees
Labels
difficulty medium Medium issue: required good gensim understanding & python skills wishlist Feature request

Comments

@bhargavvader
Copy link
Contributor

bhargavvader commented Apr 25, 2016

As discussed here and here in the mailing list, there is no current way to find the a topic index for a word - i.e the topic a word belongs to - basically the distribution over all topics for a single word.

As @piskvorky mentioned in the second link, the ideal way now is:

"With variational inference LDA (such as the implementation in gensim), you can get the per-word topic distribution from phi, one of the variational parameters. When you call lda.inference() with collect_sstats=True, it will return a 2-tuple of (gamma, sufficient statistics). In your single-word case, these sufficient statistics directly correspond to (already normalized) phi, so you can read the word topics directly off of that."

This also talks about the same, and describes the problem/what is needed in simpler words.

I'll open a PR soon for this, and would like to take suggestions- would it be as a method in the ldamodel class, with just takes a word as a parameter, which uses VB internal parameters and gives back the topic and probability, or is there a different approach?

@piskvorky
Copy link
Owner

piskvorky commented Apr 25, 2016

I haven't given this API enhancement a deep thought, but I think a good way may be to add a new, optional parameter: get_document_topics(per_word_topics=True).

  • The default would be per_word_topics=False, so nothing changes unless the user specifically asks for those per-word topic assignments;
  • but if per_word_topics=True, the method returns a 2-tuple: (per-topic-probability-for-this-document, per-wordtype-best-topic-for-this-document) (instead of just the normal per-topic-probability-for-this-document output).

The second link in your description seems unrelated -- it asks for static topic x vocab assignments, independent of input document. This issue asks for dynamic topic x document words assignment, dependent on input document: the same word can be assigned to different topics for different documents, depending on other words in the document -- the context matters.

@tmylk tmylk added wishlist Feature request difficulty easy Easy issue: required small fix labels Apr 25, 2016
@bhargavvader
Copy link
Contributor Author

I'm still a little confused, because the initial question says:

" want to assign a topic index to each word in the corpus."

Can that not mean the vocabulary? Or does it mean every word of document in the corpus? If it is certainly the latter and it is like you said the dynamic topic x document words assignment, then your suggestion does seem like the way to go.

@piskvorky
Copy link
Owner

If it's topic x vocabulary, then we don't need anything special. We already have the various print_topics etc methods.

What we need here are the topic x word assignments (assigning a topic to each word in a concrete document).

@piskvorky piskvorky added difficulty medium Medium issue: required good gensim understanding & python skills and removed difficulty easy Easy issue: required small fix labels Apr 28, 2016
@bhargavvader
Copy link
Contributor Author

bhargavvader commented Apr 28, 2016

I've tried looking around the ldamodel class but I seem to be stuck... Exactly how will I be getting the value? Am I to use the gamma value returned in inference for that word and go ahead?

For e.g. if my lda model has 80 topics, and my doc is a bow, what exactly will, say, lda.inference(doc, collect_sstats=True)[1][67] return? It obviously returns details about the 68th topic, but I'm not sure exactly what...

Have you mentioned the logic in a previous mailing list post? I can't find it.

Edit:

If I say phi_value = lda.inference(doc, collect_sstats=True)[1][i][j], where i is the topic and j is the word - if I iterate over all values and find the topic where phi_value is highest for the j word, is that the ith topic which the jth word corresponds to?

If my explanation doesn't make sense or I'm missing out on something let me know, I'll start over.

@bhargavvader
Copy link
Contributor Author

@piskvorky , could you have a look if I'm going the right way?

@piskvorky
Copy link
Owner

piskvorky commented May 3, 2016

I can maybe get to it at the end of this week; @tmylk should be able to assist you more swiftly.

Note: the VB algo assigns the word-topic distribution per word type, meaning each word type in a document gets the same distribution = will get the same "best topic". Example: in document sentence within a sentence, both of the two sentence occurrences will get the same topic.

This is in contrast to Gibbs sampling (Mallet), where each word is sampled independently = sentence can be topic #1 in its first occurrence, but topic #2 in another.

@tmylk
Copy link
Contributor

tmylk commented May 3, 2016

@bhargavvader You are right. In a single document inference lda.inference(doc, collect_sstats=True)[i][j] is just phi_ij approximating P(topic_i|word_j) in the document. For example see Figure 8 where words are colored according to their topic assignment in an inferred document in this paper

@bhargavvader
Copy link
Contributor Author

Great. I'll keep all this in mind and open a PR soon.

@bhargavvader
Copy link
Contributor Author

bhargavvader commented May 31, 2016

@piskvorky , we use VB, right? So a word type in a sentence will get only one topic (or one distribution of likely topics)? For e.g, if the sentence is the financial bank is by the river bank, and there are two topics (one to do with financial banks, other with river banks), the word bank, since it's in a bow format, will be given only one list of likely topics, right?

What I intend to confirm is that with our sampling, it is only possible for the word bank to get one bunch of topics, and not possible for each of the banks in our sentence to get an 'appropriate' topic.

edit: okay, your previous comment pretty much confirms this - but I'll still wait for you to confirm it again.

@piskvorky
Copy link
Owner

Yes, VB works over word types, it doesn't sample topics for individual words.

@bhargavvader
Copy link
Contributor Author

Cool. I think #704 is going in the right direction then.

@graychan
Copy link

graychan commented Jun 4, 2016

I am a user of Gensim and need to compute {P(t|w), for any t and w where t is a topic and w is a word}. I took a look at the commit for #704 and saw that only topic indices were returned in word_phi.

I wonder if you can also return phi_values that are, based on my shallow understanding, the probabilities.

Thanks a lot.

@bhargavvader
Copy link
Contributor Author

Hey @graychan , I've been going back and forth with the idea of the amount of information I should return for this API... The current version of the PR I decided to return only the topic_ids because the phi_values are not exactly the probabilities, they are also scaled by feature_weight. That being said, the phi_values still gives a decent idea of the word-topic proportions though, so we could definitely also give this information.

@piskvorky , @tmylk , should we give another option to users where they can get a 'raw' return format, similar to the first version of the PR? This way, users who just want the information to say, color documents (like illustrated in the tutorial) can get the current proposed return format, and others can opt for the more verbose version.

Something like - get_document_topics(per_word_topics=True, per_word_phi_values=True)

Or will this be too bloated?

@graychan
Copy link

graychan commented Jun 4, 2016

Hey @bhargavvader, Thanks a lot for the quick reply.

I did notice that from the source code that phi_{dwk} is multiplied by n_{dw}. Forgive my ignorance, and could you please help me understand where phi_{dwk} is scaled by feature_weight in the code?

Thanks a million.

@bhargavvader
Copy link
Contributor Author

I'll need to dig around to find it out myself... But in the mean time, for a small example to maybe explain what goes on:

(assume a model is trained based on the corpus provided in notebook of #704)
if the bag of words is ('river', 'water') the phi value for river returned by inference for topic_0 is 0.94561538, and if you add another river in there, the bag of words ('river', 'water', 'river') returns a value of 1.92400534, roughly double of the previous value, after we doubled the number of occurrences of river.
I guess intuitively this makes sense too, considering that depending on the feature weight the sstats will change - more so if the word count is higher for that topic it is likely contributing to.

That's what I meant by scaled, that it takes into account the feature weight... Is that what you meant too, @graychan? As for where it is being scaled in the code, line 459 of ldamodel.py which you pointed out seems to be doing it: if you see line 429, the cts variable seems to be storing values of counts; of course, I may be incorrect as well. @piskvorky or @tmylk can confirm this when they get the time, I think.
(And about the possible change in the output format, to include phi_values if the user needs)

@graychan
Copy link

graychan commented Jun 7, 2016

@bhargavvader, thanks a lot for your quick reply.

I did observe that phi values were scaled by word count in the Gensim source code and in my experiments. However, when I read the discussions among you, @tmylk, and @piskvorky , I got the impression that word count and feature weight were two different concepts, for instance, I saw @tmylk state here, "this needs a deeper check -- are the phis directly comparable like this? Aren't they scaled by word count / feature weight?", which led me to believing that word count and feature weight were two different concepts. Of course, another interpretation is that @tmylk meant that "word count (i.e., feature weight)". I would like very much that you or someone else can clarify it.

@bhargavvader
Copy link
Contributor Author

@graychan , yup, word count and feature weight mean the same thing :)

@graychan
Copy link

graychan commented Jun 8, 2016

@bhargavvader , Thanks for your quick reply.

@tmylk
Copy link
Contributor

tmylk commented Jun 9, 2016

It is a very important point by @graychan . We have to emphasize it clearly in docstring when we return phi in #704 that it is multiplied by word count

@piskvorky
Copy link
Owner

...or, more generally, by the feature weight, in case someone submits tfidf vectors rather than plain bow word counts to LDA.

@bhargavvader @tmylk have you verified the phi topic sorting to get the "top topic" is correct, in the face of this feature weight scaling?

@piskvorky
Copy link
Owner

Actually, that shouldn't matter, since all topics will be scaled by the same feature weight, for the particular feature, right?

@bhargavvader
Copy link
Contributor Author

@tmylk , I'll make sure to add information about this.
@piskvorky , you are right, it doesn't matter - we are comparing among topics and they will all be scaled similarly.

@graychan
Copy link

graychan commented Jun 9, 2016

@bhargavvader ,@tmylk, and @piskvorky , thanks a lot for the clarification and quick reply.

Eariler, @bhargavvader pointed out,

Something like - get_document_topics(per_word_topics=True, per_word_phi_values=True)

So, can get_document_topics(...) be revised to return phi_values then? If two parameters per_word_topics and per_word_phi_values are too bloated, perhaps, the function can be made to just return the list of (word_type, phi_values), and let users retrieve topic_id or phi_value by themselves?

@bhargavvader
Copy link
Contributor Author

@graychan , I just revised get_document_topics to return both sorted topics as well as phi_values if per_word_topics is true.

Do have a look at the notebook tutorial - I've explained the new return format there. You can also follow the conversation at #704.

@bhargavvader
Copy link
Contributor Author

@tmylk , @piskvorky , can we close this issue?

@bhargavvader
Copy link
Contributor Author

Am closing this issue as all PRs to do with this are merged. Feel free to re-open if something comes up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills wishlist Feature request
Projects
None yet
Development

No branches or pull requests

4 participants