-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API call for Topic Distribution of words. #683
Comments
I haven't given this API enhancement a deep thought, but I think a good way may be to add a new, optional parameter:
The second link in your description seems unrelated -- it asks for static |
I'm still a little confused, because the initial question says: " want to assign a topic index to each word in the corpus." Can that not mean the vocabulary? Or does it mean every word of document in the corpus? If it is certainly the latter and it is like you said the dynamic |
If it's topic x vocabulary, then we don't need anything special. We already have the various What we need here are the topic x word assignments (assigning a topic to each word in a concrete document). |
I've tried looking around the For e.g. if my Have you mentioned the logic in a previous mailing list post? I can't find it. Edit: If I say If my explanation doesn't make sense or I'm missing out on something let me know, I'll start over. |
@piskvorky , could you have a look if I'm going the right way? |
I can maybe get to it at the end of this week; @tmylk should be able to assist you more swiftly. Note: the VB algo assigns the word-topic distribution per word type, meaning each word type in a document gets the same distribution = will get the same "best topic". Example: in document This is in contrast to Gibbs sampling (Mallet), where each word is sampled independently = |
@bhargavvader You are right. In a single document inference |
Great. I'll keep all this in mind and open a PR soon. |
@piskvorky , we use VB, right? So a word type in a sentence will get only one topic (or one distribution of likely topics)? For e.g, if the sentence is What I intend to confirm is that with our sampling, it is only possible for the word bank to get one bunch of topics, and not possible for each of the banks in our sentence to get an 'appropriate' topic. edit: okay, your previous comment pretty much confirms this - but I'll still wait for you to confirm it again. |
Yes, VB works over word types, it doesn't sample topics for individual words. |
Cool. I think #704 is going in the right direction then. |
I am a user of Gensim and need to compute {P(t|w), for any t and w where t is a topic and w is a word}. I took a look at the commit for #704 and saw that only topic indices were returned in word_phi. I wonder if you can also return phi_values that are, based on my shallow understanding, the probabilities. Thanks a lot. |
Hey @graychan , I've been going back and forth with the idea of the amount of information I should return for this API... The current version of the PR I decided to return only the @piskvorky , @tmylk , should we give another option to users where they can get a 'raw' return format, similar to the first version of the PR? This way, users who just want the information to say, color documents (like illustrated in the tutorial) can get the current proposed return format, and others can opt for the more verbose version. Something like - Or will this be too bloated? |
Hey @bhargavvader, Thanks a lot for the quick reply. I did notice that from the source code that phi_{dwk} is multiplied by n_{dw}. Forgive my ignorance, and could you please help me understand where phi_{dwk} is scaled by feature_weight in the code? Thanks a million. |
I'll need to dig around to find it out myself... But in the mean time, for a small example to maybe explain what goes on: (assume a model is trained based on the corpus provided in notebook of #704) That's what I meant by scaled, that it takes into account the feature weight... Is that what you meant too, @graychan? As for where it is being scaled in the code, line |
@bhargavvader, thanks a lot for your quick reply. I did observe that phi values were scaled by word count in the Gensim source code and in my experiments. However, when I read the discussions among you, @tmylk, and @piskvorky , I got the impression that word count and feature weight were two different concepts, for instance, I saw @tmylk state here, "this needs a deeper check -- are the phis directly comparable like this? Aren't they scaled by word count / feature weight?", which led me to believing that word count and feature weight were two different concepts. Of course, another interpretation is that @tmylk meant that "word count (i.e., feature weight)". I would like very much that you or someone else can clarify it. |
@graychan , yup, word count and feature weight mean the same thing :) |
@bhargavvader , Thanks for your quick reply. |
...or, more generally, by the feature weight, in case someone submits tfidf vectors rather than plain bow word counts to LDA. @bhargavvader @tmylk have you verified the |
Actually, that shouldn't matter, since all topics will be scaled by the same feature weight, for the particular feature, right? |
@tmylk , I'll make sure to add information about this. |
@bhargavvader ,@tmylk, and @piskvorky , thanks a lot for the clarification and quick reply. Eariler, @bhargavvader pointed out,
So, can get_document_topics(...) be revised to return phi_values then? If two parameters |
@tmylk , @piskvorky , can we close this issue? |
Am closing this issue as all PRs to do with this are merged. Feel free to re-open if something comes up. |
As discussed here and here in the mailing list, there is no current way to find the a
topic index
for a word - i.e the topic a word belongs to - basically the distribution over all topics for a single word.As @piskvorky mentioned in the second link, the ideal way now is:
"With variational inference LDA (such as the implementation in gensim), you can get the per-word topic distribution from
phi
, one of the variational parameters. When you calllda.inference()
withcollect_sstats=True
, it will return a 2-tuple of (gamma, sufficient statistics). In your single-word case, these sufficient statistics directly correspond to (already normalized) phi, so you can read the word topics directly off of that."This also talks about the same, and describes the problem/what is needed in simpler words.
I'll open a PR soon for this, and would like to take suggestions- would it be as a method in the
ldamodel
class, with just takes a word as a parameter, which uses VB internal parameters and gives back the topic and probability, or is there a different approach?The text was updated successfully, but these errors were encountered: