-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARNING : supplied example count did not equal expected count #801
Comments
Your
In particular, if you're trying to use a generator, make sure you're passing in the function that returns an iterator, not the single iterator returned from a single call. More helpful background is in the blog post: http://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/ |
From the this page I gathered that the corpus is iterated twice -- once for |
Yes, it's typical to do multiple training passes over a corpus – unless it is already gigantic. The original word2vec.c tool defaults to 5 training passes over a supplied input file; and the current version of gensim Word2Vec also defaults to 5 training passes over the supplied iterable corpus object. The blog post is still a little unclear, given how atypical a single training pass is in practice. @piskvorky – can the blog post be tightened a bit further? I would suggest (a) changing that 'second+' to something clearer like 'second and subsequent'; (b) deemphasize the |
I reopened the issue because I believe @gojomo's suggestions make sense. |
@gojomo How about now? |
Better! But I'd prefer to either eliminate or move-to-bottom the 'advanced users' stuff. The overwhelmingly-common case seems to be (1) less-advanced-user; with (2) small-dataset; and (3) potential confusion about iterators-vs-iterables. In that case, the important thing to emphasize is that the corpus be multiply-iterable, and any other details around that point are just 'attractive hazards'. |
I don't think they are overwhelmingly common -- just the most vocal, for obvious reasons. |
Looking for volunteer to add these improvements to https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb |
Hey, i'm looking into the code and I guess the best approach would be to just throw an error whenever sentences is not iterable? |
@Doppler010 What specific test would you propose, and could it distinguish between a single-use iterator and something that is re-iterable? |
@gojomo - I was thinking something along the lines of |
@Doppler010 - That's an interesting test! I'd prefer not to create a throwaway iterator just as a test, but perhaps this could be combined with the iteration-start that needs to happen anyway, generating a warning when that's not a different-object (and thus the source is likely not a repeatably-iterable object). We'll still want the warning about mismatched-counts, as well – that will also catch places where the user has varied the corpus since |
@gojomo - Can you please point me to the location of the iteration-start in word2vec.py . I'm not able to figure it out. |
@Doppler010 word2vec.py sets it up at https://github.com/RaRe-Technologies/gensim/blob/192792688b1e7439cf10076648ff499f557142f9/gensim/models/word2vec.py#L784 though since it uses |
did this warning fix yet? and how? |
@lampda - this warning typically means you've done something wrong: not supplying the expected number of texts. So any fix would be in your code; do the checks above that your corpus iterable is correct. If you have other questions, the project discussion list is more appropriate: https://groups.google.com/forum/#!forum/gensim |
I tried to learn a word2vec embedding with gensim:
With logging switched on, I can see that the training stops after processing 10% of the corpus, and then I get this:
Why does this happen? I found nothing in the documentation that hints at gensim not processing the whole corpus.
The text was updated successfully, but these errors were encountered: