Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARNING : supplied example count did not equal expected count #801

Closed
DavidNemeskey opened this issue Jul 23, 2016 · 16 comments
Closed

WARNING : supplied example count did not equal expected count #801

DavidNemeskey opened this issue Jul 23, 2016 · 16 comments
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation

Comments

@DavidNemeskey
Copy link
Contributor

DavidNemeskey commented Jul 23, 2016

I tried to learn a word2vec embedding with gensim:

model = gensim.models.Word2Vec(size=300, window=5, min_count=1, workers=4, iter=10, sg=0)
model.build_vocab(sentences)
model.train(sentences)

With logging switched on, I can see that the training stops after processing 10% of the corpus, and then I get this:

2016-07-23 09:26:25,201 : INFO : collected 197546 word types from a corpus of 686363594 raw words and 35463442 sentences
2016-07-23 09:26:25,697 : INFO : min_count=1 retains 197546 unique words (drops 0)
2016-07-23 09:26:25,697 : INFO : min_count leaves 686363594 word corpus (100% of original 686363594)
2016-07-23 09:26:25,962 : INFO : deleting the raw counts dictionary of 197546 items
2016-07-23 09:26:25,966 : INFO : sample=0.001 downsamples 35 most-common words
2016-07-23 09:26:25,967 : INFO : downsampling leaves estimated 437707717 word corpus (63.8% of prior 686363594)
2016-07-23 09:26:25,967 : INFO : estimated required memory for 197546 words and 300 dimensions: 572883400 bytes
2016-07-23 09:26:26,437 : INFO : resetting layer weights
...
...
...
2016-07-23 09:39:42,895 : INFO : PROGRESS: at 9.99% examples, 868278 words/s, in_qsize 8, out_qsize 0
2016-07-23 09:39:43,578 : INFO : worker thread finished; awaiting finish of 3 more threads
2016-07-23 09:39:43,579 : INFO : worker thread finished; awaiting finish of 2 more threads
2016-07-23 09:39:43,584 : INFO : worker thread finished; awaiting finish of 1 more threads
2016-07-23 09:39:43,589 : INFO : worker thread finished; awaiting finish of 0 more threads
2016-07-23 09:39:43,589 : INFO : training on 686363594 raw words (437701650 effective words) took 504.0s, 868387 effective words/s
2016-07-23 09:39:43,589 : WARNING : supplied example count (35463442) did not equal expected count (354634420)

Why does this happen? I found nothing in the documentation that hints at gensim not processing the whole corpus.

@gojomo
Copy link
Collaborator

gojomo commented Jul 23, 2016

Your sentences needs to be an iterable object, which ca be iterated over multiple times – not merely an iterator that is exhausted after one pass. For example, the following code should print the same count each time:

print(sum(1 for _ in sentences))
print(sum(1 for _ in sentences))
print(sum(1 for _ in sentences))

In particular, if you're trying to use a generator, make sure you're passing in the function that returns an iterator, not the single iterator returned from a single call.

More helpful background is in the blog post: http://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/

@DavidNemeskey
Copy link
Contributor Author

From the this page I gathered that the corpus is iterated twice -- once for build_vocab() and once for train(). Now I see that "The second+ passes train the neural model." -- so I guess this really must be the problem.

@gojomo
Copy link
Collaborator

gojomo commented Aug 7, 2016

Yes, it's typical to do multiple training passes over a corpus – unless it is already gigantic. The original word2vec.c tool defaults to 5 training passes over a supplied input file; and the current version of gensim Word2Vec also defaults to 5 training passes over the supplied iterable corpus object.

The blog post is still a little unclear, given how atypical a single training pass is in practice. @piskvorky – can the blog post be tightened a bit further? I would suggest (a) changing that 'second+' to something clearer like 'second and subsequent'; (b) deemphasize the iter=1 case, perhaps by putting it in a different-color DIV; (c) include a link to your other "Data Streaming in Python: generators, iterators, iterables" post.

@DavidNemeskey
Copy link
Contributor Author

I reopened the issue because I believe @gojomo's suggestions make sense.

@piskvorky
Copy link
Owner

@gojomo How about now?

@gojomo
Copy link
Collaborator

gojomo commented Aug 8, 2016

Better! But I'd prefer to either eliminate or move-to-bottom the 'advanced users' stuff.

The overwhelmingly-common case seems to be (1) less-advanced-user; with (2) small-dataset; and (3) potential confusion about iterators-vs-iterables. In that case, the important thing to emphasize is that the corpus be multiply-iterable, and any other details around that point are just 'attractive hazards'.

@piskvorky
Copy link
Owner

piskvorky commented Aug 9, 2016

I don't think they are overwhelmingly common -- just the most vocal, for obvious reasons.

@tmylk tmylk added documentation Current issue related to documentation difficulty easy Easy issue: required small fix labels Oct 5, 2016
@tmylk
Copy link
Contributor

tmylk commented Oct 5, 2016

Looking for volunteer to add these improvements to https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb

@PNR-1
Copy link

PNR-1 commented Oct 8, 2016

Hey, i'm looking into the code and I guess the best approach would be to just throw an error whenever sentences is not iterable?

@gojomo
Copy link
Collaborator

gojomo commented Oct 8, 2016

@Doppler010 What specific test would you propose, and could it distinguish between a single-use iterator and something that is re-iterable?

@PNR-1
Copy link

PNR-1 commented Oct 9, 2016

@gojomo - I was thinking something along the lines of
'iterator' if obj is iter(obj) else 'iterable'
We can add this try:catch sequence in addition to documentation changes.

@gojomo
Copy link
Collaborator

gojomo commented Oct 9, 2016

@Doppler010 - That's an interesting test! I'd prefer not to create a throwaway iterator just as a test, but perhaps this could be combined with the iteration-start that needs to happen anyway, generating a warning when that's not a different-object (and thus the source is likely not a repeatably-iterable object). We'll still want the warning about mismatched-counts, as well – that will also catch places where the user has varied the corpus since build_vocab(), or otherwise not provided an accurate expected-size (which is necessary for proper alpha deacy scheduling and accurate progress-logging).

@PNR-1
Copy link

PNR-1 commented Oct 9, 2016

@gojomo - Can you please point me to the location of the iteration-start in word2vec.py . I'm not able to figure it out.

@gojomo
Copy link
Collaborator

gojomo commented Oct 9, 2016

@pamdla
Copy link

pamdla commented Jul 5, 2020

did this warning fix yet? and how?

@gojomo
Copy link
Collaborator

gojomo commented Jul 6, 2020

@lampda - this warning typically means you've done something wrong: not supplying the expected number of texts. So any fix would be in your code; do the checks above that your corpus iterable is correct. If you have other questions, the project discussion list is more appropriate: https://groups.google.com/forum/#!forum/gensim

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation
Projects
None yet
Development

No branches or pull requests

6 participants