Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning in word2vec when batch_words > MAX_WORDS_IN_BATCH #2861

Closed
wants to merge 6 commits into from
Closed

Warning in word2vec when batch_words > MAX_WORDS_IN_BATCH #2861

wants to merge 6 commits into from

Conversation

dsandeep0138
Copy link
Contributor

Also added a test case to verify if the warning is thrown when batch_words is more than 10000.

Fixes #2801

dsandeep97 and others added 4 commits June 17, 2020 19:25
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good – left minor language change suggestions in the review.

gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/test/test_word2vec.py Outdated Show resolved Hide resolved
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor code suggestion, otherwise looks LGTM. @mpenkov WDYT?

gensim/models/word2vec.py Outdated Show resolved Hide resolved
@piskvorky piskvorky added this to the 4.0.0 milestone Jun 18, 2020
@dsandeep0138
Copy link
Contributor Author

@piskvorky Thanks for the review. I have made the changes based on your feedback.

@mpenkov keras related tests are failing. Can you please help with these? Thanks.

@gojomo
Copy link
Collaborator

gojomo commented Jun 19, 2020

if the hope is for #2698 cleanup to be in 4.0 (and that's my hope!), these smaller tweaks to related classes should either (a) be held until after that's merged; or (b) be based off the massive changes there. Otherwise, they make its integration/progress harder.

The related FastText & Doc2Vec classes would have the same issue, so any fix that only touches word2vec* files is incomplete.

From my view, this doesn't fix the deeper problems with the Cython limit:

  • submitted texts are silently truncated to the 10000 word limit without warning, which users may not realize
  • modern users with far more memory (and larger texts) still face this limit, and can only change via the onerous expert-level approach of editing cython & recompiling, or extra pre-splitting of their input texts

A proper & better fix would be to essentially eliminate this limit entirely. One way would be to auto-split the users' input texts - somewhat rough in its handling of the users' data, but less so than silently truncating it! Another would be to use the alloca() c-function in the Cython code, which could stack-allocate whatever's needed for the user batch without recourse to a compiled-in constant. (Alas, alloca() isn't technically part of the POSIX standard, but I believe in practice it's everywhere we care about. Alternatively variable-length-arrays might do the trick, are part of ISO C99, but apparently not supported by Cython through 3.0a5.)

Perhaps best-of-both-worlds would be to do both, use alloca() to handle any-sized text the user chooses to send, but also include an optional configurable 'split-if-over-this-size' cap, to help prevent them from shooting themselves in the foot with giant stack-overflowing texts. I ran experiments improving this in 2015, viewable in old branch batching (develop...gojomo:batching), that could be adapted forward, if alloca() is considered OK to use.

Given that the default for batch_words is already set to match the Cython limit – so only those rare users who choose to deviate from the default could run into any confusion – and that the best fix would be removing this parameter/complication entirely, I'd rather not even add such warnings-for-rare-cases. Better to just prioritize the better fix, or remove the ability to configure Python-side batch_words at all. (I think the only reason for setting a smaller value might be to simulate actual job-batching in tiny unit tests, which is also of minuscule importance.)

@piskvorky
Copy link
Owner

piskvorky commented Jun 19, 2020

We'd very much prioritize a better fix (your PR). We cannot stall all development until that happens though – especially since its ETA is unknown and the PR is not open for review yet ("early WIP").

Would it be possible to split up your PR (@mpenkov 's idea)? Get some parts of the changes in now, rest later, proceed more incrementally?

The changes will be essential I believe. So basically anything to speed up the process and avoid a behemoth PR that blocks all other development.

@gojomo
Copy link
Collaborator

gojomo commented Jun 19, 2020

There's no current fix for this in #2698, nor is one planned.

It's more that this is an incomplete fix in the wrong direction, for a tiny theoretical issue that there's not even any evidence anyone has ever hit. The bug it's linked-to is just, "I think a larger value should not be accepted". But this patch doesn't actually even reject larger values, it just issues a warning – which from experience many users simply ignore. And it only does so for one of the 3 classes where there could be a concern.

So in my opinion all the discussion/effort so far around this bug/PR has been wasted on misguided minutiae. (Similarly: #2537/#2551 - a bunch of go-round about a non-issue.)

There is a closely-related problem - that texts are silently truncated in the Cython code – but that deserves a totally different better fix that would further obsolete this make-work.

But even if theoretically this change was a good 'housekeeping' fix, a slight new usability comment/warning. I believe it would still makes sense for that to either (1) wait for later application, as such tiny changes should be lower-priority and wouldn't drive a big number release, anyway; or (2) build on the #2698 in progress, meaning less conflict-resolution work by other authors or myself. No one need be 'blocked', they just shouldn't be putting a fresh coat of paint onto walls already marked to be knocked-out.

I don't want my work-in-progress to be holding anything else up, but so far I've not heard of any priority features/fixes that need be blocked that way. I'll post more on the status of #2698 there.

@mpenkov
Copy link
Collaborator

mpenkov commented Jun 25, 2020

Finally found the time to catch up this.

I agree with @gojomo that this PR is far from a perfect solution and it may make merging #2698 more difficult. At the same time, I think the changes here are an improvement over the existing code, and are relatively innocuous. I think it is best to merge this PR once the keras weirdness gets sorted out.

@piskvorky WDYT?

@gojomo
Copy link
Collaborator

gojomo commented Jun 25, 2020

Disagree with 'improvement' and 'innocuous' characterizations very strongly: it's adding lines-of-code for a merely-theoretical issue, while creating extra merge work for others (mostly me), when the real fix for the related underlying limitation will be something completely different which will cause this code to be thrown away (which will be yet more work for whoever tackles that, which is likely me).

@piskvorky
Copy link
Owner

piskvorky commented Jun 25, 2020

Yeah the warning seems useless. My review was superficial, I didn't realize this would help approximately no one.

My (incorrect) reading was that the user will see a warning when texts are truncated (good). But that's not the case with this PR – longer texts are still truncated silently. The warning in this PR is only shown if the user explicitly overrides an obscure parameter (vanishingly rare).

IMO pros and cons are both inconsequential, with cons slightly outweighing pros. The cost of this debate already dwarves the delta.

@mpenkov
Copy link
Collaborator

mpenkov commented Jun 25, 2020

OK, so in that case, the most appropriate action is to close this PR.

@dsandeep0138 Thank you for your contribution, but as you can see, the situation around the problem is quite difficult, and we cannot accept this PR. I hope this does not discourage you from future contributions to gensim.

@mpenkov mpenkov closed this Jun 25, 2020
@dsandeep0138
Copy link
Contributor Author

OK, so in that case, the most appropriate action is to close this PR.

@dsandeep0138 Thank you for your contribution, but as you can see, the situation around the problem is quite difficult, and we cannot accept this PR. I hope this does not discourage you from future contributions to gensim.

Thank you @mpenkov for the encouraging words! Will surely try and contribute!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Warning when batch_words > MAX_WORDS_IN_BATCH in word2vec
5 participants