Warning in word2vec when batch_words > MAX_WORDS_IN_BATCH #2861

dsandeep0138 · 2020-06-18T00:31:37Z

Also added a test case to verify if the warning is thrown when batch_words is more than 10000.

Fixes #2801

Also add a test case to verify that the warning is thrown

…nto issue#2801

piskvorky

Looks good – left minor language change suggestions in the review.

gensim/models/word2vec.py

gensim/test/test_word2vec.py

piskvorky

Minor code suggestion, otherwise looks LGTM. @mpenkov WDYT?

gensim/models/word2vec.py

dsandeep0138 · 2020-06-18T07:18:47Z

@piskvorky Thanks for the review. I have made the changes based on your feedback.

@mpenkov keras related tests are failing. Can you please help with these? Thanks.

gojomo · 2020-06-19T00:06:26Z

if the hope is for #2698 cleanup to be in 4.0 (and that's my hope!), these smaller tweaks to related classes should either (a) be held until after that's merged; or (b) be based off the massive changes there. Otherwise, they make its integration/progress harder.

The related FastText & Doc2Vec classes would have the same issue, so any fix that only touches word2vec* files is incomplete.

From my view, this doesn't fix the deeper problems with the Cython limit:

submitted texts are silently truncated to the 10000 word limit without warning, which users may not realize
modern users with far more memory (and larger texts) still face this limit, and can only change via the onerous expert-level approach of editing cython & recompiling, or extra pre-splitting of their input texts

A proper & better fix would be to essentially eliminate this limit entirely. One way would be to auto-split the users' input texts - somewhat rough in its handling of the users' data, but less so than silently truncating it! Another would be to use the alloca() c-function in the Cython code, which could stack-allocate whatever's needed for the user batch without recourse to a compiled-in constant. (Alas, alloca() isn't technically part of the POSIX standard, but I believe in practice it's everywhere we care about. Alternatively variable-length-arrays might do the trick, are part of ISO C99, but apparently not supported by Cython through 3.0a5.)

Perhaps best-of-both-worlds would be to do both, use alloca() to handle any-sized text the user chooses to send, but also include an optional configurable 'split-if-over-this-size' cap, to help prevent them from shooting themselves in the foot with giant stack-overflowing texts. I ran experiments improving this in 2015, viewable in old branch batching (develop...gojomo:batching), that could be adapted forward, if alloca() is considered OK to use.

Given that the default for batch_words is already set to match the Cython limit – so only those rare users who choose to deviate from the default could run into any confusion – and that the best fix would be removing this parameter/complication entirely, I'd rather not even add such warnings-for-rare-cases. Better to just prioritize the better fix, or remove the ability to configure Python-side batch_words at all. (I think the only reason for setting a smaller value might be to simulate actual job-batching in tiny unit tests, which is also of minuscule importance.)

piskvorky · 2020-06-19T07:26:09Z

We'd very much prioritize a better fix (your PR). We cannot stall all development until that happens though – especially since its ETA is unknown and the PR is not open for review yet ("early WIP").

Would it be possible to split up your PR (@mpenkov 's idea)? Get some parts of the changes in now, rest later, proceed more incrementally?

The changes will be essential I believe. So basically anything to speed up the process and avoid a behemoth PR that blocks all other development.

gojomo · 2020-06-19T18:03:37Z

There's no current fix for this in #2698, nor is one planned.

It's more that this is an incomplete fix in the wrong direction, for a tiny theoretical issue that there's not even any evidence anyone has ever hit. The bug it's linked-to is just, "I think a larger value should not be accepted". But this patch doesn't actually even reject larger values, it just issues a warning – which from experience many users simply ignore. And it only does so for one of the 3 classes where there could be a concern.

So in my opinion all the discussion/effort so far around this bug/PR has been wasted on misguided minutiae. (Similarly: #2537/#2551 - a bunch of go-round about a non-issue.)

There is a closely-related problem - that texts are silently truncated in the Cython code – but that deserves a totally different better fix that would further obsolete this make-work.

But even if theoretically this change was a good 'housekeeping' fix, a slight new usability comment/warning. I believe it would still makes sense for that to either (1) wait for later application, as such tiny changes should be lower-priority and wouldn't drive a big number release, anyway; or (2) build on the #2698 in progress, meaning less conflict-resolution work by other authors or myself. No one need be 'blocked', they just shouldn't be putting a fresh coat of paint onto walls already marked to be knocked-out.

I don't want my work-in-progress to be holding anything else up, but so far I've not heard of any priority features/fixes that need be blocked that way. I'll post more on the status of #2698 there.

mpenkov · 2020-06-25T02:35:31Z

Finally found the time to catch up this.

I agree with @gojomo that this PR is far from a perfect solution and it may make merging #2698 more difficult. At the same time, I think the changes here are an improvement over the existing code, and are relatively innocuous. I think it is best to merge this PR once the keras weirdness gets sorted out.

@piskvorky WDYT?

gojomo · 2020-06-25T05:46:45Z

Disagree with 'improvement' and 'innocuous' characterizations very strongly: it's adding lines-of-code for a merely-theoretical issue, while creating extra merge work for others (mostly me), when the real fix for the related underlying limitation will be something completely different which will cause this code to be thrown away (which will be yet more work for whoever tackles that, which is likely me).

piskvorky · 2020-06-25T08:47:22Z

Yeah the warning seems useless. My review was superficial, I didn't realize this would help approximately no one.

My (incorrect) reading was that the user will see a warning when texts are truncated (good). But that's not the case with this PR – longer texts are still truncated silently. The warning in this PR is only shown if the user explicitly overrides an obscure parameter (vanishingly rare).

IMO pros and cons are both inconsequential, with cons slightly outweighing pros. The cost of this debate already dwarves the delta.

mpenkov · 2020-06-25T10:21:13Z

OK, so in that case, the most appropriate action is to close this PR.

@dsandeep0138 Thank you for your contribution, but as you can see, the situation around the problem is quite difficult, and we cannot accept this PR. I hope this does not discourage you from future contributions to gensim.

dsandeep0138 · 2020-06-25T10:38:13Z

OK, so in that case, the most appropriate action is to close this PR.

@dsandeep0138 Thank you for your contribution, but as you can see, the situation around the problem is quite difficult, and we cannot accept this PR. I hope this does not discourage you from future contributions to gensim.

Thank you @mpenkov for the encouraging words! Will surely try and contribute!

dsandeep97 and others added 4 commits June 17, 2020 19:25

Warning in word2vec when batch_words > MAX_WORDS_IN_BATCH

1178223

Also add a test case to verify that the warning is thrown

Warning in word2vec when batch_words > MAX_WORDS_IN_BATCH

b4680d8

Also add a test case to verify that the warning is thrown

Merge branch 'issue#2801' of https://github.com/dsandeep0138/gensim i…

6b688f2

…nto issue#2801

Fix flake8 issues

100dc04

piskvorky requested changes Jun 18, 2020

View reviewed changes

gensim/models/word2vec.py Outdated Show resolved Hide resolved

gensim/models/word2vec.py Outdated Show resolved Hide resolved

gensim/models/word2vec.py Outdated Show resolved Hide resolved

gensim/test/test_word2vec.py Outdated Show resolved Hide resolved

Hanging Indentation instead of vertical indent.

0c482bf

piskvorky requested changes Jun 18, 2020

View reviewed changes

gensim/models/word2vec.py Outdated Show resolved Hide resolved

piskvorky added this to the 4.0.0 milestone Jun 18, 2020

Minor log change

0cc7f92

gojomo mentioned this pull request Jun 19, 2020

KeyedVectors & *2Vec API streamlining, consistency #2698

Merged

mpenkov approved these changes Jun 25, 2020

View reviewed changes

mpenkov closed this Jun 25, 2020

gojomo mentioned this pull request Jul 12, 2020

Really remove the 10000-token limit in [Word2Vec, FastText, Doc2Vec] #2880

Open

mpenkov self-assigned this Jul 18, 2020

dsandeep0138 deleted the issue#2801 branch July 19, 2020 02:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warning in word2vec when batch_words > MAX_WORDS_IN_BATCH #2861

Warning in word2vec when batch_words > MAX_WORDS_IN_BATCH #2861

dsandeep0138 commented Jun 18, 2020

piskvorky left a comment

piskvorky left a comment

dsandeep0138 commented Jun 18, 2020

gojomo commented Jun 19, 2020

piskvorky commented Jun 19, 2020 •

edited

Loading

gojomo commented Jun 19, 2020

mpenkov commented Jun 25, 2020 •

edited

Loading

gojomo commented Jun 25, 2020

piskvorky commented Jun 25, 2020 •

edited

Loading

mpenkov commented Jun 25, 2020

dsandeep0138 commented Jun 25, 2020

Warning in word2vec when batch_words > MAX_WORDS_IN_BATCH #2861

Warning in word2vec when batch_words > MAX_WORDS_IN_BATCH #2861

Conversation

dsandeep0138 commented Jun 18, 2020

piskvorky left a comment

Choose a reason for hiding this comment

piskvorky left a comment

Choose a reason for hiding this comment

dsandeep0138 commented Jun 18, 2020

gojomo commented Jun 19, 2020

piskvorky commented Jun 19, 2020 • edited Loading

gojomo commented Jun 19, 2020

mpenkov commented Jun 25, 2020 • edited Loading

gojomo commented Jun 25, 2020

piskvorky commented Jun 25, 2020 • edited Loading

mpenkov commented Jun 25, 2020

dsandeep0138 commented Jun 25, 2020

piskvorky commented Jun 19, 2020 •

edited

Loading

mpenkov commented Jun 25, 2020 •

edited

Loading

piskvorky commented Jun 25, 2020 •

edited

Loading