-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warning in word2vec when batch_words > MAX_WORDS_IN_BATCH #2861
Conversation
Also add a test case to verify that the warning is thrown
Also add a test case to verify that the warning is thrown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good – left minor language change suggestions in the review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor code suggestion, otherwise looks LGTM. @mpenkov WDYT?
@piskvorky Thanks for the review. I have made the changes based on your feedback. @mpenkov keras related tests are failing. Can you please help with these? Thanks. |
if the hope is for #2698 cleanup to be in 4.0 (and that's my hope!), these smaller tweaks to related classes should either (a) be held until after that's merged; or (b) be based off the massive changes there. Otherwise, they make its integration/progress harder. The related FastText & Doc2Vec classes would have the same issue, so any fix that only touches word2vec* files is incomplete. From my view, this doesn't fix the deeper problems with the Cython limit:
A proper & better fix would be to essentially eliminate this limit entirely. One way would be to auto-split the users' input texts - somewhat rough in its handling of the users' data, but less so than silently truncating it! Another would be to use the Perhaps best-of-both-worlds would be to do both, use Given that the default for |
We'd very much prioritize a better fix (your PR). We cannot stall all development until that happens though – especially since its ETA is unknown and the PR is not open for review yet ("early WIP"). Would it be possible to split up your PR (@mpenkov 's idea)? Get some parts of the changes in now, rest later, proceed more incrementally? The changes will be essential I believe. So basically anything to speed up the process and avoid a behemoth PR that blocks all other development. |
There's no current fix for this in #2698, nor is one planned. It's more that this is an incomplete fix in the wrong direction, for a tiny theoretical issue that there's not even any evidence anyone has ever hit. The bug it's linked-to is just, "I think a larger value should not be accepted". But this patch doesn't actually even reject larger values, it just issues a warning – which from experience many users simply ignore. And it only does so for one of the 3 classes where there could be a concern. So in my opinion all the discussion/effort so far around this bug/PR has been wasted on misguided minutiae. (Similarly: #2537/#2551 - a bunch of go-round about a non-issue.) There is a closely-related problem - that texts are silently truncated in the Cython code – but that deserves a totally different better fix that would further obsolete this make-work. But even if theoretically this change was a good 'housekeeping' fix, a slight new usability comment/warning. I believe it would still makes sense for that to either (1) wait for later application, as such tiny changes should be lower-priority and wouldn't drive a big number release, anyway; or (2) build on the #2698 in progress, meaning less conflict-resolution work by other authors or myself. No one need be 'blocked', they just shouldn't be putting a fresh coat of paint onto walls already marked to be knocked-out. I don't want my work-in-progress to be holding anything else up, but so far I've not heard of any priority features/fixes that need be blocked that way. I'll post more on the status of #2698 there. |
Finally found the time to catch up this. I agree with @gojomo that this PR is far from a perfect solution and it may make merging #2698 more difficult. At the same time, I think the changes here are an improvement over the existing code, and are relatively innocuous. I think it is best to merge this PR once the keras weirdness gets sorted out. @piskvorky WDYT? |
Disagree with 'improvement' and 'innocuous' characterizations very strongly: it's adding lines-of-code for a merely-theoretical issue, while creating extra merge work for others (mostly me), when the real fix for the related underlying limitation will be something completely different which will cause this code to be thrown away (which will be yet more work for whoever tackles that, which is likely me). |
Yeah the warning seems useless. My review was superficial, I didn't realize this would help approximately no one. My (incorrect) reading was that the user will see a warning when texts are truncated (good). But that's not the case with this PR – longer texts are still truncated silently. The warning in this PR is only shown if the user explicitly overrides an obscure parameter (vanishingly rare). IMO pros and cons are both inconsequential, with cons slightly outweighing pros. The cost of this debate already dwarves the delta. |
OK, so in that case, the most appropriate action is to close this PR. @dsandeep0138 Thank you for your contribution, but as you can see, the situation around the problem is quite difficult, and we cannot accept this PR. I hope this does not discourage you from future contributions to gensim. |
Thank you @mpenkov for the encouraging words! Will surely try and contribute! |
Also added a test case to verify if the warning is thrown when batch_words is more than 10000.
Fixes #2801