-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix doctag unicode problem. Fix 1543 #1544
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,11 +30,16 @@ | |
|
||
|
||
class DocsLeeCorpus(object): | ||
def __init__(self, string_tags=False): | ||
def __init__(self, string_tags=False, unicode_tags=False): | ||
self.string_tags = string_tags | ||
self.unicode_tags = unicode_tags | ||
|
||
def _tag(self, i): | ||
return i if not self.string_tags else '_*%d' % i | ||
if self.unicode_tags: | ||
return u'_\xa1_%d' % i | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks strange, why is it so different with line 41, why you use a `\xa1_'? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I enforce a title that is not ASCII encodable, i.e. a unicode character '¡'. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @gojomo wdyt? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Testing with a non-ASCII character seems sensible/necessary to me! |
||
elif self.string_tags: | ||
return '_*%d' % i | ||
return i | ||
|
||
def __iter__(self): | ||
with open(datapath('lee_background.cor')) as f: | ||
|
@@ -95,6 +100,13 @@ def testPersistenceWord2VecFormat(self): | |
binary_model_dv = keyedvectors.KeyedVectors.load_word2vec_format(test_word, binary=True) | ||
self.assertEqual(len(model.wv.vocab), len(binary_model_dv.vocab)) | ||
|
||
def test_unicode_in_doctag(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add a test with an exception (and check it with assertraise). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The test should rather be the inverse of assertraise. The exception should not be thrown with this PR. I will add a inverse test. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a good idea to add 2 tests: first with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you are mixing up what really happens, let me explain: |
||
"""Test storing document vectors of a model with unicode titles.""" | ||
model = doc2vec.Doc2Vec(DocsLeeCorpus(unicode_tags=True), min_count=1) | ||
model.save_word2vec_format(testfile(), doctag_vec=True, word_vec=True, binary=True) | ||
binary_model_dv = keyedvectors.KeyedVectors.load_word2vec_format(testfile(), binary=True) | ||
self.assertEqual(len(model.wv.vocab) + len(model.docvecs), len(binary_model_dv.vocab)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How is this assert related with your PR / with reason for test? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is true that lines 107/108 are not related to the PR. I can remove them. |
||
|
||
def test_load_mmap(self): | ||
"""Test storing/loading the entire model.""" | ||
model = doc2vec.Doc2Vec(sentences, min_count=1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should pick one type (byte string or unicode) and stick with it.
Supporting both types at the same time like this (
doctag
will be silently upcast to unicode in this solution, if the argument is unicode) looks very brittle.I'm not as familiar with this codebase as @gojomo is, but converting the user input to one fixed type (either utf8 bytestring or unicode, but consistently) is preferable. At this point, during model save, the type ought to be fixed, no silent guessing needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the hidden cast to unicode is not optimal. The problem is that the method
index_to_doctag
can return a string as well as an int. The fix is similar to another place in the class here.I am open for other suggestions though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, so the expected types are either unicode or int. A
u"%s" % index_to_doctag
should work either way, right? No change needed (beside dropping thatstr
). IIRC that int was an optimization hack (uses less memory).My suggestion is to keep the tags explicitly as "unicode or int", or "bytestring or int". And then during string formatting, rely on the fact it's unicode-or-int (or byte-or-int), possibly even with an assert to drive this contract home.
@gojomo What are the invariants here, how to do this cleanly? That unicode-or-str-or-int is really confusing, and apparently not encapsulated well (fixes in one place leave bugs in other places).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the type of
tags
should be as permissive as possible – conceivably even anything that works as a dict key would work everywhere inDoc2Vec
except in this new bit of convenience for shoehorning doc-vecs into the (limited, somewhat-'legacy') word2vec.c format. For people who want to use this format, they should be sure to use tags that can write as a nice string-token – I'd think it'd be OK here if this code is tolerant of any reasonable such choice. (They may also need to make sure, themselves, that their tags don't include internal spaces if using the 'text' word2vec.c format.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky I changed it to
u"%s%s" % ..
so that the cast is less hidden.@gojomo Your last point is especially important, yeah.