-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix doctag unicode problem. Fix 1543 #1544
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,11 +30,16 @@ | |
|
||
|
||
class DocsLeeCorpus(object): | ||
def __init__(self, string_tags=False): | ||
def __init__(self, string_tags=False, unicode_tags=False): | ||
self.string_tags = string_tags | ||
self.unicode_tags = unicode_tags | ||
|
||
def _tag(self, i): | ||
return i if not self.string_tags else '_*%d' % i | ||
if self.unicode_tags: | ||
return u'_\xa1_%d' % i | ||
elif self.string_tags: | ||
return '_*%d' % i | ||
return i | ||
|
||
def __iter__(self): | ||
with open(datapath('lee_background.cor')) as f: | ||
|
@@ -95,6 +100,14 @@ def testPersistenceWord2VecFormat(self): | |
binary_model_dv = keyedvectors.KeyedVectors.load_word2vec_format(test_word, binary=True) | ||
self.assertEqual(len(model.wv.vocab), len(binary_model_dv.vocab)) | ||
|
||
def test_unicode_in_doctag(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add a test with an exception (and check it with assertraise). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The test should rather be the inverse of assertraise. The exception should not be thrown with this PR. I will add a inverse test. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a good idea to add 2 tests: first with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you are mixing up what really happens, let me explain: |
||
"""Test storing document vectors of a model with unicode titles.""" | ||
model = doc2vec.Doc2Vec(DocsLeeCorpus(unicode_tags=True), min_count=1) | ||
try: | ||
model.save_word2vec_format(testfile(), doctag_vec=True, word_vec=True, binary=True) | ||
except UnicodeEncodeError: | ||
self.fail('Failed storing unicode title.') | ||
|
||
def test_load_mmap(self): | ||
"""Test storing/loading the entire model.""" | ||
model = doc2vec.Doc2Vec(sentences, min_count=1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks strange, why is it so different with line 41, why you use a `\xa1_'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I enforce a title that is not ASCII encodable, i.e. a unicode character '¡'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gojomo wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing with a non-ASCII character seems sensible/necessary to me!