Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix doctag unicode problem. Fix 1543 #1544

Merged
merged 4 commits into from
Sep 19, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion gensim/models/doc2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -847,7 +847,7 @@ def save_word2vec_format(self, fname, doctag_vec=False, word_vec=True, prefix='*
fout.write(utils.to_utf8("%s %s\n" % (total_vec, self.vector_size)))
# store as in input order
for i in range(len(self.docvecs)):
doctag = prefix + str(self.docvecs.index_to_doctag(i))
doctag = u"%s%s" % (prefix, self.docvecs.index_to_doctag(i))
row = self.docvecs.doctag_syn0[i]
if binary:
fout.write(utils.to_utf8(doctag) + b" " + row.tostring())
Expand Down
17 changes: 15 additions & 2 deletions gensim/test/test_doc2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,16 @@


class DocsLeeCorpus(object):
def __init__(self, string_tags=False):
def __init__(self, string_tags=False, unicode_tags=False):
self.string_tags = string_tags
self.unicode_tags = unicode_tags

def _tag(self, i):
return i if not self.string_tags else '_*%d' % i
if self.unicode_tags:
return u'_\xa1_%d' % i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks strange, why is it so different with line 41, why you use a `\xa1_'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I enforce a title that is not ASCII encodable, i.e. a unicode character '¡'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gojomo wdyt?

Copy link
Collaborator

@gojomo gojomo Sep 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing with a non-ASCII character seems sensible/necessary to me!

elif self.string_tags:
return '_*%d' % i
return i

def __iter__(self):
with open(datapath('lee_background.cor')) as f:
Expand Down Expand Up @@ -95,6 +100,14 @@ def testPersistenceWord2VecFormat(self):
binary_model_dv = keyedvectors.KeyedVectors.load_word2vec_format(test_word, binary=True)
self.assertEqual(len(model.wv.vocab), len(binary_model_dv.vocab))

def test_unicode_in_doctag(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test with an exception (and check it with assertraise).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should rather be the inverse of assertraise. The exception should not be thrown with this PR. I will add a inverse test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea to add 2 tests: first with unicode_tags=False and assertRaise and second is a current test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are mixing up what really happens, let me explain:
In gensim 2.1 a modified save_word2vec_format for Doc2Vec was added in #1256. The storing handled to export any utf8 symbols doc2vec.py#L853. Unfortunately, the str(..) call in doc2vec.py#L850 fails in Python 2.7 when the doctags contain chars not ASCII encodable.
Currently the develop branch is broken and the added test will fail, i.e. UnicodeEncodeError is thrown for Python 2.7.
The test now checks whether it can successfully store unicode doctags and fails otherwise. I have to generate them (with the code above) since the default string doctags are ASCII encodable (-> addition of the '¡' character).
So, what should the second test actually be then?

"""Test storing document vectors of a model with unicode titles."""
model = doc2vec.Doc2Vec(DocsLeeCorpus(unicode_tags=True), min_count=1)
try:
model.save_word2vec_format(testfile(), doctag_vec=True, word_vec=True, binary=True)
except UnicodeEncodeError:
self.fail('Failed storing unicode title.')

def test_load_mmap(self):
"""Test storing/loading the entire model."""
model = doc2vec.Doc2Vec(sentences, min_count=1)
Expand Down