Won't read data from UTF-8 model created by C version of word2vec #44

gerryhocks · 2017-07-10T14:33:26Z

Hallo,

The code as it stands won't read a UTF-8 vocab from a word2vec binary model created using the C version of word2vec.

This is because the vocab's characters are appended to a string buffer as if a byte is a character.

A workaround/hack like this in Word2VecModel.java's fromBinFile() method gets around this issue and probably still works for single-byte characters:

            byte[] buff = new byte[1024];
            for (int lineno = 0; lineno < vocabSize; lineno++) {
                // read vocab
                int bpos = 0;
                byte b = buffer.get();
                while (b != ' ') {
                    if (b != '\n') {
                        buff[bpos++] = b;
                    }
                    b = buffer.get();
                }
                vocabs.add(new String(buff, 0, bpos, "UTF-8"));

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Won't read data from UTF-8 model created by C version of word2vec #44

Won't read data from UTF-8 model created by C version of word2vec #44

gerryhocks commented Jul 10, 2017

Won't read data from UTF-8 model created by C version of word2vec #44

Won't read data from UTF-8 model created by C version of word2vec #44

Comments

gerryhocks commented Jul 10, 2017