Encoding issue running on javasphinx + py3 + windows 10 #63

Poddster · 2018-02-27T15:34:41Z

Hi,

Related to issue #56 and issue #37

On Windows10 javasphinx-apidoc won't work when run on Python 3.6.4. It will if run on Python2.7. This is with javasphinx==0.9.15 for both

It looks like the script, or possibly the python stdlib, are expecting the read files to be encoded in cp1252? But the files are actually utf-8. This will hit a problem on any byte that isn't a valid cp1252 character.

e.g. If trying to read character 🐍 ( U+1F40D, encoded in UTF-8 as b'\xF0\x9F\x90\x8D') then the script throws an exception, as it's treating that as 4 separate characters, and byte 0x90 is not a cp1252 character.

The stack trace shown is:

  File "C:\dev\env\python\Python36\Scripts\javasphinx-apidoc-script.py", line 11, in <module>
    load_entry_point('javasphinx==0.9.15', 'console_scripts', 'javasphinx-apidoc')()
  File "c:\dev\env\python\python36\lib\site-packages\javasphinx\apidoc.py", line 347, in main
    opts.member_headers, opts.parser_lib)
  File "c:\dev\env\python\python36\lib\site-packages\javasphinx\apidoc.py", line 228, in generate_documents
    this_file_documents = generate_from_source_file(doc_compiler, source_file, cache_dir)
  File "c:\dev\env\python\python36\lib\site-packages\javasphinx\apidoc.py", line 191, in generate_from_source_file
    source = f.read()
  File "c:\dev\env\python\python36\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 24: character maps to <undefined>

Whilst it works in py2, I'm feel like this is purely by accident due to python2's very "liberal" string decoding policies and the fact that it's a UTF-8 file. If my file was encoded in something weird, e.g. EUCJIS/SJIS, then the tool will fail. The official javadoc tool has an encoding option.

It would be good if javasphinx-apidoc could take an --encoding parameter and ensure that all files are read/decoded in that format.

Full Example

This was done using Powershell_ISA to "ensure" that the unicode characters were printed correctly, but it will happen in cmd.exe or git bash etc.

PS C:\dev\work\Mobile-SDK-Android\docs> Get-Content .\java\utf8.java -Encoding UTF8
package java;

/**
 * 🐍 U+1F40D -> \xF0\x9F\x90\x8D
 * 👐 U+1F450 -> \xF0\x9F\x91\x90
 */
public class EncodingProblems {
    public static void main(String[] args) {
        System.out.println("Hello!");
    }
}

PS C:\dev\work\Mobile-SDK-Android\docs> C:\dev\env\python\Python36\Scripts\javasphinx-apidoc.exe --output-dir=tmp/ java/
C:\dev\env\python\Python36\Scripts\javasphinx-apidoc.exe : Traceback (most recent call last):
At line:1 char:1
+ C:\dev\env\python\Python36\Scripts\javasphinx-apidoc.exe --output-dir ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError
 
  File "C:\dev\env\python\Python36\Scripts\javasphinx-apidoc-script.py", line 11, in <module>
    load_entry_point('javasphinx==0.9.15', 'console_scripts', 'javasphinx-apidoc')()
  File "c:\dev\env\python\python36\lib\site-packages\javasphinx\apidoc.py", line 347, in main
    opts.member_headers, opts.parser_lib)
  File "c:\dev\env\python\python36\lib\site-packages\javasphinx\apidoc.py", line 228, in generate_documents
    this_file_documents = generate_from_source_file(doc_compiler, source_file, cache_dir)
  File "c:\dev\env\python\python36\lib\site-packages\javasphinx\apidoc.py", line 191, in generate_from_source_file
    source = f.read()
  File "c:\dev\env\python\python36\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 24: character maps to <undefined>

PS C:\dev\work\Mobile-SDK-Android\docs> C:\dev\env\python\Python27\Scripts\javasphinx-apidoc.exe --output-dir=tmp/ java/

The text was updated successfully, but these errors were encountered:

FoxNeo · 2018-04-05T09:35:05Z

I also have that problem, my solution. Create a script shell to encode all .rst files from cp1252 to UTF-8.

Then it generates a HTML with files encoded in UTF-8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding issue running on javasphinx + py3 + windows 10 #63

Encoding issue running on javasphinx + py3 + windows 10 #63

Poddster commented Feb 27, 2018

FoxNeo commented Apr 5, 2018 •

edited

Loading

Encoding issue running on javasphinx + py3 + windows 10 #63

Encoding issue running on javasphinx + py3 + windows 10 #63

Comments

Poddster commented Feb 27, 2018

Full Example

FoxNeo commented Apr 5, 2018 • edited Loading

FoxNeo commented Apr 5, 2018 •

edited

Loading