avoid encoding errors with unicode content piped through stdio on Windows #997

mharbison72 · 2025-01-10T06:27:38Z

Consider this trivial file (with a trailing LF):

print('This is a unicode character: ≠'.encode("UTF-8"))

This command worked in cmd.exe or an MSYS terminal, and printed ≠ correctly:

$ cat test.py | pyupgrade.exe --py38-plus -

This crashed with an encoding error:

$ cat test.py | pyupgrade.exe --py38-plus - > reformated.py
Traceback (most recent call last):
  File "C:\hgdev\python39-x64\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\hgdev\python39-x64\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "c:\Users\Matt\.local\bin\pyupgrade.exe\__main__.py", line 7, in <module>
  File "C:\Users\Matt\pipx\venvs\pyupgrade\lib\site-packages\pyupgrade\_main.py", line 389, in main
    ret |= _fix_file(filename, args)
  File "C:\Users\Matt\pipx\venvs\pyupgrade\lib\site-packages\pyupgrade\_main.py", line 330, in _fix_file
    print(contents_text, end='')
  File "C:\hgdev\python39-x64\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2260' in position 36: character maps to <undefined>

Since bytes are read from stdin.buffer and decoded as UTF-8 when the input file is '-', it makes sense to write UTF-8 bytes to stdout.buffer, and avoid using the default codepage. The use case here is wiring this up to the hg fix extension, which writes content to the tool's stdin and reads it back from its stdout to reformat files. That shouldn't change the encoding.

I conditionalized it to play it safe, since the characters showed up in the terminal correctly without the redirect. It also seems to display fine if unconditionally written as bytes though.

A workaround using the existing code is to set PYTHONUTF8=1 in the environment, but that's not obvious or always easily done. This change also has the nice side effect of no longer changing LF input to CRLF output. (You'd think that print(..., end='') would avoid printing the EOL, but that's apparently baked into the TextIO object that is sys.stdout, and not something the print function can override.)

…dows Consider this trivial file (with a trailing LF): print('This is a unicode character: ≠'.encode("UTF-8")) This command worked in cmd.exe or an MSYS terminal, and printed ≠ correctly: $ cat test.py | pyupgrade.exe --py38-plus - This crashed with an encoding error: $ cat test.py | pyupgrade.exe --py38-plus - > reformated.py Traceback (most recent call last): File "C:\hgdev\python39-x64\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\hgdev\python39-x64\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "c:\Users\Matt\.local\bin\pyupgrade.exe\__main__.py", line 7, in <module> File "C:\Users\Matt\pipx\venvs\pyupgrade\lib\site-packages\pyupgrade\_main.py", line 389, in main ret |= _fix_file(filename, args) File "C:\Users\Matt\pipx\venvs\pyupgrade\lib\site-packages\pyupgrade\_main.py", line 330, in _fix_file print(contents_text, end='') File "C:\hgdev\python39-x64\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2260' in position 36: character maps to <undefined> Since bytes are read from `stdin.buffer` and decoded as UTF-8 when the input file is '-', it makes sense to write UTF-8 bytes to `stdout.buffer`, and avoid using the default codepage. The use case here is wiring this up to the `hg fix` extension, which writes content to the tool's stdin and reads it back from its stdout to reformat files. That shouldn't change the encoding. A workaround using the existing code is to set `PYTHONUTF8=1` in the environment, but that's not obvious or always easily done. This change also has the nice side effect of no longer changing LF input to CRLF output. (You'd think that `print(..., end='')` would avoid printing the EOL, but that's apparently baked into the `TextIO` object that is `sys.stdout`, and not something the print function can override.)

asottile · 2025-01-10T13:43:20Z

no thanks. your terminal is misconfigured

mharbison72 force-pushed the main branch from c1644a3 to 18c1e24 Compare January 10, 2025 06:43

asottile closed this Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid encoding errors with unicode content piped through stdio on Windows #997

avoid encoding errors with unicode content piped through stdio on Windows #997

mharbison72 commented Jan 10, 2025

asottile commented Jan 10, 2025

avoid encoding errors with unicode content piped through stdio on Windows #997

avoid encoding errors with unicode content piped through stdio on Windows #997

Conversation

mharbison72 commented Jan 10, 2025

asottile commented Jan 10, 2025