Skip to content

Commit

Permalink
avoid encoding errors with unicode content piped through stdio on Win…
Browse files Browse the repository at this point in the history
…dows

Consider this trivial file (with a trailing LF):

    print('This is a unicode character: ≠'.encode("UTF-8"))

This command worked in cmd.exe or an MSYS terminal, and printed ≠ correctly:

    $ cat test.py | pyupgrade.exe --py38-plus -

This crashed with an encoding error:

    $ cat test.py | pyupgrade.exe --py38-plus - > reformated.py
	Traceback (most recent call last):
      File "C:\hgdev\python39-x64\lib\runpy.py", line 197, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "C:\hgdev\python39-x64\lib\runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "c:\Users\Matt\.local\bin\pyupgrade.exe\__main__.py", line 7, in <module>
      File "C:\Users\Matt\pipx\venvs\pyupgrade\lib\site-packages\pyupgrade\_main.py", line 389, in main
        ret |= _fix_file(filename, args)
      File "C:\Users\Matt\pipx\venvs\pyupgrade\lib\site-packages\pyupgrade\_main.py", line 330, in _fix_file
        print(contents_text, end='')
      File "C:\hgdev\python39-x64\lib\encodings\cp1252.py", line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_table)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\u2260' in position 36: character maps to <undefined>

Since bytes are read from `stdin.buffer` and decoded as UTF-8 when the input
file is '-', it makes sense to write UTF-8 bytes to `stdout.buffer`, and avoid
using the default codepage.  The use case here is wiring this up to the `hg fix`
extension, which writes content to the tool's stdin and reads it back from its
stdout to reformat files.  That shouldn't change the encoding.

A workaround using the existing code is to set `PYTHONUTF8=1` in the environment,
but that's not obvious or always easily done.  This change also has the nice side
effect of no longer changing LF input to CRLF output.  (You'd think that
`print(..., end='')` would avoid printing the EOL, but that's apparently baked
into the `TextIO` object that is `sys.stdout`, and not something the print
function can override.)
  • Loading branch information
mharbison72 committed Jan 10, 2025
1 parent 1a2a5ae commit 18c1e24
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion pyupgrade/_main.py
Original file line number Diff line number Diff line change
Expand Up @@ -327,7 +327,7 @@ def _fix_file(filename: str, args: argparse.Namespace) -> int:
contents_text = _fix_tokens(contents_text)

if filename == '-':
print(contents_text, end='')
sys.stdout.buffer.write(contents_text.encode())
elif contents_text != contents_text_orig:
print(f'Rewriting {filename}', file=sys.stderr)
with open(filename, 'w', encoding='UTF-8', newline='') as f:
Expand Down

0 comments on commit 18c1e24

Please sign in to comment.