Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid encoding errors with unicode content piped through stdio on Windows #997

Closed
wants to merge 1 commit into from

Conversation

mharbison72
Copy link

Consider this trivial file (with a trailing LF):

print('This is a unicode character: ≠'.encode("UTF-8"))

This command worked in cmd.exe or an MSYS terminal, and printed ≠ correctly:

$ cat test.py | pyupgrade.exe --py38-plus -

This crashed with an encoding error:

$ cat test.py | pyupgrade.exe --py38-plus - > reformated.py
Traceback (most recent call last):
  File "C:\hgdev\python39-x64\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\hgdev\python39-x64\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "c:\Users\Matt\.local\bin\pyupgrade.exe\__main__.py", line 7, in <module>
  File "C:\Users\Matt\pipx\venvs\pyupgrade\lib\site-packages\pyupgrade\_main.py", line 389, in main
    ret |= _fix_file(filename, args)
  File "C:\Users\Matt\pipx\venvs\pyupgrade\lib\site-packages\pyupgrade\_main.py", line 330, in _fix_file
    print(contents_text, end='')
  File "C:\hgdev\python39-x64\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2260' in position 36: character maps to <undefined>

Since bytes are read from stdin.buffer and decoded as UTF-8 when the input file is '-', it makes sense to write UTF-8 bytes to stdout.buffer, and avoid using the default codepage. The use case here is wiring this up to the hg fix extension, which writes content to the tool's stdin and reads it back from its stdout to reformat files. That shouldn't change the encoding.

I conditionalized it to play it safe, since the characters showed up in the terminal correctly without the redirect. It also seems to display fine if unconditionally written as bytes though.

A workaround using the existing code is to set PYTHONUTF8=1 in the environment, but that's not obvious or always easily done. This change also has the nice side effect of no longer changing LF input to CRLF output. (You'd think that print(..., end='') would avoid printing the EOL, but that's apparently baked into the TextIO object that is sys.stdout, and not something the print function can override.)

…dows

Consider this trivial file (with a trailing LF):

    print('This is a unicode character: ≠'.encode("UTF-8"))

This command worked in cmd.exe or an MSYS terminal, and printed ≠ correctly:

    $ cat test.py | pyupgrade.exe --py38-plus -

This crashed with an encoding error:

    $ cat test.py | pyupgrade.exe --py38-plus - > reformated.py
	Traceback (most recent call last):
      File "C:\hgdev\python39-x64\lib\runpy.py", line 197, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "C:\hgdev\python39-x64\lib\runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "c:\Users\Matt\.local\bin\pyupgrade.exe\__main__.py", line 7, in <module>
      File "C:\Users\Matt\pipx\venvs\pyupgrade\lib\site-packages\pyupgrade\_main.py", line 389, in main
        ret |= _fix_file(filename, args)
      File "C:\Users\Matt\pipx\venvs\pyupgrade\lib\site-packages\pyupgrade\_main.py", line 330, in _fix_file
        print(contents_text, end='')
      File "C:\hgdev\python39-x64\lib\encodings\cp1252.py", line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_table)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\u2260' in position 36: character maps to <undefined>

Since bytes are read from `stdin.buffer` and decoded as UTF-8 when the input
file is '-', it makes sense to write UTF-8 bytes to `stdout.buffer`, and avoid
using the default codepage.  The use case here is wiring this up to the `hg fix`
extension, which writes content to the tool's stdin and reads it back from its
stdout to reformat files.  That shouldn't change the encoding.

A workaround using the existing code is to set `PYTHONUTF8=1` in the environment,
but that's not obvious or always easily done.  This change also has the nice side
effect of no longer changing LF input to CRLF output.  (You'd think that
`print(..., end='')` would avoid printing the EOL, but that's apparently baked
into the `TextIO` object that is `sys.stdout`, and not something the print
function can override.)
@asottile
Copy link
Owner

no thanks. your terminal is misconfigured

@asottile asottile closed this Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants