Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopt more UTF-8 #4309

Merged
merged 24 commits into from
Apr 22, 2024
Merged

Adopt more UTF-8 #4309

merged 24 commits into from
Apr 22, 2024

Conversation

abravalheri
Copy link
Contributor

Summary of changes

Previously, changes regarding UTF-8 avoided touching complicated parts of the code where it could cause incompatibility.

This PR is a bit more aggressive, but still try to maintain backwards compatibility:

  • When reading files, first try to read them as UTF-8, then we fallback to the locale encoding.
  • When writing files, use UTF-8.

In my opinion the highest risk here are in the easy_install/install_scripts commands, because it might be the case some scripts are not meant to be UTF-8... But hopefully that risk would be minimal1.

Closes

Pull Request Checklist

Footnotes

  1. Once file contents are encoded as Python strings they are already encoded as UTF-8, so it should be OK to write them directly to files. Moreover, most popular scripting languages nowadays support UTF-8.

@abravalheri abravalheri marked this pull request as ready for review April 21, 2024 11:26
@abravalheri
Copy link
Contributor Author

@jaraco, please let me know if you are OK with a more aggressive move towards UTF-8.
(in this PR I am still trying to avoid breaking changes by adding some fallback, but one never know the edge cases)

Copy link
Member

@jaraco jaraco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm enthusiastically in support. I'd also like to introduce a way to wean users off of any reliance of 'locale' encoding, though not necessarily in this change.

Comment on lines 3334 to 3336
except UnicodeDecodeError: # pragma: no cover
with open(file, "r", encoding=fallback_encoding) as f:
return f.read()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe emit a warning here so the fallback can be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we be able to remove this?
The question comes to mind mainly because of sdists that have been produced by old versions of setuptools and published to PyPI...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning implemented in 3fbaa4c

@abravalheri abravalheri merged commit 1ed7591 into pypa:main Apr 22, 2024
19 of 21 checks passed
@abravalheri abravalheri deleted the utf-8 branch April 22, 2024 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants