-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforce UTF-8 encoding #1381
Enforce UTF-8 encoding #1381
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #1381 +/- ##
==========================================
+ Coverage 67.55% 67.67% +0.12%
==========================================
Files 69 69
Lines 7493 7518 +25
Branches 1844 1844
==========================================
+ Hits 5062 5088 +26
+ Misses 2159 2158 -1
Partials 272 272 ☔ View full report in Codecov by Sentry. |
@tsibley this needs a bit more work but I'd like to get your opinion on the direction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a small but reasonable start!
Being explicit about encodings is best. Once Python's UTF-8 mode is the default (not for a while) we could be less explicit, but it'd really still be better to declare "Augur is UTF-8" explicitly.
It might be helpful to enable the default encoding warning in conjunction with the test suite to help us find places that rely on the implicit default of the current locale's encoding.
Are there bits in particular you wanted feedback on?
@tsibley no, that's helpful! Just wanted to see if you also thought declaring "Augur is UTF-8" is a good idea. I'll update this PR with your suggestions. |
e244009
to
f263054
Compare
I added it to CI in 1b52deb and triggered a manual run to expose warnings. It's very noisy since much external code doesn't set encoding explicitly, but nonetheless there is still some Augur code that is flagged. Example:
I think the most I'll do here is use the CI run as reference and specify encoding explicitly for all warnings that arise from within the Augur codebase. |
be00ef3
to
9325ca9
Compare
9325ca9
to
cd479fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a13f3ef's message:
Maybe a similar central function should be used for pandas.read_csv().
For common/shared/required arguments to pandas.read_csv()
, yeah. But if just for encoding/compression, we can pass read_csv()
a filehandle returned by open_file()
instead of a filename, right?
cd479fd
to
c2ff05e
Compare
Good point. Will look into this. EDIT: Implemented as f0d1177 |
c2ff05e
to
b695954
Compare
Ah, what I meant was turning: pandas.read_csv(args.filename, ...) into: pandas.read_csv(open_file(args.filename), ...) or the conceptual equivalent with By opening the file for Pandas, we standardize the compression and encoding support for Augur. |
b695954
to
657d524
Compare
This is the default used by xopen as of v1.3.0.¹ Make it explicit here. Also, since this module is used for IO across the codebase, store the default encoding of UTF-8 in a variable to be used in other parts of the codebase in future commits. ¹ <https://github.com/pycompression/xopen/blob/v1.3.0/README.rst#changes>
This is common user error, so handle it explicitly with a helpful error message.
This allows setting defaults and supporting various compression formats all from one central function. Maybe a similar central function should be used for pandas.read_csv().
Start with just an explicit encoding. Applied to all invocations of read_csv(). The current motivation of this change is to enforce UTF-8 encoding. Another way to address that motivation would be to use Augur's internal open_file() in place of the first argument to read_csv() to handle encoding options as well as supported compression formats - but that's not a trivial change.
657d524
to
5762baa
Compare
Oh whoops, I registered that at first but forgot to write down the change in direction after I tried it out. Replacing the first argument to Instead of trying to make that work, I thought it would be easiest to enforce UTF-8 through shared options. I've summarized this in the commit message of a4601a2 (previously f0d1177). |
Ahhh, makes sense! Thanks for following up. |
Description of proposed changes
Enforce UTF-8 encoding when reading and writing files. Improve error messages when a non-UTF-8 file is used.
See commit messages for details.
Related issue(s)
Checklist