-
Notifications
You must be signed in to change notification settings - Fork 964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove RFC2047 decoder #967
Conversation
@@ -75,24 +75,16 @@ pub enum Error { | |||
impl Metadata21 { | |||
/// Parse distribution metadata from metadata bytes | |||
pub fn parse(content: &[u8]) -> Result<Self, Error> { | |||
// HACK: trick mailparse to parse as UTF-8 instead of ASCII |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This piece I am slightly less confident in...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this change works fine when resolving defity==0.1.3
, which was the originating issue for PyO3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even the author name is resolved correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm it looks like this was lifted from PyO3? https://github.com/PyO3/python-pkginfo-rs/blob/d719988323a0cfea86d4737116d7917f30e819e2/src/metadata.rs#L89
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's linked in the PR summary :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it is theory possible to remove stacker
, but rfc2047-decoder
does not re-export chumsky
features and chumsky
enables the stacker
dependency by default. So to get stacker
out while keeping rfc2047-decoder
in, you'd need to patch rfc2047-decoder
to re-export the spill-stack
feature from chumsky
.
It seems like the goal was to support Unicode in certain header fields, but I don't think this is necessary for us. We only use get_first_value for Requires-Python, which has to be ASCII, doesn't it?
This part confuses me. get_first_value
is used for several things. And indeed, without RFC 2047 support, it seems like it might get stuff wrong? But I'm not an RFC 2047 expert. Does it only apply to the body? Or also to the header values? If it's the former, then assuming you don't care about the body here, you should be fine without RFC 2047 decoding. But if things like =E2=82=AC
can appear in header values, then you probably need RFC 2047 decoding.
My intuition here is that
let meta = Metadata21::parse(s.as_bytes()).unwrap(); | ||
assert_eq!(meta.metadata_version, "1.0"); | ||
assert_eq!(meta.name, PackageName::from_str("asdf").unwrap()); | ||
assert_eq!(meta.version, Version::new([1, 0])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a test with a non-ASCII package name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like maybe RFC 2047 is for reading UTF-8 encoded data in a format like =E2=82=AC
? So maybe a test needs to include that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works correctly as-is -- if I use Author: =?utf-8?q?=C3=A4_space?= <[email protected]>
, it gets parsed as author: ä space <[email protected]>
.
The thing that the author pointed out in the linked mailparse issue is that this decoding happens via mailparse::parse_mail
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, name has to be ASCII: https://packaging.python.org/en/latest/specifications/core-metadata/#name
@@ -75,24 +75,16 @@ pub enum Error { | |||
impl Metadata21 { | |||
/// Parse distribution metadata from metadata bytes | |||
pub fn parse(content: &[u8]) -> Result<Self, Error> { | |||
// HACK: trick mailparse to parse as UTF-8 instead of ASCII |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm it looks like this was lifted from PyO3? https://github.com/PyO3/python-pkginfo-rs/blob/d719988323a0cfea86d4737116d7917f30e819e2/src/metadata.rs#L89
2a380bd
to
933980a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, so mailparse
is already doing RFC 2047 decoding? If so, then yeah, this change LGTM.
Yeah, that's why I'm slightly confused as to why this was applied upstream. |
The
|
Honestly, I think this might've been fixed in mailparse later: staktrace/mailparse#104 |
Summary
It seems like the goal was to support Unicode in certain header fields, but I don't think this is necessary for us. We only use
get_first_value
forRequires-Python
, which has to be ASCII, doesn't it?In my testing, it seems like the
charset
hack can also be removed. The tests I copied over actually work without it, which makes me a bit skeptical.The main benefit here is that we get to a remove a big dependency stack, including Chumsky and Stacker and psm which have limited cross-platform support.