Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove RFC2047 decoder #967

Merged
merged 1 commit into from
Jan 18, 2024
Merged

Remove RFC2047 decoder #967

merged 1 commit into from
Jan 18, 2024

Conversation

charliermarsh
Copy link
Member

@charliermarsh charliermarsh commented Jan 18, 2024

Summary

It seems like the goal was to support Unicode in certain header fields, but I don't think this is necessary for us. We only use get_first_value for Requires-Python, which has to be ASCII, doesn't it?

In my testing, it seems like the charset hack can also be removed. The tests I copied over actually work without it, which makes me a bit skeptical.

The main benefit here is that we get to a remove a big dependency stack, including Chumsky and Stacker and psm which have limited cross-platform support.

@charliermarsh charliermarsh added the internal A refactor or improvement that is not user-facing label Jan 18, 2024
@@ -75,24 +75,16 @@ pub enum Error {
impl Metadata21 {
/// Parse distribution metadata from metadata bytes
pub fn parse(content: &[u8]) -> Result<Self, Error> {
// HACK: trick mailparse to parse as UTF-8 instead of ASCII
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece I am slightly less confident in...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this change works fine when resolving defity==0.1.3, which was the originating issue for PyO3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even the author name is resolved correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's linked in the PR summary :)

Copy link
Member

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it is theory possible to remove stacker, but rfc2047-decoder does not re-export chumsky features and chumsky enables the stacker dependency by default. So to get stacker out while keeping rfc2047-decoder in, you'd need to patch rfc2047-decoder to re-export the spill-stack feature from chumsky.

It seems like the goal was to support Unicode in certain header fields, but I don't think this is necessary for us. We only use get_first_value for Requires-Python, which has to be ASCII, doesn't it?

This part confuses me. get_first_value is used for several things. And indeed, without RFC 2047 support, it seems like it might get stuff wrong? But I'm not an RFC 2047 expert. Does it only apply to the body? Or also to the header values? If it's the former, then assuming you don't care about the body here, you should be fine without RFC 2047 decoding. But if things like =E2=82=AC can appear in header values, then you probably need RFC 2047 decoding.

My intuition here is that

let meta = Metadata21::parse(s.as_bytes()).unwrap();
assert_eq!(meta.metadata_version, "1.0");
assert_eq!(meta.name, PackageName::from_str("asdf").unwrap());
assert_eq!(meta.version, Version::new([1, 0]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a test with a non-ASCII package name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like maybe RFC 2047 is for reading UTF-8 encoded data in a format like =E2=82=AC? So maybe a test needs to include that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works correctly as-is -- if I use Author: =?utf-8?q?=C3=A4_space?= <[email protected]>, it gets parsed as author: ä space <[email protected]>.

The thing that the author pointed out in the linked mailparse issue is that this decoding happens via mailparse::parse_mail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -75,24 +75,16 @@ pub enum Error {
impl Metadata21 {
/// Parse distribution metadata from metadata bytes
pub fn parse(content: &[u8]) -> Result<Self, Error> {
// HACK: trick mailparse to parse as UTF-8 instead of ASCII
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, so mailparse is already doing RFC 2047 decoding? If so, then yeah, this change LGTM.

@charliermarsh
Copy link
Member Author

Yeah, that's why I'm slightly confused as to why this was applied upstream.

@charliermarsh
Copy link
Member Author

The mailparse author does say here: staktrace/mailparse#50 (comment)

It kind of looks like you're passing in utf-8 or other non-ascii data into mailparse (from here) and then expecting that to not get mangled. But really you're using mailparse in a way that it's not meant to be used.

@charliermarsh
Copy link
Member Author

Honestly, I think this might've been fixed in mailparse later: staktrace/mailparse#104

@charliermarsh charliermarsh merged commit 96a61fb into main Jan 18, 2024
3 checks passed
@charliermarsh charliermarsh deleted the charlie/metadata branch January 18, 2024 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal A refactor or improvement that is not user-facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants