Remove RFC2047 decoder #967

charliermarsh · 2024-01-18T19:16:16Z

Summary

This was inherited from https://github.com/PyO3/python-pkginfo-rs/blob/d719988323a0cfea86d4737116d7917f30e819e2/src/metadata.rs#LL78C2-L91C26
...which introduced this code here: PyO3/python-pkginfo-rs@9cd1d43
...with the originating issue here: Maturin seems to make PyPI failed to render Unicode meta data PyO3/maturin#612
...and the upstream issue here: Failure to parse RFC2047 encoded display-names in email addresses staktrace/mailparse#50

It seems like the goal was to support Unicode in certain header fields, but I don't think this is necessary for us. We only use get_first_value for Requires-Python, which has to be ASCII, doesn't it?

In my testing, it seems like the charset hack can also be removed. The tests I copied over actually work without it, which makes me a bit skeptical.

The main benefit here is that we get to a remove a big dependency stack, including Chumsky and Stacker and psm which have limited cross-platform support.

charliermarsh · 2024-01-18T19:16:39Z

crates/pypi-types/src/metadata.rs

@@ -75,24 +75,16 @@ pub enum Error {
 impl Metadata21 {
    /// Parse distribution metadata from metadata bytes
    pub fn parse(content: &[u8]) -> Result<Self, Error> {
-        // HACK: trick mailparse to parse as UTF-8 instead of ASCII


This piece I am slightly less confident in...

Although this change works fine when resolving defity==0.1.3, which was the originating issue for PyO3.

Even the author name is resolved correctly.

Hmm it looks like this was lifted from PyO3? https://github.com/PyO3/python-pkginfo-rs/blob/d719988323a0cfea86d4737116d7917f30e819e2/src/metadata.rs#L89

It's linked in the PR summary :)

BurntSushi

It looks like it is theory possible to remove stacker, but rfc2047-decoder does not re-export chumsky features and chumsky enables the stacker dependency by default. So to get stacker out while keeping rfc2047-decoder in, you'd need to patch rfc2047-decoder to re-export the spill-stack feature from chumsky.

It seems like the goal was to support Unicode in certain header fields, but I don't think this is necessary for us. We only use get_first_value for Requires-Python, which has to be ASCII, doesn't it?

This part confuses me. get_first_value is used for several things. And indeed, without RFC 2047 support, it seems like it might get stuff wrong? But I'm not an RFC 2047 expert. Does it only apply to the body? Or also to the header values? If it's the former, then assuming you don't care about the body here, you should be fine without RFC 2047 decoding. But if things like =E2=82=AC can appear in header values, then you probably need RFC 2047 decoding.

My intuition here is that

BurntSushi · 2024-01-18T19:27:50Z

crates/pypi-types/src/metadata.rs

+        let meta = Metadata21::parse(s.as_bytes()).unwrap();
+        assert_eq!(meta.metadata_version, "1.0");
+        assert_eq!(meta.name, PackageName::from_str("asdf").unwrap());
+        assert_eq!(meta.version, Version::new([1, 0]));


What about a test with a non-ASCII package name?

It looks like maybe RFC 2047 is for reading UTF-8 encoded data in a format like =E2=82=AC? So maybe a test needs to include that?

It works correctly as-is -- if I use Author: =?utf-8?q?=C3=A4_space?= <[email protected]>, it gets parsed as author: ä space <[email protected]>.

The thing that the author pointed out in the linked mailparse issue is that this decoding happens via mailparse::parse_mail.

FWIW, name has to be ASCII: https://packaging.python.org/en/latest/specifications/core-metadata/#name

BurntSushi · 2024-01-18T19:33:29Z

crates/pypi-types/src/metadata.rs

@@ -75,24 +75,16 @@ pub enum Error {
 impl Metadata21 {
    /// Parse distribution metadata from metadata bytes
    pub fn parse(content: &[u8]) -> Result<Self, Error> {
-        // HACK: trick mailparse to parse as UTF-8 instead of ASCII


Hmm it looks like this was lifted from PyO3? https://github.com/PyO3/python-pkginfo-rs/blob/d719988323a0cfea86d4737116d7917f30e819e2/src/metadata.rs#L89

BurntSushi

Ah okay, so mailparse is already doing RFC 2047 decoding? If so, then yeah, this change LGTM.

charliermarsh · 2024-01-18T20:07:34Z

Yeah, that's why I'm slightly confused as to why this was applied upstream.

charliermarsh · 2024-01-18T20:08:58Z

The mailparse author does say here: staktrace/mailparse#50 (comment)

It kind of looks like you're passing in utf-8 or other non-ascii data into mailparse (from here) and then expecting that to not get mangled. But really you're using mailparse in a way that it's not meant to be used.

charliermarsh · 2024-01-18T20:09:42Z

Honestly, I think this might've been fixed in mailparse later: staktrace/mailparse#104

charliermarsh requested review from BurntSushi and konstin January 18, 2024 19:16

charliermarsh added the internal A refactor or improvement that is not user-facing label Jan 18, 2024

charliermarsh commented Jan 18, 2024

View reviewed changes

BurntSushi reviewed Jan 18, 2024

View reviewed changes

Remove RFC2047 decoder

933980a

charliermarsh force-pushed the charlie/metadata branch from 2a380bd to 933980a Compare January 18, 2024 20:01

BurntSushi approved these changes Jan 18, 2024

View reviewed changes

charliermarsh merged commit 96a61fb into main Jan 18, 2024
3 checks passed

charliermarsh deleted the charlie/metadata branch January 18, 2024 20:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove RFC2047 decoder #967

Remove RFC2047 decoder #967

charliermarsh commented Jan 18, 2024 •

edited

Loading

charliermarsh Jan 18, 2024

charliermarsh Jan 18, 2024

charliermarsh Jan 18, 2024

BurntSushi Jan 18, 2024

charliermarsh Jan 18, 2024

BurntSushi left a comment

BurntSushi Jan 18, 2024

BurntSushi Jan 18, 2024

charliermarsh Jan 18, 2024

charliermarsh Jan 18, 2024

charliermarsh Jan 18, 2024

BurntSushi Jan 18, 2024

BurntSushi left a comment

charliermarsh commented Jan 18, 2024

charliermarsh commented Jan 18, 2024

charliermarsh commented Jan 18, 2024

Remove RFC2047 decoder #967

Remove RFC2047 decoder #967

Conversation

charliermarsh commented Jan 18, 2024 • edited Loading

Summary

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BurntSushi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BurntSushi left a comment

Choose a reason for hiding this comment

charliermarsh commented Jan 18, 2024

charliermarsh commented Jan 18, 2024

charliermarsh commented Jan 18, 2024

charliermarsh commented Jan 18, 2024 •

edited

Loading