Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Potentially) tweak build tag return values for utils.parse_wheel_filename() #389

Closed
brettcannon opened this issue Jan 26, 2021 · 6 comments · Fixed by #390
Closed

(Potentially) tweak build tag return values for utils.parse_wheel_filename() #389

brettcannon opened this issue Jan 26, 2021 · 6 comments · Fixed by #390

Comments

@brettcannon
Copy link
Member

Right now, packaging.utils.parse_wheel_filename() returns None for a build tag if it isn't defined, str if it is. There are two potential ways of tweaking this.

(See #387 for background.)

Return an empty string

One possible change is simply return the empty string instead of None. This just normalizes the return type.

Return a tuple

Another option is to return a tuple. Pip uses () as a default and (int, str) if there is a build tag (as per PEP 427). The default could also be (0, '') to get similar sorting results as PEP 427 outlines. But the key point here is to automatically do the build tag parsing as appropriate to get people results.

Perf consideration

Is it worth spending the (literal) energy on parsing build tags every time? For instance, if one was to parse every wheel file name in a large Simple repository index (e.g. https://download.pytorch.org/whl/torch_stable.html) and it happened to use build tags extensively, would it be worth parsing things ahead of time, or is it better to punt until absolutely necessary?

I think this comes down to whether build tags take precedence over wheel tags in wheel selection, or are they more of a tiebreaker when wheel tags are the same? E.g. is pkg-2021.1-cp39-cp39-manylinux.whl a better fit than pkg-2021.1-42-py3-none-any.whl, or the other way around if you want the best fit of a wheel? If wheel tags take precedence (typically), then doing the parsing work should only be done if necessary. But if a wheel tag (typically) takes precedence then pre-emptively doing it for a simpler API makes sense to me.

@uranusjr
Copy link
Member

Regarding performance, since the curent implementation does not validate the tags at all anyway, maybe we can just return everything between version and .whl unparsed, and provide an additional function to parse it (maybe directly into Tag instances)?

@pfmoore
Copy link
Member

pfmoore commented Jan 27, 2021

If we're considering performance to this level, we could also say we shouldn't parse the version into a Version object, or the tags into a set of Tags objects, as that has a cost. IMO, we should parse all of the data into user-friendly forms, and applications that are doing significant amounts of bulk parsing should consider whether they want to do something lower-level. That may mean that this function isn't appropriate for all use cases, but that's fine by me.

As a data point, I just scanned all of the wheels on a slightly out of data local copy of all the PyPI data, and I see 1.4M wheels that don't use build tags, and 1700 that do. So I don't think parsing build tags is a problem in practice 😉 (Disclaimer - I did this by counting dashes, and there were some wheels with 3 dashes and one with 6, which suggests that at least some of the 5-dashes cases may be malformed wheel names, which means the number actually using build tags could be even lower).

@brettcannon
Copy link
Member Author

Thanks for that data point, @pfmoore ! It sounds like just going ahead and doing the build tag parsing is the right move as they are seemingly very rare.

Then I'm advocating for () to be the default value (for easy not checks), we do the parse eagerly, and we introduce the following:

BuildTag = typing.NewType("BuildTag", Union[Tuple[], Tuple[int, str]])

See https://github.com/pypa/packaging/blob/main/packaging/utils.py#L15 for an example of us already using type aliases for structured/formatted types.

If that works for people I can do a PR to update the code for that.

@pfmoore
Copy link
Member

pfmoore commented Jan 27, 2021

Works for me.

@brettcannon
Copy link
Member Author

I practically have a PR ready with the proposed change; should have the PR opened before the end of the day.

@brettcannon
Copy link
Member Author

PR is up for review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants