-
-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(livecheck/pypi): update to use json endpoint to query version #18895
fix(livecheck/pypi): update to use json endpoint to query version #18895
Conversation
459350b
to
67b555f
Compare
67b555f
to
c65498d
Compare
Signed-off-by: Rui Chen <[email protected]>
c65498d
to
d49e01b
Compare
May be able to reuse
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To provide context for this PR: some upstream PyPI project pages now involve a Fastly verification step before the page is page is loaded (e.g., https://pypi.org/pypi/fred-py-api for me). In this case, the HTML response for the project page URL (that the Pypi
strategy checks) only contains a noscript
"JavaScript is disabled in your browser" message along with some JavaScript. This appears to occur with some projects and not others, so the existing strategy logic still works in some cases (e.g., cryptography
for me). From limited testing, the JSON API doesn't involve the same JavaScript verification (as expected).
Up to now, we've scraped the project HTML because it provides the version information we need but requires considerably less data transfer than the JSON API. Using cryptography
as an example, the HTML page is 25 KB compressed (369 KB uncompressed) and the JSON API response is 376 KB compressed (2 MB uncompressed). We have 437 formulae using Pypi
, so that difference adds up (e.g., in autobump, --tap homebrew/core
runs, etc.). I agree that scraping the HTML page isn't ideal (and that upstream would probably prefer we use the API) but it was a practical choice when the JSON API involves over 10x more data transfer and grows with each release much more rapidly than the HTML page (i.e., the JSON contains lengthy metadata on all the files for every release ever made).
It would be ideal if upstream provided a /pypi/<project>/latest/json
API endpoint that would redirect to the JSON API URL for the latest non-yanked version. With that setup, we could match the version from the Location
header without even having to parse the JSON response, so that would be simple, fast, and lightweight. Barring that, removing the releases
and urls
arrays from the main JSON API response would dramatically reduce the size (that mostly duplicates the Index API functionality anyway).
That said, the JSON API appears to be our only option [that differentiates yanked releases] at this point, so we'll just have to live with the higher data transfer and talk to upstream about options that would work for both us and them.
I did a comparative livecheck run across all the formulae using Pypi
and only ~28% are working for me locally (and many without a livecheck
block fall back to Git, etc.). I didn't see any failures when testing with the JSON API (using my own implementation of this idea) and the time to check them was comparable. Looking at the uncompressed content size of the JSON responses, thankfully only twelve are larger than 1 MB (and another twelve are larger than 500 KB). It's still a lot more data transfer overall but it could be worse.
In terms of the changes here:
- This should use a similar approach to the
Crates
strategy, which usesStrategy#page_content
andJson#versions_from_content
with a default block. #generate_input_values
should only return a:url
value (e.g.,values[:url] = "https://pypi.org/pypi/#{T.must(match[:package_name]).gsub(/%20|_/, "-")}/json"
) and the related tests need to be updated.
I'll try to add some review suggestions here where possible (it can be tricky, since GitHub doesn't allow for suggestions across removed lines).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't add a suggestion here for the default block, but please add this between NICE_NAME
and FILENAME_REGEX
:
# The default `strategy` block used to extract version information when
# a `strategy` block isn't provided.
DEFAULT_BLOCK = T.let(proc do |json|
json.dig("info", "version")
end.freeze, T.proc.params(
arg0: T::Hash[String, T.untyped],
).returns(T.nilable(String)))
@chenrui333 For what it's worth, I can just push a commit to incorporate these changes, if you're okay with them. I don't want you to have to go through the effort of manually incorporating all these suggestions when I already have a commit ready 😆 |
I am totally okay with it :) |
36fa958
to
7a4de1e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a commit that reworks this to use Json::find_versions
in Pypi::find_versions
, borrowing some of the approach from the Crate
strategy. Besides that, this pares down the fields in the ::generate_input_values
return hash to only :url
, as we're not using a generated regex to match version information in this setup.
I've added a provided_content
parameter to ::find_versions
as part of this process and I will expand the Pypi
tests to increase coverage (like the Crates
tests and others where we fetch content in ::find_versions
) in a later PR. 75% of Pypi
checks are failing at the moment (with some returning inaccurate version information), so the current priority is getting this fix merged in the short-term.
I'll be busy for the next few hours and I'll come back to this in the evening but I think this is most of what I wanted to address before merging.
This comment has been minimized.
This comment has been minimized.
5e3696b
to
e71b86e
Compare
This reworks the new `Pypi` JSON API implementation to use `Json::find_versions` in `Pypi::find_versions`, borrowing some of the approach from the `Crate` strategy. Besides that, this pares down the fields in the `::generate_input_values` return hash to only `:url`, as we're not using a generated regex to match version information in this setup. This adds a `provided_content` parameter to `::find_versions` as part of this process and I will expand the `Pypi` tests to increase coverage (like the `Crates` tests) in a later PR. 75% of `Pypi` checks are failing at the moment (with some returning inaccurate version information), so the current priority is getting this fix merged in the short-term.
e71b86e
to
1c161ab
Compare
Among other things, the previous commit added a `provided_content` paramter to `Pypi::find_versions`, so this takes advantage of that to expand `Pypi` test coverage to 100%.
1c161ab
to
ac4854e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another pass through this to tweak a few things and added a commit that increases Pypi
test coverage to 100% (for lines and branches). One thing to note is that we're able to use Json::find_versions
here instead of Json::versions_from_content
, which allows us to minimize the required logic in Pypi::find_versions
(much like we do for strategies that use PageMatch::find_versions
, like Apache
). [I forget if there was a specific reason why I didn't do the same thing in the Crate
strategy but I'll create a follow-up PR for that if it's possible.]
I think this is in good shape now, so we just need someone to review these changes to double-check our work. I've tested this again across the existing Pypi
formulae and it continues to work as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Thanks, @nandahkrishna! I don't anticipate this causing issues overnight based on the testing I've done, so I'm going to go ahead and merge this to fix the related checks. Thanks @chenrui333 for getting this started and @cho-m for digging into this so deeply and resolving my questions ❤️ |
Thanks @samford @cho-m @nandahkrishna! ❤️ |
Thanks @chenrui333 @samford @cho-m @nandahkrishna! |
brew style
with your changes locally?brew typecheck
with your changes locally?brew tests
with your changes locally?Right now, pypi strategy is crawling the page for checking the latest version, but we should just use json to find out the latest version.