-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache parse_links() by --find-links html page url #7729
Conversation
A decorator would help us keep this logic out of from functools import lru_cache
class CacheablePage(object):
def __init__(self, page):
self.page = page
def __hash__(self):
return hash((page.content, page.encoding))
def with_cached_html_pages(fn):
@lru_cache
def wrapper(cacheable_page):
return fn(cacheable_page.page)
def wrapper_wrapper(page):
return wrapper(CacheablePage(page))
return wrapper_wrapper
...
@with_cached_html_pages
def parse_links(page):
... |
Thank you -- that sounds great! I will make this change! |
Is there a reason behind caching the result against the content, instead of the On |
Super reasonable! This change was motivated because we saw the repeated message:
which we pinned down to pip/src/pip/_internal/index/collector.py Lines 130 to 143 in e633741
I agree that caching by the url path seems sufficient, and I can definitely see how that might be more "expected" behavior in some respect. I'm thinking that (with the use case of #5670 in mind) if some build process publishes a change to a But using the url path seems maybe cleaner to implement, and possibly easier to fold into some other caching via |
Yup, I was thinking along a similar line as well, and |
Great! Was being overly cautious! I think the |
1d05011
to
b208169
Compare
@uranusjr An unexpected consequence of keying by page url is that the testing in |
I've detailed in pex-tool#5 some quick testing I did using the benchmark script in this PR's description on the effects of adding a similar cache for fetching of Since the separate |
Cherry-pick of pypa#7729. Fixes pex-tool/pex#887.
I've added type annotations for all of the things! I was unable to figure out how to get mypy to accept the: try:
from functools import lru_cache
except ImportError:
def lru_cache(...): pattern, as it would always see both the "try" and the "except" and therefore be unable to satisfy mypy for python 2 and 3. I have therefore created a method |
Cherry-pick of pypa#7729. Fixes pex-tool/pex#887.
@uranusjr would love to know your thoughts on the change to additionally support caching |
I would guess the performance benefits (compared to only caching find-links pages) are marginal. Index pages are already per-package, and each package page is only read once in the vast majority of cases (exceptions being build-time dependencies such as setuptools). On the flip side though, the added memory requirements could be significant on constrained systems (cache entry of a well-known package is likely to be in the 100kb-500kb range) that I don’t feel comfortable blindly caching them all without adding a flag to disable it. I would suggest sticking to only caching find-links pages for now. |
Thank you for the extremely thoughtful response! I had not considered at all memory-constrained systems, and I agree that the benefits are marginal enough not to warrant these tradeoffs (we do not have any benchmarks showing any further speedup -- this was a mistake on my part, being too eager to find another optimization). I really appreciate you being so specific and helpful in your response! I will revert the change to add an |
acdd06e
to
89c53fb
Compare
@uranusjr ping! I have removed the separate caching by page content and have just keyed |
Doesn’t |
Thank you! I will check that and avoid caching index pages if so! |
89c53fb
to
beac52a
Compare
@uranusjr in beac52a I have marked I also added a unit test for this selective caching logic (see |
803d820
to
b19ca97
Compare
@uranusjr ping! I have added appropriate testing and a couple docstrings. |
Ping again! |
Hi! Sorry I’ve been a bit tied up recently, I will find some time to review this entirely this weekend. |
Thanks!! Not a problem!! No rush on my end ^_^ Thanks so much for your efforts to make this change work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks sound to me, but I still have reservations about vendoring lru_cache
(see inline comments).
src/pip/_internal/index/collector.py
Outdated
cache[cache_key] = value | ||
return value | ||
return wrapped | ||
return wrapper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we should simply make the cache a no-op on Python 2. Something like:
F = TypeVar('F', Callable)
def stub_lru_cache(maxsize=None):
# type: (Optional[int]) -> Callable[[F], F]
def _wrapper(f):
# type: (F) -> F
return f
return _wrapper
try:
from functools import lru_cache as real_lru_cache
except ImportError:
_lru_cache = stub_lru_cache
else:
_lru_cache = real_lru_cache
There are probably a couple of typing hoops we need to jump through, but the point is not avoid vendoring a duplicated lru_cache
altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of references for convinience, if we are going down this route:
- Issues with conditional imports python/mypy#1297 How to deal with conditional imports
- type hint: how should type of lru_cache be defined? python/mypy#5107 How to type-check
lru_cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have slightly modified your example to make mypy and flake8 accept it:
pip/src/pip/_internal/index/collector.py
Lines 49 to 57 in 8315353
try: | |
from functools import lru_cache as _lru_cache # type: ignore | |
except ImportError: | |
def _lru_cache(maxsize=None): # type: ignore | |
# type: (Optional[int]) -> Callable[[F], F] | |
def _wrapper(f): | |
# type: (F) -> F | |
return f | |
return _wrapper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a patch against your repo to replace the type: ignore
comments with getattr()
and a callback protocol that IMO is easier to follow. You can merge that PR to include it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for diving into this and figuring out the Protocol usage!! I have merged the PR!
b19ca97
to
8315353
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems weird to set uncacheable_links
to True
for PyPI as the links could actually be cached: it's just that it would not be useful (if I'm not mistaken).
If that's the case, a simple cache_link_parsing=False
argument might make more sense ?
(with
(Link(url) for url in index_url_loc),
(Link(url, cache_link_parsing=True) for url in fl_url_loc),
)
Absolutely!! I'll make this change! |
50030c6
to
d7318e6
Compare
add failing test apply the fix add template NEWS entry according to https://pip.pypa.io/en/latest/development/contributing/#news-entries (wrong PR #) rename news entry to the current PR # respond to review comments fix test failures fix tests by adding uuid salt in urls cache html page fetching by link make CI pass (?) make the types much better finally listen to the maintainer and cache parse_links() by url :) avoid caching parse_links() when the url is an index url cleanup add testing for uncachable marking only conditionally vendor _lru_cache for py2 bugfix => feature python 2 does not cache! Do away with type: ignore with getattr() respond to review comments
d7318e6
to
6e7b16c
Compare
@pradyunsg this should be ready for version 20.1 |
@xavfernandez feel free to click the merge button! :) |
Thanks @cosmicexplorer :) |
118: Update pip to 20.1 r=duckinator a=pyup-bot This PR updates [pip](https://pypi.org/project/pip) from **20.0.2** to **20.1**. <details> <summary>Changelog</summary> ### 20.1 ``` ================= Process ------- - Document that pip 21.0 will drop support for Python 2.7. Features -------- - Add ``pip cache dir`` to show the cache directory. (`7350 <https://github.com/pypa/pip/issues/7350>`_) Bug Fixes --------- - Abort pip cache commands early when cache is disabled. (`8124 <https://github.com/pypa/pip/issues/8124>`_) - Correctly set permissions on metadata files during wheel installation, to permit non-privileged users to read from system site-packages. (`8139 <https://github.com/pypa/pip/issues/8139>`_) ``` ### 20.1b1 ``` =================== Deprecations and Removals ------------------------- - Remove emails from AUTHORS.txt to prevent usage for spamming, and only populate names in AUTHORS.txt at time of release (`5979 <https://github.com/pypa/pip/issues/5979>`_) - Remove deprecated ``--skip-requirements-regex`` option. (`7297 <https://github.com/pypa/pip/issues/7297>`_) - Building of local directories is now done in place, instead of a temporary location containing a copy of the directory tree. (`7555 <https://github.com/pypa/pip/issues/7555>`_) - Remove unused ``tests/scripts/test_all_pip.py`` test script and the ``tests/scripts`` folder. (`7680 <https://github.com/pypa/pip/issues/7680>`_) Features -------- - pip now implements PEP 610, so ``pip freeze`` has better fidelity in presence of distributions installed from Direct URL requirements. (`609 <https://github.com/pypa/pip/issues/609>`_) - Add ``pip cache`` command for inspecting/managing pip's wheel cache. (`6391 <https://github.com/pypa/pip/issues/6391>`_) - Raise error if ``--user`` and ``--target`` are used together in ``pip install`` (`7249 <https://github.com/pypa/pip/issues/7249>`_) - Significantly improve performance when ``--find-links`` points to a very large HTML page. (`7729 <https://github.com/pypa/pip/issues/7729>`_) - Indicate when wheel building is skipped, due to lack of the ``wheel`` package. (`7768 <https://github.com/pypa/pip/issues/7768>`_) - Change default behaviour to always cache responses from trusted-host source. (`7847 <https://github.com/pypa/pip/issues/7847>`_) - An alpha version of a new resolver is available via ``--unstable-feature=resolver``. (`988 <https://github.com/pypa/pip/issues/988>`_) Bug Fixes --------- - Correctly freeze a VCS editable package when it is nested inside another VCS repository. (`3988 <https://github.com/pypa/pip/issues/3988>`_) - Correctly handle ``%2F`` in URL parameters to avoid accidentally unescape them into ``/``. (`6446 <https://github.com/pypa/pip/issues/6446>`_) - Reject VCS URLs with an empty revision. (`7402 <https://github.com/pypa/pip/issues/7402>`_) - Warn when an invalid URL is passed with ``--index-url`` (`7430 <https://github.com/pypa/pip/issues/7430>`_) - Use better mechanism for handling temporary files, when recording metadata about installed files (RECORD) and the installer (INSTALLER). (`7699 <https://github.com/pypa/pip/issues/7699>`_) - Correctly detect global site-packages availability of virtual environments created by PyPA’s virtualenv>=20.0. (`7718 <https://github.com/pypa/pip/issues/7718>`_) - Remove current directory from ``sys.path`` when invoked as ``python -m pip <command>`` (`7731 <https://github.com/pypa/pip/issues/7731>`_) - Stop failing uninstallation, when trying to remove non-existent files. (`7856 <https://github.com/pypa/pip/issues/7856>`_) - Prevent an infinite recursion with ``pip wheel`` when ``$TMPDIR`` is within the source directory. (`7872 <https://github.com/pypa/pip/issues/7872>`_) - Significantly speedup ``pip list --outdated`` by parallelizing index interaction. (`7962 <https://github.com/pypa/pip/issues/7962>`_) - Improve Windows compatibility when detecting writability in folder. (`8013 <https://github.com/pypa/pip/issues/8013>`_) Vendored Libraries ------------------ - Update semi-supported debundling script to reflect that appdirs is vendored. - Add ResolveLib as a vendored dependency. - Upgrade certifi to 2020.04.05.1 - Upgrade contextlib2 to 0.6.0.post1 - Upgrade distro to 1.5.0. - Upgrade idna to 2.9. - Upgrade msgpack to 1.0.0. - Upgrade packaging to 20.3. - Upgrade pep517 to 0.8.2. - Upgrade pyparsing to 2.4.7. - Remove pytoml as a vendored dependency. - Upgrade requests to 2.23.0. - Add toml as a vendored dependency. - Upgrade urllib3 to 1.25.8. Improved Documentation ---------------------- - Emphasize that VCS URLs using git, git+git and git+http are insecure due to lack of authentication and encryption (`1983 <https://github.com/pypa/pip/issues/1983>`_) - Clarify the usage of --no-binary command. (`3191 <https://github.com/pypa/pip/issues/3191>`_) - Clarify the usage of freeze command in the example of Using pip in your program (`7008 <https://github.com/pypa/pip/issues/7008>`_) - Add a "Copyright" page. (`7767 <https://github.com/pypa/pip/issues/7767>`_) - Added example of defining multiple values for options which support them (`7803 <https://github.com/pypa/pip/issues/7803>`_) ``` </details> <details> <summary>Links</summary> - PyPI: https://pypi.org/project/pip - Changelog: https://pyup.io/changelogs/pip/ - Homepage: https://pip.pypa.io/ </details> Co-authored-by: pyup-bot <[email protected]>
pip._internal.index.collector.parse_links()
can take a nontrivial amount of time for large--find-links
html repos. This cost is paid every single time a--find-links
page is consulted (once per transitive requirement resolved), and can therefore become quite lengthy for large resolves. See pex-tool/pex#887 and pex-tool/pex#888 for further background on how we arrived at this issue while overcoming blockers to https://github.com/pantsbuild/pants upgrading to pex version 2, which uses pip to resolve and fetch wheels.This change caches the result of invoking
parse_links()
for a--find-links
html repo, keyed by theHTMLPage
'surl
. This means that other code in pip callingparse_links()
can safely expect to idempotently resolve a set of links from an html page from any other thread, for example, without having to share any state between threads.To reproduce the pathological behavior and demonstrate that this fix improves performance, check out this branch and execute this script: