Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List full path of wheel files in cache #8474

Merged
merged 4 commits into from
Sep 16, 2020

Conversation

deveshks
Copy link
Contributor

@deveshks deveshks commented Jun 20, 2020

Closes #8355

Add an option --not-accessed-since to subcommands list and purge of pip cache
to list and remove wheels older than x days from current time.

cc @xavfernandez since the main idea was inspired from their PR #3146 , so I would appreciate thoughts on whether it's in the right direction

EDIT:
Based on further discussion about baking in a specific cache eviction mechanism not the correct way going forward, and letting the user decide based on the output to consume it and use in their own cache eviction script, I have gone ahead and updated the PR to include the --abspath option.

$ pip cache list --abspath
/Users/devesh/Library/Caches/pip/wheels/01/d8/aa/eef4a6fa336467a340674ec0a6d5133c0e8a2725c730cb884b/pip-20.2.dev0-py2.py3-none-any.whl
/Users/devesh/Library/Caches/pip/wheels/08/9b/f4/87ddd64472107c95aa33211eb85c53c564d911a61567f29fb5/twine-3.1.2.dev81+ge55869a.d20200511-cp38-none-any.whl
/Users/devesh/Library/Caches/pip/wheels/0c/88/ac/41500883ea902d3409a83a827870a726346b5ebfd0523e91df/distlib-0.3.0-py2-none-any.whl
<snip>
$ pip cache list --abspath
$

@deveshks deveshks changed the title Remove items of age cache List/Remove wheels from cache older than x days Jun 20, 2020
@deveshks deveshks force-pushed the remove-items-of-age-cache branch 2 times, most recently from 147e3d0 to 9eb49f2 Compare June 26, 2020 19:50
@deveshks
Copy link
Contributor Author

deveshks commented Jul 2, 2020

Hi @xavfernandez

I would really appreciate if you could take a look at this, and provide further reviews, since I have built this on top of your older PR.

@pradyunsg , I would really appreciate your thought as well on this.

@uranusjr
Copy link
Member

uranusjr commented Jul 3, 2020

I like this as a concept, but the interface makes me wonder whether we’d be heading toward inventing a query language via flags. We should probably design this in a more composible way, like maybe generating a JSON output for jq concumption, or a row-based (for awk), instead of adding filtering logic ourselves. Maybe something like

# The special purge operator "-" takes list of files from stdin.
pip cache list --format=json | jq -r ".[] | select(.timestamp > $ACCESS_THRESHOLD) | .name" | pip cache purge -

@deveshks
Copy link
Contributor Author

deveshks commented Jul 3, 2020

I like this as a concept, but the interface makes me wondering whether we’d be heading toward inventing a query language via flags. We should probably design this in a more composible way, like maybe generating a JSON output for jq concumption, or a row-based (for awk), instead of adding filtering logic ourselves. Maybe something like

# The special purge operator "-" takes list of files from stdin.
pip cache list --format=json | jq -r ".[] | select(.timestamp > $ACCESS_THRESHOLD) | .name" | pip cache purge -

So if I understand correctly, we are letting the output from pip cache list fed into jq and awk, which then filter the list of results to the timestamp, and then that list is fed to pip cache purge, which then looks up the paths to those wheels and removes them.

This removes the burden of maintaining the logic of --not-accessed-since from pip, and hands it over to command line tools.

But would such tools be available for windows users who want to use this feature? Also I am not sure of the learning curve of jq and awk for normal users who can format such queries. (Unless we provide them these examples under "pip cache recipes: remove cache items longer than x days)

I would also wait for other maintainers to chime in before updating the PR to follow that approach.

Edit: Seems like @pfmoore already agrees to it.

@uranusjr
Copy link
Member

uranusjr commented Jul 3, 2020

But would such tools be available for windows users who want to use this feature? Also I am not sure of the learning curve of jq and awk for normal users who can format such queries. (Unless we provide them these examples under "pip cache recipes: remove cache items longer than x days)

Yeah, having recipes sounds like a great idea, since it is indeed non-trivial to learn either tools. Windows is not a problem for jq since it’s quite easy to install with either Chocolatey or Scoop. Or failing that, we always have Python, which is pretty good at this as well (albeit significantly longer and more difficult to read) 😛

pip cache list --format=json | python -c "import json, os, sys; [print(entry['name']) for entry in json.load(sys.stdin) if entry['timestamp'] < os.environ['ACCESS_THRESHOLD']]" | pip cache purge -

@deveshks
Copy link
Contributor Author

deveshks commented Jul 3, 2020

Thanks, should I go ahead and make the changes according to your suggestions @uranusjr or should I wait for more opinions on the same?

@pfmoore
Copy link
Member

pfmoore commented Jul 3, 2020

But would such tools be available for windows users who want to use this feature

As a Windows user, I'd confirm that they are, although I concede that they are not as much part of the "usual culture" for Windows. I consider that a failing of Windows users, not of Windows or of the proposal, though 🙂

Also, I'd like to see more examples of why people need this functionality, with a particular emphasis on the question of "are we likely to see people wanting other means of selecting things to purge?" I note that the motivating use case given in #8355 didn't actually need date-based removal - it was driven simply by "my cache is taking up too much space". The idea of pruning by date was not justified - presumably "reduce the space used by the cache by xx%" or "remove big files" would just as well have solved the original problem.

@edmorley
Copy link
Contributor

edmorley commented Jul 3, 2020

I like this as a concept, but the interface makes me wonder whether we’d be heading toward inventing a query language via flags. We should probably design this in a more composible way, like maybe generating a JSON output for jq concumption

Hi! I agree that for less common use-cases that this approach would be best.

However I think "clear old things out of a cache" should be treated as a common use-case, and as such handled by pip directly. I would posit that that the next most essential thing for caches after correctness/performance is handling cache expiration, and so it's not really an optional feature.

@edmorley
Copy link
Contributor

edmorley commented Jul 3, 2020

Also, I'd like to see more examples of why people need this functionality, with a particular emphasis on the question of "are we likely to see people wanting other means of selecting things to purge?"

On Heroku, the use case is preventing unbounded growth of the pip cache, since (if we used it,) it would be stored in the app build cache and so be fetched/uploaded to S3 before/after every build.

One alternative to expiring cache based on access time that would also work for us, is the ability to purge anything from the cache that isn't currently installed (ie reported by pip list), or isn't in the dependency graph of a specified requirements file.

@deveshks
Copy link
Contributor Author

deveshks commented Jul 3, 2020

Seems like from the discussions above, the scope of this PR can be narrowed to provide pip cache purge a list of wheels, or a list of packages for which the wheels to remove.

The side-effect of that would automatically include clearing old items out of the cache, the old items being provided by the jq or awk output for example. The other use cases of purge anything from the cache that isn't currently installed (ie reported by pip list), or isn't in the dependency graph of a specified requirements file. can then easily be integrated into this by generating those list of packages and passing them to pip cache purge

This ensures that the different possibilities of removing wheels from cache is not baked into pip, but is user-provided and driven, which we can suggest by the means of recipes like the one above.

@pfmoore
Copy link
Member

pfmoore commented Jul 3, 2020

the next most essential thing for caches after correctness/performance is handling cache expiration

I agree, cache expiration makes sense, but what I'm trying to understand is why date-based is the only policy we should support from pip. Also, if we are going down this route, then why isn't the argument that as pip is maintaining a cache, it's up to pip to automatically handle expiration, with no user intervention at all?

On Heroku, the use case is preventing unbounded growth of the pip cache

So wouldn't size-based expiration (remove big files) be more useful? If someone installs a huge package like tensorflow, then realises that wasn't what they meant, why wait 2 weeks to reclaim that space?

My concern here is that we either don't lock users into one possible policy, or conversely commit pip to supporting all sorts of options (as @uranusjr said, build a query language in options).

Deleting outdated or large files from a filesystem location is a pretty common task, I'm not clear why we need pip to reinvent that wheel (sorry for the pun!). Why can't this simply be built on top of pip cache info?

@pfmoore
Copy link
Member

pfmoore commented Jul 3, 2020

One alternative to expiring cache based on access time that would also work for us, is the ability to purge anything from the cache that isn't currently installed

... and see? That's when we start getting scope creep 🙂 I would have thought that it's possible to build something like this relatively easily using pip list and pip cache list and a bit of shell scripting (or Python code!)?

@uranusjr
Copy link
Member

uranusjr commented Jul 3, 2020

I’m starting to think what we ought to do is to simply have a flag to output the full path of each cache entry (e.g. pip cache list --abspath). Whatever logic is needed can be trivially implemented by the consumer by querying the filesystem afterwards (which is what pip cache does anyway).

@deveshks
Copy link
Contributor Author

deveshks commented Jul 3, 2020

I’m starting to think what we ought to do is to simply have a flag to output the full path of each cache entry (e.g. pip cache list --abspath). Whatever logic is needed can be trivially implemented by the consumer by querying the filesystem afterwards (which is what pip cache does anyway).

I agree, using the full path will give us the caller to do anything with those paths, including deleting them from cache via rm , but would we be fine to allow external programs to do that on pip's behalf (deleting wheels from the cache), and that's not affecting cache negatively?

It should be a trivial change to provide that and write tests for that in this PR, so I am more than happy to do it if folks agree :)

@edmorley
Copy link
Contributor

edmorley commented Jul 3, 2020

... and see? That's when we start getting scope creep

Yeah fair point :-)

I agree it's likely not worth trying to satisfy every use case, and perhaps those that are related to file-system stats (like last accessed) are best handled outside of pip. However for those that are very much pip domain specific ("which of these packages aren't currently installed?") it seems a bit clunky to have to script multiple pip commands/manipulate output.

Even if more advanced cache handling support was a wontfix, I wonder if there are a few small more-generic changes that could simplify implementations - for example if pip cache purge supported an --except <package list> then callers wouldn't need to do things like programmatically diff package lists before passing to pip cache remove etc.

@edmorley
Copy link
Contributor

edmorley commented Jul 3, 2020

then why isn't the argument that as pip is maintaining a cache, it's up to pip to automatically handle expiration, with no user intervention at all?

Good point - I think pip handling this automatically would be ideal.

So wouldn't size-based expiration (remove big files) be more useful? If someone installs a huge package like tensorflow, then realises that wasn't what they meant, why wait 2 weeks to reclaim that space?

Thinking about this more, for the CI/CD use-case (where incremental build performance is the priority) I believe the preferred solution would actually be purging anything unused immediately, since unless there was a regression/revert a package-version will likely never be used again after its initial removal or upgrade.

@pfmoore
Copy link
Member

pfmoore commented Jul 3, 2020

I wonder if there are a few small more-generic changes that could simplify implementations

Maybe. That's where we need actual real-world use cases, rather than "maybe it would help if" suggestions (not criticising what you're saying here, just pointing out that we need to justify the implementation and maintenance cost with real, demonstrated benefits - the pip developer team is way too small to keep adding features that don't turn out to get much real-world use).

@deveshks
Copy link
Contributor Author

deveshks commented Jul 3, 2020

Hi @uranusjr ,

In terms of your suggestion at #8474 (comment) , are we looking at something like this?

pip cache list --abspath
/Users/devesh/Library/Caches/pip/wheels/01/d8/aa/eef4a6fa336467a340674ec0a6d5133c0e8a2725c730cb884b/pip-20.2.dev0-py2.py3-none-any.whl
/Users/devesh/Library/Caches/pip/wheels/08/9b/f4/87ddd64472107c95aa33211eb85c53c564d911a61567f29fb5/twine-3.1.2.dev81+ge55869a.d20200511-cp38-none-any.whl
/Users/devesh/Library/Caches/pip/wheels/0c/88/ac/41500883ea902d3409a83a827870a726346b5ebfd0523e91df/distlib-0.3.0-py2-none-any.whl
/Users/devesh/Library/Caches/pip/wheels/11/17/00/2a4439eeb3a42a66b3649af2ef52ead483766361dc65b31498/regex-2020.5.14-cp38-cp38-macosx_10_15_x86_64.whl
/Users/devesh/Library/Caches/pip/wheels/13/90/db/290ab3a34f2ef0b5a0f89235dc2d40fea83e77de84ed2dc05c/PyYAML-5.3.1-cp38-cp38-macosx_10_15_x86_64.whl
/Users/devesh/Library/Caches/pip/wheels/1b/32/75/6205480d3e78221888ce99a3717fa349d373b256329408b289/Flask-0.10.1-py3-none-any.whl
/Users/devesh/Library/Caches/pip/wheels/1d/43/5d/5e84d744824f0c49fd2fef1d26d86d860926ffcc7574f6fd7a/gdown-3.11.0-py3-none-any.whl
......

and

$ pip cache list *redis* --abspath
/Users/devesh/Library/Caches/pip/wheels/f5/5a/44/3dc9ee136e3496b337cdbd0b65ea8a6128bbe9a04adc5c0ea0/redis-2.8.0-py3-none-any.whl

@stuaxo
Copy link

stuaxo commented Jul 8, 2020

Outputting all the paths so the user can delete them is not a great solution.

Not all users have rm, since some are on Windows.

When I was looking for functionality like this in the help I was looking for words like remove or delete, I shouldn't have to care or know where the directory is.

Now, the user has to write a different script on Windows and everything else.

@deveshks
Copy link
Contributor Author

deveshks commented Jul 9, 2020

From the comments and +1s on this PR and the issue, seems like @uranusjr , @pfmoore and @pradyunsg are against introducing a special flag --not-accessed-since for clearing the cache, and wants to instead output a list of paths which can be processed by the consumer.

And @xavfernandez and @stuaxo , (the original ticket creator) is against that change and wants pip to handle this, since LRU cache is a common and easy strategy used in caches, and pip can have that baked in

I personally am more tilted towards outputting the list of paths, which can be used by a consumer in form of a shell, or a python script for now, and perhaps based on future requirements from other users, think of baking it into pip at some point.
Also I think that outputting the list of paths can have other use cases outside deleting the wheels as well.

Given these facts, if every one agrees, I can leave this PR as-is, or update it to output the list of paths. Please let me know. (Or if there is a third solution to this, that would be welcome too)

@pfmoore
Copy link
Member

pfmoore commented Jul 9, 2020

I remain concerned about the limited flexibility of an option specifically to delete by age, but others seem comfortable that this won't result in a series of further options being requested. I do accept that "clearing out old cache entries" is a reasonable thing to want to do.

I'm also concerned by the broader question of whether pip should have custom commands that manage things like this, as opposed to taking the approach of providing mechanisms so that users can build their own solutions using that mechanism. I find the assertion that we shouldn't be expecting users to be willing to write small scripts or command line invocations to handle tasks to be disconcerting, given that pip is a tool for people developing code. I also don't give much weight to the idea that "Unix solutions don't work on Windows" - Windows users have access to scripting tools just as much as Unix users (if nothing else they have Python!) and I find it a little condescending to assume that they aren't able to use them.

However, I'm going to withdraw from this discussion now. It's reached a point where it's more about high-level design principles than about technical choices, and I'm not sure that a ticket like this is where such decisions should be made.

@stuaxo
Copy link

stuaxo commented Jul 9, 2020

pip cache list --abspath looks like the easiest way forward for now, once that's there we can see how it's used.

It would be good if it was documented that scripts may delete files mentioned here (are there any caveats to this?).

@chrahunt
Copy link
Member

chrahunt commented Jul 14, 2020

are there any caveats to this?

As long as another pip process than the one providing the output isn't running, it should be OK.

@deveshks deveshks force-pushed the remove-items-of-age-cache branch from 9eb49f2 to 8ebe7b1 Compare July 18, 2020 08:40
@deveshks
Copy link
Contributor Author

Thank you all for your comments and suggestions. It seems that most of the folks here are okay with going with the listing of full path of wheels using the --abspath option. I have updated the PR with the same.

@deveshks deveshks changed the title List/Remove wheels from cache older than x days List full path of wheel files in cache Jul 18, 2020
@stuaxo
Copy link

stuaxo commented Jul 18, 2020

Awesome, thanks for putting this together.

@deveshks deveshks force-pushed the remove-items-of-age-cache branch from 8ebe7b1 to fda2195 Compare July 19, 2020 20:30
@deveshks
Copy link
Contributor Author

Hi @chrahunt , @pradyunsg , @uranusjr

May I request if you folks can look at this PR, and provide review comments, so that this can go in the 20.2 release train?

@uranusjr
Copy link
Member

Coming back to this, I am second-guessing my --abspath proposal. Maybe it’d make more sense to have something like pip list --format=? pip cache list --format=abspath to list the absolute paths (the current default can be called something like human).

The implementation looks good to me.

@deveshks deveshks force-pushed the remove-items-of-age-cache branch 2 times, most recently from fdec3b4 to 545dcc5 Compare July 30, 2020 21:26
@deveshks deveshks requested a review from uranusjr July 30, 2020 21:45
@deveshks deveshks force-pushed the remove-items-of-age-cache branch 3 times, most recently from 8ebc968 to 8adecc2 Compare July 31, 2020 08:24
@deveshks
Copy link
Contributor Author

deveshks commented Aug 2, 2020

Thanks for the approval.

@pradyunsg , may I get this merged, if there are no more reviews needed for this?

@deveshks deveshks force-pushed the remove-items-of-age-cache branch from 8adecc2 to e751b07 Compare August 7, 2020 14:21
@deveshks
Copy link
Contributor Author

Now that the releases are out of the way, may I get this merged @pradyunsg and @chrahunt ?

@deveshks deveshks force-pushed the remove-items-of-age-cache branch from e751b07 to a81447d Compare August 24, 2020 06:23
@deveshks deveshks force-pushed the remove-items-of-age-cache branch from a81447d to 9450f88 Compare September 11, 2020 03:54
@pradyunsg pradyunsg added C: cache Dealing with cache and files in it type: enhancement Improvements to functionality labels Sep 16, 2020
@pradyunsg pradyunsg merged commit 53efda7 into pypa:master Sep 16, 2020
@pradyunsg
Copy link
Member

Thanks @deveshks! ^>^

@deveshks deveshks deleted the remove-items-of-age-cache branch September 16, 2020 14:08
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C: cache Dealing with cache and files in it type: enhancement Improvements to functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Way to clear items from pip cache of specified age.
7 participants