-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checking if installed packages are up to date is slow and use lots of CPU #2207
Comments
I'm also concerned about slow pipenv install in already created venv and would like to understand the difference between |
@kalekseev Pipenv caches your dependency tree, so if you run the two in order, the second one will be way faster b cause it just uses the cache. You need to test them in fully isolated environments (e.g. Docker images). |
@uranusjr well I'm talking mostly about this case: |
I tried to use pyflame to get a better idea about what takes so much time when pipenv is checking for new packages. It's mainly the diff --git a/pipenv/core.py b/pipenv/core.py
index 6fe230c8..732f92d8 100644
--- a/pipenv/core.py
+++ b/pipenv/core.py
@@ -1480,7 +1480,8 @@ def pip_install(
quoted_pip = which_pip(allow_global=allow_global)
quoted_pip = escape_grouped_arguments(quoted_pip)
upgrade_strategy = '--upgrade --upgrade-strategy=only-if-needed' if selective_upgrade else ''
- pip_command = '{0} install {4} {5} {6} {7} {3} {1} {2} --exists-action w'.format(
+ from uuid import uuid4
+ pip_command = 'pyflame --threads --output=/pipenv/flamegraph/{8}.txt --trace {0} install {4} {5} {6} {7} {3} {1} {2} --exists-action w'.format(
quoted_pip,
install_reqs,
' '.join(prepare_pip_source_args(sources)),
@@ -1489,6 +1490,7 @@ def pip_install(
src,
verbose_flag,
upgrade_strategy,
+ uuid4(),
)
if verbose:
click.echo('$ {0}'.format(pip_command), err=True) I then concatenated the resulting files together to one and created a Flamegraph out of it. I did this both for a fresh install of the packages from the test repository, where all packages had to be installed, and then ran it again afterwards where no new packages were installed. The resulting Flamegraphs are here: The output from I am by no means an expert on pyflame output, but from what I can grasp by looking at Could we change pipenv to only spawn as many pip processes it needs, instead of one per package? I think the start up cost for pip makes it worth spending a bit more time up front on preparing a list of dependencies for pip, rather than only sending one to each process. As far as I can see, the downside of not having a pip process per package, is that the progress bar becomes less useful as it will potentially jump from 0% to 100% due to one invocation of pip, but I see that as a small downside compared to the current resource usage. If there's support for changing this, I wouldn't mind giving it a go. Any hints as to what to be aware of would be appreciated as well. |
Hey @Tenzer seems like we came to the same conclusion! I don’t think there’s much to watch out for; we were hoping to tackle this in the coming weeks. For the sake of keeping progress bars useful I was considering consuming requirements with a threadpool or something,havent really had time to experiment. If you want to take a stab at this it would be much appreciated! |
I had a chance to look through the code today to get an idea about how it could be implemented. My idea for now is:
Does that sound like an okay approach? I imagine the majority of dependencies people have don't need any special The alternative would be to try and group the list of dependencies by which |
@techalchemy Do you have any feedback on the above idea? |
One problem I can think of immediately is that you’d broke the installation progress bar by breaking them into chunks. How would the parent process know how far a process is into installing its chunk? |
I mentioned that in my comment in #2207 (comment):
In my mind the resource usage and speed of An alternative idea would be to use |
While I wholy agree the speed is more important than the progress bar, this is unfortunately one of those endowment effect situations. From a project management standpoint, if we need to sacrafise something to achieve the gain, the gain needs to be greatly outweight the loss, otherwise it would be not worth it to deal with the backslash from users. This is more of a political situation than technical, but no less important for a project. Using pip as a library would be a good solution to this problem. pip, however, does not really encourages these kind of usages, and the script calling pip would need constant maintainence. It would also need to be a standalone script (instead of part of Pipenv), because the installation would need to be performed in the virtual environment’s Python, not the one Pipenv is installed into. We do already tap into pip internals a little bit (but in other parts, not the installation), so this is not impossible. I would, however, encourage starting a new, standalone project for this (given it is a standalone script anyway). We will be able to integrate the script once it is usable. |
Would adding a flag or environment variable to change the behaviour be acceptable? Meaning that the current behaviour is kept as the default, and then people who want the speed boost instead of the progress bar can switch to using a batched behaviour instead. It could be thought of as a feature flag and perhaps help assess how big a difference it makes to the package installation speed. |
That could work. Maybe introduce a variable called This should be quite simple to implement IMO. I would be very interested to hear how other maintainers think about this idea @techalchemy @erinxocon |
A couple of things -- I've actually implemented the batched approach in the past for comparison and it wasn't faster, I didn't spend any time profiling it because it wasn't that straightforward either. One fundamentally negative element of this entire problem is that pip's own installation itself uses a subprocess call, so there will never really be a way to avoid this issue unless we do something incredibly drastic Breaking the progress bar is not an option IMO. I would rather do something drastic like have a thread-safe construct that allows us to pass requirements around and consume those. |
Installing these packages using the following command is downloading some dependencies, not working at all for some and for others just reaĺly slow, but only with the python:3.7-alpine image. Crashes after 10+ minutes. It is reasonably fast with the python:3.7-stretch image and works fine. Why could that be? I think it sometimes tried to connect to pypi, even though it's not possible. Please let me know, if I should create a new issue for this:
Command used pipenv install --verbose --pypi-mirror $PIP_INDEX --dev --deploy |
@max-wittig It's a different issue to what's otherwise been talked about here, so you are probably better off creating another issue. |
Any update on this? Could an option be provided to simply disable the progress bar, if it's so expensive? If the bottleneck is process invocations, I'd think that switching from O(n) process invocations should make it so much faster in the already-installed case that the progress bar becomes irrelevant in the already-installed case, does that seem right? |
Such an option already exists. I parsed this thread today but being from 2018 a lot has changed, and the original contributors have moved on so I suggest that any remaining issues/feature requests be managed through new issue reports. |
Funny story, I came to a similar conclusion as @Tenzer recently without realizing it -- I worked on a performance optimization for batch_install which cut our install time in the python package manager shootout benchmark in half. #5301 It did require dropping the progress bar, but the actual progress can be observed with the |
I've been trying to migrate a project to use Pipenv but I'm slightly blocked on how much longer it takes for Pipenv to check if the installed dependencies are up to date, compared to pointing
pip install
at a requirements file.In our setup we run tests inside Docker containers. The image we run the tests on is one that comes pre-installed with the dependencies our project has at the time the image is built. Before we run any tests we then make sure the dependencies are up to date, in case any new dependencies are needed for any new code/tests that might have been added. For this we have just been using
pip install -r requirements.txt
, which normally completes in around 30 seconds when there's no new dependencies to install.I then tried to switch this to Pipenv and pre-installed the dependences in the image using a
Pipfile
andPipfile.lock
and then runningpipenv install --deploy --dev --system
against the files. That works fine and I got an image created, but the problem comes to when we want to run tests and want to check if dependencies are up to date first. I've done this using the samepipenv install --deploy --dev --system
command and instead of 30 seconds it now takes 5 minutes and 30 seconds! On top of that the CPU usage it much, much higher.I've made a small test with the
Pipfile
andPipfile.lock
we are using (only slightly modified): https://github.com/Tenzer/pipenv-test. Some simple tests that can be run with it is for instance to first install the dependencies and then afterwards check that they are up to date in the local environment, than then see how long time and CPU the second operation takes:Note that this was run on my laptop rather than our CI system, and with a slightly simpler
Pipfile
, hence it's faster than what I described above. It can however be compared to checking if all packages are installed withpip
:So according to this non-scientific test, Pipenv is taking 36 times as long and using 94 times more CPU than
pip
.I know that there's a big difference between what's going on under the hood, but my point here is that the vastly longer time and resource usage may be a deal breaker for some with lots of dependencies.
While digging into this, I noticed that Pipenv is spawning one
pip
process for each package, and I wonder how much of a slowdown that is compared topip
doing everything inside one process. Would it potentially make sense to split the list of dependencies into 16 (or whateverPIPENV_MAX_SUBPROCESS
is set to), in order to avoid having to spawn 212pip
processes - like it's the case here?It might also be that this is all down to
pip
and trying to make it faster for the operations that Pipenv runs. I just thought I would start here and see if there perhaps could be some possible optimisations on the Pipenv side of things.The text was updated successfully, but these errors were encountered: