-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
concurrent package validation #43
Comments
You are 100% correct that parallelization would dramatically reduce the runtime. I did play around with adding such a feature but was not pleased with the increased complexity. Since I run this as a nightly cron job, the extra time is not posing a real issue for me at the moment. That being said, if you are interested in parallelizing the validation, I'd be more than happy to review the pull request. |
+1 for this feature. It looks like https://github.com/maxpoint/conda-mirror/blob/master/conda_mirror/conda_mirror.py#L359 could be rewritten to a function: def func_instead_of_inner_loop_validate(package):
# ensure the packages in this directory are in the upstream
# repodata.json
try:
package_metadata = package_repodata[package]
except KeyError:
logger.warning("%s is not in the upstream index. Removing...",
package)
_remove_package(os.path.join(package_directory, package),
reason="Package is not in the repodata index")
else:
# validate the integrity of the package, the size of the package and
# its hashes
logger.info('Validating package {}'.format(package)
_validate(os.path.join(package_directory, package),
md5=package_metadata.get('md5'),
size=package_metadata.get('size')) (Note that I dropped This allows for using Then, [...]
import multiprocessing
[...]
def _validate_packages(package_repodata, package_directory):
[...]
p = multiprocessing.Pool() # careful! This uses _all_ CPUs
p.map(func_instead_of_inner_loop_validate, sorted(local_packages))
p.close()
p.terminate()
p.join()
[...] |
@willirath Thanks for the interest. That seems like a sensible solution. Would you be able to submit a PR with this code change and we can discuss its details in that PR? |
I am working on it: #45. Won't make it past a very rough sketch today. I'll be busy the next two weeks but definitely get back to this at the end of April. |
Awesome 🎉 ! Thanks for the effort. Ping me on the PR when you'd like some feedback. |
Thanks for writing this in the first place. This tool already helped a log in getting rid of all my "maintain conda behind a tight firewall" problems. |
Closed by #48 . Thanks for the substantial effort here @willirath |
Includes the requested concurrent package validation: adtech-labs/conda-mirror#43
It appears that the vast majority of the run-time is spent validating package digests. In my last couple of test, it took ~1 hour 15mins for a single platform of
pkgs/free
on an ec2 c4.2xlarge instance. This is acceptable but would likely see a near linear speed up with some simple parallelization.The text was updated successfully, but these errors were encountered: