Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrent package validation #43

Closed
jhoblitt opened this issue Mar 27, 2017 · 7 comments
Closed

concurrent package validation #43

jhoblitt opened this issue Mar 27, 2017 · 7 comments

Comments

@jhoblitt
Copy link

It appears that the vast majority of the run-time is spent validating package digests. In my last couple of test, it took ~1 hour 15mins for a single platform of pkgs/free on an ec2 c4.2xlarge instance. This is acceptable but would likely see a near linear speed up with some simple parallelization.

@ericdill
Copy link
Contributor

You are 100% correct that parallelization would dramatically reduce the runtime. I did play around with adding such a feature but was not pleased with the increased complexity. Since I run this as a nightly cron job, the extra time is not posing a real issue for me at the moment. That being said, if you are interested in parallelizing the validation, I'd be more than happy to review the pull request.

@willirath
Copy link
Contributor

+1 for this feature.

It looks like https://github.com/maxpoint/conda-mirror/blob/master/conda_mirror/conda_mirror.py#L359 could be rewritten to a function:

def func_instead_of_inner_loop_validate(package):

    # ensure the packages in this directory are in the upstream
    # repodata.json
    try:
        package_metadata = package_repodata[package]
        except KeyError:
            logger.warning("%s is not in the upstream index. Removing...",
                           package)
            _remove_package(os.path.join(package_directory, package),
                            reason="Package is not in the repodata index")
        else:
            # validate the integrity of the package, the size of the package and
            # its hashes
            logger.info('Validating package {}'.format(package)
            _validate(os.path.join(package_directory, package),
                      md5=package_metadata.get('md5'),
                      size=package_metadata.get('size'))

(Note that I dropped idx for now.)

This allows for using map(func_instead_of_inner_loop_validate, sorted(local_packages)) instead of the loop.

Then, multiprocessing.map is all we need:

[...]
import multiprocessing
[...]
def _validate_packages(package_repodata, package_directory):
    [...]
    p = multiprocessing.Pool() # careful! This uses _all_ CPUs
    p.map(func_instead_of_inner_loop_validate, sorted(local_packages))
    p.close()
    p.terminate()
    p.join()
    
[...]

@ericdill
Copy link
Contributor

ericdill commented Apr 5, 2017

@willirath Thanks for the interest. That seems like a sensible solution. Would you be able to submit a PR with this code change and we can discuss its details in that PR?

@willirath
Copy link
Contributor

I am working on it: #45. Won't make it past a very rough sketch today. I'll be busy the next two weeks but definitely get back to this at the end of April.

@ericdill
Copy link
Contributor

ericdill commented Apr 5, 2017

Awesome 🎉 ! Thanks for the effort. Ping me on the PR when you'd like some feedback.

@willirath
Copy link
Contributor

Thanks for writing this in the first place. This tool already helped a log in getting rid of all my "maintain conda behind a tight firewall" problems.

@ericdill
Copy link
Contributor

Closed by #48 . Thanks for the substantial effort here @willirath

jhoblitt added a commit to jhoblitt/terraform-conda-mirror that referenced this issue Sep 6, 2017
Includes the requested concurrent package validation:

adtech-labs/conda-mirror#43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants