Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve file extractor performance #268

Open
christian-monch opened this issue Jul 18, 2022 · 3 comments · May be fixed by #298
Open

Improve file extractor performance #268

christian-monch opened this issue Jul 18, 2022 · 3 comments · May be fixed by #298

Comments

@christian-monch
Copy link
Collaborator

During a single file-level metadata extraction for legacy extractors 17 (seventeen) subprocess calls to git are executed, 5 of which are executed twice.

Here is a list of the subprocess calls:

Run ['git', 'annex', 'version', '--raw'] (cwd=None) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'annex', 'find', '--copies', '0', '--json', '--json-error-messages', '-c', 'annex.dotfiles=true', '--', '/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep/sub-02/anat/sub-02_label-CSF_probseg.nii.gz'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'annex', 'findref', '--copies', '0', 'HEAD', '--json', '--json-error-messages', '-c', 'annex.dotfiles=true'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'annex', 'whereis', '--json', '--json-error-messages', '--batch'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'config', '-z', '-l', '--file', '.gitmodules'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'config', '-z', '-l', '--file', '.gitmodules'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-files', '--stage', '-z'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-files', '--stage', '-z'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-files', '--stage', '-z', '--', 'sub-02/anat/sub-02_label-CSF_probseg.nii.gz'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-files', '-z', '-m', '-d', '--', 'sub-02/anat/sub-02_label-CSF_probseg.nii.gz'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-tree', 'HEAD', '-z', '-r', '--full-tree', '-l', '--', 'sub-02/anat/sub-02_label-CSF_probseg.nii.gz'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'rev-parse', '--quiet', '--verify', 'HEAD^{commit}'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'rev-parse', '--quiet', '--verify', 'HEAD^{commit}'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', 'config', '-z', '-l', '--show-origin'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', 'config', '-z', '-l', '--show-origin'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', 'config', '-z', '-l', '--show-origin', '--file', '/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep/.datalad/config'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', 'config', '-z', '-l', '--show-origin', '--file', '/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep/.datalad/config'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep)

As a first candidate for improvement, a number of git-calls stem from the following function:

def annex_status(annex_repo, paths=None):
    info = annex_repo.get_content_annexinfo(
        paths=paths,
        eval_availability=False,
        init=annex_repo.get_content_annexinfo(
            paths=paths,
            ref="HEAD",
            eval_availability=False,
            init=annex_repo.status(
                paths=paths,
                untracked="no",
                eval_submodule_state="full")
        )
    )
    annex_repo._mark_content_availability(info)
    return info
@christian-monch
Copy link
Collaborator Author

christian-monch commented Jul 18, 2022

The described behavior partially explains the behavior observer in issue #261

@christian-monch
Copy link
Collaborator Author

christian-monch commented Jul 22, 2022

I am currently looking into the following approach (see branch: enh NB this is in an experimental state):

  • Create a service that runs commands, caches their results and returns the cached result of an "identical" command execution is requested by the service
  • Commands are considered to be "identical" of the arguments, the work directory, the environment, and the static input-data are identical
  • Provide the service across processes to support re-use of calculated results throughout a complete pipeline run

The current implementation consists of three components, a server, a client, and a patcher

Server

Currently, a simple HTTP-server that accepts commands from localhost, executes them, if the result was not yet calculated, and returns the cached result.

Client

A simple request-based client that provides a remote-execution call that is similar to WitlessRunner.run()

Patcher

A module that provides a method that patches the run method of datalad.runner.gitrunner.GitWitlessRunner to redirect all non-batch and non-generator calls to the server. This transparently delegates command execution to the server

@christian-monch
Copy link
Collaborator Author

christian-monch commented Jul 29, 2022

Dropping the server approach for meta-conduct

I am moving away from the caching command server, The service might be useful in other contexts, but it is not too beneficial here. Although it reduces the number of process invocations in extract-pipelines by roughly 3/5, there are better ways. Nevertheless, the idea of a caching elements, either shared via file-system or network-transport might help to improve speed on the top-end.

A better way

I am looking into the following approach: collect as much information as possible, with the minimal necessary number of process-invocations, in the pipeline-providers. Hand the information down the downstream pipeline elements, i.e. processors and consumers. This can be done via method-call parameter, or via an exchange-service. This should replace the O(n) process calls in the individual extractor processes by O(log(n)) process calls in the provider (n is the number of files/datasets to run extraction on), with the cost of increased storage requirements.

Development is done in the branch enh-reduce-pipeline-processes NB this is in an experimental state)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant