Improve file extractor performance #268

christian-monch · 2022-07-18T14:32:49Z

During a single file-level metadata extraction for legacy extractors 17 (seventeen) subprocess calls to git are executed, 5 of which are executed twice.

Here is a list of the subprocess calls:

Run ['git', 'annex', 'version', '--raw'] (cwd=None) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'annex', 'find', '--copies', '0', '--json', '--json-error-messages', '-c', 'annex.dotfiles=true', '--', '/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep/sub-02/anat/sub-02_label-CSF_probseg.nii.gz'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'annex', 'findref', '--copies', '0', 'HEAD', '--json', '--json-error-messages', '-c', 'annex.dotfiles=true'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'annex', 'whereis', '--json', '--json-error-messages', '--batch'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'config', '-z', '-l', '--file', '.gitmodules'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'config', '-z', '-l', '--file', '.gitmodules'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-files', '--stage', '-z'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-files', '--stage', '-z'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-files', '--stage', '-z', '--', 'sub-02/anat/sub-02_label-CSF_probseg.nii.gz'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-files', '-z', '-m', '-d', '--', 'sub-02/anat/sub-02_label-CSF_probseg.nii.gz'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-tree', 'HEAD', '-z', '-r', '--full-tree', '-l', '--', 'sub-02/anat/sub-02_label-CSF_probseg.nii.gz'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'rev-parse', '--quiet', '--verify', 'HEAD^{commit}'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', '-c', 'diff.ignoreSubmodules=none', 'rev-parse', '--quiet', '--verify', 'HEAD^{commit}'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', 'config', '-z', '-l', '--show-origin'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', 'config', '-z', '-l', '--show-origin'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', 'config', '-z', '-l', '--show-origin', '--file', '/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep/.datalad/config'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep) 
Run ['git', 'config', '-z', '-l', '--show-origin', '--file', '/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep/.datalad/config'] (cwd=/mnt/btrfs/datasets-metalad-cm/datalad/crawl/labs/poldrack/ds003_fmriprep)

As a first candidate for improvement, a number of git-calls stem from the following function:

def annex_status(annex_repo, paths=None):
    info = annex_repo.get_content_annexinfo(
        paths=paths,
        eval_availability=False,
        init=annex_repo.get_content_annexinfo(
            paths=paths,
            ref="HEAD",
            eval_availability=False,
            init=annex_repo.status(
                paths=paths,
                untracked="no",
                eval_submodule_state="full")
        )
    )
    annex_repo._mark_content_availability(info)
    return info

The text was updated successfully, but these errors were encountered:

christian-monch · 2022-07-18T19:07:12Z

The described behavior partially explains the behavior observer in issue #261

christian-monch · 2022-07-22T09:24:00Z

I am currently looking into the following approach (see branch: enh NB this is in an experimental state):

Create a service that runs commands, caches their results and returns the cached result of an "identical" command execution is requested by the service
Commands are considered to be "identical" of the arguments, the work directory, the environment, and the static input-data are identical
Provide the service across processes to support re-use of calculated results throughout a complete pipeline run

The current implementation consists of three components, a server, a client, and a patcher

Server

Currently, a simple HTTP-server that accepts commands from localhost, executes them, if the result was not yet calculated, and returns the cached result.

Client

A simple request-based client that provides a remote-execution call that is similar to WitlessRunner.run()

Patcher

A module that provides a method that patches the run method of datalad.runner.gitrunner.GitWitlessRunner to redirect all non-batch and non-generator calls to the server. This transparently delegates command execution to the server

christian-monch · 2022-07-29T08:55:30Z

Dropping the server approach for `meta-conduct`

I am moving away from the caching command server, The service might be useful in other contexts, but it is not too beneficial here. Although it reduces the number of process invocations in extract-pipelines by roughly 3/5, there are better ways. Nevertheless, the idea of a caching elements, either shared via file-system or network-transport might help to improve speed on the top-end.

A better way

I am looking into the following approach: collect as much information as possible, with the minimal necessary number of process-invocations, in the pipeline-providers. Hand the information down the downstream pipeline elements, i.e. processors and consumers. This can be done via method-call parameter, or via an exchange-service. This should replace the O(n) process calls in the individual extractor processes by O(log(n)) process calls in the provider (n is the number of files/datasets to run extraction on), with the cost of increased storage requirements.

Development is done in the branch enh-reduce-pipeline-processes NB this is in an experimental state)

christian-monch added the performance label Jul 18, 2022

christian-monch mentioned this issue Jul 22, 2022

Lots of identical git processes , --batch and not, operating in the same repository -- expected? #261

Open

christian-monch linked a pull request Jan 16, 2023 that will close this issue

Reduce pipeline processes #298

Open

christian-monch mentioned this issue Feb 28, 2023

Hackathon 2023-02 topic board #335

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve file extractor performance #268

Improve file extractor performance #268

christian-monch commented Jul 18, 2022

christian-monch commented Jul 18, 2022 •

edited

Loading

christian-monch commented Jul 22, 2022 •

edited

Loading

christian-monch commented Jul 29, 2022 •

edited

Loading

Improve file extractor performance #268

Improve file extractor performance #268

Comments

christian-monch commented Jul 18, 2022

christian-monch commented Jul 18, 2022 • edited Loading

christian-monch commented Jul 22, 2022 • edited Loading

Server

Client

Patcher

christian-monch commented Jul 29, 2022 • edited Loading

Dropping the server approach for meta-conduct

A better way

christian-monch commented Jul 18, 2022 •

edited

Loading

christian-monch commented Jul 22, 2022 •

edited

Loading

christian-monch commented Jul 29, 2022 •

edited

Loading

Dropping the server approach for `meta-conduct`