Lots of identical git processes , --batch and not, operating in the same repository -- expected? #261

yarikoptic · 2022-06-29T13:00:24Z

Some time during OHBM 2022 with @jsheunis we started metadata extraction on datasets.datalad.org collection (AKA ///) to populate a data catalog. Nearly a week after, when I came home, I found smaug still sweating (at load ~20) doing that drill. Decided to look into those processes and discovered that they all run for the same dataset:

$> ps auxw | grep yoh.*git | awk '{print $2;}' | while read d ; do readlink /proc/$d/cwd; done| uniq -c
     96 /mnt/datasets/datalad/crawl/adhd200/surfaces

$> ps auxw | grep yoh.*git | sed -e 's,.*exe/git,git,g' | sort | uniq -c                            
     16 git-annex --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git-annex/git-annex findref --copies 0 HEAD --json --json-error-messages -c annex.dotfiles=true
      1 git,git,g
     16 git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git -c diff.ignoreSubmodules=none annex findref --copies 0 HEAD --json --json-error-messages -c annex.dotfiles=true
     32 git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs -c annex.dotfiles=true cat-file --batch
     16 git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs -c annex.dotfiles=true cat-file --batch-check=%(objectname) %(objecttype) %(objectsize)
     16 git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs -c annex.dotfiles=true ls-tree --full-tree -z -r -- HEAD
      1 yoh       381817  0.0  0.0   6316  2468 pts/16   S+   08:55   0:00 grep --color=auto -d skip yoh.*git

So there is all those git processes working in the same /mnt/datasets/datalad/crawl/adhd200/surfaces . I wonder if that is expected, e.g. due to multiprocessing or smth like that? I would have expected with multiprocessing parallelization would happen across datasets, but I could be wrong.

FWIW -- that dataset has considerable number of files -- almost 300k:

$> git -C /mnt/datasets/datalad/crawl/adhd200/surfaces annex info
trusted repositories: 0
semitrusted repositories: 4
        00000000-0000-0000-0000-000000000001 -- web
        00000000-0000-0000-0000-000000000002 -- bittorrent
        54630774-db20-48b2-b5c6-d11340f83105 -- yoh@falkor:/srv/datasets.datalad.org/www/adhd200/surfaces [datalad-public]
        7e184ea3-7255-44a6-bb69-32e68c5ea990 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/indi/adhd200/surfaces [here]
untrusted repositories: 0
transfers in progress: none
available local disk space: 29.98 terabytes (+1 megabyte reserved)
local annex keys: 0
local annex size: 0 bytes
annexed files in working tree: 285752
size of annexed files in working tree: 340.79 gigabytes
bloom filter size: 32 mebibytes (0% full)
backend usage: 
        MD5E: 285752
git -C /mnt/datasets/datalad/crawl/adhd200/surfaces annex info  2.14s user 3.95s system 20% cpu 29.111 total

edit: FTR running

(git)smaug:/mnt/datasets/datalad/crawl-catalog[master]git
$> tools/extract_filelevel ../crawl/ extracts/filelevel-r.json

The text was updated successfully, but these errors were encountered:

christian-monch · 2022-06-29T13:36:17Z

Thx for the issue, looking into it

christian-monch · 2022-07-07T12:14:02Z

@yarikoptic: this is a side effect of the function annex_status() (see below), which is called for every file on which a file-level extractor is executed.

def annex_status(annex_repo, paths=None):
    info = annex_repo.get_content_annexinfo(
        paths=paths,
        eval_availability=False,
        init=annex_repo.get_content_annexinfo(
            paths=paths,
            ref="HEAD",
            eval_availability=False,
            init=annex_repo.status(
                paths=paths,
                untracked="no",
                eval_submodule_state="full")
        )
    )
    annex_repo._mark_content_availability(info)
    return info

Execution of this function leads to the execution of the following commands:

1> git -c diff.ignoreSubmodules=none rev-parse --quiet --verify HEAD^{commit}
2> git -c diff.ignoreSubmodules=none ls-files --stage -z -- <file-path>
3> git -c diff.ignoreSubmodules=none ls-files -z -m -d -- <file-path>
4> git -c diff.ignoreSubmodules=none ls-tree HEAD -z -r --full-tree -l -- <file-path>
5> git annex version --raw
6> git -c diff.ignoreSubmodules=none annex findref --copies 0 HEAD --json --json-error-messages -c annex.dotfiles=true
7> git -c diff.ignoreSubmodules=none annex find --copies 0 --json --json-error-messages -c annex.dotfiles=true -- <file-path>

In words: seven subprocesses for every file-level extraction. The first, the fifth, and the sixth could obviously be cached. The other four contain the file path of the file that is operated on and have to be executed for every file path. Nevertheless, there might be an opportunity to opportunistically run them on multiple files at once and cache the results, assuming that an extraction is rarely limited to a single file.

WDYT?

christian-monch · 2022-07-07T13:04:51Z

BTW: I don't know where the following two processes originate:

     32 git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs -c annex.dotfiles=true cat-file --batch
     16 git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs -c annex.dotfiles=true cat-file --batch-check=%(objectname) %(objecttype) %(objectsize)

yarikoptic · 2022-07-07T13:14:31Z

WDYT?

long term - for the extractor I am thinking of #257 or RFing extraction to operate on entire tree/list of files via git log --stat or alike going through that history to assign most recent commit for each file. git log per file is way too expensive IMHO.

In both long & short term -- what metadata we aim to extract? may be we could just minimize number of calls to git somehow.

PS smaug is still running extraction of metadata at load 20... it was 2 weeks IIRC

christian-monch · 2022-07-07T14:29:22Z

PS smaug is still running extraction of metadata at load 20... it was 2 weeks IIRC

Thx. Will check the numbers and see whether this is expected (probably not).

christian-monch · 2022-07-07T14:33:47Z

I cannot properly analyze the problem because I have no access to /mnt/datasets/datalad/crawl-catalog and the tools and extracts directory.

yarikoptic · 2022-07-07T15:25:59Z

now you (and others in datalad group) should have read-only access to everything there

christian-monch · 2022-07-08T04:11:09Z

now you (and others in datalad group) should have read-only access to everything there

Thx

christian-monch · 2022-07-08T07:00:08Z

I still cannot read /mnt/datasets/datalad/crawl/abide/.git/index, which leads to a failing traversal.

christian-monch · 2022-07-08T11:02:30Z

I am looking at /mnt/btrfs/datasets-metalad-cm/datalad/crawl instead.

It contains 28,5 million files and directories.

time find /mnt/btrfs/datasets-metalad-cm/datalad/crawl|wc -l
28530451

real    62m22.862s
user    0m46.760s
sys     2m10.747s

With filtering out .git content (and a drastically reduced runtime. hurray for caches I guess ;-)) there are still 20 million files:

> time find /mnt/btrfs/datasets-metalad-cm/datalad/crawl |grep -v \\\.git|wc -l
20162909

real    2m10.793s
user    0m47.851s
sys     1m10.549s

The dataset traverser emits 700 entities per minute. It should therefore take about 20 days to traverse 20.000.000 files. :-(

That is not good! I will into the traverser performance.

yarikoptic · 2022-07-08T15:44:02Z

/mnt/datasets/datalad/crawl/abide/.git/index

fixing that ... check again tomorrow or so -- running across entire archive

(git)smaug:/mnt/datasets/datalad/crawl[master]git
$> echo * | xargs -n 1 -P 4 chmod g+rX -R

christian-monch · 2022-07-22T09:33:47Z

I have started an attempt to solve this issue with a caching command server. I.e. a process that provides command execution for its clients, caches the results, and returns cached results if the "same" command is executed twice (see #268 (comment))

yarikoptic · 2022-07-22T11:43:03Z

Hmm, not sure why such caching is needed (instead of code fixing so there is no duplicate calls) and when it is safe to use it - many calls are dependent on external state of the repository etc.

christian-monch self-assigned this Jun 29, 2022

christian-monch mentioned this issue Jul 18, 2022

Improve file extractor performance #268

Open

christian-monch linked a pull request Jan 16, 2023 that will close this issue

Reduce pipeline processes #298

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lots of identical git processes , --batch and not, operating in the same repository -- expected? #261

Lots of identical git processes , --batch and not, operating in the same repository -- expected? #261

yarikoptic commented Jun 29, 2022 •

edited

Loading

christian-monch commented Jun 29, 2022

christian-monch commented Jul 7, 2022 •

edited

Loading

christian-monch commented Jul 7, 2022

yarikoptic commented Jul 7, 2022

christian-monch commented Jul 7, 2022

christian-monch commented Jul 7, 2022

yarikoptic commented Jul 7, 2022

christian-monch commented Jul 8, 2022

christian-monch commented Jul 8, 2022

christian-monch commented Jul 8, 2022

yarikoptic commented Jul 8, 2022

christian-monch commented Jul 22, 2022

yarikoptic commented Jul 22, 2022

Lots of identical git processes , --batch and not, operating in the same repository -- expected? #261

Lots of identical git processes , --batch and not, operating in the same repository -- expected? #261

Comments

yarikoptic commented Jun 29, 2022 • edited Loading

christian-monch commented Jun 29, 2022

christian-monch commented Jul 7, 2022 • edited Loading

christian-monch commented Jul 7, 2022

yarikoptic commented Jul 7, 2022

christian-monch commented Jul 7, 2022

christian-monch commented Jul 7, 2022

yarikoptic commented Jul 7, 2022

christian-monch commented Jul 8, 2022

christian-monch commented Jul 8, 2022

christian-monch commented Jul 8, 2022

yarikoptic commented Jul 8, 2022

christian-monch commented Jul 22, 2022

yarikoptic commented Jul 22, 2022

yarikoptic commented Jun 29, 2022 •

edited

Loading

christian-monch commented Jul 7, 2022 •

edited

Loading