-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lots of identical git processes , --batch and not, operating in the same repository -- expected? #261
Comments
Thx for the issue, looking into it |
@yarikoptic: this is a side effect of the function def annex_status(annex_repo, paths=None):
info = annex_repo.get_content_annexinfo(
paths=paths,
eval_availability=False,
init=annex_repo.get_content_annexinfo(
paths=paths,
ref="HEAD",
eval_availability=False,
init=annex_repo.status(
paths=paths,
untracked="no",
eval_submodule_state="full")
)
)
annex_repo._mark_content_availability(info)
return info Execution of this function leads to the execution of the following commands: 1> git -c diff.ignoreSubmodules=none rev-parse --quiet --verify HEAD^{commit}
2> git -c diff.ignoreSubmodules=none ls-files --stage -z -- <file-path>
3> git -c diff.ignoreSubmodules=none ls-files -z -m -d -- <file-path>
4> git -c diff.ignoreSubmodules=none ls-tree HEAD -z -r --full-tree -l -- <file-path>
5> git annex version --raw
6> git -c diff.ignoreSubmodules=none annex findref --copies 0 HEAD --json --json-error-messages -c annex.dotfiles=true
7> git -c diff.ignoreSubmodules=none annex find --copies 0 --json --json-error-messages -c annex.dotfiles=true -- <file-path> In words: seven subprocesses for every file-level extraction. The first, the fifth, and the sixth could obviously be cached. The other four contain the file path of the file that is operated on and have to be executed for every file path. Nevertheless, there might be an opportunity to opportunistically run them on multiple files at once and cache the results, assuming that an extraction is rarely limited to a single file. WDYT? |
BTW: I don't know where the following two processes originate:
|
long term - for the extractor I am thinking of #257 or RFing extraction to operate on entire tree/list of files via In both long & short term -- what metadata we aim to extract? may be we could just minimize number of calls to PS |
Thx. Will check the numbers and see whether this is expected (probably not). |
I cannot properly analyze the problem because I have no access to |
now you (and others in |
Thx |
I still cannot read |
I am looking at It contains 28,5 million files and directories. time find /mnt/btrfs/datasets-metalad-cm/datalad/crawl|wc -l
28530451
real 62m22.862s
user 0m46.760s
sys 2m10.747s With filtering out > time find /mnt/btrfs/datasets-metalad-cm/datalad/crawl |grep -v \\\.git|wc -l
20162909
real 2m10.793s
user 0m47.851s
sys 1m10.549s The dataset traverser emits 700 entities per minute. It should therefore take about 20 days to traverse 20.000.000 files. :-( That is not good! I will into the traverser performance. |
fixing that ... check again tomorrow or so -- running across entire archive (git)smaug:/mnt/datasets/datalad/crawl[master]git
$> echo * | xargs -n 1 -P 4 chmod g+rX -R |
I have started an attempt to solve this issue with a caching command server. I.e. a process that provides command execution for its clients, caches the results, and returns cached results if the "same" command is executed twice (see #268 (comment)) |
Hmm, not sure why such caching is needed (instead of code fixing so there is no duplicate calls) and when it is safe to use it - many calls are dependent on external state of the repository etc. |
Some time during OHBM 2022 with @jsheunis we started metadata extraction on datasets.datalad.org collection (AKA ///) to populate a data catalog. Nearly a week after, when I came home, I found smaug still sweating (at load ~20) doing that drill. Decided to look into those processes and discovered that they all run for the same dataset:
So there is all those git processes working in the same
/mnt/datasets/datalad/crawl/adhd200/surfaces
. I wonder if that is expected, e.g. due to multiprocessing or smth like that? I would have expected with multiprocessing parallelization would happen across datasets, but I could be wrong.FWIW -- that dataset has considerable number of files -- almost 300k:
edit: FTR running
(git)smaug:/mnt/datasets/datalad/crawl-catalog[master]git $> tools/extract_filelevel ../crawl/ extracts/filelevel-r.json
The text was updated successfully, but these errors were encountered: