Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add ..HEAD to since statement if defined upon publish #4448

Merged
merged 3 commits into from
Sep 7, 2020

Conversation

yarikoptic
Copy link
Member

This should avoid lengthy annotatepath in heavy hierarchies while trying to
look for possible changes in the wortree.

I think this would not be entirelly correct in case if subdataset is
progressed forward but change was not recorded in the superdaset yet, but it
might work as desired (publish modified in its forwarded on remote end as well)
which might be current behavior as well anyways (did not test).

Might be that it Closes #4446 ;-)

@yarikoptic
Copy link
Member Author

FWIW it does publish reasonably faster for me.

Additionally I had spotted when running -l debug that it for some reason still does something to subsubdatasets which it should not even bother looking at since they were not changed thus with correct --since functioning `publish` should not have anything to do with them:
$> datalad -l debug publish --since= --to=datalad-public --missing=inherit -r
...
[DEBUG  ] Query repo: ['git', 'ls-files', '--stage', '-z', '-d', '-m', '--exclude-standard'] 
[DEBUG  ] Done query repo: ['git', 'ls-files', '--stage', '-z', '-d', '-m', '--exclude-standard'] 
[DEBUG  ] Done <AnnexRepo path=/mnt/datasets/datalad/crawl/openneuro (<class 'datalad.support.annexrepo.AnnexRepo'>)>.get_content_info(...) 
[DEBUG  ] Parsed version of PathRI '/mnt/datasets/datalad/crawl/openneuro/ds000001' differs from original PosixPath('/mnt/datasets/datalad/crawl/openneuro/ds000001') 
[DEBUG  ] Parsed version of PathRI '/mnt/datasets/datalad/crawl/openneuro/ds000002' differs from original PosixPath('/mnt/datasets/datalad/crawl/openneuro/ds000002') 
[DEBUG  ] Parsed version of PathRI '/mnt/datasets/datalad/crawl/openneuro/ds000003' differs from original PosixPath('/mnt/datasets/datalad/crawl/openneuro/ds000003') 
[DEBUG  ] Parsed version of PathRI '/mnt/datasets/datalad/crawl/openneuro/ds000005' differs from original PosixPath('/mnt/datasets/datalad/crawl/openneuro/ds000005') 
...

which takes notable amount of time to get through all hundreds. My wild guess is that it is probably due to diff coming up with full "records" for every subdataset, even when later they are not considered for actual action.

@codecov
Copy link

codecov bot commented Apr 27, 2020

Codecov Report

Merging #4448 into maint will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##            maint    #4448      +/-   ##
==========================================
+ Coverage   89.62%   89.65%   +0.02%     
==========================================
  Files         288      288              
  Lines       40359    40361       +2     
==========================================
+ Hits        36173    36185      +12     
+ Misses       4186     4176      -10     
Impacted Files Coverage Δ
datalad/distribution/publish.py 88.92% <100.00%> (+0.11%) ⬆️
datalad/distribution/tests/test_publish.py 100.00% <100.00%> (ø)
datalad/interface/diff.py 93.16% <0.00%> (-0.63%) ⬇️
datalad/tests/test_utils.py 95.59% <0.00%> (+0.27%) ⬆️
datalad/support/tests/test_annexrepo.py 96.73% <0.00%> (+0.38%) ⬆️
datalad/core/local/tests/test_save.py 97.96% <0.00%> (+0.45%) ⬆️
datalad/distribution/create_sibling.py 85.59% <0.00%> (+0.56%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eac6cfc...2a3abc0. Read the comment docs.

@kyleam
Copy link
Contributor

kyleam commented Apr 28, 2020

This should avoid lengthy annotatepath in heavy hierarchies while trying to
look for possible changes in the wortree.

Yeah, I think comparing to HEAD rather than the working tree is preferable here, even without considering the improved performance. FWIW push compares to HEAD.

I think this would not be entirelly correct in case if subdataset is
progressed forward but change was not recorded in the superdaset yet, but it
might work as desired (publish modified in its forwarded on remote end as well)
which might be current behavior as well anyways (did not test).

I'm not sure I get what you're trying to say here, but I'd feel okay about assuming that no one is relying on the (mis)behavior of publish pushing out unregistered changes to subdatasets (if that is indeed what it does).

@yarikoptic
Copy link
Member Author

I think this would not be entirelly correct in case if subdataset is
progressed forward but change was not recorded in the superdaset yet, but it
might work as desired (publish modified in its forwarded on remote end as well)
which might be current behavior as well anyways (did not test).

I'm not sure I get what you're trying to say here, but I'd feel okay about assuming that no one is relying on the (mis)behavior of publish pushing out unregistered changes to subdatasets (if that is indeed what it does).

Following script in the best traditions of @kyleam demonstrates it
#!/bin/bash

set -eu
cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"

datalad create src
datalad create -d src src/subds

(
cd src
datalad create-sibling -r -s target ../target
# now modify subds without saving updated state in superdataset
touch subds/file
datalad save -d subds -m "added a file" -t 0.1
datalad publish --to target -r
)

git -C target status
by resulting at the end in
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   subds (new commits)

no changes added to commit (use "git add" and/or "git commit -a")
But then messing with this script on this branch (based on maint btw although targetting master, so added to use ssh remote) and adding `--since=`
#!/bin/bash

set -eu
cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"

datalad create src
datalad create -d src src/subds

(
cd src
datalad create-sibling -r -s target localhost:$PWD/../target
# now modify subds without saving updated state in superdataset
touch subds/file
datalad save -d subds -m "added a file" -t 0.1
datalad publish --since= --to target -r
)

git -C target status

results in

[WARNING] fatal: bad revision 'target/master..HEAD' [diff(/home/yoh/.tmp/dl-AKim06i/src)] 
[WARNING] fatal: bad revision 'target/master..HEAD' [publish(/home/yoh/.tmp/dl-AKim06i/src)] 
publish(impossible): . (dataset) [fatal: bad revision 'target/master..HEAD']
[WARNING] fatal: bad revision 'target/master..HEAD' [diff(/home/yoh/.tmp/dl-AKim06i/src)] 
[WARNING] fatal: bad revision 'target/master..HEAD' [publish(/home/yoh/.tmp/dl-AKim06i/src)] 
publish(impossible): . (dataset) [fatal: bad revision 'target/master..HEAD']

and overall run failing... so I need to tune up this PR to work in those cases :-/ and probably ideally irrespectfully of --since been specified or not. heh heh

@mih
Copy link
Member

mih commented Apr 29, 2020

FTR: I have no objections to anything concerning publish(). In master it is a leaf command that isn't used outside the tests.

@yarikoptic
Copy link
Member Author

d'oh -- need to get back to this PR... current "performance" is simply not acceptable -- takes over 10 minutes to publish datasets.datalad.org whenever there is nothing to publish according to --since

This should avoid lengthy annotatepath in heavy hierarchies while trying to
look for possible changes in the wortree.

I think this would not be entirelly correct in case if subdataset is
progressed forward but change was not recorded in the superdaset yet, but it
might work as desired (publish modified in its forwarded on remote end as well)
which might be current behavior as well anyways (did not tesT)
@yarikoptic
Copy link
Member Author

those bad revision happen also with current maint version : so adding ..HEAD is not a problem -- it is just the fact that there is no remote branch yet (we have not pushed before). In such cases effective since should be None... will push a fix shortly

I have contributed to existing test -- it should not affect the tested below
logic
… effects

we used to not support paths for create-sibling, that is why used ssh.
No longer needed, and might speed up running the test in some scenarios
and AFAIK now should work on Windows (did not try though yet)
@yarikoptic yarikoptic added the merge-if-ok OP considers this work done, and requests review/merge label Sep 4, 2020
@yarikoptic yarikoptic changed the base branch from master to maint September 4, 2020 17:53
@kyleam kyleam merged commit b38acb5 into datalad:maint Sep 7, 2020
@yarikoptic yarikoptic deleted the opt-publish-since= branch October 7, 2020 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-if-ok OP considers this work done, and requests review/merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

publish should not assess worktree but use HEAD
3 participants