Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize ignore performance #4120

Merged
merged 19 commits into from
Jul 14, 2020
Merged

Conversation

karajan1001
Copy link
Contributor

@karajan1001 karajan1001 commented Jun 26, 2020

fix #3869

  1. Remove unnecessary file check during .dvcignore file finding
  2. Merge ignore rules in different .dvcignore files
  3. Use a trie tree to make .dvcignore files only influence it sub dirs.
  • ❗ I have followed the Contributing to DVC checklist.

  • 📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

This is the benchmark

Screenshot from 2020-06-27 12-08-12

The first is <1.0.1>, it is the baseline. Because this optimization involving merging of different .dvcignore files. I add another Benchmark testing dvc status with three files each having 10 rules.

Second is this PR. Now ignore file check takes little time both in 3 files or 30 files
@Suor you are right, we had rechecked files found via walk, run ignores twice. And it happened within the time when we are walking to find all .dvcignore files.

dirs[:], files[:] = self(root, dirs, files)

This PR solve another issue in

dirs, files = ignore(root, dirs, files)

If some files were ignored by some dvcignore files, then it could not be reincluded by other dvcignore files.

Git did not work like this.
Screenshot from 2020-06-27 11-49-30

And in gitignore

Patterns read from a .gitignore file in the same directory as the path, or in any parent directory, with patterns in the higher level> files (up to the toplevel of the work tree) being overridden by those in lower level files down to the directory containing the file. These patterns match relative to the location of the .gitignore file. A project normally includes such .gitignore files in its repository, containing patterns for files generated as part of the project build.

@efiop efiop requested a review from pared June 26, 2020 14:21
@karajan1001
Copy link
Contributor Author

karajan1001 commented Jun 27, 2020

(0.00 durations hidden.  Use -vv to show these durations.)
=========================== short test summary info ===========================
FAILED tests/func/test_ignore.py::test_multi_ignore_file - AssertionError: as...
===== 1 failed, 1323 passed, 89 skipped, 8 warnings in 713.38s (0:11:53) ======
The command "bash ./scripts/ci/script.sh" exited with 1.
cache.2
store build cache

Might there be more details in the log from Travis CI ?

@karajan1001 karajan1001 changed the title Fix ignore performance Optimize ignore performance Jun 27, 2020
@pared
Copy link
Contributor

pared commented Jun 27, 2020

@karajan1001
The log from Travis indicates that test_ignore_multi failed.

   def test_multi_ignore_file(tmp_dir, dvc, monkeypatch):
        tmp_dir.gen({"dir": {"subdir": {"should_ignore": "1", "not_ignore": "1"}}})
        tmp_dir.gen(DvcIgnore.DVCIGNORE_FILE, "dir/subdir/*_ignore")
        tmp_dir.gen({"dir": {DvcIgnore.DVCIGNORE_FILE: "!subdir/not_ignore"}})
    
>       assert _files_set("dir", dvc.tree) == {
            "dir/subdir/not_ignore",
            "dir/{}".format(DvcIgnore.DVCIGNORE_FILE),
        }
E       AssertionError: assert {'dir/.dvcignore'} == {'dir/.dvcign...r/not_ignore'}
E         Extra items in the right set:
E         'dir/subdir/not_ignore'
E         Full diff:
E         - {'dir/.dvcignore', 'dir/subdir/not_ignore'}
E         + {'dir/.dvcignore'}

@karajan1001
Copy link
Contributor Author

@karajan1001
The log from Travis indicates that test_ignore_multi failed.

   def test_multi_ignore_file(tmp_dir, dvc, monkeypatch):
        tmp_dir.gen({"dir": {"subdir": {"should_ignore": "1", "not_ignore": "1"}}})
        tmp_dir.gen(DvcIgnore.DVCIGNORE_FILE, "dir/subdir/*_ignore")
        tmp_dir.gen({"dir": {DvcIgnore.DVCIGNORE_FILE: "!subdir/not_ignore"}})
    
>       assert _files_set("dir", dvc.tree) == {
            "dir/subdir/not_ignore",
            "dir/{}".format(DvcIgnore.DVCIGNORE_FILE),
        }
E       AssertionError: assert {'dir/.dvcignore'} == {'dir/.dvcign...r/not_ignore'}
E         Extra items in the right set:
E         'dir/subdir/not_ignore'
E         Full diff:
E         - {'dir/.dvcignore', 'dir/subdir/not_ignore'}
E         + {'dir/.dvcignore'}

Thank you, but the question is where to get this.

@pared
Copy link
Contributor

pared commented Jun 27, 2020

@karajan1001, you don't see the 'Details' link near check?
travis

If you follow it, there should be build summary, and you should see Jobs and Stages section, where you can see particular builds.

@codecov
Copy link

codecov bot commented Jun 27, 2020

Codecov Report

Merging #4120 into master will increase coverage by 0.15%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4120      +/-   ##
==========================================
+ Coverage   92.60%   92.76%   +0.15%     
==========================================
  Files         162      162              
  Lines       11335    11278      -57     
==========================================
- Hits        10497    10462      -35     
+ Misses        838      816      -22     
Impacted Files Coverage Δ
dvc/repo/update.py 88.23% <0.00%> (-5.89%) ⬇️
dvc/dependency/repo.py 96.87% <0.00%> (-1.57%) ⬇️
dvc/remote/azure.py 93.58% <0.00%> (-0.67%) ⬇️
dvc/cli.py 94.11% <0.00%> (-0.43%) ⬇️
dvc/repo/tree.py 92.55% <0.00%> (-0.41%) ⬇️
dvc/scm/tree.py 95.74% <0.00%> (-0.26%) ⬇️
dvc/external_repo.py 86.59% <0.00%> (-0.14%) ⬇️
dvc/config.py 97.71% <0.00%> (-0.14%) ⬇️
dvc/repo/plots/template.py 96.42% <0.00%> (-0.10%) ⬇️
dvc/command/git_hook.py 95.74% <0.00%> (-0.09%) ⬇️
... and 41 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 66def03...3791998. Read the comment docs.

@karajan1001
Copy link
Contributor Author

Needs more tests. And Some Improvements

@karajan1001
Copy link
Contributor Author

If you follow it, there should be build summary, and you should see Jobs and Stages section, where you can see particular builds.
@pared

For example in
https://travis-ci.com/github/iterative/dvc/jobs/354665083

I can only get this.
image

No more detail.

@efiop
Copy link
Contributor

efiop commented Jun 29, 2020

Hi @karajan1001 ! Full logs are starting here https://travis-ci.com/github/iterative/dvc/jobs/354665083#L3357

@karajan1001
Copy link
Contributor Author

Hi @karajan1001 ! Full logs are starting here https://travis-ci.com/github/iterative/dvc/jobs/354665083#L3357

@efiop Thank you, it seems that only members of the DVC have the permission.
image

@karajan1001
Copy link
Contributor Author

karajan1001 commented Jun 29, 2020

More tests are needed, especially the merging of two ignore files and the influence of the changing of the dirname .

@skshetry
Copy link
Member

@karajan1001, from your #4120 (comment), it looks like you are able to access it. You have to search manually in that huge log for specific test failure. Or, download the raw log and grep locally.

If you are not able to, you can download the log of a specific job with following url:

https://api.travis-ci.com/v3/job/{job-id}/log.txt (subsitute job id)

Eg: https://api.travis-ci.com/v3/job/354665083/log.txt

@karajan1001
Copy link
Contributor Author

@karajan1001, from your #4120 (comment), it looks like you are able to access it. You have to search manually in that huge log for specific test failure. Or, download the raw log and grep locally.

If you are not able to, you can download the log of a specific job with following url:

https://api.travis-ci.com/v3/job/{job-id}/log.txt (subsitute job id)

Eg: https://api.travis-ci.com/v3/job/354665083/log.txt

@skshetry Thank you, but, I had tried this on this PR 2 days ago. The result is the same as what is in the https://api.travis-ci.com/v3/job/354665083/log.txt

image

Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 thing that I found.

More tests are needed, especially the merging of two ignore files and the influence of the changing of the dirname .

Besides that LGTM. @karajan1001 Do you intend to introduce those tests in this PR?

dvc/ignore.py Outdated Show resolved Hide resolved
@karajan1001
Copy link
Contributor Author

karajan1001 commented Jun 29, 2020

1 thing that I found.

More tests are needed, especially the merging of two ignore files and the influence of the changing of the dirname .

Besides that LGTM. @karajan1001 Do you intend to introduce those tests in this PR?

Of course. I would like to add every test I could think of according to the gitignore document. It would be a systematic test.

Because it may get DVC users in trouble if this algorithm changed frequently. Like what PathSpec's author said in cpburnz/python-path-specification#19

I suspect the discrepancy is that a .gitignore does additional filtering beyond that of Git's wildmatch algorithm. If I fixed this in my implementation of wildmatch, it could unexpectedly change the results of anyone already using this for a purpose other than emulating .gitingore.

@karajan1001
Copy link
Contributor Author

Screenshot from 2020-07-05 16-06-01

Current benchmark result

@karajan1001
Copy link
Contributor Author

karajan1001 commented Jul 5, 2020

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received
The build has been terminated

@pared Excuse me, What is this error? Never saw this before, seems not caused by my PR?

@pared
Copy link
Contributor

pared commented Jul 6, 2020

@karajan1001 That does not seem like like an issue with your PR, I got the same error in other repository handled by Travis. Maybe it's their issue. For now, I am restarting the build, lets see how does it go this time.

@karajan1001
Copy link
Contributor Author

@karajan1001 That does not seem like like an issue with your PR, I got the same error in other repository handled by Travis. Maybe it's their issue. For now, I am restarting the build, lets see how does it go this time.

@pared Thank you very much.

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you are reimplementing or completing implementing GitWildMatchPattern:

  1. Is this unavodable?
  2. If it is then can this be made in GitWildMatchPattern?
  3. If it can't then can it be separated from dvc/ignore, i.e. any logic not specific to dvc moved separately?

dvc/ignore.py Show resolved Hide resolved
dvc/ignore.py Outdated Show resolved Hide resolved
dvc/ignore.py Outdated Show resolved Hide resolved
dvc/ignore.py Outdated Show resolved Hide resolved
dvc/ignore.py Outdated Show resolved Hide resolved
@karajan1001
Copy link
Contributor Author

karajan1001 commented Jul 6, 2020

It looks like you are reimplementing or completing implementing GitWildMatchPattern:

  1. Is this unavodable?
  2. If it is then can this be made in GitWildMatchPattern?
  3. If it can't then can it be separated from dvc/ignore, i.e. any logic not specific to dvc moved separately?

@Suor Thank you.
I learned a lot from your following advices.
But, I didn't reimplement GitWildMatchPattern. What I'm doing is only changing the relative path of the .dvcignore file. And this is required if we merged two .dvcignore rules in different paths.
For example, if we had

a
b

in /home/.dvcignore
and

!aa
!bb

in /home/dir/.dvcignore

And if we want to replace these two files in one. We must translate one of them to another's directory.

We can choose:

  1. change /home/.dvcignore into /home/dir but it is harder to implement
**/dir/a

had to be overwritten into

**/dir/a # match /home/dir/dir/a
/a # match /home/a
  1. change /home/dir/.dvcignore into /home/ and then it would be
!/dir/**/aa
!/dir/**/bb

And this is what I had done.

@Suor
Copy link
Contributor

Suor commented Jul 6, 2020

@karajan1001 Yes, this what I meant. You are implementing some math on top of GitWildMatchPattern. It looks like pathspec is a proper place to place this math or at least make it separate from dvc/ignore, i.e. make some pathspec_math.py with all those ops, which might be used from dvc/ignore:

pattern = join_patterns(
    (prefix1, pattern1),
    (prefix2, pattern2),
    ...
)

Then this pathspec_math might be separated or even contributed back to pathspec.

@karajan1001
Copy link
Contributor Author

@karajan1001 Yes, this what I meant. You are implementing some math on top of GitWildMatchPattern. It looks like pathspec is a proper place to place this math or at least make it separate from dvc/ignore, i.e. make some pathspec_math.py with all those ops, which might be used from dvc/ignore:

pattern = join_patterns(
    (prefix1, pattern1),
    (prefix2, pattern2),
    ...
)

Then this pathspec_math might be separated or even contributed back to pathspec.

@Suor . Thank you.
Sorry that I overlooked point 3.

I'm busy at work and can't pay much attention here.

@Suor
Copy link
Contributor

Suor commented Jul 6, 2020

@karajan1001 NP. This is really nice performance tuning work.

dvc/pathspec_math.py Outdated Show resolved Hide resolved
@pared pared requested a review from Suor July 10, 2020 12:59
dvc/pathspec_math.py Outdated Show resolved Hide resolved
dvc/pathspec_math.py Outdated Show resolved Hide resolved
dvc/pathspec_math.py Outdated Show resolved Hide resolved
dvc/ignore.py Outdated Show resolved Hide resolved
Comment on lines +125 to +130
common_dirname, merged_pattern = merge_patterns(
base_pattern.dirname,
base_pattern.pattern_list,
ignore_pattern.dirname,
ignore_pattern.pattern_list,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we collect everything as we go in. We still leave partial results along the way. This means the overall structure will take quadratic memory depending on tree depth. Is this ok?

Copy link
Contributor Author

@karajan1001 karajan1001 Jul 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor
Yes, the inherited pattern would be duplicated several times if we got several .dvcignore files. If not so, we would merge all of them at the file's checking. Space or speed? It depends on which is more important, I think.

}


def test_dvcignore_pattern_merge(tmp_dir, dvc):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be easier done with a unit test calling merge_patterns directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests mainly test the pattern trie tree not only merge patterns. I had renamed it.

karajan1001 and others added 2 commits July 13, 2020 17:54
list comprehension

Co-authored-by: Alexander Schepanovski <[email protected]>
dvc/ignore.py Outdated Show resolved Hide resolved
dvc/ignore.py Outdated Show resolved Hide resolved
@efiop
Copy link
Contributor

efiop commented Jul 14, 2020

Thank you @karajan1001 !

@efiop efiop merged commit 58f6534 into iterative:master Jul 14, 2020
@karajan1001 karajan1001 deleted the fix_ignore_performance branch April 7, 2021 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

optimize dvcignore
5 participants