-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize ignore performance #4120
Conversation
Might there be more details in the log from Travis CI ? |
@karajan1001
|
Thank you, but the question is where to get this. |
@karajan1001, you don't see the 'Details' link near check? If you follow it, there should be build summary, and you should see Jobs and Stages section, where you can see particular builds. |
Codecov Report
@@ Coverage Diff @@
## master #4120 +/- ##
==========================================
+ Coverage 92.60% 92.76% +0.15%
==========================================
Files 162 162
Lines 11335 11278 -57
==========================================
- Hits 10497 10462 -35
+ Misses 838 816 -22
Continue to review full report at Codecov.
|
Needs more tests. And Some Improvements |
For example in No more detail. |
Hi @karajan1001 ! Full logs are starting here https://travis-ci.com/github/iterative/dvc/jobs/354665083#L3357 |
@efiop Thank you, it seems that only members of the DVC have the permission. |
More tests are needed, especially the merging of two ignore files and the influence of the changing of the |
@karajan1001, from your #4120 (comment), it looks like you are able to access it. You have to search manually in that huge log for specific test failure. Or, download the raw log and grep locally. If you are not able to, you can download the log of a specific job with following url:
|
@skshetry Thank you, but, I had tried this on this PR 2 days ago. The result is the same as what is in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 thing that I found.
More tests are needed, especially the merging of two ignore files and the influence of the changing of the dirname .
Besides that LGTM. @karajan1001 Do you intend to introduce those tests in this PR?
Of course. I would like to add every test I could think of according to the gitignore document. It would be a systematic test. Because it may get DVC users in trouble if this algorithm changed frequently. Like what
|
@pared Excuse me, What is this error? Never saw this before, seems not caused by my PR? |
@karajan1001 That does not seem like like an issue with your PR, I got the same error in other repository handled by Travis. Maybe it's their issue. For now, I am restarting the build, lets see how does it go this time. |
@pared Thank you very much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you are reimplementing or completing implementing GitWildMatchPattern
:
- Is this unavodable?
- If it is then can this be made in
GitWildMatchPattern
? - If it can't then can it be separated from
dvc/ignore
, i.e. any logic not specific to dvc moved separately?
@Suor Thank you.
in
in And if we want to replace these two files in one. We must translate one of them to another's directory. We can choose:
had to be overwritten into
And this is what I had done. |
@karajan1001 Yes, this what I meant. You are implementing some math on top of pattern = join_patterns(
(prefix1, pattern1),
(prefix2, pattern2),
...
) Then this |
@Suor . Thank you. I'm busy at work and can't pay much attention here. |
@karajan1001 NP. This is really nice performance tuning work. |
common_dirname, merged_pattern = merge_patterns( | ||
base_pattern.dirname, | ||
base_pattern.pattern_list, | ||
ignore_pattern.dirname, | ||
ignore_pattern.pattern_list, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we collect everything as we go in. We still leave partial results along the way. This means the overall structure will take quadratic memory depending on tree depth. Is this ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Suor
Yes, the inherited pattern would be duplicated several times if we got several .dvcignore
files. If not so, we would merge all of them at the file's checking. Space or speed? It depends on which is more important, I think.
tests/func/test_ignore.py
Outdated
} | ||
|
||
|
||
def test_dvcignore_pattern_merge(tmp_dir, dvc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be easier done with a unit test calling merge_patterns
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests mainly test the pattern trie tree not only merge patterns. I had renamed it.
list comprehension Co-authored-by: Alexander Schepanovski <[email protected]>
Thank you @karajan1001 ! |
fix #3869
.dvcignore
file finding.dvcignore
files.dvcignore
files only influence it sub dirs.❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)
Thank you for the contribution - we'll try to review it as soon as possible. 🙏
This is the benchmark
The first is <1.0.1>, it is the baseline. Because this optimization involving merging of different
.dvcignore
files. I add another Benchmark testingdvc status
with three files each having 10 rules.Second is this PR. Now ignore file check takes little time both in 3 files or 30 files
@Suor you are right, we had rechecked files found via walk, run ignores twice. And it happened within the time when we are walking to find all
.dvcignore
files.dvc/dvc/ignore.py
Line 126 in 83a2afc
This PR solve another issue in
dvc/dvc/ignore.py
Line 135 in 83a2afc
If some files were ignored by some
dvcignore
files, then it could not be reincluded by otherdvcignore
files.Git did not work like this.
And in gitignore