Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement of dvcignore #3967

Merged
merged 11 commits into from
Jun 15, 2020
Merged

Conversation

karajan1001
Copy link
Contributor

@karajan1001 karajan1001 commented Jun 7, 2020

Fix #3869

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

karajan1001 and others added 7 commits May 13, 2020 21:59
@efiop efiop requested a review from pared June 8, 2020 08:46
Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I would not close #3869 just yet.
@karajan1001 you mentioned the flashtext library, and I could not find the description of the experiment conducted, that gave the author conclusion that it makes sense to use trie-based approach only above some number of keywords. What about matching few gitignore patterns against hundreds of files? Maybe there could be gain too.

Also, I wonder whether re.compile has limit of pattern size. Checked against 100k, seems to not be a problem. I did not find resources pointing to the limit.

@pared
Copy link
Contributor

pared commented Jun 8, 2020

@karajan1001 Windows build is failing.

dvc/ignore.py Outdated Show resolved Hide resolved
@karajan1001
Copy link
Contributor Author

karajan1001 commented Jun 10, 2020

LGTM, though I would not close #3869 just yet.
@karajan1001 you mentioned the flashtext library, and I could not find the description of the experiment conducted, that gave the author conclusion that it makes sense to use trie-based approach only above some number of keywords. What about matching few gitignore patterns against hundreds of files? Maybe there could be gain too.

Also, I wonder whether re.compile has limit of pattern size. Checked against 100k, seems to not be a problem. I did not find resources pointing to the limit.

Maybe this is because regex is written in C, and flashtext is in Python. flashtext has a lower level O() but a much bigger constant term. Use a C version of Aho-Corasick Automata (variation of trie-tree) would beat regex in any cases.

@karajan1001
Copy link
Contributor Author

karajan1001 commented Jun 13, 2020

@pared @Suor
Because there is still something wrong with my ASV. I can only give a command-line result
Screenshot from 2020-06-13 20-40-30
This is the current version.

[ 81.82%] Β· For dvc commit fc42ca72 <1.0.0a1>:
[ 90.34%] Β·Β·Β· status.DVCIgnoreBench.time_status                                                                                                                          501Β±3ms
[ 90.91%] Β·Β·Β· status.DVCStatusBench.time_status                                                                                                                         84.3Β±2ms

This is the result after this submit.

[ 90.91%] Β· For dvc commit a86cedbf <1.0.0a1test>:
[ 99.43%] Β·Β·Β· status.DVCIgnoreBench.time_status                                                                                                                          104Β±1ms
[100.00%] Β·Β·Β· status.DVCStatusBench.time_status                                                                                                                       79.8Β±0.5ms

The new version took 24ms, and the old one took 417ms. (17.35 times faster)

And In the worst case where the rules in the order that ignore rules and include rules are separated from each other. And there is no big regex exists(Hardly could this happened in real).
The result is:
Screenshot from 2020-06-13 20-50-08

The new version took 84ms, and the old one took 416ms. (4.95 times faster)

@Suor Suor self-requested a review June 15, 2020 08:19
Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks good. We should merge it.

The thing is bothering me is that this doesn't help with many ignore files. Can we make a bench for that? How many ignore files do we need to be in trouble?

@pared pared merged commit a59f90f into iterative:master Jun 15, 2020
@pared
Copy link
Contributor

pared commented Jun 15, 2020

@karajan1001, thanks for this PR!

Now, that implementation of dvcignore changed, @Suor point makes sense. I did not foresee that in iterative/dvc-bench#30
@karajan1001, as you are the original author of dvcingore benchmarks, would you be interested in reintroducing multiple dvcignore files benchmark? We could pull it to the extreme and do 30 files x 1 rule, instead of original 3 x 10.

@pared pared mentioned this pull request Jun 15, 2020
1 task
@karajan1001 karajan1001 deleted the fix3869 branch April 7, 2021 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

optimize dvcignore
3 participants