-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ignore: solve re.error on group name redefinition in pathspec 0.10.x #8663
Conversation
Codecov ReportBase: 93.54% // Head: 93.55% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #8663 +/- ##
==========================================
+ Coverage 93.54% 93.55% +0.01%
==========================================
Files 457 457
Lines 36249 36249
Branches 5232 5232
==========================================
+ Hits 33908 33914 +6
+ Misses 1837 1834 -3
+ Partials 504 501 -3
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Related: #8553 |
Remove regex concatenation that causes re.error Fixes iterative#8217
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. pathspec should also be bumped
@dtrifiro Thank you for reviewing! When I ran tests with the latest pathspec, I noticed that the interpretation of Sorry that I submitted this PR earlier, but we may need to consider that breaking change in bumping. (cpburnz/python-pathspec#19 was closed this August.) |
for ignore, pattern in self.ignore_spec[::-1]: | ||
if matches(pattern, path, is_dir): | ||
for ignore, patterns in self.ignore_spec[::-1]: | ||
if any(matches(pattern, path, is_dir) for pattern in patterns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @hiroto7 I have a problem here, previously we concatenate these reg expressions mainly for the performance. And actually, in the very beginning, we use pathspaces's API to do these checks but it's really slow and the inner implementation is like this kind of nested for loop. Any better methods to solve this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karajan1001 Oh, I didn't know that the regex concatenation is deliberate, and I have noticed you have previously mentioned a bottleneck in pathspec's API in #3869 (comment).
Although pathspec seems to try to check all regexes even if any pattern matches soonly, the any()
function does short-circuit evaluation. So I think my PR never produces unnecessary regex matchings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karajan1001 Oh, I didn't know that the regex concatenation is deliberate, and I have noticed you have previously mentioned a bottleneck in pathspec's API in #3869 (comment).
Although pathspec seems to try to check all regexes even if any pattern matches soonly, the
any()
function does short-circuit evaluation. So I think my PR never produces unnecessary regex matchings.
Yes, but your algorithm is still O(m)
for m reg expressions. even if it would be faster than the pathspec's solution.
While the regex built a state machine inside of it. For example, you have both 1
and 10
patterns, in a long regex if 1
doesn't been matched, we also know 10
doesn't, while in the for loop it will still be checked.
Here is a simple test on my computer, I create 100 patterns for the dvc status
benchmark.
$ cat tests/benchmarks/cli/commands/test_status_ignore.py [ins][18:34:34]
def test_status(bench_dvc, tmp_dir, dvc, make_dataset):
dataset = make_dataset(files=True, dvcfile=True, cache=True)
ignore_list = '\n'.join([str(n) for n in range(100)])
(tmp_dir/".dvcignore").write_text(ignore_list)
bench_dvc("status")
bench_dvc("status", name="noop")
(dataset / "new").write_text("new")
bench_dvc("status", name="changed")
bench_dvc("status", name="changed-noop")
And the final benchmark shows that time cost after this PR increased from 280ms
to 310ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the benchmark result.
If that's the case, it seems that simply looping regexes one by one, as I did, is not a good idea.
Concatenating regexes is likely to be effective, unfortunately, but I have no idea about any other way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had two suggestions here,
- use some kind of hacky way to override the regex pattern created by pathspec. (replace
<ps_d>
with<ps_d1>',
<ps_d2>', ... `<ps_dn>' ) - Another more radical choice is to contribute back to
pathspec
to improve its performance and then directly use the API of it.
close as #8767 |
Fixes #8217 by getting rid of regex concatenation that causes
ERROR: unexpected error - redefinition of group name 'ps_d' as group 2; was group 1 at position 46
when latest pathspec (0.10.x) is installed.Instead of concatenating regexes with
|
, runmatches()
for each regular expression and takeany()
.❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏