ignore: solve re.error on group name redefinition in pathspec 0.10.x #8663

hiroto7 · 2022-12-06T11:37:29Z

Fixes #8217 by getting rid of regex concatenation that causes ERROR: unexpected error - redefinition of group name 'ps_d' as group 2; was group 1 at position 46 when latest pathspec (0.10.x) is installed.

Instead of concatenating regexes with |, run matches() for each regular expression and take any().

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

codecov · 2022-12-06T11:43:54Z

Codecov Report

Base: 93.54% // Head: 93.55% // Increases project coverage by +0.01% 🎉

Coverage data is based on head (18eab2e) compared to base (9169d27).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8663      +/-   ##
==========================================
+ Coverage   93.54%   93.55%   +0.01%     
==========================================
  Files         457      457              
  Lines       36249    36249              
  Branches     5232     5232              
==========================================
+ Hits        33908    33914       +6     
+ Misses       1837     1834       -3     
+ Partials      504      501       -3

Impacted Files	Coverage Δ
dvc/ignore.py	`90.07% <100.00%> (ø)`
tests/func/experiments/test_experiments.py	`99.72% <0.00%> (+0.54%)`	⬆️
dvc/repo/experiments/queue/celery.py	`85.20% <0.00%> (+0.79%)`	⬆️
dvc/repo/experiments/executor/local.py	`90.79% <0.00%> (+1.22%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

dtrifiro · 2022-12-06T13:09:49Z

Related: #8553

Remove regex concatenation that causes re.error Fixes iterative#8217

dtrifiro

Looks good. pathspec should also be bumped

hiroto7 · 2022-12-13T14:32:29Z

@dtrifiro Thank you for reviewing!

When I ran tests with the latest pathspec, I noticed that the interpretation of dir/* in .dvcignore may be slightly changed after bumping.
For example, when we write dir/* in .dvcignore, currently dir/a.txt is ignored but dir/b/c.txt is not ignored.
In contrast, with the latest pathspec, both dir/a.txt and dir/b/c.txt are ignored.

Sorry that I submitted this PR earlier, but we may need to consider that breaking change in bumping.

(cpburnz/python-pathspec#19 was closed this August.)

karajan1001 · 2022-12-14T03:05:27Z

dvc/ignore.py

-        for ignore, pattern in self.ignore_spec[::-1]:
-            if matches(pattern, path, is_dir):
+        for ignore, patterns in self.ignore_spec[::-1]:
+            if any(matches(pattern, path, is_dir) for pattern in patterns):


Hi, @hiroto7 I have a problem here, previously we concatenate these reg expressions mainly for the performance. And actually, in the very beginning, we use pathspaces's API to do these checks but it's really slow and the inner implementation is like this kind of nested for loop. Any better methods to solve this?

@karajan1001 Oh, I didn't know that the regex concatenation is deliberate, and I have noticed you have previously mentioned a bottleneck in pathspec's API in #3869 (comment).

Although pathspec seems to try to check all regexes even if any pattern matches soonly, the any() function does short-circuit evaluation. So I think my PR never produces unnecessary regex matchings.

@karajan1001 Oh, I didn't know that the regex concatenation is deliberate, and I have noticed you have previously mentioned a bottleneck in pathspec's API in #3869 (comment).

Although pathspec seems to try to check all regexes even if any pattern matches soonly, the any() function does short-circuit evaluation. So I think my PR never produces unnecessary regex matchings.

Yes, but your algorithm is still O(m) for m reg expressions. even if it would be faster than the pathspec's solution.
While the regex built a state machine inside of it. For example, you have both 1 and 10 patterns, in a long regex if 1 doesn't been matched, we also know 10 doesn't, while in the for loop it will still be checked.

Here is a simple test on my computer, I create 100 patterns for the dvc status benchmark.

$ cat tests/benchmarks/cli/commands/test_status_ignore.py [ins][18:34:34] def test_status(bench_dvc, tmp_dir, dvc, make_dataset): dataset = make_dataset(files=True, dvcfile=True, cache=True) ignore_list = '\n'.join([str(n) for n in range(100)]) (tmp_dir/".dvcignore").write_text(ignore_list) bench_dvc("status") bench_dvc("status", name="noop") (dataset / "new").write_text("new") bench_dvc("status", name="changed") bench_dvc("status", name="changed-noop")

And the final benchmark shows that time cost after this PR increased from 280ms to 310ms

Thank you for the benchmark result.

If that's the case, it seems that simply looping regexes one by one, as I did, is not a good idea.
Concatenating regexes is likely to be effective, unfortunately, but I have no idea about any other way.

I had two suggestions here,

use some kind of hacky way to override the regex pattern created by pathspec. (replace <ps_d> with <ps_d1>', <ps_d2>', ... `<ps_dn>' )

Another more radical choice is to contribute back to pathspec to improve its performance and then directly use the API of it.

karajan1001 · 2023-01-06T08:32:03Z

close as #8767

dtrifiro force-pushed the 8217 branch from 836e4cd to a0f5508 Compare December 6, 2022 13:08

ignore: solve re.error on group name redefinition in pathspec 0.10.x

18eab2e

Remove regex concatenation that causes re.error Fixes iterative#8217

dtrifiro force-pushed the 8217 branch from a0f5508 to 18eab2e Compare December 13, 2022 13:39

dtrifiro requested review from karajan1001 and dtrifiro December 13, 2022 13:40

dtrifiro approved these changes Dec 13, 2022

View reviewed changes

karajan1001 suggested changes Dec 14, 2022

View reviewed changes

daavoo added awaiting response we are waiting for your reply, please respond! :) and removed awaiting response we are waiting for your reply, please respond! :) labels Jan 3, 2023

daavoo assigned karajan1001 Jan 3, 2023

karajan1001 closed this Jan 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ignore: solve re.error on group name redefinition in pathspec 0.10.x #8663

ignore: solve re.error on group name redefinition in pathspec 0.10.x #8663

hiroto7 commented Dec 6, 2022

codecov bot commented Dec 6, 2022 •

edited

Loading

dtrifiro commented Dec 6, 2022

dtrifiro left a comment

hiroto7 commented Dec 13, 2022

karajan1001 Dec 14, 2022

hiroto7 Dec 15, 2022

karajan1001 Dec 15, 2022 •

edited

Loading

hiroto7 Dec 15, 2022

karajan1001 Dec 21, 2022

karajan1001 commented Jan 6, 2023

ignore: solve re.error on group name redefinition in pathspec 0.10.x #8663

ignore: solve re.error on group name redefinition in pathspec 0.10.x #8663

Conversation

hiroto7 commented Dec 6, 2022

codecov bot commented Dec 6, 2022 • edited Loading

Codecov Report

dtrifiro commented Dec 6, 2022

dtrifiro left a comment

Choose a reason for hiding this comment

hiroto7 commented Dec 13, 2022

karajan1001 Dec 14, 2022

Choose a reason for hiding this comment

hiroto7 Dec 15, 2022

Choose a reason for hiding this comment

karajan1001 Dec 15, 2022 • edited Loading

Choose a reason for hiding this comment

hiroto7 Dec 15, 2022

Choose a reason for hiding this comment

karajan1001 Dec 21, 2022

Choose a reason for hiding this comment

karajan1001 commented Jan 6, 2023

codecov bot commented Dec 6, 2022 •

edited

Loading

karajan1001 Dec 15, 2022 •

edited

Loading