Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-72904: Add glob.translate() function #106703

Merged
merged 27 commits into from
Nov 13, 2023

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Jul 12, 2023

Add glob.translate() function that converts a pathname with shell wildcards to a regular expression. The regular expression is used by pathlib to implement match() and glob().

This function differs from fnmatch.translate() in that wildcards do not match path separators by default, and that a * pattern segment matches precisely one path segment. When recursive is set to true, ** pattern segments match any number of path segments, and ** cannot appear outside its own segment.

In pathlib, this change speeds up directory walking (because _make_child_relpath() does less work), makes path objects smaller (they don't need a _lines slot), and removes the need for some gnarly code.


📚 Documentation preview 📚: https://cpython-previews--106703.org.readthedocs.build/

If a sequence of path separators is given to the new argument,
`translate()` produces a pattern that matches similarly to
`pathlib.Path.glob()`. Specifically:

- A `*` pattern segment matches precisely one path segment.
- A `**` pattern segment matches any number of path segments
- If `**` appears in any other position within the pattern, `ValueError` is
  raised.
- `*` and `?` wildcards in other positions don't match path separators.

This change allows us to factor out a lot of complex code in pathlib.
@barneygale
Copy link
Contributor Author

~20% globbing speedup:

$ ./python -m timeit -s 'from pathlib import Path; p = Path()' 'list(p.glob("**/*", follow_symlinks=False))'
2 loops, best of 5: 175 msec per loop  # before
2 loops, best of 5: 146 msec per loop  # after

@barneygale barneygale changed the title GH-72904: Add optional *seps* argument to fnmatch.translate() GH-72904: Add optional *sep* argument to fnmatch.translate() Jul 26, 2023
Doc/library/fnmatch.rst Outdated Show resolved Hide resolved
Co-authored-by: Jason R. Coombs <[email protected]>
Doc/library/fnmatch.rst Outdated Show resolved Hide resolved
Lib/fnmatch.py Outdated Show resolved Hide resolved
Lib/fnmatch.py Outdated Show resolved Hide resolved
Lib/fnmatch.py Outdated Show resolved Hide resolved
Lib/test/test_fnmatch.py Outdated Show resolved Hide resolved
Lib/test/test_fnmatch.py Outdated Show resolved Hide resolved
Lib/fnmatch.py Outdated Show resolved Hide resolved
@barneygale barneygale changed the title GH-72904: Add optional *sep* argument to fnmatch.translate() GH-72904: Add glob.translate() function Aug 12, 2023
@barneygale
Copy link
Contributor Author

I've moved this to a new glob.translate() function.

It was easy enough to implement a recursive argument, so I did that and made its default False to match glob().

It's much harder to implement an include_hidden argument, so I've left that for now. I don't feel great about it, tbh.

@barneygale barneygale requested a review from jaraco August 12, 2023 02:19
@barneygale
Copy link
Contributor Author

Right, after some futzing around I'm going to mark this PR as ready again.

In fnmatch.py, I've split the translate() method into _translate() and _join_translated_parts(). This minimises the diff, risk, and performance impact in that module.

(In #109879 I've re-implemented fnmatch.translate(), but that PR is an optional side-quest now)

In glob.py, I've spent some time making the implementation of translate(include_hidden=False) as clear and performant as I can. There's still a fair whack of regex involved but I think it's followable.

In pathlib.py, I've made pattern matching use str(path) directly, rather than a slight variant that represents empty paths as '' rather than '.'.

@barneygale
Copy link
Contributor Author

barneygale commented Sep 30, 2023

Timings:

$ ./python -m timeit -n 5 -s 'from glob import glob' 'list(glob("**/*", recursive=True, include_hidden=True))'
5 loops, best of 5: 105 msec per loop  # for interest
$ ./python -m timeit -n 5 -s 'from pathlib import Path; p = Path()' 'list(p.glob("**/*", follow_symlinks=True))'
5 loops, best of 5: 86.6 msec per loop  # before
5 loops, best of 5: 72.7 msec per loop  # after

So Path.glob() is ~20% faster than before, and ~45% faster than glob.glob(), at least for this test case.

Copy link
Member

@AA-Turner AA-Turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A desk review:

A

Lib/glob.py Outdated Show resolved Hide resolved
Lib/glob.py Outdated Show resolved Hide resolved
Lib/glob.py Outdated Show resolved Hide resolved
Lib/glob.py Show resolved Hide resolved
Lib/pathlib.py Outdated Show resolved Hide resolved
Lib/glob.py Show resolved Hide resolved
@barneygale barneygale requested a review from AA-Turner October 3, 2023 19:59
@barneygale
Copy link
Contributor Author

I've realised that the docs are a bit skew-whiff. Fix is in a separate PR: #110418

@encukou
Copy link
Member

encukou commented Nov 3, 2023

Is it correct to keep duplicated path separators?

>>> glob.translate('a//b')
'(?s:a//b)\\Z'

@barneygale
Copy link
Contributor Author

barneygale commented Nov 3, 2023

Is it correct to keep duplicated path separators?

>>> glob.translate('a//b')
'(?s:a//b)\\Z'

glob() keeps them:

>>> os.makedirs('a/b')
>>> glob.glob('a//b')
['a//b']

So I reckon yes?

@barneygale
Copy link
Contributor Author

Also, the number of additional slashes is meaningful in some cases, e.g. in Windows UNC paths or POSIX paths starting with two forward slashes. I don't think a pattern like /foo should match a path like //foo, and vice-versa.

@barneygale
Copy link
Contributor Author

barneygale commented Nov 12, 2023

Hey @encukou, do you think I can merge this, or should I wait for a more complete review from someone?

@encukou
Copy link
Member

encukou commented Nov 13, 2023

Oh, I should have been more clear that I wouldn't get to a thorough review in any reasonable time.
You already have a few approvals, merge away!

@barneygale barneygale merged commit cf67ebf into python:main Nov 13, 2023
aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024
Add `glob.translate()` function that converts a pathname with shell wildcards to a regular expression. The regular expression is used by pathlib to implement `match()` and `glob()`.

This function differs from `fnmatch.translate()` in that wildcards do not match path separators by default, and that a `*` pattern segment matches precisely one path segment. When *recursive* is set to true, `**` pattern segments match any number of path segments, and `**` cannot appear outside its own segment.

In pathlib, this change speeds up directory walking (because `_make_child_relpath()` does less work), makes path objects smaller (they don't need a `_lines` slot), and removes the need for some gnarly code.

Co-authored-by: Jason R. Coombs <[email protected]>
Co-authored-by: Adam Turner <[email protected]>
Glyphack pushed a commit to Glyphack/cpython that referenced this pull request Sep 2, 2024
Add `glob.translate()` function that converts a pathname with shell wildcards to a regular expression. The regular expression is used by pathlib to implement `match()` and `glob()`.

This function differs from `fnmatch.translate()` in that wildcards do not match path separators by default, and that a `*` pattern segment matches precisely one path segment. When *recursive* is set to true, `**` pattern segments match any number of path segments, and `**` cannot appear outside its own segment.

In pathlib, this change speeds up directory walking (because `_make_child_relpath()` does less work), makes path objects smaller (they don't need a `_lines` slot), and removes the need for some gnarly code.

Co-authored-by: Jason R. Coombs <[email protected]>
Co-authored-by: Adam Turner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-pathlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants