-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-116380: Speed up glob.glob()
by removing some system calls
#116392
base: main
Are you sure you want to change the base?
Conversation
Needs a fix for #116377 to land. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Misc/NEWS.d/next/Library/2024-03-05-23-08-11.gh-issue-116380.56HU7I.rst
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not hurry to merge. This is an old code. The main advantage of the initial code was its simplicity, but since then it was complicated by adding new features and optimizations. In particularly the use of os.scandir()
instead of os.listdir()
significantly improved performance. The new implementation should be benchmarked with different test cases: deep and wide threes, files and directories domination.
Thanks Serhiy! We use
If neither of these are true, then we don't need to edit: to further illustrate what I mean, here's where
|
I've been looking into this! The randomfiletree project is helpful - it can repeatedly walk a tree and create child files/folders according to a gaussian distribution, which seems to me like a good approximation for an average "shallow and wide" filesystem structure, including tweaking for file or folder distribution. It's difficult to produce "deep and narrow" trees this way, as the file/folder probability would need to change with the depth (I think?). I've been considering writing a tree generator that works this way, e.g.:
... but is that overly arbitrary? Is there a better way? Or do I just need to come up with a bunch of test cases along those lines? |
A test of 100 nested directories named "deep" from my Linux machine:
|
Misc/NEWS.d/next/Library/2024-03-05-23-08-11.gh-issue-116380.56HU7I.rst
Outdated
Show resolved
Hide resolved
@picnixz to address some of your comments on using |
That's... an interesting functionality I wasn't aware of :) If someone could explain to me the reason I'd be happy. Anyway, let's keep your loops. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The journey was a bit long but I have nothing else to say here! Thank you, as always, for addressing my nitpicking comments and considering my suggestions!
I'd be more comfortable if @serhiy-storchaka can have a final look at it just to see whether I've missed some logic.
Thank you very much for the thorough review. Your comments were very helpful as ever, thank you for being patient. |
Hey @serhiy-storchaka, please could you review this? I'd appreciate your feedback. TiA |
Speed up
glob.glob()
andglob.iglob()
by reducing the number of system calls made.This unifies the implementations of globbing in the
glob
andpathlib
modules.Depends on
pathlib.Path.glob()
by working with strings #117589pathlib.Path.glob()
by not scanning literal parts #117732Filtered recursive walk
Expanding a recursive
**
segment entails walking the entire directory tree, and so any subsequent pattern segments (except special segments) can be evaluated by filtering the expanded paths through a regex. For example,glob.glob("foo/**/*.py", recursive=True)
recursively walksfoo/
withos.scandir()
, and then filters paths through a regex based on "**/*.py
, with no further filesystem access needed.This solves #104269 as a side-effect.
Tracking path existence
We store a flag alongside each path indicating whether the path is guaranteed to exist. As we process the pattern:
""
,"."
and".."
) leave the flag unchangedfoo/bar
) set the flag to false*/*.py
) set the flag to true (because children are found viaos.scandir()
)**
) leave the flag unchanged for the root path, and set it to true for descendants discovered viaos.scandir()
.If the flag is false at the end, we call
lstat()
on each path to filter out missing paths.Minor speed-ups
We:
is_dir()
.glob
module.bytes
(a minor use-case) iniglob()
rather than supportingbytes
throughout. This particularly simplifies the code needed to handle relative bytes paths withdir_fd
.os.path.join()
; instead we keep paths in a normalized form and append trailing slashes when needed.os.path.normcase()
; instead we use case-insensitive regex matching.Implementation notes
Much of this functionality is already present in pathlib's implementation of globbing. The specific additions we make are:
dir_fd
include_hidden
root_dir
Results
Speedups via
python -m timeit -s "from glob import glob" "glob(pattern, recursive=True, include_hidden=True)"
from CPython source directory on Linux:Lib/*
Lib/*/
Lib/*.py
Lib/**
Lib/**/
Lib/**/*
Lib/**/**
Lib/**/*/
Lib/**/*.py
Lib/**/__init__.py
Lib/**/*/*.py
Lib/**/*/__init__.py
glob.glob()
by reducing number of system calls made #116380