-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise the manifest update #12553
Optimise the manifest update #12553
Conversation
c6facec
to
e124c3a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to be a bit weary of fixing performance just by writing a bunch of caches versus fixing the underlying performance issues, which is essentially what this does for gitignore
.
On the mtime side, I wonder if we should be using the user's cache directory ($XDG_CACHE_HOME or whatever), rather than introducing our own cache files in random places in the filesystem?
@@ -1,6 +1,5 @@ | |||
{"manifest": | |||
{"path": "update.py", "script": "run", "parser": "create_parser", "help": "Update the MANIFEST.json file", | |||
"virtualenv": false}, | |||
"virtualenv": false, "install": ["ujson"]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd previously been against adding non-vendored dependencies for manifest/lint? Is the intention to require it for wpt manifest
but not using manifest
as a package? How are we going to test it works under both environments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mean to add this. I intend to use ujson if it's available but not to require it. For a no-op update the time saving is almost exactly cancelled by the time to setup the virtualenv.
.gitignore
Outdated
@@ -28,5 +22,5 @@ config.json | |||
scratch | |||
.idea/ | |||
.vscode/ | |||
.DS_Store | |||
*.rej | |||
.DS_Store/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.DS_Store
is a file and not a directory?
.DS_Store | ||
*.rej | ||
.DS_Store/ | ||
*.rej |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've removed end terminating new line here?
|
||
end_space = re.compile(r"([^\\]\s)*$") | ||
|
||
|
||
def fnmatch_translate(pat, allow_component_only=True): | ||
def fnmatch_translate(pat): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given you've already added an external dependency, is there any reason not to just move gitignore to use a better regex impl that does O(n) matching?
self.data[key] = value | ||
|
||
|
||
def walk(root): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason we're doing this rather than using the scandir package and returning DirEntrys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because it's one less required dependency, and a C one at that. Also since we typically require the stat data anyway it's not clear how much of a win it will be.
e124c3a
to
8d76637
Compare
8d76637
to
0820607
Compare
0ca6eee
to
99c1701
Compare
I think this is bascially ready for review. The branch is a bit of a mess and the cache stuff probably needs better tests, but for my sanity I think I need to work on something else for a bit :) |
else: | ||
any_char = "." | ||
# By default match the entire path up to a / | ||
# but if / doesn't appear in the pattern we will mark is as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will what now?
tools/wpt/wpt.py
Outdated
pdb.post_mortem() | ||
else: | ||
raise | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still need the try/except?
99c1701
to
6bb5621
Compare
We end up with a lot of rules like (?:.*)/.*\.ext which are basically trying to find the last component in a path and match against that. These are rather slow to run so the easiest thing tdo is just pass in the last component of the path when we know that's the only thing the rule can match. The changes to surrounding code to use this API will be made in future commits.
This updates the gitignore implemenation to take input like os.walk but with additional stat data for the files. It also makes several useful optimistaions: * Avoid using regex when just matching a literal * Identify patterns that can only match the final component of a path and run those against that component rather than the full path. * Add the possibility of providing a dictionary of paths to gitignore statuses as a cache. This dramatically reduces the amount of time we spend in gitignore processing when updating the manifest.
When processing the manifest using the worktree, instead of reading all files to see if the content changed, instead only process files where the mtime has been updated since the previous run. Also cache the result of running gitignore, so we can save a couple of seconds processing the gitignore rules.
Compared to the normal os.walk this has a couple of differences: * It returns lists of (name, stat) for filenames and directories, allowing callers to reuse the stat data without going back to the system to re-request it. * Directories are always returned as paths relative to the root, and the root itself is returned as the empty string. * It is non-recursive. There are also a few features missing that aren't required for our use cases.
The caches weren't being invalidated when the manifest itself changed. To fix this the manifest itself has to be written before the cache and the cache has to include data about the manifest that it's associated with (the mtime and path are used for this purpose). To make all this work requires a single method that can load the manifest, update it, write the manifest and write the caches. Therefore we introduce a single load_and_update method that is intended to replace all previous use of the load() or update() methods (and as a bonus handles manifest version mismatches in a single place).
Since there's basically no traction on getting this landed here, I created gecko bug https://bugzilla.mozilla.org/show_bug.cgi?id=1497898 with these patches applied to mozilla-central. I really don't mind which way we land this, but it is important to me that we either land this or something functionally equivalent (i.e. something that can drop the zero-change manifest update time using the worktree to around 1s). |
6bb5621
to
7ccdbbe
Compare
Please ignore the CircleCI failure on this PR. I was experimenting with CircleCI on a branch and did not expect it to show up on all PRs. I've revoked access for CircleCI, and it should be possible to merge even with this spurious failure. Please let me know if there are problems. |
Landed via #13618 |
Not ready to merge etc. Mostly concerned with making no-op updates as fast as possible.