Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise the manifest update #12553

Closed
wants to merge 12 commits into from
Closed

Optimise the manifest update #12553

wants to merge 12 commits into from

Conversation

jgraham
Copy link
Contributor

@jgraham jgraham commented Aug 17, 2018

Not ready to merge etc. Mostly concerned with making no-op updates as fast as possible.

Copy link
Member

@gsnedders gsnedders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to be a bit weary of fixing performance just by writing a bunch of caches versus fixing the underlying performance issues, which is essentially what this does for gitignore.

On the mtime side, I wonder if we should be using the user's cache directory ($XDG_CACHE_HOME or whatever), rather than introducing our own cache files in random places in the filesystem?

@@ -1,6 +1,5 @@
{"manifest":
{"path": "update.py", "script": "run", "parser": "create_parser", "help": "Update the MANIFEST.json file",
"virtualenv": false},
"virtualenv": false, "install": ["ujson"]},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd previously been against adding non-vendored dependencies for manifest/lint? Is the intention to require it for wpt manifest but not using manifest as a package? How are we going to test it works under both environments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean to add this. I intend to use ujson if it's available but not to require it. For a no-op update the time saving is almost exactly cancelled by the time to setup the virtualenv.

.gitignore Outdated
@@ -28,5 +22,5 @@ config.json
scratch
.idea/
.vscode/
.DS_Store
*.rej
.DS_Store/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.DS_Store is a file and not a directory?

.DS_Store
*.rej
.DS_Store/
*.rej
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've removed end terminating new line here?


end_space = re.compile(r"([^\\]\s)*$")


def fnmatch_translate(pat, allow_component_only=True):
def fnmatch_translate(pat):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given you've already added an external dependency, is there any reason not to just move gitignore to use a better regex impl that does O(n) matching?

self.data[key] = value


def walk(root):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason we're doing this rather than using the scandir package and returning DirEntrys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because it's one less required dependency, and a C one at that. Also since we typically require the stat data anyway it's not clear how much of a win it will be.

@jgraham
Copy link
Contributor Author

jgraham commented Aug 24, 2018

I think this is bascially ready for review. The branch is a bit of a mess and the cache stuff probably needs better tests, but for my sanity I think I need to work on something else for a bit :)

else:
any_char = "."
# By default match the entire path up to a /
# but if / doesn't appear in the pattern we will mark is as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will what now?

tools/wpt/wpt.py Outdated
pdb.post_mortem()
else:
raise
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need the try/except?

Cactusmachete and others added 11 commits October 10, 2018 11:44
We end up with a lot of rules like (?:.*)/.*\.ext which are basically
trying to find the last component in a path and match against
that. These are rather slow to run so the easiest thing tdo is just
pass in the last component of the path when we know that's the only
thing the rule can match.

The changes to surrounding code to use this API will be made in future
commits.
This updates the gitignore implemenation to take input like os.walk
but with additional stat data for the files. It also makes several
useful optimistaions:

 * Avoid using regex when just matching a literal
 * Identify patterns that can only match the final component of a path
   and run those against that component rather than the full path.
 * Add the possibility of providing a dictionary of paths to gitignore
   statuses as a cache.

This dramatically reduces the amount of time we spend in gitignore
processing when updating the manifest.
When processing the manifest using the worktree, instead of reading
all files to see if the content changed, instead only process files where
the mtime has been updated since the previous run. Also cache the
result of running gitignore, so we can save a couple of seconds
processing the gitignore rules.
Compared to the normal os.walk this has a couple of differences:

* It returns lists of (name, stat) for filenames and directories,
  allowing callers to reuse the stat data without going back to the
  system to re-request it.

* Directories are always returned as paths relative to the root, and
  the root itself is returned as the empty string.

* It is non-recursive.

There are also a few features missing that aren't required for our use
cases.
The caches weren't being invalidated when the manifest itself
changed. To fix this the manifest itself has to be written before the
cache and the cache has to include data about the manifest that it's
associated with (the mtime and path are used for this purpose).

To make all this work requires a single method that can load the
manifest, update it, write the manifest and write the
caches. Therefore we introduce a single load_and_update method that is
intended to replace all previous use of the load() or update() methods
(and as a bonus handles manifest version mismatches in a single
place).
@jgraham
Copy link
Contributor Author

jgraham commented Oct 10, 2018

Since there's basically no traction on getting this landed here, I created gecko bug https://bugzilla.mozilla.org/show_bug.cgi?id=1497898 with these patches applied to mozilla-central. I really don't mind which way we land this, but it is important to me that we either land this or something functionally equivalent (i.e. something that can drop the zero-change manifest update time using the worktree to around 1s).

@foolip
Copy link
Member

foolip commented Oct 10, 2018

Please ignore the CircleCI failure on this PR. I was experimenting with CircleCI on a branch and did not expect it to show up on all PRs. I've revoked access for CircleCI, and it should be possible to merge even with this spurious failure. Please let me know if there are problems.

@jgraham
Copy link
Contributor Author

jgraham commented Nov 19, 2018

Landed via #13618

@jgraham jgraham closed this Nov 19, 2018
@gsnedders gsnedders deleted the manifest_optimise branch January 24, 2020 18:07
@gsnedders gsnedders restored the manifest_optimise branch January 24, 2020 18:47
@foolip foolip deleted the manifest_optimise branch April 14, 2021 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci gitignore infra lint manifest wpt wptrunner The automated test runner, commonly called through ./wpt run
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants