-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize fine-grained update by using Graph as the cache #4622
Conversation
This makes a big performance difference, excellent! Using my randomized tester script, a cold run was about 70s (several files with small changes, time includes cache download) in the Dropbox S repo, and incremental updates were typically 5-10s (and no more than 20s). I did a fair amount of testing, and found some common false positives. I isolated them into these test cases that pass on master:
It looks like that for coarse-grained incremental the existence of a file not included in a build affects build results. Fine-grained incremental mode considers the file as non-existent even if it exists (but apparently only when propagating fine-grained dependencies). Example failure:
|
The problem is due to the handling of |
This has been a bit of an adventure. Since we are no longer doing a full `load_graph()`, we can't rely on it fully populating `missing_modules` if it is cleared between updates. If `missing_modules` isn't fully populated, then semanal will spuriously reject `from x import y` where `x.y` is a missing module. We can't just *not* clear `missing_modules`, however, because `parse_file` uses it to decide that modules are suppressed. But this is actually wrong! It can lead to an import failure message not being generated because some other module has already failed to import it (and perhaps even ignored the error). `parse_file()` shouldn't actually *need* to compute `suppressed`, though. If it is called on a file with no cache, `load_graph()` will handle moving deps to `suppressed`, and if called on a file that has had cache info loaded, then the dependency information has all been loaded from the cache. So we refactor things so that dep information from the cache is used when available and `parse_file` doesn't need to deal with it. I strongly suspect that this refactor obviates the need for `fix_suppressed_dependencies`, but I was not able to quickly produce a test case from the description in #2036, so I am not comfortable making that change.
4da63be
to
7520f89
Compare
Fixed the missing_modules related failures. This has been a bit of an adventure. Since we are no longer doing a full We can't just not clear
So we refactor things so that dep information from the cache is used I strongly suspect that this refactor obviates the need for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see so many lines deleted! Looks good, but I don't have enough context on mypy.build
to say if the logic change related to suppressed
is without problems. Worth asking for a second opinion from Guido.
Feel free to merge once you've addressed my feedback and asked Guido about this (unless you make some major changes, in which case I'd like to have another look).
mypy/build.py
Outdated
@@ -1896,49 +1883,54 @@ def parse_file(self) -> None: | |||
# TODO: Why can't SemanticAnalyzerPass1 .analyze() do this? | |||
self.tree.names = manager.semantic_analyzer.globals | |||
|
|||
for pri, id, line in manager.all_imported_modules_in_file(self.tree): | |||
if id == '': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious why an empty string indicates a relative import. Add a longer description, probably here and in the docstring of all_imported_modules_in_file
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the check into all_imported_modules_in_file
instead
mypy/build.py
Outdated
# least one imported name isn't a submodule) | ||
# cur_id is also a dependency, and we should | ||
# insert it *before* any submodules. | ||
if not all_are_submodules: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this check removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lack of the parent module in the dependency list was involved in a weird interaction with the original changes that resulted in the wrong module name appearing in an error message in one case.
The other changes to build's dependency handling appear to solve this without this check, though, so I'm going to undo this change.
mypy/checker.py
Outdated
""" | ||
self.partial_reported.clear() | ||
assert self.partial_types == [] | ||
assert self.deferred_nodes == [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also reset type_map
? We could ensure that the binder and scope
are clear (or reset them as well). Also, consider clearing inferred_attribute_types
and module_refs
. Or maybe it would be easier to create a new TypeChecker
object (doesn't need to happen in this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got a noticeable (though not particularly big) performance win by reseting the TypeChecker
instead of creating a new one.
mark_all_meta_as_memory_only(graph, manager) | ||
manager.saved_cache = preserve_full_cache(graph, manager) | ||
self.type_maps = extract_type_maps(graph) | ||
manager.saved_cache = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the docstring of BuildManager
to reflect the fact that saved_cache
is no longer used in fine-grained incremental mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
Once this is landed I'm also going to just totally rip out the coarse-grained daemon and all of the saved_cache
machinery entirely.
[delete c.py.2] | ||
[file b.py.2] | ||
class Foo: | ||
def foo(self) -> None: pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a third step where the signature of foo
is changed, which would result in an error in the main module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
This reverts commit c21e216. Also fix tests, since this reintroduces a cache/no-cache divergence.
mypy/build.py
Outdated
@@ -658,23 +656,15 @@ def correct_rel_imp(imp: Union[ImportFrom, ImportAll]) -> str: | |||
elif isinstance(imp, ImportFrom): | |||
cur_id = correct_rel_imp(imp) | |||
pos = len(res) | |||
all_are_submodules = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suspicious of the removal of this behavior -- I'm pretty sure it was needed at some point (perhaps to handle import cycles in incremental mode).
"""Given some source files, load the full dependency graph. | ||
|
||
As this may need to parse files, this can raise CompileError in case | ||
there are syntax errors. | ||
""" | ||
graph = {} # type: Graph | ||
|
||
graph = old_graph or {} # type: Graph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter that this may mutate the old_graph
argument in place? (Except when called with an empty dict.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's the goal.
mypy/dmypy_server.py
Outdated
self.fswatcher.set_file_data( | ||
mypyfile.path, | ||
state.xpath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using xpath is dubious. It may be '<string>'
. I assume you need something that's not Optional, so I'd use path and add an assert that it's not None.
priorities = {} # type: Dict[str, int] # id -> priority | ||
dep_line_map = {} # type: Dict[str, int] # id -> line | ||
for pri, id, line in manager.all_imported_modules_in_file(self.tree): | ||
priorities[id] = min(pri, priorities.get(id, PRI_ALL)) | ||
if id == self.id: | ||
continue | ||
# Omit missing modules, as otherwise we could not type-check | ||
# programs with missing modules. | ||
if id in manager.missing_modules: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope you know what you're doing by removing this block of code. I can't assess the effect easily by just reading the code (though it's plausible that the code around line 1628 takes care of it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The graph loading code is pretty subtle, so I wouldn't say I am 100% confident, but I am reasonably so---I discussed my reasoning some in my comment above.
OK, I'm fine with this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updates LGTM
* master: New files shouldn't trigger a coarse-grained rebuild in fg cache mode (python#4669) Bump version to 0.580-dev Update revision history for 0.570 (python#4662) Fine-grained: Fix crashes when refreshing synthetic types (python#4667) Fine-grained: Support NewType and reset subtype caches (python#4656) Fine-grained: Detect changes in additional TypeInfo attributes (python#4659) Fine-grained: Apply semantic analyzer patch callbacks (python#4658) Optimize fine-grained update by using Graph as the cache (python#4622) Cleanup check_reverse_op_method (python#4017) Fine-grained: Fix AST merge issues (python#4652) Optionally check that we don't have duplicate nodes after AST merge (python#4647)
Make fine-grained update use a Graph as the source of truth instead of a SavedCache. We modify load_graph to optionally use an existing graph as a starting point. This allows us to skip the expensive full load_graph and preserve_full_cache operations that essentially only convert between Graph and SavedCache. This requires a nontrivial change to the loading of dependency information in build. Since we are no longer doing a full `load_graph()`, we can't rely on it fully populating `missing_modules` if it is cleared between updates. If `missing_modules` isn't fully populated, then semanal will spuriously reject `from x import y` where `x.y` is a missing module. We can't just *not* clear `missing_modules`, however, because `parse_file` uses it to decide that modules are suppressed. But this is actually wrong! It can lead to an import failure message not being generated because some other module has already failed to import it (and perhaps even ignored the error). `parse_file()` shouldn't actually *need* to compute `suppressed`, though. If it is called on a file with no cache, `load_graph()` will handle moving deps to `suppressed`, and if called on a file that has had cache info loaded, then the dependency information has all been loaded from the cache. So we refactor things so that dep information from the cache is used when available and `parse_file` doesn't need to deal with it. I strongly suspect that this refactor obviates the need for `fix_suppressed_dependencies`, but I was not able to quickly produce a test case from the description in python#2036, so I am not comfortable making that change.
Make fine-grained update use a
Graph
as the source of truth instead of aSavedCache
. We modifyload_graph
to optionally use an existing graph as a starting point.This allows us to skip the expensive full
load_graph
(~500ms on S) andpreserve_full_cache
(~200ms on S) operations that essentially only convert betweenGraph
andSavedCache
.