Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize fine-grained update by using Graph as the cache #4622

Merged
merged 10 commits into from
Mar 1, 2018

Conversation

msullivan
Copy link
Collaborator

Make fine-grained update use a Graph as the source of truth instead of a SavedCache. We modify load_graph to optionally use an existing graph as a starting point.

This allows us to skip the expensive full load_graph (~500ms on S) and preserve_full_cache (~200ms on S) operations that essentially only convert between Graph and SavedCache.

@JukkaL
Copy link
Collaborator

JukkaL commented Feb 23, 2018

This makes a big performance difference, excellent! Using my randomized tester script, a cold run was about 70s (several files with small changes, time includes cache download) in the Dropbox S repo, and incremental updates were typically 5-10s (and no more than 20s).

I did a fair amount of testing, and found some common false positives. I isolated them into these test cases that pass on master:

[case testRefreshImportOfIgnoredModule1]
# flags: --follow-imports=skip --ignore-missing-imports
# cmd: mypy c.py a/__init__.py b.py
[file c.py]
from a import a2
import b
b.x
[file a/__init__.py]
[file b.py]
x = 0
[file b.py.2]
x = ''
[file b.py.3]
x = 0
[file a/a2.py]
[out]
==
==

[case testRefreshImportOfIgnoredModule2]
# flags: --follow-imports=skip --ignore-missing-imports
# cmd: mypy c.py a/__init__.py b.py
[file c.py]
from a import a2
import b
b.x
[file a/__init__.py]
[file b.py]
x = 0
[file b.py.2]
x = ''
[file b.py.3]
x = 0
[file a/a2/__init__.py]
[out]
==
==

It looks like that for coarse-grained incremental the existence of a file not included in a build affects build results. Fine-grained incremental mode considers the file as non-existent even if it exists (but apparently only when propagating fine-grained dependencies).

Example failure:

______________________________ testRefreshImportOfIgnoredModule2_cached ______________________________
data: /Users/jukka/src/mypy/test-data/unit/fine-grained-modules.test:868:
/Users/jukka/venv/mypy/lib/python3.6/site-packages/pluggy/__init__.py:617: in __call__
    return self._hookexec(self, self._nonwrappers + self._wrappers, kwargs)
/Users/jukka/venv/mypy/lib/python3.6/site-packages/pluggy/__init__.py:222: in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
/Users/jukka/venv/mypy/lib/python3.6/site-packages/pluggy/__init__.py:216: in <lambda>
    firstresult=hook.spec_opts.get('firstresult'),
/Users/jukka/venv/mypy/lib/python3.6/site-packages/_pytest/runner.py:109: in pytest_runtest_call
    item.runtest()
/Users/jukka/src/mypy/mypy/test/data.py:647: in runtest
    suite.run_case(self.case)
/Users/jukka/src/mypy/mypy/test/testfinegrained.py:122: in run_case
    testcase.file, testcase.line))
/Users/jukka/src/mypy/mypy/test/helpers.py:95: in assert_string_arrays_equal
    raise AssertionError(msg)
E   AssertionError: Invalid output (/Users/jukka/src/mypy/test-data/unit/fine-grained-modules.test, line 868)
---------------------------------------- Captured stderr call ----------------------------------------
Expected:
  ==
  ==
Actual:
  ==
  ==
  c.py:1: error: Module 'a' has no attribute 'a2' (diff)

@msullivan
Copy link
Collaborator Author

The problem is due to the handling of missing_modules. Hopefully I'll have a fix tomorrow.

This has been a bit of an adventure.

Since we are no longer doing a full `load_graph()`, we can't rely on
it fully populating `missing_modules` if it is cleared between
updates. If `missing_modules` isn't fully populated, then semanal will
spuriously reject `from x import y` where `x.y` is a missing module.

We can't just *not* clear `missing_modules`, however, because
`parse_file` uses it to decide that modules are suppressed. But this
is actually wrong! It can lead to an import failure message not being
generated because some other module has already failed to import it
(and perhaps even ignored the error).

`parse_file()` shouldn't actually *need* to compute `suppressed`,
though. If it is called on a file with no cache, `load_graph()` will
handle moving deps to `suppressed`, and if called on a file that has
had cache info loaded, then the dependency information has all been
loaded from the cache.

So we refactor things so that dep information from the cache is used
when available and `parse_file` doesn't need to deal with it.

I strongly suspect that this refactor obviates the need for
`fix_suppressed_dependencies`, but I was not able to quickly produce a
test case from the description in #2036, so I am not comfortable
making that change.
@msullivan
Copy link
Collaborator Author

Fixed the missing_modules related failures.

This has been a bit of an adventure.

Since we are no longer doing a full load_graph(), we can't rely on
it fully populating missing_modules if it is cleared between
updates. If missing_modules isn't fully populated, then semanal will
spuriously reject from x import y where x.y is a missing module.

We can't just not clear missing_modules, however, because
parse_file uses it to decide that modules are suppressed. But this
is actually wrong! It can lead to an import failure message not being
generated because some other module has already failed to import it
(and perhaps even ignored the error).

parse_file() shouldn't actually need to compute suppressed,
though. If it is called on a file with no cache, load_graph() will
handle moving deps to suppressed, and if called on a file that has
had cache info loaded, then the dependency information has all been
loaded from the cache.

So we refactor things so that dep information from the cache is used
when available and parse_file doesn't need to deal with it.

I strongly suspect that this refactor obviates the need for
fix_suppressed_dependencies, but I was not able to quickly produce a
test case from the description in #2036, so I am not comfortable
making that change.

Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see so many lines deleted! Looks good, but I don't have enough context on mypy.build to say if the logic change related to suppressed is without problems. Worth asking for a second opinion from Guido.

Feel free to merge once you've addressed my feedback and asked Guido about this (unless you make some major changes, in which case I'd like to have another look).

mypy/build.py Outdated
@@ -1896,49 +1883,54 @@ def parse_file(self) -> None:
# TODO: Why can't SemanticAnalyzerPass1 .analyze() do this?
self.tree.names = manager.semantic_analyzer.globals

for pri, id, line in manager.all_imported_modules_in_file(self.tree):
if id == '':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious why an empty string indicates a relative import. Add a longer description, probably here and in the docstring of all_imported_modules_in_file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the check into all_imported_modules_in_file instead

mypy/build.py Outdated
# least one imported name isn't a submodule)
# cur_id is also a dependency, and we should
# insert it *before* any submodules.
if not all_are_submodules:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this check removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lack of the parent module in the dependency list was involved in a weird interaction with the original changes that resulted in the wrong module name appearing in an error message in one case.

The other changes to build's dependency handling appear to solve this without this check, though, so I'm going to undo this change.

mypy/checker.py Outdated
"""
self.partial_reported.clear()
assert self.partial_types == []
assert self.deferred_nodes == []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also reset type_map? We could ensure that the binder and scope are clear (or reset them as well). Also, consider clearing inferred_attribute_types and module_refs. Or maybe it would be easier to create a new TypeChecker object (doesn't need to happen in this PR).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a noticeable (though not particularly big) performance win by reseting the TypeChecker instead of creating a new one.

mark_all_meta_as_memory_only(graph, manager)
manager.saved_cache = preserve_full_cache(graph, manager)
self.type_maps = extract_type_maps(graph)
manager.saved_cache = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the docstring of BuildManager to reflect the fact that saved_cache is no longer used in fine-grained incremental mode.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Once this is landed I'm also going to just totally rip out the coarse-grained daemon and all of the saved_cache machinery entirely.

[delete c.py.2]
[file b.py.2]
class Foo:
def foo(self) -> None: pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a third step where the signature of foo is changed, which would result in an error in the main module.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msullivan msullivan requested a review from gvanrossum February 28, 2018 20:56
mypy/build.py Outdated
@@ -658,23 +656,15 @@ def correct_rel_imp(imp: Union[ImportFrom, ImportAll]) -> str:
elif isinstance(imp, ImportFrom):
cur_id = correct_rel_imp(imp)
pos = len(res)
all_are_submodules = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suspicious of the removal of this behavior -- I'm pretty sure it was needed at some point (perhaps to handle import cycles in incremental mode).

"""Given some source files, load the full dependency graph.

As this may need to parse files, this can raise CompileError in case
there are syntax errors.
"""
graph = {} # type: Graph

graph = old_graph or {} # type: Graph
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter that this may mutate the old_graph argument in place? (Except when called with an empty dict.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's the goal.

self.fswatcher.set_file_data(
mypyfile.path,
state.xpath,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using xpath is dubious. It may be '<string>'. I assume you need something that's not Optional, so I'd use path and add an assert that it's not None.

priorities = {} # type: Dict[str, int] # id -> priority
dep_line_map = {} # type: Dict[str, int] # id -> line
for pri, id, line in manager.all_imported_modules_in_file(self.tree):
priorities[id] = min(pri, priorities.get(id, PRI_ALL))
if id == self.id:
continue
# Omit missing modules, as otherwise we could not type-check
# programs with missing modules.
if id in manager.missing_modules:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope you know what you're doing by removing this block of code. I can't assess the effect easily by just reading the code (though it's plausible that the code around line 1628 takes care of it).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The graph loading code is pretty subtle, so I wouldn't say I am 100% confident, but I am reasonably so---I discussed my reasoning some in my comment above.

@gvanrossum
Copy link
Member

OK, I'm fine with this.

Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updates LGTM

@msullivan msullivan merged commit 3e1baae into master Mar 1, 2018
@msullivan msullivan deleted the load_graph_opt branch March 1, 2018 20:17
carljm added a commit to carljm/mypy that referenced this pull request Mar 6, 2018
* master:
  New files shouldn't trigger a coarse-grained rebuild in fg cache mode (python#4669)
  Bump version to 0.580-dev
  Update revision history for 0.570 (python#4662)
  Fine-grained: Fix crashes when refreshing synthetic types (python#4667)
  Fine-grained: Support NewType and reset subtype caches (python#4656)
  Fine-grained: Detect changes in additional TypeInfo attributes (python#4659)
  Fine-grained: Apply semantic analyzer patch callbacks (python#4658)
  Optimize fine-grained update by using Graph as the cache (python#4622)
  Cleanup check_reverse_op_method (python#4017)
  Fine-grained: Fix AST merge issues (python#4652)
  Optionally check that we don't have duplicate nodes after AST merge (python#4647)
yedpodtrzitko pushed a commit to kiwicom/mypy that referenced this pull request Mar 15, 2018
Make fine-grained update use a Graph as the source of truth instead of a SavedCache. We modify load_graph to optionally use an existing graph as a starting point. This allows us to skip the expensive full load_graph and preserve_full_cache operations that essentially only convert between Graph and SavedCache.

This requires a nontrivial change to the loading of dependency information in build.
Since we are no longer doing a full `load_graph()`, we can't rely on
it fully populating `missing_modules` if it is cleared between
updates. If `missing_modules` isn't fully populated, then semanal will
spuriously reject `from x import y` where `x.y` is a missing module.

We can't just *not* clear `missing_modules`, however, because
`parse_file` uses it to decide that modules are suppressed. But this
is actually wrong! It can lead to an import failure message not being
generated because some other module has already failed to import it
(and perhaps even ignored the error).

`parse_file()` shouldn't actually *need* to compute `suppressed`,
though. If it is called on a file with no cache, `load_graph()` will
handle moving deps to `suppressed`, and if called on a file that has
had cache info loaded, then the dependency information has all been
loaded from the cache.

So we refactor things so that dep information from the cache is used
when available and `parse_file` doesn't need to deal with it.

I strongly suspect that this refactor obviates the need for
`fix_suppressed_dependencies`, but I was not able to quickly produce a
test case from the description in python#2036, so I am not comfortable
making that change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants