Optimize fine-grained update by using Graph as the cache #4622

msullivan · 2018-02-22T22:22:34Z

Make fine-grained update use a Graph as the source of truth instead of a SavedCache. We modify load_graph to optionally use an existing graph as a starting point.

This allows us to skip the expensive full load_graph (~500ms on S) and preserve_full_cache (~200ms on S) operations that essentially only convert between Graph and SavedCache.

JukkaL · 2018-02-23T16:10:20Z

This makes a big performance difference, excellent! Using my randomized tester script, a cold run was about 70s (several files with small changes, time includes cache download) in the Dropbox S repo, and incremental updates were typically 5-10s (and no more than 20s).

I did a fair amount of testing, and found some common false positives. I isolated them into these test cases that pass on master:

[case testRefreshImportOfIgnoredModule1]
# flags: --follow-imports=skip --ignore-missing-imports
# cmd: mypy c.py a/__init__.py b.py
[file c.py]
from a import a2
import b
b.x
[file a/__init__.py]
[file b.py]
x = 0
[file b.py.2]
x = ''
[file b.py.3]
x = 0
[file a/a2.py]
[out]
==
==

[case testRefreshImportOfIgnoredModule2]
# flags: --follow-imports=skip --ignore-missing-imports
# cmd: mypy c.py a/__init__.py b.py
[file c.py]
from a import a2
import b
b.x
[file a/__init__.py]
[file b.py]
x = 0
[file b.py.2]
x = ''
[file b.py.3]
x = 0
[file a/a2/__init__.py]
[out]
==
==

It looks like that for coarse-grained incremental the existence of a file not included in a build affects build results. Fine-grained incremental mode considers the file as non-existent even if it exists (but apparently only when propagating fine-grained dependencies).

Example failure:

______________________________ testRefreshImportOfIgnoredModule2_cached ______________________________
data: /Users/jukka/src/mypy/test-data/unit/fine-grained-modules.test:868:
/Users/jukka/venv/mypy/lib/python3.6/site-packages/pluggy/__init__.py:617: in __call__
    return self._hookexec(self, self._nonwrappers + self._wrappers, kwargs)
/Users/jukka/venv/mypy/lib/python3.6/site-packages/pluggy/__init__.py:222: in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
/Users/jukka/venv/mypy/lib/python3.6/site-packages/pluggy/__init__.py:216: in <lambda>
    firstresult=hook.spec_opts.get('firstresult'),
/Users/jukka/venv/mypy/lib/python3.6/site-packages/_pytest/runner.py:109: in pytest_runtest_call
    item.runtest()
/Users/jukka/src/mypy/mypy/test/data.py:647: in runtest
    suite.run_case(self.case)
/Users/jukka/src/mypy/mypy/test/testfinegrained.py:122: in run_case
    testcase.file, testcase.line))
/Users/jukka/src/mypy/mypy/test/helpers.py:95: in assert_string_arrays_equal
    raise AssertionError(msg)
E   AssertionError: Invalid output (/Users/jukka/src/mypy/test-data/unit/fine-grained-modules.test, line 868)
---------------------------------------- Captured stderr call ----------------------------------------
Expected:
  ==
  ==
Actual:
  ==
  ==
  c.py:1: error: Module 'a' has no attribute 'a2' (diff)

msullivan · 2018-02-26T06:50:58Z

The problem is due to the handling of missing_modules. Hopefully I'll have a fix tomorrow.

This has been a bit of an adventure. Since we are no longer doing a full `load_graph()`, we can't rely on it fully populating `missing_modules` if it is cleared between updates. If `missing_modules` isn't fully populated, then semanal will spuriously reject `from x import y` where `x.y` is a missing module. We can't just *not* clear `missing_modules`, however, because `parse_file` uses it to decide that modules are suppressed. But this is actually wrong! It can lead to an import failure message not being generated because some other module has already failed to import it (and perhaps even ignored the error). `parse_file()` shouldn't actually *need* to compute `suppressed`, though. If it is called on a file with no cache, `load_graph()` will handle moving deps to `suppressed`, and if called on a file that has had cache info loaded, then the dependency information has all been loaded from the cache. So we refactor things so that dep information from the cache is used when available and `parse_file` doesn't need to deal with it. I strongly suspect that this refactor obviates the need for `fix_suppressed_dependencies`, but I was not able to quickly produce a test case from the description in #2036, so I am not comfortable making that change.

msullivan · 2018-02-28T05:35:25Z

Fixed the missing_modules related failures.

This has been a bit of an adventure.

Since we are no longer doing a full load_graph(), we can't rely on
it fully populating missing_modules if it is cleared between
updates. If missing_modules isn't fully populated, then semanal will
spuriously reject from x import y where x.y is a missing module.

We can't just not clear missing_modules, however, because
parse_file uses it to decide that modules are suppressed. But this
is actually wrong! It can lead to an import failure message not being
generated because some other module has already failed to import it
(and perhaps even ignored the error).

parse_file() shouldn't actually need to compute suppressed,
though. If it is called on a file with no cache, load_graph() will
handle moving deps to suppressed, and if called on a file that has
had cache info loaded, then the dependency information has all been
loaded from the cache.

So we refactor things so that dep information from the cache is used
when available and parse_file doesn't need to deal with it.

I strongly suspect that this refactor obviates the need for
fix_suppressed_dependencies, but I was not able to quickly produce a
test case from the description in #2036, so I am not comfortable
making that change.

JukkaL

Great to see so many lines deleted! Looks good, but I don't have enough context on mypy.build to say if the logic change related to suppressed is without problems. Worth asking for a second opinion from Guido.

Feel free to merge once you've addressed my feedback and asked Guido about this (unless you make some major changes, in which case I'd like to have another look).

JukkaL · 2018-02-28T17:19:43Z

mypy/build.py

@@ -1896,49 +1883,54 @@ def parse_file(self) -> None:
        # TODO: Why can't SemanticAnalyzerPass1 .analyze() do this?
        self.tree.names = manager.semantic_analyzer.globals

+        for pri, id, line in manager.all_imported_modules_in_file(self.tree):
+            if id == '':


It's not obvious why an empty string indicates a relative import. Add a longer description, probably here and in the docstring of all_imported_modules_in_file.

Moving the check into all_imported_modules_in_file instead

JukkaL · 2018-02-28T17:22:17Z

mypy/build.py

-                    # least one imported name isn't a submodule)
-                    # cur_id is also a dependency, and we should
-                    # insert it *before* any submodules.
-                    if not all_are_submodules:


Why was this check removed?

The lack of the parent module in the dependency list was involved in a weird interaction with the original changes that resulted in the wrong module name appearing in an error message in one case.

The other changes to build's dependency handling appear to solve this without this check, though, so I'm going to undo this change.

JukkaL · 2018-02-28T17:25:29Z

mypy/checker.py

+        """
+        self.partial_reported.clear()
+        assert self.partial_types == []
+        assert self.deferred_nodes == []


Maybe also reset type_map? We could ensure that the binder and scope are clear (or reset them as well). Also, consider clearing inferred_attribute_types and module_refs. Or maybe it would be easier to create a new TypeChecker object (doesn't need to happen in this PR).

I got a noticeable (though not particularly big) performance win by reseting the TypeChecker instead of creating a new one.

JukkaL · 2018-02-28T17:29:37Z

mypy/server/update.py

-        mark_all_meta_as_memory_only(graph, manager)
-        manager.saved_cache = preserve_full_cache(graph, manager)
-        self.type_maps = extract_type_maps(graph)
+        manager.saved_cache = {}


Update the docstring of BuildManager to reflect the fact that saved_cache is no longer used in fine-grained incremental mode.

Will do.

Once this is landed I'm also going to just totally rip out the coarse-grained daemon and all of the saved_cache machinery entirely.

JukkaL · 2018-02-28T17:32:34Z

test-data/unit/fine-grained-modules.test

+[delete c.py.2]
+[file b.py.2]
+class Foo:
+    def foo(self) -> None: pass


Maybe add a third step where the signature of foo is changed, which would result in an error in the main module.

This reverts commit c21e216. Also fix tests, since this reintroduces a cache/no-cache divergence.

gvanrossum · 2018-02-23T18:49:33Z

mypy/build.py

@@ -658,23 +656,15 @@ def correct_rel_imp(imp: Union[ImportFrom, ImportAll]) -> str:
                elif isinstance(imp, ImportFrom):
                    cur_id = correct_rel_imp(imp)
                    pos = len(res)
-                    all_are_submodules = True


I'm suspicious of the removal of this behavior -- I'm pretty sure it was needed at some point (perhaps to handle import cycles in incremental mode).

gvanrossum · 2018-02-23T21:56:18Z

mypy/build.py

    """Given some source files, load the full dependency graph.

    As this may need to parse files, this can raise CompileError in case
    there are syntax errors.
    """
-    graph = {}  # type: Graph
+
+    graph = old_graph or {}  # type: Graph


Does it matter that this may mutate the old_graph argument in place? (Except when called with an empty dict.)

Yep, that's the goal.

gvanrossum · 2018-02-28T22:34:12Z

mypy/dmypy_server.py

                self.fswatcher.set_file_data(
-                    mypyfile.path,
+                    state.xpath,


Using xpath is dubious. It may be '<string>'. I assume you need something that's not Optional, so I'd use path and add an assert that it's not None.

gvanrossum · 2018-02-28T22:37:26Z

mypy/build.py

        priorities = {}  # type: Dict[str, int]  # id -> priority
        dep_line_map = {}  # type: Dict[str, int]  # id -> line
        for pri, id, line in manager.all_imported_modules_in_file(self.tree):
            priorities[id] = min(pri, priorities.get(id, PRI_ALL))
            if id == self.id:
                continue
-            # Omit missing modules, as otherwise we could not type-check
-            # programs with missing modules.
-            if id in manager.missing_modules:


I hope you know what you're doing by removing this block of code. I can't assess the effect easily by just reading the code (though it's plausible that the code around line 1628 takes care of it).

The graph loading code is pretty subtle, so I wouldn't say I am 100% confident, but I am reasonably so---I discussed my reasoning some in my comment above.

gvanrossum · 2018-03-01T04:20:41Z

OK, I'm fine with this.

JukkaL

The updates LGTM

* master: New files shouldn't trigger a coarse-grained rebuild in fg cache mode (python#4669) Bump version to 0.580-dev Update revision history for 0.570 (python#4662) Fine-grained: Fix crashes when refreshing synthetic types (python#4667) Fine-grained: Support NewType and reset subtype caches (python#4656) Fine-grained: Detect changes in additional TypeInfo attributes (python#4659) Fine-grained: Apply semantic analyzer patch callbacks (python#4658) Optimize fine-grained update by using Graph as the cache (python#4622) Cleanup check_reverse_op_method (python#4017) Fine-grained: Fix AST merge issues (python#4652) Optionally check that we don't have duplicate nodes after AST merge (python#4647)

Make fine-grained update use a Graph as the source of truth instead of a SavedCache. We modify load_graph to optionally use an existing graph as a starting point. This allows us to skip the expensive full load_graph and preserve_full_cache operations that essentially only convert between Graph and SavedCache. This requires a nontrivial change to the loading of dependency information in build. Since we are no longer doing a full `load_graph()`, we can't rely on it fully populating `missing_modules` if it is cleared between updates. If `missing_modules` isn't fully populated, then semanal will spuriously reject `from x import y` where `x.y` is a missing module. We can't just *not* clear `missing_modules`, however, because `parse_file` uses it to decide that modules are suppressed. But this is actually wrong! It can lead to an import failure message not being generated because some other module has already failed to import it (and perhaps even ignored the error). `parse_file()` shouldn't actually *need* to compute `suppressed`, though. If it is called on a file with no cache, `load_graph()` will handle moving deps to `suppressed`, and if called on a file that has had cache info loaded, then the dependency information has all been loaded from the cache. So we refactor things so that dep information from the cache is used when available and `parse_file` doesn't need to deal with it. I strongly suspect that this refactor obviates the need for `fix_suppressed_dependencies`, but I was not able to quickly produce a test case from the description in python#2036, so I am not comfortable making that change.

msullivan requested review from JukkaL and ilevkivskyi February 22, 2018 22:22

msullivan mentioned this pull request Feb 23, 2018

Overhaul file-system operations to use FileSystemCache #4623

Merged

msullivan added 5 commits February 26, 2018 17:19

Less "Skipping" debug spew

7e22778

Remove the all_are_submodules handling

c21e216

Eliminate serializing into and out of SavedCache

d844227

Clean up a lot of cruft

2218c11

msullivan force-pushed the load_graph_opt branch from 4da63be to 7520f89 Compare February 28, 2018 05:34

Add a missing type annotation

95c3244

JukkaL reviewed Feb 28, 2018

View reviewed changes

msullivan added 3 commits February 28, 2018 10:22

Revert "Remove the all_are_submodules handling"

934588b

This reverts commit c21e216. Also fix tests, since this reintroduces a cache/no-cache divergence.

Various cleanups

f4e55f4

More clearing and asserts in the typechecker reset

8322c8c

msullivan requested a review from gvanrossum February 28, 2018 20:56

gvanrossum reviewed Feb 28, 2018

View reviewed changes

Minor cleanups

d47920d

JukkaL approved these changes Mar 1, 2018

View reviewed changes

msullivan merged commit 3e1baae into master Mar 1, 2018

msullivan deleted the load_graph_opt branch March 1, 2018 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize fine-grained update by using Graph as the cache #4622

Optimize fine-grained update by using Graph as the cache #4622

msullivan commented Feb 22, 2018

JukkaL commented Feb 23, 2018

msullivan commented Feb 26, 2018

msullivan commented Feb 28, 2018

JukkaL left a comment

JukkaL Feb 28, 2018

msullivan Feb 28, 2018

JukkaL Feb 28, 2018

msullivan Feb 28, 2018

JukkaL Feb 28, 2018

msullivan Feb 28, 2018

JukkaL Feb 28, 2018

msullivan Feb 28, 2018

JukkaL Feb 28, 2018

msullivan Feb 28, 2018

gvanrossum Feb 23, 2018

gvanrossum Feb 23, 2018

msullivan Feb 28, 2018

gvanrossum Feb 28, 2018

gvanrossum Feb 28, 2018

msullivan Feb 28, 2018

gvanrossum commented Mar 1, 2018

JukkaL left a comment

Optimize fine-grained update by using Graph as the cache #4622

Optimize fine-grained update by using Graph as the cache #4622

Conversation

msullivan commented Feb 22, 2018

JukkaL commented Feb 23, 2018

msullivan commented Feb 26, 2018

msullivan commented Feb 28, 2018

JukkaL left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvanrossum commented Mar 1, 2018

JukkaL left a comment

Choose a reason for hiding this comment