Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and record implicit dependencies in incremental mode #2041

Merged
merged 18 commits into from
Sep 2, 2016

Conversation

Michael0x2a
Copy link
Collaborator

@Michael0x2a Michael0x2a commented Aug 19, 2016

This pull request modifies the typechecking phase and adds a new visitor to check types used within a particular module to find indirect module dependencies. These indirect module dependencies are then added to the set of a module's dependencies with a lowered priority so that they may be easily referenced when attempting to check if a module's dependencies have a stale interface or not. Now, instead of having to search through the import graph to see if a module's interface is stale or not, I only need to check the set of dependencies for that module, which is a potential optimization gain.

I've done some informal testing, and it seems like this optimization will decrease the time it takes for mypy to recheck a project after changing the interface of a single file by roughly 20% to 50%, depending on exactly which file is rechecked/the size of the repo. (If a project happens to have very few indirect dependencies, the percentage will naturally trend closer to 0%.)

However, this optimization does slow down how long it takes to run mypy with either a completely cold cache or a completely warm cache by approximately 5% to 10%. Figuring out how to speed up the changes in these cases is something I will most likely try solving in a separate pull request, unless I'm hit by a sudden burst of inspiration.


More details:

I define the term "indirect dependency" as a module that is not explicitly imported but still affects the types of the current module. For example, consider the following program:

[file main.py]
from mod1 import A
localC = A().makeB()  # implicit dependency on mod3
val = localC.foo()  # implicit dependency on mod3

[file unrelated.py]
import mod2
def foo() -> None:
    mod2.unrelated_function_here()  # etc...

[file mod1.py]
from mod2 import B
class A:
    def makeB(self) -> B: return B()

[file mod2.py]
from mod3 import B
def unrelated_function_here() -> None:
    bar = B().foo() # etc...

[file mod3.py]
class B:
    def foo(self) -> int: return 0

In this test case, mod1 is an direct dependency of main, since it is directly imported. Module mod3.py is an indirect dependency of main, since it was not directly imported, but does affect the type of the val global variable of main and therefore its public interface.

Note that mod2 is neither a direct or indirect dependency of main, since main does not in any way use types or symbols that were defined within mod2.

However, mod2 is an direct dependency of unrelated. Note that mod3 is neither a direct nor indirect dependency of unrelated.

This test case forms the following import graph (omitting the builtins module and all associated machinery):

  . . . . . . . . . . . . . . . . . . . .
  :                         :           :
  :                         V           V
mod3 ------> mod2 ------> mod1 ------> main
              |
              +-----> unrelated

The solid lines indicate direct dependencies, and the dotted lines indicate indirect dependencies. (The dependees point to the dependers).

The problem here is that if mod3's interface is changed, we would also need to check mod2 and all of its dependencies recursively since we have no idea what modules later in the import graph indirectly rely on us. So, in the current version of incremental mode, an interface change to mod3 forces us to recheck the entire graph, including the unrelated module.

Proposal:

My proposal is to perform an operation similar to a transitive closure and explicitly record the indirect dependencies within the dependencies field of the BuildState class, and assign each indirect dependency a lowered priority compared to any import.

I then propose to remove the current requirement that an interface change in a module automatically forces an interface change in anything that imports it.

Instead, since we explicitly keep track of both direct and indirect dependencies, it should be sufficient to just check those enumerated files. So that means in this case, an interface change in mod3 would trigger a recheck in mod2, mod1, and main, but NOT in unrelated. Since mod2's interface would not be changed by an interface change in mod2 (unless we delete or rename the B class, the unrelated module would not need to be rechecked in any way in most cases.

More visually, to determine if we need to recheck a module, we follow the arrows backwards exactly once.

This doesn't seem like a huge win here since we get to save rechecking only a single file, but if the unrelated module is subsequently imported by many different modules, this change allows us to avoid having to recheck a potentially large, unconnected SCC.

Potential concerns:

The main danger of this change is that it works by changing the import graph AFTER the files are fully typechecked. (The reason why we can only record these changes after the typechecking phase is because we need to know what the finalized types in the AST are before we can reliably determine what our indirect dependencies are).

This means in incremental mode, the import graph will look different on the second run of mypy compared to the first (though all subsequent runs should have the same import graph).

For example, if we take the above test case, the import graph would look like:

mod3 ------> mod2 ------> mod1 ------> main
              |
              +-----> unrelated

...at the start of the first run (when incremental mode is enabled), but would look like:

  . . . . . . . . . . . . . . . . . . . .
  :                         :           :
  :                         V           V
mod3 ------> mod2 ------> mod1 ------> main
              |
              +-----> unrelated

...at the start of all subsequent runs. This ends up breaking the invariant that incremental mode does not change the import graph and that the import graph looks the same between incremental mode and non-incremental mode. This may have subtle implications in complex graphs involving many cycles.

Sketch of correctness:

However, I believe this change is a safe one. If we have an indirect dependency between modules A and B, that means there must have been a direct chain of dependencies of the form A -> mod1 -> mod2 -> ... -> modN -> B. This means that module A would have been naturally loaded before B anyways, and adding a low-priority arrow connecting the two does not change the results. (That is, the existing SCC generation code produces a partial ordering of the modules by performing a modified topographical sort, and adding more arrows in the manner described above shouldn't seem like it should change that partial ordering).

Since the indirect dependencies are also given a lowered priority then any other module, they also shouldn't drastically affect how the SCC generation/sorting code works, since every other kind of import will take priority.

Rejected idea:

The original plan was to simply just outlaw indirect dependencies altogether, which worked beautifully well in the following case:

[file main.py]
import mod1
mod1.mod2.value  #  Implicit dependency on mod2!

[file mod1.py]
import mod2

[file mod2.py]
value = 3

...but proved impossible in cases similar to the first test case listed above.

It also turned out that regular code contained far too many indirect dependencies of the first form in regular code, making outlawing them a non-starter.

@Michael0x2a Michael0x2a force-pushed the incremental-interface-3 branch from 2a8c180 to 8ebcdf2 Compare August 26, 2016 00:21
@Michael0x2a Michael0x2a changed the title [WIP] Disallow accessing modules indirectly [WIP] Detect and record implicit dependencies in incremental mode Aug 26, 2016
@Michael0x2a Michael0x2a force-pushed the incremental-interface-3 branch 3 times, most recently from 9ad052f to 8d3d327 Compare August 31, 2016 20:28
@Michael0x2a Michael0x2a changed the title [WIP] Detect and record implicit dependencies in incremental mode Detect and record implicit dependencies in incremental mode Aug 31, 2016
@Michael0x2a Michael0x2a force-pushed the incremental-interface-3 branch from 69e5d31 to cc90602 Compare August 31, 2016 23:47
@@ -70,11 +102,16 @@ def __init__(self,
self.msg = msg
self.strfrm_checker = StringFormatterChecker(self, self.chk, self.msg)

def _visit_typeinfo(self, info: nodes.TypeInfo) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unused?

@gvanrossum
Copy link
Member

Thanks for the extensive write-up! I've only found some nits; after those are addressed I'll re-review and merge.

This commit modifies the incremental tests (and in particular, the
'rechecked' and 'stale' annotations) to more accurately describe what
the expected behavior should be after making interfaces changes
propagate only one step out.
This commit modifies TypeInfo so it records the name of the module in
which the class (or NewType or namedtuple) was defined in. This does not
modify the behavior of mypy, but will make it easier to perform
introspection and analysis at later stages.

This information is also a part of the TypeInfo's fullname, but is
difficult to extract in the case of nested classes. For example, if we
have a fully qualified name of a.b.c.d, it's not always obvious whether
or not c is a module name or is a class name, and so forth.
This commit modifies the typechecking stage to record all module
references that a particular module makes. This information will be
used in a future commit.

The motivation for this change is that it turns out that the set of
all module references is often larger then the set of explicitly
imported modules, and having this information would allow us to make
incremental mode more aggressive.
This commit adds a new visitor to extract all module references from
arbitrary types.

This visitor should be considered provisional -- it will most likely be
merged in with the typechecker at a later stage to make it a less
expensive call.
This commit modifies the build process to record all implicit module
references a given file makes when writing cache data. Previously, the
cache data would only save all explicit dependencies (i.e. modules used
within a direct import).

This posed a problem because it turned out it's trivially easy to
construct a series of interlinked files wherein changing the interface
of one module produces or fixes an error in a seemingly unrelated file
(which did NOT import anything from the first module).

This made optimizing incremental mode harder since an interface change
could have unknown effects on the codebase as a whole.

This change modifies how dependencies are recorded so BOTH explicit
(directly imported) and implicit (indirectly reliant) dependences are
recorded for each file.

This means that checking to see if any of our dependents' interfaces
have changed requires simply checking every module in the list of
dependencies (moving one "step") rather then having to potentially crawl
the entire graph.

In order to avoid accidentally perturbing the import graph in
unpredictable ways, all implicit dependences will be given a lower
priority then regular imports (lower then PRI_LOW).
Previously, the indirect dependency patcher would always add in indirect
dependencies, even if they were silenced or ignored. This commit makes
the patcher more discriminating.
This commit modifies Vars and adds an extra field to indicate if they
refer to imports or not. Normally, modules are not represented as Var
nodes, except in the case where that particular module was silenced or
suppressed in some manner.

This commit will make it easier to perform analysis on the AST during
the typechecking phase.
Previously, the RefExpr module analyzer did not correctly handle the
case where a silenced import was used inside of a class. For example,
suppose we have a file named `mod1.py`:

```python
class Foo:
    def bar(self) -> None:
        import unrelated   # this is suppressed
        self.val = unrelated.test()
```

...the analyzer would end up reporting that this file indirectly relied
on module `mod1.Foo`, which is clearly incorrect.

This commit attempts to handle this case.
The `# options: ...` directive was replaced with the `# flag: ...`
directive.
This commit adds an (ad-hoc) check to the indirect dependencies patcher
to filter out computed dependencies that are not, in fact, actually
modules. Ideally, this ad-hoc check shouldn't be necessary but there are
enough weird edge cases that it seems like a useful thing to have to
prevent unexpected behavior.
This commit adds several more tests to try and check if mypy behaves
correctly under a variety silent import related scenarios, and if it
behaves correctly starting from an initial error.
It seems like the `info` attribute inside NameExpr is not actually set
or used anywhere. This commit removes it altogether.
This commit contains mainly more fixups and refactoring in response to
the first wave[let] of feedback.
@Michael0x2a Michael0x2a force-pushed the incremental-interface-3 branch from e904af4 to fcbdd4b Compare September 2, 2016 18:50
@Michael0x2a
Copy link
Collaborator Author

@gvanrossum -- I'm ready for the re-review now!

@gvanrossum gvanrossum merged commit 5e822a8 into python:master Sep 2, 2016
@Michael0x2a Michael0x2a deleted the incremental-interface-3 branch September 9, 2016 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants