Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for various --exclude usability issues #415

Merged
merged 13 commits into from
Jan 19, 2024
Merged

Conversation

jherland
Copy link
Member

Thanks to Maria for actually testing the new --exclude option from a user's POV, and notifying me of various griveous issues found. Here we try to tackle the ones that were found and raise in the later stages of #388:

  • Running fawltydeps with --exclude and passing relative paths instead of absolute (our test suite relies too heavily on absolute paths)
  • Passing "anchored" exclude patterns to --exclude was initially unsupported. Fix that.
  • Give the user a warning when the --exclude option directly overlaps with a given path (in which case the given path will always win).
  • Fix issues where the precedence between --exclude patterns and given paths were not followed in practice.

@jherland jherland force-pushed the jherland/exclude-fixes branch 2 times, most recently from 81fd1cd to 00aad46 Compare January 18, 2024 19:36
@jherland jherland marked this pull request as ready for review January 18, 2024 19:44
Copy link
Collaborator

@mknorps mknorps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thank you @jherland !

The changes you did to fix listed issues are very clever :) I like how you expanded the test suite to account for both - relative and absolute paths and how you added the overlaping check. I also tested the changes manually and all works as expected.
This is 🚀

tests/test_real_projects.py Show resolved Hide resolved
fawltydeps/traverse_project.py Show resolved Hide resolved
@jherland jherland force-pushed the jherland/integrate_ignore_parser branch from ee2f663 to 1049383 Compare January 19, 2024 13:08
@jherland jherland force-pushed the jherland/exclude-fixes branch from 00aad46 to ecf86c3 Compare January 19, 2024 13:09
Base automatically changed from jherland/integrate_ignore_parser to main January 19, 2024 13:19
This helps multi-line descriptions render more nicely in the failed test
output.
Add an earlier version of FawltyDeps to real_projects and use it to
verify new --exclude option, specifically use it to replicate the
commands that @mknorps used to find issues while reviewing #388.

For now we test these scenarios:

 1. Running `fawltydeps` with no options. This should pick up our
    configuration in pyproject.toml and produce zero undeclared/unused
    dependencies.

 2. Running `fawltydeps --exclude fawltydeps`. At first glance this
    should produce a lot of _unused_ dependencies, however, there is a
    `--code=fawltydeps` option hidden in our pyproject.toml which
    (correctly) overrides the `--exclude fawltydeps` option. Thus the
    expected output here is still zero undeclared/unused deps. (Still,
    the conflicting/overlapping options should produce a warning to
    inform the user of this overlap. This will be tested elsewhere.)

 3. Running `fawltydeps --exclude fawltydeps/`. This is very similar to
    the above, except that we add a trailing `/` to signal a directory-
    only match. This is _not_ supposed to change the logic explained in
    scenario 2 above, however, it currently triggers a bug to cause the
    `--exclude` option to seemingly override the `--code` option in
    practice. This will be fixed in a later commit.

  4. Running `fawltydeps --exclude fawltydeps/*`. This does in fact
     _not_ conflict with `--code=fawltydep`, as the pattern applied to
     everything inside the directory, but not the directory itself. This
     should therefore succeed in excluding those files from analysis,
     and the end result should be that FD reports a bunch of unused
     dependencies. Currently this fails because the pattern is
     "anchored" and CLI-supplied patterns have no associated base_dir.
     This will be fixed in a later commit.

  5. Running `fawltydeps --exclude fawltydeps/extract_*`. This should
     exclude a smaller set of files from analysis, and result is fewer
     unused deps than scenario 4. Currently it fails for the same reason
     as scenario 4. Will be fixed in a later commit.

Several of these tests currently fail. Consider them a collection of
acceptance tests for issues that will be fixed in the following commits.

In addition to the test scenarios themselves, we need to make a few
small changes in order for test_real_projects to successfully run on
the FawltyDeps project itself:

- Removing the `--config-file=/dev/null` option: This was added way back
  in 9f817b6 in order to make sure that FD options from the
  pyproject.toml in _this_ repo did not get applied when running FD on a
  real_project. However, AFAICS the real_project tests have always run
  FD in a subprocess with CWD set to the project directory, meaning that
  this should never be an issue for real_project tests. Instead, we want
  the usual rules to apply, where FD does pick up configuration from the
  pyproject.toml in the current directory (the real_project's project
  dir).

- Since bd9d9d1 we run FD in a subprocess using this command line:
  `$current_python -m fawltydeps ...`, which is AFAICS the most correct
  way to invoke it. However, it is susceptible to one annoying issue:
  When running this command, if the current directory happens to contain
  a `fawltydeps` subdir that contains a valid Python package, the
  default `sys.path` in Python will cause the `-m fawltydeps` option to
  use _that_ package, instead of the intended one from the current
  Python environment! The simplest way to avoid this is to run Python in
  "isolated" mode with the `-I` option. Without this fix, trying to test
  an older FD version with real_projects would cause the subprocess to
  not run _this_ FD version on that older FD checkout, but rather it
  would run that older FD version on itself (failing immediately due to
  the `--exclude` not existing in that older version).

- Finally, in addition to the virtualenv that we have prepared for a
  real_project (via 'requirements' in the tests/real_projects/*.toml),
  we also want to look for any Python environments that would happen to
  be embedded in the project directory itself. Thus supply an additional
  `--pyenv=.` option in order to preserve the default pyenv search path
  in addition to our extra venv.
Currently all these tests pass /some/dir/... absolute paths to the
gitignore_parser module. We want to verify that the gitignore_parser
works just as well with relative paths.

To prepare for this, we remove the /some/dir/ prefix from all test
vectors, and instead prepend the /some/dir/ prefix inside the test case
code. This does not change the existing test, but allows adding a new
test case later that will work with relative paths.
This copies the existing test_gitignore_parser_w_abs_paths() into a
corresponding test_gitignore_parser_w_rel_paths() which does _not_
prepend the /some/dir/ absolute path to everything but instead uses
relative paths only.

This currently fails due to gitignore_parser's over-reliance on absolute
paths, but this will be fixed in the next commits.
This removes an unnecessary reliance on absolute paths in
gitignore_parser:

 - Allow the base_dir that optionally accompanies a relative path to be
   a relative path.

 - Rename the Path variable being matched against a pattern to signal
   that it no longer need to be an absolute path.

Also improve the docstring for Rule.from_pattern() to bettern explain
the meaning/function of the base_dir argument.

This commit fixes the gitignore_parser test failures introduced in the
previous commit.
Consider a dir-only patttern like "foo/". This should obviously match
"foo" directories like ./foo/ and ./some/other/foo/. But should it
match paths inside those directories, like ./foo/bar as well?

Until now, gitignore_parser has answered this questions with:

  "Yes, the ./foo/bar path is inside a directory ./foo/ that should be
  ignored, and therefore everything inside that directory should also be
  ignored."

This makes sense when you consider gitignore_parser in isolation: It
does not know or require any kind of context around what kind of paths
are passed, or in which order.

However, consider the following command line:

  fawltydeps --code=foo/ --exclude=foo/

On the surface, this instructs FD to look for code under foo/, but
also to exclude any foo/ directory from traversal. We have already
resolved this by declaring that --exclude shall lose a direct conflict
like this, and this is followed by our traversal code, we _do_ in fact
traverse into the foo/ directory.

However, since foo/ is a dir-only pattern, gitignore_parser will now
proceed to succesfully match any path that we traverse inside foo/
against this pattern, and the end result is the same as if we never
traversed into foo/: Nothing under foo/ is found/looked at by FD.

To solve this, we need to take some more context into account: The
directory traversal code always checks parent dirs before descending
into them, and when it has decided to traverse into a directory, we
must assume that any match against that directory has already been
considered. We cannot have an exclude pattern that matches this
directory only _also_ match paths inside that directory (and thus
in effect override the decision of traverse into it).

With this commit we will know only apply a dir-only pattern to the
_final_ path component, meaning that children of the matched directory
will no longer match the same pattern.

In effect we are _coupling_ gitignore_parser to the dir_traversal code,
here, and changing its answer to the above question as follows:

  "No, even though I think the ./foo directory should be ignored, if you
  ask me about the ./foo/bar path, I will trust that you have already
  considered my answer to ./foo before deciding to traverse into it. My
  answer will therefore reflect the bar path under foo/ specifically,
  and this bar path does not match the foo/ pattern."

This tighter coupling ends up actually simplifying the gitignore_parser
code, and changing the expectations of a gitignore_parser test.

For good measure, we add another gitignore_parser test and a few
dir_traversal tests to verify this new behavior.

This also fixes the fawltydeps:exclude_dir_overlap real_projects test,
which has been failing until now.
No functional changes, simply move code from test_DirectoryTraversal()
into two new methods of DirectoryTraversalVector: .setup() and
.verify_traversal(). These take a setup_dir argument, the path to the
experiment/project, that can be _either_ absolute or relative.

For now, we rename test_DirectoryTraversal() to
test_DirectoryTraversal_w_abs_paths(), and pass the absolute tmp_path to
these two methods. The actual vectors and resulting paths that end up
being tested are unchanged.
Add a test_DirectoryTraversal_w_rel_paths() test case that tests the
same DirectoryTraversalVectors as are already tested by
test_DirectoryTraversal_w_abs_paths(), but instead of passing an
absolute tmp_path into the traversal code, we instead run the test from
_inside_ tmp_path, and pass a simple "." relative path to the traversal
code.

This ends up testing tranversal scenarios that are closer to what users
typically run in the Real World: You run FawltyDpes on a project in (or
relative to) the current directory, you don't pass absolute paths to
FawltyDeps.

A number of the DirectoryTraversalVectors currently fail when run in
this new scenario. These failures will be fixed in the next commits.
We used @lru_cache to cache the results of DirId.from_path() (so as to
avoid unnecesary path.stat() calls). However, when passing a relative
path to DirId.from_path() this cache breaks down as soon as the current
working directory is changed: E.g. when calling DirId.from_path(".") we
would keep returning the same DirId without realizing that "." now
points to a _different_ directory.

Fix this by limiting the caching to absolute paths only, and then
replacing DirId.from_path() with a wrapper that makes all relative paths
absolute before calling the caching version.

This fixes a few of the currently-failing dir_traversal tests.
Finally a simple fix: dir_traversal was turning all relative paths into
absolute paths before passing them on to gitignore_parser. This is no
longer necessary, as gitignore_parser now handles relative paths
correctly.

This fixed the remaining dir_traversal test failures.
Same exercise as we've done for gitignore_parser and dir_traversal in
previuos commits: Copy the test case with absolute paths, and run the
same vectors through a scenario where the code is exercised with
relative paths instead.

No new test failures!
Paths to be traversed (basepath/code/deps/pyenv) and exclude patterns
can be specified in various different places (config file, environment,
command line), and it is not always obvious to the used if/when they are
in direct conflict with each other. Consider for example:

  [tool.fawltydeps]
  code=["foo"]

in pyproject.toml, and

  --exclude=foo

or vice versa. The exclude pattern and the code path are in direct
conflict, and the code path will win (i.e. the exclude will be
disregarded). When this happens with paths and patterns that are given
by the user, we should produce a warning to help reduce user confusion.

Fix this by comparing given paths to exclude patterns in
traverse_project.find_sources(), and logging a warning message when a
conflict is detected (once per unique given path).

Add tests to verify the generation of appropriate warning messages.
Until now, we've refused "anchored" exclude patterns on the command-line
as we could not determine the appropriate base_dir for these patterns.

This turns out not to be very usable in practice: Most users are not
(and should not be) aware of why passing "foo/" on the command-line is
fine, but passing "foo/*" is not possible (and can only happen from a
.gitignore file after the future --exclude-from feature is merged).

We _should_ allow anchored patterns on the command line, and somehow
work around the fact that with --code/--deps/--pyenv we can end up with
very many paths indeed.

The implementation we settle on here and now is this: When an anchored
patterns is supplied, we will pass it to DirectoryTraversal.exclude()
_multiple_ times, once for each unique directory where we are searching
for code, deps, or pyenvs.

This should hopefully cause the least amount of surprise for our users:
Exclude patterns are applied and obeyed insofar as they do not
_directly_ conflict with a given path.

With this change, a couple of tests are removed, and their corresponding
behavior (aborting with "Anchored pattern without base_dir") is no
longer reachable from the traverse_project level or above.

A couple of other tests are adjusted to retain a more plausible scenario
(eliminating all code/deps/pyenvs from the given settings don't happen
in practice).

With this, we also fix the two remaining real_projects test cases:
fawltydeps:exclude_no_overlap and fawltydeps:exclude_anchored, and we're
finally back to an all-green test suite!
@jherland jherland force-pushed the jherland/exclude-fixes branch from ecf86c3 to 18a746e Compare January 19, 2024 13:20
@jherland jherland merged commit a4c9327 into main Jan 19, 2024
65 checks passed
@jherland jherland deleted the jherland/exclude-fixes branch January 19, 2024 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants