Fixes for various `--exclude` usability issues #415

jherland · 2024-01-18T08:31:04Z

Thanks to Maria for actually testing the new --exclude option from a user's POV, and notifying me of various griveous issues found. Here we try to tackle the ones that were found and raise in the later stages of #388:

Running fawltydeps with --exclude and passing relative paths instead of absolute (our test suite relies too heavily on absolute paths)
Passing "anchored" exclude patterns to --exclude was initially unsupported. Fix that.
Give the user a warning when the --exclude option directly overlaps with a given path (in which case the given path will always win).
Fix issues where the precedence between --exclude patterns and given paths were not followed in practice.

mknorps

This is great! Thank you @jherland !

The changes you did to fix listed issues are very clever :) I like how you expanded the test suite to account for both - relative and absolute paths and how you added the overlaping check. I also tested the changes manually and all works as expected.
This is 🚀

tests/test_real_projects.py

fawltydeps/traverse_project.py

This helps multi-line descriptions render more nicely in the failed test output.

@mknorps

Add an earlier version of FawltyDeps to real_projects and use it to verify new --exclude option, specifically use it to replicate the commands that @mknorps used to find issues while reviewing #388. For now we test these scenarios: 1. Running `fawltydeps` with no options. This should pick up our configuration in pyproject.toml and produce zero undeclared/unused dependencies. 2. Running `fawltydeps --exclude fawltydeps`. At first glance this should produce a lot of _unused_ dependencies, however, there is a `--code=fawltydeps` option hidden in our pyproject.toml which (correctly) overrides the `--exclude fawltydeps` option. Thus the expected output here is still zero undeclared/unused deps. (Still, the conflicting/overlapping options should produce a warning to inform the user of this overlap. This will be tested elsewhere.) 3. Running `fawltydeps --exclude fawltydeps/`. This is very similar to the above, except that we add a trailing `/` to signal a directory- only match. This is _not_ supposed to change the logic explained in scenario 2 above, however, it currently triggers a bug to cause the `--exclude` option to seemingly override the `--code` option in practice. This will be fixed in a later commit. 4. Running `fawltydeps --exclude fawltydeps/*`. This does in fact _not_ conflict with `--code=fawltydep`, as the pattern applied to everything inside the directory, but not the directory itself. This should therefore succeed in excluding those files from analysis, and the end result should be that FD reports a bunch of unused dependencies. Currently this fails because the pattern is "anchored" and CLI-supplied patterns have no associated base_dir. This will be fixed in a later commit. 5. Running `fawltydeps --exclude fawltydeps/extract_*`. This should exclude a smaller set of files from analysis, and result is fewer unused deps than scenario 4. Currently it fails for the same reason as scenario 4. Will be fixed in a later commit. Several of these tests currently fail. Consider them a collection of acceptance tests for issues that will be fixed in the following commits. In addition to the test scenarios themselves, we need to make a few small changes in order for test_real_projects to successfully run on the FawltyDeps project itself: - Removing the `--config-file=/dev/null` option: This was added way back in 9f817b6 in order to make sure that FD options from the pyproject.toml in _this_ repo did not get applied when running FD on a real_project. However, AFAICS the real_project tests have always run FD in a subprocess with CWD set to the project directory, meaning that this should never be an issue for real_project tests. Instead, we want the usual rules to apply, where FD does pick up configuration from the pyproject.toml in the current directory (the real_project's project dir). - Since bd9d9d1 we run FD in a subprocess using this command line: `$current_python -m fawltydeps ...`, which is AFAICS the most correct way to invoke it. However, it is susceptible to one annoying issue: When running this command, if the current directory happens to contain a `fawltydeps` subdir that contains a valid Python package, the default `sys.path` in Python will cause the `-m fawltydeps` option to use _that_ package, instead of the intended one from the current Python environment! The simplest way to avoid this is to run Python in "isolated" mode with the `-I` option. Without this fix, trying to test an older FD version with real_projects would cause the subprocess to not run _this_ FD version on that older FD checkout, but rather it would run that older FD version on itself (failing immediately due to the `--exclude` not existing in that older version). - Finally, in addition to the virtualenv that we have prepared for a real_project (via 'requirements' in the tests/real_projects/*.toml), we also want to look for any Python environments that would happen to be embedded in the project directory itself. Thus supply an additional `--pyenv=.` option in order to preserve the default pyenv search path in addition to our extra venv.

Currently all these tests pass /some/dir/... absolute paths to the gitignore_parser module. We want to verify that the gitignore_parser works just as well with relative paths. To prepare for this, we remove the /some/dir/ prefix from all test vectors, and instead prepend the /some/dir/ prefix inside the test case code. This does not change the existing test, but allows adding a new test case later that will work with relative paths.

This copies the existing test_gitignore_parser_w_abs_paths() into a corresponding test_gitignore_parser_w_rel_paths() which does _not_ prepend the /some/dir/ absolute path to everything but instead uses relative paths only. This currently fails due to gitignore_parser's over-reliance on absolute paths, but this will be fixed in the next commits.

This removes an unnecessary reliance on absolute paths in gitignore_parser: - Allow the base_dir that optionally accompanies a relative path to be a relative path. - Rename the Path variable being matched against a pattern to signal that it no longer need to be an absolute path. Also improve the docstring for Rule.from_pattern() to bettern explain the meaning/function of the base_dir argument. This commit fixes the gitignore_parser test failures introduced in the previous commit.

Consider a dir-only patttern like "foo/". This should obviously match "foo" directories like ./foo/ and ./some/other/foo/. But should it match paths inside those directories, like ./foo/bar as well? Until now, gitignore_parser has answered this questions with: "Yes, the ./foo/bar path is inside a directory ./foo/ that should be ignored, and therefore everything inside that directory should also be ignored." This makes sense when you consider gitignore_parser in isolation: It does not know or require any kind of context around what kind of paths are passed, or in which order. However, consider the following command line: fawltydeps --code=foo/ --exclude=foo/ On the surface, this instructs FD to look for code under foo/, but also to exclude any foo/ directory from traversal. We have already resolved this by declaring that --exclude shall lose a direct conflict like this, and this is followed by our traversal code, we _do_ in fact traverse into the foo/ directory. However, since foo/ is a dir-only pattern, gitignore_parser will now proceed to succesfully match any path that we traverse inside foo/ against this pattern, and the end result is the same as if we never traversed into foo/: Nothing under foo/ is found/looked at by FD. To solve this, we need to take some more context into account: The directory traversal code always checks parent dirs before descending into them, and when it has decided to traverse into a directory, we must assume that any match against that directory has already been considered. We cannot have an exclude pattern that matches this directory only _also_ match paths inside that directory (and thus in effect override the decision of traverse into it). With this commit we will know only apply a dir-only pattern to the _final_ path component, meaning that children of the matched directory will no longer match the same pattern. In effect we are _coupling_ gitignore_parser to the dir_traversal code, here, and changing its answer to the above question as follows: "No, even though I think the ./foo directory should be ignored, if you ask me about the ./foo/bar path, I will trust that you have already considered my answer to ./foo before deciding to traverse into it. My answer will therefore reflect the bar path under foo/ specifically, and this bar path does not match the foo/ pattern." This tighter coupling ends up actually simplifying the gitignore_parser code, and changing the expectations of a gitignore_parser test. For good measure, we add another gitignore_parser test and a few dir_traversal tests to verify this new behavior. This also fixes the fawltydeps:exclude_dir_overlap real_projects test, which has been failing until now.

No functional changes, simply move code from test_DirectoryTraversal() into two new methods of DirectoryTraversalVector: .setup() and .verify_traversal(). These take a setup_dir argument, the path to the experiment/project, that can be _either_ absolute or relative. For now, we rename test_DirectoryTraversal() to test_DirectoryTraversal_w_abs_paths(), and pass the absolute tmp_path to these two methods. The actual vectors and resulting paths that end up being tested are unchanged.

Add a test_DirectoryTraversal_w_rel_paths() test case that tests the same DirectoryTraversalVectors as are already tested by test_DirectoryTraversal_w_abs_paths(), but instead of passing an absolute tmp_path into the traversal code, we instead run the test from _inside_ tmp_path, and pass a simple "." relative path to the traversal code. This ends up testing tranversal scenarios that are closer to what users typically run in the Real World: You run FawltyDpes on a project in (or relative to) the current directory, you don't pass absolute paths to FawltyDeps. A number of the DirectoryTraversalVectors currently fail when run in this new scenario. These failures will be fixed in the next commits.

We used @lru_cache to cache the results of DirId.from_path() (so as to avoid unnecesary path.stat() calls). However, when passing a relative path to DirId.from_path() this cache breaks down as soon as the current working directory is changed: E.g. when calling DirId.from_path(".") we would keep returning the same DirId without realizing that "." now points to a _different_ directory. Fix this by limiting the caching to absolute paths only, and then replacing DirId.from_path() with a wrapper that makes all relative paths absolute before calling the caching version. This fixes a few of the currently-failing dir_traversal tests.

Finally a simple fix: dir_traversal was turning all relative paths into absolute paths before passing them on to gitignore_parser. This is no longer necessary, as gitignore_parser now handles relative paths correctly. This fixed the remaining dir_traversal test failures.

Same exercise as we've done for gitignore_parser and dir_traversal in previuos commits: Copy the test case with absolute paths, and run the same vectors through a scenario where the code is exercised with relative paths instead. No new test failures!

Paths to be traversed (basepath/code/deps/pyenv) and exclude patterns can be specified in various different places (config file, environment, command line), and it is not always obvious to the used if/when they are in direct conflict with each other. Consider for example: [tool.fawltydeps] code=["foo"] in pyproject.toml, and --exclude=foo or vice versa. The exclude pattern and the code path are in direct conflict, and the code path will win (i.e. the exclude will be disregarded). When this happens with paths and patterns that are given by the user, we should produce a warning to help reduce user confusion. Fix this by comparing given paths to exclude patterns in traverse_project.find_sources(), and logging a warning message when a conflict is detected (once per unique given path). Add tests to verify the generation of appropriate warning messages.

Until now, we've refused "anchored" exclude patterns on the command-line as we could not determine the appropriate base_dir for these patterns. This turns out not to be very usable in practice: Most users are not (and should not be) aware of why passing "foo/" on the command-line is fine, but passing "foo/*" is not possible (and can only happen from a .gitignore file after the future --exclude-from feature is merged). We _should_ allow anchored patterns on the command line, and somehow work around the fact that with --code/--deps/--pyenv we can end up with very many paths indeed. The implementation we settle on here and now is this: When an anchored patterns is supplied, we will pass it to DirectoryTraversal.exclude() _multiple_ times, once for each unique directory where we are searching for code, deps, or pyenvs. This should hopefully cause the least amount of surprise for our users: Exclude patterns are applied and obeyed insofar as they do not _directly_ conflict with a given path. With this change, a couple of tests are removed, and their corresponding behavior (aborting with "Anchored pattern without base_dir") is no longer reachable from the traverse_project level or above. A couple of other tests are adjusted to retain a more plausible scenario (eliminating all code/deps/pyenvs from the given settings don't happen in practice). With this, we also fix the two remaining real_projects test cases: fawltydeps:exclude_no_overlap and fawltydeps:exclude_anchored, and we're finally back to an all-green test suite!

jherland mentioned this pull request Jan 18, 2024

Support exclude patterns #388

Merged

jherland force-pushed the jherland/exclude-fixes branch 2 times, most recently from 81fd1cd to 00aad46 Compare January 18, 2024 19:36

jherland marked this pull request as ready for review January 18, 2024 19:44

jherland requested review from mknorps, Nour-Mws and zz1874 January 18, 2024 19:44

mknorps approved these changes Jan 19, 2024

View reviewed changes

tests/test_real_projects.py Show resolved Hide resolved

fawltydeps/traverse_project.py Show resolved Hide resolved

jherland force-pushed the jherland/integrate_ignore_parser branch from ee2f663 to 1049383 Compare January 19, 2024 13:08

jherland force-pushed the jherland/exclude-fixes branch from 00aad46 to ecf86c3 Compare January 19, 2024 13:09

Base automatically changed from jherland/integrate_ignore_parser to main January 19, 2024 13:19

jherland added 13 commits January 19, 2024 14:20

project_helpers: Clean up whitespace in project/experiment descriptions

1ad48ae

This helps multi-line descriptions render more nicely in the failed test output.

jherland force-pushed the jherland/exclude-fixes branch from ecf86c3 to 18a746e Compare January 19, 2024 13:20

jherland merged commit a4c9327 into main Jan 19, 2024
65 checks passed

jherland deleted the jherland/exclude-fixes branch January 19, 2024 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for various `--exclude` usability issues #415

Fixes for various `--exclude` usability issues #415

jherland commented Jan 18, 2024

mknorps left a comment

Fixes for various --exclude usability issues #415

Fixes for various --exclude usability issues #415

Conversation

jherland commented Jan 18, 2024

mknorps left a comment

Choose a reason for hiding this comment

Fixes for various `--exclude` usability issues #415

Fixes for various `--exclude` usability issues #415