Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process dependent files #6256

Closed

Conversation

carlos-granados
Copy link
Contributor

When using the cache, process dependent files of modified files to make sure we have covered all possible needed changes. See rectorphp/rector#8801

This is implemented by using PHPStan's DependencyResolver. When we analyse a file, we save all the dependencies of this file in the cache. When we do a later analysis using the cache, we use these lists of dependencies to find the dependent files for any analysed file.

The implementation is quite simple and quite performant, I have not been able to measure a noticeable degradation in performance. Seems to cover all the cases that I could think of, probably needs more eyes on it and more testing to make sure everything works as supposed.

Includes a new end 2 end test for this functionality

See the PR comments for more information

@@ -24,6 +24,9 @@
$output = str_replace(__DIR__, '.', $output);

$expectedDiff = 'expected-output.diff';
if (isset($argv[1]) && $argv[1] === '--diff') {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We allow setting which output/diff file should be used so that we can test different outputs in our new end 2 end test

@@ -114,6 +114,7 @@ parameters:
- rules/DeadCode/NodeManipulator/LivingCodeManipulator.php
- rules/EarlyReturn/Rector/If_/ChangeAndIfToEarlyReturnRector.php
- src/PhpParser/NodeTransformer.php
- src/FileSystem/FilesFinder.php
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add this exception here because now the findInDirectoriesAndFiles returns an array by unpacking two arrays and this is incorrectly seen as wrong

) {
}

public function run(Configuration $configuration, InputInterface $input): ProcessResult
{
$filePaths = $this->filesFinder->findFilesInPaths($configuration->getPaths(), $configuration);
$allFiles = $this->filesFinder->findFilesInPaths($configuration->getPaths(), $configuration);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We split this function in two, first we find all possible files to process and then we filter them using the cache

$allFiles = $this->filesFinder->findFilesInPaths($configuration->getPaths(), $configuration);
$filePaths = $this->filesFinder->filterUnchangedFiles($allFiles);

$filePaths = [...$filePaths, ...$this->findDependentFiles($allFiles, $filePaths)];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add any dependent files to the list of files that we want to analyse

@@ -94,9 +100,16 @@ public function run(Configuration $configuration, InputInterface $input): Proces
}

if ($configuration->isParallel()) {
$this->fileDependenciesCache->cacheAllFiles($allFiles);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are running in parallel, we save the list of all possible files in the cache. We do this because calculating this list is an expensive operation and calculating it for each worker would slow the process. Not sure this is the best option to pass this array to the workers, if you can think of a better option, please let me know

@@ -139,6 +139,7 @@ private function runWorker(
$processResult = $this->applicationFileProcessor->processFiles(
$filePaths,
$configuration,
null,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When processing files from the worker, we set the allFiles parameter to null so that this list of files is loaded from the cache

@@ -326,6 +327,7 @@ final class LazyContainerFactory
NodeScopeResolver::class,
ReflectionProvider::class,
CachingVisitor::class,
DependencyResolver::class,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We make sure we can get this service from PHPStan

@@ -70,9 +70,16 @@ function (string $file): bool {
// filtering files in directories collection
$directories = $this->fileAndDirectoryFilter->filterDirectories($filesAndDirectories);
$filteredFilePathsInDirectories = $this->findInDirectories($directories, $suffixes, $sortByName);
return [...$filteredFilePaths, ...$filteredFilePathsInDirectories];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function now returns a list of all possible files to analyse, without filtering out unchanged files

* @param string[] $allFiles
* @return string[]
*/
public function filterUnchangedFiles(array $allFiles): array
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we add a new functions that filters out the unchanged files

@@ -40,6 +40,18 @@ public function hashFiles(array $files): string
return $configHash;
}

public function resolvePath(string $filePath): string
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved here so that it could be used both by the file changed detector and the dependency cache

@TomasVotruba
Copy link
Member

Thanks for the PR. This is rather complex to review. Is there a way to use more from PHPStan to delegate the low-lever work?

Could you share few performance comparison before/after?

@staabm This is rather outside my area of focus, so I'd like to ask you for advisory opinion on this one 🙏

@staabm
Copy link
Contributor

staabm commented Aug 27, 2024

the caching on this level is a super-complex topic, therefore I would suggest to make it a opt-in feature before it stabilized. in addition its super frustrating when the cache doesn't work properly.

Could you share few performance comparison before/after?

I can only second this. we need to be aware how much benefit we get for the added complexity.


AFAIR rector already has some kind of cache which we need to make sure are somehow synchronized (so we should make sure that one cache does not depend on already and maybe expired other cached stuff)

we should also make sure we have a common understand what rector already caches and what this PR will add on top.
(maybe already existing caching gets obsolete)


let me write down a few corner stones from the top of my head, which are fundamental for the PHPStan caching.

  • caching is disabled when --debug is used on the cli
  • caching is disabled when passing the files to process via cli args, instead of using the actual config analyze-path
  • phpstan outputs info about whether cache is used or not when -vv is used
  • validates whether the cache is corrupt
  • cache is only persisted in certain circumstances
  • cache not used when older then x days
  • cache needs to be invalidated on config changes (maybe with a few exceptions)
  • AFAIR cache is only written from the master/main process. workers only read it.
  • the FileDependenciesCache likely needs a cache-key which can be used to invalidate caches on fundamental rector-internal changes which turn a future rector release incompatible with caches already existant from previous rector versions

very likely this list is not complete.

Implementation details can be found in PHPStan ResultCacheManager

@carlos-granados
Copy link
Contributor Author

Regarding performance, this PR does not aim to increase performance, in fact it will decrease it a little since it needs to calculate the dependencies for each file that it processes and then we need to use this data to calculate the dependent files for the files that need to be processed. I tried to measure the loss in performance when doing a full analysis with an empty cache and was not able to measure any real difference when using parallel processing, the uncertainty in the parallel processing seems to be bigger than the loss of performance that we are trying to measure. Not using parallel I got these results:
Without this code:
bin/rector --clear-cache 96.95s user 3.27s system 97% cpu 1:42.73 total
With this code:
bin/rector --clear-cache 102.15s user 4.39s system 97% cpu 1:49.03 total
So, there is a performance loss of around 5%
Running rector with a full cache and no files being processed shows these results:
Without this code:
bin/rector 0.52s user 0.38s system 60% cpu 1.472 total
With this code:
bin/rector 0.54s user 0.45s system 54% cpu 1.808 total
So, again there is a small performance loss but it will not be really noticeable
All these tests were done running Rector against the rector-src code base

@carlos-granados
Copy link
Contributor Author

carlos-granados commented Aug 27, 2024

Regarding the Rector cache, right now it is a very simple cache, much simpler than the one used by PHPStan, it just saves the hash of any file processed so that this can be used later to find out if a file has changed. PHPStan saves a lot more data, including the dependencies for each file.
I could have modified the existing Rector cache so that for each file we saved both the hash and the dependencies but the end code would have been more complex than what I did, where the dependencies are cached separately. By caching them separately we don't have to touch the existing cache which will continue working in the same way.
I did not touch the code that clears the cache so it will continue to be cleared in the same cases that it was cleared before.
The FileDependenciesCache already has something that could be considered a cache-key as it uses the string 'file_dependencies' as a suffix for the hash used to cache the data, but I guess I could change it to 'file_dependencies_v1' to make it more explicit that this can be used to invalidate caches in future versions
Regarding making this an opt-in feature, I agree this could be good until this has stabilised. @TomasVotruba what about adding something like ->withProcessDependentFiles() to enable this feature?

@carlos-granados
Copy link
Contributor Author

Regarding using more of PHPStan, we cannot use its full caching capabilities because their cache is much more complex than the one used by Rector. The only possible piece of code that we could have maybe reused is the one that finds the dependent files from the saved dependencies but PHPStan has not extracted this code in a class or function that can be reused, it is embedded in the result cache manager code

@TomasVotruba
Copy link
Member

TomasVotruba commented Aug 27, 2024

I'll be prompt to avoid investing more time on your side. I share similar view as @staabm .
At the moment we have pretty good caching and it's good balance between complexity and speed. While this feature might catch 1-2 more files, it would make the future maintenance very costly and slow.

Also, depedent files do not solve the need for re-run Rector in case of adding types. E.g. when new string type is added in single method the other dependent file still see it as file without a type as PHPStan often hold old types it parsed initially. Best practise is to re-run Rector in such cases, to catch them all in one PR.

I do re-runs for PHPStan and coding standard tools as well as they have the very same design 👍

Saying that, I want to keep current state of balanced equilibrium.

Thank you @carlos-granados 🙏

@carlos-granados
Copy link
Contributor Author

@TomasVotruba no problem, just wanted to point out something. You said Best practise is to re-run Rector in such cases, to catch them all in one PR. The problem is that with the current functionality re-running Rector does not solve the problem unless you don't use the cache. So it happens that the results provided by Rector when you use the cache and when you don't use it can be different and I think that is a bad thing and what this PR aimed to solve

@carlos-granados
Copy link
Contributor Author

If anyone else is interested in this option, it is available in this fork: https://github.com/carlos-granados/rector

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants