Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(plugin-essentials): yarn dedupe #1558

Merged
merged 27 commits into from
Aug 23, 2020
Merged

feat(plugin-essentials): yarn dedupe #1558

merged 27 commits into from
Aug 23, 2020

Conversation

paul-soporan
Copy link
Member

@paul-soporan paul-soporan commented Jul 8, 2020

What's the problem this PR addresses?

To quote from the command details:

Duplicates are defined as descriptors with overlapping ranges being resolved and locked to different locators. They are a natural consequence of Yarn's deterministic installs, but they can sometimes pile up and unnecessarily increase the size of your project.

How did you fix it?

This PR adds a new yarn deduplicate command that does what its description says.

It's inspired by (but the implementation is different):

The current algorithm currently takes 30 seconds to run on our repo (after a lot of optimizations, it was 10 minutes before). I couldn't think of a better one that's also fast and produces results while resolving descriptors (so that the output is populated progressively while the deduplication runs; it would look weird if it took 30 seconds to resolve everything and then we would show all 200 results at once).

The current algorithm prefers reusing packages with the highest usage count, rather than with the highest version, which means that packages can be both upgraded and downgraded.
After discussing with the rest of the team, we came to the conclusion that the best strategy would be to sort the locators by version, rather than by usage count. This means that packages can only be upgraded, never downgraded.

Note: I haven't added tests yet, but I'm working on it.

Checklist

  • I have set the packages that need to be released for my changes to be effective.
  • I have verified that all automated PR checks pass.

@paul-soporan paul-soporan requested a review from arcanis as a code owner July 8, 2020 12:15
@7rulnik
Copy link
Contributor

7rulnik commented Jul 8, 2020

One question: I thought that it's yarn's v1 bug and it should deduplicate deps by default. Am I missing something?

@paul-soporan
Copy link
Member Author

@7rulnik Yarn doesn't deduplicate dependencies by default, otherwise installs wouldn't be deterministic and the lockfile would be useless.
What it actually does is that it tries to not duplicate dependencies in the first place.

Example:

Lockfile:

foo@^2.3.4:
  resolution: [email protected]

Now if I run yarn add foo@*, it will reuse [email protected], even if the latest foo is actually 2.10.14, thus preventing unnecessary duplication.


The problem (not actually a problem, this is how it's intended to be) is that Yarn doesn't unlock dependencies that are already locked in the lockfile.

This means that with the following lockfile:

foo@^2.3.4:
  resolution: [email protected]

If I run yarn add [email protected], it will install [email protected] because the existing resolutions don't satisfy the range 2.10.14. This, however, leads to (sometimes) unwanted duplication, since now I've got the following lockfile:

foo@^2.3.4:
  resolution: [email protected]

[email protected]:
  resolution: [email protected]

This lockfile has 2 separate resolutions for the 2 foo descriptors, even though they have overlapping ranges and the lockfile could be simplified to the following:

"foo@^2.3.4, [email protected]":
  resolution: [email protected]

This is, essentially, what the yarn deduplicate command does.

@7rulnik
Copy link
Contributor

7rulnik commented Jul 8, 2020

Got it, thanks!

I thought that it's a bug because yarn v1 did not have a deduplicate command. I think we should add your explanation to docs.

@eps1lon
Copy link
Member

eps1lon commented Jul 9, 2020

Great to see some improving on the initial work! So the only problem with https://github.com/eps1lon/yarn-plugin-deduplicate was that it is slow? I started working on it again after the 2.1 announcement and it should be working with 2.1 now after yarnplugins/yarn-plugin-deduplicate#6 is merged.

@paul-soporan
Copy link
Member Author

@eps1lon I'm sorry for how I worded that, it's not that it was slow.

I wanted to open a PR in this repo that deduplicates some dependencies (because they started to add up in size) - #1556 - and I tried using your plugin by copying your implementation inside a new command in our repo (because we have some limitations that prevent us from installing third-party plugins in this repo), just to use it for deduplication purposes. Unfortunately, when I tested it, I first encountered a bunch of TypeError: Cannot read property 'reference' of undefined because of unsafe non-null assertions. When I fixed those, the plugin produced no results. I admit that I haven't tried debugging it further, probably also because I hadn't seen any updates to the plugin since September and because I've seen your message on Discord that I interpreted as you not wishing to continue developing this plugin - looking back, there's a big chance I might have misinterpreted that message and I'm really sorry if that's the case.
After that, I decided to write a quick (and really inefficient - that's what I was talking about when I said it was slow) implementation my own way, without using any of your code (looking at it, the locatorsByIdent part seems to be the only thing that's overlapping). After that implementation proved to be a success, I talked with the rest of the team and they agreed that this command could live in the berry repo if it's well tested. After that, I greatly improved the performance of my original implementation (I still have to tweak the current one, there are quite a few unneeded parts remaining) and opened this PR, and that's how we got here. 😅

I realize that we are now in a really unfortunate situation (that's mostly my fault), now that you've decided to revive yarn-plugin-deduplicate. I'm open to merging my work with yours if you wish, but I think that it would be best if this command lived inside the berry repo as a built-in command - seeing how many people would like to use it.

Again, I'm sorry for how this played out. I'm open to suggestions about what we should with this, as long as they benefit the community. What do you think about this? How should we proceed?

@eps1lon
Copy link
Member

eps1lon commented Jul 9, 2020

Unfortunately, when I tested it, I first encountered a bunch of TypeError: Cannot read property 'reference' of undefined because of unsafe non-null assertions.

Should be fixed now if you install from master. It wasn't about non-null assertions. They were only the symptoms of an underlying problem that was fixed with 03a68f7 (#6).

I'm open to merging my work with yours if you wish, but I think that it would be best if this command lived inside the berry repo as a built-in command - seeing how many people would like to use it.

I'd prefer that as well so that we can test integration.

Again, I'm sorry for how this played out. I'm open to suggestions about what we should with this, as long as they benefit the community. What do you think about this? How should we proceed?

All good. Having this live in an official repo was always my preference.

It being slow wouldn't have bothered me. It wasn't intended to be fast. Build something working first and then make it fast!

@Bnaya
Copy link

Bnaya commented Aug 1, 2020

Will it support the strategy flag and selective dedup that yarn-deduplicate supports?

@paul-soporan
Copy link
Member Author

@Bnaya It already supports selective dedupe (via glob pattern rest args). I was initially thinking of leaving strategies out of the first iteration, but now that there's demand I will implement it in this PR.

Note: I'll probably get back to this PR in a few days / next week.

@paul-soporan paul-soporan changed the title feat(plugin-essentials): yarn deduplicate feat(plugin-essentials): yarn dedupe Aug 17, 2020
@paul-soporan paul-soporan marked this pull request as ready for review August 19, 2020 16:58
@paul-soporan
Copy link
Member Author

So... the PR is now finished. I've decided to leave out the fewer strategy for now, so that we can merge this PR sooner (hopefully it will be a part of 2.2.0, which is just around the corner).

More details about the yarn dedupe command:

  • It's part of plugin-essentials, which means that now it's a builtin command (no plugins needed).
  • I've decided to rename it to yarn dedupe from yarn deduplicate, as it's shorter to type and it doesn't conflict with the yarn-plugin-deduplicate command name. I also think I like this name more.
  • It only supports the highest strategy for now. Quoting from the docs:

Reuses (where possible) the locators with the highest versions. This means that dependencies can only be upgraded, never downgraded. It's also guaranteed that it never takes more than a single pass to dedupe the entire dependency tree.

  • It's not tied to semver, it can dedupe any kind of packages (e.g. Python packages which use the PEP440 versioning format).
  • It has a -c,--check flag that can be used on CI (doesn't persist the changes on disk, only exits with exit code 1 when there are packages that can be deduped).
  • It supports selective dedupe via glob pattern rest args (yarn dedupe lodash 'react-*' '@babel/*')
  • It supports a --json flag that makes the output be NDJSON.
  • It comes with an in-depth explanation in the docs (https://deploy-preview-1558--yarn2.netlify.app/cli/dedupe#in-depth-explanation).
  • It's fast enough because of concurrency. This is important because we have to send requests to the registry to get the candidates for each package (each package that should be checked - the number is dependent on the selective dedupe), since the command is not tied to semver. It takes about 8 seconds to execute on our repo without selective dedupe.
  • It comes with a footer that shows how many packages can be deduped (maybe TODO in the future: show how much space can be saved?).
  • I've added in-depth tests for everything.

Now I'm going to summon @arcanis to review this PR and hopefully yarn dedupe can be shipped in 2.2.0.

@arcanis
Copy link
Member

arcanis commented Aug 19, 2020

Great work! I'll look in details tomorrow, but I have some questions based on your abstract:

It's fast enough because of concurrency. This is important because we have to send requests to the registry to get the candidates for each package (each package that should be checked - the number is dependent on the selective dedupe), since the command is not tied to semver. It takes about 8 seconds to execute on our repo without selective dedupe.

Why do we need to make network requests when doing the highest strategy (which is the only one supported atm)? Will it upgrade the packages on top of deduping them? Shouldn't it just update the resolutions to use whatever is the highest in the lockfile? Since those packages have already been resolved, no further request should be needed.

It's not tied to semver, it can dedupe any kind of packages (e.g. Python packages which use the PEP440 versioning format).

I'm not familiar with PEP440 - how does it support non-semver ranges (in particular with the highest strategy, which implies sorting the versions as much as possible).

It has a -c,--check flag that can be used on CI (doesn't persist the changes on disk, only exits with exit code 1 when there are packages that can be deduped).

I'm not necessarily against using -c,--check, but we'll have to add to the v3 that we need to normalize yarn version check, yarn dedupe -c, and yarn constraints.

@paul-soporan
Copy link
Member Author

paul-soporan commented Aug 19, 2020

Why do we need to make network requests when doing the highest strategy (which is the only one supported atm)? Will it upgrade the packages on top of deduping them? Shouldn't it just update the resolutions to use whatever is the highest in the lockfile? Since those packages have already been resolved, no further request should be needed.

Nope, it doesn't update the packages, it just uses whatever is the highest in the lockfile. The problem is that there's no clear definition of "highest". For npm semver ranges, the highest is the highest version that satisfies the range, but what is the highest for npm tags? Let's take the following example:

Lockfile:

foo@npm:next:
  resolution: foo@npm:1.0.0

foo@npm:^1.1.0:
  resolution: foo@npm:1.1.0

Let's now say that the next tag points to foo@npm:1.1.0 at the moment I want to run yarn dedupe. Now how do I know what to do with foo@npm:next? There's nothing in the lockfile telling me that foo@npm:1.1.0 is a candidate of foo@npm:next. The only way to solve this is to call resolver.getCandidates on foo@npm:next.

There's also the problem of non-semver ranges, which we wouldn't be able to deduplicate unless we use the resolver, which already has a definition of "highest" - the first candidate.

I'm not familiar with PEP440 - how does it support non-semver ranges (in particular with the highest strategy, which implies sorting the versions as much as possible).

PEP440 is just a versioning format used in the Python ecosystem. It's similar to semver, but many things don't overlap (examples of valid PEP440 versions that aren't valid semver versions: 0.9, 2012.4, 1!1.0, 1.0rc1). Hardcoding semver here would be wrong (especially because we already have the resolver system which is versioning-scheme agnostic), as Yarn isn't just a Node package manager anymore.

I'm not necessarily against using -c,--check, but we'll have to add to the v3 that we need to normalize yarn version check, yarn dedupe -c, and yarn constraints.

🤔 It makes sense for yarn version check to be a separate command from yarn version, as they have nothing in common apart from being part of the same plugin. Regarding yarn constraints, I've typed yarn constraints check more times than I'd like to admit, so... I'm not sure what's the best thing to do here. yarn constraints and yarn constraints --fix don't really do the same thing - the first one produces warnings while the second one just fixes everything silently. The way this PR is currently implemented, both yarn dedupe and yarn dedupe --check output what can be deduplicated, the difference being that the former runs an install after, while the latter exits the process with a specific code.

@paul-soporan paul-soporan marked this pull request as ready for review August 20, 2020 22:47
@paul-soporan
Copy link
Member Author

So, after more discussions, we've decided to drop both the network access requirement and the reliance on semver by refactoring the core resolvers to have a new getSatisfying method (that we might also use for running yarn add without relying on the network in the future) that statically (more or less) computes which locator references satisfy a descriptor range. This means that the NpmSemverResolver can now use the semver package internally, which reduces the execution time by about 95%.
This means that the yarn dedupe command will be fully universal and also very fast.

The remaining problems (?):

  • some resolvers (e.g. ExecResolver, PatchResolver, FileResolver) currently do their entire dynamic candidate resolving even with getSatisfying. I'm not sure how we can avoid that, as we can't bypass the FS access requirement and only do static resolving. This isn't a problem in practice, as it doesn't make any sense to dedupe a patch: or an exec: dependency, but it's still an interesting problem to solve.
  • "foo": "npm:bar@*" dependencies (aka remapped npm dependencies) can't be currently deduplicated, as the NpmRemapResolver currently takes ownership of the final resolution, rather than proxying it through a resolutionDependency. This problem will be solved in a future PR and we will also add a test to make sure that deduping works in that case.

Unless there are any design problems remaining, I think this PR can be reviewed and I will do my best to address any remaining small problems so that we can hopefully ship this in 2.2.0.

packages/plugin-essentials/sources/commands/dedupe.ts Outdated Show resolved Hide resolved
packages/plugin-essentials/sources/commands/dedupe.ts Outdated Show resolved Hide resolved
packages/plugin-essentials/sources/commands/dedupe.ts Outdated Show resolved Hide resolved
packages/plugin-essentials/sources/commands/dedupe.ts Outdated Show resolved Hide resolved
packages/plugin-exec/sources/ExecResolver.ts Outdated Show resolved Hide resolved
packages/plugin-essentials/sources/commands/dedupe.ts Outdated Show resolved Hide resolved
packages/plugin-essentials/sources/commands/dedupe.ts Outdated Show resolved Hide resolved
packages/plugin-essentials/sources/commands/dedupe.ts Outdated Show resolved Hide resolved
@paul-soporan paul-soporan requested a review from arcanis August 23, 2020 10:59
@arcanis arcanis merged commit d4ec648 into master Aug 23, 2020
@arcanis arcanis deleted the paul/feat/deduplicate branch August 23, 2020 16:10
@arcanis
Copy link
Member

arcanis commented Aug 23, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants