-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fairer queueing for parallel workers in RBI generation #889
Conversation
717347d
to
7aad73c
Compare
0c56b97
to
7084757
Compare
6f4b10e
to
ab49250
Compare
tapioca.gemspec
Outdated
@@ -24,6 +24,7 @@ Gem::Specification.new do |spec| | |||
spec.metadata["allowed_push_host"] = "https://rubygems.org" | |||
|
|||
spec.add_dependency("bundler", ">= 1.17.3") | |||
spec.add_dependency("parallel") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I commented before requesting review but I never submitted them 🤦🏻♀️
@Morriar can you explain what the difference is between adding a dependency in the Gemfile vs in the gemspec? I assumed parallel went in the gemspec because it's now a required dependency to actually run the gem, but I'm not sure I really understand the distinction.
Also, do you think I should specify a version here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, what you did is correct. The top level idea is this:
- A gem's
Gemfile
is only for doing development on the gem source code itself. It only brings in things that will be needed as you are working on the gem. Part of the things it brings in are the items listed inside the.gemspec
file, since (as I'll mention in the next section) those are also needed for the gem to run. - The gem, when packaged and published, needs to declare what other gems it has a "runtime" dependency on, so that tools like Bundler, etc can make sure to install those as well. These are gems that, if they are not present in a user's gem folder, will prevent the main gem from working. Those are listed in the
.gemspec
file asadd_dependency
entries.
In your case, Tapioca would not work on a user's machine if Parallel gem wasn't installed, so that dependency needs to be declared in the gemspec, since that is the only thing that gets packages inside a released gem (along with the actual source code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding it that:
-
the
.gemspec
file is used to export the list of declared runtime dependencies. So for example the gems in this file usingadd_dependency
will be displayed as runtime dependencies on RubyGems: https://rubygems.org/gems/tapioca. Furthermore, when a project will installtapioca
as a gem, they will get the dependencies declared in the.gemspec
as transitive dependencies. -
the
Gemfile
is used for "internal" dependencies. The ones you do not need when executing the gem. This would be gems required to run the tests, to debug, to lint etc. All the need you would only need to contribute to the project. Those dependencies will not be installed as transitive ones when installingtapioca
in a project.
In your case, since parallel
is required to execute tapioca
it is a runtime dependency and should be declared as such in the .gemspec
as you did 👍
For the version, it seems you've been using 1.21.0 to run the tests (as it is the version we already had in the Gemfile.lock
file). So it might be safer to add a restriction to require at least this version since we're not sure it works with an older version of the gem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What complicates all of this is the notion of add_development_dependency
that goes into a gemspec
and declares optional dependencies that should be used for development. I choose to completely ignore that, since it is a hold over from before the time we had Gemfile
s and that was the only way a gem could declare things it needed for development as opposed to runtime. I just always use a Gemfile
for those now and never declare my dev dependencies in the gemspec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a version constraint to the parallel gem!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you also need to do a bundle lock
to update the Gemfile.lock
after that, so that the lockfile also picks up that version bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦🏻♀️ Love when CI turns red immediately lol!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thank you. 👏
ab49250
to
6873e71
Compare
6873e71
to
d41c995
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, thanks! 👍
Motivation
Closes #760
Currently, the queue for RBI generation is executed in the following manner:
Executor
determines how many worker forks to use for RBI generation (call this numberN
)N
groups, and gives one group of jobs to each workerThe ideal behavior would be to give one job to each worker at a time, then give the next job to whichever worker frees up first. This would mean that the jobs are fairly split up between workers and on the whole, RBI generation should be more efficient.
Implementation
As @paracycle pointed out in #760, there is a gem that already implements this behavior: the parallel gem! I spent a long time reading the source code of the gem's
map
function, and I believe that it implements the exact algorithm we're looking for: rather than splitting up jobs and giving them to workers up front, it waits for workers to complete and then gives them the next job in the queue.Tests
I kept most of the
Executor
tests and switched one that was no longer relevant.Tophatting 🎩
I have tested my change on the Shopify codebase. Not only does it work, but anecdotally,
it looks like it actually generates all the gem RBIs a few minutes faster! My approach definitely was not scientific (deleted gem RBIs and then regenerated them a few times), but with the currently published version of tapioca, the process takes about 1500 seconds, and with my change, it takes about 1300.(Ignore this last part -- this result is not consistent.)