-
-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Remote caching support #3971
base: master
Are you sure you want to change the base?
Conversation
This change introduces remote caching support to SCons. Currently SCons has support for caching built objects in directories, but remote caching adds support for sharing built objects across multiple machines. For example at VMware, we have our official builds push to cache and all developers will fetch from cache. The end result of this is that developers are able to only build what they have changed locally. All other cacheable build actions will be fetched from the remote cache server, dramatically speeding up builds. One piece that I introduced as part of this change is a new scheduler as a new class ParallelV2 that is used if remote caching is enabled or if a developer passes --use-scheduler-v2. Whereas the existing parallel scheduler waits on jobs if the job queue is full, this new scheduler will attempt to continue scanning for new jobs whenever possible. In addition, it supports remote caching by draining the "pending remote cache fetches" queue just like it drains the "pending jobs" queue. I believe that this scheduler is an alternative to SCons#3386 that should have the same effect in keeping jobs as high as possible. I introduced it as a separate class for now, but it could replace the Parallel class if we wanted. The new RemoteCache class owns the job of fetch from and pushing to a Bazel remote cache server or any other similar server that supports /ac/ and /cas/ GET and PUT requests using SHA-256 file names. See https://github.com/buchgr/bazel-remote for more details on the server. This class uses urllib3 for network requests. I chose it because it has good support for concurrency across threads using its ConnectionPool class. As part of implementing this functionality, the following new parameters are introduced: --remote-cache-fetch-enabled: Enables fetch of build output from the server --remote-cache-push-enabled: Enables push of build output to the server --remote-cache-url: Required if fetch or push is enabled --remote-cache-connections: Connection count (defaults to 100)
Some are inherited simply from moving code around, but this fixes the issues my change introduced.
try: | ||
if ((remote_cache and remote_cache.fetch_enabled) or | ||
use_scheduler_v2): | ||
self.job = ParallelV2(taskmaster, num, stack_size, remote_cache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be isolated into a tool and duck typed in when the tool is loaded? Then scons would not be forced to take on urllib3 dependency. Is the dependency needed if you are just using the scheduler aspect?
Something similar was done with the ninja tool: https://github.com/SCons/scons/blob/master/SCons/Tool/ninja/__init__.py#L382
its a bit hacky but these internal class don't really have any other exposure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW urllib3 isn't strictly required by SCons with my change; it is still only required if you enable remote caching. Otherwise, you can still run SCons without urllib3.
I am honestly not sure about the idea of putting the new scheduler in a tool. It may also depend on whether people want this ParallelV2 to replace Parallel as a new-and-improved scheduler.
Any traction on this ? |
Not yet. It's on the list, just requires a big chunk of time to review and likely modify before being able to pull into the repo. |
@bdbaddog I am going to hold off on doing any future updates or merges until you think there is a shot of finalizing a design and landing this. Let me know if you want me to start doing any of that, but I am fine holding off or even closing this if the team doesn't want it. I think the biggest reason to not include remote caching in SCons is that SCons has no dependency enforcement functionality. We implemented dependency enforcement internally at VMware but it was done in such a VMware-specific way that I don't have any way to upstream it without significant support from the SCons team. I am open to finding a way to upstream it by providing the code we have written in VMware-specific layers, but it would take a decent amount of effort from VMware+SCons maintainers to adapt and integrate. That said, the benefit of this logic is that you can run SCons in a mode where it actually fails compilation if (1) a build action reads from a file that is not in its dependencies or side effects or (2) a build action writes to a file that is not in its targets or side effects. Right now SCons can easily be wrong about its dependencies. Remote caching without dependency enforcement is risky in that if SCons is missing a dependency (e.g. the scanner didn't find a .h file), if that dependency is changed we will continue to get remote cache hits when they should be cache misses. So in the example I just mentioned, if you change that .h file that SCons doesn't know about, you'll still get cache hits when you shouldn't. This could cause a lot of confusion because the SCons build output is different than you expect. Notably, this is also the case with the existing SCons "cache dir" functionality. But that is why we never used cache dir functionality internally. |
Dependency enforcement hasn't been an issue for most users, or at least that anyone has let the project know. That said, for some environments, it's worth the cost to do so. (VMWare being one of those environments I'm guessing.. ;) Does your dependency enforcement implementation work on windows, linux & mac? or only some of those? Or require admin type access on some of those platforms? Probably worth moving any discussion of dependency enforcement to a discussion or user's mailing list as IMHO it's not really a required part of this remote caching functionality. Ideally you'd push whatever updates you have to this functionality to this PR. As an aside, it's likely the new parallel scheduler will become the default in the near future. Realistically reviewing this code will take a solid day or two, ideally with access to you @grossag to answer questions. Any chance we can get you to join our Discord server for such consultation? |
Sorry for the delay in responding. I've been on the Discord for a while, although I don't pay real close attention. I'm on Discord as @ adamgross . For the dependency enforcement, it depends on the platform:
Dependency enforcement would be really hard to upstream, so I am fine not focusing on that. But you said "In general if the scanner's missing things, it should get fixed." and the issue is that it's nearly impossible to figure out if the scanner is really missing things without any guarantees from SCons. I am open to updating the pull request if you think this is mostly unblocked and worth proceeding on. But I vaguely recall from our last conversation that you wanted to wait until some of the Task logic was refactored until this could be considered. I am also unsure on whether it's a dealbreaker that this uses urllib3. I just don't want to burn a day updating this PR if it's not worth proceeding on. Let me know how you'd like me to proceed and I'll do my best to accommodate, thanks. |
What do we need to be able to test against remote caching server? |
Sorry I never responded to your question @bdbaddog . This PR is designed to use https://github.com/buchgr/bazel-remote with no additional modifications to that project. That repo's README has a "Docker" section which describes how to install and run it. It's effectively content-addressable storage where the ac/ folder stores the action metadata and cas/ folder stores the binary. ac/ is addressed by a unique action signature and cas/ is addressed by SHA-256 hash of the binary. |
Thanks! Any chance you'd be willing to resolve the conflicts in this PR? |
@bdbaddog : I was able to resolve all other conflicts other than the big one: integrating remote cache functionality into the When remote caching is enabled, any time a task is pulled off of the todo list (code line In my PR this was implemented using local variable Can you let me know if you have any thoughts on how best to implement asynchronous |
@acmorrow Can you provide some guidance here as to how to resolve the conflicts in the scheduler as described in #3971 (comment) ? |
So one design question is whether to favor command-line arguments, or internal project configuration. It feels to me like if you're going to make use of a supplemental remote cache, you generally want to do so (almost) all the time, and typing long options repeatedly is cumbersome. Yes, you can hide that in something that launches SCons for you, but still... at the least the options should be "settable" so you can Here's how |
This change introduces remote caching support to SCons. Currently SCons has support for caching built objects in directories, but remote caching adds support for sharing built objects across multiple machines. For example at VMware, we have our official builds push to cache and all developers will fetch from cache.
The end result of this is that developers are able to only build what they have changed locally. All other cacheable build actions will be fetched from the remote cache server, dramatically speeding up builds.
One piece that I introduced as part of this change is a new scheduler as a new class ParallelV2 that is used if remote caching is enabled or if a developer passes --use-scheduler-v2. Whereas the existing parallel scheduler waits on jobs if the job queue is full, this new scheduler will attempt to continue scanning for new jobs whenever possible. In addition, it supports remote caching by draining the "pending remote cache fetches" queue just like it drains the "pending jobs" queue. I believe that this scheduler is an
alternative to #3386 that should have the same effect in keeping jobs as high as possible. I introduced it as a separate class for now, but it could replace the Parallel class if we wanted.
The new RemoteCache class owns the job of fetch from and pushing to a Bazel remote cache server or any other similar server that supports /ac/ and /cas/ GET and PUT requests using SHA-256 file names. See https://github.com/buchgr/bazel-remote for more details on the server. This class uses urllib3 for network requests (see below acknowledgement that this violates SCons guidelines). I chose it because it has good support for concurrency across threads using its ConnectionPool class. If remote caching is requested and urllib3 is not present, it raises an exception. If remote caching is not requested, urllib3 is not required.
The change to the scheduler is needed because cache fetches necessarily must be asynchronous due to network latency. If a request can take 200ms, we can't hang the main thread while waiting on it. And we also can't do the requests inside of job threads (like CacheDir does) because we want to have more active network requests than job threads given that network requests are bound by network I/O, not disk I/O or CPU.
Given the aforementioned change to the scheduler, I don't see a way to make this into a Tool, which was a possibility mentioned on Discord a while back. That is, I wasn't able to find a good way to make this self-contained enough to be a Tool.
As part of implementing this functionality, the following new parameters are introduced:
I understand that this patch violates the SCons guidelines in two main ways:
Contributor Checklist:
CHANGES.txt
(and read theREADME.rst
)