-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limiting the concurrency of jobs on a worker #963
Comments
@cormacrelf Thank you for the very thoughtful post and the helpful insight for all users. We've started to stress we need to expand Buck2 documentation in #958 but your post has made it clear that there's a lot more that we can do. We are going to prioritize this work going forward as our team has just expanded quite a bit. We will discuss this Buck2 in office hours. |
Hell yeah, nice job getting funding I suppose! |
Something that might be useful, re the note I made about multiple execution platforms. Buck2 execution platform resolution docs I think you can:
Haven't tried this yet. But I think that's the idea. For a lightweight & speedy LRE setup you would probably want a single image containing all the dependencies, and just alter the PATH in each ExecutionPlatformInfo. For more of a cloud deployment, you would probably use either different docker images, or just forward the constraint value verbatim to nativelink and have nativelink's scheduler figure out which workers to run your stuff on. You couldn't do anything more fancy, like automatically turning the closure of nix store paths into a big set of You can obviously combine this with other constraints -- you could have a worker running under |
Hi @cormacrelf, have you tried using We do this internally with something like: {
...
"schedulers": {
"MAIN_SCHEDULER": {
"property_modifier": {
"modifications": [
{"add": {"name": "cpu_count", "value": "1"}}
],
"scheduler": {
"simple": {
"supported_platform_properties": {
"cpu_count": "minimum"
}
}
}
}
...
} This should make it so if a job comes in without a We'll talk about updating the documentation internally. Thanks for bringing this up :-) |
@allada this worked, thanks! I was going to benchmark a build just for kicks, but when I did it with the infinite parallelism turned back on, my computer thrashed the swap space so hard it froze. Safe so say the new config is much better, because it didn't do that. |
Yeah, the performance overhead should be negligible. It's just adding a single string to a |
It appears nativelink workers are happy to execute hundreds, even thousands of jobs at a time. On a laptop.
I've been trying a setup, before deploying a proper shared RE cluster, where you just use
nativelink basic_cas.json
roughly as-is and use it much likesccache
or a persistent build output folder on a single machine. I think you folks call this Local Remote Execution.Some notes on how this can be done with Buck2 if anyone's interested
For buck2, the overall plan looks a lot like the Bazel strategy. You can try it out just with
nix run github:TraceMachina/nativelink basic_cas.json
and otherwise following the nativelink example in the buck2 repo. (Note that the container-image property in that setup has no effect unless you add a custom entrypoint to the nativelink worker config.)But ultimately the plan for LRE looks like this:
nativelink-worker-lre-cc
from the nix config.sed -i
updating the docker-compose file and possibly wiping the storage if the image hash changesdirenv
/ nix write out some config files in.buckconfig.d/
RunInfo(args = ["env", "PATH=...", "rustc"])
and puts the rest in an argfile, so you get a command line likeenv @/tmp/rustc-args-1234.txt
which is nonsense. So maybe need some tweaks to the buck2 prelude."additional_environment": { "PATH": { "source": "some_worker_path_platform_property" } }
config, and then putting the path in your remote execution platformCommandExecutorConfig.remote_execution_properties
. Then you only get one PATH for the entire RE execution platform in buck, though.I think the Bazel LRE config that's there is a pretty good start for nix users. If I get it working well with Buck, I'll try to contribute some of that config back.
On a fresh build, our buck2 immediately spawns 2000+ jobs, thankfully mostly small ones for writing args files etc. If it goes well, all the way until the last few targets, it keeps dozens to hundreds of jobs going, most of them intensive multi-threaded compile tasks. You can't affect this using the -j flag in buck, that only applies to local execution.
In my terminal just now:
That's surely not a very healthy way to drive the Linux scheduler. By some miracle the computer doesn't fall over.
At the end of the day you're getting backpressure from Linux itself and sometimes resource exhaustion if you underestimate buck2's ability to send a firehose of jobs at RE. I think a lot of those processes are fully paged out / 0% CPU, so it probably ends up mostly as disk traffic. But you have to basically pick a value for resource limits that is big enough for the widest antichain in your build graph. Without a limit on concurrent jobs, it's really hard to say what a good number of file descriptors is, for example. Whatever limit you set, you can construct a build graph that will exceed it. And nativelink does fail pretty hard in that case -- it fails to delete work directories, hits all the retry limits, and then refuses to build anything until you restart it and delete some things in the work directory.
Is this what's happening on your average nativelink RE worker? I can't find any backpressure in the configuration or in the codebase. Surely that's a problem beyond just laptops, right? Is the idea that usually you just spin up enough workers that this is never a problem for the demand you have?
Questions:
The text was updated successfully, but these errors were encountered: