Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limiting the concurrency of jobs on a worker #963

Closed
cormacrelf opened this issue Jun 4, 2024 · 6 comments
Closed

Limiting the concurrency of jobs on a worker #963

cormacrelf opened this issue Jun 4, 2024 · 6 comments

Comments

@cormacrelf
Copy link
Contributor

cormacrelf commented Jun 4, 2024

It appears nativelink workers are happy to execute hundreds, even thousands of jobs at a time. On a laptop.

I've been trying a setup, before deploying a proper shared RE cluster, where you just use nativelink basic_cas.json roughly as-is and use it much like sccache or a persistent build output folder on a single machine. I think you folks call this Local Remote Execution.

Some notes on how this can be done with Buck2 if anyone's interested

For buck2, the overall plan looks a lot like the Bazel strategy. You can try it out just with nix run github:TraceMachina/nativelink basic_cas.json and otherwise following the nativelink example in the buck2 repo. (Note that the container-image property in that setup has no effect unless you add a custom entrypoint to the nativelink worker config.)

But ultimately the plan for LRE looks like this:

  • Probably docker-compose, using an image like nativelink-worker-lre-cc from the nix config.
  • probably automate the process of sed -i updating the docker-compose file and possibly wiping the storage if the image hash changes
  • have direnv / nix write out some config files in .buckconfig.d/
  • Have your buck toolchains read from the config keys you wrote in there.
    • It can be tricky to get the PATH in there for some of the toolchains, as e.g. the rust toolchain splits up "env" and the rest in RunInfo(args = ["env", "PATH=...", "rustc"]) and puts the rest in an argfile, so you get a command line like env @/tmp/rustc-args-1234.txt which is nonsense. So maybe need some tweaks to the buck2 prelude.
    • Or you can pull the PATH using the nativelink local worker's "additional_environment": { "PATH": { "source": "some_worker_path_platform_property" } } config, and then putting the path in your remote execution platform CommandExecutorConfig.remote_execution_properties. Then you only get one PATH for the entire RE execution platform in buck, though.
    • Hm, maybe buck supports multiple RE execution platforms being configured at once? Haven't tried.

I think the Bazel LRE config that's there is a pretty good start for nix users. If I get it working well with Buck, I'll try to contribute some of that config back.


On a fresh build, our buck2 immediately spawns 2000+ jobs, thankfully mostly small ones for writing args files etc. If it goes well, all the way until the last few targets, it keeps dozens to hundreds of jobs going, most of them intensive multi-threaded compile tasks. You can't affect this using the -j flag in buck, that only applies to local execution.

In my terminal just now:

$ ps aux | grep rustc | wc -l
313

That's surely not a very healthy way to drive the Linux scheduler. By some miracle the computer doesn't fall over.

At the end of the day you're getting backpressure from Linux itself and sometimes resource exhaustion if you underestimate buck2's ability to send a firehose of jobs at RE. I think a lot of those processes are fully paged out / 0% CPU, so it probably ends up mostly as disk traffic. But you have to basically pick a value for resource limits that is big enough for the widest antichain in your build graph. Without a limit on concurrent jobs, it's really hard to say what a good number of file descriptors is, for example. Whatever limit you set, you can construct a build graph that will exceed it. And nativelink does fail pretty hard in that case -- it fails to delete work directories, hits all the retry limits, and then refuses to build anything until you restart it and delete some things in the work directory.

Is this what's happening on your average nativelink RE worker? I can't find any backpressure in the configuration or in the codebase. Surely that's a problem beyond just laptops, right? Is the idea that usually you just spin up enough workers that this is never a problem for the demand you have?

Questions:

  • Is there anywhere for the backpressure to go? Can workers report a max concurrency value to the scheduler and have the scheduler keep count, and delay sending new jobs?
  • Can the worker itself use the GNU make jobserver protocol (ie https://docs.rs/jobserver or the async version https://docs.rs/jobslot) to cooperatively schedule with the compilers it's running? Especially for rustc which typically splits into 16 codegen units and runs them all through LLVM in parallel. And sometime this year will start running the frontend in parallel too.
  • Do you get a sane result if you report a worker's max concurrency, and also use the jobserver protocol at the same time? I think you would.
@MarcusSorealheis
Copy link
Collaborator

@cormacrelf Thank you for the very thoughtful post and the helpful insight for all users.

We've started to stress we need to expand Buck2 documentation in #958 but your post has made it clear that there's a lot more that we can do. We are going to prioritize this work going forward as our team has just expanded quite a bit. We will discuss this Buck2 in office hours.

@cormacrelf
Copy link
Contributor Author

Hell yeah, nice job getting funding I suppose!

@cormacrelf
Copy link
Contributor Author

cormacrelf commented Jun 4, 2024

Something that might be useful, re the note I made about multiple execution platforms.

Buck2 execution platform resolution docs

I think you can:

  • Have a constraint_setting(name="re") + constraint_value(name="cc", constraint_setting = ":re") ...
  • And set exec_compatible_with = ["//config/re:cc"] on your toolchains//:cxx target, or on your custom toolchain rule (either way works)
  • And set exec_compatible_with = ["//config/re:rust"] on your toolchains//:rust target, or on your custom toolchain rule (either way works)
  • And add one ExecutionPlatformInfo per constraint_value with a different value for the setting.
  • And add that constraint in the ExecutionPlatformInfo.configuration
  • And add a different container-image or custom PATH parameter as you wish to each one.
    • Either method must be handled by the nativelink worker either via its entrypoint script or additional_environment.
  • Pass those platforms to your ExecutionPlatformRegistrationInfo(platforms = platforms)

Haven't tried this yet. But I think that's the idea. For a lightweight & speedy LRE setup you would probably want a single image containing all the dependencies, and just alter the PATH in each ExecutionPlatformInfo. For more of a cloud deployment, you would probably use either different docker images, or just forward the constraint value verbatim to nativelink and have nativelink's scheduler figure out which workers to run your stuff on.

You couldn't do anything more fancy, like automatically turning the closure of nix store paths into a big set of constraint_values, because constraint settings can only take on one value at a time, I think. But that would probably be overkill anyway.

You can obviously combine this with other constraints -- you could have a worker running under qemu emulating arm64, and that one would get different PATH values. So you would want to generate most of this code to give you something resembling a sparse matrix of //config/re: X config//cpu: X config//os:. Nix should be able to evaluate the full set of paths for each of its nix systems without actually building those images, and dump the PATH configuration for each in a .bzl file or a .buckconfig.

@allada
Copy link
Member

allada commented Jun 4, 2024

Hi @cormacrelf, have you tried using PlatformPropertiesModifier scheduler to help with this?

We do this internally with something like:

{
   ...
  "schedulers": {
    "MAIN_SCHEDULER": {
      "property_modifier": {
        "modifications": [
          {"add": {"name": "cpu_count", "value": "1"}}
        ],
        "scheduler": {
          "simple": {
            "supported_platform_properties": {
              "cpu_count": "minimum"
            }
          }
        }
      }
   ...
}

This should make it so if a job comes in without a cpu_count it'll add it on with a default of cpu_count = 1.

We'll talk about updating the documentation internally. Thanks for bringing this up :-)

@cormacrelf
Copy link
Contributor Author

@allada this worked, thanks!

I was going to benchmark a build just for kicks, but when I did it with the infinite parallelism turned back on, my computer thrashed the swap space so hard it froze. Safe so say the new config is much better, because it didn't do that.

@allada
Copy link
Member

allada commented Jun 7, 2024

Yeah, the performance overhead should be negligible. It's just adding a single string to a HashMap on execution requests. Compared to GRPC this is nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants