Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: reduce memory usage of bes-uploader #20579

Open
christianscott opened this issue Dec 18, 2023 · 3 comments
Open

perf: reduce memory usage of bes-uploader #20579

christianscott opened this issue Dec 18, 2023 · 3 comments
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Performance Issues for Performance teams type: bug

Comments

@christianscott
Copy link
Contributor

christianscott commented Dec 18, 2023

Description of the bug:

we've been able to improve the overall performance of our builds by improving the throughput of bes-uploader. when events are not processed quickly, the eventQueue and ackQueue may grow faster than they're cleared, meaning these queues consume more and more of the heap. sometimes this causes bazel to OOM, but in other cases the build is slowed down because of extra competition for memory.

we've seen two reasons for bes-uploader processing events slowly:

  1. slow remote server. we've seen this when the CAS is slow, but I imagine a slow BES would have the same effect
  2. bes-uploader does too much work, see DigestUtils: avoid throwing on invalid digest function name #20574 and ByteStreamBuildEventArtifactUploader: skip reading metadata for files that won't be uploaded #20575

there are things we can do to address both of these, but it would be nice if the bes-uploader wasn't able to cause the rest of the build to perform poorly, even if it can't clear the events quickly.

note that bes_upload_mode=fully-async does not help because the events still need to be stored in memory.

some ideas:

  1. reduce size of SendRegularBuildCommand by using PathConverter ASAP. if paths were converted when an event is pushed to the queue, then PathConverter instances could be collected immediately. some PathConverter instances for our monorepo are 22mb (see screenshot below)
  2. serialize events before appending to the eventQueue
  3. offload some of the queue to disk (maybe using mmap)
Screenshot 2023-12-18 at 12 54 54 pm

Which category does this issue belong to?

Performance

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

macos, linux

What is the output of bazel info release?

6.4.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@christianscott
Copy link
Contributor Author

I wonder, how crucial is it for all build events to be submitted to the remote server? Should the build crash if events can't be submitted to the remote server? How bad is it to drop events if there's a risk of crashing the build? I wonder bazel could stop accepting events if the queue grew beyond some length.

@meisterT meisterT added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Dec 19, 2023
@tjgq
Copy link
Contributor

tjgq commented Dec 19, 2023

We should fix #20576 first and then reanalyze this. My hope is that once we're no longer digesting files unnecessarily in the BES uploader, both the queuing and memory problems go away.

@joeljeske
Copy link
Contributor

I am seeing intermittent OOMs due to growing size of BES uploader queue in 7.1.1

Is there any progress on this issue, or any information I can provide to assist?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Performance Issues for Performance teams type: bug
Projects
None yet
Development

No branches or pull requests

7 participants