Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect CPU utilization statistics of CI builders #48828

Closed
alexcrichton opened this issue Mar 7, 2018 · 4 comments · Fixed by #61632
Closed

Collect CPU utilization statistics of CI builders #48828

alexcrichton opened this issue Mar 7, 2018 · 4 comments · Fixed by #61632
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.

Comments

@alexcrichton
Copy link
Member

One of the easiest ways to make CI faster is to make things parallel and simply use the hardware we have available to us. Unfortunately though we don't have a lot of data about how parallel our build is. Are there steps we think are parallel but actually aren't? Are we pegged to one core for long durations when there's other work we could be doing?

The general idea here is that we'd spin up a daemon at the very start of the build which would sample CPU utilization every so often. This daemon would then update a file that's either displayed or uploaded at the end of the build.

Hopefully we could then use these logs to get a better view into how the builders are working during the build, diagnose non-parallel portions of the build, and implement fixes to use all the cpus we've got.

cc @rust-lang/infra

@alexcrichton alexcrichton added A-build T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. labels Mar 7, 2018
@retep998
Copy link
Member

retep998 commented Mar 8, 2018

On Windows this can be done by taking advantage of job objects. If the entire build is wrapped in a job object then we can call QueryInformationJobObject with JobObjectBasicAccountingInformation to get a bunch of useful data.

@matthiaskrgr
Copy link
Member

matthiaskrgr commented Mar 8, 2018

I made script that will print top output into the travis log every 30 seconds log , raw log .

# launch in travis as 'pathto/script.sh &'
while `sleep 30`
do
top -ibn 1 | head -n4 | tr "\n" " " | tee -a /tmp/top.log
echo "" | tee -a /tmp/top.log
done

Some findings:

Cloning submodules jemalloc, libcompiler_buildtins and liblibc alone takes 30 seconds.

While building bootstrap, compiling serde_derive, serde_json and bootstrap crates seems to take 30 seconds (total build time: 47 seconds).

stage0:
Compiling tidy crate seems to take around 30 seconds.
Compiling rustc_errors takes at least 2 minutes, only one codegen-unit is used
Compiling syntax_ext takes 9 minutes, only one CGU used

stage0 codegen artifacts:
Compiling rustc_llvm takes 1,5 minutes, one CGU

During stage1, rustc_errors and syntax_ext builds are approximately as slow as during stage0, rustc_plugins 2 minutes, one CGU.

stage2:
rustdoc took 2 minutes to build, one CGU

compiletest suite=run-make mode=run-make:
It looks like there is a single test that takes around 3 minutes to complete and has no parallelization.

Testing alloc stage1:
building liballoc takes around a minute

Testing syntax stage1:
building syntax takes 1.5 minutes, one CGU

Notes:
When the load average dropped towards 1, I assumed only one codegen unit was active.
The script was only applied to the default pullrequest travis-ci configuration.

@kennytm
Copy link
Member

kennytm commented Mar 12, 2018

As shown in #48480 (comment), the CPUs assigned to each job may have some performance difference:

  • Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
  • Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

The clock-rate 2.4 GHz vs 2.5 GHz shouldn't make any noticeable difference though (this would at most slow down by 7.2 minutes out of 3 hours if everything is CPU-bound). It is not enough to explain the timeout in #48480.

@alexcrichton
Copy link
Member Author

I was working on https://github.com/alexcrichton/cpu-usage-over-time recently for this where it periodically prints out the CPU usage as a percentage for the whole system (aka 1 core on a 4 core machine is 25%). I only got Linux/OSX working though and was unable to figure out a good way to do it on Windows.

My thinking for how we'd do this is probably download a binary near the beginning of the build (or set up some script). We'd then run stamp that-script > some-output.log just before we run stamp run-the-build.sh. That way we could correlate the two timestamps of each log (the main log and the some-output.log to similar moments in time.

Initially I was also thinking we'd just cat some-output.log at the end of the build and scrape it later if need be.

@XAMPPRocky XAMPPRocky added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label May 14, 2018
@jonas-schievink jonas-schievink added T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) and removed T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) A-build labels Apr 21, 2019
alexcrichton added a commit to alexcrichton/rust that referenced this issue Jun 11, 2019
This commit adds a script which we'll execute on Azure Pipelines which
is intended to run in the background and passively collect CPU usage
statistics for our builders. The intention here is that we can use this
information over time to diagnose issues with builders, see where we can
optimize our build, fix parallelism issues, etc. This might not end up
being too useful in the long run but it's data we've wanted to collect
for quite some time now, so here's a stab at it!

Comments about how this is intended to work can be found in the python
script used here to collect CPU usage statistics.

Closes rust-lang#48828
Centril added a commit to Centril/rust that referenced this issue Jun 12, 2019
…=pietroalbini

ci: Collect CPU usage statistics on Azure

This commit adds a script which we'll execute on Azure Pipelines which
is intended to run in the background and passively collect CPU usage
statistics for our builders. The intention here is that we can use this
information over time to diagnose issues with builders, see where we can
optimize our build, fix parallelism issues, etc. This might not end up
being too useful in the long run but it's data we've wanted to collect
for quite some time now, so here's a stab at it!

Comments about how this is intended to work can be found in the python
script used here to collect CPU usage statistics.

Closes rust-lang#48828
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants