Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI (Buildkite, code coverage): increase the value of JULIA_WORKER_TIMEOUT on the code coverage job #42193

Merged
merged 1 commit into from
Sep 10, 2021

Conversation

DilumAluthge
Copy link
Member

Let's start with something really big (20 minutes), and then I can decrease it once we get this working.

@DilumAluthge DilumAluthge added ci Continuous integration backport 1.6 Change should be backported to release-1.6 backport 1.7 labels Sep 10, 2021
@DilumAluthge DilumAluthge requested a review from a team as a code owner September 10, 2021 02:32
@DilumAluthge DilumAluthge requested review from staticfloat and vtjnash and removed request for a team September 10, 2021 02:32
@DilumAluthge
Copy link
Member Author

@vtjnash I pushed a commit to this PR that sets export JULIA_NUM_THREADS=1 before we start the code coverage job.

@DilumAluthge DilumAluthge force-pushed the dpa/ci-buildkite-coverage-worker-timeout branch from 5bf76ae to 0e0a460 Compare September 10, 2021 02:47
@@ -29,6 +29,9 @@ steps:
git config --global init.defaultBranch master

echo "--- Run Julia tests in parallel with code coverage enabled"
export JULIA_NUM_THREADS=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export JULIA_NUM_THREADS=1
unset JULIA_NUM_THREADS

But why is this set at all? It is not a good base configuration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My memory is hazy, but if I recall correctly, we didn't use to have this set, and then some job started 256 Julia threads (because the underlying machine had 128 physical cores, 256 CPU threads) and clobbered the whole machine, so we set this variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also set JULIA_CPU_THREADS to 16. Maybe that is sufficient, and we can remove JULIA_NUM_THREADS from the default configuration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@staticfloat will probably remember why we set export JULIA_NUM_THREADS=16 on all the Buildkite agents.

Maybe we should set the following configuration on all of the Buildkite agents, just to be safe:

export JULIA_CPU_THREADS=16
export JULIA_NUM_THREADS=1

@DilumAluthge DilumAluthge force-pushed the dpa/ci-buildkite-coverage-worker-timeout branch from 0e0a460 to 30a0970 Compare September 10, 2021 02:55
@DilumAluthge
Copy link
Member Author

Follow the progress of the Buildkite job here: https://buildkite.com/julialang/julia-master-scheduled/builds/228

@DilumAluthge
Copy link
Member Author

DilumAluthge commented Sep 10, 2021

@vtjnash The cmdlineargs test set takes almost 2 hours (see below) when --sysimage-native-code=no is set. Would it be okay if we excluded the cmdlineargs test set for the code coverage Buildkite job? We'd still run the cmdlineargs tests during all of the regular jobs (e.g. tester_*); we'd just skip it from the code coverage job.

Log output (from https://buildkite.com/julialang/julia-master-scheduled/builds/228):

cmdlineargs                        (13) |        started at 2021-09-10T03:17:19.982
cmdlineargs                        (13) |         failed at 2021-09-10T05:10:16.980

@DilumAluthge
Copy link
Member Author

Alternatively, instead of excluding the cmdlineargs tests, we could start those tests first. Then, by the time we have finished all the other tests, the cmdlineargs tests have finished, but it hasn't significantly increased our overall wall time.

Is there a way that I can tell Base.runtests that I want to force the cmdlineargs test to be started first?

@DilumAluthge
Copy link
Member Author

Alternatively, instead of excluding the cmdlineargs tests, we could start those tests first. Then, by the time we have finished all the other tests, the cmdlineargs tests have finished, but it hasn't significantly increased our overall wall time.

Is there a way that I can tell Base.runtests that I want to force the cmdlineargs test to be started first?

#42198

@vtjnash
Copy link
Member

vtjnash commented Sep 10, 2021

Perhaps we should attempt to run that test with sysimage-native-code=yes? Launching that many processes is going to be hard (slow).

@DilumAluthge
Copy link
Member Author

Perhaps we should attempt to run that test with sysimage-native-code=yes? Launching that many processes is going to be hard (slow).

Is there a way for us to run one specific test set (cmdlineargs) with sysimage-native-code=yes, but run all the other test sets with sysimage-native-code=no?

@vtjnash
Copy link
Member

vtjnash commented Sep 10, 2021

You can append sysimage-native-code=yes to the arguments then fork a new process

@codecov
Copy link

codecov bot commented Sep 10, 2021

Codecov Report

Merging #42193 (5349e69) into master (5405994) will increase coverage by 73.18%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #42193       +/-   ##
===========================================
+ Coverage   16.17%   89.35%   +73.18%     
===========================================
  Files         349      351        +2     
  Lines       78250    78657      +407     
===========================================
+ Hits        12654    70283    +57629     
+ Misses      65596     8374    -57222     
Impacted Files Coverage Δ
base/int.jl 54.76% <ø> (+36.74%) ⬆️
base/opaque_closure.jl 100.00% <0.00%> (ø)
base/osutils.jl 94.44% <0.00%> (ø)
base/compiler/inferencestate.jl 85.44% <0.00%> (+0.46%) ⬆️
base/compiler/ssair/slot2ssa.jl 87.72% <0.00%> (+3.10%) ⬆️
stdlib/Logging/src/ConsoleLogger.jl 94.38% <0.00%> (+3.37%) ⬆️
base/compiler/types.jl 67.85% <0.00%> (+3.57%) ⬆️
stdlib/LinearAlgebra/src/lbt.jl 34.93% <0.00%> (+3.61%) ⬆️
base/compiler/inferenceresult.jl 87.93% <0.00%> (+4.31%) ⬆️
base/compiler/utilities.jl 87.61% <0.00%> (+4.72%) ⬆️
... and 330 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 50ab3a9...5349e69. Read the comment docs.

@DilumAluthge DilumAluthge force-pushed the dpa/ci-buildkite-coverage-worker-timeout branch from 30a0970 to 5349e69 Compare September 10, 2021 07:13
@DilumAluthge
Copy link
Member Author

@DilumAluthge
Copy link
Member Author

The job (https://buildkite.com/julialang/julia-master-scheduled/builds/230) passed, and coverage info was uploaded, so I'll merge this PR.

@DilumAluthge DilumAluthge merged commit 1709f79 into master Sep 10, 2021
@DilumAluthge DilumAluthge deleted the dpa/ci-buildkite-coverage-worker-timeout branch September 10, 2021 18:40
@DilumAluthge
Copy link
Member Author

@vtjnash I ran a new coverage job on master. Can you take a look at the coverage information on master now, and let me know if it looks correct, or if there are still issues?

@vtjnash
Copy link
Member

vtjnash commented Sep 10, 2021

Looks great. +10%

@DilumAluthge
Copy link
Member Author

🎉 🎉 🎉

@DilumAluthge
Copy link
Member Author

Let me know if you start seeing issues again.

KristofferC pushed a commit that referenced this pull request Sep 11, 2021
…MEOUT` on the code coverage job (#42193)

(cherry picked from commit 1709f79)
KristofferC pushed a commit that referenced this pull request Sep 15, 2021
…MEOUT` on the code coverage job (#42193)

(cherry picked from commit 1709f79)
KristofferC pushed a commit that referenced this pull request Nov 11, 2021
…MEOUT` on the code coverage job (#42193)

(cherry picked from commit 1709f79)
@KristofferC KristofferC removed the backport 1.6 Change should be backported to release-1.6 label Nov 13, 2021
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
staticfloat pushed a commit that referenced this pull request Dec 23, 2022
…MEOUT` on the code coverage job (#42193)

(cherry picked from commit 1709f79)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants