Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler hang #9980

Closed
dmorneau opened this issue Apr 27, 2020 · 34 comments · Fixed by glennr/docker-elixir#1
Closed

Compiler hang #9980

dmorneau opened this issue Apr 27, 2020 · 34 comments · Fixed by glennr/docker-elixir#1

Comments

@dmorneau
Copy link
Contributor

Environment

  • Elixir & Erlang/OTP versions (elixir --version):
Erlang/OTP 22 [erts-10.7.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe] [dtrace]
Elixir 1.10.3 (compiled with Erlang/OTP 22)
  • Operating system: macOS Catalina

Behavior

While doing a clean build of an app (via mix test) after upgrading to Elixir 1.10.3 (from 1.10.2), I hit a hang in the compiler. It printed the "taking longer than 15s..." warning for many files, then stopped making any progress.

After 10 ~ 15 min I took a crash dump using SIGUSR1. At that point, the only process with messages in its queue was code_server (MsgQ=7). The code server is inside do_purge, waiting on a reply from erts_code_purge, which is itself inside a do_purge function, purging a module called elixir_compiler_18.

image

The request looks like it came from another process that was compiling some
LiveComponent part of the app:

image

I can't share the crash dump, but is there something I could check that might help?

This didn't reproduce after I killed the compiler and tried again.

@lukaszsamson
Copy link
Contributor

I've also noticed similar hangs and slowdowns lately. Those happen both on local MacOS and CI (Debian Linux) and started happening before 10.3 was released.

Unfortunately I don't have anything that can help in pinpointing the issue but CI logs like

==> ecto
Compiling 55 files (.ex)
Compiling lib/ecto/repo/queryable.ex (it's taking more than 15s)
Compiling lib/ecto/query/inspect.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/filter.ex (it's taking more than 15s)
Compiling lib/ecto/repo/preloader.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/windows.ex (it's taking more than 15s)
Compiling lib/ecto/schema.ex (it's taking more than 15s)
Compiling lib/ecto/repo/schema.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/preload.ex (it's taking more than 15s)
Compiling lib/ecto/embedded.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/order_by.ex (it's taking more than 15s)
Compiling lib/ecto/type.ex (it's taking more than 15s)
Compiling lib/ecto/uuid.ex (it's taking more than 15s)
Compiling lib/ecto/association.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/join.ex (it's taking more than 15s)
Compiling lib/ecto/log_entry.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/cte.ex (it's taking more than 15s)
Compiling lib/ecto/query/planner.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/select.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/distinct.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/limit_offset.ex (it's taking more than 15s)
Compiling lib/ecto/changeset.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/group_by.ex (it's taking more than 15s)
Compiling lib/ecto/changeset/relation.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/dynamic.ex (it's taking more than 15s)
Compiling lib/ecto/multi.ex (it's taking more than 15s)
Compiling lib/ecto.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder.ex (it's taking more than 15s)
Compiling lib/ecto/query/builder/update.ex (it's taking more than 15s)

Too long with no output (exceeded 10m0s): context deadline exceeded

@josevalim
Copy link
Member

Thanks @dmorneau. We have recently heard similar reports. Can you check if there is any process waiting on the code_server/code:purge call? If so, what is the callsite?

@lukaszsamson, do you remember if this happened with v1.9? Or did it start with v1.10? Tracking this down would help us a lot.

@lukaszsamson
Copy link
Contributor

@josevalim IIRC it started happening 2-3 weeks ago and since over a month all of our build pipelines run on 1.10. The first CI failure from our servers is from 19 days ago. If I had to guess I'd blame it on otp 22.3.2 released 21 days ago.

@josevalim
Copy link
Member

@lukaszsamson can you please use OTP 22.2 on CI and see if it addresses the issue?

@wojtekmach
Copy link
Member

If your CI doesn't allow you to choose a particular OTP version, but you can use a custom Docker image, you can try one of these: https://hub.docker.com/r/hexpm/elixir/tags?name=1.10.3-erlang-22.2.8&page=1

@dmorneau
Copy link
Contributor Author

Can you check if there is any process waiting on the code_server/code:purge call? If so, what is the callsite?

@josevalim Yes, it's the second screenshot I pasted. It looks like this is the call:

https://github.com/elixir-lang/elixir/blob/v1.10.3/lib/elixir/src/elixir_compiler.erl#L77
https://github.com/erlang/otp/blob/OTP-22.3.2/lib/kernel/src/code.erl#L199

@josevalim
Copy link
Member

Thanks! It seems there was a bugfix in the code purger on 22.3.0 to 22.3.1, so I would recommend those running into it to revert to 22.2 and see if the issue persists.

@lukaszsamson
Copy link
Contributor

@lukaszsamson can you please use OTP 22.2 on CI and see if it addresses the issue?

We can but it's not going to be a drop in replacement as we currently use circleci/elixir images which do not tag otp releases

@ericmj
Copy link
Member

ericmj commented Apr 27, 2020

We can but it's not going to be a drop in replacement as we currently use circleci/elixir images which do not tag otp releases

As @wojtekmach said you can use these docker images which do tag the OTP version: https://hub.docker.com/r/hexpm/elixir/tags?name=1.10.3-erlang-22.2.8&page=1

@adrienmo
Copy link
Contributor

On our side we have seen the problem happening when we switched from 22.3 to 22.3.2. Currently trying to downgrade to see if it will fix it.

@jesse-c
Copy link

jesse-c commented Apr 27, 2020

We use CircleCI images as well and we're unable to use the Hex images due to them lacking git (and possibly other things)

@josevalim
Copy link
Member

If anyone runs into this issue, please do:

  1. If you are on Linux, do kill -ABRT <pid of beam.smp>
  2. If you are on elsewhere, press Ctrl+C, then in the break menu type "A" (uppercase) and press enter

I have opened up an issue with Erlang/OTP here. @dmorneau, if it is an option, please consider sending the crash dump privately to them.

@dimitarvp
Copy link
Contributor

Just upgraded to Erlang 22.3.3, the hang happened still (and on Elixir 1.10.3), and finally got an erl_crash.dump file.

Will I be able to upload it to the Erlang issue tracker if I sign up there?

@josevalim
Copy link
Member

@dimitarvp yes, you should. If not, feel free to email me (email is on my profile), and I can make sure it is delivered.

@lukaszsamson
Copy link
Contributor

@josevalim I was able to get a dump on macos with

lldb --attach-pid <PID>
process save-core "filename"

will that be helpful? In the OTP bug only linux dumps were mentioned.

@josevalim
Copy link
Member

@lukaszsamson send it there just in case it can help someone. :)

@josevalim
Copy link
Member

OTP 22.3.4 has been released with some fixes that may (or may not) affect this. If folks can upgrade to latest 22.3.4 and see if the issues persist or not, it would be very welcome. Thank you!

@thbar
Copy link
Contributor

thbar commented May 9, 2020

I have made a first test: on my computer (Mac), I don't remember strictly seeing the slow-down/hang, but currently the compilation seems fast with OTP 22.3.4. I'll be able to make a second test once 22.3.4 is available through https://packages.erlang-solutions.com/erlang-solutions_2.0_all.deb (currently it is apparently not), because this is what I use in Docker and where the slow compilation occurred for me. I will try again later!

@dimitarvp
Copy link
Contributor

As mentioned in the linked Erlang ticket: I checked with a big-ish project of mine and I have seen no hangs or slowdowns during compilation with 22.3.4.

@victorolinasc
Copy link
Contributor

Didn't want to be the bringer of evil news but I still experience hangs on CI with 22.3.4 and elixir 1.10.3 on a large project. It is kinda random when it triggers and it is difficult to re-run with ssh the build job so that I could get any dumps.

@devonestes
Copy link
Contributor

Same here, unfortunately - still seeing compiler timeouts after 10 minutes in CI occasionally. Here's the info for our build that just timed out:

Build-agent version 1.0.31788-6ad45373 (2020-05-07T13:35:11+0000)
Docker Engine Version: 18.09.6
Kernel Version: Linux 370ffda9bb7e 4.15.0-1052-aws #54-Ubuntu SMP Tue Oct 1 15:43:26 UTC 2019 x86_64 Linux
Starting container hexpm/elixir:1.10.3-erlang-22.3.4-alpine-3.11.3
  image cache not found on this host, downloading hexpm/elixir:1.10.3-erlang-22.3.4-alpine-3.11.3
1.10.3-erlang-22.3.4-alpine-3.11.3: Pulling from hexpm/elixir
c9b1b535fdd9: Already exists
cee2ac6549f7: Pulling fs layer
b5216944cd8d: Pulling fs layer
ecbca7291e76: Pulling fs layer
cee2ac6549f7: Verifying Checksum
cee2ac6549f7: Download complete
ecbca7291e76: Verifying Checksum
ecbca7291e76: Download complete
cee2ac6549f7: Pull complete
b5216944cd8d: Verifying Checksum
b5216944cd8d: Download complete
b5216944cd8d: Pull complete
ecbca7291e76: Pull complete
Digest: sha256:f815d0b1a551954023535184f75bbf2c8b90f81afbcb0baacbecf0ac3d502a7a
Status: Downloaded newer image for hexpm/elixir:1.10.3-erlang-22.3.4-alpine-3.11.3
  using image hexpm/elixir@sha256:f815d0b1a551954023535184f75bbf2c8b90f81afbcb0baacbecf0ac3d502a7a

It appears to be random for us as well, as I can retry the build on the same git SHA and get a build that compiles just fine.

@asummers
Copy link
Contributor

asummers commented May 11, 2020

We are also seeing this behavior with the same random behavior from the same SHAs. We're on 1.10.2, as a data point. I had to cancel compile steps that had been running for 1.5h. Rerunning with same inputs succeeded in 1-2m. Please let me know if I can provide data to help diagnose.

@lukaszsamson
Copy link
Contributor

As others have reported 22.3.4 does not fix the issue. I captured another dump and attached to the erlang bug report.

@josevalim
Copy link
Member

The OTP team has provided a modified version of Erlang's source: https://bugs.erlang.org/browse/ERL-1236?focusedCommentId=17936&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17936

If someone could run it on Debian, it would be very much appreciated. You can use something like kerl to compile such version locally. Please ask if you have any questions.

@pmarreck
Copy link

pmarreck commented May 19, 2020

Just FYI, I'm also seeing this issue on Ubuntu (compiles that hang). Env invo:

OS: Ubuntu 18.04.4 LTS x86_64 
CPU: Intel Xeon E5-2676 v3 (1) @ 2.399GHz 
Erlang/OTP 23 [erts-11.0] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:1] [hipe]
Elixir 1.10.3 (compiled with Erlang/OTP 22)

Seems to have started appearing after I updated Erlang recently. Also, not sure if this can result in problems such as this but Elixir seems to be compiled against OTP22 while I now have OTP23 installed.

As others have stated, the behavior seems nondeterministic (ugh). Killing the process and rerunning the same command again has succeeded most of the time so far.

If I can help at all with additional info or diagnostics, please let me know!

@josevalim
Copy link
Member

This should be fixed on Erlang 22.3.4.1 released today. A fix for 23.0 is coming out soon. Thanks everyone for helping!

@victorolinasc
Copy link
Contributor

Nasty fix!

image

And this, ladies and gentlemen, is the reason we ALL love programming right?

@metelik
Copy link

metelik commented Jul 31, 2020

Hello,

We are trying to switch from Erlang OTP v22.3 to OTP v23.0.3 in our project.

_@ubuntu:/mnt/src/agilis_fw/common_core.fw.x86# elixir --version
Erlang/OTP 23 [erts-11.0.3] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe]

Elixir 1.10.3 (compiled with Erlang/OTP 23)_

We encounter the hand followed by the erl_crash.dump on each and every module doing the:

import OurApp.Gettext

I deem it might be related to the issue above.

I can share the crashdump:

erl_crash.dump.gz.part-ab.gz
erl_crash.dump.gz.part-ac.gz
erl_crash.dump.gz.part-aa.gz

The ^^ is a one gzip archive split with the split -b 9m command to circumvent the github's restrictions.

With my best wishes
Tomasz Motyl

@DerKastellan
Copy link
Contributor

DerKastellan commented Dec 3, 2020

This should be fixed on Erlang 22.3.4.1 released today. A fix for 23.0 is coming out soon. Thanks everyone for helping!

We switched to Erlang 22.3.4.1. We still observe Erlang compiler hangs about 5-10% of our runs in CI - we have almost 800K of Erlang in our project spread over 8 files. (ASN.1 codecs generated by asn1ct.) It seems that the compiler hangs until the job is killed about 40 minutes later.

Opened this: https://bugs.erlang.org/browse/ERL-1433

@natewallis
Copy link

I wonder how many people reporting it as being hung are just impatient. It does take some time to complete. I came here thinking that my compilation was inactive only for it to complete 5 minutes later..

@pmarreck
Copy link

In my case I definitely saw hangs (on the order of a half hour or more). I've tried to reduce the number of cross-dependencies in my app, eliminate imports where it makes sense to, etc. I've also been keeping OTP and elixir updated. I stopped seeing the hangs, but they were definitely there. I'm still kinda unhappy with compilation speed (maybe it's my particular dependency hierarchy, but it only seems to run single-threaded, and there is 1 very large form in particular that takes almost a minute alone to compile)... but runtime speed is quite good, so...

@DerKastellan
Copy link
Contributor

Once taking OTP 22.3.4.11 into use, I stopped seeing compiler hangs from the Erlang side. Since my main problem are some ridiculously large Erlang files (that's how asn1ct works) I first suspected the compiler hang to be still in place only to find that Erlang compilation times had increased by at least 50% between OTP20 and OTP22. In the end I did see no more compiler hangs, times have been rather deterministic.

What I ended up doing is replacing the default compiler wrapper for Erlang in mix with one that compiles my large Erlang files in parallel. By using more than one core the problem became manageable (= total build time back to OTP20 average or less).

@josevalim
Copy link
Member

A couple notes: on Elixir v1.11 we have improved the tooling so mix compile --force --verbose --profile=time should give you a good idea about which module is being slow. From there on, you usually have two approaches:

  1. You have slow modules that take long to compile. This article explains how to debug it: https://dashbit.co/blog/how-to-debug-elixir-erlang-compiler-performance - (also try Erlang master. The JIT and other improvements have really good benefits on compilation times)

  2. Your modules are fast but small changes trigger large recompilations: check this article and mix xref docs

In both cases Elixir v1.11 is a requirement. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.