Kotlin coroutines with loops are 18 times slower under the Graal CI. #1330

qwwdfsad · 2019-05-24T09:31:24Z

Reproducing project:

https://github.com/qwwdfsad/coroutines-graal

Overview:

FlowBenchmark is constructed to expose a non-standard pattern which Graal fails to compile.
Flow is a very simplified version of kotlinx.coroutines.Flow, a suspension-based primitive for operating with reactive streams.

Benchmark results

How to run: ./gradlew --no-daemon cleanJmhJar jmhJar && java -jar benchmarks.jar from the root folder.

Results:

// Java 1.8.0_162-b12
FlowBenchmark.flowBaseline  avgt  7  3.542 ± 0.026  us/op

// Graalvm-ce-19.0.0
FlowBenchmark.flowBaseline  avgt  7  54.129 ± 0.387  us/op

dtraceasm profiler shows that all the time spent in the interpreter, mostly in fast_aputfield (probably it is a coroutine state machine spilling).

Native call-stacks obtained via async-profiler are polluted with InterpreterRuntime::frequency_counter_overflow from the uppermost Java frame (flow.FlowBenchmark$numbers$$inlined$flow$1::collect), that is, by the way, compiled with C1.

Compilation log

Compilation log contains pretty suspicious statements about target method:

294  498  3  flow.FlowBenchmark$numbers$$inlined$flow$1::collect (255 bytes) COMPILE SKIPPED: live_in set of first block not empty (retry at different tier)
337  535  4  flow.FlowBenchmark$numbers$$inlined$flow$1::collect (255 bytes) COMPILE SKIPPED: Non-reducible loop (not retryable)

The text was updated successfully, but these errors were encountered:

qwwdfsad · 2019-05-24T09:32:51Z

@shelajev pointed out that it all comes from the fact that only structurred loops are supported in IR and this will not likely to change in the nearest future.

To understand where this irreducible loop comes from (and why we can't simply fix Kotlin compiler here), one should understand what coroutines are.
Coroutine aka suspend function is a function which execution can be "suspended" (as opposed to blocking) and later resumed using Continuation handle.
suspend functions can be invoked only from other suspend functions or started with a Kotlin compiler intrinsic startCoroutine(block or method reference).

Functions like this:

suspend fun foo(): Int {
  println("a")
  val result: Int = suspendSleep(1) // Can be suspended
  println("b")
  return result
}

we roughly (I'll ignore all exception handling here) translate it to the following form:

fun foo(continuation: Continuation<Int>): Object /* INT | SUSPENDED union */ {
  val me = continuation as? FooContinuation ?: FooContinuation(caller = continuation)
  switch (me.label) { // current state of computation
    0 -> {
      println("a")
      me.label = 1
      val result = suspendSleep(1, me)
      if (result === SUSPENDED) return SUSPENDED // Unroll stack
      else goto 1
    }

    1 -> {
       println("b")
       return me.resultOrException.getOrThrow()
    }
  }
}

additionally, we generate a separate class FooContinuation with a single resume method that is pretty similar to the foo method:

inner class FooContinuation : Continuation<Unit> {
  
  var resultOrException: Result<Unit>

  // May be called only from within `suspendSleep` if it decided to suspend
  override fun resume(result: Result<Unit> /* exception or Unit */ ) {
    assert label == 1;
    resultOrException = result
    foo(this)
  }

}

Don't hesitate to ask me for further explanations of this example because I am slightly biased about coroutines internals simplicity :)

Now back to irreducible loops.
What if suspend fun contains a loop?

suspend fun bar() {
  var sum = 0
  for (i in 1..100) {
    sum += suspendSleep(i)
  }
  
  return sum
}

Applying the same transformation procedure, we end up in a situation when code path can jump directly into loop's body, i is a continuation state that is restored when we jump right in the middle of the loop by the corresponding label value. Unfortunately, changing translation strategy is both hard and implies a serious performance overhead (e.g. because now we have to jump to the loop prologue and guard all statements here with an if).

And this is exactly an irreducible loop that Graal cannot compile :(

gilles-duboscq · 2019-05-24T12:26:21Z

This is the same root cause as #366

As you diagnosed, this is caused by bytecode containing irreducible loops which is not supported in the Graal compiler.
From our side there there is one strategy to support this: just duplicate the portion of the loop between the "extra" entry and the back-edges or loop exits so that we are back to a normal loop entry. We should do that in the bytecode parser.

Hopefully Kotlin only produce cases where there are "few" extra entries (or those extra entries jump towards the end of the loop, close to the back-edges) otherwise it will cause an explosion of the code size for large loops.

qwwdfsad · 2019-05-24T15:15:25Z

It is very unfortunate to hear as we had plans to build native images of coroutines-based applications as well.

Hopefully Kotlin only produce cases where there are "few" extra entries

It grows linearly with the count of suspension points (calls to another suspend function) in the loop. But where it really matters (in hot loops), there are usually one or two of them. Not sure about native image though.

Feel free to ping me if you need help with evaluation/testing of the potential (?) change on the Graal side, we have a bunch of applications that exploit coroutines.

gilles-duboscq · 2019-05-27T09:35:45Z

Hi @qwwdfsad, i have implemented a simple duplication strategy (not merged yet) so to evaluate it i'd be interested in other workloads that run into this issue.

FYI on 1.8.0_212

# C2
FlowBenchmark.flowBaseline  avgt    7   3.805 ± 0.081  us/op
# Graal
FlowBenchmark.flowBaseline  avgt    7  47.374 ± 4.753  us/op
# Graal with duplication strategy
FlowBenchmark.flowBaseline  avgt    7   0.052 ± 0.001  us/op

qwwdfsad · 2019-05-27T11:00:11Z

Nice! It is hard for me to extract such workloads into separate self-containing projects, but I can point a couple of our specific benchmarks in different projects (with steps how to configure and run them). Is it okay?
Alternatively (or additionally), I can also try a Graal build with fixes in our projects and see how it's going.

For example:
kotlinx.coroutines, develop branch, benchmarks from flow package. Run with ./gradlew --no-daemon cleanJmhJar jmhJar && java -jar benchmarks.jar "benchmarks.flow.*"

gilles-duboscq · 2019-05-27T12:03:36Z

I can point a couple of our specific benchmarks in different projects (with steps how to configure and run them). Is it okay?

That would be great

Alternatively (or additionally), I can also try a Graal build with fixes in our projects and see how it's going.

I'll probably put that code behind a flag at first so you can do that but i wanted to do some basic testing on my side first to avoid too many round-trips.

kotlinx.coroutines, develop branch, benchmarks from flow package. Run with ./gradlew --no-daemon cleanJmhJar jmhJar && java -jar benchmarks.jar "benchmarks.flow.*"

I will start with that.

chintana-zz · 2019-06-01T00:00:47Z

@gilles-duboscq I bumped into the same issue when trying to get GraalVM to do native image generation for Ballerina. I would be more than happy to test your fix. In the mean time you can reproduce this as follows,

$ git clone https://github.com/chintana/ballerina && cd ballerina
$ ./gradlew build -x test -x check -x :composer-library:npmBuild
$ cd distribution/zip/jballerina-tools/build/distributions
$ unzip jballerina-tools-0.992.0-m2-SNAPSHOT.zip
$ cat >file.bal
import ballerina/io;
public function main() {
	int i = 0;
	int j = 0;
	while (i < 20) {
		i+=1;
		j+=1;
	}
	io:println(j);
}
$ ./bin/jballerina build file.bal
Compiling source
    file.bal

Generating executable
    file.jar
$ # set GRAALVM_HOME
$ ./bin/jballerina native-img file.jar
[file:98676]    classlist:  14,329.22 ms
[file:98676]        (cap):   2,224.56 ms
[file:98676]        setup:   4,481.97 ms
[file:98676]     analysis:   8,488.17 ms
Warning: Abort stand-alone image build. Non-reducible loop
Detailed message:
Call path from entry point to file.main(Strand):
	at file.main(file.bal)
	at ___init.$lambda$main$(.)
	at ___init$$Lambda$364/1803931637.accept(Unknown Source)
	at org.ballerinalang.jvm.SchedulerItem.execute(Scheduler.java:401)
	at org.ballerinalang.jvm.Scheduler.run(Scheduler.java:194)
	at org.ballerinalang.jvm.Scheduler.runSafely(Scheduler.java:166)
	at org.ballerinalang.jvm.Scheduler.start(Scheduler.java:158)
	at ___init.main(.)
	at com.oracle.svm.core.JavaMainWrapper.run(JavaMainWrapper.java:153)
	at com.oracle.svm.core.code.IsolateEnterStub.JavaMainWrapper_run_5087f5482cc9a6abc971913ece43acb471d2631b(generated:0)

Warning: Use -H:+ReportExceptionStackTraces to print stacktrace of underlying exception
Build on Server(pid: 98720, port: 53135)*
[file:98720]    classlist:   2,068.29 ms
[file:98720]        (cap):   1,633.39 ms
[file:98720]        setup:   3,670.54 ms
[file:98720]   (typeflow):   2,593.28 ms
[file:98720]    (objects):   1,193.84 ms
[file:98720]   (features):     380.28 ms
[file:98720]     analysis:   4,270.38 ms
[file:98720]     universe:     256.61 ms
[file:98720]      (parse):     455.73 ms
[file:98720]     (inline):   1,759.22 ms
[file:98720]    (compile):   8,062.22 ms
[file:98720]      compile:  10,790.02 ms
[file:98720]        image:     819.42 ms
[file:98720]        write:     388.00 ms
[file:98720]      [total]:  22,451.48 ms
Warning: Image 'file' is a fallback image that requires a JDK for execution (use --no-fallback to suppress fallback image generation).

qwwdfsad · 2019-08-07T16:13:20Z

Hi, could you please elaborate on the status of this fix? Do you need any additional help from me, e.g. new benchmarks or test suites?

gilles-duboscq · 2019-08-07T16:21:08Z

I had a first version but i noticed some issues while adding more tests. I have ideas about how to fix it and still plan to do it but i have no ETA.

tlvenn · 2020-01-23T00:06:24Z

Hi @gilles-duboscq , any chance to share some status update if any ? Thanks in advance.

gilles-duboscq · 2020-01-23T08:25:07Z

Hi, no, i have not been able to allocate any time to that.

gilles-duboscq · 2020-01-23T08:25:15Z

Hi, no, i have not been able to allocate any time to that.

tlvenn · 2020-01-25T11:09:21Z

Is there any chance you could open a PR with your version so the community could potentially take it from there ? Thanks in advance.

gilles-duboscq · 2020-03-04T18:04:25Z

As i said in #366, i'm planning to take a new look at this for 20.1.0

gilles-duboscq · 2020-03-25T09:04:28Z

This should be fixed by 4662877. The fix is included in the latest 20.1 dev build (e.g., 20.1.0-dev-20200325_0537).

Using the FlowBenchmark.flowBaseline benchmark from @qwwdfsad:

	time (μs/op)
Graal without duplication	37.119 ± 1.837
C2	3.210 ± 0.063
Graal with duplication	0.002 ± 0.001

qwwdfsad · 2020-03-25T10:42:10Z

Amazing!

I've tested it on some of our workloads. When no suspension happens, it is on par with C2, but as soon as a benchmark has a hot-loop with a suspension, it is significantly faster.

E.g.:

Graal, jdk 11
Benchmark                                            Mode  Cnt    Score    Error  Units
ChannelSinkBenchmark.channelPipeline                 avgt    5  180.444 ±  5.758  ms/op
ChannelSinkBenchmark.channelPipelineOneThreadLocal   avgt    5  209.112 ± 11.974  ms/op
ChannelSinkBenchmark.channelPipelineTwoThreadLocals  avgt    5  331.341 ± 21.411  ms/op

С2, jdk 11
Benchmark                                            Mode  Cnt    Score    Error  Units
ChannelSinkBenchmark.channelPipeline                 avgt    5  215.993 ± 20.545  ms/op
ChannelSinkBenchmark.channelPipelineOneThreadLocal   avgt    5  242.537 ±  5.427  ms/op
ChannelSinkBenchmark.channelPipelineTwoThreadLocals  avgt    5  590.823 ± 48.004  ms/op

(I'd say that channelPipelineTwoThreadLocals is considerably faster mostly because of more advanced EA, though I didn't dig deep enough to verify it).

Great job!

gilles-duboscq · 2020-03-25T10:46:59Z

Glad to hear it helped your use-case. Thank you for the report.

dougxc assigned gilles-duboscq May 24, 2019

qwwdfsad changed the title ~~Kotlin coroutines with loops are 10 times slower under the Graal CI.~~ Kotlin coroutines with loops are 18 times slower under the Graal CI. May 24, 2019

gilles-duboscq mentioned this issue May 28, 2019

[native-image] Classes with loops inside Kotlin coroutines fail to generate native code #366

Closed

gilles-duboscq added this to the 20.1 milestone Mar 4, 2020

gilles-duboscq closed this as completed Mar 25, 2020

sherl0cks mentioned this issue Mar 31, 2020

Kotlin Coroutines Breaks Jackson Serialization quarkusio/quarkus#7999

Closed

akoufa mentioned this issue Jun 22, 2020

Kotlin Coroutines in Quarkus quarkusio/quarkus#10162

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kotlin coroutines with loops are 18 times slower under the Graal CI. #1330

Kotlin coroutines with loops are 18 times slower under the Graal CI. #1330

qwwdfsad commented May 24, 2019

qwwdfsad commented May 24, 2019 •

edited

Loading

gilles-duboscq commented May 24, 2019

qwwdfsad commented May 24, 2019 •

edited

Loading

gilles-duboscq commented May 27, 2019 •

edited

Loading

qwwdfsad commented May 27, 2019

gilles-duboscq commented May 27, 2019

chintana-zz commented Jun 1, 2019

qwwdfsad commented Aug 7, 2019 •

edited

Loading

gilles-duboscq commented Aug 7, 2019

tlvenn commented Jan 23, 2020

gilles-duboscq commented Jan 23, 2020

gilles-duboscq commented Jan 23, 2020

tlvenn commented Jan 25, 2020 •

edited

Loading

gilles-duboscq commented Mar 4, 2020

gilles-duboscq commented Mar 25, 2020

qwwdfsad commented Mar 25, 2020

gilles-duboscq commented Mar 25, 2020

Kotlin coroutines with loops are 18 times slower under the Graal CI. #1330

Kotlin coroutines with loops are 18 times slower under the Graal CI. #1330

Comments

qwwdfsad commented May 24, 2019

Reproducing project:

Overview:

Benchmark results

Compilation log

qwwdfsad commented May 24, 2019 • edited Loading

gilles-duboscq commented May 24, 2019

qwwdfsad commented May 24, 2019 • edited Loading

gilles-duboscq commented May 27, 2019 • edited Loading

qwwdfsad commented May 27, 2019

gilles-duboscq commented May 27, 2019

chintana-zz commented Jun 1, 2019

qwwdfsad commented Aug 7, 2019 • edited Loading

gilles-duboscq commented Aug 7, 2019

tlvenn commented Jan 23, 2020

gilles-duboscq commented Jan 23, 2020

gilles-duboscq commented Jan 23, 2020

tlvenn commented Jan 25, 2020 • edited Loading

gilles-duboscq commented Mar 4, 2020

gilles-duboscq commented Mar 25, 2020

qwwdfsad commented Mar 25, 2020

gilles-duboscq commented Mar 25, 2020

qwwdfsad commented May 24, 2019 •

edited

Loading

qwwdfsad commented May 24, 2019 •

edited

Loading

gilles-duboscq commented May 27, 2019 •

edited

Loading

qwwdfsad commented Aug 7, 2019 •

edited

Loading

tlvenn commented Jan 25, 2020 •

edited

Loading