Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kotlin coroutines with loops are 18 times slower under the Graal CI. #1330

Closed
qwwdfsad opened this issue May 24, 2019 · 17 comments
Closed

Kotlin coroutines with loops are 18 times slower under the Graal CI. #1330

qwwdfsad opened this issue May 24, 2019 · 17 comments
Assignees
Milestone

Comments

@qwwdfsad
Copy link

Reproducing project:

https://github.com/qwwdfsad/coroutines-graal

Overview:

FlowBenchmark is constructed to expose a non-standard pattern which Graal fails to compile.
Flow is a very simplified version of kotlinx.coroutines.Flow, a suspension-based primitive for operating with reactive streams.

Benchmark results

How to run: ./gradlew --no-daemon cleanJmhJar jmhJar && java -jar benchmarks.jar from the root folder.

Results:

// Java 1.8.0_162-b12
FlowBenchmark.flowBaseline  avgt  7  3.542 ± 0.026  us/op

// Graalvm-ce-19.0.0
FlowBenchmark.flowBaseline  avgt  7  54.129 ± 0.387  us/op

dtraceasm profiler shows that all the time spent in the interpreter, mostly in fast_aputfield (probably it is a coroutine state machine spilling).

Native call-stacks obtained via async-profiler are polluted with InterpreterRuntime::frequency_counter_overflow from the uppermost Java frame (flow.FlowBenchmark$numbers$$inlined$flow$1::collect), that is, by the way, compiled with C1.

Compilation log

Compilation log contains pretty suspicious statements about target method:

294  498  3  flow.FlowBenchmark$numbers$$inlined$flow$1::collect (255 bytes) COMPILE SKIPPED: live_in set of first block not empty (retry at different tier)
337  535  4  flow.FlowBenchmark$numbers$$inlined$flow$1::collect (255 bytes) COMPILE SKIPPED: Non-reducible loop (not retryable)
@qwwdfsad
Copy link
Author

qwwdfsad commented May 24, 2019

@shelajev pointed out that it all comes from the fact that only structurred loops are supported in IR and this will not likely to change in the nearest future.

To understand where this irreducible loop comes from (and why we can't simply fix Kotlin compiler here), one should understand what coroutines are.
Coroutine aka suspend function is a function which execution can be "suspended" (as opposed to blocking) and later resumed using Continuation handle.
suspend functions can be invoked only from other suspend functions or started with a Kotlin compiler intrinsic startCoroutine(block or method reference).

Functions like this:

suspend fun foo(): Int {
  println("a")
  val result: Int = suspendSleep(1) // Can be suspended
  println("b")
  return result
}

we roughly (I'll ignore all exception handling here) translate it to the following form:

fun foo(continuation: Continuation<Int>): Object /* INT | SUSPENDED union */ {
  val me = continuation as? FooContinuation ?: FooContinuation(caller = continuation)
  switch (me.label) { // current state of computation
    0 -> {
      println("a")
      me.label = 1
      val result = suspendSleep(1, me)
      if (result === SUSPENDED) return SUSPENDED // Unroll stack
      else goto 1
    }

    1 -> {
       println("b")
       return me.resultOrException.getOrThrow()
    }
  }
}

additionally, we generate a separate class FooContinuation with a single resume method that is pretty similar to the foo method:

inner class FooContinuation : Continuation<Unit> {
  
  var resultOrException: Result<Unit>

  // May be called only from within `suspendSleep` if it decided to suspend
  override fun resume(result: Result<Unit> /* exception or Unit */ ) {
    assert label == 1;
    resultOrException = result
    foo(this)
  }

}

Don't hesitate to ask me for further explanations of this example because I am slightly biased about coroutines internals simplicity :)

Now back to irreducible loops.
What if suspend fun contains a loop?

suspend fun bar() {
  var sum = 0
  for (i in 1..100) {
    sum += suspendSleep(i)
  }
  
  return sum
}

Applying the same transformation procedure, we end up in a situation when code path can jump directly into loop's body, i is a continuation state that is restored when we jump right in the middle of the loop by the corresponding label value. Unfortunately, changing translation strategy is both hard and implies a serious performance overhead (e.g. because now we have to jump to the loop prologue and guard all statements here with an if).

And this is exactly an irreducible loop that Graal cannot compile :(

@qwwdfsad qwwdfsad changed the title Kotlin coroutines with loops are 10 times slower under the Graal CI. Kotlin coroutines with loops are 18 times slower under the Graal CI. May 24, 2019
@gilles-duboscq
Copy link
Member

This is the same root cause as #366

As you diagnosed, this is caused by bytecode containing irreducible loops which is not supported in the Graal compiler.
From our side there there is one strategy to support this: just duplicate the portion of the loop between the "extra" entry and the back-edges or loop exits so that we are back to a normal loop entry. We should do that in the bytecode parser.

Hopefully Kotlin only produce cases where there are "few" extra entries (or those extra entries jump towards the end of the loop, close to the back-edges) otherwise it will cause an explosion of the code size for large loops.

@qwwdfsad
Copy link
Author

qwwdfsad commented May 24, 2019

It is very unfortunate to hear as we had plans to build native images of coroutines-based applications as well.

Hopefully Kotlin only produce cases where there are "few" extra entries

It grows linearly with the count of suspension points (calls to another suspend function) in the loop. But where it really matters (in hot loops), there are usually one or two of them. Not sure about native image though.

Feel free to ping me if you need help with evaluation/testing of the potential (?) change on the Graal side, we have a bunch of applications that exploit coroutines.

@gilles-duboscq
Copy link
Member

gilles-duboscq commented May 27, 2019

Hi @qwwdfsad, i have implemented a simple duplication strategy (not merged yet) so to evaluate it i'd be interested in other workloads that run into this issue.

FYI on 1.8.0_212

# C2
FlowBenchmark.flowBaseline  avgt    7   3.805 ± 0.081  us/op
# Graal
FlowBenchmark.flowBaseline  avgt    7  47.374 ± 4.753  us/op
# Graal with duplication strategy
FlowBenchmark.flowBaseline  avgt    7   0.052 ± 0.001  us/op

@qwwdfsad
Copy link
Author

Nice! It is hard for me to extract such workloads into separate self-containing projects, but I can point a couple of our specific benchmarks in different projects (with steps how to configure and run them). Is it okay?
Alternatively (or additionally), I can also try a Graal build with fixes in our projects and see how it's going.

For example:
kotlinx.coroutines, develop branch, benchmarks from flow package. Run with ./gradlew --no-daemon cleanJmhJar jmhJar && java -jar benchmarks.jar "benchmarks.flow.*"

@gilles-duboscq
Copy link
Member

I can point a couple of our specific benchmarks in different projects (with steps how to configure and run them). Is it okay?

That would be great

Alternatively (or additionally), I can also try a Graal build with fixes in our projects and see how it's going.

I'll probably put that code behind a flag at first so you can do that but i wanted to do some basic testing on my side first to avoid too many round-trips.

kotlinx.coroutines, develop branch, benchmarks from flow package. Run with ./gradlew --no-daemon cleanJmhJar jmhJar && java -jar benchmarks.jar "benchmarks.flow.*"

I will start with that.

@chintana-zz
Copy link

@gilles-duboscq I bumped into the same issue when trying to get GraalVM to do native image generation for Ballerina. I would be more than happy to test your fix. In the mean time you can reproduce this as follows,

$ git clone https://github.com/chintana/ballerina && cd ballerina
$ ./gradlew build -x test -x check -x :composer-library:npmBuild
$ cd distribution/zip/jballerina-tools/build/distributions
$ unzip jballerina-tools-0.992.0-m2-SNAPSHOT.zip
$ cat >file.bal
import ballerina/io;
public function main() {
	int i = 0;
	int j = 0;
	while (i < 20) {
		i+=1;
		j+=1;
	}
	io:println(j);
}
$ ./bin/jballerina build file.bal
Compiling source
    file.bal

Generating executable
    file.jar
$ # set GRAALVM_HOME
$ ./bin/jballerina native-img file.jar
[file:98676]    classlist:  14,329.22 ms
[file:98676]        (cap):   2,224.56 ms
[file:98676]        setup:   4,481.97 ms
[file:98676]     analysis:   8,488.17 ms
Warning: Abort stand-alone image build. Non-reducible loop
Detailed message:
Call path from entry point to file.main(Strand):
	at file.main(file.bal)
	at ___init.$lambda$main$(.)
	at ___init$$Lambda$364/1803931637.accept(Unknown Source)
	at org.ballerinalang.jvm.SchedulerItem.execute(Scheduler.java:401)
	at org.ballerinalang.jvm.Scheduler.run(Scheduler.java:194)
	at org.ballerinalang.jvm.Scheduler.runSafely(Scheduler.java:166)
	at org.ballerinalang.jvm.Scheduler.start(Scheduler.java:158)
	at ___init.main(.)
	at com.oracle.svm.core.JavaMainWrapper.run(JavaMainWrapper.java:153)
	at com.oracle.svm.core.code.IsolateEnterStub.JavaMainWrapper_run_5087f5482cc9a6abc971913ece43acb471d2631b(generated:0)

Warning: Use -H:+ReportExceptionStackTraces to print stacktrace of underlying exception
Build on Server(pid: 98720, port: 53135)*
[file:98720]    classlist:   2,068.29 ms
[file:98720]        (cap):   1,633.39 ms
[file:98720]        setup:   3,670.54 ms
[file:98720]   (typeflow):   2,593.28 ms
[file:98720]    (objects):   1,193.84 ms
[file:98720]   (features):     380.28 ms
[file:98720]     analysis:   4,270.38 ms
[file:98720]     universe:     256.61 ms
[file:98720]      (parse):     455.73 ms
[file:98720]     (inline):   1,759.22 ms
[file:98720]    (compile):   8,062.22 ms
[file:98720]      compile:  10,790.02 ms
[file:98720]        image:     819.42 ms
[file:98720]        write:     388.00 ms
[file:98720]      [total]:  22,451.48 ms
Warning: Image 'file' is a fallback image that requires a JDK for execution (use --no-fallback to suppress fallback image generation).

@qwwdfsad
Copy link
Author

qwwdfsad commented Aug 7, 2019

Hi, could you please elaborate on the status of this fix? Do you need any additional help from me, e.g. new benchmarks or test suites?

@gilles-duboscq
Copy link
Member

I had a first version but i noticed some issues while adding more tests. I have ideas about how to fix it and still plan to do it but i have no ETA.

@tlvenn
Copy link

tlvenn commented Jan 23, 2020

Hi @gilles-duboscq , any chance to share some status update if any ? Thanks in advance.

@gilles-duboscq
Copy link
Member

Hi, no, i have not been able to allocate any time to that.

1 similar comment
@gilles-duboscq
Copy link
Member

Hi, no, i have not been able to allocate any time to that.

@tlvenn
Copy link

tlvenn commented Jan 25, 2020

Is there any chance you could open a PR with your version so the community could potentially take it from there ? Thanks in advance.

@gilles-duboscq gilles-duboscq added this to the 20.1 milestone Mar 4, 2020
@gilles-duboscq
Copy link
Member

As i said in #366, i'm planning to take a new look at this for 20.1.0

@gilles-duboscq
Copy link
Member

This should be fixed by 4662877. The fix is included in the latest 20.1 dev build (e.g., 20.1.0-dev-20200325_0537).

Using the FlowBenchmark.flowBaseline benchmark from @qwwdfsad:

time (μs/op)
Graal without duplication 37.119 ± 1.837
C2 3.210 ± 0.063
Graal with duplication 0.002 ± 0.001

@qwwdfsad
Copy link
Author

Amazing!

I've tested it on some of our workloads. When no suspension happens, it is on par with C2, but as soon as a benchmark has a hot-loop with a suspension, it is significantly faster.

E.g.:

Graal, jdk 11
Benchmark                                            Mode  Cnt    Score    Error  Units
ChannelSinkBenchmark.channelPipeline                 avgt    5  180.444 ±  5.758  ms/op
ChannelSinkBenchmark.channelPipelineOneThreadLocal   avgt    5  209.112 ± 11.974  ms/op
ChannelSinkBenchmark.channelPipelineTwoThreadLocals  avgt    5  331.341 ± 21.411  ms/op

С2, jdk 11
Benchmark                                            Mode  Cnt    Score    Error  Units
ChannelSinkBenchmark.channelPipeline                 avgt    5  215.993 ± 20.545  ms/op
ChannelSinkBenchmark.channelPipelineOneThreadLocal   avgt    5  242.537 ±  5.427  ms/op
ChannelSinkBenchmark.channelPipelineTwoThreadLocals  avgt    5  590.823 ± 48.004  ms/op

(I'd say that channelPipelineTwoThreadLocals is considerably faster mostly because of more advanced EA, though I didn't dig deep enough to verify it).

Great job!

@gilles-duboscq
Copy link
Member

Glad to hear it helped your use-case. Thank you for the report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants