-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Third optimization batch — map fusion #95
Conversation
Codecov Report
@@ Coverage Diff @@
## master #95 +/- ##
=========================================
+ Coverage 87.4% 88.01% +0.6%
=========================================
Files 20 20
Lines 413 434 +21
Branches 35 35
=========================================
+ Hits 361 382 +21
Misses 52 52 |
I might need to improve test coverage. |
case Map(source, g, index) => | ||
// Allowed to do 32 map operations in sequence before | ||
// triggering `flatMap` in order to avoid stack overflows | ||
if (index != 31) Map(source, g.andThen(f), index + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question: out of curiosity, where does the 32 number comes from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now it's a number pulled out of thin air (well, I've seen it used somewhere else) — the default stack size varies depending on platform, AFAIK ranging from 64Kb (Windows 32 bits) to 1 MB (on Linux 64 bits).
And we need a comfortable value to not blow people's stacks.
But now that you've mentioned it, I decided to do some testing and be thorough about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds like it deserves a comment in the code telling the reader why this number has been chosen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the answer
I've got some refactoring ideas and some more testing to do, so placed this PR in the WIP state again. |
^^^ No Shame Sundays :D Feel free to review too @alexandru |
I've updated the description of the PR with the methodology for picking the right maximum number of fused |
Update: added full benchmarking results and current interpretation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it. A couple quick questions.
private[effect] final val fusionMaxStackDepth = | ||
Option(System.getProperty("cats.effect.fusionMaxStackDepth", "")) | ||
.filter(s => s != null && s.nonEmpty) | ||
.flatMap(s => Try(s.toInt).toOption) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to depend on a logger, but is it worth it to explain on stderr why we choked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to not do it, since it introduces extra code — but I don't care much and if it's a popular demand, then OK.
What I'm thinking is that people won't modify this parameter unless they are in big trouble and we can have two issues:
- given my calculations, the default value seems fine, but we might underestimate stack growth in common usage
- we don't control all possible virtual machines, I have no idea for example what's the default stack size on Android or other non-Oracle JVMs
So increasing it won't increase performance unless used for very narrow use-cases and if people hit the stack limit because of this default, then we probably need to lower this limit in the library, with the overriding option being made available only to empower people to fix it without having to wait for another release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll buy that.
@@ -93,9 +93,9 @@ sealed abstract class IO[+A] { | |||
this match { | |||
case ref @ RaiseError(_) => ref | |||
case Map(source, g, index) => | |||
// Allowed to do 32 map operations in sequence before | |||
// Allowed to do 128 map operations in sequence before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is not true in Scala.js or sometimes in the presence of the sysprop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I need to update the comment, thanks for pointing it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed that comment.
* Establishes the maximum stack depth for `IO#map` operations | ||
* for JavaScript. | ||
*/ | ||
private[effect] final val fusionMaxStackDepth = 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the default supposed to be 128?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the JavaScript version. I've updated the comment in IO.scala
to not mention 128. Also this particular default is now changed to 31
since we are using inequality of the current index for the actual test as a very small performance optimization.
Option(System.getProperty("cats.effect.fusionMaxStackDepth", "")) | ||
.filter(s => s != null && s.nonEmpty) | ||
.flatMap(s => Try(s.toInt).toOption) | ||
.filter(_ > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming what this looks like at 1
which is then reduced to 0
is that every operation is flatMapped rather than Map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the intention — which made me realize that when that counter gets reset we should use a Map(this, f, 0)
instead of a FlatMap(this, f.andThen(pure))
.
The PR is ready for merging if you folks agree. |
Any objections on merging this PR? |
Third PR addressing performance issues. Previous ones are #90 and #91.
Also addresses issue #92, but without the proposed laws.
In order to avoid stack-overflow errors, repeated
.map
calls trigger a.flatMap
every 128 calls.The rationale is thus — from my research the default stack size ranges from 320 KB on 32-bits operating systems to 1 MB on 64-bits operating systems. From my own tests a stack frame triggered by a
Function1
chaining consumes aproximately 32 bytes. So 128 fusedmap
calls will consume32 bytes * 128 = 4 KB
at a minimum (to which you add whatever else the user is doing on that stack).In a real
flatMap
chain, which is what happens withIO
, the performance does end up being dominated by.flatMap
calls even with.map
calls being mixed-in (a benchmark proving this will follow), however the improvements for.map
are still good.The important thing to watch out for in this optimization is that a degrading
.flatMap
isn't acceptable to make.map
faster, but if we can manage a performance boost for fused.map
calls without a.flatMap
regression, then it's all good.For benchmarking I've introduced 2 new benchmarks:
MapCallsBenchmark
measures the performance of puremap
calls (without being mixed within.flatMap
loops)MapStreamBenchmark
measures the performance impact of a real-world use-case, showing what to expect from Monix'sIterant
and possibly FS2The other benchmarks:
Here I seem to have suffered a slight regression — not sure if this is a fluke or not, since the differences are very small, however if this is real, I need to investigate whether I can gain some extra throughput from somewhere else (win some, lose some).