-
-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization challenge #395
Comments
There are some low hanging fruit in micro benches I'm exploring. Things like faster map access, faster branching vs. default |
@joinr Thanks. Improvements welcome, preferably in small chunks so they are easy to measure and reason about. Sci is used in https://github.com/borkdude/babashka which allows you write Clojure scripts with fast startup (thanks to GraalVM). There is a smoke test executing various scripts in |
I figure any improvement to sci would directly benefit babashka downstream. Although it's also possible the native-image producted binary may not benefit from all the optimizations that sci on a JIT'd jvm would (or a profile-guided binary). There could be some mutually exclusive optimizations that benefit one more than the other. |
Correct. But there may also be obvious optimizations that will always result in better performance. There's also the JS (browser and Node) runtimes btw. |
@borkdude Fyi, looking at the implementation of sci.impl.interpreter/fn-call, as defined by the macro expansion from (defn my-fn-call
[ctx f args]
(apply f args) sci.examples.repl> (time (dotimes [i 10000000] (sci.impl.interpreter/fn-call ctx + [1 2 3 4])))
"Elapsed time: 12099.366 msecs"
sci.examples.repl> (time (dotimes [i 10000000] (my-fn-call ctx + [1 2 3 4])))
"Elapsed time: 2679.6723 msecs" You get even better speed if you reduce the args passed in. I think |
Lol, hold up, I just noticed I didn't interpret the args. |
Yeah, still faster apparently: (defn my-fn-call [ctx f args]
(apply f (map #(sci.impl.interpreter/interpret ctx %) args)))
sci.examples.repl> (time (dotimes [i 10000000] (my-fn-call ctx + [1 2 3 4])))
"Elapsed time: 6922.1428 msecs" |
@joinr After introducing the
where |
This may be because we're calling different kinds of functions here. Maybe the functions / vars produced by sci are slower when called with apply whereas those from Clojure itself are faster. Worth looking into! |
My bench was off (updated and corrected). Invoking built-in functions is still faster this way by 2x, although you aggregate bench that doesn't cost as much. Very interesting. With the loop example, I repeated your eval, but isolated just the interpretation time. I'm running this from the REPL, and hot patched in the (loop [val 0 cnt 1000000]
(if (pos? cnt) (recur (inc val) (dec cnt)) val))
(let [frm (sci.impl.analyzer/analyze ctx
'(loop [val 0 cnt 1000000]
(if (pos? cnt) (recur (inc val) (dec cnt)) val)))]
(time (sci.impl.interpreter/interpret ctx frm)))
;;"Elapsed time: 4567.0222 msecs"
(in-ns 'sci.impl.interpreter)
(defn fn-call [ctx f args]
(apply f (map #(interpret ctx %) args)))
(in-ns 'sci.examples.repl)
sci.examples.repl> (let [frm (sci.impl.analyzer/analyze ctx
'(loop [val 0 cnt 1000000]
(if (pos? cnt) (recur (inc val) (dec cnt)) val)))]
(time (sci.impl.interpreter/interpret ctx frm)))
"Elapsed time: 4256.787 msecs" I get a minor speed boost on the interpretation side. It would be better to do via criterium and get an expected value though. |
Looks like there is a good bit of time just spent on hashing (makes sense, looking up values in the environment). Might be opportunities to push common stuff into more efficient structures (e.g. field-based access or some faster scheme than the default hashing). Most of the cost is seq traversal, hashing, hasheq, counting. |
none of this is probably new :) It's new to me though |
Toying with some gross jvm-centric optimizations here. Mostly aimed at avoiding all the explicit seq overhead and first/rest calls. Also using faster case dispatching for keywords. Biggest gains came from getting arg application to be faster by not processing the args as a lazy seq with stock core libs. If you rewrite a variant to replace the map interpret idiom, it knocked my runtime down to <= 3s (about a 25% reduction immediately). I do not know if this is reproducible downstream in bb as of yet. With current hacks: sci.examples.repl> (let [frm (sci.impl.analyzer/analyze ctx
'(loop [val 0 cnt 1000000]
(if (pos? cnt) (recur (inc val) (dec cnt)) val)))]
(time (sci.impl.interpreter/interpret ctx frm)))
"Elapsed time: 3097.8561 msecs"
1000000 joinr@712929d edited because the original link was messed up. |
Metadata lookups are costing about a full 1/3 of the remaining runtime, based on profiling, followed by basic hash/equiv stuff for operators an environment lookup. Curious to see if the metadata cost could be cached or otherwise eliminated. |
FYI, revisiting ch23 of PAIP to see if it would be easy to setup a little bytecode compiler for sci (there's also one in Lisp System Implementation, but I'm less familiar). That might be really cool... |
For some reason, I figured that clojure's |
Changes here get my runtime on that loop demo down to 2.6s. I think there's a lot more room to improve with parse-fn-args+body actually. |
Using unsafe (or at least I haven't proven it safe) mutable binds eliminates a decent amount of hashing. It got me down to 2.3s at best. Most of the redundant work appears to be grinding through args inefficiently. Might be able to trim that since the only thing that has to be a seq are the varargs coming in. |
@joinr Exciting stuff! I'm going to try out several of your approaches. But if you are relatively sure of an improvement, consider making separate PRs for those. Sci is tested in CI and I will also test the changes with babashka. |
@borkdude I think everything up to using mutation is pretty low hanging fruit. The cowmap thing works in "this" limited case, but it would probably have plenty of leaks (more of a proof of concept). I tried eliminating additional seq traversal inside of the binding loading, swapping out arrays for the params, but didn't get much benefit. I think that having a different data structure with extremely fast access (e.g. field) to the bindings would be good. There are other designs using types or structs that would supplant the map-based idioms (and require quite a bit of work), but there could be some decent gains e.g. in eliminating meta data lookups for ops and constant hashing for common fields like bindings. Just ideas. |
For a more focused synopsis, I think the biggest changes were arg-vals and getting the parse-fn-args+body more efficient and moving stuff out of the loop. I mixed in the gross optimizations (fast- ops) without testing as quite as granularly. |
Does the CI have any indication of performance regression? |
There is a performance test script, but currently not hooked up in CI: https://github.com/borkdude/sci/blob/master/test/sci/performance_test.clj This can be used to run compared to master. I think I would test such improvements locally in various scenarios right now. |
Run with |
I now created three issues out of the above: |
New benchmarks compared to joker:
We're roughly 2x as fast now, except for letfn which has approached joker's speed, but not yet there. Will have a look at letfn. |
Nicely done. |
Letfn now also faster: |
These examples:
are a lot faster with joker. Maybe there is something we can optimize in sci to get it on the same par.
Maybe we can borrow a trick or two from this post: https://tech.redplanetlabs.com/2020/09/02/clojure-faster/
One idea is to get rid of apply in fns.cljc and replace it with the same trick as in interpreter.cljc.
The text was updated successfully, but these errors were encountered: