Optimization challenge #395

borkdude · 2020-09-02T19:43:55Z

These examples:

$ time bb -e "((fn a [n] (if (#{0 1} n) 1 (+ (a (- n 2)) (a (- n 1))))) 30)"
1346269
bb -e "((fn a [n] (if (#{0 1} n) 1 (+ (a (- n 2)) (a (- n 1))))) 30)"   7.77s  user 0.09s system 99% cpu 7.868 total

$ time bb -e "(letfn [(a1 [n] (if (#{0 1} n) 1 (+ (a2 (- n 2)) (a3 (- n 1)))))
        (a2 [n] (a1 n))
        (a3 [n] (a1 n))]
  (a1 30))"
1346269
bb -e    13.49s  user 0.09s system 99% cpu 13.601 total

$ time bb -e "(declare a2 a3)
(def a1 (fn [n] (if (#{0 1} n) 1 (+ (a2 (- n 2)) (a3 (- n 1))))))
(def a2 a1)
(def a3 a1)
(println (a1 30))
"
1346269
bb -e    7.50s  user 0.09s system 99% cpu 7.610 total

are a lot faster with joker. Maybe there is something we can optimize in sci to get it on the same par.

Maybe we can borrow a trick or two from this post: https://tech.redplanetlabs.com/2020/09/02/clojure-faster/

One idea is to get rid of apply in fns.cljc and replace it with the same trick as in interpreter.cljc.

The text was updated successfully, but these errors were encountered:

joinr · 2020-09-18T21:07:49Z

There are some low hanging fruit in micro benches I'm exploring. Things like faster map access, faster branching vs. default case statements. Lifting variable resolution, figuring out a better way to detect and inline intrinsics, constant folding, etc. I'm not a big user of sci, but optimizing this seems very interesting. Just looking at VisualVM for an iterative stress test, most of the time is spent resolving and re-resolving vars. there may be some caching tricks to play with. Seems like a ripe area for research.

borkdude · 2020-09-18T21:18:41Z

@joinr Thanks. Improvements welcome, preferably in small chunks so they are easy to measure and reason about.

Sci is used in https://github.com/borkdude/babashka which allows you write Clojure scripts with fast startup (thanks to GraalVM).

There is a smoke test executing various scripts in script/run_lib_tests in the babashka repo - it might also be worth looking at that, since that's probably the most typical way sci is used right now.

joinr · 2020-09-18T21:58:22Z

I figure any improvement to sci would directly benefit babashka downstream. Although it's also possible the native-image producted binary may not benefit from all the optimizations that sci on a JIT'd jvm would (or a profile-guided binary). There could be some mutually exclusive optimizations that benefit one more than the other.

borkdude · 2020-09-18T22:01:13Z

I figure any improvement to sci would directly benefit babashka downstream.

Correct. But there may also be obvious optimizations that will always result in better performance. There's also the JS (browser and Node) runtimes btw.

joinr · 2020-09-18T22:01:27Z

@borkdude Fyi, looking at the implementation of sci.impl.interpreter/fn-call, as defined by the macro expansion from def-fn-call, it doesnt' like there is any gain from counting the args, and expanding them out explicitly via first/rest. That's actually pretty costly compared to the equivalent built-in with apply:

(defn my-fn-call
  [ctx f args]
  (apply f args)

sci.examples.repl> (time (dotimes [i 10000000]  (sci.impl.interpreter/fn-call ctx + [1 2 3 4])))
"Elapsed time: 12099.366 msecs"

sci.examples.repl> (time (dotimes [i 10000000]  (my-fn-call ctx + [1 2 3 4])))
"Elapsed time: 2679.6723 msecs"

You get even better speed if you reduce the args passed in. I think apply will traverse the input and efficiently derive the appropriate arity for IFn.invoke naturally (as least on the JVM, no idea about cljs).

joinr · 2020-09-18T22:02:47Z

Lol, hold up, I just noticed I didn't interpret the args.

joinr · 2020-09-18T22:04:22Z

Yeah, still faster apparently:

(defn my-fn-call [ctx f args]
   (apply f (map #(sci.impl.interpreter/interpret ctx %) args)))

sci.examples.repl> (time (dotimes [i 10000000]  (my-fn-call ctx + [1 2 3 4])))
"Elapsed time: 6922.1428 msecs"

borkdude · 2020-09-18T22:06:02Z

@joinr After introducing the fn-call macro instead of using the my-fn-call (that's exactly how it was before), I got a significant speed boost running this example:

$ multitime -n10 -s0 ./sci "(loop [val 0 cnt 1000000] (if (pos? cnt) (recur (inc val) (dec cnt)) val))"

where sci is the binary as produced by script/compile. I think I got at least a 2x speed up with that. This example used to run in 12 seconds and now runs below 2 seconds (on my Macbook Pro) after various optimizations.

borkdude · 2020-09-18T22:11:53Z

This may be because we're calling different kinds of functions here. Maybe the functions / vars produced by sci are slower when called with apply whereas those from Clojure itself are faster. Worth looking into!

joinr · 2020-09-18T22:22:38Z

My bench was off (updated and corrected). Invoking built-in functions is still faster this way by 2x, although you aggregate bench that doesn't cost as much.

Very interesting. With the loop example, I repeated your eval, but isolated just the interpretation time. I'm running this from the REPL, and hot patched in the apply variant to replace the exsiting fn-call. Also on Java 8 (might make a big difference somehow).

(loop [val 0 cnt 1000000]
  (if (pos? cnt) (recur (inc val) (dec cnt)) val))

(let [frm (sci.impl.analyzer/analyze ctx
            '(loop [val 0 cnt 1000000]
               (if (pos? cnt) (recur (inc val) (dec cnt)) val)))]
  (time (sci.impl.interpreter/interpret ctx frm)))
;;"Elapsed time: 4567.0222 msecs"

(in-ns 'sci.impl.interpreter)
(defn fn-call [ctx f args]
  (apply f (map #(interpret ctx %) args)))
(in-ns 'sci.examples.repl)

sci.examples.repl> (let [frm (sci.impl.analyzer/analyze ctx
                                  '(loop [val 0 cnt 1000000]
                                     (if (pos? cnt) (recur (inc val) (dec cnt)) val)))]
                     (time (sci.impl.interpreter/interpret ctx frm)))
"Elapsed time: 4256.787 msecs"

I get a minor speed boost on the interpretation side. It would be better to do via criterium and get an expected value though.

joinr · 2020-09-18T22:29:53Z

Looks like there is a good bit of time just spent on hashing (makes sense, looking up values in the environment). Might be opportunities to push common stuff into more efficient structures (e.g. field-based access or some faster scheme than the default hashing). Most of the cost is seq traversal, hashing, hasheq, counting.

joinr · 2020-09-18T22:30:44Z

none of this is probably new :) It's new to me though

joinr · 2020-09-18T23:25:36Z

Toying with some gross jvm-centric optimizations here. Mostly aimed at avoiding all the explicit seq overhead and first/rest calls. Also using faster case dispatching for keywords. Biggest gains came from getting arg application to be faster by not processing the args as a lazy seq with stock core libs. If you rewrite a variant to replace the map interpret idiom, it knocked my runtime down to <= 3s (about a 25% reduction immediately). I do not know if this is reproducible downstream in bb as of yet.

With current hacks:

sci.examples.repl> (let [frm (sci.impl.analyzer/analyze ctx
                                     '(loop [val 0 cnt 1000000]
                                          (if (pos? cnt) (recur (inc val) (dec cnt)) val)))]
                     (time (sci.impl.interpreter/interpret ctx frm)))
"Elapsed time: 3097.8561 msecs"
1000000

joinr@712929d edited because the original link was messed up.

joinr · 2020-09-18T23:44:57Z

Metadata lookups are costing about a full 1/3 of the remaining runtime, based on profiling, followed by basic hash/equiv stuff for operators an environment lookup. Curious to see if the metadata cost could be cached or otherwise eliminated.

joinr · 2020-09-19T04:19:20Z

FYI, revisiting ch23 of PAIP to see if it would be easy to setup a little bytecode compiler for sci (there's also one in Lisp System Implementation, but I'm less familiar). That might be really cool...

joinr · 2020-09-19T06:23:45Z

For some reason, I figured that clojure's fn form would be invalid at runtime for native-image, since it emits bytecode and a new class right? I guess that is a flawed assumption, since you are apparently using fn during analysis and interpretation e.g. in sci.impl.funs/parse-fn-args+body . I guess since it's a lexically scoped function, defined under defn at compile time outside of the interpreter, it's actually not being defined at runtime. That's wonderful news :)

joinr · 2020-09-19T07:12:11Z

Changes here get my runtime on that loop demo down to 2.6s. I think there's a lot more room to improve with parse-fn-args+body actually.

joinr · 2020-09-19T08:38:50Z

Using unsafe (or at least I haven't proven it safe) mutable binds eliminates a decent amount of hashing. It got me down to 2.3s at best. Most of the redundant work appears to be grinding through args inefficiently. Might be able to trim that since the only thing that has to be a seq are the varargs coming in.

borkdude · 2020-09-19T08:57:54Z

@joinr Exciting stuff! I'm going to try out several of your approaches. But if you are relatively sure of an improvement, consider making separate PRs for those. Sci is tested in CI and I will also test the changes with babashka.

joinr · 2020-09-19T09:18:58Z

@borkdude I think everything up to using mutation is pretty low hanging fruit. The cowmap thing works in "this" limited case, but it would probably have plenty of leaks (more of a proof of concept). I tried eliminating additional seq traversal inside of the binding loading, swapping out arrays for the params, but didn't get much benefit.

I think that having a different data structure with extremely fast access (e.g. field) to the bindings would be good. There are other designs using types or structs that would supplant the map-based idioms (and require quite a bit of work), but there could be some decent gains e.g. in eliminating meta data lookups for ops and constant hashing for common fields like bindings. Just ideas.

joinr · 2020-09-19T09:23:12Z

For a more focused synopsis, I think the biggest changes were arg-vals and getting the parse-fn-args+body more efficient and moving stuff out of the loop.

I mixed in the gross optimizations (fast- ops) without testing as quite as granularly.

joinr · 2020-09-19T09:23:30Z

Does the CI have any indication of performance regression?

borkdude · 2020-09-19T09:28:27Z

There is a performance test script, but currently not hooked up in CI:

https://github.com/borkdude/sci/blob/master/test/sci/performance_test.clj

This can be used to run compared to master. I think I would test such improvements locally in various scenarios right now.

borkdude · 2020-09-19T09:28:43Z

Run with script/perf

borkdude · 2020-09-19T14:26:29Z

I now created three issues out of the above:

borkdude · 2020-12-19T21:25:32Z

New benchmarks compared to joker:

$ time ./sci "((fn a [n] (if (#{0 1} n) 1 (+ (a (- n 2)) (a (- n 1))))) 30)"
1346269
./sci "((fn a [n] (if (#{0 1} n) 1 (+ (a (- n 2)) (a (- n 1))))) 30)"   2.37s  user 0.09s system 99% cpu 2.466 total

$ time /usr/local/bin/joker -e "((fn a [n] (if (#{0 1} n) 1 (+ (a (- n 2)) (a (- n 1))))) 30)"
1346269
/usr/local/bin/joker -e    6.14s  user 0.65s system 148% cpu 4.564 total

$ time ./sci "(letfn [(a1 [n] (if (#{0 1} n) 1 (+ (a2 (- n 2)) (a3 (- n 1)))))
        (a2 [n] (a1 n))
        (a3 [n] (a1 n))]
  (a1 30))"
1346269
./sci    6.45s  user 0.09s system 99% cpu 6.560 total

$ time /usr/local/bin/joker -e "(letfn [(a1 [n] (if (#{0 1} n) 1 (+ (a2 (- n 2)) (a3 (- n 1)))))
        (a2 [n] (a1 n))
        (a3 [n] (a1 n))]
  (a1 30))"
1346269
/usr/local/bin/joker -e    7.03s  user 0.71s system 147% cpu 5.243 total

$ time ./sci "(declare a2 a3)
(def a1 (fn [n] (if (#{0 1} n) 1 (+ (a2 (- n 2)) (a3 (- n 1))))))
(def a2 a1)
(def a3 a1)
(println (a1 30))
"
1346269
./sci    1.94s  user 0.09s system 99% cpu 2.032 total

$ time /usr/local/bin/joker -e "(declare a2 a3)
(def a1 (fn [n] (if (#{0 1} n) 1 (+ (a2 (- n 2)) (a3 (- n 1))))))
(def a2 a1)
(def a3 a1)
(println (a1 30))
"
#'user/a3
#'user/a1
#'user/a2
#'user/a3
1346269
/usr/local/bin/joker -e    6.00s  user 0.62s system 148% cpu 4.445 total

We're roughly 2x as fast now, except for letfn which has approached joker's speed, but not yet there. Will have a look at letfn.

joinr · 2020-12-20T11:26:00Z

Nicely done.

borkdude · 2020-12-20T14:18:43Z

Letfn now also faster:

#480 (comment)

borkdude mentioned this issue Sep 19, 2020

Re-visit perf of fn-call macro in interpreter namespace #414

Closed

1 task

borkdude added the performance label Oct 17, 2020

borkdude mentioned this issue Dec 19, 2020

Re-visit perf of letfn #480

Closed

borkdude closed this as completed Dec 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization challenge #395

Optimization challenge #395

borkdude commented Sep 2, 2020 •

edited

Loading

joinr commented Sep 18, 2020

borkdude commented Sep 18, 2020 •

edited

Loading

joinr commented Sep 18, 2020

borkdude commented Sep 18, 2020

joinr commented Sep 18, 2020 •

edited

Loading

joinr commented Sep 18, 2020 •

edited

Loading

joinr commented Sep 18, 2020 •

edited

Loading

borkdude commented Sep 18, 2020 •

edited

Loading

borkdude commented Sep 18, 2020

joinr commented Sep 18, 2020 •

edited

Loading

joinr commented Sep 18, 2020

joinr commented Sep 18, 2020

joinr commented Sep 18, 2020 •

edited

Loading

joinr commented Sep 18, 2020

joinr commented Sep 19, 2020

joinr commented Sep 19, 2020

joinr commented Sep 19, 2020

joinr commented Sep 19, 2020

borkdude commented Sep 19, 2020

joinr commented Sep 19, 2020

joinr commented Sep 19, 2020

joinr commented Sep 19, 2020

borkdude commented Sep 19, 2020

borkdude commented Sep 19, 2020

borkdude commented Sep 19, 2020

borkdude commented Dec 19, 2020 •

edited

Loading

joinr commented Dec 20, 2020

borkdude commented Dec 20, 2020

Optimization challenge #395

Optimization challenge #395

Comments

borkdude commented Sep 2, 2020 • edited Loading

joinr commented Sep 18, 2020

borkdude commented Sep 18, 2020 • edited Loading

joinr commented Sep 18, 2020

borkdude commented Sep 18, 2020

joinr commented Sep 18, 2020 • edited Loading

joinr commented Sep 18, 2020 • edited Loading

joinr commented Sep 18, 2020 • edited Loading

borkdude commented Sep 18, 2020 • edited Loading

borkdude commented Sep 18, 2020

joinr commented Sep 18, 2020 • edited Loading

joinr commented Sep 18, 2020

joinr commented Sep 18, 2020

joinr commented Sep 18, 2020 • edited Loading

joinr commented Sep 18, 2020

joinr commented Sep 19, 2020

joinr commented Sep 19, 2020

joinr commented Sep 19, 2020

joinr commented Sep 19, 2020

borkdude commented Sep 19, 2020

joinr commented Sep 19, 2020

joinr commented Sep 19, 2020

joinr commented Sep 19, 2020

borkdude commented Sep 19, 2020

borkdude commented Sep 19, 2020

borkdude commented Sep 19, 2020

borkdude commented Dec 19, 2020 • edited Loading

joinr commented Dec 20, 2020

borkdude commented Dec 20, 2020

borkdude commented Sep 2, 2020 •

edited

Loading

borkdude commented Sep 18, 2020 •

edited

Loading

joinr commented Sep 18, 2020 •

edited

Loading

joinr commented Sep 18, 2020 •

edited

Loading

joinr commented Sep 18, 2020 •

edited

Loading

borkdude commented Sep 18, 2020 •

edited

Loading

joinr commented Sep 18, 2020 •

edited

Loading

joinr commented Sep 18, 2020 •

edited

Loading

borkdude commented Dec 19, 2020 •

edited

Loading