Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization challenge #395

Closed
borkdude opened this issue Sep 2, 2020 · 28 comments
Closed

Optimization challenge #395

borkdude opened this issue Sep 2, 2020 · 28 comments

Comments

@borkdude
Copy link
Collaborator

borkdude commented Sep 2, 2020

These examples:

$ time bb -e "((fn a [n] (if (#{0 1} n) 1 (+ (a (- n 2)) (a (- n 1))))) 30)"
1346269
bb -e "((fn a [n] (if (#{0 1} n) 1 (+ (a (- n 2)) (a (- n 1))))) 30)"   7.77s  user 0.09s system 99% cpu 7.868 total
$ time bb -e "(letfn [(a1 [n] (if (#{0 1} n) 1 (+ (a2 (- n 2)) (a3 (- n 1)))))
        (a2 [n] (a1 n))
        (a3 [n] (a1 n))]
  (a1 30))"
1346269
bb -e    13.49s  user 0.09s system 99% cpu 13.601 total
$ time bb -e "(declare a2 a3)
(def a1 (fn [n] (if (#{0 1} n) 1 (+ (a2 (- n 2)) (a3 (- n 1))))))
(def a2 a1)
(def a3 a1)
(println (a1 30))
"
1346269
bb -e    7.50s  user 0.09s system 99% cpu 7.610 total

are a lot faster with joker. Maybe there is something we can optimize in sci to get it on the same par.

Maybe we can borrow a trick or two from this post: https://tech.redplanetlabs.com/2020/09/02/clojure-faster/

One idea is to get rid of apply in fns.cljc and replace it with the same trick as in interpreter.cljc.

@joinr
Copy link

joinr commented Sep 18, 2020

There are some low hanging fruit in micro benches I'm exploring. Things like faster map access, faster branching vs. default case statements. Lifting variable resolution, figuring out a better way to detect and inline intrinsics, constant folding, etc. I'm not a big user of sci, but optimizing this seems very interesting. Just looking at VisualVM for an iterative stress test, most of the time is spent resolving and re-resolving vars. there may be some caching tricks to play with. Seems like a ripe area for research.

@borkdude
Copy link
Collaborator Author

borkdude commented Sep 18, 2020

@joinr Thanks. Improvements welcome, preferably in small chunks so they are easy to measure and reason about.

Sci is used in https://github.com/borkdude/babashka which allows you write Clojure scripts with fast startup (thanks to GraalVM).

There is a smoke test executing various scripts in script/run_lib_tests in the babashka repo - it might also be worth looking at that, since that's probably the most typical way sci is used right now.

@joinr
Copy link

joinr commented Sep 18, 2020

I figure any improvement to sci would directly benefit babashka downstream. Although it's also possible the native-image producted binary may not benefit from all the optimizations that sci on a JIT'd jvm would (or a profile-guided binary). There could be some mutually exclusive optimizations that benefit one more than the other.

@borkdude
Copy link
Collaborator Author

I figure any improvement to sci would directly benefit babashka downstream.

Correct. But there may also be obvious optimizations that will always result in better performance. There's also the JS (browser and Node) runtimes btw.

@joinr
Copy link

joinr commented Sep 18, 2020

@borkdude Fyi, looking at the implementation of sci.impl.interpreter/fn-call, as defined by the macro expansion from def-fn-call, it doesnt' like there is any gain from counting the args, and expanding them out explicitly via first/rest. That's actually pretty costly compared to the equivalent built-in with apply:

(defn my-fn-call
  [ctx f args]
  (apply f args)
sci.examples.repl> (time (dotimes [i 10000000]  (sci.impl.interpreter/fn-call ctx + [1 2 3 4])))
"Elapsed time: 12099.366 msecs"

sci.examples.repl> (time (dotimes [i 10000000]  (my-fn-call ctx + [1 2 3 4])))
"Elapsed time: 2679.6723 msecs"

You get even better speed if you reduce the args passed in. I think apply will traverse the input and efficiently derive the appropriate arity for IFn.invoke naturally (as least on the JVM, no idea about cljs).

@joinr
Copy link

joinr commented Sep 18, 2020

Lol, hold up, I just noticed I didn't interpret the args.

@joinr
Copy link

joinr commented Sep 18, 2020

Yeah, still faster apparently:

(defn my-fn-call [ctx f args]
   (apply f (map #(sci.impl.interpreter/interpret ctx %) args)))

sci.examples.repl> (time (dotimes [i 10000000]  (my-fn-call ctx + [1 2 3 4])))
"Elapsed time: 6922.1428 msecs"

@borkdude
Copy link
Collaborator Author

borkdude commented Sep 18, 2020

@joinr After introducing the fn-call macro instead of using the my-fn-call (that's exactly how it was before), I got a significant speed boost running this example:

$ multitime -n10 -s0 ./sci "(loop [val 0 cnt 1000000] (if (pos? cnt) (recur (inc val) (dec cnt)) val))"

where sci is the binary as produced by script/compile. I think I got at least a 2x speed up with that. This example used to run in 12 seconds and now runs below 2 seconds (on my Macbook Pro) after various optimizations.

@borkdude
Copy link
Collaborator Author

This may be because we're calling different kinds of functions here. Maybe the functions / vars produced by sci are slower when called with apply whereas those from Clojure itself are faster. Worth looking into!

@joinr
Copy link

joinr commented Sep 18, 2020

My bench was off (updated and corrected). Invoking built-in functions is still faster this way by 2x, although you aggregate bench that doesn't cost as much.

Very interesting. With the loop example, I repeated your eval, but isolated just the interpretation time. I'm running this from the REPL, and hot patched in the apply variant to replace the exsiting fn-call. Also on Java 8 (might make a big difference somehow).

(loop [val 0 cnt 1000000]
  (if (pos? cnt) (recur (inc val) (dec cnt)) val))

(let [frm (sci.impl.analyzer/analyze ctx
            '(loop [val 0 cnt 1000000]
               (if (pos? cnt) (recur (inc val) (dec cnt)) val)))]
  (time (sci.impl.interpreter/interpret ctx frm)))
;;"Elapsed time: 4567.0222 msecs"

(in-ns 'sci.impl.interpreter)
(defn fn-call [ctx f args]
  (apply f (map #(interpret ctx %) args)))
(in-ns 'sci.examples.repl)

sci.examples.repl> (let [frm (sci.impl.analyzer/analyze ctx
                                  '(loop [val 0 cnt 1000000]
                                     (if (pos? cnt) (recur (inc val) (dec cnt)) val)))]
                     (time (sci.impl.interpreter/interpret ctx frm)))
"Elapsed time: 4256.787 msecs"

I get a minor speed boost on the interpretation side. It would be better to do via criterium and get an expected value though.

@joinr
Copy link

joinr commented Sep 18, 2020

Looks like there is a good bit of time just spent on hashing (makes sense, looking up values in the environment). Might be opportunities to push common stuff into more efficient structures (e.g. field-based access or some faster scheme than the default hashing). Most of the cost is seq traversal, hashing, hasheq, counting.

@joinr
Copy link

joinr commented Sep 18, 2020

none of this is probably new :) It's new to me though

@joinr
Copy link

joinr commented Sep 18, 2020

Toying with some gross jvm-centric optimizations here. Mostly aimed at avoiding all the explicit seq overhead and first/rest calls. Also using faster case dispatching for keywords. Biggest gains came from getting arg application to be faster by not processing the args as a lazy seq with stock core libs. If you rewrite a variant to replace the map interpret idiom, it knocked my runtime down to <= 3s (about a 25% reduction immediately). I do not know if this is reproducible downstream in bb as of yet.

With current hacks:

sci.examples.repl> (let [frm (sci.impl.analyzer/analyze ctx
                                     '(loop [val 0 cnt 1000000]
                                          (if (pos? cnt) (recur (inc val) (dec cnt)) val)))]
                     (time (sci.impl.interpreter/interpret ctx frm)))
"Elapsed time: 3097.8561 msecs"
1000000

joinr@712929d edited because the original link was messed up.

@joinr
Copy link

joinr commented Sep 18, 2020

Metadata lookups are costing about a full 1/3 of the remaining runtime, based on profiling, followed by basic hash/equiv stuff for operators an environment lookup. Curious to see if the metadata cost could be cached or otherwise eliminated.

@joinr
Copy link

joinr commented Sep 19, 2020

FYI, revisiting ch23 of PAIP to see if it would be easy to setup a little bytecode compiler for sci (there's also one in Lisp System Implementation, but I'm less familiar). That might be really cool...

@joinr
Copy link

joinr commented Sep 19, 2020

For some reason, I figured that clojure's fn form would be invalid at runtime for native-image, since it emits bytecode and a new class right? I guess that is a flawed assumption, since you are apparently using fn during analysis and interpretation e.g. in sci.impl.funs/parse-fn-args+body . I guess since it's a lexically scoped function, defined under defn at compile time outside of the interpreter, it's actually not being defined at runtime. That's wonderful news :)

@joinr
Copy link

joinr commented Sep 19, 2020

Changes here get my runtime on that loop demo down to 2.6s. I think there's a lot more room to improve with parse-fn-args+body actually.

@joinr
Copy link

joinr commented Sep 19, 2020

Using unsafe (or at least I haven't proven it safe) mutable binds eliminates a decent amount of hashing. It got me down to 2.3s at best. Most of the redundant work appears to be grinding through args inefficiently. Might be able to trim that since the only thing that has to be a seq are the varargs coming in.

@borkdude
Copy link
Collaborator Author

@joinr Exciting stuff! I'm going to try out several of your approaches. But if you are relatively sure of an improvement, consider making separate PRs for those. Sci is tested in CI and I will also test the changes with babashka.

@joinr
Copy link

joinr commented Sep 19, 2020

@borkdude I think everything up to using mutation is pretty low hanging fruit. The cowmap thing works in "this" limited case, but it would probably have plenty of leaks (more of a proof of concept). I tried eliminating additional seq traversal inside of the binding loading, swapping out arrays for the params, but didn't get much benefit.

I think that having a different data structure with extremely fast access (e.g. field) to the bindings would be good. There are other designs using types or structs that would supplant the map-based idioms (and require quite a bit of work), but there could be some decent gains e.g. in eliminating meta data lookups for ops and constant hashing for common fields like bindings. Just ideas.

@joinr
Copy link

joinr commented Sep 19, 2020

For a more focused synopsis, I think the biggest changes were arg-vals and getting the parse-fn-args+body more efficient and moving stuff out of the loop.

I mixed in the gross optimizations (fast- ops) without testing as quite as granularly.

@joinr
Copy link

joinr commented Sep 19, 2020

Does the CI have any indication of performance regression?

@borkdude
Copy link
Collaborator Author

There is a performance test script, but currently not hooked up in CI:

https://github.com/borkdude/sci/blob/master/test/sci/performance_test.clj

This can be used to run compared to master. I think I would test such improvements locally in various scenarios right now.

@borkdude
Copy link
Collaborator Author

Run with script/perf

@borkdude
Copy link
Collaborator Author

@borkdude
Copy link
Collaborator Author

borkdude commented Dec 19, 2020

New benchmarks compared to joker:

$ time ./sci "((fn a [n] (if (#{0 1} n) 1 (+ (a (- n 2)) (a (- n 1))))) 30)"
1346269
./sci "((fn a [n] (if (#{0 1} n) 1 (+ (a (- n 2)) (a (- n 1))))) 30)"   2.37s  user 0.09s system 99% cpu 2.466 total
$ time /usr/local/bin/joker -e "((fn a [n] (if (#{0 1} n) 1 (+ (a (- n 2)) (a (- n 1))))) 30)"
1346269
/usr/local/bin/joker -e    6.14s  user 0.65s system 148% cpu 4.564 total
$ time ./sci "(letfn [(a1 [n] (if (#{0 1} n) 1 (+ (a2 (- n 2)) (a3 (- n 1)))))
        (a2 [n] (a1 n))
        (a3 [n] (a1 n))]
  (a1 30))"
1346269
./sci    6.45s  user 0.09s system 99% cpu 6.560 total

$ time /usr/local/bin/joker -e "(letfn [(a1 [n] (if (#{0 1} n) 1 (+ (a2 (- n 2)) (a3 (- n 1)))))
        (a2 [n] (a1 n))
        (a3 [n] (a1 n))]
  (a1 30))"
1346269
/usr/local/bin/joker -e    7.03s  user 0.71s system 147% cpu 5.243 total
$ time ./sci "(declare a2 a3)
(def a1 (fn [n] (if (#{0 1} n) 1 (+ (a2 (- n 2)) (a3 (- n 1))))))
(def a2 a1)
(def a3 a1)
(println (a1 30))
"
1346269
./sci    1.94s  user 0.09s system 99% cpu 2.032 total

$ time /usr/local/bin/joker -e "(declare a2 a3)
(def a1 (fn [n] (if (#{0 1} n) 1 (+ (a2 (- n 2)) (a3 (- n 1))))))
(def a2 a1)
(def a3 a1)
(println (a1 30))
"
#'user/a3
#'user/a1
#'user/a2
#'user/a3
1346269
/usr/local/bin/joker -e    6.00s  user 0.62s system 148% cpu 4.445 total

We're roughly 2x as fast now, except for letfn which has approached joker's speed, but not yet there. Will have a look at letfn.

@joinr
Copy link

joinr commented Dec 20, 2020

Nicely done.

@borkdude
Copy link
Collaborator Author

Letfn now also faster:

#480 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants