For when you need a function that won’t give up at the first sign of failure.
True Grit enables flexible responses to function failure. True Grit is a data-driven, functionally-oriented, idiomatic wrapper library for using Resilience4j, one of the top resilience Java libraries.
It offers:
-
timeouts - throw an exception if a function takes too long
-
rate limiters - limit the number of calls to a function per time period
-
retries - retry a function if it fails
-
circuit breakers - track function failure rates, and block calls to the fn if it fails too often
-
bulkheads - to reserve capacity and partition usage
The most common way to use True Grit is to wrap functions that call out to an underlying network service. With True Grit, you can do things like say "Give this network call 5s before it’s considered to time out, retry it up to 3 times in case of intermittent network failure, track how many calls fail overall, and if more than 50% do, temporarily halt the function so we don’t hammer the downstream service."
Start with the net.modulolotus.truegrit
namespace, and see the individual
policy namespaces if you have more advanced needs. It contains all-in-one
functions that take a function and a config map, and return a wrapped
function with the resilience policy attached. All wrapped functions return
the same results as the original (except for thread-pool-based bulkheads,
which return Futures).
Docs are available here.
True Grit is stable and largely done. The lack of recent commits does not mean it is abandoned. I am still actively maintaining it, but the only planned future work is for bug fixes and Resilience4j updates. Suggestions for feature requests welcome, but no promises.
Add the following to your Leiningen project.clj
:
[net.modulolotus/truegrit "2.3.35"]
Add the following to your deps.edn
:
net.modulolotus/truegrit {:mvn/version "2.3.35"}
Wrapper libraries are frequently more hassle than doing interop directly. But in Resilience4j’s case, while it’s an excellent library, it requires coordinating dozens of Java classes to get anything done. So, if you don’t feel like wrangling a bunch of Config/Builder/Registry/etc classes, True Grit is for you.
As of Resilience4j 2.x and thus, True Grit 2.x, the minimum Java version is 17. If you need support for earlier Java versions, stick with the 1.x versions of True Grit. (It has the exact same API and functionality.)
Before using circuit breakers and bulkheads, be sure to understand how they operate. I highly recommend Release It! to understand the ways distributed systems can fail and how to compensate for them.
See:
-
True Grit docs - https://cljdoc.org/d/net.modulolotus/truegrit
-
Resilience4j docs - https://resilience4j.readme.io/
-
Circuit breaker pattern - https://www.martinfowler.com/bliki/CircuitBreaker.html
-
Release It! book - https://pragprog.com/titles/mnee2/release-it-second-edition/
-
Hystrix circuit breaker wiki - https://github.com/Netflix/Hystrix/wiki
(require '[net.modulolotus.truegrit :as tg])
(def resilient-fn
(-> flaky-fn
;; Give each individual call up to 10s to complete
(tg/with-time-limiter {:timeout-duration 10000})
;; Try up to 5 times, waiting 1s between failures
;; Will retry if an exception is thrown by default, but will also retry if
;; the return value is nil
(tg/with-retry {:name "my-retry"
:max-attempts 5
:wait-duration 1000
:retry-on-result nil?})
;; If it still fails after 5 tries, record it as a failure in the CB
;; CB will go into OPEN status if 20% of calls end up failures
;; CB will wait for at least 40 calls before considering a change in status,
;; giving it time to warm up.
;; Ignores UserCanceledExceptions, since if the user hit "Cancel", it's not a
;; problem in the underlying service
(tg/with-circuit-breaker {:name "my-circuit-breaker"
:failure-rate-threshold 20
:minimum-number-of-calls 40
:ignore-exceptions [UserCanceledException]})))
(require '[net.modulolotus.truegrit.circuit-breaker :as cb])
(def rest-service-cb (cb/circuit-breaker "shared-rest-service"
{:failure-rate-threshold 30
:minimum-number-of-calls 10}))
(def resil-get (cb/wrap flaky-get rest-service-cb))
(def resil-post (cb/wrap flaky-post rest-service-cb))
(def resil-put (cb/wrap flaky-put rest-service-cb))
(def resil-patch (cb/wrap flaky-patch rest-service-cb))
(def resil-delete (cb/wrap flaky-delete rest-service-cb))
(require '[net.modulolotus.truegrit.circuit-breaker :as cb])
(if (-> resilient-fn
(cb/retrieve) ; retrieve associated CircuitBreaker
(cb/call-allowed?)) ; is a call allowed right now?
(resilient-fn) ; if so, make the call
(some-fallback-fn)) ; if not, we can't wait, try a fallback
Use semaphore-based bulkheads to limit database access, keep 20% capacity in reserve, and log reserved metrics
(require '[net.modulolotus.truegrit.bulkhead :as bh])
(defn database-query-fn
"Some database fn that we've determined can only handle 100 simultaneous queries"
[user]
;; do some db stuff
)
;; Make a default version that can use up to 80% of the database's capacity
(def default-database-query (tg/with-bulkhead database-query-fn
{:name "default-db-bulkhead"
:max-concurrent-calls 80}))
;; Make a version that reserves 20% for special needs
(def reserved-database-query (tg/with-bulkhead database-query-fn
{:name "reserved-db-bulkhead"
:max-concurrent-calls 20}))
;; Usage
(defn some-handler-fn
[user]
(if (user-is-special-somehow user) ; Is the user a VIP, sysadmin, etc?
(reserved-database-query user) ; Make reserved call - the default bulkhead being full has no impact here
(default-database-query user))) ; Make standard call, blocking if unavailable
;; Log reserved bulkhead metrics every 10s
(future
(loop []
(-> reserved-database-query
(bh/retrieve)
(bh/metrics)
(log/debug))
(Thread/sleep 10000)
(recur)))
Circuit breaker status shorthand |
CLOSED is good, OPEN is bad. Think of electricity flowing. |
Read up on bulkheads and circuit breakers before using them |
Seriously. |
Circuit breakers should never be created on-demand |
Circuit breakers work by collecting data about a function’s success/failure rate over time. If you create a CB on the fly (like for an anonymous fn), but you only call that particular fn one time, the CB is useless. If you need to construct fns on the fly, but still track their overall success, you should create a CB ahead of time, and share it with all the anonymous fns by using |
Retries only make sense if there’s a reasonable expectation the fn will succeed within an acceptable time frame |
They’re better-suited for temporary glitches in the matrix, not a service being down all day. If the fn doesn’t succeed in time, retries can make things worse, by adding to the downstream load, which is why pairing them with circuit breakers works well. |
Be mindful of interactions at different levels of the system |
E.g., wrapping a high-level fn with a retry policy of 3 attempts that calls an AWS client lower down that also has its own internal retry policy of 3 attempts can result in up to 3x3=9 calls under failure modes, exacerbating things. Another common example is having multiple timeouts; it’s confusing and pointless, since the shortest timeout will trigger first. |
You still need to handle errors |
No amount of resilience policies can ensure a function will always succeed. |
Order of wrapping matters |
E.g.: (-> my-fn
(with-retry some-retry-config)
(with-time-limiter some-timeout-config)) will retry several times, but if the time limit is up before the tries succeed, it will fail. This is probably not what you want. On the other hand: (-> my-fn
(with-time-limiter some-timeout-config)
(with-retry some-retry-config)) will make calls with a certain time limit, and only if they return
failure or exceed their time limit, will it attempt a retry. If you want
a canonical "good" ordering, see the |
The r4j cache module is currently unsupported, since many Clojure/Java caching libraries already exist. However, it could be included, if people are interested. Let me know if you want it, or better still, submit a patch.
Supporting all the Java frameworks that r4j interoperates with is also a non-goal for now.