A redis-backed hyperloglog implementation in Clojure
Hyperlolog is a highly-accurate probabilistic cardinality estimation algorithm that uses constant storage, first proposed by Flajolet et al [1]. This clojure library is designed for the use case where multiple frontends update shared hyperloglog counters stored in Redis. Hyperloglog time-series are also supported, enabling cardinality estimates over arbitrary periods of time (e.g., number of distinct users over the past week).
According to this blog post, upcoming versions of Redis will have built-in support for hyperloglog counters. Once that feature is released, this library should be considered deprecated.
Hyperloglog leverages the fantastic carmine library for connecting to Redis. Add the following dependencies to your project.clj
:
[com.taoensso/carmine "2.4.0"]
[hyperloglog "0.2.1"]
First we need to set up a carmine connection pool. For example, for a simple redis instance running on redis.company.com
, use:
(ns my-app (:require [taoensso.carmine :as car]))
(defmacro wcar* [& body] `(car/wcar {:pool {} :spec {:host "redis.company.com"} ~@body))
Require the core namespace:
(ns my-app (:require [hyperloglog.core :as hll]))
Populate the counter:
(wcar* (hll/add "5e0b4287-e88d-4d95-99c9-3bac4dded572"))
(wcar* (hll/add "2561d259-6924-4e9b-b9cf-cea7c9eb4cc2"))
...
(wcar* (hll/add "697d99d1-2cd9-4423-a4f8-6577dfd3f923"))
Get the estimated number of distinct items:
(wcar* (hll/count-distinct))
The add
and count-distinct
accept a number of options. See their respective docstrings for the full list.
Require the time-series namespace:
(ns my-app (:require [hyperloglog.time-series :as hll-time]))
Populate the counter:
(wcar* (hll-time/add-now "5e0b4287-e88d-4d95-99c9-3bac4dded572"))
(wcar* (hll-time/add-now "2561d259-6924-4e9b-b9cf-cea7c9eb4cc2"))
...
(wcar* (hll-time/add-now "697d99d1-2cd9-4423-a4f8-6577dfd3f923"))
Get the estimated number of distinct items over recent time periods:
(wcar* (hll-time/count-lastest-distinct hll-time/day)) ; last day
(wcar* (hll-time/count-lastest-distinct (* 6 hll-time/hour))) ; last 6 hours
Furthermore an hll/add-at
function is available for adding items at a specific time. By default, the time-series are bucketed per hour but this is configurable (see the hll/add-at
docstring for details).
This library make the following implementation choices
- Uses a 64 bit hashing function for increased accuracy on higher cardinalities (> 1 billion) as suggested by Heule et al [2], instead of the 32-bit hashing function with large range corrections in the original paper [1].
- No small range corrections. If m is the number of observables, the hyperloglog aymptotic expected relative error is not typically attained for cardinalities less than n = 5/2 m.
The table below shows the various hyperloglog operating points depending on the choice of the number of observables.
Number of observables (m) | Minimum number of samples (5/2m) | Standard error (1.04/sqrt(m)) |
16 | 40 | 26% |
32 | 80 | 18.38% |
64 | 160 | 13.00% |
128 | 320 | 9.19% |
256 | 640 | 6.50% |
512 | 1280 | 4.60% |
1024 | 2560 | 3.25% |
2048 | 5120 | 2.30% |
4092 | 10230 | 1.63% |
- The number of observables scales linearly with the memory requirements. Note that the observables for a given hyperloglog is backed by a redis hash (storing m fields and m value).
- The minimum number of samples indicates the the number of samples beyond which the estimate is usually within its expected relative error (5/2m).
- The standard error is the square root of the variance divided by the number of samples. Its asymptotic value is 1.04/sqrt(m). See [1] for full details.
This library defaults to m=1024, but this is of course configurable (see for example the hyperloglog.core/add
docstring`).
The hyperloglog algorithm relies on a good 64 bit hashing function. By default, this library uses the MurmurHash3 hashing function in its 128-bit variant and then keeps only the first 64 bits. Heule et al tested MurmurHash3 and all the common hash functions (i.e., MD5, SHA1, SHA256) and found no significant performance difference [2]. Under the hood we use the MurmurHash3 implementation from the byte-streams library. Therefore, the hyperloglog.core/add
function should handle anything that byte-streams does (e.g., String
, byte[]
, CharSequence
, ByteBuffer
). If you need to specify a custom 64-bit hashing function, then you can pass your own as follows.
(def hll-opts {:hashing-fn my-hashing-function})
(wcar* (hll/add my-custom-object hll-opts))
- Philippe Flajolet, Eric Fusy, and Olivier Gandouet. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Proc. AOFA, 2007
- Stefan Heule, Marc Nunkesser, and Alex Hall. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm. In Proc EDBT, 2013
Copyright © 2013 John Whitbeck
Distributed under the Apache License 2.0.