hyperloglog

A redis-backed hyperloglog implementation in Clojure

Hyperlolog is a highly-accurate probabilistic cardinality estimation algorithm that uses constant storage, first proposed by Flajolet et al [1]. This clojure library is designed for the use case where multiple frontends update shared hyperloglog counters stored in Redis. Hyperloglog time-series are also supported, enabling cardinality estimates over arbitrary periods of time (e.g., number of distinct users over the past week).

Obsolescence

According to this blog post, upcoming versions of Redis will have built-in support for hyperloglog counters. Once that feature is released, this library should be considered deprecated.

Setup

Hyperloglog leverages the fantastic carmine library for connecting to Redis. Add the following dependencies to your project.clj:

[com.taoensso/carmine "2.4.0"]
[hyperloglog "0.2.1"]

First we need to set up a carmine connection pool. For example, for a simple redis instance running on redis.company.com, use:

(ns my-app (:require [taoensso.carmine :as car]))
(defmacro wcar* [& body] `(car/wcar {:pool {} :spec {:host "redis.company.com"} ~@body))

Usage

Simple shared counter

Require the core namespace:

(ns my-app (:require [hyperloglog.core :as hll]))

Populate the counter:

(wcar* (hll/add "5e0b4287-e88d-4d95-99c9-3bac4dded572"))
(wcar* (hll/add "2561d259-6924-4e9b-b9cf-cea7c9eb4cc2"))
...
(wcar* (hll/add "697d99d1-2cd9-4423-a4f8-6577dfd3f923"))

Get the estimated number of distinct items:

(wcar* (hll/count-distinct))

The add and count-distinct accept a number of options. See their respective docstrings for the full list.

Time series

Require the time-series namespace:

(ns my-app (:require [hyperloglog.time-series :as hll-time]))

Populate the counter:

(wcar* (hll-time/add-now "5e0b4287-e88d-4d95-99c9-3bac4dded572"))
(wcar* (hll-time/add-now "2561d259-6924-4e9b-b9cf-cea7c9eb4cc2"))
...
(wcar* (hll-time/add-now "697d99d1-2cd9-4423-a4f8-6577dfd3f923"))

Get the estimated number of distinct items over recent time periods:

(wcar* (hll-time/count-lastest-distinct hll-time/day))        ; last day
(wcar* (hll-time/count-lastest-distinct (* 6 hll-time/hour))) ; last 6 hours

Furthermore an hll/add-at function is available for adding items at a specific time. By default, the time-series are bucketed per hour but this is configurable (see the hll/add-at docstring for details).

Implementation details

Accuracy

This library make the following implementation choices

Uses a 64 bit hashing function for increased accuracy on higher cardinalities (> 1 billion) as suggested by Heule et al [2], instead of the 32-bit hashing function with large range corrections in the original paper [1].
No small range corrections. If m is the number of observables, the hyperloglog aymptotic expected relative error is not typically attained for cardinalities less than n = 5/2 m.

The table below shows the various hyperloglog operating points depending on the choice of the number of observables.

Number of observables (m)	Minimum number of samples (5/2m)	Standard error (1.04/sqrt(m))
16	40	26%
32	80	18.38%
64	160	13.00%
128	320	9.19%
256	640	6.50%
512	1280	4.60%
1024	2560	3.25%
2048	5120	2.30%
4092	10230	1.63%

The number of observables scales linearly with the memory requirements. Note that the observables for a given hyperloglog is backed by a redis hash (storing m fields and m value).
The minimum number of samples indicates the the number of samples beyond which the estimate is usually within its expected relative error (5/2m).
The standard error is the square root of the variance divided by the number of samples. Its asymptotic value is 1.04/sqrt(m). See [1] for full details.

This library defaults to m=1024, but this is of course configurable (see for example the hyperloglog.core/add docstring`).

Hashing

The hyperloglog algorithm relies on a good 64 bit hashing function. By default, this library uses the MurmurHash3 hashing function in its 128-bit variant and then keeps only the first 64 bits. Heule et al tested MurmurHash3 and all the common hash functions (i.e., MD5, SHA1, SHA256) and found no significant performance difference [2]. Under the hood we use the MurmurHash3 implementation from the byte-streams library. Therefore, the hyperloglog.core/add function should handle anything that byte-streams does (e.g., String, byte[], CharSequence, ByteBuffer). If you need to specify a custom 64-bit hashing function, then you can pass your own as follows.

(def hll-opts {:hashing-fn my-hashing-function})
(wcar* (hll/add my-custom-object hll-opts))

References

Philippe Flajolet, Eric Fusy, and Olivier Gandouet. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Proc. AOFA, 2007
Stefan Heule, Marc Nunkesser, and Alex Hall. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm. In Proc EDBT, 2013

License

Distributed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
src/hyperloglog		src/hyperloglog
test/hyperloglog		test/hyperloglog
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.txt		LICENSE.txt
README.md		README.md
al-v20.txt		al-v20.txt
project.clj		project.clj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hyperloglog

Obsolescence

Setup

Usage

Simple shared counter

Time series

Implementation details

Accuracy

Hashing

References

License

About

Releases

Packages

Contributors 2

Languages

License

jwhitbeck/hyperloglog

Folders and files

Latest commit

History

Repository files navigation

hyperloglog

Obsolescence

Setup

Usage

Simple shared counter

Time series

Implementation details

Accuracy

Hashing

References

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages