From 565445bbc7ec3b38fb9dcc92050ce69a1f92e59b Mon Sep 17 00:00:00 2001 From: "kedia,Akanksha" Date: Sat, 26 Aug 2023 09:51:26 +0530 Subject: [PATCH] Document Set Digest functions Cherry-pick of https://github.com/trinodb/trino/pull/8269/commits/b9fc92d4c14cb6f76545674dc9973536833c75c7 Co-authored-by: Marius Grama --- presto-docs/src/main/sphinx/functions.rst | 1 + .../src/main/sphinx/functions/setdigest.rst | 142 ++++++++++++++++++ 2 files changed, 143 insertions(+) create mode 100644 presto-docs/src/main/sphinx/functions/setdigest.rst diff --git a/presto-docs/src/main/sphinx/functions.rst b/presto-docs/src/main/sphinx/functions.rst index e48c7cf08cea3..e3a63772292fa 100644 --- a/presto-docs/src/main/sphinx/functions.rst +++ b/presto-docs/src/main/sphinx/functions.rst @@ -34,3 +34,4 @@ Functions and Operators functions/session functions/teradata functions/internationalization + functions/setdigest diff --git a/presto-docs/src/main/sphinx/functions/setdigest.rst b/presto-docs/src/main/sphinx/functions/setdigest.rst new file mode 100644 index 0000000000000..d14eeb8feecec --- /dev/null +++ b/presto-docs/src/main/sphinx/functions/setdigest.rst @@ -0,0 +1,142 @@ +==================== +Set Digest functions +==================== + +MinHash, or the min-wise independent permutations locality-sensitive hashing scheme, +is a technique used in computer science to quickly estimate how similar two sets are. +MinHash serves as a probabilistic data structure that estimates the Jaccard similarity +coefficient - the measure of the overlap between two sets as a percentage of the total unique elements in both sets. +Presto offers several functions that deal with the +`MinHash `_ technique. + +MinHash is used to quickly estimate the +`Jaccard similarity coefficient `_ +between two sets. +It is commonly used in data mining to detect near-duplicate web pages at scale. +By using this information, the search engines efficiently avoid showing +within the search results two pages that are nearly identical. + +Data structures +--------------- + +Presto implements Set Digest data sketches by encapsulating the following components: + +- `HyperLogLog `_ +- `MinHash with a single hash function `_ + +As of now, ``HyperLogLog`` and ``MinHash`` are among the techniques implemented in Presto or used +by certain functions in Presto to handle large data sets. + +``HyperLogLog (HLL)``: HyperLogLog is an algorithm used to estimate the cardinality +of a set — that is, the number of distinct elements in a large data set. +Presto uses it to provide the function approx_distinct which can be used to estimate the number +of distinct entries in a column. + +Examples:: + + SELECT approx_distinct(column_name) FROM table_name; + +``MinHash``: MinHash is used to estimate the similarity between two or more sets, commonly known as Jaccard similarity. +It is particularly effective when dealing with large data sets and is generally used in data clustering +and near-duplicate detection. + +Examples:: + + WITH mh1 AS (SELECT minhash_agg(to_utf8(value)) AS minhash FROM table1), mh2 AS (SELECT minhash_agg(to_utf8(value)) + AS minhash FROM table2), SELECT jaccard_index(mh1.minhash, mh2.minhash) AS similarity FROM mh1, mh2; + +The Presto type for this data structure is called ``setdigest``. +Presto offers the ability to merge multiple Set Digest data sketches. + +Serialization +------------- + +Data sketches such as those created via the use of MinHash or HyperLogLog can be serialized into a varbinary data type. +Serializing these data structures allows them to be efficiently stored and, if needed, transferred between different +systems or sessions. +Once stored, they can then be deserialized back into to their original state when they need to be used again. +In the context of Presto, you might normally do this using functions that convert these data sketches to and from binary. +An example might include using to_utf8() or from_utf8(). + +Functions +--------- + +.. function:: make_set_digest(x) -> setdigest + +Composes all input values of ``x`` into a ``setdigest``. + + Examples:: + + Create a ``setdigest`` corresponding to a ``bigint`` array:: + + SELECT make_set_digest(value) + FROM (VALUES 1, 2, 3) T(value); + + Create a ``setdigest`` corresponding to a ``varchar`` array:: + + SELECT make_set_digest(value) + FROM (VALUES 'Trino', 'SQL', 'on', 'everything') T(value); + + + Error : Unexpected parameters (varchar(10)) for function make_set_digest. + Expected: make_set_digest(bigint) + + +.. function:: merge_set_digest(setdigest) -> setdigest + +Returns the ``setdigest`` of the aggregate union of the individual ``setdigest`` +Set Digest structures. + + Examples:: + + SELECT merge_set_digest(a) from (SELECT make_set_digest(value) as a FROM (VALUES 4,3,2,1) T(value)); + +.. function:: cardinality(setdigest) -> bigint + +Returns the cardinality of the set digest from its internal +``HyperLogLog`` component. + + Examples:: + + SELECT cardinality(make_set_digest(value)) + FROM (VALUES 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5) T(value); + -- 5 + +.. function:: intersection_cardinality(x,y) -> bigint + +Returns the estimation for the cardinality of the intersection of the two set digests. + +``x`` and ``y`` be of type ``setdigest`` + + Examples:: + + SELECT intersection_cardinality(make_set_digest(v1), make_set_digest(v2)) + FROM (VALUES (1, 1), (NULL, 2), (2, 3), (3, 4)) T(v1, v2); + -- 3 + +.. function:: jaccard_index(x, y) -> double + +Returns the estimation of `Jaccard index `_ for +the two set digests. + +``x`` and ``y`` be of type ``setdigest``. + + Examples:: + + SELECT jaccard_index(make_set_digest(v1), make_set_digest(v2)) + FROM (VALUES (1, 1), (NULL,2), (2, 3), (NULL, 4)) T(v1, v2); + -- 0.5 + +.. function:: hash_counts(x) -> map(bigint, smallint) + +Returns a map containing the `Murmur3Hash128 `_ +hashed values and the count of their occurences within +the internal ``MinHash`` structure belonging to ``x`` or varchar + +``x`` must be of type ``setdigest``. + + Examples:: + + SELECT hash_counts(make_set_digest(value)) + FROM (VALUES 1, 1, 1, 2, 2) T(value); + -- {19144387141682250=3, -2447670524089286488=2}