A fresh look at implementing TLSH in Java.
This source implements the TLSH - Trend Locality Sensitive Hash method in Java language.
While there are already Java implementations of the method, the current one was written with specific design goals in mind.
- Specification compliant and unit tested
- Java look-and-feel compliant
- Performant
The module is built with maven. Details on the maven build are at the kTLSH site. There are also API docs.
Usage follows the pattern used by the other hash methods available in Java.
MessageDigest tlshDigest = MessageDigest.getInstance("TLSH");
tlshDigest.update("Hello world!".getBytes());
final byte[] hash1 = tlshDigest.digest();
final String encoded1 = TLSHUtil.encoded(hash1);
final byte[] hash2 = tlshDigest.digest("Goodbye Cruel World".getBytes());
final String encoded2 = TLSHUtil.encoded(hash2);
final int score = TLSHUtil.score(hash1, hash2, false);
All published TLSH algorithm variants are supported using the following name selector TLSH-(48|128|256)-(1|3)/[4-8]
, where 48
, 128
or 256
is the number of buckets, 1
or 3
is the number of checksum bytes and the optional /4
to /8
suffix is the window size. The window size defaults to 5 bytes and may be omitted. Note that with 48 buckets only 1 checksum byte is specified.
That is the full list of algorithms is:
4B window | 5B window | 6B window | 7B window | 8B window |
---|---|---|---|---|
TLSH-48-1/4 |
TLSH-48-1/5 aka TLSH-48-1 |
TLSH-48-1/6 |
TLSH-48-1/7 |
TLSH-48-1/8 |
TLSH-128-1/4 |
TLSH-128-1/5 aka TLSH-128-1 aka TLSH |
TLSH-128-1/6 |
TLSH-128-1/7 |
TLSH-128-1/8 |
TLSH-128-3/4 |
TLSH-128-3/5 aka TLSH-128-3 |
TLSH-128-3/6 |
TLSH-128-3/7 |
TLSH-128-3/8 |
TLSH-256-1/4 |
TLSH-256-1/5 aka TLSH-256-1 |
TLSH-256-1/6 |
TLSH-256-1/7 |
TLSH-256-1/8 |
TLSH-256-3/4 |
TLSH-256-3/5 aka TLSH-256-3 |
TLSH-256-3/6 |
TLSH-256-3/7 |
TLSH-256-3/8 |
The module only exports the TLSHUtil
utility class. It contains the score
utility function that computes the score difference between the provided two hashes as well as formatters for the hexadecimal representation of the TLSH hash number.
The source code follows in relevant parts the choices made in the C reference implementation of the algorithm. The hashes obtained and scores calculated are unit tested against the published test data and expected results of the C reference implementation. All tests pass.
The module exposes the TLSH algorithm the standard way by defining a MessageDigest service provider. This ensures that the TLSH hash can be computed by the library clients the same way as any other hash would be computed.
The code is following Java source code coding guidelines as published by Sun and checked by checkstyle.
The algorithm is implemented using Java runtime features wherever possible, the code is written to be easy to read and follow.
The implementation was tuned with care to perform even on large and very large input streams. For this purpose a separate JMH benchmark sub-module was created. Performance optimisation was performed only when it did not conflict with previous design goals. Performance was defined as raw hash bandwidth as well as stress on the Java GC (i.e. no unnecessary creation of objects).
The output of the benchmark harness is as follows
Benchmark Mode Cnt Score Error Units
HashBenchmark.testKLarge16MiB thrpt 2 114142.230 ops/s
HashBenchmark.testKSmall32KiB thrpt 2 115001.934 ops/s
HashBenchmark.testTMLarge16MiB thrpt 2 92331.462 ops/s
HashBenchmark.testTMSmall32KiB thrpt 2 94514.575 ops/s
HashBenchmark.testMD5Large16MiB thrpt 2 507247.038 ops/s
HashBenchmark.testMD5Small32KiB thrpt 2 497034.176 ops/s
The results show that the implementation is approximately 24% faster on large datasets and approximately. 22% faster on small datasets compared to the reference Java port. Preliminary comparison on large datasets of the C++ implementation show that the C++ implementation is comparable in speed to the reference Java implementation, i.e. slower than this Java implementation.
As a comparison the performance of the MD5 hash on the same datasets is also obtained.
All numbers are scaled to KiB/s hashing bandwidth, i.e. the implementation hashes 112MiB/s on the developers MacBook Pro.
Further increase in performance is currently being investigated.
The reference implementation of the algorithm including a reference port to Java language can be found in the authors' repository.
Another implementation can be found under https://github.com/idealista/tlsh.
This implementation follows the original C++ reference implementation and not the Java reference implementation.
This implementation will compute hashes for all input sizes between 0 and Long.MAX_VALUE, i.e. 8EiB. The C++ reference implementation will only produce a hash if a minimum of 50 bytes and a minimum level of entropy was fed into the digester. Also the C++ implementation has an upper bound of 2GiB.
The author is happy for any feedback and suggestion.