-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gcs: Implement version 2 filters. #1856
Conversation
562c2f4
to
518172a
Compare
This has been updated to support deduplication of filter data for the version 2 filters. The PR description and all benchmarks have also been updated accordingly. For a concrete example of the difference the deduplication can make, consider the following values for block 1096:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
5285405
to
ec73c7f
Compare
ec73c7f
to
f07e16a
Compare
This implements new version 2 filters which have 4 changes as compared to version 1 filters: - Support for independently specifying the false positive rate and Golomb coding bin size which allows minimizing the filter size - A faster (incompatible with version 1) reduction function - A more compact serialization for the number of members in the set - Deduplication of all hash collisions prior to reducing and serializing the deltas In addition, it adds a full set of tests and updates the benchmarks to use the new version 2 filters. The primary motivating factor for these changes is the ability to minimize the size of the filters, however, the following is a before and after comparison of version 1 and 2 filters in terms of performance and allocations. It is interesting to note the results for attempting to match a single item is not very representative due to the fact the actual hash value itself dominates to the point it can significantly vary due to the very low ns timings involved. Those differences average out when matching multiple items, which is the much more realistic scenario, and the performance increase is in line with the expected values. It is also worth nothing that filter construction now takes a bit longer due to the additional deduplication step. While the performance numbers for filter construction are about 25% larger in relative terms, it is only a few ms difference in practice and therefore is an acceptable trade off for the size savings provided. benchmark old ns/op new ns/op delta ----------------------------------------------------------------- BenchmarkFilterBuild50000 16194920 20279043 +25.22% BenchmarkFilterBuild100000 32609930 41629998 +27.66% BenchmarkFilterMatch 620 593 -4.35% BenchmarkFilterMatchAny 2687 2302 -14.33% benchmark old allocs new allocs delta ----------------------------------------------------------------- BenchmarkFilterBuild50000 6 17 +183.33% BenchmarkFilterBuild100000 6 18 +200.00% BenchmarkFilterMatch 0 0 +0.00% BenchmarkFilterMatchAny 0 0 +0.00% benchmark old bytes new bytes delta ----------------------------------------------------------------- BenchmarkFilterBuild50000 688366 2074653 +201.39% BenchmarkFilterBuild100000 1360064 4132627 +203.86% BenchmarkFilterMatch 0 0 +0.00% BenchmarkFilterMatchAny 0 0 +0.00%
f07e16a
to
2c3a4e3
Compare
This requires #1851, and #1854.
This implements new version 2 filters which have 4 changes as compared to version 1 filters:
In addition, it adds a full set of tests and updates the benchmarks to use the new version 2 filters.
The primary motivating factor for these changes is the ability to minimize the size of the filters, however, the following is a before and after comparison of version 1 and 2 filters in terms of performance and allocations.
It is interesting to note the results for attempting to match a single item is not very representative due to the fact the actual hash value itself dominates to the point it can significantly vary due to the very low ns timings involved. Those differences average out when matching multiple items, which is the much more realistic scenario, and the performance increase is in line with the expected values. It is also worth nothing that filter construction now takes a bit longer due to the additional deduplication step. While the performance numbers for filter construction are about 25% larger in relative terms, it is only a few ms difference in practice and therefore is an acceptable trade off for the size savings provided.