-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Introduce dynamic bloom filer file index #4723
base: master
Are you sure you want to change the base?
Conversation
Thanks @herefree for the contribution, we are preparing 1.0, so now we don't time to review you PR, we may take a look in next two weeks. |
+1 |
I think this dynamic bloom filter may have poor performance. Firstly, it create new BloomFilter64 while meet the required number of records, but it won't deduplicate the records. For example, if I have one file, which contains 800_0000 records, the number in each BloomFilter is 100_0000, so I need to create 8 BloomFilters to contains these values. But which we ignored is that, maybe, 800_0000 records could be deduplicated to 100_0000 records. So the first improvement, maybe it works, is to test record before write it to dynamic bloom filter. It is already exists, we may skip. Secondly, we store small bytes in metadata. If we set dynamic bloom filter items size too big, we have to store it as file even if there are just few distinct values. But if we set it too small, we can get too many bloom filters in one dynamic bloom filter, which too cost more time to query. Maybe we need to figure out this problem. Can you test it and give us some performance result? |
Thanks for your reply. |
Actually, I have already realized one version of what you said, please see: #3115 |
1: If we have N BloomFilters, then we have to test it N times? Will the performance be good? |
The number of bloom filters is limited by max_items , which does not grow all the time. Only when the user sets a particularly large max_items and a particularly small items will the number of bloom filters be particularly large. Perhaps we should let the user specify this coefficient. The default is not to expand. |
Sorry for jumping into this discussion suddenly, please excuse my intrusion. Has anyone studied Xor Filters and Binary Fuse Filters? It has been implementation in this project [1] and backed up by papers [2][3]. Seems to have an advantage over bloom filter. But I haven't had time to look into it yet.
|
Thanks your reply.I have seen the project before https://github.com/FastFilter/fastfilter_java.The Xor filter does not seem to support inserting keys individually each time. Every time you need to write all the keys to create it. |
Thanks, @herefree. I glanced briefly at the conclusion of the paper Binary Fuse Filters: Fast and Smaller Than Xor Filters, and you are right.
|
I’ve been a little busy these days, so I’ll continue this PR in the next two weeks. |
Purpose
Support Dynamic bloom filter file index.DynamicBloomFilter is largely based of org.apache.hudi.common.bloom.InternalDynamicBloomFilter.
Comparing with Hadoop's DynamicBloomFilter, Hoodie's dynamic bloom bounded bloom filter have a bound. Once the entries added reach the bound, the numbers of bloom filters will not increase and false positive ratio may not be guaranteed.
For meta data, I only kept version, numBloomFilters, bloomFilterVectorSize and numHashFunctions. This is enough to build FileIndexReader.
Tests
API and Format
Documentation
docs/content/concepts/spec/fileindex.md