[core] Introduce dynamic bloom filer file index #4723

herefree · 2024-12-16T13:09:41Z

Purpose

Support Dynamic bloom filter file index.DynamicBloomFilter is largely based of org.apache.hudi.common.bloom.InternalDynamicBloomFilter.
Comparing with Hadoop's DynamicBloomFilter, Hoodie's dynamic bloom bounded bloom filter have a bound. Once the entries added reach the bound, the numbers of bloom filters will not increase and false positive ratio may not be guaranteed.
For meta data, I only kept version, numBloomFilters, bloomFilterVectorSize and numHashFunctions. This is enough to build FileIndexReader.

Tests

API and Format

Documentation

docs/content/concepts/spec/fileindex.md

JingsongLi · 2024-12-19T03:03:55Z

Thanks @herefree for the contribution, we are preparing 1.0, so now we don't time to review you PR, we may take a look in next two weeks.

LinMingQiang · 2024-12-20T06:34:45Z

+1

leaves12138 · 2025-01-07T05:21:13Z

I think this dynamic bloom filter may have poor performance.

Firstly, it create new BloomFilter64 while meet the required number of records, but it won't deduplicate the records. For example, if I have one file, which contains 800_0000 records, the number in each BloomFilter is 100_0000, so I need to create 8 BloomFilters to contains these values. But which we ignored is that, maybe, 800_0000 records could be deduplicated to 100_0000 records. So the first improvement, maybe it works, is to test record before write it to dynamic bloom filter. It is already exists, we may skip.

Secondly, we store small bytes in metadata. If we set dynamic bloom filter items size too big, we have to store it as file even if there are just few distinct values. But if we set it too small, we can get too many bloom filters in one dynamic bloom filter, which too cost more time to query. Maybe we need to figure out this problem.

Can you test it and give us some performance result?

herefree · 2025-01-07T07:36:36Z

I think this dynamic bloom filter may have poor performance.

Firstly, it create new BloomFilter64 while meet the required number of records, but it won't deduplicate the records. For example, if I have one file, which contains 800_0000 records, the number in each BloomFilter is 100_0000, so I need to create 8 BloomFilters to contains these values. But which we ignored is that, maybe, 800_0000 records could be deduplicated to 100_0000 records. So the first improvement, maybe it works, is to test record before write it to dynamic bloom filter. It is already exists, we may skip.

Secondly, we store small bytes in metadata. If we set dynamic bloom filter items size too big, we have to store it as file even if there are just few distinct values. But if we set it too small, we can get too many bloom filters in one dynamic bloom filter, which too cost more time to query. Maybe we need to figure out this problem.

Can you test it and give us some performance result?

Thanks for your reply.
For the first problem of data duplication, I will fix it.
For the second problem, if the user sets a smaller item and then sets a larger max_item, when the amount of data increases, it will indeed cause many small bloom filter.
Maybe we consider adding a new parameter weight to dynamically increase the items in the bloom filter. For example, if the number of items in the first bloom filter is 1000, the number of items in the second bloom filter is 1000 + weight, the third one is 1000 + 2 * weight..... What do you think of this?

leaves12138 · 2025-01-10T02:10:59Z

I think this dynamic bloom filter may have poor performance.
Firstly, it create new BloomFilter64 while meet the required number of records, but it won't deduplicate the records. For example, if I have one file, which contains 800_0000 records, the number in each BloomFilter is 100_0000, so I need to create 8 BloomFilters to contains these values. But which we ignored is that, maybe, 800_0000 records could be deduplicated to 100_0000 records. So the first improvement, maybe it works, is to test record before write it to dynamic bloom filter. It is already exists, we may skip.
Secondly, we store small bytes in metadata. If we set dynamic bloom filter items size too big, we have to store it as file even if there are just few distinct values. But if we set it too small, we can get too many bloom filters in one dynamic bloom filter, which too cost more time to query. Maybe we need to figure out this problem.
Can you test it and give us some performance result?

Thanks for your reply. For the first problem of data duplication, I will fix it. For the second problem, if the user sets a smaller item and then sets a larger max_item, when the amount of data increases, it will indeed cause many small bloom filter. Maybe we consider adding a new parameter weight to dynamically increase the items in the bloom filter. For example, if the number of items in the first bloom filter is 1000, the number of items in the second bloom filter is 1000 + weight, the third one is 1000 + 2 * weight..... What do you think of this?

Actually, I have already realized one version of what you said, please see: #3115
But it was refused by Paimon Community. The first problem we solved the same. The key is the second problem, what coefficient we should use to expand the bloom-filter. Or, is there any plan better than the expanding solution?

LinMingQiang · 2025-01-10T02:43:27Z

I think this dynamic bloom filter may have poor performance.
Firstly, it create new BloomFilter64 while meet the required number of records, but it won't deduplicate the records. For example, if I have one file, which contains 800_0000 records, the number in each BloomFilter is 100_0000, so I need to create 8 BloomFilters to contains these values. But which we ignored is that, maybe, 800_0000 records could be deduplicated to 100_0000 records. So the first improvement, maybe it works, is to test record before write it to dynamic bloom filter. It is already exists, we may skip.
Secondly, we store small bytes in metadata. If we set dynamic bloom filter items size too big, we have to store it as file even if there are just few distinct values. But if we set it too small, we can get too many bloom filters in one dynamic bloom filter, which too cost more time to query. Maybe we need to figure out this problem.
Can you test it and give us some performance result?

Thanks for your reply. For the first problem of data duplication, I will fix it. For the second problem, if the user sets a smaller item and then sets a larger max_item, when the amount of data increases, it will indeed cause many small bloom filter. Maybe we consider adding a new parameter weight to dynamically increase the items in the bloom filter. For example, if the number of items in the first bloom filter is 1000, the number of items in the second bloom filter is 1000 + weight, the third one is 1000 + 2 * weight..... What do you think of this?

Actually, I have already realized one version of what you said, please see: #3115 But it was refused by Paimon Community. The first problem we solved the same. The key is the second problem, what coefficient we should use to expand the bloom-filter. Or, is there any plan better than the expanding solution?

1: If we have N BloomFilters, then we have to test it N times? Will the performance be good?
2：As for the expansion plan, the proportion can be dynamically adjusted according to the expansion rate.

herefree · 2025-01-10T05:23:00Z

I think this dynamic bloom filter may have poor performance.
Firstly, it create new BloomFilter64 while meet the required number of records, but it won't deduplicate the records. For example, if I have one file, which contains 800_0000 records, the number in each BloomFilter is 100_0000, so I need to create 8 BloomFilters to contains these values. But which we ignored is that, maybe, 800_0000 records could be deduplicated to 100_0000 records. So the first improvement, maybe it works, is to test record before write it to dynamic bloom filter. It is already exists, we may skip.
Secondly, we store small bytes in metadata. If we set dynamic bloom filter items size too big, we have to store it as file even if there are just few distinct values. But if we set it too small, we can get too many bloom filters in one dynamic bloom filter, which too cost more time to query. Maybe we need to figure out this problem.
Can you test it and give us some performance result?

Thanks for your reply. For the first problem of data duplication, I will fix it. For the second problem, if the user sets a smaller item and then sets a larger max_item, when the amount of data increases, it will indeed cause many small bloom filter. Maybe we consider adding a new parameter weight to dynamically increase the items in the bloom filter. For example, if the number of items in the first bloom filter is 1000, the number of items in the second bloom filter is 1000 + weight, the third one is 1000 + 2 * weight..... What do you think of this?

Actually, I have already realized one version of what you said, please see: #3115 But it was refused by Paimon Community. The first problem we solved the same. The key is the second problem, what coefficient we should use to expand the bloom-filter. Or, is there any plan better than the expanding solution?

The number of bloom filters is limited by max_items , which does not grow all the time. Only when the user sets a particularly large max_items and a particularly small items will the number of bloom filters be particularly large. Perhaps we should let the user specify this coefficient. The default is not to expand.
There are two ways to expand. The first way is nums, nums+coefficient * nums, nums+2*coefficient * nums.....coefficient * nums is used as the base of growth to grow linearly, or use it as you wrote before Exponential growth：nums、nums * coefficient 、nums * coefficient *coefficient？
As for other plan, maybe we can use multi-threading to speed up the query.

Tan-JiaLiang · 2025-01-10T09:42:00Z

Sorry for jumping into this discussion suddenly, please excuse my intrusion.

Has anyone studied Xor Filters and Binary Fuse Filters? It has been implementation in this project [1] and backed up by papers [2][3]. Seems to have an advantage over bloom filter. But I haven't had time to look into it yet.

[1]: https://github.com/FastFilter/fastfilter_java
[2]: Thomas Mueller Graf, Daniel Lemire, Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters, Journal of Experimental Algorithmics 25 (1), 2020. DOI: 10.1145/3376122
[3]: Thomas Mueller Graf, Daniel Lemire, Binary Fuse Filters: Fast and Smaller Than Xor Filters, Journal of Experimental Algorithmics (to appear). DOI: 10.1145/3510449

herefree · 2025-01-10T10:07:58Z

Sorry for jumping into this discussion suddenly, please excuse my intrusion.

Has anyone studied Xor Filters and Binary Fuse Filters? It has been implementation in this project [1] and backed up by papers [2][3]. Seems to have an advantage over bloom filter. But I haven't had time to look into it yet.

[1]: https://github.com/FastFilter/fastfilter_java [2]: Thomas Mueller Graf, Daniel Lemire, Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters, Journal of Experimental Algorithmics 25 (1), 2020. DOI: 10.1145/3376122 [3]: Thomas Mueller Graf, Daniel Lemire, Binary Fuse Filters: Fast and Smaller Than Xor Filters, Journal of Experimental Algorithmics (to appear). DOI: 10.1145/3510449

Thanks your reply.I have seen the project before https://github.com/FastFilter/fastfilter_java.The Xor filter does not seem to support inserting keys individually each time. Every time you need to write all the keys to create it.

Tan-JiaLiang · 2025-01-10T11:09:23Z

Thanks, @herefree. I glanced briefly at the conclusion of the paper Binary Fuse Filters: Fast and Smaller Than Xor Filters, and you are right.

Xor and binary fuse filters require access to the full set of keys at construction time. In this sense, they are immutable. Alternatives have typically a fixed memory usage and a maximal capacity, but they also allow more flexibility such as progressive construction (adding keys one by one). This observation suggests at least two directions for future research. On the one hand, we could design more flexible algorithms to build binary fuse filters: e.g., so that they become bulk updatable. On the other hand, we could seek to better document applications where immutable probabilistic filters are best suited: e.g., when the filter is sent over a network, bundled with a front-end or used in a multi-threaded environment where locking is undesirable (Voras and Žagar, 2010; Li et al., 2014).

herefree · 2025-01-14T04:15:15Z

I’ve been a little busy these days, so I’ll continue this PR in the next two weeks.

herefree added 2 commits December 16, 2024 20:30

introduce DynamicBloomFilter file index

8e83455

fix

d33d19e

herefree marked this pull request as draft January 14, 2025 04:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Introduce dynamic bloom filer file index #4723

[core] Introduce dynamic bloom filer file index #4723

herefree commented Dec 16, 2024 •

edited

Loading

JingsongLi commented Dec 19, 2024

LinMingQiang commented Dec 20, 2024

leaves12138 commented Jan 7, 2025

herefree commented Jan 7, 2025 •

edited

Loading

leaves12138 commented Jan 10, 2025

LinMingQiang commented Jan 10, 2025

herefree commented Jan 10, 2025

Tan-JiaLiang commented Jan 10, 2025 •

edited

Loading

herefree commented Jan 10, 2025

Tan-JiaLiang commented Jan 10, 2025

herefree commented Jan 14, 2025

[core] Introduce dynamic bloom filer file index #4723

Are you sure you want to change the base?

[core] Introduce dynamic bloom filer file index #4723

Conversation

herefree commented Dec 16, 2024 • edited Loading

Purpose

Tests

API and Format

Documentation

JingsongLi commented Dec 19, 2024

LinMingQiang commented Dec 20, 2024

leaves12138 commented Jan 7, 2025

herefree commented Jan 7, 2025 • edited Loading

leaves12138 commented Jan 10, 2025

LinMingQiang commented Jan 10, 2025

herefree commented Jan 10, 2025

Tan-JiaLiang commented Jan 10, 2025 • edited Loading

herefree commented Jan 10, 2025

Tan-JiaLiang commented Jan 10, 2025

herefree commented Jan 14, 2025

herefree commented Dec 16, 2024 •

edited

Loading

herefree commented Jan 7, 2025 •

edited

Loading

Tan-JiaLiang commented Jan 10, 2025 •

edited

Loading