-
Notifications
You must be signed in to change notification settings - Fork 178
Mv bloom size limit
- developer tested code checked-in June 11, 2013
- development started June 5, 2013
2i_slf_stressmix is a common basho_bench test scenario used to verify features in Riak. It always produced some timeout errors: forty to fifty in a 5 hour test. The errors increased to the thousands when executed on a machine with a FusionIO card instead of SATA / SCSI drives. This was a recent discovery.
Bloom filters are stored as part of every sst file's metadata. They are typically small, less than 20k on a 300Mbyte file. While reviewing the 2i_slf_stressmix timeout test, it became obvious that files containing mostly 2i data have bloom filters that are 5 to 10Mbyte. This large size makes a random "Get" operation take a long, long time due to the amount of meta data (bloom filter) that must be read and CRC32 calculated.
Google does not document the intent behind several sections of their leveldb code. There was a problem where the grandparent code would create many little files. Why? The impact of the small files is that they would take up a full file handle as accounted by the max_open_files. This seemed inefficient. What was missed is that this code was critical to reducing the size of the next higher level compaction.
Function Compaction::ShouldStopBefore() was modified to take key count as a parameter. This becomes a second test as to whether a newly created .sst file should be terminated before reaching its size limit. The first test, the grandparent test, had been disabled in a previous branch. That first test is now reinstated.
The key test is hard coded to 75,000 keys. This number is based upon back calculating the number of keys that will fit within a 100K bloom filter. 100K being an estimated max for reasonable meta data load during a random Get that forces the read of a closed file. (Yes, there are now plans to better optimize how the code opens an .sst table.)
The grandparent test was disable in hopes of better "packing" the file cache. Small files created by the grandparent test caused the file cache, limited by max_open_files, to be poorly utilized, i.e. thrashing occurred. The side effect of creating longer compactions at the next level was not understood. Another recent branch completely changed how the accounting for the file cache worked. The file cache now counts bytes allocated by file objects, not the number of file objects. This allows small files to work as efficiently as larger files. Therefore the reason to disable the grandfather test is eliminated and the side effect of better higher level compactions restored.
Several strategies were coded, and some are part of earlier commits to this branch.