Skip to content

The BTL C/C++ Common bloom filters for bioinformatics projects, as well as any APIs created for other programming languages.

License

Notifications You must be signed in to change notification settings

bcgsc/btl_bloomfilter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

0f19567 · Feb 18, 2021
Jun 26, 2019
Oct 10, 2019
Jun 26, 2019
Feb 18, 2021
Aug 26, 2020
Jun 26, 2019
Jun 26, 2019
Oct 13, 2015
Jul 31, 2020
Jul 19, 2019
Oct 13, 2015
Jul 22, 2019
Jul 18, 2019
Aug 9, 2019
Feb 18, 2021
Aug 26, 2020
Jun 26, 2019
Aug 26, 2020
Oct 13, 2015
Oct 13, 2015
Aug 13, 2019
Feb 18, 2021
Aug 9, 2019
Aug 12, 2019

Repository files navigation

bloomfilter

The BTL C/C++ Common Bloom filters for bioinformatics projects, as well as any APIs created for other programming languages.

Build Status

Build Status

usage example (C++)

Fast Bloom filter loading using the rolling hash function.

#include "BloomFilter.hpp"
#include <vector>
#include <string>
#include "vendor/ntHashIterator.hpp"

using namespace std;

int main(int argc, char** argv)
{
    /* test sequence */
    const string seq = "TAGAATCACCCAAAGA";
    /* k-mer size */
    const unsigned k = 5;
    /* number of Bloom filter hash functions */
    const unsigned numHashes = 4;
    /* size of Bloom filter (in bits) */
    const unsigned size = 1000;
	//Building the filter
	{
		/* init Bloom filter */
		BloomFilter bloom(size, numHashes, k);

		/* init rolling hash state and compute hash values for first k-mer */
		ntHashIterator itr(seq, numHashes, k);
		while (itr != itr.end()) {
			bloom.insert(*itr);
			++itr;
		}
		/* store the bloom filter */
		bloom.storeFilter("filterPathname.bf");
	}

	//After building
	{
		/* load the bloom filter */
		BloomFilter bloom("filterPathname.bf");

		/* query the bloom filter */

		/* init rolling hash state and compute hash values for first k-mer */
		ntHashIterator itr(seq, numHashes, k);
		while (itr != itr.end()) {
			bloom.contains(*itr);
			++itr;
		}
	}
	return 0;
}

Fast Counting Bloom filter loading using the rolling hash function.

#include "CountingBloomFilter.hpp"
#include <vector>
#include <string>
#include "vendor/ntHashIterator.hpp"

using namespace std;

int main(int argc, char** argv)
{
    /* test sequence */
    const string seq = "TAGAATCACCCAAAGA";
    /* k-mer size */
    const unsigned k = 5;
    /* number of Bloom filter hash functions */
    const unsigned numHashes = 4;
    /* size of Bloom filter (in bytes) */
    size_t size = 1000;
    /* counts to threshold bloom filter on*/
    const unsigned threshold = 1;
	//Building the filter
	{
		/* init Bloom filter */
		CountingBloomFilter<uint8_t> bloom(size, numHashes, k, threshold);

		/* init rolling hash state and compute hash values for first k-mer */
		ntHashIterator itr(seq, numHashes, k);
		while (itr != itr.end()) {
			bloom.insert(*itr);
			++itr;
		}
		/* store the bloom filter */
		bloom.storeFilter("filterPathname.bf");
	}

	//After building
	{
		/* load the bloom filter */
		CountingBloomFilter<uint8_t> bloom("filterPathname.bf", threshold);

		/* query the bloom filter */

		/* init rolling hash state and compute hash values for first k-mer */
		ntHashIterator itr(seq, numHashes, k);
		while (itr != itr.end()) {
			bloom.contains(*itr);
			++itr;
		}
	}
	return 0;
}

files

  • BloomFilter.hpp: main Bloom filter class
  • CountingBloomFilter.hpp: Counting Bloom filter class
  • RollingHashIterator.h: Enable rolling hashing on a string
  • RollingHash.h: rolling hash interface (required by RollingHashIterator.h)
  • rolling.h: rolling hash function (required by BloomFilter.hpp and RollingHash.h)
  • Tests/Unit: unit tests
  • Tests/AdHoc: ad-hoc tests

unit tests

The unit tests may be compiled and run with:

$ ./autogen.sh
$ ./configure
$ make check

To see more detailed output for the individual tests, run the binaries in Tests/Unit from the command line. (The ad-hoc tests in Tests/AdHoc may also be run in this way.)

acknowledgements

This projects uses:

  • CATCH unit test framework for C/C++
  • nthash rolling hash implementation by Hamid Mohamadi
  • cpptoml TOML parser and serializer implemented by Chase Geigle

Bloom filter file format

The specification of the Bloom filter file format is as follows:

  1. magic header string
  • Description: bf magic string
  • Type: string
  • Value: BTLBloomFilter_v1 or BTLCountingBloomFilter_v1
  1. header
  • Description: Plain header text
  • Type: string
  • Value:
    • size
      • Description: The size of Bloom filter
      • Type: size_t
      • Value:
    • sizeInBytes
      • Description: The size of Bloom filter in bytes
      • Type: size_t
      • Value:
    • hashNum
      • Description: number of hashes
      • Type: unsigned
      • Value:
    • kmerSize
      • Description: k-mer size
      • Type: unsigned
      • Value:
    • dFPR [optional]
      • Description: desired false positve rate
      • Type: double
      • Value:
    • nEntry [optional]
      • Description: number of entries
      • Type: uint64_t
      • Value:
    • tEntry [optional]
      • Description: total number of entries
      • Type: uint64_t
      • Value:
    • seed [optional] (Not yet implimented)
      • Description: initial seeds for different hashes
      • Type: uint64_t[nhash]
      • Value: [0,1, ..., nhash-1]
    • bitsPerCounter [optional]
      • Description: bits per each counter in the counting bloom filter
      • Type: unsigned
      • Value: 8
  1. filter
  • Description: Bloom filter content
  • Type: uchar[sizeInBytes]
  • Value:

About

The BTL C/C++ Common bloom filters for bioinformatics projects, as well as any APIs created for other programming languages.

Resources

License

Stars

Watchers

Forks

Packages

No packages published