BitSet-backed Bloom filter #44

dleppik · 2021-11-30T22:12:36Z

BloomFilter uses a TypedArray-backed BitSet which is 64x more space efficient. Serialization uses base64-arraybuffer, which is the only new dependency.

I added BitSet to api.ts's exports for unit test access. That wasn't my first inclination, but I didn't see a different way to access a class from dist.

I've tried to follow the style of the rest of the classes, with two exceptions. Class properties in BitSet are not backed by _underscored fields, as the TypeScript Handbook explicitly recommends otherwise. (Adding a backing field can be done without an API change, removing one can require more code changes. Omitting the backing field yields, if anything, slightly more efficient JavaScript.) Second, BitSet is not AutoExportable although it could be. I ultimately decided that because it is a low-level data structure, it is better to have the flexibility and compactness of an explicit serializer/deserializer.

BTW, it turns out the problem creating a pull request before was related to authentication problems on my end. Sorry for the inconvenience!

…erializing BitSet. BitSet is much more compact than arrays of 1 & 0, but speed is essentially unchanged because the hash is the bottleneck.

Callidon

A pretty good job, congrats 👍 👏
However, there's still some polishing to do. I've left a first batch of comments ;)

src/bloom/bit-set.ts

test/bit-set-test.js

src/bloom/bit-set.ts

Callidon · 2021-12-01T09:57:45Z

test/bloom-filter-test.js

@@ -102,9 +102,7 @@ describe('BloomFilter', () => {
      exported._seed.should.equal(filter.seed)
      exported.type.should.equal('BloomFilter')
      exported._size.should.equal(filter.size)
-      exported._length.should.equal(filter.length)
-      exported._nbHashes.should.equal(filter._nbHashes)


Why did you remove this assertion?

Filter.length was changed to be backed by this._filter.bitCount(), so exported._length no longer exists.

The other line was erroneously removed. Thanks for catching this!

I think it happened in part because my IDE (WebStorm) complained about accessing a private member directly. The test only succeeds because it is JavaScript using the compiled JavaScript. For now I am putting the line back verbatim, but I would be just as happy to add a public accessor for nbHashes.

Actually, public accessors is one of the enhancements we need to do for all structures for inheritance and usage reasons of the classes. Feel free to change the visibility of exported attributes to public. But if you choose to, do it for all!

Callidon · 2021-12-01T09:58:48Z

test/bloom-filter-test.js

-        { type: 'BloomFilter', _size: 1, _length: 1, _nbHashes: 2 },
-        { type: 'BloomFilter', _size: 1, _length: 1, _nbHashes: 2, seed: 1 }


Why did you remove these tests? nbHashes is a parameter of the Bloom Filter and should be allowed in the exports

Looks like I meant to remove _length, which no longer exists. This caused the test to pass erroneously, as the base64 decoding doesn't throw an Error.

Actually, if you want to test that all of those fields are required, you want to start with a valid one and check with only field removed at a time. I'll fix that.

…in bloom-filter-test.js.

Callidon

Thank you for taking the feedback into account 👍
On my opinion, it's good for merging. I just need the review of @folkvir on it, and when it's good for him, we will merge your changes, and release a new version soon after.
Thank you again for your contribution!

folkvir

Just a few minor changes and good to merge.

src/bloom/bit-set.ts

Co-authored-by: A.G <[email protected]>

lovasoa · 2021-12-10T01:45:34Z

src/bloom/bit-set.ts

+    add(index: number) {
+        const wordIndex = Math.floor(index / bitsPerWord)
+        const mask = 1 << (index % bitsPerWord)
+        this.array[wordIndex] = this.array[wordIndex] | mask


You can index the array just once

Suggested change

this.array[wordIndex] = this.array[wordIndex] | mask

this.array[wordIndex] |= mask

lovasoa

there is a bug in the remove method

src/bloom/bit-set.ts

lovasoa · 2021-12-10T02:17:49Z

package.json

@@ -47,6 +47,7 @@
    "typescript": "^3.7.5"
  },
  "dependencies": {
+    "base64-arraybuffer": "^1.0.1",


Could we let the users encode the arraybuffers only if they need to ?

I understand that dealing with strings may be easier in some cases, but today, the web platform allows dealing with raw binary data in most contexts (network communication, file access, webworker messages, ...), and since the stated goal of this new feature is performance and storage size, maybe it would be wiser to let the user access the raw data (and encode it to some encoding if they need to) ?

As a long-term goal, I think a binary format is a great idea. However, there is not currently a drop-in solution for converting JavaScript objects into binary. BSON does not support raw binary data, and Protobuf requires a fair amount of schema setup. This gives us a big win quickly, and is not an unreasonable format if we wish to support multiple serialization formats. Other open source formats support both binary & text modes for convenience (e.g. STL 3D model files.)

Base64 is such a common format on the web that MDN has sample code in its glossary. The new dependency on base64-arraybuffer is only 8k of code.

I don't want to require that users learn implementation details in order to serialize efficiently. Just ditching the Base64 encoding would force users to search our objects for TypedArrays in order to serialize them compactly, and then figure out how to deserialize them in a memory-efficient manner. They would also have to figure out what sort of TypedArrays we're using, avoid storing an ArrayBuffer-backed TypedArray twice, and deal with endian issues if they find anything other than a UInt8Array.

A binary format is great, but it's a bigger conversation.

I understand what you are saying, but I still feel like it should be done the other way: provide the native implementation now (it's less code), and maybe add the base64 serialization as a usability improvement for those who need it later. Providing the base64 directly blocks the people who need the binary representation, but providing the binary representation still lets the ones who need base64 use it if they need to.

@Callidon what do you think ?

I side with @dleppik on this one. We want users to be able to use the data structures out of the box, without to handle topics like binary representation. For most basic usages, it's fine.
We could provide another export method/API point to allow users to customize their binary representation, but it's outside of the scope of this PR IMO.

@lovasoa, I'm not sure what you mean by the native implementation. TypedArrays break JSON serialization. JSON.parse(JSON.stringify(new Uint8Array(1))) yields { '0': 0 }, not a Uint8Array. JavaScript has no native binary format, and we need to store the entire Bloom filter, not just a byte array.

When you do a fully binary version, there are several libraries you may want to consider. BSON is a solid choice, but it may require some tweaking with TypedArrays. I've had good luck with Protobuf (Protocol Buffers), although it is definitely not plug-and-play; you need to write a .proto type definition which is geared toward high-volume data transfer. I've never tried MessagePack, but it may be the strongest contender since their sample JavaScript code serializes a UInt8Array.

Or maybe just create the byte array directly ? There are only three integers and a byte array to serialize, it may not be worth integrating yet another library.

There's also metadata to include; that's often what trips up binary formats. I was assuming we'd want to support all the top-level classes, some of which are highly structured, while leveraging the existing reflection classes. That gets error-prone quickly.

You're right that adding another library could increase the footprint dramatically, especially if the classes are directly responsible for serializing themselves. If you're using a 1 MB library to save 0.5 MB of data transfer, it's a net loss. Even if you're using a 1 MB library to save 2 MB of data transfer, it may be a net loss in terms of latency, since loading the library blocks the data load. (I considered that before including base64-arraybuffer.) Anything significantly less than 12kB is a wash due to ethernet frame size.

The cleanest solution is to have the encoding in its own separate package, so that users don't have to download it if they don't want it. That also means one particular format doesn't preclude any other.

I don't think we need to include any metadata. We'll have a magic number to identify the format, the few integers we need, then the byte array. We don't need anything more.

@lovasoa I will not open a PR once this is merged, but I will create an issue to track your suggestion. Thus, if you or someone else want to tackle it, they are free to do it. But for more, this topic is outside the scope of this PT.

folkvir

Please, all the remaining feedback must be addressed before merging.

src/bloom/bit-set.ts

folkvir · 2021-12-17T08:26:55Z

Merged, thank you @dleppik for your work and thank you @lovasoa, @Callidon

lovasoa · 2021-12-31T09:10:20Z

@Callidon @folkvir : Would it be possible to push a new version with this code to npm ?

folkvir · 2021-12-31T09:19:09Z

Not until the develop branch will be under review. And this will be the 2.0.X version. Because of breaking changes.

lovasoa · 2022-02-06T22:25:31Z

Hello everyone :)
Since 2.0 seems to be taking a long time to land, maybe this could be backported to 1.x ?
This is a really useful feature !

folkvir · 2022-02-07T13:54:18Z

Soon soon! Sorry for the delay. New work, very busy and little time to work. I just miss one additional feature, public keywords on everything to let developers customize everything without forking the library. You can follow the develop branch for news.

Callidon · 2022-02-18T09:47:04Z

@lovasoa @dleppik Hi 👋 We've finally released the 2.0.0 npm version of the package, which skips this feature!
Sorry for the wait, I hope this release meets your expectations.

dleppik added 2 commits November 24, 2021 17:37

Add BitSet with tests.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired
Learn about vigilant mode

7dfa3c1

Bloom filter uses BitSet. Adds dependency on base64-arraybuffer for s…

2c70611

…erializing BitSet. BitSet is much more compact than arrays of 1 & 0, but speed is essentially unchanged because the hash is the bottleneck.

Callidon requested review from Callidon and folkvir December 1, 2021 07:06

Callidon requested changes Dec 1, 2021

View reviewed changes

dleppik added 5 commits December 1, 2021 18:40

BitSetData.base64 renamed content. More BitSet import/export tests.

a0b7904

Fixed #saveAsJSON tests.

3832f3a

Added optional seed parameter to BoomFilter.from(), since it is used …

cf1a0bb

…in bloom-filter-test.js.

Fixed invalid argument in filter.has.

29659ec

Style fixes: added documentation and terminating newline.

0aa8c47

Callidon approved these changes Dec 2, 2021

View reviewed changes

folkvir requested changes Dec 7, 2021

View reviewed changes

src/bloom/bit-set.ts Show resolved Hide resolved

src/bloom/bit-set.ts Outdated Show resolved Hide resolved

dleppik and others added 2 commits December 7, 2021 07:58

Update src/bloom/bit-set.ts

2ea5f42

Co-authored-by: A.G <[email protected]>

Style fixes: expanded doc comments.

cffb44a

lovasoa reviewed Dec 10, 2021

View reviewed changes

lovasoa suggested changes Dec 10, 2021

View reviewed changes

src/bloom/bit-set.ts Outdated Show resolved Hide resolved

lovasoa reviewed Dec 10, 2021

View reviewed changes

Removed buggy, unused BitSet.prototype.remove.

95ce810

folkvir requested changes Dec 13, 2021

View reviewed changes

src/bloom/bit-set.ts Show resolved Hide resolved

folkvir approved these changes Dec 14, 2021

View reviewed changes

folkvir merged commit 8f5bfce into Callidon:master Dec 17, 2021

lovasoa mentioned this pull request Jan 8, 2022

Optimisation de la gestion de la blacklist lovasoa/sanipasse#162

Open

folkvir mentioned this pull request Feb 13, 2022

road to 2.0.0 #52

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BitSet-backed Bloom filter #44

BitSet-backed Bloom filter #44

dleppik commented Nov 30, 2021 •

edited

Loading

Callidon left a comment •

edited

Loading

Callidon Dec 1, 2021

dleppik Dec 1, 2021

Callidon Dec 2, 2021

folkvir Dec 2, 2021 •

edited

Loading

Callidon Dec 1, 2021

dleppik Dec 1, 2021 •

edited

Loading

Callidon Dec 2, 2021

Callidon left a comment

folkvir left a comment

lovasoa Dec 10, 2021 •

edited

Loading

lovasoa left a comment

lovasoa Dec 10, 2021 •

edited

Loading

dleppik Dec 10, 2021

lovasoa Dec 10, 2021

Callidon Dec 10, 2021

dleppik Dec 10, 2021

dleppik Dec 10, 2021

lovasoa Dec 10, 2021 •

edited

Loading

dleppik Dec 10, 2021

lovasoa Dec 10, 2021

Callidon Dec 11, 2021

folkvir left a comment

folkvir commented Dec 17, 2021

lovasoa commented Dec 31, 2021

folkvir commented Dec 31, 2021

lovasoa commented Feb 6, 2022

folkvir commented Feb 7, 2022

Callidon commented Feb 18, 2022 •

edited

Loading

		{ type: 'BloomFilter', _size: 1, _length: 1, _nbHashes: 2 },
		{ type: 'BloomFilter', _size: 1, _length: 1, _nbHashes: 2, seed: 1 }

	this.array[wordIndex] = this.array[wordIndex] \| mask
	this.array[wordIndex] \|= mask

BitSet-backed Bloom filter #44

BitSet-backed Bloom filter #44

Conversation

dleppik commented Nov 30, 2021 • edited Loading

Callidon left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

folkvir Dec 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dleppik Dec 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Callidon left a comment

Choose a reason for hiding this comment

folkvir left a comment

Choose a reason for hiding this comment

lovasoa Dec 10, 2021 • edited Loading

Choose a reason for hiding this comment

lovasoa left a comment

Choose a reason for hiding this comment

lovasoa Dec 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lovasoa Dec 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

folkvir left a comment

Choose a reason for hiding this comment

folkvir commented Dec 17, 2021

lovasoa commented Dec 31, 2021

folkvir commented Dec 31, 2021

lovasoa commented Feb 6, 2022

folkvir commented Feb 7, 2022

Callidon commented Feb 18, 2022 • edited Loading

dleppik commented Nov 30, 2021 •

edited

Loading

Callidon left a comment •

edited

Loading

folkvir Dec 2, 2021 •

edited

Loading

dleppik Dec 1, 2021 •

edited

Loading

lovasoa Dec 10, 2021 •

edited

Loading

lovasoa Dec 10, 2021 •

edited

Loading

lovasoa Dec 10, 2021 •

edited

Loading

Callidon commented Feb 18, 2022 •

edited

Loading