Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BitSet-backed Bloom filter #44

Merged
merged 10 commits into from
Dec 17, 2021
Merged

BitSet-backed Bloom filter #44

merged 10 commits into from
Dec 17, 2021

Conversation

dleppik
Copy link
Contributor

@dleppik dleppik commented Nov 30, 2021

BloomFilter uses a TypedArray-backed BitSet which is 64x more space efficient. Serialization uses base64-arraybuffer, which is the only new dependency.

I added BitSet to api.ts's exports for unit test access. That wasn't my first inclination, but I didn't see a different way to access a class from dist.

I've tried to follow the style of the rest of the classes, with two exceptions. Class properties in BitSet are not backed by _underscored fields, as the TypeScript Handbook explicitly recommends otherwise. (Adding a backing field can be done without an API change, removing one can require more code changes. Omitting the backing field yields, if anything, slightly more efficient JavaScript.) Second, BitSet is not AutoExportable although it could be. I ultimately decided that because it is a low-level data structure, it is better to have the flexibility and compactness of an explicit serializer/deserializer.

BTW, it turns out the problem creating a pull request before was related to authentication problems on my end. Sorry for the inconvenience!

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
…erializing BitSet. BitSet is much more compact than arrays of 1 & 0, but speed is essentially unchanged because the hash is the bottleneck.
@Callidon Callidon requested review from Callidon and folkvir December 1, 2021 07:06
Copy link
Owner

@Callidon Callidon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pretty good job, congrats 👍 👏
However, there's still some polishing to do. I've left a first batch of comments ;)

src/bloom/bit-set.ts Outdated Show resolved Hide resolved
src/bloom/bit-set.ts Outdated Show resolved Hide resolved
src/bloom/bit-set.ts Show resolved Hide resolved
src/bloom/bit-set.ts Show resolved Hide resolved
src/bloom/bit-set.ts Show resolved Hide resolved
test/bit-set-test.js Outdated Show resolved Hide resolved
test/bit-set-test.js Outdated Show resolved Hide resolved
src/bloom/bit-set.ts Outdated Show resolved Hide resolved
@@ -102,9 +102,7 @@ describe('BloomFilter', () => {
exported._seed.should.equal(filter.seed)
exported.type.should.equal('BloomFilter')
exported._size.should.equal(filter.size)
exported._length.should.equal(filter.length)
exported._nbHashes.should.equal(filter._nbHashes)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this assertion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter.length was changed to be backed by this._filter.bitCount(), so exported._length no longer exists.

The other line was erroneously removed. Thanks for catching this!

I think it happened in part because my IDE (WebStorm) complained about accessing a private member directly. The test only succeeds because it is JavaScript using the compiled JavaScript. For now I am putting the line back verbatim, but I would be just as happy to add a public accessor for nbHashes.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Collaborator

@folkvir folkvir Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, public accessors is one of the enhancements we need to do for all structures for inheritance and usage reasons of the classes. Feel free to change the visibility of exported attributes to public. But if you choose to, do it for all!

Comment on lines -130 to -131
{ type: 'BloomFilter', _size: 1, _length: 1, _nbHashes: 2 },
{ type: 'BloomFilter', _size: 1, _length: 1, _nbHashes: 2, seed: 1 }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove these tests? nbHashes is a parameter of the Bloom Filter and should be allowed in the exports

Copy link
Contributor Author

@dleppik dleppik Dec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I meant to remove _length, which no longer exists. This caused the test to pass erroneously, as the base64 decoding doesn't throw an Error.

Actually, if you want to test that all of those fields are required, you want to start with a valid one and check with only field removed at a time. I'll fix that.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Owner

@Callidon Callidon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking the feedback into account 👍
On my opinion, it's good for merging. I just need the review of @folkvir on it, and when it's good for him, we will merge your changes, and release a new version soon after.
Thank you again for your contribution!

Copy link
Collaborator

@folkvir folkvir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor changes and good to merge.

src/bloom/bit-set.ts Show resolved Hide resolved
src/bloom/bit-set.ts Outdated Show resolved Hide resolved
add(index: number) {
const wordIndex = Math.floor(index / bitsPerWord)
const mask = 1 << (index % bitsPerWord)
this.array[wordIndex] = this.array[wordIndex] | mask
Copy link

@lovasoa lovasoa Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can index the array just once

Suggested change
this.array[wordIndex] = this.array[wordIndex] | mask
this.array[wordIndex] |= mask

Copy link

@lovasoa lovasoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a bug in the remove method

src/bloom/bit-set.ts Outdated Show resolved Hide resolved
@@ -47,6 +47,7 @@
"typescript": "^3.7.5"
},
"dependencies": {
"base64-arraybuffer": "^1.0.1",
Copy link

@lovasoa lovasoa Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we let the users encode the arraybuffers only if they need to ?

I understand that dealing with strings may be easier in some cases, but today, the web platform allows dealing with raw binary data in most contexts (network communication, file access, webworker messages, ...), and since the stated goal of this new feature is performance and storage size, maybe it would be wiser to let the user access the raw data (and encode it to some encoding if they need to) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a long-term goal, I think a binary format is a great idea. However, there is not currently a drop-in solution for converting JavaScript objects into binary. BSON does not support raw binary data, and Protobuf requires a fair amount of schema setup. This gives us a big win quickly, and is not an unreasonable format if we wish to support multiple serialization formats. Other open source formats support both binary & text modes for convenience (e.g. STL 3D model files.)

Base64 is such a common format on the web that MDN has sample code in its glossary. The new dependency on base64-arraybuffer is only 8k of code.

I don't want to require that users learn implementation details in order to serialize efficiently. Just ditching the Base64 encoding would force users to search our objects for TypedArrays in order to serialize them compactly, and then figure out how to deserialize them in a memory-efficient manner. They would also have to figure out what sort of TypedArrays we're using, avoid storing an ArrayBuffer-backed TypedArray twice, and deal with endian issues if they find anything other than a UInt8Array.

A binary format is great, but it's a bigger conversation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you are saying, but I still feel like it should be done the other way: provide the native implementation now (it's less code), and maybe add the base64 serialization as a usability improvement for those who need it later. Providing the base64 directly blocks the people who need the binary representation, but providing the binary representation still lets the ones who need base64 use it if they need to.

@Callidon what do you think ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I side with @dleppik on this one. We want users to be able to use the data structures out of the box, without to handle topics like binary representation. For most basic usages, it's fine.
We could provide another export method/API point to allow users to customize their binary representation, but it's outside of the scope of this PR IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lovasoa, I'm not sure what you mean by the native implementation. TypedArrays break JSON serialization. JSON.parse(JSON.stringify(new Uint8Array(1))) yields { '0': 0 }, not a Uint8Array. JavaScript has no native binary format, and we need to store the entire Bloom filter, not just a byte array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you do a fully binary version, there are several libraries you may want to consider. BSON is a solid choice, but it may require some tweaking with TypedArrays. I've had good luck with Protobuf (Protocol Buffers), although it is definitely not plug-and-play; you need to write a .proto type definition which is geared toward high-volume data transfer. I've never tried MessagePack, but it may be the strongest contender since their sample JavaScript code serializes a UInt8Array.

Copy link

@lovasoa lovasoa Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe just create the byte array directly ? There are only three integers and a byte array to serialize, it may not be worth integrating yet another library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also metadata to include; that's often what trips up binary formats. I was assuming we'd want to support all the top-level classes, some of which are highly structured, while leveraging the existing reflection classes. That gets error-prone quickly.

You're right that adding another library could increase the footprint dramatically, especially if the classes are directly responsible for serializing themselves. If you're using a 1 MB library to save 0.5 MB of data transfer, it's a net loss. Even if you're using a 1 MB library to save 2 MB of data transfer, it may be a net loss in terms of latency, since loading the library blocks the data load. (I considered that before including base64-arraybuffer.) Anything significantly less than 12kB is a wash due to ethernet frame size.

The cleanest solution is to have the encoding in its own separate package, so that users don't have to download it if they don't want it. That also means one particular format doesn't preclude any other.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to include any metadata. We'll have a magic number to identify the format, the few integers we need, then the byte array. We don't need anything more.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lovasoa I will not open a PR once this is merged, but I will create an issue to track your suggestion. Thus, if you or someone else want to tackle it, they are free to do it. But for more, this topic is outside the scope of this PT.

Copy link
Collaborator

@folkvir folkvir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, all the remaining feedback must be addressed before merging.

src/bloom/bit-set.ts Show resolved Hide resolved
@folkvir folkvir merged commit 8f5bfce into Callidon:master Dec 17, 2021
@folkvir
Copy link
Collaborator

folkvir commented Dec 17, 2021

Merged, thank you @dleppik for your work and thank you @lovasoa, @Callidon

@lovasoa
Copy link

lovasoa commented Dec 31, 2021

@Callidon @folkvir : Would it be possible to push a new version with this code to npm ?

@folkvir
Copy link
Collaborator

folkvir commented Dec 31, 2021

Not until the develop branch will be under review. And this will be the 2.0.X version. Because of breaking changes.

@lovasoa
Copy link

lovasoa commented Feb 6, 2022

Hello everyone :)
Since 2.0 seems to be taking a long time to land, maybe this could be backported to 1.x ?
This is a really useful feature !

@folkvir
Copy link
Collaborator

folkvir commented Feb 7, 2022

Soon soon! Sorry for the delay. New work, very busy and little time to work. I just miss one additional feature, public keywords on everything to let developers customize everything without forking the library. You can follow the develop branch for news.

@folkvir folkvir mentioned this pull request Feb 13, 2022
@Callidon
Copy link
Owner

Callidon commented Feb 18, 2022

@lovasoa @dleppik Hi 👋 We've finally released the 2.0.0 npm version of the package, which skips this feature!
Sorry for the wait, I hope this release meets your expectations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants