-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BitSet-backed Bloom filter #44
Conversation
…erializing BitSet. BitSet is much more compact than arrays of 1 & 0, but speed is essentially unchanged because the hash is the bottleneck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pretty good job, congrats 👍 👏
However, there's still some polishing to do. I've left a first batch of comments ;)
@@ -102,9 +102,7 @@ describe('BloomFilter', () => { | |||
exported._seed.should.equal(filter.seed) | |||
exported.type.should.equal('BloomFilter') | |||
exported._size.should.equal(filter.size) | |||
exported._length.should.equal(filter.length) | |||
exported._nbHashes.should.equal(filter._nbHashes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove this assertion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filter.length was changed to be backed by this._filter.bitCount(), so exported._length no longer exists.
The other line was erroneously removed. Thanks for catching this!
I think it happened in part because my IDE (WebStorm) complained about accessing a private member directly. The test only succeeds because it is JavaScript using the compiled JavaScript. For now I am putting the line back verbatim, but I would be just as happy to add a public accessor for nbHashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, public accessors is one of the enhancements we need to do for all structures for inheritance and usage reasons of the classes. Feel free to change the visibility of exported attributes to public. But if you choose to, do it for all!
{ type: 'BloomFilter', _size: 1, _length: 1, _nbHashes: 2 }, | ||
{ type: 'BloomFilter', _size: 1, _length: 1, _nbHashes: 2, seed: 1 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove these tests? nbHashes
is a parameter of the Bloom Filter and should be allowed in the exports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I meant to remove _length, which no longer exists. This caused the test to pass erroneously, as the base64 decoding doesn't throw an Error.
Actually, if you want to test that all of those fields are required, you want to start with a valid one and check with only field removed at a time. I'll fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
…in bloom-filter-test.js.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking the feedback into account 👍
On my opinion, it's good for merging. I just need the review of @folkvir on it, and when it's good for him, we will merge your changes, and release a new version soon after.
Thank you again for your contribution!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor changes and good to merge.
Co-authored-by: A.G <[email protected]>
add(index: number) { | ||
const wordIndex = Math.floor(index / bitsPerWord) | ||
const mask = 1 << (index % bitsPerWord) | ||
this.array[wordIndex] = this.array[wordIndex] | mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can index the array just once
this.array[wordIndex] = this.array[wordIndex] | mask | |
this.array[wordIndex] |= mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a bug in the remove method
@@ -47,6 +47,7 @@ | |||
"typescript": "^3.7.5" | |||
}, | |||
"dependencies": { | |||
"base64-arraybuffer": "^1.0.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we let the users encode the arraybuffers only if they need to ?
I understand that dealing with strings may be easier in some cases, but today, the web platform allows dealing with raw binary data in most contexts (network communication, file access, webworker messages, ...), and since the stated goal of this new feature is performance and storage size, maybe it would be wiser to let the user access the raw data (and encode it to some encoding if they need to) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a long-term goal, I think a binary format is a great idea. However, there is not currently a drop-in solution for converting JavaScript objects into binary. BSON does not support raw binary data, and Protobuf requires a fair amount of schema setup. This gives us a big win quickly, and is not an unreasonable format if we wish to support multiple serialization formats. Other open source formats support both binary & text modes for convenience (e.g. STL 3D model files.)
Base64 is such a common format on the web that MDN has sample code in its glossary. The new dependency on base64-arraybuffer is only 8k of code.
I don't want to require that users learn implementation details in order to serialize efficiently. Just ditching the Base64 encoding would force users to search our objects for TypedArrays in order to serialize them compactly, and then figure out how to deserialize them in a memory-efficient manner. They would also have to figure out what sort of TypedArrays we're using, avoid storing an ArrayBuffer-backed TypedArray twice, and deal with endian issues if they find anything other than a UInt8Array.
A binary format is great, but it's a bigger conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what you are saying, but I still feel like it should be done the other way: provide the native implementation now (it's less code), and maybe add the base64 serialization as a usability improvement for those who need it later. Providing the base64 directly blocks the people who need the binary representation, but providing the binary representation still lets the ones who need base64 use it if they need to.
@Callidon what do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I side with @dleppik on this one. We want users to be able to use the data structures out of the box, without to handle topics like binary representation. For most basic usages, it's fine.
We could provide another export method/API point to allow users to customize their binary representation, but it's outside of the scope of this PR IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lovasoa, I'm not sure what you mean by the native implementation. TypedArrays break JSON serialization. JSON.parse(JSON.stringify(new Uint8Array(1)))
yields { '0': 0 }
, not a Uint8Array. JavaScript has no native binary format, and we need to store the entire Bloom filter, not just a byte array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you do a fully binary version, there are several libraries you may want to consider. BSON is a solid choice, but it may require some tweaking with TypedArrays. I've had good luck with Protobuf (Protocol Buffers), although it is definitely not plug-and-play; you need to write a .proto type definition which is geared toward high-volume data transfer. I've never tried MessagePack, but it may be the strongest contender since their sample JavaScript code serializes a UInt8Array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe just create the byte array directly ? There are only three integers and a byte array to serialize, it may not be worth integrating yet another library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also metadata to include; that's often what trips up binary formats. I was assuming we'd want to support all the top-level classes, some of which are highly structured, while leveraging the existing reflection classes. That gets error-prone quickly.
You're right that adding another library could increase the footprint dramatically, especially if the classes are directly responsible for serializing themselves. If you're using a 1 MB library to save 0.5 MB of data transfer, it's a net loss. Even if you're using a 1 MB library to save 2 MB of data transfer, it may be a net loss in terms of latency, since loading the library blocks the data load. (I considered that before including base64-arraybuffer.) Anything significantly less than 12kB is a wash due to ethernet frame size.
The cleanest solution is to have the encoding in its own separate package, so that users don't have to download it if they don't want it. That also means one particular format doesn't preclude any other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to include any metadata. We'll have a magic number to identify the format, the few integers we need, then the byte array. We don't need anything more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lovasoa I will not open a PR once this is merged, but I will create an issue to track your suggestion. Thus, if you or someone else want to tackle it, they are free to do it. But for more, this topic is outside the scope of this PT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, all the remaining feedback must be addressed before merging.
Not until the develop branch will be under review. And this will be the 2.0.X version. Because of breaking changes. |
Hello everyone :) |
Soon soon! Sorry for the delay. New work, very busy and little time to work. I just miss one additional feature, public keywords on everything to let developers customize everything without forking the library. You can follow the develop branch for news. |
BloomFilter uses a TypedArray-backed BitSet which is 64x more space efficient. Serialization uses base64-arraybuffer, which is the only new dependency.
I added BitSet to api.ts's exports for unit test access. That wasn't my first inclination, but I didn't see a different way to access a class from dist.
I've tried to follow the style of the rest of the classes, with two exceptions. Class properties in BitSet are not backed by _underscored fields, as the TypeScript Handbook explicitly recommends otherwise. (Adding a backing field can be done without an API change, removing one can require more code changes. Omitting the backing field yields, if anything, slightly more efficient JavaScript.) Second, BitSet is not AutoExportable although it could be. I ultimately decided that because it is a low-level data structure, it is better to have the flexibility and compactness of an explicit serializer/deserializer.
BTW, it turns out the problem creating a pull request before was related to authentication problems on my end. Sorry for the inconvenience!