-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add filter
option to de-duplicate blocks in car files
#557
Conversation
Fixes #556 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for submitting this. The code changes look fine to me, but CI is was failing. I would love to get thoughts from @lidel and @achingbrain on this one before proceeding. Additional thoughts below:
Some car specifications
carv1 - https://ipld.io/specs/transport/car/carv1/#duplicate-blocks
carv2 - https://ipld.io/specs/transport/car/carv2/
It doesn't seem specified that duplicate blocks should or should not be included, but the only reason I can think to include them would be issues around https://ipld.io/specs/transport/car/carv1/#determinism.
Maybe carv1 would require duplicate blocks for the final car to be the same as ones folks expect (and to rebuild dags that a CAR may represent), but with v2, it should be okay to remove them?
I'm not sure that we're clear which spec we're attempting to adhere to with @helia/car
.
For the sake of consistency Kubo does not export duplicate blocks. At least from my limited testing. |
Dropping a drive-by comment with some historical context, hopefully its useful and not noise :) What specs say?Last time we looked into this the default behavior is left unspecified, and trustless gateway responses have explicit note about this: Gateway's CAR responses in Kubo and Rainbow (both backed by There is
Kubo/Rainbow use it to signal the response does NOT have duplicates. Sidenote: Ok, but when are duplicates useful?Duplicates have niche utility when you have a light client that has limited memory, so it can't cache blocks for later user, and has to “consume” blocks by unpacking data on the fly, and then discarding them (no blockstore, no cache, even in-memory). Such client would explicitly opt-in for receiving duplicates via On this PRI think the only concern here is that Having the ability to control this potential memory leak at the library level (and disable deduplication, and the memory cost that comes with it) would be nice, but can be added once this becomes an actual and not theoretical problem. Up to Helia maintainers. |
This can be mitigated by using a filter instead of a set, that way we don't need to store every CID encountered. See createScalableCuckooFilter in import { createScalableCuckooFilter } from '@libp2p/utils/filters'
const filter = createScaleableCuckooFilter(maxItems, errorRate)
filter.has(bytes) // returns boolean
filter.add(bytes) // `.has` will probably now return `true` The only thing to consider is the reliability of the filter. False positives are possible but false negatives are not. That is, if The implementation should be structured so that there may be the (very) occasional duplicate block but there should never be a missed block. |
Good to know about the Kubo behaviour too, we should align with that though an |
I need the |
In this PR I am sticking with a set as I need no duplicates in the resulting CAR files and I need that to be consistent always. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to use a filter instead of a set as the set will cause the process to OOM for very large or streaming CAR files.
If making the filter reliable enough for the use case is a concern you may wish to expose a way to configure it or otherwise pass a preconfigured one in, since the size of it will largely be dictated by the size of the DAGs in the CAR being exported.
Can you please also add some tests that ensure there are no regressions around the allowDuplicateBlocks
option.
If you merge main, the current CI error will go away. |
Switched to passing in an external filter to allow the consumer to choose how blocks are filtered without a specific purpose.
|
Are any other changes needed? |
Apologies for the radio silence, this fell by the wayside while prepping for IPFS Camp. Normal service will resume on Monday! |
seems like just a merge from main will fix CI here thanks to #572 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code seems much safer than previous.. two small suggestions.
it('exports a car file without duplicates', async () => { | ||
const otherBlockstore = new MemoryBlockstore() | ||
const otherUnixFS = unixfs({ blockstore: otherBlockstore }) | ||
const otherDatastore = new MemoryDatastore() | ||
const otherMFS = mfs({ blockstore: otherBlockstore, datastore: otherDatastore }) | ||
const otherCar = car({ blockstore: otherBlockstore, dagWalkers }) | ||
|
||
await otherMFS.mkdir('/testDups') | ||
await otherMFS.mkdir('/testDups/sub') | ||
|
||
const sourceCid = await otherUnixFS.addBytes(smallFile) | ||
await otherMFS.cp(sourceCid, '/testDups/a.smallfile') | ||
await otherMFS.cp(sourceCid, '/testDups/sub/b.smallfile') | ||
|
||
const rootObject = await otherMFS.stat('/testDups/') | ||
const rootCid = rootObject.cid | ||
|
||
const writer = memoryCarWriter(rootCid) | ||
const blockFilter = createScalableCuckooFilter(5) | ||
await otherCar.export(rootCid, writer, { | ||
blockFilter | ||
}) | ||
|
||
const carBytes = await writer.bytes() | ||
expect(carBytes.length).to.equal(349) | ||
}) | ||
|
||
it('exports a car file with duplicates', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@achingbrain should we merge these two tests and export one with filter and one without?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of duplication, it'd be good to either define it as a fixture or put it in a new test file that has a setup block that creates the car exporters, writer, etc.
i think the error at https://github.com/ipfs/helia/actions/runs/10043251811/job/27755609508?pr=557#step:4:183 is caused by multiformats/js-multiformats#300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks for opening this.
Just a few stylistic nits but LGTM.
it('exports a car file without duplicates', async () => { | ||
const otherBlockstore = new MemoryBlockstore() | ||
const otherUnixFS = unixfs({ blockstore: otherBlockstore }) | ||
const otherDatastore = new MemoryDatastore() | ||
const otherMFS = mfs({ blockstore: otherBlockstore, datastore: otherDatastore }) | ||
const otherCar = car({ blockstore: otherBlockstore, dagWalkers }) | ||
|
||
await otherMFS.mkdir('/testDups') | ||
await otherMFS.mkdir('/testDups/sub') | ||
|
||
const sourceCid = await otherUnixFS.addBytes(smallFile) | ||
await otherMFS.cp(sourceCid, '/testDups/a.smallfile') | ||
await otherMFS.cp(sourceCid, '/testDups/sub/b.smallfile') | ||
|
||
const rootObject = await otherMFS.stat('/testDups/') | ||
const rootCid = rootObject.cid | ||
|
||
const writer = memoryCarWriter(rootCid) | ||
const blockFilter = createScalableCuckooFilter(5) | ||
await otherCar.export(rootCid, writer, { | ||
blockFilter | ||
}) | ||
|
||
const carBytes = await writer.bytes() | ||
expect(carBytes.length).to.equal(349) | ||
}) | ||
|
||
it('exports a car file with duplicates', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of duplication, it'd be good to either define it as a fixture or put it in a new test file that has a setup block that creates the car exporters, writer, etc.
Co-authored-by: Alex Potsides <[email protected]>
Make blockFilter optional and explain it Co-authored-by: Alex Potsides <[email protected]>
Co-authored-by: Alex Potsides <[email protected]>
Confirmed no errors on my side using the bytes of the multihash once I switched from using a built in Set to using the ScalableCuckooFilter. Built in SET performs comparisons on bytes by reference versus looking at each pointer. No blocker from me anymore |
filter
option to de-duplicate blocks in car files
Title
Skip exporting duplicate blocks in @helia/car
Description
When calling export on an @helia/car instance the yielded export contains duplicate blocks. This PR skips yielding duplicate blocks resulting in compact CAR files.
Notes & open questions
Added a SET to track which CID(s) have already been written to the stream. Might need to add an option to toggle behavior if any users are relying on duplicate blocks in the output of CAR exports.
Change checklist