Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Piece Range retrieval #36

Closed
Tracked by #17
xmcai2016 opened this issue Sep 15, 2023 · 1 comment
Closed
Tracked by #17

Piece Range retrieval #36

xmcai2016 opened this issue Sep 15, 2023 · 1 comment
Assignees

Comments

@xmcai2016
Copy link

xmcai2016 commented Sep 15, 2023

Piece range retrieval
Use the pieceCID field of the deal proposal and make piece retrieval with the HTTP endpoint
Make range retrieval for the first 100 bytes and verify it is a valid CAR V1/V2 header
If it is a CAR V2 header, then check the data_size in the header to calculate how much padding has been used. In the next step, we only need to perform range retrieval between [data_offset, data_offset + data_length]
Make ranges retrieval for a random offset of that piece, up to 8MiB length
We check if retrieved data is all zeroes. Overtime, we will get a ratio of how much datacap is under utilized by padding data with zeroes
Try to find [varint, CID, block, varint, CID]. This is a valid IPLD data block. A valid IPLD block size is <= 4MiB so we should expect to get at least one IPLD data block within that range
Calculate the compression ratio of the block bytes using zstd compression
High compression ratio / low entropy means the data is highly repetitive (i.e. repeating "hello world")
Low compression ratio / high entropy means the data is noisy (i.e. random bytes, already compressed or encrypted)
Useful data usually does not have an extremely high or low entropy and the compression ratio can be compared to the original data source
The purpose of this retrieval type is to make sure the clients are not padding too much zeroes or are actually storing data that is not useful. Since the retrieval is lightweight, most of the retrieval testing will be using this kind

@github-project-automation github-project-automation bot moved this to 🍇 Backlog in ActionArena Oct 31, 2023
@stephen-pl stephen-pl moved this from 🍇 Backlog to 🍰 Todo / Commited in ActionArena Oct 31, 2023
@stephen-pl stephen-pl moved this from 🍰 Todo / Commited to 👨‍💻 In Progress in ActionArena Nov 2, 2023
@stephen-pl stephen-pl moved this from 👨‍💻 In Progress to 🍰 Todo / Commited in ActionArena Nov 2, 2023
@jcace
Copy link
Contributor

jcace commented Nov 6, 2023

As described, this ticket is very specific to open dataset retrieval validation, not necessarily other datasets. It is not useful for the more general case of validating the contents of piece range retrieval tests.

@jcace jcace closed this as not planned Won't fix, can't repro, duplicate, stale Nov 6, 2023
@github-project-automation github-project-automation bot moved this from 🍰 Todo / Commited to 🚢 Done in ActionArena Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🚢 Done
Development

No branches or pull requests

3 participants