Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FRC: Retrieval Checking Requirements #1089

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

bajtos
Copy link

@bajtos bajtos commented Dec 4, 2024

When we set out to build Spark, a protocol for testing whether payload of Filecoin deals can be retrieved back, we designed it based on how Boost worked at that time (mid-2023). Soon after FIL+ allocator compliance started to use Spark retrieval success score (Spark RSR) in mid-2024, we learned that Venus Droplet, an alternative miner software, is implemented slightly differently and requires tweaks to support Spark. Things evolved quite a bit since then. We need to overhaul most of the Spark protocol to support Direct Data Onboarding deals. We will need all miner software projects (Boost, Curio, Venus) to accommodate the new requirements imposed by the upcoming Spark v2 release.

This FRC has the following goals:

  1. Document the retrieval process based on IPFS/IPLD.
  2. Specify what Spark needs from miner software.
  3. Collaborate with the community to tweak the requirements to work well for all parties involved.
  4. Let this spec and the building blocks like IPNI Reverse Index empower other builders to design & implement their own retrieval-checking networks as alternatives to Spark.

Discussion

#1086

Progress

  • Simple Summary
  • Abstract
  • Change Motivation
  • Specification
  • Design Rationale
  • Backwards Compatibility
  • Test Cases
  • Security Considerations
  • Incentive Considerations
  • Product Considerations
  • Implementation
  • TODO

@bajtos
Copy link
Author

bajtos commented Dec 4, 2024

Tagging @steven004 @LexLuthr @magik6k @masih @willscott @juliangruber @patrickwoodhead for visibility.


#### Link on-chain MinerId and IPNI provider identity

Storage providers are requires to use the same libp2p peer ID for their block-chain identity as returned by `Filecoin.StateMinerInfo` and for the index provider identity used when communicating with IPNI instances like [cid.contact](https://cid.contact).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requirement cannot be fulfilled in Curio. We no longer have a concept of minerID <> Unique peerID binding. IPNI must be extended to support other keys types like worker key to sign ads.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am aware of that; see the note in the text below this paragraphs.

> [!NOTE]
> This is open to extensions in the future, we can support more than one form of linking
> index-provides to filecoin-miners. See e.g. [ipni/spec#33](https://github.com/ipni/specs/issues/33).

From my point of view, I prefer not to block progress on this FRC until the Curio team figures out how to extend IPNI to support other key types. Instead, I'd like this FRC to document the solution that works with Boost & Venus now and then enhance it with the new mechanism Curio needs once that new solution is agreed on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am probably mistaken here but Droplet (Venus' Boost) supports multiple minerIDs being associated with a single PeerID (see docs), does that mean if I am using Droplet, I need to limit myself to a 1:1 relationship to meet this requirement?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call, @lanzafame! I am still learning more about how Venus Droplet work and what features they offer.

Based on the docs you linked to, I believe you can have multiple minerIDs associated with a single Droplet PeerID and still meet this requirement.

In Spark, we need the PeerID returned by Filecoin.StateMinerInfo to match the PeerID used in IPNI advertisements. Spark does not check whether that PeerID is unique or shared by multiple miners.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referencing for visibility - we are discussing a possible solution for this problem here:

filecoin-project/curio#377

@bajtos bajtos marked this pull request as ready for review December 18, 2024 12:56
Signed-off-by: Miroslav Bajtoš <[email protected]>
Signed-off-by: Miroslav Bajtoš <[email protected]>
Signed-off-by: Miroslav Bajtoš <[email protected]>
Signed-off-by: Miroslav Bajtoš <[email protected]>
Copy link
Member

@jsoares jsoares left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few editorial comments. I do not know enough about the specific topic to be able to opine on a technical level. I also found the explanation somewhat unclear, but that could be a consequence of my lack of knowledge, so not holding that against the draft.

Others will be better suited to provide a full review.

Comment on lines +37 to +43
When we set out to build [Spark](https://filspark.com), a protocol for testing whether _payload_ of Filecoin deals can be retrieved back, we designed it based on how [Boost](https://github.com/filecoin-project/boost) worked at that time (mid-2023). Soon after FIL+ allocator compliance started to use Spark retrieval success score (Spark RSR) in mid-2024, we learned that [Venus](https://github.com/filecoin-project/venus) [Droplet](https://github.com/ipfs-force-community/droplet), an alternative miner software, is implemented slightly differently and requires tweaks to support Spark. Things evolved quite a bit since then. We need to overhaul most of the Spark protocol to support Direct Data Onboarding deals. We will need all miner software projects (Boost, Curio, Venus) to accommodate the new requirements imposed by the upcoming Spark v2 release.

This FRC has the following goals:
1. Document the retrieval process based on IPFS/IPLD.
2. Specify what Spark needs from miner software.
3. Collaborate with the community to tweak the requirements to work well for all parties involved.
4. Let this spec and the building blocks like [IPNI Reverse Index](https://github.com/filecoin-project/devgrants/issues/1781) empower other builders to design & implement their own retrieval-checking networks as alternatives to Spark.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads more like motivation than an abstract. It'd be useful for the abstract to summarise the actual requirements/spec.

3. Map `(PieceCID, PieceSize)` to IPNI `ContextID` value.
4. Query IPNI reverse index for a sample of payload blocks advertised by `ProviderID` with
`ContextID` (see the [proposed API
spec](https://github.com/ipni/xedni/blob/526f90f5a6001cb50b52e6376f8877163f8018af/openapi.yaml)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in the FRC or is it our of scope? The link is fine, but trying to understand whether we see it as central.


#### Link on-chain MinerId and IPNI provider identity

Storage providers are requires to use the same libp2p peer ID for their block-chain identity as returned by `Filecoin.StateMinerInfo` and for the index provider identity used when communicating with IPNI instances like [cid.contact](https://cid.contact).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Storage providers are requires to use the same libp2p peer ID for their block-chain identity as returned by `Filecoin.StateMinerInfo` and for the index provider identity used when communicating with IPNI instances like [cid.contact](https://cid.contact).
Storage providers are required to use the same libp2p peer ID for their block-chain identity as returned by `Filecoin.StateMinerInfo` and for the index provider identity used when communicating with IPNI instances like [cid.contact](https://cid.contact).

}
```

IPNI provider status ([query](https://cid.contact/providers/12D3KooWPNbkEgjdBNeaCGpsgCrPRETe4uBZf1ShFXStobdN18ys)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge fan of these arbitrary links on a document that's intended to be frozen for a long time.

Comment on lines +268 to +274
1. It's inefficient.

1. Each retrieval check requires two requests - one to download ~8MB chunk of a piece, the second one to download the payload block found in that chunk.

1. Spark typically repeats every retrieval check 40-100 times. Scanning CAR byte range 40-100 times does not bring enough value to justify the network bandwidth & CPU cost.

1. It's not clear how can retrieval checkers discover the address where the SP serves piece retrievals.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1 numbered list renders fine, but is not great for reading in raw md.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just like this FRC is dictating a convention, you could dictate a convention that at least one of the on-chain Multiaddrs stored in the miner actor of the SP is where you could retrieve from.

Comment on lines +288 to +290
[Retrieval Checking Requirements](#retrieval-checking-requirements) introduce the following breaking changes:
- Miner software must construct IPNI `ContextID` values in a specific way.
- Because such ContextIDs are scoped per piece (not per deal), miner software must de-duplicate advertisements for deals storing the same piece.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, and given this is an FRC, what are we breaking exactly?

Comment on lines +24 to +31
To make Filecoin a usable data storage offering, we need the content to be retrievable. It's difficult to improve what you don't measure; therefore, we need to measure the quality of retrieval service provided by each storage provider. To allow 3rd-party networks like [Spark](https://filspark.com) to sample active deals from the on-chain activity and check whether the SP is serving retrievals for the stored content, we need SPs to meet the following requirements:

1. Link on-chain MinerId and IPNI provider identity ([spec](#link-on-chain-minerid-and-ipni-provider-identity)).
2. Provide retrieval service using the [IPFS Trustless HTTP Gateway protocol](https://specs.ipfs.tech/http-gateways/trustless-gateway/).
3. Advertise retrievals to IPNI.
4. In IPNI advertisements, construct the `ContextID` field from `(PieceCID, PieceSize)` ([spec](#construct-ipni-contextid-from-piececid-piecesize))

Meeting these requirements needs support in software implementations like Boost, Curio & Venus Droplet but potentially also updates in settings configured by the individual SPs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To make Filecoin a usable data storage offering, we need the content to be retrievable. It's difficult to improve what you don't measure; therefore, we need to measure the quality of retrieval service provided by each storage provider. To allow 3rd-party networks like [Spark](https://filspark.com) to sample active deals from the on-chain activity and check whether the SP is serving retrievals for the stored content, we need SPs to meet the following requirements:
1. Link on-chain MinerId and IPNI provider identity ([spec](#link-on-chain-minerid-and-ipni-provider-identity)).
2. Provide retrieval service using the [IPFS Trustless HTTP Gateway protocol](https://specs.ipfs.tech/http-gateways/trustless-gateway/).
3. Advertise retrievals to IPNI.
4. In IPNI advertisements, construct the `ContextID` field from `(PieceCID, PieceSize)` ([spec](#construct-ipni-contextid-from-piececid-piecesize))
Meeting these requirements needs support in software implementations like Boost, Curio & Venus Droplet but potentially also updates in settings configured by the individual SPs.
To make Filecoin a usable data storage offering, we need the content to be retrievable. It's difficult to improve what you don't measure; therefore, we need to measure the quality of retrieval service provided by each storage provider. This FRC outlines requirements that SPs and their software stacks should meet to allow 3rd-party networks to sample active deals from the on-chain activity and check whether the SP is serving retrievals for the stored content.

The goal here is not to go into technical detail. I left a non-binding suggestion; something along these lines would be preferable.

The content in 26-29 would potentially be a good fit for the abstract; see comment below.


### Retrieval Requirements

1. Whenever a deal is activated, the SP MUST advertise all IPFS/IPLD payload block CIDs found in the Piece to IPNI. See the [IPNI Specification](https://github.com/ipni/specs/blob/main/IPNI.md) and [IPNI HTTP Provider](https://github.com/ipni/specs/blob/main/IPNI_HTTP_PROVIDER.md) for technical details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a client opt out of their deal payload being indexing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll depend on the market software, Boost should support optionality here (see) and I imagine Curio will follow suit because there's already been demonstrated demand for "private" deals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all in the context of data that's claimed to be 'retrievable'

We're not defining a standard here for signalling between client and SP / deal making for how a client communicates to an SP that data should be retrievable or not.

@bajtos
Copy link
Author

bajtos commented Jan 23, 2025

Thank you for the feedback! I'll take a look and respond to your comments (early) next week.


<!--"If you can't explain it simply, you don't understand it well enough." Provide a simplified and layman-accessible explanation of the FIP.-->

To make Filecoin a usable data storage offering, we need the content to be retrievable. It's difficult to improve what you don't measure; therefore, we need to measure the quality of retrieval service provided by each storage provider. To allow 3rd-party networks like [Spark](https://filspark.com) to sample active deals from the on-chain activity and check whether the SP is serving retrievals for the stored content, we need SPs to meet the following requirements:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My quibble with this paragraph is that you're asserting a need for "3rd-party networks", yet the structure you outline below isn't the only way to achieve the goal you've stated (because it's a "need"). This is more like Spark's preferred path, which is fine, it's just not a generalisable need. Maybe it'd be best to not give the impression to the reader that this is the only way to do this.

I'd prefer a more direct approach here: To allow Spark to sample deals, we need the following to be true.

It's nice to have a standard that's specified via FRC but it's not like we have a queue of retrieval checkers waiting for such a standard in order to get going.

1. Let's assume the maximum CAR block size is 4 MB and we have deal's `PieceCID` and `PieceSize`.
2. Pick a random offset $o$ in the piece so that $0 <= o <= PieceSize - 2*4 MB$.
3. Send an HTTP range-retrieval request to retrieve the bytes in the range`(o, o+2*4MB)`.
4. Parse the bytes to find a sequence that looks like a CARv1 block header.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sequence that looks like a CARv1 will either start at offset 0 or be discoverable from a CARv2 header which is 51 bytes long from offset 0 or will alternatively be a PoDSI container which should also be discoverable from offset 0.

Comment on lines +263 to +264
5. Extract the CID from the CARv1 block header.
6. Hash the block's payload bytes and verify that the digest equals to the CID.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CID in a CARv1 block header isn't a CID of the payload, it's typically (but not always, and may even be omitted) the root of the entire DAG within the CAR.

But the approach here could be generalised as something like the following: download a small chunk from the start of a piece, determine whether it's readable as either a CARv2, CARv1 or PoDSI container, then use that knowledge to read further sections of the containing CAR(s) to find CIDs to retrieve and build a catalog by progressive byte-range scanning of CAR section headers.

It's true that this is inefficient and would require a stored state to build up knowledge of a piece, but it is possible to do an IPLD block discovery by doing many, progressive, and small, piece retrievals if the SP is exposing the /piece/ endpoint.

You're still back at the problem of trusting the SP that the "piece" they are serving you corresponds to the PieceCID which you trust. As long as the SP is responsible for serving you the response or reporting CIDs to IPNI, you're at the mercy of the SP to tell you what a piece contains rather than the client who ought to have a canonical mapping of piece->blocks (or at least did have at the beginning). Or, to be properly trustless, downloading the entire piece and verifying the PieceCID matches the piece they gave you, but I guess you have to rule this option out and therefore be trusting the SP to play nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should extend range reads over pieces to include the intermediate proof tree (which may already be in the podsci index, or can be generated on the side like the cid index) to allow validation that the range is indeed the expected sub-section of the overall PieceCID


_We need the server to return an inclusion proof up to the PieceCID root._

4. Parse the bytes to find a sequence that looks like a CARv1 block header.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my point above, this doesn't need to be a random offset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants