-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build bridge to IPFS-land #253
Comments
from https://docs.google.com/document/d/1_vL-hxsHGcy85g7EIUdLesztXFofQ9QW4VdZG3K5J8g/edit
Agenda: with demo document attached and retrieved from https://nodes.desci.com/PZIlDkMRS_iM3HF3rAPZe1E8UsCDa-ncbM4dnsgfxA4 . ./ipfs add Demo_Research_Report.pdf produced:
with
producing:
so, again, the challenge is to generate 003e19ab870d338fbd3983c17904cfbdaa4dca3ca89493756519afb15a39ad0c isHashOf X so we can say that hash://sha256/003e19ab870d338fbd3983c17904cfbdaa4dca3ca89493756519afb15a39ad0c and QmUMs6LdCNXugG489dt64Tjd4oT1BJuE4DqQtF8JSF12Y7 are hashes derived from the same content. |
It appears that content in IPFS cannot be easily bridged to something outside of IPFS land. If there's examples out there that do build this bridge, I'd be happy to reconsider construction of a IPFS integration. |
Note that the difficulties of bridge to/from IPFS-land is not a new topic, see e.g., ipfs/kubo#1953 . This is why I was a bit surprised to read DeSci contributors say in Hill et al. 2024 "Guest Post — Navigating the Drift: Persistence Challenges in the Digital Scientific Record and the Promise of dPIDs" in the Scholarly Kitchen accessed via https://scholarlykitchen.sspnet.org/2024/03/14/guest-post-navigating-the-drift-persistence-challenges-in-the-digital-scientific-record-and-the-promise-of-dpids/
Yes, it is easy to calculate a sha256 hash, but . . . not so easy to calculate their associated address in IPFS land (the CID). Note that the authors did disclose their business associations with DeSci via
and that their product is built on IPFS. |
fyi @mielliott @cboettig @mbjones - a continuation of IPFS and sha256 discussion. |
Hello @jhpoelen @mielliott @cboetting @mbjones 👋 I work with engineering at DeSci Labs. For the sake of knowledge sharing, I'll take the liberty of posting here after I was forwarded your email touching on the scholarly kitchen article. CID's are capable of expressing rich information about data encoding, while allowing content validation of partial transfers, while being agnostic to the hashing scheme. This is a superpower, but unfortunately it means that it's not as simple as running a hash function over the entire set of data. However, under certain constraints, there can be a 1-1 relationship between sha256sum and the IPFS CID. Under a limit called the chunk size, IPFS will not break the file into smaller pieces and build a DAG out of it. /tmp
❯ head -c 256KB /dev/urandom > 256kb.txt
/tmp
❯ ipfs add --only-hash --cid-version 1 --raw-leaves 256kb.txt
added bafkreid2flnnm6jvoxrowuuzpsvk4ppjtwl3p6mr2fsqg4l7mag64qllya 256kb.txt
/tmp
❯ ipfs cid format bafkreid2flnnm6jvoxrowuuzpsvk4ppjtwl3p6mr2fsqg4l7mag64qllya -b base16
f015512207a2adad6793575e2eb52997caaae3de99d97b7f991d16503717f600dee416bc0
# sha|--------------------------------------------------------------|
/tmp
❯ sha256sum 256kb.txt
7a2adad6793575e2eb52997caaae3de99d97b7f991d16503717f600dee416bc0 256kb.txt Those extra bytes as the front of the CID ( If we add a file larger than the default chunk size (256kb), the resulting CID says that it's now encoding something different, namely a dag-pb structure. See breakdown of CID at https://cid.ipfs.tech/#bafybeibs3xpg4jkoytj4vdqkiymipixfikfyfo45ywpnfdowj2meb64kxy. IPFS has split the file into a tree of chunks, each with its own CID. /tmp
❯ head -c 1MB /dev/urandom > 1mb.txt
/tmp
❯ ipfs add --only-hash --cid-version 1 --raw-leaves 1mb.txt
added bafybeibs3xpg4jkoytj4vdqkiymipixfikfyfo45ywpnfdowj2meb64kxy 1mb.txt
/tmp
❯ ipfs cid format bafybeibs3xpg4jkoytj4vdqkiymipixfikfyfo45ywpnfdowj2meb64kxy -b base16
f0170122032ddde6e254ec4d3ca8e0a461887a2e5428b82bb9dc59ed28dd64e9840fb8abe
# NOT the sha below, because file split into chunk tree
/tmp
❯ sha256sum 1mb.txt
143252d3384e455625e3ab709a50ab08f9f7472e2a610384157aef610d631ba5 1mb.txt We can explicitly chunk the file in 1 MB segments, under which the correlation with sha256 still holds. 1 MB is however the ceiling, as this is the size where the practicalities of the transport layer bitswap comes in. IPFS clients simply communicate in content-addressed 1 MB chunks, so after that point things will always be DAG's of pieces rather than uniform files. This is key; IPFS CID's aren't built just for file integrity checking, they are built to express file structures that allow piecewise download from multiple sources, while allowing continuous integrity checking of the individual pieces. /tmp
❯ ipfs --only-hash --cid-version 1 --raw-leaves --chunker size-1048576 1mb.txt
added bafkreiaugjjngocoivlcly5locnfbkyi7h3uolrkmebyifl255qq2yy3uu 1mb.txt
/tmp
❯ ipfs cid format bafkreiaugjjngocoivlcly5locnfbkyi7h3uolrkmebyifl255qq2yy3uu -b base16
f01551220143252d3384e455625e3ab709a50ab08f9f7472e2a610384157aef610d631ba5
# sha|--------------------------------------------------------------| The intuition here is that there is an n-1 relation between CID's and data, as you can describe the same data in different encodings and under different hashing schemes. A CID is expressing richer information than a hash, out of which you are looking for a smaller subset. Putting these pieces together, by taking the sha256 hash of data under 1 MB, one can create the (hex) CID through this function: ❯ cat <(echo -n "f01551220") <(sha256sum 256kb.txt)
f015512207a2adad6793575e2eb52997caaae3de99d97b7f991d16503717f600dee416bc0 256kb.txt This is a CID identifying a single-chunk, binary encoded file with the sha256 hash 7a2adad6793575e2eb52997caaae3de99d97b7f991d16503717f600dee416bc0. Here is the breakdown: https://cid.ipfs.tech/#f015512207a2adad6793575e2eb52997caaae3de99d97b7f991d16503717f600dee416bc0 Note that since this CID contains information about the format of the CID itself (i.e., is self-describing), so one can't just run /tmp
❯ ipfs cid format f015512207a2adad6793575e2eb52997caaae3de99d97b7f991d16503717f600dee416bc0 -b base32
bafkreid2flnnm6jvoxrowuuzpsvk4ppjtwl3p6mr2fsqg4l7mag64qllya I hope this information was useful in building a better understanding of how these mysterious CID's work :) |
@m0ar Thanks for providing the example. I agree that calculating CIDs is not magic, and your example confirms my understanding of the complexities related to calculating CIDs. I'd urge to update the blog post to reflect these complexities in https://scholarlykitchen.sspnet.org/2024/03/14/guest-post-navigating-the-drift-persistence-challenges-in-the-digital-scientific-record-and-the-promise-of-dpids/ . Right now, the text makes it seem like the calculation of a CID is as easy as calculating a sha256 hash, and I think this is misleading especially given your example above. |
@jhpoelen In my opinion, this is splitting hairs. The CID is indeed a digital fingerprint, and it is created using sha256. In the context of IPFS, it's as easy as The complexity you mention only appears when popping the hood and trying to translate between different ways of hashing things, which is a discussion both interesting and worth having. This discussion, or technical documentation, is probably a better forum though. |
@m0ar thanks for taking the time to reply.
Yes, if the ipfs client was universally accessible (which it is not) as sha256/md5/sha1 algorithms, this may be the case. However, as far as I understand, there's many implicit variables required to verify a CID (e.g., blocksize).
I'd like to be able to independently verify the retrieved CID content, and the design of IPFS to combine content addressing, content graph, content blocking along with a content exchange protocol, makes it difficult for me to implement IPFS support. So, equating IPFS CID computation with sha256 content hashing is far from spitting hairs in my mind: as you demonstrated, the CID computations involves many (implicitly configured) processing steps (e.g., content blocking, content hashing, putting blocks in a content graph). In contrast, hashing digital content using a sha256 algorithm is an parameter-free operation supported by a diverse collection of software libraries across many platforms. But hey, I can be convinced otherwise if you are willing to add some IPFS bridge for Preston. This way, you can demonstrate how to retrieve independently verifiable content from the IPFS universe. I got stuck on the complexities Thanks again for engaging in discussion - I feel that we have a lot in common in realizing that content addressing is a useful way to point to digital content regardless of where/how they may be stored in the future. |
@m0ar thanks for sharing, and thanks to @jhpoelen and all for an engaging discussion. I think @jhpoelen 's suggested test that these systems based on SHA hashes should be interoperable is a good one. As another case in point, you are probably all familiar with the Open Container Registry spec which also tracks objects based on the SHA hashes. I think we can all agree the spec has proven it's ability to scale and be replicated independently -- for instance, GitHub notes that Homebrew project alone uses OCI-compatible GHCR to
and we have seen many independent software implementations of the spec by major players. Because OCI is transparently SHA-256-based, it is also easy for third-party tools to retrieve content from these registries by sha-256 checksums (e.g. I believe @jhpoelen has done this already in preston). |
The OCI support that @cboettig mentioned can be found at #255 and describes some events that led to the OCI support first being introduced in Preston v0.7.2 in July 2023. This enabled content retrieval using:
with first 24 lines being:
@m0ar What would it take to implement a similar bridge to IPFS with the same (exact) dataset to be retrieved from DeSci's IPFS universe allowing:
to produce the independently verified (via sha256sum) fingerprint of the retrieved content:
? |
IPFS aims to store content without relying on some centralized service like DNS .
Preston keeps track of (biodiversity) content.
Idea - make preston and IPFS interoperable.
Step 1. Small example
using https://docs.ipfs.tech/install/command-line/#install-official-binary-distributions, I was able to add file to some local ipfs store.
produced
and, vanilla hash -
aka
hash://sha256/9f7807097477f4f480130cefd2521e033534ac967ec36119e18392bce24d81d3
The question is: how to calculate QmYou3ngXxSek7rfbATTWw8gduBqKXwMebb6ber5J2SwMh independently?
The text was updated successfully, but these errors were encountered: