Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide Way of Representing CidV0 objects in an alternative base #34

Closed
kevina opened this issue Aug 29, 2017 · 31 comments
Closed

Provide Way of Representing CidV0 objects in an alternative base #34

kevina opened this issue Aug 29, 2017 · 31 comments

Comments

@kevina
Copy link
Contributor

kevina commented Aug 29, 2017

It may be useful to represent a CidV0 object in an alternative base. For example if we need to include it in as the domain part of a URL.

I propose we provide allow a non non-standard encoding of:

<multibase prefix><version><multihash>

where version is always 0.

Related to ipfs/kubo#4143.

Note: I made this more complicated than it needs be, see this comment: #34 (comment)

@ghost
Copy link

ghost commented Aug 29, 2017

I've been thinking about this too, for the case where we want to represent a CIDv0 in base32 in a URL. The less obvious thing here is that we probably need to be able to recognize such CIDs and convert them back to CIDv0, so that DHT and Bitswap return the expected results.

Do you agree this is a concern? @whyrusleeping was skeptical when I first brought it up a while ago.

It could be done with a multicodec header that says "what's in here is CIDv0".

@daviddias
Copy link
Member

for the case where we want to represent a CIDv0 in base32 in a URL.

For those cases, we upgrade the CID to CIDv1 and base encoded it with base32.

@kevina
Copy link
Contributor Author

kevina commented Aug 29, 2017

@lgierth

It could be done with a multicodec header that says "what's in here is CIDv0

Yes, and I think that should just be the byte '0x00' which is the variable-length encoding of the number 0. Hence the equivalent of setting the version number to 0.

@diasdavid

For those cases, we upgrade the CID to CIDv1 and base encoded it with base32.

The ideas is we want to be able to retrieve a Cidv0 object using base32. Upgrading the CID will not work in this case.

@kevina
Copy link
Contributor Author

kevina commented Aug 29, 2017

The rules for displaying CidV0 in an alternative base can be as follows:

If the base is in 'base58btc' than just encode the multihash without a base prefix, otherwise prepend a byte of value '0x00' (the version 0 string) to the binary representation and encode that with a base prefix.

This should be fairly easy to implement, I can create a p.r. later today if it will make my idea clearer.

@ghost
Copy link

ghost commented Aug 29, 2017

The ideas is we want to be able to retrieve a Cidv0 object using base32. Upgrading the CID will not work in this case.

@diasdavid what Kevin and I are thinking about is the next step after receiving a gateway request with a CIDv0-in-CIDv1: content routing and bitswap, where we need the original CIDv0 because that's what provider records are published for, and what's used as a key in the blockstore of the provider nodes. It's kind of a matter of backward compatibility: there will always be nodes who continue to create CIDv0 stuff.

@Stebalien
Copy link
Member

@kevina I'd go a step further and start using the CIDv1 formatted CIDv0s by default. I believe that will actually be less work and, given that we're planning on switching to base32 by default anyways, shouldn't cause any additional breakage. Does that make sense or am I missing something?

@kevina
Copy link
Contributor Author

kevina commented Aug 29, 2017

@Stebalien What exactly will be a CIDv1 formatted CIDv0?

The base used to encode a CID is for the most part an interface issue. The base is is not stored in the key of objects stored as part of a link or in the blockstore. The Cid version and hash used is, on the other hand, part of the key.

You can't retrieve an Cidv0 object if it is reformatted to be Cidv1 and sometimes you may still want to store a Cidv0 key in another object.

@Stebalien
Copy link
Member

What exactly will be a CIDv1 formatted CIDv0?

The multibase-cidv0-multihash sequence. That is, if we switch to base32 as the default, we'll now be returning base32-cidv0-multihash from interfaces so we might as well start returning base58btc-cidv0-multihash when the user requests base58btc.

@kevina
Copy link
Contributor Author

kevina commented Aug 30, 2017

@Stebalien sorry I misinterpreted your reply (as I am prone to do :( ).

The idea behind not returning base58btc-cidv0-multihash was (1) to provide only one way to represent a Cid in a given base and (2) to simplify the U.I. If we do switch to base58btc-cidv0-multihash then we probably need to provide a way to get to the old style hash (for compatibility, if no other reason) and will need to provide an extra option for that. If we have the rule that base58btc implies the old style hash string (no multibase-version) prefix we avoid this problem.

@Stebalien
Copy link
Member

I kind of agree... I'd just like to phase out non-multibase strings as much as possible as soon as possible. But, I guess, if we move to base32 by default we'll be doing that anyways.

@kevina
Copy link
Contributor Author

kevina commented Aug 30, 2017

It turns out I made this more complicated then necessary. The CidV0 bit is NOT needed. Just prefixing a CidV0 string with a multibase prefix will already work as expected. For example
QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco zQmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco BCIQIZOL2ACTMSCQU5TYYFWBWGWB5IAXWTEM6HDYETDW5TWJKLQBKPNA will all work already.

The only problem is this little bit of code (which gave me the impression that a CidV0 are incompatible with multibase)

func (c *Cid) StringOfBase(base mbase.Encoding) (string, error) {
	switch c.version {
	case 0:
		if base != mbase.Base58BTC {
			return "", ErrInvalidEncoding
		}
		return c.hash.B58String(), nil
	case 1:
		return mbase.Encode(base, c.bytesV1())
	default:
		panic("not possible to reach this point")
	}
}

which should likely work on CidV0 string and return a string with the multibase prefix if the code is not 'z'.

Thus I propose we fix this function and call it a day. 😄

@Stebalien
Copy link
Member

Here's an update to the spec to take this into account: multiformats/cid#14

@daviddias
Copy link
Member

daviddias commented Aug 31, 2017

@diasdavid what Kevin and I are thinking about is the next step after receiving a gateway request with a CIDv0-in-CIDv1: content routing and bitswap, where we need the original CIDv0 because that's what provider records are published for, and what's used as a key in the blockstore of the provider nodes. It's kind of a matter of backward compatibility: there will always be nodes who continue to create CIDv0 stuff.

There are two issues here. One is:

  • a) where we need the original CIDv0 because that's what provider records.
  • b) what's used as a key in the blockstore of the provider nodes.

a)

This was accepted as kind of fine since that is also a valid concern for when someone picks different hashes or bases for their content. Essentially, the content should be identified by it's multihash and not by a key that includes type and multibase. If the hash matches, it is the same content. If an app wants to interpret content "asdldasjdkla" as multicodec A and another app as multicodec B, that's fine. This way we ensure deduplication

b)

This is what I think will be the be crux.

From memory, the Lisbon Treaty (the 2~3 hack nights that me, @jbenet and @whyrusleeping spent figuring all of the details out), the actual solution for this problem was not to use CIDs as Keys in the Repo, instead, keep using the multihashes as it was before. This would solve this issue from the bottom because independently of reading CIDv0 or CIDv1, the repo would always pick up the block from the multihash and that would be the same for both.

Right now, go-ipfs uses base32.encode(rawCID) to get the keys for storing blocks in the repo. Eventually, js-ipfs followed so that both repos are compatible. However, I can't recall if there was a valid reason for this change. Please help me refresh my and everyone's memory if you know where that was discussed and approved.

Using base32.encode(rawCID) also means that if I have a block that for an IPLD graph, one of the node is considered _Raw) (because, for that graph, it really doesn't care for those bytes nor it knows what it is) and for another graph, it consuming the same content but now it is a git Object, now we effectively have to store the block 2x, one with CID having the multicodec for Raw and another with multicodec for CID.

PS: This brings a old subject where we had similar issues for having updates on how the Repo works (exactly the same module) without having those updates first proposed to the Repo Spec on https://github.com/ipfs/specs/tree/master/repo

@kevina
Copy link
Contributor Author

kevina commented Aug 31, 2017

@diasdavid Thanks for this. Was this tidbit of information documented somewhere?

Based on what you are saying all network requests for blocks should be made using the multihash only and not the full CID.

I agree that we should be storing blocks using just the multihash. The change to base32 was made before the CID proposal to fix ipfs/kubo#2601, some additional discussion was done here ipfs/go-datastore#39, and the change was made in ipfs/kubo#2903.

It should be rather simple to store just the multihash component in the datastore. The biggest pain will be the repo migration.

@daviddias
Copy link
Member

daviddias commented Aug 31, 2017

@diasdavid Thanks for this. Was this tidbit of information should of been documented somewhere?

Agreed and I believe it was, at least 2 times. Post Lisbon Treaty and somewhere in a go-ipfs PR. It should have been, however, done through the IPFS Specs Repo.

Based on what you are saying all network requests for blocks should be made using the multihash only and not the full CID.

The CID can go on the wire (because of in the future, with graphswap that will be useful), however, the assertions should be done in the hash level. If the Repo was checking by Hash, that will be granted by default.

@kevina
Copy link
Contributor Author

kevina commented Aug 31, 2017

@diasdavid you replied before I edited my sentence 😄

@Kubuxu
Copy link
Member

Kubuxu commented Aug 31, 2017

Right now, go-ipfs uses base32.encode(rawCID) to get the keys for storing blocks in the repo.

This was also cleared up but after Lisbon (I think this point might not have been considered , without this it would be impossible to enumerate blocks given only repo for example as in case of pinset loss, it has saved us once already). This also means that the data isn't self descriptive in storage in blockstore.

It might be possible to store it aside (in other part of the datastore) but it would require us to reevaluate the leveldb (it might just not scale to number of entries we keep).

EDIT: I had this comment sitting in comment windows for ~1h

@kevina
Copy link
Contributor Author

kevina commented Aug 31, 2017

@diasdavid I fully agree with all your points. Nevertheless, I still think it will be useful to be able to represent CidV0 in an alternative base. For example if we switch to Base32 users might still have a need to store a CidV0 string as part of a link in a dag. Also converting a CidV0 link in a dag to a Cidv1 for display in same cases can cause problems. If we don't allow multibase prefixed for CidV0 then it will be base32 for everything but the case where CidV0 is still required. That seams odd to me.

@kevina
Copy link
Contributor Author

kevina commented Aug 31, 2017

Could I have a link to the "Post Lisbon Treaty"?

@ghost
Copy link

ghost commented Sep 12, 2017

Could I have a link to the "Post Lisbon Treaty"?

I want too

@Kubuxu
Copy link
Member

Kubuxu commented Mar 1, 2018

I would like to renew this discussion. I don't think that supporting CIDv0 in multibase is wise, it will linger forever everywhere.

We should flesh out clear upgrade path that will lead to us not using CIDv0 internally anymore.

Quick example of possible upgrade path:

  1. Finally upgrade to outputting CIDv1
  2. Create repo migration of v0 -> v1
  3. If we ever decode CIDv0, upgrade it to v1
  4. Push out both CIDv0 and CIDv1 where compatible to the network.
  5. In places where legacy CIDv0 is need (network, DHT, so on) make dual requests and track metrics on how many times CIDv0 exclusively allowed us to access content (so making requests to CIDv0 saved us)
  6. When this number falls low enough (enough nodes upgraded), stop providing compatibility layer of publishing CIDv0.

This way we can get CIDv0 mostly out of our codebases (apart from UI layer) and upgrade to new protocol. If we don't do this soon enough we will be stuck with it forever and it decreases performance, increases complexity and reduces our velocity (complexity is bane of moving protocols forward).

We need a clear way to move forward with CIDv1 and how finally embrace it fully.

@kevina
Copy link
Contributor Author

kevina commented Mar 1, 2018

I don't think that supporting CIDv0 in multibase is wise

I do not agree with this and don't see the argument as "it will linger forever then everywhere" very convincing.

@whyrusleeping
Copy link
Member

We do want to move away from supporting cidv0. Adding new features to it seems unwise if thats really our goal.

I'm not really sure what a valid usecase for this is, as @diasdavid said earlier, we should just upgrade them to cidv1 and provide an internal conversion layer in cases where it might be needed.

@Stebalien
Copy link
Member

Stebalien commented Mar 1, 2018

@Kubuxu ideally, the datastore wouldn't care about the CID, just the multihash. However, to keep GC working correctly, we'd need to introduce a separate "linkstore" for keeping track of the local DAG. @lgierth indicated he was interested in working on this. (#34 (comment))

@kevina
Copy link
Contributor Author

kevina commented Jun 8, 2018

I'm not really sure what a valid usecase for this is, as @diasdavid said earlier, we should just upgrade them to cidv1 and provide an internal conversion layer in cases where it might be needed.

I will repeat by earlier comment #34 (comment) so its not lost in the noise:

Nevertheless, I still think it will be useful to be able to represent CidV0 in an alternative base. For example if we switch to Base32 users might still have a need to store a CidV0 string as part of a link in a dag. Also converting a CidV0 link in a dag to a Cidv1 for display in same cases can cause problems. If we don't allow multibase prefixed for CidV0 then it will be base32 for everything but the case where CidV0 is still required. That seams odd to me.

@kevina
Copy link
Contributor Author

kevina commented Jun 8, 2018

Also, as a side note, we can not fully convert CidV0 Dag objects to CidV1 without changing the hash of the object as the links inside will also need to be converted to CidV1. If we do this it will no longer be possible to look up old content. If we don't do this we will need to support CidV0 forever, at least to some capacity.

@kevina
Copy link
Contributor Author

kevina commented Jun 9, 2018

However, to keep GC working correctly, we'd need to introduce a separate "linkstore" for keeping track of the local DAG.

I do not see why this will be necessary. Pinned hashes will still be stored by the full CID. The GC does not care about the structure of anything not pinned.

@Stebalien
Copy link
Member

@kevina you're right.

@kevina
Copy link
Contributor Author

kevina commented Jun 29, 2018

In addition to all the points I said above I would also like to add that not allowing CidV0 to be represented in Base32 breaks abstraction and creates this weird special case. As I see it multibase is something that is independent of CID. That is it can be used to encode any sort of (short) binary string. If we first convert to binary and then decode the CID then any sort of base will be permissible (as the multibase code has no idea it is decoding a CID, and the CID decoder just sees the binary string), however if we do it all at once then the CID decoder will reject it.

I fully agree with should work to migrate away from CIDv0, however for this reason and reasons in previous comments I think not allowing CIDv0 to be encoded using multibase creates unnecessary complications.

In all fairness I did think of one valid use case for not allowing multibase prefixed for CIDv0 that is to distinguish raw multihashes from multihashes that are also CIDv0. That is we switch to storing raw multihashes in the datastore we will need to list them to the user. We could in this case list them with a multibase encoding and define any multihash with a multibase prefix as a raw multihash and not a CIDv0. However, as I see it this is a fragile distinction and it still doesn't solve the problem on how to handle working with existing CIDv0 stored as link in objects.

@kevina
Copy link
Contributor Author

kevina commented Jul 2, 2018

I'm not sure, but maybe this is something that we can work through in the upcoming developers meeting. See ipfs-inactive/developer-meetings#16.

@lidel
Copy link
Member

lidel commented Feb 6, 2020

Closing:

To follow migration to CIDv1 (default base32) see: https://github.com/ipfs/ipfs/issues/337

@lidel lidel closed this as completed Feb 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants