-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON encoding corrupts non-UTF-8 binary data #1582
Comments
@davidar I'll take a look at this. Thanks for reporting! |
I will try to provide a test for this, but according to this page it looks like the UTF-8-test.txt file might have the following license: https://creativecommons.org/licenses/by/4.0/ I will contact the author about this, but maybe it's acceptable to put it anyway into "test/sharness/t0051-object-data". |
The test in #1599 should help fix the issue even if we cannot merge it now. |
Looked into this a bit, and it seems to me the problem is not with the JSON marshaller/unmarshaller. I found that the data of the object returned from the call to It seems like the only reason The easiest way to replicate the error is as follows (since
So somewhere along the line those extra bytes are added to the data, or there's some encoding I've missed, and it's just not correctly stripped for the Looking in the blockstore of the repo:
we see that those bytes are already there. So if these bytes are actually a bug and not intended, then they're probably being added by the This is my first dive into the code base, so I might have missed or misunderstood something however. This is with go 1.5 by the way, haven't tested with anything else. |
@martingms i think what you're seeing is the unixfs protobuf: https://github.com/ipfs/go-ipfs/blob/master/unixfs/pb/unixfs.proto which currently is stored in the data field of the ipfs object. so, more clearly, we basically have something like: # ipfs add
data := request.body
file := unixfs.NewFile(data)
obj := Node()
obj.Data = proto.Marshal(file)
# ipfs object data
obj := get(request.params.arg[0])
return obj.Data # with the serialized protobuf
# ipfs cat
obj := get(request.params.arg[0])
var file unixfs.File
file := proto.Unmarshal(obj.Data, &file)
return file.Data() # the raw data tl;dr: i think this is fine. |
Aha, I see. So the core issue then is that the data is not being umarshalled from protobuf before being put into the Data field of the output json. |
The problem is that the json encoder is interpreting the data field as a On Thu, 27 Aug 2015 12:14 Martin Gammelsæter [email protected]
|
Hm okay. Looks like there are similar issues with the XML encoder if that's the case...
but |
Just to clarify things. @jbenet wrote:
Do you mean that it should be expected that the hash from Or do you just mean that it is fine that |
this o/ |
@martingms @chriscool This might fix it, but I haven't tested. |
@davidar I get the following test failure with your change:
|
@chriscool Then the test is broken, "CAISCkhlbGxvIE1hcnMYCg==" is the correct output Edit: the fix for this bug will change the format of object get, so I would expect tests relying on the old (incorrect) format to break |
If I 'fix' the above test, then there is the following error:
|
@chriscool That test is also broken. The old behaviour was that Data was interpreted as a utf8-encoded unicode string. However, that doesn't match the protobuf schema, so it breaks on binary data that isn't valid utf8. So, any tests that that assume "Data":"some unicode string" need to be changed to "Data":"SomeBase64String==" @jbenet @whyrusleeping jump in if I'm saying anything incorrect |
could swear we had it that way, not sure why it's not... oh i think @whyrusleeping removed it so it would look pretty and be easy to edit. (which is true, is nicer to edit in plain text). just dont think we can do that in json land |
@chriscool any further progress on this? |
I think this one falls in my lap. I wanted the data fields to be easy to edit... But previously i thought that it did escapes like |
Hey guys, me and TrekDev have been discussing this issue.. whats the consensus here? Is |
I think the solution here might be to have different encoding types for these workflows. the While we're on this topic, we should think about how ipld will handle this sort of thing... |
CBOR distinguishes between bytestrings and unicode (text) strings, so only the former would need to be base64'd when displayed as JSON. Edit: but yeah, you'd need some way of indicating which strings are actually encoded bytestrings |
This patch fixes some of the inconsistencies in the IPLD spec, as well as resolving some existing issues, namely: - ipfs/kubo#2053 - allow raw data in leaf nodes - ipfs/kubo#1582 - use EJSON for encoding (non-UTF8) binary data - mandate CBOR Strict Mode to avoid having to deal with duplicated keys - change `link` to `content` (although we could also `data` or something similar), as file content needn't be merkle-links, and can in fact be directly embedded into small files - removed the extraneous `subfiles` wrapper, as file content can be either (a merkle-link to) a bytestring, a unicode string, or a list thereof
Will this be fixed? |
Any update on this? |
There is now an option to set data field encoding to base64: https://github.com/ipfs/go-ipfs/blob/master/core/commands/object/object.go#L389 |
(use |
The
Data
field ofipfs object get --enc=json
tries to decode UTF-8 binary data into a unicode string (rather than a base64 encoded blob). This causes problems with binary data that isn't valid UTF-8:Note that the hashes don't match, as the json encoding has corrupted the data.
The problem seems to be line
core/commands/object.go:202
. I've tried fixing it myself, but I don't speak go. It looks like go's json module is supposed to do base64 conversion automatically, but it isn't working for me.The text was updated successfully, but these errors were encountered: