Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON encoding corrupts non-UTF-8 binary data #1582

Closed
davidar opened this issue Aug 15, 2015 · 25 comments
Closed

JSON encoding corrupts non-UTF-8 binary data #1582

davidar opened this issue Aug 15, 2015 · 25 comments
Assignees
Labels
kind/bug A bug in existing code (including security flaws) topic/commands Topic commands

Comments

@davidar
Copy link
Member

davidar commented Aug 15, 2015

The Data field of ipfs object get --enc=json tries to decode UTF-8 binary data into a unicode string (rather than a base64 encoded blob). This causes problems with binary data that isn't valid UTF-8:

$ wget https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
$ ipfs add UTF-8-test.txt 
added QmVDcJvu7CiqeUeHhnkvej9UZu6PMDXxED5r2nuuzn2G7M UTF-8-test.txt
$ ipfs object get --enc=json QmVDcJvu7CiqeUeHhnkvej9UZu6PMDXxED5r2nuuzn2G7M
{
  "Links": [],
  "Data": "\u0008\u0002\u0012��\u0001UTF-8 decoder capability and stress test\n
<snip>
$ ipfs object get --enc=json QmVDcJvu7CiqeUeHhnkvej9UZu6PMDXxED5r2nuuzn2G7M | ipfs object put --inputenc=json
added QmegjSvPuqqZPu5PHuevWmj8rLoGADTo1dnfCpcNWs77Po

Note that the hashes don't match, as the json encoding has corrupted the data.

The problem seems to be line core/commands/object.go:202. I've tried fixing it myself, but I don't speak go. It looks like go's json module is supposed to do base64 conversion automatically, but it isn't working for me.

@whyrusleeping
Copy link
Member

@davidar I'll take a look at this. Thanks for reporting!

@whyrusleeping whyrusleeping self-assigned this Aug 15, 2015
@whyrusleeping whyrusleeping added kind/bug A bug in existing code (including security flaws) topic/commands Topic commands labels Aug 15, 2015
@jbenet jbenet mentioned this issue Aug 22, 2015
16 tasks
@chriscool
Copy link
Contributor

I will try to provide a test for this, but according to this page it looks like the UTF-8-test.txt file might have the following license:

https://creativecommons.org/licenses/by/4.0/

I will contact the author about this, but maybe it's acceptable to put it anyway into "test/sharness/t0051-object-data".

@chriscool
Copy link
Contributor

The test in #1599 should help fix the issue even if we cannot merge it now.

@martingms
Copy link

Looked into this a bit, and it seems to me the problem is not with the JSON marshaller/unmarshaller. I found that the data of the object returned from the call to core.Resolve() (here) has extraneous data wrapping it.

It seems like the only reason ipfs get is not affected by this is because it is run through the archiver whether you ask for archiving or not, which probably trims or something.

The easiest way to replicate the error is as follows (since ipfs object data returns the data and nothing else, through the same mechanism as all other get-commands):

$ echo "hello world" > helloworld.txt
$ ipfs add helloworld.txt
added QmT78zSuBmuS4z925WZfrqQ1qHaJ56DQaTfyMUF7F8ff5o helloworld.txt
$ ipfs object data QmT78zSuBmuS4z925WZfrqQ1qHaJ56DQaTfyMUF7F8ff5o > actual
$ diff <(hexdump helloworld.txt) <(hexdump actual)
1,2c1,3
< 0000000 6568 6c6c 206f 6f77 6c72 0a64          
< 000000c
---
> 0000000 0208 0c12 6568 6c6c 206f 6f77 6c72 0a64
> 0000010 0c18                                   
> 0000012

So somewhere along the line those extra bytes are added to the data, or there's some encoding I've missed, and it's just not correctly stripped for the ipfs object {data,get,...} commands.

Looking in the blockstore of the repo:

$ hexdump 122046d4/122046d44814b9c5af141c3aaab7c05dc5e844ead5f91f12858b021eba45768b4c0e.data
0000000 120a 0208 0c12 6568 6c6c 206f 6f77 6c72
0000010 0a64 0c18                              
0000014

we see that those bytes are already there. So if these bytes are actually a bug and not intended, then they're probably being added by the add-call, and not on fetching.

This is my first dive into the code base, so I might have missed or misunderstood something however.

This is with go 1.5 by the way, haven't tested with anything else.

@jbenet
Copy link
Member

jbenet commented Aug 27, 2015

@martingms i think what you're seeing is the unixfs protobuf: https://github.com/ipfs/go-ipfs/blob/master/unixfs/pb/unixfs.proto which currently is stored in the data field of the ipfs object.

so, more clearly, we basically have something like:

# ipfs add
data := request.body
file := unixfs.NewFile(data)
obj := Node()
obj.Data = proto.Marshal(file)

# ipfs object data
obj := get(request.params.arg[0])
return obj.Data # with the serialized protobuf

# ipfs cat
obj := get(request.params.arg[0])
var file unixfs.File
file := proto.Unmarshal(obj.Data, &file)
return file.Data() # the raw data

tl;dr: i think this is fine.

@martingms
Copy link

Aha, I see. So the core issue then is that the data is not being umarshalled from protobuf before being put into the Data field of the output json.

@davidar
Copy link
Member Author

davidar commented Aug 27, 2015

The problem is that the json encoder is interpreting the data field as a
utf8 string, but it should be base64ing it instead. I don't think protobuf
has anything to do with it

On Thu, 27 Aug 2015 12:14 Martin Gammelsæter [email protected]
wrote:

Aha, I see. So the core issue then is that the data is not being
umarshalled from protobuf before being put into the Data field of the
output json.


Reply to this email directly or view it on GitHub
#1582 (comment).

David A Roberts
https://davidar.io

@martingms
Copy link

Hm okay. Looks like there are similar issues with the XML encoder if that's the case...

$ ipfs object get --enc=xml QmNY5sQeH9ttVCg24sizH71dNbcZTpGd7Yb3YwsKZ4jiFP
<Node><Data>������UTF-8 decoder capability and stress test
<snip>

but object putting with --inputenc=xml is not yet supported, so cannot verify.

@chriscool
Copy link
Contributor

Just to clarify things.

@jbenet wrote:

tl;dr: i think this is fine.

Do you mean that it should be expected that the hash from ipfs add and from ipfs object put --inputenc=json are not the same?

Or do you just mean that it is fine that ipfs object data is displaying protobuf encoded data?

@jbenet
Copy link
Member

jbenet commented Aug 28, 2015

Or do you just mean that it is fine that ipfs object data is displaying protobuf encoded data?

this o/

@davidar
Copy link
Member Author

davidar commented Aug 28, 2015

@martingms @chriscool This might fix it, but I haven't tested.

@chriscool
Copy link
Contributor

@davidar I get the following test failure with your change:

ok 5 - 'ipfs object get' succeeds

expecting success: 
                test_cmp ../t0051-object-data/expected_getOut actual_getOut

> diff -u ../t0051-object-data/expected_getOut actual_getOut
--- ../t0051-object-data/expected_getOut        2015-06-12 14:04:03.680326279 +0000
+++ actual_getOut       2015-08-28 09:03:04.721395159 +0000
@@ -1,4 +1,4 @@
 {
   "Links": [],
-  "Data": "\u0008\u0002\u0012\nHello Mars\u0018\n"
+  "Data": "CAISCkhlbGxvIE1hcnMYCg=="
 }
\ No newline at end of file

@davidar
Copy link
Member Author

davidar commented Aug 28, 2015

@chriscool Then the test is broken, "CAISCkhlbGxvIE1hcnMYCg==" is the correct output

Edit: the fix for this bug will change the format of object get, so I would expect tests relying on the old (incorrect) format to break

@chriscool
Copy link
Contributor

If I 'fix' the above test, then there is the following error:

expecting success: 
                ipfs object put  ../t0051-object-data/testPut.json > actual_putOut

Error: illegal base64 data at input byte 4
not ok 9 - 'ipfs object put file.json' succeeds
#
#                       ipfs object put  ../t0051-object-data/testPut.json > actual_putOut
#

@davidar
Copy link
Member Author

davidar commented Aug 28, 2015

@chriscool That test is also broken. The old behaviour was that Data was interpreted as a utf8-encoded unicode string. However, that doesn't match the protobuf schema, so it breaks on binary data that isn't valid utf8. So, any tests that that assume "Data":"some unicode string" need to be changed to "Data":"SomeBase64String=="

@jbenet @whyrusleeping jump in if I'm saying anything incorrect

@jbenet
Copy link
Member

jbenet commented Aug 28, 2015

So, any tests that that assume "Data":"some unicode string" need to be changed to "Data":"SomeBase64String=="

could swear we had it that way, not sure why it's not... oh i think @whyrusleeping removed it so it would look pretty and be easy to edit. (which is true, is nicer to edit in plain text). just dont think we can do that in json land

@davidar
Copy link
Member Author

davidar commented Oct 11, 2015

@chriscool any further progress on this?

@whyrusleeping
Copy link
Member

I think this one falls in my lap. I wanted the data fields to be easy to edit... But previously i thought that it did escapes like \uABCD or whatever.

@slothbag
Copy link

slothbag commented Apr 1, 2016

Hey guys, me and TrekDev have been discussing this issue.. whats the consensus here? Is ipfs object put suppose to just accept base64 and decode it internally on receive? Or will it continue to accept plain text?

@whyrusleeping
Copy link
Member

I think the solution here might be to have different encoding types for these workflows. the json encoding should have base64 encoded Data field all the time, and then we might want to have another type (not sure on naming) that does json, but doesnt base64 encode the data for ease of use.

While we're on this topic, we should think about how ipld will handle this sort of thing...

@davidar
Copy link
Member Author

davidar commented Apr 2, 2016

While we're on this topic, we should think about how ipld will handle this sort of thing...

CBOR distinguishes between bytestrings and unicode (text) strings, so only the former would need to be base64'd when displayed as JSON.

Edit: but yeah, you'd need some way of indicating which strings are actually encoded bytestrings

davidar added a commit to davidar/specs that referenced this issue Jun 13, 2016
This patch fixes some of the inconsistencies in the IPLD spec, as well as resolving some existing issues, namely:

- ipfs/kubo#2053 - allow raw data in leaf nodes
- ipfs/kubo#1582 - use EJSON for encoding (non-UTF8) binary data
- mandate CBOR Strict Mode to avoid having to deal with duplicated keys
- change `link` to `content` (although we could also `data` or something similar), as file content needn't be merkle-links, and can in fact be directly embedded into small files
- removed the extraneous `subfiles` wrapper, as file content can be either (a merkle-link to) a bytestring, a unicode string, or a list thereof
@thisconnect
Copy link
Contributor

Will this be fixed?

@ydennisy
Copy link

ydennisy commented Apr 4, 2018

Any update on this?

@magik6k
Copy link
Member

magik6k commented Apr 7, 2018

There is now an option to set data field encoding to base64: https://github.com/ipfs/go-ipfs/blob/master/core/commands/object/object.go#L389

@magik6k
Copy link
Member

magik6k commented May 14, 2019

(use ipfs object get --data-encoding=base64 QmThing for blocks with binary data)

@magik6k magik6k closed this as completed May 14, 2019
@ajnavarro ajnavarro mentioned this issue Aug 24, 2022
72 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) topic/commands Topic commands
Projects
None yet
Development

No branches or pull requests

9 participants