Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some popular services still produces invalid JSON with <0x20 characters #26

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Kentzo
Copy link

@Kentzo Kentzo commented May 21, 2011

Do not break parsing if <0x20 is occurred but JKParseOptionLooseUnicode is specified

@johnezang
Copy link
Owner

I have mixed feelings about this.

Can you give some specific examples of services that generate bad JSON? I'm not aware of any off the top of my head, but I have no doubt they exist.

In this particular case, these services are very clearly "in the wrong". RFC 4627 is unambiguous that characters < 0x20 are verboten. In cases like there, where something is clearly violating the standard, my default response is that "It's the other persons (web service) problem." The standard is the standard, and it is Right(tm), even its mistakes.

The Unicode standard specifies how to deal with malformed Unicode text. For the purposes of JSONKit, this means UTF8, and characters < 0x20 can not, by definition, be malformed UTF8. Dealing with malformed UTF8 isn't straightforward, and there's a lot of security issues that have been raised and addressed over the years. It's not clear to me that one can safely replace characters < 0x20 with U+FFFD, or do so without potentially introducing subtle security problems.

If there are services that are generating JSON with characters < 0x20, then I think it's safe to assume that they are doing so for a reason. A few that I can think of is not escaping 0x0d (newline) to \n, or embedding binary data directly in to JSON " " strings. I can't think of a case where replacing < 0x20 with U+FFFD would not significantly alter the intended meaning of the string. Dealing with malformed UTF8 text, and using U+FFFD to replace the malformed text, tends to have a much smaller chance of significantly altering the semantics of the string. In fact, it is not uncommon for the malformed UTF8 to be very carefully constructed to specifically exploit security problems, in which case replacing the malformed text with U+FFFD usually helps to protect against these types of problems.

I'll certainly entertain reasons why your patch, or something along the lines of your patch, should be incorporated in to JSONKit- if this is a common, wide spread problem that people have to deal with, then pragmatism says an exception to the rule should be made (this was done with a recent bug report regarding Microsofts way of encoding dates as /Date(...)/).

Also, I do not think your patch works as-is. You need to use something like:

if(jk_string_add_unicodeCodePoint(parseState, UNI_REPLACEMENT_CHAR, &tokenBufferIdx, &stringHash)) {
  jk_error(parseState, @"Internal error: Unable to add UTF8 sequence to internal string buffer. %@ line #%ld", [NSString stringWithUTF8String:__FILE__], (long)__LINE__);
  stringState = JSONStringStateError;
  goto finishedParsing;
}

In your patch, the above should (at least) replace your line 1457 (and probably the other places where you try to substitute U+FFFD), followed by a continue; (at least for line 1457, I haven't worked through the other cases), since otherwise the else {} block would finish and go on to the calculateHash() line, which you don't want since jk_string_add_unicodeCodePoint manages that part for you.

@Kentzo
Copy link
Author

Kentzo commented May 24, 2011

According to the comment on redmine.org all services based on pre-rails 3 are affected. As you can see from the comment, the problem is redmine does not pre-process user input.

I do not insist on fixing the problem that way, but at least it would be good if we could have an option to process even technically invalid data somehow. Probably by replacing character with it's meaning (following to my example, ^ACKNOWLEDGE or ^F instead of 0x06 character). This way we do not alter the intended meaning of the original string and it would be simple for a user-side application to process it.

@johnezang
Copy link
Owner

I can at least suggest a work around. I'm sort of of the opinion that something along these lines is probably the better way to handle this particular quirk.

The solution involves RegexKitLite, which you can download here: RegexKitLite-4.0.tar.bz2.

Basically, there's two obvious work arounds. The first is fairly simple: Use a regex to match all the characters from U+0000 to U+001F (i.e., < 0x20), and replace them with U+FFFD. This can be accomplished with the following line of code:

fixedJSONString = [jsonString stringByReplacingOccurrencesOfRegex:@"[\\u0000-\\u001f]" withString:@"\\\\uFFFD"];

This is essentially what your patch does, but moves it to an extra, external to JSONKit pre-processing step.

A second work around is to make use of RegexKitLites ^Blocks functionality. The example below also takes advantage of some of the advanced ^Block enumeration options, RKLRegexEnumerationReleaseStringReturnedByReplacementBlock and RKLRegexEnumerationFastCapturedStringsXXX. With the first option, you return a non-autoreleased string, and RegexKitLite sends the returned string a -release as soon as its done with it. This keeps memory consumption done. The second option is a "wizard level" option, and you should probably read the docs to understand what it does, but in a nutshell it speeds things up and keeps memory usage down, but there are some important caveats to using it (again, see the documentation). The usage of the option below is perfectly safe (in fact, the usage below is precisely why the option exists).

By using a ^Block, you have much finer control of what the replacement text is for the matched characters. In this case we use this to match the same range of characters, < 0x20, but every character that is matched is replaced with \u00xx, where xx is the value of the matched control character. Essentially, this transforms all the characters that are < 0x20 to their equivalent \uxxxx escaped form, which JSONKit will parse and the parsed string will contain the unescaped version of the character.

NSString *fixedJSONString = [jsonString stringByReplacingOccurrencesOfRegex:@"[\\u0000-\\u001f]"
                                                                    options:RKLNoOptions
                                                                    inRange:NSMakeRange(0UL, [jsonString length])
                                                                      error:NULL
                                                         enumerationOptions:(RKLRegexEnumerationReleaseStringReturnedByReplacementBlock | RKLRegexEnumerationFastCapturedStringsXXX)
                                                                 usingBlock:^(NSInteger captureCount, NSString * const capturedStrings[captureCount], const NSRange capturedRanges[captureCount], volatile BOOL * const stop) {
    return((NSString *)[[NSString alloc] initWithFormat:@"\\u%4.4x", [capturedStrings[0] characterAtIndex:0ul]]);
  }];

Then, you pass fixedJSONString to JSONKit to parse for both of the workarounds above.

@ghost ghost assigned johnezang May 24, 2011
@johnezang
Copy link
Owner

@Kentzo,

Any thoughts on my suggestion for dealing with this particular problem at the NSString level instead of at the JSONKit level? I'm leaning towards closing this issue with the recommendation that users who encounter this kind of malformed JSON deal with it by preprocessing the JSON using other means and then passing the preprocessed result to JSONKit.

Disadvantages to this approach:

  • External preprocessing takes longer.
  • If you process the JSON using NSString methods, I suspect that this would most likely result in some amount of "conversion to NSString" overhead, along with potential NSString instantiation overhead.
  • Any way of dealing with this particular problem that is integrated in to JSONKit is likely to be hard coded and immutable. If that hard coded way of dealing with it inside JSONKit works for you, then great. If you need to handle it a different way, then....

Advantages:

  • By externally preprocessing the JSON before having JSONKit parse it, you can tailor the solution to your particular situation.
  • The RegexKitLite example above is an example of this- some people might want to deal with < 0x20 characters by replacing them with U+FFFD, while other people may need to change the 0x20 characters to their JSON \uXXXX equivalent.

Probably the most compelling reason I can think of is that it's not entirely clear that there is a best way, let alone a single way, for dealing with JSON that is non-RFC 4627 compliant. Most of the real world situations that I can think of really require a solution that is tailored to each situation. Anyone who is forced to deal with JSON like this can most likely encapsulate the details of dealing with it behind a Objective-C category addition, though- a NSArray / NSDictionary addition like -objectFromBuggyJSONString- it does all the required preprocessing, then calls the appropriate JSONKit method, and hands back the result.

Since you're actually stuck having to deal with this problem, I'm interested to hear your thoughts.

@Kentzo
Copy link
Author

Kentzo commented Jun 11, 2011

@johnezang

Unfortunately I need to parse likely big chunks of JSON data (10-20MB divided into ~100KB parts). So the "External preprocessing takes longer" disadvantage is serious enough.

I think it is important to have a built-in option to process data whenever it contains errors since data may be passed from the external resource. I agree that encapsulating details of dealing with corrupted data is often a good solution, but for this situation I would prefer a built-in in solution.

@tciuro
Copy link

tciuro commented Jun 13, 2011

@johnezang

The second method posted on May 24 leaks memory. The block should return the following instead:

(NSString *)[[[NSString alloc] initWithFormat:@"\\u%4.4x", [capturedStrings[0] characterAtIndex:0ul]]autorelease]);

@johnezang
Copy link
Owner

@tciuro

The method I posted in my comment does not leak memory. In fact, the fix that you suggested (autorelease the returned string) is buggy because it will over-release the returned string.

The reason why the code I originally posted does not require the returned NSString to be released, as one would normally expect due to standard memory management practices, is the ^Block is invoked with the RKLRegexEnumerationReleaseStringReturnedByReplacementBlock option. With this option, RegexKit_Lite_ will send a -release to the string returned by the ^Block.

The reason behind this is that a very large number of temporary strings can be created by the ^Block as it does its work. If those strings are added to a NSAutoreleasePool, those strings, and their memory, can not be reclaimed until the NSAutoreleasePool is popped. By having RegexKit_Lite_ send the -release as soon as it is finished with the returned ^Block string, that memory can be reclaimed immediately.

FYI (for anyone who is thinking about using the solution I posted), the RKLRegexEnumerationFastCapturedStringsXXX option used is somewhat special too. You should read the RegexKit_Lite_ documentation before using any of the strings that RegexKit_Lite_ passes to the ^Block because there are some extremely important caveats and rules that must be followed when using that option (hence the XXX designation).

@tciuro
Copy link

tciuro commented Jun 13, 2011

OK, that makes more sense. I didn't know RegexKitLite had this type of rules. It's too bad, because it certainly can cause confusion.

@johnezang
Copy link
Owner

By default, RegexKit_Lite_ behaves as you would intuitively expect (from the perspective of Cocoa memory management rules). Without any extra enumerationOptions:, your bug fix suggestion would have been the correct thing to do.

To get the non-standard behavior (where you return a non-autoreleased string), you need to explicitly enable it with the RKLRegexEnumerationReleaseStringReturnedByReplacementBlock option flag. I can only hope that anyone who writes code that explicitly uses that enumeration option has either actually read the documentation before using it, or is able to figure out what it does from the description. :) The point being that RegexKit_Lite_ will use the standard Cocoa memory management patterns unless you've explicitly asked it not to. It's really meant for those that have spent some time profiling their code and they need to improve the performance or memory footprint of a particular ^Block that profiling has identified as a problem spot. But it's easy to miss if you're doing a quick "code-review" once over of the code and you're not familiar with the particulars of RegexKit_Lites_ enumeration options.

@tciuro
Copy link

tciuro commented Jun 14, 2011

"But it's easy to miss if you're doing a quick "code-review" once over of the code and you're not familiar with the particulars of RegexKitLites enumeration options."

Precisely. I understand our point though. As long as these cases are well documented, it's all fine. Thanks for explaining it.

@johnezang
Copy link
Owner

As long as these cases are well documented, it's all fine.

:) From the comment with the code:

A second work around is to make use of RegexKitLites ^Blocks functionality. The example below also takes advantage of some of the advanced ^Block enumeration options, RKLRegexEnumerationReleaseStringReturnedByReplacementBlock and RKLRegexEnumerationFastCapturedStringsXXX. _With the first option, you return a non-autoreleased string, and RegexKitLite sends the returned string a -release as soon as its done with it._ This keeps memory consumption done [sic, should be down]. The second option is a "wizard level" option, and you should probably read the docs to understand what it does, but in a nutshell it speeds things up and keeps memory usage down, but there are some important caveats to using it (again, see the documentation). The usage of the option below is perfectly safe (in fact, the usage below is precisely why the option exists).

@fpillet
Copy link

fpillet commented Sep 29, 2011

John,

Regardless of whether you consider unicode chars < \u0020 as erroneous, I (at least) have a real need for accepting these characters AND convert them as bytes of their values. I'm heavily using JSON as the internal exchange medium between a product and its scripting engine which is JavaScript. To this end, I exchange JSON between native and JavaScript, and I need to be able to exchange strings containing binary data as well.

Right now, for every string that's being exchanged I'm going through a very costly encoding phase that converts chars < 0x20 to the \\x## text representation, for the sole purpose of JSONKit not choking on them.

Needless to say, given the cost of this on large strings (i.e. images), this totally wipes the benefit of using JSONKit.

So what I need is an option that says "allow binary chars < 0x20 to be decoded to their byte value".

I tried patching isValidCodePoint() to accept them, but apparently it's not enough :-(

Any chance you can add this? would be fantastic!

@johnezang
Copy link
Owner

Regardless of whether you consider unicode chars < \u0020 as erroneous, I (at least) have a real need for accepting these characters AND convert them as bytes of their values.

With respect, it's not my opinion, and to be blunt, my, or anyone else's, opinion on the matter is irrelevant.

RFC 4627, the JSON standard, says that raw, unescaped Unicode code points < \u0020 (modulo certain "obvious" white space characters in specific contexts) are erroneous.

That is to say, "unicode chars < \u0020" are erroneous _by definition_.

The RFC 4627 standard is Right, even its mistakes.

Right now, for every string that's being exchanged I'm going through a very costly encoding phase that converts chars < 0x20 to the \\x## text representation, for the sole purpose of JSONKit not choking on them.

Needless to say, given the cost of this on large strings (i.e. images), this totally wipes the benefit of using JSONKit.

... I see. I understand your need, and I, along with a great many other people, have often wanted the ability to transport raw binary data within some JSON. There's been times when I'd kill for this ability.

That being said, the change that you'd like me to make is, quite simply, _never_ going to happen. The reasons for why it shouldn't be done simply dwarf and trump _any_ argument in favor of doing it.

Needless to say, given the cost of this on large strings (i.e. images), this totally wipes the benefit of using JSONKit.

You very obviously were using a different JSON parser before JSONKit, and when you switched to JSONKit, you had a performance regression.

The answer is obvious: switch back. :) (I say this tongue in cheek).

So what I need is an option that says "allow binary chars < 0x20 to be decoded to their byte value".

I tried patching isValidCodePoint() to accept them, but apparently it's not enough :-(

It wouldn't matter if I ripped it all out and allowed raw, unfiltered and unmodified raw binary data straight from the original JSON. It still wouldn't work. You could write a new JSON parser from scratch and it still wouldn't work.

In fact, you should have immediately stopped what you were doing and reconsidered the entire approach when you saw "CodePoint". This should have been a big clue that you are not dealing with raw, binary data, but a Unicode string. A Unicode string is made up of a number of logical Unicode Code Points. These logical Code Points do not have a canonical binary representation- the same Unicode string can be encoded as UTF-8, UTF-16BE, UTF-16LE, UTF-32*, etc, and each of which has a very different binary / byte representation while still "meaning" the same thing.

This is one of, but not the only, reasons why the idea of using JSON to transport raw, binary data in the way you'd like is fundamentally broken by design.

Any chance you can add this? would be fantastic!

No.

I strongly suggest you take a look at one of the NSString Base64 extensions.

@fpillet
Copy link

fpillet commented Sep 29, 2011

Thanks for your detailed answer, John. I suppose it doesn't matter to you that in the browser, JSON.stringify() exactly does what is said being illegal: generate \u#### entities for values < 0x20.

I get your point anyway. I wasn't using another parser before, I went straight to the fastest I could find.

Either way, I ended up properly patching JSONKit to support my needs, and I'll keep my patched version for the sole purpose of not killing my product's performance just because the RFCs say that pretty much what WebKit does (and most certainly every browser on the planet) is illegal.

And in any case, thanks for writing JSONKit! It's very helpful (and we use both JSONKit and RegexKitLite quite heavily in our product, as we have to maintain iOS 3.x compatibility for the time being, NSRegularExpression is out of question).

@fpillet
Copy link

fpillet commented Sep 29, 2011

Oh and for what it's worth, as long as I'll be able to avoid a costly phase of JS preprocessing, I won't even consider it (and encoding to base64 would have to be done in JS too). I timed the encoding time on a device at roughly 1ms per kilobyte of data. When you need to transport a 150k image, it's 150ms wasted (and on the main thread, but that's another story).

@grgcombs
Copy link

For a while I was transporting very large geographic polygons via base64 preprocessing in MySQL to RestKit and then unpacking the MKPolygons for deployment on the iOS device. This worked okay, but then I realized I wasn't saving any bandwidth even when comparing the long base64 encoded strings to the millions of latitude and longitude point coordinate pairs in JSON. Neat. Strings aren't as compact or as flexible as raw data streams, go figure.

Point being, people need JSONKit to do different shit all the time. However, it's not really @johnezang's duty to make it do all things for all situations. Moreover, the biggest tenet John has for the project is sticking to the standard. After that, the next tenet is making it as fast as a muther-fudge-licker. What you want is clearly not in the realm of the standard, and beyond that, the proposed modifications could jeopardize the rock solid stability the rest of us have come to depend on. So, to the extent that you can trick out whatever you need to in order to get your project running fast and reliably, you've got the power. No one, not John, not Apple, not even The Internets will stand in your way.

@fpillet
Copy link

fpillet commented Sep 30, 2011

I'm not arguing that John should jeopardize JSONKit's stability, power nor speed. I was merely asking for an option that allows letting binary data through, to let JSONKit match reality (and the needs of modern applications), not just RFCs.

If there's no way this can be added to the official repo, fair enough. I can live with that.

@atnan
Copy link
Contributor

atnan commented Sep 30, 2011

@fpillet Have you tried using binary plists as a wire transport format?

@fpillet
Copy link

fpillet commented Sep 30, 2011

@atnan Can binary plists be natively generated by mobile webkit, or do they require JS code to encode them?

Let me make things straight - I'm in the context of a headless, in-app UIWebView which runs Javascript that interacts with the application. This view talks with the underlying native application, both ways (i.e the native app can send data to it, script can call native app APIs and send data to it too). Think about it a bit like Phonegap, but certainly more advanced in many ways.

When JS running there exchanges information with the native app, it uses JSON. The reason is that JSON is natively generated and parsed by the webkit engine. If I were to use ANY other non-native format, this would have an impact on performance as every bit of data exchanged would have to be encoded (JS -> native) or decoded (native -> JS) using JS code, which as you may know, runs interpreted (NOT jitted) in third party apps.

In other words, the only way to get decent performance is to go straight to JSON. Generated data size is of less importance, as there is no "wire" (it's all in-app). What matters here is speed. This is the reason why I selected JSONKit in the first place, and why I'll be maintaining a modified version that accepts binary data.

@adamjernst
Copy link

@fpillet Check out NSURLProtocol. You can create your own protocol that can either serve up images directly (by just dumping an NSData in response to a request) or accept images uploaded from the UIWebView (by POSTing them to your protocol).

JSON is the wrong tool for the task here, and what's more NSURLProtocol will be even faster than JSON.

@fpillet
Copy link

fpillet commented Sep 30, 2011

@adamjernst I don't think this would pan out. Our product is basically a UI / communications engine and our customers drive it using Javascript. They need to be able to do everything in JS, because they don't have access to the native source code.

It's not just images, it's binary data all accross the board. For example, data received from a remote system can be routed to JS by the native code, JS processes it the way it wants and can send stuff back by calling back into native, which handles the comms with the remote system.

And before you suggest, no XHR won't cut the deal. We support full range of network comms, like direct TCP, SSL, UDP, broadcast, multicast, everything that's not currently possible to do right in JS/HTML alone. So we are exposing a relatively rich API for JS to use, and people start banging it quite seriously. Believe me, for what we're doing JSON is pretty much the fastest game in town.

I get your point regarding NSURLProtocol and this is something I have thought about before (not specifically this, but using an internal web server to dialog with JS). But due to the various kind of data that can be exchanged, again the time taken to encode the data pretty much nullifies the benefit of using an internal server to POST binary data.

@serkrapiv
Copy link

@johnezang

Can you quote the part of RFC4627 that says that characters < 0x20 are verboten? can't find it, seems like I'm missing something

@keverets
Copy link

@serkrapiv Not necessarily verboten, just that they must be escaped. The first paragraph of 2.5 Strings says:

The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

@yuuki3125y
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants