-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some popular services still produces invalid JSON with <0x20 characters #26
base: master
Are you sure you want to change the base?
Conversation
I have mixed feelings about this. Can you give some specific examples of services that generate bad JSON? I'm not aware of any off the top of my head, but I have no doubt they exist. In this particular case, these services are very clearly "in the wrong". RFC 4627 is unambiguous that characters < The Unicode standard specifies how to deal with malformed Unicode text. For the purposes of JSONKit, this means If there are services that are generating JSON with characters < I'll certainly entertain reasons why your patch, or something along the lines of your patch, should be incorporated in to JSONKit- if this is a common, wide spread problem that people have to deal with, then pragmatism says an exception to the rule should be made (this was done with a recent bug report regarding Microsofts way of encoding dates as Also, I do not think your patch works as-is. You need to use something like: if(jk_string_add_unicodeCodePoint(parseState, UNI_REPLACEMENT_CHAR, &tokenBufferIdx, &stringHash)) {
jk_error(parseState, @"Internal error: Unable to add UTF8 sequence to internal string buffer. %@ line #%ld", [NSString stringWithUTF8String:__FILE__], (long)__LINE__);
stringState = JSONStringStateError;
goto finishedParsing;
} In your patch, the above should (at least) replace your line 1457 (and probably the other places where you try to substitute |
According to the comment on redmine.org all services based on pre-rails 3 are affected. As you can see from the comment, the problem is redmine does not pre-process user input. I do not insist on fixing the problem that way, but at least it would be good if we could have an option to process even technically invalid data somehow. Probably by replacing character with it's meaning (following to my example, ^ACKNOWLEDGE or ^F instead of 0x06 character). This way we do not alter the intended meaning of the original string and it would be simple for a user-side application to process it. |
I can at least suggest a work around. I'm sort of of the opinion that something along these lines is probably the better way to handle this particular quirk. The solution involves RegexKitLite, which you can download here: RegexKitLite-4.0.tar.bz2. Basically, there's two obvious work arounds. The first is fairly simple: Use a regex to match all the characters from fixedJSONString = [jsonString stringByReplacingOccurrencesOfRegex:@"[\\u0000-\\u001f]" withString:@"\\\\uFFFD"]; This is essentially what your patch does, but moves it to an extra, external to JSONKit pre-processing step. A second work around is to make use of RegexKitLites ^Blocks functionality. The example below also takes advantage of some of the advanced ^Block enumeration options, By using a ^Block, you have much finer control of what the replacement text is for the matched characters. In this case we use this to match the same range of characters, < NSString *fixedJSONString = [jsonString stringByReplacingOccurrencesOfRegex:@"[\\u0000-\\u001f]"
options:RKLNoOptions
inRange:NSMakeRange(0UL, [jsonString length])
error:NULL
enumerationOptions:(RKLRegexEnumerationReleaseStringReturnedByReplacementBlock | RKLRegexEnumerationFastCapturedStringsXXX)
usingBlock:^(NSInteger captureCount, NSString * const capturedStrings[captureCount], const NSRange capturedRanges[captureCount], volatile BOOL * const stop) {
return((NSString *)[[NSString alloc] initWithFormat:@"\\u%4.4x", [capturedStrings[0] characterAtIndex:0ul]]);
}]; Then, you pass |
Any thoughts on my suggestion for dealing with this particular problem at the Disadvantages to this approach:
Advantages:
Probably the most compelling reason I can think of is that it's not entirely clear that there is a best way, let alone a single way, for dealing with JSON that is non-RFC 4627 compliant. Most of the real world situations that I can think of really require a solution that is tailored to each situation. Anyone who is forced to deal with JSON like this can most likely encapsulate the details of dealing with it behind a Objective-C category addition, though- a Since you're actually stuck having to deal with this problem, I'm interested to hear your thoughts. |
Unfortunately I need to parse likely big chunks of JSON data (10-20MB divided into ~100KB parts). So the "External preprocessing takes longer" disadvantage is serious enough. I think it is important to have a built-in option to process data whenever it contains errors since data may be passed from the external resource. I agree that encapsulating details of dealing with corrupted data is often a good solution, but for this situation I would prefer a built-in in solution. |
The second method posted on May 24 leaks memory. The block should return the following instead:
|
The method I posted in my comment does not leak memory. In fact, the fix that you suggested ( The reason why the code I originally posted does not require the returned The reason behind this is that a very large number of temporary strings can be created by the ^Block as it does its work. If those strings are added to a FYI (for anyone who is thinking about using the solution I posted), the |
OK, that makes more sense. I didn't know RegexKitLite had this type of rules. It's too bad, because it certainly can cause confusion. |
By default, RegexKit_Lite_ behaves as you would intuitively expect (from the perspective of Cocoa memory management rules). Without any extra To get the non-standard behavior (where you return a non-autoreleased string), you need to explicitly enable it with the |
"But it's easy to miss if you're doing a quick "code-review" once over of the code and you're not familiar with the particulars of RegexKitLites enumeration options." Precisely. I understand our point though. As long as these cases are well documented, it's all fine. Thanks for explaining it. |
:) From the comment with the code:
|
John, Regardless of whether you consider unicode chars < Right now, for every string that's being exchanged I'm going through a very costly encoding phase that converts chars < Needless to say, given the cost of this on large strings (i.e. images), this totally wipes the benefit of using JSONKit. So what I need is an option that says "allow binary chars < 0x20 to be decoded to their byte value". I tried patching isValidCodePoint() to accept them, but apparently it's not enough :-( Any chance you can add this? would be fantastic! |
With respect, it's not my opinion, and to be blunt, my, or anyone else's, opinion on the matter is irrelevant. RFC 4627, the JSON standard, says that raw, unescaped Unicode code points < That is to say, "unicode chars < The RFC 4627 standard is Right, even its mistakes.
... I see. I understand your need, and I, along with a great many other people, have often wanted the ability to transport raw binary data within some JSON. There's been times when I'd kill for this ability. That being said, the change that you'd like me to make is, quite simply, _never_ going to happen. The reasons for why it shouldn't be done simply dwarf and trump _any_ argument in favor of doing it.
You very obviously were using a different JSON parser before JSONKit, and when you switched to JSONKit, you had a performance regression. The answer is obvious: switch back. :) (I say this tongue in cheek).
It wouldn't matter if I ripped it all out and allowed raw, unfiltered and unmodified raw binary data straight from the original JSON. It still wouldn't work. You could write a new JSON parser from scratch and it still wouldn't work. In fact, you should have immediately stopped what you were doing and reconsidered the entire approach when you saw " This is one of, but not the only, reasons why the idea of using JSON to transport raw, binary data in the way you'd like is fundamentally broken by design.
No. I strongly suggest you take a look at one of the |
Thanks for your detailed answer, John. I suppose it doesn't matter to you that in the browser, I get your point anyway. I wasn't using another parser before, I went straight to the fastest I could find. Either way, I ended up properly patching JSONKit to support my needs, and I'll keep my patched version for the sole purpose of not killing my product's performance just because the RFCs say that pretty much what WebKit does (and most certainly every browser on the planet) is illegal. And in any case, thanks for writing JSONKit! It's very helpful (and we use both JSONKit and RegexKitLite quite heavily in our product, as we have to maintain iOS 3.x compatibility for the time being, |
Oh and for what it's worth, as long as I'll be able to avoid a costly phase of JS preprocessing, I won't even consider it (and encoding to base64 would have to be done in JS too). I timed the encoding time on a device at roughly 1ms per kilobyte of data. When you need to transport a 150k image, it's 150ms wasted (and on the main thread, but that's another story). |
For a while I was transporting very large geographic polygons via base64 preprocessing in MySQL to RestKit and then unpacking the MKPolygons for deployment on the iOS device. This worked okay, but then I realized I wasn't saving any bandwidth even when comparing the long base64 encoded strings to the millions of latitude and longitude point coordinate pairs in JSON. Neat. Strings aren't as compact or as flexible as raw data streams, go figure. Point being, people need JSONKit to do different shit all the time. However, it's not really @johnezang's duty to make it do all things for all situations. Moreover, the biggest tenet John has for the project is sticking to the standard. After that, the next tenet is making it as fast as a muther-fudge-licker. What you want is clearly not in the realm of the standard, and beyond that, the proposed modifications could jeopardize the rock solid stability the rest of us have come to depend on. So, to the extent that you can trick out whatever you need to in order to get your project running fast and reliably, you've got the power. No one, not John, not Apple, not even The Internets will stand in your way. |
I'm not arguing that John should jeopardize JSONKit's stability, power nor speed. I was merely asking for an option that allows letting binary data through, to let JSONKit match reality (and the needs of modern applications), not just RFCs. If there's no way this can be added to the official repo, fair enough. I can live with that. |
@fpillet Have you tried using binary plists as a wire transport format? |
@atnan Can binary plists be natively generated by mobile webkit, or do they require JS code to encode them? Let me make things straight - I'm in the context of a headless, in-app UIWebView which runs Javascript that interacts with the application. This view talks with the underlying native application, both ways (i.e the native app can send data to it, script can call native app APIs and send data to it too). Think about it a bit like Phonegap, but certainly more advanced in many ways. When JS running there exchanges information with the native app, it uses JSON. The reason is that JSON is natively generated and parsed by the webkit engine. If I were to use ANY other non-native format, this would have an impact on performance as every bit of data exchanged would have to be encoded (JS -> native) or decoded (native -> JS) using JS code, which as you may know, runs interpreted (NOT jitted) in third party apps. In other words, the only way to get decent performance is to go straight to JSON. Generated data size is of less importance, as there is no "wire" (it's all in-app). What matters here is speed. This is the reason why I selected JSONKit in the first place, and why I'll be maintaining a modified version that accepts binary data. |
@fpillet Check out JSON is the wrong tool for the task here, and what's more |
@adamjernst I don't think this would pan out. Our product is basically a UI / communications engine and our customers drive it using Javascript. They need to be able to do everything in JS, because they don't have access to the native source code. It's not just images, it's binary data all accross the board. For example, data received from a remote system can be routed to JS by the native code, JS processes it the way it wants and can send stuff back by calling back into native, which handles the comms with the remote system. And before you suggest, no XHR won't cut the deal. We support full range of network comms, like direct TCP, SSL, UDP, broadcast, multicast, everything that's not currently possible to do right in JS/HTML alone. So we are exposing a relatively rich API for JS to use, and people start banging it quite seriously. Believe me, for what we're doing JSON is pretty much the fastest game in town. I get your point regarding |
Can you quote the part of RFC4627 that says that characters < 0x20 are verboten? can't find it, seems like I'm missing something |
@serkrapiv Not necessarily verboten, just that they must be escaped. The first paragraph of 2.5 Strings says: The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F). |
あ |
Do not break parsing if <0x20 is occurred but JKParseOptionLooseUnicode is specified