-
-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URIs are open()ed for no reason #148
Comments
I was looking at that code today (see #147). When you run In your case it will try to load json from 'http://foo.bar/?bar=qux#quux'. When I run this in irb it does take a long time.
The uri could be changed to something else, but I'm not sure what to suggest without knowing what the test is doing. Or perhaps json-schema should set it's own timeout? * Probably not ideal but I have plans for that |
It's not clear to me why initialize_data is trying to load remote data at all; shouldn't we only fetch a URI if there's a schema we're trying to load from it (eg, it was the value of a json-schema definitely shouldn't make requests to any arbitrary URI that just-so-happens to be embedded in an object. For example, my common use-case with rack-schema is for validating API responses, which have a ton of URIs inside of them: {
"widget": { "manufacturer_id": 1, "size": 10, "price": "99.99" },
"links": { "manufacturer": "https://some.api/manufacturers/1" }
} I haven't had a chance to test locally, but I think that means we're actually hitting those endpoints during validation? |
@pd We should most definitely only be hitting a URI if there's a schema trying to load from it in a I'll take a look at this today as well. |
By the way, autoloading of ref schemas is something I want to make optional in version 3 of this library. |
Oh I see your point. I think you're misunderstanding here. Your schema definition says you expect a string, formatted as a uri. However, when you pass 'http://foo.bar/?baz=qux#quux' as data, json schema doesn't think you're passing it something to validate at all - it thinks your data is at that url, and tries to load it. I suspect that if your data was |
Basically, if you pass json schema a string, and it's not valid json (any arbitrary string is not necessarily valid json, to my knowledge - it must start with "{" and end with "}") json schema will assume it's a url, and try to load that. Your example isn't even hitting the validation code. |
Ah, this makes more sense. Thanks for clarifying before I went on a huge witch hunt, @iainbeeston . Will monitor this before I dive into the code during the work day... |
The test that is triggering this is here: https://github.com/json-schema/JSON-Schema-Test-Suite/blob/develop/tests/draft4/optional/format.json#L23-L42 That entire file is based around the premise that the validator can be run on arbitrary JSON types; I think we have a lot of local tests that do the same -- eg, validate @iainbeeston's description makes sense and explains why my example above does not trigger a bunch of web requests. I think this is still a bug though; only schemas should cause the validator to (optionally) go fetch remote data, right? Another example:
If I understand this correctly, that triggers a GET to the Github API, then throw away the result and tell me that What's the use case for |
I imagine it's for convenience (@hoxworth?) Right now, whether JSON schema can validate an arbitrary string depends on |
Okay, there's a couple of things happening here... As @iainbeeston points out, when the data is simply a bare string, it is not valid JSON as per the JSON spec. This library attempts to open the URI and use the data at the specified URI as the data to validate. If the data that is fetched is valid JSON and is parsable, the data is NOT discarded; otherwise, it is discarded and the original string is used to represent the JSON data. In the case of the GitHub API fetch, the JSON returned actually does validate against the simple The fact of the matter, however, is that we are attempting to validate invalid JSON against a JSON schema. I personally don't really like this, and think the common test suite should change and not expect validators to accept invalid JSON. |
That's a really interesting point. I've double checked the ietf spec and it's explicitly for validating json objects, not just arbitrary data. Which means that many of the common test suite tests are invalid (e.g. the format spec linked by @pd) Does anyone want to raise the issue on the JSON-Schema-Test-Suite repo? https://github.com/json-schema/JSON-Schema-Test-Suite/issues |
@iainbeeston Yeah, I'll go open a ticket. |
Issue opened on the common test suite: json-schema-org/JSON-Schema-Test-Suite#64 |
Are you talking about the JSON Schema spec? It does not say anything like this... |
As @fge pointed out on the above issue, JSON Schema explicitly validates JSON values, not JSON texts. This only affects URIs at the present due to the magic resolution attempt, but is definitely something we need to support and fix. (I still am not enamored with validating non-valid JSON texts, but whatever 😄 ) |
@fge My mistake - you're right. I confused JSON data with JSON objects. If I read through it again and look up every definition:
|
Also, (just for completeness, in case this ever helps anyone find the definition of what valid JSON is) RFC4627 has been superseded by RFC7159, which replaces point 4 with:
|
Possible option: require 'json'
JSON.parse('1')
#=> JSON::ParserError: A JSON text must at least contain two octets!
JSON.parse('"http://foo.com"')
#=> JSON::ParserError: 757: unexpected token at '"http://foo.com"'
JSON.load('1') #=> 1
JSON.load('"http://foo.com"') #=> "http://foo.com"
require 'oj' # the only other json library i have installed atm
Oj.load('1') #=> 1
Oj.load('"http://foo.com"') #=> "http://foo.com" But note that All of this is complicated by the |
I think that (maybe for v3) validate should assume it's a string to be validated. I think if the user wants to load a url they should have to be explicit by using the uri hash parameter (eg. "uri: 'http://example.com/some.json'" argument). That parameter already exists, as a way of skipping the json parsing step and downloading the uri right away. Otherwise the string should be validated against the schema.= |
Yes, currently we'll either break the API or the spec with any attempt to fix this issue. It's an unfortunate issue that most definitely needs to be addressed in v3, probably in the manner that @iainbeeston proposes above. I'm beginning to think v3 is going to have to be a sooner-rather-than-later migration. |
While we’re on the subject of loading strings that happen to be a uri, it might be worth noting that if the string is interpreted (by URI.parse()) as a relative url then it’s turned into a file url. So for strings like ‘/foo/bar’, json-schema will try to load them from the filesystem (aside: I wonder if there’s potential there for an xss like attack if a server is validating json in http requests...) |
I was thinking basically the opposite; that we should expect to receive ruby objects for both schema and data. I typically use json-schema in rack apps, where I've often already got the JSON parsed, handled parse errors, etc. I wouldn't want to have to pay the cost of JSON parsing all over again each time I hit json-schema to validate the payload. I think the problem is mostly that |
Sorry to chime in, but do I understand there that "Ruby JSON" expects JSON Texts as defined by RFC 4627? If yes, why not use an alternate JSON parsing library which parses JSON values instead? |
@pd I've been thinking about this, and now I agree completely. Ideally, we shouldn't try to parse json at all - there are already good libraries to do that - and just focus on the validation. This would remove a large amount of code, and any ruby developer that can call |
@fge Unfortunately all of the ruby json parsers that I'm aware of default to parsing json text, not json values. However, I've just discovered that Ruby JSON has a "quirks mode" that will parse both json values and json text, e.g. It's also interesting to see that |
@iainbeeston OK, I'll chime in again but something seems a bit off here; decomposing the flow, you have:
At least, that is how I view it. It looks like Ruby does things quite differently here... |
No, I think Ruby does it all as you suggested... It's just that in ruby, the default that has evolved is to parse json text, not just json values. The only way I've found (so far) to parse json values from a character stream is to use the ruby json parser, in quirks mode. Without quirks mode (ie. the default mode) it will raise an error if it tries to parse json values (as opposed to json text). Does that make sense? (Even if you disagree with that implementation?) |
@iainbeeston it makes sense and yes, I have trouble with this particular part:
It was OK when RFC 4627 was the reference, but it isn't anymore ;) |
@iainbeeston I think the correct way for us to parse values is
But we'll still have to figure out how this interacts with all the different MultiJson adapters. =\ I think you're right about just dropping support for parsing JSON, and we should only accept pre-loaded objects, so that it's the caller's responsibility to deal with the craziness! Which would mean this is finally unambiguous: JSON::Validator.validate({ type: 'string' , format: 'uri' }, 'http://foo.com') |
@fge It seems like it's an open issue in Ruby JSON (ruby/json#206) but nobody is actively working on it |
@pd Apparently quirks mode does more weird stuff than just allowing top-level json values (see ruby/json#206) - it sounds like what we're doing right now is probably closer to being correct (bizarrely) |
I got alerted to this via a pen test. We are using this library to validate user input against a schema. If the user puts in a URI as a value, then this will make a request to the URI. |
I found this while working through more of the common test suite failures. The
draft4/optional/format.json
test currently pauses during a URI validation test because it's trying to make a request to load data from the URI. You can reproduce this with:The final line will eventually return true (which is correct), but if you wrap it in
Timeout.timeout(15) { ... }
or whatever, it reliably throws an exception.I've tracked this down to one of these
open()
calls insideValidator#initialize_data
, but haven't yet had a chance to dig in any further to figure out when those are necessary and when they aren't:https://github.com/hoxworth/json-schema/blob/8daf3de35ab834bc9c2325a967be25322b909cdd/lib/json-schema/validator.rb#L584-L598
The text was updated successfully, but these errors were encountered: