Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relaxed parsing mode #483

Closed
vorner opened this issue Feb 11, 2019 · 7 comments · Fixed by #484
Closed

Relaxed parsing mode #483

vorner opened this issue Feb 11, 2019 · 7 comments · Fixed by #484

Comments

@vorner
Copy link

vorner commented Feb 11, 2019

Hello

I understand the library tries to follow the standard as close as possible.

However, there are URLs out there in the wild that exist, work and are rejected by this library as invalid. As an example, http://canada-region-70-24-.static-apple-com.center/ (rejected because of the trailing -) ‒ if you point a browser there, you'll get a content (not a very useful one, granted).

Would it be possible to have some relaxed parsing mode (if there is, I haven't found it in the documentation) where I would still get the methods to get the host and path and query out of the URL and canonicalize it, while allowing for certain violations against how a good URL looks like? I have little choice over the URLs that I need to handle ‒ basically, whatever happens in the wild might come my way and I have to do something meaningful with it.

@nox
Copy link
Contributor

nox commented Feb 11, 2019

The spec for parsing hosts says:

  1. Let asciiDomain be the result of running domain to ASCII on domain.

The spec for the domain to ASCII algorithm says:

Let result be the result of running Unicode ToASCII with domain_name set to domain, UseSTD3ASCIIRules set to beStrict, CheckHyphens set to false, CheckBidi set to true, CheckJoiners set to true, Transitional_Processing set to false, and VerifyDnsLength set to beStrict.

The spec for Unicode ToASCII refers to the validity criteria section which says:

If CheckHyphens, the label must neither begin nor end with a U+002D HYPHEN-MINUS character.

But as mentioned earlier, the Unicode ToASCII algorithm is to be invoked with CheckHyphens set to false.

Seems like a rust-url bug to me.

@nox
Copy link
Contributor

nox commented Feb 11, 2019

rust-url/idna/src/uts46.rs

Lines 253 to 262 in a1d8c88

// NOTE: Spec says that the label must not contain a HYPHEN-MINUS character in both the
// third and fourth positions. But nobody follows this criteria. See the spec issue below:
// https://github.com/whatwg/url/issues/53
//
// TODO: Add *CheckHyphens* flag.
// V3: neither begin nor end with a U+002D HYPHEN-MINUS
else if label.starts_with("-") || label.ends_with("-") {
errors.push(Error::ValidityCriteria);
}

@nox
Copy link
Contributor

nox commented Feb 11, 2019

Ouch. idna::uts46::Flags is a public struct with public fields, so adding one for the CheckHyphens flag is a breaking change for the idna crate, and idna::uts46::Errors is exposed in the url crate through the From<idna::uts46::Errors> for url::ParseError, meaning the breaking change should be propagated as a breaking bump in url too. 😕

@nox
Copy link
Contributor

nox commented Feb 11, 2019

I suggest we add a type idna::uts46::Config with builder methods similar to the public fields of idna::uts46::Flags and two methods to_ascii and to_unicode.

We can then reimplement idna::uts46::to_ascii (and similarly do the same thing for to_unicode) as:

pub fn to_ascii(domain: &str, flags: Flags) -> Result<String, Errors> {
    Config::from(flags).to_ascii(domain)
}

@SimonSapin
Copy link
Member

It is increasingly time for url 2.0: #463

@mixalturek
Copy link

Hi, the changes fixed the reported issue, all our unit tests are passing. Thank you very much for your extremely quick feedback! Do you have any ETA for release of the crate?

[patch.crates-io]
url = { git = "https://github.com/servo/rust-url.git", branch="hyphens" }

@SimonSapin
Copy link
Member

SimonSapin commented Feb 14, 2019

In general, I think we probably shouldn’t have parsing modes. If we ever do, they should be precisely specified. Just calling it “relaxed” doesn’t say what exactly is accepted or not.

This library is intended to be used (among others) in browser a implementation, so if its behavior differs from interoperable behavior in other browsers, that’s a bug either in this implementation on in the specification https://url.spec.whatwg.org/.

In this case, it looks like the specification has changed and we hadn’t been keeping up. #484 fixes this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants