Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IDNA support and integrate with DNS lookup #2543

Merged
merged 13 commits into from
Apr 10, 2018

Conversation

makenowjust
Copy link
Contributor

Punycode is the encoding for IDNA, which is defined in RFC 3492. And IDNA means Internationalized Domain Names in Application defined in RFC 5980, which is method to use non-ascii characters as host name of URI. For example, when you open http://日本語ドメイン.com/ in browser, browser converts this URL to http://xn--eckwd4c7c5976acvb2w6i.com/ internally. On this, 日本語ドメイン.com is IDN and xn--eckwd4c7c5976acvb2w6i.com is IDNA. I added punycode support.

RFC 3986 says:

When a non-ASCII registered name represents an internationalized domain name
intended for resolution via the DNS, the name must be transformed to the IDNA
encoding [RFC3490] prior to name lookup.

so I added integration of DNS lookup. Of course, it is better that getaddrinfo supports to convert IDN to IDNA internally, and in fact getaddrinfo supports such a thing since glibc 2.3.4 or Windows 8. However, this is extension and it is not portable, in other words it is not defined in POSIX (Especially, it is disabled by default on glibc). Finally, I think this is the best way to support IDNA.

Thanks.

@kostya
Copy link
Contributor

kostya commented May 2, 2016

funny, yesterday doing the same https://github.com/kostya/simple_idn, even code is similar

@makenowjust
Copy link
Contributor Author

@kostya I guess that you use punycode.js implementation as reference, and so, I refered to it.

@makenowjust
Copy link
Contributor Author

Formatted!

src/punycode.cr Outdated
end
end

raise Error.new "invalid input" unless init
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a custom error class here? ArgumentError seems fine to me.

@jhass
Copy link
Member

jhass commented May 30, 2016

I wonder how hard versions of this would be that directly read/write to IO, if possible at all.

@asterite
Copy link
Member

We could accept this without support for appending to an IO and later improve it.

After applying the suggestions made by @jhass we could merge this.

@ysbaddaden
Copy link
Contributor

What about moving it under the URI namespace as URI::Punycode and uri/punycode.cr?

@straight-shoota
Copy link
Member

@makenowjust Could you address the open comments?

@sdogruyol
Copy link
Member

This is also a good PR, @makenowjust how about updating?

@Sija
Copy link
Contributor

Sija commented Sep 22, 2017

@makenowjust 🏓

@makenowjust
Copy link
Contributor Author

@Sija Sorry I'm busy

@makenowjust
Copy link
Contributor Author

makenowjust commented Mar 4, 2018

Updated! Thank you @epergo!

end

def self.encode(string : String)
encode string.chars
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to use a Char::Reader instead of allocating an array of chars. I don't think there is a real usecase for supplying a char array, is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

end

def self.encode(string : String, io)
encode string.chars, io
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

And it uses `each_char` instaed of `chars`, it is memory efficient.
all = true
others = [] of Char

chars.each do |c|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line could just be string.each_char do |c| - this inlines the block instead of using an iterator instance.

String.build { |io| encode string, io }
end

def self.encode(string, io)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor nit bit: methods receiving an optional IO should have it prepended as first argument. This makes it easier if maybe additional arguments are added later to keep both signatures similar. Even if that's unlikely I'd recommend to stick with this strategy.

Copy link
Contributor Author

@makenowjust makenowjust Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URI.encode is also accept String as first argument and IO as second argument. I think no problem.

end

def self.encode(string, io)
h = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if these variables could be more descriptive. What does h mean?

Copy link
Contributor Author

@makenowjust makenowjust Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't describe this correctly (because the date I wrote this is 2 years ago.) Sorry 🙇 . But h and other variable names are traditional, they are used in RFC 3492's reference implementation.


return if others.empty?
others.sort!
io << DELIMITER unless all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need all at all. It can be replaced by h == 0 (note the next comment, then this needs to be after that line and h == 1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right.

firsttime = true
prev = nil

h += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably set the value to h = string.size - others.size + 1 instead of incrementing it in the loop (just a minor improvement).

end

def self.decode(string)
if delim = string.rindex(DELIMITER)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just replace this entire conditional with

rest, _, output = string.rpartition(DELIMITER)
output = output.chars
# and later loop over `rest.each_char`:
rest.each_char do |c|

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, but correct code is:

output, _, rest = ...

@makenowjust
Copy link
Contributor Author

ping

@Sija
Copy link
Contributor

Sija commented Apr 9, 2018

Catching dust while waiting for merge... does anyone care?

@bcardiff bcardiff merged commit e4f637c into crystal-lang:master Apr 10, 2018
@bcardiff bcardiff added this to the Next milestone Apr 10, 2018
@bcardiff
Copy link
Member

Thanks @makenowjust

@makenowjust makenowjust deleted the feature/punycode branch April 10, 2018 23:21
@paulkass paulkass mentioned this pull request Jul 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.