-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add IDNA support and integrate with DNS lookup #2543
Conversation
funny, yesterday doing the same https://github.com/kostya/simple_idn, even code is similar |
@kostya I guess that you use punycode.js implementation as reference, and so, I refered to it. |
dccb071
to
87efb3e
Compare
Formatted! |
src/punycode.cr
Outdated
end | ||
end | ||
|
||
raise Error.new "invalid input" unless init |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need a custom error class here? ArgumentError
seems fine to me.
I wonder how hard versions of this would be that directly read/write to IO, if possible at all. |
We could accept this without support for appending to an IO and later improve it. After applying the suggestions made by @jhass we could merge this. |
What about moving it under the URI namespace as URI::Punycode and |
@makenowjust Could you address the open comments? |
This is also a good PR, @makenowjust how about updating? |
@Sija Sorry I'm busy |
Changed condition: `'A' <= c && c <= 'z'` Into: `'A' <= c && c <= 'Z'`
1d77610
to
86ac0d7
Compare
Updated! Thank you @epergo! |
src/uri/punycode.cr
Outdated
end | ||
|
||
def self.encode(string : String) | ||
encode string.chars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to use a Char::Reader instead of allocating an array of chars. I don't think there is a real usecase for supplying a char array, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
src/uri/punycode.cr
Outdated
end | ||
|
||
def self.encode(string : String, io) | ||
encode string.chars, io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
And it uses `each_char` instaed of `chars`, it is memory efficient.
src/uri/punycode.cr
Outdated
all = true | ||
others = [] of Char | ||
|
||
chars.each do |c| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line could just be string.each_char do |c|
- this inlines the block instead of using an iterator instance.
String.build { |io| encode string, io } | ||
end | ||
|
||
def self.encode(string, io) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor nit bit: methods receiving an optional IO
should have it prepended as first argument. This makes it easier if maybe additional arguments are added later to keep both signatures similar. Even if that's unlikely I'd recommend to stick with this strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
URI.encode
is also accept String
as first argument and IO
as second argument. I think no problem.
src/uri/punycode.cr
Outdated
end | ||
|
||
def self.encode(string, io) | ||
h = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if these variables could be more descriptive. What does h
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't describe this correctly (because the date I wrote this is 2 years ago.) Sorry 🙇 . But h
and other variable names are traditional, they are used in RFC 3492's reference implementation.
src/uri/punycode.cr
Outdated
|
||
return if others.empty? | ||
others.sort! | ||
io << DELIMITER unless all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need all
at all. It can be replaced by h == 0
(note the next comment, then this needs to be after that line and h == 1
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right.
src/uri/punycode.cr
Outdated
firsttime = true | ||
prev = nil | ||
|
||
h += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably set the value to h = string.size - others.size + 1
instead of incrementing it in the loop (just a minor improvement).
src/uri/punycode.cr
Outdated
end | ||
|
||
def self.decode(string) | ||
if delim = string.rindex(DELIMITER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just replace this entire conditional with
rest, _, output = string.rpartition(DELIMITER)
output = output.chars
# and later loop over `rest.each_char`:
rest.each_char do |c|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, but correct code is:
output, _, rest = ...
ping |
Catching dust while waiting for merge... does anyone care? |
Thanks @makenowjust |
Punycode is the encoding for IDNA, which is defined in RFC 3492. And IDNA means Internationalized Domain Names in Application defined in RFC 5980, which is method to use non-ascii characters as host name of URI. For example, when you open http://日本語ドメイン.com/ in browser, browser converts this URL to http://xn--eckwd4c7c5976acvb2w6i.com/ internally. On this,
日本語ドメイン.com
is IDN andxn--eckwd4c7c5976acvb2w6i.com
is IDNA. I added punycode support.RFC 3986 says:
so I added integration of DNS lookup. Of course, it is better that
getaddrinfo
supports to convert IDN to IDNA internally, and in factgetaddrinfo
supports such a thing since glibc 2.3.4 or Windows 8. However, this is extension and it is not portable, in other words it is not defined in POSIX (Especially, it is disabled by default on glibc). Finally, I think this is the best way to support IDNA.Thanks.