Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emoji character in domains not recognized correctly (punycode) #231

Open
redsk opened this issue May 4, 2020 · 8 comments
Open

Emoji character in domains not recognized correctly (punycode) #231

redsk opened this issue May 4, 2020 · 8 comments
Assignees
Labels

Comments

@redsk
Copy link

redsk commented May 4, 2020

val u = "https://i❤.ws/"
url(u).url

leads to

java.net.URISyntaxException: Expected authority at index 7: http://
  java.net.URI$Parser.fail(URI.java:2848)
  java.net.URI$Parser.failExpecting(URI.java:2854)
  java.net.URI$Parser.parseHierarchical(URI.java:3102)
  java.net.URI$Parser.parse(URI.java:3053)
  java.net.URI.<init>(URI.java:588)
  dispatch.RawUri.toUri(uri.scala:23)
  dispatch.RawUri.toString(uri.scala:24)
  dispatch.url$.$anonfun$apply$3(requests.scala:102)
  dispatch.Req.requestBuilder$1(requests.scala:48)
  dispatch.Req.toRequestBuilder(requests.scala:53)
  dispatch.Req.toRequest(requests.scala:61)
  dispatch.UrlVerbs.url(requests.scala:125)
  dispatch.UrlVerbs.url$(requests.scala:125)
  dispatch.Req.url(requests.scala:17)

Its punycode [1] conversion [2] (https://xn--i-7iq.ws/) works as expected.

https://i❤.ws/ points to a domain registrar and is SFW.

"https://i❤.ws/" visualization when not between backticks.

Opening a new issue because I cannot reopen an issue closed by a collaborator.

[1] https://en.wikipedia.org/wiki/Punycode
[2] used converter: https://www.punycoder.com/

@redsk
Copy link
Author

redsk commented May 4, 2020

In case you're wondering, java.net.IDN.toASCII(u) fails with

java.lang.IllegalArgumentException: java.text.ParseException: An unassigned code point was found in the input

as it only supports IDNA2003. For emoji IDNA2008 is needed and the icu4j library [1] supports it:

import com.ibm.icu.text.IDNA
val uts46 = IDNA.getUTS46Instance(IDNA.DEFAULT)

val u = "i❤.ws/"
val punycodedDomain = uts46.nameToASCII(u, new java.lang.StringBuilder(), new IDNA.Info()).toString
// punycodedDomain == "xn--i-7iq.ws/"

Apparently, nameToASCII works with domain names and paths but not with protocol, that has to be stripped off before the conversion.

[1] https://mvnrepository.com/artifact/com.ibm.icu/icu4j

@farmdawgnation farmdawgnation self-assigned this May 5, 2020
@redsk
Copy link
Author

redsk commented May 6, 2020

I've given a more thorough explanation of the same issue here along with a PR. Maybe you'd like to take the same approach.

@farmdawgnation
Copy link
Member

icu4j is using the ICU license. This seems to be a tweaked version of the X11 License which is compatible with our own LGPL.

@redsk
Copy link
Author

redsk commented May 25, 2020

Yeah as you wrote, they use the ICU license which is deemed compatible with GPL. So it should be ok, no?

@farmdawgnation
Copy link
Member

Yeah, that's good. I was just making a note here as I worked through figuring out how I want to integrate it. No action needed.

@farmdawgnation
Copy link
Member

Ok, this is a lot more complex than I originally thought looking at the bug report. TL;DR there's no good way for us to support this type of encoding from the url helper because we lean heavily on the URL parsing provided by the JVM by default. That, in turn, is based on a regex that doesn't properly support this type of encoding.

The good news is that there's an existing API which will handle this correctly: host:

@ host("i❤.ws").url
res38: String = "http://xn--i-7iq.ws/"

This API works a bit differently because it doesn't accept a full URL, but that's the same thing that makes the IDN conversion work as expected. For example, to get https you would use:

@ host("i❤.ws").secure.url
res40: String = "https://xn--i-7iq.ws/"

In order to support this from the url API we'd need to re-implement breaking up URLs into their component parts for parsing. I'm willing to investigate that, but it's a much larger project because we've got to be sure we don't incidentally break something else.

Does this unblock your use case for the time being?

@redsk
Copy link
Author

redsk commented May 27, 2020

Well, if I understand correctly, the host API would work only for hosts and not for full URLs which of course would be rather inconvenient as the parsing would be needed to be done by the caller.

My use case is not blocked as I use a different library to detect URLs in strings and the library can do URL normalization, including using ICU (I did the PR), so I simply pass dispatch the normalized URL.

I created this issue because I believe that other users of Dispatch might have the same problem and because I think this library should be able to handle URL normalization properly (not only emojis, also other characters, as illustrated here), which the current url API does not.

@farmdawgnation
Copy link
Member

Well, if I understand correctly, the host API would work only for hosts and not for full URLs which of course would be rather inconvenient as the parsing would be needed to be done by the caller.

How inconvenient this is largely depends on the application.

I created this issue because I believe that other users of Dispatch might have the same problem and because I think this library should be able to handle URL normalization properly (not only emojis, also other characters, as illustrated here), which the current url API does not.

Yep, I hear you. I'm not closing the issue, just pointing out that this isn't going to be a quick, drop-in fix like I originally thought. This will take some time to get right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants