URI encoding/decoding fixes #192

wch · 2019-02-05T20:53:35Z

This is a follow-up to #185, and it does what @hadley suggested: mark the encoding of characters in C++ instead of R.

It also ensures that vector inputs and NAs are handled properly.

hadley · 2019-02-05T22:44:33Z

src/httpuv.cpp

+    if (value[i] == NA_STRING) {
+      out[i] = NA_STRING;
+    } else {
+      const char* s = doEncodeURI(Rcpp::as<std::string>(value[i]), false).c_str();


Don't you have an encoding problem here too? What does doEncodeURI assume about the input? You might want a helper like this:

inline const char* string_utf8(SEXP x, int i) { return Rf_translateCharUTF8(STRING_ELT(x, i)); }

Yes, that's true. Our current documentation for encodeURI says the following, but it's not generally useful behavior:

If conformant non-ASCII behavior is important, ensure that your input vector is UTF-8 encoded before calling encodeURI or encodeURIComponent.

This differs from the behavior of base::URLencode. I agree that we should change it.

utf8_str <- "\ue5" # "å", in UTF-8 latin1_str <- iconv(utf8_str, "UTF-8", "latin1") utf8_str #> [1] "å" latin1_str #> [1] "å" # Look at raw bytes charToRaw(utf8_str) #> [1] c3 a5 charToRaw(latin1_str) #> [1] e5 # base::URLencode URLencode(utf8_str) #> [1] "%C3%A5" URLencode(latin1_str) #> [1] "%C3%A5" # httpuv::encodeURI httpuv::encodeURI(utf8_str) #> [1] "%C3%A5" httpuv::encodeURI(latin1_str) #> [1] "%E5"

hadley · 2019-02-05T22:45:23Z

src/httpuv.cpp

-    *it = doEncodeURI(*it, false);
+  for (int i = 0; i < value.size(); i++) {
+    if (value[i] == NA_STRING) {
+      out[i] = NA_STRING;


You might want to see if the CharacterVector is already filled with NAs.

It looks like CharacterVector(n) will return a vector filled with "", but CharacterVector(n, NA_STRING) returns a string filled with NAs.

wch · 2019-02-06T18:01:29Z

src/httpuv.cpp

+    if (value[i] != NA_STRING) {
+      const char* s = Rf_translateCharUTF8(value[i]);
+      s = doEncodeURI(s, false).c_str();
+      out[i] = Rf_mkCharCE(s, CE_UTF8);


I don't think this needs a PROTECT, but if someone else could confirm, I'd appreciate it.

It should not because you're immediately assigning into an object that Rcpp should be PROTECTing.

wch added 2 commits February 5, 2019 14:31

Mark UTF-8 encoding in C++ instead of R

e2828f8

Fix NA handling for encode/decode URI functions

b2b5e9a

wch changed the title ~~Fix uri~~ URI encoding/decoding fixes Feb 5, 2019

wch requested a review from jcheng5 February 5, 2019 20:53

jcheng5 approved these changes Feb 5, 2019

View reviewed changes

hadley reviewed Feb 5, 2019

View reviewed changes

wch added 3 commits February 6, 2019 10:51

Pre-populate vectors with NA

9e409e5

Convert to UTF-8 before URL-encoding

4e9f8ec

Update NEWS

41cce87

wch force-pushed the fix-uri branch from c89a9d0 to 41cce87 Compare February 6, 2019 17:58

wch commented Feb 6, 2019

View reviewed changes

Bump version to 1.4.5.9003

6a72dd9

wch merged commit e59dbe8 into master Feb 8, 2019

wch deleted the fix-uri branch February 8, 2019 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URI encoding/decoding fixes #192

URI encoding/decoding fixes #192

wch commented Feb 5, 2019

hadley Feb 5, 2019

wch Feb 6, 2019

hadley Feb 5, 2019

wch Feb 6, 2019

wch Feb 6, 2019

hadley Feb 6, 2019

URI encoding/decoding fixes #192

URI encoding/decoding fixes #192

Conversation

wch commented Feb 5, 2019

hadley Feb 5, 2019

Choose a reason for hiding this comment

wch Feb 6, 2019

Choose a reason for hiding this comment

hadley Feb 5, 2019

Choose a reason for hiding this comment

wch Feb 6, 2019

Choose a reason for hiding this comment

wch Feb 6, 2019

Choose a reason for hiding this comment

hadley Feb 6, 2019

Choose a reason for hiding this comment