-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[windows] non-ASCII Unicode characters are garbled on Windows #29
Comments
First, thanks for pointing at this issue. Encoding is an important topic and we have to put some effort on it. Feel free to send some push requests, otherwise I'll try to look into it within the next weeks. |
Thanks for fast reply. My knowledge of C language is very little, so I can't help you much with pull request. Personally, I think that using Here is some info about UTF-8 strings, I hope that it will be useful: |
Thanks for the very informative bug report. I've looked into the issue, and I think I know what's going on: we're converting the input to UTF-8 for storage, but not back to the current encoding when reading from the database. I can see two obvious solutions to this, each of which has some disadvantages:
Another question is how to handle DB identifiers and such (Clickhouse accepts arbitrary byte strings for string fields; but column names, enum levels etc. must apparently be valid UTF-8, though I can't seem to find any mention of that in the docs). Here, the first approach is likely the only feasible solution, though that again leads to the issue of query results containing characters which cannot be represented in the current encoding. Generally, I have to say that, seeing what a mess these encoding issues are, I strongly recommend always using UTF-8 everywhere. I'm not saying we shouldn't try to fix the bug though, and I understand that this isn't always a feasible solution on Windows in particular. I'd personally prefer the second solution, I think, but as I said, that won't fix all issues either. |
If you want to choose the most flexible solution, you can add UTF-8 is recommended by Clickhouse documentation:
|
All non-ASCII Unicode characters are garbled in dataframes obtained via SELECT queries. Data is stored in UTF-8 in Clickhouse, but native R encoding on my Windows machine is CP1251.
Is there a way to preserve character encoding in RClickhouse? For now, I can only correct it manually with
set_chr_encoding
function.For example,
RMariaDB
package is working correctly with UTF-8 strings on Windows.Session info
The text was updated successfully, but these errors were encountered: