Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Float parsing fails with LANG=cs_CZ.UTF-8 and setlocale(LC_ALL, "") #2638

Closed
eregon opened this issue Mar 28, 2024 · 7 comments · Fixed by #2657
Closed

Float parsing fails with LANG=cs_CZ.UTF-8 and setlocale(LC_ALL, "") #2638

eregon opened this issue Mar 28, 2024 · 7 comments · Fixed by #2657
Labels
bug Something isn't working

Comments

@eregon
Copy link
Member

eregon commented Mar 28, 2024

We had a report that truffleruby does not work with LANG=cs_CZ.UTF-8 (which uses , instead of . as the decimal separator):

$ LANG=en_US.UTF-8 ruby -e 'p 3.4'
3.4
$ LANG=cs_CZ.UTF-8 ruby -e 'p 3.4'
truffleruby: -e:1: could not parse the float '3.4' (SyntaxError)

It turns out this is caused by Prism not being able to parse Float literals when LANG=cs_CZ.UTF-8 is set, and setlocale(LC_ALL, "") has been called.
This is because Prism uses strtod(), which is locale-sensitive.

It seems CRuby does setlocale(LC_CTYPE, "") (in main.c) and not setlocale(LC_ALL, ""), which means it doesn't reproduce as-is on CRuby.

On TruffleRuby JVM, it seems HotSpot does setlocale(LC_ALL, ""), e.g. here.
On TruffleRuby Native, we do the setlocale(LC_ALL, "") ourselves to call nl_langinfo(CODESET). Maybe we could change that one to LC_CTYPE instead, but it wouldn't solve the JVM case.
And of course this would be a problem for any process which does setlocale(LC_ALL, "") before using Prism.

@eregon eregon changed the title Float prasing fails with LANG=cs_CZ.UTF-8 and setlocale(LC_ALL, "") Float parsing fails with LANG=cs_CZ.UTF-8 and setlocale(LC_ALL, "") Mar 28, 2024
@eregon
Copy link
Member Author

eregon commented Mar 28, 2024

I guess JRuby is also affected since it runs on HotSpot, cc @enebo

@eregon
Copy link
Member Author

eregon commented Mar 28, 2024

FWIW I found this way to find out what setlocale has been called with (on Linux, where LC_ALL is 6):

$ LANG=en_GB ruby -rffi -e 'module LibC; extend FFI::Library; ffi_lib "c"; attach_function :setlocale, [:int, :string], :string; end; p LibC.setlocale(6, nil)'

CRuby 3.2:
"LC_CTYPE=en_GB;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C"

TruffleRuby Native:
"en_GB"

TruffleRuby JVM:
"en_GB"

So I think that means for both cases on TruffleRuby all locale categories are set to en_GB, while on CRuby it's only LC_CTYPE, due to setlocale(LC_CTYPE, "") vs setlocale(LC_ALL, "") calls.

@eregon
Copy link
Member Author

eregon commented Mar 28, 2024

BTW it's not just cs_CZ but also for example fr_FR and I guess any locale using something else than . for the decimal separator:

LANG=fr_FR.UTF-8 ruby -e 'p 3.4'
truffleruby: -e:1: could not parse the float '3.4' (SyntaxError)

@eregon
Copy link
Member Author

eregon commented Mar 28, 2024

A workaround/solution seems to be to setlocale(LC_NUMERIC, "C") before calling Prism.

@eregon
Copy link
Member Author

eregon commented Mar 28, 2024

oracle/truffleruby#3513 is what I'm going to use to workaround for now in TruffleRuby.
It does:

setlocale(LC_ALL, "C");
setlocale(LC_CTYPE, "");

which I think should be the same as CRuby doing just setlocale(LC_CTYPE, "");, but also working regardless of other setlocale() calls before.

It is a workaround because this overwrites the locale for the whole process globally and might cause other things which rely on using the locale from the environment for all locale categories to fail.
It feels particularly suboptimal for the case TruffleRuby or JRuby is embedded in a Java program (i.e. using java and not ruby to start), where setting the locale there at runtime could very well break things.
So it would be great to find a solution in Prism if that's possible.

@kddnewton
Copy link
Collaborator

Dang, I figured since we had already validated it, it would work. I didn't realize it would disallow .. I wish I had access to strtod_l, but I doubt that is available to me. I might need to vendor an implementation of that if it's not present, or do something like grab the locale, switch it to LC_CTYPE, then switch it back after parsing. I'll see what can be done.

@kddnewton kddnewton added the bug Something isn't working label Mar 28, 2024
@kddnewton kddnewton added this to the CRuby unblocked milestone Mar 28, 2024
@eregon
Copy link
Member Author

eregon commented Mar 28, 2024

I wish I had access to strtod_l, but I doubt that is available to me.

FWIW it seems glibc has it on Linux (looking at nm). I don't know about other platforms.

https://stackoverflow.com/questions/41794607/locale-invariant-string-processing-with-strtod-strtof-atof-printf mentions uselocale() is POSIX and only affects the current thread (unlike setlocale which is global).
That sounds like the best solution for me, although not sure how that would work on Windows.

I might need to vendor an implementation of that if it's not present,

I think double parsing is rather complicated, needs some large integers operations (not sure which ones), etc, so it's probably not so nice to vendor.
https://github.com/ruby/ruby/blob/master/missing/dtoa.c is such a vendored implementation.
FWIW I recall a known security issue some years ago where float/double parsing would waste too much time to decide between 2 adjacent doubles, so there needs to be some limit there.

or do something like grab the locale, switch it to LC_CTYPE, then switch it back after parsing. I'll see what can be done.

Right, that should work well with uselocale() setting LC_NUMERIC to "C".

One last thought is if none of the above works well, maybe we could try on failure to replace the . by , and remember it, but maybe some locales use yet another decimal separator and it does feel a bit hacky obviously.

graalvmbot pushed a commit to oracle/truffleruby that referenced this issue Apr 5, 2024
* ruby/prism#2638 got fixed so we don't need a C locale for Prism anymore.
graalvmbot pushed a commit to oracle/truffleruby that referenced this issue Apr 5, 2024
* ruby/prism#2638 got fixed so we don't need a C locale for Prism anymore.
graalvmbot pushed a commit to oracle/truffleruby that referenced this issue Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants