-
-
Notifications
You must be signed in to change notification settings - Fork 31.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
locale.getdefaultlocale() fails on Mac OS X with default language set to English #62578
Comments
On Mac OS X 10.8 with the default language set to English (System Preferences | Language and Text), the default terminal application sets the LC_CTYPE environment variable to "UTF-8". If you run Python from the terminal and try to use locale.getdefaultlocate(), you get the following error: > python
Python 2.7.2 (default, Oct 11 2012, 20:14:37)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.getdefaultlocale()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/locale.py", line 496, in getdefaultlocale
return _parse_localename(localename)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/locale.py", line 428, in _parse_localename
raise ValueError, 'unknown locale: %s' % localename
ValueError: unknown locale: UTF-8 (The stacktrace is from Python 2.7 but Python 3.3 suffers from the same problem.) There are numerous workarounds for this problem (turning off the "Set locale environment variables on startup" option in the terminal settings, or adding "export LC_CTYPE=en_US.UTF8" to .bash_profile, selecting a language other than English in the Language & Text settings), but these require additional configuration from the user's side. I think that the more useful behavior is for Python to handle this behavior of the system and not crash, even though it doesn't strictly comply to the POSIX standard. The attached patch (against current Python 3.4 master branch) is one possible fix. |
Strange, I have LANG=en_US.UTF-8 in my environment and no LC_CTYPE. A clean test account does have the same behavior as you are seeing. |
The UTF-8 value seems suspect to me, but is actually supported by the system, changing it to a nonsense value results in failure in the C function setlocale. As for the patch: I'd add this workaround only to the OSX platform (that is, test for sys.platform == 'darwin' before checking for UTF-8 as a value). |
Judging from the results of Googling for the error message, I'm far from the only one seeing this problem. What exactly would be the benefit of adding the code to check for the platform? |
The test for darwin is needed because other platforms don't support "UTF-8" as a valid LC_CTYPE name, on a recent linux box: >>> locale.setlocale(locale.LC_CTYPE, "UTF-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/python2.7/lib/python2.7/locale.py", line 539, in setlocale
return _setlocale(category, locale)
locale.Error: unsupported locale setting (And just calling setlocale to check if the value is valid is not an option because that changes process-global state) |
Why exactly does this matter? UTF-8 not being a valid LC_CTYPE value simply means that no one running Linux will ever have LC_CTYPE set to UTF-8, and the branch will never be hit. OTOH, adding the check will make the code harder to test and simply larger (no code is always better than any non-zero amount of code). |
A related issue (with a patch that touches the same locale parsing code) is http://bugs.python.org/issue5815 |
Why do you need the "getdefaultlocale" function in the first place? I'd advise against using it, precisely because it can trigger problems like this one. |
I personally don't, but the function is used by Sphinx, which is what I was trying to get to work when I ran into this problem. |
Regardless of the resolution here, the use of getdefaultlocale could be reported as a bug on the sphinx tacker. |
FWIW, I couldn't find any use of getdefaultlocale in any of the hg revisions (using hg grep) in https://bitbucket.org/birkenfeld/sphinx/ Instead, it's (probably) docutils, which has this code: locale_encoding = locale.getlocale()[1] or locale.getdefaultlocale()[1]
# locale.getpreferredencoding([do_setlocale=True|False])
# has side-effects | might return a wrong guess.
# (cf. Update 1 in http://stackoverflow.com/questions/4082645/using-python-2-xs-locale-module-to-format-numbers-and-currency) I find that quite unfortunate, since locale.getpreferredencoding() would have don the right thing (IMO). |
I just ran into this problem myself. On fresh installs of OSX 10.9 LC_CTYPE is set to "UTF-8" (at least for english language users), and now sphinx won't work :-( Is Dimitrys patch acceptable (either as is, or with my suggestion of checking for sys.platform == "darwin")? |
Ronald or Dmitry, can you elaborate under what conditions you start your login shell on 10.9? I cannot reproduce the behavior you observe. With 10.9 Terminal.app and the default language settings in System Preferences and with the default Terminal.app preferences, specifically Settings -> (Profile) -> Advanced -> Character encoding -> Unicode (UTF-8) and "Set LANG environment variable on startup" checked, login sessions have LANG=en_US.UTF-8 defined and LC_CTYPE is not defined at all. Are you sure that isn't begin created by a shell profile somewhere? (I can't check earlier OS X releases at the moment.) That said, I agree that, if OS X accepts "UTF-8" as a valid locale, the locale module should, too. |
I didn't get this on my previous system (which was basically a 10.4 system updated through 10.5, 10.7, ..., to 10.9), but did get it on my current system, which has a fresh 10.9 install where I did not use the migration assistent to migrate settings. Thus for me to get the behavior with LC_CTYPE:
I have not tried to reproduce this in a VM. BTW. I have the same system settings a you. |
With the following C code: #include <locale.h>
#include <stdio.h>
int main(void)
{
char* res = setlocale(LC_CTYPE, "UTF-8");
printf("Result: %s\n", res);
res = setlocale(LC_CTYPE, "UTF-9");
printf("Result: %s\n", res);
return 0;
}
/* EOF */ I get the following output: Result: UTF-8 That is, UTF-8 is a valid locale for LC_CTYPE, and as expected some other string isn't. BTW. "UTF-8" is only a valid locale for LC_CTYPE, not for other categories (when you change LC_CTYPE to LC_ALL both calls return NULL). |
That is seriously broken on Apple's part. But I guess we have no choice but to emulate their bug. |
I've looked at this a bit, primarily on OS X 10.9 Mavericks, although I expect mostly similar behavior on older recent releases of OS X. On 10.9, the setting of locale variables is done by whatever program is used to launch a shell. I looked at the behavior of the built-in Terminal.app, the third-party iTerm2.app, the MacPorts distribution of xterm, and the built-in sshd. By default, the latter two do not set any locale env variables. Both Terminal.app and iTerm2.app set either LANG or LC_CTYPE based on the user's settings for "Region" and "Preferred Language" in the "System Preferences" -> "Language & Region" control panel. Three examples:
So it is almost certainly the last case that is under discussion here. Whether or not that is a bug is not as clear as it might seem at first. BSD implementations of locale differ from the GNU Linux version. Both FreeBSD and OS X define a "UTF-8" locale that has only one locale category defined in it: LC_CTYPE. It appears to be a fallback locale used when there is no applicable region / language combination, in this case no "en_DE*" locales. $ ls /usr/share/locale/UTF*
LC_CTYPE Compare with the en_US* locales: $ ls /usr/share/locale/en_US*
/usr/share/locale/en_US:
LC_COLLATE LC_CTYPE LC_MESSAGES LC_MONETARY LC_NUMERIC LC_TIME /usr/share/locale/en_US.ISO8859-1: /usr/share/locale/en_US.ISO8859-15: /usr/share/locale/en_US.US-ASCII: /usr/share/locale/en_US.UTF-8: Now as I read the current POSIX standard, there is nothing wrong with this. AFAICT, the standard places no restriction on the format of locale names, in particular, it does not mandate that they conform to RFC 1766 or its successors. Further, the standard provides for implementation-specific locales (other than the mandatory "POSIX" aka "C" locale) and some platforms provide tools to create custom locales, e.g. mklocale(1) on FreeBSD and OS X, localedef(1) on GNU Linux. So I wonder if the locale module should really be imposing its own restrictions on locale names as it does currently. From IEEE Std 1003.1, 2013 Edition: There is a further complication for OS X. Apple provides a richer native API for locales, CFLocale (and its Cocoa equivalent, NSLocale). So some nuances may get lost in the imperfect mapping between CFLocale and the conventional LC_* environment variables and between them and Python. We could look at trying to support the native APIs as well. http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07 |
Mac OS X use the __CF_USER_TEXT_ENCODING env var to setup the locale in for native libraries. I found that for GUI python code I needed to convert the value in __CF_USER_TEXT_ENCODING into a suitable call to setlocale(). The code I use is attached to bpo-23797. |
As an aside to 2), CoreFoundation and any other Apple "Cocoa" frameworks should be assumed to use threads and hence the comment about threads in the fork specification (link below) apply, and currently Apple doesn't appear to use pthread_atfork to make sure library state is valid in child processes after fork. <http://pubs.opengroup.org/onlinepubs/009695399/functions/fork.html\> |
Dimitry's patch looks good, I added my patch before checking if there already is patch. The only thing that might be cause discussion is when to accept 'UTF-8' as a valid locale name. My patch only accepts in on OSX, while Dimitry's patch accepts it everywwhere. Writing this I'm slightly in favour of Dimitry's approach: I quite often run into problems when using SSH to log in to a Linux box from my OSX laptop (with LC_CTYPE=UTF-8). Almost everything works correctly, except for Python code that uses the locale module (which craps out with the exception in the first message in this issue). IMHO Dimitry's patch should be applied as is. |
ping... I think the current behavior is a bug in Python and should be fixed in 2.7, 3.4, 3.5 and default (using Dmitry's patch). I'd like to commit the patch, but would like someone else's review of the patch before doing so. |
Needed tests. With the patch: $ LC_CTYPE=UTF-8 ./python
>>> import locale
>>> locale.getdefaultlocale()
(None, 'UTF-8')
>>> locale.getpreferredencoding()
'ANSI_X3.4-1968'
>>> locale.getlocale()
(None, None)
$ LC_CTYPE=en_US_UTF-8 ./python
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')
>>> locale.getpreferredencoding()
'UTF-8'
>>> locale.getlocale()
('en_US', 'UTF-8') I think getpreferredencoding() and getlocale() should return the UTF-8 encoding. |
Perhaps the better way to solve this issue is to use aliases table. What is the LC_CTYPE environment variable set when the default language set to non-English? How different native MacOS X command-line programs behave when set LC_CTYPE to other encoding (e.g. ASCII, US-ASCII, ISO8859-1, ISO-8859-1, Latin1)? What if set it to UTF8 (no minus) or utf-8 (lower case)? |
The only locale that doesn't include language information is the UTF-8 one, there is no locale named "US-ASCII". See /usr/share/locale on an OSX system. PS. The more I look at locale.py the more problems I find with it. The code makes a unwarranted assumptions about locales that aren't actually true on all systems. For example: >>> locale.normalize('ja_JP')
'ja_JP.eucJP' That's not true on OSX, /usr/share/locale/ja_JP/LC_CTYPE is a symlink to /usr/share/locale/UTF-8/LC_CTYPE. AFAIK *all* locale's on OSX use UTF-8. |
The alias mechanism cannot be used because LC_CTYPE=UTF-8 as the locale doesn't imply anything about languages. In Linux terms it is more or less equal to "C.UTF-8" or "POSIX.UTF-8", except that those two aren't valid locales on OSX. |
Testing this is interesting to say the least due to the dynamic way the module interface is built. Serhiy: are you testing on a Linux machine? On my machine getpreferredencoding() returns 'UTF-8' because it hits the CODESET path (which ends up calling |
I've attached a patch with more tests, but I'm not to happy about the new test because it too much of a white box test and is therefore fairly fragile w.r.t. the actual implementation of the module. |
Yes, I were testing on a Linux machine and forgot that results are OS depending. I agree, that test should less depend on implementation details. As far as _locale._getdefaultlocale is defined only on Windows and "UTF-8" is not valid locale on Windows, I think there is no need to patch _locale for testing. But getlocale() and getpreferredencoding() should be consistent with getdefaultlocale() (and getlocale() is yet one way to test private function _parse_localename()). setlocale() should work with the result of getlocale() and getdefaultlocale(). Are following tests passed on OSX? |
ping? Just ran into this issue on OS X El Capitan with Region set to Germany and Language to English. Just as Ned pointed out 2 years ago, this results in LC_CTYPE set to 'UTF-8' in the terminal and docutils still can't cope with it. |
Could someone provide a patch for Python 3.5? |
OSX Sierra + Python, the bug still exists. subscribing |
To me this issue seems quite related to PEP-538. Maybe the LC_CTYPE coercion proposed in the PEP could be extended to cover the case of LC_CTYPE=UTF-8? |
PEP-538 wouldn't help here, as there's nothing wrong with CPython's assumptions about the text encoding to use for operating system interfaces - it's assuming UTF-8 (because it's Mac OS X) and that assumption is correct (because it's Mac OS X). The problem appears to be that locale.py was written primarily for Linux, and hence makes assumptions that aren't valid on BSD and Mac OS X. Dmitry's suggested solution of taking the BSD/Mac OS X specific locale of "UTF-8" and universally accepting it as meaning (None, "UTF-8") sems like a sensible step forward, even if it doesn't resolve all the discrepancies. Where PEP-538 and PEP-540 would come into play is when this setting gets forwarded over SSH to Linux servers (as then CPython *will* get the nominal system text encoding wrong), but that's independent of getting the locale module to handle it more gracefully. |
I think PEP-538 extended to the UTF-8 locale *would* help here. Specifically, it would coerce only LC_CTYPE to en_US.UTF-8 (unless OS X has C.UTF-8), which I guess is good enough for the purpose here. I do agree that it is not the kind of problem that PEP-538 tries to solve right now, but it could be extended to cover other types of problematic locales like this one. Just wanted to make you aware of this possibility. |
I think Ronald's patch bpo-18378-2015-07-25-py36.txt with added darwin check would be the best way forward. In the current form, it would allow using 'UTF-8' as locale string on all platforms - which is not such a good idea. |
SSH environment forwarding will propagate this "LC_CTYPE=UTF-8" setting from Mac OS X clients to Linux servers. At present, that breaks in multiple ways, as CPython will interpret it as being the "C" locale (since Linux servers don't offer a "UTF-8" locale, even when they do offer "C.UTF-8") PEPs 538 and 540 aim to help CPython itself to deal with that case, but that won't be sufficient to help code that tries to pass the nominal LC_CTYPE setting to the locale module. Accepting "UTF-8" and interpreting it as functionally equivalent to C.UTF-8 will mean that this setting will at least work as desired on servers that offer C.UTF-8. |
On 13.01.2017 04:47, Nick Coghlan wrote:
I don't think that's within the scope of this patch. "UTF-8" is not Please also note that SSH does not forward arbitrary env vars. Aisde: While looking into this I found that the locale module |
That alias (C.UTF-8 to en_US.UTF-8) is surely a bug in itself nowadays. I've filed bpo-30755 . |
I still have this issue on MacOS Mojave 10.14 Python 3.7.2 (default, Dec 27 2018, 07:35:06)
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.getdefaultlocale()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/locale.py", line 568, in getdefaultlocale
return _parse_localename(localename)
File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/locale.py", line 495, in _parse_localename
raise ValueError('unknown locale: %s' % localename)
ValueError: unknown locale: UTF-8
>>> $ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL= |
LC_CTYPE=UTF-8 is a valid configuration on macOS, and is in the default environment when you install a fresh system. This includes the beta's for macOS 10.15 and is therefore unlikely to change anytime soon. Interestingly enough I get this error even when I unset the relevant environment variables. For some reason LC_CTYPE is reset when I start the interpreter, even if it is set to something else. This means the usual way of working around this problem no longer works. I'll create a pull request with an up-to-date version of my latest patch for further discussion. BTW. I'm testing with the current tip of the tree, but 3.7.3 fails in the same way. |
As promised there is now a pull request. I'd love a review (and a change to approve the pull request when reviewers are happy, I'm trying to get back into actively contributing). --- I now understand why locale.getdefaultlocale() fails even when LC_CTYPE is not set: pylifecycle sets LC_CTYPE to UTF-8 in the UTF-8 coercion code. |
Ronald's PR 14738 LGTM. I merged it to master and backported for 3.8.0b4 and 3.7.5. Thanks, everyone! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: