New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
_localemodule.c: str2uni() with different LC_NUMERIC and LC_CTYPE #51691
Comments
Hi, the following works in 2.7 but not in 3.x: >>> import locale
>>> from decimal import *
>>> locale.setlocale(locale.LC_NUMERIC, 'fi_FI')
'fi_FI'
>>> format(Decimal('1000'), 'n')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.2/decimal.py", line 3632, in __format__
spec = _parse_format_specifier(specifier, _localeconv=_localeconv)
File "/usr/lib/python3.2/decimal.py", line 5628, in
_parse_format_specifier
_localeconv = _locale.localeconv()
File "/usr/lib/python3.2/locale.py", line 111, in localeconv
d = _localeconv()
ValueError: Cannot convert byte to string |
This fails in _localemodule.c: str2uni(). mbstowcs(NULL, s, 0) is If I set LC_CTYPE and LC_NUMERIC together, things work. This raises the question: If LC_CTYPE and LC_NUMERIC differ (and a) call setlocale(LC_CTYPE, setlocale(LC_NUMERIC, NULL)) before b) use some kind of _mbstowcs_l |
I'm failing to reproduce this (with py3k) on OS X: Python 3.2a0 (py3k:76866:76867, Dec 17 2009, 09:19:26)
[GCC 4.2.1 (Apple Inc. build 5646) (dot 1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> from decimal import *
>>> locale.setlocale(locale.LC_NUMERIC, 'fi_FI')
'fi_FI'
>>> format(Decimal('1000'), 'n')
'1.000' The locale command, from the same Terminal prompt, gives me: LANG="en_IE.UTF-8"
LC_COLLATE="en_IE.UTF-8"
LC_CTYPE="en_IE.UTF-8"
LC_MESSAGES="en_IE.UTF-8"
LC_MONETARY="en_IE.UTF-8"
LC_NUMERIC="en_IE.UTF-8"
LC_TIME="en_IE.UTF-8"
LC_ALL= Just to be clear, is is true that you still get the same result without >>> import locale
>>> locale.setlocale(locale.LC_NUMERIC, 'fi_FI')
'fi_FI'
>>> locale.localeconv() also gives you that ValueError? |
What are the multibyte strings that mbstowcs is failing to convert? |
I can reproduce it on a Fedora (fc6) Linux box. It's not a decimal >>> import locale
>>> locale.setlocale(locale.LC_NUMERIC, 'fi_FI')
'fi_FI'
>>> locale.localeconv()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/python/py3k/Lib/locale.py", line 111, in localeconv
d = _localeconv()
ValueError: Cannot convert byte to string
>>> Here's the contents of the struct lconv as returned by localeconv(): ((gdb) p *l The problem is thousands_sep: |
Yes, it's a problem in _localemodule.c. This situation always static PyObject*mbstowcs
str2uni(const char* s)
{
#ifdef HAVE_BROKEN_MBSTOWCS
size_t needed = strlen(s);
#else
size_t needed = mbstowcs(NULL, s, 0);
#endif I can't see a portable way to fix this except: block threads I don't think this issue is important enough to do that. What |
Changed title (was: decimal.py: format failure with locale specifier) |
I have a patch that fixes this specific issue. Probably there are similar I suspect that this is related: |
Could we have a patch review please. Also note that bpo-5905 has been closed. |
Is this the same as when tests with python-3.3.2 fails on me with RHEL-6? test_locale (test.test_format.FormatTest) ... ERROR ====================================================================== Traceback (most recent call last):
File "/home/matej/build/Extras/python3/Python-3.3.2/Lib/test/test_format.py", line 295, in test_locale
localeconv = locale.localeconv()
File "/home/matej/build/Extras/python3/Python-3.3.2/Lib/locale.py", line 111, in localeconv
d = _localeconv()
UnicodeDecodeError: 'locale' codec can't decode byte 0xe2 in position 0: Invalid or incomplete multibyte or wide character |
Matej Cepl <report@bugs.python.org> wrote:
If LC_CTYPE is UTF-8 and LC_NUMERIC something like ISO-8859-2 then it's |
Hmm, so with this patch diff -up Python-3.3.2/Lib/test/test_format.py.fixFormatTest Python-3.3.2/Lib/test/test_format.py
--- Python-3.3.2/Lib/test/test_format.py.fixFormatTest 2013-10-22 10:05:12.253426746 +0200
+++ Python-3.3.2/Lib/test/test_format.py 2013-10-22 10:16:58.510530570 +0200
@@ -288,7 +288,7 @@ class FormatTest(unittest.TestCase):
def test_locale(self):
try:
oldloc = locale.setlocale(locale.LC_ALL)
- locale.setlocale(locale.LC_ALL, '')
+ locale.setlocale(locale.LC_ALL, 'ps_AF')
except locale.Error as err:
self.skipTest("Cannot set locale: {}".format(err))
try: (or any other explicit locale, I have tried also en_IE) test doesn't fail. Using Python-3.3.2 on RHEL-6 (kernel 2.6.32-358.23.2.el6.i686). |
To bpo-200912: now, system locale is UTF-8 all the way: |
Matej Cepl <report@bugs.python.org> wrote:
The test passes here with these values (Debian). What is the output of: a) locale.localeconv() b) locale.setlocale(locale.LC_ALL, '') |
This issue is very close to the issue bpo-13706 which I solved with the new function PyUnicode_DecodeLocale(): see get_locale_info() in Python/formatter_unicode.c. We might copy/paste the code, or we should maybe add a private API to get locale information: get_locale_info() => _Py_get_locale_info() and expose the LocaleInfo structure. It may be added to unicodeobject.h for example, there is already a function related to locales: _PyUnicode_InsertThousandsGrouping(). |
>>> import locale
>>> locale.localeconv()
{'p_cs_precedes': 127, 'n_sep_by_space': 127, 'n_sign_posn': 127, 'n_cs_precedes': 127, 'grouping': [], 'positive_sign': '', 'mon_grouping': [], 'p_sep_by_space': 127, 'mon_thousands_sep': '', 'currency_symbol': '', 'mon_decimal_point': '', 'int_curr_symbol': '', 'thousands_sep': '', 'frac_digits': 127, 'int_frac_digits': 127, 'negative_sign': '', 'decimal_point': '.', 'p_sign_posn': 127}
>>> locale.setlocale(locale.LC_ALL, '')
'LC_CTYPE=en_US.utf8;LC_NUMERIC=en_IE.utf8;LC_TIME=en_IE.utf8;LC_COLLATE=en_US.utf8;LC_MONETARY=en_IE.utf8;LC_MESSAGES=en_US.utf8;LC_PAPER=en_IE.utf8;LC_NAME=en_US.utf8;LC_ADDRESS=en_US.utf8;LC_TELEPHONE=en_US.utf8;LC_MEASUREMENT=en_IE.utf8;LC_IDENTIFICATION=en_US.utf8'
>>> locale.localeconv()
{'p_cs_precedes': 1, 'n_sep_by_space': 0, 'n_sign_posn': 1, 'n_cs_precedes': 1, 'grouping': [3, 3, 0], 'positive_sign': '', 'mon_grouping': [3, 3, 0], 'p_sep_by_space': 0, 'mon_thousands_sep': ',', 'currency_symbol': '€', 'mon_decimal_point': '.', 'int_curr_symbol': 'EUR ', 'thousands_sep': ',', 'frac_digits': 2, 'int_frac_digits': 2, 'negative_sign': '-', 'decimal_point': '.', 'p_sign_posn': 1}
>>> |
Perhaps version of glibc might be interesting as well? glibc-2.12-1.107.el6_4.5.i686 |
Matej Cepl <report@bugs.python.org> wrote:
> >>> import locale
> >>> locale.localeconv()
> {'p_cs_precedes': 127, 'n_sep_by_space': 127, 'n_sign_posn': 127, 'n_cs_precedes': 127, 'grouping': [], 'positive_sign': '', 'mon_grouping': [], 'p_sep_by_space': 127, 'mon_thousands_sep': '', 'currency_symbol': '', 'mon_decimal_point': '', 'int_curr_symbol': '', 'thousands_sep': '', 'frac_digits': 127, 'int_frac_digits': 127, 'negative_sign': '', 'decimal_point': '.', 'p_sign_posn': 127}
> >>> locale.setlocale(locale.LC_ALL, '')
> 'LC_CTYPE=en_US.utf8;LC_NUMERIC=en_IE.utf8;LC_TIME=en_IE.utf8;LC_COLLATE=en_US.utf8;LC_MONETARY=en_IE.utf8;LC_MESSAGES=en_US.utf8;LC_PAPER=en_IE.utf8;LC_NAME=en_US.utf8;LC_ADDRESS=en_US.utf8;LC_TELEPHONE=en_US.utf8;LC_MEASUREMENT=en_IE.utf8;LC_IDENTIFICATION=en_US.utf8'
> >>> locale.localeconv()
> {'p_cs_precedes': 1, 'n_sep_by_space': 0, 'n_sign_posn': 1, 'n_cs_precedes': 1, 'grouping': [3, 3, 0], 'positive_sign': '', 'mon_grouping': [3, 3, 0], 'p_sep_by_space': 0, 'mon_thousands_sep': ',', 'currency_symbol': '€', 'mon_decimal_point': '.', 'int_curr_symbol': 'EUR ', 'thousands_sep': ',', 'frac_digits': 2, 'int_frac_digits': 2, 'negative_sign': '-', 'decimal_point': '.', 'p_sign_posn': 1} These look normal. I'm puzzled, because that's what is going on in the test ./python -m test test_format If this passes, there might be some interaction with other tests. If it doesn't pass, I would step through the test in gdb (break PyLocale_localeconv) |
Title: _localemodule.c: str2uni() with different LC_NUMERIC and LC_CTYPE Oh, I just realized that the issue is a LC_NUMERIC using an encoding A with a LC_CTYPE using an encoding B. It looks like the glibc does not support this setup, at least for the fi_FI locale which has a non-ASCII thousand separator (non-breaking space: U+00A0). Try attached inconsistent_locale_encodings.py script (it uses locale names for Fedora 19, you may have to adapt it to your OS). Output on Fedora 19: fi_FI numeric (ISO-8859-1) with fr_FR.utf8 ctype (UTF-8) fi_FI@euro numeric (ISO-8859-15) with fr_FR.utf8 ctype (UTF-8) fi_FI.iso88591 numeric (ISO-8859-1) with fr_FR.utf8 ctype (UTF-8) fi_FI.iso885915@euro numeric (ISO-8859-15) with fr_FR.utf8 ctype (UTF-8) fi_FI.utf8 numeric (UTF-8) with fr_FR.utf8 ctype (UTF-8) |
msg95988> Hi, the following works in 2.7 but not in 3.x: ... Sure it works because Python 2 pass the raw byte string, it does not try to decode it. But did you try to display the result in a terminal for example? Example with Python 2 in an UTF-8 terminal: $ python
Python 2.7.5 (default, Oct 8 2013, 12:19:40)
[GCC 4.8.1 20130603 (Red Hat 4.8.1-1)] on linux2
>>> import locale
>>> # set the locale encoding to UTF-8
... locale.setlocale(locale.LC_CTYPE, 'fr_FR.utf8')
'fr_FR.utf8'
>>> # set the thousand separator to U+00A0
... locale.setlocale(locale.LC_NUMERIC, 'fi_FI')
'fi_FI'
>>> locale.getlocale(locale.LC_CTYPE)
('fr_FR', 'UTF-8')
>>> locale.getlocale(locale.LC_NUMERIC)
('fi_FI', 'ISO8859-15')
>>> locale.format('%d', 123456, True)
'123\xa0456'
>>> print(locale.format('%d', 123456, True))
123�456 Mojibake! � means that b'\xA0' cannot be decoded from the locale encoding (UTF-8). There is probably the same issue with a LC_MONETARY using a different encoding than LC_CTYPE.
It is unrelated: time.strftime() uses the LC_CTYPE, but the Python was using the wrong encoding. Python used the locale encoding read at startup, whereas the *current* locale encoding must be used. This issue is specific to LC_NUMERIC with a LC_CTYPE using different encoding.
Sure, because in this case, LC_NUMERIC produces data in the same encoding than LC_CTYPE.
Setting a locale is process-wide and should be avoided. FYI locale.getpreferredencoding() changes temporarly the LC_CTYPE by default, it only uses the current LC_CTYPE if you pass False. open() changed temporarly LC_CTYPE because of that in Python 3.0-3.2 (see issue bpo-11022). The following PostgreSQL issue looks to be the same than this Python issue: The fix changes temporarly the LC_CTYPE encoding: #ifdef WIN32
setlocale(LC_CTYPE, locale_monetary);
#endif (I don't know why the code is specific to Windows.) |
STINNER Victor <report@bugs.python.org> wrote:
So does my patch. :) |
What I did:
santiago:python3 (el6) $ cd Python-3.3.2/build/optimized/ Oh well |
Victor, thanks for the comments. I also think we should set LC_CTYPE closer So it should happen somewhere in PyUnicode_DecodeLocaleAndSize(). Perhaps |
Matej Cepl <report@bugs.python.org> wrote:
It looks like some other test in the test suite did not restore a changed |
"So it should happen somewhere in PyUnicode_DecodeLocaleAndSize(). Perhaps we can create _PyUnicode_DecodeLocaleAndSize() which would take an LC_CTYPE parameter?" For this issue, it means that Python localeconv() will have to change the LC_CTYPE locale many time, for each monetary and each number value. I prefer your patch :-) |
On 22/10/13 17:32, Stefan Krah wrote:
Very plain checkout of git and ./configure && make && make test leads to Curiouser and curiouser. |
STINNER Victor <report@bugs.python.org> wrote:
Windows and OS X have mbstowcs_l(), which takes a locale arg. Linux doesn't |
Here's a quick-and-dirty version of mbstowcs_l(). The difference in the |
@Stefan: Did you my comments on Rietveld? |
Yes, I saw the comments. I'm still wondering if we should just write an Even then, there would still be a small chance that a C extension that |
What is this locale_t type used for the locale parameter of mbstowcs_l()? Are you sure that it is a string? According to this patch, it looks like a structure: |
STINNER Victor <report@bugs.python.org> wrote:
Yes, the string was mainly for benchmarking. FreeBSD seems to have thread safe |
Well, I originally opened this issue but personally I'm not that Victor, do you want to keep it open? |
The issue has a workaround: use LC_NUMERIC and LC_CTYPE locales which use the same encoding. To avoid issues, it's probably safer to only use UTF-8 locales, which are now available on modern Linux distro. I don't like the idea of calling setlocale() just for this corner case, because it changes the locale for all threads. Even if Python is protected by the GIL, Python can be embedded or modules written in C may spawn threads which don't care of the GIL. Usually, if it can fail, it will fail :-) I see that various people contributed to the issue, but it looks like the only user asking for the request is Stefan Krah. I prefer to close the issue and wait until more users ask for it before considering again the patch, or find a different way to implement the feature (support LC_NUMERIC and LC_CTYPE locales using a different encoding). To be clear, I'm closing the issue right now. |
Here we are: https://bugs.python.org/issue31900 |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: