Issue 31900: localeconv() should decode numeric fields from LC_NUMERIC encoding, not from LC_CTYPE encoding

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/76081

classification

Title:	localeconv() should decode numeric fields from LC_NUMERIC encoding, not from LC_CTYPE encoding
Type:		Stage:	resolved
Components:	Tests	Versions:	Python 3.8, Python 3.7, Python 3.6

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	cstratak, lemburg, schwab, serhiy.storchaka, skrah, vstinner
Priority:	normal	Keywords:	patch

Created on 2017-10-30 13:41 by cstratak, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
inconsistent_locale_encodings.py	vstinner, 2017-10-30 16:14
lc_numeric.py	vstinner, 2018-01-10 16:59
lc_numeric2.py	vstinner, 2018-01-15 15:41

Pull Requests
URL	Status	Linked	Edit
PR 4174	merged	vstinner, 2017-10-30 14:57
PR 5191	closed	vstinner, 2018-01-15 15:40
PR 5192	merged	vstinner, 2018-01-15 15:51

Messages (33)
msg305227 - (view)	Author: Charalampos Stratakis (cstratak) *	Date: 2017-10-30 13:41
Original bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1484497 It seems that on the development branch of Fedora, when we updated glibc from 2.26 to 2.26.90, test_float_with_comma started failing. Details from the original bug report: Under certain circumstances, when LC_NUMERIC is fr_FR.ISO8859-1 but LC_ALL is C.UTF-8, locale.localeconv() fails with UnicodeDecodeError: 'locale' codec can't decode byte 0xa0 in position 0: Invalid or incomplete multibyte or wide character Apparently, the thousands separator (or something else) in the lconv is "\xa0" (unbreakable space in fr_FR.ISO8859-1), and it's being decoded with UTF-8. This is tripped by Python's test suite, namely test_float.GeneralFloatCases.test_float_with_comma
msg305229 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-10-30 14:53
I can reproduce the bug with Python 3.6 on Fedora 26 and these locales: * LC_ALL = LC_CTYPE = fr_FR (encoding = ISO8859-1) * LC_NUMERIC= es_MX.utf8 (encoding = UTF-8) Good: LC_NUMERIC = LC_CTYPE = LC_ALL = "es_MX.utf8" haypo@selma$ env -i python3 -c 'import locale; locale.setlocale(locale.LC_ALL, "es_MX.utf8"); print(ascii(locale.localeconv()["thousands_sep"]))' => '\u2009' Bug: LC_NUMERIC = "es_MX.utf8" but LC_CTYPE = LC_ALL = "fr_FR" haypo@selma$ env -i python3 -c 'import locale; locale.setlocale(locale.LC_ALL, "fr_FR"); locale.setlocale(locale.LC_NUMERIC, "es_MX.utf8"); print(ascii(locale.localeconv()["thousands_sep"]))' => '\xe2\x80\x89'
msg305230 - (view)	Author: Charalampos Stratakis (cstratak) *	Date: 2017-10-30 15:25
Tested the PR on a system with glibc 2.26.90 where the test was failing, and it successfully passed.
msg305231 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-10-30 15:34
This is a duplicate of issue28604. See also issue25812.
msg305235 - (view)	Author: Stefan Krah (skrah) *	Date: 2017-10-30 16:02
Same as #7442, I think.
msg305236 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-10-30 16:13
Oh wow, this bug is older than what I expected :-) I added support for non-ASCII thousands separator in 2012: https://bugs.python.org/issue13706#msg151733
msg305237 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-10-30 16:14
inconsistent_locale_encodings.py of closed issue #7442 is interesting: I copy it here.
msg307230 - (view)	Author: Charalampos Stratakis (cstratak) *	Date: 2017-11-29 14:46
Pinging here. Is there some way I can help to move the issue forward?
msg308561 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-12-18 14:27
Oh. Another Python function is impacted by the bug, str.format: $ env -i python3 -c 'import locale; locale.setlocale(locale.LC_ALL, "fr_FR"); locale.setlocale(locale.LC_NUMERIC, "es_MX.utf8"); print(ascii(f"{1000:n}"))' '1\xe2\x80\x89000' It should be '1\u2009000' ('1', '\u2009', '000').
msg309774 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-10 16:59
I completed my change. It now fixes locale.localeconv(), str.format() for int, float, complex and decimal.Decimal: vstinner@apu$ ./python lc_numeric.py LC_CTYPE: ('fr_FR', 'ISO8859-1') LC_NUMERIC: ('es_MX', 'UTF-8') decimal_point: '.' thousands_sep: '\u2009' grouping: [3, 3, 0] int.__format__: '1\u2009234' float.__format__: '1\u2009234' complex.__format__: '1\u2009234+0j' Decimal.__format__: '1\u2009234'
msg309960 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-15 09:54
Update: I pushed a large change to fix locale encodings in bpo-29240: commit 7ed7aead9503102d2ed316175f198104e0cd674c.
msg309962 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-15 11:08
Oops lc_numeric.py contains a typo: d = decimal.Decimal(1234) print("Decimal.__format__: %a" % f"{i:n}") => it should be f"{d:n}"
msg309966 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2018-01-15 11:50
Just FYI: LC_ALL has precedence over all other more specific LC_* settings: http://pubs.opengroup.org/onlinepubs/7908799/xbd/envvar.html http://man7.org/linux/man-pages/man7/locale.7.html Please confirm the bug without having LC_ALL or LANG set. Thanks.
msg309969 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-15 12:04
> Please confirm the bug without having LC_ALL or LANG set. lc_numeric.py uses: locale.setlocale(locale.LC_ALL, "fr_FR") Are you talking about that? What is the problem with this configuration? I'm sure that there is a bug :-) You aren't able to reproduce it? What is your operating system?
msg309970 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2018-01-15 12:37
I just wanted to note that the description and title may cause a wrong interpretation of what should happen: If you first set LC_ALL and then one of the other categories such as LC_NUMERIC, locale C functions will still use the LC_ALL setting for everything. LC_NUMERIC does not override the LC_ALL setting. I tested this on OpenSUSE and get the same wrong results. Apparently, locale.localeconv() does not respect the above order. That's a bug. I'm not sure whether the OP's quoted behavior is a bug, though, since if the locale encoding is not UTF-8, you cannot really expect using UTF-8 numeric separators to output correctly.
msg309971 - (view)	Author: Stefan Krah (skrah) *	Date: 2018-01-15 12:52
On Mon, Jan 15, 2018 at 12:37:28PM +0000, Marc-Andre Lemburg wrote: > If you first set LC_ALL and then one of the other categories such as LC_NUMERIC, locale C functions will still use the LC_ALL setting for everything. LC_NUMERIC does not override the LC_ALL setting. I have the exact same questions as Marc-Andre. This is one of the reasons why I blocked the _decimal change. I don't fully understand the role of the new glibc, since #7442 has existed for ages -- and it is a open question whether it is a bug or not. Both views are reasonable IMO.
msg309973 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-15 12:57
Marc-Andre Lemburg: "If you first set LC_ALL and then one of the other categories such as LC_NUMERIC, locale C functions will still use the LC_ALL setting for everything. LC_NUMERIC does not override the LC_ALL setting." The root of this issue is https://bugzilla.redhat.com/show_bug.cgi?id=1484497#c0: Petr Viktorin reproducer scripts uses Python locale.setlocale(), not environment variables: https://gist.github.com/encukou/70b3d3f9ef3e29ac1dbc23a5f7bd6431 --- locale.setlocale(locale.LC_ALL, 'C.UTF-8') locale.setlocale(locale.LC_NUMERIC, 'fr_FR.ISO8859-1') ---
msg309974 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-15 13:05
Example of Fedora 27 and Python 3.6: vstinner@apu$ env -i LC_NUMERIC=uk_UA.koi8u python3 -c 'import locale; print(locale.setlocale(locale.LC_ALL, "")); print(locale.getpreferredencoding(), ascii(locale.localeconv()["thousands_sep"]))' LC_CTYPE=C.UTF-8;LC_NUMERIC=uk_UA.koi8u;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C Traceback (most recent call last): File "<string>", line 1, in <module> File "/usr/lib64/python3.6/locale.py", line 110, in localeconv d = _localeconv() UnicodeDecodeError: 'locale' codec can't decode byte 0x9a in position 0: Invalid or incomplete multibyte or wide character "env -i" starts Python in an empty environment. It seems like LC_CTYPE defaults to C.UTF-8 in this case. * LC_CTYPE = C.UTF-8, encoding = UTF-8 * LC_NUMERIC = uk_UA.koi8u, encoding = KOI8-U With my PR, it works: vstinner@apu$ env -i LC_NUMERIC=uk_UA.koi8u ./python -c 'import locale; print(locale.setlocale(locale.LC_ALL, "")); print(locale.getpreferredencoding(), ascii(locale.localeconv()["thousands_sep"]))' LC_CTYPE=C.UTF-8;LC_NUMERIC=uk_UA.koi8u;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C UTF-8 '\xa0' => thousands_sep byte string b'\x9A' is decoded as the Uniode string '\xa0'. vstinner@apu$ env -i LC_NUMERIC=uk_UA.koi8u ./python -c 'import locale; locale.setlocale(locale.LC_ALL, ""); print(ascii(f"{1234:n}"))' '1\xa0234' => the number is properly formatted vstinner@apu$ env -i LC_NUMERIC=uk_UA.koi8u ./python -c 'import locale; locale.setlocale(locale.LC_ALL, ""); print(f"{1234:n}")' 1 234 It's possible to display the result using print().
msg309975 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2018-01-15 13:13
Ok, it seems that the C setlocale() itself does not follow the conventions set forth for environment variables: http://pubs.opengroup.org/onlinepubs/7908799/xsh/setlocale.html (see the example at the bottom) So the behavior shown by Python's setlocale() is fine. However, that still doesn't magically make this work: locale.setlocale(locale.LC_ALL, 'C.UTF-8') locale.setlocale(locale.LC_NUMERIC, 'fr_FR.ISO8859-1') If LC_NUMERIC uses a different encoding than LC_ALL, there's really no surprise in having numeric formatting fail. localeconv() will output the set encoding for the numeric string conversion and Python will decode this using the locale encoding set by LC_ALL. If those two are different, you run into problems. I would not consider this a bug in Python, but rather in the locale settings passed to setlocale().
msg309977 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-15 13:14
The technical issue here is that the libc has no "stateless" function to process bytes and text with one specific locale. All functions rely on the current locales. To decode byte strings, we use mbstowcs(), and this function relies on the current LC_CTYPE locale, whereas decimal_point and thousands_sep should be decoded from the current LC_NUMERIC locale.
msg309978 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2018-01-15 13:18
Indeed. The major problem with all libc locale functions is that they are not thread safe. The GIL does help a bit protecting against corrupted data, though.
msg309980 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-15 13:20
> I would not consider this a bug in Python, but rather in the locale settings passed to setlocale(). Past 10 years, I repeated to every single user I met that "Python 3 is right, your system setup is wrong". But that's a waste of time. People continue to associate Python3 and Unicode to annoying bugs, because they don't understand how locales work. Instead of having to repeat to each user that "hum, maybe your config is wrong", I prefer to support this non convential setup and work as expected ("it just works"). With my latest implementation, setlocale() is only done when LC_CTYPE and LC_NUMERIC are different, which is the corner case which "shouldn't occur in practice".
msg309981 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2018-01-15 13:21
Sounds like a good compromise :-)
msg309986 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-15 14:44
I tested localeconv() with PR 4174 on FreeBSD: -- locale.setlocale(locale.LC_ALL, "C") locale.setlocale(locale.LC_NUMERIC, "ar_SA.UTF-8") -- It works as expected, result: -- decimal_point: '\u066b' thousands_sep: '\u066c' -- Compare it to Python 3.6 which returns mojibake, it seems like bytes are decoded from Latin1: -- decimal_point: '\xd9\xab' thousands_sep: '\xd9\xac' -- Raw byte strings, Python 2.7: * decimal_point: b'\xd9\xab' * thousands_sep: b'\xd9\xac'
msg309987 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-15 14:46
Test on Linux (Fedora 27, glibc 2.26): locale.setlocale(locale.LC_ALL, "fr_FR") locale.setlocale(locale.LC_NUMERIC, "es_MX.utf8") It works as expected, result: decimal_point: '.' thousands_sep: '\u2009' Python 3.6 returns mojibake: decimal_point: '.' thousands_sep: '\xe2\x80\x89' Python 2.7 raw strings, thousands_sep = b'\xE2\x80\x89'.
msg309988 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-15 14:56
On macOS 10.13.2, I failed to find any non-ASCII decimal_point or thousands_sep in localeconv(). I wrote a script to find all non-ASCII data in all locales: https://github.com/vstinner/misc/blob/master/python/all_locales.py
msg309989 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-15 14:58
New changeset cb064fc2321ce8673fe365e9ef60445a27657f54 by Victor Stinner in branch 'master': bpo-31900: Fix localeconv() encoding for LC_NUMERIC (#4174) https://github.com/python/cpython/commit/cb064fc2321ce8673fe365e9ef60445a27657f54
msg309993 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-15 15:41
lc_numeric.py contains a typo, used fixed lc_numeric2.py instead to test my PR 5191 which fixes decimal.Decimal.
msg310020 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-15 22:23
New changeset 5f959c4f9eca404b8bc4bc6348fed27c4b907b89 by Victor Stinner in branch '3.6': [3.6] bpo-31900: Fix localeconv() encoding for LC_NUMERIC (#4174) (#5192) https://github.com/python/cpython/commit/5f959c4f9eca404b8bc4bc6348fed27c4b907b89
msg310940 - (view)	Author: Andreas Schwab (schwab) *	Date: 2018-01-28 11:54
> The technical issue here is that the libc has no "stateless" function to process bytes and text with one specific locale. That's not true. There is a rich set of *_l functions that take a locale_t object and operate on that locale.
msg327905 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-10-17 15:19
Victor: > The technical issue here is that the libc has no "stateless" function to process bytes and text with one specific locale. Andreas Schwab: > That's not true. There is a rich set of *_l functions that take a locale_t object and operate on that locale. Oh. Do you want to work on a patch to use these functions? If yes, please open a new issue to enhance the code.
msg330610 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-11-28 16:52
See also bpo-28604: localeconv() doesn't support LC_MONETARY encoding different than LC_CTYPE encoding.
msg330611 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-11-28 16:53
The initial bug has been fixed, I close the issue.

History
Date	User	Action	Args
2022-04-11 14:58:53	admin	set	github: 76081
2018-11-28 16:53:58	vstinner	set	status: open -> closed resolution: fixed messages: + msg330611 stage: patch review -> resolved
2018-11-28 16:52:47	vstinner	set	messages: + msg330610
2018-10-17 15:19:12	vstinner	set	messages: + msg327905
2018-01-28 11:54:59	schwab	set	nosy: + schwab messages: + msg310940
2018-01-15 22:23:50	vstinner	set	messages: + msg310020
2018-01-15 15:51:32	vstinner	set	pull_requests: + pull_request5046
2018-01-15 15:41:31	vstinner	set	files: + lc_numeric2.py messages: + msg309993
2018-01-15 15:40:01	vstinner	set	pull_requests: + pull_request5045
2018-01-15 14:58:04	vstinner	set	messages: + msg309989
2018-01-15 14:56:16	vstinner	set	messages: + msg309988
2018-01-15 14:46:06	vstinner	set	messages: + msg309987
2018-01-15 14:44:00	vstinner	set	messages: + msg309986
2018-01-15 13:21:52	lemburg	set	messages: + msg309981
2018-01-15 13:20:26	vstinner	set	messages: + msg309980
2018-01-15 13:18:27	lemburg	set	messages: + msg309978
2018-01-15 13:14:07	vstinner	set	messages: + msg309977
2018-01-15 13:13:44	lemburg	set	messages: + msg309975
2018-01-15 13:05:27	vstinner	set	messages: + msg309974
2018-01-15 12:57:20	vstinner	set	messages: + msg309973
2018-01-15 12:52:30	skrah	set	messages: + msg309971
2018-01-15 12:37:28	lemburg	set	messages: + msg309970
2018-01-15 12:04:08	vstinner	set	messages: + msg309969
2018-01-15 11:50:18	lemburg	set	nosy: + lemburg messages: + msg309966
2018-01-15 11:08:21	vstinner	set	messages: + msg309962
2018-01-15 09:54:44	vstinner	set	messages: + msg309960
2018-01-10 16:59:11	vstinner	set	files: + lc_numeric.py messages: + msg309774
2017-12-18 14:27:34	vstinner	set	messages: + msg308561
2017-11-29 14:46:19	cstratak	set	messages: + msg307230
2017-10-30 16:14:00	vstinner	set	files: + inconsistent_locale_encodings.py messages: + msg305237
2017-10-30 16:13:01	vstinner	set	messages: + msg305236
2017-10-30 16:02:37	skrah	set	nosy: + skrah messages: + msg305235
2017-10-30 15:34:11	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg305231
2017-10-30 15:32:38	vstinner	set	title: localeconv() should decide numeric fields from LC_NUMERIC encoding, not from LC_CTYPE encoding -> localeconv() should decode numeric fields from LC_NUMERIC encoding, not from LC_CTYPE encoding
2017-10-30 15:25:58	cstratak	set	messages: + msg305230
2017-10-30 14:58:49	vstinner	set	versions: - Python 3.5
2017-10-30 14:57:38	vstinner	set	keywords: + patch stage: patch review pull_requests: + pull_request4142
2017-10-30 14:56:20	vstinner	set	title: UnicodeDecodeError in localeconv() makes test_float fail with glibc 2.26.90 -> localeconv() should decide numeric fields from LC_NUMERIC encoding, not from LC_CTYPE encoding
2017-10-30 14:53:19	vstinner	set	nosy: + vstinner messages: + msg305229
2017-10-30 13:41:05	cstratak	create