Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

localeconv() should decode numeric fields from LC_NUMERIC encoding, not from LC_CTYPE encoding #76081

Closed
stratakis mannequin opened this issue Oct 30, 2017 · 33 comments
Closed
Labels
3.7 (EOL) end of life 3.8 only security fixes tests Tests in the Lib/test dir

Comments

@stratakis
Copy link
Mannequin

stratakis mannequin commented Oct 30, 2017

BPO 31900
Nosy @malemburg, @vstinner, @skrah, @serhiy-storchaka, @andreas-schwab, @stratakis
PRs
  • bpo-31900: Fix localeconv() encoding for LC_NUMERIC #4174
  • bpo-31900: Fix decimal for LC_NUMERIC != LC_CTYPE #5191
  • [3.6] bpo-31900: Fix localeconv() encoding for LC_NUMERIC (#4174) #5192
  • Files
  • inconsistent_locale_encodings.py
  • lc_numeric.py
  • lc_numeric2.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-11-28.16:53:58.445>
    created_at = <Date 2017-10-30.13:41:05.693>
    labels = ['3.8', '3.7', 'tests']
    title = 'localeconv() should decode numeric fields from LC_NUMERIC encoding, not from LC_CTYPE encoding'
    updated_at = <Date 2018-11-28.16:53:58.444>
    user = 'https://github.com/stratakis'

    bugs.python.org fields:

    activity = <Date 2018-11-28.16:53:58.444>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-11-28.16:53:58.445>
    closer = 'vstinner'
    components = ['Tests']
    creation = <Date 2017-10-30.13:41:05.693>
    creator = 'cstratak'
    dependencies = []
    files = ['47246', '47377', '47386']
    hgrepos = []
    issue_num = 31900
    keywords = ['patch']
    message_count = 33.0
    messages = ['305227', '305229', '305230', '305231', '305235', '305236', '305237', '307230', '308561', '309774', '309960', '309962', '309966', '309969', '309970', '309971', '309973', '309974', '309975', '309977', '309978', '309980', '309981', '309986', '309987', '309988', '309989', '309993', '310020', '310940', '327905', '330610', '330611']
    nosy_count = 6.0
    nosy_names = ['lemburg', 'vstinner', 'skrah', 'serhiy.storchaka', 'schwab', 'cstratak']
    pr_nums = ['4174', '5191', '5192']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue31900'
    versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

    @stratakis
    Copy link
    Mannequin Author

    stratakis mannequin commented Oct 30, 2017

    Original bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1484497

    It seems that on the development branch of Fedora, when we updated glibc from 2.26 to 2.26.90, test_float_with_comma started failing.

    Details from the original bug report:

    Under certain circumstances, when LC_NUMERIC is fr_FR.ISO8859-1 but LC_ALL is C.UTF-8, locale.localeconv() fails with
    UnicodeDecodeError: 'locale' codec can't decode byte 0xa0 in position 0: Invalid or incomplete multibyte or wide character

    Apparently, the thousands separator (or something else) in the lconv is "\xa0" (unbreakable space in fr_FR.ISO8859-1), and it's being decoded with UTF-8.

    This is tripped by Python's test suite, namely test_float.GeneralFloatCases.test_float_with_comma

    @stratakis stratakis mannequin added 3.7 (EOL) end of life 3.8 only security fixes tests Tests in the Lib/test dir labels Oct 30, 2017
    @vstinner
    Copy link
    Member

    I can reproduce the bug with Python 3.6 on Fedora 26 and these locales:

    • LC_ALL = LC_CTYPE = fr_FR (encoding = ISO8859-1)
    • LC_NUMERIC= es_MX.utf8 (encoding = UTF-8)

    Good: LC_NUMERIC = LC_CTYPE = LC_ALL = "es_MX.utf8"

    haypo@selma$ env -i python3 -c 'import locale; locale.setlocale(locale.LC_ALL, "es_MX.utf8"); print(ascii(locale.localeconv()["thousands_sep"]))'

    => '\u2009'

    Bug: LC_NUMERIC = "es_MX.utf8" but LC_CTYPE = LC_ALL = "fr_FR"

    haypo@selma$ env -i python3 -c 'import locale; locale.setlocale(locale.LC_ALL, "fr_FR"); locale.setlocale(locale.LC_NUMERIC, "es_MX.utf8"); print(ascii(locale.localeconv()["thousands_sep"]))'

    => '\xe2\x80\x89'

    @vstinner vstinner changed the title UnicodeDecodeError in localeconv() makes test_float fail with glibc 2.26.90 localeconv() should decide numeric fields from LC_NUMERIC encoding, not from LC_CTYPE encoding Oct 30, 2017
    @stratakis
    Copy link
    Mannequin Author

    stratakis mannequin commented Oct 30, 2017

    Tested the PR on a system with glibc 2.26.90 where the test was failing, and it successfully passed.

    @vstinner vstinner changed the title localeconv() should decide numeric fields from LC_NUMERIC encoding, not from LC_CTYPE encoding localeconv() should decode numeric fields from LC_NUMERIC encoding, not from LC_CTYPE encoding Oct 30, 2017
    @serhiy-storchaka
    Copy link
    Member

    This is a duplicate of bpo-28604. See also bpo-25812.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Oct 30, 2017

    Same as bpo-7442, I think.

    @vstinner
    Copy link
    Member

    Oh wow, this bug is older than what I expected :-) I added support for non-ASCII thousands separator in 2012:

    https://bugs.python.org/issue13706#msg151733

    @vstinner
    Copy link
    Member

    inconsistent_locale_encodings.py of closed issue bpo-7442 is interesting: I copy it here.

    @stratakis
    Copy link
    Mannequin Author

    stratakis mannequin commented Nov 29, 2017

    Pinging here. Is there some way I can help to move the issue forward?

    @vstinner
    Copy link
    Member

    Oh. Another Python function is impacted by the bug, str.format:

    $ env -i python3 -c 'import locale; locale.setlocale(locale.LC_ALL, "fr_FR"); locale.setlocale(locale.LC_NUMERIC, "es_MX.utf8"); print(ascii(f"{1000:n}"))'
    '1\xe2\x80\x89000'

    It should be '1\u2009000' ('1', '\u2009', '000').

    @vstinner
    Copy link
    Member

    I completed my change. It now fixes locale.localeconv(), str.format() for int, float, complex and decimal.Decimal:

    vstinner@apu$ ./python lc_numeric.py
    LC_CTYPE: ('fr_FR', 'ISO8859-1')
    LC_NUMERIC: ('es_MX', 'UTF-8')
    decimal_point: '.'
    thousands_sep: '\u2009'
    grouping: [3, 3, 0]
    int.__format__: '1\u2009234'
    float.__format__: '1\u2009234'
    complex.__format__: '1\u2009234+0j'
    Decimal.__format__: '1\u2009234'

    @vstinner
    Copy link
    Member

    Update: I pushed a large change to fix locale encodings in bpo-29240: commit 7ed7aea.

    @vstinner
    Copy link
    Member

    Oops lc_numeric.py contains a typo:

    d = decimal.Decimal(1234)
    print("Decimal.__format__: %a" % f"{i:n}")

    => it should be f"{d:n}"

    @malemburg
    Copy link
    Member

    Just FYI: LC_ALL has precedence over all other more specific LC_* settings:

    http://pubs.opengroup.org/onlinepubs/7908799/xbd/envvar.html
    http://man7.org/linux/man-pages/man7/locale.7.html

    Please confirm the bug without having LC_ALL or LANG set. Thanks.

    @vstinner
    Copy link
    Member

    Please confirm the bug without having LC_ALL or LANG set.

    lc_numeric.py uses:

      locale.setlocale(locale.LC_ALL, "fr_FR")

    Are you talking about that? What is the problem with this configuration?

    I'm sure that there is a bug :-) You aren't able to reproduce it? What is your operating system?

    @malemburg
    Copy link
    Member

    I just wanted to note that the description and title may cause a wrong interpretation of what should happen:

    If you first set LC_ALL and then one of the other categories such as LC_NUMERIC, locale C functions will still use the LC_ALL setting for everything. LC_NUMERIC does not override the LC_ALL setting.

    I tested this on OpenSUSE and get the same wrong results. Apparently, locale.localeconv() does not respect the above order. That's a bug.

    I'm not sure whether the OP's quoted behavior is a bug, though, since if the locale encoding is not UTF-8, you cannot really expect using UTF-8 numeric separators to output correctly.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jan 15, 2018

    On Mon, Jan 15, 2018 at 12:37:28PM +0000, Marc-Andre Lemburg wrote:

    If you first set LC_ALL and then one of the other categories such as LC_NUMERIC, locale C functions will still use the LC_ALL setting for everything. LC_NUMERIC does not override the LC_ALL setting.

    I have the exact same questions as Marc-Andre. This is one of the reasons why I
    blocked the _decimal change. I don't fully understand the role of the new glibc,
    since bpo-7442 has existed for ages -- and it is a open question whether it is a bug
    or not.

    Both views are reasonable IMO.

    @vstinner
    Copy link
    Member

    Marc-Andre Lemburg: "If you first set LC_ALL and then one of the other categories such as LC_NUMERIC, locale C functions will still use the LC_ALL setting for everything. LC_NUMERIC does not override the LC_ALL setting."

    The root of this issue is https://bugzilla.redhat.com/show_bug.cgi?id=1484497#c0:

    Petr Viktorin reproducer scripts uses Python locale.setlocale(), not environment variables:
    https://gist.github.com/encukou/70b3d3f9ef3e29ac1dbc23a5f7bd6431
    ---

    locale.setlocale(locale.LC_ALL, 'C.UTF-8')
    locale.setlocale(locale.LC_NUMERIC, 'fr_FR.ISO8859-1')

    @vstinner
    Copy link
    Member

    Example of Fedora 27 and Python 3.6:

    vstinner@apu$ env -i LC_NUMERIC=uk_UA.koi8u python3 -c 'import locale; print(locale.setlocale(locale.LC_ALL, "")); print(locale.getpreferredencoding(), ascii(locale.localeconv()["thousands_sep"]))'
    LC_CTYPE=C.UTF-8;LC_NUMERIC=uk_UA.koi8u;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/usr/lib64/python3.6/locale.py", line 110, in localeconv
        d = _localeconv()
    UnicodeDecodeError: 'locale' codec can't decode byte 0x9a in position 0: Invalid or incomplete multibyte or wide character

    "env -i" starts Python in an empty environment. It seems like LC_CTYPE defaults to C.UTF-8 in this case.

    • LC_CTYPE = C.UTF-8, encoding = UTF-8
    • LC_NUMERIC = uk_UA.koi8u, encoding = KOI8-U

    With my PR, it works:

    vstinner@apu$ env -i LC_NUMERIC=uk_UA.koi8u ./python -c 'import locale; print(locale.setlocale(locale.LC_ALL, "")); print(locale.getpreferredencoding(), ascii(locale.localeconv()["thousands_sep"]))'
    LC_CTYPE=C.UTF-8;LC_NUMERIC=uk_UA.koi8u;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C
    UTF-8 '\xa0'

    => thousands_sep byte string b'\x9A' is decoded as the Uniode string '\xa0'.

    vstinner@apu$ env -i LC_NUMERIC=uk_UA.koi8u ./python -c 'import locale; locale.setlocale(locale.LC_ALL, ""); print(ascii(f"{1234:n}"))'
    '1\xa0234'

    => the number is properly formatted

    vstinner@apu$ env -i LC_NUMERIC=uk_UA.koi8u ./python -c 'import locale; locale.setlocale(locale.LC_ALL, ""); print(f"{1234:n}")'
    1 234

    It's possible to display the result using print().

    @malemburg
    Copy link
    Member

    Ok, it seems that the C setlocale() itself does not follow the conventions set forth for environment variables:

    http://pubs.opengroup.org/onlinepubs/7908799/xsh/setlocale.html

    (see the example at the bottom)

    So the behavior shown by Python's setlocale() is fine.

    However, that still doesn't magically make this work:

    locale.setlocale(locale.LC_ALL, 'C.UTF-8')
    locale.setlocale(locale.LC_NUMERIC, 'fr_FR.ISO8859-1')

    If LC_NUMERIC uses a different encoding than LC_ALL, there's really no surprise in having numeric formatting fail. localeconv() will output the set encoding for the numeric string conversion and Python will decode this using the locale encoding set by LC_ALL. If those two are different, you run into problems.

    I would not consider this a bug in Python, but rather in the locale settings passed to setlocale().

    @vstinner
    Copy link
    Member

    The technical issue here is that the libc has no "stateless" function to process bytes and text with one specific locale. All functions rely on the *current* locales. To decode byte strings, we use mbstowcs(), and this function relies on the current LC_CTYPE locale, whereas decimal_point and thousands_sep should be decoded from the current LC_NUMERIC locale.

    @malemburg
    Copy link
    Member

    Indeed. The major problem with all libc locale functions is that they are not thread safe. The GIL does help a bit protecting against corrupted data, though.

    @vstinner
    Copy link
    Member

    I would not consider this a bug in Python, but rather in the locale settings passed to setlocale().

    Past 10 years, I repeated to every single user I met that "Python 3 is right, your system setup is wrong". But that's a waste of time. People continue to associate Python3 and Unicode to annoying bugs, because they don't understand how locales work.

    Instead of having to repeat to each user that "hum, maybe your config is wrong", I prefer to support this non convential setup and work as expected ("it just works"). With my latest implementation, setlocale() is only done when LC_CTYPE and LC_NUMERIC are different, which is the corner case which "shouldn't occur in practice".

    @malemburg
    Copy link
    Member

    Sounds like a good compromise :-)

    @vstinner
    Copy link
    Member

    I tested localeconv() with PR 4174 on FreeBSD:

    locale.setlocale(locale.LC_ALL, "C")
    locale.setlocale(locale.LC_NUMERIC, "ar_SA.UTF-8")

    It works as expected, result:

    decimal_point: '\u066b'
    thousands_sep: '\u066c'

    Compare it to Python 3.6 which returns mojibake, it seems like bytes are decoded from Latin1:

    decimal_point: '\xd9\xab'
    thousands_sep: '\xd9\xac'

    Raw byte strings, Python 2.7:

    • decimal_point: b'\xd9\xab'
    • thousands_sep: b'\xd9\xac'

    @vstinner
    Copy link
    Member

    Test on Linux (Fedora 27, glibc 2.26):

    locale.setlocale(locale.LC_ALL, "fr_FR")
    locale.setlocale(locale.LC_NUMERIC, "es_MX.utf8")

    It works as expected, result:

    decimal_point: '.'
    thousands_sep: '\u2009'

    Python 3.6 returns mojibake:

    decimal_point: '.'
    thousands_sep: '\xe2\x80\x89'

    Python 2.7 raw strings, thousands_sep = b'\xE2\x80\x89'.

    @vstinner
    Copy link
    Member

    On macOS 10.13.2, I failed to find any non-ASCII decimal_point or thousands_sep in localeconv(). I wrote a script to find all non-ASCII data in all locales:
    https://github.com/vstinner/misc/blob/master/python/all_locales.py

    @vstinner
    Copy link
    Member

    New changeset cb064fc by Victor Stinner in branch 'master':
    bpo-31900: Fix localeconv() encoding for LC_NUMERIC (bpo-4174)
    cb064fc

    @vstinner
    Copy link
    Member

    lc_numeric.py contains a typo, used fixed lc_numeric2.py instead to test my PR 5191 which fixes decimal.Decimal.

    @vstinner
    Copy link
    Member

    New changeset 5f959c4 by Victor Stinner in branch '3.6':
    [3.6] bpo-31900: Fix localeconv() encoding for LC_NUMERIC (bpo-4174) (bpo-5192)
    5f959c4

    @andreas-schwab
    Copy link
    Mannequin

    andreas-schwab mannequin commented Jan 28, 2018

    The technical issue here is that the libc has no "stateless" function to process bytes and text with one specific locale.

    That's not true. There is a rich set of *_l functions that take a locale_t object and operate on that locale.

    @vstinner
    Copy link
    Member

    Victor:

    The technical issue here is that the libc has no "stateless" function to process bytes and text with one specific locale.

    Andreas Schwab:

    That's not true. There is a rich set of *_l functions that take a locale_t object and operate on that locale.

    Oh. Do you want to work on a patch to use these functions? If yes, please open a new issue to enhance the code.

    @vstinner
    Copy link
    Member

    See also bpo-28604: localeconv() doesn't support LC_MONETARY encoding different than LC_CTYPE encoding.

    @vstinner
    Copy link
    Member

    The initial bug has been fixed, I close the issue.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes tests Tests in the Lib/test dir
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants