Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_localemodule.c: str2uni() with different LC_NUMERIC and LC_CTYPE #51691

Closed
skrah mannequin opened this issue Dec 5, 2009 · 35 comments
Closed

_localemodule.c: str2uni() with different LC_NUMERIC and LC_CTYPE #51691

skrah mannequin opened this issue Dec 5, 2009 · 35 comments
Labels
type-bug An unexpected behavior, bug, or error

Comments

@skrah
Copy link
Mannequin

skrah mannequin commented Dec 5, 2009

BPO 7442
Nosy @loewis, @mdickinson, @vstinner, @ericvsmith, @mcepl, @skrah
Files
  • set_ctype_before_mbstowcs.patch
  • inconsistent_locale_encodings.py
  • mbstowcs_l.c
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2014-10-14.19:30:57.384>
    created_at = <Date 2009-12-05.10:44:18.463>
    labels = ['type-bug']
    title = '_localemodule.c: str2uni() with different LC_NUMERIC and LC_CTYPE'
    updated_at = <Date 2018-01-10.15:46:42.607>
    user = 'https://github.com/skrah'

    bugs.python.org fields:

    activity = <Date 2018-01-10.15:46:42.607>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2014-10-14.19:30:57.384>
    closer = 'vstinner'
    components = []
    creation = <Date 2009-12-05.10:44:18.463>
    creator = 'skrah'
    dependencies = []
    files = ['16221', '32303', '32306']
    hgrepos = []
    issue_num = 7442
    keywords = ['patch']
    message_count = 35.0
    messages = ['95988', '96008', '96534', '96535', '96544', '96556', '96557', '99317', '190055', '200892', '200912', '200916', '200917', '200927', '200928', '200929', '200933', '200944', '200954', '200957', '200960', '200961', '200968', '200969', '200970', '200974', '200976', '200983', '202117', '202127', '202132', '202145', '229318', '229338', '309770']
    nosy_count = 6.0
    nosy_names = ['loewis', 'mark.dickinson', 'vstinner', 'eric.smith', 'mcepl', 'skrah']
    pr_nums = []
    priority = 'normal'
    resolution = 'wont fix'
    stage = 'needs patch'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue7442'
    versions = ['Python 3.2']

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Dec 5, 2009

    Hi, the following works in 2.7 but not in 3.x:

    >>> import locale
    >>> from decimal import *
    >>> locale.setlocale(locale.LC_NUMERIC, 'fi_FI')
    'fi_FI'
    >>> format(Decimal('1000'), 'n')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python3.2/decimal.py", line 3632, in __format__
        spec = _parse_format_specifier(specifier, _localeconv=_localeconv)
      File "/usr/lib/python3.2/decimal.py", line 5628, in
    _parse_format_specifier
        _localeconv = _locale.localeconv()
      File "/usr/lib/python3.2/locale.py", line 111, in localeconv
        d = _localeconv()
    ValueError: Cannot convert byte to string

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Dec 5, 2009

    This fails in _localemodule.c: str2uni(). mbstowcs(NULL, s, 0) is
    LC_CTYPE sensitive, but LC_CTYPE is UTF-8 in my terminal.

    If I set LC_CTYPE and LC_NUMERIC together, things work.

    This raises the question: If LC_CTYPE and LC_NUMERIC differ (and
    since they are separate entities I assume they may differ), what
    is the correct way to convert the separator and the decimal point?

    a) call setlocale(LC_CTYPE, setlocale(LC_NUMERIC, NULL)) before
    mbstowcs. This is not really an option.

    b) use some kind of _mbstowcs_l
    (http://msdn.microsoft.com/en-us/library/k1f9b8cy(VS.80).aspx), which
    takes a locale parameter. But I don't
    find such a thing on Linux.

    @mdickinson
    Copy link
    Member

    I'm failing to reproduce this (with py3k) on OS X:

    Python 3.2a0 (py3k:76866:76867, Dec 17 2009, 09:19:26) 
    [GCC 4.2.1 (Apple Inc. build 5646) (dot 1)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import locale
    >>> from decimal import *
    >>> locale.setlocale(locale.LC_NUMERIC, 'fi_FI')
    'fi_FI'
    >>> format(Decimal('1000'), 'n')
    '1.000'

    The locale command, from the same Terminal prompt, gives me:

    LANG="en_IE.UTF-8"
    LC_COLLATE="en_IE.UTF-8"
    LC_CTYPE="en_IE.UTF-8"
    LC_MESSAGES="en_IE.UTF-8"
    LC_MONETARY="en_IE.UTF-8"
    LC_NUMERIC="en_IE.UTF-8"
    LC_TIME="en_IE.UTF-8"
    LC_ALL=

    Just to be clear, is is true that you still get the same result without
    involving Decimal at all? That is, am I correct in assuming that:

    >>> import locale
    >>> locale.setlocale(locale.LC_NUMERIC, 'fi_FI')
    'fi_FI'
    >>> locale.localeconv()

    also gives you that ValueError?

    @mdickinson
    Copy link
    Member

    What are the multibyte strings that mbstowcs is failing to convert?
    On my machine, the separators come out as plain ASCII '.' (for thousands)
    and ',' (for the decimal point).

    @ericvsmith
    Copy link
    Member

    I can reproduce it on a Fedora (fc6) Linux box. It's not a decimal
    problem, but a plain locale problem:

    >>> import locale
    >>> locale.setlocale(locale.LC_NUMERIC, 'fi_FI')
    'fi_FI'
    >>> locale.localeconv()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/root/python/py3k/Lib/locale.py", line 111, in localeconv
        d = _localeconv()
    ValueError: Cannot convert byte to string
    >>> 

    Here's the contents of the struct lconv as returned by localeconv():

    ((gdb) p *l
    $1 = {decimal_point = 0xb7b54020 ",", thousands_sep = 0xb7b54022 " ",
    grouping = 0xb7b54024 "\003\003",
    int_curr_symbol = 0x998858 "", currency_symbol = 0x998858 "",
    mon_decimal_point = 0x998858 "", mon_thousands_sep = 0x998858 "",
    mon_grouping = 0x998858 "", positive_sign = 0x998858 "", negative_sign
    = 0x998858 "", int_frac_digits = 127 '\177',
    frac_digits = 127 '\177', p_cs_precedes = 127 '\177', p_sep_by_space =
    127 '\177', n_cs_precedes = 127 '\177',
    n_sep_by_space = 127 '\177', p_sign_posn = 127 '\177', n_sign_posn =
    127 '\177', int_p_cs_precedes = 127 '\177',
    int_p_sep_by_space = 127 '\177', int_n_cs_precedes = 127 '\177',
    int_n_sep_by_space = 127 '\177', int_p_sign_posn = 127 '\177',
    int_n_sign_posn = 127 '\177'}

    The problem is thousands_sep:
    (gdb) p l->thousands_sep
    $2 = 0xb7b54022 " "
    (gdb) p (unsigned char)l->thousands_sep[0]
    $3 = 160 ' '

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Dec 18, 2009

    Yes, it's a problem in _localemodule.c. This situation always
    occurs when LC_NUMERIC is something like ISO8859-15, LC_CTYPE
    is UTF-8 AND the decimal point or separator are in the range
    128-255. Then mbstowcs tries to decode the character according
    to LC_CTYPE and finds that the character is not valid UTF-8:

    static PyObject*mbstowcs
    str2uni(const char* s)
    {
    #ifdef HAVE_BROKEN_MBSTOWCS
        size_t needed = strlen(s);
    #else
        size_t needed = mbstowcs(NULL, s, 0);
    #endif

    I can't see a portable way to fix this except:

    block threads
    set temporary LC_CTYPE
    call mbstowcs
    restore LC_CTYPE
    unblock threads

    I don't think this issue is important enough to do that. What
    I do in cdecimal is raise an error "Invalid separator or
    unsupported combination of LC_NUMERIC and LC_CTYPE".

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Dec 18, 2009

    Changed title (was: decimal.py: format failure with locale specifier)

    @skrah skrah mannequin changed the title decimal.py: format failure with locale specifier _localemodule.c: str2uni() with different LC_NUMERIC and LC_CTYPE Dec 18, 2009
    @pitrou pitrou added the type-bug An unexpected behavior, bug, or error label Dec 19, 2009
    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Feb 13, 2010

    I have a patch that fixes this specific issue. Probably there are similar
    issues in other places, e.g. when LC_TIME and LC_CTYPE differ.

    I suspect that this is related:

    http://bugs.python.org/issue5905

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented May 26, 2013

    Could we have a patch review please. Also note that bpo-5905 has been closed.

    @mcepl
    Copy link
    Mannequin

    mcepl mannequin commented Oct 22, 2013

    Is this the same as when tests with python-3.3.2 fails on me with RHEL-6?

    test_locale (test.test_format.FormatTest) ... ERROR
    test_non_ascii (test.test_format.FormatTest) ... test test_format failed
    '\u20ac=%f' % (1.0,) =? '\u20ac=1.000000' ... yes
    ok

    ======================================================================
    ERROR: test_locale (test.test_format.FormatTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/matej/build/Extras/python3/Python-3.3.2/Lib/test/test_format.py", line 295, in test_locale
        localeconv = locale.localeconv()
      File "/home/matej/build/Extras/python3/Python-3.3.2/Lib/locale.py", line 111, in localeconv
        d = _localeconv()
    UnicodeDecodeError: 'locale' codec can't decode byte 0xe2 in position 0: Invalid or incomplete multibyte or wide character

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Oct 22, 2013

    Matej Cepl <report@bugs.python.org> wrote:

    Is this the same as when tests with python-3.3.2 fails on me with RHEL-6?

    If LC_CTYPE is UTF-8 and LC_NUMERIC something like ISO-8859-2 then it's
    the same issue.

    @mcepl
    Copy link
    Mannequin

    mcepl mannequin commented Oct 22, 2013

    Hmm, so with this patch

    diff -up Python-3.3.2/Lib/test/test_format.py.fixFormatTest Python-3.3.2/Lib/test/test_format.py
    --- Python-3.3.2/Lib/test/test_format.py.fixFormatTest  2013-10-22 10:05:12.253426746 +0200
    +++ Python-3.3.2/Lib/test/test_format.py        2013-10-22 10:16:58.510530570 +0200
    @@ -288,7 +288,7 @@ class FormatTest(unittest.TestCase):
         def test_locale(self):
             try:
                 oldloc = locale.setlocale(locale.LC_ALL)
    -            locale.setlocale(locale.LC_ALL, '')
    +            locale.setlocale(locale.LC_ALL, 'ps_AF')
             except locale.Error as err:
                 self.skipTest("Cannot set locale: {}".format(err))
             try:

    (or any other explicit locale, I have tried also en_IE) test doesn't fail.

    Using Python-3.3.2 on RHEL-6 (kernel 2.6.32-358.23.2.el6.i686).

    @mcepl
    Copy link
    Mannequin

    mcepl mannequin commented Oct 22, 2013

    To bpo-200912: now, system locale is UTF-8 all the way:
    santiago:python3 (el6) $ locale
    LANG=en_US.utf8
    LC_CTYPE="en_US.utf8"
    LC_NUMERIC=en_IE.utf8
    LC_TIME=en_IE.utf8
    LC_COLLATE="en_US.utf8"
    LC_MONETARY=en_IE.utf8
    LC_MESSAGES="en_US.utf8"
    LC_PAPER=en_IE.utf8
    LC_NAME="en_US.utf8"
    LC_ADDRESS="en_US.utf8"
    LC_TELEPHONE="en_US.utf8"
    LC_MEASUREMENT=en_IE.utf8
    LC_IDENTIFICATION="en_US.utf8"
    LC_ALL=
    santiago:python3 (el6) $

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Oct 22, 2013

    Matej Cepl <report@bugs.python.org> wrote:

    To bpo-200912: now, system locale is UTF-8 all the way:
    santiago:python3 (el6) $ locale
    LANG=en_US.utf8
    LC_CTYPE="en_US.utf8"
    LC_NUMERIC=en_IE.utf8
    LC_TIME=en_IE.utf8
    LC_COLLATE="en_US.utf8"
    LC_MONETARY=en_IE.utf8
    LC_MESSAGES="en_US.utf8"
    LC_PAPER=en_IE.utf8
    LC_NAME="en_US.utf8"
    LC_ADDRESS="en_US.utf8"
    LC_TELEPHONE="en_US.utf8"
    LC_MEASUREMENT=en_IE.utf8
    LC_IDENTIFICATION="en_US.utf8"
    LC_ALL=
    santiago:python3 (el6) $

    The test passes here with these values (Debian).

    What is the output of:

    a) locale.localeconv()

    b) locale.setlocale(locale.LC_ALL, '')
    locale.localeconv()

    @vstinner
    Copy link
    Member

    This issue is very close to the issue bpo-13706 which I solved with the new function PyUnicode_DecodeLocale(): see get_locale_info() in Python/formatter_unicode.c.

    We might copy/paste the code, or we should maybe add a private API to get locale information: get_locale_info() => _Py_get_locale_info() and expose the LocaleInfo structure. It may be added to unicodeobject.h for example, there is already a function related to locales: _PyUnicode_InsertThousandsGrouping().

    @mcepl
    Copy link
    Mannequin

    mcepl mannequin commented Oct 22, 2013

    >>> import locale
    >>> locale.localeconv()
    {'p_cs_precedes': 127, 'n_sep_by_space': 127, 'n_sign_posn': 127, 'n_cs_precedes': 127, 'grouping': [], 'positive_sign': '', 'mon_grouping': [], 'p_sep_by_space': 127, 'mon_thousands_sep': '', 'currency_symbol': '', 'mon_decimal_point': '', 'int_curr_symbol': '', 'thousands_sep': '', 'frac_digits': 127, 'int_frac_digits': 127, 'negative_sign': '', 'decimal_point': '.', 'p_sign_posn': 127}
    >>> locale.setlocale(locale.LC_ALL, '')
    'LC_CTYPE=en_US.utf8;LC_NUMERIC=en_IE.utf8;LC_TIME=en_IE.utf8;LC_COLLATE=en_US.utf8;LC_MONETARY=en_IE.utf8;LC_MESSAGES=en_US.utf8;LC_PAPER=en_IE.utf8;LC_NAME=en_US.utf8;LC_ADDRESS=en_US.utf8;LC_TELEPHONE=en_US.utf8;LC_MEASUREMENT=en_IE.utf8;LC_IDENTIFICATION=en_US.utf8'
    >>> locale.localeconv()
    {'p_cs_precedes': 1, 'n_sep_by_space': 0, 'n_sign_posn': 1, 'n_cs_precedes': 1, 'grouping': [3, 3, 0], 'positive_sign': '', 'mon_grouping': [3, 3, 0], 'p_sep_by_space': 0, 'mon_thousands_sep': ',', 'currency_symbol': '€', 'mon_decimal_point': '.', 'int_curr_symbol': 'EUR ', 'thousands_sep': ',', 'frac_digits': 2, 'int_frac_digits': 2, 'negative_sign': '-', 'decimal_point': '.', 'p_sign_posn': 1}
    >>>

    @mcepl
    Copy link
    Mannequin

    mcepl mannequin commented Oct 22, 2013

    Perhaps version of glibc might be interesting as well?

    glibc-2.12-1.107.el6_4.5.i686

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Oct 22, 2013

    Matej Cepl <report@bugs.python.org> wrote:
    > >>> import locale
    > >>> locale.localeconv()
    > {'p_cs_precedes': 127, 'n_sep_by_space': 127, 'n_sign_posn': 127, 'n_cs_precedes': 127, 'grouping': [], 'positive_sign': '', 'mon_grouping': [], 'p_sep_by_space': 127, 'mon_thousands_sep': '', 'currency_symbol': '', 'mon_decimal_point': '', 'int_curr_symbol': '', 'thousands_sep': '', 'frac_digits': 127, 'int_frac_digits': 127, 'negative_sign': '', 'decimal_point': '.', 'p_sign_posn': 127}
    > >>> locale.setlocale(locale.LC_ALL, '')
    > 'LC_CTYPE=en_US.utf8;LC_NUMERIC=en_IE.utf8;LC_TIME=en_IE.utf8;LC_COLLATE=en_US.utf8;LC_MONETARY=en_IE.utf8;LC_MESSAGES=en_US.utf8;LC_PAPER=en_IE.utf8;LC_NAME=en_US.utf8;LC_ADDRESS=en_US.utf8;LC_TELEPHONE=en_US.utf8;LC_MEASUREMENT=en_IE.utf8;LC_IDENTIFICATION=en_US.utf8'
    > >>> locale.localeconv()
    > {'p_cs_precedes': 1, 'n_sep_by_space': 0, 'n_sign_posn': 1, 'n_cs_precedes': 1, 'grouping': [3, 3, 0], 'positive_sign': '', 'mon_grouping': [3, 3, 0], 'p_sep_by_space': 0, 'mon_thousands_sep': ',', 'currency_symbol': '€', 'mon_decimal_point': '.', 'int_curr_symbol': 'EUR ', 'thousands_sep': ',', 'frac_digits': 2, 'int_frac_digits': 2, 'negative_sign': '-', 'decimal_point': '.', 'p_sign_posn': 1}

    These look normal. I'm puzzled, because that's what is going on in the test
    as well. Do you get the failure when running the test in isolation:

    ./python -m test test_format

    If this passes, there might be some interaction with other tests.

    If it doesn't pass, I would step through the test in gdb (break PyLocale_localeconv)
    and see which member of struct lconv is the troublemaker.

    @vstinner
    Copy link
    Member

    Title: _localemodule.c: str2uni() with different LC_NUMERIC and LC_CTYPE

    Oh, I just realized that the issue is a LC_NUMERIC using an encoding A with a LC_CTYPE using an encoding B. It looks like the glibc does not support this setup, at least for the fi_FI locale which has a non-ASCII thousand separator (non-breaking space: U+00A0).

    Try attached inconsistent_locale_encodings.py script (it uses locale names for Fedora 19, you may have to adapt it to your OS).

    Output on Fedora 19:

    fi_FI numeric (ISO-8859-1) with fr_FR.utf8 ctype (UTF-8)
    UnicodeDecodeError: 'locale' codec can't decode byte 0xa0 in position 0: Virheellinen tai epätäydellinen monitavumerkki tai leveä merkki

    fi_FI@euro numeric (ISO-8859-15) with fr_FR.utf8 ctype (UTF-8)
    UnicodeDecodeError: 'locale' codec can't decode byte 0xa0 in position 0: Virheellinen tai epätäydellinen monitavumerkki tai leveä merkki

    fi_FI.iso88591 numeric (ISO-8859-1) with fr_FR.utf8 ctype (UTF-8)
    UnicodeDecodeError: 'locale' codec can't decode byte 0xa0 in position 0: Virheellinen tai epätäydellinen monitavumerkki tai leveä merkki

    fi_FI.iso885915@euro numeric (ISO-8859-15) with fr_FR.utf8 ctype (UTF-8)
    UnicodeDecodeError: 'locale' codec can't decode byte 0xa0 in position 0: Virheellinen tai epätäydellinen monitavumerkki tai leveä merkki

    fi_FI.utf8 numeric (UTF-8) with fr_FR.utf8 ctype (UTF-8)
    {'grouping': [3, 3, 0], 'p_cs_precedes': 0, 'mon_thousands_sep': '\xa0', 'decimal_point': ',', 'n_sep_by_space': 1, 'n_sign_posn': 1, 'mon_decimal_point': ',', 'frac_digits': 2, 'positive_sign': '', 'mon_grouping': [3, 3, 0], 'n_cs_precedes': 0, 'thousands_sep': '\xa0', 'p_sep_by_space': 1, 'p_sign_posn': 1, 'int_frac_digits': 2, 'currency_symbol': '€', 'negative_sign': '-', 'int_curr_symbol': 'EUR '}

    @vstinner
    Copy link
    Member

    msg95988> Hi, the following works in 2.7 but not in 3.x: ...

    Sure it works because Python 2 pass the raw byte string, it does not try to decode it. But did you try to display the result in a terminal for example?

    Example with Python 2 in an UTF-8 terminal:

    $ python
    Python 2.7.5 (default, Oct  8 2013, 12:19:40) 
    [GCC 4.8.1 20130603 (Red Hat 4.8.1-1)] on linux2
    >>> import locale
    >>> # set the locale encoding to UTF-8
    ... locale.setlocale(locale.LC_CTYPE, 'fr_FR.utf8')
    'fr_FR.utf8'
    >>> # set the thousand separator to U+00A0
    ... locale.setlocale(locale.LC_NUMERIC, 'fi_FI')
    'fi_FI'
    >>> locale.getlocale(locale.LC_CTYPE)
    ('fr_FR', 'UTF-8')
    >>> locale.getlocale(locale.LC_NUMERIC)
    ('fi_FI', 'ISO8859-15')
    >>> locale.format('%d', 123456, True)
    '123\xa0456'
    >>> print(locale.format('%d', 123456, True))
    123�456

    Mojibake! � means that b'\xA0' cannot be decoded from the locale encoding (UTF-8).

    There is probably the same issue with a LC_MONETARY using a different encoding than LC_CTYPE.

    I suspect that this is related: bpo-5905

    It is unrelated: time.strftime() uses the LC_CTYPE, but the Python was using the wrong encoding. Python used the locale encoding read at startup, whereas the *current* locale encoding must be used.

    This issue is specific to LC_NUMERIC with a LC_CTYPE using different encoding.

    If I set LC_CTYPE and LC_NUMERIC together, things work.

    Sure, because in this case, LC_NUMERIC produces data in the same encoding than LC_CTYPE.

    call setlocale(LC_CTYPE, setlocale(LC_NUMERIC, NULL)) before
    mbstowcs. This is not really an option.

    Setting a locale is process-wide and should be avoided. FYI locale.getpreferredencoding() changes temporarly the LC_CTYPE by default, it only uses the current LC_CTYPE if you pass False. open() changed temporarly LC_CTYPE because of that in Python 3.0-3.2 (see issue bpo-11022).

    The following PostgreSQL issue looks to be the same than this Python issue:
    http://www.postgresql.org/message-id/20100422015552.4B7E07541D0@cvs.postgresql.org

    The fix changes temporarly the LC_CTYPE encoding:

    #ifdef WIN32
    setlocale(LC_CTYPE, locale_monetary);
    #endif

    (I don't know why the code is specific to Windows.)

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Oct 22, 2013

    STINNER Victor <report@bugs.python.org> wrote:

    The following PostgreSQL issue looks to be the same than this Python issue:
    http://www.postgresql.org/message-id/20100422015552.4B7E07541D0@cvs.postgresql.org

    The fix changes temporarly the LC_CTYPE encoding:

    So does my patch. :)

    @mcepl
    Copy link
    Mannequin

    mcepl mannequin commented Oct 22, 2013

    What I did:

    1. run the build (it is a building of Fedora Rawhide python3 package on RHEL), and see it failed in this test.
    2. see below

    santiago:python3 (el6) $ cd Python-3.3.2/build/optimized/
    santiago:optimized (el6) $ ./python -m test test_format
    [1/1] test_format
    1 test OK.
    santiago:optimized (el6) $

    Oh well

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Oct 22, 2013

    Victor, thanks for the comments. I also think we should set LC_CTYPE closer
    to the actual call to mbstowcs(), otherwise there are many API calls in
    between.

    So it should happen somewhere in PyUnicode_DecodeLocaleAndSize(). Perhaps
    we can create _PyUnicode_DecodeLocaleAndSize() which would take an LC_CTYPE
    parameter?

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Oct 22, 2013

    Matej Cepl <report@bugs.python.org> wrote:

    santiago:optimized (el6) $ ./python -m test test_format
    [1/1] test_format
    1 test OK.

    It looks like some other test in the test suite did not restore a changed
    locale. If you're motivated enough, you could try if it still happens in
    3.4 and open a new issue (it's unrelated to this one).

    @vstinner
    Copy link
    Member

    "So it should happen somewhere in PyUnicode_DecodeLocaleAndSize(). Perhaps we can create _PyUnicode_DecodeLocaleAndSize() which would take an LC_CTYPE parameter?"

    For this issue, it means that Python localeconv() will have to change the LC_CTYPE locale many time, for each monetary and each number value. I prefer your patch :-)

    @mcepl
    Copy link
    Mannequin

    mcepl mannequin commented Oct 22, 2013

    On 22/10/13 17:32, Stefan Krah wrote:

    It looks like some other test in the test suite did not restore a changed
    locale. If you're motivated enough, you could try if it still happens in
    3.4 and open a new issue (it's unrelated to this one).

    Very plain checkout of git and ./configure && make && make test leads to
    another failed test, but not this one (issue bpo-19353). So either we do
    something wrong in all those Fedora patches, or this has been fixed
    since 3.3.2.

    Curiouser and curiouser.

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Oct 22, 2013

    STINNER Victor <report@bugs.python.org> wrote:

    "So it should happen somewhere in PyUnicode_DecodeLocaleAndSize(). Perhaps we can create _PyUnicode_DecodeLocaleAndSize() which would take an LC_CTYPE parameter?"

    For this issue, it means that Python localeconv() will have to change the LC_CTYPE locale many time, for each monetary and each number value. I prefer your patch :-)

    Windows and OS X have mbstowcs_l(), which takes a locale arg. Linux doesn't
    (as far as I can see). I agree this solution is ugly, but it probably won't
    have an impact on benchmarks (naively assuming that setlocale() is fast).

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Oct 22, 2013

    Here's a quick-and-dirty version of mbstowcs_l(). The difference in the
    benchmark to the plain mbstowcs() is 0.82s vs 0.55s. In the context of
    a Python function this is unlikely to be measurable.

    @vstinner
    Copy link
    Member

    vstinner commented Nov 4, 2013

    @Stefan: Did you my comments on Rietveld?
    http://bugs.python.org/review/7442/#ps1473

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Nov 4, 2013

    Yes, I saw the comments. I'm still wondering if we should just write an
    mbstowcs_l() function instead.

    Even then, there would still be a small chance that a C extension that
    creates its own thread picks up the wrong LC_CTYPE.

    @vstinner
    Copy link
    Member

    vstinner commented Nov 4, 2013

    What is this locale_t type used for the locale parameter of mbstowcs_l()? Are you sure that it is a string? According to this patch, it looks like a structure:
    http://www.winehq.org/pipermail/wine-cvs/2010-May/067264.html

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Nov 4, 2013

    STINNER Victor <report@bugs.python.org> wrote:

    What is this locale_t type used for the locale parameter of mbstowcs_l()?
    Are you sure that it is a string? According to this patch, it looks like a structure:
    http://www.winehq.org/pipermail/wine-cvs/2010-May/067264.html

    Yes, the string was mainly for benchmarking. FreeBSD seems to have thread safe
    versions (xlocale.h), that take a locale_t as the extra parameter.

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Oct 14, 2014

    Well, I originally opened this issue but personally I'm not that
    bothered by it any more.

    Victor, do you want to keep it open?

    @vstinner
    Copy link
    Member

    The issue has a workaround: use LC_NUMERIC and LC_CTYPE locales which use the same encoding. To avoid issues, it's probably safer to only use UTF-8 locales, which are now available on modern Linux distro.

    I don't like the idea of calling setlocale() just for this corner case, because it changes the locale for all threads. Even if Python is protected by the GIL, Python can be embedded or modules written in C may spawn threads which don't care of the GIL. Usually, if it can fail, it will fail :-)

    I see that various people contributed to the issue, but it looks like the only user asking for the request is Stefan Krah. I prefer to close the issue and wait until more users ask for it before considering again the patch, or find a different way to implement the feature (support LC_NUMERIC and LC_CTYPE locales using a different encoding).

    To be clear, I'm closing the issue right now.

    @vstinner
    Copy link
    Member

    I prefer to close the issue and wait until more users ask for it before considering again the patch, or find a different way to implement the feature (support LC_NUMERIC and LC_CTYPE locales using a different encoding).

    Here we are: https://bugs.python.org/issue31900

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants