classification
Title: _localemodule.c: str2uni() with different LC_NUMERIC and LC_CTYPE
Type: behavior Stage: needs patch
Components: Versions: Python 3.2, Python 3.1
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eric.smith, loewis, mark.dickinson, skrah
Priority: normal Keywords: patch

Created on 2009-12-05 10:44 by skrah, last changed 2010-02-13 14:05 by skrah.

Files
File name Uploaded Description Edit
set_ctype_before_mbstowcs.patch skrah, 2010-02-13 14:05 review
Messages (8)
msg95988 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2009-12-05 10:44
Hi, the following works in 2.7 but not in 3.x:

>>> import locale
>>> from decimal import *
>>> locale.setlocale(locale.LC_NUMERIC, 'fi_FI')
'fi_FI'
>>> format(Decimal('1000'), 'n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.2/decimal.py", line 3632, in __format__
    spec = _parse_format_specifier(specifier, _localeconv=_localeconv)
  File "/usr/lib/python3.2/decimal.py", line 5628, in
_parse_format_specifier
    _localeconv = _locale.localeconv()
  File "/usr/lib/python3.2/locale.py", line 111, in localeconv
    d = _localeconv()
ValueError: Cannot convert byte to string
msg96008 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2009-12-05 21:03
This fails in _localemodule.c: str2uni(). mbstowcs(NULL, s, 0) is
LC_CTYPE sensitive, but LC_CTYPE is UTF-8 in my terminal.

If I set LC_CTYPE and LC_NUMERIC together, things work.

This raises the question: If LC_CTYPE and LC_NUMERIC differ (and
since they are separate entities I assume they may differ), what
is the correct way to convert the separator and the decimal point?


a) call setlocale(LC_CTYPE, setlocale(LC_NUMERIC, NULL)) before
   mbstowcs. This is not really an option.


b) use some kind of _mbstowcs_l
(http://msdn.microsoft.com/en-us/library/k1f9b8cy(VS.80).aspx), which
takes a locale parameter. But I don't
find such a thing on Linux.
msg96534 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-12-17 21:02
I'm failing to reproduce this (with py3k) on OS X:

Python 3.2a0 (py3k:76866:76867, Dec 17 2009, 09:19:26) 
[GCC 4.2.1 (Apple Inc. build 5646) (dot 1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> from decimal import *
>>> locale.setlocale(locale.LC_NUMERIC, 'fi_FI')
'fi_FI'
>>> format(Decimal('1000'), 'n')
'1.000'

The locale command, from the same Terminal prompt, gives me:

LANG="en_IE.UTF-8"
LC_COLLATE="en_IE.UTF-8"
LC_CTYPE="en_IE.UTF-8"
LC_MESSAGES="en_IE.UTF-8"
LC_MONETARY="en_IE.UTF-8"
LC_NUMERIC="en_IE.UTF-8"
LC_TIME="en_IE.UTF-8"
LC_ALL=

Just to be clear, is is true that you still get the same result without 
involving Decimal at all?  That is, am I correct in assuming that:

>>> import locale
>>> locale.setlocale(locale.LC_NUMERIC, 'fi_FI')
'fi_FI'
>>> locale.localeconv()

also gives you that ValueError?
msg96535 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-12-17 21:07
What are the multibyte strings that mbstowcs is failing to convert?
On my machine, the separators come out as plain ASCII '.' (for thousands) 
and ',' (for the decimal point).
msg96544 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2009-12-18 00:35
I can reproduce it on a Fedora (fc6) Linux box. It's not a decimal
problem, but a plain locale problem:

>>> import locale
>>> locale.setlocale(locale.LC_NUMERIC, 'fi_FI')
'fi_FI'
>>> locale.localeconv()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/python/py3k/Lib/locale.py", line 111, in localeconv
    d = _localeconv()
ValueError: Cannot convert byte to string
>>> 

Here's the contents of the struct lconv as returned by localeconv():

((gdb) p *l
$1 = {decimal_point = 0xb7b54020 ",", thousands_sep = 0xb7b54022 " ",
grouping = 0xb7b54024 "\003\003", 
  int_curr_symbol = 0x998858 "", currency_symbol = 0x998858 "",
mon_decimal_point = 0x998858 "", mon_thousands_sep = 0x998858 "", 
  mon_grouping = 0x998858 "", positive_sign = 0x998858 "", negative_sign
= 0x998858 "", int_frac_digits = 127 '\177', 
  frac_digits = 127 '\177', p_cs_precedes = 127 '\177', p_sep_by_space =
127 '\177', n_cs_precedes = 127 '\177', 
  n_sep_by_space = 127 '\177', p_sign_posn = 127 '\177', n_sign_posn =
127 '\177', int_p_cs_precedes = 127 '\177', 
  int_p_sep_by_space = 127 '\177', int_n_cs_precedes = 127 '\177',
int_n_sep_by_space = 127 '\177', int_p_sign_posn = 127 '\177', 
  int_n_sign_posn = 127 '\177'}

The problem is thousands_sep:
(gdb) p l->thousands_sep
$2 = 0xb7b54022 " "
(gdb) p (unsigned char)l->thousands_sep[0]
$3 = 160 ' '
msg96556 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2009-12-18 09:38
Yes, it's a problem in _localemodule.c. This situation always
occurs when LC_NUMERIC is something like ISO8859-15, LC_CTYPE
is UTF-8 AND the decimal point or separator are in the range
128-255. Then mbstowcs tries to decode the character according
to LC_CTYPE and finds that the character is not valid UTF-8:


static PyObject*mbstowcs
str2uni(const char* s)
{
#ifdef HAVE_BROKEN_MBSTOWCS
    size_t needed = strlen(s);
#else
    size_t needed = mbstowcs(NULL, s, 0);
#endif


I can't see a portable way to fix this except:

block threads
set temporary LC_CTYPE
call mbstowcs
restore LC_CTYPE
unblock threads


I don't think this issue is important enough to do that. What
I do in cdecimal is raise an error "Invalid separator or
unsupported combination of LC_NUMERIC and LC_CTYPE".
msg96557 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2009-12-18 09:47
Changed title (was: decimal.py: format failure with locale specifier)
msg99317 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010-02-13 14:05
I have a patch that fixes this specific issue. Probably there are similar
issues in other places, e.g. when LC_TIME and LC_CTYPE differ.

I suspect that this is related:

http://bugs.python.org/issue5905
History
Date User Action Args
2010-02-13 14:05:58skrahsetfiles: + set_ctype_before_mbstowcs.patch
keywords: + patch
messages: + msg99317
2009-12-19 21:40:37pitrousetpriority: normal
nosy: + loewis

type: behavior
stage: needs patch
2009-12-18 09:47:04skrahsetmessages: + msg96557
title: decimal.py: format failure with locale specifier -> _localemodule.c: str2uni() with different LC_NUMERIC and LC_CTYPE
2009-12-18 09:38:04skrahsetmessages: + msg96556
2009-12-18 00:35:59eric.smithsetnosy: + eric.smith
messages: + msg96544
2009-12-17 21:07:00mark.dickinsonsetmessages: + msg96535
2009-12-17 21:02:04mark.dickinsonsetmessages: + msg96534
2009-12-05 21:03:37skrahsetmessages: + msg96008
2009-12-05 10:44:18skrahcreate