This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: format: minimum width: UTF-8 separators and decimal points
Type: behavior Stage:
Components: Versions: Python 2.7, Python 2.6
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: eric.smith Nosy List: eric.smith, mark.dickinson, mrabarnett, r.david.murray, skrah
Priority: normal Keywords:

Created on 2009-11-15 10:29 by skrah, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (15)
msg95283 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2009-11-15 10:29
This issue affects the format functions of float and decimal.

When calculating the padding necessary to reach the minimum width,
UTF-8 separators and decimal points are calculated by their byte
lengths. This can lead to printed representations that are too short.


Real world example (separator):

>>> import locale
>>> from decimal import *
>>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8")
'cs_CZ.UTF-8'
>>> s = format(Decimal("-1.5"),  ' 019.18n')
>>> len(s)
19
>>> len(s.decode('utf-8'))
16
>>> s
'-0\xc2\xa0000\xc2\xa0000\xc2\xa0001,5'
>>> 
>>> 
>>> s = format(-1.5,  ' 019.18n')
>>> s
'-0\xc2\xa0000\xc2\xa0000\xc2\xa0001,5'
>>> len(s.decode('utf-8'))
16
>>> 


Constructed example (separator and decimal point):

>>> u = {'decimal_point' : "\xc2\xbf",  'grouping' : [3, 3, 0],
'thousands_sep': "\xc2\xb4"}
>>> def get_fmt(x, locale, fmt='n'):
...     return Decimal.__format__(Decimal(x), fmt, _localeconv=locale)
... 
>>> s = get_fmt(Decimal("1.5"), u, "020n")
>>> s
'00\xc2\xb4000\xc2\xb4000\xc2\xb4001\xc2\xbf5'
>>> len(s.decode('utf-8'))
16
msg95796 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2009-11-28 17:53
Surely this is to be expected when working with bytestrings. You should
be working in Unicode and using UTF-8 only for input and output.
msg95836 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2009-11-30 13:04
What you mean by "working with bytestrings"? The UTF-8 separators or
decimal points come directly from struct lconv (man localeconv). The
logical way to reach a minimum width of 19 is to have 19 UTF-8
characters, which can subsequently be converted to other formats.
msg95884 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-12-01 23:19
In python3:

>>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8")
'cs_CZ.UTF-8'
>>> s = format(Decimal("-1.5"),  ' 019.18n')
>>> len(s)
20
>>> print(s)
-0 000 000 000 001,5

Python3 uses unicode for strings.  Python2 uses bytes.  To format
unicode in python2, you do:

>>> s2 = locale.format("% 019.18g", Decimal("-1.5"))
>>> len(s2)
19
>>> print s2
-0000000000000001,5

Not quite the same thing, clearly.  So, is there a way to access the
python3 unicode format semantics in python2?  Just passing format a
unicode format string results in a UnicodeDecodeError.
msg95887 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2009-12-02 00:05
In 2.7, I get:

$ ./python.exe 
Python 2.7a0 (trunk:76501, Nov 24 2009, 14:57:21) 
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8")
'cs_CZ.UTF-8'
>>> from decimal import Decimal
>>> s = format(Decimal("-1.5"),  ' 019.18n')
>>> s
'-0 000 000 000 001,5'
>>> len(s)
20
>>> s = format(Decimal("-1.5"),  u' 019.18n')                           
>>> s
u'-0 000 000 000 001,5'
>>> len(s)
20
>>> 

Could you give more details on the UnicodeDecodeError you get? Any
traceback?
msg95888 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-12-02 00:29
Interesting.  My regular locale is LC_CTYPE=en_US.UTF-8, and here is
what I get:

Python 2.7a0 (trunk:76501, Nov 24 2009, 13:59:01) 
[GCC 4.4.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import local
>>> import locale
>>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8")
'cs_CZ.UTF-8'
>>> from decimal import Decimal
>>> s = format(Decimal("-1.5"),  ' 019.18n')
>>> s
'-0\xc2\xa0000\xc2\xa0000\xc2\xa0001,5'
>>> len(s)
19
>>> print s
-0 000 000 001,5

sys.stdout.encoding gives 'UTF-8'.

And here's the traceback from trying to use unicode:

>>> s = format(Decimal("-1.5"),  u' 019.18n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rdmurray/python/trunk/Lib/decimal.py", line 3609, in
__format__
    return _format_number(self._sign, intpart, fracpart, exp, spec)
  File "/home/rdmurray/python/trunk/Lib/decimal.py", line 5704, in
_format_number
    return _format_align(sign, intpart+fracpart, spec)
  File "/home/rdmurray/python/trunk/Lib/decimal.py", line 5595, in
_format_align
    result = unicode(result)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2:
ordinal not in range(128)
msg95894 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2009-12-02 03:00
I can duplicate this on Linux. The difference is the values in the
locale for the separators, specifically,
locale.localeconv()['thousands_sep'].

>>> locale.localeconv()['thousands_sep']
'\xc2\xa0'

The question is: since a struct lconv contains char*s, how to interpret
them? The code in decimal interprets them as ascii, apparently. floats
do the same thing, so this isn't strictly a decimal problem. I'll have
to give it some thought.
msg95901 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2009-12-02 10:42
In python3.2, the output of decimal looks good. With float, the
separator is printed as two spaces on my Unicode terminal (export
LC_ALL=cs_CZ.UTF-8).

So decimal (3.2) interprets the separator string as a single UTF-8 char
and the final output is a UTF-8 string. I'd say that in C, this is the
intended way of using struct lconv.

If there is an agreement that the final output should be a UTF-8 string,
this looks correct to me.



Python 3.2a0 (py3k:76081M, Nov  6 2009, 15:23:48) 
[GCC 4.1.3 20080623 (prerelease) (Ubuntu 4.1.2-23ubuntu3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale, decimal
>>> locale.setlocale(locale.LC_NUMERIC, 'cs_CZ.UTF-8')
'cs_CZ.UTF-8'
>>> x = format(decimal.Decimal("-1.5"),  '019.18n')
>>> y = format(float("-1.5"),  '019.18n')
>>> x
'-0\xa0000\xa0000\xa0000\xa0001,5'
>>> y
'-0ᅡᅠ000ᅡᅠ000ᅡᅠ001,5'
>>> print(x)
-0 000 000 000 001,5
>>> print(y)
-0ᅡᅠ000ᅡᅠ000ᅡᅠ001,5
>>>
msg95902 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-12-02 11:53
So when the format string has type 'str' (as in Stefan's original example) 
rather than type 'unicode', I'd say Python is doing the right thing 
already:  everything in sight, including the separators coming from 
localeconv(), has type 'str', so trying to interpret things as unicode 
seems a bit of a stretch.

If the '\xc2\xa0' from localeconv()['thousands_sep'] is to be interpreted 
as a single unicode character, shouldn't it be a unicode
string already?

However, if localeconv()['thousands_sep'] *were* to give a unicode string, 
then I suppose Decimal.__format__ should be returning a unicode result;  I 
don't think it currently does this.  (Should this be true even if the 
number being formatted is so short that no thousands separators actually 
appear in it?)
msg95904 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2009-12-02 13:15
I don't see any documentation that a struct lconv should be interpreted
as UTF-8. In fact Googling "struct lconv utf-8" gives this bug report as
the first hit.

lconv.thousands_sep is char*. It's never been clear to me if this means
"pointer to a single char", or "pointer to a null terminated string of
char". In Objects/stringlib/localeutil.h I treat it as a string of char.
msg95906 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2009-12-02 13:58
In trunk, Modules/_localemodule.c also treats these as "string of char",
so at least we're consistent.

In py3k, mbstowcs is used and the result passed to PyUnicode_FromWideChar.

I'm not sure how you'd address this in locale in trunk, or if we want to
do something similar in localeutil.h in trunk (for the Unicode case).
msg95907 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2009-12-02 14:10
Googling "multi-byte thousands separator" gives better results. From
those results, it is clear to me that decimal_point and thousands_sep
are strings that may be interpreted as multi-byte characters. The Czech
separator appears to be a no-break space multi-byte character.


http://sourceware.org/ml/libc-hacker/2007-01/msg00005.html
http://drupal.org/node/353897


My point is that if a multi-byte character appears, it should be
counted as a single character for the purposes of calculating
min-width. Otherwise, the printed representation is too short.
msg95926 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-12-03 11:19
Reassigning to Eric.
msg95927 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2009-12-03 11:21
I've raised the issue with unicode and locale on python-dev:
http://mail.python.org/pipermail/python-dev/2009-December/094408.html

Pending the outcome of that decision, I'll move forward on this issue.
msg95966 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2009-12-04 14:25
See the discussion on python-dev, in particular Martin's comment at
http://mail.python.org/pipermail/python-dev/2009-December/094412.html

The solutions to this seem too complex for 2.x. It is not a problem in 3.x.
History
Date User Action Args
2022-04-11 14:56:54adminsetgithub: 51576
2009-12-04 14:25:41eric.smithsetstatus: open -> closed
resolution: wont fix
messages: + msg95966
2009-12-03 11:21:25eric.smithsetmessages: + msg95927
2009-12-03 11:19:14mark.dickinsonsetassignee: mark.dickinson -> eric.smith
messages: + msg95926
2009-12-02 14:10:39skrahsetmessages: + msg95907
2009-12-02 13:58:16eric.smithsetmessages: + msg95906
2009-12-02 13:15:11eric.smithsetmessages: + msg95904
2009-12-02 11:53:49mark.dickinsonsetmessages: + msg95902
2009-12-02 10:42:05skrahsetmessages: + msg95901
2009-12-02 03:00:19eric.smithsetmessages: + msg95894
2009-12-02 00:29:04r.david.murraysetmessages: + msg95888
2009-12-02 00:05:57eric.smithsetmessages: + msg95887
2009-12-01 23:19:39r.david.murraysetpriority: normal
versions: + Python 2.6, Python 2.7
nosy: + r.david.murray

messages: + msg95884

type: behavior
2009-11-30 13:04:04skrahsetmessages: + msg95836
2009-11-28 17:53:00mrabarnettsetnosy: + mrabarnett
messages: + msg95796
2009-11-28 16:43:46mark.dickinsonsetassignee: mark.dickinson
2009-11-15 10:29:29skrahcreate