Issue7327
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2009-11-15 10:29 by skrah, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Messages (15) | |||
---|---|---|---|
msg95283 - (view) | Author: Stefan Krah (skrah) * | Date: 2009-11-15 10:29 | |
This issue affects the format functions of float and decimal. When calculating the padding necessary to reach the minimum width, UTF-8 separators and decimal points are calculated by their byte lengths. This can lead to printed representations that are too short. Real world example (separator): >>> import locale >>> from decimal import * >>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8") 'cs_CZ.UTF-8' >>> s = format(Decimal("-1.5"), ' 019.18n') >>> len(s) 19 >>> len(s.decode('utf-8')) 16 >>> s '-0\xc2\xa0000\xc2\xa0000\xc2\xa0001,5' >>> >>> >>> s = format(-1.5, ' 019.18n') >>> s '-0\xc2\xa0000\xc2\xa0000\xc2\xa0001,5' >>> len(s.decode('utf-8')) 16 >>> Constructed example (separator and decimal point): >>> u = {'decimal_point' : "\xc2\xbf", 'grouping' : [3, 3, 0], 'thousands_sep': "\xc2\xb4"} >>> def get_fmt(x, locale, fmt='n'): ... return Decimal.__format__(Decimal(x), fmt, _localeconv=locale) ... >>> s = get_fmt(Decimal("1.5"), u, "020n") >>> s '00\xc2\xb4000\xc2\xb4000\xc2\xb4001\xc2\xbf5' >>> len(s.decode('utf-8')) 16 |
|||
msg95796 - (view) | Author: Matthew Barnett (mrabarnett) * | Date: 2009-11-28 17:53 | |
Surely this is to be expected when working with bytestrings. You should be working in Unicode and using UTF-8 only for input and output. |
|||
msg95836 - (view) | Author: Stefan Krah (skrah) * | Date: 2009-11-30 13:04 | |
What you mean by "working with bytestrings"? The UTF-8 separators or decimal points come directly from struct lconv (man localeconv). The logical way to reach a minimum width of 19 is to have 19 UTF-8 characters, which can subsequently be converted to other formats. |
|||
msg95884 - (view) | Author: R. David Murray (r.david.murray) * | Date: 2009-12-01 23:19 | |
In python3: >>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8") 'cs_CZ.UTF-8' >>> s = format(Decimal("-1.5"), ' 019.18n') >>> len(s) 20 >>> print(s) -0 000 000 000 001,5 Python3 uses unicode for strings. Python2 uses bytes. To format unicode in python2, you do: >>> s2 = locale.format("% 019.18g", Decimal("-1.5")) >>> len(s2) 19 >>> print s2 -0000000000000001,5 Not quite the same thing, clearly. So, is there a way to access the python3 unicode format semantics in python2? Just passing format a unicode format string results in a UnicodeDecodeError. |
|||
msg95887 - (view) | Author: Eric V. Smith (eric.smith) * | Date: 2009-12-02 00:05 | |
In 2.7, I get: $ ./python.exe Python 2.7a0 (trunk:76501, Nov 24 2009, 14:57:21) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import locale >>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8") 'cs_CZ.UTF-8' >>> from decimal import Decimal >>> s = format(Decimal("-1.5"), ' 019.18n') >>> s '-0 000 000 000 001,5' >>> len(s) 20 >>> s = format(Decimal("-1.5"), u' 019.18n') >>> s u'-0 000 000 000 001,5' >>> len(s) 20 >>> Could you give more details on the UnicodeDecodeError you get? Any traceback? |
|||
msg95888 - (view) | Author: R. David Murray (r.david.murray) * | Date: 2009-12-02 00:29 | |
Interesting. My regular locale is LC_CTYPE=en_US.UTF-8, and here is what I get: Python 2.7a0 (trunk:76501, Nov 24 2009, 13:59:01) [GCC 4.4.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import local >>> import locale >>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8") 'cs_CZ.UTF-8' >>> from decimal import Decimal >>> s = format(Decimal("-1.5"), ' 019.18n') >>> s '-0\xc2\xa0000\xc2\xa0000\xc2\xa0001,5' >>> len(s) 19 >>> print s -0 000 000 001,5 sys.stdout.encoding gives 'UTF-8'. And here's the traceback from trying to use unicode: >>> s = format(Decimal("-1.5"), u' 019.18n') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/rdmurray/python/trunk/Lib/decimal.py", line 3609, in __format__ return _format_number(self._sign, intpart, fracpart, exp, spec) File "/home/rdmurray/python/trunk/Lib/decimal.py", line 5704, in _format_number return _format_align(sign, intpart+fracpart, spec) File "/home/rdmurray/python/trunk/Lib/decimal.py", line 5595, in _format_align result = unicode(result) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128) |
|||
msg95894 - (view) | Author: Eric V. Smith (eric.smith) * | Date: 2009-12-02 03:00 | |
I can duplicate this on Linux. The difference is the values in the locale for the separators, specifically, locale.localeconv()['thousands_sep']. >>> locale.localeconv()['thousands_sep'] '\xc2\xa0' The question is: since a struct lconv contains char*s, how to interpret them? The code in decimal interprets them as ascii, apparently. floats do the same thing, so this isn't strictly a decimal problem. I'll have to give it some thought. |
|||
msg95901 - (view) | Author: Stefan Krah (skrah) * | Date: 2009-12-02 10:42 | |
In python3.2, the output of decimal looks good. With float, the separator is printed as two spaces on my Unicode terminal (export LC_ALL=cs_CZ.UTF-8). So decimal (3.2) interprets the separator string as a single UTF-8 char and the final output is a UTF-8 string. I'd say that in C, this is the intended way of using struct lconv. If there is an agreement that the final output should be a UTF-8 string, this looks correct to me. Python 3.2a0 (py3k:76081M, Nov 6 2009, 15:23:48) [GCC 4.1.3 20080623 (prerelease) (Ubuntu 4.1.2-23ubuntu3)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import locale, decimal >>> locale.setlocale(locale.LC_NUMERIC, 'cs_CZ.UTF-8') 'cs_CZ.UTF-8' >>> x = format(decimal.Decimal("-1.5"), '019.18n') >>> y = format(float("-1.5"), '019.18n') >>> x '-0\xa0000\xa0000\xa0000\xa0001,5' >>> y '-0ᅡᅠ000ᅡᅠ000ᅡᅠ001,5' >>> print(x) -0 000 000 000 001,5 >>> print(y) -0ᅡᅠ000ᅡᅠ000ᅡᅠ001,5 >>> |
|||
msg95902 - (view) | Author: Mark Dickinson (mark.dickinson) * | Date: 2009-12-02 11:53 | |
So when the format string has type 'str' (as in Stefan's original example) rather than type 'unicode', I'd say Python is doing the right thing already: everything in sight, including the separators coming from localeconv(), has type 'str', so trying to interpret things as unicode seems a bit of a stretch. If the '\xc2\xa0' from localeconv()['thousands_sep'] is to be interpreted as a single unicode character, shouldn't it be a unicode string already? However, if localeconv()['thousands_sep'] *were* to give a unicode string, then I suppose Decimal.__format__ should be returning a unicode result; I don't think it currently does this. (Should this be true even if the number being formatted is so short that no thousands separators actually appear in it?) |
|||
msg95904 - (view) | Author: Eric V. Smith (eric.smith) * | Date: 2009-12-02 13:15 | |
I don't see any documentation that a struct lconv should be interpreted as UTF-8. In fact Googling "struct lconv utf-8" gives this bug report as the first hit. lconv.thousands_sep is char*. It's never been clear to me if this means "pointer to a single char", or "pointer to a null terminated string of char". In Objects/stringlib/localeutil.h I treat it as a string of char. |
|||
msg95906 - (view) | Author: Eric V. Smith (eric.smith) * | Date: 2009-12-02 13:58 | |
In trunk, Modules/_localemodule.c also treats these as "string of char", so at least we're consistent. In py3k, mbstowcs is used and the result passed to PyUnicode_FromWideChar. I'm not sure how you'd address this in locale in trunk, or if we want to do something similar in localeutil.h in trunk (for the Unicode case). |
|||
msg95907 - (view) | Author: Stefan Krah (skrah) * | Date: 2009-12-02 14:10 | |
Googling "multi-byte thousands separator" gives better results. From those results, it is clear to me that decimal_point and thousands_sep are strings that may be interpreted as multi-byte characters. The Czech separator appears to be a no-break space multi-byte character. http://sourceware.org/ml/libc-hacker/2007-01/msg00005.html http://drupal.org/node/353897 My point is that if a multi-byte character appears, it should be counted as a single character for the purposes of calculating min-width. Otherwise, the printed representation is too short. |
|||
msg95926 - (view) | Author: Mark Dickinson (mark.dickinson) * | Date: 2009-12-03 11:19 | |
Reassigning to Eric. |
|||
msg95927 - (view) | Author: Eric V. Smith (eric.smith) * | Date: 2009-12-03 11:21 | |
I've raised the issue with unicode and locale on python-dev: http://mail.python.org/pipermail/python-dev/2009-December/094408.html Pending the outcome of that decision, I'll move forward on this issue. |
|||
msg95966 - (view) | Author: Eric V. Smith (eric.smith) * | Date: 2009-12-04 14:25 | |
See the discussion on python-dev, in particular Martin's comment at http://mail.python.org/pipermail/python-dev/2009-December/094412.html The solutions to this seem too complex for 2.x. It is not a problem in 3.x. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:56:54 | admin | set | github: 51576 |
2009-12-04 14:25:41 | eric.smith | set | status: open -> closed resolution: wont fix messages: + msg95966 |
2009-12-03 11:21:25 | eric.smith | set | messages: + msg95927 |
2009-12-03 11:19:14 | mark.dickinson | set | assignee: mark.dickinson -> eric.smith messages: + msg95926 |
2009-12-02 14:10:39 | skrah | set | messages: + msg95907 |
2009-12-02 13:58:16 | eric.smith | set | messages: + msg95906 |
2009-12-02 13:15:11 | eric.smith | set | messages: + msg95904 |
2009-12-02 11:53:49 | mark.dickinson | set | messages: + msg95902 |
2009-12-02 10:42:05 | skrah | set | messages: + msg95901 |
2009-12-02 03:00:19 | eric.smith | set | messages: + msg95894 |
2009-12-02 00:29:04 | r.david.murray | set | messages: + msg95888 |
2009-12-02 00:05:57 | eric.smith | set | messages: + msg95887 |
2009-12-01 23:19:39 | r.david.murray | set | priority: normal versions: + Python 2.6, Python 2.7 nosy: + r.david.murray messages: + msg95884 type: behavior |
2009-11-30 13:04:04 | skrah | set | messages: + msg95836 |
2009-11-28 17:53:00 | mrabarnett | set | nosy:
+ mrabarnett messages: + msg95796 |
2009-11-28 16:43:46 | mark.dickinson | set | assignee: mark.dickinson |
2009-11-15 10:29:29 | skrah | create |