Issue 7327: format: minimum width: UTF-8 separators and decimal points

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/51576

classification

Title:	format: minimum width: UTF-8 separators and decimal points
Type:	behavior	Stage:
Components:		Versions:	Python 2.7, Python 2.6

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:	eric.smith	Nosy List:	eric.smith, mark.dickinson, mrabarnett, r.david.murray, skrah
Priority:	normal	Keywords:

Created on 2009-11-15 10:29 by skrah, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (15)
msg95283 - (view)	Author: Stefan Krah (skrah) *	Date: 2009-11-15 10:29
This issue affects the format functions of float and decimal. When calculating the padding necessary to reach the minimum width, UTF-8 separators and decimal points are calculated by their byte lengths. This can lead to printed representations that are too short. Real world example (separator): >>> import locale >>> from decimal import * >>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8") 'cs_CZ.UTF-8' >>> s = format(Decimal("-1.5"), ' 019.18n') >>> len(s) 19 >>> len(s.decode('utf-8')) 16 >>> s '-0\xc2\xa0000\xc2\xa0000\xc2\xa0001,5' >>> >>> >>> s = format(-1.5, ' 019.18n') >>> s '-0\xc2\xa0000\xc2\xa0000\xc2\xa0001,5' >>> len(s.decode('utf-8')) 16 >>> Constructed example (separator and decimal point): >>> u = {'decimal_point' : "\xc2\xbf", 'grouping' : [3, 3, 0], 'thousands_sep': "\xc2\xb4"} >>> def get_fmt(x, locale, fmt='n'): ... return Decimal.__format__(Decimal(x), fmt, _localeconv=locale) ... >>> s = get_fmt(Decimal("1.5"), u, "020n") >>> s '00\xc2\xb4000\xc2\xb4000\xc2\xb4001\xc2\xbf5' >>> len(s.decode('utf-8')) 16
msg95796 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2009-11-28 17:53
Surely this is to be expected when working with bytestrings. You should be working in Unicode and using UTF-8 only for input and output.
msg95836 - (view)	Author: Stefan Krah (skrah) *	Date: 2009-11-30 13:04
What you mean by "working with bytestrings"? The UTF-8 separators or decimal points come directly from struct lconv (man localeconv). The logical way to reach a minimum width of 19 is to have 19 UTF-8 characters, which can subsequently be converted to other formats.
msg95884 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2009-12-01 23:19
In python3: >>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8") 'cs_CZ.UTF-8' >>> s = format(Decimal("-1.5"), ' 019.18n') >>> len(s) 20 >>> print(s) -0 000 000 000 001,5 Python3 uses unicode for strings. Python2 uses bytes. To format unicode in python2, you do: >>> s2 = locale.format("% 019.18g", Decimal("-1.5")) >>> len(s2) 19 >>> print s2 -0000000000000001,5 Not quite the same thing, clearly. So, is there a way to access the python3 unicode format semantics in python2? Just passing format a unicode format string results in a UnicodeDecodeError.
msg95887 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2009-12-02 00:05
In 2.7, I get: $ ./python.exe Python 2.7a0 (trunk:76501, Nov 24 2009, 14:57:21) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import locale >>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8") 'cs_CZ.UTF-8' >>> from decimal import Decimal >>> s = format(Decimal("-1.5"), ' 019.18n') >>> s '-0 000 000 000 001,5' >>> len(s) 20 >>> s = format(Decimal("-1.5"), u' 019.18n') >>> s u'-0 000 000 000 001,5' >>> len(s) 20 >>> Could you give more details on the UnicodeDecodeError you get? Any traceback?
msg95888 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2009-12-02 00:29
Interesting. My regular locale is LC_CTYPE=en_US.UTF-8, and here is what I get: Python 2.7a0 (trunk:76501, Nov 24 2009, 13:59:01) [GCC 4.4.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import local >>> import locale >>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8") 'cs_CZ.UTF-8' >>> from decimal import Decimal >>> s = format(Decimal("-1.5"), ' 019.18n') >>> s '-0\xc2\xa0000\xc2\xa0000\xc2\xa0001,5' >>> len(s) 19 >>> print s -0 000 000 001,5 sys.stdout.encoding gives 'UTF-8'. And here's the traceback from trying to use unicode: >>> s = format(Decimal("-1.5"), u' 019.18n') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/rdmurray/python/trunk/Lib/decimal.py", line 3609, in __format__ return _format_number(self._sign, intpart, fracpart, exp, spec) File "/home/rdmurray/python/trunk/Lib/decimal.py", line 5704, in _format_number return _format_align(sign, intpart+fracpart, spec) File "/home/rdmurray/python/trunk/Lib/decimal.py", line 5595, in _format_align result = unicode(result) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)
msg95894 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2009-12-02 03:00
I can duplicate this on Linux. The difference is the values in the locale for the separators, specifically, locale.localeconv()['thousands_sep']. >>> locale.localeconv()['thousands_sep'] '\xc2\xa0' The question is: since a struct lconv contains char*s, how to interpret them? The code in decimal interprets them as ascii, apparently. floats do the same thing, so this isn't strictly a decimal problem. I'll have to give it some thought.
msg95901 - (view)	Author: Stefan Krah (skrah) *	Date: 2009-12-02 10:42
In python3.2, the output of decimal looks good. With float, the separator is printed as two spaces on my Unicode terminal (export LC_ALL=cs_CZ.UTF-8). So decimal (3.2) interprets the separator string as a single UTF-8 char and the final output is a UTF-8 string. I'd say that in C, this is the intended way of using struct lconv. If there is an agreement that the final output should be a UTF-8 string, this looks correct to me. Python 3.2a0 (py3k:76081M, Nov 6 2009, 15:23:48) [GCC 4.1.3 20080623 (prerelease) (Ubuntu 4.1.2-23ubuntu3)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import locale, decimal >>> locale.setlocale(locale.LC_NUMERIC, 'cs_CZ.UTF-8') 'cs_CZ.UTF-8' >>> x = format(decimal.Decimal("-1.5"), '019.18n') >>> y = format(float("-1.5"), '019.18n') >>> x '-0\xa0000\xa0000\xa0000\xa0001,5' >>> y '-0ￂﾠ000ￂﾠ000ￂﾠ001,5' >>> print(x) -0 000 000 000 001,5 >>> print(y) -0ￂﾠ000ￂﾠ000ￂﾠ001,5 >>>
msg95902 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2009-12-02 11:53
So when the format string has type 'str' (as in Stefan's original example) rather than type 'unicode', I'd say Python is doing the right thing already: everything in sight, including the separators coming from localeconv(), has type 'str', so trying to interpret things as unicode seems a bit of a stretch. If the '\xc2\xa0' from localeconv()['thousands_sep'] is to be interpreted as a single unicode character, shouldn't it be a unicode string already? However, if localeconv()['thousands_sep'] were to give a unicode string, then I suppose Decimal.__format__ should be returning a unicode result; I don't think it currently does this. (Should this be true even if the number being formatted is so short that no thousands separators actually appear in it?)
msg95904 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2009-12-02 13:15
I don't see any documentation that a struct lconv should be interpreted as UTF-8. In fact Googling "struct lconv utf-8" gives this bug report as the first hit. lconv.thousands_sep is char*. It's never been clear to me if this means "pointer to a single char", or "pointer to a null terminated string of char". In Objects/stringlib/localeutil.h I treat it as a string of char.
msg95906 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2009-12-02 13:58
In trunk, Modules/_localemodule.c also treats these as "string of char", so at least we're consistent. In py3k, mbstowcs is used and the result passed to PyUnicode_FromWideChar. I'm not sure how you'd address this in locale in trunk, or if we want to do something similar in localeutil.h in trunk (for the Unicode case).
msg95907 - (view)	Author: Stefan Krah (skrah) *	Date: 2009-12-02 14:10
Googling "multi-byte thousands separator" gives better results. From those results, it is clear to me that decimal_point and thousands_sep are strings that may be interpreted as multi-byte characters. The Czech separator appears to be a no-break space multi-byte character. http://sourceware.org/ml/libc-hacker/2007-01/msg00005.html http://drupal.org/node/353897 My point is that if a multi-byte character appears, it should be counted as a single character for the purposes of calculating min-width. Otherwise, the printed representation is too short.
msg95926 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2009-12-03 11:19
Reassigning to Eric.
msg95927 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2009-12-03 11:21
I've raised the issue with unicode and locale on python-dev: http://mail.python.org/pipermail/python-dev/2009-December/094408.html Pending the outcome of that decision, I'll move forward on this issue.
msg95966 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2009-12-04 14:25
See the discussion on python-dev, in particular Martin's comment at http://mail.python.org/pipermail/python-dev/2009-December/094412.html The solutions to this seem too complex for 2.x. It is not a problem in 3.x.

History
Date	User	Action	Args
2022-04-11 14:56:54	admin	set	github: 51576
2009-12-04 14:25:41	eric.smith	set	status: open -> closed resolution: wont fix messages: + msg95966
2009-12-03 11:21:25	eric.smith	set	messages: + msg95927
2009-12-03 11:19:14	mark.dickinson	set	assignee: mark.dickinson -> eric.smith messages: + msg95926
2009-12-02 14:10:39	skrah	set	messages: + msg95907
2009-12-02 13:58:16	eric.smith	set	messages: + msg95906
2009-12-02 13:15:11	eric.smith	set	messages: + msg95904
2009-12-02 11:53:49	mark.dickinson	set	messages: + msg95902
2009-12-02 10:42:05	skrah	set	messages: + msg95901
2009-12-02 03:00:19	eric.smith	set	messages: + msg95894
2009-12-02 00:29:04	r.david.murray	set	messages: + msg95888
2009-12-02 00:05:57	eric.smith	set	messages: + msg95887
2009-12-01 23:19:39	r.david.murray	set	priority: normal versions: + Python 2.6, Python 2.7 nosy: + r.david.murray messages: + msg95884 type: behavior
2009-11-30 13:04:04	skrah	set	messages: + msg95836
2009-11-28 17:53:00	mrabarnett	set	nosy: + mrabarnett messages: + msg95796
2009-11-28 16:43:46	mark.dickinson	set	assignee: mark.dickinson
2009-11-15 10:29:29	skrah	create