Issue 13706: non-ascii fill characters no longer work in formatting

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/57915

classification

Title:	non-ascii fill characters no longer work in formatting
Type:	behavior	Stage:
Components:	Interpreter Core	Versions:	Python 3.3

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Jim.Jewett, benjamin.peterson, eric.smith, ezio.melotti, loewis, mark.dickinson, python-dev, skrah, vstinner
Priority:	normal	Keywords:	patch

Created on 2012-01-03 22:36 by skrah, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
thousands_separator.patch	vstinner, 2012-02-14 01:35		review

Messages (20)
msg150548 - (view)	Author: Stefan Krah (skrah) *	Date: 2012-01-03 22:36
It used to be possible to specify Unicode fill characters in numeric formatting: Python 3.3.0a0 (default:1dd6908df8f5, Jul 16 2011, 11:16:00) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> format(1234, "\u2007<7") '1234\u2007\u2007\u2007' [64682 refs] >>> Now it doesn't work: Python 3.3.0a0 (default:65ac469d30e6, Jan 3 2012, 23:23:07) [GCC 4.4.3] on linux Type "help", "copyright", "credits" or "license" for more information. >>> format(1234, "\u2007<7") Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: fill character too large
msg150550 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2012-01-03 22:44
It's still possible; it's just apparently limited to ASCII characters.
msg150551 - (view)	Author: Stefan Krah (skrah) *	Date: 2012-01-03 22:48
Hum, somehow I always refuse to acknowledge that ASCII is a subset of Unicode. :)
msg150552 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2012-01-03 22:51
I assume this is left over from the PEP 393 changes. I think the right thing to do is delete this code from line 277 of formatter_unicode.c: if (format->fill_char > 127 \|\| format->align > 127 \|\| format->sign > 127) { PyErr_SetString(PyExc_ValueError, "fill character too large"); return 0; } I'm not sure such a restriction needs to exist any more. But I'll admit to not having thought it through.
msg150553 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-03 22:59
> I assume this is left over from the PEP 393 changes. Correct. > I'm not sure such a restriction needs to exist any more. The restriction was introduced to simplify the implementation. maxchar has to be computed exactly in format_string_internal(), format_int_or_long_internal(), format_float_internal() and format_complex_internal(). For format_string_internal(), the following change is enough (untested): if (lpad != 0 \|\| rpad != 0) maxchar = Py_MAX(maxchar, format->fill_char); For number formatting functions, spec->n_lpadding, spec->n_spadding and spec->n_rpadding have to be checked. Something like: if (spec->n_lpadding \|\| spec->n_spadding \|\| spec->n_rpadding) maxchar = Py_MAX(maxchar, format->fill_char);
msg150554 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-01-03 23:07
Removing the if condition would be incorrect. The maximum char is computed at the beginning of the formatting. If, during formatting, need for a padding character is determined, the padding character must not be larger than the maximum char of the target string - which is 127 unless 'c' formatting is used. One solution would be to determine whether the padding character is used in advance. Another solution would be to widen the string when the need for non-ASCII padding is detected. I have no intention of fixing this issue myself; I don't mind if non-ASCII padding characters are not supported in 3.3.
msg150557 - (view)	Author: Stefan Krah (skrah) *	Date: 2012-01-03 23:44
Actually the issue is not restricted to numeric formatting. It's not possible to pad a Unicode string with a non-ascii whitespace: >>> format("abcd", "\u2007<7") Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: fill character too large I'd be more than happy to restrict all numerical I/O operations to ASCII. This includes input strings for int(), float(), Decimal(). It does break backwards compatibility though and the situation for string formatting above seems odd to me. It is worse when the rejected fill character is already present in the string: >>> format("\u2007abcd", "\u2007<7") Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: fill character too large
msg151138 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2012-01-12 16:41
Sorry for the off-the-cuff diagnosis. I had assumed this was the unintended result of the conversion, but of course I'm wrong. I'd like to fix this.
msg151732 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-01-21 14:49
New changeset 231c6042c40c by Victor Stinner in branch 'default': Issue #13706: Support non-ASCII fill characters http://hg.python.org/cpython/rev/231c6042c40c
msg151733 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-21 14:50
I fixed the original report, but there is still an issue with non-ASCII thousands separator.
msg152414 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-01-31 23:22
New changeset 056f5cc8885d by Victor Stinner in branch 'default': Issue #13706: Add assertions to detect bugs earlier http://hg.python.org/cpython/rev/056f5cc8885d
msg152418 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-02-01 00:03
With the changeset 056f5cc8885d, format(1234, 'n') fails in a locale using a non-ASCII thousands separator. Patch showing the problem: diff --git a/Lib/test/test_format.py b/Lib/test/test_format.py --- a/Lib/test/test_format.py +++ b/Lib/test/test_format.py @@ -1,4 +1,5 @@ from test.support import verbose, TestFailed +import locale import sys import test.support as support import unittest @@ -282,6 +283,19 @@ class FormatTest(unittest.TestCase): self.assertEqual(format(1+2j, "\u2007^8"), "\u2007(1+2j)\u2007") self.assertEqual(format(0j, "\u2007^4"), "\u20070j\u2007") + def test_locale(self): + try: + oldloc = locale.setlocale(locale.LC_ALL, '') + except locale.Error as err: + self.skipTest("Cannot set locale: {}".format(err)) + try: + sep = locale.localeconv()['thousands_sep'] + self.assertEqual(format(123456789, "n"), + sep.join(('123', '456', '789'))) + finally: + locale.setlocale(locale.LC_ALL, oldloc) + + def test_main(): support.run_unittest(FormatTest)
msg153314 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-02-14 01:35
thousands_separator.patch: - Remove the unused _PyUnicode_InsertThousandsGroupingLocale() function - Add a test for non-ASCII thousands separator (depend on the locale) - _PyUnicode_InsertThousandsGrouping() expects thousands separator as a Unicode object and return the maximum character if unicode is NULL - Fix str.__format__() for non-ASCII thousands separator Note: the patch handles also non-ASCII decimal point, even I don't know any locale using such point.
msg153333 - (view)	Author: Stefan Krah (skrah) *	Date: 2012-02-14 09:19
I'm using ps_AF for testing: $ ./localeconv_wchar ps_AF size of wchar_t: 32 bits decimal_point byte string: "\xd9\xab" (2 bytes) decimal_point wide string: L"\u066b" (1 characters) thousands_sep byte string: "\xd9\xac" (2 bytes) thousands_sep wide string: L"\u066c" (1 characters) currency_symbol byte string: "\xd8\xa7\xd9\x81\xd8\xba\xd8\xa7\xd9\x86\xdb\x8d" (12 bytes) currency_symbol wide string: L"\u0627\u0641\u063a\u0627\u0646\u06cd" (6 characters)
msg154098 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-02-23 23:49
New changeset f89e2f4cda88 by Victor Stinner in branch 'default': Issue #13706: Fix format(int, "n") for locale with non-ASCII thousands separator http://hg.python.org/cpython/rev/f89e2f4cda88
msg154099 - (view)	Author: Stefan Krah (skrah) *	Date: 2012-02-24 00:26
The ps_AF locale fails with an assert after the latest commit: Python 3.3.0a0 (default:f89e2f4cda88+, Feb 24 2012, 01:14:50) [GCC 4.4.3] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import locale [67103 refs] >>> locale.setlocale(locale.LC_ALL, "ps_AF") 'ps_AF' [67108 refs] >>> format(13232434234.23423, "n") python: Python/formatter_unicode.c:606: fill_number: Assertion `r == spec->n_grouped_digits' failed. Aborted
msg154100 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-02-24 00:47
New changeset a29d20fa85b4 by Victor Stinner in branch 'default': Issue #13706: Fix format(float, "n") for locale with non-ASCII decimal point (e.g. ps_aF) http://hg.python.org/cpython/rev/a29d20fa85b4
msg154101 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-02-24 00:56
> The ps_AF locale fails with an assert after the latest commit: > ... > format(13232434234.23423, "n") > python: Python/formatter_unicode.c:606: fill_number: > Assertion `r == spec->n_grouped_digits' failed. > Aborted Oh, you found a locale with a non-ASCII decimal point, cool! I failed to find such locale. The last commit makes Python supports non-ASCII decimal point. Your comment is incorrect, it was already failing before my commit ;-) Example at changeset 548a023c8230: $ LANG=ps_AF ./python Python 3.3.0a0 (default:548a023c8230, Feb 24 2012, 01:48:01) >>> import locale >>> locale.setlocale(locale.LC_ALL, 'ps_AF') 'ps_AF' >>> format(0.1, 'n') python: Objects/unicodeobject.c:391: _PyUnicode_CheckConsistency: Assertion `maxchar < 128' failed. Abandon -- By the way, Python 3.2 fails also to handle non-ASCII thousands separator or non-ASCII decimal point: $ LANG=ps_AF python3 Python 3.2.1 (default, Jul 11 2011, 18:54:42) [GCC 4.6.1 20110627 (Red Hat 4.6.1-1)] on linux2 >>> import locale >>> locale.setlocale(locale.LC_ALL, 'ps_AF') 'ps_AF' >>> format(1234, 'n') '1\Uffffffd9\Uffffffac234' >>> format(0.1, 'n') '0\Uffffffd9\Uffffffab1' D9 AC/AB are byte strings b'\xD9\xAC' and b'\xD9\xAB' which are UTF-8 encode strings corresponding to U+066C (arabic thousands separator) and U+066B (arabic decimal separator). \Uffffffab is a bug in a cast from signed char to 32-bit unsigned integer (Py_UNICODE on Linux).
msg154127 - (view)	Author: Stefan Krah (skrah) *	Date: 2012-02-24 09:57
STINNER Victor <report@bugs.python.org> wrote: > Your comment is incorrect, it was already failing before my commit ;-) Example at changeset 548a023c8230: Ah, sorry about that. I was lazy and tested against 585d3664da89 (which is a couple of weeks old). That version didn't crash but produced UTF-8 output. :) >>> format(0.1, 'n') '0\xd9\xab1' Now things look great, thanks for fixing all of that.
msg154160 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-02-24 21:02
> By the way, Python 3.2 fails also to handle > non-ASCII thousands separator or non-ASCII > decimal point: (...) Hum, it is not trivial to redo the work on Python 3.2. I prefer to leave the code unchanged to not introduce a regression, and I wait until a Python 3.2 user complains (the bug exists since Python 3.0 and nobody complained).

History
Date	User	Action	Args
2022-04-11 14:57:25	admin	set	github: 57915
2012-02-24 21:02:08	vstinner	set	status: open -> closed resolution: fixed messages: + msg154160
2012-02-24 09:57:53	skrah	set	messages: + msg154127
2012-02-24 00:56:28	vstinner	set	messages: + msg154101
2012-02-24 00:47:04	python-dev	set	messages: + msg154100
2012-02-24 00:26:22	skrah	set	messages: + msg154099
2012-02-23 23:49:31	python-dev	set	messages: + msg154098
2012-02-14 09:19:18	skrah	set	messages: + msg153333
2012-02-14 01:35:26	vstinner	set	files: + thousands_separator.patch keywords: + patch messages: + msg153314
2012-02-01 00:03:31	vstinner	set	messages: + msg152418
2012-01-31 23:22:12	python-dev	set	messages: + msg152414
2012-01-21 14:50:05	vstinner	set	messages: + msg151733
2012-01-21 14:49:55	python-dev	set	nosy: + python-dev messages: + msg151732
2012-01-12 16:41:41	eric.smith	set	nosy: + Jim.Jewett messages: + msg151138
2012-01-03 23:44:16	skrah	set	messages: + msg150557 title: non-ascii fill characters no longer work in numeric formatting -> non-ascii fill characters no longer work in formatting
2012-01-03 23:07:34	loewis	set	messages: + msg150554
2012-01-03 22:59:52	vstinner	set	messages: + msg150553
2012-01-03 22:51:26	eric.smith	set	messages: + msg150552
2012-01-03 22:49:16	skrah	set	title: Unicode fill characters no longer work in numeric formatting -> non-ascii fill characters no longer work in numeric formatting
2012-01-03 22:48:54	skrah	set	messages: + msg150551
2012-01-03 22:47:50	ezio.melotti	set	nosy: + ezio.melotti
2012-01-03 22:44:10	benjamin.peterson	set	nosy: + benjamin.peterson messages: + msg150550
2012-01-03 22:36:24	skrah	create