Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-ascii fill characters no longer work in formatting #57915

Closed
skrah mannequin opened this issue Jan 3, 2012 · 20 comments
Closed

non-ascii fill characters no longer work in formatting #57915

skrah mannequin opened this issue Jan 3, 2012 · 20 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@skrah
Copy link
Mannequin

skrah mannequin commented Jan 3, 2012

BPO 13706
Nosy @loewis, @mdickinson, @vstinner, @ericvsmith, @benjaminp, @ezio-melotti, @skrah, @JimJJewett
Files
  • thousands_separator.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-02-24.21:02:08.204>
    created_at = <Date 2012-01-03.22:36:24.835>
    labels = ['interpreter-core', 'type-bug']
    title = 'non-ascii fill characters no longer work in formatting'
    updated_at = <Date 2012-02-24.21:02:08.203>
    user = 'https://github.com/skrah'

    bugs.python.org fields:

    activity = <Date 2012-02-24.21:02:08.203>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-02-24.21:02:08.204>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2012-01-03.22:36:24.835>
    creator = 'skrah'
    dependencies = []
    files = ['24516']
    hgrepos = []
    issue_num = 13706
    keywords = ['patch']
    message_count = 20.0
    messages = ['150548', '150550', '150551', '150552', '150553', '150554', '150557', '151138', '151732', '151733', '152414', '152418', '153314', '153333', '154098', '154099', '154100', '154101', '154127', '154160']
    nosy_count = 9.0
    nosy_names = ['loewis', 'mark.dickinson', 'vstinner', 'eric.smith', 'benjamin.peterson', 'ezio.melotti', 'skrah', 'python-dev', 'Jim.Jewett']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue13706'
    versions = ['Python 3.3']

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Jan 3, 2012

    It used to be possible to specify Unicode fill characters in numeric
    formatting:

    Python 3.3.0a0 (default:1dd6908df8f5, Jul 16 2011, 11:16:00) 
    [GCC 4.4.3] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> format(1234, "\u2007<7")
    '1234\u2007\u2007\u2007'
    [64682 refs]
    >>> 

    Now it doesn't work:

    Python 3.3.0a0 (default:65ac469d30e6, Jan  3 2012, 23:23:07) 
    [GCC 4.4.3] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> format(1234, "\u2007<7")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: fill character too large

    @skrah skrah mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Jan 3, 2012
    @benjaminp
    Copy link
    Contributor

    It's still possible; it's just apparently limited to ASCII characters.

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Jan 3, 2012

    Hum, somehow I always refuse to acknowledge that ASCII is a subset
    of Unicode. :)

    @skrah skrah mannequin changed the title Unicode fill characters no longer work in numeric formatting non-ascii fill characters no longer work in numeric formatting Jan 3, 2012
    @ericvsmith
    Copy link
    Member

    I assume this is left over from the PEP-393 changes. I think the right thing to do is delete this code from line 277 of formatter_unicode.c:

        if (format->fill_char > 127 || format->align > 127 ||
            format->sign > 127) {
            PyErr_SetString(PyExc_ValueError, "fill character too large");
            return 0;
        }

    I'm not sure such a restriction needs to exist any more. But I'll admit to not having thought it through.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 3, 2012

    I assume this is left over from the PEP-393 changes.

    Correct.

    I'm not sure such a restriction needs to exist any more.

    The restriction was introduced to simplify the implementation. maxchar has to be computed exactly in format_string_internal(), format_int_or_long_internal(), format_float_internal() and format_complex_internal().

    For format_string_internal(), the following change is enough (untested):

    if (lpad != 0 || rpad != 0)
        maxchar = Py_MAX(maxchar, format->fill_char);
    

    For number formatting functions, spec->n_lpadding, spec->n_spadding and spec->n_rpadding have to be checked. Something like:

    if (spec->n_lpadding || spec->n_spadding || spec->n_rpadding)
        maxchar = Py_MAX(maxchar, format->fill_char);
    

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jan 3, 2012

    Removing the if condition would be incorrect. The maximum char is computed at the beginning of the formatting. If, during formatting, need for a padding character is determined, the padding character must not be larger than the maximum char of the target string - which is 127 unless 'c' formatting is used.

    One solution would be to determine whether the padding character is used in advance. Another solution would be to widen the string when the need for non-ASCII padding is detected.

    I have no intention of fixing this issue myself; I don't mind if non-ASCII padding characters are not supported in 3.3.

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Jan 3, 2012

    Actually the issue is not restricted to numeric formatting. It's not
    possible to pad a Unicode string with a non-ascii whitespace:

    >>> format("abcd", "\u2007<7")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: fill character too large

    I'd be more than happy to restrict all numerical I/O operations to
    ASCII. This includes input strings for int(), float(), Decimal().

    It does break backwards compatibility though and the situation
    for string formatting above seems odd to me. It is worse when
    the rejected fill character is already present in the string:

    >>> format("\u2007abcd", "\u2007<7")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: fill character too large

    @skrah skrah mannequin changed the title non-ascii fill characters no longer work in numeric formatting non-ascii fill characters no longer work in formatting Jan 3, 2012
    @ericvsmith
    Copy link
    Member

    Sorry for the off-the-cuff diagnosis. I had assumed this was the unintended result of the conversion, but of course I'm wrong.

    I'd like to fix this.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 21, 2012

    New changeset 231c6042c40c by Victor Stinner in branch 'default':
    Issue bpo-13706: Support non-ASCII fill characters
    http://hg.python.org/cpython/rev/231c6042c40c

    @vstinner
    Copy link
    Member

    I fixed the original report, but there is still an issue with non-ASCII thousands separator.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 31, 2012

    New changeset 056f5cc8885d by Victor Stinner in branch 'default':
    Issue bpo-13706: Add assertions to detect bugs earlier
    http://hg.python.org/cpython/rev/056f5cc8885d

    @vstinner
    Copy link
    Member

    vstinner commented Feb 1, 2012

    With the changeset 056f5cc8885d, format(1234, 'n') fails in a locale using a non-ASCII thousands separator. Patch showing the problem:

    diff --git a/Lib/test/test_format.py b/Lib/test/test_format.py
    --- a/Lib/test/test_format.py
    +++ b/Lib/test/test_format.py
    @@ -1,4 +1,5 @@
     from test.support import verbose, TestFailed
    +import locale
     import sys
     import test.support as support
     import unittest
    @@ -282,6 +283,19 @@ class FormatTest(unittest.TestCase):
             self.assertEqual(format(1+2j, "\u2007^8"), "\u2007(1+2j)\u2007")
             self.assertEqual(format(0j, "\u2007^4"), "\u20070j\u2007")
     
    +    def test_locale(self):
    +        try:
    +            oldloc = locale.setlocale(locale.LC_ALL, '')
    +        except locale.Error as err:
    +            self.skipTest("Cannot set locale: {}".format(err))
    +        try:
    +            sep = locale.localeconv()['thousands_sep']
    +            self.assertEqual(format(123456789, "n"),
    +                             sep.join(('123', '456', '789')))
    +        finally:
    +            locale.setlocale(locale.LC_ALL, oldloc)
    +
    +
     
     def test_main():
         support.run_unittest(FormatTest)

    @vstinner
    Copy link
    Member

    thousands_separator.patch:

    • Remove the unused _PyUnicode_InsertThousandsGroupingLocale() function
    • Add a test for non-ASCII thousands separator (depend on the locale)
    • _PyUnicode_InsertThousandsGrouping() expects thousands separator as a Unicode object and return the maximum character if unicode is NULL
    • Fix str.__format__() for non-ASCII thousands separator

    Note: the patch handles also non-ASCII decimal point, even I don't know any locale using such point.

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Feb 14, 2012

    I'm using ps_AF for testing:

    $ ./localeconv_wchar ps_AF
    size of wchar_t: 32 bits
    decimal_point byte string: "\xd9\xab" (2 bytes)
    decimal_point wide string: L"\u066b" (1 characters)
    thousands_sep byte string: "\xd9\xac" (2 bytes)
    thousands_sep wide string: L"\u066c" (1 characters)
    currency_symbol byte string: "\xd8\xa7\xd9\x81\xd8\xba\xd8\xa7\xd9\x86\xdb\x8d" (12 bytes)
    currency_symbol wide string: L"\u0627\u0641\u063a\u0627\u0646\u06cd" (6 characters)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 23, 2012

    New changeset f89e2f4cda88 by Victor Stinner in branch 'default':
    Issue bpo-13706: Fix format(int, "n") for locale with non-ASCII thousands separator
    http://hg.python.org/cpython/rev/f89e2f4cda88

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Feb 24, 2012

    The ps_AF locale fails with an assert after the latest commit:

    Python 3.3.0a0 (default:f89e2f4cda88+, Feb 24 2012, 01:14:50) 
    [GCC 4.4.3] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import locale
    [67103 refs]
    >>> locale.setlocale(locale.LC_ALL, "ps_AF")
    'ps_AF'
    [67108 refs]
    >>> format(13232434234.23423, "n")
    python: Python/formatter_unicode.c:606: fill_number: Assertion `r == spec->n_grouped_digits' failed.
    Aborted

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 24, 2012

    New changeset a29d20fa85b4 by Victor Stinner in branch 'default':
    Issue bpo-13706: Fix format(float, "n") for locale with non-ASCII decimal point (e.g. ps_aF)
    http://hg.python.org/cpython/rev/a29d20fa85b4

    @vstinner
    Copy link
    Member

    The ps_AF locale fails with an assert after the latest commit:
    ...
    format(13232434234.23423, "n")
    python: Python/formatter_unicode.c:606: fill_number:
    Assertion `r == spec->n_grouped_digits' failed.
    Aborted

    Oh, you found a locale with a non-ASCII decimal point, cool! I failed to find such locale. The last commit makes Python supports non-ASCII decimal point.

    Your comment is incorrect, it was already failing before my commit ;-) Example at changeset 548a023c8230:

    $ LANG=ps_AF ./python 
    Python 3.3.0a0 (default:548a023c8230, Feb 24 2012, 01:48:01) 
    >>> import locale
    >>> locale.setlocale(locale.LC_ALL, 'ps_AF')
    'ps_AF'
    >>> format(0.1, 'n')
    python: Objects/unicodeobject.c:391: _PyUnicode_CheckConsistency: Assertion `maxchar < 128' failed.
    Abandon

    --

    By the way, Python 3.2 fails also to handle non-ASCII thousands separator or non-ASCII decimal point:

    $ LANG=ps_AF python3
    Python 3.2.1 (default, Jul 11 2011, 18:54:42) 
    [GCC 4.6.1 20110627 (Red Hat 4.6.1-1)] on linux2
    >>> import locale
    >>> locale.setlocale(locale.LC_ALL, 'ps_AF')
    'ps_AF'
    >>> format(1234, 'n')
    '1\Uffffffd9\Uffffffac234'
    >>> format(0.1, 'n')
    '0\Uffffffd9\Uffffffab1'

    D9 AC/AB are byte strings b'\xD9\xAC' and b'\xD9\xAB' which are UTF-8 encode strings corresponding to U+066C (arabic thousands separator) and U+066B (arabic decimal separator).

    \Uffffffab is a bug in a cast from signed char to 32-bit unsigned integer (Py_UNICODE on Linux).

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Feb 24, 2012

    STINNER Victor <report@bugs.python.org> wrote:

    Your comment is incorrect, it was already failing before my commit ;-) Example at changeset 548a023c8230:

    Ah, sorry about that. I was lazy and tested against 585d3664da89 (which is a
    couple of weeks old).

    That version didn't crash but produced UTF-8 output. :)

    >>> format(0.1, 'n')
    '0\xd9\xab1'

    Now things look great, thanks for fixing all of that.

    @vstinner
    Copy link
    Member

    By the way, Python 3.2 fails also to handle
    non-ASCII thousands separator or non-ASCII
    decimal point: (...)

    Hum, it is not trivial to redo the work on Python 3.2. I prefer to leave the code unchanged to not introduce a regression, and I wait until a Python 3.2 user complains (the bug exists since Python 3.0 and nobody complained).

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants