Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str methods don't accept non-BMP fillchar on a narrow Unicode build #54730

Closed
abalkin opened this issue Nov 24, 2010 · 18 comments
Closed

str methods don't accept non-BMP fillchar on a narrow Unicode build #54730

abalkin opened this issue Nov 24, 2010 · 18 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@abalkin
Copy link
Member

abalkin commented Nov 24, 2010

BPO 10521
Nosy @malemburg, @terryjreedy, @amauryfa, @abalkin, @pitrou, @vstinner, @ericvsmith, @benjaminp, @ezio-melotti
Files
  • issue10521-isalpha.diff: Proof of concept that fixes isalpha
  • issue10521-unicode-next.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-01-05.21:12:48.498>
    created_at = <Date 2010-11-24.15:25:23.416>
    labels = ['interpreter-core', 'type-bug']
    title = "str methods don't accept non-BMP fillchar on a narrow Unicode build"
    updated_at = <Date 2012-01-05.21:12:48.490>
    user = 'https://github.com/abalkin'

    bugs.python.org fields:

    activity = <Date 2012-01-05.21:12:48.490>
    actor = 'benjamin.peterson'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-01-05.21:12:48.498>
    closer = 'benjamin.peterson'
    components = ['Interpreter Core']
    creation = <Date 2010-11-24.15:25:23.416>
    creator = 'belopolsky'
    dependencies = []
    files = ['19809', '19810']
    hgrepos = []
    issue_num = 10521
    keywords = ['patch']
    message_count = 18.0
    messages = ['122280', '122284', '122285', '122296', '122310', '122329', '122330', '122336', '122339', '122340', '122483', '122487', '122488', '122507', '122548', '144630', '144632', '150691']
    nosy_count = 9.0
    nosy_names = ['lemburg', 'terry.reedy', 'amaury.forgeotdarc', 'belopolsky', 'pitrou', 'vstinner', 'eric.smith', 'benjamin.peterson', 'ezio.melotti']
    pr_nums = []
    priority = 'normal'
    resolution = 'out of date'
    stage = 'needs patch'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue10521'
    versions = ['Python 2.7', 'Python 3.2']

    @abalkin
    Copy link
    Member Author

    abalkin commented Nov 24, 2010

    >>> 'xyz'.center(20, '\U00100140')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: The fill character must be exactly one character long

    str.ljust and str.rjust are similarly affected.

    @abalkin abalkin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Nov 24, 2010
    @pitrou
    Copy link
    Member

    pitrou commented Nov 24, 2010

    The question is, what should it do with such an input? Pretend it's a single char (but other chars in the source string won't get the same treatment)? Treat it as a two-char string (but then center() and friends should logically be extended to accept strings of arbitrary lengths)?

    @ericvsmith
    Copy link
    Member

    str.__format__ and friends (int, float, complex) also have this same problem. For example, when they're computing the "fill" character:

    >>> format('', 'x^')
    ''
    
    >>> format('', '\U00100140^')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: Invalid conversion specification

    @abalkin
    Copy link
    Member Author

    abalkin commented Nov 24, 2010

    On Wed, Nov 24, 2010 at 10:33 AM, Antoine Pitrou <report@bugs.python.org> wrote:
    ..

    The question is, what should it do with such an input?

    I think the rule for such functions should be that if
    input.encode('utf-8') is the same on wide and narrow builds, then the
    output.encode('utf-8') should be the same.

    Pretend it's a single char (but other chars in the source string won't get the same treatment)?

    Yes, *and* surrogate pairs in the source string should count for one
    char as well.

    Treat it as a two-char string (but then center() and friends should logically be
    extended to accept strings of arbitrary lengths)?

    No. For better or worse, on wide builds these methods effectively
    operate on code points. They don't interpret multi-code-point-
    graphemes or take grapheme width into account:

    --------------------
    ​​​​​​​​​​​​​​​​​123
    --------------------

    Application code has to ascertain that it is dealing with with fixed
    width characters in the target font before using these methods for
    text alignment.

    @malemburg
    Copy link
    Member

    Alexander Belopolsky wrote:
    > 
    > New submission from Alexander Belopolsky <belopolsky@users.sourceforge.net>:
    > 
    >>>> 'xyz'.center(20, '\U00100140')
    > Traceback (most recent call last):
    >   File "<stdin>", line 1, in <module>
    > TypeError: The fill character must be exactly one character long
    > 
    > str.ljust and str.rjust are similarly affected.

    I don't think we should change that for the formatting methods.

    See my reply on python-dev:

    str.center(n) centers the string in a padded string that
    is composed of n code units. Whether that operation will result
    in a text that's centered visually on output is a completely
    different story. The original string could contain surrogates,
    it could also contain combing code points, so the visual
    presentation of the result may very well not be centered at
    all; it may not even appear as having the length n to the user.

    Since we're not going change the semantics of those APIs,
    it is OK to not support padding with non-BMP code points on
    UCS-2 builds.

    Supporting such cases would only cause problems:

    • if the methods would pad with surrogates, the resulting
      string would no longer have length n; breaking the
      assumption that len(str.center(n)) == n

    • if the methods would pad with half the number of surroagtes
      to make sure that len(str.center(n)) == n, the resulting
      output to e.g. a terminal would be further off, than what
      you already have with surrogates and combining code points
      in the original string.

    @abalkin
    Copy link
    Member Author

    abalkin commented Nov 25, 2010

    On Wed, Nov 24, 2010 at 3:37 PM, Marc-Andre Lemburg
    <report@bugs.python.org> wrote:
    ..

    I don't think we should change that for the formatting methods.

    That's a reasonable position. What about

    >>> unicodedata.category('\N{OLD ITALIC LETTER A}')
    'Lo'
    >>> '\N{OLD ITALIC LETTER A}'.isalpha()
    False

    the str.isalpha() method is underspecified in the reference manual,
    but a comment in unicodectype.c describes Py_UNICODE_ISALPHA as
    follows:

    /* Returns 1 for Unicode characters having the category 'Ll', 'Lu',
    'Lt',
    'Lo' or 'Lm', 0 otherwise. */

    I don't have a wide build handy, but I am fairly sure '\N{OLD ITALIC
    LETTER A}'.isalpha() would produce True there. The result above is
    simply consequence of surrogates considered to be non-letters:

    >>> [c.isalpha() for c in '\N{OLD ITALIC LETTER A}']
    [False, False]

    @abalkin
    Copy link
    Member Author

    abalkin commented Nov 25, 2010

    Here is another str method not ready for non-BMP chars:

    >>> u = '\U00010140'
    >>> u.translate({ord(u):ord('A')})
    '𐅀'

    (expected 'A')

    >>> u = 'B'
    >>> u.translate({ord(u):ord('A')})
    'A'

    @ezio-melotti
    Copy link
    Member

    I think that methods like str.isalpha can and should be fixed. Since _PyUnicode_IsAlpha now accepts a Py_UCS4, the body of unicode_isalpha can be changed to convert normal chars and surrogates pairs to a Py_UCS4 before calling Py_UNICODE_ISALPHA.
    The attached patch is a proof of concept of this approach and returns True for '\N{OLD ITALIC LETTER A}'.isalpha() on a narrow build.
    It still has a number of issues that should be addressed (check for narrow builds, check for lone surrogates, check for high surrogate at the end of a string, fix compiler warnings ...) but it should be good enough as a PoC.

    I would also suggest to introduce a set of macros to handle surrogates (e.g. detect, combine) and use it in all the functions that need to work with them.

    @abalkin
    Copy link
    Member Author

    abalkin commented Nov 25, 2010

    Here is another proof of concept patch for the isalpha issue that introduces a higher level abstraction macro - Py_UNICODE_NEXT. It should be possible to reuse this macro in all isxyz methods and other places where surrogates are currently processed. I should be possible to come up with a pure macro definition of Py_UNICODE_NEXT.

    @amauryfa
    Copy link
    Member

    bpo-9200 already proposes a similar change to str.is* methods.

    @terryjreedy
    Copy link
    Member

    As a practical matter, I think that for at least the next decade, people are at least as likely to want to fill with a composed, multi-BMP-codepoint 'char' (grapheme) as with a non-BMP char. So to me, failure with the latter is no worse than failure with the former.

    The underlying problem is that centering k chars within n spaces with fill i is based on one-char per code encodings *and* fixed pitch fonts with one-char per space. That model is not universally applicable, so I do not consider it a bug that functions based on that model are also not universally applicable. Perhaps docs should be clearer about the limitations of many of the string methods in the new context.

    A full general solution to the general problem of centering requires a shift to physical units (points or mm) and detailed font information, including kerning. This is beyond the scope of a string method.

    So I consider this a feature request for a partial generalization of unclear utility and unclear definition.

    @abalkin
    Copy link
    Member Author

    abalkin commented Nov 27, 2010

    On Fri, Nov 26, 2010 at 6:37 PM, Terry J. Reedy <report@bugs.python.org> wrote:

    Terry J. Reedy <tjreedy@udel.edu> added the comment:

    As a practical matter, I think that for at least the next decade, people are at least as likely to
    want to fill with a composed, multi-BMP-codepoint 'char' (grapheme) as with a non-BMP char.
    So to me, failure with the latter is no worse than failure with the former.

    I disagree. '\N{AEGEAN WORD SEPARATOR DOT}' ('𐄁') looks like a
    reasonably shaped fill character, while say 'Z\N{COMBINING ACUTE
    ACCENT}\N{COMBINING GRAVE ACCENT}' ('Ź̀') does not. Yet this is not
    the point of this bug report. The point is that Python user should
    not care (much) about how many bytes per character Python uses under
    the hood or what is the numeric value of the character that she can
    enter in her program.

    The underlying problem is that centering k chars within n spaces with fill i is based
    on one-char per code encodings *and* fixed pitch fonts with one-char per space.

    No. ' Section Title '.center(40, '*') will look good regardless of
    font width and even more so when combined with <center> tag or its
    equivalent in a given application.

    @ericvsmith
    Copy link
    Member

    I think these macros would be a reasonable approach. I think str.center, etc. should support non-BMP chars, because to not do so can raise an exception. Supporting composed graphemes seems like another problem altogether. And while we could fix that, it's clearly a larger step.

    @ezio-melotti
    Copy link
    Member

    I agree that s.center(char, n).encode('utf-8') should be the same on both the builds -- even if their len() will be different -- for the following reasons:

    1. the string will eventually be encoded, and if they the result is the same on both builds, it will look the same too;
    2. trying to keep the same len() will generate different results and it won't work in case of odd width like 'foo'.center(surrogate_pair, 5) because you can't put half surrogate.

    @terryjreedy
    Copy link
    Member

    After reading the additional messages here and on a similar issue Alexander opened after this, I seem the point of wanting to make the difference between the two types of builds as transparent as sensibly possible. From that viewpoint, rejection of composed chars is not as bad because both types of builds act the same.

    @vstinner
    Copy link
    Member

    This issue has been fixed in Python 3.3 thanks to the PEP-393.

    @ezio-melotti
    Copy link
    Member

    It can still be fixed on 2.7/3.2 though.

    @benjaminp
    Copy link
    Contributor

    I'm just going to close this and say "use 3.3".

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    9 participants