New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
str methods don't accept non-BMP fillchar on a narrow Unicode build #54730
Comments
>>> 'xyz'.center(20, '\U00100140')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: The fill character must be exactly one character long str.ljust and str.rjust are similarly affected. |
The question is, what should it do with such an input? Pretend it's a single char (but other chars in the source string won't get the same treatment)? Treat it as a two-char string (but then center() and friends should logically be extended to accept strings of arbitrary lengths)? |
str.__format__ and friends (int, float, complex) also have this same problem. For example, when they're computing the "fill" character: >>> format('', 'x^')
''
>>> format('', '\U00100140^')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: Invalid conversion specification |
On Wed, Nov 24, 2010 at 10:33 AM, Antoine Pitrou <report@bugs.python.org> wrote:
I think the rule for such functions should be that if
Yes, *and* surrogate pairs in the source string should count for one
No. For better or worse, on wide builds these methods effectively -------------------- Application code has to ascertain that it is dealing with with fixed |
Alexander Belopolsky wrote:
>
> New submission from Alexander Belopolsky <belopolsky@users.sourceforge.net>:
>
>>>> 'xyz'.center(20, '\U00100140')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> TypeError: The fill character must be exactly one character long
>
> str.ljust and str.rjust are similarly affected. I don't think we should change that for the formatting methods. See my reply on python-dev: str.center(n) centers the string in a padded string that Since we're not going change the semantics of those APIs, Supporting such cases would only cause problems:
|
On Wed, Nov 24, 2010 at 3:37 PM, Marc-Andre Lemburg
That's a reasonable position. What about >>> unicodedata.category('\N{OLD ITALIC LETTER A}')
'Lo'
>>> '\N{OLD ITALIC LETTER A}'.isalpha()
False the str.isalpha() method is underspecified in the reference manual, /* Returns 1 for Unicode characters having the category 'Ll', 'Lu', I don't have a wide build handy, but I am fairly sure '\N{OLD ITALIC >>> [c.isalpha() for c in '\N{OLD ITALIC LETTER A}']
[False, False] |
Here is another str method not ready for non-BMP chars: >>> u = '\U00010140'
>>> u.translate({ord(u):ord('A')})
'𐅀' (expected 'A') >>> u = 'B'
>>> u.translate({ord(u):ord('A')})
'A' |
I think that methods like str.isalpha can and should be fixed. Since _PyUnicode_IsAlpha now accepts a Py_UCS4, the body of unicode_isalpha can be changed to convert normal chars and surrogates pairs to a Py_UCS4 before calling Py_UNICODE_ISALPHA. I would also suggest to introduce a set of macros to handle surrogates (e.g. detect, combine) and use it in all the functions that need to work with them. |
Here is another proof of concept patch for the isalpha issue that introduces a higher level abstraction macro - Py_UNICODE_NEXT. It should be possible to reuse this macro in all isxyz methods and other places where surrogates are currently processed. I should be possible to come up with a pure macro definition of Py_UNICODE_NEXT. |
bpo-9200 already proposes a similar change to str.is* methods. |
As a practical matter, I think that for at least the next decade, people are at least as likely to want to fill with a composed, multi-BMP-codepoint 'char' (grapheme) as with a non-BMP char. So to me, failure with the latter is no worse than failure with the former. The underlying problem is that centering k chars within n spaces with fill i is based on one-char per code encodings *and* fixed pitch fonts with one-char per space. That model is not universally applicable, so I do not consider it a bug that functions based on that model are also not universally applicable. Perhaps docs should be clearer about the limitations of many of the string methods in the new context. A full general solution to the general problem of centering requires a shift to physical units (points or mm) and detailed font information, including kerning. This is beyond the scope of a string method. So I consider this a feature request for a partial generalization of unclear utility and unclear definition. |
On Fri, Nov 26, 2010 at 6:37 PM, Terry J. Reedy <report@bugs.python.org> wrote:
I disagree. '\N{AEGEAN WORD SEPARATOR DOT}' ('𐄁') looks like a
No. ' Section Title '.center(40, '*') will look good regardless of |
I think these macros would be a reasonable approach. I think str.center, etc. should support non-BMP chars, because to not do so can raise an exception. Supporting composed graphemes seems like another problem altogether. And while we could fix that, it's clearly a larger step. |
I agree that s.center(char, n).encode('utf-8') should be the same on both the builds -- even if their len() will be different -- for the following reasons:
|
After reading the additional messages here and on a similar issue Alexander opened after this, I seem the point of wanting to make the difference between the two types of builds as transparent as sensibly possible. From that viewpoint, rejection of composed chars is not as bad because both types of builds act the same. |
This issue has been fixed in Python 3.3 thanks to the PEP-393. |
It can still be fixed on 2.7/3.2 though. |
I'm just going to close this and say "use 3.3". |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: