classification
Title: Make the str.is* methods work with non-BMP chars on narrow builds
Type: behavior Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.2, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: Arfrever, Rhamphoryncus, amaury.forgeotdarc, belopolsky, ezio.melotti, haypo, lemburg, python-dev, tchrist
Priority: normal Keywords: patch

Created on 2010-07-08 13:50 by amaury.forgeotdarc, last changed 2011-08-22 20:55 by ezio.melotti. This issue is now closed.

Files
File name Uploaded Description Edit
join-surrogates.patch amaury.forgeotdarc, 2010-07-09 17:13 review
issue9200.diff belopolsky, 2010-11-27 01:57 review
issue9200.diff ezio.melotti, 2011-08-18 13:59 review
issue9200-2.diff ezio.melotti, 2011-08-19 15:54 review
Messages (18)
msg109542 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-07-08 13:50
On narrow unicode builds:
unicodedata.category(chr(0x10000)) == 'Lo'  # correct
Py_UNICODE_ISPRINTABLE(0x10000)    == 1     # correct 
str.isprintable(chr(0x10000))      == False # inconsistent

On narrow unicode builds, large code points are stored with a surrogate pair.  But str.isprintable() simply loops over the Py_UNICODE array, and test the surrogates separately.

There should be a way to walk a unicode string in C, character by character, and the str methods (str.is*, str.to*) should use it.
msg109766 - (view) Author: Adam Olsen (Rhamphoryncus) Date: 2010-07-09 16:38
There should be a way to walk the unicode string in Python too.  Afaik there isn't.
msg109775 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-07-09 17:13
A "proof of concept" patch, which shows the macros used to walk a unicode string and uses them in unicode_repr() (should not change behaviour) and in unicode_isprintable() (should fix the issue).

Other functions should be changed the same way, of course.
msg122465 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-26 16:35
AFAICT, all ctype methods (isalpha, isdigit, etc.) have the same problem.  I posted a patch at issue10542 that introduces a Py_UNICODE_NEXT() macro that can help fixing all these methods.  I am adding #10542 as a dependency and if there are no objections, I will change the title to extend the scope of this issue to cover all ctype methods.
msg122498 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-27 01:57
Attached patch fixes isprintable and other ctype-like methods.  I left isspace() out for now because I could not find a test character outside of BMP to test with, but I think we should fix that for completeness as well.

At this point the goal is mostly to showcase Py_UNICODE_NEXT(), not completeness.  See issue10542.

I also noticed that upper/lower/swapcase methods have the same issue:

>>> '\N{MATHEMATICAL BOLD CAPITAL A}'.lower() == '\N{MATHEMATICAL BOLD CAPITAL A}'
True

This will be a subject of a separate issue.
msg142116 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-15 11:01
I closed #12730 as a duplicate of this and updated the title of this issue.
msg142316 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-18 13:49
The attached patch fixes all the str.is* methods and makes them work on narrow builds with non-BMP chars.  It also includes the _Py_UNICODE_NEXT macro proposed in #10542.
msg142318 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-18 13:57
(Note: I copied the macros from the other patch without changing the name.  If the approach is good I'll get rid of the prefixes and separate the words in IS{HIGH|LOW}SURROGATE with an _.)
msg142355 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-08-18 16:18
I don't think that macros specific to unicodeobject.c should get the _PY_UNICODE_ prefix. "_Py_" prefix is reserved to exported symbols, but symbols reserved for the Python interprefer itself. For _Py_UNICODE_NEXT, you can call it NEXT_CHARACTER().

_Py_UNICODE_ISHIGHSURROGATE,_Py_UNICODE_ISLOWSURROGATE and _Py_UNICODE_JOIN_SURROGATES are only used once, I would prefer to see them inlined in _Py_UNICODE_NEXT.

The first cast to Py_UCS4 in _Py_UNICODE_JOIN_SURROGATES is useless.

It looks like the macro can be simplified to something like:

#define _Py_UNICODE_NEXT(ptr, end)                                      \
    (_Py_UNICODE_ISHIGHSURROGATE(*(ptr)) && (ptr) < (end)) && _Py_UNICODE_ISLOWSURROGATE((ptr)[1] ?           \
     ((ptr) += 2,_Py_UNICODE_JOIN_SURROGATES((ptr)[-2], (ptr)[-1])) :  \
     (Py_UCS4)*(ptr)++)

(you don't need two "a?b:c")

I would prefer to see _Py_UNICODE_NEXT as a function instead of a macro because it has border effect (ptr++ or ptr += 2). You cannot write Py_UNICODE_ISALNUM(_Py_UNICODE_NEXT(p, e)) for example, because Py_UNICODE_ISALNUM is defined as:

#define Py_UNICODE_ISALNUM(ch) (Py_UNICODE_ISALPHA(ch) || Py_UNICODE_ISDECIMAL(ch) || Py_UNICODE_ISDIGIT(ch) || Py_UNICODE_ISNUMERIC(ch))

And so _Py_UNICODE_NEXT is expanded 4 times. It is also horrible to debug a program having such long macro. If you really want to keep it as a macro, please add a least a big warning.

It's also confusing to have two attachments with the same name :-/
msg142363 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-08-18 16:38
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> I don't think that macros specific to unicodeobject.c should get the _PY_UNICODE_ prefix. "_Py_" prefix is reserved to exported symbols, but symbols reserved for the Python interprefer itself. For _Py_UNICODE_NEXT, you can call it NEXT_CHARACTER().

Can we please stick to the things we discussed on issue10542.

The macros are intended to be public starting with 3.3, not private
and they are meant to be used outside the interpreter as well.
They will only be private for patch level release patches. For those
you don't need the Py-prefix, but it also doesn't hurt having it
there.

> _Py_UNICODE_ISHIGHSURROGATE,_Py_UNICODE_ISLOWSURROGATE and _Py_UNICODE_JOIN_SURROGATES are only used once, I would prefer to see them inlined in _Py_UNICODE_NEXT.
> 
> The first cast to Py_UCS4 in _Py_UNICODE_JOIN_SURROGATES is useless.
> 
> It looks like the macro can be simplified to something like:
> 
> #define _Py_UNICODE_NEXT(ptr, end)                                      \
>     (_Py_UNICODE_ISHIGHSURROGATE(*(ptr)) && (ptr) < (end)) && _Py_UNICODE_ISLOWSURROGATE((ptr)[1] ?           \
>      ((ptr) += 2,_Py_UNICODE_JOIN_SURROGATES((ptr)[-2], (ptr)[-1])) :  \
>      (Py_UCS4)*(ptr)++)
> 
> (you don't need two "a?b:c")
> 
> I would prefer to see _Py_UNICODE_NEXT as a function instead of a macro because it has border effect (ptr++ or ptr += 2). You cannot write Py_UNICODE_ISALNUM(_Py_UNICODE_NEXT(p, e)) for example, because Py_UNICODE_ISALNUM is defined as:

> #define Py_UNICODE_ISALNUM(ch) (Py_UNICODE_ISALPHA(ch) || Py_UNICODE_ISDECIMAL(ch) || Py_UNICODE_ISDIGIT(ch) || Py_UNICODE_ISNUMERIC(ch))
> 
> And so _Py_UNICODE_NEXT is expanded 4 times. It is also horrible to debug a program having such long macro. If you really want to keep it as a macro, please add a least a big warning.

Having it as function would kill the performance advantage, so that's not
really an option. The use case you mention is also not really realistic.

Adding an extra warning to the macro version is still a good idea, though.
msg142468 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-19 15:54
Here's a new version of the patch.
I decided to leave the prefix anyway, for consistency with what I'll commit to 3.3 and because without the prefix NEXT() looks ambiguous (and it's not entirely clear if it's private or not).
I rewrote the macro as Victor suggested and tested that it still works (I also added a test with surrogates).
The macros are now called _Py_UNICODE_IS_{LOW|HIGH}_SURROGATE, with '_'s.  I also tried the implementation proposed in #12751 and benchmarked with:
$ ./python -m timeit -s 's = "\uD800\uD8000\uDFFF\uDFFF\uDFFF"*1000' 's.islower()'
and got "1000 loops, best of 3: 345 usec per loop" on both, so I left the old version because I think it's more readable.
Finally, I rewrote the comment about the macro, adding a note about its side effects.
msg142720 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-22 12:13
It turned out that this can't be fixed in 2.7 unless we backport the patch in #5127 (it's in 3.2/3.3 but not in 2.7).

IIUC the macro works fine and joins surrogate pairs to a Py_UCS4 char, but since the Py_UNICODE_IS* macros still expect Py_UCS2 on narrow builds on 2.7, the higher bits gets truncated and the macros return wrong results.

So, for example
    >>> u'\ud800\udc42'.isupper()
    True
because \ud800 + \udc42 = \U000100429  →  \U000100429 gets truncated to \u0429  →  \u0429 is the CYRILLIC CAPITAL LETTER SHCHA  →  .isupper() returns True.

The current behavior is instead broken in another way, because it checks that u'\ud800'.isupper() and u'\udc42'.isupper() separately.

Would it make sense to backport #5127 or should I just give up and leave it broken?
msg142736 - (view) Author: Roundup Robot (python-dev) Date: 2011-08-22 17:31
New changeset 06b30c5bcc3d by Ezio Melotti in branch '3.2':
#9200: The str.is* methods now work with strings that contain non-BMP characters even in narrow Unicode builds.
http://hg.python.org/cpython/rev/06b30c5bcc3d

New changeset e3be2941c834 by Ezio Melotti in branch 'default':
#9200: merge with 3.2.
http://hg.python.org/cpython/rev/e3be2941c834
msg142744 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2011-08-22 20:03
This issue and #5127 should not be backported to 2.7: narrow builds don't even accept unichar(0x10000).
Only python 3 can slowly pretend to implement utf-16 features.
msg142745 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-08-22 20:20
> This issue and #5127 should not be backported to 2.7:
> narrow builds don't even accept unichar(0x10000).

I agee.
msg142746 - (view) Author: Roundup Robot (python-dev) Date: 2011-08-22 20:46
New changeset 75a4941d4d61 by Ezio Melotti in branch '2.7':
#9200: backport tests but run them on wide builds only.
http://hg.python.org/cpython/rev/75a4941d4d61
msg142747 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-22 20:52
Backporting #5127 is not possible anyway, because it would be necessary to recompile.
I backported only the tests, skipping them on wide builds.
msg142748 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-22 20:55
s/skipping them on wide builds/skipping them on narrow builds/
On wide builds they work fine, on narrow builds they don't work and they can't be fixed.
History
Date User Action Args
2011-08-22 20:55:17ezio.melottisetmessages: + msg142748
2011-08-22 20:52:25ezio.melottisetstatus: open -> closed
versions: - Python 2.7
messages: + msg142747

dependencies: - Py_UNICODE_NEXT and other macros for surrogates
resolution: fixed
stage: commit review -> resolved
2011-08-22 20:46:41python-devsetmessages: + msg142746
2011-08-22 20:20:51hayposetmessages: + msg142745
2011-08-22 20:03:23amaury.forgeotdarcsetmessages: + msg142744
2011-08-22 17:31:32python-devsetnosy: + python-dev
messages: + msg142736
2011-08-22 12:13:58ezio.melottisetmessages: + msg142720
title: Make str methods work with non-BMP chars on narrow builds -> Make the str.is* methods work with non-BMP chars on narrow builds
2011-08-22 11:08:07ezio.melottisetassignee: belopolsky -> ezio.melotti
2011-08-19 15:55:02ezio.melottisetfiles: + issue9200-2.diff

stage: patch review -> commit review
messages: + msg142468
versions: + Python 2.7, Python 3.3
2011-08-18 16:38:11lemburgsetmessages: + msg142363
2011-08-18 16:18:45hayposetnosy: + haypo
messages: + msg142355
2011-08-18 13:59:04ezio.melottisetfiles: + issue9200.diff
2011-08-18 13:58:54ezio.melottisetfiles: - issue9200.diff
2011-08-18 13:57:53ezio.melottisetmessages: + msg142318
2011-08-18 13:49:52ezio.melottisetfiles: + issue9200.diff

messages: + msg142316
2011-08-15 19:35:01Arfreversetnosy: + Arfrever
2011-08-15 11:02:22ezio.melottisetnosy: + tchrist
2011-08-15 11:01:48ezio.melottisetmessages: + msg142116
title: str.isprintable() is always False for large code points -> Make str methods work with non-BMP chars on narrow builds
2011-08-15 10:59:05ezio.melottilinkissue12730 superseder
2010-11-27 01:57:14belopolskysetfiles: + issue9200.diff

messages: + msg122498
2010-11-26 16:35:11belopolskysetassignee: belopolsky
dependencies: + Py_UNICODE_NEXT and other macros for surrogates
type: behavior
components: + Interpreter Core

nosy: + belopolsky
messages: + msg122465
stage: patch review
2010-07-09 17:13:08amaury.forgeotdarcsetfiles: + join-surrogates.patch
keywords: + patch
messages: + msg109775
2010-07-09 16:38:12Rhamphoryncussetnosy: + Rhamphoryncus
messages: + msg109766
2010-07-08 13:50:01amaury.forgeotdarccreate