msg109542 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2010-07-08 13:50 |
On narrow unicode builds:
unicodedata.category(chr(0x10000)) == 'Lo' # correct
Py_UNICODE_ISPRINTABLE(0x10000) == 1 # correct
str.isprintable(chr(0x10000)) == False # inconsistent
On narrow unicode builds, large code points are stored with a surrogate pair. But str.isprintable() simply loops over the Py_UNICODE array, and test the surrogates separately.
There should be a way to walk a unicode string in C, character by character, and the str methods (str.is*, str.to*) should use it.
|
msg109766 - (view) |
Author: Adam Olsen (Rhamphoryncus) |
Date: 2010-07-09 16:38 |
There should be a way to walk the unicode string in Python too. Afaik there isn't.
|
msg109775 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2010-07-09 17:13 |
A "proof of concept" patch, which shows the macros used to walk a unicode string and uses them in unicode_repr() (should not change behaviour) and in unicode_isprintable() (should fix the issue).
Other functions should be changed the same way, of course.
|
msg122465 - (view) |
Author: Alexander Belopolsky (belopolsky) *  |
Date: 2010-11-26 16:35 |
AFAICT, all ctype methods (isalpha, isdigit, etc.) have the same problem. I posted a patch at issue10542 that introduces a Py_UNICODE_NEXT() macro that can help fixing all these methods. I am adding #10542 as a dependency and if there are no objections, I will change the title to extend the scope of this issue to cover all ctype methods.
|
msg122498 - (view) |
Author: Alexander Belopolsky (belopolsky) *  |
Date: 2010-11-27 01:57 |
Attached patch fixes isprintable and other ctype-like methods. I left isspace() out for now because I could not find a test character outside of BMP to test with, but I think we should fix that for completeness as well.
At this point the goal is mostly to showcase Py_UNICODE_NEXT(), not completeness. See issue10542.
I also noticed that upper/lower/swapcase methods have the same issue:
>>> '\N{MATHEMATICAL BOLD CAPITAL A}'.lower() == '\N{MATHEMATICAL BOLD CAPITAL A}'
True
This will be a subject of a separate issue.
|
msg142116 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2011-08-15 11:01 |
I closed #12730 as a duplicate of this and updated the title of this issue.
|
msg142316 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2011-08-18 13:49 |
The attached patch fixes all the str.is* methods and makes them work on narrow builds with non-BMP chars. It also includes the _Py_UNICODE_NEXT macro proposed in #10542.
|
msg142318 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2011-08-18 13:57 |
(Note: I copied the macros from the other patch without changing the name. If the approach is good I'll get rid of the prefixes and separate the words in IS{HIGH|LOW}SURROGATE with an _.)
|
msg142355 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-08-18 16:18 |
I don't think that macros specific to unicodeobject.c should get the _PY_UNICODE_ prefix. "_Py_" prefix is reserved to exported symbols, but symbols reserved for the Python interprefer itself. For _Py_UNICODE_NEXT, you can call it NEXT_CHARACTER().
_Py_UNICODE_ISHIGHSURROGATE,_Py_UNICODE_ISLOWSURROGATE and _Py_UNICODE_JOIN_SURROGATES are only used once, I would prefer to see them inlined in _Py_UNICODE_NEXT.
The first cast to Py_UCS4 in _Py_UNICODE_JOIN_SURROGATES is useless.
It looks like the macro can be simplified to something like:
#define _Py_UNICODE_NEXT(ptr, end) \
(_Py_UNICODE_ISHIGHSURROGATE(*(ptr)) && (ptr) < (end)) && _Py_UNICODE_ISLOWSURROGATE((ptr)[1] ? \
((ptr) += 2,_Py_UNICODE_JOIN_SURROGATES((ptr)[-2], (ptr)[-1])) : \
(Py_UCS4)*(ptr)++)
(you don't need two "a?b:c")
I would prefer to see _Py_UNICODE_NEXT as a function instead of a macro because it has border effect (ptr++ or ptr += 2). You cannot write Py_UNICODE_ISALNUM(_Py_UNICODE_NEXT(p, e)) for example, because Py_UNICODE_ISALNUM is defined as:
#define Py_UNICODE_ISALNUM(ch) (Py_UNICODE_ISALPHA(ch) || Py_UNICODE_ISDECIMAL(ch) || Py_UNICODE_ISDIGIT(ch) || Py_UNICODE_ISNUMERIC(ch))
And so _Py_UNICODE_NEXT is expanded 4 times. It is also horrible to debug a program having such long macro. If you really want to keep it as a macro, please add a least a big warning.
It's also confusing to have two attachments with the same name :-/
|
msg142363 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2011-08-18 16:38 |
STINNER Victor wrote:
>
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
>
> I don't think that macros specific to unicodeobject.c should get the _PY_UNICODE_ prefix. "_Py_" prefix is reserved to exported symbols, but symbols reserved for the Python interprefer itself. For _Py_UNICODE_NEXT, you can call it NEXT_CHARACTER().
Can we please stick to the things we discussed on issue10542.
The macros are intended to be public starting with 3.3, not private
and they are meant to be used outside the interpreter as well.
They will only be private for patch level release patches. For those
you don't need the Py-prefix, but it also doesn't hurt having it
there.
> _Py_UNICODE_ISHIGHSURROGATE,_Py_UNICODE_ISLOWSURROGATE and _Py_UNICODE_JOIN_SURROGATES are only used once, I would prefer to see them inlined in _Py_UNICODE_NEXT.
>
> The first cast to Py_UCS4 in _Py_UNICODE_JOIN_SURROGATES is useless.
>
> It looks like the macro can be simplified to something like:
>
> #define _Py_UNICODE_NEXT(ptr, end) \
> (_Py_UNICODE_ISHIGHSURROGATE(*(ptr)) && (ptr) < (end)) && _Py_UNICODE_ISLOWSURROGATE((ptr)[1] ? \
> ((ptr) += 2,_Py_UNICODE_JOIN_SURROGATES((ptr)[-2], (ptr)[-1])) : \
> (Py_UCS4)*(ptr)++)
>
> (you don't need two "a?b:c")
>
> I would prefer to see _Py_UNICODE_NEXT as a function instead of a macro because it has border effect (ptr++ or ptr += 2). You cannot write Py_UNICODE_ISALNUM(_Py_UNICODE_NEXT(p, e)) for example, because Py_UNICODE_ISALNUM is defined as:
> #define Py_UNICODE_ISALNUM(ch) (Py_UNICODE_ISALPHA(ch) || Py_UNICODE_ISDECIMAL(ch) || Py_UNICODE_ISDIGIT(ch) || Py_UNICODE_ISNUMERIC(ch))
>
> And so _Py_UNICODE_NEXT is expanded 4 times. It is also horrible to debug a program having such long macro. If you really want to keep it as a macro, please add a least a big warning.
Having it as function would kill the performance advantage, so that's not
really an option. The use case you mention is also not really realistic.
Adding an extra warning to the macro version is still a good idea, though.
|
msg142468 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2011-08-19 15:54 |
Here's a new version of the patch.
I decided to leave the prefix anyway, for consistency with what I'll commit to 3.3 and because without the prefix NEXT() looks ambiguous (and it's not entirely clear if it's private or not).
I rewrote the macro as Victor suggested and tested that it still works (I also added a test with surrogates).
The macros are now called _Py_UNICODE_IS_{LOW|HIGH}_SURROGATE, with '_'s. I also tried the implementation proposed in #12751 and benchmarked with:
$ ./python -m timeit -s 's = "\uD800\uD8000\uDFFF\uDFFF\uDFFF"*1000' 's.islower()'
and got "1000 loops, best of 3: 345 usec per loop" on both, so I left the old version because I think it's more readable.
Finally, I rewrote the comment about the macro, adding a note about its side effects.
|
msg142720 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2011-08-22 12:13 |
It turned out that this can't be fixed in 2.7 unless we backport the patch in #5127 (it's in 3.2/3.3 but not in 2.7).
IIUC the macro works fine and joins surrogate pairs to a Py_UCS4 char, but since the Py_UNICODE_IS* macros still expect Py_UCS2 on narrow builds on 2.7, the higher bits gets truncated and the macros return wrong results.
So, for example
>>> u'\ud800\udc42'.isupper()
True
because \ud800 + \udc42 = \U000100429 → \U000100429 gets truncated to \u0429 → \u0429 is the CYRILLIC CAPITAL LETTER SHCHA → .isupper() returns True.
The current behavior is instead broken in another way, because it checks that u'\ud800'.isupper() and u'\udc42'.isupper() separately.
Would it make sense to backport #5127 or should I just give up and leave it broken?
|
msg142736 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2011-08-22 17:31 |
New changeset 06b30c5bcc3d by Ezio Melotti in branch '3.2':
#9200: The str.is* methods now work with strings that contain non-BMP characters even in narrow Unicode builds.
http://hg.python.org/cpython/rev/06b30c5bcc3d
New changeset e3be2941c834 by Ezio Melotti in branch 'default':
#9200: merge with 3.2.
http://hg.python.org/cpython/rev/e3be2941c834
|
msg142744 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2011-08-22 20:03 |
This issue and #5127 should not be backported to 2.7: narrow builds don't even accept unichar(0x10000).
Only python 3 can slowly pretend to implement utf-16 features.
|
msg142745 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-08-22 20:20 |
> This issue and #5127 should not be backported to 2.7:
> narrow builds don't even accept unichar(0x10000).
I agee.
|
msg142746 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2011-08-22 20:46 |
New changeset 75a4941d4d61 by Ezio Melotti in branch '2.7':
#9200: backport tests but run them on wide builds only.
http://hg.python.org/cpython/rev/75a4941d4d61
|
msg142747 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2011-08-22 20:52 |
Backporting #5127 is not possible anyway, because it would be necessary to recompile.
I backported only the tests, skipping them on wide builds.
|
msg142748 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2011-08-22 20:55 |
s/skipping them on wide builds/skipping them on narrow builds/
On wide builds they work fine, on narrow builds they don't work and they can't be fixed.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:03 | admin | set | github: 53446 |
2011-08-22 20:55:17 | ezio.melotti | set | messages:
+ msg142748 |
2011-08-22 20:52:25 | ezio.melotti | set | status: open -> closed versions:
- Python 2.7 messages:
+ msg142747
dependencies:
- Py_UNICODE_NEXT and other macros for surrogates resolution: fixed stage: commit review -> resolved |
2011-08-22 20:46:41 | python-dev | set | messages:
+ msg142746 |
2011-08-22 20:20:51 | vstinner | set | messages:
+ msg142745 |
2011-08-22 20:03:23 | amaury.forgeotdarc | set | messages:
+ msg142744 |
2011-08-22 17:31:32 | python-dev | set | nosy:
+ python-dev messages:
+ msg142736
|
2011-08-22 12:13:58 | ezio.melotti | set | messages:
+ msg142720 title: Make str methods work with non-BMP chars on narrow builds -> Make the str.is* methods work with non-BMP chars on narrow builds |
2011-08-22 11:08:07 | ezio.melotti | set | assignee: belopolsky -> ezio.melotti |
2011-08-19 15:55:02 | ezio.melotti | set | files:
+ issue9200-2.diff
stage: patch review -> commit review messages:
+ msg142468 versions:
+ Python 2.7, Python 3.3 |
2011-08-18 16:38:11 | lemburg | set | messages:
+ msg142363 |
2011-08-18 16:18:45 | vstinner | set | nosy:
+ vstinner messages:
+ msg142355
|
2011-08-18 13:59:04 | ezio.melotti | set | files:
+ issue9200.diff |
2011-08-18 13:58:54 | ezio.melotti | set | files:
- issue9200.diff |
2011-08-18 13:57:53 | ezio.melotti | set | messages:
+ msg142318 |
2011-08-18 13:49:52 | ezio.melotti | set | files:
+ issue9200.diff
messages:
+ msg142316 |
2011-08-15 19:35:01 | Arfrever | set | nosy:
+ Arfrever
|
2011-08-15 11:02:22 | ezio.melotti | set | nosy:
+ tchrist
|
2011-08-15 11:01:48 | ezio.melotti | set | messages:
+ msg142116 title: str.isprintable() is always False for large code points -> Make str methods work with non-BMP chars on narrow builds |
2011-08-15 10:59:05 | ezio.melotti | link | issue12730 superseder |
2010-11-27 01:57:14 | belopolsky | set | files:
+ issue9200.diff
messages:
+ msg122498 |
2010-11-26 16:35:11 | belopolsky | set | assignee: belopolsky dependencies:
+ Py_UNICODE_NEXT and other macros for surrogates type: behavior components:
+ Interpreter Core
nosy:
+ belopolsky messages:
+ msg122465 stage: patch review |
2010-07-09 17:13:08 | amaury.forgeotdarc | set | files:
+ join-surrogates.patch keywords:
+ patch messages:
+ msg109775
|
2010-07-09 16:38:12 | Rhamphoryncus | set | nosy:
+ Rhamphoryncus messages:
+ msg109766
|
2010-07-08 13:50:01 | amaury.forgeotdarc | create | |