Issue 9200: Make the str.is* methods work with non-BMP chars on narrow builds

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53446

classification

Title:	Make the str.is* methods work with non-BMP chars on narrow builds
Type:	behavior	Stage:	resolved
Components:	Interpreter Core, Unicode	Versions:	Python 3.2, Python 3.3

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	ezio.melotti	Nosy List:	Arfrever, Rhamphoryncus, amaury.forgeotdarc, belopolsky, ezio.melotti, lemburg, python-dev, tchrist, vstinner
Priority:	normal	Keywords:	patch

Created on 2010-07-08 13:50 by amaury.forgeotdarc, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
join-surrogates.patch	amaury.forgeotdarc, 2010-07-09 17:13		review
issue9200.diff	belopolsky, 2010-11-27 01:57		review
issue9200.diff	ezio.melotti, 2011-08-18 13:59		review
issue9200-2.diff	ezio.melotti, 2011-08-19 15:54		review

Messages (18)
msg109542 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-07-08 13:50
On narrow unicode builds: unicodedata.category(chr(0x10000)) == 'Lo' # correct Py_UNICODE_ISPRINTABLE(0x10000) == 1 # correct str.isprintable(chr(0x10000)) == False # inconsistent On narrow unicode builds, large code points are stored with a surrogate pair. But str.isprintable() simply loops over the Py_UNICODE array, and test the surrogates separately. There should be a way to walk a unicode string in C, character by character, and the str methods (str.is, str.to) should use it.
msg109766 - (view)	Author: Adam Olsen (Rhamphoryncus)	Date: 2010-07-09 16:38
There should be a way to walk the unicode string in Python too. Afaik there isn't.
msg109775 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-07-09 17:13
A "proof of concept" patch, which shows the macros used to walk a unicode string and uses them in unicode_repr() (should not change behaviour) and in unicode_isprintable() (should fix the issue). Other functions should be changed the same way, of course.
msg122465 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-26 16:35
AFAICT, all ctype methods (isalpha, isdigit, etc.) have the same problem. I posted a patch at issue10542 that introduces a Py_UNICODE_NEXT() macro that can help fixing all these methods. I am adding #10542 as a dependency and if there are no objections, I will change the title to extend the scope of this issue to cover all ctype methods.
msg122498 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-27 01:57
Attached patch fixes isprintable and other ctype-like methods. I left isspace() out for now because I could not find a test character outside of BMP to test with, but I think we should fix that for completeness as well. At this point the goal is mostly to showcase Py_UNICODE_NEXT(), not completeness. See issue10542. I also noticed that upper/lower/swapcase methods have the same issue: >>> '\N{MATHEMATICAL BOLD CAPITAL A}'.lower() == '\N{MATHEMATICAL BOLD CAPITAL A}' True This will be a subject of a separate issue.
msg142116 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-08-15 11:01
I closed #12730 as a duplicate of this and updated the title of this issue.
msg142316 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-08-18 13:49
The attached patch fixes all the str.is* methods and makes them work on narrow builds with non-BMP chars. It also includes the _Py_UNICODE_NEXT macro proposed in #10542.
msg142318 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-08-18 13:57
(Note: I copied the macros from the other patch without changing the name. If the approach is good I'll get rid of the prefixes and separate the words in IS{HIGH\|LOW}SURROGATE with an _.)
msg142355 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-08-18 16:18
I don't think that macros specific to unicodeobject.c should get the _PY_UNICODE_ prefix. "_Py_" prefix is reserved to exported symbols, but symbols reserved for the Python interprefer itself. For _Py_UNICODE_NEXT, you can call it NEXT_CHARACTER(). _Py_UNICODE_ISHIGHSURROGATE,_Py_UNICODE_ISLOWSURROGATE and _Py_UNICODE_JOIN_SURROGATES are only used once, I would prefer to see them inlined in _Py_UNICODE_NEXT. The first cast to Py_UCS4 in _Py_UNICODE_JOIN_SURROGATES is useless. It looks like the macro can be simplified to something like: #define _Py_UNICODE_NEXT(ptr, end) \ (_Py_UNICODE_ISHIGHSURROGATE((ptr)) && (ptr) < (end)) && _Py_UNICODE_ISLOWSURROGATE((ptr)[1] ? \ ((ptr) += 2,_Py_UNICODE_JOIN_SURROGATES((ptr)[-2], (ptr)[-1])) : \ (Py_UCS4)(ptr)++) (you don't need two "a?b:c") I would prefer to see _Py_UNICODE_NEXT as a function instead of a macro because it has border effect (ptr++ or ptr += 2). You cannot write Py_UNICODE_ISALNUM(_Py_UNICODE_NEXT(p, e)) for example, because Py_UNICODE_ISALNUM is defined as: #define Py_UNICODE_ISALNUM(ch) (Py_UNICODE_ISALPHA(ch) \|\| Py_UNICODE_ISDECIMAL(ch) \|\| Py_UNICODE_ISDIGIT(ch) \|\| Py_UNICODE_ISNUMERIC(ch)) And so _Py_UNICODE_NEXT is expanded 4 times. It is also horrible to debug a program having such long macro. If you really want to keep it as a macro, please add a least a big warning. It's also confusing to have two attachments with the same name :-/
msg142363 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2011-08-18 16:38
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > I don't think that macros specific to unicodeobject.c should get the _PY_UNICODE_ prefix. "_Py_" prefix is reserved to exported symbols, but symbols reserved for the Python interprefer itself. For _Py_UNICODE_NEXT, you can call it NEXT_CHARACTER(). Can we please stick to the things we discussed on issue10542. The macros are intended to be public starting with 3.3, not private and they are meant to be used outside the interpreter as well. They will only be private for patch level release patches. For those you don't need the Py-prefix, but it also doesn't hurt having it there. > _Py_UNICODE_ISHIGHSURROGATE,_Py_UNICODE_ISLOWSURROGATE and _Py_UNICODE_JOIN_SURROGATES are only used once, I would prefer to see them inlined in _Py_UNICODE_NEXT. > > The first cast to Py_UCS4 in _Py_UNICODE_JOIN_SURROGATES is useless. > > It looks like the macro can be simplified to something like: > > #define _Py_UNICODE_NEXT(ptr, end) \ > (_Py_UNICODE_ISHIGHSURROGATE((ptr)) && (ptr) < (end)) && _Py_UNICODE_ISLOWSURROGATE((ptr)[1] ? \ > ((ptr) += 2,_Py_UNICODE_JOIN_SURROGATES((ptr)[-2], (ptr)[-1])) : \ > (Py_UCS4)(ptr)++) > > (you don't need two "a?b:c") > > I would prefer to see _Py_UNICODE_NEXT as a function instead of a macro because it has border effect (ptr++ or ptr += 2). You cannot write Py_UNICODE_ISALNUM(_Py_UNICODE_NEXT(p, e)) for example, because Py_UNICODE_ISALNUM is defined as: > #define Py_UNICODE_ISALNUM(ch) (Py_UNICODE_ISALPHA(ch) \|\| Py_UNICODE_ISDECIMAL(ch) \|\| Py_UNICODE_ISDIGIT(ch) \|\| Py_UNICODE_ISNUMERIC(ch)) > > And so _Py_UNICODE_NEXT is expanded 4 times. It is also horrible to debug a program having such long macro. If you really want to keep it as a macro, please add a least a big warning. Having it as function would kill the performance advantage, so that's not really an option. The use case you mention is also not really realistic. Adding an extra warning to the macro version is still a good idea, though.
msg142468 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-08-19 15:54
Here's a new version of the patch. I decided to leave the prefix anyway, for consistency with what I'll commit to 3.3 and because without the prefix NEXT() looks ambiguous (and it's not entirely clear if it's private or not). I rewrote the macro as Victor suggested and tested that it still works (I also added a test with surrogates). The macros are now called _Py_UNICODE_IS_{LOW\|HIGH}_SURROGATE, with '_'s. I also tried the implementation proposed in #12751 and benchmarked with: $ ./python -m timeit -s 's = "\uD800\uD8000\uDFFF\uDFFF\uDFFF"*1000' 's.islower()' and got "1000 loops, best of 3: 345 usec per loop" on both, so I left the old version because I think it's more readable. Finally, I rewrote the comment about the macro, adding a note about its side effects.
msg142720 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-08-22 12:13
It turned out that this can't be fixed in 2.7 unless we backport the patch in #5127 (it's in 3.2/3.3 but not in 2.7). IIUC the macro works fine and joins surrogate pairs to a Py_UCS4 char, but since the Py_UNICODE_IS* macros still expect Py_UCS2 on narrow builds on 2.7, the higher bits gets truncated and the macros return wrong results. So, for example >>> u'\ud800\udc42'.isupper() True because \ud800 + \udc42 = \U000100429 → \U000100429 gets truncated to \u0429 → \u0429 is the CYRILLIC CAPITAL LETTER SHCHA → .isupper() returns True. The current behavior is instead broken in another way, because it checks that u'\ud800'.isupper() and u'\udc42'.isupper() separately. Would it make sense to backport #5127 or should I just give up and leave it broken?
msg142736 - (view)	Author: Roundup Robot (python-dev)	Date: 2011-08-22 17:31
New changeset 06b30c5bcc3d by Ezio Melotti in branch '3.2': #9200: The str.is* methods now work with strings that contain non-BMP characters even in narrow Unicode builds. http://hg.python.org/cpython/rev/06b30c5bcc3d New changeset e3be2941c834 by Ezio Melotti in branch 'default': #9200: merge with 3.2. http://hg.python.org/cpython/rev/e3be2941c834
msg142744 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2011-08-22 20:03
This issue and #5127 should not be backported to 2.7: narrow builds don't even accept unichar(0x10000). Only python 3 can slowly pretend to implement utf-16 features.
msg142745 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-08-22 20:20
> This issue and #5127 should not be backported to 2.7: > narrow builds don't even accept unichar(0x10000). I agee.
msg142746 - (view)	Author: Roundup Robot (python-dev)	Date: 2011-08-22 20:46
New changeset 75a4941d4d61 by Ezio Melotti in branch '2.7': #9200: backport tests but run them on wide builds only. http://hg.python.org/cpython/rev/75a4941d4d61
msg142747 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-08-22 20:52
Backporting #5127 is not possible anyway, because it would be necessary to recompile. I backported only the tests, skipping them on wide builds.
msg142748 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-08-22 20:55
s/skipping them on wide builds/skipping them on narrow builds/ On wide builds they work fine, on narrow builds they don't work and they can't be fixed.

History
Date	User	Action	Args
2022-04-11 14:57:03	admin	set	github: 53446
2011-08-22 20:55:17	ezio.melotti	set	messages: + msg142748
2011-08-22 20:52:25	ezio.melotti	set	status: open -> closed versions: - Python 2.7 messages: + msg142747 dependencies: - Py_UNICODE_NEXT and other macros for surrogates resolution: fixed stage: commit review -> resolved
2011-08-22 20:46:41	python-dev	set	messages: + msg142746
2011-08-22 20:20:51	vstinner	set	messages: + msg142745
2011-08-22 20:03:23	amaury.forgeotdarc	set	messages: + msg142744
2011-08-22 17:31:32	python-dev	set	nosy: + python-dev messages: + msg142736
2011-08-22 12:13:58	ezio.melotti	set	messages: + msg142720 title: Make str methods work with non-BMP chars on narrow builds -> Make the str.is* methods work with non-BMP chars on narrow builds
2011-08-22 11:08:07	ezio.melotti	set	assignee: belopolsky -> ezio.melotti
2011-08-19 15:55:02	ezio.melotti	set	files: + issue9200-2.diff stage: patch review -> commit review messages: + msg142468 versions: + Python 2.7, Python 3.3
2011-08-18 16:38:11	lemburg	set	messages: + msg142363
2011-08-18 16:18:45	vstinner	set	nosy: + vstinner messages: + msg142355
2011-08-18 13:59:04	ezio.melotti	set	files: + issue9200.diff
2011-08-18 13:58:54	ezio.melotti	set	files: - issue9200.diff
2011-08-18 13:57:53	ezio.melotti	set	messages: + msg142318
2011-08-18 13:49:52	ezio.melotti	set	files: + issue9200.diff messages: + msg142316
2011-08-15 19:35:01	Arfrever	set	nosy: + Arfrever
2011-08-15 11:02:22	ezio.melotti	set	nosy: + tchrist
2011-08-15 11:01:48	ezio.melotti	set	messages: + msg142116 title: str.isprintable() is always False for large code points -> Make str methods work with non-BMP chars on narrow builds
2011-08-15 10:59:05	ezio.melotti	link	issue12730 superseder
2010-11-27 01:57:14	belopolsky	set	files: + issue9200.diff messages: + msg122498
2010-11-26 16:35:11	belopolsky	set	assignee: belopolsky dependencies: + Py_UNICODE_NEXT and other macros for surrogates type: behavior components: + Interpreter Core nosy: + belopolsky messages: + msg122465 stage: patch review
2010-07-09 17:13:08	amaury.forgeotdarc	set	files: + join-surrogates.patch keywords: + patch messages: + msg109775
2010-07-09 16:38:12	Rhamphoryncus	set	nosy: + Rhamphoryncus messages: + msg109766
2010-07-08 13:50:01	amaury.forgeotdarc	create