Issue 5127: Use Py_UCS4 instead of Py_UNICODE in unicodectype.c

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/49377

classification

Title:	Use Py_UCS4 instead of Py_UNICODE in unicodectype.c
Type:	behavior	Stage:	resolved
Components:	Unicode	Versions:	Python 3.1, Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:	narrow build incorrectly translates cases for non-BMP code points View: 7663
Assigned To:	amaury.forgeotdarc	Nosy List:	Rhamphoryncus, amaury.forgeotdarc, bupjae, ezio.melotti, lemburg, vstinner
Priority:	normal	Keywords:	patch

Created on 2009-02-02 01:31 by bupjae, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
unicodectype_ucs4.patch	amaury.forgeotdarc, 2009-02-03 23:25
unicodectype_ucs4-2.patch	amaury.forgeotdarc, 2009-10-05 12:28
unicodectype_ucs4_3.patch	amaury.forgeotdarc, 2009-10-06 22:02	refreshed; don't forget to run makeunicodedata.py
unicodectype_ucs4_4.patch	amaury.forgeotdarc, 2010-07-08 22:27
unicodectype_ucs4_5.patch	amaury.forgeotdarc, 2010-07-08 23:18
unicodectype_ucs4_6.patch	amaury.forgeotdarc, 2010-07-09 21:00

Messages (54)
msg80924 - (view)	Author: Bupjae Lee (bupjae)	Date: 2009-02-02 01:31
>>> license Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python30\lib\site.py", line 372, in __repr__ self.__setup() File "C:\Python30\lib\site.py", line 359, in __setup data = fp.read() File "C:\Python30\lib\io.py", line 1724, in read decoder.decode(self.buffer.read(), final=True)) File "C:\Python30\lib\io.py", line 1295, in decode output = self.decoder.decode(input, final=final) UnicodeDecodeError: 'cp949' codec can't decode bytes in position 15164- 15165: il legal multibyte sequence >>> chr(0x10000) '\U00010000' >>> chr(0x11000) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python30\lib\io.py", line 1491, in write b = encoder.encode(s) UnicodeEncodeError: 'cp949' codec can't encode character '\ud804' in position 1: illegal multibyte sequence >>> I also can't understand why chr(0x10000) and chr(0x11000) has different behavior
msg80928 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2009-02-02 01:56
Here (winxpsp2, Py3, cp850-terminal) the license works fine: >>> license Type license() to see the full license text and license() works as well. I get this output for the chr()s: >>> chr(0x10000) '\U00010000' >>> chr(0x11000) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Programs\Python30\lib\io.py", line 1491, in write b = encoder.encode(s) File "C:\Programs\Python30\lib\encodings\cp850.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 1-2: character maps to <undefined> I believe that chr(0x10000) and chr(0x11000) should have the opposite behavior. U+10000 (LINEAR B SYLLABLE B008 A) belongs to the 'Lo' category and should be printed (and possibly raise a UnicodeError, see issue5110 [1]), U+11000 belongs to the 'Cn' category and should be escaped[2]. On Linux with Py3 and a UTF-8 terminal, chr(0x10000) prints '\U00010000' and chr(0x11000) prints the char (actually I see two boxes, but it shouldn't be a problem of Python). The license() works fine too. Also note that with cp850 the error message is 'character maps to <undefined>' and with cp949 is 'illegal multibyte sequence'. [1]: http://bugs.python.org/issue5110 [2]: http://www.python.org/dev/peps/pep-3138/#specification
msg80971 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2009-02-02 18:05
There were non-ascii characters in the Windows license file. This was corrected with r67860. > I believe that chr(0x10000) and chr(0x11000) should have the > opposite behavior. This other problem is because on a narrow unicode build, Py_UNICODE_ISPRINTABLE takes a 16bit integer. And indeed, >>> unicodedata.category(chr(0x10000 % 65536)) 'Cc' >>> unicodedata.category(chr(0x11000 % 65536)) 'Lo'
msg81044 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-02-03 11:34
I don't understand the behaviour of unichr(): Python 2.7a0 (trunk:68963M, Jan 30 2009, 00:49:28) >>> import unicodedata >>> unicodedata.category(u"\U00010000") 'Lo' >>> unicodedata.category(u"\U00011000") 'Cn' >>> unicodedata.category(unichr(0x10000)) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: unichr() arg not in range(0x10000) (narrow Python build) Why unichr() fails whereas \Uxxxxxxxx works? >>> len(u"\U00010000") 2 >>> ord(u"\U00010000") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: ord() expected a character, but string of length 2 found
msg81048 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2009-02-03 12:26
FWIW, on Python3 it seems to work: >>> import unicodedata >>> unicodedata.category("\U00010000") 'Lo' >>> unicodedata.category("\U00011000") 'Cn' >>> unicodedata.category(chr(0x10000)) 'Lo' >>> unicodedata.category(chr(0x11000)) 'Cn' >>> ord(chr(0x10000)), 0x10000 (65536, 65536) >>> ord(chr(0x11000)), 0x11000 (69632, 69632) I'm using a narrow build too: >>> import sys >>> sys.maxunicode 65535 >>> len('\U00010000') 2 >>> ord('\U00010000') 65536 On Python2 unichr() is supposed to raise a ValueError on a narrow build if the value is greater than 0xFFFF [1], but if the characters above 0xFFFF can be represented with u"\Uxxxxxxxx" there should be a way to fix unichr so it can return them. Python3 already does it with chr(). Maybe we should open a new issue for this if it's not present already. [1]: http://docs.python.org/library/functions.html#unichr
msg81049 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2009-02-03 12:39
Since r56395, ord() and chr() accept and return surrogate pairs even in narrow builds. The goal is to remove most differences between narrow and wide unicode builds (except for string lengths, indices or slices) To address this problem, I suggest to change all functions in unicodectype.c so that they accept Py_UCS4 characters (instead of Py_UNICODE). This would be a binary-incompatible change; and --with-wctype-functions would have an effect only if sizeof(wchar_t)==4 (instead of the current condition sizeof(wchar_t)==sizeof(PY_UNICODE_TYPE))
msg81051 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-02-03 13:12
On 2009-02-03 13:39, Amaury Forgeot d'Arc wrote: > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > > Since r56395, ord() and chr() accept and return surrogate pairs even in > narrow builds. > > The goal is to remove most differences between narrow and wide unicode > builds (except for string lengths, indices or slices) > > To address this problem, I suggest to change all functions in > unicodectype.c so that they accept Py_UCS4 characters (instead of > Py_UNICODE). -1. That would cause major breakage in the C API and is not inline with the intention of having a Py_UNICODE type in the first place. Users who are interested in UCS4 builds should simply use UCS4 builds. > This would be a binary-incompatible change; and --with-wctype-functions > would have an effect only if sizeof(wchar_t)==4 (instead of the current > condition sizeof(wchar_t)==sizeof(PY_UNICODE_TYPE)) --with-wctype-functions was scheduled for removal many releases ago, but I never got around to it. The only reason it's still there is that some Linux distribution use this config option (AFAIR, RedHat). I'd be +1 on removing the option in 3.0.1 or deprecating it in 3.0.1 and removing it in 3.1. It's not useful in any way, and causes compatibility problems with regular builds.
msg81052 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-02-03 13:14
amaury> Since r56395, ord() and chr() accept and return surrogate pairs amaury> even in narrow builds. Note: My examples are made with Python 2.x. > The goal is to remove most differences between narrow and wide unicode > builds (except for string lengths, indices or slices) It would be nice to get the same behaviour in Python 2.x and 3.x to help migration from Python2 to Python3 ;-) unichr() (in Python 2.x) documentation is correct. But I would approciate to support surrogates using unichr() which means also changing ord() behaviour. > To address this problem, I suggest to change all functions in > unicodectype.c so that they accept Py_UCS4 characters (instead of > Py_UNICODE). Why? Using surrogates, you can use 16-bits Py_UNICODE to store non-BMP characters (code > 0xffff). -- I can open a new issue if you agree that we can change unichr() / ord() behaviour on narrow build. We may ask on the mailing list?
msg81053 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-02-03 13:18
On 2009-02-03 14:14, STINNER Victor wrote: > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > amaury> Since r56395, ord() and chr() accept and return surrogate pairs > amaury> even in narrow builds. > > Note: My examples are made with Python 2.x. > >> The goal is to remove most differences between narrow and wide unicode >> builds (except for string lengths, indices or slices) > > It would be nice to get the same behaviour in Python 2.x and 3.x to help > migration from Python2 to Python3 ;-) > > unichr() (in Python 2.x) documentation is correct. But I would approciate to > support surrogates using unichr() which means also changing ord() behaviour. This is not possible for unichr() in Python 2.x, since applications always expect len(unichr(x)) == 1. Changing ord() would be possible in Python 2.x is easier, since this would only extend the range of returned values for UCS2 builds. >> To address this problem, I suggest to change all functions in >> unicodectype.c so that they accept Py_UCS4 characters (instead of >> Py_UNICODE). > > Why? Using surrogates, you can use 16-bits Py_UNICODE to store non-BMP > characters (code > 0xffff). > > -- > > I can open a new issue if you agree that we can change unichr() / ord() > behaviour on narrow build. We may ask on the mailing list?
msg81057 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2009-02-03 13:50
> That would cause major breakage in the C API Not if you recompile. I don't see how this breaks the API at the C level. > and is not inline with the intention of having a Py_UNICODE > type in the first place. Py_UNICODE is still used as the allocation unit for unicode strings. To get correct results, we need a way to access the whole unicode database even on ucs2 builds; it's possible with the unicodedata module, why not from C? My motivation for the change is this post: http://mail.python.org/pipermail/python-dev/2008-July/080900.html
msg81061 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-02-03 14:18
lemburg> This is not possible for unichr() in Python 2.x, since applications lemburg> always expect len(unichr(x)) == 1 Oh, ok. lemburg> Changing ord() would be possible in Python 2.x is easier, since lemburg> this would only extend the range of returned values for UCS2 lemburg> builds. ord() of Python3 (narrow build) rejects surrogate characters: '\U00010000' >>> len(chr(0x10000)) 2 >>> ord(0x10000) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: ord() expected string of length 1, but int found --- It looks that narrow builds with surrogates have some more problems... Test with U+10000: "LINEAR B SYLLABLE B008 A", category: Letter, Other. Correct result (Python 2.5, wide build): $ python Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40) >>> unichr(0x10000) u'\U00010000' >>> unichr(0x10000).isalpha() True Error in Python3 (narrow build): marge$ ./python Python 3.1a0 (py3k:69105M, Feb 3 2009, 15:04:35) >>> chr(0x10000).isalpha() False >>> list(chr(0x10000)) ['\ud800', '\udc00'] >>> chr(0xd800).isalpha() False >>> chr(0xdc00).isalpha() False Unicode ranges, all in the category "Other, Surrogate": - U+D800..U+DB7F: Non Private Use High Surrogate - U+DB80..U+DBFF: Private Use High Surrogate - U+DC00..U+DFFF: Low Surrogate" range
msg81081 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-02-03 18:47
On 2009-02-03 14:50, Amaury Forgeot d'Arc wrote: > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > >> That would cause major breakage in the C API > > Not if you recompile. I don't see how this breaks the API at the C level. Well, then try to look at such a change from a C extension writer's perspective. They'd have to change all their function calls and routines to work with Py_UCS4. Supporting both the old API and the new one would be nearly impossible and require either an adapter API or a lot of #ifdef'ery. Please remember that the public Python C API is not only meant for Python developers. It's main purpose is for it to be used by other developers extending or embedding Python and those developers use different release cycles and want to support more than just the bleeding edge Python version. Python has a long history of providing very stable APIs, both in C and in Python. FWIW: The last major change in the C API (the change to Py_ssize_t from Python 2.4 to 2.5) has not even propogated to all major C extensions yet. It's only now that people start to realize problems with this, since their extensions start failing with segfaults on 64-bit machines. That said, we can of course provide additional UCS4 APIs for certain things and also provide conversion helpers between Py_UNICODE and Py_UCS4 where needed. >> and is not inline with the intention of having a Py_UNICODE >> type in the first place. > > Py_UNICODE is still used as the allocation unit for unicode strings. > > To get correct results, we need a way to access the whole unicode > database even on ucs2 builds; it's possible with the unicodedata module, > why not from C? I must be missing some detail, but what does the Unicode database have to do with the unicodeobject.c C API ? > My motivation for the change is this post: > http://mail.python.org/pipermail/python-dev/2008-July/080900.html There are certainly other ways to make Python deal with surrogates in more cases than the ones we already support.
msg81090 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2009-02-03 19:19
haypo> ord() of Python3 (narrow build) rejects surrogate characters: haypo> '\U00010000' haypo> >>> len(chr(0x10000)) haypo> 2 haypo> >>> ord(0x10000) haypo> TypeError: ord() expected string of length 1, but int found ord() works fine on Py3, you probably meant to do >>> ord('\U00010000') 65536 or >>> ord(chr(0x10000)) 65536 In Py3 is also stated that it accepts surrogate pairs (help(ord)). Py2 instead doesn't support them: >>> ord(u'\U00010000') TypeError: ord() expected a character, but string of length 2 found
msg81115 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2009-02-03 23:25
> I must be missing some detail, but what does the Unicode database > have to do with the unicodeobject.c C API ? Ah, now I understand your concerns. My suggestion is to change only the 20 functions in unicodectype.c: _PyUnicode_IsAlpha, _PyUnicode_ToLowercase... and no change in unicodeobject.c at all. They all take a single code point as argument, some also return a single code point. Changing these functions is backwards compatible. I join a patch so we can argue on concrete code (tests are missing). Another effect of the patch: unicodedata.numeric('\N{AEGEAN NUMBER TWO}') can return 2.0. The str.isalpha() (and others) methods did not change: they still split the surrogate pairs.
msg93562 - (view)	Author: Adam Olsen (Rhamphoryncus)	Date: 2009-10-04 21:46
Surrogates aren't optional features of UTF-16, we really need to get this fixed. That includes .isalpha(). We might keep the old public API for compatibility, but it should be clearly marked as broken for non-BMP scalar values. I don't see a problem with changing 2.x. The existing behaviour is broken for non-BMP scalar values, so surely nobody can claim dependence on it.
msg93585 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-10-05 09:03
Adam Olsen wrote: > > Adam Olsen <rhamph@gmail.com> added the comment: > > Surrogates aren't optional features of UTF-16, we really need to get > this fixed. That includes .isalpha(). We use UCS2 on narrow Python builds, not UTF-16. > We might keep the old public API for compatibility, but it should be > clearly marked as broken for non-BMP scalar values. That has always been the case. UCS2 doesn't support surrogates. However, we have been slowly moving into the direction of making the UCS2 storage appear like UTF-16 to the Python programmer. This process is not yet complete and will likely never complete since it must still be possible to create things line lone surrogates for processing purposes, so care has to be taken when using non-BMP code points on narrow builds. > I don't see a problem with changing 2.x. The existing behaviour is > broken for non-BMP scalar values, so surely nobody can claim dependence > on it. No, but changing the APIs from 16-bit integers to 32-bit integers does require a recompile of all code using it. Otherwise you end up with segfaults. Also, the Unicode type database itself uses Py_UNICODE, so case mapping would fail for non-BMP code points. So if we want to support accessing non-BMP type information on narrow builds, we'd need to change the complete Unicode type database API to work with UCS4 code points and then provide a backwards compatible C API using Py_UNICODE. Due to the UCS2/UCS4 API renaming done in unicodeobject.h, this would amount to exposing both the UCS2 and the UCS4 variants of the APIs on narrow builds. With such an approach we'd not break the binary API and still get the full UCS4 range of code points in the type database. The change would be possible in Python 2.x and 3.x (which now both use the same strategy w/r to change management). Would someone be willing to work on this ?
msg93587 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2009-10-05 09:59
> No, but changing the APIs from 16-bit integers to 32-bit integers > does require a recompile of all code using it. Is it acceptable between 3.1 and 3.2 for example? ISTM that other changes already require recompilation of extension modules. > Also, the Unicode type database itself uses Py_UNICODE, so > case mapping would fail for non-BMP code points. Where, please? in unicodedata.c, getuchar and _getrecord_ex use Py_UCS4.
msg93588 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-10-05 10:15
Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > >> No, but changing the APIs from 16-bit integers to 32-bit integers >> does require a recompile of all code using it. > > Is it acceptable between 3.1 and 3.2 for example? ISTM that other > changes already require recompilation of extension modules. With the proposed approach, we'll keep binary compatibility, so this is not much of an issue. Note: Changes to the binary interface can be done in minor releases, but we should make sure that it's not possible to load an extension compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns. >> Also, the Unicode type database itself uses Py_UNICODE, so >> case mapping would fail for non-BMP code points. > > Where, please? in unicodedata.c, getuchar and _getrecord_ex use Py_UCS4. The change affects the Unicode type database which is implemented in unicodectype.c, not the Unicode database, which already uses UCS4.
msg93589 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2009-10-05 11:09
> we should make sure that it's not possible to load an extension > compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns. This is the case with this patch: today all these functions (_PyUnicode_IsAlpha, _PyUnicode_ToLowercase) are actually #defines to _PyUnicodeUCS2_* or _PyUnicodeUCS4_*. The patch removes the #defines: 3.1 modules that call _PyUnicodeUCS4_IsAlpha wouldn't load into a 3.2 interpreter. > The change affects the Unicode type database which is implemented > in unicodectype.c, not the Unicode database, which already uses UCS4. Are you referring to the _PyUnicode_TypeRecord structure? The first three fields only contains values up to 65535, so they could use "unsigned short" even for UCS4 builds. All the other uses are precisely changed by the patch...
msg93590 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2009-10-05 11:16
>> We might keep the old public API for compatibility, but it should be >> clearly marked as broken for non-BMP scalar values. > That has always been the case. UCS2 doesn't support surrogates. > However, we have been slowly moving into the direction of making > the UCS2 storage appear like UTF-16 to the Python programmer. UCS2 died long ago, is there any reason why we keep using an UCS2 that "appears" like UTF-16 instead of real UTF-16? > This process is not yet complete and will likely never complete > since it must still be possible to create things line lone > surrogates for processing purposes, so care has to be taken > when using non-BMP code points on narrow builds. I don't exactly know all the details of the current implementation, but -- from what I understand reading this (correct me if I'm wrong) -- it seems that the implementation is half-UCS2 to allow things like the processing of lone surrogates and half-UTF16 (or UTF-16-compatible) to work with surrogate pairs and hence with chars outside the BMP. What are the use cases for processing the lone surrogates? Wouldn't be better to use UTF-16 and disallow them (since they are illegal) and possibly provide some other way to deal with them (if it's really needed)?
msg93594 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-10-05 11:41
Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > >> we should make sure that it's not possible to load an extension >> compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns. > > This is the case with this patch: today all these functions > (_PyUnicode_IsAlpha, _PyUnicode_ToLowercase) are actually #defines to > _PyUnicodeUCS2_* or _PyUnicodeUCS4_. > The patch removes the #defines: 3.1 modules that call > _PyUnicodeUCS4_IsAlpha wouldn't load into a 3.2 interpreter. True, but we can do better. For narrow builds, the API currently exposes the UCS2 APIs. We'd need to expose the UCS4 APIs in addition* to those APIs and have the UCS2 APIs redirect to the UCS4 ones. For wide builds, we don't need to change anything. >> The change affects the Unicode type database which is implemented >> in unicodectype.c, not the Unicode database, which already uses UCS4. > > Are you referring to the _PyUnicode_TypeRecord structure? > The first three fields only contains values up to 65535, so they could > use "unsigned short" even for UCS4 builds. I haven't checked, but it's certainly possible to have a code point use a non-BMP lower/upper/title case mapping, so this should be made possible as well, if we're going to make changes to the type database.
msg93595 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-10-05 11:51
This is off-topic for the tracker item, but I'll reply anyway: Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > >>> We might keep the old public API for compatibility, but it should be >>> clearly marked as broken for non-BMP scalar values. > >> That has always been the case. UCS2 doesn't support surrogates. > >> However, we have been slowly moving into the direction of making >> the UCS2 storage appear like UTF-16 to the Python programmer. > > UCS2 died long ago, is there any reason why we keep using an UCS2 that > "appears" like UTF-16 instead of real UTF-16? UCS2 is how we store Unicode in Python for narrow builds internally. It's a storage format, not an encoding. However, on narrow builds such as the Windows builds, you will sometimes want to create Unicode strings that use non-BMP code points. Since both UCS2 and UCS4 can represent the UTF-16 encoding, it's handy to expose a bit of automatic conversion at the Python level to make things easier for the programmer. >> This process is not yet complete and will likely never complete >> since it must still be possible to create things line lone >> surrogates for processing purposes, so care has to be taken >> when using non-BMP code points on narrow builds. > > I don't exactly know all the details of the current implementation, but > -- from what I understand reading this (correct me if I'm wrong) -- it > seems that the implementation is half-UCS2 to allow things like the > processing of lone surrogates and half-UTF16 (or UTF-16-compatible) to > work with surrogate pairs and hence with chars outside the BMP. > > What are the use cases for processing the lone surrogates? Wouldn't be > better to use UTF-16 and disallow them (since they are illegal) and > possibly provide some other way to deal with them (if it's really needed)? No, because Python is meant to be used for working on all Unicode code points. Lone surrogates are not allowed in transfer encodings such as UTF-16 or UTF-8, but they are valid Unicode code points and you need to be able to work with them, since you may want to construct surrogate pairs by hand or get lone surrogates as a result of slicing a Unicode string.
msg93596 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2009-10-05 12:28
> We'd need to expose the UCS4 APIs in addition > to those APIs and have the UCS2 APIs redirect to the UCS4 ones. Why have two names for the same function? it's Python 3, after all. Or is this "no recompile" feature so important (as long as changes are clearly shown to the user)? It does not work on Windows, FWIW. > I haven't checked, but it's certainly possible to have a code point > use a non-BMP lower/upper/title case mapping, so this should be > made possible as well, if we're going to make changes to the type > database. OK, here is a new patch. Even if this does not happen with unicodedata up to 5.1, the table has only 175 entries so memory usage is not dramatically increased. Py_UNICODE is no more used at all in unicodectype.c.
msg93604 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-10-05 14:16
Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > >> We'd need to expose the UCS4 APIs in addition >> to those APIs and have the UCS2 APIs redirect to the UCS4 ones. > > Why have two names for the same function? it's Python 3, after all. It's not the same function... the UCS2 version would take a Py_UNICODE parameter, the UCS4 version a Py_UCS4 parameter. I don't understand the comment about Python 3.x. FWIW, we're no longer in the backwards incompatible changes are allowed mode for 3.x. > Or is this "no recompile" feature so important (as long as changes are > clearly shown to the user)? It does not work on Windows, FWIW. There are generally two options for API changes within a major release branch: 1. the changes are API backwards compatible and only the Python API version is changed 2. the changes are not API backwards compatible; in such a case, Python has to reject imports of old module (as it always does on Windows), so the Python API version has to be changed and the import mechanism must reject the import The second option was used when transitioning from 2.4 to 2.5 due to the Py_ssize_t changes. We could do the same for 2.7/3.2, but if it's just needed for this one change, then I'd rather stick to implementing the first option. >> I haven't checked, but it's certainly possible to have a code point >> use a non-BMP lower/upper/title case mapping, so this should be >> made possible as well, if we're going to make changes to the type >> database. > > OK, here is a new patch. Even if this does not happen with unicodedata > up to 5.1, the table has only 175 entries so memory usage is not > dramatically increased. > Py_UNICODE is no more used at all in unicodectype.c. Sorry, but this doesn't work: the functions have to return Py_UNICODE and raise an exception if the return value doesn't fit. Otherwise, you'd get completely wrong values in code downcasting the return value to Py_UNICODE on narrow builds. Another good reason to use two sets of APIs. The new set could indeed return Py_UCS4 values.
msg93611 - (view)	Author: Adam Olsen (Rhamphoryncus)	Date: 2009-10-05 17:25
On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg <report@bugs.python.org> wrote: > We use UCS2 on narrow Python builds, not UTF-16. > >> We might keep the old public API for compatibility, but it should be >> clearly marked as broken for non-BMP scalar values. > > That has always been the case. UCS2 doesn't support surrogates. > > However, we have been slowly moving into the direction of making > the UCS2 storage appear like UTF-16 to the Python programmer. > > This process is not yet complete and will likely never complete > since it must still be possible to create things line lone > surrogates for processing purposes, so care has to be taken > when using non-BMP code points on narrow builds. Balderdash. We expose UTF-16 code units, not UCS-2. Guido has made this quite clear. UTF-16 was designed as an easy transition from UCS-2. Indeed, if your code only does searches or joins existing strings then it will Just Work; declare it UTF-16 and you are done. We have a lot more work to do than that (as in this bug report), and we can't reasonably prevent the user from splitting surrogate pairs via poor code, but a 95% solution doesn't mean we suddenly revert all the way back to UCS-2. If the intent really was to use UCS-2 then a correctly functioning UTF-16 codec would join a surrogate pair into a single scalar value, then raise an error because it's outside the range representable in UCS-2. That's not very helpful though; obviously, it's much better to use UTF-16 internally. "The alternative (no matter what the configure flag is called) is UTF-16, not UCS-2 though: there is support for surrogate pairs in various places, including the \U escape and the UTF-8 codec." http://mail.python.org/pipermail/python-dev/2008-July/080892.html "If you find places where the Python core or standard library is doing Unicode processing that would break when surrogates are present you should file a bug. However this does not mean that every bit of code that slices a string at an arbitrary point (and hence risks slicing in the middle of a surrogate) is incorrect -- it all depends on what is done next with the slice." http://mail.python.org/pipermail/python-dev/2008-July/080900.html
msg93615 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-10-05 18:10
Adam Olsen wrote: > > Adam Olsen <rhamph@gmail.com> added the comment: > > On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg <report@bugs.python.org> wrote: >> We use UCS2 on narrow Python builds, not UTF-16. >> >>> We might keep the old public API for compatibility, but it should be >>> clearly marked as broken for non-BMP scalar values. >> >> That has always been the case. UCS2 doesn't support surrogates. >> >> However, we have been slowly moving into the direction of making >> the UCS2 storage appear like UTF-16 to the Python programmer. >> >> This process is not yet complete and will likely never complete >> since it must still be possible to create things line lone >> surrogates for processing purposes, so care has to be taken >> when using non-BMP code points on narrow builds. > > Balderdash. We expose UTF-16 code units, not UCS-2. Guido has made > this quite clear. > > UTF-16 was designed as an easy transition from UCS-2. Indeed, if your > code only does searches or joins existing strings then it will Just > Work; declare it UTF-16 and you are done. We have a lot more work to > do than that (as in this bug report), and we can't reasonably prevent > the user from splitting surrogate pairs via poor code, but a 95% > solution doesn't mean we suddenly revert all the way back to UCS-2. > > If the intent really was to use UCS-2 then a correctly functioning > UTF-16 codec would join a surrogate pair into a single scalar value, > then raise an error because it's outside the range representable in > UCS-2. That's not very helpful though; obviously, it's much better to > use UTF-16 internally. > > "The alternative (no matter what the configure flag is called) is > UTF-16, not UCS-2 though: there is support for surrogate pairs in > various places, including the \U escape and the UTF-8 codec." > http://mail.python.org/pipermail/python-dev/2008-July/080892.html > > "If you find places where the Python core or standard library is doing > Unicode processing that would break when surrogates are present you > should file a bug. However this does not mean that every bit of code > that slices a string at an arbitrary point (and hence risks slicing in > the middle of a surrogate) is incorrect -- it all depends on what is > done next with the slice." > http://mail.python.org/pipermail/python-dev/2008-July/080900.html All this is just nitpicking, really. UCS2 is a character set, UTF-16 an encoding. It so happens that when the Unicode consortium realized that 16 bit would not be enough to represent all scripts of the world, they added the concept of surrogates and reserved a few ranges of code points in UCS2 to represent these extra code points which are not part of UCS2, but the extensions UCS4. The conversion of these surrogate pairs to UCS4 code point values is what you find defined in UTF-16. If we were to implement Unicode using UTF-16 as storage format, we would not be able to store single lone surrogates, since these are not allowed in UTF-16. Ditto for unassigned ordinals, invalid code points, etc. PEP 100 really says it all: http://www.python.org/dev/peps/pep-0100/ """ This [internal] format will hold UTF-16 encodings of the corresponding Unicode ordinals. The Python Unicode implementation will address these values as if they were UCS-2 values. UCS-2 and UTF-16 are the same for all currently defined Unicode character points. ... Future implementations can extend the 16 bit restriction to the full set of all UTF-16 addressable characters (around 1M characters). """ Note that I wrote the PEP and worked on the implementation at a time when Unicode 2.x was still in use wide-spread use (mostly on Windows) and 3.0 was just being release: http://www.unicode.org/history/publicationdates.html But all that is off-topic for this ticket, so please let's just stop such discussions.
msg93617 - (view)	Author: Adam Olsen (Rhamphoryncus)	Date: 2009-10-05 18:48
On Mon, Oct 5, 2009 at 12:10, Marc-Andre Lemburg <report@bugs.python.org> wrote: > All this is just nitpicking, really. UCS2 is a character set, > UTF-16 an encoding. UCS is a character set, for most purposes synonymous with the Unicode character set. UCS-2 and UTF-16 are both encodings of that character set. However, UCS-2 can only represent the BMP, while UTF-16 can represent the full range. > If we were to implement Unicode using UTF-16 as storage format, > we would not be able to store single lone surrogates, since these > are not allowed in UTF-16. Ditto for unassigned ordinals, invalid > code points, etc. No. Internal usage may become temporarily ill-formed, but this is a compromise, and acceptable so long as we never export them to other systems. Not that I wouldn't prefer a system that wouldn't store lone surrogates, but.. pragmatics prevail. > Note that I wrote the PEP and worked on the implementation at a time > when Unicode 2.x was still in use wide-spread use (mostly on Windows) > and 3.0 was just being release: > > http://www.unicode.org/history/publicationdates.html I think you hit the nail on the head there. 10 years ago, unicode meant something different than it does today. That's reflected in PEP 100 and in the code. Now it's time to move on, switch to the modern terminology, modern usage, and modern specs. > But all that is off-topic for this ticket, so please let's just > stop such discussions. It needs to be discussed somewhere. It's a distraction from fixing the bug, but at least it's more private here. Would you prefer email?
msg93636 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2009-10-06 08:52
So the discussion is now on 2 points: 1. Is the change backwards compatible? (at the code level, after recompilation). My answer is yes, because all known case transformations stay in the same plane: if you pass a char in the BMP, they return a char in the BMP; if you pass a code >0x1000, you get another code >0x1000. In other words: in narrow builds, when you pass Py_UNICODE, the answer will be correct even when downcasted to Py_UNICODE. If you want, I can add checks to makeunicodedata.py to verify that future Unicode standards don't break this statement. "Naive" code that simply walks the Py_UNICODE* buffer will have identical behavior. (The current unicode methods are in this case. They should be fixed, later) 2. Is this change acceptable for 3.2? I'd say yes, because existing extension modules that use these functions will need to be recompiled; the functions names change, the modules won't load otherwise. There is no need to change the API number for this.
msg93659 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-10-06 20:41
It's not as easy as that. The functions for case conversion are used in a way that assumes they never fail (and indeed, the existing functions cannot fail). What we can do is change the input parameter to Py_UCS4, but not the Py_UNICODE output parameter, since that would cause lots of compiler warnings and implicit truncation on UCS2 builds, which would be a pretty disruptive change. However, this change would not really help anyone if there are no mappings from BMP to non-BMP or vice-versa, so I'm not sure whether this whole exercise is worth the effort. It appears to be better to just leave the case mapping APIs unchanged - or am I missing something ? The situation is different for the various Py_UNICODE_IS*() APIs: for these we can change the input parameter to Py_UCS4, remove the name mangling and add UCS2 helper functions to maintain API compatibility on UCS2 builds.
msg93662 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2009-10-06 21:27
> that would cause lots of compiler > warnings and implicit truncation on UCS2 builds Unfortunately, there is no such warning, or the initial problem we are trying to solve would have been spotted by such a warning (unicode_repr() calls Py_UNICODE_ISPRINTABLE() with a Py_UCS4 argument). gcc has a -Wconversion flag, (which I tried today on python) but it is far too verbose before version 4.3, and this newer version still has some false positives. http://gcc.gnu.org/wiki/NewWconversion But the most important thing is that implicit truncation on UCS2 builds is what happens already! The patch does not solve it, but at least it yields sensible results to wary code. Or can you imagine some (somewhat working) code which behavior will be worse after the change?
msg97441 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-01-08 21:26
I don't see the point in changing the various conversion APIs in the unicode database to return Py_UCS4 when there are no conversions that map code points between BMP and non-BMP. In order to solve the problem in question (unicode_repr() failing), we should change the various property checking APIs to accept Py_UCS4 input data. This needlessly increases the type database size without real benefit. For that to work properly we'll have to either make sure that extensions get recompiled if they use these changed APIs, or we provide an additional set of UCS2 APIs that extend the Py_UNICODE input value to a Py_UCS4 value before calling the underlying Py_UCS4 API.
msg97447 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-01-09 00:45
> I don't see the point in changing the various conversion APIs in the > unicode database to return Py_UCS4 when there are no conversions that > map code points between BMP and non-BMP. For consistency: if Py_UNICODE_ISPRINTABLE is changed to take Py_UCS4, Py_UNICODE_TOLOWER should also take Py_UCS4, and must return the same type. > In order to solve the problem in question (unicode_repr() failing), > we should change the various property checking APIs to accept Py_UCS4 > input data. This needlessly increases the type database size without > real benefit. [I'm not sure to understand. For me the 'real benefit' is that it solves the problem in question.] Yes this increases the type database: there are 300 more "case" statements in _PyUnicode_ToNumeric(), and the PyUnicode_TypeRecords array needs 1068 more bytes. On Windows, VS9.0 release build, unicodectype.obj grows from 86Kb to 94Kb; python32.dll is exactly 1.5Kb larger (from 2219Kb to 2221.5Kb); the memory usage of the just-started interpreter is about 32K larger (around 5M). These look reasonable figures to me. > For that to work properly we'll have to either make sure that > extensions get recompiled if they use these changed APIs, or we > provide an additional set of UCS2 APIs that extend the Py_UNICODE > input value to a Py_UCS4 value before calling the underlying Py_UCS4 > API. Extensions that use these changed APIs need to be recompiled, or they won't load: existing modules link with symbols like _PyUnicodeUCS2_IsPrintable, when the future interpreter will define _PyUnicode_IsPrintable.
msg97535 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-01-10 18:14
Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > >> I don't see the point in changing the various conversion APIs in the >> unicode database to return Py_UCS4 when there are no conversions that >> map code points between BMP and non-BMP. > > For consistency: if Py_UNICODE_ISPRINTABLE is changed to take Py_UCS4, Py_UNICODE_TOLOWER should also take Py_UCS4, and must return the same type. > >> In order to solve the problem in question (unicode_repr() failing), >> we should change the various property checking APIs to accept Py_UCS4 >> input data. This needlessly increases the type database size without >> real benefit. > [I'm not sure to understand. For me the 'real benefit' is that it solves the problem in question.] The problem in question is already solved by just changing the property checking APIs. Changing the conversion APIs fixes a non-problem, since there are no mappings that would require Py_UCS4 on a UCS2 build. > Yes this increases the type database: there are 300 more "case" statements in _PyUnicode_ToNumeric(), and the PyUnicode_TypeRecords array needs 1068 more bytes. > On Windows, VS9.0 release build, unicodectype.obj grows from 86Kb to 94Kb; python32.dll is exactly 1.5Kb larger (from 2219Kb to 2221.5Kb); > the memory usage of the just-started interpreter is about 32K larger (around 5M). These look reasonable figures to me. > >> For that to work properly we'll have to either make sure that >> extensions get recompiled if they use these changed APIs, or we >> provide an additional set of UCS2 APIs that extend the Py_UNICODE >> input value to a Py_UCS4 value before calling the underlying Py_UCS4 >> API. > > Extensions that use these changed APIs need to be recompiled, or they won't load: existing modules link with symbols like _PyUnicodeUCS2_IsPrintable, when the future interpreter will define _PyUnicode_IsPrintable. Hmm, that's a good point. OK, you got me convinced: let's go for it then.
msg109509 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-07-07 23:44
Now I wonder whether it's reasonable to consider this character U+10000 (LINEAR B SYLLABLE B008 A) as printable with repr(). Yes, its category is "Lo", but is there a font which can display it?
msg109513 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-07-08 05:29
Given that '\U00010000'.isprintable() returns True, I would say yes. If someone needs to print this char and has an appropriate font to do it, I don't see why it shouldn't work.
msg109518 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-07-08 07:50
Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > Given that '\U00010000'.isprintable() returns True, I would say yes. If someone needs to print this char and has an appropriate font to do it, I don't see why it shouldn't work. Note that Python3 will send printable code points as-is to the console, so whether or not a code point is considered printable should take the common availability of fonts being able to display the code point into account. Otherwise, a user would just see a square box instead of the much more useful escape sequence. The "printable" property is a Python invention, not a Unicode property, so we do have some freedom is deciding what is printable and what is not. In recent years the situation has just started clearing up for fonts covering the assigned BMP range, mostly due to Microsoft actively working to get their fonts cover the Unicode 2.0 assigned code points (BMP only): http://support.microsoft.com/kb/287247 The only font set I know of that tries to go beyond BMP is this shareware one: http://code2000.net/ Most other fonts just cover small parts of the Unicode assigned code point ranges: http://unicode.org/resources/fonts.html http://www.wazu.jp/ http://www.unifont.org/fontguide/ I suppose that in a few years we'll see OS and GUIs mix and match the available fonts to display Unicode code points. Given the font situation, I don't think we should have repr() pass through printable non-BMP code points as is. Perhaps we shouldn't give those code points the printable property to begin with.
msg109519 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-07-08 08:16
[This should probably be discussed on python-dev or in another issue, so feel free to move the conversation there.] The current implementation considers printable """all the characters except those characters defined in the Unicode character database as following categories are considered printable. * Cc (Other, Control) * Cf (Other, Format) * Cs (Other, Surrogate) * Co (Other, Private Use) * Cn (Other, Not Assigned) * Zl Separator, Line ('\u2028', LINE SEPARATOR) * Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR) * Zs (Separator, Space) other than ASCII space('\x20').""" We could also arbitrary exclude all the non-BMP chars, but that shouldn't be based on the availability of the fonts IMHO. > Note that Python3 will send printable code points as-is to the > console, so whether or not a code point is considered printable > should take the common availability of fonts being able to display > the code point into account. Otherwise, a user would just see a > square box instead of the much more useful escape sequence If the concern is about the usefulness of repr() in the console, note that on the Windows terminal trying to display most of the characters results in an error (see #5110), and that makes repr() barely usable. ascii() might be an alternative if the user wants to see the escape sequence instead of a square box.
msg109522 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-07-08 08:56
I suggest to go ahead and apply this patch, at least it correctly selects "printable" characters, whatever this means. I filed issue9198 to decide whether chr(0x10000) should be printable.
msg109523 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-07-08 09:15
Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > [This should probably be discussed on python-dev or in another issue, so feel free to move the conversation there.] > > The current implementation considers printable """all the characters except those characters defined in the Unicode character database as following categories are considered printable. > * Cc (Other, Control) > * Cf (Other, Format) > * Cs (Other, Surrogate) > * Co (Other, Private Use) > * Cn (Other, Not Assigned) > * Zl Separator, Line ('\u2028', LINE SEPARATOR) > * Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR) > * Zs (Separator, Space) other than ASCII space('\x20').""" > > We could also arbitrary exclude all the non-BMP chars, but that shouldn't be based on the availability of the fonts IMHO. Without fonts, you can't print the code points, even if the Unicode database defines the code point as not having one of the above classes. And that's probably also the reason why the Unicode database doesn't define a printable property :-) I also find the use of Zl, Zp and Zs in the definition somewhat arbitrary: whitespace is certainly printable. This also doesn't match the isprint() C lib API: http://www.cplusplus.com/reference/clibrary/cctype/isprint/ "A printable character is any character that is not a control character." >> Note that Python3 will send printable code points as-is to the >> console, so whether or not a code point is considered printable >> should take the common availability of fonts being able to display >> the code point into account. Otherwise, a user would just see a >> square box instead of the much more useful escape sequence > > If the concern is about the usefulness of repr() in the console, note that on the Windows terminal trying to display most of the characters results in an error (see #5110), and that makes repr() barely usable. > ascii() might be an alternative if the user wants to see the escape sequence instead of a square box. That's a different problem, but indeed also related to the printable property which was introduced as part of the Unicode repr() change: if the console encoding cannot represent the printable code points, you get an error. I was never a fan of the Unicode repr() change to begin with. The repr() of an object should work in almost all cases. Being able to read the repr() of an object in clear text is only secondary. IMHO, allowing all printable code points to pass through unescaped was not beneficial. We have str() for getting readable representations of objects. Anyway, we're stuck with it now, so have to work around the issues...
msg109526 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-07-08 09:26
Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > > I suggest to go ahead and apply this patch, at least it correctly selects "printable" characters, whatever this means. > I filed issue9198 to decide whether chr(0x10000) should be printable. +1
msg109527 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-07-08 09:33
Amaury, before applying the patch consider replacing the tab characters before the comments with spaces. The use of tabs is discouraged. Marc-Andre Lemburg wrote: > I was never a fan of the Unicode repr() change to begin with. The > repr() of an object should work in almost all cases. I still think that #5110 should be fixed (there's also a patch to fix the issue on Windows). If you agree please comment there and/or reopen that issue.
msg109529 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-07-08 09:35
Ezio Melotti wrote: > Marc-Andre Lemburg wrote: >> I was never a fan of the Unicode repr() change to begin with. The >> repr() of an object should work in almost all cases. > > I still think that #5110 should be fixed (there's also a patch to fix the issue on Windows). If you agree please comment there and/or reopen that issue. Let's discuss this on #9198.
msg109530 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-07-08 09:56
> consider replacing the tab characters before the comments with spaces It's actually already the case in my working copy.
msg109626 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-07-08 22:27
A new patch, generated on top of r82662
msg109631 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-07-08 22:36
Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > > A new patch, generated on top of r82662 Could you explain what this bit is about ? @@ -349,7 +313,7 @@ configure Python using --with-wctype-functions. This reduces the interpreter's code size. */ -#if defined(HAVE_USABLE_WCHAR_T) && defined(WANT_WCTYPE_FUNCTIONS) +#if defined(Py_UNICODE_WIDE) && defined(WANT_WCTYPE_FUNCTIONS) #include <wctype.h>
msg109644 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-07-08 23:18
A new patch that doesn't remove an important check, avoids a crash when the C macro is called with a huge number. thanks Ezio.
msg109646 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-07-08 23:34
> Could you explain what this bit is about ? > -#if defined(HAVE_USABLE_WCHAR_T) && defined(WANT_WCTYPE_FUNCTIONS) > +#if defined(Py_UNICODE_WIDE) && defined(WANT_WCTYPE_FUNCTIONS) On Windows at least, HAVE_USABLE_WCHAR_T is True, this means that Py_Unicode can be converted to wchar_t. But now that Py_UNICODE_ISSPACE() takes Py_UCS4, it cannot be converted to wchar_t anymore. Now that the unicode database functions claim to use Py_UCS4, the functions of wctypes.h are usable only if they also support Py_UCS4. OTOH the symbol WANT_WCTYPE_FUNCTIONS is defined only if ./configure is called with --with-wctype-functions, I don't expect it to be common. BTW, the comment says that "This reduces the interpreter's code size". I don't really agree, these functions are two-liners.
msg109687 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-07-09 08:13
Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > >> Could you explain what this bit is about ? >> -#if defined(HAVE_USABLE_WCHAR_T) && defined(WANT_WCTYPE_FUNCTIONS) >> +#if defined(Py_UNICODE_WIDE) && defined(WANT_WCTYPE_FUNCTIONS) > > On Windows at least, HAVE_USABLE_WCHAR_T is True, this means that Py_Unicode can be converted to wchar_t. But now that Py_UNICODE_ISSPACE() takes Py_UCS4, it cannot be converted to wchar_t anymore. Right, but you still have to check whether wchar_t is usable, don't you ? If wchar_t is 2 bytes on a non-Windows platform, using the wctypes functions won't work either. The line should probably read: #if defined(WANT_WCTYPE_FUNCTIONS) && defined(HAVE_USABLE_WCHAR_T) && defined(Py_UNICODE_WIDE) > Now that the unicode database functions claim to use Py_UCS4, the functions of wctypes.h are usable only if they also support Py_UCS4. > > OTOH the symbol WANT_WCTYPE_FUNCTIONS is defined only if ./configure is called with --with-wctype-functions, I don't expect it to be common. > BTW, the comment says that "This reduces the interpreter's code size". I don't really agree, these functions are two-liners. True. The support for the wctype functions should have been remove long ago, since they cause subtle incompatibilities between Python builds. I should have probably never added it in the first place... people were worried about the size of the type record tables at the time, which is why I thought it would be a good idea to try to optionally use the C lib functions. The comment was true before the Python type tables were changed into a type record database: the switch used to remove the Python tables required for those functions. With the type records database, this is no longer the case, since the records are also being used for properties that are not exposed via wctype functions.
msg109689 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-07-09 08:38
Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > > A new patch that doesn't remove an important check, avoids a crash when the C macro is called with a huge number. thanks Ezio. Could you please be more specific on what you changed ? At least visually, there don't appear to be any differences between v4 and v5 of the patch. Thanks, -- Marc-Andre Lemburg eGenix.com ________________________________________________________________________ 2010-07-19: EuroPython 2010, Birmingham, UK 9 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
msg109692 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-07-09 08:51
The 'if' in 'gettyperecord'. (I would also rewrite that as "if (code > 0x10FFFF)", it looks more readable to me.) The patch seems OK to me. In the NEWS message 'python' should be capitalized and I would also mention .isprintable() and possibly other functions that are affected directly -- mentioning repr() is good too, but it's only affected indirectly.
msg109693 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-07-09 08:59
Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > The 'if' in 'gettyperecord'. (I would also rewrite that as "if (code > 0x10FFFF)", it looks more readable to me.) Ah, good catch ! > The patch seems OK to me. In the NEWS message 'python' should be capitalized and I would also mention .isprintable() and possibly other functions that are affected directly -- mentioning repr() is good too, but it's only affected indirectly.
msg109695 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-07-09 09:13
str.isprintable() &co are not changed by this patch, because they enumerate Py_UNICODE units and do not join surrogates. See issue9200
msg109795 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-07-09 21:00
In this 6th patch, the wctype part was changed as suggested. there is one more condition, Py_UNICODE_WIDE: -#if defined(HAVE_USABLE_WCHAR_T) && defined(WANT_WCTYPE_FUNCTIONS) +#if defined(WANT_WCTYPE_FUNCTIONS) && defined(HAVE_USABLE_WCHAR_T) && defined(Py_UNICODE_WIDE)
msg114360 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-08-19 12:56
Committed with r84177.

History
Date	User	Action	Args
2022-04-11 14:56:45	admin	set	github: 49377
2010-08-19 12:56:22	amaury.forgeotdarc	set	status: open -> closed resolution: fixed messages: + msg114360 stage: patch review -> resolved
2010-07-09 21:01:02	amaury.forgeotdarc	set	files: + unicodectype_ucs4_6.patch messages: + msg109795
2010-07-09 09:13:17	amaury.forgeotdarc	set	messages: + msg109695
2010-07-09 08:59:04	lemburg	set	messages: + msg109693
2010-07-09 08:51:08	ezio.melotti	set	messages: + msg109692
2010-07-09 08:38:09	lemburg	set	messages: + msg109689
2010-07-09 08:13:36	lemburg	set	messages: + msg109687
2010-07-08 23:34:23	amaury.forgeotdarc	set	messages: + msg109646
2010-07-08 23:18:32	amaury.forgeotdarc	set	files: + unicodectype_ucs4_5.patch messages: + msg109644
2010-07-08 22:36:12	lemburg	set	messages: + msg109631
2010-07-08 22:27:29	amaury.forgeotdarc	set	files: + unicodectype_ucs4_4.patch messages: + msg109626
2010-07-08 09:56:35	amaury.forgeotdarc	set	messages: + msg109530
2010-07-08 09:35:48	lemburg	set	messages: + msg109529
2010-07-08 09:33:49	ezio.melotti	set	messages: + msg109527
2010-07-08 09:26:02	lemburg	set	messages: + msg109526
2010-07-08 09:16:00	lemburg	set	messages: + msg109523
2010-07-08 08:56:11	amaury.forgeotdarc	set	messages: + msg109522
2010-07-08 08:16:37	ezio.melotti	set	messages: + msg109519
2010-07-08 07:50:48	lemburg	set	messages: + msg109518
2010-07-08 05:29:01	ezio.melotti	set	messages: + msg109513
2010-07-07 23:44:45	amaury.forgeotdarc	set	messages: + msg109509
2010-07-07 22:56:37	amaury.forgeotdarc	set	assignee: amaury.forgeotdarc
2010-06-14 22:33:59	vstinner	unlink	issue3297 dependencies
2010-01-10 18:14:50	lemburg	set	messages: + msg97535
2010-01-10 18:04:27	ezio.melotti	set	superseder: narrow build incorrectly translates cases for non-BMP code points
2010-01-09 00:45:19	amaury.forgeotdarc	set	messages: + msg97447
2010-01-08 21:26:53	lemburg	set	messages: + msg97441
2010-01-08 00:44:58	ezio.melotti	set	title: UnicodeEncodeError - I can't even see license -> Use Py_UCS4 instead of Py_UNICODE in unicodectype.c versions: + Python 3.1, Python 3.2, - Python 3.0
2009-10-28 21:57:03	benjamin.peterson	link	issue3297 dependencies
2009-10-06 22:02:28	amaury.forgeotdarc	set	files: + unicodectype_ucs4_3.patch
2009-10-06 21:27:46	amaury.forgeotdarc	set	messages: + msg93662
2009-10-06 20:41:39	lemburg	set	messages: + msg93659
2009-10-06 08:52:29	amaury.forgeotdarc	set	messages: + msg93636
2009-10-05 18:48:53	Rhamphoryncus	set	messages: + msg93617
2009-10-05 18:10:18	lemburg	set	messages: + msg93615
2009-10-05 17:25:49	Rhamphoryncus	set	messages: + msg93611
2009-10-05 14:16:25	lemburg	set	messages: + msg93604
2009-10-05 12:28:42	amaury.forgeotdarc	set	files: + unicodectype_ucs4-2.patch messages: + msg93596
2009-10-05 11:51:27	lemburg	set	messages: + msg93595
2009-10-05 11:41:15	lemburg	set	messages: + msg93594
2009-10-05 11:16:27	ezio.melotti	set	messages: + msg93590
2009-10-05 11:09:33	amaury.forgeotdarc	set	messages: + msg93589
2009-10-05 10:15:57	lemburg	set	messages: + msg93588
2009-10-05 09:59:55	amaury.forgeotdarc	set	messages: + msg93587
2009-10-05 09:03:08	lemburg	set	messages: + msg93585
2009-10-04 21:46:52	Rhamphoryncus	set	type: behavior messages: + msg93562 nosy: + Rhamphoryncus
2009-09-25 05:54:38	ezio.melotti	set	priority: normal stage: patch review
2009-02-03 23:25:48	amaury.forgeotdarc	set	files: + unicodectype_ucs4.patch keywords: + patch messages: + msg81115
2009-02-03 19:19:30	ezio.melotti	set	messages: + msg81090
2009-02-03 18:47:29	lemburg	set	messages: + msg81081
2009-02-03 14:18:26	vstinner	set	messages: + msg81061
2009-02-03 13:50:11	amaury.forgeotdarc	set	messages: + msg81057
2009-02-03 13:18:20	lemburg	set	messages: + msg81053
2009-02-03 13:14:04	vstinner	set	messages: + msg81052
2009-02-03 13:12:04	lemburg	set	nosy: + lemburg messages: + msg81051
2009-02-03 12:39:48	amaury.forgeotdarc	set	messages: + msg81049
2009-02-03 12:26:27	ezio.melotti	set	messages: + msg81048
2009-02-03 11:34:20	vstinner	set	nosy: + vstinner messages: + msg81044
2009-02-02 18:05:46	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg80971
2009-02-02 01:56:54	ezio.melotti	set	nosy: + ezio.melotti messages: + msg80928
2009-02-02 01:31:39	bupjae	create