Use Py_UCS4 instead of Py_UNICODE in unicodectype.c #49377

bupjae · 2009-02-02T01:31:39Z

BPO	5127
Nosy	@malemburg, @amauryfa, @vstinner, @ezio-melotti
Superseder	bpo-7663: narrow build incorrectly translates cases for non-BMP code points
Files	unicodectype_ucs4.patch unicodectype_ucs4-2.patch unicodectype_ucs4_3.patch: refreshed; don't forget to run makeunicodedata.py unicodectype_ucs4_4.patch unicodectype_ucs4_5.patch unicodectype_ucs4_6.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/amauryfa'
closed_at = <Date 2010-08-19.12:56:22.263>
created_at = <Date 2009-02-02.01:31:39.408>
labels = ['type-bug', 'expert-unicode']
title = 'Use Py_UCS4 instead of Py_UNICODE in unicodectype.c'
updated_at = <Date 2010-08-19.12:56:22.261>
user = 'https://bugs.python.org/bupjae'

bugs.python.org fields:

activity = <Date 2010-08-19.12:56:22.261>
actor = 'amaury.forgeotdarc'
assignee = 'amaury.forgeotdarc'
closed = True
closed_date = <Date 2010-08-19.12:56:22.263>
closer = 'amaury.forgeotdarc'
components = ['Unicode']
creation = <Date 2009-02-02.01:31:39.408>
creator = 'bupjae'
dependencies = []
files = ['12934', '15047', '15058', '17909', '17911', '17924']
hgrepos = []
issue_num = 5127
keywords = ['patch']
message_count = 54.0
messages = ['80924', '80928', '80971', '81044', '81048', '81049', '81051', '81052', '81053', '81057', '81061', '81081', '81090', '81115', '93562', '93585', '93587', '93588', '93589', '93590', '93594', '93595', '93596', '93604', '93611', '93615', '93617', '93636', '93659', '93662', '97441', '97447', '97535', '109509', '109513', '109518', '109519', '109522', '109523', '109526', '109527', '109529', '109530', '109626', '109631', '109644', '109646', '109687', '109689', '109692', '109693', '109695', '109795', '114360']
nosy_count = 6.0
nosy_names = ['lemburg', 'amaury.forgeotdarc', 'Rhamphoryncus', 'vstinner', 'ezio.melotti', 'bupjae']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = '7663'
type = 'behavior'
url = 'https://bugs.python.org/issue5127'
versions = ['Python 3.1', 'Python 3.2']

bupjae · 2009-02-02T01:31:37Z

>>> license
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python30\lib\site.py", line 372, in __repr__
    self.__setup()
  File "C:\Python30\lib\site.py", line 359, in __setup
    data = fp.read()
  File "C:\Python30\lib\io.py", line 1724, in read
    decoder.decode(self.buffer.read(), final=True))
  File "C:\Python30\lib\io.py", line 1295, in decode
    output = self.decoder.decode(input, final=final)
UnicodeDecodeError: 'cp949' codec can't decode bytes in position 15164-
15165: il
legal multibyte sequence
>>> chr(0x10000)
'\U00010000'
>>> chr(0x11000)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python30\lib\io.py", line 1491, in write
    b = encoder.encode(s)
UnicodeEncodeError: 'cp949' codec can't encode character '\ud804' in 
position 1:
 illegal multibyte sequence
>>>

I also can't understand why chr(0x10000) and chr(0x11000) has different
behavior

ezio-melotti · 2009-02-02T01:56:53Z

Here (winxpsp2, Py3, cp850-terminal) the license works fine:
>>> license
Type license() to see the full license text

and license() works as well.

I get this output for the chr()s:
>>> chr(0x10000)
'\U00010000'
>>> chr(0x11000)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Programs\Python30\lib\io.py", line 1491, in write
    b = encoder.encode(s)
  File "C:\Programs\Python30\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position
1-2: character maps to <undefined>

I believe that chr(0x10000) and chr(0x11000) should have the opposite
behavior.
U+10000 (LINEAR B SYLLABLE B008 A) belongs to the 'Lo' category and
should be printed (and possibly raise a UnicodeError, see bpo-5110
1), U+11000 belongs to the 'Cn' category and should be escaped2.

On Linux with Py3 and a UTF-8 terminal, chr(0x10000) prints '\U00010000'
and chr(0x11000) prints the char (actually I see two boxes, but it
shouldn't be a problem of Python). The license() works fine too.

Also note that with cp850 the error message is 'character maps to
<undefined>' and with cp949 is 'illegal multibyte sequence'.

amauryfa · 2009-02-02T18:05:47Z

There were non-ascii characters in the Windows license file. This was
corrected with r67860.

I believe that chr(0x10000) and chr(0x11000) should have the
opposite behavior.

This other problem is because on a narrow unicode build,
Py_UNICODE_ISPRINTABLE takes a 16bit integer.
And indeed,

>>> unicodedata.category(chr(0x10000 % 65536))
'Cc'
>>> unicodedata.category(chr(0x11000 % 65536))
'Lo'

vstinner · 2009-02-03T11:34:20Z

I don't understand the behaviour of unichr():

Python 2.7a0 (trunk:68963M, Jan 30 2009, 00:49:28)
>>> import unicodedata
>>> unicodedata.category(u"\U00010000")
'Lo'
>>> unicodedata.category(u"\U00011000")
'Cn'
>>> unicodedata.category(unichr(0x10000))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Why unichr() fails whereas \Uxxxxxxxx works?

>>> len(u"\U00010000")
2
>>> ord(u"\U00010000")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

ezio-melotti · 2009-02-03T12:26:27Z

FWIW, on Python3 it seems to work:
>>> import unicodedata
>>> unicodedata.category("\U00010000")
'Lo'
>>> unicodedata.category("\U00011000")
'Cn'
>>> unicodedata.category(chr(0x10000))
'Lo'
>>> unicodedata.category(chr(0x11000))
'Cn'
>>> ord(chr(0x10000)), 0x10000
(65536, 65536)
>>> ord(chr(0x11000)), 0x11000
(69632, 69632)

I'm using a narrow build too:
>>> import sys
>>> sys.maxunicode
65535
>>> len('\U00010000')
2
>>> ord('\U00010000')
65536

On Python2 unichr() is supposed to raise a ValueError on a narrow build
if the value is greater than 0xFFFF 1, but if the characters above
0xFFFF can be represented with u"\Uxxxxxxxx" there should be a way to
fix unichr so it can return them. Python3 already does it with chr().

Maybe we should open a new issue for this if it's not present already.

amauryfa · 2009-02-03T12:39:48Z

Since r56395, ord() and chr() accept and return surrogate pairs even in
narrow builds.

The goal is to remove most differences between narrow and wide unicode
builds (except for string lengths, indices or slices)

To address this problem, I suggest to change all functions in
unicodectype.c so that they accept Py_UCS4 characters (instead of
Py_UNICODE).
This would be a binary-incompatible change; and --with-wctype-functions
would have an effect only if sizeof(wchar_t)==4 (instead of the current
condition sizeof(wchar_t)==sizeof(PY_UNICODE_TYPE))

malemburg · 2009-02-03T13:12:04Z

On 2009-02-03 13:39, Amaury Forgeot d'Arc wrote:

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

Since r56395, ord() and chr() accept and return surrogate pairs even in
narrow builds.

The goal is to remove most differences between narrow and wide unicode
builds (except for string lengths, indices or slices)

To address this problem, I suggest to change all functions in
unicodectype.c so that they accept Py_UCS4 characters (instead of
Py_UNICODE).

-1.

That would cause major breakage in the C API and is not inline with the
intention of having a Py_UNICODE type in the first place.

Users who are interested in UCS4 builds should simply use UCS4 builds.

This would be a binary-incompatible change; and --with-wctype-functions
would have an effect only if sizeof(wchar_t)==4 (instead of the current
condition sizeof(wchar_t)==sizeof(PY_UNICODE_TYPE))

--with-wctype-functions was scheduled for removal many releases ago,
but I never got around to it. The only reason it's still there is
that some Linux distribution use this config option (AFAIR, RedHat).
I'd be +1 on removing the option in 3.0.1 or deprecating it in
3.0.1 and removing it in 3.1.

It's not useful in any way, and causes compatibility problems
with regular builds.

vstinner · 2009-02-03T13:14:04Z

amaury> Since r56395, ord() and chr() accept and return surrogate pairs
amaury> even in narrow builds.

Note: My examples are made with Python 2.x.

The goal is to remove most differences between narrow and wide unicode
builds (except for string lengths, indices or slices)

It would be nice to get the same behaviour in Python 2.x and 3.x to help
migration from Python2 to Python3 ;-)

unichr() (in Python 2.x) documentation is correct. But I would approciate to
support surrogates using unichr() which means also changing ord() behaviour.

To address this problem, I suggest to change all functions in
unicodectype.c so that they accept Py_UCS4 characters (instead of
Py_UNICODE).

Why? Using surrogates, you can use 16-bits Py_UNICODE to store non-BMP
characters (code > 0xffff).

--

I can open a new issue if you agree that we can change unichr() / ord()
behaviour on narrow build. We may ask on the mailing list?

malemburg · 2009-02-03T13:18:20Z

On 2009-02-03 14:14, STINNER Victor wrote:

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

amaury> Since r56395, ord() and chr() accept and return surrogate pairs
amaury> even in narrow builds.

Note: My examples are made with Python 2.x.

> The goal is to remove most differences between narrow and wide unicode
> builds (except for string lengths, indices or slices)

It would be nice to get the same behaviour in Python 2.x and 3.x to help
migration from Python2 to Python3 ;-)

unichr() (in Python 2.x) documentation is correct. But I would approciate to
support surrogates using unichr() which means also changing ord() behaviour.

This is not possible for unichr() in Python 2.x, since applications
always expect len(unichr(x)) == 1.

Changing ord() would be possible in Python 2.x is easier, since
this would only extend the range of returned values for UCS2
builds.

> To address this problem, I suggest to change all functions in
> unicodectype.c so that they accept Py_UCS4 characters (instead of
> Py_UNICODE).

Why? Using surrogates, you can use 16-bits Py_UNICODE to store non-BMP
characters (code > 0xffff).

--

I can open a new issue if you agree that we can change unichr() / ord()
behaviour on narrow build. We may ask on the mailing list?

amauryfa · 2009-02-03T13:50:11Z

That would cause major breakage in the C API

Not if you recompile. I don't see how this breaks the API at the C level.

and is not inline with the intention of having a Py_UNICODE
type in the first place.

Py_UNICODE is still used as the allocation unit for unicode strings.

To get correct results, we need a way to access the whole unicode
database even on ucs2 builds; it's possible with the unicodedata module,
why not from C?

My motivation for the change is this post:
http://mail.python.org/pipermail/python-dev/2008-July/080900.html

vstinner · 2009-02-03T14:18:26Z

lemburg> This is not possible for unichr() in Python 2.x, since applications
lemburg> always expect len(unichr(x)) == 1

Oh, ok.

lemburg> Changing ord() would be possible in Python 2.x is easier, since
lemburg> this would only extend the range of returned values for UCS2
lemburg> builds.

ord() of Python3 (narrow build) rejects surrogate characters:

'\U00010000'
>>> len(chr(0x10000))
2
>>> ord(0x10000)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected string of length 1, but int found

It looks that narrow builds with surrogates have some more problems...

Test with U+10000: "LINEAR B SYLLABLE B008 A", category: Letter, Other.

Correct result (Python 2.5, wide build):

   $ python
   Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
   >>> unichr(0x10000)
   u'\U00010000'
   >>> unichr(0x10000).isalpha()
   True

Error in Python3 (narrow build):

   marge$ ./python
   Python 3.1a0 (py3k:69105M, Feb  3 2009, 15:04:35)
   >>> chr(0x10000).isalpha()
   False
   >>> list(chr(0x10000))
   ['\ud800', '\udc00']
   >>> chr(0xd800).isalpha()
   False
   >>> chr(0xdc00).isalpha()
   False

Unicode ranges, all in the category "Other, Surrogate":

U+D800..U+DB7F: Non Private Use High Surrogate
U+DB80..U+DBFF: Private Use High Surrogate
U+DC00..U+DFFF: Low Surrogate" range

malemburg · 2009-02-03T18:47:25Z

On 2009-02-03 14:50, Amaury Forgeot d'Arc wrote:

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

> That would cause major breakage in the C API

Not if you recompile. I don't see how this breaks the API at the C level.

Well, then try to look at such a change from a C extension
writer's perspective.

They'd have to change all their function calls and routines to work
with Py_UCS4.

Supporting both the old API and the new one would
be nearly impossible and require either an adapter API or a lot
of #ifdef'ery.

Please remember that the public Python C API is not only meant for
Python developers. It's main purpose is for it to be used by
other developers extending or embedding Python and those developers
use different release cycles and want to support more than just the
bleeding edge Python version.

Python has a long history of providing very stable APIs, both in
C and in Python.

FWIW: The last major change in the C API (the change to Py_ssize_t
from Python 2.4 to 2.5) has not even propogated to all major C
extensions yet. It's only now that people start to realize problems
with this, since their extensions start failing with segfaults
on 64-bit machines.

That said, we can of course provide additional UCS4 APIs for
certain things and also provide conversion helpers between
Py_UNICODE and Py_UCS4 where needed.

> and is not inline with the intention of having a Py_UNICODE
> type in the first place.

Py_UNICODE is still used as the allocation unit for unicode strings.

To get correct results, we need a way to access the whole unicode
database even on ucs2 builds; it's possible with the unicodedata module,
why not from C?

I must be missing some detail, but what does the Unicode database
have to do with the unicodeobject.c C API ?

My motivation for the change is this post:
http://mail.python.org/pipermail/python-dev/2008-July/080900.html

There are certainly other ways to make Python deal with surrogates
in more cases than the ones we already support.

ezio-melotti · 2009-02-03T19:19:29Z

haypo> ord() of Python3 (narrow build) rejects surrogate characters:
haypo> '\U00010000'
haypo> >>> len(chr(0x10000))
haypo> 2
haypo> >>> ord(0x10000)
haypo> TypeError: ord() expected string of length 1, but int found

ord() works fine on Py3, you probably meant to do 
>>> ord('\U00010000')
65536
or
>>> ord(chr(0x10000))
65536

In Py3 is also stated that it accepts surrogate pairs (help(ord)).
Py2 instead doesn't support them:
>>> ord(u'\U00010000')
TypeError: ord() expected a character, but string of length 2 found

amauryfa · 2009-02-03T23:25:47Z

I must be missing some detail, but what does the Unicode database
have to do with the unicodeobject.c C API ?

Ah, now I understand your concerns. My suggestion is to change only the 20 functions in
unicodectype.c: _PyUnicode_IsAlpha, _PyUnicode_ToLowercase... and no change in
unicodeobject.c at all.
They all take a single code point as argument, some also return a single code point.
Changing these functions is backwards compatible.

I join a patch so we can argue on concrete code (tests are missing).

Another effect of the patch: unicodedata.numeric('\N{AEGEAN NUMBER TWO}') can return 2.0.

The str.isalpha() (and others) methods did not change: they still split the surrogate pairs.

Rhamphoryncus · 2009-10-04T21:46:52Z

Surrogates aren't optional features of UTF-16, we really need to get
this fixed. That includes .isalpha().

We might keep the old public API for compatibility, but it should be
clearly marked as broken for non-BMP scalar values.

I don't see a problem with changing 2.x. The existing behaviour is
broken for non-BMP scalar values, so surely nobody can claim dependence
on it.

malemburg · 2009-10-05T09:03:07Z

Adam Olsen wrote:

Adam Olsen <rhamph@gmail.com> added the comment:

Surrogates aren't optional features of UTF-16, we really need to get
this fixed. That includes .isalpha().

We use UCS2 on narrow Python builds, not UTF-16.

We might keep the old public API for compatibility, but it should be
clearly marked as broken for non-BMP scalar values.

That has always been the case. UCS2 doesn't support surrogates.

However, we have been slowly moving into the direction of making
the UCS2 storage appear like UTF-16 to the Python programmer.

This process is not yet complete and will likely never complete
since it must still be possible to create things line lone
surrogates for processing purposes, so care has to be taken
when using non-BMP code points on narrow builds.

I don't see a problem with changing 2.x. The existing behaviour is
broken for non-BMP scalar values, so surely nobody can claim dependence
on it.

No, but changing the APIs from 16-bit integers to 32-bit integers
does require a recompile of all code using it. Otherwise you
end up with segfaults.

Also, the Unicode type database itself uses Py_UNICODE, so
case mapping would fail for non-BMP code points.

So if we want to support accessing non-BMP type information
on narrow builds, we'd need to change the complete
Unicode type database API to work with UCS4 code points and then
provide a backwards compatible C API using Py_UNICODE. Due
to the UCS2/UCS4 API renaming done in unicodeobject.h, this
would amount to exposing both the UCS2 and the UCS4 variants
of the APIs on narrow builds.

With such an approach we'd not break the binary API and
still get the full UCS4 range of code points in the type
database. The change would be possible in Python 2.x and
3.x (which now both use the same strategy w/r to change
management).

Would someone be willing to work on this ?

amauryfa · 2009-10-05T09:59:56Z

No, but changing the APIs from 16-bit integers to 32-bit integers
does require a recompile of all code using it.

Is it acceptable between 3.1 and 3.2 for example? ISTM that other
changes already require recompilation of extension modules.

Also, the Unicode type database itself uses Py_UNICODE, so
case mapping would fail for non-BMP code points.

Where, please? in unicodedata.c, getuchar and _getrecord_ex use Py_UCS4.

malemburg · 2009-10-05T10:15:58Z

Amaury Forgeot d'Arc wrote:

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

> No, but changing the APIs from 16-bit integers to 32-bit integers
> does require a recompile of all code using it.

Is it acceptable between 3.1 and 3.2 for example? ISTM that other
changes already require recompilation of extension modules.

With the proposed approach, we'll keep binary compatibility, so
this is not much of an issue.

Note: Changes to the binary interface can be done in minor releases,
but we should make sure that it's not possible to load an extension
compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns.

> Also, the Unicode type database itself uses Py_UNICODE, so
> case mapping would fail for non-BMP code points.

Where, please? in unicodedata.c, getuchar and _getrecord_ex use Py_UCS4.

The change affects the Unicode type database which is implemented
in unicodectype.c, not the Unicode database, which already uses UCS4.

amauryfa · 2009-10-05T11:09:32Z

we should make sure that it's not possible to load an extension
compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns.

This is the case with this patch: today all these functions
(PyUnicode_IsAlpha, _PyUnicode_ToLowercase) are actually #defines to
_PyUnicodeUCS2* or _PyUnicodeUCS4_*.
The patch removes the #defines: 3.1 modules that call
_PyUnicodeUCS4_IsAlpha wouldn't load into a 3.2 interpreter.

The change affects the Unicode type database which is implemented
in unicodectype.c, not the Unicode database, which already uses UCS4.

Are you referring to the _PyUnicode_TypeRecord structure?
The first three fields only contains values up to 65535, so they could
use "unsigned short" even for UCS4 builds.
All the other uses are precisely changed by the patch...

ezio-melotti · 2009-10-05T11:16:27Z

> We might keep the old public API for compatibility, but it should be
> clearly marked as broken for non-BMP scalar values.

That has always been the case. UCS2 doesn't support surrogates.

However, we have been slowly moving into the direction of making
the UCS2 storage appear like UTF-16 to the Python programmer.

UCS2 died long ago, is there any reason why we keep using an UCS2 that
"appears" like UTF-16 instead of real UTF-16?

This process is not yet complete and will likely never complete
since it must still be possible to create things line lone
surrogates for processing purposes, so care has to be taken
when using non-BMP code points on narrow builds.

I don't exactly know all the details of the current implementation, but
-- from what I understand reading this (correct me if I'm wrong) -- it
seems that the implementation is half-UCS2 to allow things like the
processing of lone surrogates and half-UTF16 (or UTF-16-compatible) to
work with surrogate pairs and hence with chars outside the BMP.

What are the use cases for processing the lone surrogates? Wouldn't be
better to use UTF-16 and disallow them (since they are illegal) and
possibly provide some other way to deal with them (if it's really needed)?

malemburg · 2009-10-05T11:41:15Z

Amaury Forgeot d'Arc wrote:

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

> we should make sure that it's not possible to load an extension
> compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns.

This is the case with this patch: today all these functions
(PyUnicode_IsAlpha, _PyUnicode_ToLowercase) are actually #defines to
_PyUnicodeUCS2* or _PyUnicodeUCS4_*.
The patch removes the #defines: 3.1 modules that call
_PyUnicodeUCS4_IsAlpha wouldn't load into a 3.2 interpreter.

True, but we can do better. For narrow builds, the API currently
exposes the UCS2 APIs. We'd need to expose the UCS4 APIs *in addition*
to those APIs and have the UCS2 APIs redirect to the UCS4 ones.

For wide builds, we don't need to change anything.

> The change affects the Unicode type database which is implemented
> in unicodectype.c, not the Unicode database, which already uses UCS4.

Are you referring to the _PyUnicode_TypeRecord structure?
The first three fields only contains values up to 65535, so they could
use "unsigned short" even for UCS4 builds.

I haven't checked, but it's certainly possible to have a code point
use a non-BMP lower/upper/title case mapping, so this should be
made possible as well, if we're going to make changes to the type
database.

malemburg · 2009-10-05T11:51:27Z

This is off-topic for the tracker item, but I'll reply anyway:

Ezio Melotti wrote:
> 
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
> 
>>> We might keep the old public API for compatibility, but it should be
>>> clearly marked as broken for non-BMP scalar values.
> 
>> That has always been the case. UCS2 doesn't support surrogates.
> 
>> However, we have been slowly moving into the direction of making
>> the UCS2 storage appear like UTF-16 to the Python programmer.
> 
> UCS2 died long ago, is there any reason why we keep using an UCS2 that
> "appears" like UTF-16 instead of real UTF-16?

UCS2 is how we store Unicode in Python for narrow builds internally.
It's a storage format, not an encoding.

However, on narrow builds such as the Windows builds, you will sometimes
want to create Unicode strings that use non-BMP code points. Since
both UCS2 and UCS4 can represent the UTF-16 encoding, it's handy to
expose a bit of automatic conversion at the Python level to make
things easier for the programmer.

> This process is not yet complete and will likely never complete
> since it must still be possible to create things line lone
> surrogates for processing purposes, so care has to be taken
> when using non-BMP code points on narrow builds.

I don't exactly know all the details of the current implementation, but
-- from what I understand reading this (correct me if I'm wrong) -- it
seems that the implementation is half-UCS2 to allow things like the
processing of lone surrogates and half-UTF16 (or UTF-16-compatible) to
work with surrogate pairs and hence with chars outside the BMP.

What are the use cases for processing the lone surrogates? Wouldn't be
better to use UTF-16 and disallow them (since they are illegal) and
possibly provide some other way to deal with them (if it's really needed)?

No, because Python is meant to be used for working on all Unicode
code points. Lone surrogates are not allowed in transfer encodings
such as UTF-16 or UTF-8, but they are valid Unicode code points and
you need to be able to work with them, since you may want to construct
surrogate pairs by hand or get lone surrogates as a result of slicing a
Unicode string.

amauryfa · 2009-10-05T12:28:39Z

We'd need to expose the UCS4 APIs *in addition*
to those APIs and have the UCS2 APIs redirect to the UCS4 ones.

Why have two names for the same function? it's Python 3, after all.
Or is this "no recompile" feature so important (as long as changes are
clearly shown to the user)? It does not work on Windows, FWIW.

I haven't checked, but it's certainly possible to have a code point
use a non-BMP lower/upper/title case mapping, so this should be
made possible as well, if we're going to make changes to the type
database.

OK, here is a new patch. Even if this does not happen with unicodedata
up to 5.1, the table has only 175 entries so memory usage is not
dramatically increased.
Py_UNICODE is no more used at all in unicodectype.c.

malemburg · 2009-10-05T14:16:25Z

Amaury Forgeot d'Arc wrote:

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

> We'd need to expose the UCS4 APIs *in addition*
> to those APIs and have the UCS2 APIs redirect to the UCS4 ones.

Why have two names for the same function? it's Python 3, after all.

It's not the same function... the UCS2 version would take a
Py_UNICODE parameter, the UCS4 version a Py_UCS4 parameter.

I don't understand the comment about Python 3.x. FWIW, we're no
longer in the backwards incompatible changes are allowed mode
for 3.x.

Or is this "no recompile" feature so important (as long as changes are
clearly shown to the user)? It does not work on Windows, FWIW.

There are generally two options for API changes within a
major release branch:

the changes are API backwards compatible and only the Python API
version is changed
the changes are not API backwards compatible; in such a case,
Python has to reject imports of old module (as it always
does on Windows), so the Python API version has to be changed
*and* the import mechanism must reject the import

The second option was used when transitioning from 2.4 to 2.5 due
to the Py_ssize_t changes.

We could do the same for 2.7/3.2, but if it's just needed for this
one change, then I'd rather stick to implementing the first option.

> I haven't checked, but it's certainly possible to have a code point
> use a non-BMP lower/upper/title case mapping, so this should be
> made possible as well, if we're going to make changes to the type
> database.

OK, here is a new patch. Even if this does not happen with unicodedata
up to 5.1, the table has only 175 entries so memory usage is not
dramatically increased.
Py_UNICODE is no more used at all in unicodectype.c.

Sorry, but this doesn't work: the functions have to return Py_UNICODE
and raise an exception if the return value doesn't fit.

Otherwise, you'd get completely wrong values in code downcasting
the return value to Py_UNICODE on narrow builds.

Another good reason to use two sets of APIs. The new set could
indeed return Py_UCS4 values.

Rhamphoryncus · 2009-10-05T17:25:48Z

On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg <report@bugs.python.org> wrote:

We use UCS2 on narrow Python builds, not UTF-16.

> We might keep the old public API for compatibility, but it should be
> clearly marked as broken for non-BMP scalar values.

That has always been the case. UCS2 doesn't support surrogates.

However, we have been slowly moving into the direction of making
the UCS2 storage appear like UTF-16 to the Python programmer.

This process is not yet complete and will likely never complete
since it must still be possible to create things line lone
surrogates for processing purposes, so care has to be taken
when using non-BMP code points on narrow builds.

Balderdash. We expose UTF-16 code units, not UCS-2. Guido has made
this quite clear.

UTF-16 was designed as an easy transition from UCS-2. Indeed, if your
code only does searches or joins existing strings then it will Just
Work; declare it UTF-16 and you are done. We have a lot more work to
do than that (as in this bug report), and we can't reasonably prevent
the user from splitting surrogate pairs via poor code, but a 95%
solution doesn't mean we suddenly revert all the way back to UCS-2.

If the intent really was to use UCS-2 then a correctly functioning
UTF-16 codec would join a surrogate pair into a single scalar value,
then raise an error because it's outside the range representable in
UCS-2. That's not very helpful though; obviously, it's much better to
use UTF-16 internally.

"The alternative (no matter what the configure flag is called) is
UTF-16, not UCS-2 though: there is support for surrogate pairs in
various places, including the \U escape and the UTF-8 codec."
http://mail.python.org/pipermail/python-dev/2008-July/080892.html

"If you find places where the Python core or standard library is doing
Unicode processing that would break when surrogates are present you
should file a bug. However this does not mean that every bit of code
that slices a string at an arbitrary point (and hence risks slicing in
the middle of a surrogate) is incorrect -- it all depends on what is
done next with the slice."
http://mail.python.org/pipermail/python-dev/2008-July/080900.html

malemburg · 2009-10-05T18:10:18Z

Adam Olsen wrote:
> 
> Adam Olsen <rhamph@gmail.com> added the comment:
> 
> On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg <report@bugs.python.org> wrote:
>> We use UCS2 on narrow Python builds, not UTF-16.
>>
>>> We might keep the old public API for compatibility, but it should be
>>> clearly marked as broken for non-BMP scalar values.
>>
>> That has always been the case. UCS2 doesn't support surrogates.
>>
>> However, we have been slowly moving into the direction of making
>> the UCS2 storage appear like UTF-16 to the Python programmer.
>>
>> This process is not yet complete and will likely never complete
>> since it must still be possible to create things line lone
>> surrogates for processing purposes, so care has to be taken
>> when using non-BMP code points on narrow builds.
> 
> Balderdash.  We expose UTF-16 code units, not UCS-2.  Guido has made
> this quite clear.
> 
> UTF-16 was designed as an easy transition from UCS-2.  Indeed, if your
> code only does searches or joins existing strings then it will Just
> Work; declare it UTF-16 and you are done.  We have a lot more work to
> do than that (as in this bug report), and we can't reasonably prevent
> the user from splitting surrogate pairs via poor code, but a 95%
> solution doesn't mean we suddenly revert all the way back to UCS-2.
> 
> If the intent really was to use UCS-2 then a correctly functioning
> UTF-16 codec would join a surrogate pair into a single scalar value,
> then raise an error because it's outside the range representable in
> UCS-2.  That's not very helpful though; obviously, it's much better to
> use UTF-16 internally.
> 
> "The alternative (no matter what the configure flag is called) is
> UTF-16, not UCS-2 though: there is support for surrogate pairs in
> various places, including the \U escape and the UTF-8 codec."
> http://mail.python.org/pipermail/python-dev/2008-July/080892.html
> 
> "If you find places where the Python core or standard library is doing
> Unicode processing that would break when surrogates are present you
> should file a bug. However this does not mean that every bit of code
> that slices a string at an arbitrary point (and hence risks slicing in
> the middle of a surrogate) is incorrect -- it all depends on what is
> done next with the slice."
> http://mail.python.org/pipermail/python-dev/2008-July/080900.html

All this is just nitpicking, really. UCS2 is a character set,
UTF-16 an encoding.

It so happens that when the Unicode consortium realized
that 16 bit would not be enough to represent all scripts of the
world, they added the concept of surrogates and reserved a few
ranges of code points in UCS2 to represent these extra code
points which are not part of UCS2, but the extensions UCS4.

The conversion of these surrogate pairs to UCS4 code point
values is what you find defined in UTF-16.

If we were to implement Unicode using UTF-16 as storage format,
we would not be able to store single lone surrogates, since these
are not allowed in UTF-16. Ditto for unassigned ordinals, invalid
code points, etc.

PEP-100 really says it all:

http://www.python.org/dev/peps/pep-0100/

"""
This [internal] format will hold UTF-16 encodings of the corresponding
Unicode ordinals. The Python Unicode implementation will address
these values as if they were UCS-2 values. UCS-2 and UTF-16 are
the same for all currently defined Unicode character points.
...
Future implementations can extend the 16 bit restriction to the
full set of all UTF-16 addressable characters (around 1M
characters).
"""

Note that I wrote the PEP and worked on the implementation at a time
when Unicode 2.x was still in use wide-spread use (mostly on Windows)
and 3.0 was just being release:

http://www.unicode.org/history/publicationdates.html

But all that is off-topic for this ticket, so please let's just
stop such discussions.

Rhamphoryncus · 2009-10-05T18:48:53Z

On Mon, Oct 5, 2009 at 12:10, Marc-Andre Lemburg <report@bugs.python.org> wrote:

All this is just nitpicking, really. UCS2 is a character set,
UTF-16 an encoding.

UCS is a character set, for most purposes synonymous with the Unicode
character set. UCS-2 and UTF-16 are both encodings of that character
set. However, UCS-2 can only represent the BMP, while UTF-16 can
represent the full range.

If we were to implement Unicode using UTF-16 as storage format,
we would not be able to store single lone surrogates, since these
are not allowed in UTF-16. Ditto for unassigned ordinals, invalid
code points, etc.

No. Internal usage may become temporarily ill-formed, but this is a
compromise, and acceptable so long as we never export them to other
systems.

Not that I wouldn't *prefer* a system that wouldn't store lone
surrogates, but.. pragmatics prevail.

Note that I wrote the PEP and worked on the implementation at a time
when Unicode 2.x was still in use wide-spread use (mostly on Windows)
and 3.0 was just being release:

http://www.unicode.org/history/publicationdates.html

I think you hit the nail on the head there. 10 years ago, unicode
meant something different than it does today. That's reflected in PEP-100 and in the code. Now it's time to move on, switch to the modern
terminology, modern usage, and modern specs.

But all that is off-topic for this ticket, so please let's just
stop such discussions.

It needs to be discussed somewhere. It's a distraction from fixing
the bug, but at least it's more private here. Would you prefer email?

amauryfa · 2009-10-06T08:52:29Z

So the discussion is now on 2 points:

Is the change backwards compatible? (at the code level, after
recompilation). My answer is yes, because all known case
transformations stay in the same plane: if you pass a char in the BMP,
they return a char in the BMP; if you pass a code >0x1000, you get
another code >0x1000. In other words: in narrow builds, when you pass
Py_UNICODE, the answer will be correct even when downcasted to
Py_UNICODE. If you want, I can add checks to makeunicodedata.py to
verify that future Unicode standards don't break this statement.

"Naive" code that simply walks the Py_UNICODE* buffer will have
identical behavior. (The current unicode methods are in this case.
They should be fixed, later)

Is this change acceptable for 3.2? I'd say yes, because existing
extension modules that use these functions will need to be recompiled;
the functions names change, the modules won't load otherwise. There is
no need to change the API number for this.

malemburg · 2009-10-06T20:41:39Z

It's not as easy as that.

The functions for case conversion are used in a way that assumes they
never fail (and indeed, the existing functions cannot fail).

What we can do is change the input parameter to Py_UCS4, but not the
Py_UNICODE output parameter, since that would cause lots of compiler
warnings and implicit truncation on UCS2 builds, which would be a pretty
disruptive change.

However, this change would not really help anyone if there are no
mappings from BMP to non-BMP or vice-versa, so I'm not sure whether this
whole exercise is worth the effort.

It appears to be better to just leave the case mapping APIs unchanged -
or am I missing something ?

The situation is different for the various Py_UNICODE_IS*() APIs: for
these we can change the input parameter to Py_UCS4, remove the name
mangling and add UCS2 helper functions to maintain API compatibility on
UCS2 builds.

amauryfa · 2009-10-06T21:27:45Z

that would cause lots of compiler
warnings and implicit truncation on UCS2 builds

Unfortunately, there is no such warning, or the initial problem we are trying
to solve would have been spotted by such a warning (unicode_repr() calls
Py_UNICODE_ISPRINTABLE() with a Py_UCS4 argument).

gcc has a -Wconversion flag, (which I tried today on python) but it is far too
verbose before version 4.3, and this newer version still has some false
positives. http://gcc.gnu.org/wiki/NewWconversion

But the most important thing is that implicit truncation on UCS2 builds is what
happens already! The patch does not solve it, but at least it yields sensible
results to wary code.
Or can you imagine some (somewhat working) code which behavior will be worse
after the change?

malemburg · 2010-01-08T21:26:53Z

I don't see the point in changing the various conversion APIs in the unicode database to return Py_UCS4 when there are no conversions that map code points between BMP and non-BMP.

In order to solve the problem in question (unicode_repr() failing), we should change the various property checking APIs to accept Py_UCS4 input data. This needlessly increases the type database size without real benefit.

For that to work properly we'll have to either make sure that extensions get recompiled if they use these changed APIs, or we provide an additional set of UCS2 APIs that extend the Py_UNICODE input value to a Py_UCS4 value before calling the underlying Py_UCS4 API.

amauryfa · 2010-01-09T00:45:19Z

I don't see the point in changing the various conversion APIs in the
unicode database to return Py_UCS4 when there are no conversions that
map code points between BMP and non-BMP.

For consistency: if Py_UNICODE_ISPRINTABLE is changed to take Py_UCS4, Py_UNICODE_TOLOWER should also take Py_UCS4, and must return the same type.

In order to solve the problem in question (unicode_repr() failing),
we should change the various property checking APIs to accept Py_UCS4
input data. This needlessly increases the type database size without
real benefit.
[I'm not sure to understand. For me the 'real benefit' is that it solves the problem in question.]

Yes this increases the type database: there are 300 more "case" statements in _PyUnicode_ToNumeric(), and the PyUnicode_TypeRecords array needs 1068 more bytes.
On Windows, VS9.0 release build, unicodectype.obj grows from 86Kb to 94Kb; python32.dll is exactly 1.5Kb larger (from 2219Kb to 2221.5Kb);
the memory usage of the just-started interpreter is about 32K larger (around 5M). These look reasonable figures to me.

For that to work properly we'll have to either make sure that
extensions get recompiled if they use these changed APIs, or we
provide an additional set of UCS2 APIs that extend the Py_UNICODE
input value to a Py_UCS4 value before calling the underlying Py_UCS4
API.

Extensions that use these changed APIs need to be recompiled, or they won't load: existing modules link with symbols like _PyUnicodeUCS2_IsPrintable, when the future interpreter will define _PyUnicode_IsPrintable.

malemburg · 2010-01-10T18:14:50Z

Amaury Forgeot d'Arc wrote:

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

> I don't see the point in changing the various conversion APIs in the
> unicode database to return Py_UCS4 when there are no conversions that
> map code points between BMP and non-BMP.

For consistency: if Py_UNICODE_ISPRINTABLE is changed to take Py_UCS4, Py_UNICODE_TOLOWER should also take Py_UCS4, and must return the same type.

> In order to solve the problem in question (unicode_repr() failing),
> we should change the various property checking APIs to accept Py_UCS4
> input data. This needlessly increases the type database size without
> real benefit.
[I'm not sure to understand. For me the 'real benefit' is that it solves the problem in question.]

The problem in question is already solved by just changing the property
checking APIs. Changing the conversion APIs fixes a non-problem, since there
are no mappings that would require Py_UCS4 on a UCS2 build.

Yes this increases the type database: there are 300 more "case" statements in _PyUnicode_ToNumeric(), and the PyUnicode_TypeRecords array needs 1068 more bytes.
On Windows, VS9.0 release build, unicodectype.obj grows from 86Kb to 94Kb; python32.dll is exactly 1.5Kb larger (from 2219Kb to 2221.5Kb);
the memory usage of the just-started interpreter is about 32K larger (around 5M). These look reasonable figures to me.

> For that to work properly we'll have to either make sure that
> extensions get recompiled if they use these changed APIs, or we
> provide an additional set of UCS2 APIs that extend the Py_UNICODE
> input value to a Py_UCS4 value before calling the underlying Py_UCS4
> API.

Extensions that use these changed APIs need to be recompiled, or they won't load: existing modules link with symbols like _PyUnicodeUCS2_IsPrintable, when the future interpreter will define _PyUnicode_IsPrintable.

Hmm, that's a good point.

OK, you got me convinced: let's go for it then.

amauryfa · 2010-07-07T23:44:45Z

Now I wonder whether it's reasonable to consider this character
U+10000 (LINEAR B SYLLABLE B008 A)
as printable with repr(). Yes, its category is "Lo", but is there a font which can display it?

ezio-melotti · 2010-07-08T05:29:00Z

Given that '\U00010000'.isprintable() returns True, I would say yes. If someone needs to print this char and has an appropriate font to do it, I don't see why it shouldn't work.

malemburg · 2010-07-08T07:50:47Z

Ezio Melotti wrote:

Ezio Melotti <ezio.melotti@gmail.com> added the comment:

Given that '\U00010000'.isprintable() returns True, I would say yes. If someone needs to print this char and has an appropriate font to do it, I don't see why it shouldn't work.

Note that Python3 will send printable code points as-is to the console,
so whether or not a code point is considered printable should take the
common availability of fonts being able to display the code point
into account. Otherwise, a user would just see a square box instead of
the much more useful escape sequence.

The "printable" property is a Python invention, not a Unicode property,
so we do have some freedom is deciding what is printable and what
is not.

In recent years the situation has just started clearing up
for fonts covering the assigned BMP range, mostly due to Microsoft actively
working to get their fonts cover the Unicode 2.0 assigned code points
(BMP only):

http://support.microsoft.com/kb/287247

The only font set I know of that tries to go beyond BMP is this
shareware one:

http://code2000.net/

Most other fonts just cover small parts of the Unicode assigned
code point ranges:

http://unicode.org/resources/fonts.html
http://www.wazu.jp/
http://www.unifont.org/fontguide/

I suppose that in a few years we'll see OS and GUIs mix and match the
available fonts to display Unicode code points.

Given the font situation, I don't think we should have repr()
pass through printable non-BMP code points as is. Perhaps we shouldn't
give those code points the printable property to begin with.

ezio-melotti · 2010-07-08T08:16:36Z

[This should probably be discussed on python-dev or in another issue, so feel free to move the conversation there.]

The current implementation considers printable """all the characters except those characters defined in the Unicode character database as following categories are considered printable.

Cc (Other, Control)
Cf (Other, Format)
Cs (Other, Surrogate)
Co (Other, Private Use)
Cn (Other, Not Assigned)
Zl Separator, Line ('\u2028', LINE SEPARATOR)
Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR)
Zs (Separator, Space) other than ASCII space('\x20')."""

We could also arbitrary exclude all the non-BMP chars, but that shouldn't be based on the availability of the fonts IMHO.

Note that Python3 will send printable code points as-is to the
console, so whether or not a code point is considered printable
should take the common availability of fonts being able to display
the code point into account. Otherwise, a user would just see a
square box instead of the much more useful escape sequence

If the concern is about the usefulness of repr() in the console, note that on the Windows terminal trying to display most of the characters results in an error (see bpo-5110), and that makes repr() barely usable.
ascii() might be an alternative if the user wants to see the escape sequence instead of a square box.

amauryfa · 2010-07-08T08:56:11Z

I suggest to go ahead and apply this patch, at least it correctly selects "printable" characters, whatever this means.
I filed bpo-9198 to decide whether chr(0x10000) should be printable.

malemburg · 2010-07-08T09:16:00Z

Ezio Melotti wrote:

Ezio Melotti <ezio.melotti@gmail.com> added the comment:

[This should probably be discussed on python-dev or in another issue, so feel free to move the conversation there.]

The current implementation considers printable """all the characters except those characters defined in the Unicode character database as following categories are considered printable.

Cc (Other, Control)

Cf (Other, Format)

Cs (Other, Surrogate)

Co (Other, Private Use)

Cn (Other, Not Assigned)

Zl Separator, Line ('\u2028', LINE SEPARATOR)

Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR)

Zs (Separator, Space) other than ASCII space('\x20')."""

We could also arbitrary exclude all the non-BMP chars, but that shouldn't be based on the availability of the fonts IMHO.

Without fonts, you can't print the code points, even if the Unicode
database defines the code point as not having one of the above
classes. And that's probably also the reason why the Unicode
database doesn't define a printable property :-)

I also find the use of Zl, Zp and Zs in the definition somewhat
arbitrary: whitespace is certainly printable. This also doesn't
match the isprint() C lib API:

http://www.cplusplus.com/reference/clibrary/cctype/isprint/

"A printable character is any character that is not a control character."

> Note that Python3 will send printable code points as-is to the
> console, so whether or not a code point is considered printable
> should take the common availability of fonts being able to display
> the code point into account. Otherwise, a user would just see a
> square box instead of the much more useful escape sequence

If the concern is about the usefulness of repr() in the console, note that on the Windows terminal trying to display most of the characters results in an error (see bpo-5110), and that makes repr() barely usable.
ascii() might be an alternative if the user wants to see the escape sequence instead of a square box.

That's a different problem, but indeed also related to the
printable property which was introduced as part of the Unicode repr()
change: if the console encoding cannot represent
the printable code points, you get an error.

I was never a fan of the Unicode repr() change to begin with. The
repr() of an object should work in almost all cases. Being able to
read the repr() of an object in clear text is only secondary.
IMHO, allowing all printable code points to pass through unescaped
was not beneficial. We have str() for getting readable representations
of objects. Anyway, we're stuck with it now, so have to work
around the issues...

malemburg · 2010-07-08T09:26:02Z

Amaury Forgeot d'Arc wrote:

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

I suggest to go ahead and apply this patch, at least it correctly selects "printable" characters, whatever this means.
I filed bpo-9198 to decide whether chr(0x10000) should be printable.

+1

ezio-melotti · 2010-07-08T09:33:49Z

Amaury, before applying the patch consider replacing the tab characters before the comments with spaces. The use of tabs is discouraged.

Marc-Andre Lemburg wrote:

I was never a fan of the Unicode repr() change to begin with. The
repr() of an object should work in almost all cases.

I still think that bpo-5110 should be fixed (there's also a patch to fix the issue on Windows). If you agree please comment there and/or reopen that issue.

malemburg · 2010-07-08T09:35:49Z

Ezio Melotti wrote:

Marc-Andre Lemburg wrote:
> I was never a fan of the Unicode repr() change to begin with. The
> repr() of an object should work in almost all cases.

I still think that bpo-5110 should be fixed (there's also a patch to fix the issue on Windows). If you agree please comment there and/or reopen that issue.

Let's discuss this on bpo-9198.

amauryfa · 2010-07-08T09:56:35Z

consider replacing the tab characters before the comments with spaces
It's actually already the case in my working copy.

amauryfa · 2010-07-08T22:27:26Z

A new patch, generated on top of r82662

malemburg · 2010-07-08T22:36:12Z

Amaury Forgeot d'Arc wrote:

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

A new patch, generated on top of r82662

Could you explain what this bit is about ?

@@ -349,7 +313,7 @@
configure Python using --with-wctype-functions. This reduces the
interpreter's code size. */

-#if defined(HAVE_USABLE_WCHAR_T) && defined(WANT_WCTYPE_FUNCTIONS)
+#if defined(Py_UNICODE_WIDE) && defined(WANT_WCTYPE_FUNCTIONS)

 #include <wctype.h>

amauryfa · 2010-07-08T23:18:31Z

A new patch that doesn't remove an important check, avoids a crash when the C macro is called with a huge number. thanks Ezio.

amauryfa · 2010-07-08T23:34:23Z

Could you explain what this bit is about ?
-#if defined(HAVE_USABLE_WCHAR_T) && defined(WANT_WCTYPE_FUNCTIONS)
+#if defined(Py_UNICODE_WIDE) && defined(WANT_WCTYPE_FUNCTIONS)

On Windows at least, HAVE_USABLE_WCHAR_T is True, this means that Py_Unicode can be converted to wchar_t. But now that Py_UNICODE_ISSPACE() takes Py_UCS4, it cannot be converted to wchar_t anymore.

Now that the unicode database functions claim to use Py_UCS4, the functions of wctypes.h are usable only if they also support Py_UCS4.

OTOH the symbol WANT_WCTYPE_FUNCTIONS is defined only if ./configure is called with --with-wctype-functions, I don't expect it to be common.
BTW, the comment says that "This reduces the interpreter's code size". I don't really agree, these functions are two-liners.

malemburg · 2010-07-09T08:13:35Z

Amaury Forgeot d'Arc wrote:

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

> Could you explain what this bit is about ?
> -#if defined(HAVE_USABLE_WCHAR_T) && defined(WANT_WCTYPE_FUNCTIONS)
> +#if defined(Py_UNICODE_WIDE) && defined(WANT_WCTYPE_FUNCTIONS)

On Windows at least, HAVE_USABLE_WCHAR_T is True, this means that Py_Unicode can be converted to wchar_t. But now that Py_UNICODE_ISSPACE() takes Py_UCS4, it cannot be converted to wchar_t anymore.

Right, but you still have to check whether wchar_t is usable, don't
you ? If wchar_t is 2 bytes on a non-Windows platform, using the
wctypes functions won't work either.

The line should probably read:

#if defined(WANT_WCTYPE_FUNCTIONS) && defined(HAVE_USABLE_WCHAR_T) && defined(Py_UNICODE_WIDE)

Now that the unicode database functions claim to use Py_UCS4, the functions of wctypes.h are usable only if they also support Py_UCS4.

OTOH the symbol WANT_WCTYPE_FUNCTIONS is defined only if ./configure is called with --with-wctype-functions, I don't expect it to be common.
BTW, the comment says that "This reduces the interpreter's code size". I don't really agree, these functions are two-liners.

True.

The support for the wctype functions should have been remove long ago,
since they cause subtle incompatibilities between Python builds. I should
have probably never added it in the first place... people were worried
about the size of the type record tables at the time, which is why
I thought it would be a good idea to try to optionally use the C lib
functions.

The comment was true before the Python type tables were changed
into a type record database: the switch used to remove the
Python tables required for those functions. With the type records
database, this is no longer the case, since the records are also
being used for properties that are not exposed via wctype functions.

malemburg · 2010-07-09T08:38:09Z

Amaury Forgeot d'Arc wrote:

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

A new patch that doesn't remove an important check, avoids a crash when the C macro is called with a huge number. thanks Ezio.

Could you please be more specific on what you changed ?

At least visually, there don't appear to be any differences between
v4 and v5 of the patch.

Thanks,

Marc-Andre Lemburg
eGenix.com

2010-07-19: EuroPython 2010, Birmingham, UK 9 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

ezio-melotti · 2010-07-09T08:51:08Z

The 'if' in 'gettyperecord'. (I would also rewrite that as "if (code > 0x10FFFF)", it looks more readable to me.)

The patch seems OK to me. In the NEWS message 'python' should be capitalized and I would also mention .isprintable() and possibly other functions that are affected directly -- mentioning repr() is good too, but it's only affected indirectly.

malemburg · 2010-07-09T08:59:05Z

Ezio Melotti wrote:

Ezio Melotti <ezio.melotti@gmail.com> added the comment:

The 'if' in 'gettyperecord'. (I would also rewrite that as "if (code > 0x10FFFF)", it looks more readable to me.)

Ah, good catch !

The patch seems OK to me. In the NEWS message 'python' should be capitalized and I would also mention .isprintable() and possibly other functions that are affected directly -- mentioning repr() is good too, but it's only affected indirectly.

amauryfa · 2010-07-09T09:13:17Z

str.isprintable() &co are not changed by this patch, because they enumerate Py_UNICODE units and do not join surrogates. See bpo-9200

amauryfa · 2010-07-09T21:00:58Z

In this 6th patch, the wctype part was changed as suggested.
there is one more condition, Py_UNICODE_WIDE:

-#if defined(HAVE_USABLE_WCHAR_T) && defined(WANT_WCTYPE_FUNCTIONS)
+#if defined(WANT_WCTYPE_FUNCTIONS) && defined(HAVE_USABLE_WCHAR_T) && defined(Py_UNICODE_WIDE)

amauryfa · 2010-08-19T12:56:22Z

Committed with r84177.

bupjae mannequin added the topic-unicode label Feb 2, 2009

Rhamphoryncus mannequin added the type-bug An unexpected behavior, bug, or error label Oct 4, 2009

ezio-melotti changed the title ~~UnicodeEncodeError - I can't even see license~~ Use Py_UCS4 instead of Py_UNICODE in unicodectype.c Jan 8, 2010

amauryfa self-assigned this Jul 7, 2010

amauryfa closed this as completed Aug 19, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Use Py_UCS4 instead of Py_UNICODE in unicodectype.c #49377

Use Py_UCS4 instead of Py_UNICODE in unicodectype.c #49377

Comments

bupjae mannequin commented Feb 2, 2009

bupjae mannequin commented Feb 2, 2009

ezio-melotti commented Feb 2, 2009

amauryfa commented Feb 2, 2009

vstinner commented Feb 3, 2009

ezio-melotti commented Feb 3, 2009

amauryfa commented Feb 3, 2009

malemburg commented Feb 3, 2009

vstinner commented Feb 3, 2009

malemburg commented Feb 3, 2009

amauryfa commented Feb 3, 2009

vstinner commented Feb 3, 2009

malemburg commented Feb 3, 2009

ezio-melotti commented Feb 3, 2009

amauryfa commented Feb 3, 2009

Rhamphoryncus mannequin commented Oct 4, 2009

malemburg commented Oct 5, 2009

amauryfa commented Oct 5, 2009

malemburg commented Oct 5, 2009

amauryfa commented Oct 5, 2009

ezio-melotti commented Oct 5, 2009

malemburg commented Oct 5, 2009

malemburg commented Oct 5, 2009

amauryfa commented Oct 5, 2009

malemburg commented Oct 5, 2009

Rhamphoryncus mannequin commented Oct 5, 2009

malemburg commented Oct 5, 2009

Rhamphoryncus mannequin commented Oct 5, 2009

amauryfa commented Oct 6, 2009

malemburg commented Oct 6, 2009

amauryfa commented Oct 6, 2009

malemburg commented Jan 8, 2010

amauryfa commented Jan 9, 2010

malemburg commented Jan 10, 2010

amauryfa commented Jul 7, 2010

ezio-melotti commented Jul 8, 2010

malemburg commented Jul 8, 2010

ezio-melotti commented Jul 8, 2010

amauryfa commented Jul 8, 2010

malemburg commented Jul 8, 2010

malemburg commented Jul 8, 2010

ezio-melotti commented Jul 8, 2010

malemburg commented Jul 8, 2010

amauryfa commented Jul 8, 2010

amauryfa commented Jul 8, 2010

malemburg commented Jul 8, 2010

amauryfa commented Jul 8, 2010

amauryfa commented Jul 8, 2010

malemburg commented Jul 9, 2010

malemburg commented Jul 9, 2010

Thanks,

ezio-melotti commented Jul 9, 2010

malemburg commented Jul 9, 2010

amauryfa commented Jul 9, 2010

amauryfa commented Jul 9, 2010

amauryfa commented Aug 19, 2010