Message 105656 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	stutzbach, theller, vstinner
Date	2010-05-13.21:10:19
SpamBayes Score	1.0319214e-06
Marked as misclassified	No
Message-id	<1273785022.02.0.930366439933.issue8670@psf.upfronthosting.co.za>
In-reply-to

Content
Support of characters outside the Unicode BMP (code > 0xffff) is not complete in narrow build (sizeof(Py_UNICODE) == 2) for Python2: $ ./python Python 2.7b2+ (trunk:81139M, May 13 2010, 18:45:37) >>> x=u'\U00010000' >>> x[0], x[1] (u'\ud800', u'\udc00') >>> len(x) 2 >>> ord(x) Traceback (most recent call last): ... TypeError: ord() expected a character, but string of length 2 found >>> unichr(0x10000) Traceback (most recent call last): ... ValueError: unichr() arg not in range(0x10000) (narrow Python build) It looks better in Python3: $ ./python Python 3.2a0 (py3k:81137:81138, May 13 2010, 18:50:51) >>> x='\U00010000' >>> x[0], x[1] ('\ud800', '\udc00') >>> len(x) 2 >>> ord(x) 65536 >>> chr(0x10000) '\U00010000' About the issue, the problem is in function u_set(). This function should use PyUnicode_AsWideChar() but PyUnicode_AsWideChar() doesn't support surrogates... whereas PyUnicode_FromWideChar() does support surrogates.

Support of characters outside the Unicode BMP (code > 0xffff) is not complete in narrow build (sizeof(Py_UNICODE) == 2) for Python2:

$ ./python
Python 2.7b2+ (trunk:81139M, May 13 2010, 18:45:37) 
>>> x=u'\U00010000'
>>> x[0], x[1]
(u'\ud800', u'\udc00')
>>> len(x)
2
>>> ord(x)
Traceback (most recent call last):
  ...
TypeError: ord() expected a character, but string of length 2 found
>>> unichr(0x10000)
Traceback (most recent call last):
  ...
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

It looks better in Python3:

$ ./python 
Python 3.2a0 (py3k:81137:81138, May 13 2010, 18:50:51) 
>>> x='\U00010000'
>>> x[0], x[1]
('\ud800', '\udc00')
>>> len(x)
2
>>> ord(x)
65536
>>> chr(0x10000)
'\U00010000'

About the issue, the problem is in function u_set(). This function should use PyUnicode_AsWideChar() but PyUnicode_AsWideChar() doesn't support surrogates... whereas PyUnicode_FromWideChar() does support surrogates.

History
Date	User	Action	Args
2010-05-13 21:10:22	vstinner	set	recipients: + vstinner, theller, stutzbach
2010-05-13 21:10:22	vstinner	set	messageid: <1273785022.02.0.930366439933.issue8670@psf.upfronthosting.co.za>
2010-05-13 21:10:20	vstinner	link	issue8670 messages
2010-05-13 21:10:20	vstinner	create