Issue 4678: Unicode: multiple chars for high code points

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/48928

classification

Title:	Unicode: multiple chars for high code points
Type:		Stage:
Components:	Unicode	Versions:	Python 3.0

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	ede, lemburg, loewis
Priority:	normal	Keywords:

Created on 2008-12-16 23:25 by ede, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (4)
msg77940 - (view)	Author: Eric Eisner (ede)	Date: 2008-12-16 23:25
I discovered this when trying to splice a string containing unicode codepoints higher than U+FFFF all examples on 32-bit Ubuntu Linux python 2.5.2 (for comparison): sys.maxunicode # 1114111 len(unichr(66674)) # 1 len(u'\U00010472') # 1 len(u'𐑲') # 2 unichr(66674)[0] # u'\U00010472' python 3.0: (same behavior on ubuntu's rc1 package and my build(r67781) from svn) sys.maxunicode # 65535 len(chr(66674)) # 2 len('\U00010472') # 2 len('𐑲') # 2 chr(66674)[0] # '\ud801' I expect the nth element of a string to be the nth codepoint, regardless of unicode settings. I don't know why the maxunicode is configured differently (both compiled by ubuntu), but is this the expected behavior? If this is actually the expected behavior, how can I configure a build of python to use the larger maxunicode value?
msg77941 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2008-12-16 23:41
On 2008-12-17 00:25, Eric Eisner wrote: > New submission from Eric Eisner <ede@mit.edu>: > > I discovered this when trying to splice a string containing unicode > codepoints higher than U+FFFF > > > all examples on 32-bit Ubuntu Linux > > python 2.5.2 (for comparison): > sys.maxunicode # 1114111 > len(unichr(66674)) # 1 > len(u'\U00010472') # 1 > len(u'𐑲') # 2 > unichr(66674)[0] # u'\U00010472' > > > python 3.0: (same behavior on ubuntu's rc1 package and my build(r67781) > from svn) > sys.maxunicode # 65535 > len(chr(66674)) # 2 > len('\U00010472') # 2 > len('𐑲') # 2 > chr(66674)[0] # '\ud801' > > I expect the nth element of a string to be the nth codepoint, regardless > of unicode settings. I don't know why the maxunicode is configured > differently (both compiled by ubuntu), but is this the expected behavior? > > If this is actually the expected behavior, how can I configure a build > of python to use the larger maxunicode value? You are seeing the different behavior because you've probably built Python 3.0 from source and used the Ubuntu default Python install for comparison: The default Python 3.0 build will create a UCS2 unless you specify the --enable-unicode=ucs4 configure option. The Ubuntu Python build (like many other Linux distros) uses this option per default.
msg77944 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-12-16 23:53
As Mark-Andre say, this is not a bug. Finding out the exact name of the configure option is left as an exercise.
msg77945 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2008-12-16 23:58
On 2008-12-17 00:53, Martin v. Löwis wrote: > Martin v. Löwis <martin@v.loewis.de> added the comment: > > As Marc-Andre say, this is not a bug. Finding out the exact name of the > configure option is left as an exercise. Ah, so that changed as well... for Python 3.0 it's called --with-wide-unicode.

History
Date	User	Action	Args
2022-04-11 14:56:42	admin	set	github: 48928
2008-12-16 23:58:25	lemburg	set	messages: + msg77945
2008-12-16 23:53:14	loewis	set	status: open -> closed resolution: not a bug messages: + msg77944 nosy: + loewis
2008-12-16 23:41:46	lemburg	set	nosy: + lemburg messages: + msg77941
2008-12-16 23:25:24	ede	create