This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Unicode: multiple chars for high code points
Type: Stage:
Components: Unicode Versions: Python 3.0
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: ede, lemburg, loewis
Priority: normal Keywords:

Created on 2008-12-16 23:25 by ede, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (4)
msg77940 - (view) Author: Eric Eisner (ede) Date: 2008-12-16 23:25
I discovered this when trying to splice a string containing unicode
codepoints higher than U+FFFF


all examples on 32-bit Ubuntu Linux

python 2.5.2 (for comparison):
sys.maxunicode     # 1114111
len(unichr(66674)) # 1
len(u'\U00010472') # 1
len(u'𐑲')          # 2
unichr(66674)[0]   # u'\U00010472'


python 3.0: (same behavior on ubuntu's rc1 package and my build(r67781)
from svn)
sys.maxunicode    # 65535
len(chr(66674))   # 2
len('\U00010472') # 2
len('𐑲')          # 2
chr(66674)[0]     # '\ud801'

I expect the nth element of a string to be the nth codepoint, regardless
of unicode settings. I don't know why the maxunicode is configured
differently (both compiled by ubuntu), but is this the expected behavior?

If this is actually the expected behavior, how can I configure a build
of python to use the larger maxunicode value?
msg77941 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-12-16 23:41
On 2008-12-17 00:25, Eric Eisner wrote:
> New submission from Eric Eisner <ede@mit.edu>:
> 
> I discovered this when trying to splice a string containing unicode
> codepoints higher than U+FFFF
> 
> 
> all examples on 32-bit Ubuntu Linux
> 
> python 2.5.2 (for comparison):
> sys.maxunicode     # 1114111
> len(unichr(66674)) # 1
> len(u'\U00010472') # 1
> len(u'𐑲')          # 2
> unichr(66674)[0]   # u'\U00010472'
> 
> 
> python 3.0: (same behavior on ubuntu's rc1 package and my build(r67781)
> from svn)
> sys.maxunicode    # 65535
> len(chr(66674))   # 2
> len('\U00010472') # 2
> len('𐑲')          # 2
> chr(66674)[0]     # '\ud801'
> 
> I expect the nth element of a string to be the nth codepoint, regardless
> of unicode settings. I don't know why the maxunicode is configured
> differently (both compiled by ubuntu), but is this the expected behavior?
> 
> If this is actually the expected behavior, how can I configure a build
> of python to use the larger maxunicode value?

You are seeing the different behavior because you've probably
built Python 3.0 from source and used the Ubuntu default Python
install for comparison:

The default Python 3.0 build will create a UCS2 unless you specify
the --enable-unicode=ucs4 configure option.

The Ubuntu Python build (like many other Linux distros) uses this
option per default.
msg77944 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-12-16 23:53
As Mark-Andre say, this is not a bug. Finding out the exact name of the
configure option is left as an exercise.
msg77945 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-12-16 23:58
On 2008-12-17 00:53, Martin v. Löwis wrote:
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> As Marc-Andre say, this is not a bug. Finding out the exact name of the
> configure option is left as an exercise.

Ah, so that changed as well... for Python 3.0 it's called
--with-wide-unicode.
History
Date User Action Args
2022-04-11 14:56:42adminsetgithub: 48928
2008-12-16 23:58:25lemburgsetmessages: + msg77945
2008-12-16 23:53:14loewissetstatus: open -> closed
resolution: not a bug
messages: + msg77944
nosy: + loewis
2008-12-16 23:41:46lemburgsetnosy: + lemburg
messages: + msg77941
2008-12-16 23:25:24edecreate