Message 77941 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	ede, lemburg
Date	2008-12-16.23:41:45
SpamBayes Score	4.6963267e-11
Marked as misclassified	No
Message-id	<49483CB8.6090100@egenix.com>
In-reply-to	<1229469924.95.0.71145202333.issue4678@psf.upfronthosting.co.za>

Content
On 2008-12-17 00:25, Eric Eisner wrote: > New submission from Eric Eisner <ede@mit.edu>: > > I discovered this when trying to splice a string containing unicode > codepoints higher than U+FFFF > > > all examples on 32-bit Ubuntu Linux > > python 2.5.2 (for comparison): > sys.maxunicode # 1114111 > len(unichr(66674)) # 1 > len(u'\U00010472') # 1 > len(u'𐑲') # 2 > unichr(66674)[0] # u'\U00010472' > > > python 3.0: (same behavior on ubuntu's rc1 package and my build(r67781) > from svn) > sys.maxunicode # 65535 > len(chr(66674)) # 2 > len('\U00010472') # 2 > len('𐑲') # 2 > chr(66674)[0] # '\ud801' > > I expect the nth element of a string to be the nth codepoint, regardless > of unicode settings. I don't know why the maxunicode is configured > differently (both compiled by ubuntu), but is this the expected behavior? > > If this is actually the expected behavior, how can I configure a build > of python to use the larger maxunicode value? You are seeing the different behavior because you've probably built Python 3.0 from source and used the Ubuntu default Python install for comparison: The default Python 3.0 build will create a UCS2 unless you specify the --enable-unicode=ucs4 configure option. The Ubuntu Python build (like many other Linux distros) uses this option per default.

On 2008-12-17 00:25, Eric Eisner wrote:
> New submission from Eric Eisner <ede@mit.edu>:
> 
> I discovered this when trying to splice a string containing unicode
> codepoints higher than U+FFFF
> 
> 
> all examples on 32-bit Ubuntu Linux
> 
> python 2.5.2 (for comparison):
> sys.maxunicode     # 1114111
> len(unichr(66674)) # 1
> len(u'\U00010472') # 1
> len(u'𐑲')          # 2
> unichr(66674)[0]   # u'\U00010472'
> 
> 
> python 3.0: (same behavior on ubuntu's rc1 package and my build(r67781)
> from svn)
> sys.maxunicode    # 65535
> len(chr(66674))   # 2
> len('\U00010472') # 2
> len('𐑲')          # 2
> chr(66674)[0]     # '\ud801'
> 
> I expect the nth element of a string to be the nth codepoint, regardless
> of unicode settings. I don't know why the maxunicode is configured
> differently (both compiled by ubuntu), but is this the expected behavior?
> 
> If this is actually the expected behavior, how can I configure a build
> of python to use the larger maxunicode value?

You are seeing the different behavior because you've probably
built Python 3.0 from source and used the Ubuntu default Python
install for comparison:

The default Python 3.0 build will create a UCS2 unless you specify
the --enable-unicode=ucs4 configure option.

The Ubuntu Python build (like many other Linux distros) uses this
option per default.

History
Date	User	Action	Args
2008-12-16 23:41:47	lemburg	set	recipients: + lemburg, ede
2008-12-16 23:41:46	lemburg	link	issue4678 messages
2008-12-16 23:41:45	lemburg	create