Message77941
On 2008-12-17 00:25, Eric Eisner wrote:
> New submission from Eric Eisner <ede@mit.edu>:
>
> I discovered this when trying to splice a string containing unicode
> codepoints higher than U+FFFF
>
>
> all examples on 32-bit Ubuntu Linux
>
> python 2.5.2 (for comparison):
> sys.maxunicode # 1114111
> len(unichr(66674)) # 1
> len(u'\U00010472') # 1
> len(u'𐑲') # 2
> unichr(66674)[0] # u'\U00010472'
>
>
> python 3.0: (same behavior on ubuntu's rc1 package and my build(r67781)
> from svn)
> sys.maxunicode # 65535
> len(chr(66674)) # 2
> len('\U00010472') # 2
> len('𐑲') # 2
> chr(66674)[0] # '\ud801'
>
> I expect the nth element of a string to be the nth codepoint, regardless
> of unicode settings. I don't know why the maxunicode is configured
> differently (both compiled by ubuntu), but is this the expected behavior?
>
> If this is actually the expected behavior, how can I configure a build
> of python to use the larger maxunicode value?
You are seeing the different behavior because you've probably
built Python 3.0 from source and used the Ubuntu default Python
install for comparison:
The default Python 3.0 build will create a UCS2 unless you specify
the --enable-unicode=ucs4 configure option.
The Ubuntu Python build (like many other Linux distros) uses this
option per default. |
|
Date |
User |
Action |
Args |
2008-12-16 23:41:47 | lemburg | set | recipients:
+ lemburg, ede |
2008-12-16 23:41:46 | lemburg | link | issue4678 messages |
2008-12-16 23:41:45 | lemburg | create | |
|