If you're reporting an issue for setuptools 0.7 or higher, please use BitBucket

Title egg platform names don't reflect unicode variant (UCS2, UCS4)
Priority feature Status chatting
Superseder Nosy List midnightmagic, pje, zooko
Assigned To Keywords

Created on 2009-06-10.17:54:13 by zooko, last changed 2009-10-14.14:40:32 by zooko.

msg406 (view) Author: zooko Date: 2009-10-14.14:40:32
See also a related thread on python-dev: "please consider
changing --enable-unicode default to ucs4". Basically, the reason we have this
problem on Linux and not on other platforms is that the de facto standard (ucs4)
differs from the de jure standard (ucs2).  I'm pleading with Python developers
to swallow their pride and bless the de facto standard as the de jure standard
on Linux so that more Linux users can converge on a common binary format.
msg384 (view) Author: pje Date: 2009-10-10.23:31:28
Note: unless somebody implements a patch, per

and/or the distutils.get_platform() adds this info, this is probably dead in the
water.  In any case, it's not suitable for the 0.6 branch ATM, so reclassifying
as a feature. (Also per the above distutils-sig post.)
msg355 (view) Author: zooko Date: 2009-09-22.02:39:35
According to the original
problem reported by David Abrahams and by midnightmagic can *not* be explained
by a UCS4/UCS2 incompatibility.  This ticket is still valid (egg platform names
don't reflect unicode variant), but the symptoms described by David Abrahams and
by midnightmagic are probably symptoms of a different bug.
msg354 (view) Author: zooko Date: 2009-09-20.13:52:46
This issue causes serious problems.  Users occasionally get binaries built for a
compatible Linux and Python version but with a different UCS2-vs-UCS4 setting,
and those users get very mysterious memory corruption errors which are hard to
diagnose.  It is possible that these situations also open up security
vulnerabilities.  A couple such instances are documented on this ticket, but you
can find more by googling.  I would like to get this fixed!

In order to help address this issue I thought I would check out what UCS size is
used by python executables in the wild.  I instrumented a few buildslaves that
are contributed by various people to the Tahoe-LAFS project to print out their
platform, python version, and sys.maxunicode.  The full results are appended
below.  maxunicode: 1114111 means that python executable was configured with
--enable-unicode=ucs4, and maxunicode: 65535 means that python executable was
configured with --enable-unicode=ucs2 or just with --enable-unicode .  The only
incompatibilities that I found are because some packagers have deliberately set
UCS4 configuration and other packagers have left the default setting. 

In the three cases where someone configured python with UCS2, one of the three
is certainly an accident (a custom-built python executable on an Ubuntu server)
and the other two just use the default instead of specifically configuring ucs2
in their configurations of Python and I suspect that they don't know the
difference and that it was an accident that they built a Python incompatible
with other distributions of their operating system.

In sum, while it would be good to add the unicode setting to the platform's ABI
(as discussed in this ticket), it would also be good to make the default value
be UCS4 instead of UCS2.  This would fix all three of the potential
incompatibilities that I found (listed below), and once we have proper inclusion
of the unicode setting in the ABI, 

(I'm sure that someone can come up with a reason why UCS2 is better than UCS4,
but I'm also sure that the benefits of compatibility outweigh any benefits of
UCS2 encoding, and that the widespread use of UCS4 demonstrates that there is
nothing fatally wrong with it, and that people who really value UCS2 encoding
more than compatibility can choose that for themselves by explicitly setting UCS2.)



Ubuntu 6.10 "edgy" i386: python: 2.4.4c1 (#2, Mar  7 2008, 03:03:38)  [GCC 4.1.2
20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)], maxunicode: 1114111
Ubuntu 7.04 "feisty": python: 2.5.1 (r251:54863, Jul 31 2008, 22:53:39)  [GCC
4.1.2 (Ubuntu 4.1.2-0ubuntu4)], maxunicode: 1114111
Ubuntu 7.10 "gutsy" i386: python: 2.5.1 (r251:54863, Jul 31 2008, 23:17:40) 
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)], maxunicode: 1114111
Ubuntu 8.04 "hardy" amd64: python: 2.5.2 (r252:60911, Jul 22 2009, 15:33:10) 
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)], maxunicode: 1114111
Ubuntu 8.04 "hardy" i386: *custom* python: 2.6 (r26:66714, Oct  2 2008,
13:40:28)  [GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)], maxunicode: 65535
Ubuntu 8.04 "hardy" i386: python: 2.5.2 (r252:60911, Jul 22 2009, 15:35:03) 
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)], maxunicode: 1114111
Ubuntu 9.04 "jaunty" amd64: *custom* python: 2.6.2 (release26-maint, Apr 19
2009, 01:58:18)  [GCC 4.3.3], maxunicode: 1114111

Debian 4.0 "etch" i386: python: 2.4.4 (#2, Oct 22 2008, 19:52:44)  [GCC 4.1.2
20061115 (prerelease) (Debian 4.1.1-21)], maxunicode: 1114111
Debian 5.0 "lenny" i386: python: 2.5.2 (r252:60911, Jan  4 2009, 17:40:26)  [GCC
4.3.2], maxunicode: 1114111
Debian 5.0 "lenny" amd64: python: 2.5.2 (r252:60911, Jan  4 2009, 21:59:32) 
[GCC 4.3.2], maxunicode: 1114111
Debian 5.0 "lenny" armv5tel: python: 2.5.2 (r252:60911, Jan  5 2009, 02:00:00) 
[GCC 4.3.2], maxunicode: 1114111
Debian unstable "squeeze/sid" i386: python: 2.5.4 (r254:67916, Feb 17 2009,
20:16:45)  [GCC 4.3.3], maxunicode: 1114111

Fedora 11 "leonidas" amd64: python: 2.6 (r26:66714, Jul  4 2009, 17:37:13)  [GCC
4.4.0 20090506 (Red Hat 4.4.0-4)], maxunicode: 1114111

ArchLinux: python: 2.6.2 (r262:71600, Jul 20 2009, 02:23:30)  [GCC 4.4.0
20090630 (prerelease)], maxunicode: 65535

NetBSD 4: python: 2.5.2 (r252:60911, Mar 20 2009, 14:00:07)  [GCC 4.1.2 20060628
prerelease (NetBSD nb2 20060711)], maxunicode: 65535

OpenSolaris SunOS-5.11-i86pc-i386-32bit: python: 2.4.4 (#1, Mar 10 2009,
09:35:36) [C], maxunicode: 65535
Nexenta NCP1 SunOS-5.11-i86pc-i386-32bit: python: 2.4.3 (#2, May  3 2006,
19:12:42)  [GCC 4.0.3 (GNU_OpenSolaris 4.0.3-1nexenta4)], maxunicode: 1114111

Mac OS 10.6 "snow leopard" i386: python: 2.6.1 (r261:67515, Jul  7 2009,
23:51:51)  [GCC 4.2.1 (Apple Inc. build 5646)], maxunicode: 65535
Mac OS 10.5 "leopard" i386: python: 2.5.1 (r251:54863, Feb  6 2009, 19:02:12) 
[GCC 4.0.1 (Apple Inc. build 5465)], maxunicode: 65535
Mac OS 10.4 "tiger" *custom* python: 2.5.4 (release25-maint:72153M, Apr 30 2009,
12:28:20)  [GCC 4.0.1 (Apple Computer, Inc. build 5367)], maxunicode: 65535

Cygwin CYGWIN_NT-5.1-1.5.25-0.156-4-2-i686-32bit-WindowsPE: python: 2.5.2
(r252:60911, Dec  2 2008, 09:26:14)  [GCC 3.4.4 (cygming special, gdc 0.12,
using dmd 0.125)], maxunicode: 65535

Windows: python: 2.6.2 (r262:71600, Apr 21 2009, 15:05:37) [MSC v.1500 32 bit
(Intel)], maxunicode: 65535
Windows: python: 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)], maxunicode: 65535
msg324 (view) Author: zooko Date: 2009-07-09.00:37:53
Here are some notes from PJE about how to implement this:
msg309 (view) Author: midnightmagic Date: 2009-06-19.20:50:51
I appear also to be experiencing this issue. Machine/OS details:

NetBSD drake 3.99.23 NetBSD 3.99.23 (GENERIC) #0: Thu Jul 27 08:22:27 PDT 2006
cpu0: Intel Pentium III (686-class), 731.29 MHz, id 0x683
total memory = 382 MB

python2.5 -V
Python 2.5.2

[... rest of stacktrace elided ...]
  File "/v/tahoe/allmydata-tahoe-1.4.1/Twisted-8.2.0-py2.5-netbsd-3.99.23-
i386.egg/twisted/internet/", line 539, in signCertificateRequest
    hlreq = CertificateRequest.load(requestData, requestFormat)
  File "/v/tahoe/allmydata-tahoe-1.4.1/Twisted-8.2.0-py2.5-netbsd-3.99.23-
i386.egg/twisted/internet/", line 310, in load
  File "/v/tahoe/allmydata-tahoe-1.4.1/Twisted-8.2.0-py2.5-netbsd-3.99.23-
i386.egg/twisted/internet/", line 64, in _copyFrom
    value = getattr(x509name, name, None)
exceptions.UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-5: 
unsupported Unicode code range

Failed to load application: 'utf8' codec can't decode bytes in position 0-5: 
unsupported Unicode code range

More details available upon request.
msg303 (view) Author: zooko Date: 2009-06-10.17:57:28
Here's the ticket on the Tahoe-LAFS issue tracker: # eggs don't say whether they have
UCS2 or UCS4 unicode implementation
msg302 (view) Author: zooko Date: 2009-06-10.17:54:12
A user of Tahoe-LAFS encountered an error in which pyOpenSSL emitted:

exceptions.UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-5:
unsupported Unicode code range

It took some effort on the part of the user and the Tahoe-LAFS devs to delve
into the code and figure out how an invalid string got into a certificate inside
pyOpenSSL.  Eventually the user (David Abrahams) figured out that the issue was
this one:

He had installed a pyOpenSSL egg which had been built with UCS4, but his Python
interpreter was UCS2.  According to the thread linked above, the best way to fix
this is for distutils get_platform() to include the unicode variant in its
output, and then for setuptools to test the compatibility of that field when
choosing an egg.  Is that right?

What's the next step?
Date User Action Args
2009-10-14 14:40:32zookosetmessages: + msg406
2009-10-10 23:31:28pjesetpriority: bug -> feature
nosy: + pje
messages: + msg384
2009-09-22 02:39:36zookosetmessages: + msg355
2009-09-20 13:52:47zookosetmessages: + msg354
2009-07-09 00:37:53zookosetmessages: + msg324
2009-06-19 20:50:52midnightmagicsetnosy: + midnightmagic
messages: + msg309
2009-06-10 17:57:29zookosetstatus: unread -> chatting
messages: + msg303
2009-06-10 17:54:13zookocreate