This issue causes serious problems. Users occasionally get binaries built for a
compatible Linux and Python version but with a different UCS2-vs-UCS4 setting,
and those users get very mysterious memory corruption errors which are hard to
diagnose. It is possible that these situations also open up security
vulnerabilities. A couple such instances are documented on this ticket, but you
can find more by googling. I would like to get this fixed!
In order to help address this issue I thought I would check out what UCS size is
used by python executables in the wild. I instrumented a few buildslaves that
are contributed by various people to the Tahoe-LAFS project to print out their
platform, python version, and sys.maxunicode. The full results are appended
below. maxunicode: 1114111 means that python executable was configured with
--enable-unicode=ucs4, and maxunicode: 65535 means that python executable was
configured with --enable-unicode=ucs2 or just with --enable-unicode . The only
incompatibilities that I found are because some packagers have deliberately set
UCS4 configuration and other packagers have left the default setting.
In the three cases where someone configured python with UCS2, one of the three
is certainly an accident (a custom-built python executable on an Ubuntu server)
and the other two just use the default instead of specifically configuring ucs2
in their configurations of Python and I suspect that they don't know the
difference and that it was an accident that they built a Python incompatible
with other distributions of their operating system.
In sum, while it would be good to add the unicode setting to the platform's ABI
(as discussed in this ticket), it would also be good to make the default value
be UCS4 instead of UCS2. This would fix all three of the potential
incompatibilities that I found (listed below), and once we have proper inclusion
of the unicode setting in the ABI,
(I'm sure that someone can come up with a reason why UCS2 is better than UCS4,
but I'm also sure that the benefits of compatibility outweigh any benefits of
UCS2 encoding, and that the widespread use of UCS4 demonstrates that there is
nothing fatally wrong with it, and that people who really value UCS2 encoding
more than compatibility can choose that for themselves by explicitly setting UCS2.)
Regards,
Zooko
Ubuntu 6.10 "edgy" i386: python: 2.4.4c1 (#2, Mar 7 2008, 03:03:38) [GCC 4.1.2
20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)], maxunicode: 1114111
Ubuntu 7.04 "feisty": python: 2.5.1 (r251:54863, Jul 31 2008, 22:53:39) [GCC
4.1.2 (Ubuntu 4.1.2-0ubuntu4)], maxunicode: 1114111
Ubuntu 7.10 "gutsy" i386: python: 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)], maxunicode: 1114111
Ubuntu 8.04 "hardy" amd64: python: 2.5.2 (r252:60911, Jul 22 2009, 15:33:10)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)], maxunicode: 1114111
Ubuntu 8.04 "hardy" i386: *custom* python: 2.6 (r26:66714, Oct 2 2008,
13:40:28) [GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)], maxunicode: 65535
Ubuntu 8.04 "hardy" i386: python: 2.5.2 (r252:60911, Jul 22 2009, 15:35:03)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)], maxunicode: 1114111
Ubuntu 9.04 "jaunty" amd64: *custom* python: 2.6.2 (release26-maint, Apr 19
2009, 01:58:18) [GCC 4.3.3], maxunicode: 1114111
Debian 4.0 "etch" i386: python: 2.4.4 (#2, Oct 22 2008, 19:52:44) [GCC 4.1.2
20061115 (prerelease) (Debian 4.1.1-21)], maxunicode: 1114111
Debian 5.0 "lenny" i386: python: 2.5.2 (r252:60911, Jan 4 2009, 17:40:26) [GCC
4.3.2], maxunicode: 1114111
Debian 5.0 "lenny" amd64: python: 2.5.2 (r252:60911, Jan 4 2009, 21:59:32)
[GCC 4.3.2], maxunicode: 1114111
Debian 5.0 "lenny" armv5tel: python: 2.5.2 (r252:60911, Jan 5 2009, 02:00:00)
[GCC 4.3.2], maxunicode: 1114111
Debian unstable "squeeze/sid" i386: python: 2.5.4 (r254:67916, Feb 17 2009,
20:16:45) [GCC 4.3.3], maxunicode: 1114111
Fedora 11 "leonidas" amd64: python: 2.6 (r26:66714, Jul 4 2009, 17:37:13) [GCC
4.4.0 20090506 (Red Hat 4.4.0-4)], maxunicode: 1114111
ArchLinux: python: 2.6.2 (r262:71600, Jul 20 2009, 02:23:30) [GCC 4.4.0
20090630 (prerelease)], maxunicode: 65535
NetBSD 4: python: 2.5.2 (r252:60911, Mar 20 2009, 14:00:07) [GCC 4.1.2 20060628
prerelease (NetBSD nb2 20060711)], maxunicode: 65535
OpenSolaris SunOS-5.11-i86pc-i386-32bit: python: 2.4.4 (#1, Mar 10 2009,
09:35:36) [C], maxunicode: 65535
Nexenta NCP1 SunOS-5.11-i86pc-i386-32bit: python: 2.4.3 (#2, May 3 2006,
19:12:42) [GCC 4.0.3 (GNU_OpenSolaris 4.0.3-1nexenta4)], maxunicode: 1114111
Mac OS 10.6 "snow leopard" i386: python: 2.6.1 (r261:67515, Jul 7 2009,
23:51:51) [GCC 4.2.1 (Apple Inc. build 5646)], maxunicode: 65535
Mac OS 10.5 "leopard" i386: python: 2.5.1 (r251:54863, Feb 6 2009, 19:02:12)
[GCC 4.0.1 (Apple Inc. build 5465)], maxunicode: 65535
Mac OS 10.4 "tiger" *custom* python: 2.5.4 (release25-maint:72153M, Apr 30 2009,
12:28:20) [GCC 4.0.1 (Apple Computer, Inc. build 5367)], maxunicode: 65535
Cygwin CYGWIN_NT-5.1-1.5.25-0.156-4-2-i686-32bit-WindowsPE: python: 2.5.2
(r252:60911, Dec 2 2008, 09:26:14) [GCC 3.4.4 (cygming special, gdc 0.12,
using dmd 0.125)], maxunicode: 65535
Windows: python: 2.6.2 (r262:71600, Apr 21 2009, 15:05:37) [MSC v.1500 32 bit
(Intel)], maxunicode: 65535
Windows: python: 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)], maxunicode: 65535
|