This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author lemburg
Recipients ajung, lemburg, loewis, mark.dickinson
Date 2009-12-21.09:24:31
SpamBayes Score 3.3643643e-12
Marked as misclassified No
Message-id <4B2F3ECE.6080709@egenix.com>
In-reply-to <1261331200.96.0.262015712179.issue7551@psf.upfronthosting.co.za>
Content
All string length calculations in Python 2.4 are done using ints
which are 32-bit, even on 64-bit platforms.

Since UTF-8 can use up to 4 bytes per Unicode code point, the encoder
overallocates the needed chunk of memory to len*4 bytes. This
will go straight over the 2GB limit the 32-bit int imposes if
you try to encode a 512M code point Unicode string.

The reason for using ints to represent string length is simple:
no one really expected that someone would work with 2GB strings
in memory at the time the string API was designed (large hard
drives had around 2GB at that time) - strings of such size are
simply not supported by Python 2.4.

BTW: I wouldn't really count on Python 2.4 working properly on
64-bit platforms. A lot of issues were fixed in Python 2.5
related to 32/64-bit differences.
History
Date User Action Args
2009-12-21 09:24:34lemburgsetrecipients: + lemburg, loewis, ajung, mark.dickinson
2009-12-21 09:24:32lemburglinkissue7551 messages
2009-12-21 09:24:31lemburgcreate