Message 96732 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	ajung, lemburg, loewis, mark.dickinson
Date	2009-12-21.09:24:31
SpamBayes Score	3.3643643e-12
Marked as misclassified	No
Message-id	<4B2F3ECE.6080709@egenix.com>
In-reply-to	<1261331200.96.0.262015712179.issue7551@psf.upfronthosting.co.za>

Content
All string length calculations in Python 2.4 are done using ints which are 32-bit, even on 64-bit platforms. Since UTF-8 can use up to 4 bytes per Unicode code point, the encoder overallocates the needed chunk of memory to len*4 bytes. This will go straight over the 2GB limit the 32-bit int imposes if you try to encode a 512M code point Unicode string. The reason for using ints to represent string length is simple: no one really expected that someone would work with 2GB strings in memory at the time the string API was designed (large hard drives had around 2GB at that time) - strings of such size are simply not supported by Python 2.4. BTW: I wouldn't really count on Python 2.4 working properly on 64-bit platforms. A lot of issues were fixed in Python 2.5 related to 32/64-bit differences.

All string length calculations in Python 2.4 are done using ints
which are 32-bit, even on 64-bit platforms.

Since UTF-8 can use up to 4 bytes per Unicode code point, the encoder
overallocates the needed chunk of memory to len*4 bytes. This
will go straight over the 2GB limit the 32-bit int imposes if
you try to encode a 512M code point Unicode string.

The reason for using ints to represent string length is simple:
no one really expected that someone would work with 2GB strings
in memory at the time the string API was designed (large hard
drives had around 2GB at that time) - strings of such size are
simply not supported by Python 2.4.

BTW: I wouldn't really count on Python 2.4 working properly on
64-bit platforms. A lot of issues were fixed in Python 2.5
related to 32/64-bit differences.

History
Date	User	Action	Args
2009-12-21 09:24:34	lemburg	set	recipients: + lemburg, loewis, ajung, mark.dickinson
2009-12-21 09:24:32	lemburg	link	issue7551 messages
2009-12-21 09:24:31	lemburg	create