This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: SystemError/MemoryError/OverflowErrors on encode() a unicode string
Type: Stage:
Components: Versions: Python 2.4
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: ajung, lemburg, loewis, mark.dickinson
Priority: normal Keywords:

Created on 2009-12-20 16:39 by ajung, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (6)
msg96695 - (view) Author: Andreas Jung (ajung) Date: 2009-12-20 16:39
We encountered a pretty bizarre behavior of Python 2.4.6 while decoding a 600MB long unicode string 
'data':

Python 2.4.6 (8GB RAM, 64 bit)

(Pdb) type(data)
<type 'unicode'>

(Pdb) len(data)
601794657

(Pdb) data2=data.encode('utf-8')
*** SystemError: Negative size passed to PyString_FromStringAndSize

Assuming that this has something to do with a 512MB limit:

(Pdb) data2=data[:512*1024*1024].encode('utf-8')
*** SystemError: Negative size passed to PyString_FromStringAndSize

Same bug...now with 512MB - 1 byte:

(Pdb) data2=data[:(256*1024*1024)-1].encode('utf-8')
OverflowError

Cross-check on a different Linux box (4GB RAM, 4 GB Swap, 64 bit)

ajung@blackmoon:~> python2.4
Python 2.4.5 (#1, Jun  9 2008, 10:35:12) 
[GCC 4.2.1 (SUSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> data = u'x'*601794657
>>> data2= data.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
MemoryError

Where is this different behavior coming from?
msg96697 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-12-20 17:09
Is the first machine also a Linux machine?  Perhaps the difference is that 
the first machine has a wide-unicode build (i.e., it uses UCS4 internally) 
and the other doesn't?

Unfortunately there's not much that the python-devs can do about this 
unless the problem is still present in Python 2.6:  Python 2.4 is now more 
than 5 years old and is no longer maintained, while Python 2.5 is only 
receiving security fixes at this stage.  Can you reproduce the problem 
with Python 2.6?
msg96698 - (view) Author: Andreas Jung (ajung) Date: 2009-12-20 17:19
Both systems are Linux system running a narrow Python build.

The problem does not occur with Python 2.5 or 2.6.

Unfortunately this error occurs with Zope 2 which is tied (at least with 
versions prior to Zope 2.12 to Python 2.4).
msg96702 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-12-20 17:46
Well, the signature of PyUnicode_Encode in Python 2.4 (see 
Objects/unicodeobject.c) is:

PyObject *PyUnicode_Encode(const Py_UNICODE *s,
			   int size,
			   const char *encoding,
			   const char *errors)

which looks like it might be relevant to the problems you're seeing.  In 
2.6, the size has type Py_ssize_t instead, which should be a 64-bit type 
on 64-bit Linux.

Closing this, since it's out of date for current Python.
msg96713 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-12-20 18:42
Just to support Mark's decision: Python 2.4 is no longer maintained; you
are on your own with any problems you encounter with it. So closing it
as "won't fix" would also have been appropriate.

The same holds for 2.5, unless you can demonstrate this to cause
security issues (e.g. crashing the Python interpreter).
msg96732 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-12-21 09:24
All string length calculations in Python 2.4 are done using ints
which are 32-bit, even on 64-bit platforms.

Since UTF-8 can use up to 4 bytes per Unicode code point, the encoder
overallocates the needed chunk of memory to len*4 bytes. This
will go straight over the 2GB limit the 32-bit int imposes if
you try to encode a 512M code point Unicode string.

The reason for using ints to represent string length is simple:
no one really expected that someone would work with 2GB strings
in memory at the time the string API was designed (large hard
drives had around 2GB at that time) - strings of such size are
simply not supported by Python 2.4.

BTW: I wouldn't really count on Python 2.4 working properly on
64-bit platforms. A lot of issues were fixed in Python 2.5
related to 32/64-bit differences.
History
Date User Action Args
2022-04-11 14:56:55adminsetgithub: 51800
2009-12-21 09:24:33lemburgsetnosy: + lemburg
title: SystemError/MemoryError/OverflowErrors on encode() a unicode string -> SystemError/MemoryError/OverflowErrors on encode() a unicode string
messages: + msg96732
2009-12-20 18:42:13loewissetnosy: + loewis
messages: + msg96713
2009-12-20 17:46:39mark.dickinsonsetstatus: open -> closed

messages: + msg96702
2009-12-20 17:19:14ajungsetstatus: pending -> open

messages: + msg96698
2009-12-20 17:09:05mark.dickinsonsetstatus: open -> pending

nosy: + mark.dickinson
messages: + msg96697

resolution: out of date
2009-12-20 16:39:50ajungcreate