Message 51668 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients
Date	2007-01-10.20:59:13
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
Larry, I probably wasn't clear enough: PyUnicode_AS_UNICODE() returns a pointer to the underlying Py_UNICODE buffer. No API using this macro checks for a NULL return value of the macro since a Unicode object is guaranteed to have a non-NULL Py_UNICODE buffer. As a result, a memory caused during the concatenation process cannot be passed back up the call stack. The NULL return value would result in a plain segfault in the calling API. Regarding the tradeoff and trying such an approach: I've done such tests myself (not with Unicode but with 8-bit strings) and it didn't pay off. The memory consumption outweighs the performance you gain by using the 'x += y' approach. The ''.join(list) approach also doesn't really help if you're after performance (for much the same reasons). In mxTextTools I used slice integers pointing into the original parsed string to work around these problems, which works great and avoids creating short strings altogether (so you gain speed and memory). A patch I would find a lot more useful is one to create a Unicode alternative to cStringIO - for strings, this is by far the most performant way of creating a larger string from lots of small pieces. To complement this, a smart slice type might also be an attractive target; one that breaks up a larger string into slices and provides operations on these, including joining them to form a new string. I'm not convinced that murking with the underlying object type and doing "subtyping" on-the-fly is a clean design.

Larry, I probably wasn't clear enough:

PyUnicode_AS_UNICODE() returns a pointer to the underlying Py_UNICODE buffer. No API using this macro checks for a NULL return value of the macro since a Unicode object is guaranteed to have a non-NULL Py_UNICODE buffer. As a result, a memory caused during the concatenation process cannot be passed back up the call stack. The NULL return value would result in a plain segfault in the calling API.

Regarding the tradeoff and trying such an approach: I've done such tests myself (not with Unicode but with 8-bit strings) and it didn't pay off. The memory consumption outweighs the performance you gain by using the 'x += y' approach. The ''.join(list) approach also doesn't really help if you're after performance (for much the same reasons). 

In mxTextTools I used slice integers pointing into the original parsed string to work around these problems, which works great and avoids creating short strings altogether (so you gain speed and memory).

A patch I would find a lot more useful is one to create a Unicode alternative to cStringIO - for strings, this is by far the most performant way of creating a larger string from lots of small pieces. To complement this, a smart slice type might also be an attractive target; one that breaks up a larger string into slices and provides operations on these, including joining them to form a new string.

I'm not convinced that murking with the underlying object type and doing "subtyping" on-the-fly is a clean design.

History
Date	User	Action	Args
2007-08-23 15:56:02	admin	link	issue1629305 messages
2007-08-23 15:56:02	admin	create