OVERVIEW

This patch consists of three changes to CPython:
 * changing PyStringObject.ob_sval,
 * "lazy concatenations", and
 * "lazy slices".
None of these changes adds new functionality to CPython;
they are all speed or memory optimizations.

(I did add one experimental new string method,
described at the end of the LAZY SLICES section.)


In detail:

PyStringObject.ob_sval was changed from a char[] array
to a char *.  This is not in and of itself particularly
desirable.  It was necessary in order to implement the
other two changes.

"lazy concatenations" change string concatenation ("a" + "b") so that,
instead of directly calculating the resulting string, it returns a
placeholder object representing the result.  As a result, string
concatenation in CPython is now more than 150% faster on average (as
reported by pystone 2.0), and is approximately as fast as the standard
string concatenation idiom ("".join([a + b + c])).

"lazy slices" changes string slicing ("abc"[1], "a".strip()) so
that, instead of directly calculating the resulting string, it
returns a placeholder object representing the result.  As a result,
string slicing in CPython is now more than 60% faster on average
(as reported by pystone 2.0).  

When considering this patch, please keep in mind that the "lazy" changes
are distinct, and could be incorporated independently.  In particular
I'm guessing that "lazy concatenations" have a lot higher chance of
being accepted than "lazy slices".


These changes were implemented almost entirely in
Include/stringobject.h and Objects/stringobject.c.

With this patch applied, trunk builds and passes all expected tests
on Win32 and Linux.



HISTORY

I've never liked the "".join([]) idiom for string concatenation; in my 
opinion it violates the principles "Beautiful is better than ugly." and 
"There should be one-- and preferably only one --obvious way to do it.". 
(And perhaps several others.)  I discussed this with my friend Mike
Thornburgh; on the spur of the moment, he suggested how I could fix it,
describing a solution with surprising accuracy and detail.

My original "v1" patch, discussed on c.l.p here:
        http://groups.google.com/group/comp.lang.python/tree/browse_frm/thread/b8a8f20bc3c81bcf
also discussed on the Python-Dev mailing list here:
        http://mail.python.org/pipermail/python-dev/2006-October/069224.html
and posted here:
        http://sourceforge.net/tracker/index.php?func=detail&aid=1569040&group_id=5470&atid=305470
contained just the "PyStringObject.ob_sval" and "lazy concatenations"
patches, and was more or less what Mike suggested.

After further consideration, I realized it would be possible to make
"lazy slices" in much the same way that I had made "lazy concatenations".
This resulted in the "v2" patch, which I redubbed the "lazy strings" patch.


PYSTRINGOBJECT.OB_SVAL

In CPython, "struct PyStringObject" is the in-memory representation
of a non-Unicode string.  It currently looks like this:
        typedef struct {
            PyObject_VAR_HEAD
            long ob_shash;
            int ob_sstate;
            char ob_sval[1];
        } PyStringObject;

This structure has not changed since revision 2.3, released in
late July 2003.

The string is stored in ob_sval.  When the object is allocated,
enough space is allocated so the PyStringObject plus the whole
string can be stored in one contiguous allocation.   For example,
to store a 20-character string CPython would be allocated as such:
        malloc(sizeof(PyStringObject) + 20)
It would then memcpy() the string directly into ob_sval.

My patch changes PyStringObject to look like this:
        typedef struct {
            PyObject_VAR_HEAD
            long ob_shash;
            int ob_sstate;
            char *ob_sval;
            char ob_svalstorage[1];
        } PyStringObject;

ob_sval is now a pointer.  In the conventional case,
where this object represents a string, it simply points
to &(ob_svalstorage[0]), and the malloc() and memcpy()
proceed as before.


Upsides:

* None; this change is not in and of itself desirable.
  However, it was necessary in order to create the
  more exotic string objects used in the other two changes.

Downsides:

* This change adds an extra pointer to every string
  object (four bytes on 32-bit platforms).

* It also changes the Python API, requiring a recompile of
  all modules.  (Or, at least, the ones using PyStrings.)

* It adds an extra pointer dereference when using a string
  object, which theoretically slows down the interpreter
  slightly, though in practice this slowdown is immeasurably
  small.

* Finally, there is a self-described "hack" in
  Mac/Modules/MacOS.c that would need reworking as a
  result of this change.  (Fixing that is beyond my pay
  grade; I've never done Mac programming.)



LAZY CONCATENATIONS

In existing CPython, when concatentating two strings, the resulting object
directly represents the concatenation.  When adding "a" + "b", the
result is a PyStringObject containing "ab".

I changed CPython so string concatenation returns an new object:
        typedef struct {
            PyObject_VAR_HEAD
            long ob_shash;
            int ob_sstate;
            char *ob_sval; /* this object matches a PyStringObject to this point */
            unsigned short ob_srightrecursiondepth;
            unsigned short ob_sstringsindex;
            PyStringObject *ob_sstrings[PYSTRING_CONCATENATIONS];
        } PyStringConcatenationObject;
This object is, for all practical purposes, a "subclass" of
PyStringObject.

When adding "a" + "b", the result is now a PyStringConcatenationObject
containing references to the two strings:
        ob_sval = NULL
        ob_srightrecursiondepth = 0
        ob_sstringsindex = 2
        ob_sstrings[0] = "a"
        ob_sstrings[1] = "b"
The actual literal value of the string is not computed until someone calls
PyString_AsString() on the concatenation object, which I refer to as
"rendering" the string.  To render, PyString_AsString() allocates a buffer
large enough to hold the resulting string, walks the tree of strings
stored in the object and copies them to the buffer, releases its references
to the objects in the tree, and stores a pointer to the buffer in ob_sval.

Adding a string concatenation object to a string, or adding two string
concatenation objects together, works as expected.  In many cases one of
the string concatenation objects involved can be modified to represent
the result; this saves creating a new object.

The implementation is heavily optimized towards the general case of
appending strings together.  Simple concatenation will return a tree
with recursion only down the left-hand side; ob_sstrings[0] will
point to another PyStringConcatenationObject object, all other
slots in ob_sstrings will point to normal PyStringObjects.  In this
case, both rendering and deallocation will be purely iterative
and require no actual recursion.

However, more exotic scenarios still work.  In cases where recursion
is actually necessary, the patch limits recursion depth, preventing
stack overflows.


Upsides:

* String concatenation is more than 150% faster.  Without the patch,
  the "ConcatStrings" test in pystone took 204ms; with the patch
  it took 76ms.

* In particular, string *prepending* (x = "a" + x) is much faster.
  There have already been many tweaks to CPython to speed up string
  appending, but they are less applicable to string prepending.
  In pathological cases, prepending with lazy concatenations has
  been observed as 100x faster than stock 2.5.  (This is not
  a surprising result.)

Downsides:

* A PyStringConcatenationObject is thirty-six bytes larger than
  a PyStringObject on 32-bit platforms.  That space can actually be
  reclaimed when the string is rendered, if the result of the
  concatenation will fit in this extra space.  But in general these
  extra thirty-six bytes will be a dead loss once the string is
  rendered.

* It is now possible for PyStringObject.ob_sval to be NULL.  String
  objects with a NULL ob_sval are "lazy" objects, which means calling
  PyString_AsString() will render the object and change
  PyStringObject.ob_sval to be non-NULL.  This adds an extra check,
  and possible function call, in every place where C code refers to
  the char * value of a string.
  
  The official way in CPython to get the char * value of a string
  object is to use the PyString_AS_STRING macro.  I changed that
  macro as follows, hiding the NULL check and the function call:
    #define PyString_AS_STRING(x) ( x->ob_sval ? x->ob_sval : PyString_AsString(x) )
  Code that calls this macro, instead of referring to ob_sval
  directly, will not need to be modified, just recompiled.  (PIL,
  NumPy, PyWin32, and SWIG all use this macro exclusively, never
  referring directly to ob_sval.)


LAZY SLICES

In existing CPython, when taking a "slice" of a string, the resulting object
directly represents the string slice.  When computing "abcde"[2:4],
the result is a PyStringObject containing "bc".

I changed CPython so string slicing returns an new object:
        typedef struct {
                PyObject_VAR_HEAD
                long ob_shash;
                int ob_sstate;
                char *ob_sval;
                unsigned short ob_srightrecursiondepth; /* this object matches a PyStringConcatenationObject to this point */
                PyStringObject *ob_slchild;
                Py_ssize_t ob_slstart;
                Py_ssize_t ob_slend;
        } PyStringSliceObject;
This object is, for all practical purposes, a "subclass" of
PyStringObject, and shares the additional ob_srightrecursiondepth
field with a PyStringConcatenationObject.

With lazy slices, the result of "abcde"[2:4] is now a PyStringSliceObject
containing a reference to "abcde" in ob_slchild, ob_slstart = 2,
ob_slend = 4, and ob_sval = NULL.  The actual literal value of the string
is not computed until someone calls PyString_AsString() on the lazy slice object,
which I refer to as "rendering" the string.  To render, PyString_AsString()
allocates a buffer large enough to hold the resulting string, copies
the value directly in, stores a pointer to the buffer in ob_sval,
and finally releases the reference to the original (and sets ob_slchild
to NULL).

(Actually, the previous paragraph started with a lie; "abcde"[2:4] would
still return a new PyStringObject containing "bc".  There is a minimum
slice size for a PyStringSliceObject, currently set experimentally to 20.)

If that were all this change contained there would be no point.  Delaying
calculating a string slice buys you little; the only potential speedup would
be in taking a slice of a slice.  The real payoff of this approach is in
working on string slices *without* rendering them.

The CPython API dictates that all strings are zero-terminated.  Any time
an external module sees a PyStringObject, ob_sval[ob_ssize] *must* be '\0',
even though technically that is beyond the "end" of the string.  It is not
permissible to allow external modules to see unterminated strings.  Thus,
changing PyStringObject so that strings might be unterminated occasionally
is out of the question.

But consider this: in the vast majority of cases, strings live out their
lives without ever being passed into an external module.  Such strings
are processed solely by stringobject.c.  And nearly all the code in
stringobject.c doesn't care whether or not the string has a '\0' at the
end.  If you already know the length of the string, it's faster to
process it based on its length rather than looking for the '\0'.

This patch therefore adds a new function to stringobject.c:
    static char * pystring_as_unterminated_string(register PyStringObject *op)
If the PyStringObject * passed in is an unrendered string slice, it
returns a pointer into the *original* string, which is explicitly *not*
guaranteed to end with a '\0'.

Again: this behavior is *never* observed by anyone outside of stringobject.c.
External users of PyStringObjects call PyString_AS_STRING(), which renders all
lazy concatenation and lazy slices so they look just like normal zero-terminated
PyStringObjects.  (With my patch applied, trunk still passes all expected tests.)

The patch also changed nearly every reference to ob_sval in stringobject.c
so it calls this function instead.  As a result, it is rarely necessary to
"render" a string slice.  A test run of "pybench -n 2" (two rounds, warp 10)
created a total of 640,041 lazy slices, of which only *19* were ever rendered.

Finally, the patch changed code that was effectively creating a slice so it
explicitly uses the string_slice() function.  There are several operations on
strings that effectively create slices, like the strip(), partition(),
and split() methods.  All these now create PyStringSliceObjects.


Upsides:

* String slicing is more than 60% faster on average.  Without the patch,
  the "StringSlicing" test in pystone took 142ms; with the patch it
  took 86ms.  Note that the speed increase from lazy slices increases
  with the length of the slice.

Downsides:

* All the downsides listed previously for the "lazy concatenation" change.

Unknowns:

* This change will affect memory usage by CPython.  *How* is unclear.

  There are specific scenarios where the change in memory use is
  predictable.
    * In some scenarios, memory usage would be worse:
        when there is at least one long-lived small lazy slice
        of a large string,
        and the large string itself would otherwise have been
        dereferenced and freed,
        and this small slice is never examined by
        code outside of stringobject.c,
        this change means the large string becomes long-lived too and
        thus Python consumes more memory overall.
      In pathological scenarios this memory usage could be
      characterized as "insane".
    * In other scenarios, memory usage would be better:
        when the slices on a string constitute all or nearly all
        of the original string,
        or any scenario where the large string would not outlive
        the slices.
    * And there are many scenarios where memory usage is a wash:
        when the strings are passed in to external modules,
        or when the slices are short-lived.
  But it is impossible to predict this patch's *overall* effect
  on memory use without trying it.  

  My hunch is that memory usage will *decrease* on average, if only
  slightly.  (But it's nothing more than a hunch as of yet.)

To mitigate the "insane" scenario above, I added a new method on a
string object: "str.simplify()".  All it does is force an unrendered
string to render; if the string is already rendered, it is a no-op.
So in the following code:
        longLivedString = tenMegabyteString[5:50].simplify()
longLivedString would not contain a reference to tenMegabyteString.

I am torn about this; on the one hand, it seems silly to add a new
string method just for this.  On the other hand, it does give the
programmer relief from the "insane memory consumption" scenario.

It's a tuning thing.  99% of the time, you don't care, and you enjoy
the benefit of faster string slices; the remaining 1% of the time,
you call .simplify() on your slices and your code behaves much the
same as it did under 2.5.

*Should* the programmer have to do this?  They didn't have to under
2.5...!  True enough.  But consider the "".join([]) idiom in Python.
It's accepted behavior to use super clunky idioms for performance;
I think this is pretty readable by comparison.  And this would be
the exception rather than the rule.



OTHER APPROACHES: ROPES AND STRINGVIEWS

My "lazy concatenation" and "lazy slice" objects are akin to simple specialized
"ropes":
        http://en.wikipedia.org/wiki/Rope_%28computer_science%29
I'm told that ABC (the precursor to Python) used ropes instead of simple string
objects, and that "it was no fun for anyone":
        http://groups.google.com/group/comp.lang.python/tree/browse_frm/thread/b8a8f20bc3c81bcf/fa70810e56370312#doc_064ee3e0ebabd218
I'm guessing it was "no fun" because ropes can't be used like normal C strings,
so this forced ABC programmers at the C level to use the rope interfaces exclusively.
My patch doesn't have this problem; the lazy objects are "rendered" before their
values are examined, at which point they are normal C strings.

On the temporal flip side, my "lazy slice" objects are similar to the "stringview"
objects proposed on the Python 3000 mailing list:
        http://mail.python.org/pipermail/python-3000/2006-August/003224.html
While this is of academic interest, I see them as two distinct conversations.
I'm proposing something for Python 2.6, not Python 3000.  Also, I'm proposing
changing the fundamental string object rather than introducing a new object.
My feeling is, either these optimizations are good enough for the general case,
or they should be dropped entirely.  I don't think it would be helpful to add
a new string-ish type to the language just to provide these optimizations; this
would occlude the "one obvious way to do it".


POSSIBLE FUTURE DIRECTIONS

* Add more heuristics for string slicing.
        * Allow the user to set the minimum size of a slice? 
          Probably only for experimental purposes.
        * Add a "ratio of slice to original string size"?
          I imagine this would be too pessimistic in most
          circumstances.

* Consider other ways of explicitly forcing a string object to render.
  str.simplify() works, but it'd be nice if we didn't have to add a
  new string method in order to expose that functionality.  Is there
  somewhere else we could put it that would be better?