OVERVIEW This patch consists of three changes to CPython: * changing PyStringObject.ob_sval, * "lazy concatenations", and * "lazy slices". None of these changes adds new functionality to CPython; they are all speed or memory optimizations. (I did add one experimental new string method, described at the end of the LAZY SLICES section.) In detail: PyStringObject.ob_sval was changed from a char[] array to a char *. This is not in and of itself particularly desirable. It was necessary in order to implement the other two changes. "lazy concatenations" change string concatenation ("a" + "b") so that, instead of directly calculating the resulting string, it returns a placeholder object representing the result. As a result, string concatenation in CPython is now more than 150% faster on average (as reported by pystone 2.0), and is approximately as fast as the standard string concatenation idiom ("".join([a + b + c])). "lazy slices" changes string slicing ("abc"[1], "a".strip()) so that, instead of directly calculating the resulting string, it returns a placeholder object representing the result. As a result, string slicing in CPython is now more than 60% faster on average (as reported by pystone 2.0). When considering this patch, please keep in mind that the "lazy" changes are distinct, and could be incorporated independently. In particular I'm guessing that "lazy concatenations" have a lot higher chance of being accepted than "lazy slices". These changes were implemented almost entirely in Include/stringobject.h and Objects/stringobject.c. With this patch applied, trunk builds and passes all expected tests on Win32 and Linux. HISTORY I've never liked the "".join([]) idiom for string concatenation; in my opinion it violates the principles "Beautiful is better than ugly." and "There should be one-- and preferably only one --obvious way to do it.". (And perhaps several others.) I discussed this with my friend Mike Thornburgh; on the spur of the moment, he suggested how I could fix it, describing a solution with surprising accuracy and detail. My original "v1" patch, discussed on c.l.p here: http://groups.google.com/group/comp.lang.python/tree/browse_frm/thread/b8a8f20bc3c81bcf also discussed on the Python-Dev mailing list here: http://mail.python.org/pipermail/python-dev/2006-October/069224.html and posted here: http://sourceforge.net/tracker/index.php?func=detail&aid=1569040&group_id=5470&atid=305470 contained just the "PyStringObject.ob_sval" and "lazy concatenations" patches, and was more or less what Mike suggested. After further consideration, I realized it would be possible to make "lazy slices" in much the same way that I had made "lazy concatenations". This resulted in the "v2" patch, which I redubbed the "lazy strings" patch. PYSTRINGOBJECT.OB_SVAL In CPython, "struct PyStringObject" is the in-memory representation of a non-Unicode string. It currently looks like this: typedef struct { PyObject_VAR_HEAD long ob_shash; int ob_sstate; char ob_sval[1]; } PyStringObject; This structure has not changed since revision 2.3, released in late July 2003. The string is stored in ob_sval. When the object is allocated, enough space is allocated so the PyStringObject plus the whole string can be stored in one contiguous allocation. For example, to store a 20-character string CPython would be allocated as such: malloc(sizeof(PyStringObject) + 20) It would then memcpy() the string directly into ob_sval. My patch changes PyStringObject to look like this: typedef struct { PyObject_VAR_HEAD long ob_shash; int ob_sstate; char *ob_sval; char ob_svalstorage[1]; } PyStringObject; ob_sval is now a pointer. In the conventional case, where this object represents a string, it simply points to &(ob_svalstorage[0]), and the malloc() and memcpy() proceed as before. Upsides: * None; this change is not in and of itself desirable. However, it was necessary in order to create the more exotic string objects used in the other two changes. Downsides: * This change adds an extra pointer to every string object (four bytes on 32-bit platforms). * It also changes the Python API, requiring a recompile of all modules. (Or, at least, the ones using PyStrings.) * It adds an extra pointer dereference when using a string object, which theoretically slows down the interpreter slightly, though in practice this slowdown is immeasurably small. * Finally, there is a self-described "hack" in Mac/Modules/MacOS.c that would need reworking as a result of this change. (Fixing that is beyond my pay grade; I've never done Mac programming.) LAZY CONCATENATIONS In existing CPython, when concatentating two strings, the resulting object directly represents the concatenation. When adding "a" + "b", the result is a PyStringObject containing "ab". I changed CPython so string concatenation returns an new object: typedef struct { PyObject_VAR_HEAD long ob_shash; int ob_sstate; char *ob_sval; /* this object matches a PyStringObject to this point */ unsigned short ob_srightrecursiondepth; unsigned short ob_sstringsindex; PyStringObject *ob_sstrings[PYSTRING_CONCATENATIONS]; } PyStringConcatenationObject; This object is, for all practical purposes, a "subclass" of PyStringObject. When adding "a" + "b", the result is now a PyStringConcatenationObject containing references to the two strings: ob_sval = NULL ob_srightrecursiondepth = 0 ob_sstringsindex = 2 ob_sstrings[0] = "a" ob_sstrings[1] = "b" The actual literal value of the string is not computed until someone calls PyString_AsString() on the concatenation object, which I refer to as "rendering" the string. To render, PyString_AsString() allocates a buffer large enough to hold the resulting string, walks the tree of strings stored in the object and copies them to the buffer, releases its references to the objects in the tree, and stores a pointer to the buffer in ob_sval. Adding a string concatenation object to a string, or adding two string concatenation objects together, works as expected. In many cases one of the string concatenation objects involved can be modified to represent the result; this saves creating a new object. The implementation is heavily optimized towards the general case of appending strings together. Simple concatenation will return a tree with recursion only down the left-hand side; ob_sstrings[0] will point to another PyStringConcatenationObject object, all other slots in ob_sstrings will point to normal PyStringObjects. In this case, both rendering and deallocation will be purely iterative and require no actual recursion. However, more exotic scenarios still work. In cases where recursion is actually necessary, the patch limits recursion depth, preventing stack overflows. Upsides: * String concatenation is more than 150% faster. Without the patch, the "ConcatStrings" test in pystone took 204ms; with the patch it took 76ms. * In particular, string *prepending* (x = "a" + x) is much faster. There have already been many tweaks to CPython to speed up string appending, but they are less applicable to string prepending. In pathological cases, prepending with lazy concatenations has been observed as 100x faster than stock 2.5. (This is not a surprising result.) Downsides: * A PyStringConcatenationObject is thirty-six bytes larger than a PyStringObject on 32-bit platforms. That space can actually be reclaimed when the string is rendered, if the result of the concatenation will fit in this extra space. But in general these extra thirty-six bytes will be a dead loss once the string is rendered. * It is now possible for PyStringObject.ob_sval to be NULL. String objects with a NULL ob_sval are "lazy" objects, which means calling PyString_AsString() will render the object and change PyStringObject.ob_sval to be non-NULL. This adds an extra check, and possible function call, in every place where C code refers to the char * value of a string. The official way in CPython to get the char * value of a string object is to use the PyString_AS_STRING macro. I changed that macro as follows, hiding the NULL check and the function call: #define PyString_AS_STRING(x) ( x->ob_sval ? x->ob_sval : PyString_AsString(x) ) Code that calls this macro, instead of referring to ob_sval directly, will not need to be modified, just recompiled. (PIL, NumPy, PyWin32, and SWIG all use this macro exclusively, never referring directly to ob_sval.) LAZY SLICES In existing CPython, when taking a "slice" of a string, the resulting object directly represents the string slice. When computing "abcde"[2:4], the result is a PyStringObject containing "bc". I changed CPython so string slicing returns an new object: typedef struct { PyObject_VAR_HEAD long ob_shash; int ob_sstate; char *ob_sval; unsigned short ob_srightrecursiondepth; /* this object matches a PyStringConcatenationObject to this point */ PyStringObject *ob_slchild; Py_ssize_t ob_slstart; Py_ssize_t ob_slend; } PyStringSliceObject; This object is, for all practical purposes, a "subclass" of PyStringObject, and shares the additional ob_srightrecursiondepth field with a PyStringConcatenationObject. With lazy slices, the result of "abcde"[2:4] is now a PyStringSliceObject containing a reference to "abcde" in ob_slchild, ob_slstart = 2, ob_slend = 4, and ob_sval = NULL. The actual literal value of the string is not computed until someone calls PyString_AsString() on the lazy slice object, which I refer to as "rendering" the string. To render, PyString_AsString() allocates a buffer large enough to hold the resulting string, copies the value directly in, stores a pointer to the buffer in ob_sval, and finally releases the reference to the original (and sets ob_slchild to NULL). (Actually, the previous paragraph started with a lie; "abcde"[2:4] would still return a new PyStringObject containing "bc". There is a minimum slice size for a PyStringSliceObject, currently set experimentally to 20.) If that were all this change contained there would be no point. Delaying calculating a string slice buys you little; the only potential speedup would be in taking a slice of a slice. The real payoff of this approach is in working on string slices *without* rendering them. The CPython API dictates that all strings are zero-terminated. Any time an external module sees a PyStringObject, ob_sval[ob_ssize] *must* be '\0', even though technically that is beyond the "end" of the string. It is not permissible to allow external modules to see unterminated strings. Thus, changing PyStringObject so that strings might be unterminated occasionally is out of the question. But consider this: in the vast majority of cases, strings live out their lives without ever being passed into an external module. Such strings are processed solely by stringobject.c. And nearly all the code in stringobject.c doesn't care whether or not the string has a '\0' at the end. If you already know the length of the string, it's faster to process it based on its length rather than looking for the '\0'. This patch therefore adds a new function to stringobject.c: static char * pystring_as_unterminated_string(register PyStringObject *op) If the PyStringObject * passed in is an unrendered string slice, it returns a pointer into the *original* string, which is explicitly *not* guaranteed to end with a '\0'. Again: this behavior is *never* observed by anyone outside of stringobject.c. External users of PyStringObjects call PyString_AS_STRING(), which renders all lazy concatenation and lazy slices so they look just like normal zero-terminated PyStringObjects. (With my patch applied, trunk still passes all expected tests.) The patch also changed nearly every reference to ob_sval in stringobject.c so it calls this function instead. As a result, it is rarely necessary to "render" a string slice. A test run of "pybench -n 2" (two rounds, warp 10) created a total of 640,041 lazy slices, of which only *19* were ever rendered. Finally, the patch changed code that was effectively creating a slice so it explicitly uses the string_slice() function. There are several operations on strings that effectively create slices, like the strip(), partition(), and split() methods. All these now create PyStringSliceObjects. Upsides: * String slicing is more than 60% faster on average. Without the patch, the "StringSlicing" test in pystone took 142ms; with the patch it took 86ms. Note that the speed increase from lazy slices increases with the length of the slice. Downsides: * All the downsides listed previously for the "lazy concatenation" change. Unknowns: * This change will affect memory usage by CPython. *How* is unclear. There are specific scenarios where the change in memory use is predictable. * In some scenarios, memory usage would be worse: when there is at least one long-lived small lazy slice of a large string, and the large string itself would otherwise have been dereferenced and freed, and this small slice is never examined by code outside of stringobject.c, this change means the large string becomes long-lived too and thus Python consumes more memory overall. In pathological scenarios this memory usage could be characterized as "insane". * In other scenarios, memory usage would be better: when the slices on a string constitute all or nearly all of the original string, or any scenario where the large string would not outlive the slices. * And there are many scenarios where memory usage is a wash: when the strings are passed in to external modules, or when the slices are short-lived. But it is impossible to predict this patch's *overall* effect on memory use without trying it. My hunch is that memory usage will *decrease* on average, if only slightly. (But it's nothing more than a hunch as of yet.) To mitigate the "insane" scenario above, I added a new method on a string object: "str.simplify()". All it does is force an unrendered string to render; if the string is already rendered, it is a no-op. So in the following code: longLivedString = tenMegabyteString[5:50].simplify() longLivedString would not contain a reference to tenMegabyteString. I am torn about this; on the one hand, it seems silly to add a new string method just for this. On the other hand, it does give the programmer relief from the "insane memory consumption" scenario. It's a tuning thing. 99% of the time, you don't care, and you enjoy the benefit of faster string slices; the remaining 1% of the time, you call .simplify() on your slices and your code behaves much the same as it did under 2.5. *Should* the programmer have to do this? They didn't have to under 2.5...! True enough. But consider the "".join([]) idiom in Python. It's accepted behavior to use super clunky idioms for performance; I think this is pretty readable by comparison. And this would be the exception rather than the rule. OTHER APPROACHES: ROPES AND STRINGVIEWS My "lazy concatenation" and "lazy slice" objects are akin to simple specialized "ropes": http://en.wikipedia.org/wiki/Rope_%28computer_science%29 I'm told that ABC (the precursor to Python) used ropes instead of simple string objects, and that "it was no fun for anyone": http://groups.google.com/group/comp.lang.python/tree/browse_frm/thread/b8a8f20bc3c81bcf/fa70810e56370312#doc_064ee3e0ebabd218 I'm guessing it was "no fun" because ropes can't be used like normal C strings, so this forced ABC programmers at the C level to use the rope interfaces exclusively. My patch doesn't have this problem; the lazy objects are "rendered" before their values are examined, at which point they are normal C strings. On the temporal flip side, my "lazy slice" objects are similar to the "stringview" objects proposed on the Python 3000 mailing list: http://mail.python.org/pipermail/python-3000/2006-August/003224.html While this is of academic interest, I see them as two distinct conversations. I'm proposing something for Python 2.6, not Python 3000. Also, I'm proposing changing the fundamental string object rather than introducing a new object. My feeling is, either these optimizations are good enough for the general case, or they should be dropped entirely. I don't think it would be helpful to add a new string-ish type to the language just to provide these optimizations; this would occlude the "one obvious way to do it". POSSIBLE FUTURE DIRECTIONS * Add more heuristics for string slicing. * Allow the user to set the minimum size of a slice? Probably only for experimental purposes. * Add a "ratio of slice to original string size"? I imagine this would be too pessimistic in most circumstances. * Consider other ways of explicitly forcing a string object to render. str.simplify() works, but it'd be nice if we didn't have to add a new string method in order to expose that functionality. Is there somewhere else we could put it that would be better?