classification
Title: improved allocation of PyUnicode objects
Type: enhancement Stage: patch review
Components: Interpreter Core, Unicode Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, Rhamphoryncus, ajaksu2, amaury.forgeotdarc, benjamin.peterson, collinwinter, eric.smith, ezio.melotti, ferringb, gvanrossum, haypo, jafo, jimjjewett, lemburg, mark.dickinson, orivej, pitrou, rhettinger, terry.reedy
Priority: normal Keywords: patch

Created on 2008-01-26 22:44 by pitrou, last changed 2011-09-29 20:03 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
stringbench3k.py pitrou, 2008-01-26 22:45 a quick port to py3k of stringbench.py
unialloc4.patch pitrou, 2008-03-20 19:37
freelists2.patch pitrou, 2008-03-20 19:37
unialloc5.patch pitrou, 2009-05-23 23:09
unialloc6.patch pitrou, 2010-10-06 23:21
Messages (96)
msg61728 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-01-26 22:44
This is an attempt at improving allocation of str (PyUnicode) objects.
For that it does two things:
1. make str objects PyVarObjects, so that the object is allocated in one
pass rather than two
2. maintain an array of freelists, each for a given str length, so that
many str allocations can bypass actual memory allocation calls

There is a ~10% speedup in stringbench, and a slight improvement in
pybench (as invoked with "./python Tools/pybench/pybench.py -t String").
Also, memory consumption is a bit lower when running those benchmarks.
msg61730 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-01-26 23:27
The Unicode object was designed not to be a PyVarObject (in contrast to
the string object), so I'm not sure what you would want to break that
design in return for a mere 10% speedup.

Note that turning the objects into PyVarObjects removes the ability to
extend the type at C level. It also introduces lots of problems if you
want to resize a Unicode object, due to the fact that the memory address
of the whole PyUnicodeObject can change. Resizing is done quite often in
codecs.

Regarding your second point: Unicode objects already use a free list and
even keep the allocated Py_UNICODE buffer allocated (if it's not too
large). Furthermore, the allocation is handled by pymalloc, so you only
rarely need to go to the system malloc() to allocate more space.

Tuning the KEEPALIVE_SIZE_LIMIT will likely have similar effects w/r to
speedup.
msg61732 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-01-27 00:24
I just tried bumping KEEPALIVE_SIZE_LIMIT to 200. It makes up for a bit
of the speedup, but the PyVarObject version is still faster. Moreover,
and perhaps more importantly, it eats less memory (800KB less in
pybench, 3MB less when running the whole test suite).

(then there are of course microbenchmarks. For example:
# With KEEPALIVE_SIZE_LIMIT = 200
./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
1000 loops, best of 3: 556 usec per loop
# With this patch
./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
1000 loops, best of 3: 391 usec per loop
)

I don't understand the argument for codecs having to resize the unicode
object. Since codecs also have to resize the bytes object when encoding,
the argument should apply to bytes objects as well, yet bytes (and str
in 2.6) is a PyVarObject.

I admit I don't know the exact reasons for PyUnicode's design. I just
know that the patch doesn't break the API, and runs the test suite fine.
Are there any specific things to look for?
msg61735 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-01-27 13:52
Your microbenchmark is biased towards your patched version. The
KEEPALIVE_SIZE_LIMIT will only cut in when you deallocate and then
reallocate Unicode objects. The free list used for Unicode objects is
also limited to 1024 objects - which isn't all that much. You could tune
 MAX_UNICODE_FREELIST_SIZE as well.

Regarding memory usage: this is difficult to measure in Python, since
pymalloc will keep memory chunks allocated even if they are not in use
by Python. However, this is a feature of pymalloc and not specific to
the Unicode implementation. It can be tuned in pymalloc. To get more
realistic memory measurements, you'd have to switch off pymalloc
altogether and then create a separate process that consumes lots of
memory to force the OS to have it allocate only memory that's really
needed to the process you're running for memory measurements. Of course,
keeping objects alive in a free list will always use more memory than
freeing them altogether and returning the memory to the OS. It's a
speed/space tradeoff. The RAM/CPU costs ratio has shifted a lot towards
RAM nowadays, so using more RAM is usually more efficient than using
more CPU time.

Regarding resize: you're right - the string object is a PyVarObject as
well and couldn't be changed at the time due to backwards compatibility
reasons. You should also note that when I added Unicode to Python 1.6,
it was a new and not commonly used type. Codecs were not used much
either, so there was no incentive to make resizing strings work better.
Later on, other optimizations were added to the Unicode implementation
that caused the PyUnicode_Resize() API to also require being able to
change the object address. Still, in the common case, it doesn't change
the object address.

The reason for using an external buffer for the Unicode object was to be
able to do further optimizations, such as share buffers between Unicode
objects. We never ended up using this, though, but there's still a lot
of room for speedups and more memory efficiency because of this design.

Like I already mentioned, PyObjects are also easier to extend at C level
- adding new variables to the object at the end is easy with PyObjects.
It's difficult for PyVarObjects, since you always have to take the
current size of the object into account and you always have to use
indirection to get at the extra variables due to the undefined offset of
the variables.

How much speedup do you get when you compare the pybench test with
KEEPALIVE_SIZE_LIMIT = 200 compared to your patched version ?
msg61738 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-01-27 15:35
With KEEPALIVE_SIZE_LIMIT = 200, the pybench runtime is basically the
same as with my patched version. stringbench remains a bit faster though
(~8%).

You say that RAM size is cheaper than CPU power today, which is true but
misses one part of the picture: the importance of CPU caches, and thus
of working set size. There are also probably people wanting to use
Python in memory-constrained environments (embedded).

I understand the argument about possible optimizations with an external
buffer, but are such optimizations likely to be implemented? (see
#1590352 and #1629305). If they really are, then I'm happy with the
unicode type remaining a plain PyObject!
msg61740 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-01-27 16:03
I don't really see the connection with #1629305. 

An optimization that would be worth checking is hooking up the
Py_UNICODE pointer to interned Unicode objects if the contents match
(e.g. do a quick check based on the hash value and then a memcmp). That
would save memory and the call to the pymalloc allocator.

Another strategy could involve a priority queue style cache with the aim
of identifying often used Unicode strings and then reusing them. 

This could also be enhanced using an offline approach: you first run an
application with an instrumented Python interpreter to find the most
often used strings and then pre-fill the cache or interned dictionary on
the production Python interpreter at startup time.

Coming from a completely different angle, you could also use the
Py_UNICODE pointer to share slices of a larger data buffer. A Unicode
sub-type could handle this case, keeping a PyObject* reference to the
larger buffer, so that it doesn't get garbage collected before the
Unicode slice.

Regarding memory constrained environments: these should simply switch
off all free lists and pymalloc. OTOH, even mobile phones come with
gigabytes of RAM nowadays, so it's not really worth the trouble, IMHO.
msg61741 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-01-27 16:46
All of those proposals are much heavier to implement; they also make the
unicode type more complicated and difficult to maintain probably (while
turning it into a PyVarObject actually shortens the implementation a
bit). In any case, it's not something I want to tackle.

The reason that I mentioned #1629305 was that it was such an
optimization which complicated the unicode type while bringing very
significant speedups.
msg61743 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-01-27 18:05
Agreed, those optimizations do make the implementation more complicated.
It's also not clear whether they would really be worth it.

#1629305 only provided speedups for the case where you write s += 'abc'.
The usual idiom for this is to use a list and then concatenate in one
go. If you want a really fast approach, you'd use cStringIO or perhaps
the bufferarray. I don't think that optimizing for just one particular
use case warrants making the code more complicated or changing the C
interface radically.

In your case, I think that closing the door for being able to easily
extend the type implement at the C is the main argument against it. The
speedups are only marginal and can also be achieved (to some extent) by
tuning the existing implementation's parameters.
msg61747 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-01-27 19:03
I know it's not the place to discuss #1629305, but the join() solution
is not always faster. Why? Because 1) there's the list contruction and
method call overhead 2) ceval.c has some bytecode hackery to try and
make plain concatenations somewhat less slow.

As for cStringIO, I actually observed in some experiments that it was
slower than the other alternatives, at least for short strings. All in
all, choosing between the three idioms is far from obvious and needs
case-by-case analysis.
msg61832 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-01-29 21:54
FWIW, I tried using the freelist scheme introduced in my patch without
making PyUnicode a PyVarObject and, although it's better than the
standard version, it's still not as good as the PyVarObject version.
Would you be interested in that patch?
msg61862 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-01-30 11:45
Yes, definitely.

Some comments on style in your first patch:

 * please use unicode->length instead of the macro LENGTH you added
 * indents in unicodeobject.c are 4 spaces
 * line length should stay below 80
msg61867 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-01-30 15:56
After some more tests I must qualify what I said. The freelist patch is
an improvement in some situations. In others it does not really have any
impact. On the other hand, the PyVarObject version handles memory-bound
cases dramatically better, see below.

With a small string:
./python -m timeit -s "s=open('INTBENCH', 'r').read()" "s.split()"
-> Unpatched py3k: 26.4 usec per loop
-> Freelist patch: 21.5 usec per loop
-> PyVarObject patch: 20.2 usec per loop

With a medium-sized string:
./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
-> Unpatched py3k: 458 usec per loop
-> Freelist patch: 408 usec per loop
-> PyVarObject patch: 316 usec per loop

With a long string:
./python -m timeit -s "s=open('Misc/HISTORY', 'r').read()" "s.split()"
-> Unpatched py3k: 31.3 msec per loop
-> Freelist patch: 32.7 msec per loop
-> PyVarObject patch: 17.8 msec per loop

(the numbers are better than in my previous posts because the
split-accelerating patch has been integrated)

Also, given those results, it is also clear that neither pybench nor
stringbench really stress memory efficiency of strings, only processing
algorithms.

That said, the freelist patch is attached.
msg62332 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-02-12 19:58
Here is an updated patch against the current py3k branch, and with
spaces instead of tabs for indentation.
msg62467 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-02-16 20:32
Here is an updated patch, to comply with the introduction of the
PyUnicode_ClearFreeList() function.
msg64108 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2008-03-19 21:40
Marc-Andre: Wit the udpated patches, is this a set of patches we can accept?
msg64112 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-19 22:00
Thanks for your interest Sean :)
By the way, on python-3000 both GvR and Martin von Löwis were ok on the
proposed design change, although they did not review the patch itself.
http://mail.python.org/pipermail/python-3000/2008-February/012076.html
msg64161 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-03-20 10:03
Antoine, as I've already mentioned in my other comments, I'm -1 on
changing the Unicode object to a variable size object.

I also don't think that the micro-benchmarks you are applying really do
test the implementation in a real-life situations. The only case where
your patch appears significantly faster is the "long string" case. If
you look at the distribution of the Unicode strings generated by this
case, you'll find that most strings have less than 10-20 characters.
Optimizing pymalloc for these cases and tuning the parameters in the
Unicode implementation will likely give you the same performance
increase without having to sacrifice the advantages of using a pointer
in the object rather than a inlining the data.

I'm +1 on the free list changes, though, in the long run, I think that
putting more efforts into improving pymalloc and removing the free lists
altogether would be better.

BTW: Unicode slices would be a possible and fairly attractive target for
a C level subclass of Unicode objects. The pointer in the Unicode object
implementation could then point to the original Unicode object's buffer
and the subclass would add an extra pointer to the original object
itself (in order to keep it alive). The Unicode type (written by Fredrik
Lundh) I used as basis for the Unicode implementation worked with this idea.
msg64163 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-03-20 10:30
Marc-Andre: don't all your objections also apply to the 8bit string
type, which is already a variable-size structure?
Is extending unicode more common than extending str?

With python 3.0, all strings are unicode. Shouldn't this type be
optimized for the same use cases of the previous str type?
msg64165 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-03-20 11:00
Yes, all those objections apply to the string type as well. The fact
that strings are variable length objects makes it impossible to do apply
any of the possible optimizations I mentioned. If strings were a fixed
length object, it would have been possible to write a true basestring
object from which both 8-bit strings and Unicode subclass, making a lot
of things easier.

BTW: Please also see ticket #2321 to see how the change affects your
benchmarks.
msg64166 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-20 11:32
Hi,

Marc-André, I'm all for "real-life" benchmarks if someone proposes some.
Until that we have to live with micro-benchmarks, which seems to be the
method used for other CPython optimizations by the way.

You are talking about slicing optimizations but you forget that the
"lazy slices" idea has been shot down by Guido and others when proposed
by Larry Hastings (with a patch) some time ago. Also the "lazy slices"
patch was for 8-bit strings, which *are* variable-size objects, which
seems to counter your argument that variable-size Unicode objects would
prevent making such optimizations.

As I said the freelist changes actually have mixed consequences, and in
general don't improve much (and that improvement is just backed by
micro-benchmarks after all... why would it be more convincing than the
far greater improvement brought by making Unicode objects variable-size
objects?).

Why wouldn't you express your arguments in the python-3000 thread I
launched one month ago, so that at least there is a clear picture of the
different arguments for and against this approach?
msg64167 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-03-20 12:53
Regarding benchmarks: It's difficult to come up with decent benchmarks
for things like this. A possible strategy is to use an instrumented
interpreter that records which Unicode objects are created and when they
are deleted. If you then run this instrumented interpreter against a few
larger applications such as Zope for a while, you'll get a data set that
can be used as input for a more real-life like benchmark. I've done this
in the past for Python byte codes to see which are used how often -
based on such data you can create optimizations that have a real-life
effect.

Regarding the lazy slice patches: those were not using subclassing, they
were patches to the existing type implementations. With subclassing you
don't affect all uses of an object, but instead let the user control
which uses should take advantage of the slice operations. Since the user
also controls the objects which are kept alive this way, the main
argument against Unicode slice objects goes away. This is different than
patching the type itself and doesn't change the main object
implementation in any way. Furthermore, it's possible to implement and
provide such subclasses outside the Python core, which gives developers
a lot more freedom.

Regarding discussions on the py3k list: I'm not on that list, since I
already get more email than I can manage. I am on the python-dev list,
if you want to take up the discussions there.
msg64168 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-20 13:57
Well I'm not subscribed to the python-3k list either - too much traffic
indeed. You can read and post into it with gmane for example:
http://thread.gmane.org/gmane.comp.python.python-3000.devel/11768
(there is probably an NNTP gateway too)

As for instrumenting the interpreter, this would tell us when and which
strings are allocated, but not the precise effect it has on a modern
CPU's memory subsystem (which is quite a complicated thing). Also, this
is Py3k - we can't test any real-life applications until they are ported
to it...
(someone really motivated would have to learn Intel VTune or AMD
CodeAnalyst and run such a "py3k real-life application" with it :-))

As for the explicit slicing approach, "explicit string views" have been
discussed in 2006 on the python-3k list, including a proof-of-concept
patch by Josiah Carlson - without any definitive pronouncement it seems.
See the subthread here:
http://mail.python.org/pipermail/python-3000/2006-August/003280.html

The reason I'm bringing in those previous discussions is that, in
theory, I'm all for much faster concatenation and slicing thanks to
buffer sharing etc., but realistically the previous attempts have failed
for various reasons (either technical or political). And since those
optimizations are far from simple to implement and maintain, chances are
they will not be attempted again, let alone accepted. I'd be happy to be
proven wrong :)

regards

Antoine.
msg64169 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-03-20 14:34
I've read the comments from Guido and Martin, but they don't convince me
in changing my -1.

As you say: it's difficult to get support for optimizations such a
slicing and concatenation into the core. And that's exactly why I want
to keep the door open for developers to add such support via extension
modules. With the 8-bit string implementation this was never possible. 

With the Unicode implementation and the subclassing support for builtin
types, it is possible to add support without having to go through all
the hoops and discussions on python-dev or py3k lists.

I'm also for making Python faster, but not if it limits future
optimizations and produces dead-ends. The fact that such optimizations
haven't been implemented yet doesn't really mean anything either -
people are not yet using Unicode intensively enough to let the need arise.
msg64170 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-20 14:50
Well, I'm not gonna try to defend my patch eternally :)
I understand your opinion even if I find a bit disturbing that we refuse
a concrete, actual optimization on the basis of future hypothetical ones.

Since all the arguments have been laid down, I'll let other developers
decide whether further discussion of this proposal is useful or not.
msg64194 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-03-20 19:10
Regarding the benchmark: You can instrument a 2.x version of the
interpreter to build the data set and then have the test load this data
set in Py3k and have it replay the allocation/deallocation in the same
way it was done on the 2.x system.

I also expect that patch #2321 will have an effect on the performance
since that patch changed the way memory allocation of the buffer was
done (from using the system malloc() to using pymalloc's allocator for
the buffer as well).
msg64197 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-20 19:36
You are right, #2321 made the numbers a bit tighter:

With a small string:
./python -m timeit -s "s=open('INTBENCH', 'r').read()" "s.split()"
-> Unpatched py3k: 23.1 usec per loop
-> Freelist patch: 21.3 usec per loop
-> PyVarObject patch: 20.5 usec per loop

With a medium-sized string:
./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
-> Unpatched py3k: 406 usec per loop
-> Freelist patch: 353 usec per loop
-> PyVarObject patch: 314 usec per loop

With a long string:
./python -m timeit -s "s=open('Misc/HISTORY', 'r').read()" "s.split()"
-> Unpatched py3k: 22.7 msec per loop
-> Freelist patch: 24 msec per loop
-> PyVarObject patch: 20.6 msec per loop

stringbench3k:
-> Unpatched py3k: 266 seconds
-> Freelist patch: 264 seconds
-> PyVarObject patch: 249 seconds

Regarding your benchmarking suggestion, this would certainly be an
interesting thing to do, but I fear it is also much more than I'm
willing to do...

I'm going to post the updated patches.
msg64215 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-03-20 22:13
Thanks for running the tests again. The use of pymalloc for the buffer 
made a significant difference indeed. I expect that more can be had by
additionally tweaking KEEPALIVE_SIZE_LIMIT.

It is interesting to see that the free list patch only appears to
provide better timings for the "smaller" strings tests. I couldn't find
the INTBENCH file anywhere in the Python source tree, but here's the
distribution of the words of the latter two files:

Dev-Python/LICENSE:
------------------------------------------------------------------------
Length: [Count] 0----1----2----3----4----5----6----7----8----9----10 * 10%
     1: [   28] ===
     2: [  350] =================================================
     3: [  354] ==================================================
     4: [  190] ==========================
     5: [  191] ==========================
     6: [  158] ======================
     7: [  164] =======================
     8: [  132] ==================
     9: [  127] =================
    10: [  102] ==============
    11: [   39] =====
    12: [   34] ====
    13: [   21] ==
    14: [   10] =
    15: [   10] =
    16: [    0]
    17: [    0]
    18: [    1]
    19: [    0]
    20: [    0]
    21: [    1]
    22: [    0]
    23: [    0]
    24: [    0]
    25: [    1]
    26: [    1]
    27: [    1]
    28: [    0]
    29: [    1]
    30: [    0]
    31: [    0]
    32: [    0]
    33: [    0]
    34: [    0]
    35: [    0]
    36: [    2]
    37: [    0]
    38: [    0]
    39: [    1]
    40: [    0]
    41: [    0]
    42: [    0]
    43: [    1]
    44: [    1]
    45: [    0]
    46: [    0]
    47: [    0]
    48: [    0]
    49: [    0]
    50: [    1]
    51: [    0]
    52: [    0]
    53: [    0]
    54: [    0]
    55: [    0]
    56: [    0]
    57: [    0]
    58: [    0]
    59: [    0]
    60: [    0]
    61: [    0]
    62: [    0]
    63: [    1]

Dev-Python/Misc/HISTORY:
------------------------------------------------------------------------
Length: [Count] 0----1----2----3----4----5----6----7----8----9----10 * 10%
     1: [ 6853] ==================
     2: [13920] =====================================
     3: [18401] ==================================================
     4: [12626] ==================================
     5: [ 9545] =========================
     6: [ 9348] =========================
     7: [ 9625] ==========================
     8: [ 7351] ===================
     9: [ 5353] ==============
    10: [ 3266] ========
    11: [ 1947] =====
    12: [ 1336] ===
    13: [  983] ==
    14: [  638] =
    15: [  408] =
    16: [  288]
    17: [  286]
    18: [  216]
    19: [  176]
    20: [  147]
    21: [  120]
    22: [  116]
    23: [   85]
    24: [   89]
    25: [   70]
    26: [   44]
    27: [   59]
    28: [   39]
    29: [   32]
    30: [   65]
    31: [   23]
    32: [   26]
    33: [   28]
    34: [   19]
    35: [   18]
    36: [    9]
    37: [   18]
    38: [    5]
    39: [   10]
    40: [    9]
    41: [    9]
    42: [    1]
    43: [    1]
    44: [    8]
    45: [    4]
    46: [    5]
    47: [    5]
    48: [    3]
    49: [    3]
    50: [    4]
    51: [    0]
    52: [    0]
    53: [    2]
    54: [    0]
    55: [    0]
    56: [    4]
    57: [    2]
    58: [    4]
    59: [    1]
    60: [    0]
    61: [    1]
    62: [    0]
    63: [    1]
    64: [    1]
    65: [    9]
    66: [    1]
    67: [    1]
    68: [    0]
    69: [    0]
    70: [   16]
    71: [    1]
    72: [    0]
    73: [    1]

Compare that to a typical Python module source file...

Dev-Python/Lib/urllib.py:
------------------------------------------------------------------------
Length: [Count] 0----1----2----3----4----5----6----7----8----9----10 * 10%
     1: [  806] ==================================================
     2: [  672] =========================================
     3: [  675] =========================================
     4: [  570] ===================================
     5: [  531] ================================
     6: [  501] ===============================
     7: [  296] ==================
     8: [  393] ========================
     9: [  246] ===============
    10: [  147] =========
    11: [  150] =========
    12: [  102] ======
    13: [   90] =====
    14: [  116] =======
    15: [   61] ===
    16: [   51] ===
    17: [   45] ==
    18: [   38] ==
    19: [   31] =
    20: [   39] ==
    21: [   24] =
    22: [   18] =
    23: [   18] =
    24: [   23] =
    25: [   17] =
    26: [   15]
    27: [   13]
    28: [   14]
    29: [   11]
    30: [    9]
    31: [    7]
    32: [    1]
    33: [    6]
    34: [   10]
    35: [    2]
    36: [    4]
    37: [    3]
    38: [    6]
    39: [    1]
    40: [    0]
    41: [    1]
    42: [    5]
    43: [    0]
    44: [    0]
    45: [    0]
    46: [    0]
    47: [    0]
    48: [    2]
    49: [    0]
    50: [    1]
    51: [    1]
    52: [    2]

The distributions differ a lot, but they both show that typical strings
(both in English text and Python program code) have a length of < 25
characters.

Setting KEEPALIVE_SIZE_LIMIT to 32 should cover most of those cases
while still keeping the memory load within bounds. Raising the free list
limit to e.g. 2048 would cause at most 64kB to be used by the free list
- which is well within bounds of any modern CPU L2 cache.
msg64216 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-20 22:19
Well, of course most words in most languages are below 20 characters.
Hence most strings containing words are also below 20 chars. But strings
can also contain whole lines (e.g. decoding of various Internet
protocols), which are statistically below 80 chars. I took that into
account in the freelists patch (it keeps lots of <15 char strings in
memory, and a few <90 char strings), which as you can see still yields
rather little benefits :)
msg64320 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-03-22 12:56
I wasn't clear enough: my point was that your free list patch would
probably benefit from some tuning of the cut-off parameters. 15
characters appears to be too small (see the HISTORY file histogram). 

You'll likely get better results for 32 and 256 as cut-off values and by
increasing the max list sizes to higher values, e.g. 1024 for the 32
character slots and 146 for the 256 character slots (see the counts in
the histograms).
msg87912 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-05-16 19:37
Collin,
Can you test this patch with Unladen Swallow's benchmarks?
msg87917 - (view) Author: Collin Winter (collinwinter) * (Python committer) Date: 2009-05-16 19:47
Daniel, which patch? freelists2.patch or unialloc4.patch? If these are
targeted py3k (judging by the "Versions" selector above), none of
Unladen Swallow's benchmarks work under 3k (we're focusing on 2.x).
msg87918 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-16 19:53
> Daniel, which patch? freelists2.patch or unialloc4.patch? If these are
> targeted py3k (judging by the "Versions" selector above), none of
> Unladen Swallow's benchmarks work under 3k (we're focusing on 2.x).

They target py3k indeed. Also, they need updating (at least the
uniallocX patch which is the interesting part here).
msg88254 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-23 23:09
Updated patch against py3k. On a 64-bit system, each unicode object
takes 14 bytes less than without the patch (using sys.getsizeof()).
Two to four more bytes could be gained by folding the `state` member in
the two lower bits of `defenc`, but I'm not sure it's worth the trouble.
msg88287 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-05-24 21:30
Antoine, I think we have to make a decision here: I'm still -1 on
changing PyUnicodeObject to be a PyVarObject, but do like your
experiments with the free lists.

I also still believe that tuning the existing parameters in the Unicode
implementation and pymalloc would give a better performance gain than
what your patch achieves. 

Setting KEEPALIVE_SIZE_LIMIT to 32 would be a first start in that direction.

However, if you insist on changing the PyUnicodeObject structure, I'll
have to reject the patch.
msg88288 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-24 21:36
As I already showed, the freelist experiments bring very little improvement.
msg88290 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-05-24 21:48
Ok, then closing the patch as rejected.
msg88291 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-24 21:51
Marc-André, please don't close the issue while you're the only one
opposing it, thanks.
msg88307 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-05-25 08:17
Antoine, I have explained the reasons for rejecting the patch. In short,
it violates a design principle behind the Unicode implementation.

If you want to change such a basic aspect of the Unicode implementation,
then write a PEP which demonstrates the usefulness on a larger set of
more general tests and comes up with significant results (10% speedup in
some micro benchmarks is not significant; memory tests need to be run
without pymalloc and require extra care to work around OS malloc
optimization strategies).

Like I said: The current design of the Unicode object implementation
would benefit more from advances in pymalloc tuning, not from making it
next to impossible to extend the Unicode objects to e.g. 

 * reuse existing memory blocks for allocation, 
 * pointing straight into memory mapped files, 
 * providing highly efficient ways to tokenize Unicode data,
 * sharing of data between Unicode objects,
 etc.

The reason I chose this design was to make the above easily
implementable and it was a conscious decision to use a PyObject
rather than a PyVarObject, like the string object, since I knew 
that the Unicode object was eventually going to replace the string
object.
msg88309 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-05-25 08:45
Looking at the comments, it seems that the performance gain comes from
the removal of the double allocation which is needed by the current design.

Was the following implementation considered:
- keep the current PyUnicodeObject structure
- for small strings, allocate one chunk of memory:
sizeof(PyUnicodeObject)+2*length. Then set self->str=(Py_UNICODE*)(self+1);
- for large strings, self->str may be allocated separately.
- unicode_dealloc() must be careful and not free self->str if it is
contiguous to the object (it's probably a good idea to reuse the
self->state field for this purpose).
msg88310 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-05-25 09:21
Amaury Forgeot d'Arc wrote:
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
> 
> Looking at the comments, it seems that the performance gain comes from
> the removal of the double allocation which is needed by the current design.
> 
> Was the following implementation considered:
> - keep the current PyUnicodeObject structure
> - for small strings, allocate one chunk of memory:
> sizeof(PyUnicodeObject)+2*length. Then set self->str=(Py_UNICODE*)(self+1);
> - for large strings, self->str may be allocated separately.
> - unicode_dealloc() must be careful and not free self->str if it is
> contiguous to the object (it's probably a good idea to reuse the
> self->state field for this purpose).

AFAIK, this was not yet been investigated.

Note that in real life applications, you hardly ever have to
call malloc on small strings - these are managed by pymalloc as
pieces of larger chunks and allocation/deallocation is generally
fast. You have the same situation for PyUnicodeObject itself
(which, as noted earlier, could be optimized in pymalloc even further,
since the size of PyUnicodeObject is fixed).

The OS malloc() is only called for longer strings and then only
for the string buffer itself - the PyUnicodeObject is again completly
managed by pymalloc, even in this case.
msg88312 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-25 09:24
Marc-André, the problem is that all your arguments are fallacious at
best. Let me see:

> Like I said: The current design of the Unicode object implementation
> would benefit more from advances in pymalloc tuning, not from making it
> next to impossible to extend the Unicode objects to e.g. [...]

Saying that is like saying "we shouldn't try to improve ceval.c because
it makes it harder to write a JIT". You are dismissing concrete actual
improvements in favour of pie-in-the-sky improvements that nobody has
seemed to try (you're welcome to prove me wrong) in 10 years of
existence of the unicode type.

Besides, if someone wants to experiment with such improvements, it is
not difficult to switch back to the old representation (my patch is very
short if you discard the mechanic replacement of "self->length" with
"PyUnicode_GET_SIZE(self)", which doesn't have to be undone to switch
representations). So, I fail to see the relevance of that argument.

> Antoine, I have explained the reasons for rejecting the patch. In short,
> it violates a design principle behind the Unicode implementation.

You seem to be the only one thinking this while, AFAIK, you haven't been
the only one to work on that datatype.

> (10% speedup in
> some micro benchmarks is not significant; memory tests need to be run
> without pymalloc and require extra care to work around OS malloc
> optimization strategies).

Actually, running performance or resource consumption tests without
pymalloc is pointless since it makes the test completely artificial and
unrelated to real-world conditions (who runs Python without pymalloc in
real-world conditions?).

>  * reuse existing memory blocks for allocation, 
>  * pointing straight into memory mapped files, 
>  * providing highly efficient ways to tokenize Unicode data,
>  * sharing of data between Unicode objects,
>  etc.

By the way, I haven't seen your patches or experiments for those. Giving
guidance is nice, but proofs of concept, at the minimum, are more
convincing. None of the suggestions above strike me as very /easy/
(actually, they are at least an order of magnitude harder than the
present patch), or even guaranteed to give any tangible benefits.

To be clear, I don't think this proposal is more important than any
other one giving similar results (provided these exist). But your
arguments are never factual and, what's more, while I already did the
same replies as I did here in other messages, you never bothered to be
more factual. I would accept your refusal if your arguments had some
semblance of concrete support for them.
msg88313 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-05-25 09:32
> The OS malloc() is only called...
I know this. But pymalloc has its own overhead, and cache locality will
certainly be better if string data is close to the string length.

The goal is to improve the current usage of strings, and not rely on
hypothetical enhancements to the generic pymalloc.
msg88584 - (view) Author: Jim Jewett (jimjjewett) Date: 2009-05-30 22:41
There were a number of patches to support sharing of data between 
unicode objects.  (By Larry Hastings?)  They were rejected because (a)  
they were complicated, and (b)  it was possible to provoke pathological 
memory retention.
msg88585 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-30 22:47
> There were a number of patches to support sharing of data between 
> unicode objects.  (By Larry Hastings?)  They were rejected because (a)  
> they were complicated, and (b)  it was possible to provoke pathological 
> memory retention.

Yes, it's the "lazy strings" patches by Larry Hastings (it was for str,
not unicode, though). Issues are #1590352 and #1569040 (and perhaps
others).

In any case, as I said, it is easy to switch back to the old
representation, so I don't think it is an argument to block this patch.
msg88744 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-06-02 13:46
Jim Jewett wrote:
> Jim Jewett <jimjjewett@users.sourceforge.net> added the comment:
> 
> There were a number of patches to support sharing of data between 
> unicode objects.  (By Larry Hastings?)  They were rejected because (a)  
> they were complicated, and (b)  it was possible to provoke pathological 
> memory retention.

Right, but the patches were targeting the main Unicode type implementation.

It would certainly be possible to implement these features on a Unicode
sub-type.

Note that the Unicode type implementation on which the Python type
is based did in fact use references to other objects in order to
implement sharing.

This part was removed from the base type due to the issues with
unwillingly keeping alive large reference objects. However, the
implementation can be used as basis for writing a Unicode sub-type
which does implement data sharing.

If you're looking for application space where such data sharing
types are useful, have a look at parsing engines or routines
that split larger chunks of data in multiple smaller pieces.
Shared memory is another use case where such types would enable
sharing of Unicode data between processes... but I'm repeating
myself.
msg88745 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-06-02 13:53
Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>> There were a number of patches to support sharing of data between 
>> unicode objects.  (By Larry Hastings?)  They were rejected because (a)  
>> they were complicated, and (b)  it was possible to provoke pathological 
>> memory retention.
> 
> Yes, it's the "lazy strings" patches by Larry Hastings (it was for str,
> not unicode, though). Issues are #1590352 and #1569040 (and perhaps
> others).
> 
> In any case, as I said, it is easy to switch back to the old
> representation, so I don't think it is an argument to block this patch.

That's not the case.

The patch breaks C API + binary compatibility for an essential Python
type - that's not something you can easily undo.
msg88746 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-06-02 14:12
> The patch breaks C API + binary compatibility for an essential Python
> type - that's not something you can easily undo.

I don't see how it breaks C API compatibility. No officially documented
function has changed, and the accessor macros still work. Am I missing
something?

As for binary compatibility, yes, it does break it, which isn't an
exceptional situation in the development process. We have changed other
"essential types" too -- for example, recently, the PyLong object got
30-bit digits on some systems. Why you think it is hard to undo, I don't
understand.

As for the future ABI PEP, which has not yet been accepted, it does not
mention PyUnicodeObject as part of the structures which are guaranteed
to remain binary-compatible :
http://www.python.org/dev/peps/pep-0384/#structures
msg88751 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-06-02 16:38
Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>> The patch breaks C API + binary compatibility for an essential Python
>> type - that's not something you can easily undo.
> 
> I don't see how it breaks C API compatibility. No officially documented
> function has changed, and the accessor macros still work. Am I missing
> something?

Yes: The layout and object type of the PyUnicodeObject object.

You cannot simply recompile your code and have it working. Instead,
you have to provide different sub-typing implementations depending
on whether PyUnicodeObject is a PyVarObject or PyObject, since
these are inherently different in their structure.

Please note that all type objects documented in the header files
not explicitly declared for private use only, are in fact
public APIs. You need access to those type objects in order to
be able to subclass them.

> As for binary compatibility, yes, it does break it, which isn't an
> exceptional situation in the development process. We have changed other
> "essential types" too -- for example, recently, the PyLong object got
> 30-bit digits on some systems. Why you think it is hard to undo, I don't
> understand.

That's a different kind of change. Even though it's very hard to
sub-type longs due to their PyVarObject nature and the fact that
longs even dig into the PyObject_VAR_HEAD, you can still recompile
your code and it will continue to work. The change was to a typedef -
the name of the typedef itself has not changed.

This is similar to compiling Python as UCS2 or UCS4 version - Py_UNICODE
will stay the same typedef, but on a UCS2 system it maps to 16 bits,
whereas on a UCS4 system it is set to 32 bits.

Note that the Unicode implementation takes great care not to hide
this binary incompatibility - by remapping all APIs to include the
UCS2/UCS4 hint in the exported name. As an side: The long implementation
does not.

> As for the future ABI PEP, which has not yet been accepted, it does not
> mention PyUnicodeObject as part of the structures which are guaranteed
> to remain binary-compatible :
> http://www.python.org/dev/peps/pep-0384/#structures

That's fine, but please note that the ABI PEP only addresses
applications that wish to benefit from the binary compatibility
across Python versions.

It has no implications on applications that don't want to use
the ABI or cannot, since they are too low-level, such as extensions
wishing to sub-class built-in types.
msg88753 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-06-02 17:17
> You cannot simply recompile your code and have it working.

Who is "you"?
People doing mundane things with PyUnicodeObjects certainly can,
assuming they use the macros for any member access.

> Please note that all type objects documented in the header files
> not explicitly declared for private use only, are in fact
> public APIs.

If the datatype layout is not publicly documented in the API reference,
it certainly seems to be a non-public part of the API. That's why we
have macros for member access, instead of letting people access members
directly.

The fact that my patch doesn't touch any part of the C sources except
for the unicode implementation itself seems to support this view as
well: people have been using the macros because they understand the
actual layout shouldn't be relied upon.

> You need access to those type objects in order to
> be able to subclass them.

As is needed for every other core object whose layout is nevertheless
changed now and then... I think it should be expected that any code
relying on low-level implementation specifics can break now and then.
Changing low-level implementation specifics is often a prerequisite for
improving things and it would be foolish to make a promise that we
guarantee 100% compatibility at that level.

(we could of course strengthen the rules for unicode if it was
demonstrated that there are several popular instances of subclassing
unicode in a C extension. However, I haven't seen any such examples)

> Note that the Unicode implementation takes great care not to hide
> this binary incompatibility - by remapping all APIs to include the
> UCS2/UCS4 hint in the exported name.

That's because there are UCS2 and UCS4 builds *of the same interpreter
version*, and people are not necessarily aware of there being a
difference. Such variability is not what we are talking about here.
msg88767 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-06-02 22:06
Here's an example implementation of a Unicode sub-type that allows
referencing other Unicode objects:

http://downloads.egenix.com/python/unicoderef-0.0.1.tar.gz

As you can see, it's pretty straight-forward to write and I want to keep
it that way.
msg88768 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-06-02 22:25
Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>> You cannot simply recompile your code and have it working.
> 
> Who is "you"?
> People doing mundane things with PyUnicodeObjects certainly can,
> assuming they use the macros for any member access.

As soon as they want to do C-level sub-typing, a change from
PyObject to PyVarObject will break their code in non-trivial ways.

>> Please note that all type objects documented in the header files
>> not explicitly declared for private use only, are in fact
>> public APIs.
> 
> If the datatype layout is not publicly documented in the API reference,
> it certainly seems to be a non-public part of the API. That's why we
> have macros for member access, instead of letting people access members
> directly.

Header files *are* the API reference.

There are many instances where they include things that are only
meant to be used internally by the interpreter, but these are
carefully documented in the header files.

>> You need access to those type objects in order to
>> be able to subclass them.
> 
> As is needed for every other core object whose layout is nevertheless
> changed now and then... I think it should be expected that any code
> relying on low-level implementation specifics can break now and then.
> Changing low-level implementation specifics is often a prerequisite for
> improving things and it would be foolish to make a promise that we
> guarantee 100% compatibility at that level.

It would be foolish to break such compatibility for the sake of some
really minor performance win.

Python's main focus is flexibility, not speed. Your proposed change
makes it a lot harder to tap into the available flexibility, since
sub-typing of PyVarObjects is non-trivial.

> (we could of course strengthen the rules for unicode if it was
> demonstrated that there are several popular instances of subclassing
> unicode in a C extension. However, I haven't seen any such examples)

Well, since you don't appear to count the many attempts to get
slicing-by-reference into the base type as proof that such ideas
do have use-cases, I've posted an example implementation which
provides such a sub-type.

It's easy to extend to all the use cases I've mentioned so far.
msg88774 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-06-02 23:09
Mark, I'm inclined to agree that this would be a destabilizing change.

Guido, do you care to pronounce on whether it is okay to change the struct?
msg88775 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2009-06-02 23:25
If this is not yet in 3.1, it's clearly too late to add it (now that RC1
was already released).  If was in already (hard to tell from the long
bug), I think it should be kept in (removing it would destabilize more
than keeping it).
msg88777 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-06-02 23:32
Correction:  It is a proposal for 3.2 that changes the struct used in
3.0 and 3.1.
msg88779 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2009-06-02 23:51
That's unfortunate; it would clearly have been easier to change this in 3.1.

That said, I'm not sure anyone *should* be subclassing PyUnicode.  Maybe
Marc-Andre can explain why he is doing this (or point to the message in
this thread where he explained this before)?  If it's a viable use case,
it should be possible to have some symbol or a few symbols whose
presence can be tested in the preprocessor by code that needs to
subclass; we should design the patch with that in mind and Marc-Andre
could help testing it.

All this is assuming the speed-up is important enough to bother.  Has
anyone run a comparison benchmark using the Unladen Swallow benchmarks?
 I trust those much more than micro-benchmarks (including, I assume,
stringbench.py).  I do expect that reducing the number of allocations
for short-to-medium-size strings from 2 to 1 would be a significant
speed-up, but I can't guess how much.
msg88801 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-06-03 09:36
> That's unfortunate; it would clearly have been easier to change this in 3.1.
> 
> That said, I'm not sure anyone *should* be subclassing PyUnicode. Maybe
> Marc-Andre can explain why he is doing this (or point to the message in
> this thread where he explained this before)? 

The Unicode type was designed to be a basic useful type with as
much functionality as needed, but no more. Since it was clear that
we would get sub-typing at the time, the aim was to move special
use cases into such sub-types.

One such example was the referencing logic used in Fredrik's
implementation and often requested by the Python community.

I removed that logic from the implementation due to the issues this
would have caused by accidentally keeping alive large referenced
Unicode objects due to references pointing to them.

However, it's easy to add it back by sub-typing the Unicode type:

http://downloads.egenix.com/python/unicoderef-0.0.1.tar.gz

Other special use cases:

* sub-types which hold a reference to the original bytes string, e.g.
to implement a round-trip safe storage even of broken text data or
text that uses multiple encodings

* sub-types that get their data from a memory mapped file or a
shared memory area without copying

* sub-types that implement indexing based on glyphs (ie. the human
recognizable notion of a character) rather than code points

* sub-types that implement special extra methods or provide case
insensitive operations

* sub-types that implement special text data forms, such as URLs,
OS paths, UID strings, etc. and custom operations on those

Sub-typing is also encouraged by the documentation:

http://docs.python.org/library/userdict.html#module-UserString

... and after all: one of the main points in making all built-in
types sub-typeable was to actually do it :-)

> If it's a viable use case,
> it should be possible to have some symbol or a few symbols whose
> presence can be tested in the preprocessor by code that needs to
> subclass; we should design the patch with that in mind and Marc-Andre
> could help testing it.
> 
> All this is assuming the speed-up is important enough to bother.  Has
> anyone run a comparison benchmark using the Unladen Swallow benchmarks?
>
>  I trust those much more than micro-benchmarks (including, I assume,
> stringbench.py).  I do expect that reducing the number of allocations
> for short-to-medium-size strings from 2 to 1 would be a significant
> speed-up, but I can't guess how much.

While the Unladen Swallow aims at providing high-level benchmarks,
it's current state doesn't really implement that promise (yet).

If you look at the list of benchmarks they use, most appear to be
dealing with pickling. That doesn't strike me as particularly useful
for testing real life Python usage.

If a high level benchmark is indeed what's wanted, then they should
setup pre-configured Django and Zope instances and run those through
a series of real-life usage scenarios to cover the web application
use space. For scientific use cases, it would be good to have similar
setups using BioPython, NumPy and matplotlib. And so on. Much like
the high level benchmarks you have in the Windows world.

Depending on the use case, the results of benchmarks for this
particular change are difficult to predict or interpret.

Here's a summary message with my reasoning for rejecting the patch:

http://bugs.python.org/issue1943#msg88307

Instead of changing PyUnicodeObject from a PyObject to a PyVarObject,
making sub-typing a lot harder, I'd much rather apply a single change
for 3.1: raising the KEEPALIVE_SIZE_LIMIT to 32 as explained and
motivated here:

http://bugs.python.org/issue1943#msg64215

That's a simple non-disruptive change which makes a lot of sense
due to the advances in CPU designs in the last 9 years. I determined
the original value of 9 using benchmarks and similar statistics in
1999/2000.

It's probably also a good time to remove the warning, now that the
implementation has proven itself for so many years...

/* Limit for the Unicode object free list stay alive optimization.

   The implementation will keep allocated Unicode memory intact for
   all objects on the free list having a size less than this
   limit. This reduces malloc() overhead for small Unicode objects.

   At worst this will result in PyUnicode_MAXFREELIST *
   (sizeof(PyUnicodeObject) + KEEPALIVE_SIZE_LIMIT +
   malloc()-overhead) bytes of unused garbage.

   Setting the limit to 0 effectively turns the feature off.

   Note: This is an experimental feature ! If you get core dumps when
   using Unicode objects, turn this feature off.

*/

#define KEEPALIVE_SIZE_LIMIT       9
msg88802 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-06-03 10:15
> Instead of changing PyUnicodeObject from a PyObject to a PyVarObject,
> making sub-typing a lot harder, I'd much rather apply a single change
> for 3.1: raising the KEEPALIVE_SIZE_LIMIT to 32 as explained and
> motivated here:

You make it sound like an alternative to this patch, while it is quite
independent. You could open a separate entry about it, together with
benchmarks results etc.
msg88804 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-06-03 11:16
Antoine Pitrou wrote:
>> Instead of changing PyUnicodeObject from a PyObject to a PyVarObject,
>> making sub-typing a lot harder, I'd much rather apply a single change
>> for 3.1: raising the KEEPALIVE_SIZE_LIMIT to 32 as explained and
>> motivated here:
> 
> You make it sound like an alternative to this patch, while it is quite
> independent. You could open a separate entry about it, together with
> benchmarks results etc.

You're right, I could open a separate ticket for it, but it does fit
the title of the ticket and was also discussed at length on this
ticket, since part of your original patch included changes to the
free list management in the Unicode implementation.
msg88806 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-06-03 11:32
Let's apply simple and noncontroversial patches first, and then see if
the bigger changes are still worth it.
Please open a new ticket.
msg88816 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2009-06-03 17:06
Hm, so the extra pointer is a feature.  I guess a compromise would be to
keep the extra indirection but make it point into the same object in the
base class.  Thinking about how memory caching in modern CPUs work, this
would probably be quite fast but it would still cost 8 bytes on most
future (i.e., 64-bit) architectures.

Still, I expect that a vanishingly small number of users will actually
use that feature.  Is it worth to make everyone pay for that
flexibility, for what must be the first- or second-most commonly used
type in Python 3.x (the other being int), which is still significantly
slower than the common (8-bit) string type in 2.x?
msg88819 - (view) Author: Collin Winter (collinwinter) * (Python committer) Date: 2009-06-03 17:35
On Wed, Jun 3, 2009 at 2:36 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
>> All this is assuming the speed-up is important enough to bother.  Has
>> anyone run a comparison benchmark using the Unladen Swallow benchmarks?
>>
>>  I trust those much more than micro-benchmarks (including, I assume,
>> stringbench.py).  I do expect that reducing the number of allocations
>> for short-to-medium-size strings from 2 to 1 would be a significant
>> speed-up, but I can't guess how much.
>
> While the Unladen Swallow aims at providing high-level benchmarks,
> it's current state doesn't really implement that promise (yet).
>
> If you look at the list of benchmarks they use, most appear to be
> dealing with pickling. That doesn't strike me as particularly useful
> for testing real life Python usage.

I would take issue with your characterization of those benchmarks.
There are several benchmarks for cPickle, true, both macro and micro
benchmarks, but I would not describe their number as "most of [our]
benchmarks". For example, "slowpickle" and "slowunpickle" both use the
pure-Python pickle.py, and are testing how close we can get that
implementation to the tuned cPickle version.

Regardless, I don't know that any of our benchmarks really stress
unicode performance. We so far haven't cared about improving the
performance of unicode objects. 2to3 uses unicode internally, so that
might be a good benchmark to run.

> If a high level benchmark is indeed what's wanted, then they should
> setup pre-configured Django and Zope instances and run those through
> a series of real-life usage scenarios to cover the web application
> use space.

We have a benchmark for Django and Spitfire templates, both of which
are heavily used in the web application use space. We focused on
template languages because in talking to Google's web app teams, they
found their primary CPU bottlenecks to be templating systems, not ORMs
or other components.

> For scientific use cases, it would be good to have similar
> setups using BioPython, NumPy and matplotlib. And so on. Much like
> the high level benchmarks you have in the Windows world.

We have NumPy in our correctness test suite, but no benchmarks based
on it. Looking at all the packages you just named, they make heavy use
of C/C++ extensions (with BioPython and matplotpib both depending on
NumPy) or large C libraries (matplotlib can depend on Cairo, I see).
We've been focusing on pure-Python performance, so I'm skeptical that
benchmarks with such large C components would be a useful guide for
our work. I'm happy to talk about this further outside of this thread,
though.

Collin
msg88824 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-06-03 20:41
> Still, I expect that a vanishingly small number of users will actually
> use that feature. 

Apart from the example Marc-André just posted (and which is a 0.0.1
proof of concept he apparently just wrote), the number of users is,
AFAICT, zero.
Unless there's some closed source extension which happens to extend
unicode as a C subtype.

Now, as for easing the subclassing of unicode in C, there are probably
several possibilities which range from devising a clever set of macros
to abusing the ob_size field for a tagged pointer. People who really
care should do a concrete proposal (and I don't know who these people
are, apart from Marc-André).
msg88830 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2009-06-03 21:31
On Wed, Jun 3, 2009 at 1:41 PM, Antoine Pitrou <report@bugs.python.org> wrote:
> Apart from the example Marc-André just posted (and which is a 0.0.1
> proof of concept he apparently just wrote), the number of users is,
> AFAICT, zero.

IIUC Marc-Andre extracted that from a larger code base (MX) which he
owns and has been maintaining for a decade or so.

> Unless there's some closed source extension which happens to extend
> unicode as a C subtype.

I believe part of MX is closed source.

> Now, as for easing the subclassing of unicode in C, there are probably
> several possibilities which range from devising a clever set of macros
> to abusing the ob_size field for a tagged pointer. People who really
> care should do a concrete proposal (and I don't know who these people
> are, apart from Marc-André).

Not really if the core code uses a macro that depends on the layout of
the object (i.e. the data immediately following the header, like old
8-bit strings), unless you change the core (or the macro) to only use
this if the type matches exactly, and for subtypes use a more
expensive API. But that would slow down unnecessarily for subclasses
written in Python (of which there are plenty).

But I would like to point out that few people if any have ever
complained about the contiguous allocation for 8-bit strings in Python
[0-2].x. And we certainly wouldn't have given in. Now that Unicode is
no longer some fancy-schmancy advanced concept but the basis for *all*
Python string processing I think we should apply the same policy.
msg88879 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-06-04 10:43
Here's a new version of the unicode reference type, extended
to run in both Python 2.6 and 3.1:

http://downloads.egenix.com/python/unicoderef-0.0.2.tar.gz

I've also included a benchmark implemented in C which measures
Unicode/Bytes allocation performance at high resolution
and without any VM overhead.

These are the results on a 64-bit system using a regular UCS2
build of Python 2.6.1 and 3.1 on a single core AMD system:

 * Unicode allocation is always much slower than Bytes

 This depends a lot on how much data you have to handle. If the
 data fits into the pymalloc managed sizes, there's little
 difference.

 * CPU cache lines / memory boundaries obviously have some
   impact on the results

 If things are properly aligned, performance is better than
 in the unaligned case. Unfortunately, there's nothing much the
 CPython implementation can do to benefit from this, since the
 hardware layouts are too diverse.

 This explains why you see some unexpected figures in the
 results, e.g. Python 3.1, 4096 words, 32chars - Unicode
 being faster than Bytes.

 * Something must have improved in pymalloc

 The 4096 word tests perform better on 3.1 than on 2.6.

 Since such improvement have a great impact, more
 emphasis should be placed on these, e.g. by having pymalloc
 provide more space to fixed size objects such PyUnicodeObject,
 reducing the need to create new arenas.

 Dictionaries and other PyObjects such as instances objects
 would also benefit from such improvements.

Note that the benchmark scales the bytes word sizes to match
the internal allocation size of the Unicode objects, ie.
2chars maps to a length 4 bytes string on a UCS2 build.

Testing Unicode/Bytes Allocation Speed with Python 3.1rc1+
------------------------------------------------------------------------

=== 1024 words ============================================================

Variant 2chars, 1024 words, Unicode: 0.005503 seconds = 0.054 us per object
(best of 50 rounds)
Variant 2chars, 1024 words, Bytes  : 0.006838 seconds = 0.067 us per object
(best of 50 rounds)

Variant 9chars, 1024 words, Unicode: 0.008956 seconds = 0.087 us per object
(best of 50 rounds)
Variant 9chars, 1024 words, Bytes  : 0.007380 seconds = 0.072 us per object
(best of 50 rounds)

Variant 16chars, 1024 words, Unicode: 0.009257 seconds = 0.090 us per object
(best of 50 rounds)
Variant 16chars, 1024 words, Bytes  : 0.007291 seconds = 0.071 us per object
(best of 50 rounds)

Variant 32chars, 1024 words, Unicode: 0.009589 seconds = 0.094 us per object
(best of 50 rounds)
Variant 32chars, 1024 words, Bytes  : 0.007803 seconds = 0.076 us per object
(best of 50 rounds)

Variant pep100, 1024 words, Unicode: 0.010569 seconds = 0.103 us per object
(best of 50 rounds)
Variant pep100, 1024 words, Bytes  : 0.008198 seconds = 0.080 us per object
(best of 50 rounds)

=== 2048 words ============================================================

Variant 2chars, 2048 words, Unicode: 0.011624 seconds = 0.057 us per object
(best of 50 rounds)
Variant 2chars, 2048 words, Bytes  : 0.013941 seconds = 0.068 us per object
(best of 50 rounds)

Variant 9chars, 2048 words, Unicode: 0.018608 seconds = 0.091 us per object
(best of 50 rounds)
Variant 9chars, 2048 words, Bytes  : 0.014773 seconds = 0.072 us per object
(best of 50 rounds)

Variant 16chars, 2048 words, Unicode: 0.018556 seconds = 0.091 us per object
(best of 50 rounds)
Variant 16chars, 2048 words, Bytes  : 0.014550 seconds = 0.071 us per object
(best of 50 rounds)

Variant 32chars, 2048 words, Unicode: 0.018972 seconds = 0.093 us per object
(best of 50 rounds)
Variant 32chars, 2048 words, Bytes  : 0.016377 seconds = 0.080 us per object
(best of 50 rounds)

Variant pep100, 2048 words, Unicode: 0.021005 seconds = 0.103 us per object
(best of 50 rounds)
Variant pep100, 2048 words, Bytes  : 0.016636 seconds = 0.081 us per object
(best of 50 rounds)

=== 4096 words ============================================================

Variant 2chars, 4096 words, Unicode: 0.022950 seconds = 0.056 us per object
(best of 50 rounds)
Variant 2chars, 4096 words, Bytes  : 0.027813 seconds = 0.068 us per object
(best of 50 rounds)

Variant 9chars, 4096 words, Unicode: 0.037229 seconds = 0.091 us per object
(best of 50 rounds)
Variant 9chars, 4096 words, Bytes  : 0.031123 seconds = 0.076 us per object
(best of 50 rounds)

Variant 16chars, 4096 words, Unicode: 0.037118 seconds = 0.091 us per object
(best of 50 rounds)
Variant 16chars, 4096 words, Bytes  : 0.036433 seconds = 0.089 us per object
(best of 50 rounds)

Variant 32chars, 4096 words, Unicode: 0.040970 seconds = 0.100 us per object
(best of 50 rounds)
Variant 32chars, 4096 words, Bytes  : 0.051422 seconds = 0.126 us per object
(best of 50 rounds)

Variant pep100, 4096 words, Unicode: 0.049630 seconds = 0.121 us per object
(best of 50 rounds)
Variant pep100, 4096 words, Bytes  : 0.034551 seconds = 0.084 us per object
(best of 50 rounds)

Testing Unicode/Bytes Allocation Speed with Python 2.6.1
------------------------------------------------------------------------

=== 1024 words ============================================================

Variant 2chars, 1024 words, Unicode: 0.005810 seconds = 0.057 us per object
(best of 50 rounds)
Variant 2chars, 1024 words, Bytes  : 0.006918 seconds = 0.068 us per object
(best of 50 rounds)

Variant 9chars, 1024 words, Unicode: 0.008539 seconds = 0.083 us per object
(best of 50 rounds)
Variant 9chars, 1024 words, Bytes  : 0.007618 seconds = 0.074 us per object
(best of 50 rounds)

Variant 16chars, 1024 words, Unicode: 0.008478 seconds = 0.083 us per object
(best of 50 rounds)
Variant 16chars, 1024 words, Bytes  : 0.008097 seconds = 0.079 us per object
(best of 50 rounds)

Variant 32chars, 1024 words, Unicode: 0.008969 seconds = 0.088 us per object
(best of 50 rounds)
Variant 32chars, 1024 words, Bytes  : 0.008649 seconds = 0.084 us per object
(best of 50 rounds)

Variant pep100, 1024 words, Unicode: 0.009779 seconds = 0.096 us per object
(best of 50 rounds)
Variant pep100, 1024 words, Bytes  : 0.008318 seconds = 0.081 us per object
(best of 50 rounds)

=== 2048 words ============================================================

Variant 2chars, 2048 words, Unicode: 0.011745 seconds = 0.057 us per object
(best of 50 rounds)
Variant 2chars, 2048 words, Bytes  : 0.013978 seconds = 0.068 us per object
(best of 50 rounds)

Variant 9chars, 2048 words, Unicode: 0.017418 seconds = 0.085 us per object
(best of 50 rounds)
Variant 9chars, 2048 words, Bytes  : 0.014443 seconds = 0.071 us per object
(best of 50 rounds)

Variant 16chars, 2048 words, Unicode: 0.018138 seconds = 0.089 us per object
(best of 50 rounds)
Variant 16chars, 2048 words, Bytes  : 0.016200 seconds = 0.079 us per object
(best of 50 rounds)

Variant 32chars, 2048 words, Unicode: 0.021348 seconds = 0.104 us per object
(best of 50 rounds)
Variant 32chars, 2048 words, Bytes  : 0.018144 seconds = 0.089 us per object
(best of 50 rounds)

Variant pep100, 2048 words, Unicode: 0.019593 seconds = 0.096 us per object
(best of 50 rounds)
Variant pep100, 2048 words, Bytes  : 0.016692 seconds = 0.082 us per object
(best of 50 rounds)

=== 4096 words ============================================================

Variant 2chars, 4096 words, Unicode: 0.024347 seconds = 0.059 us per object
(best of 50 rounds)
Variant 2chars, 4096 words, Bytes  : 0.028411 seconds = 0.069 us per object
(best of 50 rounds)

Variant 9chars, 4096 words, Unicode: 0.041480 seconds = 0.101 us per object
(best of 50 rounds)
Variant 9chars, 4096 words, Bytes  : 0.031532 seconds = 0.077 us per object
(best of 50 rounds)

Variant 16chars, 4096 words, Unicode: 0.047986 seconds = 0.117 us per object
(best of 50 rounds)
Variant 16chars, 4096 words, Bytes  : 0.038006 seconds = 0.093 us per object
(best of 50 rounds)

Variant 32chars, 4096 words, Unicode: 0.062115 seconds = 0.152 us per object
(best of 50 rounds)
Variant 32chars, 4096 words, Bytes  : 0.049687 seconds = 0.121 us per object
(best of 50 rounds)

Variant pep100, 4096 words, Unicode: 0.044331 seconds = 0.108 us per object
(best of 50 rounds)
Variant pep100, 4096 words, Bytes  : 0.035939 seconds = 0.088 us per object
(best of 50 rounds)
msg88880 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-06-04 10:54
Guido van Rossum wrote:
> Guido van Rossum <guido@python.org> added the comment:
> 
> On Wed, Jun 3, 2009 at 1:41 PM, Antoine Pitrou <report@bugs.python.org> wrote:
>> Apart from the example Marc-André just posted (and which is a 0.0.1
>> proof of concept he apparently just wrote), the number of users is,
>> AFAICT, zero.
> 
> IIUC Marc-Andre extracted that from a larger code base (MX) which he
> owns and has been maintaining for a decade or so.

Only part of it.

I wrote the sub-type Unicode Reference sub-type implementation
just a few days ago, in order to demonstrate how easy it is
provided you have a PyObject (rather than a PyVarObject) to
build on.

We should really publicize how easy it is to write such type
extensions. I'm sure that a lot of things which often generated
heated discussions (such as the slicing patches for Unicode)
could easily be solved by just adding a few such sub-types to the
core.

>> Unless there's some closed source extension which happens to extend
>> unicode as a C subtype.
> 
> I believe part of MX is closed source.

True. A large part of the code base is not available to the wider
public.

>> Now, as for easing the subclassing of unicode in C, there are probably
>> several possibilities which range from devising a clever set of macros
>> to abusing the ob_size field for a tagged pointer. People who really
>> care should do a concrete proposal (and I don't know who these people
>> are, apart from Marc-André).
> 
> Not really if the core code uses a macro that depends on the layout of
> the object (i.e. the data immediately following the header, like old
> 8-bit strings), unless you change the core (or the macro) to only use
> this if the type matches exactly, and for subtypes use a more
> expensive API. But that would slow down unnecessarily for subclasses
> written in Python (of which there are plenty).
> 
> But I would like to point out that few people if any have ever
> complained about the contiguous allocation for 8-bit strings in Python
> [0-2].x. And we certainly wouldn't have given in. Now that Unicode is
> no longer some fancy-schmancy advanced concept but the basis for *all*
> Python string processing I think we should apply the same policy.

I've spent enough time with this discussion.

If you think it's better to make sub-typing harder and thereby
closing the door for improvements which could really speed up
e.g. template processing (by not requiring copying the same data
over and over again), go for it.

I still think that it's better to keep things the way they
are and benefit from the fact that PyUnicodeObjects have a
fixed size with the variable part being dealt with separately.

Since pymalloc is being used to manage such objects, there's
a lot of room for improvements, since the allocation scheme
is under out control. E.g. we could have pymalloc allocate
larger pools for PyUnicodeObjects.

Doing the same for variable sized objects is a lot harder,
consumes more memory and likely less efficient.
msg88881 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-06-04 11:26
> Since pymalloc is being used to manage such objects, there's
> a lot of room for improvements, since the allocation scheme
> is under out control. E.g. we could have pymalloc allocate
> larger pools for PyUnicodeObjects.

I'm not sure what "larger pools for PyUnicodeObjects" means. pymalloc
doesn't have separate pools per object type, only per object size.
OTOH, we could grow the size limit under which pymalloc is used,
especially on 64-bit systems.
msg88883 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-06-04 12:13
Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>> Since pymalloc is being used to manage such objects, there's
>> a lot of room for improvements, since the allocation scheme
>> is under out control. E.g. we could have pymalloc allocate
>> larger pools for PyUnicodeObjects.
> 
> I'm not sure what "larger pools for PyUnicodeObjects" means. pymalloc
> doesn't have separate pools per object type, only per object size.

I meant larger pools for objects of sizeof(PyUnicodeObject) bytes.
The same could be done for other often used PyObjects (and only for
those).

pymalloc is a lot faster than the OS malloc() and was designed for
Python object memory management, ie. for small blocks...

"""
/* A fast, special-purpose memory allocator for small blocks, to be used
   on top of a general-purpose malloc -- heavily based on previous art. */

/* Vladimir Marangozov -- August 2000 */
"""

> OTOH, we could grow the size limit under which pymalloc is used,
> especially on 64-bit systems.

The limit is 256 bytes. Increasing it doesn't make much sense,
since the pools are 4k each and managed in arenas of
256kb.

Anything larger than 256 bytes goes straight to the OS malloc().
msg88885 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-06-04 12:18
> Anything larger than 256 bytes goes straight to the OS malloc().

Under a 64-bit system, a plain dict is more than 256 bytes.
msg88928 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-06-05 09:50
Raymond suggested the patch be committed in 3.1, so as to minimize
disruption between 3.1 and 3.2. Benjamin, what do you think?
msg88936 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-06-05 11:06
Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
> Raymond suggested the patch be committed in 3.1, so as to minimize
> disruption between 3.1 and 3.2. Benjamin, what do you think?

Has Guido pronounced on this already ?
msg88946 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2009-06-05 14:08
On Fri, Jun 5, 2009 at 4:06 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
>
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
>
> Antoine Pitrou wrote:
>> Antoine Pitrou <pitrou@free.fr> added the comment:
>>
>> Raymond suggested the patch be committed in 3.1, so as to minimize
>> disruption between 3.1 and 3.2. Benjamin, what do you think?
>
> Has Guido pronounced on this already ?

I don't want it added to 3.1 unless we start the beta cycle afresh.
It's too subtle for submitting after rc1. Talk to Benjamin if you
disagree.

I think it's fine to wait for 3.2. Maybe add something to the docs
about not subclassing unicode in C.
msg88951 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-06-05 15:55
Guido van Rossum wrote:
> I think it's fine to wait for 3.2. Maybe add something to the docs
> about not subclassing unicode in C.

We should have a wider discussion about this on python-dev.

I'll publish the unicoderef extension and then we can see
whether users want this or not.

Antoine's patch makes such extensions impossible (provided you
don't want to copy over the complete unicodeobject.c
implementation in order to change the memory allocation
scheme).

Note that in Python 2.x you don't have such issues because
there, most tools for text processing will happily work on
any sort of buffer, so you don't need a string sub-type
in order to implement e.g. references into another string
(the buffer type will allow you to do this easily).
msg88953 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-06-05 16:17
> Note that in Python 2.x you don't have such issues because
> there, most tools for text processing will happily work on
> any sort of buffer, so you don't need a string sub-type
> in order to implement e.g. references into another string
> (the buffer type will allow you to do this easily).

The new buffer API has a provision for type flags, although none of them
had a chance to be implemented by the original author before he ran
away...
There could be a type flag for unicode characters and then its support
could be implemented in the memoryview object.
msg88983 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2009-06-06 00:16
In the interest of possibly improving the imminent 3.1 release,
I opened #6216
Raise Unicode KEEPALIVE_SIZE_LIMIT from 9 to 32?

I wonder if it is possible to make it generically easier to subclass
PyVarObjects (but my C knowledge to getting too faded to have any ideas).
msg89069 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-06-08 07:58
Terry J. Reedy wrote:
> In the interest of possibly improving the imminent 3.1 release,
> I opened #6216
> Raise Unicode KEEPALIVE_SIZE_LIMIT from 9 to 32?

Thanks for opening that ticket.

> I wonder if it is possible to make it generically easier to subclass
> PyVarObjects (but my C knowledge to getting too faded to have any ideas).

Even if we were to add some pointer arithmetic tricks to at least
hide away the complexities, you'd no longer be able to change the
way the data memory allocation works.

The reason is simple: subclassing is about reusing existing method
implementations and only adding/changing a few of them.

If you want to change the way the allocation works, you'd have
to replace all of them.

Furthermore, using your subclasses objects with the existing APIs
would no longer be safe, since these would still assume the
original base class memory allocation scheme.

In summary:

Implementations like the unicoderef type I posted
and most of the other use cases I mentioned are no longer
possible, ie. you will *always* have to copy the data in order
to work with the existing Unicode APIs on it.

The current implementation has no problem with working on referenced
data, since support for this was built in right from the start.

That's what I meant with closing the door on future enhancements
that would make a huge difference if used right, for a mere
10% performance increase.
msg97545 - (view) Author: Adam Olsen (Rhamphoryncus) Date: 2010-01-10 19:52
Points against the subclassing argument:

* We have a null-termination invariant.  For byte strings this was part of the public API, and I'm not sure that's changed for unicode strings; aren't you arguing that we should maximize how much of our implementation is a public API?  This prevents lazy slicing.

* UTF-16 and UTF-32 are rarely used encodings, especially for longer strings (ie files).  For shorter strings (APIs) the unicode object overhead is more significant and we'd need a way to slave to the buffer's lifetime to that of the unicode object (hard to do).  For longer strings UTF-8 would be much more useful, but that's been shot down before.

* subclassing unicode so you can change the meaning of the fields (ie allocating your own buffer) is a gross hack.  It relies far too much on fine details of the implementation and is fragile (what if you miss the dummy byte needed by fastsearch?)  Most of the possible options could be, if they function correctly, applied directly to the basetype as a patch, so it's moot.

* If you dislike PyVarObject in general (I think the API is ugly too) you should argue for a general policy discouraging future use of it, not just get in the way of the one place where it's most appropriate

Terry: PyVarObjects would be much easier to subclass if the type object stored an offset to the beginning of the variable section, so it could be automatically recalculated for subclasses based on the size of the struct.  This'd mean the PyBytesObject struct would no longer end with a char ob_sval[1].  The down side is a tiny bit more math when accessing the variable section (as the offset is no longer constant).
msg97553 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-01-10 21:59
Adam Olsen wrote:
> 
> Adam Olsen <rhamph@gmail.com> added the comment:
> 
> Points against the subclassing argument:
> 
> * We have a null-termination invariant.  For byte strings this was part of the public API, and I'm not sure that's changed for unicode strings; aren't you arguing that we should maximize how much of our implementation is a public API?  This prevents lazy slicing.

Base type Unicode buffers end with a null-Py_UNICODE termination,
but this is not used anywhere, AFAIK. We could probably remove
that overallocation at some point.

There's no such thing as a null-termination invariant for Unicode.

> * subclassing unicode so you can change the meaning of the fields (ie allocating your own buffer) is a gross hack.  It relies far too much on fine details of the implementation and is fragile (what if you miss the dummy byte needed by fastsearch?)  Most of the possible options could be, if they function correctly, applied directly to the basetype as a patch, so it's moot.

Actually, Unicode objects were designed to be subclassable right
from the start and adjusting the buffer to point e.g. into some
other already allocated string was too. I removed this feature from
Fredrik's type implementation with the intent to readd it later on as
subclass.

See the prototype implementation of such a subclass uniref that I've
written to show how easy it is to add a subclass which can be used
to slice large Unicode objects without having to reallocate new
buffers all the time.

BTW, I'm not aware of any changes to the PyUnicodeObject by some
fastsearch implementation. Could you point me to this ?
msg97554 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-01-10 22:27
> Base type Unicode buffers end with a null-Py_UNICODE termination,
> but this is not used anywhere, AFAIK
On Windows, code like 
   CreateDirectoryW(PyUnicode_AS_UNICODE(po), NULL)
is very common, at least in posixmodule.c.
msg97555 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-01-10 22:36
I find that the null termination for 8-bit strings makes low-level parsing operations (e.g., parsing a numeric string) safer and easier:  for example, it makes skipping a series of digits with something like:

while (isdigit(*s)) ++s;

safe.  I'd imagine that null terminated PyUNICODE arrays would have similar benefits.
msg97559 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-01-10 22:52
> I find that the null termination for 8-bit strings makes low-level
> parsing operations (e.g., parsing a numeric string) safer and easier:

Not to mention faster. The new IO library makes use of it (for newline
detection), on both bytestrings and unicode strings.
msg97581 - (view) Author: Adam Olsen (Rhamphoryncus) Date: 2010-01-11 08:08
On Sun, Jan 10, 2010 at 14:59, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
> BTW, I'm not aware of any changes to the PyUnicodeObject by some
> fastsearch implementation. Could you point me to this ?

    /* We allocate one more byte to make sure the string is Ux0000 terminated.
       The overallocation is also used by fastsearch, which assumes that it's
       safe to look at str[length] (without making any assumptions about what
       it contains). */
msg98656 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-02-01 10:39
Antoine Pitrou wrote:
> 
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>> I find that the null termination for 8-bit strings makes low-level
>> parsing operations (e.g., parsing a numeric string) safer and easier:
> 
> Not to mention faster. The new IO library makes use of it (for newline
> detection), on both bytestrings and unicode strings.

I'd consider that a bug. Esp. the IO lib should be 8-bit clean
in the sense that it doesn't add any special meaning to NUL
characters or code points.

Besides, using a for-loop with a counter is both safer and faster
than checking each an every character for NUL.

Just think of what can happen if you have buggy code that overwrites
the NUL byte in some corner case situation and then use the assumption
of having the NUL byte as terminator - a classical buffer overrun.

If you're lucky, you get a segfault. If not, you end up with
data corruption or manipulation of data which could lead to
unwanted code execution.

The Python Unicode API deliberately tries to always use the combination
of a Py_UNICODE* pointer and a length integer to avoid such issues.
msg98657 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-02-01 11:39
> I'd consider that a bug. Esp. the IO lib should be 8-bit clean
> in the sense that it doesn't add any special meaning to NUL
> characters or code points.

It doesn't add any special meaning to them. It just relies on a NUL
being present after the end of the string. It doesn't care about other
NULs.

> Besides, using a for-loop with a counter is both safer and faster
> than checking each an every character for NUL.

It's slower, since it has one more condition to check.
Newline detection as it is written has a fast path in the form of:
     while (*c++ >= 0x20);

> Just think of what can happen if you have buggy code that overwrites
> the NUL byte in some corner case situation and then use the assumption
> of having the NUL byte as terminator - a classical buffer overrun.

Well, buggy code leads to bugs :)
msg98662 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-02-01 13:43
Again, on Windows there are many many usages of PyUnicode_AS_UNICODE() that pass the result to various Windows API functions, expecting a nul-terminated array of WCHARs. Please don't change this!
msg98668 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-02-01 16:12
Amaury Forgeot d'Arc wrote:
> 
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
> 
>> Base type Unicode buffers end with a null-Py_UNICODE termination,
>> but this is not used anywhere, AFAIK
> On Windows, code like 
>    CreateDirectoryW(PyUnicode_AS_UNICODE(po), NULL)
> is very common, at least in posixmodule.c.

The above usage is clearly wrong. PyUnicode_AS_UNICODE() should
only be used to access Py_UNICODE data directly when
working on Python Unicode objects. The macro is not meant
to be used directly in external APIs.

For such uses, the Unicode conversion APIs need to be used,
e.g. the PyUnicode_AsWideChar() API. These will then also
apply any 0-termination as necessary.

Note that Python is free to change the meaning of Py_UNICODE
(e.g. to use UCS4 on all platforms) or Unicode implementation
details (such as e.g. the 0-termination) and this would then break
any use such as the one you quoted.
msg98669 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-02-01 17:27
Then there are many places to change, in core python as well as in third-party code. And PyArg_ParseTuple("u") would not work any more.
msg98672 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-02-01 17:54
> Note that Python is free to change the meaning of Py_UNICODE
> (e.g. to use UCS4 on all platforms)

Python-UCS4 has never worked on Windows. Most developers on Windows, taking example on core python source code, implicitly assumed that HAVE_USABLE_WCHAR_T is true, and use the Py_Unicode* api the same way they use the PyString* functions.

PyString_AsString is documented to return a nul-terminated array. If PyUnicode_AsUnicode starts to behave differently, people will have more trouble to port their modules to py3k.
This is not an implementation detail.
msg98676 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-02-01 19:21
modules to py3k.
> This is not an implementation detail.

It is, otherwise I would have documented it. The fact that some
developers are not using those APIs correctly doesn't change that.

Note that PyUnicode_AsUnicode() only returns a pointer to the
Py_UNICODE buffer. It makes no guarantees on the 0-termination.
Developers need to use PyUnicode_GetSize() to access the size
of the Unicode string.

But no worries: We're not going to change it. It's too late
after 10 years in the wild.

Still, developers will have to be aware of the fact that 0-termination
is not a guaranteed Unicode feature and should stop making that
assumption and it will not necessarily hold or be guaranteed
for Unicode subclasses.
msg98677 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-02-01 19:28
Le lundi 01 février 2010 à 19:21 +0000, Marc-Andre Lemburg a écrit :
> > This is not an implementation detail.
> 
> It is, otherwise I would have documented it.

Ok, so the current allocation scheme of unicode objects is an
implementation detail as well, right?
msg98686 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-02-01 20:30
>It is, otherwise I would have documented it. The fact that some
developers are not using those APIs correctly doesn't change that.

If, as Antoine claimed, 'it' is a documented feature of str strings, and Py3 says str = Unicode, it is a plausible inference.
msg110599 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-17 19:28
@Antoine: do you wish to try and take this forward?
msg116937 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-09-20 13:58
No reply to msg110599, I'll close this in a couple of weeks unless anyone objects.
msg116950 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2010-09-20 15:05
2010/9/20 Mark Lawrence <report@bugs.python.org>:
>
> Mark Lawrence <breamoreboy@yahoo.co.uk> added the comment:
>
> No reply to msg110599, I'll close this in a couple of weeks unless anyone objects.

Please don't. This is still a valid issue.
msg118087 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-10-06 23:21
Updated patch against current py3k.
msg134406 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2011-04-25 19:34
I just found that the extension zope.i18nmessageid: http://pypi.python.org/pypi/zope.i18nmessageid subclasses unicode at the C level:
http://svn.zope.org/zope.i18nmessageid/trunk/src/zope/i18nmessageid/_zope_i18nmessageid_message.c?rev=120914&view=markup

Notably, the Message structure is defined this way:
typedef struct {
  PyUnicodeObject base;
  PyObject *domain;
  PyObject *default_;
  PyObject *mapping;
} Message;

How would such an extension type behave after the patch? Is there a workaround we can propose?
msg144625 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-09-29 20:03
The PEP 393 is based on the idea proposed in this issue (use only one memory block, not two), but also enhanced it to reduce more the memory using other technics:

 - use a different character type depending on the maximum character,
 - use a shorter structure for ASCII only strings

The PEP 393 has been accepted and merged into Python 3.3. So I consider this issue as done.
History
Date User Action Args
2011-09-29 20:03:47hayposetstatus: open -> closed

nosy: + haypo
messages: + msg144625

resolution: fixed
2011-04-25 19:34:56amaury.forgeotdarcsetmessages: + msg134406
2010-10-06 23:21:53pitrousetfiles: + unialloc6.patch

messages: + msg118087
2010-09-20 15:05:32benjamin.petersonsetstatus: pending -> open

messages: + msg116950
2010-09-20 13:58:08BreamoreBoysetstatus: open -> pending

messages: + msg116937
2010-08-21 18:25:42gvanrossumsetassignee: gvanrossum ->
2010-07-17 19:28:53BreamoreBoysetnosy: + BreamoreBoy
messages: + msg110599
2010-02-01 20:30:07terry.reedysetmessages: + msg98686
2010-02-01 19:28:56pitrousetmessages: + msg98677
2010-02-01 19:21:41lemburgsetmessages: + msg98676
2010-02-01 17:54:59amaury.forgeotdarcsetmessages: + msg98672
2010-02-01 17:27:21amaury.forgeotdarcsetmessages: + msg98669
2010-02-01 16:12:12lemburgsetmessages: + msg98668
2010-02-01 13:43:13amaury.forgeotdarcsetmessages: + msg98662
2010-02-01 11:39:53pitrousetmessages: + msg98657
2010-02-01 10:39:17lemburgsetmessages: + msg98656
2010-02-01 02:00:14ferringbsetnosy: + ferringb
2010-01-11 08:08:29Rhamphoryncussetmessages: + msg97581
2010-01-10 22:52:25pitrousetmessages: + msg97559
2010-01-10 22:36:26mark.dickinsonsetnosy: + mark.dickinson
messages: + msg97555
2010-01-10 22:27:34amaury.forgeotdarcsetmessages: + msg97554
2010-01-10 21:59:26lemburgsetmessages: + msg97553
2010-01-10 19:52:13Rhamphoryncussetnosy: + Rhamphoryncus
messages: + msg97545
2009-06-08 07:58:45lemburgsetmessages: + msg89069
2009-06-06 00:16:16terry.reedysetnosy: + terry.reedy
messages: + msg88983
2009-06-05 16:17:26pitrousetmessages: + msg88953
2009-06-05 15:55:18lemburgsetmessages: + msg88951
2009-06-05 14:08:51gvanrossumsetmessages: + msg88946
2009-06-05 11:06:15lemburgsetmessages: + msg88936
2009-06-05 09:50:22pitrousetnosy: + benjamin.peterson
messages: + msg88928
2009-06-04 12:18:12pitrousetmessages: + msg88885
2009-06-04 12:13:33lemburgsetmessages: + msg88883
2009-06-04 11:26:14pitrousetmessages: + msg88881
2009-06-04 10:54:42lemburgsetmessages: + msg88880
2009-06-04 10:43:58lemburgsetmessages: + msg88879
2009-06-03 21:31:25gvanrossumsetmessages: + msg88830
2009-06-03 20:41:35pitrousetmessages: + msg88824
2009-06-03 17:35:14collinwintersetmessages: + msg88819
2009-06-03 17:06:43gvanrossumsetmessages: + msg88816
2009-06-03 12:00:10hayposetnosy: - haypo
2009-06-03 11:32:31amaury.forgeotdarcsetmessages: + msg88806
2009-06-03 11:16:57lemburgsetmessages: + msg88804
2009-06-03 10:15:50pitrousetmessages: + msg88802
2009-06-03 09:36:19lemburgsetmessages: + msg88801
2009-06-02 23:51:46gvanrossumsetmessages: + msg88779
2009-06-02 23:33:08rhettingersetmessages: - msg88776
2009-06-02 23:32:53rhettingersetmessages: + msg88777
2009-06-02 23:30:30rhettingersetmessages: + msg88776
2009-06-02 23:25:43gvanrossumsetmessages: + msg88775
2009-06-02 23:09:39rhettingersetassignee: gvanrossum

messages: + msg88774
nosy: + rhettinger, gvanrossum
2009-06-02 22:25:57lemburgsetmessages: + msg88768
2009-06-02 22:06:59lemburgsetmessages: + msg88767
2009-06-02 17:17:41pitrousetmessages: + msg88753
2009-06-02 16:38:54lemburgsetmessages: + msg88751
2009-06-02 14:17:14eric.smithsetnosy: + eric.smith
2009-06-02 14:12:49pitrousetmessages: + msg88746
2009-06-02 13:53:36lemburgsetmessages: + msg88745
2009-06-02 13:46:45lemburgsetmessages: + msg88744
2009-05-30 22:47:57pitrousetmessages: + msg88585
2009-05-30 22:41:10jimjjewettsetnosy: + jimjjewett
messages: + msg88584
2009-05-25 09:32:04amaury.forgeotdarcsetmessages: + msg88313
2009-05-25 09:24:24pitrousetmessages: + msg88312
2009-05-25 09:21:03lemburgsetmessages: + msg88310
2009-05-25 08:45:44amaury.forgeotdarcsetmessages: + msg88309
2009-05-25 08:17:55lemburgsetmessages: + msg88307
2009-05-24 21:51:43pitrousetstatus: closed -> open
resolution: rejected -> (no value)
messages: + msg88291
2009-05-24 21:48:38lemburgsetstatus: open -> closed
resolution: rejected
messages: + msg88290
2009-05-24 21:36:45pitrousetmessages: + msg88288
2009-05-24 21:31:00lemburgsetmessages: + msg88287
2009-05-23 23:09:36pitrousetfiles: + unialloc5.patch
assignee: lemburg -> (no value)
messages: + msg88254

stage: test needed -> patch review
2009-05-16 19:53:53pitrousetmessages: + msg87918
2009-05-16 19:47:02collinwintersetmessages: + msg87917
2009-05-16 19:37:38ajaksu2setversions: + Python 3.2, - Python 3.0
nosy: + ajaksu2, ezio.melotti, haypo, collinwinter

messages: + msg87912

components: + Unicode
stage: test needed
2008-03-22 12:56:47lemburgsetmessages: + msg64320
2008-03-20 22:19:51pitrousetmessages: + msg64216
2008-03-20 22:13:04lemburgsetmessages: + msg64215
2008-03-20 19:38:14pitrousetfiles: - unialloc3.patch
2008-03-20 19:38:10pitrousetfiles: - unialloc2.patch
2008-03-20 19:38:06pitrousetfiles: - freelists.patch
2008-03-20 19:38:00pitrousetfiles: - unialloc.patch
2008-03-20 19:37:55pitrousetfiles: + freelists2.patch
2008-03-20 19:37:36pitrousetfiles: + unialloc4.patch
2008-03-20 19:36:49pitrousetmessages: + msg64197
2008-03-20 19:10:33lemburgsetmessages: + msg64194
2008-03-20 14:50:41pitrousetmessages: + msg64170
2008-03-20 14:34:33lemburgsetmessages: + msg64169
2008-03-20 13:57:13pitrousetmessages: + msg64168
2008-03-20 12:53:13lemburgsetmessages: + msg64167
2008-03-20 11:32:38pitrousetmessages: + msg64166
2008-03-20 11:00:14lemburgsetmessages: + msg64165
2008-03-20 10:31:00amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg64163
2008-03-20 10:04:00lemburgsetmessages: + msg64161
2008-03-19 22:00:58pitrousetmessages: + msg64112
2008-03-19 21:40:56jafosetpriority: normal
assignee: lemburg
messages: + msg64108
keywords: + patch
nosy: + jafo
2008-02-16 20:32:32pitrousetfiles: + unialloc3.patch
messages: + msg62467
2008-02-12 19:58:45pitrousetfiles: + unialloc2.patch
messages: + msg62332
2008-01-30 15:56:24pitrousetfiles: + freelists.patch
messages: + msg61867
2008-01-30 11:45:14lemburgsetmessages: + msg61862
2008-01-29 21:54:43pitrousetmessages: + msg61832
2008-01-27 19:03:09pitrousetmessages: + msg61747
2008-01-27 18:05:39lemburgsetmessages: + msg61743
2008-01-27 16:46:46pitrousetmessages: + msg61741
2008-01-27 16:03:54lemburgsetmessages: + msg61740
2008-01-27 15:35:36pitrousetmessages: + msg61738
2008-01-27 13:52:04lemburgsetmessages: + msg61735
2008-01-27 00:24:00pitrousetmessages: + msg61732
2008-01-26 23:27:34lemburgsetnosy: + lemburg
messages: + msg61730
2008-01-26 23:09:24orivejsetnosy: + orivej
2008-01-26 22:45:09pitrousetfiles: + stringbench3k.py
2008-01-26 22:44:34pitroucreate