Message 88801 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	ajaksu2, amaury.forgeotdarc, collinwinter, eric.smith, ezio.melotti, gvanrossum, jafo, jimjjewett, lemburg, orivej, pitrou, rhettinger, vstinner
Date	2009-06-03.09:36:10
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<4A264408.30404@egenix.com>
In-reply-to	<1243986707.72.0.514977063082.issue1943@psf.upfronthosting.co.za>

Content
> That's unfortunate; it would clearly have been easier to change this in 3.1. > > That said, I'm not sure anyone should be subclassing PyUnicode. Maybe > Marc-Andre can explain why he is doing this (or point to the message in > this thread where he explained this before)? The Unicode type was designed to be a basic useful type with as much functionality as needed, but no more. Since it was clear that we would get sub-typing at the time, the aim was to move special use cases into such sub-types. One such example was the referencing logic used in Fredrik's implementation and often requested by the Python community. I removed that logic from the implementation due to the issues this would have caused by accidentally keeping alive large referenced Unicode objects due to references pointing to them. However, it's easy to add it back by sub-typing the Unicode type: http://downloads.egenix.com/python/unicoderef-0.0.1.tar.gz Other special use cases: * sub-types which hold a reference to the original bytes string, e.g. to implement a round-trip safe storage even of broken text data or text that uses multiple encodings * sub-types that get their data from a memory mapped file or a shared memory area without copying * sub-types that implement indexing based on glyphs (ie. the human recognizable notion of a character) rather than code points * sub-types that implement special extra methods or provide case insensitive operations * sub-types that implement special text data forms, such as URLs, OS paths, UID strings, etc. and custom operations on those Sub-typing is also encouraged by the documentation: http://docs.python.org/library/userdict.html#module-UserString ... and after all: one of the main points in making all built-in types sub-typeable was to actually do it :-) > If it's a viable use case, > it should be possible to have some symbol or a few symbols whose > presence can be tested in the preprocessor by code that needs to > subclass; we should design the patch with that in mind and Marc-Andre > could help testing it. > > All this is assuming the speed-up is important enough to bother. Has > anyone run a comparison benchmark using the Unladen Swallow benchmarks? > > I trust those much more than micro-benchmarks (including, I assume, > stringbench.py). I do expect that reducing the number of allocations > for short-to-medium-size strings from 2 to 1 would be a significant > speed-up, but I can't guess how much. While the Unladen Swallow aims at providing high-level benchmarks, it's current state doesn't really implement that promise (yet). If you look at the list of benchmarks they use, most appear to be dealing with pickling. That doesn't strike me as particularly useful for testing real life Python usage. If a high level benchmark is indeed what's wanted, then they should setup pre-configured Django and Zope instances and run those through a series of real-life usage scenarios to cover the web application use space. For scientific use cases, it would be good to have similar setups using BioPython, NumPy and matplotlib. And so on. Much like the high level benchmarks you have in the Windows world. Depending on the use case, the results of benchmarks for this particular change are difficult to predict or interpret. Here's a summary message with my reasoning for rejecting the patch: http://bugs.python.org/issue1943#msg88307 Instead of changing PyUnicodeObject from a PyObject to a PyVarObject, making sub-typing a lot harder, I'd much rather apply a single change for 3.1: raising the KEEPALIVE_SIZE_LIMIT to 32 as explained and motivated here: http://bugs.python.org/issue1943#msg64215 That's a simple non-disruptive change which makes a lot of sense due to the advances in CPU designs in the last 9 years. I determined the original value of 9 using benchmarks and similar statistics in 1999/2000. It's probably also a good time to remove the warning, now that the implementation has proven itself for so many years... /* Limit for the Unicode object free list stay alive optimization. The implementation will keep allocated Unicode memory intact for all objects on the free list having a size less than this limit. This reduces malloc() overhead for small Unicode objects. At worst this will result in PyUnicode_MAXFREELIST * (sizeof(PyUnicodeObject) + KEEPALIVE_SIZE_LIMIT + malloc()-overhead) bytes of unused garbage. Setting the limit to 0 effectively turns the feature off. Note: This is an experimental feature ! If you get core dumps when using Unicode objects, turn this feature off. */ #define KEEPALIVE_SIZE_LIMIT 9

> That's unfortunate; it would clearly have been easier to change this in 3.1.
> 
> That said, I'm not sure anyone *should* be subclassing PyUnicode. Maybe
> Marc-Andre can explain why he is doing this (or point to the message in
> this thread where he explained this before)? 

The Unicode type was designed to be a basic useful type with as
much functionality as needed, but no more. Since it was clear that
we would get sub-typing at the time, the aim was to move special
use cases into such sub-types.

One such example was the referencing logic used in Fredrik's
implementation and often requested by the Python community.

I removed that logic from the implementation due to the issues this
would have caused by accidentally keeping alive large referenced
Unicode objects due to references pointing to them.

However, it's easy to add it back by sub-typing the Unicode type:

http://downloads.egenix.com/python/unicoderef-0.0.1.tar.gz

Other special use cases:

* sub-types which hold a reference to the original bytes string, e.g.
to implement a round-trip safe storage even of broken text data or
text that uses multiple encodings

* sub-types that get their data from a memory mapped file or a
shared memory area without copying

* sub-types that implement indexing based on glyphs (ie. the human
recognizable notion of a character) rather than code points

* sub-types that implement special extra methods or provide case
insensitive operations

* sub-types that implement special text data forms, such as URLs,
OS paths, UID strings, etc. and custom operations on those

Sub-typing is also encouraged by the documentation:

http://docs.python.org/library/userdict.html#module-UserString

... and after all: one of the main points in making all built-in
types sub-typeable was to actually do it :-)

> If it's a viable use case,
> it should be possible to have some symbol or a few symbols whose
> presence can be tested in the preprocessor by code that needs to
> subclass; we should design the patch with that in mind and Marc-Andre
> could help testing it.
> 
> All this is assuming the speed-up is important enough to bother.  Has
> anyone run a comparison benchmark using the Unladen Swallow benchmarks?
>
>  I trust those much more than micro-benchmarks (including, I assume,
> stringbench.py).  I do expect that reducing the number of allocations
> for short-to-medium-size strings from 2 to 1 would be a significant
> speed-up, but I can't guess how much.

While the Unladen Swallow aims at providing high-level benchmarks,
it's current state doesn't really implement that promise (yet).

If you look at the list of benchmarks they use, most appear to be
dealing with pickling. That doesn't strike me as particularly useful
for testing real life Python usage.

If a high level benchmark is indeed what's wanted, then they should
setup pre-configured Django and Zope instances and run those through
a series of real-life usage scenarios to cover the web application
use space. For scientific use cases, it would be good to have similar
setups using BioPython, NumPy and matplotlib. And so on. Much like
the high level benchmarks you have in the Windows world.

Depending on the use case, the results of benchmarks for this
particular change are difficult to predict or interpret.

Here's a summary message with my reasoning for rejecting the patch:

http://bugs.python.org/issue1943#msg88307

Instead of changing PyUnicodeObject from a PyObject to a PyVarObject,
making sub-typing a lot harder, I'd much rather apply a single change
for 3.1: raising the KEEPALIVE_SIZE_LIMIT to 32 as explained and
motivated here:

http://bugs.python.org/issue1943#msg64215

That's a simple non-disruptive change which makes a lot of sense
due to the advances in CPU designs in the last 9 years. I determined
the original value of 9 using benchmarks and similar statistics in
1999/2000.

It's probably also a good time to remove the warning, now that the
implementation has proven itself for so many years...

/* Limit for the Unicode object free list stay alive optimization.

   The implementation will keep allocated Unicode memory intact for
   all objects on the free list having a size less than this
   limit. This reduces malloc() overhead for small Unicode objects.

   At worst this will result in PyUnicode_MAXFREELIST *
   (sizeof(PyUnicodeObject) + KEEPALIVE_SIZE_LIMIT +
   malloc()-overhead) bytes of unused garbage.

   Setting the limit to 0 effectively turns the feature off.

   Note: This is an experimental feature ! If you get core dumps when
   using Unicode objects, turn this feature off.

*/

#define KEEPALIVE_SIZE_LIMIT       9

History
Date	User	Action	Args
2009-06-03 09:36:21	lemburg	set	recipients: + lemburg, gvanrossum, collinwinter, rhettinger, jafo, jimjjewett, amaury.forgeotdarc, pitrou, vstinner, eric.smith, ajaksu2, orivej, ezio.melotti
2009-06-03 09:36:19	lemburg	link	issue1943 messages
2009-06-03 09:36:10	lemburg	create