This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author lemburg
Recipients ajaksu2, amaury.forgeotdarc, collinwinter, eric.smith, ezio.melotti, gvanrossum, jafo, jimjjewett, lemburg, orivej, pitrou, rhettinger
Date 2009-06-04.10:54:39
SpamBayes Score 1.2212453e-15
Marked as misclassified No
Message-id <4A27A7EE.4010309@egenix.com>
In-reply-to <ca471dc20906031431h79b8e6ia9ac8db280d2d079@mail.gmail.com>
Content
Guido van Rossum wrote:
> Guido van Rossum <guido@python.org> added the comment:
> 
> On Wed, Jun 3, 2009 at 1:41 PM, Antoine Pitrou <report@bugs.python.org> wrote:
>> Apart from the example Marc-André just posted (and which is a 0.0.1
>> proof of concept he apparently just wrote), the number of users is,
>> AFAICT, zero.
> 
> IIUC Marc-Andre extracted that from a larger code base (MX) which he
> owns and has been maintaining for a decade or so.

Only part of it.

I wrote the sub-type Unicode Reference sub-type implementation
just a few days ago, in order to demonstrate how easy it is
provided you have a PyObject (rather than a PyVarObject) to
build on.

We should really publicize how easy it is to write such type
extensions. I'm sure that a lot of things which often generated
heated discussions (such as the slicing patches for Unicode)
could easily be solved by just adding a few such sub-types to the
core.

>> Unless there's some closed source extension which happens to extend
>> unicode as a C subtype.
> 
> I believe part of MX is closed source.

True. A large part of the code base is not available to the wider
public.

>> Now, as for easing the subclassing of unicode in C, there are probably
>> several possibilities which range from devising a clever set of macros
>> to abusing the ob_size field for a tagged pointer. People who really
>> care should do a concrete proposal (and I don't know who these people
>> are, apart from Marc-André).
> 
> Not really if the core code uses a macro that depends on the layout of
> the object (i.e. the data immediately following the header, like old
> 8-bit strings), unless you change the core (or the macro) to only use
> this if the type matches exactly, and for subtypes use a more
> expensive API. But that would slow down unnecessarily for subclasses
> written in Python (of which there are plenty).
> 
> But I would like to point out that few people if any have ever
> complained about the contiguous allocation for 8-bit strings in Python
> [0-2].x. And we certainly wouldn't have given in. Now that Unicode is
> no longer some fancy-schmancy advanced concept but the basis for *all*
> Python string processing I think we should apply the same policy.

I've spent enough time with this discussion.

If you think it's better to make sub-typing harder and thereby
closing the door for improvements which could really speed up
e.g. template processing (by not requiring copying the same data
over and over again), go for it.

I still think that it's better to keep things the way they
are and benefit from the fact that PyUnicodeObjects have a
fixed size with the variable part being dealt with separately.

Since pymalloc is being used to manage such objects, there's
a lot of room for improvements, since the allocation scheme
is under out control. E.g. we could have pymalloc allocate
larger pools for PyUnicodeObjects.

Doing the same for variable sized objects is a lot harder,
consumes more memory and likely less efficient.
History
Date User Action Args
2009-06-04 10:54:44lemburgsetrecipients: + lemburg, gvanrossum, collinwinter, rhettinger, jafo, jimjjewett, amaury.forgeotdarc, pitrou, eric.smith, ajaksu2, orivej, ezio.melotti
2009-06-04 10:54:42lemburglinkissue1943 messages
2009-06-04 10:54:39lemburgcreate