classification
Title: Alternative implementation of interning
Type: Stage:
Components: None Versions:
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: gvanrossum Nosy List: gvanrossum, loewis, orenti, rhettinger
Priority: normal Keywords: patch

Created on 2002-07-01 19:23 by orenti, last changed 2002-08-19 21:44 by gvanrossum. This issue is now closed.

Files
File name Uploaded Description Edit
altintern.patch orenti, 2002-07-01 19:23
altintern2.patch orenti, 2002-07-06 16:08
altintern4.patch orenti, 2002-08-10 10:01
altintern5.patch gvanrossum, 2002-08-14 18:43 updated to CVS as of 8/14, plus API_VERSION
altintern6.patch gvanrossum, 2002-08-16 18:57 New version, mostly mortal
altintern7.patch gvanrossum, 2002-08-19 21:44
Messages (14)
msg40458 - (view) Author: Oren Tirosh (orenti) Date: 2002-07-01 19:23
An interned string has a flag set indicating that it is 
interned instead of a pointer to the interned string. This 
pointer was almost always either NULL or pointing to the 
same object. The other cases were rare and ineffective 
as an optimization.  This saves an average of 3 bytes 
per string.

Interned strings are no longer immortal.  They are 
automatically destroyed when there are no more 
references to them except the global dictionary of 
interned strings.

New function (actually a macro) PyString_CheckInterned 
to check whether a string is interned.  There are no 
more references to ob_sinterned anywhere outside 
stringobject.c.
msg40459 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2002-07-02 04:21
Logged In: YES 
user_id=80475

I like the way you consolidated all of the knowledge about 
interning into one place.

Consider adding an example to the docs of an effective use 
of interning for optimization.
msg40460 - (view) Author: Oren Tirosh (orenti) Date: 2002-07-06 14:35
Logged In: YES 
user_id=562624

This implementation supports both mortal and immortal interned 
strings.

PyString_InternInPlace creates an immortal interned string for 
backward compatibility with code that relies on this behavior.

PyString_Intern creates a mortal interned string that is 
deallocated when its refcnt reaches 0.  Note that if the string 
value has been previously interned as immortal this will not 
make it mortal.

Most places in the interpreter were changed to PyString_Intern 
except those that may be required for compatibility.

This version of the patch, like the previous one, disables 
indirect interning. Is there any evidence that it is still an 
important optimization for some packages?

Make sure you rebuild everything after applying this patch 
because it modifies the size of string object headers.

msg40461 - (view) Author: Oren Tirosh (orenti) Date: 2002-07-06 16:08
Logged In: YES 
user_id=562624

Oops, forgot to actually attach the patch. Here it is.
msg40462 - (view) Author: Oren Tirosh (orenti) Date: 2002-08-10 10:01
Logged In: YES 
user_id=562624

General cleanup. Better handling of immortal interned
strings for backward compatibility.

It passes regrtest but causes test_gc to leak 20 objects. 13
from test_finalizer_newclass and 7 from test_del_newclass,
but only if test_saveall is used. I've tried earlier
versions of this patch (which were ok at the time) and they
now create this leak too.
msg40463 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2002-08-14 18:43
Logged In: YES 
user_id=6380

Here's an update of the patch for current CVS
(stringobject.h failed due to changes for
PyAPI_DATA/PyAPI_FUNC).

Could you add documentation to Doc/api/concrete.tex for
PyString_Intern() and explains how PyString_InternInPlace()
differs? (AFAICT it makes the interned string immortal -- I
suppose this is a B/W compat feature?)

The variables PYTHON_API_VERSION and PYTHON_API_STRING in
modsupport.h need to be updated -- many extensions use the
PyString_AS_STRING() macro which relies on the string object
format. If an extension compiled with the old code is linked
with the new interpreter, it will miss the first three bytes
of string objects -- or even store into memory it doesn't
own! (I've already added this to the patch I am uploading.)

(The test_gc failures were unrelated; Tim has fixed this
already in CVS.)

I'm tempted to say that except for the API doc issue this is
complete. Thanks!
msg40464 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2002-08-14 19:32
Logged In: YES 
user_id=6380

Question for all other reviewers. Why not replace all calls
to PyString_InternInplace() [which creates immortal strings]
with PyString_Intern(), making all (core) uses of interning
yield mortal strings?

E.g. the call in PyObject_SetAttr() will immortalize all
strings that are ever used as a key on a setattr operation;
in a long-lived server like Zope this is a concern, since
setattr keys are often user-provided data: an endless stream
of user-provided data will grow the interned dict indefinitely.

And having the builtin intern() always return an immortal
string also limits the usability of intern().

Most of the uses I could find of PyString_InternFromString()
hold on to a global ref to the object, making it immortal
anyway; but why should that function itself force the string
to be immortal?  (Especially since the exceptions are things
like PyObject_GetAttrString() and PyObject_SetItemString(),
which have the same concerns as PyObject_SetItem().
msg40465 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-08-14 22:24
Logged In: YES 
user_id=21627

Some mutually unrelated comments:

- the GC_UnTrack call for interned is not need: GC won't be
able to explain the reference that stringobject.c holds.

- why does this try to "fix" the problem of dangling
interned strings? AFAICT: if there is a reference to an
interned string at the time _Py_ReleaseInternedStrings is
called, that reference is silently dropped, and a later
DECREF will result in memory corruption. IOW: it should
merely set the state of all strings to normal, and clear the
dict.

- Replacing PyString_InternInPlace with PyString_Intern
seems dangerous. AFAICT, the fragment

	PyString_InternInPlace(&name);
	Py_DECREF(name);
	return PyString_AS_STRING(name);

from getclassname would break: Intern() would return the
only reference to the interned string (assuming this is the
first usage), and getclassname drops this reference,
returning a pointer to deallocated memory. I'm not sure
though why getclassname interns the result in the first place.

Selectively replacing them might be a good idea, though. For
intern(), I think an optional argument strongref needs to be
provided (the interned dict essentially weak-references the
strings). Perhaps the default even needs to be weakref.
msg40466 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2002-08-15 01:26
Logged In: YES 
user_id=6380

> - why does this try to "fix" the problem of
> dangling interned strings? AFAICT: if there is a
> reference to an interned string at the time
> _Py_ReleaseInternedStrings is called, that
> reference is silently dropped, and a later
> DECREF will result in memory corruption. IOW: it
> should merely set the state of all strings to
> normal, and clear the dict.

Note that the *only* time when
_Py_ReleaseInternedStrings() can ever be called is
at program exit, just before you run a memory leak
detector.  There's no way Python can be
resurrected after _Py_ReleaseInternedStrings() has
run.

> - Replacing PyString_InternInPlace with
> PyString_Intern seems dangerous. AFAICT, the
> fragment
> 
> 	PyString_InternInPlace(&name);
> 	Py_DECREF(name);
> 	return PyString_AS_STRING(name);
> 
> from getclassname would break: Intern() would
> return the only reference to the interned string
> (assuming this is the first usage), and
> getclassname drops this reference, returning a
> pointer to deallocated memory. I'm not sure
> though why getclassname interns the result in
> the first place.

getclassname() is doing something very unsavory
here!  I expect that its API will have to be
changed to copy the name into a buffer provided by
the caller.

We'll have to scrutinize all calls for tricks like
this.

> Selectively replacing them might be a good idea,
> though. For intern(), I think an optional
> argument strongref needs to be provided (the
> interned dict essentially weak-references the
> strings). Perhaps the default even needs to be
> weakref.

So do you think there's a need for immortal
strings?  What is that need?
msg40467 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2002-08-15 03:03
Logged In: YES 
user_id=6380

string_dealloc() is a bit optimistic in that it
doesn't check the DelItem for an error; but I
don't know what it should do when it gets an error
at that point.  Probably call Py_FatalError(); if
it wanted to recover, it would have to call
PyErr_Fetch() / PyErr_Restore() around the
DelItem() call, because we're in a dealloc handler
here and that shouldn't change the exception
state.

_Py_ReleaseInternedStrings() should use PyDict_ methods, not
PyMapping_ methods. And it should do more careful error
checking. But maybe it's best to delete this function --
it's not needed except when you want to run Insure++, and
we're not using that any more.

I note that the whole patch needs to be scrutinized
carefully looking for missing error checking and things like
that.
msg40468 - (view) Author: Oren Tirosh (orenti) Date: 2002-08-15 03:44
Logged In: YES 
user_id=562624

Yes, PyString_InternInPlace is for backward compatibility.
How conservative do we need to be about compatibility?

My work copy has an option for making strings binary
compatible. Which is more important: binary compatibility or
saving 3 bytes?

A related patch (static names) provides a possible
alternative to most PyString_InternFromString calls.

msg40469 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-08-15 08:45
Logged In: YES 
user_id=21627

_Py_ReleaseInternedStrings: it might be that embedded
applications use it. It would not be fair to cause heap
corruption for them - it would be better to break them at
link time, by removing the function entirely. I see no need
to do either - it should just release immortal strings, as
it always did, if there are any left.

intern creates immortal strings: It might be that an
application saves the id() of an interned string and
releases the interned strings; then expects to get the same
id back later. If you ask people whether they do that they
won't tell, because they don't know that they do that. You
could explicitly decide to break such applications (which
would be reasonable), but then this needs to be documented.

binary compatibility: I'm neutral here. If the API is
bumped, people get sufficient warning.

PyString_InternInPlace: I think it needs to be preserved,
since applications may not hold explicit references
(trusting that the interned dictionary will hold the
reference). Of course, the InPlace name signals that there 
is no return value, so it is better than _Intern for new users.
msg40470 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2002-08-16 18:57
Logged In: YES 
user_id=6380

Here's a new version (#6) that makes all interned strings
mortal unless explicitly requested with
PyString_InternImmortal(). There are no calls to that
function in the core.

I'm very tempted to check this in and see how it goes.

- Leave all the calls to PyString_InternInPlace(), since
that is still the recommended API.

- Got rid of the macro PyString_INTERN(), it was unused.

- Fixed the issue with getclassname() through an API change
(it's static so doesn't matter).

- Rewrote _Py_ReleaseInternedStrings(); it now simply clears
the immortality status, restores the stolen refcounts, and
then clears and decrefs the interned dict.
msg40471 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2002-08-19 21:44
Logged In: YES 
user_id=6380

All checked in. I'm uploading the final version of the
patch, since #6 contained some garbage (Emacs does something
bizarre when you edit a patch file).
History
Date User Action Args
2002-07-01 19:23:00orenticreate