Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative implementation of interning #36836

Closed
orenti mannequin opened this issue Jul 1, 2002 · 14 comments
Closed

Alternative implementation of interning #36836

orenti mannequin opened this issue Jul 1, 2002 · 14 comments
Assignees

Comments

@orenti
Copy link
Mannequin

orenti mannequin commented Jul 1, 2002

BPO 576101
Nosy @gvanrossum, @loewis, @rhettinger
Files
  • altintern.patch
  • altintern2.patch
  • altintern4.patch
  • altintern5.patch: updated to CVS as of 8/14, plus API_VERSION
  • altintern6.patch: New version, mostly mortal
  • altintern7.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gvanrossum'
    closed_at = <Date 2002-08-19.21:44:33.000>
    created_at = <Date 2002-07-01.19:23:00.000>
    labels = []
    title = 'Alternative implementation of interning'
    updated_at = <Date 2002-08-19.21:44:33.000>
    user = 'https://bugs.python.org/orenti'

    bugs.python.org fields:

    activity = <Date 2002-08-19.21:44:33.000>
    actor = 'gvanrossum'
    assignee = 'gvanrossum'
    closed = True
    closed_date = None
    closer = None
    components = ['None']
    creation = <Date 2002-07-01.19:23:00.000>
    creator = 'orenti'
    dependencies = []
    files = ['4392', '4393', '4394', '4395', '4396', '4397']
    hgrepos = []
    issue_num = 576101
    keywords = ['patch']
    message_count = 14.0
    messages = ['40458', '40459', '40460', '40461', '40462', '40463', '40464', '40465', '40466', '40467', '40468', '40469', '40470', '40471']
    nosy_count = 4.0
    nosy_names = ['gvanrossum', 'loewis', 'rhettinger', 'orenti']
    pr_nums = []
    priority = 'normal'
    resolution = 'accepted'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue576101'
    versions = []

    @orenti
    Copy link
    Mannequin Author

    orenti mannequin commented Jul 1, 2002

    An interned string has a flag set indicating that it is
    interned instead of a pointer to the interned string. This
    pointer was almost always either NULL or pointing to the
    same object. The other cases were rare and ineffective
    as an optimization. This saves an average of 3 bytes
    per string.

    Interned strings are no longer immortal. They are
    automatically destroyed when there are no more
    references to them except the global dictionary of
    interned strings.

    New function (actually a macro) PyString_CheckInterned
    to check whether a string is interned. There are no
    more references to ob_sinterned anywhere outside
    stringobject.c.

    @orenti orenti mannequin closed this as completed Jul 1, 2002
    @orenti orenti mannequin assigned gvanrossum Jul 1, 2002
    @orenti orenti mannequin closed this as completed Jul 1, 2002
    @orenti orenti mannequin assigned gvanrossum Jul 1, 2002
    @rhettinger
    Copy link
    Contributor

    Logged In: YES
    user_id=80475

    I like the way you consolidated all of the knowledge about
    interning into one place.

    Consider adding an example to the docs of an effective use
    of interning for optimization.

    @orenti
    Copy link
    Mannequin Author

    orenti mannequin commented Jul 6, 2002

    Logged In: YES
    user_id=562624

    This implementation supports both mortal and immortal interned
    strings.

    PyString_InternInPlace creates an immortal interned string for
    backward compatibility with code that relies on this behavior.

    PyString_Intern creates a mortal interned string that is
    deallocated when its refcnt reaches 0. Note that if the string
    value has been previously interned as immortal this will not
    make it mortal.

    Most places in the interpreter were changed to PyString_Intern
    except those that may be required for compatibility.

    This version of the patch, like the previous one, disables
    indirect interning. Is there any evidence that it is still an
    important optimization for some packages?

    Make sure you rebuild everything after applying this patch
    because it modifies the size of string object headers.

    @orenti
    Copy link
    Mannequin Author

    orenti mannequin commented Jul 6, 2002

    Logged In: YES
    user_id=562624

    Oops, forgot to actually attach the patch. Here it is.

    @orenti
    Copy link
    Mannequin Author

    orenti mannequin commented Aug 10, 2002

    Logged In: YES
    user_id=562624

    General cleanup. Better handling of immortal interned
    strings for backward compatibility.

    It passes regrtest but causes test_gc to leak 20 objects. 13
    from test_finalizer_newclass and 7 from test_del_newclass,
    but only if test_saveall is used. I've tried earlier
    versions of this patch (which were ok at the time) and they
    now create this leak too.

    @gvanrossum
    Copy link
    Member

    Logged In: YES
    user_id=6380

    Here's an update of the patch for current CVS
    (stringobject.h failed due to changes for
    PyAPI_DATA/PyAPI_FUNC).

    Could you add documentation to Doc/api/concrete.tex for
    PyString_Intern() and explains how PyString_InternInPlace()
    differs? (AFAICT it makes the interned string immortal -- I
    suppose this is a B/W compat feature?)

    The variables PYTHON_API_VERSION and PYTHON_API_STRING in
    modsupport.h need to be updated -- many extensions use the
    PyString_AS_STRING() macro which relies on the string object
    format. If an extension compiled with the old code is linked
    with the new interpreter, it will miss the first three bytes
    of string objects -- or even store into memory it doesn't
    own! (I've already added this to the patch I am uploading.)

    (The test_gc failures were unrelated; Tim has fixed this
    already in CVS.)

    I'm tempted to say that except for the API doc issue this is
    complete. Thanks!

    @gvanrossum
    Copy link
    Member

    Logged In: YES
    user_id=6380

    Question for all other reviewers. Why not replace all calls
    to PyString_InternInplace() [which creates immortal strings]
    with PyString_Intern(), making all (core) uses of interning
    yield mortal strings?

    E.g. the call in PyObject_SetAttr() will immortalize all
    strings that are ever used as a key on a setattr operation;
    in a long-lived server like Zope this is a concern, since
    setattr keys are often user-provided data: an endless stream
    of user-provided data will grow the interned dict indefinitely.

    And having the builtin intern() always return an immortal
    string also limits the usability of intern().

    Most of the uses I could find of PyString_InternFromString()
    hold on to a global ref to the object, making it immortal
    anyway; but why should that function itself force the string
    to be immortal? (Especially since the exceptions are things
    like PyObject_GetAttrString() and PyObject_SetItemString(),
    which have the same concerns as PyObject_SetItem().

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 14, 2002

    Logged In: YES
    user_id=21627

    Some mutually unrelated comments:

    • the GC_UnTrack call for interned is not need: GC won't be
      able to explain the reference that stringobject.c holds.

    • why does this try to "fix" the problem of dangling
      interned strings? AFAICT: if there is a reference to an
      interned string at the time _Py_ReleaseInternedStrings is
      called, that reference is silently dropped, and a later
      DECREF will result in memory corruption. IOW: it should
      merely set the state of all strings to normal, and clear the
      dict.

    • Replacing PyString_InternInPlace with PyString_Intern
      seems dangerous. AFAICT, the fragment

    	PyString_InternInPlace(&name);
    	Py_DECREF(name);
    	return PyString_AS_STRING(name);

    from getclassname would break: Intern() would return the
    only reference to the interned string (assuming this is the
    first usage), and getclassname drops this reference,
    returning a pointer to deallocated memory. I'm not sure
    though why getclassname interns the result in the first place.

    Selectively replacing them might be a good idea, though. For
    intern(), I think an optional argument strongref needs to be
    provided (the interned dict essentially weak-references the
    strings). Perhaps the default even needs to be weakref.

    @gvanrossum
    Copy link
    Member

    Logged In: YES
    user_id=6380

    • why does this try to "fix" the problem of
      dangling interned strings? AFAICT: if there is a
      reference to an interned string at the time
      _Py_ReleaseInternedStrings is called, that
      reference is silently dropped, and a later
      DECREF will result in memory corruption. IOW: it
      should merely set the state of all strings to
      normal, and clear the dict.

    Note that the *only* time when
    _Py_ReleaseInternedStrings() can ever be called is
    at program exit, just before you run a memory leak
    detector. There's no way Python can be
    resurrected after _Py_ReleaseInternedStrings() has
    run.

    • Replacing PyString_InternInPlace with
      PyString_Intern seems dangerous. AFAICT, the
      fragment

      PyString_InternInPlace(&name);
      Py_DECREF(name);
      return PyString_AS_STRING(name);

    from getclassname would break: Intern() would
    return the only reference to the interned string
    (assuming this is the first usage), and
    getclassname drops this reference, returning a
    pointer to deallocated memory. I'm not sure
    though why getclassname interns the result in
    the first place.

    getclassname() is doing something very unsavory
    here! I expect that its API will have to be
    changed to copy the name into a buffer provided by
    the caller.

    We'll have to scrutinize all calls for tricks like
    this.

    Selectively replacing them might be a good idea,
    though. For intern(), I think an optional
    argument strongref needs to be provided (the
    interned dict essentially weak-references the
    strings). Perhaps the default even needs to be
    weakref.

    So do you think there's a need for immortal
    strings? What is that need?

    @gvanrossum
    Copy link
    Member

    Logged In: YES
    user_id=6380

    string_dealloc() is a bit optimistic in that it
    doesn't check the DelItem for an error; but I
    don't know what it should do when it gets an error
    at that point. Probably call Py_FatalError(); if
    it wanted to recover, it would have to call
    PyErr_Fetch() / PyErr_Restore() around the
    DelItem() call, because we're in a dealloc handler
    here and that shouldn't change the exception
    state.

    _Py_ReleaseInternedStrings() should use PyDict_ methods, not
    PyMapping_ methods. And it should do more careful error
    checking. But maybe it's best to delete this function --
    it's not needed except when you want to run Insure++, and
    we're not using that any more.

    I note that the whole patch needs to be scrutinized
    carefully looking for missing error checking and things like
    that.

    @orenti
    Copy link
    Mannequin Author

    orenti mannequin commented Aug 15, 2002

    Logged In: YES
    user_id=562624

    Yes, PyString_InternInPlace is for backward compatibility.
    How conservative do we need to be about compatibility?

    My work copy has an option for making strings binary
    compatible. Which is more important: binary compatibility or
    saving 3 bytes?

    A related patch (static names) provides a possible
    alternative to most PyString_InternFromString calls.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 15, 2002

    Logged In: YES
    user_id=21627

    _Py_ReleaseInternedStrings: it might be that embedded
    applications use it. It would not be fair to cause heap
    corruption for them - it would be better to break them at
    link time, by removing the function entirely. I see no need
    to do either - it should just release immortal strings, as
    it always did, if there are any left.

    intern creates immortal strings: It might be that an
    application saves the id() of an interned string and
    releases the interned strings; then expects to get the same
    id back later. If you ask people whether they do that they
    won't tell, because they don't know that they do that. You
    could explicitly decide to break such applications (which
    would be reasonable), but then this needs to be documented.

    binary compatibility: I'm neutral here. If the API is
    bumped, people get sufficient warning.

    PyString_InternInPlace: I think it needs to be preserved,
    since applications may not hold explicit references
    (trusting that the interned dictionary will hold the
    reference). Of course, the InPlace name signals that there
    is no return value, so it is better than _Intern for new users.

    @gvanrossum
    Copy link
    Member

    Logged In: YES
    user_id=6380

    Here's a new version (#6) that makes all interned strings
    mortal unless explicitly requested with
    PyString_InternImmortal(). There are no calls to that
    function in the core.

    I'm very tempted to check this in and see how it goes.

    • Leave all the calls to PyString_InternInPlace(), since
      that is still the recommended API.

    • Got rid of the macro PyString_INTERN(), it was unused.

    • Fixed the issue with getclassname() through an API change
      (it's static so doesn't matter).

    • Rewrote _Py_ReleaseInternedStrings(); it now simply clears
      the immortality status, restores the stolen refcounts, and
      then clears and decrefs the interned dict.

    @gvanrossum
    Copy link
    Member

    Logged In: YES
    user_id=6380

    All checked in. I'm uploading the final version of the
    patch, since #6 contained some garbage (Emacs does something
    bizarre when you edit a patch file).

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants