This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Use a set to keep interned strings
Type: Stage:
Components: Interpreter Core Versions: Python 2.5
process
Status: closed Resolution: rejected
Dependencies: Superseder: Use a set for interned strings
View: 19187
Assigned To: Nosy List: belopolsky, illume, loewis
Priority: normal Keywords: patch

Created on 2006-06-16 01:15 by belopolsky, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
interned-set.patch belopolsky, 2006-06-16 01:15 Patch to use a set object to store interned strings
interned-set-1.patch belopolsky, 2006-06-16 03:44
Messages (11)
msg50477 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2006-06-16 01:15
This patch is a proof of concept only.  In particular, _PySet_LookString is 
*not* proposed for addition to set API.  Better performance can probably 
be achieved by implementing interning completely inside setobject.c .
msg50478 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2006-06-16 03:44
Logged In: YES 
user_id=835142

Fixed code comments and moved all interning logic to setobject.c
msg50479 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-06-16 05:12
Logged In: YES 
user_id=21627

What is the purpose of posting the patch here, then? The
Python patches tracker should only carry patches that are
meant for inclusion in Python. If inclusion is not planned,
it should be closed (it would still be available, then).
msg50480 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2006-06-16 12:48
Logged In: YES 
user_id=835142

The purporse was to allow others to comment on the work in progress.  Would it 
be more appropriate to post it on python-dev instead?  If I close a patch while it 
is not ready, can I reopen it later when it is complete?


In any case, please consider interned-set-1.patch for inclusion in Python.
msg50481 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-06-16 21:55
Logged In: YES 
user_id=21627

I did not read the python-dev discussion before writing my
first comment, so it is fine that you posted the patch here.

Still, I think some rationale should be provided for doing
this: what is the advantage?
msg50482 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2006-06-17 00:52
Logged In: YES 
user_id=835142

Rationale: The container of interned strings logically a set.  Current 
implementation that uses a dictionary with the same keys and values is 
clearly an artificial implementation of a set necessary in earlier versions of 
Python.  With a proper C implementation of sets available, it is natural to use 
it to store interned strings.

Since set and dict objects use the same lookup algorithm, this patch is not 
expected to affect performance and pybench does not show any difference. 
Since a large set is using half of the memory of the equivalent dict, this patch 
is expected to reduce interpretor memory usage.
msg50483 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-06-17 06:44
Logged In: YES 
user_id=21627

I don't think "it's natural to do it" is a convincing object
reason to change a running system. A change should address
correctness, efficiency, maintainability, functionality
(features), legal issues, etc. I can't see any of this in
this patch, so I'm -1.
msg50484 - (view) Author: Rene Dudfield (illume) Date: 2006-06-30 06:27
Logged In: YES 
user_id=2042

The author said that the patch is expected to reduce memory.
 If this is true, then the patch has a good reason to be
applied.  

Another reason is maintainability.  By removing the set like
functionality which is already implemented in the setobject.
msg50485 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-06-30 16:01
Logged In: YES 
user_id=21627

Ok, memory reduction would be a valid reason. How much
memory is being preserved (say, for an empty program, or for
an IDLE run)?

I don't see the improved maintainability. Interning is
inherently *not* a set operation, it's a dictionary
operation: You are given a string, and need to return its
interned version. That's a table lookup. In fact, the set
API had to be changed (to add _PySet_InternString) to add a
lookup operation.
msg50486 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2006-06-30 20:33
Logged In: YES 
user_id=835142

I did not measure the memory usage and I don't think such
measurement will yield a compelling reason for the change. 
I suggested that a set is half of a dict, but that was wrong
because dictentry is only 1.5 times larger than setentry
because of the cached hash and depending on the size of the
internet strings, large portion of the memory will be used
by the string object themselves and that will not change
with the patch.

My primary reason is the simplicity (you can call it
maintainability).  I find the off by two reference count in
the current dict based implementation rather unnatural and
I think the dict that maps keys to themselves is a set.

I disagree with Martin. I think interning is a set
operation and it is unfortunate that set API does not
support it directly.

I guess the current view is that set is not a container.
Although you can put things in a set and check if they
are there, there is no API to get the things back.

I would really like to see set C API to provide all
the functionality of a dict with value=key, but it
is also possible that I complitely misunderstood the
need of the set object in the first place.
 
msg50487 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-03-05 12:40
As nobody else is speaking in favour of this patch, I'm rejecting it.

I disagree with Alexander's last remark in several respects: set is indeed a container, and there is a way to get the elements back (namely, by enumerating them, or picking an arbitrary element for removal). There is no operation to check, on insertion of E, what the the prior element equal to E was (if any); there is only a check to determine whether E is in the set. The operation "give me the member equal but not identical to E" is conceptually a lookup operation; the mathematical set construct has no such operation, and the Python set models it closely. IOW, set is *not* a dict with key==value.

Without reviewing the patch again, I also doubt it is capable of getting rid of the reference count cheating: essentially, this cheating enables the interning dictionary to have weak references to strings, this is important to allow automatic collection of certain interned strings. This feature needs to be preserved, so the cheating in the reference count must continue.
History
Date User Action Args
2022-04-11 14:56:18adminsetgithub: 43512
2013-10-09 18:06:20belopolskysetsuperseder: Use a set for interned strings
2006-06-16 01:15:07belopolskycreate