This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Unsupported provider

classification
Title: One obvious way to do interning
Type: enhancement Stage:
Components: Interpreter Core Versions: Python 3.2, Python 2.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: belopolsky, eric.smith, ncoghlan, pitrou, rhettinger
Priority: normal Keywords: patch

Created on 2009-10-28 01:05 by belopolsky, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
set-add.diff belopolsky, 2009-10-28 01:05
Messages (11)
msg94595 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2009-10-28 01:05
Attached patch implements python-ideas proposal to return added or 
existing element from set.add().  See 
http://mail.python.org/pipermail/python-ideas/2009-October/006491.html .

In addition this patch contains a reimplementation of issue1507011 using 
new behavior of set_add.

The two changes are independent an I would submit them separately if I 
was more optimistic that any would be accepted. :-)

This submission is mostly for the benefit of users with applications 
that create huge number of interned strings.  Since the patch saves 8 
bytes for each interned string, such application can see appreciable 
memory savings.

I don't have such application and my tests show no difference between 
patched and stock python.  (For a data point, on start up stock python 
creates about 2,000 interned strings.)

My own motivation for writing this patch was the same as for issue1507011: the code that uses a set to store interned strings is more  
straightforward than code that uses a dict that stores strings twice, 
both as key and value.
msg94619 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-10-28 09:33
A simple way to try and see a difference would be to import lot of modules.

By the way, you shouldn't call it _PySet_Add(), it would cause confusion
with the existing PySet_Add(). _PySet_Intern() would be fine.
msg94632 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2009-10-28 16:06
I agree, _PySet_Add name can be improved upon, but I don't want to paint 
this particular bikeshed until it is clearer what if anything will be done 
with this idea.  If we add PySet_Intern API, then it would be natural to 
expose it as set.intern rather than changing how set.add works.  On the 
other hand, if set.add grows a return value, then it would make sense to 
eventually change PySet_Add to conform.
msg94887 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2009-11-04 13:46
If the idea is to create the "one obvious way" for interning, calling
the method and corresponding C function "intern" isn't painting a
bikeshed, it's meeting the design spec :)

Adding a new method also avoids a whole host of issues with the Set ABC.
msg94890 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2009-11-04 14:47
On Wed, Nov 4, 2009 at 9:46 AM, Nick Coghlan <report@bugs.python.org> wrote:
>
> Nick Coghlan <ncoghlan@gmail.com> added the comment:
>
> If the idea is to create the "one obvious way" for interning, calling
> the method and corresponding C function "intern" isn't painting a
> bikeshed, it's meeting the design spec :)
>

Any bikeshed needs to be painted and _PySet_Add was deliberately just
a primer. (Sorry for the cliche abuse.)  At the C API level, I was
hoping that old PySet_Add semantics could be eventually deprecated and
_PySet_Add be promoted to a public method.  Keeping two functions that
have the same effect but return different values is inelegant
especially given that zero return from one is success and from the
other is an error, so people who don't pay attention to compiler
warnings may get some nasty bugs.  Maybe _PySet_Add should be renamed
to _PySet_AddAndReturn to emphasize the similarity and difference with
PySet_Add.  I don't mind calling the method _PySet_Intern (in the
original patch I used _PySet_InternString), but that choice will only
become obvious if set grows intern(..) method in addition to add(..).
I also wonder how widespread the knowledge of the term "intern" is.  I
would not be surprised if some non-CS users would be more comfortable
with AddAndReturn.

> Adding a new method also avoids a whole host of issues with the Set ABC.

I am neutral on changing add(..) return value vs. adding a new method.
 Can you elaborate on the "whole host of issues with the Set ABC".
Interestingly, current implementation of add(..) does not match PEP
3119 [1] which calls for a boolean return value.

[1] http://www.python.org/dev/peps/pep-3119/#sets
msg94940 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-11-05 19:15
The basic problem here is that the "one obvious way" to some people
(including me and Martin v. Löwis) is to use a dictionary.  Further,
there is the problem of conflating types in a user's mind -- right now,
dictionaries are all about looking up and returning values, while sets
are not.  

The current notion of sets corresponds neatly with what folks learn in
their basic math classes.  In contrast, the notion of "canonical
representatives of an equivalence class" is a more advanced mathematical
concept, one that many people are never taught.  This is even more
problematic in Python because the operation of finding canonical values
is hidden behind well designed __hash__ and __eq__ functions (for
example, it is not at all obvious how Decimal(3.0), float(3.0), and
int(3.0) all have the same hash value).  

An additional problem is the modification of set.add() so that it
violates Python's norm of having mutating methods return None.  We
should keep the language consistent in this regard.

Also, I should comment on the "appreciable memory savings".  This saves
only a single pointer (4 bytes on a 32-bit build) out of three per
string.  But the string object itself takes up memory, so the percent
savings per string is smaller than you would might expect (typically
less than 15%).

Finally, the spirit of the language moratorium is to make it so that
other Python implementations don't have to change for a few years.

FWIW, I am already going to expand the set/frozenset documention to show
some techniques for accessing and using sets.  I can add a link to a
get_equivalent() recipe that makes it possible to retrieve canonical
values from any container including sets.  That recipe uses all the
containers as-is, no patching or API changes needed.
msg94943 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2009-11-05 19:50
On Thu, Nov 5, 2009 at 2:16 PM, Raymond Hettinger
<report@bugs.python.org> wrote:
>
> Raymond Hettinger <rhettinger@users.sourceforge.net> added the comment:
>
> The basic problem here is that the "one obvious way" to some people
> (including me and Martin v. Löwis) is to use a dictionary.  Further,
> there is the problem of conflating types in a user's mind -- right now,
> dictionaries are all about looking up and returning values, while sets
> are not.

One feature that is missing from both dict approach and get_equivalent
recipe is the ability to do interning of new values with a single
lookup.  Note that even at the C level one has to first try to
retrieve the key from the dictionary and then if that fails, do an
insert which repeats the same lookup.  Even dict.setdefault which is
designed to remove the double lookup while eliminates it when key is
present, still does the second lookup when the key is missing.
msg94944 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-11-05 20:23
That is a false optimization.  Regular python code is full of look-ups
(even set.add has a getattr(s, 'add') lookup just to find the
add-method; every call to a built-in typically goes through multiple
lookups).   Also, the cost of a second lookup is trivial because of
processor caching:  see Object/dictnotes.txt.

Martin has already rejected a similar proposal for similar reasons. 
Please drop this one.  IMO, it is harmful to the set API because it
changes set's core concept to include lookup operations instead of just
simple set operations.

Either use sys.intern() or roll your own using get_equivalent() or a
dictionary.  A single use-case doesn't warrant building-out the
language, providing more ways to do it, or messing the core concept of
the set/frozenset datatype.
msg94945 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2009-11-05 20:38
I agree with Raymond here. This use case isn't special enough, or the
performance of the current "one way to do it" bad enough to warrant
changing set. I recommend closing this issue.
msg94946 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2009-11-05 21:03
On Thu, Nov 5, 2009 at 3:23 PM, Raymond Hettinger
<report@bugs.python.org> wrote:
..
> Martin has already rejected a similar proposal for similar reasons.
> Please drop this one.

Sure.  In fact I've never proposed to apply this patch.  As I said in
my original submission, the only purpose of reviving  issue1507011 is
this way was to give an easy way for users who claimed that in their
apps saving 4-8 bytes per interned string would make a difference to
test their claim.

Just as with issue1507011 three years ago I personally did not see
noticeable improvements in my applications from the patch.
msg94947 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-11-05 21:06
Thank you.
History
Date User Action Args
2022-04-11 14:56:54adminsetgithub: 51473
2009-11-05 21:06:35rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg94947
2009-11-05 21:03:30belopolskysetmessages: + msg94946
2009-11-05 20:38:31eric.smithsetnosy: + eric.smith
messages: + msg94945
2009-11-05 20:23:34rhettingersetmessages: + msg94944
2009-11-05 19:50:52belopolskysetmessages: + msg94943
2009-11-05 19:16:01rhettingersetmessages: + msg94940
2009-11-04 14:47:03belopolskysetmessages: + msg94890
2009-11-04 13:46:31ncoghlansetnosy: + ncoghlan
messages: + msg94887
2009-10-28 16:06:28belopolskysetmessages: + msg94632
2009-10-28 09:33:01pitrousetnosy: + pitrou
messages: + msg94619
2009-10-28 01:20:22rhettingersetassignee: rhettinger

nosy: + rhettinger
versions: + Python 2.7, Python 3.2
2009-10-28 01:05:27belopolskycreate