Issue 7224: One obvious way to do interning

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/51473

classification

Title:	One obvious way to do interning
Type:	enhancement	Stage:
Components:	Interpreter Core	Versions:	Python 3.2, Python 2.7

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	belopolsky, eric.smith, ncoghlan, pitrou, rhettinger
Priority:	normal	Keywords:	patch

Created on 2009-10-28 01:05 by belopolsky, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
set-add.diff	belopolsky, 2009-10-28 01:05

Messages (11)
msg94595 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2009-10-28 01:05
Attached patch implements python-ideas proposal to return added or existing element from set.add(). See http://mail.python.org/pipermail/python-ideas/2009-October/006491.html . In addition this patch contains a reimplementation of issue1507011 using new behavior of set_add. The two changes are independent an I would submit them separately if I was more optimistic that any would be accepted. :-) This submission is mostly for the benefit of users with applications that create huge number of interned strings. Since the patch saves 8 bytes for each interned string, such application can see appreciable memory savings. I don't have such application and my tests show no difference between patched and stock python. (For a data point, on start up stock python creates about 2,000 interned strings.) My own motivation for writing this patch was the same as for issue1507011: the code that uses a set to store interned strings is more straightforward than code that uses a dict that stores strings twice, both as key and value.
msg94619 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-10-28 09:33
A simple way to try and see a difference would be to import lot of modules. By the way, you shouldn't call it _PySet_Add(), it would cause confusion with the existing PySet_Add(). _PySet_Intern() would be fine.
msg94632 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2009-10-28 16:06
I agree, _PySet_Add name can be improved upon, but I don't want to paint this particular bikeshed until it is clearer what if anything will be done with this idea. If we add PySet_Intern API, then it would be natural to expose it as set.intern rather than changing how set.add works. On the other hand, if set.add grows a return value, then it would make sense to eventually change PySet_Add to conform.
msg94887 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2009-11-04 13:46
If the idea is to create the "one obvious way" for interning, calling the method and corresponding C function "intern" isn't painting a bikeshed, it's meeting the design spec :) Adding a new method also avoids a whole host of issues with the Set ABC.
msg94890 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2009-11-04 14:47
On Wed, Nov 4, 2009 at 9:46 AM, Nick Coghlan <report@bugs.python.org> wrote: > > Nick Coghlan <ncoghlan@gmail.com> added the comment: > > If the idea is to create the "one obvious way" for interning, calling > the method and corresponding C function "intern" isn't painting a > bikeshed, it's meeting the design spec :) > Any bikeshed needs to be painted and _PySet_Add was deliberately just a primer. (Sorry for the cliche abuse.) At the C API level, I was hoping that old PySet_Add semantics could be eventually deprecated and _PySet_Add be promoted to a public method. Keeping two functions that have the same effect but return different values is inelegant especially given that zero return from one is success and from the other is an error, so people who don't pay attention to compiler warnings may get some nasty bugs. Maybe _PySet_Add should be renamed to _PySet_AddAndReturn to emphasize the similarity and difference with PySet_Add. I don't mind calling the method _PySet_Intern (in the original patch I used _PySet_InternString), but that choice will only become obvious if set grows intern(..) method in addition to add(..). I also wonder how widespread the knowledge of the term "intern" is. I would not be surprised if some non-CS users would be more comfortable with AddAndReturn. > Adding a new method also avoids a whole host of issues with the Set ABC. I am neutral on changing add(..) return value vs. adding a new method. Can you elaborate on the "whole host of issues with the Set ABC". Interestingly, current implementation of add(..) does not match PEP 3119 [1] which calls for a boolean return value. [1] http://www.python.org/dev/peps/pep-3119/#sets
msg94940 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-11-05 19:15
The basic problem here is that the "one obvious way" to some people (including me and Martin v. Löwis) is to use a dictionary. Further, there is the problem of conflating types in a user's mind -- right now, dictionaries are all about looking up and returning values, while sets are not. The current notion of sets corresponds neatly with what folks learn in their basic math classes. In contrast, the notion of "canonical representatives of an equivalence class" is a more advanced mathematical concept, one that many people are never taught. This is even more problematic in Python because the operation of finding canonical values is hidden behind well designed __hash__ and __eq__ functions (for example, it is not at all obvious how Decimal(3.0), float(3.0), and int(3.0) all have the same hash value). An additional problem is the modification of set.add() so that it violates Python's norm of having mutating methods return None. We should keep the language consistent in this regard. Also, I should comment on the "appreciable memory savings". This saves only a single pointer (4 bytes on a 32-bit build) out of three per string. But the string object itself takes up memory, so the percent savings per string is smaller than you would might expect (typically less than 15%). Finally, the spirit of the language moratorium is to make it so that other Python implementations don't have to change for a few years. FWIW, I am already going to expand the set/frozenset documention to show some techniques for accessing and using sets. I can add a link to a get_equivalent() recipe that makes it possible to retrieve canonical values from any container including sets. That recipe uses all the containers as-is, no patching or API changes needed.
msg94943 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2009-11-05 19:50
On Thu, Nov 5, 2009 at 2:16 PM, Raymond Hettinger <report@bugs.python.org> wrote: > > Raymond Hettinger <rhettinger@users.sourceforge.net> added the comment: > > The basic problem here is that the "one obvious way" to some people > (including me and Martin v. Löwis) is to use a dictionary. Further, > there is the problem of conflating types in a user's mind -- right now, > dictionaries are all about looking up and returning values, while sets > are not. One feature that is missing from both dict approach and get_equivalent recipe is the ability to do interning of new values with a single lookup. Note that even at the C level one has to first try to retrieve the key from the dictionary and then if that fails, do an insert which repeats the same lookup. Even dict.setdefault which is designed to remove the double lookup while eliminates it when key is present, still does the second lookup when the key is missing.
msg94944 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-11-05 20:23
That is a false optimization. Regular python code is full of look-ups (even set.add has a getattr(s, 'add') lookup just to find the add-method; every call to a built-in typically goes through multiple lookups). Also, the cost of a second lookup is trivial because of processor caching: see Object/dictnotes.txt. Martin has already rejected a similar proposal for similar reasons. Please drop this one. IMO, it is harmful to the set API because it changes set's core concept to include lookup operations instead of just simple set operations. Either use sys.intern() or roll your own using get_equivalent() or a dictionary. A single use-case doesn't warrant building-out the language, providing more ways to do it, or messing the core concept of the set/frozenset datatype.
msg94945 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2009-11-05 20:38
I agree with Raymond here. This use case isn't special enough, or the performance of the current "one way to do it" bad enough to warrant changing set. I recommend closing this issue.
msg94946 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2009-11-05 21:03
On Thu, Nov 5, 2009 at 3:23 PM, Raymond Hettinger <report@bugs.python.org> wrote: .. > Martin has already rejected a similar proposal for similar reasons. > Please drop this one. Sure. In fact I've never proposed to apply this patch. As I said in my original submission, the only purpose of reviving issue1507011 is this way was to give an easy way for users who claimed that in their apps saving 4-8 bytes per interned string would make a difference to test their claim. Just as with issue1507011 three years ago I personally did not see noticeable improvements in my applications from the patch.
msg94947 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-11-05 21:06
Thank you.

History
Date	User	Action	Args
2022-04-11 14:56:54	admin	set	github: 51473
2009-11-05 21:06:35	rhettinger	set	status: open -> closed resolution: rejected messages: + msg94947
2009-11-05 21:03:30	belopolsky	set	messages: + msg94946
2009-11-05 20:38:31	eric.smith	set	nosy: + eric.smith messages: + msg94945
2009-11-05 20:23:34	rhettinger	set	messages: + msg94944
2009-11-05 19:50:52	belopolsky	set	messages: + msg94943
2009-11-05 19:16:01	rhettinger	set	messages: + msg94940
2009-11-04 14:47:03	belopolsky	set	messages: + msg94890
2009-11-04 13:46:31	ncoghlan	set	nosy: + ncoghlan messages: + msg94887
2009-10-28 16:06:28	belopolsky	set	messages: + msg94632
2009-10-28 09:33:01	pitrou	set	nosy: + pitrou messages: + msg94619
2009-10-28 01:20:22	rhettinger	set	assignee: rhettinger nosy: + rhettinger versions: + Python 2.7, Python 3.2
2009-10-28 01:05:27	belopolsky	create