This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Use a set for interned strings
Type: resource usage Stage: patch review
Components: Interpreter Core Versions: Python 3.4
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: belopolsky, loewis, pitrou, r.david.murray, rhettinger, serhiy.storchaka, vstinner
Priority: low Keywords: patch

Created on 2013-10-07 19:18 by pitrou, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
setintern.patch pitrou, 2013-10-07 19:18 review
set_intern2.diff rhettinger, 2013-10-08 08:19 Untested review
Messages (12)
msg199154 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-07 19:18
This is a simple enhancement that may reduce memory consumption by a bit (unfortunately, this is difficult to measure using standard tools).
msg199155 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-07 19:31
Dict has a special case for string keys. Is set have such optimization?
msg199156 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-07 19:32
> Dict has a special case for string keys. Is set have such optimization?

Yes (see set_lookkey_unicode).
msg199160 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-07 20:23
Note that the resizing heuristic is slightly different for sets and dicts, so a small-to-middle-size dict can sometimesbe smaller than a set of the same len().

However, for large dicts (>= 50000) it seems the corresponding set is most always 66% smaller.
msg199169 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-10-08 03:11
If sets need to be hacked-up to accommodate this, I would like to draft the patch for it that I think will be cleaner.

FWIW, I'm dubious that there will be any benefit from this at all.  The savings of one-pointer is the dictionary is likely to be insignificant compared to the size of the string object themselves.

Also, I think sets conceptually are not the right choice of data structure because they are primarily about membership testing and not about looking up values.   Incorporating this kind of hack will make my other set optimization efforts more difficult.
msg199178 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-08 08:31
I agree the benefit is likely to be small and very minor. Victor has experience measuring memory consumption of various programs, I would like to know about his measurements.
msg199318 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-10-09 18:12
This exact issue was discussed in #19187 and #7224 many years ago.
msg199319 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-10-09 18:14
The first reference above should have been to #1507011.
msg199332 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-10-09 19:17
Raymond> FWIW, I'm dubious that there will be any benefit from this at all.  The savings of one-pointer is the dictionary is likely to be insignificant compared to the size of the string object themselves.

As I wrote in python-dev, the dictionary is usually the largest memory block, at least at Python startup. The dictionary (without counting the string, just the dict) is between 192 KB and 1.5 MB on x86_64.

In the implementation of the PEP 454, issue #18874, I added a function to get the length and size of the dictionary of Unicode interned strings.

Objects/unicodeobject.c:

PyObject*
_PyUnicode_GetInterned(void)
{
    return interned;
}

tracemalloc.get_unicode_interned():

http://hg.python.org/features/tracemalloc/file/b797779940a5/Modules/_tracemalloc.c#l4606

You can use this function to see how many KB are saved. In embedded systems, every byte of memory counts :-)
msg199337 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-10-09 19:35
> In embedded systems, every byte of memory counts 

It is not just embedded systems.  The range 192 KB to 1.5 MB is where typical L2 cache sizes are these days.  I would expect that the intern dictionary is accessed very often and much more often than the actual strings.  In theory, if it fits in L2 cache, the performance will be much higher than if it is even slightly over.

This said, when I was looking at this 3-4 years ago, I did not see measurable improvements in my programs.
msg199338 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-09 19:49
I afraid that adding new parameter to the set_contains_key function will slow down other set operations. It will be better return a key as the result of the function (NULL if not found).
msg199358 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-10-10 04:29
Based on the discussion so far, I'm going to close this one.  I don't think it is an appropriate use of sets (they are about membership testing, not about looking up values). 

The upside is minimal.  The downside is hacking up the set implementation and making it more difficult to implement some of the other alterations I'm planning for sets -- I want to tune the performance to optimize set operations without worrying about slowing down all the rest of Python because of interned strings.  Martin's reason's for rejecting this previously still hold true.
History
Date User Action Args
2022-04-11 14:57:51adminsetgithub: 63386
2013-10-10 04:29:05rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg199358
2013-10-09 19:49:32serhiy.storchakasetmessages: + msg199338
2013-10-09 19:35:37belopolskysetmessages: + msg199337
2013-10-09 19:17:40vstinnersetmessages: + msg199332
2013-10-09 19:05:11serhiy.storchakasetnosy: + loewis
2013-10-09 18:14:16belopolskysetmessages: + msg199319
2013-10-09 18:12:15belopolskysetnosy: + belopolsky
messages: + msg199318
2013-10-09 18:06:20belopolskylinkissue1507011 superseder
2013-10-08 15:59:27pitrousetnosy: + r.david.murray
2013-10-08 08:31:34pitrousetmessages: + msg199178
2013-10-08 08:19:42rhettingersetfiles: + set_intern2.diff
2013-10-08 08:02:01rhettingersetfiles: - set_intern1.diff
2013-10-08 07:33:40rhettingersetfiles: + set_intern1.diff
2013-10-08 03:11:39rhettingersetmessages: + msg199169
2013-10-07 22:47:57rhettingersetassignee: rhettinger

nosy: + rhettinger
2013-10-07 20:23:32pitrousetmessages: + msg199160
2013-10-07 19:32:44pitrousetmessages: + msg199156
2013-10-07 19:31:17serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg199155
2013-10-07 19:18:52pitroucreate