classification
Title: interned strings are stored in a dict, a set would use less memory
Type: enhancement Stage: needs patch
Components: Versions: Python 3.6
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: gregory.p.smith, josh.r, maciej.szulik, nnorwitz, rhettinger, serhiy.storchaka, vstinner
Priority: low Keywords: needs review, patch

Created on 2016-02-08 20:42 by gregory.p.smith, last changed 2016-03-13 15:48 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
interned_set_27.diff gregory.p.smith, 2016-02-08 23:52 review
interned_set_27-gps02.diff gregory.p.smith, 2016-03-11 23:19 review
Messages (12)
msg259885 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2016-02-08 20:42
The implementation of string interning uses a dict [1].  It would consume less memory and be a bit simpler if it used a set.

Identifier strings in a program are interned.  If you have a large program with a lot of code, this makes for a large dictionary.

Experimenting with changing this to use a set on 2.7 found ~22k savings on an interactive interpreter startup.  Measuring it on a huge application showed a few hundred k saved.

[1]: https://hg.python.org/cpython/file/3.5/Objects/unicodeobject.c#l1579
msg259899 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2016-02-08 23:39
Presumably it would involve using private set APIs to make this work, right? Since normally you can't look up the actual value in a set, just check for existence?
msg259900 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2016-02-08 23:52
Here's an example patch against 2.7 by nnorwitz that we're currently testing.
msg260052 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-10 22:38
Why did you remove _Py_ReleaseInternedStrings()? It looks like a big memory leak when Python is embedded.
msg260063 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2016-02-11 00:52
Because it was only called from within an "#ifdef __INSURE__" which we weren't using.  I called it an "example" patch for a reason.  Updating that function to deal with the set instead of dict seems wise.

Ironically... a few days after we did this we may have just found reason to put _Py_ReleaseInternedStrings() back and use it when compiled using clang sanitizers.  (untested)
msg261614 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2016-03-11 23:19
updated patch:

Don't Py_CLEAR the dummy sentinel value in PySet_Fini. The interned set uses it.  Otherwise it causes problems when re-initializing after a Fini as happens in processes the setup and tear down an embedded interpreter multiple times.
msg261634 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-03-12 05:26
Dicts are hard optimized for string keys due to using them for globals and attributes lookup. Sets are optimized too, but rather for accident. We can be sure that dicts will be always optimized, but set optimization can be eliminated for the sake of simplification or if it will be proven that special casing strings has enough large negative effect on non-string items.
msg261674 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016-03-13 06:40
I agree with Serhiy that this should not be done.   IIRC, there were some discussions on python-dev or python-ideas about using sets for interning and the judgment was the dicts are conceptually the right type and that using sets would be hack.  The other thought was that while a set is generally two-thirds the size of dict, the bulk of the space is for the interned strings themselves.
msg261676 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2016-03-13 07:02
The space for the strings is a fixed cost, the structure used to store them for efficient lookup is the only overhead that can be trimmed and is all in one contiguous allocation.

regardless, i agree, this isn't a large savings. priority low, feel free to drop if it you want.
msg261680 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-03-13 07:21
Since interned strings table can only grow and contains exact strings, other data structure may be more appropriate (for example Modules/hashtable.c).

But for now status quo is good to me.
msg261691 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-03-13 10:03
Serhiy Storchaka added the comment:
>
> Since interned strings table can only grow and contains exact strings,
> other data structure may be more appropriate (for example
> Modules/hashtable.c).
>

FYI this module is not well optimized. I took code and then adapted it for
my needs in tracemalloc. If you want to use it outside, you should check
again that parameters like used buckets/total buckets ratio are well chosen.
msg261699 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016-03-13 15:48
GPS: priority low, feel free to drop if it you want.
SS: But for now status quo is good to me.

Marking as closed.  Feel free to open a separate tracker item if you want to pursue the use of Modules/hashtable.c.
History
Date User Action Args
2016-03-13 15:48:39rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg261699
2016-03-13 10:03:21vstinnersetmessages: + msg261691
2016-03-13 07:21:42serhiy.storchakasetmessages: + msg261680
2016-03-13 07:02:52gregory.p.smithsetmessages: + msg261676
2016-03-13 06:40:20rhettingersetassignee: rhettinger
messages: + msg261674
2016-03-12 05:26:28serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg261634
2016-03-11 23:19:19gregory.p.smithsetfiles: + interned_set_27-gps02.diff

messages: + msg261614
2016-02-11 00:52:13gregory.p.smithsetmessages: + msg260063
2016-02-10 22:38:46vstinnersetnosy: + vstinner
messages: + msg260052
2016-02-08 23:52:01gregory.p.smithsetkeywords: + needs review, patch
files: + interned_set_27.diff
messages: + msg259900
2016-02-08 23:39:31josh.rsetnosy: + josh.r
messages: + msg259899
2016-02-08 21:23:04maciej.szuliksetnosy: + maciej.szulik
2016-02-08 20:49:52serhiy.storchakasetnosy: + rhettinger
2016-02-08 20:42:46gregory.p.smithcreate