This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: memory used by frozenset created from set differs from that of frozenset created from other iterable
Type: resource usage Stage:
Components: Versions: Python 3.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: josh.r, lebedov, pitrou, rhettinger, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2014-05-14 12:32 by lebedov, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
set_length_hint.patch vstinner, 2014-05-14 22:18 review
Messages (8)
msg218524 - (view) Author: Lev Givon (lebedov) Date: 2014-05-14 12:32
Not sure if this is indicative of a bug, but I noticed that a frozenset created from a set seems to occupy a different amount of memory than a frozenset created from some other iterable. I observed this behavior with Python 2.7.5 and with Python 3.4.0 on Ubuntu 14.04 x86_64:

>>> from sys import getsizeof
>>> x = range(100)
>>> s = set(x)
>>> f0 = frozenset(x)
>>> f1 = frozenset(s)
>>> getsizeof(s)
8424
>>> getsizeof(f0)
8424
>>> getsizeof(f1)
4328
>>> f0==f1
True

Original question on StackOverflow available at https://stackoverflow.com/questions/23618259/memory-occupied-by-set-vs-frozenset-in-python-2-7
msg218582 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-05-14 22:18
frozenset constructor has different implementations depending on the input type: set (or frozenset), dict or iterator. The constructor preallocates the frozenset for set and dict, but not for generic iterator and so the set may have a suboptimal size.

Attached patch set_length_hint.patch optimizes also the 3rd case using operator.length_hint (PyObject_LengthHint in C).

Since it is an optimization, I prefer to only apply it to Python 3.5 to limit the risk of regression.
msg218589 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2014-05-15 02:08
I think the argument against using PyObject_LengthHint for the general iterable case is that for inputs other than sets or dicts, the assumption is that significant deduplication will occur. With your patch, if I run:

myset = frozenset([0] * 1000000)

it will allocate space for, if I'm reading it correctly, two million entries, and use exactly one of them. Obviously not the common case, but preallocating when you have no idea how much duplication will occur seems like a bad idea.

With a set or dict, you know it's already deduplicated, so preallocation is always a good thing, but for the general case, you'll be using more memory than necessary much of the time.
msg218593 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-05-15 07:03
I agree with Josh's arguments. Similar idea was already proposed and rejected (issue17338).
msg218595 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-05-15 07:12
"I think the argument against using PyObject_LengthHint for the general iterable case is that for inputs other than sets or dicts, the assumption is that significant deduplication will occur."

Oh... I'm dumb :) Sorry.

Another option for frozenset only: we may adjust the internal size when the frozenset is created from an iterator.
msg218638 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2014-05-16 01:01
Not sure how much that really helps. If I understand you correctly, it would be a memory optimization that would require a round of rehashing to use?

If you wanted to make a change that got "guaranteed" better performance, you might add support for dict's dict_keys and dict_items objects (since those have the same uniqueness guarantees as a set or dict, and I could easily see someone creating sets from them to remove the link to the original dict).

If you wanted to get crazy, you might add preallocation support for collections.abc abstract classes equivalent to the built-ins, specifically, Set, Mapping, KeysView and ItemsView. That's probably not a good idea though, since it ends up calling out to Python code more (slowing you down for the general case) and it's probably not as safe, since a user-defined implementation of those abstract classes might not strictly adhere to the same uniqueness or hashability contracts the built-ins provide.
msg218654 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-05-16 11:25
AFAIU, Lev merely posted this because he feared it might be indicative of a bug. Since it isn't a bug but the by-product of a feature, I propose to close this issue as "not a bug".

Regardless, thank you for posting this report! We appreciate the concern.
msg218754 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-05-18 19:47
I restore the original title, my title was a mistake.
History
Date User Action Args
2022-04-11 14:58:03adminsetgithub: 65706
2014-05-19 19:52:06rhettingersetstatus: open -> closed
2014-05-19 19:48:52rhettingersetassignee: rhettinger
2014-05-18 19:47:55vstinnersetmessages: + msg218754
title: set and frozenset constructor should use operator.length_hint to guess the size of the iterator -> memory used by frozenset created from set differs from that of frozenset created from other iterable
2014-05-16 11:25:09pitrousetresolution: not a bug
messages: + msg218654
2014-05-16 01:01:30josh.rsetmessages: + msg218638
2014-05-15 07:12:01vstinnersetmessages: + msg218595
2014-05-15 07:03:28serhiy.storchakasetmessages: + msg218593
2014-05-15 02:08:21josh.rsetnosy: + josh.r
messages: + msg218589
2014-05-14 22:19:08vstinnersetnosy: + pitrou, serhiy.storchaka
2014-05-14 22:18:50vstinnersetfiles: + set_length_hint.patch


title: memory used by frozenset created from set differs from that of frozenset created from other iterable -> set and frozenset constructor should use operator.length_hint to guess the size of the iterator
keywords: + patch
nosy: + vstinner
versions: + Python 3.5, - Python 3.1, Python 2.7, Python 3.2, Python 3.3, Python 3.4
messages: + msg218582
2014-05-14 13:14:45pitrousetnosy: + rhettinger
2014-05-14 12:32:52lebedovcreate