Title: Make set ordered
Type: enhancement Stage: resolved
Components: Interpreter Core Versions:
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: danuker, methane, rhettinger, xtreak
Priority: normal Keywords:

Created on 2020-11-16 10:42 by danuker, last changed 2020-11-17 01:01 by rhettinger. This issue is now closed.

File name Uploaded Description Edit danuker, 2020-11-16 14:06 Naive implementation of set through dict
Messages (5)
msg381088 - (view) Author: Dan Gheorghe Haiduc (danuker) Date: 2020-11-16 10:42
Now that dicts are ordered[1], would it by any chance make sense to also order sets?

A use case that I ran into is trying to reproducibly get a random.choice from a set (after calling random.seed). 

At first I did not need reproducibility, and I just called random.choice(list(my_set)). 

Later when I did need it, it was difficult to find out what was wrong. Then I realized that sets are unordered, and that order is not dependent on random.seed.

It seems there are also some other confused newbies out there.[2][3]

Thank you for the powerful language that is Python!

msg381089 - (view) Author: Dan Gheorghe Haiduc (danuker) Date: 2020-11-16 10:46
Oops. The reference number [2] in the previous message should instead be .
msg381093 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2020-11-16 12:41
See also
msg381098 - (view) Author: Dan Gheorghe Haiduc (danuker) Date: 2020-11-16 14:06
rhettinger wrote [1]:

> The existing setobject code has been finely tuned and micro-optimized over the years, giving it excellent performance on workloads we care about.

This worries me also. But I suppose the tuning should make itself visible in benchmarks. Does Python have "canonical" benchmarks or  speed tests like PyPy[2]?

I am not sure how useful this is, but I made a naive and synthetic line_profiler benchmark of a naive implementation of set through a dict (see attached file).

It resulted in roughly the following differences:

* Initialization from a 1M-sized list: 41-66% slower
* Inserting an item until doubling size: about the same (perhaps faster, due to the size I've chosen not triggering rebuilding of a dict hash table)
* Deletion: 7-11% slower
* Membership test: 2-15% slower

Running them in the opposite order (first dict, then set) gave me the ranges.

I have not considered memory usage (I have not profiled memory before). But I suspect this would be larger, since a dict would keep values in addition to keys.

Additionally, initializing smaller structures (length = 100) seems to be slower; the initialization takes 2x longer (100% slower), but the O(1) operations take about the same.

I suspect methane's implementation linked by xtreak is better (but I have not tried it).

Profiled with:
kernprof -l
python -m line_profiler


msg381198 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-11-17 01:01
Thank for the suggestion.  We've considered it a couple of times before but there are downsides that get in the way:

* The related mathematical concept of sets is unordered.

* It isn't clear what ordering should be returned
  by set operations like intersection, union, and 
  difference especially if we want the first two
  to be commutative and if we want consistent
  in-place and data removal options.

* It gets in the way of existing performance optimizations
  some of which are significant.

* Other programming languages don't seem to have 
  found compelling the need for this.

So yes, it is something we could do, but probably shouldn't.
Date User Action Args
2020-11-17 01:01:32rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg381198

stage: resolved
2020-11-16 14:06:03danukersetfiles: +

messages: + msg381098
2020-11-16 12:41:20xtreaksetnosy: + rhettinger, xtreak, methane
messages: + msg381093
2020-11-16 10:46:21danukersetmessages: + msg381089
2020-11-16 10:42:46danukercreate