This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Reduce pickle size for an empty set
Type: performance Stage: patch review
Components: Interpreter Core Versions: Python 3.2
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: alexandre.vassalotti, belopolsky, fdrake, loewis, mark.dickinson, pitrou, rhettinger
Priority: low Keywords: patch

Created on 2010-06-29 19:28 by belopolsky, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
empty-set-pickle.diff belopolsky, 2010-06-29 19:28 review
Messages (15)
msg108937 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-06-29 19:28
Empty sets are pickled as set([]).  The pickle contains serialization of an empty list that is passed to set constructor during unpickling:

>>> dis(dumps(set()))
    0: \x80 PROTO      3
    2: c    GLOBAL     'builtins set'
   16: q    BINPUT     0
   18: ]    EMPTY_LIST
   19: q    BINPUT     1
   21: \x85 TUPLE1
   22: q    BINPUT     2
   24: R    REDUCE
   25: q    BINPUT     3
   27: .    STOP

The proposed patch pickles empty set as set() instead:


>>> dis(dumps(set()))
    0: \x80 PROTO      3
    2: c    GLOBAL     'builtins set'
   16: q    BINPUT     0
   18: )    EMPTY_TUPLE
   19: R    REDUCE
   20: q    BINPUT     1
   22: .    STOP
msg108940 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-06-29 19:49
This seems a bit like premature (space) optimization to me.  Is it really worth special-casing this?
msg108945 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-06-29 20:10
It is both space and time optimization.

Fred was a proponent of small pickle sizes in the past, so I am adding him to the nosy list.

I am not too keen on this to happen, though.  It just seemed more natural to me not to create an empty list to unpickle an empty set.

By the same logic, allowing set() as an alternative to set([]) in python code is "premature optimization".

The patched code is a bit cleaner than the original without goto's and unnecessary XDECREFs.  (I am not sure what the latest word on redundant initializations is, so I kept =NULLs intact.)

On the other hand collections.deque does not optimize empty case either.
msg108951 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-06-29 21:01
Will think about this one for a while.  Initial reaction is that the case isn't common enough to care about, and that the 5 byte savings isn't worth the extra code path.

> By the same logic, allowing set() as an alternative 
> to set([]) in python code is "premature optimization".

That wasn't an optimization -- it was needed for API 
consistency with other types (i.e. list(), dict(), etc).
The zero argument form supports a use case for
polymorphic creation of an empty container regardless
of type.
msg108958 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2010-06-30 00:54
> Fred was a proponent of small pickle sizes in the past,
> so I am adding him to the nosy list.

Thanks!

I am a proponent of small pickle sizes, in cases where it
matters.  I'm not convinced this is such a case.

The case for a small pickle size for datetime objects is
that they're very common in pickle-based persistent object
stores, such as ZODB.

Set objects are much less common in this environment; this
extends to most of Python's mutable data structures.  There
are three cases worth noting:

1. A separate persistent object is used to provide an API
   around the core Python data type, with an identical or
   similar API.  This is done to take care of noting
   changes, so the need to update the stored object is
   tracked.  In these cases, minor changes in the pickled
   size of the core object are swamped by the reference to
   the wrapper class.

2. An immutable object is used instead, and updates simply
   replace the value.  This allows the referring object to
   be responsible for change tracking, but means that types
   like the set, list, and dictionary aren't frequently
   used (though they may be used by other objects are part
   of their pickled representations).

3. Application objects may use mutable objects as part of
   their stored data value, but keep careful track of
   updates themselves.  This is most often the case when
   we want to avoid multiple round-trips to the storage
   server, and the size of the pickle isn't particularly
   interesting.

This isn't an argument that the change shouldn't be made,
but experiences with ZODB-like storages don't necessarily
suggest that all pickle sizes need be minimized.
msg108968 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-06-30 09:51
A further version of the pickle protocol could have a dedicated opcode for sets instead...
msg108996 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-06-30 16:48
> A further version of the pickle protocol could 
> have a dedicated opcode for sets instead...

-1 We don't have to introduce a new (and backwards incompatible) opcode for every possible container type.  The space savings is miniscule (because you still need to list out the items).  All that is saved is the enclosing list or tuple setup (five bytes in the case of sets). 

People concerned about pickle size would be much better off investing time into a more generic solution (such as adding a code to automatically run zip/gzip/bzip/xz etc).
msg108997 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-06-30 17:02
> A further version of the pickle protocol could 
> > have a dedicated opcode for sets instead...
> 
> -1 We don't have to introduce a new (and backwards incompatible)
> opcode for every possible container type.

What I was implying is that someone interested in more efficient pickles could start the work of a new protocol version which would *include* such an improvement.

> All that is saved is the enclosing list or tuple setup (five bytes
> in the case of sets).

No, you would save the (string) reference to "builtins.set" and the whole constructor invocation boilerplate.

>>> len(pickle.dumps([1,2,3]))
14
>>> len(pickle.dumps({1,2,3}))
36

See? The difference is much more than 5 bytes.

What's more, you would avoid building a temporary list only to convert it to a set at the end.

> People concerned about pickle size would be much better off investing > time into a more generic solution (such as adding a code to
> automatically run zip/gzip/bzip/xz etc).

That's at the expense of (un)pickling speed, though, while the above suggestion would improve performance in addition to reducing pickle sizes.
msg108999 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-06-30 17:45
>> People concerned about pickle size would be much better off investing >> time into a more generic solution (such as adding a code to
>> automatically run zip/gzip/bzip/xz etc).

> That's at the expense of (un)pickling speed, ...

Wanted to reply here so that the suggestion doesn't get lost.

When talking about pickle speed performance, we need to take into account that most use cases for pickle are i/o bound.  IOW, the time to pickle and unpickle is dominated by the time needed to read/write the pickle to/from disk or to send it across the wire.  In those common use cases, any time spent compressing/decompressing is more than made for by savings in transport/storage time.   This is a fairly common speed optimization and it would be nice if pickles supported it directly (i.e. the sender specifies a compression option and the receiver automatically detects the option from an opcode in the pickle.
msg109001 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-06-30 17:54
> When talking about pickle speed performance, we need to take into
> account that most use cases for pickle are i/o bound.  IOW, the time
> to pickle and unpickle is dominated by the time needed to read/write
> the pickle to/from disk or to send it across the wire.  In those
> common use cases, any time spent compressing/decompressing is more
> than made for by savings in transport/storage time.

This is an oversimplification. Transport and storage performance is
mostly orthogonal to CPU performance; depending on the situation, you
might care about the former, or the latter, or both. Adding a
compression layer obviously taxes the CPU more, while optimizing
bytecode decreases CPU consumption as I've pointed out.

(for example, it is quite unlikely, IMO, that ZODB setups are I/O bound
rather than CPU bound, given that it's a database system written in
Python; and even optimized databases such as PostgreSQL can exhibit
CPU-bound behaviour, depending on the hardware configuration, the
database size and the queries issued)

> This is a fairly common speed optimization and it would be nice if
> pickles supported it directly (i.e. the sender specifies a compression
> option and the receiver automatically detects the option from an
> opcode in the pickle.

I'm not sure what would be the point. It is trivially easy for people to
add that compression layer to their own protocols if they want to. Also,
the C code for pickle doesn't have to be complicated by interfacing it
with various third-party compression libraries.

(a doc addition would be more reasonable)
msg109004 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-06-30 18:57
> We don't have to introduce a new (and backwards incompatible)
> opcode for every possible container type.

I would draw the line at containers that have literal syntax (and necessarily have dedicated .pyc opcode).

This begs a question, however: why not use regular python bytecode in pickles?

I understand that originally pickle bytecode was supposed to restrict what functionality is available to unpickler, but this is a moot point these days.

The pickle documentation has a big red security warning (and some want to make it even bigger and redder, issue9105.)

Indeed, I've just checked that the following works:

>>> from pickle import *
>>> class evil:
...    def __reduce__(self):
...        return (exec, ("print('pwned!')",)) 
... 
>>> s = dumps(evil())
>>> loads(s)
pwned!


The advantages of using standard bytecode are numerous:

1. Python bytecode is much better understood than pickle bytecode.  Since it is exercised by every python program, bugs related to bytecode execution tend to be found and fixed quickly while pickle bytecode is considered to be quirky by many.

2. Much more effort had been spend on optimizing various aspects of python bytecode than pickle bytecode.

3. Using python bytecode will remove numerous arbitrary restrictions that current pickle protocol has such as not being able to pickle non module-level objects or anonymous functions.

The only downside I can think of is that doing this would place an extra constraint on bytecode evolution because pickles written by current version of python should remain loadable in all future versions.
msg109031 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2010-07-01 05:31
> This begs a question, however: why not use regular python bytecode in pickles?

Unlike pickle protocols, the bytecode is not required to be compatible across Python versions. Furthermore, Python bytecode is designed has a general purpose object serialization mechanism.

For the record, I am with Raymond about adding new opcodes for sets; pickle is complicated enough as it is. I am fine with the posted patch however.
msg110401 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-07-15 23:12
Chances are we will need to add protocol support for sets.  See issue9269.
msg120096 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-10-31 22:51
Have looked at this again and think I don't care what happens to it.  The gain is minimal but it doesn't take much extra core.  Like Fred, I'm not sure having another code path to test and maintain just to save 5 bytes.

Unassigning.  If anyone except the OP feels strongly either way, either close it or apply it.
msg160508 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-05-13 08:02
Ok, so I'm closing it.
History
Date User Action Args
2022-04-11 14:57:03adminsetgithub: 53366
2012-05-13 08:02:21loewissetstatus: open -> closed

nosy: + loewis
messages: + msg160508

resolution: wont fix
2010-10-31 22:51:28rhettingersetassignee: rhettinger ->
type: enhancement -> performance
messages: + msg120096
2010-07-15 23:12:20belopolskysetmessages: + msg110401
2010-07-01 05:31:30alexandre.vassalottisetmessages: + msg109031
2010-06-30 18:57:40belopolskysetmessages: + msg109004
2010-06-30 17:54:43pitrousetmessages: + msg109001
2010-06-30 17:45:00rhettingersetmessages: + msg108999
2010-06-30 17:02:12pitrousetmessages: + msg108997
2010-06-30 16:48:14rhettingersetmessages: + msg108996
2010-06-30 09:51:35pitrousetmessages: + msg108968
2010-06-30 00:54:08fdrakesetmessages: + msg108958
2010-06-29 21:01:51rhettingersetassignee: rhettinger

messages: + msg108951
nosy: + rhettinger
2010-06-29 20:10:59belopolskysetpriority: normal -> low
nosy: + fdrake
messages: + msg108945

2010-06-29 19:49:04mark.dickinsonsetnosy: + mark.dickinson
messages: + msg108940
2010-06-29 19:28:38belopolskycreate