Issue9120
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010-06-29 19:28 by belopolsky, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
empty-set-pickle.diff | belopolsky, 2010-06-29 19:28 | review |
Messages (15) | |||
---|---|---|---|
msg108937 - (view) | Author: Alexander Belopolsky (belopolsky) * | Date: 2010-06-29 19:28 | |
Empty sets are pickled as set([]). The pickle contains serialization of an empty list that is passed to set constructor during unpickling: >>> dis(dumps(set())) 0: \x80 PROTO 3 2: c GLOBAL 'builtins set' 16: q BINPUT 0 18: ] EMPTY_LIST 19: q BINPUT 1 21: \x85 TUPLE1 22: q BINPUT 2 24: R REDUCE 25: q BINPUT 3 27: . STOP The proposed patch pickles empty set as set() instead: >>> dis(dumps(set())) 0: \x80 PROTO 3 2: c GLOBAL 'builtins set' 16: q BINPUT 0 18: ) EMPTY_TUPLE 19: R REDUCE 20: q BINPUT 1 22: . STOP |
|||
msg108940 - (view) | Author: Mark Dickinson (mark.dickinson) * | Date: 2010-06-29 19:49 | |
This seems a bit like premature (space) optimization to me. Is it really worth special-casing this? |
|||
msg108945 - (view) | Author: Alexander Belopolsky (belopolsky) * | Date: 2010-06-29 20:10 | |
It is both space and time optimization. Fred was a proponent of small pickle sizes in the past, so I am adding him to the nosy list. I am not too keen on this to happen, though. It just seemed more natural to me not to create an empty list to unpickle an empty set. By the same logic, allowing set() as an alternative to set([]) in python code is "premature optimization". The patched code is a bit cleaner than the original without goto's and unnecessary XDECREFs. (I am not sure what the latest word on redundant initializations is, so I kept =NULLs intact.) On the other hand collections.deque does not optimize empty case either. |
|||
msg108951 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2010-06-29 21:01 | |
Will think about this one for a while. Initial reaction is that the case isn't common enough to care about, and that the 5 byte savings isn't worth the extra code path. > By the same logic, allowing set() as an alternative > to set([]) in python code is "premature optimization". That wasn't an optimization -- it was needed for API consistency with other types (i.e. list(), dict(), etc). The zero argument form supports a use case for polymorphic creation of an empty container regardless of type. |
|||
msg108958 - (view) | Author: Fred Drake (fdrake) | Date: 2010-06-30 00:54 | |
> Fred was a proponent of small pickle sizes in the past, > so I am adding him to the nosy list. Thanks! I am a proponent of small pickle sizes, in cases where it matters. I'm not convinced this is such a case. The case for a small pickle size for datetime objects is that they're very common in pickle-based persistent object stores, such as ZODB. Set objects are much less common in this environment; this extends to most of Python's mutable data structures. There are three cases worth noting: 1. A separate persistent object is used to provide an API around the core Python data type, with an identical or similar API. This is done to take care of noting changes, so the need to update the stored object is tracked. In these cases, minor changes in the pickled size of the core object are swamped by the reference to the wrapper class. 2. An immutable object is used instead, and updates simply replace the value. This allows the referring object to be responsible for change tracking, but means that types like the set, list, and dictionary aren't frequently used (though they may be used by other objects are part of their pickled representations). 3. Application objects may use mutable objects as part of their stored data value, but keep careful track of updates themselves. This is most often the case when we want to avoid multiple round-trips to the storage server, and the size of the pickle isn't particularly interesting. This isn't an argument that the change shouldn't be made, but experiences with ZODB-like storages don't necessarily suggest that all pickle sizes need be minimized. |
|||
msg108968 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2010-06-30 09:51 | |
A further version of the pickle protocol could have a dedicated opcode for sets instead... |
|||
msg108996 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2010-06-30 16:48 | |
> A further version of the pickle protocol could > have a dedicated opcode for sets instead... -1 We don't have to introduce a new (and backwards incompatible) opcode for every possible container type. The space savings is miniscule (because you still need to list out the items). All that is saved is the enclosing list or tuple setup (five bytes in the case of sets). People concerned about pickle size would be much better off investing time into a more generic solution (such as adding a code to automatically run zip/gzip/bzip/xz etc). |
|||
msg108997 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2010-06-30 17:02 | |
> A further version of the pickle protocol could > > have a dedicated opcode for sets instead... > > -1 We don't have to introduce a new (and backwards incompatible) > opcode for every possible container type. What I was implying is that someone interested in more efficient pickles could start the work of a new protocol version which would *include* such an improvement. > All that is saved is the enclosing list or tuple setup (five bytes > in the case of sets). No, you would save the (string) reference to "builtins.set" and the whole constructor invocation boilerplate. >>> len(pickle.dumps([1,2,3])) 14 >>> len(pickle.dumps({1,2,3})) 36 See? The difference is much more than 5 bytes. What's more, you would avoid building a temporary list only to convert it to a set at the end. > People concerned about pickle size would be much better off investing > time into a more generic solution (such as adding a code to > automatically run zip/gzip/bzip/xz etc). That's at the expense of (un)pickling speed, though, while the above suggestion would improve performance in addition to reducing pickle sizes. |
|||
msg108999 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2010-06-30 17:45 | |
>> People concerned about pickle size would be much better off investing >> time into a more generic solution (such as adding a code to >> automatically run zip/gzip/bzip/xz etc). > That's at the expense of (un)pickling speed, ... Wanted to reply here so that the suggestion doesn't get lost. When talking about pickle speed performance, we need to take into account that most use cases for pickle are i/o bound. IOW, the time to pickle and unpickle is dominated by the time needed to read/write the pickle to/from disk or to send it across the wire. In those common use cases, any time spent compressing/decompressing is more than made for by savings in transport/storage time. This is a fairly common speed optimization and it would be nice if pickles supported it directly (i.e. the sender specifies a compression option and the receiver automatically detects the option from an opcode in the pickle. |
|||
msg109001 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2010-06-30 17:54 | |
> When talking about pickle speed performance, we need to take into > account that most use cases for pickle are i/o bound. IOW, the time > to pickle and unpickle is dominated by the time needed to read/write > the pickle to/from disk or to send it across the wire. In those > common use cases, any time spent compressing/decompressing is more > than made for by savings in transport/storage time. This is an oversimplification. Transport and storage performance is mostly orthogonal to CPU performance; depending on the situation, you might care about the former, or the latter, or both. Adding a compression layer obviously taxes the CPU more, while optimizing bytecode decreases CPU consumption as I've pointed out. (for example, it is quite unlikely, IMO, that ZODB setups are I/O bound rather than CPU bound, given that it's a database system written in Python; and even optimized databases such as PostgreSQL can exhibit CPU-bound behaviour, depending on the hardware configuration, the database size and the queries issued) > This is a fairly common speed optimization and it would be nice if > pickles supported it directly (i.e. the sender specifies a compression > option and the receiver automatically detects the option from an > opcode in the pickle. I'm not sure what would be the point. It is trivially easy for people to add that compression layer to their own protocols if they want to. Also, the C code for pickle doesn't have to be complicated by interfacing it with various third-party compression libraries. (a doc addition would be more reasonable) |
|||
msg109004 - (view) | Author: Alexander Belopolsky (belopolsky) * | Date: 2010-06-30 18:57 | |
> We don't have to introduce a new (and backwards incompatible) > opcode for every possible container type. I would draw the line at containers that have literal syntax (and necessarily have dedicated .pyc opcode). This begs a question, however: why not use regular python bytecode in pickles? I understand that originally pickle bytecode was supposed to restrict what functionality is available to unpickler, but this is a moot point these days. The pickle documentation has a big red security warning (and some want to make it even bigger and redder, issue9105.) Indeed, I've just checked that the following works: >>> from pickle import * >>> class evil: ... def __reduce__(self): ... return (exec, ("print('pwned!')",)) ... >>> s = dumps(evil()) >>> loads(s) pwned! The advantages of using standard bytecode are numerous: 1. Python bytecode is much better understood than pickle bytecode. Since it is exercised by every python program, bugs related to bytecode execution tend to be found and fixed quickly while pickle bytecode is considered to be quirky by many. 2. Much more effort had been spend on optimizing various aspects of python bytecode than pickle bytecode. 3. Using python bytecode will remove numerous arbitrary restrictions that current pickle protocol has such as not being able to pickle non module-level objects or anonymous functions. The only downside I can think of is that doing this would place an extra constraint on bytecode evolution because pickles written by current version of python should remain loadable in all future versions. |
|||
msg109031 - (view) | Author: Alexandre Vassalotti (alexandre.vassalotti) * | Date: 2010-07-01 05:31 | |
> This begs a question, however: why not use regular python bytecode in pickles? Unlike pickle protocols, the bytecode is not required to be compatible across Python versions. Furthermore, Python bytecode is designed has a general purpose object serialization mechanism. For the record, I am with Raymond about adding new opcodes for sets; pickle is complicated enough as it is. I am fine with the posted patch however. |
|||
msg110401 - (view) | Author: Alexander Belopolsky (belopolsky) * | Date: 2010-07-15 23:12 | |
Chances are we will need to add protocol support for sets. See issue9269. |
|||
msg120096 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2010-10-31 22:51 | |
Have looked at this again and think I don't care what happens to it. The gain is minimal but it doesn't take much extra core. Like Fred, I'm not sure having another code path to test and maintain just to save 5 bytes. Unassigning. If anyone except the OP feels strongly either way, either close it or apply it. |
|||
msg160508 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2012-05-13 08:02 | |
Ok, so I'm closing it. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:03 | admin | set | github: 53366 |
2012-05-13 08:02:21 | loewis | set | status: open -> closed nosy: + loewis messages: + msg160508 resolution: wont fix |
2010-10-31 22:51:28 | rhettinger | set | assignee: rhettinger -> type: enhancement -> performance messages: + msg120096 |
2010-07-15 23:12:20 | belopolsky | set | messages: + msg110401 |
2010-07-01 05:31:30 | alexandre.vassalotti | set | messages: + msg109031 |
2010-06-30 18:57:40 | belopolsky | set | messages: + msg109004 |
2010-06-30 17:54:43 | pitrou | set | messages: + msg109001 |
2010-06-30 17:45:00 | rhettinger | set | messages: + msg108999 |
2010-06-30 17:02:12 | pitrou | set | messages: + msg108997 |
2010-06-30 16:48:14 | rhettinger | set | messages: + msg108996 |
2010-06-30 09:51:35 | pitrou | set | messages: + msg108968 |
2010-06-30 00:54:08 | fdrake | set | messages: + msg108958 |
2010-06-29 21:01:51 | rhettinger | set | assignee: rhettinger messages: + msg108951 nosy: + rhettinger |
2010-06-29 20:10:59 | belopolsky | set | priority: normal -> low nosy: + fdrake messages: + msg108945 |
2010-06-29 19:49:04 | mark.dickinson | set | nosy:
+ mark.dickinson messages: + msg108940 |
2010-06-29 19:28:38 | belopolsky | create |