classification
Title: improved efficiency of bytearray pickling by using bytes type instead of str
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: alexandre.vassalotti, benjamin.peterson, irmen, pitrou, python-dev, rhettinger, serhiy.storchaka
Priority: normal Keywords: easy, needs review, patch

Created on 2011-11-30 00:09 by irmen, last changed 2013-05-07 01:25 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
bytearray3x.patch irmen, 2011-11-30 00:09 patch for bytearray object, python 3.x review
bytearray27.patch irmen, 2011-11-30 00:09 patch for bytearray object, python 2.7 branch review
bytearray3x_reduceex.patch irmen, 2011-12-04 19:14 new patch for bytearray object, python 3.x, with reduce_ex review
Messages (15)
msg148627 - (view) Author: Irmen de Jong (irmen) Date: 2011-11-30 00:09
Pickling of bytearrays is quite inefficient right now, because bytearray's __reduce__ encodes the bytes of the bytearray into a str. A pickle made from a bytearray is quite a bit larger than necessary because of this, and it also takes a lot more processing to create it and to convert it back into the actual bytearray when unpickling (because it uses bytearray's str based initializer with encoding).

I've attached a patch (both for the default 3.x branch and the 2.7 branch) that changes this to use the bytes type instead. A pickle made from a bytearray with this patch applied now utilizes the BINBYTES/BINSTRING pickle opcodes which are a lot more efficient than BINUNICODE that is used now. The reconstruction of the bytearray now uses bytearray's initializer that takes a bytes object.

I don't think additional unit tests are needed because test_bytes already performs pickle tests on bytearrays.

A bit more info can be found in my recent post on comp.lang.python about this, see http://mail.python.org/pipermail/python-list/2011-November/1283668.html
msg148628 - (view) Author: Irmen de Jong (irmen) Date: 2011-11-30 00:15
btw I'm aware of PEP-3154 but I don't think this particular patch requires a pickle protocol bump.
msg148633 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-30 02:07
Since it's a performance improvement, it's Python 3.3-only.

The patch looks ok but there's an unrelated issue. If I first pickle a bytearray under 3.3 using protocol 2:

>>> b = bytearray(b'xyz')
>>> pickle.dumps(b, protocol=2)
b'\x80\x02c__builtin__\nbytearray\nq\x00c__builtin__\nbytes\nq\x01]q\x02(KxKyKze\x85q\x03Rq\x04\x85q\x05Rq\x06.'

and then unpickle it under 2.7, I get:

>>> pickle.loads(b'\x80\x02c__builtin__\nbytearray\nq\x00c__builtin__\nbytes\nq\x01]q\x02(KxKyKze\x85q\x03Rq\x04\x85q\x05Rq\x06.')
bytearray(b'[120, 121, 122]')

... which is wrong. It seems pickling bytes objects with protocol 2 and unpickling them with Python 2.x is broken.
msg148634 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-30 02:14
The irony is that with protocol < 3, bytes should piggy-back on bytearray and not the reverse (since the bytearray constructor has the same semantics under 2.x and 3.x).

It also means that the latin1 encoding solution should probably be kept for protocol < 3. Using __reduce_ex__ should allow for such combination, AFAIK. I think a separate issue should be opened for bytes.
msg148854 - (view) Author: Irmen de Jong (irmen) Date: 2011-12-04 19:14
Added new patch that only does the new reduction when protocol is 3 or higher.
msg148885 - (view) Author: Roundup Robot (python-dev) Date: 2011-12-05 19:45
New changeset e2959a6a1440 by Antoine Pitrou in branch 'default':
Issue #13503: Use a more efficient reduction format for bytearrays with
http://hg.python.org/cpython/rev/e2959a6a1440
msg148886 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-12-05 19:46
The patch wasn't entirely PEP 7-compliant (spaces around operators, space after if), but I've fixed that when committing. Thank you!
msg148902 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2011-12-06 02:17
Antoine, do we have unit tests for this code path?
msg149411 - (view) Author: Irmen de Jong (irmen) Date: 2011-12-13 22:04
Alexandre: the existing test_bytes already performs byte array pickle tests.
msg188480 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-05-06 01:15
Reopening this one because there is a size issue, not just speed.

My clients are bumping into this issue repeatedly.  There is a reasonable expectation that pickling a bytearray will result in a pickle about the same size as the bytearray (not a 50% to 100% expansion depending on the content).  Likewise, the size shouldn't double when switching from protocol 0 to the presumably more efficient protocol 2:

    >>> # Example using Python 2.7.4 on Mac OS X 10.8
    >>> from pickle import dumps
    >>> print len(dumps(bytearray([200] * 10000), 0))
    10055
    >>> print len(dumps(bytearray([200] * 10000), 2))
    20052
    >>> print len(dumps(bytearray([100] * 10000), 2))
    10052
    >>> print len(dumps(bytearray([100, 200] * 5000), 2))
    15052

An attractive feature of bytearrays are their compact representation of data.  An attractive feature of the binary pickle protocol is improved compactness and speed.  Currently, it isn't living up to expectations.
msg188491 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-05-06 06:25
Raymond, we can't just backport this without breaking compatibility with installed Python versions.
msg188494 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2013-05-06 07:26
Our hands are pretty much tied here. The pickle bytearray as unicode hack is likely the best we can do without pickling compatibility between Python 2 and 3. I can't think of a solution that could work here. For example.

1. Pickling bytearrays as a Python 2 str doesn't work because Python 2 strs are unpickled as unicode in Python 3.
2. Pickling bytearrays as an int lists makes the growth factor is much worst: 2x instead of the expected 1.5x.
3. Using a custom constructor breaks pickling compatibility with all the minor releases which doesn't implement the custom constructor.
4. Implementing special support in pickle for bytearrays requires a pickle protocol bump, which is disallowed for bugfixes releases.
5. Creating a special tag type to pickle Python 2 str as bytes in Python 3 has the same problem as #3.
msg188590 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-06 20:39
How about base64?

                self.save_reduce(base64.b64decode,
                                 (str(base64.b64encode(obj), 'ascii'),), obj=obj)

>>> len(dumps(bytes([200] * 10000), 2))
13372
msg188591 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-05-06 20:45
Serhiy Storchaka added the comment:
> 
> How about base64?
> 
>                 self.save_reduce(base64.b64decode,
>                                  (str(base64.b64encode(obj), 'ascii'),), obj=obj)
> 
> >>> len(dumps(bytes([200] * 10000), 2))
> 13372

It's statistically better (a fixed 1.33 expansion instead of 1.5 average), but ASCII bytearrays will pickle less efficiently than currently.

There's something else: we had enough regressions in the latest 2.7.x release, and we shouldn't risk adding new ones in the next release. Really, let's close this issue.
msg188615 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-05-07 01:25
It looks like this is an unfortunate dead-end.
History
Date User Action Args
2013-05-07 01:25:17rhettingersetstatus: open -> closed

messages: + msg188615
2013-05-06 20:45:17pitrousetnosy: + benjamin.peterson
messages: + msg188591
2013-05-06 20:39:08serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg188590
2013-05-06 07:26:15alexandre.vassalottisetmessages: + msg188494
2013-05-06 06:25:26pitrousetmessages: + msg188491
2013-05-06 01:15:22rhettingersetstatus: closed -> open

type: performance -> behavior
assignee: rhettinger
versions: + Python 2.7, - Python 3.3
nosy: + rhettinger

messages: + msg188480
2011-12-13 22:04:57irmensetmessages: + msg149411
2011-12-06 02:17:20alexandre.vassalottisetmessages: + msg148902
2011-12-05 19:46:46pitrousetstatus: open -> closed
resolution: fixed
messages: + msg148886

stage: patch review -> resolved
2011-12-05 19:45:49python-devsetnosy: + python-dev
messages: + msg148885
2011-12-04 19:14:26irmensetfiles: + bytearray3x_reduceex.patch

messages: + msg148854
2011-11-30 02:14:16pitrousetmessages: + msg148634
2011-11-30 02:07:08pitrousetmessages: + msg148633
versions: - Python 3.1, Python 2.7, Python 3.2, Python 3.4
2011-11-30 00:15:42irmensetmessages: + msg148628
2011-11-30 00:09:36irmensetfiles: + bytearray27.patch
2011-11-30 00:09:16irmencreate