classification
Title: byte/unicode pickle incompatibilities between python2 and python3
Type: behavior Stage: resolved
Components: Documentation, Library (Lib) Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: alexandre.vassalotti Nosy List: Ronny.Pfannschmidt, RonnyPfannschmidt, alexandre.vassalotti, flox, georg.brandl, ggenellina, gvanrossum, jcea, jdharper, kmike, loewis, pitrou, python-dev, serhiy.storchaka, valhallasw
Priority: high Keywords: patch

Created on 2009-08-26 11:56 by RonnyPfannschmidt, last changed 2013-12-07 09:12 by alexandre.vassalotti. This issue is now closed.

Files
File name Uploaded Description Edit
bytestrpickle.diff valhallasw, 2013-12-06 20:47 review
bytestrpickle.diff valhallasw, 2013-12-06 23:22 review
pickle_python2_str_as_bytes.diff alexandre.vassalotti, 2013-12-07 02:19 review
Messages (32)
msg91966 - (view) Author: (RonnyPfannschmidt) Date: 2009-08-26 11:56
i just noticed that there are some slight differences of the
bytestring/unicodestring pickles between python2/3 using the protocols
0, 1 and 2

the first things i noticed are:

a str from python2 is unpickled as unicode in python3
(fails for byte strings that don't fit whats expected for unicode)


a bytes instance from python3 is pickled as custom class in protocols <3

i'll write a script to try all combinations of protocols and string
variations and transfer directions
msg91967 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-08-26 12:19
Why are you reporting this here? If you think there is a bug, can you
propose an alternative behavior that you would consider correct?

The changes you mentioned are all deliberate.
msg91970 - (view) Author: (RonnyPfannschmidt) Date: 2009-08-26 12:42
the basic behavior i want to see for all protocols <= 2

1. python 2 string maps to python3 byte-string
2. python 2 unicode maps to python3 string
3. python 3 string map to python 2 unicode 
4. python 3 bytestring maps to python 2 string

anything else is is confusing and may break
for example one can't unpickle '\xFF' in python3 if it was pickled in
python2

note that these changes seem irrelevant for protocol 3 as python2.x
doesn't support it
msg91978 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-08-26 18:01
> the basic behavior i want to see for all protocols <= 2
> 
> 1. python 2 string maps to python3 byte-string

That would not be good. Many people create pickles in 2.x where the
string type really represents characters, more often so than they want
it to represent bytes. Giving them bytes on unpickling will likely
cause more problems than the current approach.

> 2. python 2 unicode maps to python3 string

That's the case, right?

> 3. python 3 string map to python 2 unicode 

That's also the case, AFAICT.

> 4. python 3 bytestring maps to python 2 string

Hmm. This may be indeed a mistake. Until r61467, bytes were saved
with the (BIN)STRING code; not sure why this was changed.
msg91980 - (view) Author: (RonnyPfannschmidt) Date: 2009-08-26 18:18
Since it breaks for anything non-ascii, its not that helpfull after all
and since python2 strings are encoding-unaware there is no way to fix
it.

It might be preferable to supply unpicklers that are cappable of
coercing if the user really wants wants coercing.

yup
> 
> > 3. python 3 string map to python 2 unicode 
> 
> That's also the case, AFAICT.
yup
> 
> > 4. python 3 bytestring maps to python 2 string
> 
> Hmm. This may be indeed a mistake. Until r61467, bytes were saved
> with the (BIN)STRING code; not sure why this was changed.
Python 3 is indeed evil there.

b'\x80\x02c__builtin__\nbytes\nq\x00]q\x01\x85q\x02Rq\x03.'

I'm convinced that a 1:1 mapping of python2 string from/to python3
bytestrings is the least surprising behaviour and will keep surprising
errors away when needing to communicate between different python
versions.

It just has bitten me, and i suspect will will get others, too.
Unpickle that completely fails in the face of encodings is not desirable
at all.
msg91998 - (view) Author: (RonnyPfannschmidt) Date: 2009-08-27 08:13
its even worse

python3:
>>> import pickle
>>> pickle.dumps(b'', protocol=2)
b'\x80\x02c__builtin__\nbytes\nq\x00]q\x01\x85q\x02Rq\x03.'

python2.6:
>>> import pickle
>>> pickle.loads('\x80\x02c__builtin__\nbytes\nq\x00]q\x01\x85q\x02Rq\x03.')
'[]'
msg92002 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-08-27 13:53
The problem with trying to solve the following issue:
   "a bytes instance from python3 is pickled as custom class in
protocols <3"
is that if we pickle bytes from Python 3 as a 2.x str in protocol <= 2,
unpickling it using Python 3 will yield a str (unicode), not a bytes
object. Therefore the whole chain (pickling then unpickling) will not be
idempotent.
msg92003 - (view) Author: (RonnyPfannschmidt) Date: 2009-08-27 14:55
unpickle of any non-ascii string from python2 will break
the only way out would be to ensure text strings and a single defined
encoding (at that point storing unicode strings in any case seems more
practical)

also byte-strings stored as python2 str would break

and since i pass around binary strings as parts of objects, its just
completely broken for me
msg92012 - (view) Author: (RonnyPfannschmidt) Date: 2009-08-27 19:15
in case the actual behavior is not supposed to change

how about a way to declare one wants exact 1:1 mapping between py2<>py3,
so str<>bytes and unicode<>str will work for sure

something like load/dump(..., encoding=bytes) just crossed my mind
msg92014 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-08-27 21:08
> how about a way to declare one wants exact 1:1 mapping between py2<>py3,
> so str<>bytes and unicode<>str will work for sure

In a sense, that's already possible. Inherit from _Pickler/_Unpickler,
and replace the dispatch dict with a different mapping.

I wouldn't object to supporting this with an option, though, assuming it
was properly documented and implemented for both pickle and _pickle
(probably along with pickletools).
msg92072 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009-08-29 22:04
Note that this is also a documentation issue: "The pickle 
serialization format is guaranteed to be backwards compatible across 
Python releases."
msg92592 - (view) Author: (RonnyPfannschmidt) Date: 2009-09-14 07:04
i'll try to add some tests now

hopefully i can get rid of the implicit badness like trying to coerce
bytes to unicode in unpickle and storing bytes as list in pickle for
protocol < 3
msg153659 - (view) Author: Merlijn van Deen (valhallasw) * Date: 2012-02-18 23:37
Any news on this?

Just as a note, pickletools.py also does not reflect the current behaviour; pickle types STRING, BINSTRING and SHORT_BINSTRING are all defined with stack_after=[pystring]:

[1, line 992]
    I(name='STRING',
      code='S',
      arg=stringnl,
      stack_before=[],
      stack_after=[pystring],
      proto=0,
      doc=(...)
     )

although the doc=... does describe it will be decoded, the object type of pystring is still defined as bytes:

[1, line 747]
pystring = StackObject(
               name='string',
               obtype=bytes,
               doc="A Python (8-bit) string object.")


[1] http://hg.python.org/cpython/file/98df29d51e12/Lib/pickletools.py
msg153686 - (view) Author: Ronny Pfannschmidt (Ronny.Pfannschmidt) Date: 2012-02-19 09:32
im unlikely to find the time to try and fix pickle/cpickle myself in the next few months
msg153705 - (view) Author: Merlijn van Deen (valhallasw) * Date: 2012-02-19 15:49
Last night, I hacked together a wrapper to do what loewis suggested [1]. It pickles bytes to str (for protocol <= 2), and unpickles str to bytes.

If I (ever) get the build system and tests of python itself to work, I'll try and see if I can implement a nicer solution - at least for pickle.py.

[1] https://github.com/valhallasw/py2/blob/master/bytestrpickle.py
msg153707 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-19 15:53
> If I (ever) get the build system and tests of python itself to work,

If you have any problems with that, don't hesitate to ask on python-dev
(or see http://mail.python.org/mailman/listinfo/core-mentorship )
msg153718 - (view) Author: Merlijn van Deen (valhallasw) * Date: 2012-02-19 19:08
OK, this is the pickle.py patch. A new parameter 'bytestr' has been added to both _Pickler and _Unpickler to toggle the pickle.string<=>bytes behaviour:

_Pickler:
IF protocol <= 2 AND bytestr=True
THEN bytes are stored as STRING/SHORT_BINSTRING/BINSTRING
ELSE (the old behaviour; obj for protocol <=2, else BINARY)

_Unpickler:
IF bytestr=True
THEN STRING/SHORT_BINSTRING/BINSTRING are read as bytes
ELSE they are read as str (old behaviour)

I also extracted the decoding stuff from the three string reading functions to a single one.
msg153719 - (view) Author: Merlijn van Deen (valhallasw) * Date: 2012-02-19 19:10
P.S. (sorry for forgetting this in the original post ;-))

Both 
  ./python -m test -G -v test_pickle
and
  ./python test_bytestrpickle.py
pass, but I have not run the entire test suite, as that takes ~90 minutes on my laptop....

The test script should of course be merged with test_pickle.py at some time....
msg154282 - (view) Author: Merlijn van Deen (valhallasw) * Date: 2012-02-25 19:36
Ok, this is my first attempt at the Pickler part of the C implementation. I'll have to adapt the python implementation to match this one.

All BytestrPicklerTests in test_bytestrpickle.py pass, and ./python -m test -G -v test_pickle passes.

Comments on style etc. are very welcome.
msg154662 - (view) Author: Merlijn van Deen (valhallasw) * Date: 2012-02-29 20:42
Added tests in Lib/test format.

After applying pickle.py.patch and BytestrPickler_c.diff, 
    ./python -m test -v -m PyPicklerBytestrTests test_pickle
returns 12 tests, no errors, while 
    ./python -m test -v -m CPicklerBytestrTests test_pickle
only passes
test_dump_bytes_protocol_0 (test.test_pickle.CPicklerBytestrTests) ... ok
test_dump_bytes_protocol_1 (test.test_pickle.CPicklerBytestrTests) ... ok
test_dump_bytes_protocol_2 (test.test_pickle.CPicklerBytestrTests) ... ok
test_dump_bytes_protocol_3 (test.test_pickle.CPicklerBytestrTests) ... ok

and has 8 errors (as expected).
msg154795 - (view) Author: Merlijn van Deen (valhallasw) * Date: 2012-03-02 20:35
And a complete patch that implements the tests, the python implementation and the C implementation. I'm not completely happy with the code duplication in read_string/read_binstring/read_short_binstring C implementation, so that might be an improvement (however, there is already a lot of code duplication there at the moment).

Again: comments would be very welcome...
msg154832 - (view) Author: Merlijn van Deen (valhallasw) * Date: 2012-03-03 12:37
OK, and now a version that's not broken... I forgot to initialize self->bytestr for PicklerObject/UnpicklerObject. *puts on the you-broke-the-build-hat*

Except for test_packaging.test_caches, this version passes all tests -- test_packaging.test_caches, which seems to fail because I make install'd python and installed {distribute,pip,setuptools,virtualenv}.
msg156166 - (view) Author: Merlijn van Deen (valhallasw) * Date: 2012-03-17 16:06
Based on the discussion on python-dev [1], this is an updated implementation that uses encoding='bytes' to signal str->bytes behaviour.

http://mail.python.org/pipermail/python-dev/2012-March/117536.html
msg156167 - (view) Author: Merlijn van Deen (valhallasw) * Date: 2012-03-17 16:07
...and the tests to go with that.
msg205347 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2013-12-06 03:42
Could you provide a single patch with the implementation and the tests together? I will try to find some time this week to review this.
msg205401 - (view) Author: Merlijn van Deen (valhallasw) * Date: 2013-12-06 20:47
Hi Alexandre,

Attached is a diff based on r87793:0c508d87f80b.

Merlijn
msg205412 - (view) Author: Merlijn van Deen (valhallasw) * Date: 2013-12-06 23:22
I have fixed most of the nits in this patch, except for:

1) the intermediate bytes object being created; inlining is an option, as storchaka suggested, but I'd rather have you decide what it should become before implementing it;

2) make clinic gives me 

./python -E ./Tools/clinic/clinic.py --make
Error in file "./Modules/_pickle.c" on line 6611:
Checksum mismatch!
Expected: bed0d8bbe1c647960ccc6f997b33bf33935fa56f
Computed: 58dcccb705487695fec30980f566027bc68d9c69
make: *** [clinic] Error 255

and I have no clue how to fix that -- the clinic docs are sparse, to say the least;

3) The tests are still in their own test case; please decide between the two of you what is the best solution;

4) I have grouped the test cases: test_load_python2_str_as_bytes (which checks protocols 0, 1, and 2), test_load_python2_unicode_as_str and test_load_long_python2_str_as_bytes;

5) I have moved the commands to create the shown pickled versions from docstrings to comments. If you think they are not useful, I'll remove them, but I found them pretty useful while shortening the strings.
msg205435 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2013-12-07 02:19
I cleaned up the patch. I will submit it tonight if there is no major objections.
msg205436 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-12-07 02:57
How about updating the documentation as well?
msg205440 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-12-07 07:47
And what about an issue mentioned in msg153659?
msg205443 - (view) Author: Roundup Robot (python-dev) Date: 2013-12-07 09:09
New changeset bd71352e950f by Alexandre Vassalotti in branch 'default':
Issue #6784: Strings from Python 2 can now be unpickled as bytes objects.
http://hg.python.org/cpython/rev/bd71352e950f
msg205444 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2013-12-07 09:12
I fixed up the last few review comments and submitted the patch. Thank you for the help!
History
Date User Action Args
2013-12-07 09:12:55alexandre.vassalottisetstatus: open -> closed
resolution: fixed
messages: + msg205444

stage: patch review -> resolved
2013-12-07 09:09:38python-devsetnosy: + python-dev
messages: + msg205443
2013-12-07 07:47:30serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg205440
2013-12-07 02:57:04pitrousetmessages: + msg205436
2013-12-07 02:19:37alexandre.vassalottisetfiles: + pickle_python2_str_as_bytes.diff

messages: + msg205435
2013-12-06 23:22:12valhallaswsetfiles: + bytestrpickle.diff

messages: + msg205412
2013-12-06 20:47:25valhallaswsetfiles: + bytestrpickle.diff

messages: + msg205401
2013-12-06 20:45:49valhallaswsetfiles: - pickle_bytes_tests.diff
2013-12-06 20:45:48valhallaswsetfiles: - pickle_bytes_code.diff
2013-12-06 20:45:47valhallaswsetfiles: - pickle_bytestr.patch
2013-12-06 20:45:46valhallaswsetfiles: - test_pickle.diff
2013-12-06 20:45:45valhallaswsetfiles: - BytestrPickler_c.diff
2013-12-06 20:45:43valhallaswsetfiles: - pickle.py.patch
2013-12-06 03:42:25alexandre.vassalottisetpriority: normal -> high
versions: + Python 3.4, - Python 2.7, Python 3.2, Python 3.3
messages: + msg205347

assignee: docs@python -> alexandre.vassalotti
stage: patch review
2013-02-15 18:18:57floxsetnosy: + flox
2012-12-18 07:02:23kmikesetnosy: + kmike
2012-03-17 16:07:47valhallaswsetfiles: + pickle_bytes_tests.diff

messages: + msg156167
2012-03-17 16:07:00valhallaswsetfiles: + pickle_bytes_code.diff

messages: + msg156166
2012-03-03 12:37:53valhallaswsetfiles: - pickle_bytestr.patch
2012-03-03 12:37:38valhallaswsetfiles: + pickle_bytestr.patch

messages: + msg154832
2012-03-03 12:06:21valhallaswsetfiles: - test_bytestrpickle.py
2012-03-02 20:35:17valhallaswsetfiles: + pickle_bytestr.patch

messages: + msg154795
2012-02-29 20:43:00valhallaswsetfiles: + test_pickle.diff

messages: + msg154662
2012-02-25 19:36:21valhallaswsetfiles: + BytestrPickler_c.diff

messages: + msg154282
2012-02-19 19:10:29valhallaswsetmessages: + msg153719
2012-02-19 19:08:10valhallaswsetfiles: + pickle.py.patch
keywords: + patch
messages: + msg153718
2012-02-19 19:03:21valhallaswsetfiles: + test_bytestrpickle.py
2012-02-19 15:53:44pitrousetmessages: + msg153707
2012-02-19 15:49:20valhallaswsetmessages: + msg153705
2012-02-19 09:32:27Ronny.Pfannschmidtsetnosy: + Ronny.Pfannschmidt
messages: + msg153686
2012-02-19 08:37:00eric.araujosetversions: + Python 3.3, - Python 2.6, Python 3.1
2012-02-18 23:37:07valhallaswsetmessages: + msg153659
2012-02-14 21:44:55valhallaswsetnosy: + valhallasw
2011-02-21 23:03:19jceasetnosy: + jcea
2011-02-02 15:48:15r.david.murraysetnosy: + jdharper
2011-02-02 15:47:44r.david.murraylinkissue11099 superseder
2010-10-29 10:07:21adminsetassignee: georg.brandl -> docs@python
2009-09-14 07:04:17RonnyPfannschmidtsetmessages: + msg92592
2009-08-29 22:04:37ggenellinasetnosy: + ggenellina, georg.brandl
messages: + msg92072

assignee: georg.brandl
components: + Documentation
2009-08-28 09:38:54RonnyPfannschmidtsettitle: byte/unicode pickle incompatibilities between python2 and and python3 -> byte/unicode pickle incompatibilities between python2 and python3
2009-08-27 21:08:01loewissetmessages: + msg92014
2009-08-27 19:15:13RonnyPfannschmidtsetmessages: + msg92012
2009-08-27 14:55:20RonnyPfannschmidtsetmessages: + msg92003
2009-08-27 13:53:41pitrousetversions: + Python 2.7, Python 3.2
nosy: + alexandre.vassalotti, gvanrossum, pitrou

messages: + msg92002

components: + Library (Lib), - None
2009-08-27 08:13:27RonnyPfannschmidtsetmessages: + msg91998
2009-08-26 18:18:03RonnyPfannschmidtsetmessages: + msg91980
2009-08-26 18:01:17loewissetmessages: + msg91978
title: byte/unicode pickle incompatibilities between python2 and and python3 -> byte/unicode pickle incompatibilities between python2 and and python3
2009-08-26 12:42:23RonnyPfannschmidtsetmessages: + msg91970
2009-08-26 12:19:45loewissetnosy: + loewis
messages: + msg91967
2009-08-26 12:12:25RonnyPfannschmidtsettitle: bytw/unicode string incompatibilities between python2 and and python3 -> byte/unicode pickle incompatibilities between python2 and and python3
2009-08-26 11:56:12RonnyPfannschmidtcreate