msg91966 - (view) |
Author: (RonnyPfannschmidt) |
Date: 2009-08-26 11:56 |
i just noticed that there are some slight differences of the
bytestring/unicodestring pickles between python2/3 using the protocols
0, 1 and 2
the first things i noticed are:
a str from python2 is unpickled as unicode in python3
(fails for byte strings that don't fit whats expected for unicode)
a bytes instance from python3 is pickled as custom class in protocols <3
i'll write a script to try all combinations of protocols and string
variations and transfer directions
|
msg91967 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2009-08-26 12:19 |
Why are you reporting this here? If you think there is a bug, can you
propose an alternative behavior that you would consider correct?
The changes you mentioned are all deliberate.
|
msg91970 - (view) |
Author: (RonnyPfannschmidt) |
Date: 2009-08-26 12:42 |
the basic behavior i want to see for all protocols <= 2
1. python 2 string maps to python3 byte-string
2. python 2 unicode maps to python3 string
3. python 3 string map to python 2 unicode
4. python 3 bytestring maps to python 2 string
anything else is is confusing and may break
for example one can't unpickle '\xFF' in python3 if it was pickled in
python2
note that these changes seem irrelevant for protocol 3 as python2.x
doesn't support it
|
msg91978 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2009-08-26 18:01 |
> the basic behavior i want to see for all protocols <= 2
>
> 1. python 2 string maps to python3 byte-string
That would not be good. Many people create pickles in 2.x where the
string type really represents characters, more often so than they want
it to represent bytes. Giving them bytes on unpickling will likely
cause more problems than the current approach.
> 2. python 2 unicode maps to python3 string
That's the case, right?
> 3. python 3 string map to python 2 unicode
That's also the case, AFAICT.
> 4. python 3 bytestring maps to python 2 string
Hmm. This may be indeed a mistake. Until r61467, bytes were saved
with the (BIN)STRING code; not sure why this was changed.
|
msg91980 - (view) |
Author: (RonnyPfannschmidt) |
Date: 2009-08-26 18:18 |
Since it breaks for anything non-ascii, its not that helpfull after all
and since python2 strings are encoding-unaware there is no way to fix
it.
It might be preferable to supply unpicklers that are cappable of
coercing if the user really wants wants coercing.
yup
>
> > 3. python 3 string map to python 2 unicode
>
> That's also the case, AFAICT.
yup
>
> > 4. python 3 bytestring maps to python 2 string
>
> Hmm. This may be indeed a mistake. Until r61467, bytes were saved
> with the (BIN)STRING code; not sure why this was changed.
Python 3 is indeed evil there.
b'\x80\x02c__builtin__\nbytes\nq\x00]q\x01\x85q\x02Rq\x03.'
I'm convinced that a 1:1 mapping of python2 string from/to python3
bytestrings is the least surprising behaviour and will keep surprising
errors away when needing to communicate between different python
versions.
It just has bitten me, and i suspect will will get others, too.
Unpickle that completely fails in the face of encodings is not desirable
at all.
|
msg91998 - (view) |
Author: (RonnyPfannschmidt) |
Date: 2009-08-27 08:13 |
its even worse
python3:
>>> import pickle
>>> pickle.dumps(b'', protocol=2)
b'\x80\x02c__builtin__\nbytes\nq\x00]q\x01\x85q\x02Rq\x03.'
python2.6:
>>> import pickle
>>> pickle.loads('\x80\x02c__builtin__\nbytes\nq\x00]q\x01\x85q\x02Rq\x03.')
'[]'
|
msg92002 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-08-27 13:53 |
The problem with trying to solve the following issue:
"a bytes instance from python3 is pickled as custom class in
protocols <3"
is that if we pickle bytes from Python 3 as a 2.x str in protocol <= 2,
unpickling it using Python 3 will yield a str (unicode), not a bytes
object. Therefore the whole chain (pickling then unpickling) will not be
idempotent.
|
msg92003 - (view) |
Author: (RonnyPfannschmidt) |
Date: 2009-08-27 14:55 |
unpickle of any non-ascii string from python2 will break
the only way out would be to ensure text strings and a single defined
encoding (at that point storing unicode strings in any case seems more
practical)
also byte-strings stored as python2 str would break
and since i pass around binary strings as parts of objects, its just
completely broken for me
|
msg92012 - (view) |
Author: (RonnyPfannschmidt) |
Date: 2009-08-27 19:15 |
in case the actual behavior is not supposed to change
how about a way to declare one wants exact 1:1 mapping between py2<>py3,
so str<>bytes and unicode<>str will work for sure
something like load/dump(..., encoding=bytes) just crossed my mind
|
msg92014 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2009-08-27 21:08 |
> how about a way to declare one wants exact 1:1 mapping between py2<>py3,
> so str<>bytes and unicode<>str will work for sure
In a sense, that's already possible. Inherit from _Pickler/_Unpickler,
and replace the dispatch dict with a different mapping.
I wouldn't object to supporting this with an option, though, assuming it
was properly documented and implemented for both pickle and _pickle
(probably along with pickletools).
|
msg92072 - (view) |
Author: Gabriel Genellina (ggenellina) |
Date: 2009-08-29 22:04 |
Note that this is also a documentation issue: "The pickle
serialization format is guaranteed to be backwards compatible across
Python releases."
|
msg92592 - (view) |
Author: (RonnyPfannschmidt) |
Date: 2009-09-14 07:04 |
i'll try to add some tests now
hopefully i can get rid of the implicit badness like trying to coerce
bytes to unicode in unpickle and storing bytes as list in pickle for
protocol < 3
|
msg153659 - (view) |
Author: Merlijn van Deen (valhallasw) * |
Date: 2012-02-18 23:37 |
Any news on this?
Just as a note, pickletools.py also does not reflect the current behaviour; pickle types STRING, BINSTRING and SHORT_BINSTRING are all defined with stack_after=[pystring]:
[1, line 992]
I(name='STRING',
code='S',
arg=stringnl,
stack_before=[],
stack_after=[pystring],
proto=0,
doc=(...)
)
although the doc=... does describe it will be decoded, the object type of pystring is still defined as bytes:
[1, line 747]
pystring = StackObject(
name='string',
obtype=bytes,
doc="A Python (8-bit) string object.")
[1] http://hg.python.org/cpython/file/98df29d51e12/Lib/pickletools.py
|
msg153686 - (view) |
Author: Ronny Pfannschmidt (Ronny.Pfannschmidt) |
Date: 2012-02-19 09:32 |
im unlikely to find the time to try and fix pickle/cpickle myself in the next few months
|
msg153705 - (view) |
Author: Merlijn van Deen (valhallasw) * |
Date: 2012-02-19 15:49 |
Last night, I hacked together a wrapper to do what loewis suggested [1]. It pickles bytes to str (for protocol <= 2), and unpickles str to bytes.
If I (ever) get the build system and tests of python itself to work, I'll try and see if I can implement a nicer solution - at least for pickle.py.
[1] https://github.com/valhallasw/py2/blob/master/bytestrpickle.py
|
msg153707 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-19 15:53 |
> If I (ever) get the build system and tests of python itself to work,
If you have any problems with that, don't hesitate to ask on python-dev
(or see http://mail.python.org/mailman/listinfo/core-mentorship )
|
msg153718 - (view) |
Author: Merlijn van Deen (valhallasw) * |
Date: 2012-02-19 19:08 |
OK, this is the pickle.py patch. A new parameter 'bytestr' has been added to both _Pickler and _Unpickler to toggle the pickle.string<=>bytes behaviour:
_Pickler:
IF protocol <= 2 AND bytestr=True
THEN bytes are stored as STRING/SHORT_BINSTRING/BINSTRING
ELSE (the old behaviour; obj for protocol <=2, else BINARY)
_Unpickler:
IF bytestr=True
THEN STRING/SHORT_BINSTRING/BINSTRING are read as bytes
ELSE they are read as str (old behaviour)
I also extracted the decoding stuff from the three string reading functions to a single one.
|
msg153719 - (view) |
Author: Merlijn van Deen (valhallasw) * |
Date: 2012-02-19 19:10 |
P.S. (sorry for forgetting this in the original post ;-))
Both
./python -m test -G -v test_pickle
and
./python test_bytestrpickle.py
pass, but I have not run the entire test suite, as that takes ~90 minutes on my laptop....
The test script should of course be merged with test_pickle.py at some time....
|
msg154282 - (view) |
Author: Merlijn van Deen (valhallasw) * |
Date: 2012-02-25 19:36 |
Ok, this is my first attempt at the Pickler part of the C implementation. I'll have to adapt the python implementation to match this one.
All BytestrPicklerTests in test_bytestrpickle.py pass, and ./python -m test -G -v test_pickle passes.
Comments on style etc. are very welcome.
|
msg154662 - (view) |
Author: Merlijn van Deen (valhallasw) * |
Date: 2012-02-29 20:42 |
Added tests in Lib/test format.
After applying pickle.py.patch and BytestrPickler_c.diff,
./python -m test -v -m PyPicklerBytestrTests test_pickle
returns 12 tests, no errors, while
./python -m test -v -m CPicklerBytestrTests test_pickle
only passes
test_dump_bytes_protocol_0 (test.test_pickle.CPicklerBytestrTests) ... ok
test_dump_bytes_protocol_1 (test.test_pickle.CPicklerBytestrTests) ... ok
test_dump_bytes_protocol_2 (test.test_pickle.CPicklerBytestrTests) ... ok
test_dump_bytes_protocol_3 (test.test_pickle.CPicklerBytestrTests) ... ok
and has 8 errors (as expected).
|
msg154795 - (view) |
Author: Merlijn van Deen (valhallasw) * |
Date: 2012-03-02 20:35 |
And a complete patch that implements the tests, the python implementation and the C implementation. I'm not completely happy with the code duplication in read_string/read_binstring/read_short_binstring C implementation, so that might be an improvement (however, there is already a lot of code duplication there at the moment).
Again: comments would be very welcome...
|
msg154832 - (view) |
Author: Merlijn van Deen (valhallasw) * |
Date: 2012-03-03 12:37 |
OK, and now a version that's not broken... I forgot to initialize self->bytestr for PicklerObject/UnpicklerObject. *puts on the you-broke-the-build-hat*
Except for test_packaging.test_caches, this version passes all tests -- test_packaging.test_caches, which seems to fail because I make install'd python and installed {distribute,pip,setuptools,virtualenv}.
|
msg156166 - (view) |
Author: Merlijn van Deen (valhallasw) * |
Date: 2012-03-17 16:06 |
Based on the discussion on python-dev [1], this is an updated implementation that uses encoding='bytes' to signal str->bytes behaviour.
http://mail.python.org/pipermail/python-dev/2012-March/117536.html
|
msg156167 - (view) |
Author: Merlijn van Deen (valhallasw) * |
Date: 2012-03-17 16:07 |
...and the tests to go with that.
|
msg205347 - (view) |
Author: Alexandre Vassalotti (alexandre.vassalotti) * |
Date: 2013-12-06 03:42 |
Could you provide a single patch with the implementation and the tests together? I will try to find some time this week to review this.
|
msg205401 - (view) |
Author: Merlijn van Deen (valhallasw) * |
Date: 2013-12-06 20:47 |
Hi Alexandre,
Attached is a diff based on r87793:0c508d87f80b.
Merlijn
|
msg205412 - (view) |
Author: Merlijn van Deen (valhallasw) * |
Date: 2013-12-06 23:22 |
I have fixed most of the nits in this patch, except for:
1) the intermediate bytes object being created; inlining is an option, as storchaka suggested, but I'd rather have you decide what it should become before implementing it;
2) make clinic gives me
./python -E ./Tools/clinic/clinic.py --make
Error in file "./Modules/_pickle.c" on line 6611:
Checksum mismatch!
Expected: bed0d8bbe1c647960ccc6f997b33bf33935fa56f
Computed: 58dcccb705487695fec30980f566027bc68d9c69
make: *** [clinic] Error 255
and I have no clue how to fix that -- the clinic docs are sparse, to say the least;
3) The tests are still in their own test case; please decide between the two of you what is the best solution;
4) I have grouped the test cases: test_load_python2_str_as_bytes (which checks protocols 0, 1, and 2), test_load_python2_unicode_as_str and test_load_long_python2_str_as_bytes;
5) I have moved the commands to create the shown pickled versions from docstrings to comments. If you think they are not useful, I'll remove them, but I found them pretty useful while shortening the strings.
|
msg205435 - (view) |
Author: Alexandre Vassalotti (alexandre.vassalotti) * |
Date: 2013-12-07 02:19 |
I cleaned up the patch. I will submit it tonight if there is no major objections.
|
msg205436 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2013-12-07 02:57 |
How about updating the documentation as well?
|
msg205440 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-12-07 07:47 |
And what about an issue mentioned in msg153659?
|
msg205443 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2013-12-07 09:09 |
New changeset bd71352e950f by Alexandre Vassalotti in branch 'default':
Issue #6784: Strings from Python 2 can now be unpickled as bytes objects.
http://hg.python.org/cpython/rev/bd71352e950f
|
msg205444 - (view) |
Author: Alexandre Vassalotti (alexandre.vassalotti) * |
Date: 2013-12-07 09:12 |
I fixed up the last few review comments and submitted the patch. Thank you for the help!
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:52 | admin | set | github: 51033 |
2013-12-07 09:12:55 | alexandre.vassalotti | set | status: open -> closed resolution: fixed messages:
+ msg205444
stage: patch review -> resolved |
2013-12-07 09:09:38 | python-dev | set | nosy:
+ python-dev messages:
+ msg205443
|
2013-12-07 07:47:30 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages:
+ msg205440
|
2013-12-07 02:57:04 | pitrou | set | messages:
+ msg205436 |
2013-12-07 02:19:37 | alexandre.vassalotti | set | files:
+ pickle_python2_str_as_bytes.diff
messages:
+ msg205435 |
2013-12-06 23:22:12 | valhallasw | set | files:
+ bytestrpickle.diff
messages:
+ msg205412 |
2013-12-06 20:47:25 | valhallasw | set | files:
+ bytestrpickle.diff
messages:
+ msg205401 |
2013-12-06 20:45:49 | valhallasw | set | files:
- pickle_bytes_tests.diff |
2013-12-06 20:45:48 | valhallasw | set | files:
- pickle_bytes_code.diff |
2013-12-06 20:45:47 | valhallasw | set | files:
- pickle_bytestr.patch |
2013-12-06 20:45:46 | valhallasw | set | files:
- test_pickle.diff |
2013-12-06 20:45:45 | valhallasw | set | files:
- BytestrPickler_c.diff |
2013-12-06 20:45:43 | valhallasw | set | files:
- pickle.py.patch |
2013-12-06 03:42:25 | alexandre.vassalotti | set | priority: normal -> high versions:
+ Python 3.4, - Python 2.7, Python 3.2, Python 3.3 messages:
+ msg205347
assignee: docs@python -> alexandre.vassalotti stage: patch review |
2013-02-15 18:18:57 | flox | set | nosy:
+ flox
|
2012-12-18 07:02:23 | kmike | set | nosy:
+ kmike
|
2012-03-17 16:07:47 | valhallasw | set | files:
+ pickle_bytes_tests.diff
messages:
+ msg156167 |
2012-03-17 16:07:00 | valhallasw | set | files:
+ pickle_bytes_code.diff
messages:
+ msg156166 |
2012-03-03 12:37:53 | valhallasw | set | files:
- pickle_bytestr.patch |
2012-03-03 12:37:38 | valhallasw | set | files:
+ pickle_bytestr.patch
messages:
+ msg154832 |
2012-03-03 12:06:21 | valhallasw | set | files:
- test_bytestrpickle.py |
2012-03-02 20:35:17 | valhallasw | set | files:
+ pickle_bytestr.patch
messages:
+ msg154795 |
2012-02-29 20:43:00 | valhallasw | set | files:
+ test_pickle.diff
messages:
+ msg154662 |
2012-02-25 19:36:21 | valhallasw | set | files:
+ BytestrPickler_c.diff
messages:
+ msg154282 |
2012-02-19 19:10:29 | valhallasw | set | messages:
+ msg153719 |
2012-02-19 19:08:10 | valhallasw | set | files:
+ pickle.py.patch keywords:
+ patch messages:
+ msg153718
|
2012-02-19 19:03:21 | valhallasw | set | files:
+ test_bytestrpickle.py |
2012-02-19 15:53:44 | pitrou | set | messages:
+ msg153707 |
2012-02-19 15:49:20 | valhallasw | set | messages:
+ msg153705 |
2012-02-19 09:32:27 | Ronny.Pfannschmidt | set | nosy:
+ Ronny.Pfannschmidt messages:
+ msg153686
|
2012-02-19 08:37:00 | eric.araujo | set | versions:
+ Python 3.3, - Python 2.6, Python 3.1 |
2012-02-18 23:37:07 | valhallasw | set | messages:
+ msg153659 |
2012-02-14 21:44:55 | valhallasw | set | nosy:
+ valhallasw
|
2011-02-21 23:03:19 | jcea | set | nosy:
+ jcea
|
2011-02-02 15:48:15 | r.david.murray | set | nosy:
+ jdharper
|
2011-02-02 15:47:44 | r.david.murray | link | issue11099 superseder |
2010-10-29 10:07:21 | admin | set | assignee: georg.brandl -> docs@python |
2009-09-14 07:04:17 | RonnyPfannschmidt | set | messages:
+ msg92592 |
2009-08-29 22:04:37 | ggenellina | set | nosy:
+ ggenellina, georg.brandl messages:
+ msg92072
assignee: georg.brandl components:
+ Documentation |
2009-08-28 09:38:54 | RonnyPfannschmidt | set | title: byte/unicode pickle incompatibilities between python2 and and python3 -> byte/unicode pickle incompatibilities between python2 and python3 |
2009-08-27 21:08:01 | loewis | set | messages:
+ msg92014 |
2009-08-27 19:15:13 | RonnyPfannschmidt | set | messages:
+ msg92012 |
2009-08-27 14:55:20 | RonnyPfannschmidt | set | messages:
+ msg92003 |
2009-08-27 13:53:41 | pitrou | set | versions:
+ Python 2.7, Python 3.2 nosy:
+ alexandre.vassalotti, gvanrossum, pitrou
messages:
+ msg92002
components:
+ Library (Lib), - None |
2009-08-27 08:13:27 | RonnyPfannschmidt | set | messages:
+ msg91998 |
2009-08-26 18:18:03 | RonnyPfannschmidt | set | messages:
+ msg91980 |
2009-08-26 18:01:17 | loewis | set | messages:
+ msg91978 title: byte/unicode pickle incompatibilities between python2 and and python3 -> byte/unicode pickle incompatibilities between python2 and and python3 |
2009-08-26 12:42:23 | RonnyPfannschmidt | set | messages:
+ msg91970 |
2009-08-26 12:19:45 | loewis | set | nosy:
+ loewis messages:
+ msg91967
|
2009-08-26 12:12:25 | RonnyPfannschmidt | set | title: bytw/unicode string incompatibilities between python2 and and python3 -> byte/unicode pickle incompatibilities between python2 and and python3 |
2009-08-26 11:56:12 | RonnyPfannschmidt | create | |