This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: cPickle corrupts high-unicode strings
Type: Stage:
Components: Library (Lib) Versions: Python 2.6, Python 2.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: alexandre.vassalotti, njs, pitrou
Priority: normal Keywords:

Created on 2008-12-23 12:30 by njs, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (2)
msg78230 - (view) Author: Nathaniel Smith (njs) * (Python committer) Date: 2008-12-23 12:30
cPickle.dump by default does not properly encode unicode characters
outside the BMP -- it throws away the high bits:

>>> cPickle.loads(cPickle.dumps(u"\U00012345"))
u'\u2345'

The problem is in dump, not load:
>>> pickle.dumps(u"\U00012345")   # works
'V\\U00012345\np0\n.'
>>> cPickle.dumps(u"\U00012345")  # no!
'V\\u2345\n.'

It does work correctly when using a more modern pickling protocol:

>>> cPickle.loads(cPickle.dumps(u"\U00012345", 1))
u'\U00012345'
>>> cPickle.loads(cPickle.dumps(u"\U00012345", 2))
u'\U00012345'

But this is not much comfort for users whose data has been corrupted
because they went with the defaults.  (Fortunately in my application I
knew that all my characters were in the supplementary plane, so I could
repair the data after the fact, but...)

Above tests are with 2.5.2, but from checking the source, the bug is
obviously still present in 2.6.1:
cPickle.c:modified_EncodeRawUnicodeEscape has no code to handle 32-bit
unicode values.

OTOH, it does look like someone noticed the problem and fixed it for
3.0; _pickle.c:raw_unicode_escape handles such characters fine.  Guess
they just forgot to backport the fixes... but the code is there, and can
probably just be copy-pasted back to 2.6.
msg78346 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2008-12-27 09:21
Fixed in r67934. Backported to 2.6 in r67936. Thanks!
History
Date User Action Args
2022-04-11 14:56:43adminsetgithub: 48980
2008-12-27 09:21:39alexandre.vassalottisetstatus: open -> closed
resolution: fixed
messages: + msg78346
nosy: + alexandre.vassalotti
2008-12-23 14:33:26pitrousetnosy: + pitrou
2008-12-23 12:30:08njscreate