Message 78230 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	njs
Recipients	njs
Date	2008-12-23.12:30:06
SpamBayes Score	1.3290745e-08
Marked as misclassified	No
Message-id	<1230035446.04.0.56766649246.issue4730@psf.upfronthosting.co.za>
In-reply-to

Content
cPickle.dump by default does not properly encode unicode characters outside the BMP -- it throws away the high bits: >>> cPickle.loads(cPickle.dumps(u"\U00012345")) u'\u2345' The problem is in dump, not load: >>> pickle.dumps(u"\U00012345") # works 'V\\U00012345\np0\n.' >>> cPickle.dumps(u"\U00012345") # no! 'V\\u2345\n.' It does work correctly when using a more modern pickling protocol: >>> cPickle.loads(cPickle.dumps(u"\U00012345", 1)) u'\U00012345' >>> cPickle.loads(cPickle.dumps(u"\U00012345", 2)) u'\U00012345' But this is not much comfort for users whose data has been corrupted because they went with the defaults. (Fortunately in my application I knew that all my characters were in the supplementary plane, so I could repair the data after the fact, but...) Above tests are with 2.5.2, but from checking the source, the bug is obviously still present in 2.6.1: cPickle.c:modified_EncodeRawUnicodeEscape has no code to handle 32-bit unicode values. OTOH, it does look like someone noticed the problem and fixed it for 3.0; _pickle.c:raw_unicode_escape handles such characters fine. Guess they just forgot to backport the fixes... but the code is there, and can probably just be copy-pasted back to 2.6.

cPickle.dump by default does not properly encode unicode characters
outside the BMP -- it throws away the high bits:

>>> cPickle.loads(cPickle.dumps(u"\U00012345"))
u'\u2345'

The problem is in dump, not load:
>>> pickle.dumps(u"\U00012345")   # works
'V\\U00012345\np0\n.'
>>> cPickle.dumps(u"\U00012345")  # no!
'V\\u2345\n.'

It does work correctly when using a more modern pickling protocol:

>>> cPickle.loads(cPickle.dumps(u"\U00012345", 1))
u'\U00012345'
>>> cPickle.loads(cPickle.dumps(u"\U00012345", 2))
u'\U00012345'

But this is not much comfort for users whose data has been corrupted
because they went with the defaults.  (Fortunately in my application I
knew that all my characters were in the supplementary plane, so I could
repair the data after the fact, but...)

Above tests are with 2.5.2, but from checking the source, the bug is
obviously still present in 2.6.1:
cPickle.c:modified_EncodeRawUnicodeEscape has no code to handle 32-bit
unicode values.

OTOH, it does look like someone noticed the problem and fixed it for
3.0; _pickle.c:raw_unicode_escape handles such characters fine.  Guess
they just forgot to backport the fixes... but the code is there, and can
probably just be copy-pasted back to 2.6.

History
Date	User	Action	Args
2008-12-23 12:30:46	njs	set	recipients: + njs
2008-12-23 12:30:46	njs	set	messageid: <1230035446.04.0.56766649246.issue4730@psf.upfronthosting.co.za>
2008-12-23 12:30:08	njs	link	issue4730 messages
2008-12-23 12:30:06	njs	create