classification
Title: pickle is unable to encode unicode surrogates
Type: Stage:
Components: Library (Lib) Versions: Python 3.1, Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: lemburg, loewis, vstinner
Priority: normal Keywords: patch

Created on 2010-04-13 00:39 by vstinner, last changed 2010-04-13 11:10 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
pickle_surrogates.patch vstinner, 2010-04-13 00:44
Messages (6)
msg102996 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-13 00:39
Python3 uses unicode surrogates to store undecodable filenames. Eg. the filename b"abc\xff.py" is encoded as "abc\xdcff.py" if the file system encoding is ASCII. Pickle is unable to store them:

./python -c 'import pickle; pickle.dumps("abc\udcff")'
(...)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 20: surrogates not allowed

This is a limitation of pickle (in the binary mode): Python accepts to store any unicode character, but pickle doesn't.

Using "surrogatepass" error handler should be enough to fix this issue.

Related issue: #3672 (Reject surrogates in utf-8 codec) -> r72208 creates "surrogatepass" error handler.
msg102997 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-13 00:51
I found this bug indirectly: test_logging failed on a SocketHandler if LogRecord.pathname contains a surrogate character. SocketHandler uses pickle to serialize the record.
msg103022 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-04-13 09:01
Both pickle and marshal will need to use the new error handler in order to stay compatible with Python 3.0 (and 2.x) and also to enable creating Unicode literals that include lone surrogates.
msg103029 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-13 09:28
> Both pickle and marshal will need to use the new error handler 
> in order to stay compatible with Python 3.0 (and 2.x) 
> and also to enable creating Unicode literals that include 
> lone surrogates.

Attached patch fixes pickle. Marshal does already use surrogatepass since Martin's commit r72208 (Issue #3672).
msg103030 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-04-13 09:44
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> Both pickle and marshal will need to use the new error handler 
>> in order to stay compatible with Python 3.0 (and 2.x) 
>> and also to enable creating Unicode literals that include 
>> lone surrogates.
> 
> Attached patch fixes pickle. Marshal does already use surrogatepass since Martin's commit r72208 (Issue #3672).

Looks good !

Thanks.
msg103034 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-13 11:10
Commited: r80031 (py3k) and r80032 (3.1), fix also pickletools.
History
Date User Action Args
2010-04-16 01:14:54vstinnerlinkissue8242 dependencies
2010-04-13 11:10:22vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg103034
2010-04-13 09:44:29lemburgsetmessages: + msg103030
2010-04-13 09:28:21vstinnersetmessages: + msg103029
2010-04-13 09:01:49lemburgsetmessages: + msg103022
2010-04-13 00:51:33vstinnersetmessages: + msg102997
2010-04-13 00:44:40vstinnersetfiles: + pickle_surrogates.patch
keywords: + patch
2010-04-13 00:39:55vstinnercreate