Title: Portability issues with pickle
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.7, Python 3.6, Python 2.7
Status: open Resolution:
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: alexandre.vassalotti, benjamin.peterson, pitrou, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2017-10-20 17:41 by serhiy.storchaka, last changed 2017-11-12 10:05 by serhiy.storchaka.

Pull Requests
URL Status Linked Edit
PR 4067 open serhiy.storchaka, 2017-10-21 09:37
Messages (4)
msg304667 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-20 17:41
After reading numerous pickle-related issues on GitHab, I have found that the most common issue with pickle in Python 2 is using it with files opened with text mode.

    with open(file_name, "w") as f:
        pickle.dump(data, f)

Initially pickle was a text protocol. But since implementing more efficient binary opcodes it is a binary protocol. Even the default protocol 0 is not completely text-safe. If save and load data containing Unicode strings with "text" protocol 0 using different text/binary modes or using text mode on different platforms, you can get an error or incorrect data.

I propose to add more defensive checks for pickle.

1. When save a pickle with protocol 0 (default) to a file opened in text mode (default) emit a Py3k warning.

2. When save a pickle with binary protocols (must be specified explicitly) to a file opened in text mode raise a ValueError on Windows and Mac Classic (resulting data is likely corrupted) and emit a warning on Unix and Linux. What the type of of warnings is more appropriate? DeprecationWarning, DeprecationWarning in py3k mode, RuntimeWarning, or UnicodeWarning?

3. Escape \r and \x1a (end-of-file in MS DOS) when pickle Unicode strings with protocol 0.

4. Detect the most common errors (e.g. module name ending with \r when load on Linux a pickle saved with text mode on Windows) and raise more informative error message.

5. Emit a warning when load an Unicode string ending with \r. This is likely an error (if the pickle was saved with text mode on Windows), but  this can a correct data if the saved Unicode string actually did end with \r. This is the most dubious proposition. On one hand, it is better to warn than silently return an incorrect result. On other hand, the correct result shouldn't provoke a warning.
msg304700 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-21 09:55
PR 4067 fixes following issues when unpickle on Unix or in binary mode files written with protocol 0 in text mode on Windows:

* ints were unpickled as longs by cPickle.
* bools were unpickled as longs by cPickle and as ints by pickle.
* floats couldn't be unpickled by cPickle.
* strings couldn't be unpickled by pickle.
* instances and globals couldn't be unpickled. And error messages were confusing due to invisible \r.
* pickles with protocol 0 containing Unicode string with \x1a couldn't be unpickled on Windows in text mode.
msg306091 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-11 18:28
It is possible to resolve issue with Unicode strings ending with \r. We can add a special mark in the stream (a combination of opcodes which is no-op) before writing the first Unicode strings ending with \r. If this mark is encountered in an input stream, therefore it was saved with new Python version, and ending \r can be removed from loaded Unicode strings.
msg306104 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-12 10:05
Updated PR correctly loads Unicode strings saved in text mode. As a mark used some corrected opcodes followed by newline. If any of previous newlines is \r\n, thus the file was written in text mode and is read in binary mode. If no opcodes with newlines was saved before the UNICODE opcode, the special no-op sequence STRING + "''\n" + POP is saved. This minimize overhead in common case.

I'm going to merge this PR and port some changes to Python 3. Could anybody please make a review of the documentation changes?
Date User Action Args
2017-11-12 10:05:48serhiy.storchakasetassignee: serhiy.storchaka
messages: + msg306104
versions: + Python 3.6, Python 3.7
2017-11-11 18:28:45serhiy.storchakasetmessages: + msg306091
2017-10-21 09:55:03serhiy.storchakasetmessages: + msg304700
2017-10-21 09:37:20serhiy.storchakasetkeywords: + patch
stage: patch review
pull_requests: + pull_request4038
2017-10-20 17:41:29serhiy.storchakacreate