Title: Portability issues with pickle
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.8, Python 3.7, Python 2.7
Status: open Resolution:
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: alexandre.vassalotti, benjamin.peterson, pitrou, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2017-10-20 17:41 by serhiy.storchaka, last changed 2019-02-18 14:07 by serhiy.storchaka.

Pull Requests
URL Status Linked Edit
PR 4067 open serhiy.storchaka, 2017-10-21 09:37
PR 11859 open serhiy.storchaka, 2019-02-14 20:20
Messages (8)
msg304667 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-20 17:41
After reading numerous pickle-related issues on GitHab, I have found that the most common issue with pickle in Python 2 is using it with files opened with text mode.

    with open(file_name, "w") as f:
        pickle.dump(data, f)

Initially pickle was a text protocol. But since implementing more efficient binary opcodes it is a binary protocol. Even the default protocol 0 is not completely text-safe. If save and load data containing Unicode strings with "text" protocol 0 using different text/binary modes or using text mode on different platforms, you can get an error or incorrect data.

I propose to add more defensive checks for pickle.

1. When save a pickle with protocol 0 (default) to a file opened in text mode (default) emit a Py3k warning.

2. When save a pickle with binary protocols (must be specified explicitly) to a file opened in text mode raise a ValueError on Windows and Mac Classic (resulting data is likely corrupted) and emit a warning on Unix and Linux. What the type of of warnings is more appropriate? DeprecationWarning, DeprecationWarning in py3k mode, RuntimeWarning, or UnicodeWarning?

3. Escape \r and \x1a (end-of-file in MS DOS) when pickle Unicode strings with protocol 0.

4. Detect the most common errors (e.g. module name ending with \r when load on Linux a pickle saved with text mode on Windows) and raise more informative error message.

5. Emit a warning when load an Unicode string ending with \r. This is likely an error (if the pickle was saved with text mode on Windows), but  this can a correct data if the saved Unicode string actually did end with \r. This is the most dubious proposition. On one hand, it is better to warn than silently return an incorrect result. On other hand, the correct result shouldn't provoke a warning.
msg304700 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-21 09:55
PR 4067 fixes following issues when unpickle on Unix or in binary mode files written with protocol 0 in text mode on Windows:

* ints were unpickled as longs by cPickle.
* bools were unpickled as longs by cPickle and as ints by pickle.
* floats couldn't be unpickled by cPickle.
* strings couldn't be unpickled by pickle.
* instances and globals couldn't be unpickled. And error messages were confusing due to invisible \r.
* pickles with protocol 0 containing Unicode string with \x1a couldn't be unpickled on Windows in text mode.
msg306091 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-11 18:28
It is possible to resolve issue with Unicode strings ending with \r. We can add a special mark in the stream (a combination of opcodes which is no-op) before writing the first Unicode strings ending with \r. If this mark is encountered in an input stream, therefore it was saved with new Python version, and ending \r can be removed from loaded Unicode strings.
msg306104 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-12 10:05
Updated PR correctly loads Unicode strings saved in text mode. As a mark used some corrected opcodes followed by newline. If any of previous newlines is \r\n, thus the file was written in text mode and is read in binary mode. If no opcodes with newlines was saved before the UNICODE opcode, the special no-op sequence STRING + "''\n" + POP is saved. This minimize overhead in common case.

I'm going to merge this PR and port some changes to Python 3. Could anybody please make a review of the documentation changes?
msg335586 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-02-15 07:27
The proposed PR looks big. Are these actual bug fixes or features? "Portability improvements" sounds like a feature.
msg335588 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-02-15 08:07
I have simplified the PR. Removed the complex code for detecting pickles written to files in text mode on Windows and for adding optional marks for correct detecting. Currently it does only two things:

* Escapes \r, \0 and \x1a (end-of-file on Windows) when pickle unicode. This allows pickles dumped to a file in binary mode (or on non-Windows, or in Python 3) be correctly loaded from a file opened from a file in text mode on Windows.

Currently, dumping to a file in text mode works most time, except on Windows, when the unicode string ends with \r or contains \x1a (not sure about \0, it was added just for the case). Since the data is only corrupted in special cases, this likely is not tested, and the user code can open files in text (default) mode without noticing a bug, until once a malicious user will provide a bad Unicode string.

* Produces a deprecation warning or even a ValueError in cases when writing to a file in text mode can cause problems. This will help to notice the problem earlier.
msg335834 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2019-02-18 13:58
I am not particularly interested in making Python 2 or ancient pickle protocols easier to use, sorry ;-)
msg335835 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-02-18 14:07
This can help with migrating to Python 3.

Python 2 programs often open files in text (default) mode for pickling and unpickling. With these changes you will get a warning when run the interpreter with the -3 option. You can also make the producer opening a file in binary mode for compatibility with Python 3, and be sure that the Python 2 consumer will read it correctly even from a file opened in text mode on Windows.
Date User Action Args
2019-02-18 14:07:30serhiy.storchakasetmessages: + msg335835
2019-02-18 13:58:01pitrousetmessages: + msg335834
2019-02-15 08:09:06serhiy.storchakasetversions: + Python 3.8, - Python 3.6
2019-02-15 08:07:56serhiy.storchakasetmessages: + msg335588
2019-02-15 07:27:14benjamin.petersonsetmessages: + msg335586
2019-02-14 20:20:24serhiy.storchakasetpull_requests: + pull_request11892
2017-11-12 10:05:48serhiy.storchakasetassignee: serhiy.storchaka
messages: + msg306104
versions: + Python 3.6, Python 3.7
2017-11-11 18:28:45serhiy.storchakasetmessages: + msg306091
2017-10-21 09:55:03serhiy.storchakasetmessages: + msg304700
2017-10-21 09:37:20serhiy.storchakasetkeywords: + patch
stage: patch review
pull_requests: + pull_request4038
2017-10-20 17:41:29serhiy.storchakacreate