New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Portability issues with pickle #76010
Comments
After reading numerous pickle-related issues on GitHab, I have found that the most common issue with pickle in Python 2 is using it with files opened with text mode. with open(file_name, "w") as f:
pickle.dump(data, f) Initially pickle was a text protocol. But since implementing more efficient binary opcodes it is a binary protocol. Even the default protocol 0 is not completely text-safe. If save and load data containing Unicode strings with "text" protocol 0 using different text/binary modes or using text mode on different platforms, you can get an error or incorrect data. I propose to add more defensive checks for pickle.
|
PR 4067 fixes following issues when unpickle on Unix or in binary mode files written with protocol 0 in text mode on Windows:
|
It is possible to resolve issue with Unicode strings ending with \r. We can add a special mark in the stream (a combination of opcodes which is no-op) before writing the first Unicode strings ending with \r. If this mark is encountered in an input stream, therefore it was saved with new Python version, and ending \r can be removed from loaded Unicode strings. |
Updated PR correctly loads Unicode strings saved in text mode. As a mark used some corrected opcodes followed by newline. If any of previous newlines is \r\n, thus the file was written in text mode and is read in binary mode. If no opcodes with newlines was saved before the UNICODE opcode, the special no-op sequence STRING + "''\n" + POP is saved. This minimize overhead in common case. I'm going to merge this PR and port some changes to Python 3. Could anybody please make a review of the documentation changes? |
The proposed PR looks big. Are these actual bug fixes or features? "Portability improvements" sounds like a feature. |
I have simplified the PR. Removed the complex code for detecting pickles written to files in text mode on Windows and for adding optional marks for correct detecting. Currently it does only two things:
Currently, dumping to a file in text mode works most time, except on Windows, when the unicode string ends with \r or contains \x1a (not sure about \0, it was added just for the case). Since the data is only corrupted in special cases, this likely is not tested, and the user code can open files in text (default) mode without noticing a bug, until once a malicious user will provide a bad Unicode string.
|
I am not particularly interested in making Python 2 or ancient pickle protocols easier to use, sorry ;-) |
This can help with migrating to Python 3. Python 2 programs often open files in text (default) mode for pickling and unpickling. With these changes you will get a warning when run the interpreter with the -3 option. You can also make the producer opening a file in binary mode for compatibility with Python 3, and be sure that the Python 2 consumer will read it correctly even from a file opened in text mode on Windows. |
Bumping this discussion in case the should be merged for 3.8b1. Thanks! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: