This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: pickle/cPickle saves invalid/incomplete data
Type: behavior Stage: resolved
Components: Versions: Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, Philipp.Lies, loewis, serhiy.storchaka, tshepang
Priority: normal Keywords:

Created on 2012-07-30 15:43 by Philipp.Lies, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (3)
msg166906 - (view) Author: Philipp Lies (Philipp.Lies) Date: 2012-07-30 15:43
I just stumbled upon a very serious bug in cPickle where cPickle stores the data passed to it only partially without a warning/error:

#creating a >8GB long random data sting
import os
import cPickle
random_string = os.urandom(int(1.1*2**33))
print len(random_string)
fout = open('test.pickle', 'wb')
cPickle.dump(random_string, fout, 2)
fout.close()
fin = open('test.pickle', 'rb')
random_string2 = cPickle.load(fin)
print len(random_string2)
print random_string == random_string2

The loaded string is significantly shorter, meaning that some of the data got lost while storing the string. This is a serious issue. However, when I use pickle, writing fails with 
error: 'i' format requires -2147483648 <= number <= 2147483647
so I guess pickle is not able to handle large data, therefore cPickle should either throw an error as well of pickle/cPickle should be patched to handle larger data.

Code to reproduce error using numpy (that's how I stumbled upon it):
import numpy as np
import cPickle as pickle
A = np.random.randn(1080,1920,553)
fout = open('test.pickle', 'wb')
pickle.dump(A, fout, 2)
fout.close()
fin = open('test.pickle', 'rb')
B = pickle.load(fin)
Here, numpy detects that the amount of data is wrong and throws an error. However, still serious because saving does not lead to an error so the user expects that the data are safely stored.

I guess might be related to http://bugs.python.org/issue13555 which is still open.

Python 2.7.3 on latest Ubuntu with numpy 1.6.2, 64bit architecture, 128GB RAM
msg166911 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-07-30 16:44
People can probably debate endless about seriousness of an issue. Keep in mind that two factors affect seriousness: what's the impact when it happens (here it is "quite bad"), and what's the chance that it happens (it's "quite low", since it requires you to pickle very long string objects, which only few people ever attempt). So these two cancel them out, in some form.

That said, I certainly agree that it needs to be fixed. AFAICT, the issue is that save_string uses "int" for size and len, when it should use Py_ssize_t. In addition, it shouldn't check for INT_MAX, but 0x7fffffff, since INT_MAX might be 2**63-1 on systems where int is a 64-bit type - but that should not be a problem on your system. I believe the bug exists in more cases; e.g. saving BINUNICODE.

Also, AFAICT, this shouldn't be a problem for 3.x, which already checks for overflow.

Then, AFAICT, there is a glitch in the BINUNICODE handling of 3.x, which rejects strings longer than 0xffffffff, when the  maximum supported length really is 0x7fffffff.
msg254674 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-14 23:41
Since issue13555 is fixed, I think this issue is fixed too.
History
Date User Action Args
2022-04-11 14:57:33adminsetgithub: 59709
2015-11-26 17:20:08serhiy.storchakasetstatus: pending -> closed
resolution: out of date
stage: resolved
2015-11-14 23:41:17serhiy.storchakasetstatus: open -> pending

nosy: + serhiy.storchaka
messages: + msg254674

type: crash -> behavior
2012-08-06 08:43:11tshepangsetnosy: + tshepang
2012-07-30 18:27:55Arfreversetnosy: + Arfrever
2012-07-30 16:44:12loewissetnosy: + loewis
messages: + msg166911
2012-07-30 15:43:47Philipp.Liescreate