classification
Title: Update pickle docs to describe format of persistent IDs
Type: enhancement Stage: needs patch
Components: Documentation Versions: Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: alexandre.vassalotti Nosy List: alexandre.vassalotti, amc1, loewis, smst, tim.peters
Priority: normal Keywords: easy

Created on 2004-05-18 22:45 by amc1, last changed 2009-05-31 00:15 by alexandre.vassalotti. This issue is now closed.

Files
File name Uploaded Description Edit
pickle_test.py amc1, 2004-05-18 22:45
Messages (8)
msg20838 - (view) Author: Allan Crooks (amc1) Date: 2004-05-18 22:45
There is a bug in save_pers in both the pickle and
cPickle modules in Python.

It occurs when someone uses a Pickler instance which is
using an ASCII protocol and also has persistent_id
defined which can return a persistent reference that
can contain newline characters in.

The current implementation of save_pers in the pickle
module is as follows:

----

   def save_pers(self, pid):
        # Save a persistent id reference
        if self.bin:
            self.save(pid)
            self.write(BINPERSID)
        else:
            self.write(PERSID + str(pid) + '\n')

----

The else clause assumes that the 'pid' will not be a
string which one or more newline characters.

If the pickler pickles a persistent ID which has a
newline in it, then an unpickler with a corresponding
persistent_load method will incorrectly unpickle the
data - usually interpreting the character after the
newline as a marker indicating what type of data should
be expected (usually resulting in an exception being
raised when the remaining data is not in the format
expected).

I have attached an example file which illustrates in
what circumstances the error occurs.

Workarounds for this bug are:
  1) Use binary mode for picklers.
  2) Modify subclass implementations of save_pers to
ensure that newlines are not returned for persistent ID's.

Although you may assume in general that this bug would
only occur on rare occasions (due to the unlikely
situation where someone would implement persistent_id
so that it would return a string with a newline
character embedded), it may occur more frequently if
the subclass implementation of persistent_id uses a
string which has been constructed using the marshal module.

This bug was discovered when our code implemented the
persistent_id method, which was returning the
marshalled format of a tuple which contained strings.
It occurred when one or more of the strings had a
length of ten characters - the marshalled format of
that string contains the string's length, where the
byte used to represent the number 10 is the same as the
one which represents the newline character:

>>> marshal.dumps('a' * 10)
's\n\x00\x00\x00aaaaaaaaaa'
>>> chr(10)
'\n'

I have replicated this bug on Python 1.5.2 and Python
2.3b1, and I believe it is present on all 2.x versions
of Python.

Many thanks to SourceForge user (and fellow colleague)
SMST who diagnosed the bug and provided the test cases
attached.
msg20839 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-05-19 03:02
Logged In: YES 
user_id=31435

The only documentation is the "Pickling and unpickling 
external objects" section of the Library Reference Manual, 
which says:

"""
Such objects are referenced by a ``persistent id'', which is 
just an arbitrary string of printable ASCII characters.
"""

A newline is universally considered to be a control character, 
not a printable character (e.g., try isprint('\n') under your 
local C compiler).  So this is functioning as designed and as 
documented.  If you don't find the docs clear, we should call 
this a documentation bug.  If you think the semantics should 
change to allow more than printable characters, then this 
should become a feature request, and more is needed to 
define exactly which characters should be allowed.  The 
current implementation is correct for persistent ids that meet 
the documented requirement.
msg20840 - (view) Author: Steve Tregidgo (smst) Date: 2004-05-19 10:31
Logged In: YES 
user_id=42335

I'd overlooked that note in the documentation before, and in
fact developed the opposite view on what was allowed by
seeing that the binary pickle format happens to allow
persistent IDs containing non-printable ASCII characters.

Given that the non-binary format can represent strings
(containing any character, printable or not) by escaping
them, it seems unfortunate that the same escaping was not
applied to the saving of persistent IDs.

It might be helpful if the documentation indicated that the
acceptance by the binary pickle format of strings without
restriction is not to be relied upon, underlining the fact
that printable ASCII is all that's allowed by the format.

Personally I'd like to see the restriction on persistent IDs
lifted in a future version of the pickle module, but I don't
have a compelling reason for it (other than it seeming to be
unnecessary). On the other hand, it seems to be a limitation
which hasn't caused much grief (if any) over the years...
perhaps such a change (albeit a minor one) in the
specifications should be left until another protocol is
introduced.
msg20841 - (view) Author: Allan Crooks (amc1) Date: 2004-05-19 15:30
Logged In: YES 
user_id=39733

I would at least like the documentation modified to make it
clearer that certain characters are not permitted for
persistent ID's. I think the text which indicates the
requirement of printable ASCII characters is too subtle -
there should be something which makes the requirement more
obvious, the use of a "must" or "should" would help get the
point across (as would some text after the statement
indicating that characters such as '\b', '\n', '\r' are not
permitted).

Perhaps it would be an idea for save_pers to do some
argument checking before storing the persistent ID, perhaps
using an assertion statement to verify that the ID doesn't
contain non-permitted characters (or at least, checking for
the presence of a '\n' character embedded in the string).

I think it is preferable to have safeguards implemented in
Pickler to prevent potentially dodgy data being stored - I
would rather have an error raised when I'm trying to pickle
something than have the data stored and corrupted, only to
notice it when it is unpickled (when it is too late).

Confusingly, the code in save_pers in the pickle module
seems to indicate that it would happily accept non-String
based persistent ID's:

----
else:
 self.write(PERSID + str(pid) + '\n')
----

I don't understand why we are using the str function if we
are expecting pid to be a string in the first place. I would
rather that this method would raise an error if it tried to
perform string concatenation on something which isn't a string.

I agree with SMST, I would like the restriction removed over
what persistent ID's we can use, it seems somewhat arbitary
- there's no reason, for example, why we can't use any
simple data type which can be marshalled as an ID.

Apart from the reason that it wouldn't be backwardly
compatible, which is probably a good enough reason. :)
msg20842 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-11-07 22:40
Logged In: YES 
user_id=31435

Unassigned myself (I don't have time for it), but changed the 
Category to Documentation.  (Changing what a persistent ID 
can be would need to be a new feature request.)
msg20843 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-07-03 12:41
Logged In: YES 
user_id=21627

Also lowering the priority. amc1, if you are still
interested, are you willing to provide a documentation patch?
msg58064 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2007-12-01 19:07
This should be fixed along issue1536.
msg88590 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2009-05-31 00:15
The updated documentation for pickle for Python 3 describes the
requirement that persistent IDs should be alphanumeric strings when
protocol 0 is used.

http://docs.python.org/3.0/library/pickle.html#persistence-of-external-objects

Closing as fixed.
History
Date User Action Args
2009-05-31 00:15:54alexandre.vassalottisetstatus: open -> closed
resolution: fixed
messages: + msg88590
2009-04-22 17:16:35ajaksu2setkeywords: + easy
2009-02-14 13:55:07ajaksu2setstage: needs patch
type: enhancement
versions: + Python 2.6
2007-12-01 19:33:45alexandre.vassalottisetdependencies: - pickle's documentation is severely outdated
2007-12-01 19:33:02alexandre.vassalottilinkissue1536 dependencies
2007-12-01 19:07:23alexandre.vassalottisetassignee: alexandre.vassalotti
dependencies: + pickle's documentation is severely outdated
messages: + msg58064
nosy: + alexandre.vassalotti
2004-05-18 22:45:53amc1create