classification
Title: provide a documented serialization func
Type: enhancement Stage:
Components: None Versions:
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: gpolo, gvanrossum, loewis, phr, skip.montanaro, tim.peters
Priority: normal Keywords:

Created on 2001-10-03 02:25 by phr, last changed 2008-03-21 21:53 by gvanrossum. This issue is now closed.

Messages (29)
msg53262 - (view) Author: paul rubin (phr) Date: 2001-10-03 02:25
It would be nice if there was a documented library
function for serializing Python basic objects
(numbers, strings, dictionaries, and lists).
By documented I mean the protocol is specified in
the documentation, precisely enough to write
interoperating implementations in other languages.

Code-wise, the marshal.dumps and loads functions do
what I want, but their data format is (according to the
documentation) intentionally not specified, because
the format might change in future Python versions.
Maybe that doc was written long enough ago that it's
ok to freeze the marshal format now, and document it?
I just mean for the basic types listed above.  Stuff
like code objects don't have to be specified.  In
fact it would be nice if there was a flag to the
loads and dumps functions to refuse to marshal/
unmarshal those objects.

Pickle/cpickle aren't really appropriate for what I'm
asking, since they're complicated (they try to handle
class instances, circular structure, etc.) and anyway
they're not documented either.

The XDR library is sort of ok, but it's written in
Python (i.e. slow) and it doesn't automatically
handle compound objects.

Thanks

msg53263 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2001-10-06 00:10
Logged In: YES 
user_id=21627

So what's wrong with xmlrpclib?
msg53264 - (view) Author: paul rubin (phr) Date: 2001-10-12 05:12
Logged In: YES 
user_id=72053

I haven't looked at xmlrpclib, but I'm looking for
a simple, compact, binary representation, not something
that needs a complicated parser and expands the data by
an order of magnitude.
msg53265 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2001-10-12 09:39
Logged In: YES 
user_id=21627

Well, then I guess you need to specify your requirements
more clearly. XML-RPC was precisely developed to be
something simple for primitive types and structures that is
sufficiently  well-specified to allow interoperation between
various languages.

I don't see why extending the data 'by an order of
magnitude' would be a problem per se, nor do I see why
'requiring a complicated parser' is a problem if the
implementation already does all the unpacking for you under
the hoods.

Furthermore, I believe it is simply not true that XML-RPC
expands the representation by an order of magnitude. For
example, the Python Integer object 1 takes 12 bytes in its
internal representation (plus the overhead that malloc
requires); the XML-RPC representation '<int>1</int>' also
uses 12 bytes.
In short, you need to say as precise as possible what it is
that you want, or you won't get it. Also, it may be that you
have conflicting requirements (e.g. 'compact, binary', and
'simple, easily processible in different languages'); then
you won't get it either. For a marshalling format that is
accessible from different languages, you better specify it
first, and implement it then.
msg53266 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2001-10-12 14:33
Logged In: YES 
user_id=6380

Paul, I don't understand the application that you are
envisioning. If you think that the marshal format is what
you want, why don't you write a PEP that specifies the
format? That would solve the documentation problem.
msg53267 - (view) Author: paul rubin (phr) Date: 2001-10-12 19:29
Logged In: YES 
user_id=72053

I just want to be able to do convenient transfers of
python data to other programs including over the network.
XMLRPC is excessive bloat in my opinion.  Sending a number
like 12345678 should take at most 5 bytes (a type byte and
a 4-byte int) instead of <int>12345678</int>.  For long
ints (300 digits) it's even worse.

The marshal format is fine, and writing a PEP would solve
the doc problem, but the current marshal doc says the
non-specification is intentional.  Writing it in a PEP
means not just documenting--it means asking the language
maintainers to freeze the marshal format of certain types,
instead of reserving the right to change the format in
future versions.  Writing the PEP only makes sense if
you're willing to freeze the format for those types (the
other types can stay undocumented).  Is that ok with you?

Thanks
Paul
msg53268 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2001-10-12 19:37
Logged In: YES 
user_id=6380

If the PEP makes a reasonable case for freezing the spec,
yes.

I wonder why you can't use decimal? Are you talking really
large volumes? The PEP needs to motivate this with an
example, preferably plucked from real life!
msg53269 - (view) Author: paul rubin (phr) Date: 2001-10-12 20:16
Logged In: YES 
user_id=72053

Decimal is bad not just because of the data expansion but
because the arithmetic to convert a decimal string to binary
can be expensive (all that multiplication).  I'd rather use
hex than decimal for that reason.  One envisioned
application is communicating a cryptography coprocessor: an
8-bit microcontroller (with a public key accelerator)
connected to the host computer through a slow serial port.
Most of the ints involved would be around 300 decimal
digits.
A simple binary format is a lot easier to deal with
in that environment than something like xmlrpc.  Also,
the format would be used for data persistence, so again,
unnecessary data expansion isn't desirable.

I looked at XMLRPC and it's not designed for this purpose.
It's intended as an RPC protocol over HTTP and isn't
well suited for object persistence.  Also, it doesn't
support integers over 32 bits, and binary strings must be
base64 encoded (more bloat).  Finally, it's not included
with Python, so I'd have to bundle an implementation
written in Python (i.e. slow) with my application (I don't
know whether Fred's implementation is Python or C).  I
think the marshal format hasn't changed since before
Python 1.5, so basing serialization on marshal would mean
applications could interoperate with older versions of
Python as well as newer ones, which helps Python's maturity.
(Maturity of a program means, among other things, that
users rarely need to be told they need the latest version
in order to use some feature).

Really, the marshal functions are written the way they're
written because that's the simplest and most natural way
of doing this kind of thing.  So the proposal is mainly
to make them available for user applications, rather than
only for system internals.
msg53270 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2001-10-12 20:24
Logged In: YES 
user_id=6380

This helps tremendously.

I think that marshal is probably overkill. Rather, you need
helper routines to convert longs to and from binary. You can
do everything else using the struct module, and it's
probably easier to write your own protocol using that and
these helpers. I suggest that the best place to add these
helpers is the binascii module, which already has a bunch of
similar things (e.g. hexlify and crc32).

Note the xmlrpc is bundled with Python 2.2.

Looking forward to your patch (much simpler to get accepted
than a PEP :-).
msg53271 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-10-12 21:09
Logged In: YES 
user_id=31435

I'm not sure this is making progress.  Paul, if you want to 
use marshal, you already can:  the pack and unpack routines 
are exposed in Python via the marshal module.  Freezing the 
representation isn't a particularly appealing idea; e.g., 
if anyone is likely to complain about the speed of Python's 
longs, it's you <wink>, and the current marshal format for 
longs is just a raw dump of Python's internal long 
representation -- but the most obvious everything-benefits 
way to speed Python longs is to increase the size of 
the "digits" used in its internal representation.  If 
that's ever done, the marshal format would want to change 
too.

It's easy enough to code your own storage format for longs, 
e.g.

>>> def tobin(i):
...     import binascii
...     ashex = hex(long(i))[2:-1] # chop '0x' and 
trailing 'L'
...     if len(ashex) & 1:
...         ashex = '0' + ashex
...     return binascii.unhexlify(ashex)

implements "base 256" for unsigned longs, and the runtime 
cannot be improved by rewriting in C except by a constant 
factor (the Python spelling has the right O() behavior).
msg53272 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2001-10-12 21:41
Logged In: YES 
user_id=44345

If you head in the direction of documenting marshal with the aim of potentially interoperating with other languages, I think it would be a good idea to create a Python-independent marshal library. This would facilitate incorporation into other languages.  Such a library probably wouldn't be able to do everything marshal can (there isn't an obvious C equivalent of Python's dictionary object, for example), but would still help nail down compatibility issues for the basic scalar types.

msg53273 - (view) Author: paul rubin (phr) Date: 2001-10-13 03:08
Logged In: YES 
user_id=72053

Skip - C has struct objects which are sort of like Python
dictionaries.  XMLRPC represents structs as name-value
pairs, for example.  And "other languages" doesn't
necessarily mean C.  The marshaller should be able to
represent the non-Python-specific serializable objects,
not just scalars.
Basically this means strings, integers (of any length),
dictionaries, lists, and floats (hmm--unicode?), but
not necessarily stuff like code objects.

Having an independent marshal library is ok, I guess,
though I don't feel it's necessary to create more
implementation work.  And one the benefit of using
the existing marshaller is that it's already available in
most versions of Python that people are running
(Red Hat 7.1 still comes with Python 1.5 for example).

Tim - yes, I'm originally used a binascii.hexlify hack
similar to yours and it worked ok, but it was ugly.  I
also had to handle strings (generate a length count
followed by the contents) and then dictionaries (name-value
pairs) and finally felt I shouldn't need to rewrite the
marshaller like that.  There's already a built-in library
function that does everything I need, very efficiently in
native code, in one call, and being able to use it is in
the "batteries included" spirit.  

Also, the current long int marshalling format
is just a digit count (16-bit digits) followed by the digits
in binary.  If the digit width changes, the marshalling
format doesn't have to change--the marshalling code should
still be able to use the same external representation 
without excessive contortions and without slowing down.
(You'll see that it's already not a simple memory dump,
but a structure read and written one byte at a time through
layers of subroutines).  Changing widths while keeping the
old format means putting a minor kludge in the marshalling
code, but no user will ever notice it.

As for the speed of Python longs,
my stuff's runtime is dominated by modular exponentiations
<wink> and I'm already using gmpy for those when it's 
available (but I don't depend on it).  The speedup with
gmpy is substantial, but the speed with ordinary Python
longs is quite acceptable on my PIII (the StrongARM is
another story--probably the C compiler's fault).

Examining Python/marshal.c, I don't see any objects of
the types I've mentioned that are likely to need to change
representations--do you?  

Btw I notice that the pickle module represents long ints
as decimal strings even in "binary" mode, but I'll resist
opening another bug for that, for now.
msg53274 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-10-13 03:43
Logged In: YES 
user_id=31435

The marshal long format actually uses 15-bit digits, each 
*stored* in 16 bits (the high bit of the high byte of which 
is always 0).  That would be a PITA to preserve even if 
Python just moved to 16-bit digits.  marshal's purpose is 
for efficient loading of .pyc files, where that odd format 
makes good sense; since it wasn't designed to be a general-
purpose data transmission format (and has many shortcomings 
for such use), I don't want to see a tail wagging the dog 
here.

Cross-release compatibility is taken seriously in pickle, 
and pickle handles many more cases than marshal, although 
pickle's author (as you've discovered) didn't give a hoot 
about efficient storage of longs.  I'd rather add an 
efficient long format to pickle than hobble marshal 
(although because pickle does take x-release compatibility 
seriously, it has to continue accepting the "longs as 
decimal strings" format forever).
msg53275 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2001-10-13 08:14
Logged In: YES 
user_id=21627

I would have never guessed that arbitrarily long ints are 
a requirement in your application...
For that application, I'd recommend to use ASN.1 BER as a 
well-document, efficient, binary marshalling format. I 
don't think any other format marshals arbitrary large 
integers in a more compact form. You can find an 
implementation of that in

http://www.cnri.reston.va.us/software/pisces/

or

http://www.enteract.com/~asl2/software/PyZ3950/asn1.py

or

http://sourceforge.net/projects/pysnmp (ber.py)

I'd be in favour of having a BER support library in the 
Python core, but somebody would have to contribute such a 
thing, of course.
msg53276 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-10-13 08:39
Logged In: YES 
user_id=31435

Martin, Paul suggested BER previously in 465045.  I suspect 
he's going to suggest this for every module one by one, 
until somebody bites <wink>.  I doubt he wants genuine 
ASN.1 BER, though, as that's a complicated beast, and he 
only cares about ints with a measly few hundred bits; 
regardless, a Python long can't have more digits than can 
be counted in a C int.
msg53277 - (view) Author: paul rubin (phr) Date: 2001-10-13 10:24
Logged In: YES 
user_id=72053

1) if Python longs are currently implemented as vectors
of 15-bit digits (yikes--why on earth would anyone do that)
and marshalled like that, then I agree that THAT much
weirdness doesn't need to be propagated to future versions.
Wow!  I never looked at the long int code closely, but
the marshal code certainly didn't reflect that. It's still
possible to freeze the current marshal format and let
future versions define a new mechanism for loading .pyc's.
From my own self-interest (of wanting to distribute apps
that work across versions) that idea attracts me, but it's
probably not the right thing in the long run.  Better may
be to fix the long int format right away and THEN document/
freeze it.  (Use a new format byte so the 2.2 demarshaller
can still read 2.1 .pyc files).  By "fix" I mean use a
simple
packed binary format, no 15 bit digits, no BER, and the
length prefix should be a byte or bit count, not multibyte
"digits".  

2) Unfortunately it's not easy in portable C with 32 bit
longs to use digits wider than 16 bits--multiplication
becomes too complicated.  If the compiler supports wide
ints (long long int) then conditionalized code to use them
might or might not be deemed worthwhile.  Python's long int
arithmetic (unlike Perl's Math::BigInt class) is fast enough
to be useable for real applications and I don't expect it
to go to the extremes that gmpy does (highly tuned
algorithms
for everything, asm code for many cpu's, etc).  So currently
I use gmpy when it's available and fall back on longs if
gmpy won't import--this works pretty well so far.

3) I like the idea of a BER/DER library for Python but I
don't feel like being the guy who writes it.  I'd probably
use it if it was available, though maybe not for this 
purpose.  (I'd like to handle X509 certificates in Python).
BER really isn't the most efficient way to store long ints,
by the way, since it puts just 7 useful bits in a byte.

4) My suggestion of BER in 465045 was motivated slightly
differently, which was to add a feature from Perl's pack/
unpack function that's missing from Python's struct.pack/
unpack.  I understand a little better now what the struct
module is for, so binascii may be a better place for such
a thing.  However, I believe Python really needs a 
pack/unpack module that does all the stuff that Perl's
does.   Data conversion like that is an area where
Perl is still beating Python pretty badly.  (Again, I
don't feel like being the one who writes the module).

5) Sorry I didn't notice Guido's post of 20:24 earlier
(several arrived at once).  I guess I'm willing to submit
a patch for binascii to read and write longs in binary.
It's slightly humorous to put it in binascii since it's
a binary-binary conversion with no ascii involved, but
the function fits well there in any case.  I'd still rather
use marshal, since I want to write out more kinds of data
than longs, and with a long->binary conversion function
I'd still need to supply Python code to traverse
dictionaries and lists, and encode strings.  Btw, the
struct module doesn't have any way to encode strings
with a length, except the Pascal format which is limited
to 256 bytes and is therefore useless for many things.
msg53278 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-10-13 19:07
Logged In: YES 
user_id=31435

I don't buy the argument that pickle is "complicated", as 
you weren't going to document the parts of the marshal 
format you didn't care about either.  A subset of pickle is 
just as easy to document and implement across languages as 
a subset of marshal, but with the key benefit that the 
pickle format is stable across releases.  So if you want a 
structure packer, pickle is the obvious choice; it just 
lacks an efficient (in time and space) scheme for storing 
longs now.  And unlike marshal, it isn't a dead end when 
you decide your app needs something fancier -- pickle 
already handles just about everything that *can* be 
pickled, and is designed to be extensible to user-defined 
types too, so you can painlessly expand your view of what 
the "interesting" subset is as your ambitions grow.

I don't really know what you mean by "BER".  The ANS.1 std

<http://www.itu.int/ITU-
T/studygroups/com17/languages/X.690_1297.pdf>

section 8.3 is quite clear that all 8 bits are used in each 
byte for integer representations -- it's a giant 2's-comp 
integer, with a variable-length length prefix, redundant 
sign bytes are forbidden, and there's nothing special about 
the last byte.  I agree with Martin that ANS.1 BER is as 
compact a standardized bigint representation as there is.
msg53279 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2001-10-13 21:04
Logged In: YES 
user_id=21627

7-bit vs. 8-bit: You were confusing tag encoding and 
INTEGER value encoding.

no way to encode a string with a length: suppose you want 
a 32 bit length, what's wrong with
struct.pack("l",len(s))+s

15-bit representation: I believe the add and sub 
implementations make use of the guarantee that a short 
won't overflow if the input fits into 15 bits.

Tim: BER = Basic Encoding Rules (as in the subtitle of 
X.690)

Even after all this discussion, I still cannot see why the 
existing libraries (including those offered for free by 
third parties) are not sufficient. It appears that Paul 
wants, among other things, that marshal becomes 
documented; it also appears that this won't hapen. 

What the other things are that Paul wants, I cannot tell, 
so I recommend to close this report with "won't fix". 

Paul, if you have a specific change that you want to be 
made, or a specific problem that you want to point out, 
please submit a new report. This issue "provide a 
documented serialization func" really ought to be closed 
as "Fixed"; xmlrpclib is already part of standard library 
and fits the original problem description:
- it is a library for serializing Python basic objects
- it is documented in the sense that the protocol is 
specified precisely enough to write interoperating 
implementations in other languages.
msg53280 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-10-13 22:25
Logged In: YES 
user_id=31435

Martin, you're right about add and sub, but it's a shallow 
assumption easy to relax (basically just declare 
carry/borrow as twodigits instead of digit).  I'd be more 
worried about the stwodigits type, but since nothing is 
actually broken here I'm not keen to fritter away time 
proving bounds on the temps in bigint division.  How we 
implement bigints internally is off topic anyway (provided 
we're not trying to hijack internal implementation formats 
for unintended purposes).

About BER, yes, and the URL I included is to a freely 
downloadable copy of the X.690 std; section 8.3 spells out 
the INTEGER rules.  They aren't at all the rules Paul 
sketched, hence "I don't really know what you [Paul] mean 
by 'BER'".

For the rest, while xmlrcplib may meet the letter of what 
Paul asked for at first, it's clear to me that it doesn't 
meet what he really wants.  My suggestion remains to add a 
new, efficient bigint format to pickle, which would meet 
everything except Paul's desire to have a special gimmick 
limited to his specific application and without having to 
write one himself.  The internal API functions 
_PyLong_AsByteArray and _PyLong_FromByteArray already do 
the heavy lifting in both directions (to or from base 256, 
unsigned or complemented, big- or little-endian).
msg53281 - (view) Author: paul rubin (phr) Date: 2001-10-13 23:29
Logged In: YES 
user_id=72053

A pickle subset ("gherkin"?) could possibly also fill this
need, if it was documented, even though pickle format is
considerably more complicated than marshal format (it uses
marshal.dumps for binary output, actually taking apart the
marshalled strings).  It was obvious in seconds how
marshal.c works but after 30 minutes of looking at pickle.py
I'm still not sure I understand it.  It looks like
the unpickler can construct arbitrary class instances and
import arbitrary modules, which makes a security hole
if the pickled strings are potentially hostile, but
I might not be reading it right.  Also, the unpickler
must implement constant folding (the memo scheme), which
complicates it somewhat, though it's not that bad.

The idea of leaving the marshal formats of some Python-
specific objects undocumented isn't to get out of
documenting stuff, but to leave those formats open to later
change.  

Re BER/DER, Burt Kaliski's "Layman's Guide" is pretty
readable (http://borg.isc.ucsb.edu/aka/Auth/ASN1layman.htm).
You're right about using all 8 bits in BER integers--it
looks like the 7 bit representation is only used for OID
components (I didn't realize that til checking on it just
now).  BER might be ok for what I'm doing--I'm not sure
right now since I don't understand ASN1 that well.  It looks
not in the spirit of marshal/pickle though: to encode a
compound object it looks like you need an ASN1 spec of
EXACTLY what you expect to find in the object.
msg53282 - (view) Author: paul rubin (phr) Date: 2001-10-14 00:49
Logged In: YES 
user_id=72053

I agree with Tim that the internal implementation of long
arithmetic isn't relevant to this--it was just surprising,
and means the current marshal format isn't all that natural
for external use.

I don't have a particular agenda to get marshal documented,
beyond that it would happen to solve my immediate problem.
Alternatives are fine too.  The ones suggested so far
just don't seem to do the job, viz.:

xmlrpc does NOT serialize basic Python objects--in
particular it doesn't serialize integers longer than 32
bits.  I can't consider using pickle until I've convinced
myself that it doesn't make security holes, and so far it
looks like the opposite.  (Can someone tell me I'm not
reading it right?).

Yes, of course, it's not that difficult to write Python
code to do everything I want.  It's just surprising that
I should need to do that.  I mean, imagine if there was
no integer addition function (no "+" operator) and the
maintainers said "that's ok, to add a and b, just use
'a - (-b)'".  It's not a showstopping obstacle, but I'm
surprised to get so much grief for suggesting making the
operation more convenient, since it's an obvious thing
to want to do (as evidenced by there already being so many
overlapping serialization functions: marshal, pickle, 
rpclib, the Serialization class from Vaults of Parnassus,
three different ASN1 implementations you mentioned, etc).

I can't see anywhere where I've requested a "special
gimmick".  Yes, an efficient bigint representation in pickle
is nice and ought to be added, but I can live without it.
I can NOT live with security holes, but wanting security
shouldn't be considered a special gimmick!  With binary
bigints, a documented format, and a way to 100% stop the
unpickler from ever calling eval or apply on untrusted
data, I wouldn't mind using pickle despite its additional
complexity compared to marshal. 

I don't want to depend on third party modules unless I 
bundle them with my application (again not a showstopper,
but it's not in Python's "batteries included" spirit to need
them at all).  Telling a user "to run this app, first
download modules vreeble from <url1> and frob from <url2>"
where url1 and url2 usually turn out to be broken links
by the time the user sees them is not the right way to
distribute an app.  (It happens I'm going to sometimes
tell the user "to run this app, first destroy your
handheld computer's OS by reflashing the firmware..."
but it's the principle of the thing, you know).

Anyway, the 15 bit bigint representation is reason enough
to not want to freeze the current marshal format.
Maybe a future marshaller can use a cleaner bigint format
and at that point perhaps the issue can be revisited.
msg53283 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-10-14 20:36
Logged In: YES 
user_id=31435

Ack -- Paul, you add a new hitherto secret <wink> 
requirement with each reply.

marshal isn't secure at all:  because its *purpose* is to 
load .pyc files, marshal creates Python code objects out of 
any bytes you happen to feed it following a "code object" 
tag.  That's a hole big enough to swallow the solar 
system.  In 2.2, marshal refuses to unpack code objects in 
restricted execution mode, but not before 2.2, and it never 
refuses in unrestricted mode.

In contrast, pickle doesn't know anything about code 
objects, so doesn't have this hole.  The pickle docs are 
clear about this, too, spelling out that marshal's code-
object abilities create "the possibility of smuggling 
Trojan horses into a program".

When wondering about security, you should be looking at 
(and using) cPickle.c instead of pickle.py; cPickle doesn't 
use marshal at all, nor does it do eval()s etc.  Yes, it 
can reconstruct pickled instances of classes that already 
exist, but it cannot create new classes.  I haven't heard 
that characterized as an insecurity before, but to each his 
own level of discomfort.

I want to go back to the start:  if the question is whether 
Python is interested in documenting another data 
transmission format, my answer is no.  There are many 
already (don't forget the ones from the CORBA and ILU 
worlds either) available from Python, and there's no reason 
to believe encoders/decoders for a Python-specific format 
would get implemented in any other language.

pickle is Python's generic answer to the Python-specific 
serialization question.  I'd be happy to see patches to 
improve it (whether for efficient longs, or some stricter 
notion of security, or even just docs).  But I expect any 
additional Python-specific serialization scheme has an 
audience of one (if you disagree, fine, write a PEP and get 
some community consensus).
msg53284 - (view) Author: paul rubin (phr) Date: 2001-10-14 22:27
Logged In: YES 
user_id=72053

My understanding of marshal (I better check it, but I did
mention the issue in the original request) is that it can
create code objects but it doesn't actually execute the code
in them.  My implementation currently uses marshal but
checks that the stuff marshal returns doesn't contain
anything unexpected.  Unpickle is different--it looks like
it can execute hostile code before the loads call ever
returns.  By the time you have a chance to check the result,
it's too late.  cPickle.c appears to work exactly the same
way (using eval and creating arbitrary instances, but maybe
not calling marshal) as pickle.py.

It never would have occured to me that the unpickler would
work that way (and I'm still not convinced I understand
it--I better try putting together a test to see if it's
really like that).  That's why I didn't notice the security
issue til we started discussing pickle and I actually looked
at the code.  I'm sorry if that sounds like I'm adding
requirements.  I'd have thought it would go without saying
that an important utility shouldn't have security holes.

I'm ok with using pickle if the doc and security concerns
are taken care of.  More efficient longs would be helpful
but they would break interoperability with old versions
and I can probably live without them.  It's really sad that
longs were shrugged off when the pickle binary format was
designed.  Now in order to have efficient longs, yet another
flag will have to be added to the constructor.

Btw, if the unpickle security issue is real (I'm still not
convinced!), I feel it should be treated as a major bug
and that an announcement should be sent out.  Unpickle 
already anticipates hostile pickled strings in the
non-binary format and checks for them (see
_is_secure_string) though I'd want want to spend an hour
or two checking both the `...` code and the evaluator
before believing that _is_secure_string is really safe--
and even if it is, it's brittle.  But it looks like
object creation security is an area they didn't think about.

Basically I have nothing against pickle in principle, but
it has these (fixable) problems, and while marshal is 
straightforwardly written, both pickle implementations
are excessively clever and make me queasy.

Anyway, I can go along with the idea that the right solution
is to fix pickle--but at present, pickle looks like it's in
worse shape than marshal.
msg53285 - (view) Author: paul rubin (phr) Date: 2001-10-16 23:28
Logged In: YES 
user_id=72053

Tim has opened a doc bug for pickle/marshal security
issues as #471893.
msg53286 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2003-05-17 00:25
Logged In: YES 
user_id=357491

Is this still an issue?  If so, shouldn't this be made an RFE?
msg53287 - (view) Author: paul rubin (phr) Date: 2003-05-17 01:17
Logged In: YES 
user_id=72053

Yes, it's still an issue, even more than before since pickle
is now explicitly documented to NOT be ok to use with
untrusted data.

This is already classified as a feature request.  I don't
know if an RFE is something different than that.
msg53288 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2003-05-17 01:30
Logged In: YES 
user_id=357491

RFE is specifically a feature request.  I have gone ahead and reclassified this 
as such.
msg64277 - (view) Author: Guilherme Polo (gpolo) * (Python committer) Date: 2008-03-21 20:45
Sorry, but is the feature request related to constructing a safe
unpickler ? If yes, then I suppose this issue should be closed and an
appropriate one be created.

Nevertheless, reading the following comment at pickletools.py (trunk)
makes me think this feature request won't be done, not in the pickle
module at least:

"Another independent change with Python 2.3 is the abandonment of any
pretense that it might be safe to load pickles received from untrusted
parties -- no sufficient security analysis has been done to guarantee
this and there isn't a use case that warrants the expense of such an
analysis."
msg64285 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-03-21 21:53
There isn't anything actionable in this bug request.  It makes much more
sense to start a discussion about requirements etc. on python-ideas.
History
Date User Action Args
2008-03-21 21:53:57gvanrossumsetstatus: open -> closed
resolution: out of date
messages: + msg64285
2008-03-21 20:45:48gpolosetnosy: + gpolo
messages: + msg64277
2007-10-07 23:19:04brett.cannonsetnosy: - brett.cannon
2001-10-03 02:25:40phrcreate