classification
Title: Pickle stream for unicode object may contain non-ASCII characters.
Type: enhancement Stage:
Components: Documentation, Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: georg.brandl Nosy List: alexandre.vassalotti, bronger, dddibagh, georg.brandl, joelpitt, lemburg, loewis, mawbid, pitrou, terry.reedy, wdoekes
Priority: normal Keywords:

Created on 2008-05-27 15:38 by mawbid, last changed 2012-12-12 04:55 by joelpitt. This issue is now closed.

Messages (26)
msg67410 - (view) Author: Haukur Hreinsson (mawbid) Date: 2008-05-27 15:38
I'm not sure if this is a functionality or documentation bug.

The docs say in section 13.1.2, Data stream format
(http://docs.python.org/lib/node315.html):
"By default, the pickle data format uses a printable ASCII representation."

I took that to mean that only ASCII characters ever appear in the pickle
output, but that's not true.

>>> print [ord(c) for c in pickle.dumps(u'á')]
[86, 225, 10, 112, 48, 10, 46]
msg67421 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-05-27 18:10
Only pickle protocol 0 is ASCII. The other two are binary protocols.

Protocol 2 is default in Python 2.5.

This should probably be made clear in the documentation, so I'd consider
this a documentation bug.
msg67422 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-05-27 18:20
Actually, I was wrong: protocol 0 is the default if you don't specify
the protocol.

This set the binary flag to false, which should result in ASCII-only data.

The Unicode save routine uses the raw-unicode-escape codec, but this
only escapes non-Latin-1 characters and allows non-ASCII Latin-1
character to pass through.

Not sure what to do about this: we can't change the protocol anymore and
all higher protocol levels are binary already.

Perhaps we just need to remove the ASCII note from the documentation
altogether and only leave the "human readbable form" comment ?!
msg67425 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-05-27 19:12
I think the documentation is fine as it stands. The format is ASCII -
even though the payload might not be.
msg67432 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-05-27 21:13
I can't follow you, Martin. 

How can a data format be printable ASCII and at the same time use
non-ASCII characters ?
msg67434 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-05-27 22:21
> How can a data format be printable ASCII and at the same time use
> non-ASCII characters ?

The "format" is the frame defining the structure. In the binary
formatter, it's a binary format. In the standard pickle format,
it's ASCII (I for int, S for string, and so on, line-separated).
msg67436 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-05-27 22:41
On 2008-05-28 00:21, Martin v. Löwis wrote:
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
>> How can a data format be printable ASCII and at the same time use
>> non-ASCII characters ?
> 
> The "format" is the frame defining the structure. In the binary
> formatter, it's a binary format. In the standard pickle format,
> it's ASCII (I for int, S for string, and so on, line-separated).

I think there's a misunderstanding there. The pickle version 0
output used to be 7-bit only for both type code and content.

While adding the Unicode support I must have forgotten about the
fact that raw-unicode-escape does not escape range(128, 256) code
points. Unfortunately, there's no way to fix this now, since the
bug has been around since Python 1.6.

That's why I think we should update the docs.
msg67437 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-05-27 22:56
> Unfortunately, there's no way to fix this now, since the
> bug has been around since Python 1.6.

Actually, there is a way to fix that: pickle could start
emitting \u escapes for characters in the range 128..256.
Older pickle implementations would be able to read that
in just fine.
msg67631 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-06-02 08:40
We could add an extra step to also escape range(128, 256) code points,
but I don't think it's worth the performance loss this would cause.

Note that this was the first time anyone has ever noticed the fact that
the pickle protocol 0 is not pure ASCII - in 8 years. I think it's
better to just adapt the documentation and remove the "ASCII". The
important feature of protocol 0 is being human readable (to some
extent), not that it's pure ASCII.
msg75021 - (view) Author: Dan Dibagh (dddibagh) Date: 2008-10-21 09:22
Your reasoning shows a lack of understanding how Python is actually used
from a programmers point of view.

Why do you think that "noticing" a problem is the same thing as entering
as a python bug report? In practice there are several steps between
noticing a problem in a python program and entering it as a bug report
in the python development system. It is very difficult so see why any of
these steps would happen automatically. Believe me, people have had real
problems due to this bug. They have just selected other solutions than
reporting it.

You are yourself reluctant so seek out the roots of this problem and fix
it. Why should other people behave differently and report it? A not so
uncommon "fix" to pickle problems out there is to not using pickle at
all. There are Python programmers who gives the advice to avoid pickle
since "it's too shaky". It is a solution, but is it the solution you
desire? 

The capability to serialize stuff into ASCII strings isn't just an
implementation detail that happens to be nice for human readability. It
is a feature people need for technical reasons. If the data is ASCII, it
can be dealt with in any ASCII-compatible context which might be network
protocols, file formats and database interfaces. There is the real use.
Programs depend on it to work properly.

The solution the change the documentation is in practice breaking
compatibility (which programming language designers normally tries to
avoid or do in a very controlled manner). How is a documentation fix
going to help all the code out there written with the assumption that
pickle protocol 0 is always ASCII? Is there a better solution around
than changing pickle to meet actual expectations?

Well, nobody has reported it as a bug in 8 years. How long do you think
that code will stay around based on the ASCII assumption? 8 years? 16
years? 24 years? Maybe all the time in the world for this to become an
issue again and again and again?

It is difficult to grasp why there is "no way to fix it now". From a
programmers point of view an obvious "fix" is to ditch pickle and use
something that delivers a consistent result rather than debugging hours.
When I try to see it from the Python library developers point of view I
see code implemented in C which produces a result with reasonable
performance. It is perfectly possible to write the code which implements
the expected result within reasonable performance. What is the problem?

Perhaps it is the raw-unicode-escape encoding that should be fixed? I
failed to find exact information about what raw-unicode-escape means. In
particular, where is the information which states that
raw-unicode-escape is always an 8-bit format? The closest I've come is
PEP 100 and PEP 263 (which I notice is written by you guys), which
describes how to decode raw unicode escape strings from Python source
and how to define encoding formats for python source code. The sole
original purpose of both unicode-escape and raw-unicode-escape appears
to be representing unicode strings in Python source code as u' and ur'
strings respectively. It is clear that the decoding of a raw unicode
escaped or unicode escaped string depends on the actual encoding of the
python source, but how goes the logic that when something is _encoded_
into a raw unicode string then the target source must be of some 8-bit
encoding. Especially considering that the default python source encoding
is ASCII. For unicode-escape this makes sense:

>>> f = file("test.py", "wb")
>>> f.write('s = u"%s"\n' % u"\u0080".encode("unicode-escape"))
>>> f.close()
>>> ^Z

python test.py (executes silently without errors)

But for raw-unicode-escape the outcome is a different thing:

>>> f = file("test.py", "wb")
>>> f.write('s = ur"%s"\n' % u"\u0080".encode("raw-unicode-escape"))
>>> f.close()
>>> ^Z

python test.py

  File "test.py", line 1
SyntaxError: Non-ASCII character '\x80' in file test.py on line 1, but
no encoding declared; see http://www.python.org/peps/pep-0263.html for
details

Huh? For someone who trusts the Standard Encodings section Python
Library reference this isn't what one would expect. If the documentation
states "Produce a string that is suitable as raw Unicode literal in
Python source code" then why isn't it suitable?
msg75022 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-10-21 09:46
On 2008-10-21 11:22, Dan Dibagh wrote:
> Your reasoning shows a lack of understanding how Python is actually used
> from a programmers point of view.

Hmm, I've been using Python for almost 15 years now and do believe
that I have an idea of how Python is being used.

Note that we cannot change the pickle format in retrospective, since
this would break pickle data exchange between different Python versions
relying on the same format (but using different pickle implementations).

What we could do is add a new pickle format which then escapes all
non-ASCII data. However, people have been more keen on getting
compact and fast loading pickles than pickles in ASCII which is why
all new versions of the pickle format are binary formats, so I don't
think it's worth the effort.

Note that the common way of dealing with binary data in ASCII streams
is using a base64 encoding and possibly also apply compression. The
pickle 0 format is really only useful for debugging purposes.

> Perhaps it is the raw-unicode-escape encoding that should be fixed? I
> failed to find exact information about what raw-unicode-escape means. In
> particular, where is the information which states that
> raw-unicode-escape is always an 8-bit format? The closest I've come is
> PEP 100 and PEP 263 (which I notice is written by you guys), which
> describes how to decode raw unicode escape strings from Python source
> and how to define encoding formats for python source code. The sole
> original purpose of both unicode-escape and raw-unicode-escape appears
> to be representing unicode strings in Python source code as u' and ur'
> strings respectively. 

Right.

> It is clear that the decoding of a raw unicode
> escaped or unicode escaped string depends on the actual encoding of the
> python source, but how goes the logic that when something is _encoded_
> into a raw unicode string then the target source must be of some 8-bit
> encoding. Especially considering that the default python source encoding
> is ASCII. For unicode-escape this makes sense:
> 
>>>> f = file("test.py", "wb")
>>>> f.write('s = u"%s"\n' % u"\u0080".encode("unicode-escape"))
>>>> f.close()
>>>> ^Z
> 
> python test.py (executes silently without errors)
> 
> But for raw-unicode-escape the outcome is a different thing:
> 
>>>> f = file("test.py", "wb")
>>>> f.write('s = ur"%s"\n' % u"\u0080".encode("raw-unicode-escape"))
>>>> f.close()
>>>> ^Z
> 
> python test.py
> 
>   File "test.py", line 1
> SyntaxError: Non-ASCII character '\x80' in file test.py on line 1, but
> no encoding declared; see http://www.python.org/peps/pep-0263.html for
> details
> 
> Huh? For someone who trusts the Standard Encodings section Python
> Library reference this isn't what one would expect. If the documentation
> states "Produce a string that is suitable as raw Unicode literal in
> Python source code" then why isn't it suitable?

Because the raw-unicode-escape codec won't escape the \x80 character,
hence the name. As a result, the generated source code is not ASCII,
which is why you see the exception.

But this is off-topic w/r to the issue in question.
msg75055 - (view) Author: Dan Dibagh (dddibagh) Date: 2008-10-21 22:56
I am well aware why my example produces an error from a technical
standpoint. What I'm getting at is the decision to implement
PyUnicode_EncodeRawUnicodeEscape the way it is. Probably there is
nothing wrong with it, but how am I supposed to know? I read the PEP,
which serves as a specification of raw unicode escape (at least for the
decoding bit) and the reference documentation. Then I read the source
trying to map between specified behavior in the documentation and the
implementation in the source code. When it comes to the part which
causes the problem with non-ASCII characters, it is difficult to follow.

Or in other words: what is the high level reason why the codec won't
escape \x80 in my test program?

To use a real-world term; an interface specification, in this case the
pickle documentation, is the contract between the consumer of the
library and the provider of the library. If it states "ASCII", ASCII is
expected. If it doesn't state "for debugging only" it will be used for
non-debugging purposes. There isn't much you can do about it without
breaking the contract. 

What makes you think that the problem cannot be fixed without changing
the existing pickle format 0?
 
Note that base64 is "a common" way to deal with binary data in ascii
streams rather than "the common". (But why should I care when my data is
already ascii?)
msg75058 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-10-21 23:34
> I read the PEP,
> which serves as a specification of raw unicode escape (at least for the
> decoding bit) and the reference documentation.

Which PEP specifically? PEP 263 only mentions the unicode-escape
encoding in its problem statement, i.e. as a pre-existing thing.
It doesn't specify it, nor does it give a rationale for why it behaves
the way it does.

> Then I read the source
> trying to map between specified behavior in the documentation and the
> implementation in the source code. When it comes to the part which
> causes the problem with non-ASCII characters, it is difficult to follow.

What code are you looking at, and where do you find it difficult to
follow it? Maybe you get confused between the "unicode-escape" codec,
and the "raw-unicode-escape" codec, also.

> Or in other words: what is the high level reason why the codec won't
> escape \x80 in my test program?

The raw-unicode-escape codec? It was designed to support parsing of
Python 2.0 source code, and of "raw" unicode strings (ur"") in
particular. In Python 2.0, you only needed to escape characters above
U+0100; Latin-1 characters didn't need escaping. Python, itself, only
relied on the decoding directory. That the codec choses not to escape
Latin-1 characters on encoding is an arbitrary choice (I guess); it's
still symmetric with decoding.

Even though the choice was arbitrary, you shouldn't change it now,
because people may rely on how this codec works.

> What makes you think that the problem cannot be fixed without changing
> the existing pickle format 0?

Applications might rely on what was implemented rather than what was
specified. If they had implemented their own pickle readers, such
readers might break if the pickle format is changed. In principle, even
the old pickle readers of Python 2.0..2.6 might break if the format
changes in 2.7 - we would have to go back and check that they don't
break (although I do believe that they would work fine).

So I personally don't see a problem with fixing this, but it appears
MAL does (for whatever reasons - I can't quite buy the performance
argument). OTOH, I don't feel that this issue deserves as much of
my time to actually implement anythings.

So contributions are welcome. If you find that the patch meets
resistance, you also need to write a PEP, and ask for BDFL
pronouncement.
msg75070 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-10-22 08:24
On 2008-10-22 01:34, Martin v. Löwis wrote:
>> What makes you think that the problem cannot be fixed without changing
>> the existing pickle format 0?
> 
> Applications might rely on what was implemented rather than what was
> specified. If they had implemented their own pickle readers, such
> readers might break if the pickle format is changed. In principle, even
> the old pickle readers of Python 2.0..2.6 might break if the format
> changes in 2.7 - we would have to go back and check that they don't
> break (although I do believe that they would work fine).
> 
> So I personally don't see a problem with fixing this, but it appears
> MAL does (for whatever reasons - I can't quite buy the performance
> argument). OTOH, I don't feel that this issue deserves as much of
> my time to actually implement anythings.

I've had a look at the implementations used in both pickle.py
and cPickle.c: both apply some extra escaping to the encoded
version of raw-unicode-escape in order to handle newlines
correctly, so I guess adding a few more escapes won't hurt.

So +0 on adding the extra escapes for range(128,256) code
points.

Still, IMHO, all of this is not worth the effort, since protocol
versions 1 and 2 are more efficient and there are better ways to
deal with the problem of sending binary data in some ASCII format,
e.g. using base64.
msg75161 - (view) Author: Dan Dibagh (dddibagh) Date: 2008-10-24 11:48
> Which PEP specifically? PEP 263 only mentions the unicode-escape
> encoding in its problem statement, i.e. as a pre-existing thing.
> It doesn't specify it, nor does it give a rationale for why it behaves
> the way it does.

PEP 100 and PEP 263. What I looked for was a description of the
functional intention and a technical definition of raw unicode escape.
The term "raw" tends to have different meanings depending on the context
in which it appears. PEP 263 is of interest in the overall understanding
of the intention of raw unicode escape. If raw unicode escape is to
convert from python source into unicode strings then the decoding of raw
unicode escape strings depends on the source code encoding. Then perhaps
it would give an idea what the encoding part is supposed to do... PEP
100 is of interest for the technical description. It describes the
section "unicode constructors" as the definition.

> What code are you looking at, and where do you find it difficult to
> follow it? Maybe you get confused between the "unicode-escape" codec,
> and the "raw-unicode-escape" codec, also.

Since it is the issue with non-ASCII characters in pickle output I look
at, it is raw-unicode-escape being in focus. For the decoding bit the
distinction between unicode-escape and raw-unicode-escape is very clear. 

I look at the function PyUnicode_EncodeRawUnicodeEscape in
Objects/unicodeobject.c. At the point of the comment "/* Copy everything
else as-is */", given the perceived intentions of the encoding type, I
try to figure out why there isn't a "/* Map non-printable US ASCII to
'\xhh' */" section like in the unicodeescape_string function. The
background in older pythons you explained is essentially what I guessed.

> The raw-unicode-escape codec? It was designed to support parsing of
> Python 2.0 source code, and of "raw" unicode strings (ur"") in
> particular. In Python 2.0, you only needed to escape characters above
> U+0100; Latin-1 characters didn't need escaping. Python, itself, only
> relied on the decoding directory. That the codec choses not to escape
> Latin-1 characters on encoding is an arbitrary choice (I guess); it's
> still symmetric with decoding.

I suppose you mean symmetric with decoding as long as you stick to the
latin-1 character set, as raw unicode escaping isn't a one-to-one mapping.

When PEP 263 came into the picture, wouldn't it have made sense to
change PyUnicode_EncodeRawUnicodeEscape to produce ASCII-only output, or
perhaps output conforming to the current default encoding? Given the
intention of the raw unicode escape, encoding something with it means
producing python source code. But it is in latin-1 while the rest of
Python has moved on to use ASCII by default or whatever being configured
in the source. I tried to put shine on that problem in my previous example.

> Even though the choice was arbitrary, you shouldn't change it now,
> because people may rely on how this codec works.

> Applications might rely on what was implemented rather than what was
> specified. If they had implemented their own pickle readers, such
> readers might break if the pickle format is changed. In principle, 
> even the old pickle readers of Python 2.0..2.6 might break if the
>format changes in 2.7 - we would have to go back and check that they don't
> break (although I do believe that they would work fine).

Then let me ask: How far reaching is the aim to maintain compatibility
with programs which depends on Python internals? Even if the internal
thing is a bug and the thing which depends on the bug is also a bug?
Maybe it is a provoking question, let me explain. The question(s)
applies to some extent to the workings of the codec but it is really the
pickle problem I think of. In the case of older Python releases, it is
just a matter of testing, just as you say. It is boring and perhaps
tedious but there is nothing special which prevents it from being done.
If there are many versions there ought to be a way to write a program
which does it automatically. 

In the case of those who have implemented their own pickle readers, the
source and the comments in pickletools.py clearly states that unicode
strings are raw unicode escaped in format 0. Now raw unicode escape
isn't a canonical format. The letter A can be represented either as
\u0041 or as itself as A. If a hypothetical implementor gets the idea
that characters in the range 0-255 cannot be represented by \u00xx
sequences then the fact that pickle replaces \ with \u005c and \n with
\u000a should give a hint that he is wrong. So if characters in the
range 128-255 gets escaped with \u00xx any pickle reader should handle
it. I've tried to come up with some sensible way to write a pickle
implemenation which fails to understand \u00xx characters without
calling it a bug. I cannot. Can you? So it seems that the worry for
changing protocol 0 is buggy programs depending on a pickle bug.

In the other end of the spectrum there are correct programs with depends
on Python externals, ie. programs depending in ASCII-conformant pickle
output (even if there are some base64 ...ehm... fundamentalists who
think it is the wrong way to do it -- I can think of at least one good
reason to do it).   

> So contributions are welcome. If you find that the patch meets
> resistance, you also need to write a PEP, and ask for BDFL
> pronouncement.

I consider doing a patch. I also understand that in order for the patch
to get acceptance it must fit into the Python framework. That's why I
ask all these questions.
msg80330 - (view) Author: Torsten Bronger (bronger) Date: 2009-01-21 15:43
I ran into this problem today when writing python data structures into a
database.  Only ASCII is safe in this situation.  I understood the
Python docs that protocol 0 was ASCII-only.

I use pickle+base64 now, however, this makes debugging more difficult.

Anyway, I think that the docs should clearly say that protocol 8 is not
ASCII-only because this is important in the Python world.  For example,
I saw this issue because Django makes an implicit unicode() conversion
with my input which fails with non-ASCII.
msg80331 - (view) Author: Torsten Bronger (bronger) Date: 2009-01-21 15:44
"protocol 8" --> "protocol 0" of course.
msg80334 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-01-21 17:57
On 2009-01-21 16:43, Torsten Bronger wrote:
> Torsten Bronger <bronger@physik.rwth-aachen.de> added the comment:
> 
> I ran into this problem today when writing python data structures into a
> database.  Only ASCII is safe in this situation.  I understood the
> Python docs that protocol 0 was ASCII-only.
> 
> I use pickle+base64 now, however, this makes debugging more difficult.

Databases can handle binary data just fine, so pickle protocol 2
should be better in your situation.

If you require ASCII-only data, you can also use pickle protocol 2,
zlib and base64 to get a compact version of a serialized Python object.

> Anyway, I think that the docs should clearly say that protocol 8 is not
> ASCII-only because this is important in the Python world.  For example,
> I saw this issue because Django makes an implicit unicode() conversion
> with my input which fails with non-ASCII.

That sounds like an issue with Django - it shouldn't try to convert
binary data to Unicode (which is reserved for text data).
msg80337 - (view) Author: Torsten Bronger (bronger) Date: 2009-01-21 18:34
Well, Django doesn't story binary data at all but wants you to store
image files etc in the file system.  Whether this was a good design
decision, is beyond the scope of this issue.  My points actually are
only these:

a) the docs strongly suggest that protocol 0 is ASCII-only and this
should be clarified (one sentence would be fully sufficient I think)

b) currently, there is no way in the standard lib to serialise data in a
debuggable, ASCII-only format

Probably b) is not important.  *I* want to have it currently but this
doesn't mean much.
msg86294 - (view) Author: Walter Doekes (wdoekes) Date: 2009-04-22 12:37
Same issue with Django here ;-)

I wouldn't mind a protocol 3 that does <128 ascii only. If only because
debugging base64'd zlib'd protocol-2 data is not particularly convenient.
msg86329 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2009-04-22 20:28
> I wouldn't mind a protocol 3 that does <128 ascii only. If only because
> debugging base64'd zlib'd protocol-2 data is not particularly convenient.

Is there any reason that prevent you to debug your pickle using pickle
disassembler tool—i.e., pickletools.dis()?
msg86331 - (view) Author: Torsten Bronger (bronger) Date: 2009-04-22 20:48
The "problem" is the pickle result.  It's not about debugging the
pickler itself.
msg86334 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-04-22 21:04
If your data is simple enough, you can use JSON. It has an
`ensure_ascii` flag when dumping data.
msg109671 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-07-09 03:18
This can no longer be a 2.5 issue but I am not sure how to update it.

OP apparently opened it as a feature request, so I did update it to 3.2.
 
But OP then says "I'm not sure if this is a functionality or documentation bug." and indeed subsequent messages debate this issue. This would mean it could apply to earlier versions, if re-typed.

On the other hand, there seems to be some opinion that there is no bug, or if there is/was, it cannot be fixed, which would mean this should be closed.

Also, the docs seem to have already been changed, so if that were the issue, this is fixed and should be closed:
 "By default, the pickle data format uses a printable ASCII representation."
is now
"Protocol version 0 is the original human-readable protocol and is backwards compatible with earlier versions of Python. "
msg109688 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-07-09 08:23
Terry J. Reedy wrote:
> 
> Terry J. Reedy <tjreedy@udel.edu> added the comment:
> 
> This can no longer be a 2.5 issue but I am not sure how to update it.
> 
> OP apparently opened it as a feature request, so I did update it to 3.2.
>  
> But OP then says "I'm not sure if this is a functionality or documentation bug." and indeed subsequent messages debate this issue. This would mean it could apply to earlier versions, if re-typed.
> 
> On the other hand, there seems to be some opinion that there is no bug, or if there is/was, it cannot be fixed, which would mean this should be closed.
> 
> Also, the docs seem to have already been changed, so if that were the issue, this is fixed and should be closed:
>  "By default, the pickle data format uses a printable ASCII representation."
> is now
> "Protocol version 0 is the original human-readable protocol and is backwards compatible with earlier versions of Python. "

I'd suggest to close the ticket.

The main idea behind version 0 was to have a readable format. The
occasional UTF-8 in the stream should be readable enough nowadays,
even if it's not ASCII.
msg177364 - (view) Author: (joelpitt) Date: 2012-12-12 04:55
Just ran into this problem using Python 2.7.3 and the issue others mention in conjunction with Django.

Note the 2.7 docs still imply it's ASCII: http://docs.python.org/2/library/pickle.html#data-stream-format

It has a weak caveat "(and of some other characteristics of pickle‘s representation)", but if you only skim read the bullet points below you'll miss that.

Yes I will use base64 to get around this, but the point is the documentation is still unclear and should probably completely remove the reference to ASCII in favour of "human-readable"... or even better, explicitly mention what will happen with unicode.
History
Date User Action Args
2012-12-12 04:55:11joelpittsetnosy: + joelpitt
messages: + msg177364
2010-07-11 01:00:12terry.reedysetstatus: open -> closed
resolution: fixed
2010-07-09 08:23:35lemburgsetmessages: + msg109688
2010-07-09 03:18:06terry.reedysetnosy: + terry.reedy

messages: + msg109671
versions: + Python 3.2, - Python 2.5
2009-04-22 21:04:34pitrousetnosy: + pitrou
messages: + msg86334
2009-04-22 20:48:24brongersetmessages: + msg86331
2009-04-22 20:28:14alexandre.vassalottisetmessages: + msg86329
2009-04-22 12:37:13wdoekessetnosy: + wdoekes
messages: + msg86294
2009-01-21 18:34:44brongersetmessages: + msg80337
2009-01-21 17:57:37lemburgsetmessages: + msg80334
2009-01-21 15:44:59brongersetmessages: + msg80331
2009-01-21 15:43:28brongersetnosy: + bronger
messages: + msg80330
2008-10-24 11:48:44dddibaghsetmessages: + msg75161
2008-10-22 08:24:50lemburgsetmessages: + msg75070
2008-10-21 23:34:45loewissetmessages: + msg75058
2008-10-21 22:56:15dddibaghsetmessages: + msg75055
2008-10-21 09:46:35lemburgsetmessages: + msg75022
2008-10-21 09:22:22dddibaghsetnosy: + dddibagh
messages: + msg75021
2008-06-02 08:40:46lemburgsetmessages: + msg67631
2008-06-01 21:38:23alexandre.vassalottisetnosy: + alexandre.vassalotti
2008-05-27 22:56:18loewissetmessages: + msg67437
2008-05-27 22:41:31lemburgsetmessages: + msg67436
2008-05-27 22:21:31loewissetmessages: + msg67434
2008-05-27 21:13:27lemburgsetmessages: + msg67432
2008-05-27 19:12:41loewissetnosy: + loewis
messages: + msg67425
2008-05-27 18:20:36lemburgsetmessages: + msg67422
2008-05-27 18:10:40lemburgsetnosy: + lemburg
messages: + msg67421
2008-05-27 15:38:13mawbidcreate