Message 75022 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	alexandre.vassalotti, dddibagh, georg.brandl, lemburg, loewis, mawbid
Date	2008-10-21.09:46:33
SpamBayes Score	2.4324986e-13
Marked as misclassified	No
Message-id	<48FDA4F7.7080307@egenix.com>
In-reply-to	<1224580944.13.0.420901995574.issue2980@psf.upfronthosting.co.za>

Content
On 2008-10-21 11:22, Dan Dibagh wrote: > Your reasoning shows a lack of understanding how Python is actually used > from a programmers point of view. Hmm, I've been using Python for almost 15 years now and do believe that I have an idea of how Python is being used. Note that we cannot change the pickle format in retrospective, since this would break pickle data exchange between different Python versions relying on the same format (but using different pickle implementations). What we could do is add a new pickle format which then escapes all non-ASCII data. However, people have been more keen on getting compact and fast loading pickles than pickles in ASCII which is why all new versions of the pickle format are binary formats, so I don't think it's worth the effort. Note that the common way of dealing with binary data in ASCII streams is using a base64 encoding and possibly also apply compression. The pickle 0 format is really only useful for debugging purposes. > Perhaps it is the raw-unicode-escape encoding that should be fixed? I > failed to find exact information about what raw-unicode-escape means. In > particular, where is the information which states that > raw-unicode-escape is always an 8-bit format? The closest I've come is > PEP 100 and PEP 263 (which I notice is written by you guys), which > describes how to decode raw unicode escape strings from Python source > and how to define encoding formats for python source code. The sole > original purpose of both unicode-escape and raw-unicode-escape appears > to be representing unicode strings in Python source code as u' and ur' > strings respectively. Right. > It is clear that the decoding of a raw unicode > escaped or unicode escaped string depends on the actual encoding of the > python source, but how goes the logic that when something is _encoded_ > into a raw unicode string then the target source must be of some 8-bit > encoding. Especially considering that the default python source encoding > is ASCII. For unicode-escape this makes sense: > >>>> f = file("test.py", "wb") >>>> f.write('s = u"%s"\n' % u"\u0080".encode("unicode-escape")) >>>> f.close() >>>> ^Z > > python test.py (executes silently without errors) > > But for raw-unicode-escape the outcome is a different thing: > >>>> f = file("test.py", "wb") >>>> f.write('s = ur"%s"\n' % u"\u0080".encode("raw-unicode-escape")) >>>> f.close() >>>> ^Z > > python test.py > > File "test.py", line 1 > SyntaxError: Non-ASCII character '\x80' in file test.py on line 1, but > no encoding declared; see http://www.python.org/peps/pep-0263.html for > details > > Huh? For someone who trusts the Standard Encodings section Python > Library reference this isn't what one would expect. If the documentation > states "Produce a string that is suitable as raw Unicode literal in > Python source code" then why isn't it suitable? Because the raw-unicode-escape codec won't escape the \x80 character, hence the name. As a result, the generated source code is not ASCII, which is why you see the exception. But this is off-topic w/r to the issue in question.

On 2008-10-21 11:22, Dan Dibagh wrote:
> Your reasoning shows a lack of understanding how Python is actually used
> from a programmers point of view.

Hmm, I've been using Python for almost 15 years now and do believe
that I have an idea of how Python is being used.

Note that we cannot change the pickle format in retrospective, since
this would break pickle data exchange between different Python versions
relying on the same format (but using different pickle implementations).

What we could do is add a new pickle format which then escapes all
non-ASCII data. However, people have been more keen on getting
compact and fast loading pickles than pickles in ASCII which is why
all new versions of the pickle format are binary formats, so I don't
think it's worth the effort.

Note that the common way of dealing with binary data in ASCII streams
is using a base64 encoding and possibly also apply compression. The
pickle 0 format is really only useful for debugging purposes.

> Perhaps it is the raw-unicode-escape encoding that should be fixed? I
> failed to find exact information about what raw-unicode-escape means. In
> particular, where is the information which states that
> raw-unicode-escape is always an 8-bit format? The closest I've come is
> PEP 100 and PEP 263 (which I notice is written by you guys), which
> describes how to decode raw unicode escape strings from Python source
> and how to define encoding formats for python source code. The sole
> original purpose of both unicode-escape and raw-unicode-escape appears
> to be representing unicode strings in Python source code as u' and ur'
> strings respectively. 

Right.

> It is clear that the decoding of a raw unicode
> escaped or unicode escaped string depends on the actual encoding of the
> python source, but how goes the logic that when something is _encoded_
> into a raw unicode string then the target source must be of some 8-bit
> encoding. Especially considering that the default python source encoding
> is ASCII. For unicode-escape this makes sense:
> 
>>>> f = file("test.py", "wb")
>>>> f.write('s = u"%s"\n' % u"\u0080".encode("unicode-escape"))
>>>> f.close()
>>>> ^Z
> 
> python test.py (executes silently without errors)
> 
> But for raw-unicode-escape the outcome is a different thing:
> 
>>>> f = file("test.py", "wb")
>>>> f.write('s = ur"%s"\n' % u"\u0080".encode("raw-unicode-escape"))
>>>> f.close()
>>>> ^Z
> 
> python test.py
> 
>   File "test.py", line 1
> SyntaxError: Non-ASCII character '\x80' in file test.py on line 1, but
> no encoding declared; see http://www.python.org/peps/pep-0263.html for
> details
> 
> Huh? For someone who trusts the Standard Encodings section Python
> Library reference this isn't what one would expect. If the documentation
> states "Produce a string that is suitable as raw Unicode literal in
> Python source code" then why isn't it suitable?

Because the raw-unicode-escape codec won't escape the \x80 character,
hence the name. As a result, the generated source code is not ASCII,
which is why you see the exception.

But this is off-topic w/r to the issue in question.

History
Date	User	Action	Args
2008-10-21 09:46:38	lemburg	set	recipients: + lemburg, loewis, georg.brandl, alexandre.vassalotti, mawbid, dddibagh
2008-10-21 09:46:35	lemburg	link	issue2980 messages
2008-10-21 09:46:33	lemburg	create