Message 104889 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	Arfrever, ezio.melotti, gregory.p.smith, lemburg, loewis, pitrou, vstinner
Date	2010-05-03.22:02:05
SpamBayes Score	2.517443e-09
Marked as misclassified	No
Message-id	<4BDF47DB.8020105@egenix.com>
In-reply-to	<4BDF3BCF.6070109@v.loewis.de>

Content
Martin v. Löwis wrote: > > Martin v. Löwis <martin@v.loewis.de> added the comment: > >> Set your CODESET to ASCII and watch the surrogate escaping >> begin... seriously, Martin, if you've ever worked with CGI >> or WSGI or FastCGI or SCGI or any of the many other protocols >> that use the OS environment for passing data between processes, >> it doesn't take much imagination to come up with examples >> that fail left and right. >> >> Here's one (RFC 3875, sections 4.1.7 and 4.1.5): >> >> LANG = 'en_US.utf8' >> CONTENT_TYPE = 'application/x-www-form-urlencoded' >> QUERY_STRING = 'type=example&name=Löwis' >> PATH_INFO = '/home/lÃ¶wis/bin/mycgi.py' > > I still don't see a failure here. AFAICT, it all works correctly. Your name will end up being partially escaped as surrogate: 'L\udcf6wis' Further processing will fail, since the application would correctly assume that the data is Latin-1 only (see the RFC): Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'latin-1' codec can't encode character '\udcf6' in position 1: ordinal not in range(256) > In particular, I fail to see the advantage of using bytes over using > escaped strings, in terms of correctness. I'm even skeptical that there > is an advantage in terms of usability (and if there is, I'd like to see > a demonstration of that, as well). The use of the 'surrogateescape' error handler modifies the encoding used for the decoding of the bytes data and does this implicitly. This works fine as long as the data is only used as reference to some entity (e.g. as in a file name) and manipulation of that data is limited to concatenation and slicing. Things that you do with file names and paths. It doesn't work if an application tries to work with the data, e.g. tries to convert it, parse it, decode it, etc. The reason is that information included by the use of the 'surrogateescape' error handler is lost along the way and this then causes data corruption.

Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
>> Set your CODESET to ASCII and watch the surrogate escaping
>> begin... seriously, Martin, if you've ever worked with CGI
>> or WSGI or FastCGI or SCGI or any of the many other protocols
>> that use the OS environment for passing data between processes,
>> it doesn't take much imagination to come up with examples
>> that fail left and right.
>>
>> Here's one (RFC 3875, sections 4.1.7 and 4.1.5):
>>
>> LANG = 'en_US.utf8'
>> CONTENT_TYPE = 'application/x-www-form-urlencoded'
>> QUERY_STRING = 'type=example&name=Löwis'
>> PATH_INFO = '/home/lÃ¶wis/bin/mycgi.py'
> 
> I still don't see a *failure* here. AFAICT, it all works correctly.

Your name will end up being partially escaped as surrogate:

'L\udcf6wis'

Further processing will fail, since the application would
correctly assume that the data is Latin-1 only (see the RFC):

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\udcf6' in position 1: ordinal not in
range(256)

> In particular, I fail to see the advantage of using bytes over using
> escaped strings, in terms of correctness. I'm even skeptical that there
> is an advantage in terms of usability (and if there is, I'd like to see
> a demonstration of that, as well).

The use of the 'surrogateescape' error handler modifies the
encoding used for the decoding of the bytes data and does this
implicitly.

This works fine as long as the data is only used *as reference* to
some entity (e.g. as in a file name) and manipulation of that
data is limited to concatenation and slicing. Things that you do
with file names and paths.

It doesn't work if an application tries to work *with* the data,
e.g. tries to convert it, parse it, decode it, etc. The reason is
that information included by the use of the 'surrogateescape'
error handler is lost along the way and this then causes data
corruption.

History
Date	User	Action	Args
2010-05-03 22:02:07	lemburg	set	recipients: + lemburg, loewis, gregory.p.smith, pitrou, vstinner, ezio.melotti, Arfrever
2010-05-03 22:02:06	lemburg	link	issue8603 messages
2010-05-03 22:02:05	lemburg	create