Author lemburg
Recipients Arfrever, ezio.melotti, gregory.p.smith, lemburg, loewis, pitrou, vstinner
Date 2010-05-03.22:02:05
SpamBayes Score 2.51744e-09
Marked as misclassified No
Message-id <4BDF47DB.8020105@egenix.com>
In-reply-to <4BDF3BCF.6070109@v.loewis.de>
Content
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
>> Set your CODESET to ASCII and watch the surrogate escaping
>> begin... seriously, Martin, if you've ever worked with CGI
>> or WSGI or FastCGI or SCGI or any of the many other protocols
>> that use the OS environment for passing data between processes,
>> it doesn't take much imagination to come up with examples
>> that fail left and right.
>>
>> Here's one (RFC 3875, sections 4.1.7 and 4.1.5):
>>
>> LANG = 'en_US.utf8'
>> CONTENT_TYPE = 'application/x-www-form-urlencoded'
>> QUERY_STRING = 'type=example&name=Löwis'
>> PATH_INFO = '/home/löwis/bin/mycgi.py'
> 
> I still don't see a *failure* here. AFAICT, it all works correctly.

Your name will end up being partially escaped as surrogate:

'L\udcf6wis'

Further processing will fail, since the application would
correctly assume that the data is Latin-1 only (see the RFC):

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\udcf6' in position 1: ordinal not in
range(256)

> In particular, I fail to see the advantage of using bytes over using
> escaped strings, in terms of correctness. I'm even skeptical that there
> is an advantage in terms of usability (and if there is, I'd like to see
> a demonstration of that, as well).

The use of the 'surrogateescape' error handler modifies the
encoding used for the decoding of the bytes data and does this
implicitly.

This works fine as long as the data is only used *as reference* to
some entity (e.g. as in a file name) and manipulation of that
data is limited to concatenation and slicing. Things that you do
with file names and paths.

It doesn't work if an application tries to work *with* the data,
e.g. tries to convert it, parse it, decode it, etc. The reason is
that information included by the use of the 'surrogateescape'
error handler is lost along the way and this then causes data
corruption.
History
Date User Action Args
2010-05-03 22:02:07lemburgsetrecipients: + lemburg, loewis, gregory.p.smith, pitrou, vstinner, ezio.melotti, Arfrever
2010-05-03 22:02:06lemburglinkissue8603 messages
2010-05-03 22:02:05lemburgcreate