Author lemburg
Recipients Arfrever, ezio.melotti, gregory.p.smith, lemburg, loewis, pitrou, vstinner
Date 2010-05-03.22:38:15
SpamBayes Score 1.03324e-05
Marked as misclassified No
Message-id <4BDF5054.4050102@egenix.com>
In-reply-to <4BDF4091.5000702@v.loewis.de>
Content
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
>> Here's one (RFC 3875, sections 4.1.7 and 4.1.5):
>>
>> LANG = 'en_US.utf8'
>> CONTENT_TYPE = 'application/x-www-form-urlencoded'
>> QUERY_STRING = 'type=example&name=Löwis'
>> PATH_INFO = '/home/löwis/bin/mycgi.py'
>>
>> (HTML uses Latin-1 as default encoding and so do many of the
>>  protocols invented for it !)
> 
> BTW, I think you are misinterpreting the RFC. It doesn't actually say
> that QUERY_STRING is Latin-1 encoded, but instead, it says
> 
> "the details of the parsing, reserved characters and support for non
> US-ASCII characters depends on the context"

Please read on:

"""
   For example, form submission from an HTML
   document [18] uses application/x-www-form-urlencoded encoding, in
   which the characters "+", "&" and "=" are reserved, and the ISO
   8859-1 encoding may be used for non US-ASCII characters.
"""

I could have also given you an example using 'multipart/form-data'
in which each part uses a different encoding or even sends binary
data by means of 'Content-Transfer-Encoding: binary'

These are not made up examples, they do occur in the real world for
which we are coding.

> Latin-1 is only given as a possible example. Apache passes the URL from
> the HTTP request unescaped; browsers will likely CGI-escape it. So most
> likely, it will be
> 
> QUERY_STRING = 'type=example&name=L%F6wis'
> or
> QUERY_STRING = 'type=example&name=L%C3%B6wis'
>
> IMO, applications are much better off to consider QUERY_STRING as a
> character string.

Believe me, I've been working with HTML, forms, web apps, etc.
for almost 20 years now. In the real world, your application has
to cope with any kind of data in QUERY_STRING.

And this is just one example of how the OS environment can
be used, e.g. to provide the user meta-data, license data,
company names.

Even if these all use UTF-8, a user might still want to stick
to ASCII as her CODESET and then all her Python application would
start to fail at first sight of a French accent or German
Umlaut.

PEP 383 is nice for file names and paths, but it's unfortunately
not going to save the world...
History
Date User Action Args
2010-05-03 22:38:18lemburgsetrecipients: + lemburg, loewis, gregory.p.smith, pitrou, vstinner, ezio.melotti, Arfrever
2010-05-03 22:38:16lemburglinkissue8603 messages
2010-05-03 22:38:15lemburgcreate