Author mgiuca
Recipients loewis, mgiuca, orsenthil, thomaspinckney3
Date 2008-07-11.07:03:19
SpamBayes Score 8.20002e-11
Marked as misclassified No
Message-id <1215759805.77.0.557528898975.issue3300@psf.upfronthosting.co.za>
In-reply-to
Content
> 3.0b1 has been released, so no new features can be added to 3.0.

While my proposal is no doubt going to cause a lot of code breakage, I
hardly consider it a "new feature". This is very definitely a bug. As I
understand it, the point of a code freeze is to stop the addition of
features which could be added to a later version. Realistically, there
is no way this issue can be fixed after 3.0 is released, as it
necessarily involves changing the behaviour of this function.

Perhaps I should explain further why this is a regression from Python
2.x and not a feature request. In Python 2.x, with byte strings, the
encoding is not an issue. quote and unquote simply encode bytes, and if
you want to use Unicode you have complete control. In Python 3.0, with
Unicode strings, if functions manipulate string objects, you don't have
control over the encoding unless the functions give you explicit
control. So Python 3.0's native Unicode strings have broken the library.

I give two examples.

Firstly, I believe that unquote(quote(x)) should always be true for all
strings x. In Python 2.x, this is always trivially true (for non-Unicode
strings), because they simply encode and decode the octets. In Python
3.0, the two functions are inconsistent, and break out of the range(0, 256).

>>> urllib.parse.unquote(urllib.parse.quote('ÿ')) # '\u00ff'
'ÿ'
# Works, because both functions work with ISO-8859-1 in this range.

>>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100'
'Ä\x80'
# Fails, because quote uses UTF-8 and unquote uses ISO-8859-1.

My patch succeeds for all characters.
>>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100'
'Ā'

Secondly, a bigger example, but I want to demonstrate how this bug
affects web applications, even very simple ones.

Consider this simple (beginnings of a) wiki system in Python 2.5, as a
CGI app:

#---
import cgi

fields = cgi.FieldStorage()
title = fields.getfirst('title')

print("Content-Type: text/html; charset=utf-8")
print("")

print('<p>Debug: %s</p>' % repr(title))
if title is None:
    print("No article selected")
else:
    print('<p>Information about %s.</p>' % cgi.escape(title))
#---

(Place this in cgi-bin, navigate to it, and add the query string
"?title=Page Title"). I'll use the page titled "Mátt" as a test case.

If you navigate to "?title=Mátt", it displays the text "Debug:
'M\xc3\xa1tt'. Information about Mátt.". The browser (at least Firefox,
Safari and IE I have tested) encodes this as "?title=M%C3%A1tt". So this
is trivial, as it's just being unquoted into a raw byte string
'M\xc3\xa1tt', then written out again as a byte string.

Now consider that you want to manipulate it as a Unicode string, still
in Python 2.5. You could augment the program to decode it as UTF-8 and
then re-encode it. (I wrote a simple UTF-8 printing function which takes
Unicode strings as input).

#---
import sys
import cgi

def printu8(*args):
    """Prints to stdout encoding as utf-8, rather than the current terminal
    encoding. (Not a fully-featured print function)."""
    sys.stdout.write(' '.join([x.encode('utf-8') for x in args]))
    sys.stdout.write('\n')

fields = cgi.FieldStorage()
title = fields.getfirst('title')
if title is not None:
    title = str(title).decode("utf-8", "replace")

print("Content-Type: text/html; charset=utf-8")
print("")

print('<p>Debug: %s.</p>' % repr(title))
if title is None:
    print("No article selected.")
else:
    printu8('<p>Information about %s.</p>' % cgi.escape(title))
#---

Now given the same input ("?title=Mátt"), it displays "Debug:
u'M\xe1tt'. Information about Mátt." Still working fine, and I can
manipulate it as Unicode because in Python 2.x I have direct control
over encoding/decoding.

Now let us upgrade this program to Python 3.0. (Note that I still can't
print Unicode characters directly out, because running through Apache
the stdout encoding is not UTF-8, so I use my printu8 function).

#---
import sys
import cgi

def printu8(*args):
    """Prints to stdout encoding as utf-8, rather than the current terminal
    encoding. (Not a fully-featured print function)."""
    sys.stdout.buffer.write(b' '.join([x.encode('utf-8') for x in args]))
    sys.stdout.buffer.write(b'\n')

fields = cgi.FieldStorage()
title = fields.getfirst('title')
# Note: No call to decode. I have no opportunity to specify the encoding
since
# it comes straight out of FieldStorage as a Unicode string.

print("Content-Type: text/html; charset=utf-8")
print("")

print('<p>Debug: %s.</p>' % ascii(title))
if title is None:
    print("No article selected.")
else:
    printu8('<p>Information about %s.</p>' % cgi.escape(title))
#---

Now given the same input ("?title=Mátt"), it displays "Debug:
'M\xc3\xa1tt'. Information about Mátt." Once again, it is erroneously
(and implicitly) decoded as ISO-8859-1, so I end up with a meaningless
Unicode string. The only possible thing I can do about this as a web
developer is call title.encode('latin-1').decode('utf-8') - a dreadful hack.

With my patch applied, the input ("?title=Mátt") produces the output
"Debug: 'M\xe1tt'. Information about Mátt."

Basically, this bug is going to affect all web developers as soon as
someone types a non-ASCII character. You could argue that supporting
UTF-8 by default is no better than supporting Latin-1 by default, but it
is. UTF-8 supports encoding of all characters where Latin-1 does not,
UTF-8 is the recommended URI encoding by both the URI Syntax RFC[1] and
the W3C HTML 4.01 specification[2], and all major browsers use it to
encode non-ASCII characters in URIs.

My patch may not be the best, or most conservative, solution to this
problem. I'm happy to see other proposals. But it's clearly an important
bug to fix, if I can't even write the simplest web app I can think of
without having to use a kludgey hack to get the string decoded
correctly. What is the point of having nice clean Unicode strings in the
language if the library spits out the wrong characters and it requires
more work to fix them than it used to with byte strings?

[1] http://tools.ietf.org/html/rfc3986#section-2.5
[2] http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1
History
Date User Action Args
2008-07-11 07:03:27mgiucasetspambayes_score: 8.20002e-11 -> 8.20002e-11
recipients: + mgiuca, loewis, orsenthil, thomaspinckney3
2008-07-11 07:03:26mgiucasetspambayes_score: 8.20002e-11 -> 8.20002e-11
messageid: <1215759805.77.0.557528898975.issue3300@psf.upfronthosting.co.za>
2008-07-11 07:03:24mgiucalinkissue3300 messages
2008-07-11 07:03:20mgiucacreate