Message 69535 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mgiuca
Recipients	loewis, mgiuca, orsenthil, thomaspinckney3
Date	2008-07-11.07:03:19
SpamBayes Score	8.2000184e-11
Marked as misclassified	No
Message-id	<1215759805.77.0.557528898975.issue3300@psf.upfronthosting.co.za>
In-reply-to

Content
> 3.0b1 has been released, so no new features can be added to 3.0. While my proposal is no doubt going to cause a lot of code breakage, I hardly consider it a "new feature". This is very definitely a bug. As I understand it, the point of a code freeze is to stop the addition of features which could be added to a later version. Realistically, there is no way this issue can be fixed after 3.0 is released, as it necessarily involves changing the behaviour of this function. Perhaps I should explain further why this is a regression from Python 2.x and not a feature request. In Python 2.x, with byte strings, the encoding is not an issue. quote and unquote simply encode bytes, and if you want to use Unicode you have complete control. In Python 3.0, with Unicode strings, if functions manipulate string objects, you don't have control over the encoding unless the functions give you explicit control. So Python 3.0's native Unicode strings have broken the library. I give two examples. Firstly, I believe that unquote(quote(x)) should always be true for all strings x. In Python 2.x, this is always trivially true (for non-Unicode strings), because they simply encode and decode the octets. In Python 3.0, the two functions are inconsistent, and break out of the range(0, 256). >>> urllib.parse.unquote(urllib.parse.quote('ÿ')) # '\u00ff' 'ÿ' # Works, because both functions work with ISO-8859-1 in this range. >>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100' 'Ä\x80' # Fails, because quote uses UTF-8 and unquote uses ISO-8859-1. My patch succeeds for all characters. >>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100' 'Ā' Secondly, a bigger example, but I want to demonstrate how this bug affects web applications, even very simple ones. Consider this simple (beginnings of a) wiki system in Python 2.5, as a CGI app: #--- import cgi fields = cgi.FieldStorage() title = fields.getfirst('title') print("Content-Type: text/html; charset=utf-8") print("") print('<p>Debug: %s</p>' % repr(title)) if title is None: print("No article selected") else: print('<p>Information about %s.</p>' % cgi.escape(title)) #--- (Place this in cgi-bin, navigate to it, and add the query string "?title=Page Title"). I'll use the page titled "Mátt" as a test case. If you navigate to "?title=Mátt", it displays the text "Debug: 'M\xc3\xa1tt'. Information about Mátt.". The browser (at least Firefox, Safari and IE I have tested) encodes this as "?title=M%C3%A1tt". So this is trivial, as it's just being unquoted into a raw byte string 'M\xc3\xa1tt', then written out again as a byte string. Now consider that you want to manipulate it as a Unicode string, still in Python 2.5. You could augment the program to decode it as UTF-8 and then re-encode it. (I wrote a simple UTF-8 printing function which takes Unicode strings as input). #--- import sys import cgi def printu8(args): """Prints to stdout encoding as utf-8, rather than the current terminal encoding. (Not a fully-featured print function).""" sys.stdout.write(' '.join([x.encode('utf-8') for x in args])) sys.stdout.write('\n') fields = cgi.FieldStorage() title = fields.getfirst('title') if title is not None: title = str(title).decode("utf-8", "replace") print("Content-Type: text/html; charset=utf-8") print("") print('<p>Debug: %s.</p>' % repr(title)) if title is None: print("No article selected.") else: printu8('<p>Information about %s.</p>' % cgi.escape(title)) #--- Now given the same input ("?title=Mátt"), it displays "Debug: u'M\xe1tt'. Information about Mátt." Still working fine, and I can manipulate it as Unicode because in Python 2.x I have direct control over encoding/decoding. Now let us upgrade this program to Python 3.0. (Note that I still can't print Unicode characters directly out, because running through Apache the stdout encoding is not UTF-8, so I use my printu8 function). #--- import sys import cgi def printu8(args): """Prints to stdout encoding as utf-8, rather than the current terminal encoding. (Not a fully-featured print function).""" sys.stdout.buffer.write(b' '.join([x.encode('utf-8') for x in args])) sys.stdout.buffer.write(b'\n') fields = cgi.FieldStorage() title = fields.getfirst('title') # Note: No call to decode. I have no opportunity to specify the encoding since # it comes straight out of FieldStorage as a Unicode string. print("Content-Type: text/html; charset=utf-8") print("") print('<p>Debug: %s.</p>' % ascii(title)) if title is None: print("No article selected.") else: printu8('<p>Information about %s.</p>' % cgi.escape(title)) #--- Now given the same input ("?title=Mátt"), it displays "Debug: 'M\xc3\xa1tt'. Information about MÃ¡tt." Once again, it is erroneously (and implicitly) decoded as ISO-8859-1, so I end up with a meaningless Unicode string. The only possible thing I can do about this as a web developer is call title.encode('latin-1').decode('utf-8') - a dreadful hack. With my patch applied, the input ("?title=Mátt") produces the output "Debug: 'M\xe1tt'. Information about Mátt." Basically, this bug is going to affect all web developers as soon as someone types a non-ASCII character. You could argue that supporting UTF-8 by default is no better than supporting Latin-1 by default, but it is. UTF-8 supports encoding of all characters where Latin-1 does not, UTF-8 is the recommended URI encoding by both the URI Syntax RFC[1] and the W3C HTML 4.01 specification[2], and all major browsers use it to encode non-ASCII characters in URIs. My patch may not be the best, or most conservative, solution to this problem. I'm happy to see other proposals. But it's clearly an important bug to fix, if I can't even write the simplest web app I can think of without having to use a kludgey hack to get the string decoded correctly. What is the point of having nice clean Unicode strings in the language if the library spits out the wrong characters and it requires more work to fix them than it used to with byte strings? [1] http://tools.ietf.org/html/rfc3986#section-2.5 [2] http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1

> 3.0b1 has been released, so no new features can be added to 3.0.

While my proposal is no doubt going to cause a lot of code breakage, I
hardly consider it a "new feature". This is very definitely a bug. As I
understand it, the point of a code freeze is to stop the addition of
features which could be added to a later version. Realistically, there
is no way this issue can be fixed after 3.0 is released, as it
necessarily involves changing the behaviour of this function.

Perhaps I should explain further why this is a regression from Python
2.x and not a feature request. In Python 2.x, with byte strings, the
encoding is not an issue. quote and unquote simply encode bytes, and if
you want to use Unicode you have complete control. In Python 3.0, with
Unicode strings, if functions manipulate string objects, you don't have
control over the encoding unless the functions give you explicit
control. So Python 3.0's native Unicode strings have broken the library.

I give two examples.

Firstly, I believe that unquote(quote(x)) should always be true for all
strings x. In Python 2.x, this is always trivially true (for non-Unicode
strings), because they simply encode and decode the octets. In Python
3.0, the two functions are inconsistent, and break out of the range(0, 256).

>>> urllib.parse.unquote(urllib.parse.quote('ÿ')) # '\u00ff'
'ÿ'
# Works, because both functions work with ISO-8859-1 in this range.

>>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100'
'Ä\x80'
# Fails, because quote uses UTF-8 and unquote uses ISO-8859-1.

My patch succeeds for all characters.
>>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100'
'Ā'

Secondly, a bigger example, but I want to demonstrate how this bug
affects web applications, even very simple ones.

Consider this simple (beginnings of a) wiki system in Python 2.5, as a
CGI app:

#---
import cgi

fields = cgi.FieldStorage()
title = fields.getfirst('title')

print("Content-Type: text/html; charset=utf-8")
print("")

print('<p>Debug: %s</p>' % repr(title))
if title is None:
    print("No article selected")
else:
    print('<p>Information about %s.</p>' % cgi.escape(title))
#---

(Place this in cgi-bin, navigate to it, and add the query string
"?title=Page Title"). I'll use the page titled "Mátt" as a test case.

If you navigate to "?title=Mátt", it displays the text "Debug:
'M\xc3\xa1tt'. Information about Mátt.". The browser (at least Firefox,
Safari and IE I have tested) encodes this as "?title=M%C3%A1tt". So this
is trivial, as it's just being unquoted into a raw byte string
'M\xc3\xa1tt', then written out again as a byte string.

Now consider that you want to manipulate it as a Unicode string, still
in Python 2.5. You could augment the program to decode it as UTF-8 and
then re-encode it. (I wrote a simple UTF-8 printing function which takes
Unicode strings as input).

#---
import sys
import cgi

def printu8(*args):
    """Prints to stdout encoding as utf-8, rather than the current terminal
    encoding. (Not a fully-featured print function)."""
    sys.stdout.write(' '.join([x.encode('utf-8') for x in args]))
    sys.stdout.write('\n')

fields = cgi.FieldStorage()
title = fields.getfirst('title')
if title is not None:
    title = str(title).decode("utf-8", "replace")

print("Content-Type: text/html; charset=utf-8")
print("")

print('<p>Debug: %s.</p>' % repr(title))
if title is None:
    print("No article selected.")
else:
    printu8('<p>Information about %s.</p>' % cgi.escape(title))
#---

Now given the same input ("?title=Mátt"), it displays "Debug:
u'M\xe1tt'. Information about Mátt." Still working fine, and I can
manipulate it as Unicode because in Python 2.x I have direct control
over encoding/decoding.

Now let us upgrade this program to Python 3.0. (Note that I still can't
print Unicode characters directly out, because running through Apache
the stdout encoding is not UTF-8, so I use my printu8 function).

#---
import sys
import cgi

def printu8(*args):
    """Prints to stdout encoding as utf-8, rather than the current terminal
    encoding. (Not a fully-featured print function)."""
    sys.stdout.buffer.write(b' '.join([x.encode('utf-8') for x in args]))
    sys.stdout.buffer.write(b'\n')

fields = cgi.FieldStorage()
title = fields.getfirst('title')
# Note: No call to decode. I have no opportunity to specify the encoding
since
# it comes straight out of FieldStorage as a Unicode string.

print("Content-Type: text/html; charset=utf-8")
print("")

print('<p>Debug: %s.</p>' % ascii(title))
if title is None:
    print("No article selected.")
else:
    printu8('<p>Information about %s.</p>' % cgi.escape(title))
#---

Now given the same input ("?title=Mátt"), it displays "Debug:
'M\xc3\xa1tt'. Information about MÃ¡tt." Once again, it is erroneously
(and implicitly) decoded as ISO-8859-1, so I end up with a meaningless
Unicode string. The only possible thing I can do about this as a web
developer is call title.encode('latin-1').decode('utf-8') - a dreadful hack.

With my patch applied, the input ("?title=Mátt") produces the output
"Debug: 'M\xe1tt'. Information about Mátt."

Basically, this bug is going to affect all web developers as soon as
someone types a non-ASCII character. You could argue that supporting
UTF-8 by default is no better than supporting Latin-1 by default, but it
is. UTF-8 supports encoding of all characters where Latin-1 does not,
UTF-8 is the recommended URI encoding by both the URI Syntax RFC[1] and
the W3C HTML 4.01 specification[2], and all major browsers use it to
encode non-ASCII characters in URIs.

My patch may not be the best, or most conservative, solution to this
problem. I'm happy to see other proposals. But it's clearly an important
bug to fix, if I can't even write the simplest web app I can think of
without having to use a kludgey hack to get the string decoded
correctly. What is the point of having nice clean Unicode strings in the
language if the library spits out the wrong characters and it requires
more work to fix them than it used to with byte strings?

[1] http://tools.ietf.org/html/rfc3986#section-2.5
[2] http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1

History
Date	User	Action	Args
2008-07-11 07:03:27	mgiuca	set	spambayes_score: 8.20002e-11 -> 8.2000184e-11 recipients: + mgiuca, loewis, orsenthil, thomaspinckney3
2008-07-11 07:03:26	mgiuca	set	spambayes_score: 8.20002e-11 -> 8.20002e-11 messageid: <1215759805.77.0.557528898975.issue3300@psf.upfronthosting.co.za>
2008-07-11 07:03:24	mgiuca	link	issue3300 messages
2008-07-11 07:03:20	mgiuca	create