urllib.quote and unquote - Unicode issues #47550

mgiuca · 2008-07-06T14:52:10Z

BPO	3300
Nosy	@gvanrossum, @loewis, @orsenthil, @pitrou
Files	parse.py.patch: (obsolete) Patch fixing all three issues; commit log in comment parse.py.patch2: (obsolete) Second patch (supersedes parse.py.patch); commit log in comment parse.py.patch3: (obsolete) Third patch (supersedes parse.py.patch2); commit log in comment parse.py.patch4: (obsolete) Fourth patch (supersedes parse.py.patch3); commit log in comment parse.py.patch5: (obsolete) Fifth patch (supersedes parse.py.patch4); commit log in comment parse.py.patch6: (obsolete) Sixth patch (supersedes parse.py.patch5); commit log in comment parse.py.patch7: (obsolete) Seventh patch (supersedes parse.py.patch6); commit log in comment parse.py.patch8: (obsolete) Eighth patch (supersedes parse.py.patch7); commit log in comment parse.py.metapatch8: Diff between patch7 and patch8 (result of review). patch parse.py.patch8+allsafe: Patch8, and quote allows all characters in 'safe' parse.py.patch9: Ninth patch (supersedes parse.py.patch8); commit log in comment for patch 8 parse.py.patch10: Tenth patch (supersedes parse.py.patch9); commit log in comment

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/gvanrossum'
closed_at = <Date 2008-08-18.21:45:16.480>
created_at = <Date 2008-07-06.14:52:09.661>
labels = ['type-bug', 'library', 'release-blocker']
title = 'urllib.quote and unquote - Unicode issues'
updated_at = <Date 2008-08-20.10:03:34.782>
user = 'https://bugs.python.org/mgiuca'

bugs.python.org fields:

activity = <Date 2008-08-20.10:03:34.782>
actor = 'mgiuca'
assignee = 'gvanrossum'
closed = True
closed_date = <Date 2008-08-18.21:45:16.480>
closer = 'gvanrossum'
components = ['Library (Lib)']
creation = <Date 2008-07-06.14:52:09.661>
creator = 'mgiuca'
dependencies = []
files = ['10829', '10870', '10873', '10883', '10888', '11009', '11015', '11069', '11070', '11089', '11092', '11093', '11111']
hgrepos = []
issue_num = 3300
keywords = ['patch']
message_count = 80.0
messages = ['69333', '69339', '69366', '69472', '69473', '69485', '69493', '69508', '69519', '69535', '69537', '69583', '69591', '70497', '70512', '70771', '70788', '70791', '70793', '70800', '70804', '70806', '70807', '70818', '70824', '70828', '70830', '70833', '70834', '70840', '70855', '70858', '70861', '70862', '70868', '70869', '70872', '70878', '70879', '70880', '70913', '70949', '70955', '70958', '70959', '70962', '70965', '70969', '70970', '71042', '71043', '71054', '71055', '71057', '71064', '71065', '71069', '71072', '71073', '71082', '71083', '71084', '71085', '71086', '71088', '71089', '71090', '71091', '71092', '71096', '71121', '71124', '71126', '71130', '71131', '71332', '71356', '71387', '71530', '71533']
nosy_count = 8.0
nosy_names = ['gvanrossum', 'loewis', 'jimjjewett', 'janssen', 'orsenthil', 'pitrou', 'thomaspinckney3', 'mgiuca']
pr_nums = []
priority = 'release blocker'
resolution = 'accepted'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue3300'
versions = ['Python 3.0']

mgiuca · 2008-07-06T14:52:07Z

Three Unicode-related problems with urllib.parse.quote and
urllib.parse.unquote in Python 3.0. (Patch attached).

Firstly, unquote appears not to have been modified from Python 2, where
it is designed to output a byte string. In Python 3, it outputs a
unicode string, implicitly decoded as ISO-8859-1 (the code points are
the same as the bytes). RFC 3986 states that the percent-encoded byte
values should be decoded as UTF-8.

http://tools.ietf.org/html/rfc3986 section 2.5.

Current behaviour:
>>> urllib.parse.unquote("%CE%A3")
'Î£'
(or '\u00ce\u00a3')

Desired behaviour:
>>> urllib.parse.unquote("%CE%A3")
'Σ'
(or '\u03a3')

Secondly, while quote *has* been modified to encode to UTF-8 before
percent-encoding, it does not work correctly for characters in
range(128, 256), due to a special case in the code which again treats
the code point values as byte values.

Current behaviour:
>>> urllib.parse.quote('\u00e9')
'%E9'

Desired behaviour:
>>> urllib.parse.quote('\u00e9')
'%C3%A9'

Note that currently, quoting characters less than 256 will use
ISO-8859-1, while quoting characters 256 or higher will use UTF-8!

Thirdly, the "safe" argument to quote does not work for characters above
256, since these are excluded from the special case. I thought I would
fix this at the same time, but it's really a separate issue.

Current behaviour:
>>> urllib.parse.quote('Σϰ', safe='Σ')
'%CE%A3%CF%B0'

Desired behaviour:
>>> urllib.parse.quote('Σϰ', safe='Σ')
'Σ%CF%B0'

A patch which fixes all three issues is attached. Note that unquote now
needs to handle the case where the UTF-8 sequence is invalid. This is
currently handled by "replace" (invalid sequences are replaced by
'\ufffd'). I would like to add an optional "errors" argument to unquote,
defaulting to "replace", to allow the user to override this behaviour,
but I didn't put that in because it would change the interface.

Note I also changed one of the test cases, which had the wrong expected
output. (String literal was manually UTF-8 encoded, designed for Python
2; nonsensical when viewed as a Python 3 Unicode string).

All urllib test cases pass.

Patch is for branch /branches/py3k, revision 64752.

Note: The above unquote issue also manifests itself in Python 2 for
Unicode strings, but it's hazy as to what the behaviour should be, and
would break existing programs, so I'm just patching the Py3k branch.

Commit log:

urllib.parse.unquote: Fixed percent-encoded octets being implicitly
decoded as ISO-8859-1; now decode as UTF-8, as per RFC 3986.

urllib.parse.quote: Fixed characters in range(128, 256) being implicitly
encoded as ISO-8859-1; now encode as UTF-8. Also fixed characters
greater than 256 not responding to "safe", and also not being cached.

Lib/test/test_urllib.py: Updated one test case for unquote which
expected the wrong output. The new version of unquote passes the new
test case.

loewis · 2008-07-06T16:54:35Z

RFC 3986 states that the percent-encoded byte
values should be decoded as UTF-8.

Where precisely do you read such a SHOULD requirement?
Section 2.5 elaborates that the local encoding (of the
resource) is typically used, ignoring cases where URIs
are constructed on the client system (such scenario is
simply ignored in the RFC).

The last paragraph in section 2.5 is the only place that
seems to imply a SHOULD requirement (although it doesn't
use the keyword); this paragraph only talks about new URI
schemes. Unfortunately, for http, the encoding is of
characters is unspecified (this is somewhat solved by the
introduction of IRIs).

mgiuca · 2008-07-07T01:45:24Z

Point taken. But the RFC certainly doesn't say that ISO-8859-1 should be
used. Since we're outputting a Unicode string in Python 3, we need to
decode with some encoding, and UTF-8 seems the most sensible and
standardised.
(Even the existing test case in test_urllib.py:466 uses a UTF-8-encoded
URL, and I had to fix it so it decodes into a meaningful string).

Having said that, it's possible that you may wish to use another
encoding, and legal to do so. Therefore, I'd suggest we add an
"encoding" argument to both quote and unquote, which defaults to "utf-8".

Note that in the current implementation, unquote is not an inverse of
quote, because quote uses UTF-8 to encode characters with code points >=
256, while unquote decodes them as ISO-8859-1. I think it's important
these two functions are inverses of each other.

thomaspinckney3 · 2008-07-09T15:20:47Z

I mentioned this is in a brief python-dev discussion earlier this
spring, but many popular websites such as Wikipedia and Facebook do use
UTF-8 as their character encoding scheme for the path and argument
portion of URLs.

I know there's no RFC that says this is what should be done, but in
order to make urllib work out-of-the-box on as many common websites as
possible, I think defaulting to UTF-8 decoding makes a lot of sense.

Possibly allow an option charset argument to be passed into quote and
unquote, but default to UTF-8 in the absence of an explicit character
set being passed in?

mgiuca · 2008-07-09T15:51:48Z

OK I've gone back over the patch and decided to add the "encoding" and
"errors" arguments from the str.encode/decode methods as optional
arguments to quote and unquote. This is a much bigger change than I
originally intended, but I think it makes things much better because
we'll get UTF-8 by default (which as far as I can tell is by far the
most common encoding).

(Tom Pinckney just made the same suggestion right as I'm typing this up!)

So my new patch is a bit more extensive, and changes the interface (in a
backwards-compatible way). Both quote and unquote now support "encoding"
and "errors" arguments, defaulting to "utf-8" and "replace", respectively.

Implementation detail: This changes the Quoter class a lot; it now
hashes four fields to ensure it doesn't use the wrong cache.

Also fixed an issue with the previous patch where non-ASCII-compatible
encodings broke for code points < 128.

I then ran the full test suite and discovered two other modules test
cases broke. I've fixed them so the full suite passes, but I'm
suspicious there may be more issues (see below).

Lib/test/test_http_cookiejar.py: A test case was written explicitly
expecting Latin-1 encoding. I've changed this test case to expect UTF-8.
Lib/email/utils.py: I extensively analysed this code and discovered
that it kind of "cheats" - it uses the Latin-1 encoding and treats it as
octets, then applies its own encoding scheme. So to fix this, I changed
the email module to call quote and unquote with encoding="latin-1".
Hence it has the same behaviour as before.

Some potential issues:

I have not updated the documentation yet. If this idea is to go ahead,
the docs will need to show these new optional arguments. (I'll do that
myself but haven't yet).
While the full test suite passes, I'm sure there will be many more
issues since I've changed the interface. Therefore I don't recommend
this patch is accepted just yet. I plan to do an investigation into all
uses (within the standard lib) of quote and unquote to see if there are
any other compatibility issues, particularly within urllib. Hence I'll
respond to this again in a few days.
The new patch to "safe" argument of quote allows non-ASCII characters
to be made safe. This correspondingly allows the construction of URIs
with non-ASCII characters. Is it better to allow users to do this if
they really want, or just mysteriously fail to let those characters through?

I would also like to have a separate pair of functions, unquote_raw and
quote_raw, which work on bytes objects instead of strings. (unquote_raw
would take a str and produce a bytes, while quote_raw would take a bytes
and produce a str). As URI encoding is fundamentally an octet encoding,
not a character encoding, this is the only way to do URI encoding
without choosing a Unicode character encoding. (I see some modules such
as "email" treating the implicit Latin-1 encoding as byte encoding,
which is a bit dodgy - they could benefit from raw functions). But as
that requires further changes to the interface, I'll save it for another
day.

Patch (parse.py.patch2) is for branch /branches/py3k, revision 64820.

Commit log:

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets
(previously implicitly decoded as ISO-8859-1). As per RFC 3986, default
is "utf-8".

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded (previously characters in range(128, 256)
were encoded as ISO-8859-1, and characters above that as UTF-8). Also
fixed characters greater than 256 not responding to "safe", and also not
being cached.

Lib/test/test_urllib.py, Lib/test/test_http_cookiejar.py: Updated test
cases which expected output in ISO-8859-1, now expects UTF-8.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).

loewis · 2008-07-09T20:28:32Z

Assuming the patch is acceptable in the first place (which I personally
have not made my mind up), then it lacks documentation and test suite
changes.

mgiuca · 2008-07-10T01:54:58Z

OK well here are the necessary changes to the documentation (RST docs
and docstrings in the code).

As I said above, I plan to to extensive testing and add new cases, and I
don't recommend this patch is accepted until that's done.

Patch (parse.py.patch3) is for branch /branches/py3k, revision 64834.

Commit log:

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets
(previously implicitly decoded as ISO-8859-1). As per RFC 3986, default
is "utf-8".

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded (previously characters in range(128, 256)
were encoded as ISO-8859-1, and characters above that as UTF-8). Also
fixed characters greater than 256 not responding to "safe", and also not
being cached.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface.

Lib/test/test_urllib.py, Lib/test/test_http_cookiejar.py: Updated test
cases which expected output in ISO-8859-1, now expects UTF-8.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).

mgiuca · 2008-07-10T16:13:03Z

Setting Version back to Python 3.0. Is there a reason it was set to
Python 3.1? This proposal will certainly break a lot of code. It's *far*
better to do it in the big backwards-incompatible Python 3.0 release
than a later release.

loewis · 2008-07-10T20:55:42Z

Setting Version back to Python 3.0. Is there a reason it was set to
Python 3.1?

3.0b1 has been released, so no new features can be added to 3.0.

mgiuca · 2008-07-11T07:03:20Z

3.0b1 has been released, so no new features can be added to 3.0.

While my proposal is no doubt going to cause a lot of code breakage, I
hardly consider it a "new feature". This is very definitely a bug. As I
understand it, the point of a code freeze is to stop the addition of
features which could be added to a later version. Realistically, there
is no way this issue can be fixed after 3.0 is released, as it
necessarily involves changing the behaviour of this function.

Perhaps I should explain further why this is a regression from Python
2.x and not a feature request. In Python 2.x, with byte strings, the
encoding is not an issue. quote and unquote simply encode bytes, and if
you want to use Unicode you have complete control. In Python 3.0, with
Unicode strings, if functions manipulate string objects, you don't have
control over the encoding unless the functions give you explicit
control. So Python 3.0's native Unicode strings have broken the library.

I give two examples.

Firstly, I believe that unquote(quote(x)) should always be true for all
strings x. In Python 2.x, this is always trivially true (for non-Unicode
strings), because they simply encode and decode the octets. In Python
3.0, the two functions are inconsistent, and break out of the range(0, 256).

>>> urllib.parse.unquote(urllib.parse.quote('ÿ')) # '\u00ff'
'ÿ'
# Works, because both functions work with ISO-8859-1 in this range.

>>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100'
'Ä\x80'
# Fails, because quote uses UTF-8 and unquote uses ISO-8859-1.

My patch succeeds for all characters.
>>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100'
'Ā'

Secondly, a bigger example, but I want to demonstrate how this bug
affects web applications, even very simple ones.

Consider this simple (beginnings of a) wiki system in Python 2.5, as a
CGI app:

#---

import cgi

fields = cgi.FieldStorage()
title = fields.getfirst('title')

print("Content-Type: text/html; charset=utf-8")
print("")

print('<p>Debug: %s</p>' % repr(title))
if title is None:
    print("No article selected")
else:
    print('<p>Information about %s.</p>' % cgi.escape(title))
#

(Place this in cgi-bin, navigate to it, and add the query string
"?title=Page Title"). I'll use the page titled "Mátt" as a test case.

If you navigate to "?title=Mátt", it displays the text "Debug:
'M\xc3\xa1tt'. Information about Mátt.". The browser (at least Firefox,
Safari and IE I have tested) encodes this as "?title=M%C3%A1tt". So this
is trivial, as it's just being unquoted into a raw byte string
'M\xc3\xa1tt', then written out again as a byte string.

Now consider that you want to manipulate it as a Unicode string, still
in Python 2.5. You could augment the program to decode it as UTF-8 and
then re-encode it. (I wrote a simple UTF-8 printing function which takes
Unicode strings as input).

#---

import sys
import cgi

def printu8(*args):
    """Prints to stdout encoding as utf-8, rather than the current terminal
    encoding. (Not a fully-featured print function)."""
    sys.stdout.write(' '.join([x.encode('utf-8') for x in args]))
    sys.stdout.write('\n')

fields = cgi.FieldStorage()
title = fields.getfirst('title')
if title is not None:
    title = str(title).decode("utf-8", "replace")

print("Content-Type: text/html; charset=utf-8")
print("")

print('<p>Debug: %s.</p>' % repr(title))
if title is None:
    print("No article selected.")
else:
    printu8('<p>Information about %s.</p>' % cgi.escape(title))
#

Now given the same input ("?title=Mátt"), it displays "Debug:
u'M\xe1tt'. Information about Mátt." Still working fine, and I can
manipulate it as Unicode because in Python 2.x I have direct control
over encoding/decoding.

Now let us upgrade this program to Python 3.0. (Note that I still can't
print Unicode characters directly out, because running through Apache
the stdout encoding is not UTF-8, so I use my printu8 function).

#---

import sys
import cgi

def printu8(*args):
    """Prints to stdout encoding as utf-8, rather than the current terminal
    encoding. (Not a fully-featured print function)."""
    sys.stdout.buffer.write(b' '.join([x.encode('utf-8') for x in args]))
    sys.stdout.buffer.write(b'\n')

fields = cgi.FieldStorage()
title = fields.getfirst('title')
# Note: No call to decode. I have no opportunity to specify the encoding
since
# it comes straight out of FieldStorage as a Unicode string.

print("Content-Type: text/html; charset=utf-8")
print("")

print('<p>Debug: %s.</p>' % ascii(title))
if title is None:
    print("No article selected.")
else:
    printu8('<p>Information about %s.</p>' % cgi.escape(title))
#

Now given the same input ("?title=Mátt"), it displays "Debug:
'M\xc3\xa1tt'. Information about MÃ¡tt." Once again, it is erroneously
(and implicitly) decoded as ISO-8859-1, so I end up with a meaningless
Unicode string. The only possible thing I can do about this as a web
developer is call title.encode('latin-1').decode('utf-8') - a dreadful hack.

With my patch applied, the input ("?title=Mátt") produces the output
"Debug: 'M\xe1tt'. Information about Mátt."

Basically, this bug is going to affect all web developers as soon as
someone types a non-ASCII character. You could argue that supporting
UTF-8 by default is no better than supporting Latin-1 by default, but it
is. UTF-8 supports encoding of all characters where Latin-1 does not,
UTF-8 is the recommended URI encoding by both the URI Syntax RFC[1] and
the W3C HTML 4.01 specification[2], and all major browsers use it to
encode non-ASCII characters in URIs.

My patch may not be the best, or most conservative, solution to this
problem. I'm happy to see other proposals. But it's clearly an important
bug to fix, if I can't even write the simplest web app I can think of
without having to use a kludgey hack to get the string decoded
correctly. What is the point of having nice clean Unicode strings in the
language if the library spits out the wrong characters and it requires
more work to fix them than it used to with byte strings?

[1] http://tools.ietf.org/html/rfc3986#section-2.5
[2] http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1

mgiuca · 2008-07-11T07:18:47Z

Since I got a complaint that my last reply was too long, I'll summarize it.

It's a bug report, not a feature request.

I can't get a simple web app to be properly Unicode-aware in Python 3,
which worked fine in Python 2. This cannot be put off until 3.1, as any
viable solution will break existing code.

mgiuca · 2008-07-12T11:32:50Z

OK I spent awhile writing test cases for quote and unquote, encoding and
decoding various Unicode strings with different encodings. As a result,
I found a bunch of issues in my previous patch, so I've rewritten the
patches to both quote and unquote. They're both actually more similar to
the original version now.

I'd be interested in hearing if anyone disagrees with my expected output
for these test cases.

I'm now confident I have good test coverage directly on the quote and
unquote functions. However, I haven't tested the other library functions
which depend upon them (though the entire test suite passes). Though as
I showed in that big post I made yesterday, other modules such as cgi
seem to be working fine (their behaviour has changed; they use UTF-8
now; but that's the whole point of this patch).

I still haven't figured out what the behaviour of "safe" should be in
quote. Should it only allow ASCII characters (thereby limiting the
output to an ASCII string, as specified by RFC 3986)? Should it also
allow Latin-1 characters, or all Unicode characters as well (perhaps
allowing you to create IRIs -- admittedly I don't know much about IRIs).
The new implementation of quote makes it rather difficult to allow
non-Latin-1 characters to be made "safe", as it encodes the string into
bytes before any processing.

Patch (parse.py.patch4) is for branch /branches/py3k, revision 64891.

Commit log:

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is "utf-8" (previously implicitly decoded as
ISO-8859-1).

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is "utf-8" (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters above 128 are no longer allowed to be "safe".

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface.

Lib/test/test_urllib.py: Added several new test cases testing encoding
and decoding Unicode strings with various encodings. This includes
updating one test case to now expect UTF-8 by default.

Lib/test/test_http_cookiejar.py: Updated test case which expected output
in ISO-8859-1, now expects UTF-8.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).

mgiuca · 2008-07-12T17:05:53Z

So today I grepped for "urllib" in the entire library in an effort to
track down every dependency on quote and unquote to see exactly how my
patch breaks other code. I've now investigated every module in the
library which uses quote, unquote or urlencode, and my findings are
documented below in detail.

So far I have found no code "breakage" except for the original
email.util issue I fixed in patch 2. Of course that doesn't mean the
behaviour hasn't changed. Nearly all modules in the report below have
changed their behaviour so they used to deal with Latin-1-encoded URLs
and now deal with UTF-8-encoded URLs. As discussed at length above, I
see this as a positive change, since nearly everybody encodes URLs in
UTF-8, and of course it allows for all characters.

I also point out that the http.server module (unpatched) is internally
broken when dealing with filenames with characters outside range(0,256);
my patch fixes it.

I'm attaching patch 5, which adds a bunch of new test cases to various
modules which demonstrate those modules correctly handling UTF-8-encoded
URLs. It also fixes a bug in email.utils which I introduced in patch 2.

Note that I haven't yet fully investigated urllib.request.

Aside from that, the only remaining matter is whether or not it's better
to encode URLs as UTF-8 or Latin-1 by default, and I'm pretty sure that
question doesn't need debate.

So basically I think if there's support for it, this patch is just about
ready to be accepted. I'm hoping it can be included in the 3.0b2 release
next week.

I'd be glad to hear any feedback about this proposal.

Not Yet Investigated
--------------------

./urllib/request.py
By far the biggest user of quote and unquote.
username, password, hostname and paths are now all converted
to/from UTF-8 percent-encodings.
Other concerns are:
* Data in the form application/x-www-form-urlencoded
* FTP access
I think this needs to be tested further.

Looks fine, not tested
----------------------

./xmlrpc/client.py
Just used to decode URI auth string (user:pass). This will change
to UTF-8, but is probably OK.
./logging/handlers.py
Just uses it in the HTTP handler to encode a dictionary. Probably
preferable to use UTF-8 to encode an arbitrary string.
./macurl2path.py
Calls to urllib look broken. Not tested.

Tested manually, fine
---------------------

./wsgiref/simple_server.py
Just used to set PATH_INFO, fine if URLs are UTF-8 encoded.
./http/server.py
All uses are for translating between actual file-system paths to
URLs. This works fine for UTF-8 URLs. Note that since it uses
quote to create URLs in a dir listing, and unquote to handle
them, it breaks when unquote is not the inverse of quote.

Consider the following simple script:

    import http.server
    s = http.server.HTTPServer(('',8000),
            http.server.SimpleHTTPRequestHandler)
    s.serve_forever()

This will "kind of" work in the unpatched version, using
Latin-1 URLs, but filenames with characters above 256 will
break (give a 404 error).
The patch fixes this.

./urllib/robotparser.py
No test cases. Manually tested, URLs properly match when
percent-encoded in UTF-8.
./nturl2path.py
No test cases available. Manually tested, fine if URLs are
UTF-8 encoded.

Test cases either exist or added, fine
--------------------------------------

./test/test_urllib.py
I wrote a large wad of test cases for all the new functionality.
./wsgiref/util.py
Added test cases expecting UTF-8.
./http/cookiejar.py
I changed a test case to expect UTF-8.
./email/utils.py
I changed this file to behave as it used to, to satisfy its
existing test cases.
./cgi.py
Added test cases for UTF-8-encoded query strings.

Commit log:

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is "utf-8" (previously implicitly decoded as
ISO-8859-1).

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is "utf-8" (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters above 128 are no longer allowed to be "safe".

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface.

Lib/test/test_urllib.py: Added several new test cases testing encoding
and decoding Unicode strings with various encodings. This includes
updating one test case to now expect UTF-8 by default.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).

mgiuca · 2008-07-31T11:27:32Z

OK after a long discussion on the mailing list, Guido gave this the OK,
with the provision that there are str->bytes and bytes->str versions of
these functions as well. So I've written those.

http://mail.python.org/pipermail/python-dev/2008-July/081601.html

quote itself now accepts either a str or a bytes. quote_from_bytes is a
new function which is just an alias for quote. (Is this acceptable?)

unquote is still str->str. I've added a totally separate function
unquote_to_bytes which is str->bytes.

Note there is a slight issue here: I didn't quite know what to do with
unescaped non-ASCII characters in the input to unquote_to_bytes - they
need to somehow be converted to bytes. I chose to encode them using
UTF-8, on the basis that they technically shouldn't be in a URI anyway.

Note that my new unquote doesn't have this problem; it's carefully
written to preserve the Unicode characters, even if they aren't
expressible in the given encoding (which explains some of the code bloat).

This makes unquote(s, encoding=e) necessarily more robust than
unquote_to_bytes(s).decode(e) in terms of unescaped non-ASCII characters
in the input.

I've also added new test cases and documentation for these two new
functions (included in patch6).

On an entirely personal note, can whoever checks this in please mention
my name in the commit log - I've put in at least 30 hours researching
and writing this patch, and I'd like for this not to go uncredited :)

Commit log for patch6:

Fix for bpo-3300.

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is "utf-8" (previously implicitly decoded as
ISO-8859-1).

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is "utf-8" (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters/bytes above 128 are no longer allowed to be
"safe". Also now allows either bytes or strings.

Added functions urllib.parse.quote_from_bytes,
urllib.parse.unquote_to_bytes.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface, added quote_from_bytes and unquote_to_bytes.

Lib/test/test_urllib.py: Added several new test cases testing encoding
and decoding Unicode strings with various encodings, as well as testing
the new functions.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).

mgiuca · 2008-07-31T14:49:50Z

Hmm ... seems patch 6 I just checked in fails a test case! Sorry! (It's
minor, gives a harmless BytesWarning if you run with -b, which "make
test" does, so I only picked it up after submitting).

I've slightly changed the code in quote so it doesn't do that any more
(it normalises all "safe" arguments to bytes).

Please review patch 7, not 6. Same commit log as above.

(Also .. someone let me know if I'm not submitting patches properly,
like perhaps I should be deleting the old ones not keeping them around?)

janssen · 2008-08-06T05:59:42Z

Here's my version of how quote and unquote should be implemented in
Python 3.0. I haven't looked at the uses of it in the library, but I'd
expect improper uses (and there are lots of them) will break, and thus
can be fixed.

Basically, percent-quoting is about creating an ASCII string that can be
safely used in URI from an arbitrary sequence of octets. So, my version
of quote() takes either a byte sequence or a string, and percent-quotes
the unsafe ones, and then returns a str. If a str is supplied on input,
it is first converted to UTF-8, then the octets of that encoding are
percent-quoted.

For unquote, there's no way to tell what the octets of the quoted
sequence may mean, so this takes the percent-quoted ASCII string, and
returns a byte sequence with the unquoted bytes. For convenience, since
the unquoted bytes are often a string in some particular character set
encoding, I've also supplied unquote_as_string(), which takes an
optional character set, and first unquotes the bytes, then converts them
to a str, using that character set encoding, and returns the resulting
string.

janssen · 2008-08-06T16:51:25Z

Here's a patch to parse.py (and test/test_urllib.py) that makes the
various tests (cgi, urllib, httplib) pass. It basically adds
"unquote_as_string", "unquote_as_bytes", "quote_as_string",
"quote_as_bytes", and then define the existing "quote" and "unquote" in
terms of them.

jimjjewett · 2008-08-06T17:29:29Z

Is there still disagreement over anything except:

(1) The type signature of quote and unquote (as opposed to the
explicit "quote_as_bytes" or "quote_as string").

(2) The default encoding (latin-1 vs UTF8), and (if UTF-8) what to do
with invalid byte sequences?

(3) Would waiting for 3.1 cause too many compatibility problems?

pitrou · 2008-08-06T18:39:42Z

Bill, I haven't studied your patch in detail but a few comments:

it would be nice to have more unit tests, especially for the various
bytes/unicode possibilities, and perhaps also roundtripping (Matt's
patch has a lot of tests)
quote_as_bytes() should return a bytes object, not a bytearray
using the "%02X" format looks clearer to me than going through the
_hextable lookup table...
when the argument is of the wrong type, quote_as_bytes() should raise
a TypeError rather than a ValueError
why is quote_as_string() hardwired to utf8 while unquote_as_string()
provides a charset parameter? wouldn't it be better for them to be
consistent with each other?

gvanrossum · 2008-08-06T19:51:23Z

Bill Janssen's "patch" breaks two unittests: test_email and
test_http_cookiejar. Details for test_email:

======================================================================
ERROR: test_rfc2231_bad_character_in_filename
(email.test.test_email.TestRFC2231)
.
.
.
File "/usr/local/google/home/guido/python/py3k/Lib/urllib/parse.py",
line 267, in unquote_as_string
return str(unquote_as_bytes(s, plus=plus), charset, 'strict')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 13:
unexpected end of data

======================================================================
FAIL: test_rfc2231_bad_character_in_charset
(email.test.test_email.TestRFC2231)
----------------------------------------------------------------------

Traceback (most recent call last):
  File
"/usr/local/google/home/guido/python/py3k/Lib/email/test/test_email.py",
line 3279, in test_rfc2231_bad_character_in_charset
    self.assertEqual(msg.get_content_charset(), None)
AssertionError: 'utf-8\\u201d' != None

Details for test_http_cookiejar:

======================================================================
FAIL: test_url_encoding (main.LWPCookieTests)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "Lib/test/test_http_cookiejar.py", line 1454, in test_url_encoding
    self.assert_("foo=bar" in cookie and version_re.search(cookie))
AssertionError: None

jimjjewett · 2008-08-06T21:03:10Z

Matt pointed out that the email package assumes Latin-1 rather than UTF-8; I
assume Bill could patch his patch the same way Matt did, and this would
resolve the email tests. (Unless you pronounce to stick with Latin-1)

The cookiejar failure probably has the same root cause; that test is
encoding (non-ASCII) Latin-1 characters, and urllib.parse.py/Quoter assumes
Latin-1.

So I see some evidence (probably not enough) for sticking with Latin-1
instead of UTF-8. But I don't see any evidence that fixing the semantics
(encoded results should be bytes) at the same time made the conversion any
more painful.

On the other hand, Matt shows that some of those extra str->byte code
changes might never need to be done at all, except for purity.

gvanrossum · 2008-08-06T21:39:36Z

Dear GvR,

New code review comments by GvR have been published.
Please go to http://codereview.appspot.com/2827 to read them.

Message:
Hi Matt,

Here's a code review of your patch.

I'm leaning more and more towards wanting this for 3.0, but I have some API
design issues and also some style nits.

I'm cross-linking this with the Python tracker issue, through the subject.

Details:

http://codereview.appspot.com/2827/diff/1/2
File Doc/library/urllib.parse.rst (right):

http://codereview.appspot.com/2827/diff/1/2#newcode198
Line 198: replaced by a placeholder character.
I don't think that's a good default. I'd rather see it default to strict --
that's what encoding translates to everywhere else. I believe that lenient
error handling by default can cause subtle security problems too, by hiding
problem characters from validation code.

http://codereview.appspot.com/2827/diff/1/2#newcode215
Line 215: An alias for :func:`quote`, intended for use with a :class:`bytes`
object
I'd rather see this as a wrapper that raises TypeError if the argument
isn't a
bytes or bytearray instance. Otherwise it's needless redundancy.

http://codereview.appspot.com/2827/diff/1/2#newcode223
Line 223: Replace %xx escapes by their single-character equivalent.
Should add what the argument type is -- I vote for str or bytes/bytearray.

http://codereview.appspot.com/2827/diff/1/2#newcode242
Line 242: .. function:: unquote_to_bytes(string)
Again, add what the argument type is.

http://codereview.appspot.com/2827/diff/1/4
File Lib/email/utils.py (right):

http://codereview.appspot.com/2827/diff/1/4#newcode224
Line 224: except:
An unqualified except clause is unacceptable here. Why do you need this
anyway?

http://codereview.appspot.com/2827/diff/1/5
File Lib/test/test_http_cookiejar.py (right):

http://codereview.appspot.com/2827/diff/1/5#newcode1450
Line 1450: "%3c%3c%0Anew%C3%A5/%C3%A5",
I'm guessing this test broke otherwise? Given that this references an RFC,
is
it correct to just fix it this way?

http://codereview.appspot.com/2827/diff/1/3
File Lib/urllib/parse.py (right):

http://codereview.appspot.com/2827/diff/1/3#newcode10
Line 10: "urlsplit", "urlunsplit"]
Please add all the quote/unquote versions here too.
(They were there in 2.5, but somehow got dropped from 3.0.

http://codereview.appspot.com/2827/diff/1/3#newcode265
Line 265: # Maps lowercase and uppercase variants (but not mixed case).
That sounds like a disaster. Why would %aa and %AA be correct but not %aA
and
%Aa? (Even though the old code had the same problem.)

http://codereview.appspot.com/2827/diff/1/3#newcode283
Line 283: def unquote(s, encoding = "utf-8", errors = "replace"):
Please no spaces around the '=' when used for an argument default (or for a
keyword arg).

Also see my comment about defaulting to 'replace' in the doc file.

Finally -- let's be consistent about quotes. It seems most of this file
uses
single quotes, so lets stick to that (except docstrings always use double
quotes).

And more: what should a None value for encoding or errors mean? IMO it
should
mean "use the default".

http://codereview.appspot.com/2827/diff/1/3#newcode382
Line 382: safe = safe.encode('ascii', 'ignore')
Using errors='ignore' seems like a mistake -- it will hide errors.

I also wonder why safe should be limited to ASCII though.

http://codereview.appspot.com/2827/diff/1/3#newcode399
Line 399: if ' ' in s:
This test means that it won't work if the input is bytes. E.g.

urllib.parse.quote_plus(b"abc def")

raises a TypeError.

Sincerely,

Your friendly code review daemon (http://codereview.appspot.com/).

jimjjewett · 2008-08-06T22:25:36Z

http://codereview.appspot.com/2827/diff/1/5#newcode1450
Line 1450: "%3c%3c%0Anew%C3%A5/%C3%A5",
I'm guessing this test broke otherwise?

Yes; that is one of the breakages you found in Bill's patch. (He didn't
modify the test.)

Given that this references an RFC,
is it correct to just fix it this way?

Probably. Looking at http://www.faqs.org/rfcs/rfc2965.html

(1) That is not among the exact tests in the RFC.
(2) The RFC does not specify charset for the cookie in general, but the
Comment field MUST be in UTF-8, and the only other reference I could find to
a specific charset was "possibly in a server-selected printable ASCII
encoding."

Whether we have to use Latin-1 (or document charset) in practice for
compatibility reasons, I don't know.

mgiuca · 2008-08-07T12:11:15Z

Dear GvR,

New code review comments by mgiuca have been published.
Please go to http://codereview.appspot.com/2827 to read them.

Message:
Hi Guido,

Thanks very much for this very detailed review. I've replied to the
comments. I will make the changes as described below and send a new
patch to the tracker.

mgiuca · 2008-08-07T13:42:46Z

A reply to a point on GvR's review, I'd like to open for discussion.
This relates to whether or not quote's "safe" argument should allow
non-ASCII characters.

Using errors='ignore' seems like a mistake -- it will hide errors. I >
also wonder why safe should be limited to ASCII though.

The reasoning is this: if we allow non-ASCII characters to be escaped,
then we allow quote to generate invalid URIs (URIs are only allowed to
have ASCII characters). It's one thing for unquote to accept such URIs,
but I think we shouldn't be producing them. Albeit, it only produces an
invalid URI if you explicitly request it. So I'm happy to make the
change to allow any character to be safe, but I'll let it go to
discussion first.

pitrou · 2008-08-07T14:20:05Z

Le jeudi 07 août 2008 à 13:42 +0000, Matt Giuca a écrit :

The reasoning is this: if we allow non-ASCII characters to be escaped,
then we allow quote to generate invalid URIs (URIs are only allowed to
have ASCII characters). It's one thing for unquote to accept such URIs,
but I think we shouldn't be producing them. Albeit, it only produces an
invalid URI if you explicitly request it. So I'm happy to make the
change to allow any character to be safe, but I'll let it go to
discussion first.

The important is that the defaults are safe. If users want to override
the defaults and produce potentially invalid URIs, there is no reason to
discourage them.

mgiuca · 2008-08-07T14:35:18Z

The important is that the defaults are safe. If users want to override
the defaults and produce potentially invalid URIs, there is no reason to
discourage them.

OK I think that's a fairly valid argument. I'm about to head off so I'll
post the patch I have now, which fixes most of the other concerns. That
change will cause havoc to quote I think ;)

mgiuca · 2008-08-07T14:59:56Z

Following Guido and Antoine's reviews, I've written a new patch which
fixes *most* of the issues raised. The ones I didn't fix I have noted
below, and commented on the review site
(http://codereview.appspot.com/2827/). Note: I intend to address all of
these issues after some discussion.

Outstanding issues raised by the reviews:

Doc/library/urllib.parse.rst:
Should unquote accept a bytes/bytearray as well as a str?

Lib/email/utils.py:
Should encode_rfc2231 with charset=None accept strings with non-ASCII
characters, and just encode them to UTF-8?

Lib/test/test_http_cookiejar.py:
Does RFC 2965 let me get away with changing the test case to expect
UTF-8? (I'm pretty sure it doesn't care what encoding is used).

Lib/test/test_urllib.py:
Should quote raise a TypeError if given a bytes with encoding/errors
arguments? (Motivation: TypeError is what you usually raise if you
supply too many args to a function).

Lib/urllib/parse.py:
(As discussed above) Should quote accept safe characters outside the
ASCII range (thereby potentially producing invalid URIs)?

------

Commit log for patch8:

Fix for bpo-3300.

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is "utf-8" (previously implicitly decoded as
ISO-8859-1). Also fixed a bug in which mixed-case hex digits (such as
"%aF") weren't being decoded at all.

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is "utf-8" (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters/bytes above 128 are no longer allowed to be
"safe". Also now allows either bytes or strings.

Added functions urllib.parse.quote_from_bytes,
urllib.parse.unquote_to_bytes. All quote/unquote functions now exported
from the module.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface, added quote_from_bytes and unquote_to_bytes.

Lib/test/test_urllib.py: Added many new test cases testing encoding
and decoding Unicode strings with various encodings, as well as testing
the new functions.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).

janssen · 2008-08-12T17:38:39Z

Larry Masinter is off on vacation, but I did get a brief message saying
that he will dig up similar discussions that he was involved in when he
gets back.

Out of curiosity, I sent a note off to the www-international mailing
list, and received this:

``For the authority (server name) portion of a URI, RFC 3986 is pretty
clear that UTF-8 must be used for non-ASCII values (assuming, for a
moment, that IDNA addresses are not Punycode encoded already). For the
path portion of URIs, a large-ish proportion of them are, indeed, UTF-8
encoded because that has been the de facto standard in Web browsers for
a number of years now. For the query and fragment parts, however, the
encoding is determined by context and often depends on the encoding of
some page that contains the form from which the data is taken. Thus, a
large number of URIs contain non-UTF-8 percent-encoded octets.''

http://lists.w3.org/Archives/Public/www-international/2008JulSep/0041.html

janssen · 2008-08-12T17:47:34Z

For Antoine:

I think the problem that Barry is facing with the email package is that
Unicode strings are an ambiguous representation of a sequence of bytes;
that is, there are a number of different byte sequences a Unicode string
may have come from. His ingenious use of raw-unicode-escape is an
attempt to conform to the requirement of having to produce a string, but
without losing any data, so that an application program can, if it needs
to, still reprocess that string and retrieve the original data. Naive
application programs that sort of expected the result to be an ASCII
string will be unaffected. Not sure it's the best idea; this is all
about just where to force unexpected runtime failures.

janssen · 2008-08-12T19:37:40Z

Here's another thought:

Let's put string_to_bytes and string_from_bytes into the binascii
module, as a2b_percent and b2a_percent, respectively.

Then parse.py would import them as

  from binascii import a2b_percent as percent_decode_as_bytes
  from binascii import b2a_percent as percent_encode_from_bytes

and add two more functions:

  def percent_encode(<string>, encoding="UTF-8", error="strict", plus=False)
  def percent_decode(<string>, encoding="UTF-8", error="strict", plus=False)

and would add backwards-compatible but deprecated functions for quote
and unquote:

  def quote(s):
      warnings.warn("urllib.parse.quote should be replaced by
percent_encode or percent_encode_from_bytes", FutureDeprecationWarning)
      if isinstance(s, str):
          return percent_encode(s)
      else:
          return percent_encode_from_bytes(s)

  def unquote(s):
      warnings.warn("urllib.parse.unquote should be replaced by
percent_decode or percent_decode_to_bytes", FutureDeprecationWarning)
      if isinstance(s, str):
          return percent_decode(s)
      else:
          return percent_decode(str(s, "ASCII", "strict"))

pitrou · 2008-08-12T22:12:50Z

Le mardi 12 août 2008 à 19:37 +0000, Bill Janssen a écrit :

Let's put string_to_bytes and string_from_bytes into the binascii
module, as a2b_percent and b2a_percent, respectively.

Well, it's my personal opinion, but I think we should focus on a simple
and straightforward solution for the present issue before beta3 is
released (which is in 8 days now). It has already been difficult to find
a (quasi-)consensus for a simple patch to adapt quote()/unquote() to the
realities of bytes/unicode separation in py3k: witness the length of the
present discussion.

(perhaps a sophisticated solution could still be adopted for 3.1,
especially if it has backwards compatibility in mind)

gvanrossum · 2008-08-12T23:43:45Z

Matt Giuca <matt.giuca@gmail.com> added the comment:
By the way, what is the current status of this bug? Is anybody waiting
on me to do anything? (Re: Patch 9)

I'll be reviewing it today or tomorrow. From looking at it briefly I
worry that the implementation is pretty slow -- a method call for each
character and a map() call sounds pretty bad.

To recap my previous list of outstanding issues raised by the review:

> Should unquote accept a bytes/bytearray as well as a str?
Currently, does not. I think it's meaningless to do so (and how to
handle >127 bytes, if so?)

The bytes > 127 would be translated as themselves; this follows
logically from how stuff is parsed -- %% and %FF are translated,
everything else is not. But I don't really care, I doubt there's a
need.

> Lib/email/utils.py:
> Should encode_rfc2231 with charset=None accept strings with non-ASCII
> characters, and just encode them to UTF-8?
Currently does. Suggestion to restrict to ASCII on the review tracker;
simple fix.

I think I agree with that comment; it seems wrong to return UTF8
without setting that in the header. The alternative would be to
default charset to utf8 if there are any non-ASCII chars in the input.
I'd be okay with that too.

> Should quote raise a TypeError if given a bytes with encoding/errors
> arguments? (Motivation: TypeError is what you usually raise if you
> supply too many args to a function).
Resolved. Raises TypeError.

> Lib/urllib/parse.py:
> (As discussed above) Should quote accept safe characters outside the
> ASCII range (thereby potentially producing invalid URIs)?
Resolved? Implemented, but too messy and not worth it just to produce
invalid URIs, so NOT in patch.

Agreed, safe should be ASCII chars only.

That's only two very minor yes/no issues remaining. Please comment.

I believe patch 9 still has errors defaulting to strict for quote().
Weren't you going to change that?

Regarding using UTF-8 as the default encoding, I still think this the
right thing to do -- while the tables shown by Bill indicate that
there's still a lot of Latin-1 out there, UTF-8 is definitely gaining
on it, and I expect that Python apps, especially Py3k apps, are much
more likely to follow (and hopefully reinforce! :-) this trend than to
lag behind.

mgiuca · 2008-08-13T14:25:19Z

I have no strong opinion on the very remaining points you listed,
except that IMHO encode_rfc2231 with charset=None should not try to
use UTF8 by default. But someone with more mail protocol skills
should comment :)

OK I've come to the realization that DEMANDING ascii (and erroring on
non-ASCII chars) is better for the short term anyway, because we can
always decide later to relax the restrictions, but it's a lot worse to
add restrictions later. So I agree now, should be ASCII. And no, I don't
have mail protocol skills.

The same goes for unquote accepting bytes. We can decide to make it
accept bytes later, but can't remove that feature later, so it's best
(IMHO) to let it NOT accept bytes (which is the current behaviour).

The bytes > 127 would be translated as themselves; this follows
logically from how stuff is parsed -- %% and %FF are translated,
everything else is not. But I don't really care, I doubt there's a
need.

Ah but what about unquote (to string)? If it accepted bytes then it
would be a bytes->str operation, and then you need a policy on DEcoding
those bytes. It makes things too complex I think.

I believe patch 9 still has errors defaulting to strict for quote().
Weren't you going to change that?

I raised it as a concern, but I thought you overruled on that, so I left
it as errors='strict'. What do you want it to be? 'replace'? Now that
this issue has been fully discussed, I'm happy with whatever you decide.

From looking at it briefly I
worry that the implementation is pretty slow -- a method call for each
character and a map() call sounds pretty bad.

Yes, it does sound pretty bad. However, that's the current way of doing
things in both 2.x and 3.x; I didn't change it (though it looks like I
changed a LOT, I really did try to change as little as possible!)
Assuming it wasn't made _slower_ than before, can we ignore existing
performance issues and treat them as a separate matter (and can be dealt
with after 3.0)?

I'm not putting up a new patch now. The only fix I'd make is to add
Antoine's "or 'ascii'" to email/utils.py, as suggested on the review
tracker. I'll make this change along with any other recommendations
after your review.

(That is Lib/email/utils.py line 222 becomes:
s = urllib.parse.quote(s, safe='', encoding=charset or 'ascii')
)

btw this Rietveld is amazing. I'm assuming I don't have permission to
upload patches there (can't find any button to do so) which is why I
keep posting them here and letting you upload to Rietveld ...

gvanrossum · 2008-08-13T14:50:28Z

On Wed, Aug 13, 2008 at 7:25 AM, Matt Giuca <report@bugs.python.org> wrote:

> I have no strong opinion on the very remaining points you listed,
> except that IMHO encode_rfc2231 with charset=None should not try to
> use UTF8 by default. But someone with more mail protocol skills
> should comment :)

OK I've come to the realization that DEMANDING ascii (and erroring on
non-ASCII chars) is better for the short term anyway, because we can
always decide later to relax the restrictions, but it's a lot worse to
add restrictions later. So I agree now, should be ASCII. And no, I don't
have mail protocol skills.

OK.

The same goes for unquote accepting bytes. We can decide to make it
accept bytes later, but can't remove that feature later, so it's best
(IMHO) to let it NOT accept bytes (which is the current behaviour).

OK.

> The bytes > 127 would be translated as themselves; this follows
> logically from how stuff is parsed -- %% and %FF are translated,
> everything else is not. But I don't really care, I doubt there's a
> need.

Ah but what about unquote (to string)? If it accepted bytes then it
would be a bytes->str operation, and then you need a policy on DEcoding
those bytes. It makes things too complex I think.

OK.

> I believe patch 9 still has errors defaulting to strict for quote().
> Weren't you going to change that?

I raised it as a concern, but I thought you overruled on that, so I left
it as errors='strict'. What do you want it to be? 'replace'? Now that
this issue has been fully discussed, I'm happy with whatever you decide.

I'm OK with replace for unquote(), your point that bogus data is
better than an exception is well taken, especially since there are
calls that the app can't control (like in cgi.py).

For quote() I think strict is better -- it can't fail anyway with
UTF8, and if an app passes an explicit conversion it'd be pretty
stupid to pass a string that can't be converted with that encoding
(since it's presumably the app that generates both the string and the
encoding) so it's better to complain there, just like if they made the
encode() call themselves with only an encoding specified. This means
we have a useful analogy: quote(s, e) == quote(s.encode(e)).

> From looking at it briefly I
> worry that the implementation is pretty slow -- a method call for each
> character and a map() call sounds pretty bad.

Yes, it does sound pretty bad. However, that's the current way of doing
things in both 2.x and 3.x; I didn't change it (though it looks like I
changed a LOT, I really did try to change as little as possible!)

Actually, while the Quoter class (the immediat subject of my scorn)
was there before your patch in 3.0, it isn't there in 2.x; somebody
must have added it in 3.0 as part of the conversion to Unicode or
perhaps as part of the restructuring of urllib.py. The 2.x code maps
the __getitem__ of a dict over the string, which is much faster. I
think we can do much better than mapping a method call.

Assuming it wasn't made _slower_ than before, can we ignore existing
performance issues and treat them as a separate matter (and can be dealt
with after 3.0)?

Now that you've spent so much time with this patch, can't you think
of a faster way of doing this? I wonder if mapping a defaultdict
wouldn't work.

I'm not putting up a new patch now. The only fix I'd make is to add
Antoine's "or 'ascii'" to email/utils.py, as suggested on the review
tracker. I'll make this change along with any other recommendations
after your review.

(That is Lib/email/utils.py line 222 becomes:
s = urllib.parse.quote(s, safe='', encoding=charset or 'ascii')
)

btw this Rietveld is amazing. I'm assuming I don't have permission to
upload patches there (can't find any button to do so) which is why I
keep posting them here and letting you upload to Rietveld ...

Thanks! You can't upload patches to the issue that *I* created, but a
better way would be to create a new issue and assign it to me for
review. That will work as long as you have a gmail account or a Google
Account. I highly recommend using the upload.py script, which you can
download from codereview.appspot.com/static/upload.py. (There's also a
link to it on the Create Issue page, at the bottom.)

I am hoping that in general we will be able to use Rietveld to review
patches instead of the bug tracker.

mgiuca · 2008-08-13T15:09:00Z

I'm OK with replace for unquote() ...
For quote() I think strict is better

There's just an odd inconsistency there, but it's only a tiny "gotcha";
and I agree with all your other arguments. I'll change unquote back to
errors='replace'.

This means we have a useful analogy:
quote(s, e) == quote(s.encode(e)).

That's exactly true, yes.

Now that you've spent so much time with this patch, can't you think
of a faster way of doing this?

Well firstly, you could replace Quoter (the class) with a "quoter"
function, which is nested inside quote. Would calling a nested function
be faster than a method call?

I wonder if mapping a defaultdict wouldn't work.

That is a good idea. Then, the "function" (as I describe above) would be
just the inside of what currently is the except block, and that would be
the default_factory of the defaultdict. I think that should speed things up.

I'm very hazy about what is faster in the bytecode world of Python, and
wary of making a change and proclaiming "this is faster!" without doing
proper speed tests (which is why I think this optimisation could be
delayed until at least after the core interface changes are made). But
I'll have a go at that change tomorrow.

(I won't be able to work on this for up to 24 hours).

pitrou · 2008-08-13T16:02:06Z

Selon Matt Giuca <report@bugs.python.org>:

> Now that you've spent so much time with this patch, can't you think
> of a faster way of doing this?

Well firstly, you could replace Quoter (the class) with a "quoter"
function, which is nested inside quote. Would calling a nested function
be faster than a method call?

The obvious speedup is to remove the map() call and do the loop inside
Quoter.__call__ instead. That way you don't have any function or method call in
the critical path.

(also, defining a class with a single __call__ method is not a common Python
idiom; usually you'd just have a function returning another (nested) function)

As for the defaultdict, here is how it can look like (this is on 2.5):

...  def __missing__(self, key):
...   print "__missing__", key
...   value = "%%%02X" % key
...   self[key] = value
...   return value
...
>>> d = D()
>>> d[66] = 'B'
>>> d[66]
'B'
>>> d[67]
__missing__ 67
'%43'
>>> d[67]
'%43'

pitrou · 2008-08-13T16:03:27Z

Selon Antoine Pitrou <report@bugs.python.org>:

As for the defaultdict, here is how it can look like (this is on 2.5):

(there should be a line here saying "class D(defaultdict)" :-))

... def __missing__(self, key):
... print "__missing__", key
... value = "%%%02X" % key
... self[key] = value
... return value

cheers

Antoine.

gvanrossum · 2008-08-13T16:41:41Z

> Now that you've spent so much time with this patch, can't you think
> of a faster way of doing this?

Well firstly, you could replace Quoter (the class) with a "quoter"
function, which is nested inside quote. Would calling a nested function
be faster than a method call?

Yes, but barely.

> I wonder if mapping a defaultdict wouldn't work.

That is a good idea. Then, the "function" (as I describe above) would be
just the inside of what currently is the except block, and that would be
the default_factory of the defaultdict. I think that should speed things up.

Yes, it would be tremendously faster, since the method would be called
only once per byte value (for each value of 'safe'), and if that byte
is repeated in the input, further occurrences will use the __getitem__
function of the defaultdict, which is implemented in C.

I'm very hazy about what is faster in the bytecode world of Python, and
wary of making a change and proclaiming "this is faster!" without doing
proper speed tests (which is why I think this optimisation could be
delayed until at least after the core interface changes are made).

That's very wise. But a first-order approximation of the speed of
something is often "how many functions/methods implemented in Python
(i.e. with def or lambda) does it call?"

But I'll have a go at that change tomorrow.

(I won't be able to work on this for up to 24 hours).

That's fine, as long as we have closure before beta3, which is next Wednesday.

janssen · 2008-08-13T16:56:12Z

Feel free to take the function implementation from my patch, if it speeds
things up (and it should).

Bill

On Wed, Aug 13, 2008 at 9:41 AM, Guido van Rossum <report@bugs.python.org>wrote:

Guido van Rossum <guido@python.org> added the comment:

>> Now that you've spent so much time with this patch, can't you think
>> of a faster way of doing this?
>
> Well firstly, you could replace Quoter (the class) with a "quoter"
> function, which is nested inside quote. Would calling a nested function
> be faster than a method call?

Yes, but barely.

>> I wonder if mapping a defaultdict wouldn't work.
>
> That is a good idea. Then, the "function" (as I describe above) would be
> just the inside of what currently is the except block, and that would be
> the default_factory of the defaultdict. I think that should speed things
up.

Yes, it would be tremendously faster, since the method would be called
only once per byte value (for each value of 'safe'), and if that byte
is repeated in the input, further occurrences will use the __getitem__
function of the defaultdict, which is implemented in C.

> I'm very hazy about what is faster in the bytecode world of Python, and
> wary of making a change and proclaiming "this is faster!" without doing
> proper speed tests (which is why I think this optimisation could be
> delayed until at least after the core interface changes are made).

That's very wise. But a first-order approximation of the speed of
something is often "how many functions/methods implemented in Python
(i.e. with def or lambda) does it call?"

> But I'll have a go at that change tomorrow.
>
> (I won't be able to work on this for up to 24 hours).

That's fine, as long as we have closure before beta3, which is next
Wednesday.

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue3300\>

janssen · 2008-08-13T17:05:19Z

Erik van der Poel at Google has now chimed in with stats on current URL
usage:

``...the bottom line is that escaped non-utf-8 is still quite prevalent,
enough (in my opinion) to require an implementation in Python, possibly
even allowing for different encodings in the path and query parts (e.g.
utf-8 path and gb2312 query).''

http://lists.w3.org/Archives/Public/www-international/2008JulSep/0042.html

I think it's worth remembering that a very large proportion of the use
of Python's urllib.unquote() is in implementations of Web server
frameworks of one sort or another. We can't control what the browsers
that talk to such frameworks produce; the IETF doesn't control that,
either. In this case, "practicality beats purity" is the clarion call
of the browser designers, and we'd better be able to support them.

gvanrossum · 2008-08-13T17:17:20Z

Bill Janssen bill.janssen@gmail.com added the comment:

Erik van der Poel at Google has now chimed in with stats on current URL
usage:

``...the bottom line is that escaped non-utf-8 is still quite prevalent,
enough (in my opinion) to require an implementation in Python, possibly
even allowing for different encodings in the path and query parts (e.g.
utf-8 path and gb2312 query).''

http://lists.w3.org/Archives/Public/www-international/2008JulSep/0042.html

I think it's worth remembering that a very large proportion of the use
of Python's urllib.unquote() is in implementations of Web server
frameworks of one sort or another. We can't control what the browsers
that talk to such frameworks produce; the IETF doesn't control that,
either. In this case, "practicality beats purity" is the clarion call
of the browser designers, and we'd better be able to support them.

I think we're supporting these sufficiently by allowing developers to
override the encoding and errors value. I see no argument here against
having a default encoding of UTF-8.

pitrou · 2008-08-13T17:51:34Z

Le mercredi 13 août 2008 à 17:05 +0000, Bill Janssen a écrit :

I think it's worth remembering that a very large proportion of the use
of Python's urllib.unquote() is in implementations of Web server
frameworks of one sort or another. We can't control what the browsers
that talk to such frameworks produce;

Yes, we do. Browsers will use whatever charset is specified in the HTML
for the query part; and, as for the path part, they should't produce it
themselves, they just follow a link which should already be
percent-quoted in the HTML.

(URL rewriting at the HTTP server level can make this more complicated,
since it can turn a query fragment into a path fragment or vice-versa;
however, most modern frameworks alleviate the need for such rewriting,
since they allow to specify flexible mapping rules at the framework
level)

The situation in which we can't control the encoding is when getting the
URLs from third-part content (e.g. some Web page which we didn't produce
ourselves, or some link in an e-email). But in those cases there's less
use cases for unquoting the URL rather than use it as-is. The only time
I've wanted to unquote such an URL was to do some processing of HTTP
referrers in order to extract which search queries had led people to
visit a Web site.

janssen · 2008-08-13T19:49:52Z

On Wed, Aug 13, 2008 at 10:51 AM, Antoine Pitrou <report@bugs.python.org>wrote:

Antoine Pitrou <pitrou@free.fr> added the comment:

Le mercredi 13 août 2008 à 17:05 +0000, Bill Janssen a écrit :
> I think it's worth remembering that a very large proportion of the use
> of Python's urllib.unquote() is in implementations of Web server
> frameworks of one sort or another. We can't control what the browsers
> that talk to such frameworks produce;

Yes, we do. Browsers will use whatever charset is specified in the HTML
for the query part; and, as for the path part, they should't produce it
themselves, they just follow a link which should already be
percent-quoted in the HTML.

Sure. What I meant was that we don't control what the browsers do, we just
go along with what they do, that is, we try to play with the default
understanding that's developed between the "consenting pairs" of
Apache/Firefox or ASP/IE.

mgiuca · 2008-08-14T11:35:13Z

Ah cheers Antoine, for the tip on using defaultdict (I was confused as
to how I could access the key just by passing defaultfactory, as the
manual suggests).

mgiuca · 2008-08-14T12:18:48Z

OK I implemented the defaultdict solution. I got curious so ran some
rough speed tests, using the following code.

import random, urllib.parse
for i in range(0, 100000):
    str = ''.join(chr(random.randint(0, 0x10ffff)) for _ in range(50))
    quoted = urllib.parse.quote(str)

Time to quote 100,000 random strings of 50 characters.
(Ran each test twice, worst case printed)

HEAD, chars in range(0,0x110000): 1m44.80
HEAD, chars in range(0,256): 25.0s
patch9, chars in range(0,0x110000): 35.3s
patch9, chars in range(0,256): 27.4s
New, chars in range(0,0x110000): 31.4s
New, chars in range(0,256): 25.3s

Head is the current Py3k head. Patch 9 is my previous patch (before
implementing defaultdict), and New is after implementing defaultdict.

Interesting. Defaultdict didn't really make much of an improvement. You
can see the big help the cache itself makes, though (my code caches all
chars, whereas the HEAD just caches ASCII chars, which is why HEAD is so
slow on the full repertoire test). Other than that, differences are
fairly negligible.

However, I'll keep the defaultdict code, I quite like it, speedy or not
(it is slightly faster).

pitrou · 2008-08-14T13:07:23Z

Hello Matt,

OK I implemented the defaultdict solution. I got curious so ran some
rough speed tests, using the following code.

import random, urllib.parse
for i in range(0, 100000):
str = ''.join(chr(random.randint(0, 0x10ffff)) for _ in range(50))
quoted = urllib.parse.quote(str)

I think if you move the line defining "str" out of the loop, relative timings
should change quite a bit. Chances are that the random functions are not very
fast, since they are written in pure Python.
Or you can create an inner loop around the call to quote(), for example to
repeat it 100 times.

cheers

Antoine.

mgiuca · 2008-08-14T15:27:24Z

New patch (patch10). Details on Rietveld review tracker
(http://codereview.appspot.com/2827).

Another update on the remaining "outstanding issues":

Resolved issues since last time:

Should unquote accept a bytes/bytearray as well as a str?
No. But see below.

Lib/email/utils.py:
Should encode_rfc2231 with charset=None accept strings with non-ASCII
characters, and just encode them to UTF-8?
Implemented Antoine's fix ("or 'ascii'").

Should quote accept safe characters outside the
ASCII range (thereby potentially producing invalid URIs)?
No.

New issues:

unquote_to_bytes doesn't cope well with non-ASCII characters (currently
encodes as UTF-8 - not a lot we can do since this is a str->bytes
operation). However, we can allow it to accept a bytes as input (while
unquote does not), and it preserves the bytes precisely.
Discussion at http://codereview.appspot.com/2827/diff/82/84, line 265.

I have *implemented* that suggestion - so unquote_to_bytes now accepts
either a bytes or str, while unquote accepts only a str. No changes need
to be made unless there is disagreement on that decision.

I also emailed Barry Warsaw about the email/utils.py patch (because we
weren't sure exactly what that code was doing). However, I'm sure that
this patch isn't breaking anything there, because I call unquote with
encoding="latin-1", which is the same behaviour as the current head.

That's all the issues I have left over in this patch.

Attaching patch 10 (for revision 65675).

Commit log for patch 10:

Fix for bpo-3300.

urllib.parse.unquote:
Added "encoding" and "errors" optional arguments, allowing the caller
to determine the decoding of percent-encoded octets.
As per RFC 3986, default is "utf-8" (previously implicitly decoded
as ISO-8859-1).
Fixed a bug in which mixed-case hex digits (such as "%aF") weren't
being decoded at all.

urllib.parse.quote:
Added "encoding" and "errors" optional arguments, allowing the
caller to determine the encoding of non-ASCII characters
before being percent-encoded.
Default is "utf-8" (previously characters in range(128, 256)
were encoded as ISO-8859-1, and characters above that as UTF-8).
Characters/bytes above 128 are no longer allowed to be "safe".
Now allows either bytes or strings.
Optimised "Quoter"; now inherits defaultdict.

Added functions urllib.parse.quote_from_bytes,
urllib.parse.unquote_to_bytes.
All quote/unquote functions now exported from the module.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface, added quote_from_bytes and unquote_to_bytes.

Lib/test/test_urllib.py: Added many new test cases testing encoding
and decoding Unicode strings with various encodings, as well as testing
the new functions.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the email
module is dependent upon).

mgiuca · 2008-08-14T15:31:40Z

Antoine:

I think if you move the line defining "str" out of the loop, relative
timings should change quite a bit. Chances are that the random
functions are not very fast, since they are written in pure Python.

Well I wanted to test throwing lots of different URIs to test the
caching behaviour. You're right though, probably a small % of the time
is spent on calling quote.

Oh well, the defaultdict implementation is in patch10 anyway :) It
cleans Quoter up somewhat, so it's a good thing anyway. Thanks for your
help.

mgiuca · 2008-08-18T14:27:49Z

Hi,

Sorry to bump this, but you (Guido) said you wanted this closed by
Wednesday. Is this patch committable yet? (There are no more unresolved
issues that I am aware of).

gvanrossum · 2008-08-18T18:32:38Z

Looking into this now. Will make sure it's included in beta3.

gvanrossum · 2008-08-18T21:45:16Z

Checked in patch 10 with minor style changes as r65838.

Thanks Matt for persevering! Thanks everyone else for contributing;
this has been quite educational.

pitrou · 2008-08-20T09:56:19Z

There's an unquote()-related failure in bpo-3613.

mgiuca · 2008-08-20T10:03:35Z

Thanks for pointing that out, Antoine. I just commented on that bug.

mgiuca mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jul 6, 2008

gvanrossum added the release-blocker label Aug 18, 2008

gvanrossum self-assigned this Aug 18, 2008

gvanrossum closed this as completed Aug 18, 2008

ezio-melotti transferred this issue from another repository Apr 10, 2022

urllib.quote and unquote - Unicode issues #47550

urllib.quote and unquote - Unicode issues #47550

Comments

mgiuca mannequin commented Jul 6, 2008

mgiuca mannequin commented Jul 6, 2008

loewis mannequin commented Jul 6, 2008

mgiuca mannequin commented Jul 7, 2008

thomaspinckney3 mannequin commented Jul 9, 2008

mgiuca mannequin commented Jul 9, 2008

loewis mannequin commented Jul 9, 2008

mgiuca mannequin commented Jul 10, 2008

mgiuca mannequin commented Jul 10, 2008

loewis mannequin commented Jul 10, 2008

mgiuca mannequin commented Jul 11, 2008

mgiuca mannequin commented Jul 11, 2008

mgiuca mannequin commented Jul 12, 2008

mgiuca mannequin commented Jul 12, 2008

mgiuca mannequin commented Jul 31, 2008

mgiuca mannequin commented Jul 31, 2008

janssen mannequin commented Aug 6, 2008

janssen mannequin commented Aug 6, 2008

jimjjewett mannequin commented Aug 6, 2008

pitrou commented Aug 6, 2008

gvanrossum commented Aug 6, 2008

jimjjewett mannequin commented Aug 6, 2008

gvanrossum commented Aug 6, 2008

jimjjewett mannequin commented Aug 6, 2008

mgiuca mannequin commented Aug 7, 2008

mgiuca mannequin commented Aug 7, 2008

pitrou commented Aug 7, 2008

mgiuca mannequin commented Aug 7, 2008

mgiuca mannequin commented Aug 7, 2008

janssen mannequin commented Aug 12, 2008

janssen mannequin commented Aug 12, 2008

janssen mannequin commented Aug 12, 2008

pitrou commented Aug 12, 2008

gvanrossum commented Aug 12, 2008

mgiuca mannequin commented Aug 13, 2008

gvanrossum commented Aug 13, 2008

mgiuca mannequin commented Aug 13, 2008

pitrou commented Aug 13, 2008

pitrou commented Aug 13, 2008

gvanrossum commented Aug 13, 2008

janssen mannequin commented Aug 13, 2008

janssen mannequin commented Aug 13, 2008

gvanrossum commented Aug 13, 2008

pitrou commented Aug 13, 2008

janssen mannequin commented Aug 13, 2008

mgiuca mannequin commented Aug 14, 2008

mgiuca mannequin commented Aug 14, 2008

pitrou commented Aug 14, 2008

mgiuca mannequin commented Aug 14, 2008

mgiuca mannequin commented Aug 14, 2008

mgiuca mannequin commented Aug 18, 2008

gvanrossum commented Aug 18, 2008

gvanrossum commented Aug 18, 2008

pitrou commented Aug 20, 2008

mgiuca mannequin commented Aug 20, 2008