This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Unsupported provider

classification
Title: urllib.quote and unquote - Unicode issues
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.0
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: gvanrossum Nosy List: gvanrossum, janssen, jimjjewett, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Priority: release blocker Keywords: patch

Created on 2008-07-06 14:52 by mgiuca, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
parse.py.patch mgiuca, 2008-07-06 14:52 (obsolete) Patch fixing all three issues; commit log in comment
parse.py.patch2 mgiuca, 2008-07-09 15:51 (obsolete) Second patch (supersedes parse.py.patch); commit log in comment
parse.py.patch3 mgiuca, 2008-07-10 01:54 (obsolete) Third patch (supersedes parse.py.patch2); commit log in comment
parse.py.patch4 mgiuca, 2008-07-12 11:32 (obsolete) Fourth patch (supersedes parse.py.patch3); commit log in comment
parse.py.patch5 mgiuca, 2008-07-12 17:05 (obsolete) Fifth patch (supersedes parse.py.patch4); commit log in comment
parse.py.patch6 mgiuca, 2008-07-31 11:27 (obsolete) Sixth patch (supersedes parse.py.patch5); commit log in comment
parse.py.patch7 mgiuca, 2008-07-31 14:49 (obsolete) Seventh patch (supersedes parse.py.patch6); commit log in comment
parse.py.patch8 mgiuca, 2008-08-07 14:59 (obsolete) Eighth patch (supersedes parse.py.patch7); commit log in comment
parse.py.metapatch8 mgiuca, 2008-08-07 15:02 Diff between patch7 and patch8 (result of review).
patch janssen, 2008-08-08 21:28
parse.py.patch8+allsafe mgiuca, 2008-08-10 07:05 Patch8, and quote allows all characters in 'safe'
parse.py.patch9 mgiuca, 2008-08-10 07:45 Ninth patch (supersedes parse.py.patch8); commit log in comment for patch 8
parse.py.patch10 mgiuca, 2008-08-14 15:27 Tenth patch (supersedes parse.py.patch9); commit log in comment
Messages (80)
msg69333 - (view) Author: Matt Giuca (mgiuca) Date: 2008-07-06 14:52
Three Unicode-related problems with urllib.parse.quote and
urllib.parse.unquote in Python 3.0. (Patch attached).

Firstly, unquote appears not to have been modified from Python 2, where
it is designed to output a byte string. In Python 3, it outputs a
unicode string, implicitly decoded as ISO-8859-1 (the code points are
the same as the bytes). RFC 3986 states that the percent-encoded byte
values should be decoded as UTF-8.

http://tools.ietf.org/html/rfc3986 section 2.5.

Current behaviour:
>>> urllib.parse.unquote("%CE%A3")
'Σ'
(or '\u00ce\u00a3')

Desired behaviour:
>>> urllib.parse.unquote("%CE%A3")
'Σ'
(or '\u03a3')

Secondly, while quote *has* been modified to encode to UTF-8 before
percent-encoding, it does not work correctly for characters in
range(128, 256), due to a special case in the code which again treats
the code point values as byte values.

Current behaviour:
>>> urllib.parse.quote('\u00e9')
'%E9'

Desired behaviour:
>>> urllib.parse.quote('\u00e9')
'%C3%A9'

Note that currently, quoting characters less than 256 will use
ISO-8859-1, while quoting characters 256 or higher will use UTF-8!

Thirdly, the "safe" argument to quote does not work for characters above
256, since these are excluded from the special case. I thought I would
fix this at the same time, but it's really a separate issue.

Current behaviour:
>>> urllib.parse.quote('Σϰ', safe='Σ')
'%CE%A3%CF%B0'

Desired behaviour:
>>> urllib.parse.quote('Σϰ', safe='Σ')
'Σ%CF%B0'

A patch which fixes all three issues is attached. Note that unquote now
needs to handle the case where the UTF-8 sequence is invalid. This is
currently handled by "replace" (invalid sequences are replaced by
'\ufffd'). I would like to add an optional "errors" argument to unquote,
defaulting to "replace", to allow the user to override this behaviour,
but I didn't put that in because it would change the interface.

Note I also changed one of the test cases, which had the wrong expected
output. (String literal was manually UTF-8 encoded, designed for Python
2; nonsensical when viewed as a Python 3 Unicode string).

All urllib test cases pass.

Patch is for branch /branches/py3k, revision 64752.

Note: The above unquote issue also manifests itself in Python 2 for
Unicode strings, but it's hazy as to what the behaviour should be, and
would break existing programs, so I'm just patching the Py3k branch.

Commit log:

urllib.parse.unquote: Fixed percent-encoded octets being implicitly
decoded as ISO-8859-1; now decode as UTF-8, as per RFC 3986.

urllib.parse.quote: Fixed characters in range(128, 256) being implicitly
encoded as ISO-8859-1; now encode as UTF-8. Also fixed characters
greater than 256 not responding to "safe", and also not being cached.

Lib/test/test_urllib.py: Updated one test case for unquote which
expected the wrong output. The new version of unquote passes the new
test case.
msg69339 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-07-06 16:54
> RFC 3986 states that the percent-encoded byte
> values should be decoded as UTF-8.

Where precisely do you read such a SHOULD requirement?
Section 2.5 elaborates that the local encoding (of the
resource) is typically used, ignoring cases where URIs
are constructed on the client system (such scenario is
simply ignored in the RFC).

The last paragraph in section 2.5 is the only place that
seems to imply a SHOULD requirement (although it doesn't
use the keyword); this paragraph only talks about new URI
schemes. Unfortunately, for http, the encoding is of
characters is unspecified (this is somewhat solved by the
introduction of IRIs).
msg69366 - (view) Author: Matt Giuca (mgiuca) Date: 2008-07-07 01:45
Point taken. But the RFC certainly doesn't say that ISO-8859-1 should be
used. Since we're outputting a Unicode string in Python 3, we need to
decode with some encoding, and UTF-8 seems the most sensible and
standardised.
(Even the existing test case in test_urllib.py:466 uses a UTF-8-encoded
URL, and I had to fix it so it decodes into a meaningful string).

Having said that, it's possible that you may wish to use another
encoding, and legal to do so. Therefore, I'd suggest we add an
"encoding" argument to both quote and unquote, which defaults to "utf-8".

Note that in the current implementation, unquote is not an inverse of
quote, because quote uses UTF-8 to encode characters with code points >=
256, while unquote decodes them as ISO-8859-1. I think it's important
these two functions are inverses of each other.
msg69472 - (view) Author: Tom Pinckney (thomaspinckney3) * Date: 2008-07-09 15:20
I mentioned this is in a brief python-dev discussion earlier this 
spring, but many popular websites such as Wikipedia and Facebook do use 
UTF-8 as their character encoding scheme for the path and argument 
portion of URLs.

I know there's no RFC that says this is what should be done, but in 
order to make urllib work out-of-the-box on as many common websites as 
possible, I think defaulting to UTF-8 decoding makes a lot of sense. 

Possibly allow an option charset argument to be passed into quote and 
unquote, but default to UTF-8 in the absence of an explicit character 
set being passed in?
msg69473 - (view) Author: Matt Giuca (mgiuca) Date: 2008-07-09 15:51
OK I've gone back over the patch and decided to add the "encoding" and
"errors" arguments from the str.encode/decode methods as optional
arguments to quote and unquote. This is a much bigger change than I
originally intended, but I think it makes things much better because
we'll get UTF-8 by default (which as far as I can tell is by far the
most common encoding).

(Tom Pinckney just made the same suggestion right as I'm typing this up!)

So my new patch is a bit more extensive, and changes the interface (in a
backwards-compatible way). Both quote and unquote now support "encoding"
and "errors" arguments, defaulting to "utf-8" and "replace", respectively.

Implementation detail: This changes the Quoter class a lot; it now
hashes four fields to ensure it doesn't use the wrong cache.

Also fixed an issue with the previous patch where non-ASCII-compatible
encodings broke for code points < 128.

I then ran the full test suite and discovered two other modules test
cases broke. I've fixed them so the full suite passes, but I'm
suspicious there may be more issues (see below).

* Lib/test/test_http_cookiejar.py: A test case was written explicitly
expecting Latin-1 encoding. I've changed this test case to expect UTF-8.
* Lib/email/utils.py: I extensively analysed this code and discovered
that it kind of "cheats" - it uses the Latin-1 encoding and treats it as
octets, then applies its own encoding scheme. So to fix this, I changed
the email module to call quote and unquote with encoding="latin-1".
Hence it has the same behaviour as before.

Some potential issues:

* I have not updated the documentation yet. If this idea is to go ahead,
the docs will need to show these new optional arguments. (I'll do that
myself but haven't yet).
* While the full test suite passes, I'm sure there will be many more
issues since I've changed the interface. Therefore I don't recommend
this patch is accepted just yet. I plan to do an investigation into all
uses (within the standard lib) of quote and unquote to see if there are
any other compatibility issues, particularly within urllib. Hence I'll
respond to this again in a few days.
* The new patch to "safe" argument of quote allows non-ASCII characters
to be made safe. This correspondingly allows the construction of URIs
with non-ASCII characters. Is it better to allow users to do this if
they really want, or just mysteriously fail to let those characters through?

I would also like to have a separate pair of functions, unquote_raw and
quote_raw, which work on bytes objects instead of strings. (unquote_raw
would take a str and produce a bytes, while quote_raw would take a bytes
and produce a str). As URI encoding is fundamentally an octet encoding,
not a character encoding, this is the only way to do URI encoding
without choosing a Unicode character encoding. (I see some modules such
as "email" treating the implicit Latin-1 encoding as byte encoding,
which is a bit dodgy - they could benefit from raw functions). But as
that requires further changes to the interface, I'll save it for another
day.

Patch (parse.py.patch2) is for branch /branches/py3k, revision 64820.

Commit log:

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets
(previously implicitly decoded as ISO-8859-1). As per RFC 3986, default
is "utf-8".

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded (previously characters in range(128, 256)
were encoded as ISO-8859-1, and characters above that as UTF-8). Also
fixed characters greater than 256 not responding to "safe", and also not
being cached.

Lib/test/test_urllib.py, Lib/test/test_http_cookiejar.py: Updated test
cases which expected output in ISO-8859-1, now expects UTF-8.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).
msg69485 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-07-09 20:28
Assuming the patch is acceptable in the first place (which I personally
have not made my mind up), then it lacks documentation and test suite
changes.
msg69493 - (view) Author: Matt Giuca (mgiuca) Date: 2008-07-10 01:54
OK well here are the necessary changes to the documentation (RST docs
and docstrings in the code).

As I said above, I plan to to extensive testing and add new cases, and I
don't recommend this patch is accepted until that's done.

Patch (parse.py.patch3) is for branch /branches/py3k, revision 64834.

Commit log:

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets
(previously implicitly decoded as ISO-8859-1). As per RFC 3986, default
is "utf-8".

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded (previously characters in range(128, 256)
were encoded as ISO-8859-1, and characters above that as UTF-8). Also
fixed characters greater than 256 not responding to "safe", and also not
being cached.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface.

Lib/test/test_urllib.py, Lib/test/test_http_cookiejar.py: Updated test
cases which expected output in ISO-8859-1, now expects UTF-8.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).
msg69508 - (view) Author: Matt Giuca (mgiuca) Date: 2008-07-10 16:13
Setting Version back to Python 3.0. Is there a reason it was set to
Python 3.1? This proposal will certainly break a lot of code. It's *far*
better to do it in the big backwards-incompatible Python 3.0 release
than a later release.
msg69519 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-07-10 20:55
> Setting Version back to Python 3.0. Is there a reason it was set to
> Python 3.1? 

3.0b1 has been released, so no new features can be added to 3.0.
msg69535 - (view) Author: Matt Giuca (mgiuca) Date: 2008-07-11 07:03
> 3.0b1 has been released, so no new features can be added to 3.0.

While my proposal is no doubt going to cause a lot of code breakage, I
hardly consider it a "new feature". This is very definitely a bug. As I
understand it, the point of a code freeze is to stop the addition of
features which could be added to a later version. Realistically, there
is no way this issue can be fixed after 3.0 is released, as it
necessarily involves changing the behaviour of this function.

Perhaps I should explain further why this is a regression from Python
2.x and not a feature request. In Python 2.x, with byte strings, the
encoding is not an issue. quote and unquote simply encode bytes, and if
you want to use Unicode you have complete control. In Python 3.0, with
Unicode strings, if functions manipulate string objects, you don't have
control over the encoding unless the functions give you explicit
control. So Python 3.0's native Unicode strings have broken the library.

I give two examples.

Firstly, I believe that unquote(quote(x)) should always be true for all
strings x. In Python 2.x, this is always trivially true (for non-Unicode
strings), because they simply encode and decode the octets. In Python
3.0, the two functions are inconsistent, and break out of the range(0, 256).

>>> urllib.parse.unquote(urllib.parse.quote('ÿ')) # '\u00ff'
'ÿ'
# Works, because both functions work with ISO-8859-1 in this range.

>>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100'
'Ä\x80'
# Fails, because quote uses UTF-8 and unquote uses ISO-8859-1.

My patch succeeds for all characters.
>>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100'
'Ā'

Secondly, a bigger example, but I want to demonstrate how this bug
affects web applications, even very simple ones.

Consider this simple (beginnings of a) wiki system in Python 2.5, as a
CGI app:

#---
import cgi

fields = cgi.FieldStorage()
title = fields.getfirst('title')

print("Content-Type: text/html; charset=utf-8")
print("")

print('<p>Debug: %s</p>' % repr(title))
if title is None:
    print("No article selected")
else:
    print('<p>Information about %s.</p>' % cgi.escape(title))
#---

(Place this in cgi-bin, navigate to it, and add the query string
"?title=Page Title"). I'll use the page titled "Mátt" as a test case.

If you navigate to "?title=Mátt", it displays the text "Debug:
'M\xc3\xa1tt'. Information about Mátt.". The browser (at least Firefox,
Safari and IE I have tested) encodes this as "?title=M%C3%A1tt". So this
is trivial, as it's just being unquoted into a raw byte string
'M\xc3\xa1tt', then written out again as a byte string.

Now consider that you want to manipulate it as a Unicode string, still
in Python 2.5. You could augment the program to decode it as UTF-8 and
then re-encode it. (I wrote a simple UTF-8 printing function which takes
Unicode strings as input).

#---
import sys
import cgi

def printu8(*args):
    """Prints to stdout encoding as utf-8, rather than the current terminal
    encoding. (Not a fully-featured print function)."""
    sys.stdout.write(' '.join([x.encode('utf-8') for x in args]))
    sys.stdout.write('\n')

fields = cgi.FieldStorage()
title = fields.getfirst('title')
if title is not None:
    title = str(title).decode("utf-8", "replace")

print("Content-Type: text/html; charset=utf-8")
print("")

print('<p>Debug: %s.</p>' % repr(title))
if title is None:
    print("No article selected.")
else:
    printu8('<p>Information about %s.</p>' % cgi.escape(title))
#---

Now given the same input ("?title=Mátt"), it displays "Debug:
u'M\xe1tt'. Information about Mátt." Still working fine, and I can
manipulate it as Unicode because in Python 2.x I have direct control
over encoding/decoding.

Now let us upgrade this program to Python 3.0. (Note that I still can't
print Unicode characters directly out, because running through Apache
the stdout encoding is not UTF-8, so I use my printu8 function).

#---
import sys
import cgi

def printu8(*args):
    """Prints to stdout encoding as utf-8, rather than the current terminal
    encoding. (Not a fully-featured print function)."""
    sys.stdout.buffer.write(b' '.join([x.encode('utf-8') for x in args]))
    sys.stdout.buffer.write(b'\n')

fields = cgi.FieldStorage()
title = fields.getfirst('title')
# Note: No call to decode. I have no opportunity to specify the encoding
since
# it comes straight out of FieldStorage as a Unicode string.

print("Content-Type: text/html; charset=utf-8")
print("")

print('<p>Debug: %s.</p>' % ascii(title))
if title is None:
    print("No article selected.")
else:
    printu8('<p>Information about %s.</p>' % cgi.escape(title))
#---

Now given the same input ("?title=Mátt"), it displays "Debug:
'M\xc3\xa1tt'. Information about Mátt." Once again, it is erroneously
(and implicitly) decoded as ISO-8859-1, so I end up with a meaningless
Unicode string. The only possible thing I can do about this as a web
developer is call title.encode('latin-1').decode('utf-8') - a dreadful hack.

With my patch applied, the input ("?title=Mátt") produces the output
"Debug: 'M\xe1tt'. Information about Mátt."

Basically, this bug is going to affect all web developers as soon as
someone types a non-ASCII character. You could argue that supporting
UTF-8 by default is no better than supporting Latin-1 by default, but it
is. UTF-8 supports encoding of all characters where Latin-1 does not,
UTF-8 is the recommended URI encoding by both the URI Syntax RFC[1] and
the W3C HTML 4.01 specification[2], and all major browsers use it to
encode non-ASCII characters in URIs.

My patch may not be the best, or most conservative, solution to this
problem. I'm happy to see other proposals. But it's clearly an important
bug to fix, if I can't even write the simplest web app I can think of
without having to use a kludgey hack to get the string decoded
correctly. What is the point of having nice clean Unicode strings in the
language if the library spits out the wrong characters and it requires
more work to fix them than it used to with byte strings?

[1] http://tools.ietf.org/html/rfc3986#section-2.5
[2] http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1
msg69537 - (view) Author: Matt Giuca (mgiuca) Date: 2008-07-11 07:18
Since I got a complaint that my last reply was too long, I'll summarize it.

It's a bug report, not a feature request.

I can't get a simple web app to be properly Unicode-aware in Python 3,
which worked fine in Python 2. This cannot be put off until 3.1, as any
viable solution will break existing code.
msg69583 - (view) Author: Matt Giuca (mgiuca) Date: 2008-07-12 11:32
OK I spent awhile writing test cases for quote and unquote, encoding and
decoding various Unicode strings with different encodings. As a result,
I found a bunch of issues in my previous patch, so I've rewritten the
patches to both quote and unquote. They're both actually more similar to
the original version now.

I'd be interested in hearing if anyone disagrees with my expected output
for these test cases.

I'm now confident I have good test coverage directly on the quote and
unquote functions. However, I haven't tested the other library functions
which depend upon them (though the entire test suite passes). Though as
I showed in that big post I made yesterday, other modules such as cgi
seem to be working fine (their behaviour has changed; they use UTF-8
now; but that's the whole point of this patch).

I still haven't figured out what the behaviour of "safe" should be in
quote. Should it only allow ASCII characters (thereby limiting the
output to an ASCII string, as specified by RFC 3986)? Should it also
allow Latin-1 characters, or all Unicode characters as well (perhaps
allowing you to create IRIs -- admittedly I don't know much about IRIs).
The new implementation of quote makes it rather difficult to allow
non-Latin-1 characters to be made "safe", as it encodes the string into
bytes before any processing.

Patch (parse.py.patch4) is for branch /branches/py3k, revision 64891.

Commit log:

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is "utf-8" (previously implicitly decoded as
ISO-8859-1).

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is "utf-8" (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters above 128 are no longer allowed to be "safe".

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface.

Lib/test/test_urllib.py: Added several new test cases testing encoding
and decoding Unicode strings with various encodings. This includes
updating one test case to now expect UTF-8 by default.

Lib/test/test_http_cookiejar.py: Updated test case which expected output
in ISO-8859-1, now expects UTF-8.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).
msg69591 - (view) Author: Matt Giuca (mgiuca) Date: 2008-07-12 17:05
So today I grepped for "urllib" in the entire library in an effort to
track down every dependency on quote and unquote to see exactly how my
patch breaks other code. I've now investigated every module in the
library which uses quote, unquote or urlencode, and my findings are
documented below in detail.

So far I have found no code "breakage" except for the original
email.util issue I fixed in patch 2. Of course that doesn't mean the
behaviour hasn't changed. Nearly all modules in the report below have
changed their behaviour so they used to deal with Latin-1-encoded URLs
and now deal with UTF-8-encoded URLs. As discussed at length above, I
see this as a positive change, since nearly everybody encodes URLs in
UTF-8, and of course it allows for all characters.

I also point out that the http.server module (unpatched) is internally
broken when dealing with filenames with characters outside range(0,256);
my patch fixes it.

I'm attaching patch 5, which adds a bunch of new test cases to various
modules which demonstrate those modules correctly handling UTF-8-encoded
URLs. It also fixes a bug in email.utils which I introduced in patch 2.

Note that I haven't yet fully investigated urllib.request.

Aside from that, the only remaining matter is whether or not it's better
to encode URLs as UTF-8 or Latin-1 by default, and I'm pretty sure that
question doesn't need debate.

So basically I think if there's support for it, this patch is just about
ready to be accepted. I'm hoping it can be included in the 3.0b2 release
next week.

I'd be glad to hear any feedback about this proposal.

Not Yet Investigated
--------------------

./urllib/request.py
    By far the biggest user of quote and unquote.
    username, password, hostname and paths are now all converted
    to/from UTF-8 percent-encodings.
    Other concerns are:
        * Data in the form application/x-www-form-urlencoded
        * FTP access
    I think this needs to be tested further.

Looks fine, not tested
----------------------

./xmlrpc/client.py
    Just used to decode URI auth string (user:pass). This will change
    to UTF-8, but is probably OK.
./logging/handlers.py
    Just uses it in the HTTP handler to encode a dictionary. Probably
    preferable to use UTF-8 to encode an arbitrary string.
./macurl2path.py
    Calls to urllib look broken. Not tested.

Tested manually, fine
---------------------

./wsgiref/simple_server.py
    Just used to set PATH_INFO, fine if URLs are UTF-8 encoded.
./http/server.py
    All uses are for translating between actual file-system paths to
    URLs. This works fine for UTF-8 URLs. Note that since it uses
    quote to create URLs in a dir listing, and unquote to handle
    them, it breaks when unquote is not the inverse of quote.

    Consider the following simple script:

    import http.server
    s = http.server.HTTPServer(('',8000),
            http.server.SimpleHTTPRequestHandler)
    s.serve_forever()

    This will "kind of" work in the unpatched version, using
    Latin-1 URLs, but filenames with characters above 256 will
    break (give a 404 error).
    The patch fixes this.
./urllib/robotparser.py
    No test cases. Manually tested, URLs properly match when
    percent-encoded in UTF-8.
./nturl2path.py
    No test cases available. Manually tested, fine if URLs are
    UTF-8 encoded.

Test cases either exist or added, fine
--------------------------------------

./test/test_urllib.py
    I wrote a large wad of test cases for all the new functionality.
./wsgiref/util.py
    Added test cases expecting UTF-8.
./http/cookiejar.py
    I changed a test case to expect UTF-8.
./email/utils.py
    I changed this file to behave as it used to, to satisfy its
    existing test cases.
./cgi.py
    Added test cases for UTF-8-encoded query strings.

Commit log:

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is "utf-8" (previously implicitly decoded as
ISO-8859-1).

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is "utf-8" (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters above 128 are no longer allowed to be "safe".

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface.

Lib/test/test_urllib.py: Added several new test cases testing encoding
and decoding Unicode strings with various encodings. This includes
updating one test case to now expect UTF-8 by default.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).
msg70497 - (view) Author: Matt Giuca (mgiuca) Date: 2008-07-31 11:27
OK after a long discussion on the mailing list, Guido gave this the OK,
with the provision that there are str->bytes and bytes->str versions of
these functions as well. So I've written those.

http://mail.python.org/pipermail/python-dev/2008-July/081601.html

quote itself now accepts either a str or a bytes. quote_from_bytes is a
new function which is just an alias for quote. (Is this acceptable?)

unquote is still str->str. I've added a totally separate function
unquote_to_bytes which is str->bytes.

Note there is a slight issue here: I didn't quite know what to do with
unescaped non-ASCII characters in the input to unquote_to_bytes - they
need to somehow be converted to bytes. I chose to encode them using
UTF-8, on the basis that they technically shouldn't be in a URI anyway.

Note that my new unquote doesn't have this problem; it's carefully
written to preserve the Unicode characters, even if they aren't
expressible in the given encoding (which explains some of the code bloat).

This makes unquote(s, encoding=e) necessarily more robust than
unquote_to_bytes(s).decode(e) in terms of unescaped non-ASCII characters
in the input.

I've also added new test cases and documentation for these two new
functions (included in patch6).

On an entirely personal note, can whoever checks this in please mention
my name in the commit log - I've put in at least 30 hours researching
and writing this patch, and I'd like for this not to go uncredited :)

Commit log for patch6:

Fix for issue 3300.

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is "utf-8" (previously implicitly decoded as
ISO-8859-1).

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is "utf-8" (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters/bytes above 128 are no longer allowed to be
"safe". Also now allows either bytes or strings.

Added functions urllib.parse.quote_from_bytes,
urllib.parse.unquote_to_bytes.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface, added quote_from_bytes and unquote_to_bytes.

Lib/test/test_urllib.py: Added several new test cases testing encoding
and decoding Unicode strings with various encodings, as well as testing
the new functions.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).
msg70512 - (view) Author: Matt Giuca (mgiuca) Date: 2008-07-31 14:49
Hmm ... seems patch 6 I just checked in fails a test case! Sorry! (It's
minor, gives a harmless BytesWarning if you run with -b, which "make
test" does, so I only picked it up after submitting).

I've slightly changed the code in quote so it doesn't do that any more
(it normalises all "safe" arguments to bytes).

Please review patch 7, not 6. Same commit log as above.

(Also .. someone let me know if I'm not submitting patches properly,
like perhaps I should be deleting the old ones not keeping them around?)
msg70771 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-06 05:59
Here's my version of how quote and unquote should be implemented in
Python 3.0.  I haven't looked at the uses of it in the library, but I'd
expect improper uses (and there are lots of them) will break, and thus
can be fixed.

Basically, percent-quoting is about creating an ASCII string that can be
safely used in URI from an arbitrary sequence of octets.  So, my version
of quote() takes either a byte sequence or a string, and percent-quotes
the unsafe ones, and then returns a str.  If a str is supplied on input,
it is first converted to UTF-8, then the octets of that encoding are
percent-quoted.

For unquote, there's no way to tell what the octets of the quoted
sequence may mean, so this takes the percent-quoted ASCII string, and
returns a byte sequence with the unquoted bytes.  For convenience, since
the unquoted bytes are often a string in some particular character set
encoding, I've also supplied unquote_as_string(), which takes an
optional character set, and first unquotes the bytes, then converts them
to a str, using that character set encoding, and returns the resulting
string.
msg70788 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-06 16:51
Here's a patch to parse.py (and test/test_urllib.py) that makes the
various tests (cgi, urllib, httplib) pass.  It basically adds
"unquote_as_string", "unquote_as_bytes", "quote_as_string",
"quote_as_bytes", and then define the existing "quote" and "unquote" in
terms of them.
msg70791 - (view) Author: Jim Jewett (jimjjewett) Date: 2008-08-06 17:29
Is there still disagreement over anything except:

(1)  The type signature of quote and unquote (as opposed to the 
explicit "quote_as_bytes" or "quote_as string").

(2)  The default encoding (latin-1 vs UTF8), and (if UTF-8) what to do 
with invalid byte sequences?

(3)  Would waiting for 3.1 cause too many compatibility problems?
msg70793 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-06 18:39
Bill, I haven't studied your patch in detail but a few comments:
- it would be nice to have more unit tests, especially for the various
bytes/unicode possibilities, and perhaps also roundtripping (Matt's
patch has a lot of tests)
- quote_as_bytes() should return a bytes object, not a bytearray
- using the "%02X" format looks clearer to me than going through the
_hextable lookup table...
- when the argument is of the wrong type, quote_as_bytes() should raise
a TypeError rather than a ValueError
- why is quote_as_string() hardwired to utf8 while unquote_as_string()
provides a charset parameter? wouldn't it be better for them to be
consistent with each other?
msg70800 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-06 19:51
Bill Janssen's "patch" breaks two unittests: test_email and
test_http_cookiejar.  Details for test_email:

======================================================================
ERROR: test_rfc2231_bad_character_in_filename
(email.test.test_email.TestRFC2231)
.
.
.
  File "/usr/local/google/home/guido/python/py3k/Lib/urllib/parse.py",
line 267, in unquote_as_string
    return str(unquote_as_bytes(s, plus=plus), charset, 'strict')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 13:
unexpected end of data

======================================================================
FAIL: test_rfc2231_bad_character_in_charset
(email.test.test_email.TestRFC2231)
----------------------------------------------------------------------
Traceback (most recent call last):
  File
"/usr/local/google/home/guido/python/py3k/Lib/email/test/test_email.py",
line 3279, in test_rfc2231_bad_character_in_charset
    self.assertEqual(msg.get_content_charset(), None)
AssertionError: 'utf-8\\u201d' != None

Details for test_http_cookiejar:

======================================================================
FAIL: test_url_encoding (__main__.LWPCookieTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "Lib/test/test_http_cookiejar.py", line 1454, in test_url_encoding
    self.assert_("foo=bar" in cookie and version_re.search(cookie))
AssertionError: None
msg70804 - (view) Author: Jim Jewett (jimjjewett) Date: 2008-08-06 21:03
Matt pointed out that the email package assumes Latin-1 rather than UTF-8; I 
assume Bill could patch his patch the same way Matt did, and this would 
resolve the email tests.  (Unless you pronounce to stick with Latin-1)

The cookiejar failure probably has the same root cause; that test is 
encoding (non-ASCII) Latin-1 characters, and urllib.parse.py/Quoter assumes 
Latin-1.

So I see some evidence (probably not enough) for sticking with Latin-1 
instead of UTF-8.  But I don't see any evidence that fixing the semantics 
(encoded results should be bytes) at the same time made the conversion any 
more painful.  

On the other hand, Matt shows that some of those extra str->byte code 
changes might never need to be done at all, except for purity.
msg70806 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-06 21:39
Dear GvR,

New code review comments by GvR have been published.
Please go to http://codereview.appspot.com/2827 to read them.

Message:
Hi Matt,

Here's a code review of your patch.

I'm leaning more and more towards wanting this for 3.0, but I have some API
design issues and also some style nits.

I'm cross-linking this with the Python tracker issue, through the subject.

Details:

http://codereview.appspot.com/2827/diff/1/2
File Doc/library/urllib.parse.rst (right):

http://codereview.appspot.com/2827/diff/1/2#newcode198
Line 198: replaced by a placeholder character.
I don't think that's a good default.  I'd rather see it default to strict --
that's what encoding translates to everywhere else.  I believe that lenient
error handling by default can cause subtle security problems too, by hiding
problem characters from validation code.

http://codereview.appspot.com/2827/diff/1/2#newcode215
Line 215: An alias for :func:`quote`, intended for use with a :class:`bytes`
object
I'd rather see this as a wrapper that raises TypeError if the argument  
isn't a
bytes or bytearray instance.  Otherwise it's needless redundancy.

http://codereview.appspot.com/2827/diff/1/2#newcode223
Line 223: Replace ``%xx`` escapes by their single-character equivalent.
Should add what the argument type is -- I vote for str or bytes/bytearray.

http://codereview.appspot.com/2827/diff/1/2#newcode242
Line 242: .. function:: unquote_to_bytes(string)
Again, add what the argument type is.

http://codereview.appspot.com/2827/diff/1/4
File Lib/email/utils.py (right):

http://codereview.appspot.com/2827/diff/1/4#newcode224
Line 224: except:
An unqualified except clause is unacceptable here.  Why do you need this  
anyway?

http://codereview.appspot.com/2827/diff/1/5
File Lib/test/test_http_cookiejar.py (right):

http://codereview.appspot.com/2827/diff/1/5#newcode1450
Line 1450: "%3c%3c%0Anew%C3%A5/%C3%A5",
I'm guessing this test broke otherwise?  Given that this references an RFC,  
is
it correct to just fix it this way?

http://codereview.appspot.com/2827/diff/1/3
File Lib/urllib/parse.py (right):

http://codereview.appspot.com/2827/diff/1/3#newcode10
Line 10: "urlsplit", "urlunsplit"]
Please add all the quote/unquote versions here too.
(They were there in 2.5, but somehow got dropped from 3.0.

http://codereview.appspot.com/2827/diff/1/3#newcode265
Line 265: # Maps lowercase and uppercase variants (but not mixed case).
That sounds like a disaster.  Why would %aa and %AA be correct but not %aA  
and
%Aa?  (Even though the old code had the same problem.)

http://codereview.appspot.com/2827/diff/1/3#newcode283
Line 283: def unquote(s, encoding = "utf-8", errors = "replace"):
Please no spaces around the '=' when used for an argument default (or for a
keyword arg).

Also see my comment about defaulting to 'replace' in the doc file.

Finally -- let's be consistent about quotes.  It seems most of this file  
uses
single quotes, so lets stick to that (except docstrings always use double
quotes).

And more: what should a None value for encoding or errors mean?  IMO it  
should
mean "use the default".

http://codereview.appspot.com/2827/diff/1/3#newcode382
Line 382: safe = safe.encode('ascii', 'ignore')
Using errors='ignore' seems like a mistake -- it will hide errors.

I also wonder why safe should be limited to ASCII though.

http://codereview.appspot.com/2827/diff/1/3#newcode399
Line 399: if ' ' in s:
This test means that it won't work if the input is bytes.  E.g.

urllib.parse.quote_plus(b"abc def")

raises a TypeError.

Sincerely,

   Your friendly code review daemon (http://codereview.appspot.com/).
msg70807 - (view) Author: Jim Jewett (jimjjewett) Date: 2008-08-06 22:25
> http://codereview.appspot.com/2827/diff/1/5#newcode1450
> Line 1450: "%3c%3c%0Anew%C3%A5/%C3%A5",
> I'm guessing this test broke otherwise?  

Yes; that is one of the breakages you found in Bill's patch.  (He didn't 
modify the test.)

> Given that this references an RFC,
> is it correct to just fix it this way?

Probably.  Looking at http://www.faqs.org/rfcs/rfc2965.html

(1)  That is not among the exact tests in the RFC.
(2)  The RFC does not specify charset for the cookie in general, but the 
Comment field MUST be in UTF-8, and the only other reference I could find to 
a specific charset was "possibly in a server-selected printable ASCII 
encoding."

Whether we have to use Latin-1 (or document charset) in practice for 
compatibility reasons, I don't know.
msg70818 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-07 12:11
Dear GvR,

New code review comments by mgiuca have been published.
Please go to http://codereview.appspot.com/2827 to read them.

Message:
Hi Guido,

Thanks very much for this very detailed review. I've replied to the
comments. I will make the changes as described below and send a new
patch to the tracker.
msg70824 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-07 13:42
A reply to a point on GvR's review, I'd like to open for discussion.
This relates to whether or not quote's "safe" argument should allow
non-ASCII characters.

> Using errors='ignore' seems like a mistake -- it will hide errors. I >
also wonder why safe should be limited to ASCII though.

The reasoning is this: if we allow non-ASCII characters to be escaped,
then we allow quote to generate invalid URIs (URIs are only allowed to
have ASCII characters). It's one thing for unquote to accept such URIs,
but I think we shouldn't be producing them. Albeit, it only produces an
invalid URI if you explicitly request it. So I'm happy to make the
change to allow any character to be safe, but I'll let it go to
discussion first.
msg70828 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-07 14:20
Le jeudi 07 août 2008 à 13:42 +0000, Matt Giuca a écrit :
> The reasoning is this: if we allow non-ASCII characters to be escaped,
> then we allow quote to generate invalid URIs (URIs are only allowed to
> have ASCII characters). It's one thing for unquote to accept such URIs,
> but I think we shouldn't be producing them. Albeit, it only produces an
> invalid URI if you explicitly request it. So I'm happy to make the
> change to allow any character to be safe, but I'll let it go to
> discussion first.

The important is that the defaults are safe. If users want to override
the defaults and produce potentially invalid URIs, there is no reason to
discourage them.
msg70830 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-07 14:35
> The important is that the defaults are safe. If users want to override
> the defaults and produce potentially invalid URIs, there is no reason to
> discourage them.

OK I think that's a fairly valid argument. I'm about to head off so I'll
post the patch I have now, which fixes most of the other concerns. That
change will cause havoc to quote I think ;)
msg70833 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-07 14:59
Following Guido and Antoine's reviews, I've written a new patch which
fixes *most* of the issues raised. The ones I didn't fix I have noted
below, and commented on the review site
(http://codereview.appspot.com/2827/). Note: I intend to address all of
these issues after some discussion.

Outstanding issues raised by the reviews:

Doc/library/urllib.parse.rst:
Should unquote accept a bytes/bytearray as well as a str?

Lib/email/utils.py:
Should encode_rfc2231 with charset=None accept strings with non-ASCII
characters, and just encode them to UTF-8?

Lib/test/test_http_cookiejar.py:
Does RFC 2965 let me get away with changing the test case to expect
UTF-8? (I'm pretty sure it doesn't care what encoding is used).

Lib/test/test_urllib.py:
Should quote raise a TypeError if given a bytes with encoding/errors
arguments? (Motivation: TypeError is what you usually raise if you
supply too many args to a function).

Lib/urllib/parse.py:
(As discussed above) Should quote accept safe characters outside the
ASCII range (thereby potentially producing invalid URIs)?

------

Commit log for patch8:

Fix for issue 3300.

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is "utf-8" (previously implicitly decoded as
ISO-8859-1). Also fixed a bug in which mixed-case hex digits (such as
"%aF") weren't being decoded at all.

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is "utf-8" (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters/bytes above 128 are no longer allowed to be
"safe". Also now allows either bytes or strings.

Added functions urllib.parse.quote_from_bytes,
urllib.parse.unquote_to_bytes. All quote/unquote functions now exported
from the module.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface, added quote_from_bytes and unquote_to_bytes.

Lib/test/test_urllib.py: Added many new test cases testing encoding
and decoding Unicode strings with various encodings, as well as testing
the new functions.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).
msg70834 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-07 15:02
I'm also attaching a "metapatch" - diff from patch 7 to patch 8. This is
to give a rough idea of what I changed since the review.

(Sorry - This is actually a diff between the two patches, so it's pretty
hard to read. It would have been nicer to diff the files themselves but
I'm not doing local commits so that's hard. Can one use the Bazaar
mirror for development, or is it too out-of-date?)
msg70840 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-07 16:59
On Thu, Aug 7, 2008 at 8:03 AM, Matt Giuca <report@bugs.python.org> wrote:
>
> Matt Giuca <matt.giuca@gmail.com> added the comment:
>
> I'm also attaching a "metapatch" - diff from patch 7 to patch 8. This is
> to give a rough idea of what I changed since the review.
>
> (Sorry - This is actually a diff between the two patches, so it's pretty
> hard to read. It would have been nicer to diff the files themselves but
> I'm not doing local commits so that's hard. Can one use the Bazaar
> mirror for development, or is it too out-of-date?)

A much better way to do this would be to use Rietveld; it can show
deltas between patch sets that are actually readable, and it keeps the
inline comments.  I've uploaded your patch #8 to the issue I created
there, and you can see the delta from patch7 starting here:

http://codereview.appspot.com/2827/diff2/1:17/27
msg70855 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-07 20:49
Just to reply to Antoine's comments on my patch:

- it would be nice to have more unit tests, especially for the various
bytes/unicode possibilities, and perhaps also roundtripping (Matt's
patch has a lot of tests)

Yes, I completely agree.

- quote_as_bytes() should return a bytes object, not a bytearray

Good point.

- using the "%02X" format looks clearer to me than going through the
_hextable lookup table...

Really?  I see it the other way.

- when the argument is of the wrong type, quote_as_bytes() should raise
a TypeError rather than a ValueError

Good point.

- why is quote_as_string() hardwired to utf8 while unquote_as_string()
provides a charset parameter? wouldn't it be better for them to be
consistent with each other?

To encourage the use of UTF-8.  The caller can alway explicitly encode
to some other character set, then pass in the bytes that result from
that encoding.  Remember that the RFC for percent-encoding really takes
bytes in, and produces bytes out.  The string-in and string-out versions
are to support naive programming (what a nice way of putting it!).
msg70858 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-07 21:17
My main fear with this patch is that "unquote" will become seen as
unreliable, because naive software trying to parse URLs will encounter
uses of percent-encoding where the encoded octets are not in fact UTF-8
bytes.  They're just some set of bytes.  A secondary concern is that it
will invisibly produce invalid data, because it decodes some
non-UTF-8-encoded string that happens to only use UTF-8-valid sequences
as the wrong string value.

Now, I have to confess that I don't know how common these use cases are
in actual URL usage.  It would be nice if there was some organization
that had a large collection of URLs, and could provide a test set we
could run a scanner over :-).

As a workaround, though, I've sent a message off to Larry Masinter to
ask about this case.  He's one of the authors of the URI spec.
msg70861 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-08-07 21:43
On 2008-08-07 23:17, Bill Janssen wrote:
> Bill Janssen <bill.janssen@gmail.com> added the comment:
> 
> My main fear with this patch is that "unquote" will become seen as
> unreliable, because naive software trying to parse URLs will encounter
> uses of percent-encoding where the encoded octets are not in fact UTF-8
> bytes.  They're just some set of bytes. 

unquote will have to be able to deal with old-style URLs that
use the Latin-1 encoding. HTML uses (or used to use) the Latin-1
encoding as default and that's how URLs ended up using it as well:

http://www.w3schools.com/TAGS/ref_urlencode.asp

I'd suggest to have it first try UTF-8 decoding and then fall back
to Latin-1 decoding.

> A secondary concern is that it
> will invisibly produce invalid data, because it decodes some
> non-UTF-8-encoded string that happens to only use UTF-8-valid sequences
> as the wrong string value.

It's rather unlikely that someone will have used a Latin-1 encoded
URL which happens to decode as valid UTF-8: The valid UTF-8 combinations
don't really make any sense when used as text, e.g.

Ã?öÃ1/4

> Now, I have to confess that I don't know how common these use cases are
> in actual URL usage.  It would be nice if there was some organization
> that had a large collection of URLs, and could provide a test set we
> could run a scanner over :-).
> 
> As a workaround, though, I've sent a message off to Larry Masinter to
> ask about this case.  He's one of the authors of the URI spec.
msg70862 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-07 21:46
On Thu, Aug 7, 2008 at 2:17 PM, Bill Janssen <report@bugs.python.org> wrote:
>
> Bill Janssen <bill.janssen@gmail.com> added the comment:
>
> My main fear with this patch is that "unquote" will become seen as
> unreliable, because naive software trying to parse URLs will encounter
> uses of percent-encoding where the encoded octets are not in fact UTF-8
> bytes.  They're just some set of bytes.

Apps that want to handle these correctly have no choice but to use
unquote_to_bytes(), or setting error='ignore' or error='replace'.

Your original proposal was to make unquote() behave like
unquote_to_bytes(), which would require changes to virtually every app
using unqote(), since almost all apps assume the result is a (text)
string.

Setting the default encoding to Latin-1 would prevent these errors,
but would commit the sin of mojibake (the Japanese word for Perl code
:-). I don't like that much either.

A middle ground might be to set the default encoding to ASCII --
that's closer to Martin's claim that URLs are supposed to be ASCII
only. It will fail as soon as an app receives encoded non-ASCII text.
This will require many apps to be changed, but at least it forces the
developers to think about which encoding to assume (perhaps there's
one handy in the request headers if it's a web app) or about error
handling or perhaps using unquote_to_bytes().

However I fear that this middle ground will in practice cause:

(a) more in-the-field failures, since devs are notorious for testing
with ASCII only; and

(b) the creation of a recipe for "fixing" unquote() calls that fail by
setting the encoding to UTF-8 without thinking about the alternatives,
thereby effectively recreating the UTF-8 default with much more pain.

Therefore I think that the UTF-8 default is probably the most pragmatic choice.

In the code review, I have asked Matt to change the default error
handling from errors='replace' to errors='strict'. I suppose we could
reduce outright crashes in the field by setting this to 'replace'
(even though for quote() I think it should remain 'strict'). But this
may cause more subtle failures, where apps simply receive garbage
data. At least when you're serving pages with error 500 the developers
tend to get called in. When the users merely get failing results such
bugs may remain lingering much longer.

> A secondary concern is that it
> will invisibly produce invalid data, because it decodes some
> non-UTF-8-encoded string that happens to only use UTF-8-valid sequences
> as the wrong string value.

In my experience this is very unlikely. UTF-8 looks like total junk in
Latin-1, so it's unlikely to occur naturally. If you see something
that matches a UTF-8 sequence in Latin-1 text, it's most likely that
in fact it was incorrectly decoded earlier...

> Now, I have to confess that I don't know how common these use cases are
> in actual URL usage.  It would be nice if there was some organization
> that had a large collection of URLs, and could provide a test set we
> could run a scanner over :-).
>
> As a workaround, though, I've sent a message off to Larry Masinter to
> ask about this case.  He's one of the authors of the URI spec.

Looking forward to his response.
msg70868 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-07 22:58
> Your original proposal was to make unquote() behave like
> unquote_to_bytes(), which would require changes to virtually every app
> using unqote(), since almost all apps assume the result is a (text)
> string.

Actually, careful apps realize that the result of "unquote" in Python 2 is a
sequence of bytes, and do something careful with that.  So only careless
apps would break, and they'd break in such a way that their maintainers
would have to look at the situation again, and think about it.  Seems like a
'good thing', to me.  And since this is Python 3, fully allowed.  I really
don't understand your position here, I'm afraid.

Setting the default encoding to Latin-1 would prevent these errors,
> but would commit the sin of mojibake (the Japanese word for Perl code
> :-). I don't like that much either.

No, that would be wrong.  Returning a string just for the sake of returning
a string.  Remember, the data percent-encoded is not necessarily a string,
and not necessarily in any known encoding.

>
> A middle ground might be to set the default encoding to ASCII --
> that's closer to Martin's claim that URLs are supposed to be ASCII
> only.

URLs *are* supposed to be ASCII only -- but the percent-encoded byte
sequences in various parts of the path aren't.

This will require many apps to be changed, but at least it forces the
> developers to think about which encoding to assume (perhaps there's
> one handy in the request headers if it's a web app) or about error
> handling or perhaps using unquote_to_bytes().

Yes, this is closer to my line of reasoning.

> However I fear that this middle ground will in practice cause:
>
> (a) more in-the-field failures, since devs are notorious for testing
> with ASCII only; and

Returning bytes deals with this problem.

> (b) the creation of a recipe for "fixing" unquote() calls that fail by
> setting the encoding to UTF-8 without thinking about the alternatives,
> thereby effectively recreating the UTF-8 default with much more pain.

Could be, but at least they will have had to think about.  There's lots of
bad code out there, and maybe by making them think about it, some of it will
improve.

> A secondary concern is that it
> > will invisibly produce invalid data, because it decodes some
> > non-UTF-8-encoded string that happens to only use UTF-8-valid sequences
> > as the wrong string value.
>
> In my experience this is very unlikely. UTF-8 looks like total junk in
> Latin-1, so it's unlikely to occur naturally. If you see something
> that matches a UTF-8 sequence in Latin-1 text, it's most likely that
> in fact it was incorrectly decoded earlier...
>

Latin-1 isn't the only alternate encoding in the world, and not all
percent-encoded byte sequences in URLs are encoded strings.  I'd feel better
if we were being guided by more than your just experience (vast though it
may rightly be said to be!).  Say, by looking at all the URLs that Google
knows about :-).  I'd particularly feel better if some expert in Asian use
of the Web spoke up here...
msg70869 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-07 23:23
On Thu, Aug 7, 2008 at 3:58 PM, Bill Janssen <report@bugs.python.org> wrote:
> Bill Janssen <bill.janssen@gmail.com> added the comment:
>> Your original proposal was to make unquote() behave like
>> unquote_to_bytes(), which would require changes to virtually every app
>> using unqote(), since almost all apps assume the result is a (text)
>> string.
>
> Actually, careful apps realize that the result of "unquote" in Python 2 is a
> sequence of bytes, and do something careful with that.

Given that in 2.x it's a string, and knowing my users, I expect that
careful apps are a tiny minority.

> So only careless
> apps would break, and they'd break in such a way that their maintainers
> would have to look at the situation again, and think about it.  Seems like a
> 'good thing', to me.  And since this is Python 3, fully allowed.  I really
> don't understand your position here, I'm afraid.

My position is that although 3.0 supports Unicode much better than 2.x
(I won't ever use pretentious and meaningless phrases like "fully
supports"), that doesn't mean that you *have* to use it. I expect lots
of simple web apps don't need Unicode but do need quote/unquote
functionality to get around "forbidden" characters in query strings
like &. I don't want to force such apps to become more aware of
Unicode than absolutely necessary.

>> A middle ground might be to set the default encoding to ASCII --
>> that's closer to Martin's claim that URLs are supposed to be ASCII
>> only.
>
> URLs *are* supposed to be ASCII only -- but the percent-encoded byte
> sequences in various parts of the path aren't.
>
>> This will require many apps to be changed, but at least it forces the
>> developers to think about which encoding to assume (perhaps there's
>> one handy in the request headers if it's a web app) or about error
>> handling or perhaps using unquote_to_bytes().
>
> Yes, this is closer to my line of reasoning.
>
>> However I fear that this middle ground will in practice cause:
>>
>> (a) more in-the-field failures, since devs are notorious for testing
>> with ASCII only; and
>
> Returning bytes deals with this problem.

In an unpleasant way. We might as well consider changing all APIs that
deal with URLs to insist on bytes.

>> (b) the creation of a recipe for "fixing" unquote() calls that fail by
>> setting the encoding to UTF-8 without thinking about the alternatives,
>> thereby effectively recreating the UTF-8 default with much more pain.
>
> Could be, but at least they will have had to think about.  There's lots of
> bad code out there, and maybe by making them think about it, some of it will
> improve.

I'd rather use a carrot than a stick. IOW I'd rather write aggressive
docs than break people's code.

>> A secondary concern is that it
>> > will invisibly produce invalid data, because it decodes some
>> > non-UTF-8-encoded string that happens to only use UTF-8-valid sequences
>> > as the wrong string value.
>>
>> In my experience this is very unlikely. UTF-8 looks like total junk in
>> Latin-1, so it's unlikely to occur naturally. If you see something
>> that matches a UTF-8 sequence in Latin-1 text, it's most likely that
>> in fact it was incorrectly decoded earlier...

> Latin-1 isn't the only alternate encoding in the world, and not all
> percent-encoded byte sequences in URLs are encoded strings.  I'd feel better
> if we were being guided by more than your just experience (vast though it
> may rightly be said to be!).  Say, by looking at all the URLs that Google
> knows about :-).  I'd particularly feel better if some expert in Asian use
> of the Web spoke up here...

OK, let's wait and see if one bites.
msg70872 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-08 00:18
On Thu, Aug 7, 2008 at 4:23 PM, Guido van Rossum <report@bugs.python.org>wrote:

>
> >> However I fear that this middle ground will in practice cause:
> >>
> >> (a) more in-the-field failures, since devs are notorious for testing
> >> with ASCII only; and
> >
> > Returning bytes deals with this problem.
>
> In an unpleasant way. We might as well consider changing all APIs that
> deal with URLs to insist on bytes.
>

That seems a bit over-the-top.  Most URL operations *are* about strings, and
most of the APIs should deal with strings; we're talking about the return
result of an operation specifically designed to extract binary data from the
one place where it's allowed to occur.  Vastly smaller than "changing all
APIs that deal with URLs".

By the way, I see that the email package dodges this by encoding the bytes
to strings using the codec "raw-unicode-escape".  In other words, byte
sequences in the outward form of a string.  I'd be OK with that.  That is,
make the default codec for "unquote" be "raw-unicode-escape".  All the bytes
will come through unscathed, and people who are naively expecting ASCII
strings will still receive them, so the code won't break.  This actually
seems to be closest to the current usage, so I'm going to change my patch to
do that.
msg70878 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-08 01:34
Now I'm looking at the failing test_http_cookiejar test, which fails
because it encodes a non-UTF-8 byte, 0xE5, in a path segment of a URI.
The question is, does the "http" URI scheme allow non-ASCII (say,
Latin-1) octets in path segments?  IANA says that the "http" scheme
is defined in RFC 2616, and that says:

   This specification adopts the
   definitions of "URI-reference", "absoluteURI", "relativeURI", "port",
   "host","abs_path", "rel_path", and "authority" from [RFC 2396].

But RFC 2396 says:

    An individual URI scheme may require a single charset, define a
    default charset, or provide a way to indicate the charset used.

And doesn't say anything about the "http" scheme.  Nor does it indicate
any default encoding or character set for URIs.  The update, 3986,
doesn't say anything new about this, though it does implore URI scheme
designers to represent characters in a textual segment with ASCII codes
where they exist, and to use UTF-8 when designing *new* URI schemes.

Barring any other information, I think that the "segments" in the path
of an "http" URL must also be assumed to be binary; that is, any octet
is allowed, and no character set can be presumed.
msg70879 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-08 02:04
Looks like the failing test in test_http_cookiejar is just a bad test;
it attempts to build an HTTP request object from an invalid URL, yet
still seem to expect to be able to extract a cookie from the response
headers for that request.  I'd expect some exception to be thrown.

So this probably indicates a bug in urllib.parse:

urllib.parse.urlparse("http://www.acme.com/foo%2f%25/<<%0anew\345/\346\370\345")


should throw an exception somewhere.
msg70880 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-08 02:07
Just to be clear:  any octet would seem to be allowed in the "path" of
an "http" URL, but any non-ASCII octet must be percent-encoded.  So the
URL itself is still an ASCII string, considered opaquely.
msg70913 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-08 21:28
Here's the updated version of my patch.  It returns a string, but
doesn't clobber bytes that are contained in the string.  Naive code
(basically code that expects ASCII strings from unquote) should continue
to work as well as it ever did.  I fixed the problem with the email
library, and removed the bogus test from the http_cookiejar test suite.
msg70949 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-09 18:34
Bill, I had a look at your patch. I see you've decided to make
quote_as_string the default? In that case, I don't know why you had to
rewrite everything to implement the same basic behaviour as my patch.
(My latest few patches support bytes both ways). Anyway, I have a lot of
issues with your implementation.

* Why did you replace all of the existing machinery? Particularly the
way quote creates Quoter objects and stores them in a cache. I haven't
done any speed tests, but I assume that was all there for performance
reasons.

* The idea of quote_as_bytes is malformed. quote_as_bytes takes a str or
bytes, and outputs a URI as a bytes, while quote_as_string outputs a URI
as a str. This is the first time in the whole discussion we've
represented a URI as bytes, not a str. URIs are not byte sequences, they
are character sequences (see discussion below). I think only
quote_as_string is valid.

* The names unquote_as_* and quote_as_* are confusing. Use unquote_to_*
and quote_from_* to avoid ambiguity.

* Are unquote_as_string and unquote both part of your public interface?
That seems like unnecessary duplication.

* As Antoine pointed out above, it's too limiting for quote to force
UTF-8. Add a 'charset' parameter.

* Add an 'errors' parameter too, to give the caller control over how
strict to be.

* unquote and unquote_plus are missing 'charset' param, which should be
passed along to unquote_as_string.

* I do like the addition of a "plus" argument, as opposed to the
separate unquote_plus and quote_plus functions. I'd swap the arguments
to unquote around so charset is first and then plus, so you can write
unquote(mystring, 'utf-8') without using a keyword argument.

* In unquote: The "raw_unicode_escape" encoding makes no sense. It does
exactly the same thing as Latin-1, except it also looks for b"\\uxxxx"
in the string and converts that into a Unicode character. So your code
behaves like this:

>>> urllib.parse.unquote('%5Cu00fc')
'ü'
(Should output "\u00fc")
>>> urllib.parse.unquote('%5Cu')
UnicodeDecodeError: 'rawunicodeescape' codec can't decode bytes in
position 11-12: truncated \uXXXX
(Should output "\u")

I suspect the email package (where you got the inspiration to use
'rawunicodeescape') has this same crazy problem, but that isn't my
concern today!

Aside from this weirdness, you're essentially defaulting unquote to
Latin-1. As I've said countless times, unquote needs to be the inverse
of quote, or you get this behaviour:

>>> urllib.parse.unquote(urllib.parse.quote('ü'))
'ü'

Once again, I refer you to my favourite web server example.

import http.server
s = http.server.HTTPServer(('',8000),
        http.server.SimpleHTTPRequestHandler)
s.serve_forever()

Run this in a directory with a non-Latin-1 filename (eg. "漢字"), and
you will get a 404 when you click on the file.

* One issue I worked very hard to solve is how to deal with unescaped
non-ASCII characters in unquote. Technically this is an invalid URI, so
I'm not sure how important it is, but it's nice to be able to assume the
unquote function won't mess with them. For example,
unquote_as_string("\u6f22%C3%BC", charset="latin-1") should give
"\u6f22\u00fc" (or at least it would be nice). Yours raises
"UnicodeEncodeError: 'ascii' codec can't encode character". (I assume
this is a wanted property, given that the existing test suite tests that
unquote can handle ALL unescaped ASCII characters (that's what
escape_string in test_unquoting is for) - I merely extend this concept
to be able to handle all unescaped Unicode characters). Note that it's
impossible to allow such lenience if you implement unquote_as_string as
calling unquote_as_bytes and then decoding.

* Question: How does unquote_bytes deal with unescaped characters?
(Since this is a str->bytes transform, you need to encode them somehow).
I don't have a good answer for you here, which is one reason I think
it's wrong to treat a URI as an octet encoding. I treat them as UTF-8.
You treat them as ASCII. Since URIs are supposed to only contain ASCII,
the answers "ASCII", "Latin-1" and "UTF-8" are all as good as each
other, but as I said above, I prefer to be lenient and allow non-ASCII
URIs as input.

* Needs a lot more test cases, and documentation for your changes. I
suggest you plug my new test cases for urllib in and see if you can make
your code pass all the things I test for (and if not, have a good reason).

In addition, a major problem I have is with this dangerous assumption
that RFC 3986 specifies a byte->str encoding. You keep stating
assumptions like this:

> Remember that the RFC for percent-encoding really takes
> bytes in, and produces bytes out.  The string-in and string-out
> versions are to support naive programming (what a nice way of
> putting it!).

You assume that my patch, the string version of quote/unquote, is a
"hack" in order to satisfy the naive souls who only want to deal with
strings, while your method is the "pure and correct" solution. This is
in no way certain.

It is clear from reading RFC 3986 that URIs are characters, not octets
(that's not debatable - as we see URIs written on billboards, they're
definately characters). What *is* debatable is whether the source data
for a URI is a character or a byte. I'm arguing it can be either.

The process of percent-encoding is taking a character, converting it
into an octet, and percent-encoding that octet into a 3-character
sequence. I quote:

> Percent-encoded octets may be used within a URI to represent
> characters outside the range of the US-ASCII coded character set
> if this representation is allowed by the scheme or by the protocol
> element in which the URI is referenced.  Such a definition should
> specify the character encoding used to map those characters to
> octets prior to being percent-encoded for the URI.

It could not be clearer that the RFC allows us to view percent encoding
as a str->str transform.

I also see later on (in section 2.1) the RFC talking about "data octets"
being percent-encoded, so I assume it's also valid to speak about a
bytes->str transform.

Therefore, I believe my quote and unquote are as correct as can be with
respect to the RFC, and I've also supplied unquote_to_bytes and
quote_from_bytes to satisfy the other (IMHO less clear, and certainly
less useful) reading of the RFC.

You seem to have arrived at the same outcome, but from a very different
standpoint that your unquote_as_bytes is "correct" and unquote_as_string
is a nasty hack.

Bill:
> Barring any other information, I think that the "segments" in the
> path of an "http" URL must also be assumed to be binary; that is,
> any octet is allowed, and no character set can be presumed.

Once again, practicality beats purity. Segments in HTTP URLs usually
refer to file paths. Files (in all modern operating systems) are Unicode
filenames, not byte filenames. So you HAVE to assume the character set
at some point. The best assumption we can make (as shown by all the
major browsers' defaults) is UTF-8.

> So this probably indicates a bug in urllib.parse:
> urllib.parse.urlparse(<snip>)
> should throw an exception somewhere.

No, this is just assuming the behaviour of unquote (which is called by
urlparse) to preserve unescaped characters as-is. So this "bug" is
actually in your own implementation of unquote (and mine too). But I
refute that that's desirable behaviour. I think you SHOULD NOT have
removed that test case. It is indicating a bug in your code (which is
what it's there for).

Look, I'm sorry to be so harsh, but I don't see what this alternative
patch accomplishes. My analysis probably comes off as self-righteous,
but then, I've spent the past month reading documentation and testing
every aspect of this issue rigorously, and all I see is you ignoring all
my work and rewriting it in a less-robust way.

Your point here was to make the str->bytes and bytes->str versions of
unquote and quote, respectively, work, right? Well I've already got
those working too. Please tell me what you think of my implementation so
we can get it approved before the deadline.

Matt
msg70955 - (view) Author: Jim Jewett (jimjjewett) Date: 2008-08-10 00:35
Matt,

Bill's main concern is with a policy decision; I doubt he would object to 
using your code once that is resolved.

The purpose of the quoting functions is to turn a string (representing the 
human-readable version) into bytes (that go over the wire).  If everything 
is ASCII, there isn't any disagreement -- but it also isn't obvious that 
they're bytes instead of characters.  So people started (well, continued, 
since it dates to pre-unicode C) treating them as though they were strings.

The fact that ASCII (and therefore most wire protocols) looks the same as 
bytes or as characters was one of the strongest arguments against splitting 
the bytes and string types.  Now that this has been done, Bill feels we 
should be consistent.  (You feel wire-protocol bytes should be treated as 
strings, if only as bytestrings, because the libraries use them that way -- 
but this is a policy decision.)

To quote the final paragraph of 1.2.1
"""
 In local or regional contexts and with improving technology, users
   might benefit from being able to use a wider range of characters;
   such use is not defined by this specification.  Percent-encoded
   octets (Section 2.1) may be used within a URI to represent characters
   outside the range of the US-ASCII coded character set if this
   representation is allowed by the scheme or by the protocol element in
   which the URI is referenced.  Such a definition should specify the
   character encoding used to map those characters to octets prior to
   being percent-encoded for the URI.
"""

So the mapping to bytes (or "octets") for non-ASCII isn't defined (here), 
and if you want to use it, you need to specify charset.  But in practice, 
people do use it without specifying a charset.  Which charset should be 
assumed?  The old code (and test cases) assumed Latin-1.  You want to 
assume UTF-8 (though you took the document charset when available -- which 
might also make sense).
msg70958 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-10 03:09
> Bill's main concern is with a policy decision; I doubt he would
> object to using your code once that is resolved.

But his patch does the same basic operations as mine, just implemented
differently and with the heap of issues I outlined above. So it doesn't
have anything to do with the policy decision.

> The purpose of the quoting functions is to turn a string
> (representing the human-readable version) into bytes (that go
> over the wire).

Ah hang on, that's a misunderstanding. There is a two-step process involved.

Step 1. Translate <character/byte> string into an ASCII character string
by percent-encoding the <characters/bytes>. (If percent-encoding
characters, use an unspecified encoding).
Step 2. Serialize the ASCII character string into an octet sequence to
send it over the wire, using some unspecified encoding.

Step 1 is explained in detail throughout the RFC, particularly in
Section 1.2.1 Transcription ("Percent-encoded octets may be used within
a URI to represent characters outside the range of the US-ASCII coded
character set") and 2.1 Percent Encoding.

Step 2 is not actually part of the spec (because the spec outlines URIs
as character sequences, not how to send them over a network). It is
briefly described in Section 2 ("This specification does not mandate any
particular character encoding for mapping between URI characters and the
octets used to store or transmit those characters.  When a URI appears
in a protocol element, the character encoding is defined by that protocol").

Section 1.2.1:

> A URI may be represented in a variety of ways; e.g., ink on
> paper, pixels on a screen, or a sequence of character
> encoding octets.  The interpretation of a URI depends only on
> the characters used and not on how those characters are
> represented in a network protocol.

The RFC then goes on to describe a scenario of writing a URI down on a
napkin, before stating:

> A URI is a sequence of characters that is not always represented
> as a sequence of octets.

Right, so there is no debate that a URI (after percent-encoding) is a
character string, not a byte string. The debate is only whether it's a
character or byte string before percent-encoding.

Therefore, the concept of "quote_as_bytes" is flawed.

> You feel wire-protocol bytes should be treated as
> strings, if only as bytestrings, because the libraries use them
> that way.

No I do not. URIs post-encoding are character strings, in the Unicode
sense of the term "character". This entire topic has nothing to do with
the wire.

Note that the "charset" or "encoding" parameter in Bill/My patch
respectively isn't the mapping from URI strings to octets (that's
trivially ASCII). It's the charset used to encode character information
into octets which then get percent-encoded.

> The old code (and test cases) assumed Latin-1.

No, the old code and test cases were written for Python 2.x. They
assumed a byte string was being emitted (back when a byte string was a
string, so that was an acceptable output type). So they weren't assuming
an encoding. In fact the *ONLY* test case for Unicode in test_urllib
used a UTF-8-encoded string.

> r = urllib.parse.unquote('br%C3%BCckner_sapporo_20050930.doc')
> self.assertEqual(r, 'br\xc3\xbcckner_sapporo_20050930.doc')

In Python 2.x, this test case says "unquote('%C3%BC') should give me the
byte sequence '\xc3\xbc'", which is a valid case. In Python 3.0, the
code didn't change but the meaning subtly did. Now it says
"unquote('%C3%BC') should give the string 'ü'". The name is clearly
supposed to be "brückner", not "brückner", which means in Python 3.0 we
should EITHER be expecting the BYTE string b'\xc3\xbc' or the character
string 'ü'.

So the old code and test cases didn't assume any encoding, then they
were accidentally made to assume Latin-1 by the fact that the language
changed underneath them.
msg70959 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-10 05:05
I've been thinking more about the errors="strict" default. I think this
was Guido's suggestion. I've decided I'd rather stick with errors="replace".

I changed errors="replace" to errors="strict" in patch 8, but now I'm
worried that will cause problems, specifically for unquote. Once again,
all the code in the stdlib which calls unquote doesn't provide an errors
option, so the default will be the only choice when using these other
services.

I'm concerned that there'll be lots of unhandled exceptions flying
around for URLs which aren't encoded with UTF-8, and a conscientious
programmer will not be able to protect against user errors.

Take the cgi module as an example. Typical usage is to write:
> fields = cgi.FieldStorage()
> foo = fields.getFirst("foo")

If the QUERY_STRING is "foo=w%FCt" (Latin-1), with errors='strict', you
get a UnicodeDecodeError when you call cgi.FieldStorage(). With
errors='replace', the variable foo will be "w�t". I think in general I'd
rather have '�'s in my program (representing invalid user input) than
exceptions, since this is usually a user input error, not a programming
error.

(One problem is that all I can do to handle this is catch a
UnicodeDecodeError on the call to FieldStorage; then I can't access any
of the data).

Now maybe something we can think about is propagating the "encoding" and
"errors" argument through to a few other major functions (such as
cgi.parse_qsl, cgi.FieldStorage and urllib.parse.urlencode), but that
should be separately to this patch.
msg70962 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-10 07:05
Guido suggested that quote's "safe" parameter should allow any
character, not just ASCII range. I've implemented this now. It was a lot
messier than I imagined.

The problem is that in my older patches, both 's' and 'safe' are encoded
to bytes right away, and the rest of the process is just octet encoding
(matching each byte against the safe set to see whether or not to quote it).

The new implementation requires that you delay encoding both of these
till the iteration over the string, so you match each *character*
against the safe set, then encode it if it's not in 'safe'. Now the
problem is some encodings/errors produce bytes which are in the safe
range. For instance quote('\u6f22', encoding='latin-1',
errors='xmlcharrefreplace') should give "%26%2328450%3B" (which is
"&#28450;" encoded). To preserve this behaviour, you then have to check
each *byte* of the encoded character against a 'safe bytes' set. I
believe that will slow down the implementation considerably.

In summary, it requires two levels of encoding: first characters, then
bytes. You can see how messy it made my quote implementation - I've
attached the patch (parse.py.patch8+allsafe).

I don't think it's worth the extra code bloat and performance hit just
to implement a feature whose only use is producing invalid URIs (since
URIs are supposed to only have ASCII characters). Does anyone disagree,
and want this feature in?
msg70965 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-10 07:45
Made a bunch of requested changes (I've reverted the "all safe" patch
for now since it caused so much grief; see above).

* quote: Fixed encoding illegal % sequences (and lots of new test cases
to prove it).
* quote now throws a type error if s is bytes, and encoding or errors
supplied.
* A few minor documentation fixes.

Patch 9.
Commit log for patch8 should suffice.
msg70969 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-10 10:49
Le dimanche 10 août 2008 à 07:05 +0000, Matt Giuca a écrit :
> I don't think it's worth the extra code bloat and performance hit just
> to implement a feature whose only use is producing invalid URIs (since
> URIs are supposed to only have ASCII characters).

Agreed.

> If the QUERY_STRING is "foo=w%FCt" (Latin-1), with errors='strict', you
> get a UnicodeDecodeError when you call cgi.FieldStorage(). With
> errors='replace', the variable foo will be "w�t". I think in general I'd
> rather have '�'s in my program (representing invalid user input) than
> exceptions, since this is usually a user input error, not a programming
> error.

Invalid user input? What if the query string comes from filling a form?
For example if I search the word "numéro" in a latin1 Web site, I get
the following URL:
http://www.le-tigre.net/spip.php?page=recherche&recherche=num%E9ro
msg70970 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-10 11:25
> Invalid user input? What if the query string comes from filling
> a form?
> For example if I search the word "numéro" in a latin1 Web site,
> I get the following URL:
> http://www.le-tigre.net/spip.php?page=recherche&recherche=num%E9ro

Yes, that is a concern. I suppose the idea should be that as the
programmer _you_ write the website, so you make it UTF-8 and you use our
defaults. Or you make it Latin-1, and you override our defaults (which
is tricky if you use cgi.FieldStorage, for example).

But anyway, how do you propose to handle that (other than the programmer
setting the correct default). With errors='replace', the above query
will result in "num�ro", but with errors='strict', it will result in a
UnicodeDecodeError (which you could handle, if you remembered). As a
programmer I don't really want to handle that error every time I use
unquote or anything that calls unquote. I'd rather accept the
possibility of '�'s in my input.

I'm not going to dig in my heels though, this time :) I just want to
make sure the consequences of this decision are known before we commit.
msg71042 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-12 02:30
On Sat, Aug 9, 2008 at 11:34 AM, Matt Giuca <report@bugs.python.org> wrote:

>
> Matt Giuca <matt.giuca@gmail.com> added the comment:
>
> Bill, I had a look at your patch. I see you've decided to make
> quote_as_string the default? In that case, I don't know why you had to
> rewrite everything to implement the same basic behaviour as my patch.
>

Just to get things clear in my head.  I don't care what code is used, so
long as it doesn't break the Web.

> (My latest few patches support bytes both ways). Anyway, I have a lot of
> issues with your implementation.
>
> * Why did you replace all of the existing machinery? Particularly the
> way quote creates Quoter objects and stores them in a cache. I haven't
> done any speed tests, but I assume that was all there for performance
> reasons.

I see no real advantage there, except that it has a built-in memory leak.
Just use a function.

> * The idea of quote_as_bytes is malformed. quote_as_bytes takes a str or
> bytes, and outputs a URI as a bytes, while quote_as_string outputs a URI
> as a str. This is the first time in the whole discussion we've
> represented a URI as bytes, not a str. URIs are not byte sequences, they
> are character sequences (see discussion below).

Just there for consistency.  Of course,  URIs are actually strings of
*ASCII* characters, which sort of specifies an encoding right there, so
using ASCII octets isn't all that far-fetched...

> * The names unquote_as_* and quote_as_* are confusing. Use unquote_to_*
> and quote_from_* to avoid ambiguity.

Good point.

> * Are unquote_as_string and unquote both part of your public interface?

Yes.  Good programmers will switch to unquote_as_string or unquote_as_bytes
(or, with your amendment, unquote_to_string and unquote_to_bytes, while
unquote is just there for backwards compatibility for naifs.

* As Antoine pointed out above, it's too limiting for quote to force
> UTF-8. Add a 'charset' parameter.

That's what quote_as_string is for.

> * Add an 'errors' parameter too, to give the caller control over how
> strict to be.

No.  Errors should be seen.  This is data, not whimsey.

>
> * unquote and unquote_plus are missing 'charset' param, which should be
> passed along to unquote_as_string.

No, they're only there for backwards compatibility, and the 2.x versions
don't have a charset param. *

>I do like the addition of a "plus" argument, as opposed to the

> separate unquote_plus and quote_plus functions. I'd swap the arguments
> to unquote around so charset is first and then plus, so you can write
> unquote(mystring, 'utf-8') without using a keyword argument.

Unquote doesn't take a charset (in my design).

>
> * In unquote: The "raw_unicode_escape" encoding makes no sense. It does
> exactly the same thing as Latin-1, except it also looks for b"\\uxxxx"
> in the string and converts that into a Unicode character. So your code
> behaves like this:
>
> >>> urllib.parse.unquote('%5Cu00fc')
> 'ü'
> (Should output "\u00fc")
> >>> urllib.parse.unquote('%5Cu')
> UnicodeDecodeError: 'rawunicodeescape' codec can't decode bytes in
> position 11-12: truncated \uXXXX
> (Should output "\u")
>
> I suspect the email package (where you got the inspiration to use
> 'rawunicodeescape') has this same crazy problem, but that isn't my
> concern today!
>
> Aside from this weirdness, you're essentially defaulting unquote to
> Latin-1. As I've said countless times, unquote needs to be the inverse
> of quote, or you get this behaviour:

> >>> urllib.parse.unquote(urllib.parse.quote('ü'))
> 'ü'
>
> Once again, I refer you to my favourite web server example.
>
> import http.server
> s = http.server.HTTPServer(('',8000),
>        http.server.SimpleHTTPRequestHandler)
> s.serve_forever()
>
> Run this in a directory with a non-Latin-1 filename (eg. "漢字"), and
> you will get a 404 when you click on the file.

Using unquote() and expecting a string is basically broken already.  So what
I'm trying to do here is to avoid losing data in the translation.

> * One issue I worked very hard to solve is how to deal with unescaped
> non-ASCII characters in unquote. Technically this is an invalid URI, so
> I'm not sure how important it is, but it's nice to be able to assume the
> unquote function won't mess with them. For example,
> unquote_as_string("\u6f22%C3%BC", charset="latin-1") should give
> "\u6f22\u00fc" (or at least it would be nice). Yours raises
> "UnicodeEncodeError: 'ascii' codec can't encode character". (I assume
> this is a wanted property, given that the existing test suite tests that
> unquote can handle ALL unescaped ASCII characters (that's what
> escape_string in test_unquoting is for) - I merely extend this concept
> to be able to handle all unescaped Unicode characters). Note that it's
> impossible to allow such lenience if you implement unquote_as_string as
> calling unquote_as_bytes and then decoding.

Good thing we don't need to; URIs consist of ASCII characters.

>
> * Question: How does unquote_bytes deal with unescaped characters?

Not sure I understand this question...

>
> * Needs a lot more test cases, and documentation for your changes. I
> suggest you plug my new test cases for urllib in and see if you can make
> your code pass all the things I test for (and if not, have a good reason).

Your test cases probably aren't testing things I feel it's necessary to
test. I'm interested in having the old test cases for urllib pass, as well
as providing the ability to unquote_to_bytes().

> In addition, a major problem I have is with this dangerous assumption
> that RFC 3986 specifies a byte->str encoding. You keep stating
> assumptions like this:
>

> Remember that the RFC for percent-encoding really takes
> > bytes in, and produces bytes out.  The string-in and string-out
> > versions are to support naive programming (what a nice way of
> > putting it!).
>

Those aren't assumptions, they're simple statements of fact.

>
> You assume that my patch, the string version of quote/unquote, is a
> "hack" in order to satisfy the naive souls who only want to deal with
> strings, while your method is the "pure and correct" solution.

Yes.

>
> It is clear from reading RFC 3986 that URIs are characters, not octets
> (that's not debatable - as we see URIs written on billboards, they're
> definately characters). What *is* debatable is whether the source data
> for a URI is a character or a byte. I'm arguing it can be either.
>

I wasn't talking about URIs (you're right, they're clearly sequences of
characters), I was talking about the process of percent-encoding, which
takes a sequence of octets, and produces an ASCII string, and the process of
percent-decoding, which takes that ASCII string, and produces that same
original sequence of octets.  Don't confuse the two issues.

You seem to have arrived at the same outcome, but from a very different
> standpoint that your unquote_as_bytes is "correct" and unquote_as_string
> is a nasty hack

Nasty?  I suppose that's in the eye of the beholder.  No, I think that this
whole situation has come about because Python 1 and Python 2 had this
unfortunate ambiguity about what a "str" was.  urllib.unquote() has always
produced a sequence of bytes, but unfortunately too many careless
programmers have regarded it as a sequence of characters.  Unquote_as_string
simply (and, IMO,  usefully), conflates two operations:  percent-decode, and
string decode.  Non-careless programmers have always written something like:

   s = urllib.unquote(frag).decode("UTF-8", "strict")

That is, they understood what was going on, and handled it correctly.  I'd
suggest removing "unquote" completely, and leaving everyone porting to
Python 3 to make their own choice between unquote_to_bytes and
unquote_to_string.

> Bill:
> > Barring any other information, I think that the "segments" in the
> > path of an "http" URL must also be assumed to be binary; that is,
> > any octet is allowed, and no character set can be presumed.
>
> Once again, practicality beats purity. Segments in HTTP URLs usually
> refer to file paths. Files (in all modern operating systems) are Unicode
> filenames, not byte filenames. So you HAVE to assume the character set
> at some point. The best assumption we can make (as shown by all the
> major browsers' defaults) is UTF-8.
>

You're not suggesting we build something that works for 80% of the cases,
and breaks on the other 20%?  When we don't need to?

> So this probably indicates a bug in urllib.parse:
> > urllib.parse.urlparse(<snip>)
> > should throw an exception somewhere.
>
> No, this is just assuming the behaviour of unquote (which is called by
> urlparse) to preserve unescaped characters as-is. So this "bug" is
> actually in your own implementation of unquote (and mine too). But I
> refute that that's desirable behaviour. I think you SHOULD NOT have
> removed that test case. It is indicating a bug in your code (which is
> what it's there for).

This has nothing to do with unquote().  The octets in question are not
percent-encoded.

> Look, I'm sorry to be so harsh, but I don't see what this alternative
> patch accomplishes. My analysis probably comes off as self-righteous,
> but then, I've spent the past month reading documentation and testing
> every aspect of this issue rigorously, and all I see is you ignoring all
> my work and rewriting it in a less-robust way.

Sorry to be so harsh, but... wow, take some lessons in humility!  It's not
about you or your patch or how long you spend reading stuff, it's about
Python. And, without referring explicitly to your patch, it really doesn't
matter how long someone spends on something, if they get it wrong.  I think
you are conflating things in a way which will break stuff (and not break
other stuff, that's true).  We're arguing about which way of breaking things
is better, I think.

> Your point here was to make the str->bytes and bytes->str versions of
> unquote and quote, respectively, work, right? Well I've already got
> those working too. Please tell me what you think of my implementation so
> we can get it approved before the deadline.

Matt, your patch is not some God-given thing here.  There are lots of
patches waiting to be looked at, and this is a fairly trivial concern.  I'm
guessing there are lots of URIs out there that contain non-UTF-8 octets,
particularly in Japan, which has used ISO-2022-JP for years (though Antoine
advances an example of Latin-1 in URLs right here).  Be nice if someone who
worked at Google (hint!, hint!) looked through their massive collection of
URLs and validated this hypothesis, or not.
msg71043 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-12 03:43
Some interesting notes here (from Erik van der Poel at Google; Guido,
you might want to stroll over to his location and talk with him):

http://lists.w3.org/Archives/Public/www-international/2007JanMar/0004.html

and more particularly

http://lists.w3.org/Archives/Public/www-international/2008AprJun/0092.html,
which says, in part,

``Within the context of HTML and HTTP, queries
[that is, the query part of a URL] don't have to say which
charset they are using, because there is already an agreement in
place: the major browsers and servers use the charset of the HTML.''

So, there's still a sizable number of Latin-1 pages out there, and
queries against these pages will use that encoding in the URL's they send.

And then there's this:

http://lists.w3.org/Archives/Public/www-international/2008AprJun/0014.html
msg71054 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-12 15:20
Bill, this debate is getting snipy, and going nowhere. We could argue
about what is the "pure" and "correct" thing to do, but we have a
limited time frame here, so I suggest we just look at the important facts.

1. There is an overwhelming consensus (including from me) that a
str->bytes version is acceptable to have in the library (whether or not
it's the "correct solution").
2. There is an overwhelming consensus (including from you) that a
str->str version is acceptable to have in the library (whether or not
it's the "correct solution").
3. By default, the str->str version breaks much less code, so both of us
decided to use it by default.

To this end, both of our patches:

1. Have a str->bytes version available.
2. Have a str->str version available.
3. Have "quote" and "unquote" functions call the str->str version.

So it seems we have agreed on that. Therefore, there should be no more
arguing about which is "more right".

So all your arguments seem to be essentially saying "the str->bytes
methods work perfectly; I don't care about if the str->str methods are
correct or not". The fact that your string versions quote UTF-8 and
unquote Latin-1 shows just how un-seriously you take the str->str methods.

Well the fact is that a) a great many users do NOT SHARE your ideals and
will default to using "quote" and "unquote" rather than the bytes
functions, and b) all of the rest of the library uses "quote" and
"unquote". So from a practical sense, how these methods behave is of the
utmost importance - they are more important than any new functions we
introduce at this point.

For example, the cgi.FieldStorage and the http.server modules will
implicitly call unquote and quote.

That means whether you, or I, or Guido, or The King Of The Internet
likes it or not, we have to have a "most reasonable" solution to the
problem of quoting and unquoting strings.

> Good thing we don't need to [handle unescaped non-ASCII characters in
> unquote]; URIs consist of ASCII characters.

Once again, practicality beats purity. I'd argue that it's a *good* (not
strictly required) idea to not mangle input unless we have to.

> > * Question: How does unquote_bytes deal with unescaped characters?

> Not sure I understand this question...

I meant unescaped non-ASCII characters, as discussed above (eg.
unquote_bytes('\u0123')).

> Your test cases probably aren't testing things I feel it's necessary
> to test. I'm interested in having the old test cases for urllib
> pass, as well as providing the ability to unquote_to_bytes().

I'm sorry, but you're missing the point of test-driven development. If
you think there is a bug, you don't just fix it and say "look, the old
test cases still pass!" You write new FAILING test cases to demonstrate
the bug. Then you change the code to make the test cases pass. All your
test suite proves is that you're happy with things the way they are.

> Matt, your patch is not some God-given thing here.

No, I am merely suggesting that it's had a great deal more thought put
into it -- not just my thought, but all the other people in the past
month who've suggested different approaches and brought up discussion
points. Including yourself -- it was your suggestion in the first place
to have the str->bytes functions, which I agree are important.

> > <snip> - Quote uses cache

> I see no real advantage there, except that it has a built-in
> memory leak. Just use a function.

Good point. Well the merits of using a cache are completely independent
from the behavioural aspects. I simply changed the existing code as
little as possible. Hence this patch will have the same performance
strengths/weaknesses as all previous versions, and the performance can
be tuned after 3.0 if necessary. (Not urgent).

On statistics about UTF-8 versus other encodings. Yes, I agree, there
are lots of URIs floating around out there, in many different encodings.
Unfortunately, we can't implicitly handle them all (and I'm talking once
more explicitly about the str->str transform here). We need to pick one
as the default. Whether Latin-1 is more popular than UTF-8 *for the time
being* is no good reason to pick Latin-1. It is called a "legacy
encoding" for a reason. It is being phased out and should NOT be
supported from here on in as the default encoding in a major web
programming language.

(Also there is no point in claiming to be "Unicode compliant" then
turning around and supporting a charset with 256 symbols by default).

Because Python's urllib will mostly be used in the context of building
web apps, it is up to the programmer to decide what encoding to use for
h(is|er) web app. For future apps, this should almost certainly be UTF-8
(if it isn't, the website won't be able to accept form input across all
characters, so isn't Unicode compliant anyway).

The problem you mention of browsers submitting URIs encoded based on the
charset is simply something we have to live with. A server will never be
able to deal with that unless the URIs are coming from pages which *it
served*. As this is very often the case, then as I said above, the app
should serve UTF-8 and there'll be no problems. Also note that ALL the
browsers I tested (FF/Saf/IE) use UTF-8 no matter what, if you directly
type Unicode characters into the address bar.
msg71055 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-12 15:26
By the way, what is the current status of this bug? Is anybody waiting
on me to do anything? (Re: Patch 9)

To recap my previous list of outstanding issues raised by the review:

> Should unquote accept a bytes/bytearray as well as a str?
Currently, does not. I think it's meaningless to do so (and how to
handle >127 bytes, if so?)

> Lib/email/utils.py:
> Should encode_rfc2231 with charset=None accept strings with non-ASCII
> characters, and just encode them to UTF-8?
Currently does. Suggestion to restrict to ASCII on the review tracker;
simple fix.

> Should quote raise a TypeError if given a bytes with encoding/errors
> arguments? (Motivation: TypeError is what you usually raise if you
> supply too many args to a function).
Resolved. Raises TypeError.

> Lib/urllib/parse.py:
> (As discussed above) Should quote accept safe characters outside the
> ASCII range (thereby potentially producing invalid URIs)?
Resolved? Implemented, but too messy and not worth it just to produce
invalid URIs, so NOT in patch.

That's only two very minor yes/no issues remaining. Please comment.
msg71057 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-12 16:04
I agree that given two similar patches, the one with more tests earns
some bonus points. Also, it seems to me that round-trippability of
quote()/unquote() is a logical and semantic requirement: in particular,
if there is a default encoding, it should be the same for both.

> For future apps, this should almost certainly be UTF-8
> (if it isn't, the website won't be able to accept form input across all
> characters, so isn't Unicode compliant anyway).

Actually, it will be able to accept such form input, as characters not
supported by the charset should be entity-encoded by the browser (e.g.
"&#123;").

I have no strong opinion on the very remaining points you listed, except
that IMHO encode_rfc2231 with charset=None should not try to use UTF8
by default. But someone with more mail protocol skills should comment :)
msg71064 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-12 17:38
Larry Masinter is off on vacation, but I did get a brief message saying
that he will dig up similar discussions that he was involved in when he
gets back.

Out of curiosity, I sent a note off to the www-international mailing
list, and received this:

``For the authority (server name) portion of a URI, RFC 3986 is pretty
clear that UTF-8 must be used for non-ASCII values (assuming, for a
moment, that IDNA addresses are not Punycode encoded already). For the
path portion of URIs, a large-ish proportion of them are, indeed, UTF-8
encoded because that has been the de facto standard in Web browsers for
a number of years now. For the query and fragment parts, however, the
encoding is determined by context and often depends on the encoding of
some page that contains the form from which the data is taken. Thus, a
large number of URIs contain non-UTF-8 percent-encoded octets.''

http://lists.w3.org/Archives/Public/www-international/2008JulSep/0041.html
msg71065 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-12 17:47
For Antoine:

I think the problem that Barry is facing with the email package is that
Unicode strings are an ambiguous representation of a sequence of bytes;
that is, there are a number of different byte sequences a Unicode string
may have come from.  His ingenious use of raw-unicode-escape is an
attempt to conform to the requirement of having to produce a string, but
without losing any data, so that an application program can, if it needs
to, still reprocess that string and retrieve the original data.  Naive
application programs that sort of expected the result to be an ASCII
string will be unaffected.  Not sure it's the best idea; this is all
about just where to force unexpected runtime failures.
msg71069 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-12 19:37
Here's another thought:

Let's put string_to_bytes and string_from_bytes into the binascii
module, as a2b_percent and b2a_percent, respectively.

Then parse.py would import them as

  from binascii import a2b_percent as percent_decode_as_bytes
  from binascii import b2a_percent as percent_encode_from_bytes

and add two more functions:

  def percent_encode(<string>, encoding="UTF-8", error="strict", plus=False)
  def percent_decode(<string>, encoding="UTF-8", error="strict", plus=False)

and would add backwards-compatible but deprecated functions for quote
and unquote:

  def quote(s):
      warnings.warn("urllib.parse.quote should be replaced by
percent_encode or percent_encode_from_bytes", FutureDeprecationWarning)
      if isinstance(s, str):
          return percent_encode(s)
      else:
          return percent_encode_from_bytes(s)

  def unquote(s):
      warnings.warn("urllib.parse.unquote should be replaced by
percent_decode or percent_decode_to_bytes", FutureDeprecationWarning)
      if isinstance(s, str):
          return percent_decode(s)
      else:
          return percent_decode(str(s, "ASCII", "strict"))
msg71072 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-12 22:12
Le mardi 12 août 2008 à 19:37 +0000, Bill Janssen a écrit :
> Let's put string_to_bytes and string_from_bytes into the binascii
> module, as a2b_percent and b2a_percent, respectively.

Well, it's my personal opinion, but I think we should focus on a simple
and straightforward solution for the present issue before beta3 is
released (which is in 8 days now). It has already been difficult to find
a (quasi-)consensus for a simple patch to adapt quote()/unquote() to the
realities of bytes/unicode separation in py3k: witness the length of the
present discussion.

(perhaps a sophisticated solution could still be adopted for 3.1,
especially if it has backwards compatibility in mind)
msg71073 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-12 23:43
> Matt Giuca <matt.giuca@gmail.com> added the comment:
> By the way, what is the current status of this bug? Is anybody waiting
> on me to do anything? (Re: Patch 9)

I'll be reviewing it today or tomorrow. From looking at it briefly I
worry that the implementation is pretty slow -- a method call for each
character and a map() call sounds pretty bad.

> To recap my previous list of outstanding issues raised by the review:
>
>> Should unquote accept a bytes/bytearray as well as a str?
> Currently, does not. I think it's meaningless to do so (and how to
> handle >127 bytes, if so?)

The bytes > 127 would be translated as themselves; this follows
logically from how stuff is parsed -- %% and %FF are translated,
everything else is not. But I don't really care, I doubt there's a
need.

>> Lib/email/utils.py:
>> Should encode_rfc2231 with charset=None accept strings with non-ASCII
>> characters, and just encode them to UTF-8?
> Currently does. Suggestion to restrict to ASCII on the review tracker;
> simple fix.

I think I agree with that comment; it seems wrong to return UTF8
without setting that in the header. The alternative would be to
default charset to utf8 if there are any non-ASCII chars in the input.
I'd be okay with that too.

>> Should quote raise a TypeError if given a bytes with encoding/errors
>> arguments? (Motivation: TypeError is what you usually raise if you
>> supply too many args to a function).
> Resolved. Raises TypeError.
>
>> Lib/urllib/parse.py:
>> (As discussed above) Should quote accept safe characters outside the
>> ASCII range (thereby potentially producing invalid URIs)?
> Resolved? Implemented, but too messy and not worth it just to produce
> invalid URIs, so NOT in patch.

Agreed, safe should be ASCII chars only.

> That's only two very minor yes/no issues remaining. Please comment.

I believe patch 9 still has errors defaulting to strict for quote().
Weren't you going to change that?

Regarding using UTF-8 as the default encoding, I still think this the
right thing to do -- while the tables shown by Bill indicate that
there's still a lot of Latin-1 out there, UTF-8 is definitely gaining
on it, and I expect that Python apps, especially Py3k apps, are much
more likely to follow (and hopefully reinforce! :-) this trend than to
lag behind.
msg71082 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-13 14:25
> I have no strong opinion on the very remaining points you listed,
> except that IMHO encode_rfc2231 with charset=None should not try to
> use UTF8 by default. But someone with more mail protocol skills
> should comment :)

OK I've come to the realization that DEMANDING ascii (and erroring on
non-ASCII chars) is better for the short term anyway, because we can
always decide later to relax the restrictions, but it's a lot worse to
add restrictions later. So I agree now, should be ASCII. And no, I don't
have mail protocol skills.

The same goes for unquote accepting bytes. We can decide to make it
accept bytes later, but can't remove that feature later, so it's best
(IMHO) to let it NOT accept bytes (which is the current behaviour).

> The bytes > 127 would be translated as themselves; this follows
> logically from how stuff is parsed -- %% and %FF are translated,
> everything else is not. But I don't really care, I doubt there's a
> need.

Ah but what about unquote (to string)? If it accepted bytes then it
would be a bytes->str operation, and then you need a policy on DEcoding
those bytes. It makes things too complex I think.

> I believe patch 9 still has errors defaulting to strict for quote().
> Weren't you going to change that?

I raised it as a concern, but I thought you overruled on that, so I left
it as errors='strict'. What do you want it to be? 'replace'? Now that
this issue has been fully discussed, I'm happy with whatever you decide.

> From looking at it briefly I
> worry that the implementation is pretty slow -- a method call for each
> character and a map() call sounds pretty bad.

Yes, it does sound pretty bad. However, that's the current way of doing
things in both 2.x and 3.x; I didn't change it (though it looks like I
changed a LOT, I really did try to change as little as possible!)
Assuming it wasn't made _slower_ than before, can we ignore existing
performance issues and treat them as a separate matter (and can be dealt
with after 3.0)?

I'm not putting up a new patch now. The only fix I'd make is to add
Antoine's "or 'ascii'" to email/utils.py, as suggested on the review
tracker. I'll make this change along with any other recommendations
after your review.

(That is Lib/email/utils.py line 222 becomes:
s = urllib.parse.quote(s, safe='', encoding=charset or 'ascii')
)

btw this Rietveld is amazing. I'm assuming I don't have permission to
upload patches there (can't find any button to do so) which is why I
keep posting them here and letting you upload to Rietveld ...
msg71083 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-13 14:50
On Wed, Aug 13, 2008 at 7:25 AM, Matt Giuca <report@bugs.python.org> wrote:
>> I have no strong opinion on the very remaining points you listed,
>> except that IMHO encode_rfc2231 with charset=None should not try to
>> use UTF8 by default. But someone with more mail protocol skills
>> should comment :)
>
> OK I've come to the realization that DEMANDING ascii (and erroring on
> non-ASCII chars) is better for the short term anyway, because we can
> always decide later to relax the restrictions, but it's a lot worse to
> add restrictions later. So I agree now, should be ASCII. And no, I don't
> have mail protocol skills.

OK.

> The same goes for unquote accepting bytes. We can decide to make it
> accept bytes later, but can't remove that feature later, so it's best
> (IMHO) to let it NOT accept bytes (which is the current behaviour).

OK.

>> The bytes > 127 would be translated as themselves; this follows
>> logically from how stuff is parsed -- %% and %FF are translated,
>> everything else is not. But I don't really care, I doubt there's a
>> need.
>
> Ah but what about unquote (to string)? If it accepted bytes then it
> would be a bytes->str operation, and then you need a policy on DEcoding
> those bytes. It makes things too complex I think.

OK.

>> I believe patch 9 still has errors defaulting to strict for quote().
>> Weren't you going to change that?
>
> I raised it as a concern, but I thought you overruled on that, so I left
> it as errors='strict'. What do you want it to be? 'replace'? Now that
> this issue has been fully discussed, I'm happy with whatever you decide.

I'm OK with replace for unquote(), your point that bogus data is
better than an exception is well taken, especially since there are
calls that the app can't control (like in cgi.py).

For quote() I think strict is better -- it can't fail anyway with
UTF8, and if an app passes an explicit conversion it'd be pretty
stupid to pass a string that can't be converted with that encoding
(since it's presumably the app that generates both the string and the
encoding) so it's better to complain there, just like if they made the
encode() call themselves with only an encoding specified. This means
we have a useful analogy: quote(s, e) == quote(s.encode(e)).

>> From looking at it briefly I
>> worry that the implementation is pretty slow -- a method call for each
>> character and a map() call sounds pretty bad.
>
> Yes, it does sound pretty bad. However, that's the current way of doing
> things in both 2.x and 3.x; I didn't change it (though it looks like I
> changed a LOT, I really did try to change as little as possible!)

Actually, while the Quoter class (the immediat subject of my scorn)
was there before your patch in 3.0, it isn't there in 2.x; somebody
must have added it in 3.0 as part of the conversion to Unicode or
perhaps as part of the restructuring of urllib.py. The 2.x code maps
the __getitem__ of a dict over the string, which is much faster. I
think we can do much better than mapping a method call.

> Assuming it wasn't made _slower_ than before, can we ignore existing
> performance issues and treat them as a separate matter (and can be dealt
> with after 3.0)?

Now that you've spent so  much time with this patch, can't you think
of a faster way of doing this? I wonder if mapping a defaultdict
wouldn't work.

> I'm not putting up a new patch now. The only fix I'd make is to add
> Antoine's "or 'ascii'" to email/utils.py, as suggested on the review
> tracker. I'll make this change along with any other recommendations
> after your review.
>
> (That is Lib/email/utils.py line 222 becomes:
> s = urllib.parse.quote(s, safe='', encoding=charset or 'ascii')
> )
>
> btw this Rietveld is amazing. I'm assuming I don't have permission to
> upload patches there (can't find any button to do so) which is why I
> keep posting them here and letting you upload to Rietveld ...

Thanks! You can't upload patches to the issue that *I* created, but a
better way would be to create a new issue and assign it to me for
review. That will work as long as you have a gmail account or a Google
Account. I highly recommend using the upload.py script, which you can
download from codereview.appspot.com/static/upload.py. (There's also a
link to it on the Create Issue page, at the bottom.)

I am hoping that in general we will be able to use Rietveld to review
patches instead of the bug tracker.
msg71084 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-13 15:08
> I'm OK with replace for unquote() ...
> For quote() I think strict is better 

There's just an odd inconsistency there, but it's only a tiny "gotcha";
and I agree with all your other arguments. I'll change unquote back to
errors='replace'.

> This means we have a useful analogy:
> quote(s, e) == quote(s.encode(e)).

That's exactly true, yes.

> Now that you've spent so  much time with this patch, can't you think
> of a faster way of doing this?

Well firstly, you could replace Quoter (the class) with a "quoter"
function, which is nested inside quote. Would calling a nested function
be faster than a method call?

> I wonder if mapping a defaultdict wouldn't work.

That is a good idea. Then, the "function" (as I describe above) would be
just the inside of what currently is the except block, and that would be
the default_factory of the defaultdict. I think that should speed things up.

I'm very hazy about what is faster in the bytecode world of Python, and
wary of making a change and proclaiming "this is faster!" without doing
proper speed tests (which is why I think this optimisation could be
delayed until at least after the core interface changes are made). But
I'll have a go at that change tomorrow.

(I won't be able to work on this for up to 24 hours).
msg71085 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-13 16:02
Selon Matt Giuca <report@bugs.python.org>:
>
> > Now that you've spent so  much time with this patch, can't you think
> > of a faster way of doing this?
>
> Well firstly, you could replace Quoter (the class) with a "quoter"
> function, which is nested inside quote. Would calling a nested function
> be faster than a method call?

The obvious speedup is to remove the map() call and do the loop inside
Quoter.__call__ instead. That way you don't have any function or method call in
the critical path.

(also, defining a class with a single __call__ method is not a common Python
idiom; usually you'd just have a function returning another (nested) function)

As for the defaultdict, here is how it can look like (this is on 2.5):

...  def __missing__(self, key):
...   print "__missing__", key
...   value = "%%%02X" % key
...   self[key] = value
...   return value
...
>>> d = D()
>>> d[66] = 'B'
>>> d[66]
'B'
>>> d[67]
__missing__ 67
'%43'
>>> d[67]
'%43'
msg71086 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-13 16:03
Selon Antoine Pitrou <report@bugs.python.org>:
> As for the defaultdict, here is how it can look like (this is on 2.5):
>

(there should be a line here saying "class D(defaultdict)" :-))

> ...  def __missing__(self, key):
> ...   print "__missing__", key
> ...   value = "%%%02X" % key
> ...   self[key] = value
> ...   return value

cheers

Antoine.
msg71088 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-13 16:41
>> Now that you've spent so  much time with this patch, can't you think
>> of a faster way of doing this?
>
> Well firstly, you could replace Quoter (the class) with a "quoter"
> function, which is nested inside quote. Would calling a nested function
> be faster than a method call?

Yes, but barely.

>> I wonder if mapping a defaultdict wouldn't work.
>
> That is a good idea. Then, the "function" (as I describe above) would be
> just the inside of what currently is the except block, and that would be
> the default_factory of the defaultdict. I think that should speed things up.

Yes, it would be tremendously faster, since the method would be called
only once per byte value (for each value of 'safe'), and if that byte
is repeated in the input, further occurrences will use the __getitem__
function of the defaultdict, which is implemented in C.

> I'm very hazy about what is faster in the bytecode world of Python, and
> wary of making a change and proclaiming "this is faster!" without doing
> proper speed tests (which is why I think this optimisation could be
> delayed until at least after the core interface changes are made).

That's very wise. But a first-order approximation of the speed of
something is often "how many functions/methods implemented in Python
(i.e. with def or lambda) does it call?"

> But I'll have a go at that change tomorrow.
>
> (I won't be able to work on this for up to 24 hours).

That's fine, as long as we have closure before beta3, which is next Wednesday.
msg71089 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-13 16:56
Feel free to take the function implementation from my patch, if it speeds
things up (and it should).

Bill

On Wed, Aug 13, 2008 at 9:41 AM, Guido van Rossum <report@bugs.python.org>wrote:

>
> Guido van Rossum <guido@python.org> added the comment:
>
> >> Now that you've spent so  much time with this patch, can't you think
> >> of a faster way of doing this?
> >
> > Well firstly, you could replace Quoter (the class) with a "quoter"
> > function, which is nested inside quote. Would calling a nested function
> > be faster than a method call?
>
> Yes, but barely.
>
> >> I wonder if mapping a defaultdict wouldn't work.
> >
> > That is a good idea. Then, the "function" (as I describe above) would be
> > just the inside of what currently is the except block, and that would be
> > the default_factory of the defaultdict. I think that should speed things
> up.
>
> Yes, it would be tremendously faster, since the method would be called
> only once per byte value (for each value of 'safe'), and if that byte
> is repeated in the input, further occurrences will use the __getitem__
> function of the defaultdict, which is implemented in C.
>
> > I'm very hazy about what is faster in the bytecode world of Python, and
> > wary of making a change and proclaiming "this is faster!" without doing
> > proper speed tests (which is why I think this optimisation could be
> > delayed until at least after the core interface changes are made).
>
> That's very wise. But a first-order approximation of the speed of
> something is often "how many functions/methods implemented in Python
> (i.e. with def or lambda) does it call?"
>
> > But I'll have a go at that change tomorrow.
> >
> > (I won't be able to work on this for up to 24 hours).
>
> That's fine, as long as we have closure before beta3, which is next
> Wednesday.
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue3300>
> _______________________________________
>
msg71090 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-13 17:05
Erik van der Poel at Google has now chimed in with stats on current URL
usage:

``...the bottom line is that escaped non-utf-8 is still quite prevalent,
enough (in my opinion) to require an implementation in Python, possibly
even allowing for different encodings in the path and query parts (e.g.
utf-8 path and gb2312 query).''

http://lists.w3.org/Archives/Public/www-international/2008JulSep/0042.html

I think it's worth remembering that a very large proportion of the use
of Python's urllib.unquote() is in implementations of Web server
frameworks of one sort or another.  We can't control what the browsers
that talk to such frameworks produce; the IETF doesn't control that,
either.  In this case, "practicality beats purity" is the clarion call
of the browser designers, and we'd better be able to support them.
msg71091 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-13 17:17
> Bill Janssen <bill.janssen@gmail.com> added the comment:
>
> Erik van der Poel at Google has now chimed in with stats on current URL
> usage:
>
> ``...the bottom line is that escaped non-utf-8 is still quite prevalent,
> enough (in my opinion) to require an implementation in Python, possibly
> even allowing for different encodings in the path and query parts (e.g.
> utf-8 path and gb2312 query).''
>
> http://lists.w3.org/Archives/Public/www-international/2008JulSep/0042.html
>
> I think it's worth remembering that a very large proportion of the use
> of Python's urllib.unquote() is in implementations of Web server
> frameworks of one sort or another.  We can't control what the browsers
> that talk to such frameworks produce; the IETF doesn't control that,
> either.  In this case, "practicality beats purity" is the clarion call
> of the browser designers, and we'd better be able to support them.

I think we're supporting these sufficiently by allowing developers to
override the encoding and errors value. I see no argument here against
having a default encoding of UTF-8.
msg71092 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-13 17:51
Le mercredi 13 août 2008 à 17:05 +0000, Bill Janssen a écrit :
> I think it's worth remembering that a very large proportion of the use
> of Python's urllib.unquote() is in implementations of Web server
> frameworks of one sort or another.  We can't control what the browsers
> that talk to such frameworks produce;

Yes, we do. Browsers will use whatever charset is specified in the HTML
for the query part; and, as for the path part, they should't produce it
themselves, they just follow a link which should already be
percent-quoted in the HTML.

(URL rewriting at the HTTP server level can make this more complicated,
since it can turn a query fragment into a path fragment or vice-versa;
however, most modern frameworks alleviate the need for such rewriting,
since they allow to specify flexible mapping rules at the framework
level)

The situation in which we can't control the encoding is when getting the
URLs from third-part content (e.g. some Web page which we didn't produce
ourselves, or some link in an e-email). But in those cases there's less
use cases for unquoting the URL rather than use it as-is. The only time
I've wanted to unquote such an URL was to do some processing of HTTP
referrers in order to extract which search queries had led people to
visit a Web site.
msg71096 - (view) Author: Bill Janssen (janssen) * (Python committer) Date: 2008-08-13 19:49
On Wed, Aug 13, 2008 at 10:51 AM, Antoine Pitrou <report@bugs.python.org>wrote:

>
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
> Le mercredi 13 août 2008 à 17:05 +0000, Bill Janssen a écrit :
> > I think it's worth remembering that a very large proportion of the use
> > of Python's urllib.unquote() is in implementations of Web server
> > frameworks of one sort or another.  We can't control what the browsers
> > that talk to such frameworks produce;
>
> Yes, we do. Browsers will use whatever charset is specified in the HTML
> for the query part; and, as for the path part, they should't produce it
> themselves, they just follow a link which should already be
> percent-quoted in the HTML.

Sure.  What I meant was that we don't control what the browsers do, we just
go along with what they do, that is, we try to play with the default
understanding that's developed between the "consenting pairs" of
Apache/Firefox or ASP/IE.
msg71121 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-14 11:35
Ah cheers Antoine, for the tip on using defaultdict (I was confused as
to how I could access the key just by passing defaultfactory, as the
manual suggests).
msg71124 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-14 12:18
OK I implemented the defaultdict solution. I got curious so ran some
rough speed tests, using the following code.

import random, urllib.parse
for i in range(0, 100000):
    str = ''.join(chr(random.randint(0, 0x10ffff)) for _ in range(50))
    quoted = urllib.parse.quote(str)

Time to quote 100,000 random strings of 50 characters.
(Ran each test twice, worst case printed)

HEAD, chars in range(0,0x110000): 1m44.80
HEAD, chars in range(0,256): 25.0s
patch9, chars in range(0,0x110000): 35.3s
patch9, chars in range(0,256): 27.4s
New, chars in range(0,0x110000): 31.4s
New, chars in range(0,256): 25.3s

Head is the current Py3k head. Patch 9 is my previous patch (before
implementing defaultdict), and New is after implementing defaultdict.

Interesting. Defaultdict didn't really make much of an improvement. You
can see the big help the cache itself makes, though (my code caches all
chars, whereas the HEAD just caches ASCII chars, which is why HEAD is so
slow on the full repertoire test). Other than that, differences are
fairly negligible.

However, I'll keep the defaultdict code, I quite like it, speedy or not
(it is slightly faster).
msg71126 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-14 13:07
Hello Matt,

> OK I implemented the defaultdict solution. I got curious so ran some
> rough speed tests, using the following code.
>
> import random, urllib.parse
> for i in range(0, 100000):
>     str = ''.join(chr(random.randint(0, 0x10ffff)) for _ in range(50))
>     quoted = urllib.parse.quote(str)

I think if you move the line defining "str" out of the loop, relative timings
should change quite a bit. Chances are that the random functions are not very
fast, since they are written in pure Python.
Or you can create an inner loop around the call to quote(), for example to
repeat it 100 times.

cheers

Antoine.
msg71130 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-14 15:27
New patch (patch10). Details on Rietveld review tracker
(http://codereview.appspot.com/2827).

Another update on the remaining "outstanding issues":

Resolved issues since last time:

> Should unquote accept a bytes/bytearray as well as a str?
No. But see below.

> Lib/email/utils.py:
> Should encode_rfc2231 with charset=None accept strings with non-ASCII
> characters, and just encode them to UTF-8?
Implemented Antoine's fix ("or 'ascii'").

> Should quote accept safe characters outside the
> ASCII range (thereby potentially producing invalid URIs)?
No.

New issues:

unquote_to_bytes doesn't cope well with non-ASCII characters (currently
encodes as UTF-8 - not a lot we can do since this is a str->bytes
operation). However, we can allow it to accept a bytes as input (while
unquote does not), and it preserves the bytes precisely.
Discussion at http://codereview.appspot.com/2827/diff/82/84, line 265.

I have *implemented* that suggestion - so unquote_to_bytes now accepts
either a bytes or str, while unquote accepts only a str. No changes need
to be made unless there is disagreement on that decision.

I also emailed Barry Warsaw about the email/utils.py patch (because we
weren't sure exactly what that code was doing). However, I'm sure that
this patch isn't breaking anything there, because I call unquote with
encoding="latin-1", which is the same behaviour as the current head.

That's all the issues I have left over in this patch.

Attaching patch 10 (for revision 65675).

Commit log for patch 10:

Fix for issue 3300.

urllib.parse.unquote:
  Added "encoding" and "errors" optional arguments, allowing the caller
  to determine the decoding of percent-encoded octets.
  As per RFC 3986, default is "utf-8" (previously implicitly decoded
  as ISO-8859-1).
  Fixed a bug in which mixed-case hex digits (such as "%aF") weren't
  being decoded at all.

urllib.parse.quote:
  Added "encoding" and "errors" optional arguments, allowing the
  caller to determine the encoding of non-ASCII characters
  before being percent-encoded.
  Default is "utf-8" (previously characters in range(128, 256)
  were encoded as ISO-8859-1, and characters above that as UTF-8).
  Characters/bytes above 128 are no longer allowed to be "safe".
  Now allows either bytes or strings.
  Optimised "Quoter"; now inherits defaultdict.

Added functions urllib.parse.quote_from_bytes,
urllib.parse.unquote_to_bytes.
All quote/unquote functions now exported from the module.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface, added quote_from_bytes and unquote_to_bytes.

Lib/test/test_urllib.py: Added many new test cases testing encoding
and decoding Unicode strings with various encodings, as well as testing
the new functions.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the email
module is dependent upon).
msg71131 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-14 15:31
Antoine:
> I think if you move the line defining "str" out of the loop, relative
> timings should change quite a bit. Chances are that the random
> functions are not very fast, since they are written in pure Python.

Well I wanted to test throwing lots of different URIs to test the
caching behaviour. You're right though, probably a small % of the time
is spent on calling quote.

Oh well, the defaultdict implementation is in patch10 anyway :) It
cleans Quoter up somewhat, so it's a good thing anyway. Thanks for your
help.
msg71332 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-18 14:27
Hi,

Sorry to bump this, but you (Guido) said you wanted this closed by
Wednesday. Is this patch committable yet? (There are no more unresolved
issues that I am aware of).
msg71356 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-18 18:32
Looking into this now.  Will make sure it's included in beta3.
msg71387 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-18 21:45
Checked in patch 10 with minor style changes as r65838.

Thanks Matt for persevering!  Thanks everyone else for contributing;
this has been quite educational.
msg71530 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-20 09:56
There's an unquote()-related failure in #3613.
msg71533 - (view) Author: Matt Giuca (mgiuca) Date: 2008-08-20 10:03
Thanks for pointing that out, Antoine. I just commented on that bug.
History
Date User Action Args
2022-04-11 14:56:36adminsetgithub: 47550
2008-08-20 10:03:34mgiucasetmessages: + msg71533
2008-08-20 09:56:19pitrousetmessages: + msg71530
2008-08-18 21:45:16gvanrossumsetstatus: open -> closed
resolution: accepted
messages: + msg71387
2008-08-18 18:32:38gvanrossumsetpriority: release blocker
assignee: gvanrossum
messages: + msg71356
2008-08-18 14:27:50mgiucasetmessages: + msg71332
2008-08-14 15:31:39mgiucasetmessages: + msg71131
2008-08-14 15:27:27mgiucasetfiles: + parse.py.patch10
messages: + msg71130
2008-08-14 13:07:23pitrousetmessages: + msg71126
2008-08-14 12:18:49mgiucasetmessages: + msg71124
2008-08-14 11:35:12mgiucasetmessages: + msg71121
2008-08-13 20:12:00gvanrossumsetfiles: - unnamed
2008-08-13 20:11:53gvanrossumsetfiles: - unnamed
2008-08-13 19:49:51janssensetfiles: + unnamed
messages: + msg71096
2008-08-13 17:51:36pitrousetmessages: + msg71092
2008-08-13 17:17:20gvanrossumsetmessages: + msg71091
2008-08-13 17:05:20janssensetmessages: + msg71090
2008-08-13 16:56:12janssensetfiles: + unnamed
messages: + msg71089
2008-08-13 16:41:43gvanrossumsetmessages: + msg71088
2008-08-13 16:03:26pitrousetmessages: + msg71086
2008-08-13 16:02:06pitrousetmessages: + msg71085
2008-08-13 15:09:00mgiucasetmessages: + msg71084
2008-08-13 14:50:30gvanrossumsetmessages: + msg71083
2008-08-13 14:28:56lemburgsetnosy: - lemburg
2008-08-13 14:25:23mgiucasetmessages: + msg71082
2008-08-12 23:43:48gvanrossumsetmessages: + msg71073
2008-08-12 22:12:51pitrousetmessages: + msg71072
2008-08-12 19:37:40janssensetmessages: + msg71069
2008-08-12 17:47:34janssensetmessages: + msg71065
2008-08-12 17:38:39janssensetmessages: + msg71064
2008-08-12 16:04:38pitrousetmessages: + msg71057
2008-08-12 15:26:41mgiucasetmessages: + msg71055
2008-08-12 15:20:13mgiucasetmessages: + msg71054
2008-08-12 03:43:37janssensetmessages: + msg71043
2008-08-12 03:36:29janssensetfiles: - unnamed
2008-08-12 02:30:54janssensetfiles: + unnamed
messages: + msg71042
2008-08-10 11:25:34mgiucasetmessages: + msg70970
2008-08-10 10:49:52pitrousetmessages: + msg70969
2008-08-10 07:45:17mgiucasetfiles: + parse.py.patch9
messages: + msg70965
2008-08-10 07:05:47mgiucasetfiles: + parse.py.patch8+allsafe
messages: + msg70962
2008-08-10 05:05:08mgiucasetmessages: + msg70959
2008-08-10 03:09:53mgiucasetmessages: + msg70958
2008-08-10 00:35:10jimjjewettsetmessages: + msg70955
2008-08-09 18:34:23mgiucasetmessages: + msg70949
2008-08-08 21:28:26janssensetfiles: - patch
2008-08-08 21:28:11janssensetfiles: + patch
messages: + msg70913
2008-08-08 02:07:02janssensetmessages: + msg70880
2008-08-08 02:04:32janssensetmessages: + msg70879
2008-08-08 01:34:39janssensetfiles: - unnamed
2008-08-08 01:34:31janssensetfiles: - unnamed
2008-08-08 01:34:04janssensetmessages: + msg70878
2008-08-08 00:18:54janssensetfiles: + unnamed
messages: + msg70872
2008-08-07 23:23:48gvanrossumsetmessages: + msg70869
2008-08-07 22:58:21janssensetfiles: + unnamed
messages: + msg70868
2008-08-07 21:46:23gvanrossumsetmessages: + msg70862
2008-08-07 21:43:25lemburgsetnosy: + lemburg
messages: + msg70861
2008-08-07 21:17:08janssensetmessages: + msg70858
2008-08-07 20:49:25janssensetmessages: + msg70855
2008-08-07 16:59:53gvanrossumsetmessages: + msg70840
2008-08-07 15:03:00mgiucasetfiles: + parse.py.metapatch8
messages: + msg70834
2008-08-07 14:59:58mgiucasetfiles: + parse.py.patch8
messages: + msg70833
2008-08-07 14:35:18mgiucasetmessages: + msg70830
2008-08-07 14:20:05pitrousetmessages: + msg70828
2008-08-07 13:42:47mgiucasetmessages: + msg70824
2008-08-07 12:11:16mgiucasetmessages: + msg70818
2008-08-06 22:25:37jimjjewettsetmessages: + msg70807
2008-08-06 21:39:38gvanrossumsetmessages: + msg70806
2008-08-06 21:03:11jimjjewettsetmessages: + msg70804
2008-08-06 19:51:24gvanrossumsetnosy: + gvanrossum
messages: + msg70800
2008-08-06 18:39:42pitrousetnosy: + pitrou
messages: + msg70793
2008-08-06 17:29:30jimjjewettsetnosy: + jimjjewett
messages: + msg70791
2008-08-06 16:51:26janssensetfiles: + patch
messages: + msg70788
2008-08-06 16:49:39janssensetfiles: - myunquote.py
2008-08-06 05:59:43janssensetfiles: + myunquote.py
nosy: + janssen
messages: + msg70771
2008-07-31 14:49:51mgiucasetfiles: + parse.py.patch7
messages: + msg70512
2008-07-31 11:27:36mgiucasetfiles: + parse.py.patch6
messages: + msg70497
2008-07-12 17:05:56mgiucasetfiles: + parse.py.patch5
messages: + msg69591
2008-07-12 11:32:53mgiucasetfiles: + parse.py.patch4
messages: + msg69583
2008-07-11 07:18:47mgiucasetmessages: + msg69537
2008-07-11 07:03:25mgiucasetmessages: + msg69535
2008-07-10 20:55:42loewissetmessages: + msg69519
2008-07-10 16:13:03mgiucasetmessages: + msg69508
versions: + Python 3.0, - Python 3.1
2008-07-10 01:55:00mgiucasetfiles: + parse.py.patch3
messages: + msg69493
2008-07-09 20:28:32loewissetmessages: + msg69485
2008-07-09 20:26:41loewissetversions: + Python 3.1, - Python 3.0
2008-07-09 15:51:51mgiucasetfiles: + parse.py.patch2
messages: + msg69473
2008-07-09 15:20:49thomaspinckney3setnosy: + thomaspinckney3
messages: + msg69472
2008-07-09 03:03:37orsenthilsetnosy: + orsenthil
2008-07-07 01:45:25mgiucasetmessages: + msg69366
2008-07-06 16:54:35loewissetnosy: + loewis
messages: + msg69339
2008-07-06 14:52:09mgiucacreate