Issue 3300: urllib.quote and unquote - Unicode issues

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/47550

classification

Title:	urllib.quote and unquote - Unicode issues
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 3.0

process

Status:	closed	Resolution:	accepted
Dependencies:		Superseder:
Assigned To:	gvanrossum	Nosy List:	gvanrossum, janssen, jimjjewett, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Priority:	release blocker	Keywords:	patch

Created on 2008-07-06 14:52 by mgiuca, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
parse.py.patch	mgiuca, 2008-07-06 14:52	(obsolete) Patch fixing all three issues; commit log in comment
parse.py.patch2	mgiuca, 2008-07-09 15:51	(obsolete) Second patch (supersedes parse.py.patch); commit log in comment
parse.py.patch3	mgiuca, 2008-07-10 01:54	(obsolete) Third patch (supersedes parse.py.patch2); commit log in comment
parse.py.patch4	mgiuca, 2008-07-12 11:32	(obsolete) Fourth patch (supersedes parse.py.patch3); commit log in comment
parse.py.patch5	mgiuca, 2008-07-12 17:05	(obsolete) Fifth patch (supersedes parse.py.patch4); commit log in comment
parse.py.patch6	mgiuca, 2008-07-31 11:27	(obsolete) Sixth patch (supersedes parse.py.patch5); commit log in comment
parse.py.patch7	mgiuca, 2008-07-31 14:49	(obsolete) Seventh patch (supersedes parse.py.patch6); commit log in comment
parse.py.patch8	mgiuca, 2008-08-07 14:59	(obsolete) Eighth patch (supersedes parse.py.patch7); commit log in comment
parse.py.metapatch8	mgiuca, 2008-08-07 15:02	Diff between patch7 and patch8 (result of review).
patch	janssen, 2008-08-08 21:28
parse.py.patch8+allsafe	mgiuca, 2008-08-10 07:05	Patch8, and quote allows all characters in 'safe'
parse.py.patch9	mgiuca, 2008-08-10 07:45	Ninth patch (supersedes parse.py.patch8); commit log in comment for patch 8
parse.py.patch10	mgiuca, 2008-08-14 15:27	Tenth patch (supersedes parse.py.patch9); commit log in comment

Messages (80)
msg69333 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-07-06 14:52
Three Unicode-related problems with urllib.parse.quote and urllib.parse.unquote in Python 3.0. (Patch attached). Firstly, unquote appears not to have been modified from Python 2, where it is designed to output a byte string. In Python 3, it outputs a unicode string, implicitly decoded as ISO-8859-1 (the code points are the same as the bytes). RFC 3986 states that the percent-encoded byte values should be decoded as UTF-8. http://tools.ietf.org/html/rfc3986 section 2.5. Current behaviour: >>> urllib.parse.unquote("%CE%A3") 'Î£' (or '\u00ce\u00a3') Desired behaviour: >>> urllib.parse.unquote("%CE%A3") 'Σ' (or '\u03a3') Secondly, while quote has been modified to encode to UTF-8 before percent-encoding, it does not work correctly for characters in range(128, 256), due to a special case in the code which again treats the code point values as byte values. Current behaviour: >>> urllib.parse.quote('\u00e9') '%E9' Desired behaviour: >>> urllib.parse.quote('\u00e9') '%C3%A9' Note that currently, quoting characters less than 256 will use ISO-8859-1, while quoting characters 256 or higher will use UTF-8! Thirdly, the "safe" argument to quote does not work for characters above 256, since these are excluded from the special case. I thought I would fix this at the same time, but it's really a separate issue. Current behaviour: >>> urllib.parse.quote('Σϰ', safe='Σ') '%CE%A3%CF%B0' Desired behaviour: >>> urllib.parse.quote('Σϰ', safe='Σ') 'Σ%CF%B0' A patch which fixes all three issues is attached. Note that unquote now needs to handle the case where the UTF-8 sequence is invalid. This is currently handled by "replace" (invalid sequences are replaced by '\ufffd'). I would like to add an optional "errors" argument to unquote, defaulting to "replace", to allow the user to override this behaviour, but I didn't put that in because it would change the interface. Note I also changed one of the test cases, which had the wrong expected output. (String literal was manually UTF-8 encoded, designed for Python 2; nonsensical when viewed as a Python 3 Unicode string). All urllib test cases pass. Patch is for branch /branches/py3k, revision 64752. Note: The above unquote issue also manifests itself in Python 2 for Unicode strings, but it's hazy as to what the behaviour should be, and would break existing programs, so I'm just patching the Py3k branch. Commit log: urllib.parse.unquote: Fixed percent-encoded octets being implicitly decoded as ISO-8859-1; now decode as UTF-8, as per RFC 3986. urllib.parse.quote: Fixed characters in range(128, 256) being implicitly encoded as ISO-8859-1; now encode as UTF-8. Also fixed characters greater than 256 not responding to "safe", and also not being cached. Lib/test/test_urllib.py: Updated one test case for unquote which expected the wrong output. The new version of unquote passes the new test case.
msg69339 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-07-06 16:54
> RFC 3986 states that the percent-encoded byte > values should be decoded as UTF-8. Where precisely do you read such a SHOULD requirement? Section 2.5 elaborates that the local encoding (of the resource) is typically used, ignoring cases where URIs are constructed on the client system (such scenario is simply ignored in the RFC). The last paragraph in section 2.5 is the only place that seems to imply a SHOULD requirement (although it doesn't use the keyword); this paragraph only talks about new URI schemes. Unfortunately, for http, the encoding is of characters is unspecified (this is somewhat solved by the introduction of IRIs).
msg69366 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-07-07 01:45
Point taken. But the RFC certainly doesn't say that ISO-8859-1 should be used. Since we're outputting a Unicode string in Python 3, we need to decode with some encoding, and UTF-8 seems the most sensible and standardised. (Even the existing test case in test_urllib.py:466 uses a UTF-8-encoded URL, and I had to fix it so it decodes into a meaningful string). Having said that, it's possible that you may wish to use another encoding, and legal to do so. Therefore, I'd suggest we add an "encoding" argument to both quote and unquote, which defaults to "utf-8". Note that in the current implementation, unquote is not an inverse of quote, because quote uses UTF-8 to encode characters with code points >= 256, while unquote decodes them as ISO-8859-1. I think it's important these two functions are inverses of each other.
msg69472 - (view)	Author: Tom Pinckney (thomaspinckney3) *	Date: 2008-07-09 15:20
I mentioned this is in a brief python-dev discussion earlier this spring, but many popular websites such as Wikipedia and Facebook do use UTF-8 as their character encoding scheme for the path and argument portion of URLs. I know there's no RFC that says this is what should be done, but in order to make urllib work out-of-the-box on as many common websites as possible, I think defaulting to UTF-8 decoding makes a lot of sense. Possibly allow an option charset argument to be passed into quote and unquote, but default to UTF-8 in the absence of an explicit character set being passed in?
msg69473 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-07-09 15:51
OK I've gone back over the patch and decided to add the "encoding" and "errors" arguments from the str.encode/decode methods as optional arguments to quote and unquote. This is a much bigger change than I originally intended, but I think it makes things much better because we'll get UTF-8 by default (which as far as I can tell is by far the most common encoding). (Tom Pinckney just made the same suggestion right as I'm typing this up!) So my new patch is a bit more extensive, and changes the interface (in a backwards-compatible way). Both quote and unquote now support "encoding" and "errors" arguments, defaulting to "utf-8" and "replace", respectively. Implementation detail: This changes the Quoter class a lot; it now hashes four fields to ensure it doesn't use the wrong cache. Also fixed an issue with the previous patch where non-ASCII-compatible encodings broke for code points < 128. I then ran the full test suite and discovered two other modules test cases broke. I've fixed them so the full suite passes, but I'm suspicious there may be more issues (see below). * Lib/test/test_http_cookiejar.py: A test case was written explicitly expecting Latin-1 encoding. I've changed this test case to expect UTF-8. * Lib/email/utils.py: I extensively analysed this code and discovered that it kind of "cheats" - it uses the Latin-1 encoding and treats it as octets, then applies its own encoding scheme. So to fix this, I changed the email module to call quote and unquote with encoding="latin-1". Hence it has the same behaviour as before. Some potential issues: * I have not updated the documentation yet. If this idea is to go ahead, the docs will need to show these new optional arguments. (I'll do that myself but haven't yet). * While the full test suite passes, I'm sure there will be many more issues since I've changed the interface. Therefore I don't recommend this patch is accepted just yet. I plan to do an investigation into all uses (within the standard lib) of quote and unquote to see if there are any other compatibility issues, particularly within urllib. Hence I'll respond to this again in a few days. * The new patch to "safe" argument of quote allows non-ASCII characters to be made safe. This correspondingly allows the construction of URIs with non-ASCII characters. Is it better to allow users to do this if they really want, or just mysteriously fail to let those characters through? I would also like to have a separate pair of functions, unquote_raw and quote_raw, which work on bytes objects instead of strings. (unquote_raw would take a str and produce a bytes, while quote_raw would take a bytes and produce a str). As URI encoding is fundamentally an octet encoding, not a character encoding, this is the only way to do URI encoding without choosing a Unicode character encoding. (I see some modules such as "email" treating the implicit Latin-1 encoding as byte encoding, which is a bit dodgy - they could benefit from raw functions). But as that requires further changes to the interface, I'll save it for another day. Patch (parse.py.patch2) is for branch /branches/py3k, revision 64820. Commit log: urllib.parse.unquote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the decoding of percent-encoded octets (previously implicitly decoded as ISO-8859-1). As per RFC 3986, default is "utf-8". urllib.parse.quote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the encoding of non-ASCII characters before being percent-encoded (previously characters in range(128, 256) were encoded as ISO-8859-1, and characters above that as UTF-8). Also fixed characters greater than 256 not responding to "safe", and also not being cached. Lib/test/test_urllib.py, Lib/test/test_http_cookiejar.py: Updated test cases which expected output in ISO-8859-1, now expects UTF-8. Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote with encoding="latin-1", to preserve existing behaviour (which the whole email module is dependent upon).
msg69485 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-07-09 20:28
Assuming the patch is acceptable in the first place (which I personally have not made my mind up), then it lacks documentation and test suite changes.
msg69493 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-07-10 01:54
OK well here are the necessary changes to the documentation (RST docs and docstrings in the code). As I said above, I plan to to extensive testing and add new cases, and I don't recommend this patch is accepted until that's done. Patch (parse.py.patch3) is for branch /branches/py3k, revision 64834. Commit log: urllib.parse.unquote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the decoding of percent-encoded octets (previously implicitly decoded as ISO-8859-1). As per RFC 3986, default is "utf-8". urllib.parse.quote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the encoding of non-ASCII characters before being percent-encoded (previously characters in range(128, 256) were encoded as ISO-8859-1, and characters above that as UTF-8). Also fixed characters greater than 256 not responding to "safe", and also not being cached. Doc/library/urllib.parse.rst: Updated docs on quote and unquote to reflect new interface. Lib/test/test_urllib.py, Lib/test/test_http_cookiejar.py: Updated test cases which expected output in ISO-8859-1, now expects UTF-8. Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote with encoding="latin-1", to preserve existing behaviour (which the whole email module is dependent upon).
msg69508 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-07-10 16:13
Setting Version back to Python 3.0. Is there a reason it was set to Python 3.1? This proposal will certainly break a lot of code. It's far better to do it in the big backwards-incompatible Python 3.0 release than a later release.
msg69519 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-07-10 20:55
> Setting Version back to Python 3.0. Is there a reason it was set to > Python 3.1? 3.0b1 has been released, so no new features can be added to 3.0.
msg69535 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-07-11 07:03
> 3.0b1 has been released, so no new features can be added to 3.0. While my proposal is no doubt going to cause a lot of code breakage, I hardly consider it a "new feature". This is very definitely a bug. As I understand it, the point of a code freeze is to stop the addition of features which could be added to a later version. Realistically, there is no way this issue can be fixed after 3.0 is released, as it necessarily involves changing the behaviour of this function. Perhaps I should explain further why this is a regression from Python 2.x and not a feature request. In Python 2.x, with byte strings, the encoding is not an issue. quote and unquote simply encode bytes, and if you want to use Unicode you have complete control. In Python 3.0, with Unicode strings, if functions manipulate string objects, you don't have control over the encoding unless the functions give you explicit control. So Python 3.0's native Unicode strings have broken the library. I give two examples. Firstly, I believe that unquote(quote(x)) should always be true for all strings x. In Python 2.x, this is always trivially true (for non-Unicode strings), because they simply encode and decode the octets. In Python 3.0, the two functions are inconsistent, and break out of the range(0, 256). >>> urllib.parse.unquote(urllib.parse.quote('ÿ')) # '\u00ff' 'ÿ' # Works, because both functions work with ISO-8859-1 in this range. >>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100' 'Ä\x80' # Fails, because quote uses UTF-8 and unquote uses ISO-8859-1. My patch succeeds for all characters. >>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100' 'Ā' Secondly, a bigger example, but I want to demonstrate how this bug affects web applications, even very simple ones. Consider this simple (beginnings of a) wiki system in Python 2.5, as a CGI app: #--- import cgi fields = cgi.FieldStorage() title = fields.getfirst('title') print("Content-Type: text/html; charset=utf-8") print("") print('<p>Debug: %s</p>' % repr(title)) if title is None: print("No article selected") else: print('<p>Information about %s.</p>' % cgi.escape(title)) #--- (Place this in cgi-bin, navigate to it, and add the query string "?title=Page Title"). I'll use the page titled "Mátt" as a test case. If you navigate to "?title=Mátt", it displays the text "Debug: 'M\xc3\xa1tt'. Information about Mátt.". The browser (at least Firefox, Safari and IE I have tested) encodes this as "?title=M%C3%A1tt". So this is trivial, as it's just being unquoted into a raw byte string 'M\xc3\xa1tt', then written out again as a byte string. Now consider that you want to manipulate it as a Unicode string, still in Python 2.5. You could augment the program to decode it as UTF-8 and then re-encode it. (I wrote a simple UTF-8 printing function which takes Unicode strings as input). #--- import sys import cgi def printu8(args): """Prints to stdout encoding as utf-8, rather than the current terminal encoding. (Not a fully-featured print function).""" sys.stdout.write(' '.join([x.encode('utf-8') for x in args])) sys.stdout.write('\n') fields = cgi.FieldStorage() title = fields.getfirst('title') if title is not None: title = str(title).decode("utf-8", "replace") print("Content-Type: text/html; charset=utf-8") print("") print('<p>Debug: %s.</p>' % repr(title)) if title is None: print("No article selected.") else: printu8('<p>Information about %s.</p>' % cgi.escape(title)) #--- Now given the same input ("?title=Mátt"), it displays "Debug: u'M\xe1tt'. Information about Mátt." Still working fine, and I can manipulate it as Unicode because in Python 2.x I have direct control over encoding/decoding. Now let us upgrade this program to Python 3.0. (Note that I still can't print Unicode characters directly out, because running through Apache the stdout encoding is not UTF-8, so I use my printu8 function). #--- import sys import cgi def printu8(args): """Prints to stdout encoding as utf-8, rather than the current terminal encoding. (Not a fully-featured print function).""" sys.stdout.buffer.write(b' '.join([x.encode('utf-8') for x in args])) sys.stdout.buffer.write(b'\n') fields = cgi.FieldStorage() title = fields.getfirst('title') # Note: No call to decode. I have no opportunity to specify the encoding since # it comes straight out of FieldStorage as a Unicode string. print("Content-Type: text/html; charset=utf-8") print("") print('<p>Debug: %s.</p>' % ascii(title)) if title is None: print("No article selected.") else: printu8('<p>Information about %s.</p>' % cgi.escape(title)) #--- Now given the same input ("?title=Mátt"), it displays "Debug: 'M\xc3\xa1tt'. Information about MÃ¡tt." Once again, it is erroneously (and implicitly) decoded as ISO-8859-1, so I end up with a meaningless Unicode string. The only possible thing I can do about this as a web developer is call title.encode('latin-1').decode('utf-8') - a dreadful hack. With my patch applied, the input ("?title=Mátt") produces the output "Debug: 'M\xe1tt'. Information about Mátt." Basically, this bug is going to affect all web developers as soon as someone types a non-ASCII character. You could argue that supporting UTF-8 by default is no better than supporting Latin-1 by default, but it is. UTF-8 supports encoding of all characters where Latin-1 does not, UTF-8 is the recommended URI encoding by both the URI Syntax RFC[1] and the W3C HTML 4.01 specification[2], and all major browsers use it to encode non-ASCII characters in URIs. My patch may not be the best, or most conservative, solution to this problem. I'm happy to see other proposals. But it's clearly an important bug to fix, if I can't even write the simplest web app I can think of without having to use a kludgey hack to get the string decoded correctly. What is the point of having nice clean Unicode strings in the language if the library spits out the wrong characters and it requires more work to fix them than it used to with byte strings? [1] http://tools.ietf.org/html/rfc3986#section-2.5 [2] http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1
msg69537 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-07-11 07:18
Since I got a complaint that my last reply was too long, I'll summarize it. It's a bug report, not a feature request. I can't get a simple web app to be properly Unicode-aware in Python 3, which worked fine in Python 2. This cannot be put off until 3.1, as any viable solution will break existing code.
msg69583 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-07-12 11:32
OK I spent awhile writing test cases for quote and unquote, encoding and decoding various Unicode strings with different encodings. As a result, I found a bunch of issues in my previous patch, so I've rewritten the patches to both quote and unquote. They're both actually more similar to the original version now. I'd be interested in hearing if anyone disagrees with my expected output for these test cases. I'm now confident I have good test coverage directly on the quote and unquote functions. However, I haven't tested the other library functions which depend upon them (though the entire test suite passes). Though as I showed in that big post I made yesterday, other modules such as cgi seem to be working fine (their behaviour has changed; they use UTF-8 now; but that's the whole point of this patch). I still haven't figured out what the behaviour of "safe" should be in quote. Should it only allow ASCII characters (thereby limiting the output to an ASCII string, as specified by RFC 3986)? Should it also allow Latin-1 characters, or all Unicode characters as well (perhaps allowing you to create IRIs -- admittedly I don't know much about IRIs). The new implementation of quote makes it rather difficult to allow non-Latin-1 characters to be made "safe", as it encodes the string into bytes before any processing. Patch (parse.py.patch4) is for branch /branches/py3k, revision 64891. Commit log: urllib.parse.unquote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the decoding of percent-encoded octets. As per RFC 3986, default is "utf-8" (previously implicitly decoded as ISO-8859-1). urllib.parse.quote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the encoding of non-ASCII characters before being percent-encoded. Default is "utf-8" (previously characters in range(128, 256) were encoded as ISO-8859-1, and characters above that as UTF-8). Also characters above 128 are no longer allowed to be "safe". Doc/library/urllib.parse.rst: Updated docs on quote and unquote to reflect new interface. Lib/test/test_urllib.py: Added several new test cases testing encoding and decoding Unicode strings with various encodings. This includes updating one test case to now expect UTF-8 by default. Lib/test/test_http_cookiejar.py: Updated test case which expected output in ISO-8859-1, now expects UTF-8. Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote with encoding="latin-1", to preserve existing behaviour (which the whole email module is dependent upon).
msg69591 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-07-12 17:05
So today I grepped for "urllib" in the entire library in an effort to track down every dependency on quote and unquote to see exactly how my patch breaks other code. I've now investigated every module in the library which uses quote, unquote or urlencode, and my findings are documented below in detail. So far I have found no code "breakage" except for the original email.util issue I fixed in patch 2. Of course that doesn't mean the behaviour hasn't changed. Nearly all modules in the report below have changed their behaviour so they used to deal with Latin-1-encoded URLs and now deal with UTF-8-encoded URLs. As discussed at length above, I see this as a positive change, since nearly everybody encodes URLs in UTF-8, and of course it allows for all characters. I also point out that the http.server module (unpatched) is internally broken when dealing with filenames with characters outside range(0,256); my patch fixes it. I'm attaching patch 5, which adds a bunch of new test cases to various modules which demonstrate those modules correctly handling UTF-8-encoded URLs. It also fixes a bug in email.utils which I introduced in patch 2. Note that I haven't yet fully investigated urllib.request. Aside from that, the only remaining matter is whether or not it's better to encode URLs as UTF-8 or Latin-1 by default, and I'm pretty sure that question doesn't need debate. So basically I think if there's support for it, this patch is just about ready to be accepted. I'm hoping it can be included in the 3.0b2 release next week. I'd be glad to hear any feedback about this proposal. Not Yet Investigated -------------------- ./urllib/request.py By far the biggest user of quote and unquote. username, password, hostname and paths are now all converted to/from UTF-8 percent-encodings. Other concerns are: * Data in the form application/x-www-form-urlencoded * FTP access I think this needs to be tested further. Looks fine, not tested ---------------------- ./xmlrpc/client.py Just used to decode URI auth string (user:pass). This will change to UTF-8, but is probably OK. ./logging/handlers.py Just uses it in the HTTP handler to encode a dictionary. Probably preferable to use UTF-8 to encode an arbitrary string. ./macurl2path.py Calls to urllib look broken. Not tested. Tested manually, fine --------------------- ./wsgiref/simple_server.py Just used to set PATH_INFO, fine if URLs are UTF-8 encoded. ./http/server.py All uses are for translating between actual file-system paths to URLs. This works fine for UTF-8 URLs. Note that since it uses quote to create URLs in a dir listing, and unquote to handle them, it breaks when unquote is not the inverse of quote. Consider the following simple script: import http.server s = http.server.HTTPServer(('',8000), http.server.SimpleHTTPRequestHandler) s.serve_forever() This will "kind of" work in the unpatched version, using Latin-1 URLs, but filenames with characters above 256 will break (give a 404 error). The patch fixes this. ./urllib/robotparser.py No test cases. Manually tested, URLs properly match when percent-encoded in UTF-8. ./nturl2path.py No test cases available. Manually tested, fine if URLs are UTF-8 encoded. Test cases either exist or added, fine -------------------------------------- ./test/test_urllib.py I wrote a large wad of test cases for all the new functionality. ./wsgiref/util.py Added test cases expecting UTF-8. ./http/cookiejar.py I changed a test case to expect UTF-8. ./email/utils.py I changed this file to behave as it used to, to satisfy its existing test cases. ./cgi.py Added test cases for UTF-8-encoded query strings. Commit log: urllib.parse.unquote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the decoding of percent-encoded octets. As per RFC 3986, default is "utf-8" (previously implicitly decoded as ISO-8859-1). urllib.parse.quote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the encoding of non-ASCII characters before being percent-encoded. Default is "utf-8" (previously characters in range(128, 256) were encoded as ISO-8859-1, and characters above that as UTF-8). Also characters above 128 are no longer allowed to be "safe". Doc/library/urllib.parse.rst: Updated docs on quote and unquote to reflect new interface. Lib/test/test_urllib.py: Added several new test cases testing encoding and decoding Unicode strings with various encodings. This includes updating one test case to now expect UTF-8 by default. Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py, Lib/test/test_wsgiref.py: Updated and added test cases to deal with UTF-8-encoded URIs. Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote with encoding="latin-1", to preserve existing behaviour (which the whole email module is dependent upon).
msg70497 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-07-31 11:27
OK after a long discussion on the mailing list, Guido gave this the OK, with the provision that there are str->bytes and bytes->str versions of these functions as well. So I've written those. http://mail.python.org/pipermail/python-dev/2008-July/081601.html quote itself now accepts either a str or a bytes. quote_from_bytes is a new function which is just an alias for quote. (Is this acceptable?) unquote is still str->str. I've added a totally separate function unquote_to_bytes which is str->bytes. Note there is a slight issue here: I didn't quite know what to do with unescaped non-ASCII characters in the input to unquote_to_bytes - they need to somehow be converted to bytes. I chose to encode them using UTF-8, on the basis that they technically shouldn't be in a URI anyway. Note that my new unquote doesn't have this problem; it's carefully written to preserve the Unicode characters, even if they aren't expressible in the given encoding (which explains some of the code bloat). This makes unquote(s, encoding=e) necessarily more robust than unquote_to_bytes(s).decode(e) in terms of unescaped non-ASCII characters in the input. I've also added new test cases and documentation for these two new functions (included in patch6). On an entirely personal note, can whoever checks this in please mention my name in the commit log - I've put in at least 30 hours researching and writing this patch, and I'd like for this not to go uncredited :) Commit log for patch6: Fix for issue 3300. urllib.parse.unquote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the decoding of percent-encoded octets. As per RFC 3986, default is "utf-8" (previously implicitly decoded as ISO-8859-1). urllib.parse.quote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the encoding of non-ASCII characters before being percent-encoded. Default is "utf-8" (previously characters in range(128, 256) were encoded as ISO-8859-1, and characters above that as UTF-8). Also characters/bytes above 128 are no longer allowed to be "safe". Also now allows either bytes or strings. Added functions urllib.parse.quote_from_bytes, urllib.parse.unquote_to_bytes. Doc/library/urllib.parse.rst: Updated docs on quote and unquote to reflect new interface, added quote_from_bytes and unquote_to_bytes. Lib/test/test_urllib.py: Added several new test cases testing encoding and decoding Unicode strings with various encodings, as well as testing the new functions. Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py, Lib/test/test_wsgiref.py: Updated and added test cases to deal with UTF-8-encoded URIs. Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote with encoding="latin-1", to preserve existing behaviour (which the whole email module is dependent upon).
msg70512 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-07-31 14:49
Hmm ... seems patch 6 I just checked in fails a test case! Sorry! (It's minor, gives a harmless BytesWarning if you run with -b, which "make test" does, so I only picked it up after submitting). I've slightly changed the code in quote so it doesn't do that any more (it normalises all "safe" arguments to bytes). Please review patch 7, not 6. Same commit log as above. (Also .. someone let me know if I'm not submitting patches properly, like perhaps I should be deleting the old ones not keeping them around?)
msg70771 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-06 05:59
Here's my version of how quote and unquote should be implemented in Python 3.0. I haven't looked at the uses of it in the library, but I'd expect improper uses (and there are lots of them) will break, and thus can be fixed. Basically, percent-quoting is about creating an ASCII string that can be safely used in URI from an arbitrary sequence of octets. So, my version of quote() takes either a byte sequence or a string, and percent-quotes the unsafe ones, and then returns a str. If a str is supplied on input, it is first converted to UTF-8, then the octets of that encoding are percent-quoted. For unquote, there's no way to tell what the octets of the quoted sequence may mean, so this takes the percent-quoted ASCII string, and returns a byte sequence with the unquoted bytes. For convenience, since the unquoted bytes are often a string in some particular character set encoding, I've also supplied unquote_as_string(), which takes an optional character set, and first unquotes the bytes, then converts them to a str, using that character set encoding, and returns the resulting string.
msg70788 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-06 16:51
Here's a patch to parse.py (and test/test_urllib.py) that makes the various tests (cgi, urllib, httplib) pass. It basically adds "unquote_as_string", "unquote_as_bytes", "quote_as_string", "quote_as_bytes", and then define the existing "quote" and "unquote" in terms of them.
msg70791 - (view)	Author: Jim Jewett (jimjjewett)	Date: 2008-08-06 17:29
Is there still disagreement over anything except: (1) The type signature of quote and unquote (as opposed to the explicit "quote_as_bytes" or "quote_as string"). (2) The default encoding (latin-1 vs UTF8), and (if UTF-8) what to do with invalid byte sequences? (3) Would waiting for 3.1 cause too many compatibility problems?
msg70793 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-06 18:39
Bill, I haven't studied your patch in detail but a few comments: - it would be nice to have more unit tests, especially for the various bytes/unicode possibilities, and perhaps also roundtripping (Matt's patch has a lot of tests) - quote_as_bytes() should return a bytes object, not a bytearray - using the "%02X" format looks clearer to me than going through the _hextable lookup table... - when the argument is of the wrong type, quote_as_bytes() should raise a TypeError rather than a ValueError - why is quote_as_string() hardwired to utf8 while unquote_as_string() provides a charset parameter? wouldn't it be better for them to be consistent with each other?
msg70800 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-08-06 19:51
Bill Janssen's "patch" breaks two unittests: test_email and test_http_cookiejar. Details for test_email: ====================================================================== ERROR: test_rfc2231_bad_character_in_filename (email.test.test_email.TestRFC2231) . . . File "/usr/local/google/home/guido/python/py3k/Lib/urllib/parse.py", line 267, in unquote_as_string return str(unquote_as_bytes(s, plus=plus), charset, 'strict') UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 13: unexpected end of data ====================================================================== FAIL: test_rfc2231_bad_character_in_charset (email.test.test_email.TestRFC2231) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/google/home/guido/python/py3k/Lib/email/test/test_email.py", line 3279, in test_rfc2231_bad_character_in_charset self.assertEqual(msg.get_content_charset(), None) AssertionError: 'utf-8\\u201d' != None Details for test_http_cookiejar: ====================================================================== FAIL: test_url_encoding (__main__.LWPCookieTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "Lib/test/test_http_cookiejar.py", line 1454, in test_url_encoding self.assert_("foo=bar" in cookie and version_re.search(cookie)) AssertionError: None
msg70804 - (view)	Author: Jim Jewett (jimjjewett)	Date: 2008-08-06 21:03
Matt pointed out that the email package assumes Latin-1 rather than UTF-8; I assume Bill could patch his patch the same way Matt did, and this would resolve the email tests. (Unless you pronounce to stick with Latin-1) The cookiejar failure probably has the same root cause; that test is encoding (non-ASCII) Latin-1 characters, and urllib.parse.py/Quoter assumes Latin-1. So I see some evidence (probably not enough) for sticking with Latin-1 instead of UTF-8. But I don't see any evidence that fixing the semantics (encoded results should be bytes) at the same time made the conversion any more painful. On the other hand, Matt shows that some of those extra str->byte code changes might never need to be done at all, except for purity.
msg70806 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-08-06 21:39
Dear GvR, New code review comments by GvR have been published. Please go to http://codereview.appspot.com/2827 to read them. Message: Hi Matt, Here's a code review of your patch. I'm leaning more and more towards wanting this for 3.0, but I have some API design issues and also some style nits. I'm cross-linking this with the Python tracker issue, through the subject. Details: http://codereview.appspot.com/2827/diff/1/2 File Doc/library/urllib.parse.rst (right): http://codereview.appspot.com/2827/diff/1/2#newcode198 Line 198: replaced by a placeholder character. I don't think that's a good default. I'd rather see it default to strict -- that's what encoding translates to everywhere else. I believe that lenient error handling by default can cause subtle security problems too, by hiding problem characters from validation code. http://codereview.appspot.com/2827/diff/1/2#newcode215 Line 215: An alias for :func:`quote`, intended for use with a :class:`bytes` object I'd rather see this as a wrapper that raises TypeError if the argument isn't a bytes or bytearray instance. Otherwise it's needless redundancy. http://codereview.appspot.com/2827/diff/1/2#newcode223 Line 223: Replace ``%xx`` escapes by their single-character equivalent. Should add what the argument type is -- I vote for str or bytes/bytearray. http://codereview.appspot.com/2827/diff/1/2#newcode242 Line 242: .. function:: unquote_to_bytes(string) Again, add what the argument type is. http://codereview.appspot.com/2827/diff/1/4 File Lib/email/utils.py (right): http://codereview.appspot.com/2827/diff/1/4#newcode224 Line 224: except: An unqualified except clause is unacceptable here. Why do you need this anyway? http://codereview.appspot.com/2827/diff/1/5 File Lib/test/test_http_cookiejar.py (right): http://codereview.appspot.com/2827/diff/1/5#newcode1450 Line 1450: "%3c%3c%0Anew%C3%A5/%C3%A5", I'm guessing this test broke otherwise? Given that this references an RFC, is it correct to just fix it this way? http://codereview.appspot.com/2827/diff/1/3 File Lib/urllib/parse.py (right): http://codereview.appspot.com/2827/diff/1/3#newcode10 Line 10: "urlsplit", "urlunsplit"] Please add all the quote/unquote versions here too. (They were there in 2.5, but somehow got dropped from 3.0. http://codereview.appspot.com/2827/diff/1/3#newcode265 Line 265: # Maps lowercase and uppercase variants (but not mixed case). That sounds like a disaster. Why would %aa and %AA be correct but not %aA and %Aa? (Even though the old code had the same problem.) http://codereview.appspot.com/2827/diff/1/3#newcode283 Line 283: def unquote(s, encoding = "utf-8", errors = "replace"): Please no spaces around the '=' when used for an argument default (or for a keyword arg). Also see my comment about defaulting to 'replace' in the doc file. Finally -- let's be consistent about quotes. It seems most of this file uses single quotes, so lets stick to that (except docstrings always use double quotes). And more: what should a None value for encoding or errors mean? IMO it should mean "use the default". http://codereview.appspot.com/2827/diff/1/3#newcode382 Line 382: safe = safe.encode('ascii', 'ignore') Using errors='ignore' seems like a mistake -- it will hide errors. I also wonder why safe should be limited to ASCII though. http://codereview.appspot.com/2827/diff/1/3#newcode399 Line 399: if ' ' in s: This test means that it won't work if the input is bytes. E.g. urllib.parse.quote_plus(b"abc def") raises a TypeError. Sincerely, Your friendly code review daemon (http://codereview.appspot.com/).
msg70807 - (view)	Author: Jim Jewett (jimjjewett)	Date: 2008-08-06 22:25
> http://codereview.appspot.com/2827/diff/1/5#newcode1450 > Line 1450: "%3c%3c%0Anew%C3%A5/%C3%A5", > I'm guessing this test broke otherwise? Yes; that is one of the breakages you found in Bill's patch. (He didn't modify the test.) > Given that this references an RFC, > is it correct to just fix it this way? Probably. Looking at http://www.faqs.org/rfcs/rfc2965.html (1) That is not among the exact tests in the RFC. (2) The RFC does not specify charset for the cookie in general, but the Comment field MUST be in UTF-8, and the only other reference I could find to a specific charset was "possibly in a server-selected printable ASCII encoding." Whether we have to use Latin-1 (or document charset) in practice for compatibility reasons, I don't know.
msg70818 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-07 12:11
Dear GvR, New code review comments by mgiuca have been published. Please go to http://codereview.appspot.com/2827 to read them. Message: Hi Guido, Thanks very much for this very detailed review. I've replied to the comments. I will make the changes as described below and send a new patch to the tracker.
msg70824 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-07 13:42
A reply to a point on GvR's review, I'd like to open for discussion. This relates to whether or not quote's "safe" argument should allow non-ASCII characters. > Using errors='ignore' seems like a mistake -- it will hide errors. I > also wonder why safe should be limited to ASCII though. The reasoning is this: if we allow non-ASCII characters to be escaped, then we allow quote to generate invalid URIs (URIs are only allowed to have ASCII characters). It's one thing for unquote to accept such URIs, but I think we shouldn't be producing them. Albeit, it only produces an invalid URI if you explicitly request it. So I'm happy to make the change to allow any character to be safe, but I'll let it go to discussion first.
msg70828 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-07 14:20
Le jeudi 07 août 2008 à 13:42 +0000, Matt Giuca a écrit : > The reasoning is this: if we allow non-ASCII characters to be escaped, > then we allow quote to generate invalid URIs (URIs are only allowed to > have ASCII characters). It's one thing for unquote to accept such URIs, > but I think we shouldn't be producing them. Albeit, it only produces an > invalid URI if you explicitly request it. So I'm happy to make the > change to allow any character to be safe, but I'll let it go to > discussion first. The important is that the defaults are safe. If users want to override the defaults and produce potentially invalid URIs, there is no reason to discourage them.
msg70830 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-07 14:35
> The important is that the defaults are safe. If users want to override > the defaults and produce potentially invalid URIs, there is no reason to > discourage them. OK I think that's a fairly valid argument. I'm about to head off so I'll post the patch I have now, which fixes most of the other concerns. That change will cause havoc to quote I think ;)
msg70833 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-07 14:59
Following Guido and Antoine's reviews, I've written a new patch which fixes most of the issues raised. The ones I didn't fix I have noted below, and commented on the review site (http://codereview.appspot.com/2827/). Note: I intend to address all of these issues after some discussion. Outstanding issues raised by the reviews: Doc/library/urllib.parse.rst: Should unquote accept a bytes/bytearray as well as a str? Lib/email/utils.py: Should encode_rfc2231 with charset=None accept strings with non-ASCII characters, and just encode them to UTF-8? Lib/test/test_http_cookiejar.py: Does RFC 2965 let me get away with changing the test case to expect UTF-8? (I'm pretty sure it doesn't care what encoding is used). Lib/test/test_urllib.py: Should quote raise a TypeError if given a bytes with encoding/errors arguments? (Motivation: TypeError is what you usually raise if you supply too many args to a function). Lib/urllib/parse.py: (As discussed above) Should quote accept safe characters outside the ASCII range (thereby potentially producing invalid URIs)? ------ Commit log for patch8: Fix for issue 3300. urllib.parse.unquote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the decoding of percent-encoded octets. As per RFC 3986, default is "utf-8" (previously implicitly decoded as ISO-8859-1). Also fixed a bug in which mixed-case hex digits (such as "%aF") weren't being decoded at all. urllib.parse.quote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the encoding of non-ASCII characters before being percent-encoded. Default is "utf-8" (previously characters in range(128, 256) were encoded as ISO-8859-1, and characters above that as UTF-8). Also characters/bytes above 128 are no longer allowed to be "safe". Also now allows either bytes or strings. Added functions urllib.parse.quote_from_bytes, urllib.parse.unquote_to_bytes. All quote/unquote functions now exported from the module. Doc/library/urllib.parse.rst: Updated docs on quote and unquote to reflect new interface, added quote_from_bytes and unquote_to_bytes. Lib/test/test_urllib.py: Added many new test cases testing encoding and decoding Unicode strings with various encodings, as well as testing the new functions. Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py, Lib/test/test_wsgiref.py: Updated and added test cases to deal with UTF-8-encoded URIs. Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote with encoding="latin-1", to preserve existing behaviour (which the whole email module is dependent upon).
msg70834 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-07 15:02
I'm also attaching a "metapatch" - diff from patch 7 to patch 8. This is to give a rough idea of what I changed since the review. (Sorry - This is actually a diff between the two patches, so it's pretty hard to read. It would have been nicer to diff the files themselves but I'm not doing local commits so that's hard. Can one use the Bazaar mirror for development, or is it too out-of-date?)
msg70840 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-08-07 16:59
On Thu, Aug 7, 2008 at 8:03 AM, Matt Giuca <report@bugs.python.org> wrote: > > Matt Giuca <matt.giuca@gmail.com> added the comment: > > I'm also attaching a "metapatch" - diff from patch 7 to patch 8. This is > to give a rough idea of what I changed since the review. > > (Sorry - This is actually a diff between the two patches, so it's pretty > hard to read. It would have been nicer to diff the files themselves but > I'm not doing local commits so that's hard. Can one use the Bazaar > mirror for development, or is it too out-of-date?) A much better way to do this would be to use Rietveld; it can show deltas between patch sets that are actually readable, and it keeps the inline comments. I've uploaded your patch #8 to the issue I created there, and you can see the delta from patch7 starting here: http://codereview.appspot.com/2827/diff2/1:17/27
msg70855 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-07 20:49
Just to reply to Antoine's comments on my patch: - it would be nice to have more unit tests, especially for the various bytes/unicode possibilities, and perhaps also roundtripping (Matt's patch has a lot of tests) Yes, I completely agree. - quote_as_bytes() should return a bytes object, not a bytearray Good point. - using the "%02X" format looks clearer to me than going through the _hextable lookup table... Really? I see it the other way. - when the argument is of the wrong type, quote_as_bytes() should raise a TypeError rather than a ValueError Good point. - why is quote_as_string() hardwired to utf8 while unquote_as_string() provides a charset parameter? wouldn't it be better for them to be consistent with each other? To encourage the use of UTF-8. The caller can alway explicitly encode to some other character set, then pass in the bytes that result from that encoding. Remember that the RFC for percent-encoding really takes bytes in, and produces bytes out. The string-in and string-out versions are to support naive programming (what a nice way of putting it!).
msg70858 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-07 21:17
My main fear with this patch is that "unquote" will become seen as unreliable, because naive software trying to parse URLs will encounter uses of percent-encoding where the encoded octets are not in fact UTF-8 bytes. They're just some set of bytes. A secondary concern is that it will invisibly produce invalid data, because it decodes some non-UTF-8-encoded string that happens to only use UTF-8-valid sequences as the wrong string value. Now, I have to confess that I don't know how common these use cases are in actual URL usage. It would be nice if there was some organization that had a large collection of URLs, and could provide a test set we could run a scanner over :-). As a workaround, though, I've sent a message off to Larry Masinter to ask about this case. He's one of the authors of the URI spec.
msg70861 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2008-08-07 21:43
On 2008-08-07 23:17, Bill Janssen wrote: > Bill Janssen <bill.janssen@gmail.com> added the comment: > > My main fear with this patch is that "unquote" will become seen as > unreliable, because naive software trying to parse URLs will encounter > uses of percent-encoding where the encoded octets are not in fact UTF-8 > bytes. They're just some set of bytes. unquote will have to be able to deal with old-style URLs that use the Latin-1 encoding. HTML uses (or used to use) the Latin-1 encoding as default and that's how URLs ended up using it as well: http://www.w3schools.com/TAGS/ref_urlencode.asp I'd suggest to have it first try UTF-8 decoding and then fall back to Latin-1 decoding. > A secondary concern is that it > will invisibly produce invalid data, because it decodes some > non-UTF-8-encoded string that happens to only use UTF-8-valid sequences > as the wrong string value. It's rather unlikely that someone will have used a Latin-1 encoded URL which happens to decode as valid UTF-8: The valid UTF-8 combinations don't really make any sense when used as text, e.g. Ã?Ã¶Ã1/4 > Now, I have to confess that I don't know how common these use cases are > in actual URL usage. It would be nice if there was some organization > that had a large collection of URLs, and could provide a test set we > could run a scanner over :-). > > As a workaround, though, I've sent a message off to Larry Masinter to > ask about this case. He's one of the authors of the URI spec.
msg70862 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-08-07 21:46
On Thu, Aug 7, 2008 at 2:17 PM, Bill Janssen <report@bugs.python.org> wrote: > > Bill Janssen <bill.janssen@gmail.com> added the comment: > > My main fear with this patch is that "unquote" will become seen as > unreliable, because naive software trying to parse URLs will encounter > uses of percent-encoding where the encoded octets are not in fact UTF-8 > bytes. They're just some set of bytes. Apps that want to handle these correctly have no choice but to use unquote_to_bytes(), or setting error='ignore' or error='replace'. Your original proposal was to make unquote() behave like unquote_to_bytes(), which would require changes to virtually every app using unqote(), since almost all apps assume the result is a (text) string. Setting the default encoding to Latin-1 would prevent these errors, but would commit the sin of mojibake (the Japanese word for Perl code :-). I don't like that much either. A middle ground might be to set the default encoding to ASCII -- that's closer to Martin's claim that URLs are supposed to be ASCII only. It will fail as soon as an app receives encoded non-ASCII text. This will require many apps to be changed, but at least it forces the developers to think about which encoding to assume (perhaps there's one handy in the request headers if it's a web app) or about error handling or perhaps using unquote_to_bytes(). However I fear that this middle ground will in practice cause: (a) more in-the-field failures, since devs are notorious for testing with ASCII only; and (b) the creation of a recipe for "fixing" unquote() calls that fail by setting the encoding to UTF-8 without thinking about the alternatives, thereby effectively recreating the UTF-8 default with much more pain. Therefore I think that the UTF-8 default is probably the most pragmatic choice. In the code review, I have asked Matt to change the default error handling from errors='replace' to errors='strict'. I suppose we could reduce outright crashes in the field by setting this to 'replace' (even though for quote() I think it should remain 'strict'). But this may cause more subtle failures, where apps simply receive garbage data. At least when you're serving pages with error 500 the developers tend to get called in. When the users merely get failing results such bugs may remain lingering much longer. > A secondary concern is that it > will invisibly produce invalid data, because it decodes some > non-UTF-8-encoded string that happens to only use UTF-8-valid sequences > as the wrong string value. In my experience this is very unlikely. UTF-8 looks like total junk in Latin-1, so it's unlikely to occur naturally. If you see something that matches a UTF-8 sequence in Latin-1 text, it's most likely that in fact it was incorrectly decoded earlier... > Now, I have to confess that I don't know how common these use cases are > in actual URL usage. It would be nice if there was some organization > that had a large collection of URLs, and could provide a test set we > could run a scanner over :-). > > As a workaround, though, I've sent a message off to Larry Masinter to > ask about this case. He's one of the authors of the URI spec. Looking forward to his response.
msg70868 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-07 22:58
> Your original proposal was to make unquote() behave like > unquote_to_bytes(), which would require changes to virtually every app > using unqote(), since almost all apps assume the result is a (text) > string. Actually, careful apps realize that the result of "unquote" in Python 2 is a sequence of bytes, and do something careful with that. So only careless apps would break, and they'd break in such a way that their maintainers would have to look at the situation again, and think about it. Seems like a 'good thing', to me. And since this is Python 3, fully allowed. I really don't understand your position here, I'm afraid. Setting the default encoding to Latin-1 would prevent these errors, > but would commit the sin of mojibake (the Japanese word for Perl code > :-). I don't like that much either. No, that would be wrong. Returning a string just for the sake of returning a string. Remember, the data percent-encoded is not necessarily a string, and not necessarily in any known encoding. > > A middle ground might be to set the default encoding to ASCII -- > that's closer to Martin's claim that URLs are supposed to be ASCII > only. URLs are supposed to be ASCII only -- but the percent-encoded byte sequences in various parts of the path aren't. This will require many apps to be changed, but at least it forces the > developers to think about which encoding to assume (perhaps there's > one handy in the request headers if it's a web app) or about error > handling or perhaps using unquote_to_bytes(). Yes, this is closer to my line of reasoning. > However I fear that this middle ground will in practice cause: > > (a) more in-the-field failures, since devs are notorious for testing > with ASCII only; and Returning bytes deals with this problem. > (b) the creation of a recipe for "fixing" unquote() calls that fail by > setting the encoding to UTF-8 without thinking about the alternatives, > thereby effectively recreating the UTF-8 default with much more pain. Could be, but at least they will have had to think about. There's lots of bad code out there, and maybe by making them think about it, some of it will improve. > A secondary concern is that it > > will invisibly produce invalid data, because it decodes some > > non-UTF-8-encoded string that happens to only use UTF-8-valid sequences > > as the wrong string value. > > In my experience this is very unlikely. UTF-8 looks like total junk in > Latin-1, so it's unlikely to occur naturally. If you see something > that matches a UTF-8 sequence in Latin-1 text, it's most likely that > in fact it was incorrectly decoded earlier... > Latin-1 isn't the only alternate encoding in the world, and not all percent-encoded byte sequences in URLs are encoded strings. I'd feel better if we were being guided by more than your just experience (vast though it may rightly be said to be!). Say, by looking at all the URLs that Google knows about :-). I'd particularly feel better if some expert in Asian use of the Web spoke up here...
msg70869 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-08-07 23:23
On Thu, Aug 7, 2008 at 3:58 PM, Bill Janssen <report@bugs.python.org> wrote: > Bill Janssen <bill.janssen@gmail.com> added the comment: >> Your original proposal was to make unquote() behave like >> unquote_to_bytes(), which would require changes to virtually every app >> using unqote(), since almost all apps assume the result is a (text) >> string. > > Actually, careful apps realize that the result of "unquote" in Python 2 is a > sequence of bytes, and do something careful with that. Given that in 2.x it's a string, and knowing my users, I expect that careful apps are a tiny minority. > So only careless > apps would break, and they'd break in such a way that their maintainers > would have to look at the situation again, and think about it. Seems like a > 'good thing', to me. And since this is Python 3, fully allowed. I really > don't understand your position here, I'm afraid. My position is that although 3.0 supports Unicode much better than 2.x (I won't ever use pretentious and meaningless phrases like "fully supports"), that doesn't mean that you have to use it. I expect lots of simple web apps don't need Unicode but do need quote/unquote functionality to get around "forbidden" characters in query strings like &. I don't want to force such apps to become more aware of Unicode than absolutely necessary. >> A middle ground might be to set the default encoding to ASCII -- >> that's closer to Martin's claim that URLs are supposed to be ASCII >> only. > > URLs are supposed to be ASCII only -- but the percent-encoded byte > sequences in various parts of the path aren't. > >> This will require many apps to be changed, but at least it forces the >> developers to think about which encoding to assume (perhaps there's >> one handy in the request headers if it's a web app) or about error >> handling or perhaps using unquote_to_bytes(). > > Yes, this is closer to my line of reasoning. > >> However I fear that this middle ground will in practice cause: >> >> (a) more in-the-field failures, since devs are notorious for testing >> with ASCII only; and > > Returning bytes deals with this problem. In an unpleasant way. We might as well consider changing all APIs that deal with URLs to insist on bytes. >> (b) the creation of a recipe for "fixing" unquote() calls that fail by >> setting the encoding to UTF-8 without thinking about the alternatives, >> thereby effectively recreating the UTF-8 default with much more pain. > > Could be, but at least they will have had to think about. There's lots of > bad code out there, and maybe by making them think about it, some of it will > improve. I'd rather use a carrot than a stick. IOW I'd rather write aggressive docs than break people's code. >> A secondary concern is that it >> > will invisibly produce invalid data, because it decodes some >> > non-UTF-8-encoded string that happens to only use UTF-8-valid sequences >> > as the wrong string value. >> >> In my experience this is very unlikely. UTF-8 looks like total junk in >> Latin-1, so it's unlikely to occur naturally. If you see something >> that matches a UTF-8 sequence in Latin-1 text, it's most likely that >> in fact it was incorrectly decoded earlier... > Latin-1 isn't the only alternate encoding in the world, and not all > percent-encoded byte sequences in URLs are encoded strings. I'd feel better > if we were being guided by more than your just experience (vast though it > may rightly be said to be!). Say, by looking at all the URLs that Google > knows about :-). I'd particularly feel better if some expert in Asian use > of the Web spoke up here... OK, let's wait and see if one bites.
msg70872 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-08 00:18
On Thu, Aug 7, 2008 at 4:23 PM, Guido van Rossum <report@bugs.python.org>wrote: > > >> However I fear that this middle ground will in practice cause: > >> > >> (a) more in-the-field failures, since devs are notorious for testing > >> with ASCII only; and > > > > Returning bytes deals with this problem. > > In an unpleasant way. We might as well consider changing all APIs that > deal with URLs to insist on bytes. > That seems a bit over-the-top. Most URL operations are about strings, and most of the APIs should deal with strings; we're talking about the return result of an operation specifically designed to extract binary data from the one place where it's allowed to occur. Vastly smaller than "changing all APIs that deal with URLs". By the way, I see that the email package dodges this by encoding the bytes to strings using the codec "raw-unicode-escape". In other words, byte sequences in the outward form of a string. I'd be OK with that. That is, make the default codec for "unquote" be "raw-unicode-escape". All the bytes will come through unscathed, and people who are naively expecting ASCII strings will still receive them, so the code won't break. This actually seems to be closest to the current usage, so I'm going to change my patch to do that.
msg70878 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-08 01:34
Now I'm looking at the failing test_http_cookiejar test, which fails because it encodes a non-UTF-8 byte, 0xE5, in a path segment of a URI. The question is, does the "http" URI scheme allow non-ASCII (say, Latin-1) octets in path segments? IANA says that the "http" scheme is defined in RFC 2616, and that says: This specification adopts the definitions of "URI-reference", "absoluteURI", "relativeURI", "port", "host","abs_path", "rel_path", and "authority" from [RFC 2396]. But RFC 2396 says: An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used. And doesn't say anything about the "http" scheme. Nor does it indicate any default encoding or character set for URIs. The update, 3986, doesn't say anything new about this, though it does implore URI scheme designers to represent characters in a textual segment with ASCII codes where they exist, and to use UTF-8 when designing new URI schemes. Barring any other information, I think that the "segments" in the path of an "http" URL must also be assumed to be binary; that is, any octet is allowed, and no character set can be presumed.
msg70879 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-08 02:04
Looks like the failing test in test_http_cookiejar is just a bad test; it attempts to build an HTTP request object from an invalid URL, yet still seem to expect to be able to extract a cookie from the response headers for that request. I'd expect some exception to be thrown. So this probably indicates a bug in urllib.parse: urllib.parse.urlparse("http://www.acme.com/foo%2f%25/<<%0anew\345/\346\370\345") should throw an exception somewhere.
msg70880 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-08 02:07
Just to be clear: any octet would seem to be allowed in the "path" of an "http" URL, but any non-ASCII octet must be percent-encoded. So the URL itself is still an ASCII string, considered opaquely.
msg70913 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-08 21:28
Here's the updated version of my patch. It returns a string, but doesn't clobber bytes that are contained in the string. Naive code (basically code that expects ASCII strings from unquote) should continue to work as well as it ever did. I fixed the problem with the email library, and removed the bogus test from the http_cookiejar test suite.
msg70949 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-09 18:34
Bill, I had a look at your patch. I see you've decided to make quote_as_string the default? In that case, I don't know why you had to rewrite everything to implement the same basic behaviour as my patch. (My latest few patches support bytes both ways). Anyway, I have a lot of issues with your implementation. * Why did you replace all of the existing machinery? Particularly the way quote creates Quoter objects and stores them in a cache. I haven't done any speed tests, but I assume that was all there for performance reasons. * The idea of quote_as_bytes is malformed. quote_as_bytes takes a str or bytes, and outputs a URI as a bytes, while quote_as_string outputs a URI as a str. This is the first time in the whole discussion we've represented a URI as bytes, not a str. URIs are not byte sequences, they are character sequences (see discussion below). I think only quote_as_string is valid. * The names unquote_as_* and quote_as_* are confusing. Use unquote_to_* and quote_from_* to avoid ambiguity. * Are unquote_as_string and unquote both part of your public interface? That seems like unnecessary duplication. * As Antoine pointed out above, it's too limiting for quote to force UTF-8. Add a 'charset' parameter. * Add an 'errors' parameter too, to give the caller control over how strict to be. * unquote and unquote_plus are missing 'charset' param, which should be passed along to unquote_as_string. * I do like the addition of a "plus" argument, as opposed to the separate unquote_plus and quote_plus functions. I'd swap the arguments to unquote around so charset is first and then plus, so you can write unquote(mystring, 'utf-8') without using a keyword argument. * In unquote: The "raw_unicode_escape" encoding makes no sense. It does exactly the same thing as Latin-1, except it also looks for b"\\uxxxx" in the string and converts that into a Unicode character. So your code behaves like this: >>> urllib.parse.unquote('%5Cu00fc') 'ü' (Should output "\u00fc") >>> urllib.parse.unquote('%5Cu') UnicodeDecodeError: 'rawunicodeescape' codec can't decode bytes in position 11-12: truncated \uXXXX (Should output "\u") I suspect the email package (where you got the inspiration to use 'rawunicodeescape') has this same crazy problem, but that isn't my concern today! Aside from this weirdness, you're essentially defaulting unquote to Latin-1. As I've said countless times, unquote needs to be the inverse of quote, or you get this behaviour: >>> urllib.parse.unquote(urllib.parse.quote('ü')) 'Ã¼' Once again, I refer you to my favourite web server example. import http.server s = http.server.HTTPServer(('',8000), http.server.SimpleHTTPRequestHandler) s.serve_forever() Run this in a directory with a non-Latin-1 filename (eg. "漢字"), and you will get a 404 when you click on the file. * One issue I worked very hard to solve is how to deal with unescaped non-ASCII characters in unquote. Technically this is an invalid URI, so I'm not sure how important it is, but it's nice to be able to assume the unquote function won't mess with them. For example, unquote_as_string("\u6f22%C3%BC", charset="latin-1") should give "\u6f22\u00fc" (or at least it would be nice). Yours raises "UnicodeEncodeError: 'ascii' codec can't encode character". (I assume this is a wanted property, given that the existing test suite tests that unquote can handle ALL unescaped ASCII characters (that's what escape_string in test_unquoting is for) - I merely extend this concept to be able to handle all unescaped Unicode characters). Note that it's impossible to allow such lenience if you implement unquote_as_string as calling unquote_as_bytes and then decoding. * Question: How does unquote_bytes deal with unescaped characters? (Since this is a str->bytes transform, you need to encode them somehow). I don't have a good answer for you here, which is one reason I think it's wrong to treat a URI as an octet encoding. I treat them as UTF-8. You treat them as ASCII. Since URIs are supposed to only contain ASCII, the answers "ASCII", "Latin-1" and "UTF-8" are all as good as each other, but as I said above, I prefer to be lenient and allow non-ASCII URIs as input. * Needs a lot more test cases, and documentation for your changes. I suggest you plug my new test cases for urllib in and see if you can make your code pass all the things I test for (and if not, have a good reason). In addition, a major problem I have is with this dangerous assumption that RFC 3986 specifies a byte->str encoding. You keep stating assumptions like this: > Remember that the RFC for percent-encoding really takes > bytes in, and produces bytes out. The string-in and string-out > versions are to support naive programming (what a nice way of > putting it!). You assume that my patch, the string version of quote/unquote, is a "hack" in order to satisfy the naive souls who only want to deal with strings, while your method is the "pure and correct" solution. This is in no way certain. It is clear from reading RFC 3986 that URIs are characters, not octets (that's not debatable - as we see URIs written on billboards, they're definately characters). What is debatable is whether the source data for a URI is a character or a byte. I'm arguing it can be either. The process of percent-encoding is taking a character, converting it into an octet, and percent-encoding that octet into a 3-character sequence. I quote: > Percent-encoded octets may be used within a URI to represent > characters outside the range of the US-ASCII coded character set > if this representation is allowed by the scheme or by the protocol > element in which the URI is referenced. Such a definition should > specify the character encoding used to map those characters to > octets prior to being percent-encoded for the URI. It could not be clearer that the RFC allows us to view percent encoding as a str->str transform. I also see later on (in section 2.1) the RFC talking about "data octets" being percent-encoded, so I assume it's also valid to speak about a bytes->str transform. Therefore, I believe my quote and unquote are as correct as can be with respect to the RFC, and I've also supplied unquote_to_bytes and quote_from_bytes to satisfy the other (IMHO less clear, and certainly less useful) reading of the RFC. You seem to have arrived at the same outcome, but from a very different standpoint that your unquote_as_bytes is "correct" and unquote_as_string is a nasty hack. Bill: > Barring any other information, I think that the "segments" in the > path of an "http" URL must also be assumed to be binary; that is, > any octet is allowed, and no character set can be presumed. Once again, practicality beats purity. Segments in HTTP URLs usually refer to file paths. Files (in all modern operating systems) are Unicode filenames, not byte filenames. So you HAVE to assume the character set at some point. The best assumption we can make (as shown by all the major browsers' defaults) is UTF-8. > So this probably indicates a bug in urllib.parse: > urllib.parse.urlparse(<snip>) > should throw an exception somewhere. No, this is just assuming the behaviour of unquote (which is called by urlparse) to preserve unescaped characters as-is. So this "bug" is actually in your own implementation of unquote (and mine too). But I refute that that's desirable behaviour. I think you SHOULD NOT have removed that test case. It is indicating a bug in your code (which is what it's there for). Look, I'm sorry to be so harsh, but I don't see what this alternative patch accomplishes. My analysis probably comes off as self-righteous, but then, I've spent the past month reading documentation and testing every aspect of this issue rigorously, and all I see is you ignoring all my work and rewriting it in a less-robust way. Your point here was to make the str->bytes and bytes->str versions of unquote and quote, respectively, work, right? Well I've already got those working too. Please tell me what you think of my implementation so we can get it approved before the deadline. Matt
msg70955 - (view)	Author: Jim Jewett (jimjjewett)	Date: 2008-08-10 00:35
Matt, Bill's main concern is with a policy decision; I doubt he would object to using your code once that is resolved. The purpose of the quoting functions is to turn a string (representing the human-readable version) into bytes (that go over the wire). If everything is ASCII, there isn't any disagreement -- but it also isn't obvious that they're bytes instead of characters. So people started (well, continued, since it dates to pre-unicode C) treating them as though they were strings. The fact that ASCII (and therefore most wire protocols) looks the same as bytes or as characters was one of the strongest arguments against splitting the bytes and string types. Now that this has been done, Bill feels we should be consistent. (You feel wire-protocol bytes should be treated as strings, if only as bytestrings, because the libraries use them that way -- but this is a policy decision.) To quote the final paragraph of 1.2.1 """ In local or regional contexts and with improving technology, users might benefit from being able to use a wider range of characters; such use is not defined by this specification. Percent-encoded octets (Section 2.1) may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. Such a definition should specify the character encoding used to map those characters to octets prior to being percent-encoded for the URI. """ So the mapping to bytes (or "octets") for non-ASCII isn't defined (here), and if you want to use it, you need to specify charset. But in practice, people do use it without specifying a charset. Which charset should be assumed? The old code (and test cases) assumed Latin-1. You want to assume UTF-8 (though you took the document charset when available -- which might also make sense).
msg70958 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-10 03:09
> Bill's main concern is with a policy decision; I doubt he would > object to using your code once that is resolved. But his patch does the same basic operations as mine, just implemented differently and with the heap of issues I outlined above. So it doesn't have anything to do with the policy decision. > The purpose of the quoting functions is to turn a string > (representing the human-readable version) into bytes (that go > over the wire). Ah hang on, that's a misunderstanding. There is a two-step process involved. Step 1. Translate <character/byte> string into an ASCII character string by percent-encoding the <characters/bytes>. (If percent-encoding characters, use an unspecified encoding). Step 2. Serialize the ASCII character string into an octet sequence to send it over the wire, using some unspecified encoding. Step 1 is explained in detail throughout the RFC, particularly in Section 1.2.1 Transcription ("Percent-encoded octets may be used within a URI to represent characters outside the range of the US-ASCII coded character set") and 2.1 Percent Encoding. Step 2 is not actually part of the spec (because the spec outlines URIs as character sequences, not how to send them over a network). It is briefly described in Section 2 ("This specification does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters. When a URI appears in a protocol element, the character encoding is defined by that protocol"). Section 1.2.1: > A URI may be represented in a variety of ways; e.g., ink on > paper, pixels on a screen, or a sequence of character > encoding octets. The interpretation of a URI depends only on > the characters used and not on how those characters are > represented in a network protocol. The RFC then goes on to describe a scenario of writing a URI down on a napkin, before stating: > A URI is a sequence of characters that is not always represented > as a sequence of octets. Right, so there is no debate that a URI (after percent-encoding) is a character string, not a byte string. The debate is only whether it's a character or byte string before percent-encoding. Therefore, the concept of "quote_as_bytes" is flawed. > You feel wire-protocol bytes should be treated as > strings, if only as bytestrings, because the libraries use them > that way. No I do not. URIs post-encoding are character strings, in the Unicode sense of the term "character". This entire topic has nothing to do with the wire. Note that the "charset" or "encoding" parameter in Bill/My patch respectively isn't the mapping from URI strings to octets (that's trivially ASCII). It's the charset used to encode character information into octets which then get percent-encoded. > The old code (and test cases) assumed Latin-1. No, the old code and test cases were written for Python 2.x. They assumed a byte string was being emitted (back when a byte string was a string, so that was an acceptable output type). So they weren't assuming an encoding. In fact the ONLY test case for Unicode in test_urllib used a UTF-8-encoded string. > r = urllib.parse.unquote('br%C3%BCckner_sapporo_20050930.doc') > self.assertEqual(r, 'br\xc3\xbcckner_sapporo_20050930.doc') In Python 2.x, this test case says "unquote('%C3%BC') should give me the byte sequence '\xc3\xbc'", which is a valid case. In Python 3.0, the code didn't change but the meaning subtly did. Now it says "unquote('%C3%BC') should give the string 'Ã¼'". The name is clearly supposed to be "brückner", not "brÃ¼ckner", which means in Python 3.0 we should EITHER be expecting the BYTE string b'\xc3\xbc' or the character string 'ü'. So the old code and test cases didn't assume any encoding, then they were accidentally made to assume Latin-1 by the fact that the language changed underneath them.
msg70959 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-10 05:05
I've been thinking more about the errors="strict" default. I think this was Guido's suggestion. I've decided I'd rather stick with errors="replace". I changed errors="replace" to errors="strict" in patch 8, but now I'm worried that will cause problems, specifically for unquote. Once again, all the code in the stdlib which calls unquote doesn't provide an errors option, so the default will be the only choice when using these other services. I'm concerned that there'll be lots of unhandled exceptions flying around for URLs which aren't encoded with UTF-8, and a conscientious programmer will not be able to protect against user errors. Take the cgi module as an example. Typical usage is to write: > fields = cgi.FieldStorage() > foo = fields.getFirst("foo") If the QUERY_STRING is "foo=w%FCt" (Latin-1), with errors='strict', you get a UnicodeDecodeError when you call cgi.FieldStorage(). With errors='replace', the variable foo will be "w�t". I think in general I'd rather have '�'s in my program (representing invalid user input) than exceptions, since this is usually a user input error, not a programming error. (One problem is that all I can do to handle this is catch a UnicodeDecodeError on the call to FieldStorage; then I can't access any of the data). Now maybe something we can think about is propagating the "encoding" and "errors" argument through to a few other major functions (such as cgi.parse_qsl, cgi.FieldStorage and urllib.parse.urlencode), but that should be separately to this patch.
msg70962 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-10 07:05
Guido suggested that quote's "safe" parameter should allow any character, not just ASCII range. I've implemented this now. It was a lot messier than I imagined. The problem is that in my older patches, both 's' and 'safe' are encoded to bytes right away, and the rest of the process is just octet encoding (matching each byte against the safe set to see whether or not to quote it). The new implementation requires that you delay encoding both of these till the iteration over the string, so you match each character against the safe set, then encode it if it's not in 'safe'. Now the problem is some encodings/errors produce bytes which are in the safe range. For instance quote('\u6f22', encoding='latin-1', errors='xmlcharrefreplace') should give "%26%2328450%3B" (which is "漢" encoded). To preserve this behaviour, you then have to check each byte of the encoded character against a 'safe bytes' set. I believe that will slow down the implementation considerably. In summary, it requires two levels of encoding: first characters, then bytes. You can see how messy it made my quote implementation - I've attached the patch (parse.py.patch8+allsafe). I don't think it's worth the extra code bloat and performance hit just to implement a feature whose only use is producing invalid URIs (since URIs are supposed to only have ASCII characters). Does anyone disagree, and want this feature in?
msg70965 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-10 07:45
Made a bunch of requested changes (I've reverted the "all safe" patch for now since it caused so much grief; see above). * quote: Fixed encoding illegal % sequences (and lots of new test cases to prove it). * quote now throws a type error if s is bytes, and encoding or errors supplied. * A few minor documentation fixes. Patch 9. Commit log for patch8 should suffice.
msg70969 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-10 10:49
Le dimanche 10 août 2008 à 07:05 +0000, Matt Giuca a écrit : > I don't think it's worth the extra code bloat and performance hit just > to implement a feature whose only use is producing invalid URIs (since > URIs are supposed to only have ASCII characters). Agreed. > If the QUERY_STRING is "foo=w%FCt" (Latin-1), with errors='strict', you > get a UnicodeDecodeError when you call cgi.FieldStorage(). With > errors='replace', the variable foo will be "w�t". I think in general I'd > rather have '�'s in my program (representing invalid user input) than > exceptions, since this is usually a user input error, not a programming > error. Invalid user input? What if the query string comes from filling a form? For example if I search the word "numéro" in a latin1 Web site, I get the following URL: http://www.le-tigre.net/spip.php?page=recherche&recherche=num%E9ro
msg70970 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-10 11:25
> Invalid user input? What if the query string comes from filling > a form? > For example if I search the word "numéro" in a latin1 Web site, > I get the following URL: > http://www.le-tigre.net/spip.php?page=recherche&recherche=num%E9ro Yes, that is a concern. I suppose the idea should be that as the programmer _you_ write the website, so you make it UTF-8 and you use our defaults. Or you make it Latin-1, and you override our defaults (which is tricky if you use cgi.FieldStorage, for example). But anyway, how do you propose to handle that (other than the programmer setting the correct default). With errors='replace', the above query will result in "num�ro", but with errors='strict', it will result in a UnicodeDecodeError (which you could handle, if you remembered). As a programmer I don't really want to handle that error every time I use unquote or anything that calls unquote. I'd rather accept the possibility of '�'s in my input. I'm not going to dig in my heels though, this time :) I just want to make sure the consequences of this decision are known before we commit.
msg71042 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-12 02:30
On Sat, Aug 9, 2008 at 11:34 AM, Matt Giuca <report@bugs.python.org> wrote: > > Matt Giuca <matt.giuca@gmail.com> added the comment: > > Bill, I had a look at your patch. I see you've decided to make > quote_as_string the default? In that case, I don't know why you had to > rewrite everything to implement the same basic behaviour as my patch. > Just to get things clear in my head. I don't care what code is used, so long as it doesn't break the Web. > (My latest few patches support bytes both ways). Anyway, I have a lot of > issues with your implementation. > > * Why did you replace all of the existing machinery? Particularly the > way quote creates Quoter objects and stores them in a cache. I haven't > done any speed tests, but I assume that was all there for performance > reasons. I see no real advantage there, except that it has a built-in memory leak. Just use a function. > * The idea of quote_as_bytes is malformed. quote_as_bytes takes a str or > bytes, and outputs a URI as a bytes, while quote_as_string outputs a URI > as a str. This is the first time in the whole discussion we've > represented a URI as bytes, not a str. URIs are not byte sequences, they > are character sequences (see discussion below). Just there for consistency. Of course, URIs are actually strings of ASCII characters, which sort of specifies an encoding right there, so using ASCII octets isn't all that far-fetched... > * The names unquote_as_* and quote_as_* are confusing. Use unquote_to_* > and quote_from_* to avoid ambiguity. Good point. > * Are unquote_as_string and unquote both part of your public interface? Yes. Good programmers will switch to unquote_as_string or unquote_as_bytes (or, with your amendment, unquote_to_string and unquote_to_bytes, while unquote is just there for backwards compatibility for naifs. * As Antoine pointed out above, it's too limiting for quote to force > UTF-8. Add a 'charset' parameter. That's what quote_as_string is for. > * Add an 'errors' parameter too, to give the caller control over how > strict to be. No. Errors should be seen. This is data, not whimsey. > > * unquote and unquote_plus are missing 'charset' param, which should be > passed along to unquote_as_string. No, they're only there for backwards compatibility, and the 2.x versions don't have a charset param. * >I do like the addition of a "plus" argument, as opposed to the > separate unquote_plus and quote_plus functions. I'd swap the arguments > to unquote around so charset is first and then plus, so you can write > unquote(mystring, 'utf-8') without using a keyword argument. Unquote doesn't take a charset (in my design). > > * In unquote: The "raw_unicode_escape" encoding makes no sense. It does > exactly the same thing as Latin-1, except it also looks for b"\\uxxxx" > in the string and converts that into a Unicode character. So your code > behaves like this: > > >>> urllib.parse.unquote('%5Cu00fc') > 'ü' > (Should output "\u00fc") > >>> urllib.parse.unquote('%5Cu') > UnicodeDecodeError: 'rawunicodeescape' codec can't decode bytes in > position 11-12: truncated \uXXXX > (Should output "\u") > > I suspect the email package (where you got the inspiration to use > 'rawunicodeescape') has this same crazy problem, but that isn't my > concern today! > > Aside from this weirdness, you're essentially defaulting unquote to > Latin-1. As I've said countless times, unquote needs to be the inverse > of quote, or you get this behaviour: > >>> urllib.parse.unquote(urllib.parse.quote('ü')) > 'Ã¼' > > Once again, I refer you to my favourite web server example. > > import http.server > s = http.server.HTTPServer(('',8000), > http.server.SimpleHTTPRequestHandler) > s.serve_forever() > > Run this in a directory with a non-Latin-1 filename (eg. "漢字"), and > you will get a 404 when you click on the file. Using unquote() and expecting a string is basically broken already. So what I'm trying to do here is to avoid losing data in the translation. > * One issue I worked very hard to solve is how to deal with unescaped > non-ASCII characters in unquote. Technically this is an invalid URI, so > I'm not sure how important it is, but it's nice to be able to assume the > unquote function won't mess with them. For example, > unquote_as_string("\u6f22%C3%BC", charset="latin-1") should give > "\u6f22\u00fc" (or at least it would be nice). Yours raises > "UnicodeEncodeError: 'ascii' codec can't encode character". (I assume > this is a wanted property, given that the existing test suite tests that > unquote can handle ALL unescaped ASCII characters (that's what > escape_string in test_unquoting is for) - I merely extend this concept > to be able to handle all unescaped Unicode characters). Note that it's > impossible to allow such lenience if you implement unquote_as_string as > calling unquote_as_bytes and then decoding. Good thing we don't need to; URIs consist of ASCII characters. > > * Question: How does unquote_bytes deal with unescaped characters? Not sure I understand this question... > > * Needs a lot more test cases, and documentation for your changes. I > suggest you plug my new test cases for urllib in and see if you can make > your code pass all the things I test for (and if not, have a good reason). Your test cases probably aren't testing things I feel it's necessary to test. I'm interested in having the old test cases for urllib pass, as well as providing the ability to unquote_to_bytes(). > In addition, a major problem I have is with this dangerous assumption > that RFC 3986 specifies a byte->str encoding. You keep stating > assumptions like this: > > Remember that the RFC for percent-encoding really takes > > bytes in, and produces bytes out. The string-in and string-out > > versions are to support naive programming (what a nice way of > > putting it!). > Those aren't assumptions, they're simple statements of fact. > > You assume that my patch, the string version of quote/unquote, is a > "hack" in order to satisfy the naive souls who only want to deal with > strings, while your method is the "pure and correct" solution. Yes. > > It is clear from reading RFC 3986 that URIs are characters, not octets > (that's not debatable - as we see URIs written on billboards, they're > definately characters). What is debatable is whether the source data > for a URI is a character or a byte. I'm arguing it can be either. > I wasn't talking about URIs (you're right, they're clearly sequences of characters), I was talking about the process of percent-encoding, which takes a sequence of octets, and produces an ASCII string, and the process of percent-decoding, which takes that ASCII string, and produces that same original sequence of octets. Don't confuse the two issues. You seem to have arrived at the same outcome, but from a very different > standpoint that your unquote_as_bytes is "correct" and unquote_as_string > is a nasty hack Nasty? I suppose that's in the eye of the beholder. No, I think that this whole situation has come about because Python 1 and Python 2 had this unfortunate ambiguity about what a "str" was. urllib.unquote() has always produced a sequence of bytes, but unfortunately too many careless programmers have regarded it as a sequence of characters. Unquote_as_string simply (and, IMO, usefully), conflates two operations: percent-decode, and string decode. Non-careless programmers have always written something like: s = urllib.unquote(frag).decode("UTF-8", "strict") That is, they understood what was going on, and handled it correctly. I'd suggest removing "unquote" completely, and leaving everyone porting to Python 3 to make their own choice between unquote_to_bytes and unquote_to_string. > Bill: > > Barring any other information, I think that the "segments" in the > > path of an "http" URL must also be assumed to be binary; that is, > > any octet is allowed, and no character set can be presumed. > > Once again, practicality beats purity. Segments in HTTP URLs usually > refer to file paths. Files (in all modern operating systems) are Unicode > filenames, not byte filenames. So you HAVE to assume the character set > at some point. The best assumption we can make (as shown by all the > major browsers' defaults) is UTF-8. > You're not suggesting we build something that works for 80% of the cases, and breaks on the other 20%? When we don't need to? > So this probably indicates a bug in urllib.parse: > > urllib.parse.urlparse(<snip>) > > should throw an exception somewhere. > > No, this is just assuming the behaviour of unquote (which is called by > urlparse) to preserve unescaped characters as-is. So this "bug" is > actually in your own implementation of unquote (and mine too). But I > refute that that's desirable behaviour. I think you SHOULD NOT have > removed that test case. It is indicating a bug in your code (which is > what it's there for). This has nothing to do with unquote(). The octets in question are not percent-encoded. > Look, I'm sorry to be so harsh, but I don't see what this alternative > patch accomplishes. My analysis probably comes off as self-righteous, > but then, I've spent the past month reading documentation and testing > every aspect of this issue rigorously, and all I see is you ignoring all > my work and rewriting it in a less-robust way. Sorry to be so harsh, but... wow, take some lessons in humility! It's not about you or your patch or how long you spend reading stuff, it's about Python. And, without referring explicitly to your patch, it really doesn't matter how long someone spends on something, if they get it wrong. I think you are conflating things in a way which will break stuff (and not break other stuff, that's true). We're arguing about which way of breaking things is better, I think. > Your point here was to make the str->bytes and bytes->str versions of > unquote and quote, respectively, work, right? Well I've already got > those working too. Please tell me what you think of my implementation so > we can get it approved before the deadline. Matt, your patch is not some God-given thing here. There are lots of patches waiting to be looked at, and this is a fairly trivial concern. I'm guessing there are lots of URIs out there that contain non-UTF-8 octets, particularly in Japan, which has used ISO-2022-JP for years (though Antoine advances an example of Latin-1 in URLs right here). Be nice if someone who worked at Google (hint!, hint!) looked through their massive collection of URLs and validated this hypothesis, or not.
msg71043 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-12 03:43
Some interesting notes here (from Erik van der Poel at Google; Guido, you might want to stroll over to his location and talk with him): http://lists.w3.org/Archives/Public/www-international/2007JanMar/0004.html and more particularly http://lists.w3.org/Archives/Public/www-international/2008AprJun/0092.html, which says, in part, ``Within the context of HTML and HTTP, queries [that is, the query part of a URL] don't have to say which charset they are using, because there is already an agreement in place: the major browsers and servers use the charset of the HTML.'' So, there's still a sizable number of Latin-1 pages out there, and queries against these pages will use that encoding in the URL's they send. And then there's this: http://lists.w3.org/Archives/Public/www-international/2008AprJun/0014.html
msg71054 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-12 15:20
Bill, this debate is getting snipy, and going nowhere. We could argue about what is the "pure" and "correct" thing to do, but we have a limited time frame here, so I suggest we just look at the important facts. 1. There is an overwhelming consensus (including from me) that a str->bytes version is acceptable to have in the library (whether or not it's the "correct solution"). 2. There is an overwhelming consensus (including from you) that a str->str version is acceptable to have in the library (whether or not it's the "correct solution"). 3. By default, the str->str version breaks much less code, so both of us decided to use it by default. To this end, both of our patches: 1. Have a str->bytes version available. 2. Have a str->str version available. 3. Have "quote" and "unquote" functions call the str->str version. So it seems we have agreed on that. Therefore, there should be no more arguing about which is "more right". So all your arguments seem to be essentially saying "the str->bytes methods work perfectly; I don't care about if the str->str methods are correct or not". The fact that your string versions quote UTF-8 and unquote Latin-1 shows just how un-seriously you take the str->str methods. Well the fact is that a) a great many users do NOT SHARE your ideals and will default to using "quote" and "unquote" rather than the bytes functions, and b) all of the rest of the library uses "quote" and "unquote". So from a practical sense, how these methods behave is of the utmost importance - they are more important than any new functions we introduce at this point. For example, the cgi.FieldStorage and the http.server modules will implicitly call unquote and quote. That means whether you, or I, or Guido, or The King Of The Internet likes it or not, we have to have a "most reasonable" solution to the problem of quoting and unquoting strings. > Good thing we don't need to [handle unescaped non-ASCII characters in > unquote]; URIs consist of ASCII characters. Once again, practicality beats purity. I'd argue that it's a good (not strictly required) idea to not mangle input unless we have to. > > * Question: How does unquote_bytes deal with unescaped characters? > Not sure I understand this question... I meant unescaped non-ASCII characters, as discussed above (eg. unquote_bytes('\u0123')). > Your test cases probably aren't testing things I feel it's necessary > to test. I'm interested in having the old test cases for urllib > pass, as well as providing the ability to unquote_to_bytes(). I'm sorry, but you're missing the point of test-driven development. If you think there is a bug, you don't just fix it and say "look, the old test cases still pass!" You write new FAILING test cases to demonstrate the bug. Then you change the code to make the test cases pass. All your test suite proves is that you're happy with things the way they are. > Matt, your patch is not some God-given thing here. No, I am merely suggesting that it's had a great deal more thought put into it -- not just my thought, but all the other people in the past month who've suggested different approaches and brought up discussion points. Including yourself -- it was your suggestion in the first place to have the str->bytes functions, which I agree are important. > > <snip> - Quote uses cache > I see no real advantage there, except that it has a built-in > memory leak. Just use a function. Good point. Well the merits of using a cache are completely independent from the behavioural aspects. I simply changed the existing code as little as possible. Hence this patch will have the same performance strengths/weaknesses as all previous versions, and the performance can be tuned after 3.0 if necessary. (Not urgent). On statistics about UTF-8 versus other encodings. Yes, I agree, there are lots of URIs floating around out there, in many different encodings. Unfortunately, we can't implicitly handle them all (and I'm talking once more explicitly about the str->str transform here). We need to pick one as the default. Whether Latin-1 is more popular than UTF-8 for the time being is no good reason to pick Latin-1. It is called a "legacy encoding" for a reason. It is being phased out and should NOT be supported from here on in as the default encoding in a major web programming language. (Also there is no point in claiming to be "Unicode compliant" then turning around and supporting a charset with 256 symbols by default). Because Python's urllib will mostly be used in the context of building web apps, it is up to the programmer to decide what encoding to use for h(is\|er) web app. For future apps, this should almost certainly be UTF-8 (if it isn't, the website won't be able to accept form input across all characters, so isn't Unicode compliant anyway). The problem you mention of browsers submitting URIs encoded based on the charset is simply something we have to live with. A server will never be able to deal with that unless the URIs are coming from pages which it served. As this is very often the case, then as I said above, the app should serve UTF-8 and there'll be no problems. Also note that ALL the browsers I tested (FF/Saf/IE) use UTF-8 no matter what, if you directly type Unicode characters into the address bar.
msg71055 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-12 15:26
By the way, what is the current status of this bug? Is anybody waiting on me to do anything? (Re: Patch 9) To recap my previous list of outstanding issues raised by the review: > Should unquote accept a bytes/bytearray as well as a str? Currently, does not. I think it's meaningless to do so (and how to handle >127 bytes, if so?) > Lib/email/utils.py: > Should encode_rfc2231 with charset=None accept strings with non-ASCII > characters, and just encode them to UTF-8? Currently does. Suggestion to restrict to ASCII on the review tracker; simple fix. > Should quote raise a TypeError if given a bytes with encoding/errors > arguments? (Motivation: TypeError is what you usually raise if you > supply too many args to a function). Resolved. Raises TypeError. > Lib/urllib/parse.py: > (As discussed above) Should quote accept safe characters outside the > ASCII range (thereby potentially producing invalid URIs)? Resolved? Implemented, but too messy and not worth it just to produce invalid URIs, so NOT in patch. That's only two very minor yes/no issues remaining. Please comment.
msg71057 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-12 16:04
I agree that given two similar patches, the one with more tests earns some bonus points. Also, it seems to me that round-trippability of quote()/unquote() is a logical and semantic requirement: in particular, if there is a default encoding, it should be the same for both. > For future apps, this should almost certainly be UTF-8 > (if it isn't, the website won't be able to accept form input across all > characters, so isn't Unicode compliant anyway). Actually, it will be able to accept such form input, as characters not supported by the charset should be entity-encoded by the browser (e.g. "{"). I have no strong opinion on the very remaining points you listed, except that IMHO encode_rfc2231 with charset=None should not try to use UTF8 by default. But someone with more mail protocol skills should comment :)
msg71064 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-12 17:38
Larry Masinter is off on vacation, but I did get a brief message saying that he will dig up similar discussions that he was involved in when he gets back. Out of curiosity, I sent a note off to the www-international mailing list, and received this: ``For the authority (server name) portion of a URI, RFC 3986 is pretty clear that UTF-8 must be used for non-ASCII values (assuming, for a moment, that IDNA addresses are not Punycode encoded already). For the path portion of URIs, a large-ish proportion of them are, indeed, UTF-8 encoded because that has been the de facto standard in Web browsers for a number of years now. For the query and fragment parts, however, the encoding is determined by context and often depends on the encoding of some page that contains the form from which the data is taken. Thus, a large number of URIs contain non-UTF-8 percent-encoded octets.'' http://lists.w3.org/Archives/Public/www-international/2008JulSep/0041.html
msg71065 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-12 17:47
For Antoine: I think the problem that Barry is facing with the email package is that Unicode strings are an ambiguous representation of a sequence of bytes; that is, there are a number of different byte sequences a Unicode string may have come from. His ingenious use of raw-unicode-escape is an attempt to conform to the requirement of having to produce a string, but without losing any data, so that an application program can, if it needs to, still reprocess that string and retrieve the original data. Naive application programs that sort of expected the result to be an ASCII string will be unaffected. Not sure it's the best idea; this is all about just where to force unexpected runtime failures.
msg71069 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-12 19:37
Here's another thought: Let's put string_to_bytes and string_from_bytes into the binascii module, as a2b_percent and b2a_percent, respectively. Then parse.py would import them as from binascii import a2b_percent as percent_decode_as_bytes from binascii import b2a_percent as percent_encode_from_bytes and add two more functions: def percent_encode(<string>, encoding="UTF-8", error="strict", plus=False) def percent_decode(<string>, encoding="UTF-8", error="strict", plus=False) and would add backwards-compatible but deprecated functions for quote and unquote: def quote(s): warnings.warn("urllib.parse.quote should be replaced by percent_encode or percent_encode_from_bytes", FutureDeprecationWarning) if isinstance(s, str): return percent_encode(s) else: return percent_encode_from_bytes(s) def unquote(s): warnings.warn("urllib.parse.unquote should be replaced by percent_decode or percent_decode_to_bytes", FutureDeprecationWarning) if isinstance(s, str): return percent_decode(s) else: return percent_decode(str(s, "ASCII", "strict"))
msg71072 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-12 22:12
Le mardi 12 août 2008 à 19:37 +0000, Bill Janssen a écrit : > Let's put string_to_bytes and string_from_bytes into the binascii > module, as a2b_percent and b2a_percent, respectively. Well, it's my personal opinion, but I think we should focus on a simple and straightforward solution for the present issue before beta3 is released (which is in 8 days now). It has already been difficult to find a (quasi-)consensus for a simple patch to adapt quote()/unquote() to the realities of bytes/unicode separation in py3k: witness the length of the present discussion. (perhaps a sophisticated solution could still be adopted for 3.1, especially if it has backwards compatibility in mind)
msg71073 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-08-12 23:43
> Matt Giuca <matt.giuca@gmail.com> added the comment: > By the way, what is the current status of this bug? Is anybody waiting > on me to do anything? (Re: Patch 9) I'll be reviewing it today or tomorrow. From looking at it briefly I worry that the implementation is pretty slow -- a method call for each character and a map() call sounds pretty bad. > To recap my previous list of outstanding issues raised by the review: > >> Should unquote accept a bytes/bytearray as well as a str? > Currently, does not. I think it's meaningless to do so (and how to > handle >127 bytes, if so?) The bytes > 127 would be translated as themselves; this follows logically from how stuff is parsed -- %% and %FF are translated, everything else is not. But I don't really care, I doubt there's a need. >> Lib/email/utils.py: >> Should encode_rfc2231 with charset=None accept strings with non-ASCII >> characters, and just encode them to UTF-8? > Currently does. Suggestion to restrict to ASCII on the review tracker; > simple fix. I think I agree with that comment; it seems wrong to return UTF8 without setting that in the header. The alternative would be to default charset to utf8 if there are any non-ASCII chars in the input. I'd be okay with that too. >> Should quote raise a TypeError if given a bytes with encoding/errors >> arguments? (Motivation: TypeError is what you usually raise if you >> supply too many args to a function). > Resolved. Raises TypeError. > >> Lib/urllib/parse.py: >> (As discussed above) Should quote accept safe characters outside the >> ASCII range (thereby potentially producing invalid URIs)? > Resolved? Implemented, but too messy and not worth it just to produce > invalid URIs, so NOT in patch. Agreed, safe should be ASCII chars only. > That's only two very minor yes/no issues remaining. Please comment. I believe patch 9 still has errors defaulting to strict for quote(). Weren't you going to change that? Regarding using UTF-8 as the default encoding, I still think this the right thing to do -- while the tables shown by Bill indicate that there's still a lot of Latin-1 out there, UTF-8 is definitely gaining on it, and I expect that Python apps, especially Py3k apps, are much more likely to follow (and hopefully reinforce! :-) this trend than to lag behind.
msg71082 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-13 14:25
> I have no strong opinion on the very remaining points you listed, > except that IMHO encode_rfc2231 with charset=None should not try to > use UTF8 by default. But someone with more mail protocol skills > should comment :) OK I've come to the realization that DEMANDING ascii (and erroring on non-ASCII chars) is better for the short term anyway, because we can always decide later to relax the restrictions, but it's a lot worse to add restrictions later. So I agree now, should be ASCII. And no, I don't have mail protocol skills. The same goes for unquote accepting bytes. We can decide to make it accept bytes later, but can't remove that feature later, so it's best (IMHO) to let it NOT accept bytes (which is the current behaviour). > The bytes > 127 would be translated as themselves; this follows > logically from how stuff is parsed -- %% and %FF are translated, > everything else is not. But I don't really care, I doubt there's a > need. Ah but what about unquote (to string)? If it accepted bytes then it would be a bytes->str operation, and then you need a policy on DEcoding those bytes. It makes things too complex I think. > I believe patch 9 still has errors defaulting to strict for quote(). > Weren't you going to change that? I raised it as a concern, but I thought you overruled on that, so I left it as errors='strict'. What do you want it to be? 'replace'? Now that this issue has been fully discussed, I'm happy with whatever you decide. > From looking at it briefly I > worry that the implementation is pretty slow -- a method call for each > character and a map() call sounds pretty bad. Yes, it does sound pretty bad. However, that's the current way of doing things in both 2.x and 3.x; I didn't change it (though it looks like I changed a LOT, I really did try to change as little as possible!) Assuming it wasn't made _slower_ than before, can we ignore existing performance issues and treat them as a separate matter (and can be dealt with after 3.0)? I'm not putting up a new patch now. The only fix I'd make is to add Antoine's "or 'ascii'" to email/utils.py, as suggested on the review tracker. I'll make this change along with any other recommendations after your review. (That is Lib/email/utils.py line 222 becomes: s = urllib.parse.quote(s, safe='', encoding=charset or 'ascii') ) btw this Rietveld is amazing. I'm assuming I don't have permission to upload patches there (can't find any button to do so) which is why I keep posting them here and letting you upload to Rietveld ...
msg71083 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-08-13 14:50
On Wed, Aug 13, 2008 at 7:25 AM, Matt Giuca <report@bugs.python.org> wrote: >> I have no strong opinion on the very remaining points you listed, >> except that IMHO encode_rfc2231 with charset=None should not try to >> use UTF8 by default. But someone with more mail protocol skills >> should comment :) > > OK I've come to the realization that DEMANDING ascii (and erroring on > non-ASCII chars) is better for the short term anyway, because we can > always decide later to relax the restrictions, but it's a lot worse to > add restrictions later. So I agree now, should be ASCII. And no, I don't > have mail protocol skills. OK. > The same goes for unquote accepting bytes. We can decide to make it > accept bytes later, but can't remove that feature later, so it's best > (IMHO) to let it NOT accept bytes (which is the current behaviour). OK. >> The bytes > 127 would be translated as themselves; this follows >> logically from how stuff is parsed -- %% and %FF are translated, >> everything else is not. But I don't really care, I doubt there's a >> need. > > Ah but what about unquote (to string)? If it accepted bytes then it > would be a bytes->str operation, and then you need a policy on DEcoding > those bytes. It makes things too complex I think. OK. >> I believe patch 9 still has errors defaulting to strict for quote(). >> Weren't you going to change that? > > I raised it as a concern, but I thought you overruled on that, so I left > it as errors='strict'. What do you want it to be? 'replace'? Now that > this issue has been fully discussed, I'm happy with whatever you decide. I'm OK with replace for unquote(), your point that bogus data is better than an exception is well taken, especially since there are calls that the app can't control (like in cgi.py). For quote() I think strict is better -- it can't fail anyway with UTF8, and if an app passes an explicit conversion it'd be pretty stupid to pass a string that can't be converted with that encoding (since it's presumably the app that generates both the string and the encoding) so it's better to complain there, just like if they made the encode() call themselves with only an encoding specified. This means we have a useful analogy: quote(s, e) == quote(s.encode(e)). >> From looking at it briefly I >> worry that the implementation is pretty slow -- a method call for each >> character and a map() call sounds pretty bad. > > Yes, it does sound pretty bad. However, that's the current way of doing > things in both 2.x and 3.x; I didn't change it (though it looks like I > changed a LOT, I really did try to change as little as possible!) Actually, while the Quoter class (the immediat subject of my scorn) was there before your patch in 3.0, it isn't there in 2.x; somebody must have added it in 3.0 as part of the conversion to Unicode or perhaps as part of the restructuring of urllib.py. The 2.x code maps the __getitem__ of a dict over the string, which is much faster. I think we can do much better than mapping a method call. > Assuming it wasn't made _slower_ than before, can we ignore existing > performance issues and treat them as a separate matter (and can be dealt > with after 3.0)? Now that you've spent so much time with this patch, can't you think of a faster way of doing this? I wonder if mapping a defaultdict wouldn't work. > I'm not putting up a new patch now. The only fix I'd make is to add > Antoine's "or 'ascii'" to email/utils.py, as suggested on the review > tracker. I'll make this change along with any other recommendations > after your review. > > (That is Lib/email/utils.py line 222 becomes: > s = urllib.parse.quote(s, safe='', encoding=charset or 'ascii') > ) > > btw this Rietveld is amazing. I'm assuming I don't have permission to > upload patches there (can't find any button to do so) which is why I > keep posting them here and letting you upload to Rietveld ... Thanks! You can't upload patches to the issue that I created, but a better way would be to create a new issue and assign it to me for review. That will work as long as you have a gmail account or a Google Account. I highly recommend using the upload.py script, which you can download from codereview.appspot.com/static/upload.py. (There's also a link to it on the Create Issue page, at the bottom.) I am hoping that in general we will be able to use Rietveld to review patches instead of the bug tracker.
msg71084 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-13 15:08
> I'm OK with replace for unquote() ... > For quote() I think strict is better There's just an odd inconsistency there, but it's only a tiny "gotcha"; and I agree with all your other arguments. I'll change unquote back to errors='replace'. > This means we have a useful analogy: > quote(s, e) == quote(s.encode(e)). That's exactly true, yes. > Now that you've spent so much time with this patch, can't you think > of a faster way of doing this? Well firstly, you could replace Quoter (the class) with a "quoter" function, which is nested inside quote. Would calling a nested function be faster than a method call? > I wonder if mapping a defaultdict wouldn't work. That is a good idea. Then, the "function" (as I describe above) would be just the inside of what currently is the except block, and that would be the default_factory of the defaultdict. I think that should speed things up. I'm very hazy about what is faster in the bytecode world of Python, and wary of making a change and proclaiming "this is faster!" without doing proper speed tests (which is why I think this optimisation could be delayed until at least after the core interface changes are made). But I'll have a go at that change tomorrow. (I won't be able to work on this for up to 24 hours).
msg71085 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-13 16:02
Selon Matt Giuca <report@bugs.python.org>: > > > Now that you've spent so much time with this patch, can't you think > > of a faster way of doing this? > > Well firstly, you could replace Quoter (the class) with a "quoter" > function, which is nested inside quote. Would calling a nested function > be faster than a method call? The obvious speedup is to remove the map() call and do the loop inside Quoter.__call__ instead. That way you don't have any function or method call in the critical path. (also, defining a class with a single __call__ method is not a common Python idiom; usually you'd just have a function returning another (nested) function) As for the defaultdict, here is how it can look like (this is on 2.5): ... def __missing__(self, key): ... print "__missing__", key ... value = "%%%02X" % key ... self[key] = value ... return value ... >>> d = D() >>> d[66] = 'B' >>> d[66] 'B' >>> d[67] __missing__ 67 '%43' >>> d[67] '%43'
msg71086 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-13 16:03
Selon Antoine Pitrou <report@bugs.python.org>: > As for the defaultdict, here is how it can look like (this is on 2.5): > (there should be a line here saying "class D(defaultdict)" :-)) > ... def __missing__(self, key): > ... print "__missing__", key > ... value = "%%%02X" % key > ... self[key] = value > ... return value cheers Antoine.
msg71088 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-08-13 16:41
>> Now that you've spent so much time with this patch, can't you think >> of a faster way of doing this? > > Well firstly, you could replace Quoter (the class) with a "quoter" > function, which is nested inside quote. Would calling a nested function > be faster than a method call? Yes, but barely. >> I wonder if mapping a defaultdict wouldn't work. > > That is a good idea. Then, the "function" (as I describe above) would be > just the inside of what currently is the except block, and that would be > the default_factory of the defaultdict. I think that should speed things up. Yes, it would be tremendously faster, since the method would be called only once per byte value (for each value of 'safe'), and if that byte is repeated in the input, further occurrences will use the __getitem__ function of the defaultdict, which is implemented in C. > I'm very hazy about what is faster in the bytecode world of Python, and > wary of making a change and proclaiming "this is faster!" without doing > proper speed tests (which is why I think this optimisation could be > delayed until at least after the core interface changes are made). That's very wise. But a first-order approximation of the speed of something is often "how many functions/methods implemented in Python (i.e. with def or lambda) does it call?" > But I'll have a go at that change tomorrow. > > (I won't be able to work on this for up to 24 hours). That's fine, as long as we have closure before beta3, which is next Wednesday.
msg71089 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-13 16:56
Feel free to take the function implementation from my patch, if it speeds things up (and it should). Bill On Wed, Aug 13, 2008 at 9:41 AM, Guido van Rossum <report@bugs.python.org>wrote: > > Guido van Rossum <guido@python.org> added the comment: > > >> Now that you've spent so much time with this patch, can't you think > >> of a faster way of doing this? > > > > Well firstly, you could replace Quoter (the class) with a "quoter" > > function, which is nested inside quote. Would calling a nested function > > be faster than a method call? > > Yes, but barely. > > >> I wonder if mapping a defaultdict wouldn't work. > > > > That is a good idea. Then, the "function" (as I describe above) would be > > just the inside of what currently is the except block, and that would be > > the default_factory of the defaultdict. I think that should speed things > up. > > Yes, it would be tremendously faster, since the method would be called > only once per byte value (for each value of 'safe'), and if that byte > is repeated in the input, further occurrences will use the __getitem__ > function of the defaultdict, which is implemented in C. > > > I'm very hazy about what is faster in the bytecode world of Python, and > > wary of making a change and proclaiming "this is faster!" without doing > > proper speed tests (which is why I think this optimisation could be > > delayed until at least after the core interface changes are made). > > That's very wise. But a first-order approximation of the speed of > something is often "how many functions/methods implemented in Python > (i.e. with def or lambda) does it call?" > > > But I'll have a go at that change tomorrow. > > > > (I won't be able to work on this for up to 24 hours). > > That's fine, as long as we have closure before beta3, which is next > Wednesday. > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue3300> > _______________________________________ >
msg71090 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-13 17:05
Erik van der Poel at Google has now chimed in with stats on current URL usage: ``...the bottom line is that escaped non-utf-8 is still quite prevalent, enough (in my opinion) to require an implementation in Python, possibly even allowing for different encodings in the path and query parts (e.g. utf-8 path and gb2312 query).'' http://lists.w3.org/Archives/Public/www-international/2008JulSep/0042.html I think it's worth remembering that a very large proportion of the use of Python's urllib.unquote() is in implementations of Web server frameworks of one sort or another. We can't control what the browsers that talk to such frameworks produce; the IETF doesn't control that, either. In this case, "practicality beats purity" is the clarion call of the browser designers, and we'd better be able to support them.
msg71091 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-08-13 17:17
> Bill Janssen <bill.janssen@gmail.com> added the comment: > > Erik van der Poel at Google has now chimed in with stats on current URL > usage: > > ``...the bottom line is that escaped non-utf-8 is still quite prevalent, > enough (in my opinion) to require an implementation in Python, possibly > even allowing for different encodings in the path and query parts (e.g. > utf-8 path and gb2312 query).'' > > http://lists.w3.org/Archives/Public/www-international/2008JulSep/0042.html > > I think it's worth remembering that a very large proportion of the use > of Python's urllib.unquote() is in implementations of Web server > frameworks of one sort or another. We can't control what the browsers > that talk to such frameworks produce; the IETF doesn't control that, > either. In this case, "practicality beats purity" is the clarion call > of the browser designers, and we'd better be able to support them. I think we're supporting these sufficiently by allowing developers to override the encoding and errors value. I see no argument here against having a default encoding of UTF-8.
msg71092 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-13 17:51
Le mercredi 13 août 2008 à 17:05 +0000, Bill Janssen a écrit : > I think it's worth remembering that a very large proportion of the use > of Python's urllib.unquote() is in implementations of Web server > frameworks of one sort or another. We can't control what the browsers > that talk to such frameworks produce; Yes, we do. Browsers will use whatever charset is specified in the HTML for the query part; and, as for the path part, they should't produce it themselves, they just follow a link which should already be percent-quoted in the HTML. (URL rewriting at the HTTP server level can make this more complicated, since it can turn a query fragment into a path fragment or vice-versa; however, most modern frameworks alleviate the need for such rewriting, since they allow to specify flexible mapping rules at the framework level) The situation in which we can't control the encoding is when getting the URLs from third-part content (e.g. some Web page which we didn't produce ourselves, or some link in an e-email). But in those cases there's less use cases for unquoting the URL rather than use it as-is. The only time I've wanted to unquote such an URL was to do some processing of HTTP referrers in order to extract which search queries had led people to visit a Web site.
msg71096 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-08-13 19:49
On Wed, Aug 13, 2008 at 10:51 AM, Antoine Pitrou <report@bugs.python.org>wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > > Le mercredi 13 août 2008 à 17:05 +0000, Bill Janssen a écrit : > > I think it's worth remembering that a very large proportion of the use > > of Python's urllib.unquote() is in implementations of Web server > > frameworks of one sort or another. We can't control what the browsers > > that talk to such frameworks produce; > > Yes, we do. Browsers will use whatever charset is specified in the HTML > for the query part; and, as for the path part, they should't produce it > themselves, they just follow a link which should already be > percent-quoted in the HTML. Sure. What I meant was that we don't control what the browsers do, we just go along with what they do, that is, we try to play with the default understanding that's developed between the "consenting pairs" of Apache/Firefox or ASP/IE.
msg71121 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-14 11:35
Ah cheers Antoine, for the tip on using defaultdict (I was confused as to how I could access the key just by passing defaultfactory, as the manual suggests).
msg71124 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-14 12:18
OK I implemented the defaultdict solution. I got curious so ran some rough speed tests, using the following code. import random, urllib.parse for i in range(0, 100000): str = ''.join(chr(random.randint(0, 0x10ffff)) for _ in range(50)) quoted = urllib.parse.quote(str) Time to quote 100,000 random strings of 50 characters. (Ran each test twice, worst case printed) HEAD, chars in range(0,0x110000): 1m44.80 HEAD, chars in range(0,256): 25.0s patch9, chars in range(0,0x110000): 35.3s patch9, chars in range(0,256): 27.4s New, chars in range(0,0x110000): 31.4s New, chars in range(0,256): 25.3s Head is the current Py3k head. Patch 9 is my previous patch (before implementing defaultdict), and New is after implementing defaultdict. Interesting. Defaultdict didn't really make much of an improvement. You can see the big help the cache itself makes, though (my code caches all chars, whereas the HEAD just caches ASCII chars, which is why HEAD is so slow on the full repertoire test). Other than that, differences are fairly negligible. However, I'll keep the defaultdict code, I quite like it, speedy or not (it is slightly faster).
msg71126 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-14 13:07
Hello Matt, > OK I implemented the defaultdict solution. I got curious so ran some > rough speed tests, using the following code. > > import random, urllib.parse > for i in range(0, 100000): > str = ''.join(chr(random.randint(0, 0x10ffff)) for _ in range(50)) > quoted = urllib.parse.quote(str) I think if you move the line defining "str" out of the loop, relative timings should change quite a bit. Chances are that the random functions are not very fast, since they are written in pure Python. Or you can create an inner loop around the call to quote(), for example to repeat it 100 times. cheers Antoine.
msg71130 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-14 15:27
New patch (patch10). Details on Rietveld review tracker (http://codereview.appspot.com/2827). Another update on the remaining "outstanding issues": Resolved issues since last time: > Should unquote accept a bytes/bytearray as well as a str? No. But see below. > Lib/email/utils.py: > Should encode_rfc2231 with charset=None accept strings with non-ASCII > characters, and just encode them to UTF-8? Implemented Antoine's fix ("or 'ascii'"). > Should quote accept safe characters outside the > ASCII range (thereby potentially producing invalid URIs)? No. New issues: unquote_to_bytes doesn't cope well with non-ASCII characters (currently encodes as UTF-8 - not a lot we can do since this is a str->bytes operation). However, we can allow it to accept a bytes as input (while unquote does not), and it preserves the bytes precisely. Discussion at http://codereview.appspot.com/2827/diff/82/84, line 265. I have implemented that suggestion - so unquote_to_bytes now accepts either a bytes or str, while unquote accepts only a str. No changes need to be made unless there is disagreement on that decision. I also emailed Barry Warsaw about the email/utils.py patch (because we weren't sure exactly what that code was doing). However, I'm sure that this patch isn't breaking anything there, because I call unquote with encoding="latin-1", which is the same behaviour as the current head. That's all the issues I have left over in this patch. Attaching patch 10 (for revision 65675). Commit log for patch 10: Fix for issue 3300. urllib.parse.unquote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the decoding of percent-encoded octets. As per RFC 3986, default is "utf-8" (previously implicitly decoded as ISO-8859-1). Fixed a bug in which mixed-case hex digits (such as "%aF") weren't being decoded at all. urllib.parse.quote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the encoding of non-ASCII characters before being percent-encoded. Default is "utf-8" (previously characters in range(128, 256) were encoded as ISO-8859-1, and characters above that as UTF-8). Characters/bytes above 128 are no longer allowed to be "safe". Now allows either bytes or strings. Optimised "Quoter"; now inherits defaultdict. Added functions urllib.parse.quote_from_bytes, urllib.parse.unquote_to_bytes. All quote/unquote functions now exported from the module. Doc/library/urllib.parse.rst: Updated docs on quote and unquote to reflect new interface, added quote_from_bytes and unquote_to_bytes. Lib/test/test_urllib.py: Added many new test cases testing encoding and decoding Unicode strings with various encodings, as well as testing the new functions. Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py, Lib/test/test_wsgiref.py: Updated and added test cases to deal with UTF-8-encoded URIs. Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote with encoding="latin-1", to preserve existing behaviour (which the email module is dependent upon).
msg71131 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-14 15:31
Antoine: > I think if you move the line defining "str" out of the loop, relative > timings should change quite a bit. Chances are that the random > functions are not very fast, since they are written in pure Python. Well I wanted to test throwing lots of different URIs to test the caching behaviour. You're right though, probably a small % of the time is spent on calling quote. Oh well, the defaultdict implementation is in patch10 anyway :) It cleans Quoter up somewhat, so it's a good thing anyway. Thanks for your help.
msg71332 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-18 14:27
Hi, Sorry to bump this, but you (Guido) said you wanted this closed by Wednesday. Is this patch committable yet? (There are no more unresolved issues that I am aware of).
msg71356 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-08-18 18:32
Looking into this now. Will make sure it's included in beta3.
msg71387 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-08-18 21:45
Checked in patch 10 with minor style changes as r65838. Thanks Matt for persevering! Thanks everyone else for contributing; this has been quite educational.
msg71530 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-20 09:56
There's an unquote()-related failure in #3613.
msg71533 - (view)	Author: Matt Giuca (mgiuca)	Date: 2008-08-20 10:03
Thanks for pointing that out, Antoine. I just commented on that bug.

History
Date	User	Action	Args
2022-04-11 14:56:36	admin	set	github: 47550
2008-08-20 10:03:34	mgiuca	set	messages: + msg71533
2008-08-20 09:56:19	pitrou	set	messages: + msg71530
2008-08-18 21:45:16	gvanrossum	set	status: open -> closed resolution: accepted messages: + msg71387
2008-08-18 18:32:38	gvanrossum	set	priority: release blocker assignee: gvanrossum messages: + msg71356
2008-08-18 14:27:50	mgiuca	set	messages: + msg71332
2008-08-14 15:31:39	mgiuca	set	messages: + msg71131
2008-08-14 15:27:27	mgiuca	set	files: + parse.py.patch10 messages: + msg71130
2008-08-14 13:07:23	pitrou	set	messages: + msg71126
2008-08-14 12:18:49	mgiuca	set	messages: + msg71124
2008-08-14 11:35:12	mgiuca	set	messages: + msg71121
2008-08-13 20:12:00	gvanrossum	set	files: - unnamed
2008-08-13 20:11:53	gvanrossum	set	files: - unnamed
2008-08-13 19:49:51	janssen	set	files: + unnamed messages: + msg71096
2008-08-13 17:51:36	pitrou	set	messages: + msg71092
2008-08-13 17:17:20	gvanrossum	set	messages: + msg71091
2008-08-13 17:05:20	janssen	set	messages: + msg71090
2008-08-13 16:56:12	janssen	set	files: + unnamed messages: + msg71089
2008-08-13 16:41:43	gvanrossum	set	messages: + msg71088
2008-08-13 16:03:26	pitrou	set	messages: + msg71086
2008-08-13 16:02:06	pitrou	set	messages: + msg71085
2008-08-13 15:09:00	mgiuca	set	messages: + msg71084
2008-08-13 14:50:30	gvanrossum	set	messages: + msg71083
2008-08-13 14:28:56	lemburg	set	nosy: - lemburg
2008-08-13 14:25:23	mgiuca	set	messages: + msg71082
2008-08-12 23:43:48	gvanrossum	set	messages: + msg71073
2008-08-12 22:12:51	pitrou	set	messages: + msg71072
2008-08-12 19:37:40	janssen	set	messages: + msg71069
2008-08-12 17:47:34	janssen	set	messages: + msg71065
2008-08-12 17:38:39	janssen	set	messages: + msg71064
2008-08-12 16:04:38	pitrou	set	messages: + msg71057
2008-08-12 15:26:41	mgiuca	set	messages: + msg71055
2008-08-12 15:20:13	mgiuca	set	messages: + msg71054
2008-08-12 03:43:37	janssen	set	messages: + msg71043
2008-08-12 03:36:29	janssen	set	files: - unnamed
2008-08-12 02:30:54	janssen	set	files: + unnamed messages: + msg71042
2008-08-10 11:25:34	mgiuca	set	messages: + msg70970
2008-08-10 10:49:52	pitrou	set	messages: + msg70969
2008-08-10 07:45:17	mgiuca	set	files: + parse.py.patch9 messages: + msg70965
2008-08-10 07:05:47	mgiuca	set	files: + parse.py.patch8+allsafe messages: + msg70962
2008-08-10 05:05:08	mgiuca	set	messages: + msg70959
2008-08-10 03:09:53	mgiuca	set	messages: + msg70958
2008-08-10 00:35:10	jimjjewett	set	messages: + msg70955
2008-08-09 18:34:23	mgiuca	set	messages: + msg70949
2008-08-08 21:28:26	janssen	set	files: - patch
2008-08-08 21:28:11	janssen	set	files: + patch messages: + msg70913
2008-08-08 02:07:02	janssen	set	messages: + msg70880
2008-08-08 02:04:32	janssen	set	messages: + msg70879
2008-08-08 01:34:39	janssen	set	files: - unnamed
2008-08-08 01:34:31	janssen	set	files: - unnamed
2008-08-08 01:34:04	janssen	set	messages: + msg70878
2008-08-08 00:18:54	janssen	set	files: + unnamed messages: + msg70872
2008-08-07 23:23:48	gvanrossum	set	messages: + msg70869
2008-08-07 22:58:21	janssen	set	files: + unnamed messages: + msg70868
2008-08-07 21:46:23	gvanrossum	set	messages: + msg70862
2008-08-07 21:43:25	lemburg	set	nosy: + lemburg messages: + msg70861
2008-08-07 21:17:08	janssen	set	messages: + msg70858
2008-08-07 20:49:25	janssen	set	messages: + msg70855
2008-08-07 16:59:53	gvanrossum	set	messages: + msg70840
2008-08-07 15:03:00	mgiuca	set	files: + parse.py.metapatch8 messages: + msg70834
2008-08-07 14:59:58	mgiuca	set	files: + parse.py.patch8 messages: + msg70833
2008-08-07 14:35:18	mgiuca	set	messages: + msg70830
2008-08-07 14:20:05	pitrou	set	messages: + msg70828
2008-08-07 13:42:47	mgiuca	set	messages: + msg70824
2008-08-07 12:11:16	mgiuca	set	messages: + msg70818
2008-08-06 22:25:37	jimjjewett	set	messages: + msg70807
2008-08-06 21:39:38	gvanrossum	set	messages: + msg70806
2008-08-06 21:03:11	jimjjewett	set	messages: + msg70804
2008-08-06 19:51:24	gvanrossum	set	nosy: + gvanrossum messages: + msg70800
2008-08-06 18:39:42	pitrou	set	nosy: + pitrou messages: + msg70793
2008-08-06 17:29:30	jimjjewett	set	nosy: + jimjjewett messages: + msg70791
2008-08-06 16:51:26	janssen	set	files: + patch messages: + msg70788
2008-08-06 16:49:39	janssen	set	files: - myunquote.py
2008-08-06 05:59:43	janssen	set	files: + myunquote.py nosy: + janssen messages: + msg70771
2008-07-31 14:49:51	mgiuca	set	files: + parse.py.patch7 messages: + msg70512
2008-07-31 11:27:36	mgiuca	set	files: + parse.py.patch6 messages: + msg70497
2008-07-12 17:05:56	mgiuca	set	files: + parse.py.patch5 messages: + msg69591
2008-07-12 11:32:53	mgiuca	set	files: + parse.py.patch4 messages: + msg69583
2008-07-11 07:18:47	mgiuca	set	messages: + msg69537
2008-07-11 07:03:25	mgiuca	set	messages: + msg69535
2008-07-10 20:55:42	loewis	set	messages: + msg69519
2008-07-10 16:13:03	mgiuca	set	messages: + msg69508 versions: + Python 3.0, - Python 3.1
2008-07-10 01:55:00	mgiuca	set	files: + parse.py.patch3 messages: + msg69493
2008-07-09 20:28:32	loewis	set	messages: + msg69485
2008-07-09 20:26:41	loewis	set	versions: + Python 3.1, - Python 3.0
2008-07-09 15:51:51	mgiuca	set	files: + parse.py.patch2 messages: + msg69473
2008-07-09 15:20:49	thomaspinckney3	set	nosy: + thomaspinckney3 messages: + msg69472
2008-07-09 03:03:37	orsenthil	set	nosy: + orsenthil
2008-07-07 01:45:25	mgiuca	set	messages: + msg69366
2008-07-06 16:54:35	loewis	set	nosy: + loewis messages: + msg69339
2008-07-06 14:52:09	mgiuca	create