New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
urllib.quote and unquote - Unicode issues #47550
Comments
Three Unicode-related problems with urllib.parse.quote and Firstly, unquote appears not to have been modified from Python 2, where http://tools.ietf.org/html/rfc3986 section 2.5. Current behaviour:
>>> urllib.parse.unquote("%CE%A3")
'Σ'
(or '\u00ce\u00a3')
Desired behaviour:
>>> urllib.parse.unquote("%CE%A3")
'Σ'
(or '\u03a3') Secondly, while quote *has* been modified to encode to UTF-8 before Current behaviour:
>>> urllib.parse.quote('\u00e9')
'%E9'
Desired behaviour:
>>> urllib.parse.quote('\u00e9')
'%C3%A9' Note that currently, quoting characters less than 256 will use Thirdly, the "safe" argument to quote does not work for characters above Current behaviour:
>>> urllib.parse.quote('Σϰ', safe='Σ')
'%CE%A3%CF%B0'
Desired behaviour:
>>> urllib.parse.quote('Σϰ', safe='Σ')
'Σ%CF%B0' A patch which fixes all three issues is attached. Note that unquote now Note I also changed one of the test cases, which had the wrong expected All urllib test cases pass. Patch is for branch /branches/py3k, revision 64752. Note: The above unquote issue also manifests itself in Python 2 for Commit log: urllib.parse.unquote: Fixed percent-encoded octets being implicitly urllib.parse.quote: Fixed characters in range(128, 256) being implicitly Lib/test/test_urllib.py: Updated one test case for unquote which |
Where precisely do you read such a SHOULD requirement? The last paragraph in section 2.5 is the only place that |
Point taken. But the RFC certainly doesn't say that ISO-8859-1 should be Having said that, it's possible that you may wish to use another Note that in the current implementation, unquote is not an inverse of |
I mentioned this is in a brief python-dev discussion earlier this I know there's no RFC that says this is what should be done, but in Possibly allow an option charset argument to be passed into quote and |
OK I've gone back over the patch and decided to add the "encoding" and (Tom Pinckney just made the same suggestion right as I'm typing this up!) So my new patch is a bit more extensive, and changes the interface (in a Implementation detail: This changes the Quoter class a lot; it now Also fixed an issue with the previous patch where non-ASCII-compatible I then ran the full test suite and discovered two other modules test
Some potential issues:
I would also like to have a separate pair of functions, unquote_raw and Patch (parse.py.patch2) is for branch /branches/py3k, revision 64820. Commit log: urllib.parse.unquote: Added "encoding" and "errors" optional arguments, urllib.parse.quote: Added "encoding" and "errors" optional arguments, Lib/test/test_urllib.py, Lib/test/test_http_cookiejar.py: Updated test Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote |
Assuming the patch is acceptable in the first place (which I personally |
OK well here are the necessary changes to the documentation (RST docs As I said above, I plan to to extensive testing and add new cases, and I Patch (parse.py.patch3) is for branch /branches/py3k, revision 64834. Commit log: urllib.parse.unquote: Added "encoding" and "errors" optional arguments, urllib.parse.quote: Added "encoding" and "errors" optional arguments, Doc/library/urllib.parse.rst: Updated docs on quote and unquote to Lib/test/test_urllib.py, Lib/test/test_http_cookiejar.py: Updated test Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote |
Setting Version back to Python 3.0. Is there a reason it was set to |
3.0b1 has been released, so no new features can be added to 3.0. |
While my proposal is no doubt going to cause a lot of code breakage, I Perhaps I should explain further why this is a regression from Python I give two examples. Firstly, I believe that unquote(quote(x)) should always be true for all >>> urllib.parse.unquote(urllib.parse.quote('ÿ')) # '\u00ff'
'ÿ'
# Works, because both functions work with ISO-8859-1 in this range.
>>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100'
'Ä\x80'
# Fails, because quote uses UTF-8 and unquote uses ISO-8859-1.
My patch succeeds for all characters.
>>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100'
'Ā' Secondly, a bigger example, but I want to demonstrate how this bug Consider this simple (beginnings of a) wiki system in Python 2.5, as a #--- import cgi
fields = cgi.FieldStorage()
title = fields.getfirst('title')
print("Content-Type: text/html; charset=utf-8")
print("")
print('<p>Debug: %s</p>' % repr(title))
if title is None:
print("No article selected")
else:
print('<p>Information about %s.</p>' % cgi.escape(title))
# (Place this in cgi-bin, navigate to it, and add the query string If you navigate to "?title=Mátt", it displays the text "Debug: Now consider that you want to manipulate it as a Unicode string, still #--- import sys
import cgi
def printu8(*args):
"""Prints to stdout encoding as utf-8, rather than the current terminal
encoding. (Not a fully-featured print function)."""
sys.stdout.write(' '.join([x.encode('utf-8') for x in args]))
sys.stdout.write('\n')
fields = cgi.FieldStorage()
title = fields.getfirst('title')
if title is not None:
title = str(title).decode("utf-8", "replace")
print("Content-Type: text/html; charset=utf-8")
print("")
print('<p>Debug: %s.</p>' % repr(title))
if title is None:
print("No article selected.")
else:
printu8('<p>Information about %s.</p>' % cgi.escape(title))
# Now given the same input ("?title=Mátt"), it displays "Debug: Now let us upgrade this program to Python 3.0. (Note that I still can't #--- import sys
import cgi
def printu8(*args):
"""Prints to stdout encoding as utf-8, rather than the current terminal
encoding. (Not a fully-featured print function)."""
sys.stdout.buffer.write(b' '.join([x.encode('utf-8') for x in args]))
sys.stdout.buffer.write(b'\n')
fields = cgi.FieldStorage()
title = fields.getfirst('title')
# Note: No call to decode. I have no opportunity to specify the encoding
since
# it comes straight out of FieldStorage as a Unicode string.
print("Content-Type: text/html; charset=utf-8")
print("")
print('<p>Debug: %s.</p>' % ascii(title))
if title is None:
print("No article selected.")
else:
printu8('<p>Information about %s.</p>' % cgi.escape(title))
# Now given the same input ("?title=Mátt"), it displays "Debug: With my patch applied, the input ("?title=Mátt") produces the output Basically, this bug is going to affect all web developers as soon as My patch may not be the best, or most conservative, solution to this [1] http://tools.ietf.org/html/rfc3986#section-2.5 |
Since I got a complaint that my last reply was too long, I'll summarize it. It's a bug report, not a feature request. I can't get a simple web app to be properly Unicode-aware in Python 3, |
OK I spent awhile writing test cases for quote and unquote, encoding and I'd be interested in hearing if anyone disagrees with my expected output I'm now confident I have good test coverage directly on the quote and I still haven't figured out what the behaviour of "safe" should be in Patch (parse.py.patch4) is for branch /branches/py3k, revision 64891. Commit log: urllib.parse.unquote: Added "encoding" and "errors" optional arguments, urllib.parse.quote: Added "encoding" and "errors" optional arguments, Doc/library/urllib.parse.rst: Updated docs on quote and unquote to Lib/test/test_urllib.py: Added several new test cases testing encoding Lib/test/test_http_cookiejar.py: Updated test case which expected output Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote |
So today I grepped for "urllib" in the entire library in an effort to So far I have found no code "breakage" except for the original I also point out that the http.server module (unpatched) is internally I'm attaching patch 5, which adds a bunch of new test cases to various Note that I haven't yet fully investigated urllib.request. Aside from that, the only remaining matter is whether or not it's better So basically I think if there's support for it, this patch is just about I'd be glad to hear any feedback about this proposal. Not Yet Investigated ./urllib/request.py Looks fine, not tested ./xmlrpc/client.py Tested manually, fine ./wsgiref/simple_server.py
import http.server
s = http.server.HTTPServer(('',8000),
http.server.SimpleHTTPRequestHandler)
s.serve_forever()
./urllib/robotparser.py Test cases either exist or added, fine ./test/test_urllib.py Commit log: urllib.parse.unquote: Added "encoding" and "errors" optional arguments, urllib.parse.quote: Added "encoding" and "errors" optional arguments, Doc/library/urllib.parse.rst: Updated docs on quote and unquote to Lib/test/test_urllib.py: Added several new test cases testing encoding Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py, Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote |
OK after a long discussion on the mailing list, Guido gave this the OK, http://mail.python.org/pipermail/python-dev/2008-July/081601.html quote itself now accepts either a str or a bytes. quote_from_bytes is a unquote is still str->str. I've added a totally separate function Note there is a slight issue here: I didn't quite know what to do with Note that my new unquote doesn't have this problem; it's carefully This makes unquote(s, encoding=e) necessarily more robust than I've also added new test cases and documentation for these two new On an entirely personal note, can whoever checks this in please mention Commit log for patch6: Fix for bpo-3300. urllib.parse.unquote: Added "encoding" and "errors" optional arguments, urllib.parse.quote: Added "encoding" and "errors" optional arguments, Added functions urllib.parse.quote_from_bytes, Doc/library/urllib.parse.rst: Updated docs on quote and unquote to Lib/test/test_urllib.py: Added several new test cases testing encoding Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py, Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote |
Hmm ... seems patch 6 I just checked in fails a test case! Sorry! (It's I've slightly changed the code in quote so it doesn't do that any more Please review patch 7, not 6. Same commit log as above. (Also .. someone let me know if I'm not submitting patches properly, |
Here's my version of how quote and unquote should be implemented in Basically, percent-quoting is about creating an ASCII string that can be For unquote, there's no way to tell what the octets of the quoted |
Here's a patch to parse.py (and test/test_urllib.py) that makes the |
Is there still disagreement over anything except: (1) The type signature of quote and unquote (as opposed to the (2) The default encoding (latin-1 vs UTF8), and (if UTF-8) what to do (3) Would waiting for 3.1 cause too many compatibility problems? |
Bill, I haven't studied your patch in detail but a few comments:
|
Bill Janssen's "patch" breaks two unittests: test_email and ====================================================================== ====================================================================== Traceback (most recent call last):
File
"/usr/local/google/home/guido/python/py3k/Lib/email/test/test_email.py",
line 3279, in test_rfc2231_bad_character_in_charset
self.assertEqual(msg.get_content_charset(), None)
AssertionError: 'utf-8\\u201d' != None Details for test_http_cookiejar: ====================================================================== Traceback (most recent call last):
File "Lib/test/test_http_cookiejar.py", line 1454, in test_url_encoding
self.assert_("foo=bar" in cookie and version_re.search(cookie))
AssertionError: None |
Matt pointed out that the email package assumes Latin-1 rather than UTF-8; I The cookiejar failure probably has the same root cause; that test is So I see some evidence (probably not enough) for sticking with Latin-1 On the other hand, Matt shows that some of those extra str->byte code |
Dear GvR, New code review comments by GvR have been published. Message: Here's a code review of your patch. I'm leaning more and more towards wanting this for 3.0, but I have some API I'm cross-linking this with the Python tracker issue, through the subject. Details: http://codereview.appspot.com/2827/diff/1/2 http://codereview.appspot.com/2827/diff/1/2#newcode198 http://codereview.appspot.com/2827/diff/1/2#newcode215 http://codereview.appspot.com/2827/diff/1/2#newcode223 http://codereview.appspot.com/2827/diff/1/2#newcode242 http://codereview.appspot.com/2827/diff/1/4 http://codereview.appspot.com/2827/diff/1/4#newcode224 http://codereview.appspot.com/2827/diff/1/5 http://codereview.appspot.com/2827/diff/1/5#newcode1450 http://codereview.appspot.com/2827/diff/1/3 http://codereview.appspot.com/2827/diff/1/3#newcode10 http://codereview.appspot.com/2827/diff/1/3#newcode265 http://codereview.appspot.com/2827/diff/1/3#newcode283 Also see my comment about defaulting to 'replace' in the doc file. Finally -- let's be consistent about quotes. It seems most of this file And more: what should a None value for encoding or errors mean? IMO it http://codereview.appspot.com/2827/diff/1/3#newcode382 I also wonder why safe should be limited to ASCII though. http://codereview.appspot.com/2827/diff/1/3#newcode399 urllib.parse.quote_plus(b"abc def") raises a TypeError. Sincerely, Your friendly code review daemon (http://codereview.appspot.com/). |
Yes; that is one of the breakages you found in Bill's patch. (He didn't
Probably. Looking at http://www.faqs.org/rfcs/rfc2965.html (1) That is not among the exact tests in the RFC. Whether we have to use Latin-1 (or document charset) in practice for |
Dear GvR, New code review comments by mgiuca have been published. Message: Thanks very much for this very detailed review. I've replied to the |
A reply to a point on GvR's review, I'd like to open for discussion.
The reasoning is this: if we allow non-ASCII characters to be escaped, |
Le jeudi 07 août 2008 à 13:42 +0000, Matt Giuca a écrit :
The important is that the defaults are safe. If users want to override |
OK I think that's a fairly valid argument. I'm about to head off so I'll |
Following Guido and Antoine's reviews, I've written a new patch which Outstanding issues raised by the reviews: Doc/library/urllib.parse.rst: Lib/email/utils.py: Lib/test/test_http_cookiejar.py: Lib/test/test_urllib.py: Lib/urllib/parse.py: ------ Commit log for patch8: Fix for bpo-3300. urllib.parse.unquote: Added "encoding" and "errors" optional arguments, urllib.parse.quote: Added "encoding" and "errors" optional arguments, Added functions urllib.parse.quote_from_bytes, Doc/library/urllib.parse.rst: Updated docs on quote and unquote to Lib/test/test_urllib.py: Added many new test cases testing encoding Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py, Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote |
Larry Masinter is off on vacation, but I did get a brief message saying Out of curiosity, I sent a note off to the www-international mailing ``For the authority (server name) portion of a URI, RFC 3986 is pretty http://lists.w3.org/Archives/Public/www-international/2008JulSep/0041.html |
For Antoine: I think the problem that Barry is facing with the email package is that |
Here's another thought: Let's put string_to_bytes and string_from_bytes into the binascii Then parse.py would import them as from binascii import a2b_percent as percent_decode_as_bytes
from binascii import b2a_percent as percent_encode_from_bytes and add two more functions: def percent_encode(<string>, encoding="UTF-8", error="strict", plus=False)
def percent_decode(<string>, encoding="UTF-8", error="strict", plus=False) and would add backwards-compatible but deprecated functions for quote def quote(s):
warnings.warn("urllib.parse.quote should be replaced by
percent_encode or percent_encode_from_bytes", FutureDeprecationWarning)
if isinstance(s, str):
return percent_encode(s)
else:
return percent_encode_from_bytes(s)
def unquote(s):
warnings.warn("urllib.parse.unquote should be replaced by
percent_decode or percent_decode_to_bytes", FutureDeprecationWarning)
if isinstance(s, str):
return percent_decode(s)
else:
return percent_decode(str(s, "ASCII", "strict")) |
Le mardi 12 août 2008 à 19:37 +0000, Bill Janssen a écrit :
Well, it's my personal opinion, but I think we should focus on a simple (perhaps a sophisticated solution could still be adopted for 3.1, |
I'll be reviewing it today or tomorrow. From looking at it briefly I
The bytes > 127 would be translated as themselves; this follows
I think I agree with that comment; it seems wrong to return UTF8
Agreed, safe should be ASCII chars only.
I believe patch 9 still has errors defaulting to strict for quote(). Regarding using UTF-8 as the default encoding, I still think this the |
OK I've come to the realization that DEMANDING ascii (and erroring on The same goes for unquote accepting bytes. We can decide to make it
Ah but what about unquote (to string)? If it accepted bytes then it
I raised it as a concern, but I thought you overruled on that, so I left
Yes, it does sound pretty bad. However, that's the current way of doing I'm not putting up a new patch now. The only fix I'd make is to add (That is Lib/email/utils.py line 222 becomes: btw this Rietveld is amazing. I'm assuming I don't have permission to |
On Wed, Aug 13, 2008 at 7:25 AM, Matt Giuca <report@bugs.python.org> wrote:
OK.
OK.
OK.
I'm OK with replace for unquote(), your point that bogus data is For quote() I think strict is better -- it can't fail anyway with
Actually, while the Quoter class (the immediat subject of my scorn)
Now that you've spent so much time with this patch, can't you think
Thanks! You can't upload patches to the issue that *I* created, but a I am hoping that in general we will be able to use Rietveld to review |
There's just an odd inconsistency there, but it's only a tiny "gotcha";
That's exactly true, yes.
Well firstly, you could replace Quoter (the class) with a "quoter"
That is a good idea. Then, the "function" (as I describe above) would be I'm very hazy about what is faster in the bytecode world of Python, and (I won't be able to work on this for up to 24 hours). |
Selon Matt Giuca <report@bugs.python.org>:
The obvious speedup is to remove the map() call and do the loop inside (also, defining a class with a single __call__ method is not a common Python As for the defaultdict, here is how it can look like (this is on 2.5): ... def __missing__(self, key):
... print "__missing__", key
... value = "%%%02X" % key
... self[key] = value
... return value
...
>>> d = D()
>>> d[66] = 'B'
>>> d[66]
'B'
>>> d[67]
__missing__ 67
'%43'
>>> d[67]
'%43' |
Selon Antoine Pitrou <report@bugs.python.org>:
(there should be a line here saying "class D(defaultdict)" :-))
cheers Antoine. |
Yes, but barely.
Yes, it would be tremendously faster, since the method would be called
That's very wise. But a first-order approximation of the speed of
That's fine, as long as we have closure before beta3, which is next Wednesday. |
Feel free to take the function implementation from my patch, if it speeds Bill On Wed, Aug 13, 2008 at 9:41 AM, Guido van Rossum <report@bugs.python.org>wrote:
|
Erik van der Poel at Google has now chimed in with stats on current URL ``...the bottom line is that escaped non-utf-8 is still quite prevalent, http://lists.w3.org/Archives/Public/www-international/2008JulSep/0042.html I think it's worth remembering that a very large proportion of the use |
I think we're supporting these sufficiently by allowing developers to |
Le mercredi 13 août 2008 à 17:05 +0000, Bill Janssen a écrit :
Yes, we do. Browsers will use whatever charset is specified in the HTML (URL rewriting at the HTTP server level can make this more complicated, The situation in which we can't control the encoding is when getting the |
On Wed, Aug 13, 2008 at 10:51 AM, Antoine Pitrou <report@bugs.python.org>wrote:
Sure. What I meant was that we don't control what the browsers do, we just |
Ah cheers Antoine, for the tip on using defaultdict (I was confused as |
OK I implemented the defaultdict solution. I got curious so ran some import random, urllib.parse
for i in range(0, 100000):
str = ''.join(chr(random.randint(0, 0x10ffff)) for _ in range(50))
quoted = urllib.parse.quote(str) Time to quote 100,000 random strings of 50 characters. HEAD, chars in range(0,0x110000): 1m44.80 Head is the current Py3k head. Patch 9 is my previous patch (before Interesting. Defaultdict didn't really make much of an improvement. You However, I'll keep the defaultdict code, I quite like it, speedy or not |
Hello Matt,
I think if you move the line defining "str" out of the loop, relative timings cheers Antoine. |
New patch (patch10). Details on Rietveld review tracker Another update on the remaining "outstanding issues": Resolved issues since last time:
New issues: unquote_to_bytes doesn't cope well with non-ASCII characters (currently I have *implemented* that suggestion - so unquote_to_bytes now accepts I also emailed Barry Warsaw about the email/utils.py patch (because we That's all the issues I have left over in this patch. Attaching patch 10 (for revision 65675). Commit log for patch 10: Fix for bpo-3300. urllib.parse.unquote: urllib.parse.quote: Added functions urllib.parse.quote_from_bytes, Doc/library/urllib.parse.rst: Updated docs on quote and unquote to Lib/test/test_urllib.py: Added many new test cases testing encoding Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py, Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote |
Antoine:
Well I wanted to test throwing lots of different URIs to test the Oh well, the defaultdict implementation is in patch10 anyway :) It |
Hi, Sorry to bump this, but you (Guido) said you wanted this closed by |
Looking into this now. Will make sure it's included in beta3. |
Checked in patch 10 with minor style changes as r65838. Thanks Matt for persevering! Thanks everyone else for contributing; |
There's an unquote()-related failure in bpo-3613. |
Thanks for pointing that out, Antoine. I just commented on that bug. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: