The 'unquote' function has some very strange behaviour on Unicode input. My proposed fix will, I am sure, be contentious, because it could change existing behaviour (only on unicode strings), but I think it's worth it for a sane unquote function.
Some historical context: I already reported this bug in Python 3 as http://bugs.python.org/issue3300 (or part of it). We argued for two months. I rewrote the function, and Guido accepted it. The same bugs are present in Python 2, but less urgent since they only affect 'unicode' strings (whereas in Python 3 they affected all strings).
PROBLEM DESCRIPTION
The basic problem is this:
Current behaviour:
>>> urllib.unquote(u"%CE%A3")
u'\xce\xa3'
(or u'Σ')
Desired behaviour:
>>> urllib.unquote(u"%CE%A3")
'\xce\xa3'
(Which decodes with UTF-8 to u'Σ')
Basically, if you give unquote a unicode string, it will give you back a unicode string, with all of the percent-escapes decoded with Latin-1. This behaviour was added in r39728. The following line was added:
res[i] = unichr(int(item[:2], 16)) + item[2:]
It takes a percent-escape (e.g., "CE"), converts it to an int (e.g., 0xCE), then calls unichr to form a Unicode character with that codepoint (e.g., u'\u00ce'). That's totally wrong. A URI percent-escape "is used to represent a data octet" [RFC 3986], not a Unicode code point.
I would argue that the result of unquote should always be a str, no matter the input. Since a URI represents a byte sequence, not a character sequence, unquote of a unicode should return a byte string, which the user can then decode as desired.
Note that in Python 3 we didn't have a choice, since all strings are unicode, we used a much more complicated solution. But we also added a function unquote_to_bytes. Python 2's unquote should behave like Python 3's unquote_to_bytes.
PROPOSED SOLUTION
To that end, my proposed solution is simply to encode the input unicode string with UTF-8, which is exactly what Python 3's unquote_to_bytes function does. I have attached a patch which does this. It is thoroughly tested and documented. However, see the discussion of potential problems later.
WHY THE CURRENT BEHAVIOUR IS BAD
I'll also point out that the patch in r39728 which created this problem also added a test case, still there, which demonstrates just how confusing this behaviour is:
r = urllib.unquote(u'br%C3%BCckner_sapporo_20050930.doc')
self.assertEqual(r, u'br\xc3\xbcckner_sapporo_20050930.doc')
This takes a string, clearly meant to be a UTF-8-encoded percent-escaped string for u'brückner_sapporo_20050930.doc', and unquotes it. Because of this bug, it is decoded with Latin-1 to the string 'brückner_sapporo_20050930.doc'. And this garbled string is *actually the expected output of the test case*!!
Furthermore, this behaviour is very confusing because it breaks equality of ASCII str and unicode objects. Consider:
>>> "%C3%BC" == u"%C3%BC"
True
>>> urllib.unquote("%C3%BC")
'\xc3\xbc'
>>> urllib.unquote(u"%C3%BC")
u'\xc3\xbc'
>>> urllib.unquote("%C3%BC") == urllib.unquote(u"%C3%BC")
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False
Why should the ASCII str object "%C3%BC" encode to one value, while the ASCII unicode object u"%C3%BC" encode to another? The two inputs represent the same string, so they should produce the same output.
POTENTIAL PROBLEMS
The attached patch will not, to my knowledge, affect any calls to unquote with str input. It only changes unicode input. Since this was buggy anyway, I think it's a legitimate fix.
I am, however, concerned about code breakage for existing code which uses unicode strings and depends upon this behaviour. Some use cases:
1. Unquoting a unicode string which is pure ASCII, with pure ASCII percent-escapes. This previously would produce a pure ASCII unicode string, now produces a pure ASCII str. This shouldn't break anything unless some code specifically checks that strings are of type 'unicode' (e.g., the Storm database library).
2. Unquoting a unicode string with pure ASCII percent-escapes, but non-ASCII characters. This previously would preserve all the unescaped characters; they will now be encoded to UTF-8. Technically this should never happen, as URIs are not allowed to contain non-ASCII characters [RFC 3986].
3. Unquoting a unicode string which is pure ASCII, with non-ASCII percent escapes. Some code may rely on the implicit decoding as Latin-1. However, I think it is more likely that existing code would just break, since most URIs are UTF-8 encoded.
TWO SOLUTIONS
Having gone through the problems, I imagine that we may reach the conclusion that it is too dangerous to "fix" this bug. Therefore, I am proposing an alternate solution (which I will attach in a follow-up comment), which is not to change the code at all. Instead, just fix the broken test case and add lots more test cases, and also document this odd behaviour thoroughly, and recommend that the input to unquote be passed as a string, then decoded as desired.
I will call the patches "urllib-unquote-fix" (which fixes the bug, adding lots of test cases and updating the documentation to describe the new behaviour), and "urllib-unquote-explain" (which adds lots of test cases and updates the documentation to describe the existing behaviour in detail).
Please discuss. :)
Note: I've just simultaneously filed another bug (issue #8135) on unquote relating to mixed case percent-escapes. Between them, these two patches include all relevant 'unquote' test cases from Python 3. I've also backported the 'quote' test cases in a patch for issue #1712522.
|