classification
Title: urllib.unquote doesn't decode mixed-case percent escapes
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: barry, ezio.melotti, mgiuca, orsenthil
Priority: normal Keywords: patch

Created on 2010-03-14 08:46 by mgiuca, last changed 2010-03-29 19:31 by orsenthil. This issue is now closed.

Files
File name Uploaded Description Edit
urllib-unquote-mixcase.patch mgiuca, 2010-03-14 08:46 Patch for urllib.unquote (obsolete)
urllib-unquote-mixcase.patch2 mgiuca, 2010-03-15 01:58 Patch for urllib.unquote
Messages (9)
msg101044 - (view) Author: Matt Giuca (mgiuca) Date: 2010-03-14 08:46
urllib.unquote fails to decode a percent-escape with mixed case. To demonstrate:

>>> unquote("%fc")
'\xfc'
>>> unquote("%FC")
'\xfc'
>>> unquote("%Fc")
'%Fc'
>>> unquote("%fC")
'%fC'

Expected behaviour:

>>> unquote("%Fc")
'\xfc'
>>> unquote("%fC")
'\xfc'

I actually fixed this bug in Python 3, at Guido's request as part of the huge fix to issue 3300. To quote Guido:

> # Maps lowercase and uppercase variants (but not mixed case).
> That sounds like a disaster.  Why would %aa and %AA be correct but
> not %aA and %Aa?  (Even though the old code had the same problem.)

(Indeed, the RFC 3986 allows mixed-case percent escapes.)

I have attached a patch which fixes it simply by removing the dict mapping all lower and uppercase variants to characters, and simply calling int(item[:2], 16). It's slower, but correct. This is the same solution we used in Python 3.

I've also backported a number of test cases from Python 3 which cover this issue, and also legitimate bad percent encoding.

Note: I've also backported the remainder of the 'unquote' test cases from Python 3 but I found another bug, so I will report that separately, with a patch.
msg101056 - (view) Author: Matt Giuca (mgiuca) Date: 2010-03-14 12:38
> Note: I've also backported the remainder of the 'unquote' test cases
> from Python 3 but I found another bug, so I will report that separately,
> with a patch.

Filed under issue #8136.
msg101076 - (view) Author: Matt Giuca (mgiuca) Date: 2010-03-15 01:08
Oh, I just discovered that urlparse contains a copy of unquote, which will also need to be patched. I've submitted a patch to remove the duplicate (#8143) -- if that is accepted first then there's no need to worry about it.
msg101079 - (view) Author: Matt Giuca (mgiuca) Date: 2010-03-15 01:54
I thought more about it, and wrote a different patch which doesn't remove the dictionary. I just replaced the dictionary creation code -- now it includes keys for all combinations of upper and lower case (for two-letter hex codes). This dictionary isn't much bigger -- 484 entries where is previously had 412.

Therefore, here is a replacement patch (urllib-unquote-mixcase.patch2).
msg101080 - (view) Author: Matt Giuca (mgiuca) Date: 2010-03-15 01:58
Tiny fix to patch2 -- replaced list comprehension with generator expression in dictionary construction.
msg101089 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-03-15 08:56
I reviewed the patch:

+_hexdig = '0123456789ABCDEFabcdef'
+_hextochr = dict((a+b, chr(int(a+b,16))) for a in _hexdig for b in _hexdig)

is really a neat way to generate the dict of mixed-case percent escape to use with to unquote. I shall commit the patch to trunk code.

yes, following the other bug on unquote and we should be able to fair conclusion on it and include this logic in there.

Thanks.
msg101091 - (view) Author: Matt Giuca (mgiuca) Date: 2010-03-15 09:06
Thanks very much. Importantly, note that unquote is currently duplicated between urllib and urlparse. I have a bug on it (#8143) but in the meantime, you will have to commit this fix to both modules.
msg101261 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-03-18 12:29
Fixed this in r79047. If we are to backport this to release26-maint, we need barry's approval. Barry, any thoughts? The change is a minor improvement, we have lived with normal case percent escape for long, mixed case would be bonus in release26.
msg101902 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-03-29 19:31
merged into release26-maint as r79492. This issue can be closed.
History
Date User Action Args
2010-03-29 19:31:32orsenthilsetstatus: open -> closed

messages: + msg101902
2010-03-18 12:29:09orsenthilsetnosy: + barry
resolution: accepted -> fixed
messages: + msg101261
2010-03-15 09:06:56mgiucasetmessages: + msg101091
2010-03-15 08:56:51orsenthilsetnosy: + orsenthil
messages: + msg101089

assignee: orsenthil
resolution: accepted
2010-03-15 01:58:27mgiucasetfiles: + urllib-unquote-mixcase.patch2

messages: + msg101080
2010-03-15 01:56:47mgiucasetfiles: - urllib-unquote-mixcase.patch2
2010-03-15 01:54:07mgiucasetfiles: + urllib-unquote-mixcase.patch2

messages: + msg101079
2010-03-15 01:08:30mgiucasetmessages: + msg101076
2010-03-14 12:38:18mgiucasetmessages: + msg101056
2010-03-14 12:30:41ezio.melottisetpriority: normal
nosy: + ezio.melotti

stage: patch review
2010-03-14 08:46:54mgiucacreate