Message 101044 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mgiuca
Recipients	mgiuca
Date	2010-03-14.08:46:53
SpamBayes Score	2.1359858e-11
Marked as misclassified	No
Message-id	<1268556415.98.0.66516345593.issue8135@psf.upfronthosting.co.za>
In-reply-to

Content
urllib.unquote fails to decode a percent-escape with mixed case. To demonstrate: >>> unquote("%fc") '\xfc' >>> unquote("%FC") '\xfc' >>> unquote("%Fc") '%Fc' >>> unquote("%fC") '%fC' Expected behaviour: >>> unquote("%Fc") '\xfc' >>> unquote("%fC") '\xfc' I actually fixed this bug in Python 3, at Guido's request as part of the huge fix to issue 3300. To quote Guido: > # Maps lowercase and uppercase variants (but not mixed case). > That sounds like a disaster. Why would %aa and %AA be correct but > not %aA and %Aa? (Even though the old code had the same problem.) (Indeed, the RFC 3986 allows mixed-case percent escapes.) I have attached a patch which fixes it simply by removing the dict mapping all lower and uppercase variants to characters, and simply calling int(item[:2], 16). It's slower, but correct. This is the same solution we used in Python 3. I've also backported a number of test cases from Python 3 which cover this issue, and also legitimate bad percent encoding. Note: I've also backported the remainder of the 'unquote' test cases from Python 3 but I found another bug, so I will report that separately, with a patch.

urllib.unquote fails to decode a percent-escape with mixed case. To demonstrate:

>>> unquote("%fc")
'\xfc'
>>> unquote("%FC")
'\xfc'
>>> unquote("%Fc")
'%Fc'
>>> unquote("%fC")
'%fC'

Expected behaviour:

>>> unquote("%Fc")
'\xfc'
>>> unquote("%fC")
'\xfc'

I actually fixed this bug in Python 3, at Guido's request as part of the huge fix to issue 3300. To quote Guido:

> # Maps lowercase and uppercase variants (but not mixed case).
> That sounds like a disaster.  Why would %aa and %AA be correct but
> not %aA and %Aa?  (Even though the old code had the same problem.)

(Indeed, the RFC 3986 allows mixed-case percent escapes.)

I have attached a patch which fixes it simply by removing the dict mapping all lower and uppercase variants to characters, and simply calling int(item[:2], 16). It's slower, but correct. This is the same solution we used in Python 3.

I've also backported a number of test cases from Python 3 which cover this issue, and also legitimate bad percent encoding.

Note: I've also backported the remainder of the 'unquote' test cases from Python 3 but I found another bug, so I will report that separately, with a patch.

History
Date	User	Action	Args
2010-03-14 08:46:56	mgiuca	set	recipients: + mgiuca
2010-03-14 08:46:55	mgiuca	set	messageid: <1268556415.98.0.66516345593.issue8135@psf.upfronthosting.co.za>
2010-03-14 08:46:54	mgiuca	link	issue8135 messages
2010-03-14 08:46:53	mgiuca	create