Message 101043 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mgiuca
Recipients	ajaksu2, collinwinter, eric.araujo, mgiuca, nagle, orsenthil, vak, varmaa, vstinner
Date	2010-03-14.08:46:31
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<1268556397.6.0.438276976101.issue1712522@psf.upfronthosting.co.za>
In-reply-to

Content
I've finally gotten around to a complete analysis of this code. I have a code/test/documentation patch which fixes the issue without any code breakage. There is another bug in quote which I've found and fixed with this patch: If the 'safe' parameter is unicode, it raises a UnicodeDecodeError. I have backported all of the 'quote' test cases from Python 3 (which I wrote) to Python 2. This exposed the reported bug as well as the above one. It's good to have a much larger set of test cases to work with. It tests things like all combinations of str/unicode, as well as non-ASCII byte string input and all manner of unicode inputs. The bugfix itself comes from Python 3 (this has already been approved, over many months, by Guido, so I am hoping a similar change can get pushed through into Python 2 fairly easily). The solution is to add "encoding" and "errors" arguments to 'quote', and have quote encode the unicode string before anything else. 'encoding' defaults to 'utf-8'. So: >>> quote(u'/El Niño/') '/El%20Ni%C3%B1o/' which is typically the desired behaviour. (Note that URI syntax does not cover Unicode strings; it merely says to encode them with some encoding, recommended but not required UTF-8, and then percent-encode those.) With this patch, quote always returns a str, even on unicode input. I think that makes sense, because a URI is, by definition, an ASCII string. It could easily be made to return a unicode instead. The other fix is for 'safe'. If 'safe' is a byte string we don't touch it. But if it is a Unicode string, we throw away all non-ASCII bytes. This means you can't make characters safe, only bytes, since URIs deal with bytes. In Python 3, we go further and throw away all non-ASCII bytes from 'safe' as well, so you can only make ASCII bytes safe. For this patch, I didn't go that far, for backwards compatibility reasons. Also updated documentation. In summary, this patch makes 'quote' fully Unicode compliant. It does not change any existing behaviour which wouldn't previously have thrown an exception, so it can't possibly break any existing code (unless it's relying on the exception being thrown). (A minor change I made was replacing the line "cachekey = (safe, always_safe)" with "cachekey = safe". This avoids unnecessary work of hashing always_safe and the tuple, since always_safe doesn't change. It doesn't affect the behaviour.) Note: I've also backported the 'unquote' test cases from Python 3 and found a few more bugs. I'm going to report them separately, with patches.

I've finally gotten around to a complete analysis of this code. I have a code/test/documentation patch which fixes the issue without any code breakage.

There is another bug in quote which I've found and fixed with this patch: If the 'safe' parameter is unicode, it raises a UnicodeDecodeError.

I have backported all of the 'quote' test cases from Python 3 (which I wrote) to Python 2. This exposed the reported bug as well as the above one. It's good to have a much larger set of test cases to work with. It tests things like all combinations of str/unicode, as well as non-ASCII byte string input and all manner of unicode inputs.

The bugfix itself comes from Python 3 (this has already been approved, over many months, by Guido, so I am hoping a similar change can get pushed through into Python 2 fairly easily). The solution is to add "encoding" and "errors" arguments to 'quote', and have quote encode the unicode string before anything else. 'encoding' defaults to 'utf-8'. So:

>>> quote(u'/El Niño/')
'/El%20Ni%C3%B1o/'

which is typically the desired behaviour. (Note that URI syntax does not cover Unicode strings; it merely says to encode them with some encoding, recommended but not required UTF-8, and then percent-encode those.)

With this patch, quote *always* returns a str, even on unicode input. I think that makes sense, because a URI is, by definition, an ASCII string. It could easily be made to return a unicode instead.

The other fix is for 'safe'. If 'safe' is a byte string we don't touch it. But if it is a Unicode string, we throw away all non-ASCII bytes. This means you can't make *characters* safe, only *bytes*, since URIs deal with bytes. In Python 3, we go further and throw away all non-ASCII bytes from 'safe' as well, so you can only make ASCII bytes safe. For this patch, I didn't go that far, for backwards compatibility reasons.

Also updated documentation.

In summary, this patch makes 'quote' fully Unicode compliant. It does not change any existing behaviour which wouldn't previously have thrown an exception, so it can't possibly break any existing code (unless it's relying on the exception being thrown).

(A minor change I made was replacing the line "cachekey = (safe, always_safe)" with "cachekey = safe". This avoids unnecessary work of hashing always_safe and the tuple, since always_safe doesn't change. It doesn't affect the behaviour.)

Note: I've also backported the 'unquote' test cases from Python 3 and found a few more bugs. I'm going to report them separately, with patches.

History
Date	User	Action	Args
2010-03-14 08:46:38	mgiuca	set	recipients: + mgiuca, collinwinter, varmaa, nagle, orsenthil, vstinner, ajaksu2, eric.araujo, vak
2010-03-14 08:46:37	mgiuca	set	messageid: <1268556397.6.0.438276976101.issue1712522@psf.upfronthosting.co.za>
2010-03-14 08:46:35	mgiuca	link	issue1712522 messages
2010-03-14 08:46:34	mgiuca	create