Message 70962 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mgiuca
Recipients	gvanrossum, janssen, jimjjewett, lemburg, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Date	2008-08-10.07:05:44
SpamBayes Score	1.5637607e-07
Marked as misclassified	No
Message-id	<1218351947.75.0.785102512508.issue3300@psf.upfronthosting.co.za>
In-reply-to

Content
Guido suggested that quote's "safe" parameter should allow any character, not just ASCII range. I've implemented this now. It was a lot messier than I imagined. The problem is that in my older patches, both 's' and 'safe' are encoded to bytes right away, and the rest of the process is just octet encoding (matching each byte against the safe set to see whether or not to quote it). The new implementation requires that you delay encoding both of these till the iteration over the string, so you match each character against the safe set, then encode it if it's not in 'safe'. Now the problem is some encodings/errors produce bytes which are in the safe range. For instance quote('\u6f22', encoding='latin-1', errors='xmlcharrefreplace') should give "%26%2328450%3B" (which is "漢" encoded). To preserve this behaviour, you then have to check each byte of the encoded character against a 'safe bytes' set. I believe that will slow down the implementation considerably. In summary, it requires two levels of encoding: first characters, then bytes. You can see how messy it made my quote implementation - I've attached the patch (parse.py.patch8+allsafe). I don't think it's worth the extra code bloat and performance hit just to implement a feature whose only use is producing invalid URIs (since URIs are supposed to only have ASCII characters). Does anyone disagree, and want this feature in?

Guido suggested that quote's "safe" parameter should allow any
character, not just ASCII range. I've implemented this now. It was a lot
messier than I imagined.

The problem is that in my older patches, both 's' and 'safe' are encoded
to bytes right away, and the rest of the process is just octet encoding
(matching each byte against the safe set to see whether or not to quote it).

The new implementation requires that you delay encoding both of these
till the iteration over the string, so you match each *character*
against the safe set, then encode it if it's not in 'safe'. Now the
problem is some encodings/errors produce bytes which are in the safe
range. For instance quote('\u6f22', encoding='latin-1',
errors='xmlcharrefreplace') should give "%26%2328450%3B" (which is
"&#28450;" encoded). To preserve this behaviour, you then have to check
each *byte* of the encoded character against a 'safe bytes' set. I
believe that will slow down the implementation considerably.

In summary, it requires two levels of encoding: first characters, then
bytes. You can see how messy it made my quote implementation - I've
attached the patch (parse.py.patch8+allsafe).

I don't think it's worth the extra code bloat and performance hit just
to implement a feature whose only use is producing invalid URIs (since
URIs are supposed to only have ASCII characters). Does anyone disagree,
and want this feature in?

History
Date	User	Action	Args
2008-08-10 07:05:48	mgiuca	set	recipients: + mgiuca, lemburg, gvanrossum, loewis, jimjjewett, janssen, orsenthil, pitrou, thomaspinckney3
2008-08-10 07:05:47	mgiuca	set	messageid: <1218351947.75.0.785102512508.issue3300@psf.upfronthosting.co.za>
2008-08-10 07:05:47	mgiuca	link	issue3300 messages
2008-08-10 07:05:45	mgiuca	create