Message 71130 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mgiuca
Recipients	gvanrossum, janssen, jimjjewett, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Date	2008-08-14.15:27:24
SpamBayes Score	1.4432899e-15
Marked as misclassified	No
Message-id	<1218727648.28.0.988658298824.issue3300@psf.upfronthosting.co.za>
In-reply-to

Content
New patch (patch10). Details on Rietveld review tracker (http://codereview.appspot.com/2827). Another update on the remaining "outstanding issues": Resolved issues since last time: > Should unquote accept a bytes/bytearray as well as a str? No. But see below. > Lib/email/utils.py: > Should encode_rfc2231 with charset=None accept strings with non-ASCII > characters, and just encode them to UTF-8? Implemented Antoine's fix ("or 'ascii'"). > Should quote accept safe characters outside the > ASCII range (thereby potentially producing invalid URIs)? No. New issues: unquote_to_bytes doesn't cope well with non-ASCII characters (currently encodes as UTF-8 - not a lot we can do since this is a str->bytes operation). However, we can allow it to accept a bytes as input (while unquote does not), and it preserves the bytes precisely. Discussion at http://codereview.appspot.com/2827/diff/82/84, line 265. I have implemented that suggestion - so unquote_to_bytes now accepts either a bytes or str, while unquote accepts only a str. No changes need to be made unless there is disagreement on that decision. I also emailed Barry Warsaw about the email/utils.py patch (because we weren't sure exactly what that code was doing). However, I'm sure that this patch isn't breaking anything there, because I call unquote with encoding="latin-1", which is the same behaviour as the current head. That's all the issues I have left over in this patch. Attaching patch 10 (for revision 65675). Commit log for patch 10: Fix for issue 3300. urllib.parse.unquote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the decoding of percent-encoded octets. As per RFC 3986, default is "utf-8" (previously implicitly decoded as ISO-8859-1). Fixed a bug in which mixed-case hex digits (such as "%aF") weren't being decoded at all. urllib.parse.quote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the encoding of non-ASCII characters before being percent-encoded. Default is "utf-8" (previously characters in range(128, 256) were encoded as ISO-8859-1, and characters above that as UTF-8). Characters/bytes above 128 are no longer allowed to be "safe". Now allows either bytes or strings. Optimised "Quoter"; now inherits defaultdict. Added functions urllib.parse.quote_from_bytes, urllib.parse.unquote_to_bytes. All quote/unquote functions now exported from the module. Doc/library/urllib.parse.rst: Updated docs on quote and unquote to reflect new interface, added quote_from_bytes and unquote_to_bytes. Lib/test/test_urllib.py: Added many new test cases testing encoding and decoding Unicode strings with various encodings, as well as testing the new functions. Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py, Lib/test/test_wsgiref.py: Updated and added test cases to deal with UTF-8-encoded URIs. Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote with encoding="latin-1", to preserve existing behaviour (which the email module is dependent upon).

New patch (patch10). Details on Rietveld review tracker
(http://codereview.appspot.com/2827).

Another update on the remaining "outstanding issues":

Resolved issues since last time:

> Should unquote accept a bytes/bytearray as well as a str?
No. But see below.

> Lib/email/utils.py:
> Should encode_rfc2231 with charset=None accept strings with non-ASCII
> characters, and just encode them to UTF-8?
Implemented Antoine's fix ("or 'ascii'").

> Should quote accept safe characters outside the
> ASCII range (thereby potentially producing invalid URIs)?
No.

New issues:

unquote_to_bytes doesn't cope well with non-ASCII characters (currently
encodes as UTF-8 - not a lot we can do since this is a str->bytes
operation). However, we can allow it to accept a bytes as input (while
unquote does not), and it preserves the bytes precisely.
Discussion at http://codereview.appspot.com/2827/diff/82/84, line 265.

I have *implemented* that suggestion - so unquote_to_bytes now accepts
either a bytes or str, while unquote accepts only a str. No changes need
to be made unless there is disagreement on that decision.

I also emailed Barry Warsaw about the email/utils.py patch (because we
weren't sure exactly what that code was doing). However, I'm sure that
this patch isn't breaking anything there, because I call unquote with
encoding="latin-1", which is the same behaviour as the current head.

That's all the issues I have left over in this patch.

Attaching patch 10 (for revision 65675).

Commit log for patch 10:

Fix for issue 3300.

urllib.parse.unquote:
  Added "encoding" and "errors" optional arguments, allowing the caller
  to determine the decoding of percent-encoded octets.
  As per RFC 3986, default is "utf-8" (previously implicitly decoded
  as ISO-8859-1).
  Fixed a bug in which mixed-case hex digits (such as "%aF") weren't
  being decoded at all.

urllib.parse.quote:
  Added "encoding" and "errors" optional arguments, allowing the
  caller to determine the encoding of non-ASCII characters
  before being percent-encoded.
  Default is "utf-8" (previously characters in range(128, 256)
  were encoded as ISO-8859-1, and characters above that as UTF-8).
  Characters/bytes above 128 are no longer allowed to be "safe".
  Now allows either bytes or strings.
  Optimised "Quoter"; now inherits defaultdict.

Added functions urllib.parse.quote_from_bytes,
urllib.parse.unquote_to_bytes.
All quote/unquote functions now exported from the module.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface, added quote_from_bytes and unquote_to_bytes.

Lib/test/test_urllib.py: Added many new test cases testing encoding
and decoding Unicode strings with various encodings, as well as testing
the new functions.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the email
module is dependent upon).

History
Date	User	Action	Args
2008-08-14 15:27:28	mgiuca	set	recipients: + mgiuca, gvanrossum, loewis, jimjjewett, janssen, orsenthil, pitrou, thomaspinckney3
2008-08-14 15:27:28	mgiuca	set	messageid: <1218727648.28.0.988658298824.issue3300@psf.upfronthosting.co.za>
2008-08-14 15:27:27	mgiuca	link	issue3300 messages
2008-08-14 15:27:25	mgiuca	create