Message 69583 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mgiuca
Recipients	loewis, mgiuca, orsenthil, thomaspinckney3
Date	2008-07-12.11:32:50
SpamBayes Score	1.6634793e-07
Marked as misclassified	No
Message-id	<1215862374.12.0.421131286848.issue3300@psf.upfronthosting.co.za>
In-reply-to

Content
OK I spent awhile writing test cases for quote and unquote, encoding and decoding various Unicode strings with different encodings. As a result, I found a bunch of issues in my previous patch, so I've rewritten the patches to both quote and unquote. They're both actually more similar to the original version now. I'd be interested in hearing if anyone disagrees with my expected output for these test cases. I'm now confident I have good test coverage directly on the quote and unquote functions. However, I haven't tested the other library functions which depend upon them (though the entire test suite passes). Though as I showed in that big post I made yesterday, other modules such as cgi seem to be working fine (their behaviour has changed; they use UTF-8 now; but that's the whole point of this patch). I still haven't figured out what the behaviour of "safe" should be in quote. Should it only allow ASCII characters (thereby limiting the output to an ASCII string, as specified by RFC 3986)? Should it also allow Latin-1 characters, or all Unicode characters as well (perhaps allowing you to create IRIs -- admittedly I don't know much about IRIs). The new implementation of quote makes it rather difficult to allow non-Latin-1 characters to be made "safe", as it encodes the string into bytes before any processing. Patch (parse.py.patch4) is for branch /branches/py3k, revision 64891. Commit log: urllib.parse.unquote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the decoding of percent-encoded octets. As per RFC 3986, default is "utf-8" (previously implicitly decoded as ISO-8859-1). urllib.parse.quote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the encoding of non-ASCII characters before being percent-encoded. Default is "utf-8" (previously characters in range(128, 256) were encoded as ISO-8859-1, and characters above that as UTF-8). Also characters above 128 are no longer allowed to be "safe". Doc/library/urllib.parse.rst: Updated docs on quote and unquote to reflect new interface. Lib/test/test_urllib.py: Added several new test cases testing encoding and decoding Unicode strings with various encodings. This includes updating one test case to now expect UTF-8 by default. Lib/test/test_http_cookiejar.py: Updated test case which expected output in ISO-8859-1, now expects UTF-8. Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote with encoding="latin-1", to preserve existing behaviour (which the whole email module is dependent upon).

OK I spent awhile writing test cases for quote and unquote, encoding and
decoding various Unicode strings with different encodings. As a result,
I found a bunch of issues in my previous patch, so I've rewritten the
patches to both quote and unquote. They're both actually more similar to
the original version now.

I'd be interested in hearing if anyone disagrees with my expected output
for these test cases.

I'm now confident I have good test coverage directly on the quote and
unquote functions. However, I haven't tested the other library functions
which depend upon them (though the entire test suite passes). Though as
I showed in that big post I made yesterday, other modules such as cgi
seem to be working fine (their behaviour has changed; they use UTF-8
now; but that's the whole point of this patch).

I still haven't figured out what the behaviour of "safe" should be in
quote. Should it only allow ASCII characters (thereby limiting the
output to an ASCII string, as specified by RFC 3986)? Should it also
allow Latin-1 characters, or all Unicode characters as well (perhaps
allowing you to create IRIs -- admittedly I don't know much about IRIs).
The new implementation of quote makes it rather difficult to allow
non-Latin-1 characters to be made "safe", as it encodes the string into
bytes before any processing.

Patch (parse.py.patch4) is for branch /branches/py3k, revision 64891.

Commit log:

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is "utf-8" (previously implicitly decoded as
ISO-8859-1).

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is "utf-8" (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters above 128 are no longer allowed to be "safe".

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface.

Lib/test/test_urllib.py: Added several new test cases testing encoding
and decoding Unicode strings with various encodings. This includes
updating one test case to now expect UTF-8 by default.

Lib/test/test_http_cookiejar.py: Updated test case which expected output
in ISO-8859-1, now expects UTF-8.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).

History
Date	User	Action	Args
2008-07-12 11:32:54	mgiuca	set	spambayes_score: 1.66348e-07 -> 1.6634793e-07 recipients: + mgiuca, loewis, orsenthil, thomaspinckney3
2008-07-12 11:32:54	mgiuca	set	spambayes_score: 1.66348e-07 -> 1.66348e-07 messageid: <1215862374.12.0.421131286848.issue3300@psf.upfronthosting.co.za>
2008-07-12 11:32:53	mgiuca	link	issue3300 messages
2008-07-12 11:32:51	mgiuca	create