Author mgiuca
Recipients mgiuca
Date 2008-07-06.14:52:06
SpamBayes Score 4.44269e-05
Marked as misclassified No
Message-id <1215355930.42.0.79499861143.issue3300@psf.upfronthosting.co.za>
In-reply-to
Content
Three Unicode-related problems with urllib.parse.quote and
urllib.parse.unquote in Python 3.0. (Patch attached).

Firstly, unquote appears not to have been modified from Python 2, where
it is designed to output a byte string. In Python 3, it outputs a
unicode string, implicitly decoded as ISO-8859-1 (the code points are
the same as the bytes). RFC 3986 states that the percent-encoded byte
values should be decoded as UTF-8.

http://tools.ietf.org/html/rfc3986 section 2.5.

Current behaviour:
>>> urllib.parse.unquote("%CE%A3")
'Σ'
(or '\u00ce\u00a3')

Desired behaviour:
>>> urllib.parse.unquote("%CE%A3")
'Σ'
(or '\u03a3')

Secondly, while quote *has* been modified to encode to UTF-8 before
percent-encoding, it does not work correctly for characters in
range(128, 256), due to a special case in the code which again treats
the code point values as byte values.

Current behaviour:
>>> urllib.parse.quote('\u00e9')
'%E9'

Desired behaviour:
>>> urllib.parse.quote('\u00e9')
'%C3%A9'

Note that currently, quoting characters less than 256 will use
ISO-8859-1, while quoting characters 256 or higher will use UTF-8!

Thirdly, the "safe" argument to quote does not work for characters above
256, since these are excluded from the special case. I thought I would
fix this at the same time, but it's really a separate issue.

Current behaviour:
>>> urllib.parse.quote('Σϰ', safe='Σ')
'%CE%A3%CF%B0'

Desired behaviour:
>>> urllib.parse.quote('Σϰ', safe='Σ')
'Σ%CF%B0'

A patch which fixes all three issues is attached. Note that unquote now
needs to handle the case where the UTF-8 sequence is invalid. This is
currently handled by "replace" (invalid sequences are replaced by
'\ufffd'). I would like to add an optional "errors" argument to unquote,
defaulting to "replace", to allow the user to override this behaviour,
but I didn't put that in because it would change the interface.

Note I also changed one of the test cases, which had the wrong expected
output. (String literal was manually UTF-8 encoded, designed for Python
2; nonsensical when viewed as a Python 3 Unicode string).

All urllib test cases pass.

Patch is for branch /branches/py3k, revision 64752.

Note: The above unquote issue also manifests itself in Python 2 for
Unicode strings, but it's hazy as to what the behaviour should be, and
would break existing programs, so I'm just patching the Py3k branch.

Commit log:

urllib.parse.unquote: Fixed percent-encoded octets being implicitly
decoded as ISO-8859-1; now decode as UTF-8, as per RFC 3986.

urllib.parse.quote: Fixed characters in range(128, 256) being implicitly
encoded as ISO-8859-1; now encode as UTF-8. Also fixed characters
greater than 256 not responding to "safe", and also not being cached.

Lib/test/test_urllib.py: Updated one test case for unquote which
expected the wrong output. The new version of unquote passes the new
test case.
History
Date User Action Args
2008-07-06 14:52:10mgiucasetspambayes_score: 4.44269e-05 -> 4.44269e-05
recipients: + mgiuca
2008-07-06 14:52:10mgiucasetspambayes_score: 4.44269e-05 -> 4.44269e-05
messageid: <1215355930.42.0.79499861143.issue3300@psf.upfronthosting.co.za>
2008-07-06 14:52:09mgiucalinkissue3300 messages
2008-07-06 14:52:08mgiucacreate