Message 233332 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	demian.brecht
Recipients	Bob.Chen, demian.brecht, orsenthil, vstinner
Date	2015-01-02.23:23:07
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1420240988.93.0.105941351132.issue22231@psf.upfronthosting.co.za>
In-reply-to

Content
A few notes: 1. Unicode hosts are not automatically IDNA-encoded (which they /could/ be rather than relying on the programmer to be aware of this), but this really has no bearing on this specific issue 2. Unicode paths are not automatically IRI-encoded (see https://tools.ietf.org/html/rfc3987#section-3), which should also likely be automatically handled when unicode objects are encountered as the path 3. When a single unicode element is contained within a list, string_join will defer to PyUnicode_Join. The problem here is that your pre-joined request elements looks like this: [u'POST http://bugs.python.org/any_url HTTP/1.1', 'Host: bugs.python.org', 'Accept-Encoding: identity', 'Content-Length: 44', 'notes: \xe5\x91\xb5\xe5\x91\xb5', 'Content-type: application/x-www-form-urlencoded', 'Accept: text/plain', '', ''] Because there's a unicode object contained in the list at index 0, the entire list is converted to unicode, which results in the error when \xe5 is encountered by the ascii decoder. The proposed solution won't work as unicode characters are legal (see RFC 3987) and will fail should anything outside of the ascii character set be present. I think that the correct way to solve this issue is to automatically encode unicode paths (or IRIs) using urllib.quote, passing the reserved characters defined in RFC 3987 as the safe parameter: >>> urllib.quote(u'/foo/呵/bar'.encode('utf-8'),':/?#[]@!$&\'()*+,;=') '/foo/%E5%91%B5/bar'

A few notes:

1. Unicode hosts are not automatically IDNA-encoded (which they /could/ be rather than relying on the programmer to be aware of this), but this really has no bearing on this specific issue
2. Unicode paths are not automatically IRI-encoded (see https://tools.ietf.org/html/rfc3987#section-3), which should also likely be automatically handled when unicode objects are encountered as the path
3. When a single unicode element is contained within a list, string_join will defer to PyUnicode_Join.

The problem here is that your pre-joined request elements looks like this: [u'POST http://bugs.python.org/any_url HTTP/1.1', 'Host: bugs.python.org', 'Accept-Encoding: identity', 'Content-Length: 44', 'notes: \xe5\x91\xb5\xe5\x91\xb5', 'Content-type: application/x-www-form-urlencoded', 'Accept: text/plain', '', '']

Because there's a unicode object contained in the list at index 0, the entire list is converted to unicode, which results in the error when \xe5 is encountered by the ascii decoder.

The proposed solution won't work as unicode characters are legal (see RFC 3987) and will fail should anything outside of the ascii character set be present.

I think that the correct way to solve this issue is to automatically encode unicode paths (or IRIs) using urllib.quote, passing the reserved characters defined in RFC 3987 as the safe parameter:

>>> urllib.quote(u'/foo/呵/bar'.encode('utf-8'),':/?#[]@!$&\'()*+,;=')
'/foo/%E5%91%B5/bar'

History
Date	User	Action	Args
2015-01-02 23:23:09	demian.brecht	set	recipients: + demian.brecht, orsenthil, vstinner, Bob.Chen
2015-01-02 23:23:08	demian.brecht	set	messageid: <1420240988.93.0.105941351132.issue22231@psf.upfronthosting.co.za>
2015-01-02 23:23:08	demian.brecht	link	issue22231 messages
2015-01-02 23:23:07	demian.brecht	create