urlencode does not handle "bytes" and could easily handle alternate encodings #49718

dmahn · 2009-03-10T14:45:15Z

BPO	5468
Nosy	@terryjreedy, @orsenthil, @ezio-melotti
Files	new_urlencode.py: Updated urlencode() function with doctest new_urlencode_tests.py: Additional urlencode() tests for use in test_urllib.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/orsenthil'
closed_at = <Date 2010-07-03.17:59:30.311>
created_at = <Date 2009-03-10.14:45:14.864>
labels = ['type-bug', 'library']
title = 'urlencode does not handle "bytes" and could easily handle alternate encodings'
updated_at = <Date 2010-07-03.17:59:30.310>
user = 'https://bugs.python.org/dmahn'

bugs.python.org fields:

activity = <Date 2010-07-03.17:59:30.310>
actor = 'orsenthil'
assignee = 'orsenthil'
closed = True
closed_date = <Date 2010-07-03.17:59:30.311>
closer = 'orsenthil'
components = ['Library (Lib)']
creation = <Date 2009-03-10.14:45:14.864>
creator = 'dmahn'
dependencies = []
files = ['13294', '13298']
hgrepos = []
issue_num = 5468
keywords = ['patch']
message_count = 10.0
messages = ['83434', '83448', '84216', '84228', '84260', '89416', '92029', '108290', '109101', '109187']
nosy_count = 6.0
nosy_names = ['jhylton', 'terry.reedy', 'orsenthil', 'ezio.melotti', 'dmahn', 'milesck']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue5468'
versions = ['Python 3.2']

dmahn · 2009-03-10T14:45:12Z

urllib.parse.urlencode() uses quote_plus() extensively to create a
complete query string, but doesn't effectively/properly take advantage
of the flexibility built into quote_plus(). Namely:

Instances of type "bytes" are not properly encoded, as str() is used
prior to passing to quote_plus(). This creates a nonsensical string
such as b'1234', while quote_plus() can handle these types properly if
passed intact. The ability to encode this type is particularly useful
for putting binary data into the query string, or for pre-encoded text
which you may want to encode in a non-standard character encoding.
Sometimes it would be desirable to encode query strings entirely in
"latin-1" or possibly "ascii" instead of "utf-8". Adding the extra
parameters now present on quote_plus() can easily give that extra
functionality.

I have attached a new version of urlencode() that provides both of the
above fixes/enhancements. Additionally, an unused codepath in the
existing function has been eliminated/cleaned up. Some doctests are
included as well.

dmahn · 2009-03-10T22:20:10Z

I also made some tests for the new code that could be added to the unit
tests in test_urllib.py

jhylton · 2009-03-26T20:57:50Z

I'm not sure I understand the part of the code that deals with binary
strings. I agree the current behavior is odd. RFC 2396 says that
non-ascii characters must be encoded as utf-8 and then percent escaped.
In the test case you started with, you encoded b'\xa0\x24'. It doesn't
seem like this should be allowed, since it is not valid utf-8.

dmahn · 2009-03-26T22:27:20Z

Hello. Thanks for the feedback.

With regards to RFC 2396, I see this:

http://www.ietf.org/rfc/rfc2396.txt

====
There is a second translation for some resources: the sequence of
octets defined by a component of the URI is subsequently used to
represent a sequence of characters. A 'charset' defines this mapping.
There are many charsets in use in Internet protocols. For example,
UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences
of characters in the repertoire of ISO 10646.
====

To me, that text does not indicate that URLs are always encoded in
UTF-8. It indicates that URL information may be encoded in character
sets ('charset') other than ASCII, and when it is, the values must be
sent as escaped values. Here, I note the specific words "many charsets
in use" and "For example", before the reference to UTF-8.

I have also done a few tests, and have found that in practice, browsers
do not always encode URLs as UTF-8. This actually seems to differ as to
what part of the URL is being encoded. For instance, my Firefox will
encode the path portion of a URL as UTF-8, but encode the query string
as Latin-1.

I think that the general idea is ... URL data must be encoded into
ASCII, but as to what the data is that is being encoded ... That may be
of some "charset" which may be application-defined. And in the most
general sense, I would argue that the data could simply be binary data.
(Actually, Latin-1 pretty much uses all the codes from 0 to 255, so
it's very much like plain binary data anyway.)

I hope that clarifies what I am reading in RFC 2396.

In addition, quote_plus() already handles all the cases I placed into
urlencode(). I suppose the actual test cases may be debatable, but I
did specifically choose tests with data which would be recognized as
something other then UTF-8.

Jeremy Hylton wrote:

Jeremy Hylton <jeremy@alum.mit.edu> added the comment:

I'm not sure I understand the part of the code that deals with binary
strings. I agree the current behavior is odd. RFC 2396 says that
non-ascii characters must be encoded as utf-8 and then percent escaped.
In the test case you started with, you encoded b'\xa0\x24'. It doesn't
seem like this should be allowed, since it is not valid utf-8.

----------
nosy: +jhylton

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue5468\>

jhylton · 2009-03-27T14:50:55Z

Indeed, I think I confused some other character encoding issues related
to HTTP with the URI issue. The discussion in RFC 3986 is length and
only occasionally clarifying for this issue. That is, it doesn't say
anything definitive like applications are free to use any character
encoding when decoding a URI. But I think it agrees with your
assessment that an application is free to interpret the binary data
however it wants, e.g. http://tools.ietf.org/html/rfc3986#section-2.1

milesck · 2009-06-15T21:50:51Z

parse_qs and parse_qsl should also grow encoding and errors parameters to
pass to the underlying unquote().

milesck · 2009-08-28T09:38:18Z

I've attached a patch that provides similar functionality to Dan Mahn's
urlencode(), as well as providing encoding and errors parameters to
parse_qs and parse_qsl, updating the documentation to reflect the added
parameters, and adding test cases. The implementation of urlencode() is
not the same as dmahn's, and has a more straightforward control flow and
less code duplication than the current implementation.

(For the tests, I tried to match the style of the file I was adding to
with regard to (expect, result) order, which is why it's inconsistent.)

terryjreedy · 2010-06-21T17:23:23Z

The question of whether % escape should be limited to utf-8 or not was discussed and decided in favor of 'not' in bpo-3300, quote and unquote.

Last December, a websig post (referenced yesterday on pydev) reported a 'problem' that would be solved by Miles' suggestion to include parse_qs and parse_qsl.

orsenthil · 2010-07-02T11:45:25Z

I see no problem in going ahead with the suggestion proposed and the patch.

I checked with RFC3986 Section 2.5
http://labs.apache.org/webarch/uri/rfc/rfc3986.html#identifying-data

Relevant line:
When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded.

This is done already in quote and quote_plus.
It just boils down to urlencode also providing the same facility for query strings and that was the point of this bug report.

Jeremy, I shall go ahead with this and do the modifications, if required.

orsenthil · 2010-07-03T17:59:29Z

Fixed and Committed revision 82510 (py3k) and revision 82511 (release31-maint).

This fixes urlencode issue. parse_qs and parse_qsl can have the same capabilities. It will be done subsequently (in another commit or issue)

Thanks Dan for the bug report and patch.

dmahn mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Mar 10, 2009

dmahn mannequin changed the title ~~urlencode does not handle "bytes", and could easily handle alternate encodings~~ urlencode does not handle "bytes", and could easily handle alternate encodings Mar 26, 2009

jhylton mannequin self-assigned this Mar 27, 2009

orsenthil assigned orsenthil and unassigned jhylton Jul 2, 2010

orsenthil closed this as completed Jul 3, 2010

orsenthil changed the title ~~urlencode does not handle "bytes", and could easily handle alternate encodings~~ urlencode does not handle "bytes" and could easily handle alternate encodings Jul 3, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urlencode does not handle "bytes" and could easily handle alternate encodings #49718

urlencode does not handle "bytes" and could easily handle alternate encodings #49718

dmahn mannequin commented Mar 10, 2009

dmahn mannequin commented Mar 10, 2009

dmahn mannequin commented Mar 10, 2009

jhylton mannequin commented Mar 26, 2009

dmahn mannequin commented Mar 26, 2009

jhylton mannequin commented Mar 27, 2009

milesck mannequin commented Jun 15, 2009

milesck mannequin commented Aug 28, 2009

terryjreedy commented Jun 21, 2010

orsenthil commented Jul 2, 2010

orsenthil commented Jul 3, 2010

urlencode does not handle "bytes" and could easily handle alternate encodings #49718

urlencode does not handle "bytes" and could easily handle alternate encodings #49718

Comments

dmahn mannequin commented Mar 10, 2009

dmahn mannequin commented Mar 10, 2009

dmahn mannequin commented Mar 10, 2009

jhylton mannequin commented Mar 26, 2009

dmahn mannequin commented Mar 26, 2009

jhylton mannequin commented Mar 27, 2009

milesck mannequin commented Jun 15, 2009

milesck mannequin commented Aug 28, 2009

terryjreedy commented Jun 21, 2010

orsenthil commented Jul 2, 2010

orsenthil commented Jul 3, 2010