classification
Title: urlencode does not handle "bytes" and could easily handle alternate encodings
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: dmahn, ezio.melotti, jhylton, milesck, orsenthil, terry.reedy
Priority: normal Keywords: patch

Created on 2009-03-10 14:45 by dmahn, last changed 2010-07-03 17:59 by orsenthil. This issue is now closed.

Files
File name Uploaded Description Edit
new_urlencode.py dmahn, 2009-03-10 14:45 Updated urlencode() function with doctest
new_urlencode_tests.py dmahn, 2009-03-10 22:20 Additional urlencode() tests for use in test_urllib.py
Messages (10)
msg83434 - (view) Author: Dan Mahn (dmahn) Date: 2009-03-10 14:45
urllib.parse.urlencode() uses quote_plus() extensively to create a
complete query string, but doesn't effectively/properly take advantage
of the flexibility built into quote_plus().  Namely:

1) Instances of type "bytes" are not properly encoded, as str() is used
prior to passing to quote_plus().  This creates a nonsensical string
such as b'1234', while quote_plus() can handle these types properly if
passed intact.  The ability to encode this type is particularly useful
for putting binary data into the query string, or for pre-encoded text
which you may want to encode in a non-standard character encoding.

2) Sometimes it would be desirable to encode query strings entirely in
"latin-1" or possibly "ascii" instead of "utf-8".  Adding the extra
parameters now present on quote_plus() can easily give that extra
functionality.

I have attached a new version of urlencode() that provides both of the
above fixes/enhancements.  Additionally, an unused codepath in the
existing function has been eliminated/cleaned up.  Some doctests are
included as well.
msg83448 - (view) Author: Dan Mahn (dmahn) Date: 2009-03-10 22:20
I also made some tests for the new code that could be added to the unit
tests in test_urllib.py
msg84216 - (view) Author: Jeremy Hylton (jhylton) Date: 2009-03-26 20:57
I'm not sure I understand the part of the code that deals with binary
strings.  I agree the current behavior is odd.  RFC 2396 says that
non-ascii characters must be encoded as utf-8 and then percent escaped.
 In the test case you started with, you encoded b'\xa0\x24'.  It doesn't
seem like this should be allowed, since it is not valid utf-8.
msg84228 - (view) Author: Dan Mahn (dmahn) Date: 2009-03-26 22:27
Hello.  Thanks for the feedback.

With regards to RFC 2396, I see this:

http://www.ietf.org/rfc/rfc2396.txt

====
There is a second translation for some resources: the sequence of
    octets defined by a component of the URI is subsequently used to
    represent a sequence of characters. A 'charset' defines this mapping.
    There are many charsets in use in Internet protocols. For example,
    UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences
    of characters in the repertoire of ISO 10646.
====

To me, that text does not indicate that URLs are always encoded in 
UTF-8.  It indicates that URL information may be encoded in character 
sets ('charset') other than ASCII, and when it is, the values must be 
sent as escaped values.  Here, I note the specific words "many charsets 
in use" and "For example", before the reference to UTF-8.

I have also done a few tests, and have found that in practice, browsers 
do not always encode URLs as UTF-8.  This actually seems to differ as to 
what part of the URL is being encoded.  For instance, my Firefox will 
encode the path portion of a URL as UTF-8, but encode the query string 
as Latin-1.

I think that the general idea is ... URL data must be encoded into 
ASCII, but as to what the data is that is being encoded ... That may be 
of some "charset" which may be application-defined.  And in the most 
general sense, I would argue that the data could simply be binary data. 
  (Actually, Latin-1 pretty much uses all the codes from 0 to 255, so 
it's very much like plain binary data anyway.)

I hope that clarifies what I am reading in RFC 2396.

In addition, quote_plus() already handles all the cases I placed into 
urlencode().  I suppose the actual test cases may be debatable, but I 
did specifically choose tests with data which would be recognized as 
something other then UTF-8.

Jeremy Hylton wrote:
> Jeremy Hylton <jeremy@alum.mit.edu> added the comment:
> 
> I'm not sure I understand the part of the code that deals with binary
> strings.  I agree the current behavior is odd.  RFC 2396 says that
> non-ascii characters must be encoded as utf-8 and then percent escaped.
>  In the test case you started with, you encoded b'\xa0\x24'.  It doesn't
> seem like this should be allowed, since it is not valid utf-8.
> 
> ----------
> nosy: +jhylton
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue5468>
> _______________________________________
msg84260 - (view) Author: Jeremy Hylton (jhylton) Date: 2009-03-27 14:50
Indeed, I think I confused some other character encoding issues related
to HTTP with the URI issue.  The discussion in RFC 3986 is length and
only occasionally clarifying for this issue.  That is, it doesn't say
anything definitive like applications are free to use any character
encoding when decoding a URI.  But I think it agrees with your
assessment that an application is free to interpret the binary data
however it wants, e.g. http://tools.ietf.org/html/rfc3986#section-2.1
msg89416 - (view) Author: Miles Kaufmann (milesck) Date: 2009-06-15 21:50
parse_qs and parse_qsl should also grow encoding and errors parameters to 
pass to the underlying unquote().
msg92029 - (view) Author: Miles Kaufmann (milesck) Date: 2009-08-28 09:38
I've attached a patch that provides similar functionality to Dan Mahn's 
urlencode(), as well as providing encoding and errors parameters to 
parse_qs and parse_qsl, updating the documentation to reflect the added 
parameters, and adding test cases.  The implementation of urlencode() is 
not the same as dmahn's, and has a more straightforward control flow and 
less code duplication than the current implementation.

(For the tests, I tried to match the style of the file I was adding to 
with regard to (expect, result) order, which is why it's inconsistent.)
msg108290 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-06-21 17:23
The question of whether % escape should be limited to utf-8 or not was discussed and decided in favor of 'not' in #3300, quote and unquote.

Last December, a websig post (referenced yesterday on pydev) reported a 'problem' that would be solved by Miles' suggestion to include parse_qs and parse_qsl.
msg109101 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-07-02 11:45
I see no problem in going ahead with the suggestion proposed and the patch.

- I checked with RFC3986 Section 2.5
http://labs.apache.org/webarch/uri/rfc/rfc3986.html#identifying-data

Relevant line:
When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded.

- This is done already in quote and quote_plus. 
- It just boils down to urlencode also providing the same facility for query strings and that was the point of this bug report.

Jeremy, I shall go ahead with this and do the modifications, if required.
msg109187 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-07-03 17:59
Fixed and Committed revision 82510 (py3k) and revision 82511 (release31-maint).

This fixes urlencode issue. parse_qs and parse_qsl can have the same capabilities. It will be done subsequently (in another commit or issue)

Thanks Dan for the bug report and patch.
History
Date User Action Args
2010-07-03 17:59:30orsenthilsetstatus: open -> closed
title: urlencode does not handle "bytes", and could easily handle alternate encodings -> urlencode does not handle "bytes" and could easily handle alternate encodings
messages: + msg109187

resolution: accepted -> fixed
stage: patch review -> resolved
2010-07-02 11:45:25orsenthilsetassignee: jhylton -> orsenthil

messages: + msg109101
nosy: + orsenthil
2010-06-21 17:23:23terry.reedysetversions: - Python 3.0, Python 3.1
nosy: + terry.reedy

messages: + msg108290

stage: patch review
2009-09-01 03:00:12milescksetfiles: - withdrawn
2009-08-28 09:38:20milescksetfiles: + withdrawn
keywords: + patch
messages: + msg92029

versions: + Python 3.2
2009-06-15 21:50:51milescksetnosy: + milesck
messages: + msg89416
2009-03-27 21:29:10ezio.melottisetnosy: + ezio.melotti
2009-03-27 14:50:56jhyltonsetassignee: jhylton
resolution: accepted
messages: + msg84260
2009-03-26 22:27:22dmahnsetmessages: + msg84228
title: urlencode does not handle "bytes", and could easily handle alternate encodings -> urlencode does not handle "bytes", and could easily handle alternate encodings
2009-03-26 20:57:50jhyltonsetnosy: + jhylton
messages: + msg84216
2009-03-10 22:20:12dmahnsetfiles: + new_urlencode_tests.py

messages: + msg83448
2009-03-10 14:45:14dmahncreate