Message 153299 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Stephen.Day
Recipients	Stephen.Day, eric.araujo, orsenthil
Date	2012-02-13.20:46:45
SpamBayes Score	5.551115e-16
Marked as misclassified	No
Message-id	<1329166006.99.0.645855905419.issue13866@psf.upfronthosting.co.za>
In-reply-to

Content
I apologize for reopening this bug, but I find your interpretation to be inaccurate. While technically valid, the combination of the documentation, the function name and the main use cases yields pathological invocations of urlencode. My bug report is to help mitigate these problems. The main use case for "url encoding" of mapping types is not for posting form data; the main use case is appending url parameters to a url: >>> from urllib import urlencode >>> from urlparse import urlunparse >>> urlunparse(('http', 'example.com', '/', None, urlencode({'a': 'some string'}), None)) 'http://example.com/?a=some+string' Any sane person would naturally gravitate to a function called "urlencode" to url encode a mapping type. If the urllib.urlencode function is indeed intended for form-encoding, as I agree is hinted in the documentation, it should indicate that its result is 'application/x-www-form-urlencoded' or it should be called "formencode". The quote or quote_plus is not at all "what I am looking for"; I am quite familiar with these library functions. These functions are for encoding component strings; they don't meet the use case described at all: >>> quote({'a': 1}) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1248, in quote if not s.rstrip(safe): AttributeError: 'dict' object has no attribute 'rstrip' In addition, Java's URLEncoder implementation is hardly a good example of standards compliant URL manipulation. Python is not Java. The Python community needs to make its own, independent, mature language decisions. In general, the use of '+' to encode spaces in content, even if it is compliant against an arbitrary standard, is pathological, especially when used in urls. Even though python's quote_plus function works symmetrically on its own, when pluses are used in a multi-language environment it can become impossible to tell whether a plus is a literal '+' or an encoded space. In addition, the usage of '%20' for spaces will work in almost all cases. RFC3986, Section 2 [1] describes the use of percent-encoding as a solution to representing reserved characters. In practice, percent-encoding is used on the value component of 'key=value' productions and this works in nearly all cases. The referenced standard [2], while relevant to the "implied" use case, is not applicable to url assembly. Given your interpretation, it seems that there is no function in the python standard library to meet the use case of correctly assembling url parameter values, leaving application developers to come up with something like this: >>> '&'.join(['='.join((quote(k), quote(v))) for k,v in {'a': '1', 'b': 'with spaces'}.iteritems()]) 'a=1&b=with%20spaces' In most cases, people will just use urlencode, which uses pluses for spaces, yielding pathological, noncompliant urls. In deference to this bug closure, there are a few options: 1. Close this issue and keep polluting the world's urls with pluses for spaces. 2. Make urlencode target path/query parameter encoding and then create a new function, formencode, for use in encoding form data, breaking backwards compatibility. 3. Simply add a keyword argument to urlencode to allow the caller to specify the encoding function and separator, retaining compatibility and satisfying all of the above use cases. Naturally, 3 seems to be a very reasonable solution to this bug. [1] http://tools.ietf.org/html/rfc3986#section-2 explicitly covers [2] http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1

I apologize for reopening this bug, but I find your interpretation to be inaccurate. While technically valid, the combination of the documentation, the function name and the main use cases yields pathological invocations of urlencode. My bug report is to help mitigate these problems.

The main use case for "url encoding" of mapping types is not for posting form data; the main use case is appending url parameters to a url:

>>> from urllib import urlencode
>>> from urlparse import urlunparse
>>> urlunparse(('http', 'example.com', '/', None, urlencode({'a': 'some string'}), None))
'http://example.com/?a=some+string'

Any sane person would naturally gravitate to a function called "urlencode" to url encode a mapping type. If the urllib.urlencode function is indeed intended for form-encoding, as I agree is hinted in the documentation, it should indicate that its result is 'application/x-www-form-urlencoded' or it should be called "formencode".

The quote or quote_plus is not at all "what I am looking for"; I am quite familiar with these library functions. These functions are for encoding component strings; they don't meet the use case described at all:

>>> quote({'a': 1})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1248, in quote
    if not s.rstrip(safe):
AttributeError: 'dict' object has no attribute 'rstrip'

In addition, Java's URLEncoder implementation is hardly a good example of standards compliant URL manipulation. Python is not Java. The Python community needs to make its own, independent, mature language decisions. In general, the use of '+' to encode spaces in content, even if it is compliant against an arbitrary standard, is pathological, especially when used in urls. Even though python's quote_plus function works symmetrically on its own, when pluses are used in a multi-language environment it can become impossible to tell whether a plus is a literal '+' or an encoded space. In addition, the usage of '%20' for spaces will work in almost all cases.

RFC3986, Section 2 [1] describes the use of percent-encoding as a solution to representing reserved characters. In practice, percent-encoding is used on the value component of 'key=value' productions and this works in nearly all cases. The referenced standard [2], while relevant to the "implied" use case, is not applicable to url assembly.

Given your interpretation, it seems that there is no function in the python standard library to meet the use case of correctly assembling url parameter values, leaving application developers to come up with something like this:

>>> '&'.join(['='.join((quote(k), quote(v))) for k,v in {'a': '1', 'b': 'with spaces'}.iteritems()])
'a=1&b=with%20spaces'

In most cases, people will just use urlencode, which uses pluses for spaces, yielding pathological, noncompliant urls.

In deference to this bug closure, there are a few options:

1. Close this issue and keep polluting the world's urls with pluses for spaces.

2. Make urlencode target path/query parameter encoding and then create a new function, formencode, for use in encoding form data, breaking backwards compatibility.

3. Simply add a keyword argument to urlencode to allow the caller to specify the encoding function and separator, retaining compatibility and satisfying all of the above use cases.

Naturally, 3 seems to be a very reasonable solution to this bug.

[1] http://tools.ietf.org/html/rfc3986#section-2 explicitly covers 
[2] http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1

History
Date	User	Action	Args
2012-02-13 20:46:47	Stephen.Day	set	recipients: + Stephen.Day, orsenthil, eric.araujo
2012-02-13 20:46:46	Stephen.Day	set	messageid: <1329166006.99.0.645855905419.issue13866@psf.upfronthosting.co.za>
2012-02-13 20:46:46	Stephen.Day	link	issue13866 messages
2012-02-13 20:46:45	Stephen.Day	create