classification
Title: urllib.quote quotes too many chars, e.g., '()'
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: csabella, ezio.melotti, joern, merwok, orsenthil
Priority: normal Keywords:

Created on 2011-09-06 10:26 by joern, last changed 2017-07-21 15:08 by louielu.

Pull Requests
URL Status Linked Edit
PR 2568 open joern, 2017-07-04 15:11
Messages (4)
msg143592 - (view) Author: Jörn Hees (joern) * Date: 2011-09-06 10:26
urllib.quote('()')
returns '%28%29'

Looking into its code it tries to follow RFC 2396 (which is good even though it should follow rfc3986 nowadays), but it doesn't:

http://tools.ietf.org/html/rfc2396 (see Appendix A, p.27): "(" and ")" are in mark and therefore unreserved, so why are they quoted?
msg143596 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2011-09-06 11:54
It can aggressively put these chars !~*\'() in the safe list.  I will look at the history to see if they originally present and were removed for some reason or they did not make it the list in the first place. 

If we do add, then it should be only 3.3 (Someone could be relying on the old behavior).
msg297621 - (view) Author: Cheryl Sabella (csabella) * Date: 2017-07-04 00:01
Issue 16285 updated the urllib.parse.quote() reserved list to add '~'.

From the docstring:
def quote(string, safe='/', encoding=None, errors=None):
    """quote('abc def') -> 'abc%20def'

    Each part of a URL, e.g. the path info, the query, etc., has a
    different set of reserved characters that must be quoted.

    RFC 3986 Uniform Resource Identifiers (URI): Generic Syntax lists
    the following reserved characters.

    reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                  "$" | "," | "~"

    Each of these characters is reserved in some component of a URL,
    but not necessarily in all of them.

    Python 3.7 updates from using RFC 2396 to RFC 3986 to quote URL strings.
    Now, "~" is included in the set of reserved characters.

--------------------------------------------
However, looking at RFC3986 (https://tools.ietf.org/html/rfc3986), appendix A has the following:

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   reserved      = gen-delims / sub-delims
   gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

----------------------------------------------------
Should the missing ones be added or should this issue be closed if they aren't going to be added?

Thanks.
msg297675 - (view) Author: Jörn Hees (joern) * Date: 2017-07-04 15:16
It's been a while... nowadays I would mostly change the documentation of the quote function to point out that it is likely to quote more characters than absolutely necessary by SPEC. The function is in place for so long, (even in py3) that people will rely on the behavior.

I made an attempt to update the docstring accordingly in https://github.com/python/cpython/pull/2568


What i think is most confusing is the current docs mentioning the reserved chars (which are btw. definitely wrong wrt. RFC3986). Actually as one can see in the code the reserved chars don't play any role for quote, but much more the unreserved chars (called _ALWAYS_SAFE https://github.com/python/cpython/blob/master/Lib/urllib/parse.py#L716 ).

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"

The current quote function's approach is to simply quote everything that is not in unreserved + safe (per arg).

In that aspect it is quite close to the old javascript.escape function: https://www.w3schools.com/jsref/jsref_escape.asp


quick links
py2.7: https://github.com/python/cpython/blob/2.7/Lib/urllib.py#L1261
py3: https://github.com/python/cpython/blob/master/Lib/urllib/parse.py#L745
RFC3986: https://tools.ietf.org/html/rfc3986#appendix-A
History
Date User Action Args
2017-07-21 15:08:36louielusettitle: urrlib.quote quotes too many chars, e.g., '()' -> urllib.quote quotes too many chars, e.g., '()'
2017-07-10 23:35:24Mariattasetstage: patch review
versions: - Python 2.7, Python 3.3, Python 3.4, Python 3.5, Python 3.6
2017-07-04 15:22:42joernsetversions: + Python 2.7, Python 3.4, Python 3.5, Python 3.6, Python 3.7
2017-07-04 15:16:51joernsetmessages: + msg297675
2017-07-04 15:11:44joernsetpull_requests: + pull_request2638
2017-07-04 00:01:51csabellasetnosy: + csabella
messages: + msg297621
2011-09-08 10:20:57ezio.melottisetnosy: + ezio.melotti
2011-09-06 15:40:16merwoksetnosy: + merwok
2011-09-06 11:54:32orsenthilsetversions: + Python 3.3, - Python 2.6, Python 2.7
nosy: + orsenthil

messages: + msg143596

assignee: orsenthil
2011-09-06 10:26:38joerncreate