classification
Title: urllib.quote quotes too many chars, e.g., '()'
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: cheryl.sabella, eric.araujo, ezio.melotti, joern, miss-islington, orsenthil
Priority: normal Keywords: patch

Created on 2011-09-06 10:26 by joern, last changed 2019-04-10 01:04 by orsenthil. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 2568 merged joern, 2017-07-04 15:11
PR 12754 merged miss-islington, 2019-04-10 00:31
Messages (6)
msg143592 - (view) Author: Jörn Hees (joern) * Date: 2011-09-06 10:26
urllib.quote('()')
returns '%28%29'

Looking into its code it tries to follow RFC 2396 (which is good even though it should follow rfc3986 nowadays), but it doesn't:

http://tools.ietf.org/html/rfc2396 (see Appendix A, p.27): "(" and ")" are in mark and therefore unreserved, so why are they quoted?
msg143596 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2011-09-06 11:54
It can aggressively put these chars !~*\'() in the safe list.  I will look at the history to see if they originally present and were removed for some reason or they did not make it the list in the first place. 

If we do add, then it should be only 3.3 (Someone could be relying on the old behavior).
msg297621 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2017-07-04 00:01
Issue 16285 updated the urllib.parse.quote() reserved list to add '~'.

From the docstring:
def quote(string, safe='/', encoding=None, errors=None):
    """quote('abc def') -> 'abc%20def'

    Each part of a URL, e.g. the path info, the query, etc., has a
    different set of reserved characters that must be quoted.

    RFC 3986 Uniform Resource Identifiers (URI): Generic Syntax lists
    the following reserved characters.

    reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                  "$" | "," | "~"

    Each of these characters is reserved in some component of a URL,
    but not necessarily in all of them.

    Python 3.7 updates from using RFC 2396 to RFC 3986 to quote URL strings.
    Now, "~" is included in the set of reserved characters.

--------------------------------------------
However, looking at RFC3986 (https://tools.ietf.org/html/rfc3986), appendix A has the following:

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   reserved      = gen-delims / sub-delims
   gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

----------------------------------------------------
Should the missing ones be added or should this issue be closed if they aren't going to be added?

Thanks.
msg297675 - (view) Author: Jörn Hees (joern) * Date: 2017-07-04 15:16
It's been a while... nowadays I would mostly change the documentation of the quote function to point out that it is likely to quote more characters than absolutely necessary by SPEC. The function is in place for so long, (even in py3) that people will rely on the behavior.

I made an attempt to update the docstring accordingly in https://github.com/python/cpython/pull/2568


What i think is most confusing is the current docs mentioning the reserved chars (which are btw. definitely wrong wrt. RFC3986). Actually as one can see in the code the reserved chars don't play any role for quote, but much more the unreserved chars (called _ALWAYS_SAFE https://github.com/python/cpython/blob/master/Lib/urllib/parse.py#L716 ).

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"

The current quote function's approach is to simply quote everything that is not in unreserved + safe (per arg).

In that aspect it is quite close to the old javascript.escape function: https://www.w3schools.com/jsref/jsref_escape.asp


quick links
py2.7: https://github.com/python/cpython/blob/2.7/Lib/urllib.py#L1261
py3: https://github.com/python/cpython/blob/master/Lib/urllib/parse.py#L745
RFC3986: https://tools.ietf.org/html/rfc3986#appendix-A
msg339820 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2019-04-10 00:31
New changeset 750d74fac5c510e39958b3f79641fe54096ee54f by Senthil Kumaran (Jörn Hees) in branch 'master':
bpo-12910: update and correct quote docstring (#2568)
https://github.com/python/cpython/commit/750d74fac5c510e39958b3f79641fe54096ee54f
msg339821 - (view) Author: miss-islington (miss-islington) Date: 2019-04-10 00:53
New changeset 796698adf558f2255474945082856538b1effb0b by Miss Islington (bot) in branch '3.7':
bpo-12910: update and correct quote docstring (GH-2568)
https://github.com/python/cpython/commit/796698adf558f2255474945082856538b1effb0b
History
Date User Action Args
2019-04-10 01:04:35orsenthilsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2019-04-10 00:53:06miss-islingtonsetnosy: + miss-islington
messages: + msg339821
2019-04-10 00:31:48miss-islingtonsetkeywords: + patch
pull_requests: + pull_request12681
2019-04-10 00:31:21orsenthilsetmessages: + msg339820
2017-07-21 15:08:36louielusettitle: urrlib.quote quotes too many chars, e.g., '()' -> urllib.quote quotes too many chars, e.g., '()'
2017-07-10 23:35:24Mariattasetstage: patch review
versions: - Python 2.7, Python 3.3, Python 3.4, Python 3.5, Python 3.6
2017-07-04 15:22:42joernsetversions: + Python 2.7, Python 3.4, Python 3.5, Python 3.6, Python 3.7
2017-07-04 15:16:51joernsetmessages: + msg297675
2017-07-04 15:11:44joernsetpull_requests: + pull_request2638
2017-07-04 00:01:51cheryl.sabellasetnosy: + cheryl.sabella
messages: + msg297621
2011-09-08 10:20:57ezio.melottisetnosy: + ezio.melotti
2011-09-06 15:40:16eric.araujosetnosy: + eric.araujo
2011-09-06 11:54:32orsenthilsetversions: + Python 3.3, - Python 2.6, Python 2.7
nosy: + orsenthil

messages: + msg143596

assignee: orsenthil
2011-09-06 10:26:38joerncreate