Issue 2637: urllib.quote() escapes characters unnecessarily and contrary to docs

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/46889

classification

Title:	urllib.quote() escapes characters unnecessarily and contrary to docs
Type:	behavior	Stage:	test needed
Components:	Documentation, Library (Lib)	Versions:	Python 2.6

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	orsenthil	Nosy List:	ajaksu2, ezio.melotti, georg.brandl, nirs, orsenthil, rwanderley, thomaspinckney3, tlesher
Priority:	normal	Keywords:	easy, patch

Created on 2008-04-15 15:09 by tlesher, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue2637.diff	orsenthil, 2009-08-20 15:26

Messages (11)
msg65518 - (view)	Author: Tim Lesher (tlesher) *	Date: 2008-04-15 15:09
The urllib.quote docstring implies that it quotes only characters in RFC 2396's "reserved" set. However, urllib.quote currently escapes all characters except those in an "always_safe" list, which consists of alphanumerics and three punctuation characters, "_.-". This behavior is contrary to the RFC, which defines "unreserved" characters as alphanumerics plus "mark" characters, or "-_.!~*'()". The RFC also says: Unreserved characters can be escaped without changing the semantics of the URI, but this should not be done unless the URI is being used in a context that does not allow the unescaped character to appear. This seems to imply that "always_safe" should correspond to the RFC's "unreserved" set of "alphanum" \| "mark".
msg66339 - (view)	Author: Tom Pinckney (thomaspinckney3) *	Date: 2008-05-06 22:54
It also looks like urllib.quote (and quote_plus) do not properly handle unicode strings. urllib.urlencode() properly converts unicode strings to utf-8 encoded ascii strings before then calling urllib.quote() on them.
msg81803 - (view)	Author: Daniel Diniz (ajaksu2) *	Date: 2009-02-12 18:59
@Tom: issue 1712522 tracks Unicode support.
msg91381 - (view)	Author: Rodrigo Steinmuller Wanderley (rwanderley)	Date: 2009-08-06 19:51
> Unreserved characters can be escaped without changing the semantics > of the URI, but this should not be done unless the URI is being used > in a context that does not allow the unescaped character to appear. How can we identify "a context that does not allow the unescaped character to appear"?
msg91417 - (view)	Author: Nir Soffer (nirs) *	Date: 2009-08-08 11:37
You can control what is safe in your particular context using the safe keyword argument. How do you want to support unicode? you must decide which character encoding you like, which depends on the server side decoding the url. Just document the fact that this function does not accept unicode.
msg91425 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2009-08-09 03:49
The way to handle this issue would be add these characters '%/:=&?~#+!$,;'@()[]' to always_safe list. There has been a similar issue in the past Issue918368, tough in a different context. And if you see, urllib.urlopen function always passes these values as the safe parameter to the quote function, fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()[]") Handling of unicode is a different issue, but for the current report, this change should suffice.
msg91429 - (view)	Author: Nir Soffer (nirs) *	Date: 2009-08-09 15:40
Senthil said: > The way to handle this issue would be add these characters > '%/:=&?~#+!$,;'@()*[]' to always_safe list. This is wrong - for example, '&=?' are NOT safe when quoting parameters for query string. This will break exiting code that assume the default safe parameters. Other characters may be unsafe in other parts of the url - I did not check which - and I don't have time to check. The current default (safe='/') is the best option - it will work correctly in most case, and in the worst is escaping some characters which are safe in particular use case. Since only the user know the context, the user should add safe characters to the function. If you don't specify anything, the function should be safe as possible for the worst use case. If you want to add characters to the default safe list, you have to make sure that the function will not break for common use cases.
msg91430 - (view)	Author: Nir Soffer (nirs) *	Date: 2009-08-09 16:04
Here is one example of code that would break if the safe parameter is changed in a careless way mentioned here (look for url_encode): http://dev.pocoo.org/projects/werkzeug/browser/werkzeug/urls.py#L112 I'm sure we can find similar code in every web application.
msg91431 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2009-08-09 18:24
On Sun, Aug 09, 2009 at 03:40:47PM +0000, Nir Soffer wrote: > for query string. This will break exiting code that assume the default > safe parameters. > > Other characters may be unsafe in other parts of the url - I did not I agree with your comments and I had similar thoughts too. The RFC spec says that different components in URL have have different characters that needs to be quoted. The quote function is documented that it is intended for path component and Python Documention provides a usage overview of quote assuming that the developer will know what he/she is doing. It does not deal with the specifics of quote w.r.t to URL components. My comment was biased from the changes made to urllib.urlopen function where we explicitly passed on reserved characters to the safe parameter of quote and we got expected results. this change has been there for few months now without any breakage reports. And that change was not according to any RFC but more based on the practical issues encountered. Yes, I agree that changes to quote function is bound to break the code which relied on the earlier behaviour. I see at least 3 tests in stdlib breaking, which can be modified without any loss in meaning, if we want go with the change. But, I feel it is okay to heed to your objection and leave the function as it is. The need to change it does not have a strong backing in RFC. It is a not a bug, considering the documentation. Only thing to live with will be urlopen's passing of safe characters.
msg91778 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2009-08-20 15:26
I see adding this information to the docs, might clarify a bit. "By default, this function is intended for quoting the path section of the URL." This is already present in the function docstring. If there is no objection, I shall commit the attached patch and close this issue.
msg92111 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2009-08-31 16:45
Fixed and Committed revision 74608 - trunk Committed revision 74609 - py3k

History
Date	User	Action	Args
2022-04-11 14:56:33	admin	set	github: 46889
2010-02-20 00:40:37	flox	set	status: open -> closed dependencies: - urllib.quote throws exception on Unicode URL
2009-08-31 16:45:44	orsenthil	set	resolution: fixed messages: + msg92111
2009-08-20 15:26:46	orsenthil	set	files: + issue2637.diff keywords: + patch messages: + msg91778
2009-08-09 18:24:31	orsenthil	set	messages: + msg91431
2009-08-09 16:04:52	nirs	set	messages: + msg91430
2009-08-09 15:40:45	nirs	set	messages: + msg91429
2009-08-09 03:49:14	orsenthil	set	assignee: georg.brandl -> orsenthil messages: + msg91425
2009-08-08 11:37:01	nirs	set	nosy: + nirs messages: + msg91417
2009-08-06 19:51:48	rwanderley	set	nosy: + rwanderley messages: + msg91381
2009-02-12 21:58:27	vstinner	set	nosy: - vstinner
2009-02-12 18:59:00	ajaksu2	set	assignee: georg.brandl dependencies: + urllib.quote throws exception on Unicode URL components: + Documentation versions: + Python 2.6, - Python 2.5 keywords: + easy nosy: + ajaksu2, ezio.melotti, georg.brandl, vstinner, orsenthil messages: + msg81803 stage: test needed
2008-05-06 22:54:56	thomaspinckney3	set	nosy: + thomaspinckney3 messages: + msg66339
2008-04-15 15:09:10	tlesher	create