classification
Title: urllib.quote() escapes characters unnecessarily and contrary to docs
Type: behavior Stage: test needed
Components: Documentation, Library (Lib) Versions: Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: ajaksu2, ezio.melotti, georg.brandl, nirs, orsenthil, rwanderley, thomaspinckney3, tlesher
Priority: normal Keywords: easy, patch

Created on 2008-04-15 15:09 by tlesher, last changed 2010-02-20 00:40 by flox. This issue is now closed.

Files
File name Uploaded Description Edit
issue2637.diff orsenthil, 2009-08-20 15:26
Messages (11)
msg65518 - (view) Author: Tim Lesher (tlesher) * Date: 2008-04-15 15:09
The urllib.quote docstring implies that it quotes only characters in RFC
2396's "reserved" set.

However, urllib.quote currently escapes all characters except those in
an "always_safe" list, which consists of alphanumerics and three
punctuation characters, "_.-".

This behavior is contrary to the RFC, which defines "unreserved"
characters as alphanumerics plus "mark" characters, or "-_.!~*'()".  

The RFC also says:

  Unreserved characters can be escaped without changing the semantics
  of the URI, but this should not be done unless the URI is being used
  in a context that does not allow the unescaped character to appear.

This seems to imply that "always_safe" should correspond to the RFC's
"unreserved" set of "alphanum" | "mark".
msg66339 - (view) Author: Tom Pinckney (thomaspinckney3) * Date: 2008-05-06 22:54
It also looks like urllib.quote (and quote_plus) do not properly handle 
unicode strings. urllib.urlencode() properly converts unicode strings to 
utf-8 encoded ascii strings before then calling urllib.quote() on them.
msg81803 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-02-12 18:59
@Tom: issue 1712522 tracks Unicode support.
msg91381 - (view) Author: Rodrigo Steinmuller Wanderley (rwanderley) Date: 2009-08-06 19:51
> Unreserved characters can be escaped without changing the semantics
> of the URI, but this should not be done unless the URI is being used
> in a context that does not allow the unescaped character to appear.

How can we identify "a context that does not allow the unescaped
character to appear"?
msg91417 - (view) Author: Nir Soffer (nirs) Date: 2009-08-08 11:37
You can control what is safe in your particular context using the safe 
keyword argument.

How do you want to support unicode? you must decide which character 
encoding you like, which depends on the server side decoding the url.

Just document the fact that this function does not accept unicode.
msg91425 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2009-08-09 03:49
The way to handle this issue would be add these characters 
'%/:=&?~#+!$,;'@()*[]' to always_safe list.

There has been a similar issue in the past Issue918368, tough in a
different context.

And if you see, urllib.urlopen function always passes these values as 
the safe parameter to the quote function,

fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]")

Handling of unicode is a different issue, but for the current report,
this change should suffice.
msg91429 - (view) Author: Nir Soffer (nirs) Date: 2009-08-09 15:40
Senthil said:
> The way to handle this issue would be add these characters 
> '%/:=&?~#+!$,;'@()*[]' to always_safe list.

This is wrong - for example, '&=?' are NOT safe when quoting parameters
for query string. This will break exiting code that assume the default
safe parameters.

Other characters may be unsafe in other parts of the url - I did not
check which - and I don't have time to check. The current default
(safe='/') is the best option - it will work correctly in most case, and
in the worst is escaping some characters which are safe in particular
use case.

Since only the user know the context, the user should add safe
characters to the function. If you don't specify anything, the function
should be safe as possible for the worst use case.

If you want to add characters to the default safe list, you have to make
sure that the function will not break for common use cases.
msg91430 - (view) Author: Nir Soffer (nirs) Date: 2009-08-09 16:04
Here is one example of code that would break if the safe parameter is
changed in a careless way mentioned here (look for url_encode):

http://dev.pocoo.org/projects/werkzeug/browser/werkzeug/urls.py#L112

I'm sure we can find similar code in every web application.
msg91431 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2009-08-09 18:24
On Sun, Aug 09, 2009 at 03:40:47PM +0000, Nir Soffer wrote:
> for query string. This will break exiting code that assume the default
> safe parameters.
> 
> Other characters may be unsafe in other parts of the url - I did not

I agree with your comments and I had similar thoughts too.

The RFC spec says that different components in URL  have have
different characters that needs to be quoted.

The quote function is documented that it is *intended for path
component* and Python Documention provides a usage overview of quote
assuming that the developer will know what he/she is doing. It does
not deal with the specifics of quote w.r.t to URL components.

My comment was biased from the changes made to urllib.urlopen function
where we explicitly passed on reserved characters to the safe
parameter of quote and we got expected results. this change has been
there for few months now without any breakage reports.  And that
change was not according to any RFC but more based on the practical
issues encountered.

Yes, I agree that changes to quote function is bound to break the
code which relied on the earlier behaviour. I see at least 3 tests in
stdlib breaking, which can be modified without any loss in meaning, if
we want go with the change.

But, I feel it is okay to heed to your objection and leave the
function as it is. 
The need to change it does not have a strong backing in RFC.  It is a
not a bug, considering the documentation.

Only thing to live with will be urlopen's passing of safe characters.
msg91778 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2009-08-20 15:26
I see adding this information to the docs, might clarify a bit.

"By default, this   function is intended for quoting the path section of
the URL."

This is already present in the function docstring.

If there is no objection, I shall commit the attached patch and close
this issue.
msg92111 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2009-08-31 16:45
Fixed and 
Committed revision 74608 - trunk
Committed revision 74609 - py3k
History
Date User Action Args
2010-02-20 00:40:37floxsetstatus: open -> closed
dependencies: - urllib.quote throws exception on Unicode URL
2009-08-31 16:45:44orsenthilsetresolution: fixed
messages: + msg92111
2009-08-20 15:26:46orsenthilsetfiles: + issue2637.diff
keywords: + patch
messages: + msg91778
2009-08-09 18:24:31orsenthilsetmessages: + msg91431
2009-08-09 16:04:52nirssetmessages: + msg91430
2009-08-09 15:40:45nirssetmessages: + msg91429
2009-08-09 03:49:14orsenthilsetassignee: georg.brandl -> orsenthil
messages: + msg91425
2009-08-08 11:37:01nirssetnosy: + nirs
messages: + msg91417
2009-08-06 19:51:48rwanderleysetnosy: + rwanderley
messages: + msg91381
2009-02-12 21:58:27hayposetnosy: - haypo
2009-02-12 18:59:00ajaksu2setassignee: georg.brandl
dependencies: + urllib.quote throws exception on Unicode URL
components: + Documentation
versions: + Python 2.6, - Python 2.5
keywords: + easy
nosy: + ajaksu2, ezio.melotti, georg.brandl, haypo, orsenthil
messages: + msg81803
stage: test needed
2008-05-06 22:54:56thomaspinckney3setnosy: + thomaspinckney3
messages: + msg66339
2008-04-15 15:09:10tleshercreate