classification
Title: urllib.quote() escapes characters unnecessarily and contrary to docs
Type: behavior
Components: Library (Lib) Versions: Python 2.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: thomaspinckney3, tlesher
Priority: Keywords:

Created on 2008-04-15 15:09 by tlesher, last changed 2008-05-06 22:54 by thomaspinckney3.

Messages
msg65518 (view) Author: Tim Lesher (tlesher) Date: 2008-04-15 15:09
The urllib.quote docstring implies that it quotes only characters in RFC
2396's "reserved" set.

However, urllib.quote currently escapes all characters except those in
an "always_safe" list, which consists of alphanumerics and three
punctuation characters, "_.-".

This behavior is contrary to the RFC, which defines "unreserved"
characters as alphanumerics plus "mark" characters, or "-_.!~*'()".  

The RFC also says:

  Unreserved characters can be escaped without changing the semantics
  of the URI, but this should not be done unless the URI is being used
  in a context that does not allow the unescaped character to appear.

This seems to imply that "always_safe" should correspond to the RFC's
"unreserved" set of "alphanum" | "mark".
msg66339 (view) Author: Tom Pinckney (thomaspinckney3) Date: 2008-05-06 22:54
It also looks like urllib.quote (and quote_plus) do not properly handle 
unicode strings. urllib.urlencode() properly converts unicode strings to 
utf-8 encoded ascii strings before then calling urllib.quote() on them.
History
Date User Action Args
2008-05-06 22:54:56thomaspinckney3setnosy: + thomaspinckney3
messages: + msg66339
2008-04-15 15:09:10tleshercreate