classification
Title: urllib.quote horribly mishandles unicode as second parameter
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: Michael Sander, ZackerySpytz, ezio.melotti, koriakin, orsenthil, r.david.murray
Priority: normal Keywords:

Created on 2015-04-07 21:10 by koriakin, last changed 2020-07-06 08:41 by terry.reedy. This issue is now closed.

Messages (4)
msg240230 - (view) Author: Marcin Koƛcielnicki (koriakin) Date: 2015-04-07 21:10
All hell breaks loose when unicode is passed as the second argument to urllib.quote in Python 2:

>>> import urllib
>>> urllib.quote('\xce\x91', u'')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/urllib.py", line 1292, in quote
    if not s.rstrip(safe):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)

This on its own wouldn't be that bad - just another Python 2 unicode wonkiness.  However, coupled with caching done by the quote function (quoters are cached based on the second parameter, and u'' == ''), it means that a random preceding call to quote from an entirely different place in the application can break your code:

$ python2
Python 2.7.9 (default, Dec 11 2014, 04:42:00)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> urllib.quote('\xce\x91', '')
'%CE%91'
>>>


$ python2
Python 2.7.9 (default, Dec 11 2014, 04:42:00)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> urllib.quote('a', u'')
'a'
>>> urllib.quote('\xce\x91', '')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/urllib.py", line 1292, in quote
    if not s.rstrip(safe):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)

Good luck debugging that.

So, one of two things needs to happen:

- a TypeError when unicode is passed as the second parameter, or
- a cast of the second parameter to str
msg240242 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-04-07 23:56
The typerror isn't going to happen for backward compatibility reasons.  A fix isn't likely to happen because python2 doesn't really support unicode in urllib, to my understanding (if I'm wrong about that the answser changes).  I'm not sure whether casting to string would have backward compatibility issues or not (I suspect it would; somneone would have to investigate that question as a first step).
msg349663 - (view) Author: Michael Sander (Michael Sander) Date: 2019-08-14 08:45
Couldn't this be fixed in a backwards compatible way by clearing the cache when this type of error occurs? We can do this by wrapping the offending line with a try/except, then checking to see if the cache is corrupted. If it is, then we clear the cache and try again.

try:
  if not s.rstrip(safe):
    return s
except UnicodeDecodeError:
  # Make sure the cache is okay, if not, try again.
  if any([not isinstance(s2, str) for q2, s2 in _safe_quoters.values()])
    # Cache is corrupted, clear it and try again.
     _safe_quoters = {}
    # Recursive call to try again
    return quote(s, safe)
  raise
msg370493 - (view) Author: Zackery Spytz (ZackerySpytz) * (Python triager) Date: 2020-05-31 18:42
Python 2 is EOL, so I think this issue should be closed.
History
Date User Action Args
2020-07-06 08:41:30terry.reedysetstatus: open -> closed
resolution: out of date
stage: resolved
2020-05-31 18:42:05ZackerySpytzsetnosy: + ZackerySpytz
messages: + msg370493
2019-08-14 08:45:03Michael Sandersetnosy: + Michael Sander
messages: + msg349663
2015-04-07 23:56:56r.david.murraysetnosy: + r.david.murray
messages: + msg240242
2015-04-07 21:12:54ezio.melottisetnosy: + orsenthil, ezio.melotti
type: behavior
2015-04-07 21:10:15koriakincreate