This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib and urllib2 decode userinfo multiple times
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.1, Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: carljm, dstanek, fdrake, jafo, jesstess, jjlee, orsenthil, ted.turocy, ysj.ray
Priority: normal Keywords: easy, patch

Created on 2008-03-06 16:10 by carljm, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
urllib-issue.patch carljm, 2008-03-06 16:09 patch
urllib_issue_updated.patch ysj.ray, 2010-07-20 08:24 patch against the trunk. review
urllib_issue_updated.diff ysj.ray, 2010-07-20 08:43 patch against py3k review
urllib_ftptests_doubleencode.patch ted.turocy, 2010-08-01 01:23 patch against py3k to add test cases exhibiting original bug. review
Messages (10)
msg63324 - (view) Author: Carl Meyer (carljm) * Date: 2008-03-06 16:09
Both urllib and urllib2 call urllib.unquote() multiple times on data in
the userinfo section of an FTP URL.  One call occurs at the end of the
urllib.splituser() function.  In urllib, the other call appears in
URLOpener.open_ftp().  In urllib2, the other two occur in
FTPHandler.ftp_open() and Request.get_host().

The effect of this is that if the userinfo section of an FTP url should
need to contain a literal % sign followed by two digits, the % sign must
be double-encoded as %2525 (for urllib) or triple-encoded as %252525
(for urllib2) in order for the URL to be accessed.

The proper behavior would be to only ever unquote a given data segment
once.  The W3's URI: Generic Syntax RFC
(http://gbiv.com/protocols/uri/rfc/rfc3986.html) addresses this very
issue in section 2.4 (When to Encode or Decode): "Implementations must
not percent-encode or decode the same string more than once, as decoding
an already decoded string might lead to misinterpreting a percent data
octet as the beginning of a percent-encoding, or vice versa in the case
of percent-encoding an already percent-encoded string."

The solution would be to standardize where in urllib and urllib2 the
unquoting happens, and then make sure it happens nowhere else.  I'm not
familiar enough with the libraries to know where it should be removed
without possibly breaking other behavior.  It seems that just removing
the map/unquote call in urllib.splituser() would fix the problem in
urllib.  I would guess the call in urllib2 Request.get_host() should
also be removed, as the RFC referenced above says clearly that only
individual data segments of the URL should be decoded, not larger
portions that might contain delimiters (: and @).

I've attached a patchset for these suggested changes.  Very superficial
testing suggests that the patch doesn't break anything obvious, but I
make no guarantees.
msg64201 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2008-03-20 20:23
Fred: You most recently touched the code impacted by this test, does
this sound reasonable?
msg110858 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-20 03:39
Re-assigned as per maintainers list.
msg110872 - (view) Author: ysj.ray (ysj.ray) Date: 2010-07-20 07:39
By confirming the "%2525" case, I guess this is really a bug. But the patch cause test_urllib2.py failed. I modified the patch to fix it.
msg110875 - (view) Author: ysj.ray (ysj.ray) Date: 2010-07-20 08:43
This bug exists on py3k also, so I worked out a patch for py3k, too.
msg110876 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-07-20 08:46
Thanks Ray Allen. Usually the patch against the py3k branch is enough, it will be ported to other branches.
msg112248 - (view) Author: Theodore Turocy (ted.turocy) Date: 2010-08-01 01:23
I am attaching a patch against py3k which adds some tests to demonstrate the original bug when issuing a Request() to a ftp:// URL.

To do this, the tests add checks for user and passwd.  The previous version of checks asserted that the .user and .passwd of the returned request should be "".  Checking .user is necessary to verify the original bug.  

I was confused by a comment in the fixture, "ftp authentication not yet implemented by FTPHandler", which appeared to justify the assumption that .user and .passwd must be "".  This may be true, but .user and .passwd are being set by the Request.  The test includes the minimal augmentation to extend on the original behavior of the test.  Someone with greater knowledge than I might look at that comment to see if it is out-of-date, or simply too vague.
msg121257 - (view) Author: Jessica McKellar (jesstess) * (Python triager) Date: 2010-11-16 01:17
I can confirm that the combination of urllib_issue_updated.diff and urllib_ftptests_doubleencode.patch apply cleanly against py3k, that the added tests exercise the described bug, and that the full test suite passes after applying the patches.

The patches look clean and conforming to PEP 8.
msg121476 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-11-18 16:47
Fixed in r86520 (py3k) and r86522 (release31-maint). 
unquote happens at the first level inside Request class, _parse method.

Shall port to py2.7.
msg121611 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-11-20 11:24
back-ported to release27-maint in r86554.
History
Date User Action Args
2022-04-11 14:56:31adminsetgithub: 46497
2010-11-20 11:24:35orsenthilsetstatus: open -> closed

messages: + msg121611
2010-11-18 16:47:46orsenthilsetnosy: - BreamoreBoy
messages: + msg121476

resolution: fixed
stage: patch review -> resolved
2010-11-16 01:17:32jesstesssetnosy: + jesstess
messages: + msg121257
2010-08-01 01:28:07r.david.murraysettype: resource usage -> behavior
stage: test needed -> patch review
2010-08-01 01:23:56ted.turocysetfiles: + urllib_ftptests_doubleencode.patch
nosy: + ted.turocy
messages: + msg112248

2010-07-31 17:39:57dstaneksetnosy: + dstanek
2010-07-20 08:46:11orsenthilsetmessages: + msg110876
2010-07-20 08:43:55ysj.raysetfiles: + urllib_issue_updated.diff

messages: + msg110875
2010-07-20 08:24:15ysj.raysetfiles: + urllib_issue_updated.patch
2010-07-20 08:23:55ysj.raysetfiles: - urllib_issue_updated.patch
2010-07-20 07:39:34ysj.raysetfiles: + urllib_issue_updated.patch
nosy: + ysj.ray
messages: + msg110872

2010-07-20 03:39:47BreamoreBoysetversions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6, Python 3.0
nosy: + BreamoreBoy

messages: + msg110858

assignee: fdrake -> orsenthil
2009-04-22 17:22:30ajaksu2setkeywords: + easy
2009-02-13 02:04:12ajaksu2setnosy: + jjlee
2009-02-12 18:00:36ajaksu2setnosy: + orsenthil
stage: test needed
versions: + Python 3.0, - Python 2.5
2008-03-20 20:23:47jafosetversions: + Python 2.6
nosy: + fdrake, jafo
messages: + msg64201
priority: normal
assignee: fdrake
type: behavior -> resource usage
2008-03-06 16:10:03carljmcreate