This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author carljm
Recipients carljm
Date 2008-03-06.16:09:59
SpamBayes Score 0.0698425
Marked as misclassified No
Message-id <1204819810.33.0.413464262118.issue2244@psf.upfronthosting.co.za>
In-reply-to
Content
Both urllib and urllib2 call urllib.unquote() multiple times on data in
the userinfo section of an FTP URL.  One call occurs at the end of the
urllib.splituser() function.  In urllib, the other call appears in
URLOpener.open_ftp().  In urllib2, the other two occur in
FTPHandler.ftp_open() and Request.get_host().

The effect of this is that if the userinfo section of an FTP url should
need to contain a literal % sign followed by two digits, the % sign must
be double-encoded as %2525 (for urllib) or triple-encoded as %252525
(for urllib2) in order for the URL to be accessed.

The proper behavior would be to only ever unquote a given data segment
once.  The W3's URI: Generic Syntax RFC
(http://gbiv.com/protocols/uri/rfc/rfc3986.html) addresses this very
issue in section 2.4 (When to Encode or Decode): "Implementations must
not percent-encode or decode the same string more than once, as decoding
an already decoded string might lead to misinterpreting a percent data
octet as the beginning of a percent-encoding, or vice versa in the case
of percent-encoding an already percent-encoded string."

The solution would be to standardize where in urllib and urllib2 the
unquoting happens, and then make sure it happens nowhere else.  I'm not
familiar enough with the libraries to know where it should be removed
without possibly breaking other behavior.  It seems that just removing
the map/unquote call in urllib.splituser() would fix the problem in
urllib.  I would guess the call in urllib2 Request.get_host() should
also be removed, as the RFC referenced above says clearly that only
individual data segments of the URL should be decoded, not larger
portions that might contain delimiters (: and @).

I've attached a patchset for these suggested changes.  Very superficial
testing suggests that the patch doesn't break anything obvious, but I
make no guarantees.
History
Date User Action Args
2008-03-06 16:10:10carljmsetspambayes_score: 0.0698425 -> 0.0698425
recipients: + carljm
2008-03-06 16:10:10carljmsetspambayes_score: 0.0698425 -> 0.0698425
messageid: <1204819810.33.0.413464262118.issue2244@psf.upfronthosting.co.za>
2008-03-06 16:10:03carljmlinkissue2244 messages
2008-03-06 16:09:59carljmcreate