classification
Title: urllib doesn't correct server returned urls
Type: behavior Stage: test needed
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: 1153027 Superseder:
Assigned To: orsenthil Nosy List: AdamN, ajaksu2, jhylton, jjlee, mike_j_brown, orsenthil, robzed
Priority: normal Keywords: patch

Created on 2004-03-17 22:41 by robzed, last changed 2009-10-07 13:22 by AdamN. This issue is now closed.

Files
File name Uploaded Description Edit
redirect_failure.py robzed, 2004-03-17 22:41 Program demonstrating urllib non-correction behaviour and client/server transaction text.
issue918368-py27.diff orsenthil, 2009-04-19 10:25
Messages (16)
msg20248 - (view) Author: Rob Probin (robzed) Date: 2004-03-17 22:41
On a URL request where the server returns a URL with spaces in, 
urllib doesn't correct it before requesting the new page.

I think this is technically a server error, however, it does work 
from web browsers (Mozilla, Safari) but not from Python urllib.

I would suggest that when urllib is following "moved temporarily" 
links (or similar) from a server it translates spaces to %20.

See example program file for more including detailed server/client 
transactions text.


msg20249 - (view) Author: Rob Probin (robzed) Date: 2004-03-18 22:38
Logged In: YES 
user_id=1000470

I've tested a change to "redirect_internal(self, url, fp, errcode, errmsg, 
headers, data)" in "urllib.py" that adds a single line newurl = 
newurl.replace(" ","%20") after the basejoin() function call that appears 
to fix the problem.

This information is placed in the public domain.
msg20250 - (view) Author: Mike Brown (mike_j_brown) Date: 2004-03-31 04:45
Logged In: YES 
user_id=371366

I agree that it is a server error to put something that doesn't 
meet the syntactic definition of a URI in the Location header 
of a response. I don't see any harm in correcting obvious 
errors, though, in the interest of usability.

As for your proposed fix, instead of just correcting spaces, I 
would do

    newurl = quote(newurl, safe="/:=&?#+!$,;'@()*[]")

quote() is a urllib function that does percent-encoding. It is 
way out of date and does poorly with Unicode strings, but if 
called with the above arguments, it should safely clean up 
most mistakes. The set of additional "safe" characters I am 
passing in is the complete set of "reserved" characters 
according to the latest draft of RFC 2396bis; these are 
characters that definitely do or might possibly have special 
meaning in a URL and thus should not be percent-encoded 
blindly.

I am currently working on a urllib.quote() replacement for 
4Suite's Ft.Lib.Uri library, and will then see about getting the 
improvements folded back into urllib.
msg20251 - (view) Author: Mike Brown (mike_j_brown) Date: 2004-03-31 05:44
Logged In: YES 
user_id=371366

In my previous example, I forgot to add a tilde to the "safe" 
characters. Tilde is an "unreserved" character, a class of 
characters allowed in a URI but for which percent-encoding is 
discouraged.

    newurl = quote(newurl, safe="/:=&?~#+!$,;'@()*[]")
msg81431 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-02-09 01:51
Superseded by issue 1153027 ?
msg81472 - (view) Author: Rob Probin (robzed) Date: 2009-02-09 19:07
I agree - this appears to be the same as issue 1153027 ?
msg81490 - (view) Author: John J Lee (jjlee) Date: 2009-02-09 21:12
This bug refers to urllib.  Issue 1153027 refers to urllib2.  It's the
same problem in each case, though.
msg81696 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-02-12 00:46
Any idea about how to test this one reliably?
msg81816 - (view) Author: John J Lee (jjlee) Date: 2009-02-12 20:48
Mike's list is missing one more character, "%" itself.  So, the
replacement should be:

quote(newurl, safe="%/:=&?~#+!$,;'@()*[]")

The replacement should be done in the .open() method, not in the code
that handles redirects.
msg81817 - (view) Author: John J Lee (jjlee) Date: 2009-02-12 20:51
A suitable test would be to derive a urllib.URLOpener subclass that has
an open_spam(url) method which records the URL.  Then call
.open("spam://") in the test and verify that the URL has been quoted.
msg81818 - (view) Author: John J Lee (jjlee) Date: 2009-02-12 21:09
Uh, that should have beeen something like
.open("spam://example.com/%2E/") of course.
msg86160 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2009-04-19 10:25
The attached patch fixes this issue and adds the test too. Followed John
J Lee's suggestion for the patch.Can someone review this?

For the other referenced issue1153027, bug is not reproducible in the
trunk and I see that fix has been made in revision 43132. It follows the
approach of percent encoding space at redirect_request. I shall verify
if we need to change that to a different fix or present one would be
sufficient.
msg86220 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2009-04-21 03:27
Fixed this in the revision 71780 for Python 2.7.
msg93658 - (view) Author: Adam Nelson (AdamN) Date: 2009-10-06 20:07
This seems a bit serious for inclusion in 2.7 IMHO.  urllib is used in all 
sorts of hackish ways in the wild and I really wonder if this is going to 
cause more problems for people than it's worth.  The 3.x series alone 
seems like the best place for this which is addressed by issue 1153027.
msg93671 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2009-10-07 01:34
AdamN, did you specifically come across a scenario which broke due to
this change? I can understand your concern, in general. The
'non-breaking' existing tests is the one of confidence factor we have in
introducing the changes.
msg93697 - (view) Author: Adam Nelson (AdamN) Date: 2009-10-07 13:22
I can't think of too many specific scenarios.  It just seems like a non-
trivial behavior change (or rather, it is trivial but with possibly far 
reaching ramifications).

One issue I see is that the ticket morphed from just dealing with space 
characters to many special characters.  quote_plus() probably would have 
handled the space issue cleanly.

The other issue I'm concerned about is that there is no testing of non-
ascii URLs (or at least I don't see any).

Finally, a netloc like 'www.g oogle.com' would be converted invalidly to 
'www.g%20oogle.com' which would then be forwarded to an actual HTTP 
request.

I can't point to anything too specific - more of just a feeling that 
this ticket is already 5 1/2 years old and maybe things should be left 
as-is and enhancement made only to Python 3.x.  Of course, I'm really 
just a new Python convert so I don't have massive experience here by any 
means.
History
Date User Action Args
2009-10-07 13:22:20AdamNsetmessages: + msg93697
2009-10-07 01:34:51orsenthilsetmessages: + msg93671
2009-10-06 20:07:47AdamNsetnosy: + AdamN
messages: + msg93658
2009-05-05 18:47:05orsenthilsetstatus: open -> closed
2009-05-05 17:55:22orsenthilsetresolution: fixed
2009-04-28 01:48:43ocean-citylinkissue5861 dependencies
2009-04-21 03:27:08orsenthilsetmessages: + msg86220
versions: - Python 2.6
2009-04-19 10:25:58orsenthilsetfiles: + issue918368-py27.diff
assignee: orsenthil
messages: + msg86160

keywords: + patch
2009-03-28 12:25:13jhyltonsetnosy: + jhylton
2009-02-12 21:09:04jjleesetmessages: + msg81818
2009-02-12 20:51:30jjleesetmessages: + msg81817
2009-02-12 20:48:09jjleesetmessages: + msg81816
2009-02-12 18:28:48ajaksu2setnosy: + orsenthil
dependencies: + http_error_302() crashes with 'HTTP/1.1 400 Bad Request
2009-02-12 00:46:23ajaksu2settype: behavior
stage: test needed
messages: + msg81696
versions: + Python 2.6, Python 2.7, - Python 2.3
2009-02-09 21:12:13jjleesetnosy: + jjlee
messages: + msg81490
2009-02-09 19:07:34robzedsetmessages: + msg81472
2009-02-09 01:51:22ajaksu2setnosy: + ajaksu2
messages: + msg81431
2004-03-17 22:41:36robzedcreate