classification
Title: urllib.FancyURLopener.redirect_internal looses data on POST!
Type: behavior Stage: test needed
Components: Library (Lib) Versions: Python 2.6, Python 2.5
process
Status: closed Resolution: not a bug
Dependencies: 549151 Superseder:
Assigned To: orsenthil Nosy List: ajaksu2, akuchling, crocowhile, jimjjewett, jjlee, kxroberto, orsenthil
Priority: low Keywords: easy

Created on 2006-02-04 17:35 by kxroberto, last changed 2009-08-20 14:50 by orsenthil. This issue is now closed.

Messages (16)
msg27426 - (view) Author: kxroberto (kxroberto) Date: 2006-02-04 17:35
    def redirect_internal(self, url, fp, errcode,
errmsg, headers, data):
        if 'location' in headers:
            newurl = headers['location']
        elif 'uri' in headers:
            newurl = headers['uri']
        else:
            return
        void = fp.read()
        fp.close()
        # In case the server sent a relative URL, join
with original:
        newurl = basejoin(self.type + ":" + url, newurl)
        return self.open(newurl)


... has to become ...


    def redirect_internal(self, url, fp, errcode,
errmsg, headers, data):
        if 'location' in headers:
            newurl = headers['location']
        elif 'uri' in headers:
            newurl = headers['uri']
        else:
            return
        void = fp.read()
        fp.close()
        # In case the server sent a relative URL, join
with original:
        newurl = basejoin(self.type + ":" + url, newurl)
        return self.open(newurl,data)



... i guess?   (  ",data"  added )

Robert
msg27427 - (view) Author: kxroberto (kxroberto) Date: 2006-02-04 20:10
Logged In: YES 
user_id=972995

Found http://www.faqs.org/rfcs/rfc2616.html (below).
But the behaviour is still strange, and the bug even more
serious: a silent redirection of a POST as GET without data
is obscure for a Python language. Leads to unpredictable
results. The cut half execution is not stopable and all is
left to a good reaction of the server, and complex
reinterpreation of the client. Python urllibX should by
default yield the 30X code for a POST redirection and
provide the first HTML: usually a redirection HTML stub with
< a href=...
That would be consistent with the RFC: the User
(=Application! not Python!) can redirect under full control
without generating a wrong call! In my application, a bug
was long unseen because of this wrong behaviour. with
30X-stub it would have been easy to discover and understand ...

urllib2 has the same bug with POST redirection.

=======
10.3.2 301 Moved Permanently

   The requested resource has been assigned a new permanent
URI and any
   future references to this resource SHOULD use one of the
returned
   URIs.  Clients with link editing capabilities ought to
automatically
   re-link references to the Request-URI to one or more of
the new
   references returned by the server, where possible. This
response is
   cacheable unless indicated otherwise.

   The new permanent URI SHOULD be given by the Location
field in the
   response. Unless the request method was HEAD, the entity
of the
   response SHOULD contain a short hypertext note with a
hyperlink to
   the new URI(s).

   If the 301 status code is received in response to a
request other
   than GET or HEAD, the user agent MUST NOT automatically
redirect the
   request unless it can be confirmed by the user, since
this might
   change the conditions under which the request was issued.

      Note: When automatically redirecting a POST request after
      receiving a 301 status code, some existing HTTP/1.0
user agents
      will erroneously change it into a GET request.
msg27428 - (view) Author: John J Lee (jjlee) Date: 2006-02-06 00:54
Logged In: YES 
user_id=261020

This is not a bug.
See the long discussion here:
http://python.org/sf/549151
msg27429 - (view) Author: kxroberto (kxroberto) Date: 2006-02-06 10:29
Logged In: YES 
user_id=972995

> http://python.org/sf/549151

the analyzation of the browsers is right. lynx is best ok to
ask.
But urllibX is not a browser (application) but a lib: As of
now with standard urllibX error handling you cannot code a lynx.

gvr's initial suggestion to raise a clear error (with
redirection-link as attribute of the exception value) is
best ok. Another option would be to simly yield the
undirected stub HTML and leave the 30X-code (and redirection
LOCATION in header).

To redirect POST as GET _while_ simply loosing (!) the data
(and not appending it to the GET-URL) is most bad for a lib.
Transcribing smart a short formlike POST to a GET w QUERY
would be so la la.
Don't know if the MS & netscape's also transpose to GET with
long data? ...

The current behaviour is most worst of all 4. All other
methods whould at least have raisen an early hint/error in
my case.
msg27430 - (view) Author: Jim Jewett (jimjjewett) Date: 2006-02-06 17:57
Logged In: YES 
user_id=764593

In theory, a GET may be automatic, but a POST requires user 
interaction, so the user can be held accountable for the 
results of a POST, but not of a GET.

Often, the page will respond to either; not sending the 
queries protects privacy in case of problems, and works more 
often than not.  (That said, I too would prefer a raised 
error or a transparent repost, at least as options.)
msg27431 - (view) Author: John J Lee (jjlee) Date: 2006-02-06 20:24
Logged In: YES 
user_id=261020

First, anyone replying to this, *please* read this page (and
the whole of this tracker note!) first:

http://ppewww.ph.gla.ac.uk/~flavell/www/post-redirect.html


kxroberto: you say that with standard urllibX error handling
you cannot get an exception on redirected 301/302/307 POST.
 That's not true of urllib2, since you may override
HTTPRedirectHandler.redirect_request(), which method was
designed and documented for precisely that purpose.  It
seems sensible to have a default that does what virtually
all browsers do (speaking as a long-time lynx user!).  I
don't know about the urllib case.

It's perfectly reasonable to extend urllib (if necessary) to
allow the option of raising an exception.  Note that (IIRC!)
 urllib's exceptions do not contain the response body data,
however (urllib2's HTTPErrors do contain the response body
data).

It would of course break backwards compatibility to start
raising exceptions by default here.  I don't think it's
reasonable to break old code on the basis of a notional
security issue when the de-facto standard web client
behaviour is to do the redirect.  In reality, the the only
"security" value of the original prescriptive rule was as a
convention to be followed by white-hat web programmers and
web client implementors to help users avoid unintentionally
re-submitting non-idempotent requests.  Since that
convention is NOT followed in the real world (lynx doesn't
count as the real world ;-), I see no value in sticking
rigidly to the original RFC spec -- especially when 2616
even provides 307 precisely in response to this problem. 
Other web client libraries, for example libwww-perl and Java
HTTPClient, do the same as Python here IIRC.  RFC 2616
section 10.3.4 even suggests web programmers use 302 to get
the behaviour you complain about!

The only doubtful case here is 301.  A decision was made on
the default behaviour in that case back when the tracker
item I pointed you to was resolved.  I think it's a mistake
to change our minds again on that default behaviour.


kxroberto.seek(nrBytes)
assert kxroberto.readline() == """\
To redirect POST as GET _while_ simply loosing (!) the data
(and not appending it to the GET-URL) is most bad for a lib."""

No.  There is no value in supporting behaviour which is
simply contrary to both de-facto and prescriptive standards
(see final paragraph of RFC 2616 section 10.3.3: if we
accept the "GET on POST redirect" rule, we must accept that
the Location header is exactly the URL that should be
followed).  FYI, many servers return a redirect URL
containing the urlencoded POST data from the original request.


kxroberto: """Don't know if the MS & netscape's also
transpose to GET with long data? ..."""

urllib2's behaviour (and urllib's, I believe) on these
issues is identical to that of IE and Firefox.


jimjewett: """In theory, a GET may be automatic, but a POST
requires user interaction, so the user can be held
accountable for the results of a POST, but not of a GET."""

That theory has been experimentally falsified ;-)
msg27432 - (view) Author: Jim Jewett (jimjjewett) Date: 2006-02-06 20:52
Logged In: YES 
user_id=764593

Sorry, I was trying to provide a quick explanation of why we 
couldn't just "do the obvious thing" and repost with data.

Yes, I realize that in practice, GET is used for non-
idempotent actions, and POST is (though less often) done 
automatically.

But since that is the official policy, I wouldn't want to 
bet too heavily against it in a courtroom -- so python 
defaults should be at least as conservative as both the spec 
and the common practice.  
msg27433 - (view) Author: John J Lee (jjlee) Date: 2006-02-06 21:19
Logged In: YES 
user_id=261020

Conservative or not, I see no utility in changing the
default, and several major harmful effects: old code breaks,
and people have to pore over the specs to figure out why
"urlopen() doesn't work".
msg62579 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2008-02-20 00:13
Can this item be closed, given jjlee's argument against changing the
behaviour?
msg86314 - (view) Author: Daniel Diniz (ajaksu2) (Python triager) Date: 2009-04-22 18:48
I agree that changing the default isn't an option.

However, IMHO, having to override HTTPRedirectHandler.redirect_request
or FancyURLopener.redirect_internal to get RFC compliant (albeit
non-useful in 99.99% of use cases) is a bit weird.

Maybe the docs should contain an example of how to be compliant?
msg91560 - (view) Author: Giorgio (crocowhile) Date: 2009-08-14 16:41
I am not sure where we stand with this issue. It seems to be an old one.
urllib2 still claim (as of python 2.6) the following;

# Strictly (according to RFC 2616), 301 or 302 in response
# to a POST MUST NOT cause a redirection without confirmation
# from the user (of urllib2, in this case).  In practice,
# essentially all clients do redirect in this case, so we
# do the same.
# be conciliant with URIs containing a space

This is just not true, we don't do the same at all. redirect_request
does not pass data along and it even changes the headers to reflect
content-size, thus behaving perfectly in accordance with RFC.

For those who stumbled upon this page looking for a workaround, this is
how to do: create a new class inheriting from HTTPRedirectHandler and
use this one instead:


class AutomaticHTTPRedirectHandler(urllib2.HTTPRedirectHandler):

    def redirect_request(self, req, fp, code, msg, headers, newurl):
        """Return a Request or None in response to a redirect.
        
        The default response in redirect_request claims not to 
        follow directives in RFC 2616 but in fact it does
        This class does not and makes handling 302 with POST
        possible
        """
        m = req.get_method()
        if (code in (301, 302, 303, 307) and m in ("GET", "HEAD")
            or code in (301, 302, 303) and m == "POST"):
            newurl = newurl.replace(' ', '%20')
            return urllib2.Request(newurl,
                           data=req.get_data(),
                           headers=req.headers,
                           origin_req_host=req.get_origin_req_host(),
                           unverifiable=True)
        else:
            raise urllib2.HTTPError(req.get_full_url(), code, msg,
headers, fp)
msg91570 - (view) Author: John J Lee (jjlee) Date: 2009-08-14 20:21
This issue is not a bug, and should be closed.  It was discussed at 
length many years ago (different bug tracker ticket), and resolved.  
Since then the same issue seems to come up every year or so, 
apparently raised by people who haven't checked the issue tracker for 
previous discussion.  Please, somebody close this issue!


> It seems to be an old one.
> urllib2 still claim (as of python 2.6) the following;
> 
> # Strictly (according to RFC 2616), 301 or 302 in response
> # to a POST MUST NOT cause a redirection without confirmation
> # from the user (of urllib2, in this case).  In practice,
> # essentially all clients do redirect in this case, so we
> # do the same.

Note that this is NOT a statement about whether the request sent as a 
result of the redirect response contains the original POST data.


> This is just not true, we don't do the same at all. redirect_request
> does not pass data along and it even changes the headers to reflect
> content-size, thus behaving perfectly in accordance with RFC.

This appears to be a statement about (amongst other things) whether 
the request sent as a result of the redirect response contains the 
original POST data.

So where's the connection between the comment you quote and your 
response to it, Giorgio?  Actually, I hope you don't mind if I ask you 
not to answer that question, but instead to go and read, very 
carefully, the tracker discussion for the original fix that introduced 
the comment you posted (you should be able to find it by svn 
annotating the source, finding the appropriate commit, then looking 
for a reference in the commit message to a bug tracker issue ID).  
Once you've done that, please stop posting on this issue <0.2 wink>

Sorry, I'm not normally this grumpy, but this issue just seems to keep 
coming back forever, because people haven't spent the time to test 
browser behaviour, carefully read the RFC, tracker discussion, commit 
messages, etc.  If you have done all that and thought carefully and 
still think there's a bug, by all means come back, but please make 
sure you're extremely clear about *exactly* what you think is wrong, 
and why.  Write a test case, and cite specific RFC wording.  If what 
you think is wrong is not the same as the original issue described in 
the opening comment of this bug tracker ticket, please raise a new 
ticket rather than commenting on this one.


> For those who stumbled upon this page looking for a workaround, this 
is
> how to do: create a new class inheriting from HTTPRedirectHandler 
and
> use this one instead:

I don't know what this is a workaround *for*.
msg91571 - (view) Author: Giorgio (crocowhile) Date: 2009-08-14 20:47
>I don't know what this is a workaround *for*.

As you can see yourself, that code does a complete redirection, taking
along the post_data too which is simply not possible by default (and
that is obviously a pain in the neck).

I never said it was "bug" nor that the code had to be changed. I am just
saying this is "a lack of a feature" that obviously many would like to
see implemented - and this is probably why it "seems to come back forever".
msg91605 - (view) Author: John J Lee (jjlee) Date: 2009-08-15 11:58
If you have a feature request, please open a separate ticket.  This one 
is about an alleged bug.
msg91655 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2009-08-17 02:26
I am assigning this to myself. I shall do some research on this issue +
plus current standings by other clients/libraries and come out with a
summary.
msg91777 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2009-08-20 14:50
I agree with John on this ticket. At the outset, this is Not a bug.
And reading through the referenced ticket indicates the design decision
for the behavior.
In summary:
<quote>
This suggests to me that *no* automatic repeat of POST
requests should ever be done, and that in the case of a 302
or 303 response, a POST should be replaced by a GET; this
may also be done for a 301 response -- even though the
standard calls that an error, it admits that it is done by
old clients.
</quote>
That was Guido's point at that time.

The least that could be done is take a call on 301 response, but this
would break the other clients which rely on 'earlier standard behavior
though not compliant with RFC'. 

At the moment, this wont be necessary as it just break clients using
urllib. 

Giorgio's point in rekindling this issue, is not related to urllib
module and specifically w.r.t to redirect_request implementation. So, an
alternate behavior is desired on urllib2's redirects (if they are
observed by existing clients), it could be handled by another request.

So, effectively closing this request.
History
Date User Action Args
2009-08-20 14:50:49orsenthilsetstatus: open -> closed
resolution: not a bug
messages: + msg91777
2009-08-17 02:26:06orsenthilsetassignee: orsenthil
messages: + msg91655
2009-08-15 11:58:14jjleesetmessages: + msg91605
2009-08-14 20:47:54crocowhilesetmessages: + msg91571
2009-08-14 20:21:30jjleesetmessages: + msg91570
2009-08-14 16:41:57crocowhilesetnosy: + crocowhile

messages: + msg91560
versions: + Python 2.5, - Python 3.0
2009-04-22 18:48:22ajaksu2setpriority: normal -> low

nosy: + ajaksu2
messages: + msg86314

keywords: + easy
2009-02-12 18:14:47ajaksu2setnosy: + orsenthil
dependencies: + urllib2 POSTs on redirect
type: behavior
stage: test needed
versions: + Python 2.6, Python 3.0, - Python 2.4
2008-02-20 00:13:44akuchlingsetnosy: + akuchling
messages: + msg62579
2006-02-04 17:35:20kxrobertocreate