This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Patch for [ 735515 ] urllib2 should cache 301 redir
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.2
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: jjlee, orsenthil, pitrou, r.david.murray, vila
Priority: normal Keywords: patch

Created on 2007-07-18 02:32 by orsenthil, last changed 2022-04-11 14:56 by admin.

Files
File name Uploaded Description Edit
urllib2-301-patch.diff orsenthil, 2010-01-06 02:22 review
Messages (22)
msg52902 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2007-07-18 02:32
Tracker item python.org/sf/735515 mentions about urllib2 to cache 301 redirections.
Attached patch tries to implement the same.

Comments on Version 1 of patch:
a) Initializes a dictionary to store the redirection.
b) If req already in cache, return the previous Request object.
c) Otherwise handle the same as 302 and store the Request object.
d) Checks for loop errors in 301.

Just noticed, that it I missed max-redirect checks.
Please comment on this patch, with next version I shall add the max redirect check.

msg52903 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2007-07-24 03:56
I have made some changes to patch. Made it use just the self.cache to cache and follow the same logic for redirect and loop as 302 would follow. I thought this is ok. Did a basic testing to verify if it caught 301 redirect, yes it does.

File Added: urllib2-cache-301.patch
msg52904 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2007-07-26 14:38
Looks reasonable to me, however I wonder if you shouldn't purge cached information periodically.  Does RFC 2616 provide any guidance?

Skip
msg52905 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2007-08-02 04:17
Hi Skip,
I have implemented Cache redirection for 301 errors only which will be Permanant redirection. Expiry header is not expected for 301 direction, so we might not purge it.
RFC 2616 states that if cache is implemented for 302 redirects ( temporary redirects) than Expiry header should be handled and cache purged.

For our change it is not required.

Thanks,
Senthil
msg52906 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2007-08-02 10:55
How much extra work would it be to handle 302 redirects and the Expiry header?  (I don't want to hold you up from the rest of your SoC tasks.  Time's running out.)

Note also that you should also include a patch to the urllib2 documentation and, if possible, one or more test cases to exercise the new code.

Skip
msg52907 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2007-08-05 20:41
File Added: urllib2-301-cache.patch
msg52908 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2007-08-05 20:42
File Added: test_urllib2-cache.patch
msg52909 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2007-08-05 20:43
File Added: liburllib2.patch
msg52910 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2007-08-05 20:59
Hi Skip,
Added the updated patch and also added patches to test and documentation also.

Did some investigation on Cache redirection and expires for 302. IMHO, it should be handled as a separate feature request (based on the need), as we will have to handle the Cache-Control header values in addition to expires header. This would amount for good amount of change w.r.t the HTTPRedirectHandler itself.
[http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html section 14.9]

As for the current Tracker item, please verify and if found suitable checkin to the Python trunk.

Thanks,
Senthil


msg96899 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2009-12-27 06:54
I am attaching an updated patch for caching the 301 redirects.
As per RFC 2616:

10.3.2 301 Moved Permanently
   ...
   ...references returned by the server, where possible. This response
is  cacheable unless indicated otherwise.

So, I have included an additional argument to the 301 method called
cacheable=True. Which can be used to turn off the cache if required.

I would like to seek some comments on the patch, specifically on adding
this cacheable=True keyword argument.  If its fine, I would go ahead and
check it in.
msg96937 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-28 06:29
I have trouble understanding what the patch does. I would expect it to
cache the <original URL> -> <redirected URL> mapping, but it seems to
cache the final HTTP response instead.

Also, it's not obvious in which situations the default for `cacheable`
would be overriden. Aren't http_error_301 and friends for internal use?
msg96952 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2009-12-28 14:56
> Antoine Pitrou added the Comment:
>                                                                      
                                               
> I have trouble understanding what the patch does. I would expect it to
> cache the <original URL> -> <redirected URL> mapping, but it seems   
> cache the final HTTP response instead.

Oops. My mistake. I got carried away by my misunderstanding of the RFC
section on 301. The patch is wrong. I coded it to cache the response
from the redirection instead of just the redirected URL.
I shall write the correct one to cache just the redirection.           
                                               
                                                                       
                                               
> Aren't http_error_301 and friends for internal use?
                                                                       
      
Not really. They are exposed methods and I believe are being used by
clients. There have been bug reports related those redirection         
methods. Even the related Issue735515, explains pretty clearly about
the redirection required. (My bad again and my comment in that issue
is irrelevant.) In this issue, John says that there is no obvious need
to change the interface.  

RFC had a statement along the lines that "301 redirection is cached by
default, unless indicated otherwise". This is where I thought, an option
to turn-off the cacheable behavior might be needed.
                                                   
Thanks for looking at this quickly.
msg96959 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-12-28 16:29
I haven't reviewed the patch, but I would like the caching behavior to
be settable.  I can easily imagine a use case where I would not want the
URLs cached: when using urllib in a test suite or test tool.  (I just
ran into this problem trying to use firefox as a test tool...at the time
I was doing it I wasn't aware that firefox cached the 301 responses).
msg97155 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-01-03 04:46
Here is the corrected patch for caching the 301 redirections.

* It caches only the redirection not the response.
* It retains cacheable=True kwarg for http_error_301 method. ( I feel, it should be useful)
* Have made the cached dict as private.

I have updated the tests. The existing tests for 301 see no changes too.

If you have any review comments, please pitch in. (I shall add the docs and news entry before commit)

Thanks!
msg97161 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-01-03 12:45
> Here is the corrected patch for caching the 301 redirections.
> 
> * It caches only the redirection not the response.
> * It retains cacheable=True kwarg for http_error_301 method. ( I feel, it should be useful)
> * Have made the cached dict as private.

I'm still not sure what this patch is trying to do. It seems you are using the cached URL *after* getting the 301 response. But the whole point of caching redirections is to avoid emitting the initial request at all.
msg97262 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-01-05 14:32
Antoine: I got your point. Yes, I was missing the purpose of the redirection itself and the patch was wrong.

If the 301 is to be cached, the cache map should be maintained at the higher level in order for the further requests to refer to.
I have created a redirect_map at OpenerDirector level, and this will be populated by the RedirectHandler and will be referred to at time of creation of Request. This is along the correct lines, I could verify it with simple scripts and check the fetches from cache map for repeated redirects.

Your comments please.
msg97264 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-01-05 15:54
Well, first it would be better with some tests.

Second, what does it do for chained redirects? E.g. let's say that there's a chain of 301 redirects: A --> B --> C. Does it cache the whole A --> C mapping, or only A --> B? If the latter, will the chaining occur when looking up the redirected url from the cache (it doesn't seem to)?

Third, it seems to use a global OpenerDirector object. Are there situations where it should rather use a request-specific object?

Fourth, you shouldn't need to define a separate http_error_301 method. Just add the `cacheable` argument to `http_error_302`. Also, the `_cache_301_redirect` attribute seems useless.
msg97265 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-01-05 16:57
Thanks for the comments.
Shall come with the tests. 

Yes,it currently does not handle chained redirects via cache. I dont know RFC's stance on it. RFC does not say anything about 301 chained redirects and there are tricker issues of caching anything other than 301. Basically, 302 and other require the client to check and comply with the Cache-Control and Expires header. The feature request was reasonably for caching only 301 redirs and I also feel a good one to have. This is the reason for the separate http_error_301 method.

The global opener seems to be a straight forward way to for this
activity and not a harmful too.  I can't think of request-specific
object for this one.
msg97282 - (view) Author: John J Lee (jjlee) Date: 2010-01-05 23:29
To make sure I understood something Antoine said:  By "per-request", I assume you mean the same kind of thing as the current use of .redirect_dict -- the multiple urllib2.Request instances that may result from a single request passed by the user to .open()/urlopen all sharing the same cache state.

In addition to what Antoine said:

 0. patch reports that your latest patch is malformed (see below).
 1. I'm afraid I think any 301 caching that's not per-request should be off by default.  Defaulting to on would be a significant change in behaviour, because urllib2.urlopen (and OpenerDirector.open) currently retains no state between calls (unless you add a handler that keeps state, such as HTTPCookieProcessor, but no such handlers are added by default).
 2. I imagine the code changes should be entirely (or almost entirely) confined to RedirectHandler and/or AbstractHTTPHandler.  Is there any justification for changing OpenerDirector?  Certainly no need to add any globals!
 3. http_error_30x is a documented interface, but it's not frequently used.  The argument(s) used to control caching should be somewhere else (see questions below).
 4. Please do post the doc changes for review once the implementation is decided on.


Some questions to consider:

 a. How should POST requests be handled when there is a cached permanent redirect URI?
 b. How useful is per-request caching, in the sense defined above (as opposed to per-user agent caching -- i.e. per-handler in our case)?  Best answered with data from the web.
 c. Should URIs be normalised before being used as a cache key?
 d. Might the cache get big?

For all #a, #b, #c, #d: What do existing implementations (e.g. Firefox) do?


$ patch -p0 < urllib2-301-redirection-proper.diff
patching file Lib/urllib2.py
Hunk #3 succeeded at 549 with fuzz 2 (offset 18 lines).
Hunk #4 succeeded at 562 (offset 18 lines).
patch: **** malformed patch at line 55: @@ -604,8 +618,12 @@

$ patch --version
patch 2.5.9
Copyright (C) 1988 Larry Wall
Copyright (C) 2003 Free Software Foundation, Inc.

This program comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of this program
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING.

written by Larry Wall and Paul Eggert
msg97283 - (view) Author: John J Lee (jjlee) Date: 2010-01-05 23:54
> Yes,it currently does not handle chained redirects via cache. I dont know RFC's stance on it. RFC does not say anything about 301 chained redirects

I don't see anything in the RFC that prevents us caching chained 301 redirections.  Caching the chained redirections is better, because it reduces the number of requests, which is the purpose of the cache.

> there are tricker issues of caching anything other than 301.

Antoine wasn't suggesting caching URLs from non-301 responses, just that you don't need to add a new method named "http_redirect_301".  Just delete it and test "if code == 301" in http_error_302.

> Basically, 302 and other require the client to check and comply with the Cache-Control and Expires header.

I don't think references to caching in the descriptions of the 30* response codes are relevant, because urllib2 doesn't implement response caching.  I think the part that's relevant is this: "The requested resource has been assigned a new permanent URI and any future references to this resource SHOULD use one of the returned URIs."

> The global opener seems to be a straight forward way to for this activity and not a harmful too.  I can't think of request-specific object for this one.

No.  The global OpenerDirector instance is a convenience to allow having a global urlopen() function.  Having a handler pick a random opener object (the global one) would be insane.
msg97293 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-01-06 02:22
Attaching a non-malformed patch. I had incomplete tests in previous ones and I removed it by-hand before submitting for review. Something went wrong, I see.

Okay, I get the points you are making. Specifically a request specific object and then maintaining a map at AbstractHTTPHandler/HTTPRedirectHandler. Shall come up with such a method.
msg110588 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-17 18:03
Just a prod in case it has gone under the radar.
History
Date User Action Args
2022-04-11 14:56:25adminsetgithub: 45217
2014-02-03 19:20:04BreamoreBoysetnosy: - BreamoreBoy
2010-07-17 18:03:40BreamoreBoysetnosy: + BreamoreBoy

messages: + msg110588
versions: + Python 3.2, - Python 2.7
2010-01-06 02:22:54orsenthilsetfiles: - urllib2-301-cache.patch
2010-01-06 02:22:42orsenthilsetfiles: - test_urllib2-cache.patch
2010-01-06 02:22:36orsenthilsetfiles: - liburllib2.patch
2010-01-06 02:22:25orsenthilsetfiles: - urllib2-301-redirection-CORRECTED.diff
2010-01-06 02:22:17orsenthilsetfiles: - urllib2-301-redirection-proper.diff
2010-01-06 02:22:04orsenthilsetfiles: + urllib2-301-patch.diff

messages: + msg97293
2010-01-05 23:54:08jjleesetmessages: + msg97283
2010-01-05 23:29:10jjleesetmessages: + msg97282
2010-01-05 16:57:51orsenthilsetmessages: + msg97265
2010-01-05 15:54:21pitrousetmessages: + msg97264
2010-01-05 14:32:07orsenthilsetfiles: + urllib2-301-redirection-proper.diff

messages: + msg97262
2010-01-03 12:45:46pitrousetmessages: + msg97161
2010-01-03 04:46:56orsenthilsetfiles: - urllib2-301-redirection.diff
2010-01-03 04:46:32orsenthilsetfiles: + urllib2-301-redirection-CORRECTED.diff

messages: + msg97155
2009-12-28 16:29:20r.david.murraysetnosy: + r.david.murray
messages: + msg96959
2009-12-28 15:38:15pitroulinkissue735515 superseder
2009-12-28 15:38:15pitrouunlinkissue735515 dependencies
2009-12-28 14:56:18orsenthilsetmessages: + msg96952
2009-12-28 06:29:40pitrousetnosy: + pitrou
messages: + msg96937
2009-12-27 06:54:34orsenthilsetfiles: + urllib2-301-redirection.diff

messages: + msg96899
2009-08-21 01:18:58orsenthilsetassignee: orsenthil
2009-04-28 00:49:41skip.montanarosetnosy: - skip.montanaro
2009-04-27 22:41:38ajaksu2linkissue735515 dependencies
2009-04-27 22:41:20ajaksu2setnosy: + jjlee
versions: + Python 2.7, - Python 2.6

type: enhancement
stage: patch review
2008-01-05 13:07:03vilasetnosy: + vila
2007-07-18 02:32:18orsenthilcreate