This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: Make urllib2 more extensible (patch)
Type: enhancement Stage:
Components: Library (Lib) Versions:
Status: closed Resolution: duplicate
Dependencies: Superseder:
Assigned To: Nosy List: brett.cannon, jhylton, jjlee
Priority: normal Keywords:

Created on 2003-06-24 13:16 by jjlee, last changed 2022-04-10 16:09 by admin. This issue is now closed.

File name Uploaded Description Edit
processors_patch jjlee, 2003-06-24 13:16
Messages (10)
msg53919 - (view) Author: John J Lee (jjlee) Date: 2003-06-24 13:16
Problem with urllib2 as it stands: many things would be 
nice to implement as a handler rather than by overriding 
methods (and inevitably duplicating code and increasing 
fragility), but it's not always possible to do so.  For 
example (all from HTTP), automatically adding Referer 
headers, handling 200 responses that should have been 
non-2xx errors if the server were sane, handling cookies, 
handling HTTP-EQUIV headers as if they were normal 
HTTP headers, automatically making responses 
seekable, and following Refresh headers.  I've done all 
these things, but I had to duplicate code to do it, 
because I don't see how to do it with handlers.  I've now 
rewritten this code by adding a 'processor' scheme to 
urllib2 (I'm *not* using 'scheme' in the technical URL 
sense here!). 
Processors work quite similarly to handlers, except that 
1. They always *all* get run (rather than just the first to 
handle a request or response -- unlike handlers). 
2. The methods that get called on processors are of the 
form <proto>_request and <proto>_response, and are 
called, respectively, immediately before and immediately 
after the normal machinery.  
http_request, for example, gets called on all processors 
before, and pre-processes HTTP requests; http_response 
post-processes HTTP responses. 
3. <proto>_request methods return request objects, and 
<proto>_response methods return response objects. 
4. Even 200 responses get processed. 
You use it like this: 
# just pass processors to build_opener as if they were 
opener = build_opener(FooHandler, BarProcessor, 
response ="") 
I've reimplemented all my stuff (the features listed in the 
first paragraph, above) in terms of this scheme, and it all 
seems to be working fine (but no unit tests yet).  So, the 
scheme does achieve the extensibility it aims for.  The 
patch I've attached here doesn't include most of those 
features -- the only new functionality it adds is an 
HTTPRefererProcessor.  If this gets accepted, I intend to 
submit patches adding new processors for cookie 
handling etc. later. 
Two things I'd like to know: 1. will my solution break 
people's code 2. is there a better way? 
For 1., I *think* it shouldn't break code. 
For 2., the obvious problem with my solution (below) is 
that handlers are pretty similar to my processors already.  
The thing is, I can't see how to reimplement these things 
in terms of handlers.  First, I need to *see* all requests 
(responses) -- I can do that using handlers by giving 
them low (high) .handler_order in Python 2.3 and 
returning None from http_open (http_error_xxx).  
However, 200 responses never get seen by 
http_error_xxx, so that doesn't work (and changing that 
would break old code).  Second, I need to actually 
modify the requests and responses.  Sometimes I'd much 
rather do that by making a new request or response than 
modifying the old one in-place (redirections, for 
example) -- and in general, even if I *am* just modifying 
in-place, I'd still prefer to explictly return the object than 
rely on side-effects.  Perhaps just adding a couple of 
hooks to AbstractHTTPHandler might get these jobs 
done, but I think the increased simplicity of 
AbstractHTTPHandler.do_open and the various 
processors makes my approach worthwhile (assuming it 
actually works & is backwards-compat., of course...). 
A few notes: 
Some headers (Content-Length, Referer, ...)  mustn't be 
copied to requests for a redirected URL.  This requires 
the addition of a new dict to Request.  I've added an 
add_unredirected_headers method, too.  The old 
implementation just sends these headers directly, but 
that's not possible if you want to use procesors to 
implement these things. 
The current response object (httplib.HTTPResponse, 
wrapped with urllib.addinfourl) doesn't include response 
code or message (because code is always 200).  The 
patch just assigns .code and .msg attributes (maybe they 
should be methods, for uniformity with the rest of the 
response interface). 
Backwards-compatibility notes: 
People who override AbstractHTTPHandler.do_open will 
do non-200 response handling there, which will break 
processors, but that's a forwards-compat. issue.  I don't 
think the existence of overridden do_open methods in old 
code should be a problem for backwards-compatibility. 
Note that, though processors see all responses, the end 
user still only gets 200 responses returned.  
ErrorProcessor ensures that by passing non-200 
responses on to the existing urllib2 error machinery. 
msg53920 - (view) Author: John J Lee (jjlee) Date: 2003-07-08 15:13
Logged In: YES 

I just noticed the patch breaks on https.  Trivially fixed by 
adding lines like https_request = http_request to the various 
processor classes. 
Also, another use case: gzip Content-encoding. 
msg53921 - (view) Author: Jeremy Hylton (jhylton) (Python triager) Date: 2003-07-31 22:15
Logged In: YES 

In principle, I'm in favor of this.  I'd like to take some 
time to review the design and code.
msg53922 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2003-08-12 05:46
Logged In: YES 

Sounds kind of like Apache's filters.  The idea seems fine, but 
perhaps this could instead be worked in with Guido's sio package 
in the CVS sandbox?  Seems to be a similar idea.  Perhaps there 
could some way of hooking that code into urllib2?
msg53923 - (view) Author: John J Lee (jjlee) Date: 2003-08-12 11:39
Logged In: YES 

Possibly similar to Apache filters, but sio's filters seem to be for 
filtering stream data rather than request / response objects -- 
has no concept of headers, for example. 
msg53924 - (view) Author: John J Lee (jjlee) Date: 2003-08-12 11:58
Logged In: YES 

If anybody wants to see some concrete examples of use of this 
patch, ask me & I'll mail them to you (actually, they use my 
urllib2 extension module rather than the patch, but the 
differences are trivial). 
BTW Jeremy, any guess about when in your Copious Free 
Time you're likely to get to this?  I'm wondering whether to just 
release my code as-is, or wait for your comments. 
msg53925 - (view) Author: Jeremy Hylton (jhylton) (Python triager) Date: 2003-08-12 15:19
Logged In: YES 

Dont't wait more than a couple of weeks for me.
msg53926 - (view) Author: John J Lee (jjlee) Date: 2003-10-29 23:10
Logged In: YES 

Just to note a minor change that should happen if this gets 
accepted: processors shouldn't be separate objects, but rather 
just a new interface that handler objects can support.  That 
way, a single object can implement both interfaces.  This 
makes implementing response cache handlers easier, for 
Not uploading a new patch, since it's a trivial code change. 
It might also be useful to have default_request and 
default_response methods (by analogy with default_open), for 
example for any response caching that's independent of 
protocol scheme. 
msg53927 - (view) Author: John J Lee (jjlee) Date: 2003-12-03 18:26
Logged In: YES 

I've uploaded a revised patch, plus some urllib2 tests and a 
doc patch, in patch 852995. 
msg53928 - (view) Author: John J Lee (jjlee) Date: 2003-12-14 15:18
Logged In: YES 

Patch 852995 applied, so closing this one. 
Date User Action Args
2022-04-10 16:09:24adminsetgithub: 38705
2003-06-24 13:16:38jjleecreate