Issue 759792: Make urllib2 more extensible (patch)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/38705

classification

Title:	Make urllib2 more extensible (patch)
Type:	enhancement	Stage:
Components:	Library (Lib)	Versions:

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:
Assigned To:		Nosy List:	brett.cannon, jhylton, jjlee
Priority:	normal	Keywords:

Created on 2003-06-24 13:16 by jjlee, last changed 2022-04-10 16:09 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
processors_patch	jjlee, 2003-06-24 13:16

Messages (10)
msg53919 - (view)	Author: John J Lee (jjlee)	Date: 2003-06-24 13:16
Problem with urllib2 as it stands: many things would be nice to implement as a handler rather than by overriding methods (and inevitably duplicating code and increasing fragility), but it's not always possible to do so. For example (all from HTTP), automatically adding Referer headers, handling 200 responses that should have been non-2xx errors if the server were sane, handling cookies, handling HTTP-EQUIV headers as if they were normal HTTP headers, automatically making responses seekable, and following Refresh headers. I've done all these things, but I had to duplicate code to do it, because I don't see how to do it with handlers. I've now rewritten this code by adding a 'processor' scheme to urllib2 (I'm not using 'scheme' in the technical URL sense here!). Processors work quite similarly to handlers, except that 1. They always all get run (rather than just the first to handle a request or response -- unlike handlers). 2. The methods that get called on processors are of the form <proto>_request and <proto>_response, and are called, respectively, immediately before and immediately after the normal OpenerDirector.open machinery. http_request, for example, gets called on all processors before, and pre-processes HTTP requests; http_response post-processes HTTP responses. 3. <proto>_request methods return request objects, and <proto>_response methods return response objects. 4. Even 200 responses get processed. You use it like this: # just pass processors to build_opener as if they were handlers opener = build_opener(FooHandler, BarProcessor, BazHandler) response = opener.open("http://www.example.com/") I've reimplemented all my stuff (the features listed in the first paragraph, above) in terms of this scheme, and it all seems to be working fine (but no unit tests yet). So, the scheme does achieve the extensibility it aims for. The patch I've attached here doesn't include most of those features -- the only new functionality it adds is an HTTPRefererProcessor. If this gets accepted, I intend to submit patches adding new processors for cookie handling etc. later. Two things I'd like to know: 1. will my solution break people's code 2. is there a better way? For 1., I think it shouldn't break code. For 2., the obvious problem with my solution (below) is that handlers are pretty similar to my processors already. The thing is, I can't see how to reimplement these things in terms of handlers. First, I need to see all requests (responses) -- I can do that using handlers by giving them low (high) .handler_order in Python 2.3 and returning None from http_open (http_error_xxx). However, 200 responses never get seen by http_error_xxx, so that doesn't work (and changing that would break old code). Second, I need to actually modify the requests and responses. Sometimes I'd much rather do that by making a new request or response than modifying the old one in-place (redirections, for example) -- and in general, even if I am just modifying in-place, I'd still prefer to explictly return the object than rely on side-effects. Perhaps just adding a couple of hooks to AbstractHTTPHandler might get these jobs done, but I think the increased simplicity of AbstractHTTPHandler.do_open and the various processors makes my approach worthwhile (assuming it actually works & is backwards-compat., of course...). Comments? A few notes: Some headers (Content-Length, Referer, ...) mustn't be copied to requests for a redirected URL. This requires the addition of a new dict to Request. I've added an add_unredirected_headers method, too. The old implementation just sends these headers directly, but that's not possible if you want to use procesors to implement these things. The current response object (httplib.HTTPResponse, wrapped with urllib.addinfourl) doesn't include response code or message (because code is always 200). The patch just assigns .code and .msg attributes (maybe they should be methods, for uniformity with the rest of the response interface). Backwards-compatibility notes: People who override AbstractHTTPHandler.do_open will do non-200 response handling there, which will break processors, but that's a forwards-compat. issue. I don't think the existence of overridden do_open methods in old code should be a problem for backwards-compatibility. Note that, though processors see all responses, the end user still only gets 200 responses returned. ErrorProcessor ensures that by passing non-200 responses on to the existing urllib2 error machinery. John
msg53920 - (view)	Author: John J Lee (jjlee)	Date: 2003-07-08 15:13
Logged In: YES user_id=261020 I just noticed the patch breaks on https. Trivially fixed by adding lines like https_request = http_request to the various processor classes. Also, another use case: gzip Content-encoding.
msg53921 - (view)	Author: Jeremy Hylton (jhylton)	Date: 2003-07-31 22:15
Logged In: YES user_id=31392 In principle, I'm in favor of this. I'd like to take some time to review the design and code.
msg53922 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2003-08-12 05:46
Logged In: YES user_id=357491 Sounds kind of like Apache's filters. The idea seems fine, but perhaps this could instead be worked in with Guido's sio package in the CVS sandbox? Seems to be a similar idea. Perhaps there could some way of hooking that code into urllib2?
msg53923 - (view)	Author: John J Lee (jjlee)	Date: 2003-08-12 11:39
Logged In: YES user_id=261020 Possibly similar to Apache filters, but sio's filters seem to be for filtering stream data rather than request / response objects -- has no concept of headers, for example.
msg53924 - (view)	Author: John J Lee (jjlee)	Date: 2003-08-12 11:58
Logged In: YES user_id=261020 If anybody wants to see some concrete examples of use of this patch, ask me & I'll mail them to you (actually, they use my urllib2 extension module rather than the patch, but the differences are trivial). BTW Jeremy, any guess about when in your Copious Free Time you're likely to get to this? I'm wondering whether to just release my code as-is, or wait for your comments.
msg53925 - (view)	Author: Jeremy Hylton (jhylton)	Date: 2003-08-12 15:19
Logged In: YES user_id=31392 Dont't wait more than a couple of weeks for me.
msg53926 - (view)	Author: John J Lee (jjlee)	Date: 2003-10-29 23:10
Logged In: YES user_id=261020 Just to note a minor change that should happen if this gets accepted: processors shouldn't be separate objects, but rather just a new interface that handler objects can support. That way, a single object can implement both interfaces. This makes implementing response cache handlers easier, for example. Not uploading a new patch, since it's a trivial code change. It might also be useful to have default_request and default_response methods (by analogy with default_open), for example for any response caching that's independent of protocol scheme.
msg53927 - (view)	Author: John J Lee (jjlee)	Date: 2003-12-03 18:26
Logged In: YES user_id=261020 I've uploaded a revised patch, plus some urllib2 tests and a doc patch, in patch 852995.
msg53928 - (view)	Author: John J Lee (jjlee)	Date: 2003-12-14 15:18
Logged In: YES user_id=261020 Patch 852995 applied, so closing this one.

History
Date	User	Action	Args
2022-04-10 16:09:24	admin	set	github: 38705
2003-06-24 13:16:38	jjlee	create