Title: Missing enumeration of HTTPResponse Objects methods of urllib.request.urlopen's http.client.HTTPResponse?
Type: enhancement Stage: resolved
Components: Documentation Versions: Python 3.6, Python 3.4, Python 3.5
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: EvensF, berker.peksag, docs@python, martin.panter, orsenthil, python-dev, r.david.murray
Priority: normal Keywords: patch

Created on 2014-04-15 00:08 by EvensF, last changed 2015-11-26 11:20 by martin.panter. This issue is now closed.

File name Uploaded Description Edit
urllib.request.rst EvensF, 2014-04-16 05:08 Document asked(?)
issue21228.patch EvensF, 2014-09-04 22:25 Patch submitted for this issue review
addinfourl-first.patch martin.panter, 2014-09-11 08:31 put 3 addinfourl methods first + msg attribute review
Messages (20)
msg216254 - (view) Author: Evens Fortuné (EvensF) * Date: 2014-04-15 00:08
In the Python Library documentation, in section "21.6. urllib.request — Extensible library for opening URLs", in the description of the urllib.request.urlopen() function it is writen:


For http and https urls, this function returns a http.client.HTTPResponse object which has the following HTTPResponse Objects methods.

For ftp, file, and data urls and requests explicity handled by legacy […]


The first sentence seemed to imply that something is supposed to be specified if I understand correctly. Is it me missing something ?
msg216258 - (view) Author: Evens Fortuné (EvensF) * Date: 2014-04-15 00:39
I forgot to tell that it is in section "20.5. urllib.request — Extensible library for opening URLs" for Python 3.2.5
msg216262 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2014-04-15 01:27
Hello Evens,
If you can, then please attach a doc to this and we and fix this soon.
msg216268 - (view) Author: Evens Fortuné (EvensF) * Date: 2014-04-15 01:49
I don't quite understand what you are asking me. You need a copy of the document ? 

You can find an example at this link:
msg216972 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-04-21 22:40
That section of the docs is indeed rather confusing.  Probably it just needs to be changed to say "for the methods supported by this object, see HTTPResponse Objects.

I'd like to see the docs reorganized so that the '.. class' declaration in the docs is immediately followed by the class's methods, but that's a much bigger project.
msg216982 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2014-04-22 03:31
I interpreted it more along the lines of “. . . returns a http.client.HTTPResponse object [with] the following [additional] methods.” Indeed, a HTTP urlopen() response I just tried does have info(), geturl() and getcode() methods, and I know the info() method is used in the real world.

Also, it would be good to document that the HTTP response’s “msg” attribute does not actually hold the header, despite the HTTPResponse documentation. Further, the return value of BaseHandler.default_open() is defined to be the same as urlopen(), but when a HTTP error occurs I have found the “msg” attribute is meant to be the HTTP status text phrase (e.g. “Not Found”).

Perhaps it would be good to add something like these two points wherever they belong:

* The “msg” attribute returned by urlopen() does not hold the HTTP header, despite the “HTTPResponse” documentation
* The “msg” attribute should be set to the HTTP status text phrase (HTTPResponse.reason)
msg226385 - (view) Author: Evens Fortuné (EvensF) * Date: 2014-09-04 22:25

From my limited experience reporting documentation issues, I see that it's better to submit a patch than to only report an issue. So, I've tried to look into the source code to figure out what was going on. I have attached a patch that I'm submitting to you for review hoping I doing everything right.

What was reported as ambiguous in this issue is the description of the return value of the function urllib.request.urlopen() for http and https URLs. It was mentionned that it should be an http.client.HTTPResponse object but it implied that something may have been different about this object. 

To understand why I'm may now be able to assert what's being said in that patch, follow me in the source code. It's based on revision c499cc2c4a06. If you don't care about all the walkthrough you can skip to step 9.

   1. We want to describe the return value of the urllib.request.urlopen() for http and https URLs
   2. The urlopen() function is defined in Lib/urllib/ starting on line 138. Its return value is the return value function of the method (line 153)
      * This opener object is defined in one of these locations:
         * On line 150 as the return value of the module function build_opener() (this return value is cached in the _opener module variable)
         * On line 152 as the value cached in the _opener module variable
         * On line 148 still as the return value of the module function build_opener() in the case if you want to access an HTTPS URL and you provide a cafile, capath or cadefault argument
      * So either way, the opener object come from the build_opener directly or indirectly.
   3. The build_opener() function is defined starting on line 505. Its return value (line 539) is an instance of the OpenerDirector class (line 514). The OpenerDirector class is defined starting on line 363.
      a. Before returning its return value, after some checks (lines 522-530, 535-536), build_opener() calls the OpenerDirector().add_handler() with an instance of some of the classes defined in the default_classes list (line 515-518). What matters to us is the HTTPHandler class and the HTTPSHandler class (line 520).
      b. The OpenerDirector().add_handler() method (line 375) takes the HTTPHandler class (line 1196) and:
         * Insert the HTTPHandler.http_open() method in the list stored as the value of OpenerDirector().handle_open['http'].
         * Insert the HTTPHandler.http_request() method in the list stored as the value of OpenerDirector().process_request['http'].
      c. For HTTPSHandler (line 1203) is the same thing but :
         * HTTPSHandler.https_open() for OpenerDirector().handle_open['https']
         * HTTPSHandler.https_request() for OpenerDirector().process_request['https']
   4. I remind you that we are looking for the return value of the method open() of an instance of the OpenerDirector class (see point number 2). This method is defined starting on line 437.
   5. The method's return value is the response variable (line 463)
   6. This variable is defined on lines 461 and 455.
      a. The loop on lines 458-461 tries to find in his handlers (the OpenerDirector().process_response dictionary) a response processor (a XXX.http_response() method) which isn't defined in HTTPHandler or HTTPSHandler. (a http_response() method is defined in HTTPErrorProcessor [line 564] and in HTTPCookieProcessor [line 1231] but in each of these cases, these classes don't modify the response value)
      b. So response variable's value is the return value of OpenerDirector()._open(req, data) on line 455. 
         * The req argument is a Request instance (line 440) or something that has the same interface, I guess (line 442). The Request class is defined on line 253.
         * The data argument is included in the constructor of the Request instance (line 440 and then on line 262) or added to the object provided (line 444). Afterwards, it won't be used directly (OpenerDirector()._open() receives it as an argument but won't use it in its body)
   7. OpenerDirector()._open() is defined on line 465. It will call OpenerDirector()._call_chain() up to three times depending on whether a result has been found (lines 468-469, 474-475). 
      * OpenerDirector()._call_chain() is defined on line 426. All it does is calling the handlers registered in the dictionnary provided (the chain argument) until one returns something else than None and returns it.
      * In our case (retrieving http and http resources):
         a. The first call (line 466) will return None since HTTPHandler or HTTPSHandler don't have a default_open() method (in fact, no handler defined in this file has a default_open() method)
         b. The second call will work since HTTPHandler.http_open() (line 1198) and HTTPSHandler.https_open() (line 1212) exists. Their return values will be enventually what we are looking for.
   8. HTTPHandler.http_open() and HTTPSHandler.https_open() returns the return value of do_open() method defined (on line 1134) in their mutual superclass AbstractHTTPHandler (line 1086). They will call it with http.client.HTTPConnection and req in the case of HTTPHandler and http.client.HTTPSConnection and req in the case of HTTPSHandler with a few other arguments.
   9. OpenerDirector().do_open() creates a http.client.HTTPSConnection object (line 1144) and calls its request() method (line 1173) and if it works, calls its getreponse() method (line 1178). This return value is the HTTPResponse object we are looking for.
   10. Finally we get our answer: 
      * On line 1186, an url attribute is added to this HTTPResponse object
      * On line 1192, the msg attribute is replaced by the reason attribute

I hope this is what was needed to close this issue. Otherwise, just tell me what is missing.

Oh and there seems that there are be many things that could be refactored. Can I do it and open issues about them ?
msg226402 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2014-09-05 06:56
With this patch, there is no longer any implication that the returned object implements the “addinfourl” interface. Perhaps that should be added back. Or maybe add it to the HTTPResponse class documentation itself? There is a comment that says the methods are there “for compatibility with old-style urllib responses”, although it seems to me they also make the class compatible with Python 3’s new “urllib”:

It is good to document the “msg” attribute and its inconsistency, since I have found this is required to implement your own BaseHandler.default_open(). However I’m not so sure if it is necessary to document the “url” attribute. Why not encourage using geturl() instead, since it is already documented, at least for non-HTTP responses?

I also saw a comment against the “msg” attribute which mentions deprecating something, but it is not clear what:
msg226711 - (view) Author: Evens Fortuné (EvensF) * Date: 2014-09-10 19:51
Well, there wasn't any indication before that the returned object was implementing the "addinfourl" interface. So I don't think we have lost anything. In what situation this interface is useful ? The following comment (that you had highlighted in your comment) gives the impression that theses methods are there only to provide compatibility with clients using old-styles responses.

That would imply that newer clients would usually not use these methods. If you want to document this, I think the "addinfourl" interface should then be better described somewhere else where it would include the fact that the HTTPResponse class implements it.

Anyway, I don't see the advantage of using a getter method (like geturl()) instead of accessing directly the attribute. For me, this is less pythonic. If you ever have to attach a behaviour to the access of this attribute, a property could then be defined.
msg226716 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-09-10 21:01
Indeed, geturl is deprecated.  I'm not sure where you see it documented, I don't see it.
msg226728 - (view) Author: Evens Fortuné (EvensF) * Date: 2014-09-10 22:03
To be honest, it may be inspired by what's written a few lines lower, for ftp, files dans data urls even though the return object is not the same as the http(s) urls
msg226729 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-09-10 22:15
Ah, I was looking at the http docs.  I wonder if we just missed the urllib docs when we made the changes.  Either that, or I'm misremembering things.
msg226736 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2014-09-11 00:01
By removing the “addinfourl” methods for HTTP responses, you are making it unnecessarily hard to handle header fields and metadata from the response. I do not know of any other documented way of getting the eventual redirect target, other than geturl(). And code to parse the Content-Type, which I understand is also returned for file: URLs, would start to get ugly (untested):

info = getattr(response, "info", None)
if info:
    type = info().get_content_type()  # Easy!
    import cgi
    type = cgi.parse_header(response.getheader("Content-Type))[0]

Since a HTTP status code only seems to be returned for HTTP responses, deprecating or removing getcode() in favour of “HTTPResponse.status” might be okay, but I think info() and geturl() should stay.

Maybe a “url” attribute is more Pythonic, but as far as I am aware that has never been documented for Python 2 or 3 for any URL type. I would not expect much existing code to be using it, and deprecating geturl() seems like a needless annoyance.
msg226741 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-09-11 04:24
I think I'm thinking of the Request API, and not the Response API.  So ignore my comments about deprecation, I'm not sure what the status is.
msg226742 - (view) Author: Evens Fortuné (EvensF) * Date: 2014-09-11 05:20
Look, the subject of this issue is to clarify the methods of the urllib.request.urlopen()'s return value for http(s) URLs. Nobody seemed to work on this for 4 months. That's why I tried to submit a patch after looking into the code to try to do my part to help.

If you think that my patch is not clear enough and would need to be more precise or maybe even that it is plain wrong, please submit a new patch with your ideas so that we can close eventually this issue.

I really didn't think that it would get this complicated…
msg226755 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2014-09-11 08:31
Fair enough, challenge accepted! Here is my attempt. I have explicitly made the info(), geturl() and getcode() methods available for all cases, and used Evens’s wording for the modified “msg” attribute, but dropped mentioning the “url” attribute.
msg226770 - (view) Author: Evens Fortuné (EvensF) * Date: 2014-09-11 12:45
I'm satisfied with this new patch.
msg229635 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2014-10-18 07:46
Is there anything else that needs doing to get this committed and resolved? My patch is just a trivial reordering of Evens’s, with about one sentence deleted.
msg234944 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-01-29 04:32
Related: Issue 12707, about deprecating some methods in favour of attributes
msg255407 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-11-26 11:07
New changeset fa3c9faabfb0 by Martin Panter in branch '3.4':
Issues #22989, #21228: Document HTTP response object for urlopen()

New changeset b55c006b79bc by Martin Panter in branch '3.5':
Issue #22989, #21228: Merge urlopen() doc from 3.4 into 3.5

New changeset c6930661599b by Martin Panter in branch 'default':
Issue #22989, #21228: Merge urlopen() doc from 3.5
Date User Action Args
2015-11-26 11:20:40martin.pantersetversions: + Python 3.6
2015-11-26 11:12:49martin.pantersetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2015-11-26 11:07:15python-devsetnosy: + python-dev
messages: + msg255407
2015-01-29 04:32:18martin.pantersetmessages: + msg234944
2014-12-05 23:32:59berker.peksagsetnosy: + berker.peksag
stage: patch review

versions: - Python 3.2, Python 3.3
2014-10-18 07:46:29martin.pantersetmessages: + msg229635
2014-09-11 12:45:24EvensFsetmessages: + msg226770
2014-09-11 08:32:00martin.pantersetfiles: + addinfourl-first.patch

messages: + msg226755
2014-09-11 05:20:47EvensFsetmessages: + msg226742
2014-09-11 04:24:12r.david.murraysetmessages: + msg226741
2014-09-11 00:01:20martin.pantersetmessages: + msg226736
2014-09-10 22:15:00r.david.murraysetmessages: + msg226729
2014-09-10 22:03:37EvensFsetmessages: + msg226728
2014-09-10 21:01:30r.david.murraysetmessages: + msg226716
2014-09-10 19:51:11EvensFsetmessages: + msg226711
2014-09-05 06:56:38martin.pantersetmessages: + msg226402
2014-09-04 22:25:42EvensFsetfiles: + issue21228.patch
keywords: + patch
messages: + msg226385
2014-04-22 03:31:54martin.pantersetnosy: + martin.panter
messages: + msg216982
2014-04-21 22:40:07r.david.murraysetnosy: + r.david.murray
messages: + msg216972
2014-04-16 05:08:55EvensFsetfiles: + urllib.request.rst
2014-04-15 01:49:00EvensFsetmessages: + msg216268
2014-04-15 01:27:22orsenthilsetnosy: + orsenthil
messages: + msg216262
2014-04-15 00:39:01EvensFsetmessages: + msg216258
2014-04-15 00:08:03EvensFcreate