classification
Title: IncompleteRead error with urllib2 or urllib.request -- fine with urllib, wget, or curl
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.6, Python 3.3, Python 3.4, Python 2.7
process
Status: closed Resolution: third party
Dependencies: Superseder:
Assigned To: Nosy List: Alex Quinn, apocalyptech, laurento.frittella, martin.panter, msornay, orsenthil, pitrou, raylu, serhiy.storchaka
Priority: normal Keywords:

Created on 2012-02-17 17:36 by Alex Quinn, last changed 2017-02-12 21:40 by apocalyptech. This issue is now closed.

Messages (10)
msg153581 - (view) Author: Alex Quinn (Alex Quinn) Date: 2012-02-17 17:36
When accessing this URL, both urllib2 (Py2) and urlib.client (Py3) raise an IncompleteRead error.
http://info.kingcounty.gov/health/ehs/foodsafety/inspections/XmlRest.aspx?Zip_Code=98199

Previous discussions about similar errors suggest that this may be due to a problem with the server and chunked data transfer.  (See links below.)  I can't understand what that means.  However, this works fine with urllib (Py2), curl, wget, and all regular web browsers I've tried it with.  Thus, I would have expected urllib2 (Py2) and urllib.request (Py3) to cope with it similarly.

Versions I've tested with:
- Fails with urllib2 + Python 2.5.4, 2.6.1, 2.7.2  (Error messages vary.)
- Fails with urllib.request + Python 3.1.2, 3.2.2
- Succeeds with urllib + Python 2.5.4, 2.6.1, 2.7.2
- Succeeds with wget 1.11.1
- Succeeds with curl 7.15.5

___________________________________________________________
TEST CASES

# FAILS - Python 2.7, 2.6, 2.5
import urllib2
url = "http://info.kingcounty.gov/health/ehs/foodsafety/inspections/XmlRest.aspx?Zip_Code=98199"
xml_str = urllib2.urlopen(url).read() # Raises httplib.IncompleteRead

# FAILS - Python 3.2, 3.1
import urllib.request
url = "http://info.kingcounty.gov/health/ehs/foodsafety/inspections/XmlRest.aspx?Zip_Code=98199"
xml_str = urllib.request.urlopen(url).read() # Raises http.client.IncompleteRead

# SUCCEEDS - Python 2.7, 2.6, 2.5
import urllib
url = "http://info.kingcounty.gov/health/ehs/foodsafety/inspections/XmlRest.aspx?Zip_Code=98199"
xml_str = urllib.urlopen(url).read()
dom = xml.dom.minidom.parseString(xml_str) # Verify XML is complete
print("urllib:  %d bytes received and parsed successfully"%len(xml_str))

# SUCCEEDS - wget
wget -O- "http://info.kingcounty.gov/health/ehs/foodsafety/inspections/XmlRest.aspx?Zip_Code=98199" | wc

# SUCCEEDS - curl - prints an error, but returns the full data anyway
curl "http://info.kingcounty.gov/health/ehs/foodsafety/inspections/XmlRest.aspx?Zip_Code=98199" | wc

___________________________________________________________
RELATED DISCUSSIONS

http://www.gossamer-threads.com/lists/python/python/847985
http://bugs.python.org/issue11463  (closed)
http://bugs.python.org/issue6785   (closed)
http://bugs.python.org/issue6312   (closed)
msg171263 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-09-25 13:02
The example URL doesn't seem to work anymore. Do you have another example to test with?
msg191087 - (view) Author: raylu (raylu) Date: 2013-06-13 19:20
The URL works for me.

While wget does download it successfully, I get the following output:

$ wget http://info.kingcounty.gov/health/ehs/foodsafety/inspections/XmlRest.aspx\?Zip_Code\=98199
--2013-06-13 12:15:21--  http://info.kingcounty.gov/health/ehs/foodsafety/inspections/XmlRest.aspx?Zip_Code=98199
Resolving info.kingcounty.gov (info.kingcounty.gov)... 146.129.240.75
Connecting to info.kingcounty.gov (info.kingcounty.gov)|146.129.240.75|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/xml]
Saving to: ‘XmlRest.aspx?Zip_Code=98199’

    [      <=>                                               ] 515,315      448KB/s   in 1.1s   

2013-06-13 12:15:23 (448 KB/s) - Read error at byte 515315 (Success).Retrying.

--2013-06-13 12:15:24--  (try: 2)  http://info.kingcounty.gov/health/ehs/foodsafety/inspections/XmlRest.aspx?Zip_Code=98199
Connecting to info.kingcounty.gov (info.kingcounty.gov)|146.129.240.75|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/xml]
Saving to: ‘XmlRest.aspx?Zip_Code=98199’

    [ <=>                                                    ] 0           --.-K/s   in 0s      


Cannot write to ‘XmlRest.aspx?Zip_Code=98199’ (Success).

Similarly, curl gives

$ curl http://info.kingcounty.gov/health/ehs/foodsafety/inspections/XmlRest.aspx\?Zip_Code\=98199 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  503k    0  503k    0     0   222k      0 --:--:--  0:00:02 --:--:--  229k
curl: (18) transfer closed with outstanding read data remaining

$ wget --version
GNU Wget 1.14 built on linux-gnu.

$ curl --version
curl 7.30.0 (x86_64-pc-linux-gnu) libcurl/7.30.0 OpenSSL/1.0.1e zlib/1.2.8 libidn/1.25 libssh2/1.4.2 librtmp/2.3
msg208169 - (view) Author: Laurento Frittella (laurento.frittella) Date: 2014-01-15 15:24
I had the same problem using urllib2 and the following trick worked for me

import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'

Source: http://stackoverflow.com/a/20645845
msg210813 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2014-02-10 09:07
The server in question is sending a chunked response, but seems to be closing the connection when it is done, without sending a zero-length chunk (which I understand it is meant to according to the HTTP protocol).

My Firefox shows the XML without any indication of error. But then if I manually truncate a chunked response to Firefox it doesn’t indicate an error either, which I would probably want to know about.
msg231406 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2014-11-20 02:37
I suggest this is the same situation as Issue 6785, and is not a bug in Python. However it might be reasonable to allow forcing a HTTP client connection to version 1.0, which could be used as a workaround.
msg231434 - (view) Author: Laurento Frittella (laurento.frittella) Date: 2014-11-20 14:00
Even if forcing the HTTP/1.0 workaround works it can end up in weird issues, especially if used in something more than a small script, like the one I tried to describe in this issue report[1] for the "requests" python library.

[1] https://github.com/kennethreitz/requests/issues/2341
msg255396 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-11-26 01:12
Closing this as being a bug in the web server, rather than Python.

If someone wants to add a way to force a HTTP 1.0 response, or a way to get all valid data before raising the exception, I suggest opening a new report.
msg287644 - (view) Author: CJ Kucera (apocalyptech) * Date: 2017-02-12 18:42
I've just encountered this problem on Python 3.6, on a different URL.  The difference being that it's not encountered with EVERY page load, though I'd say it happens with at least half:

import urllib.request
html = urllib.request.urlopen('http://www.basicinstructions.net/').read()
print('Succeeded!')

I realize that the root problem here may be an HTTP server doing something improper, but I've got no way of fixing someone else's webserver.  It'd be really nice if there was a reasonable way of handling this in Python itself.  As mentioned in the original report, other methods of retreiving this URL work without fail (curl/wget/etc).  As it is, the only way for me to be sure of retreiving the entire page contents is by looping until I don't get an IncompleteRead, which is hardly ideal.
msg287653 - (view) Author: CJ Kucera (apocalyptech) * Date: 2017-02-12 21:40
Ah, well, actually I suppose I'll rescind that a bit - other pages about this bug around the internet had been claiming that the 'requests' module uses urllib in the backend and was subject to this bug as well, but after experimenting myself, it seems like if that IS the case, they're working around it somehow, because using requests makes this succeed 100% of the time.  I probably should've tried that first!

So anyway, there's a reasonable workaround, at least.  Sorry for the bugspam!
History
Date User Action Args
2017-02-12 21:40:30apocalyptechsetmessages: + msg287653
2017-02-12 18:42:21apocalyptechsetnosy: + apocalyptech

messages: + msg287644
versions: + Python 3.6
2015-11-26 01:12:52martin.pantersetstatus: open -> closed
resolution: third party
messages: + msg255396
2015-02-13 01:25:29demian.brechtsetnosy: - demian.brecht
2014-11-20 14:00:28laurento.frittellasetmessages: + msg231434
2014-11-20 02:37:37martin.pantersetmessages: + msg231406
2014-07-24 00:32:11demian.brechtsetnosy: + demian.brecht
2014-02-10 09:07:16martin.pantersetnosy: + martin.panter
messages: + msg210813
2014-01-15 15:28:34serhiy.storchakasetnosy: + serhiy.storchaka

type: behavior
versions: + Python 3.3, Python 3.4, - Python 2.6, Python 3.1, Python 3.2
2014-01-15 15:24:43laurento.frittellasetnosy: + laurento.frittella
messages: + msg208169
2013-09-29 13:57:43msornaysetnosy: + msornay
2013-06-13 19:20:43raylusetnosy: + raylu
messages: + msg191087
2012-09-25 13:02:00pitrousetnosy: + pitrou
messages: + msg171263
2012-02-18 01:27:56pitrousetnosy: + orsenthil
2012-02-17 17:36:16Alex Quinncreate