classification
Title: urllib fail to read URL contents, urllib2 crash Python
Type: crash Stage:
Components: None Versions: Python 2.5
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: cosoleto, ggenellina, gvanrossum, jjlee, josm, orsenthil, torriem
Priority: normal Keywords:

Created on 2007-09-26 07:55 by cosoleto, last changed 2008-12-02 14:13 by jjlee. This issue is now closed.

Files
File name Uploaded Description Edit
httplib.diff josm, 2007-09-28 03:18
httplib.py.diff josm, 2007-12-01 12:16
Messages (13)
msg56143 - (view) Author: Francesco Cosoleto (cosoleto) Date: 2007-09-26 07:55
urllib fail to read URL contents, urllib2 crash Python

Python version:
-------------------------
Python 2.5.1 (r251:54863, May 18 2007, 16:56:43) 
[GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)]

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit 
(Intel)] on
win32

Python 2.4.4 (#2, Aug 16 2007, 00:34:54) 
[GCC 4.1.3 20070812 (prerelease) (Debian 4.1.2-15)] on linux2

-------------------------

Working with GNU wget:
-------------------------
$ wget -S http://www.recherche.fr/encyclopedie/Thomas-Robert_Bugeaud
--08:42:21--  http://www.recherche.fr/encyclopedie/Thomas-Robert_Bugeaud
           => `Thomas-Robert_Bugeaud'
Risoluzione di www.recherche.fr in corso... 88.191.11.214
Connessione a www.recherche.fr|88.191.11.214:80... connesso.
HTTP richiesta inviata, aspetto la risposta... 
  HTTP/1.1 200 OK
  Date: Wed, 26 Sep 2007 06:42:53 GMT
  Server: Apache/2.2.3 (Debian) PHP/5.2.3-0.dotdeb.1 with Suhosin-Patch
  X-Powered-By: PHP/5.2.3-0.dotdeb.1
  Keep-Alive: timeout=15, max=100
  Connection: Keep-Alive
  Transfer-Encoding: chunked
  Content-Type: text/html; charset=UTF-8
Lunghezza: non specificato [text/html]

    [                             <=>                         ] 
267,080       --.--K/s             

08:42:42 (14.11 KB/s) - "Thomas-Robert_Bugeaud" salvato [267080]
-------------------------

Python:
-------------------------
>>> import urllib
>>> a = urllib.urlopen('http://www.recherche.fr/encyclopedie/Thomas-
Robert_Bugeaud')
>>> c = a.read(1024*1024*2)
>>> len(c)       
1035220

>>> c[63000:64000]
'he.fr en page d\'accueil</a><br>\n      <span>Partenaires :</span> <a 
href="http://www.cartes.fr/" target="_blank">Cartes\n      
postales</a>&nbsp; <a href="http://www.deux.fr/script/" 
target="_blank">Rencontres\n      gratuites\n      </a>&nbsp; <a 
href="http://www.new.fr/" target="_blank">Noms\n      de domaine 
gratuits</a>&nbsp; <a href="http://www.netencyclo.com/" 
target="_blank">Encyclopedia</a>&nbsp;</p>\n      <p style="text-
align:center;"><a href="http://www.futureobject.com/" 
target="_blank"><img src="http://www.recherche.fr/images/logo_fo.gif" 
border="0" height="25" width="96"></a></p>\n\n  </p>\n </div>\n 
</div><!-- site -->\n</body>\n</html>\n\r\n\x00\x00\x00\x00\x00\x00\x00
\x00\x00[...omission...]\x00\x00\x00\x00'
-------------------------

As above, but with urllib2 module instead of urllib:

-------------------------
  File "/usr/lib/python2.5/socket.py", line 291, in read
    data = self._sock.recv(recv_size)
  File "/usr/lib/python2.5/httplib.py", line 509, in read
    return self._read_chunked(amt)
  File "/usr/lib/python2.5/httplib.py", line 548, in _read_chunked
    chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: '\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00[...omission...]\x00\x00\x00\x00\x00\x00\x00
\
-------------------------

As above, but with Python 2.4:
-------------------------
>>> import urllib2
>>> a = urllib2.urlopen('http://www.recherche.fr/encyclopedie/Thomas-
Robert_Bugeaud')

>>> 
>>> c = a.read(1024*1024*2)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib/python2.4/socket.py", line 295, in read
    data = self._sock.recv(recv_size)
  File "/usr/lib/python2.4/httplib.py", line 460, in read
    return self._read_chunked(amt)
  File "/usr/lib/python2.4/httplib.py", line 499, in _read_chunked
    chunk_left = int(line, 16)
ValueError: invalid literal for int(): 
-------------------------

Regards,
Francesco Cosoleto
msg56144 - (view) Author: Gabriel Genellina (ggenellina) Date: 2007-09-26 14:07
This is a server bug. Internet Explorer 6 can't show the page either. 
The response is malformed; it uses chunked transfer, and RFC2616 
section 3.6.1 says "The chunk-size field is a string of hex digits 
indicating the size of the chunk. The chunked encoding is ended by any 
chunk whose size is zero[...]"

After the (first and only) chunk of around 63K, should come a 0-length 
chunk: a line with one or more digits "0" followed by CR+LF. But the 
server is not sending that last chunk, instead it sends lots of nul 
bytes, until eventually a CR,LF sequence arrives.
Neither IE nor Python can handle that (IE keeps requesting the page 
again and again). wget is apparently a lot more relaxed and decides 
that the first chunk is good enough. Perhaps urllib/urllib2 could 
handle the error and raise a more meaningful exception in this case, 
but just ignoring the error doesn't appear to be the right thing IMHO.
msg56147 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-09-26 16:55
Maybe the French internet is incompatible with the rest of the world? :-)
msg56151 - (view) Author: John Smith (josm) Date: 2007-09-27 03:21
Firefox 2.0.0.7 and Safari 2.0.4 can who this page.

In my opinion, Python urllib should be more practical and
provide a way to read this kind of page.

"In general, an implementation must be conservative
in its sending behavior, and liberal in its receiving behavior."
[RFC 791 3.2]
msg56162 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-09-27 14:31
> In my opinion, Python urllib "should" be more practical and
> provide a way to read this kind of page.  [quotes mine]

Totally agreed.  Someone "should" submit a patch.
msg56183 - (view) Author: John Smith (josm) Date: 2007-09-28 03:18
Attached a patch for this problem.
This one just ignores the buggy chunk-size and close the connection.
As gagenellina said earlier, this might not be a good way
to fix this, but I could not come up with better solution.
msg56506 - (view) Author: Michael Torrie (torriem) Date: 2007-10-16 19:11
I had a situation where I was talking to a Sharp MFD printer.  Their web
server apparently does not serve chunked data properly.  However the
patch posted here put it in an infinite loop.

Somewhere around line 525 in the python 2.4 version of httplib.py, I had
to make it look like this:

        while True:
            line = self.fp.readline()
            if line == '\r\n' or not line:
                break

I added "or not line" to the if statement.  The blank line in the
chunked http was confusing the _last_chunk thing, but even when it was
set to zero, since there was no more data, this loop to eat up crlfs was
never ending.

Is this really a proper fix?  

I'm in favor of changing urllib2 to be less strict because, despite the
RFCs, we're stuck talking to all kinds of web servers (embedded ones in
particular) that simply can't easily be changed.
msg58042 - (view) Author: John Smith (josm) Date: 2007-12-01 12:16
included torriem's fix.

IMHO, there is no clear solution for this
because this is due to HTTP server's "bug"
and a bug is the one that you can't predict accurately...
msg58999 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2007-12-26 16:46
Irrespective of the patch, this issue is reproducable with the code in the
trunk for Python 2.6. Should we close this then?

n 2.6a0 (trunk:59600M, Dec 25 2007, 13:54:34)
[GCC 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2
>>> import urllib
>>> url = "http://www.recherche.fr/encyclopedie/Thomas-Robert_Bugeaud"
>>> a = urllib.urlopen(url)
>>> b = urllib2.urlopen(url)
>>> c = a.read(1024 * 1024 * 2)
>>> c[63000:64000]
'UA-321207-2";\nurchinTracker();\n</script>\n <div id="introFin">\n  <p>\nLe
contenu de cette page (Thomas-Robert Bugeaud) est un minuscule extrait de
l\'encyclopi\xc3\xa9die gratuite en ligne <a
href="http://fr.wikipedia.org">WIKIPEDIA</a>\nle webmaster de ce site n\'est
pas l\'auteur de cet article (Thomas-Robert Bugeaud). Vous pouvez retrouver
l\'original de cet article (Thomas-Robert Bugeaud) &agrave; <a
href="http://fr.wikipedia.org/wiki/Thomas-Robert_Bugeaud">cette adresse</a> et
la liste des auteurs <a
href="http://fr.wikipedia.org/w/index.php?title=Thomas-Robert_Bugeaud&amp;action=history">ici</a>\nVous
pouvez <a
href="http://fr.wikipedia.org/w/index.php?title=Thomas-Robert_Bugeaud&amp;action=edit">modifier
ou compl\xc3\xa9ter</a> cet article mais \xc3\xa9galement <a
href="http://fr.wikipedia.org/w/index.php?title=Discuter:Thomas-Robert_Bugeaud&amp;action=edit">discuter</a>
de son contenu (Thomas-Robert Bugeaud) sur le site de <a
href="http://fr.wikipedia.org">WIKIPEDIA France</a> - Contenu (Thomas-Robert B'
>>> c = b.read(1024 * 1024 * 2)
>>> c[63000:64000]
'acct = "UA-321207-2";\nurchinTracker();\n</script>\n <div id="introFin">\n
<p>\nLe contenu de cette page (Thomas-Robert Bugeaud) est un minuscule extrait
de l\'encyclopi\xc3\xa9die gratuite en ligne <a
href="http://fr.wikipedia.org">WIKIPEDIA</a>\nle webmaster de ce site n\'est
pas l\'auteur de cet article (Thomas-Robert Bugeaud). Vous pouvez retrouver
l\'original de cet article (Thomas-Robert Bugeaud) &agrave; <a
href="http://fr.wikipedia.org/wiki/Thomas-Robert_Bugeaud">cette adresse</a> et
la liste des auteurs <a
href="http://fr.wikipedia.org/w/index.php?title=Thomas-Robert_Bugeaud&amp;action=history">ici</a>\nVous
pouvez <a
href="http://fr.wikipedia.org/w/index.php?title=Thomas-Robert_Bugeaud&amp;action=edit">modifier
ou compl\xc3\xa9ter</a> cet article mais \xc3\xa9galement <a
href="http://fr.wikipedia.org/w/index.php?title=Discuter:Thomas-Robert_Bugeaud&amp;action=edit">discuter</a>
de son contenu (Thomas-Robert Bugeaud) sur le site de <a
href="http://fr.wikipedia.org">WIKIPEDIA France</a> - Contenu (Thomas-'
>>>
msg59000 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2007-12-26 16:49
> 
> Senthil added the comment:
> 
> Irrespective of the patch, this issue is reproducable with the code in the
> trunk for Python 2.6. Should we close this then?
> __________________________________

Sorry, I meant to say "NOT Reproducable".
msg59117 - (view) Author: Francesco Cosoleto (cosoleto) Date: 2008-01-03 01:15
Sorry, but I don't understand reason to close this issue with 
resolution "wont fix". The problem was reproducible and his logic 
explained by more developers. If the problem has been resolved, then, 
please, change "resolution" field to "fixed", else a patch request is 
pending (see msg56162). No? :-( Of course - it was predictable - the 
bug isn't reproducible now also using previous Python version: 

$ wget -c http://www.recherche.fr/encyclopedie/Thomas-Robert_Bugeaud
[..omisss..]
02:08:34 (4.28 KB/s) - "Thomas-Robert_Bugeaud" salvato [65107] 

----

Python 2.5.1 (r251:54863, May 18 2007, 16:56:43) 
>>> url = "http://www.recherche.fr/encyclopedie/Thomas-Robert_Bugeaud"
>>> a = urllib.urlopen(url) ; c = a.read(1024 * 1024 * 2)
>>> len(c)
65169
msg59118 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-01-03 01:50
I'm just following the last post's suggestion "Should we close this then?"

My message (somebody "should" submit a patch) was sarcastic --- it was
in reference to the comment that Python "should" be more practical.

Since no patch was applied, I don't know why "won't fix" isn't a
perfectly adequate description of the reason for closure.

If you want me to reopen this, please submit a patch.
msg76743 - (view) Author: John J Lee (jjlee) Date: 2008-12-02 14:13
This is fixed in trunk r61034 by issue #900744 .  Please use that issue
for any discussion re whether this should be fixed in 2.5.
History
Date User Action Args
2008-12-02 14:13:43jjleesetnosy: + jjlee
messages: + msg76743
2008-01-03 01:50:08gvanrossumsetmessages: + msg59118
2008-01-03 01:15:02cosoletosetmessages: + msg59117
2008-01-02 23:16:53gvanrossumsetstatus: open -> closed
resolution: wont fix
2007-12-26 16:49:23orsenthilsetmessages: + msg59000
2007-12-26 16:46:36orsenthilsetnosy: + orsenthil
messages: + msg58999
2007-12-01 12:16:34josmsetfiles: + httplib.py.diff
messages: + msg58042
2007-10-16 19:11:29torriemsetnosy: + torriem
messages: + msg56506
2007-09-28 03:18:04josmsetfiles: + httplib.diff
messages: + msg56183
2007-09-27 14:31:23gvanrossumsetmessages: + msg56162
2007-09-27 03:21:30josmsetnosy: + josm
messages: + msg56151
2007-09-26 16:55:10gvanrossumsetnosy: + gvanrossum
messages: + msg56147
2007-09-26 14:07:54ggenellinasetnosy: + ggenellina
messages: + msg56144
2007-09-26 07:55:32cosoletocreate