Issue1205
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2007-09-26 07:55 by cosoleto, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
httplib.diff | josm, 2007-09-28 03:18 | |||
httplib.py.diff | josm, 2007-12-01 12:16 |
Messages (13) | |||
---|---|---|---|
msg56143 - (view) | Author: Francesco Cosoleto (cosoleto) | Date: 2007-09-26 07:55 | |
urllib fail to read URL contents, urllib2 crash Python Python version: ------------------------- Python 2.5.1 (r251:54863, May 18 2007, 16:56:43) [GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)] Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on win32 Python 2.4.4 (#2, Aug 16 2007, 00:34:54) [GCC 4.1.3 20070812 (prerelease) (Debian 4.1.2-15)] on linux2 ------------------------- Working with GNU wget: ------------------------- $ wget -S http://www.recherche.fr/encyclopedie/Thomas-Robert_Bugeaud --08:42:21-- http://www.recherche.fr/encyclopedie/Thomas-Robert_Bugeaud => `Thomas-Robert_Bugeaud' Risoluzione di www.recherche.fr in corso... 88.191.11.214 Connessione a www.recherche.fr|88.191.11.214:80... connesso. HTTP richiesta inviata, aspetto la risposta... HTTP/1.1 200 OK Date: Wed, 26 Sep 2007 06:42:53 GMT Server: Apache/2.2.3 (Debian) PHP/5.2.3-0.dotdeb.1 with Suhosin-Patch X-Powered-By: PHP/5.2.3-0.dotdeb.1 Keep-Alive: timeout=15, max=100 Connection: Keep-Alive Transfer-Encoding: chunked Content-Type: text/html; charset=UTF-8 Lunghezza: non specificato [text/html] [ <=> ] 267,080 --.--K/s 08:42:42 (14.11 KB/s) - "Thomas-Robert_Bugeaud" salvato [267080] ------------------------- Python: ------------------------- >>> import urllib >>> a = urllib.urlopen('http://www.recherche.fr/encyclopedie/Thomas- Robert_Bugeaud') >>> c = a.read(1024*1024*2) >>> len(c) 1035220 >>> c[63000:64000] 'he.fr en page d\'accueil</a><br>\n <span>Partenaires :</span> <a href="http://www.cartes.fr/" target="_blank">Cartes\n postales</a> <a href="http://www.deux.fr/script/" target="_blank">Rencontres\n gratuites\n </a> <a href="http://www.new.fr/" target="_blank">Noms\n de domaine gratuits</a> <a href="http://www.netencyclo.com/" target="_blank">Encyclopedia</a> </p>\n <p style="text- align:center;"><a href="http://www.futureobject.com/" target="_blank"><img src="http://www.recherche.fr/images/logo_fo.gif" border="0" height="25" width="96"></a></p>\n\n </p>\n </div>\n </div><!-- site -->\n</body>\n</html>\n\r\n\x00\x00\x00\x00\x00\x00\x00 \x00\x00[...omission...]\x00\x00\x00\x00' ------------------------- As above, but with urllib2 module instead of urllib: ------------------------- File "/usr/lib/python2.5/socket.py", line 291, in read data = self._sock.recv(recv_size) File "/usr/lib/python2.5/httplib.py", line 509, in read return self._read_chunked(amt) File "/usr/lib/python2.5/httplib.py", line 548, in _read_chunked chunk_left = int(line, 16) ValueError: invalid literal for int() with base 16: '\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00[...omission...]\x00\x00\x00\x00\x00\x00\x00 \ ------------------------- As above, but with Python 2.4: ------------------------- >>> import urllib2 >>> a = urllib2.urlopen('http://www.recherche.fr/encyclopedie/Thomas- Robert_Bugeaud') >>> >>> c = a.read(1024*1024*2) Traceback (most recent call last): File "<stdin>", line 1, in ? File "/usr/lib/python2.4/socket.py", line 295, in read data = self._sock.recv(recv_size) File "/usr/lib/python2.4/httplib.py", line 460, in read return self._read_chunked(amt) File "/usr/lib/python2.4/httplib.py", line 499, in _read_chunked chunk_left = int(line, 16) ValueError: invalid literal for int(): ------------------------- Regards, Francesco Cosoleto |
|||
msg56144 - (view) | Author: Gabriel Genellina (ggenellina) | Date: 2007-09-26 14:07 | |
This is a server bug. Internet Explorer 6 can't show the page either. The response is malformed; it uses chunked transfer, and RFC2616 section 3.6.1 says "The chunk-size field is a string of hex digits indicating the size of the chunk. The chunked encoding is ended by any chunk whose size is zero[...]" After the (first and only) chunk of around 63K, should come a 0-length chunk: a line with one or more digits "0" followed by CR+LF. But the server is not sending that last chunk, instead it sends lots of nul bytes, until eventually a CR,LF sequence arrives. Neither IE nor Python can handle that (IE keeps requesting the page again and again). wget is apparently a lot more relaxed and decides that the first chunk is good enough. Perhaps urllib/urllib2 could handle the error and raise a more meaningful exception in this case, but just ignoring the error doesn't appear to be the right thing IMHO. |
|||
msg56147 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2007-09-26 16:55 | |
Maybe the French internet is incompatible with the rest of the world? :-) |
|||
msg56151 - (view) | Author: John Smith (josm) | Date: 2007-09-27 03:21 | |
Firefox 2.0.0.7 and Safari 2.0.4 can who this page. In my opinion, Python urllib should be more practical and provide a way to read this kind of page. "In general, an implementation must be conservative in its sending behavior, and liberal in its receiving behavior." [RFC 791 3.2] |
|||
msg56162 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2007-09-27 14:31 | |
> In my opinion, Python urllib "should" be more practical and > provide a way to read this kind of page. [quotes mine] Totally agreed. Someone "should" submit a patch. |
|||
msg56183 - (view) | Author: John Smith (josm) | Date: 2007-09-28 03:18 | |
Attached a patch for this problem. This one just ignores the buggy chunk-size and close the connection. As gagenellina said earlier, this might not be a good way to fix this, but I could not come up with better solution. |
|||
msg56506 - (view) | Author: Michael Torrie (torriem) | Date: 2007-10-16 19:11 | |
I had a situation where I was talking to a Sharp MFD printer. Their web server apparently does not serve chunked data properly. However the patch posted here put it in an infinite loop. Somewhere around line 525 in the python 2.4 version of httplib.py, I had to make it look like this: while True: line = self.fp.readline() if line == '\r\n' or not line: break I added "or not line" to the if statement. The blank line in the chunked http was confusing the _last_chunk thing, but even when it was set to zero, since there was no more data, this loop to eat up crlfs was never ending. Is this really a proper fix? I'm in favor of changing urllib2 to be less strict because, despite the RFCs, we're stuck talking to all kinds of web servers (embedded ones in particular) that simply can't easily be changed. |
|||
msg58042 - (view) | Author: John Smith (josm) | Date: 2007-12-01 12:16 | |
included torriem's fix. IMHO, there is no clear solution for this because this is due to HTTP server's "bug" and a bug is the one that you can't predict accurately... |
|||
msg58999 - (view) | Author: Senthil Kumaran (orsenthil) * | Date: 2007-12-26 16:46 | |
Irrespective of the patch, this issue is reproducable with the code in the trunk for Python 2.6. Should we close this then? n 2.6a0 (trunk:59600M, Dec 25 2007, 13:54:34) [GCC 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import urllib2 >>> import urllib >>> url = "http://www.recherche.fr/encyclopedie/Thomas-Robert_Bugeaud" >>> a = urllib.urlopen(url) >>> b = urllib2.urlopen(url) >>> c = a.read(1024 * 1024 * 2) >>> c[63000:64000] 'UA-321207-2";\nurchinTracker();\n</script>\n <div id="introFin">\n <p>\nLe contenu de cette page (Thomas-Robert Bugeaud) est un minuscule extrait de l\'encyclopi\xc3\xa9die gratuite en ligne <a href="http://fr.wikipedia.org">WIKIPEDIA</a>\nle webmaster de ce site n\'est pas l\'auteur de cet article (Thomas-Robert Bugeaud). Vous pouvez retrouver l\'original de cet article (Thomas-Robert Bugeaud) à <a href="http://fr.wikipedia.org/wiki/Thomas-Robert_Bugeaud">cette adresse</a> et la liste des auteurs <a href="http://fr.wikipedia.org/w/index.php?title=Thomas-Robert_Bugeaud&action=history">ici</a>\nVous pouvez <a href="http://fr.wikipedia.org/w/index.php?title=Thomas-Robert_Bugeaud&action=edit">modifier ou compl\xc3\xa9ter</a> cet article mais \xc3\xa9galement <a href="http://fr.wikipedia.org/w/index.php?title=Discuter:Thomas-Robert_Bugeaud&action=edit">discuter</a> de son contenu (Thomas-Robert Bugeaud) sur le site de <a href="http://fr.wikipedia.org">WIKIPEDIA France</a> - Contenu (Thomas-Robert B' >>> c = b.read(1024 * 1024 * 2) >>> c[63000:64000] 'acct = "UA-321207-2";\nurchinTracker();\n</script>\n <div id="introFin">\n <p>\nLe contenu de cette page (Thomas-Robert Bugeaud) est un minuscule extrait de l\'encyclopi\xc3\xa9die gratuite en ligne <a href="http://fr.wikipedia.org">WIKIPEDIA</a>\nle webmaster de ce site n\'est pas l\'auteur de cet article (Thomas-Robert Bugeaud). Vous pouvez retrouver l\'original de cet article (Thomas-Robert Bugeaud) à <a href="http://fr.wikipedia.org/wiki/Thomas-Robert_Bugeaud">cette adresse</a> et la liste des auteurs <a href="http://fr.wikipedia.org/w/index.php?title=Thomas-Robert_Bugeaud&action=history">ici</a>\nVous pouvez <a href="http://fr.wikipedia.org/w/index.php?title=Thomas-Robert_Bugeaud&action=edit">modifier ou compl\xc3\xa9ter</a> cet article mais \xc3\xa9galement <a href="http://fr.wikipedia.org/w/index.php?title=Discuter:Thomas-Robert_Bugeaud&action=edit">discuter</a> de son contenu (Thomas-Robert Bugeaud) sur le site de <a href="http://fr.wikipedia.org">WIKIPEDIA France</a> - Contenu (Thomas-' >>> |
|||
msg59000 - (view) | Author: Senthil Kumaran (orsenthil) * | Date: 2007-12-26 16:49 | |
> > Senthil added the comment: > > Irrespective of the patch, this issue is reproducable with the code in the > trunk for Python 2.6. Should we close this then? > __________________________________ Sorry, I meant to say "NOT Reproducable". |
|||
msg59117 - (view) | Author: Francesco Cosoleto (cosoleto) | Date: 2008-01-03 01:15 | |
Sorry, but I don't understand reason to close this issue with resolution "wont fix". The problem was reproducible and his logic explained by more developers. If the problem has been resolved, then, please, change "resolution" field to "fixed", else a patch request is pending (see msg56162). No? :-( Of course - it was predictable - the bug isn't reproducible now also using previous Python version: $ wget -c http://www.recherche.fr/encyclopedie/Thomas-Robert_Bugeaud [..omisss..] 02:08:34 (4.28 KB/s) - "Thomas-Robert_Bugeaud" salvato [65107] ---- Python 2.5.1 (r251:54863, May 18 2007, 16:56:43) >>> url = "http://www.recherche.fr/encyclopedie/Thomas-Robert_Bugeaud" >>> a = urllib.urlopen(url) ; c = a.read(1024 * 1024 * 2) >>> len(c) 65169 |
|||
msg59118 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2008-01-03 01:50 | |
I'm just following the last post's suggestion "Should we close this then?" My message (somebody "should" submit a patch) was sarcastic --- it was in reference to the comment that Python "should" be more practical. Since no patch was applied, I don't know why "won't fix" isn't a perfectly adequate description of the reason for closure. If you want me to reopen this, please submit a patch. |
|||
msg76743 - (view) | Author: John J Lee (jjlee) | Date: 2008-12-02 14:13 | |
This is fixed in trunk r61034 by issue #900744 . Please use that issue for any discussion re whether this should be fixed in 2.5. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:56:27 | admin | set | github: 45546 |
2008-12-02 14:13:43 | jjlee | set | nosy:
+ jjlee messages: + msg76743 |
2008-01-03 01:50:08 | gvanrossum | set | messages: + msg59118 |
2008-01-03 01:15:02 | cosoleto | set | messages: + msg59117 |
2008-01-02 23:16:53 | gvanrossum | set | status: open -> closed resolution: wont fix |
2007-12-26 16:49:23 | orsenthil | set | messages: + msg59000 |
2007-12-26 16:46:36 | orsenthil | set | nosy:
+ orsenthil messages: + msg58999 |
2007-12-01 12:16:34 | josm | set | files:
+ httplib.py.diff messages: + msg58042 |
2007-10-16 19:11:29 | torriem | set | nosy:
+ torriem messages: + msg56506 |
2007-09-28 03:18:04 | josm | set | files:
+ httplib.diff messages: + msg56183 |
2007-09-27 14:31:23 | gvanrossum | set | messages: + msg56162 |
2007-09-27 03:21:30 | josm | set | nosy:
+ josm messages: + msg56151 |
2007-09-26 16:55:10 | gvanrossum | set | nosy:
+ gvanrossum messages: + msg56147 |
2007-09-26 14:07:54 | ggenellina | set | nosy:
+ ggenellina messages: + msg56144 |
2007-09-26 07:55:32 | cosoleto | create |