Message56143
urllib fail to read URL contents, urllib2 crash Python
Python version:
-------------------------
Python 2.5.1 (r251:54863, May 18 2007, 16:56:43)
[GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)]
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on
win32
Python 2.4.4 (#2, Aug 16 2007, 00:34:54)
[GCC 4.1.3 20070812 (prerelease) (Debian 4.1.2-15)] on linux2
-------------------------
Working with GNU wget:
-------------------------
$ wget -S http://www.recherche.fr/encyclopedie/Thomas-Robert_Bugeaud
--08:42:21-- http://www.recherche.fr/encyclopedie/Thomas-Robert_Bugeaud
=> `Thomas-Robert_Bugeaud'
Risoluzione di www.recherche.fr in corso... 88.191.11.214
Connessione a www.recherche.fr|88.191.11.214:80... connesso.
HTTP richiesta inviata, aspetto la risposta...
HTTP/1.1 200 OK
Date: Wed, 26 Sep 2007 06:42:53 GMT
Server: Apache/2.2.3 (Debian) PHP/5.2.3-0.dotdeb.1 with Suhosin-Patch
X-Powered-By: PHP/5.2.3-0.dotdeb.1
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
Lunghezza: non specificato [text/html]
[ <=> ]
267,080 --.--K/s
08:42:42 (14.11 KB/s) - "Thomas-Robert_Bugeaud" salvato [267080]
-------------------------
Python:
-------------------------
>>> import urllib
>>> a = urllib.urlopen('http://www.recherche.fr/encyclopedie/Thomas-
Robert_Bugeaud')
>>> c = a.read(1024*1024*2)
>>> len(c)
1035220
>>> c[63000:64000]
'he.fr en page d\'accueil</a><br>\n <span>Partenaires :</span> <a
href="http://www.cartes.fr/" target="_blank">Cartes\n
postales</a> <a href="http://www.deux.fr/script/"
target="_blank">Rencontres\n gratuites\n </a> <a
href="http://www.new.fr/" target="_blank">Noms\n de domaine
gratuits</a> <a href="http://www.netencyclo.com/"
target="_blank">Encyclopedia</a> </p>\n <p style="text-
align:center;"><a href="http://www.futureobject.com/"
target="_blank"><img src="http://www.recherche.fr/images/logo_fo.gif"
border="0" height="25" width="96"></a></p>\n\n </p>\n </div>\n
</div><!-- site -->\n</body>\n</html>\n\r\n\x00\x00\x00\x00\x00\x00\x00
\x00\x00[...omission...]\x00\x00\x00\x00'
-------------------------
As above, but with urllib2 module instead of urllib:
-------------------------
File "/usr/lib/python2.5/socket.py", line 291, in read
data = self._sock.recv(recv_size)
File "/usr/lib/python2.5/httplib.py", line 509, in read
return self._read_chunked(amt)
File "/usr/lib/python2.5/httplib.py", line 548, in _read_chunked
chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: '\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00[...omission...]\x00\x00\x00\x00\x00\x00\x00
\
-------------------------
As above, but with Python 2.4:
-------------------------
>>> import urllib2
>>> a = urllib2.urlopen('http://www.recherche.fr/encyclopedie/Thomas-
Robert_Bugeaud')
>>>
>>> c = a.read(1024*1024*2)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/socket.py", line 295, in read
data = self._sock.recv(recv_size)
File "/usr/lib/python2.4/httplib.py", line 460, in read
return self._read_chunked(amt)
File "/usr/lib/python2.4/httplib.py", line 499, in _read_chunked
chunk_left = int(line, 16)
ValueError: invalid literal for int():
-------------------------
Regards,
Francesco Cosoleto |
|
Date |
User |
Action |
Args |
2007-09-26 07:55:35 | cosoleto | set | spambayes_score: 0.0263763 -> 0.02637627 recipients:
+ cosoleto |
2007-09-26 07:55:33 | cosoleto | set | spambayes_score: 0.0263763 -> 0.0263763 messageid: <1190793332.9.0.29899287923.issue1205@psf.upfronthosting.co.za> |
2007-09-26 07:55:32 | cosoleto | link | issue1205 messages |
2007-09-26 07:55:13 | cosoleto | create | |
|