Message 221327 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	karlcow
Recipients	BreamoreBoy, dualbus, ezio.melotti, karlcow, orsenthil, rhettinger, terry.reedy, tshepang
Date	2014-06-23.00:00:27
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1403481628.06.0.397133787225.issue15851@psf.upfronthosting.co.za>
In-reply-to

Content
→ python Python 2.7.5 (default, Mar 9 2014, 22:15:05) [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import robotparser >>> rp = robotparser.RobotFileParser('http://somesite.test.site/robots.txt') >>> rp.read() >>> Let's check the server logs: 127.0.0.1 - - [23/Jun/2014:08:44:37 +0900] "GET /robots.txt HTTP/1.0" 200 92 "-" "Python-urllib/1.17" Robotparser by default was using in 2.* the Python-urllib/1.17 user agent which is traditionally blocked by many sysadmins. A solution has been already proposed above: This is the proposed test for 3.4 import urllib.robotparser import urllib.request opener = urllib.request.build_opener() opener.addheaders = [('User-agent', 'MyUa/0.1')] urllib.request.install_opener(opener) rp = urllib.robotparser.RobotFileParser('http://localhost:9999') rp.read() The issue is not anymore about changing the lib, but just about documenting on how to change the RobotFileParser default UA. We can change the title of this issue if it's confusing. Or close it and open a new one for documenting what makes it easier :) Currently robotparser.py imports urllib user agent. http://hg.python.org/cpython/file/7dc94337ef67/Lib/urllib/request.py#l364 It's a common failure we encounter when using urllib in general, including robotparser. As for wikipedia, they fixed their server side user agent sniffing, and do not filter anymore python-urllib. GET /robots.txt HTTP/1.1 Accept: / Accept-Encoding: gzip, deflate, compress Host: en.wikipedia.org User-Agent: Python-urllib/1.17 HTTP/1.1 200 OK Accept-Ranges: bytes Age: 3161 Cache-control: s-maxage=3600, must-revalidate, max-age=0 Connection: keep-alive Content-Encoding: gzip Content-Length: 5208 Content-Type: text/plain; charset=utf-8 Date: Sun, 22 Jun 2014 23:59:16 GMT Last-modified: Tue, 26 Nov 2013 17:39:43 GMT Server: Apache Set-Cookie: GeoIP=JP:Tokyo:35.6850:139.7514:v4; Path=/; Domain=.wikipedia.org Vary: X-Subdomain Via: 1.1 varnish, 1.1 varnish, 1.1 varnish X-Article-ID: 19292575 X-Cache: cp1065 miss (0), cp4016 hit (1), cp4009 frontend hit (215) X-Content-Type-Options: nosniff X-Language: en X-Site: wikipedia X-Varnish: 2529666795, 2948866481 2948865637, 4134826198 4130750894 Many other sites still do. :)

→ python
Python 2.7.5 (default, Mar  9 2014, 22:15:05) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import robotparser
>>> rp = robotparser.RobotFileParser('http://somesite.test.site/robots.txt')
>>> rp.read()
>>> 


Let's check the server logs:

127.0.0.1 - - [23/Jun/2014:08:44:37 +0900] "GET /robots.txt HTTP/1.0" 200 92 "-" "Python-urllib/1.17"

Robotparser by default was using in 2.* the Python-urllib/1.17 user agent which is traditionally blocked by many sysadmins. A solution has been already proposed above:

This is the proposed test for 3.4

import urllib.robotparser
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'MyUa/0.1')]
urllib.request.install_opener(opener)
rp = urllib.robotparser.RobotFileParser('http://localhost:9999')
rp.read()


The issue is not anymore about changing the lib, but just about documenting on how to change the RobotFileParser default UA. We can change the title of this issue if it's confusing. Or close it and open a new one for documenting what makes it easier :)

Currently robotparser.py imports urllib user agent.
http://hg.python.org/cpython/file/7dc94337ef67/Lib/urllib/request.py#l364

It's a common failure we encounter when using urllib in general, including robotparser.


As for wikipedia, they fixed their server side user agent sniffing, and do not filter anymore python-urllib. 

GET /robots.txt HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate, compress
Host: en.wikipedia.org
User-Agent: Python-urllib/1.17

HTTP/1.1 200 OK
Accept-Ranges: bytes
Age: 3161
Cache-control: s-maxage=3600, must-revalidate, max-age=0
Connection: keep-alive
Content-Encoding: gzip
Content-Length: 5208
Content-Type: text/plain; charset=utf-8
Date: Sun, 22 Jun 2014 23:59:16 GMT
Last-modified: Tue, 26 Nov 2013 17:39:43 GMT
Server: Apache
Set-Cookie: GeoIP=JP:Tokyo:35.6850:139.7514:v4; Path=/; Domain=.wikipedia.org
Vary: X-Subdomain
Via: 1.1 varnish, 1.1 varnish, 1.1 varnish
X-Article-ID: 19292575
X-Cache: cp1065 miss (0), cp4016 hit (1), cp4009 frontend hit (215)
X-Content-Type-Options: nosniff
X-Language: en
X-Site: wikipedia
X-Varnish: 2529666795, 2948866481 2948865637, 4134826198 4130750894


Many other sites still do. :)

History
Date	User	Action	Args
2014-06-23 00:00:28	karlcow	set	recipients: + karlcow, rhettinger, terry.reedy, orsenthil, ezio.melotti, BreamoreBoy, tshepang, dualbus
2014-06-23 00:00:28	karlcow	set	messageid: <1403481628.06.0.397133787225.issue15851@psf.upfronthosting.co.za>
2014-06-23 00:00:28	karlcow	link	issue15851 messages
2014-06-23 00:00:27	karlcow	create