This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author karlcow
Recipients BreamoreBoy, dualbus, ezio.melotti, karlcow, orsenthil, rhettinger, terry.reedy, tshepang
Date 2014-06-23.00:00:27
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1403481628.06.0.397133787225.issue15851@psf.upfronthosting.co.za>
In-reply-to
Content
→ python
Python 2.7.5 (default, Mar  9 2014, 22:15:05) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import robotparser
>>> rp = robotparser.RobotFileParser('http://somesite.test.site/robots.txt')
>>> rp.read()
>>> 


Let's check the server logs:

127.0.0.1 - - [23/Jun/2014:08:44:37 +0900] "GET /robots.txt HTTP/1.0" 200 92 "-" "Python-urllib/1.17"

Robotparser by default was using in 2.* the Python-urllib/1.17 user agent which is traditionally blocked by many sysadmins. A solution has been already proposed above:

This is the proposed test for 3.4

import urllib.robotparser
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'MyUa/0.1')]
urllib.request.install_opener(opener)
rp = urllib.robotparser.RobotFileParser('http://localhost:9999')
rp.read()


The issue is not anymore about changing the lib, but just about documenting on how to change the RobotFileParser default UA. We can change the title of this issue if it's confusing. Or close it and open a new one for documenting what makes it easier :)

Currently robotparser.py imports urllib user agent.
http://hg.python.org/cpython/file/7dc94337ef67/Lib/urllib/request.py#l364

It's a common failure we encounter when using urllib in general, including robotparser.


As for wikipedia, they fixed their server side user agent sniffing, and do not filter anymore python-urllib. 

GET /robots.txt HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate, compress
Host: en.wikipedia.org
User-Agent: Python-urllib/1.17

HTTP/1.1 200 OK
Accept-Ranges: bytes
Age: 3161
Cache-control: s-maxage=3600, must-revalidate, max-age=0
Connection: keep-alive
Content-Encoding: gzip
Content-Length: 5208
Content-Type: text/plain; charset=utf-8
Date: Sun, 22 Jun 2014 23:59:16 GMT
Last-modified: Tue, 26 Nov 2013 17:39:43 GMT
Server: Apache
Set-Cookie: GeoIP=JP:Tokyo:35.6850:139.7514:v4; Path=/; Domain=.wikipedia.org
Vary: X-Subdomain
Via: 1.1 varnish, 1.1 varnish, 1.1 varnish
X-Article-ID: 19292575
X-Cache: cp1065 miss (0), cp4016 hit (1), cp4009 frontend hit (215)
X-Content-Type-Options: nosniff
X-Language: en
X-Site: wikipedia
X-Varnish: 2529666795, 2948866481 2948865637, 4134826198 4130750894


Many other sites still do. :)
History
Date User Action Args
2014-06-23 00:00:28karlcowsetrecipients: + karlcow, rhettinger, terry.reedy, orsenthil, ezio.melotti, BreamoreBoy, tshepang, dualbus
2014-06-23 00:00:28karlcowsetmessageid: <1403481628.06.0.397133787225.issue15851@psf.upfronthosting.co.za>
2014-06-23 00:00:28karlcowlinkissue15851 messages
2014-06-23 00:00:27karlcowcreate