classification
Title: robotparser reads empty robots.txt file as "all denied"
Type: enhancement Stage: test needed
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: larsfuse, terry.reedy
Priority: normal Keywords:

Created on 2018-12-11 09:30 by larsfuse, last changed 2018-12-17 10:02 by larsfuse.

Messages (3)
msg331595 - (view) Author: larsfuse (larsfuse) Date: 2018-12-11 09:30
The standard (http://www.robotstxt.org/robotstxt.html) says:

> To allow all robots complete access:
> User-agent: *
> Disallow:
> (or just create an empty "/robots.txt" file, or don't use one at all)

Here I give python an empty file:
$ curl http://10.223.68.186/robots.txt
$

Code:

rp = robotparser.RobotFileParser()
print (robotsurl)
rp.set_url(robotsurl)
rp.read()
print( "fetch /", rp.can_fetch(useragent = "*", url = "/"))
print( "fetch /admin", rp.can_fetch(useragent = "*", url = "/admin"))

Result:

$ ./test.py
http://10.223.68.186/robots.txt
('fetch /', False)
('fetch /admin', False)

And the result is, robotparser thinks the site is blocked.
msg331870 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-12-14 21:08
https://docs.python.org/2.7/library/robotparser.html#module-robotparser
and
https://docs.python.org/3/library/urllib.robotparser.html#module-urllib.robotparser
refers users, for file structure, to http://www.robotstxt.org/orig.html.
This says nothing about the effect of an empty file, so I don't see this as a bug.  Even if it was, I would be dubious about reversing the effect without a deprecation notice first, and definitely not in 2.7.

I would propose instead that the doc be changed to refer to the new file, with more and better examples, but add a note that robotparser interprets empty files as 'block all' rather than 'allow all'.

Try bringing this up on python-ideas.
msg331963 - (view) Author: larsfuse (larsfuse) Date: 2018-12-17 10:02
> (...) refers users, for file structure, to http://www.robotstxt.org/orig.html. This says nothing about the effect of an empty file, so I don't see this as a bug.

That is incorrect. From that url you can find:
> The presence of an empty "/robots.txt" file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome.

So this is definitely a bug.
History
Date User Action Args
2018-12-17 10:02:06larsfusesetmessages: + msg331963
2018-12-14 21:08:55terry.reedysetversions: + Python 3.8, - Python 2.7
nosy: + terry.reedy

messages: + msg331870

type: behavior -> enhancement
stage: test needed
2018-12-11 09:30:47larsfusecreate