classification
Title: robotparser reads empty robots.txt file as "all denied"
Type: enhancement Stage: test needed
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: berker.peksag, gallicrooster, larsfuse, terry.reedy, xtreak
Priority: normal Keywords:

Created on 2018-12-11 09:30 by larsfuse, last changed 2020-01-02 15:45 by gallicrooster.

Messages (6)
msg331595 - (view) Author: larsfuse (larsfuse) Date: 2018-12-11 09:30
The standard (http://www.robotstxt.org/robotstxt.html) says:

> To allow all robots complete access:
> User-agent: *
> Disallow:
> (or just create an empty "/robots.txt" file, or don't use one at all)

Here I give python an empty file:
$ curl http://10.223.68.186/robots.txt
$

Code:

rp = robotparser.RobotFileParser()
print (robotsurl)
rp.set_url(robotsurl)
rp.read()
print( "fetch /", rp.can_fetch(useragent = "*", url = "/"))
print( "fetch /admin", rp.can_fetch(useragent = "*", url = "/admin"))

Result:

$ ./test.py
http://10.223.68.186/robots.txt
('fetch /', False)
('fetch /admin', False)

And the result is, robotparser thinks the site is blocked.
msg331870 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-12-14 21:08
https://docs.python.org/2.7/library/robotparser.html#module-robotparser
and
https://docs.python.org/3/library/urllib.robotparser.html#module-urllib.robotparser
refers users, for file structure, to http://www.robotstxt.org/orig.html.
This says nothing about the effect of an empty file, so I don't see this as a bug.  Even if it was, I would be dubious about reversing the effect without a deprecation notice first, and definitely not in 2.7.

I would propose instead that the doc be changed to refer to the new file, with more and better examples, but add a note that robotparser interprets empty files as 'block all' rather than 'allow all'.

Try bringing this up on python-ideas.
msg331963 - (view) Author: larsfuse (larsfuse) Date: 2018-12-17 10:02
> (...) refers users, for file structure, to http://www.robotstxt.org/orig.html. This says nothing about the effect of an empty file, so I don't see this as a bug.

That is incorrect. From that url you can find:
> The presence of an empty "/robots.txt" file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome.

So this is definitely a bug.
msg359180 - (view) Author: Andre Burgaud (gallicrooster) * Date: 2020-01-02 03:41
Hi,

Is this ticket still relevant for Python 3.8?

While running some tests with an empty robotstxt file I realized that it was returning "ALLOWED" for any path (as per the current draft of the Robots Exclusion Protocol: https://tools.ietf.org/html/draft-koster-rep-00#section-2.2.1 ")

Code:

from urllib import robotparser

robots_url = "file:///tmp/empty.txt"

rp = robotparser.RobotFileParser()
print(robots_url)
rp.set_url(robots_url)
rp.read()
print( "fetch /", rp.can_fetch(useragent = "*", url = "/"))
print( "fetch /admin", rp.can_fetch(useragent = "*", url = "/admin"))

Output:

$ cat /tmp/empty.txt
$ python -V
Python 3.8.1
$ python test_robot3.py
file:///tmp/empty.txt
fetch / True
fetch /admin True
msg359185 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2020-01-02 08:36
There is a behavior change. parse() sets the modified time and unless the modified time is set the can_fetch method returns false. In Python 2 the parse method was called only when the file is non-empty [0] but in Python 3 it's always called though the file is empty [1] . The change was done with 1afc1696167547a5fa101c53e5a3ab4717f8852c to always read parse and then in 122541beceeccce4ef8a9bf739c727ccdcbf2f28 modified function was always called during parse thus setting the modified_time to return True from can_fetch in the end.

I think the behavior of robotparser for empty file was undefined allowing these changes and it will be good to have a test for this behavior.

[0] https://github.com/python/cpython/blob/f82e59ac4020a64c262a925230a8eb190b652e87/Lib/robotparser.py#L66-L67
[1] https://github.com/python/cpython/blob/149175c6dfc8455023e4335575f3fe3d606729f9/Lib/urllib/robotparser.py#L69-L70
msg359202 - (view) Author: Andre Burgaud (gallicrooster) * Date: 2020-01-02 15:45
Thanks @xtreak for providing some clarification on this behavior! I can write some tests to cover this behavior, assuming that we agree that an empty file means "unlimited access". This was worded as such in the old internet draft from 1996 (section 3.2.1 in https://www.robotstxt.org/norobots-rfc.txt). The current draft is more ambiguous with "If no group satisfies either condition, or no groups are present at all, no rules apply." https://tools.ietf.org/html/draft-koster-rep-00#section-2.2.1

https://www.robotstxt.org/robotstxt.html clearly states that an empty file gives full access, but I'm getting lost in figuring out which is the official spec at the moment :-)
History
Date User Action Args
2020-01-02 15:45:55gallicroostersetmessages: + msg359202
2020-01-02 08:36:56xtreaksetnosy: + berker.peksag, xtreak
messages: + msg359185
2020-01-02 03:41:13gallicroostersetnosy: + gallicrooster
messages: + msg359180
2018-12-17 10:02:06larsfusesetmessages: + msg331963
2018-12-14 21:08:55terry.reedysetversions: + Python 3.8, - Python 2.7
nosy: + terry.reedy

messages: + msg331870

type: behavior -> enhancement
stage: test needed
2018-12-11 09:30:47larsfusecreate