classification
Title: False positive hazards in robotparser
Type: behavior Stage: test needed
Components: Library (Lib) Versions: Python 3.5, Python 3.4, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: ethan.furman, python-dev, rhettinger, skip.montanaro, taleinat
Priority: normal Keywords: patch

Created on 2014-05-10 16:55 by rhettinger, last changed 2014-05-13 05:58 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
fix_false_pos.diff rhettinger, 2014-05-11 18:21 Draft patch -- needs tests review
fix_false_pos2.diff rhettinger, 2014-05-11 18:50 Move modified() to parse() review
Messages (10)
msg218226 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-05-10 16:55
* The can_fetch() method is not checking to see if read() has been called, so it returns false positives if read() has not been called.

* When read() is called, it fails to call modified() so that mtime() returns an incorrect result.  The user has to manually call modified() to update the mtime().

>>> from urllib.robotparser import RobotFileParser
>>> rp = RobotFileParser('http://en.wikipedia.org/robots.txt')
>>> rp.can_fetch('UbiCrawler', 'http://en.wikipedia.org/index.html')
True
>>> rp.read()
>>> rp.can_fetch('UbiCrawler', 'http://en.wikipedia.org/index.html')
False
>>> rp.mtime()
0
>>> rp.modified()
>>> rp.mtime()
1399740268.628497

Suggested improvements:

1) Trigger internal calls to modified() every time the parse is modified using read() or add_entry().  That would assure that mtime() actually reflects the modification time.

2) Raise an exception or return False whenever can_fetch() is called and the mtime() is zero (meaning that the parser has not be initialized with any rules).
msg218284 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-05-11 18:21
Attaching a draft patch:

* Repair the broken link to norobots-rfc.txt.

* HTTP response codes >= 500 treated as a failed read rather than as a not found.  Not found means that we can assume the entire site is allowed.  A 5xx server error tells us nothing.

* A successful read() updates the mtime (which is defined to be "the time the robots.txt file was last fetched").

* The can_fetch() method returns False unless we've had a read() with a 2xx or 4xx response.  This avoids false positives in the case where a user calls can_fetch() before calling read().
msg218285 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-05-11 18:49
Update patch to move the modified() call to parse().  That lets the mtime update whenever rules (either by a read() or by directly parsing text).
msg218325 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2014-05-12 16:07
Changes LGTM.

This module could certainly use some cleanup and updates. For example, last_changed should be a property and always accessed one way (instead of either .mtime() or .last_changed) and should be initialized to None instead of zero to avoid ambiguity, and the and/or trick should be replaced with if/else. Would anyone review such a patch if I created one?
msg218356 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2014-05-12 19:39
Can this change be (easily) tested? If so, a test case akin to your original example would be nice.
msg218399 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-05-13 04:24
> Changes LGTM.

Thanks for the review :-)

> This module could certainly use some cleanup and updates.

Yes, the API is a mess, but I would like to be very conservative with API modifications (preferably none at all) so we don't break the code of very few people who ever cared enough to use this module.  My goal here was just to fix the risk of a false positives.

> For example, last_changed should be a property and always 
> accessed one way (instead of either .mtime() or .last_changed)
> and should be initialized to None instead of zero to avoid ambiguity,

It's too late for fixing the published API.  The time for that was when the module was introduced.

> and the and/or trick should be replaced with if/else.

Yes, would be a reasonable minor clean-up that wouldn't affect the API.

>  Would anyone review such a patch if I created one?

Yes.  Just add the one-line patch to this tracker item and I'll incorporate it with the rest.

FWIW, it is perfectly reasonable to add new well-designed API extensions.  You can post patches to the open tracker items for Bug 16099 and Bug 21475.
msg218402 - (view) Author: Roundup Robot (python-dev) Date: 2014-05-13 04:57
New changeset 4ea86cd87f95 by Raymond Hettinger in branch '3.4':
Issue 21469:  Mitigate risk of false positives with robotparser.
http://hg.python.org/cpython/rev/4ea86cd87f95
msg218403 - (view) Author: Roundup Robot (python-dev) Date: 2014-05-13 05:05
New changeset f67cf5747a26 by Raymond Hettinger in branch '3.4':
Issue 21469:  Add missing news item
http://hg.python.org/cpython/rev/f67cf5747a26
msg218404 - (view) Author: Roundup Robot (python-dev) Date: 2014-05-13 05:19
New changeset d4fd55278cec by Raymond Hettinger in branch '2.7':
Issue 21469:  Mitigate risk of false positives with robotparser.
http://hg.python.org/cpython/rev/d4fd55278cec
msg218405 - (view) Author: Roundup Robot (python-dev) Date: 2014-05-13 05:22
New changeset 560320c10564 by Raymond Hettinger in branch 'default':
Issue 21469:  Minor code modernization (convert and/or expression to an if/else expression).
http://hg.python.org/cpython/rev/560320c10564
History
Date User Action Args
2014-05-13 05:58:20rhettingersetstatus: open -> closed
resolution: fixed
2014-05-13 05:22:57python-devsetmessages: + msg218405
2014-05-13 05:19:00python-devsetmessages: + msg218404
2014-05-13 05:05:32python-devsetmessages: + msg218403
2014-05-13 04:57:26python-devsetnosy: + python-dev
messages: + msg218402
2014-05-13 04:24:53rhettingersetmessages: + msg218399
2014-05-12 19:39:22skip.montanarosetnosy: + skip.montanaro
messages: + msg218356
2014-05-12 16:07:12taleinatsetnosy: + taleinat
messages: + msg218325
2014-05-12 07:26:43ethan.furmansetnosy: + ethan.furman
2014-05-11 22:10:49rhettingersettitle: Hazards in robots.txt parser -> False positive hazards in robotparser
stage: test needed
2014-05-11 18:50:13rhettingersetfiles: + fix_false_pos2.diff
2014-05-11 18:49:04rhettingersetmessages: + msg218285
2014-05-11 18:21:45rhettingersetfiles: + fix_false_pos.diff
assignee: rhettinger
messages: + msg218284

keywords: + patch
2014-05-10 16:55:10rhettingercreate