This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: robotparser doesn't return crawl delay for default entry
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.7, Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: berker.peksag, pwirtz, python-dev
Priority: normal Keywords: patch

Created on 2015-10-14 01:21 by pwirtz, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
robotparser_crawl_delay.patch pwirtz, 2015-10-14 01:21 patch review
robotparser_crawl_delay_v2.patch pwirtz, 2015-10-14 18:35 review
issue25400_v2.diff berker.peksag, 2016-09-18 15:36 review
issue25400_v3.diff berker.peksag, 2016-09-18 16:01 review
Pull Requests
URL Status Linked Edit
PR 552 closed dstufft, 2017-03-31 16:36
Messages (8)
msg252971 - (view) Author: Peter Wirtz (pwirtz) * Date: 2015-10-14 01:21
After changeset http://hg.python.org/lookup/dbed7cacfb7e, calling the crawl_delay method for a robots.txt files that has a crawl-delay for * useragents always returns None.

Ex:

Python 3.6.0a0 (default:1aae9b6a6929+, Oct  9 2015, 22:08:05)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.robotparser
>>> parser = urllib.robotparser.RobotFileParser()
>>> parser.set_url('https://www.carthage.edu/robots.txt')
>>> parser.read()
>>> parser.crawl_delay('test_robotparser')
>>> parser.crawl_delay('*')
>>> print(parser.default_entry.delay)
120
>>>

Excerpt from https://www.carthage.edu/robots.txt:

User-agent: *
Crawl-Delay: 120
Disallow: /cgi-bin

I have written a patch that solves this. With patch, output is:

Python 3.6.0a0 (default:1aae9b6a6929+, Oct  9 2015, 22:08:05)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.robotparser
>>> parser = urllib.robotparser.RobotFileParser()
>>> parser.set_url('https://www.carthage.edu/robots.txt')
>>> parser.read()
>>> parser.crawl_delay('test_robotparser')
120
>>> parser.crawl_delay('*')
120
>>> print(parser.default_entry.delay)
120
>>>

This also applies to the request_rate method.
msg252972 - (view) Author: Peter Wirtz (pwirtz) * Date: 2015-10-14 01:25
This fix breaks the unit tests though. I am not sure how to go about checking those as this would be my first contribution to python and an open source project in general.
msg253015 - (view) Author: Peter Wirtz (pwirtz) * Date: 2015-10-14 18:16
On further inspection of the tests, it appears that the way in which the tests are written, a test case can only be tested for one useragent at a time. I will attempt to work on the tests so work correctly. Any advice would be much appreciated.
msg253016 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2015-10-14 18:22
Thanks for the patch Peter(and welcome to Python and open source development). I have a WIP patch to rewrite test_robotparser in a less magic way. So we can ignore test failures for now. I'll take a closer look to your patch.
msg253017 - (view) Author: Peter Wirtz (pwirtz) * Date: 2015-10-14 18:35
Ok, for the mean time, I reworked the test so it appears to test correctly and tests passes. There does seem to be some magic, so I do hope I did not overlook anything. Here is the new patch.
msg275776 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2016-09-11 11:55
I've now updated Lib/test/test_robotparser.py (issue 25497) Peter, do you have time to update your patch? Thanks!
msg276897 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2016-09-18 15:36
Here's an updated patch.
msg276900 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-09-18 17:17
New changeset d5d910cfd288 by Berker Peksag in branch '3.6':
Issue #25400: RobotFileParser now correctly returns default values for crawl_delay and request_rate
https://hg.python.org/cpython/rev/d5d910cfd288

New changeset 911070065e38 by Berker Peksag in branch 'default':
Issue #25400: Merge from 3.6
https://hg.python.org/cpython/rev/911070065e38
History
Date User Action Args
2022-04-11 14:58:22adminsetgithub: 69586
2017-03-31 16:36:30dstufftsetpull_requests: + pull_request1034
2016-09-18 17:18:17berker.peksagsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2016-09-18 17:17:29python-devsetnosy: + python-dev
messages: + msg276900
2016-09-18 16:01:20berker.peksagsetfiles: + issue25400_v3.diff
2016-09-18 15:36:23berker.peksagsetfiles: + issue25400_v2.diff

messages: + msg276897
versions: + Python 3.7
2016-09-11 11:55:50berker.peksagsetmessages: + msg275776
2015-10-14 18:35:14pwirtzsetfiles: + robotparser_crawl_delay_v2.patch

messages: + msg253017
2015-10-14 18:22:35berker.peksagsetmessages: + msg253016
stage: patch review
2015-10-14 18:16:25pwirtzsetmessages: + msg253015
2015-10-14 09:10:17berker.peksagsetnosy: + berker.peksag
2015-10-14 01:25:01pwirtzsetmessages: + msg252972
2015-10-14 01:21:42pwirtzcreate