classification
Title: robotparser doesn't support request rate and crawl delay parameters
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: berker.peksag Nosy List: XapaJIaMnu, berker.peksag, christian.heimes, hynek, orsenthil, python-dev, rhettinger
Priority: normal Keywords: easy, needs review, patch

Created on 2012-10-01 12:58 by XapaJIaMnu, last changed 2015-10-08 09:34 by berker.peksag. This issue is now closed.

Files
File name Uploaded Description Edit
robotparser.patch XapaJIaMnu, 2012-10-01 12:58 patch for robotparser.py
robotparser.patch XapaJIaMnu, 2012-10-01 13:37 same patch for python3X
robotparser.patch XapaJIaMnu, 2012-10-07 18:20 Changes + test cases + documentation review
robotparser_reformatted.patch XapaJIaMnu, 2012-10-07 19:56 Changes, test cases, documentation, reformatted review
robotparser_v2.patch XapaJIaMnu, 2013-12-10 00:22 V2 with fixes review
robotparser_v3.patch XapaJIaMnu, 2014-05-27 09:29 V3 crawl delay and request rate patch review
Messages (17)
msg171711 - (view) Author: Nikolay Bogoychev (XapaJIaMnu) Date: 2012-10-01 12:58
Robotparser doesn't support two quite important optional parameters from the robots.txt file. I have implemented those in the following way:
(Robotparser should be initialized in the usual way:
rp = robotparser.RobotFileParser()
rp.set_url(..)
rp.read
)

crawl_delay(useragent) - Returns time in seconds that you need to wait for crawling
if none is specified, or doesn't apply to this user agent, returns -1
request_rate(useragent) - Returns a list in the form [request,seconds].
if none is specified, or doesn't apply to this user agent, returns -1
msg171712 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2012-10-01 13:16
Thanks for the patch. New features must be implemented in Python 3.4. Python 2.7 is in feature freeze mode and therefore doesn't get new features.
msg171715 - (view) Author: Nikolay Bogoychev (XapaJIaMnu) Date: 2012-10-01 13:37
Okay, sorry didn't know that (:
Here's the same patch (Same functionality) for python3

Feedback is welcome, as always (:
msg171719 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2012-10-01 13:52
We have a team that mentors new contributors. If you are interested to get your patch into Python 3.4, please read http://pythonmentors.com/ . The people are really friendly and will help you with every step of the process.
msg172327 - (view) Author: Nikolay Bogoychev (XapaJIaMnu) Date: 2012-10-07 18:20
Okay, here's a proper patch with documentation entry and test cases.
Please review and comment
msg172338 - (view) Author: Nikolay Bogoychev (XapaJIaMnu) Date: 2012-10-07 19:56
Reformatted patch
msg205567 - (view) Author: Nikolay Bogoychev (XapaJIaMnu) Date: 2013-12-08 14:41
Hey,
it has been more than an year since the last activity. 
Is there anything else I should do in order for someone of the python devs team to review my changes and perhaps give some feedback?

Nick
msg205641 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2013-12-09 02:31
I left a few comments on Rietveld.
msg205755 - (view) Author: Nikolay Bogoychev (XapaJIaMnu) Date: 2013-12-10 00:22
Thank you for the review!
I have addressed your comments and release a v2 of the patch:
Highlights:
 No longer crashes when provided with malformed crawl-delay/robots.txt parameter.
 Returns None when parameter is missing or syntax is invalid.
 Simplified several functions.
 Extended tests.

http://bugs.python.org/review/16099/diff/6206/Doc/library/urllib.robotparser.rst
File Doc/library/urllib.robotparser.rst (right):

http://bugs.python.org/review/16099/diff/6206/Doc/library/urllib.robotparser....
Doc/library/urllib.robotparser.rst:56: .. method:: crawl_delay(useragent)
On 2013/12/09 03:30:54, berkerpeksag wrote:
> Is crawl_delay used for search engines? Google recommends you to set crawl speed
> via Google Webmaster Tools instead.
> 
> See https://support.google.com/webmasters/answer/48620?hl=en.
 
Crawl delay and request rate parameters are targeted to custom crawlers that many people/companies write for specific tasks. The Google webmaster tools is targeted only to google's crawler and typically web admins have different rates for google/yahoo/bing and all other user agents.

http://bugs.python.org/review/16099/diff/6206/Lib/urllib/robotparser.py
File Lib/urllib/robotparser.py (right):

http://bugs.python.org/review/16099/diff/6206/Lib/urllib/robotparser.py#newco...
Lib/urllib/robotparser.py:168: for entry in self.entries:
On 2013/12/09 03:30:54, berkerpeksag wrote:
> Is there a better way to calculate this? (perhaps O(1)?)

I have followed the model of what was written beforehand. A 0(1) implementation (probably based on dictionaries) would require a complete rewrite of this library, as all previously implemented functions employ the:
for entry in self.entries:
    if entry.applies_to(useragent):

logic. I don't think this matters particularly here, as those two functions in particular need only be called once per domain and robots.txt seldom contains more than 3 entries. This is why I have just followed the design laid out by the original developer.

Thanks

Nick
msg205761 - (view) Author: Nikolay Bogoychev (XapaJIaMnu) Date: 2013-12-10 00:41
Oh... Sorry for the spam, could you please verify my documentation link syntax. I'm not entirely sure I got it right.
msg208721 - (view) Author: Nikolay Bogoychev (XapaJIaMnu) Date: 2014-01-21 23:30
Hey,

Just a reminder friendly reminder that there hasn't been any activity for a month and I have released a v2, pending for review (:
msg219212 - (view) Author: Nikolay Bogoychev (XapaJIaMnu) Date: 2014-05-27 09:29
Updated patch, all comments addressed, sorry for the 6 months delay. Please review
msg223099 - (view) Author: Nikolay Bogoychev (XapaJIaMnu) Date: 2014-07-15 10:38
Hey,

Just a friendly reminder that there has been no activity for a month and a half and v3 is pending for review (:
msg225916 - (view) Author: Nikolay Bogoychev (XapaJIaMnu) Date: 2014-08-26 13:15
Hey,

Just a friendly reminder that the patch is pending for review and there has been no activity for 3 months (:
msg252483 - (view) Author: Nikolay Bogoychev (XapaJIaMnu) Date: 2015-10-07 20:01
Hey,

Friendly reminder that there has been no activity on this issue for more than an year.

Cheers,

Nick
msg252521 - (view) Author: Roundup Robot (python-dev) Date: 2015-10-08 09:27
New changeset dbed7cacfb7e by Berker Peksag in branch 'default':
Issue #16099: RobotFileParser now supports Crawl-delay and Request-rate
https://hg.python.org/cpython/rev/dbed7cacfb7e
msg252525 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2015-10-08 09:34
I've finally committed your patch to default. Thank you for not giving up, Nikolay :)

Note that currently the link in the example section doesn't work. I will open a new issue for that.
History
Date User Action Args
2015-10-08 09:34:21berker.peksagsetstatus: open -> closed
versions: + Python 3.6, - Python 3.5
messages: + msg252525

resolution: fixed
stage: patch review -> resolved
2015-10-08 09:27:17python-devsetnosy: + python-dev
messages: + msg252521
2015-10-07 20:01:21XapaJIaMnusetmessages: + msg252483
2014-08-26 13:15:36XapaJIaMnusetmessages: + msg225916
2014-07-15 10:38:02XapaJIaMnusetmessages: + msg223099
2014-07-07 10:35:56berker.peksagsetassignee: berker.peksag
2014-05-27 09:29:06XapaJIaMnusetfiles: + robotparser_v3.patch

messages: + msg219212
2014-05-13 04:21:46rhettingersetassignee: rhettinger -> (no value)
2014-05-12 15:01:15rhettingersetassignee: rhettinger

nosy: + rhettinger
2014-01-21 23:30:03XapaJIaMnusetmessages: + msg208721
2013-12-10 00:41:57XapaJIaMnusetmessages: + msg205761
2013-12-10 00:22:51XapaJIaMnusetfiles: + robotparser_v2.patch

messages: + msg205755
2013-12-09 02:31:39berker.peksagsetnosy: + berker.peksag

messages: + msg205641
versions: + Python 3.5, - Python 3.4
2013-12-08 14:41:56XapaJIaMnusetmessages: + msg205567
2012-11-02 07:34:19hyneksetnosy: + orsenthil
2012-10-08 06:43:39hyneksetnosy: + hynek
2012-10-07 20:03:39christian.heimessetkeywords: + needs review
stage: test needed -> patch review
2012-10-07 19:56:11XapaJIaMnusetfiles: + robotparser_reformatted.patch

messages: + msg172338
2012-10-07 18:20:53XapaJIaMnusetfiles: + robotparser.patch

messages: + msg172327
2012-10-01 13:52:20christian.heimessetkeywords: + easy, - gsoc

messages: + msg171719
2012-10-01 13:37:51XapaJIaMnusetfiles: + robotparser.patch

messages: + msg171715
2012-10-01 13:16:01christian.heimessetversions: + Python 3.4, - Python 2.7
nosy: + christian.heimes

messages: + msg171712

keywords: + gsoc
stage: test needed
2012-10-01 12:58:25XapaJIaMnucreate