Author XapaJIaMnu
Recipients XapaJIaMnu, berker.peksag, christian.heimes, hynek, orsenthil
Date 2013-12-10.00:22:50
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1386634972.01.0.759679679276.issue16099@psf.upfronthosting.co.za>
In-reply-to
Content
Thank you for the review!
I have addressed your comments and release a v2 of the patch:
Highlights:
 No longer crashes when provided with malformed crawl-delay/robots.txt parameter.
 Returns None when parameter is missing or syntax is invalid.
 Simplified several functions.
 Extended tests.

http://bugs.python.org/review/16099/diff/6206/Doc/library/urllib.robotparser.rst
File Doc/library/urllib.robotparser.rst (right):

http://bugs.python.org/review/16099/diff/6206/Doc/library/urllib.robotparser....
Doc/library/urllib.robotparser.rst:56: .. method:: crawl_delay(useragent)
On 2013/12/09 03:30:54, berkerpeksag wrote:
> Is crawl_delay used for search engines? Google recommends you to set crawl speed
> via Google Webmaster Tools instead.
> 
> See https://support.google.com/webmasters/answer/48620?hl=en.
 
Crawl delay and request rate parameters are targeted to custom crawlers that many people/companies write for specific tasks. The Google webmaster tools is targeted only to google's crawler and typically web admins have different rates for google/yahoo/bing and all other user agents.

http://bugs.python.org/review/16099/diff/6206/Lib/urllib/robotparser.py
File Lib/urllib/robotparser.py (right):

http://bugs.python.org/review/16099/diff/6206/Lib/urllib/robotparser.py#newco...
Lib/urllib/robotparser.py:168: for entry in self.entries:
On 2013/12/09 03:30:54, berkerpeksag wrote:
> Is there a better way to calculate this? (perhaps O(1)?)

I have followed the model of what was written beforehand. A 0(1) implementation (probably based on dictionaries) would require a complete rewrite of this library, as all previously implemented functions employ the:
for entry in self.entries:
    if entry.applies_to(useragent):

logic. I don't think this matters particularly here, as those two functions in particular need only be called once per domain and robots.txt seldom contains more than 3 entries. This is why I have just followed the design laid out by the original developer.

Thanks

Nick
History
Date User Action Args
2013-12-10 00:22:52XapaJIaMnusetrecipients: + XapaJIaMnu, orsenthil, christian.heimes, berker.peksag, hynek
2013-12-10 00:22:52XapaJIaMnusetmessageid: <1386634972.01.0.759679679276.issue16099@psf.upfronthosting.co.za>
2013-12-10 00:22:51XapaJIaMnulinkissue16099 messages
2013-12-10 00:22:51XapaJIaMnucreate