Message 252971 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	pwirtz
Recipients	pwirtz
Date	2015-10-14.01:21:41
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1444785702.73.0.849541373226.issue25400@psf.upfronthosting.co.za>
In-reply-to

Content
After changeset http://hg.python.org/lookup/dbed7cacfb7e, calling the crawl_delay method for a robots.txt files that has a crawl-delay for * useragents always returns None. Ex: Python 3.6.0a0 (default:1aae9b6a6929+, Oct 9 2015, 22:08:05) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import urllib.robotparser >>> parser = urllib.robotparser.RobotFileParser() >>> parser.set_url('https://www.carthage.edu/robots.txt') >>> parser.read() >>> parser.crawl_delay('test_robotparser') >>> parser.crawl_delay('') >>> print(parser.default_entry.delay) 120 >>> Excerpt from https://www.carthage.edu/robots.txt: User-agent: Crawl-Delay: 120 Disallow: /cgi-bin I have written a patch that solves this. With patch, output is: Python 3.6.0a0 (default:1aae9b6a6929+, Oct 9 2015, 22:08:05) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import urllib.robotparser >>> parser = urllib.robotparser.RobotFileParser() >>> parser.set_url('https://www.carthage.edu/robots.txt') >>> parser.read() >>> parser.crawl_delay('test_robotparser') 120 >>> parser.crawl_delay('*') 120 >>> print(parser.default_entry.delay) 120 >>> This also applies to the request_rate method.

After changeset http://hg.python.org/lookup/dbed7cacfb7e, calling the crawl_delay method for a robots.txt files that has a crawl-delay for * useragents always returns None.

Ex:

Python 3.6.0a0 (default:1aae9b6a6929+, Oct  9 2015, 22:08:05)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.robotparser
>>> parser = urllib.robotparser.RobotFileParser()
>>> parser.set_url('https://www.carthage.edu/robots.txt')
>>> parser.read()
>>> parser.crawl_delay('test_robotparser')
>>> parser.crawl_delay('*')
>>> print(parser.default_entry.delay)
120
>>>

Excerpt from https://www.carthage.edu/robots.txt:

User-agent: *
Crawl-Delay: 120
Disallow: /cgi-bin

I have written a patch that solves this. With patch, output is:

Python 3.6.0a0 (default:1aae9b6a6929+, Oct  9 2015, 22:08:05)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.robotparser
>>> parser = urllib.robotparser.RobotFileParser()
>>> parser.set_url('https://www.carthage.edu/robots.txt')
>>> parser.read()
>>> parser.crawl_delay('test_robotparser')
120
>>> parser.crawl_delay('*')
120
>>> print(parser.default_entry.delay)
120
>>>

This also applies to the request_rate method.

History
Date	User	Action	Args
2015-10-14 01:21:42	pwirtz	set	recipients: + pwirtz
2015-10-14 01:21:42	pwirtz	set	messageid: <1444785702.73.0.849541373226.issue25400@psf.upfronthosting.co.za>
2015-10-14 01:21:42	pwirtz	link	issue25400 messages
2015-10-14 01:21:42	pwirtz	create