New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
robotparser doesn't return crawl delay for default entry #69586
Comments
After changeset http://hg.python.org/lookup/dbed7cacfb7e, calling the crawl_delay method for a robots.txt files that has a crawl-delay for * useragents always returns None. Ex: Python 3.6.0a0 (default:1aae9b6a6929+, Oct 9 2015, 22:08:05)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.robotparser
>>> parser = urllib.robotparser.RobotFileParser()
>>> parser.set_url('https://www.carthage.edu/robots.txt')
>>> parser.read()
>>> parser.crawl_delay('test_robotparser')
>>> parser.crawl_delay('*')
>>> print(parser.default_entry.delay)
120
>>> Excerpt from https://www.carthage.edu/robots.txt: User-agent: * I have written a patch that solves this. With patch, output is: Python 3.6.0a0 (default:1aae9b6a6929+, Oct 9 2015, 22:08:05)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.robotparser
>>> parser = urllib.robotparser.RobotFileParser()
>>> parser.set_url('https://www.carthage.edu/robots.txt')
>>> parser.read()
>>> parser.crawl_delay('test_robotparser')
120
>>> parser.crawl_delay('*')
120
>>> print(parser.default_entry.delay)
120
>>> This also applies to the request_rate method. |
This fix breaks the unit tests though. I am not sure how to go about checking those as this would be my first contribution to python and an open source project in general. |
On further inspection of the tests, it appears that the way in which the tests are written, a test case can only be tested for one useragent at a time. I will attempt to work on the tests so work correctly. Any advice would be much appreciated. |
Thanks for the patch Peter(and welcome to Python and open source development). I have a WIP patch to rewrite test_robotparser in a less magic way. So we can ignore test failures for now. I'll take a closer look to your patch. |
Ok, for the mean time, I reworked the test so it appears to test correctly and tests passes. There does seem to be some magic, so I do hope I did not overlook anything. Here is the new patch. |
I've now updated Lib/test/test_robotparser.py (bpo-25497) Peter, do you have time to update your patch? Thanks! |
Here's an updated patch. |
New changeset d5d910cfd288 by Berker Peksag in branch '3.6': New changeset 911070065e38 by Berker Peksag in branch 'default': |
Misc/NEWS
so that it is managed by towncrier #552Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: