classification
Title: robotparser user agent considered hostile by mod_security rules.
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.6, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.
View: 15851
Assigned To: Nosy List: berker.peksag, nagle
Priority: normal Keywords:

Created on 2016-05-19 23:21 by nagle, last changed 2016-05-20 06:55 by berker.peksag. This issue is now closed.

Messages (2)
msg265900 - (view) Author: John Nagle (nagle) Date: 2016-05-19 23:21
"robotparser" uses the default Python user agent when reading the "robots.txt" file, and there's no parameter for changing that.

Unfortunately, the "mod_security" add-on for Apache web server, when used with the standard OWASP rule set, blacklists the default Python USER-AGENT in Rule 990002, User Agent Identification. It doesn't like certain HTTP USER-AGENT values. One of them is "python-httplib2". So any program in Python which accesses the web site will trigger this rule and be blocked form access.  

For regular HTTP accesses, it's possible to put a user agent string in the Request object and work around this. But "robotparser" has no such option. 

Worse, if "robotparser" has its read of "robots.txt" rejected, it interprets that as a "deny all" robots.txt file, and returns False for all "can_fetch()" requests.
msg265909 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2016-05-20 06:55
Thanks for the report. This is a duplicate of issue 15851.
History
Date User Action Args
2016-05-20 06:55:28berker.peksagsetstatus: open -> closed

superseder: Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.

nosy: + berker.peksag
messages: + msg265909
resolution: duplicate
stage: resolved
2016-05-19 23:21:48naglecreate