This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author bernie9998
Recipients bernie9998
Date 2011-10-27.20:30:43
SpamBayes Score 0.0003157582
Marked as misclassified No
Message-id <1319747444.2.0.491041533825.issue13281@psf.upfronthosting.co.za>
In-reply-to
Content
When attempting to parse a robots.txt file which has a blank line between allow/disallow rules, all rules after the blank line are ignored.

If a blank line occurs between the user-agent and its rules, all of the rules for that user-agent are ignored.

I am not sure if having a blank line between rules is allowed in the spec, but I am seeing this behavior in a number of sites, for instance:

http://www.whitehouse.gov/robots.txt has a blank line between the disallow rules all other lines, including the associated user-agent line, resulting in the python RobotFileParser to ignore all rules.

http://www.last.fm/robots.txt appears to separate their rules with arbitrary blank lines between them.  The python RobotFileParser only sees the first two rule between the user-agent and the next newline.

If the parser is changed to simply ignore all blank lines, would it have any adverse affect on parsing robots.txt files?

I am including a simple patch which ignores all blank lines and appears to find all rules from these robots.txt files.
History
Date User Action Args
2011-10-27 20:30:44bernie9998setrecipients: + bernie9998
2011-10-27 20:30:44bernie9998setmessageid: <1319747444.2.0.491041533825.issue13281@psf.upfronthosting.co.za>
2011-10-27 20:30:43bernie9998linkissue13281 messages
2011-10-27 20:30:43bernie9998create