This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author terry.reedy
Recipients bernie9998, ezio.melotti, orsenthil, osvenskan, petri.lehtinen, terry.reedy
Date 2011-10-31.19:00:39
SpamBayes Score 6.466494e-13
Marked as misclassified No
Message-id <1320087644.42.0.438011120206.issue13281@psf.upfronthosting.co.za>
In-reply-to
Content
The robotparser is currently doing exactly what it is documented as doing. 20.9. urllib.robotparser — Parser for robots.txt
says "For more details on the structure of robots.txt files, see http://www.robotstxt.org/orig.html." (Since there are no previous details, 'more' should be deleted.) That page, in turn, says

'''The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form "<field>:<optionalspace><value><optionalspace>".'''

The formal grammar says the same thing. The page goes on with

'''Comments ... are discarded completely, and therefore do not indicate a record boundary.'''

followed by

'''The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored.'''

Not allowing blank lines within records is obviously, to me, intentional and not an accidental oversight. It aids error detection.  Consider:

User-agent: A ...
Disallow: ...

User-aget: B ...
Disallow: ...

Currently, the blank line signals a new record, the misspelled 'User-aget' line is ignored, and the new record, starting with 'Disallow' instead of 'User-agent' is correctly seen as an error and ignored. The same would be true if the User-agent line were accidentally omitted. When humans edit files, perhaps from someone else's notes, such things happen.

With this change, the second disallow line will be incorrectly attributed to A. We can justify that on the hypothesis that intentional blank lines within record, in violation of the standard, are now more common than missing or misspelled User-Agent lines. Or we can decide that mis-attributing Disallow lines is a lesser sin than ignoring them. But the change is pretty plainly a feature change and not a bug fix. 

My current suggested doc change is to replace the sentence quoted at the top with
"Such files are parsed according to the rules given at http://www.robotstxt.org/orig.html , with the exception that blank lines are allowed within records.
Versionchanged 3.3: allow blank lines within records"

Side note: The example in the doc uses musi-cal.com. We need a replacement as it was closed last June, as noted in
http://www.wolfgangsvault.com/blog/index.php/2011/06/closing-mojam-com-and-musi-cal-com/
History
Date User Action Args
2011-10-31 19:00:44terry.reedysetrecipients: + terry.reedy, orsenthil, osvenskan, ezio.melotti, bernie9998, petri.lehtinen
2011-10-31 19:00:44terry.reedysetmessageid: <1320087644.42.0.438011120206.issue13281@psf.upfronthosting.co.za>
2011-10-31 19:00:40terry.reedylinkissue13281 messages
2011-10-31 19:00:39terry.reedycreate