Issue 4108: robotparser.py fail when more than one User-Agent: * is present

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/48358

classification

Title:	robotparser.py fail when more than one User-Agent: * is present
Type:		Stage:
Components:	Library (Lib)	Versions:	Python 2.5

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	georg.brandl, mbloore, thsajid
Priority:	normal	Keywords:

Created on 2008-10-12 13:41 by thsajid, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (3)
msg74665 - (view)	Author: taskinoor hasan sajid (thsajid)	Date: 2008-10-12 13:41
Check the robots.txt file from mathworld. --> http://mathworld.wolfram.com/robots.txt It contains 2 User-Agent: * lines. From http://www.robotstxt.org/norobots-rfc.txt "These name tokens are used in User-agent lines in /robots.txt to identify to which specific robots the record applies. The robot must obey the first record in /robots.txt that contains a User- Agent line whose value contains the name token of the robot as a substring. The name comparisons are case-insensitive. If no such record exists, it should obey the first record with a User-agent line with a "" value, if present. If no record satisfied either condition, or no records are present at all, access is unlimited." But it seems that our robotparser is obeying the 2nd one. the problem occures because robotparser assumes that no robots.txt will contain two user-agent. it should not have two two such line, but in reality many site may have two. So i have changed robotparser.py as follow: def _add_entry(self, entry): if "" in entry.useragents: # the default entry is considered last if self.default_entry == None: # this check is added self.default_entry = entry else: self.entries.append(entry) And at the end of parse(self, lines) method if state==2: # self.entries.append(entry) self._add_entry(entry) # necessary if there is no new line at end and last User-Agent is
msg88887 - (view)	Author: mARK (mbloore)	Date: 2009-06-04 16:14
this looks like a good fix. i've put it into my own copy.
msg111981 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2010-07-29 17:55
Thanks for the patch, fixed in r83238.

History
Date	User	Action	Args
2022-04-11 14:56:40	admin	set	github: 48358
2010-07-29 17:55:11	georg.brandl	set	status: open -> closed nosy: + georg.brandl messages: + msg111981 resolution: fixed
2009-06-04 16:14:59	mbloore	set	nosy: + mbloore messages: + msg88887
2008-10-12 13:41:31	thsajid	create