classification
Title: robotparser.py fail when more than one User-Agent: * is present
Type: Stage:
Components: Library (Lib) Versions: Python 2.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: georg.brandl, mbloore, thsajid
Priority: normal Keywords:

Created on 2008-10-12 13:41 by thsajid, last changed 2010-07-29 17:55 by georg.brandl. This issue is now closed.

Messages (3)
msg74665 - (view) Author: taskinoor hasan sajid (thsajid) Date: 2008-10-12 13:41
Check the robots.txt file from mathworld.

--> http://mathworld.wolfram.com/robots.txt

It contains 2 User-Agent: * lines.

From http://www.robotstxt.org/norobots-rfc.txt

"These name tokens are used in User-agent lines in /robots.txt to
identify to which specific robots the record applies. The robot
must obey the first record in /robots.txt that contains a User-
Agent line whose value contains the name token of the robot as a 
substring. The name comparisons are case-insensitive. If no such
record exists, it should obey the first record with a User-agent
line with a "*" value, if present. If no record satisfied either
condition, or no records are present at all, access is unlimited."

But it seems that our robotparser is obeying the 2nd one. the problem
occures because robotparser assumes that no robots.txt will contain two
* user-agent. it should not have two two such line, but in reality many
site may have two.

So i have changed robotparser.py as follow:

    def _add_entry(self, entry):
        if "*" in entry.useragents:
            # the default entry is considered last
            if self.default_entry == None:   # this check is added
                   self.default_entry = entry
        else:
            self.entries.append(entry)

And at the end of parse(self, lines) method

        if state==2:
#            self.entries.append(entry)
            self._add_entry(entry)  # necessary if there is no new line
at end and last User-Agent is *
msg88887 - (view) Author: mARK (mbloore) Date: 2009-06-04 16:14
this looks like a good fix.  i've put it into my own copy.
msg111981 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-07-29 17:55
Thanks for the patch, fixed in r83238.
History
Date User Action Args
2010-07-29 17:55:11georg.brandlsetstatus: open -> closed

nosy: + georg.brandl
messages: + msg111981

resolution: fixed
2009-06-04 16:14:59mblooresetnosy: + mbloore
messages: + msg88887
2008-10-12 13:41:31thsajidcreate