This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: a better robotparser.py module
Type: Stage:
Components: Library (Lib) Versions:
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: skip.montanaro Nosy List: calvin, gvanrossum, skip.montanaro
Priority: normal Keywords: patch

Created on 2000-11-02 17:36 by calvin, last changed 2022-04-10 16:03 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
None calvin, 2000-11-02 17:36 None
Messages (10)
msg34748 - (view) Author: Bastian Kleineidam (calvin) Date: 2000-11-02 17:36
 
msg34749 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2000-11-02 19:14
Skip, can you comment on this?  
msg34750 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2001-01-05 02:31
Skip, back to you.  Please work with the author on an acceptable version.  You can check it in once you two agree.
msg34751 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2001-01-19 22:57
Skip, if this is ok with you, can you check it in?  (Unless you feel you don't want to check it in because you still feel your module is better -- in that case we should probably drop it or reassign it...)
msg34752 - (view) Author: Bastian Kleineidam (calvin) Date: 2000-11-02 17:40
I have written a new RobotParser module 'robotparser2.py'.

This module is

o backward compatible with the old one

o makes correct useragent matching (is buggy in
  robotparser.py)

o strips comments correctly (is buggy in robotparser.py)

o uses httplib instead of urllib.urlopen() to catch HTTP
  connect errors correctly (is buggy in robotparser.py)
  
o implements not only the draft at

http://info.webcrawler.com/mak/projects/robots/norobots.html
  but also the new one at
  http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html


Bastian Kleineidam
msg34753 - (view) Author: Bastian Kleineidam (calvin) Date: 2001-01-04 23:51
Changes:
- global debug variable in the test function
- redirection now works
- accidently printed "Allow" when I meant "Disallow". This has been fixed.

It parses the Musi-Cal robots.txt file correctly, but the robots.txt file has syntax errors:
before each user-agent: line there has to be one or more empty lines.
msg34754 - (view) Author: Bastian Kleineidam (calvin) Date: 2001-01-06 12:31
Ok, some new changes:
- allow parsing of user-agent: lines without preceding blank
  line
- two licenses available: Python 2.0 license or GPL license
- add some doc string for the classes

Bastian
msg34755 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2001-01-04 21:05
I apologize for taking so long to take a look at this.
I was reminded of it when I saw the switch from me to Guido.

I spent a little time fiddling with this module today.  I'm
not satisfied that it works as advertised.  Here are a
number of problems I found:

  * in the test function, the debug variable is not 
    declared global, so setting it to 1 has no effect

  * it never seemed to properly handle redirections, so it
    never got from

    http://www.musi-cal.com/robots.txt

    to

    http://musi-cal.mojam.com/robots.txt

  * once I worked around the redirection problem it seemed
    to parse the Musi-Cal robots.txt file incorrectly.

I replaced httplib with urllib in the read method and
got erroneous results.  If you look at the above robots.txt
file you'll see that a bunch of email address harvesters
are explicitly forbidden (not that they pay attention to 
robots.txt!).  The following should print 0, but prints 1:

    print rp.can_fetch('ExtractorPro',     
                       'http://musi-cal.mojam.com/')

This is (at least in part) due to the fact that the
redirection never works.  In the version I modified to
use urllib, it displays incorrect permissions for things like ExtractorPro:

  User-agent: ExtractorPro
  Allow: /

Note that the lines in the robot.txt file for ExtractorPro
are actually

  User-agent: ExtractorPro
  Disallow: /

Skip
msg34756 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2001-01-05 01:43
I fixed the robots.txt file, but I think you should parse
files without the requisite blank lines (be lenient in what
you accept and strict in what you generate).  The
user-agent line can serve as an implicit separator between
one record and the next.

Skip
msg34757 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2001-01-20 16:03
Checked in and closed.  Thanks Bastian!
History
Date User Action Args
2022-04-10 16:03:28adminsetgithub: 33441
2000-11-02 17:36:26calvincreate