This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author skip.montanaro
Recipients
Date 2001-01-04.21:05:55
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
I apologize for taking so long to take a look at this.
I was reminded of it when I saw the switch from me to Guido.

I spent a little time fiddling with this module today.  I'm
not satisfied that it works as advertised.  Here are a
number of problems I found:

  * in the test function, the debug variable is not 
    declared global, so setting it to 1 has no effect

  * it never seemed to properly handle redirections, so it
    never got from

    http://www.musi-cal.com/robots.txt

    to

    http://musi-cal.mojam.com/robots.txt

  * once I worked around the redirection problem it seemed
    to parse the Musi-Cal robots.txt file incorrectly.

I replaced httplib with urllib in the read method and
got erroneous results.  If you look at the above robots.txt
file you'll see that a bunch of email address harvesters
are explicitly forbidden (not that they pay attention to 
robots.txt!).  The following should print 0, but prints 1:

    print rp.can_fetch('ExtractorPro',     
                       'http://musi-cal.mojam.com/')

This is (at least in part) due to the fact that the
redirection never works.  In the version I modified to
use urllib, it displays incorrect permissions for things like ExtractorPro:

  User-agent: ExtractorPro
  Allow: /

Note that the lines in the robot.txt file for ExtractorPro
are actually

  User-agent: ExtractorPro
  Disallow: /

Skip
History
Date User Action Args
2007-08-23 15:02:26adminlinkissue402229 messages
2007-08-23 15:02:26admincreate