Message 34755 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	skip.montanaro
Recipients
Date	2001-01-04.21:05:55
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
I apologize for taking so long to take a look at this. I was reminded of it when I saw the switch from me to Guido. I spent a little time fiddling with this module today. I'm not satisfied that it works as advertised. Here are a number of problems I found: * in the test function, the debug variable is not declared global, so setting it to 1 has no effect * it never seemed to properly handle redirections, so it never got from http://www.musi-cal.com/robots.txt to http://musi-cal.mojam.com/robots.txt * once I worked around the redirection problem it seemed to parse the Musi-Cal robots.txt file incorrectly. I replaced httplib with urllib in the read method and got erroneous results. If you look at the above robots.txt file you'll see that a bunch of email address harvesters are explicitly forbidden (not that they pay attention to robots.txt!). The following should print 0, but prints 1: print rp.can_fetch('ExtractorPro', 'http://musi-cal.mojam.com/') This is (at least in part) due to the fact that the redirection never works. In the version I modified to use urllib, it displays incorrect permissions for things like ExtractorPro: User-agent: ExtractorPro Allow: / Note that the lines in the robot.txt file for ExtractorPro are actually User-agent: ExtractorPro Disallow: / Skip

I apologize for taking so long to take a look at this.
I was reminded of it when I saw the switch from me to Guido.

I spent a little time fiddling with this module today.  I'm
not satisfied that it works as advertised.  Here are a
number of problems I found:

  * in the test function, the debug variable is not 
    declared global, so setting it to 1 has no effect

  * it never seemed to properly handle redirections, so it
    never got from

    http://www.musi-cal.com/robots.txt

    to

    http://musi-cal.mojam.com/robots.txt

  * once I worked around the redirection problem it seemed
    to parse the Musi-Cal robots.txt file incorrectly.

I replaced httplib with urllib in the read method and
got erroneous results.  If you look at the above robots.txt
file you'll see that a bunch of email address harvesters
are explicitly forbidden (not that they pay attention to 
robots.txt!).  The following should print 0, but prints 1:

    print rp.can_fetch('ExtractorPro',     
                       'http://musi-cal.mojam.com/')

This is (at least in part) due to the fact that the
redirection never works.  In the version I modified to
use urllib, it displays incorrect permissions for things like ExtractorPro:

  User-agent: ExtractorPro
  Allow: /

Note that the lines in the robot.txt file for ExtractorPro
are actually

  User-agent: ExtractorPro
  Disallow: /

Skip

History
Date	User	Action	Args
2007-08-23 15:02:26	admin	link	issue402229 messages
2007-08-23 15:02:26	admin	create