Message34755
I apologize for taking so long to take a look at this.
I was reminded of it when I saw the switch from me to Guido.
I spent a little time fiddling with this module today. I'm
not satisfied that it works as advertised. Here are a
number of problems I found:
* in the test function, the debug variable is not
declared global, so setting it to 1 has no effect
* it never seemed to properly handle redirections, so it
never got from
http://www.musi-cal.com/robots.txt
to
http://musi-cal.mojam.com/robots.txt
* once I worked around the redirection problem it seemed
to parse the Musi-Cal robots.txt file incorrectly.
I replaced httplib with urllib in the read method and
got erroneous results. If you look at the above robots.txt
file you'll see that a bunch of email address harvesters
are explicitly forbidden (not that they pay attention to
robots.txt!). The following should print 0, but prints 1:
print rp.can_fetch('ExtractorPro',
'http://musi-cal.mojam.com/')
This is (at least in part) due to the fact that the
redirection never works. In the version I modified to
use urllib, it displays incorrect permissions for things like ExtractorPro:
User-agent: ExtractorPro
Allow: /
Note that the lines in the robot.txt file for ExtractorPro
are actually
User-agent: ExtractorPro
Disallow: /
Skip
|
|
Date |
User |
Action |
Args |
2007-08-23 15:02:26 | admin | link | issue402229 messages |
2007-08-23 15:02:26 | admin | create | |
|