Message169718
I found that http://en.wikipedia.org/robots.txt returns 403 if the provided user agent is in a specific blacklist.
And since robotparser doesn't provide a mechanism to change the default user agent used by the opener, it becomes unusable for that site (and sites that have a similar policy).
I think the user should have the possibility to set a specific user agent string, to better identify their bot.
I attach a patch that allows the user to change the opener used by RobotFileParser, in case the need of some specific behavior arises.
I also attach a simple example of how it solves the issue, at least with wikipedia. |
|
Date |
User |
Action |
Args |
2012-09-02 18:36:04 | dualbus | set | recipients:
+ dualbus |
2012-09-02 18:36:04 | dualbus | set | messageid: <1346610964.7.0.836759738208.issue15851@psf.upfronthosting.co.za> |
2012-09-02 18:36:04 | dualbus | link | issue15851 messages |
2012-09-02 18:36:03 | dualbus | create | |
|