This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author dualbus
Recipients dualbus
Date 2012-09-02.18:36:03
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1346610964.7.0.836759738208.issue15851@psf.upfronthosting.co.za>
In-reply-to
Content
I found that http://en.wikipedia.org/robots.txt returns 403 if the provided user agent is in a specific blacklist.

And since robotparser doesn't provide a mechanism to change the default user agent used by the opener, it becomes unusable for that site (and sites that have a similar policy).

I think the user should have the possibility to set a specific user agent string, to better identify their bot.

I attach a patch that allows the user to change the opener used by RobotFileParser, in case the need of some specific behavior arises.

I also attach a simple example of how it solves the issue, at least with wikipedia.
History
Date User Action Args
2012-09-02 18:36:04dualbussetrecipients: + dualbus
2012-09-02 18:36:04dualbussetmessageid: <1346610964.7.0.836759738208.issue15851@psf.upfronthosting.co.za>
2012-09-02 18:36:04dualbuslinkissue15851 messages
2012-09-02 18:36:03dualbuscreate