Author orsenthil
Recipients dualbus, ezio.melotti, orsenthil, terry.reedy
Date 2012-09-11.04:44:56
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <CAPOVWOQYL783JFzF4e2SWf0BgTU-R=OAGfOmQMirffWw+cwWow@mail.gmail.com>
In-reply-to <20120909214501.GC5726@claret.lan>
Content
Hello Eduardo,

I fail to see the bug in here. Robotparser module is for reading and
parsing the robot.txt file, the module responsible for fetching it
could urllib. robots.txt is always available from web-server and you
can download the robot.txt by any means, even by using
robotparser.read by providing the full url to robots.txt. You do not
need to set user-agent to read/fetch the robots.txt file. Once
fetched, now when you are crawling the site using your custom written
crawler or using urllib, you can honor the User-Agent requirement by
sending proper headers with your request. That can be done using
urllib module itself and there is documentation on adding headers I
believe.

I think, this is way most folks would be (or I believe are ) using it.
Am I missing something? If my above explanation is okay, then we can
close this bug as invalid.

Thanks,
Senthil
History
Date User Action Args
2012-09-11 04:44:57orsenthilsetrecipients: + orsenthil, terry.reedy, ezio.melotti, dualbus
2012-09-11 04:44:57orsenthillinkissue15851 messages
2012-09-11 04:44:56orsenthilcreate