Message 170265 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	dualbus
Recipients	dualbus, ezio.melotti, orsenthil, terry.reedy
Date	2012-09-11.05:45:26
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<20120911054504.GA1146@claret.lan>
In-reply-to	<CAPOVWOQYL783JFzF4e2SWf0BgTU-R=OAGfOmQMirffWw+cwWow@mail.gmail.com>

Content
Hi Senthil, > I fail to see the bug in here. Robotparser module is for reading and > parsing the robot.txt file, the module responsible for fetching it > could urllib. You're right, but robotparser's read() does a call to urllib.request.urlopen to fetch the robots.txt file. If robotparser took a file object, or something like that instead of a Url, I wouldn't think of this as a bug, but it doesn't. The default behaviour is for it to fetch the file itself, using urlopen. Also, I'm aware that you shouldn't normally worry about setting a specific user-agent to fetch the file. But that's not the case of Wikipedia. In my case, Wikipedia returned 403 for the urllib user-agent. And since there's no documented way of specifying a particular user-agent in robotparser, or to feed a file object to robotparser, I decided to report this. Only after reading the source of 2.7.x and 3.x, one can find work-arounds for that problem, since it's not really clear how these make the requests for the robots.txt in the documentation.

Hi Senthil,

> I fail to see the bug in here. Robotparser module is for reading and
> parsing the robot.txt file, the module responsible for fetching it
> could urllib.

You're right, but robotparser's read() does a call to urllib.request.urlopen to
fetch the robots.txt file. If robotparser took a file object, or something like
that instead of a Url, I wouldn't think of this as a bug, but it doesn't. The
default behaviour is for it to fetch the file itself, using urlopen.

Also, I'm aware that you shouldn't normally worry about setting a specific
user-agent to fetch the file. But that's not the case of Wikipedia. In my case,
Wikipedia returned 403 for the urllib user-agent. And since there's no
documented way of specifying a particular user-agent in robotparser, or to feed
a file object to robotparser, I decided to report this.

Only after reading the source of 2.7.x and 3.x, one can find work-arounds for
that problem, since it's not really clear how these make the requests for the
robots.txt in the documentation.

History
Date	User	Action	Args
2012-09-11 05:45:27	dualbus	set	recipients: + dualbus, terry.reedy, orsenthil, ezio.melotti
2012-09-11 05:45:26	dualbus	link	issue15851 messages
2012-09-11 05:45:26	dualbus	create