Message 18407 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	edemaine
Recipients
Date	2003-09-28.13:06:03
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
This is a rare occurrence, but if a /robots.txt file is password-protected on an http server, robotparser interactively prompts (via raw_input) for a username and password, because that is urllib's default behavior. One example of such a URL, at least at the time of this writing, is http://www.cosc.canterbury.ac.nz/robots.txt Given that robotparser and robots.txt is all about robots (not interactive users), I don't think this interactive behavior is terribly appropriate. Attached is a simple patch to robotparser.py to fix this behavior, forcing urllib to return the 401 error that it ought to. Another issue is whether a 401 (Authorization Required) URL means that everything should be allowed or everything should be disallowed. I'm not sure what's "right". Reading the spec, it says 'This file must be accessible via HTTP on the local URL "/robots.txt"' which I would read to mean it should be accessible without username/password. On the other hand, the current robotparser.py code says "if self.errcode == 401 or self.errcode == 403: self.disallow_all = 1" which has the opposite effect. I'll leave deciding which is most appropriate to the powers that be.

This is a rare occurrence, but if a /robots.txt file is
password-protected on an http server, robotparser
interactively prompts (via raw_input) for a username
and password, because that is urllib's default
behavior.  One example of such a URL, at least at the
time of this writing, is

http://www.cosc.canterbury.ac.nz/robots.txt

Given that robotparser and robots.txt is all about
*robots* (not interactive users), I don't think this
interactive behavior is terribly appropriate.  Attached
is a simple patch to robotparser.py to fix this
behavior, forcing urllib to return the 401 error that
it ought to.

Another issue is whether a 401 (Authorization Required)
URL means that everything should be allowed or
everything should be disallowed.  I'm not sure what's
"right".  Reading the spec, it says 'This file must be
accessible via HTTP on the local URL "/robots.txt"'
which I would read to mean it should be accessible
without username/password.  On the other hand, the
current robotparser.py code says "if self.errcode ==
401 or self.errcode == 403: self.disallow_all = 1"
which has the opposite effect.  I'll leave deciding
which is most appropriate to the powers that be.

History
Date	User	Action	Args
2007-08-23 14:17:15	admin	link	issue813986 messages
2007-08-23 14:17:15	admin	create