classification
Title: robotparser interactively prompts for username and password
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.3
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: skip.montanaro Nosy List: calvin, edemaine, loewis, nagle, skip.montanaro
Priority: normal Keywords:

Created on 2003-09-28 13:06 by edemaine, last changed 2007-08-28 23:28 by skip.montanaro. This issue is now closed.

Files
File name Uploaded Description Edit
robotparser.diff edemaine, 2003-09-28 13:06 Patch for robotparser.py to prevent interactive behavior and return 401
Messages (7)
msg18407 - (view) Author: Erik Demaine (edemaine) Date: 2003-09-28 13:06
This is a rare occurrence, but if a /robots.txt file is
password-protected on an http server, robotparser
interactively prompts (via raw_input) for a username
and password, because that is urllib's default
behavior.  One example of such a URL, at least at the
time of this writing, is

http://www.cosc.canterbury.ac.nz/robots.txt

Given that robotparser and robots.txt is all about
*robots* (not interactive users), I don't think this
interactive behavior is terribly appropriate.  Attached
is a simple patch to robotparser.py to fix this
behavior, forcing urllib to return the 401 error that
it ought to.

Another issue is whether a 401 (Authorization Required)
URL means that everything should be allowed or
everything should be disallowed.  I'm not sure what's
"right".  Reading the spec, it says 'This file must be
accessible via HTTP on the local URL "/robots.txt"'
which I would read to mean it should be accessible
without username/password.  On the other hand, the
current robotparser.py code says "if self.errcode ==
401 or self.errcode == 403: self.disallow_all = 1"
which has the opposite effect.  I'll leave deciding
which is most appropriate to the powers that be.
msg18408 - (view) Author: Bastian Kleineidam (calvin) Date: 2003-09-29 13:24
Logged In: YES 
user_id=9205

http://www.robotstxt.org/wc/norobots-rfc.html specifies the
401 and 403 return code consequences as restricting the
whole site (ie disallow_all = 1).

For the password input, the patch looks good to me. On the
long term, robotparser.py should switch to urllib2.py
anyway, and it should handle Transfer-Encoding: gzip.
msg18409 - (view) Author: John Nagle (nagle) Date: 2007-04-21 16:53
The attached patch was never integrated into the distribution.  This is still broken in Python 2.4 (Win32), Python 2.5 (Win32), and Python 2.5 (Linux).  

This stalled an overnight batch job for us.  Very annoying.

Reproduce with:

import robotparser
url = 'http://mueblesmoraleda.com' # whole site is password-protected.
parser = robotparser.RobotFileParser()
parser.set_url(url)
parser.read()	# Prompts for password
msg18410 - (view) Author: John Nagle (nagle) Date: 2007-04-22 21:12
I tried the patch by doing this:

import robotparser   
def prompt_user_passwd(self, host, realm):
    return None, None
robotparser.URLopener.prompt_user_passwd = prompt_user_passwd    # temp patch


This dealt with the problem effectively; robots.txt files are being processed normally, and if reading one causes an authentication request, it's handled as if no password was input, without any interaction.  

So this could probably go in.
msg55386 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2007-08-28 22:38
I'll take this one.  Looks like an easy fix.
msg55387 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2007-08-28 23:27
Checked in as r57626 on trunk and r57627 on 2.5.
msg55388 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2007-08-28 23:28
For those monitoring this, note that I am no
unittest whiz, especially with oddball test
files like test_robotparser.py.  Please feel
free to rewrite that code.
History
Date User Action Args
2007-08-28 23:28:26skip.montanarosetmessages: + msg55388
2007-08-28 23:27:18skip.montanarosetstatus: open -> closed
resolution: accepted
messages: + msg55387
2007-08-28 22:38:16skip.montanarosetpriority: high -> normal
assignee: loewis -> skip.montanaro
type: behavior
messages: + msg55386
nosy: + skip.montanaro
2003-09-28 13:06:03edemainecreate