This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author orsenthil
Recipients dualbus, ezio.melotti, orsenthil, terry.reedy
Date 2012-09-11.07:54:40
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <CAPOVWOR4RL26ZOXztZKB1k3PUQW5gHAfRDoNDsi0-ajUByPVMQ@mail.gmail.com>
In-reply-to <20120911054504.GA1146@claret.lan>
Content
Hi Eduardo,

I tested further and do observe some very strange oddities.

On Mon, Sep 10, 2012 at 10:45 PM, Eduardo A. Bustamante López
<report@bugs.python.org> wrote:

> Also, I'm aware that you shouldn't normally worry about setting a specific
> user-agent to fetch the file. But that's not the case of Wikipedia. In my case,
> Wikipedia returned 403 for the urllib user-agent.

Yeah, this really surprised me. I would normally assume robots.txt to
be readable by any agent, but I think something odd is happening.

In 2.7, I do not see the problem because, the implementation is:

import urllib

class URLOpener(urllib.FancyURLopener):
    def __init__(self, *args):
        urllib.FancyURLopener.__init__(self, *args)
        self.errcode = 200

opener = URLOpener()
fobj = opener.open('http://en.wikipedia.org/robots.txt')
print opener.errcode

This will print 200 and everything is fine. Also, look at it that
robots.txt is accessible.

In 3.3, the implementation is:

import urllib.request

try:
    fobj = urllib.request.urlopen('http://en.wikipedia.org/robots.txt')
except urllib.error.HTTPError as err:
    print(err.code)

This gives 403.  I would normally expect this to work without any issues.
But according to my analysis, what is happening is when the User-agent
is set to something which has '-' in that, the server is rejecting it
with 403.

In the above code, what is happening underlying is this:

import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Python-urllib/3.3')]
fobj = opener.open('http://en.wikipedia.org/robots.txt')
print(fobj.getcode())

This would give 403. In order to see it work, change the addheaders line to

opener.addheaders = [('', '')]
opener.addheaders = [('User-agent', 'Pythonurllib/3.3')]
opener.addheaders = [('User-agent', 'KillerSpamBot')]

All should work (as expected).

So, thing which surrprises me is, if sending "Python-urllib/3.3" is a
mistake for "THAT Server".
Is this a server oddity at Wikipedia part? ( Coz, I refered to hg log
to see from when we are sending Python-urllib/version and it seems
that it's being sent for long time).

Can't see how should this be fixed in urllib.
History
Date User Action Args
2012-09-11 07:54:41orsenthilsetrecipients: + orsenthil, terry.reedy, ezio.melotti, dualbus
2012-09-11 07:54:40orsenthillinkissue15851 messages
2012-09-11 07:54:40orsenthilcreate