Hi Eduardo,

I tested further and do observe some very strange oddities.

On Mon, Sep 10, 2012 at 10:45 PM, Eduardo A. Bustamante López
<> wrote:

> Also, I'm aware that you shouldn't normally worry about setting a specific
> user-agent to fetch the file. But that's not the case of Wikipedia. In my case,
> Wikipedia returned 403 for the urllib user-agent.

Yeah, this really surprised me. I would normally assume robots.txt to
be readable by any agent, but I think something odd is happening.

In 2.7, I do not see the problem because, the implementation is:

import urllib

class URLOpener(urllib.FancyURLopener):
    def __init__(self, *args):
        urllib.FancyURLopener.__init__(self, *args)
        self.errcode = 200

opener = URLOpener()
fobj ='')
print opener.errcode

This will print 200 and everything is fine. Also, look at it that
robots.txt is accessible.

In 3.3, the implementation is:

import urllib.request

    fobj = urllib.request.urlopen('')
except urllib.error.HTTPError as err:

This gives 403.  I would normally expect this to work without any issues.
But according to my analysis, what is happening is when the User-agent
is set to something which has '-' in that, the server is rejecting it
with 403.

In the above code, what is happening underlying is this:

import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Python-urllib/3.3')]
fobj ='')

This would give 403. In order to see it work, change the addheaders line to

opener.addheaders = [('', '')]
opener.addheaders = [('User-agent', 'Pythonurllib/3.3')]
opener.addheaders = [('User-agent', 'KillerSpamBot')]

All should work (as expected).

So, thing which surrprises me is, if sending "Python-urllib/3.3" is a
mistake for "THAT Server".
Is this a server oddity at Wikipedia part? ( Coz, I refered to hg log
to see from when we are sending Python-urllib/version and it seems
that it's being sent for long time).

Can't see how should this be fixed in urllib.
