Message170274
Hi Eduardo,
I tested further and do observe some very strange oddities.
On Mon, Sep 10, 2012 at 10:45 PM, Eduardo A. Bustamante López
<report@bugs.python.org> wrote:
> Also, I'm aware that you shouldn't normally worry about setting a specific
> user-agent to fetch the file. But that's not the case of Wikipedia. In my case,
> Wikipedia returned 403 for the urllib user-agent.
Yeah, this really surprised me. I would normally assume robots.txt to
be readable by any agent, but I think something odd is happening.
In 2.7, I do not see the problem because, the implementation is:
import urllib
class URLOpener(urllib.FancyURLopener):
def __init__(self, *args):
urllib.FancyURLopener.__init__(self, *args)
self.errcode = 200
opener = URLOpener()
fobj = opener.open('http://en.wikipedia.org/robots.txt')
print opener.errcode
This will print 200 and everything is fine. Also, look at it that
robots.txt is accessible.
In 3.3, the implementation is:
import urllib.request
try:
fobj = urllib.request.urlopen('http://en.wikipedia.org/robots.txt')
except urllib.error.HTTPError as err:
print(err.code)
This gives 403. I would normally expect this to work without any issues.
But according to my analysis, what is happening is when the User-agent
is set to something which has '-' in that, the server is rejecting it
with 403.
In the above code, what is happening underlying is this:
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Python-urllib/3.3')]
fobj = opener.open('http://en.wikipedia.org/robots.txt')
print(fobj.getcode())
This would give 403. In order to see it work, change the addheaders line to
opener.addheaders = [('', '')]
opener.addheaders = [('User-agent', 'Pythonurllib/3.3')]
opener.addheaders = [('User-agent', 'KillerSpamBot')]
All should work (as expected).
So, thing which surrprises me is, if sending "Python-urllib/3.3" is a
mistake for "THAT Server".
Is this a server oddity at Wikipedia part? ( Coz, I refered to hg log
to see from when we are sending Python-urllib/version and it seems
that it's being sent for long time).
Can't see how should this be fixed in urllib. |
|
Date |
User |
Action |
Args |
2012-09-11 07:54:41 | orsenthil | set | recipients:
+ orsenthil, terry.reedy, ezio.melotti, dualbus |
2012-09-11 07:54:40 | orsenthil | link | issue15851 messages |
2012-09-11 07:54:40 | orsenthil | create | |
|