classification
Title: testPythonOrg() of test_robotparser fails with the new www.python.org website
Type: Stage:
Components: Versions: Python 3.4
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: jcea, ned.deily, python-dev, vstinner
Priority: normal Keywords:

Created on 2014-02-21 09:20 by vstinner, last changed 2015-03-18 13:27 by vstinner. This issue is now closed.

Messages (9)
msg211836 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-02-21 09:20
I read somewhere that python.org has a new website and the new website has gzip compression enabled which makes some urllib tests failing. It's probably related.

http://buildbot.python.org/all/builders/SPARC%20Solaris%2010%20%28cc%2C%2064b%29%20%5BSB%5D%203.x/builds/1750/steps/test/logs/stdio

======================================================================
ERROR: testPythonOrg (test.test_robotparser.NetworkTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/cpython/buildslave/cc-64/3.x.snakebite-sol10-sparc-cc-64/build/Lib/test/test_robotparser.py", line 283, in testPythonOrg
    parser.read()
  File "/home/cpython/buildslave/cc-64/3.x.snakebite-sol10-sparc-cc-64/build/Lib/urllib/robotparser.py", line 64, in read
    self.parse(raw.decode("utf-8").splitlines())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
msg211840 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-02-21 09:52
http://buildbot.python.org/all/builders/x86%20FreeBSD%206.4%203.x/builds/4531/steps/test/logs/stdio

======================================================================
ERROR: testPythonOrg (test.test_robotparser.NetworkTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/db3l/buildarea/3.x.bolen-freebsd/build/Lib/test/test_robotparser.py", line 283, in testPythonOrg
    parser.read()
  File "/usr/home/db3l/buildarea/3.x.bolen-freebsd/build/Lib/urllib/robotparser.py", line 64, in read
    self.parse(raw.decode("utf-8").splitlines())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
msg211841 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-02-21 09:52
http://buildbot.python.org/all/builders/x86%20Windows%20Server%202003%20%5BSB%5D%203.x/builds/2166/steps/test/logs/stdio

======================================================================
ERROR: testPythonOrg (test.test_robotparser.NetworkTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "E:\Data\buildslave\cpython\3.x.snakebite-win2k3r2sp2-x86\build\lib\test\test_robotparser.py", line 283, in testPythonOrg
    parser.read()
  File "E:\Data\buildslave\cpython\3.x.snakebite-win2k3r2sp2-x86\build\lib\urllib\robotparser.py", line 64, in read
    self.parse(raw.decode("utf-8").splitlines())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
msg211852 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2014-02-21 12:48
It looks like the new python.org web server configuration was just changed to no longer gzip robots.txt so the test is no longer failing for me.
msg211853 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2014-02-21 13:11
... or, more likely, that a robots.txt file is now in place.
msg211889 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-02-21 23:19
> It looks like the new python.org web server configuration was just changed to no longer gzip robots.txt so the test is no longer failing for me.

If I check HTTP headers of http://www.python.org/robots.txt using a small Python script sending "GET /robots.txt HTTP/1.0" and "Host: www.python.org" (but no Accept-Encoding header): I still see "Content-Encoding: gzip".

It looks like a bug in the HTTP server serving www.python.org, because my client didn't send "Accept-Encoding: gzip, deflate".

The RFC 2616 (HTTP/1.1) says "If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding."
http://www.w3.org/Protocols/rfc2616/rfc2616.html

See also:

"HTTP/1.1 (unlike HTTP/1.0) carefully specifies the Accept-Encoding header, used by a client to indicate what content-codings it can handle, and which ones it prefers."
http://www8.org/w8-papers/5c-protocols/key/key.html

The best solution would be to implement #1508475: support gzip in urllib.
msg211898 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2014-02-22 00:12
Interesting. As of last night, I'm no longer seeing 'gzip' encoding and the test passes for me.  But I see some of the buildbots intermittently failing.  Looking at the headers for www.python.org/robots.txt, it appears that the file is being served from a varnish cache and from a CDN so there may be different responses depending on which server responds.  

>>> r1.getheaders()
[('Server', 'nginx'), ('Content-Type', 'text/plain'), ('X-Frame-Options', 'SAMEORIGIN'), ('Content-Length', '690'), ('Accept-Ranges', 'bytes'), ('Date', 'Fri, 21 Feb 2014 23:53:23 GMT'), ('Via', '1.1 varnish'), ('Age', '2858'), ('Connection', 'keep-alive'), ('X-Served-By', 'cache-sv62-SJC3'), ('X-Cache', 'HIT'), ('X-Cache-Hits', '1')]

In any case, supporting gzip would be a good idea but tests will need to have a more repeatable URL.
msg212538 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-03-02 08:22
New changeset 540ce9bb19e8 by Georg Brandl in branch '3.3':
#20719: Disable the robotparser python.org test until the gzip encoding issue can be sorted.
http://hg.python.org/cpython/rev/540ce9bb19e8
msg238433 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-03-18 13:27
This issue has been worked around.
History
Date User Action Args
2015-03-18 13:27:54vstinnersetstatus: open -> closed
resolution: out of date
messages: + msg238433
2014-03-02 08:22:51python-devsetnosy: + python-dev
messages: + msg212538
2014-02-22 00:12:07ned.deilysetmessages: + msg211898
2014-02-22 00:00:45jceasetnosy: + jcea

title: testPythonOrg() of test_robotparser fails with the new ww.python.org website -> testPythonOrg() of test_robotparser fails with the new www.python.org website
2014-02-21 23:20:15vstinnersettitle: test_robotparser failure on "SPARC Solaris 10 (cc%2C 64b) [SB] 3.x" buildbot -> testPythonOrg() of test_robotparser fails with the new ww.python.org website
2014-02-21 23:19:41vstinnersetmessages: + msg211889
2014-02-21 13:11:01ned.deilysetmessages: + msg211853
2014-02-21 12:48:52ned.deilysetnosy: + ned.deily
messages: + msg211852
2014-02-21 09:52:57vstinnersetmessages: + msg211841
2014-02-21 09:52:03vstinnersetmessages: + msg211840
2014-02-21 09:20:19vstinnercreate