This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib2 urlopen truncates https pages after 32768 characters
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: iritkatriel, jhp7e, orsenthil
Priority: normal Keywords:

Created on 2013-03-29 00:15 by jhp7e, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (4)
msg185476 - (view) Author: J Porter (jhp7e) Date: 2013-03-29 00:15
When using urllib2 to fetch page data from an https server, I found that only the first 32768 characters of the download were retrieved. Other browsers returned the full documents, so it does not appear to be a server issue. If http, rather than https is used on the same server, the full document is retrieved. No problems with shorter documents (<32768 characters). They were not truncated.
msg185477 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2013-03-29 00:40
Do you have the sample server URL or test script?
msg185498 - (view) Author: J Porter (jhp7e) Date: 2013-03-29 14:07
Here is the code (security info removed) and the output. I noticed that the problem is a bit different between 2.6.5 and 2.7.3 (on one the use of authentication is different) so I've included the output for both:

import urllib2

userData="Basic XXXX KEY GOES HERE"

emlUrl="https://pasta.lternet.edu/package/metadata/eml/knb-lter-vcr/25/27"
emlReq=urllib2.Request(emlUrl)
emlReq.add_header('Authorization', userData)
emlSock=urllib2.urlopen(emlReq,timeout=60)
emlString=emlSock.read()
print "Https,authenticated: "+str(len(emlString))

emlReq=urllib2.Request(emlUrl)
emlSock=urllib2.urlopen(emlReq,timeout=60)
emlString=emlSock.read()
print "Https,Not authenticated: "+str(len(emlString))

emlUrl="http://pasta.lternet.edu/package/metadata/eml/knb-lter-vcr/25/27"
emlReq=urllib2.Request(emlUrl)
emlReq.add_header('Authorization', userData)
emlSock=urllib2.urlopen(emlReq,timeout=60)
emlString=emlSock.read()
print "Http,authenticated: "+str(len(emlString))


emlReq=urllib2.Request(emlUrl)
emlSock=urllib2.urlopen(emlReq,timeout=60)
emlString=emlSock.read()
lengthHttpsNotAuthenticated=len(emlString)
print "Http,authenticated: "+str(len(emlString))

OUTPUT when run on PC using Python 2.6.5
Https,authenticated: 32768
Https,Not authenticated: 32768
Http,authenticated: 40898
Http,authenticated: 40898

OUTPUT when run on Ubuntu Linux (12.4LTS):
Https,authenticated: 32768
Https,Not authenticated: 40898
Http,authenticated: 40898
Http,authenticated: 40898
msg405005 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-10-25 21:57
Python 2.7 is no longer maintained. Please create a new issue if you are seeing this problem on 3.9+.
History
Date User Action Args
2022-04-11 14:57:43adminsetgithub: 61769
2021-10-25 21:57:18iritkatrielsetstatus: open -> closed

nosy: + iritkatriel
messages: + msg405005

resolution: out of date
stage: resolved
2013-03-29 14:07:38jhp7esetmessages: + msg185498
2013-03-29 00:40:01orsenthilsetnosy: + orsenthil
messages: + msg185477
2013-03-29 00:15:50jhp7ecreate