Issue 5419: urllib.request.open(someURL).read() returns a bytes object so writing it requires binary mode

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/49669

classification

Title:	urllib.request.open(someURL).read() returns a bytes object so writing it requires binary mode
Type:	behavior	Stage:	needs patch
Components:	Documentation	Versions:	Python 3.0, Python 3.1, Python 3.2, Python 3.3

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	orsenthil	Nosy List:	Danh, MLModel, georg.brandl, orsenthil
Priority:	normal	Keywords:	easy

Created on 2009-03-05 01:21 by MLModel, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (4)
msg83179 - (view)	Author: Mitchell Model (MLModel)	Date: 2009-03-05 01:21
There needs to be something somewhere in the documentation that makes the simple point that data coming in from the web is bytes not strings, which is a big change from Python 2, and that it needs to be manipulated as such, including writing in binary mode. I am not sure what documentation should be changed, but I do think something is missing, because I just ran around in circles on this one for quite some time. Perhaps the Unicode HOWTO needs more information; possibly urllib.request does; maybe a combination of things have to be added to several documentation files. Here's what happened: I wanted to read from a web page, make some string replacements, and save to a file, so I wrote code that boils down to something like: with open('url.html', 'w') as fil: fil.write(urllib.request.open(aURL).read()).replace(str1, str2) The first thing that happened was an error telling me that I can't write bytes to a text stream, so I realized that read() was returning a bytes object, which makes sense. So I converted it to a string, but that put a b' at the beginning of the file and a ' at the end! Bad. Instead of str(thebytes) I did the proper thing: thebytes.decode(), and wrote that to the file. But then I found that Non-ASCII characters created problems -- they were saved in the file as \xNN\xNN or even three \x's, then displayed as garbage when the page was opened in a browser. So I tried decoding using different codecs but couldn't find one that worked for the é and the emdash that were in the response. Finally I realized that the whole thing was a delusion: obviously urlopen responses have to return bytes objects, and adding 'b' to the 'w' when opening the output file fixed everything. (I also had to change my replacement strings to bytes.) I went back to the relevant documentation multiple times, including after I figured everything out, and I can't convince myself that it makes the connection anywhere between bytes coming in, manipulating the bytes as bytes, and writing out in binary. Yes, in retrospect this all makes sense and perhaps even should have been obvious, but I am quite sure I won't be the only experienced Python 2 programmer to trip over this when making the transition to Python 3. I apologize in advance if the requested documentation exists and I didn't find it, in which case I would appreciate a pointer to where it is lies.
msg103200 - (view)	Author: Daniel Haertle (Danh)	Date: 2010-04-15 12:17
I got struck by the same feature. In addition, currently the docs are wrong in the examples (at http://docs.python.org/dev/py3k/library/urllib.request.html#examples the output of f.read() is a string instead of bytes). There I propose the change from >>> import urllib.request >>> f = urllib.request.urlopen('http://www.python.org/') >>> print(f.read(100)) <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <?xml-stylesheet href="./css/ht2html to >>> import urllib.request >>> f = urllib.request.urlopen('http://www.python.org/') >>> print(f.read(100).decode('utf-8')) <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtm The other examples need to be corrected in a similar way. Even more importantly, the "HOWTO Fetch Internet Resources Using The urllib Package" needs to be corrected too. In the documentation of urllib.request.urlopen I propose to add a sentence (after the paragraph "This function returns a file-like object...") explaining that reading the object returns bytes that need to be decoded to a string: "Note that the method read() returns bytes that need to be decoded to a string using decode()."
msg103202 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2010-04-15 12:37
Yeah, there a example in the tutorial that was changed recently along similar lines suggested. (http://docs.python.org/dev/py3k/tutorial/stdlib.html#internet-access) The other examples got to be changed too.
msg103237 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2010-04-15 17:23
Fixed in revision 80092 and merged into release31-maint in revision 80093. I am marking this as fixed and closed. If there are any similar issues at other places, we will address them as separate bugs.

History
Date	User	Action	Args
2022-04-11 14:56:46	admin	set	github: 49669
2010-04-15 17:23:05	orsenthil	set	status: open -> closed resolution: accepted -> fixed messages: + msg103237
2010-04-15 12:37:21	orsenthil	set	messages: + msg103202
2010-04-15 12:34:01	orsenthil	set	assignee: georg.brandl -> orsenthil resolution: accepted nosy: + orsenthil
2010-04-15 12:17:45	Danh	set	nosy: + Danh messages: + msg103200 versions: + Python 3.2, Python 3.3
2009-04-22 14:39:01	ajaksu2	set	priority: normal keywords: + easy type: behavior stage: needs patch
2009-03-05 01:21:23	MLModel	create