Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib.request.open(someURL).read() returns a bytes object so writing it requires binary mode #49669

Closed
MLModel mannequin opened this issue Mar 5, 2009 · 4 comments
Closed
Assignees
Labels
docs Documentation in the Doc dir easy type-bug An unexpected behavior, bug, or error

Comments

@MLModel
Copy link
Mannequin

MLModel mannequin commented Mar 5, 2009

BPO 5419
Nosy @birkenfeld, @orsenthil, @MLModel

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = 'https://github.com/orsenthil'
closed_at = <Date 2010-04-15.17:23:05.743>
created_at = <Date 2009-03-05.01:21:23.402>
labels = ['easy', 'type-bug', 'docs']
title = 'urllib.request.open(someURL).read() returns a  bytes object so writing it requires binary mode'
updated_at = <Date 2010-04-15.17:23:05.742>
user = 'https://github.com/MLModel'

bugs.python.org fields:

activity = <Date 2010-04-15.17:23:05.742>
actor = 'orsenthil'
assignee = 'orsenthil'
closed = True
closed_date = <Date 2010-04-15.17:23:05.743>
closer = 'orsenthil'
components = ['Documentation']
creation = <Date 2009-03-05.01:21:23.402>
creator = 'MLModel'
dependencies = []
files = []
hgrepos = []
issue_num = 5419
keywords = ['easy']
message_count = 4.0
messages = ['83179', '103200', '103202', '103237']
nosy_count = 4.0
nosy_names = ['georg.brandl', 'orsenthil', 'MLModel', 'Danh']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'needs patch'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue5419'
versions = ['Python 3.0', 'Python 3.1', 'Python 3.2', 'Python 3.3']

@MLModel
Copy link
Mannequin Author

MLModel mannequin commented Mar 5, 2009

There needs to be something somewhere in the documentation that makes
the simple point that data coming in from the web is bytes not strings,
which is a big change from Python 2, and that it needs to be manipulated
as such, including writing in binary mode.

I am not sure what documentation should be changed, but I do think
something is missing, because I just ran around in circles on this one
for quite some time. Perhaps the Unicode HOWTO needs more information;
possibly urllib.request does; maybe a combination of things have to be
added to several documentation files. Here's what happened:

I wanted to read from a web page, make some string replacements, and
save to a file, so I wrote code that boils down to something like:

    with open('url.html', 'w') as fil:
        fil.write(urllib.request.open(aURL).read()).replace(str1, str2)

The first thing that happened was an error telling me that I can't write
bytes to a text stream, so I realized that read() was returning a bytes
object, which makes sense.

So I converted it to a string, but that put a b' at the beginning of the
file and a ' at the end! Bad.

Instead of str(thebytes) I did the proper thing: thebytes.decode(), and
wrote that to the file.

But then I found that Non-ASCII characters created problems -- they were
saved in the file as \xNN\xNN or even three \x's, then displayed as
garbage when the page was opened in a browser.

So I tried decoding using different codecs but couldn't find one that
worked for the é and the emdash that were in the response.

Finally I realized that the whole thing was a delusion: obviously
urlopen responses have to return bytes objects, and adding 'b' to the
'w' when opening the output file fixed everything. (I also had to change
my replacement strings to bytes.)

I went back to the relevant documentation multiple times, including
after I figured everything out, and I can't convince myself that it
makes the connection anywhere between bytes coming in, manipulating the
bytes as bytes, and writing out in binary. Yes, in retrospect this all
makes sense and perhaps even should have been obvious, but I am quite
sure I won't be the only experienced Python 2 programmer to trip over
this when making the transition to Python 3.

I apologize in advance if the requested documentation exists and I
didn't find it, in which case I would appreciate a pointer to where it
is lies.

@MLModel MLModel mannequin assigned birkenfeld Mar 5, 2009
@MLModel MLModel mannequin added the docs Documentation in the Doc dir label Mar 5, 2009
@devdanzin devdanzin mannequin added easy type-bug An unexpected behavior, bug, or error labels Apr 22, 2009
@Danh
Copy link
Mannequin

Danh mannequin commented Apr 15, 2010

I got struck by the same feature. In addition, currently the docs are wrong in the examples (at http://docs.python.org/dev/py3k/library/urllib.request.html#examples the output of f.read() is a string instead of bytes). There I propose the change from

>>> import urllib.request
>>> f = urllib.request.urlopen('http://www.python.org/')
>>> print(f.read(100))
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<?xml-stylesheet href="./css/ht2html

to

>>> import urllib.request
>>> f = urllib.request.urlopen('http://www.python.org/')
>>> print(f.read(100).decode('utf-8'))
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtm

The other examples need to be corrected in a similar way.
Even more importantly, the "HOWTO Fetch Internet Resources Using The urllib Package" needs to be corrected too.

In the documentation of urllib.request.urlopen I propose to add a sentence (after the paragraph "This function returns a file-like object...") explaining that reading the object returns bytes that need to be decoded to a string:
"Note that the method read() returns bytes that need to be decoded to a string using decode()."

@orsenthil orsenthil assigned orsenthil and unassigned birkenfeld Apr 15, 2010
@orsenthil
Copy link
Member

Yeah, there a example in the tutorial that was changed recently along similar lines suggested. (http://docs.python.org/dev/py3k/tutorial/stdlib.html#internet-access)
The other examples got to be changed too.

@orsenthil
Copy link
Member

Fixed in revision 80092 and merged into release31-maint in revision 80093. I am marking this as fixed and closed. If there are any similar issues at other places, we will address them as separate bugs.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir easy type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

2 participants