Message 78250 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ajaksu2
Recipients	ajaksu2
Date	2008-12-23.21:44:05
SpamBayes Score	4.4186876e-14
Marked as misclassified	No
Message-id	<1230068654.82.0.881542673607.issue4733@psf.upfronthosting.co.za>
In-reply-to

Content
This patch adds a version of urlopen that uses available encoding information to return strings instead of bytes. The main goal is to provide a shortcut for users that don't want to handle the decoding in the easy cases[1]. One added benefit it that the failures of such a function would be make it clear why 2.x style "str is either bytes or text" is flawed for network IO. Currently, charset detection simply uses addinfourl.get_charset(), but optionally checking for HTTP headers might be more robust. [1] http://groups.google.com/group/comp.lang.python/browse_thread/thread/b88239182f368505 [Executive summary] Glenn G. Chappell wrote: "2to3 doesn't catch it, and, in any case, why should read() return bytes, not string?" Carl Banks wrote: It returns bytes because it doesn't know what encoding to use. [...] HOWEVER... [...] It's reasonable that IF a url request's "Content-type" is text, and/or the "Content-encoding" is given, for urllib to have an option to automatically decode and return a string instead of bytes. Christian Heimes wrote: There is no generic and simple way to detect the encoding of a remote site. Sometimes the encoding is mentioned in the HTTP header, sometimes it's embedded in the <head> section of the HTML document. Daniel Diniz wrote: [... A] "decode to declared HTTP header encoding" version of urlopen could be useful to give some users the output they want (text from network io) or to make it clear why bytes is the safe way. [/Executive summary]

This patch adds a version of urlopen that uses available encoding
information to return strings instead of bytes.

The main goal is to provide a shortcut for users that don't want to
handle the decoding in the easy cases[1]. One added benefit it that the
failures of such a function would be make it clear why 2.x style "str is
either bytes or text" is flawed for network IO.

Currently, charset detection simply uses addinfourl.get_charset(), but
optionally checking for HTTP headers might be more robust.

[1]
http://groups.google.com/group/comp.lang.python/browse_thread/thread/b88239182f368505
[Executive summary]
Glenn G. Chappell wrote:
    "2to3 doesn't catch it, and, in any case, why should read() return 
bytes, not string?"

Carl Banks wrote:
    It returns bytes because it doesn't know what encoding to use.
    [...]
    HOWEVER... [...] It's reasonable that IF a url request's
"Content-type" is text, and/or the "Content-encoding"  is given, for
urllib to have an option to automatically decode and return a string
instead of bytes.

Christian Heimes wrote:
    There is no generic and simple way to detect the encoding of a
remote site. Sometimes the encoding is mentioned in the HTTP header,
sometimes it's embedded in the <head> section of the HTML document.

Daniel Diniz wrote:
    [... A] "decode to declared HTTP header encoding" version of urlopen
could be useful to give some users the output they want (text from
network io) or to make it clear why bytes is the safe way.
[/Executive summary]

History
Date	User	Action	Args
2008-12-23 21:44:15	ajaksu2	set	recipients: + ajaksu2
2008-12-23 21:44:14	ajaksu2	set	messageid: <1230068654.82.0.881542673607.issue4733@psf.upfronthosting.co.za>
2008-12-23 21:44:13	ajaksu2	link	issue4733 messages
2008-12-23 21:44:11	ajaksu2	create