This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add a "decode to declared encoding" version of urlopen to urllib
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.4
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: ajaksu2, christian.heimes, eric.araujo, ezio.melotti, martin.panter, mastrodomenico, orsenthil, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2008-12-23 21:44 by ajaksu2, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
urlopen_text.diff ajaksu2, 2008-12-23 21:44 Adds a urlopen_text function, docs and tests review
Messages (13)
msg78250 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2008-12-23 21:44
This patch adds a version of urlopen that uses available encoding
information to return strings instead of bytes.

The main goal is to provide a shortcut for users that don't want to
handle the decoding in the easy cases[1]. One added benefit it that the
failures of such a function would be make it clear why 2.x style "str is
either bytes or text" is flawed for network IO.

Currently, charset detection simply uses addinfourl.get_charset(), but
optionally checking for HTTP headers might be more robust.

[1]
http://groups.google.com/group/comp.lang.python/browse_thread/thread/b88239182f368505
[Executive summary]
Glenn G. Chappell wrote:
    "2to3 doesn't catch it, and, in any case, why should read() return 
bytes, not string?"

Carl Banks wrote:
    It returns bytes because it doesn't know what encoding to use.
    [...]
    HOWEVER... [...] It's reasonable that IF a url request's
"Content-type" is text, and/or the "Content-encoding"  is given, for
urllib to have an option to automatically decode and return a string
instead of bytes.

Christian Heimes wrote:
    There is no generic and simple way to detect the encoding of a
remote site. Sometimes the encoding is mentioned in the HTTP header,
sometimes it's embedded in the <head> section of the HTML document.

Daniel Diniz wrote:
    [... A] "decode to declared HTTP header encoding" version of urlopen
could be useful to give some users the output they want (text from
network io) or to make it clear why bytes is the safe way.
[/Executive summary]
msg78255 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-12-24 02:25
Thx, I'll review the patch after Christmas.
msg110832 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-19 23:07
Christian, Daniel, I take it that you're both still interested in this?
msg116187 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-09-12 13:01
Senthil: could you review the attached patch please?
msg121425 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-11-18 02:33
I think the patch should be updated to benefit from new facilities in the io module instead of monkey-patching methods. The doc and tests are still good.
msg146002 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-10-20 02:58
> Christian Heimes wrote:
>   There is no generic and simple way to detect the encoding of a
>   remote site. Sometimes the encoding is mentioned in the HTTP header,
>   sometimes it's embedded in the <head> section of the HTML document.

FWIW for HTML pages the encoding can be specified in at least 3 places:
* the HTTP headers: e.g. "content-type: text/html; charset=utf-8";
* the XML declaration: e.g. "<?xml version="1.0" encoding="utf-8" ?>";
* the <meta> tag: e.g. "<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Browsers usually follow this order while searching the encoding, meaning that HTTP headers have the highest priority.  The XML declaration is sometimes (mis)used in (X)HTML pages.

Anyway, since urlopen() is a generic function that can download anything, it shouldn't look at XML declarations and meta tags -- that's something parsers should take care of.

Regarding the implementation, wouldn't having a new method on the file-like object returned by urlopen better?
Maybe something like:
>>> page = urlopen(some_url)
>>> page.encoding  # get the encoding from the HTTP headers
'utf-8'
>>> page.decode()  # same as page.read().decode(page.encoding)
'...'

The advantage of having these as new methods/attribute is that you can pass the 'page' around and other functions can get back the decoded content if/when they need to.  OTOH other file-like objects don't have similar methods, so it might get a bit confusing.
msg146003 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-10-20 03:40
page.decode_content() might be a better name, and would avoid confusion with the bytes.decode() method.
msg146005 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2011-10-20 03:52
- page.encoding is a good idea.

- page.decode_content sounds definitely better than page.decode which can be confusing as page is not a bytes object, but a file-like object.

I am thinking if an attribute to urlopen would be better? Not exactly the mode like attribute of the builtin open, but something like decoded=False

The downside is that the attr is now for the implementation detail of the method in py3k and upside is it gives an idea to users as what return value they can/should expect.
msg146092 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-10-21 14:57
I’m not sure real HTML (i.e. sent as text/html) should have an XML prolog honored.  For XML, there’s http://tools.ietf.org/html/rfc3023
msg161768 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-28 11:16
If you add the encoding parameter, you should also add at least errors and newline parameters. And why not just use io.TextIOWrapper?

page.decode_content() bad that compels to read and to decode at once all of the data, while io.TextIOWrapper returns a file-like object and allows you to read line-by-line or by other pieces.
msg171356 - (view) Author: Lino Mastrodomenico (mastrodomenico) Date: 2012-09-26 19:25
FYI, the exact algorithm for determining the encoding of HTML documents is http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding

There are lots of different algorithms documented all over the intertubes for determining HTML encoding; the one above is the one used by browsers.

But that should only be used as part of a full HTML parsing library (e.g. https://code.google.com/p/html5lib/), urlopen should not attempt to do encoding sniffing from the data transferred.
msg348624 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-07-29 11:37
This feature request seems to be controversial: there is no clear consensus on which encoding should be used. I suggest to simply close the issue.

In the meanwhile, since this issue is far from being "newcomer friendly", I remove the "Easy" label.
msg408246 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2021-12-10 19:52
As Victor notes, this is a controversial issue. And I'll add that the need for this feature seems not to have been brought up up in over a decade. So I'm closing this.
History
Date User Action Args
2022-04-11 14:56:43adminsetgithub: 48983
2021-12-10 19:52:51ajaksu2setstatus: open -> closed
resolution: rejected
messages: + msg408246

stage: patch review -> resolved
2019-07-29 11:37:45vstinnersetkeywords: - easy
nosy: + vstinner
messages: + msg348624

2014-09-01 01:34:25martin.pantersetnosy: + martin.panter
2012-09-26 19:25:05mastrodomenicosetmessages: + msg171356
2012-09-26 18:49:26ezio.melottisetversions: + Python 3.4, - Python 3.3
2012-05-28 11:16:36serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg161768
2011-10-21 14:57:54eric.araujosetmessages: + msg146092
2011-10-20 03:52:04orsenthilsetmessages: + msg146005
2011-10-20 03:40:26ezio.melottisetmessages: + msg146003
2011-10-20 02:58:31ezio.melottisetmessages: + msg146002
versions: + Python 3.3, - Python 3.2
2011-10-19 23:56:33ezio.melottisetnosy: + ezio.melotti, - BreamoreBoy
2010-11-18 02:33:31eric.araujosetdependencies: - urllib(2) should allow automatic decoding by charset
2010-11-18 02:33:10eric.araujosetnosy: + eric.araujo
messages: + msg121425
2010-11-18 02:26:19eric.araujolinkissue1599329 superseder
2010-11-18 02:26:19eric.araujounlinkissue1599329 dependencies
2010-10-18 10:36:21orsenthilsetassignee: orsenthil
2010-09-12 13:01:01BreamoreBoysetmessages: + msg116187
2010-08-09 04:22:46terry.reedysetversions: + Python 3.2, - Python 3.1
2010-07-19 23:07:06BreamoreBoysetnosy: + BreamoreBoy
messages: + msg110832
2010-01-27 23:51:03mastrodomenicosetnosy: + mastrodomenico
2009-04-22 18:48:13ajaksu2setkeywords: + easy
2009-02-12 18:24:23ajaksu2setdependencies: + urllib(2) should allow automatic decoding by charset
2009-02-12 18:23:53ajaksu2linkissue1599329 dependencies
2009-02-12 18:21:40ajaksu2setnosy: + orsenthil
2008-12-24 02:25:57christian.heimessetpriority: normal
nosy: + christian.heimes
messages: + msg78255
stage: patch review
2008-12-23 21:44:13ajaksu2create