Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a "decode to declared encoding" version of urlopen to urllib #48983

Closed
devdanzin mannequin opened this issue Dec 23, 2008 · 13 comments
Closed

Add a "decode to declared encoding" version of urlopen to urllib #48983

devdanzin mannequin opened this issue Dec 23, 2008 · 13 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@devdanzin
Copy link
Mannequin

devdanzin mannequin commented Dec 23, 2008

BPO 4733
Nosy @orsenthil, @vstinner, @tiran, @devdanzin, @ezio-melotti, @merwok, @vadmium, @serhiy-storchaka
Files
  • urlopen_text.diff: Adds a urlopen_text function, docs and tests
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/orsenthil'
    closed_at = <Date 2021-12-10.19:52:51.555>
    created_at = <Date 2008-12-23.21:44:13.573>
    labels = ['type-feature', 'library']
    title = 'Add a "decode to declared encoding" version of urlopen to urllib'
    updated_at = <Date 2021-12-10.19:52:51.555>
    user = 'https://github.com/devdanzin'

    bugs.python.org fields:

    activity = <Date 2021-12-10.19:52:51.555>
    actor = 'ajaksu2'
    assignee = 'orsenthil'
    closed = True
    closed_date = <Date 2021-12-10.19:52:51.555>
    closer = 'ajaksu2'
    components = ['Library (Lib)']
    creation = <Date 2008-12-23.21:44:13.573>
    creator = 'ajaksu2'
    dependencies = []
    files = ['12437']
    hgrepos = []
    issue_num = 4733
    keywords = ['patch']
    message_count = 13.0
    messages = ['78250', '78255', '110832', '116187', '121425', '146002', '146003', '146005', '146092', '161768', '171356', '348624', '408246']
    nosy_count = 9.0
    nosy_names = ['orsenthil', 'vstinner', 'christian.heimes', 'ajaksu2', 'ezio.melotti', 'eric.araujo', 'mastrodomenico', 'martin.panter', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'rejected'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue4733'
    versions = ['Python 3.4']

    @devdanzin
    Copy link
    Mannequin Author

    devdanzin mannequin commented Dec 23, 2008

    This patch adds a version of urlopen that uses available encoding
    information to return strings instead of bytes.

    The main goal is to provide a shortcut for users that don't want to
    handle the decoding in the easy cases[1]. One added benefit it that the
    failures of such a function would be make it clear why 2.x style "str is
    either bytes or text" is flawed for network IO.

    Currently, charset detection simply uses addinfourl.get_charset(), but
    optionally checking for HTTP headers might be more robust.

    [1]
    http://groups.google.com/group/comp.lang.python/browse_thread/thread/b88239182f368505
    [Executive summary]
    Glenn G. Chappell wrote:
    "2to3 doesn't catch it, and, in any case, why should read() return
    bytes, not string?"

    Carl Banks wrote:
    It returns bytes because it doesn't know what encoding to use.
    [...]
    HOWEVER... [...] It's reasonable that IF a url request's
    "Content-type" is text, and/or the "Content-encoding" is given, for
    urllib to have an option to automatically decode and return a string
    instead of bytes.

    Christian Heimes wrote:
    There is no generic and simple way to detect the encoding of a
    remote site. Sometimes the encoding is mentioned in the HTTP header,
    sometimes it's embedded in the <head> section of the HTML document.

    Daniel Diniz wrote:
    [... A] "decode to declared HTTP header encoding" version of urlopen
    could be useful to give some users the output they want (text from
    network io) or to make it clear why bytes is the safe way.
    [/Executive summary]

    @devdanzin devdanzin mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Dec 23, 2008
    @tiran
    Copy link
    Member

    tiran commented Dec 24, 2008

    Thx, I'll review the patch after Christmas.

    @devdanzin devdanzin mannequin added the easy label Apr 22, 2009
    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 19, 2010

    Christian, Daniel, I take it that you're both still interested in this?

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Sep 12, 2010

    Senthil: could you review the attached patch please?

    @orsenthil orsenthil self-assigned this Oct 18, 2010
    @merwok
    Copy link
    Member

    merwok commented Nov 18, 2010

    I think the patch should be updated to benefit from new facilities in the io module instead of monkey-patching methods. The doc and tests are still good.

    @ezio-melotti
    Copy link
    Member

    Christian Heimes wrote:
    There is no generic and simple way to detect the encoding of a
    remote site. Sometimes the encoding is mentioned in the HTTP header,
    sometimes it's embedded in the <head> section of the HTML document.

    FWIW for HTML pages the encoding can be specified in at least 3 places:

    • the HTTP headers: e.g. "content-type: text/html; charset=utf-8";
    • the XML declaration: e.g. "<?xml version="1.0" encoding="utf-8" ?>";
    • the <meta> tag: e.g. "<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

    Browsers usually follow this order while searching the encoding, meaning that HTTP headers have the highest priority. The XML declaration is sometimes (mis)used in (X)HTML pages.

    Anyway, since urlopen() is a generic function that can download anything, it shouldn't look at XML declarations and meta tags -- that's something parsers should take care of.

    Regarding the implementation, wouldn't having a new method on the file-like object returned by urlopen better?
    Maybe something like:
    >>> page = urlopen(some_url)
    >>> page.encoding  # get the encoding from the HTTP headers
    'utf-8'
    >>> page.decode()  # same as page.read().decode(page.encoding)
    '...'

    The advantage of having these as new methods/attribute is that you can pass the 'page' around and other functions can get back the decoded content if/when they need to. OTOH other file-like objects don't have similar methods, so it might get a bit confusing.

    @ezio-melotti
    Copy link
    Member

    page.decode_content() might be a better name, and would avoid confusion with the bytes.decode() method.

    @orsenthil
    Copy link
    Member

    • page.encoding is a good idea.

    • page.decode_content sounds definitely better than page.decode which can be confusing as page is not a bytes object, but a file-like object.

    I am thinking if an attribute to urlopen would be better? Not exactly the mode like attribute of the builtin open, but something like decoded=False

    The downside is that the attr is now for the implementation detail of the method in py3k and upside is it gives an idea to users as what return value they can/should expect.

    @merwok
    Copy link
    Member

    merwok commented Oct 21, 2011

    I’m not sure real HTML (i.e. sent as text/html) should have an XML prolog honored. For XML, there’s http://tools.ietf.org/html/rfc3023

    @serhiy-storchaka
    Copy link
    Member

    If you add the encoding parameter, you should also add at least errors and newline parameters. And why not just use io.TextIOWrapper?

    page.decode_content() bad that compels to read and to decode at once all of the data, while io.TextIOWrapper returns a file-like object and allows you to read line-by-line or by other pieces.

    @mastrodomenico
    Copy link
    Mannequin

    mastrodomenico mannequin commented Sep 26, 2012

    FYI, the exact algorithm for determining the encoding of HTML documents is http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding

    There are lots of different algorithms documented all over the intertubes for determining HTML encoding; the one above is the one used by browsers.

    But that should only be used as part of a full HTML parsing library (e.g. https://code.google.com/p/html5lib/), urlopen should not attempt to do encoding sniffing from the data transferred.

    @vstinner
    Copy link
    Member

    This feature request seems to be controversial: there is no clear consensus on which encoding should be used. I suggest to simply close the issue.

    In the meanwhile, since this issue is far from being "newcomer friendly", I remove the "Easy" label.

    @vstinner vstinner removed the easy label Jul 29, 2019
    @devdanzin
    Copy link
    Mannequin Author

    devdanzin mannequin commented Dec 10, 2021

    As Victor notes, this is a controversial issue. And I'll add that the need for this feature seems not to have been brought up up in over a decade. So I'm closing this.

    @devdanzin devdanzin mannequin closed this as completed Dec 10, 2021
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants