New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a "decode to declared encoding" version of urlopen to urllib #48983
Comments
This patch adds a version of urlopen that uses available encoding The main goal is to provide a shortcut for users that don't want to Currently, charset detection simply uses addinfourl.get_charset(), but [1] Carl Banks wrote: Christian Heimes wrote: Daniel Diniz wrote: |
Thx, I'll review the patch after Christmas. |
Christian, Daniel, I take it that you're both still interested in this? |
Senthil: could you review the attached patch please? |
I think the patch should be updated to benefit from new facilities in the io module instead of monkey-patching methods. The doc and tests are still good. |
FWIW for HTML pages the encoding can be specified in at least 3 places:
Browsers usually follow this order while searching the encoding, meaning that HTTP headers have the highest priority. The XML declaration is sometimes (mis)used in (X)HTML pages. Anyway, since urlopen() is a generic function that can download anything, it shouldn't look at XML declarations and meta tags -- that's something parsers should take care of. Regarding the implementation, wouldn't having a new method on the file-like object returned by urlopen better?
Maybe something like:
>>> page = urlopen(some_url)
>>> page.encoding # get the encoding from the HTTP headers
'utf-8'
>>> page.decode() # same as page.read().decode(page.encoding)
'...' The advantage of having these as new methods/attribute is that you can pass the 'page' around and other functions can get back the decoded content if/when they need to. OTOH other file-like objects don't have similar methods, so it might get a bit confusing. |
page.decode_content() might be a better name, and would avoid confusion with the bytes.decode() method. |
I am thinking if an attribute to urlopen would be better? Not exactly the mode like attribute of the builtin open, but something like decoded=False The downside is that the attr is now for the implementation detail of the method in py3k and upside is it gives an idea to users as what return value they can/should expect. |
I’m not sure real HTML (i.e. sent as text/html) should have an XML prolog honored. For XML, there’s http://tools.ietf.org/html/rfc3023 |
If you add the encoding parameter, you should also add at least errors and newline parameters. And why not just use io.TextIOWrapper? page.decode_content() bad that compels to read and to decode at once all of the data, while io.TextIOWrapper returns a file-like object and allows you to read line-by-line or by other pieces. |
FYI, the exact algorithm for determining the encoding of HTML documents is http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding There are lots of different algorithms documented all over the intertubes for determining HTML encoding; the one above is the one used by browsers. But that should only be used as part of a full HTML parsing library (e.g. https://code.google.com/p/html5lib/), urlopen should not attempt to do encoding sniffing from the data transferred. |
This feature request seems to be controversial: there is no clear consensus on which encoding should be used. I suggest to simply close the issue. In the meanwhile, since this issue is far from being "newcomer friendly", I remove the "Easy" label. |
As Victor notes, this is a controversial issue. And I'll add that the need for this feature seems not to have been brought up up in over a decade. So I'm closing this. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: