Automatically convert character references in HTMLParser #57842

ezio-melotti · 2011-12-19T06:55:57Z

BPO	13633
Nosy	@ezio-melotti, @merwok, @bitdancer, @serhiy-storchaka
Dependencies	bpo-2927: expose html.parser.unescape bpo-11113: html.entities mapping dicts need updating?
Files	issue13633.diff issue13633-2.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/ezio-melotti'
closed_at = <Date 2013-11-23.18:17:29.997>
created_at = <Date 2011-12-19.06:55:56.952>
labels = ['type-feature', 'library']
title = 'Automatically convert character references in HTMLParser'
updated_at = <Date 2013-11-23.18:17:29.995>
user = 'https://github.com/ezio-melotti'

bugs.python.org fields:

activity = <Date 2013-11-23.18:17:29.995>
actor = 'ezio.melotti'
assignee = 'ezio.melotti'
closed = True
closed_date = <Date 2013-11-23.18:17:29.997>
closer = 'ezio.melotti'
components = ['Library (Lib)']
creation = <Date 2011-12-19.06:55:56.952>
creator = 'ezio.melotti'
dependencies = ['2927', '11113']
files = ['32729', '32803']
hgrepos = []
issue_num = 13633
keywords = ['patch']
message_count = 8.0
messages = ['149822', '154036', '188223', '203520', '203836', '204041', '204065', '204068']
nosy_count = 5.0
nosy_names = ['ezio.melotti', 'eric.araujo', 'r.david.murray', 'python-dev', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue13633'
versions = ['Python 3.4']

ezio-melotti · 2011-12-19T06:55:56Z

The doc for handle_charref and handle_entityref say:
"""
HTMLParser.handle_charref(name)
This method is called to process a character reference of the form "&#ref;". It is intended to be overridden by a derived class; the base class implementation does nothing.

HTMLParser.handle_entityref(name)
    This method is called to process a general entity reference of the form "&name;" where name is an general entity reference. It is intended to be overridden by a derived class; the base class implementation does nothing.
"""

The doc doesn't mention hex references, like ">", and apparently they are passed to handle_charref without the '&#' but with the leading 'x':

>>> from HTMLParser import HTMLParser
>>> class MyParser(HTMLParser):
...   def handle_charref(self, data):
...     print data
... 
>>> MyParser().feed('&gt; &#62; &#x3E;')
62
x3E

I've seen code in the wild doing unichr(int(data)) in handle_charref (once they figured out that '62' is passed) and then fail when an hex entity is found. Passing 'x3E' doesn't seem too useful because the user has to first check if there's a leading 'x', if there is remove it, then convert the hex string to int, and finally use unichr() to get the char, otherwise just convert to int and use unichr().

There 3 different possible solutions:

just document the behavior;
normalize the hex value before passing them to handle_charref and document it;
add a new handle_entity method that is called with the character represented by the entity (named, decimal, or hex);

The first solution alone doesn't solve much, but the doc should be clearer regardless of the decision we take.
The second one is better, but if it's implemented there won't be any way to know if the entity had a decimal or hex value anymore (does anyone care?). The normalization should also convert the hex string to int and then convert it back to str to be consistent with decimal entities.
The third one might be better, but doesn't solve the issue on 2.7/3.2. People don't care about entities and just want the equivalent char, so having a method that converts them already sounds like a useful feature to me.

ezio-melotti · 2012-02-23T02:38:49Z

This behavior is now documented, but the situation could still be improved. Adding a new method that receives the converted entity seems a good way to handle this. The parser can call both, and users can pick either one.

One problem with the current methods (handle_charref and handle_entityref) is that they don't do any processing on the entity and let invalid character references like &#x1000000000; or &#iamnotanentity; go through.

There are at least 3 changes that should be done in order to follow the HTML5 standard 0:

the parser should look at html.entities while parsing named character references (see also bpo-11113). This will allow the parser to parse &notit; as "¬it;" and ∉ as "∉" (see note at the very end of 0);
invalid character references (e.g. &#x1000000000;, &#iamnotanentity;) should not go through;
the table at 0 with the replacement character should be used by the parser to "correct" those invalid character references (e.g. 0x80 -> U+20AC);

Now, 1) can be done for both the old and new method, but for 2) and 3) the situation is a bit more complicated. The best thing is probably to keep sending them unchanged to the old methods, and implement the correct behavior for the new method only.

ezio-melotti · 2013-05-01T13:24:38Z

Another option is to add a new "convert_entities" option that, when True, automatically converts character references and doesn't call handle_charref and handle_entityref. (See also bpo-17802.)

ezio-melotti · 2013-11-20T18:46:21Z

Here is a patch.
It might be also be a good idea to add warning when the option is not explicitly set to False, and change the default to True in 3.5/3.6.

serhiy-storchaka · 2013-11-22T19:01:39Z

I have added a couple of nitpicks on Rietveld. You can ignore most of them. ;)

ezio-melotti · 2013-11-23T15:44:24Z

New patch attached.

python-dev · 2013-11-23T17:52:20Z

New changeset 1575f2dd08c4 by Ezio Melotti in branch 'default':
bpo-13633: Added a new convert_charrefs keyword arg to HTMLParser that, when True, automatically converts all character references.
http://hg.python.org/cpython/rev/1575f2dd08c4

ezio-melotti · 2013-11-23T18:17:30Z

Fixed, thanks for the reviews!

ezio-melotti self-assigned this Dec 19, 2011

ezio-melotti added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Dec 19, 2011

ezio-melotti changed the title ~~Handling of hex character references in HTMLParser.handle_charref~~ Automatically convert character references in HTMLParser Nov 20, 2013

ezio-melotti closed this as completed Nov 23, 2013

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically convert character references in HTMLParser #57842

Automatically convert character references in HTMLParser #57842

ezio-melotti commented Dec 19, 2011

ezio-melotti commented Dec 19, 2011

ezio-melotti commented Feb 23, 2012

ezio-melotti commented May 1, 2013

ezio-melotti commented Nov 20, 2013

serhiy-storchaka commented Nov 22, 2013

ezio-melotti commented Nov 23, 2013

python-dev mannequin commented Nov 23, 2013

ezio-melotti commented Nov 23, 2013

Automatically convert character references in HTMLParser #57842

Automatically convert character references in HTMLParser #57842

Comments

ezio-melotti commented Dec 19, 2011

ezio-melotti commented Dec 19, 2011

ezio-melotti commented Feb 23, 2012

ezio-melotti commented May 1, 2013

ezio-melotti commented Nov 20, 2013

serhiy-storchaka commented Nov 22, 2013

ezio-melotti commented Nov 23, 2013

python-dev mannequin commented Nov 23, 2013

ezio-melotti commented Nov 23, 2013