Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html.entities mapping dicts need updating? #55322

Closed
BrianJones mannequin opened this issue Feb 4, 2011 · 22 comments
Closed

html.entities mapping dicts need updating? #55322

BrianJones mannequin opened this issue Feb 4, 2011 · 22 comments
Assignees
Labels
stdlib Python modules in the Lib dir topic-unicode topic-XML type-feature A feature request or enhancement

Comments

@BrianJones
Copy link
Mannequin

BrianJones mannequin commented Feb 4, 2011

BPO 11113
Nosy @loewis, @ericvsmith, @ezio-melotti, @merwok
Files
  • entities_dict.py: dict with the HTML5 entities
  • entities.py: dict ('name;': 'str';) with the 2231 HTML5 entities
  • issue11113.diff
  • issue11113-2.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = <Date 2012-06-24.02:40:13.535>
    created_at = <Date 2011-02-04.03:43:54.241>
    labels = ['expert-XML', 'type-feature', 'library', 'expert-unicode']
    title = 'html.entities mapping dicts need updating?'
    updated_at = <Date 2012-06-24.03:32:54.554>
    user = 'https://bugs.python.org/BrianJones'

    bugs.python.org fields:

    activity = <Date 2012-06-24.03:32:54.554>
    actor = 'eric.araujo'
    assignee = 'ezio.melotti'
    closed = True
    closed_date = <Date 2012-06-24.02:40:13.535>
    closer = 'ezio.melotti'
    components = ['Library (Lib)', 'Unicode', 'XML']
    creation = <Date 2011-02-04.03:43:54.241>
    creator = 'Brian.Jones'
    dependencies = []
    files = ['23803', '26107', '26110', '26113']
    hgrepos = []
    issue_num = 11113
    keywords = ['patch']
    message_count = 22.0
    messages = ['127865', '127873', '127911', '128080', '128081', '128082', '138318', '138349', '138351', '138366', '140783', '148549', '148615', '163634', '163641', '163654', '163656', '163701', '163704', '163705', '163706', '163707']
    nosy_count = 7.0
    nosy_names = ['loewis', 'eric.smith', 'ezio.melotti', 'eric.araujo', 'Brian.Jones', 'python-dev', 'hp.dekoning']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue11113'
    versions = ['Python 3.3']

    @BrianJones
    Copy link
    Mannequin Author

    BrianJones mannequin commented Feb 4, 2011

    In Python 3.2b2, html.entities.codepoint2name and name2codepoint only support the 252 HTML entity names defined in the HTML 4 spec from 1997. I'm wondering if there's a reason not to support W3C Recommendation 'XML Entity Definitions for Characters'

    http://www.w3.org/TR/xml-entity-names/

    This standard contains significantly more characters, and it is noted in that spec that the HTML 5 drafts use that spec's entities. You can see the current HTML 5 'Named character references' here:

    http://www.w3.org/TR/html5/named-character-references.html#named-character-references

    If this is just a matter of somebody going in to do the grunt work, let me know.

    If startup costs associated with importing a huge dictionary are a concern, perhaps a more efficient type that enables the same lookup interface can be defined.

    If other reasons exist to not move in this direction, please do let me know!

    @BrianJones BrianJones mannequin added stdlib Python modules in the Lib dir topic-unicode topic-XML type-feature A feature request or enhancement labels Feb 4, 2011
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Feb 4, 2011

    Supporting the ones in HTML 5 would be fine with me. Supporting those of xml-entity-names would be inappropriate - it's not clear (to me, at least) that all of them are really meant for use in HTML.

    @merwok
    Copy link
    Member

    merwok commented Feb 4, 2011

    Agreed with Martin. I wonder if we should provide a means to use only HTML 4.01 entity references (say with a function parameter html5 defaulting to True) or we should just update the mapping.

    @ericvsmith
    Copy link
    Member

    I don't see the need for a parameter to support different sets of entities. Just supporting the ones from HTML 5 seems like the right thing.

    @merwok
    Copy link
    Member

    merwok commented Feb 6, 2011

    To make my intent explicit: an updated mapping could generate references invalid for 4.01.

    @ericvsmith
    Copy link
    Member

    Ah. I hadn't thought of generating them, only parsing them. In that case, then yes, it's an issue for generation.

    @merwok
    Copy link
    Member

    merwok commented Jun 14, 2011

    I just closed bpo-12329 as a duplicate of this bug. It requested the addition of the apos named entity reference.

    TTBOMK, the html module (or htmlentitydefs in 2.x) doesn’t claim to support XHTML TTBOMK; an XML parser should be used for XHTML. In HTML 4.01, apos is not defined, but it is in HTML5.

    @hpdekoning
    Copy link
    Mannequin

    hpdekoning mannequin commented Jun 14, 2011

    The reason I raised bpo-12329 was that the v2.7.1 documentation in
    http://docs.python.org/library/htmllib.html#module-htmlentitydefs
    says:
    "... The definition provided here contains all the entities defined by XHTML 1.0 ..."
    The only diff between the 252 HTML 4.01 and 253 XHTML 1.0 entities is "apos". See http://www.w3.org/TR/html401/sgml/entities.html and http://www.w3.org/TR/xhtml1/dtds.html .

    @hpdekoning
    Copy link
    Mannequin

    hpdekoning mannequin commented Jun 14, 2011

    BTW, the HTMLParser module (as well as html.parser in 3.x) does claim to parse both HTML and XHTML, see http://docs.python.org/library/htmlparser.html#module-HTMLParser .

    @merwok
    Copy link
    Member

    merwok commented Jun 15, 2011

    Ah, this changes the situation. I suppose it’s too late to stop pretending that HTML and XHTML are nearly the same thing (IOW change the doc), so apos needs to be defined for XHTML.

    IMO, we need a way to have the right entity references for HTML 4.01, XHTML 1.0 and HTML5, not put them all in one mapping.

    @ezio-melotti
    Copy link
    Member

    Having them in different mappings would be good, but I expect that for most real world application a single mappings that includes them all is the way to go. If I'm parsing a supposedly HTML page that contains an ' I'd rather have it converted even if it's not an HTML entity.
    If the set of entities supported by HTML5 is a superset of the HTML4 and XHTML ones, than we might just use that (I haven't checked though).

    @ezio-melotti ezio-melotti self-assigned this Nov 29, 2011
    @ezio-melotti
    Copy link
    Member

    http://www.w3.org/TR/html5/named-character-references.html lists 2152 HTML 5 entities (see also attached file for a dict generated from that table).
    Currently html.entities only has 252 entities, organized in 3 dicts:

    1. name -> intvalue (e.g. 'amp': 0x0026);
    2. intvalue -> name (e.g. 0x0026: 'amp');
    3. name -> char (e.g. 'amp': '&');

    In HTML 5, some of the entities map to a sequence of 2 characters, for example ≂̸ corresponds to [U+2242, U+0338] (i.e. MINUS TILDE + COMBINING LONG SOLIDUS OVERLAY).

    This means that:

    1. the current approach of having a dict with name -> intvalue doesn't work anymore, and a name -> valuelist should be used instead;
    2. the reverse dict for this would have to use tuples as keys, but I'm not sure how useful would that be (producing entities is not a common case, especially "unusual" ones like these).
    3. The name -> char dict might still be useful, and can easily become a name -> str dict in order to deal with the multichar entities;

    Since 1) is not backward-compatible the HTML5 entities should probably go in a separate dict.

    Also note that the entities are case-sensitive and some of them include different spellings (e.g. both 'amp' and 'AMP' map to '&'), so the reverse dict won't work too well. Having '&' -> 'amp' seems better than '&' -> 'AMP', but this might not be obvious for all the entities and requires some extra logic in the code to get it right.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Nov 29, 2011

    1. the current approach of having a dict with name -> intvalue doesn't work anymore, and a name -> valuelist should be used instead;
    2. the reverse dict for this would have to use tuples as keys, but I'm not sure how useful would that be (producing entities is not a common case, especially "unusual" ones like these).
    3. The name -> char dict might still be useful, and can easily become a name -> str dict in order to deal with the multichar entities;

    Since 1) is not backward-compatible the HTML5 entities should probably go in a separate dict.

    +1 for a separate dict; -1 for a value list. The right value type is
    'str'; name2codepoint ought to be deprecated (it's a left-over from
    when the str type wasn't unicode in 2.x).

    As for the reverse mapping: I'd add a dictionary that is reverse to
    entitydefs (i.e. with str keys). That some keys then have two characters
    is no real issue: applications that want to use this dictionary can
    either ignore them, or follow the approach of always checking
    Unicode combining characters - I'd expect that all "second" characters
    are indeed combining.

    OTOH, it's easy enough to create an inverted dictionary yourself
    when you need it, and not every three-line function needs to be
    in the standard library. It might actually be more useful to compile
    the values into a regular expression which you can then use to
    find out whether characters can be escaped using entity references.

    @ezio-melotti
    Copy link
    Member

    Attached another file with a dict that contains the 2231 HTML5 entities listed at http://www.w3.org/TR/html5/named-character-references.html
    The dict is like:

    html5namedcharref = {
        'Aacute;': '\xc1',
        'Aacute': '\xc1',
        'aacute;': '\xe1',
        'aacute': '\xe1',
        'Abreve;': '\u0102',
        'abreve;': '\u0103',
        ...
    }

    A better name could be found for the dict if you have better ideas (maybe html.entities.html5 only?). The dict will be added to html.entities.

    @ezio-melotti
    Copy link
    Member

    Here is a proper patch, still using the html5namedcharref name.
    HTMLParser should also be updated to use this dict.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jun 23, 2012

    How about calling it just "html5", or "HTML5"? That it is about entities already follows from the module name.

    @ezio-melotti
    Copy link
    Member

    Here's a new patch that uses the "html5" name for the dict, if there aren't other comments I'll commit it.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 24, 2012

    New changeset 2b54e25d6ecb by Ezio Melotti in branch 'default':
    bpo-11113: add a new "html5" dictionary containing the named character references defined by the HTML5 standard and the equivalent Unicode character(s) to the html.entities module.
    http://hg.python.org/cpython/rev/2b54e25d6ecb

    @merwok
    Copy link
    Member

    merwok commented Jun 24, 2012

    The ';' is not part of the entity name but an SGML delimiter, like '&'; the strings in the dict should not include it (like in the other dict they don’t).

    @merwok
    Copy link
    Member

    merwok commented Jun 24, 2012

    BTW in the doc you may point to collections.ChainMap to explain to people how to make one dict with HTML 4 and HTML 5 entities. (Note that I assume there are two dicts, but I only skimmed the diff.)

    @ezio-melotti
    Copy link
    Member

    The problem is that the standard allows some charref to end without a ';', but not all of them.

    So both "&Eacuteric" and Éric" will be parsed as "Éric", but only "αcentauri" will result in "αcentauri" -- "&alphacentauri" will be returned unchanged.

    I'm now working on bpo-15156 to use this dict in HTMLParser, and detecting the ';'-less entities is not easy. A possible solution is to keep the names that are accepted without ',' in a separate (private) dict and expose a function like HTMLParser.unescape that implements all the necessary logic.

    Regarding ChainMap, the html5 dict should be a superset of the html4 one.

    @merwok
    Copy link
    Member

    merwok commented Jun 24, 2012

    The explanations make sense, don’t change anything.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-unicode topic-XML type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants