Author ezio.melotti
Recipients Brian.Jones, eric.araujo, eric.smith, ezio.melotti, hp.dekoning, loewis
Date 2011-11-29.08:42:56
SpamBayes Score 8.80545e-08
Marked as misclassified No
Message-id <1322556184.55.0.779951108031.issue11113@psf.upfronthosting.co.za>
In-reply-to
Content
http://www.w3.org/TR/html5/named-character-references.html lists 2152 HTML 5 entities (see also attached file for a dict generated from that table).
Currently html.entities only has 252 entities, organized in 3 dicts:
  1) name -> intvalue (e.g. 'amp': 0x0026);
  2) intvalue -> name (e.g. 0x0026: 'amp');
  3) name -> char (e.g. 'amp': '&');

In HTML 5, some of the entities map to a sequence of 2 characters, for example &NotEqualTilde; corresponds to [U+2242, U+0338] (i.e. MINUS TILDE + COMBINING LONG SOLIDUS OVERLAY).

This means that:
  1) the current approach of having a dict with name -> intvalue doesn't work anymore, and a name -> valuelist should be used instead;
  2) the reverse dict for this would have to use tuples as keys, but I'm not sure how useful would that be (producing entities is not a common case, especially "unusual" ones like these).
  3) The name -> char dict might still be useful, and can easily become a name -> str dict in order to deal with the multichar entities;

Since 1) is not backward-compatible the HTML5 entities should probably go in a separate dict.

Also note that the entities are case-sensitive and some of them include different spellings (e.g. both 'amp' and 'AMP' map to '&'), so the reverse dict won't work too well.  Having '&' -> 'amp' seems better than '&' -> 'AMP', but this might not be obvious for all the entities and requires some extra logic in the code to get it right.
History
Date User Action Args
2011-11-29 08:43:05ezio.melottisetrecipients: + ezio.melotti, loewis, eric.smith, eric.araujo, Brian.Jones, hp.dekoning
2011-11-29 08:43:04ezio.melottisetmessageid: <1322556184.55.0.779951108031.issue11113@psf.upfronthosting.co.za>
2011-11-29 08:43:03ezio.melottilinkissue11113 messages
2011-11-29 08:43:03ezio.melotticreate