Message148549
http://www.w3.org/TR/html5/named-character-references.html lists 2152 HTML 5 entities (see also attached file for a dict generated from that table).
Currently html.entities only has 252 entities, organized in 3 dicts:
1) name -> intvalue (e.g. 'amp': 0x0026);
2) intvalue -> name (e.g. 0x0026: 'amp');
3) name -> char (e.g. 'amp': '&');
In HTML 5, some of the entities map to a sequence of 2 characters, for example ≂̸ corresponds to [U+2242, U+0338] (i.e. MINUS TILDE + COMBINING LONG SOLIDUS OVERLAY).
This means that:
1) the current approach of having a dict with name -> intvalue doesn't work anymore, and a name -> valuelist should be used instead;
2) the reverse dict for this would have to use tuples as keys, but I'm not sure how useful would that be (producing entities is not a common case, especially "unusual" ones like these).
3) The name -> char dict might still be useful, and can easily become a name -> str dict in order to deal with the multichar entities;
Since 1) is not backward-compatible the HTML5 entities should probably go in a separate dict.
Also note that the entities are case-sensitive and some of them include different spellings (e.g. both 'amp' and 'AMP' map to '&'), so the reverse dict won't work too well. Having '&' -> 'amp' seems better than '&' -> 'AMP', but this might not be obvious for all the entities and requires some extra logic in the code to get it right. |
|
Date |
User |
Action |
Args |
2011-11-29 08:43:05 | ezio.melotti | set | recipients:
+ ezio.melotti, loewis, eric.smith, eric.araujo, Brian.Jones, hp.dekoning |
2011-11-29 08:43:04 | ezio.melotti | set | messageid: <1322556184.55.0.779951108031.issue11113@psf.upfronthosting.co.za> |
2011-11-29 08:43:03 | ezio.melotti | link | issue11113 messages |
2011-11-29 08:43:03 | ezio.melotti | create | |
|