Message154036
This behavior is now documented, but the situation could still be improved. Adding a new method that receives the converted entity seems a good way to handle this. The parser can call both, and users can pick either one.
One problem with the current methods (handle_charref and handle_entityref) is that they don't do any processing on the entity and let invalid character references like � or &#iamnotanentity; go through.
There are at least 3 changes that should be done in order to follow the HTML5 standard [0]:
1) the parser should look at html.entities while parsing named character references (see also #11113). This will allow the parser to parse ¬it; as "¬it;" and ∉ as "∉" (see note at the very end of [0]);
2) invalid character references (e.g. �, &#iamnotanentity;) should not go through;
3) the table at [0] with the replacement character should be used by the parser to "correct" those invalid character references (e.g. 0x80 -> U+20AC);
Now, 1) can be done for both the old and new method, but for 2) and 3) the situation is a bit more complicated. The best thing is probably to keep sending them unchanged to the old methods, and implement the correct behavior for the new method only.
[0]: http://www.w3.org/TR/html5/tokenization.html#tokenizing-character-references |
|
Date |
User |
Action |
Args |
2012-02-23 02:38:49 | ezio.melotti | set | recipients:
+ ezio.melotti, eric.araujo |
2012-02-23 02:38:49 | ezio.melotti | set | messageid: <1329964729.93.0.382093206807.issue13633@psf.upfronthosting.co.za> |
2012-02-23 02:38:49 | ezio.melotti | link | issue13633 messages |
2012-02-23 02:38:48 | ezio.melotti | create | |
|