Author ezio.melotti
Recipients eric.araujo, ezio.melotti
Date 2012-02-23.02:38:48
SpamBayes Score 5.92961e-07
Marked as misclassified No
Message-id <>
This behavior is now documented, but the situation could still be improved.  Adding a new method that receives the converted entity seems a good way to handle this.  The parser can call both, and users can pick either one.

One problem with the current methods (handle_charref and handle_entityref) is that they don't do any processing on the entity and let invalid character references like &#x1000000000; or &#iamnotanentity; go through.

There are at least 3 changes that should be done in order to follow the HTML5 standard [0]:
 1) the parser should look at html.entities while parsing named character references (see also #11113).  This will allow the parser to parse &notit; as "¬it;" and &notin; as "∉" (see note at the very end of [0]);
 2) invalid character references (e.g. &#x1000000000;, &#iamnotanentity;) should not go through;
 3) the table at [0] with the replacement character should be used by the parser to "correct" those invalid character references (e.g. 0x80 -> U+20AC);

Now, 1) can be done for both the old and new method, but for 2) and 3) the situation is a bit more complicated.  The best thing is probably to keep sending them unchanged to the old methods, and implement the correct behavior for the new method only.

Date User Action Args
2012-02-23 02:38:49ezio.melottisetrecipients: + ezio.melotti, eric.araujo
2012-02-23 02:38:49ezio.melottisetmessageid: <>
2012-02-23 02:38:49ezio.melottilinkissue13633 messages
2012-02-23 02:38:48ezio.melotticreate