Author ezio.melotti
Recipients eric.araujo, ezio.melotti
Date 2012-02-23.02:38:48
SpamBayes Score 5.92961e-07
Marked as misclassified No
Message-id <1329964729.93.0.382093206807.issue13633@psf.upfronthosting.co.za>
In-reply-to
Content
This behavior is now documented, but the situation could still be improved.  Adding a new method that receives the converted entity seems a good way to handle this.  The parser can call both, and users can pick either one.

One problem with the current methods (handle_charref and handle_entityref) is that they don't do any processing on the entity and let invalid character references like &#x1000000000; or &#iamnotanentity; go through.

There are at least 3 changes that should be done in order to follow the HTML5 standard [0]:
 1) the parser should look at html.entities while parsing named character references (see also #11113).  This will allow the parser to parse &notit; as "¬it;" and &notin; as "∉" (see note at the very end of [0]);
 2) invalid character references (e.g. &#x1000000000;, &#iamnotanentity;) should not go through;
 3) the table at [0] with the replacement character should be used by the parser to "correct" those invalid character references (e.g. 0x80 -> U+20AC);

Now, 1) can be done for both the old and new method, but for 2) and 3) the situation is a bit more complicated.  The best thing is probably to keep sending them unchanged to the old methods, and implement the correct behavior for the new method only.

[0]: http://www.w3.org/TR/html5/tokenization.html#tokenizing-character-references
History
Date User Action Args
2012-02-23 02:38:49ezio.melottisetrecipients: + ezio.melotti, eric.araujo
2012-02-23 02:38:49ezio.melottisetmessageid: <1329964729.93.0.382093206807.issue13633@psf.upfronthosting.co.za>
2012-02-23 02:38:49ezio.melottilinkissue13633 messages
2012-02-23 02:38:48ezio.melotticreate