Message 154036 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	eric.araujo, ezio.melotti
Date	2012-02-23.02:38:48
SpamBayes Score	5.929609e-07
Marked as misclassified	No
Message-id	<1329964729.93.0.382093206807.issue13633@psf.upfronthosting.co.za>
In-reply-to

Content
This behavior is now documented, but the situation could still be improved. Adding a new method that receives the converted entity seems a good way to handle this. The parser can call both, and users can pick either one. One problem with the current methods (handle_charref and handle_entityref) is that they don't do any processing on the entity and let invalid character references like &#x1000000000; or &#iamnotanentity; go through. There are at least 3 changes that should be done in order to follow the HTML5 standard [0]: 1) the parser should look at html.entities while parsing named character references (see also #11113). This will allow the parser to parse &notit; as "¬it;" and ∉ as "∉" (see note at the very end of [0]); 2) invalid character references (e.g. &#x1000000000;, &#iamnotanentity;) should not go through; 3) the table at [0] with the replacement character should be used by the parser to "correct" those invalid character references (e.g. 0x80 -> U+20AC); Now, 1) can be done for both the old and new method, but for 2) and 3) the situation is a bit more complicated. The best thing is probably to keep sending them unchanged to the old methods, and implement the correct behavior for the new method only. [0]: http://www.w3.org/TR/html5/tokenization.html#tokenizing-character-references

This behavior is now documented, but the situation could still be improved.  Adding a new method that receives the converted entity seems a good way to handle this.  The parser can call both, and users can pick either one.

One problem with the current methods (handle_charref and handle_entityref) is that they don't do any processing on the entity and let invalid character references like &#x1000000000; or &#iamnotanentity; go through.

There are at least 3 changes that should be done in order to follow the HTML5 standard [0]:
 1) the parser should look at html.entities while parsing named character references (see also #11113).  This will allow the parser to parse &notit; as "¬it;" and &notin; as "∉" (see note at the very end of [0]);
 2) invalid character references (e.g. &#x1000000000;, &#iamnotanentity;) should not go through;
 3) the table at [0] with the replacement character should be used by the parser to "correct" those invalid character references (e.g. 0x80 -> U+20AC);

Now, 1) can be done for both the old and new method, but for 2) and 3) the situation is a bit more complicated.  The best thing is probably to keep sending them unchanged to the old methods, and implement the correct behavior for the new method only.

[0]: http://www.w3.org/TR/html5/tokenization.html#tokenizing-character-references

History
Date	User	Action	Args
2012-02-23 02:38:49	ezio.melotti	set	recipients: + ezio.melotti, eric.araujo
2012-02-23 02:38:49	ezio.melotti	set	messageid: <1329964729.93.0.382093206807.issue13633@psf.upfronthosting.co.za>
2012-02-23 02:38:49	ezio.melotti	link	issue13633 messages
2012-02-23 02:38:48	ezio.melotti	create