Message 149822 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	eric.araujo, ezio.melotti
Date	2011-12-19.06:55:56
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<1324277757.56.0.0733657691883.issue13633@psf.upfronthosting.co.za>
In-reply-to

Content
The doc for handle_charref and handle_entityref say: """ HTMLParser.handle_charref(name) This method is called to process a character reference of the form "&#ref;". It is intended to be overridden by a derived class; the base class implementation does nothing. HTMLParser.handle_entityref(name) This method is called to process a general entity reference of the form "&name;" where name is an general entity reference. It is intended to be overridden by a derived class; the base class implementation does nothing. """ The doc doesn't mention hex references, like ">", and apparently they are passed to handle_charref without the '&#' but with the leading 'x': >>> from HTMLParser import HTMLParser >>> class MyParser(HTMLParser): ... def handle_charref(self, data): ... print data ... >>> MyParser().feed('> > >') 62 x3E I've seen code in the wild doing unichr(int(data)) in handle_charref (once they figured out that '62' is passed) and then fail when an hex entity is found. Passing 'x3E' doesn't seem too useful because the user has to first check if there's a leading 'x', if there is remove it, then convert the hex string to int, and finally use unichr() to get the char, otherwise just convert to int and use unichr(). There 3 different possible solutions: 1) just document the behavior; 2) normalize the hex value before passing them to handle_charref and document it; 3) add a new handle_entity method that is called with the character represented by the entity (named, decimal, or hex); The first solution alone doesn't solve much, but the doc should be clearer regardless of the decision we take. The second one is better, but if it's implemented there won't be any way to know if the entity had a decimal or hex value anymore (does anyone care?). The normalization should also convert the hex string to int and then convert it back to str to be consistent with decimal entities. The third one might be better, but doesn't solve the issue on 2.7/3.2. People don't care about entities and just want the equivalent char, so having a method that converts them already sounds like a useful feature to me.

The doc for handle_charref and handle_entityref say:
"""
HTMLParser.handle_charref(name)
    This method is called to process a character reference of the form "&#ref;". It is intended to be overridden by a derived class; the base class implementation does nothing.

HTMLParser.handle_entityref(name)
    This method is called to process a general entity reference of the form "&name;" where name is an general entity reference. It is intended to be overridden by a derived class; the base class implementation does nothing.
"""

The doc doesn't mention hex references, like "&#x3E;", and apparently they are passed to handle_charref without the '&#' but with the leading 'x':

>>> from HTMLParser import HTMLParser
>>> class MyParser(HTMLParser):
...   def handle_charref(self, data):
...     print data
... 
>>> MyParser().feed('&gt; &#62; &#x3E;')
62
x3E

I've seen code in the wild doing unichr(int(data)) in handle_charref (once they figured out that '62' is passed) and then fail when an hex entity is found.  Passing 'x3E' doesn't seem too useful because the user has to first check if there's a leading 'x', if there is remove it, then convert the hex string to int, and finally use unichr() to get the char, otherwise just convert to int and use unichr().

There 3 different possible solutions:
1) just document the behavior;
2) normalize the hex value before passing them to handle_charref and document it;
3) add a new handle_entity method that is called with the character represented by the entity (named, decimal, or hex);

The first solution alone doesn't solve much, but the doc should be clearer regardless of the decision we take.
The second one is better, but if it's implemented there won't be any way to know if the entity had a decimal or hex value anymore (does anyone care?).  The normalization should also convert the hex string to int and then convert it back to str to be consistent with decimal entities.
The third one might be better, but doesn't solve the issue on 2.7/3.2.  People don't care about entities and just want the equivalent char, so having a method that converts them already sounds like a useful feature to me.

History
Date	User	Action	Args
2011-12-19 06:55:57	ezio.melotti	set	recipients: + ezio.melotti, eric.araujo
2011-12-19 06:55:57	ezio.melotti	set	messageid: <1324277757.56.0.0733657691883.issue13633@psf.upfronthosting.co.za>
2011-12-19 06:55:56	ezio.melotti	link	issue13633 messages
2011-12-19 06:55:56	ezio.melotti	create