This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Entity references without semicolon in HTMLParser
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.6
process
Status: closed Resolution: duplicate
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, flox, r.david.murray, stefan.schweizer
Priority: normal Keywords:

Created on 2010-01-03 20:13 by stefan.schweizer, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (5)
msg97177 - (view) Author: Stefan Schweizer (stefan.schweizer) Date: 2010-01-03 20:13
HTMLParser should only handle entity references that are terminated with a semicolon. I know that the semicolon can be omitted in some cases (http://www.w3.org/TR/html4/charset.html#h-5.3) and that some browsers are more tolerant, but the following example causes some odd output:

>>> import HTMLParser
>>> class EntityrefParser(HTMLParser.HTMLParser):
...     def handle_data(self, data):
...         print "handle_data '%s'" % data
...     def handle_entityref(self, name):
...         print "handle_entityref '%s'" % name
... 
>>> p = EntityrefParser()
>>> p.feed("<p>spam&eggs are delicious</p>")

Expected Result:
handle_data 'spam&eggs are delicious'

Actual Result:
handle_data 'spam'
handle_entityref 'eggs'
handle_data ' are delicious'
msg97263 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-01-05 15:42
It is a documented behavior.
http://bip.cnrs-mrs.fr/bip10/scowl.htm#semi

Quoted from issue500073:
"If you want to process such a document in a specific way, I
recommend to subclass HTMLParser, overriding unknown_entityref."
msg97266 - (view) Author: Stefan Schweizer (stefan.schweizer) Date: 2010-01-05 17:16
I do not think that the semicolon can be omitted here, because it is not at a line break or immediately before a tag, it is in the middle of a paragraph. Anyway, I guess I have to live with the decision in issue500073.

Also I could not find an 'unknown_entityref' method in the HTMLParser module, only the now deprecated parser in htmllib had one.
msg97270 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-01-05 18:02
For the record, this is valid HTML 4.01 Strict:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<title>Sample</title>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
</head>
<body>
<p>La cl&eacute</p>
<p>La cl&eacute des champs</p>
<p>La cl&eacute; des champs</p>
</body>
</html>


Tested with http://validator.w3.org/check and Mozilla Firefox 3.5.6
Reference: http://www.is-thought.co.uk/book/sgml-6.htm#General

But HTML5 should prohibit such ambiguous syntax:
http://dev.w3.org/html5/spec/Overview.html#syntax-ambiguous-ampersand
msg97278 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-01-05 20:13
w3m (a text mode browser) does not treat the &eacute without the ; as an entity ref (it puts &eacute literally into the display), while firefox does turn it into an eacute with or without the ;.  I'm sure somebody somewhere has a table listing which browsers have what behavior. 

Firefox does render, eg, &test without a trailing semi as &test.  If you want to mirror that result in code using HTMLParser, you can implement the behavior in your entityref handler.

However, this brings up an interesting issue.  Firefox also renders "&test;" literally.  You can't implement that full behavior using HTMLParser, as far as I can see, since you loose the information as to whether the entity ref was terminated by a semicolon or not. So there may be a legitimate feature request with respect to that issue.
History
Date User Action Args
2022-04-11 14:56:56adminsetgithub: 51875
2010-01-05 20:13:30r.david.murraysetnosy: + r.david.murray
messages: + msg97278
2010-01-05 18:02:42floxsetmessages: + msg97270
2010-01-05 17:16:15stefan.schweizersetmessages: + msg97266
2010-01-05 15:42:07floxsetstatus: open -> closed

nosy: + flox
messages: + msg97263

resolution: duplicate
stage: resolved
2010-01-03 21:51:59ezio.melottisetnosy: + ezio.melotti
2010-01-03 20:13:28stefan.schweizercreate