classification
Title: HTMLParser.HTMLParser doesn't handle malformed charrefs
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.5, Python 2.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: dayveday, ezio.melotti (2)
Priority: high Keywords

Created on 2009-08-07 01:25 by dayveday, last changed 2009-08-25 11:37 by ezio.melotti.

Messages (2)
msg91392 - (view) Author: Dave Day (dayveday) Date: 2009-08-07 01:25
When HTMLParser.HTMLParser encounters a malformed charref (for example 
&#bad;) it no longer parsers the following HTML correctly.

For example:
  <p>&#bad;</p>
Recognises the starttag "p" but considers the rest to be data.

To reproduce:
class MyParser(HTMLParser.HTMLParser):
  def handle_starttag(self, tag, attrs):
    print 'Start "%s"' % tag
  def handle_endtag(self,tag):
    print 'End "%s"' % tag
  def handle_charref(self, ref):
    print 'Charref "%s"' % ref
  def handle_data(self, data):
    print 'Data "%s"' % data
parser = MyParser()
parser.feed('<p>&#bad;</p>')
parser.close()

Expected output:
Start "p"
Data "&#bad;"
End "p"

Actual output:
Start "p"
Data "&#bad;</p>"
msg91950 - (view) Author: Ezio Melotti (ezio.melotti) Date: 2009-08-25 11:37
Confirmed on Python3.1 too.
History
Date User Action Args
2009-08-25 11:37:36ezio.melottisetpriority: high
nosy: + ezio.melotti
messages: + msg91950

2009-08-07 01:25:08dayvedaycreate