classification
Title: HTMLParser.HTMLParser doesn't handle malformed charrefs
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.4, Python 2.7, Python 2.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: dayveday, eric.araujo, ezio.melotti, fredrik.haard, vstinner
Priority: high Keywords: patch

Created on 2009-08-07 01:25 by dayveday, last changed 2010-05-24 21:48 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
Issue6662.patch fredrik.haard, 2010-04-28 13:19
Messages (5)
msg91392 - (view) Author: Dave Day (dayveday) Date: 2009-08-07 01:25
When HTMLParser.HTMLParser encounters a malformed charref (for example 
&#bad;) it no longer parsers the following HTML correctly.

For example:
  <p>&#bad;</p>
Recognises the starttag "p" but considers the rest to be data.

To reproduce:
class MyParser(HTMLParser.HTMLParser):
  def handle_starttag(self, tag, attrs):
    print 'Start "%s"' % tag
  def handle_endtag(self,tag):
    print 'End "%s"' % tag
  def handle_charref(self, ref):
    print 'Charref "%s"' % ref
  def handle_data(self, data):
    print 'Data "%s"' % data
parser = MyParser()
parser.feed('<p>&#bad;</p>')
parser.close()

Expected output:
Start "p"
Data "&#bad;"
End "p"

Actual output:
Start "p"
Data "&#bad;</p>"
msg91950 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-08-25 11:37
Confirmed on Python3.1 too.
msg104367 - (view) Author: Fredrik Håård (fredrik.haard) Date: 2010-04-27 21:38
Is there a reason for HTMLParser to treat anything that does not match  the regex '&#\d+;' as a charref?
msg104428 - (view) Author: Fredrik Håård (fredrik.haard) Date: 2010-04-28 13:19
Confirmed on trunk.
Attached a (what I think is) minimal patch to fix, together with a tweak of existing unit test case to verify it.
msg106401 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-24 21:48
Commited: 2.7 (r81500, r81501), 2.6 (r81503), 3.2 (r81504), 3.1 (r81505).
History
Date User Action Args
2010-05-24 21:48:50vstinnersetstatus: open -> closed

nosy: + vstinner
messages: + msg106401

resolution: fixed
2010-05-24 21:21:11eric.araujosetnosy: + eric.araujo
2010-04-28 13:19:09fredrik.haardsetfiles: + Issue6662.patch
keywords: + patch
messages: + msg104428

versions: + Python 2.7
2010-04-27 21:38:24fredrik.haardsetmessages: + msg104367
2010-04-27 21:31:39fredrik.haardsetnosy: + fredrik.haard
2009-08-25 11:37:36ezio.melottisetpriority: high
nosy: + ezio.melotti
messages: + msg91950

2009-08-07 01:25:08dayvedaycreate