This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author iko
Recipients iko
Date 2014-01-17.14:06:13
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1389967573.45.0.115549710544.issue20288@psf.upfronthosting.co.za>
In-reply-to
Content
Python 2.7 HTMLParse.py lines 185-199 (similar lines still exist in Python 3.4)
                match = charref.match(rawdata, i)
                if match:
                    ...
                else:
                    if ";" in rawdata[i:]: #bail by consuming &#
                        self.handle_data(rawdata[0:2])
                        i = self.updatepos(i, 2)
                    break

if you feed a broken charref, that is non-numeric, it will pass whatever random string that happened to be at the start of rawdata to handle_data(). Eg:

p = HTMLParser()
p.handle_data = lambda x: sys.stdout.write(x)
p.feed('<p>&#foo;</p>')

will print '<p' which is clearly wrong. I think the intention of the code is to pass '&#', which seems saner.
History
Date User Action Args
2014-01-17 14:06:13ikosetrecipients: + iko
2014-01-17 14:06:13ikosetmessageid: <1389967573.45.0.115549710544.issue20288@psf.upfronthosting.co.za>
2014-01-17 14:06:13ikolinkissue20288 messages
2014-01-17 14:06:13ikocreate