classification
Title: HTMLParse handing of non-numeric charrefs broken
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.4, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: ezio.melotti, iko, python-dev, r.david.murray
Priority: normal Keywords: patch

Created on 2014-01-17 14:06 by iko, last changed 2014-02-14 05:31 by ezio.melotti. This issue is now closed.

Files
File name Uploaded Description Edit
issue20288.diff ezio.melotti, 2014-02-01 19:13
Messages (5)
msg208336 - (view) Author: Anders Hammarquist (iko) Date: 2014-01-17 14:06
Python 2.7 HTMLParse.py lines 185-199 (similar lines still exist in Python 3.4)
                match = charref.match(rawdata, i)
                if match:
                    ...
                else:
                    if ";" in rawdata[i:]: #bail by consuming &#
                        self.handle_data(rawdata[0:2])
                        i = self.updatepos(i, 2)
                    break

if you feed a broken charref, that is non-numeric, it will pass whatever random string that happened to be at the start of rawdata to handle_data(). Eg:

p = HTMLParser()
p.handle_data = lambda x: sys.stdout.write(x)
p.feed('<p>&#foo;</p>')

will print '<p' which is clearly wrong. I think the intention of the code is to pass '&#', which seems saner.
msg208350 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-01-17 18:35
Thanks for the report, this is indeed a bug.
This behavior was covered by a test (see Lib/test/test_htmlparser.py:164), but _run_check feeds the chars one by one to the parser, and in that case it works correctly.  While feeding the parser a whole chunk I was able to reproduce the bug.  This should be fixed, and the behavior of _run_check should probably be changed too -- maybe it could test both the char-by-char and the regular feeding.
msg209911 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-02-01 19:13
Here's a patch against 2.7.
msg209914 - (view) Author: Roundup Robot (python-dev) Date: 2014-02-01 19:23
New changeset 0d50b5851f38 by Ezio Melotti in branch '2.7':
#20288: fix handling of invalid numeric charrefs in HTMLParser.
http://hg.python.org/cpython/rev/0d50b5851f38

New changeset 32097f193892 by Ezio Melotti in branch '3.3':
#20288: fix handling of invalid numeric charrefs in HTMLParser.
http://hg.python.org/cpython/rev/32097f193892

New changeset 92b3928bfde1 by Ezio Melotti in branch 'default':
#20288: merge with 3.3.
http://hg.python.org/cpython/rev/92b3928bfde1
msg211202 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-02-14 05:31
This is now fixed, thanks for the report!

> This should be fixed, and the behavior of _run_check should probably be
> changed too -- maybe it could test both the char-by-char and the
> regular feeding.

I created #20623 to track this.
History
Date User Action Args
2014-02-14 05:31:06ezio.melottisetstatus: open -> closed
resolution: fixed
messages: + msg211202

stage: needs patch -> resolved
2014-02-01 19:23:11python-devsetnosy: + python-dev
messages: + msg209914
2014-02-01 19:13:40ezio.melottisetfiles: + issue20288.diff
keywords: + patch
messages: + msg209911
2014-01-17 18:35:24ezio.melottisetversions: + Python 2.7, Python 3.3, Python 3.4
nosy: + r.david.murray

messages: + msg208350

stage: needs patch
2014-01-17 14:18:40ezio.melottisetassignee: ezio.melotti

nosy: + ezio.melotti
2014-01-17 14:06:13ikocreate