classification
Title: html.HTMLParser raises UnboundLocalError:
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.4, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: Thomas.Barlow, bmispelon, ezio.melotti, python-dev, r.david.murray
Priority: normal Keywords: easy, patch

Created on 2013-04-20 10:58 by bmispelon, last changed 2013-05-01 13:25 by ezio.melotti. This issue is now closed.

Files
File name Uploaded Description Edit
issue17802-unittest.patch Thomas.Barlow, 2013-04-22 19:26 Patch for unit tests to reproduce issue 17802 review
issue17802.diff ezio.melotti, 2013-04-23 05:32 review
Messages (6)
msg187414 - (view) Author: Baptiste Mispelon (bmispelon) * Date: 2013-04-20 10:58
When trying to parse the string `a&b`, the parser raises an UnboundLocalError:

{{{
>>> from html.parser import HTMLParser
>>> p = HTMLParser()
>>> p.feed('a&b')
>>> p.close()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.3/html/parser.py", line 149, in close
    self.goahead(1)
  File "/usr/lib/python3.3/html/parser.py", line 252, in goahead
    if k <= i:
UnboundLocalError: local variable 'k' referenced before assignment
}}}

Granted, the HTML is invalid, but this error looks like it might have been an oversight.
msg187416 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-04-20 11:43
Thanks for the report.  Yes, that's in a complicated bit of error recovery code, and clearly you found a path through it that doesn't have a corresponding test :)
msg187582 - (view) Author: Thomas Barlow (Thomas.Barlow) * Date: 2013-04-22 19:26
Just adding a patch here with a few unit tests to demonstrate the issue, comments here are welcome.  This is my first patch, I believe I have put the tests in the correct place.

It appears the problem only occurs if there is an incomplete XML entity where a sequence of valid characters (for an XML entity's name) lead to the end-of-file.

The test case for "a&b " passes, as it detects the space as an illegal character for the entity name.
msg187608 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-04-23 05:32
Thanks for the patch Thomas!
Starting from your work I made an updated patch that fixes the bug, but at the same time the tests revealed another possible issue.
In case of invalid character references, HTMLParser still calls handle_entityref instead of reporting them as 'data'.  Not sure what the preferable behavior should be though, but anyway this is a separate issue.
msg188222 - (view) Author: Roundup Robot (python-dev) Date: 2013-05-01 13:20
New changeset 9cb90c1a1a46 by Ezio Melotti in branch '3.3':
#17802: Fix an UnboundLocalError in html.parser.  Initial tests by Thomas Barlow.
http://hg.python.org/cpython/rev/9cb90c1a1a46

New changeset 20be90a3a714 by Ezio Melotti in branch 'default':
#17802: merge with 3.3.
http://hg.python.org/cpython/rev/20be90a3a714
msg188224 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-05-01 13:25
Fixed, thanks for the report!
History
Date User Action Args
2013-05-01 13:25:05ezio.melottisetstatus: open -> closed
resolution: fixed
messages: + msg188224

stage: patch review -> resolved
2013-05-01 13:20:15python-devsetnosy: + python-dev
messages: + msg188222
2013-04-23 05:33:00ezio.melottisetfiles: + issue17802.diff

messages: + msg187608
stage: needs patch -> patch review
2013-04-22 19:26:41Thomas.Barlowsetfiles: + issue17802-unittest.patch

nosy: + Thomas.Barlow
messages: + msg187582

keywords: + patch
2013-04-20 11:48:08ezio.melottisetassignee: ezio.melotti
2013-04-20 11:43:20r.david.murraysettype: crash -> behavior
versions: + Python 3.4
keywords: + easy
nosy: + r.david.murray, ezio.melotti

messages: + msg187416
stage: needs patch
2013-04-20 10:58:16bmispeloncreate