Title: html.HTMLParser raises UnboundLocalError:
Components: Library (Lib) Versions: Python 3.3, Python 3.4
Created on 2013-04-20 10:58 by bmispelon, last changed 2022-04-11 14:57 by admin.

issue17802-unittest.patch Thomas.Barlow, 2013-04-22 19:26 Patch for unit tests to reproduce issue 17802
issue17802.diff ezio.melotti, 2013-04-23 05:32
msg187414 - (view) Author: Baptiste Mispelon (bmispelon) * Date: 2013-04-20 10:58
When trying to parse the string `a&b`, the parser raises an UnboundLocalError:

>>> from html.parser import HTMLParser
>>> p = HTMLParser()
>>> p.feed('a&b')
>>> p.close()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.3/html/", line 149, in close
  File "/usr/lib/python3.3/html/", line 252, in goahead
    if k <= i:
UnboundLocalError: local variable 'k' referenced before assignment

Granted, the HTML is invalid, but this error looks like it might have been an oversight.
msg187416 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-04-20 11:43
Thanks for the report.  Yes, that's in a complicated bit of error recovery code, and clearly you found a path through it that doesn't have a corresponding test :)
msg187582 - (view) Author: Thomas Barlow (Thomas.Barlow) * Date: 2013-04-22 19:26
Just adding a patch here with a few unit tests to demonstrate the issue, comments here are welcome.  This is my first patch, I believe I have put the tests in the correct place.

It appears the problem only occurs if there is an incomplete XML entity where a sequence of valid characters (for an XML entity's name) lead to the end-of-file.

The test case for "a&b " passes, as it detects the space as an illegal character for the entity name.
msg187608 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-04-23 05:32
Thanks for the patch Thomas!
Starting from your work I made an updated patch that fixes the bug, but at the same time the tests revealed another possible issue.
In case of invalid character references, HTMLParser still calls handle_entityref instead of reporting them as 'data'.  Not sure what the preferable behavior should be though, but anyway this is a separate issue.
msg188222 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-05-01 13:20
New changeset 9cb90c1a1a46 by Ezio Melotti in branch '3.3':
#17802: Fix an UnboundLocalError in html.parser.  Initial tests by Thomas Barlow.

New changeset 20be90a3a714 by Ezio Melotti in branch 'default':
#17802: merge with 3.3.
msg188224 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-05-01 13:25
Fixed, thanks for the report!
