HTMLParser raises exception on some inputs
Created on 2018-02-19

Author: Hanno Boeck, Date: 2018-02-19
I noticed that the HTMLParser will raise an exception on some inputs.
I'm not sure what the expectations here are, but given that real-world HTML often contains all kinds of broken content I would assume an HTMLParser to always try to parse a document and not be interrupted by an exception if an error occurs.

Here's a minified example:
#!/usr/bin/env python3
import html.parser

However I actually stepped upon HTML failing on a real webpage:

Exception of minified example:

Traceback (most recent call last):
  File "./", line 5, in <module>
  File "/usr/lib64/python3.6/html/", line 111, in feed
  File "/usr/lib64/python3.6/html/", line 179, in goahead
    k = self.parse_html_declaration(i)
  File "/usr/lib64/python3.6/html/", line 264, in parse_html_declaration
    return self.parse_marked_section(i)
  File "/usr/lib64/python3.6/", line 149, in parse_marked_section
    sectName, j = self._scan_name( i+3, i )
  File "/usr/lib64/python3.6/", line 391, in _scan_name
    % rawdata[declstartpos:declstartpos+20])
  File "/usr/lib64/python3.6/", line 34, in error
    "subclasses of ParserBase must override error()")
NotImplementedError: subclasses of ParserBase must override error()
Author: Steven D'Aprano, Date: 2018-02-19
The stdlib HTML parser requires correct HTML.

To parse broken HTML, as you find in the real world, you need a third-party library like BeautifulSoup. BeautifulSoup is much more complex (about 7-8 times as many LOC) but can handle nearly anything a browser can.

I doubt the stdlib will ever compete with BeautifulSoup.
Author: Hanno Boeck, Date: 2018-02-19
Actually BeautifulSoup also uses the python html parser in the backend, so it has the same problem. (It can use alternative backends, but the python parser is the default and they also describe it as "lenient", which I would interpret as "it can handle that".)
Author: Ezio Melotti, Date: 2018-02-19
The HTMLParser has been updated to handle HTML5 and should never fail parsing a document, so if it raises an error it's probably a bug.
Author: Berker Peksag, Date: 2018-08-23
Issue 34480 is another relevant issue. The HTMLParse method doesn't have an error() method and it doesn't raise any exceptions, but its base class still does. I think there is a compatibility problem between html.parser.HTMLParser() and _markupbase.ParserBase() classes. See for more details about this.
Author: Ezio Melotti, Date: 2018-09-14
There are at least a couple of issues here.

The first one is the way the parser handles '<![...'.  The linked page contains markup like '<![STAT]-[USER-ACTIVE]!>' and since the parser currently checks for '<![' only, gets called and an error gets incorrectly raised.   
However " Markup declaration open state"[0], states that after consuming '<!', there are only 4  valid paths forward:
1) if we have '<!--', it's a comment;
2) if we have '<!doctype', it's a doctype declaration;
3) if we have '<![CDATA[', it's a CDATA section;
4) if it's something else, it's a bogus comment;

The above example should therefore fall into 4), and be treated like a bogus comment.

PR-9295 changes parse_html_declaration() to align to the specs and implement path 3), resulting in the webpage being parsed without errors (the invalid markup is considered as a bogus comment).

The second issue is about an EOF in the middle of a bogus markup declaration, like in the minified example provided by OP ("<![\n").  In this case the comment should still be emitted ('[\n'), but currently nothing gets emitted.  I'll look more into it either tomorrow or later this month and update the PR accordingly (or perhaps I'll open a separate issue).

Author: Irit Katriel, Date: 2021-09-09
I get a different error now:

>>> import html.parser
>>> html.parser.HTMLParser().feed("<![\n")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/iritkatriel/src/cpython-1/Lib/html/", line 110, in feed
  File "/Users/iritkatriel/src/cpython-1/Lib/html/", line 178, in goahead
    k = self.parse_html_declaration(i)
  File "/Users/iritkatriel/src/cpython-1/Lib/html/", line 263, in parse_html_declaration
    return self.parse_marked_section(i)
  File "/Users/iritkatriel/src/cpython-1/Lib/", line 144, in parse_marked_section
    sectName, j = self._scan_name( i+3, i )
  File "/Users/iritkatriel/src/cpython-1/Lib/", line 390, in _scan_name
    raise AssertionError(
AssertionError: expected name token at '<![\n'
