This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: HTMLParser raises exception on some inputs
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder: HTMLParser: undocumented not implemented method
View: 31844
Assigned To: ezio.melotti Nosy List: berker.peksag, ezio.melotti, hanno, iritkatriel, steven.daprano
Priority: normal Keywords: patch

Created on 2018-02-19 19:52 by hanno, last changed 2022-04-11 14:58 by admin.

Pull Requests
URL Status Linked Edit
PR 9295 open ezio.melotti, 2018-09-14 07:17
Messages (10)
msg312363 - (view) Author: Hanno Boeck (hanno) * Date: 2018-02-19 19:52
I noticed that the HTMLParser will raise an exception on some inputs.
I'm not sure what the expectations here are, but given that real-world HTML often contains all kinds of broken content I would assume an HTMLParser to always try to parse a document and not be interrupted by an exception if an error occurs.

Here's a minified example:
#!/usr/bin/env python3
import html.parser
html.parser.HTMLParser().feed("<![\n")

However I actually stepped upon HTML failing on a real webpage:
https://kafanews.com/

Exception of minified example:

Traceback (most recent call last):
  File "./foo.py", line 5, in <module>
    html.parser.HTMLParser().feed("<![\n")
  File "/usr/lib64/python3.6/html/parser.py", line 111, in feed
    self.goahead(0)
  File "/usr/lib64/python3.6/html/parser.py", line 179, in goahead
    k = self.parse_html_declaration(i)
  File "/usr/lib64/python3.6/html/parser.py", line 264, in parse_html_declaration
    return self.parse_marked_section(i)
  File "/usr/lib64/python3.6/_markupbase.py", line 149, in parse_marked_section
    sectName, j = self._scan_name( i+3, i )
  File "/usr/lib64/python3.6/_markupbase.py", line 391, in _scan_name
    % rawdata[declstartpos:declstartpos+20])
  File "/usr/lib64/python3.6/_markupbase.py", line 34, in error
    "subclasses of ParserBase must override error()")
NotImplementedError: subclasses of ParserBase must override error()
msg312379 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-02-19 23:02
The stdlib HTML parser requires correct HTML.

To parse broken HTML, as you find in the real world, you need a third-party library like BeautifulSoup. BeautifulSoup is much more complex (about 7-8 times as many LOC) but can handle nearly anything a browser can.

I doubt the stdlib will ever compete with BeautifulSoup.
msg312380 - (view) Author: Hanno Boeck (hanno) * Date: 2018-02-19 23:05
Actually BeautifulSoup also uses the python html parser in the backend, so it has the same problem. (It can use alternative backends, but the python parser is the default and they also describe it as "lenient", which I would interpret as "it can handle that".)
msg312381 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2018-02-19 23:07
The HTMLParser has been updated to handle HTML5 and should never fail parsing a document, so if it raises an error it's probably a bug.
msg323971 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2018-08-23 18:31
Issue 34480 is another relevant issue. The HTMLParse method doesn't have an error() method and it doesn't raise any exceptions, but its base class still does. I think there is a compatibility problem between html.parser.HTMLParser() and _markupbase.ParserBase() classes. See https://bugs.python.org/msg323966 for more details about this.
msg325330 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2018-09-14 07:28
There are at least a couple of issues here.

The first one is the way the parser handles '<![...'.  The linked page contains markup like '<![STAT]-[USER-ACTIVE]!>' and since the parser currently checks for '<![' only, _markupbase.py:parse_marked_section gets called and an error gets incorrectly raised.   
However "8.2.4.42. Markup declaration open state"[0], states that after consuming '<!', there are only 4  valid paths forward:
1) if we have '<!--', it's a comment;
2) if we have '<!doctype', it's a doctype declaration;
3) if we have '<![CDATA[', it's a CDATA section;
4) if it's something else, it's a bogus comment;

The above example should therefore fall into 4), and be treated like a bogus comment.

PR-9295 changes parse_html_declaration() to align to the specs and implement path 3), resulting in the webpage being parsed without errors (the invalid markup is considered as a bogus comment).


The second issue is about an EOF in the middle of a bogus markup declaration, like in the minified example provided by OP ("<![\n").  In this case the comment should still be emitted ('[\n'), but currently nothing gets emitted.  I'll look more into it either tomorrow or later this month and update the PR accordingly (or perhaps I'll open a separate issue).


[0]: https://www.w3.org/TR/html52/syntax.html#tokenizer-markup-declaration-open-state
msg401507 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-09-09 18:08
I get a different error now:

>>> import html.parser
>>> html.parser.HTMLParser().feed("<![\n")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/iritkatriel/src/cpython-1/Lib/html/parser.py", line 110, in feed
    self.goahead(0)
    ^^^^^^^^^^^^^^^
  File "/Users/iritkatriel/src/cpython-1/Lib/html/parser.py", line 178, in goahead
    k = self.parse_html_declaration(i)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iritkatriel/src/cpython-1/Lib/html/parser.py", line 263, in parse_html_declaration
    return self.parse_marked_section(i)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iritkatriel/src/cpython-1/Lib/_markupbase.py", line 144, in parse_marked_section
    sectName, j = self._scan_name( i+3, i )
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iritkatriel/src/cpython-1/Lib/_markupbase.py", line 390, in _scan_name
    raise AssertionError(
    ^^^^^^^^^^^^^^^^^^^^^
AssertionError: expected name token at '<![\n'
msg410559 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2022-01-14 14:02
The error() method was removed in issue31844.
msg410561 - (view) Author: Hanno Boeck (hanno) * Date: 2022-01-14 14:29
Now the example code raises an AssertionError(). Is that intended? I don't think that's any better.

I usually wouldn't expect an HTML parser to raise any error if you pass it a string, but instead to do fault tolerant parsing. And if it's expected that some inputs can generate exceptions, at least I think this should be properly documented.
msg410563 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2022-01-14 14:32
Reopening to discuss what the correct behaviour should be.
History
Date User Action Args
2022-04-11 14:58:57adminsetgithub: 77057
2022-01-14 14:32:47iritkatrielsetstatus: closed -> open
resolution: out of date ->
messages: + msg410563

versions: + Python 3.11, - Python 2.7, Python 3.6, Python 3.7, Python 3.8
2022-01-14 14:29:29hannosetmessages: + msg410561
2022-01-14 14:02:16iritkatrielsetstatus: open -> closed
superseder: HTMLParser: undocumented not implemented method
messages: + msg410559

resolution: out of date
stage: patch review -> resolved
2021-09-09 18:08:59iritkatrielsetnosy: + iritkatriel
messages: + msg401507
2018-09-14 07:28:22ezio.melottisetmessages: + msg325330
versions: + Python 2.7, Python 3.7, Python 3.8
2018-09-14 07:17:01ezio.melottisetkeywords: + patch
stage: patch review
pull_requests: + pull_request8724
2018-08-23 18:31:53berker.peksagsetnosy: + berker.peksag
messages: + msg323971
2018-02-26 04:10:09ezio.melottisetassignee: ezio.melotti
2018-02-19 23:07:20ezio.melottisetmessages: + msg312381
2018-02-19 23:05:00hannosetmessages: + msg312380
2018-02-19 23:02:09steven.dapranosetnosy: + steven.daprano
messages: + msg312379
2018-02-19 20:09:13serhiy.storchakasetnosy: + ezio.melotti
2018-02-19 19:52:16hannocreate