classification
Title: HTMLParser raises exception on some inputs
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.8, Python 3.7, Python 3.6, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: berker.peksag, ezio.melotti, hanno, steven.daprano
Priority: normal Keywords: patch

Created on 2018-02-19 19:52 by hanno, last changed 2018-09-14 07:28 by ezio.melotti.

Pull Requests
URL Status Linked Edit
PR 9295 open ezio.melotti, 2018-09-14 07:17
Messages (6)
msg312363 - (view) Author: Hanno Boeck (hanno) * Date: 2018-02-19 19:52
I noticed that the HTMLParser will raise an exception on some inputs.
I'm not sure what the expectations here are, but given that real-world HTML often contains all kinds of broken content I would assume an HTMLParser to always try to parse a document and not be interrupted by an exception if an error occurs.

Here's a minified example:
#!/usr/bin/env python3
import html.parser
html.parser.HTMLParser().feed("<![\n")

However I actually stepped upon HTML failing on a real webpage:
https://kafanews.com/

Exception of minified example:

Traceback (most recent call last):
  File "./foo.py", line 5, in <module>
    html.parser.HTMLParser().feed("<![\n")
  File "/usr/lib64/python3.6/html/parser.py", line 111, in feed
    self.goahead(0)
  File "/usr/lib64/python3.6/html/parser.py", line 179, in goahead
    k = self.parse_html_declaration(i)
  File "/usr/lib64/python3.6/html/parser.py", line 264, in parse_html_declaration
    return self.parse_marked_section(i)
  File "/usr/lib64/python3.6/_markupbase.py", line 149, in parse_marked_section
    sectName, j = self._scan_name( i+3, i )
  File "/usr/lib64/python3.6/_markupbase.py", line 391, in _scan_name
    % rawdata[declstartpos:declstartpos+20])
  File "/usr/lib64/python3.6/_markupbase.py", line 34, in error
    "subclasses of ParserBase must override error()")
NotImplementedError: subclasses of ParserBase must override error()
msg312379 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-02-19 23:02
The stdlib HTML parser requires correct HTML.

To parse broken HTML, as you find in the real world, you need a third-party library like BeautifulSoup. BeautifulSoup is much more complex (about 7-8 times as many LOC) but can handle nearly anything a browser can.

I doubt the stdlib will ever compete with BeautifulSoup.
msg312380 - (view) Author: Hanno Boeck (hanno) * Date: 2018-02-19 23:05
Actually BeautifulSoup also uses the python html parser in the backend, so it has the same problem. (It can use alternative backends, but the python parser is the default and they also describe it as "lenient", which I would interpret as "it can handle that".)
msg312381 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2018-02-19 23:07
The HTMLParser has been updated to handle HTML5 and should never fail parsing a document, so if it raises an error it's probably a bug.
msg323971 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2018-08-23 18:31
Issue 34480 is another relevant issue. The HTMLParse method doesn't have an error() method and it doesn't raise any exceptions, but its base class still does. I think there is a compatibility problem between html.parser.HTMLParser() and _markupbase.ParserBase() classes. See https://bugs.python.org/msg323966 for more details about this.
msg325330 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2018-09-14 07:28
There are at least a couple of issues here.

The first one is the way the parser handles '<![...'.  The linked page contains markup like '<![STAT]-[USER-ACTIVE]!>' and since the parser currently checks for '<![' only, _markupbase.py:parse_marked_section gets called and an error gets incorrectly raised.   
However "8.2.4.42. Markup declaration open state"[0], states that after consuming '<!', there are only 4  valid paths forward:
1) if we have '<!--', it's a comment;
2) if we have '<!doctype', it's a doctype declaration;
3) if we have '<![CDATA[', it's a CDATA section;
4) if it's something else, it's a bogus comment;

The above example should therefore fall into 4), and be treated like a bogus comment.

PR-9295 changes parse_html_declaration() to align to the specs and implement path 3), resulting in the webpage being parsed without errors (the invalid markup is considered as a bogus comment).


The second issue is about an EOF in the middle of a bogus markup declaration, like in the minified example provided by OP ("<![\n").  In this case the comment should still be emitted ('[\n'), but currently nothing gets emitted.  I'll look more into it either tomorrow or later this month and update the PR accordingly (or perhaps I'll open a separate issue).


[0]: https://www.w3.org/TR/html52/syntax.html#tokenizer-markup-declaration-open-state
History
Date User Action Args
2018-09-14 07:28:22ezio.melottisetmessages: + msg325330
versions: + Python 2.7, Python 3.7, Python 3.8
2018-09-14 07:17:01ezio.melottisetkeywords: + patch
stage: patch review
pull_requests: + pull_request8724
2018-08-23 18:31:53berker.peksagsetnosy: + berker.peksag
messages: + msg323971
2018-02-26 04:10:09ezio.melottisetassignee: ezio.melotti
2018-02-19 23:07:20ezio.melottisetmessages: + msg312381
2018-02-19 23:05:00hannosetmessages: + msg312380
2018-02-19 23:02:09steven.dapranosetnosy: + steven.daprano
messages: + msg312379
2018-02-19 20:09:13serhiy.storchakasetnosy: + ezio.melotti
2018-02-19 19:52:16hannocreate