Issue 32876: HTMLParser raises exception on some inputs

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/77057

classification

Title:	HTMLParser raises exception on some inputs
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.11

process

Status:	open	Resolution:
Dependencies:		Superseder:	HTMLParser: undocumented not implemented method View: 31844
Assigned To:	ezio.melotti	Nosy List:	berker.peksag, ezio.melotti, hanno, iritkatriel, steven.daprano
Priority:	normal	Keywords:	patch

Created on 2018-02-19 19:52 by hanno, last changed 2022-04-11 14:58 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 9295	open	ezio.melotti, 2018-09-14 07:17

Messages (10)
msg312363 - (view)	Author: Hanno Boeck (hanno) *	Date: 2018-02-19 19:52
I noticed that the HTMLParser will raise an exception on some inputs. I'm not sure what the expectations here are, but given that real-world HTML often contains all kinds of broken content I would assume an HTMLParser to always try to parse a document and not be interrupted by an exception if an error occurs. Here's a minified example: #!/usr/bin/env python3 import html.parser html.parser.HTMLParser().feed("<![\n") However I actually stepped upon HTML failing on a real webpage: https://kafanews.com/ Exception of minified example: Traceback (most recent call last): File "./foo.py", line 5, in <module> html.parser.HTMLParser().feed("<![\n") File "/usr/lib64/python3.6/html/parser.py", line 111, in feed self.goahead(0) File "/usr/lib64/python3.6/html/parser.py", line 179, in goahead k = self.parse_html_declaration(i) File "/usr/lib64/python3.6/html/parser.py", line 264, in parse_html_declaration return self.parse_marked_section(i) File "/usr/lib64/python3.6/_markupbase.py", line 149, in parse_marked_section sectName, j = self._scan_name( i+3, i ) File "/usr/lib64/python3.6/_markupbase.py", line 391, in _scan_name % rawdata[declstartpos:declstartpos+20]) File "/usr/lib64/python3.6/_markupbase.py", line 34, in error "subclasses of ParserBase must override error()") NotImplementedError: subclasses of ParserBase must override error()
msg312379 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2018-02-19 23:02
The stdlib HTML parser requires correct HTML. To parse broken HTML, as you find in the real world, you need a third-party library like BeautifulSoup. BeautifulSoup is much more complex (about 7-8 times as many LOC) but can handle nearly anything a browser can. I doubt the stdlib will ever compete with BeautifulSoup.
msg312380 - (view)	Author: Hanno Boeck (hanno) *	Date: 2018-02-19 23:05
Actually BeautifulSoup also uses the python html parser in the backend, so it has the same problem. (It can use alternative backends, but the python parser is the default and they also describe it as "lenient", which I would interpret as "it can handle that".)
msg312381 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2018-02-19 23:07
The HTMLParser has been updated to handle HTML5 and should never fail parsing a document, so if it raises an error it's probably a bug.
msg323971 - (view)	Author: Berker Peksag (berker.peksag) *	Date: 2018-08-23 18:31
Issue 34480 is another relevant issue. The HTMLParse method doesn't have an error() method and it doesn't raise any exceptions, but its base class still does. I think there is a compatibility problem between html.parser.HTMLParser() and _markupbase.ParserBase() classes. See https://bugs.python.org/msg323966 for more details about this.
msg325330 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2018-09-14 07:28
There are at least a couple of issues here. The first one is the way the parser handles '<![...'. The linked page contains markup like '<![STAT]-[USER-ACTIVE]!>' and since the parser currently checks for '<![' only, _markupbase.py:parse_marked_section gets called and an error gets incorrectly raised. However "8.2.4.42. Markup declaration open state"[0], states that after consuming '<!', there are only 4 valid paths forward: 1) if we have '<!--', it's a comment; 2) if we have '<!doctype', it's a doctype declaration; 3) if we have '<![CDATA[', it's a CDATA section; 4) if it's something else, it's a bogus comment; The above example should therefore fall into 4), and be treated like a bogus comment. PR-9295 changes parse_html_declaration() to align to the specs and implement path 3), resulting in the webpage being parsed without errors (the invalid markup is considered as a bogus comment). The second issue is about an EOF in the middle of a bogus markup declaration, like in the minified example provided by OP ("<![\n"). In this case the comment should still be emitted ('[\n'), but currently nothing gets emitted. I'll look more into it either tomorrow or later this month and update the PR accordingly (or perhaps I'll open a separate issue). [0]: https://www.w3.org/TR/html52/syntax.html#tokenizer-markup-declaration-open-state
msg401507 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2021-09-09 18:08
I get a different error now: >>> import html.parser >>> html.parser.HTMLParser().feed("<![\n") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/iritkatriel/src/cpython-1/Lib/html/parser.py", line 110, in feed self.goahead(0) ^^^^^^^^^^^^^^^ File "/Users/iritkatriel/src/cpython-1/Lib/html/parser.py", line 178, in goahead k = self.parse_html_declaration(i) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/iritkatriel/src/cpython-1/Lib/html/parser.py", line 263, in parse_html_declaration return self.parse_marked_section(i) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/iritkatriel/src/cpython-1/Lib/_markupbase.py", line 144, in parse_marked_section sectName, j = self._scan_name( i+3, i ) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/iritkatriel/src/cpython-1/Lib/_markupbase.py", line 390, in _scan_name raise AssertionError( ^^^^^^^^^^^^^^^^^^^^^ AssertionError: expected name token at '<![\n'
msg410559 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2022-01-14 14:02
The error() method was removed in issue31844.
msg410561 - (view)	Author: Hanno Boeck (hanno) *	Date: 2022-01-14 14:29
Now the example code raises an AssertionError(). Is that intended? I don't think that's any better. I usually wouldn't expect an HTML parser to raise any error if you pass it a string, but instead to do fault tolerant parsing. And if it's expected that some inputs can generate exceptions, at least I think this should be properly documented.
msg410563 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2022-01-14 14:32
Reopening to discuss what the correct behaviour should be.

History
Date	User	Action	Args
2022-04-11 14:58:57	admin	set	github: 77057
2022-01-14 14:32:47	iritkatriel	set	status: closed -> open resolution: out of date -> messages: + msg410563 versions: + Python 3.11, - Python 2.7, Python 3.6, Python 3.7, Python 3.8
2022-01-14 14:29:29	hanno	set	messages: + msg410561
2022-01-14 14:02:16	iritkatriel	set	status: open -> closed superseder: HTMLParser: undocumented not implemented method messages: + msg410559 resolution: out of date stage: patch review -> resolved
2021-09-09 18:08:59	iritkatriel	set	nosy: + iritkatriel messages: + msg401507
2018-09-14 07:28:22	ezio.melotti	set	messages: + msg325330 versions: + Python 2.7, Python 3.7, Python 3.8
2018-09-14 07:17:01	ezio.melotti	set	keywords: + patch stage: patch review pull_requests: + pull_request8724
2018-08-23 18:31:53	berker.peksag	set	nosy: + berker.peksag messages: + msg323971
2018-02-26 04:10:09	ezio.melotti	set	assignee: ezio.melotti
2018-02-19 23:07:20	ezio.melotti	set	messages: + msg312381
2018-02-19 23:05:00	hanno	set	messages: + msg312380
2018-02-19 23:02:09	steven.daprano	set	nosy: + steven.daprano messages: + msg312379
2018-02-19 20:09:13	serhiy.storchaka	set	nosy: + ezio.melotti
2018-02-19 19:52:16	hanno	create