Message 312363 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	hanno
Recipients	hanno
Date	2018-02-19.19:52:16
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1519069936.36.0.467229070634.issue32876@psf.upfronthosting.co.za>
In-reply-to

Content
I noticed that the HTMLParser will raise an exception on some inputs. I'm not sure what the expectations here are, but given that real-world HTML often contains all kinds of broken content I would assume an HTMLParser to always try to parse a document and not be interrupted by an exception if an error occurs. Here's a minified example: #!/usr/bin/env python3 import html.parser html.parser.HTMLParser().feed("<![\n") However I actually stepped upon HTML failing on a real webpage: https://kafanews.com/ Exception of minified example: Traceback (most recent call last): File "./foo.py", line 5, in <module> html.parser.HTMLParser().feed("<![\n") File "/usr/lib64/python3.6/html/parser.py", line 111, in feed self.goahead(0) File "/usr/lib64/python3.6/html/parser.py", line 179, in goahead k = self.parse_html_declaration(i) File "/usr/lib64/python3.6/html/parser.py", line 264, in parse_html_declaration return self.parse_marked_section(i) File "/usr/lib64/python3.6/_markupbase.py", line 149, in parse_marked_section sectName, j = self._scan_name( i+3, i ) File "/usr/lib64/python3.6/_markupbase.py", line 391, in _scan_name % rawdata[declstartpos:declstartpos+20]) File "/usr/lib64/python3.6/_markupbase.py", line 34, in error "subclasses of ParserBase must override error()") NotImplementedError: subclasses of ParserBase must override error()

I noticed that the HTMLParser will raise an exception on some inputs.
I'm not sure what the expectations here are, but given that real-world HTML often contains all kinds of broken content I would assume an HTMLParser to always try to parse a document and not be interrupted by an exception if an error occurs.

Here's a minified example:
#!/usr/bin/env python3
import html.parser
html.parser.HTMLParser().feed("<![\n")

However I actually stepped upon HTML failing on a real webpage:
https://kafanews.com/

Exception of minified example:

Traceback (most recent call last):
  File "./foo.py", line 5, in <module>
    html.parser.HTMLParser().feed("<![\n")
  File "/usr/lib64/python3.6/html/parser.py", line 111, in feed
    self.goahead(0)
  File "/usr/lib64/python3.6/html/parser.py", line 179, in goahead
    k = self.parse_html_declaration(i)
  File "/usr/lib64/python3.6/html/parser.py", line 264, in parse_html_declaration
    return self.parse_marked_section(i)
  File "/usr/lib64/python3.6/_markupbase.py", line 149, in parse_marked_section
    sectName, j = self._scan_name( i+3, i )
  File "/usr/lib64/python3.6/_markupbase.py", line 391, in _scan_name
    % rawdata[declstartpos:declstartpos+20])
  File "/usr/lib64/python3.6/_markupbase.py", line 34, in error
    "subclasses of ParserBase must override error()")
NotImplementedError: subclasses of ParserBase must override error()

History
Date	User	Action	Args
2018-02-19 19:52:16	hanno	set	recipients: + hanno
2018-02-19 19:52:16	hanno	set	messageid: <1519069936.36.0.467229070634.issue32876@psf.upfronthosting.co.za>
2018-02-19 19:52:16	hanno	link	issue32876 messages
2018-02-19 19:52:16	hanno	create