classification
Title: _markupbase.py fails with UnboundLocalError on invalid keyword in marked section
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: Pikamander2, berker.peksag, corona10, ezio.melotti, jkamdjou, kodial, leonardr, xtreak
Priority: normal Keywords: patch

Created on 2018-08-23 17:30 by kodial, last changed 2020-06-14 15:35 by corona10.

Files
File name Uploaded Description Edit
weird-bug-3.py Pikamander2, 2020-04-17 09:17
Pull Requests
URL Status Linked Edit
PR 17643 closed jkamdjou, 2020-01-04 18:09
Messages (11)
msg323962 - (view) Author: Conrad (kodial) Date: 2018-08-23 17:30
$ pip freeze | grep beautifulsoup4
beautifulsoup4==4.6.3

$ python
Python 3.7.0 (default, Jul 23 2018, 20:24:19)
[Clang 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> text = "<![hi world]>"
>>> BeautifulSoup(text, 'html.parser')
/Users/conradroche/solvvy/ml-pipeline/venv/lib/python3.7/site-packages/bs4/builder/_htmlparser.py:78: UserWarning: unknown status keyword 'hi ' in marked section
  warnings.warn(msg)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/conradroche/solvvy/ml-pipeline/venv/lib/python3.7/site-packages/bs4/__init__.py", line 282, in __init__
    self._feed()
  File "/Users/conradroche/solvvy/ml-pipeline/venv/lib/python3.7/site-packages/bs4/__init__.py", line 343, in _feed
    self.builder.feed(self.markup)
  File "/Users/conradroche/solvvy/ml-pipeline/venv/lib/python3.7/site-packages/bs4/builder/_htmlparser.py", line 247, in feed
    parser.feed(markup)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/html/parser.py", line 111, in feed
    self.goahead(0)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/html/parser.py", line 179, in goahead
    k = self.parse_html_declaration(i)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/html/parser.py", line 264, in parse_html_declaration
    return self.parse_marked_section(i)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_markupbase.py", line 160, in parse_marked_section
    if not match:
UnboundLocalError: local variable 'match' referenced before assignment
>>>
msg323966 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2018-08-23 18:12
Thanks for the report.

HTMLParser.error() was supposed to raise an exception, but the BeautifulSoup project just prints a warning here:

    def error(self, msg):
        warnings.warn(msg)

https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/builder/_htmlparser.py#L69

As a result of this, the code doesn't stop executing in the following branch:

    else:
        self.error('unknown status keyword %r in marked section' % rawdata[i+3:j])
    if not match:
        return -1

https://github.com/python/cpython/blob/3.7/Lib/_markupbase.py#L159

Note that HTMLParser.error() was removed in Python 3.5 (https://github.com/python/cpython/commit/73a4359eb0eb624c588c5d52083ea4944f9787ea#diff-1a7486df8279dbac7f20abd487947845L171) and there is an open issue about the status of _markupbase.ParserBase.error(): Issue 31844.

I also think that https://github.com/python/cpython/commit/73a4359eb0eb624c588c5d52083ea4944f9787ea#diff-1a7486df8279dbac7f20abd487947845L171 may have caused a minor regression when it was removed the error() method and its uses from the HTMLParser class. It still calls the parse_marked_section() method of _markupbase.ParserBase() which it then calls the error() method of _markupbase.ParserBase():

    elif rawdata[i:i+3] == '<![':
        return self.parse_marked_section(i)

https://github.com/python/cpython/blob/3.7/Lib/html/parser.py#L264
msg358616 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2019-12-18 09:07
Attaching a test case for this issue since the self.error code path was not covered in the current test suite. A PR is open proposing match variable initialisation with None https://github.com/python/cpython/pull/17643 .

diff --git Lib/test/test_htmlparser.py Lib/test/test_htmlparser.py
index a2bfb39d16..a9aff11706 100644
--- Lib/test/test_htmlparser.py
+++ Lib/test/test_htmlparser.py
@@ -766,6 +766,18 @@ class AttributesTestCase(TestCaseBase):
                           [("href", "http://www.example.org/\">;")]),
                          ("data", "spam"), ("endtag", "a")])
 
+    def test_invalid_keyword_error_exception(self):
+        class InvalidMarkupException(Exception):
+            pass
+
+        class MyHTMLParser(html.parser.HTMLParser):
+
+            def error(self, message):
+                raise InvalidMarkupException(message)
+
+        parser = MyHTMLParser()
+        with self.assertRaises(InvalidMarkupException):
+            parser.feed('<![invalid>')
 
 if __name__ == "__main__":
     unittest.main()
msg359301 - (view) Author: Josh Kamdjou (jkamdjou) * Date: 2020-01-04 18:28
(Author of PR https://github.com/python/cpython/pull/17643)

Since the behavior of self.error() is determined by the subclass implementation, an Exception is not guaranteed. How should this be handled? It seems the options are:

- continue execution, in which case 'match' needs to be defined (I proposed initialization to None, which results in returning -1 on the next line)
- return a value
- raise an Exception

Happy to update the PR with @xtreak's test cases.
msg359356 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2020-01-05 17:07
HTMLParser is supposed to follow the HTML5 standard, and never raise an error.

For the example in the first comment ("<![hi world]>"), the steps should be:

* https://html.spec.whatwg.org/multipage/parsing.html#data-state:tag-open-state
* https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state:markup-declaration-open-state
* https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state:bogus-comment-state
* https://html.spec.whatwg.org/multipage/parsing.html#bogus-comment-state

I agree that the error should be fixed by setting `match` to None, and a test case that triggers the UnboundLocalError (before the fix) should be added as well (what provided by Karthikeyan looks good).

However, it also seems wrong that HTMLParser ends up calling self.error() through  Lib/_markupbase.py ParserBase after HTMLParser.error() and all the calls to it have been removed.  _markupbase.py is internal, so it should be safe to remove ParserBase.error() and the code that calls it as suggested in #31844 (and possibly to merge _markupbase into html.parser too).  Even if this is done and the call to self.error() is removed from ParserBase.parse_marked_section(), `match` still needs to be set to None (either in the `else` branch or before the `if/elif` block).
msg359362 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2020-01-05 17:23
> _markupbase.py is internal, so it should be safe to remove
> ParserBase.error() and the code that calls it as suggested in #31844

Should I reopen https://github.com/python/cpython/pull/8562 then?
msg359428 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2020-01-06 14:44
I think so.
It might be worth double-checking if BeautifulSoup (and possibly other 3rd party libs) use _markupbase.py and/or ParserBase.error().  If they do, giving them a heads up and/or going through a regular deprecation process might be good, otherwise PR 8562 can be reopened and merged after removing the call to self.error() in ParserBase.parse_marked_section().
msg366648 - (view) Author: Pikamander2 (Pikamander2) Date: 2020-04-17 09:17
Here's a simplified real world example that I found if it helps anyone:

import bs4

html = '''<html><head><!fill in here--> <![end--><!-- start: SSI y-xtra-topspace.shtml --><!-- --></head><body></body></html>'''

soup = bs4.BeautifulSoup(html, 'html.parser')
msg366649 - (view) Author: Pikamander2 (Pikamander2) Date: 2020-04-17 09:21
For reference, my Python version is 3.8.2 and my bs4 version is 4.8.1
msg366650 - (view) Author: Pikamander2 (Pikamander2) Date: 2020-04-17 09:35
I updated to bs4 version 4.9.0 and the same issue occurs.
msg371418 - (view) Author: Leonard Richardson (leonardr) * Date: 2020-06-12 20:56
I'm the maintainer of Beautiful Soup. I learned about this issue when one of my users filed a bug for it against Beautiful Soup (https://bugs.launchpad.net/beautifulsoup/+bug/1883264).

BeautifulSoupHTMLParser (my subclass of html.parser.HTMLParser) implements error() only to prevent the NotImplementedError raised by ParserBase.error(). The docstring for my error() implementation says: "this method is called only on very strange markup and our best strategy is to pretend it didn't happen and keep going."

It shouldn't affect Beautiful Soup if, in a future version of Python, HTMLParser.error is never called and ParserBase is removed.
History
Date User Action Args
2020-06-14 15:35:25corona10setnosy: + corona10
2020-06-12 20:56:04leonardrsetnosy: + leonardr
messages: + msg371418
2020-04-17 09:35:20Pikamander2setmessages: + msg366650
2020-04-17 09:21:19Pikamander2setmessages: + msg366649
2020-04-17 09:17:05Pikamander2setfiles: + weird-bug-3.py
nosy: + Pikamander2
messages: + msg366648

2020-01-06 14:44:47ezio.melottisetmessages: + msg359428
2020-01-05 17:23:57berker.peksagsetmessages: + msg359362
2020-01-05 17:07:33ezio.melottisetmessages: + msg359356
2020-01-04 18:28:48jkamdjousetnosy: + jkamdjou
messages: + msg359301
2020-01-04 18:09:23jkamdjousetkeywords: + patch
stage: test needed -> patch review
pull_requests: + pull_request17250
2019-12-18 09:07:20xtreaksetnosy: + xtreak
messages: + msg358616
2018-09-13 16:39:29ezio.melottisetassignee: ezio.melotti
2018-08-23 18:13:11berker.peksagsetmessages: - msg323967
2018-08-23 18:12:56berker.peksagsetstatus: open
2018-08-23 18:12:27berker.peksagsetnosy: ezio.melotti, berker.peksag, kodial
messages: + msg323967
2018-08-23 18:12:19berker.peksagsetstatus: open -> (no value)

nosy: + ezio.melotti, berker.peksag
messages: + msg323966

stage: test needed
2018-08-23 17:30:37kodialcreate