Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_markupbase.py fails with UnboundLocalError on invalid keyword in marked section #78661

Closed
kodial mannequin opened this issue Aug 23, 2018 · 16 comments
Closed

_markupbase.py fails with UnboundLocalError on invalid keyword in marked section #78661

kodial mannequin opened this issue Aug 23, 2018 · 16 comments
Assignees
Labels
3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@kodial
Copy link
Mannequin

kodial mannequin commented Aug 23, 2018

BPO 34480
Nosy @ezio-melotti, @ambv, @berkerpeksag, @corona10, @tirkarthi, @jkamdjou, @leonardr, @mareksuscak
PRs
  • bpo-34480: fix bug where match variable is used prior to being defined #17643
  • bpo-31844: Remove _markupbase.ParserBase.error() #8562
  • [3.9] bpo-34480: fix bug where match variable is used prior to being defined (GH-17643) #32256
  • Files
  • weird-bug-3.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = None
    created_at = <Date 2018-08-23.17:30:37.685>
    labels = ['3.7', 'type-bug', 'library']
    title = '_markupbase.py fails with UnboundLocalError on invalid keyword in marked section'
    updated_at = <Date 2022-04-02.19:50:03.485>
    user = 'https://bugs.python.org/kodial'

    bugs.python.org fields:

    activity = <Date 2022-04-02.19:50:03.485>
    actor = 'mareksuscak'
    assignee = 'ezio.melotti'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2018-08-23.17:30:37.685>
    creator = 'kodial'
    dependencies = []
    files = ['49070']
    hgrepos = []
    issue_num = 34480
    keywords = ['patch']
    message_count = 14.0
    messages = ['323962', '323966', '358616', '359301', '359356', '359362', '359428', '366648', '366649', '366650', '371418', '416130', '416546', '416582']
    nosy_count = 10.0
    nosy_names = ['ezio.melotti', 'lukasz.langa', 'berker.peksag', 'corona10', 'xtreak', 'kodial', 'jkamdjou', 'leonardr', 'Pikamander2', 'mareksuscak']
    pr_nums = ['17643', '8562', '32256']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue34480'
    versions = ['Python 3.7']

    @kodial
    Copy link
    Mannequin Author

    kodial mannequin commented Aug 23, 2018

    $ pip freeze | grep beautifulsoup4
    beautifulsoup4==4.6.3
    $ python
    Python 3.7.0 (default, Jul 23 2018, 20:24:19)
    [Clang 9.0.0 (clang-900.0.39.2)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from bs4 import BeautifulSoup
    >>> text = "<![hi world]>"
    >>> BeautifulSoup(text, 'html.parser')
    /Users/conradroche/solvvy/ml-pipeline/venv/lib/python3.7/site-packages/bs4/builder/_htmlparser.py:78: UserWarning: unknown status keyword 'hi ' in marked section
      warnings.warn(msg)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/conradroche/solvvy/ml-pipeline/venv/lib/python3.7/site-packages/bs4/__init__.py", line 282, in __init__
        self._feed()
      File "/Users/conradroche/solvvy/ml-pipeline/venv/lib/python3.7/site-packages/bs4/__init__.py", line 343, in _feed
        self.builder.feed(self.markup)
      File "/Users/conradroche/solvvy/ml-pipeline/venv/lib/python3.7/site-packages/bs4/builder/_htmlparser.py", line 247, in feed
        parser.feed(markup)
      File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/html/parser.py", line 111, in feed
        self.goahead(0)
      File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/html/parser.py", line 179, in goahead
        k = self.parse_html_declaration(i)
      File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/html/parser.py", line 264, in parse_html_declaration
        return self.parse_marked_section(i)
      File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_markupbase.py", line 160, in parse_marked_section
        if not match:
    UnboundLocalError: local variable 'match' referenced before assignment
    >>>

    @kodial kodial mannequin added 3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Aug 23, 2018
    @berkerpeksag
    Copy link
    Member

    Thanks for the report.

    HTMLParser.error() was supposed to raise an exception, but the BeautifulSoup project just prints a warning here:

        def error(self, msg):
            warnings.warn(msg)

    https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/builder/_htmlparser.py#L69

    As a result of this, the code doesn't stop executing in the following branch:

    else:
        self.error('unknown status keyword %r in marked section' % rawdata[i+3:j])
    if not match:
        return -1
    

    https://github.com/python/cpython/blob/3.7/Lib/_markupbase.py#L159

    Note that HTMLParser.error() was removed in Python 3.5 (73a4359#diff-1a7486df8279dbac7f20abd487947845L171) and there is an open issue about the status of _markupbase.ParserBase.error(): bpo-31844.

    I also think that 73a4359#diff-1a7486df8279dbac7f20abd487947845L171 may have caused a minor regression when it was removed the error() method and its uses from the HTMLParser class. It still calls the parse_marked_section() method of _markupbase.ParserBase() which it then calls the error() method of _markupbase.ParserBase():

    elif rawdata[i:i+3] == '<![':
        return self.parse_marked_section(i)
    

    https://github.com/python/cpython/blob/3.7/Lib/html/parser.py#L264

    @ezio-melotti ezio-melotti self-assigned this Sep 13, 2018
    @tirkarthi
    Copy link
    Member

    Attaching a test case for this issue since the self.error code path was not covered in the current test suite. A PR is open proposing match variable initialisation with None #17643 .

    diff --git Lib/test/test_htmlparser.py Lib/test/test_htmlparser.py
    index a2bfb39d16..a9aff11706 100644
    --- Lib/test/test_htmlparser.py
    +++ Lib/test/test_htmlparser.py
    @@ -766,6 +766,18 @@ class AttributesTestCase(TestCaseBase):
                               [("href", "http://www.example.org/\">;")]),
                              ("data", "spam"), ("endtag", "a")])
     
    +    def test_invalid_keyword_error_exception(self):
    +        class InvalidMarkupException(Exception):
    +            pass
    +
    +        class MyHTMLParser(html.parser.HTMLParser):
    +
    +            def error(self, message):
    +                raise InvalidMarkupException(message)
    +
    +        parser = MyHTMLParser()
    +        with self.assertRaises(InvalidMarkupException):
    +            parser.feed('<![invalid>')
     
     if __name__ == "__main__":
         unittest.main()

    @jkamdjou
    Copy link
    Mannequin

    jkamdjou mannequin commented Jan 4, 2020

    (Author of PR #17643)

    Since the behavior of self.error() is determined by the subclass implementation, an Exception is not guaranteed. How should this be handled? It seems the options are:

    • continue execution, in which case 'match' needs to be defined (I proposed initialization to None, which results in returning -1 on the next line)
    • return a value
    • raise an Exception

    Happy to update the PR with @XTreak's test cases.

    @ezio-melotti
    Copy link
    Member

    HTMLParser is supposed to follow the HTML5 standard, and never raise an error.

    For the example in the first comment ("<![hi world]>"), the steps should be:

    I agree that the error should be fixed by setting match to None, and a test case that triggers the UnboundLocalError (before the fix) should be added as well (what provided by Karthikeyan looks good).

    However, it also seems wrong that HTMLParser ends up calling self.error() through Lib/_markupbase.py ParserBase after HTMLParser.error() and all the calls to it have been removed. _markupbase.py is internal, so it should be safe to remove ParserBase.error() and the code that calls it as suggested in bpo-31844 (and possibly to merge _markupbase into html.parser too). Even if this is done and the call to self.error() is removed from ParserBase.parse_marked_section(), match still needs to be set to None (either in the else branch or before the if/elif block).

    @berkerpeksag
    Copy link
    Member

    _markupbase.py is internal, so it should be safe to remove
    ParserBase.error() and the code that calls it as suggested in bpo-31844

    Should I reopen #8562 then?

    @ezio-melotti
    Copy link
    Member

    I think so.
    It might be worth double-checking if BeautifulSoup (and possibly other 3rd party libs) use _markupbase.py and/or ParserBase.error(). If they do, giving them a heads up and/or going through a regular deprecation process might be good, otherwise PR 8562 can be reopened and merged after removing the call to self.error() in ParserBase.parse_marked_section().

    @Pikamander2
    Copy link
    Mannequin

    Pikamander2 mannequin commented Apr 17, 2020

    Here's a simplified real world example that I found if it helps anyone:

    import bs4
    
    html = '''<html><head><!fill in here--> <![end--><!-- start: SSI y-xtra-topspace.shtml --><!-- --></head><body></body></html>'''
    
    soup = bs4.BeautifulSoup(html, 'html.parser')

    @Pikamander2
    Copy link
    Mannequin

    Pikamander2 mannequin commented Apr 17, 2020

    For reference, my Python version is 3.8.2 and my bs4 version is 4.8.1

    @Pikamander2
    Copy link
    Mannequin

    Pikamander2 mannequin commented Apr 17, 2020

    I updated to bs4 version 4.9.0 and the same issue occurs.

    @leonardr
    Copy link
    Mannequin

    leonardr mannequin commented Jun 12, 2020

    I'm the maintainer of Beautiful Soup. I learned about this issue when one of my users filed a bug for it against Beautiful Soup (https://bugs.launchpad.net/beautifulsoup/+bug/1883264).

    BeautifulSoupHTMLParser (my subclass of html.parser.HTMLParser) implements error() only to prevent the NotImplementedError raised by ParserBase.error(). The docstring for my error() implementation says: "this method is called only on very strange markup and our best strategy is to pretend it didn't happen and keep going."

    It shouldn't affect Beautiful Soup if, in a future version of Python, HTMLParser.error is never called and ParserBase is removed.

    @mareksuscak
    Copy link
    Mannequin

    mareksuscak mannequin commented Mar 27, 2022

    Based on the changelog, it seems that the fix for this issue has only landed in Python 3.10 but it's still affecting versions 3.7, 3.8 and 3.9. There are a few projects such as Apache Beam that have only recently added support for Python 3.9 with no support for 3.10. Beautiful Soup is a popular tool that's affected by this bug and so my question is if backporting this bugfix to at least Python 3.9 which is still an officially supported release is on the table?

    @ambv
    Copy link
    Contributor

    ambv commented Apr 2, 2022

    Marek, I can merge a fix for 3.9 for this. I don't think we should be removing _markupbase.ParserBase.error() in 3.9 as #52808 is doing. So we'd need a new patch, like #61843. However, that will need a test as well.

    Would you be interested in creating the patch?

    @mareksuscak
    Copy link
    Mannequin

    mareksuscak mannequin commented Apr 2, 2022

    Alright, the PR is up. This is my first contribution to Python and I had to sign the Contributor Agreement so it may take 1 business day before you're able to accept it. In the meantime, can you please review and let me know if anything needs to change? I wasn't sure where / how to add an entry in the NEWS.d folder. Can you please help me? Thanks!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @zhuofeng6
    Copy link

    any question about this issue? if not, i think it is better to close it

    @ezio-melotti
    Copy link
    Member

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    Status: Done
    Development

    No branches or pull requests

    5 participants