Message 346210 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	cheryl.sabella
Recipients	cheryl.sabella, ezio.melotti, htran, terry.reedy
Date	2019-06-21.13:14:57
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1561122898.03.0.0623370000836.issue37071@roundup.psfhosted.org>
In-reply-to

Content
Thank you for the report. Looking at the BeautifulSoup source, there is a comment about this scenario: # Unlike other parsers, html.parser doesn't send separate end tag # events for empty-element tags. (It's handled in # handle_startendtag, but only if the original markup looked like # <tag/>.) # # So we need to call handle_endtag() ourselves. Since we # know the start event is identical to the end event, we # don't want handle_endtag() to cross off any previous end # events for tags of this name. HTMLParser itself produces output such as: >>> class MyParser(HTMLParser): ... def handle_starttag(self, tag, attrs): ... print(f'start: {tag}') ... def handle_endtag(self, tag): ... print(f'end: {tag}') ... def handle_data(self, data): ... print(f'data: {data}') ... >>> parser = MyParser() >>> parser.feed('<p><test></p>') start: p start: test end: p My suggestion would be to try a different parser in BeautifulSoup [1] to handle this. Even if we wanted to modify HTMLParser, any such change would probably be backwards incompatible. [1] https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

Thank you for the report.

Looking at the BeautifulSoup source, there is a comment about this scenario:
            # Unlike other parsers, html.parser doesn't send separate end tag
            # events for empty-element tags. (It's handled in
            # handle_startendtag, but only if the original markup looked like
            # <tag/>.)
            #
            # So we need to call handle_endtag() ourselves. Since we
            # know the start event is identical to the end event, we
            # don't want handle_endtag() to cross off any previous end
            # events for tags of this name.


HTMLParser itself produces output such as:
>>> class MyParser(HTMLParser):
...     def handle_starttag(self, tag, attrs):
...         print(f'start: {tag}')
...     def handle_endtag(self, tag):
...         print(f'end: {tag}')
...     def handle_data(self, data):
...         print(f'data: {data}')
...
>>> parser = MyParser()
>>> parser.feed('<p><test></p>')
start: p
start: test
end: p

My suggestion would be to try a different parser in BeautifulSoup [1] to handle this.  Even if we wanted to modify HTMLParser, any such change would probably be backwards incompatible.

[1] https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

History
Date	User	Action	Args
2019-06-21 13:14:58	cheryl.sabella	set	recipients: + cheryl.sabella, terry.reedy, ezio.melotti, htran
2019-06-21 13:14:58	cheryl.sabella	set	messageid: <1561122898.03.0.0623370000836.issue37071@roundup.psfhosted.org>
2019-06-21 13:14:57	cheryl.sabella	link	issue37071 messages
2019-06-21 13:14:57	cheryl.sabella	create