Issue 25258: HtmlParser doesn't handle void element tags correctly

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/69445

classification

Title:	HtmlParser doesn't handle void element tags correctly
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 3.6, Python 3.4, Python 3.5, Python 2.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Chenyun Yang, ezio.melotti, josh.r, karlcow, martin.panter, r.david.murray, xiang.zhang
Priority:	normal	Keywords:

Created on 2015-09-28 19:26 by Chenyun Yang, last changed 2022-04-11 14:58 by admin.

Messages (15)
msg251792 - (view)	Author: Chenyun Yang (Chenyun Yang)	Date: 2015-09-28 19:26
For void elements such as (<link>, <img>), there doesn't need to have xhtml empty end tag. HtmlParser which relies on the XHTML empty end syntax failed to handle this situation. from HTMLParser import HTMLParser # create a subclass and override the handler methods class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print "Encountered a start tag:", tag def handle_endtag(self, tag): print "Encountered an end tag :", tag def handle_data(self, data): print "Encountered some data :", data >>> parser.feed('<link rel="import"><img src="som">') Encountered a start tag: link Encountered a start tag: img >>> parser.feed('<link rel="import"/><img src="som"/>') Encountered a start tag: link Encountered an end tag : link Encountered a start tag: img Encountered an end tag : img Reference: https://github.com/python/cpython/blob/bdfb14c688b873567d179881fc5bb67363a6074c/Lib/html/parser.py http://www.w3.org/TR/html5/syntax.html#void-elements
msg251813 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2015-09-29 02:44
The example for Parsing an element with a few attributes and a title:" in https://docs.python.org/2/library/htmlparser.html#examples demonstrates this as expected behavior, so I'm not sure it can be changed: >>> parser.feed('<img src="python-logo.png" alt="The Python logo">') Start tag: img attr: ('src', 'python-logo.png') attr: ('alt', 'The Python logo') >>> >>> parser.feed('<h1>Python</h1>') Start tag: h1 Data : Python End tag : h1
msg251821 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2015-09-29 05:22
From the specification, void element has no end tag, so I think this behaviour can not be called incorrect. For void element, only handle_starttag is called. And for start tag ends with '/>', actually HTMLParser calls handle_startendtag, which invokes handle_starttag and handle_endtag. I think there are two solutions, filter void elements in the library and then invoke handle_startendtag, or filter void elements in the application in handle_starttag and then invoke handle_endtag.
msg251883 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-09-29 20:27
Also applies to Python 3, though I’m not sure I would consider it a bug.
msg251891 - (view)	Author: Chenyun Yang (Chenyun Yang)	Date: 2015-09-29 21:35
I think the bug is mostly about inconsistent behavior: <img> and <img/> shouldn't be parsed differently. This causes problem in the case that the parser won't be able to know consistently whether it has ended the visit of <img> tag. I propose one fix which will be: in the `parse_internal' method call, check for void elements and call `handle_startendtag' On Tue, Sep 29, 2015 at 1:27 PM, Martin Panter <report@bugs.python.org> wrote: > > Martin Panter added the comment: > > Also applies to Python 3, though I’m not sure I would consider it a bug. > > ---------- > nosy: +martin.panter > versions: +Python 3.4, Python 3.5, Python 3.6 > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue25258> > _______________________________________ >
msg251987 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-10-01 02:05
My thinking is that the knowledge that <img> does not have a closing tag is at a higher level than the current HTMLParser class. It is similar to knowing where the following HTML implicitly closes the <li> elements: <ul><li>Item A<li>Item B</ul> In both cases I would not expect the HTMLParser to report “virtual” empty or closing tags. I don’t think it should report an empty <img/> or closing </img> tag just because that is easy to do, because it would be inconsistent with other implied HTML tags. But maybe see what other people say. I don’t know your particular use case, but I would suggest if you need to parse non-XML HTML <img> tags, use the handle_starttag() method and don’t rely on the end tag :)
msg252150 - (view)	Author: Chenyun Yang (Chenyun Yang)	Date: 2015-10-02 19:18
the example you give for <li> is a different case. <img>, <link> are void elements which are allowed to have no close tag; <li> without </li> is a browser implementation detail, most browser autocompletes </li>. Without the parser calls the handle_endtag(), the client code which uses HTMLParser won't be able to know whether the a traversal is finished. Do you have a strong reason why we should include the knowledge of void elements into the HTMLParser at this line? https://github.com/python/cpython/blob/bdfb14c688b873567d179881fc5bb67363a6074c/Lib/html/parser.py#L341 if end.endswith('/>') or (end.endswith('>') and tag in VOID_ELEMENTS) On Wed, Sep 30, 2015 at 7:05 PM, Martin Panter <report@bugs.python.org> wrote: > > Martin Panter added the comment: > > My thinking is that the knowledge that <img> does not have a closing tag > is at a higher level than the current HTMLParser class. It is similar to > knowing where the following HTML implicitly closes the <li> elements: > > <ul><li>Item A<li>Item B</ul> > > In both cases I would not expect the HTMLParser to report “virtual” empty > or closing tags. I don’t think it should report an empty <img/> or closing > </img> tag just because that is easy to do, because it would be > inconsistent with other implied HTML tags. But maybe see what other people > say. > > I don’t know your particular use case, but I would suggest if you need to > parse non-XML HTML <img> tags, use the handle_starttag() method and don’t > rely on the end tag :) > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue25258> > _______________________________________ >
msg252152 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2015-10-02 19:42
Note that HTMLParser tries to follow the HTML5 specs, and for this case they say [0]: "Set the self-closing flag of the current tag token. Switch to the data state. Emit the current tag token." So it seems that for <img />, only the <img> (and not the closing </img>) should be emitted. HTMLParser has no way to set the self-closing flag, so calling handle_startendtag seems the most reasonable things to do, since it allows tree-builders to set the flag themselves. That said, the default implementation of handle_startendtag should indeed just call handle_starttag, however this would be a backward-incompatible change. [0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state
msg252154 - (view)	Author: Chenyun Yang (Chenyun Yang)	Date: 2015-10-02 20:16
I am fine with either handle_startendtag or handle_starttag, The issue is that the behavior is consistent for the two equally valid syntax (<img> and <img/> are handled differently); this inconsistent cannot be fixed from the inherited class as (handle_* calls are dispatched in the internal method of HTMLParser) On Fri, Oct 2, 2015 at 12:42 PM, Ezio Melotti <report@bugs.python.org> wrote: > > Ezio Melotti added the comment: > > Note that HTMLParser tries to follow the HTML5 specs, and for this case > they say [0]: > "Set the self-closing flag of the current tag token. Switch to the data > state. Emit the current tag token." > > So it seems that for <img />, only the <img> (and not the closing </img>) > should be emitted. HTMLParser has no way to set the self-closing flag, so > calling handle_startendtag seems the most reasonable things to do, since it > allows tree-builders to set the flag themselves. That said, the default > implementation of handle_startendtag should indeed just call > handle_starttag, however this would be a backward-incompatible change. > > [0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state > > ---------- > type: -> behavior > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue25258> > _______________________________________ >
msg252156 - (view)	Author: Chenyun Yang (Chenyun Yang)	Date: 2015-10-02 20:21
Correct for previous comment, consistent -> not consistent On Fri, Oct 2, 2015 at 1:16 PM, Chenyun Yang <report@bugs.python.org> wrote: > > Chenyun Yang added the comment: > > I am fine with either handle_startendtag or handle_starttag, > > The issue is that the behavior is consistent for the two equally valid > syntax (<img> and <img/> are handled differently); this inconsistent cannot > be fixed from the inherited class as (handle_* calls are dispatched in the > internal method of HTMLParser) > > On Fri, Oct 2, 2015 at 12:42 PM, Ezio Melotti <report@bugs.python.org> > wrote: > > > > > Ezio Melotti added the comment: > > > > Note that HTMLParser tries to follow the HTML5 specs, and for this case > > they say [0]: > > "Set the self-closing flag of the current tag token. Switch to the data > > state. Emit the current tag token." > > > > So it seems that for <img />, only the <img> (and not the closing </img>) > > should be emitted. HTMLParser has no way to set the self-closing flag, > so > > calling handle_startendtag seems the most reasonable things to do, since > it > > allows tree-builders to set the flag themselves. That said, the default > > implementation of handle_startendtag should indeed just call > > handle_starttag, however this would be a backward-incompatible change. > > > > [0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state > > > > ---------- > > type: -> behavior > > > > _______________________________________ > > Python tracker <report@bugs.python.org> > > <http://bugs.python.org/issue25258> > > _______________________________________ > > > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue25258> > _______________________________________ >
msg252168 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2015-10-02 22:13
> this inconsistent cannot be fixed from the inherited class as (handle_* > calls are dispatched in the internal method of HTMLParser) You can override handle_startendtag() like this: >>> class MyHTMLParser(HTMLParser): ... def handle_starttag(self, tag, attrs): ... print('start', tag) ... def handle_endtag(self, tag): ... print('end', tag) ... def handle_startendtag(self, tag, attrs): ... self.handle_starttag(tag, attrs) ... >>> parser = MyHTMLParser() >>> parser.feed('<link rel="import"/><img src="som"/>') start link start img (P.S. please don't quote the whole message in your reply)
msg252214 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-10-03 14:47
I suspect that calling startendtag is also backward incompatible, in that there may be parsers out there that are depending on starttag getting called for <img>, and endtag not getting called (that is, endtag getting called for it will cause them to break). I would hope that this would not be the case, but I'm worried about it.
msg252234 - (view)	Author: Chenyun Yang (Chenyun Yang)	Date: 2015-10-03 19:43
handle_startendtag is also called for non-void elements, such as <a/>, so the override example will break in those situation. The compatible patch I proposed right now is just one liner checker: # http://www.w3.org/TR/html5/syntax.html#void-elements <https://www.google.com/url?q=http://www.w3.org/TR/html5/syntax.html%23void-elements&usg=AFQjCNFVtfyZ53NDOHlPq896qmX5b8fPTA>_VOID_ELEMENT_TAGS = frozenset([ 'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen', 'link', 'meta', 'param', 'source', 'track', 'wbr'])class HTMLParser.HTMLParser: # Internal -- handle starttag, return end or -1 if not terminated def parse_starttag(self, i): #... if end.endswith('/>'): # XHTML-style empty tag: <span attr="value" /> self.handle_startendtag(tag, attrs) ############# PATCH ################# elif end.endswith('>') and tag in _VOID_ELEMENT_TAGS: self.handle_startendtag(tag, attrs) ############# PATCH #################
msg384362 - (view)	Author: karl (karlcow) *	Date: 2021-01-05 01:29
The parsing rules for tokenization of html are at https://html.spec.whatwg.org/multipage/parsing.html#tokenization In the stack of open elements, there are specific rules for certain elements. https://html.spec.whatwg.org/multipage/parsing.html#special from a DOM point of view, there is indeed no difference in between <img src="somewhere"><img src="somewhere"/> https://software.hixie.ch/utilities/js/live-dom-viewer/?%3C!DOCTYPE%20html%3E%0A%3Cimg%20src%3D%22somewhere%22%3E%3Cimg%20src%3D%22somewhere%22%2F%3E
msg384363 - (view)	Author: karl (karlcow) *	Date: 2021-01-05 01:34
I wonder if the confusion comes from the name. The HTMLParser is kind of a tokenizer more than a full HTML parser, but that's probably a detail. It doesn't create a DOM Tree which you can access, but could help you to build a DOM Tree (!= DOM Document object) https://html.spec.whatwg.org/multipage/parsing.html#overview-of-the-parsing-model > Implementations that do not support scripting do not have to actually create a DOM Document object, but the DOM tree in such cases is still used as the model for the rest of the specification.

History
Date	User	Action	Args
2022-04-11 14:58:21	admin	set	github: 69445
2021-01-05 01:34:11	karlcow	set	messages: + msg384363
2021-01-05 01:29:11	karlcow	set	nosy: + karlcow messages: + msg384362
2015-10-03 19:43:29	Chenyun Yang	set	messages: + msg252234
2015-10-03 14:47:47	r.david.murray	set	nosy: + r.david.murray messages: + msg252214
2015-10-02 22:13:55	ezio.melotti	set	messages: + msg252168
2015-10-02 20:21:29	Chenyun Yang	set	messages: + msg252156
2015-10-02 20:16:02	Chenyun Yang	set	messages: + msg252154
2015-10-02 19:42:35	ezio.melotti	set	type: behavior messages: + msg252152
2015-10-02 19:18:01	Chenyun Yang	set	messages: + msg252150
2015-10-01 02:05:15	martin.panter	set	messages: + msg251987
2015-09-29 21:35:43	Chenyun Yang	set	messages: + msg251891
2015-09-29 20:27:04	martin.panter	set	nosy: + martin.panter messages: + msg251883 versions: + Python 3.4, Python 3.5, Python 3.6
2015-09-29 05:22:58	xiang.zhang	set	nosy: + xiang.zhang messages: + msg251821
2015-09-29 02:44:39	josh.r	set	nosy: + josh.r messages: + msg251813
2015-09-28 19:29:41	r.david.murray	set	nosy: + ezio.melotti
2015-09-28 19:26:34	Chenyun Yang	create