This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: HtmlParser doesn't handle void element tags correctly
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.6, Python 3.4, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Chenyun Yang, ezio.melotti, josh.r, karlcow, martin.panter, r.david.murray, xiang.zhang
Priority: normal Keywords:

Created on 2015-09-28 19:26 by Chenyun Yang, last changed 2022-04-11 14:58 by admin.

Messages (15)
msg251792 - (view) Author: Chenyun Yang (Chenyun Yang) Date: 2015-09-28 19:26
For void elements such as (<link>, <img>), there doesn't need to have xhtml empty end tag. HtmlParser which relies on the XHTML empty end syntax failed to handle this situation. 

from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        print "Encountered some data  :", data

>>> parser.feed('<link rel="import"><img src="som">')
Encountered a start tag: link
Encountered a start tag: img
>>> parser.feed('<link rel="import"/><img src="som"/>')
Encountered a start tag: link
Encountered an end tag : link
Encountered a start tag: img
Encountered an end tag : img


Reference:
https://github.com/python/cpython/blob/bdfb14c688b873567d179881fc5bb67363a6074c/Lib/html/parser.py
http://www.w3.org/TR/html5/syntax.html#void-elements
msg251813 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2015-09-29 02:44
The example for Parsing an element with a few attributes and a title:" in https://docs.python.org/2/library/htmlparser.html#examples demonstrates this as expected behavior, so I'm not sure it can be changed:

    >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
    Start tag: img
         attr: ('src', 'python-logo.png')
         attr: ('alt', 'The Python logo')
    >>>
    >>> parser.feed('<h1>Python</h1>')
    Start tag: h1
    Data     : Python
    End tag  : h1
msg251821 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2015-09-29 05:22
From the specification, void element has no end tag, so I think this
behaviour can not be called incorrect. For void element, only
handle_starttag is called.

And for start tag ends with '/>', actually HTMLParser calls
handle_startendtag, which invokes handle_starttag and
handle_endtag.

I think there are two solutions, filter void elements in the library
and then invoke handle_startendtag, or filter void elements in the
application in handle_starttag and then invoke handle_endtag.
msg251883 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-09-29 20:27
Also applies to Python 3, though I’m not sure I would consider it a bug.
msg251891 - (view) Author: Chenyun Yang (Chenyun Yang) Date: 2015-09-29 21:35
I think the bug is mostly about inconsistent behavior: <img> and <img/>
shouldn't be parsed differently.

This causes problem in the case that the parser won't be able to know
consistently whether it has ended the visit of <img> tag.

I propose one fix which will be: in the `parse_internal' method call, check
for void elements and call `handle_startendtag'

On Tue, Sep 29, 2015 at 1:27 PM, Martin Panter <report@bugs.python.org>
wrote:

>
> Martin Panter added the comment:
>
> Also applies to Python 3, though I’m not sure I would consider it a bug.
>
> ----------
> nosy: +martin.panter
> versions: +Python 3.4, Python 3.5, Python 3.6
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue25258>
> _______________________________________
>
msg251987 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-01 02:05
My thinking is that the knowledge that <img> does not have a closing tag is at a higher level than the current HTMLParser class. It is similar to knowing where the following HTML implicitly closes the <li> elements:

<ul><li>Item A<li>Item B</ul>

In both cases I would not expect the HTMLParser to report “virtual” empty or closing tags. I don’t think it should report an empty <img/> or closing </img> tag just because that is easy to do, because it would be inconsistent with other implied HTML tags. But maybe see what other people say.

I don’t know your particular use case, but I would suggest if you need to parse non-XML HTML <img> tags, use the handle_starttag() method and don’t rely on the end tag :)
msg252150 - (view) Author: Chenyun Yang (Chenyun Yang) Date: 2015-10-02 19:18
the example you give for <li> is a different case.

<img>, <link> are void elements which are allowed to have no close tag;
<li> without </li> is a browser implementation detail, most browser
autocompletes </li>.

Without the parser calls the handle_endtag(), the client code which uses
HTMLParser won't be able to know whether the a traversal is finished.

Do you have a strong reason why we should include the knowledge of  void
elements into the HTMLParser at this line?

https://github.com/python/cpython/blob/bdfb14c688b873567d179881fc5bb67363a6074c/Lib/html/parser.py#L341

if end.endswith('/>') or (end.endswith('>') and tag in VOID_ELEMENTS)

On Wed, Sep 30, 2015 at 7:05 PM, Martin Panter <report@bugs.python.org>
wrote:

>
> Martin Panter added the comment:
>
> My thinking is that the knowledge that <img> does not have a closing tag
> is at a higher level than the current HTMLParser class. It is similar to
> knowing where the following HTML implicitly closes the <li> elements:
>
> <ul><li>Item A<li>Item B</ul>
>
> In both cases I would not expect the HTMLParser to report “virtual” empty
> or closing tags. I don’t think it should report an empty <img/> or closing
> </img> tag just because that is easy to do, because it would be
> inconsistent with other implied HTML tags. But maybe see what other people
> say.
>
> I don’t know your particular use case, but I would suggest if you need to
> parse non-XML HTML <img> tags, use the handle_starttag() method and don’t
> rely on the end tag :)
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue25258>
> _______________________________________
>
msg252152 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2015-10-02 19:42
Note that HTMLParser tries to follow the HTML5 specs, and for this case they say [0]:
"Set the self-closing flag of the current tag token. Switch to the data state. Emit the current tag token."

So it seems that for <img />, only the <img> (and not the closing </img>) should be emitted.  HTMLParser has no way to set the self-closing flag, so calling handle_startendtag seems the most reasonable things to do, since it allows tree-builders to set the flag themselves.  That said, the default implementation of handle_startendtag should indeed just call handle_starttag, however this would be a backward-incompatible change.

[0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state
msg252154 - (view) Author: Chenyun Yang (Chenyun Yang) Date: 2015-10-02 20:16
I am fine with either handle_startendtag or handle_starttag,

The issue is that the behavior is consistent for the two equally valid
syntax (<img> and <img/> are handled differently); this inconsistent cannot
be fixed from the inherited class as (handle_* calls are dispatched in the
internal method of HTMLParser)

On Fri, Oct 2, 2015 at 12:42 PM, Ezio Melotti <report@bugs.python.org>
wrote:

>
> Ezio Melotti added the comment:
>
> Note that HTMLParser tries to follow the HTML5 specs, and for this case
> they say [0]:
> "Set the self-closing flag of the current tag token. Switch to the data
> state. Emit the current tag token."
>
> So it seems that for <img />, only the <img> (and not the closing </img>)
> should be emitted.  HTMLParser has no way to set the self-closing flag, so
> calling handle_startendtag seems the most reasonable things to do, since it
> allows tree-builders to set the flag themselves.  That said, the default
> implementation of handle_startendtag should indeed just call
> handle_starttag, however this would be a backward-incompatible change.
>
> [0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state
>
> ----------
> type:  -> behavior
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue25258>
> _______________________________________
>
msg252156 - (view) Author: Chenyun Yang (Chenyun Yang) Date: 2015-10-02 20:21
Correct for previous comment, consistent -> not consistent

On Fri, Oct 2, 2015 at 1:16 PM, Chenyun Yang <report@bugs.python.org> wrote:

>
> Chenyun Yang added the comment:
>
> I am fine with either handle_startendtag or handle_starttag,
>
> The issue is that the behavior is consistent for the two equally valid
> syntax (<img> and <img/> are handled differently); this inconsistent cannot
> be fixed from the inherited class as (handle_* calls are dispatched in the
> internal method of HTMLParser)
>
> On Fri, Oct 2, 2015 at 12:42 PM, Ezio Melotti <report@bugs.python.org>
> wrote:
>
> >
> > Ezio Melotti added the comment:
> >
> > Note that HTMLParser tries to follow the HTML5 specs, and for this case
> > they say [0]:
> > "Set the self-closing flag of the current tag token. Switch to the data
> > state. Emit the current tag token."
> >
> > So it seems that for <img />, only the <img> (and not the closing </img>)
> > should be emitted.  HTMLParser has no way to set the self-closing flag,
> so
> > calling handle_startendtag seems the most reasonable things to do, since
> it
> > allows tree-builders to set the flag themselves.  That said, the default
> > implementation of handle_startendtag should indeed just call
> > handle_starttag, however this would be a backward-incompatible change.
> >
> > [0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state
> >
> > ----------
> > type:  -> behavior
> >
> > _______________________________________
> > Python tracker <report@bugs.python.org>
> > <http://bugs.python.org/issue25258>
> > _______________________________________
> >
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue25258>
> _______________________________________
>
msg252168 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2015-10-02 22:13
> this inconsistent cannot be fixed from the inherited class as (handle_* 
> calls are dispatched in the internal method of HTMLParser)

You can override handle_startendtag() like this:

>>> class MyHTMLParser(HTMLParser):
...     def handle_starttag(self, tag, attrs):
...         print('start', tag)
...     def handle_endtag(self, tag):
...         print('end', tag)
...     def handle_startendtag(self, tag, attrs):
...         self.handle_starttag(tag, attrs)
... 
>>> parser = MyHTMLParser()
>>> parser.feed('<link rel="import"/><img src="som"/>')
start link
start img


(P.S. please don't quote the whole message in your reply)
msg252214 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-10-03 14:47
I suspect that calling startendtag is also backward incompatible, in that there may be parsers out there that are depending on starttag getting called for <img>, and endtag not getting called (that is, endtag getting called for it will cause them to break).  I would hope that this would not be the case, but I'm worried about it.
msg252234 - (view) Author: Chenyun Yang (Chenyun Yang) Date: 2015-10-03 19:43
handle_startendtag is also called for non-void elements, such as <a/>, so
the override example will break in those situation.

The compatible patch I proposed right now is just one liner checker:

# http://www.w3.org/TR/html5/syntax.html#void-elements
<https://www.google.com/url?q=http://www.w3.org/TR/html5/syntax.html%23void-elements&usg=AFQjCNFVtfyZ53NDOHlPq896qmX5b8fPTA>_VOID_ELEMENT_TAGS
= frozenset([    'area', 'base', 'br', 'col', 'embed', 'hr', 'img',
'input', 'keygen',    'link', 'meta', 'param', 'source', 'track',
'wbr'])class HTMLParser.HTMLParser:  # Internal -- handle starttag,
return end or -1 if not terminated  def parse_starttag(self, i):
#...    if end.endswith('/>'):      # XHTML-style empty tag: <span
attr="value" />      self.handle_startendtag(tag, attrs)
#############    PATCH    #################    elif end.endswith('>')
and tag in _VOID_ELEMENT_TAGS:      self.handle_startendtag(tag,
attrs)    #############    PATCH    #################
msg384362 - (view) Author: karl (karlcow) * Date: 2021-01-05 01:29
The parsing rules for tokenization of html are at 
https://html.spec.whatwg.org/multipage/parsing.html#tokenization

In the stack of open elements, there are specific rules for certain elements. 
https://html.spec.whatwg.org/multipage/parsing.html#special

from a DOM point of view, there is indeed no difference in between 
<img src="somewhere"><img src="somewhere"/>

https://software.hixie.ch/utilities/js/live-dom-viewer/?%3C!DOCTYPE%20html%3E%0A%3Cimg%20src%3D%22somewhere%22%3E%3Cimg%20src%3D%22somewhere%22%2F%3E
msg384363 - (view) Author: karl (karlcow) * Date: 2021-01-05 01:34
I wonder if the confusion comes from the name. The HTMLParser is kind of a tokenizer more than a full HTML parser, but that's probably a detail. It doesn't create a DOM Tree which you can access, but could help you to build a DOM Tree (!= DOM Document object)

https://html.spec.whatwg.org/multipage/parsing.html#overview-of-the-parsing-model

> Implementations that do not support scripting do not have to actually create a DOM Document object, but the DOM tree in such cases is still used as the model for the rest of the specification.
History
Date User Action Args
2022-04-11 14:58:21adminsetgithub: 69445
2021-01-05 01:34:11karlcowsetmessages: + msg384363
2021-01-05 01:29:11karlcowsetnosy: + karlcow
messages: + msg384362
2015-10-03 19:43:29Chenyun Yangsetmessages: + msg252234
2015-10-03 14:47:47r.david.murraysetnosy: + r.david.murray
messages: + msg252214
2015-10-02 22:13:55ezio.melottisetmessages: + msg252168
2015-10-02 20:21:29Chenyun Yangsetmessages: + msg252156
2015-10-02 20:16:02Chenyun Yangsetmessages: + msg252154
2015-10-02 19:42:35ezio.melottisettype: behavior
messages: + msg252152
2015-10-02 19:18:01Chenyun Yangsetmessages: + msg252150
2015-10-01 02:05:15martin.pantersetmessages: + msg251987
2015-09-29 21:35:43Chenyun Yangsetmessages: + msg251891
2015-09-29 20:27:04martin.pantersetnosy: + martin.panter

messages: + msg251883
versions: + Python 3.4, Python 3.5, Python 3.6
2015-09-29 05:22:58xiang.zhangsetnosy: + xiang.zhang
messages: + msg251821
2015-09-29 02:44:39josh.rsetnosy: + josh.r
messages: + msg251813
2015-09-28 19:29:41r.david.murraysetnosy: + ezio.melotti
2015-09-28 19:26:34Chenyun Yangcreate