Issue25258
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2015-09-28 19:26 by Chenyun Yang, last changed 2022-04-11 14:58 by admin.
Messages (15) | |||
---|---|---|---|
msg251792 - (view) | Author: Chenyun Yang (Chenyun Yang) | Date: 2015-09-28 19:26 | |
For void elements such as (<link>, <img>), there doesn't need to have xhtml empty end tag. HtmlParser which relies on the XHTML empty end syntax failed to handle this situation. from HTMLParser import HTMLParser # create a subclass and override the handler methods class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print "Encountered a start tag:", tag def handle_endtag(self, tag): print "Encountered an end tag :", tag def handle_data(self, data): print "Encountered some data :", data >>> parser.feed('<link rel="import"><img src="som">') Encountered a start tag: link Encountered a start tag: img >>> parser.feed('<link rel="import"/><img src="som"/>') Encountered a start tag: link Encountered an end tag : link Encountered a start tag: img Encountered an end tag : img Reference: https://github.com/python/cpython/blob/bdfb14c688b873567d179881fc5bb67363a6074c/Lib/html/parser.py http://www.w3.org/TR/html5/syntax.html#void-elements |
|||
msg251813 - (view) | Author: Josh Rosenberg (josh.r) * | Date: 2015-09-29 02:44 | |
The example for Parsing an element with a few attributes and a title:" in https://docs.python.org/2/library/htmlparser.html#examples demonstrates this as expected behavior, so I'm not sure it can be changed: >>> parser.feed('<img src="python-logo.png" alt="The Python logo">') Start tag: img attr: ('src', 'python-logo.png') attr: ('alt', 'The Python logo') >>> >>> parser.feed('<h1>Python</h1>') Start tag: h1 Data : Python End tag : h1 |
|||
msg251821 - (view) | Author: Xiang Zhang (xiang.zhang) * | Date: 2015-09-29 05:22 | |
From the specification, void element has no end tag, so I think this behaviour can not be called incorrect. For void element, only handle_starttag is called. And for start tag ends with '/>', actually HTMLParser calls handle_startendtag, which invokes handle_starttag and handle_endtag. I think there are two solutions, filter void elements in the library and then invoke handle_startendtag, or filter void elements in the application in handle_starttag and then invoke handle_endtag. |
|||
msg251883 - (view) | Author: Martin Panter (martin.panter) * | Date: 2015-09-29 20:27 | |
Also applies to Python 3, though I’m not sure I would consider it a bug. |
|||
msg251891 - (view) | Author: Chenyun Yang (Chenyun Yang) | Date: 2015-09-29 21:35 | |
I think the bug is mostly about inconsistent behavior: <img> and <img/> shouldn't be parsed differently. This causes problem in the case that the parser won't be able to know consistently whether it has ended the visit of <img> tag. I propose one fix which will be: in the `parse_internal' method call, check for void elements and call `handle_startendtag' On Tue, Sep 29, 2015 at 1:27 PM, Martin Panter <report@bugs.python.org> wrote: > > Martin Panter added the comment: > > Also applies to Python 3, though I’m not sure I would consider it a bug. > > ---------- > nosy: +martin.panter > versions: +Python 3.4, Python 3.5, Python 3.6 > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue25258> > _______________________________________ > |
|||
msg251987 - (view) | Author: Martin Panter (martin.panter) * | Date: 2015-10-01 02:05 | |
My thinking is that the knowledge that <img> does not have a closing tag is at a higher level than the current HTMLParser class. It is similar to knowing where the following HTML implicitly closes the <li> elements: <ul><li>Item A<li>Item B</ul> In both cases I would not expect the HTMLParser to report “virtual” empty or closing tags. I don’t think it should report an empty <img/> or closing </img> tag just because that is easy to do, because it would be inconsistent with other implied HTML tags. But maybe see what other people say. I don’t know your particular use case, but I would suggest if you need to parse non-XML HTML <img> tags, use the handle_starttag() method and don’t rely on the end tag :) |
|||
msg252150 - (view) | Author: Chenyun Yang (Chenyun Yang) | Date: 2015-10-02 19:18 | |
the example you give for <li> is a different case. <img>, <link> are void elements which are allowed to have no close tag; <li> without </li> is a browser implementation detail, most browser autocompletes </li>. Without the parser calls the handle_endtag(), the client code which uses HTMLParser won't be able to know whether the a traversal is finished. Do you have a strong reason why we should include the knowledge of void elements into the HTMLParser at this line? https://github.com/python/cpython/blob/bdfb14c688b873567d179881fc5bb67363a6074c/Lib/html/parser.py#L341 if end.endswith('/>') or (end.endswith('>') and tag in VOID_ELEMENTS) On Wed, Sep 30, 2015 at 7:05 PM, Martin Panter <report@bugs.python.org> wrote: > > Martin Panter added the comment: > > My thinking is that the knowledge that <img> does not have a closing tag > is at a higher level than the current HTMLParser class. It is similar to > knowing where the following HTML implicitly closes the <li> elements: > > <ul><li>Item A<li>Item B</ul> > > In both cases I would not expect the HTMLParser to report “virtual” empty > or closing tags. I don’t think it should report an empty <img/> or closing > </img> tag just because that is easy to do, because it would be > inconsistent with other implied HTML tags. But maybe see what other people > say. > > I don’t know your particular use case, but I would suggest if you need to > parse non-XML HTML <img> tags, use the handle_starttag() method and don’t > rely on the end tag :) > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue25258> > _______________________________________ > |
|||
msg252152 - (view) | Author: Ezio Melotti (ezio.melotti) * | Date: 2015-10-02 19:42 | |
Note that HTMLParser tries to follow the HTML5 specs, and for this case they say [0]: "Set the self-closing flag of the current tag token. Switch to the data state. Emit the current tag token." So it seems that for <img />, only the <img> (and not the closing </img>) should be emitted. HTMLParser has no way to set the self-closing flag, so calling handle_startendtag seems the most reasonable things to do, since it allows tree-builders to set the flag themselves. That said, the default implementation of handle_startendtag should indeed just call handle_starttag, however this would be a backward-incompatible change. [0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state |
|||
msg252154 - (view) | Author: Chenyun Yang (Chenyun Yang) | Date: 2015-10-02 20:16 | |
I am fine with either handle_startendtag or handle_starttag, The issue is that the behavior is consistent for the two equally valid syntax (<img> and <img/> are handled differently); this inconsistent cannot be fixed from the inherited class as (handle_* calls are dispatched in the internal method of HTMLParser) On Fri, Oct 2, 2015 at 12:42 PM, Ezio Melotti <report@bugs.python.org> wrote: > > Ezio Melotti added the comment: > > Note that HTMLParser tries to follow the HTML5 specs, and for this case > they say [0]: > "Set the self-closing flag of the current tag token. Switch to the data > state. Emit the current tag token." > > So it seems that for <img />, only the <img> (and not the closing </img>) > should be emitted. HTMLParser has no way to set the self-closing flag, so > calling handle_startendtag seems the most reasonable things to do, since it > allows tree-builders to set the flag themselves. That said, the default > implementation of handle_startendtag should indeed just call > handle_starttag, however this would be a backward-incompatible change. > > [0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state > > ---------- > type: -> behavior > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue25258> > _______________________________________ > |
|||
msg252156 - (view) | Author: Chenyun Yang (Chenyun Yang) | Date: 2015-10-02 20:21 | |
Correct for previous comment, consistent -> not consistent On Fri, Oct 2, 2015 at 1:16 PM, Chenyun Yang <report@bugs.python.org> wrote: > > Chenyun Yang added the comment: > > I am fine with either handle_startendtag or handle_starttag, > > The issue is that the behavior is consistent for the two equally valid > syntax (<img> and <img/> are handled differently); this inconsistent cannot > be fixed from the inherited class as (handle_* calls are dispatched in the > internal method of HTMLParser) > > On Fri, Oct 2, 2015 at 12:42 PM, Ezio Melotti <report@bugs.python.org> > wrote: > > > > > Ezio Melotti added the comment: > > > > Note that HTMLParser tries to follow the HTML5 specs, and for this case > > they say [0]: > > "Set the self-closing flag of the current tag token. Switch to the data > > state. Emit the current tag token." > > > > So it seems that for <img />, only the <img> (and not the closing </img>) > > should be emitted. HTMLParser has no way to set the self-closing flag, > so > > calling handle_startendtag seems the most reasonable things to do, since > it > > allows tree-builders to set the flag themselves. That said, the default > > implementation of handle_startendtag should indeed just call > > handle_starttag, however this would be a backward-incompatible change. > > > > [0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state > > > > ---------- > > type: -> behavior > > > > _______________________________________ > > Python tracker <report@bugs.python.org> > > <http://bugs.python.org/issue25258> > > _______________________________________ > > > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue25258> > _______________________________________ > |
|||
msg252168 - (view) | Author: Ezio Melotti (ezio.melotti) * | Date: 2015-10-02 22:13 | |
> this inconsistent cannot be fixed from the inherited class as (handle_* > calls are dispatched in the internal method of HTMLParser) You can override handle_startendtag() like this: >>> class MyHTMLParser(HTMLParser): ... def handle_starttag(self, tag, attrs): ... print('start', tag) ... def handle_endtag(self, tag): ... print('end', tag) ... def handle_startendtag(self, tag, attrs): ... self.handle_starttag(tag, attrs) ... >>> parser = MyHTMLParser() >>> parser.feed('<link rel="import"/><img src="som"/>') start link start img (P.S. please don't quote the whole message in your reply) |
|||
msg252214 - (view) | Author: R. David Murray (r.david.murray) * | Date: 2015-10-03 14:47 | |
I suspect that calling startendtag is also backward incompatible, in that there may be parsers out there that are depending on starttag getting called for <img>, and endtag not getting called (that is, endtag getting called for it will cause them to break). I would hope that this would not be the case, but I'm worried about it. |
|||
msg252234 - (view) | Author: Chenyun Yang (Chenyun Yang) | Date: 2015-10-03 19:43 | |
handle_startendtag is also called for non-void elements, such as <a/>, so the override example will break in those situation. The compatible patch I proposed right now is just one liner checker: # http://www.w3.org/TR/html5/syntax.html#void-elements <https://www.google.com/url?q=http://www.w3.org/TR/html5/syntax.html%23void-elements&usg=AFQjCNFVtfyZ53NDOHlPq896qmX5b8fPTA>_VOID_ELEMENT_TAGS = frozenset([ 'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen', 'link', 'meta', 'param', 'source', 'track', 'wbr'])class HTMLParser.HTMLParser: # Internal -- handle starttag, return end or -1 if not terminated def parse_starttag(self, i): #... if end.endswith('/>'): # XHTML-style empty tag: <span attr="value" /> self.handle_startendtag(tag, attrs) ############# PATCH ################# elif end.endswith('>') and tag in _VOID_ELEMENT_TAGS: self.handle_startendtag(tag, attrs) ############# PATCH ################# |
|||
msg384362 - (view) | Author: karl (karlcow) * | Date: 2021-01-05 01:29 | |
The parsing rules for tokenization of html are at https://html.spec.whatwg.org/multipage/parsing.html#tokenization In the stack of open elements, there are specific rules for certain elements. https://html.spec.whatwg.org/multipage/parsing.html#special from a DOM point of view, there is indeed no difference in between <img src="somewhere"><img src="somewhere"/> https://software.hixie.ch/utilities/js/live-dom-viewer/?%3C!DOCTYPE%20html%3E%0A%3Cimg%20src%3D%22somewhere%22%3E%3Cimg%20src%3D%22somewhere%22%2F%3E |
|||
msg384363 - (view) | Author: karl (karlcow) * | Date: 2021-01-05 01:34 | |
I wonder if the confusion comes from the name. The HTMLParser is kind of a tokenizer more than a full HTML parser, but that's probably a detail. It doesn't create a DOM Tree which you can access, but could help you to build a DOM Tree (!= DOM Document object) https://html.spec.whatwg.org/multipage/parsing.html#overview-of-the-parsing-model > Implementations that do not support scripting do not have to actually create a DOM Document object, but the DOM tree in such cases is still used as the model for the rest of the specification. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:58:21 | admin | set | github: 69445 |
2021-01-05 01:34:11 | karlcow | set | messages: + msg384363 |
2021-01-05 01:29:11 | karlcow | set | nosy:
+ karlcow messages: + msg384362 |
2015-10-03 19:43:29 | Chenyun Yang | set | messages: + msg252234 |
2015-10-03 14:47:47 | r.david.murray | set | nosy:
+ r.david.murray messages: + msg252214 |
2015-10-02 22:13:55 | ezio.melotti | set | messages: + msg252168 |
2015-10-02 20:21:29 | Chenyun Yang | set | messages: + msg252156 |
2015-10-02 20:16:02 | Chenyun Yang | set | messages: + msg252154 |
2015-10-02 19:42:35 | ezio.melotti | set | type: behavior messages: + msg252152 |
2015-10-02 19:18:01 | Chenyun Yang | set | messages: + msg252150 |
2015-10-01 02:05:15 | martin.panter | set | messages: + msg251987 |
2015-09-29 21:35:43 | Chenyun Yang | set | messages: + msg251891 |
2015-09-29 20:27:04 | martin.panter | set | nosy:
+ martin.panter messages: + msg251883 versions: + Python 3.4, Python 3.5, Python 3.6 |
2015-09-29 05:22:58 | xiang.zhang | set | nosy:
+ xiang.zhang messages: + msg251821 |
2015-09-29 02:44:39 | josh.r | set | nosy:
+ josh.r messages: + msg251813 |
2015-09-28 19:29:41 | r.david.murray | set | nosy:
+ ezio.melotti |
2015-09-28 19:26:34 | Chenyun Yang | create |