This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: HTMLParser mishandles last attribute in self-closing tag
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.6, Python 3.5, Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: Tom Anderl, ezio.melotti, xiang.zhang
Priority: normal Keywords:

Created on 2016-01-11 20:48 by Tom Anderl, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (4)
msg258013 - (view) Author: Tom Anderl (Tom Anderl) Date: 2016-01-11 20:48
When the HTMLParser encounters a start tag element that includes:
  1. an unquoted attribute as the final attribute 
  2. an optional '/' character marking the start tag as self-closing
  3. no space between the final attribute and the '/' character

the '/' character gets attached to the attribute value and the element is interpreted as not self-closing.  This can be illustrated with the following:

===============================================================================

import HTMLParser

# Begin Monkeypatch
#import re
#HTMLParser.attrfind = re.compile(
#    r'((?<=[\'"\s/])[^\s/>][^\s/=>]*)(\s*=+\s*'
#    r'(\'[^\']*\'|"[^"]*"|(?![\'"])[^/>\s]*))?(?:\s|/(?!>))*')
# End Monkeypatch

class MyHTMLParser(HTMLParser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        print('got starttag: {0} with attributes {1}'.format(tag, attrs))

    def handle_endtag(self, tag):
        print('got endtag: {0}'.format(tag))

MyHTMLParser().feed('<img height=1.0 width=2.0/>')

==============================================================================

Running the above code yields the output:

    got starttag: img with attributes [('height', '1.0'), ('width', '2.0/')]

Note the trailing '/' on the 'width' attribute.  If I uncomment the monkey patch, the script then yields:

    got starttag: img with attributes [('height', '1.0'), ('width', '2.0')]
    got endtag: img

Note that the trailing '/' is gone, and an endtag event was generated.
msg258285 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-01-15 09:46
I don't think this is a bug. The HTML5 syntax spec tells:

    If an attribute using the unquoted attribute syntax is to be followed by another attribute or by the optional "/" (U+002F) character allowed in step 6 of the start tag syntax above, then there must be a space character separating the two.

So I think HTMLParser's behaviour is right.

The link is https://www.w3.org/TR/html5/syntax.html#attributes-0.
msg258286 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-01-15 09:51
Hmm, can not say the behaviour is right. But since the HTML doesn't follows the official rule, HTMLParser's behaviour is understandable and can not be identified as incorrect.
msg258297 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2016-01-15 14:40
This is not a bug, as described in the HTML5 standard[0], if an unquoted attribute value is followed by a /, the / is included (the "anything else" branch of that list).
This is also what browsers do: try to create an HTML document that includes <img title=test/> and open it in a browser, then use the inspector to examine the result -- you will see <img title="test/"></img> (at least on firefox).
HTMLParser follows the HTML5 standard, so I'm closing this as "not a bug".
Thanks anyway for the report and to Xiang for pointing out that it's not a bug.

[0]: https://www.w3.org/TR/html5/syntax.html#attribute-value-%28unquoted%29-state
History
Date User Action Args
2022-04-11 14:58:26adminsetgithub: 70272
2016-01-15 14:40:19ezio.melottisetstatus: open -> closed
versions: + Python 3.5, Python 3.6
messages: + msg258297

resolution: not a bug
stage: test needed -> resolved
2016-01-15 09:51:42xiang.zhangsetmessages: + msg258286
2016-01-15 09:46:03xiang.zhangsetnosy: + xiang.zhang
messages: + msg258285
2016-01-11 20:53:33ezio.melottisetassignee: ezio.melotti

nosy: + ezio.melotti
stage: test needed
2016-01-11 20:48:45Tom Anderlcreate