classification
Title: html.parser.HTMLParser doesn't parse tags in comments in scripts correctly
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: duplicate
Dependencies: Superseder: HTMLParser.py - more robust SCRIPT tag parsing
View: 670664
Assigned To: ezio.melotti Nosy List: ezio.melotti, r.david.murray, turion
Priority: normal Keywords:

Created on 2012-01-04 13:26 by turion, last changed 2012-01-04 16:19 by turion. This issue is now closed.

Files
File name Uploaded Description Edit
htmlparserbug.py turion, 2012-01-04 13:26 Script demonstrating the bug
Messages (8)
msg150603 - (view) Author: Manuel Bärenz (turion) Date: 2012-01-04 13:26
I've attached a script which demonstrates the bug.

When feeding a <script> that contains a comment tag with the actual script and the script containing tags itself (e.g. a 'document.write(<td></td>)'), the parser doesn't call handle_comment and handle_starttag.
msg150604 - (view) Author: Manuel Bärenz (turion) Date: 2012-01-04 13:38
I forgot to say, I'm using python version 3.2.2.
msg150605 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-01-04 13:55
The content of a script tag is CDATA.  Why would you expect it to be parsed?
msg150606 - (view) Author: Manuel Bärenz (turion) Date: 2012-01-04 14:25
Oh, I wasn't aware of that.
Then, the bug is actually calling handle_endtag.
msg150607 - (view) Author: Manuel Bärenz (turion) Date: 2012-01-04 14:28
To clarify this even further: Consider
parser_instance.feed("<script><td></td></script>")

It should call:
parser_instance.handle_starttag("script", [])
parser_instance.handle_data("<td></td>")
parser_instance.handle_endtag("script", [])

Instead, it calls:
parser_instance.handle_starttag("script", [])
parser_instance.handle_data("<td>")
parser_instance.handle_endtag("td", [])
parser_instance.handle_endtag("script", [])
msg150608 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-01-04 14:42
I believe this was fixed recently as part of issue 670664.  Ezio will know for sure.
msg150611 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-01-04 15:02
Yep, this was fixed in #670664.
With the development version of Python (AFAIK the fix has not be released yet) and the example parser found in the doc[0] I get this:

>>> parser = MyHTMLParser()
>>> parser.feed('<script><td></td></script>')
Encountered a start tag: script
Encountered   some data: <td></td>
Encountered  an end tag: script


[0]: http://docs.python.org/dev/library/html.parser.html#example-html-parser-application
msg150614 - (view) Author: Manuel Bärenz (turion) Date: 2012-01-04 16:19
Great! Thank you!
History
Date User Action Args
2012-01-04 16:19:16turionsetmessages: + msg150614
2012-01-04 15:02:17ezio.melottisetstatus: open -> closed
superseder: HTMLParser.py - more robust SCRIPT tag parsing
messages: + msg150611

assignee: ezio.melotti
resolution: duplicate
stage: resolved
2012-01-04 14:42:30r.david.murraysetmessages: + msg150608
2012-01-04 14:28:47turionsetmessages: + msg150607
2012-01-04 14:25:35turionsetmessages: + msg150606
2012-01-04 13:55:44r.david.murraysetnosy: + ezio.melotti, r.david.murray
messages: + msg150605
2012-01-04 13:38:27turionsetmessages: + msg150604
2012-01-04 13:26:46turioncreate