Author yotam
Recipients Hunanyan, cpalmer, ezio.melotti, fantoozler, fdrake, georg.brandl, gsf, momat, yotam
Date 2010-09-30.21:50:03
SpamBayes Score 2.80249e-05
Marked as misclassified No
Message-id <1285883406.35.0.460129064114.issue670664@psf.upfronthosting.co.za>
In-reply-to
Content
The HTMLParser.py fails when inside 
  <script> ... </script>
it can fooled by JavaScript with less-than '<' conditional expressions.
In the attached example:

 $ tar tvzf lt-in-script-example.tgz | cut -c24-
     796 2010-09-30 16:52 h2t.py
   23678 2010-09-30 16:39 t.html

here's what happens:

 $ python h2t.py t.html /tmp/t.txt
 HTMLParser: /home/yotam/src/wog/HTMLParser.bug/HTMLParser.py
 Traceback (most recent call last):
   File "h2t.py", line 31, in <module>
     text = html2text(f_html.read())
   File "h2t.py", line 23, in html2text
     te = TextExtractor(html)
   File "h2t.py", line 15, in __init__
     self.feed(html)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 108, in feed
     self.goahead(0)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 148, in goahead
     k = self.parse_starttag(i)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 229, in parse_starttag
     endpos = self.check_for_whole_start_tag(i)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 304, in check_for_whole_start_tag
     self.error("malformed start tag")
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 115, in error
     raise HTMLParseError(message, self.getpos())
 HTMLParser.HTMLParseError: malformed start tag, at line 396, column 332


I have a suggested patch 
   HTMLParser.diff
fixing this problem, soon to be attached.

-- yotam
History
Date User Action Args
2010-09-30 21:50:07yotamsetrecipients: + yotam, fdrake, georg.brandl, fantoozler, gsf, cpalmer, ezio.melotti, momat, Hunanyan
2010-09-30 21:50:06yotamsetmessageid: <1285883406.35.0.460129064114.issue670664@psf.upfronthosting.co.za>
2010-09-30 21:50:04yotamlinkissue670664 messages
2010-09-30 21:50:03yotamcreate