Message117762
The HTMLParser.py fails when inside
<script> ... </script>
it can fooled by JavaScript with less-than '<' conditional expressions.
In the attached example:
$ tar tvzf lt-in-script-example.tgz | cut -c24-
796 2010-09-30 16:52 h2t.py
23678 2010-09-30 16:39 t.html
here's what happens:
$ python h2t.py t.html /tmp/t.txt
HTMLParser: /home/yotam/src/wog/HTMLParser.bug/HTMLParser.py
Traceback (most recent call last):
File "h2t.py", line 31, in <module>
text = html2text(f_html.read())
File "h2t.py", line 23, in html2text
te = TextExtractor(html)
File "h2t.py", line 15, in __init__
self.feed(html)
File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 229, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 304, in check_for_whole_start_tag
self.error("malformed start tag")
File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 396, column 332
I have a suggested patch
HTMLParser.diff
fixing this problem, soon to be attached.
-- yotam |
|
Date |
User |
Action |
Args |
2010-09-30 21:50:07 | yotam | set | recipients:
+ yotam, fdrake, georg.brandl, fantoozler, gsf, cpalmer, ezio.melotti, momat, Hunanyan |
2010-09-30 21:50:06 | yotam | set | messageid: <1285883406.35.0.460129064114.issue670664@psf.upfronthosting.co.za> |
2010-09-30 21:50:04 | yotam | link | issue670664 messages |
2010-09-30 21:50:03 | yotam | create | |
|