Issue1144533
Created on 2005-02-19 21:02 by ahoeltje, last changed 2010-07-31 21:07 by georg.brandl. This issue is now closed.
| Messages (3) | |||
|---|---|---|---|
| msg60671 - (view) | Author: Allan Hoeltje (ahoeltje) | Date: 2005-02-19 21:02 | |
I am using the htmllib to parse web pages for plain text content. I
came across a web page that contained a script construct similar
to the example below. Note that the script is itself writing a script.
The htmllib appears to be confused by the use of single and double
quotes used within the real <script> and </script> tags.
I am using "Python 2.3 (#1, Sep 13 2003, 00:49:11) [GCC
3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin" on a
PowerBook G4 running OSX 10.3.8.
<html>
<body>
<h1> This is a test </h1>
<br>
<blockquote>
<script language="JavaScript">
rnum = Math.round( Math.random() * 100000 );
document.write( '<scr' + 'ipt src="http://www.a.org/' +
rnum + '/"></scr' + 'ipt>' );
</script>
</blockquote>
</body>
</html>
Here is the Python trace:
Traceback (most recent call last):
File "cleanFeed.py", line 26, in ?
clean = stripHtml.strip( feed )
File "/Users/allan/Desktop/Mood for Today/stripHtml.py", line
144, in strip
parser.feed(s)
File "/System/Library/Frameworks/Python.framework/Versions/
2.3/lib/python2.3/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/System/Library/Frameworks/Python.framework/Versions/
2.3/lib/python2.3/HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "/System/Library/Frameworks/Python.framework/Versions/
2.3/lib/python2.3/HTMLParser.py", line 327, in parse_endtag
self.error("bad end tag: %s" % `rawdata[i:j]`)
File "/System/Library/Frameworks/Python.framework/Versions/
2.3/lib/python2.3/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</scr' + 'ipt>", at line
1, column 309
|
|||
| msg60672 - (view) | Author: Richard Brodie (leogah) | Date: 2005-03-09 00:51 | |
Logged In: YES user_id=356893 Generally speaking, you are better off conditioning random junk pulled off the web (with uTidylib or similar) before feeding it to HTMLParser, which tends to report errors when it finds them. See: http://www.w3.org/TR/html4/appendix/notes.html#h-B.3.2.1 for an explanation of why the error message is strictly correct. Someone may step in with a patch to make HTMLParser more tolerant in this case; there will always be something else though. |
|||
| msg112202 - (view) | Author: Georg Brandl (georg.brandl) * ![]() |
Date: 2010-07-31 21:07 | |
Now that htmllib has been removed in Python 3, I don't think this is worth working on. As Richard notes, it is much more useful to use a dedicated parser insensitive to all kinds of wrong markup anyway. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2010-07-31 21:07:03 | georg.brandl | set | status: open -> closed nosy: + georg.brandl messages: + msg112202 resolution: out of date |
| 2009-04-22 14:44:13 | ajaksu2 | set | keywords: + easy |
| 2009-02-16 01:01:17 | ajaksu2 | set | stage: test needed type: behavior versions: + Python 2.6, - Python 2.3 |
| 2005-02-19 21:02:08 | ahoeltje | create | |
