Issue1144533
Created on 2005-02-19 21:02 by ahoeltje, last changed 2009-04-22 14:44 by ajaksu2.
| Messages (2) | |||
|---|---|---|---|
| msg60671 - (view) | Author: Allan Hoeltje (ahoeltje) | Date: 2005-02-19 21:02 | |
I am using the htmllib to parse web pages for plain text content. I came across a web page that contained a script construct similar to the example below. Note that the script is itself writing a script. The htmllib appears to be confused by the use of single and double quotes used within the real <script> and </script> tags. I am using "Python 2.3 (#1, Sep 13 2003, 00:49:11) [GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin" on a PowerBook G4 running OSX 10.3.8. <html> <body> <h1> This is a test </h1> <br> <blockquote> <script language="JavaScript"> rnum = Math.round( Math.random() * 100000 ); document.write( '<scr' + 'ipt src="http://www.a.org/' + rnum + '/"></scr' + 'ipt>' ); </script> </blockquote> </body> </html> Here is the Python trace: Traceback (most recent call last): File "cleanFeed.py", line 26, in ? clean = stripHtml.strip( feed ) File "/Users/allan/Desktop/Mood for Today/stripHtml.py", line 144, in strip parser.feed(s) File "/System/Library/Frameworks/Python.framework/Versions/ 2.3/lib/python2.3/HTMLParser.py", line 108, in feed self.goahead(0) File "/System/Library/Frameworks/Python.framework/Versions/ 2.3/lib/python2.3/HTMLParser.py", line 150, in goahead k = self.parse_endtag(i) File "/System/Library/Frameworks/Python.framework/Versions/ 2.3/lib/python2.3/HTMLParser.py", line 327, in parse_endtag self.error("bad end tag: %s" % `rawdata[i:j]`) File "/System/Library/Frameworks/Python.framework/Versions/ 2.3/lib/python2.3/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParser.HTMLParseError: bad end tag: "</scr' + 'ipt>", at line 1, column 309 |
|||
| msg60672 - (view) | Author: Richard Brodie (leogah) | Date: 2005-03-09 00:51 | |
Logged In: YES user_id=356893 Generally speaking, you are better off conditioning random junk pulled off the web (with uTidylib or similar) before feeding it to HTMLParser, which tends to report errors when it finds them. See: http://www.w3.org/TR/html4/appendix/notes.html#h-B.3.2.1 for an explanation of why the error message is strictly correct. Someone may step in with a patch to make HTMLParser more tolerant in this case; there will always be something else though. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2009-04-22 14:44:13 | ajaksu2 | set | keywords: + easy |
| 2009-02-16 01:01:17 | ajaksu2 | set | stage: test needed type: behavior versions: + Python 2.6, - Python 2.3 |
| 2005-02-19 21:02:08 | ahoeltje | create | |