Message95162
Hi all,
I'm using BeautifulSoup to parsing an HTML page and find it refused to
parse the page. By looking at the backtrace, I found it is a problem
with the python built-in HTMLParser.py. In fact, the web page I'm
parsing is with some Chinese characters. there is a tag like <img
src=/foo/bar.png alt=中文> , note this is legacy html page where the
attributes are not quoted. However, the regexp defined in
HTMLParser.py is :
attrfind = re.compile(
r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?')
Note that the Chinese character (also any other non-english
characters), so it fire an error parsing this. I'm not sure whether
the HTML standard allow un-quoted non-ASCII characters in the
attributes. If it allows, this seems to be a bug. and the regexp to
better be [^>\s] IMHO.
BTW: It seems something like :
<script>
var st = "<a></";
</script>
can not be parsed. :-/ |
|
Date |
User |
Action |
Args |
2009-11-12 16:25:50 | pluskid | set | recipients:
+ pluskid |
2009-11-12 16:25:50 | pluskid | set | messageid: <1258043150.41.0.372876851796.issue7311@psf.upfronthosting.co.za> |
2009-11-12 16:25:42 | pluskid | link | issue7311 messages |
2009-11-12 16:25:42 | pluskid | create | |
|