Message 95162 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	pluskid
Recipients	pluskid
Date	2009-11-12.16:25:41
SpamBayes Score	4.7186185e-07
Marked as misclassified	No
Message-id	<1258043150.41.0.372876851796.issue7311@psf.upfronthosting.co.za>
In-reply-to

Content
Hi all, I'm using BeautifulSoup to parsing an HTML page and find it refused to parse the page. By looking at the backtrace, I found it is a problem with the python built-in HTMLParser.py. In fact, the web page I'm parsing is with some Chinese characters. there is a tag like <img src=/foo/bar.png alt=中文> , note this is legacy html page where the attributes are not quoted. However, the regexp defined in HTMLParser.py is : attrfind = re.compile( r'\s([a-zA-Z_][-.:a-zA-Z_0-9])(\s=\s' r'(\'[^\']\'\|"[^"]"\|[-a-zA-Z0-9./,:;+%?!&$\(\)_#=~@]))?') Note that the Chinese character (also any other non-english characters), so it fire an error parsing this. I'm not sure whether the HTML standard allow un-quoted non-ASCII characters in the attributes. If it allows, this seems to be a bug. and the regexp to better be [^>\s] IMHO. BTW: It seems something like : <script> var st = "<a></"; </script> can not be parsed. :-/

Hi all,

I'm using BeautifulSoup to parsing an HTML page and find it refused to
parse the page. By looking at the backtrace, I found it is a problem
with the python built-in HTMLParser.py. In fact, the web page I'm
parsing is with some Chinese characters. there is a tag like <img
src=/foo/bar.png alt=中文> , note this is legacy html page where the
attributes are not quoted. However, the regexp defined in
HTMLParser.py is :

 attrfind = re.compile(
    r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
    r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?')

Note that the Chinese character (also any other non-english
characters), so it fire an error parsing this. I'm not sure whether
the HTML standard allow un-quoted non-ASCII characters in the
attributes. If it allows, this seems to be a bug. and the regexp to
better be [^>\s] IMHO.

BTW: It seems something like :

<script>
var st = "<a></";
</script>

can not be parsed. :-/

History
Date	User	Action	Args
2009-11-12 16:25:50	pluskid	set	recipients: + pluskid
2009-11-12 16:25:50	pluskid	set	messageid: <1258043150.41.0.372876851796.issue7311@psf.upfronthosting.co.za>
2009-11-12 16:25:42	pluskid	link	issue7311 messages
2009-11-12 16:25:42	pluskid	create