This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author pluskid
Recipients pluskid
Date 2009-11-12.16:25:41
SpamBayes Score 4.71862e-07
Marked as misclassified No
Message-id <>
Hi all,

I'm using BeautifulSoup to parsing an HTML page and find it refused to
parse the page. By looking at the backtrace, I found it is a problem
with the python built-in In fact, the web page I'm
parsing is with some Chinese characters. there is a tag like <img
src=/foo/bar.png alt=中文> , note this is legacy html page where the
attributes are not quoted. However, the regexp defined in is :

 attrfind = re.compile(

Note that the Chinese character (also any other non-english
characters), so it fire an error parsing this. I'm not sure whether
the HTML standard allow un-quoted non-ASCII characters in the
attributes. If it allows, this seems to be a bug. and the regexp to
better be [^>\s] IMHO.

BTW: It seems something like :

var st = "<a></";

can not be parsed. :-/
Date User Action Args
2009-11-12 16:25:50pluskidsetrecipients: + pluskid
2009-11-12 16:25:50pluskidsetmessageid: <>
2009-11-12 16:25:42pluskidlinkissue7311 messages
2009-11-12 16:25:42pluskidcreate