Message113688
When HTMLParser reaches CDATA element it enters cdata mode by calling set_cdata_mode (file html/parser.py line 270). this method assigns self.interesting member new value r'<(/|\Z)'. But this is not correct. Consider following case
<script language="javascript">
<!--
if (window.adgroupid == undefined) {
window.adgroupid = Math.round(Math.random() * 1000);
}
document.write('<scr'+'ipt language="javascript1.1" src="http://adserver.adtech.de/addyn|3.0|876|2378574|0|225|ADTECH;loc=100;target=_blank;key=;grp='+window.adgroupid+';misc='+new Date().getTime()+'"></scri'+'pt>');
//-->
</script>
</scri'+'pt> matches with r'<(/|\Z)' and parser gets confused and produce wrong results. You can see such real htmls in
www.ahram.org.eg
www.chefkoch.de
www.chemieonline.de
www.eip.gov.eg
www.rezepte.li
www.scienceworld.com
The solution can be to keep
interesting_cdata_script = re.compile(r'<(/|\z)script')
interesting_cdata_style = re.compile(r'<(/|\z)style')
instead of
interesting_cdata = re.compile(r'<(/|\Z)')
and depending on what tag is begins (script or style) set_cdata_mode can assign correct regexp to self.interesting member.
Please contact with me via email if you need more details.
arman.hunanyan@gmail.com |
|
Date |
User |
Action |
Args |
2010-08-12 15:53:05 | Hunanyan | set | recipients:
+ Hunanyan |
2010-08-12 15:53:05 | Hunanyan | set | messageid: <1281628385.29.0.0117705176523.issue9577@psf.upfronthosting.co.za> |
2010-08-12 15:53:02 | Hunanyan | link | issue9577 messages |
2010-08-12 15:53:01 | Hunanyan | create | |
|