This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Hunanyan
Recipients Hunanyan
Date 2010-08-12.15:53:01
SpamBayes Score 0.000875704
Marked as misclassified No
Message-id <>
When HTMLParser reaches CDATA element it enters cdata mode by calling set_cdata_mode (file html/ line 270). this method assigns self.interesting member new value r'<(/|\Z)'. But this is not correct. Consider following case 

<script language="javascript">
if (window.adgroupid == undefined) {
	window.adgroupid = Math.round(Math.random() * 1000);
document.write('<scr'+'ipt language="javascript1.1" src="|3.0|876|2378574|0|225|ADTECH;loc=100;target=_blank;key=;grp='+window.adgroupid+';misc='+new Date().getTime()+'"></scri'+'pt>');

</scri'+'pt> matches with r'<(/|\Z)' and parser gets confused and produce wrong results.  You can see such real htmls in 

The solution can be to keep

interesting_cdata_script = re.compile(r'<(/|\z)script')
interesting_cdata_style = re.compile(r'<(/|\z)style')

instead of 

interesting_cdata = re.compile(r'<(/|\Z)')

and depending on what tag is begins (script or style) set_cdata_mode can assign correct regexp to self.interesting member.

Please contact with me via email if you need more details.
Date User Action Args
2010-08-12 15:53:05Hunanyansetrecipients: + Hunanyan
2010-08-12 15:53:05Hunanyansetmessageid: <>
2010-08-12 15:53:02Hunanyanlinkissue9577 messages
2010-08-12 15:53:01Hunanyancreate