classification
Title: html parser bug related with CDATA sections
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.1
process
Status: closed Resolution: duplicate
Dependencies: Superseder: HTMLParser.py - more robust SCRIPT tag parsing
View: 670664
Assigned To: Nosy List: Hunanyan, r.david.murray
Priority: normal Keywords:

Created on 2010-08-12 15:53 by Hunanyan, last changed 2010-08-13 11:51 by r.david.murray. This issue is now closed.

Messages (3)
msg113688 - (view) Author: Arman (Hunanyan) Date: 2010-08-12 15:53
When HTMLParser reaches CDATA element it enters cdata mode by calling set_cdata_mode (file html/parser.py line 270). this method assigns self.interesting member new value r'<(/|\Z)'. But this is not correct. Consider following case 

<script language="javascript">
<!--
if (window.adgroupid == undefined) {
	window.adgroupid = Math.round(Math.random() * 1000);
}
document.write('<scr'+'ipt language="javascript1.1" src="http://adserver.adtech.de/addyn|3.0|876|2378574|0|225|ADTECH;loc=100;target=_blank;key=;grp='+window.adgroupid+';misc='+new Date().getTime()+'"></scri'+'pt>');
//-->
</script>

</scri'+'pt> matches with r'<(/|\Z)' and parser gets confused and produce wrong results.  You can see such real htmls in 

www.ahram.org.eg
www.chefkoch.de
www.chemieonline.de
www.eip.gov.eg
www.rezepte.li
www.scienceworld.com 

The solution can be to keep

interesting_cdata_script = re.compile(r'<(/|\z)script')
interesting_cdata_style = re.compile(r'<(/|\z)style')

instead of 

interesting_cdata = re.compile(r'<(/|\Z)')

and depending on what tag is begins (script or style) set_cdata_mode can assign correct regexp to self.interesting member.


Please contact with me via email if you need more details.

arman.hunanyan@gmail.com
msg113713 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-08-12 22:32
I believe this is a duplicate of Issue670664.  If you disagree please reopen with additional information.
msg113743 - (view) Author: Arman (Hunanyan) Date: 2010-08-13 05:21
Yes I agree. This is the same issue.

On Fri, Aug 13, 2010 at 3:32 AM, R. David Murray <report@bugs.python.org>wrote:

>
> R. David Murray <rdmurray@bitdance.com> added the comment:
>
> I believe this is a duplicate of Issue670664.  If you disagree please
> reopen with additional information.
>
> ----------
> nosy: +r.david.murray
> resolution:  -> duplicate
> stage:  -> committed/rejected
> status: open -> closed
> superseder:  -> HTMLParser.py - more robust SCRIPT tag parsing
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue9577>
> _______________________________________
>
History
Date User Action Args
2010-08-13 11:51:30r.david.murraysetfiles: - unnamed
2010-08-13 05:21:50Hunanyansetfiles: + unnamed

messages: + msg113743
2010-08-12 22:32:32r.david.murraysetstatus: open -> closed

superseder: HTMLParser.py - more robust SCRIPT tag parsing

nosy: + r.david.murray
messages: + msg113713
resolution: duplicate
stage: resolved
2010-08-12 15:53:02Hunanyancreate