HTMLParser incorrectly handles cdata elements. #57567

MichaelBrooks · 2011-11-06T19:48:25Z

BPO	13358
Nosy	@ezio-melotti
Files	issue13358.diff: patch against 2.7

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/ezio-melotti'
closed_at = <Date 2011-11-18.16:05:18.868>
created_at = <Date 2011-11-06.19:48:24.759>
labels = ['type-bug', 'library']
title = 'HTMLParser incorrectly handles cdata elements.'
updated_at = <Date 2011-11-18.16:05:18.867>
user = 'https://bugs.python.org/MichaelBrooks'

bugs.python.org fields:

activity = <Date 2011-11-18.16:05:18.867>
actor = 'ezio.melotti'
assignee = 'ezio.melotti'
closed = True
closed_date = <Date 2011-11-18.16:05:18.868>
closer = 'ezio.melotti'
components = ['Library (Lib)']
creation = <Date 2011-11-06.19:48:24.759>
creator = 'Michael.Brooks'
dependencies = []
files = ['23721']
hgrepos = []
issue_num = 13358
keywords = ['patch']
message_count = 13.0
messages = ['147175', '147176', '147178', '147223', '147806', '147807', '147809', '147816', '147818', '147819', '147846', '147883', '147884']
nosy_count = 3.0
nosy_names = ['ezio.melotti', 'Michael.Brooks', 'python-dev']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue13358'
versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

MichaelBrooks · 2011-11-06T19:48:23Z

The HTML tag at the bottom of this page correctly identified has having cdata like properties and trigger set_cdata_mode(). Due to the cdata properties of this tag, the only way to end the data segment is with a closing </script> tag, NO OTHER tag can close this data segment. Currently in cdata mode the HTMLParser will use this regular expression to close this script tag: re.compile(r'<(/|\Z)'), however this script tag is setting a variable with data that contains "</b>" which will terminate this script tag prematurely.

I have written and tested the following patch on my system:
#used to terminate cdata elements
endtagfind_script = re.compile('(?i)</\s*script\s*>')
endtagfind_style = re.compile('(?i)</\s*style\s*>')

class html_patch(HTMLParser.HTMLParser):
    # Internal -- sets the proper tag terminator based on cdata element type
    def set_cdata_mode(self, tag):
        #We check if the script is either a style or a script
        #based on self.CDATA_CONTENT_ELEMENTS
        if tag=="style":
            self.interesting = endtagfind_style
        elif tag=="script":
            self.interesting = endtagfind_script
        else:
            self.error("Unknown cdata type:"+tag) # should never happen
        self.cdata_tag = tag

This cdata tag isn't parsed properly by HTMLParser, but it works fine in a browser:
<script>
pwa.setup(
pwa.searchview,
'lhid_searchheader',
'lhid_content',
'lhid_trayhandle',
'lhid_tray',
{'query': 'test',
'tagQuery': '',
'searchScope': '',
'owner': '',
'doCrowding': false,
'isOwner': false,
'albumId': ''
,'experimentalsearchquality': true},
'firealwaysworks'
,
{feedUrl: 'https://picasaweb.google.com/data/feed/tiny/all?alt=jsonm&kind=photo&access=public&filter=1&q=test',
feedPreload: null},
{NEW_HOMEPAGE:1,NEW_ONE_BAR:1,fr:1,tags:1,search:1,globalsearch:1,globalsearchpromo:1,newfeatureslink:1,cart:1,contentcaching:1,developerlink:1,payments:1,newStrings:1,cccquota:1,signups:1,flashSlideshow:1,URL_SHORTENER_VISIBILITY:1,emailupload:1,photopickeralbumview:1,PWA_NEWUI:1,WILDCARD_QUERY_FEED:1,recentphotos:1,editinpicasa:1,imagesearch:1,froptin:1,FR_CONTINUOUS_CLUSTERING:1,asyncUploads:1,PERFORMANCE_EXPERIMENTS:1,BAKED_PRELOAD_FEEDS:1,albumviewlimit:1,HQ_VIDEOS:1,VIDEO_INFO_DISPLAY:1,CSI:1,EXPERIMENTAL_SEARCH_QUALITY:1,COMMENT_TRANSLATION:1,NEW_COMMENT_STYLE:1,ENABLE_NEW_FLAG_ABUSE_FORM:1,QRCODE:1,CHINA:1,GWS_URL_REDIRECTION:1,FEATURED_PHOTOS:1,COMMENT_SUBSCRIPTION:1,COMMENT_SUBSCRIPTION_SETTING:1,PICASA_MAC:1,AD_ON_SEARCHPAGE:1,API_AUTO_ACCOUNTS:1,FOCUS_GROUP_ACL:1,PHOTOSTREAM:1,BACKEND_ACL:1,ADVANCED_SEARCH:1,FACE_SEARCH:1,CAMERA_SEARCH:1,NOTIFICATION:1,PIXELATED_PREVIEW:1,TRANSPARENT_PIXELATED_PREVIEW:1,NEW_SETTINGS_PAGE:1,VIEW_STARRERS:1,FR_FOCUS_MERGE:1,AD_ON_SEARCH_ONEUP:1,GALLERY_COMMENTS:1,COMMENT_ABUSE_BLOCKING:1,FAVORITE_NOTIFICATION:1,IMAGE_ONLY_LINK:1,RECENT_PHOTOS_SLIDESHOW:1,HEART:1,SMALLER_IMAGE:1,FAST_SLIDESHOW:1,VIEW_CONTACTS:1,COLLABORATIVE_ALBUMS:1,PRINT_MARKETPLACE:1,PRINT_MARKETPLACE_REPLACEMENT:1,VIEW_COUNT:1,POST_TO:1,GAPLUS:1,PICASA_PROMO:1,DOUBLECLICK_PREMIUM_ADS:1,DOUBLECLICK_EXPLORE_MAIN:1,DOUBLECLICK_MYPHOTOS:1,DOUBLECLICK_PUBLIC_GALLERY:1,DOUBLECLICK_USER_ALBUM:1,DOUBLECLICK_USER_PHOTO:1,DOUBLECLICK_VISITOR_ADS:1,PRODUCTION:1,NOSCRIPT:1,UNLISTED_GALLERY:1,GA_TRACKING:1,UNLIMITED_GALLERY:1,PICNIK_EDIT:1,MICROSCOPE_ZOOM:1,FR_V2:1,FAVORITE_SUGGESTION:1,FAVORITE_UPDATE:1,MERGED_PROFILES_SOFTLAUNCH:1,MERGED_PROFILES:1,MERGED_PROFILES_ASYNC:1,NEW_FR_UI:1,GAPLUS_UNMERGED_SOCIALIZATION:1,OPTOUT_ACL_NOTIFICATION:1,HTTPS_VISIBILITY:1,DEFAULT_HTTPS:1,EXTENDED_EXIF:1,DOUBLECLICK_MULTISLOT:1,ONEPICK:1,PER_ALBUM_GEO_VISIBILITY:1,FOCUS_MERGE_LINK_DIALOG_VISIBILITY:1,SHAREBOX_VISIBILITY:1,AUTO_DOWNSIZE:1,BULK_ALBUM_EDITOR_VISIBILITY:1,PROFILE_NAME_CHECK:1,COLLABORATIVE_NAMETAGS:1,NOT_FOUND_404:1,REDIRECT_TO_PLUS:1},
{
'gdataVersion': '4.0',
'updateCartPath': '\x2Flh\x2FupdateCart?rtok=b8S9ibYqrTMF',
'editCaptionsPath': '',
'albumMapPath': '',
'albumKmlUrl': '',
'selectedPhotosPath': '\x2Flh\x2FselectedPhotos?tok=QUI1UGxRYk9fNmw1Q2tVeS1DWnY3UlFoTTY1RzRNNWphdzoxMzIwNjAyMzA3NDYx',
'setLicensePath': '',
'setStarPath': '\x2Flh\x2FsetStar?tok=QUI1UGxRWW4zY1ZKb3U0TzROZU5tUHhIV3hhRW9HcUYwQToxMzIwNjAyMzA3NDYx',
'peopleManagerPath': '',
'peopleSearchPath': '',
'clusterViewPath': '',
'frOptStatus': 'OptedIn',
'isNameTagsVisible': '','authUserIsPhotosUser': true,
'authUserNickname': 'Some Nickname',
'authUserPortraitUrl': 'https:\x2F\x2Flh4.googleusercontent.com\x2F-UI9ZfIFfyQI\x2FAAAAAAAAAAI\x2FAAAAAAAAAAA\x2Fm0enLvZXYbI\x2Fs32-c\x2Ffirealwaysworks.jpg',
'authUserProfileUrl':'https:\x2F\x2Fprofiles.google.com\x2F115162402406836485912', 'authUser':{name:'firealwaysworks',isProfileUser:1,isLoggedIn:1,user:1,isOwner:1
,'showGeo': 0
},
'foreignNickname': '',
'subjects': [
]
,
'owner': {name:'firealwaysworks',nickname:'Michael Brooks',portrait:'https:\x2F\x2Flh4.googleusercontent.com\x2F-UI9ZfIFfyQI\x2FAAAAAAAAAAI\x2FAAAAAAAAAAA\x2Fm0enLvZXYbI\x2Fs32-c\x2Ffirealwaysworks.jpg',link:'https:\x2F\x2Fpicasaweb.google.com\x2F115162402406836485912',defaultalbumnametagsvisibility:'2'
,isnewstorageuser:false},
'album': {id:'',link:'',name:'',title:'',titleshort:'',kmlLink:'',mapLink:'',showGeo:0},
'picnikHandlerUrl': 'https:\x2F\x2Fpicasaweb.google.com\x2Flh\x2FpicnikHandler?tok=QUI1UGxRYWJYZVVueTE2SjJJLTBjRUliWVlaaU42S0NRQToxMzIwNjAyMzA3NDYx',
'picnikCloseUrl': 'https:\x2F\x2Fpicasaweb.google.com\x2Flh\x2FpicnikClose',
'picnikLocale': 'en_US',
'authUserIsProfileUser': true,
'ownerIsEsUser': false,
'profileServerUrl': '',
'profileSignupUrl': 'https:\x2F\x2Fplus.google.com\x2Fup\x2F?type=st',
'profileName': 'Google+'
});
function google_afs_request_done(google_ads) {
var google_num_ads = google_ads.length;
if (google_num_ads <= 0)
return;

    var wideAds = '<table width="100%" cellspacing="0" cellpadding="8"' +
        'border="0" class="lhcl_search_ad_tbl"><tbody><tr>';
for(var i = 0; i < google_num_ads; i++){
      if (google_ads[i].type=="text/wide"){
        wideAds += '<td>';
wideAds += '<table width="100%" cellspacing="0" ' +
          'cellpadding="0" border="0"><tbody><tr><td>';
wideAds+='<a href="' + google_ads[i].url + '" ' +
        'onmouseout="window.status=\'\';return true" ' +
        'onmouseover="window.status=\'go to ' +
        google_ads[i].visible_url + '\';return true" ' +
        'style="text-decoration:none">' +
'<span style="text-decoration:underline">' +
'<b>' + google_ads[i].line1 + '</b><br></span>' +
'<span style="color:#000000">' +
google_ads[i].line2 + '<br></span>' +
'<span style="color:#008000">' +
google_ads[i].visible_url + '</span></a><BR>';
wideAds += '</td>';
if (i == google_num_ads - 1) {
wideAds += '<td valign="top">' +
'<div id="lhid_search_ad_spl">Sponsored Links</div></td>';
}
wideAds += '</tr></tbody></table>';
wideAds += '</td>';
}
}
wideAds += '</tr></tbody></table>';
document.getElementById("lhid_search_ad_unit").innerHTML = wideAds;
}
google_afs_query = 'test';
google_afs_ad = 'w3'; // specify the number of ads you are requesting
google_afs_client = 'google-picasa_js'; // substitute your client ID
// google_afs_channel = ''; // enter your comma-separated channel IDs
google_afs_ie = 'utf8'; // select input encoding scheme
google_afs_oe = 'utf8'; // select output encoding scheme
google_afs_adsafe = 'high'; // specify level for filtering non-family-safe ads
google_afs_adtest = 'off'; // ** set parameter to off before launch to production
document.write('<' +
    'script src="https://www.google.com/afsonline/show_afs_ads.js"></' +
    'script>');
</script>

ezio-melotti · 2011-11-06T19:52:01Z

Have you tried with the latest 2.7? (see msg147170)

MichaelBrooks · 2011-11-06T19:54:37Z

Yes I am running python 2.7.2.

On Sun, Nov 6, 2011 at 12:52 PM, Ezio Melotti <report@bugs.python.org>wrote:

Ezio Melotti <ezio.melotti@gmail.com> added the comment:

Have you tried with the latest 2.7? (see msg147170)

----------
nosy: +ezio.melotti
stage: -> test needed

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue13358\>

MichaelBrooks · 2011-11-07T14:46:27Z

This one should also have a priority change. Tested python 2.7.3

--MIke

On Sun, Nov 6, 2011 at 12:54 PM, Michael Brooks <report@bugs.python.org>wrote:

Michael Brooks <firealwaysworks@gmail.com> added the comment:

Yes I am running python 2.7.2.

On Sun, Nov 6, 2011 at 12:52 PM, Ezio Melotti <report@bugs.python.org
>wrote:

>
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
>
> Have you tried with the latest 2.7? (see msg147170)
>
> ----------
> nosy: +ezio.melotti
> stage: -> test needed
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue13358\>
> _______________________________________
>

----------

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue13358\>

MichaelBrooks · 2011-11-17T15:36:57Z

Has anyone else been able to verify this?

On Mon, Nov 7, 2011 at 7:46 AM, Michael Brooks <report@bugs.python.org>wrote:

Michael Brooks <firealwaysworks@gmail.com> added the comment:

This one should also have a priority change. Tested python 2.7.3

--MIke

On Sun, Nov 6, 2011 at 12:54 PM, Michael Brooks <report@bugs.python.org
>wrote:

>
> Michael Brooks <firealwaysworks@gmail.com> added the comment:
>
> Yes I am running python 2.7.2.
>
> On Sun, Nov 6, 2011 at 12:52 PM, Ezio Melotti <report@bugs.python.org
> >wrote:
>
> >
> > Ezio Melotti <ezio.melotti@gmail.com> added the comment:
> >
> > Have you tried with the latest 2.7? (see msg147170)
> >
> > ----------
> > nosy: +ezio.melotti
> > stage: -> test needed
> >
> > _______________________________________
> > Python tracker <report@bugs.python.org>
> > <http://bugs.python.org/issue13358\>
> > _______________________________________
> >
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue13358\>
> _______________________________________
>

----------

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue13358\>

ezio-melotti · 2011-11-17T15:41:10Z

I'm working on it, but a minimal example seems to work fine.

(P.S. there's no need to quote the previous message(s) while replying)

ezio-melotti · 2011-11-17T16:24:24Z

It seems to me that the arguments are parsed correctly, but handle_data is called multiple time between handle_starttag and handle_endtag.
This might happen, e.g. in case the source lines are fed one by one to the parser, but in this case seems to happen whenever </ is found.
(The tests didn't detect this because they join the data to avoid buffer artifacts.)
I'm not sure if this can be considered a bug, but the situation can indeed be improved.

MichaelBrooks · 2011-11-17T18:01:06Z

Ok so until you fix this bug, i'll be overriding HTMLParser with my fix,
becuase this is a blocking issue for my project. My HTMLParser must behave
like a browser, period end of story.

Thanks.

On Thu, Nov 17, 2011 at 9:24 AM, Ezio Melotti <report@bugs.python.org>wrote:

Ezio Melotti <ezio.melotti@gmail.com> added the comment:

It seems to me that the arguments are parsed correctly, but handle_data is
called multiple time between handle_starttag and handle_endtag.
This might happen, e.g. in case the source lines are fed one by one to the
parser, but in this case seems to happen whenever </ is found.
(The tests didn't detect this because they join the data to avoid buffer
artifacts.)
I'm not sure if this can be considered a bug, but the situation can indeed
be improved.

----------

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue13358\>

ezio-melotti · 2011-11-17T18:17:55Z

It already behaves like a browser, it just gives you data in chunks instead of calling handle_data() only once at the end. The documentation is not clear about this though. It says that feed() can be called several times, but it doesn't say that handle_data() (and possibly other methods) might get called more than once. This seems to always be the case while calling feed() several times.

MichaelBrooks · 2011-11-17T18:28:35Z

Oah, then there is a misunderstanding. No browser will parse the html
that is declared within a javascript variable, it must be treated as a
continues data segment (with cdata properties) until the exit
</\s*script\s*> is encountered (and if this tag found anywhere, even in a
quoted string it will still terminate this data segment, because its a
cdata element). The snip of html provided must only be a single data
segment. </ alone is not a proper terminator.

Thu, Nov 17, 2011 at 11:17 AM, Ezio Melotti <report@bugs.python.org> wrote:

Ezio Melotti <ezio.melotti@gmail.com> added the comment:

It already behaves like a browser, it just gives you data in chunks
instead of calling handle_data() only once at the end. The documentation
is not clear about this though. It says that feed() can be called several
times, but it doesn't say that handle_data() (and possibly other methods)
might get called more than once. This seems to always be the case while
calling feed() several times.

----------

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue13358\>

ezio-melotti · 2011-11-18T10:19:13Z

Attached patch should solve the issue.

python-dev · 2011-11-18T16:03:23Z

New changeset 91163aa3d5b4 by Ezio Melotti in branch '2.7':
bpo-13358: HTMLParser now calls handle_data only once for each CDATA.
http://hg.python.org/cpython/rev/91163aa3d5b4

New changeset 0a32e7e3aa1f by Ezio Melotti in branch '3.2':
bpo-13358: HTMLParser now calls handle_data only once for each CDATA.
http://hg.python.org/cpython/rev/0a32e7e3aa1f

New changeset e12d2b9c88ef by Ezio Melotti in branch 'default':
bpo-13358: merge with 3.2.
http://hg.python.org/cpython/rev/e12d2b9c88ef

ezio-melotti · 2011-11-18T16:05:19Z

This should be fixed now, let me know if you find other problems with the parser.

MichaelBrooks mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Nov 6, 2011

ezio-melotti self-assigned this Nov 18, 2011

ezio-melotti closed this as completed Nov 18, 2011

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTMLParser incorrectly handles cdata elements. #57567

HTMLParser incorrectly handles cdata elements. #57567

MichaelBrooks mannequin commented Nov 6, 2011

MichaelBrooks mannequin commented Nov 6, 2011

ezio-melotti commented Nov 6, 2011

MichaelBrooks mannequin commented Nov 6, 2011

MichaelBrooks mannequin commented Nov 7, 2011

MichaelBrooks mannequin commented Nov 17, 2011

ezio-melotti commented Nov 17, 2011

ezio-melotti commented Nov 17, 2011

MichaelBrooks mannequin commented Nov 17, 2011

ezio-melotti commented Nov 17, 2011

MichaelBrooks mannequin commented Nov 17, 2011

ezio-melotti commented Nov 18, 2011

python-dev mannequin commented Nov 18, 2011

ezio-melotti commented Nov 18, 2011

HTMLParser incorrectly handles cdata elements. #57567

HTMLParser incorrectly handles cdata elements. #57567

Comments

MichaelBrooks mannequin commented Nov 6, 2011

MichaelBrooks mannequin commented Nov 6, 2011

ezio-melotti commented Nov 6, 2011

MichaelBrooks mannequin commented Nov 6, 2011

MichaelBrooks mannequin commented Nov 7, 2011

MichaelBrooks mannequin commented Nov 17, 2011

ezio-melotti commented Nov 17, 2011

ezio-melotti commented Nov 17, 2011

MichaelBrooks mannequin commented Nov 17, 2011

ezio-melotti commented Nov 17, 2011

MichaelBrooks mannequin commented Nov 17, 2011

ezio-melotti commented Nov 18, 2011

python-dev mannequin commented Nov 18, 2011

ezio-melotti commented Nov 18, 2011