Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLParser incorrectly handles cdata elements. #57567

Closed
MichaelBrooks mannequin opened this issue Nov 6, 2011 · 13 comments
Closed

HTMLParser incorrectly handles cdata elements. #57567

MichaelBrooks mannequin opened this issue Nov 6, 2011 · 13 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@MichaelBrooks
Copy link
Mannequin

MichaelBrooks mannequin commented Nov 6, 2011

BPO 13358
Nosy @ezio-melotti
Files
  • issue13358.diff: patch against 2.7
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = <Date 2011-11-18.16:05:18.868>
    created_at = <Date 2011-11-06.19:48:24.759>
    labels = ['type-bug', 'library']
    title = 'HTMLParser incorrectly handles cdata elements.'
    updated_at = <Date 2011-11-18.16:05:18.867>
    user = 'https://bugs.python.org/MichaelBrooks'

    bugs.python.org fields:

    activity = <Date 2011-11-18.16:05:18.867>
    actor = 'ezio.melotti'
    assignee = 'ezio.melotti'
    closed = True
    closed_date = <Date 2011-11-18.16:05:18.868>
    closer = 'ezio.melotti'
    components = ['Library (Lib)']
    creation = <Date 2011-11-06.19:48:24.759>
    creator = 'Michael.Brooks'
    dependencies = []
    files = ['23721']
    hgrepos = []
    issue_num = 13358
    keywords = ['patch']
    message_count = 13.0
    messages = ['147175', '147176', '147178', '147223', '147806', '147807', '147809', '147816', '147818', '147819', '147846', '147883', '147884']
    nosy_count = 3.0
    nosy_names = ['ezio.melotti', 'Michael.Brooks', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue13358'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

    @MichaelBrooks
    Copy link
    Mannequin Author

    MichaelBrooks mannequin commented Nov 6, 2011

    The HTML tag at the bottom of this page correctly identified has having cdata like properties and trigger set_cdata_mode(). Due to the cdata properties of this tag, the only way to end the data segment is with a closing </script> tag, NO OTHER tag can close this data segment. Currently in cdata mode the HTMLParser will use this regular expression to close this script tag: re.compile(r'<(/|\Z)'), however this script tag is setting a variable with data that contains "</b>" which will terminate this script tag prematurely.

    I have written and tested the following patch on my system:
    #used to terminate cdata elements
    endtagfind_script = re.compile('(?i)</\s*script\s*>')
    endtagfind_style = re.compile('(?i)</\s*style\s*>')

    class html_patch(HTMLParser.HTMLParser):
        # Internal -- sets the proper tag terminator based on cdata element type
        def set_cdata_mode(self, tag):
            #We check if the script is either a style or a script
            #based on self.CDATA_CONTENT_ELEMENTS
            if tag=="style":
                self.interesting = endtagfind_style
            elif tag=="script":
                self.interesting = endtagfind_script
            else:
                self.error("Unknown cdata type:"+tag) # should never happen
            self.cdata_tag = tag 

    This cdata tag isn't parsed properly by HTMLParser, but it works fine in a browser:
    <script>
    pwa.setup(
    pwa.searchview,
    'lhid_searchheader',
    'lhid_content',
    'lhid_trayhandle',
    'lhid_tray',
    {'query': 'test',
    'tagQuery': '',
    'searchScope': '',
    'owner': '',
    'doCrowding': false,
    'isOwner': false,
    'albumId': ''
    ,'experimentalsearchquality': true},
    'firealwaysworks'
    ,
    {feedUrl: 'https://picasaweb.google.com/data/feed/tiny/all?alt=jsonm&amp;kind=photo&amp;access=public&amp;filter=1&amp;q=test',
    feedPreload: null},
    {NEW_HOMEPAGE:1,NEW_ONE_BAR:1,fr:1,tags:1,search:1,globalsearch:1,globalsearchpromo:1,newfeatureslink:1,cart:1,contentcaching:1,developerlink:1,payments:1,newStrings:1,cccquota:1,signups:1,flashSlideshow:1,URL_SHORTENER_VISIBILITY:1,emailupload:1,photopickeralbumview:1,PWA_NEWUI:1,WILDCARD_QUERY_FEED:1,recentphotos:1,editinpicasa:1,imagesearch:1,froptin:1,FR_CONTINUOUS_CLUSTERING:1,asyncUploads:1,PERFORMANCE_EXPERIMENTS:1,BAKED_PRELOAD_FEEDS:1,albumviewlimit:1,HQ_VIDEOS:1,VIDEO_INFO_DISPLAY:1,CSI:1,EXPERIMENTAL_SEARCH_QUALITY:1,COMMENT_TRANSLATION:1,NEW_COMMENT_STYLE:1,ENABLE_NEW_FLAG_ABUSE_FORM:1,QRCODE:1,CHINA:1,GWS_URL_REDIRECTION:1,FEATURED_PHOTOS:1,COMMENT_SUBSCRIPTION:1,COMMENT_SUBSCRIPTION_SETTING:1,PICASA_MAC:1,AD_ON_SEARCHPAGE:1,API_AUTO_ACCOUNTS:1,FOCUS_GROUP_ACL:1,PHOTOSTREAM:1,BACKEND_ACL:1,ADVANCED_SEARCH:1,FACE_SEARCH:1,CAMERA_SEARCH:1,NOTIFICATION:1,PIXELATED_PREVIEW:1,TRANSPARENT_PIXELATED_PREVIEW:1,NEW_SETTINGS_PAGE:1,VIEW_STARRERS:1,FR_FOCUS_MERGE:1,AD_ON_SEARCH_ONEUP:1,GALLERY_COMMENTS:1,COMMENT_ABUSE_BLOCKING:1,FAVORITE_NOTIFICATION:1,IMAGE_ONLY_LINK:1,RECENT_PHOTOS_SLIDESHOW:1,HEART:1,SMALLER_IMAGE:1,FAST_SLIDESHOW:1,VIEW_CONTACTS:1,COLLABORATIVE_ALBUMS:1,PRINT_MARKETPLACE:1,PRINT_MARKETPLACE_REPLACEMENT:1,VIEW_COUNT:1,POST_TO:1,GAPLUS:1,PICASA_PROMO:1,DOUBLECLICK_PREMIUM_ADS:1,DOUBLECLICK_EXPLORE_MAIN:1,DOUBLECLICK_MYPHOTOS:1,DOUBLECLICK_PUBLIC_GALLERY:1,DOUBLECLICK_USER_ALBUM:1,DOUBLECLICK_USER_PHOTO:1,DOUBLECLICK_VISITOR_ADS:1,PRODUCTION:1,NOSCRIPT:1,UNLISTED_GALLERY:1,GA_TRACKING:1,UNLIMITED_GALLERY:1,PICNIK_EDIT:1,MICROSCOPE_ZOOM:1,FR_V2:1,FAVORITE_SUGGESTION:1,FAVORITE_UPDATE:1,MERGED_PROFILES_SOFTLAUNCH:1,MERGED_PROFILES:1,MERGED_PROFILES_ASYNC:1,NEW_FR_UI:1,GAPLUS_UNMERGED_SOCIALIZATION:1,OPTOUT_ACL_NOTIFICATION:1,HTTPS_VISIBILITY:1,DEFAULT_HTTPS:1,EXTENDED_EXIF:1,DOUBLECLICK_MULTISLOT:1,ONEPICK:1,PER_ALBUM_GEO_VISIBILITY:1,FOCUS_MERGE_LINK_DIALOG_VISIBILITY:1,SHAREBOX_VISIBILITY:1,AUTO_DOWNSIZE:1,BULK_ALBUM_EDITOR_VISIBILITY:1,PROFILE_NAME_CHECK:1,COLLABORATIVE_NAMETAGS:1,NOT_FOUND_404:1,REDIRECT_TO_PLUS:1},
    {
    'gdataVersion': '4.0',
    'updateCartPath': '\x2Flh\x2FupdateCart?rtok=b8S9ibYqrTMF',
    'editCaptionsPath': '',
    'albumMapPath': '',
    'albumKmlUrl': '',
    'selectedPhotosPath': '\x2Flh\x2FselectedPhotos?tok=QUI1UGxRYk9fNmw1Q2tVeS1DWnY3UlFoTTY1RzRNNWphdzoxMzIwNjAyMzA3NDYx',
    'setLicensePath': '',
    'setStarPath': '\x2Flh\x2FsetStar?tok=QUI1UGxRWW4zY1ZKb3U0TzROZU5tUHhIV3hhRW9HcUYwQToxMzIwNjAyMzA3NDYx',
    'peopleManagerPath': '',
    'peopleSearchPath': '',
    'clusterViewPath': '',
    'frOptStatus': 'OptedIn',
    'isNameTagsVisible': '','authUserIsPhotosUser': true,
    'authUserNickname': 'Some Nickname',
    'authUserPortraitUrl': 'https:\x2F\x2Flh4.googleusercontent.com\x2F-UI9ZfIFfyQI\x2FAAAAAAAAAAI\x2FAAAAAAAAAAA\x2Fm0enLvZXYbI\x2Fs32-c\x2Ffirealwaysworks.jpg',
    'authUserProfileUrl':'https:\x2F\x2Fprofiles.google.com\x2F115162402406836485912', 'authUser':{name:'firealwaysworks',isProfileUser:1,isLoggedIn:1,user:1,isOwner:1
    ,'showGeo': 0
    },
    'foreignNickname': '',
    'subjects': [
    ]
    ,
    'owner': {name:'firealwaysworks',nickname:'Michael Brooks',portrait:'https:\x2F\x2Flh4.googleusercontent.com\x2F-UI9ZfIFfyQI\x2FAAAAAAAAAAI\x2FAAAAAAAAAAA\x2Fm0enLvZXYbI\x2Fs32-c\x2Ffirealwaysworks.jpg',link:'https:\x2F\x2Fpicasaweb.google.com\x2F115162402406836485912',defaultalbumnametagsvisibility:'2'
    ,isnewstorageuser:false},
    'album': {id:'',link:'',name:'',title:'',titleshort:'',kmlLink:'',mapLink:'',showGeo:0},
    'picnikHandlerUrl': 'https:\x2F\x2Fpicasaweb.google.com\x2Flh\x2FpicnikHandler?tok=QUI1UGxRYWJYZVVueTE2SjJJLTBjRUliWVlaaU42S0NRQToxMzIwNjAyMzA3NDYx',
    'picnikCloseUrl': 'https:\x2F\x2Fpicasaweb.google.com\x2Flh\x2FpicnikClose',
    'picnikLocale': 'en_US',
    'authUserIsProfileUser': true,
    'ownerIsEsUser': false,
    'profileServerUrl': '',
    'profileSignupUrl': 'https:\x2F\x2Fplus.google.com\x2Fup\x2F?type=st',
    'profileName': 'Google+'
    });
    function google_afs_request_done(google_ads) {
    var google_num_ads = google_ads.length;
    if (google_num_ads <= 0)
    return;

        var wideAds = '<table width="100%" cellspacing="0" cellpadding="8"' +
            'border="0" class="lhcl_search_ad_tbl"><tbody><tr>';
    for(var i = 0; i < google_num_ads; i++){
          if (google_ads[i].type=="text/wide"){
            wideAds += '<td>';
    wideAds += '<table width="100%" cellspacing="0" ' +
              'cellpadding="0" border="0"><tbody><tr><td>';
    wideAds+='<a href="' + google_ads[i].url + '" ' +
            'onmouseout="window.status=\'\';return true" ' +
            'onmouseover="window.status=\'go to ' +
            google_ads[i].visible_url + '\';return true" ' +
            'style="text-decoration:none">' +
    '<span style="text-decoration:underline">' +
    '<b>' + google_ads[i].line1 + '</b><br></span>' +
    '<span style="color:#000000">' +
    google_ads[i].line2 + '<br></span>' +
    '<span style="color:#008000">' +
    google_ads[i].visible_url + '</span></a><BR>';
    wideAds += '</td>';
    if (i == google_num_ads - 1) {
    wideAds += '<td valign="top">' +
    '<div id="lhid_search_ad_spl">Sponsored Links</div></td>';
    }
    wideAds += '</tr></tbody></table>';
    wideAds += '</td>';
    }
    }
    wideAds += '</tr></tbody></table>';
    document.getElementById("lhid_search_ad_unit").innerHTML = wideAds;
    }
    google_afs_query = 'test';
    google_afs_ad = 'w3'; // specify the number of ads you are requesting
    google_afs_client = 'google-picasa_js'; // substitute your client ID
    // google_afs_channel = ''; // enter your comma-separated channel IDs
    google_afs_ie = 'utf8'; // select input encoding scheme
    google_afs_oe = 'utf8'; // select output encoding scheme
    google_afs_adsafe = 'high'; // specify level for filtering non-family-safe ads
    google_afs_adtest = 'off'; // ** set parameter to off before launch to production
    document.write('<' +
        'script src="https://www.google.com/afsonline/show_afs_ads.js"></' +
        'script>');
    </script>

    @MichaelBrooks MichaelBrooks mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Nov 6, 2011
    @ezio-melotti
    Copy link
    Member

    Have you tried with the latest 2.7? (see msg147170)

    @MichaelBrooks
    Copy link
    Mannequin Author

    MichaelBrooks mannequin commented Nov 6, 2011

    Yes I am running python 2.7.2.

    On Sun, Nov 6, 2011 at 12:52 PM, Ezio Melotti <report@bugs.python.org>wrote:

    Ezio Melotti <ezio.melotti@gmail.com> added the comment:

    Have you tried with the latest 2.7? (see msg147170)

    ----------
    nosy: +ezio.melotti
    stage: -> test needed


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue13358\>


    @MichaelBrooks
    Copy link
    Mannequin Author

    MichaelBrooks mannequin commented Nov 7, 2011

    This one should also have a priority change. Tested python 2.7.3

    --MIke

    On Sun, Nov 6, 2011 at 12:54 PM, Michael Brooks <report@bugs.python.org>wrote:

    Michael Brooks <firealwaysworks@gmail.com> added the comment:

    Yes I am running python 2.7.2.

    On Sun, Nov 6, 2011 at 12:52 PM, Ezio Melotti <report@bugs.python.org
    >wrote:

    >
    > Ezio Melotti <ezio.melotti@gmail.com> added the comment:
    >
    > Have you tried with the latest 2.7? (see msg147170)
    >
    > ----------
    > nosy: +ezio.melotti
    > stage: -> test needed
    >
    > _______________________________________
    > Python tracker <report@bugs.python.org>
    > <http://bugs.python.org/issue13358\>
    > _______________________________________
    >

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue13358\>


    @MichaelBrooks
    Copy link
    Mannequin Author

    MichaelBrooks mannequin commented Nov 17, 2011

    Has anyone else been able to verify this?

    On Mon, Nov 7, 2011 at 7:46 AM, Michael Brooks <report@bugs.python.org>wrote:

    Michael Brooks <firealwaysworks@gmail.com> added the comment:

    This one should also have a priority change. Tested python 2.7.3

    --MIke

    On Sun, Nov 6, 2011 at 12:54 PM, Michael Brooks <report@bugs.python.org
    >wrote:

    >
    > Michael Brooks <firealwaysworks@gmail.com> added the comment:
    >
    > Yes I am running python 2.7.2.
    >
    > On Sun, Nov 6, 2011 at 12:52 PM, Ezio Melotti <report@bugs.python.org
    > >wrote:
    >
    > >
    > > Ezio Melotti <ezio.melotti@gmail.com> added the comment:
    > >
    > > Have you tried with the latest 2.7? (see msg147170)
    > >
    > > ----------
    > > nosy: +ezio.melotti
    > > stage: -> test needed
    > >
    > > _______________________________________
    > > Python tracker <report@bugs.python.org>
    > > <http://bugs.python.org/issue13358\>
    > > _______________________________________
    > >
    >
    > ----------
    >
    > _______________________________________
    > Python tracker <report@bugs.python.org>
    > <http://bugs.python.org/issue13358\>
    > _______________________________________
    >

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue13358\>


    @ezio-melotti
    Copy link
    Member

    I'm working on it, but a minimal example seems to work fine.

    (P.S. there's no need to quote the previous message(s) while replying)

    @ezio-melotti
    Copy link
    Member

    It seems to me that the arguments are parsed correctly, but handle_data is called multiple time between handle_starttag and handle_endtag.
    This might happen, e.g. in case the source lines are fed one by one to the parser, but in this case seems to happen whenever </ is found.
    (The tests didn't detect this because they join the data to avoid buffer artifacts.)
    I'm not sure if this can be considered a bug, but the situation can indeed be improved.

    @MichaelBrooks
    Copy link
    Mannequin Author

    MichaelBrooks mannequin commented Nov 17, 2011

    Ok so until you fix this bug, i'll be overriding HTMLParser with my fix,
    becuase this is a blocking issue for my project. My HTMLParser must behave
    like a browser, period end of story.

    Thanks.

    On Thu, Nov 17, 2011 at 9:24 AM, Ezio Melotti <report@bugs.python.org>wrote:

    Ezio Melotti <ezio.melotti@gmail.com> added the comment:

    It seems to me that the arguments are parsed correctly, but handle_data is
    called multiple time between handle_starttag and handle_endtag.
    This might happen, e.g. in case the source lines are fed one by one to the
    parser, but in this case seems to happen whenever </ is found.
    (The tests didn't detect this because they join the data to avoid buffer
    artifacts.)
    I'm not sure if this can be considered a bug, but the situation can indeed
    be improved.

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue13358\>


    @ezio-melotti
    Copy link
    Member

    It already behaves like a browser, it just gives you data in chunks instead of calling handle_data() only once at the end. The documentation is not clear about this though. It says that feed() can be called several times, but it doesn't say that handle_data() (and possibly other methods) might get called more than once. This seems to always be the case while calling feed() several times.

    @MichaelBrooks
    Copy link
    Mannequin Author

    MichaelBrooks mannequin commented Nov 17, 2011

    Oah, then there is a misunderstanding. No browser will parse the html
    that is declared within a javascript variable, it must be treated as a
    continues data segment (with cdata properties) until the exit
    </\s*script\s*> is encountered (and if this tag found anywhere, even in a
    quoted string it will still terminate this data segment, because its a
    cdata element). The snip of html provided must only be a single data
    segment. </ alone is not a proper terminator.

    Thu, Nov 17, 2011 at 11:17 AM, Ezio Melotti <report@bugs.python.org> wrote:

    Ezio Melotti <ezio.melotti@gmail.com> added the comment:

    It already behaves like a browser, it just gives you data in chunks
    instead of calling handle_data() only once at the end. The documentation
    is not clear about this though. It says that feed() can be called several
    times, but it doesn't say that handle_data() (and possibly other methods)
    might get called more than once. This seems to always be the case while
    calling feed() several times.

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue13358\>


    @ezio-melotti
    Copy link
    Member

    Attached patch should solve the issue.

    @ezio-melotti ezio-melotti self-assigned this Nov 18, 2011
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 18, 2011

    New changeset 91163aa3d5b4 by Ezio Melotti in branch '2.7':
    bpo-13358: HTMLParser now calls handle_data only once for each CDATA.
    http://hg.python.org/cpython/rev/91163aa3d5b4

    New changeset 0a32e7e3aa1f by Ezio Melotti in branch '3.2':
    bpo-13358: HTMLParser now calls handle_data only once for each CDATA.
    http://hg.python.org/cpython/rev/0a32e7e3aa1f

    New changeset e12d2b9c88ef by Ezio Melotti in branch 'default':
    bpo-13358: merge with 3.2.
    http://hg.python.org/cpython/rev/e12d2b9c88ef

    @ezio-melotti
    Copy link
    Member

    This should be fixed now, let me know if you find other problems with the parser.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant