Issue13358
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2011-11-06 19:48 by Michael.Brooks, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
issue13358.diff | ezio.melotti, 2011-11-18 10:19 | patch against 2.7 |
Messages (13) | |||
---|---|---|---|
msg147175 - (view) | Author: Michael Brooks (Michael.Brooks) | Date: 2011-11-06 19:48 | |
The HTML tag at the bottom of this page correctly identified has having cdata like properties and trigger set_cdata_mode(). Due to the cdata properties of this tag, the only way to end the data segment is with a closing </script> tag, NO OTHER tag can close this data segment. Currently in cdata mode the HTMLParser will use this regular expression to close this script tag: re.compile(r'<(/|\Z)'), however this script tag is setting a variable with data that contains "</b>" which will terminate this script tag prematurely. I have written and tested the following patch on my system: #used to terminate cdata elements endtagfind_script = re.compile('(?i)</\s*script\s*>') endtagfind_style = re.compile('(?i)</\s*style\s*>') class html_patch(HTMLParser.HTMLParser): # Internal -- sets the proper tag terminator based on cdata element type def set_cdata_mode(self, tag): #We check if the script is either a style or a script #based on self.CDATA_CONTENT_ELEMENTS if tag=="style": self.interesting = endtagfind_style elif tag=="script": self.interesting = endtagfind_script else: self.error("Unknown cdata type:"+tag) # should never happen self.cdata_tag = tag This cdata tag isn't parsed properly by HTMLParser, but it works fine in a browser: <script> pwa.setup( pwa.searchview, 'lhid_searchheader', 'lhid_content', 'lhid_trayhandle', 'lhid_tray', {'query': 'test', 'tagQuery': '', 'searchScope': '', 'owner': '', 'doCrowding': false, 'isOwner': false, 'albumId': '' ,'experimentalsearchquality': true}, 'firealwaysworks' , {feedUrl: 'https://picasaweb.google.com/data/feed/tiny/all?alt=jsonm&kind=photo&access=public&filter=1&q=test', feedPreload: null}, {NEW_HOMEPAGE:1,NEW_ONE_BAR:1,fr:1,tags:1,search:1,globalsearch:1,globalsearchpromo:1,newfeatureslink:1,cart:1,contentcaching:1,developerlink:1,payments:1,newStrings:1,cccquota:1,signups:1,flashSlideshow:1,URL_SHORTENER_VISIBILITY:1,emailupload:1,photopickeralbumview:1,PWA_NEWUI:1,WILDCARD_QUERY_FEED:1,recentphotos:1,editinpicasa:1,imagesearch:1,froptin:1,FR_CONTINUOUS_CLUSTERING:1,asyncUploads:1,PERFORMANCE_EXPERIMENTS:1,BAKED_PRELOAD_FEEDS:1,albumviewlimit:1,HQ_VIDEOS:1,VIDEO_INFO_DISPLAY:1,CSI:1,EXPERIMENTAL_SEARCH_QUALITY:1,COMMENT_TRANSLATION:1,NEW_COMMENT_STYLE:1,ENABLE_NEW_FLAG_ABUSE_FORM:1,QRCODE:1,CHINA:1,GWS_URL_REDIRECTION:1,FEATURED_PHOTOS:1,COMMENT_SUBSCRIPTION:1,COMMENT_SUBSCRIPTION_SETTING:1,PICASA_MAC:1,AD_ON_SEARCHPAGE:1,API_AUTO_ACCOUNTS:1,FOCUS_GROUP_ACL:1,PHOTOSTREAM:1,BACKEND_ACL:1,ADVANCED_SEARCH:1,FACE_SEARCH:1,CAMERA_SEARCH:1,NOTIFICATION:1,PIXELATED_PREVIEW:1,TRANSPARENT_PIXELATED_PREVIEW:1,NEW_SETTINGS_PAGE:1,VIEW_STARRERS:1,FR_FOCUS_MERGE:1,AD_ON_SEARCH_ONEUP:1,GALLERY_COMMENTS:1,COMMENT_ABUSE_BLOCKING:1,FAVORITE_NOTIFICATION:1,IMAGE_ONLY_LINK:1,RECENT_PHOTOS_SLIDESHOW:1,HEART:1,SMALLER_IMAGE:1,FAST_SLIDESHOW:1,VIEW_CONTACTS:1,COLLABORATIVE_ALBUMS:1,PRINT_MARKETPLACE:1,PRINT_MARKETPLACE_REPLACEMENT:1,VIEW_COUNT:1,POST_TO:1,GAPLUS:1,PICASA_PROMO:1,DOUBLECLICK_PREMIUM_ADS:1,DOUBLECLICK_EXPLORE_MAIN:1,DOUBLECLICK_MYPHOTOS:1,DOUBLECLICK_PUBLIC_GALLERY:1,DOUBLECLICK_USER_ALBUM:1,DOUBLECLICK_USER_PHOTO:1,DOUBLECLICK_VISITOR_ADS:1,PRODUCTION:1,NOSCRIPT:1,UNLISTED_GALLERY:1,GA_TRACKING:1,UNLIMITED_GALLERY:1,PICNIK_EDIT:1,MICROSCOPE_ZOOM:1,FR_V2:1,FAVORITE_SUGGESTION:1,FAVORITE_UPDATE:1,MERGED_PROFILES_SOFTLAUNCH:1,MERGED_PROFILES:1,MERGED_PROFILES_ASYNC:1,NEW_FR_UI:1,GAPLUS_UNMERGED_SOCIALIZATION:1,OPTOUT_ACL_NOTIFICATION:1,HTTPS_VISIBILITY:1,DEFAULT_HTTPS:1,EXTENDED_EXIF:1,DOUBLECLICK_MULTISLOT:1,ONEPICK:1,PER_ALBUM_GEO_VISIBILITY:1,FOCUS_MERGE_LINK_DIALOG_VISIBILITY:1,SHAREBOX_VISIBILITY:1,AUTO_DOWNSIZE:1,BULK_ALBUM_EDITOR_VISIBILITY:1,PROFILE_NAME_CHECK:1,COLLABORATIVE_NAMETAGS:1,NOT_FOUND_404:1,REDIRECT_TO_PLUS:1}, { 'gdataVersion': '4.0', 'updateCartPath': '\x2Flh\x2FupdateCart?rtok=b8S9ibYqrTMF', 'editCaptionsPath': '', 'albumMapPath': '', 'albumKmlUrl': '', 'selectedPhotosPath': '\x2Flh\x2FselectedPhotos?tok=QUI1UGxRYk9fNmw1Q2tVeS1DWnY3UlFoTTY1RzRNNWphdzoxMzIwNjAyMzA3NDYx', 'setLicensePath': '', 'setStarPath': '\x2Flh\x2FsetStar?tok=QUI1UGxRWW4zY1ZKb3U0TzROZU5tUHhIV3hhRW9HcUYwQToxMzIwNjAyMzA3NDYx', 'peopleManagerPath': '', 'peopleSearchPath': '', 'clusterViewPath': '', 'frOptStatus': 'OptedIn', 'isNameTagsVisible': '','authUserIsPhotosUser': true, 'authUserNickname': 'Some Nickname', 'authUserPortraitUrl': 'https:\x2F\x2Flh4.googleusercontent.com\x2F-UI9ZfIFfyQI\x2FAAAAAAAAAAI\x2FAAAAAAAAAAA\x2Fm0enLvZXYbI\x2Fs32-c\x2Ffirealwaysworks.jpg', 'authUserProfileUrl':'https:\x2F\x2Fprofiles.google.com\x2F115162402406836485912', 'authUser':{name:'firealwaysworks',isProfileUser:1,isLoggedIn:1,user:1,isOwner:1 ,'showGeo': 0 }, 'foreignNickname': '', 'subjects': [ ] , 'owner': {name:'firealwaysworks',nickname:'Michael Brooks',portrait:'https:\x2F\x2Flh4.googleusercontent.com\x2F-UI9ZfIFfyQI\x2FAAAAAAAAAAI\x2FAAAAAAAAAAA\x2Fm0enLvZXYbI\x2Fs32-c\x2Ffirealwaysworks.jpg',link:'https:\x2F\x2Fpicasaweb.google.com\x2F115162402406836485912',defaultalbumnametagsvisibility:'2' ,isnewstorageuser:false}, 'album': {id:'',link:'',name:'',title:'',titleshort:'',kmlLink:'',mapLink:'',showGeo:0}, 'picnikHandlerUrl': 'https:\x2F\x2Fpicasaweb.google.com\x2Flh\x2FpicnikHandler?tok=QUI1UGxRYWJYZVVueTE2SjJJLTBjRUliWVlaaU42S0NRQToxMzIwNjAyMzA3NDYx', 'picnikCloseUrl': 'https:\x2F\x2Fpicasaweb.google.com\x2Flh\x2FpicnikClose', 'picnikLocale': 'en_US', 'authUserIsProfileUser': true, 'ownerIsEsUser': false, 'profileServerUrl': '', 'profileSignupUrl': 'https:\x2F\x2Fplus.google.com\x2Fup\x2F?type=st', 'profileName': 'Google+' }); function google_afs_request_done(google_ads) { var google_num_ads = google_ads.length; if (google_num_ads <= 0) return; var wideAds = '<table width="100%" cellspacing="0" cellpadding="8"' + 'border="0" class="lhcl_search_ad_tbl"><tbody><tr>'; for(var i = 0; i < google_num_ads; i++){ if (google_ads[i].type=="text/wide"){ wideAds += '<td>'; wideAds += '<table width="100%" cellspacing="0" ' + 'cellpadding="0" border="0"><tbody><tr><td>'; wideAds+='<a href="' + google_ads[i].url + '" ' + 'onmouseout="window.status=\'\';return true" ' + 'onmouseover="window.status=\'go to ' + google_ads[i].visible_url + '\';return true" ' + 'style="text-decoration:none">' + '<span style="text-decoration:underline">' + '<b>' + google_ads[i].line1 + '</b><br></span>' + '<span style="color:#000000">' + google_ads[i].line2 + '<br></span>' + '<span style="color:#008000">' + google_ads[i].visible_url + '</span></a><BR>'; wideAds += '</td>'; if (i == google_num_ads - 1) { wideAds += '<td valign="top">' + '<div id="lhid_search_ad_spl">Sponsored Links</div></td>'; } wideAds += '</tr></tbody></table>'; wideAds += '</td>'; } } wideAds += '</tr></tbody></table>'; document.getElementById("lhid_search_ad_unit").innerHTML = wideAds; } google_afs_query = 'test'; google_afs_ad = 'w3'; // specify the number of ads you are requesting google_afs_client = 'google-picasa_js'; // substitute your client ID // google_afs_channel = ''; // enter your comma-separated channel IDs google_afs_ie = 'utf8'; // select input encoding scheme google_afs_oe = 'utf8'; // select output encoding scheme google_afs_adsafe = 'high'; // specify level for filtering non-family-safe ads google_afs_adtest = 'off'; // ** set parameter to off before launch to production document.write('<' + 'script src="https://www.google.com/afsonline/show_afs_ads.js"></' + 'script>'); </script> |
|||
msg147176 - (view) | Author: Ezio Melotti (ezio.melotti) * ![]() |
Date: 2011-11-06 19:52 | |
Have you tried with the latest 2.7? (see msg147170) |
|||
msg147178 - (view) | Author: Michael Brooks (Michael.Brooks) | Date: 2011-11-06 19:54 | |
Yes I am running python 2.7.2. On Sun, Nov 6, 2011 at 12:52 PM, Ezio Melotti <report@bugs.python.org>wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > Have you tried with the latest 2.7? (see msg147170) > > ---------- > nosy: +ezio.melotti > stage: -> test needed > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue13358> > _______________________________________ > |
|||
msg147223 - (view) | Author: Michael Brooks (Michael.Brooks) | Date: 2011-11-07 14:46 | |
This one should also have a priority change. Tested python 2.7.3 --MIke On Sun, Nov 6, 2011 at 12:54 PM, Michael Brooks <report@bugs.python.org>wrote: > > Michael Brooks <firealwaysworks@gmail.com> added the comment: > > Yes I am running python 2.7.2. > > On Sun, Nov 6, 2011 at 12:52 PM, Ezio Melotti <report@bugs.python.org > >wrote: > > > > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > > > Have you tried with the latest 2.7? (see msg147170) > > > > ---------- > > nosy: +ezio.melotti > > stage: -> test needed > > > > _______________________________________ > > Python tracker <report@bugs.python.org> > > <http://bugs.python.org/issue13358> > > _______________________________________ > > > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue13358> > _______________________________________ > |
|||
msg147806 - (view) | Author: Michael Brooks (Michael.Brooks) | Date: 2011-11-17 15:36 | |
Has anyone else been able to verify this? On Mon, Nov 7, 2011 at 7:46 AM, Michael Brooks <report@bugs.python.org>wrote: > > Michael Brooks <firealwaysworks@gmail.com> added the comment: > > This one should also have a priority change. Tested python 2.7.3 > > --MIke > > On Sun, Nov 6, 2011 at 12:54 PM, Michael Brooks <report@bugs.python.org > >wrote: > > > > > Michael Brooks <firealwaysworks@gmail.com> added the comment: > > > > Yes I am running python 2.7.2. > > > > On Sun, Nov 6, 2011 at 12:52 PM, Ezio Melotti <report@bugs.python.org > > >wrote: > > > > > > > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > > > > > Have you tried with the latest 2.7? (see msg147170) > > > > > > ---------- > > > nosy: +ezio.melotti > > > stage: -> test needed > > > > > > _______________________________________ > > > Python tracker <report@bugs.python.org> > > > <http://bugs.python.org/issue13358> > > > _______________________________________ > > > > > > > ---------- > > > > _______________________________________ > > Python tracker <report@bugs.python.org> > > <http://bugs.python.org/issue13358> > > _______________________________________ > > > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue13358> > _______________________________________ > |
|||
msg147807 - (view) | Author: Ezio Melotti (ezio.melotti) * ![]() |
Date: 2011-11-17 15:41 | |
I'm working on it, but a minimal example seems to work fine. (P.S. there's no need to quote the previous message(s) while replying) |
|||
msg147809 - (view) | Author: Ezio Melotti (ezio.melotti) * ![]() |
Date: 2011-11-17 16:24 | |
It seems to me that the arguments are parsed correctly, but handle_data is called multiple time between handle_starttag and handle_endtag. This might happen, e.g. in case the source lines are fed one by one to the parser, but in this case seems to happen whenever </ is found. (The tests didn't detect this because they join the data to avoid buffer artifacts.) I'm not sure if this can be considered a bug, but the situation can indeed be improved. |
|||
msg147816 - (view) | Author: Michael Brooks (Michael.Brooks) | Date: 2011-11-17 18:01 | |
Ok so until you fix this bug, i'll be overriding HTMLParser with my fix, becuase this is a blocking issue for my project. My HTMLParser must behave like a browser, period end of story. Thanks. On Thu, Nov 17, 2011 at 9:24 AM, Ezio Melotti <report@bugs.python.org>wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > It seems to me that the arguments are parsed correctly, but handle_data is > called multiple time between handle_starttag and handle_endtag. > This might happen, e.g. in case the source lines are fed one by one to the > parser, but in this case seems to happen whenever </ is found. > (The tests didn't detect this because they join the data to avoid buffer > artifacts.) > I'm not sure if this can be considered a bug, but the situation can indeed > be improved. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue13358> > _______________________________________ > |
|||
msg147818 - (view) | Author: Ezio Melotti (ezio.melotti) * ![]() |
Date: 2011-11-17 18:17 | |
It already behaves like a browser, it just gives you data in chunks instead of calling handle_data() only once at the end. The documentation is not clear about this though. It says that feed() can be called several times, but it doesn't say that handle_data() (and possibly other methods) might get called more than once. This seems to always be the case while calling feed() several times. |
|||
msg147819 - (view) | Author: Michael Brooks (Michael.Brooks) | Date: 2011-11-17 18:28 | |
Oah, then there is a misunderstanding. No browser will parse the html that is declared within a javascript variable, it must be treated as a continues data segment (with cdata properties) until the exit </\s*script\s*> is encountered (and if this tag found anywhere, even in a quoted string it will still terminate this data segment, because its a cdata element). The snip of html provided must only be a single data segment. </ alone is not a proper terminator. Thu, Nov 17, 2011 at 11:17 AM, Ezio Melotti <report@bugs.python.org> wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > It already behaves like a browser, it just gives you data in chunks > instead of calling handle_data() only once at the end. The documentation > is not clear about this though. It says that feed() can be called several > times, but it doesn't say that handle_data() (and possibly other methods) > might get called more than once. This seems to always be the case while > calling feed() several times. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue13358> > _______________________________________ > |
|||
msg147846 - (view) | Author: Ezio Melotti (ezio.melotti) * ![]() |
Date: 2011-11-18 10:19 | |
Attached patch should solve the issue. |
|||
msg147883 - (view) | Author: Roundup Robot (python-dev) ![]() |
Date: 2011-11-18 16:03 | |
New changeset 91163aa3d5b4 by Ezio Melotti in branch '2.7': #13358: HTMLParser now calls handle_data only once for each CDATA. http://hg.python.org/cpython/rev/91163aa3d5b4 New changeset 0a32e7e3aa1f by Ezio Melotti in branch '3.2': #13358: HTMLParser now calls handle_data only once for each CDATA. http://hg.python.org/cpython/rev/0a32e7e3aa1f New changeset e12d2b9c88ef by Ezio Melotti in branch 'default': #13358: merge with 3.2. http://hg.python.org/cpython/rev/e12d2b9c88ef |
|||
msg147884 - (view) | Author: Ezio Melotti (ezio.melotti) * ![]() |
Date: 2011-11-18 16:05 | |
This should be fixed now, let me know if you find other problems with the parser. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:23 | admin | set | github: 57567 |
2011-11-18 16:05:18 | ezio.melotti | set | status: open -> closed resolution: fixed messages: + msg147884 stage: commit review -> resolved |
2011-11-18 16:03:23 | python-dev | set | nosy:
+ python-dev messages: + msg147883 |
2011-11-18 10:19:14 | ezio.melotti | set | files:
+ issue13358.diff versions: + Python 3.2, Python 3.3 messages: + msg147846 assignee: ezio.melotti keywords: + patch stage: test needed -> commit review |
2011-11-17 18:28:35 | Michael.Brooks | set | messages: + msg147819 |
2011-11-17 18:17:55 | ezio.melotti | set | messages: + msg147818 |
2011-11-17 18:01:06 | Michael.Brooks | set | messages: + msg147816 |
2011-11-17 16:24:24 | ezio.melotti | set | messages: + msg147809 |
2011-11-17 15:41:10 | ezio.melotti | set | messages: + msg147807 |
2011-11-17 15:36:57 | Michael.Brooks | set | messages: + msg147806 |
2011-11-07 14:46:26 | Michael.Brooks | set | messages: + msg147223 |
2011-11-06 19:54:37 | Michael.Brooks | set | messages: + msg147178 |
2011-11-06 19:52:00 | ezio.melotti | set | nosy:
+ ezio.melotti messages: + msg147176 stage: test needed |
2011-11-06 19:48:45 | Michael.Brooks | set | type: behavior |
2011-11-06 19:48:24 | Michael.Brooks | create |