This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification which include two will have problem
Title: SGMLParser processing
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.7
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, moonflow
Priority: normal Keywords:

Created on 2012-11-20 14:09 by moonflow, last changed 2022-04-11 14:57 by admin. This issue is now closed.

File name Uploaded Description Edit moonflow, 2012-11-20 14:09 test python file ezio.melotti, 2012-11-20 14:25
Messages (5)
msg175990 - (view) Author: moonflow (moonflow) Date: 2012-11-20 14:09
if a <tr> include two <a> or more,SGMLParser processing has a problem

for example:
    <td align="center" valign="top" nowrap>
    <script language="Javascript">
      if ( 4 == 4 ) document.write("<strong class=\"Critical small\">Critical</strong>");
      if ( 4 == 3 ) document.write("<strong class=\"High small\">High</strong>");
      if ( 4 == 2 ) document.write("<strong class=\"Medium small\">Medium</strong>");
      if ( 4 == 1 ) document.write("<strong class=\"Low small\">Low</strong>");
    <td valign="top" align="center" nowrap>
    <small><script type="text/javascript">document.write(FormatDate("%d-%b-%y", "2012", "11", "18"));</script></small>
    <td valign="top" align="center" nowrap><small>
    <a title="CPAI-2012-809" style="text-transform:uppercase" href="2012/cpai-08-nov.html">
    <td valign="top" nowrap align="center"><small>
    <a target="_blank" href="">CVE-2011-2089</a><br /></small>
    <td valign="top"><small>SCADA ICONICS WebHMI ActiveX Stack Overflow (2011-2089)</small></td>

def start_a(self, attrs):
        if self.is_td:       
            cve_href = [v for k, v in attrs if k == "target" and v == "_blank"]
            if cve_href:
                self.is_a = True
                self.is_cve = True

            #for SGMLParser maybe have a bug,a <tr> have two <a> has problem
            vul_href = [v for k, v in attrs if k == "style"]
            print vul_href
            if vul_href:
                vul_href = "".join([v for k, v in attrs if k == "href"])
                if vul_href.find("cve") == -1:
                    self.href_name = vul_href     
                self.href_name = ""

here print vul_href but print nothing.Is it ok?
msg175992 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-11-20 14:12
Have you tried with HTMLParser?
sgmllib is deprecated and has been removed in Python 3.
HTMLParser is also much better at parsing (broken) HTML.
msg175993 - (view) Author: moonflow (moonflow) Date: 2012-11-20 14:18
I haven't tried it, the problem will not process?
msg175994 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-11-20 14:25
If what you are trying to do is extracting the link(s) that contain 'cve', you try the attached script.
msg175995 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-11-20 14:43
Sorry, I misread your code, looks like you want the href *without* 'cve'.
In that case change my code to use "'cve' not in attrs['href']" (also avoid using  s.find('cve') == -1 , and use the more readable and idiomatic  'cve' not in s ).

I think your original script doesn't work for two reasons:
1) you are looking for a table with class="tablesorter", but in the HTML the table doesn't have that class, so self.is_table is never set to True;
2) you are finding the href of the <a> with a "style" attribute and correctly setting it to self.href_name, but the value is then replaced by "" when the following <a> without "style" is found;

That said, I still suggest you to abandon sgmllib and use HTMLParser, or possibly an external module like BeautifulSoup or LXML.
Date User Action Args
2022-04-11 14:57:38adminsetgithub: 60717
2012-11-20 14:43:52ezio.melottisetstatus: open -> closed
resolution: not a bug
messages: + msg175995

stage: resolved
2012-11-20 14:25:52ezio.melottisetfiles: +

messages: + msg175994
2012-11-20 14:18:07moonflowsetmessages: + msg175993
2012-11-20 14:12:18ezio.melottisetnosy: + ezio.melotti
messages: + msg175992
2012-11-20 14:09:36moonflowcreate