classification
Title: HTMLParser attribute parsing bug
Type: enhancement Stage: test needed
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: fdrake Nosy List: BreamoreBoy, calvin, fdrake, r.david.murray, smroid, titus
Priority: normal Keywords: easy

Created on 2003-02-10 14:57 by fdrake, last changed 2010-08-18 13:15 by BreamoreBoy. This issue is now closed.

Messages (6)
msg60305 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2003-02-10 14:57
HTMLParser (reportedly)  fails to parse this construct:

<a href="http://ss"title="pe">P</a>

(Note that a required space between the two attributes
of the "a" tag has been omitted).  The W3C validator
appearantly treats this differently, so there's no
point in arguing the letter of the law.

Assigned to me.
msg60306 - (view) Author: Bastian Kleineidam (calvin) Date: 2003-03-31 10:44
Logged In: YES 
user_id=9205

HTMLParser (and lots of other parsers I tried) has
definitely limits when it comes to error recovering. I dont
know if its good to put further development effort in
HTMLParser as it will IMHO never reach the ability to cope
with all the crappy HTML out there.
If you really want to have a html parser in Python, I
suggest you look at my htmlsax module packaged with
linkchecker (linkchecker.sf.net) and webcleaner
(webcleaner.sf.net), the parser is tested with lots of real
world examples.
The parser packaged with linkchecker has line counting, the
one with webcleaner not.

Cheers, Bastian
msg60307 - (view) Author: Steven Rosenthal (smroid) Date: 2003-05-14 05:12
Logged In: YES 
user_id=159908

Two troublesome input examples:
<table border=0 width="100%"cellspacing=0 cellpadding=0>
<option selected value=>

Here's a fix I came up with in HTMLParser.py: replace the
definition of locatestarttagend with:

locatestarttagend = re.compile(r"""
  <[a-zA-Z][-.a-zA-Z0-9:_]*          # tag name
  \s*                                # whitespace after tag name
  (?:
    (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
      (?:\s*=\s*                     # value indicator
        (?:'[^']*'                   # LITA-enclosed value
          |\"[^\"]*\"                # LIT-enclosed value
          |[^'\">\s]+                # bare value
         )?
       )?
     )
     \s*                             # whitespace between attrs
   )*
  \s*                                # trailing whitespace
""", re.VERBOSE)
msg60308 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2004-03-11 04:53
Logged In: YES 
user_id=100308

I'm using python 2.3.3.

I note that bug 699079, which addresses this same issue, was closed
as "not a bug".  As far as I can tell the current behavior of
HTMLParser, unlike what was reported in that bug report, is to
silently stop parsing.  This is a problem, as it took me quite a
while to track down why my application wasn't working, whereas if
an exception had been generated I'd have figured it out right quick.

If it's going to stop parsing when the error occurs, then I'd much
rather it generate an exception.  I can always trap the exception
if I want to keep going.  Since it apparently used to work that
way, I'm hoping maybe a quick poke through CVS by someone knowledgeable
with the code can restore the excption behavior, pending a more
satisfactory resolution to the problem.
msg60309 - (view) Author: Titus Brown (titus) Date: 2004-12-19 00:34
Logged In: YES 
user_id=23486

In response to rdmurray's comment: in Python 2.4, at least, an exception 
is raised.

Not sure why this bug is being kept open...  but see bug 736428 and 
patch 755660 for related issues.
msg114217 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-18 13:15
Closed as fixed in r23322.
History
Date User Action Args
2010-08-18 13:15:34BreamoreBoysetstatus: open -> closed

nosy: + BreamoreBoy
messages: + msg114217

resolution: fixed
2009-04-22 18:49:33ajaksu2setkeywords: + easy
versions: + Python 2.7
2009-02-12 03:25:02ajaksu2settype: enhancement
stage: test needed
2009-02-12 03:01:19ajaksu2linkissue755670 dependencies
2003-02-10 14:57:40fdrakecreate