classification
Title: HTMLParser.locatestartagend regex too stringent
Type: feature request Stage: test needed
Components: Library (Lib) Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ajaksu2, dyoo (2)
Priority: normal Keywords easy, patch

Created on 2004-11-01 18:05 by dyoo, last changed 2009-02-14 18:45 by ajaksu2.

Files
File name Uploaded Description Edit Remove
HTMLParser.py.diff dyoo, 2004-11-01 18:05 diff against Lib/HTMLParser.py from Python 2.3.3
Messages (2)
msg22976 - (view) Author: Danny Yoo (dyoo) Date: 2004-11-01 18:05
In Python 2.3.3, HTMLParser uses a certain regex that
is too stringent, and it does not capture slightly
malformed HTML gracefully.

The current definition of HTMLParser.locatestartendtag:

locatestarttagend = re.compile(r"""
  <[a-zA-Z][-.a-zA-Z0-9:_]*       # tag name
  (?:\s+                  # whitespace before attribute
name
    (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
      (?:\s*=\s*                     # value indicator
        (?:'[^']*'                   # LITA-enclosed value
          |\"[^\"]*\"                # LIT-enclosed value
          |[^'\">\s]+                # bare value
         )
       )?
     )
   )*
  \s*                                # trailing whitespace
""", re.VERBOSE)


does not capture strings like:

    <IMG SRC = "abc.jpg"WIDTH=5>

where there is no space between the closing quote and
the next attribute name.  Many sources of HTML are
slightly malformed this way --- in particular, CNN.com
--- so being slightly lenient might be good.  We can
slightly relax the constraint:


locatestarttagend = re.compile(r"""
  <[a-zA-Z][-.a-zA-Z0-9:_]*       # tag name
  (?:\s*          # optional whitespace before
attribute name
    (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
      (?:\s*=\s*                     # value indicator
        (?:'[^']*'                   # LITA-enclosed value
          |\"[^\"]*\"                # LIT-enclosed value
          |[^'\">\s]+                # bare value
         )
       )?
     )
   )*
  \s*                                # trailing whitespace
""", re.VERBOSE)

which allows the parser to process more of the HTML out
there.


See:

http://mail.python.org/pipermail/tutor/2004-October/032835.html

and:

http://mail.python.org/pipermail/tutor/2004-October/032869.html

for an explanation of what motivates this change.

Thanks!
msg82104 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-02-14 18:45
The regex is still the same. This is one of many 'HTMLParser regex for
attributes' issues.
History
Date User Action Args
2009-02-14 18:45:26ajaksu2setversions: + Python 2.7, - Python 2.3
nosy: + ajaksu2
messages: + msg82104
keywords: + patch, easy
type: feature request
stage: test needed
2004-11-01 18:05:39dyoocreate