Message 22976 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	dyoo
Recipients
Date	2004-11-01.18:05:39
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
In Python 2.3.3, HTMLParser uses a certain regex that is too stringent, and it does not capture slightly malformed HTML gracefully. The current definition of HTMLParser.locatestartendtag: locatestarttagend = re.compile(r""" <[a-zA-Z][-.a-zA-Z0-9:_]* # tag name (?:\s+ # whitespace before attribute name (?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name (?:\s=\s # value indicator (?:'[^']' # LITA-enclosed value \|\"[^\"]\" # LIT-enclosed value \|[^'\">\s]+ # bare value ) )? ) )* \s* # trailing whitespace """, re.VERBOSE) does not capture strings like: <IMG SRC = "abc.jpg"WIDTH=5> where there is no space between the closing quote and the next attribute name. Many sources of HTML are slightly malformed this way --- in particular, CNN.com --- so being slightly lenient might be good. We can slightly relax the constraint: locatestarttagend = re.compile(r""" <[a-zA-Z][-.a-zA-Z0-9:_]* # tag name (?:\s* # optional whitespace before attribute name (?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name (?:\s=\s # value indicator (?:'[^']' # LITA-enclosed value \|\"[^\"]\" # LIT-enclosed value \|[^'\">\s]+ # bare value ) )? ) )* \s* # trailing whitespace """, re.VERBOSE) which allows the parser to process more of the HTML out there. See: http://mail.python.org/pipermail/tutor/2004-October/032835.html and: http://mail.python.org/pipermail/tutor/2004-October/032869.html for an explanation of what motivates this change. Thanks!

In Python 2.3.3, HTMLParser uses a certain regex that
is too stringent, and it does not capture slightly
malformed HTML gracefully.

The current definition of HTMLParser.locatestartendtag:

locatestarttagend = re.compile(r"""
  <[a-zA-Z][-.a-zA-Z0-9:_]*       # tag name
  (?:\s+                  # whitespace before attribute
name
    (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
      (?:\s*=\s*                     # value indicator
        (?:'[^']*'                   # LITA-enclosed value
          |\"[^\"]*\"                # LIT-enclosed value
          |[^'\">\s]+                # bare value
         )
       )?
     )
   )*
  \s*                                # trailing whitespace
""", re.VERBOSE)


does not capture strings like:

    <IMG SRC = "abc.jpg"WIDTH=5>

where there is no space between the closing quote and
the next attribute name.  Many sources of HTML are
slightly malformed this way --- in particular, CNN.com
--- so being slightly lenient might be good.  We can
slightly relax the constraint:


locatestarttagend = re.compile(r"""
  <[a-zA-Z][-.a-zA-Z0-9:_]*       # tag name
  (?:\s*          # optional whitespace before
attribute name
    (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
      (?:\s*=\s*                     # value indicator
        (?:'[^']*'                   # LITA-enclosed value
          |\"[^\"]*\"                # LIT-enclosed value
          |[^'\">\s]+                # bare value
         )
       )?
     )
   )*
  \s*                                # trailing whitespace
""", re.VERBOSE)

which allows the parser to process more of the HTML out
there.


See:

http://mail.python.org/pipermail/tutor/2004-October/032835.html

and:

http://mail.python.org/pipermail/tutor/2004-October/032869.html

for an explanation of what motivates this change.

Thanks!

History
Date	User	Action	Args
2007-08-23 14:27:13	admin	link	issue1058305 messages
2007-08-23 14:27:13	admin	create