Message22976
In Python 2.3.3, HTMLParser uses a certain regex that
is too stringent, and it does not capture slightly
malformed HTML gracefully.
The current definition of HTMLParser.locatestartendtag:
locatestarttagend = re.compile(r"""
<[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
(?:\s+ # whitespace before attribute
name
(?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
(?:\s*=\s* # value indicator
(?:'[^']*' # LITA-enclosed value
|\"[^\"]*\" # LIT-enclosed value
|[^'\">\s]+ # bare value
)
)?
)
)*
\s* # trailing whitespace
""", re.VERBOSE)
does not capture strings like:
<IMG SRC = "abc.jpg"WIDTH=5>
where there is no space between the closing quote and
the next attribute name. Many sources of HTML are
slightly malformed this way --- in particular, CNN.com
--- so being slightly lenient might be good. We can
slightly relax the constraint:
locatestarttagend = re.compile(r"""
<[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
(?:\s* # optional whitespace before
attribute name
(?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
(?:\s*=\s* # value indicator
(?:'[^']*' # LITA-enclosed value
|\"[^\"]*\" # LIT-enclosed value
|[^'\">\s]+ # bare value
)
)?
)
)*
\s* # trailing whitespace
""", re.VERBOSE)
which allows the parser to process more of the HTML out
there.
See:
http://mail.python.org/pipermail/tutor/2004-October/032835.html
and:
http://mail.python.org/pipermail/tutor/2004-October/032869.html
for an explanation of what motivates this change.
Thanks! |
|
Date |
User |
Action |
Args |
2007-08-23 14:27:13 | admin | link | issue1058305 messages |
2007-08-23 14:27:13 | admin | create | |
|