Issue1058305
Created on 2004-11-01 18:05 by dyoo, last changed 2009-02-14 18:45 by ajaksu2.
| File name |
Uploaded |
Description |
Edit |
Remove |
|
HTMLParser.py.diff
|
dyoo,
2004-11-01 18:05
|
diff against Lib/HTMLParser.py from Python 2.3.3 |
|
|
|
msg22976 - (view) |
Author: Danny Yoo (dyoo) |
Date: 2004-11-01 18:05 |
|
In Python 2.3.3, HTMLParser uses a certain regex that
is too stringent, and it does not capture slightly
malformed HTML gracefully.
The current definition of HTMLParser.locatestartendtag:
locatestarttagend = re.compile(r"""
<[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
(?:\s+ # whitespace before attribute
name
(?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
(?:\s*=\s* # value indicator
(?:'[^']*' # LITA-enclosed value
|\"[^\"]*\" # LIT-enclosed value
|[^'\">\s]+ # bare value
)
)?
)
)*
\s* # trailing whitespace
""", re.VERBOSE)
does not capture strings like:
<IMG SRC = "abc.jpg"WIDTH=5>
where there is no space between the closing quote and
the next attribute name. Many sources of HTML are
slightly malformed this way --- in particular, CNN.com
--- so being slightly lenient might be good. We can
slightly relax the constraint:
locatestarttagend = re.compile(r"""
<[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
(?:\s* # optional whitespace before
attribute name
(?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
(?:\s*=\s* # value indicator
(?:'[^']*' # LITA-enclosed value
|\"[^\"]*\" # LIT-enclosed value
|[^'\">\s]+ # bare value
)
)?
)
)*
\s* # trailing whitespace
""", re.VERBOSE)
which allows the parser to process more of the HTML out
there.
See:
http://mail.python.org/pipermail/tutor/2004-October/032835.html
and:
http://mail.python.org/pipermail/tutor/2004-October/032869.html
for an explanation of what motivates this change.
Thanks!
|
|
msg82104 - (view) |
Author: Daniel Diniz (ajaksu2) |
Date: 2009-02-14 18:45 |
|
The regex is still the same. This is one of many 'HTMLParser regex for
attributes' issues.
|
|
| Date |
User |
Action |
Args |
| 2009-02-14 18:45:26 | ajaksu2 | set | versions:
+ Python 2.7, - Python 2.3 nosy:
+ ajaksu2 messages:
+ msg82104 keywords:
+ patch, easy type: feature request stage: test needed |
| 2004-11-01 18:05:39 | dyoo | create | |
|