New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTMLParser.locatestartagend regex too stringent #41113
Comments
In Python 2.3.3, HTMLParser uses a certain regex that The current definition of HTMLParser.locatestartendtag: locatestarttagend = re.compile(r"""
<[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
(?:\s+ # whitespace before attribute
name
(?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
(?:\s*=\s* # value indicator
(?:'[^']*' # LITA-enclosed value
|\"[^\"]*\" # LIT-enclosed value
|[^'\">\s]+ # bare value
)
)?
)
)*
\s* # trailing whitespace
""", re.VERBOSE) does not capture strings like:
where there is no space between the closing quote and locatestarttagend = re.compile(r"""
<[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
(?:\s* # optional whitespace before
attribute name
(?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
(?:\s*=\s* # value indicator
(?:'[^']*' # LITA-enclosed value
|\"[^\"]*\" # LIT-enclosed value
|[^'\">\s]+ # bare value
)
)?
)
)*
\s* # trailing whitespace
""", re.VERBOSE) which allows the parser to process more of the HTML out See: http://mail.python.org/pipermail/tutor/2004-October/032835.html and: http://mail.python.org/pipermail/tutor/2004-October/032869.html for an explanation of what motivates this change. Thanks! |
The regex is still the same. This is one of many 'HTMLParser regex for |
I'll close this in a couple of weeks unless anyone objects. |
No reply to msg114390. |
Closing this issue as out of date was inappropriate. It may be a duplicate, but someone with an interest should go through and evaluate all the related 'tolerant HTML parser' issues. bpo-1486713 could perhaps serve as a master issue for this set. |
Closing this in favor of 1486713, which has a patch and covers additional issues. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: