classification
Title: HTMLParser fix to accept malformed tag attributes
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: accepted
Dependencies: Superseder: HTMLParser : A auto-tolerant parsing mode
View: 1486713
Assigned To: Nosy List: Neil Muller, ajaksu2, ezio.melotti, jlgijsbers, nnseva, r.david.murray, svilend
Priority: normal Keywords: easy, patch

Created on 2004-10-13 10:11 by nnseva, last changed 2011-05-10 14:06 by ezio.melotti. This issue is now closed.

Files
File name Uploaded Description Edit
HTMLParser.py.patch nnseva, 2004-10-15 06:27 This is a patch
html.parser.diff svilend, 2011-05-05 10:34 patch to limit nonstrict-regexp from eating too much
test-htmlparser-attrs.py svilend, 2011-05-05 10:35 test with unquoted attribtues
Messages (11)
msg22675 - (view) Author: Vsevolod Novikov (nnseva) Date: 2004-10-13 10:11
This is a patch to fix bugs #975556 and #921657.

I think, it should be made just because the parser
should accept as many pages as it can. At the other
hand, the code near to fixed contains regexp to accept
mailformed attributes in other cases: compare attrfind
variable and locatestarttagend variable values.
msg22676 - (view) Author: Johannes Gijsbers (jlgijsbers) * (Python triager) Date: 2004-10-13 11:09
Logged In: YES 
user_id=469548

There's no uploaded file!  You have to check the
checkbox labeled "Check to Upload & Attach File"
when you upload a file.

Please try again.

(This is a SourceForge annoyance that we can do
nothing about. :-( )
msg22677 - (view) Author: Vsevolod Novikov (nnseva) Date: 2004-10-15 06:27
Logged In: YES 
user_id=325678

There's no uploaded file!  You have to check the
checkbox labeled "Check to Upload & Attach File"
when you upload a file.

Please try again.

(This is a SourceForge annoyance that we can do
nothing about. :-( )
msg22678 - (view) Author: Vsevolod Novikov (nnseva) Date: 2004-10-15 06:27
Logged In: YES 
user_id=325678

Missed patch, sorry ...
msg81692 - (view) Author: Daniel Diniz (ajaksu2) (Python triager) Date: 2009-02-11 23:57
Heh, the patch applies cleanly to trunk more than four years later and
tests pass fine. We'll surely need better tests if the behavior change
is considered an improvement.
msg114333 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-19 07:18
The patch is a one line change to a compiled regex.  Would someone with html and/or regex knowledge like to comment, thanks, as I've no idea as to the implications.  I also agree with comments in msg81692 regarding better unit tests.  Please don't ask me! :)
msg121677 - (view) Author: Neil Muller (Neil Muller) Date: 2010-11-20 16:31
I think this change is makes the parser far too lenient. Something like the explicit tolerant mode proposed in #1486713 is a better solution.
msg123176 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-03 04:17
Included this in the 'strict=False' mode in the issue 1486713 patch.
msg135179 - (view) Author: svilen dobrev (svilend) Date: 2011-05-05 10:34
this seems to eat too much into data and gets past endpos of the chunk processed, and parser gets confused and treats any subsequent stuff as data. i didn't think out how to fix the regexp as such, but instead limited its span to :endpos so it doesnot eat too much. 
seems to happen with unquoted attributes.
msg135180 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-05-05 10:44
This issue is closed, so it's better if you create a new issue.
Even better if you can attach a patch that adds a testcase to Lib/test/test_htmlparser.py
msg135701 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-05-10 14:06
For the record, the new issue is #12008.
History
Date User Action Args
2011-05-10 14:06:07ezio.melottisetmessages: + msg135701
2011-05-05 10:44:16ezio.melottisetnosy: + ezio.melotti
messages: + msg135180
2011-05-05 10:35:01svilendsetfiles: + test-htmlparser-attrs.py
2011-05-05 10:34:11svilendsetfiles: + html.parser.diff
nosy: + svilend
messages: + msg135179

2010-12-03 04:24:57r.david.murraysettitle: HTMLParser fix to accept mailformed tag attributes -> HTMLParser fix to accept malformed tag attributes
2010-12-03 04:17:17r.david.murraysetstatus: open -> closed
nosy: + r.david.murray, - BreamoreBoy
messages: + msg123176
resolution: accepted

superseder: HTMLParser : A auto-tolerant parsing mode
stage: patch review -> resolved
2010-11-20 16:31:00Neil Mullersetnosy: + Neil Muller
messages: + msg121677
2010-08-19 07:18:41BreamoreBoysetversions: + Python 3.2, - Python 2.7
nosy: + BreamoreBoy

messages: + msg114333

stage: test needed -> patch review
2009-04-22 18:49:57ajaksu2setkeywords: + patch, easy
stage: test needed
2009-02-11 23:57:49ajaksu2setnosy: + ajaksu2
messages: + msg81692
2009-02-09 06:13:35ajaksu2settype: enhancement
versions: + Python 2.7, - Python 2.3
2004-10-13 10:11:24nnsevacreate