Title: HtmlParser non-strict goes wrong with unquoted attributes
Created on 2011-05-05 10:47 by svilend, last changed 2011-11-01 12:46 by ezio.melotti.

html.parser.diff svilend, 2011-05-05 10:47 patch to limit nonstruct regexp's span svilend, 2011-05-05 10:48 standalone test
msg135182 - (view) Author: svilen dobrev (svilend) Date: 2011-05-05 10:47
nonstrict mode seems to eat too much into data and gets past endpos of the chunk processed, and parser gets confused and treats any subsequent stuff as data. i didn't think out how to fix the regexp as such, but instead limited its span to :endpos so it doesnot eat too much. 
seems to happen with unquoted attributes.
msg135183 - (view) Author: svilen dobrev (svilend) Date: 2011-05-05 10:51
(the nonstrict regexp came with Issue1046092)
msg143472 - (view) Author: Piet van Oostrum (pietvo) Date: 2011-09-03 19:23
I was bitten by this bug today. Hope it will be solved in the next release of Python 3.

It is also possible to use the third argument of search in line 285:

                m =, k, endpos)

This seems to me to be a more `natural' solution.
msg146772 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-11-01 12:44
New changeset 6107a84e3c44 by Ezio Melotti in branch '3.2':
#12008: add a test.

New changeset 495b31a8b280 by Ezio Melotti in branch 'default':
#12008: merge with 3.2.
msg146773 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-11-01 12:46
This seems to be already fixed in 3.2/3.3, so I extracted the test from your script and added to the test suite.  If you can find a way to break the parser let me know.
