classification
Title: HtmlParser non-strict goes wrong with unquoted attributes
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2, Python 3.3
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: eric.araujo, ezio.melotti, pietvo, python-dev, r.david.murray, svilend
Priority: normal Keywords: patch

Created on 2011-05-05 10:47 by svilend, last changed 2011-11-01 12:46 by ezio.melotti. This issue is now closed.

Files
File name Uploaded Description Edit
html.parser.diff svilend, 2011-05-05 10:47 patch to limit nonstruct regexp's span
test-htmlparser-attrs.py svilend, 2011-05-05 10:48 standalone test
Messages (5)
msg135182 - (view) Author: svilen dobrev (svilend) Date: 2011-05-05 10:47
nonstrict mode seems to eat too much into data and gets past endpos of the chunk processed, and parser gets confused and treats any subsequent stuff as data. i didn't think out how to fix the regexp as such, but instead limited its span to :endpos so it doesnot eat too much. 
seems to happen with unquoted attributes.
msg135183 - (view) Author: svilen dobrev (svilend) Date: 2011-05-05 10:51
(the nonstrict regexp came with Issue1046092)
msg143472 - (view) Author: Piet van Oostrum (pietvo) Date: 2011-09-03 19:23
I was bitten by this bug today. Hope it will be solved in the next release of Python 3.

It is also possible to use the third argument of search in line 285:

                m = attrfind_tolerant.search(rawdata, k, endpos)

This seems to me to be a more `natural' solution.
msg146772 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-11-01 12:44
New changeset 6107a84e3c44 by Ezio Melotti in branch '3.2':
#12008: add a test.
http://hg.python.org/cpython/rev/6107a84e3c44

New changeset 495b31a8b280 by Ezio Melotti in branch 'default':
#12008: merge with 3.2.
http://hg.python.org/cpython/rev/495b31a8b280
msg146773 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-11-01 12:46
This seems to be already fixed in 3.2/3.3, so I extracted the test from your script and added to the test suite.  If you can find a way to break the parser let me know.
History
Date User Action Args
2011-11-01 12:46:40ezio.melottisetstatus: open -> closed

assignee: ezio.melotti

nosy: + ezio.melotti
messages: + msg146773
resolution: out of date
stage: resolved
2011-11-01 12:44:12python-devsetnosy: + python-dev
messages: + msg146772
2011-09-03 19:23:14pietvosetnosy: + pietvo
messages: + msg143472
2011-05-06 17:04:48eric.araujosetnosy: + eric.araujo, r.david.murray

versions: + Python 3.3
2011-05-05 10:51:48svilendsetmessages: + msg135183
2011-05-05 10:48:12svilendsetfiles: + test-htmlparser-attrs.py
type: behavior
components: + Library (Lib)
versions: + Python 3.2
2011-05-05 10:47:27svilendcreate