classification
Title: Generator-based HTMLParser
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, flying sheep, karlcow, ncoghlan, r.david.murray, scoder
Priority: normal Keywords: patch

Created on 2013-03-13 18:09 by flying sheep, last changed 2013-08-26 04:52 by ncoghlan.

Files
File name Uploaded Description Edit
htmltokenizer.patch flying sheep, 2013-03-13 19:28 version 1.0.0.1 of the patch. tests still pass. review
Messages (10)
msg184096 - (view) Author: (flying sheep) * Date: 2013-03-13 18:09
hi, i have an idea on how to make an internal change to html.parser.HTMLParser, which would expose a token generator interface.

after that, we would be able to do e.g. list(HTMLParser().tokenize(data)) or even

parser = HTMLParser()
for chunk in pipe_in_html():
    yield from parser.tokenize(chunk)

---

the changes affect excluively HTMLParser’s methods and would unfortunately require a behavior change to most (internal) parse_* methods. the changes go as follows:

1. the tokenize(data=None, end=False) method is added. it contains mainly goahead’s body with an prepended snippet to append passed data to raw_data, and all handle_* calls changed to "yield token, data".

2. all parse_* methods which returned an int and called one handle_* method are changed to return an (int, token) tuple (so that tokenize can yield the tokens)

3. goahead is changed to a skeleton implementation based on traversing the list created by tokenize, experiencing no changed behavior.

all changes would only affect the behavior of the parse_* methods, and the addition of the tokenize method: the tokens are discarded if goahead, feed, or close are called. (this can of course be changed if advisable)

---

since this is my first contribution, i’m unsure if i shall already add the patch, unknowing if the changes to the internal parse_* methods are acceptable at all. what do you say?

PS: the tokens are named like the handle_* methods, and the current goahead implementation basically calls getattr(self, 'handle_' + token)(data) for each (token, data) tuple. This can be changed to a token: method dict or a classic “switch” elif stack.
msg184100 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-03-13 18:15
If you have a patch you can post it, however new features are allowed only in Python 3.4, and they must be backward compatible (run "python -m test test_htmlparser" to check that).
msg184101 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-03-13 18:32
I think that in order to maintain backward compatibility the existing parse_ names should continue to have the same signature, but they could be re-implemented in terms of new versions that return the token.  That way if an application overrides the methods for some reason that existing code should continue to work.
msg184103 - (view) Author: karl (karlcow) * Date: 2013-03-13 18:50
flying sheep: do you plan to make it easier to use the HTML5 algorithm?
http://www.w3.org/TR/html5/syntax.html#parsing
msg184104 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-03-13 18:52
HTMLParser already parsers HTML5 producing the correct result in most of the cases.
msg184105 - (view) Author: karl (karlcow) * Date: 2013-03-13 18:58
Ezio: I'm talking about "HTML5 Parsing algorithm", not about about parsing html* documents. :)

The only python parser I know who is closer of the HTML5 parser algorithm is https://code.google.com/p/html5lib/
msg184106 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-03-13 19:08
Well, I'm not sure what's the point of implementing that specific algorithm if the end result is the same.  HTMLParser implementation also has the advantage of being much simpler, and probably faster too.  If for some reason you want that specific algorithm you can always use html5lib.
Also if you find places where HTMLParser is not doing the right thing you can report new issues (I know a few corner cases where this happens, but they are so obscure that I intentionally left them unfixed to keep the code simple).
msg184107 - (view) Author: (flying sheep) * Date: 2013-03-13 19:24
no, i didn’t change anything that didn’t have to be changed to expose the tokens. i kept the changes as minimal as possible.

and the tests pass! i attached the patch.

---

aside thoughts:

i had to change _markupbase.py, too, but i wonder why it’s even a separate module: it is only ever imported by html.parser and its only content, ParserBase, is only subclassed once (by HTMLParser). both classes are so intertwined and dependent on each other (ParserBase calls HTMLParser methods that it itself doesn’t even define) that i think _markupbase should just be scrapped and included into HTMLParser.
msg184108 - (view) Author: (flying sheep) * Date: 2013-03-13 19:28
whoops, left my editor modeline in. i knew that was going to happen.
msg196179 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-08-26 04:52
The event generation API for ElementTree being discussed in issue 17741 is potentially relevant here.

I think that style of API is preferable, as it doesn't alter how data is fed into the parser, just how it is extracted.
History
Date User Action Args
2013-08-26 04:52:06ncoghlansetnosy: + ncoghlan
messages: + msg196179
2013-08-24 08:30:43ezio.melottisetnosy: + scoder
2013-03-13 20:49:39flying sheepsetfiles: - htmltokenizer.patch
2013-03-13 19:28:25flying sheepsetfiles: + htmltokenizer.patch

messages: + msg184108
2013-03-13 19:24:32flying sheepsetfiles: + htmltokenizer.patch
keywords: + patch
messages: + msg184107
2013-03-13 19:08:40ezio.melottisetmessages: + msg184106
2013-03-13 18:58:19karlcowsetmessages: + msg184105
2013-03-13 18:52:42ezio.melottisetmessages: + msg184104
2013-03-13 18:50:00karlcowsetnosy: + karlcow
messages: + msg184103
2013-03-13 18:32:12r.david.murraysetnosy: + r.david.murray
messages: + msg184101
2013-03-13 18:15:17ezio.melottisetversions: + Python 3.4
nosy: + ezio.melotti

messages: + msg184100

components: + Library (Lib), - XML
2013-03-13 18:10:50flying sheepsettype: enhancement
components: + XML
2013-03-13 18:09:52flying sheepcreate