This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author flying sheep
Recipients flying sheep
Date 2013-03-13.18:09:51
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1363198192.05.0.796177094664.issue17410@psf.upfronthosting.co.za>
In-reply-to
Content
hi, i have an idea on how to make an internal change to html.parser.HTMLParser, which would expose a token generator interface.

after that, we would be able to do e.g. list(HTMLParser().tokenize(data)) or even

parser = HTMLParser()
for chunk in pipe_in_html():
    yield from parser.tokenize(chunk)

---

the changes affect excluively HTMLParser’s methods and would unfortunately require a behavior change to most (internal) parse_* methods. the changes go as follows:

1. the tokenize(data=None, end=False) method is added. it contains mainly goahead’s body with an prepended snippet to append passed data to raw_data, and all handle_* calls changed to "yield token, data".

2. all parse_* methods which returned an int and called one handle_* method are changed to return an (int, token) tuple (so that tokenize can yield the tokens)

3. goahead is changed to a skeleton implementation based on traversing the list created by tokenize, experiencing no changed behavior.

all changes would only affect the behavior of the parse_* methods, and the addition of the tokenize method: the tokens are discarded if goahead, feed, or close are called. (this can of course be changed if advisable)

---

since this is my first contribution, i’m unsure if i shall already add the patch, unknowing if the changes to the internal parse_* methods are acceptable at all. what do you say?

PS: the tokens are named like the handle_* methods, and the current goahead implementation basically calls getattr(self, 'handle_' + token)(data) for each (token, data) tuple. This can be changed to a token: method dict or a classic “switch” elif stack.
History
Date User Action Args
2013-03-13 18:09:52flying sheepsetrecipients: + flying sheep
2013-03-13 18:09:52flying sheepsetmessageid: <1363198192.05.0.796177094664.issue17410@psf.upfronthosting.co.za>
2013-03-13 18:09:52flying sheeplinkissue17410 messages
2013-03-13 18:09:51flying sheepcreate