Message184096
hi, i have an idea on how to make an internal change to html.parser.HTMLParser, which would expose a token generator interface.
after that, we would be able to do e.g. list(HTMLParser().tokenize(data)) or even
parser = HTMLParser()
for chunk in pipe_in_html():
yield from parser.tokenize(chunk)
---
the changes affect excluively HTMLParser’s methods and would unfortunately require a behavior change to most (internal) parse_* methods. the changes go as follows:
1. the tokenize(data=None, end=False) method is added. it contains mainly goahead’s body with an prepended snippet to append passed data to raw_data, and all handle_* calls changed to "yield token, data".
2. all parse_* methods which returned an int and called one handle_* method are changed to return an (int, token) tuple (so that tokenize can yield the tokens)
3. goahead is changed to a skeleton implementation based on traversing the list created by tokenize, experiencing no changed behavior.
all changes would only affect the behavior of the parse_* methods, and the addition of the tokenize method: the tokens are discarded if goahead, feed, or close are called. (this can of course be changed if advisable)
---
since this is my first contribution, i’m unsure if i shall already add the patch, unknowing if the changes to the internal parse_* methods are acceptable at all. what do you say?
PS: the tokens are named like the handle_* methods, and the current goahead implementation basically calls getattr(self, 'handle_' + token)(data) for each (token, data) tuple. This can be changed to a token: method dict or a classic “switch” elif stack. |
|
Date |
User |
Action |
Args |
2013-03-13 18:09:52 | flying sheep | set | recipients:
+ flying sheep |
2013-03-13 18:09:52 | flying sheep | set | messageid: <1363198192.05.0.796177094664.issue17410@psf.upfronthosting.co.za> |
2013-03-13 18:09:52 | flying sheep | link | issue17410 messages |
2013-03-13 18:09:51 | flying sheep | create | |
|