New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
html.parser.HTMLParser: setting 'convert_charrefs = True' leads to dropped text #67333
Comments
If convert_charrefs is set to true the final data section is not return by feed(). It is held until the next tag is encountered. --- from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self, convert_charrefs=True)
self.fed = []
def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)
def handle_endtag(self, tag):
print("Encountered an end tag :", tag)
def handle_data(self, data):
print("Encountered some data :", data)
parser = MyHTMLParser()
parser.feed("foo <a>link</a> bar")
print("")
parser.feed("spam <a>link</a> eggs") gives Encountered some data : foo Encountered some data : barspam With 'convert_charrefs = False' it works as expected. |
You “forgot” to call close(): >>> parser.close()
Encountered some data : eggs Perhaps this is a documentation bug, since there is a lot of example code given, but none of the examples call close(). |
That would make sense. Might also be worth mentioning the difference in behaviour with convert_charrefs = True/False as that was what led me to think this was a bug. |
Here is a patch that fixes the problem. |
I still think it would be worthwhile adding close() calls to the examples in the documentation (Doc/library/html.parser.rst). BTW I haven’t tested this, and maybe it is not a concern, but even with this patch it looks like the parser will buffer unlimited data and output nothing until close() if each string it is fed ends with an ampersand (and otherwise contains only plain text, no tags etc). |
If I add context manager support to HTMLParser I can update the examples to use it, but otherwise I don't think it's worth changing them now.
This is true, but I don't think it's a realistic case.
|
A context manager here would seem a bit strange. Is there any precedent for using context managers with feed parsers? The two others that come to mind are ElementTree.XMLParser and email.parser.FeedParser. These two build an object while parsing, and close() returns that object, so a context manager would be unhelpful. If an exception is raised inside the context manager, should close() be called (like for file objects), or not? |
I still haven't thought this through, but I can't see any problem with it right now. This would be similar to: from contextlib import closing
with closing(MyHTMLParser()) as parser:
parser.feed(html) and this already seems to work fine, including with OP's case.
The parser is guaranteed to never raise parsing-related errors during parsing, so this shouldn't be an issue. I will open a new issue after fixing this so we can keep discussing there. |
@ezio I think you should commit what you have so far. LGTM. |
@ezio - you seem busy, so I'll commit this next week if its still pending. |
I'll try to take care of this during the weekend. |
New changeset ef82131d0c93 by Ezio Melotti in branch '3.4': New changeset 1f6155ffcaf6 by Ezio Melotti in branch '3.5': New changeset 48ae9d66c720 by Ezio Melotti in branch 'default': |
Fixed, thanks for the report! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: