Title: HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts;)
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.1, Python 3.2, Python 2.7
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: Martin.Potthast, orsenthil, r.david.murray
Priority: normal Keywords: patch

Created on 2010-12-22 13:51 by Martin.Potthast, last changed 2010-12-28 16:11 by orsenthil. This issue is now closed.

File name Uploaded Description Edit Martin.Potthast, 2010-12-22 13:51 Martin.Potthast, 2010-12-22 16:44
Issue10759.diff orsenthil, 2010-12-23 14:58
Messages (9)
msg124506 - (view) Author: Martin Potthast (Martin.Potthast) Date: 2010-12-22 13:51
The title says it all; try the minimal example.
msg124507 - (view) Author: Martin Potthast (Martin.Potthast) Date: 2010-12-22 13:56
I'd suggest to better verify the input and return such strings unchanged.
msg124510 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-22 15:48
Leaving the input unchanged does seem to be what browsers do.  (Issue 7626 has some info on browser behaviour with invalid entity refs.)

Rather than pre-validating the input, I think the exception can be caught and the putative entity returned unchanged, similar to how a keyerror is handled for a named entity.

Would you have any interest in proposing a patch and tests?
msg124512 - (view) Author: Martin Potthast (Martin.Potthast) Date: 2010-12-22 16:44
Agreed. Here's a patch for HTMLParser. That was easy enough.

With regard to tests, there seems to be already one called test_malformatted_charref in However, the test tests the whole parser and not only HTMLParser.unescape().

At the same time, HTMLParser.unescape() has the following comment:
"# Internal -- helper to remove special character quoting"

It appears the syntax check is done in line 168 already, but since the unescape function is publicly visible, I'd say that it should be capable of handling all kinds of malformed input, despite that comment. Maybe this comment should be removed.

I'm not entirely sure how to write the test properly, since it doesn't fit into the framework provided by; and unfortunately, my time is rather short at the moment.
msg124513 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-22 16:52
Ah, as an undocumented internal interface it may in fact not be appropriate to make this change.  Or it may be.  I'll have to look at the code in more detail to figure that out, or perhaps Senthil will.  (It may even be time to document the function, we'll see.)

Thanks for the patch.
msg124516 - (view) Author: Martin Potthast (Martin.Potthast) Date: 2010-12-22 19:23
Why not simply remove the additional check in line 168 and leave the responsibility to check the validity of its input to the unescape function (be it explicitly or, like now, lazily). That way, the code changes are minimal, the existing test covers the current issue, and the function gets more robust.

By the way, I came across this function via Stackoverflow:
msg124555 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-12-23 14:58
Yes, I too agree that HTMLParser.unescape() should split-out malformed char-ref just as other browsers do.

But, as unescape function has undocumented/unexposed for releases, I am not sure making it exposed is a good idea. HTMLParser is more for event based parsing of tags, and unescape is a just a helper function in that context.

Given that reasoning if you see the malformatted test, you see that event based parsing does return the malformatted data properly For e.g -  ("data", "&#bad;").

Only calling unescape explicitly does not exhibit this behavior.

Martin: I am not sure if changing something in line 168 would solve the issue. In that particular block of code, the else condition is responsible for throwing the malformed charref on an event. If would like to elaborate a bit more on your suggestion, it would be helpful.

However, I do agree that unescape can be changed as per your patch and I have added a simple test to exercise that change. I think, this can go in.
msg124798 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-12-28 16:02
Fixed this in r87542 in (py3k). unescape is undocumented helper method, so no docs are added.

There was already an issue ( Issue6662) on malformed charref handling and it is fixed.
msg124799 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-12-28 16:11
r87544 (release27-maint) and r87545 (release31-maint).
Date User Action Args
2010-12-28 16:11:30orsenthilsetstatus: open -> closed
nosy: orsenthil, r.david.murray, Martin.Potthast
messages: + msg124799
2010-12-28 16:02:53orsenthilsetnosy: orsenthil, r.david.murray, Martin.Potthast
messages: + msg124798
resolution: fixed
stage: patch review -> resolved
2010-12-23 14:58:46orsenthilsetfiles: + Issue10759.diff
versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6
nosy: orsenthil, r.david.murray, Martin.Potthast
messages: + msg124555

assignee: orsenthil
stage: test needed -> patch review
2010-12-22 19:23:16Martin.Potthastsetnosy: orsenthil, r.david.murray, Martin.Potthast
messages: + msg124516
2010-12-22 16:52:44r.david.murraysetnosy: orsenthil, r.david.murray, Martin.Potthast
messages: + msg124513
stage: needs patch -> test needed
2010-12-22 16:44:48Martin.Potthastsetfiles: +

messages: + msg124512
keywords: + patch
nosy: orsenthil, r.david.murray, Martin.Potthast
2010-12-22 15:48:56r.david.murraysetnosy: + r.david.murray, orsenthil

messages: + msg124510
stage: needs patch
2010-12-22 13:56:09Martin.Potthastsettype: behavior
messages: + msg124507
2010-12-22 13:52:45Martin.Potthastsettitle: HTMLParser.unescape() cannot handle HTML entities with incorrect syntax (e.g. &#hearts;) -> HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts;)
2010-12-22 13:51:05Martin.Potthastcreate