Title: HTMLparser does not handle call to handle_data when a tag contains no data.
Messages (7)
msg102392 - (view) Author: Winfried Plappert (wplappert) Date: 2010-04-05 18:08
When parsing HTML and having a string along the lines of <td></td>, a call to handle_data is not issued between handle_starttag and handle_endtag, but afterwards. The problem is in HTMLparser.goahead, where the position i and j are calculated. The code reads
if i < j: self.handle_data(rawdata[i:j]) but it should be
if i <= j: self.handle_data(rawdata[i:j])

If there is data between <td> and </td>, everything works fine.

I just checked the trunk of 2.6, this occurs in line 142 of Lib/ The size of is 13407 bytes, and is dated 'Feb 26 19:25'.
msg102414 - (view) Author: Winfried Plappert (wplappert) Date: 2010-04-05 21:19
The same code can be found in the 3.1 distribution.
msg102430 - (view) Author: Winfried Plappert (wplappert) Date: 2010-04-06 03:01
Here is a test program (, some sample data (Shannon-2010.0.02-extract.html) and two output files (correct.out and wrong.out).
msg102433 - (view) Author: Winfried Plappert (wplappert) Date: 2010-04-06 03:21
in short the correct output should be

which implies that one element is missing in the output stream :)
msg102436 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-04-06 04:29
But changing the HTMLParser.goahead's way to treating tags from
if i < j: self.handle_data(rawdata[i:j]) TO
if i <= j: self.handle_data(rawdata[i:j]

is not the correct way to deal with this problem. Theoretically, whatever it is doing seems correct. As there is no data, don't call handle_data.

I can understand your testcase, and I think there is some other way to handle the test you are mentioning.

If you change the above line, many of the existing tests may fail, so that *may not be* way to go.
msg104128 - (view) Author: Winfried Plappert (wplappert) Date: 2010-04-24 20:45
I have modified my program so I will check for data/no-data at the end of a td-call (td_end). Now it produces the correct result. I think you can close this issue.
msg104136 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-04-25 00:30
Thanks. Closing on submitter's note.
