This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: HTMLparser does not handle call to handle_data when a tag contains no data.
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.1, Python 2.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: eric.araujo, orsenthil, pythonhacker, wplappert
Priority: normal Keywords: easy

Created on 2010-04-05 18:08 by wplappert, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
shannon_data.py wplappert, 2010-04-06 03:01 The test program
Shannon-2010.0.02-extract.html wplappert, 2010-04-06 03:01 the sample data
correct.out wplappert, 2010-04-06 03:02 expected outpit, fix applied
wrong.out wplappert, 2010-04-06 03:03 ouput with cuurent version of HTMLparser.py
shannon_data-v2.py wplappert, 2010-04-24 20:45 modified test program
Messages (7)
msg102392 - (view) Author: Winfried Plappert (wplappert) Date: 2010-04-05 18:08
When parsing HTML and having a string along the lines of <td></td>, a call to handle_data is not issued between handle_starttag and handle_endtag, but afterwards. The problem is in HTMLparser.goahead, where the position i and j are calculated. The code reads
if i < j: self.handle_data(rawdata[i:j]) but it should be
if i <= j: self.handle_data(rawdata[i:j])

If there is data between <td> and </td>, everything works fine.

I just checked the trunk of 2.6, this occurs in line 142 of Lib/HTMLParser.py. The size of HTMLParser.py is 13407 bytes, and is dated 'Feb 26 19:25'.
msg102414 - (view) Author: Winfried Plappert (wplappert) Date: 2010-04-05 21:19
The same code can be found in the 3.1 distribution.
msg102430 - (view) Author: Winfried Plappert (wplappert) Date: 2010-04-06 03:01
Here is a test program (shannon_data.py), some sample data (Shannon-2010.0.02-extract.html) and two output files (correct.out and wrong.out).
msg102433 - (view) Author: Winfried Plappert (wplappert) Date: 2010-04-06 03:21
in short the correct output should be
2/4/2010;6.3;11.1;0.8;6.5;;7.8;-5
versus
2/4/2010;6.3;11.1;0.8;6.5;7.8;-5

which implies that one element is missing in the output stream :)
msg102436 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-04-06 04:29
But changing the HTMLParser.goahead's way to treating tags from
if i < j: self.handle_data(rawdata[i:j]) TO
if i <= j: self.handle_data(rawdata[i:j]

is not the correct way to deal with this problem. Theoretically, whatever it is doing seems correct. As there is no data, don't call handle_data.

I can understand your testcase, and I think there is some other way to handle the test you are mentioning.

If you change the above line, many of the existing tests may fail, so that *may not be* way to go.
msg104128 - (view) Author: Winfried Plappert (wplappert) Date: 2010-04-24 20:45
I have modified my program so I will check for data/no-data at the end of a td-call (td_end). Now it produces the correct result. I think you can close this issue.
msg104136 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-04-25 00:30
Thanks. Closing on submitter's note.
History
Date User Action Args
2022-04-11 14:56:59adminsetgithub: 52566
2010-04-25 00:30:27orsenthilsetstatus: open -> closed
resolution: not a bug
messages: + msg104136

stage: test needed -> resolved
2010-04-24 20:45:08wplappertsetfiles: + shannon_data-v2.py

messages: + msg104128
2010-04-21 22:09:31eric.araujosetnosy: + eric.araujo
2010-04-12 13:57:42pythonhackersetnosy: + pythonhacker
2010-04-06 04:29:51orsenthilsetmessages: + msg102436
2010-04-06 03:21:34wplappertsetmessages: + msg102433
2010-04-06 03:03:24wplappertsetfiles: + wrong.out
2010-04-06 03:02:33wplappertsetfiles: + correct.out
2010-04-06 03:01:47wplappertsetfiles: + Shannon-2010.0.02-extract.html
2010-04-06 03:01:11wplappertsetfiles: + shannon_data.py

messages: + msg102430
2010-04-05 21:19:58wplappertsetmessages: + msg102414
versions: + Python 3.1
2010-04-05 19:24:27r.david.murraysetpriority: normal
nosy: + orsenthil

keywords: + easy
stage: test needed
2010-04-05 18:28:49wplappertsettitle: HTMLparser does not handle call to handle_data when a tag contains nor data. -> HTMLparser does not handle call to handle_data when a tag contains no data.
2010-04-05 18:08:54wplappertcreate