classification
Title: sgmllib fail to parse html containing
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.6
process
Status: closed Resolution: works for me
Dependencies: Superseder:
Assigned To: Nosy List: georg.brandl, halfjuice
Priority: normal Keywords:

Created on 2010-10-06 04:28 by halfjuice, last changed 2010-10-06 07:16 by georg.brandl. This issue is now closed.

Messages (6)
msg118048 - (view) Author: halfjuice (halfjuice) Date: 2010-10-06 04:27
When parsing html containing the following tag:
... <!- ie6 doesn't allow empty div. -> ...
SGMLParser will stop parse following content without any warning. When such tag is removed everything works fine.

When looking into sgmllib.py, statement below found:

    if rawdata.startswith("<!", i):
        # This is some sort of declaration; in "HTML as
        # deployed," this should only be the document type
        # declaration ("<!DOCTYPE html...>").

I think that's why something goes wrong here.
msg118049 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-10-06 05:07
Are you sure you got the comment syntax right? e.g.
<!-- ie6 doesn't allow empty div. -->

SGMLParser should handle that.
msg118052 - (view) Author: halfjuice (halfjuice) Date: 2010-10-06 06:08
well, <!-- ... -> is ok since it's comment. <!- ... -> is probably a IE hack. see http://www.google.com/dictionary?langpair=en|zh-CN&q=vague&hl=en&aq=f
msg118053 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-10-06 06:39
Is that URL really what you wanted to show me?

Also, I'm not intimate with all of SGML's syntax, but ISTM that what you show here is invalid SGML, and as such SGMLParser is not required to parse it.
msg118054 - (view) Author: halfjuice (halfjuice) Date: 2010-10-06 07:10
Sorry, the URL on the page is sort of broken. The URL contains the "<!- ... ->" stuff.

I think you're right, the <!- is probably just a mistake which is not in the SGML standard. But I'm wondering if the SGMLParser can SKIP such an invalid statement? My browser does this.
msg118055 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-10-06 07:16
The browser needs to be very liberal in what it accepts, since nobody wants their page view to break because of such a technicality. This is different for a tool like SGMLParser.

In light of this, and because sgmllib is removed anyway in Python 3, I'm closing this.
History
Date User Action Args
2010-10-06 07:16:28georg.brandlsetstatus: open -> closed

messages: + msg118055
2010-10-06 07:10:14halfjuicesetmessages: + msg118054
2010-10-06 06:39:14georg.brandlsetmessages: + msg118053
2010-10-06 06:08:01halfjuicesetstatus: pending -> open

messages: + msg118052
2010-10-06 05:07:08georg.brandlsetstatus: open -> pending

nosy: + georg.brandl
messages: + msg118049

resolution: works for me
2010-10-06 04:28:01halfjuicecreate