This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Can SGMLParser properly handle tags?
Type: behavior Stage: resolved
Components: Extension Modules, Library (Lib), XML Versions: Python 2.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: eric.araujo, ezio.melotti, once-off
Priority: normal Keywords: easy

Created on 2009-03-17 11:19 by once-off, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
sgml_error.py once-off, 2009-03-17 11:19
Messages (3)
msg83667 - (view) Author: (once-off) Date: 2009-03-17 11:19
The attached script (sgml_error.py) was designed to output XML files
unchanged, other than expanding <empty/> tags into an opening and
closing tag, such as <empty></empty>.

It seems the SGMLParser class recognizes an empty tag, but does not emit
the closing tag until the NEXT forward slash it sees. So everything from
the forward slash in <empty/> (even the closing angle bracket) until the
next forward slash is considered to be textual data. See the following
line output.

Have I missed something here (like a conscious design limitation on the
class, an error on my part, etc), or is this really a bug with the class?

C:\Python24\Lib>python sgmllib.py H:\input.xml
start tag: <root>
data: '\n '
start tag: <tag1>
end tag: </tag1>
data: '\n '
start tag: <tag2>
data: '>\n <tag3>hello<'
end tag: </tag2>
data: 'tag3>\n'
end tag: </root>

C:\Python24\Lib>python
ActivePython 2.4.3 Build 12 (ActiveState Software Inc.) based on
Python 2.4.3 (#69, Apr 11 2006, 15:32:42) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sgml_error

Input:
<root>
 <tag1></tag1>
 <tag2/>
 <tag3>hello</tag3>
</root>

Output:
<root>
 <tag1></tag1>
 <tag2>>
 <tag3>hello<</tag2>tag3>
</root>

Expected:
<root>
 <tag1></tag1>
 <tag2></tag2>
 <tag3>hello</tag3>
</root>
msg98424 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-01-27 13:40
Hello

XML of the form <tag/> are an SGML hack, or more precisely the combination of two features of SGML. The forward slash closes the tag, and the following angle bracket is character data, not part of the tag.

The W3C validator  uses a real SGML parser for HTML doctypes, and fails on XML-like /> constructs: http://validator.w3.org/check?uri=data%3Atext%2Fhtml%2C%3C!DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+HTML+4.01%2F%2FEN%22+%22http%3A%2F%2Fwww.w3.org%2FTR%2Fhtml4%2Fstrict.dtd%22%3E+%3Chtml%3E+%3Chead%3E+++%3Ctitle%3ETest%3C%2Ftitle%3E+++%3Cmeta+name%3Dtest+content%3Done%2F%3E+++%3Cmeta+name%3Dbug+content%3Dtwo%3E+%3C%2Fhead%3E+%3Cbody%3E+++%3Cp%3ETest%3C%2Fp%3E+%3C%2Fbody%3E+%3C%2Fhtml%3E&charset=%28detect+automatically%29&doctype=Inline&group=1&verbose=1

The complete explanation can be read at http://www.cs.tut.fi/~jkorpela/html/empty.html

In conclusion, sgmllib is right. Use an XML parser for XML or an HTML5 parser for HTML.

Kind regards
msg98425 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-01-27 13:45
Damn, the URI got fubared :/ Anyway, I just wanted to give an example of the verbose error message, but the second link will contain enough explanation.

Regards
History
Date User Action Args
2022-04-11 14:56:46adminsetgithub: 49748
2010-02-05 16:23:31ezio.melottisetnosy: + ezio.melotti
2010-02-05 16:00:44ezio.melottisetstatus: open -> closed
priority: normal
resolution: not a bug
stage: test needed -> resolved
2010-01-27 13:45:03eric.araujosetmessages: + msg98425
2010-01-27 13:40:36eric.araujosetnosy: + eric.araujo
messages: + msg98424
2009-04-22 14:38:05ajaksu2setkeywords: + easy
stage: test needed
versions: + Python 2.6, - Python 2.5, Python 2.4, 3rd party
2009-03-17 11:19:34once-offcreate