This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: <> in attrs in sgmllib not handled
Type: Stage:
Components: Library (Lib) Versions:
process
Status: closed Resolution: duplicate
Dependencies: Superseder: sgmllib should allow angle brackets in quoted values
View: 1504333
Assigned To: Nosy List: barnabas79, benjamin.peterson, fdrake, georg.brandl, loewis, sambayer
Priority: normal Keywords:

Created on 2003-05-28 16:30 by sambayer, last changed 2022-04-10 16:08 by admin. This issue is now closed.

Messages (9)
msg60332 - (view) Author: Samuel Bayer (sambayer) Date: 2003-05-28 16:30
Hi folks -

This bug is noted in the source code for sgmllib.py,
and it finally bit me. If you feed the SGMLParser class
text such as

<tag attr = "<attrtag> bar </attrtag>">foo</tag>

the <attrtag> will be processed as a tag, as well as
being recognized as part of the attribute. This is
because of the way the end index for the opening tag is
computed.

As far as I can tell from the HTML 4.01 specification,
this is legal. The case I encountered was in a value of
an "onmouseover" attribute, which was a Javascript call
which contained HTML text as one of its arguments.

The problem is in SGMLParser.parse_starttag, which
attempts to compute the end of the opening tag with a
simple regexp [<>], and uses this index even when the
attributes have passed it. There's no real need to
check this regexp in advance, as far as I can tell.
I've attached my proposed modification of
SGMLParser.parse_starttag; I've tested this change in
2.2.1, but there are no relevant differences between
2.2.1 and the head of the CVS tree for this method. No
guarantees of correctness, but it works on the examples
I've tested it on.

Cheers -
Sam Bayer

================================

w_endbracket = re.compile("\s*[<>]")

class SGMLParser:
    # Internal -- handle starttag, return length or -1
if not terminated
    def parse_starttag(self, i):
        self.__starttag_text = None
        start_pos = i
        rawdata = self.rawdata
        if shorttagopen.match(rawdata, i):
            # SGML shorthand: <tag/data/ == <tag>data</tag>
            # XXX Can data contain &... (entity or char
refs)?
            # XXX Can data contain < or > (tag characters)?
            # XXX Can there be whitespace before the
first /?
            match = shorttag.match(rawdata, i)
            if not match:
                return -1
            tag, data = match.group(1, 2)
            self.__starttag_text = '<%s/' % tag
            tag = tag.lower()
            k = match.end(0)
            self.finish_shorttag(tag, data)
            self.__starttag_text =
rawdata[start_pos:match.end(1) + 1]
            return k
        
        # Now parse the data between i+1 and the end of
the tag into a tag and attrs
        attrs = []
        if rawdata[i:i+2] == '<>':
            # SGML shorthand: <> == <last open tag seen>
            k = i + 1
            tag = self.lasttag
        else:
            match = tagfind.match(rawdata, i+1)
            if not match:
                self.error('unexpected call to
parse_starttag')
            k = match.end(0)
            tag = rawdata[i+1:k].lower()
            self.lasttag = tag
        while w_endbracket.match(rawdata, k) is None:
            match = attrfind.match(rawdata, k)
            if not match: break
            attrname, rest, attrvalue = match.group(1,
2, 3)
            if not rest:
                attrvalue = attrname
            elif attrvalue[:1] == '\'' ==
attrvalue[-1:] or \
                 attrvalue[:1] == '"' == attrvalue[-1:]:
                attrvalue = attrvalue[1:-1]
            attrs.append((attrname.lower(), attrvalue))
            k = match.end(0)
        match = endbracket.search(rawdata, k)
        if not match:
            return -1
        j = match.start(0)   
        if rawdata[j] == '>':
            j = j+1
        self.__starttag_text = rawdata[start_pos:j]
        self.finish_starttag(tag, attrs)
        return j
msg60333 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2003-06-14 07:58
Logged In: YES 
user_id=21627

If this is a known bug, why are you reporting it?
msg60334 - (view) Author: Samuel Bayer (sambayer) Date: 2003-06-14 13:35
Logged In: YES 
user_id=40146

I'm reporting it because

(a) it's not in the bug queue, and
(b) it's broken

The fact that it's noted as a bug in the source code doesn't
strike me as relevant. Especially since I attached a fix.
msg60335 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2003-06-14 15:23
Logged In: YES 
user_id=21627

I see. Can you please attach the fix as a context or unified
diff to this report? I can't follow your changes above at all.
msg60336 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2006-06-23 06:16
Logged In: YES 
user_id=3066

See also: http://www.python.org/sf/803422
msg63505 - (view) Author: Paul Molodowitch (barnabas79) Date: 2008-03-13 14:50
I posted a patch that worked on all the test cases for me over at
http://bugs.python.org/issue1504333...
msg63506 - (view) Author: Paul Molodowitch (barnabas79) Date: 2008-03-13 14:53
errr... why was my last message classified as spam? =(
Is there some policy here I'm violating that I'm unaware of?  I would
think consolidating of similar issues would be a good thing...
msg63514 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-03-13 20:54
>errr... why was my last message classified as spam? =(
It's not your fault. The spam filter was just confused. Somebody with
more powerful than me can, I believe, reeducate the spam filter and
allow us to read it.
msg111928 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-07-29 14:02
Setting #1504333 which has a patch as superseder.
History
Date User Action Args
2022-04-10 16:08:55adminsetgithub: 38561
2010-07-29 14:02:30georg.brandlsetstatus: open -> closed

nosy: + georg.brandl
messages: + msg111928

superseder: sgmllib should allow angle brackets in quoted values
resolution: duplicate
2008-03-13 20:54:11benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg63514
2008-03-13 14:53:26barnabas79setmessages: + msg63506
2008-03-13 14:50:09barnabas79setnosy: + barnabas79
messages: + msg63505
2003-05-28 16:30:03sambayercreate