classification
Title: Request: getpos() for sgmllib
Type: enhancement Stage: test needed
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, ajaksu2, d98dzone, nnorwitz
Priority: normal Keywords: easy, patch

Created on 2003-11-25 16:47 by d98dzone, last changed 2010-08-22 10:44 by BreamoreBoy. This issue is now closed.

Files
File name Uploaded Description Edit
diff.txt d98dzone, 2003-12-02 11:55 Unix diff on The updated version and the CVS version(1.46)
Messages (6)
msg19144 - (view) Author: Dan Wiklund (d98dzone) Date: 2003-11-25 16:47
During the process of making my masters thesis I
discovered the need for a working getpos() in
sgmllib.py. As it is now you can successfully call it
since it is inherited from markupbase.py but you will
always get the answer (1,0) since it is never updated.

To fix this one needs to change the goahead function.
This is my own implementation of this change, in part
influenced by the "sister" goahead-function  in
HTLMParser.py:


************************************
def goahead(self, end):
        rawdata = self.rawdata
        i = 0
        k = 0
        n = len(rawdata)
        tmp=0
        while i < n:
            if self.nomoretags:
                self.handle_data(rawdata[i:n])
                i = n
                break
            match = interesting.search(rawdata, i)
            if match: j = match.start()
            else: j = n
            if i < j:
                self.handle_data(rawdata[i:j])
                tmp = self.updatepos(i, j)
            i = j
            if i == n: break
            startswith = rawdata.startswith
            if rawdata[i] == '<':
                if starttagopen.match(rawdata, i):
                    if self.literal:
                        self.handle_data(rawdata[i])
                        tmp = self.updatepos(i, i+1)
                        i = i+1
                        continue
                    k = self.parse_starttag(i)
                    if k < 0: break
                    tmp = self.updatepos(i, k)
                    i = k
                    continue
                if rawdata.startswith("</", i):
                    k = self.parse_endtag(i)
                    if k < 0: break
                    tmp = self.updatepos(i, k)
                    i = k
                    self.literal = 0
                    continue
                if self.literal:
                    if n > (i + 1):
                        self.handle_data("<")
                        i = i+1
                        tmp = self.updatepos(i, k)
                    else:
                        # incomplete
                        break
                    continue
                if rawdata.startswith("<!--", i):
                        # Strictly speaking, a comment
is --.*--
                        # within a declaration tag <!...>.
                        # This should be removed,
                        # and comments handled only in
parse_declaration.
                    k = self.parse_comment(i)
                    
                    if k < 0: break
                    tmp = self.updatepos(i, k)
                    i = k

                    continue
                if rawdata.startswith("<?", i):
                    k = self.parse_pi(i)
                    if k < 0: break
                    tmp = self.updatepos(i, k)
                    i = i+k
                    continue
                if rawdata.startswith("<!", i):
                    # This is some sort of declaration;
in "HTML as
                    # deployed," this should only be
the document type
                    # declaration ("<!DOCTYPE html...>").
                    k = self.parse_declaration(i)
                    if k < 0: break
                    tmp = self.updatepos(i, k)
                    i = k
                    continue
                tmp = self.updatepos(i, k)
            elif rawdata[i] == '&':
                
                if self.literal:
                    self.handle_data(rawdata[i])
                    #tmp = self.updatepos(i,i+1)#added
                    i = i+1
                    continue
                match = charref.match(rawdata, i)
                if match:
                    name = match.group()[2:-1]
                    self.handle_charref(name)
                    k = match.end()
                    if not startswith(';', k-1):
                        k = k - 1
                    tmp = self.updatepos(i, k)
                    i = k
                    continue
                match = entityref.match(rawdata, i)
                if match:
                    name = match.group(1)
                    self.handle_entityref(name)
                    k = match.end()
                    if not startswith(';', k-1):
                        k = k - 1
                    tmp = self.updatepos(i, k)
                    i = k
                    continue
                
            else:
                self.error('neither < nor & ??')
            # We get here only if incomplete matches but
            # nothing else
            match = incomplete.match(rawdata, i)
            if not match:
                self.handle_data(rawdata[i])
                i = i+1
                continue
            j = match.end(0)
            if j == n:
                break # Really incomplete
            self.handle_data(rawdata[i:j])

            i = j

            
        # end while
        if end and i < n:
            self.handle_data(rawdata[i:n])
            tmp = self.updatepos(i, n)
            i = n
        self.rawdata = rawdata[i:]
        # XXX if end: check for empty stack

    # Extensions for the DOCTYPE scanner:
    _decl_otherchars = '='

****************************

The major diffrence is the updatepos functions. It
seems to work fine, or at least it has worked fine for
me so far.
msg19145 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2003-11-25 18:51
Logged In: YES 
user_id=33168

Can you please post a context diff against the version in
CVS as an attachment?  Formatting is not preserved when
viewing through SF.  Thanks.
msg19146 - (view) Author: Dan Wiklund (d98dzone) Date: 2003-12-02 12:16
Logged In: YES 
user_id=917420

Added an attachment with the diffrence to the current file
version. This har three parts. The first is just updatepos
inserted at the correct places in the function goahead. The
second is from the part of the goahead function which
handles the &-characters. I had a hard time making it work
with the current model and changed it to a version inspired
by the same part of the goahead-function in HTMLParser.py.
The last is the printouts in the testfunction to check if
the function performs ok. 
msg81883 - (view) Author: Daniel Diniz (ajaksu2) (Python triager) Date: 2009-02-13 05:21
Closed #868908 as a duplicate of this one.
msg114300 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-18 23:35
Anyone interested in this?  I found the patch unreadable but YMMV.
msg114669 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-22 10:44
sgmllib has been deprecated since 2.6 and has been removed from py3k.
History
Date User Action Args
2010-08-22 10:44:01BreamoreBoysetstatus: open -> closed
resolution: out of date
messages: + msg114669

versions: + Python 3.2, - Python 2.7
2010-08-18 23:35:03BreamoreBoysetnosy: + BreamoreBoy
messages: + msg114300
2009-04-22 17:18:18ajaksu2setkeywords: + easy
2009-02-13 05:21:02ajaksu2setnosy: + ajaksu2
stage: test needed
messages: + msg81883
versions: + Python 2.7
2009-02-13 05:19:37ajaksu2linkissue868908 superseder
2008-02-19 23:26:01akuchlingsetkeywords: + patch
type: enhancement
2003-11-25 16:47:35d98dzonecreate