This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: getpos() for sgmllib
Type: enhancement Stage:
Components: None Versions:
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Request: getpos() for sgmllib
View: 849097
Assigned To: Nosy List: ajaksu2, d98dzone
Priority: normal Keywords:

Created on 2004-01-01 20:01 by d98dzone, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
diff.txt d98dzone, 2004-01-01 20:01 Unix diff on The updated version and the CVS version(1.46)
Messages (2)
msg54083 - (view) Author: Dan Wiklund (d98dzone) Date: 2004-01-01 20:01
Placed here instead of in Bugs since it really isn't a bug.

 During the process of making my masters thesis I
discovered the need for a working getpos() in
sgmllib.py. As it is now you can successfully call it
since it is inherited from markupbase.py but you will
always get the answer (1,0) since it is never updated.

To fix this one needs to change the goahead function.
This is my own implementation of this change, in part
influenced by the "sister" goahead-function in
HTLMParser.py:


************************************
def goahead(self, end):
rawdata = self.rawdata
i = 0
k = 0
n = len(rawdata)
tmp=0
while i < n:
if self.nomoretags:
self.handle_data(rawdata[i:n])
i = n
break
match = interesting.search(rawdata, i)
if match: j = match.start()
else: j = n
if i < j:
self.handle_data(rawdata[i:j])
tmp = self.updatepos(i, j)
i = j
if i == n: break
startswith = rawdata.startswith
if rawdata[i] == '<':
if starttagopen.match(rawdata, i):
if self.literal:
self.handle_data(rawdata[i])
tmp = self.updatepos(i, i+1)
i = i+1
continue
k = self.parse_starttag(i)
if k < 0: break
tmp = self.updatepos(i, k)
i = k
continue
if rawdata.startswith("</", i):
k = self.parse_endtag(i)
if k < 0: break
tmp = self.updatepos(i, k)
i = k
self.literal = 0
continue
if self.literal:
if n > (i + 1):
self.handle_data("<")
i = i+1
tmp = self.updatepos(i, k)
else:
# incomplete
break
continue
if rawdata.startswith("<!--", i):
# Strictly speaking, a comment
is --.*--
# within a declaration tag <!...>.
# This should be removed,
# and comments handled only in
parse_declaration.
k = self.parse_comment(i)

if k < 0: break
tmp = self.updatepos(i, k)
i = k

continue
if rawdata.startswith("<?", i):
k = self.parse_pi(i)
if k < 0: break
tmp = self.updatepos(i, k)
i = i+k
continue
if rawdata.startswith("<!", i):
# This is some sort of declaration;
in "HTML as
# deployed," this should only be
the document type
# declaration ("<!DOCTYPE html...>").
k = self.parse_declaration(i)
if k < 0: break
tmp = self.updatepos(i, k)
i = k
continue
tmp = self.updatepos(i, k)
elif rawdata[i] == '&':

if self.literal:
self.handle_data(rawdata[i])
#tmp = self.updatepos(i,i+1)#added
i = i+1
continue
match = charref.match(rawdata, i)
if match:
name = match.group()[2:-1]
self.handle_charref(name)
k = match.end()
if not startswith(';', k-1):
k = k - 1
tmp = self.updatepos(i, k)
i = k
continue
match = entityref.match(rawdata, i)
if match:
name = match.group(1)
self.handle_entityref(name)
k = match.end()
if not startswith(';', k-1):
k = k - 1
tmp = self.updatepos(i, k)
i = k
continue

else:
self.error('neither < nor & ??')
# We get here only if incomplete matches but
# nothing else
match = incomplete.match(rawdata, i)
if not match:
self.handle_data(rawdata[i])
i = i+1
continue
j = match.end(0)
if j == n:
break # Really incomplete
self.handle_data(rawdata[i:j])

i = j


# end while
if end and i < n:
self.handle_data(rawdata[i:n])
tmp = self.updatepos(i, n)
i = n
self.rawdata = rawdata[i:]
# XXX if end: check for empty stack

# Extensions for the DOCTYPE scanner:
_decl_otherchars = '='

****************************

The major diffrence is the updatepos functions. It
seems to work fine, or at least it has worked fine for
me so far.


Posted a diff taken againts the CVS version(1.46).
It har three parts. The first is just updatepos
inserted at the correct places in the function goahead. The
second is from the part of the goahead function which
handles the &-characters. I had a hard time making it work
with the current model and changed it to a version inspired
by the same part of the goahead-function in HTMLParser.py.
The last is the printouts in the testfunction to check if
the function performs ok. 
msg81882 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2009-02-13 05:19
Duplicate of #849097.
History
Date User Action Args
2022-04-11 14:56:01adminsetgithub: 39750
2009-02-13 05:19:37ajaksu2setstatus: open -> closed
resolution: duplicate
superseder: Request: getpos() for sgmllib
messages: + msg81882
nosy: + ajaksu2
2004-01-01 20:01:48d98dzonecreate