Issue 868908: getpos() for sgmllib

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/39750

classification

Title:	getpos() for sgmllib
Type:	enhancement	Stage:
Components:	None	Versions:

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	Request: getpos() for sgmllib View: 849097
Assigned To:		Nosy List:	ajaksu2, d98dzone
Priority:	normal	Keywords:

Created on 2004-01-01 20:01 by d98dzone, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
diff.txt	d98dzone, 2004-01-01 20:01	Unix diff on The updated version and the CVS version(1.46)

Messages (2)
msg54083 - (view)	Author: Dan Wiklund (d98dzone)	Date: 2004-01-01 20:01
Placed here instead of in Bugs since it really isn't a bug. During the process of making my masters thesis I discovered the need for a working getpos() in sgmllib.py. As it is now you can successfully call it since it is inherited from markupbase.py but you will always get the answer (1,0) since it is never updated. To fix this one needs to change the goahead function. This is my own implementation of this change, in part influenced by the "sister" goahead-function in HTLMParser.py: ************************************ def goahead(self, end): rawdata = self.rawdata i = 0 k = 0 n = len(rawdata) tmp=0 while i < n: if self.nomoretags: self.handle_data(rawdata[i:n]) i = n break match = interesting.search(rawdata, i) if match: j = match.start() else: j = n if i < j: self.handle_data(rawdata[i:j]) tmp = self.updatepos(i, j) i = j if i == n: break startswith = rawdata.startswith if rawdata[i] == '<': if starttagopen.match(rawdata, i): if self.literal: self.handle_data(rawdata[i]) tmp = self.updatepos(i, i+1) i = i+1 continue k = self.parse_starttag(i) if k < 0: break tmp = self.updatepos(i, k) i = k continue if rawdata.startswith("</", i): k = self.parse_endtag(i) if k < 0: break tmp = self.updatepos(i, k) i = k self.literal = 0 continue if self.literal: if n > (i + 1): self.handle_data("<") i = i+1 tmp = self.updatepos(i, k) else: # incomplete break continue if rawdata.startswith("<!--", i): # Strictly speaking, a comment is --.-- # within a declaration tag <!...>. # This should be removed, # and comments handled only in parse_declaration. k = self.parse_comment(i) if k < 0: break tmp = self.updatepos(i, k) i = k continue if rawdata.startswith("<?", i): k = self.parse_pi(i) if k < 0: break tmp = self.updatepos(i, k) i = i+k continue if rawdata.startswith("<!", i): # This is some sort of declaration; in "HTML as # deployed," this should only be the document type # declaration ("<!DOCTYPE html...>"). k = self.parse_declaration(i) if k < 0: break tmp = self.updatepos(i, k) i = k continue tmp = self.updatepos(i, k) elif rawdata[i] == '&': if self.literal: self.handle_data(rawdata[i]) #tmp = self.updatepos(i,i+1)#added i = i+1 continue match = charref.match(rawdata, i) if match: name = match.group()[2:-1] self.handle_charref(name) k = match.end() if not startswith(';', k-1): k = k - 1 tmp = self.updatepos(i, k) i = k continue match = entityref.match(rawdata, i) if match: name = match.group(1) self.handle_entityref(name) k = match.end() if not startswith(';', k-1): k = k - 1 tmp = self.updatepos(i, k) i = k continue else: self.error('neither < nor & ??') # We get here only if incomplete matches but # nothing else match = incomplete.match(rawdata, i) if not match: self.handle_data(rawdata[i]) i = i+1 continue j = match.end(0) if j == n: break # Really incomplete self.handle_data(rawdata[i:j]) i = j # end while if end and i < n: self.handle_data(rawdata[i:n]) tmp = self.updatepos(i, n) i = n self.rawdata = rawdata[i:] # XXX if end: check for empty stack # Extensions for the DOCTYPE scanner: _decl_otherchars = '=' *************************** The major diffrence is the updatepos functions. It seems to work fine, or at least it has worked fine for me so far. Posted a diff taken againts the CVS version(1.46). It har three parts. The first is just updatepos inserted at the correct places in the function goahead. The second is from the part of the goahead function which handles the &-characters. I had a hard time making it work with the current model and changed it to a version inspired by the same part of the goahead-function in HTMLParser.py. The last is the printouts in the testfunction to check if the function performs ok.
msg81882 - (view)	Author: Daniel Diniz (ajaksu2) *	Date: 2009-02-13 05:19
Duplicate of #849097.

History
Date	User	Action	Args
2022-04-11 14:56:01	admin	set	github: 39750
2009-02-13 05:19:37	ajaksu2	set	status: open -> closed resolution: duplicate superseder: Request: getpos() for sgmllib messages: + msg81882 nosy: + ajaksu2
2004-01-01 20:01:48	d98dzone	create