Issue 849097: Request: getpos() for sgmllib

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/39602

classification

Title:	Request: getpos() for sgmllib
Type:	enhancement	Stage:	test needed
Components:	Library (Lib)	Versions:	Python 3.2

process

Status:	closed	Resolution:	out of date
Dependencies:		Superseder:
Assigned To:		Nosy List:	BreamoreBoy, ajaksu2, d98dzone, nnorwitz
Priority:	normal	Keywords:	easy, patch

Created on 2003-11-25 16:47 by d98dzone, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
diff.txt	d98dzone, 2003-12-02 11:55	Unix diff on The updated version and the CVS version(1.46)

Messages (6)
msg19144 - (view)	Author: Dan Wiklund (d98dzone)	Date: 2003-11-25 16:47
During the process of making my masters thesis I discovered the need for a working getpos() in sgmllib.py. As it is now you can successfully call it since it is inherited from markupbase.py but you will always get the answer (1,0) since it is never updated. To fix this one needs to change the goahead function. This is my own implementation of this change, in part influenced by the "sister" goahead-function in HTLMParser.py: ************************************ def goahead(self, end): rawdata = self.rawdata i = 0 k = 0 n = len(rawdata) tmp=0 while i < n: if self.nomoretags: self.handle_data(rawdata[i:n]) i = n break match = interesting.search(rawdata, i) if match: j = match.start() else: j = n if i < j: self.handle_data(rawdata[i:j]) tmp = self.updatepos(i, j) i = j if i == n: break startswith = rawdata.startswith if rawdata[i] == '<': if starttagopen.match(rawdata, i): if self.literal: self.handle_data(rawdata[i]) tmp = self.updatepos(i, i+1) i = i+1 continue k = self.parse_starttag(i) if k < 0: break tmp = self.updatepos(i, k) i = k continue if rawdata.startswith("</", i): k = self.parse_endtag(i) if k < 0: break tmp = self.updatepos(i, k) i = k self.literal = 0 continue if self.literal: if n > (i + 1): self.handle_data("<") i = i+1 tmp = self.updatepos(i, k) else: # incomplete break continue if rawdata.startswith("<!--", i): # Strictly speaking, a comment is --.-- # within a declaration tag <!...>. # This should be removed, # and comments handled only in parse_declaration. k = self.parse_comment(i) if k < 0: break tmp = self.updatepos(i, k) i = k continue if rawdata.startswith("<?", i): k = self.parse_pi(i) if k < 0: break tmp = self.updatepos(i, k) i = i+k continue if rawdata.startswith("<!", i): # This is some sort of declaration; in "HTML as # deployed," this should only be the document type # declaration ("<!DOCTYPE html...>"). k = self.parse_declaration(i) if k < 0: break tmp = self.updatepos(i, k) i = k continue tmp = self.updatepos(i, k) elif rawdata[i] == '&': if self.literal: self.handle_data(rawdata[i]) #tmp = self.updatepos(i,i+1)#added i = i+1 continue match = charref.match(rawdata, i) if match: name = match.group()[2:-1] self.handle_charref(name) k = match.end() if not startswith(';', k-1): k = k - 1 tmp = self.updatepos(i, k) i = k continue match = entityref.match(rawdata, i) if match: name = match.group(1) self.handle_entityref(name) k = match.end() if not startswith(';', k-1): k = k - 1 tmp = self.updatepos(i, k) i = k continue else: self.error('neither < nor & ??') # We get here only if incomplete matches but # nothing else match = incomplete.match(rawdata, i) if not match: self.handle_data(rawdata[i]) i = i+1 continue j = match.end(0) if j == n: break # Really incomplete self.handle_data(rawdata[i:j]) i = j # end while if end and i < n: self.handle_data(rawdata[i:n]) tmp = self.updatepos(i, n) i = n self.rawdata = rawdata[i:] # XXX if end: check for empty stack # Extensions for the DOCTYPE scanner: _decl_otherchars = '=' *************************** The major diffrence is the updatepos functions. It seems to work fine, or at least it has worked fine for me so far.
msg19145 - (view)	Author: Neal Norwitz (nnorwitz) *	Date: 2003-11-25 18:51
Logged In: YES user_id=33168 Can you please post a context diff against the version in CVS as an attachment? Formatting is not preserved when viewing through SF. Thanks.
msg19146 - (view)	Author: Dan Wiklund (d98dzone)	Date: 2003-12-02 12:16
Logged In: YES user_id=917420 Added an attachment with the diffrence to the current file version. This har three parts. The first is just updatepos inserted at the correct places in the function goahead. The second is from the part of the goahead function which handles the &-characters. I had a hard time making it work with the current model and changed it to a version inspired by the same part of the goahead-function in HTMLParser.py. The last is the printouts in the testfunction to check if the function performs ok.
msg81883 - (view)	Author: Daniel Diniz (ajaksu2) *	Date: 2009-02-13 05:21
Closed #868908 as a duplicate of this one.
msg114300 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-08-18 23:35
Anyone interested in this? I found the patch unreadable but YMMV.
msg114669 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-08-22 10:44
sgmllib has been deprecated since 2.6 and has been removed from py3k.

History
Date	User	Action	Args
2022-04-11 14:56:01	admin	set	github: 39602
2010-08-22 10:44:01	BreamoreBoy	set	status: open -> closed resolution: out of date messages: + msg114669 versions: + Python 3.2, - Python 2.7
2010-08-18 23:35:03	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg114300
2009-04-22 17:18:18	ajaksu2	set	keywords: + easy
2009-02-13 05:21:02	ajaksu2	set	nosy: + ajaksu2 stage: test needed messages: + msg81883 versions: + Python 2.7
2009-02-13 05:19:37	ajaksu2	link	issue868908 superseder
2008-02-19 23:26:01	akuchling	set	keywords: + patch type: enhancement
2003-11-25 16:47:35	d98dzone	create