Issue849097
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2003-11-25 16:47 by d98dzone, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
diff.txt | d98dzone, 2003-12-02 11:55 | Unix diff on The updated version and the CVS version(1.46) |
Messages (6) | |||
---|---|---|---|
msg19144 - (view) | Author: Dan Wiklund (d98dzone) | Date: 2003-11-25 16:47 | |
During the process of making my masters thesis I discovered the need for a working getpos() in sgmllib.py. As it is now you can successfully call it since it is inherited from markupbase.py but you will always get the answer (1,0) since it is never updated. To fix this one needs to change the goahead function. This is my own implementation of this change, in part influenced by the "sister" goahead-function in HTLMParser.py: ************************************ def goahead(self, end): rawdata = self.rawdata i = 0 k = 0 n = len(rawdata) tmp=0 while i < n: if self.nomoretags: self.handle_data(rawdata[i:n]) i = n break match = interesting.search(rawdata, i) if match: j = match.start() else: j = n if i < j: self.handle_data(rawdata[i:j]) tmp = self.updatepos(i, j) i = j if i == n: break startswith = rawdata.startswith if rawdata[i] == '<': if starttagopen.match(rawdata, i): if self.literal: self.handle_data(rawdata[i]) tmp = self.updatepos(i, i+1) i = i+1 continue k = self.parse_starttag(i) if k < 0: break tmp = self.updatepos(i, k) i = k continue if rawdata.startswith("</", i): k = self.parse_endtag(i) if k < 0: break tmp = self.updatepos(i, k) i = k self.literal = 0 continue if self.literal: if n > (i + 1): self.handle_data("<") i = i+1 tmp = self.updatepos(i, k) else: # incomplete break continue if rawdata.startswith("<!--", i): # Strictly speaking, a comment is --.*-- # within a declaration tag <!...>. # This should be removed, # and comments handled only in parse_declaration. k = self.parse_comment(i) if k < 0: break tmp = self.updatepos(i, k) i = k continue if rawdata.startswith("<?", i): k = self.parse_pi(i) if k < 0: break tmp = self.updatepos(i, k) i = i+k continue if rawdata.startswith("<!", i): # This is some sort of declaration; in "HTML as # deployed," this should only be the document type # declaration ("<!DOCTYPE html...>"). k = self.parse_declaration(i) if k < 0: break tmp = self.updatepos(i, k) i = k continue tmp = self.updatepos(i, k) elif rawdata[i] == '&': if self.literal: self.handle_data(rawdata[i]) #tmp = self.updatepos(i,i+1)#added i = i+1 continue match = charref.match(rawdata, i) if match: name = match.group()[2:-1] self.handle_charref(name) k = match.end() if not startswith(';', k-1): k = k - 1 tmp = self.updatepos(i, k) i = k continue match = entityref.match(rawdata, i) if match: name = match.group(1) self.handle_entityref(name) k = match.end() if not startswith(';', k-1): k = k - 1 tmp = self.updatepos(i, k) i = k continue else: self.error('neither < nor & ??') # We get here only if incomplete matches but # nothing else match = incomplete.match(rawdata, i) if not match: self.handle_data(rawdata[i]) i = i+1 continue j = match.end(0) if j == n: break # Really incomplete self.handle_data(rawdata[i:j]) i = j # end while if end and i < n: self.handle_data(rawdata[i:n]) tmp = self.updatepos(i, n) i = n self.rawdata = rawdata[i:] # XXX if end: check for empty stack # Extensions for the DOCTYPE scanner: _decl_otherchars = '=' **************************** The major diffrence is the updatepos functions. It seems to work fine, or at least it has worked fine for me so far. |
|||
msg19145 - (view) | Author: Neal Norwitz (nnorwitz) * | Date: 2003-11-25 18:51 | |
Logged In: YES user_id=33168 Can you please post a context diff against the version in CVS as an attachment? Formatting is not preserved when viewing through SF. Thanks. |
|||
msg19146 - (view) | Author: Dan Wiklund (d98dzone) | Date: 2003-12-02 12:16 | |
Logged In: YES user_id=917420 Added an attachment with the diffrence to the current file version. This har three parts. The first is just updatepos inserted at the correct places in the function goahead. The second is from the part of the goahead function which handles the &-characters. I had a hard time making it work with the current model and changed it to a version inspired by the same part of the goahead-function in HTMLParser.py. The last is the printouts in the testfunction to check if the function performs ok. |
|||
msg81883 - (view) | Author: Daniel Diniz (ajaksu2) * | Date: 2009-02-13 05:21 | |
Closed #868908 as a duplicate of this one. |
|||
msg114300 - (view) | Author: Mark Lawrence (BreamoreBoy) * | Date: 2010-08-18 23:35 | |
Anyone interested in this? I found the patch unreadable but YMMV. |
|||
msg114669 - (view) | Author: Mark Lawrence (BreamoreBoy) * | Date: 2010-08-22 10:44 | |
sgmllib has been deprecated since 2.6 and has been removed from py3k. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:56:01 | admin | set | github: 39602 |
2010-08-22 10:44:01 | BreamoreBoy | set | status: open -> closed resolution: out of date messages: + msg114669 versions: + Python 3.2, - Python 2.7 |
2010-08-18 23:35:03 | BreamoreBoy | set | nosy:
+ BreamoreBoy messages: + msg114300 |
2009-04-22 17:18:18 | ajaksu2 | set | keywords: + easy |
2009-02-13 05:21:02 | ajaksu2 | set | nosy:
+ ajaksu2 stage: test needed messages: + msg81883 versions: + Python 2.7 |
2009-02-13 05:19:37 | ajaksu2 | link | issue868908 superseder |
2008-02-19 23:26:01 | akuchling | set | keywords:
+ patch type: enhancement |
2003-11-25 16:47:35 | d98dzone | create |