Issue34600
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2018-09-07 06:10 by Martin Hosken, last changed 2022-04-11 14:59 by admin. This issue is now closed.
Messages (8) | |||
---|---|---|---|
msg324719 - (view) | Author: Martin Hosken (Martin Hosken) | Date: 2018-09-07 06:10 | |
This is a regression from python2 by being forced to use cElementTree. I have code that uses iterparse to process an XML file, but I also want to process comments and so I have a comment handling function called by the parser during iterparse. Under python3 I can find no way to achieve the same thing: ``` parser = et.XMLParser(target=et.TreeBuilder()) parser.parser.CommentHandler = myCommentHandler for event, elem in et.iterparse(fh, parser=parser): ... ``` Somewhat ugly but works in python2, but I can find no way to set a comment handler on the parser in python3. 1. There is no way(?) to get to xml.etree.ElementTree.XMLParser since the C implementation completely masks the python versions. 2. It is possible to create a subclass of TreeBuilder to add a comment method. But the C version XMLParser requires that its TreeBuilder not be a subclass, when used in iterparse. The only solution I found was to copy the XMLParser code out of ElementTree into a private module and use that pure python implementation. Suggested solutions: 1. Allow access to all the python implementations in ElementTree and not just Element. 2. Allow a comments method to be passed to the XMLParser on creation. Thank you. |
|||
msg324835 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2018-09-08 12:19 | |
> But the C version XMLParser requires that its TreeBuilder not be a subclass, when used in iterparse. Creating a TreeBuilder subclass looks the Right Way. What are problems with this? Could you please provide a complete script that works in 2.7, but doesn't work in 3.x? |
|||
msg324837 - (view) | Author: Stefan Behnel (scoder) * ![]() |
Date: 2018-09-08 12:32 | |
There are dedicated handler methods that you can implement: "def comment(self, comment)" and "def pi(self, target, data)". Both (c)ElementTree and lxml support those. I think the "target" argument to the parser is a bit underdocumented, and the standard TreeBuilder does not implement those methods (because it does not use them). https://docs.python.org/3/library/xml.etree.elementtree.html#xmlparser-objects Probably worth mentioning that in the docs. |
|||
msg324909 - (view) | Author: Martin Hosken (Martin Hosken) | Date: 2018-09-10 03:46 | |
Sorry. This test is rather long because it is 3 tests: from __future__ import print_function import sys import xml.etree.ElementTree as et import xml.etree.cElementTree as cet from io import StringIO teststr = u"""<?xml version="1"?> <root> <child> Hello <!-- Greeting --> World </child> </root>""" testf = StringIO(teststr) if len(sys.argv) >= 2 and 'a' in sys.argv[1]: testf.seek(0) for event, elem in et.iterparse(testf, events=["end", "comment"]): if event == 'end': print(elem.tag + ": " + str(elem.text)) elif event == 'comment': print("comment: " + elem.text) if len(sys.argv) < 2 or 'b' in sys.argv[1]: testf.seek(0) def doComment(data): parser.parser.StartElementHandler("!--", ('text', data)) parser.parser.EndElementHandler("!--") parser = et.XMLParser() parser.parser.CommentHandler = doComment for event, elem in et.iterparse(testf, parser=parser): if hasattr(elem, 'text'): print(elem.tag + ": " + str(elem.text)) else: print(elem.tag + ": " + elem.get('text', "")) if len(sys.argv) < 2 or 'c' in sys.argv[1] or 'd' in sys.argv[1]: testf.seek(0) useet = et if len(sys.argv) < 2 or 'c' in sys.argv[1] else cet class CommentingTb(useet.TreeBuilder): def __init__(self): self.parser = None def comment(self, data): self.parser.parser.StartElementHandler("!--", ('text', data)) self.parser.parser.EndElementHandler("!--") tb = CommentingTb() parser = useet.XMLParser(target=tb) tb.parser = parser kw = {'parser': parser} if len(sys.argv) < 2 or 'c' in sys.argv[1] else {} for event, elem in useet.iterparse(testf, **kw): if hasattr(elem, 'text'): print(elem.tag + ": " + str(elem.text)) else: print(elem.tag + ": " + elem.get('text', "")) Test 'a' is how I would like to write the solution to my problem. Not sure why 'comment' isn't supported by iterparse directly, but hey. Test 'b' is how I solved in it python2 Test 'c' is how I would have to solve it in python3 if it worked Test 'd' is the same as 'c' but uses cElementTree rather than ElementTree. Results: Success output for a test is: ``` !--: None child: Hello root: ``` Python2: a Fails (obviously) b Succeeds c Succeeds d Fails: can't inherit from cElementTree.TreeBuilder Python3: a Fails (obviously) b Fails: XMLParser has no attribute 'parser' c Fails: event handling only supported for ElementTree.TreeBuilder targets d Fails: Gives output but no initial comment component (line 1) The key failure here is Python3 'c'. This is what stops any hope of comment handling using the et.XMLParser. The only way I could get around it was to use my own copy from the source code. |
|||
msg324910 - (view) | Author: Martin Hosken (Martin Hosken) | Date: 2018-09-10 03:51 | |
Blast. Bugs. Sorry. Missing superclass init call in CommentingTb. I enclose the whole thing again to save editing. Also fixes comment output to give text. from __future__ import print_function import sys import xml.etree.ElementTree as et import xml.etree.cElementTree as cet from io import StringIO teststr = u"""<?xml version="1"?> <root> <child> Hello <!-- Greeting --> World </child> </root>""" testf = StringIO(teststr) if len(sys.argv) >= 2 and 'a' in sys.argv[1]: testf.seek(0) for event, elem in et.iterparse(testf, events=["end", "comment"]): if event == 'end': print(elem.tag + ": " + str(elem.text)) elif event == 'comment': print("comment: " + elem.text) if len(sys.argv) < 2 or 'b' in sys.argv[1]: testf.seek(0) def doComment(data): parser.parser.StartElementHandler("!--", ('text', data)) parser.parser.EndElementHandler("!--") parser = et.XMLParser() parser.parser.CommentHandler = doComment for event, elem in et.iterparse(testf, parser=parser): if elem.tag == "!--": print(elem.tag + ": " + elem.get('text', "")) else: print(elem.tag + ": " + str(elem.text)) if len(sys.argv) < 2 or 'c' in sys.argv[1] or 'd' in sys.argv[1]: testf.seek(0) useet = et if len(sys.argv) < 2 or 'c' in sys.argv[1] else cet class CommentingTb(useet.TreeBuilder): def __init__(self): useet.TreeBuilder.__init__(self) self.parser = None def comment(self, data): self.parser.parser.StartElementHandler("!--", ('text', data)) self.parser.parser.EndElementHandler("!--") tb = CommentingTb() parser = useet.XMLParser(target=tb) tb.parser = parser kw = {'parser': parser} if len(sys.argv) < 2 or 'c' in sys.argv[1] else {} for event, elem in useet.iterparse(testf, **kw): if elem.tag == "!--": print(elem.tag + ": " + elem.get('text', "")) else: print(elem.tag + ": " + str(elem.text)) |
|||
msg325041 - (view) | Author: Stefan Behnel (scoder) * ![]() |
Date: 2018-09-11 18:00 | |
lxml supports "comment" and "pi" as event types in iterparse (or, more specifically, in the XMLPullParser). If someone wants to implement this for (c)ElementTree, I'd be happy to review the PR. https://lxml.de/api/lxml.etree.XMLPullParser-class.html |
|||
msg325051 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2018-09-11 18:58 | |
Than your for your example Martin. Now I see what is not working. Indeed, using the TreeBuilder subclass is recommended way, but iterparse() was designed to accept only exact TreeBuilder. Actually there is an intention to consider the parser parameter of iterparse() as internal and remove it from the public API (also the deprecation warning is not emitted yet). I don't see a way to fix this. Since this feature never worked in Python 3, and the Python 2 way looks as using implementation details, I think we should consider adding it as a new feature rather of a buf fix. Adding support of "comment" and "pi" as event types in iterparse looks reasonable to me. Eli, what are your thoughts? I agree with you Marting that using your own copy from the source code is the best way of solving your problem on current Python 3. |
|||
msg342075 - (view) | Author: Stefan Behnel (scoder) * ![]() |
Date: 2019-05-10 12:50 | |
I think this is resolved by issue 36673 (Py3.8). Please try it in the just released alpha4. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:59:05 | admin | set | github: 78781 |
2019-05-10 12:50:04 | scoder | set | status: open -> closed resolution: duplicate messages: + msg342075 stage: resolved |
2018-09-11 18:58:13 | serhiy.storchaka | set | messages: + msg325051 |
2018-09-11 18:00:57 | scoder | set | type: behavior -> enhancement messages: + msg325041 versions: + Python 3.8, - Python 3.6 |
2018-09-10 06:30:18 | xtreak | set | nosy:
+ xtreak |
2018-09-10 03:51:39 | Martin Hosken | set | messages: + msg324910 |
2018-09-10 03:46:24 | Martin Hosken | set | messages: + msg324909 |
2018-09-08 12:32:16 | scoder | set | messages: + msg324837 |
2018-09-08 12:19:43 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka, eli.bendersky, scoder messages: + msg324835 |
2018-09-07 06:10:16 | Martin Hosken | create |