classification
Title: python3 regression ElementTree.iterparse() unable to capture comments
Type: enhancement Stage:
Components: XML Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Martin Hosken, eli.bendersky, scoder, serhiy.storchaka, xtreak
Priority: normal Keywords:

Created on 2018-09-07 06:10 by Martin Hosken, last changed 2018-09-11 18:58 by serhiy.storchaka.

Messages (7)
msg324719 - (view) Author: Martin Hosken (Martin Hosken) Date: 2018-09-07 06:10
This is a regression from python2 by being forced to use cElementTree.

I have code that uses iterparse to process an XML file, but I also want to process comments and so I have a comment handling function called by the parser during iterparse. Under python3 I can find no way to achieve the same thing:

```
parser = et.XMLParser(target=et.TreeBuilder())
parser.parser.CommentHandler = myCommentHandler
for event, elem in et.iterparse(fh, parser=parser):
    ...
```

Somewhat ugly but works in python2, but I can find no way to set a comment handler on the parser in python3.


1. There is no way(?) to get to xml.etree.ElementTree.XMLParser since the C implementation completely masks the python versions.
2. It is possible to create a subclass of TreeBuilder to add a comment method. But the C version XMLParser requires that its TreeBuilder not be a subclass, when used in iterparse.

The only solution I found was to copy the XMLParser code out of ElementTree into a private module and use that pure python implementation.

Suggested solutions:
1. Allow access to all the python implementations in ElementTree and not just Element.
2. Allow a comments method to be passed to the XMLParser on creation.

Thank you.
msg324835 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-09-08 12:19
> But the C version XMLParser requires that its TreeBuilder not be a subclass, when used in iterparse.

Creating a TreeBuilder subclass looks the Right Way. What are problems with this? Could you please provide a complete script that works in 2.7, but doesn't work in 3.x?
msg324837 - (view) Author: Stefan Behnel (scoder) * Date: 2018-09-08 12:32
There are dedicated handler methods that you can implement: "def comment(self, comment)" and "def pi(self, target, data)". Both (c)ElementTree and lxml support those.

I think the "target" argument to the parser is a bit underdocumented, and the standard TreeBuilder does not implement those methods (because it does not use them).

https://docs.python.org/3/library/xml.etree.elementtree.html#xmlparser-objects

Probably worth mentioning that in the docs.
msg324909 - (view) Author: Martin Hosken (Martin Hosken) Date: 2018-09-10 03:46
Sorry. This test is rather long because it is 3 tests:

from __future__ import print_function
import sys
import xml.etree.ElementTree as et
import xml.etree.cElementTree as cet
from io import StringIO

teststr = u"""<?xml version="1"?>
<root>
    <child>
        Hello <!-- Greeting --> World
    </child>
</root>"""
testf = StringIO(teststr)

if len(sys.argv) >= 2 and 'a' in sys.argv[1]:
    testf.seek(0)
    for event, elem in et.iterparse(testf, events=["end", "comment"]):
        if event == 'end':
            print(elem.tag + ": " + str(elem.text))
        elif event == 'comment':
            print("comment: " + elem.text)

if len(sys.argv) < 2 or 'b' in sys.argv[1]:
    testf.seek(0)
    def doComment(data):
        parser.parser.StartElementHandler("!--", ('text', data))
        parser.parser.EndElementHandler("!--")
    parser = et.XMLParser()
    parser.parser.CommentHandler = doComment
    for event, elem in et.iterparse(testf, parser=parser):
        if hasattr(elem, 'text'):
            print(elem.tag + ": " + str(elem.text))
        else:
            print(elem.tag + ": " + elem.get('text', ""))

if len(sys.argv) < 2 or 'c' in sys.argv[1] or 'd' in sys.argv[1]:
    testf.seek(0)
    useet = et if len(sys.argv) < 2 or 'c' in sys.argv[1] else cet
    class CommentingTb(useet.TreeBuilder):
        def __init__(self):
            self.parser = None
        def comment(self, data):
            self.parser.parser.StartElementHandler("!--", ('text', data))
            self.parser.parser.EndElementHandler("!--")
    tb = CommentingTb()
    parser = useet.XMLParser(target=tb)
    tb.parser = parser
    kw = {'parser': parser} if len(sys.argv) < 2 or 'c' in sys.argv[1] else {}
    for event, elem in useet.iterparse(testf, **kw):
        if hasattr(elem, 'text'):
            print(elem.tag + ": " + str(elem.text))
        else:
            print(elem.tag + ": " + elem.get('text', ""))

Test 'a' is how I would like to write the solution to my problem. Not sure why 'comment' isn't supported by iterparse directly, but hey.

Test 'b' is how I solved in it python2

Test 'c' is how I would have to solve it in python3 if it worked

Test 'd' is the same as 'c' but uses cElementTree rather than ElementTree.

Results:

Success output for a test is:
```
!--: None
child: 
        Hello 
root: 
    
```

Python2:
a    Fails (obviously)
b    Succeeds
c    Succeeds
d    Fails: can't inherit from cElementTree.TreeBuilder

Python3:
a    Fails (obviously)
b    Fails: XMLParser has no attribute 'parser'
c    Fails: event handling only supported for ElementTree.TreeBuilder targets
d    Fails: Gives output but no initial comment component (line 1)

The key failure here is Python3 'c'. This is what stops any hope of comment handling using the et.XMLParser. The only way I could get around it was to use my own copy from the source code.
msg324910 - (view) Author: Martin Hosken (Martin Hosken) Date: 2018-09-10 03:51
Blast. Bugs. Sorry. Missing superclass init call in CommentingTb. I enclose the whole thing again to save editing. Also fixes comment output to give text.

from __future__ import print_function
import sys
import xml.etree.ElementTree as et
import xml.etree.cElementTree as cet
from io import StringIO

teststr = u"""<?xml version="1"?>
<root>
    <child>
        Hello <!-- Greeting --> World
    </child>
</root>"""
testf = StringIO(teststr)

if len(sys.argv) >= 2 and 'a' in sys.argv[1]:
    testf.seek(0)
    for event, elem in et.iterparse(testf, events=["end", "comment"]):
        if event == 'end':
            print(elem.tag + ": " + str(elem.text))
        elif event == 'comment':
            print("comment: " + elem.text)

if len(sys.argv) < 2 or 'b' in sys.argv[1]:
    testf.seek(0)
    def doComment(data):
        parser.parser.StartElementHandler("!--", ('text', data))
        parser.parser.EndElementHandler("!--")
    parser = et.XMLParser()
    parser.parser.CommentHandler = doComment
    for event, elem in et.iterparse(testf, parser=parser):
        if elem.tag == "!--":
            print(elem.tag + ": " + elem.get('text', ""))
        else:
            print(elem.tag + ": " + str(elem.text))

if len(sys.argv) < 2 or 'c' in sys.argv[1] or 'd' in sys.argv[1]:
    testf.seek(0)
    useet = et if len(sys.argv) < 2 or 'c' in sys.argv[1] else cet
    class CommentingTb(useet.TreeBuilder):
        def __init__(self):
            useet.TreeBuilder.__init__(self)
            self.parser = None
        def comment(self, data):
            self.parser.parser.StartElementHandler("!--", ('text', data))
            self.parser.parser.EndElementHandler("!--")
    tb = CommentingTb()
    parser = useet.XMLParser(target=tb)
    tb.parser = parser
    kw = {'parser': parser} if len(sys.argv) < 2 or 'c' in sys.argv[1] else {}
    for event, elem in useet.iterparse(testf, **kw):
        if elem.tag == "!--":
            print(elem.tag + ": " + elem.get('text', ""))
        else:
            print(elem.tag + ": " + str(elem.text))
msg325041 - (view) Author: Stefan Behnel (scoder) * Date: 2018-09-11 18:00
lxml supports "comment" and "pi" as event types in iterparse (or, more specifically, in the XMLPullParser). If someone wants to implement this for (c)ElementTree, I'd be happy to review the PR.

https://lxml.de/api/lxml.etree.XMLPullParser-class.html
msg325051 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-09-11 18:58
Than your for your example Martin. Now I see what is not working. Indeed, using the TreeBuilder subclass is recommended way, but iterparse() was designed to accept only exact TreeBuilder. Actually there is an intention to consider the parser parameter of iterparse() as internal and remove it from the public API (also the deprecation warning is not emitted yet).

I don't see a way to fix this. Since this feature never worked in Python 3, and the Python 2 way looks as using implementation details, I think we should consider adding it as a new feature rather of a buf fix. Adding support of "comment" and "pi" as event types in iterparse looks reasonable to me. Eli, what are your thoughts?

I agree with you Marting that using your own copy from the source code is the best way of solving your problem on current Python 3.
History
Date User Action Args
2018-09-11 18:58:13serhiy.storchakasetmessages: + msg325051
2018-09-11 18:00:57scodersettype: behavior -> enhancement
messages: + msg325041
versions: + Python 3.8, - Python 3.6
2018-09-10 06:30:18xtreaksetnosy: + xtreak
2018-09-10 03:51:39Martin Hoskensetmessages: + msg324910
2018-09-10 03:46:24Martin Hoskensetmessages: + msg324909
2018-09-08 12:32:16scodersetmessages: + msg324837
2018-09-08 12:19:43serhiy.storchakasetnosy: + serhiy.storchaka, eli.bendersky, scoder
messages: + msg324835
2018-09-07 06:10:16Martin Hoskencreate