classification
Title: xml.etree.ElementTree skips processing instructions when parsing
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: effbot, eli.bendersky, mark, nikratio, scoder
Priority: normal Keywords: patch

Created on 2010-08-05 09:25 by mark, last changed 2019-04-27 15:56 by scoder.

Files
File name Uploaded Description Edit
testcase.patch nikratio, 2014-01-19 04:52 review
Messages (10)
msg112961 - (view) Author: Mark Summerfield (mark) * Date: 2010-08-05 09:25
If you read in an XML file using xml.etree.ElementTree.parse() and then write it out again using xml.etree.ElementTree.write() what is written may not be the same as what was read. In particular any XML declaration and processing instructions are stripped.

It seems to me that the parser should at least preserve any declaration and processing instructions so that reading and writing match up.

Here's an example:

Python 3.1.2 (r312:79147, Jul 15 2010, 10:56:05) 
[GCC 4.4.4] on linux2
Type "copyright", "credits" or "license()" for more information.
>>> file = "control-center.xml"
>>> open(file).read()[:500]
'<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" [\n<!ENTITY VERSION "1.5.7">\n]>\n<article id="index" lang="en_GB">\n  \n  <articleinfo>\n    <abstract role="description">\n      <para>The GNOME Control Centre provides a central place for the user to setup their GNOME experience. It can let you configure anything from the behaviour of your window borders to the default font type.</para>\n   '
>>> import xml.etree.ElementTree as etree
>>> xml = etree.parse(file)
>>> temp = "temp.xml"
>>> xml.write("temp.xml", encoding="utf-8")
>>> open(temp).read()[:500]
'<article id="index" lang="en_GB">\n  \n  <articleinfo>\n    <abstract role="description">\n      <para>The GNOME Control Centre provides a central place for the user to setup their GNOME experience. It can let you configure anything from the behaviour of your window borders to the default font type.</para>\n    </abstract>\n    <title>Control Centre</title>\n    <authorgroup>\n      <author>\n\t<firstname>Kevin</firstname><surname>Breit</surname>\n      </author>\n    </authorgroup>\n    <copyright>\n      <y'
>>>
msg208445 - (view) Author: Nikolaus Rath (nikratio) * Date: 2014-01-19 04:52
I can confirm this. The actual problem is that neither XML nor SGML PIs  in the input make it into the etree, and no events are generated for them during incremental parsing.

XML PIs that are added into the tree using Python functions are correctly written out. SGML PIs currently cannot be represented at all (there's no ElementTree.SGMLProcessingInstruction analogous to ElementTree.ProcessingInstruction)

There is special cased support for the DOCTYPE element in the TreeBuilder class to allow retrieving the doctype when not parsing incrementally, but it needs to be retrieved manually and written out manually.


I have attached a testcase for XML PIs.  For proper SGML PI handling, ElementTree first needs to learn about them.

Recommended stage for this issue: needs patch
msg208464 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-01-19 09:01
When you write "XML PI", do you mean the XML declaration? At least that's what Mark used in his original example.

ET avoids writing them out when they are not necessary, i.e. for UTF-8 compatible encodings. IMHO that's perfectly ok and definitely not an incorrect behaviour.

As for processing instructions (what you used in your test case patch), making them appear in the tree by default would be a behavioural change that might break existing ET code.

Note that lxml keeps PIs in the tree by default, unless you configure its parser explicitly with "remove_pis=True".

There is also a "remove_comments=True" in lxml. ET simply discards comments when parsing IIRC.

http://lxml.de/parsing.html#parser-options

IMHO, both behaviours are ok, which lxml having a tendency towards keeping the data as it came in rather than trying to find the easiest possible way for the user to work with the tree. PIs and comments are a bit 'special' to work with.

A fix could be to add the two keyword arguments also to ET's parser, but make them default to True (as opposed to False in lxml), so that users can enable them at need.
msg208518 - (view) Author: Nikolaus Rath (nikratio) * Date: 2014-01-20 03:13
No, I really mean XML processing instruction. I agree with you that the XML declaration is a non-issue, because there is no information lost: you know that you're going to write XML, and you manually specify the encoding. Thus it's trivial to add the correct XML declaration if desired.

The fact that PIs are not read, however, is a real problem. The XML spec requires that PIs MUST be passed trough (http://www.w3.org/TR/REC-xml/#sec-pi). Furthermore, ElementTree is designed to represent XML data, so writing out an ElementTree as XML and reading it back in must (in my opinionn not result in information loss. But currently it does:

>>> import xml.etree.ElementTree as ET
>>> import tempfile
>>> root = ET.Element('body', {'text': 'some text for the body'})
>>> root.insert(1, ET.ProcessingInstruction('do-something'))
>>> tree = ET.ElementTree(root)
>>> tmp = tempfile.NamedTemporaryFile()
>>> tree.write(tmp.name)
>>> tmp.seek(0)
0
>>> tree_copy = ET.parse(tmp.name)
>>> ET.dump(tree)
<body text="some text for the body"><?do-something?></body>
>>> ET.dump(tree_copy)
<body text="some text for the body" />

I think tree and tree_copy not having the some contents is a bug.

Regarding comments: personally I think that throwing away is not a good idea either. But this is allowed by the XML spec (http://www.w3.org/TR/REC-xml/#dt-comment). This should probably go in a separate bug report if someone is interested in it.

As for backwards compatibility: yes, this is a concern. The keyword argument would be a solution. On the other hand, I'm not sure that the default should be something that causes dataloss...?

lxml sounds like it's doing the right things. Is there some connection between lxml and etree that I'm not aware of, or did you just give it as an example of how a different library behaves?
msg208528 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-01-20 08:02
> As for backwards compatibility: yes, this is a concern. The keyword argument would be a solution. On the other hand, I'm not sure that the default should be something that causes dataloss...?

It's a matter of use cases. How often do people care? My experience tells me that it's much more common to parse XML in and extract information from it, than to do round trips. And when extracting information, both comments and processing instructions usually get in the way.

So I would say that this default behaviour isn't wrong. In fact, I'd even say that both lxml and ET behave as expected (or at least as intended) in their own ways.

Also, I can't really remember spotting a processing instruction anywhere *inside* of a root element in real world XML. For those living next to it, ET currently lacks support in its tree model.


> lxml sounds like it's doing the right things. Is there some connection between lxml and etree that I'm not aware of, or did you just give it as an example of how a different library behaves?

Both implement (essentially) the same API, so I consider compatibility and look-alikes important. The general idea is that lxml extends what's there, but otherwise tries to stay compatible as good as it can.

BTW, lxml also allows parser target objects to have "pi(self, target, data)" and "comment(text)" methods, through which it can pass comments and PIs through to user provided tree builders. I think ET lacks those, too.


Changing target version from 3.3/3.4 to 3.5 as this is a new feature that is unlikely to make it into a currently released (or close to release) Python version. Also changing the ticket type to enhancement as the current behaviour isn't "wrong". It's rather a matter of supporting an additional use case.
msg208600 - (view) Author: Nikolaus Rath (nikratio) * Date: 2014-01-21 01:18
For the record: I disagree that this is an enhancement. ElementTree supports PIs as first-class tree elements. They can be added, inspected, removed, and written out when serializing into XML. Only when reading in XML, they are silently dropped. I think this is a bug, no matter how rarely people might stumble upon it in practice.

If there was no support for PIs at all (i.e., they couldn't be created as tree objects) I'd agree with you that this is an enhancement.


Unless there are objections, I'll try to work on a patch that either documents that PIs are lost, or optionally adds them to the tree when parsing (depending on how difficult that turns out to be).
msg208628 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-01-21 08:12
> Unless there are objections, I'll try to work on a patch that either documents that PIs are lost, or optionally adds them to the tree when parsing (depending on how difficult that turns out to be).

Please do. It should not be difficult at all to make the parser create PI objects (or comments).

Note, however, that this needs to be done in both the Python implementation and the parser of celementtree.
msg209037 - (view) Author: Nikolaus Rath (nikratio) * Date: 2014-01-24 04:54
I have created issue 20375 with a patch to document the current behavior.
msg340972 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-04-27 07:25
Comment/PI parsing in general is implemented in issue 36673. Note that there is currently no way to represent comments and PIs in the tree when they appear outside of the root element, which I think is what this ticket is about. After issue 36673 is resolved, however, they can at least be picked up from the parser target or from iterparse() and XMLPullParser().
msg340995 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-04-27 15:56
Issue 24287 is a duplicate of this one and has some additional discussion.
History
Date User Action Args
2019-04-27 15:56:22scodersetmessages: + msg340995
2019-04-27 15:56:08scodersetmessages: - msg340994
2019-04-27 15:55:22scoderlinkissue24287 superseder
2019-04-27 15:55:03scodersetmessages: + msg340994
2019-04-27 07:25:53scodersetmessages: + msg340972
versions: + Python 3.8, - Python 3.5
2014-01-24 04:54:12nikratiosetmessages: + msg209037
2014-01-21 08:12:41scodersetmessages: + msg208628
2014-01-21 01:18:50nikratiosetmessages: + msg208600
2014-01-20 08:02:36scodersettype: behavior -> enhancement
messages: + msg208528
versions: + Python 3.5, - Python 3.3, Python 3.4
2014-01-20 03:13:40nikratiosetmessages: + msg208518
2014-01-19 09:01:29scodersetmessages: + msg208464
2014-01-19 08:40:05scodersetnosy: + scoder, eli.bendersky
2014-01-19 04:53:36nikratiosettitle: xml.etree.ElementTree strips XML declaration and procesing instructions -> xml.etree.ElementTree skips processing instructions when parsing
2014-01-19 04:52:22nikratiosetfiles: + testcase.patch
versions: + Python 3.3, Python 3.4, - Python 3.1
nosy: + nikratio

messages: + msg208445

keywords: + patch
2010-08-06 03:22:21r.david.murraysetnosy: + effbot
2010-08-05 09:25:41markcreate