xml.etree.ElementTree skips processing instructions when parsing #53730

mark-summerfield · 2010-08-05T09:25:42Z

BPO	9521
Nosy	@scoder, @mark-summerfield
Files	testcase.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2010-08-05.09:25:41.808>
labels = ['3.8', 'type-feature', 'library']
title = 'xml.etree.ElementTree skips processing instructions when parsing'
updated_at = <Date 2019-04-27.15:56:22.021>
user = 'https://github.com/mark-summerfield'

bugs.python.org fields:

activity = <Date 2019-04-27.15:56:22.021>
actor = 'scoder'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2010-08-05.09:25:41.808>
creator = 'mark'
dependencies = []
files = ['33538']
hgrepos = []
issue_num = 9521
keywords = ['patch']
message_count = 10.0
messages = ['112961', '208445', '208464', '208518', '208528', '208600', '208628', '209037', '340972', '340995']
nosy_count = 5.0
nosy_names = ['effbot', 'scoder', 'mark', 'eli.bendersky', 'nikratio']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue9521'
versions = ['Python 3.8']

mark-summerfield · 2010-08-05T09:25:39Z

If you read in an XML file using xml.etree.ElementTree.parse() and then write it out again using xml.etree.ElementTree.write() what is written may not be the same as what was read. In particular any XML declaration and processing instructions are stripped.

It seems to me that the parser should at least preserve any declaration and processing instructions so that reading and writing match up.

Here's an example:

Python 3.1.2 (r312:79147, Jul 15 2010, 10:56:05) 
[GCC 4.4.4] on linux2
Type "copyright", "credits" or "license()" for more information.
>>> file = "control-center.xml"
>>> open(file).read()[:500]
'<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" [\n<!ENTITY VERSION "1.5.7">\n]>\n<article id="index" lang="en_GB">\n  \n  <articleinfo>\n    <abstract role="description">\n      <para>The GNOME Control Centre provides a central place for the user to setup their GNOME experience. It can let you configure anything from the behaviour of your window borders to the default font type.</para>\n   '
>>> import xml.etree.ElementTree as etree
>>> xml = etree.parse(file)
>>> temp = "temp.xml"
>>> xml.write("temp.xml", encoding="utf-8")
>>> open(temp).read()[:500]
'<article id="index" lang="en_GB">\n  \n  <articleinfo>\n    <abstract role="description">\n      <para>The GNOME Control Centre provides a central place for the user to setup their GNOME experience. It can let you configure anything from the behaviour of your window borders to the default font type.</para>\n    </abstract>\n    <title>Control Centre</title>\n    <authorgroup>\n      <author>\n\t<firstname>Kevin</firstname><surname>Breit</surname>\n      </author>\n    </authorgroup>\n    <copyright>\n      <y'
>>>

nikratio · 2014-01-19T04:52:22Z

I can confirm this. The actual problem is that neither XML nor SGML PIs in the input make it into the etree, and no events are generated for them during incremental parsing.

XML PIs that are added into the tree using Python functions are correctly written out. SGML PIs currently cannot be represented at all (there's no ElementTree.SGMLProcessingInstruction analogous to ElementTree.ProcessingInstruction)

There is special cased support for the DOCTYPE element in the TreeBuilder class to allow retrieving the doctype when not parsing incrementally, but it needs to be retrieved manually and written out manually.

I have attached a testcase for XML PIs. For proper SGML PI handling, ElementTree first needs to learn about them.

Recommended stage for this issue: needs patch

scoder · 2014-01-19T09:01:29Z

When you write "XML PI", do you mean the XML declaration? At least that's what Mark used in his original example.

ET avoids writing them out when they are not necessary, i.e. for UTF-8 compatible encodings. IMHO that's perfectly ok and definitely not an incorrect behaviour.

As for processing instructions (what you used in your test case patch), making them appear in the tree by default would be a behavioural change that might break existing ET code.

Note that lxml keeps PIs in the tree by default, unless you configure its parser explicitly with "remove_pis=True".

There is also a "remove_comments=True" in lxml. ET simply discards comments when parsing IIRC.

http://lxml.de/parsing.html#parser-options

IMHO, both behaviours are ok, which lxml having a tendency towards keeping the data as it came in rather than trying to find the easiest possible way for the user to work with the tree. PIs and comments are a bit 'special' to work with.

A fix could be to add the two keyword arguments also to ET's parser, but make them default to True (as opposed to False in lxml), so that users can enable them at need.

nikratio · 2014-01-20T03:13:39Z

No, I really mean XML processing instruction. I agree with you that the XML declaration is a non-issue, because there is no information lost: you know that you're going to write XML, and you manually specify the encoding. Thus it's trivial to add the correct XML declaration if desired.

The fact that PIs are not read, however, is a real problem. The XML spec requires that PIs MUST be passed trough (http://www.w3.org/TR/REC-xml/#sec-pi). Furthermore, ElementTree is designed to represent XML data, so writing out an ElementTree as XML and reading it back in must (in my opinionn not result in information loss. But currently it does:

>>> import xml.etree.ElementTree as ET
>>> import tempfile
>>> root = ET.Element('body', {'text': 'some text for the body'})
>>> root.insert(1, ET.ProcessingInstruction('do-something'))
>>> tree = ET.ElementTree(root)
>>> tmp = tempfile.NamedTemporaryFile()
>>> tree.write(tmp.name)
>>> tmp.seek(0)
0
>>> tree_copy = ET.parse(tmp.name)
>>> ET.dump(tree)
<body text="some text for the body"><?do-something?></body>
>>> ET.dump(tree_copy)
<body text="some text for the body" />

I think tree and tree_copy not having the some contents is a bug.

Regarding comments: personally I think that throwing away is not a good idea either. But this is allowed by the XML spec (http://www.w3.org/TR/REC-xml/#dt-comment). This should probably go in a separate bug report if someone is interested in it.

As for backwards compatibility: yes, this is a concern. The keyword argument would be a solution. On the other hand, I'm not sure that the default should be something that causes dataloss...?

lxml sounds like it's doing the right things. Is there some connection between lxml and etree that I'm not aware of, or did you just give it as an example of how a different library behaves?

scoder · 2014-01-20T08:02:35Z

As for backwards compatibility: yes, this is a concern. The keyword argument would be a solution. On the other hand, I'm not sure that the default should be something that causes dataloss...?

It's a matter of use cases. How often do people care? My experience tells me that it's much more common to parse XML in and extract information from it, than to do round trips. And when extracting information, both comments and processing instructions usually get in the way.

So I would say that this default behaviour isn't wrong. In fact, I'd even say that both lxml and ET behave as expected (or at least as intended) in their own ways.

Also, I can't really remember spotting a processing instruction anywhere *inside* of a root element in real world XML. For those living next to it, ET currently lacks support in its tree model.

lxml sounds like it's doing the right things. Is there some connection between lxml and etree that I'm not aware of, or did you just give it as an example of how a different library behaves?

Both implement (essentially) the same API, so I consider compatibility and look-alikes important. The general idea is that lxml extends what's there, but otherwise tries to stay compatible as good as it can.

BTW, lxml also allows parser target objects to have "pi(self, target, data)" and "comment(text)" methods, through which it can pass comments and PIs through to user provided tree builders. I think ET lacks those, too.

Changing target version from 3.3/3.4 to 3.5 as this is a new feature that is unlikely to make it into a currently released (or close to release) Python version. Also changing the ticket type to enhancement as the current behaviour isn't "wrong". It's rather a matter of supporting an additional use case.

nikratio · 2014-01-21T01:18:49Z

For the record: I disagree that this is an enhancement. ElementTree supports PIs as first-class tree elements. They can be added, inspected, removed, and written out when serializing into XML. Only when reading in XML, they are silently dropped. I think this is a bug, no matter how rarely people might stumble upon it in practice.

If there was no support for PIs at all (i.e., they couldn't be created as tree objects) I'd agree with you that this is an enhancement.

Unless there are objections, I'll try to work on a patch that either documents that PIs are lost, or optionally adds them to the tree when parsing (depending on how difficult that turns out to be).

scoder · 2014-01-21T08:12:41Z

Unless there are objections, I'll try to work on a patch that either documents that PIs are lost, or optionally adds them to the tree when parsing (depending on how difficult that turns out to be).

Please do. It should not be difficult at all to make the parser create PI objects (or comments).

Note, however, that this needs to be done in both the Python implementation and the parser of celementtree.

nikratio · 2014-01-24T04:54:13Z

I have created bpo-20375 with a patch to document the current behavior.

scoder · 2019-04-27T07:25:53Z

Comment/PI parsing in general is implemented in bpo-36673. Note that there is currently no way to represent comments and PIs in the tree when they appear outside of the root element, which I think is what this ticket is about. After bpo-36673 is resolved, however, they can at least be picked up from the parser target or from iterparse() and XMLPullParser().

scoder · 2019-04-27T15:56:22Z

bpo-24287 is a duplicate of this one and has some additional discussion.

mark-summerfield mannequin added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir labels Aug 5, 2010

nikratio mannequin changed the title ~~xml.etree.ElementTree strips XML declaration and procesing instructions~~ xml.etree.ElementTree skips processing instructions when parsing Jan 19, 2014

scoder added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Jan 20, 2014

scoder added the 3.8 only security fixes label Apr 27, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

serhiy-storchaka added the topic-XML label May 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xml.etree.ElementTree skips processing instructions when parsing #53730

xml.etree.ElementTree skips processing instructions when parsing #53730

mark-summerfield mannequin commented Aug 5, 2010

mark-summerfield mannequin commented Aug 5, 2010

nikratio mannequin commented Jan 19, 2014

scoder commented Jan 19, 2014

nikratio mannequin commented Jan 20, 2014

scoder commented Jan 20, 2014

nikratio mannequin commented Jan 21, 2014

scoder commented Jan 21, 2014

nikratio mannequin commented Jan 24, 2014

scoder commented Apr 27, 2019

scoder commented Apr 27, 2019

xml.etree.ElementTree skips processing instructions when parsing #53730

xml.etree.ElementTree skips processing instructions when parsing #53730

Comments

mark-summerfield mannequin commented Aug 5, 2010

mark-summerfield mannequin commented Aug 5, 2010

nikratio mannequin commented Jan 19, 2014

scoder commented Jan 19, 2014

nikratio mannequin commented Jan 20, 2014

scoder commented Jan 20, 2014

nikratio mannequin commented Jan 21, 2014

scoder commented Jan 21, 2014

nikratio mannequin commented Jan 24, 2014

scoder commented Apr 27, 2019

scoder commented Apr 27, 2019