classification
Title: ElementTree ProcessingInstruction uses character entities in content
Type: behavior Stage: resolved
Components: XML Versions: Python 3.1, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Neil Muller, effbot, flox, hodgestar, jerith, pitrou, russell, tlynn, waveform
Priority: normal Keywords: patch

Created on 2008-05-03 15:12 by waveform, last changed 2010-03-17 17:29 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
issue-2746_py3k.diff Neil Muller, 2009-06-07 20:59 Fix & test case for py3k
issue-2746.diff Neil Muller, 2009-06-07 21:01 Fixed upload of fix and test case for trunk
Messages (9)
msg66154 - (view) Author: Dave Hughes (waveform) Date: 2008-05-03 15:12
In the ElementTree and cElementTree implementations in Python 2.5 (and 
possibly Python 2.6 as I also found this issue when testing an SVN 
checkout of ElementTree 1.3), the conversion of a ProcessingInstruction 
to a string converts XML reserved characters (<, >, &) to character 
entities:

>>> from xml.etree.ElementTree import *
>>> tostring(ProcessingInstruction('test', '<testing&>'))
'<?test &lt;testing&amp;&gt;?>'

>>> from xml.etree.cElementTree import *
>>> tostring(ProcessingInstruction('test', '<testing&>'))
'<?test &lt;testing&amp;&gt;?>'

The XML 1.0 spec is rather vague on whether character entities are 
permitted in PIs (it explicitly states parameter entities are not 
parsed in PIs, but says nothing about parsing character entities). 
However, it does have this to say in section 2.4 "Character Data and 
Markup":

"The ampersand character (&) and the left angle bracket (<) MUST NOT 
appear in their literal form, except when used as markup delimiters, or 
within a comment, a processing instruction, or a CDATA section."

So, XML reserved chars don't need converting in PIs (the only string 
not permitted in a PI's content according to the spec, section 2.6, is 
'?>'), which sort of implies that they shouldn't be. As for practical 
reasons why they shouldn't be:

Breaks generated PHP:

>>> from xml.etree.cElementTree import *
>>> doc = Element('html')
>>> SubElement(doc, 'head')
<Element 'head' at 0x2af4e3b8a9f0>
>>> SubElement(doc, 'body')
<Element 'body' at 0x2af4e3b922a0>
>>> doc[1].append(ProcessingInstruction('php', 'if (2 < 1) print 
"<p>Something has gone horribly wrong!</p>";'))
>>> tostring(doc)
'<html><head /><body><?php if (2 &lt; 1) print "&lt;p&gt;Something has 
gone horribly wrong!&lt;/p&gt;";?></body></html>'

Different from xml.dom:

>>> from xml.dom.minidom import *
>>> i = getDOMImplementation()
>>> doc = i.createDocument(None, 'html', None)
>>> doc.documentElement.appendChild(doc.createElement('head'))
<DOM Element: head at 0x8c6170>
>>> doc.documentElement.appendChild(doc.createElement('body'))
<DOM Element: body at 0x8c6290>
>>> 
doc.documentElement.lastChild.appendChild(doc.createProcessingInstruction('test',
 '<testing&>'))
<xml.dom.minidom.ProcessingInstruction instance at 0x8c63b0>
>>> doc.toxml()
'<?xml version="1.0" ?>\n<html><head/><body><?test <testing&>?></body></
html>'

Different from lxml:

>>> from lxml.etree import *
>>> tostring(ProcessingInstruction('test', '<testing&>'))
'<?test <testing&>?>'

I suspect the only change necessary to fix this is to replace the 
_escape_cdata() call for ProcessingInstruction (and possibly Comment 
too given the spec quote above) in ElementTree._write() with an 
_encode() call, as shown in this patch (which includes the Comment 
change as well):

Index: elementtree/ElementTree.py
===================================================================
--- elementtree/ElementTree.py  (revision 511)
+++ elementtree/ElementTree.py  (working copy)
@@ -663,9 +663,9 @@
         # write XML to file
         tag = node.tag
         if tag is Comment:
-            file.write("<!-- %s -->" % _escape_cdata(node.text, 
encoding))
+            file.write("<!-- %s -->" % _encode(node.text, encoding))
         elif tag is ProcessingInstruction:
-            file.write("<?%s?>" % _escape_cdata(node.text, encoding))
+            file.write("<?%s?>" % _encode(node.text, encoding))
         else:
             items = node.items()
             xmlns_items = [] # new namespaces in this scope

Sorry I haven't got a similar patch for cElementTree. I've had a quick 
look through the source, but haven't yet figured out where the change 
should be made (unless it's not required - does cElementTree reuse that 
bit of ElementTree?).
msg66515 - (view) Author: Simon Cross (hodgestar) Date: 2008-05-10 12:57
cElementTree.ElementTree is a copy of ElementTree.ElementTree with the
.parse(...) method replaced, so the original patch for ElementTree
should fix cElementTree too.

The copying of the ElementTree class into cElementTree happens in the
call to boostrap in the init_elementtree() function at the bottom of
_elementtree.c.
msg89033 - (view) Author: Neil Muller (Neil Muller) Date: 2009-06-07 13:59
Patch which includes the given fix and adds a test case to cover this
(test case from Russell Cloran)
msg89036 - (view) Author: Neil Muller (Neil Muller) Date: 2009-06-07 14:10
Previous patch was missing two lines in the test case. Correct fix uploaded
msg89056 - (view) Author: Neil Muller (Neil Muller) Date: 2009-06-07 20:59
Issue also effects p3k. Adapted patch attached.
msg89057 - (view) Author: Neil Muller (Neil Muller) Date: 2009-06-07 21:01
Previous upload of issue_2746 was corrupt. Fixed version uploaded.
msg94442 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-10-24 23:37
Can you include the cElementTree fix and test case in your patch as well?
msg94443 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-10-24 23:40
Oops, sorry, I hadn't read your message about the patch also correcting
cElementTree.
msg99130 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-02-09 17:26
I've committed the patch in r78125 (trunk) and r78126 (py3k). I'm not sure I want to backport it to 2.6/3.1, since it might bite people who relied on the old behaviour.
History
Date User Action Args
2010-03-17 17:29:32pitrousetstatus: pending -> closed
nosy: + flox
resolution: accepted -> fixed
2010-02-09 17:26:51pitrousetstatus: open -> pending
versions: - Python 2.7, Python 3.2
messages: + msg99130

resolution: accepted
stage: patch review -> resolved
2009-10-24 23:40:20pitrousetmessages: + msg94443
2009-10-24 23:37:55pitrousetpriority: normal
versions: + Python 3.2, - Python 2.5, Python 3.0
nosy: + pitrou

messages: + msg94442

stage: patch review
2009-10-24 20:02:49tlynnsetnosy: + tlynn
2009-06-08 05:06:54jerithsetnosy: + jerith
2009-06-07 21:01:51Neil Mullersetfiles: + issue-2746.diff

messages: + msg89057
2009-06-07 21:00:57Neil Mullersetfiles: - issue-2746.diff
2009-06-07 21:00:34Neil Mullersetfiles: - issue-2746.diff
2009-06-07 20:59:55Neil Mullersetfiles: + issue-2746_py3k.diff

messages: + msg89056
versions: + Python 2.6, Python 3.0, Python 3.1, Python 2.7
2009-06-07 15:32:18russellsetnosy: + russell
2009-06-07 14:10:50Neil Mullersetfiles: + issue-2746.diff

messages: + msg89036
2009-06-07 14:00:38Neil Mullersetnosy: + effbot
2009-06-07 13:59:35Neil Mullersetfiles: + issue-2746.diff

nosy: + Neil Muller
messages: + msg89033

keywords: + patch
2008-05-10 12:57:44hodgestarsetnosy: + hodgestar
messages: + msg66515
2008-05-03 15:12:24waveformcreate