Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ElementTree ProcessingInstruction uses character entities in content #46995

Closed
waveform mannequin opened this issue May 3, 2008 · 9 comments
Closed

ElementTree ProcessingInstruction uses character entities in content #46995

waveform mannequin opened this issue May 3, 2008 · 9 comments
Labels
topic-XML type-bug An unexpected behavior, bug, or error

Comments

@waveform
Copy link
Mannequin

waveform mannequin commented May 3, 2008

BPO 2746
Nosy @pitrou, @florentx
Files
  • issue-2746_py3k.diff: Fix & test case for py3k
  • issue-2746.diff: Fixed upload of fix and test case for trunk
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-03-17.17:29:32.421>
    created_at = <Date 2008-05-03.15:12:24.752>
    labels = ['expert-XML', 'type-bug']
    title = 'ElementTree ProcessingInstruction uses character entities in content'
    updated_at = <Date 2010-03-17.17:29:32.419>
    user = 'https://bugs.python.org/waveform'

    bugs.python.org fields:

    activity = <Date 2010-03-17.17:29:32.419>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-03-17.17:29:32.421>
    closer = 'pitrou'
    components = ['XML']
    creation = <Date 2008-05-03.15:12:24.752>
    creator = 'waveform'
    dependencies = []
    files = ['14221', '14222']
    hgrepos = []
    issue_num = 2746
    keywords = ['patch']
    message_count = 9.0
    messages = ['66154', '66515', '89033', '89036', '89056', '89057', '94442', '94443', '99130']
    nosy_count = 9.0
    nosy_names = ['effbot', 'tlynn', 'pitrou', 'waveform', 'hodgestar', 'Neil Muller', 'jerith', 'russell', 'flox']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue2746'
    versions = ['Python 2.6', 'Python 3.1']

    @waveform
    Copy link
    Mannequin Author

    waveform mannequin commented May 3, 2008

    In the ElementTree and cElementTree implementations in Python 2.5 (and
    possibly Python 2.6 as I also found this issue when testing an SVN
    checkout of ElementTree 1.3), the conversion of a ProcessingInstruction
    to a string converts XML reserved characters (<, >, &) to character
    entities:

    >>> from xml.etree.ElementTree import *
    >>> tostring(ProcessingInstruction('test', '<testing&>'))
    '<?test &lt;testing&amp;&gt;?>'
    
    >>> from xml.etree.cElementTree import *
    >>> tostring(ProcessingInstruction('test', '<testing&>'))
    '<?test &lt;testing&amp;&gt;?>'

    The XML 1.0 spec is rather vague on whether character entities are
    permitted in PIs (it explicitly states parameter entities are not
    parsed in PIs, but says nothing about parsing character entities).
    However, it does have this to say in section 2.4 "Character Data and
    Markup":

    "The ampersand character (&) and the left angle bracket (<) MUST NOT
    appear in their literal form, except when used as markup delimiters, or
    within a comment, a processing instruction, or a CDATA section."

    So, XML reserved chars don't need converting in PIs (the only string
    not permitted in a PI's content according to the spec, section 2.6, is
    '?>'), which sort of implies that they shouldn't be. As for practical
    reasons why they shouldn't be:

    Breaks generated PHP:

    >>> from xml.etree.cElementTree import *
    >>> doc = Element('html')
    >>> SubElement(doc, 'head')
    <Element 'head' at 0x2af4e3b8a9f0>
    >>> SubElement(doc, 'body')
    <Element 'body' at 0x2af4e3b922a0>
    >>> doc[1].append(ProcessingInstruction('php', 'if (2 < 1) print 
    "<p>Something has gone horribly wrong!</p>";'))
    >>> tostring(doc)
    '<html><head /><body><?php if (2 &lt; 1) print "&lt;p&gt;Something has 
    gone horribly wrong!&lt;/p&gt;";?></body></html>'

    Different from xml.dom:

    >>> from xml.dom.minidom import *
    >>> i = getDOMImplementation()
    >>> doc = i.createDocument(None, 'html', None)
    >>> doc.documentElement.appendChild(doc.createElement('head'))
    <DOM Element: head at 0x8c6170>
    >>> doc.documentElement.appendChild(doc.createElement('body'))
    <DOM Element: body at 0x8c6290>
    >>> 
    doc.documentElement.lastChild.appendChild(doc.createProcessingInstruction('test',
     '<testing&>'))
    <xml.dom.minidom.ProcessingInstruction instance at 0x8c63b0>
    >>> doc.toxml()
    '<?xml version="1.0" ?>\n<html><head/><body><?test <testing&>?></body></
    html>'

    Different from lxml:

    >>> from lxml.etree import *
    >>> tostring(ProcessingInstruction('test', '<testing&>'))
    '<?test <testing&>?>'

    I suspect the only change necessary to fix this is to replace the
    _escape_cdata() call for ProcessingInstruction (and possibly Comment
    too given the spec quote above) in ElementTree._write() with an
    _encode() call, as shown in this patch (which includes the Comment
    change as well):

    Index: elementtree/ElementTree.py
    ===================================================================

    --- elementtree/ElementTree.py  (revision 511)
    +++ elementtree/ElementTree.py  (working copy)
    @@ -663,9 +663,9 @@
             # write XML to file
             tag = node.tag
             if tag is Comment:
    -            file.write("<!-- %s -->" % _escape_cdata(node.text, 
    encoding))
    +            file.write("<!-- %s -->" % _encode(node.text, encoding))
             elif tag is ProcessingInstruction:
    -            file.write("<?%s?>" % _escape_cdata(node.text, encoding))
    +            file.write("<?%s?>" % _encode(node.text, encoding))
             else:
                 items = node.items()
                 xmlns_items = [] # new namespaces in this scope

    Sorry I haven't got a similar patch for cElementTree. I've had a quick
    look through the source, but haven't yet figured out where the change
    should be made (unless it's not required - does cElementTree reuse that
    bit of ElementTree?).

    @waveform waveform mannequin added topic-XML type-bug An unexpected behavior, bug, or error labels May 3, 2008
    @hodgestar
    Copy link
    Mannequin

    hodgestar mannequin commented May 10, 2008

    cElementTree.ElementTree is a copy of ElementTree.ElementTree with the
    .parse(...) method replaced, so the original patch for ElementTree
    should fix cElementTree too.

    The copying of the ElementTree class into cElementTree happens in the
    call to boostrap in the init_elementtree() function at the bottom of
    _elementtree.c.

    @NeilMuller
    Copy link
    Mannequin

    NeilMuller mannequin commented Jun 7, 2009

    Patch which includes the given fix and adds a test case to cover this
    (test case from Russell Cloran)

    @NeilMuller
    Copy link
    Mannequin

    NeilMuller mannequin commented Jun 7, 2009

    Previous patch was missing two lines in the test case. Correct fix uploaded

    @NeilMuller
    Copy link
    Mannequin

    NeilMuller mannequin commented Jun 7, 2009

    Issue also effects p3k. Adapted patch attached.

    @NeilMuller
    Copy link
    Mannequin

    NeilMuller mannequin commented Jun 7, 2009

    Previous upload of issue_2746 was corrupt. Fixed version uploaded.

    @pitrou
    Copy link
    Member

    pitrou commented Oct 24, 2009

    Can you include the cElementTree fix and test case in your patch as well?

    @pitrou
    Copy link
    Member

    pitrou commented Oct 24, 2009

    Oops, sorry, I hadn't read your message about the patch also correcting
    cElementTree.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 9, 2010

    I've committed the patch in r78125 (trunk) and r78126 (py3k). I'm not sure I want to backport it to 2.6/3.1, since it might bite people who relied on the old behaviour.

    @pitrou pitrou closed this as completed Mar 17, 2010
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-XML type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant