Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml.etree.ElementTree skips processing instructions when parsing #53730

Open
mark-summerfield mannequin opened this issue Aug 5, 2010 · 10 comments
Open

xml.etree.ElementTree skips processing instructions when parsing #53730

mark-summerfield mannequin opened this issue Aug 5, 2010 · 10 comments
Labels
3.8 only security fixes stdlib Python modules in the Lib dir topic-XML type-feature A feature request or enhancement

Comments

@mark-summerfield
Copy link
Mannequin

mark-summerfield mannequin commented Aug 5, 2010

BPO 9521
Nosy @scoder, @mark-summerfield
Files
  • testcase.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2010-08-05.09:25:41.808>
    labels = ['3.8', 'type-feature', 'library']
    title = 'xml.etree.ElementTree skips processing instructions when parsing'
    updated_at = <Date 2019-04-27.15:56:22.021>
    user = 'https://github.com/mark-summerfield'

    bugs.python.org fields:

    activity = <Date 2019-04-27.15:56:22.021>
    actor = 'scoder'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2010-08-05.09:25:41.808>
    creator = 'mark'
    dependencies = []
    files = ['33538']
    hgrepos = []
    issue_num = 9521
    keywords = ['patch']
    message_count = 10.0
    messages = ['112961', '208445', '208464', '208518', '208528', '208600', '208628', '209037', '340972', '340995']
    nosy_count = 5.0
    nosy_names = ['effbot', 'scoder', 'mark', 'eli.bendersky', 'nikratio']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue9521'
    versions = ['Python 3.8']

    @mark-summerfield
    Copy link
    Mannequin Author

    mark-summerfield mannequin commented Aug 5, 2010

    If you read in an XML file using xml.etree.ElementTree.parse() and then write it out again using xml.etree.ElementTree.write() what is written may not be the same as what was read. In particular any XML declaration and processing instructions are stripped.

    It seems to me that the parser should at least preserve any declaration and processing instructions so that reading and writing match up.

    Here's an example:

    Python 3.1.2 (r312:79147, Jul 15 2010, 10:56:05) 
    [GCC 4.4.4] on linux2
    Type "copyright", "credits" or "license()" for more information.
    >>> file = "control-center.xml"
    >>> open(file).read()[:500]
    '<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" [\n<!ENTITY VERSION "1.5.7">\n]>\n<article id="index" lang="en_GB">\n  \n  <articleinfo>\n    <abstract role="description">\n      <para>The GNOME Control Centre provides a central place for the user to setup their GNOME experience. It can let you configure anything from the behaviour of your window borders to the default font type.</para>\n   '
    >>> import xml.etree.ElementTree as etree
    >>> xml = etree.parse(file)
    >>> temp = "temp.xml"
    >>> xml.write("temp.xml", encoding="utf-8")
    >>> open(temp).read()[:500]
    '<article id="index" lang="en_GB">\n  \n  <articleinfo>\n    <abstract role="description">\n      <para>The GNOME Control Centre provides a central place for the user to setup their GNOME experience. It can let you configure anything from the behaviour of your window borders to the default font type.</para>\n    </abstract>\n    <title>Control Centre</title>\n    <authorgroup>\n      <author>\n\t<firstname>Kevin</firstname><surname>Breit</surname>\n      </author>\n    </authorgroup>\n    <copyright>\n      <y'
    >>>

    @mark-summerfield mark-summerfield mannequin added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir labels Aug 5, 2010
    @nikratio
    Copy link
    Mannequin

    nikratio mannequin commented Jan 19, 2014

    I can confirm this. The actual problem is that neither XML nor SGML PIs in the input make it into the etree, and no events are generated for them during incremental parsing.

    XML PIs that are added into the tree using Python functions are correctly written out. SGML PIs currently cannot be represented at all (there's no ElementTree.SGMLProcessingInstruction analogous to ElementTree.ProcessingInstruction)

    There is special cased support for the DOCTYPE element in the TreeBuilder class to allow retrieving the doctype when not parsing incrementally, but it needs to be retrieved manually and written out manually.

    I have attached a testcase for XML PIs. For proper SGML PI handling, ElementTree first needs to learn about them.

    Recommended stage for this issue: needs patch

    @nikratio nikratio mannequin changed the title xml.etree.ElementTree strips XML declaration and procesing instructions xml.etree.ElementTree skips processing instructions when parsing Jan 19, 2014
    @scoder
    Copy link
    Contributor

    scoder commented Jan 19, 2014

    When you write "XML PI", do you mean the XML declaration? At least that's what Mark used in his original example.

    ET avoids writing them out when they are not necessary, i.e. for UTF-8 compatible encodings. IMHO that's perfectly ok and definitely not an incorrect behaviour.

    As for processing instructions (what you used in your test case patch), making them appear in the tree by default would be a behavioural change that might break existing ET code.

    Note that lxml keeps PIs in the tree by default, unless you configure its parser explicitly with "remove_pis=True".

    There is also a "remove_comments=True" in lxml. ET simply discards comments when parsing IIRC.

    http://lxml.de/parsing.html#parser-options

    IMHO, both behaviours are ok, which lxml having a tendency towards keeping the data as it came in rather than trying to find the easiest possible way for the user to work with the tree. PIs and comments are a bit 'special' to work with.

    A fix could be to add the two keyword arguments also to ET's parser, but make them default to True (as opposed to False in lxml), so that users can enable them at need.

    @nikratio
    Copy link
    Mannequin

    nikratio mannequin commented Jan 20, 2014

    No, I really mean XML processing instruction. I agree with you that the XML declaration is a non-issue, because there is no information lost: you know that you're going to write XML, and you manually specify the encoding. Thus it's trivial to add the correct XML declaration if desired.

    The fact that PIs are not read, however, is a real problem. The XML spec requires that PIs MUST be passed trough (http://www.w3.org/TR/REC-xml/#sec-pi). Furthermore, ElementTree is designed to represent XML data, so writing out an ElementTree as XML and reading it back in must (in my opinionn not result in information loss. But currently it does:

    >>> import xml.etree.ElementTree as ET
    >>> import tempfile
    >>> root = ET.Element('body', {'text': 'some text for the body'})
    >>> root.insert(1, ET.ProcessingInstruction('do-something'))
    >>> tree = ET.ElementTree(root)
    >>> tmp = tempfile.NamedTemporaryFile()
    >>> tree.write(tmp.name)
    >>> tmp.seek(0)
    0
    >>> tree_copy = ET.parse(tmp.name)
    >>> ET.dump(tree)
    <body text="some text for the body"><?do-something?></body>
    >>> ET.dump(tree_copy)
    <body text="some text for the body" />

    I think tree and tree_copy not having the some contents is a bug.

    Regarding comments: personally I think that throwing away is not a good idea either. But this is allowed by the XML spec (http://www.w3.org/TR/REC-xml/#dt-comment). This should probably go in a separate bug report if someone is interested in it.

    As for backwards compatibility: yes, this is a concern. The keyword argument would be a solution. On the other hand, I'm not sure that the default should be something that causes dataloss...?

    lxml sounds like it's doing the right things. Is there some connection between lxml and etree that I'm not aware of, or did you just give it as an example of how a different library behaves?

    @scoder
    Copy link
    Contributor

    scoder commented Jan 20, 2014

    As for backwards compatibility: yes, this is a concern. The keyword argument would be a solution. On the other hand, I'm not sure that the default should be something that causes dataloss...?

    It's a matter of use cases. How often do people care? My experience tells me that it's much more common to parse XML in and extract information from it, than to do round trips. And when extracting information, both comments and processing instructions usually get in the way.

    So I would say that this default behaviour isn't wrong. In fact, I'd even say that both lxml and ET behave as expected (or at least as intended) in their own ways.

    Also, I can't really remember spotting a processing instruction anywhere *inside* of a root element in real world XML. For those living next to it, ET currently lacks support in its tree model.

    lxml sounds like it's doing the right things. Is there some connection between lxml and etree that I'm not aware of, or did you just give it as an example of how a different library behaves?

    Both implement (essentially) the same API, so I consider compatibility and look-alikes important. The general idea is that lxml extends what's there, but otherwise tries to stay compatible as good as it can.

    BTW, lxml also allows parser target objects to have "pi(self, target, data)" and "comment(text)" methods, through which it can pass comments and PIs through to user provided tree builders. I think ET lacks those, too.

    Changing target version from 3.3/3.4 to 3.5 as this is a new feature that is unlikely to make it into a currently released (or close to release) Python version. Also changing the ticket type to enhancement as the current behaviour isn't "wrong". It's rather a matter of supporting an additional use case.

    @scoder scoder added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Jan 20, 2014
    @nikratio
    Copy link
    Mannequin

    nikratio mannequin commented Jan 21, 2014

    For the record: I disagree that this is an enhancement. ElementTree supports PIs as first-class tree elements. They can be added, inspected, removed, and written out when serializing into XML. Only when reading in XML, they are silently dropped. I think this is a bug, no matter how rarely people might stumble upon it in practice.

    If there was no support for PIs at all (i.e., they couldn't be created as tree objects) I'd agree with you that this is an enhancement.

    Unless there are objections, I'll try to work on a patch that either documents that PIs are lost, or optionally adds them to the tree when parsing (depending on how difficult that turns out to be).

    @scoder
    Copy link
    Contributor

    scoder commented Jan 21, 2014

    Unless there are objections, I'll try to work on a patch that either documents that PIs are lost, or optionally adds them to the tree when parsing (depending on how difficult that turns out to be).

    Please do. It should not be difficult at all to make the parser create PI objects (or comments).

    Note, however, that this needs to be done in both the Python implementation and the parser of celementtree.

    @nikratio
    Copy link
    Mannequin

    nikratio mannequin commented Jan 24, 2014

    I have created bpo-20375 with a patch to document the current behavior.

    @scoder
    Copy link
    Contributor

    scoder commented Apr 27, 2019

    Comment/PI parsing in general is implemented in bpo-36673. Note that there is currently no way to represent comments and PIs in the tree when they appear outside of the root element, which I think is what this ticket is about. After bpo-36673 is resolved, however, they can at least be picked up from the parser target or from iterparse() and XMLPullParser().

    @scoder scoder added the 3.8 only security fixes label Apr 27, 2019
    @scoder
    Copy link
    Contributor

    scoder commented Apr 27, 2019

    bpo-24287 is a duplicate of this one and has some additional discussion.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes stdlib Python modules in the Lib dir topic-XML type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants