Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml.dom.minidom toprettyxml: omit whitespace for text-only elements #48397

Closed
thomaslee mannequin opened this issue Oct 19, 2008 · 21 comments
Closed

xml.dom.minidom toprettyxml: omit whitespace for text-only elements #48397

thomaslee mannequin opened this issue Oct 19, 2008 · 21 comments
Assignees
Labels
release-blocker stdlib Python modules in the Lib dir topic-XML type-bug An unexpected behavior, bug, or error

Comments

@thomaslee
Copy link
Mannequin

thomaslee mannequin commented Oct 19, 2008

BPO 4147
Nosy @birkenfeld, @pefu, @benjaminp, @ezio-melotti, @bitdancer, @dankenigsberg
Files
  • minidom-toprettyxml-01.patch: A patch implementing the proposed changes.
  • minidom-Text-toprettyxml.patch: toprettyxml() should conserve Text.data
  • minidom-Text-toprettyxml-02.patch: preserve only Text nodes.
  • issue4147.diff: patch against 2.7
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = <Date 2011-11-18.15:39:13.696>
    created_at = <Date 2008-10-19.15:32:24.037>
    labels = ['expert-XML', 'type-bug', 'library', 'release-blocker']
    title = 'xml.dom.minidom toprettyxml: omit whitespace for text-only elements'
    updated_at = <Date 2011-11-18.15:39:13.695>
    user = 'https://bugs.python.org/thomaslee'

    bugs.python.org fields:

    activity = <Date 2011-11-18.15:39:13.695>
    actor = 'ezio.melotti'
    assignee = 'ezio.melotti'
    closed = True
    closed_date = <Date 2011-11-18.15:39:13.696>
    closer = 'ezio.melotti'
    components = ['Library (Lib)', 'XML']
    creation = <Date 2008-10-19.15:32:24.037>
    creator = 'thomaslee'
    dependencies = []
    files = ['11832', '23286', '23294', '23692']
    hgrepos = []
    issue_num = 4147
    keywords = ['patch']
    message_count = 21.0
    messages = ['74978', '90546', '90574', '102247', '111604', '122289', '144745', '144746', '144747', '144748', '144755', '144770', '144771', '144778', '147655', '147659', '147662', '147685', '147848', '147879', '147880']
    nosy_count = 13.0
    nosy_names = ['georg.brandl', 'pefu', 'techtonik', 'thomaslee', 'benjamin.peterson', 'ezio.melotti', 'Arfrever', 'r.david.murray', 'jgarrison', 'm-samia', 'danken', 'tod', 'python-dev']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue4147'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

    @thomaslee
    Copy link
    Mannequin Author

    thomaslee mannequin commented Oct 19, 2008

    For XML elements containing only text data, it would be nice if
    toprettyxml could omit the whitespace it normally injects before & after
    the text, e.g.

    <person>
    <first-name>
    Bob
    </first-name>
    </person>

    Becomes:

    <person>
    <first-name>Bob</first-name>
    </person>

    From what I understand the handling of whitespace within XML elements is
    application-defined, so I'm classifying this as a nice-to-have feature.
    However it should be noted that in our particular case, the existing
    behaviour caused a few problems with a third-party system which treated
    whitespace as being significant.

    @thomaslee thomaslee mannequin added type-feature A feature request or enhancement stdlib Python modules in the Lib dir topic-XML labels Oct 19, 2008
    @jgarrison
    Copy link
    Mannequin

    jgarrison mannequin commented Jul 15, 2009

    Also needed here. While pretty-printing should be able to insert
    non-significant whitespace BETWEEN xml elements, it should never alter
    the content of (i.e. insert whitespace into) existing text elements.

    @jgarrison
    Copy link
    Mannequin

    jgarrison mannequin commented Jul 16, 2009

    To clarify:

    ... it should never alter the content of (i.e. insert whitespace into)
    existing text elements that contain non-whitespace characters.

    @m-samia
    Copy link
    Mannequin

    m-samia mannequin commented Apr 3, 2010

    Could you please apply that patch? People are starting to use non-standard libraries to process xml files because of this issue

    for example this man is using lxml or pyxml:
    http://ronrothman.com/public/leftbraned/xml-dom-minidom-toprettyxml-and-silly-whitespace/

    Is there any problem with that patch?

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 26, 2010

    @thomas: could you provide a unit test to go with your patch.

    @techtonik
    Copy link
    Mannequin

    techtonik mannequin commented Nov 24, 2010

    This one is bug.

    @dankenigsberg
    Copy link
    Mannequin

    dankenigsberg mannequin commented Oct 1, 2011

    Here's another take on fixing this bug, with an accompanying unit test. Personally, I'm monkey-patching xml.dom.minidom in order to avoid it, but please consider fixing it properly upstream.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 1, 2011

    New changeset 086ca132e161 by R David Murray in branch '3.2':
    bpo-4147: minidom's toprettyxml no longer adds whitespace to text nodes.
    http://hg.python.org/cpython/rev/086ca132e161

    New changeset fa0b1e50270f by R David Murray in branch 'default':
    merge bpo-4147: minidom's toprettyxml no longer adds whitespace to text nodes.
    http://hg.python.org/cpython/rev/fa0b1e50270f

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 1, 2011

    New changeset 406c5b69cb1b by R David Murray in branch '2.7':
    bpo-4147: minidom's toprettyxml no longer adds whitespace to text nodes.
    http://hg.python.org/cpython/rev/406c5b69cb1b

    @bitdancer
    Copy link
    Member

    This looks correct to me, and it tested out fine on the test suite (and the provided test failed without the provided fix), so I committed it.

    I have a small concern that the change in output might be a bit radical for a bug fix release, but it does seem to me that this is clearly a bug. If people think it shouldn't go in the bug fix releases let me know and I'll back it out.

    Thanks for the patch, Dan.

    @bitdancer bitdancer added type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels Oct 1, 2011
    @ezio-melotti
    Copy link
    Member

    The patch seems wrong to me:
    >>> d = minidom.parseString('<foo><bar>AAA</bar>BBB<bar>CCC</bar></foo>')
    >>> print(d.toprettyxml())
    <?xml version="1.0" ?>
    <foo>
            <bar>AAA        </bar>
    BBB     <bar>CCC        </bar>
    </foo>

    Even if the newlines are gone, the indentation before the closing tag is preserved. Also a newline is added before the text node BBB.

    It would be good to check what the XML standard says about the whitespace. I'm pretty sure HTML has well defined rules about it, but I don't know if that's the same for XML.

    FWIW the link in msg102247 contains a different fix (not sure if it's any better), and also a link to an article about XML and whitespace: http://www.oracle.com/technetwork/articles/wang-whitespace-092897.html (the link seems broken in the page).

    @ezio-melotti ezio-melotti reopened this Oct 2, 2011
    @dankenigsberg
    Copy link
    Mannequin

    dankenigsberg mannequin commented Oct 2, 2011

    Oh dear.

    Thanks, Enzio, for pointing out that former patch is wrong. It is also quite naive, since the whole NATURE of toprettyprint() is to add whitespace to Text nodes. Tomas Lee's http://bugs.python.org/file11832/minidom-toprettyxml-01.patch made an effort to touch only "simple" Text nodes, that are confined within a single <elem></elem>.

    I did not expect http://bugs.python.org/file23286/minidom-Text-toprettyxml.patch to get in so quickly, after the former one spent several years on queue. However now is time to fix it, possible by my second patch.

    @dankenigsberg
    Copy link
    Mannequin

    dankenigsberg mannequin commented Oct 2, 2011

    btw, http://www.w3.org/TR/xml/#sec-white-space is a bit vague on how should a parser deal with whitespace, and seems to allow non-preservation of text nodes. Preserving "simple" text nodes is allowed, too, and is more polite to applications reading the prettyxml.

    @bitdancer
    Copy link
    Member

    Heh, you happened to post your patch at a time when I wanted something to do as a break from something I didn't want to do...and I *thought* I understood the problem, after reading the various links. But clearly I didn't. We don't have someone who has stepped forward to be xml maintainer, so I just went ahead and committed it.

    I should find time to look at your new patch some time today, or perhaps Ezio will have time. (Clearly minidom doesn't have enough tests.)

    @techtonik
    Copy link
    Mannequin

    techtonik mannequin commented Nov 15, 2011

    I would revert this patch (leaving several test cases though) until a proper fix is available.

    @bitdancer
    Copy link
    Member

    Yeah, I just haven't found time to do the revert yet (my first naive attempt using hg commands failed and I haven't found time to figure it out or do the reverse-patch method).

    @ezio-melotti
    Copy link
    Member

    Here is a new patch based on Dan's last patch.
    Correct me if I'm wrong, but it seems to me that it's not possible for a node to have only text-nodes as children and yet have more than one child; i.e. you can't have two or more adjacent text nodes, because they would be considered as a single text node. I therefore changed the check with all() to check if there's only a child and if it's a text node.
    I also added a test that checks where the \n and \t are added, because testing only that the DOM is preserved is not enough.

    @dankenigsberg
    Copy link
    Mannequin

    dankenigsberg mannequin commented Nov 15, 2011

    Technically, adjacent Text nodes are not illegal, but preserving this oddity in pretty-print is impossible. It is perfectly fine to pretty-print only the simple case of len()==1.

    @ezio-melotti
    Copy link
    Member

    I did some tests, creating an element ('elem') that contains two adjacent text nodes ('text'). With my latest patch the prettyprint is:
    <?xml version="1.0" ?>
    <elem>
    text
    text
    </elem>

    Here both the text nodes are printed on a newline and they are indented.

    With your patch it should be:
    <?xml version="1.0" ?>
    <elem>texttext</elem>

    I'm not sure there's any reason to prefer the second option though.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 18, 2011

    New changeset 7262f8f276ff by Ezio Melotti in branch '2.7':
    bpo-4147: minidom's toprettyxml no longer adds whitespace around a text node when it is the only child of an element. Initial patch by Dan Kenigsberg.
    http://hg.python.org/cpython/rev/7262f8f276ff

    New changeset 5401daa96a21 by Ezio Melotti in branch '3.2':
    bpo-4147: minidom's toprettyxml no longer adds whitespace around a text node when it is the only child of an element. Initial patch by Dan Kenigsberg.
    http://hg.python.org/cpython/rev/5401daa96a21

    New changeset cb6614e3438b by Ezio Melotti in branch 'default':
    bpo-4147: merge with 3.2.
    http://hg.python.org/cpython/rev/cb6614e3438b

    @ezio-melotti
    Copy link
    Member

    I committed my patch with a few more tests. This should be fixed now.
    Thanks for the report and the patch!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    release-blocker stdlib Python modules in the Lib dir topic-XML type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants