classification
Title: xml.dom.minidom wrong indentation writing for CDATA section
Type: enhancement Stage: resolved
Components: XML Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: eli.bendersky, scoder, serhiy.storchaka, vsurjaninov
Priority: normal Keywords: patch

Created on 2019-03-23 15:38 by vsurjaninov, last changed 2019-03-27 12:08 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 12514 merged vsurjaninov, 2019-03-23 16:00
PR 12578 closed miss-islington, 2019-03-27 06:19
Messages (5)
msg338681 - (view) Author: Vladimir Surjaninov (vsurjaninov) * Date: 2019-03-23 15:38
If we are writing xml with CDATA section and leaving non-empty indentation and new-line parameters, a parent node of the section will contain useless indentation, that will be parsed as a text.

Example:
>>>doc = minidom.Document()
>>>root = doc.createElement('root')
>>>doc.appendChild(root)
>>>node = doc.createElement('node')
>>>root.appendChild(node)
>>>data = doc.createCDATASection('</data>')
>>>node.appendChild(data)
>>>print(doc.toprettyxml(indent=‘  ‘ * 4)
<?xml version="1.0" ?>
<root>
    <node>
<![CDATA[</data>]]>    </node>
</root>

If we try to parse this output doc, we won’t get CDATA value correctly.

Following code returns a string that contains only indentation characters:
>>>doc = minidom.parseString(xml_text)
>>>doc.getElementsByTagName('node')[0].firstChild.nodeValue

Returns a string with CDATA value and indentation characters:
>>>doc.getElementsByTagName('node')[0].firstChild.wholeText


But we have a workaround:
>>>data.nodeType = data.TEXT_NODE
…
>>>print(doc.toprettyxml(indent=‘  ‘ * 4)
<?xml version="1.0" ?>
<root>
    <node><![CDATA[</data>]]></node>
</root>

It will be parsed correctly:
>>>doc.getElementsByTagName('node')[0].firstChild.nodeValue
</data>

But I think it will be better if we fix the writing function, which would set this as default behavior.
msg338701 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-03-23 21:33
Yes, this case is incorrect. Pretty printing should not change character content inside of a simple tag.

The PR looks good to me.
msg338936 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-27 05:59
New changeset 384b81d923addd52125e94470b11d2574ca266a9 by Serhiy Storchaka (Vladimir Surjaninov) in branch 'master':
bpo-36407: Fix writing indentations of CDATA section (xml.dom.minidom). (GH-12514)
https://github.com/python/cpython/commit/384b81d923addd52125e94470b11d2574ca266a9
msg338939 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-27 06:19
Should we backport this change? I am not sure.
msg338943 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-03-27 07:04
I don't think this should be backported. Pretty-printing is not a production relevant feature, more of a "debugging, diffing and help users see what they get" kind of feature. It's good to have it fixed for the future, but we shouldn't bother users with it during a point release.
History
Date User Action Args
2019-03-27 12:08:27serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2019-03-27 07:04:43scodersetmessages: + msg338943
2019-03-27 06:19:42serhiy.storchakasetmessages: + msg338939
2019-03-27 06:19:22miss-islingtonsetpull_requests: + pull_request12522
2019-03-27 05:59:02serhiy.storchakasetmessages: + msg338936
2019-03-23 21:33:28scodersetmessages: + msg338701
versions: + Python 3.8
2019-03-23 16:00:14vsurjaninovsetkeywords: + patch
stage: patch review
pull_requests: + pull_request12465
2019-03-23 15:40:39xtreaksetnosy: + scoder, eli.bendersky, serhiy.storchaka
2019-03-23 15:38:49vsurjaninovcreate