xml.dom.minidom does not escape CR, LF and TAB characters within attribute values #50002

Tomalak · 2009-04-14T11:13:06Z

BPO	5752
Nosy	@amauryfa, @devdanzin, @dvarrazzo, @ezio-melotti, @wking, @mitar, @moriyoshi
Files	minidom_test.py: A small test script that outlines the problem test_toxml.py: New test script to demonstrate that the patch proposed works test_multiline_roundtrip.diff: Tomalak's minidom_test.py as unittest, patch for test_minidom minidom.patch: Proposed patch to minidom.py minidom.diff: Old file before new file in patch. minidom.py: monkey patch module applying minidom.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2009-04-14.11:13:06.464>
labels = ['expert-XML', 'easy', 'type-bug']
title = 'xml.dom.minidom does not escape CR, LF and TAB characters within attribute values'
updated_at = <Date 2019-06-22.20:28:40.662>
user = 'https://bugs.python.org/Tomalak'

bugs.python.org fields:

activity = <Date 2019-06-22.20:28:40.662>
actor = 'mitar'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['XML']
creation = <Date 2009-04-14.11:13:06.464>
creator = 'Tomalak'
dependencies = []
files = ['13921', '13960', '13966', '13977', '18184', '18325']
hgrepos = []
issue_num = 5752
keywords = ['patch', 'easy']
message_count = 33.0
messages = ['85964', '87178', '87332', '87333', '87351', '87429', '87431', '87432', '87526', '87527', '87528', '87564', '87586', '87590', '87604', '87609', '87676', '87679', '87681', '90548', '90676', '110999', '111493', '111642', '112497', '112570', '119728', '142964', '185398', '185408', '185419', '185423', '185576']
nosy_count = 13.0
nosy_names = ['effbot', 'amaury.forgeotdarc', 'ajaksu2', 'piro', 'ezio.melotti', 'sayap', 'Tomalak', 'devon', 'labrat', 'mitar', 'moriyoshi', 'gauges', 'Jens.Grivolla']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue5752'
versions = ['Python 2.6', 'Python 3.1', 'Python 2.7', 'Python 3.2']

Linked PRs

gh-50002: xml.dom.minidom now preserves whitespaces in attributes #107947

Tomalak · 2009-04-14T11:13:06Z

Current behavior upon toxml() is:

Upon reading the document again, the new line is normalized and
collapsed into a space (according to the XML spec, section 3.3.3), which
means that it is lost.

Better behavior would be something like this (within attribute values only):

sechifrancesco · 2009-05-04T21:13:32Z

Ok, I've tried to solve this problem, but I think that the keyword
'easy' is not suitable for this kind of task, because it is necessary to
modify the expat module that is very complex.

Tomalak · 2009-05-06T13:42:32Z

@francesco Sechi: Would it not just require a minimal change to the
_write_data() method? Something along the lines of (sorry, no Python
expert, maybe I am way off):

def _write_data(writer, data, is_attrib=False):
    "Writes datachars to writer."
    if is_attrib: 
        data = data.replace("\r", "&#13;").replace("\n", "&#10;")
    data = data.replace("&", "&amp;").replace("<", "&lt;")
    data = data.replace("\"", "&quot;").replace(">", "&gt;")
    writer.write(data)

and in Element.writexml():

    #[...]
    for a_name in a_names:
        writer.write(" %s=\"" % a_name)
        _write_data(writer, attrs[a_name].value, True)
    #[...]

Tomalak · 2009-05-06T13:45:56Z

Of course it should be:

def _write_data(writer, data, is_attrib=False):
    "Writes datachars to writer."
    data = data.replace("&", "&amp;").replace("<", "&lt;")
    data = data.replace("\"", "&quot;").replace(">", "&gt;")
    if is_attrib: 
        data = data.replace("\r", "&#13;").replace("\n", "&#10;")
    writer.write(data)

sechifrancesco · 2009-05-06T21:47:30Z

Don't worry, I'm a newer too.
No, your solution does not work, because the method you refer
(_write_data) is called by the toxml() function, but the newline is
replaced with a whitespace by the parsestring() function. The
parsestring function, as I already said, refers to the 'expat' module,
that is very complex (for me).

Tomalak · 2009-05-08T08:22:28Z

Hmm... I thought toxml() is the part that needs to be fixed, not the
parsing/reading. I mentioned the reading only to outline the data loss
that occurs eventually.

My point is: The toxml() (i.e. _write_data) *actually writes* the
newline to the output. And within parameters, it just shouldn't.

Tomalak · 2009-05-08T09:33:28Z

Attaching a patch that fixes the problem.

Tomalak · 2009-05-08T09:38:10Z

Attaching a test file that outlines the problem. Output on my system
(Windows / Python 3.0) is:

Without the patch:
C:\Python30>python.exe c:\minidom_test.py
False
1 -->"multiline
value"
2 -->"multiline value"

With the patch:
C:\Python30>python.exe c:\minidom_test.py
True
1 -->"multiline
value"
2 -->"multiline
value"

sechifrancesco · 2009-05-10T15:22:35Z

I think that the problem is: the xmldoc1 has the newline or not? If it
hasn't your patch works only in the particular case you pass a toxml
return value to 'parsestring'. If I pass an XML string with newlines it
doesn't work. So your solution is not generic and cannot be considered a
patch for the issue you proposed.

sechifrancesco · 2009-05-10T16:24:21Z

I try to explain better what is my opinion:

If you add the attribute by using setAttribute the newlines are kept
and the toxml works well
If you add the attribute by using the parsestring, passing it an XML
string the newlines are replaced
Your patch works only if you act on a well-constructed (i.e.with
newlines kept in internal data structures) xml.dom.minidom.Document
object only, so...
If you try to execute your patched toxml method of a
xml.dom.minidom.Document constructed using parsestring passing it a
string with newline it does not work.

So your patch works only in a specific case: you are trying to fix a
problem in parsestring, acting on its actual parameter.

Tomalak · 2009-05-10T16:34:30Z

Francesco, I think you are missing the point. :-) The problem has two sides.

If I create an XML document using the DOM (not by parsing it from a
string!), then I can put newline characters into attribute value. This
is allowed and conforms to the XML spec.

However, *literal* newlines in an attribute value (i.e. when the
document is parsed from a string) have no meaning. The parser treats
them as if they were insignificant whitespace -- they are converted to a
single space. This is also valid and conforms to the XML spec.

The catch: This leads to an actual data loss if I *wanted* to store
newline characters in an attribute -- unless the newline characters are
properly encoded. Encoding the newline characters is also valid and
conforms to the spec, so the DOM implementation should do it.

In other words - the parsing process you refer to is actually working
fine. If an attribute contains a literal newline, it is indeed okay to
collapse it into a space. It's only the document serializing that is broken.

Minidom is clearly missing functionality here, and it does not conform
to the XML spec. If I store a string of data in an XML document, it must
be ensured that upon reading the document again, I get the *same* data
back. This is what I check with my test script.

sechifrancesco · 2009-05-11T07:31:15Z

All right, now I understand, thanks. But I think that, for internal
class coherence, it is necessary not to modify toxml method, but the
'setAttribute' one, because this is the source of the problem.

sechifrancesco · 2009-05-11T21:06:09Z

A solution for this issue could be to replace the setAttribute method as
follow:

d["value"] = d["nodeValue"] = value
+ d["value"] = d["nodeValue"] = value.replace('\n',' ')

NOTE: I didn't do a patch, because I don't know which python version you
are using.

Please try this solution and give me a feedback, thanks.

sechifrancesco · 2009-05-11T21:30:32Z

I have uploaded a test script that shows that, without my patch, the
methods setAttribute and parseString work differently; adding my patch,
the behaviour is symmetric.

devdanzin · 2009-05-12T03:29:28Z

Francesco,
Your patch still doesn't allow one to add a multiline attribute values
as Tomalak describes:
"The catch: This leads to an actual data loss if I *wanted* to store
newline characters in an attribute -- unless the newline characters are
properly encoded. Encoding the newline characters is also valid and
conforms to the spec, so the DOM implementation should do it."

I'm not sure whether the proposed behavior is correct or desirable. Even
if it is correct, it might introduce backwards incompatible changes in
behavior.

Here's a test case for trunk.

sechifrancesco · 2009-05-12T08:26:33Z

My position is:
if you want to encode the newline character, this should be done by both
parseString and setAttribute methods. Otherwise, the behaviour is not
symmetric.
My patch translates the newline character with a whitespace in the
setAttribute method, because parseString already does it. If you want to
encode the newline in different manner, you should develop a patch that
introduces this kind of encoding in both parseString and setAttribute
methods.

Tomalak · 2009-05-13T11:44:34Z

Francesco,

if you want to encode the newline character,
this should be done by both parseString and
setAttribute methods. Otherwise, the
behaviour is not symmetric.

I believe you still don't see the issue. The behaviour is not symmetric
*now*. You store a '\n' in an attribute value with setAttribute(), save
the document to XML, load it again and out comes a space where the '\n'
should have been.

The point is that parseString() behaves correctly, but serializing does
not. There is only one side to fix, because only one side is broken.

If you want to encode the newline in different
manner, you should develop a patch that
introduces this kind of encoding in both
parseString and setAttribute methods.

It would be pointless to do the encoding in setAttribute(). The valid
ways to XML-encode a '\n' character are '&#xA', '&#x0A' or '&#10'. Doing
so in setAttribute() would produce doubly encoded output, like this:
'&#10'. This is even more wrong.

However, if parseString() encounters a '&#10' in the input, it correctly
translates this to '\n' in the DOM. As I said, there is nothing to fix
in parsing, this exercise is about getting minidom to actually *output*
a ' ' where appropriate. :-)

Tomalak · 2009-05-13T11:57:19Z

Daniel Diniz:

The proposed behaviour is correct:
http://www.w3.org/TR/2000/WD-xml-c14n-20000119.html#charescaping

"In attribute values, the character information items
TAB (#x9), newline (#xA), and carriage-return (#xD)
are represented by " ", " ", and " " respectively."

Since the behaviour is correct, it is also desirable. :-)

I don't think that this change could cause existing solution to break
since the current inconsistency in handling these characters make it
impossible to rely on this anyway.

Thanks for putting up the unit test diff.

Tomalak · 2009-05-13T12:16:26Z

I changed the patch to include support for TAB characters, which were
also left unencoded before.

Also I switched encoding from ' ' etc. to ' '. This is
equivalent, but the spec uses the latter variant.

devon · 2009-07-15T22:25:14Z

see also a similar issue in etree: bpo-6492

Tomalak · 2009-07-18T13:50:01Z

@devon: Thanks for pointing & linking back here.

BreamoreBoy · 2010-07-21T03:31:09Z

Patched test_minidom and ran it test failed. Went to patch minidom.py and it appears up to date, so no idea why the test failed, can someone please take a look as it's 04:30 BST, thanks.

BreamoreBoy · 2010-07-24T17:21:21Z

minidom.patch had the new file listed before the old, so I've uploaded minidom.diff. The patch is tiny and looks clean. Tests have been repeated on Windows against 2.7 and are fine, so I believe this can be committed.

ezio-melotti · 2010-07-26T16:49:38Z

bpo-7139 has been closed as duplicate of this and contains a few messages that might be worth reading.

wking · 2010-08-02T16:47:21Z

As a workaround until the patch gets included, you can import this monkey patch module.

wking · 2010-08-03T10:32:02Z

And while we're at it, we should also

.replace('&', '&').replace('"', """).replace('<', '<')

which would have to go at the beginning to avoid double-escaping the '&'.

We could use xml.sax.saxutils.escape to do all the escaping rather than chaining replaces:

   data = escape(data, {'"':'&quot;', '\r':'&#xD;', '\n':'&#xA;', '\t':'&#x9;'})

which also escapes '>' (not strictly required for attribute values, but shouldn't be harmful either).

gauges · 2010-10-27T19:47:41Z

I tried to apply the minidom.diff patch below, but it seems that removing the two lines that replace the "<" and ">" characters is not a good idea.
It least in some files I now suddenly get "<" and ">" characters in text elements where there should be > and <

At least the part with the tabs seems to work now and if I add the two lines with the replace calls that got deleted by the patch, everything seems fine.

sayap · 2011-08-25T11:32:37Z

Just want to mention that until the patch get included, it will be impossible to use the standard library to generate a working BCP (Bulk Copy Program) XML format file for SQL Server, which always requires a TERMINATOR="\r\n" or TERMINATOR=" " attribute.

dvarrazzo · 2013-03-28T01:39:01Z

Just got bitten by this bug, which affects xml.etree.ElementTree and cElementTree too. Any chance to have it fixed?

Note that lxml.etree is not affected by the bug and can be used as a replacement for the stdlib module if you happen to have some work to do.

amauryfa · 2013-03-28T09:47:40Z

minidom may be broken, but what's the issue with ElementTree?

>>> import xml.etree.cElementTree as etree
>>> doc = etree.fromstring('<xml />')
>>> doc.set('attr', "multiline\nvalue")
>>> etree.tostring(doc)
'<xml attr="multiline&#10;value" />'

dvarrazzo · 2013-03-28T10:17:04Z

ElementTree issue is with tabs:

In [1]: import xml.etree.cElementTree as etree

In [2]: doc = etree.fromstring('<xml />')

In [4]: doc.set('attr', "here\tthere")

In [5]: etree.tostring(doc)
Out[5]: '<xml attr="here\tthere" />'

In [6]: etree.fromstring(_5).attrib['attr']
Out[6]: 'here there'

bye bye tab.

dvarrazzo · 2013-03-28T10:22:17Z

I was going to open an issue on itself about etree and tabs, but I've found this and thought it would have been marked as duplicate. If you don't think it's the case I will open it anyway.

I've added my comment because ElementTree is not reported in this page at all: I've been googling forever about "ElementTree tabs attributes" without result. I've found this (and the network of duplicates) only searching for less generic "xml tabs" into the python bug tracker, when I was already filing the bug.

dvarrazzo · 2013-03-30T16:29:15Z

Added separate issue bpo-17582 as ElementTree implementation doesn't depend on whatever causes the bug in xml.dom.minidom.

jbaker6953 · 2023-01-02T05:59:04Z

I had to make an updated version of this monkey patch for a project I was working on. I leave it here for anyone's reference. The old instance method overwrite was bombing out with TypeError: function() argument 'code' must be code, not function so I just changed it for my needs. It works well with my data set.

https://gist.github.com/jbaker6953/deb9a8e4eae1e622f467fc9b4edf11db

Also double quotes (") are now only quoted in attributes.

…-107947) Also double quotes (") are now only quoted in attributes.

hugovk · 2023-11-10T15:04:03Z

Looks like this has been fixed by #107947.

Tomalak mannequin added topic-XML type-bug An unexpected behavior, bug, or error labels Apr 14, 2009

devdanzin mannequin added the easy label Apr 22, 2009

Tomalak mannequin changed the title ~~xml.dom.minidom does not handle newline characters in attribute values~~ xml.dom.minidom does not escape newline characters within attribute values May 10, 2009

Tomalak mannequin changed the title ~~xml.dom.minidom does not escape newline characters within attribute values~~ xml.dom.minidom does not escape CR, LF and TAB characters within attribute values May 13, 2009

ezio-melotti transferred this issue from another repository Apr 10, 2022

alexlance mentioned this issue Jan 1, 2023

Many export errors alexlance/signal-message-exporter#11

Open

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Aug 14, 2023

pythongh-50002: xml.dom.minidom now preserves whitespaces in attributes

a14a234

Also double quotes (") are now only quoted in attributes.

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Aug 14, 2023

pythongh-50002: xml.dom.minidom now preserves whitespaces in attributes

4562201

Also double quotes (") are now only quoted in attributes.

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Aug 14, 2023

pythongh-50002: xml.dom.minidom now preserves whitespaces in attributes

dd6c84c

Also double quotes (") are now only quoted in attributes.

bedevere-bot mentioned this issue Aug 14, 2023

gh-50002: xml.dom.minidom now preserves whitespaces in attributes #107947

Merged

serhiy-storchaka added a commit that referenced this issue Aug 23, 2023

gh-50002: xml.dom.minidom now preserves whitespaces in attributes (GH…

154477b

…-107947) Also double quotes (") are now only quoted in attributes.

hugovk closed this as completed Nov 10, 2023

xml.dom.minidom does not escape CR, LF and TAB characters within attribute values #50002

xml.dom.minidom does not escape CR, LF and TAB characters within attribute values #50002

Comments

Tomalak mannequin commented Apr 14, 2009 • edited by bedevere-bot

Linked PRs

Tomalak mannequin commented Apr 14, 2009

sechifrancesco mannequin commented May 4, 2009

Tomalak mannequin commented May 6, 2009

Tomalak mannequin commented May 6, 2009

sechifrancesco mannequin commented May 6, 2009

Tomalak mannequin commented May 8, 2009

Tomalak mannequin commented May 8, 2009

Tomalak mannequin commented May 8, 2009

sechifrancesco mannequin commented May 10, 2009

sechifrancesco mannequin commented May 10, 2009

Tomalak mannequin commented May 10, 2009

sechifrancesco mannequin commented May 11, 2009

sechifrancesco mannequin commented May 11, 2009

sechifrancesco mannequin commented May 11, 2009

devdanzin mannequin commented May 12, 2009

sechifrancesco mannequin commented May 12, 2009

Tomalak mannequin commented May 13, 2009

Tomalak mannequin commented May 13, 2009

Tomalak mannequin commented May 13, 2009

devon mannequin commented Jul 15, 2009

Tomalak mannequin commented Jul 18, 2009

BreamoreBoy mannequin commented Jul 21, 2010

BreamoreBoy mannequin commented Jul 24, 2010

ezio-melotti commented Jul 26, 2010

wking mannequin commented Aug 2, 2010

wking mannequin commented Aug 3, 2010

gauges mannequin commented Oct 27, 2010

sayap mannequin commented Aug 25, 2011

dvarrazzo mannequin commented Mar 28, 2013

amauryfa commented Mar 28, 2013

dvarrazzo mannequin commented Mar 28, 2013

dvarrazzo mannequin commented Mar 28, 2013

dvarrazzo mannequin commented Mar 30, 2013

jbaker6953 commented Jan 2, 2023

hugovk commented Nov 10, 2023

Tomalak mannequin commented Apr 14, 2009 •

edited by bedevere-bot