classification
Title: xml.dom.minidom does not escape CR, LF and TAB characters within attribute values
Type: behavior Stage: patch review
Components: XML Versions: Python 3.1, Python 3.2, Python 2.7, Python 2.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Jens.Grivolla, Tomalak, ajaksu2, amaury.forgeotdarc, devon, effbot, ezio.melotti, gauges, labrat, moriyoshi, piro, sayap
Priority: normal Keywords: easy, patch

Created on 2009-04-14 11:13 by Tomalak, last changed 2014-02-03 18:37 by BreamoreBoy.

Files
File name Uploaded Description Edit
minidom_test.py Tomalak, 2009-05-08 09:53 A small test script that outlines the problem
test_toxml.py sechi_francesco, 2009-05-11 21:27 New test script to demonstrate that the patch proposed works
test_multiline_roundtrip.diff ajaksu2, 2009-05-12 03:29 Tomalak's minidom_test.py as unittest, patch for test_minidom review
minidom.patch Tomalak, 2009-05-13 12:11 Proposed patch to minidom.py
minidom.diff BreamoreBoy, 2010-07-24 17:21 Old file before new file in patch. review
minidom.py labrat, 2010-08-02 16:47 monkey patch module applying minidom.diff
Messages (33)
msg85964 - (view) Author: (Tomalak) Date: 2009-04-14 11:13
Current behavior upon toxml() is:

<foo attribute="multiline
value" />

Upon reading the document again, the new line is normalized and
collapsed into a space (according to the XML spec, section 3.3.3), which
means that it is lost.

Better behavior would be something like this (within attribute values only):

<foo attribute="multiline&#x0D;&#x0A;value" />
msg87178 - (view) Author: Francesco Sechi (sechi_francesco) Date: 2009-05-04 21:13
Ok, I've tried to solve this problem, but I think that the keyword
'easy' is not suitable for this kind of task, because it is necessary to
modify the expat module that is very complex.
msg87332 - (view) Author: (Tomalak) Date: 2009-05-06 13:42
@Francesco Sechi: Would it not just require a minimal change to the
_write_data() method? Something along the lines of (sorry, no Python
expert, maybe I am way off):

def _write_data(writer, data, is_attrib=False):
    "Writes datachars to writer."
    if is_attrib: 
        data = data.replace("\r", "&#13;").replace("\n", "&#10;")
    data = data.replace("&", "&amp;").replace("<", "&lt;")
    data = data.replace("\"", "&quot;").replace(">", "&gt;")
    writer.write(data)

and in Element.writexml():

    #[...]
    for a_name in a_names:
        writer.write(" %s=\"" % a_name)
        _write_data(writer, attrs[a_name].value, True)
    #[...]
msg87333 - (view) Author: (Tomalak) Date: 2009-05-06 13:45
Of course it should be:

def _write_data(writer, data, is_attrib=False):
    "Writes datachars to writer."
    data = data.replace("&", "&amp;").replace("<", "&lt;")
    data = data.replace("\"", "&quot;").replace(">", "&gt;")
    if is_attrib: 
        data = data.replace("\r", "&#13;").replace("\n", "&#10;")
    writer.write(data)
msg87351 - (view) Author: Francesco Sechi (sechi_francesco) Date: 2009-05-06 21:47
Don't worry, I'm a newer too.
No, your solution does not work, because the method you refer
(_write_data) is called by the toxml() function, but the newline is
replaced with a whitespace by the parsestring() function. The
parsestring function, as I already said, refers to the 'expat' module,
that is very complex (for me).
msg87429 - (view) Author: (Tomalak) Date: 2009-05-08 08:22
Hmm... I thought toxml() is the part that needs to be fixed, not the
parsing/reading. I mentioned the reading only to outline the data loss
that occurs eventually.

My point is: The toxml() (i.e. _write_data) *actually writes* the
newline to the output. And within parameters, it just shouldn't.
msg87431 - (view) Author: (Tomalak) Date: 2009-05-08 09:33
Attaching a patch that fixes the problem.
msg87432 - (view) Author: (Tomalak) Date: 2009-05-08 09:38
Attaching a test file that outlines the problem. Output on my system
(Windows / Python 3.0) is:

Without the patch:
C:\Python30>python.exe c:\minidom_test.py
False
1 -->"multiline
value"
2 -->"multiline value"

With the patch:
C:\Python30>python.exe c:\minidom_test.py
True
1 -->"multiline
value"
2 -->"multiline
value"
msg87526 - (view) Author: Francesco Sechi (sechi_francesco) Date: 2009-05-10 15:22
I think that the problem is: the xmldoc1 has the newline or not? If it
hasn't your patch works only in the particular case you pass a toxml
return value to 'parsestring'. If I pass an XML string with newlines it
doesn't work. So your solution is not generic and cannot be considered a
patch for the issue you proposed.
msg87527 - (view) Author: Francesco Sechi (sechi_francesco) Date: 2009-05-10 16:24
I try to explain better what is my opinion:
- If you add the attribute by using setAttribute the newlines are kept
and the toxml works well
- If you add the attribute by using the parsestring, passing it an XML
string the newlines are replaced

- Your patch works only if you act on a well-constructed (i.e.with
newlines kept in internal data structures) xml.dom.minidom.Document
object only, so...
- If you try to execute your patched toxml method of a
xml.dom.minidom.Document constructed using parsestring passing it a
string with newline it does not work.

So your patch works only in a specific case: you are trying to fix a
problem in parsestring, acting on its actual parameter.
msg87528 - (view) Author: (Tomalak) Date: 2009-05-10 16:34
Francesco, I think you are missing the point. :-) The problem has two sides.

If I create an XML document using the DOM (not by parsing it from a
string!), then I can put newline characters into attribute value. This
is allowed and conforms to the XML spec. 

However, *literal* newlines in an attribute value (i.e. when the
document is parsed from a string) have no meaning. The parser treats
them as if they were insignificant whitespace -- they are converted to a
single space. This is also valid and conforms to the XML spec.

The catch: This leads to an actual data loss if I *wanted* to store
newline characters in an attribute -- unless the newline characters are
properly encoded. Encoding the newline characters is also valid and
conforms to the spec, so the DOM implementation should do it. 

In other words - the parsing process you refer to is actually working
fine. If an attribute contains a literal newline, it is indeed okay to
collapse it into a space. It's only the document serializing that is broken.

Minidom is clearly missing functionality here, and it does not conform
to the XML spec. If I store a string of data in an XML document, it must
be ensured that upon reading the document again, I get the *same* data
back. This is what I check with my test script.
msg87564 - (view) Author: Francesco Sechi (sechi_francesco) Date: 2009-05-11 07:31
All right, now I understand, thanks. But I think that, for internal
class coherence, it is necessary not to modify toxml method, but the
'setAttribute' one, because this is the source of the problem.
msg87586 - (view) Author: Francesco Sechi (sechi_francesco) Date: 2009-05-11 21:06
A solution for this issue could be to replace the setAttribute method as
follow:
- d["value"] = d["nodeValue"] = value
+ d["value"] = d["nodeValue"] = value.replace('\n',' ')

NOTE: I didn't do a patch, because I don't know which python version you
are using.

Please try this solution and give me a feedback, thanks.
msg87590 - (view) Author: Francesco Sechi (sechi_francesco) Date: 2009-05-11 21:30
I have uploaded a test script that shows that, without my patch, the
methods setAttribute and parseString work differently; adding my patch,
the behaviour is symmetric.
msg87604 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-05-12 03:29
Francesco,
Your patch still doesn't allow one to add a multiline attribute values
as Tomalak describes:
"The catch: This leads to an actual data loss if I *wanted* to store
newline characters in an attribute -- unless the newline characters are
properly encoded. Encoding the newline characters is also valid and
conforms to the spec, so the DOM implementation should do it."

I'm not sure whether the proposed behavior is correct or desirable. Even
if it is correct, it might introduce backwards incompatible changes in
behavior.

Here's a test case for trunk.
msg87609 - (view) Author: Francesco Sechi (sechi_francesco) Date: 2009-05-12 08:26
My position is: 
if you want to encode the newline character, this should be done by both
parseString and setAttribute methods. Otherwise, the behaviour is not
symmetric.
My patch translates the newline character with a whitespace in the
setAttribute method, because parseString already does it. If you want to
encode the newline in different manner, you should develop a patch that
introduces this kind of encoding in both parseString and setAttribute
methods.
msg87676 - (view) Author: (Tomalak) Date: 2009-05-13 11:44
Francesco,

> if you want to encode the newline character, 
> this should be done by both parseString and 
> setAttribute methods. Otherwise, the 
> behaviour is not symmetric.

I believe you still don't see the issue. The behaviour is not symmetric
*now*. You store a '\n' in an attribute value with setAttribute(), save
the document to XML, load it again and out comes a space where the '\n'
should have been.

The point is that parseString() behaves correctly, but serializing does
not. There is only one side to fix, because only one side is broken.

> If you want to encode the newline in different 
> manner, you should develop a patch that
> introduces this kind of encoding in both 
> parseString and setAttribute methods.

It would be pointless to do the encoding in setAttribute(). The valid
ways to XML-encode a '\n' character are '&#xA', '&#x0A' or '&#10'. Doing
so in setAttribute() would produce doubly encoded output, like this:
'&amp;#10'. This is even more wrong.

However, if parseString() encounters a '&#10' in the input, it correctly
translates this to '\n' in the DOM. As I said, there is nothing to fix
in parsing, this exercise is about getting minidom to actually *output*
a '&#10;' where appropriate. :-)
msg87679 - (view) Author: (Tomalak) Date: 2009-05-13 11:57
Daniel Diniz: 

The proposed behaviour is correct:
http://www.w3.org/TR/2000/WD-xml-c14n-20000119.html#charescaping

"In attribute values, the character information items 
TAB (#x9), newline (#xA), and carriage-return (#xD) 
are represented by "&#x9;", "&#xA;", and "&#xD;" respectively."

Since the behaviour is correct, it is also desirable. :-)

I don't think that this change could cause existing solution to break
since the current inconsistency in handling these characters make it
impossible to rely on this anyway.

Thanks for putting up the unit test diff.
msg87681 - (view) Author: (Tomalak) Date: 2009-05-13 12:16
I changed the patch to include support for TAB characters, which were
also left unencoded before.

Also I switched encoding from '&#13;' etc. to '&#xD;'. This is
equivalent, but the spec uses the latter variant.
msg90548 - (view) Author: (devon) Date: 2009-07-15 22:25
see also a similar issue in etree: #6492
msg90676 - (view) Author: (Tomalak) Date: 2009-07-18 13:50
@devon: Thanks for pointing & linking back here.
msg110999 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-21 03:31
Patched test_minidom and ran it test failed.  Went to patch minidom.py and it appears up to date, so no idea why the test failed, can someone please take a look as it's 04:30 BST, thanks.
msg111493 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-24 17:21
minidom.patch had the new file listed before the old, so I've uploaded minidom.diff.  The patch is tiny and looks clean.  Tests have been repeated on Windows against 2.7 and are fine, so I believe this can be committed.
msg111642 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-07-26 16:49
#7139 has been closed as duplicate of this and contains a few messages that might be worth reading.
msg112497 - (view) Author: W. Trevor King (labrat) Date: 2010-08-02 16:47
As a workaround until the patch gets included, you can import this monkey patch module.
msg112570 - (view) Author: W. Trevor King (labrat) Date: 2010-08-03 10:32
And while we're at it, we should also

   .replace('&', '&amp;').replace('"', "&quot;").replace('<', '&lt;')

which would have to go at the beginning to avoid double-escaping the '&'.

We could use xml.sax.saxutils.escape to do all the escaping rather than chaining replaces:

   data = escape(data, {'"':'&quot;', '\r':'&#xD;', '\n':'&#xA;', '\t':'&#x9;'})

which also escapes '>' (not strictly required for attribute values, but shouldn't be harmful either).
msg119728 - (view) Author: Ralph Gauges (gauges) Date: 2010-10-27 19:47
I tried to apply the minidom.diff patch below, but it seems that removing the two lines that replace the "<" and ">" characters is not a good idea.
It least in some files I now suddenly get "<" and ">" characters in text elements where there should be &gt; and &lt;

At least the part with the tabs seems to work now and if I add the two lines with the replace calls that got deleted by the patch, everything seems fine.
msg142964 - (view) Author: Yap Sok Ann (sayap) Date: 2011-08-25 11:32
Just want to mention that until the patch get included, it will be impossible to use the standard library to generate a working BCP (Bulk Copy Program) XML format file for SQL Server, which always requires a TERMINATOR="\r\n" or TERMINATOR="&#13;&#10;" attribute.
msg185398 - (view) Author: Daniele Varrazzo (piro) Date: 2013-03-28 01:39
Just got bitten by this bug, which affects xml.etree.ElementTree and cElementTree too. Any chance to have it fixed?

Note that lxml.etree is not affected by the bug and can be used as a replacement for the stdlib module if you happen to have some work to do.
msg185408 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2013-03-28 09:47
minidom may be broken, but what's the issue with ElementTree?

>>> import xml.etree.cElementTree as etree
>>> doc = etree.fromstring('<xml />')
>>> doc.set('attr', "multiline\nvalue")
>>> etree.tostring(doc)
'<xml attr="multiline&#10;value" />'
msg185419 - (view) Author: Daniele Varrazzo (piro) Date: 2013-03-28 10:17
ElementTree issue is with tabs:

    In [1]: import xml.etree.cElementTree as etree

    In [2]: doc = etree.fromstring('<xml />')

    In [4]: doc.set('attr', "here\tthere")

    In [5]: etree.tostring(doc)
    Out[5]: '<xml attr="here\tthere" />'

    In [6]: etree.fromstring(_5).attrib['attr']
    Out[6]: 'here there'

bye bye tab.
msg185423 - (view) Author: Daniele Varrazzo (piro) Date: 2013-03-28 10:22
I was going to open an issue on itself about etree and tabs, but I've found this and thought it would have been marked as duplicate. If you don't think it's the case I will open it anyway.

I've added my comment because ElementTree is not reported in this page at all: I've been googling forever about "ElementTree tabs attributes" without result. I've found this (and the network of duplicates) only searching for less generic "xml tabs" into the python bug tracker, when I was already filing the bug.
msg185576 - (view) Author: Daniele Varrazzo (piro) Date: 2013-03-30 16:29
Added separate issue #17582 as ElementTree implementation doesn't depend on whatever causes the bug in xml.dom.minidom.
History
Date User Action Args
2014-02-03 18:37:26BreamoreBoysetnosy: - BreamoreBoy
2013-03-30 16:29:15pirosetmessages: + msg185576
2013-03-28 10:22:17pirosetmessages: + msg185423
2013-03-28 10:17:03pirosetmessages: + msg185419
2013-03-28 09:47:39amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg185408
2013-03-28 01:39:01pirosetnosy: + piro
messages: + msg185398
2011-08-25 11:32:37sayapsetnosy: + sayap
messages: + msg142964
2011-05-10 09:09:23Jens.Grivollasetnosy: + Jens.Grivolla
2010-10-27 19:47:41gaugessetnosy: + gauges
messages: + msg119728
2010-08-03 10:32:03labratsetmessages: + msg112570
2010-08-02 16:47:22labratsetfiles: + minidom.py
nosy: + labrat
messages: + msg112497

2010-07-26 16:49:38ezio.melottisetnosy: + effbot, ezio.melotti, moriyoshi
messages: + msg111642
2010-07-24 17:21:21BreamoreBoysetfiles: + minidom.diff

messages: + msg111493
2010-07-21 03:31:09BreamoreBoysetnosy: + BreamoreBoy

messages: + msg110999
versions: + Python 2.7, Python 3.2
2009-07-18 13:50:00Tomalaksetmessages: + msg90676
2009-07-15 22:25:14devonsetnosy: + devon
messages: + msg90548
2009-05-13 15:06:29sechi_francescosetnosy: - sechi_francesco
2009-05-13 12:17:35Tomalaksettitle: xml.dom.minidom does not escape newline characters within attribute values -> xml.dom.minidom does not escape CR, LF and TAB characters within attribute values
2009-05-13 12:16:26Tomalaksetmessages: + msg87681
2009-05-13 12:11:58Tomalaksetfiles: + minidom.patch
2009-05-13 12:11:16Tomalaksetfiles: - minidom.patch
2009-05-13 11:57:19Tomalaksetmessages: + msg87679
2009-05-13 11:44:34Tomalaksetmessages: + msg87676
2009-05-12 08:26:34sechi_francescosetmessages: + msg87609
2009-05-12 03:30:07ajaksu2setpriority: normal
stage: test needed -> patch review
2009-05-12 03:29:29ajaksu2setfiles: + test_multiline_roundtrip.diff
nosy: + ajaksu2
messages: + msg87604

2009-05-11 21:30:32sechi_francescosetmessages: + msg87590
2009-05-11 21:27:13sechi_francescosetfiles: + test_toxml.py
2009-05-11 21:25:57sechi_francescosetfiles: - test_toxml.py
2009-05-11 21:06:09sechi_francescosetmessages: + msg87586
2009-05-11 07:31:17sechi_francescosetmessages: + msg87564
2009-05-10 17:26:20Tomalaksettitle: xml.dom.minidom does not handle newline characters in attribute values -> xml.dom.minidom does not escape newline characters within attribute values
2009-05-10 16:34:31Tomalaksetmessages: + msg87528
2009-05-10 16:24:21sechi_francescosetmessages: + msg87527
2009-05-10 15:22:35sechi_francescosetmessages: + msg87526
2009-05-08 09:53:34Tomalaksetfiles: + minidom_test.py
2009-05-08 09:52:50Tomalaksetfiles: - toxml_test.py
2009-05-08 09:38:10Tomalaksetfiles: + toxml_test.py

messages: + msg87432
2009-05-08 09:33:29Tomalaksetfiles: + minidom.patch
keywords: + patch
messages: + msg87431
2009-05-08 08:22:28Tomalaksetmessages: + msg87429
2009-05-06 21:47:31sechi_francescosetmessages: + msg87351
2009-05-06 13:45:56Tomalaksetmessages: + msg87333
2009-05-06 13:42:32Tomalaksetmessages: + msg87332
2009-05-04 21:13:31sechi_francescosetmessages: + msg87178
2009-05-02 15:50:58sechi_francescosetfiles: + test_toxml.py
2009-05-02 15:12:06sechi_francescosetnosy: + sechi_francesco
2009-04-22 05:07:21ajaksu2setkeywords: + easy
stage: test needed
versions: - Python 2.5, Python 2.4, Python 3.0, Python 2.7
2009-04-14 11:13:34Tomalaksettype: behavior
2009-04-14 11:13:06Tomalakcreate