Issue 5752: xml.dom.minidom does not escape CR, LF and TAB characters within attribute values

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Unsupported provider

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/50002

classification

Title:	xml.dom.minidom does not escape CR, LF and TAB characters within attribute values
Type:	behavior	Stage:	patch review
Components:	XML	Versions:	Python 3.1, Python 3.2, Python 2.7, Python 2.6

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Jens.Grivolla, Tomalak, ajaksu2, amaury.forgeotdarc, devon, effbot, ezio.melotti, gauges, labrat, mitar, moriyoshi, piro, sayap
Priority:	normal	Keywords:	easy, patch

Created on 2009-04-14 11:13 by Tomalak, last changed 2022-04-11 14:56 by admin.

Files
File name	Uploaded	Description	Edit
minidom_test.py	Tomalak, 2009-05-08 09:53	A small test script that outlines the problem
test_toxml.py	sechi_francesco, 2009-05-11 21:27	New test script to demonstrate that the patch proposed works
test_multiline_roundtrip.diff	ajaksu2, 2009-05-12 03:29	Tomalak's minidom_test.py as unittest, patch for test_minidom	review
minidom.patch	Tomalak, 2009-05-13 12:11	Proposed patch to minidom.py
minidom.diff	BreamoreBoy, 2010-07-24 17:21	Old file before new file in patch.	review
minidom.py	labrat, 2010-08-02 16:47	monkey patch module applying minidom.diff

Messages (33)
msg85964 - (view)	Author: (Tomalak)	Date: 2009-04-14 11:13
Current behavior upon toxml() is: <foo attribute="multiline value" /> Upon reading the document again, the new line is normalized and collapsed into a space (according to the XML spec, section 3.3.3), which means that it is lost. Better behavior would be something like this (within attribute values only): <foo attribute="multiline value" />
msg87178 - (view)	Author: Francesco Sechi (sechi_francesco)	Date: 2009-05-04 21:13
Ok, I've tried to solve this problem, but I think that the keyword 'easy' is not suitable for this kind of task, because it is necessary to modify the expat module that is very complex.
msg87332 - (view)	Author: (Tomalak)	Date: 2009-05-06 13:42
@Francesco Sechi: Would it not just require a minimal change to the _write_data() method? Something along the lines of (sorry, no Python expert, maybe I am way off): def _write_data(writer, data, is_attrib=False): "Writes datachars to writer." if is_attrib: data = data.replace("\r", " ").replace("\n", " ") data = data.replace("&", "&").replace("<", "<") data = data.replace("\"", """).replace(">", ">") writer.write(data) and in Element.writexml(): #[...] for a_name in a_names: writer.write(" %s=\"" % a_name) _write_data(writer, attrs[a_name].value, True) #[...]
msg87333 - (view)	Author: (Tomalak)	Date: 2009-05-06 13:45
Of course it should be: def _write_data(writer, data, is_attrib=False): "Writes datachars to writer." data = data.replace("&", "&").replace("<", "<") data = data.replace("\"", """).replace(">", ">") if is_attrib: data = data.replace("\r", " ").replace("\n", " ") writer.write(data)
msg87351 - (view)	Author: Francesco Sechi (sechi_francesco)	Date: 2009-05-06 21:47
Don't worry, I'm a newer too. No, your solution does not work, because the method you refer (_write_data) is called by the toxml() function, but the newline is replaced with a whitespace by the parsestring() function. The parsestring function, as I already said, refers to the 'expat' module, that is very complex (for me).
msg87429 - (view)	Author: (Tomalak)	Date: 2009-05-08 08:22
Hmm... I thought toxml() is the part that needs to be fixed, not the parsing/reading. I mentioned the reading only to outline the data loss that occurs eventually. My point is: The toxml() (i.e. _write_data) actually writes the newline to the output. And within parameters, it just shouldn't.
msg87431 - (view)	Author: (Tomalak)	Date: 2009-05-08 09:33
Attaching a patch that fixes the problem.
msg87432 - (view)	Author: (Tomalak)	Date: 2009-05-08 09:38
Attaching a test file that outlines the problem. Output on my system (Windows / Python 3.0) is: Without the patch: C:\Python30>python.exe c:\minidom_test.py False 1 -->"multiline value" 2 -->"multiline value" With the patch: C:\Python30>python.exe c:\minidom_test.py True 1 -->"multiline value" 2 -->"multiline value"
msg87526 - (view)	Author: Francesco Sechi (sechi_francesco)	Date: 2009-05-10 15:22
I think that the problem is: the xmldoc1 has the newline or not? If it hasn't your patch works only in the particular case you pass a toxml return value to 'parsestring'. If I pass an XML string with newlines it doesn't work. So your solution is not generic and cannot be considered a patch for the issue you proposed.
msg87527 - (view)	Author: Francesco Sechi (sechi_francesco)	Date: 2009-05-10 16:24
I try to explain better what is my opinion: - If you add the attribute by using setAttribute the newlines are kept and the toxml works well - If you add the attribute by using the parsestring, passing it an XML string the newlines are replaced - Your patch works only if you act on a well-constructed (i.e.with newlines kept in internal data structures) xml.dom.minidom.Document object only, so... - If you try to execute your patched toxml method of a xml.dom.minidom.Document constructed using parsestring passing it a string with newline it does not work. So your patch works only in a specific case: you are trying to fix a problem in parsestring, acting on its actual parameter.
msg87528 - (view)	Author: (Tomalak)	Date: 2009-05-10 16:34
Francesco, I think you are missing the point. :-) The problem has two sides. If I create an XML document using the DOM (not by parsing it from a string!), then I can put newline characters into attribute value. This is allowed and conforms to the XML spec. However, literal newlines in an attribute value (i.e. when the document is parsed from a string) have no meaning. The parser treats them as if they were insignificant whitespace -- they are converted to a single space. This is also valid and conforms to the XML spec. The catch: This leads to an actual data loss if I wanted to store newline characters in an attribute -- unless the newline characters are properly encoded. Encoding the newline characters is also valid and conforms to the spec, so the DOM implementation should do it. In other words - the parsing process you refer to is actually working fine. If an attribute contains a literal newline, it is indeed okay to collapse it into a space. It's only the document serializing that is broken. Minidom is clearly missing functionality here, and it does not conform to the XML spec. If I store a string of data in an XML document, it must be ensured that upon reading the document again, I get the same data back. This is what I check with my test script.
msg87564 - (view)	Author: Francesco Sechi (sechi_francesco)	Date: 2009-05-11 07:31
All right, now I understand, thanks. But I think that, for internal class coherence, it is necessary not to modify toxml method, but the 'setAttribute' one, because this is the source of the problem.
msg87586 - (view)	Author: Francesco Sechi (sechi_francesco)	Date: 2009-05-11 21:06
A solution for this issue could be to replace the setAttribute method as follow: - d["value"] = d["nodeValue"] = value + d["value"] = d["nodeValue"] = value.replace('\n',' ') NOTE: I didn't do a patch, because I don't know which python version you are using. Please try this solution and give me a feedback, thanks.
msg87590 - (view)	Author: Francesco Sechi (sechi_francesco)	Date: 2009-05-11 21:30
I have uploaded a test script that shows that, without my patch, the methods setAttribute and parseString work differently; adding my patch, the behaviour is symmetric.
msg87604 - (view)	Author: Daniel Diniz (ajaksu2) *	Date: 2009-05-12 03:29
Francesco, Your patch still doesn't allow one to add a multiline attribute values as Tomalak describes: "The catch: This leads to an actual data loss if I wanted to store newline characters in an attribute -- unless the newline characters are properly encoded. Encoding the newline characters is also valid and conforms to the spec, so the DOM implementation should do it." I'm not sure whether the proposed behavior is correct or desirable. Even if it is correct, it might introduce backwards incompatible changes in behavior. Here's a test case for trunk.
msg87609 - (view)	Author: Francesco Sechi (sechi_francesco)	Date: 2009-05-12 08:26
My position is: if you want to encode the newline character, this should be done by both parseString and setAttribute methods. Otherwise, the behaviour is not symmetric. My patch translates the newline character with a whitespace in the setAttribute method, because parseString already does it. If you want to encode the newline in different manner, you should develop a patch that introduces this kind of encoding in both parseString and setAttribute methods.
msg87676 - (view)	Author: (Tomalak)	Date: 2009-05-13 11:44
Francesco, > if you want to encode the newline character, > this should be done by both parseString and > setAttribute methods. Otherwise, the > behaviour is not symmetric. I believe you still don't see the issue. The behaviour is not symmetric now. You store a '\n' in an attribute value with setAttribute(), save the document to XML, load it again and out comes a space where the '\n' should have been. The point is that parseString() behaves correctly, but serializing does not. There is only one side to fix, because only one side is broken. > If you want to encode the newline in different > manner, you should develop a patch that > introduces this kind of encoding in both > parseString and setAttribute methods. It would be pointless to do the encoding in setAttribute(). The valid ways to XML-encode a '\n' character are '&#xA', '&#x0A' or '&#10'. Doing so in setAttribute() would produce doubly encoded output, like this: '&#10'. This is even more wrong. However, if parseString() encounters a '&#10' in the input, it correctly translates this to '\n' in the DOM. As I said, there is nothing to fix in parsing, this exercise is about getting minidom to actually output a ' ' where appropriate. :-)
msg87679 - (view)	Author: (Tomalak)	Date: 2009-05-13 11:57
Daniel Diniz: The proposed behaviour is correct: http://www.w3.org/TR/2000/WD-xml-c14n-20000119.html#charescaping "In attribute values, the character information items TAB (#x9), newline (#xA), and carriage-return (#xD) are represented by " ", " ", and " " respectively." Since the behaviour is correct, it is also desirable. :-) I don't think that this change could cause existing solution to break since the current inconsistency in handling these characters make it impossible to rely on this anyway. Thanks for putting up the unit test diff.
msg87681 - (view)	Author: (Tomalak)	Date: 2009-05-13 12:16
I changed the patch to include support for TAB characters, which were also left unencoded before. Also I switched encoding from ' ' etc. to ' '. This is equivalent, but the spec uses the latter variant.
msg90548 - (view)	Author: (devon)	Date: 2009-07-15 22:25
see also a similar issue in etree: #6492
msg90676 - (view)	Author: (Tomalak)	Date: 2009-07-18 13:50
@devon: Thanks for pointing & linking back here.
msg110999 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-07-21 03:31
Patched test_minidom and ran it test failed. Went to patch minidom.py and it appears up to date, so no idea why the test failed, can someone please take a look as it's 04:30 BST, thanks.
msg111493 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-07-24 17:21
minidom.patch had the new file listed before the old, so I've uploaded minidom.diff. The patch is tiny and looks clean. Tests have been repeated on Windows against 2.7 and are fine, so I believe this can be committed.
msg111642 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-07-26 16:49
#7139 has been closed as duplicate of this and contains a few messages that might be worth reading.
msg112497 - (view)	Author: W. Trevor King (labrat) *	Date: 2010-08-02 16:47
As a workaround until the patch gets included, you can import this monkey patch module.
msg112570 - (view)	Author: W. Trevor King (labrat) *	Date: 2010-08-03 10:32
And while we're at it, we should also .replace('&', '&').replace('"', """).replace('<', '<') which would have to go at the beginning to avoid double-escaping the '&'. We could use xml.sax.saxutils.escape to do all the escaping rather than chaining replaces: data = escape(data, {'"':'"', '\r':' ', '\n':' ', '\t':' '}) which also escapes '>' (not strictly required for attribute values, but shouldn't be harmful either).
msg119728 - (view)	Author: Ralph Gauges (gauges)	Date: 2010-10-27 19:47
I tried to apply the minidom.diff patch below, but it seems that removing the two lines that replace the "<" and ">" characters is not a good idea. It least in some files I now suddenly get "<" and ">" characters in text elements where there should be > and < At least the part with the tabs seems to work now and if I add the two lines with the replace calls that got deleted by the patch, everything seems fine.
msg142964 - (view)	Author: Yap Sok Ann (sayap)	Date: 2011-08-25 11:32
Just want to mention that until the patch get included, it will be impossible to use the standard library to generate a working BCP (Bulk Copy Program) XML format file for SQL Server, which always requires a TERMINATOR="\r\n" or TERMINATOR=" " attribute.
msg185398 - (view)	Author: Daniele Varrazzo (piro) *	Date: 2013-03-28 01:39
Just got bitten by this bug, which affects xml.etree.ElementTree and cElementTree too. Any chance to have it fixed? Note that lxml.etree is not affected by the bug and can be used as a replacement for the stdlib module if you happen to have some work to do.
msg185408 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2013-03-28 09:47
minidom may be broken, but what's the issue with ElementTree? >>> import xml.etree.cElementTree as etree >>> doc = etree.fromstring('<xml />') >>> doc.set('attr', "multiline\nvalue") >>> etree.tostring(doc) '<xml attr="multiline value" />'
msg185419 - (view)	Author: Daniele Varrazzo (piro) *	Date: 2013-03-28 10:17
ElementTree issue is with tabs: In [1]: import xml.etree.cElementTree as etree In [2]: doc = etree.fromstring('<xml />') In [4]: doc.set('attr', "here\tthere") In [5]: etree.tostring(doc) Out[5]: '<xml attr="here\tthere" />' In [6]: etree.fromstring(_5).attrib['attr'] Out[6]: 'here there' bye bye tab.
msg185423 - (view)	Author: Daniele Varrazzo (piro) *	Date: 2013-03-28 10:22
I was going to open an issue on itself about etree and tabs, but I've found this and thought it would have been marked as duplicate. If you don't think it's the case I will open it anyway. I've added my comment because ElementTree is not reported in this page at all: I've been googling forever about "ElementTree tabs attributes" without result. I've found this (and the network of duplicates) only searching for less generic "xml tabs" into the python bug tracker, when I was already filing the bug.
msg185576 - (view)	Author: Daniele Varrazzo (piro) *	Date: 2013-03-30 16:29
Added separate issue #17582 as ElementTree implementation doesn't depend on whatever causes the bug in xml.dom.minidom.

History
Date	User	Action	Args
2022-04-11 14:56:47	admin	set	github: 50002
2019-06-22 20:28:40	mitar	set	nosy: + mitar
2014-02-03 18:37:26	BreamoreBoy	set	nosy: - BreamoreBoy
2013-03-30 16:29:15	piro	set	messages: + msg185576
2013-03-28 10:22:17	piro	set	messages: + msg185423
2013-03-28 10:17:03	piro	set	messages: + msg185419
2013-03-28 09:47:39	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg185408
2013-03-28 01:39:01	piro	set	nosy: + piro messages: + msg185398
2011-08-25 11:32:37	sayap	set	nosy: + sayap messages: + msg142964
2011-05-10 09:09:23	Jens.Grivolla	set	nosy: + Jens.Grivolla
2010-10-27 19:47:41	gauges	set	nosy: + gauges messages: + msg119728
2010-08-03 10:32:03	labrat	set	messages: + msg112570
2010-08-02 16:47:22	labrat	set	files: + minidom.py nosy: + labrat messages: + msg112497
2010-07-26 16:49:38	ezio.melotti	set	nosy: + effbot, ezio.melotti, moriyoshi messages: + msg111642
2010-07-24 17:21:21	BreamoreBoy	set	files: + minidom.diff messages: + msg111493
2010-07-21 03:31:09	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg110999 versions: + Python 2.7, Python 3.2
2009-07-18 13:50:00	Tomalak	set	messages: + msg90676
2009-07-15 22:25:14	devon	set	nosy: + devon messages: + msg90548
2009-05-13 15:06:29	sechi_francesco	set	nosy: - sechi_francesco
2009-05-13 12:17:35	Tomalak	set	title: xml.dom.minidom does not escape newline characters within attribute values -> xml.dom.minidom does not escape CR, LF and TAB characters within attribute values
2009-05-13 12:16:26	Tomalak	set	messages: + msg87681
2009-05-13 12:11:58	Tomalak	set	files: + minidom.patch
2009-05-13 12:11:16	Tomalak	set	files: - minidom.patch
2009-05-13 11:57:19	Tomalak	set	messages: + msg87679
2009-05-13 11:44:34	Tomalak	set	messages: + msg87676
2009-05-12 08:26:34	sechi_francesco	set	messages: + msg87609
2009-05-12 03:30:07	ajaksu2	set	priority: normal stage: test needed -> patch review
2009-05-12 03:29:29	ajaksu2	set	files: + test_multiline_roundtrip.diff nosy: + ajaksu2 messages: + msg87604
2009-05-11 21:30:32	sechi_francesco	set	messages: + msg87590
2009-05-11 21:27:13	sechi_francesco	set	files: + test_toxml.py
2009-05-11 21:25:57	sechi_francesco	set	files: - test_toxml.py
2009-05-11 21:06:09	sechi_francesco	set	messages: + msg87586
2009-05-11 07:31:17	sechi_francesco	set	messages: + msg87564
2009-05-10 17:26:20	Tomalak	set	title: xml.dom.minidom does not handle newline characters in attribute values -> xml.dom.minidom does not escape newline characters within attribute values
2009-05-10 16:34:31	Tomalak	set	messages: + msg87528
2009-05-10 16:24:21	sechi_francesco	set	messages: + msg87527
2009-05-10 15:22:35	sechi_francesco	set	messages: + msg87526
2009-05-08 09:53:34	Tomalak	set	files: + minidom_test.py
2009-05-08 09:52:50	Tomalak	set	files: - toxml_test.py
2009-05-08 09:38:10	Tomalak	set	files: + toxml_test.py messages: + msg87432
2009-05-08 09:33:29	Tomalak	set	files: + minidom.patch keywords: + patch messages: + msg87431
2009-05-08 08:22:28	Tomalak	set	messages: + msg87429
2009-05-06 21:47:31	sechi_francesco	set	messages: + msg87351
2009-05-06 13:45:56	Tomalak	set	messages: + msg87333
2009-05-06 13:42:32	Tomalak	set	messages: + msg87332
2009-05-04 21:13:31	sechi_francesco	set	messages: + msg87178
2009-05-02 15:50:58	sechi_francesco	set	files: + test_toxml.py
2009-05-02 15:12:06	sechi_francesco	set	nosy: + sechi_francesco
2009-04-22 05:07:21	ajaksu2	set	keywords: + easy stage: test needed versions: - Python 2.5, Python 2.4, Python 3.0, Python 2.7
2009-04-14 11:13:34	Tomalak	set	type: behavior
2009-04-14 11:13:06	Tomalak	create