Message 87528 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Tomalak
Recipients	Tomalak, sechi_francesco
Date	2009-05-10.16:34:29
SpamBayes Score	1.480176e-10
Marked as misclassified	No
Message-id	<1241973273.1.0.543750931833.issue5752@psf.upfronthosting.co.za>
In-reply-to

Content
Francesco, I think you are missing the point. :-) The problem has two sides. If I create an XML document using the DOM (not by parsing it from a string!), then I can put newline characters into attribute value. This is allowed and conforms to the XML spec. However, literal newlines in an attribute value (i.e. when the document is parsed from a string) have no meaning. The parser treats them as if they were insignificant whitespace -- they are converted to a single space. This is also valid and conforms to the XML spec. The catch: This leads to an actual data loss if I wanted to store newline characters in an attribute -- unless the newline characters are properly encoded. Encoding the newline characters is also valid and conforms to the spec, so the DOM implementation should do it. In other words - the parsing process you refer to is actually working fine. If an attribute contains a literal newline, it is indeed okay to collapse it into a space. It's only the document serializing that is broken. Minidom is clearly missing functionality here, and it does not conform to the XML spec. If I store a string of data in an XML document, it must be ensured that upon reading the document again, I get the same data back. This is what I check with my test script.

Francesco, I think you are missing the point. :-) The problem has two sides.

If I create an XML document using the DOM (not by parsing it from a
string!), then I can put newline characters into attribute value. This
is allowed and conforms to the XML spec. 

However, *literal* newlines in an attribute value (i.e. when the
document is parsed from a string) have no meaning. The parser treats
them as if they were insignificant whitespace -- they are converted to a
single space. This is also valid and conforms to the XML spec.

The catch: This leads to an actual data loss if I *wanted* to store
newline characters in an attribute -- unless the newline characters are
properly encoded. Encoding the newline characters is also valid and
conforms to the spec, so the DOM implementation should do it. 

In other words - the parsing process you refer to is actually working
fine. If an attribute contains a literal newline, it is indeed okay to
collapse it into a space. It's only the document serializing that is broken.

Minidom is clearly missing functionality here, and it does not conform
to the XML spec. If I store a string of data in an XML document, it must
be ensured that upon reading the document again, I get the *same* data
back. This is what I check with my test script.

History
Date	User	Action	Args
2009-05-10 16:34:34	Tomalak	set	recipients: + Tomalak, sechi_francesco
2009-05-10 16:34:33	Tomalak	set	messageid: <1241973273.1.0.543750931833.issue5752@psf.upfronthosting.co.za>
2009-05-10 16:34:31	Tomalak	link	issue5752 messages
2009-05-10 16:34:30	Tomalak	create