msg85964 - (view) |
Author: (Tomalak) |
Date: 2009-04-14 11:13 |
Current behavior upon toxml() is:
<foo attribute="multiline
value" />
Upon reading the document again, the new line is normalized and
collapsed into a space (according to the XML spec, section 3.3.3), which
means that it is lost.
Better behavior would be something like this (within attribute values only):
<foo attribute="multiline
value" />
|
msg87178 - (view) |
Author: Francesco Sechi (sechi_francesco) |
Date: 2009-05-04 21:13 |
Ok, I've tried to solve this problem, but I think that the keyword
'easy' is not suitable for this kind of task, because it is necessary to
modify the expat module that is very complex.
|
msg87332 - (view) |
Author: (Tomalak) |
Date: 2009-05-06 13:42 |
@Francesco Sechi: Would it not just require a minimal change to the
_write_data() method? Something along the lines of (sorry, no Python
expert, maybe I am way off):
def _write_data(writer, data, is_attrib=False):
"Writes datachars to writer."
if is_attrib:
data = data.replace("\r", " ").replace("\n", " ")
data = data.replace("&", "&").replace("<", "<")
data = data.replace("\"", """).replace(">", ">")
writer.write(data)
and in Element.writexml():
#[...]
for a_name in a_names:
writer.write(" %s=\"" % a_name)
_write_data(writer, attrs[a_name].value, True)
#[...]
|
msg87333 - (view) |
Author: (Tomalak) |
Date: 2009-05-06 13:45 |
Of course it should be:
def _write_data(writer, data, is_attrib=False):
"Writes datachars to writer."
data = data.replace("&", "&").replace("<", "<")
data = data.replace("\"", """).replace(">", ">")
if is_attrib:
data = data.replace("\r", " ").replace("\n", " ")
writer.write(data)
|
msg87351 - (view) |
Author: Francesco Sechi (sechi_francesco) |
Date: 2009-05-06 21:47 |
Don't worry, I'm a newer too.
No, your solution does not work, because the method you refer
(_write_data) is called by the toxml() function, but the newline is
replaced with a whitespace by the parsestring() function. The
parsestring function, as I already said, refers to the 'expat' module,
that is very complex (for me).
|
msg87429 - (view) |
Author: (Tomalak) |
Date: 2009-05-08 08:22 |
Hmm... I thought toxml() is the part that needs to be fixed, not the
parsing/reading. I mentioned the reading only to outline the data loss
that occurs eventually.
My point is: The toxml() (i.e. _write_data) *actually writes* the
newline to the output. And within parameters, it just shouldn't.
|
msg87431 - (view) |
Author: (Tomalak) |
Date: 2009-05-08 09:33 |
Attaching a patch that fixes the problem.
|
msg87432 - (view) |
Author: (Tomalak) |
Date: 2009-05-08 09:38 |
Attaching a test file that outlines the problem. Output on my system
(Windows / Python 3.0) is:
Without the patch:
C:\Python30>python.exe c:\minidom_test.py
False
1 -->"multiline
value"
2 -->"multiline value"
With the patch:
C:\Python30>python.exe c:\minidom_test.py
True
1 -->"multiline
value"
2 -->"multiline
value"
|
msg87526 - (view) |
Author: Francesco Sechi (sechi_francesco) |
Date: 2009-05-10 15:22 |
I think that the problem is: the xmldoc1 has the newline or not? If it
hasn't your patch works only in the particular case you pass a toxml
return value to 'parsestring'. If I pass an XML string with newlines it
doesn't work. So your solution is not generic and cannot be considered a
patch for the issue you proposed.
|
msg87527 - (view) |
Author: Francesco Sechi (sechi_francesco) |
Date: 2009-05-10 16:24 |
I try to explain better what is my opinion:
- If you add the attribute by using setAttribute the newlines are kept
and the toxml works well
- If you add the attribute by using the parsestring, passing it an XML
string the newlines are replaced
- Your patch works only if you act on a well-constructed (i.e.with
newlines kept in internal data structures) xml.dom.minidom.Document
object only, so...
- If you try to execute your patched toxml method of a
xml.dom.minidom.Document constructed using parsestring passing it a
string with newline it does not work.
So your patch works only in a specific case: you are trying to fix a
problem in parsestring, acting on its actual parameter.
|
msg87528 - (view) |
Author: (Tomalak) |
Date: 2009-05-10 16:34 |
Francesco, I think you are missing the point. :-) The problem has two sides.
If I create an XML document using the DOM (not by parsing it from a
string!), then I can put newline characters into attribute value. This
is allowed and conforms to the XML spec.
However, *literal* newlines in an attribute value (i.e. when the
document is parsed from a string) have no meaning. The parser treats
them as if they were insignificant whitespace -- they are converted to a
single space. This is also valid and conforms to the XML spec.
The catch: This leads to an actual data loss if I *wanted* to store
newline characters in an attribute -- unless the newline characters are
properly encoded. Encoding the newline characters is also valid and
conforms to the spec, so the DOM implementation should do it.
In other words - the parsing process you refer to is actually working
fine. If an attribute contains a literal newline, it is indeed okay to
collapse it into a space. It's only the document serializing that is broken.
Minidom is clearly missing functionality here, and it does not conform
to the XML spec. If I store a string of data in an XML document, it must
be ensured that upon reading the document again, I get the *same* data
back. This is what I check with my test script.
|
msg87564 - (view) |
Author: Francesco Sechi (sechi_francesco) |
Date: 2009-05-11 07:31 |
All right, now I understand, thanks. But I think that, for internal
class coherence, it is necessary not to modify toxml method, but the
'setAttribute' one, because this is the source of the problem.
|
msg87586 - (view) |
Author: Francesco Sechi (sechi_francesco) |
Date: 2009-05-11 21:06 |
A solution for this issue could be to replace the setAttribute method as
follow:
- d["value"] = d["nodeValue"] = value
+ d["value"] = d["nodeValue"] = value.replace('\n',' ')
NOTE: I didn't do a patch, because I don't know which python version you
are using.
Please try this solution and give me a feedback, thanks.
|
msg87590 - (view) |
Author: Francesco Sechi (sechi_francesco) |
Date: 2009-05-11 21:30 |
I have uploaded a test script that shows that, without my patch, the
methods setAttribute and parseString work differently; adding my patch,
the behaviour is symmetric.
|
msg87604 - (view) |
Author: Daniel Diniz (ajaksu2) * |
Date: 2009-05-12 03:29 |
Francesco,
Your patch still doesn't allow one to add a multiline attribute values
as Tomalak describes:
"The catch: This leads to an actual data loss if I *wanted* to store
newline characters in an attribute -- unless the newline characters are
properly encoded. Encoding the newline characters is also valid and
conforms to the spec, so the DOM implementation should do it."
I'm not sure whether the proposed behavior is correct or desirable. Even
if it is correct, it might introduce backwards incompatible changes in
behavior.
Here's a test case for trunk.
|
msg87609 - (view) |
Author: Francesco Sechi (sechi_francesco) |
Date: 2009-05-12 08:26 |
My position is:
if you want to encode the newline character, this should be done by both
parseString and setAttribute methods. Otherwise, the behaviour is not
symmetric.
My patch translates the newline character with a whitespace in the
setAttribute method, because parseString already does it. If you want to
encode the newline in different manner, you should develop a patch that
introduces this kind of encoding in both parseString and setAttribute
methods.
|
msg87676 - (view) |
Author: (Tomalak) |
Date: 2009-05-13 11:44 |
Francesco,
> if you want to encode the newline character,
> this should be done by both parseString and
> setAttribute methods. Otherwise, the
> behaviour is not symmetric.
I believe you still don't see the issue. The behaviour is not symmetric
*now*. You store a '\n' in an attribute value with setAttribute(), save
the document to XML, load it again and out comes a space where the '\n'
should have been.
The point is that parseString() behaves correctly, but serializing does
not. There is only one side to fix, because only one side is broken.
> If you want to encode the newline in different
> manner, you should develop a patch that
> introduces this kind of encoding in both
> parseString and setAttribute methods.
It would be pointless to do the encoding in setAttribute(). The valid
ways to XML-encode a '\n' character are '
', '
' or '
'. Doing
so in setAttribute() would produce doubly encoded output, like this:
'&#10'. This is even more wrong.
However, if parseString() encounters a '
' in the input, it correctly
translates this to '\n' in the DOM. As I said, there is nothing to fix
in parsing, this exercise is about getting minidom to actually *output*
a ' ' where appropriate. :-)
|
msg87679 - (view) |
Author: (Tomalak) |
Date: 2009-05-13 11:57 |
Daniel Diniz:
The proposed behaviour is correct:
http://www.w3.org/TR/2000/WD-xml-c14n-20000119.html#charescaping
"In attribute values, the character information items
TAB (#x9), newline (#xA), and carriage-return (#xD)
are represented by "	", "
", and "
" respectively."
Since the behaviour is correct, it is also desirable. :-)
I don't think that this change could cause existing solution to break
since the current inconsistency in handling these characters make it
impossible to rely on this anyway.
Thanks for putting up the unit test diff.
|
msg87681 - (view) |
Author: (Tomalak) |
Date: 2009-05-13 12:16 |
I changed the patch to include support for TAB characters, which were
also left unencoded before.
Also I switched encoding from ' ' etc. to '
'. This is
equivalent, but the spec uses the latter variant.
|
msg90548 - (view) |
Author: (devon) |
Date: 2009-07-15 22:25 |
see also a similar issue in etree: #6492
|
msg90676 - (view) |
Author: (Tomalak) |
Date: 2009-07-18 13:50 |
@devon: Thanks for pointing & linking back here.
|
msg110999 - (view) |
Author: Mark Lawrence (BreamoreBoy) * |
Date: 2010-07-21 03:31 |
Patched test_minidom and ran it test failed. Went to patch minidom.py and it appears up to date, so no idea why the test failed, can someone please take a look as it's 04:30 BST, thanks.
|
msg111493 - (view) |
Author: Mark Lawrence (BreamoreBoy) * |
Date: 2010-07-24 17:21 |
minidom.patch had the new file listed before the old, so I've uploaded minidom.diff. The patch is tiny and looks clean. Tests have been repeated on Windows against 2.7 and are fine, so I believe this can be committed.
|
msg111642 - (view) |
Author: Ezio Melotti (ezio.melotti) * |
Date: 2010-07-26 16:49 |
#7139 has been closed as duplicate of this and contains a few messages that might be worth reading.
|
msg112497 - (view) |
Author: W. Trevor King (labrat) * |
Date: 2010-08-02 16:47 |
As a workaround until the patch gets included, you can import this monkey patch module.
|
msg112570 - (view) |
Author: W. Trevor King (labrat) * |
Date: 2010-08-03 10:32 |
And while we're at it, we should also
.replace('&', '&').replace('"', """).replace('<', '<')
which would have to go at the beginning to avoid double-escaping the '&'.
We could use xml.sax.saxutils.escape to do all the escaping rather than chaining replaces:
data = escape(data, {'"':'"', '\r':'
', '\n':'
', '\t':'	'})
which also escapes '>' (not strictly required for attribute values, but shouldn't be harmful either).
|
msg119728 - (view) |
Author: Ralph Gauges (gauges) |
Date: 2010-10-27 19:47 |
I tried to apply the minidom.diff patch below, but it seems that removing the two lines that replace the "<" and ">" characters is not a good idea.
It least in some files I now suddenly get "<" and ">" characters in text elements where there should be > and <
At least the part with the tabs seems to work now and if I add the two lines with the replace calls that got deleted by the patch, everything seems fine.
|
msg142964 - (view) |
Author: Yap Sok Ann (sayap) |
Date: 2011-08-25 11:32 |
Just want to mention that until the patch get included, it will be impossible to use the standard library to generate a working BCP (Bulk Copy Program) XML format file for SQL Server, which always requires a TERMINATOR="\r\n" or TERMINATOR=" " attribute.
|
msg185398 - (view) |
Author: Daniele Varrazzo (piro) * |
Date: 2013-03-28 01:39 |
Just got bitten by this bug, which affects xml.etree.ElementTree and cElementTree too. Any chance to have it fixed?
Note that lxml.etree is not affected by the bug and can be used as a replacement for the stdlib module if you happen to have some work to do.
|
msg185408 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * |
Date: 2013-03-28 09:47 |
minidom may be broken, but what's the issue with ElementTree?
>>> import xml.etree.cElementTree as etree
>>> doc = etree.fromstring('<xml />')
>>> doc.set('attr', "multiline\nvalue")
>>> etree.tostring(doc)
'<xml attr="multiline value" />'
|
msg185419 - (view) |
Author: Daniele Varrazzo (piro) * |
Date: 2013-03-28 10:17 |
ElementTree issue is with tabs:
In [1]: import xml.etree.cElementTree as etree
In [2]: doc = etree.fromstring('<xml />')
In [4]: doc.set('attr', "here\tthere")
In [5]: etree.tostring(doc)
Out[5]: '<xml attr="here\tthere" />'
In [6]: etree.fromstring(_5).attrib['attr']
Out[6]: 'here there'
bye bye tab.
|
msg185423 - (view) |
Author: Daniele Varrazzo (piro) * |
Date: 2013-03-28 10:22 |
I was going to open an issue on itself about etree and tabs, but I've found this and thought it would have been marked as duplicate. If you don't think it's the case I will open it anyway.
I've added my comment because ElementTree is not reported in this page at all: I've been googling forever about "ElementTree tabs attributes" without result. I've found this (and the network of duplicates) only searching for less generic "xml tabs" into the python bug tracker, when I was already filing the bug.
|
msg185576 - (view) |
Author: Daniele Varrazzo (piro) * |
Date: 2013-03-30 16:29 |
Added separate issue #17582 as ElementTree implementation doesn't depend on whatever causes the bug in xml.dom.minidom.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:47 | admin | set | github: 50002 |
2019-06-22 20:28:40 | mitar | set | nosy:
+ mitar
|
2014-02-03 18:37:26 | BreamoreBoy | set | nosy:
- BreamoreBoy
|
2013-03-30 16:29:15 | piro | set | messages:
+ msg185576 |
2013-03-28 10:22:17 | piro | set | messages:
+ msg185423 |
2013-03-28 10:17:03 | piro | set | messages:
+ msg185419 |
2013-03-28 09:47:39 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages:
+ msg185408
|
2013-03-28 01:39:01 | piro | set | nosy:
+ piro messages:
+ msg185398
|
2011-08-25 11:32:37 | sayap | set | nosy:
+ sayap messages:
+ msg142964
|
2011-05-10 09:09:23 | Jens.Grivolla | set | nosy:
+ Jens.Grivolla
|
2010-10-27 19:47:41 | gauges | set | nosy:
+ gauges messages:
+ msg119728
|
2010-08-03 10:32:03 | labrat | set | messages:
+ msg112570 |
2010-08-02 16:47:22 | labrat | set | files:
+ minidom.py nosy:
+ labrat messages:
+ msg112497
|
2010-07-26 16:49:38 | ezio.melotti | set | nosy:
+ effbot, ezio.melotti, moriyoshi messages:
+ msg111642
|
2010-07-24 17:21:21 | BreamoreBoy | set | files:
+ minidom.diff
messages:
+ msg111493 |
2010-07-21 03:31:09 | BreamoreBoy | set | nosy:
+ BreamoreBoy
messages:
+ msg110999 versions:
+ Python 2.7, Python 3.2 |
2009-07-18 13:50:00 | Tomalak | set | messages:
+ msg90676 |
2009-07-15 22:25:14 | devon | set | nosy:
+ devon messages:
+ msg90548
|
2009-05-13 15:06:29 | sechi_francesco | set | nosy:
- sechi_francesco
|
2009-05-13 12:17:35 | Tomalak | set | title: xml.dom.minidom does not escape newline characters within attribute values -> xml.dom.minidom does not escape CR, LF and TAB characters within attribute values |
2009-05-13 12:16:26 | Tomalak | set | messages:
+ msg87681 |
2009-05-13 12:11:58 | Tomalak | set | files:
+ minidom.patch |
2009-05-13 12:11:16 | Tomalak | set | files:
- minidom.patch |
2009-05-13 11:57:19 | Tomalak | set | messages:
+ msg87679 |
2009-05-13 11:44:34 | Tomalak | set | messages:
+ msg87676 |
2009-05-12 08:26:34 | sechi_francesco | set | messages:
+ msg87609 |
2009-05-12 03:30:07 | ajaksu2 | set | priority: normal stage: test needed -> patch review |
2009-05-12 03:29:29 | ajaksu2 | set | files:
+ test_multiline_roundtrip.diff nosy:
+ ajaksu2 messages:
+ msg87604
|
2009-05-11 21:30:32 | sechi_francesco | set | messages:
+ msg87590 |
2009-05-11 21:27:13 | sechi_francesco | set | files:
+ test_toxml.py |
2009-05-11 21:25:57 | sechi_francesco | set | files:
- test_toxml.py |
2009-05-11 21:06:09 | sechi_francesco | set | messages:
+ msg87586 |
2009-05-11 07:31:17 | sechi_francesco | set | messages:
+ msg87564 |
2009-05-10 17:26:20 | Tomalak | set | title: xml.dom.minidom does not handle newline characters in attribute values -> xml.dom.minidom does not escape newline characters within attribute values |
2009-05-10 16:34:31 | Tomalak | set | messages:
+ msg87528 |
2009-05-10 16:24:21 | sechi_francesco | set | messages:
+ msg87527 |
2009-05-10 15:22:35 | sechi_francesco | set | messages:
+ msg87526 |
2009-05-08 09:53:34 | Tomalak | set | files:
+ minidom_test.py |
2009-05-08 09:52:50 | Tomalak | set | files:
- toxml_test.py |
2009-05-08 09:38:10 | Tomalak | set | files:
+ toxml_test.py
messages:
+ msg87432 |
2009-05-08 09:33:29 | Tomalak | set | files:
+ minidom.patch keywords:
+ patch messages:
+ msg87431
|
2009-05-08 08:22:28 | Tomalak | set | messages:
+ msg87429 |
2009-05-06 21:47:31 | sechi_francesco | set | messages:
+ msg87351 |
2009-05-06 13:45:56 | Tomalak | set | messages:
+ msg87333 |
2009-05-06 13:42:32 | Tomalak | set | messages:
+ msg87332 |
2009-05-04 21:13:31 | sechi_francesco | set | messages:
+ msg87178 |
2009-05-02 15:50:58 | sechi_francesco | set | files:
+ test_toxml.py |
2009-05-02 15:12:06 | sechi_francesco | set | nosy:
+ sechi_francesco
|
2009-04-22 05:07:21 | ajaksu2 | set | keywords:
+ easy stage: test needed versions:
- Python 2.5, Python 2.4, Python 3.0, Python 2.7 |
2009-04-14 11:13:34 | Tomalak | set | type: behavior |
2009-04-14 11:13:06 | Tomalak | create | |