Issue7139
Created on 2009-10-15 06:21 by moriyoshi, last changed 2009-11-11 17:38 by moriyoshi.
|
msg94074 - (view) |
Author: Moriyoshi Koizumi (moriyoshi) |
Date: 2009-10-15 06:21 |
|
ElementTree doesn't correctly serialize end-of-line characters (#xa,
#xd) in attribute values. Since bare end-of-line characters are
converted to #x20 by the parser according to the specification [1], such
characters that are represented as character references in the original
document must be serialized in the same form.
[1] http://www.w3.org/TR/xml11/#AVNormalize
### sample code
from xml.etree.ElementTree import ElementTree
from cStringIO import StringIO
# builder = ElementTree(file=StringIO("<foo>\x0d</foo>"))
# out = StringIO()
# builder.write(out)
# print out.getvalue()
out = StringIO()
ElementTree(file=StringIO(
'''<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE foo [
<!ELEMENT foo (#PCDATA)>
<!ATTLIST foo attr CDATA "">
]>
<foo attr=" test
test  test a "> </foo>
''')).write(out)
# should be "<foo attr=" test test test a ">\x0a</foo>
print out.getvalue()
out = StringIO()
ElementTree(file=StringIO(
'''<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE foo [
<!ELEMENT foo (#PCDATA)>
<!ATTLIST foo attr NMTOKENS "">
]>
<foo attr=" test
test  test a "> </foo>
''')).write(out)
# should be "<foo attr="test test test a">\x0a</foo>
print out.getvalue()
|
|
msg94077 - (view) |
Author: Moriyoshi Koizumi (moriyoshi) |
Date: 2009-10-15 07:39 |
|
Tabs must be converted to character references as well.
|
|
msg94833 - (view) |
Author: Moriyoshi Koizumi (moriyoshi) |
Date: 2009-11-02 16:12 |
|
Looks like a duplicate of #6492
|
|
msg94853 - (view) |
Author: Ezio Melotti (ezio.melotti) |
Date: 2009-11-02 22:06 |
|
If I understood correctly, the correct behavior while reading is:
* literal newlines (\n or \r) and tabs (\t) should be collapsed and
converted to a space
* newlines (
 or 
) and tabs (	) as entities should be
converted to the literal equivalents (\n, \r and \t)
(See http://www.w3.org/TR/2000/WD-xml-c14n-20000119.html#charescaping)
This should be ok in both xml.minidom and etree.
Instead, while writing, if literal newlines and tabs are written as they
are (\n, \r and \t), they can't be read during the parsing phase because
they are collapsed and converted to a space. They should therefore be
converted to entities (
, 
 and 	) automatically, but this
could be incompatible with the current behavior (i.e. \n, \r or \t that
now are written and collapsed as a space during the parsing will then
become significant).
Moriyoshi, can you confirm that what I said is correct and the problem
is similar to the one described in #5752?
I also closed #6492 as duplicate of this.
|
|
msg94855 - (view) |
Author: Fredrik Lundh (effbot) |
Date: 2009-11-02 22:27 |
|
The real problem here is that XML attributes weren't really designed
to hold data that doesn't survive normalization. One would have
thought that making it difficult to do that, and easy to store such
things as character data, would have made people think a bit before
designing XML formats that does things the other way around, but
apparently some people finds it hard having to use their brain when
designing things...
FWIW, the current ET 1.3 beta escapes newline but not tabs and
carriage returns; I don't really mind adding tabs, but I'm less sure
about carriage return -- XML pretty much treats CT as a junk character
also outside attributes, and escaping it in all contexts would just be
silly.
|
|
msg95145 - (view) |
Author: Moriyoshi Koizumi (moriyoshi) |
Date: 2009-11-11 17:38 |
|
@ezio.melotti
Yes, it works flawlessly as for parsing.
Fixing this would actually break the current behavior, but I believe
this is how it should work.
It seems #5752 pretty much says the same thing.
@effbot
As specified in 2.11 End-of-Line Handling [2], any variants of EOL
characters should have been normalized into single #xa before it
actually gets parsed, so bare #xd characters would never appear as they
are amongst parsed information items.
[2] http://www.w3.org/TR/xml/#sec-line-ends
|
|
| Date |
User |
Action |
Args |
| 2009-11-11 17:38:12 | moriyoshi | set | messages:
+ msg95145 |
| 2009-11-02 22:27:29 | effbot | set | messages:
+ msg94855 |
| 2009-11-02 22:06:35 | ezio.melotti | set | nosy:
+ ezio.melotti, devon
messages:
+ msg94853 versions:
+ Python 2.7 |
| 2009-11-02 16:12:47 | moriyoshi | set | messages:
+ msg94833 |
| 2009-10-15 07:39:04 | moriyoshi | set | messages:
+ msg94077 |
| 2009-10-15 06:28:06 | ezio.melotti | set | priority: normal assignee: effbot
nosy:
+ effbot |
| 2009-10-15 06:21:42 | moriyoshi | set | title: Incorrect serialization of end-of-line characters in attribute values -> ElementTree: Incorrect serialization of end-of-line characters in attribute values |
| 2009-10-15 06:21:29 | moriyoshi | create | |
|