classification
Title: ElementTree: Incorrect serialization of end-of-line characters in attribute values
Type: behavior Stage:
Components: XML Versions: Python 3.1, Python 3.2, Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder:
Assigned To: effbot Nosy List: BreamoreBoy, devon, effbot, ezio.melotti, moriyoshi
Priority: normal Keywords:

Created on 2009-10-15 06:21 by moriyoshi, last changed 2010-07-25 13:00 by BreamoreBoy. This issue is now closed.

Messages (7)
msg94074 - (view) Author: Moriyoshi Koizumi (moriyoshi) Date: 2009-10-15 06:21
ElementTree doesn't correctly serialize end-of-line characters (#xa, 
#xd) in attribute values.  Since bare end-of-line characters are 
converted to #x20 by the parser according to the specification [1], such 
characters that are represented as character references in the original 
document must be serialized in the same form.

[1] http://www.w3.org/TR/xml11/#AVNormalize   

### sample code

from xml.etree.ElementTree import ElementTree
from cStringIO import StringIO

# builder = ElementTree(file=StringIO("<foo>\x0d</foo>"))
# out = StringIO()
# builder.write(out)
# print out.getvalue()

out = StringIO()
ElementTree(file=StringIO(
'''<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE foo [
<!ELEMENT foo (#PCDATA)>
<!ATTLIST foo attr CDATA "">
]>
<foo attr="   test
&#13;test&#32; test&#10;a  ">&#10;</foo>
''')).write(out)
# should be "<foo attr="   test &#13;test  test&#10;a  ">\x0a</foo>
print out.getvalue()

out = StringIO()
ElementTree(file=StringIO(
'''<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE foo [
<!ELEMENT foo (#PCDATA)>
<!ATTLIST foo attr NMTOKENS "">
]>
<foo attr="   test
&#13;test&#32; test&#10;a  ">&#10;</foo>
''')).write(out)
# should be "<foo attr="test &#13;test test&#10;a">\x0a</foo>
print out.getvalue()
msg94077 - (view) Author: Moriyoshi Koizumi (moriyoshi) Date: 2009-10-15 07:39
Tabs must be converted to character references as well.
msg94833 - (view) Author: Moriyoshi Koizumi (moriyoshi) Date: 2009-11-02 16:12
Looks like a duplicate of #6492
msg94853 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-11-02 22:06
If I understood correctly, the correct behavior while reading is:
  * literal newlines (\n or \r) and tabs (\t) should be collapsed and
converted to a space
  * newlines (&#xA; or &#xD;) and tabs (&#x9;) as entities should be
converted to the literal equivalents (\n, \r and \t)

(See http://www.w3.org/TR/2000/WD-xml-c14n-20000119.html#charescaping)

This should be ok in both xml.minidom and etree.


Instead, while writing, if literal newlines and tabs are written as they
are (\n, \r and \t), they can't be read during the parsing phase because
they are collapsed and converted to a space. They should therefore be
converted to entities (&#xA;, &#xD; and &#x9;) automatically, but this
could be incompatible with the current behavior (i.e. \n, \r or \t that
now are written and collapsed as a space during the parsing will then
become significant).

Moriyoshi, can you confirm that what I said is correct and the problem
is similar to the one described in #5752?
I also closed #6492 as duplicate of this.
msg94855 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2009-11-02 22:27
The real problem here is that XML attributes weren't really designed
to hold data that doesn't survive normalization.  One would have
thought that making it difficult to do that, and easy to store such
things as character data, would have made people think a bit before
designing XML formats that does things the other way around, but
apparently some people finds it hard having to use their brain when
designing things...

FWIW, the current ET 1.3 beta escapes newline but not tabs and
carriage returns; I don't really mind adding tabs, but I'm less sure
about carriage return -- XML pretty much treats CT as a junk character
also outside attributes, and escaping it in all contexts would just be
silly.
msg95145 - (view) Author: Moriyoshi Koizumi (moriyoshi) Date: 2009-11-11 17:38
@ezio.melotti

Yes, it works flawlessly as for parsing. 

Fixing this would actually break the current behavior, but I believe 
this is how it should work.

It seems #5752 pretty much says the same thing.

@effbot

As specified in 2.11 End-of-Line Handling [2], any variants of EOL 
characters should have been normalized into single #xa before it 
actually gets parsed, so bare #xd characters would never appear as they 
are amongst parsed information items.


[2] http://www.w3.org/TR/xml/#sec-line-ends
msg111540 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-25 13:00
Closed as a duplicate of #5752 which has patches attached.
History
Date User Action Args
2010-07-25 13:00:54BreamoreBoysetstatus: open -> closed
versions: + Python 3.1, Python 3.2, - Python 2.6
nosy: + BreamoreBoy

messages: + msg111540

resolution: duplicate
2009-11-11 17:38:12moriyoshisetmessages: + msg95145
2009-11-02 22:27:29effbotsetmessages: + msg94855
2009-11-02 22:06:35ezio.melottisetnosy: + ezio.melotti, devon

messages: + msg94853
versions: + Python 2.7
2009-11-02 16:12:47moriyoshisetmessages: + msg94833
2009-10-15 07:39:04moriyoshisetmessages: + msg94077
2009-10-15 06:28:06ezio.melottisetpriority: normal
assignee: effbot

nosy: + effbot
2009-10-15 06:21:42moriyoshisettitle: Incorrect serialization of end-of-line characters in attribute values -> ElementTree: Incorrect serialization of end-of-line characters in attribute values
2009-10-15 06:21:29moriyoshicreate