Message185574
XML defines the following chars as whitespace [1]::
S ::= (#x20 | #x9 | #xD | #xA)+
However the chars are not properly escaped into attributes, so they are converted into spaces as per attribute-value normalization [2]
>>> data = '\x09\x0a\x0d\x20'
>>> data
'\t\n\r '
>>> import xml.etree.ElementTree as ET
>>> e = ET.Element('x', attr=data)
>>> s = ET.tostring(e)
>>> s
'<x attr="\t \r " />'
>>> e1 = ET.fromstring(s)
>>> data1 = e1.attrib['attr']
>>> data1 == data
False
>>> data1
' \n '
cElementTree suffers of the same bug::
>>> import xml.etree.cElementTree as cET
>>> cET.fromstring(cET.tostring(cET.Element('a', attr=data))).attrib['attr']
' \n '
but not the external library lxml.etree::
>>> import lxml.etree as LET
>>> LET.fromstring(LET.tostring(LET.Element('a', attr=data))).attrib['attr']
'\t\n\r '
The bug is analogous to #5752 but it refers to a different and independent module. Proper escaping should be added to the _escape_attrib() function into /xml/etree/ElementTree.py (and equivalent for cElementTree).
[1] http://www.w3.org/TR/REC-xml/#white
[2] http://www.w3.org/TR/REC-xml/#AVNormalize |
|
Date |
User |
Action |
Args |
2013-03-30 16:26:34 | piro | set | recipients:
+ piro |
2013-03-30 16:26:34 | piro | set | messageid: <1364660794.69.0.883579635559.issue17582@psf.upfronthosting.co.za> |
2013-03-30 16:26:34 | piro | link | issue17582 messages |
2013-03-30 16:26:34 | piro | create | |
|