Message 185574 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	piro
Recipients	piro
Date	2013-03-30.16:26:34
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1364660794.69.0.883579635559.issue17582@psf.upfronthosting.co.za>
In-reply-to

Content
XML defines the following chars as whitespace [1]:: S ::= (#x20 \| #x9 \| #xD \| #xA)+ However the chars are not properly escaped into attributes, so they are converted into spaces as per attribute-value normalization [2] >>> data = '\x09\x0a\x0d\x20' >>> data '\t\n\r ' >>> import xml.etree.ElementTree as ET >>> e = ET.Element('x', attr=data) >>> s = ET.tostring(e) >>> s '<x attr="\t \r " />' >>> e1 = ET.fromstring(s) >>> data1 = e1.attrib['attr'] >>> data1 == data False >>> data1 ' \n ' cElementTree suffers of the same bug:: >>> import xml.etree.cElementTree as cET >>> cET.fromstring(cET.tostring(cET.Element('a', attr=data))).attrib['attr'] ' \n ' but not the external library lxml.etree:: >>> import lxml.etree as LET >>> LET.fromstring(LET.tostring(LET.Element('a', attr=data))).attrib['attr'] '\t\n\r ' The bug is analogous to #5752 but it refers to a different and independent module. Proper escaping should be added to the _escape_attrib() function into /xml/etree/ElementTree.py (and equivalent for cElementTree). [1] http://www.w3.org/TR/REC-xml/#white [2] http://www.w3.org/TR/REC-xml/#AVNormalize

XML defines the following chars as whitespace [1]::

    S ::= (#x20 | #x9 | #xD | #xA)+

However the chars are not properly escaped into attributes, so they are converted into spaces as per attribute-value normalization [2]

    >>> data = '\x09\x0a\x0d\x20'
    >>> data
    '\t\n\r '

    >>> import  xml.etree.ElementTree as ET
    >>> e = ET.Element('x', attr=data)
    >>> s = ET.tostring(e)
    >>> s
    '<x attr="\t&#10;\r " />'

    >>> e1 = ET.fromstring(s)
    >>> data1 = e1.attrib['attr']
    >>> data1 == data
    False

    >>> data1
    ' \n  '

cElementTree suffers of the same bug::

    >>> import  xml.etree.cElementTree as cET
    >>> cET.fromstring(cET.tostring(cET.Element('a', attr=data))).attrib['attr']
    ' \n  '

but not the external library lxml.etree::

    >>> import lxml.etree as LET
    >>> LET.fromstring(LET.tostring(LET.Element('a', attr=data))).attrib['attr']
    '\t\n\r '

The bug is analogous to #5752 but it refers to a different and independent module. Proper escaping should be added to the _escape_attrib() function into /xml/etree/ElementTree.py (and equivalent for cElementTree).

[1] http://www.w3.org/TR/REC-xml/#white
[2] http://www.w3.org/TR/REC-xml/#AVNormalize

History
Date	User	Action	Args
2013-03-30 16:26:34	piro	set	recipients: + piro
2013-03-30 16:26:34	piro	set	messageid: <1364660794.69.0.883579635559.issue17582@psf.upfronthosting.co.za>
2013-03-30 16:26:34	piro	link	issue17582 messages
2013-03-30 16:26:34	piro	create