This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author piro
Recipients piro
Date 2013-03-30.16:26:34
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1364660794.69.0.883579635559.issue17582@psf.upfronthosting.co.za>
In-reply-to
Content
XML defines the following chars as whitespace [1]::

    S ::= (#x20 | #x9 | #xD | #xA)+

However the chars are not properly escaped into attributes, so they are converted into spaces as per attribute-value normalization [2]

    >>> data = '\x09\x0a\x0d\x20'
    >>> data
    '\t\n\r '

    >>> import  xml.etree.ElementTree as ET
    >>> e = ET.Element('x', attr=data)
    >>> s = ET.tostring(e)
    >>> s
    '<x attr="\t&#10;\r " />'

    >>> e1 = ET.fromstring(s)
    >>> data1 = e1.attrib['attr']
    >>> data1 == data
    False

    >>> data1
    ' \n  '

cElementTree suffers of the same bug::

    >>> import  xml.etree.cElementTree as cET
    >>> cET.fromstring(cET.tostring(cET.Element('a', attr=data))).attrib['attr']
    ' \n  '

but not the external library lxml.etree::

    >>> import lxml.etree as LET
    >>> LET.fromstring(LET.tostring(LET.Element('a', attr=data))).attrib['attr']
    '\t\n\r '

The bug is analogous to #5752 but it refers to a different and independent module. Proper escaping should be added to the _escape_attrib() function into /xml/etree/ElementTree.py (and equivalent for cElementTree).

[1] http://www.w3.org/TR/REC-xml/#white
[2] http://www.w3.org/TR/REC-xml/#AVNormalize
History
Date User Action Args
2013-03-30 16:26:34pirosetrecipients: + piro
2013-03-30 16:26:34pirosetmessageid: <1364660794.69.0.883579635559.issue17582@psf.upfronthosting.co.za>
2013-03-30 16:26:34pirolinkissue17582 messages
2013-03-30 16:26:34pirocreate