Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml.etree.ElementTree does not preserve whitespaces in attributes #61782

Closed
dvarrazzo mannequin opened this issue Mar 30, 2013 · 14 comments
Closed

xml.etree.ElementTree does not preserve whitespaces in attributes #61782

dvarrazzo mannequin opened this issue Mar 30, 2013 · 14 comments
Assignees
Labels
3.7 (EOL) end of life easy stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@dvarrazzo
Copy link
Mannequin

dvarrazzo mannequin commented Mar 30, 2013

BPO 17582
Nosy @rhettinger, @scoder, @dvarrazzo, @benjaminp, @skrah
Files
  • 17582-etree-whitespace.patch: Patch for ElementTree
  • 17582-etree-whitespace-test.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2016-09-12.06:24:55.291>
    created_at = <Date 2013-03-30.16:26:34.661>
    labels = ['3.7', 'easy', 'type-bug', 'library']
    title = 'xml.etree.ElementTree does not preserve whitespaces in attributes'
    updated_at = <Date 2020-04-12.12:55:44.283>
    user = 'https://github.com/dvarrazzo'

    bugs.python.org fields:

    activity = <Date 2020-04-12.12:55:44.283>
    actor = 'scoder'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2016-09-12.06:24:55.291>
    closer = 'rhettinger'
    components = ['Library (Lib)']
    creation = <Date 2013-03-30.16:26:34.661>
    creator = 'piro'
    dependencies = []
    files = ['36929', '40346']
    hgrepos = []
    issue_num = 17582
    keywords = ['patch', 'easy']
    message_count = 14.0
    messages = ['185574', '228569', '229272', '229398', '229399', '240525', '249707', '249932', '250005', '275969', '275972', '275973', '357113', '366245']
    nosy_count = 10.0
    nosy_names = ['rhettinger', 'scoder', 'piro', 'benjamin.peterson', 'duaneg', 'eli.bendersky', 'skrah', 'python-dev', 'lwcolton', 'Emiliano Heyns']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'needs patch'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue17582'
    versions = ['Python 3.7']

    @dvarrazzo
    Copy link
    Mannequin Author

    dvarrazzo mannequin commented Mar 30, 2013

    XML defines the following chars as whitespace [1]::

    S ::= (#x20 | #x9 | #xD | #xA)+
    

    However the chars are not properly escaped into attributes, so they are converted into spaces as per attribute-value normalization [2]

        >>> data = '\x09\x0a\x0d\x20'
        >>> data
        '\t\n\r '
    
        >>> import  xml.etree.ElementTree as ET
        >>> e = ET.Element('x', attr=data)
        >>> s = ET.tostring(e)
        >>> s
        '<x attr="\t&#10;\r " />'
    
        >>> e1 = ET.fromstring(s)
        >>> data1 = e1.attrib['attr']
        >>> data1 == data
        False
    
        >>> data1
        ' \n  '

    cElementTree suffers of the same bug::

        >>> import  xml.etree.cElementTree as cET
        >>> cET.fromstring(cET.tostring(cET.Element('a', attr=data))).attrib['attr']
        ' \n  '

    but not the external library lxml.etree::

        >>> import lxml.etree as LET
        >>> LET.fromstring(LET.tostring(LET.Element('a', attr=data))).attrib['attr']
        '\t\n\r '

    The bug is analogous to bpo-5752 but it refers to a different and independent module. Proper escaping should be added to the _escape_attrib() function into /xml/etree/ElementTree.py (and equivalent for cElementTree).

    [1] http://www.w3.org/TR/REC-xml/#white
    [2] http://www.w3.org/TR/REC-xml/#AVNormalize

    @dvarrazzo dvarrazzo mannequin added topic-XML stdlib Python modules in the Lib dir labels Mar 30, 2013
    @BreamoreBoy BreamoreBoy mannequin added the type-bug An unexpected behavior, bug, or error label Oct 1, 2014
    @scoder
    Copy link
    Contributor

    scoder commented Oct 5, 2014

    Proper escaping should be added to the _escape_attrib() function into /xml/etree/ElementTree.py (and equivalent for cElementTree).

    Agreed. Can you provide a patch?

    @dvarrazzo
    Copy link
    Mannequin Author

    dvarrazzo mannequin commented Oct 13, 2014

    No, I cannot. I take the fact there has been no answer for more than 18 months as an acknowledgement that the issue is not deemed important by Python maintainers: it's not important for me either. I'm not a heavy xml user: just knowing that the Python XML libraries are unreliable and that by default I should use lxml is a sufficient solution to my sporadic xml uses. Your mileage should vary.

    @lwcolton
    Copy link
    Mannequin

    lwcolton mannequin commented Oct 15, 2014

    Here is a patch. Please note that in your example \r is replaced by \n per 2.11: http://www.w3.org/TR/REC-xml/#sec-line-ends
    Also, the patch is only for ElementTree, I will investigate cElementTree but no promises.

    @lwcolton
    Copy link
    Mannequin

    lwcolton mannequin commented Oct 15, 2014

    I sort of realized, does this mean lxml.etree would now be the offender, for not following 2.11 and leaving the \r as-is?

    @benjaminp
    Copy link
    Contributor

    The patch seems reason, though it needs a test.

    @duaneg
    Copy link
    Mannequin

    duaneg mannequin commented Sep 4, 2015

    Here is a patch with a unit test for the new escaping functionality. I believe it covers all the new cases. Additional code is not required for cElementTree as the serialisation code is all Python.

    @rhettinger
    Copy link
    Contributor

    Stefan, can you opine on the patches and whether they should be backported?

    @scoder
    Copy link
    Contributor

    scoder commented Sep 6, 2015

    Patch and test look correct. They fix a bug that produces incorrect output, so I vote for backporting them. Most code won't see the difference as whitespace control characters are rare in attribute values. And code that uses them will benefit from correctness. Obviously, there might also be breakage in the rare case that code puts control characters into attribute values and expects them to disappear magically, but then it's the user code that is wrong here.

    Only issue is that serialisation is slow already and this change slows it down a bit more. Every attribute value will now be searched 8 times instead of 5 times. I added a minor review comment that would normally reduce it to 7. timeit suggests to me that the overall overhead is still tiny, though, and thus acceptable:

    $ python3.5 -m timeit -s "s = 'askhfalsdhfashldfsadf'" "'\n' in s"
    10000000 loops, best of 3: 0.0383 usec per loop
    
    $ python3.5 -m timeit -s "s = 'askhfalsdhfashldfsadf'" "s.replace('\n', 'y')"
    10000000 loops, best of 3: 0.151 usec per loop
    
    $ python3.5 -m timeit -s "s = 'askhfalsdhfashldfsadf'; rep=s.replace" "rep('\n', 'y')"
    10000000 loops, best of 3: 0.12 usec per loop

    @kesara kesara mannequin added the 3.7 (EOL) end of life label Sep 12, 2016
    @rhettinger rhettinger assigned rhettinger and skrah and unassigned rhettinger Sep 12, 2016
    @scoder
    Copy link
    Contributor

    scoder commented Sep 12, 2016

    Raymond, you might have meant me when assigning the ticket and not Stefan Krah, but since I'm actually not a core dev, I can't commit the patch myself.

    See my last comment, though, I reviewed the patch and it should get committed.

    @rhettinger rhettinger assigned rhettinger and unassigned skrah Sep 12, 2016
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 12, 2016

    New changeset 0a5596315cf0 by Raymond Hettinger in branch '3.5':
    Issue bpo-17582: xml.etree.ElementTree nows preserves whitespaces in attributes
    https://hg.python.org/cpython/rev/0a5596315cf0

    @rhettinger
    Copy link
    Contributor

    Done.

    @EmilianoHeyns
    Copy link
    Mannequin

    EmilianoHeyns mannequin commented Nov 20, 2019

    I don't see newlines currently preserved in attributes:

       elem = ET.parse(StringIO('<test a="   \nab\n    "/>')).getroot()
       print(ET.tostring(elem))

    @EmilianoHeyns EmilianoHeyns mannequin removed the topic-XML label Nov 20, 2019
    @scoder
    Copy link
    Contributor

    scoder commented Apr 12, 2020

    Also see the later fix in bpo-39011, where the EOL normalisation in attribute text was removed again. This change was applied in Py3.9.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life easy stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants