classification
Title: ElementTree attributes replace "\r" with "\n"
Type: behavior Stage: resolved
Components: XML Versions: Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: mefistotelis, nows, scoder, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2019-12-09 23:40 by mefistotelis, last changed 2020-04-12 12:52 by scoder. This issue is now closed.

Files
File name Uploaded Description Edit
0001-bpo-39011-Preserve-line-endings-within-attributes.patch mefistotelis, 2020-02-10 00:24 Patch v1
0002-bpo-39011-Test-white-space-preservation-in-attribs.patch mefistotelis, 2020-02-11 10:39
Pull Requests
URL Status Linked Edit
PR 18468 merged python-dev, 2020-02-12 01:11
Messages (9)
msg358154 - (view) Author: Mefistotelis (mefistotelis) * Date: 2019-12-09 23:40
TLDR:
If I place "\r" in an Element attribute, it is handled and idiomized to "
" in the XML file. But wait - \r is not really code 10, right?

Real description:

If I create ElementTree and read it just after creation, I'm getting what I put there - "\r". But if I save and re-load, it transforms into "\n". The character is incorrectly converted before being idiomized, and saved XML file has invalid value stored.

Quick repro:

# python3 -i
Python 3.8.0 (default, Oct 25 2019, 06:23:40)  [GCC 9.2.0 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.ElementTree as ET
>>> elem = ET.Element('TEST')
>>> elem.set("Attt", "a\x0db")
>>> tree = ET.ElementTree(elem)
>>> with open("_test1.xml", "wb") as xml_fh:
...     tree.write(xml_fh, encoding='utf-8', xml_declaration=True)
...
>>> tree.getroot().get("Attt")
'a\rb'
>>> tree = ET.parse("_test1.xml")
>>> tree.getroot().get("Attt")
'a\nb'
>>>

Related issue: https://bugs.python.org/issue5752
(keeping this one separate as it seem to be a simple bug, easy to fix outside of the discussion there)

If there's a good workaround - please let me know.

Tested on Windows, v3.8 and v3.6
msg358181 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-12-10 11:14
See https://www.w3.org/TR/REC-xml/#sec-line-ends.
msg358219 - (view) Author: Mefistotelis (mefistotelis) * Date: 2019-12-10 20:04
Disclaimer: I'm not at all an expert in XML specs.

The linked spec chapter, "End-of-Line Handling", says all line endings should behave like they were converted to "\n" _before_ parsing.

This means:

1. This part of spec does not apply to the behavior described in the issue , because line endings are converted before the file is saved. The spec describes loading process, not saving.

2. Before parsing, the line endings within attributes are replaced by idioms - so they are no longer line endings in the meaning assigned by the spec. The chapter starts with clear indication that it only applies to line endings which are used to give structure to physical file. The affected line endings are narrowed by stating: "files [...], for editing convenience, are organized into lines.". Since line endings in attributes are idiomized, they don't take part of organizing file into lines.


Then again, I'm not an expert. From the various specs I worked with, I know that the affected industry always comes out with unified interpretation of specs. If it was widely accepted to apply this chapter to values of attributes, I'd understand.
msg358831 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-12-23 19:19
I think we did it wrong in issue 17582. Parser behaviour is not a reason why the *serialisation* should modify the content.

Luckily, fixing this does not impact the C14N serialisation (which aims to guarantee byte identical serialisation), but it changes the "normal" serialisation. I would therefore suggest that we remove the newline replacement code in the next release only, Py3.9.

@mefistotelis, do you want to submit a PR?
msg361664 - (view) Author: Mefistotelis (mefistotelis) * Date: 2020-02-10 00:24
Patch attached.

I was thinking about one for() instead, but didn't wanted to introduce too large changes..

Let me know if you would prefer something like:

    for i in (9,10,13,):
        if chr(i) not in text: continue
        text = text.replace(chr(i), "&#{:02d};".format(i))

That would also make it easy to extend for other chars, ie. if we'd like  the parser to be always able to re-read the XML we've created. Currently, placing control chars in attributes will prevent that. But I'm getting out of scope of this issue now.
msg361681 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2020-02-10 12:13
Your patch looks good to me. Could you please add (or adapt) the tests and then create a PR from it? You also need to write a NEWS entry for this change, and it also seems worth an entry in the "What's new" document.

https://devguide.python.org/committing/
msg361682 - (view) Author: Reece Johnson (nows) Date: 2020-02-10 12:20
Hope it is fixed now.
msg361795 - (view) Author: Mefistotelis (mefistotelis) * Date: 2020-02-11 10:39
I'm on it.

Test update attached.
msg366244 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2020-04-12 12:52
New changeset 5fd8123dfdf6df0a9c29363c8327ccfa0c1d41ac by mefistotelis in branch 'master':
bpo-39011: Preserve line endings within ElementTree attributes (GH-18468)
https://github.com/python/cpython/commit/5fd8123dfdf6df0a9c29363c8327ccfa0c1d41ac
History
Date User Action Args
2020-04-12 12:52:57scodersetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2020-04-12 12:52:18scodersetmessages: + msg366244
2020-02-12 01:11:10python-devsetstage: needs patch -> patch review
pull_requests: + pull_request17838
2020-02-11 10:39:02mefistotelissetfiles: + 0002-bpo-39011-Test-white-space-preservation-in-attribs.patch

messages: + msg361795
2020-02-10 12:20:52nowssetnosy: + nows
messages: + msg361682
2020-02-10 12:13:40scodersetmessages: + msg361681
2020-02-10 00:24:28mefistotelissetfiles: + 0001-bpo-39011-Preserve-line-endings-within-attributes.patch
keywords: + patch
messages: + msg361664
2019-12-23 19:19:53scodersetstage: needs patch
messages: + msg358831
versions: + Python 3.9, - Python 3.6, Python 3.8
2019-12-10 20:04:34mefistotelissetmessages: + msg358219
2019-12-10 11:14:09serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg358181
2019-12-10 01:40:25rhettingersetnosy: + scoder
2019-12-09 23:40:50mefistoteliscreate