Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ElementTree attributes replace "\r" with "\n" #83192

Closed
mefistotelis mannequin opened this issue Dec 9, 2019 · 9 comments
Closed

ElementTree attributes replace "\r" with "\n" #83192

mefistotelis mannequin opened this issue Dec 9, 2019 · 9 comments
Labels
3.9 only security fixes topic-XML type-bug An unexpected behavior, bug, or error

Comments

@mefistotelis
Copy link
Mannequin

mefistotelis mannequin commented Dec 9, 2019

BPO 39011
Nosy @scoder, @serhiy-storchaka, @mefistotelis
PRs
  • bpo-39011: Preserve line endings within ElementTree attributes #18468
  • Files
  • 0001-bpo-39011-Preserve-line-endings-within-attributes.patch: Patch v1
  • 0002-bpo-39011-Test-white-space-preservation-in-attribs.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-04-12.12:52:57.755>
    created_at = <Date 2019-12-09.23:40:50.208>
    labels = ['expert-XML', 'type-bug', '3.9']
    title = 'ElementTree attributes replace "\\r" with "\\n"'
    updated_at = <Date 2020-04-12.12:52:57.755>
    user = 'https://github.com/mefistotelis'

    bugs.python.org fields:

    activity = <Date 2020-04-12.12:52:57.755>
    actor = 'scoder'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-04-12.12:52:57.755>
    closer = 'scoder'
    components = ['XML']
    creation = <Date 2019-12-09.23:40:50.208>
    creator = 'mefistotelis'
    dependencies = []
    files = ['48885', '48890']
    hgrepos = []
    issue_num = 39011
    keywords = ['patch']
    message_count = 9.0
    messages = ['358154', '358181', '358219', '358831', '361664', '361681', '361682', '361795', '366244']
    nosy_count = 4.0
    nosy_names = ['scoder', 'serhiy.storchaka', 'mefistotelis', 'nows']
    pr_nums = ['18468']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue39011'
    versions = ['Python 3.9']

    @mefistotelis
    Copy link
    Mannequin Author

    mefistotelis mannequin commented Dec 9, 2019

    TLDR:
    If I place "\r" in an Element attribute, it is handled and idiomized to " " in the XML file. But wait - \r is not really code 10, right?

    Real description:

    If I create ElementTree and read it just after creation, I'm getting what I put there - "\r". But if I save and re-load, it transforms into "\n". The character is incorrectly converted before being idiomized, and saved XML file has invalid value stored.

    Quick repro:

    # python3 -i
    Python 3.8.0 (default, Oct 25 2019, 06:23:40)  [GCC 9.2.0 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import xml.etree.ElementTree as ET
    >>> elem = ET.Element('TEST')
    >>> elem.set("Attt", "a\x0db")
    >>> tree = ET.ElementTree(elem)
    >>> with open("_test1.xml", "wb") as xml_fh:
    ...     tree.write(xml_fh, encoding='utf-8', xml_declaration=True)
    ...
    >>> tree.getroot().get("Attt")
    'a\rb'
    >>> tree = ET.parse("_test1.xml")
    >>> tree.getroot().get("Attt")
    'a\nb'
    >>>

    Related issue: https://bugs.python.org/issue5752
    (keeping this one separate as it seem to be a simple bug, easy to fix outside of the discussion there)

    If there's a good workaround - please let me know.

    Tested on Windows, v3.8 and v3.6

    @mefistotelis mefistotelis mannequin added 3.8 only security fixes topic-XML type-bug An unexpected behavior, bug, or error labels Dec 9, 2019
    @serhiy-storchaka
    Copy link
    Member

    @mefistotelis
    Copy link
    Mannequin Author

    mefistotelis mannequin commented Dec 10, 2019

    Disclaimer: I'm not at all an expert in XML specs.

    The linked spec chapter, "End-of-Line Handling", says all line endings should behave like they were converted to "\n" _before_ parsing.

    This means:

    1. This part of spec does not apply to the behavior described in the issue , because line endings are converted before the file is saved. The spec describes loading process, not saving.

    2. Before parsing, the line endings within attributes are replaced by idioms - so they are no longer line endings in the meaning assigned by the spec. The chapter starts with clear indication that it only applies to line endings which are used to give structure to physical file. The affected line endings are narrowed by stating: "files [...], for editing convenience, are organized into lines.". Since line endings in attributes are idiomized, they don't take part of organizing file into lines.

    Then again, I'm not an expert. From the various specs I worked with, I know that the affected industry always comes out with unified interpretation of specs. If it was widely accepted to apply this chapter to values of attributes, I'd understand.

    @scoder
    Copy link
    Contributor

    scoder commented Dec 23, 2019

    I think we did it wrong in bpo-17582. Parser behaviour is not a reason why the *serialisation* should modify the content.

    Luckily, fixing this does not impact the C14N serialisation (which aims to guarantee byte identical serialisation), but it changes the "normal" serialisation. I would therefore suggest that we remove the newline replacement code in the next release only, Py3.9.

    @mefistotelis, do you want to submit a PR?

    @scoder scoder added 3.9 only security fixes and removed 3.8 only security fixes labels Dec 23, 2019
    @mefistotelis
    Copy link
    Mannequin Author

    mefistotelis mannequin commented Feb 10, 2020

    Patch attached.

    I was thinking about one for() instead, but didn't wanted to introduce too large changes..

    Let me know if you would prefer something like:

        for i in (9,10,13,):
            if chr(i) not in text: continue
            text = text.replace(chr(i), "&#{:02d};".format(i))

    That would also make it easy to extend for other chars, ie. if we'd like the parser to be always able to re-read the XML we've created. Currently, placing control chars in attributes will prevent that. But I'm getting out of scope of this issue now.

    @scoder
    Copy link
    Contributor

    scoder commented Feb 10, 2020

    Your patch looks good to me. Could you please add (or adapt) the tests and then create a PR from it? You also need to write a NEWS entry for this change, and it also seems worth an entry in the "What's new" document.

    https://devguide.python.org/committing/

    @nows
    Copy link
    Mannequin

    nows mannequin commented Feb 10, 2020

    Hope it is fixed now.

    @mefistotelis
    Copy link
    Mannequin Author

    mefistotelis mannequin commented Feb 11, 2020

    I'm on it.

    Test update attached.

    @scoder
    Copy link
    Contributor

    scoder commented Apr 12, 2020

    New changeset 5fd8123 by mefistotelis in branch 'master':
    bpo-39011: Preserve line endings within ElementTree attributes (GH-18468)
    5fd8123

    @scoder scoder closed this as completed Apr 12, 2020
    @scoder scoder closed this as completed Apr 12, 2020
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes topic-XML type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants