Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml.dom.minidom does not escape CR, LF and TAB characters within attribute values #50002

Closed
Tomalak mannequin opened this issue Apr 14, 2009 · 35 comments
Closed

xml.dom.minidom does not escape CR, LF and TAB characters within attribute values #50002

Tomalak mannequin opened this issue Apr 14, 2009 · 35 comments
Labels
easy topic-XML type-bug An unexpected behavior, bug, or error

Comments

@Tomalak
Copy link
Mannequin

Tomalak mannequin commented Apr 14, 2009

BPO 5752
Nosy @amauryfa, @devdanzin, @dvarrazzo, @ezio-melotti, @wking, @mitar, @moriyoshi
Files
  • minidom_test.py: A small test script that outlines the problem
  • test_toxml.py: New test script to demonstrate that the patch proposed works
  • test_multiline_roundtrip.diff: Tomalak's minidom_test.py as unittest, patch for test_minidom
  • minidom.patch: Proposed patch to minidom.py
  • minidom.diff: Old file before new file in patch.
  • minidom.py: monkey patch module applying minidom.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2009-04-14.11:13:06.464>
    labels = ['expert-XML', 'easy', 'type-bug']
    title = 'xml.dom.minidom does not escape CR, LF and TAB characters within attribute values'
    updated_at = <Date 2019-06-22.20:28:40.662>
    user = 'https://bugs.python.org/Tomalak'

    bugs.python.org fields:

    activity = <Date 2019-06-22.20:28:40.662>
    actor = 'mitar'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['XML']
    creation = <Date 2009-04-14.11:13:06.464>
    creator = 'Tomalak'
    dependencies = []
    files = ['13921', '13960', '13966', '13977', '18184', '18325']
    hgrepos = []
    issue_num = 5752
    keywords = ['patch', 'easy']
    message_count = 33.0
    messages = ['85964', '87178', '87332', '87333', '87351', '87429', '87431', '87432', '87526', '87527', '87528', '87564', '87586', '87590', '87604', '87609', '87676', '87679', '87681', '90548', '90676', '110999', '111493', '111642', '112497', '112570', '119728', '142964', '185398', '185408', '185419', '185423', '185576']
    nosy_count = 13.0
    nosy_names = ['effbot', 'amaury.forgeotdarc', 'ajaksu2', 'piro', 'ezio.melotti', 'sayap', 'Tomalak', 'devon', 'labrat', 'mitar', 'moriyoshi', 'gauges', 'Jens.Grivolla']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue5752'
    versions = ['Python 2.6', 'Python 3.1', 'Python 2.7', 'Python 3.2']

    Linked PRs

    @Tomalak
    Copy link
    Mannequin Author

    Tomalak mannequin commented Apr 14, 2009

    Current behavior upon toxml() is:

    <foo attribute="multiline
    value" />

    Upon reading the document again, the new line is normalized and
    collapsed into a space (according to the XML spec, section 3.3.3), which
    means that it is lost.

    Better behavior would be something like this (within attribute values only):

    <foo attribute="multiline value" />

    @Tomalak Tomalak mannequin added topic-XML type-bug An unexpected behavior, bug, or error labels Apr 14, 2009
    @devdanzin devdanzin mannequin added the easy label Apr 22, 2009
    @sechifrancesco
    Copy link
    Mannequin

    sechifrancesco mannequin commented May 4, 2009

    Ok, I've tried to solve this problem, but I think that the keyword
    'easy' is not suitable for this kind of task, because it is necessary to
    modify the expat module that is very complex.

    @Tomalak
    Copy link
    Mannequin Author

    Tomalak mannequin commented May 6, 2009

    @francesco Sechi: Would it not just require a minimal change to the
    _write_data() method? Something along the lines of (sorry, no Python
    expert, maybe I am way off):

    def _write_data(writer, data, is_attrib=False):
        "Writes datachars to writer."
        if is_attrib: 
            data = data.replace("\r", "&#13;").replace("\n", "&#10;")
        data = data.replace("&", "&amp;").replace("<", "&lt;")
        data = data.replace("\"", "&quot;").replace(">", "&gt;")
        writer.write(data)

    and in Element.writexml():

        #[...]
        for a_name in a_names:
            writer.write(" %s=\"" % a_name)
            _write_data(writer, attrs[a_name].value, True)
        #[...]

    @Tomalak
    Copy link
    Mannequin Author

    Tomalak mannequin commented May 6, 2009

    Of course it should be:

    def _write_data(writer, data, is_attrib=False):
        "Writes datachars to writer."
        data = data.replace("&", "&amp;").replace("<", "&lt;")
        data = data.replace("\"", "&quot;").replace(">", "&gt;")
        if is_attrib: 
            data = data.replace("\r", "&#13;").replace("\n", "&#10;")
        writer.write(data)

    @sechifrancesco
    Copy link
    Mannequin

    sechifrancesco mannequin commented May 6, 2009

    Don't worry, I'm a newer too.
    No, your solution does not work, because the method you refer
    (_write_data) is called by the toxml() function, but the newline is
    replaced with a whitespace by the parsestring() function. The
    parsestring function, as I already said, refers to the 'expat' module,
    that is very complex (for me).

    @Tomalak
    Copy link
    Mannequin Author

    Tomalak mannequin commented May 8, 2009

    Hmm... I thought toxml() is the part that needs to be fixed, not the
    parsing/reading. I mentioned the reading only to outline the data loss
    that occurs eventually.

    My point is: The toxml() (i.e. _write_data) *actually writes* the
    newline to the output. And within parameters, it just shouldn't.

    @Tomalak
    Copy link
    Mannequin Author

    Tomalak mannequin commented May 8, 2009

    Attaching a patch that fixes the problem.

    @Tomalak
    Copy link
    Mannequin Author

    Tomalak mannequin commented May 8, 2009

    Attaching a test file that outlines the problem. Output on my system
    (Windows / Python 3.0) is:

    Without the patch:
    C:\Python30>python.exe c:\minidom_test.py
    False
    1 -->"multiline
    value"
    2 -->"multiline value"

    With the patch:
    C:\Python30>python.exe c:\minidom_test.py
    True
    1 -->"multiline
    value"
    2 -->"multiline
    value"

    @sechifrancesco
    Copy link
    Mannequin

    sechifrancesco mannequin commented May 10, 2009

    I think that the problem is: the xmldoc1 has the newline or not? If it
    hasn't your patch works only in the particular case you pass a toxml
    return value to 'parsestring'. If I pass an XML string with newlines it
    doesn't work. So your solution is not generic and cannot be considered a
    patch for the issue you proposed.

    @sechifrancesco
    Copy link
    Mannequin

    sechifrancesco mannequin commented May 10, 2009

    I try to explain better what is my opinion:

    • If you add the attribute by using setAttribute the newlines are kept
      and the toxml works well

    • If you add the attribute by using the parsestring, passing it an XML
      string the newlines are replaced

    • Your patch works only if you act on a well-constructed (i.e.with
      newlines kept in internal data structures) xml.dom.minidom.Document
      object only, so...

    • If you try to execute your patched toxml method of a
      xml.dom.minidom.Document constructed using parsestring passing it a
      string with newline it does not work.

    So your patch works only in a specific case: you are trying to fix a
    problem in parsestring, acting on its actual parameter.

    @Tomalak
    Copy link
    Mannequin Author

    Tomalak mannequin commented May 10, 2009

    Francesco, I think you are missing the point. :-) The problem has two sides.

    If I create an XML document using the DOM (not by parsing it from a
    string!), then I can put newline characters into attribute value. This
    is allowed and conforms to the XML spec.

    However, *literal* newlines in an attribute value (i.e. when the
    document is parsed from a string) have no meaning. The parser treats
    them as if they were insignificant whitespace -- they are converted to a
    single space. This is also valid and conforms to the XML spec.

    The catch: This leads to an actual data loss if I *wanted* to store
    newline characters in an attribute -- unless the newline characters are
    properly encoded. Encoding the newline characters is also valid and
    conforms to the spec, so the DOM implementation should do it.

    In other words - the parsing process you refer to is actually working
    fine. If an attribute contains a literal newline, it is indeed okay to
    collapse it into a space. It's only the document serializing that is broken.

    Minidom is clearly missing functionality here, and it does not conform
    to the XML spec. If I store a string of data in an XML document, it must
    be ensured that upon reading the document again, I get the *same* data
    back. This is what I check with my test script.

    @Tomalak Tomalak mannequin changed the title xml.dom.minidom does not handle newline characters in attribute values xml.dom.minidom does not escape newline characters within attribute values May 10, 2009
    @sechifrancesco
    Copy link
    Mannequin

    sechifrancesco mannequin commented May 11, 2009

    All right, now I understand, thanks. But I think that, for internal
    class coherence, it is necessary not to modify toxml method, but the
    'setAttribute' one, because this is the source of the problem.

    @sechifrancesco
    Copy link
    Mannequin

    sechifrancesco mannequin commented May 11, 2009

    A solution for this issue could be to replace the setAttribute method as
    follow:

    • d["value"] = d["nodeValue"] = value
      + d["value"] = d["nodeValue"] = value.replace('\n',' ')

    NOTE: I didn't do a patch, because I don't know which python version you
    are using.

    Please try this solution and give me a feedback, thanks.

    @sechifrancesco
    Copy link
    Mannequin

    sechifrancesco mannequin commented May 11, 2009

    I have uploaded a test script that shows that, without my patch, the
    methods setAttribute and parseString work differently; adding my patch,
    the behaviour is symmetric.

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented May 12, 2009

    Francesco,
    Your patch still doesn't allow one to add a multiline attribute values
    as Tomalak describes:
    "The catch: This leads to an actual data loss if I *wanted* to store
    newline characters in an attribute -- unless the newline characters are
    properly encoded. Encoding the newline characters is also valid and
    conforms to the spec, so the DOM implementation should do it."

    I'm not sure whether the proposed behavior is correct or desirable. Even
    if it is correct, it might introduce backwards incompatible changes in
    behavior.

    Here's a test case for trunk.

    @sechifrancesco
    Copy link
    Mannequin

    sechifrancesco mannequin commented May 12, 2009

    My position is:
    if you want to encode the newline character, this should be done by both
    parseString and setAttribute methods. Otherwise, the behaviour is not
    symmetric.
    My patch translates the newline character with a whitespace in the
    setAttribute method, because parseString already does it. If you want to
    encode the newline in different manner, you should develop a patch that
    introduces this kind of encoding in both parseString and setAttribute
    methods.

    @Tomalak
    Copy link
    Mannequin Author

    Tomalak mannequin commented May 13, 2009

    Francesco,

    if you want to encode the newline character,
    this should be done by both parseString and
    setAttribute methods. Otherwise, the
    behaviour is not symmetric.

    I believe you still don't see the issue. The behaviour is not symmetric
    *now*. You store a '\n' in an attribute value with setAttribute(), save
    the document to XML, load it again and out comes a space where the '\n'
    should have been.

    The point is that parseString() behaves correctly, but serializing does
    not. There is only one side to fix, because only one side is broken.

    If you want to encode the newline in different
    manner, you should develop a patch that
    introduces this kind of encoding in both
    parseString and setAttribute methods.

    It would be pointless to do the encoding in setAttribute(). The valid
    ways to XML-encode a '\n' character are '&#xA', '&#x0A' or '&#10'. Doing
    so in setAttribute() would produce doubly encoded output, like this:
    '&#10'. This is even more wrong.

    However, if parseString() encounters a '&#10' in the input, it correctly
    translates this to '\n' in the DOM. As I said, there is nothing to fix
    in parsing, this exercise is about getting minidom to actually *output*
    a ' ' where appropriate. :-)

    @Tomalak
    Copy link
    Mannequin Author

    Tomalak mannequin commented May 13, 2009

    Daniel Diniz:

    The proposed behaviour is correct:
    http://www.w3.org/TR/2000/WD-xml-c14n-20000119.html#charescaping

    "In attribute values, the character information items
    TAB (#x9), newline (#xA), and carriage-return (#xD)
    are represented by " ", " ", and " " respectively."

    Since the behaviour is correct, it is also desirable. :-)

    I don't think that this change could cause existing solution to break
    since the current inconsistency in handling these characters make it
    impossible to rely on this anyway.

    Thanks for putting up the unit test diff.

    @Tomalak
    Copy link
    Mannequin Author

    Tomalak mannequin commented May 13, 2009

    I changed the patch to include support for TAB characters, which were
    also left unencoded before.

    Also I switched encoding from ' ' etc. to ' '. This is
    equivalent, but the spec uses the latter variant.

    @Tomalak Tomalak mannequin changed the title xml.dom.minidom does not escape newline characters within attribute values xml.dom.minidom does not escape CR, LF and TAB characters within attribute values May 13, 2009
    @devon
    Copy link
    Mannequin

    devon mannequin commented Jul 15, 2009

    see also a similar issue in etree: bpo-6492

    @Tomalak
    Copy link
    Mannequin Author

    Tomalak mannequin commented Jul 18, 2009

    @devon: Thanks for pointing & linking back here.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 21, 2010

    Patched test_minidom and ran it test failed. Went to patch minidom.py and it appears up to date, so no idea why the test failed, can someone please take a look as it's 04:30 BST, thanks.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 24, 2010

    minidom.patch had the new file listed before the old, so I've uploaded minidom.diff. The patch is tiny and looks clean. Tests have been repeated on Windows against 2.7 and are fine, so I believe this can be committed.

    @ezio-melotti
    Copy link
    Member

    bpo-7139 has been closed as duplicate of this and contains a few messages that might be worth reading.

    @wking
    Copy link
    Mannequin

    wking mannequin commented Aug 2, 2010

    As a workaround until the patch gets included, you can import this monkey patch module.

    @wking
    Copy link
    Mannequin

    wking mannequin commented Aug 3, 2010

    And while we're at it, we should also

    .replace('&', '&').replace('"', """).replace('<', '<')

    which would have to go at the beginning to avoid double-escaping the '&'.

    We could use xml.sax.saxutils.escape to do all the escaping rather than chaining replaces:

       data = escape(data, {'"':'&quot;', '\r':'&#xD;', '\n':'&#xA;', '\t':'&#x9;'})

    which also escapes '>' (not strictly required for attribute values, but shouldn't be harmful either).

    @gauges
    Copy link
    Mannequin

    gauges mannequin commented Oct 27, 2010

    I tried to apply the minidom.diff patch below, but it seems that removing the two lines that replace the "<" and ">" characters is not a good idea.
    It least in some files I now suddenly get "<" and ">" characters in text elements where there should be > and <

    At least the part with the tabs seems to work now and if I add the two lines with the replace calls that got deleted by the patch, everything seems fine.

    @sayap
    Copy link
    Mannequin

    sayap mannequin commented Aug 25, 2011

    Just want to mention that until the patch get included, it will be impossible to use the standard library to generate a working BCP (Bulk Copy Program) XML format file for SQL Server, which always requires a TERMINATOR="\r\n" or TERMINATOR=" " attribute.

    @dvarrazzo
    Copy link
    Mannequin

    dvarrazzo mannequin commented Mar 28, 2013

    Just got bitten by this bug, which affects xml.etree.ElementTree and cElementTree too. Any chance to have it fixed?

    Note that lxml.etree is not affected by the bug and can be used as a replacement for the stdlib module if you happen to have some work to do.

    @amauryfa
    Copy link
    Member

    minidom may be broken, but what's the issue with ElementTree?

    >>> import xml.etree.cElementTree as etree
    >>> doc = etree.fromstring('<xml />')
    >>> doc.set('attr', "multiline\nvalue")
    >>> etree.tostring(doc)
    '<xml attr="multiline&#10;value" />'

    @dvarrazzo
    Copy link
    Mannequin

    dvarrazzo mannequin commented Mar 28, 2013

    ElementTree issue is with tabs:

    In [1]: import xml.etree.cElementTree as etree
    
    In [2]: doc = etree.fromstring('<xml />')
    
    In [4]: doc.set('attr', "here\tthere")
    
    In [5]: etree.tostring(doc)
    Out[5]: '<xml attr="here\tthere" />'
    
    In [6]: etree.fromstring(_5).attrib['attr']
    Out[6]: 'here there'
    

    bye bye tab.

    @dvarrazzo
    Copy link
    Mannequin

    dvarrazzo mannequin commented Mar 28, 2013

    I was going to open an issue on itself about etree and tabs, but I've found this and thought it would have been marked as duplicate. If you don't think it's the case I will open it anyway.

    I've added my comment because ElementTree is not reported in this page at all: I've been googling forever about "ElementTree tabs attributes" without result. I've found this (and the network of duplicates) only searching for less generic "xml tabs" into the python bug tracker, when I was already filing the bug.

    @dvarrazzo
    Copy link
    Mannequin

    dvarrazzo mannequin commented Mar 30, 2013

    Added separate issue bpo-17582 as ElementTree implementation doesn't depend on whatever causes the bug in xml.dom.minidom.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @jbaker6953
    Copy link

    I had to make an updated version of this monkey patch for a project I was working on. I leave it here for anyone's reference. The old instance method overwrite was bombing out with TypeError: function() argument 'code' must be code, not function so I just changed it for my needs. It works well with my data set.

    https://gist.github.com/jbaker6953/deb9a8e4eae1e622f467fc9b4edf11db

    serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Aug 14, 2023
    Also double quotes (") are now only quoted in attributes.
    serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Aug 14, 2023
    Also double quotes (") are now only quoted in attributes.
    serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Aug 14, 2023
    Also double quotes (") are now only quoted in attributes.
    serhiy-storchaka added a commit that referenced this issue Aug 23, 2023
    …-107947)
    
    Also double quotes (") are now only quoted in attributes.
    @hugovk
    Copy link
    Member

    hugovk commented Nov 10, 2023

    Looks like this has been fixed by #107947.

    @hugovk hugovk closed this as completed Nov 10, 2023
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    easy topic-XML type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants