Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minidom does not encode correctly when calling Document.writexml #63111

Closed
brianvanderburg2 mannequin opened this issue Sep 3, 2013 · 9 comments
Closed

minidom does not encode correctly when calling Document.writexml #63111

brianvanderburg2 mannequin opened this issue Sep 3, 2013 · 9 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes docs Documentation in the Doc dir easy topic-XML type-bug An unexpected behavior, bug, or error

Comments

@brianvanderburg2
Copy link
Mannequin

brianvanderburg2 mannequin commented Sep 3, 2013

BPO 18911
Nosy @scoder, @ezio-melotti, @serhiy-storchaka, @Windsooon
PRs
  • bpo-18911: using xmlcharrefreplace when open a file #13352
  • [3.7] bpo-18911: clarify that the minidom XML writer receives texts but not bytes (GH-13352) #13718
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-06-01.07:00:02.383>
    created_at = <Date 2013-09-03.05:36:59.560>
    labels = ['easy', 'type-bug', '3.8', 'expert-XML', '3.7', 'docs']
    title = 'minidom does not encode correctly when calling Document.writexml'
    updated_at = <Date 2019-06-01.07:00:02.382>
    user = 'https://bugs.python.org/brianvanderburg2'

    bugs.python.org fields:

    activity = <Date 2019-06-01.07:00:02.382>
    actor = 'scoder'
    assignee = 'docs@python'
    closed = True
    closed_date = <Date 2019-06-01.07:00:02.383>
    closer = 'scoder'
    components = ['Documentation', 'XML']
    creation = <Date 2013-09-03.05:36:59.560>
    creator = 'brianvanderburg2'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 18911
    keywords = ['patch', 'easy']
    message_count = 9.0
    messages = ['196824', '196836', '257253', '257255', '257827', '342621', '344061', '344152', '344153']
    nosy_count = 7.0
    nosy_names = ['scoder', 'ezio.melotti', 'docs@python', 'serhiy.storchaka', 'brianvanderburg2', 'upendra-k14', 'Windson Yang']
    pr_nums = ['13352', '13718']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue18911'
    versions = ['Python 3.7', 'Python 3.8']

    @brianvanderburg2
    Copy link
    Mannequin Author

    brianvanderburg2 mannequin commented Sep 3, 2013

    When I have unicode data to save, it seems that it does not save correctly, giving an encode error. I know this exists on 2.7 and from checking the code in xml/dom/minidom.py it looks like it does in 3.2 as well.

    The method call that seem to be problematic is doc.writexml(open(filename, "wb"), "", " ", "utf-8")

    Currently I found this to work: doc.writexml(codecs.open(filename, "w", "utf-8"), "", " ", "utf-8")

    It seems like this should be handled by the writexml method since it already has the specified encoding.

    @brianvanderburg2 brianvanderburg2 mannequin added topic-XML type-bug An unexpected behavior, bug, or error labels Sep 3, 2013
    @serhiy-storchaka
    Copy link
    Member

    On Python 3 you should not only open file in text mode with specified encoding, but also specify the "xmlcharrefreplace" error handler.

    doc.writexml(open(filename, "w", encoding="utf-8", errors="xmlcharrefreplace"), "", "  ", "utf-8")
    

    I can suggest only one solution -- explicitly document this behavior.

    Perhaps we also should add a special module level function for writing DOM tree to binary file. Low-level writexml() should not be used directly.

    @serhiy-storchaka serhiy-storchaka added the docs Documentation in the Doc dir label Sep 11, 2013
    @upendra-k14
    Copy link
    Mannequin

    upendra-k14 mannequin commented Dec 31, 2015

    I am trying to resolve a issue for the first time. Can anybody please tell me or elaborate what is "module level function" specifically in this context.

    @bitdancer
    Copy link
    Member

    It means a function defined in the module namespace, as opposed to as a method on a class, so that 'from xml.dom.minidom import <somefunction>' will get you that function.

    This issue should be for documentation of the problem, since we won't add the function to 2.7. A new issue should be opened for the enhancement request of adding a module level convenience function for writing a dom out to a binary file.

    @ezio-melotti
    Copy link
    Member

    On Python 3 you should not only open file in text mode with specified
    encoding, but also specify the "xmlcharrefreplace" error handler.

    Isn't this only required in case there are non encodable characters?
    If the encoding is utf-8, this shouldn't be necessary (unless there are lone surrogates). Specifying xmlcharrefreplace might be useful while using ascii or latin1 though.

    The docs of writexml don't seem to specify if the file should be opened in text or binary mode but istm that only text mode is supported. The advice of using xmlcharrefreplace could be added in a note.

    @Windsooon
    Copy link
    Mannequin

    Windsooon mannequin commented May 16, 2019

    I added a PR for like this:

    .. note::

      You should specify the "xmlcharrefreplace" error handler when open a file with
      specified encoding::
    
             writer = open(
                    filename, "w", encoding="utf-8",
                    errors="xmlcharrefreplace")
             doc.writexml(writer, "", "  ", "utf-8")

    @scoder
    Copy link
    Contributor

    scoder commented May 31, 2019

    Asking users unconditionally to use the "xmlcharrefreplace" replacement method seems wrong for UTF-8. It should not be necessary.

    We should, however, document explicitly that the file will receive text and not bytes, i.e. that users are themselves responsible for opening the output file with the desired encoding. We should also make it clearer that the "encoding" argument to writexml() does not change that.

    @scoder
    Copy link
    Contributor

    scoder commented Jun 1, 2019

    New changeset 5ac0b98 by Stefan Behnel (Windson yang) in branch 'master':
    bpo-18911: clarify that the minidom XML writer receives texts but not bytes (GH-13352)
    5ac0b98

    @scoder scoder added the 3.8 only security fixes label Jun 1, 2019
    @scoder scoder closed this as completed Jun 1, 2019
    @scoder scoder added the 3.7 (EOL) end of life label Jun 1, 2019
    @scoder scoder reopened this Jun 1, 2019
    @scoder
    Copy link
    Contributor

    scoder commented Jun 1, 2019

    New changeset 18e23f2 by Stefan Behnel (Miss Islington (bot)) in branch '3.7':
    bpo-18911: clarify that the minidom XML writer receives texts but not bytes (GH-13718)
    18e23f2

    @scoder scoder closed this as completed Jun 1, 2019
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes docs Documentation in the Doc dir easy topic-XML type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants