classification
Title: xml.etree.ElementTree forgets the encoding
Type: enhancement Stage:
Components: Library (Lib), XML Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: effbot, flox, mark, scoder, serhiy.storchaka
Priority: normal Keywords:

Created on 2010-08-05 09:32 by mark, last changed 2013-01-07 16:19 by serhiy.storchaka.

Messages (7)
msg112962 - (view) Author: Mark Summerfield (mark) Date: 2010-08-05 09:32
If you read in an XML file that specifies its encoding and then later on use xml.etree.ElementTree.write(), it is always written using US-ASCII. 

I think the behaviour should be different:
(1) If the XML that was read included an encoding, that encoding should be remembered and used when writing.
(2) If there is no encoding the default for writing should be UTF-8 (which is the standard for XML files).
(3) For non-XML files use US-ASCII.

Naturally, any of these could be overridden using an encoding argument to the write() method.
msg113118 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-08-06 17:29
It behaves as documented. Moved to "feature request".
http://docs.python.org/library/xml.etree.elementtree.html
msg113238 - (view) Author: Stefan Behnel (scoder) * Date: 2010-08-08 07:44
I think it makes sense to keep input and output separate. After all, the part of the software that outputs a document doesn't necessarily know how it came in, so having the default output encoding depend on the input sounds error prone. Encoding should always be explicit. My advice is to reject this feature request.
msg113663 - (view) Author: Mark Summerfield (mark) Date: 2010-08-12 07:20
Perhaps a useful compromise would be to add an "encoding" attribute that is set to the encoding of the XML file that's read in (and with a default of "ascii").

That way it would be possible to preserve the encoding, e.g.:

import xml.etree.ElementTree as etree
xml_tree = etree.ElementTree(in_filehandle)
# process the tree
xml_tree.write(out_filehandle, encoding=xml_tree.encoding)
msg113666 - (view) Author: Stefan Behnel (scoder) * Date: 2010-08-12 08:05
lxml.etree has encapsulated this in a 'docinfo' property which also holds the XML 'version', the 'standalone' state and the DOCTYPE (if available).

Note that this information is readily available in lxml.etree for any parsed Element (by wrapping it in a new ElementTree), but not in ET where it can only be associated to the ElementTree instance that did the parsing, not one that just wraps a parsed tree of Element objects. I would expect that this is still enough to handle this use case, though.

Stefan
msg113667 - (view) Author: Mark Summerfield (mark) Date: 2010-08-12 08:21
I don't see how lxml is relevant here? lxml is a third party library, whereas etree is part of the standard library. And according to the 3.1.2 docs etree doesn't have a docinfo (or any other) property.
msg113670 - (view) Author: Stefan Behnel (scoder) * Date: 2010-08-12 09:27
That's why I mention it here to prevent future incompatibilities between the two libraries.
History
Date User Action Args
2013-01-07 16:19:34serhiy.storchakasetversions: + Python 3.4, - Python 3.2, Python 3.3
2012-07-14 18:48:49serhiy.storchakasetnosy: + serhiy.storchaka
2010-08-12 09:27:39scodersetmessages: + msg113670
2010-08-12 08:21:29marksetmessages: + msg113667
2010-08-12 08:05:17scodersetmessages: + msg113666
2010-08-12 07:20:19marksetmessages: + msg113663
2010-08-08 07:44:26scodersetmessages: + msg113238
2010-08-06 17:32:39floxsetnosy: + scoder
2010-08-06 17:29:48floxsettype: behavior -> enhancement
messages: + msg113118
components: + XML
versions: + Python 3.2, Python 3.3, - Python 3.1
2010-08-06 03:23:51r.david.murraysetnosy: + effbot, flox
2010-08-05 09:32:08markcreate