Author serhiy.storchaka
Recipients ezio.melotti, serhiy.storchaka
Date 2013-01-31.10:01:17
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1359626479.25.0.87229024986.issue17089@psf.upfronthosting.co.za>
In-reply-to
Content
xmlparser.Parse() works with string data only if XML encoding is utf-8 (or ascii). Examples:

>>> import xml.parsers.expat
>>> parser = xml.parsers.expat.ParserCreate()
>>> content = []
>>> parser.CharacterDataHandler = content.append
>>> parser.Parse("<?xml version='1.0' encoding='utf-8'?><tag>\xb5</tag>")
1
>>> content
['µ']
>>> parser = xml.parsers.expat.ParserCreate()
>>> content = []
>>> parser.CharacterDataHandler = content.append
>>> parser.Parse("<?xml version='1.0' encoding='iso8859'?><tag>\xb5</tag>")
1
>>> content
['µ']
>>> parser = xml.parsers.expat.ParserCreate()
>>> content = []
>>> parser.CharacterDataHandler = content.append
>>> parser.Parse("<?xml version='1.0' encoding='utf-16'?><tag>\xb5</tag>")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
xml.parsers.expat.ExpatError: encoding specified in XML declaration is incorrect: line 1, column 30

This affects all other modules which works with XML: xml.sax, xml.dom.minidom, xml.dom.pulldom, xml.etree.ElementTree.

Here is a patch which fixes parsing string data with non-UTF-8 XML.
History
Date User Action Args
2013-01-31 10:01:19serhiy.storchakasetrecipients: + serhiy.storchaka, ezio.melotti
2013-01-31 10:01:19serhiy.storchakasetmessageid: <1359626479.25.0.87229024986.issue17089@psf.upfronthosting.co.za>
2013-01-31 10:01:19serhiy.storchakalinkissue17089 messages
2013-01-31 10:01:18serhiy.storchakacreate