This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: SAXParseError on unicode (Japanese) file
Type: behavior Stage:
Components: XML Versions: Python 2.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, gianzula
Priority: normal Keywords:

Created on 2010-07-13 09:04 by gianzula, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
ff1a.xml gianzula, 2010-07-13 09:04
Messages (2)
msg110163 - (view) Author: Gianfranco (gianzula) Date: 2010-07-13 09:04
When parsing a UTF-16 little-endian encoded XML file containing some japanese characters, the xml.sax.parse function raises a SAXParseException exception saying "no element found". Problem arises with/on:

Python 2.5.2/Windows XP Pro SP3 32 bit
Python 2.6.4/Windows XP Pro SP3 32 bit
Python 2.5.2/Windows 2008 Server SP2 64 bit

The same file is successfully processed with/on:

Python 2.4.3/CentOS 5.4
Python 2.6.3/CentOS 5.4

I've attached a minimal XML file that contains a single U+FF1A japanese character that triggers the exception. Code for parsing the file follows:

import xml.sax
xml.sax.parse(open("ff1a.xml"), xml.sax.ContentHandler())

Best regards,
Gianfranco
msg110181 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-07-13 12:34
Your file contains the byte \x1a == EOF.
You should not open it in text mode, but in binary mode, otherwise it's truncated.

import xml.sax
xml.sax.parse(open("ff1a.xml", 'rb'), xml.sax.ContentHandler())

works on all versions I tried.
History
Date User Action Args
2022-04-11 14:57:03adminsetgithub: 53487
2010-07-13 12:34:12amaury.forgeotdarcsetstatus: open -> closed

nosy: + amaury.forgeotdarc
messages: + msg110181

resolution: not a bug
2010-07-13 09:04:35gianzulacreate