This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: xml.dom.minidom not able to handle utf-8 data
Type: compile error Stage:
Components: Interpreter Core, Unicode, XML Versions: Python 2.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: draghuram, facundobatista, sharmila
Priority: normal Keywords:

Created on 2007-10-18 01:58 by sharmila, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
testdata.txt sharmila, 2007-10-18 01:58
Messages (8)
msg56511 - (view) Author: Sharmila Sivakumar (sharmila) Date: 2007-10-18 01:58
I try to load the data in the testdata.txt file into a dom.

I tried 
import xml.dom.minidom as dom
data = open('testdata.txt','r').read()
mydom = dom.parseString(data)
I get the following error

>>> mydom.firstChild.childNodes
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in 
position 18: ordinal not in range(128)


So I tried decoding the data and using it but it failed again.

>>> mydom2 = dom.parseString(data.decode('utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/site-packages/_xmlplus/dom/minidom.py", line 
1925, in parseString
    return expatbuilder.parseString(string)
  File "/usr/lib/python2.5/site-packages/_xmlplus/dom/expatbuilder.py", 
line 942, in parseString
    return builder.parseString(string)
  File "/usr/lib/python2.5/site-packages/_xmlplus/dom/expatbuilder.py", 
line 223, in parseString
    parser.Parse(string, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u014d' in 
position 173: ordinal not in range(128)


I am willing to fix this myself if I'm given the permission.
msg56514 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2007-10-18 03:36
Downloaded the testdata.txt file, and yes, it's UTF-8:

facundo@pomcat:~/devel$ file testdata.txt 
testdata.txt: UTF-8 Unicode text

But I opened it perfectly!

Python 2.5.1 (r251:54863, May  2 2007, 16:56:35) 
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.dom.minidom as dom
>>> data = open('testdata.txt','r').read()
>>> mydom = dom.parseString(data)
>>> mydom
<xml.dom.minidom.Document instance at 0xb7c03b0c>
>>> 

In which platform you're working?

And yes, you have absolute permission to fix it, patchs are always welcomed!
msg56518 - (view) Author: Sharmila Sivakumar (sharmila) Date: 2007-10-18 04:41
Thanks for your quick response Facundo.

I'm working on Ubuntu 7.04, python 2.5.1
Python 2.5.1 (r251:54863, May 2 2007, 16:56:35)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2

This error occurs when the default encoding is 'ascii'.  When I change the
default encoding to 'utf-8' it works for me too.  Is, by any chance, your
default encoding 'utf-8'?

On 10/18/07, Facundo Batista <report@bugs.python.org> wrote:
>
>
> Facundo Batista added the comment:
>
> Downloaded the testdata.txt file, and yes, it's UTF-8:
>
> facundo@pomcat:~/devel$ file testdata.txt
> testdata.txt: UTF-8 Unicode text
>
> But I opened it perfectly!
>
> Python 2.5.1 (r251:54863, May  2 2007, 16:56:35)
> [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import xml.dom.minidom as dom
> >>> data = open('testdata.txt','r').read()
> >>> mydom = dom.parseString(data)
> >>> mydom
> <xml.dom.minidom.Document instance at 0xb7c03b0c>
> >>>
>
> In which platform you're working?
>
> And yes, you have absolute permission to fix it, patchs are always
> welcomed!
>
> ----------
> nosy: +facundobatista
> resolution:  -> works for me
> status: open -> closed
>
> __________________________________
> Tracker <report@bugs.python.org>
> <http://bugs.python.org/issue1290>
> __________________________________
>
msg56519 - (view) Author: Sharmila Sivakumar (sharmila) Date: 2007-10-18 04:45
Oops Facundo, that will work.   It actually fails *
after the dom construction* when you do

mydom.firstChild.childNodes

I request you to try it again.

The prob is there is some encoding and decoding done within the parser, and
it uses the default encoding 'ascii'.  This fails for utf-8 data.

On 10/18/07, Sharmila Sivakumar <report@bugs.python.org> wrote:
>
>
> Sharmila Sivakumar added the comment:
>
> Thanks for your quick response Facundo.
>
> I'm working on Ubuntu 7.04, python 2.5.1
> Python 2.5.1 (r251:54863, May 2 2007, 16:56:35)
> [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
>
> This error occurs when the default encoding is 'ascii'.  When I change the
> default encoding to 'utf-8' it works for me too.  Is, by any chance, your
> default encoding 'utf-8'?
>
> On 10/18/07, Facundo Batista <report@bugs.python.org> wrote:
> >
> >
> > Facundo Batista added the comment:
> >
> > Downloaded the testdata.txt file, and yes, it's UTF-8:
> >
> > facundo@pomcat:~/devel$ file testdata.txt
> > testdata.txt: UTF-8 Unicode text
> >
> > But I opened it perfectly!
> >
> > Python 2.5.1 (r251:54863, May  2 2007, 16:56:35)
> > [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
> > Type "help", "copyright", "credits" or "license" for more information.
> > >>> import xml.dom.minidom as dom
> > >>> data = open('testdata.txt','r').read()
> > >>> mydom = dom.parseString(data)
> > >>> mydom
> > <xml.dom.minidom.Document instance at 0xb7c03b0c>
> > >>>
> >
> > In which platform you're working?
> >
> > And yes, you have absolute permission to fix it, patchs are always
> > welcomed!
> >
> > ----------
> > nosy: +facundobatista
> > resolution:  -> works for me
> > status: open -> closed
> >
> > __________________________________
> > Tracker <report@bugs.python.org>
> > <http://bugs.python.org/issue1290>
> > __________________________________
> >
>
> Added file: http://bugs.python.org/file8559/unnamed
>
> __________________________________
> Tracker <report@bugs.python.org>
> <http://bugs.python.org/issue1290>
> __________________________________
>
msg56542 - (view) Author: Raghuram Devarakonda (draghuram) (Python triager) Date: 2007-10-18 20:43
When I run the code in a script, I don't get the error.

***************
marvin:cpython$ python
Python 2.5 (r25:51908, Jan 24 2007, 12:48:15) 
[GCC 4.1.0 (SUSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.dom.minidom as dom
>>> data = open('testdata.txt','r').read()
>>> mydom = dom.parseString(data)
>>> mydom.firstChild.childNodes
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in
position 18: ordinal not in range(128)
>>> import sys
>>> sys.getdefaultencoding()
'ascii'

marvin:cpython$ python dom.py 
marvin:cpython$ 
***************

Can you try and see if you can run it from the script too?
msg56543 - (view) Author: Raghuram Devarakonda (draghuram) (Python triager) Date: 2007-10-18 20:44
I forgot to show dom.py source.

marvin:cpython$ cat dom.py 
import xml.dom.minidom as dom
data = open('testdata.txt','r').read()
mydom = dom.parseString(data)
mydom.firstChild.childNodes
msg56556 - (view) Author: Raghuram Devarakonda (draghuram) (Python triager) Date: 2007-10-19 14:56
The fact that the problem occurs only from the command line and not when
run from a script indicates that the real issue is in trying to print
the object. Sure enough, if you modify the script to do
repr(mydom.firstChild.childNodes), it gets the same problem. So the
issue may have some thing to do with how the object is constructed in
repr(). I don't have time right now to dig deeper but the parser itself
may not have any encoding/decoding issues (apart of ability to print
these high level objects).
msg56719 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2007-10-24 19:12
CharacterData.__repr__ was constructing a string in response that keeped
having a non-ascii character.

Fixed in rev 58641.
History
Date User Action Args
2022-04-11 14:56:27adminsetgithub: 45631
2007-10-24 19:12:12facundobatistasetresolution: works for me -> fixed
messages: + msg56719
2007-10-24 16:28:51facundobatistasetfiles: - unnamed
2007-10-24 16:28:45facundobatistasetfiles: - unnamed
2007-10-19 14:56:11draghuramsetmessages: + msg56556
2007-10-18 20:44:24draghuramsetmessages: + msg56543
2007-10-18 20:43:20draghuramsetnosy: + draghuram
messages: + msg56542
2007-10-18 04:45:14sharmilasetfiles: + unnamed
messages: + msg56519
2007-10-18 04:41:15sharmilasetfiles: + unnamed
messages: + msg56518
2007-10-18 03:36:31facundobatistasetstatus: open -> closed
resolution: works for me
messages: + msg56514
nosy: + facundobatista
2007-10-18 01:58:15sharmilacreate