classification
Title: writing non-ascii characters in xml file using python code embedded in C
Type: behavior Stage: resolved
Components: XML Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: aimad, benjamin.peterson, ezio.melotti, lemburg, serhiy.storchaka, vstinner, xiang.zhang
Priority: normal Keywords:

Created on 2017-04-28 10:14 by aimad, last changed 2017-05-16 15:08 by xiang.zhang. This issue is now closed.

Files
File name Uploaded Description Edit
write_to_xml1.py aimad, 2017-04-29 14:16
Messages (9)
msg292521 - (view) Author: mahboubi (aimad) Date: 2017-04-28 10:14
my python code embedded in C program, uses etree from lxml to write a plain string as element attribute in xml file. the problem is when my string contains non english characters(non ascii), the program fails to write even with unicode conversion such as unicode(mystring, "utf-8"), but when I use python code only, it works.
msg292523 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-04-28 10:25
lxml is not the part of the Python standard library. Use lxml bug tracker if your issue is specific for lxml.

If you can reproduce the issue with xml.etree.ElementTree from the stdlib, please provide a simple example code that does this.
msg292594 - (view) Author: mahboubi (aimad) Date: 2017-04-29 14:16
I have just tried to do this task using xml.etree.ElementTree and still have the same problem.
In the file 'write_to_xml1.py' I'm trying to develop some function that creates an xml file and then add data containing non ascii characters.The python program works perfectly, but when I tried to call  these functions from C, the program crashes.Note that this problem dosn't happen when adding ascii characters only.Here is the C program :

void create_report()
{
	PyObject *pName, *pModule, *pDict, *pFunc, *pValue, *pArgs;

    // Initialize the Python Interpreter
    Py_Initialize();

    // Build the name object
    pName = PyString_FromString("write_to_xml1");
    // Load the module object
    pModule = PyImport_Import(pName);

    // pDict is a borrowed reference 
    pDict = PyModule_GetDict(pModule);

    // pFunc is also a borrowed reference 
    pFunc = PyDict_GetItemString(pDict,"create_report");

	if (PyCallable_Check(pFunc)) 
    {
		PyObject_CallObject(pFunc,NULL);	
    } else 
    {
        PyErr_Print();
    }

}
void modif_report()
{
	PyObject *pName, *pModule, *pDict, *pFunc, *pValue, *pArgs;

    // Initialize the Python Interpreter
    Py_Initialize();

    // Build the name object
    pName = PyString_FromString("write_to_xml1");
    // Load the module object
    pModule = PyImport_Import(pName);
    // pDict is a borrowed reference 
    pDict = PyModule_GetDict(pModule);
    // pFunc is also a borrowed reference 
    pFunc = PyDict_GetItemString(pDict,"traite");
	if (PyCallable_Check(pFunc)) 
    {
		PyObject_CallObject(pFunc,NULL);	
    } else 
    {
        PyErr_Print();
    }

}

int main(int argc, char *argv[])
{
	create_report();
	modif_report();
        return(0);
}      // end main()
msg292630 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-04-30 14:28
IMHO this doesn't look like an error in xml library. With a little tweak in of your problem: adding encoding declaration, removing not needed imports, replace open with codecs.open(encoding='utf8') in create_report, the C problem works fine for me:

[tmp]$ cat essai_rapport_test_30-04-2017-22-22_你好.xml
<?xml version="1.0" ?>
<rapport_test date="2017-04-30" langue="你好" nb_correcte="你好" nb_incorrecte="abc" time="22:22:46"/>

xml_file.write(doc.toprettyxml()) fails for me since ascii codecs cannot encode the result of doc.toprettyxml, a unicode. And I would suggest you adding failure checks to your C program. Almost every step could fail and then result in a segfault.
msg292631 - (view) Author: mahboubi (aimad) Date: 2017-04-30 14:48
Thank you xiang zhang for your reply.
I think that the problem is not in xml_file.write(doc.toprettyxml()) because it works using python only, and the C problem dosen't work since you didn't get any 'alerte' balise.
msg292633 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-04-30 14:51
But I didn't see any crash either. You'd better provide a simple reproduce program, not involving so much logic. Or could you debug it and provide a crash backtrace?
msg292634 - (view) Author: mahboubi (aimad) Date: 2017-04-30 15:03
Also, using codecs.open(encoding='utf8') gives unhadled exception in C? but no problem using python code only.
msg292639 - (view) Author: mahboubi (aimad) Date: 2017-04-30 18:51
Probably it's my fault to use the word 'crash', what I mean is that generated report by python program contains the right result(the 'alerte' balise is added) but the other report generated by the same python program embedde in C didn't give the right result (report is created but no 'alerte' balise is added)
msg293761 - (view) Author: mahboubi (aimad) Date: 2017-05-16 13:14
problem solved, It's not related to embedding python in C, Ijust had to use mystring.decode('utf-8') instead of unicode(mystring,"utf-8")
History
Date User Action Args
2017-05-16 15:08:57xiang.zhangsetresolution: not a bug
2017-05-16 13:14:27aimadsetstatus: open -> closed

messages: + msg293761
stage: test needed -> resolved
2017-05-06 20:31:13aimadsettype: crash -> behavior
2017-04-30 18:51:38aimadsetmessages: + msg292639
2017-04-30 15:03:16aimadsetmessages: + msg292634
2017-04-30 14:51:56xiang.zhangsetmessages: + msg292633
2017-04-30 14:48:48aimadsetstatus: pending -> open

messages: + msg292631
2017-04-30 14:28:51xiang.zhangsetstatus: open -> pending
nosy: + xiang.zhang
messages: + msg292630

2017-04-29 14:16:06aimadsetfiles: + write_to_xml1.py
status: pending -> open
messages: + msg292594
2017-04-28 10:25:16serhiy.storchakasetstatus: open -> pending

nosy: + serhiy.storchaka
messages: + msg292523

stage: test needed
2017-04-28 10:14:52aimadcreate