classification
Title: incorrect utf-8 conversion with c api
Type: behavior Stage: resolved
Components: Unicode Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: dzaamek, ezio.melotti, mark.dickinson, vstinner
Priority: normal Keywords:

Created on 2014-03-24 16:17 by dzaamek, last changed 2014-07-03 17:43 by ezio.melotti. This issue is now closed.

Messages (4)
msg214693 - (view) Author: David Zámek (dzaamek) Date: 2014-03-24 16:17
I use python 2.7.6 on win32.

If I enter u'\u010d'.encode('utf-8') to console, I get '\xc4\x8d' as response. That's correct.

But it I use C API for the same, I get incorrect '\xc3\xa8' as response. 

I was testing it on this program:

#include <Python.h>
int main() {
Py_Initialize();
PyObject* dict = PyDict_New();
PyRun_String("u'\u010d'.encode('utf-8')", Py_single_input, dict, dict);
Py_DECREF(dict);
}
msg214694 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-03-24 16:24
In the C language, \u must be escaped as "\\u".
msg214713 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2014-03-24 19:16
Indeed: the \u010d is being interpreted by your *C compiler* as a multibyte character, and the individual bytes of that multibyte character end up in the string that you actually pass to Python.  I suspect that the actual bytes you get depend on your locale.  Here I get (signed) bytes -60 and -115.  (See e.g. "translation phase 7" in C99 6.4.5.)

As Victor says, you need to escape the backslash in the C code.
msg214714 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2014-03-24 19:21
> I suspect that the actual bytes you get depend on your locale.

And from the output you're seeing, I'd guess that Windows is using the CP1250 (Latin: Central European) codepage to make the translation on your machine: http://en.wikipedia.org/wiki/Windows-1250.
History
Date User Action Args
2014-07-03 17:43:22ezio.melottisetstage: resolved
2014-03-24 19:21:29mark.dickinsonsetmessages: + msg214714
2014-03-24 19:16:25mark.dickinsonsetstatus: open -> closed

nosy: + mark.dickinson
messages: + msg214713

resolution: not a bug
2014-03-24 16:24:46vstinnersetmessages: + msg214694
2014-03-24 16:17:02dzaamekcreate