This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: python c api wchar_t*/char* passing contradiction
Type: compile error Stage: resolved
Components: Unicode Versions: Python 3.4, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, jj, loewis, python-dev, vstinner, zach.ware
Priority: normal Keywords:

Created on 2014-07-30 17:10 by jj, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (13)
msg224327 - (view) Author: Jonas Jelten (jj) Date: 2014-07-30 17:10
The documentation and the code example at
https://docs.python.org/3.5/extending/embedding.html#very-high-level-embedding

#include <Python.h>

int
main(int argc, char *argv[])
{
  Py_SetProgramName(argv[0]);  /* optional but recommended */
  Py_Initialize();
  PyRun_SimpleString("from time import time,ctime\n"
                     "print('Today is', ctime(time()))\n");
  Py_Finalize();
  return 0;
}

contradicts the actual implementation of the code:
http://hg.python.org/cpython/file/tip/Include/pythonrun.h#l25

which leads to compiler errors. To fix them, ugly wchar_t to char conversions are needed.

Also, I was hoping, Python 3.3 finally switched from wchar_t to char and UTF-8.
at least that's how I understood PEP 393 http://python.org/dev/peps/pep-0393/

see also:

http://stackoverflow.com/questions/21591908/python-3-3-c-string-handling-wchar-t-vs-char


=> Are the docs wrong (which i hope are not, the example is straightforward and simple-stupid with a char*),
or is cpython wrong?
msg224329 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-07-30 17:47
You were misinterpreting PEP 393 - it is only about the representation of string objects, and doesn't affect any pre-existing API. Changing Py_SetProgramName is not possible without breaking existing code, so it could only happen in Python 4. 

A proper solution might be adding Py_SetProgramNameUTF8, but it could trick people into believing that argv[0] actually is UTF-8 on their system, which it might not be. Providing Py_SetProgramNameASCII might be better, but it could fail if argv[0] contains non-ASCII characters. Yet another solution could be to expose _Py_char2wchar to the developer.

In any case: yes, the example is outdated, and only valid for Python 2.
msg224340 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-07-30 18:53
This issue is why I created the issue #18395.
msg224364 - (view) Author: Jonas Jelten (jj) Date: 2014-07-30 23:57
I'd say Python should definitely change its internal string type to char*. Exposing "handy" wchar_t->char conversion functions don't resolve the data represenation enhancement.
msg224400 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-07-31 12:43
Jonas, why do you say that?
msg224406 - (view) Author: Zachary Ware (zach.ware) * (Python committer) Date: 2014-07-31 14:39
See also issue20466 (which has a patch for this, but I cannot speak for its effectiveness).

I'd be in favor of closing that issue and this one as duplicates of #18395, and noting in #18395 that the embedding example must be updated before that issue is closed.
msg224444 - (view) Author: Jonas Jelten (jj) Date: 2014-07-31 20:03
Martin, i think the most intuitive and easiest way for working with strings in C are just char arrays.

Starting with the main() argv being char*, probably most programmers just go with char* and all the encoding just works.
This is because contact with encoding is only needed for the user input software (xorg, keyboard input) and user output (-> your terminal emulator, the gui, ...).
No matter what stuff your program receives, the encoding only matters for the actual output display software to select the correct visual representation.
Requiring a conversion to wide chars just increases the interface complexity and adds really unneeded data transformations that are completely obsolete with UTF-8.

What I'd really like to see in CPython is that the internal storage (and the way it's exposed in the C-API) is just raw bytes (=> char*).

This allows super-easy integration in C projects that probably all just use char as their string type (see the doc example mentioned earlier).

PEP 393 states: "(..) the specification chooses UTF-8 as the recommended way of exposing strings to C code."

And for that, I think using char instead of wchar_t is a better solution for interface developers.
msg224462 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-08-01 00:46
> What I'd really like to see in CPython is that the internal storage (and the way it's exposed in the C-API) is just raw bytes (=> char*).

Python is portable, we care of Windows. On Windows, wchar_t* is the native type for strings (ex: command line, environment variables).
msg224482 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-08-01 10:34
New changeset 94d0e842b9ea by Victor Stinner in branch 'default':
Issue #18395, #22108: Update embedded Python examples to decode correctly
http://hg.python.org/cpython/rev/94d0e842b9ea
msg224485 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-08-01 10:37
I updated the embedding and extending examples but I didn't try them.

@Jonas: Can you please try the updated examples?
msg224490 - (view) Author: Jonas Jelten (jj) Date: 2014-08-01 12:20
Indeed, that should do it, thanks.

I still pledge for Python 4? always using char* internally to make this conversion obsolete ;) (except for windows)
msg224491 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-08-01 12:21
> I still pledge for Python 4? always using char* internally to make this conversion obsolete ;) (except for windows)

I don't understand your proposition. We try to have duplicating functions for char* and wchar*.
msg224496 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-08-01 13:05
Jonas: Python's string type is a Unicode character type, unlike C's (which is wishy-washy when it comes to characters outside of the "basic execution character set"). So just declaring that all APIs take UTF-8 will *not* allow for easy integration with other C code; instead, it will be the source of moji-bake.

In any case, this issue appears to be resolved now; thanks for the patch.
History
Date User Action Args
2022-04-11 14:58:06adminsetgithub: 66306
2014-08-01 13:07:10zach.waresetstage: resolved
versions: - Python 3.3
2014-08-01 13:05:07loewissetstatus: open -> closed
resolution: fixed
messages: + msg224496
2014-08-01 12:21:45vstinnersetmessages: + msg224491
2014-08-01 12:20:38jjsetmessages: + msg224490
2014-08-01 10:37:21vstinnersetmessages: + msg224485
2014-08-01 10:34:47python-devsetnosy: + python-dev
messages: + msg224482
2014-08-01 00:46:57vstinnersetmessages: + msg224462
2014-07-31 20:03:00jjsetmessages: + msg224444
2014-07-31 14:39:43zach.waresetnosy: + zach.ware
messages: + msg224406
2014-07-31 12:43:51loewissetmessages: + msg224400
2014-07-30 23:57:16jjsetmessages: + msg224364
2014-07-30 18:53:37vstinnersetmessages: + msg224340
2014-07-30 17:47:53loewissetnosy: + loewis
messages: + msg224329
2014-07-30 17:10:42jjcreate