Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python c api wchar_t*/char* passing contradiction #66306

Closed
jj mannequin opened this issue Jul 30, 2014 · 13 comments
Closed

python c api wchar_t*/char* passing contradiction #66306

jj mannequin opened this issue Jul 30, 2014 · 13 comments
Labels
build The build process and cross-build topic-unicode

Comments

@jj
Copy link
Mannequin

jj mannequin commented Jul 30, 2014

BPO 22108
Nosy @loewis, @vstinner, @ezio-melotti, @zware

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2014-08-01.13:05:07.999>
created_at = <Date 2014-07-30.17:10:42.463>
labels = ['build', 'expert-unicode']
title = 'python c api wchar_t*/char* passing contradiction'
updated_at = <Date 2014-08-01.13:07:10.027>
user = 'https://bugs.python.org/jj'

bugs.python.org fields:

activity = <Date 2014-08-01.13:07:10.027>
actor = 'zach.ware'
assignee = 'none'
closed = True
closed_date = <Date 2014-08-01.13:05:07.999>
closer = 'loewis'
components = ['Unicode']
creation = <Date 2014-07-30.17:10:42.463>
creator = 'jj'
dependencies = []
files = []
hgrepos = []
issue_num = 22108
keywords = []
message_count = 13.0
messages = ['224327', '224329', '224340', '224364', '224400', '224406', '224444', '224462', '224482', '224485', '224490', '224491', '224496']
nosy_count = 6.0
nosy_names = ['loewis', 'vstinner', 'ezio.melotti', 'python-dev', 'zach.ware', 'jj']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'compile error'
url = 'https://bugs.python.org/issue22108'
versions = ['Python 3.4', 'Python 3.5']

@jj
Copy link
Mannequin Author

jj mannequin commented Jul 30, 2014

The documentation and the code example at
https://docs.python.org/3.5/extending/embedding.html#very-high-level-embedding

#include <Python.h>

int
main(int argc, char *argv[])
{
  Py_SetProgramName(argv[0]);  /* optional but recommended */
  Py_Initialize();
  PyRun_SimpleString("from time import time,ctime\n"
                     "print('Today is', ctime(time()))\n");
  Py_Finalize();
  return 0;
}

contradicts the actual implementation of the code:
http://hg.python.org/cpython/file/tip/Include/pythonrun.h#l25

which leads to compiler errors. To fix them, ugly wchar_t to char conversions are needed.

Also, I was hoping, Python 3.3 finally switched from wchar_t to char and UTF-8.
at least that's how I understood PEP-393 http://python.org/dev/peps/pep-0393/

see also:

http://stackoverflow.com/questions/21591908/python-3-3-c-string-handling-wchar-t-vs-char

=> Are the docs wrong (which i hope are not, the example is straightforward and simple-stupid with a char*),
or is cpython wrong?

@jj jj mannequin added topic-unicode build The build process and cross-build labels Jul 30, 2014
@loewis
Copy link
Mannequin

loewis mannequin commented Jul 30, 2014

You were misinterpreting PEP-393 - it is only about the representation of string objects, and doesn't affect any pre-existing API. Changing Py_SetProgramName is not possible without breaking existing code, so it could only happen in Python 4.

A proper solution might be adding Py_SetProgramNameUTF8, but it could trick people into believing that argv[0] actually is UTF-8 on their system, which it might not be. Providing Py_SetProgramNameASCII might be better, but it could fail if argv[0] contains non-ASCII characters. Yet another solution could be to expose _Py_char2wchar to the developer.

In any case: yes, the example is outdated, and only valid for Python 2.

@vstinner
Copy link
Member

This issue is why I created the issue bpo-18395.

@jj
Copy link
Mannequin Author

jj mannequin commented Jul 30, 2014

I'd say Python should definitely change its internal string type to char*. Exposing "handy" wchar_t->char conversion functions don't resolve the data represenation enhancement.

@loewis
Copy link
Mannequin

loewis mannequin commented Jul 31, 2014

Jonas, why do you say that?

@zware
Copy link
Member

zware commented Jul 31, 2014

See also bpo-20466 (which has a patch for this, but I cannot speak for its effectiveness).

I'd be in favor of closing that issue and this one as duplicates of bpo-18395, and noting in bpo-18395 that the embedding example must be updated before that issue is closed.

@jj
Copy link
Mannequin Author

jj mannequin commented Jul 31, 2014

Martin, i think the most intuitive and easiest way for working with strings in C are just char arrays.

Starting with the main() argv being char*, probably most programmers just go with char* and all the encoding just works.
This is because contact with encoding is only needed for the user input software (xorg, keyboard input) and user output (-> your terminal emulator, the gui, ...).
No matter what stuff your program receives, the encoding only matters for the actual output display software to select the correct visual representation.
Requiring a conversion to wide chars just increases the interface complexity and adds really unneeded data transformations that are completely obsolete with UTF-8.

What I'd really like to see in CPython is that the internal storage (and the way it's exposed in the C-API) is just raw bytes (=> char*).

This allows super-easy integration in C projects that probably all just use char as their string type (see the doc example mentioned earlier).

PEP-393 states: "(..) the specification chooses UTF-8 as the recommended way of exposing strings to C code."

And for that, I think using char instead of wchar_t is a better solution for interface developers.

@vstinner
Copy link
Member

vstinner commented Aug 1, 2014

What I'd really like to see in CPython is that the internal storage (and the way it's exposed in the C-API) is just raw bytes (=> char*).

Python is portable, we care of Windows. On Windows, wchar_t* is the native type for strings (ex: command line, environment variables).

@python-dev
Copy link
Mannequin

python-dev mannequin commented Aug 1, 2014

New changeset 94d0e842b9ea by Victor Stinner in branch 'default':
Issue bpo-18395, bpo-22108: Update embedded Python examples to decode correctly
http://hg.python.org/cpython/rev/94d0e842b9ea

@vstinner
Copy link
Member

vstinner commented Aug 1, 2014

I updated the embedding and extending examples but I didn't try them.

@jonas: Can you please try the updated examples?

@jj
Copy link
Mannequin Author

jj mannequin commented Aug 1, 2014

Indeed, that should do it, thanks.

I still pledge for Python 4? always using char* internally to make this conversion obsolete ;) (except for windows)

@vstinner
Copy link
Member

vstinner commented Aug 1, 2014

I still pledge for Python 4? always using char* internally to make this conversion obsolete ;) (except for windows)

I don't understand your proposition. We try to have duplicating functions for char* and wchar*.

@loewis
Copy link
Mannequin

loewis mannequin commented Aug 1, 2014

Jonas: Python's string type is a Unicode character type, unlike C's (which is wishy-washy when it comes to characters outside of the "basic execution character set"). So just declaring that all APIs take UTF-8 will *not* allow for easy integration with other C code; instead, it will be the source of moji-bake.

In any case, this issue appears to be resolved now; thanks for the patch.

@loewis loewis mannequin closed this as completed Aug 1, 2014
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build The build process and cross-build topic-unicode
Projects
None yet
Development

No branches or pull requests

2 participants