New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document the encoding of functions bytes arguments of the C API #53947
Comments
Many C functions have bytes argument (char* type) but the encoding is not documented. If would not be a problem if the encoding was always the same, but it is not. Examples:
Attached patch is a try to document most low level functions. I choosed to add the name of function arguments in the headers because I consider that a header can be used as a quick documentation. I only touched .c files to change argument names. It is hard to get the right encoding, so I cannot ensure that my patch is correct. My patch is just a draft. I don't know if "encoded to utf-8" is the right expression. Or should it be "decoded as utf-8"? |
I think either of these is correct:
|
Possibly use the word "buffer" here, rather than "string", as "string" may suggest the "str" type. Or even: "NUL-terminated buffer of UTF-8-encoded bytes", or whatnot. (sorry for bikeshedding) |
Better specifying requirements is good. A few comments: - The second argument is an error message; it is converted to a string object.
+ The second argument is an error message; it is decoded to a string object
+ with ``'utf-8'`` encoding.
I would write the change as
+ The second argument is a utf-8 encoded error message; it is decoded to a string object. I the second part (what the function will do with the arg) really needed? I think in the current version, it serves to indirectly specify that the arg in not to be a string, but bytes. If the specific encoding required is specified, that also says bytes, making 'will be decoded' redundant and irrelevant. + a Python exception (class, not an instance). *format* should be a string *format* should be ISO-8859-1 encoded bytes containing format codes, although I am not clear about the implications of that. Are not all format code ascii chars? I do not really like 'encoded to', but 'decoded to' is wrong. 'will be decoded from xxx bytes' is better. I think there should be a general discussion somewhere about bytes arguments and the terminology that will be used. |
About PyErr_Format() and PyUnicode_FromFormat*() encoding: it's not exactly ISO-8859-1... there is a bug => issue bpo-9769. |
bpo-6543 changed the encoding of the filename argument of PyRun_SimpleFileExFlags() (and all functions based on PyRun_SimpleFileExFlags) and c_filename attribute of the compiler (private) structure in Python 3.1.3: use utf-8 in strict mode instead of filesystem encoding with surrogateescape. |
A (probably crazy) idea that just occurred to me:
typedef char utf8_bytes;
typedef char iso8859_1_bytes;
typedef char fsenc_bytes; then specify the encoding in the type signature of the API e.g.:
|
I like it! Let's see how far we can get without iso8859_1_bytes, though. (It is likely to be locale_bytes anyways.) There are a few places where we'll need ascii_bytes. The added benefit is that we can make these typedefs unsigned char and avoid char signness being ambiguous. We will also need to give the typedefs the Py_ prefix. And an obligatory bikesheding comment: if we typedef char, we should use singular form. Or we can typedef char* Py_utf8_bytes. |
r87504 documents encodings of error functions. |
Victor, Here is an interesting case for your collection: PyDict_GetItemString. Note that it is documented as not setting error, but in fact it may if encoding fails. This rarely an issue because most uses of PyDict_GetItemString are with an ASCII string literal. |
It's easier to guess the encoding of such function: Python 3 always use UTF-8, but yes, the encoding should be documented. I documented many functions, directly in the header files, and sometimes also in the reST documentation. I close this issue because I consider it as done. If you would like to document the encoding of some specific functions, please open new issues. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: