Issue 9738: Document the encoding of functions bytes arguments of the C API

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53947

classification

Title:	Document the encoding of functions bytes arguments of the C API
Type:		Stage:
Components:	Documentation, Interpreter Core, Unicode	Versions:	Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	belopolsky, dmalcolm, docs@python, eric.araujo, terry.reedy, vstinner
Priority:	normal	Keywords:	patch

Created on 2010-09-01 22:41 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
encodings.patch	vstinner, 2010-09-01 22:41

Messages (12)
msg115339 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-01 22:41
Many C functions have bytes argument (char* type) but the encoding is not documented. If would not be a problem if the encoding was always the same, but it is not. Examples: - format of PyUnicode_FromFormat() should be encoded as ISO-8859-1 - filename of PyParser_ASTFromString() should be encoded as utf-8 - filename of PyErr_SetFromErrnoWithFilename() should be encoded to the filesystem encoding (with strict error handler, and not surrogateescape) - 's' argument of PyParser_ASTFromString() should be encoded as utf-8 if PyPARSE_IGNORE_COOKIE flag is set, otherwise the parser checks for #coding:xxx cookie (if there is no cookie, utf-8 is used) Attached patch is a try to document most low level functions. I choosed to add the name of function arguments in the headers because I consider that a header can be used as a quick documentation. I only touched .c files to change argument names. It is hard to get the right encoding, so I cannot ensure that my patch is correct. My patch is just a draft. I don't know if "encoded to utf-8" is the right expression. Or should it be "decoded as utf-8"?
msg115404 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2010-09-02 20:53
I think either of these is correct: - a UTF-8-encoded string - a string encoded in UTF-8
msg115405 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2010-09-02 21:13
> I think either of these is correct: > - a UTF-8-encoded string > - a string encoded in UTF-8 Possibly use the word "buffer" here, rather than "string", as "string" may suggest the "str" type. Or even: "NUL-terminated buffer of UTF-8-encoded bytes", or whatnot. (sorry for bikeshedding)
msg115523 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2010-09-03 22:38
Better specifying requirements is good. A few comments: - The second argument is an error message; it is converted to a string object. + The second argument is an error message; it is decoded to a string object + with ``'utf-8'`` encoding. I would write the change as + The second argument is a utf-8 encoded error message; it is decoded to a string object. I the second part (what the function will do with the arg) really needed? I think in the current version, it serves to indirectly specify that the arg in not to be a string, but bytes. If the specific encoding required is specified, that also says bytes, making 'will be decoded' redundant and irrelevant. ------------------------------- + a Python exception (class, not an instance). format should be a string + encoded to ISO-8859-1, containing format codes, format should be ISO-8859-1 encoded bytes containing format codes, although I am not clear about the implications of that. Are not all format code ascii chars? -------------------------------- I do not really like 'encoded to', but 'decoded to' is wrong. 'will be decoded from xxx bytes' is better. I think there should be a general discussion somewhere about bytes arguments and the terminology that will be used.
msg115543 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-03 23:53
About PyErr_Format() and PyUnicode_FromFormat*() encoding: it's not exactly ISO-8859-1... there is a bug => issue #9769.
msg115942 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-09 12:47
#6543 changed the encoding of the filename argument of PyRun_SimpleFileExFlags() (and all functions based on PyRun_SimpleFileExFlags) and c_filename attribute of the compiler (private) structure in Python 3.1.3: use utf-8 in strict mode instead of filesystem encoding with surrogateescape.
msg123655 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2010-12-08 22:08
A (probably crazy) idea that just occurred to me: typedef char utf8_bytes; typedef char iso8859_1_bytes; typedef char fsenc_bytes; then specify the encoding in the type signature of the API e.g.: - int PyRun_SimpleFile(FILE fp, const char filename) + int PyRun_SimpleFile(FILE fp, const fsenc_bytes filename)
msg123659 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-12-08 22:55
> A (probably crazy) idea that just occurred to me: > typedef char utf8_bytes; > typedef char iso8859_1_bytes; > typedef char fsenc_bytes; I like it! Let's see how far we can get without iso8859_1_bytes, though. (It is likely to be locale_bytes anyways.) There are a few places where we'll need ascii_bytes. The added benefit is that we can make these typedefs unsigned char and avoid char signness being ambiguous. We will also need to give the typedefs the Py_ prefix. And an obligatory bikesheding comment: if we typedef char, we should use singular form. Or we can typedef char* Py_utf8_bytes.
msg124692 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-12-27 01:50
r87504 documents encodings of error functions. r87505 documents encodings of unicode functions. r87506 documents encodings of AST, compiler, parser and PyRun functions.
msg124696 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-12-27 02:07
While documenting encodings, I found two issues: #10778 and #10779.
msg125359 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2011-01-04 19:18
Victor, Here is an interesting case for your collection: PyDict_GetItemString. Note that it is documented as not setting error, but in fact it may if encoding fails. This rarely an issue because most uses of PyDict_GetItemString are with an ASCII string literal.
msg137331 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-05-30 21:13
> Here is an interesting case for your collection: PyDict_GetItemString. It's easier to guess the encoding of such function: Python 3 always use UTF-8, but yes, the encoding should be documented. I documented many functions, directly in the header files, and sometimes also in the reST documentation. I close this issue because I consider it as done. If you would like to document the encoding of some specific functions, please open new issues.

History
Date	User	Action	Args
2022-04-11 14:57:06	admin	set	github: 53947
2011-05-30 21:13:23	vstinner	set	status: open -> closed resolution: fixed messages: + msg137331
2011-01-04 19:18:35	belopolsky	set	nosy: terry.reedy, belopolsky, vstinner, eric.araujo, dmalcolm, docs@python messages: + msg125359
2010-12-27 02:07:04	vstinner	set	nosy: terry.reedy, belopolsky, vstinner, eric.araujo, dmalcolm, docs@python messages: + msg124696
2010-12-27 01:50:56	vstinner	set	nosy: terry.reedy, belopolsky, vstinner, eric.araujo, dmalcolm, docs@python messages: + msg124692
2010-12-08 22:55:09	belopolsky	set	messages: + msg123659
2010-12-08 22:08:28	dmalcolm	set	messages: + msg123655
2010-11-17 23:54:56	belopolsky	set	nosy: + belopolsky
2010-09-09 12:47:26	vstinner	set	messages: + msg115942
2010-09-03 23:53:46	vstinner	set	messages: + msg115543
2010-09-03 22:38:37	terry.reedy	set	nosy: + terry.reedy messages: + msg115523
2010-09-02 21:13:07	dmalcolm	set	nosy: + dmalcolm messages: + msg115405
2010-09-02 20:53:03	eric.araujo	set	nosy: + eric.araujo messages: + msg115404
2010-09-01 22:41:34	vstinner	create