This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Document the encoding of functions bytes arguments of the C API
Type: Stage:
Components: Documentation, Interpreter Core, Unicode Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: belopolsky, dmalcolm, docs@python, eric.araujo, terry.reedy, vstinner
Priority: normal Keywords: patch

Created on 2010-09-01 22:41 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
encodings.patch vstinner, 2010-09-01 22:41
Messages (12)
msg115339 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-01 22:41
Many C functions have bytes argument (char* type) but the encoding is not documented. If would not be a problem if the encoding was always the same, but it is not. Examples:
 - format of PyUnicode_FromFormat() should be encoded as ISO-8859-1
 - filename of PyParser_ASTFromString() should be encoded as utf-8
 - filename of PyErr_SetFromErrnoWithFilename() should be encoded to the filesystem encoding (with strict error handler, and not surrogateescape)
 - 's' argument of PyParser_ASTFromString() should be encoded as utf-8 if PyPARSE_IGNORE_COOKIE flag is set, otherwise the parser checks for #coding:xxx cookie (if there is no cookie, utf-8 is used)

Attached patch is a try to document most low level functions. I choosed to add the name of function arguments in the headers because I consider that a header can be used as a quick documentation. I only touched .c files to change argument names.

It is hard to get the right encoding, so I cannot ensure that my patch is correct. My patch is just a draft.

I don't know if "encoded to utf-8" is the right expression. Or should it be "decoded as utf-8"?
msg115404 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-09-02 20:53
I think either of these is correct:
- a UTF-8-encoded string
- a string encoded in UTF-8
msg115405 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2010-09-02 21:13
> I think either of these is correct:
> - a UTF-8-encoded string
> - a string encoded in UTF-8

Possibly use the word "buffer" here, rather than "string", as "string" may suggest the "str" type.

Or even: "NUL-terminated buffer of UTF-8-encoded bytes", or whatnot.

(sorry for bikeshedding)
msg115523 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-09-03 22:38
Better specifying requirements is good. A few comments:

-   The second argument is an error message; it is converted to a string object.
+   The second argument is an error message; it is decoded to a string object
+   with ``'utf-8'`` encoding.
 
I would write the change as
+   The second argument is a utf-8 encoded error message; it is decoded to a string object. 

I the second part (what the function will do with the arg) really needed? I think in the current version, it serves to indirectly specify that the arg in not to be a string, but bytes. If the specific encoding required is specified, that also says bytes, making 'will be decoded' redundant and irrelevant.
-------------------------------

+   a Python exception (class, not an instance). *format* should be a string
+   encoded to ISO-8859-1, containing format codes, 

*format* should be ISO-8859-1 encoded bytes containing format codes,

although I am not clear about the implications of that. Are not all format code ascii chars?
--------------------------------

I do not really like 'encoded to', but 'decoded to' is wrong. 'will be decoded from xxx bytes' is better. I think there should be a general discussion somewhere about bytes arguments and the terminology that will be used.
msg115543 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-03 23:53
About PyErr_Format() and PyUnicode_FromFormat*() encoding: it's not exactly ISO-8859-1... there is a bug => issue #9769.
msg115942 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-09 12:47
#6543 changed the encoding of the filename argument of PyRun_SimpleFileExFlags() (and all functions based on PyRun_SimpleFileExFlags) and c_filename attribute of the compiler (private) structure in Python 3.1.3: use utf-8 in strict mode instead of filesystem encoding with surrogateescape.
msg123655 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2010-12-08 22:08
A (probably crazy) idea that just occurred to me:
  typedef char utf8_bytes;
  typedef char iso8859_1_bytes;
  typedef char fsenc_bytes;

then specify the encoding in the type signature of the API e.g.:
- int PyRun_SimpleFile(FILE *fp, const char *filename)
+ int PyRun_SimpleFile(FILE *fp, const fsenc_bytes *filename)
msg123659 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-08 22:55
> A (probably crazy) idea that just occurred to me:
>  typedef char utf8_bytes;
>  typedef char iso8859_1_bytes;
>  typedef char fsenc_bytes;

I like it!  Let's see how far we can get without iso8859_1_bytes, though.  (It is likely to be locale_bytes anyways.)  There are a few places where we'll need ascii_bytes.

The added benefit is that we can make these typedefs unsigned char and avoid char signness being ambiguous.  We will also need to give the typedefs the Py_ prefix.

And an obligatory bikesheding comment: if we typedef char, we should use singular form.  Or we can typedef char* Py_utf8_bytes.
msg124692 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-12-27 01:50
r87504 documents encodings of error functions.
r87505 documents encodings of unicode functions.
r87506 documents encodings of AST, compiler, parser and PyRun functions.
msg124696 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-12-27 02:07
While documenting encodings, I found two issues: #10778 and #10779.
msg125359 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2011-01-04 19:18
Victor,

Here is an interesting case for your collection: PyDict_GetItemString.  Note that it is documented as not setting error, but in fact it may if encoding fails.  This rarely an issue because most uses of PyDict_GetItemString are with an ASCII string literal.
msg137331 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-30 21:13
> Here is an interesting case for your collection: PyDict_GetItemString.

It's easier to guess the encoding of such function: Python 3 always use UTF-8, but yes, the encoding should be documented.

I documented many functions, directly in the header files, and sometimes also in the reST documentation.

I close this issue because I consider it as done. If you would like to document the encoding of some specific functions, please open new issues.
History
Date User Action Args
2022-04-11 14:57:06adminsetgithub: 53947
2011-05-30 21:13:23vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg137331
2011-01-04 19:18:35belopolskysetnosy: terry.reedy, belopolsky, vstinner, eric.araujo, dmalcolm, docs@python
messages: + msg125359
2010-12-27 02:07:04vstinnersetnosy: terry.reedy, belopolsky, vstinner, eric.araujo, dmalcolm, docs@python
messages: + msg124696
2010-12-27 01:50:56vstinnersetnosy: terry.reedy, belopolsky, vstinner, eric.araujo, dmalcolm, docs@python
messages: + msg124692
2010-12-08 22:55:09belopolskysetmessages: + msg123659
2010-12-08 22:08:28dmalcolmsetmessages: + msg123655
2010-11-17 23:54:56belopolskysetnosy: + belopolsky
2010-09-09 12:47:26vstinnersetmessages: + msg115942
2010-09-03 23:53:46vstinnersetmessages: + msg115543
2010-09-03 22:38:37terry.reedysetnosy: + terry.reedy
messages: + msg115523
2010-09-02 21:13:07dmalcolmsetnosy: + dmalcolm
messages: + msg115405
2010-09-02 20:53:03eric.araujosetnosy: + eric.araujo
messages: + msg115404
2010-09-01 22:41:34vstinnercreate