Document the encoding of functions bytes arguments of the C API #53947

vstinner · 2010-09-01T22:41:35Z

BPO	9738
Nosy	@terryjreedy, @abalkin, @vstinner, @merwok, @davidmalcolm
Files	encodings.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2011-05-30.21:13:23.317>
created_at = <Date 2010-09-01.22:41:34.740>
labels = ['interpreter-core', 'expert-unicode', 'docs']
title = 'Document the encoding of functions bytes arguments of the C API'
updated_at = <Date 2011-05-30.21:13:23.316>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2011-05-30.21:13:23.316>
actor = 'vstinner'
assignee = 'docs@python'
closed = True
closed_date = <Date 2011-05-30.21:13:23.317>
closer = 'vstinner'
components = ['Documentation', 'Interpreter Core', 'Unicode']
creation = <Date 2010-09-01.22:41:34.740>
creator = 'vstinner'
dependencies = []
files = ['18705']
hgrepos = []
issue_num = 9738
keywords = ['patch']
message_count = 12.0
messages = ['115339', '115404', '115405', '115523', '115543', '115942', '123655', '123659', '124692', '124696', '125359', '137331']
nosy_count = 6.0
nosy_names = ['terry.reedy', 'belopolsky', 'vstinner', 'eric.araujo', 'dmalcolm', 'docs@python']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue9738'
versions = ['Python 3.2']

vstinner · 2010-09-01T22:41:28Z

Many C functions have bytes argument (char* type) but the encoding is not documented. If would not be a problem if the encoding was always the same, but it is not. Examples:

format of PyUnicode_FromFormat() should be encoded as ISO-8859-1
filename of PyParser_ASTFromString() should be encoded as utf-8
filename of PyErr_SetFromErrnoWithFilename() should be encoded to the filesystem encoding (with strict error handler, and not surrogateescape)
's' argument of PyParser_ASTFromString() should be encoded as utf-8 if PyPARSE_IGNORE_COOKIE flag is set, otherwise the parser checks for #coding:xxx cookie (if there is no cookie, utf-8 is used)

Attached patch is a try to document most low level functions. I choosed to add the name of function arguments in the headers because I consider that a header can be used as a quick documentation. I only touched .c files to change argument names.

It is hard to get the right encoding, so I cannot ensure that my patch is correct. My patch is just a draft.

I don't know if "encoded to utf-8" is the right expression. Or should it be "decoded as utf-8"?

merwok · 2010-09-02T20:53:03Z

I think either of these is correct:

a UTF-8-encoded string
a string encoded in UTF-8

davidmalcolm · 2010-09-02T21:13:07Z

I think either of these is correct:

a UTF-8-encoded string

a string encoded in UTF-8

Possibly use the word "buffer" here, rather than "string", as "string" may suggest the "str" type.

Or even: "NUL-terminated buffer of UTF-8-encoded bytes", or whatnot.

(sorry for bikeshedding)

terryjreedy · 2010-09-03T22:38:37Z

Better specifying requirements is good. A few comments:

-   The second argument is an error message; it is converted to a string object.
+   The second argument is an error message; it is decoded to a string object
+   with ``'utf-8'`` encoding.
 
I would write the change as
+   The second argument is a utf-8 encoded error message; it is decoded to a string object.

I the second part (what the function will do with the arg) really needed? I think in the current version, it serves to indirectly specify that the arg in not to be a string, but bytes. If the specific encoding required is specified, that also says bytes, making 'will be decoded' redundant and irrelevant.
-------------------------------

+ a Python exception (class, not an instance). *format* should be a string
+ encoded to ISO-8859-1, containing format codes,

*format* should be ISO-8859-1 encoded bytes containing format codes,

although I am not clear about the implications of that. Are not all format code ascii chars?
--------------------------------

I do not really like 'encoded to', but 'decoded to' is wrong. 'will be decoded from xxx bytes' is better. I think there should be a general discussion somewhere about bytes arguments and the terminology that will be used.

vstinner · 2010-09-03T23:53:46Z

About PyErr_Format() and PyUnicode_FromFormat*() encoding: it's not exactly ISO-8859-1... there is a bug => issue bpo-9769.

vstinner · 2010-09-09T12:47:26Z

bpo-6543 changed the encoding of the filename argument of PyRun_SimpleFileExFlags() (and all functions based on PyRun_SimpleFileExFlags) and c_filename attribute of the compiler (private) structure in Python 3.1.3: use utf-8 in strict mode instead of filesystem encoding with surrogateescape.

davidmalcolm · 2010-12-08T22:08:29Z

A (probably crazy) idea that just occurred to me:
  typedef char utf8_bytes;
  typedef char iso8859_1_bytes;
  typedef char fsenc_bytes;

then specify the encoding in the type signature of the API e.g.:

int PyRun_SimpleFile(FILE *fp, const char *filename)
+ int PyRun_SimpleFile(FILE *fp, const fsenc_bytes *filename)

abalkin · 2010-12-08T22:55:09Z

A (probably crazy) idea that just occurred to me:
typedef char utf8_bytes;
typedef char iso8859_1_bytes;
typedef char fsenc_bytes;

I like it! Let's see how far we can get without iso8859_1_bytes, though. (It is likely to be locale_bytes anyways.) There are a few places where we'll need ascii_bytes.

The added benefit is that we can make these typedefs unsigned char and avoid char signness being ambiguous. We will also need to give the typedefs the Py_ prefix.

And an obligatory bikesheding comment: if we typedef char, we should use singular form. Or we can typedef char* Py_utf8_bytes.

vstinner · 2010-12-27T01:50:56Z

r87504 documents encodings of error functions.
r87505 documents encodings of unicode functions.
r87506 documents encodings of AST, compiler, parser and PyRun functions.

vstinner · 2010-12-27T02:07:05Z

While documenting encodings, I found two issues: bpo-10778 and bpo-10779.

abalkin · 2011-01-04T19:18:36Z

Victor,

Here is an interesting case for your collection: PyDict_GetItemString. Note that it is documented as not setting error, but in fact it may if encoding fails. This rarely an issue because most uses of PyDict_GetItemString are with an ASCII string literal.

vstinner · 2011-05-30T21:13:23Z

Here is an interesting case for your collection: PyDict_GetItemString.

It's easier to guess the encoding of such function: Python 3 always use UTF-8, but yes, the encoding should be documented.

I documented many functions, directly in the header files, and sometimes also in the reST documentation.

I close this issue because I consider it as done. If you would like to document the encoding of some specific functions, please open new issues.

vstinner assigned docspython Sep 1, 2010

vstinner added docs Documentation in the Doc dir interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels Sep 1, 2010

vstinner closed this as completed May 30, 2011

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document the encoding of functions bytes arguments of the C API #53947

Document the encoding of functions bytes arguments of the C API #53947

vstinner commented Sep 1, 2010

vstinner commented Sep 1, 2010

merwok commented Sep 2, 2010

davidmalcolm commented Sep 2, 2010

terryjreedy commented Sep 3, 2010

vstinner commented Sep 3, 2010

vstinner commented Sep 9, 2010

davidmalcolm commented Dec 8, 2010

abalkin commented Dec 8, 2010

vstinner commented Dec 27, 2010

vstinner commented Dec 27, 2010

abalkin commented Jan 4, 2011

vstinner commented May 30, 2011

Document the encoding of functions bytes arguments of the C API #53947

Document the encoding of functions bytes arguments of the C API #53947

Comments

vstinner commented Sep 1, 2010

vstinner commented Sep 1, 2010

merwok commented Sep 2, 2010

davidmalcolm commented Sep 2, 2010

terryjreedy commented Sep 3, 2010

vstinner commented Sep 3, 2010

vstinner commented Sep 9, 2010

davidmalcolm commented Dec 8, 2010

abalkin commented Dec 8, 2010

vstinner commented Dec 27, 2010

vstinner commented Dec 27, 2010

abalkin commented Jan 4, 2011

vstinner commented May 30, 2011