This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Documentation for format units starting with 'e' is inconsistent
Type: enhancement Stage: patch review
Components: Documentation Versions: Python 3.9, Python 3.8, Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: larry Nosy List: ZackerySpytz, benjamin.peterson, georg.brandl, larry, lemburg, martin.panter, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2015-04-16 23:38 by larry, last changed 2022-04-11 14:58 by admin.

Pull Requests
URL Status Linked Edit
PR 19610 open ZackerySpytz, 2020-04-19 21:19
Messages (8)
msg241292 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2015-04-16 23:38
Documentation is here:

    https://docs.python.org/3/c-api/arg.html#arg-parsing


The first line of documentation for each format unit follows this convention:
    formatunit (pythontype) [arguments, to, pyarg_parsetuple]

These represent the format unit itself, followed by the Python type it consumes in parentheses, followed by the C types it outputs in square brackets.  Thus
    i (int) [int]
means the format unit is 'i', it consumes a Python 'int', and it produces a C 'int'.  Similarly,
    s (str) [const char *]
means the format unit is 's', it consumes a Python 'str', and it produces a C 'const char *'.

When you call PyArg_ParseTuple (AndKeywords), you pass in a pointer to the thing you expect.  If it gives you an int, you pass in &my_int. So the type of the expression you pass in for 'i' is actually "int *".  And the type you pass in for 's' is actually "char **".

The format units that deal with encodings are a bit weirder.  You actually pass in a const char * string first, followed by the buffer you want to write data too.  Technically the types of the values you pass in for "es" are "const char *, char **".  But the documentation for es says
    es (str) [const char *encoding, char **buffer]

This led me to believe that I actually had to pass in a "char ***" for buffer!  Which is wrong and doing so makes your programs explode-y.


The documentation should

* explain this first-line convention precisely, and

* use the types consistently.

My suspicion is that the things in brackets have to be the precise C type, e.g. "int *" for i, "char **" for s, "const char *, char **" for es.
msg241298 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2015-04-17 00:35
On 17.04.2015 01:38, Larry Hastings wrote:
> 
> Documentation is here:
> 
>     https://docs.python.org/3/c-api/arg.html#arg-parsing
> 
> 
> The first line of documentation for each format unit follows this convention:
>     formatunit (pythontype) [arguments, to, pyarg_parsetuple]
> 
> These represent the format unit itself, followed by the Python type it consumes in parentheses, followed by the C types it outputs in square brackets.  Thus
>     i (int) [int]
> means the format unit is 'i', it consumes a Python 'int', and it produces a C 'int'.  Similarly,
>     s (str) [const char *]
> means the format unit is 's', it consumes a Python 'str', and it produces a C 'const char *'.
> 
> When you call PyArg_ParseTuple (AndKeywords), you pass in a pointer to the thing you expect.  If it gives you an int, you pass in &my_int. So the type of the expression you pass in for 'i' is actually "int *".  And the type you pass in for 's' is actually "char **".
> 
> The format units that deal with encodings are a bit weirder.  You actually pass in a const char * string first, followed by the buffer you want to write data too.  Technically the types of the values you pass in for "es" are "const char *, char **".  But the documentation for es says
>     es (str) [const char *encoding, char **buffer]

You need to pass in a variable which will then be set up to point to a
buffer which will be written too :-)

The "e" variants (typically) allocate a buffer for you, since it's pretty
much unknown how long the encoded data will be.

> This led me to believe that I actually had to pass in a "char ***" for buffer!  Which is wrong and doing so makes your programs explode-y.

Indeed :-)

> The documentation should
> 
> * explain this first-line convention precisely, and
> 
> * use the types consistently.
> 
> My suspicion is that the things in brackets have to be the precise C type, e.g. "int *" for i, "char **" for s, "const char *, char **" for es.

The paragraph under "Parsing argument" says:

"""
In the following description, the quoted form is the format unit; the entry in (round) parentheses
is the Python object type that matches the format unit; and the entry in [square] brackets is the
type of the C variable(s) whose address should be passed.
"""

So I guess the "e" descriptions need to have the additional * removed
or the paragraph has to be updated and all other listings need
to be converted to precise types (that would be my preference).

I wonder why no one has noticed in all these years. I apparently had
understood the listings back to be precise C types back in the days
I added the documentation for the "e" codes:
https://hg.python.org/cpython-fullhistory/rev/3ae06c57d09e).

The descriptions for the codes do clarify what is going on, though.
msg241307 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2015-04-17 03:50
> The "e" variants (typically) allocate a buffer for you, since it's pretty
> much unknown how long the encoded data will be.

All four will do it if you pass in a NULL pointer.  "es#" and "et#" can reuse
an existing buffer, because you can also pass in its size.

> So I guess the "e" descriptions need to have the additional * removed
> or the paragraph has to be updated and all other listings need
> to be converted to precise types (that would be my preference).

Here's the problem with removing the "additional *".  The first argument
to encoding is a static string.  You literally pass in the char * string,
not a pointer to a variable containing the address of the string.  e.g.

  PyArg_ParseTuple("es", args, "utf-8", &buffer);

So how do we annotate that?  [char, char *]?

> I wonder why no one has noticed in all these years.

Because nobody ever freaking uses the "e" format units.  But by golly
I'm going to see that the docs are right and Clinic supports them
correctly--or die trying.
msg241313 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2015-04-17 04:42
One more idea.  We annotate with an & when you pass in a pointer to a variable.  So format unit 'i' would get [& int], 's' would get [& char *], and 'es#' would get [char *, & char *, & int].
msg241318 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2015-04-17 07:40
On 17.04.2015 06:42, Larry Hastings wrote:
> 
> Larry Hastings added the comment:
> 
> One more idea.  We annotate with an & when you pass in a pointer to a variable.  So format unit 'i' would get [& int], 's' would get [& char *], and 'es#' would get [char *, & char *, & int].

That sounds like a nice solution.
msg241319 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2015-04-17 07:57
On 17.04.2015 05:50, Larry Hastings wrote:
> 
> Larry Hastings added the comment:
> 
>> The "e" variants (typically) allocate a buffer for you, since it's pretty
>> much unknown how long the encoded data will be.
> 
> All four will do it if you pass in a NULL pointer.  "es#" and "et#" can reuse
> an existing buffer, because you can also pass in its size.

Right.

>> So I guess the "e" descriptions need to have the additional * removed
>> or the paragraph has to be updated and all other listings need
>> to be converted to precise types (that would be my preference).
> 
> Here's the problem with removing the "additional *".  The first argument
> to encoding is a static string.  You literally pass in the char * string,
> not a pointer to a variable containing the address of the string.  e.g.
> 
>   PyArg_ParseTuple("es", args, "utf-8", &buffer);
> 
> So how do we annotate that?  [char, char *]?

A literal static string is really a pointer as well. The compiler
will allocate and initialize the string and then provide a pointer
to reference it. It is also possible to pass in the pointer
directly - it just needs to be a const char *, i.e. one who's target
value doesn't change.

So for "es" this would have to be [const char, char *], but I like
your & idea more: [& const char *, & char **]

>> I wonder why no one has noticed in all these years.
> 
> Because nobody ever freaking uses the "e" format units.  But by golly
> I'm going to see that the docs are right and Clinic supports them
> correctly--or die trying.

It's not used much, but there are a few cases in the posixmodule and
the Mac modules of the Python 2.7 stdlib, as well as in extension modules,
of course. The Python 3.4 stdlib does not seem to use them anymore
(the uses were replaced with custom solutions or the modules removed).

I guess people never found out about those parser markers or
simply always fetch Unicode as object rather than as encoded
string.
msg286430 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-01-29 05:55
The O! and O& units are in a similar situation. They just use a different font and descriptive name, rather than a specific type:

``O!`` (object) [*typeobject*, PyObject \*]
``O&`` (object) [*converter*, *anything*]

Following this lead, you could write:

``es`` (:class:`str`) [*encoding*, char \*buffer]
``et`` (. . .) [*encoding*, char \*buffer]
``es#`` (:class:`str`) [*encoding*, char \*buffer, int buffer_length]
``et#`` (. . .) [*encoding*, char \*buffer, int buffer_length]

The text description should explain what *encoding* is, but it appears it may already do that well enough.
msg286699 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-02-01 21:13
LGTM.
History
Date User Action Args
2022-04-11 14:58:15adminsetgithub: 68168
2020-04-19 21:50:17ZackerySpytzsetversions: + Python 3.7, Python 3.8, Python 3.9, - Python 2.7, Python 3.4, Python 3.5
2020-04-19 21:19:07ZackerySpytzsetkeywords: + patch
nosy: + ZackerySpytz

pull_requests: + pull_request18944
stage: needs patch -> patch review
2017-02-06 14:05:36vstinnersetnosy: + vstinner
2017-02-01 21:13:51serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg286699
2017-01-29 05:55:23martin.pantersetnosy: + martin.panter
messages: + msg286430
2015-04-17 07:57:47lemburgsetmessages: + msg241319
2015-04-17 07:40:09lemburgsetmessages: + msg241318
2015-04-17 04:42:22larrysetmessages: + msg241313
2015-04-17 03:50:41larrysetmessages: + msg241307
2015-04-17 00:35:29lemburgsetnosy: + lemburg
messages: + msg241298
2015-04-16 23:38:45larrycreate