This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Undefined C behavior going beyond end of struct via a [1] arrays.
Type: compile error Stage: needs patch
Components: Interpreter Core Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: colesbury, gregory.p.smith, pablogsal, pitrou, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2020-03-30 21:04 by gregory.p.smith, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 19232 closed gregory.p.smith, 2020-03-30 21:07
Messages (20)
msg365349 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-03-30 21:04
The correct C99 way to do this is using a char[].  PyBytesObject and unicode's struct encoding_map both do this.

Unclear to me if we should backport this to earlier versions or not (because PyBytesObject may be exposed?)  Probably, but I also doubt it is a big deal as compilers are generally not _yet_ making use of this detail AFAIK.
msg365350 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-03-30 21:13
From the PR comment thread (as I opened that first):

"""Well, there was no other choice in ISO C89 than using char ob_sval[1];, no?

Is char ob_sval[]; supported by the C compiler supported by CPython? Like Visual Studio, GCC, clang and xlc (AIX)? (I don't think that we officially support xlc, but it's more "best effort" support.)

You can use the new buildbot label to test you change on more platforms.""" - vstinner

Per https://www.python.org/dev/peps/pep-0007/ we require some C99 features as of CPython 3.6.  It does not currently list Flexible array member.

I'll be very surprised if we find any compiler that does not support this.

I'll run this through the buildbot testing as you suggested and assuming nothing important falls out, see that we add this to the C99 required feature list.
msg365399 - (view) Author: Sam Gross (colesbury) * (Python triager) Date: 2020-03-31 16:46
It may be worth considering C-API extensions written in C++. Flexible array members are not part of the C++ standard, although GCC, Clang, and MSVC support them as an extension. GCC and Clang will issue warnings with `-Wpedantic` and MSVC will issue warnings with `/Wall`. Currently, C++ code that includes `<Python.h>` is warning-free in GCC (g++) even with `-Wpedantic`. That won't be true after this change, unless Py_LIMITED_API is defined.

Note that GCC also explicitly supports trailing one-element arrays (the current pattern) as an extension. https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
msg365403 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-31 17:25
> Currently, C++ code that includes `<Python.h>` is warning-free in GCC (g++) even with `-Wpedantic`. That won't be true after this change, unless Py_LIMITED_API is defined.

By the way, the current trend is to make more and more structures opaque in the C API. See for example:

* bpo-39573: Make PyObject an opaque structure in the limited C API
* bpo-39947: Make the PyThreadState structure opaque (move it to the internal C API)
msg365410 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-03-31 19:21
"""
The [1] thing is used in more places:

 PyObject *f_localsplus[1]; in PyFrameObject
 PyObject *ob_item[1]; in PyTupleObject
 void *elements[1]; in asdl_seq and int elements[1];` in asdl_int_seq
 digit ob_digit[1]; in PyLongObject
 Py_ssize_t ob_array[1]; in PyMemoryViewObject
 _tracemalloc.c: frame_t frames[1]; in traceback_t

Do we need to update all of them? Do you want to update them all?
""" - vstinner

I believe we probably do.  But I suggest multiple PRs.  I'll update the issue title.

I'm also going to ask clang/llvm folks that prompted me to look into this for comments.
msg365415 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-03-31 19:53
I concur with Sam that we should keep compatibility with C++ in header files.
msg365425 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-03-31 21:02
Isn't the only actual way for a C .h file to be compatible with C++ via

extern "C" {
}

which all of our non-meta include files appear to have already?
msg365464 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-01 13:36
The following C++ code fails to build:
---
#ifdef __cplusplus
#  include <cstdlib>
#else
#  include <stdlib.h>
#endif

#ifdef __cplusplus
extern "C" {
#endif

typedef struct {
    int x;
    int y;
    char array[];
} mystruct_t;

#ifdef __cplusplus
}
#endif

int main()
{
    size_t size = 2;
    mystruct_t *obj = (mystruct_t *)malloc(sizeof(mystruct_t) - 1 + size);
    obj->array[0] = 'O';
    obj->array[1] = 'K';
    free(obj);
    return 0;
}
---

Error:
---
$ LANG= g++ -pedantic -Werror x.cpp
x.c:14:10: error: ISO C++ forbids flexible array member 'array' [-Werror=pedantic]
   14 |     char array[];
      |          ^~~~~
cc1plus: all warnings being treated as errors
---
msg365465 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-01 13:41
Modules/hashtable.c and Modules/hashtable.h use a different approach. The variable size data is *not* part of the structure:

typedef struct {
    /* used by _Py_hashtable_t.buckets to link entries */
    _Py_slist_item_t _Py_slist_item;

    Py_uhash_t key_hash;

    /* key (key_size bytes) and then data (data_size bytes) follows */
} _Py_hashtable_entry_t;

In memory, we have: [_Py_slist_item, key_hash, key, data] where key size is table->key_size bytes (not stored in each table entry, only in the stable).

Pointer to key and data is computed with these macros:

#define _Py_HASHTABLE_ENTRY_PKEY(ENTRY) \
        ((const void *)((char *)(ENTRY) \
                        + sizeof(_Py_hashtable_entry_t)))

#define _Py_HASHTABLE_ENTRY_PDATA(TABLE, ENTRY) \
        ((const void *)((char *)(ENTRY) \
                        + sizeof(_Py_hashtable_entry_t) \
                        + (TABLE)->key_size))

But this approach is more annoying to use, it requires to play with pointers and requires such ugly macros.
msg365466 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-01 13:42
> Undefined C behavior going beyond end of struct via a [1] arrays.

How is it an undefined C behavior? It works well in practice, no?
msg365472 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-04-01 14:09
AFAIK extern "C" only affects mangling of function names. Because of overloading in C++ you can have several functions with the same name, and to distinguish "int abs(int)" from "float abs(float)" the C++ compiler mangles function names, that makes them incompatible with C.
msg365539 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-02 00:36
> How is it an undefined C behavior? It works well in practice, no?

Famous last words ;)
msg365671 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-04-03 02:40
updates:

- extern "C" is indeed really only about linking so it has no bearing.

What I'm hearing from talking to our C++ compiler team is unfortunately sad: The C++ standard does not support flexible array member syntax on purpose because it leads to problems specific to C++ (ex: what do "new" and "del" do?)

So some compilers will reject such code (just as some accept it treating it as C99 does).  Meaning we can't do this in any public header file.

One workaround would indeed be to do something similar to that hashtable code, but it is quite annoying and I don't know that we could actually change the definition of PyBytesObject that way as its internals could be referenced externally.  (though all the bytes should line up regardless so even macros before and after such a change would be compatible?)

Within our internal private pure C code we could move to use this feature; things in .h files are the cross language issue.

Anyways I'm following up internally to better understand the motivation for wanting code to not use the "it's worked forever" technically undefined behavior of the trailing [1] member and out of bounds access.

Pondering, I wonder if this could turn into a "-fwrapv" style of situation, we depend on that behavior working so we adopted the compiler flag when compilers started to care; so at most we might some day need to pass another compiler flag to ensure it stays?  we'll see.

I'm inclined not to move forward with my PRs for now.
msg365691 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-03 12:08
For me, the most sane option is to make structures opaque in the C API, and then flexible array members.

I did something similar for atomic types. First, we got tons of build isssues with various C compilers and then with C++ compilers. I moved the header to our "internal C API", so basically I removed it from the public C API. Since that time, we stopped to get bug reports about pyatomic.h :-)
msg365744 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-04-04 04:26
agreed, being opaque seems ideal.
msg365745 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-04-04 04:27
PyBytesObject could become a struct that contains a single opaque internal struct.

there is some code out there that references PyBytesObjects internals by field name but my searches across a broad swath of code so far seem to suggest that is so rare this change may be plausible (more research needed).  People are using the PyBytes_ macros and APIs.  yay!
msg366883 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2020-04-20 21:13
Another possibility yet would be:

typedef struct {
    PyObject_VAR_HEAD
    Py_hash_t ob_shash;
    char ob_sval;
} PyBytesObject;

#define PyBytes_AS_STRING(op) (assert(PyBytes_Check(op)), \
                                &(((PyBytesObject *)(op))->ob_sval))

Not sure whether that would be UB...
msg411151 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2022-01-21 17:09
How about:

#ifdef __cplusplus
    char array[1];
#else
    char array[];
#endif

?
msg411168 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-01-21 17:49
#ifdef __cplusplus
    char array[1];
#else
    char array[];
#endif

Does it change the size of the structure between C and C++, sizeof(PyBytesObject)? Also, does the size of the struture matter? :-) I guess that the impart part is the ABI: offset of the members.
msg411207 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-01-21 23:32
off the top of my head that might actually work as I _think_ "empty" things are required to consume an unused byte of size no matter what meaning sizeof shouldn't change?  Some testing and standards perusing for C99 is in order to confirm that though.
History
Date User Action Args
2022-04-11 14:59:28adminsetgithub: 84301
2022-01-21 23:32:19gregory.p.smithsetmessages: + msg411207
2022-01-21 17:49:24vstinnersetmessages: + msg411168
2022-01-21 17:09:52pitrousetmessages: + msg411151
2021-06-08 00:21:22gregory.p.smithsetversions: + Python 3.11, - Python 3.9
2020-04-20 21:13:05pitrousetnosy: + pitrou
messages: + msg366883
2020-04-04 04:27:58gregory.p.smithsetmessages: + msg365745
2020-04-04 04:26:22gregory.p.smithsetmessages: + msg365744
2020-04-03 12:08:15vstinnersetmessages: + msg365691
2020-04-03 02:40:16gregory.p.smithsetmessages: + msg365671
2020-04-02 00:36:42pablogsalsetnosy: + pablogsal
messages: + msg365539
2020-04-01 14:09:12serhiy.storchakasetmessages: + msg365472
2020-04-01 13:42:58vstinnersetmessages: + msg365466
2020-04-01 13:41:27vstinnersetmessages: + msg365465
2020-04-01 13:36:29vstinnersetmessages: + msg365464
2020-03-31 21:02:48gregory.p.smithsetmessages: + msg365425
2020-03-31 19:53:05serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg365415
2020-03-31 19:22:00gregory.p.smithsetstage: patch review -> needs patch
2020-03-31 19:21:53gregory.p.smithsettitle: Undefined C behavior going beyond end of struct via a char[1]. -> Undefined C behavior going beyond end of struct via a [1] arrays.
2020-03-31 19:21:23gregory.p.smithsetmessages: + msg365410
2020-03-31 17:25:15vstinnersetmessages: + msg365403
2020-03-31 16:46:56colesburysetnosy: + colesbury
messages: + msg365399
2020-03-30 21:13:18gregory.p.smithsetnosy: + vstinner
2020-03-30 21:13:10gregory.p.smithsetmessages: + msg365350
2020-03-30 21:07:36gregory.p.smithsetkeywords: + patch
pull_requests: + pull_request18592
2020-03-30 21:04:23gregory.p.smithcreate