classification
Title: Concrete object C API considered harmful to subclasses of builtin types
Type: behavior Stage:
Components: Documentation Versions: Python 3.3
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Arfrever, Trundle, benjamin.peterson, daniel.urban, docs@python, eric.araujo, eric.snow, josh.r, meador.inge, ncoghlan, pitrou, rhettinger, scoder
Priority: normal Keywords: patch

Created on 2011-01-21 23:49 by rhettinger, last changed 2015-05-13 21:37 by ppperry.

Files
File name Uploaded Description Edit
issue10977_list.patch daniel.urban, 2011-03-20 12:27 Patch for list. review
Messages (31)
msg126800 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-01-21 23:49
Currently, the concrete object C API bypasses any methods defined on subclasses of builtin types.  

It has long been accepted that subclasses of builtin types need to override many methods rather than just a few because the type itself was implemented with direct internal calls.  This work-around requires extra work but still makes it possible to create a consistent subclass (a case-insensitive dictionary for example).

However, there is another problem with this workaround.  Any subclassed builtin may still be bypassed by other functions and methods using the concrete C API.

For example, the OrderedDict class overrides enough dict methods to make itself internally consistent and fully usable by all pure python code.  However, any builtin or extension using PyDict_SetItem() or PyDict_DelItem() will update the OrderedDict's internal unordered dict
and bypass all the logic keeping dict ordering invariants intact.  Note, this problem would still impact OrderedDict even if it were rewritten in C.  The concrete calls would directly access the original dict structure and ignore any of the logic implemented in the subclass (whether coded in pure python or in C).

Since OrderedDict is written in pure python, the consequence of having its invariants violated would result in disorganization, but not in a crash.  However if OrderedDict were rewritten in C, augmenting the inherited data structure with its own extra state, then this problem could result in seg-faulting.

Note this is only one example.  Pretty much any subclass of a builtin type that adds additional state is vulnerable to a concrete C API that updates only part of the state while leaving the extra state information in an inconsistent state.

Another example would be a list subclass that kept extra state tracking the number of None objects contained within.  There is no way to implement that subclass, either in C or pure python, even if every method were overridden, that wouldn't be rendered inconsistent by an external tool using PyList_SetItem().

My recommendation is to find all of the mutating methods for the concrete C API and add an alternate path for subclasses.  The alternate path should use the abstract API.

Pseudo-code for PyList_SetItem():

   if type(obj) is list:
      # proceed as currently implemented
   else:
      # use PyObject_SetItem() adapting the 
      # function parameters and return values if necessary
      # to match the API for PyList_SetItem().
   else:
      raise BadInternalCall('object should be a list')
msg131489 - (view) Author: Daniel Urban (daniel.urban) * (Python triager) Date: 2011-03-20 12:27
Here is a patch for list.  It modifies the following C API functions:
PyList_SetItem
PyList_Insert
PyList_Append
PyList_SetSlice
PyList_Sort
PyList_Reverse
_PyList_Extend

It also includes tests (with ctypes).

I plan to do next the same for dict, but first I'd wait for comments, to know if I'm doing something completely wrong :-)
msg131494 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-03-20 13:35
Hmm, making PyList_* an abstract API doesn't make sense to me. These functions do exactly what they say: they operate on concrete instances of list (they are documented as part of the concrete API). With that reasoning, we should have fallback paths in every function in the concrete APIs; that's a lot of complication added to these C files.

IMO we "should" (or, rather, could) instead add abstract PySequence_Append(), etc. functions if we deem it necessary (just as we already have PyMapping_Keys(), etc.).

By the way, PyList_SetItem() already has an abstract counterpart called PyObject_SetItem(), so changing this one seems useless.
msg131537 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-03-20 20:17
> By the way, PyList_SetItem() already has an 
> abstract counterpart called PyObject_SetItem(), 
> so changing this one seems useless.

The problem isn't a lack of available abstract
functions. If some code uses PyObject_SetItem(),
then it already works for both list and dict
subclasses. The issue arises only for code that is 
already using PyDict_SetItem() or PyList_SetItem().

The concrete methods pre-date the abstract methods
and pre-date the ability to subclass built-in types.
Those concrete methods were never updated to reflect
the new reality.  They now only make sense for exact
type matches because they know nothing about 
subclasses and their invariants.


IOW, if some patch is accepted, it needs to be aimed
at the concrete functions to make them subclass aware.
msg131538 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-03-20 20:31
> The problem isn't a lack of available abstract
> functions.

Then what is it exactly?
You are arguing for PyList_SetItem() to have the same semantics as
PyObject_SetItem(), but what's the point of having two functions which
do exactly the same thing? It doesn't make sense to me.
msg131550 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-03-20 21:50
It's largely a backwards compatibility hack, but the concrete methods also have an optimised fast path that the generic methods lack.

So (for example), PyList_SetItem would now mean "this is *probably* a list, so check for that and use the fast path if the assumption is correct, otherwise fall back on PyObject_SetItem".

Currently, using the concrete API means your code potentially *breaks* if the assumption is incorrect.

Sounds like a good idea to me, and will fix a lot of cases of bad interaction between concrete types and related objects.
msg131551 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-03-20 21:55
I'm just pointing out a problem.  Lots of existing code uses the concrete API and that code can break subclasses of builtin types (possibly resulting in segfaults).

I don't really care what is done about it.  I'm just observing that the current design of the concrete API is incompatible with subclassing of built-in types.

A practical consequence is that I don't see how to write a fast, light C version of OrderedDict that would be safe from calls to PyDict_SetItem().  AFAICT, there is no mechanism for adding extra state to a built-in type and being able to guarantee even very simple invariants.
msg133026 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-04-05 12:24
I may have found another use case for this functionality. Currently, the Python code in heapq.py accepts arbitrary sequences (that are sufficiently compatible with the Sequence ABC), but the C code in _heapq only permits use of a concrete list.

The two currently available approaches to address that discrepancy are:
1. Use PyObject_* calls instead of PyList_* calls throughout the code (yuck, we'd be making the common case significantly worse in the name of semantic purity)
2. Sprinkle type checks throughout the _heapq code to decide whether or not to call the concrete APIs or their abstract equivalents

The extensive use of the PyList macro API in _heapq means that a little bit of 2 might still be necessary even if the concrete API was updated as Raymond suggests, but I think there would be value in changing the meaning of the concrete APIs to include falling back to the abstract APIs if the type assumption is incorrect.
msg133027 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-04-05 12:43
I added a few commments on Daniel's draft patch. However, I'm getting nervous about the performance impact of all these extra pointer comparisons.

We should probably hold off on this until more of the macro benchmark suite runs on Py3k so we can get as hint as to the possible speed impact on real application code.
msg133032 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2011-04-05 14:12
I find this extremely ugly. The correct way to handle this is to use the generic methods themselves. Really, the type-specific names should only be used when you have just created the type or done PyXXX_CheckExact() yourself on it.
msg133126 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-04-06 13:24
"should" is a wonderful word when it comes to external APIs.

We currently have a couple of problems:

1. The concrete APIs will fail noisily if given an instance of something that isn't a list, but may fail *silently* if given a subclass that adds additional state that needs to be kept in sync.

2. We don't have a standard "fast path with fallback" idiom for containers, so any code that wants to support arbitrary sequences has to either accept the performance penalty, or code the fast path at every point it gets used.

Changing the concrete APIs to be more defensive and fall back to the abstract APIs if their assumptions are violated would be a win on both of those fronts.

I also retract my performance concern from above - all of the affected code paths already include a "Py<X>_Check()" that triggers PyErr_BadInternalCall() if it fails. All we would be doing is moving that check further down in the function and dealing with a negative result differently.

Some code paths would become slightly slower when used with subclasses of builtin types, but a lot of currently broken code would automatically start supporting subclasses of builtins correctly (and be well on its way to being duck-typed, instead of only working with the builtin types).
msg133130 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2011-04-06 14:01
2011/4/6 Nick Coghlan <report@bugs.python.org>:
>
> Nick Coghlan <ncoghlan@gmail.com> added the comment:
>
> "should" is a wonderful word when it comes to external APIs.
>
> We currently have a couple of problems:
>
> 1. The concrete APIs will fail noisily if given an instance of something that isn't a list, but may fail *silently* if given a subclass that adds additional state that needs to be kept in sync.
>
> 2. We don't have a standard "fast path with fallback" idiom for containers, so any code that wants to support arbitrary sequences has to either accept the performance penalty, or code the fast path at every point it gets used.
>
> Changing the concrete APIs to be more defensive and fall back to the abstract APIs if their assumptions are violated would be a win on both of those fronts.

Why not add fast paths to the generic functions if that's what you're
concerned about?

It's unexpected for the user of the functions and breaks years of
tradition. What if someone calls PyList_Append on a custom type that
doesn't do as they expect and then PyList_GET_ITEM?

It seems like asking for subtle bugs to me. The only correct way to is
change code that uses these type specific apis to use the generic
ones.
msg133132 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-04-06 14:09
> Why not add fast paths to the generic functions if that's what you're
> concerned about?
> 
> It's unexpected for the user of the functions and breaks years of
> tradition. What if someone calls PyList_Append on a custom type that
> doesn't do as they expect and then PyList_GET_ITEM?
> 
> It seems like asking for subtle bugs to me. The only correct way to is
> change code that uses these type specific apis to use the generic
> ones.

I agree with Benjamin.
IIRC we already have a bunch of fast paths in abstract.c.
msg133137 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-04-06 14:43
1. It's an external API. We *don't control* most of the potentially broken code, so saying "just fix the call sites" effectively solves nothing and leaves classes like OrderedDict at risk of subtle silent corruption whenever they are passed to a 3rd party C extension.

2. The alternate "fix" is to tighten up the Py<X>_Check calls to Py<X>_CheckExact in order to reject subclasses. That would be a huge pain from a backwards compatibility point of view, so it isn't a realistic option.

3. Another alternate "fix" would be to exclude the concrete API from the stable ABI. That was already done for the macro interfaces, but it's too late for the functions - they were included in the stable interface for the 3.2 release.

4. The fact that the macros are irredeemably broken when used without ensuring that the supplied object is exactly the expected type is a very poor excuse for not fixing the function calls. It sounds like "The taillight is busted too, so let's not bother fixing the brakes".

5. There's a reason I put the "fast path" point second - that benefit is just an added bonus that comes from fixing the real design flaw that Raymond pointed out.

All that said, there really is one specific case where fixing this problem could cause a serious backwards compatibility issue: subclasses of builtin types that are *themselves* written in C. In such cases, the method implementations would quite likely use the concrete API to handle the base class state, then do their own thing to handle the rest of the update. Adding an implicit fallback to the concrete API implementations would cause such cases to trigger infinite recursion at the C level.
msg133138 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-04-06 14:48
Having convinced myself that Raymond's original suggestion can't be implemented safely, I have an alternative (arguably even more radical) proposal:

Deprecate the public concrete API functions that modify object state.

Put an underscore in front of them for internal use, have the public versions trigger a deprecation warning (not to be removed until 3.6 or so), provide a C level mechanism to easily make the equivalent of a super() call and advise that everyone switch to the abstract API in order to handle subclasses properly.
msg133139 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-04-06 14:52
> 1. It's an external API. We *don't control* most of the potentially
> broken code, so saying "just fix the call sites" effectively solves
> nothing and leaves classes like OrderedDict at risk of subtle silent
> corruption whenever they are passed to a 3rd party C extension.

But since that problem has been existing for ages, we're not in a rush
to fix it either, are we?
After all, people do have to fix their code from time to time (or port
it to Python 3). Using the right APIs for the job is part of that.
Pointing to the abstract alternatives in the documentation for concrete
APIs may help people do so.

In the long term, having a clean CPython API with a proper separation of
concerns is a major benefit.
msg133141 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-04-06 14:59
Changing the affected components. Antoine and Benjamin are right, this needs to be handled as a documentation and education problem.

Raymond's suggestion can too easily trigger infinite recursion and mine would block legitimate use cases in 3rd party code.

For heapq, we can just fix it to include the fallbacks to the abstract versions if PyList_CheckExact() fails.
msg133176 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-04-06 22:22
One thing that could be done is to have an optional warning in the mutating concrete C API functions that gets triggered whenever the input doesn't have an exact type match.  That will let people discover incorrect uses.

We could also scour our own stdlib code to see if there are any cases where there needs to be an alternate code patch for subclasses of builtin types.

I did a scan for PyList_SetItem and found that we were already pretty good about only using it when the target was known to be an exact list. Though those results were good, I'm not hopeful for a similar result with PyDict_SetItem.

In the mean time this bug will preclude a C version of OrderedDict from being a dict subclass.  That's unfortunate because it means that the C version can't be a drop-in replacement for the existing pure python OrderedDict.
msg133191 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-04-07 04:50
I thought about a warning, but the problem is that a subclass using the concrete API as part of its *implementation* of the associated slot or method is actually perfectly correct usage.

I'm not sure this is enough to give up on the idea of OrderedDict being a dict subclass implemented in C, though. It seems to me that some defensive internal checks (to pick up state corruption due to this problem) would be a lesser evil than giving up on being a dict subclass.
msg200741 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2013-10-21 10:01
I'm also in favor of a clean separation between the concrete and abstract
APIs (see also #12965).  Why can't the concrete functions be locked down
with *CheckExact()?  If there's any breakage in third party modules,
it's easy to fix (probably much easier than the Unicode changes in 3.3).
msg200752 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-10-21 11:42
Because calling them *from* method implementations in concrete subclasses
is often a perfectly reasonable thing to do.
msg200804 - (view) Author: Eric Snow (eric.snow) * (Python committer) Date: 2013-10-21 15:38
> Because calling them *from* method implementations in concrete subclasses...

We do the same thing from Python all the time (though less so with the availability of super).
E.G. dict.__init__(self, ...) and dict.__set item__(self, key, value).

Now, if the C-API had some equivalent to super...
msg200837 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-10-21 20:27
I haven't fully thought this one through, but perhaps we could:
- deprecate calling the current concrete APIs with subclasses (since doing
so may break subclass invariants)
- add "PyBaseDict_*" APIs (etc) that subclass implementations can call.
- offer a compatibility header for earlier versions that #defines the new
APIs back to the old ones

This would be 3.5 PEP territory, though.
msg200842 - (view) Author: Eric Snow (eric.snow) * (Python committer) Date: 2013-10-21 20:41
I had roughly the same idea, Nick, though my approach to address backward compatibility was more complicated.  Definitely worth at least looking into for 3.5.
msg201205 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-10-25 05:57
-1 on any changes that make the specific C-API functions less specific. They are perfectly adequate for gaining speed in well defined situations.

+0 on any changes that special case concrete types in abstract calls if they prove to be worth the hassle.

+1 for making sure CPython does not accidentally exclude subclasses internally for anything that's provided by users.

+1 on CPython being the wrong place to work around this for external code. If extensions use the wrong function, *they* must be fixed. At most, there might be a documentation issue hidden here.

Also, we should be careful with adding yet another substantial set of C-API functions. It's not clear to me that they are really needed in practice.
msg201247 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-10-25 15:56
The problem we're trying to solve is CPython *silently* breaking subclass
invariants, which is what the concrete APIs currently do. At the moment,
trying to swap out dicts with ordered dicts in various places will corrupt
the ordered dict instead of calling the appropriate subclass methods or
throwing a type error.
msg201248 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-25 16:18
> The problem we're trying to solve is CPython *silently* breaking subclass
> invariants, which is what the concrete APIs currently do.

To be clear: the problem is with CPython calling the concrete APIs when
it shouldn't, rather than with the concrete APIs not behaving properly.
msg201249 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-10-25 16:23
On 26 Oct 2013 02:18, "Antoine Pitrou" <report@bugs.python.org> wrote:
>
>
> Antoine Pitrou added the comment:
>
> > The problem we're trying to solve is CPython *silently* breaking
subclass
> > invariants, which is what the concrete APIs currently do.
>
> To be clear: the problem is with CPython calling the concrete APIs when
> it shouldn't, rather than with the concrete APIs not behaving properly.

The boilerplate required to use them correctly renders them broken in my
view, particularly when they silently corrupt internal subclass state when
misused instead of throwing an exception.

>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue10977>
> _______________________________________
msg201250 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-25 16:25
> > To be clear: the problem is with CPython calling the concrete APIs when
> > it shouldn't, rather than with the concrete APIs not behaving properly.
> 
> The boilerplate required to use them correctly renders them broken in my
> view, particularly when they silently corrupt internal subclass state when
> misused instead of throwing an exception.

IMO the boilerplate should actually be included inside the abstract
APIs, so as to automatically enable optimized shortcuts when possible.
msg201256 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-10-25 16:44
Nick Coghlan added the comment:
> Antoine Pitrou added the comment:
>> To be clear: the problem is with CPython calling the concrete APIs when
>> it shouldn't, rather than with the concrete APIs not behaving properly.
> 
> The boilerplate required to use them correctly renders them broken in my
> view

I think we disagree here. There is no boilerplate at all involved, as long
as its guaranteed that they are being called on the correct type of object.
User code may provide that guarantee either implicitly or explicitly, and
only the explicit way involves additional code (such as a type check). The
implicit way (that simply knows the exact type) would suffer from any code
additions to the fast paths in the specific C-API functions.

> particularly when they silently corrupt internal subclass state when
> misused instead of throwing an exception.

It has bothered me more than once that even the fast paths in those
functions usually contain type checks, which often means that the type
check is being executed twice: once as a special casing check before
calling the function and once as a safety check after entering it. As long
as the functions are being used correctly, the second check is just useless
code that unnecessarily takes time.

IMHO, it would be nice to have those internal type checks enabled *only*
for debug builds, although even then, there may be legitimate usages that
call these functions on subtypes (such as when called from within a well
defined subtype implementation, as you mentioned).
msg201260 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-10-25 17:02
Keep in mind that part of the problem here is finding stdlib code that
needs to be fixed to handle subclasses correctly. The existing APIs have
permitted use with subclasses for years but are, in fact, the wrong tool
for that in many cases.

The problem to be solved isn't "this code should handle ducktyping" but,
"this code should either reject subclasses or else handle them without
corrupting them". Using the same API for both external invocation and for
subclasses calling up to the parent is a problem. For code in critical
paths, doing the same type check twice is also a problem.
History
Date User Action Args
2015-05-13 21:37:51ppperrysettitle: Concrete object C API considered harmful to subclasses of builtin typesje -> Concrete object C API considered harmful to subclasses of builtin types
2015-05-13 21:36:45ppperrysettitle: Concrete obect C API considered harmful to subclasses of builtin typesje -> Concrete object C API considered harmful to subclasses of builtin typesje
2015-05-13 21:36:24ppperrysettitle: Concrete object C API considered harmful to subclasses of builtin types -> Concrete obect C API considered harmful to subclasses of builtin typesje
2015-03-24 15:06:24r.david.murraylinkissue23757 superseder
2014-10-14 17:58:02skrahsetnosy: - skrah
2014-03-06 22:46:57josh.rsetnosy: + josh.r
2013-10-25 17:02:44ncoghlansetmessages: + msg201260
2013-10-25 16:44:42scodersetmessages: + msg201256
2013-10-25 16:25:21pitrousetmessages: + msg201250
2013-10-25 16:23:05ncoghlansetmessages: + msg201249
2013-10-25 16:18:04pitrousetmessages: + msg201248
2013-10-25 15:56:06ncoghlansetmessages: + msg201247
2013-10-25 05:57:11scodersetnosy: + scoder
messages: + msg201205
2013-10-21 20:41:30eric.snowsetmessages: + msg200842
2013-10-21 20:27:33ncoghlansetmessages: + msg200837
2013-10-21 15:38:02eric.snowsetmessages: + msg200804
2013-10-21 11:42:21ncoghlansetmessages: + msg200752
2013-10-21 10:01:25skrahsetnosy: + skrah
messages: + msg200741
2013-09-10 20:59:16Arfreversetnosy: + Arfrever
2011-11-06 03:05:55meador.ingesetnosy: + meador.inge
2011-11-05 20:05:01eric.snowsetnosy: + eric.snow
2011-04-20 22:25:22rhettingersettitle: Concrete object C API needs abstract path for subclasses of builtin types -> Concrete object C API considered harmful to subclasses of builtin types
2011-04-07 04:50:18ncoghlansetmessages: + msg133191
2011-04-06 22:22:21rhettingersetmessages: + msg133176
2011-04-06 14:59:36ncoghlansetnosy: + docs@python
messages: + msg133141

assignee: docs@python
components: + Documentation, - Interpreter Core
2011-04-06 14:52:54pitrousetmessages: + msg133139
2011-04-06 14:48:57ncoghlansetmessages: + msg133138
2011-04-06 14:43:42ncoghlansetmessages: + msg133137
2011-04-06 14:09:46pitrousetmessages: + msg133132
2011-04-06 14:01:34benjamin.petersonsetmessages: + msg133130
2011-04-06 13:24:25ncoghlansetmessages: + msg133126
2011-04-05 14:12:58benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg133032
2011-04-05 12:43:12ncoghlansetmessages: + msg133027
2011-04-05 12:24:26ncoghlansetmessages: + msg133026
2011-03-20 21:55:22rhettingersetnosy: rhettinger, ncoghlan, pitrou, eric.araujo, Trundle, daniel.urban
messages: + msg131551
2011-03-20 21:50:56ncoghlansetnosy: + ncoghlan
messages: + msg131550
2011-03-20 20:31:43pitrousetnosy: rhettinger, pitrou, eric.araujo, Trundle, daniel.urban
messages: + msg131538
2011-03-20 20:17:59rhettingersetnosy: rhettinger, pitrou, eric.araujo, Trundle, daniel.urban
messages: + msg131537
2011-03-20 13:37:05Trundlesetnosy: + Trundle
2011-03-20 13:35:42pitrousetnosy: + pitrou
messages: + msg131494
2011-03-20 12:27:12daniel.urbansetfiles: + issue10977_list.patch

messages: + msg131489
keywords: + patch
2011-01-28 18:22:21daniel.urbansetnosy: + daniel.urban
2011-01-22 16:46:09eric.araujosetnosy: + eric.araujo

title: Concrete object C API needs abstract path for subclasses of builtin types -> Concrete object C API needs abstract path for subclasses of builtin types
2011-01-21 23:49:24rhettingercreate