New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concrete object C API considered harmful to subclasses of builtin types #55186
Comments
Currently, the concrete object C API bypasses any methods defined on subclasses of builtin types. It has long been accepted that subclasses of builtin types need to override many methods rather than just a few because the type itself was implemented with direct internal calls. This work-around requires extra work but still makes it possible to create a consistent subclass (a case-insensitive dictionary for example). However, there is another problem with this workaround. Any subclassed builtin may still be bypassed by other functions and methods using the concrete C API. For example, the OrderedDict class overrides enough dict methods to make itself internally consistent and fully usable by all pure python code. However, any builtin or extension using PyDict_SetItem() or PyDict_DelItem() will update the OrderedDict's internal unordered dict Since OrderedDict is written in pure python, the consequence of having its invariants violated would result in disorganization, but not in a crash. However if OrderedDict were rewritten in C, augmenting the inherited data structure with its own extra state, then this problem could result in seg-faulting. Note this is only one example. Pretty much any subclass of a builtin type that adds additional state is vulnerable to a concrete C API that updates only part of the state while leaving the extra state information in an inconsistent state. Another example would be a list subclass that kept extra state tracking the number of None objects contained within. There is no way to implement that subclass, either in C or pure python, even if every method were overridden, that wouldn't be rendered inconsistent by an external tool using PyList_SetItem(). My recommendation is to find all of the mutating methods for the concrete C API and add an alternate path for subclasses. The alternate path should use the abstract API. Pseudo-code for PyList_SetItem(): if type(obj) is list:
# proceed as currently implemented
else:
# use PyObject_SetItem() adapting the
# function parameters and return values if necessary
# to match the API for PyList_SetItem().
else:
raise BadInternalCall('object should be a list') |
Here is a patch for list. It modifies the following C API functions: It also includes tests (with ctypes). I plan to do next the same for dict, but first I'd wait for comments, to know if I'm doing something completely wrong :-) |
Hmm, making PyList_* an abstract API doesn't make sense to me. These functions do exactly what they say: they operate on concrete instances of list (they are documented as part of the concrete API). With that reasoning, we should have fallback paths in every function in the concrete APIs; that's a lot of complication added to these C files. IMO we "should" (or, rather, could) instead add abstract PySequence_Append(), etc. functions if we deem it necessary (just as we already have PyMapping_Keys(), etc.). By the way, PyList_SetItem() already has an abstract counterpart called PyObject_SetItem(), so changing this one seems useless. |
The problem isn't a lack of available abstract The concrete methods pre-date the abstract methods IOW, if some patch is accepted, it needs to be aimed |
Then what is it exactly? |
It's largely a backwards compatibility hack, but the concrete methods also have an optimised fast path that the generic methods lack. So (for example), PyList_SetItem would now mean "this is *probably* a list, so check for that and use the fast path if the assumption is correct, otherwise fall back on PyObject_SetItem". Currently, using the concrete API means your code potentially *breaks* if the assumption is incorrect. Sounds like a good idea to me, and will fix a lot of cases of bad interaction between concrete types and related objects. |
I'm just pointing out a problem. Lots of existing code uses the concrete API and that code can break subclasses of builtin types (possibly resulting in segfaults). I don't really care what is done about it. I'm just observing that the current design of the concrete API is incompatible with subclassing of built-in types. A practical consequence is that I don't see how to write a fast, light C version of OrderedDict that would be safe from calls to PyDict_SetItem(). AFAICT, there is no mechanism for adding extra state to a built-in type and being able to guarantee even very simple invariants. |
I may have found another use case for this functionality. Currently, the Python code in heapq.py accepts arbitrary sequences (that are sufficiently compatible with the Sequence ABC), but the C code in _heapq only permits use of a concrete list. The two currently available approaches to address that discrepancy are:
The extensive use of the PyList macro API in _heapq means that a little bit of 2 might still be necessary even if the concrete API was updated as Raymond suggests, but I think there would be value in changing the meaning of the concrete APIs to include falling back to the abstract APIs if the type assumption is incorrect. |
I added a few commments on Daniel's draft patch. However, I'm getting nervous about the performance impact of all these extra pointer comparisons. We should probably hold off on this until more of the macro benchmark suite runs on Py3k so we can get as hint as to the possible speed impact on real application code. |
I find this extremely ugly. The correct way to handle this is to use the generic methods themselves. Really, the type-specific names should only be used when you have just created the type or done PyXXX_CheckExact() yourself on it. |
"should" is a wonderful word when it comes to external APIs. We currently have a couple of problems:
Changing the concrete APIs to be more defensive and fall back to the abstract APIs if their assumptions are violated would be a win on both of those fronts. I also retract my performance concern from above - all of the affected code paths already include a "Py<X>_Check()" that triggers PyErr_BadInternalCall() if it fails. All we would be doing is moving that check further down in the function and dealing with a negative result differently. Some code paths would become slightly slower when used with subclasses of builtin types, but a lot of currently broken code would automatically start supporting subclasses of builtins correctly (and be well on its way to being duck-typed, instead of only working with the builtin types). |
2011/4/6 Nick Coghlan <report@bugs.python.org>:
Why not add fast paths to the generic functions if that's what you're It's unexpected for the user of the functions and breaks years of It seems like asking for subtle bugs to me. The only correct way to is |
I agree with Benjamin. |
All that said, there really is one specific case where fixing this problem could cause a serious backwards compatibility issue: subclasses of builtin types that are *themselves* written in C. In such cases, the method implementations would quite likely use the concrete API to handle the base class state, then do their own thing to handle the rest of the update. Adding an implicit fallback to the concrete API implementations would cause such cases to trigger infinite recursion at the C level. |
Having convinced myself that Raymond's original suggestion can't be implemented safely, I have an alternative (arguably even more radical) proposal: Deprecate the public concrete API functions that modify object state. Put an underscore in front of them for internal use, have the public versions trigger a deprecation warning (not to be removed until 3.6 or so), provide a C level mechanism to easily make the equivalent of a super() call and advise that everyone switch to the abstract API in order to handle subclasses properly. |
But since that problem has been existing for ages, we're not in a rush In the long term, having a clean CPython API with a proper separation of |
Changing the affected components. Antoine and Benjamin are right, this needs to be handled as a documentation and education problem. Raymond's suggestion can too easily trigger infinite recursion and mine would block legitimate use cases in 3rd party code. For heapq, we can just fix it to include the fallbacks to the abstract versions if PyList_CheckExact() fails. |
One thing that could be done is to have an optional warning in the mutating concrete C API functions that gets triggered whenever the input doesn't have an exact type match. That will let people discover incorrect uses. We could also scour our own stdlib code to see if there are any cases where there needs to be an alternate code patch for subclasses of builtin types. I did a scan for PyList_SetItem and found that we were already pretty good about only using it when the target was known to be an exact list. Though those results were good, I'm not hopeful for a similar result with PyDict_SetItem. In the mean time this bug will preclude a C version of OrderedDict from being a dict subclass. That's unfortunate because it means that the C version can't be a drop-in replacement for the existing pure python OrderedDict. |
I thought about a warning, but the problem is that a subclass using the concrete API as part of its *implementation* of the associated slot or method is actually perfectly correct usage. I'm not sure this is enough to give up on the idea of OrderedDict being a dict subclass implemented in C, though. It seems to me that some defensive internal checks (to pick up state corruption due to this problem) would be a lesser evil than giving up on being a dict subclass. |
I'm also in favor of a clean separation between the concrete and abstract |
Because calling them *from* method implementations in concrete subclasses |
We do the same thing from Python all the time (though less so with the availability of super). Now, if the C-API had some equivalent to super... |
I haven't fully thought this one through, but perhaps we could:
This would be 3.5 PEP territory, though. |
I had roughly the same idea, Nick, though my approach to address backward compatibility was more complicated. Definitely worth at least looking into for 3.5. |
-1 on any changes that make the specific C-API functions less specific. They are perfectly adequate for gaining speed in well defined situations. +0 on any changes that special case concrete types in abstract calls if they prove to be worth the hassle. +1 for making sure CPython does not accidentally exclude subclasses internally for anything that's provided by users. +1 on CPython being the wrong place to work around this for external code. If extensions use the wrong function, *they* must be fixed. At most, there might be a documentation issue hidden here. Also, we should be careful with adding yet another substantial set of C-API functions. It's not clear to me that they are really needed in practice. |
The problem we're trying to solve is CPython *silently* breaking subclass |
To be clear: the problem is with CPython calling the concrete APIs when |
On 26 Oct 2013 02:18, "Antoine Pitrou" <report@bugs.python.org> wrote:
The boilerplate required to use them correctly renders them broken in my
|
IMO the boilerplate should actually be included inside the abstract |
Nick Coghlan added the comment:
I think we disagree here. There is no boilerplate at all involved, as long
It has bothered me more than once that even the fast paths in those IMHO, it would be nice to have those internal type checks enabled *only* |
Keep in mind that part of the problem here is finding stdlib code that The problem to be solved isn't "this code should handle ducktyping" but, |
The current hard-coded assumptions lead to problems not just with mutation. Simply using Let's consider a "copy on write" Addressing the performance issue that has been discussed, fixing this specific case would result in performance improvement. I can't say for certain whether this will hold true for addressing this class of problems in general, but the possibility exists. Here's how the function's performance could be improved in this specific case: the current implementation starts with a guard of the form However, there remains the concern raised by @ncoghlan. Subclasses (written in C) may reasonably need access to the OG dict internals, and the proposed exact type check would prevent that. The public API will require modification one way or the other: either by introducing a new method explicitly for accessing the internals, or by updating the documentation to clarify that Personally I think changing code to do what the documentation says it does is the right solution. If we want subclasses written in C to be able to rely on the internals of Basically it's probably best to either do builtin subclassing support properly or not at all. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: