classification
Title: PyMapping_Check returns 1 for list
Type: enhancement Stage: patch review
Components: C API Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: aganders3, aviramha, avrahami.ben, bar.harel, benjamin.peterson, brandtbucher, bukzor, georg.brandl, levkivskyi, methane, miss-islington, petr.viktorin, pitrou, rhettinger, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2022-01-14 11:12 by aviramha, last changed 2022-01-25 09:43 by avrahami.ben.

Pull Requests
URL Status Linked Edit
PR 30600 closed aviramha, 2022-01-14 15:32
Messages (23)
msg410555 - (view) Author: Aviram (aviramha) * Date: 2022-01-14 11:12
This is re-open of https://bugs.python.org/issue5945.
In the former issue, it was decided that documenting the odd behavior and later providing clean, good ABC C API would be the long term solution.
Few years passed and there are no alternatives AFAIK
I am wondering if that's due to lack of resources or just forgotten?
I don't mind contributing the necessary change in case there's really nothing in progress and it is something Python community wants to fix.
Hopefully there's an existing solution and I just didn't search well enough.
msg410556 - (view) Author: Aviram (aviramha) * Date: 2022-01-14 13:08
https://github.com/PyO3/pyo3/issues/2072 Relevant discussion in PyO3 related issue
msg410557 - (view) Author: Aviram (aviramha) * Date: 2022-01-14 13:33
Would checking the TPFLAGS for `Py_TPFLAGS_SEQUENCE` & `Py_TPFLAGS_MAPPING` when using `PySequence_Check` & `PyMapping_Check` be a valid fix?
msg410564 - (view) Author: Bar Harel (bar.harel) * Date: 2022-01-14 14:40
Up until now, trying to distinguish between actual sequences and mappings in C-API was a pain. Same methods are implemented in customer user classes, and the ABCs could have either been registered dynamically or had an appropriate __subclasshook__. On top of that, making the same checks as isinstance regarding the whole ABC and __mro__ chain would considerably slow down the code.

The new tp_flags are set in both registered abc classes, and subclasses of the appropriate types. As tp_flags exists, we don't even need to account for tp_as_mapping and tp_as_sequence. According to the docs "if such a flag bit is clear, the type fields it guards must not be accessed and must be considered to have a zero or NULL value instead.". User made classes that don't inherit from the Sequence and Mapping ABCs are not considered either, and there is no __subclasshook__ for these ABCs so inheritance is fully enforced.

As we cover builtins, c-extension classes and user-made Python classes, checking tp_flags is guaranteed to be accurate for distinguishing between mapping and sequence types, and would be as fast as a pointer dereference can be.

Modifying the current PySequence_Check and PyMapping_Check might be a breaking change, especially on a C-API level. However one can argue that checking for a mapping and expecting a failure on user-made mappings counts as a bug. Having them accurately return the correct type in a fast manner using tp_flags is therefore an acceptable breaking change.
msg410566 - (view) Author: Bar Harel (bar.harel) * Date: 2022-01-14 14:54
Do note, the relevant functions are in the Stable ABI, and their promise will slightly change, yet modifying the current functions instead of creating new ones may still be beneficial.
msg410585 - (view) Author: Aviram (aviramha) * Date: 2022-01-14 18:34
I submitted a draft patch. Using TPFlags alone doesn't cut it as some types are excluded (bytes, str, bytearray) in sequence and same for mapping. I'm thinking of checking for those cases specifically as those are very very specific casings. Would love some input.
msg410618 - (view) Author: Bar Harel (bar.harel) * Date: 2022-01-15 04:01
Another question we should ask is about duck typing. Is a sequence which doesn't inherit from abc.Sequence considered a sequence? Whatever the answer is, PySequence specifically looks for a sequence and removes duck typing out of the picture. The object will not pass static typing and will not pass isinstance check, so there's no reason for it to pass PySequence.
msg410619 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2022-01-15 04:34
> It changes behavior for objects not being iterable/sequences 
> if not inheriting from `abc.collections.Sequence`.

This would be a breaking change (for example, it broke the long stable sre_parse code).

The utility of PySequence_Check and PyMapping_Check is already so low that it isn't work breaking other things to just to marginally improve these two minor and rarely used functions.

These functions will never be fully reliable.  The documentation explains that in general, we can't tell if a class with __getitem__ is a mapping or a sequence.  Sometimes hints are present (such as the tp_flags), but the can't get a reliable result.  As Guido observed, "calling PyMapping_Check() was never particularly reliable, and extension modules depending on it probably always had subtle bugs."  That was true in 2011 and it is still true today.

I recommend closing this.  These functions are mostly unimportant and unreliable and cannot be made correct.  In contrast, iterability is important and needs to be stable.  Special cases aren't important enough to break the rules.
msg410621 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2022-01-15 05:00
s/it isn't work breaking other things/it isn't worth breaking other things/
msg410622 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2022-01-15 05:10
collections.abc.Mapping is fixed by https://bugs.python.org/issue43977
We can be same thing if backward compatibility allows it.
msg410624 - (view) Author: Bar Harel (bar.harel) * Date: 2022-01-15 05:36
I thought about it, what about simply excluding TPFLAGS_MAPPING from PySequence and TPFLAGS_Sequence from PyMapping? It will remove the types that are 100% not sequences or mappings, as these flags are mutually exclusive by definition. The result will be much more accurate, yet not cause a breaking change, apart from places where it is truly not a sequence or mapping.
msg410626 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2022-01-15 07:11
> what about simply excluding TPFLAGS_MAPPING from PySequence 
> and TPFLAGS_Sequence from PyMapping? It will remove the types 
> that are 100% not sequences or mappings, as these flags 
> are mutually exclusive by definition.

This is more plausible than the proposed breaking change.


> The result will be much more accurate

If they can't be made fully reliable, why would we ever recommend that someone use these functions in real code?  They can be made to guess better than they guess now, but there is still guesswork.  

ISTM developers should follow the structure pattern matching implementation and refuse the temptation to guess.  If a class declares itself as a mapping or sequence, that is reliable information.  In contrast, these functions attempt to divine meaning in the absence of a clear declaration.  Using these functions will likely result in subtle bugs.

Once Py_TPFLAGS_MAPPING and Py_TPFLAGS_SEQUENCE became available, we should have deprecated these functions.  No one should use them anymore. Their core design is flawed; they tried to deduce semantics from structural artifacts.
msg410627 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2022-01-15 07:32
Here's question to focus on:  In what circumstances should a developer ever prefer PyMapping_Check() over PyType_HasFeature() with Py_TPFLAGS_MAPPING?

The latter doesn't give any new, useful information:

   return o && Py_TYPE(o)->tp_as_mapping &&
               Py_TYPE(o)->tp_as_mapping->mp_subscript;

I don't see any reason to build on top of this. It's best to just let it go gently into the good night without disrupting anything that currently happens to work.
msg410629 - (view) Author: Aviram (aviramha) * Date: 2022-01-15 08:19
I agree that a developer should and would prefer the `Py_TPFLAGS_*` but when you visit https://docs.python.org/3/c-api/sequence.html
It seems like the best practice to determine Sequence protocol is by using this function, hence leading to confusion. There's no recommendation to use the new `Py_TPFLAGS_*`.
To have this knowledge of `Py_TPFLAGS_*` one should be very knowledgable in Python's C-API.
How about adding a deprecation note to `PyMapping_Check` & `PySequence_Check` in the documentation, suggesting the alternative path (to use `PyType_HasFeature`)?
msg410632 - (view) Author: Bar Harel (bar.harel) * Date: 2022-01-15 08:52
@rhettinger I completely understand what you're saying and at first I agreed with you. Before I gave it a closer look, I thought about the same thing - we want reliability. Reliability is important and will avoid subtle bugs, which is why I was against this change for the exact reasons you mentioned: it is both breaking and unreliable.

I then realized that this change can be a reliable replacement for isinstance(obj, collections.abc.Sequence) at the C level. Let's use the broken sre_parse.SubPattern as an example - it does not register or inherit from collections.abc.Sequence, and isinstance(SubPattern, collections.abc.Sequence) == False. We cannot know programmatically if SubPattern is a Sequence, we cannot type hint it as such, and apart from reading the documentation, we cannot deal with the type differently in dynamic code that accepts either sequences or mappings. I dare to say, counting on it being a sequence, especially on a LBYL language like C is even less reliable. While SubPattern "embraces" the spirit of duck typing, it is very hard to fit in light of all recent changes advocating for a more structured and well defined types. After all, this feature was requested in order to solve reliability issues in statically typed languages.

Putting everything aside, the grand question still remains: do you think that there's a use for an efficient C-API isinstance check for Sequences and Mappings? I would presume the answer is yes. Would we encourage it? I have no clue. But if there's a need, we can either change this function as it has the same "spirit" or introduce a new one to prevent breaking existing code.

To answer your question: per specification, testing for Py_TPFLAGS_SEQUENCE using PyType_HasFeature, does not take strings, bytes and bytearray into consideration, and will not suffice. It is an incorrect solution that is even less reliable and falls into the exact pitfall of "guesswork" (for example SubPattern currently doesn't work with it either). It is not encouraged or easily thought of. PySequence_Check which is much more intuitive yet doesn't work either and that's where fixing it can have an edge.

A theoretical `PyIsInstance_Sequence` can check for TPFLAGS_SEQUENCE and Str/Bytes/ByteArray_Check. If I'm not wrong, doing so will be 100% reliable, identical to isinstance(obj, Sequence), and will be very efficient.

As a side-note, the C-API documentation for TP_FLAGS is not clear atm. It mentions for example tp_as_sequence and says "if such a flag bit is clear, the type fields it guards must not be accessed and must be considered to have a zero or NULL value instead" yet Py_TPFLAGS_SEQUENCE does not actually coincide with sequences per specification. I know it has a different explanation as well and the flag has its own docstring, but it is still a bit misleading.
msg410633 - (view) Author: Bar Harel (bar.harel) * Date: 2022-01-15 09:28
> That was true in 2011 and it is still true today

Python's methodology greatly shifted since 2011. For the better or worse, Python lost some of its duck typing. Nowadays, most people use linters. Wherever you'd try to pass sre_parse.SubPattern, the linter will throw an error saying it's not a Sequence even if it fully behaves like one. You can silence that error but you'll continue receiving such warnings all over the code, whether in stdlib or out in the wild. The meaning of Sequence now shifted to "inherits from abc.Sequence". The only thing wrong with PySequence_Check is that the ecosystem shifted, but its view of a Sequence remained the same.
msg410944 - (view) Author: Petr Viktorin (petr.viktorin) * (Python committer) Date: 2022-01-19 13:39
Changing the existing functions is a no-go for backwards compatibility reasons. I think the best way forward would be to add a new function, and then possibly deprecate the old one if it's deemed dangerous.

If you want to push this forward, could you summarize the use case(s) and expected behavior (docs) for such a function? It probably doesn't need a PEP, but it's worth looking here for what to think about when making a proposal: https://www.python.org/dev/peps/pep-0001/#what-belongs-in-a-successful-pep
msg410949 - (view) Author: Aviram (aviramha) * Date: 2022-01-19 14:12
Sure, I will do so. The proposal should be written here, right?
msg410951 - (view) Author: Petr Viktorin (petr.viktorin) * (Python committer) Date: 2022-01-19 14:15
I'd post it to capi-sig, or to the existing thread on python-dev. But here's a good place too, especially if you want feedback from a smaller group of people first.
msg411363 - (view) Author: Aviram (aviramha) * Date: 2022-01-23 11:45
I sent it to sig-capi - https://mail.python.org/archives/list/capi-sig@python.org/thread/T6DHEKHKKZIYU2GEPGHUQJ3DHTJXZGWW/
msg411387 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-01-23 17:30
It is difficult to distinguish Mapping from Sequence because they have the same set of dunder methods. The difference is only in semantic -- __iter__ yields items for Sequence and keys for Mapping, __getitem__ gets an item by index (or a sequence by slice) for Sequence and value by key for Mapping. But semantic cannot be tested here.

In these cases when both Sequence and Mapping are accepted but handled differently (like in the dict constructor) we test for existence of the "keys" method. It is not good, because it is not a dunder method, and therefore should be looked up at instance, not only at a type.
msg411422 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2022-01-23 21:59
I would really like this to be left alone.  We've been without a reliable version for over a decade.  That is strong evidence that we really don't need this unless it can be done perfectly (which it can't).  

Also, PyMapping_Check is just a CPython specific optimization (and a flawed one).  Because it can't be made completely reliable, we should not encourage people to use it, nor should we enshrine it in the stable API.

And if a modification can potentially break long stable code (such as that it the re module), then we absolutely shouldn't do it. x

Further, there is a general design issue.  Abstract base classes were invented to solve this specific problem (distinguishing mappings from sequences).  ABCs (and typing) are a now well established practice, and it is foolish to try to do an end run around using them.

I am strongly opposed to this going forward and request that a PEP be made if it is pursued further.   It is an SC level decision to allow stable code to be broken, to guarantee a CPython specific API for something that cannot be made correct in the general case, and to bypass the intended way to do it.

If the core concern is that isinstance() checks for ABCs are too slow, then efforts should be made to optimize them rather than creating an unreliable, CPython only alternative.
msg411568 - (view) Author: Ben Avrahami (avrahami.ben) * Date: 2022-01-25 09:43
IMHO, I don't think any alternative to aviramha's solution addresses the issue, And I don't think the need is niche enough to be ignored.

PyType_HasFeature excludes strings, bytes, and other esoteric types.
PyMapping_Check includes mappings like dict and MappingProxyType.
PySequence_Check includes non-dict mappings like MappingProxyType.

The only possible solutions right now are:

a. Import the "collections.abc.Sequence" class into C and run "PyObject_IsInstance" on it. This would be the correct solution, but it would be slower than aviramha's propsal.
b. Perform's aviramha's proposal manually: first check the "Py_TPFLAGS_SEQUENCE" feature flag, and fallback for instance checking strings, bytecodes, and other esoteric types individually. This would be correct and fast, but is cumbersome to perform, and users are bound to forget some types.

A question as simple as "would isinstance(X, <Sequence/Mapping>) returns true?" should be easier to answer for low-level developer, one only needs to look at the finagling numpy, among others, has to perform to parse a sequence (the relevant function in numpy is called "PyArray_FromAny").

A simple implementation of a new function for non-CPython alternatives will be to do implement solution a: import "collections.abc.Sequence", and check for instance. For CPYTHON, this can be optimized by implementing solution b.

There is already a precedence for ABI functions that only exist because its trivial implementation can be optimized with CPython specific behaviour. I don't understand the reluctance to add this one.
History
Date User Action Args
2022-01-25 09:43:02avrahami.bensetnosy: + avrahami.ben
messages: + msg411568
2022-01-23 21:59:19rhettingersetmessages: + msg411422
2022-01-23 17:30:55serhiy.storchakasetmessages: + msg411387
2022-01-23 11:45:38aviramhasetmessages: + msg411363
2022-01-19 14:15:20petr.viktorinsetmessages: + msg410951
2022-01-19 14:12:10aviramhasetmessages: + msg410949
2022-01-19 13:39:27petr.viktorinsetnosy: + petr.viktorin
messages: + msg410944
2022-01-15 14:30:52rhettingersetassignee: rhettinger ->
2022-01-15 09:28:31bar.harelsetmessages: + msg410633
2022-01-15 08:52:56bar.harelsetmessages: + msg410632
2022-01-15 08:19:35aviramhasetmessages: + msg410629
2022-01-15 07:32:17rhettingersetmessages: + msg410627
2022-01-15 07:11:20rhettingersetmessages: + msg410626
2022-01-15 05:56:09jmillikinsetnosy: - jmillikin
2022-01-15 05:36:01bar.harelsetmessages: + msg410624
2022-01-15 05:10:01methanesetnosy: + methane
messages: + msg410622
2022-01-15 05:00:37rhettingersetmessages: + msg410621
2022-01-15 04:34:42rhettingersetassignee: rhettinger
messages: + msg410619
2022-01-15 04:01:14bar.harelsetmessages: + msg410618
2022-01-14 18:34:05aviramhasetmessages: + msg410585
2022-01-14 18:21:38iritkatrielsettype: behavior -> enhancement
versions: - Python 3.7, Python 3.8, Python 3.9, Python 3.10
2022-01-14 15:49:45brandtbuchersetnosy: + brandtbucher
2022-01-14 15:32:55aviramhasetkeywords: + patch
stage: patch review
pull_requests: + pull_request28799
2022-01-14 14:54:04bar.harelsetmessages: + msg410566
2022-01-14 14:40:47bar.harelsetnosy: + bar.harel
messages: + msg410564
2022-01-14 13:33:54aviramhasetmessages: + msg410557
2022-01-14 13:31:40aganders3setnosy: + aganders3
2022-01-14 13:08:39aviramhasetmessages: + msg410556
2022-01-14 11:18:28aviramhasetnosy: + georg.brandl, rhettinger, pitrou, benjamin.peterson, jmillikin, bukzor, serhiy.storchaka, levkivskyi, miss-islington
2022-01-14 11:12:18aviramhacreate