Message 349195 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	josh.r
Recipients	canjo, josh.r, pitrou, rhettinger, serhiy.storchaka
Date	2019-08-07.19:51:18
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1565207479.82.0.0380900835422.issue24413@roundup.psfhosted.org>
In-reply-to

Content
To be clear, set().__or__(x) returns NotImplemented, it doesn't raise NotImplementedError. I've edited the title to match. One major problem that gets in the way of a fix is that the interface of set and dict views doesn't match, because dict views only provide the interface of collections.abc.Set, and collections.abc.Set only requires the operator overloads, not the named equivalent methods. And because the named equivalents don't exist, the operator overloads were made more liberal, to accept any iterable, not just set or set-like things. set's rules are: 1. Named methods take varargs, and accept any iterable(s) 2. Operator overloads take only one argument, which must be a set/frozenset dict view's rules are: 1. There are no named methods (aside from isdisjoint, which has no operator equivalent) 2. Operator overloads take only one argument, which can be any iterable So fixing this is problematic so long as we want to keep a simple implementation; right now, all the methods simplify to: 1. Convert to set 2. Call named, in-place method on resulting set to get liberal behavior Fixing this would involve either: 1. Rewriting all the dict views operator overloads to do the work themselves, including the logic to reject non-iterables by returning NotImplemented like set's operators do. Pro: Not a hack. Resulting code would be more efficient in many cases (e.g. dict view's __and__ could make a new empty set [currently makes set with a copy of the view's values], and iterate the right hand side while performing membership tests in the underlying dict before inserting into the result set, avoiding the need to make a large set that will almost certainly shrink, possibly requiring rehashes) Con: Lots of new code to maintain. Would not benefit from any algorithmic improvements to equivalent set code without mirroring them manually. 2. Checking for TypeError from the set method and explicitly returning NotImplemented. Pro: Easy, code remains maintainable, benefits from any future optimizations to set Cons: Can't distinguish fundamental problems that should be TypeErrors (right hand side contains unhashable objects) from simple case that should return NotImplemented (right hand side isn't iterable). Potentially fixable by explicitly performing the conversion to iterator on the dict view's side (though you'd only do it to confirm it's possible; you'd want to pass the original object unmodified for the common case where it's a set/frozenset and could be processed more efficiently). Also makes failure case even slower than it already is (exception is raised, then ignored, then NotImplemented is returned, then right-hand side is checked, and odds are, the right-hand side won't know what to do either, so it'll return NotImplemented anyway, and a TypeError gets thrown after more work). 3. Refactor the set API to expose (private, C level) helpers for the operator overloads (specifically, the in-place versions like __ior__, since we're always making a new true set to call it on anyway, and it seems silly to use __or__ which would make yet another set, and force us to throw away the first set we made) that are less strict about their arguments. Pros: Roughly the same as #2, plus a minor per-call (not per item) performance boost: C helpers could be called directly, avoiding overhead of calling Python level methods via _PyObject_CallMethodIdOneArg(result, &PyId_update, other); invoking dict lookups and the like when we know we just made a true set object, so there is no possibility of update being overridden, so a direct C level call would save time. Continues to benefit from any optimizations to set's code (since 95% of the code remains the same). Con: More code churn than #2; unfortunately, unlike list (which has type loose in-place operator overloads), set's in-place operator overloads are still type strict; alist += {1,2,3} works, but aset \|= [1,2,3] does not, so we can't just call set's __ixor__ unmodified, we'd need a _PySet_IXorIterable separate from the existing set_ior, where both of them might be implemented in terms of set_update_internal, but _PySet_IXorIterable only checks that the argument is iterable to return NotImplemented. #1 and #3 seem to be the only fully correct options; #1 could squeeze out more performance, but requires a lot more new code, #3 requires a lot less new code, and gets a tiny performance boost. I don't see explicit calls to __or__ being all that common though, and I have a hard time envisioning a class that would hit this case without explicitly calling __or__; for `{}.keys() \| x` the only time you'd see a behavioral difference from `set() \| x` is when: 1. x is not iterable, and 2. x defines a __ror__ that can do something meaningful with a set I'd personally assume anything that meets criterion #2 would be iterable; anyone have a counter-example? If this is only a theoretical problem, might it make sense to just leave it as is?

To be clear, set().__or__(x) returns NotImplemented, it doesn't raise NotImplementedError. I've edited the title to match.

One major problem that gets in the way of a fix is that the interface of set and dict views doesn't match, because dict views only provide the interface of collections.abc.Set, and collections.abc.Set *only* requires the operator overloads, not the named equivalent methods. And because the named equivalents don't exist, the operator overloads were made more liberal, to accept any iterable, not just set or set-like things.

set's rules are:

1. Named methods take varargs, and accept any iterable(s)
2. Operator overloads take only one argument, which must be a set/frozenset

dict view's rules are:

1. There are no named methods (aside from isdisjoint, which has no operator equivalent)
2. Operator overloads take only one argument, which can be any iterable

So fixing this is problematic so long as we want to keep a simple implementation; right now, all the methods simplify to:

1. Convert to set
2. Call named, in-place method on resulting set to get liberal behavior

Fixing this would involve either:

1. Rewriting all the dict views operator overloads to do the work themselves, including the logic to reject non-iterables by returning NotImplemented like set's operators do.

Pro: Not a hack. Resulting code would be more efficient in many cases (e.g. dict view's __and__ could make a new empty set [currently makes set with a copy of the view's values], and iterate the right hand side while performing membership tests in the underlying dict before inserting into the result set, avoiding the need to make a large set that will almost certainly shrink, possibly requiring rehashes)

Con: Lots of new code to maintain. Would not benefit from any algorithmic improvements to equivalent set code without mirroring them manually.

2. Checking for TypeError from the set method and explicitly returning NotImplemented.

Pro: Easy, code remains maintainable, benefits from any future optimizations to set

Cons: Can't distinguish fundamental problems that should be TypeErrors (right hand side contains unhashable objects) from simple case that should return NotImplemented (right hand side isn't iterable). Potentially fixable by explicitly performing the conversion to iterator on the dict view's side (though you'd only do it to confirm it's possible; you'd want to pass the original object unmodified for the common case where it's a set/frozenset and could be processed more efficiently). Also makes failure case even slower than it already is (exception is raised, then ignored, then NotImplemented is returned, then right-hand side is checked, and odds are, the right-hand side won't know what to do either, so it'll return NotImplemented anyway, and a TypeError gets thrown after more work).

3. Refactor the set API to expose (private, C level) helpers for the operator overloads (specifically, the *in-place* versions like __ior__, since we're always making a new true set to call it on anyway, and it seems silly to use __or__ which would make yet another set, and force us to throw away the first set we made) that are less strict about their arguments.

Pros: Roughly the same as #2, plus a minor per-call (not per item) performance boost: C helpers could be called directly, avoiding overhead of calling Python level methods via _PyObject_CallMethodIdOneArg(result, &PyId_update, other); invoking dict lookups and the like when we know we just made a true set object, so there is no possibility of update being overridden, so a direct C level call would save time. Continues to benefit from any optimizations to set's code (since 95% of the code remains the same).

Con: More code churn than #2; unfortunately, unlike list (which has type loose in-place operator overloads), set's in-place operator overloads are still type strict; alist += {1,2,3} works, but aset |= [1,2,3] does not, so we can't just call set's __ixor__ unmodified, we'd need a _PySet_IXorIterable separate from the existing set_ior, where both of them might be implemented in terms of set_update_internal, but _PySet_IXorIterable only checks that the argument is iterable to return NotImplemented.

#1 and #3 seem to be the only fully correct options; #1 could squeeze out more performance, but requires a *lot* more new code, #3 requires a lot less new code, and gets a tiny performance boost.

I don't see explicit calls to __or__ being all that common though, and I have a hard time envisioning a class that would hit this case without explicitly calling __or__; for `{}.keys() | x` the only time you'd see a behavioral difference from `set() | x` is when:

1. x is not iterable, and
2. x defines a __ror__ that can do something meaningful with a set

I'd personally assume anything that meets criterion #2 would be iterable; anyone have a counter-example? If this is only a theoretical problem, might it make sense to just leave it as is?

History
Date	User	Action	Args
2019-08-07 19:51:19	josh.r	set	recipients: + josh.r, rhettinger, pitrou, serhiy.storchaka, canjo
2019-08-07 19:51:19	josh.r	set	messageid: <1565207479.82.0.0380900835422.issue24413@roundup.psfhosted.org>
2019-08-07 19:51:19	josh.r	link	issue24413 messages
2019-08-07 19:51:18	josh.r	create