Clean up and fix OrderedDict #69596

serhiy-storchaka · 2015-10-15T08:05:09Z

BPO	25410
Nosy	@rhettinger, @ericsnowcurrently, @serhiy-storchaka, @mr-nfamous, @csabella
Files	odict_cleanup.patch odict_cleanup_2.patch odict_type.patch odict_add_new_node_leak.patch odict_revert_setting_on_error.patch odict_iternext_simpler.patch odict_resize_sentinel.patch odict_copy_iter.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2015-10-15.08:05:08.623>
labels = ['extension-modules', 'type-bug']
title = 'Clean up and fix OrderedDict'
updated_at = <Date 2018-11-04.21:31:17.395>
user = 'https://github.com/serhiy-storchaka'

bugs.python.org fields:

activity = <Date 2018-11-04.21:31:17.395>
actor = 'rhettinger'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Extension Modules']
creation = <Date 2015-10-15.08:05:08.623>
creator = 'serhiy.storchaka'
dependencies = []
files = ['40785', '40803', '40806', '40817', '40839', '40840', '40841', '41418']
hgrepos = []
issue_num = 25410
keywords = ['patch']
message_count = 35.0
messages = ['253033', '253051', '253110', '253122', '253133', '253134', '253135', '253137', '253139', '253140', '253141', '253142', '253144', '253149', '253217', '253218', '253219', '253221', '253227', '253244', '253341', '253346', '253347', '253349', '254072', '254075', '254152', '254180', '254181', '256922', '256923', '256987', '313971', '329220', '329260']
nosy_count = 6.0
nosy_names = ['rhettinger', 'python-dev', 'eric.snow', 'serhiy.storchaka', 'bup', 'cheryl.sabella']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue25410'
versions = ['Python 3.5', 'Python 3.6']

serhiy-storchaka · 2015-10-15T08:05:04Z

Proposed patch cleans up and fixes minor bugs in C implementation of OrderedDict.

Used the "p" format unit instead of manual calling PyObject_True() for parsing boolean parameters.
Used _Py_Identifier private API instead of char* API if appropriate.
Used Py_TYPE instead of the __class__ attribute as in other extension types.
Fixed od_fast_nodes size calculation in __sizeof__().
Simplified __reduce__() taking into account that __dict__ is empty in C implementation of OrderedDict.
popitem() with wrong number of arguments now raises TypeError instead of KeyError for empty dictionary.
Python implementation of move_to_end() calls key comparing only once in common case. C implementation called key comparing twice, it first compares a key with first or last key. Now C implementation calls key comparing only once in common case.
Used PyUnicode_FromFormat() instead of str.__mod__ in __repr__().
update() now takes into account that args and kwargs are always tuple and dict (if not NULL).
Got rid of PyMapping_Items() in update(). PyMapping_Items() creates a copy of items as a list, this is not needed.

Also applied other cleanups. The size of sources is decreased by 105 lines.

Objects/odictobject.c | 194 +-----------------------------!!!!!!!!!!!!!!!!!!
1 file changed, 6 insertions(+), 111 deletions(-), 77 modifications(!)

rhettinger · 2015-10-15T14:59:11Z

These all look good except for perhaps #5 which I need to look at a bit more for its effect on OD subclasses.

ericsnowcurrently · 2015-10-17T01:06:59Z

Thanks for working on this, Serhiy. I've left a review.

As to the points you outlined, I have concerns with the impact of #3 and #5 on subclasses. Notably od.__class__ is not necessarily the same as type(od). Also #7 may introduce an unhandled re-entrancy, causing potentially incorrect outcomes.

Also note that I was extremely careful to (almost) exactly match the pure Python implementation. Not only did this guarantee equivalent behavior, but it simplified the porting effort. I'm not opposed to deviating from the pure Python implementation as long as the behavior remains exactly the same. So if you want to change the behavior of OrderedDict you must be sure to make the equivalent change in the pure Python implementation (with the associated backward-compatibility constraints). Thanks again for working on this though.

serhiy-storchaka · 2015-10-17T06:00:58Z

Thank you for your review Eric.

As for using Py_TYPE(self) instead of the __class__ attribute in #3, this is consistent with the rest of Python core and stdlib. All C implementations use Py_TYPE(self) for repr and pickling, even if Python implementations can use __class__.

>>> class S(set): __class__ = str
... 
>>> s = S()
>>> s.__class__
<class 'str'>
>>> s
S()
>>> s.__reduce_ex__(2)
(<class '__main__.S'>, ([],), {})
>>> s+''
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'S' and 'str'

Note that you can't just set s.__class__ = str, this is forbidden (bpo-24912). You should set __class__ in class definition or make it a property, and this doesn't have affect object's behavior, the only visible effect is the value of the __class__ attribute itself.

One possible argument for Py_TYPE(self) (besides simplicity and historical reasons) is that it is more reliable. It doesn't cause executing arbitrary Python code and therefore is thread-safe and reentrant. It returns the true type of the object, that can be important for debugging.

We should not care about exactly matching Python implementation, but rather about consistency with the rest of Python. If such type of mismatching is considered a bug, Python is full of such bugs.

About #5, be sure that the new code is exact semantic equivalence to the old code besides copying the dict that is not needed now. I just dropped an iteration on always empty dict and related code.

I don't see re-entrancy problem in #7. Could you please provide an example?

ericsnowcurrently · 2015-10-17T16:01:20Z

Regarding Py_TYPE(od) vs. od.__class__, there is a difference for subclasses, as you demonstrated in your example. [1] Thanks for explaining your rationale. I now understand your argument about using PyTYPE() for repr and pickle in C types. I still have concerns, though, regarding parity between the two OrderedDict implementations.

The key difference is that we're talking about an after-the-fact C port of an existing pure Python implementation. The two implementations should behave identically in nearly every case. The only place I would expect a deviation to not matter is for anything where Python-as-a-whole does not guarantee behavior. However there are *very* few of those when all is said and done. Any other variation should not be made casually and only if we are willing to accept that there may be code out there that relies on the pure Python behavior which the C implementation will break.

So like I said before, as a rule I'm absolutely okay with changing the behavior as long as the pure Python implementation is changed to match and OrderedDict remains backward-compatible (and the change is meaningful, e.g. efficiency, consistency). Otherwise my concerns remain and we have to have sufficient justification for the change.

It may be worth taking this to python-dev to get a clearer consensus on both "Py_TYPE(obj) vs. obj.__class__", as well as about parity between dual pure-Python/C implementations in the stdlib, regardless of the outcome of this issue. Both are points about which we should be consistent throughout Python. The type() vs. __class__ question may deserve an entry in the language reference and both may deserve a PEP.

-----

For this particular case, I think we should still aim for compatibility with the pure Python implementation. To that effect, we could use Py_TYPE(od) only if PyODict_CheckExact() returns true (as a fast path) and use od.__class__ otherwise. That fast path would be safe for the C implementation since code can't change OrderedDict().__class__ (due to bpo-24912).

That particular difference in the implementations (i.e. you *can* change od.__class__ for the pure Python one) is an acceptable compatibility break since it's unlikely anyone is changing od.__class__ *and* if they are then they can just switch to a simple subclass that wraps OrderedDict:

    # before:
    from collections import OrderedDict
    od = OrderedDict()
    od.__class__ = SomethingElse

    # after:
    import collections
    class OrderedDict(collections.OrderedDict):
        pass
    od = OrderedDict()
    od.__class__ = SomethingElse

If we *do* continue supporting "type(od) != od.__class__" in repr then I'd say why bother with a fast path for PyOdict_CheckExact(). That sort of efficiency isn't necessary for repr. If we stop supporting a differing od.__class__ then I'm fine simply using Py_TYPE() throughout.

Likewise, if this is not a case we want to support then we must accept that we may break some code out there, however unlikely that is. In that case perhaps we could be more clear in the documentation that OrderedDict().__class__ should not be changed, though such an addition to the OrderedDict docs might just be clutter. A general FAQ or other broader doc entry about not assigning to obj.__class__ for stdlib types might be more appropriate. But that is where clarification from python-dev would help.

[1] There is also a difference between type(obj) and obj.__class__ in the case of proxies (e.g. see bpo-16251), but that is less of an issue here.

ericsnowcurrently · 2015-10-17T16:19:10Z

Regarding #5, you're right about OrderedDict().__dict__ being empty for the C implementation. (Nice observation!) So I'm okay with ripping all that code out of odict_reduce(). Since we're still accessing od.__dict__ through _PyObject_GetAttrId() that should not impact subclassing.

ericsnowcurrently · 2015-10-17T16:45:10Z

Regarding #7, I see what you did now. That looks fine to me.

serhiy-storchaka · 2015-10-17T17:21:35Z

Regarding Py_TYPE(od) vs. od.__class__, there is a difference for
subclasses, as you demonstrated in your example. [1] Thanks for explaining
your rationale. I now understand your argument about using PyTYPE() for
repr and pickle in C types. I still have concerns, though, regarding
parity between the two OrderedDict implementations.

The key difference is that we're talking about an after-the-fact C port of
an existing pure Python implementation.

There is no a difference. io, pickle, ElementTree, bz2, virtually all
accelerator classes was created as replacements of pure Python
implementations. All C implementations use Py_TYPE(self) for repr() and
pickling. I think this deviation is common and acceptable.

Backward compatibility related to __class__ assignment was already broken in C
implementation. In 3.4 following code works:

>>> from collections import *
>>> class foo(OrderedDict):
...     def bark(self): return "spam"
... 
>>> class bar(OrderedDict):
...     pass
... 
>>> od = bar()
>>> od.__class__ = foo
>>> od.bark()
'spam'

In 3.5 it doesn't.

That particular difference in the implementations (i.e. you *can* change
od.__class__ for the pure Python one) is an acceptable compatibility break
since it's unlikely anyone is changing od.__class__ *and* if they are then
they can just switch to a simple subclass that wraps OrderedDict:
# before:
from collections import OrderedDict
od = OrderedDict()
od.\_\_class__ = SomethingElse

# after:
import collections
class OrderedDict(collections.OrderedDict):
    pass
od = OrderedDict()
od.\_\_class__ = SomethingElse

No, this assignment is forbidden (due to bpo-24912). You can't set __class_ for
an instance of a subclass of non-heap type.

It may be worth taking this to python-dev to get a clearer consensus on both
"Py_TYPE(obj) vs. obj.__class__", as well as about parity between dual
pure-Python/C implementations in the stdlib, regardless of the outcome of
this issue. Both are points about which we should be consistent throughout
Python. The type() vs. __class__ question may deserve an entry in the
language reference and both may deserve a PEP.

Could you please raise a discussion on Python-Dev? You will formulate the
problem better.

serhiy-storchaka · 2015-10-17T19:22:54Z

Updated patch addresses Eric's comments. Changes related to #3 are reverted. We will return to this after discussing on Python-Dev.

ericsnowcurrently · 2015-10-17T21:06:31Z

new patch LGTM

ericsnowcurrently · 2015-10-17T21:10:07Z

Backward compatibility related to __class__ assignment was already broken in C
implementation. In 3.4 following code works:
[snip]
In 3.5 it doesn't.

Depending on what feedback we get from python-dev, that may need to be
fixed. I'd be inclined to not worry about it. :)

No, this assignment is forbidden (due to bpo-24912). You can't set __class_ for
an instance of a subclass of non-heap type.

Ah, I see. So the solution to that issue has *forced* a compatibility break.

Could you please raise a discussion on Python-Dev? You will formulate the
problem better.

I will.

ericsnowcurrently · 2015-10-17T22:21:33Z

Posted to python-dev:

https://mail.python.org/pipermail/python-dev/2015-October/141953.html

https://mail.python.org/pipermail/python-dev/2015-October/141954.html

python-dev · 2015-10-18T06:55:32Z

New changeset b6e33798f82a by Serhiy Storchaka in branch '3.5':
Issue bpo-25410: Cleaned up and fixed minor bugs in C implementation of OrderedDict.
https://hg.python.org/cpython/rev/b6e33798f82a

New changeset 741ef17e9b86 by Serhiy Storchaka in branch 'default':
Issue bpo-25410: Cleaned up and fixed minor bugs in C implementation of OrderedDict.
https://hg.python.org/cpython/rev/741ef17e9b86

serhiy-storchaka · 2015-10-18T08:24:03Z

Here is a patch that makes both implementations to use type(self) instead of self.__class__ in __repr__(), __reduce__() and copy().

There is a difference between current implementations. Python implementation uses self.__class__ in copy(), C implementation uses type(self).

serhiy-storchaka · 2015-10-20T13:41:52Z

Seems there is a leak in _odict_add_new_node() when PyObject_Hash(key) fails. Here is a fix.

ericsnowcurrently · 2015-10-20T14:48:34Z

both patches* LGTM

odict_type.patch and odict_add_new_node_leak.patch

ericsnowcurrently · 2015-10-20T14:50:47Z

And thanks again, Serhiy, for taking the time on this. :)

python-dev · 2015-10-20T15:24:07Z

New changeset 93f948120773 by Serhiy Storchaka in branch '3.5':
Issue bpo-25410: Fixed a memory leak in OrderedDict in the case when key's hash
https://hg.python.org/cpython/rev/93f948120773

New changeset c3cec0f77eff by Serhiy Storchaka in branch 'default':
Issue bpo-25410: Fixed a memory leak in OrderedDict in the case when key's hash
https://hg.python.org/cpython/rev/c3cec0f77eff

serhiy-storchaka · 2015-10-20T16:03:47Z

Thank you for your reviews and discussions (and for your appreciated C acceleration of OrderedDict of course) Eric. I just want to make the code a little cleaner and reliable.

As for odict_type.patch, I would prefer to commit only C part of the patch and left Python implementation unchanged. There few not very strong arguments for __class__ against type() in Python code.

Calling type() needs to make globals and builtins lookup. This is more than 2 times slower than access the __class__ attribute. Not critical for __repr__(), __reduce__() and copy().
If the code is invoked at shutdown after destroying the builtins module, type can be None. We already had issues with this in the past. In current Python such situation is almost impossible nevertheless, due to different defensive techniques.

ericsnowcurrently · 2015-10-20T19:25:14Z

Since the python-dev discussion about __class__, leaving the Python implementation alone is fine with me.

python-dev · 2015-10-22T17:20:05Z

New changeset a42c0c1c5133 by Serhiy Storchaka in branch '3.5':
Issue bpo-25410: C implementation of OrderedDict now uses type(self) instead of
https://hg.python.org/cpython/rev/a42c0c1c5133

New changeset 10b965d59b49 by Serhiy Storchaka in branch 'default':
Issue bpo-25410: C implementation of OrderedDict now uses type(self) instead of
https://hg.python.org/cpython/rev/10b965d59b49

serhiy-storchaka · 2015-10-22T18:48:56Z

Thanks Eric.

A comment in PyODict_SetItem suggests to revert setting the value on the dict if adding new node failed. Following patch implements this suggestion. After committing the patch in bpo-25462 PyDict_DelItem can be replaced with _PyDict_DelItem_KnownHash and it will be always successful.

serhiy-storchaka · 2015-10-22T18:53:14Z

Following patch makes the code for the iterating a little simpler by merging common code.

serhiy-storchaka · 2015-10-22T19:14:09Z

In very rare circumstances it is possible that after a series of modifications using raw dict methods, ma_keys becomes to point to the same address, but allocated keys array has different size. But the guard in _odict_get_index uses only the address as resize sentinel. This can cause segmentation error if found key index is larger than the size of od_fast_nodes. Following patch adds the value of keys->dk_size as a sentinel.

serhiy-storchaka · 2015-11-04T21:19:34Z

Could you please make a review of last three patches Eric?

ericsnowcurrently · 2015-11-04T23:04:17Z

I will review those patches soon.

ericsnowcurrently · 2015-11-06T00:02:04Z

All 3 patches look fine to me.

In "odict_resize_sentinel.patch", it would be nice if you could accomplish that with a single sentinel. However, fixing the bug is more critical.

python-dev · 2015-11-06T10:01:06Z

New changeset 45ce4c6b4f36 by Serhiy Storchaka in branch '3.5':
Issue bpo-25410: Made testing that od_fast_nodes and dk_entries are in sync more
https://hg.python.org/cpython/rev/45ce4c6b4f36

New changeset c16af48153a4 by Serhiy Storchaka in branch 'default':
Issue bpo-25410: Made testing that od_fast_nodes and dk_entries are in sync more
https://hg.python.org/cpython/rev/c16af48153a4

serhiy-storchaka · 2015-11-06T10:10:40Z

Thank you for your review Eric.

I made error in commit messages, that is why they are non shown here. odict_revert_setting_on_error.patch and odict_iternext_simpler.patch were committed in 1594c23d8c2f and ad44d551c13c.

od_resize_sentinel2 in odict_resize_sentinel.patch was renamed to od_fast_nodes_size. Now I see that this is not enough. It is possible that ma_keys is located on the same place, has the same size, but has different layout for keys with matched hashes. I'm trying to write more reliable checks.

serhiy-storchaka · 2015-12-23T16:16:15Z

Following code prints X([(1, 1), (3, 3)]) on 3.4 and X([(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]) on 3.5+.

from collections import OrderedDict
class X(OrderedDict):
    def __iter__(self):
        for k in OrderedDict.__iter__(self):
            if k % 2:
                yield k

od = X((i, i) for i in range(5))
print(od.copy())

serhiy-storchaka · 2015-12-23T16:22:20Z

And even simpler example: list(od.keys()) is [1, 3] in 3.4 and [0, 1, 2, 3, 4] in 3.5.

serhiy-storchaka · 2015-12-25T12:21:04Z

Proposed patch makes OrderedDict.copy() more consistent between implementations.

csabella · 2018-03-16T19:45:02Z

Is this issue still relevant?

mr-nfamous · 2018-11-04T09:11:38Z

It might be more appropriate to start a new issue for this, but I'll leave that decision to somehow who would know for sure. Anyway, basically the entire dict/PyDictObject api functions do not appear work at all with OrderedDict. Or rather, OrderedDict doesn't seem to be able to recognize the changes the dict api makes to an object. This is present in both 3.6.0 and 3.7.0 by the way.

	from operator import setitem
	from collections import OrderedDict
	from pprint import pprint
	class thing:
		def __init__(self):
			ns = OrderedDict(a='od.__init__')
			vars(__class__)['__dict__'].__set__(self, ns)
			dict.__setitem__(ns, 'b', 'dict.__setitem__')
			self.c = 'PyObject_SetAttr'
			OrderedDict.__setitem__(ns, 'd', 'od.__setitem__')
			ns.update(e='od.update')
			object.__setattr__(self, 'f', 'PyObject_GenericSetAttr')
			setattr(self, 'f', 'PyObject_SetAttr')
			setitem(ns, 'g', 'od.__setitem__')
			dict.update(ns, h='dict.update')
			dict.setdefault(ns, 'i', 'i')
	self    = thing()
	ns      = self.__dict__
	real_ns = {**ns}
	missing = {k: ns[k] for k in real_ns.keys() - ns.keys()}
	pprint(ns)
	pprint(missing, width=len(f'{missing}')-1)
	print(f'"b" in {ns.keys()} == {"b" in ns.keys()}')
	print(f'"b" in {*ns.keys(),} == {"b" in [*ns.keys()]}')
	del ns['a']
	del ns['b']
	print(f"ns.get('c', KeyError('c')) == {ns.get('c', KeyError('c'))}")
	print(f"ns.pop('c', KeyError('c')) == {ns.pop('c', KeyError('c'))!r}")
	ns.get('i')
	ns.pop('i')

Maybe it's considered undefined behavior for a subclass to use
a method of one of its bases which it has overriden. That's fair enough, but as this example demonstrates, the silence and arbitrariness of the errors is a real problem when OrderedDict is used as a namespace, since it's a complete coin toss on whether one of the many internal API function will set an attribute or name via PyDict_SetItem or PyObject_SetItem. Only the former can invoke the methods OrderedDict overrides and there isn't any easy-to-find on the subject as far as I know.

rhettinger · 2018-11-04T21:31:17Z

Maybe it's considered undefined behavior for a subclass to use
a method of one of its bases which it has overriden.

In general, it's true that if OrderedDict is a subclass of dict, then it would have no defense against someone making a direct call to the dict base class. Such a call should be expected to violate the OrderedDicts invariants.

it's a complete coin toss on whether one of the many internal
API function will set an attribute or name via PyDict_SetItem
or PyObject_SetItem.

Not really. The CPython source is supposed to only call PyDict_SetItem when the target is known to be an exact dictionary. If you find a case where that isn't true, please file a bug report and we'll fix it.

It might be more appropriate to start a new issue for this, but I'll > leave that decision to somehow who would know for sure.

No need. We've known about this sort of problem for years. See https://bugs.python.org/issue10977 for example. There isn't really much we could do about it without causing other issues that would be worse.

FWIW, this doesn't seem to be a problem in practice. Further, OrderedDict is expected to become less relevant now that regular dicts are order preserving.

serhiy-storchaka added extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error labels Oct 15, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

Clean up and fix OrderedDict #69596

Clean up and fix OrderedDict #69596

Comments

serhiy-storchaka commented Oct 15, 2015

serhiy-storchaka commented Oct 15, 2015

rhettinger commented Oct 15, 2015

ericsnowcurrently commented Oct 17, 2015

serhiy-storchaka commented Oct 17, 2015

ericsnowcurrently commented Oct 17, 2015

ericsnowcurrently commented Oct 17, 2015

ericsnowcurrently commented Oct 17, 2015

serhiy-storchaka commented Oct 17, 2015

serhiy-storchaka commented Oct 17, 2015

ericsnowcurrently commented Oct 17, 2015

ericsnowcurrently commented Oct 17, 2015

ericsnowcurrently commented Oct 17, 2015

python-dev mannequin commented Oct 18, 2015

serhiy-storchaka commented Oct 18, 2015

serhiy-storchaka commented Oct 20, 2015

ericsnowcurrently commented Oct 20, 2015

ericsnowcurrently commented Oct 20, 2015

python-dev mannequin commented Oct 20, 2015

serhiy-storchaka commented Oct 20, 2015

ericsnowcurrently commented Oct 20, 2015

python-dev mannequin commented Oct 22, 2015

serhiy-storchaka commented Oct 22, 2015

serhiy-storchaka commented Oct 22, 2015

serhiy-storchaka commented Oct 22, 2015

serhiy-storchaka commented Nov 4, 2015

ericsnowcurrently commented Nov 4, 2015

ericsnowcurrently commented Nov 6, 2015

python-dev mannequin commented Nov 6, 2015

serhiy-storchaka commented Nov 6, 2015

serhiy-storchaka commented Dec 23, 2015

serhiy-storchaka commented Dec 23, 2015

serhiy-storchaka commented Dec 25, 2015

csabella commented Mar 16, 2018

mr-nfamous mannequin commented Nov 4, 2018

rhettinger commented Nov 4, 2018