Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up and fix OrderedDict #69596

Open
serhiy-storchaka opened this issue Oct 15, 2015 · 35 comments
Open

Clean up and fix OrderedDict #69596

serhiy-storchaka opened this issue Oct 15, 2015 · 35 comments
Labels
extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error

Comments

@serhiy-storchaka
Copy link
Member

BPO 25410
Nosy @rhettinger, @ericsnowcurrently, @serhiy-storchaka, @mr-nfamous, @csabella
Files
  • odict_cleanup.patch
  • odict_cleanup_2.patch
  • odict_type.patch
  • odict_add_new_node_leak.patch
  • odict_revert_setting_on_error.patch
  • odict_iternext_simpler.patch
  • odict_resize_sentinel.patch
  • odict_copy_iter.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2015-10-15.08:05:08.623>
    labels = ['extension-modules', 'type-bug']
    title = 'Clean up and fix OrderedDict'
    updated_at = <Date 2018-11-04.21:31:17.395>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2018-11-04.21:31:17.395>
    actor = 'rhettinger'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Extension Modules']
    creation = <Date 2015-10-15.08:05:08.623>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = ['40785', '40803', '40806', '40817', '40839', '40840', '40841', '41418']
    hgrepos = []
    issue_num = 25410
    keywords = ['patch']
    message_count = 35.0
    messages = ['253033', '253051', '253110', '253122', '253133', '253134', '253135', '253137', '253139', '253140', '253141', '253142', '253144', '253149', '253217', '253218', '253219', '253221', '253227', '253244', '253341', '253346', '253347', '253349', '254072', '254075', '254152', '254180', '254181', '256922', '256923', '256987', '313971', '329220', '329260']
    nosy_count = 6.0
    nosy_names = ['rhettinger', 'python-dev', 'eric.snow', 'serhiy.storchaka', 'bup', 'cheryl.sabella']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue25410'
    versions = ['Python 3.5', 'Python 3.6']

    @serhiy-storchaka
    Copy link
    Member Author

    Proposed patch cleans up and fixes minor bugs in C implementation of OrderedDict.

    1. Used the "p" format unit instead of manual calling PyObject_True() for parsing boolean parameters.

    2. Used _Py_Identifier private API instead of char* API if appropriate.

    3. Used Py_TYPE instead of the __class__ attribute as in other extension types.

    4. Fixed od_fast_nodes size calculation in __sizeof__().

    5. Simplified __reduce__() taking into account that __dict__ is empty in C implementation of OrderedDict.

    6. popitem() with wrong number of arguments now raises TypeError instead of KeyError for empty dictionary.

    7. Python implementation of move_to_end() calls key comparing only once in common case. C implementation called key comparing twice, it first compares a key with first or last key. Now C implementation calls key comparing only once in common case.

    8. Used PyUnicode_FromFormat() instead of str.__mod__ in __repr__().

    9. update() now takes into account that args and kwargs are always tuple and dict (if not NULL).

    10. Got rid of PyMapping_Items() in update(). PyMapping_Items() creates a copy of items as a list, this is not needed.

    Also applied other cleanups. The size of sources is decreased by 105 lines.

    Objects/odictobject.c | 194 +-----------------------------!!!!!!!!!!!!!!!!!!
    1 file changed, 6 insertions(+), 111 deletions(-), 77 modifications(!)

    @serhiy-storchaka serhiy-storchaka added extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error labels Oct 15, 2015
    @rhettinger
    Copy link
    Contributor

    These all look good except for perhaps #5 which I need to look at a bit more for its effect on OD subclasses.

    @ericsnowcurrently
    Copy link
    Member

    Thanks for working on this, Serhiy. I've left a review.

    As to the points you outlined, I have concerns with the impact of #3 and #5 on subclasses. Notably od.__class__ is not necessarily the same as type(od). Also #7 may introduce an unhandled re-entrancy, causing potentially incorrect outcomes.

    Also note that I was extremely careful to (almost) exactly match the pure Python implementation. Not only did this guarantee equivalent behavior, but it simplified the porting effort. I'm not opposed to deviating from the pure Python implementation as long as the behavior remains exactly the same. So if you want to change the behavior of OrderedDict you must be sure to make the equivalent change in the pure Python implementation (with the associated backward-compatibility constraints). Thanks again for working on this though.

    @serhiy-storchaka
    Copy link
    Member Author

    Thank you for your review Eric.

    As for using Py_TYPE(self) instead of the __class__ attribute in #3, this is consistent with the rest of Python core and stdlib. All C implementations use Py_TYPE(self) for repr and pickling, even if Python implementations can use __class__.

    >>> class S(set): __class__ = str
    ... 
    >>> s = S()
    >>> s.__class__
    <class 'str'>
    >>> s
    S()
    >>> s.__reduce_ex__(2)
    (<class '__main__.S'>, ([],), {})
    >>> s+''
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: unsupported operand type(s) for +: 'S' and 'str'

    Note that you can't just set s.__class__ = str, this is forbidden (bpo-24912). You should set __class__ in class definition or make it a property, and this doesn't have affect object's behavior, the only visible effect is the value of the __class__ attribute itself.

    One possible argument for Py_TYPE(self) (besides simplicity and historical reasons) is that it is more reliable. It doesn't cause executing arbitrary Python code and therefore is thread-safe and reentrant. It returns the true type of the object, that can be important for debugging.

    We should not care about exactly matching Python implementation, but rather about consistency with the rest of Python. If such type of mismatching is considered a bug, Python is full of such bugs.

    About #5, be sure that the new code is exact semantic equivalence to the old code besides copying the dict that is not needed now. I just dropped an iteration on always empty dict and related code.

    I don't see re-entrancy problem in #7. Could you please provide an example?

    @ericsnowcurrently
    Copy link
    Member

    Regarding Py_TYPE(od) vs. od.__class__, there is a difference for subclasses, as you demonstrated in your example. [1] Thanks for explaining your rationale. I now understand your argument about using PyTYPE() for repr and pickle in C types. I still have concerns, though, regarding parity between the two OrderedDict implementations.

    The key difference is that we're talking about an after-the-fact C port of an existing pure Python implementation. The two implementations should behave identically in nearly every case. The only place I would expect a deviation to not matter is for anything where Python-as-a-whole does not guarantee behavior. However there are *very* few of those when all is said and done. Any other variation should not be made casually and only if we are willing to accept that there may be code out there that relies on the pure Python behavior which the C implementation will break.

    So like I said before, as a rule I'm absolutely okay with changing the behavior as long as the pure Python implementation is changed to match and OrderedDict remains backward-compatible (and the change is meaningful, e.g. efficiency, consistency). Otherwise my concerns remain and we have to have sufficient justification for the change.

    It may be worth taking this to python-dev to get a clearer consensus on both "Py_TYPE(obj) vs. obj.__class__", as well as about parity between dual pure-Python/C implementations in the stdlib, regardless of the outcome of this issue. Both are points about which we should be consistent throughout Python. The type() vs. __class__ question may deserve an entry in the language reference and both may deserve a PEP.

    -----

    For this particular case, I think we should still aim for compatibility with the pure Python implementation. To that effect, we could use Py_TYPE(od) only if PyODict_CheckExact() returns true (as a fast path) and use od.__class__ otherwise. That fast path would be safe for the C implementation since code can't change OrderedDict().__class__ (due to bpo-24912).

    That particular difference in the implementations (i.e. you *can* change od.__class__ for the pure Python one) is an acceptable compatibility break since it's unlikely anyone is changing od.__class__ *and* if they are then they can just switch to a simple subclass that wraps OrderedDict:

        # before:
        from collections import OrderedDict
        od = OrderedDict()
        od.__class__ = SomethingElse
    
        # after:
        import collections
        class OrderedDict(collections.OrderedDict):
            pass
        od = OrderedDict()
        od.__class__ = SomethingElse

    If we *do* continue supporting "type(od) != od.__class__" in repr then I'd say why bother with a fast path for PyOdict_CheckExact(). That sort of efficiency isn't necessary for repr. If we stop supporting a differing od.__class__ then I'm fine simply using Py_TYPE() throughout.

    Likewise, if this is not a case we want to support then we must accept that we may break some code out there, however unlikely that is. In that case perhaps we could be more clear in the documentation that OrderedDict().__class__ should not be changed, though such an addition to the OrderedDict docs might just be clutter. A general FAQ or other broader doc entry about not assigning to obj.__class__ for stdlib types might be more appropriate. But that is where clarification from python-dev would help.

    [1] There is also a difference between type(obj) and obj.__class__ in the case of proxies (e.g. see bpo-16251), but that is less of an issue here.

    @ericsnowcurrently
    Copy link
    Member

    Regarding #5, you're right about OrderedDict().__dict__ being empty for the C implementation. (Nice observation!) So I'm okay with ripping all that code out of odict_reduce(). Since we're still accessing od.__dict__ through _PyObject_GetAttrId() that should not impact subclassing.

    @ericsnowcurrently
    Copy link
    Member

    Regarding #7, I see what you did now. That looks fine to me.

    @serhiy-storchaka
    Copy link
    Member Author

    Regarding Py_TYPE(od) vs. od.__class__, there is a difference for
    subclasses, as you demonstrated in your example. [1] Thanks for explaining
    your rationale. I now understand your argument about using PyTYPE() for
    repr and pickle in C types. I still have concerns, though, regarding
    parity between the two OrderedDict implementations.

    The key difference is that we're talking about an after-the-fact C port of
    an existing pure Python implementation.

    There is no a difference. io, pickle, ElementTree, bz2, virtually all
    accelerator classes was created as replacements of pure Python
    implementations. All C implementations use Py_TYPE(self) for repr() and
    pickling. I think this deviation is common and acceptable.

    Backward compatibility related to __class__ assignment was already broken in C
    implementation. In 3.4 following code works:

    >>> from collections import *
    >>> class foo(OrderedDict):
    ...     def bark(self): return "spam"
    ... 
    >>> class bar(OrderedDict):
    ...     pass
    ... 
    >>> od = bar()
    >>> od.__class__ = foo
    >>> od.bark()
    'spam'

    In 3.5 it doesn't.

    That particular difference in the implementations (i.e. you *can* change
    od.__class__ for the pure Python one) is an acceptable compatibility break
    since it's unlikely anyone is changing od.__class__ *and* if they are then
    they can just switch to a simple subclass that wraps OrderedDict:

    # before:
    from collections import OrderedDict
    od = OrderedDict()
    od.\_\_class__ = SomethingElse
    
    # after:
    import collections
    class OrderedDict(collections.OrderedDict):
        pass
    od = OrderedDict()
    od.\_\_class__ = SomethingElse
    

    No, this assignment is forbidden (due to bpo-24912). You can't set __class_ for
    an instance of a subclass of non-heap type.

    It may be worth taking this to python-dev to get a clearer consensus on both
    "Py_TYPE(obj) vs. obj.__class__", as well as about parity between dual
    pure-Python/C implementations in the stdlib, regardless of the outcome of
    this issue. Both are points about which we should be consistent throughout
    Python. The type() vs. __class__ question may deserve an entry in the
    language reference and both may deserve a PEP.

    Could you please raise a discussion on Python-Dev? You will formulate the
    problem better.

    @serhiy-storchaka
    Copy link
    Member Author

    Updated patch addresses Eric's comments. Changes related to #3 are reverted. We will return to this after discussing on Python-Dev.

    @ericsnowcurrently
    Copy link
    Member

    new patch LGTM

    @ericsnowcurrently
    Copy link
    Member

    Backward compatibility related to __class__ assignment was already broken in C
    implementation. In 3.4 following code works:
    [snip]
    In 3.5 it doesn't.

    Depending on what feedback we get from python-dev, that may need to be
    fixed. I'd be inclined to not worry about it. :)

    No, this assignment is forbidden (due to bpo-24912). You can't set __class_ for
    an instance of a subclass of non-heap type.

    Ah, I see. So the solution to that issue has *forced* a compatibility break.

    Could you please raise a discussion on Python-Dev? You will formulate the
    problem better.

    I will.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 18, 2015

    New changeset b6e33798f82a by Serhiy Storchaka in branch '3.5':
    Issue bpo-25410: Cleaned up and fixed minor bugs in C implementation of OrderedDict.
    https://hg.python.org/cpython/rev/b6e33798f82a

    New changeset 741ef17e9b86 by Serhiy Storchaka in branch 'default':
    Issue bpo-25410: Cleaned up and fixed minor bugs in C implementation of OrderedDict.
    https://hg.python.org/cpython/rev/741ef17e9b86

    @serhiy-storchaka
    Copy link
    Member Author

    Here is a patch that makes both implementations to use type(self) instead of self.__class__ in __repr__(), __reduce__() and copy().

    There is a difference between current implementations. Python implementation uses self.__class__ in copy(), C implementation uses type(self).

    @serhiy-storchaka
    Copy link
    Member Author

    Seems there is a leak in _odict_add_new_node() when PyObject_Hash(key) fails. Here is a fix.

    @ericsnowcurrently
    Copy link
    Member

    both patches* LGTM

    • odict_type.patch and odict_add_new_node_leak.patch

    @ericsnowcurrently
    Copy link
    Member

    And thanks again, Serhiy, for taking the time on this. :)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 20, 2015

    New changeset 93f948120773 by Serhiy Storchaka in branch '3.5':
    Issue bpo-25410: Fixed a memory leak in OrderedDict in the case when key's hash
    https://hg.python.org/cpython/rev/93f948120773

    New changeset c3cec0f77eff by Serhiy Storchaka in branch 'default':
    Issue bpo-25410: Fixed a memory leak in OrderedDict in the case when key's hash
    https://hg.python.org/cpython/rev/c3cec0f77eff

    @serhiy-storchaka
    Copy link
    Member Author

    Thank you for your reviews and discussions (and for your appreciated C acceleration of OrderedDict of course) Eric. I just want to make the code a little cleaner and reliable.

    As for odict_type.patch, I would prefer to commit only C part of the patch and left Python implementation unchanged. There few not very strong arguments for __class__ against type() in Python code.

    1. Calling type() needs to make globals and builtins lookup. This is more than 2 times slower than access the __class__ attribute. Not critical for __repr__(), __reduce__() and copy().

    2. If the code is invoked at shutdown after destroying the builtins module, type can be None. We already had issues with this in the past. In current Python such situation is almost impossible nevertheless, due to different defensive techniques.

    @ericsnowcurrently
    Copy link
    Member

    Since the python-dev discussion about __class__, leaving the Python implementation alone is fine with me.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 22, 2015

    New changeset a42c0c1c5133 by Serhiy Storchaka in branch '3.5':
    Issue bpo-25410: C implementation of OrderedDict now uses type(self) instead of
    https://hg.python.org/cpython/rev/a42c0c1c5133

    New changeset 10b965d59b49 by Serhiy Storchaka in branch 'default':
    Issue bpo-25410: C implementation of OrderedDict now uses type(self) instead of
    https://hg.python.org/cpython/rev/10b965d59b49

    @serhiy-storchaka
    Copy link
    Member Author

    Thanks Eric.

    A comment in PyODict_SetItem suggests to revert setting the value on the dict if adding new node failed. Following patch implements this suggestion. After committing the patch in bpo-25462 PyDict_DelItem can be replaced with _PyDict_DelItem_KnownHash and it will be always successful.

    @serhiy-storchaka
    Copy link
    Member Author

    Following patch makes the code for the iterating a little simpler by merging common code.

    @serhiy-storchaka
    Copy link
    Member Author

    In very rare circumstances it is possible that after a series of modifications using raw dict methods, ma_keys becomes to point to the same address, but allocated keys array has different size. But the guard in _odict_get_index uses only the address as resize sentinel. This can cause segmentation error if found key index is larger than the size of od_fast_nodes. Following patch adds the value of keys->dk_size as a sentinel.

    @serhiy-storchaka
    Copy link
    Member Author

    Could you please make a review of last three patches Eric?

    @ericsnowcurrently
    Copy link
    Member

    I will review those patches soon.

    @ericsnowcurrently
    Copy link
    Member

    All 3 patches look fine to me.

    In "odict_resize_sentinel.patch", it would be nice if you could accomplish that with a single sentinel. However, fixing the bug is more critical.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 6, 2015

    New changeset 45ce4c6b4f36 by Serhiy Storchaka in branch '3.5':
    Issue bpo-25410: Made testing that od_fast_nodes and dk_entries are in sync more
    https://hg.python.org/cpython/rev/45ce4c6b4f36

    New changeset c16af48153a4 by Serhiy Storchaka in branch 'default':
    Issue bpo-25410: Made testing that od_fast_nodes and dk_entries are in sync more
    https://hg.python.org/cpython/rev/c16af48153a4

    @serhiy-storchaka
    Copy link
    Member Author

    Thank you for your review Eric.

    I made error in commit messages, that is why they are non shown here. odict_revert_setting_on_error.patch and odict_iternext_simpler.patch were committed in 1594c23d8c2f and ad44d551c13c.

    od_resize_sentinel2 in odict_resize_sentinel.patch was renamed to od_fast_nodes_size. Now I see that this is not enough. It is possible that ma_keys is located on the same place, has the same size, but has different layout for keys with matched hashes. I'm trying to write more reliable checks.

    @serhiy-storchaka
    Copy link
    Member Author

    Following code prints X([(1, 1), (3, 3)]) on 3.4 and X([(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]) on 3.5+.

    from collections import OrderedDict
    class X(OrderedDict):
        def __iter__(self):
            for k in OrderedDict.__iter__(self):
                if k % 2:
                    yield k
    
    od = X((i, i) for i in range(5))
    print(od.copy())

    @serhiy-storchaka
    Copy link
    Member Author

    And even simpler example: list(od.keys()) is [1, 3] in 3.4 and [0, 1, 2, 3, 4] in 3.5.

    @serhiy-storchaka
    Copy link
    Member Author

    Proposed patch makes OrderedDict.copy() more consistent between implementations.

    @csabella
    Copy link
    Contributor

    Is this issue still relevant?

    @mr-nfamous
    Copy link
    Mannequin

    mr-nfamous mannequin commented Nov 4, 2018

    It might be more appropriate to start a new issue for this, but I'll leave that decision to somehow who would know for sure. Anyway, basically the entire dict/PyDictObject api functions do not appear work at all with OrderedDict. Or rather, OrderedDict doesn't seem to be able to recognize the changes the dict api makes to an object. This is present in both 3.6.0 and 3.7.0 by the way.

    	from operator import setitem
    	from collections import OrderedDict
    	from pprint import pprint
    	class thing:
    		def __init__(self):
    			ns = OrderedDict(a='od.__init__')
    			vars(__class__)['__dict__'].__set__(self, ns)
    			dict.__setitem__(ns, 'b', 'dict.__setitem__')
    			self.c = 'PyObject_SetAttr'
    			OrderedDict.__setitem__(ns, 'd', 'od.__setitem__')
    			ns.update(e='od.update')
    			object.__setattr__(self, 'f', 'PyObject_GenericSetAttr')
    			setattr(self, 'f', 'PyObject_SetAttr')
    			setitem(ns, 'g', 'od.__setitem__')
    			dict.update(ns, h='dict.update')
    			dict.setdefault(ns, 'i', 'i')
    	self    = thing()
    	ns      = self.__dict__
    	real_ns = {**ns}
    	missing = {k: ns[k] for k in real_ns.keys() - ns.keys()}
    	pprint(ns)
    	pprint(missing, width=len(f'{missing}')-1)
    	print(f'"b" in {ns.keys()} == {"b" in ns.keys()}')
    	print(f'"b" in {*ns.keys(),} == {"b" in [*ns.keys()]}')
    	del ns['a']
    	del ns['b']
    	print(f"ns.get('c', KeyError('c')) == {ns.get('c', KeyError('c'))}")
    	print(f"ns.pop('c', KeyError('c')) == {ns.pop('c', KeyError('c'))!r}")
    	ns.get('i')
    	ns.pop('i')

    Maybe it's considered undefined behavior for a subclass to use
    a method of one of its bases which it has overriden. That's fair enough, but as this example demonstrates, the silence and arbitrariness of the errors is a real problem when OrderedDict is used as a namespace, since it's a complete coin toss on whether one of the many internal API function will set an attribute or name via PyDict_SetItem or PyObject_SetItem. Only the former can invoke the methods OrderedDict overrides and there isn't any easy-to-find on the subject as far as I know.

    @rhettinger
    Copy link
    Contributor

    Maybe it's considered undefined behavior for a subclass to use
    a method of one of its bases which it has overriden.

    In general, it's true that if OrderedDict is a subclass of dict, then it would have no defense against someone making a direct call to the dict base class. Such a call should be expected to violate the OrderedDicts invariants.

    it's a complete coin toss on whether one of the many internal
    API function will set an attribute or name via PyDict_SetItem
    or PyObject_SetItem.

    Not really. The CPython source is supposed to only call PyDict_SetItem when the target is known to be an exact dictionary. If you find a case where that isn't true, please file a bug report and we'll fix it.

    It might be more appropriate to start a new issue for this, but I'll > leave that decision to somehow who would know for sure.

    No need. We've known about this sort of problem for years. See https://bugs.python.org/issue10977 for example. There isn't really much we could do about it without causing other issues that would be worse.

    FWIW, this doesn't seem to be a problem in practice. Further, OrderedDict is expected to become less relevant now that regular dicts are order preserving.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants