Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OrderedDict written in C #61195

Closed
ericsnowcurrently opened this issue Jan 18, 2013 · 134 comments
Closed

Add OrderedDict written in C #61195

ericsnowcurrently opened this issue Jan 18, 2013 · 134 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-feature A feature request or enhancement

Comments

@ericsnowcurrently
Copy link
Member

BPO 16991
Nosy @rhettinger, @gpshead, @ncoghlan, @pitrou, @scoder, @ericvsmith, @tiran, @benjaminp, @ned-deily, @ezio-melotti, @merwok, @alex, @asvetlov, @skrah, @florentx, @markshannon, @ericsnowcurrently, @JimJJewett, @serhiy-storchaka, @1st1, @westurner, @refi64, @MojoVampire
Files
  • odict-speed.diff: pybench addition for OrderedDict
  • cOrderedDict.diff
  • cOrderedDict-2.diff
  • cOrderedDict-3.diff
  • cOrderedDict-4.diff
  • 3b2a9026d48e.diff
  • c3fab329aa7f.diff
  • ba1c6d40ca63.diff
  • 0813b1a88171.diff
  • odict-windows.diff
  • crash-1.py
  • crash-2.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ericsnowcurrently'
    closed_at = <Date 2015-06-01.16:26:38.776>
    created_at = <Date 2013-01-18.04:08:30.389>
    labels = ['interpreter-core', 'type-feature', 'release-blocker']
    title = 'Add OrderedDict written in C'
    updated_at = <Date 2015-06-02.14:01:03.238>
    user = 'https://github.com/ericsnowcurrently'

    bugs.python.org fields:

    activity = <Date 2015-06-02.14:01:03.238>
    actor = 'eric.snow'
    assignee = 'eric.snow'
    closed = True
    closed_date = <Date 2015-06-01.16:26:38.776>
    closer = 'eric.snow'
    components = ['Interpreter Core']
    creation = <Date 2013-01-18.04:08:30.389>
    creator = 'eric.snow'
    dependencies = []
    files = ['30461', '30661', '32262', '38753', '39445', '39496', '39499', '39500', '39566', '39567', '39581', '39582']
    hgrepos = ['309']
    issue_num = 16991
    keywords = ['patch']
    message_count = 134.0
    messages = ['180170', '180176', '180202', '180236', '180640', '180682', '180688', '180690', '181421', '181422', '182132', '182134', '182145', '182147', '182214', '184801', '184835', '189891', '189893', '189895', '189897', '190589', '190590', '190639', '190640', '190643', '191557', '200607', '209700', '209744', '209831', '215971', '231087', '232412', '232803', '232825', '232828', '239665', '239666', '239740', '239761', '239867', '239871', '243096', '243121', '243123', '243343', '243358', '243371', '243379', '243492', '243699', '243702', '243705', '243706', '243714', '243715', '243716', '243726', '243733', '243749', '243753', '243799', '243801', '243802', '243803', '243806', '243807', '243843', '243849', '243863', '243868', '243903', '243915', '243959', '243960', '243964', '243965', '243966', '244005', '244039', '244040', '244044', '244046', '244047', '244048', '244053', '244054', '244067', '244070', '244071', '244073', '244074', '244076', '244122', '244380', '244401', '244402', '244421', '244436', '244437', '244438', '244439', '244440', '244441', '244442', '244462', '244463', '244464', '244465', '244466', '244467', '244468', '244469', '244475', '244476', '244478', '244480', '244481', '244484', '244486', '244491', '244492', '244493', '244496', '244499', '244535', '244574', '244575', '244586', '244587', '244600', '244664', '244668']
    nosy_count = 29.0
    nosy_names = ['rhettinger', 'gregory.p.smith', 'ncoghlan', 'pitrou', 'scoder', 'eric.smith', 'christian.heimes', 'benjamin.peterson', 'ned.deily', 'ezio.melotti', 'eric.araujo', 'mrabarnett', 'Arfrever', 'alex', 'asvetlov', 'skrah', 'flox', 'BreamoreBoy', 'Mark.Shannon', 'python-dev', 'eric.snow', 'Jim.Jewett', 'serhiy.storchaka', 'yselivanov', 'westurner', 'refi64', 'josh.r', 'tonn81', 'introom']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue16991'
    versions = ['Python 3.5']

    @ericsnowcurrently
    Copy link
    Member Author

    Here's an initial stab at writing OrderedDict in C. Though, the implementation is not heavily optimized and isn't super subclass-friendly, my hope is that it's relatively close to the final version. I'm getting this up now to get some eyes on it.

    The spot in the builtins is mostly for convenience, but I expect it will need to be exposed somewhere (perhaps _collections?).

    My experience with the C-API is relatively limited and my C-fu is not at a professional level. However, I'm pretty sure that I have most everything correct.

    The ultimate goal for this type is to use it for **kwargs.

    Note: this first patch has some reference leaks that I'm tracking down.

    @ericsnowcurrently ericsnowcurrently self-assigned this Jan 18, 2013
    @ericsnowcurrently ericsnowcurrently added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Jan 18, 2013
    @benjaminp
    Copy link
    Contributor

    Nit: This really not be exposed through _collections rather than hacked into builtins.

    @ezio-melotti
    Copy link
    Member

    The tests should probably be updated to use the PEP-399 idiom in order to test both the implementations.

    @ericsnowcurrently
    Copy link
    Member Author

    @benjamin: Yeah, I've fixed that.

    @ezio: Good point. I've touched that up.

    Once I have cleared up reference counting issues I'll put up a new patch.

    @ericsnowcurrently
    Copy link
    Member Author

    Here's a cleanup of test.test_collections that helps keep the subsequent patch (still forthcoming) cleaner.

    @ezio-melotti
    Copy link
    Member

    What's the reason for moving the OrderedDict tests in a separate file?

    @ericsnowcurrently
    Copy link
    Member Author

    What's the reason for moving the OrderedDict tests in a separate file?

    Following the precedent of collections.deque and collections.defaultdict:

    • a big chunk of code
    • the default implementation will be coming via _collections.

    @ezio-melotti
    Copy link
    Member

    These are indeed good reasons.

    @ericsnowcurrently
    Copy link
    Member Author

    Here's an updated patch. I still have some ref-counting issues, but the patch is much closer to what I expect will be the final version. At this point it passes all the main unit tests (segfaults in some of the supplemental Mapping tests).

    One key thing is that for now the various iterators are faked using lists. Once everything is sorted out I'll plug my implementation of the iterators back in (which I'd already mostly finished).

    I see room for efficiency improvements in a couple spots. As well, I plan on improving the subclass-ability of the type once it's otherwise happy.

    Any feedback at the point would be helpful. Regardless, I'm being careful to get this right and I'm no expert with the C-API, so it's taking a while. :)

    Some other considerations:

    • ensure that __init__() can reset and populate an existing dictionary:
      d=OrderedDict(); d[0]=1; d.__init__() # resets
    • avoid any reference cycles so that dicts can clean-up right-away when their ref-count drops to zero
      rather than waiting on GC.
    • possibly use the GIL to make the link updates atomic, trying to make the overall class thread safe.
    • methods like get() pop() or setdefault() shouldn't rely on __getitem__ raising KeyError
      because __missing__ may be present in a subclass

    @ericsnowcurrently
    Copy link
    Member Author

    Looks like I didn't get the patch lined up to tip so the review link isn't showing up. I'll have to fix that tomorrow.

    @ericsnowcurrently
    Copy link
    Member Author

    Are there any objections to this unittest cleanup patch? I'd like to commit it separately prior to anything that will go in for the C OrderedDict.

    @ericsnowcurrently
    Copy link
    Member Author

    Incidently, I'm just about done with the OrdereDict patch. I'm attaching the current one. At present it passes the unit tests, but I have two memory-related issues and a number of open questions (that I don't think are critical).

    The patch has the unittest cleanup merged in just so it will show up in reviewboard.

    @benjaminp
    Copy link
    Contributor

    Why did you copy test_collections.py instead of just creating a new file?

    @ericsnowcurrently
    Copy link
    Member Author

    Why did you copy test_collections.py instead of just creating
    a new file?

    To preserve the history of changes to the bulk of the code in test_ordereddict.py.

    @ericsnowcurrently
    Copy link
    Member Author

    I should clarify, odict.diff passes test_ordereddict. Regardless, it's pretty close. I'm going to figure out why the review link hates me on this issue, but until then any extra eyes here would be appreciated. The memory-related issues are pushing well past my experience.

    My goal is to get this in before PyCon and to have a reasonable plan at least for implementing **kwargs as OrderedDict. My intuition is that writing OrderedDict in C is the hard part.

    @ericsnowcurrently
    Copy link
    Member Author

    My current patch ends up with O(n) deletion, which won't fly, so I'm refactoring. While I'm at it I'm also planning on using the BSD queue.h doubly-linked list rather than the one that I rolled. I'm also going to pull the ordered dict implementation into it's own source file. However, these things should not have much of an impact on most of the code I've already written. I anticipate that the changes won't translate into a further large volume of work.

    In talking to Raymond, he emphasized the importance of making sure we avoid reentrancy problems. I'll be double-checking that and likely making use of the GIL in a couple spots.

    While the bulk of the implementation is complete, the remaining work to do here is what I've described above, along with more testing. An orthogonal problem is addressing the problem of the concrete dict API. I'll bring that up separately when this issue is basically done.

    @rhettinger
    Copy link
    Contributor

    I just looked at the test-collections patch and don't want it committed. It is full of trivial edits that make it hard to review and doesn't make the test suite better.

    It is okay to add some new tests, but I don't want to rearrange the existing code with minor edits. That will just make it more difficult to maintain the code across versions (it is no fun to backport a fix and then have unnecessary merge conflicts).

    @ericsnowcurrently
    Copy link
    Member Author

    It's worth noting that bpo-10977 is pretty closely related to this isssue.

    @ericsnowcurrently
    Copy link
    Member Author

    I just looked at the test-collections patch and don't want it
    committed. It is full of trivial edits that make it hard to
    review and doesn't make the test suite better.

    Thanks for the feedback, Raymond. I agree that there are a number of trivial edits there that shouldn't be there. I'll fix those.

    @ericsnowcurrently
    Copy link
    Member Author

    Raymond, do you have any objections to splitting the OrderedDict-related tests into their own module. Doing so allows for testing all the those test classes at once, which is handy when working on OrderedDict. This is more relevant now with the extra PEP-399-motivated tests.

    @ericsnowcurrently
    Copy link
    Member Author

    I finally had some time so here's an updated patch (including O(1) deletions). I've also added a bunch of notes at the top of odictobject.c. Some notable things left to do:

    • address two recently failing tests due to r83881 (bpo-17900)
    • check for any reference cycles (should be fine)
    • validate reentrancy (make sure everything is thread-safe around calls into Python)
    • make sure subclassing is okay
    • compare performance to dict, pure Python OrderedDict

    My goal here is to get an effective OrderedDict that we are happy with, while recognizing that there is room for optimization. However, I won't be okay with faultiness, so the implementation must be complete. This has been my mindset throughout.

    @ericsnowcurrently
    Copy link
    Member Author

    Without too many optimzations, the C implementation of OrderedDict is basically between 1x and 4x the performance of raw dict. This contrasts with the pure Python OrderedDict, which falls in roughly between 4x and 100x.

    I've attached an addition to pybench, not intended to be committed, that compares the 3. Run any of these commands to get timing:

    ./python Tools/pybench/pybench.py -t ODict
    ./python Tools/pybench/pybench.py -t ODict -t _Py
    ./python Tools/pybench/pybench.py -t ODict -t _C
    ./python Tools/pybench/pybench.py -t ODict -t _Dict

    I tuned the # rounds for each test such that the raw dict results all come out to just about the same value (2ms on my computer). That way it's easy to compare the pure Python or C results to the dict results.

    @ericsnowcurrently
    Copy link
    Member Author

    Here's an updated patch that has fixes for recursive pickles and for a couple memory-related segfaults that I missed before. That leaves the following to be done:

    • handle the case where a node is deleted during iteration
    • check for any reference cycles (should be fine)
    • validate reentrancy (make sure everything is thread-safe around calls into Python)
    • make sure subclassing is okay
    • address any remaining TODOs in the code

    @ericsnowcurrently
    Copy link
    Member Author

    Here's one solution to the deletion-during-iteration problem. It makes the iterators track the key rather than the node, at the expense of a sliver of speed on __iter__() and keys().

    @ericsnowcurrently
    Copy link
    Member Author

    The concern about reference cycles here lies in their existence in the linked-list. To see what I mean, check out the use of weakrefs in the pure Python implementation at Lib/collections/init.py. As the current implementation does not use PyObjects for the linked-list, I'm going to call this a non-issue.

    @ericsnowcurrently
    Copy link
    Member Author

    Here's what I feel is a nearly complete patch. The only outstanding issues that I feel need to be answered are 4 spots where calls into the interpreter may result in unexpected changes to the object or make the current function state out-of-date.

    1. _odict_clear_nodes() while iterating over the nodes.
    2. _odict_keys_equal() while iterating over the nodes.
    3. odict_sizeof() -- I'm not too worried about this one.
    4. _odict_copy() while iterating over the nodes.

    Once I feel comfortable with some resolution for those, I'm going to consider the patch ready to go, pending reviews.

    @ericsnowcurrently
    Copy link
    Member Author

    I figured out what I hope were the last memory-related issues. Apparently tp_traverse is not inherited if tp_flags is set. I had it set on all the view types and all the iterator types. So during GC it would blow up when it tried to call tp_traverse.

    Everything is looking pretty good so I'm attaching the latest diff.

    @serhiy-storchaka
    Copy link
    Member

    I agree with all Stephan's comments on Rietveld. See also bpo-24115 that fixed bugs similar to found in C implementation of OrderedDict.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 30, 2015

    New changeset 0a7380984e37 by Eric Snow in branch '3.5':
    Issue bpo-16991: Use PyObject_TypeCheck instead of PyObject_IsInstance.
    https://hg.python.org/cpython/rev/0a7380984e37

    @ericsnowcurrently
    Copy link
    Member Author

    PyObject_TypeCheck() should be used instead of PyObject_IsInstance() (see
    bpo-24257).

    Thanks for pointing this out. I've fixed both dictobject.h and odictobject.h.

    Perhaps Py_ODict_GetItemId() should be private API as relevant dict function.

    This isn't needed for 3.5, right? I'm not opposed to adding more
    functions to the C API for OrderedDict, but I wanted to start out as
    small as possible. My main motivation for the C implementation was
    for use in the interpreter and I hadn't given much thought to the
    direct use of C OrderedDict by extension modules. That can be
    addressed in 3.6 if it's important. I supposed you could double-check
    with Larry if you want to add more odict support to the C-API for 3.5.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented May 30, 2015

    I'm getting the following linker errors on Windows 8.1 for 32 and 64 bit debug and release builds. unresolved external symbol _PyODict_Type in C:\cpython\PCbuild\_collectionsmodule.obj, and _PyODictIter_Type,
    _PyODictValues_Type, _PyODictKeys_Type,_PyODictItems_Type in C:\cpython\PCbuild\object.obj

    @ericsnowcurrently
    Copy link
    Member Author

    New changeset 0a7380984e37 by Eric Snow in branch '3.5':
    Issue bpo-16991: Use PyObject_TypeCheck instead of PyObject_IsInstance.
    https://hg.python.org/cpython/rev/0a7380984e37

    I've also merged this into default:

    changeset: 96384:10eabadba316b6f2034db4c076a60c63d25d8fc6

    @ericsnowcurrently
    Copy link
    Member Author

    I'm getting the following linker errors on Windows 8.1 for 32 and 64 bit debug and release builds. unresolved external symbol _PyODict_Type in C:\cpython\PCbuild\_collectionsmodule.obj, and _PyODictIter_Type,
    _PyODictValues_Type, _PyODictKeys_Type,_PyODictItems_Type in C:\cpython\PCbuild\object.obj

    Hmm. I'm not too familiar with how things work for Windows. I'm also
    not clear on where the leading underscore is coming from.

    @serhiy-storchaka
    Copy link
    Member

    This isn't needed for 3.5, right? I'm not opposed to adding more
    functions to the C API for OrderedDict, but I wanted to start out as
    small as possible.

    You already added public name Py_ODict_GetItemId. It uses private _Py_Identifier API and shouldn't be public.

    @serhiy-storchaka
    Copy link
    Member

    As for Windows issue, perhaps these names should be enumerated in PC/python3.def? See also bpo-23903.

    @ericsnowcurrently
    Copy link
    Member Author

    You already added public name Py_ODict_GetItemId. It uses private _Py_Identifier API and shouldn't be public.

    Ah. I'll remove it.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 30, 2015

    New changeset c9404fba02ba by Eric Snow in branch '3.5':
    Issue bpo-16991: Properly handle return values in several places.
    https://hg.python.org/cpython/rev/c9404fba02ba

    New changeset 951a3ef82180 by Eric Snow in branch '3.5':
    Issue bpo-16991: Do not return None from OrderedDict.__reversed__.
    https://hg.python.org/cpython/rev/951a3ef82180

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 30, 2015

    New changeset 7117e9b0f595 by Eric Snow in branch '3.5':
    Issue bpo-16991: Drop Py_ODict_GetItemId.
    https://hg.python.org/cpython/rev/7117e9b0f595

    @ericsnowcurrently
    Copy link
    Member Author

    If it's just a matter of adding the definitions then here's a patch. Does that look correct?

    @ericsnowcurrently
    Copy link
    Member Author

    That last message is about building on Windows.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 30, 2015

    New changeset 030205f1e716 by Eric Snow in branch '3.5':
    Issue bpo-16991: Fix a few leaks and other memory-related concerns in OrderedDict.
    https://hg.python.org/cpython/rev/030205f1e716

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 30, 2015

    New changeset fbe74badb0c6 by Eric Snow in branch 'default':
    Issue bpo-16991: Ensure that the proper OrderedDict is used in tests.
    https://hg.python.org/cpython/rev/fbe74badb0c6

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 30, 2015

    New changeset 1d851112922f by Eric Snow in branch '3.5':
    Issue bpo-16991: Add PyODict* to Windows builds.
    https://hg.python.org/cpython/rev/1d851112922f

    @ericsnowcurrently
    Copy link
    Member Author

    I'm checking a fix for Windows against a buildbot (http://buildbot.python.org/all/builders/AMD64%20Windows8%203.x).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 30, 2015

    New changeset 3ba1fa925f17 by Eric Snow in branch 'default':
    Issue bpo-16991: Use the correct version for master.
    https://hg.python.org/cpython/rev/3ba1fa925f17

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 30, 2015

    New changeset aea27164d207 by Eric Snow in branch '3.5':
    Issue bpo-16991: Add odictobject.h on Windows.
    https://hg.python.org/cpython/rev/aea27164d207

    @ericsnowcurrently
    Copy link
    Member Author

    Well, that last one has everything compiling again. I expect it should be okay now. I'll watch the results.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented May 31, 2015

    Opening again. I have too many questions. :)

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jun 1, 2015

    crash-1.py is due to an unchecked return value from _odictnode_VALUE().

    We should probably use PyDict_GetItemWithError(), also in other
    places.

    I normally try to steer clear of stylistic remarks, but the
    _odictnode* macros are hiding too many things. As of now,
    they were hiding that an assert() is always true and that a
    return value was unchecked.

    Also, they're very inconvenient in a debugger.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jun 1, 2015

    crash-2.py is due to the fact that _PyDict_Pop() deletes a reference
    to 'key' in _odict_popkey().

    The INCREF(key) in popitem should take place before calling _odict_popkey().

    Again, I don't see the point of INCREF/DECREF *inside* of _odict_popkey().

    @ericsnowcurrently
    Copy link
    Member Author

    @jim and Stefan, Thanks for thorough reviews!

    @Stefan, I'll take a look at those crashers and other suggestions ASAP. I really appreciate you taking the time. Now that the patch has been landed, would you mind opening new issues for each problem you find? That will help keep individual problems from getting lost. Thanks!

    @tiran
    Copy link
    Member

    tiran commented Jun 1, 2015

    Coverity has found an issue in odict, too:

    *** CID 1302699:  Null pointer dereferences  (NULL_RETURNS)
    /Objects/odictobject.c: 1316 in odict_copy()
    1310             od_copy = PyObject_CallFunctionObjArgs((PyObject *)Py_TYPE(od), NULL);
    1311         if (od_copy == NULL)
    1312             return NULL;
    1313     
    1314         if (PyODict_CheckExact(od)) {
    1315             _odict_FOREACH(od, node) {
    >>>     CID 1302699:  Null pointer dereferences  (NULL_RETURNS)
    >>>     Dereferencing a pointer that might be null "PyDict_GetItem((PyObject *)(PyObject *)od, node->key)" when calling "PyODict_SetItem".
    1316                 int res = PyODict_SetItem((PyObject *)od_copy,
    1317                                           _odictnode_KEY(node),
    1318                                           _odictnode_VALUE(node, od));
    1319                 if (res != 0)
    1320                     goto fail;
    1321             }

    @ericsnowcurrently
    Copy link
    Member Author

    I've opened the following issues to address the 3 last comments:

    bpo-24347
    bpo-24348
    bpo-24349

    I'll be opening a separate issue for outstanding review comments.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jun 2, 2015

    It's fine to open other issues, but I'm not happy with the resize()/get_index() situation. Currently I can't come up even with an informal "proof" that it'll always work (and see bpo-24361).

    I think these functions really *need* 100% code coverage.

    @ericsnowcurrently
    Copy link
    Member Author

    Addressing the concerns with resize()/get_index() is next on my list. I had meant to open up an issue on it last night but it was getting pretty late for me and it slipped my mind. I've opened bpo-24362 to track that work.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests