Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbm corrupts index on macOS (_dbm module) #77255

Closed
nneonneo mannequin opened this issue Mar 14, 2018 · 7 comments
Closed

dbm corrupts index on macOS (_dbm module) #77255

nneonneo mannequin opened this issue Mar 14, 2018 · 7 comments
Labels
OS-mac type-bug An unexpected behavior, bug, or error

Comments

@nneonneo
Copy link
Mannequin

nneonneo mannequin commented Mar 14, 2018

BPO 33074
Nosy @ronaldoussoren, @ned-deily, @zhangyangyu
Superseder
  • bpo-30388: ndbm can't iterate through values on OS X
  • Files
  • test.db: Test database file in _dbm format (macOS)
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-11-14.08:44:36.815>
    created_at = <Date 2018-03-14.06:34:12.674>
    labels = ['OS-mac', 'type-bug']
    title = 'dbm corrupts index on macOS (_dbm module)'
    updated_at = <Date 2020-11-14.08:44:36.814>
    user = 'https://bugs.python.org/nneonneo'

    bugs.python.org fields:

    activity = <Date 2020-11-14.08:44:36.814>
    actor = 'ronaldoussoren'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-11-14.08:44:36.815>
    closer = 'ronaldoussoren'
    components = ['macOS']
    creation = <Date 2018-03-14.06:34:12.674>
    creator = 'nneonneo'
    dependencies = []
    files = ['47484']
    hgrepos = []
    issue_num = 33074
    keywords = []
    message_count = 7.0
    messages = ['313809', '313810', '314665', '314667', '314674', '328849', '380962']
    nosy_count = 4.0
    nosy_names = ['ronaldoussoren', 'nneonneo', 'ned.deily', 'xiang.zhang']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '30388'
    type = 'behavior'
    url = 'https://bugs.python.org/issue33074'
    versions = ['Python 2.7', 'Python 3.5', 'Python 3.6']

    @nneonneo
    Copy link
    Mannequin Author

    nneonneo mannequin commented Mar 14, 2018

    Environment: Python 3.6.4, macOS 10.12.6

    Python 3's dbm appears to corrupt the key index on macOS if objects >4KB are inserted.

    Code:

    <<<<<<<<<<<
    import dbm
    import contextlib

    with contextlib.closing(dbm.open('test', 'n')) as db:
        for k in range(128):
            db[('%04d' % k).encode()] = b'\0' * (k * 128)
    
    with contextlib.closing(dbm.open('test', 'r')) as db:
        print(len(db))
        print(len(list(db.keys())))
    >>>>>>>>>>>

    On my machine, I get the following:

    <<<<<<<<<<<
    94
    Traceback (most recent call last):
      File "test.py", line 10, in <module>
        print(len(list(db.keys())))
    SystemError: Negative size passed to PyBytes_FromStringAndSize
    >>>>>>>>>>>

    (The error says PyString_FromStringAndSize on Python 2.x but is otherwise the same). The expected output, which I see on Linux (using gdbm), is

    128
    128

    I get this error with the following Pythons on my system:

    /usr/bin/python2.6 - Apple-supplied Python 2.6.9
    /usr/bin/python - Apple-supplied Python 2.7.13
    /opt/local/bin/python2.7 - MacPorts Python 2.7.14
    /usr/local/bin/python - Python.org Python 2.7.13
    /usr/local/bin/python3.5 - Python.org Python 3.5.1
    /usr/local/bin/python3.6 - Python.org Python 3.6.4

    This seems like a very big problem - silent data corruption with no warning. It appears related to bpo-30388, but in that case they were seeing sporadic failures. The deterministic script above causes failures in every case.

    This was discovered after running some code which used shelve (which uses dbm under the hood) in Python 3, but the bug clearly applies to Python 2 as well.

    @nneonneo
    Copy link
    Mannequin Author

    nneonneo mannequin commented Mar 14, 2018

    (Note: the contextlib stuff is just for Python 2 compatibility, it's not necessary on Python 3).

    @zhangyangyu
    Copy link
    Member

    I highly suspect you don't have gdbm installed in your environment and import dbm.gnu will fail. When simply using dbm.open, it searches through [dbm.gnu, dbm.ndbm, dbm.dumb]. In my environment, macOS 10.13.3, dbm.gnu works correctly(so dbm works correctly) and dbm.ndbm fails with same error. Currently I cannot see any code in Python _dbm module could lead to this error. And POSIX only requires dbm library to support at least 1023 bytes long key/value pairs.

    @nneonneo
    Copy link
    Mannequin Author

    nneonneo mannequin commented Mar 29, 2018

    So we have some other problems then:

    (1) It should be documented in dbm, and ideally in shelve, that keys/values over a certain limit might not work. Presently there is no hint that such a limit exists, and until you mentioned it I was unaware that POSIX only required 1023-byte keys and values.
    (2) dbm.ndbm should refuse to perform operations that might corrupt the database, or it should be deprecated entirely if this is impossible. A built-in data storage system for Python should not have an easy corruption route, as it is very surprising for users.
    (3) It might be worth considering "dbm.sqlite" or somesuch, adapting a SQLite database as a key-value store. The key-value store approach is much simpler than sqlite and appropriate for certain applications, while SQLite itself would provide robustness and correctness. I can volunteer to build such a thing on top of the existing Python SQLite support.
    (4) The approach of shelve is incompatible with limited-length values, because shelve's pickles are of an unpredictable length. This suggests that shelve should internally prioritize dumbdbm over ndbm if ndbm cannot guarantee support for arbitrary-length keys/values.
    (5) dbm.gnu is not a default, and I can't even work out how to get it enabled with the stock Python installation (i.e. without building my own Python against e.g. Macports gdbm). Is it a problem to ship dbm.gnu as part of the default install, so that it is more feasible to assume its existence?

    Thoughts?

    @ned-deily
    Copy link
    Member

    Addressing your point (5):

    (5) dbm.gnu is not a default, and I can't even work out how to get it enabled with the stock Python installation (i.e. without building my own Python against e.g. Macports gdbm)

    If you are using MacPorts, the easiest way is to use a Python from MacPorts. For example,
    port install py36-gdbm
    would install everything you would need to use gdbm with their python3.6.

    Is it a problem to ship dbm.gnu as part of the default install, so that it is more feasible to assume its existence?

    The main problem is that gdbm is GPL3 licensed. Python source distributions do not include any GPL3-licensed software to avoid tainting Python itself. We therefore avoid shipping GPL3 software with python.org binary releases, like our macOS installers.

    @nneonneo
    Copy link
    Mannequin Author

    nneonneo mannequin commented Oct 29, 2018

    I just started a new project, thoughtlessly decided to use shelve to store data, and lost it all again thanks to this bug.

    To reiterate: Although gdbm might fix this issue, it's not installed by default. But the issue is with dbm: Python is allowing me to insert elements into the database which exceed internal limits, causing the database to become silently corrupt upon retrieval. This is an unacceptable situation - a very normal, non-complex use of the standard library is causing data loss without any indication that the loss is occurring.

    At the very least there should be a warning or error that the data inserted exceeds dbm's limits, and in an ideal world dbm would not fall over from inserting a few KB of data in a single row (but I understand that's a third party problem at that point).

    Can't we just ship a dbm that is backed with a more robust engine, like a SQLite key-value table?

    @nneonneo nneonneo mannequin added the type-bug An unexpected behavior, bug, or error label Oct 29, 2018
    @ronaldoussoren
    Copy link
    Contributor

    This is a duplicate of bpo-30388

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    OS-mac type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants