classification
Title: dbm corrupts index on macOS (_dbm module)
Type: behavior Stage: resolved
Components: macOS Versions: Python 3.6, Python 3.5, Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: ndbm can't iterate through values on OS X
View: 30388
Assigned To: Nosy List: ned.deily, nneonneo, ronaldoussoren, xiang.zhang
Priority: normal Keywords:

Created on 2018-03-14 06:34 by nneonneo, last changed 2020-11-14 08:44 by ronaldoussoren. This issue is now closed.

Files
File name Uploaded Description Edit
test.db nneonneo, 2018-03-14 06:34 Test database file in _dbm format (macOS)
Messages (7)
msg313809 - (view) Author: Robert Xiao (nneonneo) * Date: 2018-03-14 06:34
Environment: Python 3.6.4, macOS 10.12.6

Python 3's dbm appears to corrupt the key index on macOS if objects >4KB are inserted.

Code:

<<<<<<<<<<<
import dbm
import contextlib

with contextlib.closing(dbm.open('test', 'n')) as db:
    for k in range(128):
        db[('%04d' % k).encode()] = b'\0' * (k * 128)

with contextlib.closing(dbm.open('test', 'r')) as db:
    print(len(db))
    print(len(list(db.keys())))
>>>>>>>>>>>

On my machine, I get the following:

<<<<<<<<<<<
94
Traceback (most recent call last):
  File "test.py", line 10, in <module>
    print(len(list(db.keys())))
SystemError: Negative size passed to PyBytes_FromStringAndSize
>>>>>>>>>>>

(The error says PyString_FromStringAndSize on Python 2.x but is otherwise the same). The expected output, which I see on Linux (using gdbm), is

128
128

I get this error with the following Pythons on my system:

/usr/bin/python2.6 - Apple-supplied Python 2.6.9
/usr/bin/python - Apple-supplied Python 2.7.13
/opt/local/bin/python2.7 - MacPorts Python 2.7.14
/usr/local/bin/python - Python.org Python 2.7.13
/usr/local/bin/python3.5 - Python.org Python 3.5.1
/usr/local/bin/python3.6 - Python.org Python 3.6.4

This seems like a very big problem - silent data corruption with no warning. It appears related to issue30388, but in that case they were seeing sporadic failures. The deterministic script above causes failures in every case.

This was discovered after running some code which used shelve (which uses dbm under the hood) in Python 3, but the bug clearly applies to Python 2 as well.
msg313810 - (view) Author: Robert Xiao (nneonneo) * Date: 2018-03-14 06:35
(Note: the contextlib stuff is just for Python 2 compatibility, it's not necessary on Python 3).
msg314665 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2018-03-29 18:15
I highly suspect you don't have gdbm installed in your environment and `import dbm.gnu` will fail. When simply using `dbm.open`, it searches through [dbm.gnu, dbm.ndbm, dbm.dumb]. In my environment, macOS 10.13.3, dbm.gnu works correctly(so dbm works correctly) and dbm.ndbm fails with same error. Currently I cannot see any code in Python _dbm module could lead to this error. And POSIX only requires dbm library to support at least 1023 bytes long key/value pairs.
msg314667 - (view) Author: Robert Xiao (nneonneo) * Date: 2018-03-29 19:05
So we have some other problems then:

(1) It should be documented in dbm, and ideally in shelve, that keys/values over a certain limit might not work. Presently there is no hint that such a limit exists, and until you mentioned it I was unaware that POSIX only required 1023-byte keys and values.
(2) dbm.ndbm should refuse to perform operations that might corrupt the database, or it should be deprecated entirely if this is impossible. A built-in data storage system for Python should not have an easy corruption route, as it is very surprising for users.
(3) It might be worth considering "dbm.sqlite" or somesuch, adapting a SQLite database as a key-value store. The key-value store approach is much simpler than sqlite and appropriate for certain applications, while SQLite itself would provide robustness and correctness. I can volunteer to build such a thing on top of the existing Python SQLite support.
(4) The approach of shelve is incompatible with limited-length values, because shelve's pickles are of an unpredictable length. This suggests that shelve should internally prioritize dumbdbm over ndbm if ndbm cannot guarantee support for arbitrary-length keys/values.
(5) dbm.gnu is not a default, and I can't even work out how to get it enabled with the stock Python installation (i.e. without building my own Python against e.g. Macports gdbm). Is it a problem to ship dbm.gnu as part of the default install, so that it is more feasible to assume its existence? 

Thoughts?
msg314674 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-03-29 20:47
Addressing your point (5):

> (5) dbm.gnu is not a default, and I can't even work out how to get it enabled with the stock Python installation (i.e. without building my own Python against e.g. Macports gdbm)

If you are using MacPorts, the easiest way is to use a Python from MacPorts. For example,
  port install py36-gdbm
would install everything you would need to use gdbm with their python3.6.
  
> Is it a problem to ship dbm.gnu as part of the default install, so that it is more feasible to assume its existence?

The main problem is that gdbm is GPL3 licensed.  Python source distributions do not include any GPL3-licensed software to avoid tainting Python itself.  We therefore avoid shipping GPL3 software with python.org binary releases, like our macOS installers.
msg328849 - (view) Author: Robert Xiao (nneonneo) * Date: 2018-10-29 18:21
I just started a new project, thoughtlessly decided to use `shelve` to store data, and lost it all again thanks to this bug.

To reiterate: Although `gdbm` might fix this issue, it's not installed by default. But the issue is with `dbm`: Python is allowing me to insert elements into the database which exceed internal limits, causing the database to become silently corrupt upon retrieval. This is an unacceptable situation - a very normal, non-complex use of the standard library is causing data loss without any indication that the loss is occurring.

At the very least there should be a warning or error that the data inserted exceeds dbm's limits, and in an ideal world dbm would not fall over from inserting a few KB of data in a single row (but I understand that's a third party problem at that point).

Can't we just ship a dbm that is backed with a more robust engine, like a SQLite key-value table?
msg380962 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2020-11-14 08:44
This is a duplicate of #30388
History
Date User Action Args
2020-11-14 08:44:36ronaldoussorensetstatus: open -> closed
superseder: ndbm can't iterate through values on OS X
messages: + msg380962

resolution: duplicate
stage: resolved
2018-10-29 18:21:14nneonneosettype: behavior
messages: + msg328849
2018-03-29 20:47:50ned.deilysetmessages: - msg314673
2018-03-29 20:47:39ned.deilysetmessages: + msg314674
2018-03-29 20:46:57ned.deilysetmessages: + msg314673
2018-03-29 19:05:46nneonneosetmessages: + msg314667
2018-03-29 18:15:43xiang.zhangsetnosy: + xiang.zhang
messages: + msg314665
2018-03-14 06:36:51ned.deilysetnosy: + ronaldoussoren, ned.deily
components: + macOS
2018-03-14 06:35:02nneonneosetmessages: + msg313810
2018-03-14 06:34:12nneonneocreate