classification
Title: dbmmodule inquiry function is performance prohibitive
Type: performance Stage:
Components: Extension Modules Versions: Python 2.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: jafo, jcea, johansen, ysj.ray
Priority: normal Keywords: patch

Created on 2008-02-22 00:41 by johansen, last changed 2011-02-22 13:53 by ysj.ray.

Messages (8)
msg62668 - (view) Author: johansen (johansen) Date: 2008-02-22 00:41
We've been using Python 2.4 to build the new package management software
for OpenSolaris.  We use a ndbm database to hold keywords about
packages, and found that each time we added a new OpenSolaris build to
our package repository, the time to import would increase by about 1/3
of the previous time.

It turns out that we were continually invoking a function in the
dbmmodule that walked the entire database every time the function was
called.

Looking at dbmmodule.c, the source for dbm.so, is instructive:

This is dbm_length, the function that we're _always_ calling.

static int
dbm_length(dbmobject *dp)
{
        if (dp->di_dbm == NULL) {
                 PyErr_SetString(DbmError, "DBM object has already been
closed"); 
                 return -1; 
        }
        if ( dp->di_size < 0 ) {
                datum key;
                int size;

                size = 0;
                for ( key=dbm_firstkey(dp->di_dbm); key.dptr;
                      key = dbm_nextkey(dp->di_dbm))
                        size++;
                dp->di_size = size;
        }
        return dp->di_size;
}

It's a knock-off of function shown in ndbm(3C) that traverses the
database.  It looks like this function walks every record in the
database, and then returns that as its size.

Further examination of dbmmodule shows that dbm_length has been assigned
as the function for the inquiry operator:

static PyMappingMethods dbm_as_mapping = {
        (inquiry)dbm_length,            /*mp_length*/
        (binaryfunc)dbm_subscript,      /*mp_subscript*/
        (objobjargproc)dbm_ass_sub,     /*mp_ass_subscript*/
};


It looks like dbm_length stashes the size of the database, so it doesn't
always have to traverse it.  However, further examination of the source
shows that an insertion doesn't update the di_size counter.  Worse yet,
an update or a delete will cause the counter to be set to -1.  This
means that the next call to dbm_length will have to walk the entire
database all over again.  Ick.

One of the problem parts of the code is this line in catalog.py:
update_searchdb():

                if fmri_list:
                        if not self.searchdb:
                                self.searchdb = \
                                    dbm.open(self.searchdb_file, "c")

This if not triggers the PyObject_IsTrue that invokes the inquiry operator
for the dbm module.  Every time we run this code, we're going to walk
the entire database.  By changing this to:

                if fmri_list:
                        if self.searchdb is None:
                                self.searchdb = \
                                    dbm.open(self.searchdb_file, "c")

We were able to work around the problem by using the is None check,
instead of if not self.searchdb; however, this seems like it is really a
problem with the dbmmodule and should ultimately be resolved there.
msg64126 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2008-03-19 23:56
Proposed patch is inline.
msg64447 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2008-03-24 22:46
I think that "-1" is a sanity check. If the count is updated in the
database, but it is not transactional (or there are bugs, or the DB is
updated by a not up-to-date library, and so on), the cached counter and
the real data can diverge.

Anybody using "Berkeley DB" related databases knows that "length" is
costly. By good reasons, actually :-).

Checking for empty databases should be fast, nevertheless (just iterate
over the first item in the database). We could simply define a
"__nonzero__()" method for that. That would solve the "if" issue.
msg65080 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2008-04-07 17:48
Somebody posted a similar issue in pybsddb. The patch I proposed would
be consistent with current "__len__" *internal* use, but the real
intention, I think, is to return True or False if the database is open
or closed, not if the database is empty or not.

Opinions?

See http://mailman.argo.es/pipermail/pybsddb/2008-April/000028.html
msg65081 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-04-07 17:51
Assigning anything not related to Py3k design to me is a mistake; I
don't have the bandwidth to handle this, sorry.
msg68106 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2008-06-12 23:23
johansen, could you be happy returning True of False according to
database being open/closed?.
msg68107 - (view) Author: johansen (johansen) Date: 2008-06-12 23:29
Yes, True/False should be sufficient for our purposes.  IIRC, we were
trying to determine if we had a stale handle to the database and needed
to open it again.
msg80737 - (view) Author: johansen (johansen) Date: 2009-01-29 01:24
I haven't been able to find any of the patches listed in the comments,
but it does look like providing a nb_nonzero method in the module would
solve our issue.  PyObject_IsTrue checks the tp_as_number methods before
the sequence and mapping methods.  I'm not sure if it's safe to count on
this behavior as always being true, but for 2.4 and the dbmmodule, it
would work.
History
Date User Action Args
2011-02-22 13:53:04ysj.raysetnosy: + ysj.ray
2009-01-29 02:02:27gvanrossumsetnosy: - gvanrossum
2009-01-29 01:24:14johansensetmessages: + msg80737
2008-06-12 23:29:59johansensetmessages: + msg68107
2008-06-12 23:23:33jceasetmessages: + msg68106
2008-04-07 17:51:05gvanrossumsetassignee: gvanrossum ->
messages: + msg65081
2008-04-07 17:48:03jceasetmessages: + msg65080
2008-03-24 22:46:49jceasetmessages: + msg64447
2008-03-24 22:45:08benjamin.petersonsettype: resource usage -> performance
2008-03-24 22:33:05jceasetnosy: + jcea
2008-03-19 23:56:58jafosetpriority: normal
assignee: gvanrossum
messages: + msg64126
keywords: + patch
nosy: + gvanrossum, jafo
2008-02-22 00:41:15johansencreate