classification
Title: bsddb: segfault on db.associate call with Txn and large data
Type: Stage:
Components: Extension Modules Versions: Python 2.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: nnorwitz Nosy List: gregory.p.smith, jcea, nnorwitz, rshura
Priority: normal Keywords:

Created on 2006-01-23 20:35 by rshura, last changed 2008-03-26 18:16 by jcea. This issue is now closed.

Files
File name Uploaded Description Edit
test3.py rshura, 2006-01-23 20:41 Txn-less code
test1413192.py nnorwitz, 2006-01-24 06:45
1413192.patch nnorwitz, 2006-01-24 07:03 fix attempt 1
Messages (16)
msg27331 - (view) Author: Alex Roitman (rshura) Date: 2006-01-23 20:35
Problem confirmed on Python2.3.5/bsddb4.2.0.5 and
Python2.4.2/bsddb4.3.0 on Debian sid and Ubuntu Breezy.

It appears, that the associate call, necessary to
create a secondary index, segfaults when:
1. There is a large amount of data
2. Environment is transactional.

The
http://www.gramps-project.org/files/bsddb/testcase.tar.gz
 contains the example code and two databases, pm.db and
pm_ok.db -- both have the same number of keys and each
data item is a pickled tuple with two elements. The
second index is created over the unpickled data[1]. The
pm.db segfaults and the pm_ok.db does not. The second
db has much smaller data items in data[0].

If the environment is set up and opened without TXN
then pm.db is also fine. Seems like a problem in
associate call in a TXN environment, that is only seen
with large enough data.

Please let me know if I can be of further assistance.
This is a show-stopper issue for me, I would go out of
my way to help resolving this or finding a work-around.

Thanks!
Alex

P.S. I could not attach the large file, probably due to
the size limit on the upload, hence a link to the testcase.
msg27332 - (view) Author: Alex Roitman (rshura) Date: 2006-01-23 20:41
Logged In: YES 
user_id=498357

Attaching test3.py containing same code without
transactions. Works fine with either pm.db or pm_ok.db
msg27333 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2006-01-24 06:45
Logged In: YES 
user_id=33168

I've got a much simpler test case.  The problem seems to be
triggered when the txn is deleted after the env (in
Modules/_bsddb.c 917 vs 966).  If I change the variable
names in python, I don't get the same behaviour (ie, it
doesn't crash).

I removed the original data file, but if you change the_txn
to txn, that might "fix" the problem.  If not, try playing
with different variable names and see if you can get it to
not crash.  Obviously there needs to be a real fix in C
code, but I'm not sure what needs to happen.  It doesn't
look like we keep enough info to do this properly.
msg27334 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2006-01-24 07:03
Logged In: YES 
user_id=33168

I spoke too soon.  The attached patch works for me or your
original test case and my pared down version.  It also
passes the tests.  It also fixes a potential memory leak. 
Let me know if this works for you.
msg27335 - (view) Author: Alex Roitman (rshura) Date: 2006-01-24 18:50
Logged In: YES 
user_id=498357

Thanks for a quick response!

OK, first thing first: your simpler testcase seems to expose
yet another problem, not the one I had. In particular, your
testcase segfaults for me on python2.4.2/bsddb4.3.0 but
*does not* segfault with python2.3.5/bsddb4.2.0.5

In my testcase, I can definitely blame the segfault on the
associate call, not on open. I can demonstrate it by either
commenting out the associate call (no segfault) or by
inserting a print statement right before the associate.

So your testcase does not seem to have an exact same problem
than my testcase. In my testcase nothing seems to depend on
variable names (as one would expect). I am rebuilding
python2.4 with your patch, will post results soon.
msg27336 - (view) Author: Alex Roitman (rshura) Date: 2006-01-24 19:31
Logged In: YES 
user_id=498357

OK, built and installed all kinds of python packages with
the patch. All tests are fine. Here goes:

1. Your testcase goes just fine, no segfault with the
patched version.
2. Mine still segfaults.
3. I ran mine under gdb with python2.4-dbg package, here's
the output (printed "shurafine" is my addition, to make sure
that the correct code is being run):

$ gdb python2.4-dbg
GNU gdb 6.4-debian
Copyright 2005 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public
License, and you are
welcome to change it and/or distribute copies of it under
certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show
warranty" for details.
This GDB was configured as "i486-linux-gnu"...Using host
libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".

(gdb) run test2.py
Starting program: /usr/bin/python2.4-dbg test2.py
[Thread debugging using libthread_db enabled]
[New Thread -1210038592 (LWP 29629)]
shurafine

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1210038592 (LWP 29629)]
0xb7b57f3e in DB_associate (self=0xb7db9f58, args=0xb7dbd3b4,
    kwargs=0xb7db5e94) at
/home/shura/src/python2.4-2.4.2/Modules/_bsddb.c:1219
1219            Py_DECREF(self->associateCallback);
(gdb)

Please let me know if I can be of further assistance.
msg27337 - (view) Author: Alex Roitman (rshura) Date: 2006-01-24 19:37
Logged In: YES 
user_id=498357

Done same tests on another Debian sid machine, exact same
results (up to one line number, due to my extra fprintf
statement):

(gdb) run test2.py
Starting program: /usr/bin/python2.4-dbg test2.py
[Thread debugging using libthread_db enabled]
[New Thread -1210390848 (LWP 5865)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1210390848 (LWP 5865)]
0xb7b01eb4 in DB_associate (self=0xb7d63df0, args=0xb7d67234,
    kwargs=0xb7d5ee94) at
/home/shura/src/python2.4-2.4.2/Modules/_bsddb.c:1218
1218            Py_DECREF(self->associateCallback);
(gdb) 
msg27338 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2006-01-24 19:40
Logged In: YES 
user_id=33168

Could you pull the version of Modules/_bsddb.c out of SVN
and then apply my patch?  I believe your new problem was
fixed recently.  You can look in the Misc/NEWS file to find
the exact patch.
msg27339 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2006-01-24 20:14
Logged In: YES 
user_id=413

fwiw your patch looks good.  it makes sense for a DBTxn to
hold a reference to its DBEnv.

(I suspect there may still be problems if someone calls
DBEnv.close while there are outstanding DBTxn's but doing
something about that would be a lot more work if its an
actual issue)
msg27340 - (view) Author: Alex Roitman (rshura) Date: 2006-01-24 20:50
Logged In: YES 
user_id=498357

With the SVN version of _bsddb.c I no longer have segfault
with my test. Instead I have the following exception:

Traceback (most recent call last):
  File "test2.py", line 37, in ?
   
person_map.associate(surnames,find_surname,db.DB_CREATE,txn=the_txn)
MemoryError: (12, 'Cannot allocate memory -- Lock table is
out of available locks')

Now, please bear with me here if you can. It's easy to shrug
it off saying that I simply don't have enough locks for this
huge txn. But the exact same code works fine with the
pm_ok.db file from my testcase, and that file has exact same
number of elements and exact same structure of both the data
and the secondary index computation. So one would think that
it needs exact same number of locks, and yet it works while
pm.db does not.

The only difference between the two data files is that in
each data item, data[0] is much larger in pm.db and smaller
in pm_ok.db

Is it remotely possible that the actual error has nothing to
do with locks but rather with the data size? What can I do
to find out or fix this?

Thanks for you help!
msg27341 - (view) Author: Alex Roitman (rshura) Date: 2006-01-24 21:12
Logged In: YES 
user_id=498357

Tried increasing locks, lockers, and locked objects to 10000
each and seems to help. So I guess the number of locks is
data-size specific. I guess this is indeed a lock issue, so
it's my problem now and not yours :-)

Thanks for your help!
msg27342 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2006-01-25 00:35
Logged In: YES 
user_id=413

BerkeleyDB uses page locking so it makes sense that a
database with larger data objects in it would require more
locks assuming it is internally locking each page.  That
kind of tuning gets into BerkeleyDB internals where i
suspect people on the comp.databases.berkeleydb newsgroup
could answer things better.

glad its working for you now.
msg27343 - (view) Author: Alex Roitman (rshura) Date: 2006-01-25 02:21
Logged In: YES 
user_id=498357

While you guys are here, can I ask you if there's a way to
return to the checkpoint made in a Txn-aware database?
Specifically, is there a way to return to the latest
checkpoing from within python?

My problem is that if my data import fails in the middle, I
want to undo some transactions that were committed, to have
a clean import undo. Checkpoint seems like a nice way to do
that, if only I could get back to it :-)
msg27344 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2006-01-25 05:23
Logged In: YES 
user_id=33168

I'm sorry I'm not a Berkeley DB developer, I just play one
on TV.  :-)  Seriously, I don't know anything about BDB.  I
was just trying to get it stable.  Maybe Greg can answer
your question.
msg27345 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2006-01-25 05:29
Logged In: YES 
user_id=33168

Committed revision 42177.
Committed revision 42178. (2.4)
msg27346 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2006-01-25 05:31
Logged In: YES 
user_id=33168

Oh, I forgot to say thanks for the good bug report and
responding back.
History
Date User Action Args
2008-03-26 18:16:18jceasetnosy: + jcea
2006-01-23 20:35:45rshuracreate