classification
Title: bsddb: test01_basic_replication fails sometimes
Type: behavior Stage: resolved
Components: Extension Modules, Tests Versions: Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: jcea Nosy List: BreamoreBoy, amaury.forgeotdarc, barry, benjamin.peterson, flox, gregory.p.smith, jcea, mhammond, ocean-city, r.david.murray, trent
Priority: normal Keywords: buildbot, patch

Created on 2008-09-18 01:18 by benjamin.peterson, last changed 2014-03-23 00:52 by r.david.murray. This issue is now closed.

Files
File name Uploaded Description Edit
fix_timeout.patch ocean-city, 2009-01-27 14:29 review
bsddb3-skip-random-failure.patch r.david.murray, 2010-01-29 20:46 review
Messages (23)
msg73348 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-09-18 01:18
This happens on the Windows buildbot everything once and a while [1]:

======================================================================
FAIL: test01_basic_replication
(bsddb.test.test_replication.DBReplicationManager)
----------------------------------------------------------------------
Traceback (most recent call last):
  File
"S:\buildbots\python\trunk.nelson-windows\build\lib\bsddb\test\test_replication.py",
line 122, in test01_basic_replication
    self.assertTrue(time.time()<timeout)
AssertionError

----------------------------------------------------------------------

http://python.org/dev/buildbot/stable/x86%20W2k8%20trunk/builds/159/step-test/0
msg73351 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2008-09-18 01:44
We're going to disable this test for 2.6rc2.  Please jcea and
gregory.p.smith, take a look at it for 2.6 final.  I've made it a
deferred blocker for that release.
msg73352 - (view) Author: Mark Hammond (mhammond) * (Python committer) Date: 2008-09-18 01:51
I instrumented the code a little.  The error is happening because
self.client_startupdone never gets set to True.  This is supposed to be
set in the client_startupdone() method.  It expects an event type of
db.DB_EVENT_REP_STARTUPDONE, but we see exactly one call to this
function with a value of 2, which is apparently DB_EVENT_REP_CLIENT. 
After this call no further calls are made to the method and after 10
seconds the test gives up.
msg73357 - (view) Author: Mark Hammond (mhammond) * (Python committer) Date: 2008-09-18 02:50
As discussed with Barry on #python-dev, I committed r66498 which skips
the failing assertion on Windows and replaces it with some noise to
stderr.  Note that only that one assertion fails - the rest of the test
passes on Windows.

Also, Brett Cannon on #python-dev experimented and could see the
callback function being called a number of times - with 2 first as
Windows does, but then a number of other times and once with the
expected value.  I'm afraid I've no insight into why these aren't
delivered on Windows.
msg73370 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2008-09-18 11:22
Adding Nelson to the nosy list, since I'm unable to debug any Windows
issue by myself.

I'm very interested in the report saying that MS Windows Berkeley DB
correctly sends the db.DB_EVENT_REP_STARTUPDONE event in his setup. That
could indicate some issue in the buildbot.

Barry, I don't consider this a "release blocker" for 2.6.0. This is a
very advanced feature (database replication). This must be solved, but
2.6 doesn't need to be delayed for it.
msg73371 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2008-09-18 11:25
¿Somebody can try this issue in a Windows box?. It could be a Berkeley
DB bug in that platform, but I would like to verify this before
informing Oracle.

Thanks.
msg73386 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2008-09-18 14:43
I've tried this issue on VC6 + db4.7.25.0 + win2k, and I could reproduce
error.

I added following patch for investigation.

Index: Lib/bsddb/test/test_replication.py
===================================================================
--- Lib/bsddb/test/test_replication.py  (revision 66483)
+++ Lib/bsddb/test/test_replication.py  (working copy)
@@ -40,6 +40,7 @@
                 self.confirmed_master=True

         def client_startupdone(a,b,c) :
+            print "DBReplicationManager:", (a, b, c)
             if b==db.DB_EVENT_REP_STARTUPDONE :
                 self.client_startupdone=True

@@ -201,6 +202,7 @@
                 self.confirmed_master = True

         def client_startupdone(a,b,c) :
+            print "DBBaseReplication:", (a, b, c)
             if b == db.DB_EVENT_REP_STARTUPDONE :
                 self.client_startupdone = True

And this is result.

DBReplicationManager: (<DBEnv object at 0x00A80148>, 2, None)
DBReplicationManager: (<DBEnv object at 0x00A80148>, 5, 0)
DBBaseReplication: (<DBEnv object at 0x00B35CD0>, 2, None)
DBBaseReplication: (<DBEnv object at 0x00B35CD0>, 5, 13)
DBBaseReplication: (<DBEnv object at 0x00B35CD0>, 7, None)
DBBaseReplication: (<DBEnv object at 0x00B49658>, 2, None)
DBBaseReplication: (<DBEnv object at 0x00B49658>, 5, 13)
DBBaseReplication: (<DBEnv object at 0x00B49730>, 2, None)
DBBaseReplication: (<DBEnv object at 0x00B49730>, 5, 13)
msg73415 - (view) Author: Mark Hammond (mhammond) * (Python committer) Date: 2008-09-19 02:15
We are seeing one more error almost identical to the one I fixed (even
the method name is the same), but its at line 315 - this is the last
error in the Windows buildbot!  Please let me know if you would like me
to make a similar "fix" to this line, or if you think you will be able
to back out my hack or skip the test some other way.
msg73421 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-09-19 05:56
just make a similar "fix" for now.
msg73452 - (view) Author: Mark Hammond (mhammond) * (Python committer) Date: 2008-09-20 01:18
Actually, I've decided to leave it alone.  The buildbots most recent
failure was:

test test_bsddb3 failed -- Traceback (most recent call last):
  File
"S:\buildbots\python\trunk.nelson-windows\build\lib\bsddb\test\test_replication.py",
line 315, in test01_basic_replication
    self.assertTrue(time.time()<timeout)
AssertionError

But I can't repro that - instead, I now see:

Traceback (most recent call last):
  File "o:\src\python-svn\lib\bsddb\test\test_lock.py", line 127, in
test03_lock_timeout
    self.assertTrue((end_time-start_time) >= 0.1)
AssertionError

So I go back to the windows buildbots and they are all green!!  So any
remaining issues appear truly transient and don't seem to be restricted
to a single place.  I can't think of an obvious improvement to make here.
msg73785 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2008-09-25 12:53
Oracle guys are studying this issue. I will keep you informed.

This issue is not a release blocker, in any case. See rational in msg73370.
msg73912 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2008-09-27 01:37
I need some MS Windows user able to replicate this issue locally (not in
the buildbot). Oracle people need to do some test and I would like to
avoid to mess with builtbots.

Please, help!.
msg73920 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-09-27 08:53
I reproduce the test failure very consistently, on both win2k and winXP, 
and may be of some help to test stuff.
msg80652 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2009-01-27 14:29
I've investigated more. I think this is timeout problem. According to 
following site,

http://www.oracle.com/technology/documentation/berkeley-
db/db/api_c/rep_timeout.html

timeout should be specified in microseconds. This means

self.dbenvMaster.rep_set_timeout(db.DB_REP_CONNECTION_RETRY,100123)

only waits 10msec. This is too small comparing to default timeout. 
Probably 10sec is appropriate here.

After I applied an attached patch "fix_timeout.py", I ran test several 
times and found no error.

# I changed time.time()+10 to time.time()+30 because my machine is too 
slow, so 10sec was not enough. :-[
msg94932 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-11-05 14:49
I changed the timeout to +30, and the bsddb3 test has been passing on
the windows buildbot.  That buildbot is a virtualbox instance running on
a P4, so it makes sense that it just took a long time for the (second)
test to pass.  While testing I noticed that on my gentoo buildbot the
ReplicationManager test would time out but the base test would pass...so
I wonder if the test that is currently being skipped would pass if you
gave that windows buildbot even longer to think about it.  But since
bsddb3 isn't in py3k, I'm not inclined to take the time to find out.
msg95247 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-11-14 18:22
Failures still occur occasionally even with the timeout set to 60.  So
I've turned the check that is skipped on Windows from an assertion into
a warning only on all other platforms, since bsddb support isn't
actively maintained and is gone in py3k.  Fix applied to trunk in r76265
and 2.6 in r76267.
msg97735 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-01-13 19:45
Still occurs, on 2010-01-13.

http://www.python.org/dev/buildbot/all/builders/x86%20Ubuntu%20trunk/builds/264
msg98521 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-01-29 18:35
Hmm.  That's not actually the same timeout.  What that means is that if the startup timeout triggers, then a subsequent test (one that is looking for the results of a client transaction) fails, which contradicts the earlier observation that even if the startup timeout fails then everything else still passes.
msg98537 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-01-29 20:46
Since this isn't likely to get fixed and it is causing buildbot instability, I propose the attached patch to just skip the remainder of the test method if this second timeout happens.
msg100234 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2010-03-01 01:55
r.david.murray - sounds like a good idea.
msg100266 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-03-01 20:09
Applied skip patch to trunk in r78558, and 2.6 in 78560.
msg214525 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-03-23 00:10
Is any follow up needed to this or can it be closed as bsddb has been removed from Python 3?
msg214529 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-03-23 00:52
Yeah, let's just close it.  Nobody is likely to care enough to try to fix it properly.
History
Date User Action Args
2014-03-23 00:52:52r.david.murraysetstatus: languishing -> closed
resolution: out of date
messages: + msg214529

stage: needs patch -> resolved
2014-03-23 00:10:57BreamoreBoysetnosy: + BreamoreBoy

messages: + msg214525
versions: - Python 2.6
2010-03-01 20:09:33r.david.murraysetstatus: open -> languishing

messages: + msg100266
2010-03-01 01:55:27gregory.p.smithsetmessages: + msg100234
2010-01-29 20:47:00r.david.murraysetfiles: + bsddb3-skip-random-failure.patch
keywords: + patch
messages: + msg98537
2010-01-29 18:38:33r.david.murraysetkeywords: + buildbot, - patch
2010-01-29 18:35:40r.david.murraysetmessages: + msg98521
stage: patch review -> needs patch
2010-01-13 19:45:53floxsetnosy: + flox
messages: + msg97735
2009-11-14 18:22:42r.david.murraysettitle: bsddb: test01_basic_replication fails on Windows sometimes -> bsddb: test01_basic_replication fails sometimes
nosy: mhammond, barry, gregory.p.smith, jcea, amaury.forgeotdarc, ocean-city, benjamin.peterson, trent, r.david.murray
messages: + msg95247

components: + Tests
type: behavior
stage: patch review
2009-11-06 02:47:35r.david.murraysetkeywords: - buildbot
2009-11-05 14:49:12r.david.murraysetnosy: + r.david.murray

messages: + msg94932
versions: + Python 2.7
2009-10-30 19:47:16r.david.murraysetkeywords: + buildbot
2009-01-27 14:30:01ocean-citysetfiles: + fix_timeout.patch
keywords: + patch
messages: + msg80652
2008-09-27 08:53:20amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg73920
2008-09-27 01:37:59jceasetmessages: + msg73912
2008-09-26 23:37:33benjamin.petersonsetpriority: release blocker -> normal
2008-09-25 12:53:10jceasetmessages: + msg73785
2008-09-20 01:18:06mhammondsetmessages: + msg73452
2008-09-19 05:56:44gregory.p.smithsetmessages: + msg73421
2008-09-19 02:15:25mhammondsetmessages: + msg73415
2008-09-18 14:43:06ocean-citysetnosy: + ocean-city
messages: + msg73386
2008-09-18 11:25:50jceasetmessages: + msg73371
2008-09-18 11:22:42jceasetnosy: + trent
messages: + msg73370
2008-09-18 05:44:17barrysetpriority: deferred blocker -> release blocker
2008-09-18 02:50:17mhammondsetmessages: + msg73357
2008-09-18 01:51:41mhammondsetnosy: + mhammond
messages: + msg73352
2008-09-18 01:44:45barrysetpriority: deferred blocker
nosy: + barry
messages: + msg73351
2008-09-18 01:43:36barrysetnosy: + gregory.p.smith
2008-09-18 01:18:07benjamin.petersoncreate