Title: mailbox: other programs' messages can vanish without trace
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.4, Python 3.5, Python 2.7
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Jim.Jewett, ajaksu2, baikie, barry, jwilk, loewis, nnorwitz, peter.ll, petri.lehtinen, r.david.murray, terry.reedy
Priority: normal Keywords: patch

Created on 2006-11-19 16:03 by baikie, last changed 2014-12-31 16:25 by akuchling.

File name Uploaded Description Edit
mailbox-copy-back.diff baikie, 2006-11-19 16:03 Rewrite single-file mailbox by copying review
mailbox-test.patch akuchling, 2006-12-15 13:17 Adds a test showing problem review
length-checking.diff akuchling, 2006-12-20 14:46 Add length-checking to single-file mailboxes review
mailbox-update-toc.diff baikie, 2006-12-20 19:17 Update _toc on locking
mailbox-test-lock.diff baikie, 2006-12-20 19:19 Lock/unlock around message count in test case
mailbox-docs.diff akuchling, 2007-01-05 19:24 Doc patch review
mailbox-update-toc-fixed.diff baikie, 2007-01-06 15:24 Return correct key from add()
mailbox-copy-back-53287.diff baikie, 2007-01-06 15:30 Update to be against current version review
mailbox-copy-back-new.diff baikie, 2007-01-06 19:57 Rewrite single-file mailbox by copying review
mailbox-fcntl-warn.diff baikie, 2007-01-13 18:32 Warn when flush() may release lock
mailbox-pending-lock.diff akuchling, 2007-01-17 21:05 Small diff
mailbox-unified-patch.diff akuchling, 2007-01-17 21:06 Unified patch that fixes everything. review
mailbox-unified2-test.diff baikie, 2007-01-18 18:14 New version (test part) review
mailbox-unified2-module.diff baikie, 2007-01-18 18:15 New version (module part, semi-broken) review
mailbox-update-toc-new.diff baikie, 2007-01-21 22:10 Preserve message keys when rereading
mailbox-update-toc-again.diff baikie, 2007-04-13 13:45 New version of _update_toc() review
test_mailbox-reread.diff baikie, 2007-04-13 13:46 Tests for concurrency/rereading _toc review
mailbox-babyl-fix.diff baikie, 2007-04-13 13:47 In Babyl class, write preamble before adding messages to an empty file
mailbox-copy-back-2.7.diff baikie, 2014-03-23 22:57
mailbox-all-tests-2.7.diff baikie, 2014-03-23 22:57
mailbox-tests-2.7-part1-for-copy-back.diff baikie, 2014-03-30 21:50
mailbox-tests-2.7-part2.diff baikie, 2014-03-30 21:50
Messages (51)
msg30590 - (view) Author: David Watson (baikie) Date: 2006-11-19 16:03
The mailbox classes based on _singlefileMailbox (mbox, MMDF, Babyl) implement the flush() method by writing the new mailbox contents into a temporary file which is then renamed over the original. Unfortunately, if another program tries to deliver messages while is working, and uses only fcntl() locking, it will have the old file open and be blocked waiting for the lock to become available. Once has replaced the old file and closed it, making the lock available, the other program will write its messages into the now-deleted "old" file, consigning them to oblivion.

I've caused Postfix on Linux to lose mail this way (although I did have to turn off its use of dot-locking to do so).

A possible fix is attached.  Instead of new_file being renamed, its contents are copied back to the original file.  If file.truncate() is available, the mailbox is then truncated to size.  Otherwise, if truncation is required, it's truncated to zero length beforehand by reopening self._path with mode wb+.  In the latter case, there's a check to see if the mailbox was replaced while we weren't looking, but there's still a race condition.  Any alternative ideas?

Incidentally, this fixes a problem whereby Postfix wouldn't deliver to the replacement file as it had the execute bit set.
msg30591 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-11-19 20:02
Mailbox locking was invented precisely to support this kind of operation. Why do you complain that things break if you deliberately turn off the mechanism preventing breakage?

I fail to see a bug here.
msg30592 - (view) Author: David Watson (baikie) Date: 2006-11-19 20:44
This is a bug.  The point is that the code is subverting the protection of its own fcntl locking.  I should have pointed out that Postfix was still using fcntl locking, and that should have been sufficient.  (In fact, it was due to its use of fcntl locking that it chose  precisely the wrong moment to deliver mail.)  Dot-locking does protect against this, but not every program uses it - which is precisely the reason that the code implements fcntl locking in the first place.
msg30593 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2006-12-12 21:04
I agree with David's analysis; this is in fact a bug.  I'll try to look at the patch.
msg30594 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2006-12-15 13:17
The attached patch adds a test case to that demonstrates the problem.  No modifications to are needed to show data loss.

Now looking at the patch...

File Added: mailbox-test.patch
msg30595 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2006-12-15 14:06
I'm testing the fix using two Python processes running, and my test case fails even with your patch.  This is due to another bug, even in the patched version.  

mbox has a dictionary attribute, _toc, mapping message keys to positions in the file.  flush() writes out all the messages in self._toc and constructs a new _toc with the new file offsets.  It doesn't re-read the file to see if new messages were added by another process.

One fix that seems to work: instead of doing 'self._toc = new_toc' after flush() has done its work, do self._toc = None.  The ToC will be regenerated the next time _lookup() is called, causing a re-read of all the contents of the mbox.  Inefficient, but I see no way around the necessity for doing this.

It's not clear to me that my suggested fix is enough, though.  Process #1 opens a mailbox, reads the ToC, and the process does something else for 5 minutes.  In the meantime, process #2 adds a file to the mbox.  Process #1 then adds a message to the mbox and writes it out; it never notices process #2's change.

Maybe the _toc has to be regenerated every time you call lock(), because at this point you know there will be no further updates to the mbox by any other process.  Any unlocked usage of _toc should also really be regenerating _toc every time, because you never know if another process has added a message... but that would be really inefficient.

msg30596 - (view) Author: David Watson (baikie) Date: 2006-12-16 19:09
Yes, I see what you mean.  I had tried multiple flushes, but only
inside a single lock/unlock.  But this means that in the no-truncate()
code path, even this is currently unsafe, as the lock is momentarily
released after flushing.

I think _toc should be regenerated after every lock(), as with the
risk of another process replacing/deleting/rearranging the messages,
it isn't valid to carry sequence numbers from one locked period to
another anyway, or from unlocked to locked.  However, this runs the
risk of dangerously breaking code that thinks it is valid to do so,
even in the case where the mailbox was *not* modified (i.e. turning
possible failure into certain failure).  For instance, if the program
removes message 1, then as things stand, the key "1" is no longer
used, and removing message 2 will remove the message that followed 1.
If _toc is regenerated in between, however (using the current code, so
that the messages are renumbered from 0), then the old message 2
becomes message 1, and removing message 2 will therefore remove the
wrong message.  You'd also have things like pending deletions and
replacements (also unsafe generally) being forgotten.  So it would
take some work to get right, if it's to work at all...
msg30597 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2006-12-18 17:43
Eep, correct; changing the key IDs would be a big problem for existing code.  We could say 'discard all keys' after doing lock() or unlock(), but this is an API change that means the fix couldn't be backported to 2.5-maint.

We could make generating the ToC more complicated, preserving key IDs when possible; that may not be too difficult, though the code might be messy.

Maybe it's best to just catch this error condition: save the size of the mailbox, updating it in _append_message(), and then make .flush() raise an exception if the mailbox size has unexpectedly changed.
msg30598 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2006-12-20 14:46
Attaching a patch that adds length checking: before doing a flush() on a single-file mailbox, seek to the end and verify its length is unchanged.  It raises an ExternalClashError if the file's length has changed.  (Should there be a different exception for this case, perhaps a subclass of ExternalClashError?)

I verified that this change works by running a program that added 25 messages, pausing between each one, and then did 'echo "new line" > /tmp/mbox' from a shell while the program was running.

I also noticed that the self._lookup() call in self.flush() wasn't necessary, and replaced it by an assertion.

I think this change should go on both the trunk and 25-maint branches.

File Added: length-checking.diff
msg30599 - (view) Author: David Watson (baikie) Date: 2006-12-20 19:17
Yeah, I think that should definitely go in.  ExternalClashError or a
subclass sounds fine to me (although you could make a whole taxonomy
of these errors, really).  It would be good to have the code actually
keep up with other programs' changes, though; a program might just
want to count the messages at first, say, and not make changes until
much later.

I've been trying out the second option (patch attached, to apply on
top of mailbox-copy-back), regenerating _toc on locking, but
preserving existing keys.  The patch allows existing _generate_toc()s
to work unmodified, but means that _toc now holds the entire last
known contents of the mailbox file, with the 'pending' (user-visible)
mailbox state being held in a new attribute, _user_toc, which is a
mapping from keys issued to the program to the keys of _toc
(i.e. sequence numbers in the file).  When _toc is updated, any new
messages that have appeared are given keys in _user_toc that haven't
been issued before, and any messages that have disappeared are removed
from it.  The code basically assumes that messages with the same
sequence number are the same message, though, so even if most cases
are caught by the length check, programs that make
deletions/replacements before locking could still delete the wrong
messages.  This behaviour could be trapped, though, by raising an
exception in lock() if self._pending is set (after all, code like that
would be incorrect unless it could be assumed that the mailbox module
kept hashes of each message or something).

Also attached is a patch to the test case, adding a lock/unlock around
the message count to make sure _toc is up-to-date if the parent
process finishes first; without it, there are still intermittent

File Added: mailbox-update-toc.diff
msg30600 - (view) Author: David Watson (baikie) Date: 2006-12-20 19:19
File Added: mailbox-test-lock.diff
msg30601 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2006-12-20 19:48
Committed length-checking.diff to trunk in rev. 53110.
msg30602 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2007-01-05 19:24
As a step toward improving matters, I've attached the suggested doc patch (for both 25-maint and trunk).  It encourages people to use Maildir :), explicitly states that modifications should be bracketed by lock(), and fixes the examples to match.

It does not say that keys are invalidated by doing a flush(), because we're going to try to avoid the necessity for that.

File Added: mailbox-docs.diff
msg30603 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2007-01-05 19:51
Question about mailbox-update-doc: the add() method still returns self._next_key - 1; should this be 
self._next_user_key - 1?  The keys in _user_toc are the ones returned to external users of the mailbox, right?

(A good test case would be to initialize _next_key to 0 and _next_user_key to a different value like 123456.)

I'm still staring at the patch, trying to convince myself that it will help -- haven't spotted any problems, but this bug is making me nervous...
msg30604 - (view) Author: David Watson (baikie) Date: 2007-01-06 15:24
Aack, yes, that should be _next_user_key.  Attaching a fixed version.

I've been thinking, though: flush() does in fact invalidate the keys
on platforms without a file.truncate(), when the fcntl() lock is
momentarily released afterwards.  It seems hard to avoid this as,
perversely, fcntl() locks are supposed to be released automatically on
all file descriptors referring to the file whenever the process closes
any one of them - even one the lock was never set on.

So, code using such platforms could inadvertently be
carrying keys across an unlocked period, which is not made safe by the
update-toc patch (as it's only meant to avert disasters resulting from
doing this *and* rebuilding the table of contents, *assuming* that
another process hasn't deleted or rearranged messages).

File Added: mailbox-update-toc-fixed.diff
msg30605 - (view) Author: David Watson (baikie) Date: 2007-01-06 15:30
File Added: mailbox-copy-back-53287.diff
msg30606 - (view) Author: David Watson (baikie) Date: 2007-01-06 19:57
Oops, length checking had made the first two lines of this patch
redundant; update-toc applies OK with fuzz.

File Added: mailbox-copy-back-new.diff
msg30607 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2007-01-12 17:12
So shall we document flush() as invalidating keys, then?
msg30608 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2007-01-12 18:41
I realized that making flush() invalidate keys breaks the final example in the docs, which loops over inbox.iterkeys() and removes messages, doing a pack() after each message.

Which platforms lack file.truncate()?  Windows has it; POSIX has it, so modern Unix variants should all have it.  Maybe mailbox should simply raise an exception (or trigger a warning?) if truncate is missing, and we should then assume that flush() has no effect upon keys.
msg30609 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2007-01-12 19:41
One OS/2 port lacks truncate(), and so does RISCOS.
msg30610 - (view) Author: David Watson (baikie) Date: 2007-01-13 18:32
I like the warning idea - it seems appropriate if the problem is
relatively rare.  How about this?

File Added: mailbox-fcntl-warn.diff
msg30611 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2007-01-15 19:01
Comment from Andrew MacIntyre (os2vacpp is the OS/2 that lacks ftruncate()):


I actively support the OS/2 EMX port (sys.platform == "os2emx"; build
directory is PC/os2emx).  I would like to keep the VAC++ port alive, but
the reality is I don't have the resources to do so.  The VAC++ port was
the subject of discussion about removal of build support support from the
source tree for 2.6 - I don't recall there being a definitive outcome,
but if someone does delete the PC/os2vacpp directory, I'm not in a
position to argue.

AMK: ( has a separate section of code used when file.truncate()
isn't available, and the existence of this section is bedevilling me.
It would be convenient if platforms without file.truncate() weren't a
factor; then this branch could just be removed.  In your opinion,
would it hurt OS/2 users of if support for platforms
without file.truncate() was removed?)

aimacintyre: No.  From what documentation I can quickly check, ftruncate() operates
on file descriptors rather than FILE pointers.  As such I am sure that,
if it became an issue, it would not be difficult to write a ftruncate()
emulation wrapper for the underlying OS/2 APIs that implement the
required functionality.

msg30612 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2007-01-17 06:48
Andrew, do you need any help with this?
msg30613 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2007-01-17 19:56
Committed a modified version of the doc. patch in rev. 53472 (trunk) and rev. 53474 (release25-maint).
msg30614 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2007-01-17 20:53
mailbox-unified-patch contains David's copy-back-new and fcntl-warn patches, plus the test-mailbox patch and some additional changes to from me.  (I'll upload a diff to David's diffs in a minute.)

This is the change I'd like to check in. now passes, as does the script I'm using.
msg30615 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2007-01-17 21:05
mailbox-pending-lock is the difference between David's copy-back-new + fcntl-warning and my -unified patch, uploaded so that he can comment on the changes.

The only change is to make _singleFileMailbox.lock() clear self._toc, forcing a re-read of the mailbox file.  If _pending is true, the ToC isn't cleared and a warning is logged.  I think this lets existing code run (albeit possibly with a warning if the mailbox is modified before .lock() is called), but fixes the risk of missing changes written by another process.

Triggering a new warning is sort of an API change, but IMHO it's still worth backporting; code that triggers this warning is at risk of losing messages or corrupting the mailbox.

Clearing the _toc on .lock() is also sort of an API change, but it's necessary to avoid the potential data loss.  It may lead to some redundant scanning of mailbox files, but this price is worth paying, I think; people probably don't have 2Gb mbox files (I hope not, anyway!) and no extra read is done if you create the mailbox and immediately lock it before looking anything up.

Neal: if you want to discuss this patch or want an explanation of something, feel free to chat with me about it.

I'll wait a day or two and see if David spots any problems.  If nothing turns up, I'll commit it to both trunk and release25-maint.

File Added: mailbox-pending-lock.diff
msg30616 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2007-01-17 21:06
Add mailbox-unified-patch.

File Added: mailbox-unified-patch.diff
msg30617 - (view) Author: David Watson (baikie) Date: 2007-01-18 18:14
Unfortunately, there is a problem with clearing _toc: it's basically
the one I alluded to in my comment of 2006-12-16.  Back then I thought
it could be caught by the test you issue the warning for, but the
application may instead do its second remove() *after* the lock() (so
that self._pending is not set at lock() time), using the key carried
over from before it called unlock().  As before, this would result in
the wrong message being deleted.

I've added a test case for this (diff attached), and a bug I found in
the process whereby flush() wasn't updating the length, which could
cause subsequent flushes to fail (I've got a fix for this).  These
seem to have turned up a third bug in the MH class, but I haven't
looked at that yet.

File Added: mailbox-unified2-test.diff
msg30618 - (view) Author: David Watson (baikie) Date: 2007-01-18 18:15
This version passes the new tests (it fixes the length checking bug,
and no longer clears self._toc), but goes back to failing test_concurrent_add.

File Added: mailbox-unified2-module.diff
msg30619 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2007-01-19 15:24
After reflection, I don't think the potential changing actually makes things any worse.  _generate() always starts numbering keys with 1, so if a message's key changes because of lock()'s, re-reading, that means someone else has already modified the mailbox.  Without the ToC clearing, you're already fated to have a corrupted mailbox because the new mailbox will be written using outdated file offsets.  With the ToC clearing, you delete the wrong message.  Neither outcome is good, but data is lost either way.  

The new behaviour is maybe a little bit better in that you're losing a single message but still generating a well-formed mailbox, and not a randomly jumbled mailbox.

I suggest applying the patch to clear self._toc, and noting in the documentation that keys might possibly change after doing a lock().
msg30620 - (view) Author: David Watson (baikie) Date: 2007-01-20 18:20
Hang on.  If a message's key changes after recreating _toc, that does
not mean that another process has modified the mailbox.  If the
application removes a message and then (inadvertently) causes _toc to
be regenerated, the keys of all subsequent messages will be
decremented by one, due only to the application's own actions.

That's what happens in the "broken locking" test case: the program
intends to remove message 0, flush, and then remove message 1, but
because _toc is regenerated in between, message 1 is renumbered as 0,
message 2 is renumbered as 1, and so the program deletes message 2
instead.  To clear _toc in such code without attempting to preserve
the message keys turns possible data loss (in the case that another
process modified the mailbox) into certain data loss.  That's what I'm
concerned about.
msg30621 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2007-01-21 03:16
<sigh>  I'm starting to lose track of all the variations on the bug.  Maybe we should just add more warnings to the documentation about locking the mailbox when modifying it and not try to fix this at all.
msg30622 - (view) Author: David Watson (baikie) Date: 2007-01-21 22:10
Hold on, I have a plan.  If _toc is only regenerated on locking, or at
the end of a flush(), then the only way self._pending can be set at
that time is if the application has made modifications before calling
lock().  If we make that an exception-raising offence, then we can
assume that self._toc is a faithful representation of the last known
contents of the file.  That means we can preserve the existing message
keys on a reread without any of that _user_toc nonsense.

Diff attached, to apply on top of mailbox-unified2.  It's probably had
even less review and testing than the previous version, but it appears
to pass all the regression tests and doesn't change any existing

File Added: mailbox-update-toc-new.diff
msg30623 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2007-01-22 15:46
This would be an API change, and therefore out-of-bounds for 2.5.

I suggest giving up on this for 2.5.1 and only fixing it in 2.6.   I'll add another warning to the docs, and maybe to the module as well.

msg30624 - (view) Author: David Watson (baikie) Date: 2007-01-22 20:24
So what you propose to commit for 2.5 is basically mailbox-unified2
(your mailbox-unified-patch, minus the _toc clearing)?
msg30625 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2007-01-24 20:48
I've strengthened the warning again.

The MH bug in unified2 is straightforward: MH.remove() opens a file object, locks it, closes the file object, and then tries to unlock it.  Presumably the MH test case never bothered locking the mailbox before making changes before.
msg30626 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2007-03-28 16:56
Created a branch in SVN at svn+ssh:// to work on this for 2.5.1.  I've committed the unified2 module patch and the test_concurrent_ad() test.
msg30627 - (view) Author: David Watson (baikie) Date: 2007-04-13 13:45
Here's a possible solution to the rereading problem.  It should allow
existing applications to work whether they behave properly or not, as
well as (probably) most third-party subclasses.

The idea is to compare the new table of contents to the list of known
offset pairs, and raise ExternalClashError if *any* of them have
changed or disappeared.  Any new pairs can then be added to _toc under
new keys.  To maintain the list of known pairs, a special dict
subclass is used on self._toc that records every offset pair ever set
in it - even those that are subsequently removed from the mapping.
However, if self._pending is not set when rereading, then the code
doesn't rely on this special behaviour: it just uses
self._toc.itervalues(), which will work even if a subclass has
replaced the special _toc with a normal dictionary.

Ways the code can break:

- If a subclass replaces self._toc and then the application tries to
lock the mailbox *after* making modifications (so that _update_toc()
finds self._pending set, and looks for the special attribute on _toc).

- If a subclass tries to store something other than sequences in _toc.

- If a subclass' _generate_toc() can produce offsets for messages that
don't match those they were written under.

File Added: mailbox-update-toc-again.diff
msg30628 - (view) Author: David Watson (baikie) Date: 2007-04-13 13:46
Some new test cases for this stuff.

File Added: test_mailbox-reread.diff
msg30629 - (view) Author: David Watson (baikie) Date: 2007-04-13 13:47
This fixes the Babyl breakage.  Perhaps it should be in the superclass?

File Added: mailbox-babyl-fix.diff
msg84585 - (view) Author: Daniel Diniz (ajaksu2) (Python triager) Date: 2009-03-30 17:24
Barry: is this on scope for the email sprint?
msg116185 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-09-12 12:46
Does any core committer fancy reviewing the 18 attached patches? :)
msg138216 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-06-12 19:46
D. Watson, could you remove obsolete patches and leave just the ones a patch reviewer should look at?
msg164004 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2012-06-25 19:13
It seems to me that there are three separate issues here:

1) Concurrent access to a single-file mailbox will discard some messages even if locking is correctly used. This is demonstrated by mailbox-test.patch, which despite its age applies on top of 3.2 and fails (at least for me)

2) Detecting changes to mbox files is hard and should be improved.

3) When the TOC of an mbox is re-read, the message keys should probably be kept intact. Otherwise, the application would need to prepare for the fact that keys can change without notice in the middle of a loop, for example.

The first point is a serious bug that should be fixed somehow.

The second point has been discussed in greath lengths, but it strikes me that the mtime tracking has not been suggested. AFAIK, this is what most/all mbox readers do. The toc should be re-read when the mtime changes and/or is newer than atime.

The third point also sounds important, but might be tricky to implement.
msg213979 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2014-03-18 15:24
What is the status here?

As best I can tell from a skim, 

(a)  It should be broken into at least three separate issues.  (Though maybe some can just be closed instead of separated?)

(b)  None of them are security holes, so we missed 2.5 and 2.6, and should now remove 3.2 and possibly 3.3.

(c)  There are patches with at least test cases (and maybe fixes) for all identified issues.

(d)  None of these patches have been applied, so we should probably add 3.4 and and 3.5 to the affected list, and then apply at least the test cases and the uncontroversial fixes.

(e) Ideally, David Watson and/or Andrew Kuchling should sign off on which patches are uncontroversial.
msg214007 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2014-03-18 19:08
The bug got so complicated, with so many variations and failure modes, that I just lost track of what to do next.  

Let's reboot.  Problem #1 is that the single-file mailboxes subvert their own fcntl locking by writing to a temp. file and renaming; David's first patch takes care of that by truncating the original and copying.

Problem #2 is that the mailbox classes keep an internal table-of-contents mapping messages to positions in the file. If some other process writes to the mailbox, this table-of-contents becomes obsolete, so we need to update it.  But the ToC is also used to derive keys, the identifiers kept by scripts using the mailbox, so updating the ToC means these keys become invalid.  And here's where I got confused: do we invalidate the ToC in all sorts of places?  Or do we just document
that you shouldn't trust keys to remain the same unless you've locked the mailbox?  (Except this is only true for single-file mailbox formats - Maildir uses unique strings that don't change.)

I suggest we apply the fix for #1, and for #2 just discard and update the ToC when we lock the mailbox, ignoring other possible routes to corruption (so no detecting the problem and raising an ExternalClash exception).  Since 2007 the docs have said "If you’re modifying a mailbox, you must lock it by calling the lock() and unlock() methods before reading any messages in the file or making any changes".

(I'm also reducing the priority of this bug; clearly it's not "high".)
msg214043 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2014-03-19 01:20
OK, if I understand *that* correctly,

(1)  The locking mechanism doesn't really work, and that is too complicated to fix in this issue.  A new issue would be fine.

(2)  The locking failure messes up the Table Of Contents, but that is too comprehensive a change to fix here, and should be handled in the new issue.

(3)  The locking failure also causes data loss/corruption in the messages themselves, and that *can* be fixed simply by truncate+append, instead of renaming.  (Is this actually true, or only true if you *also* implement better change-detection to catch the other process?)

(4)  While the patches do (and test for) #3, there is not currently a patch that *only* does #3.

(5)  This is a pure bug fix; the only reason to change documentation would be to help make triggering the bugs less likely.


Also note that

(6)  At least some of the changes do seem to have been included at some point, so regenerating the patches is not mechanical.

(7)  The documentation has a big warning note, but the wording could use some improvement to emphasize that *even* reading is unsafe, if changes (even status changes, such as a "Seen" flag) could happen later.  Again, based on my possibly faulty understanding:

"If you’re modifying a mailbox, you must lock it by calling the lock() and unlock() methods before reading any messages in the file or making any changes"


"Keys may become invalid at any time, unless the mailbox is locked. 

Notably, another process may modify the underlying file storage so that the key used to retrieve an existing message becomes invalid, or points to a different message.  To prevent this, a mailbox must be locked prior to even reading a message, and must not be unlocked until all changes -- even status changes, such as whether or not a Message was viewed -- for all keys have been completed and flushed."
msg214648 - (view) Author: David Watson (baikie) Date: 2014-03-23 22:57
On Tue 18 Mar 2014, A.M. Kuchling wrote:
> I suggest we apply the fix for #1, and for #2 just discard and
> update the ToC when we lock the mailbox, ignoring other
> possible routes to corruption (so no detecting the problem and
> raising an ExternalClash exception).  Since 2007 the docs have
> said "If you’re modifying a mailbox, you must lock it by
> calling the lock() and unlock() methods before reading any
> messages in the file or making any changes".

Well, the warning you're referring to begins "Be very cautious
when modifying mailboxes that might be simultaneously changed by
some other process. The safest mailbox format to use for such
tasks is Maildir...", suggesting that it only applies to
mailboxes that might in fact be modified by another process.

But if a reread is forced *simply by discarding the ToC*, then
the application's own keys become invalid if it has,
e.g. previously deleted a message, even if it's a private mailbox
that no other process ever touches.  So such a change would make
the warning apply to such mailboxes, whereas previously it (in
effect) did not.

(That does raise the question of why the application is calling
.lock() at all on a mailbox that won't be modified by another
process, but if the programmer thought the warning didn't apply
to their situation, then they presumably wouldn't think that
calling .lock() after modifying might actually be dangerous - and
it currently isn't dangerous for such a mailbox.)

Anyway, I've rebased the mailbox-unified2 patch onto 2.7,
renaming it as mailbox-copy-back-2.7.diff; this fixes the
original (renaming and fcntl locking) issue and adds some
warnings about locking.  I've combined all the tests (I think)
into a separate patch, and ported them to the multiprocessing
module (possibly unwise as I'm not very familiar with it - it
would be nice if someone could test it on Windows).  I haven't
looked at the tests closely again, but they appear to check for
many of the ToC issues.

There actually isn't a test for the original locking issue, as
the tests all use the mailbox API, which doesn't provide a way to
turn off dot-locking.  Also, the module no longer rewrites the
mailbox if messages have only been appended to it - it just syncs
it instead.  However, I have verified by hand that the problem is
still there when the mailbox *is* rewritten due to deletions,
msg215207 - (view) Author: David Watson (baikie) Date: 2014-03-30 21:50
On Sun 23 Mar 2014, David Watson wrote:
> There actually isn't a test for the original locking issue, as
> the tests all use the mailbox API, which doesn't provide a way to
> turn off dot-locking. ...

On second thought, the underlying error doesn't actually have
anything to do with locking - another process can open the
mailbox, and can replace the file before that process
even tries to lock it.  Andrew's test_concurrent_append()
originally *did* test for this, but commit c37cb11b546f changed
the single-file mailbox classes to just sync the file rather than
replacing it when messages have only been added, as in that test,
meaning it's no longer effective (for that issue at least).

I've now made a test for the original renaming-vs-copying issue
which just uses the simple _subprocess mechanism that the other
tests use, rather than trying to set up any special locking
conditions.  I've also split the tests into two patches -
mailbox-tests-2.7-part1-for-copy-back.diff has the test for the
original issue, and passes with the copy-back patch applied,
while mailbox-tests-2.7-part2.diff applies on top and includes
the rest (these involve the separate issues around rereading, and
mostly fail).
msg217378 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2014-04-28 15:12
pinging David Watson:  What is the status?  If I understand correctly, (and I may well not), you have already opened other issues for parts of this, and (only) the final patch is ready for patch (and hopefully) commit review.  Is this correct?
msg217713 - (view) Author: David Watson (baikie) Date: 2014-05-01 18:11
On Mon 28 Apr 2014, Jim Jewett wrote:
> pinging David Watson:  What is the status?  If I understand correctly, (and I may well not), you have already opened other issues for parts of this, and (only) the final patch is ready for patch (and hopefully) commit review.  Is this correct?

No, I haven't opened any separate issues (I would be perfectly
happy with fixing the original renaming-vs-copying problem and
then leaving this issue open to deal with the rest).  The patches
mailbox-copy-back-2.7.diff and
mailbox-tests-2.7-part1-for-copy-back.diff are what I suggest
applying for the renaming-vs-copying problem.

I haven't looked at the other patches in much detail, but for
what it's worth, the tests in mailbox-tests-2.7-part2.diff do
pass with mailbox-update-toc-again.diff applied.
Date User Action Args
2014-12-31 16:25:26akuchlingsetnosy: - akuchling
2014-05-12 09:18:30jwilksetnosy: + jwilk
2014-05-01 18:11:37baikiesetmessages: + msg217713
2014-04-28 15:12:56Jim.Jewettsetmessages: + msg217378
2014-03-30 21:50:12baikiesetfiles: + mailbox-tests-2.7-part1-for-copy-back.diff, mailbox-tests-2.7-part2.diff

messages: + msg215207
2014-03-23 22:57:20baikiesetfiles: + mailbox-copy-back-2.7.diff, mailbox-all-tests-2.7.diff

messages: + msg214648
2014-03-19 01:20:48Jim.Jewettsetmessages: + msg214043
2014-03-18 19:13:10terry.reedysetversions: + Python 3.4, Python 3.5, - Python 3.2, Python 3.3
2014-03-18 19:08:48akuchlingsetpriority: high -> normal

messages: + msg214007
2014-03-18 15:24:27Jim.Jewettsetnosy: + Jim.Jewett
messages: + msg213979
2012-06-25 19:13:21petri.lehtinensetnosy: + petri.lehtinen
messages: + msg164004
2011-06-12 19:46:14terry.reedysetnosy: + r.david.murray, terry.reedy, - BreamoreBoy

messages: + msg138216
versions: + Python 3.3, - Python 3.1
2010-11-12 21:00:18akuchlingsetassignee: akuchling ->
2010-09-12 12:46:30BreamoreBoysetnosy: + BreamoreBoy

messages: + msg116185
versions: + Python 3.1, Python 3.2, - Python 2.6
2010-05-11 20:50:53terry.reedysetversions: + Python 2.7
2009-03-30 17:24:49ajaksu2settype: behavior
versions: + Python 2.6, - Python 2.5
keywords: + patch
nosy: + barry, ajaksu2

messages: + msg84585
stage: patch review
2007-09-17 10:50:06peter.llsetnosy: + peter.ll
2006-11-19 16:03:08baikiecreate