classification
Title: mailbox.Maildir re-reads directory too often
Type: resource usage Stage: test needed
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: akuchling Nosy List: akuchling, doko, joshtriplett
Priority: normal Keywords: patch

Created on 2006-12-03 17:28 by doko, last changed 2009-05-03 02:53 by akuchling. This issue is now closed.

Files
File name Uploaded Description Edit
mailbox-mtime.patch akuchling, 2006-12-13 13:38 Patch implementing mtime checking
mailbox-mtime.patch akuchling, 2009-05-02 19:05 Updated version of patch
Messages (6)
msg30739 - (view) Author: Matthias Klose (doko) * (Python committer) Date: 2006-12-03 17:28
[forwarded from http://bugs.debian.org/401395]

Various functions in mailbox.Maildir call self._refresh, which always re-reads the cur and new directories with os.listdir.  _refresh should stat each of the two directories first to see if they changed.  This cuts processing time of a series of lookups down by a factor of the number of messages in the folder, a potentially large number.
msg30740 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2006-12-13 13:38
By stat()'ing the directories, do you mean checking the mtime?  I think this isn't safe because of the limited resolution of mtime on filesystems; ext3 seems to have a 1-second resolution for mtime, for example.  This means that _refresh() might read a directory, and if some other process adds or deletes a message in the same second, _refresh() couldn't detect the change.  Is there some other property of directories that could be used for a more reliable check?

The attached patch implements checking of mtime, but I don't recommend applying it; it causes the test suite in test_mailbox.py to break all over the place, because the process modifies mailboxes so quickly that the mtime check doesn't notice the process's own changes.

I'll wait a bit for any alternative suggestion, and then close this bug as "won't fix".


File Added: mailbox-mtime.patch
msg30741 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2006-12-14 19:09
Stray thought: would it help if the patch stored the (mtime - 1sec) instead of the mtime?  Successive calls in the same second would then always re-read the directories, but once the clock ticks to the next second, re-reads would only occur if the directories have actually changed.  The check would be 'if new_mtime > self._new_mtime' instead of '=='.

Is this sort of mtime-based checking reliable on remote filesystems such as NFS?
msg30742 - (view) Author: Josh Triplett (joshtriplett) Date: 2007-05-04 18:30
akuchling wrote:
> By stat()'ing the directories, do you mean checking the mtime?  I think
> this isn't safe because of the limited resolution of mtime on filesystems;
> ext3 seems to have a 1-second resolution for mtime, for example.

Regardless of the resolution on any particular filesystem, stat and stat64
return file timestamps as time_t, which has one-second resolution.

> This means that _refresh() might read a directory, and if some other process
> adds or deletes a message in the same second, _refresh() couldn't detect the
> change.

True.  mailbox.Maildir's behavior of always representing the current contents
of the maildir without requiring the caller to explicitly refresh will not
work with an implementation that checks for an mtime increase.

The two solutions below (inotify and your suggested mtime-1 approach) would
allow the automatic updates to work efficiently.

> Is there some other property of directories that could be used for
> a more reliable check?

On Linux, you could use inotify to get a notice when anything changes in the
directory.  You then wouldn't even incur the overhead of a repeated directory
stat, and wouldn't need to re-read the entire directory to find the new mail.

> The attached patch implements checking of mtime, but I don't recommend
> applying it; it causes the test suite in test_mailbox.py to break all over
> the place, because the process modifies mailboxes so quickly that the mtime
> check doesn't notice the process's own changes.

Fair enough.

> I'll wait a bit for any alternative suggestion, and then close this bug as
> "won't fix".

Please don't.  The performance hit of repeatedly re-reading the directory
makes mailbox.Maildir unusably slow on even moderately large maildirs (a few
thousand messages).  For the application we originally wanted mailbox.Maildir
for, we had to rewrite the code to manually operate on the maildir instead,
and we achieved an improvement of several orders of magnitude.

akuchling:
> Stray thought: would it help if the patch stored the (mtime - 1sec) instead
> of the mtime?  Successive calls in the same second would then always re-read
> the directories, but once the clock ticks to the next second, re-reads would
> only occur if the directories have actually changed.  The check would be 'if
> new_mtime > self._new_mtime' instead of '=='.

Good idea.  That would work fine as well, and I believe it would have
addressed our performance problem.  This would work as a good solution on
platforms that lack inotify, or if you don't want to use inotify.

Compared to inotify, this would have somewhat more overhead (but still far
less than the current approach), and would still perform poorly if you insert
messages as you go (sometimes avoidable).  However, compared to the current
approach, this has a massive improvement, so please consider implementing it.

> Is this sort of mtime-based checking reliable on remote filesystems such as
> NFS?

This particular sort of checking, yes, I think so.  The times do not
necessarily match the system clock (because they come from a remote system
that does not necessarily have a synchronized clock), but the times should
remain consistent with *each other*.

Thank you for looking at this problem,
Josh Triplett
msg86969 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2009-05-02 19:05
Updated version of the patch that only stores the current time -1sec,
adds a test case, and passes all tests.
msg86996 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2009-05-03 02:53
Committed to trunk in rev. 72213.

Committed to py3k in rev. 72228.
History
Date User Action Args
2009-05-03 02:53:35akuchlingsetstatus: open -> closed
resolution: fixed
messages: + msg86996

versions: + Python 2.7, - Python 2.6
2009-05-02 19:05:53akuchlingsetfiles: + mailbox-mtime.patch
keywords: + patch
messages: + msg86969
2009-03-30 17:25:06ajaksu2setstage: test needed
type: resource usage
versions: + Python 2.6, - Python 2.5
2006-12-03 17:28:11dokocreate