Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mailbox.Maildir re-reads directory too often #44299

Closed
doko42 opened this issue Dec 3, 2006 · 6 comments
Closed

mailbox.Maildir re-reads directory too often #44299

doko42 opened this issue Dec 3, 2006 · 6 comments
Assignees
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@doko42
Copy link
Member

doko42 commented Dec 3, 2006

BPO 1607951
Nosy @akuchling, @doko42
Files
  • mailbox-mtime.patch: Patch implementing mtime checking
  • mailbox-mtime.patch: Updated version of patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/akuchling'
    closed_at = <Date 2009-05-03.02:53:35.412>
    created_at = <Date 2006-12-03.17:28:11.000>
    labels = ['library', 'performance']
    title = 'mailbox.Maildir re-reads directory too often'
    updated_at = <Date 2009-05-03.02:53:35.398>
    user = 'https://github.com/doko42'

    bugs.python.org fields:

    activity = <Date 2009-05-03.02:53:35.398>
    actor = 'akuchling'
    assignee = 'akuchling'
    closed = True
    closed_date = <Date 2009-05-03.02:53:35.412>
    closer = 'akuchling'
    components = ['Library (Lib)']
    creation = <Date 2006-12-03.17:28:11.000>
    creator = 'doko'
    dependencies = []
    files = ['2246', '13839']
    hgrepos = []
    issue_num = 1607951
    keywords = ['patch']
    message_count = 6.0
    messages = ['30739', '30740', '30741', '30742', '86969', '86996']
    nosy_count = 3.0
    nosy_names = ['akuchling', 'doko', 'joshtriplett']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'test needed'
    status = 'closed'
    superseder = None
    type = 'resource usage'
    url = 'https://bugs.python.org/issue1607951'
    versions = ['Python 2.7']

    @doko42
    Copy link
    Member Author

    doko42 commented Dec 3, 2006

    [forwarded from http://bugs.debian.org/401395]

    Various functions in mailbox.Maildir call self._refresh, which always re-reads the cur and new directories with os.listdir. _refresh should stat each of the two directories first to see if they changed. This cuts processing time of a series of lookups down by a factor of the number of messages in the folder, a potentially large number.

    @doko42 doko42 added the stdlib Python modules in the Lib dir label Dec 3, 2006
    @doko42 doko42 added the stdlib Python modules in the Lib dir label Dec 3, 2006
    @akuchling
    Copy link
    Member

    By stat()'ing the directories, do you mean checking the mtime? I think this isn't safe because of the limited resolution of mtime on filesystems; ext3 seems to have a 1-second resolution for mtime, for example. This means that _refresh() might read a directory, and if some other process adds or deletes a message in the same second, _refresh() couldn't detect the change. Is there some other property of directories that could be used for a more reliable check?

    The attached patch implements checking of mtime, but I don't recommend applying it; it causes the test suite in test_mailbox.py to break all over the place, because the process modifies mailboxes so quickly that the mtime check doesn't notice the process's own changes.

    I'll wait a bit for any alternative suggestion, and then close this bug as "won't fix".

    File Added: mailbox-mtime.patch

    @akuchling
    Copy link
    Member

    Stray thought: would it help if the patch stored the (mtime - 1sec) instead of the mtime? Successive calls in the same second would then always re-read the directories, but once the clock ticks to the next second, re-reads would only occur if the directories have actually changed. The check would be 'if new_mtime > self._new_mtime' instead of '=='.

    Is this sort of mtime-based checking reliable on remote filesystems such as NFS?

    @joshtriplett
    Copy link
    Mannequin

    joshtriplett mannequin commented May 4, 2007

    akuchling wrote:

    By stat()'ing the directories, do you mean checking the mtime? I think
    this isn't safe because of the limited resolution of mtime on filesystems;
    ext3 seems to have a 1-second resolution for mtime, for example.

    Regardless of the resolution on any particular filesystem, stat and stat64
    return file timestamps as time_t, which has one-second resolution.

    This means that _refresh() might read a directory, and if some other process
    adds or deletes a message in the same second, _refresh() couldn't detect the
    change.

    True. mailbox.Maildir's behavior of always representing the current contents
    of the maildir without requiring the caller to explicitly refresh will not
    work with an implementation that checks for an mtime increase.

    The two solutions below (inotify and your suggested mtime-1 approach) would
    allow the automatic updates to work efficiently.

    Is there some other property of directories that could be used for
    a more reliable check?

    On Linux, you could use inotify to get a notice when anything changes in the
    directory. You then wouldn't even incur the overhead of a repeated directory
    stat, and wouldn't need to re-read the entire directory to find the new mail.

    The attached patch implements checking of mtime, but I don't recommend
    applying it; it causes the test suite in test_mailbox.py to break all over
    the place, because the process modifies mailboxes so quickly that the mtime
    check doesn't notice the process's own changes.

    Fair enough.

    I'll wait a bit for any alternative suggestion, and then close this bug as
    "won't fix".

    Please don't. The performance hit of repeatedly re-reading the directory
    makes mailbox.Maildir unusably slow on even moderately large maildirs (a few
    thousand messages). For the application we originally wanted mailbox.Maildir
    for, we had to rewrite the code to manually operate on the maildir instead,
    and we achieved an improvement of several orders of magnitude.

    akuchling:

    Stray thought: would it help if the patch stored the (mtime - 1sec) instead
    of the mtime? Successive calls in the same second would then always re-read
    the directories, but once the clock ticks to the next second, re-reads would
    only occur if the directories have actually changed. The check would be 'if
    new_mtime > self._new_mtime' instead of '=='.

    Good idea. That would work fine as well, and I believe it would have
    addressed our performance problem. This would work as a good solution on
    platforms that lack inotify, or if you don't want to use inotify.

    Compared to inotify, this would have somewhat more overhead (but still far
    less than the current approach), and would still perform poorly if you insert
    messages as you go (sometimes avoidable). However, compared to the current
    approach, this has a massive improvement, so please consider implementing it.

    Is this sort of mtime-based checking reliable on remote filesystems such as
    NFS?

    This particular sort of checking, yes, I think so. The times do not
    necessarily match the system clock (because they come from a remote system
    that does not necessarily have a synchronized clock), but the times should
    remain consistent with *each other*.

    Thank you for looking at this problem,
    Josh Triplett

    @devdanzin devdanzin mannequin added performance Performance or resource usage labels Mar 30, 2009
    @akuchling
    Copy link
    Member

    Updated version of the patch that only stores the current time -1sec,
    adds a test case, and passes all tests.

    @akuchling
    Copy link
    Member

    Committed to trunk in rev. 72213.

    Committed to py3k in rev. 72228.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants