classification
Title: Mailbox module should support other mbox formats in addition to mboxo
Type: enhancement Stage: needs patch
Components: email, Library (Lib) Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, endolith, petri.lehtinen, r.david.murray
Priority: normal Keywords:

Created on 2012-01-02 21:46 by endolith, last changed 2012-06-25 18:59 by petri.lehtinen.

Messages (8)
msg150478 - (view) Author: (endolith) Date: 2012-01-02 21:46
The documentation states:

"Several variations of the mbox format exist to address perceived shortcomings in the original. In the interest of compatibility, mbox implements the original format, which is sometimes referred to as mboxo."

http://docs.python.org/dev/library/mailbox.html#mailbox.mbox

But this format is fundamentally broken, corrupting lines that start with "From ", and I can't find any justification for using it in Python.  In fact, all four links included in that section argue against this format.

If only one mbox format is used, it should be mboxrd.  Otherwise, include support for all the variants, with mboxrd as the default.
msg150479 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-01-02 21:55
Well, supporting the other variants would be good (I'll review any proposed patches), but I think the default will have to stay mboxo for backward compatibility reasons (unless the consensus is to go through the warning/deprecation cycle to change it).

As a new feature, this could only go into 3.3 or later.
msg159625 - (view) Author: (endolith) Date: 2012-04-29 16:26
Ok.  I'm not sure what backwards compatibility issues would exist, though.

The only difference is that mboxrd converts
"\nFrom "  → "\n>From "
"\n>From " → "\n>>From "

making the conversion reversible, while mboxo does

"\nFrom "  → "\n>From "
"\n>From " → "\n>From " (no change)

which is ambiguous, and both get converted back to "\nFrom " when converting back to text, corrupting the original message.

mboxrd is essentially a bugfix for mboxo rather than a fundamentally different format.
msg159629 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-04-29 16:55
If that's really the only difference we might indeed be able to treat it as a bug fix.  I'd have to look at a proposed patch to be sure.
msg163359 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2012-06-21 18:51
I'm a little concerned about backwards compatibility. Someone might get upset if extra >'s start appearing in the messages when they read the mailbox contents with an application that uses the mboxo format.

A little analysis on the possible corruptions that happen with these formats:

- When the mailbox is both read and written using the mboxo format, lines starting with "From " are changed to ">From ".

- When the mailbox is both read and written using the mboxrd format, no corruption happens.

- If the mailbox is written using the mboxo format and read using the mboxrd format, lines that were meant to start with ">From " are changed to "From ". So we essentially get a sligthly different corruption.

- If the mailbox is written using the mboxrd format and read using the mboxo format, lines that were meant to start with ">From " are changed to ">>From ". This is a new type of corruption.
msg163904 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2012-06-25 06:21
The default mode for reading mbox files should also be modified a bit to maximize the support fordifferent implementations. See #11728.

I think we should still use the mboxo format by default when writing, and the "default" format of RFC 4155 when reading. We could then add a "format" parameter to the mbox constructor to alter the writing and/or reading behavior to match a specific mbox format.

According to RFC 4155, the best reference for different mbox formats is http://qmail.org./man/man5/mbox.html.
msg163975 - (view) Author: (endolith) Date: 2012-06-25 14:44
> - If the mailbox is written using the mboxrd format and read using the mboxo format, lines that were meant to start with ">From " are changed to ">>From ". This is a new type of corruption.

Well, yes.  So the choices are:

mboxrd as default: Sometimes results in corruption
mboxo  as default: Always results in corruption

Is there a way to reliably detect the format of the file and produce an error if it seems to be reading it wrong?

If not, maybe just include a function that guesses the format so the correct option can be found easily?  If there are consecutive ">" quoted lines, like this, for instance:

>This is the body.
>>From my point of view
>there are 3 lines.

then it was probably encoded with mboxrd?  If instead you find:

>This is the body.
>From my point of view
>there are 3 lines.

then it was probably encoded with mboxo?
msg164002 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2012-06-25 18:59
endolith wrote:
> > - If the mailbox is written using the mboxrd format and read using
> > - the mboxo format, lines that were meant to start with ">From "
> > - are changed to ">>>From ". This is a new type of corruption.
> 
> Well, yes.  So the choices are:
> 
> mboxrd as default: Sometimes results in corruption
> mboxo  as default: Always results in corruption

I don't think so. Assuming that mboxo (the current default) was used
to write the mailbox, both formats sometimes result in corruption.

mboxo as default: "From " lines get written (and subsequently read) as
">From ".

mboxrd as default: ">From " lines were written as ">From " but are
read as "From ".

Furthermore, if Python's mailbox module is used to write the mbox file
and another software, that only supports mboxo, is used to read it
(e.g. mutt), having mboxrd as the default would case ">From " lines to
be written as ">>From ". These linew would then be read as ">>From "
by the reading software.

So, I'd like to keep the default as is, and add a parameter to change
to mboxrd when it's OK for the use case at hand. We should also
clearly document that mboxrd is recommended as it never corrupts data
if used for both reading and writing.

> Is there a way to reliably detect the format of the file and produce
> an error if it seems to be reading it wrong?
>
> If not, maybe just include a function that guesses the format so the
> correct option can be found easily? If there are consecutive ">"
> quoted lines, like this, for instance:
> 
> >This is the body.
> >>From my point of view
> >there are 3 lines.
> 
> then it was probably encoded with mboxrd?  If instead you find:
> 
> >This is the body.
> >From my point of view
> >there are 3 lines.
> 
> then it was probably encoded with mboxo?

It's not possible to automatically detect the format. Guessing like
you suggested is too fragile. It might work on some situations, but
wouldn't work on others.

If it was possible to detect the format by guessing, I'm sure RFC 4155
would mention that, as it aims for the best possible outcome for
reading any of the formats, without knowing which format is actually
in use.
History
Date User Action Args
2012-06-25 18:59:03petri.lehtinensetmessages: + msg164002
2012-06-25 14:44:54endolithsetmessages: + msg163975
2012-06-25 06:21:59petri.lehtinensetnosy: + barry
messages: + msg163904
components: + email
2012-06-21 18:51:48petri.lehtinensetmessages: + msg163359
2012-06-21 10:52:19petri.lehtinensetnosy: + petri.lehtinen

versions: + Python 3.4, - Python 3.3
2012-04-29 16:55:11r.david.murraysetmessages: + msg159629
2012-04-29 16:26:40endolithsetmessages: + msg159625
2012-01-02 21:55:03r.david.murraysetversions: - Python 2.7
type: behavior -> enhancement

nosy: + r.david.murray
title: Mailbox module should not use mboxo format -> Mailbox module should support other mbox formats in addition to mboxo
messages: + msg150479
stage: needs patch
2012-01-02 21:47:36endolithsettitle: Should not use mboxo format -> Mailbox module should not use mboxo format
2012-01-02 21:46:27endolithcreate