classification
Title: mbox parser incorrect behaviour
Type: behavior Stage:
Components: email, Library (Lib) Versions: Python 3.10, Python 3.9, Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, petri.lehtinen, r.david.murray, sdaoden, wally1980
Priority: normal Keywords:

Created on 2011-03-31 12:04 by wally1980, last changed 2020-11-10 18:18 by iritkatriel.

Messages (8)
msg132657 - (view) Author: valera (wally1980) Date: 2011-03-31 12:04
mailbox.mbox  parser is splitting mbox files by "^From " pattern, which is wrong , in fairy it should split mbox by "\nFrom ".
Illustration:
------
From bla-blah@localhost
Header1
Header2
body1
body2

From blah-blah2@localhost
Header1
body1
From your dear friend
body3

------
This mbox would be splitted in 3 messages instead of 2
msg132671 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-03-31 14:13
All the references I could find talk about triggering the match without the proceeding newline.  That is, it is not certain that a blank line will precede the 'From ' header, and the typical quoting rules for mbox format call for any 'From ' at the start of a line (whether preceded by a blank line or not) to be quoted.  This might have something to do with the fact that otherwise you have to special case the first line of the mbox, but I don't really know.

What tool are you using that is producing the unquoted 'From ' lines in your mbox?  I know there are variants on the mbox format, so if one of them has the format you propose, this would become a feature request to support that variant mbox format.
msg132687 - (view) Author: valera (wally1980) Date: 2011-03-31 16:48
On Thu, 31 Mar 2011 14:13:50 +0000
"R. David Murray" <report@bugs.python.org> wrote:

> 
> R. David Murray <rdmurray@bitdance.com> added the comment:
> 
> All the references I could find talk about triggering the match
> without the proceeding newline.  That is, it is not certain that a
> blank line will precede the 'From ' header, and the typical quoting
> rules for mbox format call for any 'From ' at the start of a line
> (whether preceded by a blank line or not) to be quoted.  This might
> have something to do with the fact that otherwise you have to special
> case the first line of the mbox, but I don't really know.
> 
> What tool are you using that is producing the unquoted 'From ' lines
> in your mbox?  I know there are variants on the mbox format, so if
> one of them has the format you propose, this would become a feature
> request to support that variant mbox format.
> 
> ----------
> nosy: +r.david.murray
> 

Hello, David !

This is  an email from netcraft mailing list - the host which accepted
it is running sendmail  with some antivirus software  on top -
mimedefang + spamassassin from what I know.
Could be tat something is broken in that chain, I've spotted the error
when I was writing the script for mailbox --> maildir conversion,
while migrating this server.
So I had to inherit mailbox.mbox  and  fix as I need, I'll investigate
further what lead to such behaviour. 
Nevertheless, here is snippet from rfc4155 -    
In order to improve interoperability among messaging systems, this
 memo defines a "default" mbox database format, which MUST be
 supported by all implementations that claim to be compliant with this
 specification.

 The "default" mbox database format uses a linear sequence of Internet
 messages, with each message being immediately prefaced by a separator
 line, and being terminated by an empty line.

---
So I think  assuming that there should be  an empty line before
"From " separator line is fine  (for the second email and further) and
would help to deal with all kinds of mbox  mailboxes, fix is rather
trivial.

Best regards,
Valery Masiutsin
msg138245 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-06-13 13:56
Hello Valery Masiutsin, i recently stumbled over this while searching
for the link to the standart i've stored in another issue.
(Without being logged in, say.)
The de-facto standart (http://qmail.org/man/man5/mbox.html) says:

HOW A MESSAGE IS READ
          A reader scans through an mbox file looking for From_ lines.
          Any From_ line marks the beginning of a message.  The reader
          should not attempt to take advantage of the fact that every
          From_ line (past the beginning of the file) is preceded by a
          blank line.

This is however the recent version.  The "mbox" manpage of my up-to-date
Mac OS X 10.6.7 does not state this, for example.  It's from 2002.
However, all known MBOX standarts, i.e. MBOXO, MBOXRD, MBOXCL, require
proper quoting of non-From_ "From " lines (by preceeding with '>').
So your example should not fail in Python.
(But hey - are you sure *that* has been produced by Perl?)

You're right however that Python seems to only support the old MBOXO
way of un-escaping only plain "From " to/from ">From ", which is not
even mentioned anymore in the current standart - that only describes
MBOXRD ("(>*From )" -> ">"+match.group(1)). 
(Lucky me: i own Mac OS X, otherwise i wouldn't even know.)
Thus you're in trouble if the unescaping is performed before the split..
This is another issue, though: "MBOX parser uses MBOXO algorithm".

;> - Ciao, Steffen
msg163812 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2012-06-24 17:41
It seems to me that "^From " is the correct way to match the start of each  message. This is also what the qmail manual page (linked in the previous message) says. So closing as invalid.
msg163872 - (view) Author: valera (wally1980) Date: 2012-06-24 23:03
Hello Petri

Qmail manpage does not sound as a valid reference for me, I've pointed
 relevant RFC (which dictates correct  behaviour)  as a reference, python
mbox parser does not conform to it.

Best regards,
Valery Masiutsin

On Sun, Jun 24, 2012 at 6:41 PM, Petri Lehtinen <report@bugs.python.org>wrote:

>
> Petri Lehtinen <petri@digip.org> added the comment:
>
> It seems to me that "^From " is the correct way to match the start of each
>  message. This is also what the qmail manual page (linked in the previous
> message) says. So closing as invalid.
>
> ----------
> nosy: +petri.lehtinen
> resolution:  -> invalid
> stage: test needed -> committed/rejected
> status: open -> closed
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue11728>
> _______________________________________
>
msg163902 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2012-06-25 06:15
Actually, you're right. Sorry for overlooking the RFC. But that said, the RFC itself refers to the same manpage as a reference that's "mostly authoritative for those variations that are otherwise only documented in anecdotal form". So I guess it's quite a good reference after all :)

In Appendix A, RFC 4155 defines a set of rules for a "default" mbox format that maximizes interoperability between different mbox implementations.

The important things in the RFC concerning this issue are:

* There MUST be an empty line after each message.

* The RFC does not specify any escape syntax for message body lines starting with "From ". It says: "Recipient systems are expected to parse full separator lines as they are documented above."

Because the RFC states that there must be an empty line after each message, and it aims for maximum interoperability, I think we can assume that there always is an empty line there. But looking for "\n\nFrom " is not enough for finding the starting points of messages. We should actually parse the whole separator line which consists of "From ", an email address (addr-spec in RFC 2822), a timestamp (in UNIX ctime format without timezone), and a newline character.

I think this should be the default mode for reading mbox files. See #13698 for adding support for other formats.
msg164636 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2012-07-04 04:24
Some thoughts on doing "clever tricks" to enhance mbox parsing:

    http://www.jwz.org/doc/content-length.html
History
Date User Action Args
2020-11-10 18:18:35iritkatrielsetversions: + Python 3.8, Python 3.9, Python 3.10, - Python 2.7, Python 3.2, Python 3.3, Python 3.4
2012-07-04 04:24:49petri.lehtinensetmessages: + msg164636
2012-06-25 06:15:54petri.lehtinensetstatus: closed -> open

components: + email

nosy: + barry
messages: + msg163902
resolution: not a bug ->
stage: resolved ->
2012-06-24 23:03:53wally1980setmessages: + msg163872
2012-06-24 17:41:08petri.lehtinensetstatus: open -> closed

nosy: + petri.lehtinen
messages: + msg163812

resolution: not a bug
stage: test needed -> resolved
2011-06-13 13:56:24sdaodensetnosy: + sdaoden
messages: + msg138245
2011-06-01 06:30:44terry.reedysetstage: test needed
type: behavior
versions: - Python 2.6, Python 2.5, Python 3.1
2011-03-31 16:48:39wally1980setmessages: + msg132687
2011-03-31 14:13:49r.david.murraysetnosy: + r.david.murray
messages: + msg132671
2011-03-31 12:04:25wally1980create