Message 164002 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	petri.lehtinen
Recipients	barry, endolith, petri.lehtinen, r.david.murray
Date	2012-06-25.18:59:03
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<20120625185900.GA3555@chang>
In-reply-to	<1340635495.1.0.488099254513.issue13698@psf.upfronthosting.co.za>

Content
endolith wrote: > > - If the mailbox is written using the mboxrd format and read using > > - the mboxo format, lines that were meant to start with ">From " > > - are changed to ">>>From ". This is a new type of corruption. > > Well, yes. So the choices are: > > mboxrd as default: Sometimes results in corruption > mboxo as default: Always results in corruption I don't think so. Assuming that mboxo (the current default) was used to write the mailbox, both formats sometimes result in corruption. mboxo as default: "From " lines get written (and subsequently read) as ">From ". mboxrd as default: ">From " lines were written as ">From " but are read as "From ". Furthermore, if Python's mailbox module is used to write the mbox file and another software, that only supports mboxo, is used to read it (e.g. mutt), having mboxrd as the default would case ">From " lines to be written as ">>From ". These linew would then be read as ">>From " by the reading software. So, I'd like to keep the default as is, and add a parameter to change to mboxrd when it's OK for the use case at hand. We should also clearly document that mboxrd is recommended as it never corrupts data if used for both reading and writing. > Is there a way to reliably detect the format of the file and produce > an error if it seems to be reading it wrong? > > If not, maybe just include a function that guesses the format so the > correct option can be found easily? If there are consecutive ">" > quoted lines, like this, for instance: > > >This is the body. > >>From my point of view > >there are 3 lines. > > then it was probably encoded with mboxrd? If instead you find: > > >This is the body. > >From my point of view > >there are 3 lines. > > then it was probably encoded with mboxo? It's not possible to automatically detect the format. Guessing like you suggested is too fragile. It might work on some situations, but wouldn't work on others. If it was possible to detect the format by guessing, I'm sure RFC 4155 would mention that, as it aims for the best possible outcome for reading any of the formats, without knowing which format is actually in use.

endolith wrote:
> > - If the mailbox is written using the mboxrd format and read using
> > - the mboxo format, lines that were meant to start with ">From "
> > - are changed to ">>>From ". This is a new type of corruption.
> 
> Well, yes.  So the choices are:
> 
> mboxrd as default: Sometimes results in corruption
> mboxo  as default: Always results in corruption

I don't think so. Assuming that mboxo (the current default) was used
to write the mailbox, both formats sometimes result in corruption.

mboxo as default: "From " lines get written (and subsequently read) as
">From ".

mboxrd as default: ">From " lines were written as ">From " but are
read as "From ".

Furthermore, if Python's mailbox module is used to write the mbox file
and another software, that only supports mboxo, is used to read it
(e.g. mutt), having mboxrd as the default would case ">From " lines to
be written as ">>From ". These linew would then be read as ">>From "
by the reading software.

So, I'd like to keep the default as is, and add a parameter to change
to mboxrd when it's OK for the use case at hand. We should also
clearly document that mboxrd is recommended as it never corrupts data
if used for both reading and writing.

> Is there a way to reliably detect the format of the file and produce
> an error if it seems to be reading it wrong?
>
> If not, maybe just include a function that guesses the format so the
> correct option can be found easily? If there are consecutive ">"
> quoted lines, like this, for instance:
> 
> >This is the body.
> >>From my point of view
> >there are 3 lines.
> 
> then it was probably encoded with mboxrd?  If instead you find:
> 
> >This is the body.
> >From my point of view
> >there are 3 lines.
> 
> then it was probably encoded with mboxo?

It's not possible to automatically detect the format. Guessing like
you suggested is too fragile. It might work on some situations, but
wouldn't work on others.

If it was possible to detect the format by guessing, I'm sure RFC 4155
would mention that, as it aims for the best possible outcome for
reading any of the formats, without knowing which format is actually
in use.

History
Date	User	Action	Args
2012-06-25 18:59:04	petri.lehtinen	set	recipients: + petri.lehtinen, barry, endolith, r.david.murray
2012-06-25 18:59:03	petri.lehtinen	link	issue13698 messages
2012-06-25 18:59:03	petri.lehtinen	create