classification
Title: Mailbox module should use binary I/O, not text I/O
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: r.david.murray Nosy List: georg.brandl, giampaolo.rodola, haypo, holdenweb, lregebro, pitrou, r.david.murray, rhettinger, sdaoden
Priority: release blocker Keywords: patch

Created on 2010-06-30 12:28 by holdenweb, last changed 2011-01-30 06:23 by r.david.murray. This issue is now closed.

Files
File name Uploaded Description Edit
test_thunderbird_mailbox.py holdenweb, 2010-06-30 12:28
test.mailbox holdenweb, 2010-06-30 12:29 Test data for Thunderbird mailbox reader
test2.5.out holdenweb, 2010-06-30 12:32 2.5 profiler output showing non-setup time is concentrated on mailbox itervalues
test3.1.out holdenweb, 2010-06-30 12:34 3.1 profiler output showing that 1.3M+ calls to cp1252.decode use almost half the CPU
mailbox.patch haypo, 2011-01-26 00:58
BytesGenerator_handle_text.patch haypo, 2011-01-26 10:00
mailbox2.patch pitrou, 2011-01-26 14:21
mailbox3.2.patch r.david.murray, 2011-01-28 21:23
mailbox4.patch r.david.murray, 2011-01-29 17:25
mailbox5.patch r.david.murray, 2011-01-29 19:24
mailbox6.patch r.david.murray, 2011-01-29 21:56
Messages (46)
msg108978 - (view) Author: Steve Holden (holdenweb) * (Python committer) Date: 2010-06-30 12:28
The attached program completes in less than half a second under Python 2.5. Under Python 3 it takes almost three minutes on the same system. The issue appears to be heavy use of decoding, at least in a Windows system, during creation of the mailbox toc. The disparity may be less remarkable when not profiling.

Further attachments will include a test data file (a Thunderbird mailbox taken from the same host system) and profiler outputs from the 2.5 and 3.1 runs of this program.
msg108979 - (view) Author: Steve Holden (holdenweb) * (Python committer) Date: 2010-06-30 12:40
Thread at http://aspn.activestate.com/ASPN/Mail/Message/python-dev/3873005 refers to this issue. Posted files are already attached herewith.
msg126078 - (view) Author: Lennart Regebro (lregebro) Date: 2011-01-12 09:35
I can confirm on Ubuntu and with other example mailboxes. Looping through the messages and printing the subjects takes around 200-300 times longer under Python 3 than under Python 2.
msg126079 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-01-12 09:40
Can you confirm using the Py3.2 head?
Am curious if Antoine's optimizations helped here.
msg126082 - (view) Author: Lennart Regebro (lregebro) Date: 2011-01-12 10:21
3.2 sees a small improvement when running the Steve test:

Python 2.6.6: 0.0291s

Python 3.1.2: 31.1s

Python 3.2b2+: 28.8s

This is Ubuntu 10.04 on ext3, with all Pythons compiled from source, with no configure attributes except a prefix.

I wonder if the differences between different unix systems can have to do with what the default system encoding is? Mine is UTF-8.
msg126085 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-12 11:56
The aforementioned python-dev thread (available at http://mail.python.org/pipermail/python-dev/2010-June/101186.html ) explains things quite well.
The mailbox module needs to be modified to use binary I/O, both for functionality and for speed. Right now, I don't know how the mailbox module can be useful in py3k (you'd quickly run into unicode errors as soon as you try to read an email with another charset, I think).
msg126139 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-01-12 21:14
RDM, you were suggested for this by Thomas Wouters (who wrote much of the existing code).  Are you up for it?
msg126144 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-01-12 21:52
With the module being so slow as to be unusable, this can be considered a bugfix, so it is okay if the fix goes into 3.2.1.
msg126153 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-13 00:13
I've been intending to take a look at this issue at some point, but am not sure when I'd get to it.

I took a quick look. It does seems to me that it is true that for data-validity purposes the message files need to be opened in binary and fed to the email package in binary.  But this is so that the message will get decoded using the correct character sets, not to avoid the decoding.  In Python3 it makes no sense to manipulate the subjects as binary strings, so the example of "looping through the messages and printing the subjects" is still going to require decoding.

There may still be ways to make it more efficient for common use cases, but that will require more detailed analysis.
msg126994 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-01-25 10:37
Re-New to Python - Re-Started with Py3K in 2011.
'Found myself in a dead-end after 10 days of work because a KOI8-R spam mail causes the file I/O decoding process to fail -
and there is NO WAY TO HANDLE THIS with mailbox.py!
(Went to python.org, searched for mailbox.py, was redirected to Google and found NOTHING related to this problem.
FINALLY found the IssueTracker but was too stupid to re-search.
Well.  Put an issue 10995 which was wrong - unfortunate.)

But now I will spend this entire day to back-port my script to Python 2.7 (and i did not work with Python for some six years)!

I mean - the plan to rewrite the entire mailbox.py exists for about six months now, but mailbox.py is included in the basic library, documented in the library book - but not a single word, not a single comment states that it is in fact *UNUSABLE* in Py3K!

Wouldn't it be sufficient *in the meanwhile* to apply the 10-minutes-work patch mentioned in my issue 10995?
I know it's almost as wrong, but it would gracefully integrate in my fetchmail(1)/mutt(1) local en_GB.UTF-8 stuff 8-}.
Python 3.2 is about to be released in two weeks - shall this unusable module be included the very same way once again?
Thanks for reading this book.
msg127002 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-25 12:39
I'm afraid so.  The python3 uptake process was expected to take five years overall, and we are only up to about the second year at this point.  So while you may have been away from Python for 6 years, you came back right in the middle of an unprecedented transition period.

I agree that it is unfortunate that a shipping library is not functioning correctly with respect to the Python3 bytes/string separation, but no one had tried to use mailbox in python3 enough to have encountered this problem.  You will note that that this bug report was a *performance* bug report initially, and as such had lower priority.  The encoding issue was recognized much more recently.  And before that could be fixed correctly, the email package had to be fixed to handle bytes input.  That happened only just before the end of the beta phase for 3.2.  Now it is too late to make further API changes for 3.3, and in any case it seems counter-productive to make an API change that we don't really want in the library long term.

You could work up a patch to fix this, use it locally, and contribute it so that it makes it in to 3.3.  Perhaps if you and/or someone else can come up with a patch before RC2 it could even go in to 3.2.  I haven't looked at it (yet), but I'm hoping that the patch isn't actually that hard to fix the encoding issues (as opposed to the performance issue, which may take more work).

Or perhaps you could monkey-patch in your encoding fix until 3.3 comes out.  Of course, right now using 2.7 with an eye to staying compatible with python3 is also a perfectly sensible option.
msg127005 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-25 12:52
That should have been "too late to make API changes for 3.2".
msg127036 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-01-25 18:33
ISTM an "API change" is okay if it fixes a critical usability bug.

Also, if this is going to ship as-is, the docs should get a big warning right at the top.  Perhaps the source code should also emit a notice that the module is hosed so that people like Steffen don't waste tons of time on hopeless endeavors.
msg127078 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-26 00:58
mailbox.patch:
 - open files in binary mode not as text
 - parse as bytes not as Unicode
 - replace email.generator.Generator() by email.generator.BytesGenerator()
 - use .message_from_bytes() instead of .message_from_str()
 - use .message_from_binary_file() instead of .message_from_file()
 - use BytesIO() instead of StringIO()
 - add more methods to _ProxyFile: readable, writable, seekable, flush, closed
 - don't use universal newline (not supported by binary files): I don't remember if the email binary parser supports directly universal newline

I don't know anything about the mailbox module. I just replaced str functions by bytes functions.

Keep Unicode for some things: MH.get_sequence() reads the file using UTF-8 encoding, labels and sequences.

The patch have to be tested on Windows (Windows uses \r\n newline). I only tested on Linux.
msg127079 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-26 01:04
While working on this issue, I found and fixed two bugs in the email binary parser: r88196 and r88197.
msg127087 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-01-26 02:59
Nice.  Thanks Victor.
msg127089 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-26 04:11
Thanks, Victor, you beat me to it :)  I'll see if I can review this tomorrow, or if not I can probably do it Thursday.

I reverted r88197 because it was incorrect and caused an email test to fail.  Once I come up with a test for it I'll fix it correctly.  I should write a test for the other one, too, even though it is trivial.

Just a note: being as we're in RC, you should get a review even for seemingly trivial fixes.
msg127097 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-26 10:00
> I reverted r88197 because it was incorrect and caused an email test
> to fail. Once I come up with a test for it I'll fix it correctly.

test_mailbox is a good (indirect) test suite for this change. The problem of r88197 is that it replaces msg._payload by msg.get_payload() which is wrong. New attached patch mailbox.patch	keeps msg._payload unchanged, but don't call _has_surrogates() (do nothing) if msg._payload is None. This patch has no test, I'm unable to write a test for this case (directly with the email API).
msg127101 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-01-26 11:55
This message will not help anyone.
And it's not a chat, but because it seems i really made the horses gone
grazy (direct translation of a german proverb):
- msg127002: RDM, i didn't know all of that and i am really sorry.
  Now I am looking forward to do something good for montys,
  but the truth is: today i ain't able to (yet), because i am so god
  damned re-new to python(1) (16 days today including weekends)!
  My local fix *was indeed already* just using encoding=latin1, which,
  by the way, decreased CPU usage of parsing a 150KB archive mbox from
  4 minutes (!) to about 50 seconds (dumping is a tenth in both cases).
  And it did not fail due to conversion errors, too.
msg127078: haypo, thank you!  I couldn't have done all of this, being
  so far away from the great picture of python encoding internals.
msg127036: RH, "people like Steffen" do the camel-walk!
  *This* is slow!!
  But once i've realized that mailbox.py is a dead-end on Py3K
  (for shipping out of my really useless, brutally simple current
  S-Postman script thing), i would have been much happier if a simple
  sentence like "python3 uptake process in transition" would have
  warned me (in whatever RST output).
Thanks for reading this book in a bug tracker.
msg127104 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-26 12:19
All test_email and test_mailbox pass with mailbox.patch+BytesGenerator_handle_text.patch on Windows except one test:

======================================================================
ERROR: test_set_item (test.test_mailbox.TestBabyl)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\victor\py3k\lib\test\test_mailbox.py", line 286, in test_set_item
    self._check_sample(self._box[key1])
  File "C:\victor\py3k\lib\mailbox.py", line 76, in __getitem__
    return self.get_message(key)
  File "C:\victor\py3k\lib\mailbox.py", line 1190, in get_message
    body = self._file.read(stop - self._file.tell())
ValueError: read length must be positive or -1
msg127119 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-26 14:21
There's a missing conversion in mailbox.patch. Running with -bb shows the issue. Here is an updated patch.
msg127135 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-26 20:23
Haypo: yeah, in an ideal world Generator would use get_payload() and not _payload, but I had to make some compromises with purity of model separation in order to achieve the practical goal of handling bytes usefully.  I'd like to fix that as I work on email in 3.3.
msg127148 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-26 21:49
pitrou> There's a missing conversion in mailbox.patch.
pitrou> Running with -bb shows the issue.
pitrou> Here is an updated patch.

Good catch: test_mailbox now pass on Windows.

--

Some remarks on mailbox2.patch.

get_string() returns a bytes object: I propose to rename it to get_bytes(): """Return a *byte* string representation or raise a KeyError."""

The following comment is outdated, target have to be a *binary* file:
    def _dump_message(self, message, target, mangle_from_=False):
        # This assumes the target file is open in *text* mode ...

get_file(): should we specify that the file-like object is a binary file?

MH.get_sequences() and MH.set_sequences() opens .mh_sequences file in text mode from the locale encoding. I don't know if the locale encoding is a good choice. Does this file contain non-ASCII characters? Should we use ASCII or UTF-8 encoding instead, or parse the file in binary, and only decode requested values from ASCII?

Since all tests of test_mailbox now pass on Windows, it looks like the "universal newline" thing still work. But how can I be sure?

-            from_line = 'From MAILER-DAEMON %s' % time.asctime(time.gmtime())
+            from_line = b'From MAILER-DAEMON ' + time.asctime(time.gmtime()).encode()

Is UTF-8 the right encoding to encode a timestamp? Or should we use something like "=?UTF-8?q?...?=" ?

MH.set_sequences() does... sometimes... decode the sequence name from UTF-8. I don't understand why I had to add the following if:

-                f.write('%s:' % name)
+                if isinstance(name, bytes):
+                    name = name.decode()
+                f.write(name + ':')

Is it correct to decode the timestamp from UTF-8? And is the following change correct?
***********
-            maybe_date = ' '.join(self.get_from().split()[-5:])
+            maybe_date = b' '.join(self.get_from().split()[-5:])
             try:
+                maybe_date = maybe_date.decode('utf-8')
                 message.set_date(calendar.timegm(time.strptime(maybe_date,
                                                       '%a %b %d %H:%M:%S %Y')))
-            except (ValueError, OverflowError):
+            except (ValueError, OverflowError, UnicodeDecodeError):
                 pass
***********


The following change is just enough to fix mailbox. But it would maybe be better to inherit from RawIOBase instead and implement all methods. _PartialFile class might be moved into the io module. All of this can be done later.
****** 
+    def readable(self):
+        return self._file.readable()
 
+    def writable(self):
+        return self._file.writable()
+
+    def seekable(self):
+        return self._file.seekable()
+
+    def flush(self):
+        return self._file.flush()
+
+    @property
+    def closed(self):
+        return self._file.closed
******
msg127168 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-27 02:36
I haven't looked at the items Haypo has pointed to yet, but I have looked at the API issues (get_string, add, etc).  It seems to me that we have to make a decision here: do we break API backward compatibility and convert to consuming and emitting bytes pretty much everywhere we handle non-Message messages, or do we maintain backward compatibility by extending the APIs to handle bytes as well as strings?  In the email module I took the second approach.  I'm leaning toward that approach here as well, but it is a little messier than it was for email, where there ended up being distinct interfaces for bytes versus string.  Here we'd have get_string and get_bytes, but it seems more sensible to have add and similar methods accept both bytes and strings rather than making duplicate methods for each of the cases (since they are polymorphic between string and Message already).

And then there is get_file, which is *documented* as returning a binary file, but in fact has been returning a text file.  It makes sense that it should return a binary file.  So that's one backward incompatible bug fix already....unless we introduce get_binary_file and change the docs.

I think we need opinions from more than just haypo and I, but given that we are in RC I'm leaning toward a polymorphic API for add and friends and new get_bytes and get_binary_file methods.  That's more work than the patch haypo produced, though, since it requires some new code and tests in addition to what he's already done.  

Either approach introduces API changes in an RC, but unless we want to continue to ship a mailbox module that is as half-functional (or less) as what email was in 3.1, we have to do something.

I should have some time to work on this tomorrow, and a bit more on Friday, but we're getting down to the wire here.
msg127245 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-28 04:26
Attached is a patch that builds on Victor's patch, but takes the approach I discussed of maintaining backward compatibility (for the most part; see below).  The test suite in this version is substantially unchanged.  The major changes are adding tests for the bytes input and the new method (get_bytes).  The changes to existing test methods are to methods that test internal interfaces (because those now handle bytes, not string).

I've included doc changes, which are mostly adding notes about where bytes are accepted (add, __setitem__), and for the new get_bytes method.

get_string is now implemented by calling get_bytes, passing the result to email.message.Message, and then calling as_string.  This defeats the efficiency purpose of using get_string, but in that use case code should really be using get_bytes.

I kept the change to get_file: it returns a binary file, as currently documented.  I think this is less likely to cause backward compatibility issues (assuming any 3.x code exists that uses mailbox!) than get_string returning bytes (or dissapearing) would.

As with email.message.Message's get_unixfrom method, get_from returns a string, and set_from takes a string.  Although there are no real standards for this "header", I believe that it is restricted to ASCII, and have written the code accordingly.  This is my answer to Victor's question about maybe_date and the from line: I think we should use ascii as the encoding.  That is certainly true for the date; asctime does not use the locale, and English date fields are definitely the de-facto standard for From lines.

I haven't looked at the mh_sequences question yet.  I don't think there are any formal restrictions on what characters can be used in a sequence name, but I haven't looked to see if there are any standards documents for mh.  I'll test to see if my nmh installation accepts non-ascci chars for sequence names tomorrow.  I'm also going to try to go over Victor's changes section by section, but everything I've looked at other than the mh_sequences issue he raised looks good to me so far.

I note that we still don't have an RM call on whether or not this can go in if it passes review.

Oh, also note that neither Victor's patch nor my patch have any tests for non-ASCII characters.  Some should be added :)
msg127303 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-01-28 14:07
After cloning branches/py3k (i now have three different detached repo snakes in my arena (2.7,3.1,py3k), by the way - not bad for a greenhorn, huh?).
I've applied RDMs patch from msg127245.
Note: the test mails are *malformed*!
Stuff in brackets are my error messages, rest is "str(ex)".

1. Single Latin-1 character in "From:" (00F6;LATIN SMALL LETTER O WITH DIAERESIS):
    [ERROR: failed to handle box "/Users/steffen/tmp/au.latin1":] expected string or buffer
2. Whatever-Encoding in "Subject:" (see example 2 below):
    [PANIC: Box source-changes.mdir: message-add failed, mails may be lost:] 'ascii' codec can't encode character '\ufffd' in position 8: ordinal not in range(128)

Here are two stripped header fields pasted in an UTF-8 environment:

From: "SAJATNAPTAR.COM" <info@sajatnaptar.com>$
Subject: Falinaptár ingyenes házhozszállítással. Már rendeltél? Olvass el!

From: "Syria Trade Center :" <no-reply@syriatc.com>$
Subject: ÅÝÊÊÇÍ ÇáãßÊÈ ÇáÑÆíÓí ááãÑßÒ - æÊÞæíã 2011
msg127305 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-28 14:27
Steffen: thanks for testing.  Do those error messages have tracebacks?  Can you post them?  Can you post example messages and a short program that demonstrates the problem?  I'm going to be creating some non-ascii test cases, but any additional info you can provide will give me a leg up on that.
msg127309 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-01-28 14:59
Indeed i tried to create tracebacks (even with "import traceback"), but these all end up in my code (and all the time).  I have not yet figured out how to create tracebacks which leave my code and reach the source, which surely must be somewhere in email/ - even with the -d command line switch.
Wait a bit for the rest - i would indeed post my halfway-thought-through-and-developed S-Postman if you would ask for it.  It however simply uses the email package (mailbox,email,FeedParser) and is a 30KB thing with config-file parsing etc..
msg127310 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-28 15:08
I don't see those error messages in the mailbox source.  I'm guess your application isn trapping the errors in a try/except.  In that case, just do a bare 'raise' in the except clause, and you should get the full traceback.

I'm sure I'll discover problems just using your simple examples.  Likely your full code would be more of a distraction than a help, unless we end up in a situation where I've fixed all the bugs I can find and you are still having problems.
msg127313 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-01-28 15:21
You're indeed right, i've overseen a try..catch!
I'll even be able to give you some fast code hints now
(and i'll be offline the next few hours - the mails are simply mails with illegal charsets, say):

Traceback (most recent call last):
  File "/Users/steffen/tmp/y/s-postman.py", line 1098, in <module>
    sys.exit(main())
  File "/Users/steffen/tmp/y/s-postman.py", line 1088, in main
    xclass = xclass() # Impl. class chosen upon commline args
  File "/Users/steffen/tmp/y/s-postman.py", line 951, in __init__
    self._walk()
       def _walk(self):
                verb("--dispatch: starting iteration over input boxes")
                for b, t in _Dispatch_Boxes:
                        box = open_mailbox(b, type=t, create=False)
                        try:
                                self._do_box(box)
                        except Exception as e:
                                raise
  File "/Users/steffen/tmp/y/s-postman.py", line 958, in _walk
    self._do_box(box)
        def _do_box(self, box):
                cnt = len(box)
                log("* Box contains ", cnt, " messages")
                for nr in range(cnt):
                        log("  * Dispatching message ", nr+1)
                        msg = box.get_message(nr)
                        Ticket.process_msg(msg)
                log("  @ Dispatched ", cnt, " messages, finished box")
  File "/Users/steffen/tmp/y/s-postman.py", line 982, in _do_box
    Ticket.process_msg(msg)
        @staticmethod
        def process_msg(msg):
                ticket = Ticket(msg)
                (match, ruleset, to_box) = Ruleset.dispatch_ticket(ticket)
                if not match:
                        to_box.add_ticket(ticket)
                        return
                splitter = to_box.get_archive_splitter()
                if not splitter or config.get_keep_archives():
                        to_box.add_ticket(ticket)
                        return
                log("    @ Treating ticket ", ticket._id,
                        " as archive, splitting")
                for msg in splitter(msg):
                        ticket = Ticket(msg)
                        to_box.add_ticket(ticket)
  File "/Users/steffen/tmp/y/s-postman.py", line 898, in process_msg
    to_box.add_ticket(ticket)
        def add_ticket(self, ticket, ignore_errors=False):
                efun = panic
                if ignore_errors:
                        efun = error
                log("    @ Saving ticket ", ticket.get_id(),
                        " in \"", self._ident, "\"")
                mbox = self._mailbox
                if not mbox:
                        mbox = os.path.join(config.get_folder(), self._path)
                        mbox = open_mailbox(mbox, type=self._type, create=True)
                        self._mailbox = mbox
                        try:
                                mbox.lock()
                        except Exception as e:
                                efun("Could not gain mailbox lock!")
                try:
                        mbox.add(ticket.get_msg())
                        mbox.flush()
                except Exception as e:
                        #efun("Box ", self._ident,
                        #       ": message-add failed, ",
                        #       "mails may be lost: ", str(e))
                        raise
  File "/Users/steffen/tmp/y/s-postman.py", line 680, in add_ticket
    mbox.add(ticket.get_msg())
  File "/Users/steffen/usr/lib/python3.2/mailbox.py", line 259, in add
    self._dump_message(message, tmp_file)
  File "/Users/steffen/usr/lib/python3.2/mailbox.py", line 205, in _dump_message
    gen.flatten(message)
  File "/Users/steffen/usr/lib/python3.2/email/generator.py", line 88, in flatten
    self._write(msg)
  File "/Users/steffen/usr/lib/python3.2/email/generator.py", line 141, in _write
    self._write_headers(msg)
  File "/Users/steffen/usr/lib/python3.2/email/generator.py", line 372, in _write_headers
    header_name=h)
  File "/Users/steffen/usr/lib/python3.2/email/header.py", line 197, in __init__
    self.append(s, charset, errors)
  File "/Users/steffen/usr/lib/python3.2/email/header.py", line 275, in append
    s.encode(output_charset, errors)
UnicodeEncodeError: 'ascii' codec can't encode character '\ufffd' in position 8: ordinal not in range(128)
msg127324 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-28 19:12
What is the data type returned by your get_msg?  I bet it is string, and email can't handle messages in string format that have non-ASCII characters (I'm adding an explicit error message for this).  You either need to use a Message object, or, more likely in your case, change the return type of get_msg to be bytes.
msg127327 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-28 19:38
I'm updating the patch to contain a couple tests using non-ASCII.  More are needed.

Before this patch, one could process a file containing non-ASCII characters as text, and if your default encoding happened to be able to decode it, things would appear to more or less work.  In real life doing this is most likely to produce mojibake.  So the patch now rejects string input that contains non-ASCII characters with a helpful message about using bytes or Message input.  Email doesn't handle messages in string format that contain non-ASCII characters, either (which, I think, was the source of the error Steffen encountered).  This means that the string backward-compatibility is reduced to ascii-only messages.  But if mailbox in py3 is being used successfully by anybody, it is most likely to be someone processing ascii only messages for some reason.
msg127338 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-01-28 20:29
> What is the data type returned by your get_msg?  I bet it is string,
> and email can't handle messages in string format that have non-ASCII
> characters

(Now i see that the local names 'box', 'mbox' and 'mailbox' have become somewhat messed up, which may have been misleading.)
The answer is (somewhat) definitely no:

    class Ticket:
        @staticmethod
        def process_msg(msg):
                ticket = Ticket(msg)
        ...
        def __init__(self, msg):
                global _Ticket_Count
                _Ticket_Count += 1
                self._id = _Ticket_Count
                self._msg = msg
                log("    @ Creating ticket number ", self._id, ":")

... instantiated by either:
                        msg = mbox.get_message(nr) # It's a Mailbox
                        Ticket.process_msg(msg)

... or:
def openbsd_splitter(msg):
        if msg.is_multipart():
                log("      @ Multipart message: not splitting")
                return [msg]
        i = msg["Subject"]
        if i is None or "digest," not in i:
                log("      @ \"digest,\" not in Subject: not splitting")
                return [msg]
        # Real splitter: nl, SPLITTER, nl, Date: header..
        SPLITTER = "------------------------------"
        def __create_msg(charset, lines):
                try:
                        fp = email.feedparser.FeedParser()
                        headerok, lastnl = False, False
                        while len(lines) > 0:
                                l = lines.pop(0)
                                if SPLITTER in l and lastnl:
                                        break
                                lastnl = not len(l.strip())
                                if not headerok:
                                        if lastnl:
                                                headerok = True
                                        else:
                                                l = split_header_line_helper(.....)
                                fp.feed(l + "\n")
                        return fp.close()
                except Exception as e:
                        log("      @ Error - not splitting: ", str(e))
                        return None
        result = list()
        lines = msg.get_payload().splitlines()
        while len(lines):
                l = lines.pop(0)
                if SPLITTER in l:
                        break
        while len(lines):
                l = lines[0]
                if l.startswith("Date: "):
                        nm = __create_msg(charset, lines)
                        if not nm:
                                return [msg]
                        result.append(nm)
                else:
                        lines.pop(0)
        return result
... which then ends up as the shown
                for msg in splitter(msg):
                        ticket = Ticket(msg)
                        to_box.add_ticket(ticket) # This is 'class Box'

... and it's the very Box.add_ticket() which has been shown in msg127313.  That's all - note however that the email.message.Message headers may either be strings or 'Header' objects - this is work in transition (i somehow want to deal with these malformed mails and at least encapsulate all str() headers to 'Header' headers with the fallback 'quopri' encoding ISO-8859-1 - like this the mail will at least be clean on the disk ...)
msg127343 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-28 20:58
Well, that's a bunch of code, and I'm afraid I don't know what your answer to my question was.  What error do you get now if you use the new version of mailbox3.patch?

If you feed the new mailbox/email bytes, it will preserve the bytes as is, as long as you don't try to convert the invalid headers to strings.  If you convert them to string (by accessing them through the Message object), it will encode them as 'unknown-8bit' using quopri or base64 as appropriate (ie: depending on how many non-ascii chars there are).  If you want instead to guess that they are latin-1, you can call decode_header on the stringified version to get back the original bytes, and then substitute your preferred guessed charset for the 'unknown-8bit' charset and go from there to unicode.  (For Python3.3 I plan to provide tools to make this kind of processing much simpler.)
msg127346 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-28 21:11
Added two more tests of non-ASCII.  I think the tests now cover the necessary cases.

I still want to do a full code review tomorrow, but I think the patch is in final form if anyone else is available to do a review as well.

Georg, are you OK with this going in?  I think it is an important part of the "email is fixed" story and thus worth bending the rules for.
msg127349 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-01-28 21:33
I missed your mailbox3.patch, but now i've merged it in.
One error changed, it now happens when a re.search is applied to a header value and thus seems to match what you say.  I'm not able to understand this error this evening, but i will review it once again tomorrow and will notify this issue if it seems to me it's something different than what you say.
The other error almost remains the same from my point of view, again, i'm out today, but here i'll add the traceback again.

... as before ...
  File "s-postman.py", line 685, in add_ticket
    mbox.add(ticket.get_msg())
  File "/Users/steffen/usr/lib/python3.2/mailbox.py", line 269, in add
    self._dump_message(message, tmp_file)
  File "/Users/steffen/usr/lib/python3.2/mailbox.py", line 215, in _dump_message
    gen.flatten(message)
  File "/Users/steffen/usr/lib/python3.2/email/generator.py", line 88, in flatten
    self._write(msg)
  File "/Users/steffen/usr/lib/python3.2/email/generator.py", line 141, in _write
    self._write_headers(msg)
  File "/Users/steffen/usr/lib/python3.2/email/generator.py", line 372, in _write_headers
    header_name=h)
  File "/Users/steffen/usr/lib/python3.2/email/header.py", line 197, in __init__
    self.append(s, charset, errors)
  File "/Users/steffen/usr/lib/python3.2/email/header.py", line 275, in append
    s.encode(output_charset, errors)
UnicodeEncodeError: 'ascii' codec can't encode character '\ufffd' in position 8: ordinal not in range(128)
msg127354 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-28 21:57
If you are using the most recent mailbox3 patch (I should have renamed it, sorry...I've no done so to make it clear) you should be getting an error message that tells you to use binary or Message.  So I don't understand how you are getting this message.  I still don't know what it is you are passing to the add method.

RC2 will be cut starting Sunday morning CET, and after that we will be making no further non-critical changes.  So if you've got bugs, best we find them before the end of the day tomorrow ;)
msg127392 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2011-01-29 09:43
I'd really like someone else to throw a pair of eyes at the code changes before it is committed.

But yes, I will allow this into rc2, since a completely broken module isn't really what a minor release is about.
msg127395 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-01-29 10:14
RDM: it seems i was too tired to get your messages right last evening!
Indeed it's now completely my fault, i should inspect the content further in respect to the str/bytes etc. stuff!  Thus - i will now need three or four days to cleanup my hacky code before output of this broken thing is of any further use.  (If afterwards something new shows up i will of course post a feedback.)
It may be of interest for you, however, that speed broke down heavily once again, so that my dumb thing did not take 2 seconds as it did after applying the first patch, but almost 8 seconds.  (It takes max. 1.1 seconds on Python 2.7.)
Beside that there came to me a "hhuuuuh"!  Python is a lot about testing!  I would have been more of a help if i would have simply offered some test cases? !!  Next time!

RDM: thanks a lot for spending long hours of work on this issue!
msg127406 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-29 11:40
+            if isinstance(message, io.TextIOWrapper):
+                # Backward compatibility hack.
+                message = message.buffer

Is it a good thing to parse a mailbox using a text file? If not, we should emit a warning and maybe remove this feature in Python 3.3.
msg127418 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-29 14:35
Victor: yes, I was thinking that when I added that comment but forgot to come back to it.  Thanks for spotting that.

Another thing I forgot about yesterday is that I activated the commented out statements that do linesep transformations on the binary file data.  I'm guessing those were commented out when the module was converted to using text files, since the text file would do the transformation itself.  Your patch left them commented out, and the tests passed on Windows.  If the tests *still* pass on windows with them uncommented, then that will prove that, like the old email tests, the line ending variations aren't really being tested.

But, the important thing here is that I haven't run the tests on Windows yet, and that certainly needs to be done.
msg127446 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-29 17:25
OK, I've added deprecation warnings for using StringIO or text mode files as input.  I found one bug thereby, but it is a bug that pre-existed the patch (see issue 11062).  

I've completed my code review.

To address Victor's question about the mh-sequences file: nmh rejects non-ascii sequence names, so the file should contain only ASCII.  The man page specifies that sequences are composed only of alphanumeric characters.  I think opening the file in text mode using the system default encoding is probably fine, since if any mh program does support non-ascii sequence names that is likely what it would do as well.  Of course, in the future I would think utf-8 would be preferred, but I guess we can deal with that issue if we get a bug report.  We're maintaining backward compatibility with 3.1 here, so it's not really an issue for this patch.  As far as the 'if bytes' business goes, the tests pass for me without those lines, and it looks to me like they should not be needed.  On IRC Victor said he thought he may have introduced those before the patch was finished.  We have decided to omit them.

I think I've address the remainder of Victor's issues already.

The last step is running the tests on Windows.  Attached is the updated patch.
msg127459 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-29 18:18
> The last step is running the tests on Windows.
> Attached is the updated patch.

mailbox4.patch doesn't pass on Windows, Raymond is working on a patch.
msg127472 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-29 19:24
(I hope you meant I was working on a patch :)

Patch is done, but there is one remaining test failure that I'm not sure how to handle.  The test is test_add_text_file_warns.  The code checks to see if a file is a subclass of io.TextIOWrapper, and if so warns that this is deprecated, grabs the buffer attribute, and reads the file as binary.  This works fine, except that in testing it I used a temporary file.  On Linux that works great, but on Windows the temporary file is a tempfile._TemporaryFileWrapper, and that is *not* a subclass of io.TextIOWrapper.  So the code falls through to the "this must be a binary file" code and fails with a TypeError.

Any thoughts on how to handle this edge case?

I've got stuff to do this afternoon but I'll check back later to see if anybody has any ideas and, if all else fails, will disable that test on Windows.  It means the module doesn't handle temporary-text-file input on Windows correctly, but since we are deprecating text files anyway I think that is not a show stopper.
msg127491 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-29 21:56
Benjamin suggested using hasattr(message, 'buffer'), and that works great.  The test revealed a bug in the patch, which is now fixed.

All tests pass on windows.  As far as I'm concerned the patch is ready to go.  Other reviews would of course be welcome (and perhaps required by Georg).
msg127518 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-30 06:23
Committed (with RM approval on IRC) in r88252.

Note that this does not necessarily solve the performance problem.  A new issue should be opened for that if it still exist.
History
Date User Action Args
2011-01-30 06:23:56r.david.murraysetstatus: open -> closed
nosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127518

resolution: fixed
stage: commit review -> resolved
2011-01-29 21:56:15r.david.murraysetfiles: + mailbox6.patch
nosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127491
2011-01-29 19:24:40r.david.murraysetfiles: + mailbox5.patch
nosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127472
2011-01-29 18:18:40hayposetnosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127459
2011-01-29 17:25:45r.david.murraysetfiles: + mailbox4.patch
nosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127446
2011-01-29 14:35:21r.david.murraysetnosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127418
2011-01-29 11:40:03hayposetnosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127406
2011-01-29 10:14:45sdaodensetnosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127395
2011-01-29 09:43:53georg.brandlsetnosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127392
2011-01-29 09:43:20georg.brandlsetnosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: - msg127391
2011-01-29 09:40:54georg.brandlsetnosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127391
2011-01-28 21:57:06r.david.murraysetnosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127354
2011-01-28 21:33:43sdaodensetnosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127349
2011-01-28 21:23:50r.david.murraysetfiles: - mailbox3.patch
nosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
2011-01-28 21:23:38r.david.murraysetfiles: + mailbox3.2.patch
nosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
2011-01-28 21:11:03r.david.murraysetnosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127346
2011-01-28 20:58:22r.david.murraysetnosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127343
2011-01-28 20:29:20sdaodensetnosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127338
2011-01-28 19:38:12r.david.murraysetfiles: - mailbox3.patch
nosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
2011-01-28 19:38:02r.david.murraysetfiles: + mailbox3.patch
nosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127327
2011-01-28 19:12:45r.david.murraysetnosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127324
2011-01-28 15:21:26sdaodensetnosy: georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127313
2011-01-28 15:20:46akuchlingsetnosy: - akuchling
2011-01-28 15:08:20r.david.murraysetnosy: akuchling, georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127310
2011-01-28 14:59:49sdaodensetnosy: akuchling, georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127309
2011-01-28 14:27:21r.david.murraysetnosy: akuchling, georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127305
2011-01-28 14:07:17sdaodensetnosy: akuchling, georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127303
2011-01-28 04:26:20r.david.murraysetfiles: + mailbox3.patch
nosy: akuchling, georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127245
2011-01-27 02:36:51r.david.murraysetnosy: akuchling, georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127168
2011-01-26 21:49:57hayposetnosy: akuchling, georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127148
2011-01-26 20:23:00r.david.murraysetnosy: akuchling, georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127135
2011-01-26 14:21:12pitrousetfiles: + mailbox2.patch
nosy: akuchling, georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127119
2011-01-26 12:19:26hayposetnosy: akuchling, georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127104
2011-01-26 11:55:45sdaodensetnosy: akuchling, georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127101
2011-01-26 10:00:29hayposetfiles: + BytesGenerator_handle_text.patch
nosy: akuchling, georg.brandl, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127097
2011-01-26 04:11:22r.david.murraysetpriority: critical -> release blocker
versions: - Python 3.3
nosy: + georg.brandl

messages: + msg127089

stage: needs patch -> commit review
2011-01-26 02:59:05rhettingersetnosy: akuchling, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127087
2011-01-26 01:04:12hayposetnosy: akuchling, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127079
2011-01-26 00:58:47hayposetfiles: + mailbox.patch

messages: + msg127078
keywords: + patch
nosy: akuchling, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
2011-01-25 18:33:40rhettingersetnosy: akuchling, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127036
2011-01-25 12:52:15r.david.murraysetnosy: akuchling, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127005
2011-01-25 12:39:40r.david.murraysetnosy: akuchling, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
messages: + msg127002
2011-01-25 11:51:33pitrousetpriority: high -> critical
nosy: akuchling, rhettinger, holdenweb, pitrou, haypo, giampaolo.rodola, lregebro, r.david.murray, sdaoden
versions: + Python 3.3
2011-01-25 10:37:00sdaodensetnosy: + sdaoden
messages: + msg126994
2011-01-24 15:51:34hayposetnosy: + haypo
2011-01-24 14:58:43r.david.murraylinkissue10995 superseder
2011-01-13 00:13:43r.david.murraysetnosy: akuchling, rhettinger, holdenweb, pitrou, giampaolo.rodola, lregebro, r.david.murray
messages: + msg126153
2011-01-12 21:52:17rhettingersetnosy: akuchling, rhettinger, holdenweb, pitrou, giampaolo.rodola, lregebro, r.david.murray
messages: + msg126144
versions: - Python 3.3
2011-01-12 21:14:08rhettingersetassignee: r.david.murray

messages: + msg126139
nosy: + r.david.murray
2011-01-12 11:56:04pitrousetversions: + Python 3.2, Python 3.3, - Python 3.1
nosy: akuchling, rhettinger, holdenweb, pitrou, giampaolo.rodola, lregebro
title: Mailbox module demonstrates infeasibly slow performance -> Mailbox module should use binary I/O, not text I/O
messages: + msg126085

type: performance -> behavior
stage: test needed -> needs patch
2011-01-12 10:21:11lregebrosetnosy: akuchling, rhettinger, holdenweb, pitrou, giampaolo.rodola, lregebro
messages: + msg126082
2011-01-12 09:40:01rhettingersetnosy: + pitrou, rhettinger
messages: + msg126079
2011-01-12 09:37:27rhettingersetpriority: normal -> high
nosy: akuchling, holdenweb, giampaolo.rodola, lregebro
2011-01-12 09:35:22lregebrosetnosy: + lregebro
messages: + msg126078
2010-07-12 01:11:30akuchlingsetnosy: + akuchling
2010-06-30 13:58:20giampaolo.rodolasetnosy: + giampaolo.rodola
2010-06-30 12:40:42holdenwebsetmessages: + msg108979
2010-06-30 12:34:57holdenwebsetfiles: + test3.1.out
2010-06-30 12:32:30holdenwebsetfiles: + test2.5.out
2010-06-30 12:29:27holdenwebsetfiles: + test.mailbox
2010-06-30 12:28:21holdenwebcreate