This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: mailbox.mbox fails on non ASCII characters
Type: enhancement Stage: patch review
Components: email Versions: Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, flokli, r.david.murray, terry.reedy
Priority: normal Keywords: patch

Created on 2020-11-22 10:09 by flokli, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 23553 flokli, 2020-11-29 10:47
Messages (8)
msg381607 - (view) Author: Florian Klink (flokli) * Date: 2020-11-22 10:09
I'm importing some mbox archives into my maildirs, and use `mailbox.mbox` to parse archives created by pipermail.

Some of these archives seem to contain non-ascii characters, and python just throws a `UnicodeDecodeError` and refuses to process the archive.

Reproducer: (successful on 3.7.9, 3.8.5, 3.9.0)

```
curl https://lists.freedesktop.org/archives/systemd-devel/2016-January.txt.gz | zcat > mbox.txt
python3 -c "import mailbox; mb = mailbox.mbox('mbox.txt');mb.items()"
```
msg381627 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2020-11-22 19:18
The problem with that archive is that it is not in proper mbox format.  It contains the following line (5689):

    From here I was hoping to run something like “dbus-send –system –dest=Test.Me –print-reply /Japan Japan.Reset.Test string:”Hello””

You will note that there is no leading '>' on that line to escape that 'From '.  So mbox tries to build a 'From ' line from it, and fails because 'From ' lines should not contain any non-ascii characters.  It can be argued that that failure is sub-optimal...it should probably be calling decode('ascii', errors='replace') so that the parse doesn't fail, just like it would not fail if there were no non-ascii in the unescaped 'From ' line.
msg381637 - (view) Author: Florian Klink (flokli) * Date: 2020-11-22 20:50
Yeah, not questioning here this might be badly formatted, but given these files are out there, and the parser is somewhat forgiving in other cases, it should be tolerant there as well.
msg381975 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2020-11-28 05:42
(The non-ascii chars are “ and ”, versus ascii ".)

Florian, although you did not select a 'Type', selecting multiple versions implicitly claims that the current behavior is a bug.  I believe R.David has explained that it is not, even if sub-optimal.  Do you want to
A. Argue on the basis of some claim in the docs that this really is a bug.
B. Close this issue as 'Not a bug'.
C. Turn it into an enhancement issue for 3.10 by calling decode in the appropriate place.  Is so, you might first try making the change in your code after finding the appropriate place and see if the improvement is worth the change.
msg382052 - (view) Author: Florian Klink (flokli) * Date: 2020-11-29 10:47
I opened https://github.com/python/cpython/pull/23553 - PTAL.

I made this an enhancement for 3.10 - but it could probably also be backported to older versions
msg382169 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2020-11-30 18:12
After thinking about it some more, I think given that when there is no non-ascii mbox will happily treat *anything* as valid on the "From " line, that we should consider blowing up on non-ascii to be a bug.
msg382581 - (view) Author: Florian Klink (flokli) * Date: 2020-12-05 19:02
Based on https://bugs.python.org/issue42433#msg382169 I added back the versions that bug is present.

The PR is up to and appropriately linked (I think?) - let me know if there's anything left to be done from my side.
msg382584 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2020-12-05 20:44
3.7 only gets security fixes.  If and when someone merges something, that person will decide whether to backport.
History
Date User Action Args
2022-04-11 14:59:38adminsetgithub: 86599
2020-12-05 20:44:11terry.reedysetmessages: + msg382584
versions: - Python 3.7, Python 3.8, Python 3.9
2020-12-05 19:02:01floklisetmessages: + msg382581
versions: + Python 3.7, Python 3.8, Python 3.9
2020-11-30 18:12:47r.david.murraysetmessages: + msg382169
2020-11-29 10:47:12floklisetversions: - Python 3.8, Python 3.9
messages: + msg382052
pull_requests: + pull_request22434

keywords: + patch
type: enhancement
stage: patch review
2020-11-28 05:42:30terry.reedysetnosy: + terry.reedy
messages: + msg381975
2020-11-28 05:21:14terry.reedysetversions: + Python 3.10, - Python 3.7
2020-11-22 20:50:06floklisetmessages: + msg381637
2020-11-22 19:18:58r.david.murraysetmessages: + msg381627
2020-11-22 10:09:32floklicreate