Title: Email parser creates a message object that can't be flattened
Type: behavior Stage: patch review
Components: email Versions: Python 3.9, Python 3.8, Python 3.7, Python 3.6, Python 3.5
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, msapiro, r.david.murray
Priority: normal Keywords: patch

Created on 2017-12-15 00:25 by msapiro, last changed 2020-01-19 06:38 by msapiro.

File name Uploaded Description Edit
bad_email_2.eml msapiro, 2017-12-15 00:25 Sample message triggering issue
Pull Requests
URL Status Linked Edit
PR 18059 open msapiro, 2020-01-19 06:38
Messages (5)
msg308353 - (view) Author: Mark Sapiro (msapiro) * (Python triager) Date: 2017-12-15 00:25
This is related to but a different exception is thrown for a different reason. This is caused by a defective spam message. I don't actually have the offending message from the wild, but the attached bad_email_2.eml illustrates the problem.

The defect is the message declares the content charset as us-ascii, but the body contains non-ascii. When the message is parsed into an email.message.Message object and the objects as_string() method is called, UnicodeEncodeError is thrown as follows:

>>> import email
>>> with open('bad_email_2.eml', 'rb') as fp:
...     msg = email.message_from_binary_file(fp)
>>> msg.as_string()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/email/", line 159, in as_string
    g.flatten(self, unixfrom=unixfrom)
  File "/usr/lib/python3.5/email/", line 115, in flatten
  File "/usr/lib/python3.5/email/", line 181, in _write
  File "/usr/lib/python3.5/email/", line 214, in _dispatch
  File "/usr/lib/python3.5/email/", line 243, in _handle_text
    msg.set_payload(payload, charset)
  File "/usr/lib/python3.5/email/", line 316, in set_payload
    payload = payload.encode(charset.output_charset)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 31-33: ordinal not in range(128)
msg308361 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-12-15 02:34
What would you like to see happen in that situation?  Should we use errors=replace like we do for headers?  (That seems reasonable to me.)

Note that it can be re-serialized as binary.
msg308362 - (view) Author: Mark Sapiro (msapiro) * (Python triager) Date: 2017-12-15 03:23
Yes. I think errors=replace is a good solution. In Mailman, we have our own class which is a subclass of email.message.Message and what we do to work around this and issue27321 is override as_string() with:

    def as_string(self):
        # Work around for and
            value = email.message.Message.as_string(self)
        except (KeyError, UnicodeEncodeError):
            value = email.message.Message.as_bytes(self).decode(
                'ascii', 'replace')
        return value
msg308395 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-12-15 14:40
I do wonder where you are using the string version of messages :)

I actually thought I'd already done this (errors=replace), but obviously not.  I don't have time now to work on a patch for this, and the patch in the other issue hasn't be updated to reflect the review I did :(
msg308421 - (view) Author: Mark Sapiro (msapiro) * (Python triager) Date: 2017-12-15 19:16
> I do wonder where you are using the string version of messages :)

Probably some places where we could use bytes, but one of the problem areas is where we save the content of a message held for moderation.
Date User Action Args
2020-01-19 06:38:24msapirosetkeywords: + patch
stage: patch review
pull_requests: + pull_request17453
2020-01-19 06:34:55msapirosetversions: + Python 3.7, Python 3.8, Python 3.9
2017-12-15 19:16:50msapirosetmessages: + msg308421
2017-12-15 14:40:10r.david.murraysetmessages: + msg308395
2017-12-15 03:23:27msapirosetmessages: + msg308362
2017-12-15 02:34:11r.david.murraysetmessages: + msg308361
2017-12-15 00:25:27msapirocreate