Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Email parser creates a message object that can't be flattened as bytes. #83565

Closed
msapiro mannequin opened this issue Jan 18, 2020 · 7 comments
Closed

Email parser creates a message object that can't be flattened as bytes. #83565

msapiro mannequin opened this issue Jan 18, 2020 · 7 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes topic-email

Comments

@msapiro
Copy link
Mannequin

msapiro mannequin commented Jan 18, 2020

BPO 39384
Nosy @warsaw, @msapiro, @bitdancer
PRs
  • bpo-39384 Fix email.generator.BytesGenerator to flatten non-ascii body. #18056
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-02-06.04:30:03.853>
    created_at = <Date 2020-01-18.19:48:43.393>
    labels = ['3.7', '3.8', 'expert-email', '3.9']
    title = "Email parser creates a message object that can't be flattened as bytes."
    updated_at = <Date 2020-02-06.04:30:03.852>
    user = 'https://github.com/msapiro'

    bugs.python.org fields:

    activity = <Date 2020-02-06.04:30:03.852>
    actor = 'msapiro'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-02-06.04:30:03.853>
    closer = 'msapiro'
    components = ['email']
    creation = <Date 2020-01-18.19:48:43.393>
    creator = 'msapiro'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 39384
    keywords = ['patch']
    message_count = 7.0
    messages = ['360249', '360320', '360331', '361359', '361360', '361361', '361469']
    nosy_count = 3.0
    nosy_names = ['barry', 'msapiro', 'r.david.murray']
    pr_nums = ['18056']
    priority = 'normal'
    resolution = None
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue39384'
    versions = ['Python 3.5', 'Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9']

    @msapiro
    Copy link
    Mannequin Author

    msapiro mannequin commented Jan 18, 2020

    This is similar to https://bugs.python.org/issue32330 but is the opposite behavior. In that issue, the message couldn't be flattened as a string but could be flattened as bytes. Here, the message can be flattened as a string but can't be flattened as bytes.

    The original message was created by an arguably defective email client that quoted a message containing a utf8 encoded RIGHT SINGLE QUOTATION MARK and utf-8 encoded separately the three bytes resulting in â** instead of . That's not really relevant but is just to show how such a message can be generated.

    The following interactive python session shows the issue.

    >>> import email
    >>> msg = email.message_from_string("""From user@example.com Sat Jan 18 04:09:40 2020
    ... From: user@example.com
    ... To: recip@example.com
    ... Subject: Century Dates for Insurance purposes
    ... Date: Fri, 17 Jan 2020 20:09:26 -0800
    ... Message-ID: <75ccdd72-d71c-407c-96bd-0ca95abcfa03@email.android.com>
    ... MIME-Version: 1.0
    ... Content-Type: text/plain; charset="utf-8"
    ... Content-Transfer-Encoding: 8bit
    ... 
    ...        Thursday-Monday will cover both days of staging and then storing goods
    ...        post-century. I think thatâ**s the way to go.
    ... 
    ... """)
    >>> msg.as_bytes()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python3.7/email/message.py", line 178, in as_bytes
        g.flatten(self, unixfrom=unixfrom)
      File "/usr/local/lib/python3.7/email/generator.py", line 116, in flatten
        self._write(msg)
      File "/usr/local/lib/python3.7/email/generator.py", line 181, in _write
        self._dispatch(msg)
      File "/usr/local/lib/python3.7/email/generator.py", line 214, in _dispatch
        meth(msg)
      File "/usr/local/lib/python3.7/email/generator.py", line 432, in _handle_text
        super(BytesGenerator,self)._handle_text(msg)
      File "/usr/local/lib/python3.7/email/generator.py", line 249, in _handle_text
        self._write_lines(payload)
      File "/usr/local/lib/python3.7/email/generator.py", line 155, in _write_lines
        self.write(line)
      File "/usr/local/lib/python3.7/email/generator.py", line 406, in write
        self._fp.write(s.encode('ascii', 'surrogateescape'))
    UnicodeEncodeError: 'ascii' codec can't encode character '\xe2' in position 33: ordinal not in range(128)
    >>> 
    

    @msapiro msapiro mannequin added 3.7 (EOL) end of life topic-email 3.8 only security fixes 3.9 only security fixes labels Jan 18, 2020
    @bitdancer
    Copy link
    Member

    Since you parsed it as a string it is not really legitimate to serialize it as bytes. (That will work if the input message only contains ascii, but not if it contains unicode). You'll get the same error if you replace the garbage with the "’". Using errors=replace is not crazy, but it hides the actual problem. Let's see what other people think :)

    In theory you could "fix" this by encoding the unicode using the charset specified by the container. I have no idea how complicated it will be do that, and it would be a new feature: parsing strings is specified to only work with ASCII input, currently.

    I put "fix" in quotes, because even if you make text parts like this example work, you still can't handle non-text 8bit mime parts. Is it worth doing anyway?

    Really, message_as_string and friends should just be avoided entirely, maybe even deprecated.

    @msapiro
    Copy link
    Mannequin Author

    msapiro mannequin commented Jan 20, 2020

    This came about because of an actual situation in a Mailman 3 installation. I can't say for sure what the actual original message looked like, but it was received by Mailman's LMTP server and parsed with email.message_from_bytes(), so it clearly wasn't exactly like the message excerpt I posted in the report above. However, All I had to go by was the message object from the shunted pickle file created as a result of the exception.

    The message was processed by Mailman, but when Mailman's handler pipeline attempted to save it for the digest, it calls an instance of mailbox.MMDF to add the message to the mailbox accumulating messages for the digest, and that in turn calls the flatten method of an email.generator.BytesGenerator instance. and that's where the exception was thrown.

    Perhaps the suggested patch in #18056 doesn't address every possible case, and it can result in a slightly garbled message due to replacing 'invalid' characters, but in my case at least, it is much preferable to the alternative.

    @msapiro
    Copy link
    Mannequin Author

    msapiro mannequin commented Feb 4, 2020

    Other Mailman3 installations are also encountering this issue. See https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/message/VQZORIDL5PNQ4W33KIMVTFTANSGZD46S/

    @bitdancer
    Copy link
    Member

    If we can get an actual reproducer using message_as_bytes I'd feel more comfortable with the fix. I worry that there is some other bug this is exposing that should be fixed instead.

    @bitdancer
    Copy link
    Member

    message_from_bytes

    @msapiro
    Copy link
    Mannequin Author

    msapiro mannequin commented Feb 6, 2020

    I've researched this further, and I know how this happens. The original message contains a text/html part (in my case, the only part) which contains a base64 or quoted-printable body which when decoded contains non-ascii. It is parsed correctly by email.message_from_bytes.

    It is then processed by Mailman's content filtering which retrieves html payload via

    part.get_payload(decode=True).decode(ctype, errors='replace'))
    

    where part is the text/html part and ctype is 'utf-8' in this case. It then uses elinks, lynx or some other configured command to convert the html payload to plain text and that plain text still contains non-ascii.

    It then replaces the payload and sets the content type via

    del part['content-transfer-encoding']
    part.set_payload(plain_text)
    part.set_type('text/plain')
    

    And this results in a message which can't be flattened as_bytes.

    The issue is set_payload() should encode the payload appropriately and in fact, it does if an appropriate charset is given, so this is our error in not providing a charset= argument to set_payload.

    Closing this and the corresponding PR.

    @msapiro msapiro mannequin closed this as completed Feb 6, 2020
    @msapiro msapiro mannequin closed this as completed Feb 6, 2020
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes topic-email
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant