classification
Title: Unicode names break email header
Type: behavior Stage: patch review
Components: email Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Celelibi, _savage, barry, michael.thies, r.david.murray
Priority: normal Keywords: patch

Created on 2018-08-17 23:02 by _savage, last changed 2019-02-10 02:37 by Celelibi.

Pull Requests
URL Status Linked Edit
PR 8803 open python-dev, 2018-08-18 04:56
Messages (6)
msg323686 - (view) Author: Jens Troeger (_savage) * Date: 2018-08-17 23:02
See also this comment and ensuing conversation: https://bugs.python.org/issue24218?#msg322761

Consider an email message with the following:

message = EmailMessage()
message["From"] = Address(addr_spec="bar@foo.com", display_name="Jens Troeger")
message["To"] = Address(addr_spec="foo@bar.com", display_name="Martín Córdoba")

It’s important here that the email itself is `ascii` encodable, but the names are not. Flattening the object (https://github.com/python/cpython/blob/master/Lib/smtplib.py#L964) incorrectly inserts multiple linefeeds, thus breaking the email header, thus mangling the entire email:

flatmsg: b'From: Jens Troeger <bar@foo.com>\r\nTo: Fernando =?utf-8?q?Mart=C3=ADn_C=C3=B3rdoba?= <foo@bar.com>\r\r\r\r\r\nSubject:\r\n Confirmation: …\r\n…'

After an initial investigation into the BytesGenerator (used to flatten an EmailMessage object), here is what’s happening.

Flattening the body and attachments of the EmailMessage object works, and eventually _write_headers() is called to flatten the headers which happens entry by entry (https://github.com/python/cpython/blob/master/Lib/email/generator.py#L417-L418). Flattening a header entry is a recursive process over the parse tree of the entry, which builds the flattened and encoded final string by descending into the parse tree and encoding & concatenating the individual “parts” (tokens of the header entry).

Given the parse tree for a header entry like "Martín Córdoba <foo@bar.com>" eventually results in the correct flattened string:

    '=?utf-8?q?Mart=C3=ADn_C=C3=B3rdoba?= <foo@bar.com>\r\n'

at the bottom of the recursion for this “Mailbox” part. The recursive callstack is then:

    _refold_parse_tree _header_value_parser.py:2687
    fold [Mailbox] _header_value_parser.py:144
    _refold_parse_tree _header_value_parser.py:2630
    fold [Address] _header_value_parser.py:144
    _refold_parse_tree _header_value_parser.py:2630
    fold [AddressList] _header_value_parser.py:144
    _refold_parse_tree _header_value_parser.py:2630
    fold [Header] _header_value_parser.py:144
    fold [_UniqueAddressHeader] headerregistry.py:258
    _fold [EmailPolicy] policy.py:205
    fold_binary [EmailPolicy] policy.py:199
    _write_headers [BytesGenerator] generator.py:418
    _write [BytesGenerator] generator.py:195

The problem now arises from the interplay of 

    # https://github.com/python/cpython/blob/master/Lib/email/_header_value_parser.py#L2629
    encoded_part = part.fold(policy=policy)[:-1] # strip nl

which strips the '\n' from the returned string, and

    # https://github.com/python/cpython/blob/master/Lib/email/_header_value_parser.py#L2686
    return policy.linesep.join(lines) + policy.linesep

which adds the policy’s line separation string linesep="\r\n" to the end of the flattened string upon unrolling the recursion.

I am not sure about a proper fix here, but considering that the linesep policy can be any string length (in this case len("\r\n") == 2) a fixed truncation of one character [:-1] seems wrong. Instead, using:

    encoded_part = part.fold(policy=policy)[:-len(policy.linesep)] # strip nl

seems to work for entries with and without Unicode characters in their display names.
msg323691 - (view) Author: Jens Troeger (_savage) * Date: 2018-08-18 05:08
Pull request https://github.com/python/cpython/pull/8803/
msg328381 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-10-24 17:07
I've requested some small changes on the PR.  If Jens doesn't respond in another week or so someone else could pick it up.
msg328382 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-10-24 17:10
Michael, if you could check if Jens patch fixes your problem I would appreciate it.
msg328425 - (view) Author: Michael Thies (michael.thies) Date: 2018-10-25 10:46
Thanks for pointing me to this issue. :)

> Michael, if you could check if Jens patch fixes your problem I would appreciate it.

Jens PR does exactly, what I proposed in #35057, so it fixes my problem, indeed.
msg334096 - (view) Author: Jens Troeger (_savage) * Date: 2019-01-20 18:22
Can somebody please review and merge https://github.com/python/cpython/pull/8803 ? I am still waiting for this fix the become mainstream.
History
Date User Action Args
2019-02-10 02:37:38Celelibisetnosy: + Celelibi
2019-01-20 18:22:25_savagesetmessages: + msg334096
2018-10-25 10:46:11michael.thiessetmessages: + msg328425
2018-10-24 17:10:07r.david.murraysetnosy: + michael.thies
messages: + msg328382
2018-10-24 17:07:59r.david.murraysetmessages: + msg328381
2018-08-18 05:08:14_savagesetmessages: + msg323691
2018-08-18 04:56:44python-devsetkeywords: + patch
stage: patch review
pull_requests: + pull_request8279
2018-08-17 23:02:02_savagecreate