Author _savage
Recipients _savage, barry, maciej.szulik, matrixise, python-dev, r.david.murray
Date 2018-08-06.18:51:49
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1533581509.38.0.56676864532.issue24218@psf.upfronthosting.co.za>
In-reply-to
Content
David, I tried to find the mentioned '\r\r…\n' issue but I could not find it here. However, from an initial investigation into the BytesGenerator, here is what’s happening.

Flattening the body and attachments of the EmailMessage object works, and eventually _write_headers() is called to flatten the headers which happens entry by entry (https://github.com/python/cpython/blob/master/Lib/email/generator.py#L417-L418). Flattening a header entry is a recursive process over the parse tree of the entry, which builds the flattened and encoded final string by descending into the parse tree and encoding & concatenating the individual “parts” (tokens of the header entry).

Given the parse tree for a header entry like "Martín Córdoba <foo@bar.com>" eventually results in the correct flattened string:

    '=?utf-8?q?Mart=C3=ADn_C=C3=B3rdoba?= <foo@bar.com>\r\n'

at the bottom of the recursion for this “Mailbox” part. The recursive callstack is then:

    _refold_parse_tree _header_value_parser.py:2687
    fold [Mailbox] _header_value_parser.py:144
    _refold_parse_tree _header_value_parser.py:2630
    fold [Address] _header_value_parser.py:144
    _refold_parse_tree _header_value_parser.py:2630
    fold [AddressList] _header_value_parser.py:144
    _refold_parse_tree _header_value_parser.py:2630
    fold [Header] _header_value_parser.py:144
    fold [_UniqueAddressHeader] headerregistry.py:258
    _fold [EmailPolicy] policy.py:205
    fold_binary [EmailPolicy] policy.py:199
    _write_headers [BytesGenerator] generator.py:418
    _write [BytesGenerator] generator.py:195

The problem now arises from the interplay of 

    # https://github.com/python/cpython/blob/master/Lib/email/_header_value_parser.py#L2629
    encoded_part = part.fold(policy=policy)[:-1] # strip nl

which strips the '\n' from the returned string, and

    # https://github.com/python/cpython/blob/master/Lib/email/_header_value_parser.py#L2686
    return policy.linesep.join(lines) + policy.linesep

which adds the policy’s line separation string linesep="\r\n" to the end of the flattened string upon unrolling the recursion.

I am not sure about a proper fix here, but considering that the linesep policy can be any string length (in this case len("\r\n") == 2) a fixed truncation of one character [:-1] seems wrong. Instead, using:

    encoded_part = part.fold(policy=policy)[:-len(policy.linesep)] # strip nl

seems to work for entries with and without Unicode characters in their display names.

David, please advise on how to proceed from here.
History
Date User Action Args
2018-08-06 18:51:49_savagesetrecipients: + _savage, barry, r.david.murray, python-dev, maciej.szulik, matrixise
2018-08-06 18:51:49_savagesetmessageid: <1533581509.38.0.56676864532.issue24218@psf.upfronthosting.co.za>
2018-08-06 18:51:49_savagelinkissue24218 messages
2018-08-06 18:51:49_savagecreate