Message 323216 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	_savage
Recipients	_savage, barry, maciej.szulik, matrixise, python-dev, r.david.murray
Date	2018-08-06.18:51:49
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1533581509.38.0.56676864532.issue24218@psf.upfronthosting.co.za>
In-reply-to

Content
David, I tried to find the mentioned '\r\r…\n' issue but I could not find it here. However, from an initial investigation into the BytesGenerator, here is what’s happening. Flattening the body and attachments of the EmailMessage object works, and eventually _write_headers() is called to flatten the headers which happens entry by entry (https://github.com/python/cpython/blob/master/Lib/email/generator.py#L417-L418). Flattening a header entry is a recursive process over the parse tree of the entry, which builds the flattened and encoded final string by descending into the parse tree and encoding & concatenating the individual “parts” (tokens of the header entry). Given the parse tree for a header entry like "Martín Córdoba <foo@bar.com>" eventually results in the correct flattened string: '=?utf-8?q?Mart=C3=ADn_C=C3=B3rdoba?= <foo@bar.com>\r\n' at the bottom of the recursion for this “Mailbox” part. The recursive callstack is then: _refold_parse_tree _header_value_parser.py:2687 fold [Mailbox] _header_value_parser.py:144 _refold_parse_tree _header_value_parser.py:2630 fold [Address] _header_value_parser.py:144 _refold_parse_tree _header_value_parser.py:2630 fold [AddressList] _header_value_parser.py:144 _refold_parse_tree _header_value_parser.py:2630 fold [Header] _header_value_parser.py:144 fold [_UniqueAddressHeader] headerregistry.py:258 _fold [EmailPolicy] policy.py:205 fold_binary [EmailPolicy] policy.py:199 _write_headers [BytesGenerator] generator.py:418 _write [BytesGenerator] generator.py:195 The problem now arises from the interplay of # https://github.com/python/cpython/blob/master/Lib/email/_header_value_parser.py#L2629 encoded_part = part.fold(policy=policy)[:-1] # strip nl which strips the '\n' from the returned string, and # https://github.com/python/cpython/blob/master/Lib/email/_header_value_parser.py#L2686 return policy.linesep.join(lines) + policy.linesep which adds the policy’s line separation string linesep="\r\n" to the end of the flattened string upon unrolling the recursion. I am not sure about a proper fix here, but considering that the linesep policy can be any string length (in this case len("\r\n") == 2) a fixed truncation of one character [:-1] seems wrong. Instead, using: encoded_part = part.fold(policy=policy)[:-len(policy.linesep)] # strip nl seems to work for entries with and without Unicode characters in their display names. David, please advise on how to proceed from here.

David, I tried to find the mentioned '\r\r…\n' issue but I could not find it here. However, from an initial investigation into the BytesGenerator, here is what’s happening.

Flattening the body and attachments of the EmailMessage object works, and eventually _write_headers() is called to flatten the headers which happens entry by entry (https://github.com/python/cpython/blob/master/Lib/email/generator.py#L417-L418). Flattening a header entry is a recursive process over the parse tree of the entry, which builds the flattened and encoded final string by descending into the parse tree and encoding & concatenating the individual “parts” (tokens of the header entry).

Given the parse tree for a header entry like "Martín Córdoba <foo@bar.com>" eventually results in the correct flattened string:

    '=?utf-8?q?Mart=C3=ADn_C=C3=B3rdoba?= <foo@bar.com>\r\n'

at the bottom of the recursion for this “Mailbox” part. The recursive callstack is then:

    _refold_parse_tree _header_value_parser.py:2687
    fold [Mailbox] _header_value_parser.py:144
    _refold_parse_tree _header_value_parser.py:2630
    fold [Address] _header_value_parser.py:144
    _refold_parse_tree _header_value_parser.py:2630
    fold [AddressList] _header_value_parser.py:144
    _refold_parse_tree _header_value_parser.py:2630
    fold [Header] _header_value_parser.py:144
    fold [_UniqueAddressHeader] headerregistry.py:258
    _fold [EmailPolicy] policy.py:205
    fold_binary [EmailPolicy] policy.py:199
    _write_headers [BytesGenerator] generator.py:418
    _write [BytesGenerator] generator.py:195

The problem now arises from the interplay of 

    # https://github.com/python/cpython/blob/master/Lib/email/_header_value_parser.py#L2629
    encoded_part = part.fold(policy=policy)[:-1] # strip nl

which strips the '\n' from the returned string, and

    # https://github.com/python/cpython/blob/master/Lib/email/_header_value_parser.py#L2686
    return policy.linesep.join(lines) + policy.linesep

which adds the policy’s line separation string linesep="\r\n" to the end of the flattened string upon unrolling the recursion.

I am not sure about a proper fix here, but considering that the linesep policy can be any string length (in this case len("\r\n") == 2) a fixed truncation of one character [:-1] seems wrong. Instead, using:

    encoded_part = part.fold(policy=policy)[:-len(policy.linesep)] # strip nl

seems to work for entries with and without Unicode characters in their display names.

David, please advise on how to proceed from here.

History
Date	User	Action	Args
2018-08-06 18:51:49	_savage	set	recipients: + _savage, barry, r.david.murray, python-dev, maciej.szulik, matrixise
2018-08-06 18:51:49	_savage	set	messageid: <1533581509.38.0.56676864532.issue24218@psf.upfronthosting.co.za>
2018-08-06 18:51:49	_savage	link	issue24218 messages
2018-08-06 18:51:49	_savage	create