Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

email package folds msg-id identifiers using RFC2047 encoded words where it must not #79986

Closed
mjpieters mannequin opened this issue Jan 22, 2019 · 10 comments
Closed

email package folds msg-id identifiers using RFC2047 encoded words where it must not #79986

mjpieters mannequin opened this issue Jan 22, 2019 · 10 comments
Labels
3.8 only security fixes topic-email

Comments

@mjpieters
Copy link
Mannequin

mjpieters mannequin commented Jan 22, 2019

BPO 35805
Nosy @warsaw, @mjpieters, @bitdancer, @maxking
PRs
  • bpo-35805: Add parser for Message-ID email header. #13397
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-12-08.17:26:38.540>
    created_at = <Date 2019-01-22.12:56:02.438>
    labels = ['3.8', 'expert-email']
    title = 'email package folds msg-id identifiers using RFC2047 encoded words where it must not'
    updated_at = <Date 2020-08-06.18:19:28.728>
    user = 'https://github.com/mjpieters'

    bugs.python.org fields:

    activity = <Date 2020-08-06.18:19:28.728>
    actor = 'odo2'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-12-08.17:26:38.540>
    closer = 'maxking'
    components = ['email']
    creation = <Date 2019-01-22.12:56:02.438>
    creator = 'mjpieters'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 35805
    keywords = ['patch']
    message_count = 10.0
    messages = ['334210', '334225', '342774', '343272', '344615', '349895', '358012', '374952', '374953', '374955']
    nosy_count = 5.0
    nosy_names = ['barry', 'mjpieters', 'r.david.murray', 'maxking', 'odo2']
    pr_nums = ['13397']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue35805'
    versions = ['Python 3.8']

    @mjpieters
    Copy link
    Mannequin Author

    mjpieters mannequin commented Jan 22, 2019

    When encountering identifier headers such as Message-ID containing a msg-id token longer than 77 characters (including the <...> angle brackets), the email package folds that header using RFC 2047 encoded words, e.g.

    Message-ID: <154810422972.4.16142961424846318784@aaf39fce-569e-473a-9453-6862595bd8da.prvt.dyno.rt.heroku.com>

    becomes

    Message-ID: =?utf-8?q?=3C154810422972=2E4=2E16142961424846318784=40aaf39fce-?=
    =?utf-8?q?569e-473a-9453-6862595bd8da=2Eprvt=2Edyno=2Ert=2Eheroku=2Ecom=3E?=

    The msg-id token here is this long because Heroku Dyno machines use a UUID in the FQDN, but Heroku is hardly the only source of such long msg-id tokens. Microsoft's Outlook.com / Office365 email servers balk at the RFC2047 encoded word use here and attempt to wrap the email in a TNEF winmail.dat attachment, then may fail at this under some conditions that I haven't quite worked out yet and deliver an error message to the recipient with the helpful message "554 5.6.0 Corrupt message content", or just deliver the ever unhelpful winmail.dat attachment to the unsuspecting recipient (I'm only noting these symptom here for future searches).

    I encountered this issue with long Message-ID values generated by email.util.make_msgid(), but this applies to all RFC 5322 section 3.6.4 Identification Fields headers, as well as the corresponding headers from RFC 822 section 4.6 (covered by section 4.5.4 in 5322).

    What is happening here is that the email._header_value_parser module has no handling for the msg-id tokens *at all*, and email.headerregistry has no dedicated header class for identifier headers. So these headers are parsed as unstructured, and folded at will.

    RFC2047 section 5 on the other hand states that the msg-id token is strictly off-limits, and no RFC2047 encoding should be used to encode such elements. Because headers *can* exceed 78 characters (RFC 5322 section 2.1.1 states that "Each line of characters MUST be no more than 998 characters, and SHOULD be no more than 78 characters[.]") I think that RFC5322 msg-id tokens should simply not be folded, at all. The obsoleted RFC822 syntax for msg-id makes them equal to the addr-spec token, where the local-part (before the @) contains word tokens; those would be fair game but then at least apply the RFC2047 encoded word replacement only to those word tokens.

    For now, I worked around the issue by using a custom policy that uses 998 as the maximum line length for identifier headers:

    from email.policy import EmailPolicy
    
    # Headers that contain msg-id values, RFC5322
    MSG_ID_HEADERS = {'message-id', 'in-reply-to', 'references', 'resent-msg-id'}
    
    class MsgIdExcemptPolicy(EmailPolicy):
        def _fold(self, name, value, *args, **kwargs):
            if name.lower() in MSG_ID_HEADERS and self.max_line_length - len(name) - 2 < len(value):
                # RFC 5322, section 2.1.1: "Each line of characters MUST be no
                # more than 998 characters, and SHOULD be no more than 78
                # characters, excluding the CRLF.". To avoid msg-id tokens from being folded
                # by means of RFC2047, fold identifier lines to the max length instead.
                return self.clone(max_line_length=998)._fold(name, value, *args, **kwargs)
            return super()._fold(name, value, *args, **kwargs)

    This ignores the fact that In-Reply-To and References contain foldable whitespace in between each msg-id, but it at least let us send email through smtp.office365.com again without confusing recipients.

    @mjpieters mjpieters mannequin added 3.7 (EOL) end of life 3.8 only security fixes topic-email labels Jan 22, 2019
    @bitdancer
    Copy link
    Member

    Yes, the correct solution would be to write an actual parser for headers containing message ids. All the pieces needed to do this already exist in _header_value_parser, it "just" needs a function that glues them together in the right order, and then apply that new top-level parser to the appropriate headers via headerregistry.

    See also bpo-34881.

    @maxking
    Copy link
    Contributor

    maxking commented May 17, 2019

    I have created #13397 for this. For now, it only parses Message-ID header.

    I do plan to add support for other Identification headers soon, perhaps in a 2nd PR.

    @maxking
    Copy link
    Contributor

    maxking commented May 23, 2019

    I have made the requested changes on PR.

    David, can you please review again?

    @warsaw
    Copy link
    Member

    warsaw commented Jun 4, 2019

    New changeset 46d88a1 by Barry Warsaw (Abhilash Raj) in branch 'master':
    bpo-35805: Add parser for Message-ID email header. (GH-13397)
    46d88a1

    @maxking
    Copy link
    Contributor

    maxking commented Aug 17, 2019

    I am slightly confused if this should be backported to bugfix branches since this is technically a new feature, the ability to parse Message-ID field.

    I would love to see what David and Barry think about this?

    @maxking
    Copy link
    Contributor

    maxking commented Dec 8, 2019

    Closing this since it has been fixed in Python 3.8.

    @maxking maxking removed the 3.7 (EOL) end of life label Dec 8, 2019
    @maxking maxking closed this as completed Dec 8, 2019
    @odo2
    Copy link
    Mannequin

    odo2 mannequin commented Aug 6, 2020

    With regard to msg349895, is there any chance this fix could be considered for backport?

    I imagine you could view it as a new feature, but it seems to be the only official fix we have for the fact that Python 3 generates invalid SMTP messages. And that's not a minor problem because many popular MTAs (GMail, Outlook, etc.) will rewrite non-RFC-conformant Message IDs, causing the original ID to be lost and missing in subsequent replies. This breaks an important mechanism to support email threads.

    To this day, several Linux distributions still ship 3.6 or 3.7, even in their latest LTS, and users and vendors are stuck with supporting those for a while.

    Thanks!

    @odo2
    Copy link
    Mannequin

    odo2 mannequin commented Aug 6, 2020

    Further, under Python 3.8 the issue is not fully solved, as other identification headers are still being folded in a non-RFC-conformant manner (see OP for RFC references). This was indicated on the original PR by the author: #13397 (comment)

    It is less severe of a problem than for Message-ID, but still means that MTA/MUA may fail to recognize the threading structure because identifiers are lost.

    Is it better to open a new issue for this?

    Example on 3.8.2: the In-Reply-To header is RFC2047-folded.

    Python 3.8.2 (default, Jul 16 2020, 14:00:26) 
    [GCC 9.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import email.message
    >>> import email.policy
    >>> msg = email.message.EmailMessage(policy=email.policy.SMTP)
    >>> msg['Message-Id'] = '<929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com>'
    >>> msg['In-Reply-To'] = '<92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com>'
    >>> print(msg.as_string())
    Message-Id: <929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com>
    In-Reply-To: =?utf-8?q?=3C92922734221723=2E1596730568=2E324691772460444-anot?=
     =?utf-8?q?her-30661-parent=2Ereference=40test-123=2Eexample=2Ecom=3E?=

    @odo2
    Copy link
    Mannequin

    odo2 mannequin commented Aug 6, 2020

    Somehow the message identifiers in the code sample got messed up in previous comment, here's the actual code, for what it's worth ;-)
    https://gist.github.com/odony/0323eab303dad2077c1277076ecc3733

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes topic-email
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants