This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: encoded-word abused for header line folding causes RFC 2047 violation
Type: behavior Stage:
Components: email Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Jeffrey.Kintscher, Mika_Hawkins, barry, equaeghe, r.david.murray
Priority: normal Keywords:

Created on 2020-08-14 11:53 by equaeghe, last changed 2022-04-11 14:59 by admin.

Messages (6)
msg375397 - (view) Author: Erik Quaeghebeur (equaeghe) Date: 2020-08-14 11:53
Encoded-word is apparently used for header line folding sometimes. This appears to me as an abuse of this encoding technique. However, that is not the main issue: it also causes a violation of RFC 2074, as it also encodes message id's:

https://tools.ietf.org/html/rfc2047#section-5 says “An
'encoded-word' MUST NOT appear in any portion of an
'addr-spec'.” and
https://tools.ietf.org/html/rfc5322#section-3.6.4 says
“The message identifier (msg-id) syntax is a limited
version of the addr-spec construct enclosed in the angle
bracket characters, "<" and ">".”

This causes actual problems. Namely, email clients cannot parse the message id and so have trouble with generation of In-Reply-To and References headers or problems with thread reconstruction using these headers containing encoded-word versions of message ids.

Minimal example:

---
>>> import email
>>> import email.policy

>>> msg = email.message_from_string("""From: test@example.com
To: test@example.org
Subject: Test
Date: Mon, 10 Aug 2020 22:52:53 +0000
Message-ID:  <VI1PR09MB41911D8371E899C1FE78EE48FA440@abcdefghijklm.nmopqrst.uvwx.example.com>
X-Some-Blobby-Custom-Header: DIZEglcw6TIh1uC2UrnNjWYqe8l/bYo0oxKG7mBX38s1urzvCwQD30Q07DDJFgTVZWKbThu6hVjR53MTYAHYClHPt8UvyFPkAUIc8Ps1/R+HuSQ8gbR1R03sKoFAgPZKO+FKJ9bNbBb60THl81zSCsZiALwi4LLOqnf9ZIB111G4/shFuWxRlPcsPJt72sn+tTHZqK9fRAyoK1OZCZMJmjQGysovicz1Xc6nOXHMQr2+suRwOJwSUqvsfkj8EEtzJGj7ICQ2GbgBaOjcof1AML4RCFy/vD5bG0Y8HQ2KET3SraTki4dPo+xMYSZVFEy/va4rYeynOXPfxXfHSyIFwB6gnH74Ws/XPk8ZxhAQ2wSy7Hvgg3tZ7HOmlLWg4A/vUGN+8RJlgn+hHtuCXnglv+fIKEhW36wcFotngSrcXULbTlqdE5zjuV5O7wNfgIShZnNhnPdLipslmZJGaa6RQpIonZbwUWCM8g9DZmSwo8g0On0l20IVS9s6bUCddwRZ5erHx4eUZ4DGh4YyR2fgm0WsNVW8pVsAdFMClfAJYqyPEqrDN91djfPYRZPMvzYWTAm8MAip6vDa1ZvzywDpGJYD3VwapLfgFy+AR0S/q/V1HHRmSXx1oNLEedhAt0OkIxWxO8FvqNeEfMLVhxTk1g==
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="utf-8"

BODY
""")

>>> print(msg.as_bytes(policy=email.policy.SMTPUTF8).decode())
From: test@example.com
To: test@example.org
Subject: Test
Date: Mon, 10 Aug 2020 22:52:53 +0000
Message-ID: =?utf-8?q?=3CVI1PR09MB41911D8371E899C1FE78EE48FA440=40abcdefghij?=
 =?utf-8?q?klm=2Enmopqrst=2Euvwx=2Eexample=2Ecom=3E?=
X-Some-Blobby-Custom-Header: =?utf-8?q?DIZEglcw6TIh1uC2UrnNjWYqe8l/bYo0oxKG7?=
 =?utf-8?q?mBX38s1urzvCwQD30Q07DDJFgTVZWKbThu6hVjR53MTYAHYClHPt8UvyFPkAUIc8P?=
 =?utf-8?q?s1/R+HuSQ8gbR1R03sKoFAgPZKO+FKJ9bNbBb60THl81zSCsZiALwi4LLOqnf9ZIB?=
 =?utf-8?q?111G4/shFuWxRlPcsPJt72sn+tTHZqK9fRAyoK1OZCZMJmjQGysovicz1Xc6nOXHM?=
 =?utf-8?q?Qr2+suRwOJwSUqvsfkj8EEtzJGj7ICQ2GbgBaOjcof1AML4RCFy/vD5bG0Y8HQ2KE?=
 =?utf-8?q?T3SraTki4dPo+xMYSZVFEy/va4rYeynOXPfxXfHSyIFwB6gnH74Ws/XPk8ZxhAQ2w?=
 =?utf-8?q?Sy7Hvgg3tZ7HOmlLWg4A/vUGN+8RJlgn+hHtuCXnglv+fIKEhW36wcFotngSrcXUL?=
 =?utf-8?q?bTlqdE5zjuV5O7wNfgIShZnNhnPdLipslmZJGaa6RQpIonZbwUWCM8g9DZmSwo8g0?=
 =?utf-8?q?On0l20IVS9s6bUCddwRZ5erHx4eUZ4DGh4YyR2fgm0WsNVW8pVsAdFMClfAJYqyPE?=
 =?utf-8?q?qrDN91djfPYRZPMvzYWTAm8MAip6vDa1ZvzywDpGJYD3VwapLfgFy+AR0S/q/V1HH?=
 =?utf-8?q?RmSXx1oNLEedhAt0OkIxWxO8FvqNeEfMLVhxTk1g=3D=3D?=
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="utf-8"

BODY
---
msg375409 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2020-08-14 14:09
It's not really an abuse.  It is, however, buggy.  It should be being applied *only* when the header contains unstructured text.  Unfortunately I made the choice to treat any header that doesn't have a specific parser as unstructured, and that was a wrong choice which should be fixed.  It is an interesting question what should be used as the default parser, though.  Suggestions and code are welcome :)

There should be specific header parsers for headers that contain message ids.  That was on my todo list but did not get done before my circumstances changed and my free-time focus moved away from python development work :(

The message_id parser exists.  In-Reply-To just needs to be declared in the header registry as a MessageIDHeader (not sure how that got missed).  Writing a Header class for References should be trivial, it's just a list of message ids.  That will fix those headers, and I suggest we do that asap.

Fixing the default-to-unstructured will take a bit more thought and should probably be split out into a separate issue.  I can review and give advice (though you may have to ping me directly) but I won't have time to write any code.
msg375411 - (view) Author: Erik Quaeghebeur (equaeghe) Date: 2020-08-14 14:47
Note that In-Reply-To can also contain multiple message ids: <https://tools.ietf.org/html/rfc5322#section-3.6.4>.
It should be treated the same as References.

When you say that a message_id parser exists, then that means it is not applied to the Message-Id header by default yet, because my example shows that the Message-Id header gets mangled.

Applying encoded-word encoding to (unknown) unstructured fields may break workflows. These are often X-… headers and one cannot assume that the application generating and consuming them apply decoding. (Just as with message ids.) The most reliable approach would be to not encode them, but apply white-space folding and then leave them to go beyond the limit set (78 characters, typically). As headers, the increased line length is not that big of a problem. (The 78 limit is for visual reasons.) In case the lines still go beyond 998 characters, an error should be raised, as that is an RFC violation. Tools generating such headers are severely broken and should not get a free pass. Users could get the option to allow such lines and take their chances when the message is submitted and transported.
msg375434 - (view) Author: Erik Quaeghebeur (equaeghe) Date: 2020-08-14 22:06
We also shouldn't forget Resent-Message-Id.

So in the header registry <https://github.com/python/cpython/blob/2a9f709ba23c8f6aa2bed821aacc4e7baecde383/Lib/email/headerregistry.py#L562>,

'message-id': MessageIDHeader,

should be replaced by

'message-id': UniqueSingleMessageIDHeader,
'resent-message-id': SingleMessageIDHeader,
'in-reply-to': UniqueMessageIDHeader,
'references': UniqueMessageIDHeader,

with Unique/Single used as for the other Headers.
msg375566 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2020-08-17 20:29
Yes for the registry changes.  I thought we had fixed the bug that was causing message-id to get encoded, but maybe it still exists in 3.7?  I don't remember when we fixed it (and I may be remembering wrong!)

As for X- "unstructured headers" getting trashed, by *definition* in the rfc, if the header body is unstructured it must support RFC encoding.  If does not, it is not an unstructured header field.  Which is why I said we need to think about what characteristics the default parser should have.  The RFC doesn't really speak to that, it expects every header to be one of the defined types...but while an X- header might be of a defined type, the email package can't know that unless it is told, so what should we use as the default parsing strategy?  "text without encoded words" isn't really RFC compliant, I think.  (Though I'll admit it has been a while since I last reviewed the relevant RFCs.)

Note that I believe that we have an open issue (or at least an open discussion) that we should change the 'refold_source' default from 'long' to 'none', which means that X- headers would at least be passed through by default.  It would also mitigate this problem, and can be used as a local workaround for headers that are just getting passed through and not modified.
msg375612 - (view) Author: Mika Hawkins (Mika_Hawkins) Date: 2020-08-18 11:57
Truly for the vault changes. I thought we had fixed the bug that was causing message-id to get encoded, yet perhaps it despite everything exists in 3.7? I don't recall when we fixed it (and I might be recollecting incorrectly!) 

With respect to X-"unstructured headers" getting destroyed, by *definition* in the rfc, if the header body is unstructured it must help RFC encoding. In the event that doesn't, it's anything but an unstructured header field. Which is the reason I said we have to consider what attributes the default parser ought to have. The RFC doesn't generally address that, it anticipates that each header should be one of the characterized types...but while a X-header may be of a characterized type, the email bundle can't realize that except if it is told, so what would it be a good idea for us to use as the default parsing procedure? "text without encoded words" isn't generally RFC consistent, I think. (In spite of the fact that I'll let it be known has been some time since I last explored the important RFCs.) 

Note that I accept that we have an open issue (or if nothing else an open conversation) that we should change the 'refold_source' default from 'long' to 'none', which implies that X-headers would in any event be gone through of course. It would likewise alleviate this issue, and can be utilized as a nearby workaround for headers that are simply getting gone through and not altered.

Regards,
Mika Hawkins
History
Date User Action Args
2022-04-11 14:59:34adminsetgithub: 85725
2020-08-18 11:57:09Mika_Hawkinssetnosy: + Mika_Hawkins
messages: + msg375612
2020-08-17 20:29:11r.david.murraysetmessages: + msg375566
2020-08-14 23:33:12Jeffrey.Kintschersetnosy: + Jeffrey.Kintscher
2020-08-14 22:06:51equaeghesetmessages: + msg375434
2020-08-14 14:47:08equaeghesetmessages: + msg375411
2020-08-14 14:09:32r.david.murraysetmessages: + msg375409
2020-08-14 11:53:18equaeghecreate